GB 213-007: Business Statistics Fall 2022
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
R Peer Pair Project 1
GB 213-007: Business Statistics
Fall 2022
To begin:
o Download the SAT by State dataset and attach it to an R script file using the attach(file_path_name) method.
o Comment your first name and last name along with the date and class section number at the top of the file.
o Submit the file in the following form: GB213_section_first3_last3_rp1.R Eg: GB213_7_krt_jun_rp1.R
Part A: Understanding the dataset
In data analytics, it is important to first have an understanding of the data before performing any analyses or manipulations of the dataset. Therefore, in this part of the code, you will be computing basic summary statistics of the dataset. Include screenshots of the output in your submission file.
• How many rows and columns does the dataset have?
• What are the column titles?
• What is the structure of the dataset, i.e., the type of data in each column?
• Are there any missing values?
• What is the range of values for all numerical columns?
• Print the first 5 rows
• Print the last 5 rows
• Create a histogram for the teachers’ salaries column. Make sure you label the horizontal axis and give the histogram a title. Is the data skewed? Please discuss.
After obtaining the above information, write a brief summary description of the dataset.
Part B: Analysis
In this part, we want to examine whether there is a relationship between student SAT scores and teachers’ salaries.
1. Calculate the following basic statistics on the total scores and teachers’ salaries. Save each calculation as a logical name. Write a few sentences about what each statistic means. Include screenshots of the output in your submission file.
• Mean
• Median
• Standard Deviation
In your discussion of mean and median, mention their significance, and why they may differ. Consider the minimum and maximum values of the total scores and teachers’ salaries that you obtained in Part 1 and use them to explain how outliers may affect the mean and median.
2. Not all data is “clean” or consistent. Data scientists and statisticians spend time preparing their data to be analyzed to compensate for human and computer errors introduced during the data collection process. This SAT by State dataset is clean for our purposes; however, we want to maintain consistency. Not all states had high participation rates.
Subset the data for only states with a participation rate above 0.10. Name this new subset ‘df_clean’ . Recalculate the following summary statistics for the total scores. Include screenshots of the output in your submission file.
• Mean
• Median
• Standard Deviation
Have these statistics changed, compared to the original SAT by State dataset? If so, are the figures more useful for analysis now than before? Why, or why not?
3. Using the subset ‘df_clean’, create a boxplot for teachers’ salaries. Remember to label the horizontal axis. Include a screenshot of the output in your submission file.
Notice that the max, min, median, and IQR are not labeled in the box plot. You have already calculated the maximum, minimum and median. Now calculate
• Q1
• Q3
Explain the significance of Q1 and Q3.
4. Using the subset ‘df_clean’, create a line plot with the total scores and teachers’ salaries on the horizontal and vertical axes, respectively. Include a screenshot of the output in your submission. Remember to add axis labels, a title, and color to the graph.
What is the general trend of the graph? Is there a visual relationship between the two variables?
What are some other types of graphs that could be used to represent the relationship between these two variables?
.
• Calculate the correlation coefficient for total scores and teachers’ salaries and explain its significance.
Is there a strong or weak relationship between total scores and teacher’s salaries?
What are some other variables, either in the dataset or not, that might correlate with students’ scores? What further data should be included in a future analysis?
5. Sometimes, not all data is useful or required for the analysis. We are, in fact, only interested in the New England region. Use the table below to create a new subset of data called ‘df_new_eng’ .
• For this new subset of data, calculate the mean and median for the English SAT scores and the Math SAT scores.
• Compare these numbers with the values of those quantities for the entire country.
Did New England perform above or below average in the English and Math SATs, compared to the entire country?
• What is the sample size for the New England subset?
• Calculate the standard deviation of English and Math SAT scores for this subset, using the usual formula for standard deviation. Do not use the functions sd() or var() to get the standard deviation; this exercise is intended in part to give you practice in using R for direct calculations. (You may use the sd() and var() functions to check your answers.)
Play around with the dataset to see if you can discern a relationship between sample size and standard deviation. What do you find, if anything?
6. If you like, perform similar analyses to for other subsets of the data. What do you find?
Part C: Interpretation
Write a brief passage summarizing what you have found through your analysis of the dataset. What do you conclude about the relationship between students’ SAT scores and teachers’ salaries? Why do you think this relationship holds? What are the implications of your conclusion have? What are the limitations of your analysis, and what are its strengths? Please discuss.
2022-11-12