Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ECON20003 QUANTITATIVE METHODS 2

Second Semester, 2022

Assignment 1

Exercise 1 (16 marks = 4 + 4 + 4 + 4)

A large population has a mean and standard deviation of 36 and 12, respectively. Consider the sampling distribution of the sample  mean  based on simple  random samples  of  size  40.  Answer  the  following  questions  by  performing  all  required calculations manually.

(a)  What are the mean and the standard deviation of this sampling distribution?

(4 marks: 2 marks for the mean and 2 marks for the standard deviation. Give at most 1 + 1 marks if the proper notations and formulas are missing.)

We are told that population X has = 36 and = 12, and that the sample size is n = 40. Consequently, the mean and the standard deviation of the sampling distribution of the sample mean are

x = = 36    , = = = 1.897

(b)  What can you tell about the shape of this sampling distribution?

(4 marks: 2 marks for acknowledging that n > 30 and hence CLT applies, and 2 marks for stating that the sample mean is surely about normal.)

We do not know whether the sampled population is normally distributed. However, the sample size is large enough (n > 30) to rely on the Central Limit Theorem. Hence, the sampling distribution of the sample mean is at least approximately normally distributed, even if the population is not.

(c)  What is the probability that the mean of a single sample is at least 35?

(4 marks: 3 marks for the calculations and 1 mark for the comment on the result. Students do not have to show all details, but must demonstrate how they got the result from the Standard Normal table. Otherwise, deduct 1 mark.)

P(X >35) =PZ > 35oxx ))| =PZ > ))| =P(Z > 0.53)

= 1–P(Z < –0.53) = 1– 0.2981= 0.7019

Hence, the probability that the mean of a single sample is at least 35 is 0.7019.

(d)  What proportion of the sample means is between 30 and 35?

(4 marks: 3 marks for the calculations and 1 mark for the comment on the result. Students do not have to show all details, but must demonstrate how they got the result from the Standard Normal table. Otherwise, deduct 1 mark. Students do not have to interpret the probability value, but the proportion.)

To find this proportion, one needs first to determine the probability that the mean of a single sample is between 30 and 35.

P(30X 35) =P30oxx Z 35oxx ))| =P Z ))|

=P(–3.16共Z共 –0.53) =P(Z共 –0.53) –P(Z共 –3.16) = 0.2981– 0 = 0.2981

Hence, the probability that the mean of a single sample is between 30 and 35 is 0.2981, implying that the proportion of the sample means that are between 30 and 35 is 29.81%.

Exercise 2 (52 marks = 6 + 6 + 11 + 4 + 12 + 13)

The  sales  manager  of  the  Happy  Life  company  keeps  records  of  the  time  her salespeople spend on customer calls. She has found that salespeople who spend more  time  per  customer  call  are  more  successful  and  that  the  most  successful salespeople spend, on average, more than 50 minutes on a customer call. In order to see whether a new salesperson might become one of the firm’s most successful salespeople, the sales manager recorded the time (in minutes) this salesperson spent on a random sample of 35 calls during his probation period. These observations are saved in the a1e2.xlsx Excel file.

(a)  Consider the time variable and answer the following questions. Is this variable qualitative or quantitative?  If it is qualitative, is it ranked or unranked?  If it  is quantitative, is it discrete or continuous? What is its level of measurement? Explain your answers.

(6 marks: 3 x 2 marks for the correct answers with explanation. Award maximum 3 x 1 marks if there are no explanations or incorrect explanations.)

Time is a quantitative variable as its possible values are numerical and the basic arithmetic operations are meaningful. Theoretically, it is a continuous variable because it can take on any non-negative real number. However, in this case time is measured in minutes, so the possible values are integers and hence the observed variable is a discrete variable. Yet, since it has a large number of different possible values, it can be treated as a continuous variable for practical purposes. This variable has a genuine zero point (0 minute) and by comparing two different non-zero values to each other it is possible to tell that one particular call took x times more minutes than another call, so the measurement scale is ratio.

b)   Launch RStudio, create a new RStudio project and script, and name both a1e2. Import the data set from the a1e2 Excel data file to RStudio and save  it as a1e2.RData. Attach the data to your RStudio project. Take now a screenshot of your RStudio window and paste it into your assignment.

(6 marks: 2 + 2 marks for the correct R commands, i.e., for importing the data set, and for attaching it to the workfile, and 2 marks for the proper screenshot. It must be clear from the screenshot that the data was loaded and saved successfully.

Give only 1 mark for the screenshot if it does not show the saved RData file.)

After having imported a1e2.xlsx to RStudio, the data can be loaded by executing the

attach(a1e2)

or the

a1e2$time

command. In the former case the screenshot looks like this:

Perform the following tasks with RStudio / R.

(c)  Obtain the smallest observation, the largest observation, the range, the 1st quartile, the median, the 3rd quartile, the mean, the standard deviation, and the coefficient of variation for time. What do these statistics tell you about the amounts of time the new salesperson spent on customer calls? Provide a precise interpretation of each of these statistics.

(11 marks: 1+1 marks for the correct printout and code, 9x1 marks for the interpretations of the statistics with the name of the statistics, the name of the variable, and the unit of measurement. Award only 1 mark for the code and the printout if a student uses only one of the summary() and stat.desc() functions. It is enough if students mention the precise name of the variable only once and then refer to it as time. Otherwise, award only 0.5 mark for each interpretation if the variable name is unclear or the unit of measurement is missing, and round the mark upward to the nearest integer.)

The

library(pastecs)

summary(time)

round(stat.desc(time, basic = TRUE, norm = TRUE), 2)

commands return

and

All these descriptive statistics are about the number of minutes the new salesperson spent on the 35 customer calls in the sample.

Minimum = 12.00 In this sample of customers calls made by the new salesperson the shortest call took 12 minutes.

Maximum = 108.00 In this sample of customers calls made by the new salesperson the longest call took 108 minutes.

Range = 108.00 12.00 = 96.00 In this sample the difference between the longest and the shortest customer calls made by the new salesperson is 96 minutes.

Q1 = 36.00 In this sample of customers calls made by the new salesperson one quarter of the calls took at most 36.00 minutes and three quarters took at least 36.00 minutes.

Median = Q2 = 52.00 In this sample of customers calls made by the new salesperson half of the calls took at most 52.00 minutes and half of them took at least 52.00 minutes.

Q3 = 62.00 In this sample of customers calls made by the new salesperson three quarters of the calls took at most 62.00 minutes and one quarter took at least 62.00 minutes.

Mean = 51.26 In this sample of customers calls made by the new salesperson the average length of calls is 51.26 minutes.

Std. Dev. = 23.004 In this sample of customers calls made by the new salesperson the average deviation of the length of calls from the mean is 23.004 minutes.

Coefficient of variation = 0.449 In this sample of customers calls made by the new salesperson the standard deviation of the length of calls is about 45% of the average length of calls.

(d)  Construct a 95% confidence interval for the population mean of time the new salesperson spends on customer calls. Interpret your confidence interval.

(4 marks: 1+1 marks for the correct R code and printout and 2 marks for the correct and precise interpretation.)

This confidence interval is provided by the t-test() function of R. The

t.test(time, conf.level = 0.95)

command returns

Hence, with 95% confidence, the average time the new salesperson spends on customer calls is between about 43.35 and 59.16 minutes.

(e)  Perform an appropriate test at the 5% significance level on the sample data to see whether the new salesperson might become one of the firm’s most successful salespeople.  Specify  the  null  and  alternative  hypotheses,  the  observed  test statistic, make a statistical decision based on the p-value, and draw an appropriate conclusion.

(12 marks: 1+1 marks for the correct code and printout, 2 marks for the hypotheses using proper notations, 2 marks for the test statistic, 2 marks for the correct statistical decision, 2 marks for the correct and precise conclusion, and 2 marks for answering the actual question. Give at most 1 mark for a decision based on the t critical value instead of the p-value.)

Since the most successful salespeople spend, on average, more than 50 minutes on a customer call, the hypotheses are

H0 := 50  , HA :> 50

The

t.test(time, mu = 50, alternative = "greater")

command returns

The test statistic is 0.3233 and the p-value is 0.3742 > 0.05, i.e., larger than the significance level. Therefore, at the 5% significance level we fail to reject H0. This means that at the 5% significance level there is not sufficient evidence to conclude that the mean time the new salesperson spends on a call is greater than 50 minutes. Consequently, the manager cannot expect the new salesperson to be one of the firms most successful salespeople.

Some students might (incorrectly) base the decision on the critical value, t0.05,34 t0.05,35 = 1.690. This is a right-tail test, and the observed test statistic is smaller than this critical value, so H0 is maintained at the 5% significance level.

(f)   What conditions are required to validate the confidence interval and the test in

parts (d) and (e)? Are they likely satisfied? Use as much evidence as you can to support your answers.

(13 marks for mentioning and discussing the requirements of the t-test. In particular, students can get 4x1 marks for mentioning all four requirements, 3x1 marks for presenting some reasonable argument or evidence related to the first three requirements, and 61 marks for the six checks of normality. Although in later assignments students will not be expected to apply all checks for normality, this time they are asked to do so. Award at most 5 marks for the normality checks if the R codes for the plots are missing. Deduct 2 marks if a student does not draw an overall conclusion.)

The confidence interval and the t-test in parts (d) and (e) are based on the following assumptions:

i. The data is a random sample of independent observations.

ii. The variable of interest is quantitative and continuous.

iii. The measurement scale is interval or ratio.

iv. The sampled population is normally distributed, at least approximately.

We were told that the sample is a random sample.

In part (a) it was already discussed that time is a discrete quantitative variable, but since it has a large number of possible values, we can treat it as continuous in hypothesis testing.

It was also mentioned in part (a) that the measurement scale of time is ratio.

As regards normality, we learnt about six possible checks.

Histogram (with normal curve). The

hist(time, freq = FALSE, col = "orange",

main = "Relative frequency of minutes spent on calls", xlim = c(0,120), ylim = c(0,0.025))

lines(seq(0, 120, by = 1),

dnorm(seq(0, 120, by = 1), mean(time), sd(time)), col= "blue") commands return the following plot:

It shows that the sample data are more or less symmetric, supporting the the normality assumption.

QQ plot. The

qqnorm(time, main = "Normal Q-Q Plot",

xlab = "Theoretical Quantiles", ylab = "Sample Quantiles", col = "black")

qqline(time, col = "red")

commands return the ploy shown on the next page.

Apparently, there are only a couple of points relatively far from the 45- degree reference line, hence the QQ plot also supports the normality assumption.

Mean median comparison. The sample mean and sample median are

51.26 and 52.00, respectively. The difference between them is very small, less than 1.5% of the sample mean, so the population of time might be symmetric and thus normally distributed.

Skewness is 0.0.420 implying some positive skewness. However, its Z

value is 0.528 < 1, suggesting that the population of time might be symmetric and thus normal.

Excess kurtosis is 0.020 > 0, indicating leptokurtic distribution. However, its

Z value is 0.013 < 1, so in terms of the thickness of the tails the distribution of the time population might be normal.

The Shapiro- Wilk normality test statistic is 0.964 and its p-value is 0.308 >

0.1. Hence, the null hypothesis of normality cannot be rejected, not even at the 10% significance level.

All six checks for normality indicate that the population of time might be normally distributed. Hence, the assumptions behind the confidence interval and the t-test in parts (d) and (e) are fully satisfied.

Exercise 3 (32 marks = 4 + 12 + 16)

The population of adolescent laborers who dropped out of high school at age 16 has a median reading comprehension score of 60 on a scale 0, 1, 2, …, 100. Suppose we would like to know whether adolescents still in school at age 16 achieve a higher median score on the same test than dropouts employed as laborers. To answer this research question, we took a random sample of 21 adolescents who are still in school at age 16 and recorded their scores on the same test. These scores are saved in the a1e3.xlsx Excel file. Perform all required calculations and tasks with RStudio / R.

(a)  Granted that the required conditions are satisfied, which tests can be used to answer  the research question? Explain your answer.

(4 marks: 2 + 2 marks for nominating the two tests with some reasonable explanation. Do not award any mark for any other test.)

Since the research question is about the median test score of adolescents still in school at age 16, two tests are available, the sign test and the Wilcoxon signed ranks test.

(b)  Perform the tests you nominated in part (a) at the 5% significance level. For each test specify the null and alternative hypotheses, the observed test statistic, make a statistical decision based on the p-value, and draw an appropriate conclusion.

(12 marks: 6 + 6 marks for the sign test and the Wilcoxon signed ranks test. In each case award 1 mark for the correct code, 1 mark for the correct printout, 1 mark for the hypotheses using proper notations, 1 mark for the test statistic, 1 mark for the correct statistical decision based on the p-value, 1 mark for the correct and precise conclusion. Students who failed to realize that the research question is about the median and nominated the t-test, or performed a different test than the one nominated in part(a), can get at most 3 marks.)