闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

ETW2001 2022 Semester 1

Group Assignment (25%)

The main task of the assignment is for students to work together with groupmate to perform overall analysis using the data including:

1. Understanding the context of variables.

2. Perform descriptive analysis.

3. Develop a predictive model.

4. Present the output in a report format.

Deadline: 20th May 11:55 PM in MYT

Submission: A report in pdf format and a R-script recording your codes that you used for the report. When you submit, Moodle might not recognize R-script with an error. You may ignore that error message since the R-script is still submitted. You will extract the data from Penn World Table (PWT) version 10.0.

Context

The main component to determine country’s GDP includes investment, government expenditure and net export. You will utilize some of useful variables from Penn World Table (PWT) dataset.

The main objective of this assignment includes:

• Practice to write a proper statistical report.

• Explore how much income per capita has changed over the last 20 years.

• Develop predictive models for income and conduct necessary tests and diagnostics.

1. Data preparation (15 marks)

1.1. Download PWT version 10.0 in Excel format

(https://www.rug.nl/ggdc/productivity/pwt/?lang=en) and import to R-Studio.

1.2. Delete all columns except coutrycode, country, year, rgdpe, pop, csh_i, csh_g, csh_x. You

should refer to “Legend” sheet to understand the definition and the unit of measurement of each variable.

1.3. Let your data frame to include data only for 2000 and 2019. Your dataframe should have 366

rows.

1.4. Remove the oil producing countries of the Persian Gulf: ARE (United Arab Emirates), BHR

(Bahrain), IRN (Iran), IRQ (Iraq), KWT (Kuwait), OMN (Oman), QAT (Qatar), SAU (Saudi Arabia) and YEM (Yemen). The main source of GDP from these countries are different from the rest. After the removal, your dataframe should have 348 rows.

1.5. Remove all countries with less than 1 million population in 2019. Make sure you have same

observation number for both 2000 and 2019. In order to conduct this task you may take the following steps:

• Filter the rows that year is 2019 and the population is at least 1 million. Save this as a separate dataframe.

• Use an appropriate join function to merge the dataframe from 1.4 and the above. Remove the duplicate rows for 2019.

You will notice that there are duplicate variables denoted as “ .x”, and “ .y” . Name them by year 2000 and 2019 respectively.

For example, rgdpe is separated into two columns as rgdpe2000 and rgdpe2019. You can remove any redundant variables.

To confirm, make sure that your dataframe includes 139 rows.

1.6. Now, You will create a new dataframe utilizing the variables from 1.5. Create a dataframe

including the following variables.

• incpc2019 = rgdpe2019/pop2019 (this is income per capita in 2019).

• Incpc2000 = rgdpe2000/pop2000 (this is income per capita in 2000).

• popgrowth = 100*((pop2019 – pop2000)/pop2000)/20 (this is the average annual population growth rate in the past 20 years in %).

• invshare = 100*(csh_i2019+csh_i2000)/2 (this is a measure of % of GDP invested).

• govshare = 100*(csh_g2019+csh_g2000)/2 (this is a measure of government expenditure as a % of GDP).

• expshare = 100*(csh_x2019+csh_x2000)/2 (this is a measure of exports as a % of GDP).

• Now, your new dataframe will include countrycode, country name, and the 6 variables that you created. It should include 139 rows and 8 columns.

2. Introduction (5 marks)

Write an introduction of the report. In your introduction, it should include:

• A general overview of changes in income globally for the last 20 years (after year 2000).

• Description of overall objective of the report and how your report is structured.

For introduction, you do not need to perform any analysis using the data. If you are referring to any external sources for information, you should cite the reference properly.

Your introduction is limited to 1 page. Do not write more than 1 page.

3. Descriptive Analysis (25 marks)

3.1. Draw a histogram for both incpc2000 and incpc2019. Combine these two histograms into one

plot. (4 marks)

3.2. Compare the two histograms. Discuss how the distribution changed over 20 years. Your discussion should identify similarity and difference in distribution. Also, you should utilize relevant statistics in the discussion. (5 marks)

3.3. Create a scatterplot showing incpc2000 on the x-axis, and incpc2019 on the y-axis. Insert 2 linear lines sloped 1 and 2 respectively. Explain what these diagonal lines indicate and how it can be used. (5 marks)

3.4. From the scatterplot, color with the label of country code the top 5 countries for growth ratio

(incpc2019/incpc2000) conditional that both of incpc2000 and incpc2019 are greater than the

median in the respective year. Name those 5 countries as “Potential Top 5” in the legend. (3 marks)

3.5. From the scatterplot, color with the label of country code the top 5 countries for growth ratio

(incpc2019/incpc2000) conditional that their incpc2000 is greater than 25000 and incpc2019 is greater than 50000 in the respective year. Name those 5 countries as “Conventional Top 5” in the legend. (3 marks)

3.6. Discuss the characteristics of “Potential Top 5” and “Conventional Top 5” countries. What are

the similarities and differences within and between the group? (5 marks)

4. Predictive Analysis (35 marks)

4.1. Run 4 regression models as below. The dependent variable is incpc2019 and the independent

variable is incpc2000.

• lm1: level-level for incpc2019 and incpc2000

• lm2: level-log for incpc2019 and incpc2000

• lm3: log-level for incpc2019 and incpc2000

• lm4: log-log for incpc2019 and incpc2000

The remaining variables such as popgrowth, invshare, govshare, and expshare should be included as independent variables for all 4 models. (5 marks)

4.2. Create a table combining summary of all the above 4 models to make it easier comparison. The table should include the variable names, estimated coefficients, standard error of the estimated coefficients, observations, R-squared, adjusted R-squared and Residual standard error of the model.

You can obtain all the values from the summary, hence arrange those values nicely into one table. (3 marks)

4.3. Among the 4 models, which one do you think it is the best? Justify which model you would choose. For any statistics you used for the decision, interpret it. (3 marks)

4.4. Regardless of your decision in 4.3, a researcher would like to examine the relationship of

income for both years in terms of percentage change. In this case, which is the most appropriate model? (3 marks)

4.5. Perform hypothesis tests for all independent variables in the model you identified in 4.4. Test

its statistical significance whether the variables have positive impact on the dependent variable. You do not need to repeat the process redundantly, just provide a standard format which can be applied for all independent variables. (5 marks)

4.6. Interpret the estimated coefficients of variables (including the constant) that are statistically

significant. (10 marks)

4.7. Examine VIF of the variables from the model you chose in 4.4. Is it good? Also, notice that VIF

value of one particular variable is higher than the rest. Why do you think it shows a high VIF value? (3 marks)

4.8. Draw a histogram of residuals for the model you chose in 4.4. Discuss on this plot (3 marks)

5. Implications (15 marks)

Does this regression measure the causal effect of the explanatory variables on income per capita? Use all the relevant and appropriate information obtained throughout the analysis. Discuss any suggestions or limitations of the model as well. Do not write this task more than 1 page.

6. Conclusion (5 marks)

Write a conclusion which summarizes task 2 to 5. Do not write more than a page.