ECOM2000: Econometric Principles ‐ Data Analysis Project 2022
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
ECOM2000: Econometric Principles ‐ Data Analysis Project
2022
1. Introduction
One of the hypotheses that have been widely discussed in the literature of development/ environmental economics is the Environmental Kuznets Curve (EKC). It states that the relationship between a country’s national income and the extent of environmental degradation is in an inverted U‐shape. That is, the extent of environmental degradation increases with national income at a diminishing rate and starts decreasing as national income increases further beyond a certain level. In this project, we will test the EKC hypothesis empirically using data from the
World Bank.
2. Preliminary
Data collection
To do this project, you need to download the following data from the World Bank’s (WB’s) World Development Indicators (WDI) website (https://databank.worldbank.org/source/world‐ development‐indicators):
Variable |
WB indicator name |
Measurement |
WB data code |
CO2 |
CO2 emissions |
Metric tons per capita |
EN.ATM.CO2E.PC |
GDP |
GDP per capita |
Constant 2015 US$ |
NY.GDP.PCAP.KD |
PopDen |
Population density |
People per sq. km of land area |
EN.POP.DNST |
UrbPop |
Urban population |
% of total population |
SP.URB.TOTL.IN.ZS |
Please follow the steps below to download these data from the WB’s website:
1. Expand “Country” tab on the left‐hand side of the website and choose all countries. To do this, you need to select “Countries” out of three options, then select all countries by ticking the box on the next line. You should see that you have selected 217 countries. (see Image 1 at the end of this document)
2. Expand “Series” tab and search the required data series by the WB indicator name or data code listed above. Go through the search results and tick the box next to the intended variable (pay attention to the measurement as well). (see Image 2)
3. Move to the “Time” tab and select “2018” by ticking the box next to it. (see Image 3)
4. Click “Apply Changes” on the right‐hand side of the website. (see Image 3)
5. Under “Download options,” choose “Advanced Options” . (see Image 3)
6. In the popup window, select “Names only” within “Variable format:” option. (see Image 4)
7. Click “Download” and save the file in your local drive.
Data Cleaning/Formatting
Before analyzing data, you have to follow several steps to clean and rearrange it. First, opening the data file in Excel, you notice that the data downloaded from the WB are arranged as:
Column A Column B Column C
Country Name Series Name 2018
You see that the data are stored in rows 2‐869, and below them, you see the following texts in lines 873 and 874:
Data from database: World Development Indicators
Last Updated: ##/##/2022
Please delete these two lines and save the Excel file under the same name. (see Image 5)
Next, we need to convert the data format from a long form (data on 4 variables from 217 countries are stacked vertically in one column) into a wide form (data are stored in a table form so that the first column stores the country name and subsequent columns store the data on one variable in each column). There are many ways to perform this transform, but one possible way is to execute the following in R:
dat = readxl::read_excel("[path]/Data_Extract_From_World_Development_Indicators.xlsx", sheet = "Data")
datw = spread(dat, "Series Name", "2018")
We are familiar with the first line, which reads the Excel data into the workspace (you need to change the file path). The second line convert the data from a long form into a wide form and save the new data as “datw.” We also want to shorten the variable names so that they are easier to handle. We can try:
datw = rename(datw, CO2 = "CO2 emissions (metric tons per capita)",
GDPpc = "GDP per capita (constant 2015 US$)",
PopDen = "Population density (people per sq. km of land area)", UrbPop = "Urban population (% of total population)")
Now, a new data matrix “datw” contains the country name in the first column and the data on four variables (CO2, GDPpc, PopDen, and UrbPop) in columns 2‐5.
Two more steps we need to follow are: (1) convert missing values from “ ..” into “NA” and eliminate them from dataset, and (2) change the data type from character to numerical. These can be done by:
datw[datw==".."] = NA
datw = na.omit(datw)
class(datw$CO2) = "double"
class(datw$GDPpc) = "double"
class(datw$PopDen) = "double"
class(datw$UrbPop) = "double"
The first line change “ ..” into “NA” (which is the default value for missing observations in R), while the second line eliminates these missing observations from the dataset. The remaining four lines change the data type from character into numeric for the four variables. Now we are ready to analyze the data.
3. Data Analysis
Aanalyze the WDI data using R/RStudio and answer the following 11 questions.
1. (6 points) Create a new variable CO2k by converting the data on CO2 emissions from metric tons per capita into kilograms per capita (by multiplying the original data by 1,000). Then, create a scatter plot of CO2 emissions per capita (vertical axis) against per capita GDP (horizontal axis). Please label each axis clearly.
2. (10 points) Under the assumption that CO2 emissions (in kg) are distributed independently and identically in the population, construct a 90% confidence interval of the population mean of CO2 emissions per capita (in kg) manually (that is, using the sample mean, sample variance, and the appropriate critical values obtained from either R and/or statistical tables). Interpret the calculated confidence interval.
3. (10 points) Estimate a multiple regression model with CO2 emissions per capita (in kg) as the dependent variable, and GDP per capita, GDP per capita squared, population density, and the share of population living in urban areas as explanatory variables. Write down the estimated sample regression equation.
4. (8 points) For the regression model estimated in Question 4, interpret the reported R‐square value as well as the standard error of the regression. Briefly comment on the model’s goodness of fit to the observed data.
5. (8 points) For the regression model estimated in Question 4, provide interpretations of the estimated coefficients for PoPDen and UrbPop.
6. (10 points) For the regression model estimated in Question 4, test if the true population coefficient for PoPDen is negative at a 10% test size, using a critical value approach. State clearly the null and alternative hypothesis.
7. (10 points) For the regression model estimated in Question 4, construct a 99% confidence interval of the true population coefficient for UrbPop. Interpret the obtained confidence interval.
8. (14 points) Using the regression model estimated in Question 4, calculate the predicted values of CO2 for a range of GDP observed in the sample (with 1,000 increments) whilst keeping the values of PopDen and UrbPop at their respective sample means. Create a two‐
dimensional diagram with the predicted values of CO2 (vertical axis) is plotted against GDP (horizontal axis). Briefly describe the relationship between CO2 emissions per capita and GDP per capita as implied by the estimated regression model. Does this have the shape you expected? Explain why/why not?
9. (6 points) Based on the model estimated in Question 4, find the level of GDP per capita where the effect of GDP per capita on CO2 emissions changes its sign. Briefly comment on how this relates to your answer to Question 8 above.
10. (10 points) Describe how you could test a joint hypothesis that the true population coefficients for PopDen and UrbPop are both equal to zero at a 5% significance level. State the null and alternative hypothesis, and clearly present the test statistics and how you would calculate it.
11. (8 points) Implement the joint hypothesis test as described in Question 10 at a 5% significance level.
4. Further Instructions
This is an individual project, not a group project. You are required to work and compose
your report individually.
You need to submit either:
i. A PDF report generated by RMarkdown that contains your R commands, outputs, and text‐based answers addressing each question, or
ii. A PDF document containing your text‐based answers to the questions, AND an RMarkdown file (or an R‐script file plus its outputs) providing your R commands and their outputs. Please attach your RMarkdown or R script (and output) file at the end of your text‐based report.
If you choose the second of the above two options, please include the key results of your data
analysis from R (for example, descriptive statistics, figures, regression outputs, etc. but not the data) into your text‐based report, so that your report can be read without referring to your R script/output file.
Your report is marked out of 100 marks in total and will count toward 35% of your overall
final grade.
You need to show all of your workings. Full marks will not be awarded if any parts of the
essential steps are not presented/described.
Please submit your report through Turn‐it‐in by the due dates specified on the first page of
this instruction.
Appendix: Captured images for Data Download
Image 1:
Image 2:
Image 3:
Image 4:
2022-09-04