Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

Homework Assignment 2

Multiple Regression, Omitted Variable Bias

(Due February 14, 2024)

This week you will be exploring data on subways in cities across the world. Your objective will be to assess the effect of subways on air pollution, and, in doing that, you will also explore the relationship between a city’s size and the extent of its subway network.

The data that you will use comes from actual studies that have been published in highly regarded jounals — the articles are included as supplementary materials, so you can see what some of these studies look like when written up for academic outlets. (You will not be replicating these papers’ results, so they are more for your edification than any sort of direct guidance.) Write everything up professionally; your overall presentation style is important. You should not have any raw R output or code copied/pasted into your final document. All questions should be answered clearly and concisely in full sentences. Plots, tables, etc., should be clearly labeled and referenced appropriately in your writeup. For this assignment, your presentation scores will be incorporated into each question.

1. The Basics (20pts)

Create a new .R script file (“LastName PID HW2.R” – e.g., “Garfias 12345678 HW2.R”; no black spaces in the script name). Your script should generate all of the output, tables, and graphics used in your written submission and needs to run on its own, fully, without errors, to get full credit. You should assume that we will run your .R file in the same directory as the original data files; do not assume that the data is already loaded.

• Script (.R file) named correctly and runs without errors. (10pts)

• Script (.R file) does not overwrite original data, does the requisite analysis, and outputs any figures or tables used in your writeup (labeled correctly, and saved with filenames that include your last name and PID). (10pts)

2. Theory and Getting to Know the Data (20pts)

You will begin by mimicking the ‘Introduction’ section to a paper that explains the theory you are exploring and the methods you will use. It is really important to conceptually understand what you want to do before you dive in to analysis. Imagine explaining it to an educated but non-QM reader. You will then move on to describe the data, mimicking the ‘Data’ section of a paper. As you respond, think of how you could turn your answers into an essay that seamlessly describes the analysis.

1. Provide an introductory sentence or two that explains the importance of subways and why we care about their effects on cities as public policy analysts. (1-2 sentences, 1pts)

2. Thinking about your response above and the variables in your data set, explain how you could use regression analysis and this data set to estimate the effects of subways. Make sure to clearly identify possible dependent and independent variables. (1 short paragraph; 5pts)

3. Note that the provided data is a .dta Stata dataset, which is commonly used in Economics; use the haven package to import the data. Write a 3-sentence (maximum) description of the data as though you were explaining it to a colleague. What is the unit of observation, how many cities are included in the data, how many countries are represented in the data, how many years are represented in the data? Make sure to exclude missing values from your count. (2pts)

4. Looking at the included articles, how are subways defined in the data? How many cities have a working subway by the end of 2010? (1-2 sentences, 2pts)

5. Now, describe some trends in the construction of subways across the world. Starting from the original dataset, keep observations starting in 1862. Then, collapse/aggregate, by decade, 1) the total number of city subway systems each decade and 2) the total number of new subways that become operational each decade. Create a presentable table of the data that results from the collapse/aggregation. (3pts)

6. With the collapsed data by year, create a single presentable graph that plots both the total number of cities with a subway over time and the total number of new subways over time. During which decade can you clearly see an inflection point in the number of cities with active subways? (5pts)

7. Now collapse the data by continent-year (beware of missing values here), and create graph that plots the total number of cities with a subway over time for each continent. Is the increase in the number of subways following the inflection point driven by cities in specific continent(s)? (2pts)

3. Exploratory Regression Analysis (25pts)

You will begin by exploring one important correlate of the extensiveness of the subway network in a city: its population.

1. Starting from the original panel data, keep a cross-section of cities that have a subway in 1999. For all the subsequent analyses, make sure you are using variables measured in 1999. With these data, evaluate the relationship between a city’s population and the subway’s extent, as measured by the number of subway stations. Regress the number of subway stations on total city population count. Interpret the results (magnitude and significance) in a meaningful way. (2pts)

2. To facilitate a substantive interpretation of the estimates, standardize city population and re-estimate the regression. Interpret the results in a meaningful way. Is the standard error different in this regression as compared to the previous one (from question 3.1)? Is the p-value different? How do you make sense of these differences/similarities? (3pts)

3. A colleague tells you that, in their experience, it is standard practice to log population before any anal-ysis. Follow this advice and take the log of total population. What would you do if a city reported to have a population of zero? (1pt)

4. Regress the number of subway stations on the log of total city population count and interpret the results. (4pts)

5. Now create a scatterplot with the number of subway stations on the logged population variable. Is the bivariate relationship visibly linear or does it seem to follow a non-linear shape? (2pts)

6. Consider a model in which you want to relate the number of subway stations as a dependent variable and log(population) and log(population) squared as independent variables. Write out the equation for the model. Show your calculations for how you would interpret a change in y with regards to a change in x. (5pts)

7. Now regress the number of subway stations on log(population) and log(population) squared. What do you find? Generate a scatter plot showing the original (logged) data and the predicted values from your regression. Calculate the turning point. What can you say about the relationship between a city’s number of subway stations and its population? (5pts)

8. Look at the results from your the last three regressions above (from questions 3.1, 3.4, and 3.7). Which model better fits the data? Why? What can you say about the root mean squared error in each case? (3pts)

4. Bias Analysis (30pts)

The results from your analysis so far are useful to evaluate the descriptive relationship between the extent of a city’s subway network and its population. You will now turn to estimating the effect of a related measure of subway extent — the number of subway lines in a city — on air pollution, as measured by PM10 concentration in the air. These tiny particles have a diameter of 10 micrometres (0.01 mm) or less, are usually found in dust and smoke, and can be highly detrimental to health.

1. Articulate a clear argument about whether the extent of a city’s subway should increase or reduce air pollution. (1-2 sentences, 3pts)

2. Assess your argument that a city’s subway should increase/reduce air pollution using a bivariate regres-sion model. Focus on the number of subway lines in a city in 1999 and PM10 concentration in 1999. Interpret the coefficients in a clear and substantive manner. (2pts)

3. Now, let’s move onto multiple regression. Why might you want to include average temperature in the regressions above? What alternative explanation might be driving the results in the regression above? Now explain what you think will happen to the regression coefficient you calculated above for the num-ber of subway lines if you also include average temperature in the regression? Think about omitted variable bias and use correlations to guide your intuition. Regress PM10 concentration on the num-ber of subway lines, and also include average temperature. What happens? Interpret the coefficients appropriately and substantively. (5pts)

4. Do you think that leaving out a measure of a city’s population might bias your results? Explain clearly and in detail why/why not. Then, estimate the regression from the previous question, but now include log of population as well. Interpret the findings substantively. What does this do to your coefficient of interest? (5pts)

5. Present one nicely formatted final regression table that contains the regression results for the models that you estimated in the last three questions. In your table, you should have one column for each model, and should show the parameter estimate, standard error, and some marker for statistical sig-nificance. Notes for the table should explain to the reader how to interpret it. In your answer to the questions above and in the discussion section below, you can reference the table by column, since the columns are the different model results. (5pts)

6. Imagine that as part of an unprecedented effort to expand public transport, a country with a large number of cities funds a program that randomly allocates additional subway lines. You collect data on these randomly allocated lines and estimate a model similar to that of question 4.2. Should you include average temperature and a measure of population as additional covariates to avoid biasing your estimate on the additional number of subway lines? (5pts)

7. When you only kept observations from the 1999 cross section of cities that have a subway, how many cities ended up in the data? How many of these cities have missing data for PM10 concentration? Under what conditions would you be unconcerned about these missing observations? Is there evidence that you can provide to show that these conditions are met (or not met) in these data? (5pts)

5. Conclusion (5pts)

Imagine an international financial institution is considering allocating substantial funds to cities with the objective of expanding subway systems to curb air pollution. They ask you to provide your expert advice on whether this massive investment is justified by the evidence. If you were asked to critique the shortcomings of your own analysis and to provide an understanding to the policymakers of what potential uncertainties remain, what would you say? What issues with your own analysis might you raise? (one short paragraph, 5pts)