闪电代写 -代写CS作业_CS代写_Finance代写_Economic代写_Statistics代写_代码代做_IT代写_加急帮助

Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

BA222 - Lecture Notes 14: Introduction to Time-Series and Non-Linear Regression Models

Table of Contents

• Time Series

• Identifying Time-Trends:

• Visualization of Time-Series

• Estimating Time Trends using regression models

• Making a Forecast using Time

• Seasonality

• Identifying Seasonality and Time Trends using Regression Models

• Non-Linear Regression Models

• Increasing and Diminishing Returns

• Polynomials

• Logarithmic Transformations

• Comparing Non-Linear Models

Time Series

A time-series is data for which the unit of observation is time. Each observation in a time-series represent the values of the variables in the data measured in different moments in time. Much of the business and economics data comes in the form of time series. For example, interest rates, GDP, inflation rate, unemployment rate, stock prices, etc. Time series come in different frequencies. The most common time frequencies are:

• Annual (Real GDP, Population Growth)

• Quarterly (Quarterly GDP)

• Monthly (Unemployment, Inflation)

• Weekly (Industrial Production)

• Daily (Stock Prices)

Using time-series we can identify patterns over time and use them to make

predictions about the future based on past observations.

In a time-series we are concerned with two patterns:

• Time-Trend: How the average value of y changes as time passes. There may be a singular time-trend or several, describing periods of growth or decrease over time.

• Seasonality: How the average value of y changes over different seasons. A season here is deﬁned as a repeating stage in the cycle of time unit. For instance, data for retail sales may have an unusually high average value every year during holiday specials. Data for restaurant reservations will be higher on Friday and Saturday than other days of the week.

The key difference between seasonality and a time-trend is that seasonality is a repeating pattern, while a time-trend is a singular effect.

Identifying Time-Trends:

Let's start by loading some time-series data. The data in GDPC1 .csv includes two columns:

• y is the value of the quarterly real GDP for the US.

• date represents the dates in which the quarterly real GDP was reported. The data starts in the ﬁrst quarter of 1960 and ends in the last quarter of 2022.

import pandas as pd

import matplotlib .pyplot as plt

import statsmodels .formula .api as smf

import seaborn as sb

path =

'/Users/ccasso/Dropbox/BU/Teaching/2023/Spring/BA222/Data/realGDPUS/g dp .csv'

gdp = pd .read_csv(path)

gdp

Let's start creating some useful variables to help us with the analysis:

# EXTRACTING YEARS

dateDecomposed = gdp .date .str .split('/')

gdp['month'] = dateDecomposed .str[0] .astype('int')

gdp['day'] = dateDecomposed .str[1] .astype('int')

gdp['year'] = dateDecomposed .str[2] .astype('int')

# ADDING 2000 and 1900 to years

gdp .loc[gdp['year']>=60, 'year'] += 1900

gdp .loc[gdp['year']<60, 'year'] += 2000

# EXTRACTING QUARTERS

gdp['quarter'] = 1

gdp .loc[gdp['month'] == 4, 'quarter'] = 2

gdp .loc[gdp['month'] == 7, 'quarter'] = 3

gdp .loc[gdp['month'] == 10, 'quarter'] = 4

# GENERATING TIME PERIODS

gdp['t'] = pd .Series(range(1, len(gdp) + 1, 1))

Code Explanation:

• .str .split('/') : The date variable is a string. We can apply string related functions to a variable by using .str on pandas . One of those functions is .split() that allows you to divide a string and create a list based on a character. Because dates are stored in the format 'm/d/y' we use the character / to obtain a list in the form ['m','d','y'] .

• .str[i] : On a list of strings like the one we created using .str .split('/') you can extract individual elements of a list by using the square bracket operator [] and using the index of the element you want to extract. For instance, ['m','d','y'] using .str[0] will extract m , .str[1] will extract d , etc.

• .astype('int') : This is a function that allow us to typecast the data from str to int . The dates are originally saved as a string, but we want days, months and years to be integers.

• += : The += operator allows you to add a value to the current value of a variable. For instance, if x = 10 , then doing x += 5 is the same as x = x + 5 .

• Making Quarters: The variables months and days are not very useful as the data is organized in quarters. But we can infer the quarter number (1, 2, 3, 4) based on the month of the report. Also, we are making a variable called t which represents the different periods with an integer.

Visualization of Time-Series

To visualize a variable over time is very simple, you just need to use the Let's take a look at the real GDP ( y ):

.plot() .

gdp .y .plot()

plt.show ()

Alternatively, we can use the

.lineplot()

from the

seaborn

package.

sb .lineplot(y = gdp .y, x = gdp .t) #Using t in the x -axis

plt.show ()

We can see an overall increasing (positive) trend in the data.

Let's go over another example, the inflationUS .csv ﬁle includes information about the monthly consumer price index (CPI) for the US and inflation rate (12- month % change of the CPI). Let's start with making a graph of the CPI. I'm calling this database pi :

pi .cpi .plot()

plt.show ()

An increasing trend, that seems to accelerate at the end. Let's see now the inflation rate:

pi .inflation .plot()

plt.show ()

This one for the most part have no clear overall time trend. It seems that the inflation rate seems to fluctuate around 2 percent. Except for the last part of the data where it seems to increase to a peak of 8% and is recently descending. It's probably easier to see that adding a reference line at 2 percent.

pi .inflation .plot()

plt .axhline(y = 2, color = 'black', linestyle = 'dashed')

plt.show ()

The reason why the inflation rate fluctuates around 2 percent has to do with the business cycles and the monetary policy of the Federal Reserve (FED). The FED will change interest rates in order to keep the inflation under what they consider the natural inflation rate that is consistent with the growth in real GDP in order to avoid

periods in which prices are increasing faster than production. The FED is not always successful in keeping the inflation under control as we can see at the end of time- series.

We can use yearly averages to identify time-trends with more clarity. It is as simple as using the year variable in the time series in the .lineplot function:

sb .lineplot(y = pi .inflation, x = pi .year)

plt .axhline(y = 2, color = 'black', linestyle = 'dashed')

plt.show ()

We can do the same for the GDP data:

sb .lineplot(y = gdp .y, x = gdp .year)

plt.show ()

In summary:

• We can visually identify time trends of time-series using the .plot() and lineplot() functions.

• Time trends may be singular, like in the case of the real GDP as the data is overall increasing.

• Time trends may also have different trends. The inflation rate for the most part fluctuates around 2 percent, but at the end it shows an increasing trend.

Estimating Time Trends using regression models

A regression model with a time trend can be represented as:

y = β0 + β1time + error

Where β1 represent the units of time. \beta_1 Is the slope, it will show by how much the average value of y changes when time increases by one unit. Finally, the error term represents all the factors, besides time, that explain the variation of the y variable.

We can estimate the parameters of the model with smf .ols() :

smf .ols('y ~ t', data = gdp) .fit() .summary()

Based on these results we can say that: On average, the real GDP increased by 68.62 Billions each quarter (yes, the US real GDP is measured in Billions).

The slope of the graph describes a linear estimation of the time trend. We can visualize the ﬁt using .regplot from the seaborn package.

sb .regplot(x = gdp .t, y = gdp .y, line_kws = {'color':'red',

'linestyle':'dashed'}, scatter = False)

sb .lineplot(x = gdp .t, y = gdp .y, color = 'blue')

plt.show ()

This is a pretty good ﬁt. It captures the overall time trend.

Let's repeat this exercise using the data for inflation:

# CREATING T for the PI data

pi['t'] = pd .Series(range(1, len(pi) + 1, 1))

smf .ols('inflation ~ t', data = pi) .fit() .summary()

Now let's see the plot:

sb .regplot(x = pi .t, y = pi .inflation, line_kws = {'color':'red', 'linestyle':'dashed'}, scatter = False)

sb .lineplot(x = pi .t, y = pi .inflation, color = 'blue')

plt.show ()

The regression is biased by the last period of above 2 percent inflation in the data.

start = 111

end = 126

timeFilterPre = pi .t < start

timeFilterMed = (pi .t >= start) & (pi .t < end)

timeFilterPost = pi .t >= end

sb .regplot(x = pi .t[timeFilterPre], y = pi .inflation[timeFilterPre], line_kws = {'color':'red', 'linestyle':'dashed'}, scatter = False) sb .regplot(x = pi .t[timeFilterMed], y = pi .inflation[timeFilterMed], line_kws = {'color':'red', 'linestyle':'dashed'}, scatter = False) sb .regplot(x = pi .t[timeFilterPost], y =

pi .inflation[timeFilterPost], line_kws = {'color':'red',

'linestyle':'dashed'}, scatter = False)

sb .lineplot(x = pi .t, y = pi .inflation, color = 'blue')

plt.show ()

In this last graph we can clearly see the period where inflation is under control, with no increasing/decreasing trend over time. Then in period 111 (March of 2021) we start seeing the inflation to be consistently above 2 percent. This inflation cycle reaches a peak of about 9.1 percent in period 126 (June of 2022) and then starts descending (what was the monetary policy implemented by the FED in Summer 2022 to ﬁght inflation?).

Making a Forecast using Time

Forecasting is very similar to calculating the ﬁtted values of a model. We just need to plug in the time period that we want and calculate the estimated value of y indicated by the regression. Let's make a forecast for the quarterly GDP for the ﬁrst quarter of 2023:

T = max(gdp .t) + 1 # one more quarter after the data ends

b = smf .ols('y ~ t', data = gdp) .fit() .params

gdp_2023Q1 = b['Intercept'] + b['t'] * T

gdp_2023Q1

To decide if this was a good or a bad prediction we'll have to wait for the actual real GDP numbers for the ﬁrst quarter of 2023.

Let's see how good the model is overall at producing accurate predictions using the data that we have. For this we are going to pretend that we didn't have the data for the last ten years (2013-2022). Using only a model estimated with data before 2013 we are going to make predictions for the quarters of 2013-2022 and compare them to the real values to see how the model performs:

# Estimating using only data before 2018

reg1 = smf .ols('y ~ t', data = gdp[gdp .year < 2013]) .fit()

# Use the function .predict()

# to make predictions for the entire data

# using the model estimated with data before 2021

forecast = reg1 .predict(gdp)

# Adding the forecast to the original data (optional)

gdp['forecast'] = forecast

# Making graphs

gdp .forecast .plot()

gdp .y .plot()

plt.show ()

The predictions look pretty good (this is essentially the same graph as what we produced with .regplot but using only data prior to 2013). In order to objectively determine if the predictions are accurate we'll use the 95% conﬁdence interval for the predictions. If the observed values are generally within the CI then the predictions of the model are good.

To get the 95% conﬁdence interval bounds we need to use the function .get_prediction() .summary_frame(alpha = 0 .05) . You can change the alpha level to determine different conﬁdence levels.

The result of .get_prediction() .summary_frame(alpha = 0 .05) is a data frame that includes 6 columns:

• The ﬁrst one is labeled mean and that's the same as the ﬁtted values of the regression.

• The second one is mean_se and that's the standard error of the prediction.

Recall that we use the standard errors to construct the conﬁdence intervals.

• The third one is mean_ci_lower , these are the values of the lower bound of the 95% conﬁdence interval

• The fourth column is mean_ci_upper , it works the same as mean_ci_lower but for the upper bound

• The last two columns: obs_ci_lower and obs_ci_upper are conﬁdence intervals constructed using the actual data and not the regression model. We should ignore these columns.

Therefore, if we want to judge how accurate are the forecasts of a regression model we can make a graph of the predicted values, the actual values and the conﬁdence interval. If most of the times the predicted values are within the conﬁdence interval, and, in general terms, the model captures the time trends in the data, we can use the model for forecasting.

# Making predictions with 95% confidence interval

predictions = reg1 .get_prediction(gdp) .summary_frame(alpha = 0 .05)

# # Adding confidence interval values to the data frame

gdp['lB'] = predictions .mean_ci_lower

gdp['uB'] = predictions .mean_ci_upper

# # Making graphs for year >= 2013

# # Using the linestyle = "dashed" option to make

# # the graph a bit easier to read

gdp .forecast[(gdp .year >= 2013) & (gdp .year < 2023)] .plot(color = 'red', linestyle = "dashed")

gdp .y[(gdp .year >= 2013) & (gdp .year < 2023)] .plot(color = 'blue') gdp .lB[(gdp .year >= 2013) & (gdp .year < 2023)] .plot(color = 'black', linestyle = "dashed")

gdp .uB[(gdp .year >= 2013) & (gdp .year < 2023)] .plot(color = 'black', linestyle = "dashed")

plt.show ()

Doesn't look so good now. Our predictions and conﬁdence intervals are way off in this portion of the sample. The reason for this is that in a regression model there are two components that explain the variation of y. A part explained by the independent variables and the other part is the error term. If elements of the error term change in a signiﬁcant way over time then we should not be surprised by these results.

Seasonality

Frequencies in time series are important when discussing seasonality. A seasonal pattern happens when the average value of y is statistically related with the presence of speciﬁc seasons (or cyclical moments in time). For example, retail sales tend to increase on average in the fourth quarter because of holiday sales. Restaurant reservations are harder to ﬁnd on Fridays than other days of the week. In the summer sales of ice cream increase on average, etc.

Take a look at the data in the ﬁle gap_revenues .csv for a time-series in quarterly frequency with very clear seasonal pattern:

import pandas as pd

import matplotlib .pyplot as plt

path =

'/Users/ccasso/Dropbox/BU/Teaching/2022/Spring/BA222/BA222Spring2022/

Datasets/gap/gap_revenues .csv'

gap = pd .read_csv(path)

plt .plot(gap .revenue)

plt.show ()

Let's see if we can detect a seasonal pattern by looking at the average by quarter using a boxplot:

sb .boxplot(y = gap .revenue, x = gap .quarter)

plt.show ()

We can identify the overall time trend by using a yearly average instead:

sb .lineplot(y = gap .revenue, x = gap .year)

plt.show ()

In summary, we can say that gap revenues increased from 1995 to 2000 and then remain constant for the rest of the sample. Moreover, each year there is a cyclical pattern in sales with a spike in quarter four.

Identifying Seasonality and Time Trends using Regression Models

We can model seasonal patterns by adding dummies for each part of the cycle to a regression model. For instance:

y = β0 + β1 (Quarter2) + β2 (Quarter3) + β3 (Quarter4) + error

Would be the correct speciﬁcation to identify seasonal patterns in quarterly data. Note that the ﬁrst quarter is excluded (why?).

To estimate a seasonal effect is as simple as estimating a regression model with a dummy:

reg1 = smf .ols('revenue ~ C(quarter)', data = gap) .fit()

reg1 .summary()

Interpretation:

• β0 : Average sales when quarter is equal to one.

• β1

: Average change in sales between quarter one and two.

• β2

: Average change in sales between quarter one and three.

• β3

: Average change in sales between quarter one and four.

In this case β1 and β2 are statistically equal to zero, meaning that we cannot say that there is a statistically signiﬁcant difference between sales in the ﬁrst, second and third quarter. Because β3 is statistically different than zero we can say that the sales in the last quarter are statistically different than the sales in the ﬁrst quarter.

In the previous graphs we noticed that there was also an important time trend (increasing from 1995 to 2000 and then remains constant). Ignoring that would probably be an incorrect speciﬁcation of the model, we can inspect the residuals and see if there is any time trend:

reg1 .resid .plot()

plt.show ()

Look! The overall time trend is in the residual. Which is not good, the residuals are supposed to be random. Let's estimate the model with the time trend and look at the residuals again:

y = β0 + β1 (Quarter2) + β2 (Quarter3) + β3 (Quarter4) + β4time + error

reg2 = smf .ols('revenue ~ time + C(quarter)', data = gap) .fit() reg2 .resid .plot()

plt.show ()

Better, but still not random. That's because adding time will add a single constant time trend and we need two. In order to capture that, we'll create the following dummy:

gap['growth'] = 0

gap .loc[gap .time <= 40, 'growth'] = 1

And then create an interaction term by multiplying the dummy with respect to a

variable. Like this:

y = β0 + β1 (Quarter2) + β2 (Quarter3) + β3 (Quarter4) + β4time + β5 × growth + error

Here β4 will capture the overall time effect, and β5 the difference during the

growing period:

reg3 = smf .ols('revenue ~ time*growth + time + C(quarter)', data =

gap) .fit()

reg3 .summary()

Let's look at the residuals now:

reg3 .resid .plot()

plt.show ()

The residuals look pretty random, as they should. We still have a by negative spike

(the effect of the COVID pandemic). You can see this more clearly by comparing the

ﬁtted values with the observed values:

reg3 .fittedvalues .plot()

gap .revenue .plot()

We can control for an outlier by adding a dummy to it:

# IDENTIFYING OUTLIER

period = gap[reg2 .resid == reg2 .resid .min()] .time .values[0]

gap['outlierDummy'] = 0

gap .loc[gap .time == period, 'outlierDummy'] = 1

# ESTIMATION

reg4 = smf .ols('revenue ~ time*growth + time + C(quarter) +

outlierDummy', data = gap) .fit()

reg4 .summary()

Inspecting the residuals:

reg4 .resid .plot()

plt.show ()

Much better. Now let's see the ﬁtted values vs observed values:

reg4 .fittedvalues .plot()

gap .revenue .plot()

Non-Linear Regression Models

One of the issues that we may encounter when dealing with business and economics data is that the relation between the variables is not linear. We have seen several cases of non-linear relations in the past (e.g. income vs math scores).

For now, take a look at the data in the nonlinear .csv ﬁle. The data contains the quarterly revenues of a company. A quick inspection of the data reveals the clear non-linearity of revenues with respect to time.

db .y .plot()

plt.show ()

We can identify the overall trend in the data by calculating the average of y for each

year:

db[["year", "y"]] .groupby("year") .mean() .plot(legend = False) plt.show ()

And the seasonality can be identiﬁed by computing the average for each quarter:

db[["quarter", "y"]] .groupby("quarter") .mean() .plot(kind = "bar", legend = False)

plt.show ()

The yearly average graph shows a clear non-linear pattern. First the average

revenues increase and plateau around 2005. They remain about the same level for the rest of the sample.

See how poorly a linear model ﬁts the data:

sb .regplot(x = db .t, y = db .y)