BA222 - Lecture Notes 14: Introduction to Time-Series and Non-Linear Regression Models
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
BA222 - Lecture Notes 14: Introduction to Time-Series and Non-Linear Regression Models
Table of Contents
• Time Series
• Identifying Time-Trends:
• Visualization of Time-Series
• Estimating Time Trends using regression models
• Making a Forecast using Time
• Seasonality
• Identifying Seasonality and Time Trends using Regression Models
• Non-Linear Regression Models
• Increasing and Diminishing Returns
• Polynomials
• Logarithmic Transformations
• Comparing Non-Linear Models
Time Series
A time-series is data for which the unit of observation is time. Each observation in a time-series represent the values of the variables in the data measured in different moments in time. Much of the business and economics data comes in the form of time series. For example, interest rates, GDP, inflation rate, unemployment rate, stock prices, etc. Time series come in different frequencies. The most common time frequencies are:
• Annual (Real GDP, Population Growth)
• Quarterly (Quarterly GDP)
• Monthly (Unemployment, Inflation)
• Weekly (Industrial Production)
• Daily (Stock Prices)
Using time-series we can identify patterns over time and use them to make
predictions about the future based on past observations.
In a time-series we are concerned with two patterns:
• Time-Trend: How the average value of y changes as time passes. There may be a singular time-trend or several, describing periods of growth or decrease over time.
• Seasonality: How the average value of y changes over different seasons. A season here is defined as a repeating stage in the cycle of time unit. For instance, data for retail sales may have an unusually high average value every year during holiday specials. Data for restaurant reservations will be higher on Friday and Saturday than other days of the week.
The key difference between seasonality and a time-trend is that seasonality is a repeating pattern, while a time-trend is a singular effect.
Identifying Time-Trends:
Let's start by loading some time-series data. The data in GDPC1 .csv includes two columns:
• y is the value of the quarterly real GDP for the US.
• date represents the dates in which the quarterly real GDP was reported. The data starts in the first quarter of 1960 and ends in the last quarter of 2022.
import pandas as pd
import matplotlib .pyplot as plt
import statsmodels .formula .api as smf
import seaborn as sb
path =
'/Users/ccasso/Dropbox/BU/Teaching/2023/Spring/BA222/Data/realGDPUS/g dp .csv'
gdp = pd .read_csv(path)
gdp
Let's start creating some useful variables to help us with the analysis:
# EXTRACTING YEARS
dateDecomposed = gdp .date .str .split('/')
gdp['month'] = dateDecomposed .str[0] .astype('int')
gdp['day'] = dateDecomposed .str[1] .astype('int')
gdp['year'] = dateDecomposed .str[2] .astype('int')
# ADDING 2000 and 1900 to years
gdp .loc[gdp['year']>=60, 'year'] += 1900
gdp .loc[gdp['year']<60, 'year'] += 2000
# EXTRACTING QUARTERS
gdp['quarter'] = 1
gdp .loc[gdp['month'] == 4, 'quarter'] = 2
gdp .loc[gdp['month'] == 7, 'quarter'] = 3
gdp .loc[gdp['month'] == 10, 'quarter'] = 4
# GENERATING TIME PERIODS
gdp['t'] = pd .Series(range(1, len(gdp) + 1, 1))
Code Explanation:
• .str .split('/') : The date variable is a string. We can apply string related functions to a variable by using .str on pandas . One of those functions is .split() that allows you to divide a string and create a list based on a character. Because dates are stored in the format 'm/d/y' we use the character / to obtain a list in the form ['m','d','y'] .
• .str[i] : On a list of strings like the one we created using .str .split('/') you can extract individual elements of a list by using the square bracket operator [] and using the index of the element you want to extract. For instance, ['m','d','y'] using .str[0] will extract m , .str[1] will extract d , etc.
• .astype('int') : This is a function that allow us to typecast the data from str to int . The dates are originally saved as a string, but we want days, months and years to be integers.
• += : The += operator allows you to add a value to the current value of a variable. For instance, if x = 10 , then doing x += 5 is the same as x = x + 5 .
• Making Quarters: The variables months and days are not very useful as the data is organized in quarters. But we can infer the quarter number (1, 2, 3, 4) based on the month of the report. Also, we are making a variable called t which represents the different periods with an integer.
Visualization of Time-Series
To visualize a variable over time is very simple, you just need to use the Let's take a look at the real GDP ( y ):
.plot() .
gdp .y .plot()
plt.show ()
Alternatively, we can use the
.lineplot()
from the
seaborn
package.
sb .lineplot(y = gdp .y, x = gdp .t) #Using t in the x -axis
plt.show ()
We can see an overall increasing (positive) trend in the data.
Let's go over another example, the inflationUS .csv file includes information about the monthly consumer price index (CPI) for the US and inflation rate (12- month % change of the CPI). Let's start with making a graph of the CPI. I'm calling this database pi :
pi .cpi .plot()
plt.show ()
An increasing trend, that seems to accelerate at the end. Let's see now the inflation rate:
pi .inflation .plot()
plt.show ()
This one for the most part have no clear overall time trend. It seems that the inflation rate seems to fluctuate around 2 percent. Except for the last part of the data where it seems to increase to a peak of 8% and is recently descending. It's probably easier to see that adding a reference line at 2 percent.
pi .inflation .plot()
plt .axhline(y = 2, color = 'black', linestyle = 'dashed')
plt.show ()
The reason why the inflation rate fluctuates around 2 percent has to do with the business cycles and the monetary policy of the Federal Reserve (FED). The FED will change interest rates in order to keep the inflation under what they consider the natural inflation rate that is consistent with the growth in real GDP in order to avoid
periods in which prices are increasing faster than production. The FED is not always successful in keeping the inflation under control as we can see at the end of time- series.
We can use yearly averages to identify time-trends with more clarity. It is as simple as using the year variable in the time series in the .lineplot function:
sb .lineplot(y = pi .inflation, x = pi .year)
plt .axhline(y = 2, color = 'black', linestyle = 'dashed')
plt.show ()
We can do the same for the GDP data:
sb .lineplot(y = gdp .y, x = gdp .year)
plt.show ()
In summary:
• We can visually identify time trends of time-series using the .plot() and lineplot() functions.
• Time trends may be singular, like in the case of the real GDP as the data is overall increasing.
• Time trends may also have different trends. The inflation rate for the most part fluctuates around 2 percent, but at the end it shows an increasing trend.
Estimating Time Trends using regression models
A regression model with a time trend can be represented as:
y = β0 + β1time + error
Where β1 represent the units of time. \beta_1 Is the slope, it will show by how much the average value of y changes when time increases by one unit. Finally, the error term represents all the factors, besides time, that explain the variation of the y variable.
We can estimate the parameters of the model with smf .ols() :
smf .ols('y ~ t', data = gdp) .fit() .summary()
Based on these results we can say that: On average, the real GDP increased by 68.62 Billions each quarter (yes, the US real GDP is measured in Billions).
The slope of the graph describes a linear estimation of the time trend. We can visualize the fit using .regplot from the seaborn package.
sb .regplot(x = gdp .t, y = gdp .y, line_kws = {'color':'red',
'linestyle':'dashed'}, scatter = False)
sb .lineplot(x = gdp .t, y = gdp .y, color = 'blue')
plt.show ()
This is a pretty good fit. It captures the overall time trend.
Let's repeat this exercise using the data for inflation:
# CREATING T for the PI data
pi['t'] = pd .Series(range(1, len(pi) + 1, 1))
smf .ols('inflation ~ t', data = pi) .fit() .summary()
Now let's see the plot:
sb .regplot(x = pi .t, y = pi .inflation, line_kws = {'color':'red', 'linestyle':'dashed'}, scatter = False)
sb .lineplot(x = pi .t, y = pi .inflation, color = 'blue')
plt.show ()
The regression is biased by the last period of above 2 percent inflation in the data.
start = 111
end = 126
timeFilterPre = pi .t < start
timeFilterMed = (pi .t >= start) & (pi .t < end)
timeFilterPost = pi .t >= end
sb .regplot(x = pi .t[timeFilterPre], y = pi .inflation[timeFilterPre], line_kws = {'color':'red', 'linestyle':'dashed'}, scatter = False) sb .regplot(x = pi .t[timeFilterMed], y = pi .inflation[timeFilterMed], line_kws = {'color':'red', 'linestyle':'dashed'}, scatter = False) sb .regplot(x = pi .t[timeFilterPost], y =
pi .inflation[timeFilterPost], line_kws = {'color':'red',
'linestyle':'dashed'}, scatter = False)
sb .lineplot(x = pi .t, y = pi .inflation, color = 'blue')
plt.show ()
In this last graph we can clearly see the period where inflation is under control, with no increasing/decreasing trend over time. Then in period 111 (March of 2021) we start seeing the inflation to be consistently above 2 percent. This inflation cycle reaches a peak of about 9.1 percent in period 126 (June of 2022) and then starts descending (what was the monetary policy implemented by the FED in Summer 2022 to fight inflation?).
Making a Forecast using Time
Forecasting is very similar to calculating the fitted values of a model. We just need to plug in the time period that we want and calculate the estimated value of y indicated by the regression. Let's make a forecast for the quarterly GDP for the first quarter of 2023:
T = max(gdp .t) + 1 # one more quarter after the data ends
b = smf .ols('y ~ t', data = gdp) .fit() .params
gdp_2023Q1 = b['Intercept'] + b['t'] * T
gdp_2023Q1
To decide if this was a good or a bad prediction we'll have to wait for the actual real GDP numbers for the first quarter of 2023.
Let's see how good the model is overall at producing accurate predictions using the data that we have. For this we are going to pretend that we didn't have the data for the last ten years (2013-2022). Using only a model estimated with data before 2013 we are going to make predictions for the quarters of 2013-2022 and compare them to the real values to see how the model performs:
# Estimating using only data before 2018
reg1 = smf .ols('y ~ t', data = gdp[gdp .year < 2013]) .fit()
# Use the function .predict()
# to make predictions for the entire data
# using the model estimated with data before 2021
forecast = reg1 .predict(gdp)
# Adding the forecast to the original data (optional)
gdp['forecast'] = forecast
# Making graphs
gdp .forecast .plot()
gdp .y .plot()
plt.show ()
The predictions look pretty good (this is essentially the same graph as what we produced with .regplot but using only data prior to 2013). In order to objectively determine if the predictions are accurate we'll use the 95% confidence interval for the predictions. If the observed values are generally within the CI then the predictions of the model are good.
To get the 95% confidence interval bounds we need to use the function .get_prediction() .summary_frame(alpha = 0 .05) . You can change the alpha level to determine different confidence levels.
The result of .get_prediction() .summary_frame(alpha = 0 .05) is a data frame that includes 6 columns:
• The first one is labeled mean and that's the same as the fitted values of the regression.
• The second one is mean_se and that's the standard error of the prediction.
Recall that we use the standard errors to construct the confidence intervals.
• The third one is mean_ci_lower , these are the values of the lower bound of the 95% confidence interval
• The fourth column is mean_ci_upper , it works the same as mean_ci_lower but for the upper bound
• The last two columns: obs_ci_lower and obs_ci_upper are confidence intervals constructed using the actual data and not the regression model. We should ignore these columns.
Therefore, if we want to judge how accurate are the forecasts of a regression model we can make a graph of the predicted values, the actual values and the confidence interval. If most of the times the predicted values are within the confidence interval, and, in general terms, the model captures the time trends in the data, we can use the model for forecasting.
# Making predictions with 95% confidence interval
predictions = reg1 .get_prediction(gdp) .summary_frame(alpha = 0 .05)
# # Adding confidence interval values to the data frame
gdp['lB'] = predictions .mean_ci_lower
gdp['uB'] = predictions .mean_ci_upper
# # Making graphs for year >= 2013
# # Using the linestyle = "dashed" option to make
# # the graph a bit easier to read
gdp .forecast[(gdp .year >= 2013) & (gdp .year < 2023)] .plot(color = 'red', linestyle = "dashed")
gdp .y[(gdp .year >= 2013) & (gdp .year < 2023)] .plot(color = 'blue') gdp .lB[(gdp .year >= 2013) & (gdp .year < 2023)] .plot(color = 'black', linestyle = "dashed")
gdp .uB[(gdp .year >= 2013) & (gdp .year < 2023)] .plot(color = 'black', linestyle = "dashed")
plt.show ()
Doesn't look so good now. Our predictions and confidence intervals are way off in this portion of the sample. The reason for this is that in a regression model there are two components that explain the variation of y. A part explained by the independent variables and the other part is the error term. If elements of the error term change in a significant way over time then we should not be surprised by these results.
Seasonality
Frequencies in time series are important when discussing seasonality. A seasonal pattern happens when the average value of y is statistically related with the presence of specific seasons (or cyclical moments in time). For example, retail sales tend to increase on average in the fourth quarter because of holiday sales. Restaurant reservations are harder to find on Fridays than other days of the week. In the summer sales of ice cream increase on average, etc.
Take a look at the data in the file gap_revenues .csv for a time-series in quarterly frequency with very clear seasonal pattern:
import pandas as pd
import matplotlib .pyplot as plt
path =
'/Users/ccasso/Dropbox/BU/Teaching/2022/Spring/BA222/BA222Spring2022/
Datasets/gap/gap_revenues .csv'
gap = pd .read_csv(path)
plt .plot(gap .revenue)
plt.show ()
Let's see if we can detect a seasonal pattern by looking at the average by quarter using a boxplot:
sb .boxplot(y = gap .revenue, x = gap .quarter)
plt.show ()
We can identify the overall time trend by using a yearly average instead:
sb .lineplot(y = gap .revenue, x = gap .year)
plt.show ()
In summary, we can say that gap revenues increased from 1995 to 2000 and then remain constant for the rest of the sample. Moreover, each year there is a cyclical pattern in sales with a spike in quarter four.
Identifying Seasonality and Time Trends using Regression Models
We can model seasonal patterns by adding dummies for each part of the cycle to a regression model. For instance:
y = β0 + β1 (Quarter2) + β2 (Quarter3) + β3 (Quarter4) + error
Would be the correct specification to identify seasonal patterns in quarterly data. Note that the first quarter is excluded (why?).
To estimate a seasonal effect is as simple as estimating a regression model with a dummy:
reg1 = smf .ols('revenue ~ C(quarter)', data = gap) .fit()
reg1 .summary()
Interpretation:
• β0 : Average sales when quarter is equal to one.
• β1
: Average change in sales between quarter one and two.
• β2
: Average change in sales between quarter one and three.
• β3
: Average change in sales between quarter one and four.
In this case β1 and β2 are statistically equal to zero, meaning that we cannot say that there is a statistically significant difference between sales in the first, second and third quarter. Because β3 is statistically different than zero we can say that the sales in the last quarter are statistically different than the sales in the first quarter.
In the previous graphs we noticed that there was also an important time trend (increasing from 1995 to 2000 and then remains constant). Ignoring that would probably be an incorrect specification of the model, we can inspect the residuals and see if there is any time trend:
reg1 .resid .plot()
plt.show ()
Look! The overall time trend is in the residual. Which is not good, the residuals are supposed to be random. Let's estimate the model with the time trend and look at the residuals again:
y = β0 + β1 (Quarter2) + β2 (Quarter3) + β3 (Quarter4) + β4time + error
reg2 = smf .ols('revenue ~ time + C(quarter)', data = gap) .fit() reg2 .resid .plot()
plt.show ()
Better, but still not random. That's because adding time will add a single constant time trend and we need two. In order to capture that, we'll create the following dummy:
gap['growth'] = 0
gap .loc[gap .time <= 40, 'growth'] = 1
And then create an interaction term by multiplying the dummy with respect to a
variable. Like this:
y = β0 + β1 (Quarter2) + β2 (Quarter3) + β3 (Quarter4) + β4time + β5 × growth + error
Here β4 will capture the overall time effect, and β5 the difference during the
growing period:
reg3 = smf .ols('revenue ~ time*growth + time + C(quarter)', data =
gap) .fit()
reg3 .summary()
Let's look at the residuals now:
reg3 .resid .plot()
plt.show ()
The residuals look pretty random, as they should. We still have a by negative spike
(the effect of the COVID pandemic). You can see this more clearly by comparing the
fitted values with the observed values:
reg3 .fittedvalues .plot()
gap .revenue .plot()
We can control for an outlier by adding a dummy to it:
# IDENTIFYING OUTLIER
period = gap[reg2 .resid == reg2 .resid .min()] .time .values[0]
gap['outlierDummy'] = 0
gap .loc[gap .time == period, 'outlierDummy'] = 1
# ESTIMATION
reg4 = smf .ols('revenue ~ time*growth + time + C(quarter) +
outlierDummy', data = gap) .fit()
reg4 .summary()
Inspecting the residuals:
reg4 .resid .plot()
plt.show ()
Much better. Now let's see the fitted values vs observed values:
reg4 .fittedvalues .plot()
gap .revenue .plot()
Non-Linear Regression Models
One of the issues that we may encounter when dealing with business and economics data is that the relation between the variables is not linear. We have seen several cases of non-linear relations in the past (e.g. income vs math scores).
For now, take a look at the data in the nonlinear .csv file. The data contains the quarterly revenues of a company. A quick inspection of the data reveals the clear non-linearity of revenues with respect to time.
db .y .plot()
plt.show ()
We can identify the overall trend in the data by calculating the average of y for each
year:
db[["year", "y"]] .groupby("year") .mean() .plot(legend = False) plt.show ()
And the seasonality can be identified by computing the average for each quarter:
db[["quarter", "y"]] .groupby("quarter") .mean() .plot(kind = "bar", legend = False)
plt.show ()
The yearly average graph shows a clear non-linear pattern. First the average
revenues increase and plateau around 2005. They remain about the same level for the rest of the sample.
See how poorly a linear model fits the data:
sb .regplot(x = db .t, y = db .y)
plt.show ()
Let's use a quadratic model instead:
sb .regplot(x = db .t, y = db .y, order = 2)
plt.show ()
Much better. Maybe a logarithmic transformation of the x variable?
sb .regplot(x = db .t, y = db .y, logx = True)
plt.show ()
The non-linear fit looks much better than the linear fit, specially the quadratic model.
Increasing and Diminishing Returns
2023-05-03