AQM2000 – Predictive Business Analytics Spring 2023
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
AQM2000 – Predictive Business Analytics
Spring 2023
Logistic Regression
Exponential Growth
Recall that exponential growth happens when a quantity increases as a percentage of how much there is at a
given moment. Investments increase exponentially since you earn interest as a percentage of how much you have in your account. The interest you earn this time period is added to your investment (this is called compounding), so that you will earn even more interest in the next time period.
The Logistic Function
Logistic functions are those that exhibit exponential growth initially, but have a saturation point, L, that
cannot be exceeded. In population ecology this saturation point is called the carrying capacity.
The general logistic function has the form f (x) = . The plot below shows the graph of f (x) = . Note that f (0) = = 50 . Furthermore as x )w,e-x ) 0,f (x) ) 100 and as x ) -w,e- x ) w,f (x) ) 0 .
y
100
90
80
70
60
50
40
30
20
10
x
Logistic equations are used to model how a quantity spreads through a population over time, such as viruses
and rumors. Consider a virus that infects a population in such a way that in every time period 10% of those not infected become newly infected. Since those who are already infected cannot be newly infected, there is limit to how many people can be infected by the virus, but eventually everybody becomes infected.
The Motivation for Logistic Regression Models
Logistic regression models are used when we want to assign a probability to a response variable and then use
that probability to classify the observation.
For example a logistic regression model can be used to create a spam filter. Each email is assigned a
probability representing its “spam likelihood” based on characteristics of the message, and then is classified as spam if its probability exceeds a certain “spam threshold” .
A Simple Spam Filter
Suppose you want to develop a spam filter that uses the number of spelling errors to flag potential spam
emails. Each message will be assigned a probability of being spam based on the number of spelling errors.
You analyze a collection of emails and count the number of spelling errors and whether or not the email is
spam. Suppose this is what you get when you plot the data on a graph.
It’s hard to see how to create a model using these data. To get more insight, the number of spelling errors is
grouped into bins of width 5. The percentage of spam messages in each bin is the probability of being spam.
Although there is more structure in this graph, linear regression will not do a good job of modeling these data.
|
What we want to do is to create a (non-linear) regression model that captures the logistic shape of these data.
This requires performing a series of transformations to the data to make it linear, then using linear regression.
Probabilities, Odds, and Logistic Functions
We start by first discussing probability from a different, but related, measure, called odds.
The odds of an event occurring is the probability the event occurs relative to the probability that it doesn’t.
odds = |
This is called “odds for” since it represents the odds that the event happens.
Odds are sometimes used in sports betting (especially in horse racing) but are always represented as the “odds
against”, and is the reciprocal of the “odds for” calculation.
In this course we will only be using “odds for” .
Example
If an event has a 50% chance of happening (p = 0.5) then odds = 1 0一 = 1 . We say that the odds that the
event happens are 1 to 1.
If an event has an 80% chance of happening (p = 0.8), then odds = 1 0一 = = 4 . In this case the odds that
the event happens are 4 to 1.
If an event has probability p = 0.2, then odds = 1 0一 = = 0.25 . In this case the odds of it happening are
0.25 to 1, which can also be phrased as 1 to 4. This means that for every one time it occurs, there are 4 times that it does not occur.
0.01 0.01
1 一 0.01 0.99 .
. For large probabilities the odds get large. For example, ifp = 0.98, odds = = = 49 , but
0.99 0.99
1 一 0.99 0.01 .
Probability and odds are directly related: the higher the probability, the higher the odds, and the lower the
probability, the lower the odds.
As the probability approaches 0, odds approach 0, and as the probability approaches 1, odds approach infinity.
Converting Odds to Probabilities
odds odds + 1 |
Example
Suppose the odds of an event happening is 3 to 1. The probability that the event happens is p = = 75% .
If the odds of an event happening is 0.35 to 1, then the probability the event happens is p 0.35
Exponentials and Logarithms
Plotting the data in terms of odds, we get the following graph that can be modeled by an exponential equation.
Since exponentials and logarithms are the inverse of each other (logs “undo” exponentials), so we can apply
one more transformation by taking the logarithm of the odds to obtain a graph very close to a linear function.
The Logit Function
This process of taking the logarithm of the odds (log-odds) is called the logit function.
Here is how we can use the logit function to model the process of assigning a probability to an email with a
given number of spelling errors.
The Logit Function and the Spam Filter
Suppose s is the number of spellings errors in a message, and p(s) is the probability of a message being spam
with s spelling errors, then the odds that a message with s spelling errors is spam is odds(s) = . Take the logarithm of the odds to obtain the logit function: logit(s) = log ))| .
Since the logit function is linear, it has an equation of the form Y = β0 + β1s, where s is the number of spellings
errors in the message. This means that the logit function can be written Y = log ))| = b0 + b1s .
The slope is β1 and we can see how the log-odds and odds change as the number of spelling errors increases.
. If β 1 > 0, then as s increases, Y increases, the log-odds increase, and so the odds must also increase.
. If β 1 < 0, then as s increases, Y decreases, the log-odds decrease, and so the odds must also decrease.
. If β 1 = 0, then as s increases, Y is constant, the log-odds don’t change, and so the odds also don’t change.
Exponentials from Logarithms
Since exponentials and logarithms are inverses, the equation log ))| = log(odds(s)) = b0 + b1s can be
rewritten as odds(s) = eb0 +b1s .
So how do the odds change as s increases by 1? In the context of the spam filter, this means how do the odds
of message being spam change if a message has one additional spelling error?
In terms of equations, this means finding the relationship between odds(s) and odds(s + 1). For a given number of spelling errors, s, odds(s) = eb0 +b1s and odds(s + 1) = eb0 +b1 (s +1) = eb0 +b1s+ b1 .
. Take the ratio of these values to obtain = = eb1 .
. This means that odds(s + 1) = eb1 odds(s).
The way to interpret this is to say that for each additional spelling error, the odds that the message is spam
increase by a multiplicative factor of eb1 .
. If β 1 is positive, then eb1 is greater than 1, and so the new odds would be more than the old odds.
. If β 1 is negative, then eb1 is less than 1, and so the new odds would be less than the old odds.
Interpreting the Logistic Regression Equation
Suppose our logistic regression model to predict the odds that a message with s spelling errors is spam is
given by the equation log(odds(s)) = -7.23 + 0. 17s .
a) What are the odds that a message with 30 spelling errors is spam? Give your answer to 4 decimal places. log(odds(30)) = -7.23 + 0. 17(30) = -2. 13 丰 odds(30) = e-2.13 = 0. 1188
b) What is the probability that a message with 30 spelling errors is spam? p = ~ 10.6%
c) How do the odds of a message being spam change for each additional spelling error?
Recall that New = eb1 Old so the odds the message is spam increase by a multiplicative factor of e0.17 ≈ 1. 185, so New odds = 1.185 Old, meaning that the odds the message is spam increase 18.5%.
Spam messages tend to be shorter in length than messages that are not spam. Suppose the logistic regression
equation log(odds(w)) = 4.09 - 0.043w gives the odds that a message with w words is spam.
a) What are the odds that a message with 120 words is spam? Give your answer to 4 decimal places. log(odds(120)) = 4.09 - 0.043(120) = -1.07 丰 odds(120) = e-1.07 = 0.3430
b) What is the probability that a message with 120 words is spam? p = ~ 25.5%
c) How do the odds of a message being spam change for each additional word?
The odds that a message is spam increase by a multiplicative factor of e–0.043 ≈ 0.958, so the new odds are 95.8% of the old odds, meaning that the odds that the message is spam decrease 4.2%.
Creating a Spam Filter
A database of 3,921 emails was analyzed and 21 different variables were tabulated.
. The variables include whether or not the email is spam, the number of characters (in thousands) in the email, whether the email had "Re:" in the subject line, whether there was an attachment, the presence of certain keywords such as “inherit”, “password”, or “viagra”, the time the email was sent, etc.
spam |
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … |
to_multiple |
0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, … |
from |
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, … |
cc |
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, … |
sent_email |
0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, … |
time |
2012-01-01 06:16:41, 2012-01-01 07:03:59,… |
image |
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, … |
attach |
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, … |
dollar |
0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … |
winner |
no, no, no, no, no, no, no, no, no, no, no… |
inherit |
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … |
viagra |
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … |
password |
0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, … |
num_char |
11.370, 10.504, 7.773, 13.256, 1.231, 1.09… |
line_breaks |
202, 202, 192, 255, 29, 25, 193, 237, 69, … |
format |
1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, … |
re_subj |
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, … |
exclaim_subj |
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … |
urgent_subj |
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … |
exclaim_mess |
0, 1, 6, 48, 1, 1, 1, 18, 1, 0, 2, 1, 0, 1… |
number |
big, small, small, small, none, none, big,… |
Logistic Regression in R
Let’s start with a simple model using the number of characters as the predictor and whether the email is spam
as the response.
. Load the data.
Emaildf = read.csv('email.csv')
. Logistic regression models require the response variable to be cast as a logical variable.
Emaildf$spam = as.logical(Emaildf$spam)
Standard linear regression models use lm() but in this case we are generating a linear model with a logical
response variable, so we use the glm() command (“general linear model”) and specify that the response values are only 0 or 1 by including the parameter family = 'binomial'.
LogReg = glm(spam ~ num_char, data = Emaildf, family = 'binomial') summary(LogReg)
LogReg$coefficients
2023-03-29
Logistic Regression