Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

MTH113: Revision Outline for Final Exam

1 Exploratory Data Analysis

Categorical and Numerical variables

• Categorical Variable (also named as nominal or qualitative variable): A variable that tells us what group or category an individual data point belongs to.

• Numerical Variable: Contains measured numerical values with measurement units. It can be further di↵erentiated as Discrete numerical and Continuous numerical variableS.

Notes: Categorical variable can also be numbers such as 1, 2, 3, . . . for Age or 2010, 2011, . . . for Years . However, we should know that for categorical variable the number is simply a label for the corresponding category, cannot be calculated in finding any summarized results related to the variable (compare this to the scenario of Age and Year with numerical values).

Graphical display of the data

• Frequency and relative frequency tables are used usually for categorical variables:

We can also use bar chart to display frequency (relative frequency) tables:

• Histogram is used for numerical, especially continuous numerical variables, where y-axis is the frequency or relative frequency, x-axis is the grouped values for the variable.

*Pay attention to the di↵erence of x-axis between bar charts for categorical variables and histograms for numerical variables.

• Boxplot is a type of plot for numerical variables, where we can easily identify Interquatile range (IQR) and the 5-number summary (Max, Q3, Median, Q1, Min), and any outliers based on the IQR.

Mild outliers are those data values located outside the range

(Q1 − 1.5IQR, Q3 + 1.5IQR), which denoted as Dot.

Extreme outliers are those data values located outside the range

(Q1 − 3IQR, Q3 + 3IQR), which denoted as circle.

• Be aware of the di↵erent pros and cons for histogram and boxplot, and be able to connect the histogram to a boxplot for certain numerical variables.

Some characteristics of the data

• Shape characteristics: Mode, Symmetry, Skewness, Outliers.

A histogram is skewed right if it has a long right tail. It is skewed left if it has a long left tail.

Outliers (for univariate data) are those data values that are far above or below the rest of the data values. You can identify them from boxplot (using the IQR) or visualize them from histogram.

• Center characteristics: Mean, Median

Median is the center of the data values, which is the value (average value) of the middle data point(s) if the number of the data points is odd(even).

Mean is the arithmetic average of the data values:

= P xi

for a data with n values.

• Spread characteristics: Range, Percentiles and Quartiles, Interquartile Range (IQR), Sample Variance, Sample Standard Deviation.

First quartile Q1.

Third quartile Q3.

IQR = Q3 − Q1.

Variance s2  = n 1 P(xi − )2 .

Standard Deviation s = qn 1 P(xi − )2 .

2 Regression and Correlation for Numerical Variables

Scatterplots and Association

Scatterplots exhibit the relationship between two numerical variables for detecting patterns, trends, relationships and extraordinary values.

• Be able to interpret the various associations between two numerical variables from the scatterplot, identify any outliers. Pay attention to the di↵erent definitions of outliers for scatterplots and outliers in boxplot or histogram.

Correlation (coecient)

Denition: Let (xi ,yi ) for i = 1, 2, . . . ,n be the corresponding values of two numerical variables x and y, then the (linear) correlation between x and y has the following two

equivalent definitions

(a)    r = P zxizyi

^P(xi )2 P(yi y¯)2

where = P xi  and y¯ = P yi  are the mean of x and y respectively.

• Assumptions and conditions for correlation:

1.  To use r, there must be a underlying linear relationship between the two variables. That is you should first identify a linear association possibly through investigating scatterplots, then calculate r to quantify such linear relationship.

2.  The variables must be numerical.

3.  Outliers can strongly a↵ect the correlation.

• Correlation r has no units, So changing the units of x or y does not a↵ect r.

• Note that |r| < 1, when r = ±1, then xi and yi lie exactly on a straight line (yi − y¯) = k(xi − ),    for some constant k

Linear regression

Linear models for two numerical variables

• In a statistical investigation, we are interested in one numerical variable which is called Response (dependent) Variable (denoted as y) and we would like to predict its value; Another variable we choose that used to provide information for the prediction of the response variable is called Explanatory (independent) variable (denoted as x).

• Our purpose is to build a linear model

yˆ = a + bx,

such that given any value of x, we can generate predicted value yˆ according to the above linear model.

• Obviously, we need to find the value of the slope b and intercept b based the available data of x and y.

The line of best fit

yˆ = a + bx’where a = y¯ − b and b = rs(s)y(x)

where s and sy   are the standard deviation of the data x and y .  This line is also called

Least squares line or Linear regression line.

•  a and b are determined through minimizing the Residual Sum of Square

n                  n                                 n

Residual Sum of Square =Xei(2)  =X(yi yˆi )2  =X(yi a bxi )2

i=1               i=1                               i=1

where we call ei  = yi − yˆi the residual, which is equal to data yi minus the predicted value yˆi  using the data xi  and the regression line.

Residual analysis and R2

How to check whether the linear model works ne or not ?

• Check the value of R2  (R-squared).

R2  = 1 = 1

= Regression Sum of Squared = P(yˆi y¯)2

Note:

1.  Total Sum of Square (proportional to variance in the data y ) evaluates the deviations of those data points yi  from the mean y¯ .

2.  Regression Sum of Square evaluates the deviations in y explained by the linear model yˆi  = a + bxi

3.  Residual Sum of Square

n                                 n                               n

X(yi yˆi )2  =X(yi y¯)2 X(yˆi y¯)2

i=1                               i=1                             i=1

is the deviations in y that cannot be explained by the linear model.

4. It is obvious that 0 < R2 < 1, R2 close to 1 means that the linear model explained most of the variations in y, otherwise for R2 close to 0 means that linear model cannot explain the variations in y .

5.  R2  usually will be generated by statistical software as an output of the linear regres- sion. For manual computation, R2  is equal in value to r2 - the square of correlation coefficient.

• After tting the regression model, you should also check the plot of residuals against the predicted values (or x values) and look for:

Any bends that would violate the assumption of linear relationship between x and y Any outliers, and

Any change in the spread of the residuals from one part of the plot to another.

• For a good regression model, the residual scatterplot should have

no direction

no shape

no bends

no outliers

no identifiable patterns

3 Gathering Data

Three ideas of sampling

• Examine a part of the whole.

The goal of many statistical investigation is to learn about the entire group of individuals, which is called Population; However, it is in general impossible to directly work on population. Then statistician collect small group of individuals (called a sample) from the population, and study population’s characteristic through sample.

• Randomize.

Randomizing method in selecting sample can lead to a representative sample.

• It is the sample size that matters.

Bias in sampling

Bias can exists in our sample if no randomizing methods are implemented when selecting the sample data.

• Bias in sampling is the tendency for samples to di↵er from the corresponding population, which results from the way or method the sample is selected.

• Common types of bias are Selection bias and Non-response bias.

1. Selection bias is introduced when the way the sample is selected systematically ex- cludes some part of the population in interest.

2. Non-response bias occurs when responses are not obtained from all individuals se- lected for inclusion in the sample.

Random sampling methods

Random sampling is a class of method that can eliminate (theoretically) bias in the resulting samples. There are 4 types of random sampling methods that are frequent used.

1. Simple random sampling

2. Stratified sampling

3. Cluster sampling

4. Systematic sampling

Experiments and observational studies

Observational studies

• Researchers don’t assign choices

• Passively observe participants

• Good for discovering relationships related to rare outcomes

• Bad for establishing cause-and-e↵ect relationships

Tough to handle lurking variables

There are two types of observational studies:

1. Retrospective Studies: In this type of study, investigator collect data on something that has already occurred.

2. Prospective Studies: In this type of study, investigator identify subjects in advance and collect data as events unfold.

Experiments

• An experiment is a study in which one or more explanatory variables are manipulated in order to observe the e↵ect on a response variable.

• The explanatory variables are those variables that have values that are controlled by the experimenter. Sometimes also called factors.

• The response variable is a variable that is not controlled by the experimenter and that is measured as part of the experiment.

• An experimental condition is any particular combination of values for the explanatory variables. They are also called treatments.

4 Introduction to Probability

Three building blocks in probability theory

(Chance) Experiment is any activity or situation in which there is uncertainty about which of two or more possible outcomes will result.

Sample Space denoted as S , is the collection of all possible outcomes from the (chance) experiment.

Event is any collection of outcomes from the sample space.

• Set theory can be applied to events and sample space.

1.  Simple event S 2 S is one outcome from the experiment, and is the element in set S .

2.  Event A ⇢ S is simply a subset of sample space.

• When we talk about event in probability context, the following three statements are implied by the definition of event.

1.  There is a specific experiment related to the events in interests.

2.  All outcomes from the experiment are known, that is you know the sample space.

3.  We have a specific and clear statement to connect the event to one or combination of some outcomes from the experiment.

Denitions of probability

Classical approach to probability

Definition: Assume that an experiment can generate N equal likely outcomes, and an event E contains M of those outcomes, then the probability of event E is defined as

P(E) = M

(1)

Most of the calculation of the classical probability is based on Combination and Per- mutation.

Limitations of the Classical Approach to Probability: The classical approach to probability works well with games of chance or other situations with a finite outcomes that can be regarded as equally likely. However, some situations that are clearly probabilistic in nature do not fit the classical model. But, we can extend the definition to cover some of the situations with infinite many outcomes, that is so-called“ geometric approach to probability”, where the probability are assumed to proportional to lengths or areas.

Statistical (relative frequency) approach to probability

Denition: The probability of an event E, denoted by P(E), is defined to be the value approached by the relative frequency of occurrence of E in a very long series of trials of an experiment. Thus, if the number of trials is quite large,

number of times E occurs

number of trials

• A drawback of such definition of probability is that the frequency is only approximation of the probability. One way to solve such problem is we can define the probability of event E is a number p such that when we repeat the experiment, the frequency of the event E occurs varies around the value p, and when the number of trails tend to infinite, such frequency will converge to the value p.  But how can you prove such p exists is still a  problem .   This  actually  related  to  one  of the  most  famous  theorem Law of Large Number we will learn in the future .

**Modern probability theory

• In modern probability theory, probability P is deined as a function mapping from F to [0, 1],    where F is the collection of all kinds of subsets (including S itself) of the sample space S, i.e. any event A 2 F .

• Three axioms to the probability measure (function):

1.  P(S) = 1.

2.  For any A2F, 0

3.  (Addition axiom) If A1 , A2 , . . . An , . . . are mutually disjoint, then

P Ai ! =

1

X P(Ai )

i=1

Properties related to probability

• P(Ac ) = 1 − P(A) ,    where Ac  = {! | ! 2 ⌦ − A}.

• P(;) = 0, where ; = ⌦c .

• If A ⇢ B, then P(A) < P(B) .

• P(A [ B) = P(A)+ P(B) − P(A \ B) .

Conditional probability

Denition: Suppose A and B are two events in the sample space S and P(B) 0, then the probability given by

P(A \ B)

P(B)

is called the conditional probability of event A given B .

• Multiplication principle .

P(A \ B) = P(A | B)P(B)

which is simple transformation from conditional probability.  What should be the multi- plication rule for more than two events

P Ai ! =?

Total probability and Bayes Theorem

• Partition of S . A partition of the sample space S is a finite collection of events say C1 , C2 , . . . , Cn in S such that

1.  Ci \ Cj = ;, for any i j (Mutually exclusive) 2. n Ci  = S                          (Exhaustive)

Law of total probability

Let {Ci } be a partition of S such that P(Ci) 0 for all i = 1, 2, . . . , n, then for any event A ⇢ S ,

n

P(A) =XP(A | Ci )P(Ci ).

i=1

Bayess Theorem

Let {Ci } be a partition of S such that P(A | Ci) 0 for all i = 1, 2, . . . , n, then for any event A ⇢ S ,

P(A | Ci )P(Ci )

Independence of random events

Denition: Two events A and B are said to be independent if any of the three equivalent statements holds:

1. P(A | B) = P(A),

2. P(B | A) = P(B),

3. P(A \ B) = P(A)P(B)

Note that the third statement is the multiplication principle when A and B are indepen- dent.

• More generally, a finite collection of events {Ai } are called independent if P Ai ! =i2I(Y)P(Ai ),

for every subset I ⇢ {1, 2, . . . ,n}.

• Independent Disjoint.

A and B are disjoint (P(A \ B) = 0) means that if one event occurs, the other one cannot occur. This concept usually related to two di↵erent events cannot occur simultaneously from one trail of an experiment.

A and B are independent (P(A|B) = P(A)) means that the occurrence of one event gives no information about the occurrence of the other event. This concept usually related to two events for two (independent) trails of an experiment.

When disjoint = independent ?

• Mutually independent Pairwise independent

A collection of events Ai  for i = 1, 2, ... are (mutually) independent if each event Ai is independent of any combination of other events in the collection.

A collection of events Ai  for i = 1, 2, ... are pairwise independent if any two events in the collection are independent.

Mutually independent ) Pairwise independent

Mutually independent : Pairwise independent

5 Random variables

Denition: Random variable is a function mapping from sample space S to R, usually denoted as X . It is obvious that a  random variable is simply a function of those outcomes from an experiment, it is natural  to be introduced when studying those quantities related to the outcomes of an experiment.

Distribution functions

Discrete random variable

• Probability mass function (pmf)

pX (x) := P(X = x)

is called the pmf of the random variable X defined for all values of x.

pX (x) ≥ 0 for any x,

Px   pX (x) = 1.

• Cumulative distribution function (cdf). The function FX :R! [0, 1], defined by

FX (x) := P(X < x)

is called cdf of the discrete random variable X .

FX (x) =Py

Continuous random variable

• Cumulative distribution function (cdf). It has the same definition to the discrete case. The function function FX  : R ! [0, 1], defined by

FX (x) := P(X < x)

is called cdf of the continuous random variable X .

• Probability density function (pdf). The function fX  : R ! R such that fX (x) 0 for any x, and         Z fX (x)dx = 1,

and

d

dx

given that FX (x) is di↵erentiable at x.

Characteristics of single random variable

Expectation

For any random variable X, the expectation is usually denoted as E(X).

• When X is a discrete random variable

E(X) =X xpX (x)

• When X is a continuous random variable

E(X) = Z xfX (x)dx

• For any function h : M ! R such that h(X) is still a random variable then,

P  h(x)pX (x),   X is a discrete random variable E(h(X)) = <

R h(x)fX (x)dx,     X is a continuous random variable

• When h is a linear function, h(x) = a + bx for some real constant a and b, then E(h(X)) = E(a + bX) = a + bE(X),

which also indicate that E(a) = a, that is expectation of a constant is the constant itself. Of course the expectation is for the random variable X in our context.

Variance

• The definition of Variance is given as

Var(X) = E[(X − E(X))2]

• Another formula for Variance that is easy for calculation is Var(X) = E(X2 ) − (E(X))2

where E(X2 ) is called the second moment of random variable X, which can be calculated using the expectation of h(X) where h is power function with power 2,

P x2pX (x),   X is a discrete random variable E(X2 ) = <

Rx2 fX (x)dx,     X is a continuous random variable

Variance for a linear function of random variable X, i.e. a + bX ,

Var(a + bX) = b2Var(X),

which also indicate that variance of a constant is zero, that is Var(a) = 0, similarly the variance is for the random variable X in our context.

Joint distribution functions for multiple random variables

• For any two random variables X and Y , the joint cdf is defined by  FX,Y (x,y) = P(X < x,Y < y), for any x, y 2 R

• For discrete random variable.

FX,Y (a,b) =  X  pX,Y (x,y),

x

where p(x,y) := P(X = x,Y = y) is the so-called joint probability mass function (joint pmf) of X and Y .

• For continuous random variable.

FX,Y (a,b) = Za1 Zb1 fX,Y (x,y)dydx,

where fX,Y (x,y) is the so called joint probability density function (joint pdf) for X and

.

• From joint pdf or pmf to marginal pdf or pmf

pX (x) = X pX,Y (x,y),

y

fX (x) =l fX,Y (x,y)dy and similarly for pY (y) or fY (y).

for discrete random variables

for continuous random variables

Independence of random variables

Analogy to the independence of random events, we also have independence of random variables.

• The definition of independence between two random variables X and Yis through the joint distribution and marginal distribution, that is for any events A and B we have

P(X 2 A,Y 2 B) = P(X 2 A)P(Y 2 B),

or in terms of conditional distribution and marginal one,

P(X 2 A | Y 2 B) = P(X 2 A).

• Note that those joint distributions, conditional distributions and marginal distributions    can be written as probability of random events in S . Let E = {s 2 S | X(s) 2 A} and F = {s

2 S | Y (s) 2 B}, which are independent events if X and Yare independent, then the above two equations are simply

P(X 2 A, Y 2 B) = P(E \ F) = P(E)P(F) = P(X 2 A)P(Y 2 B) P(X 2 A | Y 2 B) = P(E | F) = P(E) = P(X 2 A).

• An easy way to check the independence among random variables is to use distribu- tion functions.  Let X1 ,X2 , . . . ,Xn  be a sequence of random variables, then they are independent if and only if,

Joint cdf equals to the product of the marginal cdfs:

n

F(x1 , x2 , . . . ,xn ) =Y FXi (xi )

i=1

Joint pmf equals to the product of the marginal pmfs for discrete random variables:

n

p(x1 ,x2 , . . . ,xn ) =Y pXi (xi)

i=1

Joint pdf equals to the product of the marginal pdfs for continuous random variables:

n

f(x1 ,x2 , . . . ,xn ) =Y fXi (xi )

i=1

for any x1 ,x2 , . . . xn .

Covariance of two random variables

• The definition of covariance of two random variables X and Y is Cov(X,Y) = E[X E(X)][Y E(Y)]

• Another formula for covariance that is easy for calculation is Cov(X,Y) = E(XY) − E(X)E(Y),

where

E(XY) =

• Correlation of coefficient between X and Y is defined as

p = Cov(X,Y)

Recall and compare this formula to the correlation r we introduced for the data of two numerical variables.

• Covariance for independent random variables is zero, i.e.

Cov(X,Y) = 0,     for X and Y are independent,

this can be proved by seeing that E(XY) = E(X)E(Y) when X and Y are independent.

Expectation and Variance of linear combination of random variables

Let X and Y be two random variables, and a and b are two real value constants, then

• E(aX + bY) = aE(X) + bE(Y), note that this equation is always true no matter X and Y are independent or not.

• Var(aX + bY) = a2Var(X)+ b2Var(Y)+2abCov(X,Y).

• When X and Y are independent, we have

Var(aX + bY) = a2Var(X)+ b2Var(Y)

Examples of random variable

Discrete random variables

• Bernoulli (X ⇠ Bernoulli(p)).

A random variable X takes value 1 with probability p 2 (0, 1) and 0 with probability q = 1 − p is called a Bernoulli random variable with parameter p.

pmf:

p, pX (x) =

1 − p.

for x = 1,

for x = 0.

E(X) = p,    Var(X) = p(1 − p).

• Binomial (X ⇠ Bin(n,p)).

If X is the number of success in a Bernoulli experiment (with success probability p) with n independent trails , then X is a Binomial random variable with parameter n and p.

pmf:

pX (k) = ✓ ◆k(n)pk (1 − p)n−k,    k = 0, 1, . . . ,n.

E(X) = np,    Var(X) = np(1 − p).

• Geometric (X ⇠ Geom(p)).

If X is the time until the first success in a sequence of independent Bernoulli trails (with parameter p), then X is a Geometric random variable with parameter p.

pmf:

pX (k) = (1 − p)k−1p, k = 1, 2, . . . , .

E(X) =