DSC 212 — Probability and Statistics for Data Science Lecture 5
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
DSC 212 — Probability and Statistics for Data Science
January 24, 2023
Lecture 5
5.1 IID random variables
Definition 1 (Independent and Identically Distributed (i.i.d.) random variable) . If random vari- ables X1 , ...Xn are independent and have the same marginal distribution, then they are referred to as i.i.d.
FXi (t) = FX1 (t) Ai
Example 1. X1 , ...Xn are i.i.d. Bernoulli random variables with parameter p. For Bernoulli
given by
PX1 ({X1 = t}) = pt(1 − p)(1−t) where t ∈ {0, 1}
Due to the i.i.d. property, the joint distribution is the following probability,
n n
P({X1 = t1 ,X2 = t2 , ...,Xn = tn}) =”P({Xi = ti}) =” pti(1 − p)1−ti
i=1 i=1
= p对 ti (1 − p)对(1−ti) = p对 ti (1 − p)n−对 ti
Property 1. For i.i.d. random variables Xi, where each Xihas variance σ2 , i.e. EXi(2)−(EXi)2 = σ 2 , Var(z Xi) = n · σ 2 ,
Proof. First define, Y = z Xi and let EX = µ, the variance Var(z Xi) can be expressed as EY2 − (EY)2 . Let us consider each of the terms separately. The first term can be expressed as,
EY2 = E(X1 + X2 + ... + Xn)(X1 + X2 + ... + Xn)
= E(X1(2) + X2(2) + ... + Xn(2) + 2X1X2+ 2X1X3 + ...2X1Xn+ ...)
=
n
之 EXi(2)
i=1
" 物
n
之
independent, ij
identically distributed
= nEX1(2) + 之 2 EXiEXj
" 物
identically distributed
= nEX1(2) + 2 ( )2(n) EX1EX1
= nEX1(2) + n(n − 1)(EX1)2
The second term (EY)2 can be written as,
(EY)2 = (E(X1 + X2 + ... + Xn))2 = (EX1 + .... + EXn)2 = (nEX1)2
(by linearity) (by identically distributed)
Finally, the variance is
Var(Y) = nEX1(2) + n(n − 1)(EX1)2 − n2 (EX1)2
= nEX1(2) + 2 一2− n(EX1)2 − 2 一2
= n(EX1(2) − (EX1)2 )
= n · Var(X1)
Thus, the variance of X1 +X2 + ... +Xn is n times the variance of Xi if X1,X2 , . . . ,Xn are i.i.d.
5.2 Probability Inequalities
Theorem 1 (Markov’s inequality). For any non-negative random variable X, suppose that EX exists . For any t > 0,
P(X ≥ t) ≤
Proof.
EX = \0 ∞ ufx(u)du = \0t ufx(u)du+ \t ∞ ufx(u)du
石 、
positive
≥ \t ∞ ufx(u)du ≥ t \t ∞ fx(u)du = t · P(X ≥ t)
Theorem 2 (Chebyshev’s inequality). For random variable X where EX2 < ∞ and let EX = µ ,
then
Var(X)
2
Proof. Let Y = |X − µ|. Clearly Y is non-negative, whereby applying Markov’s inequality we get,
P(|X − µ| ≥ t) = P(Y ≥ t) = P(Y2 ≥ t2 ) ≤ = =
Remark 1 . Thus, P({X > µ + 3σ} or {X < µ − 3σ}) = P(|X − µ| ≥ 3σ) ≤ =
Another useful inequality is the following. Let Y = X1 +X2 + ... +Xn where X1,X2 , ...,Xn are i.i.d. random variables. Thus, EY = nEX1 ,
Var(Y) n · Var(X1)
t2 t2
By setting t = nu, we obtain the following
P(' − EX1 ' > u) ≤ =
which implies that the difference between the average of i.i.d. random variables and the expectation goes to zero as n goes to infty, i.e. limn→∞P(' − EX1' > u) → 0. This is known as the weak law of large numbers and is formally defined later.
Definition 2 (Convergence in Probability). A sequence of random variables X1 ...Xn(i.i.d. assump- tion is not required) is said to converge in probability to random variable X, written as Xn X , if
lim P(|Xn− X| ≥ t) = 0
Theorem 3 (Weak Law of Large numbers). If X1 ...Xn are i.i. d. random variables each with expected value µ, then
对 Xi µ
Remark 2 . This justifies why we refer to the expected value as the mean.
In practice however, we often only have finite number of samples, whereby we have to rely on probability inequalities to make statements with high confidence.
Example 2. Consider a tossing a coin, P({H}) = p. Define a sequence of random variables Xi
as Xi = t(t)h(h) . Define, Yn = X1 + X2 + ...Xn (the total number of heads
We want to know if the coin is biased or not.
Using the weak law of large numbers, EX1 = P(X1 = 1) = p.
However, tossing infinite number of coins is not feasible. What if we want to know the answer for a smaller number of tosses, e.g. for n = 100?
Here ∈ [0, 1] is random variable and P(' − p ' > t) ≤ (by Chebyshev’s inequality)
Var(X1) = EX1(2) −(EX1)2 = p − p2 = p(1 − p)
石 如
Xi(2) =X1
Given p = 0.7 and we want an answer that is within an interval. For example, for t = 0.1
P(' − p ' > 0.1) ≤ = 0.21. Which implies that with probability at most .21, the fraction
of heads lies in the interval [0.6, 0.8].
If we are willing to increase the size of the interval, we can be more confident, since
P(' − p ' > 0.15) ≤ = 0.21/2.25 = 0.0934.
The above statement means, we probability ≥ 91%, we can say the average fraction lies in [0.55, 0.85].
More generally, for some α ∈ [0, 1], with probability at least 1 − α, the fraction of Heads will
lie in the interval [p − ^ p),p + ^ ].
Substituting p = 0.5,n = 100,α = .09, we get ^ p),p + ^ ≈ 0.17 if we observe fewer
than 0.33 fraction of heads or more than 0.67 fraction of heads, we can conclude that the coin is biased with probability 91%.
If on the other hand we had n = 10000 coin tosses, we can conclude that the coin is biased with probability 91%, if we observe fewer than 0.483 fraction of heads or more than 0.517.
A tight confidence interval can be constructed with more specialized inequalities such as Ho- effding’s inequality.
2023-02-23