DSC 212 — Probability and Statistics for Data Science Lecture 4
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
DSC 212 — Probability and Statistics for Data Science
January 19, 2023
Lecture 4
4.1 Independence of Random Variables
In the last class we defined independence between discrete random variables. We say X ⊥⊥ Y if PXY({X = u} ∩ {Y = v}) = PX ({X = u}) PY ({Y = v})
Let us quickly verify the types of all the objects in the equation above. The probability space on the LHS is (RX × RY , B(RX × RY), PXY) whereas on the right it is (RX , B(RX), PX) and (RY , B(RY), PY), where B(S) is the Borel field of a space S . These individual probability spaces have themselves been defined based on some other underlying probability spaces (Ω 1 , F1 , P1) and (Ω2 , F2 , P2), and random variables X : Ω1 → RX and Y : Ω2 → RY . Indeed independence is really about the interaction between the randomness of outcomes in Ω 1 and Ω2 , given by a joint distribution P12 over events from Ω1 × Ω2 .
In general, the two random variables are independent if their joint CDF can be written as the product of their marginal CDFs.
FXY(u,v) = FX(u) · FY(v) (4.1)
where FXY(u,v) = PXY({X ≤ u}∩{Y ≤ v}) = P12 ({(ω1 ,ω2 ) ∈ Ω1 × Ω2 | X(ω1 ) ≤ u,Y (ω2 ) ≤ v}), and FX(u) = PX({X ≤ u}), whereas FY(v) = PY({Y ≤ u}).
Recall that for a pair of random variables (X,Y), their distribution function is described by their CDF,
FXY(u,v) = PXY (X ≤ u,Y ≤ v) (4.2) For continuous random variables, there exists a more convenient representation than the CDF.
4.1.1 Joint Density
(X,Y) are jointly continuous random variables if there exists a function fXY : R2 → R, called the joint probability density function (joint-PDF), such that
The PDF must
PXY(A) = \A fXY (u,v)dudv
satisfy,
fXY (X,Y) ≥ 0
\ \ fXY (u,v)dudv = 1
(4.3)
(non-negativity)
(normalization)
4.1.2 Marginal Density
Based on the joint PDF, one can derive the marginal density of X and Y , denoted fX and fY respectively.
\
fY (v) = \ fXY (t,v)dt
If the marginal density (fX and fY ) is known, the joint density fXY is unknown, unless X and Y are independent.
For conditnuous random variables, one can verify that, if X,Y are independent, then
fXY = fX (X)fY (Y)
Suppose X,Y are independent, then Eg(X)h(Y) = Eg(X)Eh(Y)
\
\ \ g(u)h(t) fX (u)fY (t)dudt
= \ g(u)fX (u)du · \ h(t)fY (X)dt
= Eg(X) · Eh(Y)
where (c) follows from the indepedence of X and Y also allows for fXY may be written as the product of fX and fY .
4.2 Transformations of Random Variables
Given a random variable X we are often interested in the distribution of Y = g(X) for some function g . To find the distribution of Y , we must find its CDF, and then find its PDF by differentiating the CDF.
Example 1. Let’s assume X is drawn from a random uniform distribution and transformed using Y = sin(X). Suppose
X ∼ Uniform [ π , ]
Y = sin(X)
CDF: Let us find the probability that Y ≤ t, for some t ∈ [ − 1, 1]. Consider the following set of
Figure 4.1: (left) The CDF and PDF of X . (right) The Transformation Y = sin(X).
equations.
FY(t) = P{Y ≤ t}
FY(t) = P{sin(X) ≤ t}
FY(t) P{X ≤ sin−1(t)} = FX (sin−1(t))
u +
π
FY(t) = FX (sin−1(t)) =
where (b) holds since the sin function is increasing in this domain.
PDF: Let us take the derivate of FY(t), which we found above, to find the PDF, fY (t).
∂FY (t) ∂ sin−1(t) 1 1
fY (t) = ∂t = ∂t π = π · ^1 − t2
4.2.1 General invertible increasing transformations
Assume g is an invertible and increasing function: y = g(X)
FY(t) = FX (g−1(t))
fY (t) = fX (g−1(t)) g−1(t) = fX (g−1(t))
where (a) follows due to the chain rule of differentiation. This general form may be used to verify the above solution of the PDF of y = sin(X), as shown below.
4.3 Conditional Distributions
For a pair of random variables (X,Y) with a joint distribution PXY , the conditional distribution of X | Y is defined as
PXY({X = u} ∩ {Y = v})
PY({Y = v})
As in the case of sets, the above equation is well defined only if {Y = v} does not have measure 0, i.e., PY ({Y = v}) > 0.
For each v, the LHS is a distribution of X . Hence v is a parameter of this distribution. The
axioms of probability only hold for the first argument of a conditional distribution. For continuous random variables (X,Y) with density fXY , the conditional density of X | Y is
fXY (u,v)
fY (v)
is a valid density function with respect to u. Note that the above function is well defined only for v such that fY (v) > 0
Notice that if X TT Y , then
PX|Y({X = u}|{Y = v}) = PX({x = u})
fX|Y(u|v) = fX (u)
Example 2 (Max transformation). Given two random, uniform distributions (X and Y), find the CDF and PDF of the maximum of the joint distribution (Z).
X ∼ Uniform[0, 1]
Y ∼ Uniform[0, 1]
Z = max{X,Y }
CDF: Let us find the probability of the event Z ≤ t. This means that both X and Y are ≤ t. Since X and Y are independent, the joint distribution can be decomposed as a product of the two marginal distributions. Hence we can write
FZ = P{max{X,Y } ≤ t}
= PXY{{X ≤ t} ∩ {Y ≤ t}}
= PX{X ≤ t} · PY{Y ≤ t} = t · t
= FX(t) · FY(t) = t2
PDF: The above equation yields that,
Figure 4.2: The joint distribution of two random uniform variables (left). The PDF for the two random uniform distributions and maximum transformation (right).
4.4 Some commonly used expectations
4.4.1 Mean
The mean or expectation of X is a constant. It is not random. It is the best1 estimate for what the random variables distribution looks like. We often denote the mean of a random variable X by µX . Recall that by definition of the expectation, the mean is given by the integral
EX = \ tfX (t)dt
where fX is the PDF of X .
4.4.2 Variance
The variance of a random variable X describes the average deviation around the mean, and is denoted σX(2), since it is non-negative.
σX(2) = E[(X − EX)2] = E[(X − µX )2] = E|X|2 − |EX|2
Observe that (X − EX)2 is always non-negative, whereby σX(2) is also non-negative. Hence,
The square-root of the variance, i.e., σX is called the standard deviation of X, i.e., the deviation about the mean, which can be considered to be a “standard”.
4.4.3 Covariance and Correlation
The covariance is defined below.
Cov(X,Y) = E(X − µX )(Y − µY ) = E(XY − XµY − µXY + µX µY ) = E(YX) − µX µY
The inner terms cancel since E[XµY ] = E[µXY] = µX µY . When we want to study the interaction between a pair of random variables (X,Y), we define the covariance to be the matrix:
ΣXY = [E(X (Y(X)) EY) E(X Y(Y)) EY)] = [ρ2σXσXσY
where − 1 ≤ ρ ≤ 1 is called the correlation.
ρσX2σYσY ]
E[(X − µX )(Y − µY )]
(4.5)
The matrix above is also sometimes referred to as a Variance- Covariance matrix, because it contains the variance along the diagonal and covariances off the diagonals.
In terms of vector notation. The covariance matrix for a random vector X ∈ Rd is the matrix
Σ = E(X − EX)(X − EX)⊤ = EXX⊤ − (EX)(EX)⊤
We also of have the inequality
Σ ⪰ 0, ⇐⇒ EXX⊤ ⪰ (EX)(EX)⊤
(4.6)
(4.7)
which means Σ is a positive definite matrix and v⊤ EXX⊤v ≥ (Ev⊤ X)2 , for any constant vector v ∈ Rd . Notice that both quantities are scalars in the last inequality.
Exercise 1. Find the range of ρ .
Solution: The Cauchy-Schwarz inequality states:
|⟨a,b⟩|2 ≤ ⟨a,a⟩⟨b,b⟩
|⟨a,b⟩| ≤ ∥a∥ · ∥b∥
Let:
a := X − µX
b := Y − µY
⟨a,b⟩ := Eab
The expectation of the above terms may be used in the Cauchy-Swartz inequality. The last step below follows from the definition of the correlation coefficient, ρ .
|E[(X − µX )(y − µY )]|2 ≤ E[(X − µX )2]E[(Y − µY )2]
|Cov(X,Y )| ≤ σX σY
Cov(X,Y)
4.5 Conditional Expectation
How to take expectation given a set of conditions.
E[g(X)|Y = v] = EX|Y=vg(X) = g(t)fX|Y(t|v)dt
R
The above integral is taken with respect to t, thus the expectation is dependent on v . Conditional expectation E[g(X)|Y] is a random variable.
Claim 1. For a pair of random variables (X,Y), we have
EY[EX|Y[g(X)|Y]] = EYh(Y) = EX|Yg(X)
Proof.
E[g(X)|Y = v] = h(v)
Ey[h(Y)] = \RY = \RY = \RY = \RY
h(v) · fY (v)dv
(\RX f(t)fX|Y (t,v)dtdv) · fY (v)dv
\
g(t)dt · fXY (t,v)dv
RX
→− Let fX (t) = \RX fXY (t,v)dv →− marginal of X
= g(t)fX (t)dt →− constant
= Eg(X)
2023-02-23