DSC 212 — Probability and Statistics for Data Science Lecture 2
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
DSC 212 — Probability and Statistics for Data Science
Lecture 2
January 12, 2023
2.1 Random Variables
In statistics, the connection between sample spaces, events, and data is established through the use of random variables. Recall that a measurable space (Ω , F) consists of a sample space Ω and a σ−algebra F of subsets of Ω .
Definition 1 (Measurable function). Let (Ω , F) and (S,S) be two measurable spaces. For a function X : Ω → S we define the pre-image of a set A ∈ S to be
X−1(A) := {ω ∈ Ω | X(ω) ∈ A}.
A function X : Ω → S is said to be F-measurable if X−1(A) ∈ F for all A ∈ S .
Definition 2 (Random variable). Consider the measurable space1 (R, B), and a probability space (Ω , F, P). An F-measurable function X : Ω → R, is called a random variable. The probability measure of a random variable X, denoted PX, is defined as
PX({X ∈ A}) = P(X−1(A)) ∀A ∈ B.
Note 1. Random variables are just a special case of measurable functions with S = R. They map a sample space Ω into R which is friendly to perform calculus operations.
Remark 1. In practice F may be excessively large. We define σ(X) ⊂ F to be the smallest σ−algebra over Ω, such that X is measurable with respect to it. Note σ(X) consists of events of outcomes in Ω . Properties such as independence of random variables related to events in σ(X).
Definition 3 (Independent random variables). Random variables X : Ω → R and Y : Ω → R are independent if all events A ∈ σ(X) and B ∈ σ(Y) are independent. Mathematically, this means
P(A ∩ B) = P(A)P(B) ∀A ∈ σ(X),B ∈ σ(Y).
Example 1 (Bernoulli random variable). Consider the probability space (Ω , F, P) corresponding to tossing a biased coin, where Ω = {H,T}, and F = {{H}, {T}, Ω , ∅}, with P({H}) = p. The random variable X defined as
X(ω) = {0(1)
ω = H
ω = T
is called a Bernoulli random variable with parameter p. In this case the range of X is S = {0, 1} and events in F map to events in S = {{0}, {1}, ∅,S}.
Measurable functions can have other co-domains S, which define other random objects studied in probability and statistics.
- If S = R, the measurable function is called a random variable,
- If S = Rd, the measurable function is called a random vector,
- If S = Rd×p, the measurable function is called a random matrix,
- If S is the space of graphs with n vertices, the measurable function is a random graph.
Other objects such as random trees, random walks, random functions are defined in a similar manner by choosing an appropriate co-domain S suitable to the application.
Definition 4 (Cumulative distribution function). For a random variable X : Ω → R, the cumula- tive distribution function (CDF), denoted FX, is the function
FX : R → [0, 1], defined as FX(u) = PX ({X ≤ u}) = P ({X−1 ((−∞,u])}) .
Example 2. Suppose we toss a fair coin twice, which leads to the sample space Ω = {HH,TT,HT,TH} and all outcomes are equally likely, i.e., P({ω}) = for all ω ∈ Ω . Define the random variable X : Ω → R which counts number of heads. Hence
2
X(ω) =〈 1
0
ω = HH
ω ∈ HT,TH
ω = TT
so, PX({X = 0}) = , PX({X = 2}) = , and PX({X = 1}) = . The CDF is given by the following function.
2.2 Properties of the CDF
Let FX : R → [0, 1] be the CDF of a random variable X : Ω → R. Then FX satisfies:
1. Non- decreasing : x1 < x2 implies that FX (x1 ) ≤ FX (x2 )
2. Normalization: lim FX(x) = 0, and lim FX(x) = 1.
x→ −∞ x→+∞
3. Right- continuous: For any x ∈ R, the function value equals the upper-limit.
y<x
Note 2. Conversely, if a function F : R → [0, 1] satisfies the above 3 properties, there exists a random variable with F as its CDF.
Exercise 1. Verify that the above CDF in the figure satisfies all the properties listed above.
Remark 2. (Technical! Can be omitted.) When the codomain is S = Rd , we need an additional condition for F : Rd → [0, 1] to it be a valid CDF of a random variable. For any rectangle A = (a1 ,b1] × (a2 ,b2] × ... × (ad,bd] ⊂ Rd , let C be the set of 2d corners of A. Then the function must satisfy
工 sign(c)F(c) ≥ 0
c∈C
where sign(c) = (−1)#a ’s ini c .
Example 3. Continuing example 2., tossing a coin twice, observe that we have PX({X = 1}) = . Using the last property of the CDF, we have another way of calculating this quantity. Note that FX(1− ) = , whereas FX(1) = whereby we have PX({X = 1}) = FX(1) − FX(1− ) = − = .
Definition 5 (Probability mass function). We say random variable X is said to be discrete if the range of X a countable subset of R, i.e., the set {X(ω) : ω ∈ Ω} is countable. We define the probability mass function (PMF) for X as PX({X = u}). The PMF satisfies
工 PX({X = k}) = 1. (PMF-normalization)
k∈range(X)
Note 3. The summation of the PMF being 1, is just the normalization property of P, since {X−1(k)} are all disjoint subsets of Ω and ∪k {X−1(k) } = Ω, whereby
1 = P(Ω) =工P({X−1(k)}) =工P({X = k}).
k k
Remark 3. It is a good idea to verify that the PMF adds to 1, whenever defining a PMF.
Example 4 (Geometric random variable). Let p ∈ [0, 1]. The random variable X ∈ N0 := N ∪ {0} (number of trials until 1st success) PX({X = k}) = p(1 − p)k ∀k ∈ N0
工 PX({X = k}) = p 工(∞)(1 − p)k = p ( ) = p × = 1,
where we have used the formula for the infinite sum of a geometric series.
Example 5 (Binomial random variable). Suppose we toss a fair coin n times where X = number
of heads observed Ω = {HH...H, HH...HT , ..., T T . . . T}. Observe that |Ω| = 2n and all
n times n − 1 times n times
outcomes are equally likely,
PX({X = k}) = (k(n))
If instead the tossed coin were biased where P({H}) = p, and P({T}) = 1−p, for some parameter p ∈ [0, 1], we have the PMF,
PX({X = k}) = ( )k(n)pk (1 − p)n−k
Observe that the PMF satisfies the identity
工(n)PX({X = k}) =工(n) ( )k(n)pk (1 − p)n−k = 1,
where the summation is the binomial expansion of (p + (1 − p))n .
Note 4. It is important to note that p is a parameter that defines a distribution, whereas k ∈ R is
a particular value that the random variable (Ω , F, P) → (R, B, PX) takes.
2.3 Expectation
The expectation of a function of a random variable is the average value that the function accounting for the randomness.
Definition 6. For a function g : R → R of a discrete random variable, the expectation of g is defined as
Eg(X) :=工 g(k)PX({X = k}).
k
Remark 4. The expectation is the best2 scalar approximation that describes the distribution. In fact, as we will see in later lectures, as the number of samples grows large, if we are provided samples {Xi} from a distribution the average of samples 对 g(Xi) → Eg(X), almost surely, under mild conditions on g . This is called the law of large numbers.
Note 5. The expectation is also known as the mean, often denoted by µX or simply µ .
Example 6 (Bernoulli Distribution). Let the random variable X represent a binary coin flip of the sample space Ω = {H,T}, which is comprise of two possible outcomes: X(H) = 1 and X(T) = 0. The outcome of the coin flip is assigned a numerical value X ∈ {0, 1}. The probability of the event “heads” occurring is represented by P({H}) = p and the probability of the event “tails”occurring is represented by P({T}) = 1 − p. The expected value of X is the the probability of the event “heads” occurring, given by
EX = 0 · PX({X = 0}) + 1 · PX({X = 1}) = 0 · (1 − p) + 1 · p = p
Example 7 (Geometric distribution). Following example 4, the geometric probability distribution models the number of failures before the first success in a sequence of Bernoulli trials. The expected value can be calculated by
EX =工(∞)k · p(1 − p)k =
The formula tells us that we expect trials before the first success, on average. Proof. Observe that, due to the noramlization property of the PMF
工 p(1 − p)k = 1
k≥0
Differentiating with respect to p gives us that
工(1 − p)k +工 kp(1 − p)k−1 · (−1) = 0
k≥0 k≥0
Thus we have
工 kp(1 − p)k = 工(1 − p)k =
Rearranging the terms proves the claim.
Property 1 (Linearity of Expectation). For any two functions f and g : R → R, and any two numbers α and β ∈ R, the expected value of the sum of the scaled functions is equal to the sum of the expected values of the scaled functions. Mathematically, this means
E(αf + βg)(X) = Eαf(X) + β g(X) = αEf(X) + β · Eg(X).
2023-02-08