Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

DSC 212 — Probability and Statistics for Data Science

Lecture 2

January 12, 2023

2.1    Random Variables

In statistics, the connection between sample spaces, events, and data is established through the use of random variables. Recall that a measurable space (Ω , F) consists of a sample space Ω and a σ−algebra F of subsets of Ω .

Definition  1  (Measurable function). Let (Ω , F) and (S,S) be two measurable spaces.  For a function X : Ω → S we define the pre-image of a set A ∈ S to be

X1(A) := {ω ∈ Ω | X(ω) ∈ A}.

A function X : Ω → S is said to be F-measurable if X−1(A) ∈ F for all A ∈ S .

Definition 2 (Random variable). Consider the measurable space1  (R, B), and a probability space (Ω , F, P). An F-measurable function X : Ω → R, is called a random variable.  The probability measure of a random variable X, denoted PX, is defined as

PX({X ∈ A}) = P(X−1(A))       ∀A ∈ B.

Note  1.  Random variables are just a special case of measurable functions with S = R. They map a sample space into R which is friendly to perform calculus operations.

Remark  1.  In practice F may be excessively large.   We define σ(X)  ⊂ F to be the smallest σ−algebra over Ω, such that X is measurable with respect to it.  Note σ(X) consists of events of outcomes in Ω . Properties such as independence of random variables related to events in σ(X).

Definition 3 (Independent random variables). Random variables X : Ω → R and Y : Ω → R are independent if all events A ∈ σ(X) and B ∈ σ(Y) are independent. Mathematically, this means

P(A ∩ B) = P(A)P(B)       ∀A ∈ σ(X),B ∈ σ(Y).

Example 1 (Bernoulli random variable). Consider the probability space (Ω , F, P) corresponding to tossing a biased coin, where Ω = {H,T}, and F = {{H}, {T}, Ω , ∅}, with P({H}) = p. The random variable X defined as

X(ω) = {0(1)

ω = H

ω = T

is called a Bernoulli random variable with parameter p. In this case the range of X is S = {0, 1} and events in F map to events in S = {{0}, {1}, ∅,S}.

Measurable functions can have other co-domains S, which define other random objects studied in probability and statistics.

- If S = R, the measurable function is called a random variable,

- If S = Rd, the measurable function is called a random vector,

- If S = Rd×p, the measurable function is called a random matrix,

- If S is the space of graphs with n vertices, the measurable function is a random graph.

Other objects such as random trees, random walks, random functions are defined in a similar manner by choosing an appropriate co-domain S suitable to the application.

Definition 4 (Cumulative distribution function). For a random variable X : Ω → R, the cumula- tive distribution function (CDF), denoted FX, is the function

FX  : R [0, 1],    defined as   FX(u) = PX ({X u}) = P ({X1 ((−∞,u])}) .

Example 2. Suppose we toss a fair coin twice, which leads to the sample space Ω = {HH,TT,HT,TH} and all outcomes are equally likely, i.e., P({ω}) =  for all ω ∈ Ω .  Define the random variable      X : → R which counts number of heads. Hence

2 

X(ω) = 1 

 0

ω = HH

ω ∈ HT,TH

ω = TT

so, PX({X  = 0}) =  , PX({X  = 2}) =  , and PX({X  = 1}) =  .  The CDF is given by the following function.

 

2.2    Properties of the CDF

Let FX  : R → [0, 1] be the CDF of a random variable X : Ω → R. Then FX  satisfies:

1. Non- decreasing : x1  < x2  implies that FX (x1 ) ≤ FX (x2 )

2.  Normalization:    lim   FX(x) = 0, and   lim   FX(x) = 1.

x→ −∞                                         x→+∞

3. Right- continuous: For any x ∈ R, the function value equals the upper-limit.

y<x

Note  2.  Conversely, if a function F  : R → [0, 1] satisfies the above 3 properties, there exists a random variable with F as its CDF.

Exercise 1. Verify that the above CDF in the figure satisfies all the properties listed above.

Remark  2.  (Technical!  Can be omitted.)  When the codomain is S = Rd , we need an additional condition for F  : Rd   → [0, 1] to it be a valid CDF of a random variable.   For any rectangle A = (a1 ,b1] × (a2 ,b2] × ... × (ad,bd] ⊂ Rd , let C be the set of 2d  corners of A. Then the function must satisfy

工 sign(c)F(c) 0

c∈C

where sign(c) = (−1)#a s ini  c .

Example 3. Continuing example 2., tossing a coin twice, observe that we have PX({X = 1}) =  . Using the last property of the CDF, we have another way of calculating this quantity.  Note that FX(1 ) = , whereas FX(1) =  whereby we have PX({X = 1}) = FX(1) − FX(1) =  −  =  .

Definition 5  (Probability mass function). We say random variable X is said to be discrete if the range of X a countable subset of R, i.e., the set {X(ω) : ω ∈ Ω} is countable.  We define the probability mass function (PMF) for X as PX({X = u}). The PMF satisfies

    PX({X = k}) = 1.                                        (PMF-normalization)

k∈range(X)

Note 3. The summation of the PMF being 1, is just the normalization property of P, since {X1(k)} are all disjoint subsets of and k {X1(k) } = Ω, whereby

1 = P(Ω) =工P({X−1(k)}) =工P({X = k}).

k                                            k

Remark  3.  It is a good idea to verify that the PMF adds to 1, whenever defining a PMF.

Example 4 (Geometric random variable). Let p ∈ [0, 1]. The random variable X ∈ N0  := N ∪ {0} (number of trials until 1st  success) PX({X = k}) = p(1 − p)k     ∀k ∈ N0

 PX({X = k}) = p 工(∞)(1 p)k  = p ( ) = p ×  = 1,

where we have used the formula for the infinite sum of a geometric series.

Example 5 (Binomial random variable). Suppose we toss a fair coin n times where X = number

of heads observed Ω = {HH...H,    HH...HT ,    ...,    T   T . . . T}.  Observe that |Ω| = 2n  and all

n times           n − 1 times                n times

outcomes are equally likely,

PX({X = k}) = (k(n))

If instead the tossed coin were biased where P({H}) = p, and P({T}) = 1−p, for some parameter p ∈ [0, 1], we have the PMF,

PX({X = k}) = (  )k(n)pk (1 p)nk

Observe that the PMF satisfies the identity

工(n)PX({X = k}) =工(n) (  )k(n)pk (1 p)nk  = 1,

where the summation is the binomial expansion of (p + (1 − p))n .

Note 4.  It is important to note that p is a parameter that defines a distribution, whereas k ∈ R is

a particular value that the random variable (Ω , F, P) → (R, B, PX) takes.

2.3    Expectation

The expectation of a function of a random variable is the average value that the function accounting for the randomness.

Definition 6. For a function g : R → R of a discrete random variable, the expectation of g is defined as

Eg(X) :=工 g(k)PX({X = k}).

k

Remark  4.  The expectation is the best2  scalar approximation that describes the distribution.  In fact, as we will see in later lectures, as the number of samples grows large, if we are provided samples {Xi} from a distribution the average of samples   g(Xi) → Eg(X), almost surely, under mild conditions on g . This is called the law of large numbers.

Note  5.  The expectation is also known as the mean, often denoted by µX  or simply µ .

Example 6 (Bernoulli Distribution). Let the random variable X represent a binary coin flip of the sample space  = {H,T}, which is comprise of two possible outcomes: X(H) = 1 and X(T) = 0. The outcome of the coin flip is assigned a numerical value X ∈ {0, 1}. The probability of the event “heads” occurring is represented by P({H}) = p and the probability of the event “tails”occurring  is represented by P({T}) = 1 − p.  The expected value of X is the the probability of the event “heads” occurring, given by

EX = 0 · PX({X = 0}) + 1 · PX({X = 1}) = 0 · (1 − p) + 1 · p = p

Example 7 (Geometric distribution). Following example 4, the geometric probability distribution models the number of failures before the first success in a sequence of Bernoulli trials. The expected value can be calculated by

EX  =工(∞)k · p(1 p)k  =

The formula tells us that we expect  trials before the first success, on average. Proof.  Observe that, due to the noramlization property of the PMF

工 p(1 − p)k  = 1

k0

Differentiating with respect to p gives us that

(1 p)k +工 kp(1 p)k1 · (1) = 0

k≥0                        k≥0

Thus we have

  kp(1 p)k  = (1 p)k  =

Rearranging the terms proves the claim.

Property  1  (Linearity of Expectation). For any two functions f and g : R → R, and any two numbers α and β ∈ R, the expected value of the sum of the scaled functions is equal to the sum of the expected values of the scaled functions. Mathematically, this means

E(αf + βg)(X) = Eαf(X) + β g(X) = αEf(X) + β · Eg(X).