Algorithmic Foundations of Learning Problem Sheet 1
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit
Algorithmic Foundations of Learning
Problem Sheet 1
1.1 Slow Rate 1/^n (Question type: A)
1. Let X1 , X2 , . . . be i.i.d. random variables with µ = EX1 and σ 2 = VarX1 . Show that
/|ììE ┌╱ Xi - µ\2 ┐ = . (1.1)
2. Let Z be a real-valued random variable such that for any u 2 0 we have P(lZl 2 u) < 2e2 , for a positive constant σ 2 > 0 (this statement holds if Z is Gaussian with mean 0 and variance σ 2 , as we will prove later on in the course). Prove that for all p 2 1 we have
(E[lZlp])1/p < 2^σ2 ^p .
[Hint: Recall that for any non-negative random variable X we have EX = ←0o P(X > ε)dε . Recall also that the Gamma function is defined as Γ(t) = ←0o xt21e2ydx for any t > 0 . You might want to use that Γ(t) < tt for any t > 0 and that p1/p < 2 for any p > 0]
3. Let X1 , X2 , . . . be i.i.d. Gaussian random variables with mean µ and variance σ 2 . Show that for any p 2 2 we have
╱E ┌ │ Xi - µ │p ┐\1/p < , (1.2)
where cp is a quantity dependent on p (but independent of n) that you should state.
Along with the Central Limit Theorem, which is an asymptotic statement, the non-asymptotic statements (1.1) and (1.2) illustrate how the “slow” rate 1/^n permeates statistics. In machine learning, we will see that there are settings (essentially exploiting low noise or some convexity properties) where we can achieve the “fast”rate 1/n, or be somewhere in between: 1/n炷 with α e [1/2, 1].
It turns out that the inequality (1.2) holds not just for Gaussian random variables, but for a much bigger class of random variables called sub- Gaussian, which can be defined by the tail inequality P(lZl 2 u) < 2e2 . Here, σ 2 is called variance proxy. Sub-Gaussian random variables will be properly introduced later on in the course and will play a crucial role throughout.
Remark 1.1 (Monte Carlo) While the rate 1/^n is “slow”, it is the very reason why Monte Carlo ap- proximation methods are used to approximate integrals of the form← f (x)p(x)dx for a given high- dimensional function f : Rd → R and density p . In fact, note that by the same arguments above, if we assume to have access to i.i. d. samples X1 , . . . , Xn from the density p, then the Monte Carlo estimate f (Xi ) yields
╱E ┌ │ f (Xi ) - ( f (x)p(x)dx│p ┐\1/p < .
In this context, the Monte Carlo estimate is a good one as the rate 1/′n is independent of the dimension d . Note that deterministic “grid” methods to approximate the integral would give a dimension- dependent rate 1/n1/d, which is strictly worse than the Monte Carlo rate for d 2 3 and yields the so- called curse of dimensionality: to have precision ε one would need the number of samples n to be of the order (1/ε)d , which scales exponentially with d . Using Monte Carlo, one only needs (1/ε)2 samples . In fact, there is active research on developing “quasi-Monte Carlo” methods to get rates close to the “fast” rate 1/n .
1.2 Bayes Decision Rules (Question type: A)
Consider the setting of Section 1.2 in the Lectures Notes. Show that the following different choices of loss functions yield the corresponding Bayes decision rule.
1. In regression, the choice φ(yˆ, y) = (yˆ - y)2 leads to a** (x) = E[Y lX = x].
Hint: Prove that E φ(a** (X), Y) < E φ(a(X), Y) for any a e 8 by considering E[(a** (X) - a(X) + a(X) - Y)2].
2. In classification, the choice φ(yˆ, y) = 1-ˆ- leads to a** (x) = argmax-ˆey P(Y = yˆlX = x).
1.3 Statements in Probability (Question type: B)
Let P(X < ε1 ) 2 1 - δ 1 and P(Y < ε2 ) 2 1 - δ2 . Show that P(X + Y < ε1 + ε2 ) 2 1 - (δ1 + δ2 ).
1.4 Excess Risk, Prediction Error and Estimation Error with the Square Loss (Question type: B)
Consider a supervised learning setting where X = (X1 , . . . , Xd ) e Rd is a d-dimensional feature vector and Y e R is its response. For any non-random predictor a : Rd → R, consider the expected risk:
r(a) = E[(a(X) - Y)2].
1. Let a** e argmin r(a). Show that the excess risk r(a) - r(a** ) can be written as
r(a) - r(a** ) = E[(a(X) - a** (X))2] .
尸 J 尸 J
excess risk prediction error
2. For any (possibly random) predictor A : Rd → R, prove the following bias-variance decomposition:
E尸 r(A) - r(a**)J = E )╱ E[A(X)lX] - a** (X)、2] + E Var尸[A(X)lX]J .
尸 J
expected squared bias
Henceforth, consider the class of linear predictors given by A = {a : a(x) = (x, wa ) for some wa e Rd }, where (x, wa ) = xT wa denotes the inner product between x and wa . Assume Y = a** (X) + ξ, where a** (X) = (X, wa大大 ) and ξ is an independent random variable with mean 0.
3. Show that the irreducible risk is r(a女女 ) = Var(ξ).
4. Prove that for any a e A defined as a(x) = (x, wa ) the following equivalence holds:
r(a) - r(a女女 ) = (wa - wa大大 )T m(wa - wa大大 ),
where m is a matrix that you should specify.
5. What does the previous result imply when X is isotropic, i.e., E[XiXj ] = 1i=j ?
1.5 Examples of Rademacher Complexity (Question type: B)
Compute the Rademacher complexity Rad(T) of the following sets.
1. Let T = {t} for a given t e Rn .
2. Let T = {(1, 3), (-2, 3)} C R2 .
3. Let T be the subset of {-1, 0, 1}n containing all elements with k-sparse components, i.e., with exactly k components different than 0.
4. For c > 0, let T be the subset of [-c, c]n containing all elements with k-sparse components, i.e., with exactly k components different than 0.
1.6 Rademacher Complexity of a Finite Set (Question type: B)
Prove Massart’s lemma, Lemma 2.9 in the Lecture Notes.
Hint: Follow the proof strategy of Proposition 2.2 to prove Rad(T) < maxteT |t|2 ^2 log lTl/n and use the properties of the Rademacher complexity under scalar translations, Proposition 2.6.
1.7 Contraction Property of Rademacher Complexity (Question type: C)
Prove Talagrand’s lemma, Lemma 2.10 in the Lecture Notes.
Hint:
1. First, prove that for any set T C R2 and for any 1-Lipschitz function f : R → R we have
sup(t1 + f (t2 )) + sup(t1 - f (t2 )) < sup(t1 + t2 ) + sup(t1 - t2 ).
teT teT teT teT
2. Second, for any k e {1, . . . , n}, by conditioning on {Ω1 , . . . , Ωk21 , Ωk+1 , . . . , Ωn } prove that E t(s)e(u)T(p) Ωi fi (ti ) < E t(s)e(u)T(p) ╱ Ωi fi (ti ) + Ωk tk、.
2022-07-23