Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

CSE 250A. Assignment 1

Out: Tues Oct 3

Due: Thurs Oct 12 (by 11:59 PM, Pacific Time, via gradescope)

Reading: Russell & Norvig, Chapter 13 (optional)

Grace period: 48 hours

1.0 Basics

You should submit your homework assignments via gradescope:

https://www.gradescope.com/courses/641197

Late assignments will be accepted without penalty during the 48-hour grace period after the due date; how-ever, such assignments may not be graded in a timely fashion. No assignments will be accepted beyond the grace period. Here is a primer on submitting PDF homework via gradescope:

https://tinyurl.com/gradescope-guide

If you have not done this before, please allow some extra time to familiarize yourself with this process.

1.1 Conditioning on background evidence

It is often useful to consider the impact of specific events in the context of general background evidence, rather than in the absence of information.

(a) Denoting such evidence by E, prove the conditionalized version of the product rule:

P(X, Y |E) = P(X|Y, E)P(Y |E).

(b) Also, prove the conditionalized version of Bayes rule:

P(X|Y, E) = P(Y |X, E)P(X|E)/P(Y |E).

(c) Also, prove the conditionalized version of marginalization:

P(X|E) = Xy P(X, Y =y|E).

1.2 Conditional independence

Show that the following three statements about random variables X , Y, and E are equivalent:

(1)   P(X, Y |E)   =   P(X|E)P(Y |E)

(2)   P(X|Y, E)   =   P(X|E)

(3)   P(Y |X, E)   =   P(Y |E)

In other words, show that (1) implies (2) & (3), that (2) implies (1) & (3),and that (3) implies (1) & (2). You should become fluent with all these ways of expressing that X is conditionally independent of Y given E.


1.3 Creative writing

This problem does not involve any calculations: simply attach events to the binary random variables X , Y, and Z that are consistent with the following patterns of commonsense reasoning.  You may use different events for the different parts of the problem. Also, please be creative: do not use the same events (e.g., burglaries, earthquakes, alarms) that were considered in lecture.

(a) Cumulative evidence:

P(X =1) < P(X =1|Y =1) < P(X =1|Y =1, Z =1)

(b) Explaining away:

P(X =1|Y =1)   >   P(X =1),

P(X =1|Y =1, Z =1)   < P(X =1|Y =1)

(c) Conditional independence:

P(X =1, Y =1) P(X =1)P(Y =1)

P(X =1, Y =1|Z=1) = P(X =1|Z =1)P(Y =1|Z =1)

1.4 Bayes Rule

Suppose that 1% of competitive cyclists use performance-enhancing drugs and that a particular drug test has a 5% false positive rate and a 10% false negative rate.

(a) Let D ∈  {0, 1} indicate whether a cyclist is doping, and let T {0, 1} indicate the outcome of the drug test. Draw the belief network for these random variables, and use the information above to deduce the (conditional) probability tables for P(D) and P(T|D).

(b) Cyclist A tests negative for drug use. What is the probability that Cyclist A is not using drugs?

(c) Cyclist B tests positive for drug use. What is the probability that Cyclist B is using drugs?

1.5 Entropy

(a)  Let X be a discrete random variable with P(X = xi) = pi for i ∈ {1, 2,..., n}. The entropy H[X] of the random variable X is a measure of its uncertainty. It is defined as:

H[X]  = pi log pi.

Show that the entropy H[X] is maximized when pi = . You should do this by computing the gradient with respect to pi  and using Lagrange multipliers to enforce the constraint that Σipi  = 1. Later in the course, we will use similar calculations for learning probabilistic models.

(b)  The joint entropy of n discrete random variables (X1, X2 ,..., Xn) is defined as:

H(X1, X2 ,..., Xn)  = x1(Σ) x2(Σ) ... xn(Σ) P(x1 , x2 ,..., xn)log P(x1 , x2 ,..., xn),

where the sums range overall possible instantiations of X1, X2 ,..., Xn. Show that if the variables Xi are independent, then their joint entropy is the sum of their individual entropies: namely,

n                                                                                                  n

P(x1 , x2 ,..., xn)  = P(xi)   implies    H(X1, X2 ,..., Xn)  =Σ H(Xi).

i=1                                                                                              i=1

1.6 Kullback-Leibler distance

Often it is useful to measure the difference between two probability distributions over the same random variable. For example, as shorthand let

pi = P(X = i|E),        qi = P(X = i|E )

denote the conditional distributions over the random variable X for different pieces of evidence E E . Note that Σipi  = Σiqi  = 1. The Kullback-Leibler (KL) distance between these distributions (also known as the relative entropy) is defined as:

KL(p,q) =i(Σ)pi log(pi/qi).

(a)  Consider the natural logarithm (in base e).   By sketching graphs of log(x) and x - 1, verify the inequality:

log(x)  参 x - 1,

with equality if and only if x=1. Confirm this result by differentiation of log(x) - (x - 1).

(b)  Use the previous result to prove that KL(p,q)  ≥  0, with equality if and only if the two distributions pi and qi are equal.

(c)  Using the inequality in (a), as well as the simple equality log x = 2log ^x, derive the tighter lower

bound:

KL(p,q) i(Σ)(^pi - ^qi)2 .

(d)  Provide a counterexample to show that the KL distance is not a symmetric function of its arguments:

KL(p,q) KL(q, p).

Despite this asymmetry, it is still common to refer to KL(p,q) as a measure of distance. Many algo- rithms for machine learning are based on minimizing KL distances between probability distributions.

1.7 Mutual information

The mutual information I(X, Y) between two discrete random variables X and Y is defined as

I(X, Y)  =  x(Σ) y(Σ) P(x,y)log ,

where the sum is over all possible values of the random variables X andY. Note how the mutual information is related to the definitions in the previous two problems.

(a)  Show that the mutual information is nonnegative. (Hint: use the result of the previous problem.)

(b)  Show that the mutual information I(X, Y) vanishes if and only if X and Y are independent random

variables. (Thus, I(X, Y) provides one quantitative measure of dependence between X and Y.)

1.8 Compare and contrast

Consider the different belief networks (BNs) shown below for the discrete random variables X , Y, and Z.

belief network #1

X

Y Z

(a)  Does the first belief network imply a statement of marginal or conditional independence that is not implied by the second? If yes, provide an example.

(b)  Does the second belief network imply a statement of marginal or conditional independence that is not implied by the third? If yes, provide an example.

(c)  Does the third belief network imply a statement of marginal or conditional independence that is not implied by the first? If yes, provide an example.

1.9 Hangman

Consider the belief network shown below, where the random variable W stores a five-letter word and the random variable Li  ∈ {A, B, . . . , Z} reveals only the word’s ith letter.  Also, suppose that these five-letter words are chosen at random from a large corpus of text according to their frequency:

P(W = w)  = COUNT (w)

w′  COUNT (w ) ,

where COUNT (w) denotes the number of times that w appears in the corpus and where the denominator is a sum over all five-letter words.  Note that in this model the conditional probability tables for the random variables Li are particularly simple:

P(Li = |W = w)  =  { 0(1) h(ℓ)rw(s t)ise(he)ith letter of w,

Now imagine a game in which you are asked to guess the word w one letter at a time. The rules of this game are as follows:  after each letter (A through Z) that you guess, you’ll be told whether the letter appears in the word and also where it appears. Given the evidence that you have at any stage in this game, the critical question is what letter to guess next.

Let’s work an example. Suppose that after three guesses—the letters D, I, M—you’ve learned that the let- ter I does not appear, and that the letters D and M appear as follows:

M               D                  M

Now consider your next guess: call it . In this game the best guess is the letter that maximizes

P (L2 = or L4 = ' L1 = M, L3 = D, L5 = M, L2 ̸∈{D, I, M}, L4 ̸∈{D, I, M}).

In other works, pick the letter ℓ that is most likely to appear in the blank (unguessed) spaces of the word. For any letter ℓ we can compute this probability as follows:

P (L2 = or L4 = ' L1 = M, L3 = D, L5 = M, L2 ̸∈{D, I, M}, L4 ̸∈{D, I, M})

=   w(Σ) P (W = w, L2 = or L4 = ' L1 = M, L3 = D, L5 = M, L2 ̸∈{D, I, M}, L4 ̸∈{D, I, M}),

=   w(Σ) P(W = w|L1 = M, L3 = D, L5 = M, L2 ̸∈{D, I, M}, L4 ̸∈{D, I, M}) P(L2 = or L4 = |W = w)

where in the third line we have exploited the conditional independence (CI) of the letters Li  given the word W. Inside this sum there are two terms, and they are both easy to compute. In particular, the second term is more or less trivial:

P(L2 = or L4 = |W = w)  =  {  0(1)

And the first term we obtain from Bayes rule:

if is the second or fourth letter of w

otherwise.


P(W = w|L1 = M, L3 = D, L5 = M, L2 ̸∈{D, I, M}, L4 ̸∈{D, I, M})

P(L1 = M, L3 = D, L5 = M, L2 ̸∈{D, I, M}, L4 ̸∈{D, I, M}|W = w)P(W = w)

=

P(L1 = M, L3 = D, L5 = M, L2 ̸∈{D, I, M}, L4 ̸∈{D, I, M})

In the numerator of Bayes rule are two terms; the left term is equal to zero or one (depending on whether the evidence is compatible with the word w), and the right term is the prior probability P(W = w), as determined by the empirical word frequencies. The denominator of Bayes rule is given by:

P(L1 = M, L3 = D, L5 = M, L2 ̸∈{D, I, M}, L4 ̸∈{D, I, M})

= Σ P(W = w, L1 = M, L3 = D, L5 = M, L2 ̸∈{D, I, M}, L4 ̸∈{D, I, M}),

= P(W = w)P(L1 = M, L3 = D, L5 = M, L2 ̸∈{D, I, M}, L4 ̸∈{D, I, M}|W = w),

where again all the right terms inside the sum are equal to zero or one.  Note that the denominator merely sums the empirical frequencies of words that are compatible with the observed evidence.

Now let’s consider the general problem.  Let E denote the evidence at some intermediate round of the game: in general, some letters will have been guessed correctly and their places revealed in the word, while other letters will have been guessed incorrectly and thus revealed to be absent.  There are two essential computations. The first is the posterior probability, obtained from Bayes rule:

P(W = w|E)  = .

The second key computation is the predictive probability, based on the evidence, that the letter ℓ appears somewhere in the word:

P (Li = for some i∈{1, 2, 3, 4, 5}'E)  = w(Σ) P (Li = for some i{1, 2, 3, 4, 5}'W = w)P (W = w'E).