### ECE-GY 9123 / Spring 2021 Homework 5

*ECE-G**Y 9123 / Spring 2021*

**Homework 5**

**Please upload your assignments on or before April 26, 2021.**

• You are encouraged to discuss ideas with each other; but

• you

must acknowledgeyour collaborator, and• you

must compose your ownwriteup and/or code independently.• We

requireanswers to theory questions to be written in LaTeX, and answers to coding questions in Python (Jupyter notebooks)• Upload your answers in the form of a single PDF on Gradescope.

1. (**3 points**) *Policy gradients *. In class we derived a general form of policy gradients. Let us consider a special case here. Suppose the step size is η. We consider where past actions and states do not matter; different actions ai give rise to different rewards Ri.

a. Define the mapping π such that π(ai) = softmax(θi) for i = 1, . . . , k, where k is the total number of actions and θi is a scalar parameter encoding the value of each action. Show that if action ai is sampled, then the change in the parameters in REINFORCE is given by:

∆θi = ηRi(1 − π(ai)).

b. Intuitively explain the dynamics of the above gradient updates.

2. (**3 points**) *Designing rewards in Q-learning *. Suppose we are trying to solve a maze with a goal and a (stationary) monster in some location, and the goal is to reach the goal in the minimum number of moves. We are tasked with designing a suitable reward function for Q-learning. There are two options:

a. We declare a reward of +2 for reaching the goal, -1 for running into a monster, and 0 for every other move.

b. We declare a reward of +1.5 for reaching the goal, -1.5 for running into a monster, and -0.5 for every other move.

Which of these reward functions might lead to better policies?

(*Hint: For a general case, how does the expected discounted return change if a constant offset **is added to all rewards?*)

3. (**4 points**) Open the (incomplete) Jupyter notebook provided as an attachment to this homework in Google Colab (or other environment of your choice) and complete the missing items. Save your finished notebook in PDF format and upload along with your answers to the above theory questions in a single PDF.

2021-04-14