ECE-GY 9123 / Spring 2021

Homework 5

• You are encouraged to discuss ideas with each other; but

• you must acknowledge your collaborator, and

• you must compose your own writeup and/or code independently.

• We require answers to theory questions to be written in LaTeX, and answers to coding questions in Python (Jupyter notebooks)

1. (3 points) Policy gradients . In class we derived a general form of policy gradients. Let us consider a special case here. Suppose the step size is η. We consider where past actions and states do not matter; different actions ai give rise to different rewards Ri.

a. Define the mapping π such that π(ai) = softmax(θi) for i = 1, . . . , k, where k is the total number of actions and θi is a scalar parameter encoding the value of each action. Show that if action ai is sampled, then the change in the parameters in REINFORCE is given by:

∆θi = ηRi(1 − π(ai)).

2. (3 points) Designing rewards in Q-learning . Suppose we are trying to solve a maze with a goal and a (stationary) monster in some location, and the goal is to reach the goal in the minimum number of moves. We are tasked with designing a suitable reward function for Q-learning. There are two options:

a. We declare a reward of +2 for reaching the goal, -1 for running into a monster, and 0 for every other move.

b. We declare a reward of +1.5 for reaching the goal, -1.5 for running into a monster, and -0.5 for every other move.

Which of these reward functions might lead to better policies?

(Hint: For a general case, how does the expected discounted return change if a constant offset is added to all rewards?)

3. (4 points) Open the (incomplete) Jupyter notebook provided as an attachment to this homework in Google Colab (or other environment of your choice) and complete the missing items. Save your finished notebook in PDF format and upload along with your answers to the above theory questions in a single PDF.