Assignment 7: Reinforcement Learning (Non-Computational)
Assigned Tuesday, April 8th
Due Thursday, April 17th, by 11:59pm - submit electronically.


Questions

(To be answered based on the lectures and readings--you should not have to search the internet for the answers.)
  1. Why do we think that exponential discounting not the way we humans discount future rewards? Describe, as part of this, how procrastination indicates that exponential discounting is incorrect.
  2. In what ways do we animals trade off exploring versus exploitation of our environment?
  3. How is prediction error encoded in the brain?

Q-learning

Consider a q-learning agent operating on the gridworld environment from the lectures. The current q-values are shown in Fig (a) below. Consider one episode where the q-learning agent executes the actions shown in Figure (b). The start state is specified by the tail of the arrow in Figure (b) (bottom row, column 3). The intermediate states are the middle and top rows of column 3. The final state (shown as the head of the arrow) is the goal state with value +1.

Assume that the immediate reward for taking any action from any state , R(s,a,s') = 0. Assume a discount factor gamma = 0.9. Assume a learning rate of .5 (alpha = .5).

Q-values in Gridworld
Figure (a). Q-values after 1000 Episodes on the Gridworld (from lecture slides)

episode in Gridworld
Figure (b): A new episode starting from the previous gridworld q-values (Figure (a)).

Questions:

  1. Why did we not specify the model (the transition function) for the problem? (2 points)
  2. Work out the update for the Q-values for the episode depicted in the Figure (b) using the initial q-values shown in Figure (a). (8 points).