Consider a q-learning agent operating on the gridworld environment from the lectures. The current q-values are shown in Fig (a) below. Consider one episode where the q-learning agent executes the actions shown in Figure (b). The start state is specified by the tail of the arrow in Figure (b) (bottom row, column 3). The intermediate states are the middle and top rows of column 3. The final state (shown as the head of the arrow) is the goal state with value +1.
Assume that the immediate reward for taking any action from any state , R(s,a,s') = 0. Assume a discount factor gamma = 0.9. Assume a learning rate of .5 (alpha = .5).
Figure (a). Q-values after 1000 Episodes on the Gridworld (from lecture slides)
Figure (b): A new episode starting from the previous gridworld q-values (Figure (a)).
Questions:
Work out the update for the Q-values for the episode depicted in the Figure (b) using the initial q-values shown in Figure (a). (8 points).