MDP solvers
and reinforcement learning.
Make a robot crawl.
In this checkpoint, you will experiment with both value iteration for known MDPs and Q-learning for reinforcement learning. You will test your systems on a simple Gridworld domain, but also apply them to the task of teaching a simple simulated robot to crawl.
The code for this checkpoint contains the following files, available as a zip archive.
| Checkpoint files: | |
agent.py |
The file in which you will write your agents. |
mdp.py |
Abstract class for general MDPs. |
environment.py |
Abstract class for general reinforcement learning environments (compare to mdp.py). |
gridworld.py |
The Gridworld code and test harness. |
graphicsGridworldDisplay.py |
Plug-in for the Gridworld graphical display. You can ignore this file entirely. |
graphicsUtils.py |
GUI code. You can ignore this file entirely. |
textGridworldDisplay.py |
Plug-in for the Gridworld text interface. You can ignore this file entirely. |
crawler.py |
The crawler code and test harness. You will run this but not edit it, and so can ignore the contents. |
graphicsCrawlerDisplay.py |
GUI for the crawler robot. You can ignore this file entirely. |
What to submit: You will fill in portions of (only) agent.py
during the assignment. You should submit this file along with a
text or pdf file containing answers to the discussion questions. If
you change other files, submit those as well, but you should not need to.
Evaluation: Your code will be autograded for technical correctness. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder. Your answers to discussion questions will also be graded.
Academic Dishonesty: We will be checking your code against other submissions in the class for logical redundancy. If you copy someone else's code and submit it with minor changes, we will know. These cheat detectors are quite hard to fool, so please don't try. We trust you all to submit your own work only; please don't let us down. Instead, contact the course staff if you are having trouble.
To get started, run the Gridworld harness in interactive mode:
python gridworld.py -m
You will see the two-exit Gridworld from class. Your agent's position
is given by the blue dot, and you can move with the arrow keys. Notice
that the agent's value estimates are shown, and are all zero. Manual
control may be a little frustrating if the noise level is not turned down (-n), since
you will sometimes move in an unexpected direction. Such is the life of a
Gridworld agent! You can control many aspects of the simulation. A
full list is available by running:
python gridworld.py -h
You check out the other grids, change the noise or discount, change the
number of episodes to run and so on. If you drop the manual flag (-m) you
will get the RandomAgent by default. Try:
python gridworld.py -g MazeGrid
You should see the random agent bounce around the grid until it happens upon the exit. Not the finest hour for an AI agent; we will build better ones soon.
Next, either use the text interface (-t) or look at the console output that
accompanies the graphical output. Do a manual run through any grid you
like, and notice that unless you specify quiet (-q) output, you will be told
about each transition the agent experiences. As in previous code,
coordinates are in (row, col) format and any arrays are indexed by [row][col],
with 'north' being the direction of decreasing row, etc. By default, most
transitions will receive a reward of zero, though you can change this with the
living reward option (-r). Note particularly, that the MDP is such that
you first must enter a pre-terminal state and then take the special 'exit' action before the
episode actually ends (in the true terminal state (-1, -1)). Your total
return may have been less than you
expected, due to the discount rate (-d).
You should definitely look at agent.py, mdp.py, and environment.py closely,
and investigate parts of gridworld.py as needed. The support code should
be ignored.
Hint: The util.Counter class in util.py will make your life much
easier in this assignment. It acts like a dictionary, but has a getCount()
method which returns zero for items not in the Counter (rather than raising an
exception like a dictionary).
Question 1 (10 points) Write a value
iteration agent in ValueIterationAgent, which has been partial specified for you
in agent.py. You can select this agent with '-a value'. Your value
iteration agent is an offline planner, not a reinforcement agent, and so the
relevant training option is the number of iterations of value iteration it
should run (-i). It should take an MDP on construction and run value
iteration for the specified iterations on that MDP before the constructor
returns. Recall that value iteration computes estimates of the optimal
values. From these value estimates, you should synthesize responses to
getPolicy(state) and getQValue(state, action). (If your implementation is
such that the policy or q-values are precomputed, you may simply return
them.) You may assume that 100 iterations is enough for
convergence in the questions below. (Note: to actually run your agent with the learned policy,
use the -k option; press enter to cycle through viewing the learned values,
q-values and policy execution.)
Written short questions (two sentence answers, max!):
MazeGrid becomes non-zero? Why?BridgeGrid with the default discount of 0.9
and the default noise of 0.2. Which one of these two parameters must
we change before the agent dares to cross the bridge, and to what value?DiscountGrid, give parameter values which produce the following
optimal policy types or state that they are impossible:
MazeGrid, with the default parameters, compare the value estimate
of the start state after 100 iterations to the empirical returns from
running 10 episodes (-k 10 -q). How close is the value estimate to the
empirical average discounted rewards? How about if we run 10000 episodes (-k 10000 -q)? Does the value estimate
predict the long-term average discounted rewards correctly?Note that your value iteration agent does not actually learn from experience. Rather, it ponders its knowledge of the MDP to arrive at an optimal policy before ever interacting with a real environment. This distinction may be subtle in a simulated environment like a Gridword, but it's very important in the real world, where the real MDP is not available.
Question 2 (9 points) You will now write a
Q-learning agent, which does very little on construction, but then learns by
trial and error interactions with the environment through its update(state,
action, nextState, reward) method. A stub of a Q-learner is specified in
QLearningAgent in agent.py, and you can select it with
the option '-a q'. You should first
run your Q-learner through several episodes under manual control (e.g. '-k 5
-m'), for example on the MazeGrid. Watch how the agent learns about the
state it was just in and 'leaves learning in its wake.' Your agent should
be an epsilon-greedy learner, meaning it chooses random actions epsilon of the
time, and follows its current best q-values otherwise.
Written short questions:
-n 0.0) for
100 episodes. How do the learned q-values compare to those of the
value iteration agent? Why will your agent usually not learn the
optimal policy?Question 3 (1 point) Try the following command:
python crawler.py
This will invoke the crawling robot from class using your Q-learner. Play around with the various learning parameters to see how they affect the agent's policies and actions. Your formal requirement for this question is simply to ensure that your Q-learner works in this non-episodic environment.