MDP solvers
and reinforcement learning.
Make a robot crawl.
In this project, you will experiment with both value iteration for known MDPs and Q-learning for reinforcement learning. You will test your systems on a simple Gridworld domain, but also apply them to the task of teaching a simple simulated robot to crawl.
The code for this project contains the following files:
Checkpoint files: | |
agent.py |
The file in which you will write your agents. |
mdp.py |
Abstract class for general MDPs. |
environment.py |
Abstract class for general reinforcement learning environments (compare to mdp.py). |
gridworld.py |
The Gridworld code and test harness. |
graphicsGridworldDisplay.py |
Plug-in for the Gridworld graphical display. You can ignore this file entirely. |
graphicsUtils.py |
GUI code. You can ignore this file entirely. |
textGridworldDisplay.py |
Plug-in for the Gridworld text interface. You can ignore this file entirely. |
crawler.py |
The crawler code and test harness. You will run this but not edit it, and so can ignore the contents. |
graphicsCrawlerDisplay.py |
GUI for the crawler robot. You can ignore this file entirely. |
util.py |
Utilities. The Counter may be useful, as usual. It now provides a new method, argMaxFair(), which may come in handy for your Q-learner. |
What to submit: You will fill in
portions of (only) agent.py
during the assignment. You should submit this file along with a
text or pdf file containing answers to the discussion questions.
If
you change other files, submit those as well, but you should not need
to.
Evaluation: Your code will be autograded for technical correctness. Please do not change the names of any provided functions or classes within the code, or you will wreak havoc on the autograder. Your answers to discussion questions will also be graded.
Academic Dishonesty: We trust you all to submit your own work only; please don't let us down. Instead, contact the course staff if you are having trouble.
To get started, run the Gridworld harness in interactive mode:
python gridworld.py
-m
You will see the two-exit Gridworld from class. Your
agent's position
is given by the blue dot, and you can move with the arrow
keys. Notice
that the agent's value estimates are shown, and are all zero.
Manual
control may be a little frustrating if the noise level is not turned
down (-n
), since
you will sometimes move in an unexpected direction. Such is
the life of a
Gridworld agent! You can control many aspects of the
simulation. A
full list is available by running:
python gridworld.py
-h
You check out the other grids, change the noise or discount,
change the
number of episodes to run and so on. If you drop the manual
flag (-m
) you
will get the RandomAgent
by
default. Try:
python gridworld.py
-g MazeGrid
You should see the random agent bounce around the grid until it happens upon the exit. Not the finest hour for an AI agent; we will build better ones soon.
Next, either use the text interface (-t
)
or look at the console output that
accompanies the graphical output. Do a manual run through any
grid you
like, and notice that unless you specify quiet (-q
)
output, you will be told
about each transition the agent experiences. As in previous
code,
coordinates are in (row, col)
format and any
arrays are indexed by [row][col]
,
with 'north'
being the direction of
decreasing row, etc. By default, most
transitions will receive a reward of zero, though you can change this
with the
living reward option (-r
). Note
particularly, that the MDP is such that
you first must enter a pre-terminal state and then take the special
'exit' action before the
episode actually ends (in the true terminal state (-1, -1)
).
Your total
return may have been less than you
expected, due to the discount rate (-d
).
You should definitely look at agent.py
,
mdp.py
, and environment.py
closely,
and investigate parts of gridworld.py
as
needed. The support code should
be ignored.
Hint: The util.Counter
class in util.py
will make your life much
easier in this assignment. It acts like a dictionary, but has
a getCount()
method which returns zero for items not in the Counter (rather than
raising an
exception like a dictionary).
Question 1 (10 points) Write
a value
iteration agent in ValueIterationAgent
, which
has been partial specified for you
in agent.py. You can select this agent with '-a
value'
. Your value
iteration agent is an offline planner, not a reinforcement agent, and
so the
relevant training option is the number of iterations of value iteration
it
should run (-i
). It should take an
MDP on construction and run value
iteration for the specified number of iterations on that MDP before the
constructor
returns. Recall that value iteration computes estimates of
the optimal
values. From these value estimates, you should synthesize
responses to
getPolicy(state)
and getQValue(state,
action)
. (If your implementation is
such that the policy or q-values are precomputed, you may simply return
them.) You may assume that 100 iterations is enough for
convergence in the questions below. (Note: to actually run your agent
with the learned policy,
use the -k
option; press enter to cycle
through viewing the learned values, q-values and policy execution.)
Written short questions (two sentence answers, max!):
MazeGrid
becomes
non-zero? Why?BridgeGrid
with the default discount of 0.9 and the default noise of
0.2. Which one of these two parameters must we change before
the agent dares to cross the bridge, and to what value?DiscountGrid
, give
parameter values which produce the following optimal policy types or
state that they are impossible:
MazeGrid
, with the default
parameters, compare the value estimate of the start state after 100
iterations to the empirical returns from running 10 episodes (-k
10 -q
). How close is the value estimate to the
empirical average discounted rewards? How about if we run
10000 episodes (-k 10000 -q
)? Does
the value estimate predict the long-term average discounted rewards
correctly?Note that your value iteration agent does not actually learn from experience. Rather, it ponders its knowledge of the MDP to arrive at an optimal policy before ever interacting with a real environment. This distinction may be subtle in a simulated environment like a Gridword, but it's very important in the real world, where the real MDP is not available.
Question 2 (8 points)
You will now write a
Q-learning agent, which does very little on construction, but then
learns by
trial and error interactions with the environment through its
update(state,
action, nextState, reward) method. A stub of a Q-learner is
specified in
QLearningAgent
in agent.py
,
and you can select it with
the option '-a q'
. You should first
run your Q-learner through several episodes under manual control (e.g. '-k
5
-m'
), for example on the MazeGrid. Watch how the
agent learns about the
state it was just in and 'leaves learning in its wake.' Your
agent should
be an epsilon-greedy learner, meaning it chooses random actions epsilon
of the
time, and follows its current best q-values otherwise.
Written short questions:
-n
0.0
) for 100 episodes. How do the learned
q-values compare to those of the value iteration agent? Why
will your agent usually not learn the optimal policy?Question 3 (2 points) Try the following command:
python crawler.py
This will invoke the crawling robot from class using your
Q-learner.
Play around with the various learning parameters to see how they affect
the
agent's policies and actions. Note that the step delay is a
parameter of the simulation, whereas the learning rate and epsilon are
parameters of your learning algorithm, and the discount factor is a
property of the environment.
For a fixed discount factor, how can you set (or vary) the other
parameters so that your agent quickly learns an optimal policy?
How does the optimal policy (i.e. gait) learned depend on the
discount factor, both qualitatively and quantitatively (in terms of
average velocity)? Hint: you probably want to turn epsilon down
to near zero before assessing what your agent has learned.