\documentclass{article}
\usepackage{cs294-40}
\include{cs294-40-macros}
\begin{lecture}{15}{Policy Gradient Methods}{Fernando Garcia Bermudez}{10/16/2008}
\section{Lecture outline}
\begin{itemize}
\item Introduction
\item Finite difference methods
\item Likelihood ratio methods
\end{itemize}
\section{Introduction}
Let's choose a parametrized policy class ${\policy_\theta : \theta \in
\mathbb{R}^d}$. As an example, we will consider helicopter flight. We
will consider a policy parameterization as the one used by Ng
et al. (2004).\\
$x,y,z$: $x$ points forward along the helicopter, $y$ sideways to the right, $z$ downward.
$n_x, n_y, n_z$: rotation vector that brings helicopter back to level position (expressed in the helicopter frame).
\begin{equation*}
u_{collective} = \theta_1 \cdot f_1(z^*-z) + \theta_2 \cdot \dot{z}
\end{equation*}
\begin{equation*}
u_{elevator} = \theta_3 \cdot f_2(x^*-x) + \theta_4 f_4(\dot{x}) + \theta_5 \cdot q + \theta_6 \cdot n_y
\end{equation*}
\begin{equation*}
u_{aileron} = \theta_7 \cdot f_3(y^*-y) + \theta_8 f_5(\dot{y}) + \theta_9 \cdot p + \theta_{10} \cdot n_x
\end{equation*}
\begin{equation*}
u_{rudder} = \theta_{11} \cdot r + \theta_{12} \cdot n_z
\end{equation*}
Note that policy search depends on the number of parameters versus on the linearity of them. One can solve linearly per axis because integration of them over time yields the nonlinearity.
\section{Finite difference methods}
We estimate $U(\theta)$ from sample means (in a simulated environment or in the real world).
We can compute the gradient $g$ using standard finite difference methods, as follows:
\begin{equation*}
g_j = \frac{U(\theta^{(i)} + \epsilon e_j) - U(\theta^{(i)} - \epsilon e_j)}{2\epsilon}
\qquad e_j = \left(\begin{array}{cc}
0 & \\
0 & \\
\vdots & \\
0 & \\
1 & \leftarrow \text{j$^\text{th}$ entry} \\
0 & \\
\vdots & \\
0 &
\end{array} \right)
\end{equation*}
Then we can make a (small) step in the direction of the gradient to try to improve the utility.
However, for some robotics settings, we might not be willing to
evaluate the performance of the policy according to these finite
differences, especially so when testing on real systems (rather than
simulation). We can, alternatively, find the gradient through the
following computation:
Recall that, locally around our current estimate $\theta$, we can
approximate the utility $U(\theta^{(i)})$ at an alternative point
$\theta^{(i)}$ with a linear approximation:
\begin{equation*}
U(\theta) \approx U(\theta) + g^T (\theta^{(i)} - \theta)
\end{equation*}
Let's say we have a set of small perturbations
$\theta^{(0)},\theta^{(1)},\ldots,\theta^{(m)}$ for which we are
willing to evaluate the utility $U(\theta^{(i)})$. We can get an
estimate of the gradient through solving the following set of
equations for the gradient $g$ through least squares:
\begin{eqnarray*}
U(\theta^{(0)}) & = & U(\theta) + (\theta^{(0)}-\theta)^{\top} g\\
U(\theta^{(1)}) & = & U(\theta) + (\theta^{(1)}-\theta)^{\top} g\\
\ldots \\
U(\theta^{(m)}) & = & U(\theta) + (\theta^{(m)}-\theta)^{\top} g
\end{eqnarray*}
Note that estimating the derivative of a stochastic function may be
problematic as Figure \ref{fig:derivative} illustrates. To get a
better, but still noisy measurement one could get a lot of samples and
average.
\begin{figure}[htbp]
\centering
\includegraphics[scale=0.5]{figures/derivative.eps}
\caption{Problem estimating the derivative of a stochastic
function. In the figure one can see a stochastic function for which we
want to estimate the slote between the two vertical lines, and because
we need to sample a point from a random distribution for the two
specific values of $\theta$, we end up getting two points that
estimate the wrong slope for the function between those points.}
\label{fig:derivative}
\end{figure}
For the simulated setting, rather than requiring a large number of
samples, the PEGASUS algorithm (Ng and Jordan, 2000) turns the
optimization into an optimization over a deterministic function. In
particular, they observe that when you fix the random seed of the
simulation, the simulation becomes deterministic, and computing
gradients will be less susceptible to noise introduced by
stochasticity.
Essentially, every random seed corresponds to a function, and one
averages them to get the (deterministic) function one is optimizing
(see Figure \ref{fig:fixedseed}). This gives less noisy gradient
estimates.
\begin{figure}[htbp]
\centering
\includegraphics[scale=0.5]{figures/fixedseed.eps}
\caption{Target and the estimated functions using a fixed random seed.}
\label{fig:fixedseed}
\end{figure}
Using the PEGASUS algorithm, it took about 2 hours to search the
parameters of a new helicopter.
%There is another method, called reward
%shaping (Ng et al, 1999), that could also be used, but this method not
%only depends on the reward but also requires more computational power
%and can simulate less time ahead than PEGASUS.
\section{Likelihood ratio methods}
%These methods were pioneered in operations research where they were
%called \emph{reinforce} (Williams, 1992) and were later reinvented and
%improved by the reinforcement learning community.
Let $\tau$ denote a trajectory and time H the horizon:
\begin{equation*}
\tau = (s_0, a_0, s_1, a_1, \hdots, s_H, a_H)
\end{equation*}
Let,
\begin{equation*}
R(\tau) = \sum_{t=0}^{H}{R(s_t,a_t)}
\end{equation*}
\begin{equation*}
U(\theta) = \sum_\tau{p_\theta(\tau) R(\tau)}
\end{equation*}
The last equation represents the utility and is smooth for a discrete system, which allows the following:
\begin{align*}
\nabla_\theta U(\theta) & = \sum_\tau \nabla_\theta p_\theta(\tau) R(\tau) \\
{}& = \sum_\tau \frac{p_\theta(\tau)}{p_\theta(\tau)} \nabla_\theta p_\theta(\tau) R(\tau) \\
{}& = \sum_\tau p_\theta(\tau) \frac{ \nabla_\theta {p_\theta(\tau)}}{p_\theta(\tau)} R(\tau) \\
{}& \approx \frac{1}{m} \sum_{i=1}^m \frac{\nabla_\theta p_\theta(\tau^{(i)})}{p_\theta(\tau^{(i)})} R(\tau^{(i)}) \\
\end{align*}
Here $i$ indexes over the $m$ sample trajectories $\tau^{(i)}$ which are obtained while
executing the current policy $\pi_{\theta}$ and which are used to
obtain a finite sample estimate of the above expectation.
The above derivation allows the empirical average to be used to
estimate the gradient. Note in the last step that the ratio is often
easier to evaluate than just the gradient.
\begin{equation*}
p_\theta(\tau) = \prod_{t=0}^H p(s_{t+1} \vert s_t, s_t) \pi_\theta(a_t \vert s_t)
\end{equation*}
\begin{align*}
\nabla_\theta p_\theta(\tau) & = \prod_{t=0}^H p(s_{t+1} \vert s_t, a_t) \nabla_\theta \left( \prod_{t=0}^H \pi_\theta(a_t \vert s_t) \right) \\
{} & = \prod_{t=0}^H p(s_{t+1} \vert s_t, a_t) \prod_{t=0}^H \pi_\theta(a_t \vert s_t) \sum_{t'=0}^H \frac{\nabla_\theta \pi_\theta(a_{t'} \vert s_{t'})}{\pi_\theta(a_{t'} \vert s_{t'})} \\
\end{align*}
\begin{align*}
\frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)} &=
\frac{\prod_{t=0}^H p(s_{t+1} \vert s_t, s_t)}{\prod_{t=0}^H p(s_{t+1} \vert s_t, s_t)}
\frac{\prod_{t=0}^H \pi_\theta(a_t \vert s_t)}{\prod_{t=0}^H \pi_\theta(a_t \vert s_t)}
\sum_{t'=1}^H \frac{\nabla_\theta \pi_\theta(a_{t'} \vert s_{t'})}{\pi_\theta(a_{t'} \vert s_{t'})} \\
{} & = \sum_{t'=0}^H \frac{\nabla_\theta \pi_\theta(a_{t'} \vert s_{t'})}{\pi_\theta(a_{t'} \vert s_{t'})} \\
\end{align*}
Pay close attention to the last equality in which by getting rid of
the first term due to computing the ratio, the dynamics now
dissapeared and we don't need to know them to perform this
calculation. Plugging in this last result in the gradient of the
utility function one gets:
\begin{align*}
\nabla_\theta U(\theta) & = \sum_\tau p_\theta(\tau) \left( \sum_{t'=0}^H \frac{\nabla_\theta \pi_\theta(a_{t'} \vert s_{t'})}{\pi_\theta(a_{t'} \vert s_{t'})} \right) R(\tau) \\
{} & \approx \frac{1}{m} \sum_{i=1}^m \sum_{t'=0}^H \frac{\nabla_\theta \pi_\theta(a_{t'}^{(i)} \vert s_{t'}^{(i)})}{\pi_\theta(a_{t'}^{(i)} \vert s_{t'}^{(i)})} R(\tau^{(i)}) \\
\end{align*}
The last equality shows that the gradient of reinforce, $g_{RF}$, can
be calculated as an empirical estimate without having the dynamic
model. Note that this gradient doesn't change when offsetting the
reward by a constant, so we could add this to the equation in order to
reduce the variance of $g_{RF}$. Note also that this is an unbiased
estimator since $\expect g_{RF} = \nabla_\theta U(\theta)$.
%% Following we show the gradient of reinforce with the offset added:
%% \begin{equation*}
%% g_{RF} = \frac{1}{nH} \sum_{i=1}^n \sum_{t'=0}^H \frac{\nabla_\theta
%% \pi_\theta(a_{t'}^{(i)} \vert s_{t'}^{(i)})}{\pi_\theta(a_{t'}^{(i)}
%% \vert s_{t'}^{(i)})} \sum_{t=0}^H (R(s_t^{(i)}, a_t^{(i)}) -
%% b_t^{(i)}) = g_{PGF} = g_{G(PO)MDP}
%% \end{equation*}
%% We proceed to assume that for $t'