\documentclass[fleqn]{article}

\usepackage{haldefs}
\usepackage{notes}
\usepackage{url}
\usepackage{graphicx}

\begin{document}
\lecture{Artificial Intelligence}{HW5: Reinforcement Learning}{CS5300, Spring 2009}

% IF YOU'RE USING THIS .TEX FILE AS A TEMPLATE, PLEASE REPLACE
% "CS5300, Spring 2009" WITH YOUR NAME AND UID.

% Hand in at: http://www.cs.utah.edu/~hal/handin.pl?course=cs5300

\section{TD and Q in Blockworld}

Consider the gridworld figure shown on Slide 5 from the Q-learning
lecture (day10).  Suppose that we run two episodes that yield the
following sequences of (state, action, reward) tuples:

\begin{tabular}{ccc|ccc}
{\bf S} & {\bf A} & {\bf R} & {\bf S} & {\bf A} & {\bf R} \\
\hline
(1,1) & up    & -1     & (1,1) & up    & -1 \\
(2,1) & left  & -1     & (1,2) & up    & -1 \\
(1,1) & up    & -1     & (1,3) & right & -1 \\
(1,2) & up    & -1     & (2,3) & right & -1 \\
(1,3) & up    & -1     & (2,3) & right & -1 \\
(2,3) & right & -1     & (3,3) & right & -1 \\
(3,3) & right & -1     & (4,3) & exit  & +100 \\
(4,3) & exit  & +100   & (done)&       & \\
(done)&       &        &       &       & \\
\end{tabular}

\bee
\i According to direct estimation, what are the values for every state
in the grid?

\i According to model-based learning, what are the transition
probabilities for every (state, action, state) triple.  Don't bother
listing all the ones that we have no information about.

\i Suppose that we run Q-learning.  However, instead of initializing all
our Q values to zero, we initialize them to some large positive number
(``large'' with respect to the maximum reward possible in the world:
say, 10 times the max reward).  I claim that this will cause a
Q-learning agent to initially explore a lot and then eventually start
exploiting.  Why should this be true?  Justify your answer in a short
paragraph.
\ene

\section{Policy Gradient}

{\bf (6300 only)} In order to do policy gradient, we need to be able
to compute the gradient of the evaluation function $\rho$ with respect
to a parameter vector $\vec \th$: $\grad_{\vec \th} \rho(\th)$.  By
our algebraic magic, we expressed this as:

\begin{equation} \label{eq:grad}
\grad_{\vec \th} \rho(\vec \th) 
  = \sum_a \pi_{\vec \th}(s_0,a) R(a) 
             \underbrace{\grad_{\vec \th} \log \left( \pi_{\vec \th}(s_0,a) \right)}_{g(s_0,a)}
\end{equation}

If we us a linear function thrown through a soft-max as our stochastic
policty, we have:

\begin{equation}
\pi_{\vec \th}(s,a) 
  = \frac {\exp \left( \sum_{i=1}^n \th_i f_i(s,a) \right)}
          {\sum_{a'} \exp \left( \sum_{i=1}^n \th_i f_i(s,a') \right)}
\end{equation}

Compute a closed form solution for $g(s_0,a)$.  Explain in a few
sentences \emph{why} this leads to a sensible update for gradience
ascent (i.e., if we plug this in to Eq~\eqref{eq:grad} and do gradient
ascent, why is the derived form reasonable)?

\end{document}
