Review



What are MDPs?


What do MDPs consist of?


$S, A, T, R, \gamma$



What are POMDPs?


What do POMDPs consist of?


$S, A, T, R, \gamma, $$\ O, \Omega$



Review



Bellman Equation


$V_\pi(s) = \sum_{s'} T(s,a,s')\left[ R(s,a,s') + \gamma V_\pi(s') \right]$


POMDPs


Equivalent to MDPs over the set of belief states


Since $b(s)$ is continuous, $\infty$ belief states


Reinforcement Learning


Real life is much harder!


Assume we don't know $T(s,a,s')$


We also don't know $R(s,a,s')$


Another Game of Chance

You pick action A or B.


You get a reward.


Try to maximize your reward.

MDPs

  • Offline
  • Known environment dynamics
  • Maximize expected rewards


POMDPs

  • Offline
  • Incomplete knowledge of states
  • Estimate $b(s)$


RL

  • Online
  • Unknown environment dynamics
  • Estimate through interactions

RL Framework



RL Framework



Agent interacts with environment


Agent receives reward


Agent learns to maximize reward


RL Framework



Let's see a (complex) demo


(Local code run)

Reinforcement Learning



Assume we don't know $T(s,a,s')$


We also don't know $R(s,a,s')$


First Steps



Assume an underlying MDP


Let's revisit the game of chance


What does the MDP look like?

First Steps



What does the MDP look like?


You gain some information as you play!

Model Based Learning



Approximate the MDP


Solve for values assuming approximation is correct


Model Based Learning



Model-Based Monte Carlo



Estimate $T(s,a,s')$ and $R(s,a,s')$


How may we do that?

Average over data!


Compute $V_\pi(s)$ as in MDPs


Model-Based Monte Carlo


Estimate $T(s,a,s')$ and $R(s,a,s')$


Play to learn information

Model-Based Monte Carlo


Estimate $T(s,a,s')$ and $R(s,a,s')$


Play to learn information


Update Probabilities

Model-Based Monte Carlo


Play to learn information


Update Probabilities


Eventually, we hit the end state.

Model-Based Monte Carlo


Play to learn information


Update Probabilities


Eventually, we hit the end state.

Model-Based Monte Carlo


Consider the following sequence of plays


Model-Based Monte Carlo

Consider the following sequence of plays

\[\hat{T}(s,a,s') = \frac{\#(s,a,s')}{\#(s,a)}\]

\[\hat{T}(\textcolor{#22b496}{S}, \textcolor{#2c0328}{A}, \textcolor{#22b496}{S}) = \frac{107}{107} = 1\]

Model-Based Monte Carlo

Consider the following sequence of plays

\[\hat{T}(s,a,s') = \frac{\#(s,a,s')}{\#(s,a)}\]

\[\hat{T}(\textcolor{#22b496}{S}, \textcolor{#2c0328}{A}, \textcolor{#f79cfb}{E}) = \frac{0}{107} = 0\]

Model-Based Monte Carlo

Consider the following sequence of plays

\[\hat{T}(s,a,s') = \frac{\#(s,a,s')}{\#(s,a)}\]

\[\hat{T}(\textcolor{#22b496}{S}, \textcolor{#4faeea}{B}, \textcolor{#22b496}{S}) = \frac{11}{14} = 0.785\]

Model-Based Monte Carlo

Consider the following sequence of plays

\[\hat{T}(s,a,s') = \frac{\#(s,a,s')}{\#(s,a)}\]

\[\hat{T}(\textcolor{#22b496}{S}, \textcolor{#4faeea}{B}, \textcolor{#f79cfb}{E}) = \frac{3}{14} = 0.215\]

Model-Free Monte Carlo


Learn $Q_{opt}(s,a)$ directly


Running average over incoming data

Model-Free Monte Carlo


Estimate $Q_\pi(s,a)$


Utility, $u_t = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \dots$



$\hat{Q}_\pi(s,a) = \textrm{avg}(u_t)$
where
$S_{t-1} = s, A_t = a$

Model-Free Monte Carlo


Consider the following sequence of plays again



Model-Free Monte Carlo



Model-Free Monte Carlo



Model-Free Monte Carlo



Model-Free Monte Carlo


\[ \hat{Q}_{\pi}(\textcolor{#22b496}{S}, \textcolor{#2c0328}{A}) = \frac{94 + 93 + -100}{3} = 29\]

Model-Free Monte Carlo


Try this for the other dice game, with actions Stay or Quit

Model-Free Monte Carlo


Can be interpreted as an iterative convex combination (running average)...



The SARSA algorithm - for each observed (s,a,r,s',a'):\[ \hat{Q}^t_{\pi}(s, a) = (1-\eta)\hat{Q}^{t-1}_{\pi}(s, a) + \eta \left[ r + \gamma \hat{Q}_{\pi}(s', a') \right] \]where $\eta = \frac{1}{1 + \textrm{no. of updates to }\hat{Q}_\pi(s,a)}$

Model-Free Monte Carlo


For Optimal Policy, for each observed (s,a,r,s'):



The Q-Learning Algorithm\[ \hat{Q}^t_{opt}(s, a) = (1-\eta)\hat{Q}^{t-1}_{opt}(s, a) + \eta \left[ R + \gamma \hat{V}_{opt}(s') \right] \]where $\eta = \frac{1}{1 + \textrm{no. of updates to }\hat{Q}_{opt}(s,a)}$

Exploration v/s Exploitation


Sequences have to be random enough to cover search space


Optimal policy cannot be too random

Exploration v/s Exploitation


Strategies to narrow down favorite restaurants


Fastest way to commute to campus

Exploration v/s Exploitation


The $\epsilon$-greedy approach

Start by exploring at random.
Reduce randomness over time to prefer optimal policy.

Exploration v/s Exploitation


The $\epsilon$-greedy approach

Initialize $\epsilon$ close to 1.
At each time step: \[ \begin{cases} \textrm{with probability $\epsilon$, take a random action}\\ \textrm{with probability $1-\epsilon$, take the optimal action}\\ \end{cases} \] Perform Q-learning updates
Reduce $\epsilon$