
What are MDPs?

What do MDPs consist of?

$S, A, T, R, \gamma$

What are POMDPs?

What do POMDPs consist of?

$S, A, T, R, \gamma, $$\ O, \Omega$


Bellman Equation

$V_\pi(s) = \sum_{s'} T(s,a,s')\left[ R(s,a,s') + \gamma V_\pi(s') \right]$


Equivalent to MDPs over the set of belief states

Since $b(s)$ is continuous, $\infty$ belief states

Reinforcement Learning

Real life is much harder!

Assume we don't know $T(s,a,s')$

We also don't know $R(s,a,s')$

Another Game of Chance

You pick action A or B.

You get a reward.

Try to maximize your reward.


  • Offline
  • Known environment dynamics
  • Maximize expected rewards


  • Offline
  • Incomplete knowledge of states
  • Estimate $b(s)$


  • Online
  • Unknown environment dynamics
  • Estimate through interactions

RL Framework

RL Framework

Agent interacts with environment

Agent receives reward

Agent learns to maximize reward

RL Framework

Let's see a (complex) demo

(Local code run)

Reinforcement Learning

Assume we don't know $T(s,a,s')$

We also don't know $R(s,a,s')$

First Steps

Assume an underlying MDP

Let's revisit the game of chance

What does the MDP look like?

First Steps

What does the MDP look like?

You gain some information as you play!

Model Based Learning

Approximate the MDP

Solve for values assuming approximation is correct

Model Based Learning

Model-Based Monte Carlo

Estimate $T(s,a,s')$ and $R(s,a,s')$

How may we do that?

Average over data!

Compute $V_\pi(s)$ as in MDPs

Model-Based Monte Carlo

Estimate $T(s,a,s')$ and $R(s,a,s')$

Play to learn information

Model-Based Monte Carlo

Estimate $T(s,a,s')$ and $R(s,a,s')$

Play to learn information

Update Probabilities

Model-Based Monte Carlo

Play to learn information

Update Probabilities

Eventually, we hit the end state.

Model-Based Monte Carlo

Play to learn information

Update Probabilities

Eventually, we hit the end state.

Model-Based Monte Carlo

Consider the following sequence of plays

Model-Based Monte Carlo

Consider the following sequence of plays

\[\hat{T}(s,a,s') = \frac{\#(s,a,s')}{\#(s,a)}\]

\[\hat{T}(\textcolor{#22b496}{S}, \textcolor{#2c0328}{A}, \textcolor{#22b496}{S}) = \frac{107}{107} = 1\]

Model-Based Monte Carlo

Consider the following sequence of plays

\[\hat{T}(s,a,s') = \frac{\#(s,a,s')}{\#(s,a)}\]

\[\hat{T}(\textcolor{#22b496}{S}, \textcolor{#2c0328}{A}, \textcolor{#f79cfb}{E}) = \frac{0}{107} = 0\]

Model-Based Monte Carlo

Consider the following sequence of plays

\[\hat{T}(s,a,s') = \frac{\#(s,a,s')}{\#(s,a)}\]

\[\hat{T}(\textcolor{#22b496}{S}, \textcolor{#4faeea}{B}, \textcolor{#22b496}{S}) = \frac{11}{14} = 0.785\]

Model-Based Monte Carlo

Consider the following sequence of plays

\[\hat{T}(s,a,s') = \frac{\#(s,a,s')}{\#(s,a)}\]

\[\hat{T}(\textcolor{#22b496}{S}, \textcolor{#4faeea}{B}, \textcolor{#f79cfb}{E}) = \frac{3}{14} = 0.215\]

Model-Free Monte Carlo

Learn $Q_{opt}(s,a)$ directly

Running average over incoming data

Model-Free Monte Carlo

Estimate $Q_\pi(s,a)$

Utility, $u_t = R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \dots$

$\hat{Q}_\pi(s,a) = \textrm{avg}(u_t)$
$S_{t-1} = s, A_t = a$

Model-Free Monte Carlo

Consider the following sequence of plays again

Model-Free Monte Carlo

Model-Free Monte Carlo

Model-Free Monte Carlo

Model-Free Monte Carlo

\[ \hat{Q}_{\pi}(\textcolor{#22b496}{S}, \textcolor{#2c0328}{A}) = \frac{94 + 93 + -100}{3} = 29\]

Model-Free Monte Carlo

Try this for the other dice game, with actions Stay or Quit

Model-Free Monte Carlo

Can be interpreted as an iterative convex combination (running average)...

The SARSA algorithm - for each observed (s,a,r,s',a'):\[ \hat{Q}^t_{\pi}(s, a) = (1-\eta)\hat{Q}^{t-1}_{\pi}(s, a) + \eta \left[ r + \gamma \hat{Q}_{\pi}(s', a') \right] \]where $\eta = \frac{1}{1 + \textrm{no. of updates to }\hat{Q}_\pi(s,a)}$

Model-Free Monte Carlo

For Optimal Policy, for each observed (s,a,r,s'):

The Q-Learning Algorithm\[ \hat{Q}^t_{opt}(s, a) = (1-\eta)\hat{Q}^{t-1}_{opt}(s, a) + \eta \left[ R + \gamma \hat{V}_{opt}(s') \right] \]where $\eta = \frac{1}{1 + \textrm{no. of updates to }\hat{Q}_{opt}(s,a)}$

Exploration v/s Exploitation

Sequences have to be random enough to cover search space

Optimal policy cannot be too random

Exploration v/s Exploitation

Strategies to narrow down favorite restaurants

Fastest way to commute to campus

Exploration v/s Exploitation

The $\epsilon$-greedy approach

Start by exploring at random.
Reduce randomness over time to prefer optimal policy.

Exploration v/s Exploitation

The $\epsilon$-greedy approach

Initialize $\epsilon$ close to 1.
At each time step: \[ \begin{cases} \textrm{with probability $\epsilon$, take a random action}\\ \textrm{with probability $1-\epsilon$, take the optimal action}\\ \end{cases} \] Perform Q-learning updates
Reduce $\epsilon$