It's Friday night. How would you get to Cambridge as quickly as possible?

Drive
Take a Bus
Take the T
Uber/Lyft
Swim across the Charles

Recall environment uncertainties from previous lectures!

What are the unknowns in this problem?

How do stochastic transitions feature in this problem?

Uncertainty in the real world

Robotics

Weather models & agriculture

Route planning

Frodo's Dilemma

Goal: Get to Mount Doom
(volcano in the top row)

Seen by Sauron (the 3 red cells):
Damage to health

Cave next to Frodo:
Shelter, but game ends

Frodo's Dilemma

Rewards!

Now imagine Frodo is a bit clumsy!

Probability of Slipping = 0.1

Slipping takes Frodo in a random direction

Should he still try?

Now imagine Frodo is a bit clumsy!

Probability of Slipping = 0.3

Should he still try?

Now imagine Frodo is a bit clumsy!

What about Probability of Slipping = 0.5?

A Game of Dice!

Choose to stay or quit

If you quit, game ends and you get $10
If you stay, you get $4, and we roll a die

If we roll 1 or 2, the game ends, else we play another round

Markov Decision Processes

Start State
State Space
Set of Actions
Transition Probabilities
Rewards
Discount Factor

Solutions: Policies

A mapping from each state to an action

Frodo's Dilemma

Example policy: { (4,1): UP, (3,1): UP, ...}

Remember: may not actually lead you to the right cell!

What about the dice game?

Example policy: { Start: QUIT }

Example policy: { Start: STAY }

Given a policy, how do we evaluate it?

Following a policy gives us a random path.

The utility of a policy is the discounted sum of rewards on the path.

Value of a policy = Expected Utility

Policy Evaluation

Expected Value

\[V_\pi(s) = \begin{cases} 0, & \text{if } s \text{ is terminal}\\ Q_\pi(s,a), & \text{otherwise} \end{cases} \]
\[ Q_\pi(s,a) = \sum_{s'} T(s,a,s')[R(s,a,s')+\gamma V_\pi(s')] \]

Policy Evaluation

Iterative Updates to Expected Utility

Bellman Equation

\[ V_\pi^{t}(s) = \sum_{s'} T(s,a,s')[R(s,a,s')+\gamma V_\pi^{t-1}(s')] \]

Policy Evaluation

Iterative Updates to Expected Utility

Bellman Equation

\[ V_\pi^{t}(s) = \sum_{s'} T(s,a,s')[R(s,a,s')+\gamma V_\pi^{t-1}(s')] \]

State
Start	0	4	6.66	8...	...	~12
End	0	0	0	0	...	0

Now how do we find the optimal policy?

Hint: Similar iterative algorithm

Value Iteration to Learn Optimal Policy

\[ V_{opt}(s) = \max_{a\in A}\sum_{s'} T(s,a,s')[R(s,a,s')+\gamma V_{opt}(s')] \]

\[ \pi_{opt}(s) = \argmax_{a\in A} \sum_{s'} T(s,a,s')[R(s,a,s')+\gamma V_{opt}(s')]\]

Limitations of MDPs

Calculating $V_\pi(s)$ is expensive!

Must know transition probabilities

Must know reward function

Must fully observe every state

A Tourist in Mordor

Frodo does not know the entire state space.

Imagine that Frodo can see for 1 cell in each direction.

Each observation uniquely identifies the state.

Now Consider This Grid-World

Robot perceives obstacles.

Observations do not uniquely identify the state.

Could be Worse!

Sensors are not 100% accurate.

Observations are now stochastic.

Partially Observable MDPs

POMDPs

States are not fully observable.

Observations are stochastic.

Actions are stochastic.

Rewards are stochastic.

POMDPs

Set of States, $S$

Set of Actions, $A$

Set of Transition Probabilities, $T(s,a,s')$

Set of Rewards, $R(s,a,s')$

Set of Observations, $O$

Set of Observation Probabilities, $\Omega(o|s)$

POMDPs

Belief State, $b(s) = P(s|\text{History})$

Rewards and Transitions can be defined over belief states

POMDPs

Equivalent to MDPs over the set of belief states

Since $b(s)$ is continuous, $\infty$ belief states

Solving POMDPs

For a finite horizon, $b(s)$ can be
divided into regions.

Within each region, one alpha vector is optimal.

Each alpha vector represents a policy.

It's Friday night. How would you get to Cambridge as quickly as possible?

Recall environment uncertainties from previous lectures!

What are the unknowns in this problem?

How do stochastic transitions feature in this problem?

Uncertainty in the real world

Robotics

Weather models & agriculture

Route planning

Frodo's Dilemma

Frodo's Dilemma

Goal: Get to Mount Doom (volcano in the top row)

Seen by Sauron (the 3 red cells): Damage to health

Cave next to Frodo:Shelter, but game ends

Frodo's Dilemma

Rewards!

Now imagine Frodo is a bit clumsy!

Probability of Slipping = 0.1

Slipping takes Frodo in a random direction

Should he still try?

Now imagine Frodo is a bit clumsy!

Probability of Slipping = 0.3

Should he still try?

Now imagine Frodo is a bit clumsy!

What about Probability of Slipping = 0.5?

A Game of Dice!

Choose to stay or quit

If you quit, game ends and you get $10

If you stay, you get $4, and we roll a die

If we roll 1 or 2, the game ends, else we play another round

Markov Decision Processes

Start State

State Space

Set of Actions

Transition Probabilities

Rewards

Discount Factor

Solutions: Policies

A mapping from each state to an action

Frodo's Dilemma

Example policy: { (4,1): UP, (3,1): UP, ...}

Remember: may not actually lead you to the right cell!

What about the dice game?

Example policy: { Start: QUIT }

Example policy: { Start: STAY }

Given a policy, how do we evaluate it?

Following a policy gives us a random path.

The utility of a policy is the discounted sum of rewards on the path.

Value of a policy = Expected Utility

Policy Evaluation

Expected Value

\[V_\pi(s) = \begin{cases} 0, & \text{if } s \text{ is terminal}\\ Q_\pi(s,a), & \text{otherwise} \end{cases} \]

\[ Q_\pi(s,a) = \sum_{s'} T(s,a,s')[R(s,a,s')+\gamma V_\pi(s')] \]

Policy Evaluation

Iterative Updates to Expected Utility

Bellman Equation

\[ V_\pi^{t}(s) = \sum_{s'} T(s,a,s')[R(s,a,s')+\gamma V_\pi^{t-1}(s')] \]

Policy Evaluation

Iterative Updates to Expected Utility

Bellman Equation

\[ V_\pi^{t}(s) = \sum_{s'} T(s,a,s')[R(s,a,s')+\gamma V_\pi^{t-1}(s')] \]

Now how do we find the optimal policy?

Hint: Similar iterative algorithm

Value Iteration to Learn Optimal Policy

\[ V_{opt}(s) = \max_{a\in A}\sum_{s'} T(s,a,s')[R(s,a,s')+\gamma V_{opt}(s')] \]

\[ \pi_{opt}(s) = \argmax_{a\in A} \sum_{s'} T(s,a,s')[R(s,a,s')+\gamma V_{opt}(s')]\]

Limitations of MDPs

Calculating $V_\pi(s)$ is expensive!

Must know transition probabilities

Must know reward function

Must fully observe every state

A Tourist in Mordor

Frodo does not know the entire state space.

Imagine that Frodo can see for 1 cell in each direction.

Each observation uniquely identifies the state.

Now Consider This Grid-World

Robot perceives obstacles.

Observations do not uniquely identify the state.

Could be Worse!

Sensors are not 100% accurate.

Observations are now stochastic.

Goal: Get to Mount Doom
(volcano in the top row)

Seen by Sauron (the 3 red cells):
Damage to health

Cave next to Frodo:
Shelter, but game ends

For a finite horizon, $b(s)$ can be
divided into regions.