It's Friday night. How would you get to Cambridge as quickly as possible?



  • Drive
  • Take a Bus
  • Take the T
  • Uber/Lyft
  • Swim across the Charles

Recall environment uncertainties from previous lectures!


What are the unknowns in this problem?

How do stochastic transitions feature in this problem?

Uncertainty in the real world



Robotics


Weather models & agriculture


Route planning


Frodo's Dilemma



Frodo's Dilemma



Goal: Get to Mount Doom
(volcano in the top row)


Seen by Sauron (the 3 red cells):
Damage to health


Cave next to Frodo:
Shelter, but game ends

Frodo's Dilemma



Rewards!

Now imagine Frodo is a bit clumsy!



Probability of Slipping = 0.1

Slipping takes Frodo in a random direction


Should he still try?

Now imagine Frodo is a bit clumsy!



Probability of Slipping = 0.3


Should he still try?

Now imagine Frodo is a bit clumsy!



What about Probability of Slipping = 0.5?

A Game of Dice!



  • Choose to stay or quit


  • If you quit, game ends and you get $10

  • If you stay, you get $4, and we roll a die


  • If we roll 1 or 2, the game ends, else we play another round

Markov Decision Processes



  • Start State

  • State Space

  • Set of Actions

  • Transition Probabilities

  • Rewards

  • Discount Factor

Solutions: Policies



A mapping from each state to an action

Frodo's Dilemma





Example policy: { (4,1): UP, (3,1): UP, ...}


Remember: may not actually lead you to the right cell!

What about the dice game?



Example policy: { Start: QUIT }


Example policy: { Start: STAY }


Given a policy, how do we evaluate it?


Following a policy gives us a random path.



The utility of a policy is the discounted sum of rewards on the path.



Value of a policy = Expected Utility

Policy Evaluation



  • Expected Value

    \[V_\pi(s) = \begin{cases} 0, & \text{if } s \text{ is terminal}\\ Q_\pi(s,a), & \text{otherwise} \end{cases} \]

  • \[ Q_\pi(s,a) = \sum_{s'} T(s,a,s')[R(s,a,s')+\gamma V_\pi(s')] \]

Policy Evaluation



Iterative Updates to Expected Utility


Bellman Equation


\[ V_\pi^{t}(s) = \sum_{s'} T(s,a,s')[R(s,a,s')+\gamma V_\pi^{t-1}(s')] \]

Policy Evaluation

Iterative Updates to Expected Utility


Bellman Equation


\[ V_\pi^{t}(s) = \sum_{s'} T(s,a,s')[R(s,a,s')+\gamma V_\pi^{t-1}(s')] \]

State
Start 0 4 6.66 8... ... ~12
End 0 0 0 0 ... 0

Now how do we find the optimal policy?


Hint: Similar iterative algorithm

Value Iteration to Learn Optimal Policy



\[ V_{opt}(s) = \max_{a\in A}\sum_{s'} T(s,a,s')[R(s,a,s')+\gamma V_{opt}(s')] \]


\[ \pi_{opt}(s) = \argmax_{a\in A} \sum_{s'} T(s,a,s')[R(s,a,s')+\gamma V_{opt}(s')]\]

Limitations of MDPs



Calculating $V_\pi(s)$ is expensive!


Must know transition probabilities


Must know reward function


Must fully observe every state


A Tourist in Mordor

Frodo does not know the entire state space.


Imagine that Frodo can see for 1 cell in each direction.


Each observation uniquely identifies the state.

Now Consider This Grid-World

Robot perceives obstacles.


Observations do not uniquely identify the state.

Could be Worse!

Sensors are not 100% accurate.


Observations are now stochastic.

Partially Observable MDPs

POMDPs



States are not fully observable.


Observations are stochastic.


Actions are stochastic.


Rewards are stochastic.


POMDPs



Set of States, $S$


Set of Actions, $A$


Set of Transition Probabilities, $T(s,a,s')$


Set of Rewards, $R(s,a,s')$


Set of Observations, $O$


Set of Observation Probabilities, $\Omega(o|s)$

POMDPs



Belief State, $b(s) = P(s|\text{History})$


Rewards and Transitions can be defined over belief states


POMDPs


Equivalent to MDPs over the set of belief states


Since $b(s)$ is continuous, $\infty$ belief states


Solving POMDPs

For a finite horizon, $b(s)$ can be
divided into regions.


Within each region, one alpha vector is optimal.


Each alpha vector represents a policy.