It's Friday night. How would you get to Cambridge as quickly as possible?
Drive
Take a Bus
Take the T
Uber/Lyft
Swim across the Charles
Recall environment uncertainties from previous lectures!
What are the
unknowns
in this problem?
How do
stochastic transitions
feature in this problem?
Uncertainty in the real world
Robotics
Weather models & agriculture
Route planning
Frodo's Dilemma
Frodo's Dilemma
Goal:
Get to Mount Doom
(volcano in the top row)
Seen by Sauron (the 3 red cells):
Damage to health
Cave next to Frodo:
Shelter, but game ends
Frodo's Dilemma
Rewards!
Now imagine Frodo is a bit clumsy!
Probability of Slipping = 0.1
Slipping takes Frodo in a random direction
Should he still try?
Now imagine Frodo is a bit clumsy!
Probability of Slipping = 0.3
Should he still try?
Now imagine Frodo is a bit clumsy!
What about Probability of Slipping = 0.5?
A Game of Dice!
Choose to
stay
or
quit
If you
quit
,
game ends and you get $10
If you
stay
,
you get $4, and we roll a die
If we roll 1 or 2, the game ends, else we play another round
Markov Decision Processes
Start State
State Space
Set of Actions
Transition Probabilities
Rewards
Discount Factor
Solutions:
Policies
A
mapping
from each state to an action
Frodo's Dilemma
Example policy: {
(4,1):
UP
,
(3,1):
UP
, ...}
Remember: may not actually lead you to the right cell!
What about the dice game?
Example policy: {
Start:
QUIT
}
Example policy: {
Start:
STAY
}
Given a policy, how do we evaluate it?
Following a policy gives us a random path.
The
utility
of a policy is the
discounted
sum of rewards on the path.
Value of a policy =
Expected Utility
Policy Evaluation
Expected Value
\[V_\pi(s) = \begin{cases} 0, & \text{if } s \text{ is terminal}\\ Q_\pi(s,a), & \text{otherwise} \end{cases} \]
\[ Q_\pi(s,a) = \sum_{s'} T(s,a,s')[R(s,a,s')+\gamma V_\pi(s')] \]
Policy Evaluation
Iterative Updates to Expected Utility
Bellman Equation
\[ V_\pi^{t}(s) = \sum_{s'} T(s,a,s')[R(s,a,s')+\gamma V_\pi^{t-1}(s')] \]
Policy Evaluation
Iterative Updates to Expected Utility
Bellman Equation
\[ V_\pi^{t}(s) = \sum_{s'} T(s,a,s')[R(s,a,s')+\gamma V_\pi^{t-1}(s')] \]
State
Start
0
4
6.66
8...
...
~12
End
0
0
0
0
...
0
Now how do we find the optimal policy?
Hint: Similar iterative algorithm
Value Iteration to Learn Optimal Policy
\[ V_{opt}(s) = \max_{a\in A}\sum_{s'} T(s,a,s')[R(s,a,s')+\gamma V_{opt}(s')] \]
\[ \pi_{opt}(s) = \argmax_{a\in A} \sum_{s'} T(s,a,s')[R(s,a,s')+\gamma V_{opt}(s')]\]
Limitations of MDPs
Calculating $V_\pi(s)$ is expensive!
Must know transition probabilities
Must know reward function
Must fully observe every state
A Tourist in Mordor
Frodo does not know the entire state space.
Imagine that Frodo can see for 1 cell in each direction.
Each observation uniquely identifies the state.
Now Consider This Grid-World
Robot perceives obstacles.
Observations
do not
uniquely identify the state.
Could be Worse!
Sensors are not 100% accurate.
Observations are now stochastic.
Partially Observable MDPs
POMDPs
States are not fully observable.
Observations are stochastic.
Actions are stochastic.
Rewards are stochastic.
POMDPs
Set of States, $S$
Set of Actions, $A$
Set of Transition Probabilities, $T(s,a,s')$
Set of Rewards, $R(s,a,s')$
Set of Observations, $O$
Set of Observation Probabilities, $\Omega(o|s)$
POMDPs
Belief State, $b(s) = P(s|\text{History})$
Rewards and Transitions can be defined over belief states
POMDPs
Equivalent to MDPs over the set of belief states
Since $b(s)$ is continuous, $\infty$ belief states
Solving POMDPs
For a finite horizon, $b(s)$ can be
divided into regions.
Within each region, one alpha vector is optimal.
Each alpha vector represents a policy.