Try this for the other dice game, with actions Stay or Quit
Model-Free Monte Carlo
Can be interpreted as an iterative convex combination (running average)...
The SARSA algorithm - for each observed (s,a,r,s',a'):\[ \hat{Q}^t_{\pi}(s, a) = (1-\eta)\hat{Q}^{t-1}_{\pi}(s, a) + \eta \left[ r + \gamma \hat{Q}_{\pi}(s', a') \right] \]where $\eta = \frac{1}{1 + \textrm{no. of updates to }\hat{Q}_\pi(s,a)}$
Model-Free Monte Carlo
For Optimal Policy, for each observed (s,a,r,s'):
The Q-Learning Algorithm\[ \hat{Q}^t_{opt}(s, a) = (1-\eta)\hat{Q}^{t-1}_{opt}(s, a) + \eta \left[ R + \gamma \hat{V}_{opt}(s') \right] \]where $\eta = \frac{1}{1 + \textrm{no. of updates to }\hat{Q}_{opt}(s,a)}$
Exploration v/s Exploitation
Sequences have to be random enough to cover search space
Optimal policy cannot be too random
Exploration v/s Exploitation
Strategies to narrow down favorite restaurants
Fastest way to commute to campus
Exploration v/s Exploitation
The $\epsilon$-greedy approach
Start by exploring at random. Reduce randomness over time to prefer optimal policy.
Exploration v/s Exploitation
The $\epsilon$-greedy approach
Initialize $\epsilon$ close to 1.
At each time step:
\[ \begin{cases}
\textrm{with probability $\epsilon$, take a random action}\\
\textrm{with probability $1-\epsilon$, take the optimal action}\\
\end{cases} \]
Perform Q-learning updates
Reduce $\epsilon$