Recap - Deep Learning
Neural Networks with multiple hidden layers
Loss Functions & Gradient Descent
What do real loss functions look like?
Dealing with Local Minima
Thoughts?
Dealing with Local Minima
Thoughts?
Learning Rate Scheduling
Momentum
Momentum
Remember the direction of previous gradients
Momentum causes parameters to want to continue updating in that direction
Momentum
Assume parameter $w$ is being updated
Standard Gradient Update
$w = w - \eta \nabla L_w$
Update with Momentum
$v_t = \beta v_{t-1} - \eta \nabla L_w$
$w = w + v_t$
The
Adam
Optimizer
Adaptive Moment Estimation
Momentum + Higher Order Gradient Moments
Adaptive Learning Rates
The
Adam
Optimizer
Adaptive Moment Estimation
$m_t = \beta_1\ m_{t-1} + (1 - \beta_1) \nabla L_w$
$\quad\leftarrow$ Mean of Gradient
$v_t = \beta_2\ v_{t-1} + (1 - \beta_2) (\nabla L_w)^2$
$\quad \leftarrow$ Variance of Gradient
Bias Correction
$$m_t = \frac{m_t}{1 - (\beta_1)^t}, v_t = \frac{v_t}{1 - (\beta_2)^t}$$
$$w = w - \eta \frac{m_t}{\sqrt{v_t} + \epsilon}$$
Natural Language Processing
Natural Language Processing
Word Vectors
Bag of Words
Tf-Idf
$\textrm{count of }t \cdot \log\frac{\textrm{No. of docs}}{\textrm{No. of docs containing }t}$
Word Embeddings
Natural Language Processing
Word Embeddings
Natural Language Processing
Word Embeddings
Natural Language Processing
Word Embeddings
Deep Sequence Modeling
Real-world problems often involve sequential input
Most language problems
Speech recognition
Genomic Data
How do we model sequences?
Introducing Time to Neurons
Recurrent Neural Networks
Consider a stack of perceptrons
Now think about doing this at each time step
Recurrent Neural Networks
Recurrent Neural Networks
Input $x_t$
Update Hidden State
$h_t = \sigma(\mathbf{w}_{hh}\cdot h_{t-1} + \mathbf{w}_{xh}\cdot x_t)$
Output, $y_t = \mathbf{w}_{hy}\cdot h_t$
Recurrent Neural Networks
Computing Loss
Recurrent Neural Networks
Computing Gradients
Backpropagation through time
Recurrent Neural Networks
Computing Gradients
Repeated multiplications through time
Vanishing/Exploding Gradients
Solution: LSTM
LSTM (gated cells)
Selectively add or remove information at each step
Forget
Discard irrelevant information
Store
Store relevant input from current cell
Update
Selectively update cell state
Output
Return filtered version of cell state
LSTM (gated cells)
Selectively add or remove information at each step
Forget
Discard irrelevant information
Store
Store relevant input from current cell
Update
Selectively update cell state
Output
Return filtered version of cell state
Attention Mechanism
Fundamental idea behind transformers including GPT 3/4
Automatically learn
masks
over inputs
Figure out which part of input to pay attention to
Attention Mechanism
Consider a
position-aware
vector encoding of inputs
For each word (in parallel), calculate attention