Recap - Deep Learning


Neural Networks with multiple hidden layers


Loss Functions & Gradient Descent

What do real loss functions look like?

Dealing with Local Minima


Thoughts?



Dealing with Local Minima


Thoughts?



Learning Rate Scheduling


Momentum

Momentum


Remember the direction of previous gradients


Momentum causes parameters to want to continue updating in that direction

Momentum


Assume parameter $w$ is being updated



Standard Gradient Update

$w = w - \eta \nabla L_w$


Update with Momentum

$v_t = \beta v_{t-1} - \eta \nabla L_w$
$w = w + v_t$


The Adam Optimizer

Adaptive Moment Estimation



Momentum + Higher Order Gradient Moments


Adaptive Learning Rates



The Adam Optimizer

Adaptive Moment Estimation



$m_t = \beta_1\ m_{t-1} + (1 - \beta_1) \nabla L_w$$\quad\leftarrow$ Mean of Gradient


$v_t = \beta_2\ v_{t-1} + (1 - \beta_2) (\nabla L_w)^2$$\quad \leftarrow$ Variance of Gradient


Bias Correction$$m_t = \frac{m_t}{1 - (\beta_1)^t}, v_t = \frac{v_t}{1 - (\beta_2)^t}$$

$$w = w - \eta \frac{m_t}{\sqrt{v_t} + \epsilon}$$

Natural Language Processing


Natural Language Processing


Word Vectors


Bag of Words


Tf-Idf
$\textrm{count of }t \cdot \log\frac{\textrm{No. of docs}}{\textrm{No. of docs containing }t}$


Word Embeddings

Natural Language Processing


Word Embeddings

Natural Language Processing


Word Embeddings


Natural Language Processing


Word Embeddings


Deep Sequence Modeling


Real-world problems often involve sequential input



Most language problems

Speech recognition

Genomic Data



How do we model sequences?

Introducing Time to Neurons

Recurrent Neural Networks



Consider a stack of perceptrons

Now think about doing this at each time step

Recurrent Neural Networks


Recurrent Neural Networks


Input $x_t$


Update Hidden State
$h_t = \sigma(\mathbf{w}_{hh}\cdot h_{t-1} + \mathbf{w}_{xh}\cdot x_t)$


Output, $y_t = \mathbf{w}_{hy}\cdot h_t$

Recurrent Neural Networks

Computing Loss


Recurrent Neural Networks

Computing Gradients



Backpropagation through time

Recurrent Neural Networks

Computing Gradients



Repeated multiplications through time


Vanishing/Exploding Gradients


Solution: LSTM

LSTM (gated cells)

Selectively add or remove information at each step



Forget
Discard irrelevant information


Store
Store relevant input from current cell


Update
Selectively update cell state


Output
Return filtered version of cell state

LSTM (gated cells)

Selectively add or remove information at each step



Forget
Discard irrelevant information


Store
Store relevant input from current cell


Update
Selectively update cell state


Output
Return filtered version of cell state

Attention Mechanism



Fundamental idea behind transformers including GPT 3/4


Automatically learn masks over inputs


Figure out which part of input to pay attention to


Attention Mechanism

Consider a position-aware vector encoding of inputs

For each word (in parallel), calculate attention