Recap - Deep Learning

Neural Networks with multiple hidden layers

Loss Functions & Gradient Descent

What do real loss functions look like?

Dealing with Local Minima

Thoughts?

Dealing with Local Minima

Thoughts?

Learning Rate Scheduling

Momentum

Momentum

Remember the direction of previous gradients

Momentum causes parameters to want to continue updating in that direction

Momentum

Assume parameter $w$ is being updated

Standard Gradient Update

$w = w - \eta \nabla L_w$

Update with Momentum

$v_t = \beta v_{t-1} - \eta \nabla L_w$
$w = w + v_t$

The Adam Optimizer

Adaptive Moment Estimation

Momentum + Higher Order Gradient Moments

Adaptive Learning Rates

The Adam Optimizer

Adaptive Moment Estimation

$m_t = \beta_1\ m_{t-1} + (1 - \beta_1) \nabla L_w$$\quad\leftarrow$ Mean of Gradient

$v_t = \beta_2\ v_{t-1} + (1 - \beta_2) (\nabla L_w)^2$$\quad \leftarrow$ Variance of Gradient

Bias Correction$$m_t = \frac{m_t}{1 - (\beta_1)^t}, v_t = \frac{v_t}{1 - (\beta_2)^t}$$

$$w = w - \eta \frac{m_t}{\sqrt{v_t} + \epsilon}$$

Natural Language Processing

Natural Language Processing

Word Vectors

Bag of Words

Tf-Idf
$\textrm{count of }t \cdot \log\frac{\textrm{No. of docs}}{\textrm{No. of docs containing }t}$

Word Embeddings

Natural Language Processing

Word Embeddings

Natural Language Processing

Word Embeddings

Natural Language Processing

Word Embeddings

Deep Sequence Modeling

Real-world problems often involve sequential input

Most language problems

Speech recognition

Genomic Data

How do we model sequences?

Introducing Time to Neurons

Recurrent Neural Networks

Consider a stack of perceptrons

Now think about doing this at each time step

Recurrent Neural Networks

Recurrent Neural Networks

Input $x_t$

Update Hidden State
$h_t = \sigma(\mathbf{w}_{hh}\cdot h_{t-1} + \mathbf{w}_{xh}\cdot x_t)$

Output, $y_t = \mathbf{w}_{hy}\cdot h_t$

Recurrent Neural Networks

Computing Loss

Recurrent Neural Networks

Computing Gradients

Backpropagation through time

Recurrent Neural Networks

Computing Gradients

Repeated multiplications through time

Vanishing/Exploding Gradients

Solution: LSTM

LSTM (gated cells)

Selectively add or remove information at each step

Forget
Discard irrelevant information

Store
Store relevant input from current cell

Update
Selectively update cell state

Output
Return filtered version of cell state

LSTM (gated cells)

Selectively add or remove information at each step

Forget
Discard irrelevant information

Store
Store relevant input from current cell

Update
Selectively update cell state

Output
Return filtered version of cell state

Attention Mechanism

Fundamental idea behind transformers including GPT 3/4

Automatically learn masks over inputs

Figure out which part of input to pay attention to

Attention Mechanism

Consider a position-aware vector encoding of inputs

For each word (in parallel), calculate attention