Recap - Deep Learning

Neural Networks with multiple hidden layers

Loss Functions & Gradient Descent

What do real loss functions look like?

Dealing with Local Minima


Dealing with Local Minima


Learning Rate Scheduling



Remember the direction of previous gradients

Momentum causes parameters to want to continue updating in that direction


Assume parameter $w$ is being updated

Standard Gradient Update

$w = w - \eta \nabla L_w$

Update with Momentum

$v_t = \beta v_{t-1} - \eta \nabla L_w$
$w = w + v_t$

The Adam Optimizer

Adaptive Moment Estimation

Momentum + Higher Order Gradient Moments

Adaptive Learning Rates

The Adam Optimizer

Adaptive Moment Estimation

$m_t = \beta_1\ m_{t-1} + (1 - \beta_1) \nabla L_w$$\quad\leftarrow$ Mean of Gradient

$v_t = \beta_2\ v_{t-1} + (1 - \beta_2) (\nabla L_w)^2$$\quad \leftarrow$ Variance of Gradient

Bias Correction$$m_t = \frac{m_t}{1 - (\beta_1)^t}, v_t = \frac{v_t}{1 - (\beta_2)^t}$$

$$w = w - \eta \frac{m_t}{\sqrt{v_t} + \epsilon}$$

Natural Language Processing

Natural Language Processing

Word Vectors

Bag of Words

$\textrm{count of }t \cdot \log\frac{\textrm{No. of docs}}{\textrm{No. of docs containing }t}$

Word Embeddings

Natural Language Processing

Word Embeddings

Natural Language Processing

Word Embeddings

Natural Language Processing

Word Embeddings

Deep Sequence Modeling

Real-world problems often involve sequential input

Most language problems

Speech recognition

Genomic Data

How do we model sequences?

Introducing Time to Neurons

Recurrent Neural Networks

Consider a stack of perceptrons

Now think about doing this at each time step

Recurrent Neural Networks

Recurrent Neural Networks

Input $x_t$

Update Hidden State
$h_t = \sigma(\mathbf{w}_{hh}\cdot h_{t-1} + \mathbf{w}_{xh}\cdot x_t)$

Output, $y_t = \mathbf{w}_{hy}\cdot h_t$

Recurrent Neural Networks

Computing Loss

Recurrent Neural Networks

Computing Gradients

Backpropagation through time

Recurrent Neural Networks

Computing Gradients

Repeated multiplications through time

Vanishing/Exploding Gradients

Solution: LSTM

LSTM (gated cells)

Selectively add or remove information at each step

Discard irrelevant information

Store relevant input from current cell

Selectively update cell state

Return filtered version of cell state

LSTM (gated cells)

Selectively add or remove information at each step

Discard irrelevant information

Store relevant input from current cell

Selectively update cell state

Return filtered version of cell state

Attention Mechanism

Fundamental idea behind transformers including GPT 3/4

Automatically learn masks over inputs

Figure out which part of input to pay attention to

Attention Mechanism

Consider a position-aware vector encoding of inputs

For each word (in parallel), calculate attention