Neural Networks

Key Idea: Nonlinear activations over linear combinations of features

Neuron/Perceptron

Fundamental building block

Neuron

Consider the

Non linear activation over

linear combination
$\sigma($$\mathbf{w}\cdot \phi(x)$$)$

Feedforward Neural Network

Combines several neurons into a single layered model

Feedforward Neural Network

Combines several neurons into a single layered model

Acts as a (weighted) composition of functions

Acts as a (weighted) composition of functions

Key idea: Learn $\mathbf{v}$ and $\mathbf{w}$ jointly!

Training

How do we learn $\mathbf{v}$ and $\mathbf{w}$?

Gradient Descent, of course...

... gradient of what?

Loss Functions

Training

Must average over input data

Training

Training

Update $\textcolor{#3b0a7a}{w}$ and $\textcolor{#361900}{V}$ using Gradient Descent

Need to compute

$\nabla_{\textcolor{#3b0a7a}{w}} \frac{1}{N} \sum_i L(\sigma(\textcolor{#3b0a7a}{w}\cdot\sigma(\textcolor{#361900}{V}\cdot\phi_i)), \textcolor{#3c1d8a}{y_i})$
and
$\nabla_{\textcolor{#361900}{V}} \frac{1}{N} \sum_i L(\sigma(\textcolor{#3b0a7a}{w}\cdot\sigma(\textcolor{#361900}{V}\cdot\phi_i)), \textcolor{#3c1d8a}{y_i})$

Training

For example, consider the Least-Square Loss

To train network, compute the following gradients:

\[\nabla_{\textcolor{#3b0a7a}{w}}L = \frac{\partial}{\partial \textcolor{#3b0a7a}{w}} \frac{1}{N} \sum_i \biggl( \textcolor{#3c1d8a}{y_i} - \Bigl(\frac{1}{1+\mathrm{e}^{- w\cdot\frac{1}{1+ \mathrm{e}^{V\cdot\phi_i}}}}\Bigr)\biggr)^2 \]and\[\nabla_{\textcolor{#361900}{V}}L = \frac{\partial}{\partial \textcolor{#361900}{V}} \frac{1}{N} \sum_i \biggl( \textcolor{#3c1d8a}{y_i} - \Bigl(\frac{1}{1+\mathrm{e}^{- w\cdot\frac{1}{1+ \mathrm{e}^{V\cdot\phi_i}}}}\Bigr)\biggr)^2 \]

Training

Finally, to update weights, use backpropagation

Forward pass - compute model output

Backward pass - do gradient updates

$w = w - \eta \nabla_w L(f_{V,w}(\phi),y)$
$V = V - \eta \nabla_V L(f_{V,w}(\phi),y)$

Alternate until convergence

Generalization

Is loss minimization sufficient to train robust models?

What if the model memorizes the data? (100% accuracy = 0 loss)

What strategies can you think of to prevent overfitting?

Generalization

Regularization: add a penalty for certain behaviors when updating weights

Early stopping using a validation set

Deep Learning

Neural Networks with multiple hidden layers

Loss Functions & Gradient Descent

What do real loss functions look like?