Neural Networks



Key Idea: Nonlinear activations over linear combinations of features


Neuron/Perceptron

Fundamental building block

Neuron


Consider the

Non linear activation over

linear combination
$\sigma($$\mathbf{w}\cdot \phi(x)$$)$

Feedforward Neural Network



Combines several neurons into a single layered model


Feedforward Neural Network


Combines several neurons into a single layered model


Acts as a (weighted) composition of functions



Acts as a (weighted) composition of functions



Key idea: Learn $\mathbf{v}$ and $\mathbf{w}$ jointly!

Training



How do we learn $\mathbf{v}$ and $\mathbf{w}$?


Gradient Descent, of course...

... gradient of what?


Loss Functions

Training



Must average over input data



Training

Training


Update $\textcolor{#3b0a7a}{w}$ and $\textcolor{#361900}{V}$ using Gradient Descent



Need to compute

$\nabla_{\textcolor{#3b0a7a}{w}} \frac{1}{N} \sum_i L(\sigma(\textcolor{#3b0a7a}{w}\cdot\sigma(\textcolor{#361900}{V}\cdot\phi_i)), \textcolor{#3c1d8a}{y_i})$
and
$\nabla_{\textcolor{#361900}{V}} \frac{1}{N} \sum_i L(\sigma(\textcolor{#3b0a7a}{w}\cdot\sigma(\textcolor{#361900}{V}\cdot\phi_i)), \textcolor{#3c1d8a}{y_i})$

Training



For example, consider the Least-Square Loss


To train network, compute the following gradients:


\[\nabla_{\textcolor{#3b0a7a}{w}}L = \frac{\partial}{\partial \textcolor{#3b0a7a}{w}} \frac{1}{N} \sum_i \biggl( \textcolor{#3c1d8a}{y_i} - \Bigl(\frac{1}{1+\mathrm{e}^{- w\cdot\frac{1}{1+ \mathrm{e}^{V\cdot\phi_i}}}}\Bigr)\biggr)^2 \]and\[\nabla_{\textcolor{#361900}{V}}L = \frac{\partial}{\partial \textcolor{#361900}{V}} \frac{1}{N} \sum_i \biggl( \textcolor{#3c1d8a}{y_i} - \Bigl(\frac{1}{1+\mathrm{e}^{- w\cdot\frac{1}{1+ \mathrm{e}^{V\cdot\phi_i}}}}\Bigr)\biggr)^2 \]

Training



Finally, to update weights, use backpropagation


Forward pass - compute model output

Backward pass - do gradient updates

$w = w - \eta \nabla_w L(f_{V,w}(\phi),y)$
$V = V - \eta \nabla_V L(f_{V,w}(\phi),y)$


Alternate until convergence

Generalization



Is loss minimization sufficient to train robust models?


What if the model memorizes the data? (100% accuracy = 0 loss)


What strategies can you think of to prevent overfitting?

Generalization



Regularization: add a penalty for certain behaviors when updating weights


Early stopping using a validation set

Deep Learning


Neural Networks with multiple hidden layers


Loss Functions & Gradient Descent

What do real loss functions look like?