Neural Networks
Key Idea: Nonlinear activations over linear combinations of features
Neuron/Perceptron
Fundamental building block
Neuron
Consider the
Non linear activation over
linear combination
$\sigma($$\mathbf{w}\cdot \phi(x)$$)$
Feedforward Neural Network
Combines several neurons into a single layered model
Feedforward Neural Network
Combines several neurons into a single layered model
Acts as a (weighted) composition of functions
Acts as a (weighted) composition of functions
Key idea: Learn $\mathbf{v}$ and $\mathbf{w}$ jointly!
Training
How do we learn $\mathbf{v}$ and $\mathbf{w}$?
Gradient Descent, of course...
... gradient of what?
Loss Functions
Training
Must average over input data
Training
Update $\textcolor{#3b0a7a}{w}$ and $\textcolor{#361900}{V}$ using Gradient Descent
Need to compute
$\nabla_{\textcolor{#3b0a7a}{w}} \frac{1}{N} \sum_i L(\sigma(\textcolor{#3b0a7a}{w}\cdot\sigma(\textcolor{#361900}{V}\cdot\phi_i)), \textcolor{#3c1d8a}{y_i})$
and
$\nabla_{\textcolor{#361900}{V}} \frac{1}{N} \sum_i L(\sigma(\textcolor{#3b0a7a}{w}\cdot\sigma(\textcolor{#361900}{V}\cdot\phi_i)), \textcolor{#3c1d8a}{y_i})$
Training
For example, consider the Least-Square Loss
To train network, compute the following gradients:
\[\nabla_{\textcolor{#3b0a7a}{w}}L = \frac{\partial}{\partial \textcolor{#3b0a7a}{w}} \frac{1}{N} \sum_i \biggl( \textcolor{#3c1d8a}{y_i} - \Bigl(\frac{1}{1+\mathrm{e}^{- w\cdot\frac{1}{1+ \mathrm{e}^{V\cdot\phi_i}}}}\Bigr)\biggr)^2 \]and\[\nabla_{\textcolor{#361900}{V}}L = \frac{\partial}{\partial \textcolor{#361900}{V}} \frac{1}{N} \sum_i \biggl( \textcolor{#3c1d8a}{y_i} - \Bigl(\frac{1}{1+\mathrm{e}^{- w\cdot\frac{1}{1+ \mathrm{e}^{V\cdot\phi_i}}}}\Bigr)\biggr)^2 \]
Training
Finally, to update weights, use backpropagation
Forward pass - compute model output
Backward pass - do gradient updates
$w = w - \eta \nabla_w L(f_{V,w}(\phi),y)$
$V = V - \eta \nabla_V L(f_{V,w}(\phi),y)$
Alternate until convergence
Generalization
Is loss minimization sufficient to train robust models?
What if the model memorizes the data? (100% accuracy = 0 loss)
What strategies can you think of to prevent overfitting?
Generalization
Regularization: add a penalty for certain behaviors when updating weights
Early stopping using a validation set
Deep Learning
Neural Networks with multiple hidden layers
Loss Functions & Gradient Descent
What do real loss functions look like?