Supervised Learning



Supervised Learning



Training over labeled examples


Two key tasks:


Classification

Regression

Classification



Given labeled examples, learn a function that
maps each example to a class label


Given labeled examples, learn a function that
maps each example to a class label

Regression



Given labeled examples, learn a function that
maps each example to a real-valued output


Given labeled examples, learn a function that
maps each example to a real-valued output

Supervised Learning



What does the training data look like?


What does the training data look like?


Training inputs $X$: \[ X = \{X_1, X_2, \dots, X_n\}\quad \textrm{where} \\ X_i = \{x_{i1}, x_{i2}, \dots, x_{im}\} \]


Training labels $Y$: \[ Y = \{ y_1, y_2, \dots, y_n \} \]

Naïve Bayes Classifier



A simple probabilistic classifier based on Bayes' Theorem


Item Weight Color Shape Taste Size
Apple Light Red Round Sweet Medium
Banana Light Other Not Round Sweet Small
Carrot Light Other Not Round Other Small
Watermelon Heavy Green Round Sweet Large
Grapes Light Other Round Sweet Small
Cucumber Heavy Green Not Round Other Large
Strawberry Light Red Round Sweet Small
Eggplant Heavy Other Not Round Bitter Large
Lemon Light Other Round Sour Small
Bell Pepper Light Red Not Round Other Medium

Consider this input data



Replace item with type (fruit/veg)

Type Weight Color Shape Taste Size
Fruit Light Red Round Sweet Medium
Fruit Light Other Not Round Sweet Small
Veg Light Other Not Round Other Small
Fruit Heavy Green Round Sweet Large
Fruit Light Other Round Sweet Small
Veg Heavy Green Not Round Other Large
Fruit Light Red Round Sweet Small
Veg Heavy Other Not Round Bitter Large
Fruit Light Other Round Sour Small
Veg Light Red Not Round Other Medium

Is the following new data
point a fruit or a veg?



Light, Red, Not Round, Sweet, Small



\[ P(y|X) = \frac{P(y)\prod_jP(X_j|y)}{P(X)} \]

In practice, we often use \[ \log P(y|X) = \log P(y) + \sum_j \log P(X_j | y) \]

What would you do for continuous variables?


Binning!

Problems with Naïve Bayes?


What if we see a feature value that is never observed?

\[ P(x_j | y) = 0 \]


Pretend we've seen it once before

\[ P(X_j | y) = \frac{\#(x_j,y) + 1 }{\#y + m} \]

A more general ML framework


Let's start with Linear Predictors

Features


Consider r.venkatesaramani@northeastern.edu
as an input to an email filter


What features would you use?

Consider r.venkatesaramani@northeastern.edu
as an input to an email filter


\[ \phi(x) = \begin{bmatrix} \textrm{has first initial: } 1 \\ \textrm{has first name: } 0 \\ \textrm{has separator: } 1 \\ \textrm{has last initial: } 0 \\ \textrm{has last name: } 1 \\ \textrm{has NEU domain: } 1 \\ \end{bmatrix} = \begin{bmatrix} 1\\0\\1\\0\\1\\1 \end{bmatrix} \]

Consider r.venkatesaramani@northeastern.edu
as an input to an email filter


\[ \phi(x) = \begin{bmatrix} \textrm{has first initial: } 1 \\ \textrm{has first name: } 0 \\ \textrm{has separator: } 1 \\ \textrm{has last initial: } 0 \\ \textrm{has last name: } 1 \\ \textrm{has NEU domain: } 1 \\ \end{bmatrix} = \begin{bmatrix} 1\\0\\1\\0\\1\\1 \end{bmatrix} \]

Each feature has a relative importance



\[ \textbf{w} = \begin{bmatrix} \textrm{has first initial: } 1.5 \\ \textrm{has first name: } 2 \\ \textrm{has separator: } -1 \\ \textrm{has last initial: } 1.5 \\ \textrm{has last name: } 2 \\ \textrm{has NEU domain: } 4 \\ \end{bmatrix} = \begin{bmatrix} 1.5\\2\\-1\\1.5\\2\\14 \end{bmatrix} \]

Consider r.venkatesaramani@northeastern.edu
as an input to an email filter


\[ \phi(x) = \begin{bmatrix} 1\\0\\1\\0\\1\\1 \end{bmatrix} \]

Each feature has a relative importance



\[ \textbf{w} = \begin{bmatrix} 1.5\\2\\-1\\1.5\\2\\14 \end{bmatrix} \]

Linear Predictors


Score is a linear function of the input features

\[ Score = \textbf{w}\cdot \phi(x) = \sum_j w_j \phi(x)_j \]


For classification, the prediction
$f_\textbf{w}(x) = \textrm{sign}(Score)$


$f_\textbf{w}(x) = \textrm{sign}(\textbf{w}\cdot \phi(x))$

$f_\textbf{w}(x) = \textrm{sign}(\textbf{w}\cdot \phi(x))$


\[ f_\textbf{w}(x) = \begin{cases} 1 & \textbf{w}\cdot \phi(x) > 0 \\ -1 & \textbf{w}\cdot \phi(x) < 0 \\ ? & \textbf{w}\cdot \phi(x) = 0 \\ \end{cases} \]


Consider $\textbf{w} = [2,-1]$ and
$\phi(X) = \{ [2,0], [0,2], [2,4] \}$

\[ f_\textbf{w}(x) = \begin{cases} 1 & \textbf{w}\cdot \phi(x) > 0 \\ -1 & \textbf{w}\cdot \phi(x) < 0 \\ ? & \textbf{w}\cdot \phi(x) = 0 \\ \end{cases} \]


Consider $\textbf{w} = [2,-1]$ and
$\phi(X) = \{ [2,0], [0,2], [2,4] \}$

\[ f_\textbf{w}(x) = \begin{cases} 1 & \textbf{w}\cdot \phi(x) > 0 \\ -1 & \textbf{w}\cdot \phi(x) < 0 \\ ? & \textbf{w}\cdot \phi(x) = 0 \\ \end{cases} \]


Consider $\textbf{w} = [2,-1]$ and
$\phi(X) = \{ [2,0], [0,2], [2,4] \}$

\[ f_\textbf{w}(x) = \begin{cases} 1 & \textbf{w}\cdot \phi(x) > 0 \\ -1 & \textbf{w}\cdot \phi(x) < 0 \\ ? & \textbf{w}\cdot \phi(x) = 0 \\ \end{cases} \]


Consider $\textbf{w} = [2,-1]$ and
$\phi(X) = \{ [2,0], [0,2], [2,4] \}$

\[ f_\textbf{w}(x) = \begin{cases} 1 & \textbf{w}\cdot \phi(x) > 0 \\ -1 & \textbf{w}\cdot \phi(x) < 0 \\ ? & \textbf{w}\cdot \phi(x) = 0 \\ \end{cases} \]


Consider $\textbf{w} = [2,-1]$ and
$\phi(X) = \{ [2,0], [0,2], [2,4] \}$

\[ f_\textbf{w}(x) = \begin{cases} 1 & \textbf{w}\cdot \phi(x) > 0 \\ -1 & \textbf{w}\cdot \phi(x) < 0 \\ ? & \textbf{w}\cdot \phi(x) = 0 \\ \end{cases} \]


Consider $\textbf{w} = [2,-1]$ and
$\phi(X) = \{ [2,0], [0,2], [2,4] \}$

How do we learn $\textbf{w}$?


We need a loss function


Quantifies how well we are doing


$ Score = \textbf{w}\cdot\phi(x) $


Prediction, $f_\textbf{w}(x) = \textrm{sign}(\textbf{w}\cdot\phi(x)) $


$ \textrm{Margin} = (\textbf{w}\cdot\phi(x))y $


Margin measures how correct we are!

Zero-One Loss
$L_{0-1}(x,y,\textbf{w}) = \mathbb{1}[f_\textbf{w}(x) \ne y]$

$ Score = \textbf{w}\cdot\phi(x) $


Prediction, $f_\textbf{w}(x) = \textrm{sign}(\textbf{w}\cdot\phi(x)) $


$ \textrm{Margin} = (\textbf{w}\cdot\phi(x))y $


Learning $\textbf{w}$ for regression


$y' = f_\textbf{w}(x) = \textbf{w}\cdot\phi(x)$


Residual, $y' - y \\= \textbf{w}\cdot\phi(x)-y$


Learning $\textbf{w}$ for regression


$y' = f_\textbf{w}(x) = \textbf{w}\cdot\phi(x)$


Residual, $y' - y \\= \textbf{w}\cdot\phi(x)-y$


More loss functions



Squared Loss
$L_{\textrm{squared}}(x,y,\textbf{w}) = (f_\textbf{w}(x) - y)^2$

More loss functions



Absolute Deviation
$L_{\textrm{abs-dev}}(x,y,\textbf{w}) = |f_\textbf{w}(x) - y|$

Use Gradient Descent to minimize loss

More on Linear Classifiers



Consider the following distribution of data.


More on Linear Classifiers



Consider this decision boundary:


More on Linear Classifiers



Now consider this new data point:


More on Linear Classifiers



Gets classified as a Pass!


How can we deal with this?



Idea: find "equidistant" separator





Distance between decision boundary and first
datapoint on either side is called the margin

Support Vector Classifier



Find the decision boundary that maximizes the margin


Equivalently, minimize $||\textbf{w}||$


Subject to training data classification as constraints

Hinge Loss



How to minimize $||\textbf{w}||$?


Use gradient descent on hinge loss:


$\ell_{hinge}(\textbf{w}) = \sum_i \max(0, 1 - y_i w^T x_i)$


Great, but...



What about outliers?


What about outliers?




Maximal-Margin isn't great here

What about outliers?




Soft-Margin SVC

Allow a few misclassifications

In higher dimensions

"Linear" Classifiers



Now consider the following distribution of data.


Can a Linear Classifer separate the classes?

"Linear" Classifiers



Let's get creative with features

Let's get creative with features!




Augment features as $\phi(x)=\{x, x^3\}$

Let's get creative with features!




Augment features as $\phi(x)=\{x, x^3\}$

Kernels Visualization

Logistic Regression



What if we want to predict probabilities?


Logistic function


$\sigma(x) = \frac{1}{1+e^{-x}}$


Logistic Regression



Logistic function


$\sigma(x) = \frac{1}{1+e^{-x}}$


Logistic Regression



How to use this for classification?


Use it as a non-linear activation function!


$\sigma(\textbf{w}\cdot\phi(x))$


$\sigma(\textbf{w}\cdot\phi(x)) \geq 0.5 \implies \textbf{w}\cdot\phi(x) \geq 0$


$\sigma(\textbf{w}\cdot\phi(x)) < 0.5 \implies \textbf{w}\cdot\phi(x) < 0$


Logistic Regression



$\sigma(\textbf{w}\cdot\phi(x))$


$\sigma(\textbf{w}\cdot\phi(x)) \geq 0.5 \implies \textbf{w}\cdot\phi(x) \geq 0$


$\sigma(\textbf{w}\cdot\phi(x)) < 0.5 \implies \textbf{w}\cdot\phi(x) < 0$


Logistic Regression



Let's consider this example again


Logistic Regression



What's the probability of passing if you study for 9 hours?


Logistic Regression



Consider $\phi(x) = x$

We need to fit a curve to this

Logistic Regression



Remember that $\sigma$ is a function of a linear combination of $\phi(x)$


$\sigma(\textbf{w},\phi(x)) = \frac{1}{1+\mathrm{e}^{\textbf{w}\cdot\phi(x)}}$


$\textbf{w}\cdot\phi(x) = w_0 + \sum_i w_i \phi(x)_i$


Logistic Regression



Remember that $\sigma$ is a function of a linear combination of $\phi(x)$


$\sigma(\textbf{w},\phi(x)) = \frac{1}{1+\mathrm{e}^{\textbf{w}\cdot\phi(x)}}$


$\textbf{w}\cdot\phi(x) = w_0 + \sum_i w_i \phi(x)_i$


Logistic Regression



We need a loss function that captures binary misclassification using $\textbf{w}$


Binary Cross Entropy


Binary Cross Entropy



$\ell_{BCE}(\textbf{w}) = -\frac{1}{N}\sum_i$ $ y_i \log\sigma(\textbf{w}\cdot\phi(x_i))$$ + $ $(1-y_i) \log (1-\sigma(\textbf{w}\cdot\phi(x_i)))$



$ = -\frac{1}{N} \sum_i $ $ y_i \log P(y_i | x_i) $$ + $ $(1-y_i) \log (1-P(y_i | x_i))$


Unsupervised Learning



What if we don't have labels?


Can we still learn useful representations?


What sort of task does clustering accomplish?

Unsupervised Learning

K-Means Clustering



Initialize K cluster centers


Assign each point to the nearest cluster center


Update cluster centers to be the mean of the points in the cluster


Repeat until convergence