Supervised Learning

Supervised Learning

Training over labeled examples

Two key tasks:

Classification

Regression

Classification

Given labeled examples, learn a function that
maps each example to a class label

Given labeled examples, learn a function that
maps each example to a class label

Regression

Given labeled examples, learn a function that
maps each example to a real-valued output

Given labeled examples, learn a function that
maps each example to a real-valued output

Supervised Learning

What does the training data look like?

What does the training data look like?

Training inputs $X$: \[ X = \{X_1, X_2, \dots, X_n\}\quad \textrm{where} \\ X_i = \{x_{i1}, x_{i2}, \dots, x_{im}\} \]

Training labels $Y$: \[ Y = \{ y_1, y_2, \dots, y_n \} \]

Naïve Bayes Classifier

A simple probabilistic classifier based on Bayes' Theorem

Item	Weight	Color	Shape	Taste	Size
Apple	Light	Red	Round	Sweet	Medium
Banana	Light	Other	Not Round	Sweet	Small
Carrot	Light	Other	Not Round	Other	Small
Watermelon	Heavy	Green	Round	Sweet	Large
Grapes	Light	Other	Round	Sweet	Small
Cucumber	Heavy	Green	Not Round	Other	Large
Strawberry	Light	Red	Round	Sweet	Small
Eggplant	Heavy	Other	Not Round	Bitter	Large
Lemon	Light	Other	Round	Sour	Small
Bell Pepper	Light	Red	Not Round	Other	Medium

Consider this input data

Replace item with type (fruit/veg)

Type	Weight	Color	Shape	Taste	Size
Fruit	Light	Red	Round	Sweet	Medium
Fruit	Light	Other	Not Round	Sweet	Small
Veg	Light	Other	Not Round	Other	Small
Fruit	Heavy	Green	Round	Sweet	Large
Fruit	Light	Other	Round	Sweet	Small
Veg	Heavy	Green	Not Round	Other	Large
Fruit	Light	Red	Round	Sweet	Small
Veg	Heavy	Other	Not Round	Bitter	Large
Fruit	Light	Other	Round	Sour	Small
Veg	Light	Red	Not Round	Other	Medium

Is the following new data
point a fruit or a veg?

Light, Red, Not Round, Sweet, Small

\[ P(y|X) = \frac{P(y)\prod_jP(X_j|y)}{P(X)} \]

In practice, we often use \[ \log P(y|X) = \log P(y) + \sum_j \log P(X_j | y) \]

What would you do for continuous variables?

Binning!

Problems with Naïve Bayes?

What if we see a feature value that is never observed?

\[ P(x_j | y) = 0 \]

Pretend we've seen it once before

\[ P(X_j | y) = \frac{\#(x_j,y) + 1 }{\#y + m} \]

A more general ML framework

Let's start with Linear Predictors

Features

Consider r.venkatesaramani@northeastern.edu
as an input to an email filter

What features would you use?

Consider r.venkatesaramani@northeastern.edu
as an input to an email filter

\[ \phi(x) = \begin{bmatrix} \textrm{has first initial: } 1 \\ \textrm{has first name: } 0 \\ \textrm{has separator: } 1 \\ \textrm{has last initial: } 0 \\ \textrm{has last name: } 1 \\ \textrm{has NEU domain: } 1 \\ \end{bmatrix} = \begin{bmatrix} 1\\0\\1\\0\\1\\1 \end{bmatrix} \]

Consider r.venkatesaramani@northeastern.edu
as an input to an email filter

\[ \phi(x) = \begin{bmatrix} \textrm{has first initial: } 1 \\ \textrm{has first name: } 0 \\ \textrm{has separator: } 1 \\ \textrm{has last initial: } 0 \\ \textrm{has last name: } 1 \\ \textrm{has NEU domain: } 1 \\ \end{bmatrix} = \begin{bmatrix} 1\\0\\1\\0\\1\\1 \end{bmatrix} \]

Each feature has a relative importance

\[ \textbf{w} = \begin{bmatrix} \textrm{has first initial: } 1.5 \\ \textrm{has first name: } 2 \\ \textrm{has separator: } -1 \\ \textrm{has last initial: } 1.5 \\ \textrm{has last name: } 2 \\ \textrm{has NEU domain: } 4 \\ \end{bmatrix} = \begin{bmatrix} 1.5\\2\\-1\\1.5\\2\\14 \end{bmatrix} \]

Consider r.venkatesaramani@northeastern.edu
as an input to an email filter

\[ \phi(x) = \begin{bmatrix} 1\\0\\1\\0\\1\\1 \end{bmatrix} \]

Each feature has a relative importance

\[ \textbf{w} = \begin{bmatrix} 1.5\\2\\-1\\1.5\\2\\14 \end{bmatrix} \]

Linear Predictors

Score is a linear function of the input features

\[ Score = \textbf{w}\cdot \phi(x) = \sum_j w_j \phi(x)_j \]

For classification, the prediction
$f_\textbf{w}(x) = \textrm{sign}(Score)$

$f_\textbf{w}(x) = \textrm{sign}(\textbf{w}\cdot \phi(x))$

$f_\textbf{w}(x) = \textrm{sign}(\textbf{w}\cdot \phi(x))$

\[ f_\textbf{w}(x) = \begin{cases} 1 & \textbf{w}\cdot \phi(x) > 0 \\ -1 & \textbf{w}\cdot \phi(x) < 0 \\ ? & \textbf{w}\cdot \phi(x) = 0 \\ \end{cases} \]

Consider $\textbf{w} = [2,-1]$ and
$\phi(X) = \{ [2,0], [0,2], [2,4] \}$

\[ f_\textbf{w}(x) = \begin{cases} 1 & \textbf{w}\cdot \phi(x) > 0 \\ -1 & \textbf{w}\cdot \phi(x) < 0 \\ ? & \textbf{w}\cdot \phi(x) = 0 \\ \end{cases} \]

Consider $\textbf{w} = [2,-1]$ and
$\phi(X) = \{ [2,0], [0,2], [2,4] \}$

\[ f_\textbf{w}(x) = \begin{cases} 1 & \textbf{w}\cdot \phi(x) > 0 \\ -1 & \textbf{w}\cdot \phi(x) < 0 \\ ? & \textbf{w}\cdot \phi(x) = 0 \\ \end{cases} \]

Consider $\textbf{w} = [2,-1]$ and
$\phi(X) = \{ [2,0], [0,2], [2,4] \}$

\[ f_\textbf{w}(x) = \begin{cases} 1 & \textbf{w}\cdot \phi(x) > 0 \\ -1 & \textbf{w}\cdot \phi(x) < 0 \\ ? & \textbf{w}\cdot \phi(x) = 0 \\ \end{cases} \]

Consider $\textbf{w} = [2,-1]$ and
$\phi(X) = \{ [2,0], [0,2], [2,4] \}$

\[ f_\textbf{w}(x) = \begin{cases} 1 & \textbf{w}\cdot \phi(x) > 0 \\ -1 & \textbf{w}\cdot \phi(x) < 0 \\ ? & \textbf{w}\cdot \phi(x) = 0 \\ \end{cases} \]

Consider $\textbf{w} = [2,-1]$ and
$\phi(X) = \{ [2,0], [0,2], [2,4] \}$

\[ f_\textbf{w}(x) = \begin{cases} 1 & \textbf{w}\cdot \phi(x) > 0 \\ -1 & \textbf{w}\cdot \phi(x) < 0 \\ ? & \textbf{w}\cdot \phi(x) = 0 \\ \end{cases} \]

Consider $\textbf{w} = [2,-1]$ and
$\phi(X) = \{ [2,0], [0,2], [2,4] \}$

How do we learn $\textbf{w}$?

We need a loss function

Quantifies how well we are doing

$ Score = \textbf{w}\cdot\phi(x) $

Prediction, $f_\textbf{w}(x) = \textrm{sign}(\textbf{w}\cdot\phi(x)) $

$ \textrm{Margin} = (\textbf{w}\cdot\phi(x))y $

Margin measures how correct we are!

Zero-One Loss
$L_{0-1}(x,y,\textbf{w}) = \mathbb{1}[f_\textbf{w}(x) \ne y]$

$ Score = \textbf{w}\cdot\phi(x) $

Prediction, $f_\textbf{w}(x) = \textrm{sign}(\textbf{w}\cdot\phi(x)) $

$ \textrm{Margin} = (\textbf{w}\cdot\phi(x))y $

Learning $\textbf{w}$ for regression

$y' = f_\textbf{w}(x) = \textbf{w}\cdot\phi(x)$

Residual, $y' - y \\= \textbf{w}\cdot\phi(x)-y$

Learning $\textbf{w}$ for regression

$y' = f_\textbf{w}(x) = \textbf{w}\cdot\phi(x)$

Residual, $y' - y \\= \textbf{w}\cdot\phi(x)-y$

More loss functions

Squared Loss
$L_{\textrm{squared}}(x,y,\textbf{w}) = (f_\textbf{w}(x) - y)^2$

More loss functions

Absolute Deviation
$L_{\textrm{abs-dev}}(x,y,\textbf{w}) = |f_\textbf{w}(x) - y|$

Use Gradient Descent to minimize loss

More on Linear Classifiers

Consider the following distribution of data.

More on Linear Classifiers

Consider this decision boundary:

More on Linear Classifiers

Now consider this new data point:

More on Linear Classifiers

Gets classified as a Pass!

How can we deal with this?

Idea: find "equidistant" separator

Distance between decision boundary and first
datapoint on either side is called the margin

Support Vector Classifier

Find the decision boundary that maximizes the margin

Equivalently, minimize $||\textbf{w}||$

Subject to training data classification as constraints

Hinge Loss

How to minimize $||\textbf{w}||$?

Use gradient descent on hinge loss:

$\ell_{hinge}(\textbf{w}) = \sum_i \max(0, 1 - y_i w^T x_i)$

Great, but...

What about outliers?

What about outliers?

Maximal-Margin isn't great here

What about outliers?

Soft-Margin SVC

Allow a few misclassifications

In higher dimensions

"Linear" Classifiers

Now consider the following distribution of data.

Can a Linear Classifer separate the classes?

"Linear" Classifiers

Let's get creative with features

Let's get creative with features!

Augment features as $\phi(x)=\{x, x^3\}$

Let's get creative with features!

Augment features as $\phi(x)=\{x, x^3\}$

Kernels Visualization

Logistic Regression

What if we want to predict probabilities?

Logistic function

$\sigma(x) = \frac{1}{1+e^{-x}}$

Logistic Regression

Logistic function

$\sigma(x) = \frac{1}{1+e^{-x}}$

Logistic Regression

How to use this for classification?

Use it as a non-linear activation function!

$\sigma(\textbf{w}\cdot\phi(x))$

$\sigma(\textbf{w}\cdot\phi(x)) \geq 0.5 \implies \textbf{w}\cdot\phi(x) \geq 0$

$\sigma(\textbf{w}\cdot\phi(x)) < 0.5 \implies \textbf{w}\cdot\phi(x) < 0$

Logistic Regression

$\sigma(\textbf{w}\cdot\phi(x))$

$\sigma(\textbf{w}\cdot\phi(x)) \geq 0.5 \implies \textbf{w}\cdot\phi(x) \geq 0$

$\sigma(\textbf{w}\cdot\phi(x)) < 0.5 \implies \textbf{w}\cdot\phi(x) < 0$

Logistic Regression

Let's consider this example again

Logistic Regression

What's the probability of passing if you study for 9 hours?

Logistic Regression

Consider $\phi(x) = x$

We need to fit a curve to this

Logistic Regression

Remember that $\sigma$ is a function of a linear combination of $\phi(x)$

$\sigma(\textbf{w},\phi(x)) = \frac{1}{1+\mathrm{e}^{\textbf{w}\cdot\phi(x)}}$

$\textbf{w}\cdot\phi(x) = w_0 + \sum_i w_i \phi(x)_i$

Logistic Regression

Remember that $\sigma$ is a function of a linear combination of $\phi(x)$

$\sigma(\textbf{w},\phi(x)) = \frac{1}{1+\mathrm{e}^{\textbf{w}\cdot\phi(x)}}$

$\textbf{w}\cdot\phi(x) = w_0 + \sum_i w_i \phi(x)_i$

Logistic Regression

We need a loss function that captures binary misclassification using $\textbf{w}$

Binary Cross Entropy

Binary Cross Entropy

$\ell_{BCE}(\textbf{w}) = -\frac{1}{N}\sum_i$ $ y_i \log\sigma(\textbf{w}\cdot\phi(x_i))$$ + $ $(1-y_i) \log (1-\sigma(\textbf{w}\cdot\phi(x_i)))$

$ = -\frac{1}{N} \sum_i $ $ y_i \log P(y_i | x_i) $$ + $ $(1-y_i) \log (1-P(y_i | x_i))$

Unsupervised Learning

What if we don't have labels?

Can we still learn useful representations?

What sort of task does clustering accomplish?

Unsupervised Learning

K-Means Clustering

Initialize K cluster centers

Assign each point to the nearest cluster center

Update cluster centers to be the mean of the points in the cluster

Repeat until convergence