So far we have discussed search algorithms and developed a probabilistic reasoning model for cases where our world is static. Now, consider, for instance, the example of monitoring patients for food intake and insulin based on blood sugar levels. In this case, we encounter a shift from static to dynamic environments. Unlike static scenarios where the state sequence remains unchanged, dynamic environments introduce variability over multiple time intervals, requiring a different modeling approach. To tackle this, we adopt discrete-time models, dividing time into slices to capture snapshots of the state space at specific intervals. This enables us to observe states and make predictions based on these discrete time increments. Integral to this approach is the construction of transitional models, representing the probability distribution of the next state based on the current states.
In machine learning and data processing, feature extraction is the process of transforming raw data into a structured set of informative features that can be fed into algorithms. This process involves selecting key characteristics from the data that are most relevant for the task at hand, often based on domain knowledge.
For the email address r.venkatesaramani@northeastern.edu
, you might extract the following features:
ϕ(x) = [ has first initial: 1, has first name: 0, has separator: 1, has last initial: 0, has last name: 1, has NEU domain: 1 ] = [1, 0, 1, 0, 1, 1]
This feature vector encodes whether the email address contains a first initial, first name, a separator (dot or underscore), last initial, last name, and a domain that belongs to Northeastern University (NEU).
Each feature in this representation plays a role in determining the classification, but not all features have equal importance. Some may weigh more heavily in the decision-making process.
The importance of each feature can be represented by weights. These weights show how strongly each feature contributes to the final prediction. For the email filter, the weights for each feature could look like this:
w = [ has first initial: 1.5, has first name: 2, has separator: -1, has last initial: 1.5, has last name: 2, has NEU domain: 4 ] = [1.5, 2, -1, 1.5, 2, 4]
These weights reflect the relative importance of each feature for making predictions. For instance, if the email comes from an NEU domain, it may be more important than having a first initial.
In machine learning, a common approach to classification problems is using linear predictors. The final decision is made by calculating a score based on the dot product of the feature vector ϕ(x) and the weight vector w:
Score = w ⋅ ϕ(x) = ∑ wj ϕ(x)j
For example, if we consider w = [1.5, 2, -1, 1.5, 2, 4] and ϕ(x) = [1, 0, 1, 0, 1, 1], the score is calculated as:
Score = (1.5 × 1) + (2 × 0) + (-1 × 1) + (1.5 × 0) + (2 × 1) + (4 × 1) = 6.5
For binary classification tasks (e.g., spam vs. not spam), a simple sign function can be used to make predictions based on the score:
fw(x) = sign(Score) = { 1, if w ⋅ ϕ(x) > 0 -1, if w ⋅ ϕ(x) < 0 0, if w ⋅ ϕ(x) = 0 }
In this case, a positive score could correspond to "important email," while a negative score could represent "spam."
If we consider multiple email addresses as inputs, represented as feature vectors, we can compute scores for each one. For example, consider:
ϕ(X) = { [2, 0], [0, 2], [2, 4] }
If w = [2, -1], the scores for each input are:
Score for [2, 0] = 2 × 2 + (-1) × 0 = 4 Score for [0, 2] = 2 × 0 + (-1) × 2 = -2 Score for [2, 4] = 2 × 2 + (-1) × 4 = 0
Based on the sign of these scores, we can classify the inputs accordingly.
Feature extraction and vector representations are fundamental in transforming raw data into meaningful inputs for machine learning models. By assigning importance (weights) to features, a model can predict outcomes, such as classifying emails, based on the combination of features present in the input data.
The Naïve Bayes classifier is a simple probabilistic classifier based on Bayes' Theorem. It assumes independence between the features given the class. Despite its simplicity, it is widely used for text classification, spam detection, and more.
Type | Weight | Color | Shape | Taste | Size |
---|---|---|---|---|---|
Fruit | Light | Red | Round | Sweet | Medium |
Fruit | Light | Other | Not Round | Sweet | Small |
Veg | Light | Other | Not Round | Other | Small |
Fruit | Heavy | Green | Round | Sweet | Large |
Fruit | Light | Other | Round | Sweet | Small |
Veg | Heavy | Green | Not Round | Other | Large |
Fruit | Light | Red | Round | Sweet | Small |
Veg | Heavy | Other | Not Round | Bitter | Large |
Fruit | Light | Other | Round | Sour | Small |
Veg | Light | Red | Not Round | Other | Medium |
We are asked to classify the following data point: Light, Red, Not Round, Sweet, Small. Is it a fruit or a vegetable?
The probability of class y given features X is calculated as:
P(y|X) = (P(y) ∏ P(Xj|y)) / P(X)
In practice, we often use the logarithmic form to avoid precision errors due to very small probabilities:
log P(y|X) = log P(y) + Σ log P(Xj|y)
P(Xj|y) = (#(Xj, y) + 1) / (#y + m)
Despite its simplicity and assumptions of feature independence, Naïve Bayes is effective for many real-world classification problems. Its ability to handle missing or previously unseen data with Laplace Smoothing makes it robust even with smaller datasets.
A linear classifier is a type of machine learning model that classifies data by computing a linear combination of input features. Here's an overview of how linear classifiers work:
In a linear classifier, the score is calculated as a linear function of the input features. The score function is defined as:
Score = w ⋅ ϕ(x) = ∑j wj ϕ(x)j
where w
represents the weight vector, and ϕ(x)
is the feature vector of the input x
.
The prediction of a linear classifier is based on the sign of the score:
fw(x) = sign(Score) = sign(w ⋅ ϕ(x))
In other words, the classifier outputs:
fw(x) = { 1 if w ⋅ ϕ(x) > 0 -1 if w ⋅ ϕ(x) < 0 ? if w ⋅ ϕ(x) = 0 }
Consider the weight vector w = [2, -1]
and the feature vectors ϕ(X) = {[2, 0], [0, 2], [2, 4]}
. The prediction for each feature vector can be calculated as follows:
fw([2, 0]) = sign([2, -1] ⋅ [2, 0]) = sign(4) = 1 fw([0, 2]) = sign([2, -1] ⋅ [0, 2]) = sign(-2) = -1 fw([2, 4]) = sign([2, -1] ⋅ [2, 4]) = sign(2) = 1
w
To learn the weight vector w
, we need a loss function to quantify how well the model is performing. The score is defined as:
Score = w ⋅ ϕ(x)
The prediction function is:
fw(x) = sign(w ⋅ ϕ(x))
The margin is defined as:
Margin = (w ⋅ ϕ(x)) y
where y
is the true label. The margin measures how correctly the classifier is predicting the label. A larger margin indicates a more confident prediction.
The zero-one loss is used for classification tasks and is defined as:
L0-1(x, y, w) = 1[fw(x) ≠ y]
It gives a penalty of 1 if the prediction does not match the true label y
, and 0 otherwise.
For regression tasks, the prediction is:
y' = fw(x) = w ⋅ ϕ(x)
The residual is:
Residual = y' − y = w ⋅ ϕ(x) − y
The squared loss function is:
Lsquared(x, y, w) = (fw(x) − y)2
The absolute deviation loss function is:
Labs-dev(x, y, w) = |fw(x) − y|
To learn the optimal weight vector w
, gradient descent or other optimization algorithms can be used to minimize the chosen loss function.
Support Vector Machines (SVM) are a powerful class of supervised learning algorithms used for classification and regression tasks. One of the key principles of SVM is to find a decision boundary that maximizes the margin between different classes. This is known as maximal margin classification.
To build a Support Vector Classifier, we aim to find the decision boundary that maximizes the margin between the two classes. Mathematically, this can be formulated as:
\[ \text{Maximize } \frac{2}{\| \mathbf{w} \|} \text{ subject to } y_i (\mathbf{w}^T \mathbf{x}_i + b) \geq 1 \text{ for all } i \]
where \(\mathbf{w}\) is the weight vector, \(b\) is the bias term, \(y_i\) are the class labels, and \(\mathbf{x}_i\) are the input features. The goal is to maximize the margin \(\frac{2}{\| \mathbf{w} \|}\), which is the distance between the decision boundary and the closest data points from either class.
To find the optimal weight vector \(\mathbf{w}\), we minimize the hinge loss function. The hinge loss is defined as:
\[ \ell_{\text{hinge}}(\mathbf{w}) = \sum_{i} \max(0, 1 - y_i (\mathbf{w}^T \mathbf{x}_i + b)) \]
The hinge loss penalizes misclassified points that lie within the margin or are on the wrong side of the decision boundary. Minimizing this loss function helps in finding the optimal weight vector \(\mathbf{w}\) and bias term \(b\).
Gradient descent is commonly used to minimize the hinge loss. The gradient of the hinge loss with respect to \(\mathbf{w}\) is computed, and the weight vector \(\mathbf{w}\) is updated iteratively to minimize the loss.
One limitation of the maximal margin classifier is its sensitivity to outliers. Outliers can significantly affect the margin and thus the decision boundary. To address this issue, we use the concept of Soft-Margin SVC.
The Soft-Margin Support Vector Classifier allows for some misclassification of training data. This approach introduces slack variables to tolerate violations of the margin constraints. The optimization problem becomes:
\[ \text{Minimize } \frac{1}{2} \| \mathbf{w} \|^2 + C \sum_{i} \xi_i \]
Subject to: \[ y_i (\mathbf{w}^T \mathbf{x}_i + b) \geq 1 - \xi_i \] and \(\xi_i \geq 0\), where \(\xi_i\) are slack variables and \(C\) is a regularization parameter that controls the trade-off between maximizing the margin and minimizing classification errors.
In some cases, a linear classifier may not effectively separate the classes. To address this, we can augment the feature space to make it possible to use a linear classifier in a higher-dimensional space.
For example, consider augmenting features as: \[ \phi(\mathbf{x}) = \{ \mathbf{x}, x^3 \} \] This transformation allows the linear classifier to handle more complex decision boundaries by projecting the data into a higher-dimensional space.
Kernels are functions that enable the SVM to work in high-dimensional spaces without explicitly computing the transformation. By using kernel functions, we can map the data into a higher-dimensional space implicitly and apply the linear classification in this space.
Common kernel functions include:
By applying these kernels, we can handle complex datasets and find effective decision boundaries in higher-dimensional feature spaces.
Logistic Regression is used when the outcome is categorical, especially binary classification problems. To predict probabilities, we use the logistic function, also known as the sigmoid function:
σ(x) = \frac{1}{1 + e^{-x}}
Logistic Regression uses the sigmoid function as a non-linear activation function. Here's how it works:
σ(w ⋅ ϕ(x)) ≥ 0.5
, then w ⋅ ϕ(x) ≥ 0
σ(w ⋅ ϕ(x)) < 0.5
, then w ⋅ ϕ(x) < 0
Here, ϕ(x)
represents the feature vector and w
represents the weight vector. The sigmoid function maps the linear combination of features to a probability between 0 and 1.
Consider a simple example where you want to predict the probability of passing an exam based on hours studied. Let ϕ(x) = x
. We need to fit a logistic curve to this data.
The logistic function is applied to a linear combination of the features:
σ(w, ϕ(x)) = \frac{1}{1 + e^{-w ⋅ ϕ(x)}}
Where w ⋅ ϕ(x) = w0 + ∑i wiϕ(x)i
.
To train a logistic regression model, we need a loss function that captures binary misclassification. The Binary Cross Entropy (BCE) loss function is used:
ℓBCE(w) = -\frac{1}{N} ∑i [yi log(σ(w ⋅ ϕ(xi))) + (1 - yi) log(1 - σ(w ⋅ ϕ(xi)))]
Where yi
is the true label and σ(w ⋅ ϕ(xi))
is the predicted probability for each instance.