Probabilistic Reasoning

So far we have discussed search algorithms and developed a probabilistic reasoning model for cases where our world is static. Now, consider, for instance, the example of monitoring patients for food intake and insulin based on blood sugar levels. In this case, we encounter a shift from static to dynamic environments. Unlike static scenarios where the state sequence remains unchanged, dynamic environments introduce variability over multiple time intervals, requiring a different modeling approach. To tackle this, we adopt discrete-time models, dividing time into slices to capture snapshots of the state space at specific intervals. This enables us to observe states and make predictions based on these discrete time increments. Integral to this approach is the construction of transitional models, representing the probability distribution of the next state based on the current states.


Feature Extraction and Vector Representations

In machine learning and data processing, feature extraction is the process of transforming raw data into a structured set of informative features that can be fed into algorithms. This process involves selecting key characteristics from the data that are most relevant for the task at hand, often based on domain knowledge.

Feature Extraction Example: Email Address

For the email address r.venkatesaramani@northeastern.edu, you might extract the following features:

                ϕ(x) = 
                [
                    has first initial: 1,
                    has first name: 0,
                    has separator: 1,
                    has last initial: 0,
                    has last name: 1,
                    has NEU domain: 1
                ]
                =
                [1, 0, 1, 0, 1, 1]
                    

This feature vector encodes whether the email address contains a first initial, first name, a separator (dot or underscore), last initial, last name, and a domain that belongs to Northeastern University (NEU).

Each feature in this representation plays a role in determining the classification, but not all features have equal importance. Some may weigh more heavily in the decision-making process.

Feature Importance (Weights)

The importance of each feature can be represented by weights. These weights show how strongly each feature contributes to the final prediction. For the email filter, the weights for each feature could look like this:

                w = 
                [
                    has first initial: 1.5,
                    has first name: 2,
                    has separator: -1,
                    has last initial: 1.5,
                    has last name: 2,
                    has NEU domain: 4
                ]
                =
                [1.5, 2, -1, 1.5, 2, 4]
                    

These weights reflect the relative importance of each feature for making predictions. For instance, if the email comes from an NEU domain, it may be more important than having a first initial.

Linear Predictors

In machine learning, a common approach to classification problems is using linear predictors. The final decision is made by calculating a score based on the dot product of the feature vector ϕ(x) and the weight vector w:

                Score = w ⋅ ϕ(x) = ∑ wj ϕ(x)j
                    

For example, if we consider w = [1.5, 2, -1, 1.5, 2, 4] and ϕ(x) = [1, 0, 1, 0, 1, 1], the score is calculated as:

                Score = (1.5 × 1) + (2 × 0) + (-1 × 1) + (1.5 × 0) + (2 × 1) + (4 × 1) = 6.5
                    

Classification Decision

For binary classification tasks (e.g., spam vs. not spam), a simple sign function can be used to make predictions based on the score:

                fw(x) = sign(Score) = 
                {
                    1, if w ⋅ ϕ(x) > 0
                   -1, if w ⋅ ϕ(x) < 0
                    0, if w ⋅ ϕ(x) = 0
                }
                    

In this case, a positive score could correspond to "important email," while a negative score could represent "spam."

Vector Representations of Multiple Inputs

If we consider multiple email addresses as inputs, represented as feature vectors, we can compute scores for each one. For example, consider:

                ϕ(X) = 
                {
                    [2, 0],
                    [0, 2],
                    [2, 4]
                }
                    

If w = [2, -1], the scores for each input are:

                Score for [2, 0] = 2 × 2 + (-1) × 0 = 4
                Score for [0, 2] = 2 × 0 + (-1) × 2 = -2
                Score for [2, 4] = 2 × 2 + (-1) × 4 = 0
                    

Based on the sign of these scores, we can classify the inputs accordingly.

Conclusion

Feature extraction and vector representations are fundamental in transforming raw data into meaningful inputs for machine learning models. By assigning importance (weights) to features, a model can predict outcomes, such as classifying emails, based on the combination of features present in the input data.


Naïve Bayes Classifier: Binary Naive Bayes

The Naïve Bayes classifier is a simple probabilistic classifier based on Bayes' Theorem. It assumes independence between the features given the class. Despite its simplicity, it is widely used for text classification, spam detection, and more.

Example Dataset:

Type Weight Color Shape Taste Size
Fruit Light Red Round Sweet Medium
Fruit Light Other Not Round Sweet Small
Veg Light Other Not Round Other Small
Fruit Heavy Green Round Sweet Large
Fruit Light Other Round Sweet Small
Veg Heavy Green Not Round Other Large
Fruit Light Red Round Sweet Small
Veg Heavy Other Not Round Bitter Large
Fruit Light Other Round Sour Small
Veg Light Red Not Round Other Medium

Prediction Problem:

We are asked to classify the following data point: Light, Red, Not Round, Sweet, Small. Is it a fruit or a vegetable?

Bayes' Theorem for Classification:

The probability of class y given features X is calculated as:

        P(y|X) = (P(y) ∏ P(Xj|y)) / P(X)
    

In practice, we often use the logarithmic form to avoid precision errors due to very small probabilities:

        log P(y|X) = log P(y) + Σ log P(Xj|y)
    

Challenges with Naïve Bayes:

  • One challenge is when a feature value is never observed in the training data for a given class. This would result in P(Xj|y) = 0, making the entire product 0 and eliminating that class from consideration.
  • To address this, we use a technique called Laplace Smoothing or Additive Smoothing, where we pretend to have observed every feature value at least once. The formula for calculating probabilities becomes:
  •             P(Xj|y) = (#(Xj, y) + 1) / (#y + m)
            
  • Here, #(Xj, y) is the number of occurrences of feature Xj for class y, and m is the total number of possible feature values.

Conclusion:

Despite its simplicity and assumptions of feature independence, Naïve Bayes is effective for many real-world classification problems. Its ability to handle missing or previously unseen data with Laplace Smoothing makes it robust even with smaller datasets.


Linear Classifiers

A linear classifier is a type of machine learning model that classifies data by computing a linear combination of input features. Here's an overview of how linear classifiers work:

1. Linear Predictors

In a linear classifier, the score is calculated as a linear function of the input features. The score function is defined as:

        Score = w ⋅ ϕ(x) = ∑j wj ϕ(x)j
                

where w represents the weight vector, and ϕ(x) is the feature vector of the input x.

2. Prediction

The prediction of a linear classifier is based on the sign of the score:

        fw(x) = sign(Score) = sign(w ⋅ ϕ(x))
                

In other words, the classifier outputs:

        fw(x) = 
          { 
            1  if  w ⋅ ϕ(x) > 0 
            -1 if  w ⋅ ϕ(x) < 0 
            ?  if  w ⋅ ϕ(x) = 0 
          }
                

3. Example Calculation

Consider the weight vector w = [2, -1] and the feature vectors ϕ(X) = {[2, 0], [0, 2], [2, 4]}. The prediction for each feature vector can be calculated as follows:

        fw([2, 0]) = sign([2, -1] ⋅ [2, 0]) = sign(4) = 1
        fw([0, 2]) = sign([2, -1] ⋅ [0, 2]) = sign(-2) = -1
        fw([2, 4]) = sign([2, -1] ⋅ [2, 4]) = sign(2) = 1
                

4. Learning the Weight Vector w

To learn the weight vector w, we need a loss function to quantify how well the model is performing. The score is defined as:

        Score = w ⋅ ϕ(x)
                

The prediction function is:

        fw(x) = sign(w ⋅ ϕ(x))
                

Margin

The margin is defined as:

        Margin = (w ⋅ ϕ(x)) y
                

where y is the true label. The margin measures how correctly the classifier is predicting the label. A larger margin indicates a more confident prediction.

Zero-One Loss

The zero-one loss is used for classification tasks and is defined as:

        L0-1(x, y, w) = 1[fw(x) ≠ y]
                

It gives a penalty of 1 if the prediction does not match the true label y, and 0 otherwise.

5. Regression

For regression tasks, the prediction is:

        y' = fw(x) = w ⋅ ϕ(x)
                

The residual is:

        Residual = y' − y = w ⋅ ϕ(x) − y
                

Squared Loss

The squared loss function is:

        Lsquared(x, y, w) = (fw(x) − y)2
                

Absolute Deviation

The absolute deviation loss function is:

        Labs-dev(x, y, w) = |fw(x) − y|
                

6. Optimization

To learn the optimal weight vector w, gradient descent or other optimization algorithms can be used to minimize the chosen loss function.


Support Vector Machines (SVM) as Maximal Margin Classification

Support Vector Machines (SVM) are a powerful class of supervised learning algorithms used for classification and regression tasks. One of the key principles of SVM is to find a decision boundary that maximizes the margin between different classes. This is known as maximal margin classification.

Support Vector Classifier

To build a Support Vector Classifier, we aim to find the decision boundary that maximizes the margin between the two classes. Mathematically, this can be formulated as:

\[ \text{Maximize } \frac{2}{\| \mathbf{w} \|} \text{ subject to } y_i (\mathbf{w}^T \mathbf{x}_i + b) \geq 1 \text{ for all } i \]

where \(\mathbf{w}\) is the weight vector, \(b\) is the bias term, \(y_i\) are the class labels, and \(\mathbf{x}_i\) are the input features. The goal is to maximize the margin \(\frac{2}{\| \mathbf{w} \|}\), which is the distance between the decision boundary and the closest data points from either class.

Hinge Loss

To find the optimal weight vector \(\mathbf{w}\), we minimize the hinge loss function. The hinge loss is defined as:

\[ \ell_{\text{hinge}}(\mathbf{w}) = \sum_{i} \max(0, 1 - y_i (\mathbf{w}^T \mathbf{x}_i + b)) \]

The hinge loss penalizes misclassified points that lie within the margin or are on the wrong side of the decision boundary. Minimizing this loss function helps in finding the optimal weight vector \(\mathbf{w}\) and bias term \(b\).

Gradient Descent on Hinge Loss

Gradient descent is commonly used to minimize the hinge loss. The gradient of the hinge loss with respect to \(\mathbf{w}\) is computed, and the weight vector \(\mathbf{w}\) is updated iteratively to minimize the loss.

Handling Outliers

One limitation of the maximal margin classifier is its sensitivity to outliers. Outliers can significantly affect the margin and thus the decision boundary. To address this issue, we use the concept of Soft-Margin SVC.

Soft-Margin SVC

The Soft-Margin Support Vector Classifier allows for some misclassification of training data. This approach introduces slack variables to tolerate violations of the margin constraints. The optimization problem becomes:

\[ \text{Minimize } \frac{1}{2} \| \mathbf{w} \|^2 + C \sum_{i} \xi_i \]

Subject to: \[ y_i (\mathbf{w}^T \mathbf{x}_i + b) \geq 1 - \xi_i \] and \(\xi_i \geq 0\), where \(\xi_i\) are slack variables and \(C\) is a regularization parameter that controls the trade-off between maximizing the margin and minimizing classification errors.

Linear Classifiers in Higher Dimensions

In some cases, a linear classifier may not effectively separate the classes. To address this, we can augment the feature space to make it possible to use a linear classifier in a higher-dimensional space.

For example, consider augmenting features as: \[ \phi(\mathbf{x}) = \{ \mathbf{x}, x^3 \} \] This transformation allows the linear classifier to handle more complex decision boundaries by projecting the data into a higher-dimensional space.

Kernels and Visualization

Kernels are functions that enable the SVM to work in high-dimensional spaces without explicitly computing the transformation. By using kernel functions, we can map the data into a higher-dimensional space implicitly and apply the linear classification in this space.

Common kernel functions include:

  • Polynomial Kernel: \(K(\mathbf{x}, \mathbf{x}') = (\mathbf{x}^T \mathbf{x}' + c)^d\)
  • Radial Basis Function (RBF) Kernel: \(K(\mathbf{x}, \mathbf{x}') = \exp(-\gamma \| \mathbf{x} - \mathbf{x}' \|^2)\)
  • Sigmoid Kernel: \(K(\mathbf{x}, \mathbf{x}') = \tanh(\alpha \mathbf{x}^T \mathbf{x}' + c)\)

By applying these kernels, we can handle complex datasets and find effective decision boundaries in higher-dimensional feature spaces.

Logistic Regression

What if we want to predict probabilities?

Logistic Regression is used when the outcome is categorical, especially binary classification problems. To predict probabilities, we use the logistic function, also known as the sigmoid function:

σ(x) = \frac{1}{1 + e^{-x}}

How to use this for classification?

Logistic Regression uses the sigmoid function as a non-linear activation function. Here's how it works:

  • If σ(w ⋅ ϕ(x)) ≥ 0.5, then w ⋅ ϕ(x) ≥ 0
  • If σ(w ⋅ ϕ(x)) < 0.5, then w ⋅ ϕ(x) < 0

Here, ϕ(x) represents the feature vector and w represents the weight vector. The sigmoid function maps the linear combination of features to a probability between 0 and 1.

Example: Predicting Probability

Consider a simple example where you want to predict the probability of passing an exam based on hours studied. Let ϕ(x) = x. We need to fit a logistic curve to this data.

Logistic Function in Logistic Regression

The logistic function is applied to a linear combination of the features:

σ(w, ϕ(x)) = \frac{1}{1 + e^{-w ⋅ ϕ(x)}}

Where w ⋅ ϕ(x) = w0 + ∑i wiϕ(x)i.

Loss Function: Binary Cross Entropy

To train a logistic regression model, we need a loss function that captures binary misclassification. The Binary Cross Entropy (BCE) loss function is used:

BCE(w) = -\frac{1}{N} ∑i [yi log(σ(w ⋅ ϕ(xi))) + (1 - yi) log(1 - σ(w ⋅ ϕ(xi)))]

Where yi is the true label and σ(w ⋅ ϕ(xi)) is the predicted probability for each instance.