Having studied search algorithms, learning to deal with real-world uncertainties, and learning from experience - in the final section of this course, we turn our attention to learning from data. A series of algorithms classified as machine learning models, our goal is to enable computers to learn from and make predictions or decisions based on data. Machine learning models are often classified into supervised learning and unsupervised learning. In supervised learning, models are trained on labeled data, where the input and output are known. In unsupervised learning on the other hand, models are trained on unlabeled data, and the goal is to discover patterns or relationships in the data. Reinforcement learning may also be thought of as a form of machine learning, where the data is generated through interactions with an environment.
In machine learning, the first step is to transform raw data at hand into a structured set of informative features that can be fed into algorithms. This process involves selecting key characteristics from the data that are most relevant for the task at hand, often based on domain knowledge, and converting them into numerical representations, which we can then use in order to compute complex relationships. This process is called feature extraction.
There are a host of techniques used for converting raw data into numerical information - and the choice of which to use almost always depends on the problem, the model, and the data distribution. This is best explained with the following example.
Spam Classification: Given a set of emails labeled as spam or ham (not spam), our goal is to build a machine learning algorithm that learns what differentiates spam emails from ham emails, based on a set of features provided to the model. In other words, given some numeric feature information about an email, we want to predict whether it is spam or ham. In order to do so, we must first select the features that are most relevant to the task at hand. If we consider building such a classifier for use at Northeastern University, the following features of an email will likely be useful for the task:
Now given an email, we must convert these features into numerical representations, since the driving idea behind machine learning models is to learn mathematical relationships between the input and output data. Let us now reason about how each of these features can be converted into numerical representations.
Sender Domain: The sender's email address can, in theory, be registered with one of infinitely many email domains. However, not every unique domain is necessarily relevant to the task of spam classification. For instance, the use of a domain such as northeastern.edu
or khoury.ccs.edu
is likely to be more indicative of a ham email than a spam email - but all we are concerned about is whether or not one of these trusted domains was used. If we have a list of trusted senders, we can reduce this feature set into a binary representation, that only captures whether the email was sent from a trusted domain. Therefore, we can convert the sender's domain into a binary feature, where the presence of a trusted domain is represented as a 1, and the presence of an untrusted domain is represented as a 0.
Call to action/sense of urgency: The presence of a call to action or a sense of urgency in an email is often indicative of a spam email. (Think about emails that contain phrases such as 'requires your immediate attention'.) The more urgent an email appears, the more likely it is to be spam. We can convert this feature into a numerical representation in many ways; for now, let us assume we have access to a natural language tool that can analyze the text of an email and return a score that represents the urgency of the email. This score can then be used as a feature in our model. For simplicity, let us assume that the score ranges from 0 to 1, where 0 indicates no urgency and 1 indicates high urgency.
Presence of attachments: Attachments in emails are often used to spread malware or viruses, and are therefore indicative of spam emails. We can convert this feature into a binary representation, where the presence of an attachment is represented as a 1, and the absence of an attachment is represented as a 0.
Presence of links: Links in emails are often used to redirect users to malicious websites, and are therefore indicative of spam emails. We can convert this feature into a binary representation, where the presence of a link is represented as a 1, and the absence of a link is represented as a 0.
Typos in the email: Spam emails often contain typos, as they are often generated by bots or non-native speakers. We can convert this feature into a numerical representation by counting the number of typos in an email. For instance, if an email contains 3 typos, we can represent this feature as 3. However, this method is not ideal, as the number of typos in an email is not necessarily bounded, and if we have a large number of typos, the feature can quickly dominate all other features. A better approach may be to normalize the feature by dividing the number of typos by the number of words in the email, such that we use the frequency of typos rather than the count as a feature. Another possible approach is to use a binning technique, where if the email contains 0-2 typos, we represent the feature as 0, if it contains 3-5 typos, we represent the feature as 1, and so on. Intuitively, this approach also captures the idea that a larger bin index is indicative of a larger number of typos being present in the email.
Finally, once we have converted these features into numerical representations, we concatenate them into a single vector, which we can then use as input to our machine learning model. This vector representation is often referred to as a feature vector, and is the input to the model. The process of extracting features and converting them into a numerical vector is called vectorization. Feature extraction and vectorization are crucial steps in building machine learning models, as the quality of the features directly impacts the performance of the model. In practice, feature extraction is often an iterative process, where we experiment with different feature sets and representations to find the best combination for the task at hand.
Naive Bayes classification is a simple probabilistic classifier based on the application of Bayes' theorem, where we (naively) assume that features are independent. Doing so simplifies the computation of the posterior probability, under the assumption that features are conditionally independent given the class label. Naive Bayes (NB) classification is a supervised learning approach, i.e., we require a set of labeled training data to build the model. The model is then used to predict the class label of new, unseen instances based on the features provided. Despite its simplicity, NB classifiers have been found to perform well in practice, especially in text classification tasks such as spam filtering and sentiment analysis. Let us walk through how an NB classifier works in practice with the same example of spam classification, with the set of features we discussed earlier.
The first thing we need to do is to build a training dataset, where each instance is represented by a feature vector and has an associated true class label. For instance, we may have the following raw training data:
# | Email Contents | Class |
---|---|---|
Email 1 | From: raj.venkat@northeastern.edu Body: URGENT: Your account has been compromised. Click here to reset your password. |
Spam |
Email 2 | From: texjohnson707@gmail.com Body: Hi, I hope you are doing well. Attached is the document you requested. Attachments: barrel.jpg |
Ham |
Email 3 | From: noreply@insurer.com Body: ACTION REQD: You are the named beneficiery of a $1M insurance polcy. To collect, call (716) 776-2323 immidiately. |
Spam |
Email 4 | From: noreply@nsfgrfp.org Body: The U.S. National Science Foundation (NSF) is seeking reviewers for the FY25 Graduate Research Fellowship Program (GRFP). The deadline to register to be considered as a reviewer for this year’s competition has been extended to November 4, 2024. |
Ham |
[0, 0.9, 0, 0, 1]
, and its true label (spam) is represented as -1
. (Conversely, the labels for ham emails are represented as 1
.)
After similarly processing all the emails, the equivalent training dataset would look as follows:# | Trusted Domain | Urgency Score | Attachment | Link | Typos | Class |
---|---|---|---|---|---|---|
Email 1 | 1 | 0.8 | 0 | 1 | 0 | -1 |
Email 2 | 0 | 0.2 | 1 | 0 | 0 | 1 |
Email 3 | 0 | 0.9 | 0 | 0 | 1 | -1 |
Email 4 | 1 | 0.1 | 0 | 0 | 0 | 1 |
Great! Now that we have some training data, we can build our NB classifier. This is a fairly straightforward application of Bayes' theorem, where we compute the posterior probability of each class given the features of an email, and predict the class with the highest probability. To understand this, let us now take a new email as an example and walk through the process of classifying it as spam or ham. Consider the following email, and its correpsonding feature representation:
Email Contents |
---|
From: registrar@northeastern.edu Body: URGENT. The academic calendar has been updated. Please click here to view the updated calendar. In case of unforeseen scheduling conflicts, please reply to this email by the end of day. |
Trusted Domain | Urgency Score | Attachment | Link | Typos |
---|---|---|---|---|
1 | 0.8 | 0 | 1 | 0 |
To compute the likelihood $P(x_t | \textrm{spam})$, we make the naive assumption that the features are conditionally independent given the class label. This allows us to write: \begin{equation} P(x_t | \textrm{spam}) = P(\textrm{Trusted Domain} | \textrm{spam}) P(\textrm{Urgency} | \textrm{spam}) P(\textrm{Attachment} | \textrm{spam}) P(\textrm{Link} | \textrm{spam}) P(\textrm{Typos} | \textrm{spam}) \end{equation} where each term on the right-hand side can be computed from the training data. For instance, $P(\textrm{Trusted Domain} | \textrm{spam})$ is the fraction of spam emails that have a trusted domain, which in our case, happens to be one out of two (out of the two spam emails, namely emails 1 and 3, only email 1 is from a trusted domain). We can similarly compute the other terms. For the given email, the posterior probability of spam is therefore: \begin{gather} P(\textrm{spam} | x_t) \\= P(\textrm{Trusted Domain} | \textrm{spam}) P(\textrm{Urgency} | \textrm{spam}) P(\textrm{Attachment} | \textrm{spam}) P(\textrm{Link} | \textrm{spam}) P(\textrm{Typos} | \textrm{spam}) P(\textrm{spam}) \\ = P(\textrm{TD}=1|\textrm{spam}) \times P(\textrm{U}=0.8|\textrm{spam}) \times P(\textrm{A}=0|\textrm{spam}) \times P(\textrm{T}=1|\textrm{spam}) \times P(\textrm{L}=0|\textrm{spam}) \times 0.5 \\ = \frac{1}{2} \times \frac{1}{2} \times 1 \times \frac{1}{2} \times \frac{1}{2} \times 0.5 = 0.03125\\ \end{gather} Similarly, we can compute the posterior probability of ham as: \begin{gather} P(\textrm{ham} | x_t) \\= P(\textrm{Trusted Domain} | \textrm{ham}) P(\textrm{Urgency} | \textrm{ham}) P(\textrm{Attachment} | \textrm{ham}) P(\textrm{Link} | \textrm{ham}) P(\textrm{Typos} | \textrm{ham}) P(\textrm{ham}) \\ = P(\textrm{TD}=1|\textrm{ham}) \times P(\textrm{U}=0.8|\textrm{ham}) \times P(\textrm{A}=0|\textrm{ham}) \times P(\textrm{T}=1|\textrm{ham}) \times P(\textrm{L}=0|\textrm{ham}) \times 0.5 \\ = \frac{1}{2} \times 0 \times \frac{1}{2} \times 0 \times 1 \times 0.5 = 0\\ \end{gather} From our computed probabilities, we can now classify the test email as spam. This may surprise you, since the test email, when inspected, does not appear to be a spam email. However, the NB classifier has made its prediction based on a very limited dataset, and the features we provided. In reality, the features used are often more complex, and the model is trained on a much larger dataset. Despite its simplicity, NB classification is often used as a baseline model for text classification tasks, and can perform surprisingly well in practice.
One final nuance here is that if any of the conditional probabilities in the posterior computation evaluate to $0$, the entire probabilty goes to $0$. For instance, when we computed the probability of the urgency being $0.8$ given a ham class email ($P(\textrm{U}=0.8|\textrm{ham})$), this value is 0 simply because such a combination does not exist in our training set. However, this does not mean that such a combination will never exist or will not be encountered at test time. In its current form, our NB classifier will assign a probability of $0$ for any combination of features and labels it has not yet observed, and is therefore not as generalizable. To account for this, we can use Laplace smoothing, where we add a small constant to each count to ensure that no probability is ever $0$. This is a common technique used in practice to ensure that the model is more robust and generalizes better to unseen data. Here is a more detailed set of notes for smoothed NB classification [CSE446: University of Washington].
While NB classification works quite well for simple problems such as text-classification, often we wish to extract explicit relationships between features and the class label. For instance, from our NB classification, it is hard to understand which of the provided features are the most important for spam/ham classification. To tackle this problem, we now build a more general framework for classification, using linear decision boundaries.
To understand the problem at hand, we must first ensure that we are comfortable with thinking about data as points or vectors in some $n$ dimensional space. Recall that for a given email, we vectorize it to a format that looks like [0, 0.9, 0, 0, 1]
for instance. This vector is simply a point in a 5-dimensional space. Each data point in our original training set will be represented similarly as a vector of length 5, and represents a unique point in 5-D space. Moving forward, we use the notation $\mathbf{x}_i$ to represent a feature vector in this space, where the subscript $i$ denotes that this feature vector corresponds to the $i^{th}$ data point. Since the feature vector above represents email 3, we can therefore say that $\mathbf{x}_3 = $[0, 0.9, 0, 0, 1]
.
While humans cannot visualize 5-dimensional spaces, the theory works just as well in 1, 2 or even 10000 dimensions - so for the sake of building intuition, we will consider 2-D and 3-D spaces in this section. However, it is important to realize that everything we discuss below applies to feature vectors of arbitrary length. For simplicity, let us consider a different sample dataset, where we have two features for each data point, and the class label is binary. The dataset is as follows:
# | Feature 1 | Feature 2 | Class |
---|---|---|---|
1 | 1 | 1 | 1 |
2 | 2 | 2 | 1 |
3 | 1 | 2 | 0 |
4 | 2 | 1 | 0 |
A natural question to ask, then, is - which of these lines is the best? One might say that the best line is the one that maximizes the separation between the two classes. In machine learning, we call this quantity the margin - which is defined as the distance between the decision boundary and the closest point of the two classes in case of binary classification. The decision boundary that maximizes this margin is the one that is most robust to noise in the data, and is the one that is most likely to generalize well to unseen data. To understand this, let us first consider a counterexample, where the decision boundary is closer to one class, but still perfectly classifies the data.
Now, in this space, let us place a new data point, shown in the following figure in white. Given the decision boundary shown, we would classify this point as Class 1. However, looking at the overall data distribution, it is clear that this point is closer to other Class -1 points, and should be classified as such.
This is why we instead aim for a decision boundary that acts as the maximal separator; in other words, the decision boundary that maximizes the margin. Such a decision boundary would look something like this:
The process of finding the decision boundary that maximizes the margin is called maximal margin classification, and is a key idea in the field of machine learning. The decision boundary that maximizes the margin is called the maximum margin hyperplane, and the algorithm that finds this boundary is called the support vector machine (SVM). In the next section, we will discuss the SVM algorithm in more detail, and understand how it works. But before we do so, we must cover one more aspect of learning linear decision boundaries - and that is the representation of a decision boundary in the form of a mathematical equation using a set of parameters.
Recall how we said that a linear classifier in 2-D space is a line. In general, a linear classifier in $n$ dimensions is a hyperplane, which can be represented as the set of points $\mathbf{x}$ that satisfy the equation: $\mathbf{w}\cdot \mathbf{x} + b = 0$. Here, $\mathbf{w}$ is a vector of weights, and $b$ is a bias term that serves the same purpose as the intercept of a line in 2-dimensions. The decision boundary is the set of points where the dot product of the weights and the feature vector plus the bias term is equal to 0. For instance, in 2-D space, if the weight vector is $\mathbf{w}=[w_1, w_2]$ and the bias term is $b$, the equation of a line represented by these parameters is $w_1 x_1 + w_2 x_2 + b = 0$. Similarly, in 3-D space, the equation of a plane is $w_1 x_1 + w_2 x_2 + w_3 x_3 + b = 0$, and so on. The process of training an SVM is nothing but figuring out appropriate values for the weights and the bias term such that they maximize the margin, and therefore, the generalization of the model.
One final piece of information that will come in useful later is the geometric interpretation of the weights and biases in the representation of a hyperplane. The weights $\mathbf{w}$ are simply a point or a vector in the $n$-dimensional space, and the hyperplane is orthogonal to the weight vector. To fully appreciate this, let us consider the so-called 'best' decision boundary we saw earlier ($y=-2x+12.5$ or equivalently $x_2 = -2 x_1 + 12.5$, where the vertical axis is represented by $x_2$ instead of $y$ for the sake of generality), and think about its representation using weights and a bias term. Say we are given the weights, $\mathbf{w}=[0.8, 0.4]$ and the bias term $b=-5$. We begin by first just plotting the weight vector in a 2-dimensional space.
The weight vector is shown in blue. Now, we draw an orthogonal to this vector that passes through the origin. You will notice that this resembles the line $x_2 = -2 x_1$. That is because if we consider the equation of a line defined by our weight vector without considering the bias term, we get $0.8 x_1 + 0.4 x_2 = 0$, which simplifies to $x_2 = -2 x_1$.
Finally, we add the bias term to shift the hyperplane along the weight vector. Plugging this into the equation for a line, we get $0.8 x_1 + 0.4 x_2 -5 = 0 \implies x2 = -2 x_1 + \frac{5}{0.4} = -2 x_1 + 12.5$. The final hyperplane is shown below:
Three important things to note: a)
the orthogonal distance of the decision boundary to the origin is $\frac{b}{||\mathbf{w}||}$, b)
the same decision boundary could be obtained using the weights $\mathbf{w}=[4, 2]$ and the bias term $b=-20$; indicating that the weights and biases can be scaled by a constant factor and still represent the same hyperplane, and c)
the same line can be obtained using the weights $\mathbf{w}=[-0.8, -0.4]$ and bias term $b=5$; however, while this represents the same line, the interpretation as a decision boundary for classification will be slightly different, as we will see next.
Consider our 'best' classifier again. To make a prediction for a test data point, we simply need to see which side of the line the point is on. If the point is in the direction that the weight vector points in (in this case, the right half of the plane), we say that the point is classified as a Class $+1$ sample, and otherwise, we say that the point is classified as a Class $-1$ sample. For example, consider the point $(6,5)$. This point lies on the right side of the decision boundary, and is therefore classified as Class $+1$. Similarly, the point $(3, 2)$ lies on the left side of the decision boundary, and is classified as Class $-1$.
How do we know if a point lies on one side of the decision boundary or the other without actually plotting everything? The answer lies in the sign of the quantity $\mathbf{w}\cdot \mathbf{x} + b$. If this quantity is positive, the point lies on one side of the decision boundary, and if it is negative, the point lies on the other side. Let us try this for the two test points in the above figure. For the point $(6,5)$, we have $$\mathbf{w}\cdot \mathbf{x} + b = [0.8, 0.4]\cdot [6, 5]^\mathrm{T} - 5$$ $$= 0.8*6 + 0.4*5 - 5 $$$$= 4.8 + 2 - 5 = 1.8 > 0$$ Similarly, for the point $(3,2)$, we have $$\mathbf{w}\cdot \mathbf{x} + b = [0.8, 0.4]\cdot [3, 2]^\mathrm{T} - 5 $$$$= 0.8*3 + 0.4*2 - 5 $$$$= 2.4 + 0.8 - 5 = -1.8 < 0$$Therefore, the point $(6,5)$ is classified as Class $+1$, and the point $(3,2)$ is classified as Class $-1$.
Now think about what happens when the signs for $\mathbf{w}$ and $b$ are flipped - even though we get the same line that acts as the decision boundary, the left half would now be classified as Class $+1$ and the right half as Class $-1$. This is why the weights and biases are unique to a decision boundary, and not the line itself.
Armed with the knowledge of linear classifiers and the representation of hyperplanes, we are now ready to understand the Support Vector Machine (SVM) algorithm. The fundamental idea is to find a linear separator that not only classifies our data correctly, but also maximizes the margin between the two classes. Let us re-visit the figure that showed the 'optimal' hyperplane, with a bit of additional information:
Notice the two points on either side of the decision boundary that are closest to it. These points are called the support vectors (hence the name!). The margin is defined as the distance between the hyperplane and the closest point of either class, and the goal is to find the hyperplane that maximizes this distance. Equivalently, we can think of this as maximizing the distance between the two dashed lines subject to accuracy as a constraint, such that the decision boundary lies between the two dashed lines. Now, in SVMs, we make a key assumption that any point, in order to be classified correctly, must lie not only on the correct side of the decision boundary, but also outside the margin. To simplify our calculations, we further assume that the orthogonal distance between the support vectors in particular and the decision boundary is 1. This is a common assumption in SVMs, and is called the canonical hyperplane.
The process of training an SVM, therefore, is to find the hyperplane that maximizes the margin while ensuring that all the data points are classified correctly. In other words, in order to be able to leverage techniques such as gradient descent, we need to first construct a loss function that accounts for both these objectives - accuracy, and maximizing the margin. Let us start by considering accuracy. We know that a point $\mathbf{x}$ is classified correctly if $\mathbf{w}\cdot \mathbf{x} + b > 0$ for Class $y_i = +1$ points, and $\mathbf{w}\cdot \mathbf{x} + b < 0$ for Class $y_i = -1$ points. We can combine these conditions into a single ineuqality as follows: $y_i (\mathbf{w}\cdot \mathbf{x} + b) > 0$. Further, based on our assumption that the support vectors lie a unit distance away from the hyperplane, this implies that all correctly classified points must be at least a unit distance away as well. Therefore, we can re-write the inequality as $y_i(\mathbf{w}\cdot \mathbf{x_i} + b) \ge 1$. This inequality is satisfied if the point is classified correctly, and is violated if the point is either misclassified, or lies within the margin. To define an appropriate loss function, it is useful to consider not only the number of misclassifications, but also by how much each point is misclassfied, and penalize these errors accordingly. A natural thing to do is to consider how far the value of $y_i(\mathbf{w}\cdot \mathbf{x_i} +b)$ is from $1$. If the value is greater than $1$, then the point is both classified correctly, and is at least a unit distance away from the hyperplane, and should yield zero loss. On the other hand, if the value is less than $1$, then the point is either misclassified, or is within the margin, and should yield a non-zero loss, and we wish to scale this loss by the distance from the margin (the further the point is in the wrong direction, the greater the loss). This can be captured using the hinge loss function, which is defined as $\max(0, 1 - y_i(\mathbf{w}\cdot \mathbf{x_i} + b))$. The hinge loss is a convex function, and is zero when the point is correctly classified, and increases linearly with the distance from the margin when the point is misclassified. The following figure shows the hinge loss function:
Now that we have a convex function that represents classification accuracy, we turn our attention to maximizing the margin. The margin is defined as the distance between the hyperplane and the support vectors. To derive the equation, consider the two support vectors in the figure below. The distance we wish to maximize is the sum of the orthogonal distances of the two points from the decision boundary. In other words, if we were to imagine a vector connecting the two support vectors ($\mathbf{x}_A$ and $\mathbf{x}_B$), the projection of this vector ($\mathbf{x}_B - \mathbf{x}_A$) onto the weight vector $\mathbf{w}$ would be the distance that we are interested in maximizing. This projection is given by $(\mathbf{x}_B-\mathbf{x}_A)\cdot\frac{\mathbf{w}}{||\mathbf{w}||}$, and is shown in the figure below:
Now, recall that we assumed that the support vectors lie a unit distance away from the hyperplane. Therefore, we can say that $\mathbf{w}\cdot \mathbf{x}_A + b = 1$ and $\mathbf{w}\cdot \mathbf{x}_B + b = -1$. Subtracting the two equations gives us $\mathbf{w}\cdot (\mathbf{x}_B - \mathbf{x}_A) = 2$. Therefore, the margin is given by $\frac{(\mathbf{x}_B - \mathbf{x}_A)\cdot\mathbf{w}}{||\mathbf{w}||} = \frac{2}{||\mathbf{w}||}$. To maximize the margin, we need to minimize $||\mathbf{w}||$, subject to the constraint that all points are classified correctly. For mathematical convenience in the computation of gradients, we equivalently minimize $\frac{1}{2}||\mathbf{w}||^2$. This is called a regularization term, and minimizing this quantity allows us to avoid overfitting - in this case, by learning a 'more general' classifier in the sense of achieving maximum separation between classes. The final loss function that we wish to minimize is therefore the sum of the hinge loss averaged over the training dataset for accuracy and the regularization term to maximize the margin, and is given by: $\frac{1}{2}||\mathbf{w}||^2 + \frac{C}{N}\sum_{i=1}^{N} \max(0, 1 - y_i(\mathbf{w}\cdot \mathbf{x_i} + b))$. The parameter $C$ is a hyperparameter that controls the tradeoff between maximizing the margin and minimizing the hinge loss. A large value of $C$ will prioritize minimizing the hinge loss, while a small value of $C$ will prioritize maximizing the margin. Instead of averaging over the training data, we may equivalently choose to simply sum the hinge loss over the training data instead and pick an appropriate value for $C$, since the denominator $N$ does not feature in a sum over data points; the choice of whether to average or sum is a design choice.
Finally, to learn the weights and bias term that minimize this loss function, we can use gradient descent. The gradient of the hinge loss term is given by $-y_i \mathbf{x_i}$ when the point is misclassified, and zero otherwise. The gradient of the regularization term is simply $\mathbf{w}$. Therefore, the gradient of the loss function with respect to the weights is given by $\nabla_\mathbf{w}\mathcal{L}=\mathbf{w} - \frac{C}{N}\sum_{i=1}^{N} y_i \mathbf{x_i} \mathbb{1}(y_i(\mathbf{w}\cdot \mathbf{x_i} + b) < 1)$, and the gradient with respect to the bias term is given by $\nabla_b\mathcal{L}=-\frac{C}{N}\sum_{i=1}^{N} y_i \mathbb{1}(y_i(\mathbf{w}\cdot \mathbf{x_i} + b) < 1)$. Here, $\mathbb{1}(\cdot)$ is the indicator function that returns 1 if the condition inside the brackets is true, and 0 otherwise.
So far, we have discussed linear classifiers and SVMs, which are used for binary classification tasks. However, in many real-world scenarios, we wish to predict probabilities of a data point belonging to a particular class, rather than simply classifying it as one of two classes. Despite its name, logistic regression is a classification algorithm, and is used to predict the probability of a data point belonging to a particular class. The key idea in logistic regression is to model the probability of a data point belonging to a particular class (the positive class) as a function of the features of the data point. Before you read further, think about what we already know about linear classifiers, and how we can extend this idea to predict probabilities.
Recall that in linear classifiers, the decision boundary is a hyperplane that separates the two classes. The sign of the quantity $\mathbf{w}\cdot \mathbf{x} + b$ determines the class of the data, and the magnitude of this quantity determines the distance of the data point from the decision boundary. In logistic regression, we simply take this quantity and pass it through a function that maps it to the range $[0, 1]$ so that we can interpret it as a probability. To be more precise, the output gives us the probability of the input belonging to the positive or $1$ class, since large positive inputs will be mapped to $1$, whereas large negative values (distances from the hyperplane in a direction opposite to the weight vector) will be mapped to $0$. This function is called the sigmoid function, and is defined as $f(z) = \frac{1}{1 + e^{-z}}$. The sigmoid function is shown in the figure below:
The sigmoid function lends itself naturally to interpreting the output as a probability. Note, for instance, what happens when for a linear classifer defined by weights $\mathbf{w}=[1, 1]$ and bias term $b=0$, the distance of a point from the hyperplane happens to be 0. The sigmoid function evaluates to $\frac{1}{1 + e^{-0}} = 0.5$, indicating that the point is equally likely to belong to either class. If the distance is large and positive, the sigmoid function evaluates to a value close to $1$, indicating that the point is likely to belong to the positive class. Similarly, if the distance is large and negative, the sigmoid function evaluates to a value close to $0$, indicating that the point is likely to belong to the negative class. Due to this mapping, in the context of logistic regression, we represent the negative class as $0$ and the positive class as $1$, instead of $-1$ and $1$ as we did in the context of linear classifiers. This is also mathematically convenient, as we will see shortly.
The probability of a data point belonging to the positive class is therefore given by $$P(y=1|\mathbf{x}) = \frac{1}{1 + e^{-(\mathbf{w}\cdot \mathbf{x} + b)}}$$ The probability of the data point belonging to the negative class is simply the complement of this probability, and is given by $P(y=0|\mathbf{x}) = 1 - P(y=1|\mathbf{x})$. Now, given what we know about probabilistic modeling from previous sections, try to think about the following question: what is the total probability of our training data given the weights and bias term? Another way to phrase this question would be as follows: given weights $\mathbf{w}$ and a bias term $b$, what is the total probability that the data we have seen is generated by the model? Yet another way to think about this is: what is the total probability that the given weights and biases label the training data correctly?
The answer to any of the above questions derives from the sort of reasoning we employed in previous sections. Start by considering a single point, $\mathbf{x}_i = (3,4)$ with the corresponding true label $y_i = 0$ (negative class), with the same weights and biases as before. What is the probability that those weights and biases label this point as $0$? Evaluating the points using the weights and baises, we get $\mathbf{w}\cdot \mathbf{x}_i + b = [0.8, 0.4]\cdot[3,4]^\mathrm{T} - 5 = 0.8*3 + 0.4*4 - 5 = 2.4 + 1.6 - 5 = -1$. Plugging this into the sigmoid function, we get $P(1|\mathbf{x}_i) = \sigma(\mathbf{w}\cdot\mathbf{x}_i + b) = \sigma(-1) \approx 0.27$. Since this is the probability of the point belonging to the positive class, the probability of the point belonging to the negative class is $1 - 0.27 = 0.73$. Therefore, the total probability of the point being labeled correctly is $0.73$.
Now, if we have a training set of two points, $\mathbf{x}_1 = (3,4)$ and $\mathbf{x}_2 = (6,5)$, with corresponding labels $y_1 = 0$ and $y_2 = 1$, and the weights and biases are the same as before, the total probability of the training data being labeled correctly is $P(y_1=0|\mathbf{x}_1)P(y_2=1|\mathbf{x}_2) = 0.73*0.86 \approx 0.63$, where the probabilities are multipled because both events (both correct classifications) must happen. Hopefully, the general form of the total probability is now starting to reveal itself to you. The total probability of the training data being labeled correctly is given by the product of the individual probabilities of each point being labeled correctly! Since $y_i$ is now either $0$ or $1$ in logistic regression, the relationship $y_i(\mathbf{w}\cdot \mathbf{x_i} + b) \ge 0$ no longer works. Instead, we wish the following to be true: when $y_i=1$, the probability of the point being labeled as $1$ should be high, and when $y_i=0$, the probability of the point being labeled as $0$ should be high. This is captured by the following equation: $$P(y_i|\mathbf{x}_i) = \begin{cases} \sigma(\mathbf{w}\cdot \mathbf{x}_i + b) & \text{if } y_i = 1 \\ 1 - \sigma(\mathbf{w}\cdot \mathbf{x}_i + b) & \text{if } y_i = 0 \end{cases}$$ We can combine these two cases into a single equation as follows: $$P(y_i|\mathbf{x}_i) = P(1|\mathbf{x}_i)^{y_i} P(0|\mathbf{x}_i)^{1-y_i}$$ such that if $y_1=0$, the first term, $P(1|\mathbf{x}_i)^{y_i}$ evaluates to $1$, and the second term gives us the probability of a correct prediction. Similarly, if $y_i=1$, the second term evaluates to $1$. Therefore, the total probability of the training data being labeled correctly is given by $$P(y|\mathbf{x}) = \prod_{i=1}^{N} P(y_i|\mathbf{x}_i) = \prod_{i=1}^{N} \Big[P(1|\mathbf{x}_i)^{y_i} P(0|\mathbf{x}_i)^{1-y_i}\Big]$$
Now, note that this total probability is a function of the weights and bias term. This implies that we can find a set of weights and a bias term such that this probability (of correctly labeling or generating our training data) is maximized. Therefore, the optimization problem we wish to solve is to maximize the total probability of the training data given the weights and bias term. This is the same as maximizing the log of the total probability, since the logarithm is a monotonic function. Logarithmic values are also more numerically stable since multiplying several probabilities between $0$ and $1$ could lead to numerical underflow. Therefore, we equivalently maximize the log-likelihood of the training data, which is given by $$\log P(y|\mathbf{x}) = \sum_{i=1}^{N} y_i \log P(1|\mathbf{x}_i) + (1-y_i) \log P(0|\mathbf{x}_i)$$ Substituting in the expressions for $P(1|\mathbf{x}_i)$ and $P(0|\mathbf{x}_i)$, we get $$\log P(y|\mathbf{x}) = \sum_{i=1}^{N} y_i \log \sigma(\mathbf{w}\cdot \mathbf{x}_i + b) + (1-y_i) \log (1 - \sigma(\mathbf{w}\cdot \mathbf{x}_i + b))$$ This is the objective function that we wish to maximize. Since in machine learning, we tend to think of the optimization as a loss minimization, we can equivalently write that our objective is to minimize the negative log likelihod, which is given by $$\mathcal{L}(\mathbf{w}, b) = -\sum_{i=1}^{N} y_i \log \sigma(\mathbf{w}\cdot \mathbf{x}_i + b) + (1-y_i) \log (1 - \sigma(\mathbf{w}\cdot \mathbf{x}_i + b))$$ This is referred to as the Binary Cross Entropy (BCE) loss. Now, observe what happens when a point is classified either correctly or incorrectly. If the true label is $y_i=1$, the first term evaluates to $\log P(1|\mathbf{x}_i)$ and the second term evaluates to $0$. If the predicted probability $P(1|\mathbf{x}_i)\approx 1$, then $log P(1|\mathbf{x}_i) \approx 0$, and if the predicted probability $P(1|\mathbf{x}_i)\approx 0$, then $log P(1|\mathbf{x}_i) \approx -\infty$. Therefore, the BCE loss is minimized when the predicted probability is close to the true label, and is maximized when the predicted probability is far from the true label. The same reasoning applies when the true label is $y_i=0$.
Finally, the BCE loss is a convex function, and can be minimized using gradient descent. The gradient of the BCE loss with respect to the weights and bias term is given by $$\nabla_\mathbf{w}\mathcal{L} = \sum_{i=1}^{N} \Big(\sigma(\mathbf{w}\cdot \mathbf{x}_i + b) - y_i\Big)\mathbf{x}_i$$ $$\nabla_b\mathcal{L} = \sum_{i=1}^{N} \Big(\sigma(\mathbf{w}\cdot \mathbf{x}_i + b) - y_i\Big)$$
Special Topic: Derivative of the Sigmoid function $\sigma(x)$
The sigmoid function is a key component of logistic regression (and many neural networks), and it is important to understand its derivative. The derivative of the sigmoid function is given by $$\frac{d}{dx}\sigma(x) = \frac{d}{dx}\frac{1}{1 + e^{-x}}$$ Applying the chain rule, we get $$\frac{d}{dx}\sigma(x) = \frac{e^{-x}}{(1 + e^{-x})^2}$$ Employing a bit of high-school algebraic trickery, we can rewrite this as $$\frac{d}{dx}\sigma(x) = \frac{1+e^{-x}-1}{(1 + e^{-x})^2}$$ $$ = \frac{(1+e^{-x})}{(1+e^{-x})^2} - \frac{1}{(1+e^{-x})^2}$$ $$= \frac{1}{1 + e^{-x}} - \Bigg(\frac{1}{(1 + e^{-x})}\Bigg)^2$$ $$= \sigma(x) - \sigma(x)^2$$ $$= \sigma(x)(1 - \sigma(x))$$
Neural networks are a class of machine learning models that are inspired by the structure and function of the human brain. Much like neurons in the brain, neural networks are composed of interconnected nodes called perceptrons (also called neurons), that are organized in layers. Before we delve into the specifics of neural networks, let us first understand the structure of a single perceptron.
A perceptron is a simple mathematical model, which takes a set of inputs, performs a weighted sum of these inputs, adds a bias term, and passes the result through an activation function to produce an output. The weighted sum of the inputs and the bias term is given by $$z = \mathbf{w}\cdot \mathbf{x} + b$$ where $\mathbf{w}$ is the weight vector, $\mathbf{x}$ is the input vector, and $b$ is the bias term. The output of the perceptron is given by $$y = f(z)$$ where $f(\cdot)$ is an activation function. An activation function is a non-linear function that introduces non-linearity into the model, and is what allows neural networks to learn very complex patterns in the data.
The most commonly used activation functions are the now familiar sigmoid ($\sigma$) function, the hyperbolic tangent ($\tanh$) function, and the rectified linear unit (ReLU) function, shown below:
Now, consider a perceptron that uses a sigmoid activation, in particular. Such a perceptron is shown below. Upon inspection, you may realize that this is equivalent to the function $\sigma(\mathbf{w}\cdot \mathbf{x}+b) = P(y=1|\mathbf{x})$ in logistic regression!
In other words, we may think of each perceptron in a neural network as a smaller model that serves to either make a binary decision based on its inputs (for instance, by producing outputs converging to $0$ or $1$ for the sigmoid activation, outputs converging to $-1$ or $1$ for the $\tanh$ activation), or otherwise extract some useful information from the data (such as using ReLU, which retains $\mathbf{w}\cdot\mathbf{x} + b$ when this quantity is greater than $0$). Each neuron, therefore, is a building block that can be used to model complex patterns in the data by serving as an intermediate feature extractor.
To fully appreciate why this is important, we can look at a simple example. Imagine the four points in the following figure, which are not linearly separable (i.e., a single linear classifier would not be able to separate these points into two classes). However, by using a neural network with a single hidden layer, we can learn a non-linear decision boundary by combining two linear boundaries, such that it separates the two classes.
The figure above shows the XOR problem, which is a classic example of a problem that cannot be solved by a single linear classifier. However, by using a neural network with a single hidden layer, we can learn a non-linear decision boundary that separates the two classes. Before we build such a network, however, think about how you might solve this problem using two linear classifiers instead of one.
To solve the XOR problem by combining two linear classifiers, we can use the fact that the XOR gate is a combination of the AND and the OR gates, combined with a NOT gate, such that $A\ \mathrm{XOR}\ B = (A\ \mathrm{OR}\ B)\ \mathrm{AND}\ \mathrm{NOT}\ (A\ \mathrm{AND}\ B)$. Here, both the OR and AND gates are linear classifiers, and can be represented by a line in 2D space in our example. The NOT gate is a linear classifier as well, but let's put that thought aside for a moment. The figure below shows one possible decision boundary each for the AND and OR gates respectively.
The OR decision boundary labels everything to its right as $1$, and everything to its left as $0$. The AND decision boundary labels everything to its right as $1$, and everything to its left as $0$. The XOR decision boundary, therefore, needs to classify the points that lie to the right of the OR line as $1$, and the points that lie to the right of the AND line as $0$. However, we know from a previous section, that flipping the signs for the weight and bias terms for the AND decision boundary will label the points to its left as $1$ instead. However, we still need to combine these two decision boundaries into a single final prediction.
Let us now finally construct a neural network with a single hidden layer, and think about how it may achieve such a combination. Imagine a neural network, with two input nodes, two nodes in the first hidden layer, and a single output node. Further, we can set the network up in such a way that the top node in the hidden layer represents the OR decision boundary from the above figure, and the bottom node represents the AND decision boundary (whether the direction of prediction is flipped about the line does not actually matter in this setting). How may we do this? Well, we know that the first node should represent a logistic regression that has the same behavior as the OR gate. In other words, its output should be close to $1$ if $A\ \mathrm{OR}\ B$ is $1$, and $0$ otherwise. This, in turn, is nothing but a probability that is based on distance from a line in 2-D space in our example (which is the same OR boundary from our figure). Finally, we know that we can represent a line in two dimensions using a weight vector and a bias term. Therefore, the weights and bias term for the top node in the hidden layer should be such that it represents the OR decision boundary. From the figure above, we can deduce that $\mathbf{w}_1=[1,1], b_1 = -0.5$ produces such a line. Similarly, the weights and bias term for the bottom node in the hidden layer should represent the AND decision boundary, and from our figure we can say that $\mathbf{w}_2 = [1,1], b_2=-1.5$ produces such a line. The final output node should then take the outputs from the two nodes in the hidden layer, and combine them to produce the XOR decision boundary. This setup is shown below:
Now, let us work through the network, and see what is actually happening internally. The top node in the hidden layer takes the input $(A,B)$, and produces the value $\sigma(w_{11}x_1 + w_{12}x_2 + b1) = \sigma(\mathbf{w}_1\cdot \mathbf{x} + b1)$. Since we have deliberately set $\mathbf{w}_1 = [1,1], b=-0.5$, we know that the output roughly corresponds to the behavior of the OR gate. For instance, for the point $(0,0)$, we get $z_1 = 1*0 + 1*0 - 0.5 = -0.5$, and $\sigma(-0.5) \approx 0.38$, which is closer to $0$ than it is to $1$. Similarly, for the point $(1,0)$, we get $z_1 = 1*1 + 1*0 - 0.5 = 0.5$, and $\sigma(0.5) \approx 0.62$, which is closer to $1$. The bottom node in the hidden layer takes the same input $(A,B)$, and produces the value $\sigma(w_{21}x_1 + w_{22}x_2 + b2) = \sigma(\mathbf{w}_2\cdot \mathbf{x} + b2)$. Since we have deliberately set $\mathbf{w}_2 = [1,1], b=-1.5$, we know that the output roughly corresponds to the behavior of the AND gate. For instance, for the point $(0,0)$, we get $z_2 = 1*0 + 1*0 - 1.5 = -1.5$, and $\sigma(-1.5) \approx 0.18$, which is closer to $0$ than it is to $1$. Similarly, for the point $(1,1)$, we get $z_2 = 1*1 + 1*1 - 1.5 = 0.5$, and $\sigma(0.5) \approx 0.62$, which is closer to $1$ as expected. Now before we move on to the final output node, let us see what happens when we simply plot the outputs of the two nodes in the hidden layer for all the four points in our original dataset. For instance, for $(0,0)$, the output values are $\mathbf{z} = [0.38, 0.18]$, which forms one point in our new 2-D space. The following figure shows the outputs of the two nodes in the hidden layer for all the four points in the dataset, with the original location of each point denoted in the legend:
Perhaps, the role of the final output node is now becoming clearer. The final output node, in this example, learns its own decision boundary in this latent space, such that it is able to separate the points that lie to the right of the OR line and the left of the AND line as $1$. This is the essence of how a neural network can learn complex patterns in the data by combining simpler decision boundaries learned by individual nodes in the hidden layer! In this new space (above figure), we can draw a linear decision boundary that separates the blue points ($(0,1)$ and $(1,0)$) from the orange points ($(0,0)$ and $(1,1)$), and this line acts as the XOR decision boundary. Specifically, we want the blue points to be classified as $1$ and the orange points to be classified as $0$. The weights and biases for the final output node can be set to $\mathbf{v} = [1,-1], c = -0.22$ to produce such a line, shown below.
The final output node, therefore, takes the outputs of the two nodes in the hidden layer, and produces the value $\hat{y} = \sigma(\mathbf{v}\cdot \mathbf{z} + c)$. For instance, for the point $(0,0)$, we had $z = [0.38, 0.18]$, and thus $\hat{y} = \sigma(1*0.38 - 1*0.18 - 0.22) = \sigma(-0.02) \approx 0.495$. Since this value is closer to $1$ than to $1$, the final prediction for this point is technically correct, although barely so. This is simply down to the fact that we picked the decision boundaries, and therefore, the weights and biases based on intuition, rather than analytically finding the best possible values.
The XOR problem is a classic example of a problem that cannot be solved by a single linear classifier, and requires a non-linear decision boundary. Neural networks, by combining multiple perceptrons in hidden layers, are able to learn such non-linear decision boundaries. The XOR problem is a simple example, but neural networks are capable of learning much more complex patterns in the data, especially when the neural network is deeper and wider. A deeper neural network is one that has more hidden layers, and a wider neural network is one that has more nodes in each hidden layer. Since the neural network acts as a composition of simple linear classifiers using nonlinear activations, a deeper network is able to learn more complex patterns in the data by combining simpler patterns learned by individual layers. One way to think about this is to think of how we can approximate any function using a combination of simple functions, all the way down to approximating curves with straight lines. In the same way, a neural network is able to approximate complex patterns in the data by combining simpler patterns learned by individual layers.
Now that we have discussed the theory behind linear classifiers, logistic regression, and neural networks, let us turn our attention to how these models are trained in practice. The process of training a model involves finding the weights and biases that minimize a loss function that captures how badly our model is performing based on our current estimates of its parameters. This is typically done using gradient descent. Think about the XOR problem once again. What is the loss function that we wish to minimize in this case?
Since our network has a single output that lies in the range $[0,1]$, we can interpret the model's output as the probability of the positive class. The loss function that we wish to minimize is, thus, once again the Binary Cross Entropy (BCE) loss, which we've already seen in Logistic Regression. The only difference here, is that the probability of the positive class is given be the output of a neural network, rather than a logistic regression model. For example, for the XOR neural network, the loss is given by $$\mathcal{L}(\mathbf{w}, \mathbf{v}, b, c) = -\sum_{i=1}^{N} y_i \log \sigma(\mathbf{v}\cdot \sigma(\mathbf{w}\cdot \mathbf{x}_i + b) + c) + (1-y_i) \log (1 - \sigma(\mathbf{v}\cdot \sigma(\mathbf{w}\cdot \mathbf{x}_i + b) + c))$$ In general, the BCE loss for a neural network is given by $$\mathcal{L}(\Theta) = -\sum_{i=1}^{N} y_i \log \sigma(f(\mathbf{x}_i; \Theta)) + (1-y_i) \log (1 - \sigma(f(\mathbf{x}_i; \Theta)))$$ where $\Theta$ is the set of all weights and biases in the neural network, and $f(\cdot)$ is the function that computes the output of the neural network. Note that in the general form, we specify the sigmoid function outside of $f(\cdot)$, since the output of the neural network could be anything (such as when using a ReLU activation at the final layer), and we want to map it to the range $[0,1]$ before computing the loss.
The gradient of the loss function with respect to the weights and biases can be computed by applying the chain rule. Modern optimers (such as the ones in PyTorch) do this automatically using a technique called automatic differentiation, where internally, they build a computation graph of the operations that are performed on the input data, and then compute the gradients by traversing this graph backwards. This process is what we discussed in the section on gradient descent in local search. The gradients are then used to update the weights and biases in the direction that minimizes the loss function. This process is repeated for a number of iterations, or until the loss converges. The learning rate is a hyperparameter that controls the size of the steps taken in the direction of the gradients, and is typically set to a small value. The process of training a model also involves splitting the data into a training set and a validation set. The model is trained on the training set, and the performance of the model is evaluated on the validation set. The loss on the validation set is used to tune hyperparameters such as the learning rate, the number of hidden layers, the number of nodes in each hidden layer, and so on. This process is called hyperparameter tuning, and is an important part of training a model. The model is finally evaluated on an unseen test set to get an unbiased estimate of its performance.
Multiclass Classification: So far, we have discussed binary classification tasks, where the output is either $0$ or $1$. However, in many real-world scenarios, we wish to classify data into more than two classes. This is called multiclass classification. One way to extend binary classifiers to multiclass classifiers is to use the one-vs-all strategy. In this strategy, we train a binary classifier for each class, where the class is treated as the positive class, and all other classes are treated as the negative class. The final prediction is then made by selecting the class that has the highest probability according to the binary classifiers. This is a simple and effective strategy, and is used in practice for many multiclass classification tasks. In the context of neural networks, the output layer is simply modified to have as many nodes as there are classes. Each output node may be thought of as an independent binary classifier, classifying each input as either belonging to the $j^{th}$ class, or not. To be precise, this interpretation strictly works only when the activation function at the output layer is the sigmoid function, but in general, any other activations may be thought of as a weighted composition of features used to make the final decision instead.
However, in this form, the outputs of the neural network do not necessarily sum to $1$. To convert the outputs to a valid probability distribution, the output of the network is passed through a softmax function. The softmax function is a generalization of the sigmoid function to multiple classes, and is defined as $$\mathrm{SM}(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$$ where $z$ is the output of the neural network, and $K$ is the number of classes. The output of the softmax function is a probability distribution over the classes, and the class with the highest probability is selected as the final prediction. The loss function used in this case is called the Categorical Cross Entropy loss, and is given by $$\mathcal{L}(\Theta) = -\sum_{i=1}^{N} \sum_{j=1}^{K} y_{ij} \log \mathrm{SM}(f(\mathbf{x}_i; \Theta))_j$$ where $y_{ij}$ is the true label of the $i^{th}$ data point for the $j^{th}$ class, and $\mathrm{SM}(f(\mathbf{x}_i; \Theta))_j$ is the predicted probability of the $i^{th}$ data point belonging to the $j^{th}$ class. To further simplify notation, we can use the fact that both the true label and the predicted probability are vectors, and rewrite the loss function as $$\mathcal{L}(\Theta) = -\sum_{i=1}^{N} \mathbf{y}_i \cdot \log \mathrm{SM}(f(\mathbf{x}_i; \Theta))$$ where $\mathbf{y}_i$ is a one-hot encoded representation of the true label of the $i^{th}$ data point, and $\mathrm{SM}(f(\mathbf{x}_i; \Theta))$ is a vector of the same size as $\mathbf{y}_i$, which contains the predicted probabilities of the $i^{th}$ data point belonging to each class.