The preceding discussions on search algorithms delved into methods for traversing state-space graphs from an initial state to a goal, emphasizing the construction and the preservation of a path throughout the process. However, many real-life problems do not focus on the specific path taken to reach the goal. Examples of such problems include job-shop scheduling, automatic programming, telecommunications network optimization, vehicle routing, and portfolio management. In these cases, the focus is primarily on the final configuration of the solution, rather than the journey taken to attain it.
With this in mind, we now turn our attention to local search methods. Unlike their counterparts, local searches operate exclusively on the current node, often choosing to evaluate its immediate neighbors without retaining any path in memory. Local search algorithms offer two key advantages in scenarios where the path to the goal is inconsequential:
Hill-climbing is a type of local search algorithm. Hill climbing iteratively moves in either an increasing/ascent or decreasing/descent direction based on the problem objective. The algorithm terminates when it reaches a "peak" in case of a maximization problem, i.e., no neighbor has a higher objective function value, or a "valley," when the neighbor has no lower value in case of a minimization problem. Hill-climbing can be summarized with the following general structure:
function HILL-CLIMBING(G, s, goal):
current = s
while True:
neighbors = get_neighbors(G, current)
best_neighbor = None
best_value = calculate_value(current)
for neighbor in neighbors:
neighbor_value = calculate_value(neighbor)
if neighbor_value > best_value:
best_neighbor = neighbor
best_value = neighbor_value
if best_neighbor is None or best_value <= calculate_value(current):
return current
current = best_neighbor
This particular approach is one of several related variants of hill-climbing, and is called Steepest-Ascent Hill Climbing. Here, we iterate over all neighbors to pick the best one at each step. In contrast, another commonly seen implementation is to generate a neighboring state, and simply accept it if the score is better than the current state, rather than iterating over all neighbors to find the best one. This latter approach is known as First-Choice Hill Climbing.
A useful measure that can help us to take hill-climbing steps in the right direction as well as inform how big of a step we should take is the rate of change of our objective function with respect to its inputs. For a univariate function, this quantity is known as the derivative of the function with respect to its inputs. For multivariate function, this is known as the gradient of the objective function, which is simply a collection of derivatives with respect to every variable that the objective is a function of. The gradient is a vector that points in the direction of the greatest rate of increase of the objective function. The gradient is also perpendicular to the contour lines of the objective function. Depending on whether our problem is one of maximization or minimization, we take steps either towards the gradient's direction (direction of steepest ascent), or away from it (direction of steepest descent).
In general, given a function $f$ of $n$ variables, the gradient of $f$ is defined as follows: $$\nabla f = \left[\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \cdots, \frac{\partial f}{\partial x_n}\right]$$ where $\frac{\partial f}{\partial x_i}$ is the partial derivative of $f$ with respect to the function input $x_i$. For example, let $f(x,y) = x^2 + y^2$. The partial derivative of the objective function with respect to $x$ (i.e., $\frac{\partial f}{\partial x}$) can be solved using the chain rule as follows: $$\frac{\partial f}{\partial x} = \frac{\partial f}{\partial x}\frac{\partial x}{\partial x} + \frac{\partial f}{\partial y}\frac{\partial y}{\partial x} = 2x(1) + 2y(0) = 2x$$ Similarly, $\frac{\partial f}{\partial y} = 2y$. Therefore, $\nabla f = \left[2x, 2y\right]$.
In order to maximize a function $f$ of $n$ variables, we can take a step in the direction of the gradient of $f$ at the current point. This is known as gradient ascent. Similarly, in order to minimize a function $f$ of $n$ variables, we can take a step in the direction opposite to the gradient of $f$ at the current point. This is known as gradient descent. Let $\mathbf{x}$ be a vector of $n$ variables, let $f$ be a function of $\mathbf{x}$, and let $\alpha$ be the learning rate. The gradient descent algorithm can be summarized as follows:
function GRADIENT-DESCENT(f, 𝐱, alpha, threshold):
while True:
gradient = calculate_gradient(f, 𝐱)
if magnitude(gradient) < threshold:
return 𝐱
𝐱 = 𝐱 - α * gradient
Gradient ascent (i.e., maximization) proceeds similarly, except the update equation becomes:
𝐱 = 𝐱 + α * gradient
For a refresher on how to compute derivates, partial derivates and gradients, please refer to Chapter 5 of Mathematics for Machine Learning.
Here, instead, we focus our attention to one specific method of computing gradients of composite functions using the chain rule. The chain rule is a fundamental concept in calculus that allows us to compute the derivative of a composite function. For example, let $f(x) = g(h(x))$. The chain rule states that the derivative of $f$ with respect to $x$ is given by: $$\frac{df}{dx} = \frac{dg}{dh}\frac{dh}{dx}$$ This can be extended to multivariate functions. Let $f(x,y) = g(h(x,y))$. The chain rule states that the partial derivative of $f$ with respect to $x$ is given by: $$\frac{\partial f}{\partial x} = \frac{\partial g}{\partial h}\frac{\partial h}{\partial x}$$ And the partial derivative of $f$ with respect to $y$ is given by: $$\frac{\partial f}{\partial y} = \frac{\partial g}{\partial h}\frac{\partial h}{\partial y}$$ Indeed, this is exactly what we did in the earlier example to compute the gradient of $f(x,y) = x^2 + y^2$. We computed the partial derivative of $f$ with respect to $x$ and $y$, and combined them to get the gradient of $f$. In class, we discussed how many problems in AI boil down to defining our goals as objective functions that we want to maximize or minimize. In many such cases, we can use gradient descent to find the optimal solution. For example, in the case of linear regression, we can use gradient descent to find the optimal values of the parameters that minimize the Mean Squared Error across all data points. Now, you may notice that the gradient descent algorithm is an iterative one, i.e., we need to recompute the derivatives/gradients after every update to our set of parameters. This can be computationally expensive, especially when the number of variables is large. However, modern machine learning toolkits such as PyTorch do this with seemingly little effort! How does this work then?
In order to efficiently compute derivatives of complicated objective functions, we rely on an internal representation of the objective function in the form of a computational graph. A computational graph is a directed graph that represents the flow of data through a series of operations. Each node in the graph represents an operation (as well as the partial function value evaluated upto that operation), and each edge represents the flow of data from one operation to another. The graph is constructed by breaking down the objective function into a series of elementary operations, and connecting them in the order in which they are executed. The graph, when traversed in a depth-first manner, allows us to compute the objective function value by applying the operations in the correct order.
Building such a graph allows us to compute the derivative of the objective function with respect to its inputs by applying the chain rule in a reverse order through the graph. This process is known as automatic differentiation, and is central to modern machine learning frameworks. Let's take an example, and explore how we can construct a computation graph by splitting the function by operators. Consider the function $f(x,y) = (x+y)^2$. We can split this function into elementary operations as $f(x,y) = (x+y) \times (x+y)$. We then look at the 'outermost' operator - in other words, the operator that is applied at the final stage of the function evaluation. In this case, it's the multiplication operator, that takes two inputs and multiplies them. In our case, both inputs are the same, i.e., $(x+y)$. Now, we split each $(x+y)$ by the addition operator, which also accepts two inputs, namely $x$ and $y$. We can now construct a computation graph that looks like this:
In this graph, each operator node also represents the function below it. For instance, each of the nodes with the $+$ symbol represents the (sub)function $x+y$, and the node with the $\times$ symbol combines these under multiplication to give us our original function $(x+y) \times (x+y)$. The graph is traversed in a forward manner (bottom to top) to compute the function value.
Next, we will extend this graph to compute the derivative of the function with respect to $x$. We can do this by applying the chain rule in a reverse manner through the graph. The chain rule states that the derivative of a composite function is the product of the derivative of the outer function with respect to the inner function, and the derivative of the inner function with respect to the input. Take for instance, our function, $f(x,y) = (x+y)^2$. The derivative of $f$ with respect to $x$ is given by: $$\frac{\partial f}{\partial x} = \frac{\partial f}{\partial (x+y)}\frac{\partial (x+y)}{\partial x}$$ To represent the computation of this derivative on the graph, we label each edge with the partial derivative of the parent node with respect to the input along that edge. For example, $$\frac{\partial f(x,y)}{\partial (x+y)} = (x+y)$$ Similarly, $$\frac{\partial (x+y)}{\partial x} = 1,\quad \textrm{and}\quad \frac{\partial (x+y)}{\partial y} = 1$$ This gives us the following graph:
The key benefit of drawing such a graph is that it allows us to compute the derivative of the objective function with respect to any of its inputs by traversing the graph in a top-down fashion, also referred to as backpropagation. To find, for instance, the derivative of $f(x,y)$ with respect to $x$, we first isolate all the paths that go from the root node down to a leaf with a value of $x$. For each path, we multiply the derivatives along the branches that make up the path, and finally sum these values together. In our graph, we have two paths that lead to $x$, starting from the top: reachable by traversing left$\rightarrow$left, or right$\rightarrow$left from the root. For the first path, the derivatives along the branches are $(x+y)$ and $1$ respectively, so multiplying them together gives us $(x+y)$. For the second path, the derivatives along the branches are also $(x+y)$ and $1$ respectively, so multiplying them together once again gives us $(x+y)$. Summing these two values gives us $2(x+y)$, which is the derivative of $f(x,y)$ with respect to $x$.
With a function as simple as our example, the improvements in computational efficiency may not be immediately obvious, but for a more complex function with many variables, the computational savings can be significant. This is because we can reuse the intermediate results of the forward pass to compute the gradients in the backward pass. Further, the graph only needs to be constructed once, and for any value of the function's inputs, we can quickly compute the function value and its gradient. Before we try a more complex example, let's first put together a set of basic rules that we can use to construct these graphs, based on some of the most common elementary operations.
For a multiplication operator, the derivative along one branch is simply the input along the other branch. This is because if $f = ab$, then $\nabla_a f = b$ and $\nabla_b f = a$. For an addition operator, the derivative along both branches is 1. This is because if $f = a+b$, then $\nabla_a f = 1$ and $\nabla_b f = 1$. For subtraction, the derivative along the left branch is 1, and the derivative along the right branch is -1. This is because if $f = a-b$, then $\nabla_a f = 1$ and $\nabla_b f = -1$. For a power operator, the derivative along the base branch is the exponent times the base raised to the power of the exponent minus one. This is because if $f = a^b$, then $\nabla_a f = b \cdot a^{b-1}$. Note that the power operator only takes one input. For a division operator, the derivative along the numerator branch is the reciprocal of the denominator, and the derivative along the denominator branch is the negative of the numerator divided by the square of the denominator. This is because if $f = \frac{a}{b}$, then $\nabla_a f = \frac{1}{b}$ and $\nabla_b f = -\frac{a}{b^2}$.
Let's now take a more complex example to illustrate how we can construct a computation graph and compute the gradient of the objective function with respect to its inputs. Recall the example of Linear Regression in a 2-dimensional space from class, where the objective was to minimize the Mean Squared Error. Let $S$ be the set of points to which we are trying to fit a line, and let the line be given by the equation $y = mx+c$. The objective function was defined as $MSE(m,c) = \frac{1}{|S|}\sum_{(i,j) \in S}(j - (mi+c))^2$. We can construct a computation graph for this function as follows.
Given what we know about elementary operations from earlier in these notes, most of this graph should be fairly easy to understand. One exception is the summation operator. Recall that a summation operator simply represents a sum of many terms. In our case, the sum operator represents the sum of all the squared errors across all points in our dataset. Here, we recall the sum rule in calculus, which states that the derivative of a sum of functions is the sum of the derivatives of the functions. In other words, $$\frac{\partial}{\partial x}\sum_i f_i(x) = \sum_i\frac{\partial}{\partial x} f_i(x)$$
Armed with this, we can now compute the gradient of the objective function, MSE, with respect to its inputs ($m$ and $c$). From the top of the graph, there is one path that leads us to the leaf node $m$ (in reality there are $|S|$ paths, one for each data point in our data set). Multiplying the derivatives along this branch, and using the sum rule from above, we get: $$\nabla_m MSE(m,c) = \frac{1}{n}\sum_{(i,j)\in S} 2(j-(mi+c))\times -1 \times 1 \times i$$ Similarly, for $c$, we get: $$\nabla_c MSE(m,c) = \frac{1}{n}\sum_{(i,j)\in S} 2(j-(mi+c))\times -1 \times 1$$
Finally, we are ready to solve Linear Regression using Gradient Descent. In order to do so, we start with some initial estimate of $m$ and $c$, say $m=0$ and $c=0$. We then iteratively update $m$ and $c$ using the gradients we computed above. Recall that the update equations are given by: $$m = m - \alpha \nabla_m MSE(m,c)$$ $$c = c - \alpha \nabla_c MSE(m,c)$$ where $\alpha$ is the learning rate. We repeat this process, alternating between updating the slope term $m$ and the intercept term $c$ until the gradients are close to zero, or until we reach a specified number of iterations without much change in our objective function. While we end our discussion of gradient descent here, we will once again, encounter it when we discuss Neural Networks later in the course.
Here's another really nice resource about computation graphs and automatic differentiation: Blog post by Christopher Olah. Please note that there are minor differences between the representations of the computation graph used by Christopher and me. For grading purposes, please use the notation we've discussed in class - reflected in my notes above - in all problem sets.
Hill-climbing approaches, while often effective, may fail to produce an optimal solution in certain instances. This may happen as a result of the search getting stuck in a local optimum. This can be thought of as exploring the "landscape" of an objective function in an $n$-dimensional vector space. The objective function's landscape may have several points of inflection, causing the search to get trapped in a "valley" (local minimum) or a "peak" (local maximum). Since our local search only moves in one direction, it may completely miss better solutions (deeper valleys or taller peaks) as a result of this.
Ridges: Ridges are sequences of local optima. The presence of multiple local optima makes it challenging for the algorithm to navigate and converge towards a global optimum, as it may become trapped in one of the multiple vicinities.
Plateaus: A plateau refers to a flat area in the state-space landscape, which can be a flat local optimum or a shoulder from which some future progress is posible. Plateaus pose challenges for hill-climbing algorithms as there might be cases where no successor value is available for the algorithm to choose from, resulting in being stranded without further progress.
Dealing with Local Optima: Here, we discuss two commonly used strategies for dealing with local optima. To motivate both strategies, consider the following objective function ($g(w)=\sin(3w)+0.1w\ +\ 0.3$), plotted below.
If we start at $x=2$ and attempt to maximize our objective, then hill-climbing leads us to the local maximum at $x\approx 2.6$, whereas the global maximum lies at $x\approx 4.6$. This is a result of the algorithm being stuck in a local optimum. However, if we had either started at $x=4$, or had somehow been able to move from the local optimum to $x=4$, then we could have subsequently reached the global maximum.
Random Restarts: One of the most popular methods to circumvent getting stuck in local optima is to randomly restart the hill-climbing process with a different initial state. In the above example, if the first run started at $x=2$, then a random restart could, in theory, get us to start from, say, $x=4$ (really any value on either slope of the optimal peak), which would lead us to the global optimum.
Random Downhill Steps: A second way to wiggle our way out of local optima is to allow steps that lead to a worse objective function value with a small probability. In the above example, from the local optimum peak around $x=2.6$, if we allowed the search algorithm to accept a neighbor on its right (combined with the step-size being large enough), then a subsequent step could get us to the global optimum. This approach is also known as Stochastic Hill Climbing.
function SIMULATED-ANNEALING(G, s, goal, T, d):
current = s
best = s
while True:
neighbor = generate_neighbor(G, current)
if Energy(neighbor) <= Energy(current):
current = neighbor
best = neighbor
else:
if random(0,1) <= e^((Energy(current)-Energy(neighbor))/T):
current = neighbor
else:
continue
T = T * d
if converged:
return best
At each iteration, the algorithm accepts the newly generated neighbor if its energy is lower than the current state, and accepts it with a probability of $\mathrm{e}^{(\textrm{Energy(current)-Energy(neighbor)})/T}$ if the energy is higher, where $\mathrm{e}$ is Euler's number. The probability of accepting a worse solution decreases exponentially as the temperature $T$ decreases. Here's some insight into how this works. First, note that we only need to compute this probability when the energy of the generated neighbor is higher than the energy of our current solution. Therefore, the numerator in the exponent, i.e. $\textrm{Energy(current)-Energy(neighbor)}$, is always negative giving us a function of the form $f(x) = \mathrm{e}^{-x}$. Now the exponent is inversely proportional to the temperature $T$, and thus as the temperature decreases, the absolute value of the exponent increases. Essentially, we're now looking at the behaviour of the function $f(x) = \mathrm{e}^{-x}, x \ge 0$, plotted below. You can observe that for a high temperature, the resulting probability of accepting a worse solution is close to one, but as the temperature $T$ decreases, this probability decreases exponentially.Genetic Algorithms are a class of search algorithms inspired by the process of natural selection, and can be viewed as an extension/combination of some of the local search techniques we have discussed so far. The algorithm, much like local beam search, maintains a population of candidate solutions, and iteratively applies genetic operators to the population to generate new candidate solutions. These genetic operators broadly consist of mutations and crossovers, discussed below. The algorithm proceeds by iteratively generating new candidates using these operators, and then using a fitness function to retain some subset of the best-performing solutions, in a workflow similar to local beam search.
Mutations: These are random changes to a candidate solution, just like in regular hill climbing. For example, in the case of a binary string, a mutation could involve flipping a single bit from 0 to 1 or vice-versa. In the case of a real-valued vector, a mutation could involve adding a small random number to one of the vector's elements. In the case of the traveling salesman problem, a 2-opt swap could be a mutation. Mutations are used to introduce new information into the population, and to prevent the algorithm from getting stuck in local optima. The rate at which these random changes are applied to each candidate solution is controlled by a parameter called the mutation rate, and may vary throughout the execution of the program (similar to how we decrease probability of large changes over time in Simulated Annealing).
Crossovers: These are operations that combine two candidate solutions to produce one or more new candidate solutions. For example, in the case of a binary string or a real-valued 1-dimensional vector, one possible crossover could involve taking the first half of one parent and the second half of another parent to produce a new child string/vector. In the case of the traveling salesman problem, crossovers might involve more steps and some logical parsing of the two parent solutions (see below). Crossovers are used to combine information from two candidate solutions that are already good, in the hopes of producing an even better solution.
Complex crossovers may often be necessary, especially in cases where the validity of a solution depends on the structure of its representation. For example, consider the traveling salesman problem with 5 cities, named A, B, C, D and E. Assuming all cities are connected, one possible solution could be the string "ABCDEA", which represents the order in which the cities are visited. A second candidate solution could be AECDBA. Now, let's try and combine these solutions using a crossover operation. A simple crossover between the two parents "ABCDEA" and "AECDBA" where we combine the first half of the first parent with the second half of the second parent would yield the string "ABCDBA", which is not a valid solution to the TSP. This is because the child visits city B twice, and does not visit city E at all. A more complex crossover could involve parsing the two parents and combining them in a way that ensures that the child visits each city exactly once. This is known as the Order Crossover (OX) operator. For further reading on genetic operations, please refer to these notes by Dr. Lluís Alsedà, Universitat Autònoma de Barcelona.