Chapter 2: Machine Learning
In the last chapter we were concerned with real-valued circuits that computed possibly complex expressions of their inputs (the forward pass), and also we could compute the gradients of these expressions on the original inputs (backward pass). In this chapter we will see how useful this extremely simple mechanism is in Machine Learning.
This post was originally posted here on karpathy.github.io/
Binary Classification
As we did before, lets start out simple. The simplest, common and yet very practical problem in Machine Learning is binary classification. A lot of very interesting and important problems can be reduced to it. The setup is as follows: We are given a dataset of N
vectors and every one of them is labeled with a +1
or a -1
. For example, in two dimensions our dataset could look as simple as:
vector -> label
---------------
[1.2, 0.7] -> +1
[-0.3, 0.5] -> -1
[-3, -1] -> +1
[0.1, 1.0] -> -1
[3.0, 1.1] -> -1
[2.1, -3] -> +1
Here, we have N = 6
datapoints, where every datapoint has two features (D = 2
). Three of the datapoints have label +1
and the other three label -1
. This is a silly toy example, but in practice a +1/-1 dataset could be very useful things indeed: For example spam/no spam emails, where the vectors somehow measure various features of the content of the email, such as the number of times certain enhancement drugs are mentioned.
Goal. Our goal in binary classification is to learn a function that takes a 2-dimensional vector and predicts the label. This function is usually parameterized by a certain set of parameters, and we will want to tune the parameters of the function so that its outputs are consistent with the labeling in the provided dataset. In the end we can discard the dataset and use the learned parameters to predict labels for previously unseen vectors.
Training protocol
We will eventually build up to entire neural networks and complex expressions, but lets start out simple and train a linear classifier very similar to the single neuron we saw at the end of Chapter 1. The only difference is that we’ll get rid of the sigmoid because it makes things unnecessarily complicated (I only used it as an example in Chapter 1 because sigmoid neurons are historically popular but modern Neural Networks rarely, if ever, use sigmoid non-linearities). Anyway, lets use a simple linear function:
In this expression we think of x
and y
as the inputs (the 2D vectors) and a,b,c
as the parameters of the function that we will want to learn. For example, if a = 1, b = -2, c = -1
, then the function will take the first datapoint ([1.2, 0.7]
) and output 1 * 1.2 + (-2) * 0.7 + (-1) = -1.2
. Here is how the training will work:
- We select a random datapoint and feed it through the circuit
- We will interpret the output of the circuit as a confidence that the datapoint has class
+1
. (i.e. very high values = circuit is very certain datapoint has class+1
and very low values = circuit is certain this datapoint has class-1
.) - We will measure how well the prediction aligns with the provided labels. Intuitively, for example, if a positive example scores very low, we will want to tug in the positive direction on the circuit, demanding that it should output higher value for this datapoint. Note that this is the case for the the first datapoint: it is labeled as
+1
but our predictor unction only assigns it value-1.2
. We will therefore tug on the circuit in positive direction; We want the value to be higher. - The circuit will take the tug and backpropagate it to compute tugs on the inputs
a,b,c,x,y
- Since we think of
x,y
as (fixed) datapoints, we will ignore the pull onx,y
. If you’re a fan of my physical analogies, think of these inputs as pegs, fixed in the ground. - On the other hand, we will take the parameters
a,b,c
and make them respond to their tug (i.e. we’ll perform what we call a parameter update). This, of course, will make it so that the circuit will output a slightly higher score on this particular datapoint in the future. - Iterate! Go back to step 1.
The training scheme I described above, is commonly referred as Stochastic Gradient Descent. The interesting part I’d like to reiterate is that a,b,c,x,y
are all made up of the same stuff as far as the circuit is concerned: They are inputs to the circuit and the circuit will tug on all of them in some direction. It doesn’t know the difference between parameters and datapoints. However, after the backward pass is complete we ignore all tugs on the datapoints (x,y
) and keep swapping them in and out as we iterate over examples in the dataset. On the other hand, we keep the parameters (a,b,c
) around and keep tugging on them every time we sample a datapoint. Over time, the pulls on these parameters will tune these values in such a way that the function outputs high scores for positive examples and low scores for negative examples.
Learning a Support Vector Machine
As a concrete example, lets learn a Support Vector Machine. The SVM is a very popular linear classifier; Its functional form is exactly as I’ve described in previous section, f(x,y)=ax+by+cf(x,y)=ax+by+c. At this point, if you’ve seen an explanation of SVMs you’re probably expecting me to define the SVM loss function and plunge into an explanation of slack variables, geometrical intuitions of large margins, kernels, duality, etc. But here, I’d like to take a different approach. Instead of definining loss functions, I would like to base the explanation on the force specification (I just made this term up by the way) of a Support Vector Machine, which I personally find much more intuitive. As we will see, talking about the force specification and the loss function are identical ways of seeing the same problem. Anyway, here it is:
Support Vector Machine “Force Specification”:
- If we feed a positive datapoint through the SVM circuit and the output value is less than 1, pull on the circuit with force
+1
. This is a positive example so we want the score to be higher for it. - Conversely, if we feed a negative datapoint through the SVM and the output is greater than -1, then the circuit is giving this datapoint dangerously high score: Pull on the circuit downwards with force
-1
. - In addition to the pulls above, always add a small amount of pull on the parameters
a,b
(notice, not onc
!) that pulls them towards zero. You can think of botha,b
as being attached to a physical spring that is attached at zero. Just as with a physical spring, this will make the pull proprotional to the value of each ofa,b
(Hooke’s law in physics, anyone?). For example, ifa
becomes very high it will experience a strong pull of magnitude|a|
back towards zero. This pull is something we call regularization, and it ensures that neither of our parametersa
orb
gets disproportionally large. This would be undesirable because botha,b
get multiplied to the input featuresx,y
(remember the equation isa*x + b*y + c
), so if either of them is too high, our classifier would be overly sensitive to these features. This isn’t a nice property because features can often be noisy in practice, so we want our classifier to change relatively smoothly if they wiggle around.
Lets quickly go through a small but concrete example. Suppose we start out with a random parameter setting, say, a = 1, b = -2, c = -1
. Then:
- If we feed the point
[1.2, 0.7]
, the SVM will compute score1 * 1.2 + (-2) * 0.7 - 1 = -1.2
. This point is labeled as+1
in the training data, so we want the score to be higher than 1. The gradient on top of the circuit will thus be positive:+1
, which will backpropagate toa,b,c
. Additionally, there will also be a regularization pull ona
of-1
(to make it smaller) and regularization pull onb
of+2
to make it larger, toward zero. - Suppose instead that we fed the datapoint
[-0.3, 0.5]
to the SVM. It computes1 * (-0.3) + (-2) * 0.5 - 1 = -2.3
. The label for this point is-1
, and since-2.3
is smaller than-1
, we see that according to our force specification the SVM should be happy: The computed score is very negative, consistent with the negative label of this example. There will be no pull at the end of the circuit (i.e it’s zero), since there no changes are necessary. However, there will still be the regularization pull ona
of-1
and onb
of+2
.
Okay there’s been too much text. Lets write the SVM code and take advantage of the circuit machinery we have from Chapter 1:
// A circuit: it takes 5 Units (x,y,a,b,c) and outputs a single Unit
// It can also compute the gradient w.r.t. its inputs
var Circuit = function() {
// create some gates
this.mulg0 = new multiplyGate();
this.mulg1 = new multiplyGate();
this.addg0 = new addGate();
this.addg1 = new addGate();
};
Circuit.prototype = {
forward: function(x,y,a,b,c) {
this.ax = this.mulg0.forward(a, x); // a*x
this.by = this.mulg1.forward(b, y); // b*y
this.axpby = this.addg0.forward(this.ax, this.by); // a*x + b*y
this.axpbypc = this.addg1.forward(this.axpby, c); // a*x + b*y + c
return this.axpbypc;
},
backward: function(gradient_top) { // takes pull from above
this.axpbypc.grad = gradient_top;
this.addg1.backward(); // sets gradient in axpby and c
this.addg0.backward(); // sets gradient in ax and by
this.mulg1.backward(); // sets gradient in b and y
this.mulg0.backward(); // sets gradient in a and x
}
}
That’s a circuit that simply computes a*x + b*y + c
and can also compute the gradient. It uses the gates code we developed in Chapter 1. Now lets write the SVM, which doesn’t care about the actual circuit. It is only concerned with the values that come out of it, and it pulls on the circuit.
// SVM class
var SVM = function() {
// random initial parameter values
this.a = new Unit(1.0, 0.0);
this.b = new Unit(-2.0, 0.0);
this.c = new Unit(-1.0, 0.0);
this.circuit = new Circuit();
};
SVM.prototype = {
forward: function(x, y) { // assume x and y are Units
this.unit_out = this.circuit.forward(x, y, this.a, this.b, this.c);
return this.unit_out;
},
backward: function(label) { // label is +1 or -1
// reset pulls on a,b,c
this.a.grad = 0.0;
this.b.grad = 0.0;
this.c.grad = 0.0;
// compute the pull based on what the circuit output was
var pull = 0.0;
if(label === 1 && this.unit_out.value < 1) {
pull = 1; // the score was too low: pull up
}
if(label === -1 && this.unit_out.value > -1) {
pull = -1; // the score was too high for a positive example, pull down
}
this.circuit.backward(pull); // writes gradient into x,y,a,b,c
// add regularization pull for parameters: towards zero and proportional to value
this.a.grad += -this.a.value;
this.b.grad += -this.b.value;
},
learnFrom: function(x, y, label) {
this.forward(x, y); // forward pass (set .value in all Units)
this.backward(label); // backward pass (set .grad in all Units)
this.parameterUpdate(); // parameters respond to tug
},
parameterUpdate: function() {
var step_size = 0.01;
this.a.value += step_size * this.a.grad;
this.b.value += step_size * this.b.grad;
this.c.value += step_size * this.c.grad;
}
};
Now lets train the SVM with Stochastic Gradient Descent:
var data = []; var labels = [];
data.push([1.2, 0.7]); labels.push(1);
data.push([-0.3, -0.5]); labels.push(-1);
data.push([3.0, 0.1]); labels.push(1);
data.push([-0.1, -1.0]); labels.push(-1);
data.push([-1.0, 1.1]); labels.push(-1);
data.push([2.1, -3]); labels.push(1);
var svm = new SVM();
// a function that computes the classification accuracy
var evalTrainingAccuracy = function() {
var num_correct = 0;
for(var