Perceptron math part 2

The Update Rule

How does the hyperplane move? How can we ensure all our points are classified correctly?

Given a misclassified sample xi,yix_i, y_i, the perceptron update is:

wnew=w+ηyixiw_{\text{new}} = w + \eta y_i x_i

bnew=b+ηyib_{\text{new}} = b + \eta y_i

A concrete example:

Let’s say we have a misclassified point with feature vector xi=[3,5,6,7]x_i = [3, 5, 6, 7] and true label yi=+1y_i = +1. Our current weight vector is w=[1,2,3,4]w = [1, 2, 3, 4], our current bias is b=2b = 2, and our learning rate is η=0.1\eta = 0.1.

To update the weights, we calculate ηyixi=0.1×(+1)×[3,5,6,7]=[0.3,0.5,0.6,0.7]\eta y_i x_i = 0.1 \times (+1) \times [3, 5, 6, 7] = [0.3, 0.5, 0.6, 0.7]. Then our new weight vector becomes wnew=[1,2,3,4]+[0.3,0.5,0.6,0.7]=[1.3,2.5,3.6,4.7]w_{\text{new}} = [1, 2, 3, 4] + [0.3, 0.5, 0.6, 0.7] = [1.3, 2.5, 3.6, 4.7].

For the bias, we calculate ηyi=0.1×(+1)=0.1\eta y_i = 0.1 \times (+1) = 0.1, so our new bias becomes bnew=2+0.1=2.1b_{\text{new}} = 2 + 0.1 = 2.1.

Understanding the notation:

Let’s break down what each symbol means in these update equations. The symbol wneww_{\text{new}} represents the new weight vector after we make an update, while ww is the current weight vector before the update. The weight vector contains all the parameters that define the direction of our decision boundary.

The symbol η\eta (pronounced “eta”) is called the learning rate. It’s a small positive number that controls how big of a step we take when adjusting our weights. Think of it like the size of your steps when walking—too large and you might overshoot your destination, too small and it takes forever to get there.

The symbol yiy_i is the true label of the misclassified point, which is either +1 or -1. This tells us which side of the boundary the point should actually be on. The symbol xix_i is the feature vector of the misclassified point—this is just the coordinates of the point in our feature space (like height and weight, or any other measurements we’re using).

For the bias term, bnewb_{\text{new}} is the new bias value after the update, and bb is the current bias before the update. The bias is a single number that shifts the decision boundary up or down (or left and right, depending on how you visualize it).

When we multiply ηyixi\eta y_i x_i, we’re scaling the point’s coordinates by both the learning rate and its true label. If the point should be positive (yi=+1y_i = +1), we add a scaled version of the point to our weights, pulling the boundary toward it. If it should be negative (yi=1y_i = -1), we subtract a scaled version, pushing the boundary away from it.

How do we know a point is on the wrong side?

When a point is on the wrong side, we “pull” the hyperplane toward it by moving the weights in the direction of the point if it should be positive, or away from it if it should be negative. The learning rate η\eta controls how big each step is. If the learning rate is too large, you overshoot and might make things worse. If it’s too small, learning becomes very slow. Think of it like adjusting a line on a graph: if a blue point ends up in the red region, you tilt the line to move it to the correct side.

Why this works (the detailed math):

First, a crucial clarification:

We don’t have a weight for each point. Instead, we have one shared weight vector ww that applies to all points. If your data has dd features (like height, weight, and age), then ww has dd components: w=(w1,w2,...,wd)w = (w_1, w_2, ..., w_d). Every point uses the same ww to compute its score: score=wx+b\text{score} = w \cdot x + b. When we update ww based on a misclassified point, we’re adjusting the shared model, which affects how all points are classified going forward.

The intuitive picture:

Think of the weight vector ww as defining the “direction” of your decision boundary. When a point is misclassified, we adjust ww accordingly. If a point should be class +1 but is on the -1 side, we adjust ww to point more toward that point’s location. If a point should be class -1 but is on the +1 side, we adjust ww to point more away from that point’s location. Each update is a small nudge that makes the boundary better for that specific point, while hopefully not breaking correct classifications.

Now the detailed math:

When a point xix_i is misclassified with true label yiy_i, we know yi(wxi+b)0y_i(w \cdot x_i + b) \le 0. The update rule accomplishes several important things:

Nudging the hyperplane in the right direction:

The update adds ηyixi\eta y_i x_i to the weight vector: ww+ηyixiw \leftarrow w + \eta y_i x_i. Let’s see a concrete example: if xi=(3,2)x_i = (3, 2) should be class +1 but was misclassified, we add η(+1)(3,2)=(η3,η2)\eta \cdot (+1) \cdot (3, 2) = (\eta \cdot 3, \eta \cdot 2) to ww. This makes ww point more in the direction of (3,2)(3, 2), which helps classify this point correctly.

After the update, the new score for xix_i becomes (w+ηyixi)xi+(b+ηyi)(w + \eta y_i x_i) \cdot x_i + (b + \eta y_i). When we expand this, we get wxi+b+ηyi(xixi+1)w \cdot x_i + b + \eta y_i (x_i \cdot x_i + 1). Since xixi=xi20x_i \cdot x_i = \|x_i\|^2 \ge 0, we’re adding a positive term when η>0\eta > 0. This increases the score in the direction of yiy_i, moving the point toward or across the hyperplane. The key insight here is that the update uses the point’s own coordinates (xix_i) to adjust ww, which means points with larger feature values have more influence on the corresponding weights.

Increasing alignment with the true separator:

If a perfect separator ww^* exists (one that correctly classifies all points), we want our ww to point in a similar direction. The dot product www \cdot w^* measures how aligned our weights are with the perfect solution—the larger this value, the more aligned we are. After an update, we have (w+ηyixi)w=ww+ηyi(xiw)(w + \eta y_i x_i) \cdot w^* = w \cdot w^* + \eta y_i (x_i \cdot w^*). For correctly separable data, yi(xiw)0y_i (x_i \cdot w^*) \ge 0 because points are on the correct side of ww^*. So each update adds a non-negative term, meaning www \cdot w^* increases and our weights become more aligned with the true separator. The intuition is that each correction moves us closer to the “ideal” weight vector that would perfectly separate the data.

Margin improvement:

The margin is the minimum distance from any point to the hyperplane. As ww aligns better with ww^*, correctly classified points move further from the hyperplane. This increases the “safety zone” around the decision boundary, making our classifier more robust.

Bounded weight growth:

While www \cdot w^* increases, the norm w\|w\| grows at a controlled rate. This bounded growth is crucial for the convergence proof because it ensures we don’t need infinite updates to reach a solution.

The key insight: Each update makes a small correction that moves misclassified points toward the correct side, increases alignment with the optimal solution (if one exists), and does so in a controlled way that guarantees convergence for separable data.