Perceptron math part 2
The Update Rule
How does the hyperplane move? How can we ensure all our points are classified correctly?
Given a misclassified sample , the perceptron update is:
A concrete example:
Let’s say we have a misclassified point with feature vector and true label . Our current weight vector is , our current bias is , and our learning rate is .
To update the weights, we calculate . Then our new weight vector becomes .
For the bias, we calculate , so our new bias becomes .
Understanding the notation:
Let’s break down what each symbol means in these update equations. The symbol represents the new weight vector after we make an update, while is the current weight vector before the update. The weight vector contains all the parameters that define the direction of our decision boundary.
The symbol (pronounced “eta”) is called the learning rate. It’s a small positive number that controls how big of a step we take when adjusting our weights. Think of it like the size of your steps when walking—too large and you might overshoot your destination, too small and it takes forever to get there.
The symbol is the true label of the misclassified point, which is either +1 or -1. This tells us which side of the boundary the point should actually be on. The symbol is the feature vector of the misclassified point—this is just the coordinates of the point in our feature space (like height and weight, or any other measurements we’re using).
For the bias term, is the new bias value after the update, and is the current bias before the update. The bias is a single number that shifts the decision boundary up or down (or left and right, depending on how you visualize it).
When we multiply , we’re scaling the point’s coordinates by both the learning rate and its true label. If the point should be positive (), we add a scaled version of the point to our weights, pulling the boundary toward it. If it should be negative (), we subtract a scaled version, pushing the boundary away from it.
How do we know a point is on the wrong side?
When a point is on the wrong side, we “pull” the hyperplane toward it by moving the weights in the direction of the point if it should be positive, or away from it if it should be negative. The learning rate controls how big each step is. If the learning rate is too large, you overshoot and might make things worse. If it’s too small, learning becomes very slow. Think of it like adjusting a line on a graph: if a blue point ends up in the red region, you tilt the line to move it to the correct side.
Why this works (the detailed math):
First, a crucial clarification:
We don’t have a weight for each point. Instead, we have one shared weight vector that applies to all points. If your data has features (like height, weight, and age), then has components: . Every point uses the same to compute its score: . When we update based on a misclassified point, we’re adjusting the shared model, which affects how all points are classified going forward.
The intuitive picture:
Think of the weight vector as defining the “direction” of your decision boundary. When a point is misclassified, we adjust accordingly. If a point should be class +1 but is on the -1 side, we adjust to point more toward that point’s location. If a point should be class -1 but is on the +1 side, we adjust to point more away from that point’s location. Each update is a small nudge that makes the boundary better for that specific point, while hopefully not breaking correct classifications.
Now the detailed math:
When a point is misclassified with true label , we know . The update rule accomplishes several important things:
Nudging the hyperplane in the right direction:
The update adds to the weight vector: . Let’s see a concrete example: if should be class +1 but was misclassified, we add to . This makes point more in the direction of , which helps classify this point correctly.
After the update, the new score for becomes . When we expand this, we get . Since , we’re adding a positive term when . This increases the score in the direction of , moving the point toward or across the hyperplane. The key insight here is that the update uses the point’s own coordinates () to adjust , which means points with larger feature values have more influence on the corresponding weights.
Increasing alignment with the true separator:
If a perfect separator exists (one that correctly classifies all points), we want our to point in a similar direction. The dot product measures how aligned our weights are with the perfect solution—the larger this value, the more aligned we are. After an update, we have . For correctly separable data, because points are on the correct side of . So each update adds a non-negative term, meaning increases and our weights become more aligned with the true separator. The intuition is that each correction moves us closer to the “ideal” weight vector that would perfectly separate the data.
Margin improvement:
The margin is the minimum distance from any point to the hyperplane. As aligns better with , correctly classified points move further from the hyperplane. This increases the “safety zone” around the decision boundary, making our classifier more robust.
Bounded weight growth:
While increases, the norm grows at a controlled rate. This bounded growth is crucial for the convergence proof because it ensures we don’t need infinite updates to reach a solution.
The key insight: Each update makes a small correction that moves misclassified points toward the correct side, increases alignment with the optimal solution (if one exists), and does so in a controlled way that guarantees convergence for separable data.
⸻