Chapter 4: Training Models

53 min read

Chapter 4: Training Models.

Introduction - Beyond Black Boxes

Up until now, as the book says, we’ve treated ML models and their training algorithms mostly like black boxes. We fed them data, they gave us results, and we learned to evaluate those results. You’ve optimized regression, improved classifiers, even built a spam filter, often without peeking under the hood. And that’s okay for many practical purposes!

However, understanding how these things work internally is incredibly powerful. It helps you:

Choose the right model and algorithm: Knowing the mechanics helps you match the tool to the task.
Select good hyper parameters: Hyper parameters often control the learning process itself. Understanding that process helps you tune them effectively.
Debug issues and perform error analysis: When things go wrong, or your model makes weird mistakes, knowing the “why” is crucial.
Foundation for advanced topics: Especially for neural networks (Part II of the book), the concepts here are fundamental.

This chapter focuses on Linear Regression as a starting point because it’s simple, yet we can train it in very different ways. We’ll explore two main approaches:

A direct “closed-form” equation (The Normal Equation): This is like having a magic formula that directly spits out the best model parameters in one go.
An iterative optimization approach (Gradient Descent): This is more like a trial-and-error process. We start with a guess for the parameters and gradually tweak them, step by step, to minimize the error, eventually (hopefully!) arriving at the same best parameters. We’ll see different “flavors” of Gradient Descent: Batch, Mini-batch, and Stochastic.

Then, the chapter will touch on:

Polynomial Regression: How to use linear models for non-linear data.
Learning Curves: Tools to diagnose over fitting or under fitting.
Regularization: Techniques to prevent over fitting.
Logistic Regression and Softmax Regression: Models commonly used for classification tasks.

The scorpion icon on page 112 gives a fair warning: there will be some math (linear algebra, calculus). If you’re “allergic,” the book suggests you can still get the concepts by focusing on the text. My job is to make sure you get those concepts, regardless of how comfortable you are with the equations. We’ll always ask: “What is this equation ultimately trying to achieve?”

Linear Regression - The Model

Remember our life satisfaction model from Chapter 1? life_satisfaction = θ₀ + θ₁ × GDP_per_capita. That was a simple linear regression with one feature.

More generally, a linear model predicts a value by:

Taking all the input features (like a house’s square footage, number of bedrooms, age, etc.).
Multiplying each feature by a specific weight (a model parameter).
Summing up these weighted features.
Adding a constant bias term (another model parameter, also called the intercept).

Equation 4-1 (page 112): Linear Regression model prediction ŷ = θ₀ + θ₁x₁ + θ₂x₂ + ⋯ + θₙxₙ

ŷ (y-hat): The predicted value.
n: The number of features.
xᵢ: The value of the i-th feature.
θ₀: The bias term (theta-zero). What it’s ultimately trying to achieve: It’s the baseline prediction if all feature values were zero. It allows the line/plane to shift up or down.
θ₁ to θₙ: The feature weights. θⱼ is the weight for the j-th feature. What they’re ultimately trying to achieve: They represent how much a one-unit change in that feature xⱼ affects the predicted value ŷ, holding other features constant. A positive weight means the feature positively influences the prediction; a negative weight means it negatively influences it. The magnitude shows the strength of the influence.

Equation 4-2 (page 113): Vectorized form ŷ = h_θ(x) = θ · x

This is just a more compact way to write Equation 4-1 using vector notation.

θ (theta): Is now a parameter vector [θ₀, θ₁, ..., θₙ].
x: Is the instance’s feature vector [x₀, x₁, ..., xₙ]. Here, we add a “dummy” feature x₀ which is always set to 1. This allows us to include the bias term θ₀ neatly into the dot product (because θ₀ * x₀ = θ₀ * 1 = θ₀).
θ · x: This is the dot product of the two vectors. It’s exactly the sum θ₀x₀ + θ₁x₁ + ... + θₙxₙ.
h_θ(x): This is our hypothesis function (our model), parameterized by θ. Given an input x, it predicts ŷ.

The bird sidebar (page 113) explains that vectors are often column vectors (2D arrays with one column). So, if θ and x are column vectors, the dot product can be written as a matrix multiplication: ŷ = θᵀx (where θᵀ is the transpose of θ, making it a row vector). Don’t let this bog you down; it’s a notational convenience. The goal is the same: calculate a weighted sum of features plus a bias.

How do we train it? Training means finding the values for the parameters θ (the bias θ₀ and the weights θ₁ to θₙ) that make the model “best fit” the training data.

To do this, we need a way to measure how well (or poorly) the model fits. We learned in Chapter 2 that for regression, a common measure is RMSE (Root Mean Square Error).

What it’s ultimately trying to achieve: Quantify the typical prediction error. However, for mathematical convenience during training, it’s often easier to minimize the MSE (Mean Squared Error) instead. Minimizing MSE will also minimize RMSE (since the square root function is monotonic). The footnote on page 113 is important: the function we optimize during training (the cost function, here MSE) might be different from the final performance metric we use to evaluate the model (e.g., RMSE). This is often because the cost function has nice mathematical properties (like being easily differentiable) that make optimization easier.

MSE Cost Function & The Normal Equation

Equation 4-3 (page 114): MSE cost function for a Linear Regression model MSE(X, h_θ) = (1/m) * Σᵢ (θᵀx⁽ⁱ⁾ - y⁽ⁱ⁾)² (summing from i=1 to m, where m is the number of instances)

What it’s ultimately trying to achieve:
- For each training instance i:
  - θᵀx⁽ⁱ⁾ is the model’s prediction for that instance.
  - y⁽ⁱ⁾ is the actual target value.
  - (θᵀx⁽ⁱ⁾ - y⁽ⁱ⁾) is the error for that instance.
  - We square this error: (error)². (Why square? It makes all errors positive, and it penalizes larger errors more heavily).
- We sum these squared errors over all m training instances: Σᵢ (error)².
- We divide by m to get the mean of the squared errors. This function tells us, on average, how “bad” our model’s predictions are for a given set of parameters θ. Our goal in training is to find the θ that makes this MSE as small as possible.

The Normal Equation: A Direct Solution

For Linear Regression with an MSE cost function, there’s a wonderful mathematical shortcut. Instead of iteratively searching for the best θ, there’s a direct formula that gives you the θ that minimizes the cost function in one shot! This is called the Normal Equation.

Equation 4-4 (page 114): Normal Equation θ̂ = (XᵀX)⁻¹ Xᵀy

θ̂ (theta-hat): This is the value of θ that minimizes the cost function.
X: The matrix of input features for all training instances (each row is an instance, x₀ for each instance is 1).
y: The vector of actual target values for all training instances.
Xᵀ: The transpose of matrix X.
(XᵀX)⁻¹: The inverse of the matrix XᵀX.

What it’s ultimately trying to achieve: This equation, derived using calculus (setting the derivative of the cost function to zero and solving for θ), directly calculates the optimal parameter vector θ̂ that makes the linear model fit the training data as closely as possible (in the mean squared error sense). It’s like a direct recipe: plug in your data X and y, and out pops the best θ.

Testing the Normal Equation

The book generates some linear-looking data (Figure 4-1): X = 2 * np.random.rand(100, 1) (100 instances, 1 feature) y = 4 + 3 * X + np.random.randn(100, 1) (True model is y = 4 + 3x₁ + noise) So, the ideal θ₀ is 4, and the ideal θ₁ is 3.

To use the Normal Equation, we need to add x₀ = 1 to each instance in X: X_b = np.c_[np.ones((100, 1)), X] (np.c_ concatenates arrays column-wise)

Now, apply the Normal Equation: theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y) The result is something like [[4.215...], [2.770...]]. So, θ₀̂ ≈ 4.215 and θ₁̂ ≈ 2.770. It’s close to the true values (4 and 3), but not exact because of the random noise we added to y. The noise makes it impossible to recover the exact original parameters.

We can then use this theta_best to make predictions (Figure 4-2).

Scikit-Learn and Computational Complexity

Scikit-Learn LinearRegression: from sklearn.linear_model import LinearRegression lin_reg = LinearRegression() lin_reg.fit(X, y) Scikit-Learn handles adding the x₀=1 feature (or rather, it separates the bias term lin_reg.intercept_ from the feature weights lin_reg.coef_). The results are the same as the Normal Equation.
Underlying Method (scipy.linalg.lstsq): Scikit-Learn’s LinearRegression actually uses scipy.linalg.lstsq() (“least squares”). This function computes θ̂ = X⁺y, where X⁺ is the pseudoinverse (or Moore-Penrose inverse) of X. You can compute X⁺ using np.linalg.pinv(). The pseudoinverse is calculated using a technique called Singular Value Decomposition (SVD).
- Why SVD/pseudoinverse instead of the direct Normal Equation (XᵀX)⁻¹ Xᵀy?
  1. More efficient: SVD is generally more computationally efficient.
  2. Handles edge cases: The Normal Equation requires XᵀX to be invertible. If it’s not (e.g., if you have more features than instances, m < n, or if some features are redundant/linearly dependent), the Normal Equation breaks down. The pseudoinverse is always defined, making SVD more robust.
Computational Complexity:
- Normal Equation (inverting XᵀX): About O(n²·⁴) to O(n³) where n is the number of features. This gets very slow if you have many features (e.g., 100,000 features). Doubling features can increase time by 5-8x.
- SVD (used by Scikit-Learn): About O(n²). Doubling features increases time by ~4x. Still slow for very large n.
- Both are O(m) with respect to the number of instances m. So, they handle large numbers of training instances efficiently, as long as the data fits in memory.
- Predictions: Once trained, making predictions is very fast: O(m) and O(n) – linear with number of instances and features.

The Problem with Normal Equation/SVD: They get slow with many features and require all data to be in memory. This leads us to the next method…

Gradient Descent - The Iterative Approach

When the Normal Equation is too slow (too many features) or the dataset is too large to fit in memory, we need a different approach. Enter Gradient Descent.

The Core Idea: Gradient Descent is a generic optimization algorithm. It iteratively tweaks model parameters to minimize a cost function.
- Imagine you’re lost in a foggy mountain valley. You can only feel the slope of the ground under your feet. To get to the bottom, you’d take a step in the direction of the steepest downhill slope. Repeat.
- This is Gradient Descent:
  1. It measures the local gradient of the cost function (e.g., MSE) with respect to the parameter vector θ. The gradient tells you the direction of steepest ascent.
  2. It takes a step in the opposite direction (descending gradient) to reduce the cost.
  3. Repeat until the gradient is zero (or very close), meaning you’ve reached a minimum.
The Process (Figure 4-3, page 118):
1. Random Initialization: Start with random values for θ.
2. Iterative Improvement: In each step:
  - Calculate the gradient of the cost function at the current θ.
  - Update θ by taking a small step in the negative gradient direction.
3. Convergence: Continue until the algorithm converges to a minimum (cost stops decreasing significantly).
Learning Rate (η - eta):
- This is a crucial hyperparameter that determines the size of the steps.
- Too small (Figure 4-4): Many iterations needed to converge (very slow).
- Too large (Figure 4-5): Might jump across the valley, diverge, and fail to find a solution.
Challenges (Figure 4-6, page 119):
- Local Minima: If the cost function has multiple minima (not a smooth bowl), GD might converge to a local minimum, which isn’t as good as the global minimum.
- Plateaus: If the cost function has flat areas, GD can take a very long time to cross them.
- Irregular Terrains: Holes, ridges make convergence difficult.
Good News for Linear Regression MSE:
- The MSE cost function for Linear Regression is a convex function.
  - What this means: It’s shaped like a bowl. It has no local minima, only one global minimum.
  - It’s also continuous with a slope that doesn’t change abruptly.
- Consequence: Gradient Descent is guaranteed to approach the global minimum if you wait long enough and the learning rate isn’t too high.
Feature Scaling Matters! (Figure 4-7, page 120):
- If features have very different scales (e.g., feature 1 ranges 0-1, feature 2 ranges 0-1000), the cost function “bowl” becomes elongated.
- GD will take a long, zig-zag path to the minimum.
- Solution: Ensure all features have a similar scale (e.g., using StandardScaler). GD will then converge much faster.
Parameter Space: Training a model is essentially a search in the model’s parameter space for the combination of parameters that minimizes the cost. More parameters = higher dimensional space = harder search. For Linear Regression (convex cost), it’s like finding the bottom of a D-dimensional bowl.

Batch Gradient Descent - BGD

To implement GD, we need the gradient of the cost function with respect to each model parameter θⱼ. This is the partial derivative ∂MSE(θ) / ∂θⱼ.

Equation 4-5 (page 121): Partial derivative of MSE w.r.t. θⱼ ∂MSE(θ)/∂θⱼ = (2/m) * Σᵢ (θᵀx⁽ⁱ⁾ - y⁽ⁱ⁾) * xⱼ⁽ⁱ⁾
- What it’s ultimately trying to achieve: For each parameter θⱼ, it calculates how much the MSE would change if θⱼ changed a tiny bit.
  - (θᵀx⁽ⁱ⁾ - y⁽ⁱ⁾) is the prediction error for instance i.
  - We multiply this error by the value of the j-th feature of instance i, xⱼ⁽ⁱ⁾. (If xⱼ⁽ⁱ⁾ is large, θⱼ has a bigger impact on the prediction and thus on the error).
  - We average this product over all instances m.
Equation 4-6 (page 122): Gradient vector ∇_θ MSE(θ) ∇_θ MSE(θ) = (2/m) Xᵀ(Xθ - y)
- This is the compact, vectorized way to compute all partial derivatives at once. ∇_θ MSE(θ) is a vector containing ∂MSE(θ)/∂θ₀, ∂MSE(θ)/∂θ₁, …, ∂MSE(θ)/∂θₙ.
- What it’s ultimately trying to achieve: It gives the direction of steepest increase in the cost function.
Crucial point for BATCH GD: This formula uses the entire training set X at each step to calculate the gradients. This is why it’s called Batch Gradient Descent.
- Consequence: Terribly slow on very large training sets.
- Advantage: Scales well with the number of features (unlike Normal Equation).
Equation 4-7 (page 122): Gradient Descent step θ⁽ⁿᵉˣᵗ ˢᵗᵉᵖ⁾ = θ - η ∇_θ MSE(θ)
- What it’s ultimately trying to achieve: Update the current parameters θ by taking a step of size η (learning rate) in the direction opposite to the gradient (downhill).
Implementation (page 122): The code shows a loop: for iteration in range(n_iterations): gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y) theta = theta - eta * gradients With eta = 0.1 and n_iterations = 1000, the resulting theta is exactly what the Normal Equation found! Perfect.
Effect of Learning Rate η (Figure 4-8, page 123):
- η = 0.02 (left): Too slow, many steps to converge.
- η = 0.1 (middle): Good, converges quickly.
- η = 0.5 (right): Too high, diverges, jumps around.
Finding a good learning rate: Grid search (Chapter 2).
Setting number of iterations: If too low, far from optimum. If too high, waste time after convergence.
- Solution: Set many iterations, but stop when the gradient vector becomes tiny (its norm < ϵ, a small tolerance). This means we’re (almost) at the minimum.
Convergence Rate (sidebar, page 124): For convex cost functions like MSE, BGD with fixed η takes O(1/ϵ) iterations to reach within ϵ of the optimum. To get 10x more precision (divide ϵ by 10), you need ~10x more iterations.

Stochastic Gradient Descent - SGD

Batch GD is slow on large datasets because it uses all training data at each step.

Stochastic Gradient Descent (SGD):
- At each step, picks one random instance from the training set and computes gradients based only on that single instance.
- Advantages:
  - Much faster per step (very little data to process).
  - Can train on huge datasets (only one instance in memory at a time – good for out-of-core learning, Ch 1).
- Disadvantages (Figure 4-9, page 124):
  - Stochastic nature: The path to the minimum is much more erratic (“bouncy”) than BGD. Cost function goes up and down, decreasing only on average.
  - Never settles: Once near the minimum, it keeps bouncing around, never perfectly settling. Final parameters are good, but not optimal.
- Advantage of randomness: If cost function is irregular (non-convex, Figure 4-6), SGD’s randomness can help it jump out of local minima and find a better global minimum.
Learning Schedule (page 125):
- To help SGD settle at the minimum, we can gradually reduce the learning rate η.
- Start with large η (quick progress, escape local minima).
- Make η smaller over time (settle at global minimum).
- This is like simulated annealing in metallurgy.
- The function determining η at each iteration is the learning schedule.
- If η reduces too quickly -> stuck in local minimum or frozen half-way.
- If η reduces too slowly -> bounce around minimum for long, or stop too early with suboptimal solution.
Implementation of SGD (page 125):
- Outer loop for epochs (an epoch is one pass through the entire training set, by convention).
- Inner loop for m iterations (number of instances). In each inner iteration:
  - Pick a random_index.
  - Get xi and yi for that instance.
  - Gradients calculated using only xi and yi: gradients = 2 * xi.T.dot(xi.dot(theta) - yi). (Note: 2 not 2/m because m=1 here).
  - Update eta using learning_schedule(epoch * m + i).
  - Update theta.
- After 50 epochs (much fewer iterations than BGD’s 1000), it finds a theta very close to BGD’s.
- Figure 4-10 (page 126) shows the irregular first 20 steps.
Important note on SGD (sidebar, page 126):
- Training instances must be independent and identically distributed (IID) for SGD to converge to global optimum on average.
- Shuffle instances during training (pick randomly, or shuffle set at start of each epoch). If data is sorted (e.g., by label), SGD will optimize for one label, then the next, and not find global minimum.
SGD with Scikit-Learn (SGDRegressor, page 126):
- Defaults to optimizing MSE.
- sgd_reg = SGDRegressor(max_iter=1000, tol=1e-3, penalty=None, eta0=0.1)
  - max_iter: max epochs.
  - tol: stop if loss drops by less than this in one epoch.
  - penalty=None: no regularization (more later).
  - eta0: initial learning rate. Uses its own default learning schedule.
- sgd_reg.fit(X, y.ravel()) (.ravel() flattens y into 1D array, often needed by Scikit-Learn).
- Resulting intercept and coefficient are very close to Normal Equation’s.

Mini-batch Gradient Descent

A compromise between Batch GD and Stochastic GD.

At each step, computes gradients on small random sets of instances called mini-batches.
Main advantage over SGD: Performance boost from hardware optimization of matrix operations (especially on GPUs).
Behavior (Figure 4-11, page 127):
- Less erratic path than SGD.
- Ends up closer to the minimum than SGD (with fixed η).
- But may be harder to escape local minima (for non-convex problems) compared to SGD.
- All three (Batch, Stochastic, Mini-batch) end up near the minimum. Batch stops at the minimum. SGD and Mini-batch would also reach minimum with a good learning schedule.
- But Batch GD takes much longer per step.

Comparison Table 4-1

This table nicely summarizes Normal Equation, SVD, Batch GD, Stochastic GD, Mini-batch GD for Linear Regression based on:

Large m (instances): Normal Eq/SVD are fast if data fits memory. GD methods are also fast (SGD/Mini-batch can do out-of-core).
Out-of-core support: Only SGD/Mini-batch.
Large n (features): Normal Eq/SVD are slow. GD methods are fast.
Hyperparameters: Normal Eq/SVD have 0. BGD has 2 (η, iterations). SGD/Mini-batch have >=2 (η, iterations, schedule params).
Scaling required: No for Normal Eq/SVD. Yes for GD methods.
Scikit-Learn class: LinearRegression for SVD. SGDRegressor for GDs.

The key is that after training, all these methods (if converged) produce very similar models and make predictions in the same way. The difference is in how they get there (the training process).

Excellent! Glad you’re with me. Let’s push on to the next sections of Chapter 4. We’ve just laid the groundwork for how Linear Regression models are trained. Now, let’s see how we can adapt these ideas for more complex scenarios.

Polynomial Regression

So far, Linear Regression assumes a straight-line (or flat plane/hyperplane) relationship between features and the target. But what if your data is more complex, like the curved data in Figure 4-12 (page 129)?

The book shows data generated by a simple quadratic equation: y = 0.5 * X2 + X + 2 + noise. Clearly, a straight line won’t fit this well.

This is where Polynomial Regression comes in.

The Core Idea: You can still use a linear model to fit nonlinear data!
How? By adding powers of each feature as new features, and then training a linear model on this extended set of features.
- For example, if you have one feature X, you can create a new feature X². Then your linear model becomes ŷ = θ₀ + θ₁X + θ₂X². Even though the equation is quadratic in X, it’s linear in terms of the parameters θ₀, θ₁, θ₂ and the features X and X².
Scikit-Learn’s PolynomialFeatures (page 129): This class transforms your data. from sklearn.preprocessing import PolynomialFeatures poly_features = PolynomialFeatures(degree=2, include_bias=False)
- degree=2: We want to add 2nd-degree polynomial features (i.e., squares).
- include_bias=False: PolynomialFeatures can add a column of 1s (for the bias term), but LinearRegression handles that, so we set it to False to avoid redundancy. X_poly = poly_features.fit_transform(X) If X[0] was [-0.75], then X_poly[0] becomes [-0.75, 0.566] (original X, and X²).
Train a Linear Model on Extended Data: Now, you just fit a standard LinearRegression model to this X_poly: lin_reg = LinearRegression() lin_reg.fit(X_poly, y) The model predictions are shown in Figure 4-13 (page 130). It’s a nice curve that fits the data much better than a straight line! The estimated equation (e.g., ŷ = 0.56x₁² + 0.93x₁ + 1.78) is close to the original y = 0.5x₁² + 1.0x₁ + 2.0 + noise.
Multiple Features and Combinations (page 130): If you have multiple features (say, a and b) and use PolynomialFeatures(degree=3), it will add a², a³, b², b³, AND also the combination terms like ab, a²b, ab². This allows the model to find relationships between features.
Warning: Combinatorial Explosion! (Scorpion icon, page 130) The number of features can explode if you have many original features and use a high polynomial degree. The formula is (n+d)! / (d!n!) where n is original features, d is degree. Be careful! This can make the model very slow and prone to overfitting.

Learning Curves

With high-degree Polynomial Regression, you can fit the training data very well, maybe too well. Figure 4-14 (page 131) shows:

A 300-degree polynomial model: Wiggles wildly to hit every training point. This is severe overfitting.
A plain linear model: Misses the curve. This is underfitting.
A quadratic model (2nd degree): Fits best, which makes sense as the data was generated quadratically.

But how do you know in general if your model is too complex (overfitting) or too simple (underfitting), especially if you don’t know the true underlying function?

Cross-Validation (from Chapter 2):
- If model performs well on training data but poorly on cross-validation, it’s overfitting.
- If it performs poorly on both, it’s underfitting.
Learning Curves (page 131):
- What they are: Plots of the model’s performance (e.g., RMSE) on the training set and the validation set, as a function of the training set size (or training iteration).
- How to generate them: Train the model multiple times on different-sized subsets of the training set.
- The book provides a plot_learning_curves function (page 132).
- Learning Curves for an Underfitting Model (e.g., plain Linear Regression on the quadratic data - Figure 4-15, page 132):
  - Training error: Starts at 0 (perfect fit with 1-2 instances), then rises as more noisy/nonlinear data is added, eventually plateauing.
  - Validation error: Starts very high (model trained on few instances generalizes poorly), then decreases as model learns from more examples, eventually plateauing close to the training error.
  - Key characteristics of underfitting curves: Both curves plateau, they are close together, and the error is fairly high.
  - What it’s ultimately trying to achieve: The plot tells you that the model is too simple. Adding more training examples will not help if the curves have plateaued (as the scorpion icon on page 133 says). You need a more complex model or better features.
- Learning Curves for an Overfitting Model (e.g., 10th-degree polynomial on the quadratic data - Figure 4-16, page 133):
  - Training error: Much lower than the Linear Regression model’s training error. It fits the training data well.
  - Validation error: There’s a gap between the training error and the validation error. The model performs significantly better on the training data than on the validation data. This is the hallmark of overfitting.
  - If you had a much larger training set, these two curves would eventually get closer.
  - What it’s ultimately trying to achieve: The plot tells you the model is too complex for the current amount of data. One way to improve an overfitting model is to feed it more training data (as the scorpion icon on page 134 suggests), until the validation error meets the training error (or use regularization, which we’ll see next).

The Bias/Variance Trade-off (sidebar, page 134): This is a fundamental concept in statistics and ML. A model’s generalization error (how well it performs on unseen data) can be expressed as the sum of three components:
1. Bias:
  - Error due to wrong assumptions made by the model. E.g., assuming data is linear when it’s quadratic.
  - A high-bias model is likely to underfit.
  - What it’s ultimately trying to achieve (in a bad way): It has a strong preconceived notion of how the data should look, and it sticks to it even if the data says otherwise.
2. Variance:
  - Error due to the model’s excessive sensitivity to small variations in the training data. A model with many degrees of freedom (like a high-degree polynomial) can have high variance.
  - A high-variance model is likely to overfit. It learns the noise in the training data, not just the signal.
  - What it’s ultimately trying to achieve (in a bad way): It tries to fit every little nook and cranny of the training data, making it unstable and perform poorly on new, slightly different data.
3. Irreducible Error:
  - Error due to the inherent noisiness of the data itself.
  - This part cannot be reduced by model changes; only by cleaning the data (e.g., fixing broken sensors, removing outliers).
The Trade-off:
- Increasing model complexity typically increases variance (more likely to overfit) and reduces bias (better fit to complex patterns).
- Decreasing model complexity typically increases bias (more likely to underfit) and reduces variance. The goal is to find a sweet spot, balancing bias and variance.

Regularized Linear Models

We saw that overfitting is a problem with complex models. Regularization is a way to reduce overfitting by constraining the model.

What it’s ultimately trying to achieve: For linear models, this usually means constraining the model’s weights. The idea is to prevent the weights from becoming too large, which can happen when the model tries too hard to fit the noise in the training data. Smaller weights generally lead to simpler, smoother models that generalize better.

The book discusses three types of regularized linear models: Ridge, Lasso, and Elastic Net.

Ridge Regression (also Tikhonov regularization) (page 135):
- It adds a regularization term to the MSE cost function.
- Equation 4-8: Ridge Regression cost function J(θ) = MSE(θ) + α * (1/2) * Σᵢ (θᵢ)² (sum from i=1 to n, so bias term θ₀ is NOT regularized)
  - The regularization term is α * (1/2) * Σᵢ (θᵢ)². This is α/2 times the sum of the squares of the feature weights. This is related to the ℓ₂ norm of the weight vector w = [θ₁, ..., θₙ], specifically (1/2) * ||w||₂².
  - α (alpha) is a hyperparameter that controls how much you want to regularize.
    - If α = 0, it’s just Linear Regression.
    - If α is very large, all weights θᵢ (for i>0) end up very close to zero, and the model becomes a flat line through the data’s mean.
  - What it’s ultimately trying to achieve: The learning algorithm is now forced to not only fit the data (minimize MSE) but also keep the model weights as small as possible.
- Important: Scale the data (e.g., StandardScaler) before performing Ridge Regression, as it’s sensitive to the scale of input features (scorpion icon, page 136). This is true for most regularized models.
- Figure 4-17 (page 136) shows the effect of α:
  - Left: Plain Ridge Regression on linear data. Higher α makes predictions flatter.
  - Right: Polynomial features (degree=10) + scaling + Ridge. Increasing α leads to flatter (less extreme, more reasonable) predictions, reducing variance but increasing bias.
- Training Ridge Regression:
  - Closed-form solution (Equation 4-9, page 137): θ̂ = (XᵀX + αA)⁻¹ Xᵀy (where A is an identity matrix with a 0 at the top-left for the bias term). Scikit-Learn’s Ridge(alpha=..., solver="cholesky") uses this.
  - Gradient Descent (page 137): Add αw to the MSE gradient vector (where w is the vector of weights, excluding bias). Scikit-Learn’s SGDRegressor(penalty="l2") does this. The "l2" means add a regularization term equal to half the square of the ℓ₂ norm of the weights.
Lasso Regression (Least Absolute Shrinkage and Selection Operator) (page 137):
- Also adds a regularization term to the MSE cost function, but uses the ℓ₁ norm of the weight vector.
- Equation 4-10: Lasso Regression cost function J(θ) = MSE(θ) + α * Σᵢ |θᵢ| (sum from i=1 to n)
  - The regularization term is α times the sum of the absolute values of the weights.
- Figure 4-18 (page 138) shows Lasso models, similar to Figure 4-17 for Ridge, but with smaller α values.
- Key characteristic of Lasso: It tends to completely eliminate the weights of the least important features (i.e., set them to zero). It automatically performs feature selection and outputs a sparse model (a model with few non-zero feature weights).
- Why does Lasso do this? (Figure 4-19, page 139):
  - This is a bit more advanced, but the intuition comes from looking at the “shape” of the ℓ₁ penalty vs. the ℓ₂ penalty.
  - The ℓ₁ penalty |θ₁| + |θ₂| has “corners” along the axes in the parameter space. When minimizing the combined MSE + ℓ₁ penalty, the optimization path often hits these corners, forcing one of the parameters to zero.
  - The ℓ₂ penalty θ₁² + θ₂² is circular. The optimization path approaches the origin smoothly, shrinking weights but not usually making them exactly zero.
  - What it’s ultimately trying to achieve: The ℓ₁ penalty encourages sparsity by pushing less important feature weights all the way to zero.
- Training Lasso Regression:
  - The Lasso cost function is not differentiable at θᵢ = 0. However, Gradient Descent can still work if you use a subgradient vector (Equation 4-11, page 140). This is a technical detail; the main idea is that an iterative approach can still find the minimum.
  - Scikit-Learn: Lasso(alpha=...) or SGDRegressor(penalty="l1").
  - When using Lasso with GD, you often need to gradually reduce the learning rate to help it converge without bouncing around the optimum (due to the “sharp corners” of the ℓ₁ penalty).
Elastic Net (page 140):
- A middle ground between Ridge and Lasso. Its regularization term is a simple mix of both Ridge (ℓ₂) and Lasso (ℓ₁) terms.
- Equation 4-12: Elastic Net cost function J(θ) = MSE(θ) + rα Σᵢ|θᵢ| + ((1-r)/2)α Σᵢθᵢ²
  - r is the mix ratio.
    - If r = 0, Elastic Net is equivalent to Ridge.
    - If r = 1, Elastic Net is equivalent to Lasso.
- Scikit-Learn: ElasticNet(alpha=..., l1_ratio=...) (where l1_ratio is r).
When to use which? (page 140):
- Plain Linear Regression (no regularization) should generally be avoided. Some regularization is almost always better.
- Ridge is a good default.
- If you suspect only a few features are useful, prefer Lasso or Elastic Net because they perform feature selection by shrinking useless feature weights to zero.
- Elastic Net is generally preferred over Lasso because Lasso can behave erratically when the number of features is greater than the number of training instances, or when several features are strongly correlated. Elastic Net is more stable in these cases.
Early Stopping (page 141):
- A very different way to regularize iterative learning algorithms like Gradient Descent.
- The Idea: Stop training as soon as the validation error reaches a minimum.
- Figure 4-20 (page 141) shows a complex model (high-degree polynomial) being trained with Batch GD:
  - As epochs go by, training error (RMSE) goes down.
  - Validation error also goes down initially, but then starts to go back up. This indicates the model has started to overfit.
  - With early stopping, you just stop training when the validation error is at its minimum.
- It’s a simple and efficient regularization technique. Geoffrey Hinton called it a “beautiful free lunch.”
- Implementation (page 142):
  - Loop for epochs.
  - In each epoch, train the model (e.g., SGDRegressor with warm_start=True so it continues training, max_iter=1 so it does one epoch).
  - Evaluate on validation set.
  - If validation error is less than current minimum_val_error, save the model (clone it) and update minimum_val_error and best_epoch.
  - After the loop, the best_model is your regularized model.
- With SGD/Mini-batch GD, validation curves are noisy. You might stop only after validation error has been above minimum for a while, then roll back to the best model.

Fantastic! It’s great that the intuition behind L1 and L2 regularization clicked. Let’s continue our journey through Chapter 4, moving on to models designed for classification.

Logistic Regression

We’ve seen Linear Regression for predicting continuous values. Now, what if we want to predict a class? For example, is an email spam or not spam? This is a binary classification problem.

The book points out (as we saw in Chapter 1) that some regression algorithms can be adapted for classification. Logistic Regression (also called Logit Regression) is a prime example.

What it’s ultimately trying to achieve: Logistic Regression estimates the probability that an instance belongs to a particular class (typically the “positive” class, labeled ‘1’). If this probability is greater than a certain threshold (usually 50%), the model predicts ‘1’; otherwise, it predicts ‘0’.
Estimating Probabilities (Page 143):
1. Linear Combination: Just like Linear Regression, it first computes a weighted sum of the input features plus a bias term: t = xᵀθ. (This t is often called the logit).
2. Logistic Function (Sigmoid): Instead of outputting t directly, it passes t through the logistic function (also called the sigmoid function), denoted as σ(t).
  - Equation 4-13: Logistic Regression model estimated probability (vectorized form) p̂ = h_θ(x) = σ(xᵀθ) where p̂ (p-hat) is the estimated probability that the instance x belongs to the positive class.
  - Equation 4-14: Logistic function σ(t) = 1 / (1 + exp(-t))
  - Figure 4-21 (page 143) shows the characteristic S-shape of the sigmoid function.
    - What the sigmoid is ultimately trying to achieve: It squashes any input value t (which can range from -∞ to +∞) into an output value between 0 and 1. This output can then be interpreted as a probability.
      - If t is large and positive, exp(-t) is close to 0, so σ(t) is close to 1.
      - If t is large and negative, exp(-t) is very large, so σ(t) is close to 0.
      - If t = 0, exp(-t) = 1, so σ(t) = 1/2 = 0.5.
Making Predictions (Page 143):
- Equation 4-15: Logistic Regression model prediction ŷ = 0 if p̂ < 0.5 ŷ = 1 if p̂ ≥ 0.5
- Since σ(t) ≥ 0.5 when t ≥ 0, the model predicts 1 if xᵀθ (the logit) is positive, and 0 if it’s negative.
Training and Cost Function (Page 144):
- Objective: We want to set the parameter vector θ so that the model estimates a high probability for positive instances (actual y=1) and a low probability for negative instances (actual y=0).
- Cost Function for a Single Instance (Equation 4-16): c(θ) = -log(p̂) if y = 1 c(θ) = -log(1 - p̂) if y = 0
  - What this cost function is ultimately trying to achieve:
    - If y=1 (actual is positive):
      - If model predicts p̂ close to 1 (correct), -log(p̂) is close to 0 (low cost).
      - If model predicts p̂ close to 0 (incorrect), -log(p̂) is very large (high cost).
    - If y=0 (actual is negative):
      - If model predicts p̂ close to 0 (so 1-p̂ is close to 1, correct), -log(1-p̂) is close to 0 (low cost).
      - If model predicts p̂ close to 1 (so 1-p̂ is close to 0, incorrect), -log(1-p̂) is very large (high cost). This cost function penalizes the model heavily when it’s confident and wrong.
- Cost Function for the Whole Training Set (Log Loss - Equation 4-17): J(θ) = - (1/m) * Σᵢ [ y⁽ⁱ⁾log(p̂⁽ⁱ⁾) + (1 - y⁽ⁱ⁾)log(1 - p̂⁽ⁱ⁾) ] This is just the average cost over all training instances. It’s a single, clever expression that combines the two cases from Equation 4-16.
- Good news: This log loss cost function is convex. So, Gradient Descent (or other optimization algorithms) can find the global minimum.
- Bad news: There’s no closed-form solution (like the Normal Equation for Linear Regression) to find the θ that minimizes this cost function. We must use an iterative optimization algorithm like Gradient Descent.
- Partial Derivatives (Equation 4-18, page 145): ∂J(θ)/∂θⱼ = (1/m) * Σᵢ (σ(θᵀx⁽ⁱ⁾) - y⁽ⁱ⁾) * xⱼ⁽ⁱ⁾
  - This looks very similar to the partial derivative for Linear Regression’s MSE (Equation 4-5)!
  - σ(θᵀx⁽ⁱ⁾) is the predicted probability p̂⁽ⁱ⁾.
  - (p̂⁽ⁱ⁾ - y⁽ⁱ⁾) is the prediction error.
  - This error is multiplied by the feature value xⱼ⁽ⁱ⁾ and averaged. Once you have these gradients, you can use Batch GD, Stochastic GD, or Mini-batch GD to find the optimal θ.
Decision Boundaries (Page 145-147): Let’s use the Iris dataset to illustrate. We’ll try to classify Iris virginica based only on petal width.
- Load data: X = iris["data"][:, 3:] (petal width), y = (iris["target"] == 2).astype(int) (1 if virginica, else 0).
- Train LogisticRegression(): log_reg = LogisticRegression() log_reg.fit(X, y)
- Figure 4-23 (page 146): Shows estimated probabilities vs. petal width.
  - The S-shape is clear.
  - For petal widths > ~2cm, probability of being Iris virginica is high.
  - For petal widths < ~1cm, probability is low.
  - The decision boundary (where p̂ = 0.5) is around 1.6 cm. If petal width > 1.6cm, it predicts virginica; otherwise, not virginica.
- Figure 4-24 (page 147): Shows decision boundary using two features (petal width and petal length).
  - The dashed line is where the model estimates 50% probability – this is the linear decision boundary.
  - Other parallel lines show other probability contours (e.g., 15%, 90%).
- Regularization: Logistic Regression models in Scikit-Learn use ℓ₂ regularization by default. The hyperparameter is C (inverse of α): higher C means less regularization.

Softmax Regression

Logistic Regression is for binary classification. What if we have more than two classes, and we want a model that handles them directly (not with OvR/OvO strategies)? Enter Softmax Regression (or Multinomial Logistic Regression).

The Idea:
1. For a given instance x, compute a score sₖ(x) for each class k. This is done just like Linear Regression: sₖ(x) = xᵀθ⁽ᵏ⁾ (Equation 4-19). Each class k has its own dedicated parameter vector θ⁽ᵏ⁾. These are often stored as rows in a parameter matrix Θ.
2. Estimate the probability p̂ₖ that the instance belongs to class k by applying the softmax function (also called normalized exponential) to the scores.
  - Equation 4-20: Softmax function p̂ₖ = σ(s(x))ₖ = exp(sₖ(x)) / Σⱼ exp(sⱼ(x)) (sum over all classes j=1 to K)
    - What it’s ultimately trying to achieve: It takes a vector of arbitrary scores s(x) for all classes, computes the exponential of each score (making them all positive), and then normalizes them by dividing by their sum. The result is a set of probabilities (p̂ₖ) that are all between 0 and 1 and sum up to 1 across all classes. The class with the highest initial score sₖ(x) will get the highest probability p̂ₖ.
Prediction (Equation 4-21, page 149): The classifier predicts the class k that has the highest estimated probability p̂ₖ (which is simply the class with the highest score sₖ(x)). ŷ = argmaxₖ p̂ₖ
- Softmax Regression predicts only one class at a time (mutually exclusive classes). It’s multiclass, not multioutput.
Training and Cost Function (Cross Entropy - page 149):
- Objective: Estimate a high probability for the target class and low probabilities for other classes.
- Cost Function (Cross Entropy - Equation 4-22): J(Θ) = - (1/m) * Σᵢ Σₖ yₖ⁽ⁱ⁾ log(p̂ₖ⁽ⁱ⁾)
  - yₖ⁽ⁱ⁾ is the target probability that instance i belongs to class k (usually 1 if it’s the target class, 0 otherwise).
  - p̂ₖ⁽ⁱ⁾ is the model’s estimated probability that instance i belongs to class k.
  - What it’s ultimately trying to achieve: This cost function penalizes the model when it estimates a low probability for the actual target class. It’s a common measure for how well a set of estimated class probabilities matches the target classes.
  - The sidebar on “Cross Entropy” (page 150) gives some information theory background – it measures the average number of bits needed to encode events based on your probability estimates vs. the true probabilities. Lower is better.
  - When there are only two classes (K=2), this cross-entropy cost function is equivalent to the log loss for Logistic Regression.
- Gradient Vector (Equation 4-23, page 150): ∇_θ⁽ᵏ⁾ J(Θ) = (1/m) * Σᵢ (p̂ₖ⁽ⁱ⁾ - yₖ⁽ⁱ⁾) * x⁽ⁱ⁾ This gives the gradient for the parameter vector θ⁽ᵏ⁾ of a specific class k. Again, very similar form to previous gradient equations! You compute this for every class, then use an optimization algorithm (like GD) to find the parameter matrix Θ that minimizes the cost.
Using Softmax Regression in Scikit-Learn (page 150):
- LogisticRegression can perform Softmax Regression by setting:
  - multi_class="multinomial"
  - solver="lbfgs" (or another solver that supports multinomial, like “sag” or “newton-cg”)
  - It also applies ℓ₂ regularization by default (controlled by C).
- Example: Classify Iris flowers into all 3 classes using petal length and width. X = iris["data"][:, (2, 3)] y = iris["target"] softmax_reg = LogisticRegression(multi_class="multinomial", solver="lbfgs", C=10) softmax_reg.fit(X, y)
- To predict: softmax_reg.predict([[5, 2]]) might give array([2]) (Iris virginica).
- To get probabilities: softmax_reg.predict_proba([[5, 2]]) might give [[6.3e-07, 0.057, 0.942]], meaning 94.2% prob for class 2 (virginica), 5.8% for class 1 (versicolor), and near 0% for class 0 (setosa).
- Figure 4-25 (page 151) shows the decision boundaries. They are linear between any two classes.

And that brings us to the end of the core content of Chapter 4! We’ve gone from simple Linear Regression to Polynomial Regression, learned about diagnosing model fit with learning curves, seen how to regularize models to prevent overfitting (Ridge, Lasso, Elastic Net, Early Stopping), and finally explored Logistic and Softmax Regression for classification tasks.

The recurring theme for training, especially for models without closed-form solutions, is defining a cost function that captures how “bad” our model is, and then using an iterative algorithm like Gradient Descent to find the model parameters that minimize this cost. The specific form of the cost function and how predictions are made change from model to model, but the underlying optimization principle is often very similar.

This chapter is dense, but the concepts are absolutely key for understanding how models actually learn. How are you feeling about Logistic and Softmax Regression? Any particular part of “what they are trying to achieve” that needs more light?

Glossary

Difference between L1 (Lasso) and L2 (Ridge) regularization! Let’s break down that intuition about the “shapes” and “corners.”

Imagine we have a model with just two weights, θ₁ and θ₂. We want to find the values of θ₁ and θ₂ that minimize our main cost function (let’s say MSE), subject to some penalty on the size of these weights.

Visualizing the Penalties (Loss Functions for Weights)

Think of the penalty term as defining a “budget” or a “constraint region” for our weights. The optimization process is trying to find the best MSE it can, while staying within or close to this budget defined by the penalty.

L₂ Penalty (Ridge): θ₁² + θ₂² ≤ C
- The equation θ₁² + θ₂² = constant describes a circle (or a sphere/hypersphere in higher dimensions) centered at the origin (θ₁=0, θ₂=0).
- So, the L2 penalty encourages the weights (θ₁, θ₂) to stay within a circular region around the origin.
- Shape: Smooth and round. No sharp corners.
- Imagine the contours of this penalty function: they are concentric circles.
L₁ Penalty (Lasso): |θ₁| + |θ₂| ≤ C
- The equation |θ₁| + |θ₂| = constant describes a diamond (or a rotated square in 2D, and a similar shape with “pointy” corners on the axes in higher dimensions).
- So, the L1 penalty encourages the weights (θ₁, θ₂) to stay within a diamond-shaped region around the origin.
- Shape: Has sharp corners that lie on the axes. For our 2D example, the corners are at points like (C, 0), (-C, 0), (0, C), and (0, -C).
- Imagine the contours of this penalty function: they are concentric diamonds.

Visualizing the Optimization Process (Figure 4-19)

Now, let’s consider the main cost function, the MSE. The contours of the MSE (if we ignore the penalty for a moment) are typically elliptical. The center of these ellipses is the point where MSE is minimized without any regularization – let’s call this the “unconstrained optimum.”

The regularization process is trying to find a point that: a. Is on the lowest possible MSE contour (meaning good fit to the data). b. Satisfies the “budget” imposed by the L1 or L2 penalty (meaning weights are small).

This can be visualized as finding the point where an MSE ellipse first “touches” the boundary of the penalty region (the circle for L2, the diamond for L1).

With L₂ Penalty (Ridge - bottom-right plot in Figure 4-19):
- Imagine an expanding MSE ellipse (as we try to get lower MSE) until it just kisses the circular L2 penalty region.
- Because the circle is smooth and round, the point where the ellipse touches it can be anywhere on the circle’s boundary.
- It’s very unlikely that this touch point will be exactly on an axis (e.g., where θ₁ = 0 or θ₂ = 0).
- Result: The L2 penalty shrinks both θ₁ and θ₂ towards zero, but it’s rare for either of them to become exactly zero. The optimization path (white dots) goes smoothly towards a point where both weights are small but likely non-zero.
With L₁ Penalty (Lasso - top-right plot in Figure 4-19):
- Imagine an expanding MSE ellipse until it just kisses the diamond-shaped L1 penalty region.
- Because the diamond has sharp corners on the axes, it’s much more likely that the ellipse will first touch the diamond at one of these corners.
- If it touches at a corner on, say, the θ₁-axis, then at that point, θ₂ will be exactly zero.
- Result: The L1 penalty also shrinks weights towards zero, but because of its shape, it often forces some weights to become exactly zero. This is the “feature selection” property. The optimization path (white dots) is shown hitting the θ₂=0 axis and then “rolling down the gutter” along that axis.

Why the “Corners” Matter for Hitting Zero:

Think about the gradient (the direction of steepest descent) of the penalty term itself:

L₂ Penalty: The gradient of θ₁² + θ₂² always points directly towards the origin. It smoothly pulls all weights smaller.
L₁ Penalty: The gradient of |θ₁| + |θ₂| is a bit trickier.
- If θ₁ > 0, the derivative w.r.t. θ₁ is 1.
- If θ₁ < 0, the derivative w.r.t. θ₁ is -1.
- If θ₁ = 0, it’s undefined (this is the “corner”). This means the L1 penalty applies a constant “force” (either +1 or -1) pulling a weight towards zero, regardless of how small the weight already is (as long as it’s not zero). Once a weight hits zero, this “force” can effectively keep it there if the pull from the MSE term isn’t strong enough to move it away. In contrast, for L2, the “force” 2θ₁ pulling θ₁ towards zero gets weaker as θ₁ gets smaller. So, it might get very close to zero but not quite hit it unless the MSE term also strongly favors zero.

Intuitive Summary:

L₂ (Ridge) is like saying: “Try to make your weights small, and distribute the ‘smallness’ somewhat evenly. Keep them inside this round ball.” It pulls weights towards zero but doesn’t aggressively zero them out.
L₁ (Lasso) is like saying: “Try to make your weights small, and if possible, make some of them exactly zero. Keep them inside this pointy diamond. It’s okay to focus your ‘budget’ on just a few important weights and eliminate the others.”

The “corners” of the L1 penalty region are key because they represent solutions where one or more weights are exactly zero. The optimization process, in trying to balance minimizing MSE with staying within the L1 budget, often finds its optimal solution precisely at one of these corners.

Does that help clarify why the shape of the penalty (round vs. pointy with corners on axes) leads to these different behaviors in terms of zeroing out weights?

Okay, absolutely! Let’s revisit Softmax Regression (pages 148-150) and really focus on what it’s ultimately trying to achieve, especially in contrast to just using multiple binary Logistic Regressions.

You’re spot on: Logistic Regression is great for “is it A or not A?” (binary). But what if you have “is it A, B, C, or D?” (multiclass).

The Problem with Just Using Multiple Binary Classifiers (OvR/OvO):

We saw in Chapter 3 that you can use binary classifiers for multiclass problems:

One-vs-Rest (OvR): Train a separate binary Logistic Regression for each class.
- Classifier 1: “Is it class A (vs. B, C, D)?”
- Classifier 2: “Is it class B (vs. A, C, D)?”
- …and so on.
- To make a prediction, you run your input through all these binary classifiers and pick the class whose classifier gives the highest “confidence” score (or probability, if the binary classifier outputs that).
One-vs-One (OvO): Train a binary classifier for every pair of classes (A vs B, A vs C, A vs D, B vs C, etc.). Pick the class that wins the most “duels.”

Limitations/Quirks of OvR/OvO for Probabilities:

While these strategies work for getting a class label, there’s a slight awkwardness if you want well-calibrated probabilities for each class that naturally sum to 1.

With OvR, each binary Logistic Regression outputs a probability for its class versus all others. For example, P(A | not A). These probabilities from different classifiers aren’t inherently guaranteed to sum to 1 when you look across all classes for a single instance. You might get P(A)=0.7, P(B)=0.4, P(C)=0.1. These don’t sum to 1, and it’s not immediately clear how to turn them into a proper probability distribution over A, B, and C. You usually just pick the class with the highest score.

Softmax Regression: The “Direct” Multiclass Probabilistic Approach

Softmax Regression is designed from the ground up to handle multiple classes directly and produce a consistent set of probabilities that sum to 1 across all classes.

Here’s the core idea and “what it’s trying to achieve”:

Goal: For any given input instance (e.g., an image of a digit), we want to output a probability for each possible class (e.g., P(digit is 0), P(digit is 1), …, P(digit is 9)). Critically, these probabilities should all add up to 100%.
Step 1: Calculate a “Score” for Each Class (Equation 4-19)
- Just like Linear Regression or Logistic Regression, for each class k, Softmax Regression calculates a linear score: sₖ(x) = xᵀθ⁽ᵏ⁾
- x is the input feature vector.
- θ⁽ᵏ⁾ (theta-k) is a separate vector of weights specifically for class k. So, if you have 10 classes (digits 0-9), you will have 10 different θ vectors.
- What these scores sₖ(x) are ultimately trying to achieve: They are like raw “evidence” or “suitability scores” for each class, given the input x. A higher score sₖ(x) suggests that class k is a more likely candidate for this input. These scores can be any real number (positive, negative, large, small).
Step 2: Convert Scores into Probabilities (The Softmax Function - Equation 4-20)
- The raw scores sₖ(x) are not probabilities yet (they don’t sum to 1, and they can be negative). We need a way to transform them into a valid probability distribution. This is where the softmax function (also called “normalized exponential”) comes in.
- For each class k, the probability p̂ₖ is calculated as: p̂ₖ = exp(sₖ(x)) / Σⱼ exp(sⱼ(x)) (where the sum in the denominator is over all possible classes j)
- What the softmax function is ultimately trying to achieve:
  - exp(sₖ(x)) (Exponential): First, it takes the exponential of each score. This has two effects:
    - It makes all scores positive (since e to any power is positive).
    - It tends to exaggerate differences: if score A is slightly higher than score B, exp(A) will be significantly higher than exp(B). The “softmax” is “soft” in that it doesn’t just pick the max score and give it 100% probability, but it does give more weight to higher scores.
  - Σⱼ exp(sⱼ(x)) (Sum of Exponentials): It then sums up these positive, exponentiated scores for all classes. This sum acts as a normalization constant.
  - Division: Dividing each exp(sₖ(x)) by this total sum ensures two things:
    1. Each p̂ₖ will be between 0 and 1.
    2. All the p̂ₖ values will sum up to 1. So, we get a proper probability distribution across all classes. The class that had the highest initial score sₖ(x) will end up with the largest probability p̂ₖ.

Analogy for Softmax:

Imagine you have several candidates for a job (the classes).

You give each candidate a raw “suitability score” (the sₖ(x)). Some might be high, some low, some even negative if they seem really unsuitable.
To decide how to allocate a “probability of being hired” that sums to 100% across all candidates:
- You first want to ensure everyone’s considered “positively” and amplify the scores of strong candidates: you “exponentiate” their scores. A candidate with a score of 3 becomes e³ ≈ 20, while a candidate with a score of 1 becomes e¹ ≈ 2.7. The difference is magnified.
- Then, you add up all these amplified, positive scores to get a “total pool of amplified suitability.”
- Finally, each candidate’s share of this total pool becomes their probability of being hired.

Why is this better than just running multiple OvR Logistic Regressions for probabilities?

Direct Probabilistic Interpretation: Softmax directly outputs a set of probabilities that are inherently linked and sum to 1. It’s designed for this purpose. With OvR Logistic Regression, you’d have to do some extra (potentially ad-hoc) normalization step if you wanted the “probabilities” from different binary classifiers to sum to 1 for a given instance.
Shared Information During Training (via the cost function): When Softmax Regression is trained (using the cross-entropy cost function, which we’ll get to), the updates to the weights θ⁽ᵏ⁾ for one class are influenced by the scores and target probabilities of all other classes because of that denominator in the softmax function. This allows the model to learn the relationships and distinctions between all classes simultaneously in a more coupled way. With independent OvR classifiers, each classifier only learns to distinguish its class from “everything else” without explicitly considering the fine-grained distinctions between those “everything else” classes.

Making a Prediction and Training:

Prediction (Equation 4-21): Once you have the probabilities p̂ₖ for all classes, you simply predict the class k that has the highest probability. This is equivalent to picking the class with the highest initial score sₖ(x).
Training (Cross-Entropy Cost Function - Equation 4-22):
- The goal is to find the parameter matrix Θ (which contains all the θ⁽ᵏ⁾ vectors) such that for a given training instance, the predicted probability p̂ₖ is high for the correct class k and low for all other classes.
- The cross-entropy cost function is used for this. J(Θ) = - (1/m) * Σᵢ Σₖ yₖ⁽ⁱ⁾ log(p̂ₖ⁽ⁱ⁾)
  - yₖ⁽ⁱ⁾ is 1 if class k is the true class for instance i, and 0 otherwise.
  - What it’s ultimately trying to achieve: It heavily penalizes the model if log(p̂ₖ⁽ⁱ⁾) is a large negative number for the true class k (which happens if p̂ₖ⁽ⁱ⁾ is close to 0 for the true class). It wants to maximize the log-probability of the correct class.
- This cost function is convex, so Gradient Descent can find the optimal Θ. The gradients (Equation 4-23) are computed, and Θ is updated iteratively.

In essence, Softmax Regression is a generalization of Logistic Regression to handle multiple classes by:

Calculating a linear score for each class.
Using the softmax function to convert these scores into a proper probability distribution over all classes.
Being trained with a cost function (cross-entropy) that encourages high probabilities for correct classes.

It provides a more principled and direct way to get multiclass probabilities compared to trying to combine multiple independent binary logistic regressions.

Focus on the Cross-Entropy cost function

Let’s break it down slowly, focusing on what it’s ultimately trying to achieve.

Recall the equation: J(Θ) = - (1/m) * Σᵢ Σₖ yₖ⁽ⁱ⁾ log(p̂ₖ⁽ⁱ⁾)

Where:

J(Θ): The total cost for our model parameters Θ. We want to minimize this.
m: The number of training instances. (1/m) means we’re averaging the cost over all instances.
Σᵢ: Sum over all training instances i (from 1 to m).
Σₖ: Sum over all possible classes k (from 1 to K).
yₖ⁽ⁱ⁾: This is the true target probability that instance i belongs to class k.
- For most classification problems (like digit recognition), this is a “one-hot encoded” value. If instance i is truly a ‘digit 3’, then:
  - y₃⁽ⁱ⁾ = 1 (the probability of being class ‘3’ is 100%)
  - y₀⁽ⁱ⁾ = 0, y₁⁽ⁱ⁾ = 0, y₂⁽ⁱ⁾ = 0, y₄⁽ⁱ⁾ = 0, …, y₉⁽ⁱ⁾ = 0 (the probability of being any other class is 0%).
p̂ₖ⁽ⁱ⁾: This is the model’s predicted probability that instance i belongs to class k (this comes from the softmax function).
log(p̂ₖ⁽ⁱ⁾): The natural logarithm of the model’s predicted probability.

Understanding the Core Term: yₖ⁽ⁱ⁾ log(p̂ₖ⁽ⁱ⁾)

Let’s focus on a single instance i and a single class k.

The term yₖ⁽ⁱ⁾ log(p̂ₖ⁽ⁱ⁾) is the heart of it. Since yₖ⁽ⁱ⁾ is either 0 or 1 (for the one-hot encoded true label):

Case 1: Class k is NOT the true class for instance i.
- Then yₖ⁽ⁱ⁾ = 0.
- So, yₖ⁽ⁱ⁾ log(p̂ₖ⁽ⁱ⁾) = 0 * log(p̂ₖ⁽ⁱ⁾) = 0.
- This means: For all the classes that are not the true class, this term contributes nothing to the sum Σₖ. This makes sense – we don’t directly care about the exact log-probability the model assigns to the incorrect classes, as long as it assigns a high probability to the correct class.
Case 2: Class k IS the true class for instance i.
- Then yₖ⁽ⁱ⁾ = 1.
- So, yₖ⁽ⁱ⁾ log(p̂ₖ⁽ⁱ⁾) = 1 * log(p̂ₖ⁽ⁱ⁾) = log(p̂ₖ⁽ⁱ⁾).
- This means: For the one true class, this term contributes log(p̂ₖ⁽ⁱ⁾) to the sum Σₖ.

So, for a single instance i, the inner sum Σₖ yₖ⁽ⁱ⁾ log(p̂ₖ⁽ⁱ⁾) simplifies to just log(p̂_true_class⁽ⁱ⁾). It’s the logarithm of the probability that the model assigned to the actual correct class for that instance.

Why log? And why the negative sign in J(Θ)?

Now let’s consider the log and the overall negative sign in J(Θ) = - (1/m) * Σᵢ log(p̂_true_class⁽ⁱ⁾).

Probabilities p̂ are between 0 and 1.
The logarithm of a number between 0 and 1 is always negative (or zero if p̂=1).
- log(1) = 0
- log(0.5) ≈ -0.69
- log(0.1) ≈ -2.3
- As p̂ approaches 0, log(p̂) approaches -∞.
What the cost function is trying to achieve:
- We want p̂_true_class⁽ⁱ⁾ to be as close to 1 as possible. This means our model is very confident and correct.
  - If p̂_true_class⁽ⁱ⁾ = 1 (perfect prediction for this instance), then log(p̂_true_class⁽ⁱ⁾) = log(1) = 0.
  - The negative sign makes it -0 = 0. This is the minimum possible cost for this instance. Good!
- We want to heavily penalize the model if p̂_true_class⁽ⁱ⁾ is close to 0. This means our model is very confident but wrong about the true class.
  - If p̂_true_class⁽ⁱ⁾ is very small (e.g., 0.0001), then log(p̂_true_class⁽ⁱ⁾) is a large negative number (e.g., log(0.0001) ≈ -9.2).
  - The negative sign in J(Θ) makes this - (large negative number) = large positive number. This is a very high cost. Good! This is what we want the cost function to do – penalize bad, confident predictions for the true class.

So, minimizing J(Θ) = - Σ log(p̂_true_class) is equivalent to maximizing Σ log(p̂_true_class). And since log is a monotonically increasing function, maximizing the sum of log-probabilities is equivalent to maximizing the product of the probabilities (this is related to Maximum Likelihood Estimation, a very common principle in statistics and machine learning).

In simpler terms, the cross-entropy cost function is ultimately trying to:

Make the model assign the highest possible probability to the correct class for each training example.
It does this by looking at the logarithm of the predicted probability for the true class.
A perfect score (probability 1 for the true class) results in a log-probability of 0, contributing zero to the (negative) cost.
A very poor score (probability near 0 for the true class) results in a very large negative log-probability, which, when negated by the cost function, becomes a very large positive cost.
The cost function therefore “rewards” the model for being confident and correct, and heavily “punishes” it for being confident and incorrect (or even just very uncertain about the correct class).

Connection to Information Theory (as mentioned in the sidebar):

Cross-entropy is a concept from information theory.

Imagine you have a true distribution of events y (e.g., true digit classes).
You have a model that produces a predicted distribution p̂.
Cross-entropy H(y, p̂) measures the average number of bits you would need to encode an event drawn from the true distribution y, if you used an encoding scheme optimized for your predicted distribution p̂.
If your predicted distribution p̂ perfectly matches the true distribution y, then the cross-entropy is minimized and is equal to the entropy of y itself (the inherent minimum bits needed to describe y).
If p̂ is different from y, the cross-entropy will be higher. The difference H(y, p̂) - H(y) is called the Kullback-Leibler (KL) divergence, which measures how much your predicted distribution diverges from the true one.

So, when we minimize the cross-entropy cost function, we are essentially trying to make our model’s predicted probability distribution p̂ as close as possible to the true underlying distribution y of the class labels.

Why is it better than, say, trying to minimize the squared error between p̂ and y for classification?

Range of Output: p̂ from softmax is a probability (0 to 1). y is usually 0 or 1. Squared error (p̂ - y)² could work, but log loss (cross-entropy) has properties that often make optimization more stable and faster, especially for probabilities.
Penalizing Confident Wrong Predictions: The log term in cross-entropy means that if the model predicts a very low probability (near 0) for the true class, the cost goes towards infinity. This is a very strong penalty. Squared error doesn’t penalize confident wrong predictions as harshly. For y=1, if p̂=0.01, squared error is (0.01-1)² ≈ 0.98. If p̂=0.0001, squared error is (0.0001-1)² ≈ 0.9998. The increase isn’t as dramatic as log(0.01) vs log(0.0001).
Mathematical Properties for Gradient Descent: The gradients derived from the cross-entropy cost function for models like Softmax Regression (and logistic regression) have a nice, simple form (as seen in Equation 4-23), which is computationally convenient.

This cost function is the workhorse for most classification problems where the model outputs probabilities. It effectively pushes the model to be accurate and appropriately confident in its predictions for the correct classes.