Linear Regression & Normal Equation

In machine learning, Linear Regression is one of the most fundamental supervised learning algorithms. It models the relationship between features and continuous target variables using a linear equation.

Following the formulation from the classic article on Machine Learning Cơ Bản, we will explore how we can optimize this model using two completely different paradigms:

  1. The Analytical Closed-form Solution (The Normal Equation)
  2. The Iterative Numerical Solution (Gradient Descent)

The Mathematical Formulation

Let our dataset consist of $N$ training examples: ${(\mathbf{x}i, y_i)}{i=1}^N$, where $\mathbf{x}_i$ is the input vector and $y_i$ is the target value.

For simple linear regression (1D feature), we express the model output as:

\(y_i \approx f(\mathbf{x}_i) = w_1 x_i + w_0\)

To simplify calculations, we define the parameter vector $\mathbf{w} = [w_1, w_0]^T$ and augment the input feature vector with a bias coordinate $1$, forming $\mathbf{\bar{x}}_i = [x_i, 1]^T$. The prediction becomes:

\(f(\mathbf{\bar{x}}_i) = \mathbf{\bar{x}}_i^T \mathbf{w}\)

We can group all training inputs into a single design matrix $\mathbf{\bar{X}}$ of shape $N \times 2$, and all target labels into a column vector $\mathbf{y}$ of shape $N \times 1$:

\(\mathbf{\bar{X}} = \begin{bmatrix} \mathbf{\bar{x}}_1^T \\ \mathbf{\bar{x}}_2^T \\ \vdots \\ \mathbf{\bar{x}}_N^T \end{bmatrix} = \begin{bmatrix} x_1 & 1 \\ x_2 & 1 \\ \vdots & \vdots \\ x_N & 1 \end{bmatrix}, \quad \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix}\)

Our predictions vector is $\mathbf{\hat{y}} = \mathbf{\bar{X}}\mathbf{w}$. The difference between predictions and targets is the error vector: $\mathbf{e} = \mathbf{\bar{X}}\mathbf{w} - \mathbf{y}$.


The Loss Function (MSE)

Our optimization metric is the sum of squared errors, represented as the Mean Squared Error (MSE) loss:

\(\mathcal{L}(\mathbf{w}) = \frac{1}{2N} \sum_{i=1}^{N} \left( \mathbf{\bar{x}}_i^T \mathbf{w} - y_i \right)^2 = \frac{1}{2N} \|\mathbf{\bar{X}}\mathbf{w} - \mathbf{y}\|_2^2\)

We can expand this vector norm:

\(\mathcal{L}(\mathbf{w}) = \frac{1}{2N} \left( \mathbf{w}^T \mathbf{\bar{X}}^T \mathbf{\bar{X}} \mathbf{w} - 2\mathbf{w}^T \mathbf{\bar{X}}^T \mathbf{y} + \mathbf{y}^T \mathbf{y} \right)\)


Solution 1: The Analytical Closed-form (Normal Equation)

To minimize the loss function $\mathcal{L}(\mathbf{w})$, we compute its gradient with respect to $\mathbf{w}$ and set it to zero:

\(\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \frac{1}{N} \mathbf{\bar{X}}^T (\mathbf{\bar{X}}\mathbf{w} - \mathbf{y}) = 0\)

\(\mathbf{\bar{X}}^T \mathbf{\bar{X}} \mathbf{w} = \mathbf{\bar{X}}^T \mathbf{y}\)

This is known as the Normal Equation. If the matrix $\mathbf{\bar{X}}^T \mathbf{\bar{X}}$ is invertible (non-singular), we obtain the analytical optimal solution $\mathbf{w}_{\text{opt}}$ in a single step:

\(\mathbf{w}_{\text{opt}} = (\mathbf{\bar{X}}^T \mathbf{\bar{X}})^{-1} \mathbf{\bar{X}}^T \mathbf{y}\)

  • Pros: No need to choose a learning rate ($\alpha$), no convergence thresholds or iterations. Finds the mathematically exact global minimum.
  • Cons: Computing the inverse $(\mathbf{\bar{X}}^T \mathbf{\bar{X}})^{-1}$ takes $O(D^3)$ time, where $D$ is the number of features. For datasets with hundreds of thousands of features, it is extremely slow. Also, if features are collinear (redundant), the matrix is singular and cannot be inverted.

Solution 2: The Iterative Numerical Approach (Gradient Descent)

For large datasets, we use Gradient Descent. Starting from an initial point (e.g. $\mathbf{w} = [0, 0]^T$), we update parameters iteratively:

\(\mathbf{w} \leftarrow \mathbf{w} - \alpha \frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \mathbf{w} - \frac{\alpha}{N} \mathbf{\bar{X}}^T (\mathbf{\bar{X}}\mathbf{w} - \mathbf{y})\)

  • Pros: Extremely efficient for massive datasets. Time complexity per step is $O(N D)$. Works even when the design matrix is not invertible.
  • Cons: Requires careful tuning of hyperparameters ($\alpha$) and monitoring of convergence.

Interactive Demonstration

Test these two optimization approaches below:

  • Click the canvas to add data points (click near an existing point to delete it)
  • Run GD to watch gradient descent iterate toward the minimum
  • Closed-form Fit to instantly compute the Normal Equation solution
Configuration
Epoch0
Loss (MSE)0.0000
w₁ (slope)0.000
w₀ (bias)0.000
Data & Regression Line
Click to add/remove points
Gradient Descent fit
Closed-form optimal
Residuals
Loss Surface & GD Path
GD trajectory
Optimal w*
Loss (MSE) over Epochs
Weight Convergence
Gradient Magnitude ‖∇L‖

What the Charts Show

Chart What It Reveals
Data & Regression Line The model’s current linear fit overlaid on the data. Watch the solid line converge onto the dashed optimal.
Loss Surface & GD Path A heatmap of $\mathcal{L}(w_1, w_0)$ centered around the optimal point. The GD trajectory crawls downhill toward the minimum (blue dot).
Loss over Epochs The classic training curve. Should decrease monotonically if $\alpha$ is well-chosen.
Weight Convergence Tracks $w_1$ and $w_0$ independently over training iterations. Both should stabilize at the closed-form values.
Gradient Magnitude $|\nabla \mathcal{L}|$ shrinks toward zero as the model approaches the minimum — the hallmark of convergence.

Key Observations to Try

  • Learning rate too small ($\alpha \approx 0.001$): Training is stable but very slow. The GD path on the loss surface takes tiny steps.
  • Learning rate too large ($\alpha > 0.05$): The weights overshoot the minimum and oscillate. The gradient magnitude chart spikes instead of decaying.
  • Outlier sensitivity: Load the With Outliers preset. The optimal fit line gets pulled toward the outliers — a well-known weakness of MSE-based regression.

Conclusion

Whether we compute weights analytically via matrix inversion in the Normal Equation or step slowly down the loss gradient with Gradient Descent, simple linear regression highlights the core principles of machine learning models.

By comparing the two approaches side-by-side and observing the loss surface, weight trajectories, and gradient dynamics, we build the mathematical and geometric intuitions needed for more advanced optimization algorithms in deep learning.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Third-Party Libraries Demo
  • how to write a blog post
  • test sidebar table of contents