Linear Regression & Normal Equation
In machine learning, Linear Regression is one of the most fundamental supervised learning algorithms. It models the relationship between features and continuous target variables using a linear equation.
Following the formulation from the classic article on Machine Learning Cơ Bản, we will explore how we can optimize this model using two completely different paradigms:
- The Analytical Closed-form Solution (The Normal Equation)
- The Iterative Numerical Solution (Gradient Descent)
The Mathematical Formulation
Let our dataset consist of $N$ training examples: ${(\mathbf{x}i, y_i)}{i=1}^N$, where $\mathbf{x}_i$ is the input vector and $y_i$ is the target value.
For simple linear regression (1D feature), we express the model output as:
\(y_i \approx f(\mathbf{x}_i) = w_1 x_i + w_0\)
To simplify calculations, we define the parameter vector $\mathbf{w} = [w_1, w_0]^T$ and augment the input feature vector with a bias coordinate $1$, forming $\mathbf{\bar{x}}_i = [x_i, 1]^T$. The prediction becomes:
\(f(\mathbf{\bar{x}}_i) = \mathbf{\bar{x}}_i^T \mathbf{w}\)
We can group all training inputs into a single design matrix $\mathbf{\bar{X}}$ of shape $N \times 2$, and all target labels into a column vector $\mathbf{y}$ of shape $N \times 1$:
\(\mathbf{\bar{X}} = \begin{bmatrix} \mathbf{\bar{x}}_1^T \\ \mathbf{\bar{x}}_2^T \\ \vdots \\ \mathbf{\bar{x}}_N^T \end{bmatrix} = \begin{bmatrix} x_1 & 1 \\ x_2 & 1 \\ \vdots & \vdots \\ x_N & 1 \end{bmatrix}, \quad \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix}\)
Our predictions vector is $\mathbf{\hat{y}} = \mathbf{\bar{X}}\mathbf{w}$. The difference between predictions and targets is the error vector: $\mathbf{e} = \mathbf{\bar{X}}\mathbf{w} - \mathbf{y}$.
The Loss Function (MSE)
Our optimization metric is the sum of squared errors, represented as the Mean Squared Error (MSE) loss:
\(\mathcal{L}(\mathbf{w}) = \frac{1}{2N} \sum_{i=1}^{N} \left( \mathbf{\bar{x}}_i^T \mathbf{w} - y_i \right)^2 = \frac{1}{2N} \|\mathbf{\bar{X}}\mathbf{w} - \mathbf{y}\|_2^2\)
We can expand this vector norm:
\(\mathcal{L}(\mathbf{w}) = \frac{1}{2N} \left( \mathbf{w}^T \mathbf{\bar{X}}^T \mathbf{\bar{X}} \mathbf{w} - 2\mathbf{w}^T \mathbf{\bar{X}}^T \mathbf{y} + \mathbf{y}^T \mathbf{y} \right)\)
Solution 1: The Analytical Closed-form (Normal Equation)
To minimize the loss function $\mathcal{L}(\mathbf{w})$, we compute its gradient with respect to $\mathbf{w}$ and set it to zero:
\(\frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \frac{1}{N} \mathbf{\bar{X}}^T (\mathbf{\bar{X}}\mathbf{w} - \mathbf{y}) = 0\)
\(\mathbf{\bar{X}}^T \mathbf{\bar{X}} \mathbf{w} = \mathbf{\bar{X}}^T \mathbf{y}\)
This is known as the Normal Equation. If the matrix $\mathbf{\bar{X}}^T \mathbf{\bar{X}}$ is invertible (non-singular), we obtain the analytical optimal solution $\mathbf{w}_{\text{opt}}$ in a single step:
\(\mathbf{w}_{\text{opt}} = (\mathbf{\bar{X}}^T \mathbf{\bar{X}})^{-1} \mathbf{\bar{X}}^T \mathbf{y}\)
- Pros: No need to choose a learning rate ($\alpha$), no convergence thresholds or iterations. Finds the mathematically exact global minimum.
- Cons: Computing the inverse $(\mathbf{\bar{X}}^T \mathbf{\bar{X}})^{-1}$ takes $O(D^3)$ time, where $D$ is the number of features. For datasets with hundreds of thousands of features, it is extremely slow. Also, if features are collinear (redundant), the matrix is singular and cannot be inverted.
Solution 2: The Iterative Numerical Approach (Gradient Descent)
For large datasets, we use Gradient Descent. Starting from an initial point (e.g. $\mathbf{w} = [0, 0]^T$), we update parameters iteratively:
\(\mathbf{w} \leftarrow \mathbf{w} - \alpha \frac{\partial \mathcal{L}}{\partial \mathbf{w}} = \mathbf{w} - \frac{\alpha}{N} \mathbf{\bar{X}}^T (\mathbf{\bar{X}}\mathbf{w} - \mathbf{y})\)
- Pros: Extremely efficient for massive datasets. Time complexity per step is $O(N D)$. Works even when the design matrix is not invertible.
- Cons: Requires careful tuning of hyperparameters ($\alpha$) and monitoring of convergence.
Interactive Demonstration
Test these two optimization approaches below:
- Click the canvas to add data points (click near an existing point to delete it)
- Run GD to watch gradient descent iterate toward the minimum
- Closed-form Fit to instantly compute the Normal Equation solution
Configuration
Data & Regression Line
Loss Surface & GD Path
Loss (MSE) over Epochs
Weight Convergence
Gradient Magnitude ‖∇L‖
What the Charts Show
| Chart | What It Reveals |
|---|---|
| Data & Regression Line | The model’s current linear fit overlaid on the data. Watch the solid line converge onto the dashed optimal. |
| Loss Surface & GD Path | A heatmap of $\mathcal{L}(w_1, w_0)$ centered around the optimal point. The GD trajectory crawls downhill toward the minimum (blue dot). |
| Loss over Epochs | The classic training curve. Should decrease monotonically if $\alpha$ is well-chosen. |
| Weight Convergence | Tracks $w_1$ and $w_0$ independently over training iterations. Both should stabilize at the closed-form values. |
| Gradient Magnitude | $|\nabla \mathcal{L}|$ shrinks toward zero as the model approaches the minimum — the hallmark of convergence. |
Key Observations to Try
- Learning rate too small ($\alpha \approx 0.001$): Training is stable but very slow. The GD path on the loss surface takes tiny steps.
- Learning rate too large ($\alpha > 0.05$): The weights overshoot the minimum and oscillate. The gradient magnitude chart spikes instead of decaying.
- Outlier sensitivity: Load the With Outliers preset. The optimal fit line gets pulled toward the outliers — a well-known weakness of MSE-based regression.
Conclusion
Whether we compute weights analytically via matrix inversion in the Normal Equation or step slowly down the loss gradient with Gradient Descent, simple linear regression highlights the core principles of machine learning models.
By comparing the two approaches side-by-side and observing the loss surface, weight trajectories, and gradient dynamics, we build the mathematical and geometric intuitions needed for more advanced optimization algorithms in deep learning.
Enjoy Reading This Article?
Here are some more articles you might like to read next: