2024-09-30

The Beginning of Humanity's Last Invention

I recently decided to go back and relearn how different concepts in AI and machine learning actually work. At some point, I realized that my understanding wasn't as concrete as I'd thought - it was more like I know about the concepts, but not really understand how they worked under the hood. So, I figured it was time to dive straight into the mathematics and logic behind these models.

As I went through this process, I remembered a quote by Richard Feynman: "If you want to master something, teach it." That's what inspired me to write this piece. By breaking down the inner workings of neural networks, I hope to not only solidify my own understanding and also offer a resource for others who want to grasp the fundamental mechanics of these incredible models.

Part 1: Forward Propagation

Think of a neural network as a mathematical function that takes in an input $x$ and produces an output $y$ . Our goal is to compute the output $y$ , which we represent as $f(x) = y$ . However, to allow our neural networks to capture complex patterns in data, we introduce two key parameters: weights $w$ and biases $b$ .

These parameters play roles similar to those in a linear equation, like $y = mx + c$ , where $m$ is the slope and $c$ is the intercept. In this case, the weights $w$ control how much influence each input has on the output, and the bias $b$ shifts the output.

Example with a Single Neuron

An example of a single neuron — Figure 1: Visualization of a single neuron in a neural network. The inputs $x_1,x_2,\dots,x_n$ are multiplied by their corresponding weights $w_1,w_2,\dots,w_n$ and summed along with a bias $b$ . The sum is then passed through an activation function $g(⋅)$ to produce the output $\hat{y}$ . *Source here*

Let's start with a simple example: a single neuron. Suppose we have three inputs $x_1, x_2, x_3$ . Each input is multiplied by a corresponding weight and then summed together. To this sum, we add a bias term. We can write the neuron's output as:

y = w_1 x_1 + w_2 x_2 + w_3 x_3 + b

This equation shows how the inputs are weighted and summed, with the bias adjusting the final value. But neural networks rarely consist of just one neuron. To tackle more complex problems, we typically use multiple neurons.

Mathematical Representation of Forward Propagation

When we have multiple neurons in a layer, we use matrix operations to represent forward propagation. For a layer with several neurons, the output is given by:

y = W^T x + b

Where:

$W$ is a matrix of weights (each row corresponds to a neuron),
$x$ is the input vector,
$b$ is the bias vector (one bias per neuron).

This equation can be interpreted as each input $x_{i}$ being multiplied by its corresponding weight $w_{i}$ , and the results are added to the bias. This process is repeated for all neurons, and the resulting outputs are summed.

Derivation of Forward Propagation Formula

Let's break this down with a simple example. Suppose we have three inputs, $x_1, x_2, x_3$ , each paired with a weight $w_1, w_2, w_3$ . After multiplying each input by its respective weights, we add the bias term. This can be expressed as:

y = \begin{bmatrix} w_1 \\ w_2 \\ w_3 \end{bmatrix} \cdot \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} + \begin{bmatrix} b_1 \\ b_2 \\ b_3 \end{bmatrix}

This simplifies to the equation:

y = W^T x + b

For a dataset with $m$ examples, the equation generalizes to:

y = \sum_{i=1}^{m} W^T x + b

Activation Functions: Adding Non-Linearity

Once we compute the weighted sum, we pass the result through an activation function. This step introduces non-linearity, which is crucial because it allows the network to model more complex patterns beyond what a simple linear function can achieve.

Popular activation functions include Sigmoid and Tanh, but here we'll focus on the Rectified Linear Unit (ReLU), which is commonly used in modern neural networks. The ReLU activation function works as follows:

If the input $y$ is greater than 0, it returns $y$ ,
If $y$ is less than or equal to 0, it returns 0.

Mathematically, this can be written as:

\text{ReLU}(y) = \max(0, y)

ReLU is effective because it introduces non-linearity while being simple to compute.

Summary of the Forward Propagation Process

To summarize, the forward propagation process involves:

Input Transformation: Inputs are multiplied by weights and added to biases, creating a weighted sum.
Activation: The weighted sum is passed through an activation function to introduce non-linearity.
Output: The final value is the neuron's output, which can be passed to the next layer or used as the network's prediction.

Part 2: Cost Function - Measuring Prediction Quality

Once our neural network has made a prediction, the next step is to measure how accurate the prediction is. To do this, we use a cost function. The cost function provides a way to quantify how far the prediction is from the actual value, and our objective is to minimize this error over time.

One widely used cost function is the Mean Squared Error (MSE). The formula for MSE is:

\theta = \frac{1}{m}\sum_{i=1}^{m}(\hat{y}_i - y_i)^2

Where:

$\hat{y}_i$ is the predicted value,
$y_i$ is the actual value,
$m$ is the number of samples, and
$\theta$ is the cost (or error) that we want to minimize.

Intuition Behind MSE

The Mean Squared Error measures the average squared difference between the predicted and actual values. By squaring the difference, we ensure that larger errors are penalized more heavily. Squaring also guarantees that all errors are positive, regardless of whether the prediction is too high or too low.

Mathematical Proof of MSE

Let’s break down the MSE calculation step by step:

Calculate the Difference: For each prediction $\hat{y}$ , subtract the actual value $y$ . This difference is the prediction error.
$\text{Error} = \hat{y} - y$
Square the Difference: To ensure that all errors are positive and to give larger errors more weight, we square the differences:
$(\hat{y} - y)^2$
Sum Over All Predictions: For models with multiple predictions, we sum these squared errors:
$\sum_{i=1}^{m} (\hat{y}_i - y_i)^2$
Average the Errors: To find the average squared error, divide the sum by the number of predictions:
$\theta = \frac{1}{m}\sum_{i=1}^{m} (\hat{y}_i - y_i)^2$

The goal of training is to minimize $\theta$ , reducing the difference between predictions and true values.

Throughout the training process, the objective is to minimize $\theta$ . By doing so, we aim to reduce the overall error between the predictions made by the model and the actual target values.

Part 3: Backpropagation - Learning from Mistakes

So far, we've learned how a neural network makes predictions and measures their accuracy. However, none of this is useful unless the model can learn from its mistakes and improve. This is where backpropagation comes in.

An example of backpropagation Figure 2: Visualization of the gradient descent process on a 3D error surface. The path shows the iterative steps taken by the algorithm as it moves from the initial point (blue) towards the global minimum (green), minimizing the error (cost) at each step.

Backpropagation is the process that allows the model to adjust its internal parameters—weights and biases—to minimize the cost function. This adjustment is guided by the gradients of the cost function with respect to the weights and biases.

The update rule for these parameters is:

\theta_{\text{new}} = \theta_{\text{old}} - \alpha \nabla_{w,b} (\theta_{\text{old}})

Where:

$\theta$ represents the weights and biases,
$\alpha$ is the learning rate, and
$\nabla_{w, b}(\theta_{\text{old}})$ is the gradient of the cost function.

Mathematical Proof of Backpropagation

Backpropagation works by moving backwards through the network to calculate the necessary adjustments to the weights and biases. Let's break this process down step by step:

Before diving into the backpropagation steps, let's quickly recap some key notations:

Weighted input:
$z^{l} = W^l a^{l - 1} + b^{l}$
Activation:
$a^l = \sigma(z^l)$

Where:

$\sigma$ is the activation function (applied element-wise),
$a^{l-1}$ is the activation from the previous layer,
$W^l$ is the weight matrix, and
$b^l$ is the bias vector.

The goal now is to compute:

\frac{\partial C}{\partial W^l} \quad \text{and} \quad \frac{\partial C}{\partial b^l}

efficiently, where $C$ is the cost function.

Equation 1: Output Layer Error

\delta^{L} = \nabla_{a}C \otimes \sigma'(z^L)

Explanation:

$\delta^{L}$ represents the error in the output layer (layer $L$ ).
$\nabla_{a} C$ is the derivative of the cost function with respect to the activation $a^L$ in the output layer.
$\sigma'(z^L)$ is the derivative of the activation function with respect to the weighted input $z^L$ .

Derivative:

We want to compute how the cost function $C$ changes with respect to the weighted input $z^{L}$ .
Using the chain rule, we can expand this as:
$\frac{\partial C}{\partial w} = \frac{\partial C}{\partial a^{L}} \frac{\partial a^{L}}{\partial z^{L}}$

Backpropagation Tie-In:

This equation initializes the backpropagation process by calculating the error at the output layer. It shows how the network's final prediction affects the cost and sets up the error to be propagated backward.

Equation 2: Hidden Layer Error

\delta^{l} = (W^{l + 1}\delta^{l + 1}) \otimes \sigma\prime(z^l)

Explanation:

$\delta^l$ represents the error at the hidden layer $l$ .
$W^{l + 1}\delta^{l + 1}$ is the weighted sum of the errors from the next layer $l + 1$ .
$\sigma'(z^l)$ adjusts this error based on the activation function's derivative at layer $l$ .

Derivative:

To compute how the cost changes with respect to the inputs of the current layer, we use the chain rule:
$\frac{\partial C}{\partial z^l} = \frac{\partial a^{l}}{\partial z^{l}} \frac{\partial z^{l + 1}}{\partial a^l} \frac{\partial C}{\partial z^{l + 1}}$

Backpropagation Tie-In:

This recursive formula allows us to propagate the error backward through each hidden layer. It effectively updates the error term, which we will use to adjust the weights and biases.

Equation 3: Gradient with Respect to Bias

\frac{\partial C}{\partial b^{l}} = \delta^{l}

Explanation:

The gradient of the cost function with respect to the bias at layer $l$ is simply the error $\delta^l$ at that layer.

Derivative:

By applying the chain rule, we get:
$\frac{\partial C}{\partial b^l} = \frac{\partial z^l}{\partial b^l} \frac{\partial C}{\partial z^l}$
From Equation 2, we know that $\frac{\partial C}{\partial z^l} = \delta^l$ , and $\frac{\partial z^l}{\partial b^l} = 1$ .

Backpropagation Tie-In:

Once we have $\delta^l$ , we directly obtain the gradient of the bias. This not only simplifies the computation but also enables straightforward updates to the bias in each layer during training.

Equation 4: Gradient with Respect to Weights

\frac{\partial C}{\partial W^{l}} = \delta^l (a^{l - 1})^T

Explanation:

This equation shows how the cost function changes with respect to the weights in layer $l$ . The gradient is the product of the error $\delta^{l}$ and the activation from the previous layer, $a^{l - 1}$ .
The expression $(a^{l - 1})^{T}$ is the transpose of the activations, ensuring the gradient has the correct dimensions for the weight matrix.

Derivative:

To find how the cost changes with respect to the weights, we use the chain rule:
$\frac{\partial C}{\partial W^l} = \frac{\partial z^l}{\partial W^l} \frac{\partial C}{\partial z^l}$
From Equation 2, we know that $\frac{\partial C}{\partial z^l} = \delta^l$ , and $\frac{\partial z^l}{\partial W^l} = a^{l - 1}$ .

Backpropagation Tie-In:

This equation is crucial as it shows how the weights in each layer contribute to the cost. By using the gradient, we can update the weights during training to reduce the error.

Visual Representation of the Backpropagation Process

Forward Pass: Compute $z^{l}$ and $a^{l}$ for each layer $l$ up to the output layer $L$ . This step involves applying weights, biases, and activation functions to generate the network's output.
Backward Pass:
- Output Layer (Layer $L$ ): Compute the error $\delta^L$ using Equation 1.
- Hidden Layers: Starting from layer $L - 1$ and moving backward to layer 1, compute $\delta^{l}$ for each hidden layer using Equation 2. This step propagates the error back through the network.
Gradient Computation: For each layer $l$ :
- Calculate the gradient with respect to the biases using Equation 3:
  $\frac{\partial C}{\partial b^l} = \delta^{l}$
- Calculate the gradient with respect to the weights using Equation 4:
  $\frac{\partial C}{\partial W^l} = \delta^{l}(a^{l - 1})^{T}$
Parameter Update:
- Adjust the weights and biases using the computed gradients. For example, with gradient descent:
  $W^{l}_{\text{new}} = W^{l}_{\text{old}} - \eta \frac{\partial C}{\partial W^{l}}, \quad b^{l}_{\text{new}} = b^{l}_{\text{old}} - \eta \frac{\partial C}{\partial b^{l}}$
- Where $\eta$ is the learning rate.

Explanation:

This visual guide outlines the main steps of the backpropagation process, highlighting how errors are propagated backward and how gradients are used to update parameters.
By repeating these steps across multiple iterations, the network learns to reduce the overall cost, improving its predictions.

Part 4: Hand Computation

In this section, we’ll manually compute a simple neural network's forward and backward passes to build a deeper understanding of the calculations.

A neural network with weights and biases

Network Structure:

Inputs: $x_1 = 0.5$ , $x_2 = 0.1$
Hidden Layer: 2 neurons
Output Layer: 1 neuron
True Output: $y = 0.8$
Learning Rate: $\eta = 0.1$

Weights and Biases:

From input to hidden layer:
$W_1 = \begin{bmatrix} 0.4 & 0.2 \\ 0.3 & 0.7 \end{bmatrix}, \quad b_1 = \begin{bmatrix} 0.1 \\ -0.3 \end{bmatrix}$
From hidden to output layer:
$W_2 = \begin{bmatrix} 0.6 \\ 0.9 \end{bmatrix}, \quad b_2 = 0.2$

Step 1: Forward Propagation

Input to Hidden Layer Calculation:
- First, we compute the weighted input to the hidden layer:
  $z_1 = W_1 \cdot x + b_1$
- Substituting the values, we get:
  $z_1 = \begin{bmatrix} 0.4 & 0.2 \\ 0.3 & 0.7 \end{bmatrix} \begin{bmatrix} 0.5 \\ 0.1 \end{bmatrix} + \begin{bmatrix} 0.1 \\ -0.3 \end{bmatrix} = \begin{bmatrix} 0.32 \\ -0.08 \end{bmatrix}$
Activation (ReLU) for Hidden Layer:
- Apply the ReLU activation function to each element of $z_1$ :
  $a_1 = \begin{bmatrix} \text{ReLU}(0.32) \\ \text{ReLU}(-0.08) \end{bmatrix} = \begin{bmatrix} 0.32 \\ 0 \end{bmatrix}$
- Here, ReLU returns 0 for negative inputs and the input itself for positive values.
Hidden to Output Layer Calculation:
- Next, we compute the weighted input to the output layer:
  $z_2 = W_2 \cdot a_1 + b_2$
- Substituting the values, we get:
  $z_2 = \begin{bmatrix} 0.6 & 0.9 \end{bmatrix} \begin{bmatrix} 0.32 \\ 0 \end{bmatrix} + 0.2 = 0.392$
Final Output:
- Since we're not applying an additional activation function at the output layer, our final output for this forward pass is 0.392.

Step 2: Cost Calculation

To measure how far off our prediction is from the actual output, we use the Mean Squared Error (MSE):
$C = \frac{1}{2} (y_{\text{pred}} - y_{\text{true}})^2 = \frac{1}{2}(0.392 - 0.8)^2 = 0.0832$
This cost value indicates the error in the current predictions, which we will now work to minimize using backpropagation.

Step 3: Backward Propagation

Output Layer Error:
- Using the chain rule and the derivative of the cost function, the error term for the output layer is:
  $\delta^L = \frac{\partial C}{\partial z^L} = (y_{\text{pred}} - y_{\text{true}}) \cdot \sigma{\prime}(z^L) = (0.392 - 0.8) \cdot 1 = -0.408$
Update Bias for Output Layer:
- The gradient with respect to the bias at the output layer is simply the error term:
  $\frac{\partial C}{\partial b^L} = \delta^L = -0.408$
- Update the bias using gradient descent:
  $b_{\text{new}} = b_{\text{old}} - \eta \cdot \delta^L = 0.2 - 0.1 \cdot (-0.408) = 0.2408$
Update Weights for Output Layer:
- The gradient with respect to the weights is:
  $\frac{\partial C}{\partial W_2} = \delta^L \cdot a_1^T = -0.408 \cdot \begin{bmatrix} 0.32 \\ 0 \end{bmatrix} = \begin{bmatrix} -0.13 \\ 0 \end{bmatrix}$
- Update the weights:
  $W_{1,2} = 0.6 - 0.1 \cdot (-0.13) = 0.613, \quad W_{2,2} = 0.9 - 0.1 \cdot 0 = 0.9$
Hidden Layer Error:
- Calculate the error term for each neuron in the hidden layer:
  $\delta^1 = (0.6 \cdot -0.408) \cdot 1 = -0.2448$ $\delta^2 = (0.9 \cdot -0.408) \cdot 0 = 0$
Update Bias for Hidden Layer:
- For the first neuron:
  $\frac{\partial C}{\partial b_1^1} = \delta^l = -0.2448$ $b_{\text{new}}^1 = b_{\text{old}}^1 - \eta \cdot \delta^1 = 0.1 - 0.1 \cdot (-0.2448) = 0.12448$
- For the second neuron, no change as $\delta^2 = 0$ :
  $b_{\text{new}}^2 = -0.3$
Update Weights for Hidden Layer:
- For the first hidden neuron:
  $\frac{\partial C}{\partial W_{1,1}} = \delta^l \cdot x_1 = -0.2448 \cdot 0.5 = -0.1224$ $\frac{\partial C}{\partial W_{2,1}} = \delta^l \cdot x_2 = -0.2448 \cdot 0.1 = -0.02448$
- Update the weights:
  $W_{1,1} = 0.4 - 0.1 \cdot (-0.1224) = 0.41224, \quad W_{2,1} = 0.2 - 0.1 \cdot (-0.0245) = 0.2024$
- For the second hidden neuron, no change:
  $W_{1,2} = 0.3, \quad W_{2,2} = 0.7$

Conclusion

These steps illustrate how a neural network learns through forward propagation, cost evaluation, and backpropagation. By manually updating the weights and biases, we gain a clearer understanding of how the model gradually improves its predictions. This process repeats over many iterations, reducing the error and fine-tuning the network's performance.

Closing Thoughts

Neural Networks have shown they can learn just about anything, from recognizing faces in photos to translating languages. As Geoffrey Hinton, one of the pioneers in deep learning, put it: "Neural Networks are capable of learning anything if you give them enough data and compute." This really shows how flexible and powerful these systems are.

With all the excitement around artificial intelligence today, especially with the rise of large language models (LLMs), it's more important than ever to understand how neural networks actually work. And as we push forward in the race towards Artificial General Intelligence (AGI), knowing the ins and outs of these networks gives us a better grip on where we're headed. By understanding the nuts and bolts of neural networks, we can ensure we're using this technology wisely and steering it in the right direction.