2024-09-30

The Beginning of Humanity's Last Invention

I recently decided to go back and relearn how different concepts in AI and machine learning actually work. At some point, I realized that my understanding wasn't as concrete as I'd thought - it was more like I know about the concepts, but not really understand how they worked under the hood. So, I figured it was time to dive straight into the mathematics and logic behind these models.

As I went through this process, I remembered a quote by Richard Feynman: "If you want to master something, teach it." That's what inspired me to write this piece. By breaking down the inner workings of neural networks, I hope to not only solidify my own understanding and also offer a resource for others who want to grasp the fundamental mechanics of these incredible models.

Part 1: Forward Propagation

Think of a neural network as a mathematical function that takes in an input xx and produces an output yy. Our goal is to compute the output yy, which we represent as f(x)=yf(x) = y. However, to allow our neural networks to capture complex patterns in data, we introduce two key parameters: weights ww and biases bb.

These parameters play roles similar to those in a linear equation, like y=mx+cy = mx + c, where mm is the slope and cc is the intercept. In this case, the weights ww control how much influence each input has on the output, and the bias bb shifts the output.

Example with a Single Neuron

An example of a single neuron

Figure 1: Visualization of a single neuron in a neural network. The inputs x1,x2,,xnx_1,x_2,\dots,x_n are multiplied by their corresponding weights w1,w2,,wnw_1,w_2,\dots,w_n and summed along with a bias bb. The sum is then passed through an activation function g()g(⋅) to produce the output y^\hat{y}. Source here

Let's start with a simple example: a single neuron. Suppose we have three inputs x1,x2,x3x_1, x_2, x_3. Each input is multiplied by a corresponding weight and then summed together. To this sum, we add a bias term. We can write the neuron's output as:

y=w1x1+w2x2+w3x3+by = w_1 x_1 + w_2 x_2 + w_3 x_3 + b

This equation shows how the inputs are weighted and summed, with the bias adjusting the final value. But neural networks rarely consist of just one neuron. To tackle more complex problems, we typically use multiple neurons.

Mathematical Representation of Forward Propagation

When we have multiple neurons in a layer, we use matrix operations to represent forward propagation. For a layer with several neurons, the output is given by:

y=WTx+by = W^T x + b

Where:

  • WW is a matrix of weights (each row corresponds to a neuron),
  • xx is the input vector,
  • bb is the bias vector (one bias per neuron).

This equation can be interpreted as each input xix_{i} being multiplied by its corresponding weight wiw_{i}, and the results are added to the bias. This process is repeated for all neurons, and the resulting outputs are summed.


Derivation of Forward Propagation Formula

Let's break this down with a simple example. Suppose we have three inputs, x1,x2,x3x_1, x_2, x_3, each paired with a weight w1,w2,w3w_1, w_2, w_3. After multiplying each input by its respective weights, we add the bias term. This can be expressed as:

y=[w1w2w3][x1x2x3]+[b1b2b3]y = \begin{bmatrix} w_1 \\ w_2 \\ w_3 \end{bmatrix} \cdot \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} + \begin{bmatrix} b_1 \\ b_2 \\ b_3 \end{bmatrix}

This simplifies to the equation:

y=WTx+by = W^T x + b

For a dataset with mm examples, the equation generalizes to:

y=i=1mWTx+by = \sum_{i=1}^{m} W^T x + b

Activation Functions: Adding Non-Linearity

Once we compute the weighted sum, we pass the result through an activation function. This step introduces non-linearity, which is crucial because it allows the network to model more complex patterns beyond what a simple linear function can achieve.

Popular activation functions include Sigmoid and Tanh, but here we'll focus on the Rectified Linear Unit (ReLU), which is commonly used in modern neural networks. The ReLU activation function works as follows:

  • If the input yy is greater than 0, it returns yy,
  • If yy is less than or equal to 0, it returns 0.

Mathematically, this can be written as:

ReLU(y)=max(0,y)\text{ReLU}(y) = \max(0, y)

ReLU is effective because it introduces non-linearity while being simple to compute.


Summary of the Forward Propagation Process

To summarize, the forward propagation process involves:

  1. Input Transformation: Inputs are multiplied by weights and added to biases, creating a weighted sum.
  2. Activation: The weighted sum is passed through an activation function to introduce non-linearity.
  3. Output: The final value is the neuron's output, which can be passed to the next layer or used as the network's prediction.

Part 2: Cost Function - Measuring Prediction Quality

Once our neural network has made a prediction, the next step is to measure how accurate the prediction is. To do this, we use a cost function. The cost function provides a way to quantify how far the prediction is from the actual value, and our objective is to minimize this error over time.

One widely used cost function is the Mean Squared Error (MSE). The formula for MSE is:

θ=1mi=1m(y^iyi)2\theta = \frac{1}{m}\sum_{i=1}^{m}(\hat{y}_i - y_i)^2

Where:

  • y^i\hat{y}_i is the predicted value,
  • yiy_i is the actual value,
  • mm is the number of samples, and
  • θ\theta is the cost (or error) that we want to minimize.

Intuition Behind MSE

The Mean Squared Error measures the average squared difference between the predicted and actual values. By squaring the difference, we ensure that larger errors are penalized more heavily. Squaring also guarantees that all errors are positive, regardless of whether the prediction is too high or too low.

Mathematical Proof of MSE

Let’s break down the MSE calculation step by step:

  1. Calculate the Difference: For each prediction y^\hat{y}, subtract the actual value yy. This difference is the prediction error.

    Error=y^y\text{Error} = \hat{y} - y
  2. Square the Difference: To ensure that all errors are positive and to give larger errors more weight, we square the differences:

    (y^y)2(\hat{y} - y)^2
  3. Sum Over All Predictions: For models with multiple predictions, we sum these squared errors:

    i=1m(y^iyi)2\sum_{i=1}^{m} (\hat{y}_i - y_i)^2
  4. Average the Errors: To find the average squared error, divide the sum by the number of predictions:

    θ=1mi=1m(y^iyi)2\theta = \frac{1}{m}\sum_{i=1}^{m} (\hat{y}_i - y_i)^2

The goal of training is to minimize θ\theta, reducing the difference between predictions and true values.

Throughout the training process, the objective is to minimize θ\theta. By doing so, we aim to reduce the overall error between the predictions made by the model and the actual target values.

Part 3: Backpropagation - Learning from Mistakes

So far, we've learned how a neural network makes predictions and measures their accuracy. However, none of this is useful unless the model can learn from its mistakes and improve. This is where backpropagation comes in.

An example of backpropagation Figure 2: Visualization of the gradient descent process on a 3D error surface. The path shows the iterative steps taken by the algorithm as it moves from the initial point (blue) towards the global minimum (green), minimizing the error (cost) at each step.

Backpropagation is the process that allows the model to adjust its internal parameters—weights and biases—to minimize the cost function. This adjustment is guided by the gradients of the cost function with respect to the weights and biases.

The update rule for these parameters is:

θnew=θoldαw,b(θold)\theta_{\text{new}} = \theta_{\text{old}} - \alpha \nabla_{w,b} (\theta_{\text{old}})

Where:

  • θ\theta represents the weights and biases,
  • α\alpha is the learning rate, and
  • w,b(θold)\nabla_{w, b}(\theta_{\text{old}}) is the gradient of the cost function.
Mathematical Proof of Backpropagation

Backpropagation works by moving backwards through the network to calculate the necessary adjustments to the weights and biases. Let's break this process down step by step:

Before diving into the backpropagation steps, let's quickly recap some key notations:

  1. Weighted input:

    zl=Wlal1+blz^{l} = W^l a^{l - 1} + b^{l}
  2. Activation:

    al=σ(zl)a^l = \sigma(z^l)

Where:

  • σ\sigma is the activation function (applied element-wise),
  • al1a^{l-1} is the activation from the previous layer,
  • WlW^l is the weight matrix, and
  • blb^l is the bias vector.

The goal now is to compute:

CWlandCbl\frac{\partial C}{\partial W^l} \quad \text{and} \quad \frac{\partial C}{\partial b^l}

efficiently, where CC is the cost function.


Equation 1: Output Layer Error

δL=aCσ(zL)\delta^{L} = \nabla_{a}C \otimes \sigma'(z^L)

Explanation:

  • δL\delta^{L} represents the error in the output layer (layer LL).
  • aC\nabla_{a} C is the derivative of the cost function with respect to the activation aLa^L in the output layer.
  • σ(zL)\sigma'(z^L) is the derivative of the activation function with respect to the weighted input zLz^L.

Derivative:

  • We want to compute how the cost function CC changes with respect to the weighted input zLz^{L}.

  • Using the chain rule, we can expand this as:

    Cw=CaLaLzL\frac{\partial C}{\partial w} = \frac{\partial C}{\partial a^{L}} \frac{\partial a^{L}}{\partial z^{L}}

Backpropagation Tie-In:

  • This equation initializes the backpropagation process by calculating the error at the output layer. It shows how the network's final prediction affects the cost and sets up the error to be propagated backward.

Equation 2: Hidden Layer Error

δl=(Wl+1δl+1)σ(zl)\delta^{l} = (W^{l + 1}\delta^{l + 1}) \otimes \sigma\prime(z^l)

Explanation:

  • δl\delta^l represents the error at the hidden layer ll.
  • Wl+1δl+1W^{l + 1}\delta^{l + 1} is the weighted sum of the errors from the next layer l+1l + 1.
  • σ(zl)\sigma'(z^l) adjusts this error based on the activation function's derivative at layer ll.

Derivative:

  • To compute how the cost changes with respect to the inputs of the current layer, we use the chain rule:

    Czl=alzlzl+1alCzl+1\frac{\partial C}{\partial z^l} = \frac{\partial a^{l}}{\partial z^{l}} \frac{\partial z^{l + 1}}{\partial a^l} \frac{\partial C}{\partial z^{l + 1}}

Backpropagation Tie-In:

  • This recursive formula allows us to propagate the error backward through each hidden layer. It effectively updates the error term, which we will use to adjust the weights and biases.

Equation 3: Gradient with Respect to Bias

Cbl=δl\frac{\partial C}{\partial b^{l}} = \delta^{l}

Explanation:

  • The gradient of the cost function with respect to the bias at layer ll is simply the error δl\delta^l at that layer.

Derivative:

  • By applying the chain rule, we get:

    Cbl=zlblCzl\frac{\partial C}{\partial b^l} = \frac{\partial z^l}{\partial b^l} \frac{\partial C}{\partial z^l}
  • From Equation 2, we know that Czl=δl\frac{\partial C}{\partial z^l} = \delta^l, and zlbl=1\frac{\partial z^l}{\partial b^l} = 1.

Backpropagation Tie-In:

  • Once we have δl\delta^l, we directly obtain the gradient of the bias. This not only simplifies the computation but also enables straightforward updates to the bias in each layer during training.

Equation 4: Gradient with Respect to Weights

CWl=δl(al1)T\frac{\partial C}{\partial W^{l}} = \delta^l (a^{l - 1})^T

Explanation:

  • This equation shows how the cost function changes with respect to the weights in layer ll. The gradient is the product of the error δl\delta^{l} and the activation from the previous layer, al1a^{l - 1}.
  • The expression (al1)T(a^{l - 1})^{T} is the transpose of the activations, ensuring the gradient has the correct dimensions for the weight matrix.

Derivative:

  • To find how the cost changes with respect to the weights, we use the chain rule:

    CWl=zlWlCzl\frac{\partial C}{\partial W^l} = \frac{\partial z^l}{\partial W^l} \frac{\partial C}{\partial z^l}
  • From Equation 2, we know that Czl=δl\frac{\partial C}{\partial z^l} = \delta^l, and zlWl=al1\frac{\partial z^l}{\partial W^l} = a^{l - 1}.

Backpropagation Tie-In:

  • This equation is crucial as it shows how the weights in each layer contribute to the cost. By using the gradient, we can update the weights during training to reduce the error.

Visual Representation of the Backpropagation Process

  1. Forward Pass: Compute zlz^{l} and ala^{l} for each layer ll up to the output layer LL. This step involves applying weights, biases, and activation functions to generate the network's output.

  2. Backward Pass:

    • Output Layer (Layer LL): Compute the error δL\delta^L using Equation 1.
    • Hidden Layers: Starting from layer L1L - 1 and moving backward to layer 1, compute δl\delta^{l} for each hidden layer using Equation 2. This step propagates the error back through the network.
  3. Gradient Computation: For each layer ll:

    • Calculate the gradient with respect to the biases using Equation 3:

      Cbl=δl\frac{\partial C}{\partial b^l} = \delta^{l}
    • Calculate the gradient with respect to the weights using Equation 4:

      CWl=δl(al1)T\frac{\partial C}{\partial W^l} = \delta^{l}(a^{l - 1})^{T}
  4. Parameter Update:

    • Adjust the weights and biases using the computed gradients. For example, with gradient descent:

      Wnewl=WoldlηCWl,bnewl=boldlηCblW^{l}_{\text{new}} = W^{l}_{\text{old}} - \eta \frac{\partial C}{\partial W^{l}}, \quad b^{l}_{\text{new}} = b^{l}_{\text{old}} - \eta \frac{\partial C}{\partial b^{l}}
    • Where η\eta is the learning rate.


Explanation:

  • This visual guide outlines the main steps of the backpropagation process, highlighting how errors are propagated backward and how gradients are used to update parameters.
  • By repeating these steps across multiple iterations, the network learns to reduce the overall cost, improving its predictions.

Part 4: Hand Computation

In this section, we’ll manually compute a simple neural network's forward and backward passes to build a deeper understanding of the calculations.

A neural network with weights and biases

Network Structure:

  • Inputs: x1=0.5x_1 = 0.5, x2=0.1x_2 = 0.1
  • Hidden Layer: 2 neurons
  • Output Layer: 1 neuron
  • True Output: y=0.8y = 0.8
  • Learning Rate: η=0.1\eta = 0.1

Weights and Biases:

  • From input to hidden layer:

    W1=[0.40.20.30.7],b1=[0.10.3]W_1 = \begin{bmatrix} 0.4 & 0.2 \\ 0.3 & 0.7 \end{bmatrix}, \quad b_1 = \begin{bmatrix} 0.1 \\ -0.3 \end{bmatrix}
  • From hidden to output layer:

    W2=[0.60.9],b2=0.2W_2 = \begin{bmatrix} 0.6 \\ 0.9 \end{bmatrix}, \quad b_2 = 0.2

Step 1: Forward Propagation

  1. Input to Hidden Layer Calculation:

    • First, we compute the weighted input to the hidden layer:

      z1=W1x+b1z_1 = W_1 \cdot x + b_1
    • Substituting the values, we get:

      z1=[0.40.20.30.7][0.50.1]+[0.10.3]=[0.320.08]z_1 = \begin{bmatrix} 0.4 & 0.2 \\ 0.3 & 0.7 \end{bmatrix} \begin{bmatrix} 0.5 \\ 0.1 \end{bmatrix} + \begin{bmatrix} 0.1 \\ -0.3 \end{bmatrix} = \begin{bmatrix} 0.32 \\ -0.08 \end{bmatrix}
  2. Activation (ReLU) for Hidden Layer:

    • Apply the ReLU activation function to each element of z1z_1:

      a1=[ReLU(0.32)ReLU(0.08)]=[0.320]a_1 = \begin{bmatrix} \text{ReLU}(0.32) \\ \text{ReLU}(-0.08) \end{bmatrix} = \begin{bmatrix} 0.32 \\ 0 \end{bmatrix}
    • Here, ReLU returns 0 for negative inputs and the input itself for positive values.

  3. Hidden to Output Layer Calculation:

    • Next, we compute the weighted input to the output layer:

      z2=W2a1+b2z_2 = W_2 \cdot a_1 + b_2
    • Substituting the values, we get:

      z2=[0.60.9][0.320]+0.2=0.392z_2 = \begin{bmatrix} 0.6 & 0.9 \end{bmatrix} \begin{bmatrix} 0.32 \\ 0 \end{bmatrix} + 0.2 = 0.392
  4. Final Output:

    • Since we're not applying an additional activation function at the output layer, our final output for this forward pass is 0.392.

Step 2: Cost Calculation

  • To measure how far off our prediction is from the actual output, we use the Mean Squared Error (MSE):

    C=12(ypredytrue)2=12(0.3920.8)2=0.0832C = \frac{1}{2} (y_{\text{pred}} - y_{\text{true}})^2 = \frac{1}{2}(0.392 - 0.8)^2 = 0.0832
  • This cost value indicates the error in the current predictions, which we will now work to minimize using backpropagation.


Step 3: Backward Propagation

  1. Output Layer Error:

    • Using the chain rule and the derivative of the cost function, the error term for the output layer is:

      δL=CzL=(ypredytrue)σ(zL)=(0.3920.8)1=0.408\delta^L = \frac{\partial C}{\partial z^L} = (y_{\text{pred}} - y_{\text{true}}) \cdot \sigma{\prime}(z^L) = (0.392 - 0.8) \cdot 1 = -0.408
  2. Update Bias for Output Layer:

    • The gradient with respect to the bias at the output layer is simply the error term:

      CbL=δL=0.408\frac{\partial C}{\partial b^L} = \delta^L = -0.408
    • Update the bias using gradient descent:

      bnew=boldηδL=0.20.1(0.408)=0.2408b_{\text{new}} = b_{\text{old}} - \eta \cdot \delta^L = 0.2 - 0.1 \cdot (-0.408) = 0.2408
  3. Update Weights for Output Layer:

    • The gradient with respect to the weights is:

      CW2=δLa1T=0.408[0.320]=[0.130]\frac{\partial C}{\partial W_2} = \delta^L \cdot a_1^T = -0.408 \cdot \begin{bmatrix} 0.32 \\ 0 \end{bmatrix} = \begin{bmatrix} -0.13 \\ 0 \end{bmatrix}
    • Update the weights:

      W1,2=0.60.1(0.13)=0.613,W2,2=0.90.10=0.9W_{1,2} = 0.6 - 0.1 \cdot (-0.13) = 0.613, \quad W_{2,2} = 0.9 - 0.1 \cdot 0 = 0.9
  4. Hidden Layer Error:

    • Calculate the error term for each neuron in the hidden layer:

      δ1=(0.60.408)1=0.2448\delta^1 = (0.6 \cdot -0.408) \cdot 1 = -0.2448 δ2=(0.90.408)0=0\delta^2 = (0.9 \cdot -0.408) \cdot 0 = 0
  5. Update Bias for Hidden Layer:

    • For the first neuron:

      Cb11=δl=0.2448\frac{\partial C}{\partial b_1^1} = \delta^l = -0.2448 bnew1=bold1ηδ1=0.10.1(0.2448)=0.12448b_{\text{new}}^1 = b_{\text{old}}^1 - \eta \cdot \delta^1 = 0.1 - 0.1 \cdot (-0.2448) = 0.12448
    • For the second neuron, no change as δ2=0\delta^2 = 0:

      bnew2=0.3b_{\text{new}}^2 = -0.3
  6. Update Weights for Hidden Layer:

    • For the first hidden neuron:

      CW1,1=δlx1=0.24480.5=0.1224\frac{\partial C}{\partial W_{1,1}} = \delta^l \cdot x_1 = -0.2448 \cdot 0.5 = -0.1224 CW2,1=δlx2=0.24480.1=0.02448\frac{\partial C}{\partial W_{2,1}} = \delta^l \cdot x_2 = -0.2448 \cdot 0.1 = -0.02448
    • Update the weights:

      W1,1=0.40.1(0.1224)=0.41224,W2,1=0.20.1(0.0245)=0.2024W_{1,1} = 0.4 - 0.1 \cdot (-0.1224) = 0.41224, \quad W_{2,1} = 0.2 - 0.1 \cdot (-0.0245) = 0.2024
    • For the second hidden neuron, no change:

      W1,2=0.3,W2,2=0.7W_{1,2} = 0.3, \quad W_{2,2} = 0.7

Conclusion

These steps illustrate how a neural network learns through forward propagation, cost evaluation, and backpropagation. By manually updating the weights and biases, we gain a clearer understanding of how the model gradually improves its predictions. This process repeats over many iterations, reducing the error and fine-tuning the network's performance.

Closing Thoughts

Neural Networks have shown they can learn just about anything, from recognizing faces in photos to translating languages. As Geoffrey Hinton, one of the pioneers in deep learning, put it: "Neural Networks are capable of learning anything if you give them enough data and compute." This really shows how flexible and powerful these systems are.

With all the excitement around artificial intelligence today, especially with the rise of large language models (LLMs), it's more important than ever to understand how neural networks actually work. And as we push forward in the race towards Artificial General Intelligence (AGI), knowing the ins and outs of these networks gives us a better grip on where we're headed. By understanding the nuts and bolts of neural networks, we can ensure we're using this technology wisely and steering it in the right direction.