You've used PyTorch. You've called loss.backward(). But do you truly understand what happens? Let's build a neural network from scratch; no frameworks, just NumPy and clear thinking.
By the end of this tutorial: You'll implement forward propagation, backpropagation, and gradient descent for a multi-layer network that learns XOR; the classic problem that proved perceptrons weren't enough.
Live Neural Network Trainer
Watch backpropagation update weights in real-time
Training Problem
Hyperparameters
Training Stats
0
Epoch
—
Loss
—
Accuracy
—
Grad Norm
The Forward Pass
Let's build a 2-layer network: input(2) → hidden(4) → output(1)
Step 1: Initialize Weights
import numpy as np
# Xavier initialization
W1 = np.random.randn(2, 4) * np.sqrt(2.0 / 2) # (input, hidden)
b1 = np.zeros((1, 4))
W2 = np.random.randn(4, 1) * np.sqrt(2.0 / 4) # (hidden, output)
b2 = np.zeros((1, 1))
Step 2: Forward Propagation
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def forward(X, W1, b1, W2, b2):
# Layer 1
z1 = X @ W1 + b1 # Linear transformation
a1 = sigmoid(z1) # Activation
# Layer 2
z2 = a1 @ W2 + b2
a2 = sigmoid(z2)
return z1, a1, z2, a2
The Loss Function
Binary Cross-Entropy for classification:
L = -1/m × Σ[y·log(ŷ) + (1-y)·log(1-ŷ)]
def compute_loss(y_true, y_pred):
m = y_true.shape[0]
epsilon = 1e-15 # Prevent log(0)
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
return loss
⬅️ Backpropagation: The Chain Rule
This is where the magic happens. We compute gradients layer by layer, from output to input.
Output Layer Gradient
# dL/dz2 = a2 - y (for sigmoid + BCE)
dz2 = a2 - y
# dL/dW2 = a1.T @ dz2
dW2 = (a1.T @ dz2) / m
db2 = np.mean(dz2, axis=0, keepdims=True)
Hidden Layer Gradient (Chain Rule!)
# Backpropagate through W2
da1 = dz2 @ W2.T
# Backpropagate through sigmoid
# sigmoid'(z) = sigmoid(z) * (1 - sigmoid(z))
dz1 = da1 * a1 * (1 - a1)
# Gradients for W1, b1
dW1 = (X.T @ dz1) / m
db1 = np.mean(dz1, axis=0, keepdims=True)
Gradient Descent Update
def update_weights(W1, b1, W2, b2, dW1, db1, dW2, db2, lr):
W1 -= lr * dW1
b1 -= lr * db1
W2 -= lr * dW2
b2 -= lr * db2
return W1, b1, W2, b2
Complete Training Loop
# XOR dataset
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])
# Initialize
W1, b1, W2, b2 = initialize_weights()
# Train
for epoch in range(10000):
# Forward
z1, a1, z2, a2 = forward(X, W1, b1, W2, b2)
# Loss
loss = compute_loss(y, a2)
# Backward
dW1, db1, dW2, db2 = backward(X, y, z1, a1, z2, a2, W1, W2)
# Update
W1, b1, W2, b2 = update_weights(W1, b1, W2, b2, dW1, db1, dW2, db2, lr=0.5)
if epoch % 1000 == 0:
print(f"Epoch {epoch}, Loss: {loss:.4f}")
# Test
predictions = forward(X, W1, b1, W2, b2)[-1]
print("Predictions:", predictions.round())
💡 Key Insights
- XOR requires hidden layers: A single perceptron can only learn linearly separable problems
- Gradient flow: Backprop is just the chain rule applied recursively
- Vanishing gradients: Deep sigmoid networks suffer because sigmoid'(z) ≤ 0.25
- Learning rate matters: Too high → diverge, too low → slow
Beyond the Basics
Once you understand this foundation:
- ReLU: Replace sigmoid with max(0, z) for better gradient flow
- Batch Normalization: Normalize activations to stabilize training
- Adam optimizer: Adaptive learning rates per parameter
- Dropout: Randomly zero neurons for regularization
Further Reading
- 3Blue1Brown. "Neural Networks" YouTube series
- Goodfellow et al. "Deep Learning" Chapter 6: Deep Feedforward Networks
- Karpathy. "micrograd" - Minimal autograd engine
READER FEEDBACK
Help us improve by rating this article and sharing your thoughts
Leave a Comment
Previous Comments
Finally understood backpropagation! The interactive visualization makes it so much clearer than textbooks.