Building a Neural Network from Scratch: Step-by-Step

You've used PyTorch. You've called loss.backward(). But do you truly understand what happens? Let's build a neural network from scratch; no frameworks, just NumPy and clear thinking.

             By the end of this tutorial: You'll implement forward propagation, backpropagation, and gradient descent for a multi-layer network that learns XOR; the classic problem that proved perceptrons weren't enough.
        

Live Neural Network Trainer

Watch backpropagation update weights in real-time

Training Problem

Dataset

Hyperparameters

Learning Rate: 0.5 Hidden Neurons: 4

Training Stats

Epoch

—

Loss

—

Accuracy

—

Grad Norm

The Forward Pass

Let's build a 2-layer network: input(2) → hidden(4) → output(1)

Step 1: Initialize Weights

import numpy as np

# Xavier initialization
W1 = np.random.randn(2, 4) * np.sqrt(2.0 / 2)   # (input, hidden)
b1 = np.zeros((1, 4))
W2 = np.random.randn(4, 1) * np.sqrt(2.0 / 4)   # (hidden, output)
b2 = np.zeros((1, 1))

Step 2: Forward Propagation

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def forward(X, W1, b1, W2, b2):
    # Layer 1
    z1 = X @ W1 + b1      # Linear transformation
    a1 = sigmoid(z1)       # Activation
    
    # Layer 2
    z2 = a1 @ W2 + b2
    a2 = sigmoid(z2)
    
    return z1, a1, z2, a2

The Loss Function

Binary Cross-Entropy for classification:

L = -1/m \times Σ[y\cdotlog(ŷ) + (1-y)\cdotlog(1-ŷ)]

def compute_loss(y_true, y_pred):
    m = y_true.shape[0]
    epsilon = 1e-15  # Prevent log(0)
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    return loss

⬅️ Backpropagation: The Chain Rule

This is where the magic happens. We compute gradients layer by layer, from output to input.

Output Layer Gradient

# dL/dz2 = a2 - y  (for sigmoid + BCE)
dz2 = a2 - y

# dL/dW2 = a1.T @ dz2
dW2 = (a1.T @ dz2) / m
db2 = np.mean(dz2, axis=0, keepdims=True)

Hidden Layer Gradient (Chain Rule!)

# Backpropagate through W2
da1 = dz2 @ W2.T

# Backpropagate through sigmoid
# sigmoid'(z) = sigmoid(z) * (1 - sigmoid(z))
dz1 = da1 * a1 * (1 - a1)

# Gradients for W1, b1
dW1 = (X.T @ dz1) / m
db1 = np.mean(dz1, axis=0, keepdims=True)

Gradient Descent Update

def update_weights(W1, b1, W2, b2, dW1, db1, dW2, db2, lr):
    W1 -= lr * dW1
    b1 -= lr * db1
    W2 -= lr * dW2
    b2 -= lr * db2
    return W1, b1, W2, b2

Complete Training Loop

# XOR dataset
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])

# Initialize
W1, b1, W2, b2 = initialize_weights()

# Train
for epoch in range(10000):
    # Forward
    z1, a1, z2, a2 = forward(X, W1, b1, W2, b2)
    
    # Loss
    loss = compute_loss(y, a2)
    
    # Backward
    dW1, db1, dW2, db2 = backward(X, y, z1, a1, z2, a2, W1, W2)
    
    # Update
    W1, b1, W2, b2 = update_weights(W1, b1, W2, b2, dW1, db1, dW2, db2, lr=0.5)
    
    if epoch % 1000 == 0:
        print(f"Epoch {epoch}, Loss: {loss:.4f}")

# Test
predictions = forward(X, W1, b1, W2, b2)[-1]
print("Predictions:", predictions.round())

💡 Key Insights

            XOR requires hidden layers: A single perceptron can only learn linearly separable problems
Gradient flow: Backprop is just the chain rule applied recursively
Vanishing gradients: Deep sigmoid networks suffer because sigmoid'(z) ≤ 0.25
Learning rate matters: Too high → diverge, too low → slow

        

Beyond the Basics

Once you understand this foundation:

ReLU: Replace sigmoid with max(0, z) for better gradient flow
Batch Normalization: Normalize activations to stabilize training
Adam optimizer: Adaptive learning rates per parameter
Dropout: Randomly zero neurons for regularization

Building a Neural Network from Scratch: Forward Pass, Backprop, and Beyond