One neuron changes everything: the 63 vs 64 puzzle

I was trying to figure out the smallest network that can learn multiplication. Not matrix multiplication, not anything fancy - just F = m * a. Two numbers in, one number out. The simplest nonlinear function I could think of.

I set up a [2, 64, 64, 1] MLP with LeakyReLU and it worked. Cost dropped, the network learned the product. Good. Then I wanted to know how small the first layer could be while still learning, so I tried [2, 63, 64, 1]. One fewer neuron in the first hidden layer.

It didn't work. Cost plateaued around 137 and refused to drop further.

I went back to 64. It worked again. Same seed, same learning rate, same everything. The only difference was one neuron in the first layer. And the training stories weren't even close - the 64-neuron version initially looked worse, sitting at cost 1,430 for seventy epochs while the 63-neuron version had already settled. Then around epoch 80, the 64-neuron network did something sharp. Cost crashed from 1,252 to 151 in twenty epochs. A phase transition.

One neuron. Same seed, same learning rate, same data. The 63-neuron network plateaus politely. The 64-neuron network looks dead for 70 epochs, then suddenly collapses to a better solution.

That didn't make sense to me. So I started pulling the thread.

The setup

The task is deliberately trivial. Generate 1,000 random (m, a) pairs where m and a are uniform in [1, 10]. Target is F = m * a. Split 70/30 into train and validation. Train a fully-connected network with He initialization, LeakyReLU (alpha=0.01), and vanilla gradient descent at learning rate 1e-5 for 150 epochs.

The architecture is [2, N, 64, 1] - two inputs, a variable-width first hidden layer, a fixed 64-wide second hidden layer, one linear output. Once I saw the 63-vs-64 discrepancy, I swept the first layer width across eight values to see if the pattern was wider: 63, 64, 65, 80, 96, 105, 110, 128.

All eight networks use np.random.seed(1) for initialization.

I expected the results to be boring. Wider is better, maybe with diminishing returns. Instead I got this:

Three completely different behaviors from the same task and the same code. The grey lines (widths 80, 96, 105) are stuck above cost 1,200 - basically dead. The warm lines (63, 65) drop quickly to ~137 and plateau. And then there's width 64 (the thick line), doing its own thing entirely - frozen for 70 epochs, then a dramatic collapse.

It gets weirder

If this were a simple capacity story - wider is better - you'd expect a smooth curve. That's not what happens at all.

Widths 63 and 65 land at roughly the same cost (~137). Widths 80, 96, and 105 are catastrophically stuck above 1,200. And then widths 110 and 128 are fine again, converging to ~135.

This is not a capacity curve. This is a phase diagram. There are three distinct regimes, and which one you land in has nothing to do with how many parameters you have. Width 80 has more parameters than width 65, and it's ten times worse.

Three regimes

I tracked every internal variable during training - activations, pre-activations, gradients, weight norms - and a clear picture emerged.

Plateau (widths 63, 65): Activations spread across many neurons in layer 2. Nobody dominates. Z2_max hovers around 8-9. The network settles into a passable-but-mediocre solution where many neurons contribute small amounts. It can approximate multiplication as a sum of weak piecewise-linear functions, but it can't learn the sharp product structure. Cost: ~137.

Dead (widths 80, 96, 105): Every pre-activation in layer 2 is negative. Z2_max is below zero. LeakyReLU at alpha=0.01 means these neurons pass only 1% of their signal. The network is effectively disconnected between layers 1 and 2. Nothing gets through. The gradients flowing back are 100x weaker than they should be. Cost: ~1,300.

Converge (widths 64, 110, 128): One or two neurons in layer 2 develop large positive activations. Z2_max shoots up to 14-48. These winners get full gradients (no LeakyReLU attenuation), so they grow faster, which makes them even more dominant. Rich get richer. The network concentrates its computation through a few dominant channels and actually learns the multiplication structure. Cost: ~135-145.

The smoking gun

The variable that separates the three regimes is Z2_max - the peak pre-activation in the second hidden layer at epoch 100.

Width 64 has a Z2_max of 47.3 - five times larger than its neighbors at width 63 (8.7) and 65 (9.5). The dead regime widths have Z2_max below zero. The converge regime widths (110, 128) land at 14-25.

This is the mechanism. LeakyReLU creates a 100x gradient asymmetry between positive and negative pre-activations. If your initialization happens to produce even one strongly positive neuron in layer 2, that neuron gets 100x more gradient signal than its negative neighbors. It grows. Its neighbors shrink further. Winner takes all.

The gradient norms confirm it:

Width 64's middle-layer gradient (dW2) is 37.5 - nearly 4x larger than width 63's 9.7. The signal is concentrated, not distributed. This is why the 64-neuron network can break out of the plateau that traps the 63-neuron version: it has a dominant channel funneling learning signal through.

So why does one neuron matter?

Here's the part that really bothered me. 63 and 64 neurons, same seed. Why would one neuron change the activation landscape so dramatically?

I traced the random number generation. When you call np.random.randn(2, 63) followed by np.random.randn(63, 64), you consume 126 random numbers for W1 and then start W2 at index 126 in the sequence. When you call np.random.randn(2, 64) followed by np.random.randn(64, 64), W1 consumes 128 numbers and W2 starts at index 128.

That two-index offset means W2 for the 63-neuron network and W2 for the 64-neuron network are drawing from completely different parts of the random stream. The first layer weights share 126 out of 128 values. But the second layer - the layer that determines whether you get a dominant positive neuron or not - is initialized with entirely different random numbers.

np.random.seed(1)
seq_63_w1 = np.random.randn(2, 63)   # consumes 126 numbers
seq_63_w2 = np.random.randn(63, 64)  # starts at index 126
 
np.random.seed(1)
seq_64_w1 = np.random.randn(2, 64)   # consumes 128 numbers
seq_64_w2 = np.random.randn(64, 64)  # starts at index 128
 
# W1 values identical up to element 126: True
# W2 values identical: False — completely different!

You think you're changing the architecture. You're actually changing the initialization. The extra neuron shifts the random seed offset for every subsequent weight matrix, giving you a completely different W2.

The 63-vs-64 difference has nothing to do with 63 or 64. It has to do with which slice of the random number generator happens to produce a neuron with a large positive pre-activation in layer 2, for this particular data distribution.

Confirmation: the 20-seed test

To verify this, I ran both architectures across 20 different random seeds, training for 10,000 epochs each. If the difference is architectural, 64 should consistently beat 63. If it's initialization luck, the results should be noisy.

They're noisy. Some seeds favor 63. Some favor 64. The mean performance is roughly similar. Seed 1 just happened to give width 64 a lucky W2 that produced a dominant neuron, while width 63 with the same seed got a W2 where the activations were more uniform.

The width sweep with fine granularity around 60-68 shows the same thing: which regime you land in depends more on the interaction between seed and width than on width alone. There's no magic number. There's a phase boundary that you cross when initialization happens to produce the right conditions for winner-take-all dynamics.

What multiplication demands

This raises a deeper question: why is multiplication so sensitive to this? Addition isn't. A network learning F = m + a converges reliably regardless of width. What makes multiplication different?

Multiplication requires the network to represent the interaction between inputs, not just their sum. In a ReLU network, this means learning piecewise-linear surfaces that approximate the hyperbolic contours of m * a. You need neurons that respond to specific combinations of m and a - not neurons that respond to m alone or a alone.

The plateau regime fails because distributed, uniform activations approximate multiplication as a sum of independent contributions from m and a. That gets you close (cost ~137) but can't capture the cross-term. The converge regime succeeds because concentrated activations through a dominant neuron can encode the interaction: that neuron's W1 weights form a direction in (m, a) space that captures the product structure.

The results

Width	Final cost	Regime	Z2_max	dW2 norm	What happened
63	137	Plateau	8.7	9.7	Uniform activations, no winner
64	145	Converge	47.3	37.5	Lucky dominant neuron, phase transition
65	137	Plateau	9.5	10.4	Same as 63, different slice of randomness
80	1,346	Dead	-0.5	19.0	All layer-2 neurons negative, signal killed
96	1,271	Dead	-1.3	20.7	Same death, slightly different trajectory
105	1,360	Dead	-1.5	5.5	Same death
110	135	Converge	14.0	4.2	Lucky initialization, clean convergence
128	138	Converge	24.5	30.6	Dominant neuron, late phase transition

What this doesn't tell me

I ran this on a toy problem with a specific activation function (LeakyReLU at alpha=0.01). The 100x gradient asymmetry is what drives the winner-take-all dynamics. With ReLU (alpha=0), the dead neurons are truly dead and can't recover. With a smoother activation like GELU or SiLU, the asymmetry is less extreme and the three-regime structure might soften or disappear. I haven't tested that.

I also haven't connected this to the initialization literature rigorously. The lottery ticket hypothesis [1] is about trained networks containing sparse subnetworks that can match full performance. This is a related but distinct phenomenon: it's about untrained networks containing one lucky neuron whose initial activation happens to be large enough to trigger winner-take-all dynamics. The lottery ticket is found after training. This "lucky neuron" exists at initialization and determines the entire training trajectory.

Whether this matters for large models is an open question. Transformers use different activation functions, different initialization schemes, and much larger hidden dimensions. The probability of landing in the dead regime decreases as width increases (more chances for at least one positive neuron). But the sensitivity to initialization is a general phenomenon, and the specific mechanism - activation-dependent gradient asymmetry creating winner-take-all dynamics - is not specific to toy problems.

The actual lesson

This is a cautionary tale for anyone doing hyperparameter sweeps. If your comparison uses the same random seed, you are not running a controlled experiment. The seed interacts with the architecture in ways that can produce spurious differences that look systematic but are actually coincidental. The fix is trivial: average over many seeds. But it's remarkable how often this doesn't happen, even in published work.

I started this investigation thinking I'd found an architecture difference. I ended up finding an initialization story. And that reframing - not "63 vs 64" but "lucky W2 vs unlucky W2" - changes what the experiment actually tells you. You're not measuring architecture. You're measuring one draw from a random number generator.

The code

The full training script with trace collection across all 8 widths:

    ▶
    python
    fma_model.py
    
      135 lines

"""
F = m*a training with full trace collection across 8 widths.

Trains [2, N, 64, 1] networks for N in {63, 64, 65, 80, 96, 105, 110, 128},
storing all internal variables (activations, gradients, weight stats) at every
epoch. Saves traces to traces.pkl for analysis.

Run: python fma_model.py
Requires: numpy, pickle
"""

import numpy as np
import pickle


# -- Data --
np.random.seed(42)
n_samples = 1000
m = np.random.uniform(1, 10, n_samples)
a = np.random.uniform(1, 10, n_samples)
F = m * a
X = np.column_stack([m, a])
Y = F.reshape(-1, 1)
split = int(0.7 * n_samples)
X_train, X_val = X[:split], X[split:]
Y_train, Y_val = Y[:split], Y[split:]


# -- Network functions --

def leaky_relu(Z, alpha=0.01):
    return np.where(Z > 0, Z, alpha * Z)

def leaky_relu_derivative(Z, alpha=0.01):
    return np.where(Z > 0, 1, alpha)


def train_and_trace(width, epochs=150, lr=1e-5, seed=1):
    """Train a [2, width, 64, 1] network and collect full traces."""
    layer_dims = [2, width, 64, 1]
    np.random.seed(seed)

    # Initialize (uniform, not He - matches original experiment)
    params = {}
    for l in range(1, len(layer_dims)):
        params[f'W{l}'] = np.random.rand(layer_dims[l-1], layer_dims[l]) * np.sqrt(1/layer_dims[l-1])
        params[f'b{l}'] = np.zeros((1, layer_dims[l]))

    traces = []

    for epoch in range(epochs):
        # Forward
        A0 = X_train
        Z1 = A0 @ params['W1'] + params['b1']
        A1 = leaky_relu(Z1)
        Z2 = A1 @ params['W2'] + params['b2']
        A2 = leaky_relu(Z2)
        Z3 = A2 @ params['W3'] + params['b3']
        A3 = Z3  # linear output

        cost = np.mean((A3 - Y_train) ** 2)

        # Backward
        m_samples = Y_train.shape[0]
        dZ3 = (2 / m_samples) * (A3 - Y_train)
        dW3 = A2.T @ dZ3
        db3 = np.sum(dZ3, axis=0, keepdims=True)

        dA2 = dZ3 @ params['W3'].T
        dZ2 = dA2 * leaky_relu_derivative(Z2)
        dW2 = A1.T @ dZ2
        db2 = np.sum(dZ2, axis=0, keepdims=True)

        dA1 = dZ2 @ params['W2'].T
        dZ1 = dA1 * leaky_relu_derivative(Z1)
        dW1 = A0.T @ dZ1
        db1 = np.sum(dZ1, axis=0, keepdims=True)

        # Collect trace
        trace = {
            'epoch': epoch, 'cost': cost, 'width': width,
            'Z1_max': Z1.max(), 'Z1_min': Z1.min(), 'Z1_mean': Z1.mean(), 'Z1_std': Z1.std(),
            'A1_max': A1.max(), 'A1_min': A1.min(), 'A1_mean': A1.mean(), 'A1_std': A1.std(),
            'Z2_max': Z2.max(), 'Z2_min': Z2.min(), 'Z2_mean': Z2.mean(), 'Z2_std': Z2.std(),
            'A2_max': A2.max(), 'A2_min': A2.min(), 'A2_mean': A2.mean(), 'A2_std': A2.std(),
            'Z3_max': Z3.max(), 'Z3_min': Z3.min(), 'Z3_mean': Z3.mean(), 'Z3_std': Z3.std(),
            'error_max': np.abs(A3 - Y_train).max(),
            'error_mean': np.abs(A3 - Y_train).mean(),
            'W1_max': params['W1'].max(), 'W1_min': params['W1'].min(),
            'W1_mean': params['W1'].mean(), 'W1_std': params['W1'].std(),
            'b1_max': params['b1'].max(), 'b1_min': params['b1'].min(), 'b1_mean': params['b1'].mean(),
            'W2_max': params['W2'].max(), 'W2_min': params['W2'].min(),
            'W2_mean': params['W2'].mean(), 'W2_std': params['W2'].std(),
            'b2_max': params['b2'].max(), 'b2_min': params['b2'].min(), 'b2_mean': params['b2'].mean(),
            'W3_max': params['W3'].max(), 'W3_min': params['W3'].min(),
            'W3_mean': params['W3'].mean(), 'W3_std': params['W3'].std(),
            'b3_max': params['b3'].max(), 'b3_min': params['b3'].min(), 'b3_mean': params['b3'].mean(),
            'dW1_norm': np.linalg.norm(dW1), 'dW2_norm': np.linalg.norm(dW2), 'dW3_norm': np.linalg.norm(dW3),
        }
        traces.append(trace)

        # Update
        params['W1'] -= lr * dW1
        params['b1'] -= lr * db1
        params['W2'] -= lr * dW2
        params['b2'] -= lr * db2
        params['W3'] -= lr * dW3
        params['b3'] -= lr * db3

    return traces


# -- Run all widths --

widths = [63, 64, 65, 80, 96, 105, 110, 128]
all_traces = {}

for w in widths:
    print(f"Running width {w}...")
    all_traces[w] = train_and_trace(w, epochs=150, lr=1e-5, seed=1)

# Save
with open('traces.pkl', 'wb') as f:
    pickle.dump(all_traces, f)

print(f"\nSaved traces for {len(widths)} widths, 150 epochs each")

# Print summary
print("\nFinal costs:")
print(f"{'Width':<8} {'Cost':<12} {'Z2_max':<10} {'A2_mean':<10} {'dW2_norm':<10}")
print("-" * 50)
for w in widths:
    t = all_traces[w][-1]
    print(f"{w:<8} {t['cost']:<12.2f} {t['Z2_max']:<10.2f} {t['A2_mean']:<10.2f} {t['dW2_norm']:<10.2f}")

The systematic investigation - 20-seed sensitivity analysis, width sweep, random sequence tracing:

    ▶
    python
    fma_investigation.py
    
      178 lines

"""
63 vs 64 neurons: Why does one neuron change everything?

A systematic investigation into why a [2, 63, 64, 1] MLP learning F = m*a
gets stuck, while [2, 64, 64, 1] shows a dramatic phase transition.

Run: python fma_investigation.py
Requires: numpy, matplotlib
"""

import numpy as np
import matplotlib.pyplot as plt


# -- Data --
np.random.seed(42)
n_samples = 1000
m = np.random.uniform(1, 10, n_samples)
a = np.random.uniform(1, 10, n_samples)
F = m * a
X = np.column_stack([m, a])
Y = F.reshape(-1, 1)
split = int(0.7 * n_samples)
X_train, X_val = X[:split], X[split:]
Y_train, Y_val = Y[:split], Y[split:]


# -- Network functions --

def leaky_relu(Z, alpha=0.01):
    return np.where(Z > 0, Z, alpha * Z)

def leaky_relu_derivative(Z, alpha=0.01):
    return np.where(Z > 0, 1, alpha)

def initialize_parameters(layer_dims, seed=1):
    np.random.seed(seed)
    parameters = {}
    for l in range(1, len(layer_dims)):
        parameters[f'W{l}'] = np.random.randn(layer_dims[l-1], layer_dims[l]) * np.sqrt(2.0 / layer_dims[l-1])
        parameters[f'b{l}'] = np.zeros((1, layer_dims[l]))
    return parameters

def forward(X, parameters, layer_dims):
    cache = {'A0': X}
    A_prev = X
    L = len(layer_dims)
    for l in range(1, L):
        Z = np.dot(A_prev, parameters[f'W{l}']) + parameters[f'b{l}']
        A = Z if l == L - 1 else leaky_relu(Z)
        cache[f'Z{l}'] = Z
        cache[f'A{l}'] = A
        A_prev = A
    return A, cache

def compute_cost(A, Y):
    return np.mean((A - Y) ** 2)

def backward(X, Y, parameters, cache, layer_dims):
    L = len(layer_dims)
    grads = {}
    m = Y.shape[0]
    AL = cache[f'A{L-1}']
    for l in reversed(range(1, L)):
        Z = cache[f'Z{l}']
        if l == L - 1:
            dZ = (2 / m) * (AL - Y)
        else:
            W_next = parameters[f'W{l+1}']
            dA = np.dot(dZ, W_next.T)
            dZ = dA * leaky_relu_derivative(Z)
        A_prev = cache[f'A{l-1}'] if l > 1 else X
        grads[f'dW{l}'] = np.dot(A_prev.T, dZ)
        grads[f'db{l}'] = np.sum(dZ, axis=0, keepdims=True)
    return grads

def update_parameters(parameters, grads, lr, layer_dims):
    for l in range(1, len(layer_dims)):
        parameters[f'W{l}'] -= lr * grads[f'dW{l}']
        parameters[f'b{l}'] -= lr * grads[f'db{l}']
    return parameters


# -- Investigation 1: Train and track everything --

def train_with_tracking(layer_dims, lr, epochs, seed=1):
    parameters = initialize_parameters(layer_dims, seed=seed)
    history = {
        'cost': [],
        'grad_norms': {l: [] for l in range(1, len(layer_dims))},
        'negative_frac': {l: [] for l in range(1, len(layer_dims) - 1)},
    }
    for i in range(epochs):
        A, cache = forward(X_train, parameters, layer_dims)
        history['cost'].append(compute_cost(A, Y_train))
        grads = backward(X_train, Y_train, parameters, cache, layer_dims)
        for l in range(1, len(layer_dims)):
            history['grad_norms'][l].append(np.linalg.norm(grads[f'dW{l}']))
            if l < len(layer_dims) - 1:
                history['negative_frac'][l].append((cache[f'Z{l}'] < 0).mean())
        parameters = update_parameters(parameters, grads, lr, layer_dims)
    return parameters, history


print("Training 63-neuron network (50K epochs)...")
p63, h63 = train_with_tracking([2, 63, 64, 1], 0.0001, 50000, seed=1)
print(f"Final cost: {h63['cost'][-1]:.4f}")

print("Training 64-neuron network (50K epochs)...")
p64, h64 = train_with_tracking([2, 64, 64, 1], 0.0001, 50000, seed=1)
print(f"Final cost: {h64['cost'][-1]:.4f}")


# -- Investigation 2: Random number sequence analysis --

print("\n" + "=" * 60)
print("RANDOM NUMBER SEQUENCE ANALYSIS")
print("=" * 60)

np.random.seed(1)
seq_63_w1 = np.random.randn(2, 63)   # consumes 126 numbers
seq_63_w2 = np.random.randn(63, 64)  # starts at index 126

np.random.seed(1)
seq_64_w1 = np.random.randn(2, 64)   # consumes 128 numbers
seq_64_w2 = np.random.randn(64, 64)  # starts at index 128

print(f"W1 values identical up to element 126: {np.allclose(seq_63_w1.flatten(), seq_64_w1.flatten()[:126])}")
print(f"W2 values identical: {np.allclose(seq_63_w2.flatten()[:10], seq_64_w2.flatten()[:10])}")
print("W2 is drawn from completely different parts of the random stream!")


# -- Investigation 3: Sensitivity across 20 seeds --

print("\n" + "=" * 60)
print("SENSITIVITY TO INITIALIZATION (20 seeds)")
print("=" * 60)

results_63, results_64 = [], []
for seed in range(1, 21):
    params = initialize_parameters([2, 63, 64, 1], seed=seed)
    for _ in range(10000):
        A, cache = forward(X_train, params, [2, 63, 64, 1])
        grads = backward(X_train, Y_train, params, cache, [2, 63, 64, 1])
        params = update_parameters(params, grads, 0.0001, [2, 63, 64, 1])
    results_63.append(compute_cost(forward(X_train, params, [2, 63, 64, 1])[0], Y_train))

    params = initialize_parameters([2, 64, 64, 1], seed=seed)
    for _ in range(10000):
        A, cache = forward(X_train, params, [2, 64, 64, 1])
        grads = backward(X_train, Y_train, params, cache, [2, 64, 64, 1])
        params = update_parameters(params, grads, 0.0001, [2, 64, 64, 1])
    results_64.append(compute_cost(forward(X_train, params, [2, 64, 64, 1])[0], Y_train))

    print(f"Seed {seed:2d}: 63-net={results_63[-1]:.2f}, 64-net={results_64[-1]:.2f}")

print(f"\n63-net: mean={np.mean(results_63):.2f}, std={np.std(results_63):.2f}")
print(f"64-net: mean={np.mean(results_64):.2f}, std={np.std(results_64):.2f}")


# -- Investigation 4: Width sweep --

print("\n" + "=" * 60)
print("WIDTH SWEEP (5 seeds each)")
print("=" * 60)

widths = [32, 48, 60, 61, 62, 63, 64, 65, 66, 67, 68, 80, 96, 128]
for n in widths:
    costs = []
    for seed in range(1, 6):
        params = initialize_parameters([2, n, 64, 1], seed=seed)
        for _ in range(10000):
            A, cache = forward(X_train, params, [2, n, 64, 1])
            grads = backward(X_train, Y_train, params, cache, [2, n, 64, 1])
            params = update_parameters(params, grads, 0.0001, [2, n, 64, 1])
        costs.append(compute_cost(forward(X_train, params, [2, n, 64, 1])[0], Y_train))
    print(f"n={n:3d}: mean={np.mean(costs):.2f}, std={np.std(costs):.2f}")

References

1. Frankle, J. & Carlin, M. (2019). The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv:1803.03635