One neuron changes everything: the 63 vs 64 puzzle
A 3-layer MLP learning F = m*a behaves completely differently with 63 neurons vs 64. The investigation reveals three training regimes, winner-take-all gradient dynamics, and a lesson about what you're actually ablating when you change architecture.
I was trying to figure out the smallest network that can learn multiplication. Not matrix multiplication, not anything fancy - just F = m * a. Two numbers in, one number out. The simplest nonlinear function I could think of.
I set up a [2, 64, 64, 1] MLP with LeakyReLU and it worked. Cost dropped, the network learned the product. Good. Then I wanted to know how small the first layer could be while still learning, so I tried [2, 63, 64, 1]. One fewer neuron in the first hidden layer.
It didn't work. Cost plateaued around 137 and refused to drop further.
I went back to 64. It worked again. Same seed, same learning rate, same everything. The only difference was one neuron in the first layer. And the training stories weren't even close - the 64-neuron version initially looked worse, sitting at cost 1,430 for seventy epochs while the 63-neuron version had already settled. Then around epoch 80, the 64-neuron network did something sharp. Cost crashed from 1,252 to 151 in twenty epochs. A phase transition.
One neuron. Same seed, same learning rate, same data. The 63-neuron network plateaus politely. The 64-neuron network looks dead for 70 epochs, then suddenly collapses to a better solution.That didn't make sense to me. So I started pulling the thread.
The setup
The task is deliberately trivial. Generate 1,000 random (m, a) pairs where m and a are uniform in [1, 10]. Target is F = m * a. Split 70/30 into train and validation. Train a fully-connected network with He initialization, LeakyReLU (alpha=0.01), and vanilla gradient descent at learning rate 1e-5 for 150 epochs.
The architecture is [2, N, 64, 1] - two inputs, a variable-width first hidden layer, a fixed 64-wide second hidden layer, one linear output. Once I saw the 63-vs-64 discrepancy, I swept the first layer width across eight values to see if the pattern was wider: 63, 64, 65, 80, 96, 105, 110, 128.
All eight networks use np.random.seed(1) for initialization.
I expected the results to be boring. Wider is better, maybe with diminishing returns. Instead I got this:
Three completely different behaviors from the same task and the same code. The grey lines (widths 80, 96, 105) are stuck above cost 1,200 - basically dead. The warm lines (63, 65) drop quickly to ~137 and plateau. And then there's width 64 (the thick line), doing its own thing entirely - frozen for 70 epochs, then a dramatic collapse.
It gets weirder
If this were a simple capacity story - wider is better - you'd expect a smooth curve. That's not what happens at all.
Widths 63 and 65 land at roughly the same cost (~137). Widths 80, 96, and 105 are catastrophically stuck above 1,200. And then widths 110 and 128 are fine again, converging to ~135.
This is not a capacity curve. This is a phase diagram. There are three distinct regimes, and which one you land in has nothing to do with how many parameters you have. Width 80 has more parameters than width 65, and it's ten times worse.
Three regimes
I tracked every internal variable during training - activations, pre-activations, gradients, weight norms - and a clear picture emerged.
Plateau (widths 63, 65): Activations spread across many neurons in layer 2. Nobody dominates. Z2_max hovers around 8-9. The network settles into a passable-but-mediocre solution where many neurons contribute small amounts. It can approximate multiplication as a sum of weak piecewise-linear functions, but it can't learn the sharp product structure. Cost: ~137.
Dead (widths 80, 96, 105): Every pre-activation in layer 2 is negative. Z2_max is below zero. LeakyReLU at alpha=0.01 means these neurons pass only 1% of their signal. The network is effectively disconnected between layers 1 and 2. Nothing gets through. The gradients flowing back are 100x weaker than they should be. Cost: ~1,300.
Converge (widths 64, 110, 128): One or two neurons in layer 2 develop large positive activations. Z2_max shoots up to 14-48. These winners get full gradients (no LeakyReLU attenuation), so they grow faster, which makes them even more dominant. Rich get richer. The network concentrates its computation through a few dominant channels and actually learns the multiplication structure. Cost: ~135-145.
The smoking gun
The variable that separates the three regimes is Z2_max - the peak pre-activation in the second hidden layer at epoch 100.
Width 64 has a Z2_max of 47.3 - five times larger than its neighbors at width 63 (8.7) and 65 (9.5). The dead regime widths have Z2_max below zero. The converge regime widths (110, 128) land at 14-25.
This is the mechanism. LeakyReLU creates a 100x gradient asymmetry between positive and negative pre-activations. If your initialization happens to produce even one strongly positive neuron in layer 2, that neuron gets 100x more gradient signal than its negative neighbors. It grows. Its neighbors shrink further. Winner takes all.
The gradient norms confirm it:
Width 64's middle-layer gradient (dW2) is 37.5 - nearly 4x larger than width 63's 9.7. The signal is concentrated, not distributed. This is why the 64-neuron network can break out of the plateau that traps the 63-neuron version: it has a dominant channel funneling learning signal through.
So why does one neuron matter?
Here's the part that really bothered me. 63 and 64 neurons, same seed. Why would one neuron change the activation landscape so dramatically?
I traced the random number generation. When you call np.random.randn(2, 63) followed by np.random.randn(63, 64), you consume 126 random numbers for W1 and then start W2 at index 126 in the sequence. When you call np.random.randn(2, 64) followed by np.random.randn(64, 64), W1 consumes 128 numbers and W2 starts at index 128.
That two-index offset means W2 for the 63-neuron network and W2 for the 64-neuron network are drawing from completely different parts of the random stream. The first layer weights share 126 out of 128 values. But the second layer - the layer that determines whether you get a dominant positive neuron or not - is initialized with entirely different random numbers.
np.random.seed(1)
seq_63_w1 = np.random.randn(2, 63) # consumes 126 numbers
seq_63_w2 = np.random.randn(63, 64) # starts at index 126
np.random.seed(1)
seq_64_w1 = np.random.randn(2, 64) # consumes 128 numbers
seq_64_w2 = np.random.randn(64, 64) # starts at index 128
# W1 values identical up to element 126: True
# W2 values identical: False — completely different!The 63-vs-64 difference has nothing to do with 63 or 64. It has to do with which slice of the random number generator happens to produce a neuron with a large positive pre-activation in layer 2, for this particular data distribution.
Confirmation: the 20-seed test
To verify this, I ran both architectures across 20 different random seeds, training for 10,000 epochs each. If the difference is architectural, 64 should consistently beat 63. If it's initialization luck, the results should be noisy.
They're noisy. Some seeds favor 63. Some favor 64. The mean performance is roughly similar. Seed 1 just happened to give width 64 a lucky W2 that produced a dominant neuron, while width 63 with the same seed got a W2 where the activations were more uniform.
The width sweep with fine granularity around 60-68 shows the same thing: which regime you land in depends more on the interaction between seed and width than on width alone. There's no magic number. There's a phase boundary that you cross when initialization happens to produce the right conditions for winner-take-all dynamics.
What multiplication demands
This raises a deeper question: why is multiplication so sensitive to this? Addition isn't. A network learning F = m + a converges reliably regardless of width. What makes multiplication different?
Multiplication requires the network to represent the interaction between inputs, not just their sum. In a ReLU network, this means learning piecewise-linear surfaces that approximate the hyperbolic contours of m * a. You need neurons that respond to specific combinations of m and a - not neurons that respond to m alone or a alone.
The plateau regime fails because distributed, uniform activations approximate multiplication as a sum of independent contributions from m and a. That gets you close (cost ~137) but can't capture the cross-term. The converge regime succeeds because concentrated activations through a dominant neuron can encode the interaction: that neuron's W1 weights form a direction in (m, a) space that captures the product structure.
The results
| Width | Final cost | Regime | Z2_max | dW2 norm | What happened |
|---|---|---|---|---|---|
| 63 | 137 | Plateau | 8.7 | 9.7 | Uniform activations, no winner |
| 64 | 145 | Converge | 47.3 | 37.5 | Lucky dominant neuron, phase transition |
| 65 | 137 | Plateau | 9.5 | 10.4 | Same as 63, different slice of randomness |
| 80 | 1,346 | Dead | -0.5 | 19.0 | All layer-2 neurons negative, signal killed |
| 96 | 1,271 | Dead | -1.3 | 20.7 | Same death, slightly different trajectory |
| 105 | 1,360 | Dead | -1.5 | 5.5 | Same death |
| 110 | 135 | Converge | 14.0 | 4.2 | Lucky initialization, clean convergence |
| 128 | 138 | Converge | 24.5 | 30.6 | Dominant neuron, late phase transition |
What this doesn't tell me
I ran this on a toy problem with a specific activation function (LeakyReLU at alpha=0.01). The 100x gradient asymmetry is what drives the winner-take-all dynamics. With ReLU (alpha=0), the dead neurons are truly dead and can't recover. With a smoother activation like GELU or SiLU, the asymmetry is less extreme and the three-regime structure might soften or disappear. I haven't tested that.
I also haven't connected this to the initialization literature rigorously. The lottery ticket hypothesis [1] is about trained networks containing sparse subnetworks that can match full performance. This is a related but distinct phenomenon: it's about untrained networks containing one lucky neuron whose initial activation happens to be large enough to trigger winner-take-all dynamics. The lottery ticket is found after training. This "lucky neuron" exists at initialization and determines the entire training trajectory.
Whether this matters for large models is an open question. Transformers use different activation functions, different initialization schemes, and much larger hidden dimensions. The probability of landing in the dead regime decreases as width increases (more chances for at least one positive neuron). But the sensitivity to initialization is a general phenomenon, and the specific mechanism - activation-dependent gradient asymmetry creating winner-take-all dynamics - is not specific to toy problems.
The actual lesson
This is a cautionary tale for anyone doing hyperparameter sweeps. If your comparison uses the same random seed, you are not running a controlled experiment. The seed interacts with the architecture in ways that can produce spurious differences that look systematic but are actually coincidental. The fix is trivial: average over many seeds. But it's remarkable how often this doesn't happen, even in published work.
I started this investigation thinking I'd found an architecture difference. I ended up finding an initialization story. And that reframing - not "63 vs 64" but "lucky W2 vs unlucky W2" - changes what the experiment actually tells you. You're not measuring architecture. You're measuring one draw from a random number generator.
The code
The full training script with trace collection across all 8 widths:
The systematic investigation - 20-seed sensitivity analysis, width sweep, random sequence tracing:
References
1. Frankle, J. & Carlin, M. (2019). The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv:1803.03635
Keep reading
April 1, 2026·9 min read
Grokking has two phases, and you can see the boundary
A component-freezing ablation study on modular division reveals that grokking isn't one process. It's two: infrastructure setup, then computational reorganization. The boundary between them is sharp and measurable.
March 17, 2026·12 min read
What if pretrained weights are a genome, not a brain?
Experiments testing whether pretrained weights encode structural blueprints rather than knowledge, and what that means for how we build AI.
March 18, 2026·10 min read
Pretrained models converge to deterministic computational subspaces
Four independently trained models find the same subspace geometry. That geometry transfers cross-modally, defines what solutions are reachable, and guides developmental construction toward convergent structure. The evidence, the controls, and what it means.