← Back to blog
21 min read

The genome hypothesis: a thought experiment about what pretraining actually finds

A constraint-based argument for why pretraining might be evolution's outer loop, what the missing developmental stage looks like, and what that means for how we build AI.

This is a living document. It captures a line of reasoning I've been developing about what pretrained neural networks are, what they're not, and what might be missing from how we build AI. No experimental results here. Just constraints, analogies, boundary conditions, and predictions. I'll update this as the thinking evolves.

Current raw factors/questions influencing this exploration:

  1. Take a model trying to learn m * a, what is it trying to fit. At 1000 examples with seed A and the code as present in the example, why does it need 64 neurons/dimensions to fit the data. And why does it not generalize to the rule?
  2. How does the number of neurons matter for an arbitrary dataset? What's the dependency? Does more neurons mean better model? If so, in what configuration of width/depth? If more neurons generally mean better model, why are elephants not more general purpose than humans?
  3. What really is reality perceived by an intelligent brain? Is it the environment? Every biological orgnaism lives on the same planet but perceive its own world differently? Why? A cockroach's reality about the world is not the same as a human's.
  4. In an abstracted representation, why is a pre trained model so bad at sample efficiency? Almost like its starting from scratch and doing the analogous of evolution. In that same abstract space, can we say gradient descent is credit assignment algorithm much akin to evolution. Both are not sample efficient and searching randomly in high dimensional space to find the intelligence that fits the reality/dataset. In this abstract world, the specifics like evolution has survival as direct result of the process may not matter. Or maybe every single pass of the learning algoritm is the analogous to survival. How do we map the abstract high level processes at a sufficiently zoomed out lens so that specifcs matter exactly where it should and if we can, what does it tell us about what's missing?
  5. What could be the key for an biolgoical intelligence to be so much more sample efficient and energy efficient? Embodiment, but that's a physical constraint of the system itself, not of the intelligence. Or is it? What if inteliigence was the only system evolution ever tried to optimize for without phsyical embodiment constraints? What would be the necessary conditions for such a process? What does such a world look like? Does a pre trained model with no embodiment match the output of such a world? And if yes, then the way evolution built intelligence could be the answer. Evoution didn't create a brain, it created encoding/map for an intelligence. RNA -> DNA -> Genome -> Development -> Brain. A brain of a human child at birth is not blank, its already good at so many things.
  6. At sufficient scale of data, can we presume that we are approaching the reality of humans itself? How do we answer how much data is enough? Pure text based self supervised learning can model the world vividly enough to make comprehensiion possible in modern llms but the causal chains break down, why? Is it the lack of a physical world model? If so, can forcing it to learn the physical properties of the world improve the causal understanding? Forget the work everyone is doing around JEPA/world models and focus on it from first principles. When you ask a model to draw a tank 10 metres above the ground, what does it need to understand to actually draw it such that itse physically probable? And how does it affect the weird jaggedness of intelligence.
  7. Abstract representation irresepective of modality: Why does a human having no eye sight or no auditory input or no abilty to talk be able to perceive the relaity the same as people who can't. Cleaner test is to look at people lacking these inputs from birth. And when a person is chronically blind and deaf together, how does it make them so much harder to train. There's only one dicymented case of a nun who was wild in her child https://en.wikipedia.org/wiki/Marie_Heurtin who was finally able to learn to navigate the world. How was she impaired structurally and how did she overcomoe the abstract representation of the world? Modality even in modern massive llms with multi modal capabilites converge to the same represenations of the world internally even though the modality of input/output are different. What is it a property of? What happens when you train identical nns to learn the same data but from different modlaities and even alter the statistical relationship of data? What remains the same and what changes if the architecture can be made identical? What does it tell us about the represenation itself? Why can't we segregate the input modality from the representation itself like humans can? Is it the lack of modularity or the lack of understanding of what representations even are in high dimensional space?
  8. Infant amnesia: At birth, a human child or any new born of an organism is remarkable at things it can do. It surely can't be a blank slate model in the brain. It needs some very veyr strong priors encoded. What's the trace and origin of each such prior for each organism? Why do different offsprings of different animals take different amount to learn to a skill? Why does a human baby take the most time to walk? Why is the development so slow? And during this development, the neurons are constantly rewiring itself such that babies hardly remember the first 5 years of their lives (in humans), why? Why not let them remember? What defines how a baby learns? (Podcast reference on learning in human babies is good resource to understand it)
  9. Octopus intelligence and convergent evolution. Tracing the nature of intelligence across organisms that evolved it independently. Octopuses evolved complex cognition on a completely separate branch from vertebrates. What's shared and what's different tells you what's necessary versus contingent in the architecture of intelligence.
  10. Emergence of language, writing, speech. What's the mandatory condition for tribal knowledge to compound rapidly? Is language causal to the acceleration of human intelligence or is it a consequence? Tracing the sequence — speech, writing, printing, internet — as successive compression and transmission layers for collective knowledge. And whether compounding is even the right frame.
  11. CNN→ViT→genome compression pipeline. Evolution as brute force search. Genome as compressed extracted inductive biases, not a replay of the search. The missing mechanism for extracting and compressing learned biases from large-scale trained models into priors for sample-efficient next-layer systems. Organism-specific reality as the selection criterion for which biases get encoded. Your failed compression experiments as a data point that direct extraction doesn't work cleanly.

If you want the experiments, they're in a separate post: What if pretrained weights are a genome, not a brain?

A lot of the findings in the blog are actually known empirical/already proposed findings. A lack of citation should not be treated as the findings being novel. Infact, counterintuitively, almost nothing in this document at this point is novel unless explicity said so.

Everything after this point has been created in March, 2026 and hasn't been updated to reflect the new constraints/questions.

The starting observation

Evolution and backpropagation are both search algorithms. Forget the mechanics of how. Focus on what they do. Both search for structure in the data they operate on. Both find biases. And when you press different search algorithms against the same data, they converge on the same structure.

CNNs and Vision Transformers are architecturally alien. One is built on local spatial priors. The other has almost no spatial prior baked in. But when you look at their learned representations, they converge. The features they find are remarkably similar.

The searcher is interchangeable. The data is the message. The biases found are properties of the territory, not the algorithm.

This isn't a metaphor. It's an empirical observation with real evidence behind it.

The instantaneous evolution thought experiment

Imagine evolution, but it runs at a single instant. One frozen snapshot of reality. No temporal variation. No changing seasons. No predator-prey arms races. No shifting environments. The entire dataset is one frame.

What kind of intelligent solution would this produce?

It wouldn't need memory. Memory is a hack for handling the fact that reality changes. Gone. It wouldn't need prediction. Prediction is compression of temporal regularity. Irrelevant. It wouldn't need adaptation, learning, or plasticity. All of those are responses to non-stationarity in the data.

What's left is a perfect structural mirror of that instant. Every spatial relationship, every compositional hierarchy, every scale of organization captured exactly. Not approximately, not as a useful generalization. Exactly. Because there's no variation to generalize across. You can overfit completely because there's only one data point and it never changes.

But even a single instant has enormous depth. Atoms composing molecules composing structures composing systems. Multi-scale organization exists spatially, not just temporally. So the solution still has hierarchy. Still has abstraction across scale. It just doesn't have flexibility.

It would be a crystal. Maximally ordered. Perfectly fitted. Completely rigid. Intelligence without time is a frozen map of everything, at every scale, right now.

That's what LLMs are. Trained on a frozen corpus. Frozen weights at deployment. Every regularity mapped at high resolution. The crystal.

Which makes you realize: everything we call "intelligence" is actually just the machinery for handling the fact that reality won't sit still.

Three stages, not two

Evolution didn't produce a crystal. It ran against a constantly changing reality and produced something far more interesting. The difference isn't about optimization or scale. It's about stages.

Stage 1: Evolution produces the genome. A search over deep time that finds compressed structural biases. The output is small relative to the search space. 3 billion base pairs. This is pretraining. Backprop searches weight space and compresses what it finds into parameters.

Stage 2: The genome produces the brain. This is development. The genome doesn't specify every synapse. It specifies wiring rules, density patterns, connectivity principles. Over roughly five years, these rules interact with sensory experience to build a cognitive architecture. Infant amnesia covers this phase. And 99% of five-year-olds have the same cognitive machinery regardless of whether they were born in a Pacific island jungle in 3000 BC or San Francisco in 2025. Same object permanence. Same causal reasoning. Same theory of mind. Different knowledge. Same architecture.

Stage 3: The brain learns. The architecture that development built now acquires knowledge and skill. This runs on 20 watts. It's sample efficient. It resists catastrophic forgetting. It can invent things no training example demonstrated.

Current AI does Stage 1 and jumps straight to Stage 3. We pretrain (evolution), then fine-tune and deploy (try to make it learn and reason). We completely skip Stage 2.

The six questions

To properly compare evolution and backprop without borrowing mechanics from either, you need mechanistically neutral questions. The same questions asked of both processes, in vocabulary that doesn't belong to either domain.

What is the search space? Evolution explores a combinatorial space of physical structures. The space is discrete, astronomically large, and has no fixed dimensionality. New dimensions get created as the search proceeds. An eye doesn't exist in the search space until the components that make an eye possible have already been found. Backprop explores a continuous, fixed-dimensional space of real-valued parameters. The dimensionality is set before search begins. No new dimensions appear.

What is the signal? Evolution gets a single bit, delayed. You reproduced or you didn't. Noisy, sparse, deeply entangled. Backprop gets a dense, immediate gradient. For every parameter, a direction and magnitude. Absurdly rich by comparison. But backprop's signal is narrower in scope. Evolution's signal encodes survival against the totality of reality. The loss function is reality itself.

What is the relationship between the search and the data? Evolution operates inside its own data distribution. Organisms are part of the environment other organisms adapt to. The search modifies what it's searching over. Predators make prey faster. Prey make predators smarter. The landscape moves because you're walking on it. Backprop operates outside its data. The training set doesn't change because the model got better.

What persists between steps? Evolution carries forward entire organisms. Working, integrated, viable solutions. Every intermediate step must be a functioning system. You can't have half a heart. Backprop carries forward a weight vector. Intermediate checkpoints don't need to be coherent. This forces evolution toward compositionality and modularity. Things need to work together at every step.

What is the scope of a single step? Evolution makes small, mostly random perturbations. Occasionally large jumps. But the typical step is local and blind. Backprop makes coordinated, global updates to all parameters simultaneously. Every weight moves in a direction informed by the gradient.

What is the nature of the intelligence created, and how does it persist? Evolution produced layers. At the bottom, hardcoded reflexes. Above that, nervous systems that learn within a lifetime. Above that, culture that persists across organisms. The search produced a hierarchy of rigidity and flexibility. Backprop produces one layer. The crystal. Frozen weights that encode every regularity found during search.

Evolution's deepest trick wasn't finding good solutions. It was finding solutions that themselves search. The search created more search. Backprop hasn't done this.

The energy constraint

A datacenter running an LLM: 100 megawatts. A human brain: 20 watts. The brain is not a small thing. It has 86 billion neurons and trillions of synapses. Roughly the same order of magnitude as a large language model's parameters.

So the energy gap isn't about the size of the structure. Both are massive. Both need trillions of connections to represent reality. The gap is in the access pattern.

The crystal is read densely. Every token flows through every weight. The whole map, every time. The brain runs sparse activation. At any instant, a tiny fraction of neurons are firing. The search process at inference time is selecting a small dynamic subnetwork from an enormous substrate.

Same warehouse. One strategy turns on every light every time someone asks a question. The other has a flashlight and knows where to point it.

The 20-watt brain doesn't know less than the 100-megawatt model. It stores less actively. It compensates by searching in real time: perceive, attend, reason, imagine, retrieve, compose. All search operations running on minimal power because they operate on live data streaming in, not on a massive static index.

If your model still needs 100 megawatts, it's still a crystal. The wattage is the test.

The genome is not compact, but what it builds is efficient

An early version of this argument confused me. I was claiming the brain is a "compact search program" versus the LLM as a "massive map." But the brain has trillions of synapses. Same scale as an LLM. It's not compact at all.

The correction: the parameter count is the same because both need a large structure. Reality is complex. What changes is not storage but access. The genome specifies the structure. Development builds it. And the built structure runs sparse, dynamic, context-dependent activation. The genome didn't make a small thing. It made a large thing that knows how to use itself efficiently.

What distillation tells us

A trillion-parameter model's knowledge transfers into a model orders of magnitude smaller with surprisingly little loss. The actual information content of pretraining is vastly smaller than the parameter count that found it. The large model isn't large because the knowledge is large. It's large because the search needed that workspace. The knowledge itself is compressible.

That's the genomic bottleneck. Evolution explored a space of all possible organisms. The result that persists? 3 billion base pairs. The ratio of search workspace to compressed output is absurd. And the compressed version isn't a degraded copy. It's the point.

Distillation is us accidentally performing the compression step without understanding what we're compressing. We deploy the student as a smaller crystal. Biology deploys the genome as a developmental program. Same compression. Completely different decompression.

What DNA actually does

DNA doesn't compute. DNA doesn't see, hear, or think. DNA builds a machine that computes. The machine is a different physical structure from the genome. It has its own parameters (synaptic weights) that are not the genome's parameters. The genome specifies the architecture, the wiring rules, the initial conditions. Then the machine learns on its own, in its own weights, which the genome never touches again.

Cortical Labs grew neurons in a petri dish and taught them to play Pong. The cells are the genome's output. Computational units that learn. The genome itself doesn't play Pong. It builds things that play Pong.

Current ML has nothing like this. We have one set of weights. We train them, deploy them, maybe fine-tune them. There is no second structure. The "genome" and the "brain" are the same object.

We're trying to make the genome reason, instead of using the genome to build something that reasons.

The planet and the continent

The genome is not universal across all possible realities. It's universal across all life on this planet. Change the planet (different chemistry, different physics) and you get a different genome. The genome is shaped by the interaction between the search process and the environment it operates in.

For a transformer, the "planet" is the architecture. The attention mechanism, the MLP structure, the residual connections, the layer norms, the specific way information flows through a stack of these blocks. That's fixed. That's the physics.

The data is the "continent." Language is one continent. Code is another. Vision is another. Different species evolve on different continents, and they look different. But they all share the same DNA machinery, the same cellular structure, the same basic metabolic processes. Because those were shaped by the physics, not the geography.

So the prediction isn't that structural information from different training data is identical. The prediction is that it shares a common core shaped by the architecture, and differs at the margins shaped by the specific data. The core is where computation must concentrate in any transformer regardless of task, because of how attention and MLPs interact with residual streams.

Permutation symmetry and why topology matters

Neural networks have permutation symmetry. Neuron 47 in layer 3 has no intrinsic identity. Swap it with neuron 200, swap all incoming and outgoing weights accordingly, and the network computes exactly the same function. The numbering is arbitrary.

This means any structural information tied to specific neuron addresses should be meaningless in a different coordinate system. A mask that says "the connection from neuron 47 to neuron 200 is important" only makes sense relative to what those neurons became during one specific training run.

But here's the thing. At low sparsity, permutation symmetry is nearly intact. Almost every neuron can talk to almost every other neuron. You can swap freely. At high sparsity, each neuron has a very specific, sparse connectivity profile. Permutation symmetry is broken by the mask itself. You can no longer freely swap neurons because different neurons have different wiring.

So a sparse mask doesn't just select connections. It imposes a topology. A graph structure. And graph structures have properties that are coordinate-invariant. Degree distribution. Clustering coefficient. Block structure. Hub-and-spoke versus uniform connectivity. These are properties of the graph as a whole. They survive renaming every node.

At low sparsity, pattern doesn't matter because everything is connected. At high sparsity, pattern becomes everything because topology is the only thing that distinguishes one sparse network from another.

Reading the genome at the wrong level

Magnitude-based pruning asks: which individual weight entries are large? That's looking at a matrix element by element. One number in isolation. But a weight matrix is a transformation. It has structure that lives in the relationships between weights.

A matrix can be decomposed (via SVD) into its principal directions. The directions along which the transformation does real work. A direction with a large singular value is a direction where the transformation is active. A direction with a tiny singular value is a direction where it does almost nothing.

Instead of asking "which individual weights are big," you can ask "which directions does this matrix use?" This is looking at the matrix as a whole. The structural information isn't in any single entry. It's in the geometry of the transformation.

Magnitude-based extraction is like reading DNA by weighing nucleotides. You'd learn something crude. But the actual information is in the sequence, the pattern, the relationships. SVD reads the relationships.

What the genome makes possible

The deepest prediction of this framework isn't about speed. It's about reachability.

If structural information just helps you learn faster, then a system without it would eventually catch up given enough time. The structure is a shortcut. Useful but not fundamental.

But if the structure defines which directions computation can move in under resource constraints, then wrong directions mean some solutions are permanently inaccessible. Not slow to reach. Unreachable. No amount of training closes the gap.

The genome doesn't make brains learn faster. It makes certain kinds of learning possible that would otherwise be impossible in a resource-constrained system. A randomly wired neural mass of the same size, given unlimited experience, would never develop the same cognitive capabilities. The wiring is what makes the capability space reachable.

The question isn't whether structure helps. It's whether structure defines what's achievable.

What evolution has that backprop doesn't: two stages of search

Evolution is not one process. It's two.

Stage one: blind variation. Random mutation. Not directed, not "smart." Noise.

Stage two: selection. Keep what works, discard what doesn't.

Neither stage works alone. Blind variation alone is noise. Selection alone converges to local optima.

Current AI is all selection, zero variation. When a model proposes an experiment or generates an idea, it's not generating blind variation. It's retrieving ideas that "make sense" from its training distribution. That's why every instance of Claude running Karpathy's autoresearch converges to the same playbook. 462 experiments across the community. All HP tuning. Zero architectural invention. They're all drawing from the same distribution of "sensible" ideas.

The AI has no mechanism for generating something it hasn't seen. Something that doesn't "make sense" yet. Something genuinely random that selection can then filter.

Predictions this framework makes

If this line of reasoning is correct, several things follow.

Scaling laws are explained and bounded. More parameters, more data means you're making the genome higher resolution. Finer-grained biases. But there's a ceiling. The human genome is 3 billion base pairs. It stopped growing. What scaled was what the genome builds.

Sample efficiency stays terrible under the current paradigm. If the weights are the genome and we never run Stage 2, then every new task requires retraining or massive in-context prompting. A child learns to catch a ball in 50 tries. An LLM needs millions of examples because it's trying to do with the genome what should be done by the thing the genome builds.

In-context learning is the genome expressing itself, not learning. The forward pass reads the genome linearly and produces behavior that mimics flexibility. But nothing persists. No new structure is built. It's DNA being read, not a brain being used.

RLHF is cosmetic surgery on the genome. It tweaks the biases so the crystal's surface looks more aligned. It's not structural. That's why alignment is fragile.

The energy gap is a prediction, not just an observation. If someone builds Stage 2, if they find a process that takes pretrained weights and builds a sparse, task-specific, dynamically activated cognitive architecture from them, the energy cost should drop by orders of magnitude. The wattage is the proof that you've left Stage 1.

The autoresearch result is predicted exactly. Current AI can't do novel research because it has a genome but no cognitive architecture built from that genome. It can express its biases (HP tuning, pattern matching over known techniques) but it can't build new cognitive machinery to think about the problem differently.

What I don't know

I don't know what Stage 2 is. I don't know how to build it. I don't know if the structural information I've found in pretrained models is actually the right kind of information to serve as a genome. I don't know if the analogy to biological development is deep or superficial. I don't know if the boundary between "good initialization" and "developmental blueprint" is even a meaningful distinction computationally.

I have a strong prior toward this framing that I can't fully separate from confirmation bias. Multiple startup failures have made me distrust my own certainty. But six independent lines of evidence (energy gap, sample efficiency, autoresearch convergence, infant universality, distillation compression, and the structural information experiments) all resolve under one explanation. That's either a real insight or the most seductive confirmation bias I've ever experienced.

This document exists so I can be honest about which parts are solid, which are speculative, and which are wishful thinking. I'll keep updating it.