Hindsight Is 20/20

Where I Started

On cochlear implants, language models, and why the hard part of learning was never getting the information.

I went to an AI networking event expecting to spend the evening thinking about models. I left thinking about ears.

One of the speakers worked in a brain–computer interface lab, and somewhere in the conversation cochlear implants came up — not as the main topic, just an aside. I knew the mechanical version of the story: a device bypasses the damaged parts of the ear and stimulates the auditory nerve directly. What I'd never thought about was what happens after it's switched on. The way he told it, the first thing a recipient gets isn't sound so much as noise — pops, clicks, something mechanical and wrong — and then, months later, they can suddenly hear. Speech, voices, music, out of what started as static.

He may have been playing to the room a little. So I looked it up afterward, and the truth is less cinematic and, I think, more interesting. The improvement isn't a switch that flips months in. It's gradual: recipients do describe the early signal as robotic and thin, but speech perception climbs steadily over the first weeks, keeps climbing through about six months, and then mostly plateaus.¹ There's rarely a single moment when sound arrives. There's a slope steep enough, seen from far enough away, to look like a moment.

That's the part that snagged me, sitting in that room, because the first thing it reminded me of was the way people talk about abilities "emerging" in large language models — the same shape, nothing and nothing and nothing and then apparently all at once a capability. I'd gone in expecting to think about that kind of emergence in machines, and here it was in a person's auditory cortex, described by someone who works on brains.

But the longer I sat with it, the less it seemed to be about emergence, or about hearing. Here's the thing I couldn't get past. The implant didn't give the recipient the ability to hear — the auditory machinery was already there, intact, waiting. It handed the brain an unfamiliar encoding. The device gets tuned over the first months, but the real change isn't that it starts sending a better signal — it's that the brain slowly learns to read the one it gets.² The reader changed, not the page.

Which raises a question about the rest of us. If a deafened adult has to learn, over months, to turn electrode pulses into sound, when did the rest of us learn to turn air-pressure waves into it? We did the same thing, as infants — so early, and so far beneath memory, that we mistake the finished product for the world itself. Normal hearing was never raw reception. It's an interpretation we completed so long ago we forgot it was learned.

Hold onto that, because it's where this ends up. The cochlear implant isn't an exception to normal hearing. It's normal hearing with the timestamp still attached — the rare case where some of that learning is slow enough, and late enough, to watch, instead of finishing invisibly before we could form a memory of it.

And once you see hearing that way, it stops being a story about hearing.

Two things we both call learning

We usually talk about learning as the acquisition of information. A student takes in facts, a scientist gathers observations, a model trains on data, and each of them ends up holding more of the world than before. That's real, and I don't want to wave it away. Some learning genuinely is just storage. A phone number, a password, the capital of a country you'll never visit — you can acquire those without anything about you changing except the contents of a list. That's a lookup table in the head, and a database does it better.

But that isn't the kind of learning anyone means when they say something finally clicked.

Everyone has had the experience of staring at material that is fully available — the textbook is open, the documentation exists, the lecture was clear enough — and getting nothing from it. Then someone explains it slightly differently, and the same information rearranges itself into something usable. The facts didn't change. You did. A good explanation doesn't hand you more data; it builds a bridge from how you already see things to a way you couldn't see before, and afterward you can read what was already on the page.

So there are two things hiding under one word. One adds entries to a list. The other changes the reader. The first is information. The second is perception, and it's the one that does all the work we actually care about — the one that turns a novice into an expert, noise into sound, a stack of observations into a discovery. The mistake, the one I want to spend the rest of this on, is treating the second as if it were a larger pile of the first. More information doesn't get you there. A child handed every textbook ever written does not wake up a scientist. The information is present and the way of seeing it is not, and no amount of the former assembles the latter on its own.

This is obvious, in hindsight, the moment you look at machine learning, where the whole game is representation. A good representation pulls the meaningful factors of a problem apart so that a simple decision can separate them; a bad one tangles them together until the signal is technically present and practically unreachable.³ The data can be identical in both cases. What changed is the form it's held in.

I should say plainly that I don't think people are language models. We have bodies, drives, memory, attention, and a few hundred million years of evolutionary baggage no transformer carries. But the vocabulary is useful, and I'll keep borrowing it, because the two kinds of system share at least one constraint: neither learns by piling up information. Both learn by acquiring a usable way to represent it.

The information was already there

Pavlov's dog is the cleanest demonstration I know, precisely because it's so often told as something smaller than it is. Before conditioning, the dog hears the bell and gets the food. Both signals arrive; the relevant information is fully present from the start. What's missing is the relationship. Then bell and food repeat together, and at some point the bell stops being a sound and becomes a prediction — the dog salivates before the food comes. Nobody told the dog the rule. No one handed it the sentence bell predicts food. Repetition reorganized the animal until a relationship it always had access to became something it could act on.⁴ The usual gloss is that the dog learned an association. The part worth keeping is that a learning system acquired the ability to perceive a structure that had been invisible to it. The information was there the whole time. The pattern wasn't, until it was.

Here the picture gets a wrinkle the dog and the implant don't quite show, because both mostly take their input as given. We don't. We act. We choose what to look at next, and a good deal of what we learn from is data we went and generated ourselves. That makes the loop recursive in a way it isn't for a model sitting still in front of a fixed dataset: once you acquire a way of seeing, it changes what you reach for, which changes what you see next. A lens is not only a filter on incoming experience. It's a hand on the tap.

I know this one from the inside. After enough years moving between biology, genomics, software, and machine learning, I stopped experiencing them as separate subjects. They kept collapsing into the same handful of shapes — inputs, transformations, constraints, feedback, failure modes, the slow accretion of structure nobody designed. That lens was trained into me by the work, and once it existed it quietly decided what I'd notice in the next thing I read. It's why I can't look at a large codebase now without seeing something biological in it, a comparison that wouldn't have occurred to me a decade ago because, for me, it didn't yet exist.

Emergence is mostly a fact about the observer

This is the thing that actually snagged me in that room, so let me come back to it properly. In machine learning, people call a capability emergent when it shows up suddenly at scale — absent in the small model, present in the large one. Arithmetic, in-context learning, chain-of-thought, tool use; all of them have been described this way, as abilities that appear past some threshold rather than growing in smoothly.⁵ It's a genuinely exciting framing, and it's also been pushed back on hard. Schaeffer and colleagues pointed out that some of the suddenness lives in the measurement: a metric that only pays out after a sharp threshold will turn smooth underlying improvement into an apparent jump.⁶

I don't need to settle that debate to take the useful thing from it, because the pushback sharpens the point. Emergence doesn't always mean a capability arrived from nowhere. Sometimes it means the capability became visible — to us, the people holding the measuring stick. The system was getting better underneath the whole time; we only registered it once it crossed the line where our instrument could see it.

Which is exactly the implant, and the slope I'd mistaken for a switch when the speaker described it. The recipient adapts gradually, nothing dramatic from one day to the next, until speech comes clear and it feels — from outside, and even from inside — like sound switched on. It didn't. We just can't watch adaptation happen, only the point where it crosses into something we'd call hearing. Not emergence as magic — emergence as the moment interpretation catches up to a signal that was there all along.

But there's a second half I left out, and it's the one that ties the machine back to the ear. Emergence at scale isn't only about building a bigger model; it's gated by how much the model has been through. A large network trained on too little data stays undertrained — the capacity is there and the experience isn't, and the capability never appears.⁷ Grokking is starker: a model fits its training data and still can't generalize, and then, only after a long stretch of further training, the ability snaps in — latent the whole time, waiting on enough exposure to become reachable.⁸ Which is the dog, and the implant, and every stretch of study that felt like nothing until it didn't.

This sounds for a second like a contradiction — didn't I just say more information doesn't get you there? But two different things go under "more data." Stored facts you can't yet read are inert; that's the textbooks. Exposure a system can extract structure from is the raw material perception is built out of; that's the trillionth token, the hundredth bell, the sixth month. Data doesn't become understanding by piling up. It becomes understanding by being enough to generalize from. So when emergence looks sudden, two different things are hiding in the word: a real threshold the system crosses by accumulating enough experience, and the jump in our own measurement when its output finally clears the bar we set. The slope is exposure. The step is us.

Months, not years

There's a detail in the implant story I skipped, and it's the one that says what kind of learning this is. A deafened adult adapts to an implant in months. A baby takes years to learn to hear. If both are learning the same thing, why the gap?

Because they aren't learning the same amount. The adult already spent decades training the interpreter — the phoneme categories, the voices, the grammar, the whole apparatus that turns sound into meaning. An implant doesn't ask the adult to rebuild any of that. It asks for a much smaller thing: a new mapping from an unfamiliar signal onto machinery that's otherwise intact. In the language of the models, it's closer to fitting a small adapter onto a frozen, pretrained network than to training one from scratch.⁹ The base is preserved; only the front end is refit. That's why it's fast.

The baby has no base. It's building the entire interpreter from nothing, entangled with learning language and a world to hear it in, and the data arrives at the slow trickle of a single life. Over the first year, the same infant that could discriminate the speech sounds of every human language narrows to the ones it actually hears.¹⁰ The interpreter specializes to its input. Years, because there's a whole stack to build.

And notice what the gap implies. Detecting sound was never the hard part. A newborn's ears work; a fetus responds to sound in the last months before birth. What takes years isn't hearing in the sense of transduction. It's hearing in the sense of interpretation — which is the only sense this essay has been about. So the implant doesn't really show us a baby learning to hear. It shows us one layer of that process, the front end, relearned in the open and at adult speed, while the rest of the stack sits untouched. Normal hearing with the timestamp on a single layer.

The brain can't afford to see everything

There's an unglamorous reason perception works by interpretation rather than by faithful recording: the brain is expensive. It's a small fraction of body mass and burns something like a fifth of your resting energy.¹¹ A system on that kind of budget can't process reality as a complete, raw stream. It has to predict, compress, and fill in — which is broadly what the predictive-processing account of the brain says it does, generating expectations about its input and really attending only when the input violates them.¹² Perception, on that view, is closer to active inference than to passive reception. Habits, heuristics, expertise, perception itself — all of them are compression, ways of not computing the world from scratch every time.

This is the part that loops back to the implant and tightens it. The brain isn't merely receiving a strange new signal and tolerating it. It's finding a compression of it — a way to predict and act through the new encoding cheaply enough to be worth the energy. Perception emerges when the system locates a useful shortcut, not when it finishes some exhaustive decode. It also explains why real learning so often feels like nothing, nothing, nothing, and then everything at once. You can accumulate experience for years without a structure to hold it, and then a single concept arrives that compresses all of it, and the backlog reorganizes overnight. The years weren't wasted. They were waiting for a representation.

Expertise, creativity, and the same move at every scale

Once you have the shape it's hard not to see it everywhere, which I'll come back to as a problem in a minute. But the range is worth noticing first.

Expertise is the clearest case, and the research on it is old. Chess masters don't have superhuman memory in general; their edge is that they perceive a board in meaningful chunks where a novice sees individual pieces.¹³ Same board, same pieces, different perception. The pattern repeats across every skilled trade: the radiologist sees structure in the film the patient can't, the bioinformatician sees batch effects and contamination where someone else sees a table of numbers, the engineer sees hidden state and failure modes in code that currently runs fine. It's also why experts are often bad at teaching — their perception has been compressed past the point where they can recover the intermediate steps, so the answer arrives as intuition, and intuition is just compressed experience surfacing as something that feels immediate.

Creativity turns out to be the same move in different clothes. We describe it as making something new, but a lot of it feels less like invention than like recognition. The codebase-as-organism comparison didn't come from nowhere, and it didn't feel like building something — it felt like noticing something that had quietly become obvious. Two regions of what I knew, biology and software, had drifted close enough together that a single idea could reach across both. A novel thought may just be what it feels like when two concepts become neighbors in your representation, near enough to connect. The territory hadn't changed. The map had.

And the same thing happens to whole fields at once. Darwin didn't create the fossils or the variation between species; the observations sat around for a long time before there was a frame that organized them. Germ theory didn't create disease, plate tectonics didn't move the continents, relativity didn't invent gravity or clocks. In each case the data was lying in the open and the lens arrived late.¹⁴ What an individual gets from learning, a community gets from a paradigm: not more facts, but a way of seeing the facts it already had. And afterward everyone says it was obvious. Of course species evolve. Of course microbes cause disease. Of course attention was the thing worth scaling. Hindsight is merciless that way — once a representation exists, it feels like it was always available, when in fact it had to be learned.

The lens reveals, and the lens lies

I've been describing acquired perception as if it were straightforwardly an improvement — as if learning to see just means seeing more truly. It doesn't, and the essay would be dishonest if it stopped before this.

Perception is interpretation, and interpretation can be wrong. This isn't a defect bolted onto an otherwise clean system; it's the cost of the same machinery that makes perception cheap. A system that has to commit to one coherent reading under uncertainty will sometimes commit to one that isn't there. The brain does this constantly and invisibly — it fills the blind spot, stabilizes a jittering visual field, resolves ambiguous input into a single confident percept. The dress is the famous case: identical pixels, and some people saw blue-and-black while others saw white-and-gold, because the brain was silently guessing at the lighting and different priors produced genuinely different experiences.¹⁵ Nobody was short on data. The disagreement was downstream of interpretation.

When a language model states something false with total confidence, we call it hallucination and treat it as an alien kind of failure. It isn't alien at all. It's the same bargain. Faced with an incomplete or ambiguous signal, a system that has to produce a coherent output will fill the gap with its best guess — the brain settles on a percept, the model settles on a continuation, and most of the time that gap-filling is exactly what lets either one act instead of freezing. The analogy isn't literal; a brain isn't sampling tokens from a vocabulary. But the shape is real, and it carries a warning the cheerful version of this essay would skip: a representation is a compression of reality, every compression throws something away, so a lens that lets you see at all is also, necessarily, a lens that hides. It works, right up until the thing it discarded turns out to be the thing that mattered.

That's not only a bug in individuals; it's how paradigms go wrong, and it's how the tools we build to escape the problem go wrong too. Interpretability is the attempt to read a model we can't otherwise see into — and interpretability is itself a learned lens, which means it isn't exempt from its own thesis. The better we get at reading these systems, the more confidently we'll stop looking where our methods point away from. That's the junk-DNA move all over again, aimed this time at our own instruments: we get good at finding what we can find and quietly start calling the rest noise. A representation that lets us read the machine will, like every representation, also decide what about the machine stays invisible. The next breakthrough often needs someone to loosen the grip of a lens that's still working well enough to discourage looking elsewhere. The trap of a good representation is precisely that it's good.

How little signal does it take

Gap-filling has a generous side, too. The same machinery that divided a room over the dress — the brain committing to a reading the signal underdetermines — is what fills a familiar song back in. You don't need an implant to feel it. At a volume where a track you know lands full and clear, one you've never heard stays thin and far away — the same air hitting the same ears. The signal isn't the difference; the model you bring to it is. For a song you know, your brain predicts the next bar and fills the percept in; for one you don't, there's nothing to predict from, so you get the raw signal and it stays muddy. And you can watch the model get built, because by the tenth listen the new song has come forward in the mix too. You weren't hearing the song; you were hearing your model of it.

The implant is the same trick with the priors stripped to the studs. A couple dozen electrodes stand in for thousands of hair cells — a signal closer to a low-bandwidth vocoder than to anything we'd call high fidelity — and yet four channels is enough to follow speech in a quiet room.¹⁶ Not because four channels carry speech, but because the listener's pretrained base reconstructs it. What you hear is partly generated, not received. The cleanest demonstration is sine-wave speech: the same handful of tones is heard as electronic whistles or as a clear sentence depending only on whether you've been told what it says.¹⁷ Prior knowledge, and nothing else, tips it from noise into speech. (It's also why music stays hard for implant users long after speech comes back — music is less predictable, so the priors have less to rebuild from.)

Which raises a question I'm going to set down rather than chase here. If perception can rebuild so much from so little, then the interesting question about a lossy system isn't how to make the signal more faithful. It's how little signal it actually needs, given a good enough interpreter. That rhymes with the machines, which are themselves lossy compressions of an enormous amount of human text and useful despite it — we keep pushing them to be more faithful when the leverage might be in the reader. But that's a different essay, and probably the next one.

What I'm actually claiming

I want to be careful here, because this is the kind of idea that flatters itself.

By the essay's own logic, "everything is acquired perception" is exactly the sort of pattern you'd expect to find everywhere once you'd acquired it — which is precisely the kind of pattern you should distrust. I've now run the same lens across a deafened adult, a salivating dog, a chess board, a scaling law, and a song in a car, which is either evidence or a symptom. A lens that fits every case might be telling you something deep about the world, or it might just be telling you that you're wearing it. I can't fully rule out the second.

So let me bound it. I'm not claiming all learning is perception; some of it really is just storage, and I said so up front. I'm not claiming perception gets you closer to truth; the whole middle of this argues it often doesn't. And I'm not claiming people are models — only that they're both learning systems, and learning systems seem to share this one structural habit of turning repeated exposure into a way of seeing, rather than into a bigger pile of what they were exposed to.

The flag I'll actually plant is narrower, and I think sturdier: the hard part of learning is usually not getting the information. It's acquiring the representation that makes the information mean something. The signal is in the room more often than we admit. What's missing is the interpreter.

Which sends me back, finally, to the ear. The cochlear implant looked at first like a story about restoring a lost sense. By the end it looks like the most honest model of perception I have — a system handed a signal it can't yet read, slowly building, or rebuilding, the interpreter that turns it into a world. The only unusual thing about the implant is the timing. It makes an adult do, deliberately and late, a piece of what the rest of us did so early we mistook the result for reality. We were all handed a strange encoding once and learned to call it sound.

That's what hindsight is, too. Not the cheap version, where the answer looks obvious after someone says it. The real version: the particular vertigo of a learning system that has acquired a new way of seeing and can no longer reconstruct what it was like not to have it. The world didn't get clearer. You did. The signal was always there. What arrived — quietly, expensively, and only after a lot of exposure you probably mistook for wasted time — was the representation that finally let you read it.

So the more interesting question isn't what information we're missing. It's what's already in the room, fully present and completely unreadable, waiting on an interpreter we haven't built yet.

References

Fu, Q.-J., & Galvin, J. J. (2007). "Perceptual Learning and Auditory Training in Cochlear Implant Recipients." Trends in Amplification, 11(3), 193–205. ↩
Kral, A., & Sharma, A. (2012). "Developmental Neuroplasticity After Cochlear Implantation." Trends in Neurosciences, 35(2), 111–122. ↩
Bengio, Y., Courville, A., & Vincent, P. (2013). "Representation Learning: A Review and New Perspectives." IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828. ↩
Rescorla, R. A., & Wagner, A. R. (1972). "A Theory of Pavlovian Conditioning: Variations in the Effectiveness of Reinforcement and Nonreinforcement." In Classical Conditioning II: Current Research and Theory. ↩
Wei, J., et al. (2022). "Emergent Abilities of Large Language Models." Transactions on Machine Learning Research. ↩
Schaeffer, R., Miranda, B., & Koyejo, S. (2023). "Are Emergent Abilities of Large Language Models a Mirage?" NeurIPS 2023 (Outstanding Paper). ↩
Hoffmann, J., et al. (2022). "Training Compute-Optimal Large Language Models." NeurIPS 2022. arXiv:2203.15556. ↩
Power, A., Burda, Y., Edwards, H., Babuschkin, I., & Misra, V. (2022). "Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets." arXiv:2201.02177. ↩
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." arXiv:2106.09685. ↩
Werker, J. F., & Tees, R. C. (1984). "Cross-Language Speech Perception: Evidence for Perceptual Reorganization During the First Year of Life." Infant Behavior and Development, 7(1), 49–63. ↩
Herculano-Houzel, S. (2011). "Scaling of Brain Metabolism with a Fixed Energy Budget per Neuron: Implications for Neuronal Activity, Plasticity and Evolution." PLoS ONE, 6(3), e17514. ↩
Friston, K. (2010). "The Free-Energy Principle: A Unified Brain Theory?" Nature Reviews Neuroscience, 11(2), 127–138. ↩
Chase, W. G., & Simon, H. A. (1973). "Perception in Chess." Cognitive Psychology, 4(1), 55–81. ↩
Kuhn, T. S. (1962). The Structure of Scientific Revolutions. University of Chicago Press. ↩
Wallisch, P. (2017). "Illumination Assumptions Account for Individual Differences in the Perceptual Interpretation of a Profoundly Ambiguous Stimulus in the Color Domain: 'The Dress.'" Journal of Vision, 17(4):5. ↩
Shannon, R. V., Zeng, F.-G., Kamath, V., Wygonski, J., & Ekelid, M. (1995). "Speech Recognition with Primarily Temporal Cues." Science, 270(5234), 303–304. ↩
Remez, R. E., Rubin, P. E., Pisoni, D. B., & Carrell, T. D. (1981). "Speech Perception Without Traditional Speech Cues." Science, 212(4497), 947–949. ↩

Hindsight Is 20/20

Where I Started

Two things we both call learning

The information was already there

Emergence is mostly a fact about the observer

Months, not years

The brain can't afford to see everything

Expertise, creativity, and the same move at every scale

The lens reveals, and the lens lies

How little signal does it take

What I'm actually claiming

References

Related Articles

How to Think About AI Agents

RAG Is an Architecture, Not a Feature

Building Your First LLM Application

Neural Networks Got Too Big to Read - Now We Sequence Them

The other ninety-eight percent

What Production Actually Demands