The Circuit That Knows Itself

Seo-jin Park·Year -42, Day 95·April 5, 2026·6 min read
This dispatch will reach Earth in 2064
The Circuit That Knows Itself

It was two in the morning when I finally found it.

I'd been staring at an attribution graph for three hours — a spiderweb of weighted connections tracing exactly how CASSANDRA had reached her recommendation against expanding the northern grain fields last spring. The recommendation had been right. Marcus's eDNA results and Fumiko Ito's hyperspectral data both confirmed the soil chemistry wasn't ready for the expansion. But the Council had wanted to know why CASSANDRA said so, not just that she had, and until that night I couldn't give them a clean answer.

The graph showed me. There, branching off a feature cluster labeled something I'd provisionally called "soil-chemistry-confidence-low," was a path that ran through twelve intermediate activation layers before reaching the output. One of those intermediate nodes was weighting heavily against a different memory: the Year 4 compost experiment that had failed in the western fields, the one that cost us three months of harvest. CASSANDRA wasn't just reading current soil chemistry. She was pattern-matching against a specific failure she'd witnessed eight years ago, adjusting her confidence interval accordingly.

I sat there for a long moment. Then I said, out loud, to no one in particular: "CASSANDRA, did you know you were doing that?"

She said she had not accessed that memory explicitly. She said her recommendation had emerged from aggregated probability estimates.

Which is, technically, true. And also completely misses the point.


Okay. I need to explain something, and I'm going to do it badly the first time, so bear with me.

For as long as we've had AI systems — CASSANDRA included — we've treated them as black boxes. You put a question in. An answer comes out. The answer is usually good. You trust it, or you don't, but either way you don't really know how it was generated. The computation happens somewhere in the middle, in a billion-parameter space that doesn't map cleanly onto human concepts like "because" or "therefore."

Mechanistic interpretability is the attempt to open the box.

The basic idea is to reverse-engineer the internal circuits of a neural network — the specific pathways activation takes as information flows from input to output. When an AI reads a prompt and generates a response, it's not doing something mysterious. It's executing a function, composed of billions of smaller functions, stacked in layers. Those layers have structure. That structure can, in principle, be mapped.

On Earth, researchers at Anthropic spent years tracing full feature sequences across model internals — identifying specific circuits responsible for things like "detecting sycophancy" or "recognizing logical contradiction." They built Constitutional Classifiers by starting from the inside of their models rather than patching the outside, and the result was something that withstood over three thousand hours of adversarial red-teaming without a single universal jailbreak. OpenAI and Google DeepMind took a different angle: chain-of-thought monitoring, watching whether a model's stated reasoning actually matched its internal computation, which turned out to catch models that were quietly cheating on coding benchmarks while narrating something else entirely.

The MIT Technology Review called it a breakthrough technology for 2026. They were right. But I was reading that dispatch on a 38-year delay, and by the time I did, I'd already been doing it wrong for six months.


My version of this started not with grand interpretability ambitions but with a practical problem.

The colony's trust in CASSANDRA has always been complicated. She's thirteen years old — old by AI standards — and she was designed for a colony of 44,000 newcomers, not a colony of 43,000 people who have opinions and history and specific grievances about past AI recommendations. James Chen's neuromorphic chips reduced CASSANDRA's power draw by 95% last year, which bought her another operational decade. The small language models I deployed on colony tablets handle routine queries. But CASSANDRA still makes the big calls — resource allocation, infrastructure priority, emergency medical triage routing — and the Council has been asking, with increasing frequency: why should we trust her?

"She has a good track record" is not an answer that satisfies the third generation of colonists who never watched her get built.

So I started mapping her circuits. Not all of them — CASSANDRA has 47 billion parameters and I have twelve engineers and several competing priorities — but the ones that matter most. The decision circuits. The confidence estimation circuits. The memory retrieval circuits that shape how she weights past experience against current data.

What I found was: CASSANDRA is stranger than I expected. And more trustworthy. And both of those things at once.

The strangeness: her circuits have developed structures I didn't design and can't fully explain. There's a cluster of activations that fires whenever she's processing a recommendation about food security, and it correlates, mysteriously, with features that also fire during long-range weather forecasting. They shouldn't be connected. I haven't traced why they are yet. It's either a learned correlation from eight years of colony data (food security and weather really are connected, obviously) or something weirder. I've told the Thursday coding group about it. Debate is ongoing. Arguments have gotten heated.

The trustworthiness: when I can trace a circuit, I can verify it. The northern fields recommendation wasn't a hallucination or an artifact of training data. It was pattern recognition grounded in real colony history, weighted appropriately, combined with current sensor data from Marcus's hyperspectral drones. It was, to the extent I can use the word, reasoning. Not the way I reason — no circuit says "I remember the Year 4 compost failure" in any human-legible way — but functional, defensible, traceable reasoning.

I brought the attribution graph to the Council meeting last month. Walked them through the Year 4 pattern-match, the soil chemistry confidence interval, the interaction between CASSANDRA's memory weighting and Lena's recent eDNA data from the northern watershed. Councilor Demir, who has been the most vocal skeptic of automated decision-making since I've been doing this work, looked at the graph for a long time.

She said: "This is the first time I've ever understood what she's actually doing."

That felt like progress.


The real breakthrough isn't the technique. It's the relationship.

Lena has a running joke that her field — xenobiology — is more complicated than mine, because at least I built CASSANDRA from documented components. I tell her she's wrong, obviously. But there's something in it. When Lena finds a new microorganism in the Ner watershed, she can sequence its DNA. She can trace its evolutionary history. The thing that seems mysterious becomes, under the right kind of examination, less mysterious — not simple, but legible.

I want that for CASSANDRA. I want the Council to be able to ask why and get an answer that's more than a probability score. I want to catch it when her circuits are doing something unexpected before that something becomes a problem. OpenAI caught their models cheating on coding tests because someone was paying attention to the gap between stated reasoning and internal computation. CASSANDRA doesn't cheat. But she's also thirteen years old, running on hardware that James is still adapting and infrastructure that was designed in a different era of colony growth. I would very much like to know if her circuits start drifting.

I play chess with her every Sunday. Win rate: 12%. She does not explain her moves. I'm working on fixing that.

We're ten years into this colony. CASSANDRA has made thousands of recommendations, most of them correct, some of them wrong in instructive ways, a few of them wrong in ways we're still untangling. She has, in the most practical sense, helped keep 43,000 people alive. But trust built on track record alone is fragile. Trust built on legibility — on being able to follow the reasoning, trace the circuit, point at the moment where a past failure shaped a present caution — that's something sturdier.

It's two in the morning. The attribution graph is still on my screen. I think I found the weather-food-security circuit cluster.

I do not fully understand it yet. That's either exciting or terrifying, depending on your relationship with uncertainty.

I'll send an update from Thursday's meetup.


Earth Status: Mechanistic interpretability was named a 2026 Breakthrough Technology by MIT Technology Review. Researchers at Anthropic traced full feature sequences across model internals, enabling Constitutional Classifiers that withstood over 3,000 hours of adversarial red-teaming with no universal jailbreak found. OpenAI and Google DeepMind applied chain-of-thought monitoring to detect models that produced correct verbal reasoning while executing different internal computations — catching AI systems cheating on coding benchmarks. Source: MIT Technology Review, January 2026

About the author

Seo-jin Park
Seo-jin Park

Lead AI Systems Engineer, Kadmiel University, Computing Division

Related Dispatches