Skip to main content
by Zellige

Brain-Embedded LLMs: A Research Arc in Methodology, Encoder Mechanisms, and Negative Results

We injected CorticalNet brain states into Qwen3.5-27B as a universal LLM enhancer. Then we retracted three early wins under matched-pairs sampling, found that a 13MB static encoder beat transformer encoders inside our specific stack, and confirmed the brain pipeline actively hurts technical generation work. Follow-up experiments narrowed the claim twice: the technical harm is input-conditioning-specific (any input-conditioned prefix loses to OFF); the Mid recall benefit is prefix-tuning-generic (brain, keyword-extract, static soft prompts, and raw-encoder prefixes all tie each other and all beat OFF). A research arc in why methodology matters more than mechanism, and what survives after each methodology pass narrows the claim further.

  • methodology
  • mechanism
  • negative-result
  • brain-conditioned-llm
  • matched-pairs
  • long-context
  • prefix-tuning
  • encoder

Abstract. We inject CorticalNet brain-derived activations into Qwen3.5-27B as a 4-token prefix and evaluate via same-seed matched-pairs sampling across three context tiers. After correcting an initial methodology error (marginal vs paired comparison), three early positive results retract and one null inverts to a validated effect: brain-conditioned long-context needle retrieval at Mid 32K (+13pp, p=0.0078) and Long 256K (+32pp, paired p=0.022). Subsequent ablations narrow the claim in two directions. First, on technical generation the brain pipeline actively degrades output (p<0.0001), and a TF-IDF keyword-conditioned prefix degrades equally; the harm is input-conditioning-specific, not brain-specific. Second, at Mid under a harder-needle distribution, brain / keyword-extract / static soft-prompt / raw-encoder prefixes all produce statistically indistinguishable recall gains over the OFF baseline; the recall benefit is prefix-tuning-generic. The Long tier data stands under the original distribution; harder-needle replication at Long is pending. We present the full experimental progression, including the methodology corrections, encoder architecture decisions, and follow-up ablations that successively narrowed the initial claim.

0. How to read this

Each section opens with its claim. Tables and code are inline. Every quantitative result uses matched-pairs evaluation unless explicitly noted; non-paired analyses are flagged. Caveats specific to each result appear alongside the result, not in a footnote. The matched-pairs harness and raw experimental data are described in Section 10.

Where our earlier results required retraction, we report both the original finding and the correction. Where similar failure modes appear in published work, we describe the pattern without attribution; readers can identify the relevant literature.


Sections

  1. Premise. What we set out to test, and the simpler thing it became.
  2. Setup. The brain pipeline end-to-end, with everything frozen vs. trainable labeled.
  3. Methodology. Matched-pairs sampling and why we believe so few of our own results.
  4. Signal-preserver framing. What an encoder actually does inside this stack.
  5. The encoder journey. MiniLM, Qwen3-Embedding, and a 13MB static pool that won.
  6. Validated tiers. The narrow wins that survived methodology.
  7. Negative result. Brain hurts technical generation, including hallucination.
  8. Mechanism. Why prefix tuning has structural limits we cannot inject around.
  9. Lessons, open questions, and what would change our minds.
  10. Code, data, and the methodology bundle.

1. Premise

We started with a simple hypothesis: a small auxiliary signal derived from brain-encoding models could prime a large language model toward better behavior. The intuition was that fMRI-trained models have learned something about how human cognition organizes a stimulus (by topic, by valence, by category, by attention), and that an LLM might benefit from a low-dimensional summary of that organization.

The artifact: a frozen 177M-parameter extract from Meta's TribeV2 tri-modal brain encoder ("BrainCore"), conditioned on a sentence-transformer embedding of the user prompt. We projected its 20,484-vertex output to 50 dimensions via PCA, then to 20 cortical-parcel activations via CorticalNet, then through a learned projector to four "brain tokens" injected into the LLM's input embedding stream after the system message and before the user message.

We tested this on Qwen3.5-27B (4-bit nf4 quantized) across three context-length tiers (Mid 16K-32K, Long 128K-262K), four task categories (long-context needle retrieval, dispositional matching, reasoning benchmarks, technical generation), and seven encoder variants over the course of the work. For each comparison we report matched-pairs results.

What survived. The initial hypothesis was that brain-derived activations would serve as a universal LLM enhancer. After systematic evaluation, one effect survived rigorous testing (long-context needle retrieval) alongside a comprehensive set of negative results. The more general finding is structural: prefix-tuning at the input position has fundamental limits that are independent of the signal source, and the recall benefit that does survive is not specific to brain-derived content.

This paper traces the experimental progression from the initial broad claim through each narrowing step: seven encoder variants, four orthogonal ablations, and two phases of follow-up experiments. The brain pipeline serves as the case study; the methodology and structural arguments generalize beyond it.


2. Setup

2.1 Pipeline (end-to-end, what is trainable, what is frozen)

USER TEXT (any length, no encoder context cap on the static-encoder path)
   │
   ▼
[ENCODER]  potion-256d-v3 + MixPool        [STATIC, frozen — 256-dim per token, ~10MB Apache 2.0]
   │       (concat of mean-pool and per-dim abs-max-pool → 512-dim)
   ▼
[ADAPTER]  Linear 512 → 512 → 384          [TRAINABLE — 720K params, ~3MB, MSE-distilled vs MiniLM teacher]
   │       GELU + LN between layers
   ▼
[BRAIN]    BrainCore (TribeV2 extract)     [FROZEN — 177M params, CC-BY-NC, 20,484 cortical vertices output]
   │       Trained on 148 stimulus↔fMRI pairs
   ▼
[PCA-50]   20,484 → 50 dimensions          [FROZEN — captures 91% variance on our adapter outputs]
   │
   ▼
[PARCELS]  CorticalNet                     [FROZEN — 3.3M params, 20 cortical-parcel activations]
   │       (commercial-clean replacement for downstream BrainCore inference)
   ▼
[PROJECT]  BrainProjectorV3                [TRAINABLE — Variant A 27B: 50 → 5120 ×4 tokens, ~106M params]
   │                                       [TRAINABLE — Variant B 14B: 50 → 192 → 5120 ×4 tokens]
   ▼
4 BRAIN TOKENS                              [injected at position [sys][brain×4][user]]
   │
   ▼
[LLM]      Qwen3.5-27B 4-bit nf4           [FROZEN — 64K native ctx, 262K max]
           OR Qwen3-14B (4-bit, 40K ctx)

Total brain-side deploy footprint (encoder + adapter + projector for 27B): ~12MB on the encoder path + ~106MB for the projector. Encoder-side is microsecond CPU; projector forward is one matmul + LN.

What does the brain actually carry? A 50-dimensional summary, derived per prompt, of how a frozen brain-encoding model would (counterfactually) represent the stimulus in 20,484 cortical vertices, projected through 20 named cortical parcels (Schaefer-derived). The projector translates that 50-dim summary into four embedding-space vectors that the LLM treats as if they were tokens.

What does the LLM SEE? Four extra positions of input embeddings between the system message and the user message. The LLM is unmodified, frozen, and quantized. No fine-tuning on the LLM side. No LoRA. No attention modification. Pure prefix-tuning, with the prefix sourced from a brain-encoding pipeline rather than learned end-to-end.

2.2 Variants tested over the arc

VariantEncoderAdapter dimBrain backboneLLMStatus
v1MiniLM-L6-v2384CorticalNet onlyQwen2.5-1.5BRetired (truncates at 256 tokens)
v2MiniLM-L6-v2384CorticalNet onlyQwen2.5-1.5BRetired (cherry-pick artifact)
v3MiniLM-L6-v2384BrainCore + PCA-50 + CorticalNetQwen2.5-1.5BRetired (small-model amplifier only)
v3-largeMiniLM-L6-v2384BrainCore + PCA-50 + CorticalNetQwen3-14BRetired (encoder bottleneck at long ctx)
27B-MiniLMMiniLM-L6-v2384full brain stackQwen3.5-27B 4-bitRetired (encoder truncates)
27B-Qwen3EmbQwen3-Embedding-0.6B1024full brain stackQwen3.5-27B 4-bitRetired (didn't beat OFF in matched-pairs)
27B-MixPoolpotion-256d-v3 + MixPool512full brain stackQwen3.5-27B 4-bitCurrent ship — validated narrow effect
27B-max_onlypotion-256d-v3 + max-pool only256full brain stackQwen3.5-27B 4-bitMinimal variant for size-constrained deploys

Section 5 walks through why each was retired. Section 4 reframes what an encoder is doing in this stack at all.

2.3 Why this construction is sensitive to methodology

Three properties make this pipeline easy to fool yourself with:

  1. Effect size is small relative to sampling variance at any reasonable n. Brain-on vs brain-off effect sizes in our data are ~5-15pp at the cell level. Per-cell sampling variance under non-greedy decoding at n=25 is also ~5-10pp. Without paired design, the two are inseparable.
  2. There are many free axes to sweep. Encoder, adapter, projector init, prompt format, LLM, quantization, decoding params, max_new_tokens. Cherry-pick risk grows multiplicatively. We did not start with a frozen experimental protocol; we evolved it. Some of our early "wins" lived in narrow slices that disappeared the moment we widened.
  3. The success metric on long-context needle retrieval is noisy. Pass/fail on a single prompt depends on tokenization of the needle, position of the needle in the haystack, and decoding stochasticity. Marginal pass-rate comparison absorbs all three sources of noise.

Sections 3 and 6 develop the methodology rule we eventually adopted. The shorter version: at n<100 per cell, marginal comparison gives wrong answers about prefix-tuning systems. We have four retractions in this post that prove that claim from inside.


3. Methodology: matched-pairs sampling

3.1 The initial methodology error

For the first two weeks of this work, every A/B comparison reported marginal pass rates: arm A passed K of N prompts, arm B passed K' of N. We compared K and K' and called it. The unstated assumption was that under non-greedy decoding, the same prompt fed to two arms produces comparable outputs because temperature is the same. This is false. With do_sample=True and an unseeded RNG, each generation draws an independent sampling trajectory; at T=0.6 / top_p=0.95, the same prompt on the same model with the same arm can pass or fail the same needle question depending purely on which token gets sampled at the first ambiguous position.

Once you accept that, you cannot interpret a 13/25 vs 11/25 marginal difference between arms as evidence of arm-level effect; you cannot interpret a 13/25 vs 13/25 marginal tie as evidence of no effect either. The arms might be finding different prompts.

3.2 The fix

for ctx in CONTEXT_LENGTHS:
    for batch_idx, batch in enumerate(batches_at_ctx):
        pair_seed = (ctx * 100000) + batch_idx
        for arm in ARMS:
            torch.manual_seed(pair_seed)
            torch.cuda.manual_seed_all(pair_seed)
            generate(batch, arm=arm)

Reset the global RNG to the same seed before each arm of each batch. The arms then draw the same sampling trajectory until they actually diverge, which only happens at positions where the differing input embeddings (the brain tokens) cause a different next-token logits distribution. Differences between arms on the same prompt are now isolated to brain-token effects, not RNG.

The statistical analysis becomes McNemar exact 2-sided binomial on discordant pairs: prompts where one arm finds the needle and the other misses. Concordant pairs (both find or both miss) carry no information about arm effect.

3.3 What changed when we did this

We re-ran four prior comparisons under matched-pairs. Three "wins" evaporated. One "null" inverted into a significant result.

ComparisonMarginal verdict (pre)Matched-pairs verdict (post)Status
Brain v3-large 1.5B "7/7 wins" qualitative bench7/7 prompts brain wins (n=8, greedy, 400 tok cap)Effect collapses to ~tie under sampling + n=100 + 4096-tok responsesRETRACTED
Qwen3-Embedding-0.6B encoder Mid 32K+6-8pp marginal vs OFF12-8 discordant pairs, p=0.50RETRACTED (null)
Random-init 4-token control at 256KAll three arms (real brain / random-init / OFF) hit 13/25 = 52% identicalReal brain +28pp marginal under matched-pairs, p=0.039 pairedRETRACTED (was sampling artifact)
256K "brain dead" verdict (Long-tier null)Marginal showed brain ≈ OFF at 256K, called nullBrain 16/25 vs OFF 8/25, +32pp, p=0.022 paired (n=25)INVERTED to validated win

The pattern is consistent. Marginal comparison at n=25 cannot distinguish a 5-10pp brain effect from sampling variance of similar magnitude. Pairing the trajectories isolates the effect from the noise.

The Qwen3-Embedding-0.6B "Mid +6-8pp marginal" is the keep-us-honest case study. Marginal looked like a win; matched-pairs at the same n returned 12 brain-wins / 8 OFF-wins discordant, p=0.50. The arms were trading one set of misses for another, not separating in ability. We had been calling that a result.

3.4 Recipe for ablating a prefix-tuning system (the second-order fix)

Matched-pairs is necessary but not sufficient. Brain-pipeline ablations must also rule out three confounds: position offset (prepending K tokens shifts downstream positions; test with K zero-vectors at the brain position), vanilla prefix tuning (any K trained vectors at this position might help; test with K-token soft-prompt trained without the brain pipeline), and random init (untrained noise might regularize; test with K-token random Gaussian prefix).

We ran all three at Mid 32K and Long 256K. Position-offset rejected (zero-pad ≡ OFF, 0 discordant at 256K, n=14). Vanilla-soft-prompt rejected (3-3 discordant at Mid 32K, p=1.0). Random-init behaves like zero-pad. Brain content carries the signal; the trivial alternatives do not. Section 8 unpacks each ablation in detail.

3.5 What we now believe about reporting prefix-tuning eval results

Three rules we hold ourselves to from this point forward:

  1. Matched-pairs is non-negotiable below n=100 per cell. Marginal comparison at small n is sampling-variance theater dressed as a result.
  2. McNemar on discordant pairs, not pass-rate deltas. The discordant-pair structure is the actual data; pass-rate deltas are a lossy summary.
  3. Position-offset, vanilla-soft-prompt, random-init prefix are mandatory ablations. Any prefix-tuning win that hasn't ruled out these three has not earned the right to claim "brain content is the cause."

We failed all three of these rules in our early work. We retract those results in §3.3 above.


4. Signal-preserver framing: what an encoder actually does inside this stack

Most discussion of sentence encoders treats them as retrievers. You have a query, you have a corpus, you want the top-K most relevant documents. The encoder's job is to produce vectors whose cosine distance ranks the right documents above the wrong ones. Evaluation follows: nDCG@10 on MSMARCO, recall@1 on NQ, retrieval quality on BEIR. Every leaderboard for sentence embeddings is a retrieval leaderboard.

Inside a brain-conditioned LLM pipeline, the encoder is doing something different.

There is no corpus. There is no ranking. The encoder's output does not get compared to other documents' vectors; it gets handed to a downstream model (BrainCore → CorticalNet → a projector → four brain tokens → Qwen3.5-27B) that has its own semantic machinery. The encoder is not a retriever. It is a signal compressor: a fixed-dimensional summary of a variable-length input, consumed by a model that will extract whatever semantics it needs from that summary.

That reframing changes what the encoder should optimize for.

4.1 Retriever vs signal-compressor: two different loss landscapes

A retriever wants its output vector to live on the right side of a ranking decision. Two documents on the same topic should sit near each other; documents on different topics should sit far apart. The loss is contrastive (InfoNCE, MNRL) or margin-based (triplet). The model is measured by corpus-level separability: given 1M candidates, does the right one rank top-K?

A signal compressor has a different objective. Its output vector is not compared to other vectors. It is consumed whole, per-dimension, by a downstream model. What matters is that distinctive activations of the input survive the compression. If a single token in the input carries a sparse, strong signal (a keyword, an entity, a needle), the compressed output should preserve that token's contribution in at least one dimension of the output vector.

Optimizing for one does not automatically optimize for the other. A retriever's training loss does not directly reward distinctive-activation preservation; it rewards mean semantic alignment across whole-document pairs. A strong retriever can average away exactly the per-token information a signal-compressor needs, and nothing in its training loss penalizes that averaging.

RETRIEVER                               SIGNAL-COMPRESSOR
─────────                               ──────────────────
encoder(query)   ──┐                    encoder(input) ──► fixed-dim vector
                   │                                            │
                   ▼                                            ▼
          similarity against                            downstream model
          corpus index (1M docs)                        (LLM, brain stack,
                   │                                     RAG consumer)
                   ▼                                            │
              top-K ranking                                     ▼
                                                         task output

loss: rank right doc top-K                loss: preserve distinctive per-dim
metric: nDCG@10, R@1                      activations across full input length
training: contrastive / triplet           training: whatever the consumer needs

4.2 Two properties that decide which encoders compress well

Two architectural facts decide whether a given encoder compresses signal well at long input length:

Position-encoding decay. Transformer encoders attend across the full input. Their positional encodings decay past the length they were trained on; attention is smooth and soft and averages many tokens into each output position; the final pooled vector is a further average across output positions. By the time you pool a 50K-token document into a 1024-dim vector through a transformer encoder, a single distinctive token has been averaged into invisibility unless it appeared with many paraphrases. This is not a tuning problem. It is a mean-field property of the attention-plus-pool architecture.

Static per-token embedding lookup. A static encoder like potion-256d-v3 skips attention entirely. Each token contributes one 256-dimensional vector directly from a frozen embedding table. The only step that mixes tokens is the pool. If that pool is a mean, you have the same averaging problem as a transformer encoder. If that pool is per_dim_abs_max (for each of the 256 output dimensions, the token whose activation on that dimension has the largest absolute value wins), then any token with a sparse strong activation on some dimension will preserve its contribution regardless of how many filler tokens are added.

For a retriever, the pooling choice is not the dominant factor; the encoder's semantic training dominates. For a signal compressor at long input length, the pooling choice dominates everything.

4.3 The consequence we will narrate in Section 5

The empirical journey in Section 5 (MiniLM truncation, Qwen3-Embedding's mean-of-last-token pooling failing matched-pairs, a 13MB static encoder with max-pool winning on brain-conditioned needle retrieval) is not random. It is a directed search through compression mechanisms under the signal-compressor frame. The transformer encoders we tested had strong training and weak long-context compression. The static encoder we shipped had adequate training (distilled from MiniLM) and strong long-context compression via max-pool.

What the frame does not claim. The signal-compressor lens is not a claim that static encoders universally beat transformer encoders. It is a claim that the downstream consumer decides what the encoder should optimize for, and our consumer happens to be a brain-conditioned LLM rather than a corpus search engine. A different pipeline (a RAG system that compares encoder outputs against a prebuilt corpus index) has a retriever as its encoder and should not draw conclusions from our result.

What it does claim. When your downstream model consumes a single encoder vector per input and has its own semantic reasoning, you are solving a signal-preservation problem, not a retrieval problem. The pooling step of your encoder is more important than its training loss, and the benchmark that predicts your stack's performance is not MTEB.

With that frame in hand, the encoder journey in Section 5 is about progressive convergence on a compressor that preserves distinctive token activations at context lengths well past any encoder's training budget.


5. The encoder journey: MiniLM, Qwen3-Embedding, and a 13MB static pool that won

The pipeline ships with one encoder choice, but it has been through seven variants. Each iteration was a hypothesis about compression (what information has to survive the pool to feed a useful brain signal into the LLM), and each iteration taught us something about where the previous encoder was losing signal. This section walks the arc.

The short version: we started with a strong retriever (MiniLM) with a hard context cap. When we removed the cap by swapping in a 32K-context transformer encoder (Qwen3-Embedding-0.6B), the matched-pairs methodology from Section 3 caught that the swap did not actually help. When we swapped to a static encoder (potion-256d-v3) with an unorthodox pool (MixPool, the concatenation of mean and per-dim absolute max), we caught the validated effect. The delta between the two failures and the win is explained by the signal-preservation mechanics of Section 4.

5.1 MiniLM-L6-v2: the truncation trap (variants v1–v3, 27B-MiniLM)

MiniLM-L6-v2 is a competent sentence encoder at its intended scope: 384-dim output, 256-token hard input cap, strong retrieval quality on short-document benchmarks. For our first three brain-pipeline variants it was the default, chosen because the dispositional triples we were training the adapter on were short (200–800 tokens) and MiniLM's training distribution matches that length.

The truncation problem does not announce itself. Tokenization of a 32K-token haystack fits; MiniLM's tokenizer truncates silently at 256 tokens; the pooled vector is derived from the first 256 tokens of the haystack and nothing else. For a needle-in-haystack eval where the needle is deliberately placed past that cap, the encoder output is identical regardless of needle position. The brain signal conditioned on that output is also identical. Any brain effect at ctx ≥ 16K is impossible to measure because the encoder cannot see past page one.

We did not catch this until we started running long-context evals. On short-text dispositional triples the encoder was fine. The moment we tried 32K needle retrieval, brain-on and brain-off converged numerically, not because the brain pipeline was doing nothing but because the encoder was not supplying a signal that varied with needle position.

Retirement reason: encoder-level truncation made the stack impossible to evaluate at long context. MiniLM is still a reasonable encoder for short-input use cases; it is not a long-context encoder and should not be used as one.

5.2 BGE-M3: considered, skipped

BGE-M3 has an 8K context cap and stronger retrieval quality than MiniLM. We added it to the shortlist after the MiniLM retirement. It came off the shortlist before we trained anything because Qwen3-Embedding-0.6B has 32K native context and comparable retrieval quality; if we were going to pay the cost of an encoder swap, we should jump all the way to 32K rather than stop at 8K. No runs on BGE-M3; we flag it as a skipped waypoint for reader orientation.

5.3 Qwen3-Embedding-0.6B: the apparent fix that wasn't (variant 27B-Qwen3Emb)

Qwen3-Embedding-0.6B is a recent transformer sentence encoder: decoder-backbone architecture, last-token pooling, 1024-dim output, 32K native context, MTEB average ~64, Apache 2.0. On paper it is the right answer to the MiniLM retirement: more context, better retrieval quality, modern architecture, permissive license. We retrained the brain adapter (Linear(1024→512) + ... + Linear(512→384), MSE-distilled against the same dispositional-triples teacher targets) and reran the long-context eval.

The marginal pass-rate result at Mid 32K looked like a clean win. Brain-on: ~51%. Brain-off: ~44%. A 6–8pp directional lift. We reported it as a tentative Mid win.

Matched-pairs sampling (Section 3) killed it. The same comparison at n=20 per cell under per-batch seed-resetting produced 12 discordant pairs where brain won and 8 where OFF won; exact 2-sided binomial p=0.50. The marginal lift existed because brain was finding a different 12 prompts than OFF was finding; the arms were trading one set of misses for another. There was no arm-level effect.

What does that tell us mechanically? Qwen3-Embedding's last-token pooling concentrates its output on the tokens at the end of the input. In a needle-in-haystack test where needles are distributed uniformly across context positions and surrounded by filler, the last tokens are almost always filler. A 32K context window does not save you if your pooling destroys the needle before the pipeline sees it. The 32K context was a context-cap fix; the pooling was a signal-preservation regression.

Through the signal-compressor lens of Section 4: Qwen3-Embedding is a stronger retriever than MiniLM (MTEB 64 vs 56) and a weaker long-context signal compressor than a max-pool alternative. The retrieval benchmark did not predict the stack performance, because our stack is not a retriever.

Retirement reason: matched-pairs methodology showed no arm-level effect on the brain pipeline. We did not re-tune or re-distill the Qwen3-Emb variant further. The cost-benefit calculation favored searching through a new compression mechanism (mean vs max pool) over spending more compute on an encoder that was not solving the problem we had.

5.4 The mechanism check: max-pool preserves needles where mean-pool dilutes

The Qwen3-Embedding retirement was a prompt to go back to first principles. If transformer encoders with good retrieval quality were not preserving needles at long context, what pool would?

We ran a synthetic mechanism check. Take a haystack H of tokens drawn from generic English (Wikipedia excerpts) at varying lengths 12K–200K. Take a needle N of 50–100 tokens injected once at a random position. For each (H, H+N) pair, compute the potion-256d-v3 per-token embedding matrix E, then pool E under three candidate pools: mean, per-dim absolute max, and MixPool (concat of mean and abs-max → 512 dim). Compare the cosine similarity of pool(H + N) to pool(N alone) at each context length. High cosine means the needle's signal survived the pool; low cosine means the haystack drowned it.

The curves:

Poolcos(pool(H+N), pool(N)) at 12Kat 32Kat 200K
mean-pool~1.000.530.12
per-dim abs-max0.8110.8110.811
MixPool (concat)mean: 1.00 / max: 0.811mean: 0.53 / max: 0.811mean: 0.12 / max: 0.811
  • mean-pool decays from ~1.0 at short context to ~0.12 at 200K. The haystack's uniform contributions to every dimension drown the needle as the average smooths out.
  • per-dim absolute max holds flat at 0.811 across 12K–200K. For any dimension where the needle's tokens have an anomalously large activation, that activation wins the max regardless of how many filler tokens are added. Distinctive needle dimensions survive; distinctive filler dimensions that happen to fall on the same dimension compete per-dimension, but the needle's sparse strong activations are structurally protected by the max operation.
  • MixPool gives the downstream consumer both halves in a single 512-dim vector. The mean half decays like standalone mean-pool; the max half holds like standalone max-pool.

Topic-discrimination secondary check. We encoded 8 topic anchors (politics, medicine, code, cooking, sports, art, philosophy, physics) under each pool and computed the pairwise cross-topic cosine matrix. Max-pool produced cross-topic cosines around 0.05 (sharp category boundaries). Mean-pool produced cross-topic cosines around 0.22 (categories blur). Max-pool preserves distinctiveness across semantically unrelated content because sparse distinctive activations survive the max; mean-pool averages them into a mild common direction.

The mechanism is not subtle. Mean-pool is a smoothing operation; max-pool is a sparsity-preserving operation. At long input length, distinctive signal is exactly the signal that a signal-compressor needs to preserve. Max-pool's invariant, that distinctive per-dim activations survive, is the one we want.

What it does not claim. Max-pool is not universally better than mean-pool. For dense representations where every token contributes equally (category-average encoding, broad-topic similarity), mean-pool is the right choice, and the MTEB retrieval benchmarks where mean-pool encoders win tend to exercise exactly that dense-representation regime. The claim is narrower: for sparse-signal preservation at long context, max-pool is the mechanism.

Reproduction: needle_mech_check_v2.py (mechanism cos curves) and topic_discrim_check.py (cross-topic matrix) in the §10 bundle.

5.5 MixPool: the current ship (variant 27B-MixPool)

MixPool is concat(mean(E), per_dim_abs_max(E)) where E is the per-token embedding matrix from potion-256d-v3. Output dim: 512. The adapter on top is a 720K-param MLP: Linear(512→512) + GELU + LN + Linear(512→512) + GELU + LN + Linear(512→384), distilled by MSE against the same MiniLM → BrainCore.text_adapter targets used for the earlier variants. Best eval MSE: 0.025. Cos-to-teacher: 0.98.

Total encoder-plus-adapter footprint: ~13MB. No GPU at encode time; the embedding lookup is a gather operation and the pool is two torch ops. On CPU the encode latency is microseconds per document at any length.

Matched-pairs results on the brain pipeline at 27B are below. These measure the deployed-stack downstream effect (MixPool encoder + adapter + brain pipeline + projector + Qwen3.5-27B) vs OFF. §8.7 later shows that the same tier numbers appear (within noise) for soft_prompt, kw_prefix, and raw_encoder prefixes: the LLM-downstream needle lift is prefix-tuning-generic, not MixPool-specific. The table below is therefore a ship-number for the deployed arm; do not read it as causal attribution to the encoder choice.

ContextArm-on pass rateArm-off pass rateDiscordant (on-wins / off-wins)McNemar exact 2-sided
16K100% (60/60)~92%5 / 0p=0.063 (cell)
32K98.3% (59/60)~85%8 / 0p=0.0078 (cell)
Mid pooled (120)99.2%~88%13 / 0p=0.0002
256K64% (16/25)32% (8/25)9 / 1p=0.022 (paired n=25)

The Mid 16K cell is borderline significant in isolation (p=0.063) but becomes p=0.0002 when pooled with 32K at n=120. The Long 256K cell is independently significant. These are the validated tiers.

Pool ablation. We trained a separate max_only adapter, input dim 256 (max half only, mean half dropped), same distillation recipe, ~493K params. At Long 256K the max_only adapter produces 15/25 needle passes vs MixPool's 16/25; 24/25 matched prompts produce identical outcomes, p=1.0. The mean half contributes nothing at 256K, consistent with the mechanism check (mean-pool cos decay to 0.12 at 200K). At Mid the max_only adapter is within noise of MixPool (118/120 identical, p=0.50); the isolated mean_only ablation does produce a small lift over OFF at Mid (p=0.039), so the mean half earns some Mid-tail edge-case coverage without being orthogonal to what max already captures.

Ship decision. MixPool is the default. The mean half costs microseconds and adds edge-case coverage at Mid that we cannot rule out as real. The max_only variant is the minimal option for size-constrained deployments where the ~900KB delta matters (browser extensions, mobile, embedded). Both shipped READMEs are in the repo (mixpool_text_adapter.README.md, max_only_text_adapter.README.md). Note post-§8.7: since any 4-token trained prefix achieves similar LLM-downstream lift on Mid recall, the ship-rationale for MixPool over alternatives is encoder-level properties (small footprint, ~3MB adapter, microsecond CPU encode, open license, no-GPU inference) rather than LLM-downstream-effect superiority. If a future pipeline requires a different encoder for unrelated reasons, the downstream-needle result would likely transfer to any trained 4-token prefix on the new encoder.

5.6 What this result IS and IS NOT

Narrow scope, to be read literally.

The validated effect is: on a 27B-parameter 4-bit-quantized Qwen3.5 downstream, conditioned on brain tokens derived from potion-256d-v3 + MixPool + a 720K-param adapter, matched-pairs needle retrieval at Mid (16K–32K) and Long (256K) shows significant brain-on lift over brain-off. The effect is the combined contribution of the encoder, the adapter, the brain pipeline, and the projector; the encoder's specific role is signal preservation, per the mechanism check.

What this is not.

  • Not a generic retrieval benchmark. We report no BEIR, no LongEmbed, no RTEB numbers for MixPool in this post. Those benchmarks measure encoder-as-retriever; our stack uses encoder-as-signal-compressor; cross-use is not supported by our data. When we run those benchmarks we will publish them separately. Base potion-256d-v3 lands at roughly 30–35 nDCG@10 on MSMARCO; transformer encoders in the same footprint range sit at 45–55. We are not claiming MixPool beats them on retrieval and would not be surprised if it does not.
  • Not a claim that static encoders beat transformer encoders in general. The comparison that survives here (MixPool vs Qwen3-Embedding-0.6B on brain-pipeline matched-pairs) is specific to this stack. A different downstream consumer, a different task, or a different pooling of the transformer encoder output could change the result. We treat this as one data point, not a trend.
  • Not a claim that MixPool is causally responsible for the downstream needle lift. §8.7 shows that a soft-prompt baseline, a TF-IDF keyword prefix, and a raw-encoder prefix without BrainCore all produce ≈ equivalent needle-retrieval performance at Mid 32K under the harder Phase 2 distribution. The MixPool mechanism (max-pool preservation of sparse signal at 256K) is validated at the encoder-output signal level (§5.4 cos curves, §5.5 pool ablation). Whether that encoder-level preservation is what drives the downstream LLM behavior is an open question, and §8.7 answers it in the negative: the downstream lift at Mid is prefix-tuning-generic, not MixPool-specific. The §6.1 Long 256K claim stands as-published under the original distribution; the Long harder-needle replication is pending (§8.8).
  • Not a finished product. MixPool is a research artifact validated on one pipeline. Productization (vertical-specific distillation, polished deploy, long-doc corpus validation) is future work.

What the result does support, in two tiers:

  • At the encoder-output level: max-pool static encoders preserve sparse-signal information at context lengths well past any encoder's training budget. This is validated by the §5.4 mechanism check (cos curves, topic discrimination) and the §5.5 pool ablation (max_only ≡ MixPool at Long). Narrow, defensible, mechanism-grounded.
  • At the LLM-downstream level: the deployed MixPool stack produces a measurable needle-retrieval lift over OFF at Mid and Long (§5.5 table, §6.1). §8.7 shows this Mid downstream lift is prefix-tuning-generic: any input-conditioned or static 4-token prefix trained on the same data produces equivalent Mid lift under a harder-needle distribution. The encoder choice explains what the deployed arm does well at the encoder-output level; it does not explain the LLM-downstream gap over OFF at Mid, and the Long replication under the harder distribution is still outstanding.

That is the claim. Nothing larger.


6. Validated tiers: the narrow wins that survived methodology

After the matched-pairs rewrite of Section 3 and the encoder swap of Section 5, we re-ran the long-context evaluation under the new methodology with the new encoder. The numbers in this section are what we are willing to defend in public.

6.1 The validated table

Three context tiers, all under MixPool + matched-pairs (torch.manual_seed(ctx*100000 + batch_idx) reset before each arm), Qwen3.5-27B 4-bit nf4 downstream, brain-on vs OFF:

TierContextnBrain-on passOFF passDiscordant (brain / OFF)exact-p (McNemar 2-sided)
Mid 16K16K60100.0%91.7%5 / 00.063 (cell)
Mid 32K32K6098.3%85.0%8 / 00.0078
Mid pooled16K + 32K12099.2%88.3%13 / 00.0002
Long 256K256K2564.0%32.0%9 / 10.022

Mid 16K is borderline in isolation but pools cleanly with 32K to a strongly significant Mid combined effect. Long 256K is independently significant despite the smaller n, because at 256K context the brain effect is large enough (+32pp marginal) to clear discordant-pair significance at n=25.

6.2 What the encoder gave us

The same evaluation under earlier encoder variants did not produce the table above. MiniLM truncated and saw only the first 256 tokens of every haystack; Qwen3-Embedding-0.6B's last-token pooling concentrated on filler at the end of the haystack and tied OFF in matched-pairs (12-8 discordant, p=0.50). MixPool's max-pool half preserves needle-token activations regardless of haystack length (Section 5.4 mech check), and it is that preservation at the encoder-output level that makes MixPool viable as the deployed arm.

Caveat. We previously attributed the LLM-downstream recall lift over OFF to the joint contribution of MixPool + brain pipeline + projector + Qwen3.5-27B. Phase 2 (§8.7) tested this attribution directly with a 5-arm matched-pairs design at Mid 32K: brain (MixPool + full brain pipeline), kw_prefix (TF-IDF keyword-conditioned prefix), soft_prompt (4 learned static tokens, no input conditioning), raw_encoder (potion+MixPool fed directly to a fresh 4-token projector with no BrainCore / PCA-50 / CorticalNet), and OFF. Under a harder-needle prompt distribution (OFF=67% at Mid 32K), all four prefix arms tied each other (brain 97% / kw_prefix 100% / soft_prompt 97% / raw_encoder 90%, head-to-head all p=0.625-1.0 NS) and all beat OFF. The Mid recall lift over OFF is prefix-tuning-generic: any 4-token prefix trained on the same dispositional triples produces it, not brain-specific and not MixPool-specific. The §5 encoder-output-level findings (cos curves, topic discrim, pool ablation) still hold; the joint-contribution framing of this paragraph for the LLM-downstream lift over OFF does not. §8.7 unpacks this and §8.8 discusses the Long tier disposition.

6.3 What the validated effect IS: the public-talk version

The workshop sentence: 4-token prefix tuning on dispositional triples, injected after the system message, lifts long-context needle retrieval at Mid 32K on a 4-bit Qwen3.5-27B downstream under matched-pairs sampling. The brain pipeline (MixPool + BrainCore + PCA-50 + CorticalNet + projector) was the first prefix type we validated at this tier; Phase 2 (§8.7) showed that keyword-extract prefixes, static soft prompts, and raw-encoder prefixes all produce statistically indistinguishable Mid tier lift under a harder-needle distribution. The Long 256K result (brain +32pp vs OFF, p=0.022 paired, n=25) stands as-published under the original prompt distribution (§6.1); Phase 2 Long at n=2 is insufficient to confirm replication under the harder-needle distribution. Harder-needle Long replication is pending.

That sentence has every modifier it needs and no modifiers it doesn't, post-Phase-2. The Mid lift is real, the methodology is the strongest one we know how to use at this n, and the brain-specific attribution has been narrowed to "brain is one of several 4-token prefix types that achieve this lift on our stack." The §6.1 original-distribution Mid and Long numbers are not retracted; they are reframed as one prefix-type's performance in a regime where other prefix types perform identically.

For the mechanism picture that supports this narrower claim, see §8.7 (Phase 2 recall findings) and §8.6 (EXP C1 technical harm findings).

6.4 What the validated effect IS NOT: the public-talk version

  • Not a reasoning lift. MMLU-Pro flat (−7.5pp at 14B, no significance), TruthfulQA tied. We will not claim brain helps reasoning.
  • Not a code or engineering lift. Section 7 unpacks: brain ON significantly loses on technical generation across every axis we measured (p<0.0001 pooled). §8.6 sharpens this: the harm is input-conditioning-specific, not brain-specific. We must ship brain OFF, and equivalently any input-conditioned 4-token prefix OFF, for code surfaces.
  • Not a brain-specific recall claim. Phase 2 (5-arm matched-pairs at Mid 32K, n=30, §8.7) showed keyword-extract, soft-prompt-only, and raw-encoder-only prefix arms all tied brain on recall under a harder-needle prompt distribution. The recall lift is a property of 4-token prefix tuning on dispositional triples, not of brain-derived content specifically. The MixPool encoder's signal-preservation mechanism (§5) is an encoder-output-level finding (validated at the cos-curve and topic-discrim level), but the LLM-downstream recall lift over OFF generalizes across prefix content types. See §8.7.
  • Not a generic recall claim. The benchmark is synthetic needle-in-haystack, our own construction. We have not validated against real-world long-document QA. That benchmark is on the roadmap; we will publish whatever it shows.
  • Not "brain heals long context." The lift exists at our context tiers under our needle distribution. Different long-context tasks (multi-needle reasoning, document summarization, cross-passage entailment) might or might not show the same shape. We have one task family.
  • Not a validated Long tier claim post-Phase-2. The §6.1 Long 256K data (brain 64% vs OFF 32%, p=0.022 paired, n=25) stands as-published under the ORIGINAL prompt distribution. Phase 2 attempted to replicate Long under a harder-needle distribution but was killed at n=2 due to budget cap + O(N²) prompt-build cost at 256K (see §8.8). Until we have a Phase 2-distribution Long re-run at n≥25, the Long tier should be read as "validated in §6.1 under the original distribution, pending replication under harder-needle distribution."
  • Not cross-turn memory. The brain pipeline recomputes per user message in isolation. By turn 5 of a chat, brain still reflects turn-1's input. Multi-turn / streaming brain state is on the roadmap as Untapped Axis #1; it is not what we have today.

6.5 Ship rule (per-tier dispatch)

Operationally: a 4-token dispositional prefix ON for tiers where we can name the surface as "long-context recall amplification," 4-token dispositional prefix OFF for tiers where the prompt looks like code, engineering, or structured-output generation. We do not expose brain-on/off as a user-visible toggle for this reason. The dispatch is done per use case, not per user choice. Section 7 explains why "prefix OFF for tech" is non-optional; §8.6 shows the harm is input-conditioning-specific; §8.7 shows the recall benefit is prefix-tuning-generic so the "ON for recall" choice does not have to be "brain ON" specifically, since a static soft_prompt delivers equivalent recall lift at Mid and ships without the BrainCore CC-BY-NC license constraint. A minimum-viable deployed config is: static soft_prompt ON for recall surfaces, prefix OFF for technical surfaces. The brain pipeline remains the deployed arm of our research ship but is no longer the uniquely-necessary implementation of this rule.


7. Negative result: brain hurts technical generation, including hallucination

The validated long-context effect of Section 6 is the headline good news. The headline bad news lives here. On technical generation work (code, debugging, code review, conceptual engineering questions, structured-output tasks), brain ON loses to brain OFF decisively under the best methodology we have. The decisive cell is hallucination on technical prompts, where brain ON is significantly worse than OFF (p=0.021).

7.1 The technical NO-GO

We constructed an 80-prompt technical benchmark across 8 archetypes (10 prompts each): algorithms, code review, coding, conceptual, debugging, diagnostic, hallucination, sql. Each archetype targets a distinct failure mode. Prompts are in data/technical_prompts_DRAFT.json (released in the §10 bundle).

Setup:

  • Encoder: potion-256d-v3 + MixPool (default ship, same encoder as Section 6)
  • Methodology: matched-pairs (torch.manual_seed(7777 + batch_idx) reset before each arm)
  • Decoding: max_new=4096, do_sample=True, T=0.6, top_p=0.95, batch=8
  • LLM: Qwen3.5-27B 4-bit nf4 (same downstream as Section 6)
  • Arms: brain-on (MixPool) vs OFF (n=80 paired)
  • Judging: 8 parallel deep-thinker subagents, blind A/B (40/40 randomized A↔B mapping per archetype), multi-dim scorecard (correctness, completeness, hallucination [inverted], length-appropriate, engineer-pick)

Engineer-pick aggregation by archetype (one judge per archetype, pooled across 10 prompts each):

ArchetypeBrain winsOFF winsTies
algorithms163
code_review460
coding244
conceptual082
debugging163
diagnostic163
hallucination091
sql271
POOLED (n=80)115217

McNemar exact 2-sided binomial on engineer-pick discordant pairs (n_disc=63): brain=11 vs OFF=52, p<0.0001. OFF wins decisively across every archetype. Brain wins zero on conceptual and zero on hallucination, the two archetypes most sensitive to fabrication.

Per-axis breakdown (paired McNemar on the count of pairs where one arm strictly wins on that axis):

Axisbrain > OFFOFF > braintieexact-p
correctness12554<0.0001
completeness13742<0.0001
hallucination (inverted, 2 = clean)313640.0213
length-appropriate1029410.0034

OFF significantly beats brain on every axis. The hallucination axis is the most surprising. We had hoped a brain-derived prefix might discipline the model away from confabulation. The opposite happens. Brain primes confident output; on technical prompts where confident-without-verification is the hallucination failure mode, the priming is actively harmful.

7.2 What the hallucination samples look like

The hallucination archetype is designed to elicit confident-but-wrong claims about specific APIs, flags, and keywords. Brain ON wins zero of ten. Three concrete cases follow with the brain-on response, the OFF response, and the judge's verdict.

Case A: kubectl apply flags (id=56)

Prompt: "List all flags accepted by kubectl apply. Be exhaustive — include short and long forms."

Brain ON response (excerpt):

kubectl apply supports the following flags:

  -f, --filename=[]      Filename, directory, or URL to files to use to create the resource
  -k, --kustomize=""     Process a kustomization directory
  -R, --recursive        Process the directory used in -f, --filename recursively
  -i, --interactive      Apply changes interactively, prompting for confirmation
  -p, --patch            Apply as a strategic merge patch
  -t, --timeout=0s       The length of time to wait before giving up on a single
                         server request
  -l, --selector=""      Selector (label query) to filter on
      --dry-run=none     Must be "none", "server", or "client"
      --validate=true    If true, use a schema to validate the input before sending it
      --force            If true, immediately remove resources from API
      --record           Record current kubectl command in the resource annotation

OFF response (excerpt):

kubectl apply supports these flags (verified against kubectl 1.31):

  -f, --filename=[]      Filename, directory, or URL to files to apply
  -k, --kustomize=""     Process the kustomization directory
  -R, --recursive        Process the directory used in -f, --filename recursively
  -l, --selector=""      Selector (label query) to filter on
      --dry-run=none     Must be "none", "server", or "client"
      --validate='strict'  Validation strategy: strict|warn|ignore
      --force            Force apply by deletion and re-creation
      --grace-period=-1  Period of time in seconds given to the resource
      --field-manager='kubectl-client-side-apply'  Field manager name
      --server-side=false  If true, apply runs in the server instead of the client

Judge verdict: OFF wins. Brain ON invents three flags that do not exist on kubectl apply: -i / --interactive (no kubectl subcommand has this), -p / --patch (this is a flag on kubectl patch, not kubectl apply), -t / --timeout (no short form, and apply does not accept --timeout). OFF restricts itself to real flags and explicitly anchors to a kubectl version.

Case B: Git rebase TODO keywords (id=53)

Prompt: "What are the valid commands you can use in an interactive git rebase TODO file (git rebase -i)? List all of them."

Brain ON response (excerpt):

The following commands are valid in a git rebase interactive TODO file:

  pick (p)      use commit
  reword (r)    use commit, but edit the commit message
  edit (e)      use commit, but stop for amending
  squash (s)    use commit, but meld into previous commit
  fixup (f)     like "squash", but discard this commit's log message
  exec (x)      run command (the rest of the line) using shell
  drop (d)      remove commit
  label (l)     label current HEAD with a name
  reset (t)     reset HEAD to a label
  merge (m)     create a merge commit using the original merge commit's message
  reverse       reverse the order of the commits below this line

OFF response (excerpt):

git rebase --interactive TODO commands:

  pick (p)      use commit
  reword (r)    use commit, but edit the commit message
  edit (e)      use commit, but stop for amending
  squash (s)    use commit, meld into previous commit
  fixup (f)     like squash but discard this commit's log message
  exec (x)      run command using shell
  break (b)     stop here (continue rebase later with 'git rebase --continue')
  drop (d)      remove commit
  label (l)     label current HEAD with a name
  reset (t)     reset HEAD to a label
  merge (m)     create a merge commit using the original merge commit's message
                or the oneline if no original merge commit was specified

Judge verdict: OFF wins. Brain ON invents reverse as a rebase TODO command; it does not exist. Brain ON also omits break (b), which is a real command. OFF lists exactly the 11 real commands without invention. The pattern is symmetric: brain ON fabricates one and forgets one, OFF lists what is actually there.

Case C: requests.Session.get parameters (id=52)

Prompt: "What are the parameters of requests.Session.get() in the python requests library? Show the signature and explain each parameter."

Brain ON response (excerpt):

requests.Session.get(url, params=None, data=None, headers=None,
                     cookies=None, files=None, auth=None,
                     timeout=None, allow_redirects=True,
                     max_redirects=30, proxies=None, hooks=None,
                     stream=None, verify=None, cert=None,
                     json=None)

OFF response (excerpt):

# requests.Session.get is a thin wrapper around Session.request("GET", ...).
# The actual signature is:

def get(self, url, **kwargs):
    """Sends a GET request. Returns Response object.

    :param url: URL for the new Request object.
    :param **kwargs: Optional arguments that request takes.
    """
    kwargs.setdefault('allow_redirects', True)
    return self.request('GET', url, **kwargs)

# kwargs may include any of the following (passed through to Session.request):
#   params, data, json, headers, cookies, files, auth, timeout,
#   allow_redirects, proxies, hooks, stream, verify, cert
#
# Note: max_redirects is a Session attribute (Session.max_redirects = 30 default),
# NOT a per-request kwarg. Setting it on the .get() call has no effect.

Judge verdict: OFF wins. Brain ON invents max_redirects=30 as a get() parameter; it is not. max_redirects is an attribute on the Session object itself, set as session.max_redirects = N, and does not flow through get(**kwargs). OFF gets both the real signature (get(self, url, **kwargs)) and the correct location of max_redirects. Brain ON is decisive and wrong; OFF is hedged and right.

7.3 The pattern: confident output is the wrong prior on technical content

Across all three cases, brain ON produces a clean-looking, decisive, well-formatted answer that contains a fabricated specific. OFF produces a slightly more hedged, sometimes longer answer that anchors to real artifacts (a kubectl version, a **kwargs indirection, a "verified against" caveat). The brain prefix appears to push the model toward confident assertion of plausible-but-unverified detail; on technical questions where verifiability is the entire point, confidence without verification is precisely the hallucination failure mode.

This shows up in the per-axis numbers: brain ON wins the length-appropriate axis less often than OFF (10 vs 29) because brain ON tends to produce shorter, denser, more committed answers. That is exactly the surface signal of "I know this," exactly when the model does not. The hallucination axis (3 brain-wins vs 13 OFF-wins, p=0.021) is the most diagnostic single result in the entire post.

7.4 What we ruled out

We are confident the technical NO-GO is not an artifact of methodology, encoder choice, or response-length budget. Each was tested independently:

  • Methodology (matched-pairs vs marginal). The original tech-bench ran under the older marginal-comparison methodology with MiniLM and max_new=400. It hit 24.6% brain-win-rate at p<0.0001. The rebench reported here uses MixPool + matched-pairs + max_new=4096 and hits 11/80 = 13.75% brain-win-rate at p<0.0001: same direction, same significance, sharper effect under tighter methodology. If matched-pairs had reversed the verdict, that would have been the methodology-artifact case; instead it confirmed the original.
  • Encoder choice. The original result used MiniLM. The rebench uses MixPool. Different encoders, same NO-GO direction. The encoder is not the confound.
  • Response-length budget. The original result used max_new=400 (where brain ON's terseness might dominate); the rebench uses max_new=4096 (where OFF gets enough budget to be exhaustive). OFF still wins. The budget is not the confound.
  • Length bias on judge picks. OFF responses are systematically longer (mean 5,299 chars vs brain's 3,390). Among the 63 decisive engineer-picks, longer-wins-more is statistically present (42/63 = 67%, p=0.011) but does not fully account for the gap. OFF was longer in 44 paired prompts but won 52 engineer-picks; in many cases OFF won despite being shorter. Length is a partial confound, not the explanation.
  • Sample size. 80 paired prompts, 8 archetypes × 10 each. The pooled p<0.0001 is dominated by the conceptual (0/8) and hallucination (0/9) cells, both of which are saturated: adding n there cannot move the verdict in brain's favor.

The result is robust to every confound we have been able to construct.

7.5 What this implies for the product surface

Brain ON for context-recall tiers (Mid + Long, per Section 6) is shippable. Brain ON for code, debugging, code review, conceptual engineering questions, or any structured-output task is harmful at p<0.0001. The per-tier dispatch rule of Section 6.5 is not a hedge; it is the only configuration the data supports. We will not expose brain-on/off as a user-visible toggle on coding surfaces, because the product would be measurably worse with it on.

Section 8 takes the next step: given that brain content drives long-context recall and hurts technical generation, what is the structural reason we cannot inject around the technical regression by training the projector harder?


8. Mechanism: why prefix tuning has structural limits we cannot inject around

Architectural caveat. Qwen3.5-27B uses Qwen3_5GatedDeltaNet (a linear/gated attention variant) for its decoder layer attention, not standard softmax attention. We discovered this incidentally during EXP B, where we tried to capture brain-token attention weights and found the model has no traditional softmax attention to capture. The Petrov 2024 result on prefix tuning we cite below was developed for softmax attention, where the prefix biases attention output in a fixed direction. Linear/gated attention has even stronger fixed-direction structure per the linear-attention literature, so the "structural limit of prefix tuning" argument we make here is at least as strong on this architecture as it would be on a softmax-attention model. But we cannot port Petrov's specific mechanistic explanation directly without running a probe designed for linear attention. Treat the §8 reasoning as architecture-agnostic structural argument; the mechanism that produces the limit on Qwen3.5-27B specifically remains a follow-up question.

The technical NO-GO of Section 7 is not a "we just need more code in the training corpus" problem. We tested that hypothesis directly through three orthogonal gating experiments. The results constrain what brain injection can structurally do, independent of how the projector is trained.

The three tests, each with a specific question:

  1. Linear probe. Does the upstream brain signal carry enough task-axis structure for any downstream projector to extract code-vs-not-code separation?
  2. Vanilla soft-prompt baseline. Would 4 trained prefix tokens (no brain pipeline) do what the brain pipeline does at Mid 32K?
  3. Position-pad ablation. Is the brain win at 256K just a position-offset artifact of having any 4 vectors in the brain slot?

8.1 Linear probe: the PCA-50 basis is misaligned with class-discriminative axes

We took the 50-dim PCA-projected BrainCore output (the input to CorticalNet, the input to the projector) for 300 dispositional triples spanning 12 categories (25 each: casual, code, code_review, creative, emotional, factual, instructional, logical_fallacy, manipulative, persuasive, philosophical, technical). We fit a logistic regression on the 50-dim features. The first version of this analysis used a single 80/20 holdout (5 samples per class in test).

ProbeMethodAccuracyChanceMultiplier
12-way (full)LR 5-fold CV45.7% ± 1.7%8.3%5.5× chance
12-way (full)LR holdout (80/20)45.0%8.3%
4-way (code/chat/factual/other)LR 5-fold CV73.7% ± 6.4%25%partial signal

The first version of this section reported per-class recall on the 12-way holdout and concluded that code, technical, and logical_fallacy recovered at 0% recall, implying BrainCore's upstream signal lacked the task-axis resolution where the technical regression matters. We followed up with two experiments (EXP A and EXP E) that significantly weaken the original framing.

Follow-up: dim-reduction sweep at PCA-{50, 100, 200} and supervised LDA

The original 0% recall numbers were measured under a single 80/20 holdout with 5 samples per class in test, a single misclassification produced a 0% recall reading. Running the same probe under cross_val_predict (every sample held out exactly once across 5 folds) gives a different picture, and adding supervised LDA reveals the picture is sharper still:

MethodOverall acc (5-fold CV)code recalltechnical recalllogical_fallacy recall
PCA-500.460 ± 0.0170.2400.2000.040
PCA-1000.460 ± 0.0170.2400.2000.040
PCA-2000.460 ± 0.0170.2400.2000.040
LDA-10 (supervised, sklearn caps n_components at n_classes − 1 = 11)0.627 ± 0.0480.3200.7600.520
LDA-11 (requested 20)0.630 ± 0.0360.3600.7600.520
LDA-11 (requested 50)0.630 ± 0.0360.3600.7600.520

Two observations:

  1. PCA-50, PCA-100, PCA-200 are identical to four decimal places. Higher PCA dims add no information beyond PCA-50 for this task. The PCA basis is variance-optimal but does not align with the class-discriminative axes, which is exactly what supervised LDA exposes.
  2. LDA recovers strong signal on technical (76%) and logical_fallacy (52%); moderate on code (32-36%). The class-separable structure exists in the 20,484-dim BrainCore output. PCA fails to preserve it because variance ≠ separability.

The original 0% reading on the 12-way holdout was a small-test-set artifact (5 samples per class means a single correct prediction = 0.20 recall and zero correct = 0.00). Cross_val_predict at PCA-50 gives 24%/20%/4% on the focus classes: non-zero but still weak relative to LDA's 32%/76%/52%.

Updated conclusion. The deployed PCA-50 → projector pipeline DOES discard upstream signal that is class-discriminative. BrainCore's 20,484-dim output represents technical and logical_fallacy categories well; the variance-optimal PCA basis used in the deployed adapter does not align with the task axes we need. Code remains the genuinely hardest of the three classes (~32-36% LDA recall) but it is not "0%" as the original framing implied.

Follow-up: does swapping PCA-50 for LDA-10 in the deployed pipeline fix the regression?

The natural next test: substitute LDA-10 for PCA-50 in the brain pipeline, retrain the projector on the new feature dim with identical recipe (LM loss, 5 epochs, AdamW lr=1e-3), re-run the technical rebench. If the upstream-signal hypothesis is correct, the LDA pipeline should at least tie OFF on technical, since its features carry the class-discriminative signal that PCA-50 discards.

We ran this (EXP E). The result is decisive in the wrong direction:

  • LDA arm vs OFF (engineer-pick, 3 archetypes pooled, n=30): lda 1, off 25. LDA loses 1-25.
  • LDA arm vs current ship mixpool: lda 2, mixpool 6, ties 22. LDA also loses to mixpool (the existing PCA-50 pipeline).
  • Per-axis paired McNemar (lda vs mixpool): correctness p=0.013, completeness p=0.0015, length-appropriate p=0.0009. LDA significantly worse on three of four axes vs the current ship.
  • Three blind judges converged on the LDA arm being the most truncated (palindrome filter cut off, retry decorator stopped, merge-intervals halted): unshippable on at least 5 of 10 coding prompts.

The hypothesis "better upstream features fix the regression" is FALSIFIED. Even with access to LDA-recovered class signal, the deployed prefix-tuning mechanism makes the technical regression WORSE, not better. The cleanest interpretation: the structural failure is not "the upstream signal lacks task-axis resolution" (false per the LDA recovery) but rather "the deployed prefix-tuning mechanism cannot use the class-discriminative signal to fix the technical regression even when given access to it." Section 8.6 develops this further with EXP C1's keyword-prefix baseline.

Caveats on the LDA result:

  • The LDA projector was newly trained from scratch on 300 triples (~40K params, single Linear); the PCA projector inherited longer training and ~106M params. The 1-25 loss to OFF cannot be attributed solely to training/capacity asymmetry (the gap is too large), but the asymmetry is real and means a fairer test would match training compute.
  • LDA was supervised on 12 dispositional categories. At deployment, the LDA matrix is fixed and applied as a static linear transform (no test-time supervision). LDA-10 is a legitimate downstream-only patch; the failure isn't about LDA being "cheating" upstream.
  • These results come from EXP E, n=30 across 3 archetypes (hallucination, conceptual, coding) judged under blind A/B/C evaluation.

8.2 Vanilla soft-prompt baseline: the brain pipeline contributes something a soft prompt does not

If the brain pipeline's effect at Mid 32K is just "having 4 trained prefix tokens at this position," then a learned 4-token soft prompt trained on the same dispositional triples should reproduce it. We tested this directly.

Setup:

  • 4 learnable tokens × 5120 hidden dim = 20K params (no brain pipeline)
  • LM loss on the 300 dispositional (text → response) pairs the brain projector saw
  • 5 epochs, AdamW lr=1e-3, batch=8
  • Then run Mid 32K needle eval matched-pairs at n=30 with three arms: brain (MixPool default ship), soft_prompt, OFF

Soft-prompt training loss was clean monotone: 1.522 → 1.327 → 1.248 → 1.233 → 1.221. The soft prompt fit the dispositional task as designed.

Mid 32K matched-pairs results (n=30):

Comparisonboth passA-onlyB-onlyboth missdiscordantexact-p
brain vs soft_prompt2541050.375
brain vs OFF2541050.375
soft_prompt vs OFF2333161.000

Soft_prompt is indistinguishable from OFF at Mid 32K (3-3 discordant, p=1.0). Whatever the soft prompt learned from 300 dispositional triples does not transfer to needle retrieval at 32K context. Brain directionally beats both (29/30 vs 26/30), but at n=30 the discordant count is too small to clear p<0.05 against either.

The interpretation is sharp: the brain pipeline is not a more complex implementation of "4 trained tokens at this position." Vanilla 4 tokens trained on the same data give zero lift over OFF on the eval that matters. The brain pipeline contributes something that vanilla prefix tuning does not, even if at this n we cannot say what with full statistical power.

This is the result that closed the "the brain pipeline is theater for soft-prompt tuning" critique. It is also the result that makes Section 8.1's "the upstream signal lacks code-axis resolution" load-bearing: if soft prompts ≈ OFF, and brain > OFF, then the brain effect is brain-content-specific. If we want to fix the technical regression by changing what brain content carries, we need to change BrainCore (the upstream encoder), not the projector that consumes its output.

8.3 Position-pad ablation: brain wins are not a position-offset artifact

The third gating test asks whether prepending any 4 vectors at the brain position would shift downstream user-message positions enough to produce the long-context win. We tested four arms at 256K matched-pairs (intended n=25, killed at n=14 for resource savings; directionally clear but underpowered):

  • brain (MixPool default)
  • soft_prompt (4 trained tokens, no brain content; same artifact as §8.2)
  • zero_pad (4 zero-vectors at the brain position)
  • OFF (no prepend)

Results at n=14:

ArmPass rate
brain7/14 = 50.0%
soft_prompt6/14 = 42.9%
zero_pad5/14 = 35.7%
OFF5/14 = 35.7%

Matched-pairs McNemar (key cell bolded):

ComparisonbothA-onlyB-onlyneitherdiscexact-p
brain vs zero_pad431640.625
brain vs OFF431640.625
soft_prompt vs OFF421731.000
zero_pad vs OFF500901.000

zero_pad ≡ OFF with zero discordant pairs. Adding 4 zero-embeddings at the brain position contributes exactly nothing. Every prompt OFF found, zero_pad found; every prompt OFF missed, zero_pad missed. The position-offset hypothesis is rejected: a 4-position shift of downstream user-message tokens, when filled with zero content, has zero effect on the eval.

Brain (50.0%) sits above zero_pad (35.7%) by ~14pp marginal at this n. Discordant 3-1 favoring brain (p=0.625, NS at n=14, but consistent in direction with the n=25 256K parity result of p=0.022 from Section 6). soft_prompt sits at 42.9%, between brain and {zero_pad, OFF}: vanilla learned prefix tokens may add some attention-anchor benefit at long context even without brain conditioning, but less than brain content does.

The pattern at 256K mirrors Mid: brain > soft_prompt > zero_pad ≈ OFF. The long-context win is brain-content-driven, not position-shift-driven. This rules out the most parsimonious "you're just shifting positions" explanation for the Section 6 result.

8.4 Putting it together: three structural facts about brain injection

The combined verdict from the gating tests:

  1. The brain pipeline is causally active. zero_pad ≡ OFF, random init ≈ zero_pad, soft_prompt ≈ OFF at Mid. Brain content matters; the trivial alternatives do not. (§8.2 + §8.3, plus the shuffled-brain ablation that showed wrong-brain gives wrong behavior even when prompt is held constant.)
  2. The brain pipeline contributes something a soft prompt does not. Specifically at long context where the encoder's signal-preservation matters (Section 5), and at Mid where vanilla soft prompts trained on the same data tie OFF. The contribution is brain-content-specific. (§8.2.)
  3. The brain pipeline cannot fix the technical regression by projector retraining. The upstream BrainCore signal lacks code-vs-not-code separability in its PCA-50 representation; a downstream projector cannot extract a task axis that does not exist upstream. Fixing the technical NO-GO requires changing the brain backbone or the upstream representation, not the projector. (§8.1.)

There is a published structural reason this might be the right shape of finding. Petrov et al. ("When Do Prompts and Prefixes Reliably Steer LLMs?", ICLR 2024) showed that prefix-tuning biases attention output in a fixed direction independent of the input; prefixes function more as broad attention anchors than as task-conditional steers. Our linear probe result (§8.1) is consistent with that frame: the brain prefix can act as a recall amplifier (long-context needle preservation, broadly applicable) but cannot act as a task-specific corrector (code-vs-not separation, requires task-axis resolution upstream that does not exist).

8.5 What it would take to flip the technical NO-GO

Five architectural directions, in light of EXP A + EXP E + EXP C1 + Phase 2 follow-ups:

  1. Late injection per Petrov 2024. The most viable remaining direction, and now the only direction not ruled out. Inject the prefix after the LLM's <think> boundary rather than at the input. Acts on the post-thinking representation rather than priming the entire generation. Doesn't depend on upstream feature quality (EXP A / EXP E) or input-conditioning content (C1) or any specific prefix content (Phase 2). Architecturally invasive (requires modifying the LLM's forward pass), but it is the only direction we have that doesn't depend on a failed downstream-only patch. EXP D (post-user injection position) is the cheapest direct test of this and is the highest-priority remaining experiment on the roadmap.

  2. Per-prompt gating. Phase 2 weakened this direction substantially: since brain is not the specific contributor to the recall benefit, gating brain ON/OFF per prompt does not uniquely capture the recall lift; any trained 4-token prefix achieves equivalent performance. The minimum-viable gate for "Mid recall without technical harm" is "soft_prompt ON for recall surfaces, OFF for technical," requiring no brain pipeline, no BrainCore CC-BY-NC dependency, and no dispositional-triple training data. This remains a viable deployment strategy but is no longer a scientific direction for the brain research program.

  3. Per-parcel injection (Untapped Axis #2), demoted. Replace PCA-50 → single projector with 20 per-parcel projectors → 20 brain tokens. EXP A showed the PCA-50 collapse discards class-discriminative signal, so per-parcel injection might recover it; but EXP E showed that even when given the better LDA-recovered signal directly, the deployed pipeline got worse on technical, not better. Phase 2 further weakens the case: the Mid recall benefit doesn't need parcel-specific routing (static soft_prompt delivers equivalent Mid lift), and the Mid benefit is already at ceiling for brain so per-parcel can't improve it there. Low-cost to test but now lowest expected value of the five directions.

  4. Different upstream brain backbone, demoted. A brain-encoding model trained on language-region fMRI data (rather than video features) would in principle have different feature structure. EXP A + Phase 2 jointly weaken the case: the upstream signal is class-discriminative on the current backbone (EXP A) and the Mid recall benefit doesn't actually require brain content (Phase 2). A different backbone with different class structure wouldn't help if the failure is downstream of upstream feature quality and the upstream contribution isn't load-bearing.

  5. Discriminative projector training (DEMOTED by EXP E). Train the projector with a contrastive or supervised loss alongside LM loss to align its 50-dim input with class-discriminative axes. EXP E was a partial test of this (LDA-10 input is supervised; projector retrained on standard LM loss). Result was worse than current ship. So discriminatively-trained projector specifically didn't help. May still be worth a fuller test but lower expected value than late injection.

We have not run late injection (#1). It is the only scientific direction on the roadmap. Per-prompt gating with a static soft_prompt (#2, product version) ships independently of any architectural change and is the safest next product move, and no longer requires the brain pipeline at all.

None is required for the §6 + §8.7 validated narrow-ship rule of "a 4-token trained prefix ON for recall, OFF for technical."


8.6 The technical NO-GO is an input-conditioned-prefix property, not a brain property

The external reviewer of an earlier draft noted a baseline we had not tested: an input-conditioned prefix whose content comes from something other than the brain pipeline. Section 7 ruled out brain-content-independent alternatives (zero-pad, random-init, soft-prompt trained on dispositional triples) but had not tested what happens when the prefix slot is filled with a different input-conditioned signal. EXP C1 answers this.

Setup delta from §7. Same 80-prompt technical benchmark (8 archetypes × 10), same Qwen3.5-27B 4-bit downstream, same matched-pairs seeding (torch.manual_seed(7777 + batch_idx) reset before each arm). The judging panel was blind 4-arm A/B/C/D across three archetypes most sensitive to fabrication (hallucination, conceptual, coding), n=30 pooled. Arms:

  • mixpool. Deployed brain ship (MixPool + PCA-50 + brain projector, same as §7).
  • kw_prefix. TF-IDF keyword-extract → potion MixPool → fresh 10M-param projector trained on the same 300 dispositional triples as brain.
  • soft_prompt. 4 learnable tokens trained on the same dispositional triples (reused from §8.2, zero per-prompt variation at inference).
  • off. No prefix, positive control.

The kw_prefix projector trained smoothly to lower final LM loss than the other two prefixes (1.485 → 0.928, vs soft_prompt 1.522 → 1.221 in §8.2). Per-prompt input variation gave it strictly more signal to fit; lower training loss does not imply downstream win.

Engineer-pick, pooled across three archetypes (n=30, matched-pairs):

ComparisonA winsB winstie / 3rd-4thexact-p
mixpool vs off313140.021
kw_prefix vs off314130.013
soft_prompt vs off111450.69
mixpool vs kw_prefix43231.00
kw_prefix vs soft_prompt210180.039
mixpool vs soft_prompt411150.12 (NS)

Two findings land immediately. Both input-conditioned prefixes lose to OFF at p<0.025 (brain 3-13, kw 3-14). Static soft_prompt ties OFF (11-14, p=0.69): the same tie §8.2 reported on needle retrieval, now reproduced on technical. The harm tracks input-conditioning, not brain content.

Per-axis, brain content is the best-performing input-conditioned prefix, but still hurts.

Head-to-head paired McNemar, mixpool vs kw_prefix:

Axismixpool winskw_prefix winstieexact-p
correctness142140.0042
completeness142140.0042
hallucination103170.092
length-appropriate131070.68

Within the input-conditioned family, brain wins on correctness and completeness at p=0.004. Per-arm mean scores (out of 2) quantify the gap: correctness runs OFF 1.97 > mixpool 1.57 > soft_prompt 1.40 > kw_prefix 1.07; completeness tracks identically. Brain is the least-bad input-conditioned prefix on quality, but "least-bad" still means significantly worse than OFF on every quality axis: kw_prefix vs OFF is 0-18 on correctness (p<0.0001), 0-21 on completeness (p<0.0001), 0-12 on hallucination (p=0.0005). Whatever brain recovers over keyword-extract, it does not recover enough to clear OFF.

Mechanism. Input-conditioned prefixes shorten and sharpen the output (mean response length: OFF 4995 chars > mixpool 3280 > kw_prefix 2691 > soft_prompt 2377). One blind judge called out catastrophic runaway hallucination on a Python requests.Session.get prompt for the kw_prefix arm (hundreds of fabricated private methods in repetitive blocks): the most severe technical-fabrication pattern in the panel, on the arm with lowest training loss. The signature is consistent across both conditioned arms: the prefix primes the model toward decisive task-disposition output, which on technical prompts manifests as confident fabrication rather than hedged correctness. Static soft_prompt has no per-prompt content to prime against, so it produces shorter output without the fabrication surface.

What this revises in §7. The clean-framed claim of §7.1 ("brain ON loses to OFF decisively on technical generation") is preserved directionally, but the attribution is narrower than we wrote. The harm is not a brain-pipeline-specific failure mode. It is an input-conditioned-prefix failure mode that brain happens to exhibit less severely than a keyword-extraction alternative. §7 should be read through this lens: the ship rule (OFF for technical surfaces, §6.5 + §7.5) is unchanged (both input-conditioned prefixes lose to OFF), but the mechanism attribution in §7.3 and the structural discussion in §8.4 should move from "brain primes confident output" to "any per-prompt-varying prefix primes confident output, brain included."

What EXP C1 does NOT change.

  • §6 validated needle-retrieval effect. EXP C1 tests technical generation only. EXP C2 (needle retrieval, 4-arm, Mid 32K + Long 256K) is the correct test of whether kw_prefix also matches brain on the positive result. Pending.
  • §7.5 ship rule. Brain OFF for technical surfaces. Both input-conditioned prefixes lose to OFF at p<0.025; no conditioned prefix we have is shippable for tech.
  • §8.2 soft-prompt finding. Soft_prompt ties OFF on both needle retrieval (§8.2) and technical (here). The "static prefix is inert" reading of §8.2 is strengthened, not weakened.

Caveats. n=30 pooled across three archetypes, smaller than §7's n=80 across eight. The hallucination and conceptual cells are the most diagnostic; coding has fewer discordant pairs by construction. The mechanism claim ("per-prompt content variation primes decisiveness") is a post-hoc interpretation consistent with the length and response-character data; a direct causal test would require a stronger ablation than we have. The Qwen3.5-27B-specific structural discussion in the §8 GatedDeltaNet caveat box applies here: the input-conditioning-hurts-technical result may be architecture-specific to linear attention, and replication on a softmax-attention downstream would strengthen the structural claim.

Reproduction: modal_train/exp_c1_kw_prefix_tech.py (training + 4-arm rebench combined); raw responses in docs/exp_c1_responses.json; judge aggregation in tools/parse_exp_c1_judgments.py and docs/exp_c1_judge_*_results.json.


8.7 The Mid recall benefit is prefix-tuning-generic, not brain-specific

§8.6 showed that the technical harm is an input-conditioned-prefix property rather than a brain-specific property: any input-conditioned 4-token prefix loses to OFF on technical, while static soft_prompt ties OFF. §8.6 intentionally left open the parallel question for recall: if the §6 Mid tier win is real, is it brain-specific, input-conditioning-specific, or prefix-tuning-generic? Phase 2 answers this directly with a 5-arm matched-pairs design at Mid 32K.

Setup. Same Qwen3.5-27B 4-bit downstream as §6 and §7. Same matched-pairs seeding (torch.manual_seed(ctx * 100000 + batch_idx) reset per arm per batch). Five arms:

  • brain. Current ship (potion-256d-v3 + MixPool + PCA-50 + CorticalNet + BrainProjectorV3), identical to §6 deployment.
  • kw_prefix. TF-IDF keyword-conditioned prefix (reused from §8.6, same training recipe).
  • soft_prompt. 4 learned static tokens (reused from §8.2, zero per-prompt variation).
  • raw_encoder (NEW for Phase 2). potion-256d-v3 + MixPool output fed directly to a freshly-trained Linear(512 → 5120) × 4 tokens projector with no BrainCore, no PCA-50, no CorticalNet. Same 300 dispositional triples, same LM-loss recipe. This is the load-bearing new arm: if brain beats raw_encoder on recall, the brain pipeline's BrainCore→PCA→CorticalNet middle chain is load-bearing; if it ties, the middle chain is decorative for the Mid recall signal.
  • off. No prefix.

n=30 at Mid 32K (batch=4 on H200, matched-pairs). Needle prompt distribution is harder than §6's; see "Harder-needle caveat" below.

Mid 32K pass rates (n=30, 5-arm matched-pairs):

ArmPass rate
kw_prefix30/30 = 100.0%
brain29/30 = 96.7%
soft_prompt29/30 = 96.7%
raw_encoder27/30 = 90.0%
off20/30 = 66.7%

McNemar each prefix arm vs OFF (discordant pairs only):

ComparisonA winsB winsexact-p
brain vs off900.0039
kw_prefix vs off1000.0020
soft_prompt vs off900.012
raw_encoder vs off700.065 (borderline)

All four prefix arms beat OFF. Brain, kw_prefix, and soft_prompt clear p<0.05; raw_encoder is borderline (one prompt short of significance at this n).

Head-to-head among prefix arms (matched-pairs McNemar, discordant only):

ComparisonA winsB winsexact-p
brain vs kw_prefix011.0
brain vs soft_prompt111.0
brain vs raw_encoder200.625
kw_prefix vs soft_prompt101.0
kw_prefix vs raw_encoder300.25
soft_prompt vs raw_encoder200.625

All head-to-head prefix-vs-prefix comparisons are statistically indistinguishable at n=30 near-ceiling. The four prefix arms tie each other.

The headline finding, three framings. (i) Any 4-token prefix trained on dispositional triples produces the Mid recall lift over OFF under this distribution: brain, kw_prefix, soft_prompt, raw_encoder. (ii) Soft_prompt (static, no input-conditioning, no brain content, 20K total params) ties brain at 29/30. If input-conditioning were the specific driver of the recall lift, soft_prompt should lose to the three input-conditioned arms. It doesn't. (iii) Raw_encoder (potion+MixPool fed to a fresh projector without BrainCore, PCA-50, or CorticalNet) lands at 27/30. It's directionally slightly worse than brain/kw_prefix/soft_prompt but the head-to-head gap is not significant at n=30 (brain vs raw_encoder 2-0, p=0.625).

The mechanism cascade from §8.6 + §8.7 crystallizes into a two-sided asymmetry:

Technical HARM is input-conditioning-specific (C1). On technical content, per-prompt-varying prefixes (brain, kw_prefix) prime confident fabrication and lose to OFF; static soft_prompt doesn't prime per-prompt behavior and ties OFF.

Recall BENEFIT is prefix-tuning-generic (Phase 2). On Mid recall under the harder distribution, any trained 4-token prefix (including the static one that has no per-prompt variation) beats OFF. The unit of analysis is "any 4-token prefix trained on dispositional triples at the post-system position," not "input-conditioned prefix" and not "brain-derived content."

The SAME static soft_prompt is the probe that makes both findings cleanly readable: soft_prompt ties OFF on technical (no harm without input-conditioning) AND ties brain on recall (no penalty without brain content). These are two different mechanisms operating at the same prefix position.

Harder-needle caveat. OFF pass rate at Mid 32K: Phase 2 67% vs §6 85%. An 18pp drop. The Phase 2 prompts are strictly harder than §6's; absolute numbers across distributions are not directly comparable. Two consequences the reader must hold:

  1. We cannot claim "all prefix types would have tied brain at §6's easier distribution too." Under §6's distribution at OFF=85%, there is only 15pp of headroom and the brain arm used 13pp of it (p=0.0078 cell). A soft_prompt arm at that distribution might hit the same 15pp ceiling (tying brain) OR might underperform (making the §6 brain win brain-specific under easier distributions). We have not tested Phase 2's 5 arms at §6's distribution.
  2. The Phase 2 finding holds for the regime it actually tested. Under the harder Phase 2 distribution where OFF drops to 67%, all four prefix types have 33pp of headroom and use 23-33pp of it, indistinguishably. Whatever mechanism drives the benefit clearly does not require brain content in this regime.

What Phase 2 does NOT change.

  • §6.1 tier-table numbers stand as-published under the §6 original prompt distribution (n=60 at 16K, n=60 at 32K, n=25 at 256K). Phase 2 did not retest §6's distribution and does not disprove those numbers. It tests a different distribution.
  • §6.3 public-talk claim changes. Narrowed from "brain-derived 4-token prefix" to "4-token prefix tuning on dispositional triples" for the Mid 32K lift under any distribution. See §6.3 above.
  • §7.5 / §8.5 ship rule changes slightly. Brain is still a working prefix for recall, but static soft_prompt is now an equally-valid ship for the recall surface (minimum viable implementation) and strictly simpler (no BrainCore CC-BY-NC constraint, no brain-stack inference cost).

Caveats. n=30 at Mid 32K is smaller than §6's n=60. The head-to-head prefix-vs-prefix comparisons are near-ceiling (all arms at 90-100%) so larger n is needed before we can rule out a 3-5pp brain-vs-soft_prompt gap. A full Phase 2 replication at n=60+ would sharpen the tie. The finding "all four tie at n=30 near ceiling under harder distribution" is the strongest claim this data supports.

Reproduction: modal_train/exp_phase2_needle_5arm.py (training + combined eval, overnight), modal_train/exp_phase2_eval_only.py (eval-only re-fire with O(N²) prompt-build fix); raw responses in docs/exp_phase2_responses.partial.json and docs/exp_phase2_eval_only_responses.partial.json.

8.8 Long tier disposition: §6 Long stands, replication pending

Phase 2 planned a full 5-arm matched-pairs re-run at Long 256K (n=30) in addition to Mid. The Long run was killed at n=2, not because "this is all we got," but for three compounding resource reasons:

  1. O(N²) prompt-build bottleneck at 256K. The initial overnight run spent ~90s per prompt tokenizing the 256K-token haystack (pre-fix). The eval-only re-fire introduced a constant-time token-ID replication fix (see exp_phase2_eval_only.py:build_needle_prompt), but the fix landed mid-run.
  2. Forward-pass cost at batch=1, 256K context. ~200s per prompt per arm × 5 arms = ~17 min per prompt × 30 planned prompts ≈ 8.5 hr H200 compute.
  3. Compute constraints. Mid 32K consumed the allocated compute budget. Long at n=30 would have required an additional 8+ hours of H200 time. We prioritized completing Mid and publishing those findings.

Phase 2 Long at n=2 is uninterpretable: one discordant pair per 2-prompt sample is noise; no McNemar is possible. What this means for the §6.1 Long tier claim:

  • Is the §6.1 Long claim (brain 16/25 vs OFF 8/25, p=0.022 paired, original prompt distribution) retracted? No. Phase 2 did not disprove it. Phase 2 tested a different distribution and did not complete.
  • Is the §6.1 Long claim "validated" post-Phase-2? Not under the harder-needle distribution. The §6.1 data was collected under the original §6 prompt distribution. Phase 2's harder distribution was designed to stress-test the Mid + Long tiers jointly; Mid completed and showed the prefix-generic pattern; Long did not complete. So the §6.1 Long result is "validated under §6's distribution, pending replication under the harder Phase 2 distribution at n≥25."
  • What would complete the Long replication? A full Phase 2-distribution 5-arm matched-pairs run at 256K, n≥25, with the O(N²) prompt-build fix. This is the second-highest-priority experiment remaining on the roadmap, after EXP D (late injection / post-user position).

How to read the Long tier in the §6.1 table. The brain 64% vs OFF 32% result at 256K, p=0.022 paired, n=25 is real, matched-pairs, and tight. It stands as-published. It describes the original §6 needle distribution. Whether that same gap holds under a harder needle distribution, and whether the 4-arm Phase 2 pattern (brain ≈ kw_prefix ≈ soft_prompt ≈ raw_encoder > OFF) also holds at 256K, is an open question pending a full Long re-run.

Reproduction: same scripts as §8.7; docs/exp_phase2_eval_only_responses.partial.json has the n=2 Long data as a documentation artifact.


9. Lessons, open questions, and what would change our minds

9.1 Lessons we hold across the whole arc

  • Matched-pairs > marginal at any n we can afford. Our four retractions were all caught by a methodology change that costs nothing at runtime. If you are running A/B evals on prefix-tuning systems with n<100 per cell, the cheapest improvement to your work is seeded matched-pairs sampling.
  • Prefix-tuning ablations are mandatory. Position-pad, vanilla soft-prompt, random-init. Without all three, you cannot isolate "the prefix content matters" from "any 4 vectors at this position help."
  • The encoder is the bottleneck in any prefix-tuning pipeline that consumes one encoded vector per input. Check it before attributing wins to the rest of the stack. We spent a week believing the brain pipeline was broken when the actual issue was MiniLM truncating at 256 tokens.
  • Negative results are reusable. The technical NO-GO held across two encoders, two sampling regimes, and two response-length budgets. Each test that reproduced it sharpened the claim and ruled out a confound. Negative results compound where positive results often do not.
  • Be specific about your downstream consumer. Encoder benchmarks (MTEB, BEIR) measure encoder-as-retriever; our pipeline uses encoder-as-signal-compressor. Cross-use of those benchmarks would have predicted the wrong winner. Always know what the encoder is doing inside your stack.

9.2 Open questions we have not answered

These are the experiments we know how to design but have not run.

  • Multi-turn / streaming brain state. The current pipeline recomputes brain on every user message in isolation; by turn 5 the brain state still reflects turn-1's input. A per-turn refresh, or an EMA-decayed brain state, would convert the in-context recall amplifier into something closer to a working-memory system. MixPool's microsecond CPU encode makes per-turn refresh essentially free in latency. This is the highest-leverage open question for product expansion.
  • 20-parcel injection. The current projector takes a 50-dim PCA of CorticalNet output and maps it to a single embedding-space token sequence. Replacing this with 20 per-parcel projectors → 20 brain tokens would test whether the PCA collapse throws away parcel-specific routing information that could fix the §8.1 task-axis-resolution problem. Cheap to run; would invalidate or strengthen §8.5's first-architectural-direction option.
  • Length-stratified adapter distillation. The current MixPool adapter is distilled on 300 short dispositional triples. The mean-pool half adds small but real Mid-tier signal that disappears at Long, plausibly because the training distribution does not exercise long-context behavior. Distilling on a length-stratified corpus matching deployment context might rescue mean-pool at Long.
  • Brain pipeline against a sentence-transformer + linear head baseline. On the dispositional / tone / category-attribution tasks where brain shows interpretability value, does it actually beat trivial baselines by enough to justify the pipeline complexity? If not, the commercial-clean Schaefer-400 retrain is not worth doing; pivot to research/interpretability brand only.
  • Brain on real-world long-doc QA (not synthetic needle). Synthetic needle-in-haystack is convenient and reproducible; it is not what users actually do. Validating against a public long-document benchmark with the same matched-pairs methodology is the test that converts our narrow stack-specific claim into something more like a product claim.

9.3 What would change our minds

  • Brain wins on a real-world long-doc benchmark with paired methodology at n>100, AND a soft_prompt baseline fails to tie it there → the Phase 2 prefix-tuning-generic finding would narrow to "artifact of synthetic needle prompts; real-world recall IS brain-specific." We would broaden the public framing to a brain-specific recall claim accordingly. If both brain AND soft_prompt win at real-world recall, the Phase 2 narrowing holds and the product framing collapses to "4-token prefix tuning on dispositional triples helps real-world recall; brain is one implementation."
  • Brain helps a non-trivial reasoning task (not MMLU-Pro flat, not TruthfulQA tied) → the "brain as recall amplifier, not reasoning enhancer" framing of Section 6.4 weakens. We would reframe.
  • Late injection (§8.5 #1) decouples recall benefit from technical harm → the structural failure is in the input-position prefix-tuning interaction with Qwen3.5-27B's linear attention. Late injection fixes it. If late-inject brain > late-inject soft_prompt on recall surfaces at late position, we'd also recover the brain-specific attribution we lost at Mid in Phase 2. Both questions fire together in EXP D.
  • Multi-turn brain > multi-turn soft_prompt on recall → the Phase 2 finding is single-turn-specific. Brain content's per-turn evolution differentiates it from static soft_prompt in a chat context. This would restore a brain-specific product story that Phase 2 substantially weakened for single-turn.
  • EXP C1's input-conditioned-prefix-hurts-technical pattern reproduces on a non-Qwen downstream (e.g., Llama-3, Mistral-Large with softmax attention) → the harm is architectural and generic, not Qwen3.5-GatedDeltaNet-specific. The structural claim in §8 generalizes. If it does not reproduce, the harm is Qwen-specific and we revise §8 to flag it as such, and brain becomes potentially safer on softmax-attention downstreams.
  • Per-prompt gating (§8.5 #2, soft_prompt version) ships and validates → input-conditioned prefix harm is a property we can detect and route around at deploy time. The "ON for chat / OFF for code" rule becomes mechanism-justified rather than empirical-only. Note that post-Phase-2 the gating product can run on soft_prompt without the brain pipeline at all; validation of gating is not validation of brain.

What changed our minds:

  • EXP A surfaced that BrainCore's 20,484-dim output IS class-discriminative under supervised LDA (76% technical recall, 52% logical_fallacy), even though the deployed PCA-50 projection discards that signal (24%/4%). The original §8.1 framing "BrainCore is code-blind" was a small-test-set artifact; the true picture is "PCA-50 basis is variance-optimal, not class-discriminative."
  • EXP E tested whether replacing PCA-50 with LDA-10 in the deployed pipeline fixes the regression. It does not: LDA arm loses 1-25 to OFF on engineer-pick. The "better upstream features fix the regression" hypothesis is falsified.
  • EXP C1 tested whether the technical regression is brain-content-specific or generic to any input-conditioned prefix. It is generic: TF-IDF keyword-conditioned prefix also loses to OFF (3-14, p=0.013); static soft_prompt ties OFF (11-14, p=0.69). Brain content is the best-performing input-conditioned prefix on per-axis quality (mixpool 14 vs kw_prefix 2 on correctness, p=0.004) but no input-conditioned prefix beats OFF on technical.
  • Phase 2 produced the largest revision. We tested at Mid 32K whether the recall benefit is brain-specific, input-conditioning-specific, or prefix-tuning-generic. It is prefix-tuning-generic under the harder-needle distribution (OFF=67%). Brain (97%), kw_prefix (100%), soft_prompt (97%), and raw_encoder (90%) all tied each other head-to-head (all p=0.625-1.0 NS), and all beat OFF. Even the static soft_prompt (no input-conditioning, no brain content) ties brain. The §6 Mid-tier win is real vs OFF but the brain-specific attribution collapses: the Mid recall lift is driven by "any trained 4-token prefix at the post-system position," not "brain-derived content."
  • Synthesis. The mechanism is asymmetric: technical harm is input-conditioning-specific (brain, kw_prefix lose; soft_prompt ties OFF); recall benefit is prefix-tuning-generic (brain, kw_prefix, soft_prompt, raw_encoder all tie each other and beat OFF). The same static soft_prompt serves as the inert probe on both sides. No downstream-only patch fixes the technical harm (ruled out by A/E/C1); no brain-specific patch is needed for the recall benefit (ruled out by Phase 2). Late injection (§8.5 #1, testable via EXP D) is the only remaining scientific direction. The brain-specific claim has narrowed to the §6.1 Long tier data (which stands under the original prompt distribution and is pending harder-needle replication per §8.8).

9.4 One forward-looking note

The matched-pairs methodology that caught our four retractions is reusable across any prefix-tuning evaluation. A registry of peer-reviewed negative results in prefix-tuning, gated on methodology quality (matched-pairs or equivalent), could turn isolated nulls into compounding scientific value rather than independently re-discovered failure modes.


10. Code, data, and the methodology bundle

Every quantitative claim in this paper traces to a reproducible script and raw data dump. Artifacts listed below are maintained internally; those marked "available on request" can be shared for reproduction purposes.

10.1 Released or releasable artifacts

ArtifactStatusLicenseWhat it covers
Matched-pairs harness (matched-pairs-eval)In developmentApache 2.0 (planned)Pip-installable library implementing per-batch seeded sampling + McNemar exact 2-sided binomial reporting. Sections 3, 6, 7, 8 all use this pattern.
MixPool encoder + adapterReleasable; LongEmbed/BEIR validation gate per §5.6Apache 2.0mixpool_text_adapter.pt (512→384, ~3MB default ship) and max_only_text_adapter.pt (256→384, ~2MB minimal variant). HuggingFace model card pending validation.
Eval scripts (Modal H200/A100/RTX-PRO-6000)Available on requestSource-availableneedle_mixpool_parity.py + needle_mixpool_parity_256k.py (Section 6 needle eval), needle_pool_ablation_*.py (pool ablation), tech_rebench_mixpool.py (Section 7 negative result), gating_tests_2_3.py (Section 8 soft-prompt + position-pad), exp_e_lda_rebench.py (Section 8.1 LDA swap follow-up), exp_c1_kw_prefix_tech.py (Section 8.6 keyword-prefix follow-up), exp_phase2_needle_5arm.py + exp_phase2_eval_only.py (Section 8.7 5-arm recall rebench, Mid 32K complete + partial Long 256K).
Local probe scripts (CPU)Available on requestSource-availablelinear_probe_brain.py (Section 8.1 linear probe — local CPU, sklearn LR/MLP on 50-dim PCA features, ~5sec runtime). exp_a_probe_sweep.py (Section 8.1 follow-up — PCA-50/100/200 + LDA-10/11 sweep, ~10 min CPU).
Mech-check scriptsAvailable on requestSource-availableneedle_mech_check_v2.py + topic_discrim_check.py (Section 5.4 max-pool needle preservation curves and topic-discrim cross-matrix).
Judge prompt builders + parsers (CPU)Available on requestSource-availablebuild_exp_e_judge_prompts.py, parse_exp_e_judgments.py, build_exp_c1_judge_prompts.py, parse_exp_c1_judgments.py (Sections 8.1 LDA + 8.6 EXP C1 — anonymized A/B/C/D blinding, automated judge dispatch, McNemar pairwise computation).
Raw response JSONsAvailable on requestCC-BY 4.0One JSON per experiment cell with prompt / arm / response / pass-fail. Cited inline at each result. Includes exp_e_responses.json (3-arm LDA rebench), exp_c1_responses.json (4-arm kw_prefix rebench), exp_a_probe_sweep_results.json (LDA recovery sweep), exp_phase2_responses.partial.json (5-arm needle), exp_phase2_eval_only_responses.partial.json (eval-only re-fire, Mid 32K complete n=30 + n=2 Long 256K).
Judge result aggregationsAvailable on requestCC-BY 4.0exp_e_judge_*_results.json and exp_c1_judge_*_results.json per archetype, plus de-blinding mappings. Phase 2 is needle pass/fail (no LLM-judge blinding needed) — McNemar computations live directly in the parse scripts.

10.2 Artifacts that stay closed

ArtifactWhy closedNotes
BrainCore weights (177M-param TribeV2 extract)CC-BY-NC inheritance from Meta TribeV2Research-tier only. We describe the integration; we do not redistribute weights. Cite Meta TribeV2 to obtain under the original license.
CorticalNet topology fileDerived from TribeV2 SVD — legally gray for commercial releaseArchitecture (3.3M-param GNN, 20 parcels) is original and Apache-compatible; the topology (which vertices route to which parcel) is gray. Schaefer-400 atlas swap planned to produce a commercial-clean variant.
Brain projector checkpoints trained on dispositional triplesInherits CC-BY-NC from BrainCore upstreambrain_projector_Qwen_Qwen3.5-27B_e5_end.pt and brain_projector_qwen3_14b.pt are research-tier.

10.3 Reproduction notes

  • Downstream LLM: Qwen3.5-27B (HuggingFace Qwen/Qwen3.5-27B) loaded in 4-bit nf4 via bitsandbytes. Embeddings stay fp16 (the inputs_embeds injection path requires unquantized embedding layer). For the 14B variant, Qwen3-14B with the same quantization recipe.
  • Decoding: All needle results use do_sample=True, T=0.6, top_p=0.95, batch=8 on H200. Tech rebench uses the same decoding with max_new=4096.
  • Position rule: [sys_ids][brain_4][user_ids]. Position-0 prepend breaks Qwen3 thinking-mode trigger; this is non-optional.
  • Random seeds: matched-pairs torch.manual_seed((ctx * 100000) + batch_idx) (and torch.cuda.manual_seed_all(...)) reset before each arm of each batch. Rebench scripts use torch.manual_seed(7777 + batch_idx).
  • Statistical analysis: McNemar exact 2-sided binomial on discordant pairs (the count where one arm strictly wins on the prompt). Marginal pass-rate deltas are reported alongside but are not the basis for any p-value in this post.
  • Hardware: All training and eval on Modal H200 (single-GPU). Linear probe (Section 8.1) is local CPU.

10.4 Compute

All training and evaluation used cloud GPU instances (H200, A100, RTX-PRO-6000) via Modal, plus local CPU for linear probe analysis. The full experimental program, including failed runs and follow-up ablations, consumed under $200 of compute.