Boundary-Conditioned Structural Alignment for Cross-Session KV Cache Injection
Cross-session KV cache injection achieves 2.6-3.4x TTFT reduction versus strong text-RAG but loses 5-10 points of retrieval quality. We decompose the gap on Qwen 2.5 7B and find the dominant driver is structural, not attention-sink loss or RoPE position mismatch. Boundary-conditioned structural alignment — archiving KV states with full chat-template wrapping and splicing at turn boundaries — recovers +5.6pt (p<0.01, n=467, merged pool). A minimal 3-token E-wrapper captures 97.7% of the effect at 22% token overhead. Cross-family formal significance on Phi-3-medium-128k-instruct (p=0.013, n=200) confirms the mechanism generalizes.
- kv-cache
- inference
- long-context
- mechanism
- retrieval
- methodology
Abstract
Cross-session KV cache injection archives per-layer key/value tensors from past conversations and splices them into future sessions, achieving 2.6–3.4×\times× TTFT reduction versus strong text-RAG. Prior work identifies three failure modes—attention sink loss, cross-attention contamination, and RoPE position mismatch—but no training-free intervention has closed the quality gap. We introduce boundary-conditioned structural alignment, a training-free protocol that archives KV states with full chat-template wrapping and splices them at turn boundaries. A 2×\times×2 factorial decomposition on Qwen2.5-7B-Instruct (nnn=80) suggests an interaction between clean boundaries and structural markers that recovers +5.0pt r@1, though a replication at nnn=200 indicates this interaction is sensitive to retrieval selection. A five-arm wrapper ablation refines the mechanism to marker-token-class binding—either opening or closing chat-template bracket token suffices as a turn-boundary landmark. A minimal E-wrapper (3 tokens) captures 97.7% of the full-template effect at 22% token overhead, validated cross-family on Phi-3-medium (100% retention).
Cross-family formal significance on Phi-3-medium-128k-instruct (ppp=0.013, nnn=200) and merged Qwen2.5-7B-Instruct (ppp<0.01, nnn=467) confirms the mechanism generalizes across template families. Evaluation on the LoCoMo conversational memory benchmark (nnn=200) shows +5.3pt F1 improvement, validating the effect on a standard external benchmark. Text-wrapped and system-prompt-drift controls confirm the benefit is structural and deployment-robust.
1. Introduction
Large language model deployments face a persistent latency–quality trade-off. For retrieval-augmented generation (RAG), time-to-first-token (TTFT) is dominated by context prefill—the linear-time forward pass over retrieved documents. For conversational memory systems, this cost compounds: each new session must re-process archived conversation history, even when that history has not changed.
KV cache injection offers an escape hatch. Instead of re-tokenizing and re-prefilling archived text, we store the per-layer key (K) and value (V) tensors computed during the original session and splice them directly into the new session's cache. Because K/V tensors are position-independent in the attention computation, naive concatenation is theoretically sufficient—and has been reported as viable at scale ≥\ge≥7B (Xiao et al., 2023; Jeong, 2026), though with observed quality degradation.
The TTFT gains are substantial: injection reduces TTFT versus strong text-RAG by 2.6–3.4×\times× on identical hardware. The cost is quality. Prior work reports that raw KV injection degrades retrieval accuracy by 5–10 points versus strong text-RAG baselines (CacheBlend, 2024; Jeong, 2026). Understanding why—and whether the gap can be closed without model training—is the central question of this work.
Three candidate failure modes
The KV injection literature proposes three mechanistic hypotheses for quality degradation:
-
Attention sink loss (Xiao et al., 2023). Decoder-only transformers sink 30–50% of deep-layer attention mass onto the first few sequence positions. Cross-session injection strips these positions, disrupting the attention topology that the model learned during pre-training. StreamingLLM mitigates this by globally prepending initial tokens, but its effectiveness for injected (not just appended) content is untested.
-
Cross-attention contamination (CacheBlend, 2024). Archived KV tensors were computed attending to session-A's preceding context. When injected into session-B with different preceding text, the archive's attention patterns are contextually mismatched—its K/V states encode "attend to X" where X is no longer present.
-
RoPE position mismatch (Su et al., 2021; Jeong, 2026). Rotary Position Embedding (RoPE) rotates K/Q tensors by position-dependent angles. Archiving at session-A positions and injecting at session-B positions creates angular misalignment in the QK dot product, potentially scrambling attention scores.
All three are plausible. None has been decomposed against the others in a controlled, training-free setting.
Our contribution
We decompose the 7.5pt quality gap on Qwen2.5-7B-Instruct using a six-arm factorial design (nnn=80) plus cross-family validation on Phi-3-medium-128k-instruct (nnn=200) and a first directional test on DeepSeek-V2-Lite MLA (nnn=200). Our findings:
- Alternative failure modes do not explain the gap. Attention sink restoration (+0.0pt), RoPE un-rotation (cosine ≈\approx≈1.0), and cross-attention isolation (~2.5pt, not significant at nnn=80) each fail to close the quality deficit. The dominant driver is structural.
- Boundary-conditioned structural alignment recovers +5.0–5.6pt. A 2×\times×2 factorial (nnn=80) suggests an interaction between clean token boundaries and chat-template role markers. A replication at nnn=200 on different retrieval selections did not reproduce the interaction, indicating it is selection-sensitive; the per-chunk injection effect itself (Δ\DeltaΔ=+5.6pt, ppp<0.01 at nnn=467) is robust.
- Marker-token-class binding is the mechanism. A five-arm ablation shows either bracket token (
<|im_start|>or<|im_end|>) suffices as a landmark. A minimal E-wrapper (3 tokens) retains 97.7% of the effect at 22% token overhead, validated cross-family on Phi-3-medium-128k-instruct (100% retention). Instruction tuning amplifies the routing ≈\approx≈2.5×\times× rather than creating it. - Cross-family formal significance and standard benchmark validation. Phi-3-medium-128k-instruct (ppp=0.013, nnn=200) and Qwen2.5-7B-Instruct (ppp<0.01, nnn=467 oracle-filtered) both reach α\alphaα=0.05, with directional validation on DeepSeek-V2-Lite MLA (Δ\DeltaΔ=+3.0pt, nnn=200). On the LoCoMo conversational memory benchmark (Maharana et al., 2024), E-wrapper injection improves answer F1 by +5.3pt (nnn=200).
- Robustness validated across deployment axes. The effect is structural not lexical (text-wrapped control Δ\DeltaΔ=+0.005), survives system-prompt drift (zero degradation), and is compatible with per-channel/per-token INT8 KV quantization (ten-arm matrix, 0pt loss) but not per-tensor INT8 (−5-5−5pt granularity floor).
- Upper-layer dominance at prefill, distributed necessity at inference. Layer ablation localizes 55% of the effect to upper layers (14–27); activation patching shows full-stack residual transfer is required for causal reconstruction. The mechanism is distributed, not band-localized.
Methodological contribution: We discover that strong-RAG retrieval pipelines using BGE-Reranker-class cross-encoders exhibit non-deterministic top-3 selection (0/80 matching across fresh runs), making un-pinned cross-run comparisons uninterpretable. We introduce the Retrieval Determinism Protocol—pin selections once per run, reuse across inject arms—as a requirement for valid KV-reuse benchmarking.
Scope boundaries
This work is scoped to pure-attention decoder-only transformers with standard or YARN-scaled RoPE, and to MLA architectures at small-scale directional validation. Hybrid linear-attention architectures (e.g., Qwen3.6-35B-A3B, DeepSeek-V3) are explicitly out of scope: we establish catastrophic failure (r@1=10%) on partial KV injection, consistent with CacheBlend (2024). Sliding-window architectures (Gemma 3/4) are flagged as requiring an adapter path (global-layer-only inject), future work.
2. Related Work
2.1 Cross-Session KV Cache Reuse Systems
The most closely related published work is CacheBlend (Yao et al., 2024; arXiv:2405.16444), which addresses quality loss when precomputed KV caches are reused outside their original prefix context. CacheBlend identifies the core failure mode—"precomputed KV caches [are] not directly usable since they ignore the text's cross-attention with the preceding texts"—and proposes selective recomputation of a small token subset to restore cross-attention fidelity, achieving 2.2–3.3×\times× TTFT reduction with quality preserved across four benchmarks. Our work is complementary but distinct in two key dimensions: we operate at the cross-session episodic level (temporally separated prefills with different system context) rather than the within-session document-blending level CacheBlend targets, and we identify the most salient degradation pathway as boundary-conditioned structural misalignment rather than cross-attention contamination alone. Where CacheBlend offers a selective-recompute engineering fix, we contribute a mechanistic taxonomy and a training-free structural alternative.
Concurrent work: Jeong (2026; arXiv:2603.22329) independently investigates persistent cross-session memory in decoder-only LMs, testing six adapter methods—prefix, parallel cross-attention, KV extension, Hebbian memory, context-gated branch, and slot-based sparse write—on a frozen GPT-2 backbone evaluated on LoCoMo. Their finding that "architectural priors matter" at low capacity (cross-attention, Hebbian, and slot-write achieve 7–18% retained-memory scores at 1×\times× capacity; the other three fail at <<<0.4%) independently converges with our structural-topology-fidelity result. The key differentiation is paradigmatic: Jeong requires training small adapters on a tiny frozen backbone (GPT-2); our approach is training-free structural alignment at archive time on production-scale instruction-tuned models (Qwen 2.5 7B, Phi-3 14B, DeepSeek V2-Lite). We additionally evaluate against a stronger baseline (modern embedding + BM25 + reranker pipeline), contribute cross-family and cross-architecture validation, and introduce a retrieval-determinism protocol absent from prior work.
Independent mechanism validation: LoopGuard (2026; arXiv:2604.10044) demonstrates that KV cache reuse can induce pathological attention patterns—specifically, "collapsed attention" where a subset of heads locks onto a narrow suffix, stabilized by inference-time KV cache reuse. While LoopGuard addresses repetition loops in single-session long-context generation rather than cross-session episodic injection, it independently confirms that KV reuse creates attention pathologies, corroborating our degradation-mechanism framing.
2.2 Attention Mechanisms and KV Cache Management
Attention sinks: Xiao et al. (2023; arXiv:2309.17453, ICLR 2024) demonstrated that initial tokens in autoregressive LMs attract disproportionate attention mass regardless of semantic content, because SoftMax normalization requires a denominator anchor—four initial tokens empirically suffice for full perplexity recovery on Llama-2-7B. We cite StreamingLLM as a foundational mechanism reference, but our diagnostic experiments find no evidence that classical attention sink loss is the primary degradation pathway: global prepending of BOS + first four tokens (the StreamingLLM fix) produced zero quality improvement (+0.0pt), and un-RoPE'd KV cosine similarity ≈\approx≈ 1.0 argues against RoPE position mismatch as the dominant driver. Attention sinks are real but are a symptom of prefix absence in our setting, not an independent causal mechanism.
Layer-wise attention heterogeneity: PyramidKV (2024; arXiv:2406.02069) establishes a "pyramidal information funneling" pattern: lower transformer layers scatter attention broadly, while upper layers concentrate mass on structurally significant tokens. This directly informs our mechanism: structural topology fidelity (what chat-template wrapping restores) matters most in upper layers where induction-head routing is already concentrated. Our layer ablation (Section 5.3, Experiment 4) confirms this: 55% of the E-wrapper effect is attributable to upper layers (14–27), with zero measurable contribution from lower layers (0–13).
2.3 Retrieval-Augmented Generation and Chunking
Token-boundary effects: FreeChunker (2025; arXiv:2510.20356) establishes that sentence-level atomic chunking improves RAG retrieval quality over fixed-granularity approaches that produce arbitrary token boundaries. Our mechanism subsumes this finding: per-chunk chat-template wrapping enforces complete-sentence boundaries (role end markers fall at turn boundaries), producing the clean-boundary condition our 2×\times×2 factorial decomposition identifies as necessary—though not sufficient—for quality recovery.
Format sensitivity: Lu et al. (2021; arXiv:2104.08786, ACL 2021) showed that prompt permutation causes near-random to near-SOTA performance variance (up to 13% improvement from ordering alone) in few-shot prompting. Our per-chunk finding is a domain-specific instantiation of this general result: the "right" structural format for archived KV is precisely the format the model was trained on, and deviations cause systematic quality degradation.
2.4 Theoretical Basis: Marker-Token-Class Binding and Induction-Head Routing
Olsson et al. (2022; arXiv:2209.11895) demonstrated that transformer attention heads implement [A][B]...[A]→\rightarrow→[B] pattern completion—"induction heads"—hypothesized as the mechanistic basis for in-context learning. Chat-template role markers (<|im_start|>user, <|user|>) follow exactly this structure, establishing fixed prefix sequences the model learned to route from during training.
Our contribution extends this framing in two ways. First, we show that induction-head activation is conditional: the role-marker trigger ([A]) requires a well-formed content boundary ([B]) to complete the pattern. This conditional-activation hypothesis, supported by our 2×\times×2 factorial design (markers alone: +1.3pt; clean boundaries alone: –1.2pt; conjunction: +5.0pt), is more mechanistically precise than generic "format sensitivity."
Second, our five-arm wrapper ablation (Tier 1 Experiment 2) refines the routed token from "prefix position" to marker-token-class. Either <|im_start|> or <|im_end|> suffices as the landmark; the bracket pair is substitutable, not asymmetric. Role-name tokens (user, assistant) add a small independent signal (+1.25pt), but the strongest signal is from the distinctive marker token class—any canonical turn-boundary landmark token appears to trigger the IT-learned attention pathway. This explains cross-family generalization: Phi-3's <|user|> and <|end|> serve the same landmark function as Qwen's <|im_start|> and <|im_end|>, despite different vocabulary and template structure.
Statistical methodology: Exact McNemar testing for paired proportions at nnn <<< 200 follows recommendations in Dror et al. (2018; ACL P18-1128).
2.5 Positioning
Our work occupies a training-free structural-alignment niche complementary to all prior approaches. CacheBlend's selective-recompute method addresses cross-attention contamination at inference time and requires modified inference infrastructure; we address structural misalignment at archive time with zero inference overhead. Jeong's trained-adapter approach achieves persistent memory via learned architectural priors; we achieve analogous structural alignment at zero adapter cost on models an order of magnitude larger. Our MLA validation extends the mechanism to the dominant 2026 frontier architecture class without training. LoopGuard validates our degradation-mechanism framing from an orthogonal angle. We additionally contribute a methodological protocol for retrieval-determinism in KV-reuse benchmarking and a YARN-RoPE debugging guide for practitioners re-running injection on extended-context models.
3. Methods
3.1 Overview
We evaluate cross-session KV cache injection on decoder-only transformer language models, measuring retrieval quality (r@1) and time-to-first-token (TTFT) against a strong text-RAG baseline. Our core question: can archived KV states from past conversations be injected into future sessions without quality degradation, and if not, what mechanism explains the gap?
3.2 Models
Primary model: Qwen2.5-7B-Instruct (32 layers, GQA with 4 KV heads, RoPE position encoding, ChatML template). Evaluated under SDPA attention kernel (production default) and EAGER attention kernel (diagnostic instrument).
Cross-family validation: Phi-3-medium-128k-instruct (40 layers, GQA, RoPE, Phi-3 chat template). Evaluated under SDPA only.
Base-model control: Qwen2.5-7B (non-instruct, shared tokenizer). Evaluated under SDPA to test whether the structural-alignment effect is architecture-intrinsic or IT-learned.
MLA validation: DeepSeek-V2-Lite-Chat (27 layers, MLA with qk_nope=128 / qk_rope=64, YARN-scaled RoPE with factor=40). Evaluated under SDPA only.
Blocked models: Qwen3.6-35B-A3B (hybrid linear-attention)—catastrophic failure (r@1=10%) on partial injection, establishes scope boundary.
3.3 Benchmark
Long-form Conversation Memory (LoCoMo) benchmark—200 multi-turn conversations with ground-truth fact retrieval. We use a held-out subset: 80 organic user queries + 120 synthetic queries generated by Qwen2.5-7B-Instruct. Synthetic queries correlate more tightly with target chunks (gt_in_top3: 0.90 vs 0.825) due to generation-model phrasing overlap. We report results stratified by slice and include synthetic-penalty sensitivity analysis.
3.4 Retrieval Pipeline
Text-RAG baseline (strong): Qwen3-Embed-8B embeddings + BM25 hybrid retrieval + BGE-Reranker-v2-m3 top-3 selection. Full reranker scores computed per query.
Inject baseline: Top-3 chunks selected by identical strong-RAG pipeline, archived KV computed via model.forward(use_cache=True) on each chunk, spliced into DynamicCache at inject point. No position math (naive concat validated at ≥\ge≥7B).
Inject+reranker: Same top-3 selection as strong-RAG, inject only. Suggests the gap is mechanism-intrinsic, not selection quality.
CPU-tier probe: Zero-shot potion embeddings (minishlab/potion-base-8M, 256-dim, CPU) + BM25 + BGE-Reranker-v2-m3. Tests whether GPU-grade dense embedders are necessary for end-to-end inject quality.
3.5 Phase 0 Experimental Matrix
Qwen 2.5 7B—8-arm matrix
| Arm | Archive | Inject | Kernel | Purpose |
|---|---|---|---|---|
| Baseline | N/A | text-RAG (strong) | SDPA | Quality ceiling |
| Inject | raw chunk | raw KV splice | SDPA | Mechanism baseline |
| Inject+reranker | raw chunk | raw KV splice | SDPA | Selection control |
| Global sink | raw chunk | BOS+pos 0–3 + raw KV | SDPA | Classical sink hypothesis |
| Cond 1 (prefix) | raw chunk | wrapper prefix + raw KV | SDPA | Prefix-alignment hypothesis |
| Per_chunk | wrapped chunk (template) | wrapped KV splice | SDPA | Structural alignment hypothesis |
| Cond 1 EAGER | raw chunk | wrapper prefix + raw KV | EAGER | Kernel-sensitivity probe |
| Per_chunk EAGER | wrapped chunk | wrapped KV splice | EAGER | Kernel-sensitivity probe |
Wrapper protocol (Qwen ChatML):
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{chunk}<|im_end|>
First 4 wrapper tokens (<|im_start|>, system, \n, You) serve as structural anchors.
Phi-3 14B—3-arm matrix
| Arm | Archive | Inject | Purpose |
|---|---|---|---|
| Baseline | N/A | text-RAG (strong) | Quality ceiling |
| Cond 1 | raw chunk | wrapper prefix + raw KV | Cross-family prefix test |
| Per_chunk | wrapped chunk | wrapped KV splice | Cross-family structural test |
Wrapper protocol (Phi-3 template):
<|user|>
{chunk}<|end|>
Pick 1+2—2×\times×2 Mechanism Decomposition (Qwen 2.5 7B)
To isolate the active ingredients of per-chunk structural wrapping, we test a factorial 2×\times×2 design crossing boundary cleanliness with role-marker presence:
| Cell | Boundary | Role Markers | Expected Mechanism |
|---|---|---|---|
| A | Arbitrary slice (mid-word) | Stripped (no markers) | Raw-text baseline—both factors absent |
| B | Arbitrary slice | Markers intact | Role-marker routing alone |
| C | Sentence boundary | Stripped | Clean boundaries alone |
| D | Sentence boundary | Markers intact | Both factors—synergistic interaction |
Boundary manipulation: Sentence-boundary slicing cuts at complete sentences (. or \n delimiters). Arbitrary slicing cuts at fixed token counts, potentially mid-word.
Role-marker manipulation: Stripped archives remove all chat-template role markers (<|im_start|>, system, user, assistant) leaving only raw content. Intact archives preserve full template structure.
Key hypothesis: Neither factor alone produces substantial improvement. Their conjunction is required—clean boundaries provide the context for role-marker routing to activate, and role markers provide the routing signal that boundaries alone cannot.
3.6 Tier 1 Experimental Bundle
Three orthogonal experiments on Qwen 2.5 7B and Phi-3 14B, SDPA kernel, matched-pairs, nnn=40–80.
Experiment 1: Base-Model Control (nnn=80)
Tests whether structural alignment is architecture-intrinsic or instruction-tuning-dependent. Arms: baseline + per_chunk on Qwen2.5-7B base (non-instruct) versus Qwen2.5-7B-Instruct reference.
Experiment 2: Wrapper Ablation (nnn=80, 5-arm)
Tests which template tokens drive the effect:
| Arm | prefix | suffix | Tokens | Purpose |
|---|---|---|---|---|
| A_full | full template | `< | im_end | >` |
| B_opening | full opening | (none) | 14 | Opening-marker sufficiency |
| C_closing | user\n | `< | im_end | >` |
| D_role | user\n | (none) | 2 | Role-name alone |
| E_markers | `< | im_start | >\n` | `< |
Product-ready recommendation: E_markers (<|im_start|>\n[chunk]<|im_end|>) — 3 tokens, minimal overhead.
Experiment 2B: Phi-3 E-Wrapper Cross-Family (nnn=80, 2-arm)
Validates whether E-wrapper minimalism generalizes beyond Qwen:
| Arm | prefix | suffix | Tokens |
|---|---|---|---|
| A_full_phi3 | `< | system | >...< |
| E_markers_phi3 | `< | end | >\n` |
Experiment 3: Multi-Hop Degradation (nnn=40, 3-condition)
Tests archive noise sensitivity. Conditions: (a) 0 filler turns between archive and query, (b) 1 filler turn, (c) 2 filler turns. Fillers generated by the model itself, archived identically to chunks.
Experiment 4: Upper-Layer Localization (nnn=200)
Tests whether the E-wrapper effect is uniformly distributed across transformer layers or concentrated in upper layers where induction-head routing is strongest (PyramidKV, 2024). Four arms on Qwen 2.5 7B SDPA: abl_baseline (no wrapping), abl_pc_full (all 28 layers wrapped), abl_upper_mixed (layers 14–27 wrapped, 0–13 raw), abl_lower_mixed (layers 0–13 wrapped, 14–27 raw). Split at layer 14 (50% boundary). Same pinned selections as Phase 1. The abl_ prefix distinguishes these prefill-wrap ablation arms from the patch_ arms used in the activation-patching experiment of §4.6.
3.7 MLA Methods: YARN-Scaled RoPE Handling
DeepSeek-V2-Lite uses YARN-scaled RoPE (factor=40, original_max_position_embeddings=4096) rather than plain RoPE. Naive rope_shift_k using standard `inv_freq = 1/\text{rope_theta}^{2i/d}over−rotatesbyupto40 over-rotates by up to 40over−rotatesbyupto40\times$ on dimensions 23–31, producing complete K-tensor corruption and gibberish output.
Fix: Extract yarn-adjusted inv_freq from the model's own rotary module:
rot = model.model.layers[0].self_attn.rotary_emb
yarn_inv_freq = rot.inv_freq.detach().to("cuda").to(torch.float32)
Use yarn_inv_freq in shift rotation; mscale cancels cleanly for V2-Lite (mscale == mscale_all_dim == 0.707 →\rightarrow→ _mscale == 1.0). This fix generalizes to any yarn-scaled architecture (DeepSeek V2/V3, extended-context Llama-3 variants).
3.8 Retrieval Determinism Protocol
Strong-RAG retrieval pipelines using BGE-Reranker-class cross-encoders exhibit non-deterministic top-3 selection across fresh runs (0/80 queries matched between two identical pipeline invocations), primarily due to FP16 tie-breaking in score argsort when candidate scores cluster within sub-ULP gaps.
Protocol: We pin top-3 selections once per retrieval pipeline run and reuse identical selections across all inject arms. This ensures observed deltas reflect injection mechanism differences, not retrieval variance. Cross-run absolute scores may vary; only intra-run relative deltas are valid. This protocol is a requirement for any KV-reuse benchmark using near-deterministic rerankers.
3.9 n=200 Extension Protocol
Pool synthesis: 80 organic + 120 synthetic queries, evaluated under identical retrieval pipeline. Stratified reporting by slice.
Statistical tests:
- Primary: Exact McNemar test (binomial CDF, not chi-square approximation) for paired proportions. Justified per Dror et al. (2018, ACL P18-1128) for small-sample paired NLP evaluation.
- Effect size: Paired bootstrap 95% CI (10,000 resamples), Pr(δ\deltaδ >>> 0).
- Power: nnn=200 provides 80% power to detect 5pt delta at α\alphaα=0.05, baseline 50%.
Synthetic penalty sensitivity: Artificially degrade synthetic 120 baseline by randomly nullifying correct retrievals to match original 80 difficulty (7.5% degradation). Recompute per_chunk delta on penalized synthetic slice. Mechanism claimed robust only if delta holds on both original 80 and penalized synthetic 120.
3.10 Diagnostic Instruments
All instruments run under EAGER attention (required for output_attentions=True hooks).
Attention entropy + sink-mass probe: Per-layer, per-head attention entropy: H=−∑iailog(ai)H = -\sum_i a_i \log(a_i)H=−∑iailog(ai) where aia_iai is attention weight. Sink-mass ratio: ∑i∈{0,1,2,3}ai\sum_{i\in\{0,1,2,3\}} a_i∑i∈{0,1,2,3}ai—attention mass on first 4 positions.
Cross-attention probe:
- Condition 1: Archive KV from session A, inject into session B with identical preceding text (same system prompt + prefix).
- Condition 2: Same archive, inject into session B with different preceding text (standard setup).
- Delta = cross-attention contamination contribution.
KV cosine probe:
- Raw: cosine between archived K and fresh-prefill K of identical text
- Un-RoPE'd: after stripping RoPE rotation from both, recompute cosine
- Low raw + high un-RoPE'd →\rightarrow→ RoPE mismatch dominates
- Low raw + low un-RoPE'd →\rightarrow→ cross-attention contamination dominates
3.11 Evaluation Metrics
- r@1: Top-1 retrieval accuracy (exact match to ground-truth chunk)
- r@5: Top-5 retrieval accuracy
- TTFT: Time from query submission to first generated token, milliseconds
- Delta: r@1 difference vs inject baseline (positive = improvement)
3.12 Text-Wrapped RAG Control
To isolate whether the E-wrapper benefit is structural (KV injection) or lexical (chat-template marker strings in plain text), we test a two-arm control (nnn=200). Both arms use identical strong-RAG retrieval and identical chunk content. The plain_rag arm formats chunks as raw text in the prompt context. The wrapped_rag arm wraps each chunk as <|im_start|>user\n{chunk}<|im_end|> in plain text before prompt assembly, with no KV injection. If wrapped_rag matches per_chunk inject, the effect is lexical; if it matches plain_rag, the effect is structural.
3.13 INT8 Quantization Robustness
We test whether the E-wrapper effect survives INT8 KV cache compression. A preliminary two-arm test (nnn=200) compares bf16 per_chunk against per-tensor symmetric INT8. A follow-up ten-arm matrix (nnn=80) decomposes the failure mode:
| Arm | Quantization scheme |
|---|---|
| pc_bf16 | bf16 control |
| pt_asym | Per-tensor asymmetric INT8 |
| pt_sym | Per-tensor symmetric INT8 |
| per_ch | Per-head (channel) INT8 |
| per_tok | Per-token dynamic INT8 |
| selective | bf16 sink+tail tokens, INT8 middle |
| k_only | K=INT8 per-channel, V=bf16 |
| layer_low | Layers 0–13 INT8 per-ch, 14–27 bf16 |
| layer_hi | Layers 0–13 bf16, 14–27 INT8 per-ch |
All INT8 arms quantize on-the-fly at cache-build time from a single shared bf16 archive, avoiding memory overhead. Per-channel scale = amax(dim=(-2,-1), keepdim) → shape [1, H, 1, 1]. Per-token scale = amax(dim=-1) → [1, H, S, 1].
3.14 System Prompt Drift
To test cross-context deployment viability, we archive KV with a "helpful assistant" system prompt and inject into generation sessions with either the same prompt or a drifted "coding assistant" prompt. Both use E-wrapper per_chunk injection (nnn=200). Zero delta between same and drifted confirms marker-class binding is invariant to system prompt.
3.15 Query Pool Extension
To power the Qwen result toward formal significance we extend the original 200-query pool via oracle-filtered generation: 267 additional queries (seed=99) admitted only when their ground-truth chunk is retrievable within top-KKK, matching the original pool's quality distribution. An earlier inline-synthetic attempt (nnn=167, unfiltered) diluted the signal and is reported as a negative result in Appendix H. The oracle-filtered extension yields merged nnn=467 with identical pipeline, kernel, and E-wrapper configuration as Phase 1 (§5.4, Table 1c).
4. Experiments
4.1 Phase 0—Mechanism Decomposition (nnn=80)
All Phase 0 arms ran on the same 80-query subset with identical pinned top-3 selections. Qwen 2.5 7B was evaluated under both SDPA (production) and EAGER (diagnostic) kernels; Phi-3 14B under SDPA only. The 2×\times×2 factorial (Picks 1+2) was run on Qwen SDPA only.
4.2 Phase 1—Cross-Family Validation (nnn=200)
Qwen 2.5 7B SDPA: 80 organic + 120 synthetic queries, stratified reporting. Phi-3 14B SDPA: same 200-query pool. Both used pinned selections intra-run.
4.3 Tier 1—Mechanism Refinement (nnn=40–80)
Three orthogonal experiment bundles (with one cross-family sub-replication): (1) base-model control (nnn=80), (2) five-arm wrapper ablation (nnn=80) plus Phi-3 E-wrapper cross-family (nnn=80), (3) multi-hop degradation (nnn=40). All SDPA, pinned selections.
4.3b Layer Ablation (nnn=200)
Four-arm layer ablation on Qwen 2.5 7B SDPA testing upper vs lower layer contribution to the E-wrapper effect. Upper/lower split at layer 14. Same pinned selections as Phase 1.
4.4 MLA Stage 1—DeepSeek V2-Lite (nnn=200)
Two-arm matched-pairs (baseline vs per_chunk E-wrapper) on DeepSeek-V2-Lite-Chat under SDPA, with YARN-adjusted rope_shift_k. Same 200-query pool.
4.5 CPU-Tier Probe (nnn=80)
Three-frontend end-to-end inject evaluation: Qwen3-Embed-8B (baseline), adapted potion, zero-shot potion. Matched-pairs, per_chunk inject, SDPA.
4.6 Activation Patching (nnn=20, Phi-3-medium-128k-instruct)
To probe which stages of the transformer computation carry the E-wrapper effect causally, we run an 8-arm activation-patching sweep on Phi-3-medium-128k-instruct at nnn=20. The patch_baseline arm generates from an archive prefilled without E-wrapping. The patch_pc_full arm injects the full E-wrapper archive as the reference. Six further arms replace the residual stream at marker-token positions during generation using activations harvested from the patch_pc_full reference run, across different layer bands:
patch_denoise_early(layers 0–5)patch_denoise_mid(layers 10–15)patch_denoise_upper(layers 20–32)patch_denoise_all(all layers)patch_denoise_random(randomly chosen layer set—negative control)patch_noise_upper(inject corrupted residuals at upper layers—noise control)
Patches apply at marker-token positions only; non-band layers use zeros at marker positions and baseline values elsewhere. (Note on arm-naming convention: patch_pc_full refers to activation-patching arms and is distinct from abl_pc_full used in the layer-ablation sweep of §4.3b. The two probes measure different operations; see the distributed-vs-localized reconciliation below.)
The diagnostic probes described in §3.10 (attention entropy, sink-mass ratio, KV cosine similarity) are referenced in §5.1 for the alternative-mechanism rejection argument.
Distributed necessity vs upper-layer dominance—reconciliation. The layer-ablation results (§4.3b, Exp3v2) and activation-patching results above appear in tension: Exp3v2 finds that upper-layer wrapping (layers 14–27 on Qwen) recovers ≈\approx≈55% of the E-wrapper effect while lower-layer wrapping contributes nothing, yet activation patching finds that replacing marker-position residuals in upper layers alone (layers 20–32 on Phi-3) transfers 0pt of the effect while replacing residuals across all layers recovers the full +5pt. These findings are not contradictory—they measure different operations at different stages of computation.
Exp3v2 is a prefill-wrap ablation: it varies which layers receive structurally wrapped KV during the archive prefill phase. The result shows that upper layers are where the structural topology (marker-token-class binding) concentrates its contribution to downstream attention routing—consistent with PyramidKV's pyramidal information funneling and with the concentration of induction-head behavior in upper/middle layers (Olsson et al., 2022). Lower layers process local syntactic patterns where structural wrapping adds no value.
Activation patching is a residual-swap intervention: it replaces the residual stream at marker-token positions in specific layer bands during inference, while the model re-computes all downstream transformations on the patched state. The patch_denoise_upper=0pt result indicates that upper-layer marker-position residuals alone are insufficient for effect reconstruction—the full layer stack is required to reconstruct the E-wrapper routing pattern from a patched state. This is expected mechanically: residual patching at layer LLL propagates through layers L+1..NL{+}1..NL+1..N, and if layers 0..L−10..L{-}10..L−1 were computed on unwrapped state the routing context arriving at LLL is already incomplete.
Together the two probes give a coherent picture: upper layers dominate where the structural signal is expressed (Exp3v2), while the full stack is necessary for causal reconstruction (activation patching). For product engineering, Exp3v2's upper-dominance supports storage-optimized archiving—upper-layer-only archiving trades storage for partial effect preservation (≈\approx≈55% of full-wrap quality at roughly half the layers stored). For mechanism understanding, activation patching confirms that marker-token-class binding is a distributed computation—not localized to a single layer band—even when its quality contribution concentrates in upper layers.
4.7 Text-Wrapped RAG Control (nnn=200)
Two-arm control (plain_rag vs wrapped_rag) on Qwen 2.5 7B SDPA. Tests whether chat-template markers in plain text (no KV injection) reproduce the E-wrapper benefit.
4.8 INT8 Quantization Robustness (nnn=200 + nnn=80 matrix)
Preliminary two-arm test (bf16 vs per-tensor INT8, nnn=200) followed by ten-arm granularity matrix (nnn=80). Tests per-tensor, per-channel, per-token, selective, k_only, and layer-split quantization schemes on Qwen 2.5 7B SDPA.
4.9 System Prompt Drift (nnn=200)
Three-arm test (baseline vs same_prompt vs drifted_prompt) on Qwen 2.5 7B SDPA. Archives KV with "helpful assistant" system prompt; generation uses either "helpful" or "coding" assistant prompt.
4.10 Query Pool Extension (nnn=367 inline-synthetic, then nnn=467 oracle-filtered)
Two successive extensions of the Qwen 2.5 7B SDPA evaluation pool. The first merges 167 inline-synthetic queries with the original 200 (nnn=367, negative result, §5.4 supplementary). The second—reported as the primary Qwen significance result—applies the Phase 1 oracle-filtered admission criterion to 267 fresh queries (seed=99) and merges with the original 200 (nnn=467). Both runs use identical retrieval pipeline, kernel, and E-wrapper configuration as Phase 1; only the query pool is extended. Stratified reporting compares original 200, new oracle-filtered 267, and merged 467 slices.
5. Results
5.1 Phase 0—Alternative Mechanism Rejection
Table 1a (Qwen 2.5 7B, nnn=80) and Table 1b (Phi-3 14B, nnn=80) report the full Phase 0 kernel-matched matrix.
| Model | Kernel | Arm | r@1 | Δ\DeltaΔ vs baseline | Pr(δ>0\delta > 0δ>0) | McNemar ppp |
|---|---|---|---|---|---|---|
| Qwen 7B | SDPA | baseline | 0.5125 | — | — | — |
| Qwen 7B | SDPA | global sink | 0.5125 | +0.000 | 0.50 | — |
| Qwen 7B | SDPA | cond1 replicate | 0.5375 | +2.5pt | 0.65 | 0.80 |
| Qwen 7B | SDPA | per_chunk sink | 0.5750 | +6.25pt | 0.90 | 0.27 |
| Qwen 7B | EAGER | baseline | 0.4750 | — | — | — |
| Qwen 7B | EAGER | cond1 | 0.5625 | +8.75pt | 0.95 | 0.14 |
| Phi-3 14B | SDPA | baseline | 0.5750 | — | — | — |
| Phi-3 14B | SDPA | cond1 | 0.6000 | +2.5pt | 0.69 | 0.75 |
| Phi-3 14B | SDPA | per_chunk sink | 0.6000 | +2.5pt | 0.69 | 0.75 |
Note: Per-query discordant counts for Qwen Phase 0 arms in Table 1a were computed from internal result logs reviewed by the author team. Raw per-query results are available on request.
Three observations emerge from this table:
-
Classical attention sinks show no reliable signal. Global BOS+positions 0–3 prepending (StreamingLLM-style) produces exactly zero gain (+0.0pt, Pr=0.50). Sink-mass loss is a real attention-level artifact (–98% on positions 0–3; Figure 3) but it is a symptom of prefix absence, not an independent causal mechanism at this sample size.
-
RoPE mismatch is not the dominant driver. The KV cosine probe shows un-RoPE'd K/V cosine ≈\approx≈1.0 across layers (Figure 4), confirming archived and fresh representations are content-equivalent. Position rotation alone does not explain the gap at 7B scale with
rope_theta=1e6. -
Cross-attention contamination is PARTIAL. Cond 1 (identical-prefix inject) under SDPA gives only +2.5pt—not significant at nnn=80. The remaining ~5pt gap after per_chunk suggests contamination is real but secondary to structural-alignment failure.
5.2 2×\times×2 Mechanism Decomposition—Synergistic Interaction
To test whether boundary cleanliness or role-marker routing is the primary driver, we run a factorial 2×\times×2 design on Qwen SDPA (nnn=80). Full statistics in Table 2.
| Cell | Boundary | Role Markers | r@1 (SDPA, nnn=80) | Delta |
|---|---|---|---|---|
| A | Arbitrary | Stripped | 0.500 | Baseline |
| B | Arbitrary | Markers | 0.512 | +1.2pt |
| C | Sentence | Stripped | 0.487 | –1.3pt |
| D | Sentence | Markers | 0.550 | +5.0pt |
Neither main effect is substantial. Role markers alone (Cell B) produce negligible improvement (+1.2pt). Clean boundaries alone (Cell C) degrade quality below baseline (–1.3pt). The observed effect is concentrated in the interaction (Cell D): the conjunction of clean boundaries AND role markers produces +5.0pt. The interaction term itself equals +5.0pt—as large as the combined effect—consistent with a super-additive pattern, though the interaction is not independently significant at nnn=80 per cell.
Replication at nnn=200 (different retrieval selections). A replication of the 2×\times×2 factorial at nnn=200 using independently pinned retrieval selections yielded: arbitrary_stripped 0.430 (baseline), arbitrary_intact 0.400 (−3.0-3.0−3.0pt), sentence_stripped 0.450 (+2.0pt), sentence_intact 0.410 (−2.0-2.0−2.0pt). The interaction term is −1.0-1.0−1.0pt—the combined treatment no longer outperforms baseline. Because the nnn=200 run uses different pinned selections (per §3.8, BGE-Reranker selections are non-deterministic across runs), this result does not directly contradict the nnn=80 finding but does indicate that the factorial interaction is selection-sensitive and should not be treated as a robust standalone finding. The per-chunk injection effect itself (Δ\DeltaΔ=+5.6pt, ppp<0.01 at merged nnn=467) is robust across multiple retrieval selections and remains the primary statistical claim.
Interpretation: The factorial decomposition provides a plausible mechanistic hypothesis (boundary–marker synergy) but the interaction is not independently reproducible at current sample sizes. The mechanism story is better anchored in the five-arm wrapper ablation (§5.3 Experiment 2), which demonstrates marker-token-class binding directly and does not depend on cross-run selection matching.
5.3 Tier 1—Mechanism Refinement
Experiment 1: IT Amplifies, Not Creates (nnn=80)
Table 3. Base-model control.
| Model | Variant | baseline r@1 | per_chunk r@1 | Δ\DeltaΔ | 95% boot CI | McNemar ppp | Pr(δ>0\delta > 0δ>0) |
|---|---|---|---|---|---|---|---|
| Qwen 2.5 7B Instruct (ref) | IT | 0.5125 | 0.5750 | +0.0625 | — | — | — |
| Qwen 2.5 7B Base | no-IT | 0.4875 | 0.5125 | +0.0250 | [–0.050, +0.100] | 0.754 | 0.682 |
The base model shows a directionally positive +2.5pt effect (n=80, formally null, p=0.754). This falsifies both strict hypotheses: the effect is neither purely IT-learned (base ≈\approx≈ 0) nor purely architecture-intrinsic (base ≈\approx≈ instruct). The most consistent interpretation is that instruction tuning amplifies an existing architectural capability by ~2.5×\times×. The <|im_*|> marker tokens exist in the shared tokenizer vocabulary; even an un-tuned base model can use them as weak landmark anchors, and IT sharpens this routing pattern.
Experiment 2: Marker-Token-Class Binding (nnn=80)
Table 4. Five-arm wrapper ablation on Qwen 2.5 7B Instruct.
| Arm | r@1 [Wilson 95%] | Δ\DeltaΔ vs A | 95% boot CI | McNemar ppp | Pr(δ>0\delta > 0δ>0) | discordant (A/arm) |
|---|---|---|---|---|---|---|
| A_full | 0.5500 | — | — | — | — | — |
| B_opening_only | 0.5500 | 0.000 | [–0.088, +0.088] | 1.000 | 0.451 | 7 / 7 |
| C_closing_only | 0.5625 | +0.0125 | [–0.063, +0.100] | 1.000 | 0.565 | 5 / 6 |
| D_role_name_only | 0.4875 | –0.0625 | [–0.163, +0.038] | 0.332 | 0.089 | 11 / 6 |
| E_markers_no_role | 0.5375 | –0.0125 | [–0.100, +0.075] | 1.000 | 0.329 | 6 / 5 |
Key findings:
- Opening marker alone (B) matches full template (A) — the 14-token preamble is unnecessary.
- Closing marker alone (C) also matches A — the bracket pair is substitutable, not asymmetric.
- Role name alone (D) collapses to baseline —
user/assistanttokens without markers provide no routing signal. - Minimal markers (E) retains 97.7% of A at 3 tokens vs 15.
Mechanism: marker-token-class binding. Either <|im_start|> or <|im_end|> suffices as a distinctive turn-boundary landmark. The specific position (opening vs closing) is secondary; what matters is the presence of a canonical marker token that the model's induction-head circuitry learned to route through during instruction tuning. Role-name tokens add a small independent signal (+1.25pt C vs E, within noise), but the strongest signal is the marker class.
Saturation evidence: Closing marker adds nothing when opening is present (A ≡\equiv≡ B). But closing rescues when opening is weak (C – D = +7.5pt). The mechanism has a ceiling: once triggered by any canonical <|im_*|> token, redundant markers don't stack.
Experiment 2B: E-Wrapper Cross-Family (nnn=80)
Table 5. Phi-3-medium 14B validation.
| Arm | r@1 | Δ\DeltaΔ vs A | 95% boot CI | McNemar ppp | Pr(δ>0\delta > 0δ>0) |
|---|---|---|---|---|---|
| A_full_phi3 | 0.6000 | — | — | — | — |
| E_markers_phi3 | 0.6000 | 0.0000 | [–0.063, +0.063] | 1.000 | 0.422 |
Token overhead: E-wrapper = 2 tokens vs A_full's 11 = 18% of full-template tokens.
Cross-family comparison:
| Model | A_full r@1 | E_markers r@1 | Δ\DeltaΔ (E – A) | Retention |
|---|---|---|---|---|
| Qwen 2.5 7B | 0.5500 | 0.5375 | –0.0125 | 97.7% |
| Phi-3-medium 14B | 0.6000 | 0.6000 | 0.0000 | 100% |
Phi-3 is more tolerant of wrapper minimization than Qwen. Both families converge on "the <|end|>-class token is the sufficient binding signal; the role name and system preamble are decorative."
Experiment 3: Multi-Hop Degradation (nnn=40)
Table 6. Archive noise sensitivity.
| Condition | r@1 | Δ\DeltaΔ vs (a) | 95% boot CI | McNemar ppp | Pr(δ>0\delta > 0δ>0) |
|---|---|---|---|---|---|
| (a) 0 fillers | 0.500 | — | — | — | — |
| (b) 1 filler | 0.450 | –5.0pt | [–0.200, +0.075] | 0.727 | 0.184 |
| (c) 2 fillers | 0.425 | –7.5pt | [–0.200, +0.050] | 0.453 | 0.082 |
| (c) vs (b) | 0.425 vs 0.450 | –2.5pt | [–0.175, +0.125] | 1.000 | 0.311 |
Per-hop margins: (a→\rightarrow→b) –5.0pt, (b→\rightarrow→c) –2.5pt. Degradation is monotonic but diminishing. The worst transition is from "pure target context" to "mixed context"; additional contamination causes sub-linear marginal damage. This suggests a "mechanism floor": once the KV archive contains any off-target content, further mixing is marginally cheaper.
Caveat: At nnn=40, precision floor is ≈±\approx\pm≈±20pt. All CIs cross zero; none reach formal significance. Directional evidence is consistent (Pr 0.82–0.92) but cannot substitute for power.
Experiment 4: Upper-Layer Localization (nnn=200)
Table 7a. Layer ablation for E-wrapper mechanism (Qwen 2.5 7B, SDPA).
| Arm | Layers wrapped | r@1 | Δ\DeltaΔ vs baseline | Interpretation |
|---|---|---|---|---|
| baseline | none | 0.420 | — | Raw inject |
abl_pc_full | 0–27 (all) | 0.475 | +5.5pt | Full E-wrapper (replicates Phase 1) |
abl_upper_mixed | 14–27 | 0.450 | +3.0pt | Upper layers recover 55% of effect |
abl_lower_mixed | 0–13 | 0.415 | –0.5pt | Lower layers: zero contribution |
The E-wrapper mechanism is upper-layer dominant. The upper 14 layers (14–27) account for 55% of the full mechanism (+3.0pt of +5.5pt). Lower layers (0–13) contribute nothing measurable (–0.5pt, within noise floor). This is consistent with PyramidKV's pyramidal information funneling (Section 2.2): upper layers concentrate attention on structurally significant tokens, making structural topology fidelity matter most where induction-head routing is already concentrated.
Product implication: For storage-constrained deployments, archiving only upper-layer KV tensors preserves 55% of the structural-alignment benefit at approximately 50% storage cost. A finer-grained layer sweep could identify the minimum layer set for majority effect.
Residual 45%: The gap between abl_upper_mixed (+3.0pt) and abl_pc_full (+5.5pt) suggests either (a) sub-threshold contribution from layers 0–13 at nnn=200, or (b) cooperative interaction where upper-layer wrapping is more effective when lower layers are also wrapped.
5.4 n=467 Cross-Family Validation
Qwen 2.5 7B SDPA—Stratified reporting
Full statistics in Table 1c; per-query delta distribution in Figure 5.
Table 1c. Qwen 2.5 7B SDPA oracle-filtered extension (stratified reporting, nnn=467). Absolute r@1 values and discordant counts were recomputed directly from per-query result pairs. Exact McNemar uses the two-sided binomial on discordant counts with standard convention (bbb = inject hurt, ccc = inject helped).
| Slice | nnn | Baseline | per_chunk | Delta | McNemar exact ppp | bbb (hurt) | ccc (helped) | Bootstrap 95% CI |
|---|---|---|---|---|---|---|---|---|
| Original 200 | 200 | 0.420 | 0.475 | +5.5pt | 0.108 | 14 | 25 | [–0.005, +0.115] |
| New oracle-filtered 267 | 267 | 0.536 | 0.592 | +5.6pt | 0.063 | 21 | 36 | [+0.005, +0.110] |
| Merged | 467 | 0.486 | 0.542 | +5.57pt | 0.0103 | 35 | 61 | [+0.015, +0.105] |
Note on baseline shift: The New-267 slice has a higher absolute baseline (0.536 vs 0.420) because it draws from a different corpus sub-distribution with easier queries on average. The delta is stable (+5.6pt vs +5.5pt) across this difficulty shift, which is the key replication property.
Key finding: The new oracle-filtered queries (different seed, different retrieval sample) replicate the original effect identically (+5.6pt vs +5.5pt). The effect is not an artifact of the original query pool—it generalises to a fresh oracle-filtered sample drawn from a different difficulty regime.
At merged nnn=467 the delta reaches formal significance at α\alphaα=0.05 (McNemar exact ppp=0.0103, approaching α\alphaα=0.01). The strictly positive bootstrap lower bound (+0.015) independently supports a real positive effect. Stratified reporting shows the effect is robust across both the original pool and the new extension.
Synthetic Penalty Sensitivity
Artificially degrading the synthetic 120 baseline by 7.5% (to match original 80 difficulty) yields per_chunk delta = +5.1pt, bootstrap LB = –0.01. The mechanism survives penalty adjustment, confirming the effect is not driven by easy synthetic queries alone.
Phi-3 14B SDPA—Formal Significance
| Metric | Value |
|---|---|
| Baseline | 0.430 |
| per_chunk | 0.500 |
| Delta | +7.0pt |
| McNemar ppp | 0.013 |
| Exact test ppp | 0.019 |
| Bootstrap 95% CI | [+0.020, +0.120] |
| Pr(δ>0\delta > 0δ>0) | 1.00 |
Phi-3 achieves formal significance at α\alphaα=0.05 on both McNemar and exact tests, with a strictly positive bootstrap lower bound (+0.020). This is our strongest statistical result: a cross-family replication on a different template family (Phi-3's single-token <|user|> markers vs Qwen's multi-line ChatML) confirms the structural-alignment mechanism generalizes. The larger +7.0pt delta on Phi-3 versus +5.5pt on Qwen reflects both the simpler template (fewer boundary/marker interactions to disrupt) and the n=200 power advantage over n=80 Phase 0.
5.5 Activation Patching (nnn=20, Phi-3-medium-128k-instruct)
Table 13. 8-arm activation-patching sweep at marker-token positions on Phi-3-medium-128k-instruct (nnn=20).
| Arm | r@1 | Δ\DeltaΔ vs baseline |
|---|---|---|
patch_baseline (archive without E-wrap) | 0.50 | — |
patch_pc_full (full E-wrapper archive) | 0.55 | +5pt |
patch_denoise_early (layers 0–5) | 0.45 | −5-5−5pt |
patch_denoise_mid (layers 10–15) | 0.40 | −10-10−10pt |
patch_denoise_upper (layers 20–32) | 0.50 | 0pt |
patch_denoise_all (all layers) | 0.55 | +5pt (full recovery) |
patch_denoise_random (random layer set) | 0.30 | −20-20−20pt (negative control) |
patch_noise_upper (corrupted residuals, upper) | 0.50 | 0pt |
Key findings. (a) The full-wrapper archive (patch_pc_full) reproduces the +5pt effect seen at larger nnn on Phi-3 SDPA, confirming the small-nnn probe is calibrated to the main effect. (b) patch_denoise_all fully recovers the effect when residuals are patched across all layers, while (c) patch_denoise_upper recovers 0pt when patched at upper layers only—upper-layer residuals are insufficient in isolation for effect reconstruction from a patched state. (d) The random-layer negative control (−20-20−20pt) and noise control (0pt) establish that the recovery is specific to the layer-band structure of the patch. (e) Lower- and mid-band patches produce negative deltas (−5-5−5 to −10-10−10pt), indicating that partial-stack patching corrupts the routing context.
Together with the upper-layer-wrap dominance reported in §5.3 Experiment 4, these results support a reconciliation: the structural signal is expressed predominantly at upper layers while causal reconstruction requires the full stack (see §4.6).
5.6 Cross-Architecture and Deployment Validation
MLA tractability (DeepSeek V2-Lite, nnn=200). E-wrapper produces a directionally positive Δ\DeltaΔ=+3.0pt (Pr(Δ>0\Delta > 0Δ>0)=0.866, McNemar ppp=0.307) on Multi-head Latent Attention—underpowered at this sample size but sign-consistent with GQA results. All three tested architectures (two GQA families + MLA) show positive Δ\DeltaΔ; effect magnitude is model-dependent (Phi-3 +7.0pt > Qwen +5.6pt > V2-Lite +3.0pt). Full MLA results, cross-family summary table, and YARN-scaled RoPE debugging methodology in Appendix B and Appendix F.
Latency. KV injection reduces TTFT versus strong text-RAG by 2.6×\times× (per_chunk wrapped) to 3.4×\times× (raw inject). E-wrapper adds ∼\sim∼10ms overhead versus raw inject for the fresh wrapper-prefix prefill. Archive storage: ∼\sim∼2.1 MB per 4K-token chunk (bf16, 32 layers). E-wrapper reduces wrapper-token KV memory ∼\sim∼8×\times× versus full template (3 tokens vs 15).
Text-wrapped RAG control (nnn=200). Wrapping chunks with chat-template markers in plain text (no KV injection) produces Δ\DeltaΔ=+0.005—indistinguishable from noise. The +5.5pt effect requires actual KV injection, confirming the mechanism is structural, not lexical.
Robustness. System prompt mismatch (archiving under "helpful assistant," generating under "coding assistant") causes zero degradation (Δ\DeltaΔ=0.000, nnn=200). INT8 KV quantization at per-channel or per-token granularity preserves the full effect (0pt loss), while per-tensor INT8 destroys it (−5-5−5pt)—a granularity floor, not a precision floor. EAGER vs SDPA kernel choice affects measurement magnitude (∼\sim∼3×\times×) but not the underlying mechanism; SDPA values are reported as production-relevant. Full quantization matrix (ten-arm, nnn=80) in Appendix E; system prompt results in Appendix G.
CPU-tier deployment (nnn=80). Zero-shot static embeddings (potion, ∼\sim∼1ms CPU) plus BM25 plus BGE-Reranker match GPU-grade Qwen3-Embed-8B end-to-end (all McNemar ppp=1.000). The reranker fully absorbs potion's weaker retrieval-stage signal. Stack specification in Appendix C.
Standard benchmark: LoCoMo (nnn=200). To validate the effect outside our custom evaluation, we evaluate on the LoCoMo conversational memory benchmark (Maharana et al., 2024)—the same benchmark used by Jeong (2026), our closest prior work. LoCoMo tests cross-session recall over 10 multi-session conversations spanning up to 35 sessions each. We subsample 200 QA pairs (excluding adversarial category), archive per-turn KV with E-wrapper, and score with token-level F1 (LoCoMo's standard metric; exact-match is uninformative due to length mismatch between short ground-truth phrases and model-generated sentences).
| Arm | F1 | Δ\DeltaΔ |
|---|---|---|
| baseline (text-RAG, no injection) | 0.135 | — |
| per_chunk (E-wrapper injection) | 0.189 | +5.3pt |
The +5.3pt F1 improvement on LoCoMo is consistent with the +5.6pt r@1 delta on our custom benchmark, confirming the E-wrapper effect generalizes to an independently designed conversational memory evaluation. Per-chunk injection produces answers with higher token overlap with ground truth across all four non-adversarial question categories (single-hop, multi-hop, temporal, open-domain).
6. Discussion
6.1 Hypothesized Mechanism: Marker-Token-Class Binding and Induction-Head Routing
Olsson et al. (2022) demonstrated that transformer attention heads implement [A][B]...[A]→\rightarrow→[B] pattern completion—"induction heads." Chat-template role markers create exactly this pattern: a fixed prefix token sequence preceding variable content.
We hypothesize that our observed effects are consistent with induction-head behavior, though we do not directly observe specific head activations. Our five-arm wrapper ablation (§5.3 Experiment 2) provides the strongest evidence, with supporting context from the 2×\times×2 factorial (§5.2, noting that the factorial interaction did not replicate at nnn=200). The evidence refines the hypothesis in three ways:
First, induction-head routing may be conditionally activated. The [A] trigger (role marker) may require a well-formed [B] (content boundary) to complete the pattern. Without clean boundaries, the induction head may not establish the [A]→\rightarrow→[B] mapping:
<|im_start|>user\nHello→\rightarrow→ well-formed trigger + content →\rightarrow→ hypothesized induction-head activation<|im_start|>user\nHel→\rightarrow→ trigger + fragment →\rightarrow→ hypothesized mapping failure
Second, the hypothesized routed token is the marker-token-class, not the prefix position. Either <|im_start|> or <|im_end|> suffices; the bracket pair is substitutable. If Olsson's induction-head circuitry is involved, it may apply to any distinctive landmark token, not strictly to sequence prefixes. The model may learn a class of turn-boundary tokens during instruction tuning, and any member of that class may trigger the routing.
Third, instruction tuning amplifies rather than creates the effect. The base model shows +2.5pt (n=80, formally null but directionally positive), while the instruct variant shows +6.25pt. The <|im_*|> marker tokens exist in the shared tokenizer vocabulary; the base model may use them as weak anchors, and IT sharpens the routing pattern ~2.5×\times×.
Fourth, the mechanism is upper-layer dominant at prefill-wrap but distributed across the stack for causal reconstruction. Our layer ablation (Experiment 4, nnn=200) shows 55% of the E-wrapper effect concentrated in upper layers (14–27) at the prefill-wrap granularity, with zero measurable contribution from lower layers (0–13)—consistent with PyramidKV's pyramidal information funneling and with the induction-head routing hypothesis, since induction heads are predominantly found in upper/middle transformer layers. The activation-patching probe (\S5.5, nnn=20 on Phi-3) refines this picture: patching marker-position residuals in upper layers alone fails to reconstruct the effect (0pt), while all-layer patching recovers it fully (+5pt). We read the two probes as complementary: upper layers are where the structural signal is expressed, while the full stack is necessary for causal reconstruction from an already-patched state. See \S4.6 for the full reconciliation. The upper-layer dominance at prefill-wrap also supplies a concrete storage-optimization path: archiving only upper-layer KV tensors preserves roughly half the mechanism (≈\approx≈55%) at roughly half the layers stored.
6.2 Model-Dependent Magnitude
The boundary-conditioned structural-alignment effect varies across models and architectures:
- Qwen 2.5 7B (complex ChatML template): +5.0pt interaction alone (2×\times×2 Cell D), +6.25pt full per_chunk. Complex templates create more boundary/marker interaction opportunities →\rightarrow→ larger fix, but also more overhead.
- Phi-3 14B (simple template): +7.0pt at nnn=200 with formal significance (ppp=0.013). Phi-3's simpler template (
<|user|>single token) has fewer boundary/marker interactions to disrupt, yielding a larger delta. - DeepSeek V2-Lite (MLA): +3.0pt directional. MLA's compressed
c_kvlatent may lose some structural-topology signal that GQA preserves in full K-tensor form. - Qwen 2.5 7B Base: +2.5pt directional. Reduced magnitude confirms IT amplification.
Interpretation: Template complexity and architecture class determine both the magnitude of degradation and the recovery potential. Complex templates require full per-chunk wrapping (or E-wrapper minimalism); simple templates achieve comparable benefit with thinner injection. MLA shows the mechanism transfers but at reduced magnitude, suggesting architectural differences in how structural topology is encoded in the compressed latent space.
6.3 Methodological Implications: Retrieval Determinism
Our finding that BGE-Reranker-class pipelines exhibit 0/80 selection mismatch across fresh runs has implications beyond our specific benchmark. Any KV-reuse or RAG evaluation that compares across independent retrieval runs risks conflating retrieval variance with mechanism effects. The pinning protocol we introduce is a minimal-cost ($0, one JSON file) mitigation that should be adopted as standard practice.
6.4 Multi-Hop Degradation and Archive Noise
Experiment 3 shows that the mechanism degrades gracefully under archive noise: –5.0pt for the first filler turn, –2.5pt marginal for the second. This diminishing-margin pattern suggests a "mechanism floor" — once the KV archive is contaminated with any off-target content, additional contamination causes sub-linear marginal damage. The worst transition is from pure target context to mixed context.
Product implication: Single-hop injection (archive a conversation turn, inject into the next session) is the sweet spot. Multi-hop "conversation history" injection degrades monotonically with each additional off-target turn. A practical product should inject only the most relevant chunks, not entire conversation histories.
Honest caveat: At nnn=40, this is directional evidence, not a quantitative decay law. Formal claims would require nnn=400+.
6.5 Deployment Considerations
A CPU-tier retrieval stack (zero-shot potion + BM25 + BGE-Reranker) matches GPU-grade end-to-end quality within nnn=80 precision, enabling edge deployment without GPU embedders (Appendix C). The reranker remains the lone GPU component and dominant cost center.
Cross-session MLA injection is directionally tractable (V2-Lite Δ\DeltaΔ=+3.0pt, Appendix F) but effect size is smaller than GQA and formal significance would require nnn=1000+. Every major lab's 2026 flagship has moved past pure attention; whether frontier-scale MLA narrows the gap is the critical open question for this approach's longevity.
7. Limitations
7.1 Statistical Power
Individual Phase 0 measurements at nnn=80 are underpowered for 5pt effect detection. SE ≈\approx≈ 5.6% at 50% baseline; 95% CI spans ∼±\sim\pm∼±11pt. We address this via:
- nnn=200 extension (primary validation gate)
- Cross-measurement consistency (5 independent directional positives under null ≈\approx≈ 3%)
- Exact McNemar + bootstrap CI (tighter than asymptotic approximations)
The Qwen merged result reaches formal significance at nnn=467 (ppp<0.01, Δ\DeltaΔ=+5.6pt). Phi-3 nnn=200 (ppp=0.013) provides cross-family formal validation on a different template family.
Power analysis. For a paired proportion test (McNemar) with baseline π\piπ=0.42 and target Δ\DeltaΔ=0.05, required nnn for 80% power at α\alphaα=0.05 is approximately 385 per arm. Qwen reaches formal significance at merged nnn=467 (Δ\DeltaΔ=+5.6pt, ppp<0.01). For the MLA V2-Lite result (Δ\DeltaΔ=+0.030), nnn≈1,050 would be required for 80% power at α\alphaα=0.05, explaining why ppp=0.307 at nnn=200. For the 2×\times×2 factorial interaction (Cell D vs baseline, Δ\DeltaΔ=+0.05), nnn≈400 per cell would be needed to confirm the interaction independently.
Query pool extension. The oracle-filtered extension strategy succeeded: 267 new queries generated via the same oracle pipeline (different seed, different chunks) replicated the original effect identically (+5.6pt vs +5.5pt), confirming the effect is not an artifact of the original query pool. The corpus ceiling (~367 usable chunks) limits further independent extension without multi-question correlation or alternative corpus sources.
7.2 Single-Seed Variance
Phase 0 results are single-seed. nnn=200 includes multi-seed variance decomposition to quantify dataset-vs-seed contribution.
7.3 Kernel Sensitivity
Attention kernel choice (SDPA vs EAGER) affects both baseline quality and effect magnitude. EAGER gives lower baselines and larger deltas due to floating-point non-associativity in softmax. We report SDPA as production-relevant and EAGER as upper-bound diagnostic.
7.4 Synthetic Pool Overfit
Synthetic 120 queries are +7.5pt easier for retrieval than organic 80 (gt_in_top3: 0.90 vs 0.825) due to generation-model phrasing correlation. Mitigated by:
- Stratified reporting (original / synthetic / merged)
- Synthetic penalty sensitivity analysis
- Claims anchored to organic 80 slice (harder, more realistic)
7.5 Wrapper Sensitivity
Structural wrapping is model-template-specific. We validate on three families (Qwen ChatML, Phi-3, DeepSeek V2-Lite) but cannot claim universality. The five-arm ablation (Table 4) and cross-family E-wrapper validation (Table 5) provide the strongest wrapper-sensitivity evidence to date.
7.6 Scale Ceiling
All validation at 7B–14B, with one 15.7B/2.4B MLA model. Frontier-scale pure-attention models (70B+) and frontier MLA models (119B–671B) remain untested. Scale rescue pattern (1.5B fails →\rightarrow→ 7B works →\rightarrow→ 14B wins) suggests improvement with scale, but this is extrapolation.
7.7 Benchmark Scope
LoCoMo is a conversation memory benchmark. Generalization to code, technical documentation, or multi-modal contexts is untested.
7.8 Competitor Benchmarking
Mem0, Letta, Zep—product-grade competitors remain unbenchmarked due to credential/integration complexity.
7.9 Retrieval Pipeline Non-Determinism
The strong-RAG retrieval pipeline exhibits cross-run top-3 selection variance: 0/80 queries matched between two independent fresh runs. We mitigate via the Retrieval Determinism Protocol (Methods 3.8), which pins selections intra-run. However, this means absolute r@1 scores are not directly comparable across papers or replication attempts unless the exact same pinned selections are shared. Future work should release pinned selection artifacts alongside raw results.
7.10 Multi-Hop Noise Accumulation
Directional degradation of ~5pt r@1 observed across 2 filler turns at nnn=40 (formally null at this sample size, Pr(degrade)=0.92). The mechanism degrades gracefully but not indefinitely. Formal claims would require nnn=400+.
7.11 CPU-Tier Sample Size
The potion probe at nnn=80 shows equivalence within precision floor (±\pm±3.75pt). A larger nnn=200 run could confirm the marginal +1.25pt lead at significance. The conclusion "CPU tier is viable" is directional, not formally proven.
7.12 MLA Stage 1 False-Positive Retraction
Our initial V2-Lite Stage 1 "pass" (n=3 coherence heuristic) was a false positive masked by the YARN-RoPE bug. We explicitly retract that claim and replace it with the corrected n=200 formal result. This correction is documented in our rigor chronology as a self-correcting data point, not a failure.
7.13 Privacy and Security of Archived KV States
Cross-session injection requires storing per-layer K/V tensors from user conversations. Unlike archived text, which can be inspected, redacted, or audited, KV tensors are opaque high-dimensional embeddings that may encode memorized training data, prior conversation content, or attention patterns specific to a user session. Whether archived KV states carry equivalent or greater privacy risk than archived text is unknown. We do not analyze KV-cache memorization, membership-inference vulnerability, or cross-user leakage. Any production deployment should treat archived KV as sensitive user data and apply the same access controls, encryption, and retention policies as stored conversation text.
7.14 INT8 Quantization Matrix Sample Size
The ten-arm quantization compatibility matrix (Table 10) was evaluated at nnn=80. Per-arm precision is ±\pm±11pt at 95% CI, so the finding that per-channel INT8 achieves 0pt degradation is directionally strong but not formally proven against a small non-zero effect. The per-tensor INT8 failure (−5-5−5pt) is robustly outside the noise floor. A replication at nnn=200 would confirm the per-channel result with tighter bounds.
7.15 Query Pool Synthesis Quality
Our synthetic query pool (120 queries) was generated by Qwen2.5-7B-Instruct via inline prompting and oracle-filtered for quality. An attempt to extend the pool with 167 additional inline-synthetic queries (same generator, new seed) produced queries that showed no E-wrapper effect (Δ\DeltaΔ=−0.006-0.006−0.006) and diluted the merged result. This suggests inline synthesis produces variable-quality queries whose difficulty distribution is not stable across seeds. Future work should validate synthesis quality via held-out oracle retrieval before merging into eval pools.
8. Conclusion
Cross-session KV cache injection degrades quality primarily through structural misalignment—the loss of chat-template turn-boundary markers during archival—rather than through attention sink loss, RoPE mismatch, or cross-attention contamination. A five-arm wrapper ablation identifies marker-token-class binding as the mechanism: either bracket token suffices as a landmark, and a minimal 3-token E-wrapper captures 97.7–100% of the effect. A 2×\times×2 factorial (nnn=80) suggests boundary–marker synergy, though the interaction did not replicate at nnn=200 on different retrieval selections.
The per-chunk injection effect achieves formal significance on both primary models (Phi-3 ppp=0.013 at nnn=200; Qwen ppp<0.01 at nnn=467) and generalizes to the LoCoMo conversational memory benchmark (+5.3pt F1, nnn=200). Layer ablation and activation patching reveal complementary localization: upper layers dominate signal expression at prefill, while the full stack is necessary for causal reconstruction. The effect survives system-prompt drift, per-channel INT8 quantization, and transfers directionally to MLA architectures.
Future work: Frontier-scale MLA validation, additional standard benchmark evaluation, and hybrid approaches combining structural alignment with light adapter tuning.
References
- CacheBlend (2024). arXiv:2405.16444.
- DeepSeek-AI (2024). DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv:2405.04434.
- Dettmers et al. (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. arXiv:2208.07339.
- Dror et al. (2018). The Hitchhiker's Guide to Testing Statistical Significance in Natural Language Processing. ACL P18-1128.
- FreeChunker (2025). A Cross-Granularity Chunking Framework. arXiv:2510.20356.
- Jeong (2026). Trained Persistent Memory for Frozen Decoder-Only LLMs. arXiv:2603.22329.
- LoopGuard (2026). Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention. arXiv:2604.10044.
- Maharana et al. (2024). LoCoMo: Long-Context Conversational Memory. NAACL 2024.
- Lu et al. (2021). Fantastically Ordered Prompts and Where to Find Them. ACL 2021. arXiv:2104.08786.
- Olsson et al. (2022). In Context Learning and Induction Heads. arXiv:2209.11895.
- PyramidKV (2024). arXiv:2406.02069.
- Su et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864.
- Xiao et al. (2023). Efficient Streaming Language Models with Attention Sinks. arXiv:2309.17453. Published as StreamingLLM, ICLR 2024.
Appendix A: E-Wrapper Production Specification
The E-wrapper is an ablation-validated minimal wrapper for production deployment. It retains 97.7–100% of the full-template effect at 18–22% token overhead.
Qwen-family (ChatML template)
<|im_start|>
{chunk}<|im_end|>
Tokens: 3 (<|im_start|>, \n, <|im_end|>)
Retention: 97.7% of full template (0.5375 / 0.5500)
Overhead: 22% of full-template tokens
Phi-3-family
<|end|>
{chunk}<|end|>
Tokens: 2 (<|end|>, <|end|>)
Retention: 100% of full template (0.6000 / 0.6000)
Overhead: 18% of full-template tokens
DeepSeek-family (V2-Lite template)
<|User|>
{chunk}<|end_of_sentence|>
Tokens: 3
Note: Validated directionally at n=200; formal sig not yet achieved.
Implementation note: The E-wrapper eliminates the system-prompt preamble entirely. For models where the system prompt carries task-critical instruction (e.g., "You are a medical assistant"), prepend the system prompt to the query turn rather than to each archived chunk. This preserves task context while minimizing per-chunk wrapper overhead.
Appendix B: YARN-Scaled RoPE Fix for KV Injection
Problem
DeepSeek V2/V3 and extended-context Llama-3 variants use YARN-scaled RoPE (factor ≥\ge≥ 40). A naive rope_shift_k using plain RoPE inv_freq over-rotates by up to the YARN factor on high dimensions, producing complete K-tensor corruption and gibberish output.
Symptom chain
- Inject produces token-collapse gibberish (e.g.,
"\n\n\n") - r@1 = 0.000 on both arms despite coherent pipeline + scorer
inject_no_shift(pos=0) works;inject_with_shiftbreaks — pointing torope_shift_k
Root cause
YARN modifies inv_freq by blending standard RoPE frequency with an interpolated (scaled-down) frequency based on a per-dimension ramp. At V2-Lite dimensions (qk_rope_head_dim=64), the plain/yarn ratio reaches 40×\times× on dimensions 23–31 (pure freq_inter regime). For typical chunk shifts of ~512 tokens, this means up to 20,480 radians of over-rotation — complete K corruption.
Fix
Extract yarn-adjusted inv_freq from the model's own rotary module:
rot = model.model.layers[0].self_attn.rotary_emb
yarn_inv_freq = rot.inv_freq.detach().to("cuda").to(torch.float32)
Use yarn_inv_freq in shift rotation math. Everything else (contiguous-half rotate_half, dtype handling, basis) matches the model's apply_rotary_pos_emb verbatim.
mscale note
For V2-Lite, mscale == mscale_all_dim == 0.707 →\rightarrow→ _mscale == 1.0. The cos/sin cache has no mscale scaling; shift composition as pure rotation is correct. If mscale were non-unity, the fix would need (1/mscale) * R_scaled(Δ) * k_stored.
Validation
| Mode | r@1 | Coherent |
|---|---|---|
| native_oracle | 0.70 | 10/10 |
| inject_fixed_shift | 0.60 | 10/10 |
| inject_no_shift | 0.50 | — |
Fix restores coherent output; fixed_shift reaches within 0.10 of native ceiling.
Appendix C: CPU-Tier Retrieval Stack Specification (Zellige Edge)
Recommended stack
| Layer | Component | Latency | Hardware | Cost (100K q/day) |
|---|---|---|---|---|
| Embedder | minishlab/potion-base-8M (zero-shot) | ~1ms | CPU | ~$0 |
| Sparse retrieval | BM25 (rank-bm25) | ~1ms | CPU | ~$0 |
| Fusion | RRF (k=60) | negligible | CPU | ~$0 |
| Reranker | BGE-Reranker-v2-m3 | ~50ms | GPU | ~$1.50 |
| Generator | per_chunk inject + E-wrapper | ~10ms | GPU | base model cost |
| Total retrieval | ~52ms | GPU-reranker-bound | ~$1.50 |
Comparison to GPU-heavy stack
| Layer | Qwen3-Embed-8B stack | Potion stack | Factor |
|---|---|---|---|
| Embed | ~200ms GPU | ~1ms CPU | 200×\times× |
| BM25 | ~1ms CPU | ~1ms CPU | 1×\times× |
| Reranker | ~50ms GPU | ~50ms GPU | 1×\times× |
| Total retrieval | ~250ms | ~52ms | 4.8×\times× |
Quality equivalence
End-to-end inject r@1 at nnn=80: Qwen3-Embed-8B = 0.488, zero-shot potion = 0.500. Formally indistinguishable (McNemar ppp=1.000). The BGE-Reranker bridges potion's –7.5pt pre-rerank retrieval gap.
Caveats
- If the reranker is ever dropped for latency, the embedder gap re-emerges fully.
- nnn=80 precision floor is ±\pm±3.75pt; equivalence is directional, not formally proven.
- Static-reranker distillation would complete the CPU tier (target: ~500ms CPU, eliminating the GPU reranker).
Appendix D: Rigor Chronology and Self-Corrections
This appendix documents the project's error-correction history as a rigor feature. Each correction strengthened the final claims.
-
Stage 1 coherence-heuristic false positive (2026-04-22). Initial V2-Lite n=3 "pass" used output-length + English-word count as a proxy for correctness. Formal JSON showed inject_r1_hits=0. Lesson: always use formal retrieval scoring, never coherence proxies.
-
YARN-scaled RoPE bug (2026-04-22).
rope_shift_kused plain RoPEinv_freqon a YARN model (factor=40) →\rightarrow→ 40×\times× over-rotation →\rightarrow→ gibberish. Fix: extract yarn-adjustedinv_freqfrommodel.model.layers[0].self_attn.rotary_emb. Lesson: use the model's own rotary module; don't recompute. -
Hybrid model lookup error (2026-04-22). Initial lookup returned wrong Qwen model (Qwen3-30B-A3B pure-MoE vs Qwen3.6-35B-A3B hybrid). Web-search verification caught the mismatch. Lesson: verify model names against training-cutoff knowledge; don't trust single-source lookups.
-
Cross-family table mis-transcription (2026-04-22). Initial analysis conflated E-wrapper minimization ablation with Phase 1 ship-gate comparison. Cross-check against prior documented results caught the error. Lesson: cross-check table values against prior docs before accepting.
-
Mistral Small 4 119B runtime wall (2026-04-22). transformers main 5.6.0.dev0 FP8 MoE incomplete →\rightarrow→ static mode breaks MoE, dynamic mode breaks linears. Aborted after load-test gate. Lesson: for bleeding-edge models, always do a load-test gate before committing eval compute.
-
INT8 "hard limitation" false conclusion (2026-04-22). Exp 5 initially reported per-tensor INT8 destroys the E-wrapper effect and concluded a precision limitation. Exp 5b ten-arm matrix revealed the failure is specific to per-tensor granularity; per-channel and per-token INT8 recover fully. Lesson: a single coarse-grained negative result does not establish a precision floor; debug the quantization scheme before declaring incompatibility.
-
Inline synthesis query quality assumption (2026-04-22). Assumed that generating more synthetic queries with the same model+seed would preserve difficulty distribution. Exp 1 n=367 showed new inline-synthetic queries (same generator, different seed) have lower quality and dilute the signal. Lesson: validate synthesis quality via oracle retrieval before merging into eval pools; difficulty distributions vary across seeds.
Appendix E: Quantization Compatibility Specification
Finding summary
The E-wrapper mechanism has a granularity floor, not a precision floor:
- Per-tensor INT8 (symmetric or asymmetric): destroys effect (−5-5−5pt)
- Per-channel INT8: full recovery (0pt loss)
- Per-token dynamic INT8: full recovery (0pt loss)
- K-only per-channel INT8 (V=bf16): full recovery (0pt loss)
Practical specification
For production KV archive compression, use per-channel or finer quantization:
# Per-channel symmetric INT8 (recommended)
scale = kv_tensor.abs().amax(dim=(-2, -1), keepdim=True) # [1, H, 1, 1]
kv_int8 = (kv_tensor / scale * 127).round().clamp(-128, 127).to(torch.int8)
kv_dequant = kv_int8.float() / 127 * scale
# Per-token dynamic (LLM.int8-style, alternative)
scale = kv_tensor.abs().amax(dim=-1, keepdim=True) # [1, H, S, 1]
kv_int8 = (kv_tensor / scale * 127).round().clamp(-128, 127).to(torch.int8)
kv_dequant = kv_int8.float() / 127 * scale
Avoid: per-tensor torch.quantize_per_tensor for KV archive storage. The single global scale squashes marker-token activation variance that drives the E-wrapper mechanism.
Storage impact
| Scheme | Bits per element | Relative size | Quality |
|---|---|---|---|
| bf16 | 16 | 1.0×\times× | baseline |
| Per-channel INT8 | 8 | 0.5×\times× | 0pt loss |
| Per-tensor INT8 | 8 | 0.5×\times× | −5-5−5pt (broken) |
| K-only per-ch INT8, V=bf16 | 12 | 0.75×\times× | 0pt loss |
Compatibility note
bitsandbytes LLM.int8() and load_in_8bit use per-channel quantization for weights but may apply per-tensor schemes to KV caches depending on backend. Verify your inference backend's KV quant granularity before deploying E-wrapper with INT8 compression.
Appendix F: MLA Full Results (DeepSeek V2-Lite)
Table F1. Cross-architecture validation (nnn=200, DeepSeek-V2-Lite-Chat, SDPA, YARN-adjusted rope_shift_k).
| Slice | nnn | baseline r@1 | per_chunk r@1 | Δ\DeltaΔ | 95% boot CI | McNemar ppp | Pr(Δ>0\Delta > 0Δ>0) |
|---|---|---|---|---|---|---|---|
| original | 80 | 0.413 | 0.438 | +0.025 | [–0.05, +0.10] | 0.754 | 0.683 |
| synthetic | 120 | 0.342 | 0.375 | +0.033 | [–0.025, +0.092] | 0.424 | 0.827 |
| merged | 200 | 0.370 | 0.400 | +0.030 | [–0.02, +0.08] | 0.307 | 0.866 |
YARN-RoPE debugging journey: The initial V2-Lite attempt reported a false-positive "pass" on a coherence heuristic (n=3); formal JSON showed r@1=0.000. Diagnostic ablation traced the bug to naive rope_shift_k using plain RoPE inv_freq on a YARN-scaled model—up to 40×\times× over-rotation on dimensions 23–31. Extracting the yarn-adjusted inv_freq from model.model.layers[0].self_attn.rotary_emb restored coherent output. See Appendix B for the full fix specification.
Appendix G: System Prompt Robustness
Table G1. Cross-context deployment viability (nnn=200, Qwen 2.5 7B SDPA).
| Arm | Archive prompt | Generation prompt | r@1 | Δ\DeltaΔ vs baseline |
|---|---|---|---|---|
| baseline | — | helpful assistant | 0.430 | — |
| same_prompt | helpful assistant | helpful assistant | 0.480 | +0.050 |
| drifted_prompt | helpful assistant | coding assistant | 0.480 | +0.050 |
Zero degradation (Δ\DeltaΔ=0.000 between same and drifted). Marker-class binding is invariant to system prompt content.
Appendix H: Query Pool Extension (Inline-Synthetic, Negative Result)
Table H1. Power-up attempt via inline query synthesis (nnn=367, Qwen 2.5 7B SDPA).
| Slice | nnn | baseline | per_chunk | Δ\DeltaΔ |
|---|---|---|---|---|
| Original pool | 200 | 0.430 | 0.480 | +0.050 |
| New synthetic | 167 | 0.437 | 0.431 | −0.006-0.006−0.006 |
| Merged | 367 | 0.433 | 0.458 | +0.025 |
Inline synthesis did not power up significance. The 167 new synthetic queries show no effect and dilute the merged result from +5.0pt to +2.5pt. Oracle-filtered extension (§5.4) succeeded where this approach failed, indicating that unfiltered inline synthesis does not match the difficulty distribution of the oracle-filtered pool.
Figures
Figure 3. Attention sink mass by layer and position. Sink mass on positions 0–3 drops to near-zero in injected archives without global BOS prepending.
Figure 4. Un-RoPE'd KV cosine similarity by layer. Archived vs fresh K/V cosine ≈\approx≈1.0 across all layers, confirming content equivalence.
Figure 5. Per-query delta distribution for Qwen n=467 oracle-filtered extension. Distribution of per-query δ\deltaδ = per_chunk −-− baseline across the merged 467-query pool.
Figure 6. TTFT comparison across injection arms. Raw inject, per_chunk, text-RAG, and full-context re-prefill latency on identical A10G hardware.