Boundary-Conditioned Structural Alignment for Cross-Session KV Cache Injection

Abstract

Cross-session KV cache injection archives per-layer key/value tensors from past conversations and splices them into future sessions, achieving 2.6–3.4 $×\times$ TTFT reduction versus strong text-RAG. Prior work identifies three failure modes (attention sink loss, cross-attention contamination, and RoPE position mismatch), but no training-free intervention has closed the quality gap. We introduce boundary-conditioned structural alignment, a training-free protocol that archives KV states with full chat-template wrapping and splices them at turn boundaries. A 2 $×\times$ 2 factorial decomposition on Qwen2.5-7B-Instruct ( $n$ =80) suggests an interaction between clean boundaries and structural markers that recovers +5.0pt r@1, though a replication at $n$ =200 indicates this interaction is sensitive to retrieval selection. A five-arm wrapper ablation refines the mechanism to marker-token-class binding: either opening or closing chat-template bracket token suffices as a turn-boundary landmark. A minimal E-wrapper (3 tokens) captures 97.7% of the full-template effect at 22% token overhead, validated cross-family on Phi-3-medium (100% retention).

Cross-family formal significance on Phi-3-medium-128k-instruct ( $p$ =0.013, $n$ =200) and merged Qwen2.5-7B-Instruct ( $p$ <0.01, $n$ =467) confirms the mechanism generalizes across template families. Evaluation on the LoCoMo conversational memory benchmark ( $n$ =200) shows +5.3pt F1 improvement, validating the effect on a standard external benchmark. Text-wrapped and system-prompt-drift controls confirm the benefit is structural and deployment-robust.

1. Introduction

Large language model deployments face a persistent latency–quality trade-off. For retrieval-augmented generation (RAG), time-to-first-token (TTFT) is dominated by context prefill, the linear-time forward pass over retrieved documents. For conversational memory systems, this cost compounds: each new session must re-process archived conversation history, even when that history has not changed.

KV cache injection offers an escape hatch. Instead of re-tokenizing and re-prefilling archived text, we store the per-layer key (K) and value (V) tensors computed during the original session and splice them directly into the new session's cache. Because K/V tensors are position-independent in the attention computation, naive concatenation is theoretically sufficient, and has been reported as viable at scale $≥\ge$ 7B (Xiao et al., 2023; Jeong, 2026), though with observed quality degradation.

The TTFT gains are substantial: injection reduces TTFT versus strong text-RAG by 2.6–3.4 $×\times$ on identical hardware. The cost is quality. Prior work reports that raw KV injection degrades retrieval accuracy by 5–10 points versus strong text-RAG baselines (CacheBlend, 2024; Jeong, 2026). Understanding why, and whether the gap can be closed without model training, is the central question of this work.

Three candidate failure modes

The KV injection literature proposes three mechanistic hypotheses for quality degradation:

Attention sink loss (Xiao et al., 2023). Decoder-only transformers sink 30–50% of deep-layer attention mass onto the first few sequence positions. Cross-session injection strips these positions, disrupting the attention topology that the model learned during pre-training. StreamingLLM mitigates this by globally prepending initial tokens, but its effectiveness for injected (not just appended) content is untested.
Cross-attention contamination (CacheBlend, 2024). Archived KV tensors were computed attending to session-A's preceding context. When injected into session-B with different preceding text, the archive's attention patterns are contextually mismatched: its K/V states encode "attend to X" where X is no longer present.
RoPE position mismatch (Su et al., 2021; Jeong, 2026). Rotary Position Embedding (RoPE) rotates K/Q tensors by position-dependent angles. Archiving at session-A positions and injecting at session-B positions creates angular misalignment in the QK dot product, potentially scrambling attention scores.

All three are plausible. None has been decomposed against the others in a controlled, training-free setting.

Our contribution

We decompose the 7.5pt quality gap on Qwen2.5-7B-Instruct using a six-arm factorial design ( $n$ =80) plus cross-family validation on Phi-3-medium-128k-instruct ( $n$ =200) and a first directional test on DeepSeek-V2-Lite MLA ( $n$ =200). Our findings:

Alternative failure modes do not explain the gap. Attention sink restoration (+0.0pt), RoPE un-rotation (cosine $≈\approx$ 1.0), and cross-attention isolation (~2.5pt, not significant at $n$ =80) each fail to close the quality deficit. The dominant driver is structural.
Boundary-conditioned structural alignment recovers +5.0–5.6pt. A 2 $×\times$ 2 factorial ( $n$ =80) suggests an interaction between clean token boundaries and chat-template role markers. A replication at $n$ =200 on different retrieval selections did not reproduce the interaction, indicating it is selection-sensitive; the per-chunk injection effect itself ( $Δ\Delta$ =+5.6pt, $p$ <0.01 at $n$ =467) is robust.
Marker-token-class binding is the mechanism. A five-arm ablation shows either bracket token (<|im_start|> or <|im_end|>) suffices as a landmark. A minimal E-wrapper (3 tokens) retains 97.7% of the effect at 22% token overhead, validated cross-family on Phi-3-medium-128k-instruct (100% retention). Instruction tuning amplifies the routing $≈\approx$ 2.5 $×\times$ rather than creating it.
Cross-family formal significance and standard benchmark validation. Phi-3-medium-128k-instruct ( $p$ =0.013, $n$ =200) and Qwen2.5-7B-Instruct ( $p$ <0.01, $n$ =467 oracle-filtered) both reach $α\alpha$ =0.05, with directional validation on DeepSeek-V2-Lite MLA ( $Δ\Delta$ =+3.0pt, $n$ =200). On the LoCoMo conversational memory benchmark (Maharana et al., 2024), E-wrapper injection improves answer F1 by +5.3pt ( $n$ =200).
Robustness validated across deployment axes. The effect is structural, not lexical (text-wrapped control $Δ\Delta$ =+0.005), survives system-prompt drift (zero degradation), and is compatible with per-channel/per-token INT8 KV quantization (ten-arm matrix, 0pt loss) but not per-tensor INT8 ( $- 5$ pt granularity floor).
Upper-layer dominance at prefill, distributed necessity at inference. Layer ablation localizes 55% of the effect to upper layers (14–27); activation patching shows full-stack residual transfer is required for causal reconstruction. The mechanism is distributed, not band-localized.

Methodological contribution: We discover that strong-RAG retrieval pipelines using BGE-Reranker-class cross-encoders exhibit non-deterministic top-3 selection (0/80 matching across fresh runs), making un-pinned cross-run comparisons uninterpretable. We introduce the Retrieval Determinism Protocol (pin selections once per run, reuse across inject arms) as a requirement for valid KV-reuse benchmarking.

Scope boundaries

This work is scoped to pure-attention decoder-only transformers with standard or YARN-scaled RoPE, and to MLA architectures at small-scale directional validation. Hybrid linear-attention architectures (e.g., Qwen3.6-35B-A3B, DeepSeek-V3) are explicitly out of scope: we establish catastrophic failure (r@1=10%) on partial KV injection, consistent with CacheBlend (2024). Sliding-window architectures (Gemma 3/4) are flagged as requiring an adapter path (global-layer-only inject), future work.

2. Related Work

2.1 Cross-Session KV Cache Reuse Systems

The most closely related published work is CacheBlend (Yao et al., 2024; arXiv:2405.16444), which addresses quality loss when precomputed KV caches are reused outside their original prefix context. CacheBlend identifies the core failure mode, "precomputed KV caches [are] not directly usable since they ignore the text's cross-attention with the preceding texts", and proposes selective recomputation of a small token subset to restore cross-attention fidelity, achieving 2.2–3.3 $×\times$ TTFT reduction with quality preserved across four benchmarks. Our work is complementary but distinct in two key dimensions: we operate at the cross-session episodic level (temporally separated prefills with different system context) rather than the within-session document-blending level CacheBlend targets, and we identify the most salient degradation pathway as boundary-conditioned structural misalignment rather than cross-attention contamination alone. Where CacheBlend offers a selective-recompute engineering fix, we contribute a mechanistic taxonomy and a training-free structural alternative.

Concurrent work: Jeong (2026; arXiv:2603.22329) independently investigates persistent cross-session memory in decoder-only LMs, testing six adapter methods (prefix, parallel cross-attention, KV extension, Hebbian memory, context-gated branch, and slot-based sparse write) on a frozen GPT-2 backbone evaluated on LoCoMo. Their finding that "architectural priors matter" at low capacity (cross-attention, Hebbian, and slot-write achieve 7–18% retained-memory scores at 1 $×\times$ capacity; the other three fail at $<$ 0.4%) independently converges with our structural-topology-fidelity result. The key differentiation is paradigmatic: Jeong requires training small adapters on a tiny frozen backbone (GPT-2); our approach is training-free structural alignment at archive time on production-scale instruction-tuned models (Qwen 2.5 7B, Phi-3 14B, DeepSeek V2-Lite). We additionally evaluate against a stronger baseline (modern embedding + BM25 + reranker pipeline), contribute cross-family and cross-architecture validation, and introduce a retrieval-determinism protocol absent from prior work.

Independent mechanism validation: LoopGuard (2026; arXiv:2604.10044) demonstrates that KV cache reuse can induce pathological attention patterns: specifically, "collapsed attention" where a subset of heads locks onto a narrow suffix, stabilized by inference-time KV cache reuse. While LoopGuard addresses repetition loops in single-session long-context generation rather than cross-session episodic injection, it independently confirms that KV reuse creates attention pathologies, corroborating our degradation-mechanism framing.

2.2 Attention Mechanisms and KV Cache Management

Attention sinks: Xiao et al. (2023; arXiv:2309.17453, ICLR 2024) demonstrated that initial tokens in autoregressive LMs attract disproportionate attention mass regardless of semantic content, because SoftMax normalization requires a denominator anchor: four initial tokens empirically suffice for full perplexity recovery on Llama-2-7B. We cite StreamingLLM as a foundational mechanism reference, but our diagnostic experiments find no evidence that classical attention sink loss is the primary degradation pathway: global prepending of BOS + first four tokens (the StreamingLLM fix) produced zero quality improvement (+0.0pt), and un-RoPE'd KV cosine similarity $≈\approx$ 1.0 argues against RoPE position mismatch as the dominant driver. Attention sinks are real but are a symptom of prefix absence in our setting, not an independent causal mechanism.

Layer-wise attention heterogeneity: PyramidKV (2024; arXiv:2406.02069) establishes a "pyramidal information funneling" pattern: lower transformer layers scatter attention broadly, while upper layers concentrate mass on structurally significant tokens. This directly informs our mechanism: structural topology fidelity (what chat-template wrapping restores) matters most in upper layers where induction-head routing is already concentrated. Our layer ablation (Section 5.3, Experiment 4) confirms this: 55% of the E-wrapper effect is attributable to upper layers (14–27), with zero measurable contribution from lower layers (0–13).

2.3 Retrieval-Augmented Generation and Chunking

Token-boundary effects: FreeChunker (2025; arXiv:2510.20356) establishes that sentence-level atomic chunking improves RAG retrieval quality over fixed-granularity approaches that produce arbitrary token boundaries. Our mechanism subsumes this finding: per-chunk chat-template wrapping enforces complete-sentence boundaries (role end markers fall at turn boundaries), producing the clean-boundary condition our 2 $×\times$ 2 factorial decomposition identifies as necessary, though not sufficient, for quality recovery.

Format sensitivity: Lu et al. (2021; arXiv:2104.08786, ACL 2021) showed that prompt permutation causes near-random to near-SOTA performance variance (up to 13% improvement from ordering alone) in few-shot prompting. Our per-chunk finding is a domain-specific instantiation of this general result: the "right" structural format for archived KV is precisely the format the model was trained on, and deviations cause systematic quality degradation.

2.4 Theoretical Basis: Marker-Token-Class Binding and Induction-Head Routing

Olsson et al. (2022; arXiv:2209.11895) demonstrated that transformer attention heads implement [A][B]...[A] $→\rightarrow$ [B] pattern completion ("induction heads"), hypothesized as the mechanistic basis for in-context learning. Chat-template role markers (<|im_start|>user, <|user|>) follow exactly this structure, establishing fixed prefix sequences the model learned to route from during training.

Our contribution extends this framing in two ways. First, we show that induction-head activation is conditional: the role-marker trigger ([A]) requires a well-formed content boundary ([B]) to complete the pattern. This conditional-activation hypothesis, supported by our 2 $×\times$ 2 factorial design (markers alone: +1.3pt; clean boundaries alone: –1.2pt; conjunction: +5.0pt), is more mechanistically precise than generic "format sensitivity."

Second, our five-arm wrapper ablation (Tier 1 Experiment 2) refines the routed token from "prefix position" to marker-token-class. Either <|im_start|> or <|im_end|> suffices as the landmark; the bracket pair is substitutable, not asymmetric. Role-name tokens (user, assistant) add a small independent signal (+1.25pt), but the strongest signal is from the distinctive marker token class: any canonical turn-boundary landmark token appears to trigger the IT-learned attention pathway. This explains cross-family generalization: Phi-3's <|user|> and <|end|> serve the same landmark function as Qwen's <|im_start|> and <|im_end|>, despite different vocabulary and template structure.

Statistical methodology: Exact McNemar testing for paired proportions at $n$ $<$ 200 follows recommendations in Dror et al. (2018; ACL P18-1128).

2.5 Positioning

Our work occupies a training-free structural-alignment niche complementary to all prior approaches. CacheBlend's selective-recompute method addresses cross-attention contamination at inference time and requires modified inference infrastructure; we address structural misalignment at archive time with zero inference overhead. Jeong's trained-adapter approach achieves persistent memory via learned architectural priors; we achieve analogous structural alignment at zero adapter cost on models an order of magnitude larger. Our MLA validation extends the mechanism to the dominant 2026 frontier architecture class without training. LoopGuard validates our degradation-mechanism framing from an orthogonal angle. We additionally contribute a methodological protocol for retrieval-determinism in KV-reuse benchmarking and a YARN-RoPE debugging guide for practitioners re-running injection on extended-context models.

3. Methods

3.1 Overview

We evaluate cross-session KV cache injection on decoder-only transformer language models, measuring retrieval quality (r@1) and time-to-first-token (TTFT) against a strong text-RAG baseline. Our core question: can archived KV states from past conversations be injected into future sessions without quality degradation, and if not, what mechanism explains the gap?

3.2 Models

Primary model: Qwen2.5-7B-Instruct (32 layers, GQA with 4 KV heads, RoPE position encoding, ChatML template). Evaluated under SDPA attention kernel (production default) and EAGER attention kernel (diagnostic instrument).

Cross-family validation: Phi-3-medium-128k-instruct (40 layers, GQA, RoPE, Phi-3 chat template). Evaluated under SDPA only.

Base-model control: Qwen2.5-7B (non-instruct, shared tokenizer). Evaluated under SDPA to test whether the structural-alignment effect is architecture-intrinsic or IT-learned.

MLA validation: DeepSeek-V2-Lite-Chat (27 layers, MLA with qk_nope=128 / qk_rope=64, YARN-scaled RoPE with factor=40). Evaluated under SDPA only.

Blocked models: Qwen3.6-35B-A3B (hybrid linear-attention) fails catastrophically (r@1=10%) on partial injection, establishing the scope boundary.

3.3 Benchmark

Long-form Conversation Memory (LoCoMo) benchmark: 200 multi-turn conversations with ground-truth fact retrieval. We use a held-out subset: 80 organic user queries + 120 synthetic queries generated by Qwen2.5-7B-Instruct. Synthetic queries correlate more tightly with target chunks (gt_in_top3: 0.90 vs 0.825) due to generation-model phrasing overlap. We report results stratified by slice and include synthetic-penalty sensitivity analysis.

3.4 Retrieval Pipeline

Text-RAG baseline (strong): Qwen3-Embed-8B embeddings + BM25 hybrid retrieval + BGE-Reranker-v2-m3 top-3 selection. Full reranker scores computed per query.

Inject baseline: Top-3 chunks selected by identical strong-RAG pipeline, archived KV computed via model.forward(use_cache=True) on each chunk, spliced into DynamicCache at inject point. No position math (naive concat validated at $≥\ge$ 7B).

Inject+reranker: Same top-3 selection as strong-RAG, inject only. Suggests the gap is mechanism-intrinsic, not selection quality.

CPU-tier probe: Zero-shot potion embeddings (minishlab/potion-base-8M, 256-dim, CPU) + BM25 + BGE-Reranker-v2-m3. Tests whether GPU-grade dense embedders are necessary for end-to-end inject quality.

3.5 Phase 0 Experimental Matrix

Qwen 2.5 7B: 8-arm matrix

Arm	Archive	Inject	Kernel	Purpose
Baseline	N/A	text-RAG (strong)	SDPA	Quality ceiling
Inject	raw chunk	raw KV splice	SDPA	Mechanism baseline
Inject+reranker	raw chunk	raw KV splice	SDPA	Selection control
Global sink	raw chunk	BOS+pos 0–3 + raw KV	SDPA	Classical sink hypothesis
Cond 1 (prefix)	raw chunk	wrapper prefix + raw KV	SDPA	Prefix-alignment hypothesis
Per_chunk	wrapped chunk (template)	wrapped KV splice	SDPA	Structural alignment hypothesis
Cond 1 EAGER	raw chunk	wrapper prefix + raw KV	EAGER	Kernel-sensitivity probe
Per_chunk EAGER	wrapped chunk	wrapped KV splice	EAGER	Kernel-sensitivity probe

Wrapper protocol (Qwen ChatML):

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{chunk}<|im_end|>

First 4 wrapper tokens (<|im_start|>, system, \n, You) serve as structural anchors.

Phi-3 14B: 3-arm matrix

Arm	Archive	Inject	Purpose
Baseline	N/A	text-RAG (strong)	Quality ceiling
Cond 1	raw chunk	wrapper prefix + raw KV	Cross-family prefix test
Per_chunk	wrapped chunk	wrapped KV splice	Cross-family structural test

Wrapper protocol (Phi-3 template):

<|user|>
{chunk}<|end|>

Pick 1+2: 2 $×\times$ 2 Mechanism Decomposition (Qwen 2.5 7B)

To isolate the active ingredients of per-chunk structural wrapping, we test a factorial 2 $×\times$ 2 design crossing boundary cleanliness with role-marker presence:

Cell	Boundary	Role Markers	Expected Mechanism
A	Arbitrary slice (mid-word)	Stripped (no markers)	Raw-text baseline—both factors absent
B	Arbitrary slice	Markers intact	Role-marker routing alone
C	Sentence boundary	Stripped	Clean boundaries alone
D	Sentence boundary	Markers intact	Both factors—synergistic interaction

Boundary manipulation: Sentence-boundary slicing cuts at complete sentences (. or \n delimiters). Arbitrary slicing cuts at fixed token counts, potentially mid-word.

Role-marker manipulation: Stripped archives remove all chat-template role markers (<|im_start|>, system, user, assistant) leaving only raw content. Intact archives preserve full template structure.

Key hypothesis: Neither factor alone produces substantial improvement. Their conjunction is required: clean boundaries provide the context for role-marker routing to activate, and role markers provide the routing signal that boundaries alone cannot.

3.6 Tier 1 Experimental Bundle

Three orthogonal experiments on Qwen 2.5 7B and Phi-3 14B, SDPA kernel, matched-pairs, $n$ =40–80.

Experiment 1: Base-Model Control ( $n$ =80)

Tests whether structural alignment is architecture-intrinsic or instruction-tuning-dependent. Arms: baseline + per_chunk on Qwen2.5-7B base (non-instruct) versus Qwen2.5-7B-Instruct reference.

Experiment 2: Wrapper Ablation ( $n$ =80, 5-arm)

Tests which template tokens drive the effect:

Arm	prefix	suffix	Tokens	Purpose
A_full	full template	`<	im_end	>`
B_opening	full opening	(none)	14	Opening-marker sufficiency
C_closing	`user\n`	`<	im_end	>`
D_role	`user\n`	(none)	2	Role-name alone
E_markers	`<	im_start	>\n`	`<

Product-ready recommendation: E_markers (<|im_start|>\n[chunk]<|im_end|>): 3 tokens, minimal overhead.

Experiment 2B: Phi-3 E-Wrapper Cross-Family ( $n$ =80, 2-arm)

Validates whether E-wrapper minimalism generalizes beyond Qwen:

Arm	prefix	suffix	Tokens
A_full_phi3	`<	system	>...<
E_markers_phi3	`<	end	>\n`

Experiment 3: Multi-Hop Degradation ( $n$ =40, 3-condition)

Tests archive noise sensitivity. Conditions: (a) 0 filler turns between archive and query, (b) 1 filler turn, (c) 2 filler turns. Fillers generated by the model itself, archived identically to chunks.

Experiment 4: Upper-Layer Localization ( $n$ =200)

Tests whether the E-wrapper effect is uniformly distributed across transformer layers or concentrated in upper layers where induction-head routing is strongest (PyramidKV, 2024). Four arms on Qwen 2.5 7B SDPA: abl_baseline (no wrapping), abl_pc_full (all 28 layers wrapped), abl_upper_mixed (layers 14–27 wrapped, 0–13 raw), abl_lower_mixed (layers 0–13 wrapped, 14–27 raw). Split at layer 14 (50% boundary). Same pinned selections as Phase 1. The abl_ prefix distinguishes these prefill-wrap ablation arms from the patch_ arms used in the activation-patching experiment of §4.6.

3.7 MLA Methods: YARN-Scaled RoPE Handling

DeepSeek-V2-Lite uses YARN-scaled RoPE (factor=40, original_max_position_embeddings=4096) rather than plain RoPE. Naive rope_shift_k using standard `inv_freq = 1/\text{rope_theta}^{2i/d} $o v er - r o t a t es b y u pt o 40$ \times$ on dimensions 23–31, producing complete K-tensor corruption and gibberish output.

Fix: Extract yarn-adjusted inv_freq from the model's own rotary module:

rot = model.model.layers[0].self_attn.rotary_emb
yarn_inv_freq = rot.inv_freq.detach().to("cuda").to(torch.float32)

Use yarn_inv_freq in shift rotation; mscale cancels cleanly for V2-Lite (mscale == mscale_all_dim == 0.707 $→\rightarrow$ _mscale == 1.0). This fix generalizes to any yarn-scaled architecture (DeepSeek V2/V3, extended-context Llama-3 variants).

3.8 Retrieval Determinism Protocol

Strong-RAG retrieval pipelines using BGE-Reranker-class cross-encoders exhibit non-deterministic top-3 selection across fresh runs (0/80 queries matched between two identical pipeline invocations), primarily due to FP16 tie-breaking in score argsort when candidate scores cluster within sub-ULP gaps.

Protocol: We pin top-3 selections once per retrieval pipeline run and reuse identical selections across all inject arms. This ensures observed deltas reflect injection mechanism differences, not retrieval variance. Cross-run absolute scores may vary; only intra-run relative deltas are valid. This protocol is a requirement for any KV-reuse benchmark using near-deterministic rerankers.

3.9 n=200 Extension Protocol

Pool synthesis: 80 organic + 120 synthetic queries, evaluated under identical retrieval pipeline. Stratified reporting by slice.

Statistical tests:

Primary: Exact McNemar test (binomial CDF, not chi-square approximation) for paired proportions. Justified per Dror et al. (2018, ACL P18-1128) for small-sample paired NLP evaluation.
Effect size: Paired bootstrap 95% CI (10,000 resamples), Pr( $δ\delta$ $>$ 0).
Power: $n$ =200 provides 80% power to detect 5pt delta at $α\alpha$ =0.05, baseline 50%.

Synthetic penalty sensitivity: Artificially degrade synthetic 120 baseline by randomly nullifying correct retrievals to match original 80 difficulty (7.5% degradation). Recompute per_chunk delta on penalized synthetic slice. Mechanism claimed robust only if delta holds on both original 80 and penalized synthetic 120.

3.10 Diagnostic Instruments

All instruments run under EAGER attention (required for output_attentions=True hooks).

Attention entropy + sink-mass probe: Per-layer, per-head attention entropy: $-\sum_i a_i \log(a_i)$ where $a_i$ is attention weight. Sink-mass ratio: $∑i∈{0,1,2,3}ai\sum_{i\in\{0,1,2,3\}} a_i$ , the attention mass on the first 4 positions.

Cross-attention probe:

Condition 1: Archive KV from session A, inject into session B with identical preceding text (same system prompt + prefix).
Condition 2: Same archive, inject into session B with different preceding text (standard setup).
Delta = cross-attention contamination contribution.

KV cosine probe:

Raw: cosine between archived K and fresh-prefill K of identical text
Un-RoPE'd: after stripping RoPE rotation from both, recompute cosine
Low raw + high un-RoPE'd $→\rightarrow$ RoPE mismatch dominates
Low raw + low un-RoPE'd $→\rightarrow$ cross-attention contamination dominates

3.11 Evaluation Metrics

r@1: Top-1 retrieval accuracy (exact match to ground-truth chunk)
r@5: Top-5 retrieval accuracy
TTFT: Time from query submission to first generated token, milliseconds
Delta: r@1 difference vs inject baseline (positive = improvement)

3.12 Text-Wrapped RAG Control

To isolate whether the E-wrapper benefit is structural (KV injection) or lexical (chat-template marker strings in plain text), we test a two-arm control ( $n$ =200). Both arms use identical strong-RAG retrieval and identical chunk content. The plain_rag arm formats chunks as raw text in the prompt context. The wrapped_rag arm wraps each chunk as <|im_start|>user\n{chunk}<|im_end|> in plain text before prompt assembly, with no KV injection. If wrapped_rag matches per_chunk inject, the effect is lexical; if it matches plain_rag, the effect is structural.

3.13 INT8 Quantization Robustness

We test whether the E-wrapper effect survives INT8 KV cache compression. A preliminary two-arm test ( $n$ =200) compares bf16 per_chunk against per-tensor symmetric INT8. A follow-up ten-arm matrix ( $n$ =80) decomposes the failure mode:

Arm	Quantization scheme
pc_bf16	bf16 control
pt_asym	Per-tensor asymmetric INT8
pt_sym	Per-tensor symmetric INT8
per_ch	Per-head (channel) INT8
per_tok	Per-token dynamic INT8
selective	bf16 sink+tail tokens, INT8 middle
k_only	K=INT8 per-channel, V=bf16
layer_low	Layers 0–13 INT8 per-ch, 14–27 bf16
layer_hi	Layers 0–13 bf16, 14–27 INT8 per-ch

All INT8 arms quantize on-the-fly at cache-build time from a single shared bf16 archive, avoiding memory overhead. Per-channel scale = amax(dim=(-2,-1), keepdim) → shape [1, H, 1, 1]. Per-token scale = amax(dim=-1) → [1, H, S, 1].

3.14 System Prompt Drift

To test cross-context deployment viability, we archive KV with a "helpful assistant" system prompt and inject into generation sessions with either the same prompt or a drifted "coding assistant" prompt. Both use E-wrapper per_chunk injection ( $n$ =200). Zero delta between same and drifted confirms marker-class binding is invariant to system prompt.

3.15 Query Pool Extension

To power the Qwen result toward formal significance we extend the original 200-query pool via oracle-filtered generation: 267 additional queries (seed=99) admitted only when their ground-truth chunk is retrievable within top- $K$ , matching the original pool's quality distribution. An earlier inline-synthetic attempt ( $n$ =167, unfiltered) diluted the signal and is reported as a negative result in Appendix H. The oracle-filtered extension yields merged $n$ =467 with identical pipeline, kernel, and E-wrapper configuration as Phase 1 (§5.4, Table 1c).

4. Experiments

4.1 Phase 0: Mechanism Decomposition ( $n$ =80)

All Phase 0 arms ran on the same 80-query subset with identical pinned top-3 selections. Qwen 2.5 7B was evaluated under both SDPA (production) and EAGER (diagnostic) kernels; Phi-3 14B under SDPA only. The 2 $×\times$ 2 factorial (Picks 1+2) was run on Qwen SDPA only.

4.2 Phase 1: Cross-Family Validation ( $n$ =200)

Qwen 2.5 7B SDPA: 80 organic + 120 synthetic queries, stratified reporting. Phi-3 14B SDPA: same 200-query pool. Both used pinned selections intra-run.

4.3 Tier 1: Mechanism Refinement ( $n$ =40–80)

Three orthogonal experiment bundles (with one cross-family sub-replication): (1) base-model control ( $n$ =80), (2) five-arm wrapper ablation ( $n$ =80) plus Phi-3 E-wrapper cross-family ( $n$ =80), (3) multi-hop degradation ( $n$ =40). All SDPA, pinned selections.

4.3b Layer Ablation ( $n$ =200)

Four-arm layer ablation on Qwen 2.5 7B SDPA testing upper vs lower layer contribution to the E-wrapper effect. Upper/lower split at layer 14. Same pinned selections as Phase 1.

4.4 MLA Stage 1: DeepSeek V2-Lite ( $n$ =200)

Two-arm matched-pairs (baseline vs per_chunk E-wrapper) on DeepSeek-V2-Lite-Chat under SDPA, with YARN-adjusted rope_shift_k. Same 200-query pool.

4.5 CPU-Tier Probe ( $n$ =80)

Three-frontend end-to-end inject evaluation: Qwen3-Embed-8B (baseline), adapted potion, zero-shot potion. Matched-pairs, per_chunk inject, SDPA.

4.6 Activation Patching ( $n$ =20, Phi-3-medium-128k-instruct)

To probe which stages of the transformer computation carry the E-wrapper effect causally, we run an 8-arm activation-patching sweep on Phi-3-medium-128k-instruct at $n$ =20. The patch_baseline arm generates from an archive prefilled without E-wrapping. The patch_pc_full arm injects the full E-wrapper archive as the reference. Six further arms replace the residual stream at marker-token positions during generation using activations harvested from the patch_pc_full reference run, across different layer bands:

patch_denoise_early (layers 0–5)
patch_denoise_mid (layers 10–15)
patch_denoise_upper (layers 20–32)
patch_denoise_all (all layers)
patch_denoise_random (randomly chosen layer set, negative control)
patch_noise_upper (inject corrupted residuals at upper layers, noise control)

Patches apply at marker-token positions only; non-band layers use zeros at marker positions and baseline values elsewhere. (Note on arm-naming convention: patch_pc_full refers to activation-patching arms and is distinct from abl_pc_full used in the layer-ablation sweep of §4.3b. The two probes measure different operations; see the distributed-vs-localized reconciliation below.)

The diagnostic probes described in §3.10 (attention entropy, sink-mass ratio, KV cosine similarity) are referenced in §5.1 for the alternative-mechanism rejection argument.

Distributed necessity vs upper-layer dominance: reconciliation. The layer-ablation results (§4.3b, Exp3v2) and activation-patching results above appear in tension: Exp3v2 finds that upper-layer wrapping (layers 14–27 on Qwen) recovers $≈\approx$ 55% of the E-wrapper effect while lower-layer wrapping contributes nothing, yet activation patching finds that replacing marker-position residuals in upper layers alone (layers 20–32 on Phi-3) transfers 0pt of the effect while replacing residuals across all layers recovers the full +5pt. These findings are not contradictory: they measure different operations at different stages of computation.

Exp3v2 is a prefill-wrap ablation: it varies which layers receive structurally wrapped KV during the archive prefill phase. The result shows that upper layers are where the structural topology (marker-token-class binding) concentrates its contribution to downstream attention routing, consistent with PyramidKV's pyramidal information funneling and with the concentration of induction-head behavior in upper/middle layers (Olsson et al., 2022). Lower layers process local syntactic patterns where structural wrapping adds no value.

Activation patching is a residual-swap intervention: it replaces the residual stream at marker-token positions in specific layer bands during inference, while the model re-computes all downstream transformations on the patched state. The patch_denoise_upper=0pt result indicates that upper-layer marker-position residuals alone are insufficient for effect reconstruction: the full layer stack is required to reconstruct the E-wrapper routing pattern from a patched state. This is expected mechanically: residual patching at layer $L$ propagates through layers $L{+}1..N$ , and if layers $0..L{-}1$ were computed on unwrapped state the routing context arriving at $L$ is already incomplete.

Together the two probes give a coherent picture: upper layers dominate where the structural signal is expressed (Exp3v2), while the full stack is necessary for causal reconstruction (activation patching). For product engineering, Exp3v2's upper-dominance supports storage-optimized archiving: upper-layer-only archiving trades storage for partial effect preservation ( $≈\approx$ 55% of full-wrap quality at roughly half the layers stored). For mechanism understanding, activation patching confirms that marker-token-class binding is a distributed computation, not localized to a single layer band, even when its quality contribution concentrates in upper layers.

4.7 Text-Wrapped RAG Control ( $n$ =200)

Two-arm control (plain_rag vs wrapped_rag) on Qwen 2.5 7B SDPA. Tests whether chat-template markers in plain text (no KV injection) reproduce the E-wrapper benefit.

4.8 INT8 Quantization Robustness ( $n$ =200 + $n$ =80 matrix)

Preliminary two-arm test (bf16 vs per-tensor INT8, $n$ =200) followed by ten-arm granularity matrix ( $n$ =80). Tests per-tensor, per-channel, per-token, selective, k_only, and layer-split quantization schemes on Qwen 2.5 7B SDPA.

4.9 System Prompt Drift ( $n$ =200)

Three-arm test (baseline vs same_prompt vs drifted_prompt) on Qwen 2.5 7B SDPA. Archives KV with "helpful assistant" system prompt; generation uses either "helpful" or "coding" assistant prompt.

4.10 Query Pool Extension ( $n$ =367 inline-synthetic, then $n$ =467 oracle-filtered)

Two successive extensions of the Qwen 2.5 7B SDPA evaluation pool. The first merges 167 inline-synthetic queries with the original 200 ( $n$ =367, negative result, §5.4 supplementary). The second, reported as the primary Qwen significance result, applies the Phase 1 oracle-filtered admission criterion to 267 fresh queries (seed=99) and merges with the original 200 ( $n$ =467). Both runs use identical retrieval pipeline, kernel, and E-wrapper configuration as Phase 1; only the query pool is extended. Stratified reporting compares original 200, new oracle-filtered 267, and merged 467 slices.

5. Results

5.1 Phase 0: Alternative Mechanism Rejection

Table 1a (Qwen 2.5 7B, $n$ =80) and Table 1b (Phi-3 14B, $n$ =80) report the full Phase 0 kernel-matched matrix.

Model	Kernel	Arm	r@1	$Δ\Delta$ vs baseline	Pr( $δ>0\delta > 0$ )	McNemar $p$
Qwen 7B	SDPA	baseline	0.5125	—	—	—
Qwen 7B	SDPA	global sink	0.5125	+0.000	0.50	—
Qwen 7B	SDPA	cond1 replicate	0.5375	+2.5pt	0.65	0.80
Qwen 7B	SDPA	per_chunk sink	0.5750	+6.25pt	0.90	0.27
Qwen 7B	EAGER	baseline	0.4750	—	—	—
Qwen 7B	EAGER	cond1	0.5625	+8.75pt	0.95	0.14
Phi-3 14B	SDPA	baseline	0.5750	—	—	—
Phi-3 14B	SDPA	cond1	0.6000	+2.5pt	0.69	0.75
Phi-3 14B	SDPA	per_chunk sink	0.6000	+2.5pt	0.69	0.75

Note: Per-query discordant counts for Qwen Phase 0 arms in Table 1a were computed from internal result logs reviewed by the author team. Raw per-query results are available on request.

Three observations emerge from this table:

Classical attention sinks show no reliable signal. Global BOS+positions 0–3 prepending (StreamingLLM-style) produces exactly zero gain (+0.0pt, Pr=0.50). Sink-mass loss is a real attention-level artifact (–98% on positions 0–3; Figure 3) but it is a symptom of prefix absence, not an independent causal mechanism at this sample size.
RoPE mismatch is not the dominant driver. The KV cosine probe shows un-RoPE'd K/V cosine $≈\approx$ 1.0 across layers (Figure 4), confirming archived and fresh representations are content-equivalent. Position rotation alone does not explain the gap at 7B scale with rope_theta=1e6.
Cross-attention contamination is PARTIAL. Cond 1 (identical-prefix inject) under SDPA gives only +2.5pt, not significant at $n$ =80. The remaining ~5pt gap after per_chunk suggests contamination is real but secondary to structural-alignment failure.

5.2 2 $×\times$ 2 Mechanism Decomposition: Synergistic Interaction

To test whether boundary cleanliness or role-marker routing is the primary driver, we run a factorial 2 $×\times$ 2 design on Qwen SDPA ( $n$ =80). Full statistics in Table 2.

Cell	Boundary	Role Markers	r@1 (SDPA, $n$ =80)	Delta
A	Arbitrary	Stripped	0.500	Baseline
B	Arbitrary	Markers	0.512	+1.2pt
C	Sentence	Stripped	0.487	–1.3pt
D	Sentence	Markers	0.550	+5.0pt

Neither main effect is substantial. Role markers alone (Cell B) produce negligible improvement (+1.2pt). Clean boundaries alone (Cell C) degrade quality below baseline (–1.3pt). The observed effect is concentrated in the interaction (Cell D): the conjunction of clean boundaries AND role markers produces +5.0pt. The interaction term itself equals +5.0pt (as large as the combined effect), consistent with a super-additive pattern, though the interaction is not independently significant at $n$ =80 per cell.

Replication at $n$ =200 (different retrieval selections). A replication of the 2 $×\times$ 2 factorial at $n$ =200 using independently pinned retrieval selections yielded: arbitrary_stripped 0.430 (baseline), arbitrary_intact 0.400 ( $- 3.0$ pt), sentence_stripped 0.450 (+2.0pt), sentence_intact 0.410 ( $- 2.0$ pt). The interaction term is $- 1.0$ pt: the combined treatment no longer outperforms baseline. Because the $n$ =200 run uses different pinned selections (per §3.8, BGE-Reranker selections are non-deterministic across runs), this result does not directly contradict the $n$ =80 finding but does indicate that the factorial interaction is selection-sensitive and should not be treated as a robust standalone finding. The per-chunk injection effect itself ( $Δ\Delta$ =+5.6pt, $p$ <0.01 at merged $n$ =467) is robust across multiple retrieval selections and remains the primary statistical claim.

Interpretation: The factorial decomposition provides a plausible mechanistic hypothesis (boundary–marker synergy) but the interaction is not independently reproducible at current sample sizes. The mechanism story is better anchored in the five-arm wrapper ablation (§5.3 Experiment 2), which demonstrates marker-token-class binding directly and does not depend on cross-run selection matching.

5.3 Tier 1: Mechanism Refinement

Experiment 1: IT Amplifies, Not Creates ( $n$ =80)

Table 3. Base-model control.

Model	Variant	baseline r@1	per_chunk r@1	$Δ\Delta$	95% boot CI	McNemar $p$	Pr( $δ>0\delta > 0$ )
Qwen 2.5 7B Instruct (ref)	IT	0.5125	0.5750	+0.0625	—	—	—
Qwen 2.5 7B Base	no-IT	0.4875	0.5125	+0.0250	[–0.050, +0.100]	0.754	0.682

The base model shows a directionally positive +2.5pt effect (n=80, formally null, p=0.754). This falsifies both strict hypotheses: the effect is neither purely IT-learned (base $≈\approx$ 0) nor purely architecture-intrinsic (base $≈\approx$ instruct). The most consistent interpretation is that instruction tuning amplifies an existing architectural capability by ~2.5 $×\times$ . The <|im_*|> marker tokens exist in the shared tokenizer vocabulary; even an un-tuned base model can use them as weak landmark anchors, and IT sharpens this routing pattern.

Experiment 2: Marker-Token-Class Binding ( $n$ =80)

Table 4. Five-arm wrapper ablation on Qwen 2.5 7B Instruct.

Arm	r@1 [Wilson 95%]	$Δ\Delta$ vs A	95% boot CI	McNemar $p$	Pr( $δ>0\delta > 0$ )	discordant (A/arm)
A_full	0.5500	—	—	—	—	—
B_opening_only	0.5500	0.000	[–0.088, +0.088]	1.000	0.451	7 / 7
C_closing_only	0.5625	+0.0125	[–0.063, +0.100]	1.000	0.565	5 / 6
D_role_name_only	0.4875	–0.0625	[–0.163, +0.038]	0.332	0.089	11 / 6
E_markers_no_role	0.5375	–0.0125	[–0.100, +0.075]	1.000	0.329	6 / 5

Key findings:

Opening marker alone (B) matches full template (A): the 14-token preamble is unnecessary.
Closing marker alone (C) also matches A: the bracket pair is substitutable, not asymmetric.
Role name alone (D) collapses to baseline: user/assistant tokens without markers provide no routing signal.
Minimal markers (E) retains 97.7% of A at 3 tokens vs 15.

Mechanism: marker-token-class binding. Either <|im_start|> or <|im_end|> suffices as a distinctive turn-boundary landmark. The specific position (opening vs closing) is secondary; what matters is the presence of a canonical marker token that the model's induction-head circuitry learned to route through during instruction tuning. Role-name tokens add a small independent signal (+1.25pt C vs E, within noise), but the strongest signal is the marker class.

Saturation evidence: Closing marker adds nothing when opening is present (A $≡\equiv$ B). But closing rescues when opening is weak (C – D = +7.5pt). The mechanism has a ceiling: once triggered by any canonical <|im_*|> token, redundant markers don't stack.

Experiment 2B: E-Wrapper Cross-Family ( $n$ =80)

Table 5. Phi-3-medium 14B validation.

Arm	r@1	$Δ\Delta$ vs A	95% boot CI	McNemar $p$	Pr( $δ>0\delta > 0$ )
A_full_phi3	0.6000	—	—	—	—
E_markers_phi3	0.6000	0.0000	[–0.063, +0.063]	1.000	0.422

Token overhead: E-wrapper = 2 tokens vs A_full's 11 = 18% of full-template tokens.

Cross-family comparison:

Model	A_full r@1	E_markers r@1	$Δ\Delta$ (E – A)	Retention
Qwen 2.5 7B	0.5500	0.5375	–0.0125	97.7%
Phi-3-medium 14B	0.6000	0.6000	0.0000	100%

Phi-3 is more tolerant of wrapper minimization than Qwen. Both families converge on "the <|end|>-class token is the sufficient binding signal; the role name and system preamble are decorative."

Experiment 3: Multi-Hop Degradation ( $n$ =40)

Table 6. Archive noise sensitivity.

Condition	r@1	$Δ\Delta$ vs (a)	95% boot CI	McNemar $p$	Pr( $δ>0\delta > 0$ )
(a) 0 fillers	0.500	—	—	—	—
(b) 1 filler	0.450	–5.0pt	[–0.200, +0.075]	0.727	0.184
(c) 2 fillers	0.425	–7.5pt	[–0.200, +0.050]	0.453	0.082
(c) vs (b)	0.425 vs 0.450	–2.5pt	[–0.175, +0.125]	1.000	0.311

Per-hop margins: (a $→\rightarrow$ b) –5.0pt, (b $→\rightarrow$ c) –2.5pt. Degradation is monotonic but diminishing. The worst transition is from "pure target context" to "mixed context"; additional contamination causes sub-linear marginal damage. This suggests a "mechanism floor": once the KV archive contains any off-target content, further mixing is marginally cheaper.

Caveat: At $n$ =40, precision floor is $≈±\approx\pm$ 20pt. All CIs cross zero; none reach formal significance. Directional evidence is consistent (Pr 0.82–0.92) but cannot substitute for power.

Experiment 4: Upper-Layer Localization ( $n$ =200)

Table 7a. Layer ablation for E-wrapper mechanism (Qwen 2.5 7B, SDPA).

Arm	Layers wrapped	r@1	$Δ\Delta$ vs baseline	Interpretation
baseline	none	0.420	—	Raw inject
`abl_pc_full`	0–27 (all)	0.475	+5.5pt	Full E-wrapper (replicates Phase 1)
`abl_upper_mixed`	14–27	0.450	+3.0pt	Upper layers recover 55% of effect
`abl_lower_mixed`	0–13	0.415	–0.5pt	Lower layers: zero contribution

The E-wrapper mechanism is upper-layer dominant. The upper 14 layers (14–27) account for 55% of the full mechanism (+3.0pt of +5.5pt). Lower layers (0–13) contribute nothing measurable (–0.5pt, within noise floor). This is consistent with PyramidKV's pyramidal information funneling (Section 2.2): upper layers concentrate attention on structurally significant tokens, making structural topology fidelity matter most where induction-head routing is already concentrated.

Product implication: For storage-constrained deployments, archiving only upper-layer KV tensors preserves 55% of the structural-alignment benefit at approximately 50% storage cost. A finer-grained layer sweep could identify the minimum layer set for majority effect.

Residual 45%: The gap between abl_upper_mixed (+3.0pt) and abl_pc_full (+5.5pt) suggests either (a) sub-threshold contribution from layers 0–13 at $n$ =200, or (b) cooperative interaction where upper-layer wrapping is more effective when lower layers are also wrapped.

5.4 n=467 Cross-Family Validation

Qwen 2.5 7B SDPA: Stratified reporting

Full statistics in Table 1c; per-query delta distribution in Figure 5.

Table 1c. Qwen 2.5 7B SDPA oracle-filtered extension (stratified reporting, $n$ =467). Absolute r@1 values and discordant counts were recomputed directly from per-query result pairs. Exact McNemar uses the two-sided binomial on discordant counts with standard convention ( $b$ = inject hurt, $c$ = inject helped).

Slice	$n$	Baseline	per_chunk	Delta	McNemar exact $p$	$b$ (hurt)	$c$ (helped)	Bootstrap 95% CI
Original 200	200	0.420	0.475	+5.5pt	0.108	14	25	[–0.005, +0.115]
New oracle-filtered 267	267	0.536	0.592	+5.6pt	0.063	21	36	[+0.005, +0.110]
Merged	467	0.486	0.542	+5.57pt	0.0103	35	61	[+0.015, +0.105]

Note on baseline shift: The New-267 slice has a higher absolute baseline (0.536 vs 0.420) because it draws from a different corpus sub-distribution with easier queries on average. The delta is stable (+5.6pt vs +5.5pt) across this difficulty shift, which is the key replication property.

Key finding: The new oracle-filtered queries (different seed, different retrieval sample) replicate the original effect identically (+5.6pt vs +5.5pt). The effect is not an artifact of the original query pool: it generalises to a fresh oracle-filtered sample drawn from a different difficulty regime.

At merged $n$ =467 the delta reaches formal significance at $α\alpha$ =0.05 (McNemar exact $p$ =0.0103, approaching $α\alpha$ =0.01). The strictly positive bootstrap lower bound (+0.015) independently supports a real positive effect. Stratified reporting shows the effect is robust across both the original pool and the new extension.

Synthetic Penalty Sensitivity

Artificially degrading the synthetic 120 baseline by 7.5% (to match original 80 difficulty) yields per_chunk delta = +5.1pt, bootstrap LB = –0.01. The mechanism survives penalty adjustment, confirming the effect is not driven by easy synthetic queries alone.

Phi-3 14B SDPA: Formal Significance

Metric	Value
Baseline	0.430
per_chunk	0.500
Delta	+7.0pt
McNemar $p$	0.013
Exact test $p$	0.019
Bootstrap 95% CI	[+0.020, +0.120]
Pr( $δ>0\delta > 0$ )	1.00

Phi-3 achieves formal significance at $α\alpha$ =0.05 on both McNemar and exact tests, with a strictly positive bootstrap lower bound (+0.020). This is our strongest statistical result: a cross-family replication on a different template family (Phi-3's single-token <|user|> markers vs Qwen's multi-line ChatML) confirms the structural-alignment mechanism generalizes. The larger +7.0pt delta on Phi-3 versus +5.5pt on Qwen reflects both the simpler template (fewer boundary/marker interactions to disrupt) and the n=200 power advantage over n=80 Phase 0.

5.5 Activation Patching ( $n$ =20, Phi-3-medium-128k-instruct)

Table 13. 8-arm activation-patching sweep at marker-token positions on Phi-3-medium-128k-instruct ( $n$ =20).

Arm	r@1	$Δ\Delta$ vs baseline
`patch_baseline` (archive without E-wrap)	0.50	—
`patch_pc_full` (full E-wrapper archive)	0.55	+5pt
`patch_denoise_early` (layers 0–5)	0.45	$- 5$ pt
`patch_denoise_mid` (layers 10–15)	0.40	$- 10$ pt
`patch_denoise_upper` (layers 20–32)	0.50	0pt
`patch_denoise_all` (all layers)	0.55	+5pt (full recovery)
`patch_denoise_random` (random layer set)	0.30	$- 20$ pt (negative control)
`patch_noise_upper` (corrupted residuals, upper)	0.50	0pt

Key findings. (a) The full-wrapper archive (patch_pc_full) reproduces the +5pt effect seen at larger $n$ on Phi-3 SDPA, confirming the small- $n$ probe is calibrated to the main effect. (b) patch_denoise_all fully recovers the effect when residuals are patched across all layers, while (c) patch_denoise_upper recovers 0pt when patched at upper layers only: upper-layer residuals are insufficient in isolation for effect reconstruction from a patched state. (d) The random-layer negative control ( $- 20$ pt) and noise control (0pt) establish that the recovery is specific to the layer-band structure of the patch. (e) Lower- and mid-band patches produce negative deltas ( $- 5$ to $- 10$ pt), indicating that partial-stack patching corrupts the routing context.

Together with the upper-layer-wrap dominance reported in §5.3 Experiment 4, these results support a reconciliation: the structural signal is expressed predominantly at upper layers while causal reconstruction requires the full stack (see §4.6).

5.6 Cross-Architecture and Deployment Validation

MLA tractability (DeepSeek V2-Lite, $n$ =200). E-wrapper produces a directionally positive $Δ\Delta$ =+3.0pt (Pr( $Δ>0\Delta > 0$ )=0.866, McNemar $p$ =0.307) on Multi-head Latent Attention, underpowered at this sample size but sign-consistent with GQA results. All three tested architectures (two GQA families + MLA) show positive $Δ\Delta$ ; effect magnitude is model-dependent (Phi-3 +7.0pt > Qwen +5.6pt > V2-Lite +3.0pt). Full MLA results, cross-family summary table, and YARN-scaled RoPE debugging methodology in Appendix B and Appendix F.

Latency. KV injection reduces TTFT versus strong text-RAG by 2.6 $×\times$ (per_chunk wrapped) to 3.4 $×\times$ (raw inject). E-wrapper adds $∼\sim$ 10ms overhead versus raw inject for the fresh wrapper-prefix prefill. Archive storage: $∼\sim$ 2.1 MB per 4K-token chunk (bf16, 32 layers). E-wrapper reduces wrapper-token KV memory $∼\sim$ 8 $×\times$ versus full template (3 tokens vs 15).

Text-wrapped RAG control ( $n$ =200). Wrapping chunks with chat-template markers in plain text (no KV injection) produces $Δ\Delta$ =+0.005, indistinguishable from noise. The +5.5pt effect requires actual KV injection, confirming the mechanism is structural, not lexical.

Robustness. System prompt mismatch (archiving under "helpful assistant," generating under "coding assistant") causes zero degradation ( $Δ\Delta$ =0.000, $n$ =200). INT8 KV quantization at per-channel or per-token granularity preserves the full effect (0pt loss), while per-tensor INT8 destroys it ( $- 5$ pt): a granularity floor, not a precision floor. EAGER vs SDPA kernel choice affects measurement magnitude ( $∼\sim$ 3 $×\times$ ) but not the underlying mechanism; SDPA values are reported as production-relevant. Full quantization matrix (ten-arm, $n$ =80) in Appendix E; system prompt results in Appendix G.

CPU-tier deployment ( $n$ =80). Zero-shot static embeddings (potion, $∼\sim$ 1ms CPU) plus BM25 plus BGE-Reranker match GPU-grade Qwen3-Embed-8B end-to-end (all McNemar $p$ =1.000). The reranker fully absorbs potion's weaker retrieval-stage signal. Stack specification in Appendix C.

Standard benchmark: LoCoMo ( $n$ =200). To validate the effect outside our custom evaluation, we evaluate on the LoCoMo conversational memory benchmark (Maharana et al., 2024), the same benchmark used by Jeong (2026), our closest prior work. LoCoMo tests cross-session recall over 10 multi-session conversations spanning up to 35 sessions each. We subsample 200 QA pairs (excluding adversarial category), archive per-turn KV with E-wrapper, and score with token-level F1 (LoCoMo's standard metric; exact-match is uninformative due to length mismatch between short ground-truth phrases and model-generated sentences).

Arm	F1	$Δ\Delta$
baseline (text-RAG, no injection)	0.135	—
per_chunk (E-wrapper injection)	0.189	+5.3pt

The +5.3pt F1 improvement on LoCoMo is consistent with the +5.6pt r@1 delta on our custom benchmark, confirming the E-wrapper effect generalizes to an independently designed conversational memory evaluation. Per-chunk injection produces answers with higher token overlap with ground truth across all four non-adversarial question categories (single-hop, multi-hop, temporal, open-domain).

6. Discussion

6.1 Hypothesized Mechanism: Marker-Token-Class Binding and Induction-Head Routing

Olsson et al. (2022) demonstrated that transformer attention heads implement [A][B]...[A] $→\rightarrow$ [B] pattern completion ("induction heads"). Chat-template role markers create exactly this pattern: a fixed prefix token sequence preceding variable content.

We hypothesize that our observed effects are consistent with induction-head behavior, though we do not directly observe specific head activations. Our five-arm wrapper ablation (§5.3 Experiment 2) provides the strongest evidence, with supporting context from the 2 $×\times$ 2 factorial (§5.2, noting that the factorial interaction did not replicate at $n$ =200). The evidence refines the hypothesis in three ways:

First, induction-head routing may be conditionally activated. The [A] trigger (role marker) may require a well-formed [B] (content boundary) to complete the pattern. Without clean boundaries, the induction head may not establish the [A] $→\rightarrow$ [B] mapping:

<|im_start|>user\nHello $→\rightarrow$ well-formed trigger + content $→\rightarrow$ hypothesized induction-head activation
<|im_start|>user\nHel $→\rightarrow$ trigger + fragment $→\rightarrow$ hypothesized mapping failure

Second, the hypothesized routed token is the marker-token-class, not the prefix position. Either <|im_start|> or <|im_end|> suffices; the bracket pair is substitutable. If Olsson's induction-head circuitry is involved, it may apply to any distinctive landmark token, not strictly to sequence prefixes. The model may learn a class of turn-boundary tokens during instruction tuning, and any member of that class may trigger the routing.

Third, instruction tuning amplifies rather than creates the effect. The base model shows +2.5pt (n=80, formally null but directionally positive), while the instruct variant shows +6.25pt. The <|im_*|> marker tokens exist in the shared tokenizer vocabulary; the base model may use them as weak anchors, and IT sharpens the routing pattern ~2.5 $×\times$ .

Fourth, the mechanism is upper-layer dominant at prefill-wrap but distributed across the stack for causal reconstruction. Our layer ablation (Experiment 4, $n$ =200) shows 55% of the E-wrapper effect concentrated in upper layers (14–27) at the prefill-wrap granularity, with zero measurable contribution from lower layers (0–13), consistent with PyramidKV's pyramidal information funneling and with the induction-head routing hypothesis, since induction heads are predominantly found in upper/middle transformer layers. The activation-patching probe (\S5.5, $n$ =20 on Phi-3) refines this picture: patching marker-position residuals in upper layers alone fails to reconstruct the effect (0pt), while all-layer patching recovers it fully (+5pt). We read the two probes as complementary: upper layers are where the structural signal is expressed, while the full stack is necessary for causal reconstruction from an already-patched state. See \S4.6 for the full reconciliation. The upper-layer dominance at prefill-wrap also supplies a concrete storage-optimization path: archiving only upper-layer KV tensors preserves roughly half the mechanism ( $≈\approx$ 55%) at roughly half the layers stored.

6.2 Model-Dependent Magnitude

The boundary-conditioned structural-alignment effect varies across models and architectures:

Qwen 2.5 7B (complex ChatML template): +5.0pt interaction alone (2 $×\times$ 2 Cell D), +6.25pt full per_chunk. Complex templates create more boundary/marker interaction opportunities $→\rightarrow$ larger fix, but also more overhead.
Phi-3 14B (simple template): +7.0pt at $n$ =200 with formal significance ( $p$ =0.013). Phi-3's simpler template (<|user|> single token) has fewer boundary/marker interactions to disrupt, yielding a larger delta.
DeepSeek V2-Lite (MLA): +3.0pt directional. MLA's compressed c_kv latent may lose some structural-topology signal that GQA preserves in full K-tensor form.
Qwen 2.5 7B Base: +2.5pt directional. Reduced magnitude confirms IT amplification.

Interpretation: Template complexity and architecture class determine both the magnitude of degradation and the recovery potential. Complex templates require full per-chunk wrapping (or E-wrapper minimalism); simple templates achieve comparable benefit with thinner injection. MLA shows the mechanism transfers but at reduced magnitude, suggesting architectural differences in how structural topology is encoded in the compressed latent space.

6.3 Methodological Implications: Retrieval Determinism

Our finding that BGE-Reranker-class pipelines exhibit 0/80 selection mismatch across fresh runs has implications beyond our specific benchmark. Any KV-reuse or RAG evaluation that compares across independent retrieval runs risks conflating retrieval variance with mechanism effects. The pinning protocol we introduce is a minimal-cost ($0, one JSON file) mitigation that should be adopted as standard practice.

6.4 Multi-Hop Degradation and Archive Noise

Experiment 3 shows that the mechanism degrades gracefully under archive noise: –5.0pt for the first filler turn, –2.5pt marginal for the second. This diminishing-margin pattern suggests a "mechanism floor": once the KV archive is contaminated with any off-target content, additional contamination causes sub-linear marginal damage. The worst transition is from pure target context to mixed context.

Product implication: Single-hop injection (archive a conversation turn, inject into the next session) is the sweet spot. Multi-hop "conversation history" injection degrades monotonically with each additional off-target turn. A practical product should inject only the most relevant chunks, not entire conversation histories.

Honest caveat: At $n$ =40, this is directional evidence, not a quantitative decay law. Formal claims would require $n$ =400+.

6.5 Deployment Considerations

A CPU-tier retrieval stack (zero-shot potion + BM25 + BGE-Reranker) matches GPU-grade end-to-end quality within $n$ =80 precision, enabling edge deployment without GPU embedders (Appendix C). The reranker remains the lone GPU component and dominant cost center.

Cross-session MLA injection is directionally tractable (V2-Lite $Δ\Delta$ =+3.0pt, Appendix F) but effect size is smaller than GQA and formal significance would require $n$ =1000+. Every major lab's 2026 flagship has moved past pure attention; whether frontier-scale MLA narrows the gap is the critical open question for this approach's longevity.

7. Limitations

7.1 Statistical Power

Individual Phase 0 measurements at $n$ =80 are underpowered for 5pt effect detection. SE $≈\approx$ 5.6% at 50% baseline; 95% CI spans $∼±\sim\pm$ 11pt. We address this via:

$n$ =200 extension (primary validation gate)
Cross-measurement consistency (5 independent directional positives under null $≈\approx$ 3%)
Exact McNemar + bootstrap CI (tighter than asymptotic approximations)

The Qwen merged result reaches formal significance at $n$ =467 ( $p$ <0.01, $Δ\Delta$ =+5.6pt). Phi-3 $n$ =200 ( $p$ =0.013) provides cross-family formal validation on a different template family.

Power analysis. For a paired proportion test (McNemar) with baseline $π\pi$ =0.42 and target $Δ\Delta$ =0.05, required $n$ for 80% power at $α\alpha$ =0.05 is approximately 385 per arm. Qwen reaches formal significance at merged $n$ =467 ( $Δ\Delta$ =+5.6pt, $p$ <0.01). For the MLA V2-Lite result ( $Δ\Delta$ =+0.030), $n$ ≈1,050 would be required for 80% power at $α\alpha$ =0.05, explaining why $p$ =0.307 at $n$ =200. For the 2 $×\times$ 2 factorial interaction (Cell D vs baseline, $Δ\Delta$ =+0.05), $n$ ≈400 per cell would be needed to confirm the interaction independently.

Query pool extension. The oracle-filtered extension strategy succeeded: 267 new queries generated via the same oracle pipeline (different seed, different chunks) replicated the original effect identically (+5.6pt vs +5.5pt), confirming the effect is not an artifact of the original query pool. The corpus ceiling (~367 usable chunks) limits further independent extension without multi-question correlation or alternative corpus sources.

7.2 Single-Seed Variance

Phase 0 results are single-seed. $n$ =200 includes multi-seed variance decomposition to quantify dataset-vs-seed contribution.

7.3 Kernel Sensitivity

Attention kernel choice (SDPA vs EAGER) affects both baseline quality and effect magnitude. EAGER gives lower baselines and larger deltas due to floating-point non-associativity in softmax. We report SDPA as production-relevant and EAGER as upper-bound diagnostic.

7.4 Synthetic Pool Overfit

Synthetic 120 queries are +7.5pt easier for retrieval than organic 80 (gt_in_top3: 0.90 vs 0.825) due to generation-model phrasing correlation. Mitigated by:

Stratified reporting (original / synthetic / merged)
Synthetic penalty sensitivity analysis
Claims anchored to organic 80 slice (harder, more realistic)

7.5 Wrapper Sensitivity

Structural wrapping is model-template-specific. We validate on three families (Qwen ChatML, Phi-3, DeepSeek V2-Lite) but cannot claim universality. The five-arm ablation (Table 4) and cross-family E-wrapper validation (Table 5) provide the strongest wrapper-sensitivity evidence to date.

7.6 Scale Ceiling

All validation at 7B–14B, with one 15.7B/2.4B MLA model. Frontier-scale pure-attention models (70B+) and frontier MLA models (119B–671B) remain untested. Scale rescue pattern (1.5B fails $→\rightarrow$ 7B works $→\rightarrow$ 14B wins) suggests improvement with scale, but this is extrapolation.

7.7 Benchmark Scope

LoCoMo is a conversation memory benchmark. Generalization to code, technical documentation, or multi-modal contexts is untested.

7.8 Competitor Benchmarking

Product-grade competitors (Mem0, Letta, Zep) remain unbenchmarked due to credential/integration complexity.

7.9 Retrieval Pipeline Non-Determinism

The strong-RAG retrieval pipeline exhibits cross-run top-3 selection variance: 0/80 queries matched between two independent fresh runs. We mitigate via the Retrieval Determinism Protocol (Methods 3.8), which pins selections intra-run. However, this means absolute r@1 scores are not directly comparable across papers or replication attempts unless the exact same pinned selections are shared. Future work should release pinned selection artifacts alongside raw results.

7.10 Multi-Hop Noise Accumulation

Directional degradation of ~5pt r@1 observed across 2 filler turns at $n$ =40 (formally null at this sample size, Pr(degrade)=0.92). The mechanism degrades gracefully but not indefinitely. Formal claims would require $n$ =400+.

7.11 CPU-Tier Sample Size

The potion probe at $n$ =80 shows equivalence within precision floor ( $±\pm$ 3.75pt). A larger $n$ =200 run could confirm the marginal +1.25pt lead at significance. The conclusion "CPU tier is viable" is directional, not formally proven.

7.12 MLA Stage 1 False-Positive Retraction

Our initial V2-Lite Stage 1 "pass" (n=3 coherence heuristic) was a false positive masked by the YARN-RoPE bug. We explicitly retract that claim and replace it with the corrected n=200 formal result. This correction is documented in our rigor chronology as a self-correcting data point, not a failure.

7.13 Privacy and Security of Archived KV States

Cross-session injection requires storing per-layer K/V tensors from user conversations. Unlike archived text, which can be inspected, redacted, or audited, KV tensors are opaque high-dimensional embeddings that may encode memorized training data, prior conversation content, or attention patterns specific to a user session. Whether archived KV states carry equivalent or greater privacy risk than archived text is unknown. We do not analyze KV-cache memorization, membership-inference vulnerability, or cross-user leakage. Any production deployment should treat archived KV as sensitive user data and apply the same access controls, encryption, and retention policies as stored conversation text.

7.14 INT8 Quantization Matrix Sample Size

The ten-arm quantization compatibility matrix (Table 10) was evaluated at $n$ =80. Per-arm precision is $±\pm$ 11pt at 95% CI, so the finding that per-channel INT8 achieves 0pt degradation is directionally strong but not formally proven against a small non-zero effect. The per-tensor INT8 failure ( $- 5$ pt) is robustly outside the noise floor. A replication at $n$ =200 would confirm the per-channel result with tighter bounds.

7.15 Query Pool Synthesis Quality

Our synthetic query pool (120 queries) was generated by Qwen2.5-7B-Instruct via inline prompting and oracle-filtered for quality. An attempt to extend the pool with 167 additional inline-synthetic queries (same generator, new seed) produced queries that showed no E-wrapper effect ( $Δ\Delta$ = $- 0.006$ ) and diluted the merged result. This suggests inline synthesis produces variable-quality queries whose difficulty distribution is not stable across seeds. Future work should validate synthesis quality via held-out oracle retrieval before merging into eval pools.

8. Conclusion

Cross-session KV cache injection degrades quality primarily through structural misalignment (the loss of chat-template turn-boundary markers during archival) rather than through attention sink loss, RoPE mismatch, or cross-attention contamination. A five-arm wrapper ablation identifies marker-token-class binding as the mechanism: either bracket token suffices as a landmark, and a minimal 3-token E-wrapper captures 97.7–100% of the effect. A 2 $×\times$ 2 factorial ( $n$ =80) suggests boundary–marker synergy, though the interaction did not replicate at $n$ =200 on different retrieval selections.

The per-chunk injection effect achieves formal significance on both primary models (Phi-3 $p$ =0.013 at $n$ =200; Qwen $p$ <0.01 at $n$ =467) and generalizes to the LoCoMo conversational memory benchmark (+5.3pt F1, $n$ =200). Layer ablation and activation patching reveal complementary localization: upper layers dominate signal expression at prefill, while the full stack is necessary for causal reconstruction. The effect survives system-prompt drift, per-channel INT8 quantization, and transfers directionally to MLA architectures.

Future work: Frontier-scale MLA validation, additional standard benchmark evaluation, and hybrid approaches combining structural alignment with light adapter tuning.

References

CacheBlend (2024). arXiv:2405.16444.
DeepSeek-AI (2024). DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv:2405.04434.
Dettmers et al. (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. arXiv:2208.07339.
Dror et al. (2018). The Hitchhiker's Guide to Testing Statistical Significance in Natural Language Processing. ACL P18-1128.
FreeChunker (2025). A Cross-Granularity Chunking Framework. arXiv:2510.20356.
Jeong (2026). Trained Persistent Memory for Frozen Decoder-Only LLMs. arXiv:2603.22329.
LoopGuard (2026). Breaking Self-Reinforcing Attention Loops via Dynamic KV Cache Intervention. arXiv:2604.10044.
Maharana et al. (2024). LoCoMo: Long-Context Conversational Memory. NAACL 2024.
Lu et al. (2021). Fantastically Ordered Prompts and Where to Find Them. ACL 2021. arXiv:2104.08786.
Olsson et al. (2022). In Context Learning and Induction Heads. arXiv:2209.11895.
PyramidKV (2024). arXiv:2406.02069.
Su et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864.
Xiao et al. (2023). Efficient Streaming Language Models with Attention Sinks. arXiv:2309.17453. Published as StreamingLLM, ICLR 2024.

Appendix A: E-Wrapper Production Specification

The E-wrapper is an ablation-validated minimal wrapper for production deployment. It retains 97.7–100% of the full-template effect at 18–22% token overhead.

Qwen-family (ChatML template)

<|im_start|>
{chunk}<|im_end|>

Tokens: 3 (<|im_start|>, \n, <|im_end|>)
Retention: 97.7% of full template (0.5375 / 0.5500)
Overhead: 22% of full-template tokens

Phi-3-family

<|end|>
{chunk}<|end|>

Tokens: 2 (<|end|>, <|end|>)
Retention: 100% of full template (0.6000 / 0.6000)
Overhead: 18% of full-template tokens

DeepSeek-family (V2-Lite template)

<|User|>
{chunk}<|end_of_sentence|>

Tokens: 3
Note: Validated directionally at n=200; formal sig not yet achieved.

Implementation note: The E-wrapper eliminates the system-prompt preamble entirely. For models where the system prompt carries task-critical instruction (e.g., "You are a medical assistant"), prepend the system prompt to the query turn rather than to each archived chunk. This preserves task context while minimizing per-chunk wrapper overhead.

Appendix B: YARN-Scaled RoPE Fix for KV Injection

Problem

DeepSeek V2/V3 and extended-context Llama-3 variants use YARN-scaled RoPE (factor $≥\ge$ 40). A naive rope_shift_k using plain RoPE inv_freq over-rotates by up to the YARN factor on high dimensions, producing complete K-tensor corruption and gibberish output.

Symptom chain

Inject produces token-collapse gibberish (e.g., "\n\n\n")
r@1 = 0.000 on both arms despite coherent pipeline + scorer
inject_no_shift (pos=0) works; inject_with_shift breaks, pointing to rope_shift_k

Root cause

YARN modifies inv_freq by blending standard RoPE frequency with an interpolated (scaled-down) frequency based on a per-dimension ramp. At V2-Lite dimensions (qk_rope_head_dim=64), the plain/yarn ratio reaches 40 $×\times$ on dimensions 23–31 (pure freq_inter regime). For typical chunk shifts of ~512 tokens, this means up to 20,480 radians of over-rotation: complete K corruption.

Fix

Extract yarn-adjusted inv_freq from the model's own rotary module:

rot = model.model.layers[0].self_attn.rotary_emb
yarn_inv_freq = rot.inv_freq.detach().to("cuda").to(torch.float32)

Use yarn_inv_freq in shift rotation math. Everything else (contiguous-half rotate_half, dtype handling, basis) matches the model's apply_rotary_pos_emb verbatim.

mscale note

For V2-Lite, mscale == mscale_all_dim == 0.707 $→\rightarrow$ _mscale == 1.0. The cos/sin cache has no mscale scaling; shift composition as pure rotation is correct. If mscale were non-unity, the fix would need (1/mscale) * R_scaled(Δ) * k_stored.

Validation

Mode	r@1	Coherent
native_oracle	0.70	10/10
inject_fixed_shift	0.60	10/10
inject_no_shift	0.50	—

Fix restores coherent output; fixed_shift reaches within 0.10 of native ceiling.

Appendix C: CPU-Tier Retrieval Stack Specification (Zellige Edge)

Recommended stack

Layer	Component	Latency	Hardware	Cost (100K q/day)
Embedder	minishlab/potion-base-8M (zero-shot)	~1ms	CPU	~$0
Sparse retrieval	BM25 (rank-bm25)	~1ms	CPU	~$0
Fusion	RRF (k=60)	negligible	CPU	~$0
Reranker	BGE-Reranker-v2-m3	~50ms	GPU	~$1.50
Generator	per_chunk inject + E-wrapper	~10ms	GPU	base model cost
Total retrieval		~52ms	GPU-reranker-bound	~$1.50

Comparison to GPU-heavy stack

Layer	Qwen3-Embed-8B stack	Potion stack	Factor
Embed	~200ms GPU	~1ms CPU	200 $×\times$
BM25	~1ms CPU	~1ms CPU	1 $×\times$
Reranker	~50ms GPU	~50ms GPU	1 $×\times$
Total retrieval	~250ms	~52ms	4.8 $×\times$

Quality equivalence

End-to-end inject r@1 at $n$ =80: Qwen3-Embed-8B = 0.488, zero-shot potion = 0.500. Formally indistinguishable (McNemar $p$ =1.000). The BGE-Reranker bridges potion's –7.5pt pre-rerank retrieval gap.

Caveats

If the reranker is ever dropped for latency, the embedder gap re-emerges fully.
$n$ =80 precision floor is $±\pm$ 3.75pt; equivalence is directional, not formally proven.
Static-reranker distillation would complete the CPU tier (target: ~500ms CPU, eliminating the GPU reranker).

Appendix D: Rigor Chronology and Self-Corrections

This appendix documents the project's error-correction history as a rigor feature. Each correction strengthened the final claims.

Stage 1 coherence-heuristic false positive (2026-04-22). Initial V2-Lite n=3 "pass" used output-length + English-word count as a proxy for correctness. Formal JSON showed inject_r1_hits=0. Lesson: always use formal retrieval scoring, never coherence proxies.
YARN-scaled RoPE bug (2026-04-22). rope_shift_k used plain RoPE inv_freq on a YARN model (factor=40) $→\rightarrow$ 40 $×\times$ over-rotation $→\rightarrow$ gibberish. Fix: extract yarn-adjusted inv_freq from model.model.layers[0].self_attn.rotary_emb. Lesson: use the model's own rotary module; don't recompute.
Hybrid model lookup error (2026-04-22). Initial lookup returned wrong Qwen model (Qwen3-30B-A3B pure-MoE vs Qwen3.6-35B-A3B hybrid). Web-search verification caught the mismatch. Lesson: verify model names against training-cutoff knowledge; don't trust single-source lookups.
Cross-family table mis-transcription (2026-04-22). Initial analysis conflated E-wrapper minimization ablation with Phase 1 ship-gate comparison. Cross-check against prior documented results caught the error. Lesson: cross-check table values against prior docs before accepting.
Mistral Small 4 119B runtime wall (2026-04-22). transformers main 5.6.0.dev0 FP8 MoE incomplete $→\rightarrow$ static mode breaks MoE, dynamic mode breaks linears. Aborted after load-test gate. Lesson: for bleeding-edge models, always do a load-test gate before committing eval compute.
INT8 "hard limitation" false conclusion (2026-04-22). Exp 5 initially reported per-tensor INT8 destroys the E-wrapper effect and concluded a precision limitation. Exp 5b ten-arm matrix revealed the failure is specific to per-tensor granularity; per-channel and per-token INT8 recover fully. Lesson: a single coarse-grained negative result does not establish a precision floor; debug the quantization scheme before declaring incompatibility.
Inline synthesis query quality assumption (2026-04-22). Assumed that generating more synthetic queries with the same model+seed would preserve difficulty distribution. Exp 1 n=367 showed new inline-synthetic queries (same generator, different seed) have lower quality and dilute the signal. Lesson: validate synthesis quality via oracle retrieval before merging into eval pools; difficulty distributions vary across seeds.

Appendix E: Quantization Compatibility Specification

Finding summary

The E-wrapper mechanism has a granularity floor, not a precision floor:

Per-tensor INT8 (symmetric or asymmetric): destroys effect ( $- 5$ pt)
Per-channel INT8: full recovery (0pt loss)
Per-token dynamic INT8: full recovery (0pt loss)
K-only per-channel INT8 (V=bf16): full recovery (0pt loss)

Practical specification

For production KV archive compression, use per-channel or finer quantization:

# Per-channel symmetric INT8 (recommended)
scale = kv_tensor.abs().amax(dim=(-2, -1), keepdim=True)  # [1, H, 1, 1]
kv_int8 = (kv_tensor / scale * 127).round().clamp(-128, 127).to(torch.int8)
kv_dequant = kv_int8.float() / 127 * scale

# Per-token dynamic (LLM.int8-style, alternative)
scale = kv_tensor.abs().amax(dim=-1, keepdim=True)  # [1, H, S, 1]
kv_int8 = (kv_tensor / scale * 127).round().clamp(-128, 127).to(torch.int8)
kv_dequant = kv_int8.float() / 127 * scale

Avoid: per-tensor torch.quantize_per_tensor for KV archive storage. The single global scale squashes marker-token activation variance that drives the E-wrapper mechanism.

Storage impact

Scheme	Bits per element	Relative size	Quality
bf16	16	1.0 $×\times$	baseline
Per-channel INT8	8	0.5 $×\times$	0pt loss
Per-tensor INT8	8	0.5 $×\times$	$- 5$ pt (broken)
K-only per-ch INT8, V=bf16	12	0.75 $×\times$	0pt loss

Compatibility note

bitsandbytes LLM.int8() and load_in_8bit use per-channel quantization for weights but may apply per-tensor schemes to KV caches depending on backend. Verify your inference backend's KV quant granularity before deploying E-wrapper with INT8 compression.

Appendix F: MLA Full Results (DeepSeek V2-Lite)

Table F1. Cross-architecture validation ( $n$ =200, DeepSeek-V2-Lite-Chat, SDPA, YARN-adjusted rope_shift_k).

Slice	$n$	baseline r@1	per_chunk r@1	$Δ\Delta$	95% boot CI	McNemar $p$	Pr( $Δ>0\Delta > 0$ )
original	80	0.413	0.438	+0.025	[–0.05, +0.10]	0.754	0.683
synthetic	120	0.342	0.375	+0.033	[–0.025, +0.092]	0.424	0.827
merged	200	0.370	0.400	+0.030	[–0.02, +0.08]	0.307	0.866

YARN-RoPE debugging journey: The initial V2-Lite attempt reported a false-positive "pass" on a coherence heuristic (n=3); formal JSON showed r@1=0.000. Diagnostic ablation traced the bug to naive rope_shift_k using plain RoPE inv_freq on a YARN-scaled model: up to 40 $×\times$ over-rotation on dimensions 23–31. Extracting the yarn-adjusted inv_freq from model.model.layers[0].self_attn.rotary_emb restored coherent output. See Appendix B for the full fix specification.

Appendix G: System Prompt Robustness

Table G1. Cross-context deployment viability ( $n$ =200, Qwen 2.5 7B SDPA).

Arm	Archive prompt	Generation prompt	r@1	$Δ\Delta$ vs baseline
baseline	—	helpful assistant	0.430	—
same_prompt	helpful assistant	helpful assistant	0.480	+0.050
drifted_prompt	helpful assistant	coding assistant	0.480	+0.050

Zero degradation ( $Δ\Delta$ =0.000 between same and drifted). Marker-class binding is invariant to system prompt content.

Appendix H: Query Pool Extension (Inline-Synthetic, Negative Result)

Table H1. Power-up attempt via inline query synthesis ( $n$ =367, Qwen 2.5 7B SDPA).

Slice	$n$	baseline	per_chunk	$Δ\Delta$
Original pool	200	0.430	0.480	+0.050
New synthetic	167	0.437	0.431	$- 0.006$
Merged	367	0.433	0.458	+0.025

Inline synthesis did not power up significance. The 167 new synthetic queries show no effect and dilute the merged result from +5.0pt to +2.5pt. Oracle-filtered extension (§5.4) succeeded where this approach failed, indicating that unfiltered inline synthesis does not match the difficulty distribution of the oracle-filtered pool.

Figures

Figure 3. Attention sink mass by layer and position. Sink mass on positions 0–3 drops to near-zero in injected archives without global BOS prepending.

Figure 4. Un-RoPE'd KV cosine similarity by layer. Archived vs fresh K/V cosine $≈\approx$ 1.0 across all layers, confirming content equivalence.

Figure 5. Per-query delta distribution for Qwen n=467 oracle-filtered extension. Distribution of per-query $δ\delta$ = per_chunk $-$ baseline across the merged 467-query pool.

Figure 6. TTFT comparison across injection arms. Raw inject, per_chunk, text-RAG, and full-context re-prefill latency on identical A10G hardware.

Abstract

1. Introduction

Three candidate failure modes

Our contribution

Scope boundaries

2. Related Work

2.1 Cross-Session KV Cache Reuse Systems

2.2 Attention Mechanisms and KV Cache Management

2.3 Retrieval-Augmented Generation and Chunking

2.4 Theoretical Basis: Marker-Token-Class Binding and Induction-Head Routing

2.5 Positioning

3. Methods

3.1 Overview

3.2 Models

3.3 Benchmark

3.4 Retrieval Pipeline

3.5 Phase 0 Experimental Matrix

Qwen 2.5 7B: 8-arm matrix

Phi-3 14B: 3-arm matrix

Pick 1+2: 2×\times×2 Mechanism Decomposition (Qwen 2.5 7B)

3.6 Tier 1 Experimental Bundle

Experiment 1: Base-Model Control (nnn=80)

Experiment 2: Wrapper Ablation (nnn=80, 5-arm)

Experiment 2B: Phi-3 E-Wrapper Cross-Family (nnn=80, 2-arm)

Experiment 3: Multi-Hop Degradation (nnn=40, 3-condition)

Experiment 4: Upper-Layer Localization (nnn=200)

3.7 MLA Methods: YARN-Scaled RoPE Handling

3.8 Retrieval Determinism Protocol

3.9 n=200 Extension Protocol

3.10 Diagnostic Instruments

3.11 Evaluation Metrics

3.12 Text-Wrapped RAG Control

3.13 INT8 Quantization Robustness

3.14 System Prompt Drift

3.15 Query Pool Extension

4. Experiments

4.1 Phase 0: Mechanism Decomposition (nnn=80)

4.2 Phase 1: Cross-Family Validation (nnn=200)

4.3 Tier 1: Mechanism Refinement (nnn=40–80)

4.3b Layer Ablation (nnn=200)

4.4 MLA Stage 1: DeepSeek V2-Lite (nnn=200)

4.5 CPU-Tier Probe (nnn=80)

4.6 Activation Patching (nnn=20, Phi-3-medium-128k-instruct)

4.7 Text-Wrapped RAG Control (nnn=200)

4.8 INT8 Quantization Robustness (nnn=200 + nnn=80 matrix)

4.9 System Prompt Drift (nnn=200)

4.10 Query Pool Extension (nnn=367 inline-synthetic, then nnn=467 oracle-filtered)

5. Results

5.1 Phase 0: Alternative Mechanism Rejection

5.2 2×\times×2 Mechanism Decomposition: Synergistic Interaction

5.3 Tier 1: Mechanism Refinement

Experiment 1: IT Amplifies, Not Creates (nnn=80)

Experiment 2: Marker-Token-Class Binding (nnn=80)

Experiment 2B: E-Wrapper Cross-Family (nnn=80)

Experiment 3: Multi-Hop Degradation (nnn=40)

Experiment 4: Upper-Layer Localization (nnn=200)

5.4 n=467 Cross-Family Validation

Qwen 2.5 7B SDPA: Stratified reporting

Synthetic Penalty Sensitivity

Phi-3 14B SDPA: Formal Significance

5.5 Activation Patching (nnn=20, Phi-3-medium-128k-instruct)

5.6 Cross-Architecture and Deployment Validation

6. Discussion

6.1 Hypothesized Mechanism: Marker-Token-Class Binding and Induction-Head Routing

6.2 Model-Dependent Magnitude

6.3 Methodological Implications: Retrieval Determinism

6.4 Multi-Hop Degradation and Archive Noise

6.5 Deployment Considerations

7. Limitations

7.1 Statistical Power

7.2 Single-Seed Variance

7.3 Kernel Sensitivity

7.4 Synthetic Pool Overfit

7.5 Wrapper Sensitivity

7.6 Scale Ceiling

7.7 Benchmark Scope

7.8 Competitor Benchmarking

7.9 Retrieval Pipeline Non-Determinism

7.10 Multi-Hop Noise Accumulation

7.11 CPU-Tier Sample Size

Pick 1+2: 2 $×\times$ 2 Mechanism Decomposition (Qwen 2.5 7B)

Experiment 1: Base-Model Control ( $n$ =80)

Experiment 2: Wrapper Ablation ( $n$ =80, 5-arm)

Experiment 2B: Phi-3 E-Wrapper Cross-Family ( $n$ =80, 2-arm)

Experiment 3: Multi-Hop Degradation ( $n$ =40, 3-condition)

Experiment 4: Upper-Layer Localization ( $n$ =200)

4.1 Phase 0: Mechanism Decomposition ( $n$ =80)

4.2 Phase 1: Cross-Family Validation ( $n$ =200)

4.3 Tier 1: Mechanism Refinement ( $n$ =40–80)

4.3b Layer Ablation ( $n$ =200)

4.4 MLA Stage 1: DeepSeek V2-Lite ( $n$ =200)

4.5 CPU-Tier Probe ( $n$ =80)

4.6 Activation Patching ( $n$ =20, Phi-3-medium-128k-instruct)

4.7 Text-Wrapped RAG Control ( $n$ =200)

4.8 INT8 Quantization Robustness ( $n$ =200 + $n$ =80 matrix)

4.9 System Prompt Drift ( $n$ =200)

4.10 Query Pool Extension ( $n$ =367 inline-synthetic, then $n$ =467 oracle-filtered)

5.2 2 $×\times$ 2 Mechanism Decomposition: Synergistic Interaction

Experiment 1: IT Amplifies, Not Creates ( $n$ =80)

Experiment 2: Marker-Token-Class Binding ( $n$ =80)

Experiment 2B: E-Wrapper Cross-Family ( $n$ =80)

Experiment 3: Multi-Hop Degradation ( $n$ =40)

Experiment 4: Upper-Layer Localization ( $n$ =200)

5.5 Activation Patching ( $n$ =20, Phi-3-medium-128k-instruct)