Zellige AI Research
Papers & notes
Longer-form work on frontier models, evaluations, and what we've learned building them. Usually shipped alongside open weights and reference code.
Parametric Memory Cannot Replace Retrieval: Why Titans-Style Neural Memory Fails for Cross-File Symbol Resolution
We integrate Titans-style parametric memory into Qwen 3.5-9B and evaluate on CrossCodeEval for cross-file code completion. Across four architecture iterations, parametric memory produces zero measurable improvement over a memory-ablated baseline (1.0% EM vs 1.0% EM), while explicit context concatenation yields 2.5x higher exact match. Per-sample analysis traces the failure to a mechanism-task mismatch: inner-loop SGD on a low-dimensional MLP compresses lossily, but cross-file symbol resolution requires lossless recall of specific token sequences.
Eliminating Autograd from Memory-Augmented Transformers
Memory-augmented transformers (Titans, TTT) require per-sample backward passes at inference time, a systems artifact rather than a fundamental requirement. We eliminate autograd entirely via two complementary methods: exact manual gradient kernels (cos=1.0, 5.5x speedup, 53% VRAM reduction) and learned Forward Alignment Networks (cos=0.91, architecture-agnostic). Same-seed verification on a 40M MAC transformer confirms identical training dynamics (BPC gap = 0.0005). Amdahl's-law accounting shows 1.41x end-to-end throughput at 40M, scaling favorably to 6.1x kernel speedup at dim_head=128.
Importance Is Not Fragility: Why High-Fisher Layers Survive Low-Bit Quantization
Mixed-precision quantization allocates different bit-widths to different layers. The standard heuristic assigns more bits to high-Fisher layers. We show this points in the wrong direction: on Qwen2.5 at 3-bit, the inverse allocation improves perplexity 3.7x. We decompose per-layer quantization damage into Fisher trace and quantization-error covariance, and propose Quantization Visibility, a metric that predicts per-layer fragility at p < 0.001 where Fisher trace fails.
Brain-Embedded LLMs: A Research Arc in Methodology, Encoder Mechanisms, and Negative Results
We injected CorticalNet brain states into Qwen3.5-27B as a universal LLM enhancer. Then we retracted three early wins under matched-pairs sampling, found that a 13MB static encoder beat transformer encoders inside our specific stack, and confirmed the brain pipeline actively hurts technical generation work. Follow-up experiments narrowed the claim twice: the technical harm is input-conditioning-specific (any input-conditioned prefix loses to OFF); the Mid recall benefit is prefix-tuning-generic (brain, keyword-extract, static soft prompts, and raw-encoder prefixes all tie each other and all beat OFF). A research arc in why methodology matters more than mechanism, and what survives after each methodology pass narrows the claim further.
Boundary-Conditioned Structural Alignment for Cross-Session KV Cache Injection
Cross-session KV cache injection achieves 2.6-3.4x TTFT reduction versus strong text-RAG but loses 5-10 points of retrieval quality. We decompose the gap on Qwen 2.5 7B and find the dominant driver is structural, not attention-sink loss or RoPE position mismatch. Boundary-conditioned structural alignment — archiving KV states with full chat-template wrapping and splicing at turn boundaries — recovers +5.6pt (p<0.01, n=467, merged pool). A minimal 3-token E-wrapper captures 97.7% of the effect at 22% token overhead. Cross-family formal significance on Phi-3-medium-128k-instruct (p=0.013, n=200) confirms the mechanism generalizes.