Zellige AI Research

Papers & notes

Longer-form work on frontier models, evaluations, and what we've learned building them. Often shipped alongside open weights and reference code.

RSS feed →

Jul 14, 2026
Tessera-Preview-9B: Compressed Reasoning at 18x Fewer Tokens, and What It Costs
Tessera-Preview-9B is a proof of concept: a 9B coding model fine-tuned on a 10K-example corpus to reason internally in a dense CJK notation minted by a compressor we own, while emitting ordinary code and tool calls. On the full LiveCodeBench set (1,055 problems, matched 16K greedy budget, paired harness) it reaches 34.9% against its base's 39.5% while spending a median 17.7x fewer output tokens per problem, and the two models fail in entirely different ways: the base almost never writes wrong code (95.9% of its completions pass) but thinks into the token ceiling on 58.8% of problems, while Tessera completes 78% of problems cheaply and errs by writing wrong code. The compressed channel is causally load-bearing (in an archived 50-problem ablation, forcing an empty think collapses accuracy to 4%), and the register itself accounts for only ~1.7x of the saving; the rest is a trained brevity curriculum. The measured costs are equally plain: -4.6 points of coding accuracy at matched greedy budget, a large instruction-following deficit on prompts far from the 100%-code training distribution, and none of the base's sampling headroom. The failure pattern tracks training-data coverage closely enough that the small, narrow corpus is likely a major cause, which is what the scaled successor now in preparation is designed to test.
May 18, 2026
Parametric Memory Cannot Replace Retrieval: Why Titans-Style Neural Memory Fails for Cross-File Symbol Resolution
We integrate Titans-style parametric memory into Qwen 3.5-9B and evaluate on CrossCodeEval for cross-file code completion. Across four architecture iterations, parametric memory produces zero measurable improvement over a memory-ablated baseline (1.0% EM vs 1.0% EM, an underpowered comparison given the base model's 2-7/200 solve rate), while explicit context concatenation yields 2.5x higher exact match. Per-sample analysis points to a mechanism-task mismatch: inner-loop SGD on a low-dimensional MLP compresses lossily, but cross-file symbol resolution requires lossless recall of specific token sequences.
May 10, 2026
Eliminating Autograd from Memory-Augmented Transformers
Memory-augmented transformers (Titans, TTT) require per-sample backward passes at inference time, a systems artifact rather than a fundamental requirement. We eliminate autograd entirely via two complementary methods: exact manual gradient kernels (cos=1.0, 5.5x speedup, 53% VRAM reduction) and learned Forward Alignment Networks (cos=0.91, architecture-agnostic). Same-seed verification on a 40M MAC transformer confirms identical training dynamics (BPC gap = 0.0005). Amdahl's-law accounting shows 1.41x end-to-end throughput at 40M, scaling favorably to 6.1x kernel speedup at dim_head=128.
May 5, 2026
Importance Is Not Fragility: Why High-Fisher Layers Survive Low-Bit Quantization
Mixed-precision quantization allocates different bit-widths to different layers. The standard heuristic assigns more bits to high-Fisher layers. We show this points in the wrong direction: on Qwen2.5 at 3-bit, the inverse allocation improves perplexity 3.7x. We decompose per-layer quantization damage into Fisher trace and quantization-error covariance, and propose Quantization Visibility, a metric that predicts per-layer fragility at p < 0.001 where Fisher trace fails.
May 2, 2026
Brain-Embedded LLMs: A Research Arc in Methodology, Encoder Mechanisms, and Negative Results
We injected CorticalNet brain states into Qwen3.5-27B as a universal LLM enhancer. Then we retracted three early wins under matched-pairs sampling, found that a 13MB static encoder beat transformer encoders inside our specific stack, and confirmed the brain pipeline actively hurts technical generation work. Follow-up experiments narrowed the claim twice: the technical harm is input-conditioning-specific (any input-conditioned prefix loses to OFF); the Mid recall benefit is prefix-tuning-generic (brain, keyword-extract, static soft prompts, and raw-encoder prefixes all tie each other and all beat OFF). A research arc in why methodology matters more than mechanism, and what survives after each methodology pass narrows the claim further.
Apr 24, 2026
Boundary-Conditioned Structural Alignment for Cross-Session KV Cache Injection
Cross-session KV cache injection achieves 2.6-3.4x TTFT reduction versus strong text-RAG but loses 5-10 points of retrieval quality. We decompose the gap on Qwen 2.5 7B and find the dominant driver is structural, not attention-sink loss or RoPE position mismatch. Boundary-conditioned structural alignment (archiving KV states with full chat-template wrapping and splicing at turn boundaries) recovers +5.6pt (p<0.01, n=467, merged pool). A minimal 3-token E-wrapper captures 97.7% of the effect at 22% token overhead. Cross-family formal significance on Phi-3-medium-128k-instruct (p=0.013, n=200) confirms the mechanism generalizes.

Tessera-Preview-9B: Compressed Reasoning at 18x Fewer Tokens, and What It Costs

Parametric Memory Cannot Replace Retrieval: Why Titans-Style Neural Memory Fails for Cross-File Symbol Resolution

Eliminating Autograd from Memory-Augmented Transformers

Importance Is Not Fragility: Why High-Fisher Layers Survive Low-Bit Quantization

Brain-Embedded LLMs: A Research Arc in Methodology, Encoder Mechanisms, and Negative Results

Boundary-Conditioned Structural Alignment for Cross-Session KV Cache Injection