Skip to main content
by Zellige

Importance Is Not Fragility: Why High-Fisher Layers Survive Low-Bit Quantization

Mixed-precision quantization allocates different bit-widths to different layers. The standard heuristic assigns more bits to high-Fisher layers. We show this points in the wrong direction: on Qwen2.5 at 3-bit, the inverse allocation improves perplexity 3.7x. We decompose per-layer quantization damage into Fisher trace and quantization-error covariance, and propose Quantization Visibility, a metric that predicts per-layer fragility at p < 0.001 where Fisher trace fails.

  • quantization
  • fisher-information
  • mixed-precision
  • transformers

Abstract

Mixed-precision quantization allocates different bit-widths to different layers. The standard heuristic, used by HAWQ, BAQ, MixLLM, and others, assigns more bits to layers with higher Fisher or Hessian sensitivity. We show this heuristic points in the wrong direction. On Qwen2.5 models at 3-bit, allocating more bits to high-Fisher layers produces 91.25 perplexity, while the inverse allocation (more bits to low-Fisher layers) produces 24.71, a 3.7×\times× improvement at the same average bit budget. We explain this through a decomposition of per-layer quantization damage into Fisher trace and quantization-error covariance: high-Fisher layers have concentrated curvature that quantization noise largely avoids, while low-Fisher layers have broad sensitivity that quantization noise hits directly. We propose Quantization Visibility (DD_\ellD), a scalar metric that measures the gradient-weighted impact of actual quantization error rather than Fisher sensitivity alone. DD_\ellD predicts per-layer fragility at p<0.001p < 0.001p<0.001 (Spearman ρ=0.785\rho = 0.785ρ=0.785 on Qwen2.5-0.5B, ρ=0.533\rho = 0.533ρ=0.533 on Qwen2.5-1.5B) where Fisher trace fails entirely (p=0.25p = 0.25p=0.25). The effect persists under GPTQ quantization (ρ=0.554\rho = 0.554ρ=0.554, p=0.0006p = 0.0006p=0.0006) and reproduces cross-family on Microsoft Phi-2 (ρ=0.477\rho = 0.477ρ=0.477, p=0.016p = 0.016p=0.016). At 7B scale, damage concentrates in a single layer that DD_\ellD identifies with Pearson r=0.972r = 0.972r=0.972. These results hold in the degraded-but-functional quantization regime and suggest that the relationship between layer importance and quantization robustness in trained transformers is inverted relative to standard assumptions.

1. Introduction

Deploying large language models under memory constraints requires aggressive weight quantization. At 4 bits per weight, most models retain near-full quality. At 3 bits and below, quality degrades, and the question becomes which layers to protect with extra precision.

The field's default answer relies on sensitivity metrics derived from the Fisher information matrix or the Hessian of the loss. HAWQ [Dong et al., 2019] uses Hessian eigenvalues. BAQ [Zhang et al., 2025] formulates a convex allocation from a Hessian proxy. MixLLM [Zheng et al., 2024] assigns larger bit-width to globally high-salience output features. The shared assumption is straightforward: layers where the loss changes most under perturbation should receive more bits to protect against quantization noise.

We test this assumption directly on Qwen2.5 models at 3-bit average precision. The result is unambiguous: the assumption is wrong. Allocating more bits to high-Fisher layers and fewer to low-Fisher layers produces worse perplexity than uniform allocation. Inverting the allocation produces dramatically better perplexity.

This is not a failure of Fisher information as a concept. It is a failure of the assumption that Fisher sensitivity implies quantization fragility. Fisher information measures how much the loss changes under arbitrary perturbations. Quantization does not produce arbitrary perturbations. It produces structured noise whose geometry depends on weight distributions, group scales, and outlier patterns. When that structured noise happens to avoid the directions Fisher cares about, a high-Fisher layer can be robust to quantization despite being important to the model.

We formalize this through the decomposition ΔKL12Tr(GCq,(b))\Delta \text{KL}_\ell \approx \frac{1}{2} \text{Tr}(G_\ell \cdot C_{q,\ell}^{(b)})ΔKL21Tr(GCq,(b)), where GG_\ellG is the output-side Fisher and Cq,(b)C_{q,\ell}^{(b)}Cq,(b) is the actual covariance of quantization-induced output perturbation. Fisher-only allocation optimizes Tr(G)\text{Tr}(G_\ell)Tr(G) and ignores Cq,C_{q,\ell}Cq,. Our metric, Quantization Visibility (DD_\ellD), estimates the full product cheaply from a single forward and backward pass on calibration data.

Our contributions

  1. We demonstrate that the standard Fisher/Hessian allocation direction is wrong at sub-4-bit precision, with empirical evidence across three Qwen model sizes (0.5B, 1.5B, 7B) and cross-family validation on Microsoft Phi-2.
  2. We explain the inversion through Fisher-error overlap: high-Fisher layers have sparse curvature that quantization noise avoids.
  3. We propose D=12E[(gδy)2]D_\ell = \frac{1}{2} \mathbb{E}[(g_\ell^\top \delta y_\ell)^2]D=21E[(gδy)2], a cheap per-layer fragility score that predicts actual perplexity impact where Fisher trace does not.
  4. We characterize the regime where this metric operates: degraded-but-functional quantization (perplexity 13 to 200), not catastrophic collapse or minimal degradation.

2. Background

2.1 Post-training quantization and mixed precision

Post-training quantization (PTQ) converts full-precision weights to lower bit-width representations without retraining. Round-to-nearest (RTN) with per-group symmetric scaling is the simplest approach. GPTQ [Frantar et al., 2022] improves on RTN by using Hessian-weighted error feedback during rounding. AWQ [Lin et al., 2023] protects activation-aware salient weights. Recent work pushes toward information-theoretic optimality: WaterSIC [Lifar et al., 2026] achieves within 0.255 bits of the reverse-waterfilling limit.

Mixed-precision allocation assigns different bit-widths to different layers (or finer-grained units) under a total memory budget. The key question is which layers to protect. HAWQ and its successors use Hessian or Fisher trace as the sensitivity signal. ScaleBITS [2026] warns that static sensitivity can fail after quantization. Our work identifies a specific failure mode: the sensitivity signal points in the wrong direction at low bit-widths.

2.2 Fisher information for quantization

For a linear layer y=Wxy = Wxy=Wx with perturbation E=W^WE = \hat{W} - WE=W^W, the second-order loss proxy is:

ΔL12Tr(GCq,),Cq,=E[δyδy]\Delta \mathcal{L}_\ell \approx \frac{1}{2} \text{Tr}(G_\ell \cdot C_{q,\ell}), \quad C_{q,\ell} = \mathbb{E}[\delta y_\ell \delta y_\ell^\top]ΔL21Tr(GCq,),Cq,=E[δyδy]

where GG_\ellG is the output-side Fisher (or Hessian) and Cq,C_{q,\ell}Cq, is the quantization-error covariance in the output space. Standard allocation methods approximate this as Tr(G)σq2\text{Tr}(G_\ell) \cdot \sigma_q^2Tr(G)σq2, treating quantization noise as isotropic. This approximation fails when Cq,C_{q,\ell}Cq, has directional structure that differs from GG_\ellG.

3. The Inverse-Fisher Anomaly

We quantize Qwen2.5-1.5B to 3-bit average precision using symmetric per-group RTN (group size 128). We compute per-layer output-Fisher trace from 32 calibration sequences using empirical Fisher estimation (sample from model predictions, backpropagate cross-entropy loss, collect output gradients). We then test three allocation strategies at the same average bit budget:

  • Uniform: every layer at 3 bits.
  • Fisher-gated: top 20% of layers by Fisher trace receive 4 bits, rest receive 3 bits (with adjustment to maintain average).
  • Inverse-Fisher: top 20% of layers by Fisher trace receive 3 bits (or lower), bottom 80% receive 4 bits.

Results

Allocation3-bit average PPL4-bit average PPL
FP16 baseline11.4411.44
Uniform RTN178.1415.06
Fisher-gated (protect high-Fisher)91.2513.80
Inverse (protect low-Fisher)24.7113.06

The inverse allocation is 7.2×\times× better than uniform and 3.7×\times× better than Fisher-gated at 3-bit average. The effect is consistent at 4-bit but smaller in magnitude (13.06 vs 15.06), as expected when quantization damage is less severe overall.

Both Fisher-gated and inverse beat uniform, confirming that any mixed-precision allocation helps. But the direction matters: Fisher-gated helps by accident (some layers happen to benefit from extra bits), while inverse allocation correctly identifies which layers are actually fragile.

4. Explanation: Importance and Fragility Are Different

4.1 The decomposition

The per-layer damage from quantization is not Tr(G)\text{Tr}(G_\ell)Tr(G). It is Tr(GCq,)\text{Tr}(G_\ell \cdot C_{q,\ell})Tr(GCq,). These are different when the eigenvectors of GG_\ellG and Cq,C_{q,\ell}Cq, are not aligned.

A layer can have high Fisher trace (important) but low damage (robust) when its Fisher-sensitive directions do not overlap with the directions where quantization introduces error. Concretely:

  • Sparse Fisher curvature. High-Fisher layers often have curvature concentrated in a few directions (low Fisher participation ratio dF=(Tr G)2/Tr(G2)d_F = (\text{Tr } G)^2 / \text{Tr}(G^2)dF=(Tr G)2/Tr(G2)). Quantization noise distributes across all directions, so most of it falls outside the sensitive subspace.
  • Structured quantization error. RTN with per-group scaling produces error that depends on weight distribution shape, outlier channels, and group dynamic range. This error is not isotropic and can systematically avoid high-curvature directions.
  • RMSNorm suppression. In pre-norm transformer blocks, perturbations that are radial (scaling the residual stream) get suppressed by the next normalization layer. Only tangential perturbations (changing the direction) propagate. If high-Fisher layers produce mostly radial quantization error, normalization makes them robust.

4.2 Why the field assumed otherwise

The standard Fisher/Hessian allocation framework was developed and tested under conditions where quantization error is approximately isotropic: 8-bit and 4-bit regimes where the noise is small and well-distributed. In that regime, Tr(G)σq2\text{Tr}(G_\ell) \cdot \sigma_q^2Tr(G)σq2 is a reasonable approximation because Cq,σq2IC_{q,\ell} \approx \sigma_q^2 ICq,σq2I.

At 3 bits and below, quantization noise becomes large and highly structured. Group clipping, outlier distortion, and the coarse quantization grid create directional error patterns. The isotropic approximation breaks, and Fisher trace stops predicting damage.

5. Quantization Visibility (DD_\ellD)

We define Quantization Visibility as:

D(b)=12Ex,g[(gδy(b))2]D_\ell(b) = \frac{1}{2} \mathbb{E}_{x,g}\left[(g_\ell^\top \delta y_\ell^{(b)})^2\right]D(b)=21Ex,g[(gδy(b))2]

where g=L/yg_\ell = \partial \mathcal{L} / \partial y_\ellg=L/y is the gradient of the loss with respect to the layer output (from a backward pass on calibration data) and δy(b)=(Qb(W)W)x\delta y_\ell^{(b)} = (Q_b(W_\ell) - W_\ell) x_\ellδy(b)=(Qb(W)W)x is the output perturbation induced by bbb-bit quantization on calibration input xx_\ellx.

DD_\ellD is a scalar contraction of the Fisher-error overlap: it measures how much the actual quantization perturbation projects onto the gradient-sensitive direction. It does not require computing or storing the full Fisher matrix GG_\ellG or the error covariance Cq,C_{q,\ell}Cq,. It requires one forward pass (to collect xx_\ellx and compute δy\delta y_\ellδy) and one backward pass (to collect gg_\ellg) on calibration data.

5.1 Comparison to alternatives

MetricWhat it measuresCorrelation with V(3)V_\ell(3)V(3) on Qwen2.5-1.5B
Fisher traceLayer importance (arbitrary perturbations)ρ=0.274\rho = 0.274ρ=0.274, p=0.111p = 0.111p=0.111
Weight quant MSERaw quantization error magnitudeρ=0.212\rho = -0.212ρ=0.212, p=0.221p = 0.221p=0.221
FEO (normalized)Fisher-error overlap ratioρ=0.181\rho = 0.181ρ=0.181, p=0.298p = 0.298p=0.298
DD_\ellDGradient-weighted quantization damageρ=0.533\rho = 0.533ρ=0.533, p=0.001p = 0.001p=0.001

Fisher trace has no significant correlation with actual fragility. Weight quantization MSE is negatively correlated (layers with more raw error are less fragile, consistent with the inverse-Fisher finding). The normalized Fisher-Error Overlap (FEO) loses magnitude information that makes DD_\ellD predictive. DD_\ellD is the only metric that significantly predicts which layers are actually damaged by quantization.

6. Experiments

6.1 Per-layer fragility measurement

For each model, we quantize all linear layers to bbb-bit, then restore each target layer one at a time to FP16 and measure the perplexity change. The fragility score V(b)V_\ell(b)V(b) is the perplexity drop from restoring layer \ell:

V(b)=PPLall-b-bitPPLrestore-V_\ell(b) = \text{PPL}_{\text{all-}b\text{-bit}} - \text{PPL}_{\text{restore-}\ell}V(b)=PPLall-b-bitPPLrestore-

Positive VV_\ellV means the layer benefits from restoration (is fragile). Negative VV_\ellV means restoration hurts (the layer participates in a compensation equilibrium with other quantized layers).

We evaluate on WikiText-2 with 60,000 tokens. Fisher trace and DD_\ellD are computed from 32 calibration sequences of 512 tokens from C4. Target layers are sampled from 5 evenly-spaced decoder blocks (7 layer types each: q/k/v/o projections, gate/up/down MLP projections).

6.2 Results across model sizes

ModelFamilyQuantizerBitsBaseline PPLDD_\ellD Spearman ρ\rhoρppp-value
Qwen2.5-0.5BQwenRTN31960.785<0.0001< 0.0001<0.0001
Qwen2.5-1.5BQwenRTN3630.5330.0010.0010.001
Qwen2.5-1.5BQwenGPTQ3630.5540.00060.00060.0006
Phi-2 (2.7B)MicrosoftRTN3130.4770.0160.0160.016
Qwen2.5-7BQwenRTN3250.1260.47

On 0.5B and 1.5B, DD_\ellD has strong rank correlation with actual layer fragility. Under GPTQ quantization on the same 1.5B model, DD_\ellD remains significant (ρ=0.554\rho = 0.554ρ=0.554, p=0.0006p = 0.0006p=0.0006), confirming the effect is not an artifact of the RTN quantizer. On Phi-2, a completely different architecture family (Microsoft, different training pipeline, different tokenizer, different normalization), DD_\ellD reproduces with ρ=0.477\rho = 0.477ρ=0.477 (p=0.016p = 0.016p=0.016). This confirms the finding generalizes across both quantizers and architecture families.

On 7B, the Spearman correlation is not significant, but the Pearson correlation is 0.972. Damage at 7B concentrates in a single extreme outlier: restoring layers.27.mlp.down_proj drops perplexity from 25.21 to 8.49 (V=16.72V_\ell = 16.72V=16.72), while all other layers cluster near V0V_\ell \approx 0V0. DD_\ellD correctly identifies this outlier but the rank ordering among the remaining near-zero layers is noise. At larger model scale, quantization fragility becomes more concentrated, making correct identification of the few critical layers even more important.

6.3 Regime dependence

DD_\ellD is predictive in the degraded-but-functional regime (baseline perplexity 25 to 200). Outside this regime, it is not informative:

ModelBitsBaseline PPLRegimeDD_\ellD Spearman ρ\rhoρ
Qwen2.5-3B351,057Catastrophic0.085
Qwen2.5-3B48.4Minimal degradation0.006
OPT-1.3B34,054Catastrophic0.211
OPT-1.3B412.0Minimal degradation0.080
Pythia-2.8B325.3Degraded0.000

When quantization catastrophically destroys the model (PPL > 1000), per-layer restoration cannot recover meaningful function, and all metrics are noise. When quantization barely degrades the model (PPL near FP16), there is no fragility signal to predict. DD_\ellD operates at the quantization cliff, which is exactly where bit allocation decisions matter most.

6.4 Layer-type patterns

Across all models and blocks, down_proj (the MLP down-projection) is consistently the most fragile layer type. On Qwen2.5-7B, a single down_proj in the final block accounts for 66% of total quantization damage. Attention projections (q/k/v) sometimes show negative VV_\ellV: restoring them to FP16 in an otherwise 3-bit model hurts perplexity, suggesting they participate in compensation dynamics with other quantized layers.

6.5 Negative VV_\ellV layers

9 of 35 tested layers on Qwen2.5-1.5B have negative VV_\ellV: restoring them makes the model worse. This occurs because quantized layers develop compensation patterns. When the entire model operates at low precision, error distributions partially cancel across layers. Restoring one layer to full precision disrupts this equilibrium.

This has a practical implication: per-layer sensitivity analysis in a fully-quantized context is not equivalent to per-layer sensitivity in a mixed context. The compensation effect means brute-force "restore one layer" measurements overestimate the benefit of protecting individual layers in isolation, and underestimate the benefit of protecting the truly critical layers while leaving the compensating layers alone.

7. Related Work

Hessian/Fisher-based allocation. HAWQ [Dong et al., 2019] and HAWQ-V2 [Dong et al., 2020] use Hessian eigenvalues for per-layer bit allocation, explicitly assigning more bits to layers with larger Hessian trace. BAQ [Zhang et al., 2025] formulates bit allocation as convex optimization from a Hessian-proxy sensitivity model. LampQ [Kim et al., 2025] uses Fisher-based metrics for layer-wise allocation in vision transformers. All of these allocate in the direction opposite to what we find works at sub-4-bit.

Output-feature mixed precision. MixLLM [Zheng et al., 2024] assigns precision by output-feature salience. CMPQ [Chen et al., 2024] explores channel-wise mixed precision. These methods operate on a finer granularity than per-layer allocation but share the sensitivity-implies-fragility assumption.

Dynamic sensitivity. ScaleBITS [2026] argues that static sensitivity estimates fail after quantization and proposes dynamic sensitivity search. This is the closest prior work to our finding: it warns that the standard metric can fail, though it does not identify the specific inversion we observe or explain it through Fisher-error overlap.

Rate-distortion theory. WaterSIC [Lifar et al., 2026] applies reverse waterfilling to achieve near-information-theoretic quantization. Its distortion metric is input-covariance-weighted MSE (Tr(EΣxE)\text{Tr}(E \Sigma_x E^\top)Tr(EΣxE)), which does not include the output-side Fisher term. Our decomposition extends this to Tr(GEΣxE)\text{Tr}(G_\ell E \Sigma_x E^\top)Tr(GEΣxE), showing that the output-side term is not a scalar when quantization noise has directional structure.

Sensitivity warnings. PQI/ReQuant [2025] demonstrates that gradient-based sensitivity metrics can underestimate quantization impact by orders of magnitude. QEP [2025] identifies cross-layer error propagation as a central bottleneck at low bit-widths. Both support the conclusion that local Fisher is insufficient, though neither identifies the directional inversion.

8. Discussion

Why the standard heuristic was not caught earlier

At 4-bit and above, the inversion effect is small (13.06 vs 13.80). The standard heuristic is only clearly wrong in the sub-4-bit regime, which has only recently become practically relevant with methods like BitNet b1.58 and WaterSIC pushing toward extreme compression. Prior mixed-precision work validated at 4-8 bits, where the isotropic approximation holds and Fisher direction is roughly correct.

Connection to training dynamics

An alternative explanation for why high-Fisher layers are robust: layers that receive more gradient signal during training are pushed toward wider minima in those directions, making them naturally resistant to perturbation. This is consistent with the observation that Fisher-sensitive directions coincide with well-explored regions of the loss landscape. We do not test this hypothesis directly, but it suggests the inversion may be a general property of SGD-trained networks, not specific to quantization.

Limitations

  • Architecture coverage. Our primary results are on the Qwen2.5 family (0.5B, 1.5B, 7B) with cross-family confirmation on Microsoft Phi-2 (2.7B). Pythia-2.8B showed no signal despite being in the degraded regime (PPL 25), and OPT-1.3B did not enter the degraded regime at tested bit-widths. Additional architecture families (Llama, Mistral) would strengthen generality.
  • Quantizer. We test RTN and GPTQ. DD_\ellD is significant under both (ρ=0.533\rho = 0.533ρ=0.533 for RTN, ρ=0.554\rho = 0.554ρ=0.554 for GPTQ on Qwen2.5-1.5B). AWQ has not been tested.
  • No end-to-end allocator. We show DD_\ellD predicts fragility but do not build a complete allocation system. The inverse-Fisher shootout uses a crude binary split, not continuous DD_\ellD-guided allocation. A proper allocator that greedily assigns bits by marginal DD_\ellD reduction is the natural next step.
  • Perplexity only. We evaluate on WikiText-2 perplexity. Downstream task evaluation (MMLU, ARC, HumanEval) may show different sensitivity patterns.

9. Conclusion

The relationship between Fisher sensitivity and quantization robustness in trained transformers is inverted at low bit-widths. Layers the model relies on most (high Fisher trace) are the most resistant to quantization noise, because their curvature is concentrated in directions that quantization noise avoids. Layers that appear less important (low Fisher trace) are fragile because their broad, diffuse sensitivity intersects with quantization error across many directions.

DD_\ellD captures this by measuring what Fisher trace alone cannot: the projection of actual quantization perturbation onto the gradient-sensitive direction. It predicts per-layer fragility at p<0.001p < 0.001p<0.001 where Fisher trace fails entirely, and it operates at the quantization cliff where allocation decisions have the most impact.

For practitioners deploying quantized models, the implication is direct: protect the quiet layers, not the loud ones.

References

  • Dong, Z., Yao, Z., Gholami, A., et al. "HAWQ: Hessian AWare Quantization." ICCV 2019.
  • Dong, Z., Yao, Z., Cai, Y., et al. "HAWQ-V2: Hessian Aware trace-Weighted Quantization." NeurIPS 2020.
  • Frantar, E., Ashkboos, S., Hoefler, T., Alistarh, D. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023.
  • Lin, J., Tang, J., et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." MLSys 2024.
  • Zhang, C., Wang, L., Lasaulce, S., Debbah, M. "BAQ: Efficient Bit Allocation Quantization for Large Language Models." arXiv:2506.05664, 2025.
  • Zheng, Z., Song, X., Liu, C. "MixLLM: LLM Quantization with Global Mixed-precision between Output-features." arXiv:2412.14590, 2024.
  • Lifar, E., Savkin, S., Ordentlich, O., Polyanskiy, Y. "WaterSIC: Information-theoretically (near) optimal linear layer quantization." arXiv:2603.04956, 2026.
  • Kim, J., et al. "LampQ: Towards Accurate Layer-wise Mixed Precision Quantization for Vision Transformers." AAAI 2026.
  • Chen, Z., Xie, B., Li, J., Shen, C. "CMPQ: Channel-wise Mixed-Precision Quantization." arXiv:2410.13056, 2024.

Appendix A: Experimental Details

All experiments use symmetric per-group RTN quantization with group size 128. Fisher trace is estimated from 32 C4 calibration sequences of 512 tokens using empirical Fisher (sample from model predictions, backpropagate cross-entropy, collect output gradients). DD_\ellD uses the same calibration data. Perplexity is evaluated on WikiText-2 with 60,000 tokens. Target layers are sampled from 5 evenly-spaced decoder blocks per model. All computation runs on NVIDIA RTX A6000 (48GB).

Appendix B: Per-Layer Fragility Tables

Full per-layer V(3)V_\ell(3)V(3), Fisher trace, DD_\ellD, and weight quantization MSE for all tested models are available at [zellige.ai/research/importance-not-fragility].