Importance Is Not Fragility: Why High-Fisher Layers Survive Low-Bit Quantization

Abstract

Mixed-precision quantization allocates different bit-widths to different layers. The standard heuristic, used by HAWQ, BAQ, MixLLM, and others, assigns more bits to layers with higher Fisher or Hessian sensitivity. We show this heuristic points in the wrong direction. On Qwen2.5 models at 3-bit, allocating more bits to high-Fisher layers produces 91.25 perplexity, while the inverse allocation (more bits to low-Fisher layers) produces 24.71, a 3.7 $×\times$ improvement at the same average bit budget. We explain this through a decomposition of per-layer quantization damage into Fisher trace and quantization-error covariance: high-Fisher layers have concentrated curvature that quantization noise largely avoids, while low-Fisher layers have broad sensitivity that quantization noise hits directly. We propose Quantization Visibility ( $DℓD_\ell$ ), a scalar metric that measures the gradient-weighted impact of actual quantization error rather than Fisher sensitivity alone. $DℓD_\ell$ predicts per-layer fragility at $p < 0.001$ (Spearman $ρ=0.785\rho = 0.785$ on Qwen2.5-0.5B, $ρ=0.533\rho = 0.533$ on Qwen2.5-1.5B) where Fisher trace fails entirely ( $p = 0.25$ ). The effect persists under GPTQ quantization ( $ρ=0.554\rho = 0.554$ , $p = 0.0006$ ) and reproduces cross-family on Microsoft Phi-2 ( $ρ=0.477\rho = 0.477$ , $p = 0.016$ ). At 7B scale, damage concentrates in a single layer that $DℓD_\ell$ identifies with Pearson $r = 0.972$ . These results hold in the degraded-but-functional quantization regime and suggest that the relationship between layer importance and quantization robustness in trained transformers is inverted relative to standard assumptions.

1. Introduction

Deploying large language models under memory constraints requires aggressive weight quantization. At 4 bits per weight, most models retain near-full quality. At 3 bits and below, quality degrades, and the question becomes which layers to protect with extra precision.

The field's default answer relies on sensitivity metrics derived from the Fisher information matrix or the Hessian of the loss. HAWQ [Dong et al., 2019] uses Hessian eigenvalues. BAQ [Zhang et al., 2025] formulates a convex allocation from a Hessian proxy. MixLLM [Zheng et al., 2024] assigns larger bit-width to globally high-salience output features. The shared assumption is straightforward: layers where the loss changes most under perturbation should receive more bits to protect against quantization noise.

We test this assumption directly on Qwen2.5 models at 3-bit average precision. The result is unambiguous: the assumption is wrong. Allocating more bits to high-Fisher layers and fewer to low-Fisher layers produces worse perplexity than uniform allocation. Inverting the allocation produces dramatically better perplexity.

This is not a failure of Fisher information as a concept. It is a failure of the assumption that Fisher sensitivity implies quantization fragility. Fisher information measures how much the loss changes under arbitrary perturbations, but quantization noise is far from arbitrary: its geometry depends on weight distributions, group scales, and outlier patterns. When that structured noise happens to avoid the directions Fisher cares about, a high-Fisher layer can be robust to quantization despite being important to the model.

We formalize this through the decomposition $ΔKLℓ≈12Tr(Gℓ⋅Cq,ℓ(b))\Delta \text{KL}_\ell \approx \frac{1}{2} \text{Tr}(G_\ell \cdot C_{q,\ell}^{(b)})$ , where $GℓG_\ell$ is the output-side Fisher and $Cq,ℓ(b)C_{q,\ell}^{(b)}$ is the actual covariance of quantization-induced output perturbation. Fisher-only allocation optimizes $Tr(Gℓ)\text{Tr}(G_\ell)$ and ignores $Cq,ℓC_{q,\ell}$ . Our metric, Quantization Visibility ( $DℓD_\ell$ ), estimates the full product cheaply from a single forward and backward pass on calibration data.

Our contributions

We demonstrate that the standard Fisher/Hessian allocation direction is wrong at sub-4-bit precision, with empirical evidence across three Qwen model sizes (0.5B, 1.5B, 7B) and cross-family validation on Microsoft Phi-2.
We explain the inversion through Fisher-error overlap: high-Fisher layers have sparse curvature that quantization noise avoids.
We propose $Dℓ=12E[(gℓ⊤δyℓ)2]D_\ell = \frac{1}{2} \mathbb{E}[(g_\ell^\top \delta y_\ell)^2]$ , a cheap per-layer fragility score that predicts actual perplexity impact where Fisher trace does not.
We characterize the regime where this metric operates: degraded-but-functional quantization (perplexity 13 to 200), not catastrophic collapse or minimal degradation.

2. Background

2.1 Post-training quantization and mixed precision

Post-training quantization (PTQ) converts full-precision weights to lower bit-width representations without retraining. Round-to-nearest (RTN) with per-group symmetric scaling is the simplest approach. GPTQ [Frantar et al., 2022] improves on RTN by using Hessian-weighted error feedback during rounding. AWQ [Lin et al., 2023] protects activation-aware salient weights. Recent work pushes toward information-theoretic optimality: WaterSIC [Lifar et al., 2026] achieves within 0.255 bits of the reverse-waterfilling limit.

Mixed-precision allocation assigns different bit-widths to different layers (or finer-grained units) under a total memory budget. The key question is which layers to protect. HAWQ and its successors use Hessian or Fisher trace as the sensitivity signal. ScaleBITS [2026] warns that static sensitivity can fail after quantization. Our work identifies a specific failure mode: the sensitivity signal points in the wrong direction at low bit-widths.

2.2 Fisher information for quantization

For a linear layer $y = W x$ with perturbation $E=W^−WE = \hat{W} - W$ , the second-order loss proxy is:

$ΔLℓ≈12Tr(Gℓ⋅Cq,ℓ),Cq,ℓ=E[δyℓδyℓ⊤]\Delta \mathcal{L}_\ell \approx \frac{1}{2} \text{Tr}(G_\ell \cdot C_{q,\ell}), \quad C_{q,\ell} = \mathbb{E}[\delta y_\ell \delta y_\ell^\top]$

where $GℓG_\ell$ is the output-side Fisher (or Hessian) and $Cq,ℓC_{q,\ell}$ is the quantization-error covariance in the output space. Standard allocation methods approximate this as $Tr(Gℓ)⋅σq2\text{Tr}(G_\ell) \cdot \sigma_q^2$ , treating quantization noise as isotropic. This approximation fails when $Cq,ℓC_{q,\ell}$ has directional structure that differs from $GℓG_\ell$ .

3. The Inverse-Fisher Anomaly

We quantize Qwen2.5-1.5B to 3-bit average precision using symmetric per-group RTN (group size 128). We compute per-layer output-Fisher trace from 32 calibration sequences using empirical Fisher estimation (sample from model predictions, backpropagate cross-entropy loss, collect output gradients). We then test three allocation strategies at the same average bit budget:

Uniform: every layer at 3 bits.
Fisher-gated: top 20% of layers by Fisher trace receive 4 bits, rest receive 3 bits (with adjustment to maintain average).
Inverse-Fisher: top 20% of layers by Fisher trace receive 3 bits (or lower), bottom 80% receive 4 bits.

Results

Allocation	3-bit average PPL	4-bit average PPL
FP16 baseline	11.44	11.44
Uniform RTN	178.14	15.06
Fisher-gated (protect high-Fisher)	91.25	13.80
Inverse (protect low-Fisher)	24.71	13.06

The inverse allocation is 7.2 $×\times$ better than uniform and 3.7 $×\times$ better than Fisher-gated at 3-bit average. The effect is consistent at 4-bit but smaller in magnitude (13.06 vs 15.06), as expected when quantization damage is less severe overall.

Both Fisher-gated and inverse beat uniform, confirming that any mixed-precision allocation helps. But the direction matters: Fisher-gated helps by accident (some layers happen to benefit from extra bits), while inverse allocation correctly identifies which layers are actually fragile.

4. Explanation: Importance and Fragility Are Different

4.1 The decomposition

The per-layer damage from quantization is $Tr(Gℓ⋅Cq,ℓ)\text{Tr}(G_\ell \cdot C_{q,\ell})$ , not $Tr(Gℓ)\text{Tr}(G_\ell)$ alone. The two diverge when the eigenvectors of $GℓG_\ell$ and $Cq,ℓC_{q,\ell}$ are not aligned.

A layer can have high Fisher trace (important) but low damage (robust) when its Fisher-sensitive directions do not overlap with the directions where quantization introduces error. Concretely:

Sparse Fisher curvature. High-Fisher layers often have curvature concentrated in a few directions (low Fisher participation ratio $G)2/Tr(G2)d_F = (\text{Tr } G)^2 / \text{Tr}(G^2)$ ). Quantization noise distributes across all directions, so most of it falls outside the sensitive subspace.
Structured quantization error. RTN with per-group scaling produces error that depends on weight distribution shape, outlier channels, and group dynamic range. This error is not isotropic and can systematically avoid high-curvature directions.
RMSNorm suppression. In pre-norm transformer blocks, perturbations that are radial (scaling the residual stream) get suppressed by the next normalization layer. Only tangential perturbations (changing the direction) propagate. If high-Fisher layers produce mostly radial quantization error, normalization makes them robust.

4.2 Why the field assumed otherwise

The standard Fisher/Hessian allocation framework was developed and tested under conditions where quantization error is approximately isotropic: 8-bit and 4-bit regimes where the noise is small and well-distributed. In that regime, $Tr(Gℓ)⋅σq2\text{Tr}(G_\ell) \cdot \sigma_q^2$ is a reasonable approximation because $Cq,ℓ≈σq2IC_{q,\ell} \approx \sigma_q^2 I$ .

At 3 bits and below, quantization noise becomes large and highly structured. Group clipping, outlier distortion, and the coarse quantization grid create directional error patterns. The isotropic approximation breaks, and Fisher trace stops predicting damage.

5. Quantization Visibility ( $DℓD_\ell$ )

We define Quantization Visibility as:

$Dℓ(b)=12Ex,g[(gℓ⊤δyℓ(b))2]D_\ell(b) = \frac{1}{2} \mathbb{E}_{x,g}\left[(g_\ell^\top \delta y_\ell^{(b)})^2\right]$

where $gℓ=∂L/∂yℓg_\ell = \partial \mathcal{L} / \partial y_\ell$ is the gradient of the loss with respect to the layer output (from a backward pass on calibration data) and $δyℓ(b)=(Qb(Wℓ)−Wℓ)xℓ\delta y_\ell^{(b)} = (Q_b(W_\ell) - W_\ell) x_\ell$ is the output perturbation induced by $b$ -bit quantization on calibration input $xℓx_\ell$ .

$DℓD_\ell$ is a scalar contraction of the Fisher-error overlap: it measures how much the actual quantization perturbation projects onto the gradient-sensitive direction. It does not require computing or storing the full Fisher matrix $GℓG_\ell$ or the error covariance $Cq,ℓC_{q,\ell}$ . It requires one forward pass (to collect $xℓx_\ell$ and compute $δyℓ\delta y_\ell$ ) and one backward pass (to collect $gℓg_\ell$ ) on calibration data.

5.1 Comparison to alternatives

Metric	What it measures	Correlation with $Vℓ(3)V_\ell(3)$ on Qwen2.5-1.5B
Fisher trace	Layer importance (arbitrary perturbations)	$ρ=0.274\rho = 0.274$ , $p = 0.111$
Weight quant MSE	Raw quantization error magnitude	$ρ=−0.212\rho = -0.212$ , $p = 0.221$
FEO (normalized)	Fisher-error overlap ratio	$ρ=0.181\rho = 0.181$ , $p = 0.298$
$DℓD_\ell$	Gradient-weighted quantization damage	$ρ=0.533\rho = 0.533$ , $p = 0.001$

Fisher trace has no significant correlation with actual fragility. Weight quantization MSE is negatively correlated (layers with more raw error are less fragile, consistent with the inverse-Fisher finding). The normalized Fisher-Error Overlap (FEO) loses magnitude information that makes $DℓD_\ell$ predictive. $DℓD_\ell$ is the only metric that significantly predicts which layers are actually damaged by quantization.

6. Experiments

6.1 Per-layer fragility measurement

For each model, we quantize all linear layers to $b$ -bit, then restore each target layer one at a time to FP16 and measure the perplexity change. The fragility score $Vℓ(b)V_\ell(b)$ is the perplexity drop from restoring layer $ℓ\ell$ :

$Vℓ(b)=PPLall-b-bit−PPLrestore-ℓV_\ell(b) = \text{PPL}_{\text{all-}b\text{-bit}} - \text{PPL}_{\text{restore-}\ell}$

Positive $VℓV_\ell$ means the layer benefits from restoration (is fragile). Negative $VℓV_\ell$ means restoration hurts (the layer participates in a compensation equilibrium with other quantized layers).

We evaluate on WikiText-2 with 60,000 tokens. Fisher trace and $DℓD_\ell$ are computed from 32 calibration sequences of 512 tokens from C4. Target layers are sampled from 5 evenly-spaced decoder blocks (7 layer types each: q/k/v/o projections, gate/up/down MLP projections).

6.2 Results across model sizes

Model	Family	Quantizer	Bits	Baseline PPL	$DℓD_\ell$ Spearman $ρ\rho$	$p$ -value
Qwen2.5-0.5B	Qwen	RTN	3	196	0.785	$< 0.0001$
Qwen2.5-1.5B	Qwen	RTN	3	63	0.533	$0.001$
Qwen2.5-1.5B	Qwen	GPTQ	3	63	0.554	$0.0006$
Phi-2 (2.7B)	Microsoft	RTN	3	13	0.477	$0.016$
Qwen2.5-7B	Qwen	RTN	3	25	0.126	0.47

On 0.5B and 1.5B, $DℓD_\ell$ has strong rank correlation with actual layer fragility. Under GPTQ quantization on the same 1.5B model, $DℓD_\ell$ remains significant ( $ρ=0.554\rho = 0.554$ , $p = 0.0006$ ), confirming the effect is not an artifact of the RTN quantizer. On Phi-2, a completely different architecture family (Microsoft, different training pipeline, different tokenizer, different normalization), $DℓD_\ell$ reproduces with $ρ=0.477\rho = 0.477$ ( $p = 0.016$ ). This confirms the finding generalizes across both quantizers and architecture families.

On 7B, the Spearman correlation is not significant, but the Pearson correlation is 0.972. Damage at 7B concentrates in a single extreme outlier: restoring layers.27.mlp.down_proj drops perplexity from 25.21 to 8.49 ( $Vℓ=16.72V_\ell = 16.72$ ), while all other layers cluster near $Vℓ≈0V_\ell \approx 0$ . $DℓD_\ell$ correctly identifies this outlier but the rank ordering among the remaining near-zero layers is noise. At larger model scale, quantization fragility becomes more concentrated, making correct identification of the few critical layers even more important.

6.3 Regime dependence

$DℓD_\ell$ is predictive in the degraded-but-functional regime (baseline perplexity 25 to 200). Outside this regime, it is not informative:

Model	Bits	Baseline PPL	Regime	$DℓD_\ell$ Spearman $ρ\rho$
Qwen2.5-3B	3	51,057	Catastrophic	0.085
Qwen2.5-3B	4	8.4	Minimal degradation	0.006
OPT-1.3B	3	4,054	Catastrophic	0.211
OPT-1.3B	4	12.0	Minimal degradation	0.080
Pythia-2.8B	3	25.3	Degraded	0.000

When quantization catastrophically destroys the model (PPL > 1000), per-layer restoration cannot recover meaningful function, and all metrics are noise. When quantization barely degrades the model (PPL near FP16), there is no fragility signal to predict. $DℓD_\ell$ operates at the quantization cliff, which is exactly where bit allocation decisions matter most.

6.4 Layer-type patterns

Across all models and blocks, down_proj (the MLP down-projection) is consistently the most fragile layer type. On Qwen2.5-7B, a single down_proj in the final block accounts for 66% of total quantization damage. Attention projections (q/k/v) sometimes show negative $VℓV_\ell$ : restoring them to FP16 in an otherwise 3-bit model hurts perplexity, suggesting they participate in compensation dynamics with other quantized layers.

6.5 Negative $VℓV_\ell$ layers

9 of 35 tested layers on Qwen2.5-1.5B have negative $VℓV_\ell$ : restoring them makes the model worse. This occurs because quantized layers develop compensation patterns. When the entire model operates at low precision, error distributions partially cancel across layers. Restoring one layer to full precision disrupts this equilibrium.

This has a practical implication: per-layer sensitivity analysis in a fully-quantized context is not equivalent to per-layer sensitivity in a mixed context. The compensation effect means brute-force "restore one layer" measurements overestimate the benefit of protecting individual layers in isolation, and underestimate the benefit of protecting the truly critical layers while leaving the compensating layers alone.

7. Related Work

Hessian/Fisher-based allocation. HAWQ [Dong et al., 2019] and HAWQ-V2 [Dong et al., 2020] use Hessian eigenvalues for per-layer bit allocation, explicitly assigning more bits to layers with larger Hessian trace. BAQ [Zhang et al., 2025] formulates bit allocation as convex optimization from a Hessian-proxy sensitivity model. LampQ [Kim et al., 2025] uses Fisher-based metrics for layer-wise allocation in vision transformers. All of these allocate in the direction opposite to what we find works at sub-4-bit.

Output-feature mixed precision. MixLLM [Zheng et al., 2024] assigns precision by output-feature salience. CMPQ [Chen et al., 2024] explores channel-wise mixed precision. These methods operate on a finer granularity than per-layer allocation but share the sensitivity-implies-fragility assumption.

Dynamic sensitivity. ScaleBITS [2026] argues that static sensitivity estimates fail after quantization and proposes dynamic sensitivity search. This is the closest prior work to our finding: it warns that the standard metric can fail, though it does not identify the specific inversion we observe or explain it through Fisher-error overlap.

Rate-distortion theory. WaterSIC [Lifar et al., 2026] applies reverse waterfilling to achieve near-information-theoretic quantization. Its distortion metric is input-covariance-weighted MSE ( $Tr(EΣxE⊤)\text{Tr}(E \Sigma_x E^\top)$ ), which does not include the output-side Fisher term. Our decomposition extends this to $Tr(GℓEΣxE⊤)\text{Tr}(G_\ell E \Sigma_x E^\top)$ , showing that the output-side term is not a scalar when quantization noise has directional structure.

Sensitivity warnings. PQI/ReQuant [2025] demonstrates that gradient-based sensitivity metrics can underestimate quantization impact by orders of magnitude. QEP [2025] identifies cross-layer error propagation as a central bottleneck at low bit-widths. Both support the conclusion that local Fisher is insufficient, though neither identifies the directional inversion.

8. Discussion

Why the standard heuristic was not caught earlier

At 4-bit and above, the inversion effect is small (13.06 vs 13.80). The standard heuristic is only clearly wrong in the sub-4-bit regime, which has only recently become practically relevant with methods like BitNet b1.58 and WaterSIC pushing toward extreme compression. Prior mixed-precision work validated at 4-8 bits, where the isotropic approximation holds and Fisher direction is roughly correct.

Connection to training dynamics

An alternative explanation for why high-Fisher layers are robust: layers that receive more gradient signal during training are pushed toward wider minima in those directions, making them naturally resistant to perturbation. This is consistent with the observation that Fisher-sensitive directions coincide with well-explored regions of the loss landscape. We do not test this hypothesis directly, but it suggests the inversion may be a general property of SGD-trained networks, not specific to quantization.

Limitations

Architecture coverage. Our primary results are on the Qwen2.5 family (0.5B, 1.5B, 7B) with cross-family confirmation on Microsoft Phi-2 (2.7B). Pythia-2.8B showed no signal despite being in the degraded regime (PPL 25), and OPT-1.3B did not enter the degraded regime at tested bit-widths. Additional architecture families (Llama, Mistral) would strengthen generality.
Quantizer. We test RTN and GPTQ. $DℓD_\ell$ is significant under both ( $ρ=0.533\rho = 0.533$ for RTN, $ρ=0.554\rho = 0.554$ for GPTQ on Qwen2.5-1.5B). AWQ has not been tested.
No end-to-end allocator. We show $DℓD_\ell$ predicts fragility but do not build a complete allocation system. The inverse-Fisher shootout uses a crude binary split, not continuous $DℓD_\ell$ -guided allocation. A proper allocator that greedily assigns bits by marginal $DℓD_\ell$ reduction is the natural next step.
Perplexity only. We evaluate on WikiText-2 perplexity. Downstream task evaluation (MMLU, ARC, HumanEval) may show different sensitivity patterns.

9. Conclusion

The relationship between Fisher sensitivity and quantization robustness in trained transformers is inverted at low bit-widths. Layers the model relies on most (high Fisher trace) are the most resistant to quantization noise, because their curvature is concentrated in directions that quantization noise avoids. Layers that appear less important (low Fisher trace) are fragile because their broad, diffuse sensitivity intersects with quantization error across many directions.

$DℓD_\ell$ captures this by measuring what Fisher trace alone cannot: the projection of actual quantization perturbation onto the gradient-sensitive direction. It predicts per-layer fragility at $p < 0.001$ where Fisher trace fails entirely, and it operates at the quantization cliff where allocation decisions have the most impact.

For practitioners deploying quantized models, the implication is direct: protect the quiet layers, not the loud ones.

References

Dong, Z., Yao, Z., Gholami, A., et al. "HAWQ: Hessian AWare Quantization." ICCV 2019.
Dong, Z., Yao, Z., Cai, Y., et al. "HAWQ-V2: Hessian Aware trace-Weighted Quantization." NeurIPS 2020.
Frantar, E., Ashkboos, S., Hoefler, T., Alistarh, D. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023.
Lin, J., Tang, J., et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." MLSys 2024.
Zhang, C., Wang, L., Lasaulce, S., Debbah, M. "BAQ: Efficient Bit Allocation Quantization for Large Language Models." arXiv:2506.05664, 2025.
Zheng, Z., Song, X., Liu, C. "MixLLM: LLM Quantization with Global Mixed-precision between Output-features." arXiv:2412.14590, 2024.
Lifar, E., Savkin, S., Ordentlich, O., Polyanskiy, Y. "WaterSIC: Information-theoretically (near) optimal linear layer quantization." arXiv:2603.04956, 2026.
Kim, J., et al. "LampQ: Towards Accurate Layer-wise Mixed Precision Quantization for Vision Transformers." AAAI 2026.
Chen, Z., Xie, B., Li, J., Shen, C. "CMPQ: Channel-wise Mixed-Precision Quantization." arXiv:2410.13056, 2024.

Appendix A: Experimental Details

All experiments use symmetric per-group RTN quantization with group size 128. Fisher trace is estimated from 32 C4 calibration sequences of 512 tokens using empirical Fisher (sample from model predictions, backpropagate cross-entropy, collect output gradients). $DℓD_\ell$ uses the same calibration data. Perplexity is evaluated on WikiText-2 with 60,000 tokens. Target layers are sampled from 5 evenly-spaced decoder blocks per model. All computation runs on NVIDIA RTX A6000 (48GB).

Appendix B: Per-Layer Fragility Tables

Full per-layer $Vℓ(3)V_\ell(3)$ , Fisher trace, $DℓD_\ell$ , and weight quantization MSE for all tested models are available at [zellige.ai/research/importance-not-fragility].