Importance Is Not Fragility: Why High-Fisher Layers Survive Low-Bit Quantization
Mixed-precision quantization allocates different bit-widths to different layers. The standard heuristic assigns more bits to high-Fisher layers. We show this points in the wrong direction: on Qwen2.5 at 3-bit, the inverse allocation improves perplexity 3.7x. We decompose per-layer quantization damage into Fisher trace and quantization-error covariance, and propose Quantization Visibility, a metric that predicts per-layer fragility at p < 0.001 where Fisher trace fails.
- quantization
- fisher-information
- mixed-precision
- transformers
Abstract
Mixed-precision quantization allocates different bit-widths to different layers. The standard heuristic, used by HAWQ, BAQ, MixLLM, and others, assigns more bits to layers with higher Fisher or Hessian sensitivity. We show this heuristic points in the wrong direction. On Qwen2.5 models at 3-bit, allocating more bits to high-Fisher layers produces 91.25 perplexity, while the inverse allocation (more bits to low-Fisher layers) produces 24.71, a 3.7×\times× improvement at the same average bit budget. We explain this through a decomposition of per-layer quantization damage into Fisher trace and quantization-error covariance: high-Fisher layers have concentrated curvature that quantization noise largely avoids, while low-Fisher layers have broad sensitivity that quantization noise hits directly. We propose Quantization Visibility (DℓD_\ellDℓ), a scalar metric that measures the gradient-weighted impact of actual quantization error rather than Fisher sensitivity alone. DℓD_\ellDℓ predicts per-layer fragility at p<0.001p < 0.001p<0.001 (Spearman ρ=0.785\rho = 0.785ρ=0.785 on Qwen2.5-0.5B, ρ=0.533\rho = 0.533ρ=0.533 on Qwen2.5-1.5B) where Fisher trace fails entirely (p=0.25p = 0.25p=0.25). The effect persists under GPTQ quantization (ρ=0.554\rho = 0.554ρ=0.554, p=0.0006p = 0.0006p=0.0006) and reproduces cross-family on Microsoft Phi-2 (ρ=0.477\rho = 0.477ρ=0.477, p=0.016p = 0.016p=0.016). At 7B scale, damage concentrates in a single layer that DℓD_\ellDℓ identifies with Pearson r=0.972r = 0.972r=0.972. These results hold in the degraded-but-functional quantization regime and suggest that the relationship between layer importance and quantization robustness in trained transformers is inverted relative to standard assumptions.
1. Introduction
Deploying large language models under memory constraints requires aggressive weight quantization. At 4 bits per weight, most models retain near-full quality. At 3 bits and below, quality degrades, and the question becomes which layers to protect with extra precision.
The field's default answer relies on sensitivity metrics derived from the Fisher information matrix or the Hessian of the loss. HAWQ [Dong et al., 2019] uses Hessian eigenvalues. BAQ [Zhang et al., 2025] formulates a convex allocation from a Hessian proxy. MixLLM [Zheng et al., 2024] assigns larger bit-width to globally high-salience output features. The shared assumption is straightforward: layers where the loss changes most under perturbation should receive more bits to protect against quantization noise.
We test this assumption directly on Qwen2.5 models at 3-bit average precision. The result is unambiguous: the assumption is wrong. Allocating more bits to high-Fisher layers and fewer to low-Fisher layers produces worse perplexity than uniform allocation. Inverting the allocation produces dramatically better perplexity.
This is not a failure of Fisher information as a concept. It is a failure of the assumption that Fisher sensitivity implies quantization fragility. Fisher information measures how much the loss changes under arbitrary perturbations. Quantization does not produce arbitrary perturbations. It produces structured noise whose geometry depends on weight distributions, group scales, and outlier patterns. When that structured noise happens to avoid the directions Fisher cares about, a high-Fisher layer can be robust to quantization despite being important to the model.
We formalize this through the decomposition ΔKLℓ≈12Tr(Gℓ⋅Cq,ℓ(b))\Delta \text{KL}_\ell \approx \frac{1}{2} \text{Tr}(G_\ell \cdot C_{q,\ell}^{(b)})ΔKLℓ≈21Tr(Gℓ⋅Cq,ℓ(b)), where GℓG_\ellGℓ is the output-side Fisher and Cq,ℓ(b)C_{q,\ell}^{(b)}Cq,ℓ(b) is the actual covariance of quantization-induced output perturbation. Fisher-only allocation optimizes Tr(Gℓ)\text{Tr}(G_\ell)Tr(Gℓ) and ignores Cq,ℓC_{q,\ell}Cq,ℓ. Our metric, Quantization Visibility (DℓD_\ellDℓ), estimates the full product cheaply from a single forward and backward pass on calibration data.
Our contributions
- We demonstrate that the standard Fisher/Hessian allocation direction is wrong at sub-4-bit precision, with empirical evidence across three Qwen model sizes (0.5B, 1.5B, 7B) and cross-family validation on Microsoft Phi-2.
- We explain the inversion through Fisher-error overlap: high-Fisher layers have sparse curvature that quantization noise avoids.
- We propose Dℓ=12E[(gℓ⊤δyℓ)2]D_\ell = \frac{1}{2} \mathbb{E}[(g_\ell^\top \delta y_\ell)^2]Dℓ=21E[(gℓ⊤δyℓ)2], a cheap per-layer fragility score that predicts actual perplexity impact where Fisher trace does not.
- We characterize the regime where this metric operates: degraded-but-functional quantization (perplexity 13 to 200), not catastrophic collapse or minimal degradation.
2. Background
2.1 Post-training quantization and mixed precision
Post-training quantization (PTQ) converts full-precision weights to lower bit-width representations without retraining. Round-to-nearest (RTN) with per-group symmetric scaling is the simplest approach. GPTQ [Frantar et al., 2022] improves on RTN by using Hessian-weighted error feedback during rounding. AWQ [Lin et al., 2023] protects activation-aware salient weights. Recent work pushes toward information-theoretic optimality: WaterSIC [Lifar et al., 2026] achieves within 0.255 bits of the reverse-waterfilling limit.
Mixed-precision allocation assigns different bit-widths to different layers (or finer-grained units) under a total memory budget. The key question is which layers to protect. HAWQ and its successors use Hessian or Fisher trace as the sensitivity signal. ScaleBITS [2026] warns that static sensitivity can fail after quantization. Our work identifies a specific failure mode: the sensitivity signal points in the wrong direction at low bit-widths.
2.2 Fisher information for quantization
For a linear layer y=Wxy = Wxy=Wx with perturbation E=W^−WE = \hat{W} - WE=W^−W, the second-order loss proxy is:
ΔLℓ≈12Tr(Gℓ⋅Cq,ℓ),Cq,ℓ=E[δyℓδyℓ⊤]\Delta \mathcal{L}_\ell \approx \frac{1}{2} \text{Tr}(G_\ell \cdot C_{q,\ell}), \quad C_{q,\ell} = \mathbb{E}[\delta y_\ell \delta y_\ell^\top]ΔLℓ≈21Tr(Gℓ⋅Cq,ℓ),Cq,ℓ=E[δyℓδyℓ⊤]
where GℓG_\ellGℓ is the output-side Fisher (or Hessian) and Cq,ℓC_{q,\ell}Cq,ℓ is the quantization-error covariance in the output space. Standard allocation methods approximate this as Tr(Gℓ)⋅σq2\text{Tr}(G_\ell) \cdot \sigma_q^2Tr(Gℓ)⋅σq2, treating quantization noise as isotropic. This approximation fails when Cq,ℓC_{q,\ell}Cq,ℓ has directional structure that differs from GℓG_\ellGℓ.
3. The Inverse-Fisher Anomaly
We quantize Qwen2.5-1.5B to 3-bit average precision using symmetric per-group RTN (group size 128). We compute per-layer output-Fisher trace from 32 calibration sequences using empirical Fisher estimation (sample from model predictions, backpropagate cross-entropy loss, collect output gradients). We then test three allocation strategies at the same average bit budget:
- Uniform: every layer at 3 bits.
- Fisher-gated: top 20% of layers by Fisher trace receive 4 bits, rest receive 3 bits (with adjustment to maintain average).
- Inverse-Fisher: top 20% of layers by Fisher trace receive 3 bits (or lower), bottom 80% receive 4 bits.
Results
| Allocation | 3-bit average PPL | 4-bit average PPL |
|---|---|---|
| FP16 baseline | 11.44 | 11.44 |
| Uniform RTN | 178.14 | 15.06 |
| Fisher-gated (protect high-Fisher) | 91.25 | 13.80 |
| Inverse (protect low-Fisher) | 24.71 | 13.06 |
The inverse allocation is 7.2×\times× better than uniform and 3.7×\times× better than Fisher-gated at 3-bit average. The effect is consistent at 4-bit but smaller in magnitude (13.06 vs 15.06), as expected when quantization damage is less severe overall.
Both Fisher-gated and inverse beat uniform, confirming that any mixed-precision allocation helps. But the direction matters: Fisher-gated helps by accident (some layers happen to benefit from extra bits), while inverse allocation correctly identifies which layers are actually fragile.
4. Explanation: Importance and Fragility Are Different
4.1 The decomposition
The per-layer damage from quantization is not Tr(Gℓ)\text{Tr}(G_\ell)Tr(Gℓ). It is Tr(Gℓ⋅Cq,ℓ)\text{Tr}(G_\ell \cdot C_{q,\ell})Tr(Gℓ⋅Cq,ℓ). These are different when the eigenvectors of GℓG_\ellGℓ and Cq,ℓC_{q,\ell}Cq,ℓ are not aligned.
A layer can have high Fisher trace (important) but low damage (robust) when its Fisher-sensitive directions do not overlap with the directions where quantization introduces error. Concretely:
- Sparse Fisher curvature. High-Fisher layers often have curvature concentrated in a few directions (low Fisher participation ratio dF=(Tr G)2/Tr(G2)d_F = (\text{Tr } G)^2 / \text{Tr}(G^2)dF=(Tr G)2/Tr(G2)). Quantization noise distributes across all directions, so most of it falls outside the sensitive subspace.
- Structured quantization error. RTN with per-group scaling produces error that depends on weight distribution shape, outlier channels, and group dynamic range. This error is not isotropic and can systematically avoid high-curvature directions.
- RMSNorm suppression. In pre-norm transformer blocks, perturbations that are radial (scaling the residual stream) get suppressed by the next normalization layer. Only tangential perturbations (changing the direction) propagate. If high-Fisher layers produce mostly radial quantization error, normalization makes them robust.
4.2 Why the field assumed otherwise
The standard Fisher/Hessian allocation framework was developed and tested under conditions where quantization error is approximately isotropic: 8-bit and 4-bit regimes where the noise is small and well-distributed. In that regime, Tr(Gℓ)⋅σq2\text{Tr}(G_\ell) \cdot \sigma_q^2Tr(Gℓ)⋅σq2 is a reasonable approximation because Cq,ℓ≈σq2IC_{q,\ell} \approx \sigma_q^2 ICq,ℓ≈σq2I.
At 3 bits and below, quantization noise becomes large and highly structured. Group clipping, outlier distortion, and the coarse quantization grid create directional error patterns. The isotropic approximation breaks, and Fisher trace stops predicting damage.
5. Quantization Visibility (DℓD_\ellDℓ)
We define Quantization Visibility as:
Dℓ(b)=12Ex,g[(gℓ⊤δyℓ(b))2]D_\ell(b) = \frac{1}{2} \mathbb{E}_{x,g}\left[(g_\ell^\top \delta y_\ell^{(b)})^2\right]Dℓ(b)=21Ex,g[(gℓ⊤δyℓ(b))2]
where gℓ=∂L/∂yℓg_\ell = \partial \mathcal{L} / \partial y_\ellgℓ=∂L/∂yℓ is the gradient of the loss with respect to the layer output (from a backward pass on calibration data) and δyℓ(b)=(Qb(Wℓ)−Wℓ)xℓ\delta y_\ell^{(b)} = (Q_b(W_\ell) - W_\ell) x_\ellδyℓ(b)=(Qb(Wℓ)−Wℓ)xℓ is the output perturbation induced by bbb-bit quantization on calibration input xℓx_\ellxℓ.
DℓD_\ellDℓ is a scalar contraction of the Fisher-error overlap: it measures how much the actual quantization perturbation projects onto the gradient-sensitive direction. It does not require computing or storing the full Fisher matrix GℓG_\ellGℓ or the error covariance Cq,ℓC_{q,\ell}Cq,ℓ. It requires one forward pass (to collect xℓx_\ellxℓ and compute δyℓ\delta y_\ellδyℓ) and one backward pass (to collect gℓg_\ellgℓ) on calibration data.
5.1 Comparison to alternatives
| Metric | What it measures | Correlation with Vℓ(3)V_\ell(3)Vℓ(3) on Qwen2.5-1.5B |
|---|---|---|
| Fisher trace | Layer importance (arbitrary perturbations) | ρ=0.274\rho = 0.274ρ=0.274, p=0.111p = 0.111p=0.111 |
| Weight quant MSE | Raw quantization error magnitude | ρ=−0.212\rho = -0.212ρ=−0.212, p=0.221p = 0.221p=0.221 |
| FEO (normalized) | Fisher-error overlap ratio | ρ=0.181\rho = 0.181ρ=0.181, p=0.298p = 0.298p=0.298 |
| DℓD_\ellDℓ | Gradient-weighted quantization damage | ρ=0.533\rho = 0.533ρ=0.533, p=0.001p = 0.001p=0.001 |
Fisher trace has no significant correlation with actual fragility. Weight quantization MSE is negatively correlated (layers with more raw error are less fragile, consistent with the inverse-Fisher finding). The normalized Fisher-Error Overlap (FEO) loses magnitude information that makes DℓD_\ellDℓ predictive. DℓD_\ellDℓ is the only metric that significantly predicts which layers are actually damaged by quantization.
6. Experiments
6.1 Per-layer fragility measurement
For each model, we quantize all linear layers to bbb-bit, then restore each target layer one at a time to FP16 and measure the perplexity change. The fragility score Vℓ(b)V_\ell(b)Vℓ(b) is the perplexity drop from restoring layer ℓ\ellℓ:
Vℓ(b)=PPLall-b-bit−PPLrestore-ℓV_\ell(b) = \text{PPL}_{\text{all-}b\text{-bit}} - \text{PPL}_{\text{restore-}\ell}Vℓ(b)=PPLall-b-bit−PPLrestore-ℓ
Positive VℓV_\ellVℓ means the layer benefits from restoration (is fragile). Negative VℓV_\ellVℓ means restoration hurts (the layer participates in a compensation equilibrium with other quantized layers).
We evaluate on WikiText-2 with 60,000 tokens. Fisher trace and DℓD_\ellDℓ are computed from 32 calibration sequences of 512 tokens from C4. Target layers are sampled from 5 evenly-spaced decoder blocks (7 layer types each: q/k/v/o projections, gate/up/down MLP projections).
6.2 Results across model sizes
| Model | Family | Quantizer | Bits | Baseline PPL | DℓD_\ellDℓ Spearman ρ\rhoρ | ppp-value |
|---|---|---|---|---|---|---|
| Qwen2.5-0.5B | Qwen | RTN | 3 | 196 | 0.785 | <0.0001< 0.0001<0.0001 |
| Qwen2.5-1.5B | Qwen | RTN | 3 | 63 | 0.533 | 0.0010.0010.001 |
| Qwen2.5-1.5B | Qwen | GPTQ | 3 | 63 | 0.554 | 0.00060.00060.0006 |
| Phi-2 (2.7B) | Microsoft | RTN | 3 | 13 | 0.477 | 0.0160.0160.016 |
| Qwen2.5-7B | Qwen | RTN | 3 | 25 | 0.126 | 0.47 |
On 0.5B and 1.5B, DℓD_\ellDℓ has strong rank correlation with actual layer fragility. Under GPTQ quantization on the same 1.5B model, DℓD_\ellDℓ remains significant (ρ=0.554\rho = 0.554ρ=0.554, p=0.0006p = 0.0006p=0.0006), confirming the effect is not an artifact of the RTN quantizer. On Phi-2, a completely different architecture family (Microsoft, different training pipeline, different tokenizer, different normalization), DℓD_\ellDℓ reproduces with ρ=0.477\rho = 0.477ρ=0.477 (p=0.016p = 0.016p=0.016). This confirms the finding generalizes across both quantizers and architecture families.
On 7B, the Spearman correlation is not significant, but the Pearson correlation is 0.972. Damage at 7B concentrates in a single extreme outlier: restoring layers.27.mlp.down_proj drops perplexity from 25.21 to 8.49 (Vℓ=16.72V_\ell = 16.72Vℓ=16.72), while all other layers cluster near Vℓ≈0V_\ell \approx 0Vℓ≈0. DℓD_\ellDℓ correctly identifies this outlier but the rank ordering among the remaining near-zero layers is noise. At larger model scale, quantization fragility becomes more concentrated, making correct identification of the few critical layers even more important.
6.3 Regime dependence
DℓD_\ellDℓ is predictive in the degraded-but-functional regime (baseline perplexity 25 to 200). Outside this regime, it is not informative:
| Model | Bits | Baseline PPL | Regime | DℓD_\ellDℓ Spearman ρ\rhoρ |
|---|---|---|---|---|
| Qwen2.5-3B | 3 | 51,057 | Catastrophic | 0.085 |
| Qwen2.5-3B | 4 | 8.4 | Minimal degradation | 0.006 |
| OPT-1.3B | 3 | 4,054 | Catastrophic | 0.211 |
| OPT-1.3B | 4 | 12.0 | Minimal degradation | 0.080 |
| Pythia-2.8B | 3 | 25.3 | Degraded | 0.000 |
When quantization catastrophically destroys the model (PPL > 1000), per-layer restoration cannot recover meaningful function, and all metrics are noise. When quantization barely degrades the model (PPL near FP16), there is no fragility signal to predict. DℓD_\ellDℓ operates at the quantization cliff, which is exactly where bit allocation decisions matter most.
6.4 Layer-type patterns
Across all models and blocks, down_proj (the MLP down-projection) is consistently the most fragile layer type. On Qwen2.5-7B, a single down_proj in the final block accounts for 66% of total quantization damage. Attention projections (q/k/v) sometimes show negative VℓV_\ellVℓ: restoring them to FP16 in an otherwise 3-bit model hurts perplexity, suggesting they participate in compensation dynamics with other quantized layers.
6.5 Negative VℓV_\ellVℓ layers
9 of 35 tested layers on Qwen2.5-1.5B have negative VℓV_\ellVℓ: restoring them makes the model worse. This occurs because quantized layers develop compensation patterns. When the entire model operates at low precision, error distributions partially cancel across layers. Restoring one layer to full precision disrupts this equilibrium.
This has a practical implication: per-layer sensitivity analysis in a fully-quantized context is not equivalent to per-layer sensitivity in a mixed context. The compensation effect means brute-force "restore one layer" measurements overestimate the benefit of protecting individual layers in isolation, and underestimate the benefit of protecting the truly critical layers while leaving the compensating layers alone.
7. Related Work
Hessian/Fisher-based allocation. HAWQ [Dong et al., 2019] and HAWQ-V2 [Dong et al., 2020] use Hessian eigenvalues for per-layer bit allocation, explicitly assigning more bits to layers with larger Hessian trace. BAQ [Zhang et al., 2025] formulates bit allocation as convex optimization from a Hessian-proxy sensitivity model. LampQ [Kim et al., 2025] uses Fisher-based metrics for layer-wise allocation in vision transformers. All of these allocate in the direction opposite to what we find works at sub-4-bit.
Output-feature mixed precision. MixLLM [Zheng et al., 2024] assigns precision by output-feature salience. CMPQ [Chen et al., 2024] explores channel-wise mixed precision. These methods operate on a finer granularity than per-layer allocation but share the sensitivity-implies-fragility assumption.
Dynamic sensitivity. ScaleBITS [2026] argues that static sensitivity estimates fail after quantization and proposes dynamic sensitivity search. This is the closest prior work to our finding: it warns that the standard metric can fail, though it does not identify the specific inversion we observe or explain it through Fisher-error overlap.
Rate-distortion theory. WaterSIC [Lifar et al., 2026] applies reverse waterfilling to achieve near-information-theoretic quantization. Its distortion metric is input-covariance-weighted MSE (Tr(EΣxE⊤)\text{Tr}(E \Sigma_x E^\top)Tr(EΣxE⊤)), which does not include the output-side Fisher term. Our decomposition extends this to Tr(GℓEΣxE⊤)\text{Tr}(G_\ell E \Sigma_x E^\top)Tr(GℓEΣxE⊤), showing that the output-side term is not a scalar when quantization noise has directional structure.
Sensitivity warnings. PQI/ReQuant [2025] demonstrates that gradient-based sensitivity metrics can underestimate quantization impact by orders of magnitude. QEP [2025] identifies cross-layer error propagation as a central bottleneck at low bit-widths. Both support the conclusion that local Fisher is insufficient, though neither identifies the directional inversion.
8. Discussion
Why the standard heuristic was not caught earlier
At 4-bit and above, the inversion effect is small (13.06 vs 13.80). The standard heuristic is only clearly wrong in the sub-4-bit regime, which has only recently become practically relevant with methods like BitNet b1.58 and WaterSIC pushing toward extreme compression. Prior mixed-precision work validated at 4-8 bits, where the isotropic approximation holds and Fisher direction is roughly correct.
Connection to training dynamics
An alternative explanation for why high-Fisher layers are robust: layers that receive more gradient signal during training are pushed toward wider minima in those directions, making them naturally resistant to perturbation. This is consistent with the observation that Fisher-sensitive directions coincide with well-explored regions of the loss landscape. We do not test this hypothesis directly, but it suggests the inversion may be a general property of SGD-trained networks, not specific to quantization.
Limitations
- Architecture coverage. Our primary results are on the Qwen2.5 family (0.5B, 1.5B, 7B) with cross-family confirmation on Microsoft Phi-2 (2.7B). Pythia-2.8B showed no signal despite being in the degraded regime (PPL 25), and OPT-1.3B did not enter the degraded regime at tested bit-widths. Additional architecture families (Llama, Mistral) would strengthen generality.
- Quantizer. We test RTN and GPTQ. DℓD_\ellDℓ is significant under both (ρ=0.533\rho = 0.533ρ=0.533 for RTN, ρ=0.554\rho = 0.554ρ=0.554 for GPTQ on Qwen2.5-1.5B). AWQ has not been tested.
- No end-to-end allocator. We show DℓD_\ellDℓ predicts fragility but do not build a complete allocation system. The inverse-Fisher shootout uses a crude binary split, not continuous DℓD_\ellDℓ-guided allocation. A proper allocator that greedily assigns bits by marginal DℓD_\ellDℓ reduction is the natural next step.
- Perplexity only. We evaluate on WikiText-2 perplexity. Downstream task evaluation (MMLU, ARC, HumanEval) may show different sensitivity patterns.
9. Conclusion
The relationship between Fisher sensitivity and quantization robustness in trained transformers is inverted at low bit-widths. Layers the model relies on most (high Fisher trace) are the most resistant to quantization noise, because their curvature is concentrated in directions that quantization noise avoids. Layers that appear less important (low Fisher trace) are fragile because their broad, diffuse sensitivity intersects with quantization error across many directions.
DℓD_\ellDℓ captures this by measuring what Fisher trace alone cannot: the projection of actual quantization perturbation onto the gradient-sensitive direction. It predicts per-layer fragility at p<0.001p < 0.001p<0.001 where Fisher trace fails entirely, and it operates at the quantization cliff where allocation decisions have the most impact.
For practitioners deploying quantized models, the implication is direct: protect the quiet layers, not the loud ones.
References
- Dong, Z., Yao, Z., Gholami, A., et al. "HAWQ: Hessian AWare Quantization." ICCV 2019.
- Dong, Z., Yao, Z., Cai, Y., et al. "HAWQ-V2: Hessian Aware trace-Weighted Quantization." NeurIPS 2020.
- Frantar, E., Ashkboos, S., Hoefler, T., Alistarh, D. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023.
- Lin, J., Tang, J., et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." MLSys 2024.
- Zhang, C., Wang, L., Lasaulce, S., Debbah, M. "BAQ: Efficient Bit Allocation Quantization for Large Language Models." arXiv:2506.05664, 2025.
- Zheng, Z., Song, X., Liu, C. "MixLLM: LLM Quantization with Global Mixed-precision between Output-features." arXiv:2412.14590, 2024.
- Lifar, E., Savkin, S., Ordentlich, O., Polyanskiy, Y. "WaterSIC: Information-theoretically (near) optimal linear layer quantization." arXiv:2603.04956, 2026.
- Kim, J., et al. "LampQ: Towards Accurate Layer-wise Mixed Precision Quantization for Vision Transformers." AAAI 2026.
- Chen, Z., Xie, B., Li, J., Shen, C. "CMPQ: Channel-wise Mixed-Precision Quantization." arXiv:2410.13056, 2024.
Appendix A: Experimental Details
All experiments use symmetric per-group RTN quantization with group size 128. Fisher trace is estimated from 32 C4 calibration sequences of 512 tokens using empirical Fisher (sample from model predictions, backpropagate cross-entropy, collect output gradients). DℓD_\ellDℓ uses the same calibration data. Perplexity is evaluated on WikiText-2 with 60,000 tokens. Target layers are sampled from 5 evenly-spaced decoder blocks per model. All computation runs on NVIDIA RTX A6000 (48GB).
Appendix B: Per-Layer Fragility Tables
Full per-layer Vℓ(3)V_\ell(3)Vℓ(3), Fisher trace, DℓD_\ellDℓ, and weight quantization MSE for all tested models are available at [zellige.ai/research/importance-not-fragility].