Fast KV Compaction via Attention Matching

This fork implements the algorithm from arXiv:2602.16284 by Zweiger et al. (MIT Han Lab, 2026).

Paper Summary

KV cache compaction replaces traditional context eviction (truncation, sliding window) with a learned compression. Given a full KV cache of N tokens, compaction produces a smaller cache of M tokens (M << N) such that the model's attention behavior is preserved.

Key Insight

Instead of selecting which tokens to keep and which to discard, Attention Matching finds a small set of synthetic key-value pairs that reproduce the original attention distribution. The compacted cache is not a subset of the original — it is a new, learned representation.

Algorithm

  1. Position Selection — Choose M positions from the original N using attention-weighted scoring. Each position contributes proportionally to its attention mass.
  2. Beta Fitting (NNLS) — Solve for additive bias terms that correct the attention distribution. Uses Non-Negative Least Squares with projected gradient descent.
  3. Value Fitting (Least-Squares) — Solve for compressed V such that the weighted sum of values matches the original. Uses QR factorization with Cholesky fallback.
  4. Execution — During inference, compacted K/V/beta tensors are prepended to the live KV cache. Beta is injected as an additive attention bias.

Fork vs Paper Reference

This implementation extends the paper in several dimensions:

FeaturePaper (MIT)This Fork
LanguagePython (reference)C++ (production)
SolverNumPy/SciPyPure C++ with NEON SIMD + Metal GPU
SelectionAttention scoring+ OMP, progressive schedule
Query sourceCaptured from trainingSelf-study (autoregressive generation)
Long contextNot addressedChunked self-study for >8K
Flash attentionNot addressedPer-layer flash hybrid (zero-beta)
Iterative refinementNot addressedQuality-gated convergence loop
ArchitecturesStandard attention+ iSWA, hybrid SSM, IMROPE
State persistenceNot addressedSave/restore (version 2 serialization)
Quantized cacheNot addressedQ8_0, Q4_K extraction

Implementation Scale

Compaction Pipelines
7
select, solver, OMP, self-study, chunked, on-policy, sequential
Fix Commits
29
18 Critical/Major bugs found and fixed
Models Validated
17
15 pass the 0.95 cosine quality gate

Citation

@article{zweiger2026fast, title={Fast KV Compaction via Attention Matching}, author={Zweiger, Adam and others}, journal={arXiv preprint arXiv:2602.16284}, year={2026} }

Links