This fork implements the algorithm from arXiv:2602.16284 by Zweiger et al. (MIT Han Lab, 2026).
KV cache compaction replaces traditional context eviction (truncation, sliding window) with a learned compression. Given a full KV cache of N tokens, compaction produces a smaller cache of M tokens (M << N) such that the model's attention behavior is preserved.
Instead of selecting which tokens to keep and which to discard, Attention Matching finds a small set of synthetic key-value pairs that reproduce the original attention distribution. The compacted cache is not a subset of the original — it is a new, learned representation.
This implementation extends the paper in several dimensions:
| Feature | Paper (MIT) | This Fork |
|---|---|---|
| Language | Python (reference) | C++ (production) |
| Solver | NumPy/SciPy | Pure C++ with NEON SIMD + Metal GPU |
| Selection | Attention scoring | + OMP, progressive schedule |
| Query source | Captured from training | Self-study (autoregressive generation) |
| Long context | Not addressed | Chunked self-study for >8K |
| Flash attention | Not addressed | Per-layer flash hybrid (zero-beta) |
| Iterative refinement | Not addressed | Quality-gated convergence loop |
| Architectures | Standard attention | + iSWA, hybrid SSM, IMROPE |
| State persistence | Not addressed | Save/restore (version 2 serialization) |
| Quantized cache | Not addressed | Q8_0, Q4_K extraction |