Benchmark Dashboard

Live benchmark data from CI. Updated after each successful perf run.

Models Tested
17
15 pass quality gate
CI Tests
49
Main-label, all passing
Engine Test Tiers
8
Pytests to live dashboard
CI Workflows
7
Active on every push/PR

Decode Speed After Compaction

ModelContextRatioCosineCompacted (t/s)Baseline (t/s)Speedup
Qwen3-8B8K8x0.9979.35.7+63%
Qwen3-30B-A3B4K8x0.99920.213.5+50%
Qwen3-8B4K8x0.99713.810.7+29%
Qwen3-8B16K8x0.9995.34.8+10%

Compaction Quality Heatmap

Logit cosine similarity (select pipeline). Higher is better. Green >= 0.99, yellow >= 0.95.

Model4K/2x4K/4x4K/8x8K/2x8K/4x8K/8x
Qwen3-8B 0.999 0.999 0.997 0.999 0.998 0.997
Qwen3-30B-A3B 0.999 0.999 0.999 ---
Qwen3-14B 0.995 0.996 0.992 0.999 0.997 0.973
DeepSeek-R1-14B 0.999 0.998 0.993 ---

Fork vs Upstream vs Ollama (Baseline)

No compaction — pure decode speed comparison. Zero regression from compaction code.

256K Iterative Context Extension

64K physical KV cache, 49-57 compaction cycles, 256K effective tokens.

100%
10-fact recall at every checkpoint through 256K tokens

Platform Build Status

PlatformCI WorkflowStatus
macOS (Apple Silicon)modelai-ciActive
macOS (Server smoke)modelai-server-smokeActive
Windows (MSVC x64)modelai-ci-windowsBuild-only
Perf regressionmodelai-perf-smokeActive
Upstream syncmodelai-upstream-syncWeekly

Upstream Sync Health

Sync Schedule
Saturday 2PM PDT
Automated CI-gated merge
Branches
3
modelai-main, upstream-master, upstream-sync
Security Patches
Same-day
Emergency sync for CVEs