Dense Bench

a series · 5 of 5 published

Dense Bench is the investigation that started with me buying a Mac Studio M3 Ultra 512GB and wanting to know whether I'd bought a workstation or a very expensive TV. The question: can a $10K box do real frontier-lab work on long-context dense LLM inference, or is it a toy? I built a harness because the public benchmarks weren't asking what I wanted to know. The answer turned out to be more interesting — and more conditional — than I expected. Posts are numbered. New entries land here as I finish them.

00.
Dense Bench, Part 0: Why I built a benchmark I didn't need.
the setup. why a Mac Studio, why existing benchmarks weren't enough, what I wanted to find out.
01.
Dense Bench, Part 1: Prefill and decode curves from 4K to 128K context.
the numbers — prefill, decode, TTFT, and memory curves from 4K to 128K on two dense 70B-class models.
02.
Dense Bench, Part 2: The Llama 128K cliff.
one number from the sweep that didn't match my prior, and what I learned chasing it.
03.
Dense Bench, Part 3: Compressed KV at the retrieval boundary.
what compressing the KV cache to int8 and int4 costs you at the retrieval boundary — and why bit width turned out not to be the variable that mattered.
04.
Dense Bench, Part 4: Memory doesn't predict it — agentic reliability vs quantization.
whether a quantized 70B can actually finish a multi-step job — and why the smallest, cheapest config was the least reliable, while more precision helped one model family and hurt the other.