Dense Bench
a series · 5 of 5 published
Dense Bench is the investigation that started with me buying a Mac Studio M3 Ultra 512GB and wanting to know whether I'd bought a workstation or a very expensive TV. The question: can a $10K box do real frontier-lab work on long-context dense LLM inference, or is it a toy? I built a harness because the public benchmarks weren't asking what I wanted to know. The answer turned out to be more interesting — and more conditional — than I expected. Posts are numbered. New entries land here as I finish them.
- 00.
the setup. why a Mac Studio, why existing benchmarks weren't enough, what I wanted to find out.
- 01.
the numbers — prefill, decode, TTFT, and memory curves from 4K to 128K on two dense 70B-class models.
- 02.
one number from the sweep that didn't match my prior, and what I learned chasing it.
- 03.
what compressing the KV cache to int8 and int4 costs you at the retrieval boundary — and why bit width turned out not to be the variable that mattered.
- 04.
whether a quantized 70B can actually finish a multi-step job — and why the smallest, cheapest config was the least reliable, while more precision helped one model family and hurt the other.