abhi ram salammagarionline · fremont, ca

← writing

Dense Bench

a series · 5 of 5 published

Dense Bench is the investigation that started with me buying a Mac Studio M3 Ultra 512GB and wanting to know whether I'd bought a workstation or a very expensive TV. The question: can a $10K box do real frontier-lab work on long-context dense LLM inference, or is it a toy? I built a harness because the public benchmarks weren't asking what I wanted to know. The answer turned out to be more interesting — and more conditional — than I expected. Posts are numbered. New entries land here as I finish them.

  1. 00.

    the setup. why a Mac Studio, why existing benchmarks weren't enough, what I wanted to find out.

  2. 01.

    the numbers — prefill, decode, TTFT, and memory curves from 4K to 128K on two dense 70B-class models.

  3. 02.

    one number from the sweep that didn't match my prior, and what I learned chasing it.

  4. 03.

    what compressing the KV cache to int8 and int4 costs you at the retrieval boundary — and why bit width turned out not to be the variable that mattered.

  5. 04.

    whether a quantized 70B can actually finish a multi-step job — and why the smallest, cheapest config was the least reliable, while more precision helped one model family and hurt the other.