Dense Bench, Part 0: Why I built a benchmark I didn't need.

2026-05-07 · updated 2026-05-07 · 755 words · 4 min · tags: benchmarks, mlx, llm, mac-studio, m3-ultra, dense-bench

Part 0 of the dense-bench series. Part 1 is the numbers; Part 2 is the result I didn't expect.

I bought a Mac Studio M3 Ultra with 512 GB of unified memory and the first thing I wanted to know was whether I'd bought a workstation or a very expensive TV.

On paper, the buying decision was easy. 512 GB of unified memory means a dense 70B-class model at 4-bit quantization fits with hundreds of gigabytes of headroom for KV cache. On the spec sheet, this is the machine I want for the work I want to do. The question is what "fits" actually buys you in practice as you push context out to where people need it: 32K, 64K, 128K. Throughput? Latency? A graceful curve, or a cliff?

The honest answer was that I didn't know, and the public data didn't tell me.

What's already out there

Two data points were the closest I could find.

Awni Hannun's MLX numbers on DeepSeek V3.Awni is on the MLX team at Apple, so his benchmarks are the de facto reference. Excellent measurements, Apple Silicon, large model, real long context. But DeepSeek V3 is a mixture-of-experts model: total parameter count is enormous, active count per token is much smaller. That changes the memory-bandwidth story completely. The KV-cache curve for an MoE model and the KV-cache curve for a dense 70B do not live in the same neighborhood.

arXiv:2511.05502. Closest publicly available data on dense models in MLX at long context. The hardware is M2 Ultra 192 GB and the dense Qwen runs cap at 100K tokens — both because that's where the memory budget runs out, and because the experiment was designed around a different question. It's a good paper. It just isn't measuring my hardware or the regime I care about.

So the configuration I'd just paid for — M3 Ultra 512 GB, dense 70B-class models, contexts to 128K — wasn't in either dataset. Which is fine, except that I'd rather know than not know, and "I'd rather know" is the version of me that ends up writing benchmark harnesses on weekends.

What I wanted to measure

Prefill throughput, decode throughput, time-to-first-token, and peak memory across the matrix (model × prompt-type × length) with length running 4K, 8K, 16K, 32K, 64K, 128K. Two models: Llama-3.1-70B- Instruct and Qwen-2.5-72B-Instruct, both 4-bit. Two prompt types: a summarize-style prompt over public-domain prose, and a single-needle retrieval task in the style of RULER's NIAH.Single-needle, not the full RULER 13-task suite. Quality is a separate experiment. Twenty-four runs total. The numbers needed to be wall-clock and reproducible.

The harness needed three properties that other MLX benchmarks I'd pulled off GitHub didn't have together:

Resumable. A 24-cell matrix on a 70B at long context takes five to six hours. If I Ctrl-C in hour four, I want to come back to a half-finished JSONL and pick up where I left off, not redo the work.
Honest about swap. Unified memory means the OS can page silently before MLX itself OOMs. A benchmark that doesn't flag decode < 0.5 tok/s as suspicious is going to publish a number that's actually measuring how fast macOS can move pages around.
Library cross-check. mlx-lm reports its own internal prefill and decode rates. Wall-clock and library-reported numbers should agree. When they don't, I want to know which one to trust.

I built mlx-dense-bench to have those three properties and nothing else. It's small. It's boring. The interesting work is downstream.

Why bother shipping it

If I were the only person who ever ran this, I wouldn't bother publishing the harness. I'd run the matrix, look at the numbers, maybe lose them on a hard drive in two years.

The reason it's open source is that the next person who buys an M3 Ultra 512 GB shouldn't have to write the same scaffolding I did. The configuration is going to be relevant for a while: this is the box for any small team or solo researcher who wants to do long- context dense work without renting H100s. The numbers should be in the open and the tooling to reproduce them should be in the open. Anyone running their own model or their own prompts should be able to drop in a config file and get clean comparable data back.

That's the setup. Part 1 is the numbers — what the curves actually look like. Part 2 is the result I didn't expect: at 128K, one of the two models stops talking entirely.

~~~

← back to writing