Dense Bench, Part 2: The Llama 128K cliff.

2026-05-07 · updated 2026-05-07 · 854 words · 4 min · tags: benchmarks, mlx, llm, mac-studio, m3-ultra, dense-bench, llama, qwen

Part 2 of the dense-bench series. Part 0 is the why; Part 1 is the numbers.

The matrix in Part 1 has 24 rows. Twenty-two of them tell the story they were supposed to tell. The other two stop mid-sentence.

At 128K context, Llama-3.1-70B-Instruct-4bit emits the end-of-turn token as its very first generated token, in both the lorem (summarize) and ruler-niah (single-needle) prompt types. That is the entire generation. Generation length: 1. Decode throughput: 0 tok/s. The model finished prefill — about thirty minutes of work — then chose to stop talking.

Qwen-2.5-72B-Instruct-4bit, same hardware, same prompt-length target, on the same day, produces 99 tokens (lorem) and 128 tokens (niah) and decodes them at 6.06 tok/s.

That's a cliff, and it's worth understanding.

What the data looks like

The 128K rows from the matrix:

| model | prompt | prompt_tokens | gen_tokens | decode_tps | flags | |---|---|---:|---:|---:|---| | Llama-3.1-70B-Instruct-4bit | lorem | 131,114 | 1 | 0.00 | early_eos, !valid_decode | | Llama-3.1-70B-Instruct-4bit | niah | 131,133 | 1 | 0.00 | early_eos, !valid_decode | | Qwen-2.5-72B-Instruct-4bit | lorem | 131,108 | 99 | 6.06 | — | | Qwen-2.5-72B-Instruct-4bit | niah | 131,121 | 128 | 6.06 | — |

Prefill and TTFT columns omitted here; they're valid for all four rows (the prefill pass completed in every case). See Part 1 for the full table.

The decode_tps values are zero for Llama because there's nothing to decode after the first token. The model emits <|eot_id|> and the generation halts. This is real model behavior, not a harness bug. The harness flagged both rows automatically as early_eos=True and valid_decode=False and excluded them from the decode- throughput plot. Without that flag, those zeros would have ended up in a chart as if Llama "decoded at zero tokens per second at 128K," which is a meaninglessly wrong way to read what's happening.

Where the cliff comes from

Llama-3.1-70B-Instruct's nominal context limit is 131,072 tokens."128K" is the headline; the spec is 131,072. Round numbers in ML are usually approximations. This one isn't. 128K, on the dot. Our 128K prompts run a hair past that. After the chat template wraps the prompt (system prefix, role tokens, end-of-turn boundaries), the actual tokenized input is 131,114 tokens for lorem and 131,133 for niah. Forty-two tokens over for one prompt, sixty-one for the other. From the model's perspective, we asked it to attend to positions outside the range it was trained on, and the path of least resistance is to declare the conversation over.

Qwen-2.5-72B-Instruct has a different story. Its 128K is native: no RoPE-scaling extension, no "we trained on shorter and stretched the positional embeddings" footnote.Llama 3.1's 128K is RoPE-scaled from a shorter base. Qwen 2.5 was trained at 128K directly. Forty-two extra tokens inside Qwen's window doesn't put it outside any extension. It puts it inside the regime the model already saw during training, plus a rounding error.

Different model, different relationship to "128K." Same harness, same prompt generator, same chat-template logic. The numbers shake out the way they shake out.

What's still valid in the row

TTFT and prefill throughput for Llama at 128K. The prefill pass completed cleanly: 131,114 / 131,133 tokens, attention applied across all of them, KV cache populated, peak memory measured. The model didn't crash; it chose to stop at the next step.

So if you're reasoning about whether you can put a long-context Llama-3.1-70B prompt on Apple Silicon and at least get an answer about cost, the prefill numbers from Part 1 tell you what an attempt costs. They just don't tell you anything about whether the answer will be useful. At 128K, with the chat template, it won't.

What to do about it

Three takeaways for anyone shipping long-context Llama 3.1:

128K is not a round number. It's 131,072 minus your chat template, system prompt, and any tool-call boundaries. Budget for the overhead before you call your prompt a 128K prompt.
If you actually need 128K of payload, Qwen-2.5-72B-Instruct handles it cleanly on this hardware. Llama 3.1 doesn't, at least not at the absolute edge of its window.
Whatever benchmark you trust, make sure it flags early_eos. A row with one generated token and zero decode throughput will silently corrupt any aggregate that doesn't know to exclude it. You'll publish a number, someone will plot it, and the chart will tell a story that isn't true.

Why this is a Part 2 and not a footnote

The cliff isn't visible from the numbers alone. The numbers just say "Llama at 128K = 0 tok/s." Without the harness telling me the generation length was 1, I'd have written "decode collapses on Llama at 128K," and that's the wrong sentence. Decode doesn't collapse; the model declines to decode, because the prompt is past its trained range.

This is what a benchmark is for. Not the headline tok/s in a table, but the conditions and flags that tell you which rows are saying what they look like they're saying. The harness took an extra week to build out properly. Part 2 wouldn't exist without it.

~~~

← back to writing