abhi ram salammagarionline · fremont, ca

Writing

  • 2026-04-18
    Hello

    What this site is, why it exists, what will live here.

  • 2026-05-07
    Dense Bench, Part 0: Why I built a benchmark I didn't need.

    Why existing MLX benchmark data didn't answer the question I had about dense 70B inference at long context on M3 Ultra 512GB — and the three properties I needed in a harness before I trusted any of it.

  • 2026-05-03
    Dense Bench, Part 1: Prefill and decode curves from 4K to 128K context.

    Real prefill, TTFT, decode, and memory curves for Llama-3.1-70B-Instruct-4bit and Qwen-2.5-72B-Instruct-4bit at 4K–128K context on Mac Studio M3 Ultra 512GB. Methodology, raw tables, and a reproducible harness.

  • 2026-05-07
    Dense Bench, Part 2: The Llama 128K cliff.

    At 128K context, Llama-3.1-70B-Instruct emits end-of-turn as its first generated token. Qwen-2.5-72B-Instruct doesn't. The 42 tokens past the nominal context limit, the harness flag that caught it, and what to take away if you're shipping long-context Llama at the edge.

  • 2026-05-12
    Dense Bench, Part 3: Compressed KV at the retrieval boundary.

    120 runs sweeping int8 / int4 / FP16 KV cache across 5 needle depths and 5–7 context lengths for dense Llama-3.1-70B and Qwen-2.5-72B on M3 Ultra 512GB. Bit width turned out not to be the variable that mattered. A cache-key bug in the 192K probe, caught after publish, is documented in §6.

  • 2026-06-24
    Dense Bench, Part 4: Memory doesn't predict it — agentic reliability vs quantization.

    640 graded multi-turn tool-calling episodes across Llama-3.1-70B and Qwen-2.5-72B at 4-bit and 8-bit on M3 Ultra 512GB. Peak memory does not predict task completion, and the quantization effect reverses sign between the two model families. Includes a Plain-English reading mode.