Benchmarks I'm watching now

Whenever a new model is released, I read the benchmarks. I haven’t seen a good list of benchmarks. I tend to see them referenced and then bookmark them.

Here’s my list.

Benchmarks

  • 16x Eval Model Evaluation - Small number of coding, writing, and image analysis tasks. More sophisticated rubric. Missing Claude Sonnet 4.5 and Opus 4.1.
  • AI Productivity Index - Attempt to benchmark AIs doing real work, like investment banking, legal, medical.
  • Aider LLM Leaderboards - Coding challenges. Missing Claude Sonnet 4.5. Last updated September 2.
  • ARC Prize - Aims to capture tasks that are easy for humans but hard for AIs, so not really sure if this is a good representation of what people typically use AI for. It’s striking though that Grok 4, GPT-5, and Sonnet 4.5 do the best at this. Grok 4 occasionally comes out on top in benchmarks, and this is one of those cases.
  • Artificial Analysis LLM Leaderboard - Independent evaluations on several benchmarks, with a composite score. Excellent benchmark. This is the place I think I’d send someone who asked for just one website.
  • Berkeley Function Calling Leaderboard (BFCL) V4 - Tool-calling benchmark. Tends to show Claude models above Grok and GPT-5.
  • BiomedArena.AI - Safety and response quality in a biomedical benchmark. Has only published an automated benchmark. Hopefully crowdsourced will come soon.
  • CompileBench - Cool real-world example. Can AI do things like compile old software?
  • Design Arena - Which AI is best at design. Personally I find GPT-5’s designs to be better than Sonnet 4.5’s. This benchmark disagrees, though the Elo difference isn’t huge.
  • Dubesor LLM Benchmark - Small benchmark by an individual. It has the classic flaw where you can’t tell some of the parameters of the models. GPT-5 is lower down, but is that with default thinking? Would GPT-5-high be higher? Sonnet 4.5 is lower down than you’d expect. Did they try it with thinking?
  • EQ-Bench 3 - Really cool idea to test emotional intelligence and reasoning. I worry a little bit that it’s LLM-as-judge, and LLMs might prefer their own outputs. That said, I tend to find Claude better at EQ, but this model puts it lower than I’d expect. It does find Sonnet 4.5 to be an excellent writer.
  • Evals | Roo Code - I’m happy to see coding assistants test models. However, I’m concerned this benchmark is saturated, which makes differentiating between models less reliable.
  • FACTS Grounding Leaderboard - Whether a model’s response accurately reflects a long-form document. No Sonnet 4.5.
  • ForecastBench - Baseline - Can LLMs forecast the future? Kind of. The public and superforecasters are still better, though LLMs are gradually improving.
  • FutureSearch Benchmarks - Confusingly this is a deep research benchmark, not a future prediction benchmark. Sonnet 4.5 does surprisingly well, followed by GPT-5.
  • Kagi LLM Benchmarking Project - Small private benchmark from the maintainers of the search engine. I find this one pretty useful, and it often has things that other benchmarks don’t (GPT-5 Pro). However, it’s sometimes missing things (it has Sonnet 4.5 but not Sonnet 4.5 thinking).
  • Kotlin-bench Leaderboard - SWE-bench but for Kotlin.
  • lechmazur - Clever benchmarks, like NYT Connections puzzles and storytelling that must incorporate specific elements. The latter is one of the benchmarks that Kimi does surprisingly well on. (Another is EQ-Bench, above.)
  • LiveBench - Automated benchmark (as opposed to human-judged like LMArena or LLM-judged like EQ-Bench) that I think best reflects my experience of coding strength. (That is, the global score does, not the coding score.)
  • LMArena Overview - The OG. Curiously Gemini 2.5 continues to top this benchmark, but I don’t think I see it at the top of many others, except vision benchmarks. They offer a bunch of benchmarks, including web dev and search.
  • Lmgame Bench - It’s missing models and evals, and the results are surprising, but having models play games is a great idea, so I keep it around to look back at it occasionally.
  • MathArena - Math competition benchmark. Shows GPT-5 and Grok 4 towards the top, as expected, but then GLM-4.6 in first place!
  • ModelSlant.com - Interesting idea to test model slant, but I’m not sure what this shows, since the numbers are so small.
  • OpenRouter LLM Rankings - Not a benchmark. This measures usage, and is therefore a good proxy for buzz around models, even if they don’t top whatever benchmark.
  • Probing LLM Social Intelligence via Werewolf - Wonderful idea. If only this had Claude.
  • SEAL LLM Leaderboards - One of my favorites. A dozen evaluations, including ones you know (Humanity’s Last Exam) and ones you don’t (evaluating model honesty when pressured to lie, and puzzle questions).
  • SEAL Showdown | Human Evaluation - SEAL’s answer to LMArena.
  • SimpleBench - A little bit like ARC-AGI. This is a question that humans find easy and LLMs find hard. Gemini 2.5 Pro does well on this.
  • Terminal-Bench - Represented elsewhere, e.g., in Artificial Analysis, but this records the agent type.
  • Vals.ai - Academic and non-academic (finance, legal) benchmarks, as well as a composite score, weighted by GDP, which puts Sonnet 4.5 ahead of GPT-5, though I worry this is because OpenAI’s success is split between GPT-5 and GPT-5 Codex.
  • Vellum LLM Leaderboard - A bit out of date, but I keep it around because it tracks several benchmarks.
  • Vending-Bench: Testing long-term coherence in agents | Andon Labs - Coherence over long periods of time. Grok 4 does well here.
  • VoxelBench - Minecraft-style benchmark, but MCBench wasn’t getting updated enough, so I started following VoxelBench instead. Surprising result that Gemini 2.5 Deep Think beats GPT-5-high and GPT-5 Pro.
  • Wolfram LLM Benchmarking Project - Test cases from Wolfram’s book about the Wolfram Language. GPT-5 does poorly enough that I wonder how GPT-5-high would do.
  • Yupp Leaderboard - Competitor to LMArena. Puts Gemini 2.5 Pro lower down than LMArena, which I’d expect, but then has Opus 4 nearly 100 Elo points higher than Opus 4.1, which surprises me.