Blog on samek.fyi

When a change feels risky

Fri, 12 Jun 2026 00:00:00 -0400

I’m trying to get into a new habit. Previously, when I wrote some code, and it felt risky, I would reread the code more carefully, read the surrounding code, try it out locally or in staging, and monitor closely after deploying.

When I write code with AI, I’m grateful to have the intuition that a particular code change is risky, since that represents years of experience with the types of things I work on. However, the next intuition, to reach for the things I used to try, now skips the best next move.

From working in the business to working on it

Sun, 12 Apr 2026 00:00:00 +0000

There’s an old business idea, usually attributed to Michael Gerber, about the difference between working in your business and working on your business. Working in the business is doing the thing. Working on the business is improving how the business works. I think this distinction works in an analogous way in software engineering. Writing code is working in the business. Everything around the code is working on it.

I wrote a feature at work in a day that previously would have taken a week. Writing the code was faster, but everything else around it stayed the same: chatting with users, building consensus with stakeholders, making sure the architecture made sense, running tests, deploying to staging and production.

Let the LLM try first

Wed, 11 Mar 2026 16:23:00 +0000

I had an deck in Anki, a flashcards app, with an annoying problem that I had caused when I imported the cards. On some cards, the front of the card had text that linked to a URL. On others, the URL was only on the back, which meant I couldn’t click it when I actually wanted it.

My first instinct was to export the deck to CSV, use AI to write a script to transform it, and then import it again. But in a moment of laziness and curiosity, I just asked Claude Code to fix it. I assumed Claude Code didn’t have access to the application, so it couldn’t fix it. I could have used Claude Code’s Chrome extension, since Anki has a web version, but I didn’t want to deal with the slowness of all that clicking.

Reflections on prototyping a sysadmin benchmark

Fri, 13 Feb 2026 19:42:00 +0000

I’ve been doing a bunch of keeping up with and experimenting with AI. It’s still a surprise to me that I can create a Chrome extension in minutes, having never written one before, and having rarely programmed in JavaScript.

I wrote a Chrome extension to copy page content, and another to send the same prompt to multiple LLMs, then a CLI tool to do the same with Simon Willison’s llm CLI. Then a website to keep track of my BJJ notes. Then a harness to evaluate LLMs.

Benchmarks I'm watching now

Mon, 20 Oct 2025 00:59:00 +0000

Whenever a new model is released, I read the benchmarks. I haven’t seen a good list of benchmarks. I tend to see them referenced and then bookmark them.

~~Here’s my list.~~ I keep a continually updated list of benchmarks at samek.fyi/benchmarks. I deleted the old list that used to live in this post so that I could link to the updated one instead.

Dia browser

Fri, 13 Jun 2025 05:14:00 +0000

I tried Dia, a new Chromium-based browser with built-in AI features from the creators of Arc. It has an AI chat sidebar that can reference any open tab, plus writing assistance that pulls context from multiple tabs. Their YouTube video shows someone writing an email in Gmail based on their Google Calendar tab.

It’s a UX improvement over existing options. Brave’s Leo only works with the current tab and requires your own API key for the models I want to use. Firefox awkwardly opens ChatGPT or Claude’s actual UI and copy-pastes content. Chrome’s Gemini Flash integration is clean but single-tab only.

Coding agent with Simon Willison's llm

Wed, 28 May 2025 01:46:00 +0000

Simon Willison released tool calling in his llm CLI. A tool is just a python function. Inspired by Coding agent in 94 lines of Ruby, I wrote a coding agent in Python. Here’s the code.

aider helpfully explains the various editing formats people use for LLMs:

whole returns an entirely new document.
diff is a series of search and replace blocks.
diff-fenced is like diffs but puts the filename inside the fence. gemini models prefer this.
udiff is a more familiar diff format, but only used for GPT-4 Turbo.

So I picked diff.

Pareto frontier LLMs, Aider edition

Thu, 08 May 2025 04:42:00 +0000

Google released an impressive improvement to Gemini, which now tops the WebDev Arena Leaderboard. That position had been held by Claude for quite a while.

Here’s a Pareto frontier for the Aider polyglot coding leaderboard. The analysis of March’s Gemini didn’t include reasoning tokens in the cost. They fixed the problem for May’s release.

It’s too bad it doesn’t include o4-mini (medium), especially because the new Gemini sits right between o4-mini (high) and o3-plus-GPT-4.1. It’s nice to see that the corrected pricing makes the Pareto frontier make sense again. Gemini 2.5 Pro wasn’t some wild outlier.

Pareto frontier LLMs, Kagi edition

Sat, 03 May 2025 17:58:00 +0000

Kagi benchmarks LLMs on a private benchmark. I asked Claude to build a Pareto frontier chart for Kagi’s April 17th, 2025, update (source code).

Claude has the strange property of doing well on some benchmarks and not others. It creates half as much slop in longform writing as Gemini and OpenAI models, for example, and does well in a number of coding benchmarks, while doing much worse than other frontier models in other benchmarks.

Model Context Protocol, simply

Thu, 01 May 2025 20:06:00 +0000

I keep seeing descriptions of Model Context Protocol (MCP) that are either vague metaphors or get too specific too quickly.

From the official docs:

MCP is an open protocol that standardizes how applications provide context to LLMs. Think of MCP like a USB-C port for AI applications. Just as USB-C provides a standardized way to connect your devices to various peripherals and accessories, MCP provides a standardized way to connect AI models to different data sources and tools.

o3 is the research assistant I wanted

Wed, 30 Apr 2025 19:03:00 +0000

Deep Research (three separate products! from Perplexity, OpenAI, and Gemini) is usually more than I want. Instead I want something that stops when it has the answer. Here’s a great example about predicting the next pope. It searched the web to find out what prediction markets said, aggregated and normalized the data, and gave me a chart of probabilities. Then it made a second chart introducing me to each candidate.

ChatGPT's improved search

Mon, 28 Apr 2025 21:12:23 +0000

A few days ago I wrote that Perplexity is better for quick searches. That may have changed.

OpenAI announced some search improvements, though sources were not among them, and Sam Altman mentioned improvements to the model. ChatGPT now does a better job at my example query, explicitly noting that the two models have different behavior.

I’m curious if this will be reflected on Search Arena.

Steam engine time

Sun, 27 Apr 2025 21:27:00 +0000

o3 and gemini 2.5 pro being a single Elo point apart on LMArena with style control feels very steam-engine time.

There’s an idea in the science-fiction community called steam-engine time, which is what people call it when suddenly twenty or thirty different writers produce stories about the same idea. It’s called steam-engine time because nobody knows why the steam engine happened when it did. Ptolemy demonstrated the mechanics of the steam engine, and there was nothing technically stopping the Romans from building big steam engines.

Perplexity is still better at quick searches

Fri, 25 Apr 2025 02:13:00 +0000

Simon Willison writes that “AI assisted search-based research actually works now.” o3 and o4-mini agentic research is truly impressive, as are ChatGPT’s and Gemini’s Deep Research.

But I consistently find queries where Perplexity gets it right, and ChatGPT and Gemini get it wrong.

An example: “how is chatgpt-4o-latest different from gpt-4o.” Only Perplexity correctly identifies that the model is priced differently than gpt-4o. Gemini 2.5 Pro focuses entirely on gpt-4o being stable, while chatgpt-4o-latest changes. ChatGPT does the same. Only Perplexity figured out that its pricing is different from gpt-4o. I tried asking the question a few different ways, and Perplexity found different sources to explain that the March release improved in instruction following, coding, and creativity. I think this demonstrates that Perplexity didn’t just get lucky with a source.