<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title>Blog on samek.fyi</title>
		<link>https://samek.fyi/posts/</link>
		<description>Recent content in Blog on samek.fyi</description>
		<generator>Hugo</generator>
		<language>en-us</language>
		
		
		
		
			<lastBuildDate>Fri, 12 Jun 2026 00:00:00 -0400</lastBuildDate>
		
			<atom:link href="https://samek.fyi/posts/index.xml" rel="self" type="application/rss+xml" />
			<item>
				<title>When a change feels risky</title>
				<link>https://samek.fyi/risky-ai-changes/</link>
				<pubDate>Fri, 12 Jun 2026 00:00:00 -0400</pubDate>
				<guid>https://samek.fyi/risky-ai-changes/</guid>
				<description>&lt;p&gt;I&amp;rsquo;m trying to get into a new habit. Previously, when I wrote some code, and it felt risky, I would reread the code more carefully, read the surrounding code, try it out locally or in staging, and monitor closely after deploying.&lt;/p&gt;&#xA;&lt;p&gt;When I write code with AI, I&amp;rsquo;m grateful to have the intuition that a particular code change is risky, since that represents years of experience with the types of things I work on. However, the next intuition, to reach for the things I used to try, now skips the best next move.&lt;/p&gt;</description>
			</item>
			<item>
				<title>From working in the business to working on it</title>
				<link>https://samek.fyi/from-working-in-the-business-to-working-on-it/</link>
				<pubDate>Sun, 12 Apr 2026 00:00:00 +0000</pubDate>
				<guid>https://samek.fyi/from-working-in-the-business-to-working-on-it/</guid>
				<description>&lt;p&gt;There&amp;rsquo;s an old business idea, usually attributed to Michael Gerber, about the difference between working &lt;em&gt;in&lt;/em&gt; your business and working &lt;em&gt;on&lt;/em&gt; your business. Working in the business is doing the thing. Working on the business is improving how the business works. I think this distinction works in an analogous way in software engineering. Writing code is working &lt;em&gt;in&lt;/em&gt; the business. Everything around the code is working &lt;em&gt;on&lt;/em&gt; it.&lt;/p&gt;&#xA;&lt;p&gt;I wrote a feature at work in a day that previously would have taken a week. Writing the code was faster, but everything else around it stayed the same: chatting with users, building consensus with stakeholders, making sure the architecture made sense, running tests, deploying to staging and production.&lt;/p&gt;</description>
			</item>
			<item>
				<title>Let the LLM try first</title>
				<link>https://samek.fyi/let-the-llm-try-first/</link>
				<pubDate>Wed, 11 Mar 2026 16:23:00 +0000</pubDate>
				<guid>https://samek.fyi/let-the-llm-try-first/</guid>
				<description>&lt;p&gt;I had an deck in Anki, a flashcards app, with an annoying problem that I had caused when I imported the cards. On some cards, the front of the card had text that linked to a URL. On others, the URL was only on the back, which meant I couldn&amp;rsquo;t click it when I actually wanted it.&lt;/p&gt;&#xA;&lt;p&gt;My first instinct was to export the deck to CSV, use AI to write a script to transform it, and then import it again. But in a moment of laziness and curiosity, I just asked Claude Code to fix it. I assumed Claude Code didn&amp;rsquo;t have access to the application, so it couldn&amp;rsquo;t fix it. I could have used Claude Code&amp;rsquo;s Chrome extension, since Anki has a web version, but I didn&amp;rsquo;t want to deal with the slowness of all that clicking.&lt;/p&gt;</description>
			</item>
			<item>
				<title>Reflections on prototyping a sysadmin benchmark</title>
				<link>https://samek.fyi/reflections-on-prototyping-a-sysadmin-benchmark/</link>
				<pubDate>Fri, 13 Feb 2026 19:42:00 +0000</pubDate>
				<guid>https://samek.fyi/reflections-on-prototyping-a-sysadmin-benchmark/</guid>
				<description>&lt;p&gt;I&amp;rsquo;ve been doing a bunch of keeping up with and experimenting with AI. It&amp;rsquo;s still a surprise to me that I can create a Chrome extension in minutes, having never written one before, and having rarely programmed in JavaScript.&lt;/p&gt;&#xA;&lt;p&gt;I wrote a Chrome extension to copy page content, and another to send the same prompt to multiple LLMs, then a CLI tool to do the same with Simon Willison&amp;rsquo;s &lt;a href=&#34;https://github.com/simonw/llm&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;llm CLI&lt;/a&gt;. Then a website to keep track of my BJJ notes. Then a harness to evaluate LLMs.&lt;/p&gt;</description>
			</item>
			<item>
				<title>Benchmarks I&#39;m watching now</title>
				<link>https://samek.fyi/benchmarks-im-watching-now/</link>
				<pubDate>Mon, 20 Oct 2025 00:59:00 +0000</pubDate>
				<guid>https://samek.fyi/benchmarks-im-watching-now/</guid>
				<description>&lt;p&gt;Whenever a new model is released, I read the benchmarks. I haven&amp;rsquo;t seen a good list of benchmarks. I tend to see them referenced and then bookmark them.&lt;/p&gt;&#xA;&lt;p&gt;&lt;del&gt;Here&amp;rsquo;s my list.&lt;/del&gt; I keep a continually updated list of benchmarks at &lt;a href=&#34;https://samek.fyi/benchmarks/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;samek.fyi/benchmarks&lt;/a&gt;. I deleted the old list that used to live in this post so that I could link to the updated one instead.&lt;/p&gt;</description>
			</item>
			<item>
				<title>Dia browser</title>
				<link>https://samek.fyi/dia-browser/</link>
				<pubDate>Fri, 13 Jun 2025 05:14:00 +0000</pubDate>
				<guid>https://samek.fyi/dia-browser/</guid>
				<description>&lt;p&gt;I tried &lt;a href=&#34;https://www.diabrowser.com/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Dia&lt;/a&gt;, a new Chromium-based browser with built-in AI features from the creators of &lt;a href=&#34;https://arc.net/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Arc&lt;/a&gt;. It has an AI chat sidebar that can reference any open tab, plus writing assistance that pulls context from multiple tabs. Their YouTube video shows someone writing an email in Gmail based on their Google Calendar tab.&lt;/p&gt;&#xA;&lt;p&gt;It&amp;rsquo;s a UX improvement over existing options. Brave&amp;rsquo;s Leo only works with the current tab and requires your own API key for the models I want to use. Firefox awkwardly opens ChatGPT or Claude&amp;rsquo;s actual UI and copy-pastes content. Chrome&amp;rsquo;s Gemini Flash integration is clean but single-tab only.&lt;/p&gt;</description>
			</item>
			<item>
				<title>Coding agent with Simon Willison&#39;s llm</title>
				<link>https://samek.fyi/coding-agent-with-simon-willisons-llm/</link>
				<pubDate>Wed, 28 May 2025 01:46:00 +0000</pubDate>
				<guid>https://samek.fyi/coding-agent-with-simon-willisons-llm/</guid>
				<description>&lt;p&gt;Simon Willison &lt;a href=&#34;https://simonwillison.net/2025/May/27/llm-tools/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;released&lt;/a&gt; tool calling in his &lt;a href=&#34;https://llm.datasette.io/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;llm&lt;/a&gt; CLI. A tool is just a python function. Inspired by &lt;a href=&#34;https://radanskoric.com/articles/coding-agent-in-ruby&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Coding agent in 94 lines of Ruby&lt;/a&gt;, I wrote a coding agent in Python. Here&amp;rsquo;s the &lt;a href=&#34;https://github.com/bsamek/llm-tools-editor/blob/c26911127fe16a7d475fcccc6d9c43ecc47ae7d4/llm_tools_editor.py&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;code&lt;/a&gt;.&lt;/p&gt;&#xA;&lt;p&gt;aider helpfully &lt;a href=&#34;https://aider.chat/docs/more/edit-formats.html&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;explains&lt;/a&gt; the various editing formats people use for LLMs:&lt;/p&gt;&#xA;&lt;ol&gt;&#xA;&lt;li&gt;&lt;strong&gt;whole&lt;/strong&gt; returns an entirely new document.&lt;/li&gt;&#xA;&lt;li&gt;&lt;strong&gt;diff&lt;/strong&gt; is a series of search and replace blocks.&lt;/li&gt;&#xA;&lt;li&gt;&lt;strong&gt;diff-fenced&lt;/strong&gt; is like diffs but puts the filename inside the fence. gemini models prefer this.&lt;/li&gt;&#xA;&lt;li&gt;&lt;strong&gt;udiff&lt;/strong&gt; is a more familiar diff format, but only used for GPT-4 Turbo.&lt;/li&gt;&#xA;&lt;/ol&gt;&#xA;&lt;p&gt;So I picked diff.&lt;/p&gt;</description>
			</item>
			<item>
				<title>Pareto frontier LLMs, Aider edition</title>
				<link>https://samek.fyi/pareto-frontier-llms-aider-edition/</link>
				<pubDate>Thu, 08 May 2025 04:42:00 +0000</pubDate>
				<guid>https://samek.fyi/pareto-frontier-llms-aider-edition/</guid>
				<description>&lt;p&gt;Google released an impressive improvement to Gemini, which now tops the &lt;a href=&#34;https://web.lmarena.ai/leaderboard&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;WebDev Arena Leaderboard&lt;/a&gt;. That position had been held by Claude for quite a while.&lt;/p&gt;&#xA;&lt;p&gt;Here&amp;rsquo;s a Pareto frontier for the &lt;a href=&#34;https://aider.chat/docs/leaderboards/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Aider polyglot coding leaderboard&lt;/a&gt;. The analysis of March&amp;rsquo;s Gemini &lt;a href=&#34;https://aider.chat/2025/05/07/gemini-cost.html&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;didn&amp;rsquo;t include&lt;/a&gt; reasoning tokens in the cost. They fixed the problem for May&amp;rsquo;s release.&lt;/p&gt;&#xA;&lt;p&gt;&lt;img src=&#34;https://samek.fyi/images/pareto-aider.webp&#34; alt=&#34;pareto-aider&#34;&gt;&lt;/p&gt;&#xA;&lt;p&gt;It&amp;rsquo;s too bad it doesn&amp;rsquo;t include o4-mini (medium), especially because the new Gemini sits right between o4-mini (high) and o3-plus-GPT-4.1. It&amp;rsquo;s nice to see that the corrected pricing makes the Pareto frontier make sense again. Gemini 2.5 Pro wasn&amp;rsquo;t some wild outlier.&lt;/p&gt;</description>
			</item>
			<item>
				<title>Pareto frontier LLMs, Kagi edition</title>
				<link>https://samek.fyi/pareto-frontier-models-kagi-edition/</link>
				<pubDate>Sat, 03 May 2025 17:58:00 +0000</pubDate>
				<guid>https://samek.fyi/pareto-frontier-models-kagi-edition/</guid>
				<description>&lt;p&gt;Kagi &lt;a href=&#34;https://help.kagi.com/kagi/ai/llm-benchmark.html&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;benchmarks&lt;/a&gt; LLMs on a private benchmark. I asked Claude to build a Pareto frontier chart for Kagi&amp;rsquo;s April 17th, 2025, update (&lt;a href=&#34;https://github.com/bsamek/pareto-llm/blob/main/pareto-kagi.py&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;source code&lt;/a&gt;).&lt;/p&gt;&#xA;&lt;p&gt;&lt;img src=&#34;https://samek.fyi/images/pareto-2.webp&#34; alt=&#34;pareto-2&#34;&gt;&lt;/p&gt;&#xA;&lt;p&gt;Claude has the strange property of doing well on some benchmarks and not others. It creates &lt;a href=&#34;https://eqbench.com/creative_writing_longform.html&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;half as much slop&lt;/a&gt; in longform writing as Gemini and OpenAI models, for example, and &lt;a href=&#34;https://web.lmarena.ai/leaderboard&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;does&lt;/a&gt; &lt;a href=&#34;https://livebench.ai/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;well&lt;/a&gt; in a number of coding benchmarks, while doing &lt;a href=&#34;https://scale.com/leaderboard/humanitys_last_exam_text_only&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;much worse&lt;/a&gt; than other frontier models in other benchmarks.&lt;/p&gt;</description>
			</item>
			<item>
				<title>Model Context Protocol, simply</title>
				<link>https://samek.fyi/model-context-protocol-simply/</link>
				<pubDate>Thu, 01 May 2025 20:06:00 +0000</pubDate>
				<guid>https://samek.fyi/model-context-protocol-simply/</guid>
				<description>&lt;p&gt;I keep seeing descriptions of Model Context Protocol (MCP) that are either vague metaphors or get too specific too quickly.&lt;/p&gt;&#xA;&lt;p&gt;From the &lt;a href=&#34;https://modelcontextprotocol.io/introduction&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;official docs&lt;/a&gt;:&lt;/p&gt;&#xA;&lt;blockquote&gt;&#xA;&lt;p&gt;MCP is an open protocol that standardizes how applications provide context to LLMs. Think of MCP like a USB-C port for AI applications. Just as USB-C provides a standardized way to connect your devices to various peripherals and accessories, MCP provides a standardized way to connect AI models to different data sources and tools.&lt;/p&gt;</description>
			</item>
			<item>
				<title>o3 is the research assistant I wanted</title>
				<link>https://samek.fyi/o3-is-the-research-assistant-i-wanted/</link>
				<pubDate>Wed, 30 Apr 2025 19:03:00 +0000</pubDate>
				<guid>https://samek.fyi/o3-is-the-research-assistant-i-wanted/</guid>
				<description>&lt;p&gt;Deep Research (three separate products! from Perplexity, OpenAI, and Gemini) is usually more than I want. Instead I want something that stops when it has the answer. Here&amp;rsquo;s a &lt;a href=&#34;https://chatgpt.com/share/680ba0f9-c7e0-800f-b4b4-165e375cb8c7&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;great example&lt;/a&gt; about predicting the next pope. It searched the web to find out what prediction markets said, aggregated and normalized the data, and gave me a chart of probabilities. Then it made a second chart introducing me to each candidate.&lt;/p&gt;</description>
			</item>
			<item>
				<title>ChatGPT&#39;s improved search</title>
				<link>https://samek.fyi/chatgpts-improved-search/</link>
				<pubDate>Mon, 28 Apr 2025 21:12:23 +0000</pubDate>
				<guid>https://samek.fyi/chatgpts-improved-search/</guid>
				<description>&lt;p&gt;A few days ago I &lt;a href=&#34;https://samek.fyi/perplexity-better-at-quick-searches/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;wrote&lt;/a&gt; that Perplexity is better for quick searches. That may have changed.&lt;/p&gt;&#xA;&lt;p&gt;OpenAI announced some search improvements, though sources were not among them, and Sam Altman &lt;a href=&#34;https://x.com/sama/status/1916625892123742290&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;mentioned&lt;/a&gt; improvements to the model. ChatGPT &lt;a href=&#34;https://chatgpt.com/share/680fe920-5014-800f-9a5e-04d54eacee28&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;now does a better job&lt;/a&gt; at my example query, explicitly noting that the two models have different behavior.&lt;/p&gt;&#xA;&lt;p&gt;I&amp;rsquo;m curious if this will be reflected on &lt;a href=&#34;https://beta.lmarena.ai/leaderboard/search&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;Search Arena&lt;/a&gt;.&lt;/p&gt;</description>
			</item>
			<item>
				<title>Steam engine time</title>
				<link>https://samek.fyi/steam-engine-time/</link>
				<pubDate>Sun, 27 Apr 2025 21:27:00 +0000</pubDate>
				<guid>https://samek.fyi/steam-engine-time/</guid>
				<description>&lt;p&gt;o3 and gemini 2.5 pro being a single Elo point apart on LMArena with style control feels very &lt;em&gt;steam-engine time&lt;/em&gt;.&lt;/p&gt;&#xA;&lt;blockquote&gt;&#xA;&lt;p&gt;There’s an idea in the science-fiction community called steam-engine time, which is what people call it when suddenly twenty or thirty different writers produce stories about the same idea. It’s called steam-engine time ­because nobody knows why the steam engine happened when it did. Ptolemy demonstrated the mechanics of the steam engine, and there was nothing technically stopping the Romans from building big steam engines.&lt;/p&gt;</description>
			</item>
			<item>
				<title>Perplexity is still better at quick searches</title>
				<link>https://samek.fyi/perplexity-better-at-quick-searches/</link>
				<pubDate>Fri, 25 Apr 2025 02:13:00 +0000</pubDate>
				<guid>https://samek.fyi/perplexity-better-at-quick-searches/</guid>
				<description>&lt;p&gt;Simon Willison &lt;a href=&#34;https://simonwillison.net/2025/Apr/21/ai-assisted-search/&#34; target=&#34;_blank&#34; rel=&#34;noopener&#34;&gt;writes&lt;/a&gt; that &amp;ldquo;AI assisted search-based research actually works now.&amp;rdquo; o3 and o4-mini agentic research is truly impressive, as are ChatGPT&amp;rsquo;s and Gemini&amp;rsquo;s Deep Research.&lt;/p&gt;&#xA;&lt;p&gt;But I consistently find queries where Perplexity gets it right, and ChatGPT and Gemini get it wrong.&lt;/p&gt;&#xA;&lt;p&gt;An example: &amp;ldquo;how is chatgpt-4o-latest different from gpt-4o.&amp;rdquo; Only Perplexity correctly identifies that the model is priced differently than gpt-4o. Gemini 2.5 Pro focuses entirely on gpt-4o being stable, while chatgpt-4o-latest changes. ChatGPT does the same. Only Perplexity figured out that its pricing is different from gpt-4o. I tried asking the question a few different ways, and Perplexity found &lt;em&gt;different&lt;/em&gt; sources to explain that the March release improved in instruction following, coding, and creativity. I think this demonstrates that Perplexity didn&amp;rsquo;t just get lucky with a source.&lt;/p&gt;</description>
			</item>
	</channel>
</rss>
