ZB Field Notes

llms.txt: robots.txt for the age of AI agents

Your site has a new kind of reader

For twenty years, the two readers that mattered for a public site were people and search-engine crawlers. We learned to serve both: HTML for humans, robots.txt and sitemap.xml for Googlebot.

There is a third reader now. When someone asks an AI assistant “what has Zakaria written about Kafka?”, an agent may go and read my site in real time to answer. And that reader has very different tastes from a browser. It does not want my nav bar, my web fonts, my client-side routing, or three kilobytes of analytics script. It wants the words — cheaply, in tokens.

llms.txt is the small convention that gives it exactly that. I added it to this site over the weekend; here is what it is, why it is suddenly worth caring about, and how I wired it into the Spring Boot app that serves this blog.

What llms.txt actually is

It is a Markdown file at a well-known path — /llms.txt at your domain root — proposed by Jeremy Howard of Answer.AI in 2024. Think robots.txt, but instead of telling crawlers where not to go, it hands LLMs a clean, curated index of what is worth reading.

The format is deliberately boring: an <h1> title, a one-line summary in a blockquote, then sections of links with short descriptions.

# FastMCP

> The fast, Pythonic way to build MCP servers.

## Docs
- [Quickstart](https://gofastmcp.com/getting-started/quickstart.md)
- [CLI](https://gofastmcp.com/cli/overview.md): the fastmcp command-line interface

That is a real one — gofastmcp.com/llms.txt. There is an optional companion, llms-full.txt, that inlines the entire documentation as one big Markdown blob you can drop straight into a context window.

Two things to be clear about. It is a convention, not a spec with teeth — like robots.txt, tools honour it by goodwill, not by force. And it is pulled at inference time, not baked into training, so it is a way to stay current with a model long after its knowledge cut-off.

Why it matters more now

A few reasons it crossed the line from “cute idea” to “worth ten minutes of my evening”:

  • HTML is expensive in tokens. Rendered pages are mostly chrome. Markdown is signal. An agent reading llms.txt spends its budget on your content, not your cookie banner.

  • Gated and JS-heavy pages are invisible. My CV dossier sits behind a LinkedIn login and renders client-side. A crawler — or an agent — sees nothing useful. A static llms.txt lets me say “the dossier is private, but the blog is here” without exposing anything.

  • You control the framing. Rather than hope an agent reconstructs what you do from scraped fragments, you hand it the summary you would want it to repeat.

Wiring it into a Spring Boot site

My setup is a little unusual and made for a fun constraint. One jar serves two hostnames: the gated single-page app on the apex (zakaria.lu), and this server-rendered, crawlable blog on blog.zakaria.lu. A host-dispatch filter decides which is which.

The blog index is just a controller method. It pulls the published posts I already query for the sitemap and renders one bullet each — no new data, no caching, always current:

@GetMapping(value = "/blog/llms.txt", produces = "text/plain;charset=UTF-8")
public String llmsText() {
    var base = properties.normalizedBlogBaseUrl();
    var md = new StringBuilder(LLMS_HEADER);
    for (var post : blogService.allPublished()) {
        md.append("- [").append(oneLine(post.getTitle())).append("](")
          .append(base).append("/").append(post.getSlug()).append(")");
        // ... append the excerpt as the link description
    }
    return md.toString();
}

One path, two files

Here is the part I liked. The apex deserves its own llms.txt too — it is the domain an agent tries first. But it should not list blog posts; it should introduce me and point at the blog. So /llms.txt resolves to two different controllers depending on host:

  • On blog.zakaria.lu, the host filter forwards /llms.txt to /blog/llms.txt: the post index.

  • On zakaria.lu, it hits a separate controller that serves a short front door — who I am, a note that the dossier is gated, and a link to the blog machine-readable index.

Neither controller contains a single if (host == ...). The filter discriminates once, up front, and each handler stays blissfully host-agnostic.

The small gotchas

  • Serve UTF-8 explicitly. Excerpts carry em dashes and ellipses; the default text/plain charset mangles them. produces = "text/plain;charset=UTF-8" fixes it.

  • Do not let content break the Markdown. A stray ] in a post title closes the link early; a newline in an excerpt splits the bullet. A one-line flatten-and-escape helper handles both.

  • The apex 404’d at first — correctly. The SPA uses hash routing, so the server never serves arbitrary paths and /llms.txt matched nothing. That 404 was not a bug to fix; it was the app telling me I had not decided what the apex should say yet.

Is it actually worth it?

Honestly? The payoff is speculative. No major assistant guarantees it reads llms.txt, and adoption is uneven. But the cost was an evening and two tiny controllers, it cannot hurt, and it is the kind of cheap insurance that ages well: if agents-reading-sites becomes normal, I am already legible to them; if it does not, I have lost nothing but a few lines of Java.

I kept mine to an index for now and skipped llms-full.txt — inlining full posts means an HTML-to-Markdown step I did not want to own yet. That is the next experiment. For a convention this cheap, “start small and see who shows up” feels exactly right.