llms.txt & AI Crawlers: How to Get Cited by LLMs

Learn what llms.txt actually does, how to control GPTBot, ClaudeBot and PerplexityBot in robots.txt, and what really drives AI citations in 2025–2026.

AI crawlers now make up roughly 16.9% of all web traffic — and GPTBot requests alone grew 305% year-over-year as of May 2025 (Cloudflare). Yet most site owners still treat every AI bot the same way: either block all or allow all. That binary is costing them either IP protection or citation visibility — sometimes both.

This guide covers the llms.txt standard (what it actually does vs. what marketers claim it does), how to configure robots.txt for each AI crawler by lane, and what the evidence shows actually drives citations in ChatGPT, Perplexity, Claude, and Google AI Overviews.

Quick answer:

llms.txt is a Markdown file at your domain root that helps AI agents navigate your content during inference — it does not block training crawlers, and studies across 300,000+ domains show negligible citation impact for most sites. To control training access, use robots.txt directives per bot. To get cited, prioritise JSON-LD schema, fresh content with visible dates, and answer-first structure — these have measurable citation lift. llms.txt is low-cost experimental infrastructure, not a core tactic.

What is llms.txt?

The llms.txt standard was proposed by Jeremy Howard on 3 September 2024. It is a Markdown-formatted file placed at https://yourdomain.com/llms.txt that provides AI language models with a curated index of a site's most important pages (Grounding Page).

Three things it is not:

A training dataset blocker
An access control mechanism
An official IETF, W3C, or ISO standard

It is explicitly an inference-time convenience — when an AI agent fetches your site during a live query, llms.txt gives it a shortcut to your best content without crawling dozens of pages (Glasp, Digital Applied). Think of it as a table of contents for AI agents, not a gatekeeper.

The companion file llms-full.txt inlines the full Markdown content of all linked pages. Some agents fetch it twice as often as llms.txt (Webscraft), but Cloudflare's version runs to 3.7 million tokens — well beyond most model context windows.

Who has adopted it?

Adoption numbers vary wildly by measurement method:

SE Ranking (early 2026): 10.13% of ~300,000 analysed domains (DerivateX)
Rankability (June 2025): 0.3% of the top 1,000 most-visited sites globally (Rankability)
BuiltWith (October 2025): 844,473 live websites
Averi/Semrush (July 2025): only ~951 domains with proper files (Averi)

The discrepancy reflects auto-generated stubs from Yoast and Rank Math inflating raw counts. By comparison, Schema.org structured data runs on 45 million domains (Averi).

Heaviest adopters are technical documentation sites and SaaS products: Anthropic, Cloudflare, Vercel, Supabase, Stripe, NVIDIA, Coinbase, Zapier, Mintlify, Cursor. Marketing sites and local businesses have the lowest adoption (Sygnal, Webscraft).

What Google and industry experts say

John Mueller (Google) compared llms.txt to the old keywords meta tag and called building separate Markdown pages for bots "a stupid idea," noting that "none of the AI services have said they're using LLMs.TXT" (Search Engine Journal).

Gary Illyes (Google) confirmed at Search Central Live Asia Pacific 2025 that Google was not pursuing llms.txt (Search Engine Journal).

OtterlyAI removed the llms.txt checker from its GEO Audit product because the file showed marginal impact on AI citations (OtterlyAI).

That said, Google Lighthouse 13.3 added an "Agentic browsing audits" section that includes an llms.txt audit — signalling Google sees it as an agentic browsing tool for AI agents navigating sites, not a search ranking factor (Grounding Page).

AI crawler landscape 2025–2026

The major AI companies have split their crawling into separate bots for training, search indexing, and user-initiated fetches. Each has different robots.txt compliance behaviour. Understanding the distinction is critical: blocking a training bot does not block the search bot.

AI crawler reference table

User-Agent	Company	Purpose	Respects robots.txt	Recommended action
GPTBot	OpenAI	Training	Yes	Disallow
OAI-SearchBot	OpenAI	ChatGPT Search index	Yes	Allow
ChatGPT-User	OpenAI	User-initiated fetch	Exempted Dec 2024	Allow (no control)
ClaudeBot	Anthropic	Training	Yes	Disallow
Claude-SearchBot	Anthropic	Web answers	Yes	Allow
Claude-User	Anthropic	User-initiated fetch	Yes	Allow
PerplexityBot	Perplexity	Search index	Yes (with stealth)	Allow with caution
Perplexity-User	Perplexity	User fetch	Claims not required	Allow
Google-Extended	Google	Gemini training	Yes	Disallow
Googlebot	Google	Traditional search	Yes	Allow
CCBot	Common Crawl	Training	Yes	Disallow
Bytespider	ByteDance	Training	No — non-compliant	Block at WAF
Meta-WebIndexer	Meta	Training	No	Block at WAF
Amazonbot	Amazon	Alexa/AI	Yes	Disallow
Applebot	Apple	Siri/Search	Yes	Allow if desired

Sources: Soar Agency, Digital Applied, Anagram.ai

Traffic growth context

According to Cloudflare's May 2025 report (Cloudflare):

GPTBot requests: +305% year-over-year (7.7% of AI bot traffic)
ChatGPT-User requests: +2,825% year-over-year
PerplexityBot requests: +157,490% year-over-year
ClaudeBot: down 46% from its 2024 peak, but still 5.4% of AI crawler traffic
Overall AI crawling: +32% year-over-year

The Bytespider and stealth crawler problem

Not all bots play by the rules. HAProxy data showed nearly 90% of AI crawler traffic across their customer base came from Bytespider (ByteDance), much of it ignoring disallow directives (Soar Agency). Cloudflare published evidence in August 2025 showing Perplexity using undeclared crawlers that rotate user-agents, IPs, and ASNs to evade no-crawl directives (Soar Agency).

Meta-WebIndexer has been the most aggressive by volume in some log samples, with zero robots.txt checks (wislr.com). As of April 2026, major publishers including NBCUniversal, CNN, and Vox Media sent a demand letter to Common Crawl seeking enforceable opt-out mechanisms (Digital Applied).

This means robots.txt alone is insufficient. For non-compliant bots, you need WAF rules or Cloudflare's AI Audit tool, which enforces blocking at the network edge before the origin server is reached.

Controlling AI access: the practical setup

robots.txt by lane

The key principle: list each bot separately. Blocking GPTBot does not block OAI-SearchBot.

# Block training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Amazonbot
Disallow: /

# Allow search/retrieval bots
User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

About 21% of the top 1,000 websites have explicit rules for GPTBot in their robots.txt (Paul Calvano). robots.txt is widely respected by compliant bots — but approximately 13.26% of AI bot requests now ignore it entirely (Heather Scott via LinkedIn).

The cost of blocking everything

A difference-in-differences analysis of large news publishers found that blocking LLM crawlers via robots.txt leads to a 23.1% decline in total monthly visits (SimilarWeb) and a 13.9% decline in human traffic (Comscore) — statistically significant. By 2025, around 80% of top news publishers block LLM crawlers. Whether that tradeoff is right depends on your content and business model, but the data is worth knowing before deploying blanket blocks.

Network-level enforcement

Cloudflare AI Audit blocks at the edge before the origin server — it overrides robots.txt and can enforce per-bot policies for compliant and non-compliant bots alike. In July 2025, Cloudflare flipped its default for new domains to block AI crawlers, shifting roughly 20% of the public web from open to closed by default (Glasp). Over 1 million sites have enabled Cloudflare's AI blocking features.

Cloudflare Pay-per-Crawl (launched mid-2025, GA August 2025) allows sites to charge AI crawlers micro-fees via HTTP 402 — the first widely deployed mechanism for monetising AI access (Glasp, Digital Applied).

For verified bot identity, use reverse DNS + forward DNS: legitimate ClaudeBot IPs resolve to anthropic.com domains. OpenAI, Anthropic, and Perplexity all publish CIDR IP ranges.

Does llms.txt actually improve citations?

The honest answer is: for most sites, no.

SE Ranking (129,000 domains, 216,524 pages): Including llms.txt showed negligible impact on citation frequency; removing the feature improved predictive accuracy of their citation model (Search Engine Journal).

Trakkr (37,000+ domains, 337,000+ citations): Sites with llms.txt averaged 6.8 citations vs 6.7 without — p-value 0.85. Statistically indistinguishable from random.

OtterlyAI (90-day experiment): Only 84 requests out of 62,100 AI bot hits targeted /llms.txt — that's 0.1% of AI visits (OtterlyAI).

Wislr.com (48-day log analysis, 12,099 AI bot requests): Zero AI bots requested /llms.txt across the entire period (wislr.com).

ALLMO study (94,614 cited URLs, 11,867 AI responses): Only 1 cited URL was an llms.txt page (0.001%).

There are two exceptions worth noting. Deepak Gupta's GEO Measurement Study (50,431 citations, 90 days, 6 engines) found llms.txt and llms-full.txt produced an +11% lift overall — but concentrated entirely on Claude and Perplexity. ChatGPT Search and Google AI Overviews showed no measurable change (Deepak Gupta). One agency's dev5310 experiment (February 2026) saw Google AI Mode cite their llms.txt as the #1 source for 4 queries within 24 hours of submission to Search Console — but it was a single-site, narrow-scope test (dev5310).

Bottom line: Implement llms.txt as low-cost experimental infrastructure if you run a documentation-heavy or developer-facing site. Keep it under 3,000 tokens, include a blockquote summary and H2 sections linking to Markdown versions of key pages. Do not treat it as a primary GEO tactic.

What actually drives AI citations

If llms.txt is not the lever, what is? The evidence points clearly at a set of technical and content signals.

1. JSON-LD schema markup

This is the highest-confidence citation driver in the research:

Princeton GEO research: content with clear structural signals saw up to 40% higher visibility in AI-generated responses (SurferSEO)
GPT-5 accuracy improved from 16% to 54% when content uses structured data — a 300% improvement (Averi)
Pages with valid structured data are 2.3x more likely to appear in Google AI Overviews (Duane Forrester)
Deep sameAs arrays in Person/Organization JSON-LD (at least 8 entries: Wikidata, Crunchbase, LinkedIn, GitHub, etc.) drive +34% across all engines, +52% on Gemini and Bing (Deepak Gupta)
Sharp HealthCare: 843% increase in clicks from AI-generated search features within 9 months using comprehensive schema (SurferSEO)

See the Structured Data & Schema Markup guide for implementation detail.

One caveat from Mark Williams-Cook: LLM tokenisation can destroy schema structure — "@type": "Organization" becomes separate tokens for type and Organization. Use application/ld+json in <script> tags, not inline attributes, and validate with Google's Rich Results Test.

2. Content freshness and dating

Explicit dateModified discipline (visible "Published" and "Last updated" dates + matching JSON-LD): +22% overall, +41% on Claude, +18% on Perplexity (Deepak Gupta)
Content labeled "updated two hours ago" cited 38% more often than month-old content (SurferSEO)
AI systems favour content that is 25.7% fresher than URLs in organic search (Ahrefs, via SurferSEO)
Freshness accounts for roughly 40% of Perplexity's ranking factors (SurferSEO)
Claude is the strictest — it penalises content without visible dateModified within the last 6 months

3. Content structure

Answer-first structure: 67% more citations than pages that build to the answer (PixelMojo)
Articles over 2,900 words: average 5.1 citations vs 3.2 for under 800 words (SE Ranking)
120–180 words between headings: 70% more citations than sections under 50 words (SE Ranking)
Proper H1/H2/H3 nesting: sequential heading hierarchy gives 2.8x probability of being cited (Incremys)
Chunk-level structure (H2/H3 followed by single-sentence "claim" then supporting paragraphs): +18% lift on definitional and implementation queries

4. Statistics, quotes, and methodology

Pages with 19+ statistics: 5.4 citations vs 2.8 with minimal data (SE Ranking)
Expert quotes: 4.1 citations vs 2.4 without (SE Ranking)
Adding statistics (Princeton/Georgia Tech GEO study): +40% citations; adding direct quotations: +35% citations (SurferSEO)
Methodology pages: +9% overall, +24% on buyer-intent queries (Deepak Gupta)

5. Domain authority and trust signals

Over 32K referring domains: 3.5x more likely to be cited (SE Ranking)
Domain Trust > 90: almost 4x more citations than DT < 43 (SE Ranking)
Over 190K monthly visitors: nearly twice as many citations (SE Ranking)
Millions of brand mentions on Quora/Reddit: roughly 4x higher citation chances (SE Ranking)

6. Page speed

FCP under 0.4 seconds: average 6.7 citations vs 2.1 for >1.13s — 3x more likely (SE Ranking)

What does not work

Generic FAQ schema — now actively devalued by Claude, Gemini, and ChatGPT when used without genuine Q&A
Gated content — gated whitepaper landing pages received 0.03% of all AI citations vs 100% of ungated equivalents in Deepak Gupta's study
Keyword density — zero correlation
Backlinks within the retrieved set — minimal effect on citation share once already retrieved

Citation patterns by AI engine

Different engines cite differently. Knowing where your content fits changes which optimisations to prioritise.

Engine	Share of citations	Key behaviour
ChatGPT Search	36%	Most generous on definitional/implementation queries; concentrates on top 3 sources
Perplexity	26%	Cites 6–12 sources per answer; easiest to enter, hardest to dominate
Google AI Overviews	17%	Heavy tilt toward properties already ranking in Google top 5
Claude (web search)	11%	Strictest about freshness; penalises content without visible dateModified within 6 months
Gemini	7%	Entity authority matters most; relies on knowledge graph lookups
Bing Copilot	3.6%	Also heavily relies on knowledge graph

Source: Deepak Gupta GEO Measurement Study (50,431 citations, 90 days)

Notable citation patterns:

67.82% of cited sources in Google AI Overviews do not rank in Google's top 10 for the same query (SurferSEO) — AI citation is not just SEO rank
86% of AI citations come from brand-controlled sources (Yext analysis of 6.8M citations) (SurferSEO)
46.7% of Perplexity's top citations come from Reddit (SurferSEO)
YouTube is cited 200× more than any other video platform; ~14% of Perplexity citations from YouTube (SurferSEO)
AI referral traffic reached 1.13 billion visits in June 2025 — a 357% year-over-year increase, with ChatGPT accounting for 78% (PixelMojo)
ChatGPT traffic converts at 15.9% vs 1.76% for Google organic (Seer Interactive, via PixelMojo)

For a full breakdown of answer engine optimisation tactics, see the Answer Engine Optimisation guide.

What llms.txt should look like if you implement it

If you decide to implement it, do it properly:

# Company Name

> One to three sentence summary of what the site covers and who it is for. 
> Keep this under 150 words. AI agents use this to decide whether to read further.

## Core Documentation

- [Getting Started Guide](https://example.com/docs/getting-started.md): Introduction and setup
- [API Reference](https://example.com/docs/api.md): Full endpoint documentation
- [Changelog](https://example.com/docs/changelog.md): Version history

## Key Articles

- [How X Works](https://example.com/blog/how-x-works.md): Technical deep-dive
- [Pricing FAQ](https://example.com/pricing.md): Plan comparison and billing

## Optional

- [llms-full.txt](https://example.com/llms-full.txt): Full content of all pages above

Keep it under 3,000 tokens. Link to .md versions of pages where possible — this reduces token consumption by 50–70% compared to HTML (Webscraft). Review every 3–6 months as the bot landscape changes.

The future: MCP, pay-per-crawl, and what comes next

Three shifts are worth tracking:

Model Context Protocol (MCP): Introduced by Anthropic in late 2024, now adopted by OpenAI, Google DeepMind, and the Linux Foundation — with 97 million monthly SDK downloads by 2026 (Duane Forrester). MCP allows agents to interact with sites (write/execute actions), while llms.txt provides "read" discovery. LangChain's mcpdoc already fetches llms.txt files to give agents a fetch_docs tool. This is the infrastructure layer that makes llms.txt relevant long-term — not as an SEO signal, but as a machine-readable entry point for agentic tasks.

Cloudflare Pay-per-Crawl (HTTP 402): Launched GA August 2025 — the first widely deployed mechanism for charging AI crawlers micro-fees for content access. This is a structural shift from the open-by-default web to a permission-and-payment model (Glasp, Digital Applied).

IETF AIPREF Working Group: Drafting a proper machine-readable standard for AI training and usage preferences — expected to absorb ai.txt and formalise the licensing layer (Glasp). When this arrives, it will likely supersede the current robots.txt extension approach.

Zero-click search is now nearly 60% of all queries, up 2.5× since AI Overviews launched (Incremys). Semrush data predicts AI search visitors will overtake traditional search by 2028. For most sites, the window to build AI citation authority through content quality and structured data is now — before the permission and payment layers lock in.

Frequently asked questions

Does llms.txt stop AI companies from training on my content?

No. llms.txt has no effect on training crawlers. It is an inference-time guidance file only. To opt out of training, use robots.txt directives targeting specific training bots (GPTBot, ClaudeBot, Google-Extended, CCBot) and WAF rules for non-compliant bots like Bytespider that ignore robots.txt (Digital Applied, Soar Agency).

If I block GPTBot, will I stop appearing in ChatGPT answers?

Blocking GPTBot (the training crawler) does not block OAI-SearchBot (ChatGPT Search indexing) or ChatGPT-User (real-time user fetches). To remain visible in ChatGPT Search, explicitly allow OAI-SearchBot in your robots.txt. Note that ChatGPT-User was exempted from robots.txt compliance by OpenAI in December 2024, so it will reach your site regardless of directives (Soar Agency).

What is the single highest-impact thing I can do to get cited by AI?

Implement JSON-LD schema markup with deep entity connections. Princeton GEO research found up to 40% higher AI visibility from structural signals (SurferSEO), and GPT-5 accuracy improved from 16% to 54% with structured data (Averi). Pair this with visible dateModified dates in both the HTML and your JSON-LD, and write in answer-first structure. These three changes have more evidence behind them than any other tactic.

Is it safe to allow all AI crawlers?

The "allow all" approach maximises citation visibility but means your content is used for training by default. The Crawl-to-Referral Ratio is a useful framing: Anthropic's ClaudeBot (at peak June 2025) crawled approximately 70,900 pages per single referral — a ratio that makes uninhibited training crawls economically poor for most publishers (Digital Applied). The recommended approach is to block training bots and explicitly allow search/retrieval bots — the two-lane robots.txt setup described in this article.

Will llms.txt become an official standard?

It is currently a community-driven convention with no IETF, W3C, or ISO backing. The IETF AIPREF Working Group is drafting a proper machine-readable standard for AI training and usage preferences that may absorb some of llms.txt's intent — but in a more formal and enforceable form. Google has confirmed it has no plans to support llms.txt as a search signal (GetPublii, Search Engine Journal).

Originally published in the EcomExperts SEO library.