Discoverability & Structure

Which AI crawlers should you allow? A policy that actually means something

Q: Which AI crawlers should you allow? A policy that actually means something

Not all AI crawlers do the same job, so "allow everything" and "block everything" are both wrong. Welcome the crawlers that put you in live AI answers — GPTBot, ClaudeBot, PerplexityBot, Google-Extended — because they drive citations and visibility. Decide separately on training-only crawlers like CCBot (Common Crawl) and Applebot-Extended: they feed model training but don't surface you in answers. And here's the catch most sites miss — a Content-Signal of ai-train=no is only a stated preference; training crawlers ignore it. To actually reserve training you must Disallow those bots in robots.txt. The fix is to make your declared policy and your access rules agree.

By Paul Masterson · Published 2026-06-10 · Last reviewed 2026-06-11 · 6 min read

The two questions you're actually answering

When people talk about "AI crawler policy" they blur two separate questions that need separate answers:

Access — can this crawler fetch my pages at all? This is what robots.txt Allow and Disallow control. It's a hard gate.
Usage — what may it do with what it fetches? This is what a Content-Signal declares: search=yes, ai-input=yes, ai-train=no. It's a stated preference.

Most sites pick one lever, ignore the other, and end up with a policy that quietly contradicts itself. Getting this right takes about ten minutes and almost everyone gets it wrong.

Most policies are accidentally inconsistent

There are three common failure modes:

Block everything. A blanket Disallow: / for AI bots feels safe, but it also removes you from ChatGPT, Claude, Perplexity and Google's AI answers — the exact places your buyers are now asking questions. You've protected content nobody will be cited for.
Allow everything by silence. You address GPTBot and call it done, leaving a dozen other crawlers to fall through to the wildcard. Your policy is whatever each vendor defaults to — not a decision you made.
Say one thing, do another. You publish Content-Signal: ai-train=no and then let Common Crawl walk the whole site. Your stated policy and your actual access rules disagree.

That last one is the subtle killer, and it deserves its own section.

Not all AI crawlers are the same

Lumping every bot into "AI crawlers" is the root mistake. There are really three buckets, and they earn very different treatment:

Answer crawlers — GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, Google-Extended. These power live answers and citations. When one of them reads you, you can show up when someone asks an AI about your topic. This is the visibility you're chasing. Welcome them.
Training-only crawlers — CCBot (Common Crawl) and Applebot-Extended. These feed model training corpora. They don't put you in an answer tomorrow; they fold your words into a model months from now. There's a real decision here, and it's about how you feel about your content being training data.
Abusive crawlers — Bytespider and friends. Aggressive, little upside, happy to burn your bandwidth. Block them and move on.

The trap: a preference isn't a rule

Here's the part almost every "AI-ready" checklist gets wrong.

A Content-Signal is a declaration. ai-train=no says "please don't train on this." Well-behaved, signal-aware crawlers may honour it. But Common Crawl and most training crawlers don't read Content-Signal at all — they fetch, archive, and the content ends up in training sets regardless of what you declared.

So if your actual goal is to reserve training, a Content-Signal alone is a sign on an unlocked door. The only thing that genuinely stops a training-only crawler is a Disallow in robots.txt. To be consistent, your declaration and your access rules have to point the same way:

User-agent: *
Content-Signal: search=yes, ai-input=yes, ai-train=no

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: CCBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

Now the policy means something: you welcome the crawlers that drive answers, and you enforce — not just request — the training stance you publish.

The honest caveat

There's no perfectly clean line. GPTBot and ClaudeBot also contribute to training, yet we allow them — because they're the same crawlers that surface you in live answers, and you can't have the visibility without them. So the pragmatic, defensible rule is:

Allow crawlers that drive live AI answers, even though they may also train. Block crawlers whose only job is training, and block abusive ones.

That's a trade-off, not a loophole, and it's worth being honest with yourself about it. If reserving training matters more to you than appearing in AI answers, tighten further. If answers matter most, this is the sensible middle.

How to set it up

Write explicit robots.txt groups for the crawlers you care about — don't leave them to the wildcard.
Add a Content-Signal directive that matches your real stance, and send it as an HTTP response header too, so it travels with every resource.
Disallow the training-only and abusive crawlers if your signal says ai-train=no.
Check it from the outside — confirm the bots you meant to allow really are allowed, and the ones you blocked really are blocked.

We practise this

This site runs exactly the policy above — you can read our robots.txt and see it. Answer crawlers welcomed, training-only crawlers disallowed to enforce our ai-train=no signal, abusive ones blocked. A crawler policy isn't a checkbox; it's a position. Make sure yours is one you actually hold — and one your server actually enforces.

Discoverability & Structure

Becoming an entity — how AI learns who you are

The unit of the AI web is the entity, not the page. Here's how to get recognised in the knowledge graph so assistants can reason about who you are with confidence.
Discoverability & Structure

What is llms.txt and why your site needs one

A plain-English explanation of the llms.txt standard — what it is, what goes in it, and why it's becoming table stakes for AI discoverability.
Accessibility & Performance

Accessibility is now law — what the European Accessibility Act means for your site

Since 28 June 2025 the European Accessibility Act has made web accessibility a legal obligation, not best practice. Here's the scope, the standard, and why the same work pays off with AI engines too.