Discoverability & Structure
Which AI crawlers should you allow? A policy that actually means something
The two questions you're actually answering
When people talk about "AI crawler policy" they blur two separate questions that need separate answers:
- Access — can this crawler fetch my pages at all? This is what
robots.txtAllowandDisallowcontrol. It's a hard gate. - Usage — what may it do with what it fetches? This is what a Content-Signal
declares:
search=yes,ai-input=yes,ai-train=no. It's a stated preference.
Most sites pick one lever, ignore the other, and end up with a policy that quietly contradicts itself. Getting this right takes about ten minutes and almost everyone gets it wrong.
Most policies are accidentally inconsistent
There are three common failure modes:
- Block everything. A blanket
Disallow: /for AI bots feels safe, but it also removes you from ChatGPT, Claude, Perplexity and Google's AI answers — the exact places your buyers are now asking questions. You've protected content nobody will be cited for. - Allow everything by silence. You address
GPTBotand call it done, leaving a dozen other crawlers to fall through to the wildcard. Your policy is whatever each vendor defaults to — not a decision you made. - Say one thing, do another. You publish
Content-Signal: ai-train=noand then let Common Crawl walk the whole site. Your stated policy and your actual access rules disagree.
That last one is the subtle killer, and it deserves its own section.
Not all AI crawlers are the same
Lumping every bot into "AI crawlers" is the root mistake. There are really three buckets, and they earn very different treatment:
- Answer crawlers —
GPTBot,ClaudeBot,PerplexityBot,OAI-SearchBot,Google-Extended. These power live answers and citations. When one of them reads you, you can show up when someone asks an AI about your topic. This is the visibility you're chasing. Welcome them. - Training-only crawlers —
CCBot(Common Crawl) andApplebot-Extended. These feed model training corpora. They don't put you in an answer tomorrow; they fold your words into a model months from now. There's a real decision here, and it's about how you feel about your content being training data. - Abusive crawlers —
Bytespiderand friends. Aggressive, little upside, happy to burn your bandwidth. Block them and move on.
The trap: a preference isn't a rule
Here's the part almost every "AI-ready" checklist gets wrong.
A Content-Signal is a declaration. ai-train=no says "please don't train on
this." Well-behaved, signal-aware crawlers may honour it. But Common Crawl and most
training crawlers don't read Content-Signal at all — they fetch, archive, and the
content ends up in training sets regardless of what you declared.
So if your actual goal is to reserve training, a Content-Signal alone is a sign on
an unlocked door. The only thing that genuinely stops a training-only crawler is a
Disallow in robots.txt. To be consistent, your declaration and your access rules
have to point the same way:
User-agent: *
Content-Signal: search=yes, ai-input=yes, ai-train=no
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: CCBot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
Now the policy means something: you welcome the crawlers that drive answers, and you enforce — not just request — the training stance you publish.
The honest caveat
There's no perfectly clean line. GPTBot and ClaudeBot also contribute to training,
yet we allow them — because they're the same crawlers that surface you in live answers,
and you can't have the visibility without them. So the pragmatic, defensible rule is:
Allow crawlers that drive live AI answers, even though they may also train. Block crawlers whose only job is training, and block abusive ones.
That's a trade-off, not a loophole, and it's worth being honest with yourself about it. If reserving training matters more to you than appearing in AI answers, tighten further. If answers matter most, this is the sensible middle.
How to set it up
- Write explicit
robots.txtgroups for the crawlers you care about — don't leave them to the wildcard. - Add a
Content-Signaldirective that matches your real stance, and send it as an HTTP response header too, so it travels with every resource. Disallowthe training-only and abusive crawlers if your signal saysai-train=no.- Check it from the outside — confirm the bots you meant to allow really are allowed, and the ones you blocked really are blocked.
We practise this
This site runs exactly the policy above — you can read our
robots.txt and see it. Answer crawlers welcomed, training-only crawlers
disallowed to enforce our ai-train=no signal, abusive ones blocked. A crawler policy
isn't a checkbox; it's a position. Make sure yours is one you actually hold — and one
your server actually enforces.
Get the Radar
Monthly updates on AI web standards, so you don't have to track them yourself.
No spam. Unsubscribe anytime. We handle your email per our Privacy Policy.