Discoverability & Structure

Which AI crawlers should you allow? A policy that actually means something

The two questions you're actually answering

When people talk about "AI crawler policy" they blur two separate questions that need separate answers:

  1. Accesscan this crawler fetch my pages at all? This is what robots.txt Allow and Disallow control. It's a hard gate.
  2. Usagewhat may it do with what it fetches? This is what a Content-Signal declares: search=yes, ai-input=yes, ai-train=no. It's a stated preference.

Most sites pick one lever, ignore the other, and end up with a policy that quietly contradicts itself. Getting this right takes about ten minutes and almost everyone gets it wrong.

Most policies are accidentally inconsistent

There are three common failure modes:

  • Block everything. A blanket Disallow: / for AI bots feels safe, but it also removes you from ChatGPT, Claude, Perplexity and Google's AI answers — the exact places your buyers are now asking questions. You've protected content nobody will be cited for.
  • Allow everything by silence. You address GPTBot and call it done, leaving a dozen other crawlers to fall through to the wildcard. Your policy is whatever each vendor defaults to — not a decision you made.
  • Say one thing, do another. You publish Content-Signal: ai-train=no and then let Common Crawl walk the whole site. Your stated policy and your actual access rules disagree.

That last one is the subtle killer, and it deserves its own section.

Not all AI crawlers are the same

Lumping every bot into "AI crawlers" is the root mistake. There are really three buckets, and they earn very different treatment:

  • Answer crawlersGPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, Google-Extended. These power live answers and citations. When one of them reads you, you can show up when someone asks an AI about your topic. This is the visibility you're chasing. Welcome them.
  • Training-only crawlersCCBot (Common Crawl) and Applebot-Extended. These feed model training corpora. They don't put you in an answer tomorrow; they fold your words into a model months from now. There's a real decision here, and it's about how you feel about your content being training data.
  • Abusive crawlersBytespider and friends. Aggressive, little upside, happy to burn your bandwidth. Block them and move on.

The trap: a preference isn't a rule

Here's the part almost every "AI-ready" checklist gets wrong.

A Content-Signal is a declaration. ai-train=no says "please don't train on this." Well-behaved, signal-aware crawlers may honour it. But Common Crawl and most training crawlers don't read Content-Signal at all — they fetch, archive, and the content ends up in training sets regardless of what you declared.

So if your actual goal is to reserve training, a Content-Signal alone is a sign on an unlocked door. The only thing that genuinely stops a training-only crawler is a Disallow in robots.txt. To be consistent, your declaration and your access rules have to point the same way:

User-agent: *
Content-Signal: search=yes, ai-input=yes, ai-train=no

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: CCBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

Now the policy means something: you welcome the crawlers that drive answers, and you enforce — not just request — the training stance you publish.

The honest caveat

There's no perfectly clean line. GPTBot and ClaudeBot also contribute to training, yet we allow them — because they're the same crawlers that surface you in live answers, and you can't have the visibility without them. So the pragmatic, defensible rule is:

Allow crawlers that drive live AI answers, even though they may also train. Block crawlers whose only job is training, and block abusive ones.

That's a trade-off, not a loophole, and it's worth being honest with yourself about it. If reserving training matters more to you than appearing in AI answers, tighten further. If answers matter most, this is the sensible middle.

How to set it up

  1. Write explicit robots.txt groups for the crawlers you care about — don't leave them to the wildcard.
  2. Add a Content-Signal directive that matches your real stance, and send it as an HTTP response header too, so it travels with every resource.
  3. Disallow the training-only and abusive crawlers if your signal says ai-train=no.
  4. Check it from the outside — confirm the bots you meant to allow really are allowed, and the ones you blocked really are blocked.

We practise this

This site runs exactly the policy above — you can read our robots.txt and see it. Answer crawlers welcomed, training-only crawlers disallowed to enforce our ai-train=no signal, abusive ones blocked. A crawler policy isn't a checkbox; it's a position. Make sure yours is one you actually hold — and one your server actually enforces.