# Which AI crawlers should you allow? A policy that actually means something

*2026-06-10 · Discoverability & Structure*

> "Allow everything" and "block everything" are both wrong answers. Here's how to set an AI crawler policy that welcomes the bots that drive citations, reserves the ones that only train, and — crucially — actually enforces what you say.

## The two questions you're actually answering

When people talk about "AI crawler policy" they blur two separate questions that need
separate answers:

1. **Access** — *can this crawler fetch my pages at all?* This is what `robots.txt`
   `Allow` and `Disallow` control. It's a hard gate.
2. **Usage** — *what may it do with what it fetches?* This is what a **Content-Signal**
   declares: `search=yes`, `ai-input=yes`, `ai-train=no`. It's a stated preference.

Most sites pick one lever, ignore the other, and end up with a policy that quietly
contradicts itself. Getting this right takes about ten minutes and almost everyone
gets it wrong.

## Most policies are accidentally inconsistent

There are three common failure modes:

- **Block everything.** A blanket `Disallow: /` for AI bots feels safe, but it also
  removes you from ChatGPT, Claude, Perplexity and Google's AI answers — the exact
  places your buyers are now asking questions. You've protected content nobody will
  be cited for.
- **Allow everything by silence.** You address `GPTBot` and call it done, leaving a
  dozen other crawlers to fall through to the wildcard. Your policy is whatever each
  vendor defaults to — not a decision you made.
- **Say one thing, do another.** You publish `Content-Signal: ai-train=no` and then
  let Common Crawl walk the whole site. Your *stated* policy and your *actual* access
  rules disagree.

That last one is the subtle killer, and it deserves its own section.

## Not all AI crawlers are the same

Lumping every bot into "AI crawlers" is the root mistake. There are really three
buckets, and they earn very different treatment:

- **Answer crawlers** — `GPTBot`, `ClaudeBot`, `PerplexityBot`, `OAI-SearchBot`,
  `Google-Extended`. These power live answers and citations. When one of them reads
  you, you can *show up* when someone asks an AI about your topic. This is the
  visibility you're chasing. **Welcome them.**
- **Training-only crawlers** — `CCBot` (Common Crawl) and `Applebot-Extended`. These
  feed model *training* corpora. They don't put you in an answer tomorrow; they fold
  your words into a model months from now. There's a real decision here, and it's
  about how you feel about your content being training data.
- **Abusive crawlers** — `Bytespider` and friends. Aggressive, little upside, happy to
  burn your bandwidth. **Block them** and move on.

## The trap: a preference isn't a rule

Here's the part almost every "AI-ready" checklist gets wrong.

A **Content-Signal** is a *declaration*. `ai-train=no` says "please don't train on
this." Well-behaved, signal-aware crawlers may honour it. But Common Crawl and most
training crawlers **don't read Content-Signal at all** — they fetch, archive, and the
content ends up in training sets regardless of what you declared.

So if your actual goal is to **reserve training**, a Content-Signal alone is a sign on
an unlocked door. The only thing that genuinely stops a training-only crawler is a
`Disallow` in `robots.txt`. To be consistent, your declaration and your access rules
have to point the same way:

```
User-agent: *
Content-Signal: search=yes, ai-input=yes, ai-train=no

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: CCBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /
```

Now the policy means something: you welcome the crawlers that drive answers, and you
*enforce* — not just request — the training stance you publish.

## The honest caveat

There's no perfectly clean line. `GPTBot` and `ClaudeBot` also contribute to training,
yet we allow them — because they're the same crawlers that surface you in live answers,
and you can't have the visibility without them. So the pragmatic, defensible rule is:

> Allow crawlers that drive live AI answers, even though they may also train. Block
> crawlers whose **only** job is training, and block abusive ones.

That's a trade-off, not a loophole, and it's worth being honest with yourself about it.
If reserving training matters more to you than appearing in AI answers, tighten further.
If answers matter most, this is the sensible middle.

## How to set it up

1. Write explicit `robots.txt` groups for the crawlers you care about — don't leave
   them to the wildcard.
2. Add a `Content-Signal` directive that matches your real stance, and send it as an
   HTTP response header too, so it travels with every resource.
3. `Disallow` the training-only and abusive crawlers if your signal says `ai-train=no`.
4. Check it from the outside — confirm the bots you meant to allow really are allowed,
   and the ones you blocked really are blocked.

## We practise this

This site runs exactly the policy above — you can read our
[robots.txt](/robots.txt) and see it. Answer crawlers welcomed, training-only crawlers
disallowed to enforce our `ai-train=no` signal, abusive ones blocked. A crawler policy
isn't a checkbox; it's a position. Make sure yours is one you actually hold — and one
your server actually enforces.