Block GPTBot if you want to opt out of AI training - but never block OAI-SearchBot, or you delete yourself from ChatGPT search for zero benefit. Every major AI company now splits crawling into three separate, independently controllable bots: one for training, one for search citation, one for live user fetches. The costly mistake is a copy-pasted "block all AI bots" rule that also kills the citation crawlers. Block training if your policy requires it; always allow the search bots.
The single most expensive misconception in AI SEO is that "AI bot" means one thing. It doesn't - and the confusion has a poster child: GPTBot and OAI-SearchBot. They're both OpenAI, both start with similar names, and one of them getting your robots.txt wrong can quietly erase you from ChatGPT's search answers while achieving nothing you intended. Let's fix that permanently.
1. The one rule that matters
Training crawlers and citation crawlers are different bots - treat every AI company as three switches, not one. If you remember nothing else: blocking a training bot (GPTBot, ClaudeBot, Google-Extended) protects your content from model training and has no effect on whether you can be cited. Blocking a citation bot (OAI-SearchBot, PerplexityBot, Claude-SearchBot) removes you from that engine's answers and protects nothing extra, because training is already handled by the separate training bot. There is almost never a good reason to block the citation bots.
2. The three-job model
OpenAI, Anthropic, Perplexity, Apple and Meta all now separate crawling into three distinct jobs. Once you see this pattern, every vendor's bot list becomes readable:
- Training - collects content to train foundation models (GPTBot, ClaudeBot, CCBot). Block these to opt out of training.
- Search / citation - indexes content so it can be surfaced and cited in AI answers (OAI-SearchBot, PerplexityBot, Claude-SearchBot, Googlebot). Allow these if you want AI visibility.
- Live user fetch - retrieves a specific page in response to a user's prompt in real time (ChatGPT-User, Perplexity-User, Claude-User). These often ignore robots.txt because they're user-initiated.
The economics behind all this: Cloudflare found training now drives ~80–82% of all AI crawling, while search crawling fell to ~18%. That imbalance - plus tiny referral rates - is why mass blocking and pay-per-crawl models are rising. But blocking indiscriminately throws away citations you'd actually want.
3. GPTBot vs OAI-SearchBot, precisely
GPTBot trains; OAI-SearchBot cites. Blocking the first is a content-policy choice; blocking the second is self-sabotage. Straight from OpenAI's documentation:
| GPTBot | OAI-SearchBot | ChatGPT-User | |
|---|---|---|---|
| Job | Train foundation models | Index for ChatGPT search | Live user-triggered fetch |
| Blocking effect | Opts out of training | Removes you from ChatGPT search | Little - it's user-initiated |
| Respects robots.txt | Yes | Yes | Often no |
| Recommended | Block if you must | Always allow | Allow |
OpenAI's docs are explicit: sites that opt out of OAI-SearchBot "will not be shown in ChatGPT search answers." So the widespread instinct to "block GPTBot to keep my content out of AI" is fine - it just has nothing to do with search visibility, which is governed by a completely different bot.
Blocking GPTBot protects your training data. Blocking OAI-SearchBot deletes you from ChatGPT search. They are not the same decision.
4. Every AI crawler that matters
Here's the full landscape, sorted by what each bot does and what you should do about it. Notice how the training/search/fetch split repeats across every vendor:
| Bot | Operator | Purpose | Recommended action |
|---|---|---|---|
| GPTBot | OpenAI | Training | Block to opt out of training |
| OAI-SearchBot | OpenAI | Search citation | Allow |
| ChatGPT-User | OpenAI | Live user fetch | Allow |
| Googlebot | Search + AI Overviews | Allow | |
| Google-Extended | Gemini training opt-out | Block to exit Gemini training | |
| ClaudeBot | Anthropic | Training | Block if opting out |
| Claude-SearchBot | Anthropic | Search citation | Allow |
| PerplexityBot | Perplexity | Search citation | Allow |
| CCBot | Common Crawl | Open dataset (feeds training) | Block to avoid training corpora |
| Bytespider | ByteDance | Training | Block (may need WAF) |
| Applebot-Extended | Apple | Apple Intelligence training opt-out | Block; keep Applebot allowed |
| Meta-ExternalAgent | Meta | Training (Llama) | Block if desired |
And here's where the crawling volume actually concentrates. Googlebot dwarfs everything (~50% of all crawler requests), so this chart shows the AI-bot subset for readability:
5. The robots.txt to copy
Here's the default I recommend for most sites: stay fully visible in AI search, opt out of training.
# --- Allow AI SEARCH & CITATION bots ---
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Googlebot
Allow: /
# --- Block AI TRAINING bots ---
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: / If you're running a pure GEO/visibility play and don't care about training, the simplest option is to allow everything (User-agent: * / Allow: /) - every crawl is then a chance to be cited. The block list above is for teams that want the citation upside and a training opt-out.
6. Google's no-clean-lever trap
Google is the one exception where you don't get a clean choice. Google-Extended lets you opt out of Gemini and Vertex training, which is genuinely useful. But AI Overviews are not a separate crawler - they're generated from the standard Googlebot search index. That means you cannot appear in Google Search while excluding yourself from AI Overviews. If you're in Google's index, you're eligible for Overviews, full stop. Anyone promising you an "AI Overviews opt-out" that keeps your rankings is selling something that doesn't exist.
7. robots.txt is an honour system
robots.txt is a request, not a wall - the reputable bots comply, but some don't. GPTBot, OAI-SearchBot, ClaudeBot, Googlebot and Applebot honour it reliably. CCBot and Bytespider have documented compliance gaps, and in August 2025 Cloudflare de-listed Perplexity as a verified bot after catching it rotating IPs and spoofing a Chrome user-agent to crawl sites that had blocked it. If you have content you genuinely must keep out of AI systems, robots.txt alone won't do it - layer a WAF, Cloudflare's managed AI Crawl Control, or IP-based firewall rules on top.
8. Verify a bot is really who it claims
Never trust a user-agent string alone - verify by IP. Because user-agents are trivially spoofed, confirm any bot against its operator's published IP ranges. OpenAI publishes machine-readable lists at openai.com/gptbot.json, openai.com/searchbot.json and openai.com/chatgpt-user.json; Google, Anthropic and others publish equivalents. Reverse-DNS plus an IP match is how you separate the real crawlers from impersonators in your logs - and it's the foundation of any serious log analysis, which is the same technique you'll use to track your first AI citations.
9. The other vendors, decoded
OpenAI's three-bot split is the clearest, but every major AI company now follows the same pattern - once you learn it, their bot lists read themselves.
Anthropic has the cleanest separation after OpenAI. ClaudeBot is the training crawler (block it to opt out of training); Claude-SearchBot indexes for Claude's search answers (allow it); Claude-User is the live user-triggered fetch (allow it). Anthropic says all three respect robots.txt. There's also a deprecated anthropic-ai/Claude-Web token worth keeping in old block lists.
Perplexity declares no training crawler - it says it doesn't train foundation models - so both PerplexityBot (indexing) and Perplexity-User (live fetch) exist for visibility, and you'd generally allow both. The asterisk: Cloudflare caught Perplexity in 2025 using undeclared crawlers with rotating IPs and a spoofed Chrome user-agent to reach sites that had blocked it, and de-listed it as a verified bot. So Perplexity honours robots.txt in principle but has a documented enforcement gap in practice.
Apple splits by suffix: Applebot powers Siri and Spotlight search (keep it allowed), while Applebot-Extended is purely a training opt-out token (block it to exit Apple Intelligence training without losing Siri visibility). Meta uses Meta-ExternalAgent for Llama training and Meta-ExternalFetcher for live fetches. Common Crawl's CCBot has no citation product - its dataset feeds many models' training, so block it if you're avoiding training corpora. And ByteDance's Bytespider is notorious for weak compliance; if you must keep it out, expect to enforce at the firewall, not in robots.txt.
10. The economics: take, not give
The reason this whole topic suddenly matters is money - AI crawling has become wildly lopsided in what it takes versus what it returns. Cloudflare's network-scale data tells the story. Training now drives roughly 80–82% of all AI crawling, up from 72% a year earlier, while search crawling fell to ~18%. And the referrals AI engines send back are minuscule: for every visitor referred, Anthropic's crawlers made about 38,000 requests, OpenAI's about 1,091, and Perplexity's about 195. Google, by comparison, sits near 5-to-1.
That imbalance is fuelling two reactions. The first is mass blocking: GPTBot is now the single most-blocked AI crawler, disallowed by roughly half of top news sites, and AI-blocking among reputable sites climbed from ~23% in late 2023 to ~60% by mid-2025. The second is monetisation - Cloudflare launched pay-per-crawl, letting sites charge bots for access and returning HTTP 402 ("Payment Required") responses at scale. The direction of travel is clear: the free-for-all era of AI crawling is ending, and site owners are gaining leverage.
Where does that leave you? It depends on your goal. If you're a publisher whose business is content, blocking training crawlers (and eventually charging for access) is a defensible stance - just keep the citation bots allowed so you don't vanish from AI answers. If you're a brand or business using GEO to be discovered, the calculus inverts: you want the crawls, because being cited in an AI answer is worth far more to you than the crawl costs. Decide which camp you're in before you touch robots.txt.
11. Reading your logs for AI bots
Your server logs are the only place you can see AI crawlers actually visiting - and they're your earliest signal of AI visibility. Every request records a timestamp, path, user-agent and source IP. To turn that into insight:
- Filter by user-agent for the bots that matter -
GPTBot,OAI-SearchBot,ChatGPT-User,ClaudeBot,Claude-SearchBot,PerplexityBot,Google-Extended,Bytespider. - Verify by IP against each operator's published ranges to filter out spoofers before you trust the counts.
- Track cadence - how often each bot revisits. Studies of agentic crawling found GPTBot re-fetches high-value pages roughly every 2.4 days once it's discovered them; a rising cadence means you're firmly in the index.
- Note the on-demand bots - PerplexityBot and ChatGPT-User often fetch at query time rather than on a schedule, so a hit from them can mean a real user's question just surfaced your page.
Because crawling is the prerequisite to citation, this log signal leads your actual citations by days or weeks. It's the same telemetry you'll lean on to track a new site's climb in Zero to Cited - the first GPTBot or OAI-SearchBot hit is the moment you know the door has opened.
12. So should you block anything at all?
Here's a decision framework to settle it for your specific site - because the right answer genuinely differs by business model. Work through these questions in order:
- Is being discovered in AI answers valuable to you? For almost every business, brand or creator using content to attract customers, the answer is yes. If so, allow every search and citation bot - OAI-SearchBot, PerplexityBot, Claude-SearchBot, Googlebot, Bingbot. This is non-negotiable; blocking them is self-erasure.
- Is your content itself your product? If you're a publisher, a research firm, or anyone whose paid offering is the writing, you have a legitimate reason to resist having it used as free training data. In that case, block the training crawlers - GPTBot, ClaudeBot, CCBot, Google-Extended, Applebot-Extended - while keeping the citation bots open.
- Do you have content you truly need to keep out of AI entirely? If yes, robots.txt isn't enough - the non-compliant crawlers will ignore it. Move to WAF rules, Cloudflare's AI Crawl Control, or authentication. But recognise the trade-off: fully walling off content also removes it from the answers where discovery now happens.
- Are you unsure? Then default to openness. Allow everything. For the overwhelming majority of sites, the visibility upside of being crawled dwarfs the theoretical downside of being trained on - and you can always tighten later.
The mistake to avoid isn't allowing too much or blocking too much in the abstract - it's making the decision accidentally. A robots.txt copied from a "block AI bots" blog post, a CMS default nobody reviewed, a security plugin's aggressive preset: these are how sites end up invisible to AI search without anyone choosing it. Open your robots.txt right now, read it against the tables in this guide, and make sure every line reflects a deliberate choice. That five-minute review is the single highest-value thing you can do for your AI-search visibility today.
And revisit it periodically. The bot landscape shifts - new crawlers appear (Claude-SearchBot and OAI-SearchBot are both recent additions), vendors change their compliance behaviour, and new "Extended"-style opt-out tokens keep arriving. A robots.txt that was perfectly configured a year ago may be silently blocking a citation bot that didn't exist when you wrote it. Put a quarterly reminder on the calendar to re-check it against the current bot list.
Get this right once and you never have to think about it again: allow the search bots, decide deliberately on the training bots, verify by IP, and never let a lazy "block all AI" rule quietly cost you the citations you're working so hard to earn.