Does blocking GPTBot remove me from ChatGPT?

No. GPTBot only controls training-data collection. ChatGPT's live search uses OAI-SearchBot (indexing) and ChatGPT-User (live fetch). Per OpenAI's docs, blocking GPTBot leaves your ChatGPT search visibility fully intact - but blocking OAI-SearchBot removes you from ChatGPT search answers.

What's the actual difference between GPTBot and OAI-SearchBot?

GPTBot crawls content that may be used to train OpenAI's foundation models. OAI-SearchBot exists to surface websites in ChatGPT's search results. Different jobs, different user-agents, different IP ranges - controlled independently in robots.txt.

Does robots.txt actually stop these bots?

It's an honour system. OpenAI, Anthropic, Google, Apple and Perplexity's declared crawlers respect it. CCBot and Bytespider have mixed records, and Cloudflare de-listed Perplexity in 2025 for stealth crawling around blocks. For real enforcement, use a WAF or Cloudflare's AI Crawl Control.

Can I stay in Google AI Overviews but block Gemini training?

You can block Gemini training with Google-Extended, but you cannot exclude yourself from AI Overviews while staying in Search - Overviews run on the standard Googlebot index. It's all-or-nothing with Google Search.

How do I verify a bot is really GPTBot and not a spoofer?

Check the request's source IP against OpenAI's published ranges at openai.com/gptbot.json, searchbot.json and chatgpt-user.json. User-agent strings are trivially faked - Perplexity was caught spoofing a Chrome user-agent to evade blocks.

Do ChatGPT-User and Perplexity-User obey robots.txt?

Largely no - both are user-initiated fetches. OpenAI says robots.txt rules may not apply to ChatGPT-User, and Perplexity-User generally ignores robots.txt. Blocking them in robots.txt is mostly symbolic; use a WAF if you truly must block.

If AI bots barely send traffic, why allow them at all?

The crawl-to-referral ratios are lopsided (OpenAI ~1,091:1, Anthropic ~38,066:1), but the referrals that do come are high-intent, and being cited builds brand presence in AI answers. The upside is visibility and citation, not raw traffic.

Will blocking training bots hurt my SEO?

No. GPTBot, ClaudeBot, Google-Extended and Applebot-Extended are separate from the search-indexing bots. Blocking them doesn't affect Google/Bing rankings or AI-search citation, as long as you leave Googlebot, Bingbot, OAI-SearchBot, PerplexityBot and Claude-SearchBot allowed.

GPTBot vs OAI-SearchBot: The AI Crawler Guide

The short answer

Block GPTBot if you want to opt out of AI training - but never block OAI-SearchBot, or you delete yourself from ChatGPT search for zero benefit. Every major AI company now splits crawling into three separate, independently controllable bots: one for training, one for search citation, one for live user fetches. The costly mistake is a copy-pasted "block all AI bots" rule that also kills the citation crawlers. Block training if your policy requires it; always allow the search bots.

The single most expensive misconception in AI SEO is that "AI bot" means one thing. It doesn't - and the confusion has a poster child: GPTBot and OAI-SearchBot. They're both OpenAI, both start with similar names, and one of them getting your robots.txt wrong can quietly erase you from ChatGPT's search answers while achieving nothing you intended. Let's fix that permanently.

1. The one rule that matters

Training crawlers and citation crawlers are different bots - treat every AI company as three switches, not one. If you remember nothing else: blocking a training bot (GPTBot, ClaudeBot, Google-Extended) protects your content from model training and has no effect on whether you can be cited. Blocking a citation bot (OAI-SearchBot, PerplexityBot, Claude-SearchBot) removes you from that engine's answers and protects nothing extra, because training is already handled by the separate training bot. There is almost never a good reason to block the citation bots.

2. The three-job model

OpenAI, Anthropic, Perplexity, Apple and Meta all now separate crawling into three distinct jobs. Once you see this pattern, every vendor's bot list becomes readable:

Training - collects content to train foundation models (GPTBot, ClaudeBot, CCBot). Block these to opt out of training.
Search / citation - indexes content so it can be surfaced and cited in AI answers (OAI-SearchBot, PerplexityBot, Claude-SearchBot, Googlebot). Allow these if you want AI visibility.
Live user fetch - retrieves a specific page in response to a user's prompt in real time (ChatGPT-User, Perplexity-User, Claude-User). These often ignore robots.txt because they're user-initiated.

The economics behind all this: Cloudflare found training now drives ~80–82% of all AI crawling, while search crawling fell to ~18%. That imbalance - plus tiny referral rates - is why mass blocking and pay-per-crawl models are rising. But blocking indiscriminately throws away citations you'd actually want.

3. GPTBot vs OAI-SearchBot, precisely

GPTBot trains; OAI-SearchBot cites. Blocking the first is a content-policy choice; blocking the second is self-sabotage. Straight from OpenAI's documentation:

	GPTBot	OAI-SearchBot	ChatGPT-User
Job	Train foundation models	Index for ChatGPT search	Live user-triggered fetch
Blocking effect	Opts out of training	Removes you from ChatGPT search	Little - it's user-initiated
Respects robots.txt	Yes	Yes	Often no
Recommended	Block if you must	Always allow	Allow

OpenAI's docs are explicit: sites that opt out of OAI-SearchBot "will not be shown in ChatGPT search answers." So the widespread instinct to "block GPTBot to keep my content out of AI" is fine - it just has nothing to do with search visibility, which is governed by a completely different bot.

Blocking GPTBot protects your training data. Blocking OAI-SearchBot deletes you from ChatGPT search. They are not the same decision.

4. Every AI crawler that matters

Here's the full landscape, sorted by what each bot does and what you should do about it. Notice how the training/search/fetch split repeats across every vendor:

Bot	Operator	Purpose	Recommended action
GPTBot	OpenAI	Training	Block to opt out of training
OAI-SearchBot	OpenAI	Search citation	Allow
ChatGPT-User	OpenAI	Live user fetch	Allow
Googlebot	Google	Search + AI Overviews	Allow
Google-Extended	Google	Gemini training opt-out	Block to exit Gemini training
ClaudeBot	Anthropic	Training	Block if opting out
Claude-SearchBot	Anthropic	Search citation	Allow
PerplexityBot	Perplexity	Search citation	Allow
CCBot	Common Crawl	Open dataset (feeds training)	Block to avoid training corpora
Bytespider	ByteDance	Training	Block (may need WAF)
Applebot-Extended	Apple	Apple Intelligence training opt-out	Block; keep Applebot allowed
Meta-ExternalAgent	Meta	Training (Llama)	Block if desired

And here's where the crawling volume actually concentrates. Googlebot dwarfs everything (~50% of all crawler requests), so this chart shows the AI-bot subset for readability:

Share of total crawler requests - AI bots only

GPTBot

7.7%

ClaudeBot

5.4%

Amazonbot

4.2%

Bytespider

2.9%

ChatGPT-User

1.3%

Applebot

1.2%

PerplexityBot

0.2%

Cloudflare, May 2025, across 30+ crawlers. Googlebot (~50%, off-scale) leads overall; GPTBot leads the AI pack. PerplexityBot crawls little because it fetches on-demand at query time.

5. The robots.txt to copy

Here's the default I recommend for most sites: stay fully visible in AI search, opt out of training.

# --- Allow AI SEARCH & CITATION bots ---
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Googlebot
Allow: /

# --- Block AI TRAINING bots ---
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /

If you're running a pure GEO/visibility play and don't care about training, the simplest option is to allow everything (User-agent: * / Allow: /) - every crawl is then a chance to be cited. The block list above is for teams that want the citation upside and a training opt-out.

6. Google's no-clean-lever trap

Google is the one exception where you don't get a clean choice. Google-Extended lets you opt out of Gemini and Vertex training, which is genuinely useful. But AI Overviews are not a separate crawler - they're generated from the standard Googlebot search index. That means you cannot appear in Google Search while excluding yourself from AI Overviews. If you're in Google's index, you're eligible for Overviews, full stop. Anyone promising you an "AI Overviews opt-out" that keeps your rankings is selling something that doesn't exist.

7. robots.txt is an honour system

robots.txt is a request, not a wall - the reputable bots comply, but some don't. GPTBot, OAI-SearchBot, ClaudeBot, Googlebot and Applebot honour it reliably. CCBot and Bytespider have documented compliance gaps, and in August 2025 Cloudflare de-listed Perplexity as a verified bot after catching it rotating IPs and spoofing a Chrome user-agent to crawl sites that had blocked it. If you have content you genuinely must keep out of AI systems, robots.txt alone won't do it - layer a WAF, Cloudflare's managed AI Crawl Control, or IP-based firewall rules on top.

8. Verify a bot is really who it claims

Never trust a user-agent string alone - verify by IP. Because user-agents are trivially spoofed, confirm any bot against its operator's published IP ranges. OpenAI publishes machine-readable lists at openai.com/gptbot.json, openai.com/searchbot.json and openai.com/chatgpt-user.json; Google, Anthropic and others publish equivalents. Reverse-DNS plus an IP match is how you separate the real crawlers from impersonators in your logs - and it's the foundation of any serious log analysis, which is the same technique you'll use to track your first AI citations.

9. The other vendors, decoded

OpenAI's three-bot split is the clearest, but every major AI company now follows the same pattern - once you learn it, their bot lists read themselves.

Anthropic has the cleanest separation after OpenAI. ClaudeBot is the training crawler (block it to opt out of training); Claude-SearchBot indexes for Claude's search answers (allow it); Claude-User is the live user-triggered fetch (allow it). Anthropic says all three respect robots.txt. There's also a deprecated anthropic-ai/Claude-Web token worth keeping in old block lists.

Perplexity declares no training crawler - it says it doesn't train foundation models - so both PerplexityBot (indexing) and Perplexity-User (live fetch) exist for visibility, and you'd generally allow both. The asterisk: Cloudflare caught Perplexity in 2025 using undeclared crawlers with rotating IPs and a spoofed Chrome user-agent to reach sites that had blocked it, and de-listed it as a verified bot. So Perplexity honours robots.txt in principle but has a documented enforcement gap in practice.

Apple splits by suffix: Applebot powers Siri and Spotlight search (keep it allowed), while Applebot-Extended is purely a training opt-out token (block it to exit Apple Intelligence training without losing Siri visibility). Meta uses Meta-ExternalAgent for Llama training and Meta-ExternalFetcher for live fetches. Common Crawl's CCBot has no citation product - its dataset feeds many models' training, so block it if you're avoiding training corpora. And ByteDance's Bytespider is notorious for weak compliance; if you must keep it out, expect to enforce at the firewall, not in robots.txt.

10. The economics: take, not give

The reason this whole topic suddenly matters is money - AI crawling has become wildly lopsided in what it takes versus what it returns. Cloudflare's network-scale data tells the story. Training now drives roughly 80–82% of all AI crawling, up from 72% a year earlier, while search crawling fell to ~18%. And the referrals AI engines send back are minuscule: for every visitor referred, Anthropic's crawlers made about 38,000 requests, OpenAI's about 1,091, and Perplexity's about 195. Google, by comparison, sits near 5-to-1.

That imbalance is fuelling two reactions. The first is mass blocking: GPTBot is now the single most-blocked AI crawler, disallowed by roughly half of top news sites, and AI-blocking among reputable sites climbed from ~23% in late 2023 to ~60% by mid-2025. The second is monetisation - Cloudflare launched pay-per-crawl, letting sites charge bots for access and returning HTTP 402 ("Payment Required") responses at scale. The direction of travel is clear: the free-for-all era of AI crawling is ending, and site owners are gaining leverage.

Where does that leave you? It depends on your goal. If you're a publisher whose business is content, blocking training crawlers (and eventually charging for access) is a defensible stance - just keep the citation bots allowed so you don't vanish from AI answers. If you're a brand or business using GEO to be discovered, the calculus inverts: you want the crawls, because being cited in an AI answer is worth far more to you than the crawl costs. Decide which camp you're in before you touch robots.txt.

11. Reading your logs for AI bots

Your server logs are the only place you can see AI crawlers actually visiting - and they're your earliest signal of AI visibility. Every request records a timestamp, path, user-agent and source IP. To turn that into insight:

Filter by user-agent for the bots that matter - GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, PerplexityBot, Google-Extended, Bytespider.
Verify by IP against each operator's published ranges to filter out spoofers before you trust the counts.
Track cadence - how often each bot revisits. Studies of agentic crawling found GPTBot re-fetches high-value pages roughly every 2.4 days once it's discovered them; a rising cadence means you're firmly in the index.
Note the on-demand bots - PerplexityBot and ChatGPT-User often fetch at query time rather than on a schedule, so a hit from them can mean a real user's question just surfaced your page.

Because crawling is the prerequisite to citation, this log signal leads your actual citations by days or weeks. It's the same telemetry you'll lean on to track a new site's climb in Zero to Cited - the first GPTBot or OAI-SearchBot hit is the moment you know the door has opened.

12. So should you block anything at all?

Here's a decision framework to settle it for your specific site - because the right answer genuinely differs by business model. Work through these questions in order:

Is being discovered in AI answers valuable to you? For almost every business, brand or creator using content to attract customers, the answer is yes. If so, allow every search and citation bot - OAI-SearchBot, PerplexityBot, Claude-SearchBot, Googlebot, Bingbot. This is non-negotiable; blocking them is self-erasure.
Is your content itself your product? If you're a publisher, a research firm, or anyone whose paid offering is the writing, you have a legitimate reason to resist having it used as free training data. In that case, block the training crawlers - GPTBot, ClaudeBot, CCBot, Google-Extended, Applebot-Extended - while keeping the citation bots open.
Do you have content you truly need to keep out of AI entirely? If yes, robots.txt isn't enough - the non-compliant crawlers will ignore it. Move to WAF rules, Cloudflare's AI Crawl Control, or authentication. But recognise the trade-off: fully walling off content also removes it from the answers where discovery now happens.
Are you unsure? Then default to openness. Allow everything. For the overwhelming majority of sites, the visibility upside of being crawled dwarfs the theoretical downside of being trained on - and you can always tighten later.

The mistake to avoid isn't allowing too much or blocking too much in the abstract - it's making the decision accidentally. A robots.txt copied from a "block AI bots" blog post, a CMS default nobody reviewed, a security plugin's aggressive preset: these are how sites end up invisible to AI search without anyone choosing it. Open your robots.txt right now, read it against the tables in this guide, and make sure every line reflects a deliberate choice. That five-minute review is the single highest-value thing you can do for your AI-search visibility today.

And revisit it periodically. The bot landscape shifts - new crawlers appear (Claude-SearchBot and OAI-SearchBot are both recent additions), vendors change their compliance behaviour, and new "Extended"-style opt-out tokens keep arriving. A robots.txt that was perfectly configured a year ago may be silently blocking a citation bot that didn't exist when you wrote it. Put a quarterly reminder on the calendar to re-check it against the current bot list.

Get this right once and you never have to think about it again: allow the search bots, decide deliberately on the training bots, verify by IP, and never let a lazy "block all AI" rule quietly cost you the citations you're working so hard to earn.

GPTBot vs OAI-SearchBot: the AI crawler guide

1. The one rule that matters

2. The three-job model

3. GPTBot vs OAI-SearchBot, precisely

4. Every AI crawler that matters

5. The robots.txt to copy

6. Google's no-clean-lever trap

7. robots.txt is an honour system

8. Verify a bot is really who it claims

9. The other vendors, decoded

10. The economics: take, not give

11. Reading your logs for AI bots

12. So should you block anything at all?

Sources

Frequently asked questions

Ritik Namdev

Related guides

Zero to Cited: how a new site climbs into AI search

Bing Copilot SEO: the easiest AI engine to crack

Does llms.txt actually work? A 90-day log test

One experiment. Every week.