How to Configure robots.txt for AI Crawlers

Your robots.txt file is the gatekeeper to your website. For decades, it controlled which traditional search crawlers could access your pages. Now it plays an equally critical role in AI search visibility. If your robots.txt blocks AI crawlers, your content is invisible to ChatGPT, Claude, Perplexity, and Gemini — no matter how good it is. This guide covers everything you need to know about configuring robots.txt for the AI search era.

Why robots.txt Matters More Than Ever

In the traditional SEO era, blocking a crawler was a minor issue — you might lose some visibility in one search engine. In the AI era, blocking an AI crawler means complete invisibility in that AI engine's responses. There is no partial visibility. If GPTBot cannot crawl your page, ChatGPT will never cite it. Period.

The stakes are higher because AI search is growing rapidly. ChatGPT, Perplexity, Gemini, and Claude collectively process billions of queries daily. Each uses its own crawler, and each respects your robots.txt directives independently. A misconfigured robots.txt can silently block your site from all AI search traffic.

AI Crawlers You Need to Know

Here are the major AI crawlers and what they power:

GPTBot — OpenAI's crawler for ChatGPT Search. User-agent string: GPTBot. This is the most impactful AI crawler to allow due to ChatGPT's massive user base of 300+ million weekly active users.
ClaudeBot — Anthropic's crawler for Claude's web search features. User-agent string: ClaudeBot. Increasingly important as Claude's web search grows among professional and developer audiences.
PerplexityBot — Perplexity's dedicated web crawler. User-agent string: PerplexityBot. Essential for visibility in Perplexity's citation-heavy answer engine.
Google-Extended — Google's AI/ML-specific crawler separate from standard Googlebot. User-agent string: Google-Extended. Controls whether your content appears in Google's AI Overviews (formerly SGE).
Bytespider — ByteDance's crawler used for AI training and search features. User-agent string: Bytespider. Relevant for global AI search visibility.

Recommended Configuration

For maximum AI search visibility, add these directives to your robots.txt file:

# Allow AI crawlers for AI search visibility
User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Bytespider
Allow: /

# Standard search crawlers
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Place this file at your website root so it is accessible at yourdomain.com/robots.txt. Ensure it returns a 200 status code — a 404 or 500 response is treated as if all crawlers are blocked by some AI bots.

Selective Blocking: When You Want to Restrict AI Access

Some websites may want to allow AI crawlers to access public content while blocking certain directories. This is common for sites with premium content, private user areas, or content they do not want AI models to train on:

User-agent: GPTBot
Allow: /blog/
Allow: /guides/
Disallow: /premium/
Disallow: /account/

User-agent: ClaudeBot
Allow: /blog/
Allow: /guides/
Disallow: /premium/
Disallow: /account/

This approach lets you benefit from AI citations on your public content while protecting private or premium areas. Be strategic — blocking your best content from AI crawlers is blocking your best opportunity for AI citations.

Common Mistakes to Avoid

These are the most frequent robots.txt errors we see in our GEO audits:

Wildcard blocking — Using User-agent: * with Disallow: / blocks all crawlers including AI bots. Many sites do this unintentionally when they only mean to block specific bad bots.
CDN-level blocking — Services like Cloudflare, AWS WAF, and Akamai can block AI crawlers at the network level before they even reach your robots.txt. Check your CDN and WAF settings separately.
CMS default settings — Several popular CMS platforms (WordPress plugins, Wix, Squarespace) have started adding AI bot blocking by default. Audit your CMS settings even if you have not changed your robots.txt manually.
Missing sitemap reference — Your robots.txt should include a Sitemap: directive pointing to your XML sitemap. AI crawlers use this to discover and prioritize pages for crawling.

Testing Your Configuration

After updating your robots.txt, verify it works correctly. Check that the file is accessible via a web browser at yourdomain.com/robots.txt, returns a 200 status code, and contains the correct directives for each AI crawler.

Run a free GEO scan to instantly validate your robots.txt AI crawler configuration: Run Free GEO Scan