← All posts

AI Visibility: How to Tell If ChatGPT, Perplexity, and Google AI Overviews Can See Your Site

VisibilityIQ

When someone asks ChatGPT a question about a topic your site covers, whether your content surfaces in the answer depends on a chain that starts long before the query: a bot had to crawl your site, your content had to be included in training or retrieval, and nothing in your robots configuration blocked that process. Most site owners have never audited this chain. Here’s how it works and how to check it.

How AI search engines crawl vs traditional Googlebot

Traditional search crawlers — Googlebot, Bingbot — operate on a well-understood model: crawl, index, rank. They visit URLs, extract content and signals, store them in an index, and serve results based on ranking algorithms. The relationship between crawl access and appearing in search results is direct: block the crawler, the page doesn’t rank.

AI search engines add two different access patterns on top of this:

Training crawls ingest content to train or fine-tune model weights. Content ingested during training becomes part of the model’s parametric knowledge — what it “knows” from training, independent of any live retrieval. GPTBot’s primary role until recently was training data collection. Blocking GPTBot prevents your content from being incorporated into future model training runs.

Retrieval-augmented generation (RAG) crawls retrieve current content at query time to ground the model’s response in up-to-date information. Perplexity is heavily RAG-based: it fetches live pages when answering queries and cites them directly. Google AI Overviews blend both: the model has parametric knowledge from training plus live retrieval from the index. Blocking a RAG crawler means your content is not retrieved to ground answers, even if the model has parametric knowledge of your site.

The practical implication: blocking GPTBot affects training data inclusion. Blocking PerplexityBot affects whether your pages are retrieved and cited in live Perplexity answers. These are distinct effects with different consequences depending on your goals.

Key AI crawlers and what they do

GPTBot (OpenAI) — The primary crawler for OpenAI’s data collection. Used to gather training data and, increasingly, for real-time web browsing in ChatGPT. User-agent string: GPTBot. OpenAI publishes a robots.txt guide for blocking it at the path level or entirely.

PerplexityBot — Perplexity AI’s crawler, used for real-time retrieval to ground answers in live web content. Because Perplexity is explicitly a retrieval-first system, blocking PerplexityBot means your pages will not be cited in Perplexity answers, even for queries where you would otherwise rank. User-agent: PerplexityBot.

ClaudeBot (Anthropic) — Anthropic’s web crawler, used for training data collection and to support Claude’s web search capabilities. User-agent: ClaudeBot (also sometimes anthropic-ai). Anthropic publishes documentation on their crawler and how to manage access.

Google-Extended — Google’s dedicated opt-out mechanism for AI training data. Standard Googlebot continues to crawl for search indexing; Google-Extended covers use of your content to train Bard/Gemini models and improve AI features. Blocking Google-Extended does not affect your Google Search rankings — it only affects AI training data use.

Bingbot / OAI-SearchBot — Microsoft operates Bingbot for traditional search and OAI-SearchBot for Bing’s AI-powered features and Copilot grounding. These are separate user agents with separate robots.txt implications.

CCBot (Common Crawl) — Not an AI company crawler itself, but the Common Crawl corpus is widely used as training data for open-weight models and academic research. Blocking CCBot reduces the likelihood of your content appearing in open-source model training datasets.

robots.txt blocking patterns and their consequences

The robots.txt standard gives you per-user-agent control. A complete block of a specific crawler looks like this:

User-agent: GPTBot
Disallow: /

A path-level block that excludes a specific section:

User-agent: GPTBot
Disallow: /members/
Disallow: /private/

The most common mistake is overly broad wildcard rules intended for other purposes that accidentally catch AI crawlers. A rule like:

User-agent: *
Disallow: /api/

Does not block AI crawlers from your public pages, but a blanket:

User-agent: *
Disallow: /

blocks everything, including GPTBot, PerplexityBot, and all others. Sites that went through aggressive SEO lock-downs or misconfigured during a migration sometimes have variants of this in place without realizing it.

Another pattern: some security-focused reverse proxy configurations add a robots.txt via the edge that differs from the one in your source code. What you see when you view-source /robots.txt may not be what live crawlers are seeing if there’s an edge rewrite in play.

llms.txt — what it is, why it exists, how to implement it

llms.txt is an emerging convention (not yet a formal standard) for site owners to provide AI systems with a structured, human-readable summary of their site’s content and purpose. The format, proposed by Jeremy Howard and others in 2024, defines a markdown file at https://yourdomain.com/llms.txt that contains:

  • A one-line title
  • A short description of the site
  • Optional sections with links to key pages, organized by topic

The rationale is that AI systems processing your site benefit from an explicit, curated map of what’s important rather than inferring it from crawl depth and page structure alone. A well-formed llms.txt can help AI systems understand which pages represent your canonical content, which are supporting resources, and which are boilerplate.

A minimal example:

# VisibilityIQ

> Technical SEO platform with render-parity diff, rank tracking, and AI visibility auditing.

## Features
- [Site Audit](https://visibilityiq365.com/features/site-audit): Render-parity diff and full technical audit
- [AI Visibility](https://visibilityiq365.com/features/ai-visibility): AI-crawler access scoring and llms.txt validation
- [Pricing](https://visibilityiq365.com/pricing): $7.50/mo after $25 first month

You can also provide an llms-full.txt with more complete content for systems that want the full text rather than a structured index.

llms.txt is voluntary and has no direct impact on robots.txt enforcement — a crawler you’ve blocked in robots.txt will not bypass that block because you have llms.txt. Its value is for AI systems that are already allowed to access your site and want to understand it better.

noai and noindex directives for AI crawlers

The noai meta tag is a proposed directive that signals to AI systems that a page’s content should not be used for AI training:

<meta name="robots" content="noai, noimageai">

noai targets text content; noimageai targets images. Adoption among AI crawler operators is voluntary — there is no technical enforcement mechanism equivalent to the robots exclusion standard. Whether a given crawler honors these tags depends on that company’s stated policy.

For content where you want to prevent AI training use specifically (rather than blocking crawl access entirely), noai is the appropriate signal. For blocking access to the page at all, robots.txt is the enforceable mechanism.

The X-Robots-Tag HTTP response header can carry the same directives for non-HTML content (PDFs, images) where a meta tag isn’t possible:

X-Robots-Tag: noai

How to verify AI crawler access

A systematic check covers four layers:

robots.txt analysis: Fetch your live robots.txt (not the one in your source tree — fetch it from the actual URL) and parse the rules for each AI crawler user-agent. Look for both explicit rules and wildcard rules that may catch them.

Meta tag audit: For a sample of your key pages, check the rendered DOM (not just view-source, since these can be injected by JavaScript) for <meta name="robots"> content values. Look for noai, noindex, or any other directives that might affect AI crawlers.

HTTP header check: Use curl -I https://yourdomain.com/your-page to inspect response headers for X-Robots-Tag directives. These are easy to miss because they’re invisible in the browser.

llms.txt validation: Fetch /llms.txt and verify it’s well-formed: a title, a description block, and link sections with correct markdown formatting.

The edge case that trips up many sites is a discrepancy between what you see in your CMS or source control and what’s actually served. CDN rules, edge workers, and reverse proxy configurations can all modify robots.txt responses or inject headers before the crawler sees them.


Want an automated check across all four layers for your site? VisibilityIQ’s AI Visibility audit scores GPTBot, PerplexityBot, ClaudeBot, and Google-Extended access against your live robots.txt, meta tags, HTTP headers, and llms.txt — included in the base subscription.