The Crawler Permission Economy: Who Gets to Train on You — and What It's Worth
AI labs are paying publishers millions for training data access. Most sites are giving it away for free via default robots.txt settings. Here is the permission economy that's emerging.
By Camille Moreau, AI Policy · May 25, 2026
AI labs are paying millions for training data while most publishers give it away free via robots.txt defaults. Here's the crawler permission economy and what your content is worth.
Frequently Asked Questions
Should websites block AI training crawlers like GPTBot and ClaudeBot?
Whether to block AI training crawlers depends on two factors that most site operators conflate: training crawlers and inference crawlers are not the same thing, and blocking one does not automatically block the other. GPTBot is OpenAI's training data crawler — blocking it prevents your content from entering future model versions but does not affect whether ChatGPT with browsing enabled can currently cite you. OAI-SearchBot is the inference crawler that ChatGPT uses for real-time answers; blocking it directly costs you AEO visibility. ClaudeBot is Anthropic's inference crawler; blocking it removes you from Claude's real-time citation pool. The calculus: if you are a publisher with unique, high-value content, blocking training crawlers while allowing inference crawlers preserves your citation surface while creating leverage for a paid licensing negotiation. If you are a B2B brand that primarily wants citation share, blocking any AI crawler is almost certainly self-defeating. Publishers that have blocked all AI crawlers without distinguishing between crawler types have typically hurt their AEO performance without gaining any monetization benefit.
How much are AI labs paying for publisher training data licensing deals?
Disclosed deal values range from roughly $1 million to over $250 million annually, and the spread is almost entirely explained by traffic volume and content uniqueness. The Associated Press signed a multi-year deal with OpenAI reportedly worth $15 million per year. News Corp's agreement with OpenAI is reported at over $250 million over five years, covering the Wall Street Journal, New York Post, and other properties. Reddit's data licensing agreement with Google was valued at approximately $60 million annually ahead of its IPO. Smaller publishers with monthly traffic in the 1–5 million range are being offered between $50,000 and $500,000 annually in exploratory deals. The valuation methodology labs use is not public, but it correlates strongly with: unique content that cannot be scraped elsewhere, update frequency, topic authority in categories the model underperforms, and geographic or language coverage gaps. Publishers negotiating without understanding these valuation drivers typically leave significant money on the table.
What is the trade-off between blocking AI crawlers and losing AEO visibility?
The trade-off is asymmetric and depends entirely on which type of crawler you block. Blocking training crawlers — GPTBot, CCBot (Common Crawl), and similar data-harvest bots — has no direct effect on your current AEO performance because these crawlers feed future model training, not current inference. Your content is already in the current models regardless. Blocking inference crawlers — OAI-SearchBot, PerplexityBot, ClaudeBot — directly removes you from the citation pool for real-time AI search answers. This is the block that costs citation share. The practical recommendation for most publishers: allow inference crawlers unconditionally, because citation visibility is the most valuable near-term asset. For training crawlers, blocking is a negotiating tactic, not a permanent strategy. The publishers generating licensing revenue are blocking training crawlers not because blocking is valuable in itself, but because selective restriction creates the scarcity condition that justifies a paid access conversation. Blocking everything as a default, without a licensing strategy to convert it, is simply destroying citation value for no gain.
How do you set up a robots.txt that balances AI training blocking with search crawler access?
The configuration requires distinguishing between four crawler categories: search engine crawlers (Googlebot, Bingbot), AI inference crawlers (OAI-SearchBot, PerplexityBot, ClaudeBot), AI training crawlers (GPTBot, CCBot, Common Crawl), and generic scrapers. A publisher pursuing the training-block-with-inference-allowed strategy would: allow Googlebot, Bingbot, and all standard search crawlers unconditionally; allow OAI-SearchBot, PerplexityBot, ClaudeBot, and GoogleOther (for AI Overviews) unconditionally; and disallow GPTBot, CCBot, and similar training crawlers for high-value content directories while keeping them on an allowed list for marketing or public content. The robots.txt entries for GPTBot and CCBot follow standard disallow syntax. The key mistake to avoid is using a blanket User-agent: * Disallow: / rule, which blocks Googlebot and tanks organic search. Every robots.txt change for AI crawlers must be surgical, targeting specific user-agent strings rather than wildcards, and must be audited after implementation to confirm it has not inadvertently blocked inference or search crawlers.
What is the emerging legal framework for AI training data access in 2026?
Three distinct legal frameworks are converging in 2026, and they apply differently by jurisdiction. In the United States, the foundational question — whether training on copyrighted content constitutes fair use — remains unresolved at the circuit court level, with multiple cases in active litigation. The New York Times case against OpenAI and Microsoft is the most watched, with a ruling expected in late 2026 or 2027. In the European Union, the EU AI Act and its implementing regulations require AI providers to maintain a public register of training data sources, give rights-holders opt-out mechanisms, and comply with the existing Text and Data Mining exceptions under the DSM Directive. In practice, this means EU-based publishers have a stronger legal basis for requiring licensing agreements. In the UK, the government's proposed amendments to copyright law for AI training are still in parliamentary process but lean toward an opt-out regime similar to the EU. Japan has the most permissive framework globally, treating AI training as non-infringing under its 2018 copyright amendments. For most publishers, the practical implication is that the EU framework offers the strongest near-term leverage for monetization conversations.
Related Articles
Topics: AEO, AI Training Data, robots.txt, Data Licensing, Copyright, AI Policy
Browse all articles | About Signal