Canonical Tags in the AI Search Era: How LLMs Handle Duplicate Content Differently Than Google
AI bot traffic hit 30 to 40 percent of edge requests at major publisher properties by Q1 2026. The CDN configuration you ship in the next 90 days decides whether those bots are nearly free or burn six figures of origin bandwidth.
By Raj Patel, AI & Infrastructure · May 25, 2026
CDN edge cache strategy for AI bots in 2026: Cloudflare, Fastly, and Akamai patterns for GPTBot, ClaudeBot, and PerplexityBot without burning origin bandwidth.
Frequently Asked Questions
How much of my CDN traffic is now AI bots in 2026?
For most content-heavy properties, AI bot traffic sits between 18 and 42 percent of total edge requests as of Q1 2026, with the median publisher landing near 31 percent according to Cloudflare's AI Bot Traffic report from February 2026. The composition has shifted dramatically since 2024. GPTBot and OAI-SearchBot together account for roughly 11 to 14 percent of bot traffic across the Cloudflare network, ClaudeBot and anthropic-ai for another 7 to 9 percent, PerplexityBot for 5 to 8 percent, and Google-Extended for 3 to 5 percent. The remaining 10 to 15 percent comes from a long tail of smaller training crawlers, search-specific agents like Common Crawl, and a growing population of unidentified scrapers using residential proxies. If you have not measured this on your own property in the last 60 days, you almost certainly have more bot traffic than you think, and you are paying for it at human-traffic rates.
Should I block GPTBot and ClaudeBot to save bandwidth?
Almost never. Blocking these crawlers removes you from the training corpus and the live retrieval set that feeds AI search results, which is the single largest source of new referral traffic for many publishers in 2026. The right move is to serve them efficiently from your CDN edge rather than block them at origin. With aggressive edge caching, a typical GPTBot crawl costs you fractional cents per million requests because the bot is overwhelmingly hitting cached objects. The economics flip only if you are seeing pathological crawl patterns: a single user agent fetching the same URL hundreds of times per hour, or hitting expensive endpoints like search results or personalized pages. In those cases, the correct response is targeted rate limiting and cache-control rules, not a wholesale block. The Cloudflare AI Audit and Fastly Edge Cloud features now expose this data clearly enough that the decision should be data-driven, not reflexive.
What is stale-while-revalidate and why does it matter for AI crawlers?
Stale-while-revalidate is an HTTP cache directive that tells a CDN to serve a stale cached response immediately while asynchronously fetching a fresh copy from origin. For AI crawlers, this is the single most important cache pattern to get right. AI bots tolerate slightly stale content well because they are not building real-time experiences and they typically re-crawl on a multi-day cadence anyway. By using max-age combined with stale-while-revalidate windows of 24 to 72 hours for evergreen content, you guarantee the bot gets an instant edge response even when the cached object has technically expired, while your origin handles only one revalidation request rather than every crawler hit. Cloudflare, Fastly, and Akamai all support the directive natively. Web.dev documents it as the recommended pattern for content that is mostly static but occasionally updated. Combined with surrogate keys for purging, it is the foundation of efficient AI crawler serving.
Can I serve different cache policies to GPTBot than to human users?
Yes, and you should. Most CDNs let you key cache rules off the User-Agent header or a derived bot-class signal, and applying longer time-to-live values to crawler traffic is one of the highest-leverage configurations available. A common pattern is to serve human users a 5-minute browser cache and a 60-minute edge cache, while serving identified AI crawlers an edge cache of 24 to 72 hours with stale-while-revalidate of an additional 7 days. The risk to avoid is content cloaking. Google has explicitly clarified that serving different content body to crawlers than to users violates webmaster guidelines, but serving the same body with different cache headers is fine. Cloudflare AI Audit, Fastly Compute@Edge, and Akamai EdgeWorkers all expose bot-class detection that you can use to apply these rules without writing your own user-agent parser, and the controls are auditable from the dashboard.
How do I track AI crawler frequency at the edge without overloading my origin?
Use edge key-value storage to record crawl counters per URL per crawler, then sample to your analytics system rather than logging every request. Cloudflare Workers KV, Fastly KV Store, and Akamai EdgeKV all support sub-millisecond writes from the edge with reasonable consistency guarantees for analytical use. The standard pattern is to increment a counter keyed on URL and a bot-class label on every crawler-classified request, batch the counters into 60-second windows, and ship the aggregated deltas to your data warehouse. This gives you per-URL, per-crawler frequency data without sending raw logs to origin. You can then identify which URLs are being over-crawled (candidates for longer TTLs or sitemap deprioritization) and which are being missed (candidates for sitemap promotion or origin pre-warming). Doing this at the edge keeps your origin out of the analytics path entirely, which is the whole point.
Related Articles
Topics: AEO, CDN, Edge Cache, AI Crawlers, Performance, Infrastructure
Browse all articles | About Signal