Question 1

How much are AI companies paying for training data?

Accepted Answer

AI companies spent $816.7 million on content licensing in 2024, with an average deal size of $24 million. Total committed spending across all known deals reached $2.92 billion. The largest individual deals include News Corp's $250 million five-year agreement with OpenAI ($50M/year), Reddit's combined $203 million in licensing revenue (including $60M/year from Google and $70M/year from OpenAI), Stack Overflow's $200M+ in licensing deals, and Shutterstock's $104 million in AI licensing revenue in 2023 alone. OpenAI accounts for 53% of all licensing spending, followed by Google at 12%, Microsoft at 9%, and Meta at 6%. The total AI training data market was valued at $2.3-2.9 billion in 2024 and is projected to reach $3.9-7.5 billion by 2026.

Question 2

What is the Anthropic Bartz copyright settlement?

Accepted Answer

Bartz v. Anthropic resulted in a $1.5 billion settlement in 2025 — the largest copyright settlement in United States history. The case involved approximately 500,000 pirated works that Anthropic used to train its Claude AI models, averaging roughly $3,000 per pirated book. Critically, the presiding judge ruled that training AI on piracy-sourced material does not qualify as fair use under US copyright law. This ruling set an important precedent because it distinguished between using copyrighted works that were legally obtained versus those sourced through piracy, making the method of data acquisition a key factor in fair use determinations for AI training.

Question 3

Is AI training on copyrighted data fair use?

Accepted Answer

The legal landscape is still unsettled. As of early 2026, there have been three federal fair use rulings related to AI training: two ruled in favor of AI companies, and one ruled against. No appellate court has issued a decision yet. Thomson Reuters v. ROSS Intelligence was the first ruling against fair use for AI training. In Bartz v. Anthropic, the judge ruled that piracy-sourced training data is not protected by fair use. Meanwhile, in Getty v. Stability AI in the UK, a court found that model weights are not 'copies' of training data, complicating copyright claims. Over 70 AI copyright lawsuits had been filed by late 2025, doubling from roughly 30 at the end of 2024. The NYT v. OpenAI case, with summary judgment scheduled for April 2, 2026, may become the most consequential ruling in this area.

Question 4

What is the AI training data wall problem?

Accepted Answer

The 'data wall' refers to the projected exhaustion of high-quality text data available for AI training. Research from Epoch AI predicts that quality text data — the kind needed to meaningfully improve frontier models — will be exhausted between 2026 and 2028. The problem is structural: the internet's stock of human-generated text is finite, and LLMs have already consumed most of it. Common Crawl, which holds 9.5+ petabytes across 250 billion+ web pages and supplied 80%+ of GPT-3's training tokens, has already been used by two-thirds of all large language models. As models get larger and more capable, they require exponentially more data, but the supply of novel, high-quality human text is growing linearly at best. This is why exclusive data licensing deals and proprietary data sources have become the next competitive frontier.

Question 5

How much is the AI training data market worth?

Accepted Answer

The AI training data market was valued at $2.3-2.9 billion in 2024 and is projected to reach $3.9-7.5 billion by 2026. The synthetic data segment, which is seen as a partial solution to the data wall problem, was worth approximately $486-587 million in 2025 and is projected to reach $3.1-7.2 billion by 2032-2033. Scale AI, the largest data labeling and curation company, reached a $29 billion valuation with $870 million in revenue in 2024 and $2 billion projected for 2025. Meta invested $14.3 billion for a 49% stake in Scale AI, though that deal triggered customer flight — both OpenAI and Google cut ties with Scale AI over concerns about data neutrality.

Question 6

Which companies have the best AI data moats?

Accepted Answer

The strongest data moats belong to platforms with large volumes of unique, human-generated content that cannot be replicated. Reddit holds 20+ years of threaded human conversation across millions of communities and has monetized this at $203 million through deals with Google and OpenAI. Stack Overflow controls the canonical repository of developer knowledge and earned over $200 million from licensing despite a 76% traffic collapse. Shutterstock holds hundreds of millions of licensed images and earned $104 million from AI licensing in 2023, projecting $250 million by 2027. News Corp leveraged its global journalism portfolio for a $250 million OpenAI deal. Getty Images holds one of the largest curated visual datasets. Companies generating unique proprietary data at scale — including platforms like Spotify, Duolingo, and LinkedIn — hold undervalued data assets as AI companies exhaust public training data sources.