3.5 C

Half of Top News Sites Blocked OpenAI’s Crawlers in 2023



At the end of 2023, nearly one-half (48%) of the top news websites, based on reach, across 10 countries blocked OpenAI‘s crawlers, while nearly one-quarter (24%) blocked Google’s AI crawler, according to a study by Reuters Institute.

Reuters Institute analyzed the robots.txt of the 15 online news sources with the widest reach, including titles like The New York Times, BuzzFeed News, The Wall Street Journal, The Washington Post, CNN and NPR, across countries including Germany, India, Spain, the U.K. and the U.S.

In the absence of clear regulatory frameworks governing generative artificial intelligence‘s use of copyrighted material, many large publishers have taken matters into their own hands, taking AI firms to court, updating terms of service, blocking crawlers or making deals to protect premium content, data and revenues.

The study grouped outlets into three categories: legacy print publications, television and radio broadcasters and digital-born outlets.

Over one-half (57%) of the websites of legacy print publications, such as The New York Times, blocked OpenAI’s crawlers by the end of 2023, compared with 48% of television and radio broadcasters and 31% of digital-born outlets.

Similarly, 32% of print outlets blocked Google’s crawlers, while 19% of broadcasters and 17% of digital-born outlets did the same.

“The Reuters study highlights a fundamental challenge for generative AI: its dependence on authentic content generated by real people who see it as a threat to their livelihoods,” said Gartner VP distinguished analyst Andrew Frank.

Meanwhile, a recent study by Cornell University found that when new AI models are trained on data derived from prior models rather than human input, they tend to ‘model collapse’ or degenerate, leading to increased errors and misinformation in the generated output.

“This suggests that large language model developers need to find ways to compensate people who create or report true content, not just for the sake of society, but also for their own commercial interests,” said Frank.

Website crawlers are deployed for many reasons. Crawlers like Google’s Googlebot index publisher websites in the tech giant’s search results. Meanwhile, OpenAI’s crawler, GPTBot, collects data across the internet to train its large language models such as ChatGPT. This lets AI tools generate accurate, contemporaneous data—a capability that news publishers especially are uniquely positioned to provide: LLMs overweigh premium publishers’ content by a factor of between 5 and 100. AI-powered solutions are emerging as alternatives to traditional search engines.

Subscribe to our magazine

━ more like this

Baby boomers are redefining work in their 60s, 70s, and beyond with ‘unretirement’ plans: ‘We’re not our grandparents’ vision of retirees’

After 27 years working at Fidelity Investments, Nan Ives jumped at the opportunity to take an early retirement package at age 59. She...

Residents rush to save artifacts as blaze engulfs Copenhagen’s historic stock exchange building in devastating fire

A fire ripping through Denmark’s old stock exchange building has torn down the structure’s dragon-tail spire, a Copenhagen landmark. The protected 400-year-old building caught...

Sony wants 60fps PS5 Pro “Enhanced” games, but it’s happy to settle for less

Sony is working on a new “high-end version” of the PS5, codenamed Trinity and likely to debut as the PS5 Pro later this...

The beginner’s guide to frequent flyer programs: How to earn, redeem and maximize airline miles

Fortune Recommends™ has partnered with CardRatings for our coverage of credit card products. Fortune Recommends™ and CardRatings may receive a commission from card...

7 people with power at Coinbase

Coinbase launched in 2012 as a one-man startup with the goal of bringing crypto into the mainstream. It has since grown into a...