36.1 C

Half of Top News Sites Blocked OpenAI’s Crawlers in 2023



At the end of 2023, nearly one-half (48%) of the top news websites, based on reach, across 10 countries blocked OpenAI‘s crawlers, while nearly one-quarter (24%) blocked Google’s AI crawler, according to a study by Reuters Institute.

Reuters Institute analyzed the robots.txt of the 15 online news sources with the widest reach, including titles like The New York Times, BuzzFeed News, The Wall Street Journal, The Washington Post, CNN and NPR, across countries including Germany, India, Spain, the U.K. and the U.S.

In the absence of clear regulatory frameworks governing generative artificial intelligence‘s use of copyrighted material, many large publishers have taken matters into their own hands, taking AI firms to court, updating terms of service, blocking crawlers or making deals to protect premium content, data and revenues.

The study grouped outlets into three categories: legacy print publications, television and radio broadcasters and digital-born outlets.

Over one-half (57%) of the websites of legacy print publications, such as The New York Times, blocked OpenAI’s crawlers by the end of 2023, compared with 48% of television and radio broadcasters and 31% of digital-born outlets.

Similarly, 32% of print outlets blocked Google’s crawlers, while 19% of broadcasters and 17% of digital-born outlets did the same.

“The Reuters study highlights a fundamental challenge for generative AI: its dependence on authentic content generated by real people who see it as a threat to their livelihoods,” said Gartner VP distinguished analyst Andrew Frank.

Meanwhile, a recent study by Cornell University found that when new AI models are trained on data derived from prior models rather than human input, they tend to ‘model collapse’ or degenerate, leading to increased errors and misinformation in the generated output.

“This suggests that large language model developers need to find ways to compensate people who create or report true content, not just for the sake of society, but also for their own commercial interests,” said Frank.

Website crawlers are deployed for many reasons. Crawlers like Google’s Googlebot index publisher websites in the tech giant’s search results. Meanwhile, OpenAI’s crawler, GPTBot, collects data across the internet to train its large language models such as ChatGPT. This lets AI tools generate accurate, contemporaneous data—a capability that news publishers especially are uniquely positioned to provide: LLMs overweigh premium publishers’ content by a factor of between 5 and 100. AI-powered solutions are emerging as alternatives to traditional search engines.

Subscribe to our magazine

━ more like this

Apple backtracks and approves the first PC emulator for iOS

UTM SE is a PC emulator that allows you to run classic software and old-school games.* Supports both VGA mode for graphics and...

Biden says ‘everybody must condemn’ attack on Trump as campaign halts messaging to supporters and pulls TV ads

President Joe Biden said Saturday that “everybody must condemn” the suspected assassination attempt on former President Donald Trump, adding that he hoped to speak with his 2024 presidential...

Donald Trump taken off stage during rally after apparent gunshots; Secret Service says he is safe

Donald Trump’s campaign said in a statement that he was “fine” after being whisked off the stage at a rally in Butler, Pennsylvania...

Call your shot: Investors are buying individual stocks at a record pace as market-beating bets dwindle to lowest ever

More investors are calling their shots and picking out individual stocks to buy as market gains become concentrated in an increasingly narrow range...

A trip to Shanghai’s AI mega-conference showed me that China’s developers are still playing catch-up to Silicon Valley

Last week, Shanghai hosted China’s largest AI event: The World Artificial Intelligence Conference (WAIC), with 500 exhibitors, 1,500 exhibits, over 300,000 attendees, and...