The Battle Over AI Training Data
Major Publishers Opt Out of Apple's AI Scraping
Last updated:

Edited By
Mackenzie Ferguson
AI Tools Researcher & Implementation Consultant
This summer, Apple introduced Applebot-Extended, allowing websites to opt out of having their data used for AI training. Major publishers like The New York Times and Facebook have already taken advantage of this, signaling a shift in how AI bots are perceived and used.
This summer, Apple introduced a tool allowing websites to opt out of having their data used to train Apple’s AI models. Prominent news outlets and social media platforms such as The New York Times, Facebook, and Instagram quickly took advantage of this feature. This shift highlights a growing unease among web publishers regarding how their data is used by AI training bots.
Apple's new tool, Applebot-Extended, extends the capabilities of its existing web-crawling bot. This new extension specifically allows website owners to prevent their data from being utilized for Apple's AI training without affecting their visibility on Apple's search products such as Siri and Spotlight. This distinction is crucial because it lets publishers protect their intellectual property while maintaining their presence in Apple’s ecosystem.
Learn to use AI like a Pro
Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.














The mechanism through which publishers can block Applebot-Extended is the Robots Exclusion Protocol, commonly known as robots.txt. This protocol has been a standard for managing web crawlers for decades. However, the rise of AI and the increased focus on training data have brought new attention to this once-arcane text file. Many publishers have already begun to update their robots.txt files to block data scraping by other AI companies like OpenAI and Anthropic.
Despite Apple's efforts, Applebot-Extended is still relatively under the radar. An analysis by AI-detection startup Originality AI and AI agent watchdog service Dark Visitors found that only around 6-7% of high-traffic websites are currently blocking Applebot-Extended. This suggests that many website owners may be either unaware of the new tool or indifferent to Apple’s data usage practices.
A deeper dive into how specific industries are responding reveals a mixed bag. Data journalist Ben Welsh’s recent analysis shows that around 25% of surveyed news websites are blocking Applebot-Extended, compared to 53% blocking OpenAI’s bot and 43% blocking Google-Extended. This indicates that Apple’s new tool might still not be as well-known among publishers as those from other major AI players.
Economic considerations also play a significant role in the decisions publishers make regarding data access. Many publishers, such as Condé Nast and Buzzfeed, have historically blocked AI bots unless a paid partnership is in place. This suggests a strategic approach to data sharing, where companies opt to negotiate deals that benefit them financially rather than outright block AI training bots.
Learn to use AI like a Pro
Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.














Some media executives are closely involved in deciding which bots to block. Notably, several CEOs directly influence these decisions, underscoring the significance of AI data usage for the media industry. This executive-level attention signifies how crucial managing AI interactions has become for preserving the value and integrity of published content.
Legal implications are also at the forefront. The New York Times, which is suing OpenAI for copyright infringement, argues that the opt-out nature of Applebot-Extended is insufficient. The Times maintains that scraping content for commercial purposes without explicit permission violates copyright law, regardless of technical blockades like robots.txt files.
The landscape of AI training data and its management is evolving rapidly. With major entities like Apple introducing tools like Applebot-Extended, the ongoing battle over data usage is visible in the updates to robots.txt files across the web. This seemingly small text file has become a battleground for one of the most significant technological advancements of our time.
In summary, Apple's introduction of Applebot-Extended is part of a broader trend where web publishers are taking a more active role in managing how their data is used for AI training. While the adoption rate of this new tool is still growing, it represents a crucial effort to balance visibility and intellectual property rights in an increasingly data-driven landscape.