OpenAI Accused of Training GPT-4o on Copyrighted Books Illegally

OpenAI Accused of Training GPT-4o on Copyrighted Books Illegally

A new study from the Social Science Research Council’s AI Disclosures Project suggests OpenAI’s latest AI model, GPT-4o, may have been trained on non-public, copyrighted book content without authorization. The working paper, Beyond Public Access in LLM Pre-Training Data, applies a legally obtained dataset of 34 copyrighted O’Reilly Media books to test whether OpenAI’s models recognize paywalled content at an unusually high rate.

Using the DE-COP membership inference attack method, researchers found that GPT-4o demonstrated an AUROC score of 82% for non-public O’Reilly book content, compared to lower recognition rates for publicly available book samples. In contrast, GPT-3.5 Turbo, an earlier OpenAI model, showed more recognition of public content, with AUROC scores of 64% (public) vs. 54% (non-public).

Meanwhile, OpenAI’s GPT-4o Mini, a smaller and less capable version, appeared to have little to no knowledge of either public or non-public O’Reilly Media content, with AUROC scores hovering around 50%, indicating random chance.

"These results highlight the urgent need for increased corporate transparency regarding pre-training data sources as a means to develop formal licensing frameworks for AI content training," the researchers wrote.

The study underscores concerns over AI companies training models on paywalled or proprietary data without disclosure or consent. The researchers note that “such access violations might have occurred via the LibGen database, as all of the O’Reilly books tested were found in it”.

This finding aligns with ongoing legal disputes over AI training practices, including lawsuits from publishers and news organizations challenging whether AI companies are scraping proprietary content.

The report also warns that OpenAI and other companies have advocated for pre-training exemptions from copyright law, which, if granted, could allow AI firms to train on copyrighted materials without compensation.

"If adopted, copyright holders and content creators may be unable to sustain themselves and their creations, with profound implications for the survival of the internet’s traffic-driven business model," the researchers caution.

The AI Disclosures Project, led by technologist Tim O’Reilly and economist Ilan Strauss, is focused on improving transparency in AI training practices. The study argues that clear disclosure standards for AI model training could function similarly to financial disclosure regulations—establishing trust, preventing exploitation, and encouraging innovation.

"Disclosures are vital for well-functioning markets yet remain lacking in AI," the report states. It further suggests that “standardized AI disclosures can build trust, expedite adoption, and spur innovation.”

While AI companies often cite proprietary concerns when refusing to reveal their training datasets, the researchers suggest that more transparency could facilitate commercial markets for training data licensing and remuneration.

"Liability provisions that incentivize improved corporate transparency in disclosing data provenance may be an important step to facilitating commercial markets for training data licensing and remuneration," the study proposes.

With AI models becoming increasingly reliant on vast amounts of training data, the tension between AI developers and content creators is unlikely to subside anytime soon. This latest report adds to the growing body of evidence that suggests some AI models may be built, in part, on data they were never meant to access.

Data Sources:

  1. Social Science Research Council’s AI Disclosures Project. Beyond Public Access in LLM Pre-Training Data (2024). Available here
  2. Carlini, N., et al. Membership Inference Attacks Against Machine Learning Models (2021). arXiv
  3. Bhattacharjee, Y. How Sci-Hub is Blurring the Lines of Copyright Infringement (2022). Science Magazine
  4. The New York Times Co. v. OpenAI, Microsoft (2023). Case details
  5. Copyright Office Hearing on AI Training Exemptions (2024). U.S. Copyright Office
  6. O’Reilly, T. The Future of AI Transparency and Disclosures (2024) and the new study: O’Reilly Media
  7. Strauss, I. AI Data Markets and the Economics of Transparency (2024). Social Science Research Council