An investigation by Proof News has revealed that Apple, NVIDIA, and Anthropic have been using transcripts from over 173,000 YouTube videos to train their artificial intelligence models without obtaining necessary permissions. The dataset, assembled by nonprofit organization EleutherAI, encompasses material from more than 48,000 YouTube channels, leading to serious ethical questions about the unapproved use of this content.
Composition of the Dataset
The dataset in question primarily comprises video transcripts and not actual footage. Included are transcripts from popular YouTube figures such as Marques Brownlee and MrBeast, as well as major news platforms like The New York Times, BBC, and ABC News. Content from The Verge, Vox, and Engadget is also part of this compilation.
Brownlee, a prominent tech commentator, voiced his concerns on X (formerly Twitter). He pointed out that his transcripts were among the data scraped for AI model training without his consent, indicating a growing issue for content creators who are unaware that their work is being utilized in this manner.
Apple has sourced data for their AI from several companies
One of them scraped tons of data/transcripts from YouTube videos, including mine
Apple technically avoids “fault” here because they’re not the ones scraping
But this is going to be an evolving problem for a long time https://t.co/U93riaeSlY
— Marques Brownlee (@MKBHD) July 16, 2024
Legal and Ethical Concerns
Google, YouTube’s owner, has reiterated that using YouTube data to train AI systems breaches its terms of service. This stance was confirmed by YouTube CEO Neal Mohan and Alphabet CEO Sundar Pichai. Despite these clear regulations, Apple, NVIDIA, and Anthropic have not addressed the findings of Proof News’ investigation.
The opaque nature of the data sources used for AI training persists as a significant issue. This month, criticisms surfaced against Apple for not disclosing the origins of the data employed in training its generative AI tool, Apple Intelligence. The lack of transparency incites further concerns over data privacy as the tool is set to be integrated into millions of Apple devices.
EleutherAI and The Pile
The identified transcripts are part of a broader collection known as The Pile, maintained by EleutherAI. This compilation also includes a variety of datasets from books to Wikipedia articles. Previously, the Books3 dataset within The Pile faced scrutiny revealing which specific works were used, resulting in lawsuits from affected authors.
To accompany its findings, Proof News has launched an interactive tool. This resource allows users to check whether their content or that of their favorite YouTubers is included in the dataset, promoting a level of transparency for affected creators.
Last Updated on November 7, 2024 3:35 pm CET