HomeWinBuzzer NewsInvestigation Finds Apple, NVIDIA, Anthropic Used YouTube Transcripts for AI Training

Investigation Finds Apple, NVIDIA, Anthropic Used YouTube Transcripts for AI Training

Tech giants Apple, NVIDIA, and Anthropic reportedly trained AI using transcripts from 173,000 YouTube videos without permission.

-

An investigation by Proof News has revealed that Apple, NVIDIA, and Anthropic have been using transcripts from over 173,000 YouTube videos to train their artificial intelligence models without obtaining necessary permissions. The dataset, assembled by nonprofit organization EleutherAI, encompasses material from more than 48,000 YouTube channels, leading to serious ethical questions about the unapproved use of this content.

Composition of the Dataset

The dataset in question primarily comprises video transcripts and not actual footage. Included are transcripts from popular YouTube figures such as Marques Brownlee and MrBeast, as well as major news platforms like The New York Times, BBC, and ABC News. Content from The Verge, Vox, and Engadget is also part of this compilation.

Brownlee, a prominent tech commentator, voiced his concerns on X (formerly Twitter). He pointed out that his transcripts were among the data scraped for AI model training without his consent, indicating a growing issue for content creators who are unaware that their work is being utilized in this manner.

Legal and Ethical Concerns

Google, YouTube’s owner, has reiterated that using YouTube data to train AI systems breaches its terms of service. This stance was confirmed by YouTube CEO Neal Mohan and Alphabet CEO Sundar Pichai. Despite these clear regulations, Apple, NVIDIA, and Anthropic have not addressed the findings of Proof News’ investigation.

The opaque nature of the data sources used for AI training persists as a significant issue. This month, criticisms surfaced against Apple for not disclosing the origins of the data employed in training its generative AI tool, Apple Intelligence. The lack of transparency incites further concerns over data privacy as the tool is set to be integrated into millions of Apple devices.

EleutherAI and The Pile

The identified transcripts are part of a broader collection known as The Pile, maintained by EleutherAI. This compilation also includes a variety of datasets from books to Wikipedia articles. Previously, the Books3 dataset within The Pile faced scrutiny revealing which specific works were used, resulting in lawsuits from affected authors.

To accompany its findings, Proof News has launched an interactive tool. This resource allows users to check whether their content or that of their favorite YouTubers is included in the dataset, promoting a level of transparency for affected creators.

Last Updated on November 7, 2024 3:35 pm CET

Luke Jones
Luke Jones
Luke has been writing about Microsoft and the wider tech industry for over 10 years. With a degree in creative and professional writing, Luke looks for the interesting spin when covering AI, Windows, Xbox, and more.

Recent News

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x
Mastodon