Nvidia is facing accusations of scraping large quantities of video content from platforms like YouTube and Netflix to enhance its AI models. According to 404Media, the initiative, referred to as Cosmos, seeks to improve Nvidia's AI capabilities for applications such as its Omniverse 3D world-building tool and autonomous vehicle systems.
Operating Methods and Legal Defenses
According to documents and internal communications, Nvidia employees were directed to employ tools like yt-dlp, an open-source YouTube downloader, and virtual machines to bypass IP restrictions and download substantial amounts of video content. Nvidia maintains that its methods comply with copyright regulations and fair use. An Nvidia spokesperson noted, “Our models and our research efforts are in full compliance with the letter and the spirit of copyright law,” pointing out that copyright law covers specific expressions but not underlying facts, ideas, or data.
Employees involved in the Cosmos project expressed concerns regarding the legality and ethics of using copyrighted material without permission. Internal conversations suggest that management often dismissed these concerns, referring to them as “executive decisions.” Sources indicate that Nvidia scraped content from various libraries, including academic datasets designated for non-commercial use. For instance, the HD-VG-130M dataset from Peking University is meant purely for research purposes.
Wider Industry Context and Upcoming Regulations
Nvidia's practices underscore a pervasive issue in the AI sector, where firms like Runway and OpenAI also face scrutiny over data collection techniques. The legal landscape around the utilization of copyrighted content for AI remains unsettled, with several lawsuits in progress and increasing calls for transparency. Proposed legislation, such as the AI Foundation Model Transparency Act and the Generative AI Copyright Disclosure Act, seeks to establish clearer norms and mandate disclosure of data sources used for training AI.
YouTube has reiterated its stance that scraping its content violates its terms. YouTube's policy communications manager referred in a message to Engadget to an earlier statement by CEO Neal Mohan, who emphasized that using YouTube content for AI training contravenes the platform's terms. To avoid YouTube's detection mechanisms, Nvidia resorted to using rotating IP addresses via virtual machines. Employees were instructed to restart AWS instances to acquire new public IP addresses, thus evading bans.
Diverse Data Sources and Internal Approvals
Reports also indicate that Nvidia instructed its staff to use various other datasets, including MovieNet's movie trailers database, internal video game libraries, and GitHub datasets like WebVid and InternVid-10M. Following a cease-and-desist order, the WebVid dataset was removed. Some of the datasets Nvidia allegedly used were intended for academic or non-commercial purposes only. Despite this, Nvidia claimed that these resources were permissible for its commercial AI projects.
Internal communications suggest a culture within Nvidia that prioritizes technological development over legal and ethical concerns. Employees raising ethical or legal objections were reportedly told by their managers that the practices had high-level approval within the company.