Hugging Face has announced FineWeb, a substantial dataset designed to improve the pretraining of large language models (LLMs). The dataset, which consists of 15 trillion tokens and spans 44 terabytes, is one of the most comprehensive ever released for LLM training.
FineWeb is derived from 96 CommonCrawl snapshots. This dataset surpasses previous benchmarks like RefinedWeb and C4 in terms of both size and variety. Its extensive and carefully curated content is set to significantly enhance LLM capabilities.
Rigorous Deduplication and Filtering
Significant effort has been put into deduplication using MinHash, a fuzzy hashing technique. This approach helps reduce repetitive content, enabling more efficient training by minimizing data redundancy. The dataset underwent both individual and global deduplication, with the former especially effective in keeping high-quality data.
The dataset also includes advanced filtering to remove low-quality content. Initial measures involved language classification and URL filtering to exclude non-English texts and adult content. Building upon C4’s groundwork, additional heuristic filters were applied, such as eliminating documents with excessive boilerplate content and those that do not end lines with punctuation.
Educational Component: FineWeb-Edu
FineWeb also features a subset named FineWeb-Edu, specifically tailored for academic content. This educational subset was created using synthetic annotations generated by Llama-3-70B-Instruct, which evaluated 500,000 samples on their educational merit. A classifier trained on these annotations was applied to the entire dataset, isolating content of academic value. The result is a dataset containing 1.3 trillion tokens, optimized for academic benchmarks like MMLU, ARC, and OpenBookQA.
FineWeb has undergone rigorous testing against several benchmarks, consistently outperforming other open web-scale datasets. Its effectiveness has been demonstrated through a series of “early-signal” benchmarks using small models, including evaluations on CommonSense QA, HellaSwag, and OpenBook QA. FineWeb-Edu showed outstanding improvements, validating the use of synthetic annotations for educational content filtering.
Recent Hugging Face Breach
Hugging Face is currently managing a data breach of its Spaces platform. Hugging Face recently discovered unauthorized access to a limited amount of sensitive information on their Spaces platform. In response, they revoked potentially compromised tokens and emailed affected users. To improve security, they recommend switching to new, fine-grained access tokens that provide more control over AI model access.
Hugging Face will eventually phase out “classic” read and write tokens with fine-grained access tokens after achieving feature parity. The company remains committed to strengthening security across its infrastructure and is continuing to investigate any related incidents.
Last Updated on November 7, 2024 7:54 pm CET