HomeWinBuzzer NewsHugging Face Unveils Open LLM Leaderboard v2 With Chinese Model on Top

Hugging Face Unveils Open LLM Leaderboard v2 With Chinese Model on Top

The latest leaderboard by Hugging Face measures LLMs using four key tasks with six benchmarks.

-

has introduced an updated leaderboard for open large language models (LLMs), highlighting the superior performance of Chinese AI models. Alibaba's Qwen models have notably captured three positions in the top ten, showcasing their advanced abilities across various tasks.

Improved Evaluation Criteria and Benchmarks

The latest leaderboard by Hugging Face measures LLMs using four key tasks: knowledge assessment, reasoning with extended contexts, complex mathematics, and instruction following.

Six specific are integral for these evaluations, which range from answering science questions to generating truthful responses and solving high-school-level math problems.

MMLU-Pro is an enhanced version of the MMLU dataset, featuring ten answer choices instead of four and requiring more reasoning on questions. It has been expertly reviewed to reduce noise, making it a higher-quality and more challenging benchmark.

GPQA (-Proof Q&A Benchmark) is a highly difficult knowledge dataset designed by domain experts to be challenging for laypersons but manageable for experts. The dataset is access-restricted to minimize contamination and ensure accurate evaluation of models' knowledge and reasoning abilities.

MuSR (Multistep Soft Reasoning) is a dataset of complex, algorithmically generated problems around 1,000 words long, including murder mysteries and team allocation optimizations. Solving these problems requires advanced reasoning and long-range context parsing, with most models performing no better than random.

MATH (Mathematics Aptitude Test of Heuristics) is a compilation of high-school-level competition math problems, formatted consistently with LaTeX for equations and Asymptote for figures. The benchmark focuses on the hardest problems and tests models' mathematical reasoning and problem-solving skills.

IFEval (Instruction Following Evaluation) tests models' abilities to follow explicit instructions accurately, such as adhering to specific formatting or keyword inclusion. The evaluation emphasizes precision in following instructions rather than content quality.

BBH (Big Bench Hard) is a subset of 23 challenging tasks from the BigBench dataset, chosen for their objective metrics, difficulty, and sufficient sample sizes for statistical significance. The tasks include multistep arithmetic, algorithmic reasoning, language understanding, and world knowledge, correlating well with human preference.

Top Performers and Notable Absences

's Qwen models have emerged as top contenders, securing the 1st, 3rd, and 10th spots. Meta's Llama3-70B also appears on the list, along with several smaller open-source projects outperforming many well-established models. OpenAI‘s ChatGPT is absent from the leaderboard since Hugging Face focuses solely on open-source models to guarantee reproducibility.

Infrastructure and Evaluation Process

The evaluations leverage Hugging Face's infrastructure, which utilizes 300 Nvidia H100 GPUs. The platform's open-source nature allows for new model submissions, with popular ones getting prioritized via a voting system. Users can filter the leaderboard to highlight significant models, preventing an overload of minor entries.

Hugging Face's initial leaderboard, launched last year, quickly became a popular tool for comparing LLM performance. However, models began to be overly optimized for the specific benchmarks, causing a performance decline in real-world applications. This overfitting issue led to the creation of the second leaderboard aimed at providing a more comprehensive and meaningful evaluation.

Meta's Performance and Over-Specialization

Meta's updated Llama models have shown weaker performance on the new leaderboard than in prior rankings. This decline is attributed to their specialization on the earlier benchmarks, negatively impacting their real-world usefulness. This situation highlights the necessity of diverse training data to sustain robust AI performance.

Hugging Face updates the leaderboard weekly, enabling ongoing evaluation and enhancement of models. This approach ensures that the rankings reflect the latest performance data. Detailed analysis of each model's performance across individual benchmarks is also provided, giving insights into their strengths and weaknesses.

The leaderboard's open-source framework promotes transparency and reproducibility, with all models and their evaluation results available for public scrutiny. This design aims to mitigate the risk of models being overly fine-tuned to specific benchmarks by incorporating a diverse array of evaluation criteria.

Markus Kasanmascheff
Markus Kasanmascheff
Markus is the founder of WinBuzzer and has been playing with Windows and technology for more than 25 years. He is holding a Master´s degree in International Economics and previously worked as Lead Windows Expert for Softonic.com.