Stability AI, in partnership with its CarperAI lab, has introduced two innovative large language models, FreeWilly1 and FreeWilly2. These models, now accessible for non-commercial use, have shown remarkable performance across a variety of benchmarks, securing leading positions on the Hugging Face Open LLM Leaderboard.
Superior Language Comprehension
Built on the sturdy foundation models LLaMA 65B and LLaMA 2 70B from Meta, FreeWilly1 and FreeWilly2 have been meticulously fine-tuned with a synthetically generated dataset using Supervised Fine-Tune (SFT) in standard Alpaca format. Notably, FreeWilly2's performance even rivals that of GPT-3.5 in certain tasks, which was the first model powering ChatGPT from OpenAI.
Anel Islamovic, a spokesperson from Stability AI, expressed pride in the models' exceptional reasoning ability across varied benchmarks. “Both models demonstrate exceptional reasoning ability across varied benchmarks,” Islamovic said. “FreeWilly1 leverages the original LLaMA 65B foundation model and was carefully fine-tuned with a new synthetically-generated dataset using Supervised Fine-Tune (SFT) in standard Alpaca format. Similarly, FreeWilly2 leverages the LLaMA 2 70B foundation model to reach a performance that compares favorably with GPT-3.5 for some tasks.”
Cutting-Edge Training Methodology
The training approach for the FreeWilly models was inspired by Microsoft's methodology, as outlined in its paper Orca: Progressive Learning from Complex Explanation Traces of GPT-4. However, Stability AI adopted this approach, using different data sources. The training dataset, consisting of 600,000 data points, was sourced from high-quality instructions generated by language models. These instructions were from datasets created by Enrico Shippole. Despite the dataset being a tenth of the size used in the original Orca paper, the models have demonstrated exceptional performance, validating the approach to synthetically generated datasets.
Setting a New Benchmark in Open Access LLMs
To evaluate the models' performance, Stability AI utilized EleutherAI's lm-eval-harness, with the addition of AGIEval, which is a human-centric benchmark for evaluating foundation models. The results highlighted that both FreeWilly models excel in intricate reasoning, understanding linguistic subtleties, and answering complex questions in specialized domains such as law and mathematical problem-solving. Stability AI researchers independently verified the performance results of FreeWilly models, and these were later reproduced by Hugging Face on July 21, 2023, and published in their leaderboard, where they rank and evaluate LLMs and chatbots as they are released.
Emphasis on Responsible Release Practices
Stability AI emphasized its focus on responsible release practices for FreeWilly. According to the company, the models underwent internal red team testing for potential harms, but the company is actively encouraging external feedback to further enhance safety measures.
FreeWilly1 and FreeWilly2 mark a significant milestone in the realm of open access LLMs. They are expected to propel research, enhance natural language understanding, and enable complex tasks. Stability AI is optimistic about the endless possibilities that these models will create in the AI community and the innovative applications they will inspire.
“We are excited about the endless possibilities that these models will bring to the AI community, and the new applications they will inspire,” Islamovic said. “We would like to express our sincere gratitude to our passionate team of researchers, engineers, and collaborators, whose remarkable efforts and dedication have enabled us to reach this significant milestone.”