HomeWinBuzzer NewsScale AI Steps Up to Forge U.S. Defense Department's AI Evaluation Framework

Scale AI Steps Up to Forge U.S. Defense Department’s AI Evaluation Framework

Pentagon partners with AI firm to test language models used in military applications. Initiative builds robust evaluation framework to ensure AI reliability


The 's Chief Digital and Artificial Intelligence Office (CDAO) has embarked on a new initiative by contracting Scale AI, a San Francisco-based tech firm, to develop a comprehensive framework for assessing large language models (LLMs). This collaboration aims to ensure the reliability and safety of technologies that have the potential to augment military planning and operations. The project, announced by the CDAO, is a response to the growing need for a robust testing and evaluation (T&E) procedure that can accurately measure the performance of complex AI systems within the Department of Defense (DoD).

Enhancing Military Decision-making with AI

Large language models represent a class of AI that can generate text, images, and other media responses from human prompts. While promising for military applications, their complexity introduces challenges in ensuring their reliability and appropriateness for sensitive military contexts. The new framework from Scale AI is set to address these challenges by providing the DoD with the tools necessary for deploying AI capabilities safely. It will offer benchmarks for model performance, real-time feedback mechanisms, and specialized evaluation sets tailored for military needs. These advancements are designed to complement the objectives of Task Force Lima and leverage generative AI technologies to their full potential.

A Rigorous Process for Trustworthy AI

The Scale AI initiative will implement a rigorous T&E process, reflecting the intricacies of evaluating generative AI. Unlike the more straightforward T&E methods used for other types of algorithms, evaluating large language models requires a nuanced approach due to the variability in linguistic expression and the lack of absolute “ground truth” in language-based responses. Scale AI plans to incorporate “holdout datasets,” which will include input from DoD insiders to ensure that AI responses meet the high standards expected in military contexts. This method aims to fine-tune AI models to operate within the specific needs of the DoD, ensuring their applications are both reliable and relevant.

Furthermore, the ambition is to automate the T&E process as much as possible, enabling efficient and ongoing assessment of AI models as technology evolves. By establishing a set of evaluation metrics and model cards, DoD officials will gain a clear understanding of each model's strengths and potential limitations within secure environments. This meticulous approach underscores the commitment to integrating AI technologies in a way that enhances the robustness and operational efficacy of the U.S. military's capabilities.

Scale AI's collaboration with the Pentagon signifies a crucial step towards the responsible deployment of AI in national defense. By setting new benchmarks for the evaluation of generative AI, this project promises to pave the way for advanced technologies that bolster military effectiveness while ensuring the highest standards of safety and reliability. Last August, Scale partnered with OpenAI to bring fine tuning for to enterprise users. 

Luke Jones
Luke Jones
Luke has been writing about all things tech for more than five years. He is following Microsoft closely to bring you the latest news about Windows, Office, Azure, Skype, HoloLens and all the rest of their products.

Recent News