Symflower has introduced DevQualityEval, a novel benchmark and framework created to evaluate the quality of code produced by large language models (LLMs). This tool is designed to assist developers in enhancing the performance of LLMs within actual software development environments, providing a standardized framework for measuring and comparing the efficacy of different LLMs in producing high-quality code.
DevQualityEval v0.4.0 was tested, analyzing 138 different LLMs for code generation in Java and Go. The evaluation process involved sorting the models based on their scores and costs, removing inferior models within the same vendor family, and renaming the remaining models for clarity. Evaluations with DevQualityEval have shown that while GPT-4 Turbo offers superior capabilities, Meta´s Llama-3 70B is significantly more cost-effective, scoring nearly as high. Anthropic´s Claude 3 Sonnet and Haiku models, Mistral Medium, and WizardLM-2 8x22B, Microsoft AI’s most advanced Wizard model, which was withdrawn in April due to missed toxicity checks, also achieved comparable code quality.
The DevQualityEval framework features tasks that simulate real-world programming scenarios, like creating unit tests for various programming languages. It provides metrics such as code compilation success rates, test coverage ratios, and qualitative evaluations of code style and accuracy. This allows developers to gauge the capabilities and limitations of different Large Language Models (LLMs).
Comparative Insights and Practical Performance
DevQualityEval assesses models based on their proficiency in resolving programming tasks both accurately and efficiently. It allocates points based on several factors, such as error-free responses, the inclusion of executable code, and the attainment of complete test coverage. Additionally, the framework evaluates the models’ token economy and the pertinence of their responses, deducting points for verbosity or irrelevance. This emphasis on functional performance makes DevQualityEval a helpful asset for developers and users aiming to implement Large Language Models in real-world settings.
Setting up DevQualityEval is straightforward. Developers need to install Git and Go, clone the repository, and run the installation commands. The benchmark can then be executed using the ‘eval-dev-quality‘ binary, which generates detailed logs and evaluation results. Developers can specify which models to evaluate and obtain comprehensive reports in formats such as CSV and Markdown. The framework currently supports openrouter.ai as the LLM provider, with plans to expand support to additional providers.
These insights help developers make informed decisions based on their requirements and budget constraints. The evaluation also highlighted the need for a more nuanced approach to costs and the impact of “chattiness” on cost-effectiveness.
Future versions of DevQualityEval will include additional features such as stability assessments, more detailed coverage reports, and more complex test generation cases. Symflower encourages feedback and collaboration from the community to further improve the benchmark and its evaluations.