Microsoft Debuts Phi-4 Reasoning Models, Aiming for Big Performance Gains

Microsoft has released Phi-4-reasoning, Phi-4-reasoning-plus (14B), and Phi-4-mini-reasoning (3.8B) AI models, showing strong reasoning performance.

Microsoft has introduced a trio of new artificial intelligence models under its Phi banner, intensifying its focus on smaller, efficient systems capable of complex problem-solving. The company released Phi-4-reasoning and Phi-4-reasoning-plus, both containing 14 billion parameters, alongside the compact Phi-4-mini-reasoning, which has 3.8 billion parameters.

The models, described by Microsoft as systems where “Reasoning models are trained to leverage inference-time scaling to perform complex tasks that demand multi-step decomposition and internal reflection,” aim to provide high performance comparable to much larger AI systems while maintaining efficiency. They are available now through Microsoft’s Azure AI Foundry and the Hugging Face platform under permissive licenses.

Pushing Reasoning Boundaries with Fewer Parameters

The central claim is that these smaller models can hold their own against industry heavyweights. Microsoft’s technical documentation asserts that Phi-4-reasoning-plus, enhanced through reinforcement learning, performs competitively with OpenAI’s o3-mini and approaches the capability of DeepSeek-R1 (a 671B parameter model) on certain mathematical evaluations like the AIME 2025 test.

Both 14B models reportedly outperform Anthropic’s Claude 3.7 Sonnet and Google’s Gemini 2 Flash Thinking on most benchmarks, though exceptions were noted for GPQA science questions and BA-Calendar planning tasks. The technical report highlights significant gains over the base Phi-4 on general benchmarks too, with Phi-4-reasoning-plus showing a 22-point improvement on IFEval (instruction following) and a 10-point gain on ArenaHard (human preference evaluation).

However, the report also cautions about performance variance, noting that on the 30-question AIME 2025 benchmark, accuracy for models like DeepSeek-R1-Distill-Llama-70B can range from 30% to 70% across 50 runs, making single-run comparisons potentially unreliable.

Phi-4-mini-reasoning, despite its 3.8B parameter size, is reported to surpass models like OpenThinker-7B on several math benchmarks and supports an extensive 128,000-token context length with a 200k+ vocabulary size. Microsoft stated these models “balance size and performance,” allowing “even resource-limited devices to perform complex reasoning tasks efficiently.”

Inside the Training Process and Model Specifications

Achieving this performance involved specific training strategies. Phi-4-reasoning is a supervised fine-tuning (SFT) of the original Phi-4 base model, using over 1.4 million examples with reasoning steps generated by OpenAI’s o3-mini.

This SFT process, using data with a public cutoff of March 2025, occurred between January and April 2025. Phi-4-reasoning-plus adds a layer of reinforcement learning on top, primarily using mathematical problems and Group Relative Policy Optimization (GRPO) – an algorithm designed to improve model outputs based on relative preferences between different generated responses – to refine its output.

This results in higher accuracy in math but also generates responses that are, on average, 1.5 times longer than Phi-4-reasoning, a difference less pronounced in coding or planning. The Phi-4-mini-reasoning model was trained separately in February 2024 on over a million synthetic math problems (sourced from DeepSeek R1 output) covering a wide difficulty range.

To accommodate the detailed reasoning chains, the 14B models had their context capacity doubled from the original Phi-4’s 16k to 32k tokens. Microsoft also suggests specific inference settings (like temperature 0.8) for optimal results with the Phi-4-reasoning-plus model.

Evolution of the Phi Family and Strategic Context

The launch marks a continuation of Microsoft’s Phi project, which began gaining attention with the original 14B parameter Phi-4 in December 2024. That initial Phi 4 model was noted for strong math performance, achieving a 91.8 score on AMC 12 tests, ahead of competitors like Gemini Pro 1.5 (89.8) at the time. Microsoft followed up by fully open-sourcing Phi-4 in January 2025, releasing its weights on Hugging Face under an MIT license.

At that time, Microsoft engineer Shital Shah posted on X, “A lot of folks had been asking us for weight release… Few even uploaded bootlegged phi-4 weights on HuggingFace<0xF0><0x9F><0x98><0xAC>. Well, wait no more. We are releasing today official phi-4 model on HuggingFace! With MIT licence!!” The family saw further expansion in February 2025 with the addition of a different text-based mini model and the Phi-4-multimodal variant. The current reasoning models build directly on the SFT and synthetic data techniques used previously.

The models underscore Microsoft’s strategy of cultivating highly capable smaller models – often termed Small Language Models (SLMs) – alongside its investments in large-scale AI like OpenAI’s GPT series. SLMs are gaining industry interest due to potential advantages like reduced training costs and easier domain-specific fine-tuning. This approach targets efficiency and accessibility, potentially lowering the barrier for enterprises and developers. Microsoft integrates Phi models into its ecosystem, such as the Phi Silica variant optimized for NPUs in Copilot+ PCs.

For broader access, Phi-4-mini-reasoning is also available in the GGUF format via projects like Unsloth, a popular format for running models locally on consumer hardware. Microsoft emphasized that the Phi models are developed following its Responsible AI principles, though acknowledges limitations like the 32k token context for the 14B models and the primary focus on English.

Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.

Recent News

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x