DeepSeek is betting that aligned AI models don’t have to be trained endlessly—they need better ways to reason through their outputs as they generate them. In collaboration with Tsinghua University, the company has introduced a new method called Self-Principled Critique Tuning (SPCT), a generative reward modeling technique designed to operate during inference rather than requiring large-scale preference data during training.
SPCT was introduced in a research paper published on April 4 and tested in a model called DeepSeek-GRM-27B. The results are striking.
Rather than depending on static human annotations, SPCT enables models to refine their outputs dynamically using self-generated principles and critique loops during inference. The result: reduced costs, better scalability, and state-of-the-art performance with smaller models.
At its core, SPCT is an inference-first approach that achieves high-quality alignment by optimizing how models reason about their own responses. The 27-billion-parameter DeepSeek-GRM model using SPCT achieves an MT-Bench score of 8.35—surpassing models trained with Direct Preference Optimization (DPO), which scores 7.58—without increasing model size.
Independent benchmarks further confirm that SPCT enables smaller models to match the performance of much larger counterparts, such as 671B-scale models, by leveraging inference-time computation with 32 samples per query.
This alignment process is designed to scale with model size. According to the paper, SPCT’s advantage becomes more apparent as models grow larger, offering a promising path forward for AI developers looking to avoid the compute-intensive route of reinforcement learning from human feedback (RLHF).
The Recursive Architecture Behind SPCT
At the heart of SPCT is a multi-stage alignment pipeline that replaces static human labels with a loop of principle synthesis, response generation, critique filtering, and principle refinement. Each stage builds upon the last to incrementally improve the quality and alignment of the model’s output.
The process begins with the generation of context-specific principles using chain-of-thought prompting. For example, when handling coding-related tasks, the model might determine that memory efficiency should take priority over runtime and readability. These principles guide the next phase, in which the model generates an initial response within a constrained 4,096-token window.
Once an initial response is produced, the model engages in self-critique. It evaluates its output against the synthesized principles and generates feedback for improvement. These critiques are filtered in real-time by a meta reward model (Meta-RM), which uses a 512-dimensional reward embedding to score the quality of each critique. Poor-quality critiques are discarded to ensure the integrity of the refinement cycle.
The final step in the loop is principle refinement. Using gradient-based optimization, the model adjusts its internal alignment heuristics based on how well the critique matches the intended response. This recursive tuning allows the model to iteratively converge on high-quality outputs, adapting dynamically to the specifics of each query without requiring external intervention or retraining.
Optimizing Inference Through Hardware-Aware Design
SPCT’s efficiency is made possible through a hardware-conscious architecture that includes a Mixture-of-Experts (MoE) setup. The GRM-27B model employs 16 experts, with only two activated per token, and supports context windows of up to 128,000 tokens. Speculative execution further enhances performance by precomputing potential critique paths, reducing latency during inference.
Performance benchmarking demonstrates that SPCT achieves significant throughput advantages. When processing single-query batches, the system records a latency of 1.4 seconds and a throughput of 42 tokens per second. For batch sizes of eight, latency increases to 3.1 seconds while throughput scales to 208 tokens per second.
Batch Size | Latency | Throughput |
---|---|---|
1 | 1.4s | 42 tokens/second |
8 | 3.1s | 208 tokens/second |
This efficient inference strategy allows SPCT to scale alignment capabilities without scaling model size. The result is a practical, cost-effective method that maintains performance parity with much larger models.
Benchmarking Costs and Performance Across Models
A comparative analysis reveals that SPCT significantly reduces the cost of training and deploying high-performance models. The DeepSeek-GRM model, with 27 billion parameters and using SPCT, achieves a training cost of approximately $12,000 while delivering a strong MT-Bench score of 8.35. By contrast, Nemotron-4, a 340B parameter model, incurs costs over $1.2 million to reach an MT-Bench score of 8.41. OpenAI’s GPT-4o, with 1.8 trillion parameters, scores 8.72 at an estimated cost of $6.3 million.
Model | Size | MT-Bench | Approx. Training Cost |
---|---|---|---|
DeepSeek-GRM | 27B | 8.35 | $12,000 |
Nemotron-4 | 340B | 8.41 | $1.2 million |
GPT-4o | 1.8T | 8.72 | $6.3 million |
These comparisons underscore a central advantage of SPCT: it achieves state-of-the-art results using a fraction of the computational and financial resources required by brute-force scaling.
Beyond performance, SPCT offers compelling advantages in sustainability and flexibility. It eliminates nearly 90 percent of the human annotation typically required for alignment, drastically reducing labor and time investments. Moreover, it lowers energy consumption by 73 percent compared to DPO, making it an environmentally responsible option for AI development.
SPCT’s capacity for real-time adaptation also sets it apart. Traditional alignment methods are limited by the quality and scope of their training datasets, making them slow to adjust to novel or evolving tasks. In contrast, SPCT’s recursive inference strategy enables models to generate and refine principles on the fly, allowing them to handle unpredictable inputs and changing objectives without retraining.
This capability opens new frontiers in domains such as robotics, where systems must respond to dynamic environments, and multimodal AI, where alignment across text, vision, and sensor data is essential. The DeepSeek team is actively exploring SPCT’s application in real-time robotics control and distributed learning systems, where collaboration among multiple agents requires adaptive alignment mechanisms.
Shifting From Scale to Architecture
SPCT appears to be a central component of DeepSeek’s strategy for scaling AI performance through smarter architecture rather than bigger models. On March 24, DeepSeek released an open-weight update of its DeepSeek-V3 model to Hugging Face under an MIT license, dubbed DeepSeek V3.1. The model, weighing in at 641GB, runs efficiently on local hardware.
Developer Awni Hannun, testing a quantized 4-bit version on a 512GB Apple Mac Studio, reported inference speeds exceeding 20 tokens per second, writing: “It’s the most powerful model I’ve ever run on my laptop.”
The model is available on Hugging Face at this repository for developers seeking to experiment with open weights.
The V3-0324/V3.1 model is built on a Mixture-of-Experts (MoE) design, in which only about 37 billion of its total 685 billion parameters are active during any single inference step. This setup enables memory-efficient generation and is augmented by architectural features like Multi-Head Latent Attention (MLA) and Multi-Token Prediction (MTP), both designed to improve output speed and accuracy.
The DeepSeek-GRM-27B model used to test SPCT shares architectural similarities with V3-0324, suggesting that inference-time alignment could eventually be available in publicly released versions of DeepSeek’s commercial models as well.
Enterprise Adoption Under Pressure
DeepSeek’s approach is already being validated by enterprise adoption. Tencent confirmed during its Q4 2024 earnings call that it had integrated DeepSeek models across products like WeChat. A Tencent executive stated: “The industry and we, within the industry, are getting much higher productivity on a large language model training from existing GPUs without needing to add additional GPUs at the pace previously expected.”
The company’s decision to shift from GPU-hungry infrastructure toward optimized inference models comes at a time when U.S. export controls have restricted the availability of Nvidia’s top-tier AI chips in China. In 2023, the U.S. blocked sales of the A800 and H800 models. In response, Tencent reportedly placed bulk orders for the H20, a lower-powered chip still allowed under current rules.
DeepSeek’s earlier model, R1, was trained using only 2,048 H800 GPUs—an unusually low number for a foundation model of its size. SPCT further aligns with this strategy by enabling better performance without increasing the number of training samples or relying on large-scale preference annotation.
R2 Fast-Tracked as Rivals Surge
SPCT’s emergence is also strategically timed: DeepSeek’s next model, R2, is being rushed to market. As reported on February 26, the company accelerated its original May timeline to keep pace with rivals. The R1 model had drawn attention for its efficiency but fell short in areas like reasoning, multilingual accuracy, and code generation.
Competitors are also moving aggressively. Microsoft integrated OpenAI’s o1 model into Copilot at no additional cost, and then soon after that upgraded to o3-mini-high. xAI’s released Grok 3 which outperforms GPT-4o. Google in March then unveiled Gemini 2.5 Pro Experimental, reclaiming top positions in various benchmarks, and then shortly after unlocking free access to this model for all users.
OpenAI reacted to all these developments after its February decision to cancel the release of its most powerful o3 model to go for a release of o3 and o4-mini in the near future, most probably being preoccupied to fall further behind in the AI race.
Meta meanwhile has rushed the release of its new LLama 4 models this weekend, with Llama 4 Scout and Llama 4 Maverick, two open-weight frontier large language models that introduce major architectural changes while expanding the company’s presence across consumer apps and cloud platforms.