A new large language model from DeepSeek has quietly appeared online — and it’s already drawing interest for an unexpected reason: it runs fast locally on an Apple Mac Studio.
The 641-gigabyte open-weight model, officially named DeepSeek-V3-0324, an updated version of DeepSeek’s V3 model from last year, was uploaded to Hugging Face today under an MIT license, giving developers the freedom to modify and deploy it commercially. What makes it stand out, however, is its ability to operate efficiently on consumer-grade hardware.
Unlike many model launches from enterprise labs, DeepSeek’s latest drop came with no accompanying whitepaper, research blog, or marketing push. Developer Awni Hannun first flagged the release after testing it locally.
The new Deep Seek V3 0324 in 4-bit runs at > 20 toks/sec on a 512GB M3 Ultra with mlx-lm! pic.twitter.com/wFVrFCxGS6
— Awni Hannun (@awnihannun) March 24, 2025
Running the quantized version on a 512GB Mac Studio with vLLM
and mlc-llm
, he reported inference speeds above 20 tokens per second. “It’s the most powerful model I’ve ever run on my laptop,” he wrote.
The model page includes configuration files and weights, but no formal documentation or performance evaluation, reinforcing the low-key nature of the release. It is also available for demo access via OpenRouter, where users can interact with it directly.
Early testers have noted substantial improvements compared to the previous version. AI researcher Xeophon stated in a post on X: “Tested the new DeepSeek V3 on my internal bench and it has a huge jump in all metrics on all tests. It is now the best non-reasoning model, dethroning Sonnet 3.5.”
Tested the new DeepSeek V3 on my internal bench and it has a huge jump in all metrics on all tests.
— Xeophon (@TheXeophon) March 24, 2025
It is now the best non-reasoning model, dethroning Sonnet 3.5.
Congrats @deepseek_ai! pic.twitter.com/efEu2FQSBe
Checkpoint release reflects efficiency-first strategy
DeepSeek-V3-0324 is not a brand-new model but the first open-weight checkpoint of the broader DeepSeek V3 architecture introduced in late 2024.
This release makes that architecture publicly accessible, and it comes with built-in support for FP8 quantization — a precision format that balances memory efficiency with computational accuracy.
The underlying model design is structured around a Mixture-of-Experts (MoE) architecture. While the model totals 685 billion parameters, only around 37 billion are active at any time during inference, which significantly reduces hardware demands.
It also includes two performance-focused innovations: Multi-Head Latent Attention (MLA), which improves long-range dependency handling across attention heads, and Multi-Token Prediction (MTP), which enables the model to generate multiple tokens per step rather than just one.
These optimizations helped the DeepSeek V3 family achieve notable benchmark results when first profiled in December. The previous model version scored 90.2 on the MATH-500 test, outperforming GPT-4o’s 74.6. It also reached 79.8 on MGSM and matched GPT-4o on HumanEval-Mul, a programming benchmark. While those results don’t reflect the performance of V3-0324 specifically, they provide a window into the architecture’s potential.
Tencent adopts DeepSeek to offset GPU demands
Beyond local experimentation, DeepSeek models are already being used in production environments. Tencent confirmed during its Q4 2024 earnings call that it had adopted DeepSeek for services like WeChat, using it to optimize GPU utilization amid growing infrastructure constraints.
Chinese companies are generally prioritizing efficiency and utilization — efficient utilization of the GPU servers… DeepSeek’s success really sort of symbolize and solidify — demonstrated that — that reality,” said one Tencent executive.
This strategy aligns with the company’s broader approach. While Tencent is also pursuing its in-house HunYuan Turbo S model, DeepSeek’s lightweight and efficient architecture has proven attractive for handling multilingual and reasoning-heavy workloads.
Such efficiency is particularly valuable given U.S. restrictions on advanced Nvidia chips. DeepSeek’s previous R1 model was reportedly trained using just 2,048 H800 GPUs, far fewer than typical for models of its scale. In response to growing demand, Chinese firms have turned to the lower-powered Nvidia H20 — DeepSeek’s adoption was a key factor behind a spike in H20 orders earlier this year.