DeepSeek’s New 641GB AI Model Lands Quietly — and Runs Surprisingly Fast on a Mac

DeepSeek’s V3-0324 AI model has launched quietly, offering efficient performance on Mac Studio and drawing interest for its open-weight availability.

A new large language model from DeepSeek has quietly appeared online — and it’s already drawing interest for an unexpected reason: it runs fast locally on an Apple Mac Studio.

The 641-gigabyte open-weight model, officially named DeepSeek-V3-0324, an updated version of DeepSeek’s V3 model from last year, was uploaded to Hugging Face today under an MIT license, giving developers the freedom to modify and deploy it commercially. What makes it stand out, however, is its ability to operate efficiently on consumer-grade hardware.

Unlike many model launches from enterprise labs, DeepSeek’s latest drop came with no accompanying whitepaper, research blog, or marketing push. Developer Awni Hannun first flagged the release after testing it locally.

Running the quantized version on a 512GB Mac Studio with vLLM and mlc-llm, he reported inference speeds above 20 tokens per second. “It’s the most powerful model I’ve ever run on my laptop,” he wrote.

The model page includes configuration files and weights, but no formal documentation or performance evaluation, reinforcing the low-key nature of the release. It is also available for demo access via OpenRouter, where users can interact with it directly.

Early testers have noted substantial improvements compared to the previous version. AI researcher Xeophon stated in a post on X: “Tested the new DeepSeek V3 on my internal bench and it has a huge jump in all metrics on all tests. It is now the best non-reasoning model, dethroning Sonnet 3.5.”

Checkpoint release reflects efficiency-first strategy

DeepSeek-V3-0324 is not a brand-new model but the first open-weight checkpoint of the broader DeepSeek V3 architecture introduced in late 2024.

This release makes that architecture publicly accessible, and it comes with built-in support for FP8 quantization — a precision format that balances memory efficiency with computational accuracy.

The underlying model design is structured around a Mixture-of-Experts (MoE) architecture. While the model totals 685 billion parameters, only around 37 billion are active at any time during inference, which significantly reduces hardware demands.

It also includes two performance-focused innovations: Multi-Head Latent Attention (MLA), which improves long-range dependency handling across attention heads, and Multi-Token Prediction (MTP), which enables the model to generate multiple tokens per step rather than just one.

These optimizations helped the DeepSeek V3 family achieve notable benchmark results when first profiled in December. The previous model version scored 90.2 on the MATH-500 test, outperforming GPT-4o’s 74.6. It also reached 79.8 on MGSM and matched GPT-4o on HumanEval-Mul, a programming benchmark. While those results don’t reflect the performance of V3-0324 specifically, they provide a window into the architecture’s potential.

Tencent adopts DeepSeek to offset GPU demands

Beyond local experimentation, DeepSeek models are already being used in production environments. Tencent confirmed during its Q4 2024 earnings call that it had adopted DeepSeek for services like WeChat, using it to optimize GPU utilization amid growing infrastructure constraints.

Chinese companies are generally prioritizing efficiency and utilization — efficient utilization of the GPU servers… DeepSeek’s success really sort of symbolize and solidify — demonstrated that — that reality,” said one Tencent executive.

This strategy aligns with the company’s broader approach. While Tencent is also pursuing its in-house HunYuan Turbo S model, DeepSeek’s lightweight and efficient architecture has proven attractive for handling multilingual and reasoning-heavy workloads.

Such efficiency is particularly valuable given U.S. restrictions on advanced Nvidia chips. DeepSeek’s previous R1 model was reportedly trained using just 2,048 H800 GPUs, far fewer than typical for models of its scale. In response to growing demand, Chinese firms have turned to the lower-powered Nvidia H20 — DeepSeek’s adoption was a key factor behind a spike in H20 orders earlier this year.

Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x