Is DeepSeek Training its AI with Data from Google Gemini? New Distillation Claims Emerge

DeepSeek faces new claims its R1-0528 AI model was trained on data from Google Gemini, after prior scrutiny about alledged training use of OpenAI models.

Chinese AI lab DeepSeek faces new questions over its training data. Speculation has emerged that its new R1-0528 model used Google’s Gemini AI for training. This follows earlier accusations from March regarding alledged use of output from OpenAI’s ChatGPT. The recent claims, as TechCrunch reports, stem from researchers noting similarities. They observed the new DeepSeek model’s language and internal “traces” resemble Google’s Gemini 2.5 Pro.

With DeepSeek’s latest model, debates on AI ethics and intellectual property intensify again. The practice of “distillation,” where models learn from other model during training, is central to this. If proven, DeepSeek could face legal and reputational issues. Such outcomes would also question the efficacy of safeguards by major AI labs. The situation highlights the fierce AI competition between the U.S. and China and could also affect user trust.

The current concerns were amplified by developers Sam Paeach and the creator of SpeechMap. They pointed to stylistic and structural resemblances. While not conclusive, this echoes past incidents. For example, on request, DeepSeek’s V3 model sometimes identified itself as ChatGPT

A Pattern Of Accusations

Allegations of improper data use by DeepSeek are not new. Earlier in 2025, OpenAI found evidence linking DeepSeek to distillation. Around the same time, Microsoft reportedly detected significant data exfiltration via OpenAI developer accounts, which OpenAI suspected were tied to DeepSeek, as per Bloomberg. OpenAI’s terms explicitly forbid using its outputs to build rival AI.

Furthering these concerns, a study found DeepSeek R1 shared 74.2% of its writing style with ChatGPT. DeepSeek, however, stated its R1 model was developed from its V3 model. The company claimed V3 used internal synthetic data and third-party open-source information, not direct OpenAI outputs.

DeepSeek described the R1-0528 model, launched in late May as a “minor trial upgrade”. Yet, on its Hugging Face page, the company positioned the model as “approaching that of leading models, such as o3 and Gemini 2.5 Pro.”

This assertion of near-parity with leading systems like Gemini 2.5 Pro offers a potential motive for learning from such advanced AI. The R1-0528 model, utilizing a Mixture-of-Experts (MoE) architecture, was promoted for significant enhancements in reasoning, mathematics, and programming.

Challenges In Proving Distillation

AI training on newwe data faces one major constraint: The AI industry grapples with “AI slop”—web content increasingly generated by AI. This can lead to models unintentionally developing similar characteristics. However, some experts find deliberate distillation plausible.

AI researcher Nathan Lambert suggested on X“If I was DeepSeek, I would definitely create a ton of synthetic data from the best API model out there,” noting DeepSeek is “[DeepSeek is] short on GPUs and flush with cash. It’s literally effectively more compute for them.”

But proving illicit data distillation is notoriously hard; it often comes down to analyzing output patterns, which can be suggestive but rarely conclusive on their own. The line blurs between direct distillation and indirect influence when models are trained on web-scale data increasingly populated by outputs from other advanced AIs. Stylistic convergence is almost inevitable to some degree.

In response to these risks, major AI companies are enhancing security. OpenAI began ID verification in April. Google has started ‘summarizing’ the traces generated by models available through its AI Studio developer platform. Similarly, Anthropic in May also said it would start to summarize its own model’s traces, citing a need to protect its “competitive advantages,” as TechCrunch reported.

The Geopolitical And Regulatory Landscape

DeepSeek’s progress occurs amid considerable geopolitical headwinds. A US House Select Committee on the CCP has labeled DeepSeek a national security risk. Committee Chairman John Moolenaar asserted, “DeepSeek isn’t just another AI app — it’s a weapon in the Chinese Communist Party’s arsenal, designed to spy on Americans, steal our technology, and subvert U.S. law.” This scrutiny adds to prior reports on DeepSeek R1’s factual accuracy and security issues.

The company has emphasized computational efficiency, partly due to U.S. export controls on advanced Nvidia GPUs. This focus also led Tencent, a chinese competitor who is also developing its own AI models, to use DeepSeek models in late 2024 for GPU optimization.

As of now, DeepSeek continues its rapid development. The latest R1-0528 model is available under an MIT License, permitting commercial use and distillation, and DeepSeek’s next-generation R2 model is expected to be released soon.

Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.

Recent News

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x