Study: DeepSeek R1-Output Matches ChatGPT by 74%, Pointing to Heavy Data Use from OpenAI Models

According to a study by Copyleaks, DeepSeek R1 shares 74% of its writing style with ChatGPT, raising concerns over potential OpenAI data use.

Updated on March 4, 2025 9:23 am CET: We’ve revised this story to clarify that DeepSeek’s distillation process according to them was carried out internally using its own V3 model—and not by directly harvesting OpenAI outputs—and to emphasize that DeepSeek has consistently stated it relies on third-party open-source data rather than OpenAI’s proprietary models.


A new forensic analysis by Copyleaks reveals that DeepSeek’s latest reasoning model, DeepSeek R1, shares 74.2% of its writing style with OpenAI’s ChatGPT.

The study examined subtle linguistic markers—including sentence structure, word choice, and phrasing—to arrive at this figure, suggesting that DeepSeek’s internal distillation process may be a key factor behind the model’s performance in reasoning tasks.

The findings by Copyleaks shared with WinBuzzer suggest that DeepSeek may have relied on ChatGPT-generated outputs during its training via distillation techniques, raising ethical and legal concerns about AI model development.

According to DeepSeek’s published research, the R1 model was developed by transferring reasoning capabilities from its internally trained V3 model. The V3 model itself appears to have been partially trained based on data coming from OpenAI-models. How DeepSeek possibly obtained this data, remains unknown.

The process DeepSeek appears to have used — known as knowledge distillation — utilizes synthetic data generated from its own models and data from third-party open-source sources, rather than relying on outputs from OpenAI’s proprietary systems directly.

While critics have raised concerns about potential data harvesting, DeepSeek consistently maintains that its method is entirely self-contained.

The Copyleaks study employed three advanced AI classifiers that unanimously confirmed the 74.2% stylistic match, lending strong credence to the efficiency of DeepSeek’s internal training methods. This high degree of similarity reflects the systematic application of reinforcement learning and distillation within DeepSeek’s own development pipeline, rather than any direct copying from ChatGPT.

The findings come amid OpenAI’s ongoing investigation into unusual API activity linked to developer accounts in China. OpenAI has detected irregular patterns suggesting mass extraction of ChatGPT responses, which, if connected to DeepSeek, could indicate unauthorized model training based on OpenAI-generated content.

The Copyleaks analysis reveals that 74.2% of DeepSeek’s text shares a stylistic fingerprint with OpenAI’s models, a level of similarity not found in other AI systems tested. The findings suggest DeepSeek may have incorporated OpenAI-generated outputs in its training, though further research is needed. (Souce: Copyleaks)

Microsoft’s Role: Hosting DeepSeek AI Despite OpenAI’s Investigation

Microsoft, OpenAI’s largest investor, has integrated DeepSeek-R1 into its Azure AI Foundry, making it accessible to developers worldwide. This move has sparked debate over Microsoft’s due diligence, given that OpenAI is simultaneously investigating potential unauthorized data use by the same model.

The situation presents a complex dynamic, as Microsoft benefits from both OpenAI’s exclusivity and DeepSeek’s accessibility. If OpenAI determines that DeepSeek was trained using its data without permission, Microsoft may face pressure to reconsider its support for the model.

Market Impact: Nvidia’s $593 Billion Stock Drop

DeepSeek’s emergence has had financial repercussions. Shortly after its rise to prominence, Nvidia temporarily lost $593 billion in market value, as investors reevaluated the demand for high-performance GPUs.

DeepSeek’s ability to generate AI-driven responses with lower computational costs raised concerns that AI firms may shift towards more efficient models, reducing reliance on Nvidia’s high-end AI training hardware.

While Nvidia remains the leading supplier of AI chips, DeepSeek’s approach could indicate a shift in how companies prioritize cost efficiency over raw computing power, potentially altering market expectations for AI model development.

OpenAI’s recently released GPT-4.5 model points also in that direction. GPT-4.5 was built on the old training paradigm of progressively increasing the amount of training data and has been found underperforming other models which put emphasis of reasoning approaches like Mixture-of-Experts and Chain of Thought.

DeepSeek’s Many Controversies

How DeepSeek obtained its training data is not the only controversy the company is involved in.

A recent NewsGuard study found that DeepSeek-R1 failed 83% of factual accuracy tests, ranking it among the least reliable AI models reviewed. Users reported instances of incorrect or misleading responses, raising concerns about the model’s dependability for critical applications.

Furthermore, the study found that DeepSeek’s outputs frequently aligned with Chinese government narratives, even in non-political queries. The model’s apparent bias, combined with its factual inaccuracies, has led to speculation about whether it was trained with state-approved datasets, further complicating its credibility in international markets.

As a result, Perplexity has released R1 1776, an open-source AI model built on DeepSeek R1 that removes the existing filtering mechanisms that restricted responses to politically sensitive topics.

DeepSeek is Outdated and Easy to Jailbreak

Another issue of DeepSeek R1 is that its knowledge base appears outdated, as it frequently cites pre-2024 events as if they are current. This raises concerns about whether the model is being actively maintained or if its training data is limited by external restrictions.

Security assessments have revealed vulnerabilities in DeepSeek-R1’s safeguards. A Palo Alto Networks study found that model’s safeguards failed all of their jailbreak attempts, indicating that DeepSeek lacks adequate protections against misuse.

Similarly, Adversa AI’s research confirmed that DeepSeek is highly susceptible to prompt injections, allowing users to bypass safety mechanisms and generate content that should otherwise be restricted. Unlike OpenAI’s ChatGPT, which has undergone multiple security updates, DeepSeek appears to lack comparable content moderation safeguards.

U.S. Lawmakers and European Regulators Target DeepSeek

The concerns surrounding DeepSeek have triggered responses from policymakers in both the United States and Europe. In Washington, legislators are reviewing a proposal to ban DeepSeek AI from federal agencies, citing security risks and concerns over its ties to China.

Officials worry that its vulnerabilities could be exploited for misinformation campaigns or unauthorized data collection, raising national security implications.

Texas has already taken independent action, becoming the first U.S. state to blacklist DeepSeek from government use. The decision follows broader state-led efforts to regulate foreign AI models amid concerns over data privacy and potential cybersecurity threats.

In Europe, scrutiny has focused on data protection. Italy’s data protection authority has launched an investigation into whether DeepSeek complies with GDPR. If the AI model is found to be processing data in ways that violate EU privacy laws, it could face significant operational restrictions in the region.

China’s AI Expansion and the Rush for Nvidia’s H20 Chips

DeepSeek’s development is occurring in a broader geopolitical context, where AI technology is increasingly intertwined with national security concerns.

As the U.S. tightens export controls on high-performance chips, Chinese companies have been securing as many AI processors as possible. DeepSeek is among the firms that have contributed to a surge in demand for Nvidia’s H20 processors, one of the few AI chips still available for export to China.

The restrictions have forced Chinese AI developers to adapt, potentially relying more on optimized software efficiency rather than hardware acceleration. DeepSeek’s emphasis on achieving high performance with lower computational demands suggests a shift in strategy to work within these limitations.

DeepSeek Races to Release R2 Model Amid Intensifying Scrutiny

Despite the challenges, DeepSeek is accelerating its AI development timeline. The company has opted to fast-track the release of its R2 model, moving up its launch schedule in an attempt to maintain its momentum in the AI race.

This decision suggests that DeepSeek is prioritizing market presence, even as concerns about its training data, security vulnerabilities, and factual reliability remain unresolved.

The rush to release R2 could be a strategic move to strengthen DeepSeek’s position against competitors such as OpenAI, Google, and Alibaba. However, if the new model suffers from the same weaknesses as R1, including factual inaccuracy and security gaps, it may face resistance in Western markets.

The Future of AI Ethics, Security, and Regulation

The DeepSeek controversy highlights broader debates about AI training ethics, intellectual property, and security. As AI models become more sophisticated, questions about how they are developed—and whether they rely on data extracted from competitors—are becoming central to discussions on AI governance.

If OpenAI formally concludes that its outputs were used in DeepSeek’s training, it could set a precedent for how AI companies handle intellectual property disputes.

Such a ruling could lead to tighter regulations requiring greater transparency in AI training datasets and possibly legal consequences for companies found to have leveraged competitor-generated data without authorization.

Regulators are also likely to impose stricter compliance measures on AI models operating in major markets. With scrutiny from U.S. lawmakers and European regulators increasing, AI firms may soon face heightened oversight on data privacy, security, and content moderation.

Markus Kasanmascheff
Markus Kasanmascheff
Markus has been covering the tech industry for more than 15 years. He is holding a Master´s degree in International Economics and is the founder and managing editor of Winbuzzer.com.

Recent News

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
0
We would love to hear your opinion! Please comment below.x
()
x