Why DeepSeek’s new AI model thinks it’s ChatGPT

Share

Earlier this week, DeepSeek, a well-funded Chinese AI lab, released its latest model, DeepSeek V3. Positioned as an "open" AI model, DeepSeek V3 has quickly gained attention for outperforming many competitors on popular benchmarks.

Earlier this week, DeepSeek, a well-funded Chinese AI lab, released its latest model, DeepSeek V3. Positioned as an “open” AI model, DeepSeek V3 has quickly gained attention for outperforming many competitors on popular benchmarks. The model is both large and efficient, excelling in text-based tasks such as coding and essay writing. However, a peculiar issue has surfaced—DeepSeek V3 seems to believe it’s ChatGPT.

Numerous posts on X (formerly Twitter) and tests by TechCrunch have revealed that DeepSeek V3 identifies itself as ChatGPT, OpenAI’s flagship chatbot powered by GPT-4. When asked about its identity, the model insists it is a version of OpenAI’s GPT-4, released in 2023. This phenomenon isn’t a rare fluke either. In five out of eight test cases, DeepSeek V3 claimed to be ChatGPT, while in the other three, it correctly identified itself as DeepSeek V3.

This behavior provides an intriguing glimpse into its training data. A user’s observation on X noted, “It gives you a rough idea of their training data distribution.” The confusion isn’t just limited to identity; if asked about DeepSeek’s API, the model provides instructions for OpenAI’s API instead. It even recycles some of GPT-4’s jokes—down to the exact punchlines.

Why Does DeepSeek V3 Think It’s ChatGPT?

AI models like DeepSeek V3 and ChatGPT are statistical systems trained on billions of examples to recognize patterns and predict responses. While DeepSeek hasn’t disclosed much about its training data, public datasets containing GPT-4-generated text are widely available. If DeepSeek V3 was trained on such datasets, it might have inadvertently memorized GPT-4’s outputs and is now regurgitating them.

“Obviously, the model is seeing raw responses from ChatGPT at some point, but it’s not clear where that is,” said Mike Cook, a research fellow at King’s College London specializing in AI. “It could be ‘accidental,’ but unfortunately, we’ve seen instances of people directly training their models on outputs from other systems to piggyback off their knowledge.”

Cook explained that training on outputs from other AI systems can harm model quality, leading to hallucinations and inaccurate answers. “It’s like taking a photocopy of a photocopy—you lose more and more information and connection to reality,” he added.

Ethical and Legal Concerns

This practice may also violate terms of service. OpenAI explicitly prohibits users from leveraging its outputs to develop competing models. Neither OpenAI nor DeepSeek has responded to requests for comment, but OpenAI’s CEO, Sam Altman, posted a cryptic remark on X, seemingly directed at DeepSeek: “It is (relatively) easy to copy something that you know works. It is extremely hard to do something new, risky, and difficult when you don’t know if it will work.”

DeepSeek V3 isn’t the only AI model with identity issues. Google’s Gemini, for instance, has reportedly claimed to be Baidu’s Wenxinyiyan chatbot when prompted in Mandarin. The root of the problem lies in the ever-growing presence of AI-generated content on the internet. Bots are flooding platforms like Reddit and X, and AI-generated clickbait is becoming more common. By some estimates, 90% of the web’s content could be AI-generated by 2026.

This “contamination” complicates efforts to filter out AI-generated text from training datasets. While it’s possible that DeepSeek directly trained its model on ChatGPT’s outputs, it’s more likely that large amounts of GPT-4 data inadvertently made their way into its training set. Heidy Khlaaf, chief AI scientist at the AI Now Institute, suggested that the cost savings from “distilling” knowledge from an existing model might tempt developers to take such shortcuts.

“Even with the internet now brimming with AI outputs, accidentally training on ChatGPT or GPT-4 data wouldn’t necessarily result in outputs this reminiscent of OpenAI’s style,” Khlaaf said. “If DeepSeek intentionally used OpenAI’s models for distillation, it wouldn’t be surprising.”

The Bigger Picture

The implications of this overlap are significant. For one, DeepSeek V3’s inability to accurately self-identify raises concerns about its reliability. More troubling, however, is the potential for DeepSeek V3 to amplify GPT-4’s existing biases and flaws. By uncritically absorbing and iterating on another model’s outputs, DeepSeek V3 could perpetuate errors and further degrade the quality of its responses.

As the web becomes increasingly saturated with AI-generated content, the challenge of maintaining clean and diverse training datasets will only grow. Models like DeepSeek V3 illustrate both the promise and pitfalls of AI development in an era of blurred boundaries and shared knowledge. The question now is whether developers will prioritize innovation and ethical practices over short-term gains.

Join our newsletter to stay updated

Related Posts

Join Our Newsletter

Our Services

Lets Get In Touch