Earlier this week, DeepSeek, a well-funded Chinese AI lab, unveiled an “open” AI model that outperforms many competitors on popular benchmarks. The model, DeepSeek V3, is both large and efficient, adept at handling text-based tasks like coding and essay writing with ease.
It appears to mistake itself for ChatGPT.
Posts on X — as well as tests conducted by TechCrunch — reveal that DeepSeek V3 identifies itself as ChatGPT, OpenAI’s AI-powered chatbot platform. When questioned further, DeepSeek V3 insists it is a version of OpenAI’s GPT-4 model released in 2023. The confusion runs deep; if asked about DeepSeek’s API, the model offers instructions on how to use OpenAI’s API. DeepSeek V3 even reproduces some of GPT-4’s jokes, including the punchlines.
So, what’s happening here?
Models like ChatGPT and DeepSeek V3 function as statistical systems. They are trained on vast amounts of data, learning patterns from those examples to make predictions — for example, predicting how phrases like “to whom” typically precede “it may concern” in an email.
DeepSeek has disclosed very little about the sources of DeepSeek V3’s training data. However, there is no shortage of publicly available datasets containing text generated by GPT-4 via ChatGPT. If DeepSeek V3 was trained on such data, it might have inadvertently memorized some of GPT-4’s outputs, now simply regurgitating them verbatim.
“Clearly, the model is encountering raw responses from ChatGPT at some point, but it’s unclear exactly where,” Mike Cook, a research fellow at King’s College London who specializes in AI, told TechCrunch. “It could be accidental… but unfortunately, we’ve seen instances where models are directly trained on outputs from other models to leverage their knowledge.”
Cook pointed out that training models on the outputs of rival AI systems can severely degrade model quality, leading to hallucinations and misleading answers. “Like photocopying a photocopy, we lose more and more information and connection to reality,” Cook said.
This practice might also violate the terms of service of the originating systems. OpenAI’s terms explicitly forbid users from using ChatGPT-generated outputs to build competing models.
Neither OpenAI nor DeepSeek responded to requests for comment. However, OpenAI CEO Sam Altman appeared to take a dig at DeepSeek and other competitors on X, stating, “It is (relatively) easy to copy something that you know works. It is extremely hard to do something new, risky, and difficult when you don’t know if it will work.”
DeepSeek V3 isn’t the first model to misidentify itself. For instance, Google’s Gemini has been known to claim it’s Baidu’s Wenxinyiyan chatbot when prompted in Mandarin. This is symptomatic of a larger issue — the web is increasingly littered with AI-generated content. Content farms churn out clickbait, and bots flood platforms like Reddit and X. By one estimate, 90% of the web could be AI-generated by 2026.
This “contamination,” as it were, makes it difficult to thoroughly filter AI outputs from training datasets.
It’s entirely possible that DeepSeek trained DeepSeek V3 directly on ChatGPT-generated text, much like Google has been accused of doing. Heidy Khlaaf, chief AI scientist at the nonprofit AI Now Institute, noted that the cost-efficiency of “distilling” an existing model’s knowledge can be appealing to developers, despite the risks.
“Even with internet data now brimming with AI outputs, other models that accidentally train on ChatGPT or GPT-4 outputs might not necessarily produce responses that mimic OpenAI’s specific outputs,” Khlaaf said. “If DeepSeek did perform distillation using OpenAI models, it wouldn’t be surprising.”
However, it’s more likely that a significant portion of ChatGPT/GPT-4 data found its way into DeepSeek V3’s training set. This raises concerns not only about the model’s self-identification but also about the possibility that DeepSeek V3, by uncritically absorbing and repeating GPT-4’s outputs, could amplify some of GPT-4’s biases and flaws.
Post a Comment