HomeWinBuzzer NewsOpenAI´s GPT-4o has a Chinese Porn and Spam Problem

OpenAI´s GPT-4o has a Chinese Porn and Spam Problem

Chinese tokens use to train GPT-4o seem to be predominantly spam and pornographic phrases

-

´s latest multimodal model GPT-4o is facing scrutiny due to problems with its Chinese token data. The issues stem from inadequate data cleaning processes, leading to potential performance problems and misuse, according to a researcher who had a closer look at 's public token library.

Tokens are the basic units in language models, representing words, expressions, or characters. They allow the model to process text more efficiently by recognizing consistent strings of characters. GPT-4o's new tokenizer includes 200,000 tokens, with 25% in non-English languages, aimed at improving multi-language tasks. However, the Chinese tokens are predominantly spam and pornographic phrases, which are not commonly used in everyday language. This discrepancy is due to insufficient data filtering during the training phase.

Impact on Model Performance

The presence of these inappropriate tokens can cause the model to generate nonsensical or unrelated responses. Researchers have shown that these tokens can also be exploited to bypass OpenAI's safety mechanisms, enabling the model to produce unsafe content. Tianle Cai, a PhD student at Princeton University, identified the issue by analyzing the longest Chinese tokens in GPT-4o's public token library, finding that most were related to gambling and pornography.

Data Cleaning and Solutions

Experts suggest that the problem arises from the training data being polluted by spam websites that hijack unrelated content to boost their visibility. This issue was not present in previous versions of the tokenizer used in GPT-3.5 and GPT-4. Solutions to this problem include applying rigorous data cleaning processes and ensuring that the tokenizer and the language model are trained on consistent data sets. Simple techniques, such as automatic translation of detected keywords, could significantly reduce the prevalence of spam.

The issue underscores the importance of thorough data cleaning in the development of language models, particularly for non-English languages. As OpenAI continues to refine its models, addressing these data quality issues will be essential for improving performance and maintaining user trust.

Markus Kasanmascheff
Markus Kasanmascheff
Markus is the founder of WinBuzzer and has been playing with Windows and technology for more than 25 years. He is holding a Master´s degree in International Economics and previously worked as Lead Windows Expert for Softonic.com.