OpenAI´s latest multimodal model GPT-4o is facing scrutiny due to problems with its Chinese token data. The issues stem from inadequate data cleaning processes, leading to potential performance problems and misuse, according to a researcher who had a closer look at GPT-4o's public token library.
Tokens are the basic units in language models, representing words, expressions, or characters. They allow the model to process text more efficiently by recognizing consistent strings of characters. GPT-4o's new tokenizer includes 200,000 tokens, with 25% in non-English languages, aimed at improving multi-language tasks. However, the Chinese tokens are predominantly spam and pornographic phrases, which are not commonly used in everyday language. This discrepancy is due to insufficient data filtering during the training phase.
Impact on Model Performance
The presence of these inappropriate tokens can cause the model to generate nonsensical or unrelated responses. Researchers have shown that these tokens can also be exploited to bypass OpenAI's safety mechanisms, enabling the model to produce unsafe content. Tianle Cai, a PhD student at Princeton University, identified the issue by analyzing the longest Chinese tokens in GPT-4o's public token library, finding that most were related to gambling and pornography.
Just wrote a script to further investigate how the corpus used to train the gpt4o tokenizer is polluted by Internet scams. The results are quite interesting… 🤦♂️🤦♂️🤦♂️https://t.co/Fc2T4rSHix https://t.co/Q1Syh9amJn pic.twitter.com/lQ1u5aQoAs
— Tianle Cai (@tianle_cai) May 13, 2024
Data Cleaning and Solutions
Experts suggest that the problem arises from the training data being polluted by spam websites that hijack unrelated content to boost their visibility. This issue was not present in previous versions of the tokenizer used in GPT-3.5 and GPT-4. Solutions to this problem include applying rigorous data cleaning processes and ensuring that the tokenizer and the language model are trained on consistent data sets. Simple techniques, such as automatic translation of detected keywords, could significantly reduce the prevalence of spam.
The issue underscores the importance of thorough data cleaning in the development of language models, particularly for non-English languages. As OpenAI continues to refine its models, addressing these data quality issues will be essential for improving performance and maintaining user trust.