A recent study has uncovered a major security issue in AI visual language models, particularly GPT-4o, illustrating how these models can be directed to generate malicious text when exposed to specific flowchart images. The research, named “Image-to-Text Logic Jailbreak: Your Imagination Can Help You Do Anything,” demonstrates how these systems can be compromised through targeted flowchart visuals combined with text prompts.
High Success Rate of Attacks
Research findings indicate that the OpenAI LLM GPT-4o exhibits a success rate of 92.8% in these logic jailbreak scenarios, whereas GPT-4-vision-preview shows a 70% success rate. The attack method involved using an automated text-to-text jailbreak framework, which creates a flowchart image from a malicious prompt and then inputs it into the visual language model to provoke a harmful response. Interestingly, manually created flowcharts were more effective at inducing these harmful outputs than those generated by AI, pointing to challenges in fully automating these attacks.
Implications for AI Safety
The results point to the urgent need for better safety protocols in visual language models, as their applications continue to expand. This is consistent with earlier research which also spotlighted the vulnerabilities of such models to combinations of text and image inputs. An earlier study introduced a benchmark called Safe Inputs but Unsafe Output (SIUO) to gauge the safety of visual language models. Only a handful of models, including GPT-4o, attained scores above 50% on this safety benchmark, indicating that considerable improvements are needed.
Industry Response and Future Measures
As visual language models like GPT-4o and Google Gemini become more prevalent, addressing these security issues is crucial to prevent misuse and potential legal consequences. Currently, GPT-4o imposes a limit on daily image inputs, but as these caps are lifted, strong safety measures will be essential. Governments are already establishing AI safety bodies, such as the UK's AI Safety Institute, which is expanding its presence to San Francisco, to oversee AI risks.
OpenAI announced GPT-4o in May. Building on the foundation set by GPT-4, which was adept at processing images and text, GPT-4o introduces voice as a new element, making it a natively multimodal platform. This enhancement not only improves the user experience with ChatGPT, OpenAI's popular AI chatbot, but also extends its functionality.
The researchers also introduced a new dataset known as the Logic Jailbreak Flowcharts (LJF) dataset, designed to assess flowchart image jailbreaks. This collection contains hand-drawn flowcharts that depict 70 harmful activities. Additionally, an automated text-to-text jailbreak framework was developed, converting harmful actions into declarative sentences and generating corresponding flowcharts for testing.
Quality of Flowchart Images
The effectiveness of these jailbreak attempts is closely linked to the quality of the flowchart images. Hand-drawn flowcharts were found to be more successful at inducing harmful outputs compared to AI-generated images. This finding underscores the importance of having comprehensive datasets for evaluating the vulnerabilities of visual language models accurately.
The study details the approach used for assessing the success of these jailbreak attempts, utilizing metrics like the Attack Success Rate (ASR). It also explores the broader implications for the design and deployment of visual language models, stressing the need for robust security measures to safeguard against multimodal input risks.