Nvidia researchers have unveiled a new text-to-image personalization method named Perfusion. Unlike its heavyweight counterparts, Perfusion is a compact model with a size of just 100KB and a training time of approximately four minutes.
Perfusion: A New Approach to AI Art Creation
Developed by Nvidia and Tel-Aviv University in Israel, Perfusion offers a unique approach to portraying personalized concepts while preserving their identity. Despite its small size, it surpasses the efficiency of adjustment methods used by leading AI art generators such as Stability AI‘s Stable Diffusion v1.5, the recently released Stable Diffusion XL (SDXL), and MidJourney.
The primary innovation in Perfusion is a mechanism called “Key-Locking”. This process links new concepts that a user wants to introduce, like a specific cat or chair, to a broader category during image generation. For instance, the cat would be associated with the wider concept of a “feline”. This method helps prevent overfitting, a common issue where the model becomes too narrowly adjusted to the exact training examples, making it difficult for the AI to generate new creative versions of the concept.
Key-Locking Mechanism
By associating the new cat with the general idea of a feline, the model can depict the cat in various poses, appearances, and environments while still maintaining the essential “catness” that makes it look like the intended cat, not just any random feline. In essence, Key-Locking allows the AI to flexibly portray personalized concepts while preserving their core identity.
Perfusion also enables the combination of multiple personalized concepts in a single image with natural interactions, unlike existing tools that learn concepts in isolation. Users can guide the image creation process through text prompts, merging concepts like a specific animal with objects like a chair, books, clothes or others.
Controlling Visual Fidelity and Textual Alignment
A unique feature of Perfusion is that it allows users to control the balance between visual fidelity (the image) and textual alignment (the prompt) during inference by adjusting a single 100KB model. This capability enables users to explore the trade-off between text similarity and image similarity and select the optimal balance that suits their specific needs, all without the need for retraining.
Nvidia's Growing Focus on AI
This research aligns with Nvidia's growing focus on AI. The company's stock has surged over 230% in 2023, as its GPUs continue to dominate training AI models. With entities like Anthropic, Google, Microsoft, and Baidu investing heavily into generative AI, Nvidia's innovative Perfusion model could provide it with a competitive edge. Nvidia has only presented the research paper for now, promising to release the code soon.
Comparison with Other AI Image Generators
While other AI image generators offer ways for users to fine-tune output, they are typically larger in size. For instance, a LoRA, a popular fine-tuning method used in Stable Diffusion, can add anywhere from dozens of megabytes to more than one gigabyte (GB) to the app. Another method, textual inversion embeddings, are lighter but less accurate. A model trained using Dreambooth, the most accurate technique currently, weighs more than 2GB.
In contrast, Nvidia claims that Perfusion produces superior visual quality and alignment to prompts over the leading AI techniques mentioned before. The ultra-efficient size makes it possible to update only the parts that it needs to when fine-tuning how it's producing an image, compared to the multi-GB footprint of methods that fine-tune the entire model.
With other research projects, NVIDIA has been advancing the state-of-the-art in generative AI research, with new methods to enhance the realism and quality of AI-generated images.
Recent examples of AI Image Generators
- OpenAI, the research organization behind DALL-E, has also introduced ShapE, a generative model that can create 3D models from text, opening up new possibilities for AI in image creation.
- Stability AI, a startup that focuses on generative AI, has released StableStudio, an open-source web app that uses its Stable Diffusion model to generate images from text prompts. Users can also use DreamStudio features to make multiple variations of an image with different styles and attributes.
- Meta, the company formerly known as Facebook, has unveiled I-JEPA, its own AI image generator based on its generative transformer model. I-JEPA can learn the associations between words and images, and generate realistic images from text descriptions.
- Alibaba, the Chinese e-commerce giant, has launched Tongyi Wanxiang, a generative AI image generator that can handle both Chinese and English languages. Users can customize the image output parameters using Composer, a large model developed by Alibaba Cloud.