Meta's Fundamental AI Research (FAIR) team has launched a suite of new AI models aimed at audio generation, text-to-vision tasks, and watermarking.
“Today, we're excited to share some of the most recent FAIR research models with the global community. We're publicly releasing five models including image-to-text and text-to-music generation models, a multi-token prediction model and a technique for detecting AI-generated speech. By publicly sharing this research, we hope to inspire iterations and ultimately help advance AI in a responsible way,” Meta said in the release.
Post by @aiatmetaView on Threads
New Text-to-Music Generation Tool
Leading this announcement is JASCO, which stands for Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation. This tool allows users to compose and adjust music through text commands, enabling control over elements like chords and drums. The JASCO inference code will be under an MIT license, while the pre-trained model will follow a Creative Commons license for non-commercial use.
JASCO can produce high-quality music samples conditioned on global text descriptions and detailed local controls. It utilizes the Flow Matching modeling paradigm in conjunction with an innovative conditioning technique, enabling the generation of controllable music at both the local level, such as chords, and the global level through text descriptions.
This allows the incorporation of both symbolic and audio-based conditions in the same text-to-music model. JASCO permits to experiment with various symbolic control signals (e.g., chords, melody), as well as with audio representations (e.g., separated drum tracks, full-mix).
JASCO promises to revolutionize AI-driven music production by allowing users to adjust sounds through text inputs, providing a new layer of customization over musical elements. This model becomes a part of AudioCraft's AI audio library, offering musicians and producers new creative avenues to explore.
AudioSeal for Watermarking
Meta FAIR also rolled out AudioSeal, a new tool focusing on watermarking AI-generated speech. AudioSeal helps distinguish AI-generated audio from human-produced content, enhancing verification processes.
AudioSeal is a system that simultaneously trains a generator to embed watermarks in audio and a detector to identify these watermarks within longer audio segments, even after editing. It delivers top-tier detection capabilities for both natural and synthetic speech at a sample-level resolution (1/16k second), with minimal impact on signal quality and robustness against various audio editing types.
Designed with a swift, single-pass detector, AudioSeal outperforms current models in speed, detecting watermarks up to a hundred times faster. It is optimized for extensive and instantaneous applications.
According to Meta, AudioSeal can detect AI-generated content 485 times faster than traditional methods. Unlike JASCO, AudioSeal will be released under a commercial license.
Chameleon Models for Text and Vision
Additionally, Meta FAIR unveiled its Chameleon multimodal text model in two versions, 7B and 34B parameters. These models handle visual and textual understanding tasks, such as image captioning.
Contrary to traditional models that process text and images independently, Chameleon integrates them from the outset, providing unparalleled proficiency in comprehending and creating mixed-modal content.
Chameleon excels beyond numerous specialized models in image captioning and text generation functions. Its cohesive method for handling mixed content facilitates a more efficient comprehension and production of intricate documents.
The FAIR team utilized specific methods to cultivate Chameleon's abilities, guaranteeing its adeptness with mixed content. These advancements allow Chameleon to surpass other prominent models in human assessments, including Gemini-Pro and GPT-4V.
Image generation features will not be available at this stage. These text models will be released under a research-only license, providing researchers with the tools necessary for advanced studies.
Advancements in Language Model Training
FAIR's latest innovation also includes a multi-token prediction approach for language model training. This enables models to predict multiple upcoming words at once, improving both the efficiency and accuracy of tasks like code completion. This method is likely to enhance the capabilities of large language models across various applications.
Meta FAIR also introduced a model aimed at enhancing geographical and cultural diversity in text-to-image generation systems. They provide geographic disparity codes and annotations to support this endeavor, aiming to enhance the evaluation metrics for text-to-image models to better represent various cultures and regions.
Meta's investment in AI and metaverse technologies is reflected in its projected capital expenditures, expected to be between $35 billion and $40 billion by the end of 2024. This is $5 billion more than initially anticipated, highlighting the company's strategic priorities in these areas.