Google DeepMind has released Zipper, a novel approach designed to merge generative foundation models dealing with text, speech, and images. This aims to improve tasks that require integrating information from different types of data.
Combining models trained on distinct data types can be problematic. Aligning these modalities while keeping unimodal effectiveness intact is complex. Traditional methods, like expanding model vocabularies or aligning multimodal datasets, fall short by being cumbersome and requiring extensive aligned data.
Google´s new Zipper Architecture
Zipper resolves these issues by using independently trained unimodal decoders linked with cross-attention mechanisms. This allows to tackle problems of combining different types of data—like text, speech, and images—by using a unique method. Typically, each type of data (or modality) is processed by a model specifically trained for that particular data type. These specific models are referred to as “unimodal decoders“, where “unimodal” means they focus on one mode or type of data.
In the Zipper architecture, these unimodal decoders are trained separately, allowing them to become highly effective at handling their respective data types. However, to make these different decoders work together, Zipper employs what are called “cross-attention mechanisms“. These mechanisms enable the decoders to pay attention to and process information from each other's outputs.
This means that even though the decoders are trained separately, they can still interact and integrate their findings, thus enhancing the model's ability to handle tasks that involve multiple types of data simultaneously. For example, this can be beneficial in scenarios like understanding a video where you need to process and integrate both the visual information and the spoken words.
Use of Projection Layers
In the Zipper architecture, “projection layers“ are used to ensure that the data from different modalities (like text, speech, and images) can be effectively combined. These layers adjust the data's dimensions so that they align with each other, making it easier for the model to manage transitions and interactions between these different types of data during the process of cross-attention.
During the process of generating outputs—such as text or speech—the model follows a predefined sequence of steps to complete the task. This approach is somewhat similar to expanding the model's vocabulary to understand and generate more complex data combinations. However, Zipper's method is more flexible and allows for better integration and reuse of the individual unimodal decoders, enhancing the model's overall effectiveness in handling complex multimodal tasks.
The Zipper architecture can flexibly combine multimodal generative models from independently pre-trained unimodal decoders and can be reused and repurposed in new multimodal combinations.
Tests with PaLM2 Models
Experiments included PaLM2 models for the text component and a comparable architecture for the speech side, initially trained on the LibriLight dataset. Results showed that ASF performance didn't drop when the text backbone was frozen. Zipper excelled particularly in Text-to-Speech tasks (TTS) when the speech component was not frozen. The model delivered impressive results using just 1% of the original training data, outperforming baselines with much less aligned data.
Zipper shows an improved word error rate (WER) reduction of 12 absolute points (a 40% relative error reduction) on unfrozen modality backbones for speech-generative TTS tasks compared to vocabulary expansion baselines.
Future research aims to scale Zipper to accommodate more modalities and larger data sets. The architecture's potential for expanding generative modeling across various fields appears promising. For detailed insight, you can check out the research paper which is available on arXiv.