A global team of scientists has unveiled SignLLM, a comprehensive multilingual model aimed at generating sign language gestures from textual inputs. This new model seeks to address longstanding issues in sign language data processing, setting a new standard in the recognition and generation of multiple sign languages.
Challenges in Sign Language Data Processing
Developing avatars that can produce sign language from text is the aim of Sign Language Production (SLP). The conventional method involves transforming text to gloss—a representational language for gestures—and then creating videos to simulate sign language motion. These videos are later refined to generate more lifelike avatar animations. The complexity of these tasks has historically posed many challenges.
Over the past ten years, datasets such as the German Sign Language (GSL) dataset PHOENIX14T and others for Sign Language Production (SLP), Sign Language Recognition (SLR), and Sign Language Translation (SLT) have been difficult to handle. The lack of standard tools and slow progress with minority languages have further complicated research efforts. American Sign Language (ASL) studies remain in the developmental phase.
New Prompt2Sign Dataset
In response, researchers from institutions like Rutgers University, Australian National University, Data61/CSIRO, Carnegie Mellon University, University of Texas at Dallas, and University of Central Florida have created the Prompt2Sign dataset. This dataset captures upper body movements of sign language demonstrators and covers eight distinct sign languages sourced from publicly available online videos and datasets.
Utilizing OpenPose, a video processing application, the team standardizes video frame data to reduce redundancy, thus simplifying training with sequence-to-sequence and text-to-text models. They also automate the generation of prompt words to lessen the need for manual annotations and improve tool efficiency, which enhances data processing capabilities.
Basing their work on the Prompt2Sign dataset, the researchers have developed SignLLM. This is the first large-scale multilingual SLP model capable of generating skeletal poses for eight different sign languages based on textual suggestions. It offers two operational modes: the Multi-Language Switching Framework (MLSF), which dynamically incorporates encoder-decoder groups to handle multiple sign languages simultaneously, and the Prompt2LangGloss module, which supports the creation of static encoder-decoder pairs.
To see how this works, the team behind SignLLM has created some demo videos, which you can watch on the SignLLM GitHub page. As they write, the “videos are not direct output from our model and actually a lot of trouble to make, they are just for demonstration purposes. These videos are the result of reprocessing the pose videos output of our model using the style transfer model, a lot of the media spontaneously exaggerate our work, it is not what we want to see.“