Meta Artificial Intelligence makes a speech-to-speech translator that can be used in many languages
Towards Universal Translation: The SeamLESS Communication Team and the Challenges it Can Take to Make Translation Systems More Accurate and More Efficient
The team at Meta built on its previous work on speech-to-speech translation2 as well as on a project called No Language Left Behind3, which aimed to provide text-to-text translation for some 200 languages. Through experience, researchers at Meta and elsewhere have found that making translation systems multilingual can improve their performance even in translating languages with limited training data; why this happens is unclear.
There are many problems associated with existing speech technologies. The more widely used dialects are more likely to have worse transcriptions than the non-standard ones. The quality of translation to a different language is not as good if the data used to train it is not representative of that language. This affects any languages that appear infrequently on the Internet, from Afrikaans to Zulu4.
The SeamLESS Communication Team addresses the challenges of finding key technologies that could make universal translation a reality.
SEAMLESSM4T: A Machine Translation System based on a Multilingual Speech and Text Data Set with Application to High-Stakes Medical Exams
To train their AI model, the researchers relied on methods called self-supervised and semi-supervised learning. These approaches help a model to learn from huge amounts of raw data — such as text, speech and video — without requiring humans to annotate the data with specific labels or categories that provide context. Such labels might be accurate transcripts or translations, for example.
A massive data set containing 4.5 million hours of multilingual speaking audio was used to train the part of the model that is responsible for translation. It is easier for the model to learn the patterns in data, and this kind of training helps the model, by making it less complex to fine-tune the model for specific tasks.
The team used Internet mining to find audio snippets in languages that match subtitles, one of their savviest strategies. Starting with some data that they knew to be reliable, the authors trained the model to recognize when two pieces of content (such as a video clip and a corresponding subtitle) actually match in meaning. They collected 443,000 hours of audio and aligned about 30,000 hours of speech pairs from internet-derived data, which they used to further train their model.
The Massively Multilingual and Multimodal Machine Translation (SEAMLESSM4T) system can also translate speech to text, text to speech and text to text. The results are described in Nature.
Meta is a supporter of open-source language technology. Its research team was instrumental in developing PyTorch, a software library for training AI models, which is widely used by companies such as OpenAI and Tesla, as well as by many researchers around the world. The model introduced here adds to Meta’s arsenal of foundational language technology models, such as the Llama family of large language models2, which can be used to create applications akin to ChatGPT. The lack of huge computational resources is an advantage for researchers who lack this level of openness.
Speech technologies are increasingly being used for high-stakes tasks such as taking notes during medical exams and legal proceedings. Seamless created a model that is speeding progress in this area. But the users of these models (doctors and courtroom officials, for example) should be made aware of the fallibility of speech technologies, as should the individuals whose voices are the inputs.
The toxicity associated with their model was quantified by the researchers. This is an important first step towards a baseline against which future models can be tested. Extra care must be taken to ensure that a model can translate something into a foreign language, as the performance of existing models varies wildly across languages. This effort should be followed by similar efforts by computer-vision researchers, who are working to improve the poor performance of image-recognition models in under-represented groups.
The models that were used to produce the translations looked for gender bias. Their analysis examined whether the model over-represented one gender when translating gender-neutral phrases into gendered languages: does “I am a teacher” in English translate to the masculine “Soy profesor” or to the feminine “Soy profesora” in Spanish? But such analyses are restricted to languages with binary masculine or feminine forms only, and future audits should broaden the scope of linguistic biases studied8.
Meta’s SEAMLESSM4T: Extending LLaMA to open-source scientific research with SEAMLESSM transcripts
Following the success of its LLaMA large, Meta is making its SEAMLESSM4T open source for other researchers who want to build on it.
The team collected millions of hours of audio files of speech, along with human-generated translations from the Internet and other sources. The authors were able to collect transcripts of some of those speeches.