GoURMET - Global Under-Resourced MEdia Translation
Training machine translation models on news output can help global media organisations utilise better MT solutions for low-resource languages.
What is GoURMET?
GoURMET stands for Global Under-Resourced MEdia Translation.
It is a 3.5-year-long multilingual, multinational project supported by European Union's Horizon 2020 programme to improve machine translation, particularly for less common languages.
The project serves a dual purpose: Training custom machine translation models on data from the news domain, and developing tools and ideas utilising these models to support journalists in multilingual newsrooms.
Machine translation for a global newsroom
Generally, the more data available to develop a machine translation (MT) model, the better the result. Typically, the models require millions of translated sentences in order to reach acceptable performance.
For several languages, including many of the 40+ languages BBC World Service reports in, compiling high quality training datasets is difficult. News content in these languages offers a valuable resource.
The project, run as part of News Labs' multilingual solutions stream brought global media giants BBC and Deutsche Welle together with academic trailblazers from Alicante, Amsterdam and Edinburgh Universities to explore how under-resourced languages can be better served by MT solutions in a media setting.
We selected 16 languages to be trained on news data from the BBC and Deutsche Welle.
These are: Amharic, Bulgarian, Burmese, Gujarati, Hausa, Igbo, Kyrgyz, Macedonian, Pashto, Serbian, Swahili, Tamil, Tigrinya, Turkish, Urdu, Yoruba.
Since they are trained on the news provider's output, these models are aligned to the organisation's narrative style. Being custom models, they are secure to process sensitive material, and can be enhanced over time. For processing large volumes, they also have potential to reduce costs.
The project had identified three areas with potential benefits:
- Monitoring: Removing language barriers so that all content is visible across the newsroom in each language
- Content creation: Supporting the efficient transfer of content across languages via human validation and correction
- Domain enhancement: Experimenting with developing glossary-led solutions for fields with highly specialised terms.
To utilise these models in an efficient manner, and explore the extent of usefulness of MT solutions, News Labs created a multilingual suite of prototypes for BBC journalists. The suite was shortlisted for a News Innovation Award in 2021. It comprises of:
- Live Pages Monitor: A monitoring tool enabling BBC journalists to follow Live updates from any BBC Service in any language and immediately build on the local expertise.
- Frank: A discovery tool accumulating original and impactful content that can be reworked and reversioned for distribution across BBC outlets.
- Multilingual GST: A tool allowing under-resourced languages to benefit from machine learning solutions such as semi-automated graphics generation by employing MT models
Over 200 BBC and DW journalists contributed to the project as data validators and evaluators, with many more contributing to the tool trials.
Our work has demonstrated that it was possible to compete with and surpass the results from global tech giants even on single iterations of training.
The project ran between January 2019 and June 2022. The 42-month project also spawned 70+ academic research papers across the GoURMET Consortium.
The models developed in the project have been open sourced and are available to download on the project page. The prototypes and tools developed are available for use internally.
The project's EU reviewers recommended sustaining the line of research to explore further retraining options to ensure the models can improve over time.
News Labs continues to explore multilingual solutions in a bid to remove language barriers across BBC Newsrooms.
GoURMET project partners
- Edinburgh University (coordinator)
- BBC News Labs
- Deutsche Welle
- University of Alicante
- University of Amsterdam