How might we efficiently scale BBC News' structured storytelling across a growing set of languages globally?
BBC News Labs is exploring ways of integrating Language Technology into the News production process. We experiment with innovative approaches to delivering News to our multilingual global audiences.
What is Language Technology?
Language Tech is all around us - we use it more often, more frequently, in more and more places (even in unexpected places) and sometimes without realising it. LT looks at all aspects of spoken and written human languages, like…
- Speech-to-Text (STT) - you’ll find this in many telephone services around the world: Any time a voice asks you to say “one” or “yes” or “no” into your phone the engine at the other end decodes what you’ve just said.
- Text-to-Speech (TTS) - you know this from e-book readers, satnavs, tannoy announcements… and of course your smartphone’s personal assistant which uses both STT and TTS. A voice engine decodes your text and converts it into synthetic speech.
- Machine Translation (MT) - this refers to translating from one human language into another. A lot of people will have tried Google or Bing at some point or another. These translations are based on fragments of human translations, which is what makes the MT sometimes sound really good. If not enough original human translations are available, the MT results can be a bit … well …garbled.
- Automatic Speech Recognition (ASR) - that’s essentially STT. It is the translation of spoken human language into text.
- Speaker Diarisation - This process is used to automatically identify several speakers in a stream of audio, for example when analysing an interview or a conversation. The audio stream is partitioned into segments according to the speaker identity and this gives you a fairly good idea about who spoke when.
- Speech Synthesis - this is essentially TTS. There are 2 popular methods of creating synthetic voices: Unit selection and statistical parametric synthesis. The unit selection method takes the recording of a human voice and fragments it into tiny phonetic units: phonemes, consonant clusters, etc. When the speech engines decodes written text it then looks into its phonetic ‘library’ and assembles the relevant phonetic fragments and turns it all into quite natural sounding coherent audio. The statistical parametric synthesis works with a smaller library of phonetic fragments, it sounds slightly less natural, but: with it you can control stress and intonation. You can choose which words should be stressed in a sentence. Currently a lot of research is focussing on how to get the best results out of both methods to create more human sounding voices. All this frequently raises the unanswerable question of: What is “natural sounding” anyway?
What’s the purpose of this workstream?
- We explore opportunities to make multilingual journalism more efficient (free up time for journalists to curate the News - not spend it on arduous tasks!)
- We want to give our audiences a new experience of following the news (not everybody is comfortable in English… why should they miss out on news stories?)
- Our method is to demonstrate the art of the possible through prototypes (save time on discussions and disconnected views - try it out!)
- We track the state of the art, so that BBC and partners can take advantage when the tech reaches readiness.
What’s our approach?
- Collaborate with universities and research groups (University of Edinburgh, UCL, Cambridge, Alan Turing Institute…)
- Collaborate with international news broadcasters (e.g. Deutsche Welle)
- Experiment with what’s already available and find out what works / what doesn’t work