From speech to text: automated transcription in the newsroom


Labber Alex Norton shares four potential benefits of automated transcription technology that we’ve found from our experiments in the newsroom.

There are a lot of reasons for newsrooms to be excited about automatic transcription. Traditional video editing workflows in organisations like the BBC revolve around transcripts of content, and traditionally, the process of getting those transcripts involves a human playing back the video and typing out the words they hear. Speech-to-text technology, which uses machine learning to produce transcripts of audio, has the potential to save journalists the time they spend manually typing up their interviews and news organisations the money they pay third-party companies to do the same work.

In recent years the technology itself has advanced rapidly. Automatic transcription of speech content is getting faster, cheaper and more accurate. Big players such as Microsoft, Google and IBM all offer speech to text APIs, and services like Temi and Trint are taking aim at the prosumer market. There’s activity in the open source world too, with projects such as autoEdit and FrameTrail.

Still, we think that there is room for innovative applications of speech-to-text technology designed specifically with newsrooms in mind. We’ve been using a speech recognition engine developed by our colleagues in Research and Development to build prototypes for the past few years, experimenting with video editing, archival search and subtitling and captioning.

This autumn, I co-presented a paper on our work at the International Broadcasting Conference in Amsterdam, showcasing our most interesting experiments and their impact on journalists’ workflows. Here’s three of the potential benefits of automated transcription we’ve found from trialling these experiments in the newsroom.

1. Time-aware transcripts make it possible to edit video and audio by text selection.

Cutting out the process of translating timecodes to text to timecodes again speeds up the video editing process by allowing editors to work directly from a transcript.

The challenge

Transcripts serve as the working draft of a programme for a producer, who selects the comments and quotations they want to include from the text. But a video editor then needs to translate those selected segments back into time codes in the a/v file in order to create the programme. That means searching for the exact time that a quote begins within an editing application, letting the audio or video play until the end of the selection and then trimming sub-clips from much longer footage.

The solution

Our Octo tool automatically generates a transcript of a video file and allows producers to make edits directly from the text. Selecting a passage of text in the transcript automatically creates a sub-clip of the video, setting the in and out points to match the highlighted words. Users can download the clip to their desktop or send it to one of the BBC’s production asset management systems for use in other tools.

Octo application editor
Transcript-based clipping at work in the Octo editor

The technology behind Octo is possible because our speech recognition engine returns timing information for each word that’s accurate down to the millisecond. However, we’ve also built in a manual override, which allows users to adjust the in and out points to allow for a few seconds of buffer space on either side of their selection.

Even though speech recognition never generates perfect transcriptions, they’re usually good enough to allow editors to search for passages by keyword, which greatly speeds up the editing process.

2. Transcripts make the contents of a/v files searchable.

The ability to scan a video file for keywords saves time in the editing process, but it also greatly improves search and discovery when it comes to selecting relevant archive footage.

The challenge

Searching for content in production archives typically relies on the titles, descriptions and tags that have been assigned by humans to give clues as to the concepts, topics, people and locations covered. Unfortunately we find that this metadata is often limited and in some cases downright inaccurate, making it difficult to search for relevant material on a particular topic.

Even when a piece of multimedia has been tagged and labelled well, the metadata never captures everything that’s said within a file that’s even a few minutes long. A journalist has to play back a clip in its entirety to know what it contains.

The solution

Our Window on the Newsroom system automatically transcribes content arriving in our production archives. Having a full transcript available means that everything that’s said in an a/v file is now exposed to search queries, rather than just selected tags, titles and descriptions.

demo of WON tool
WON automatically extracts entities such as places and organisations from an automatically generated transcript.

Window on the Newsroom also performs additional language processing to allow journalists to write more complex search queries. The transcripts are run through an entity extraction engine to detect people, places, organisations and references using the open DBpedia knowledge graph. These entities can then be used to retrieve content in ways that wouldn’t be possible with manually entered descriptions. For example, using a traditional metadata-based search engine you could run ask for all the videos where Donald Trump is mentioned. But with our entity-based search engine, you could ask for all the videos that reference people who work for Donald Trump and were born in Russia.

3. Auto-transcription can speed up captioning and subtitling.

As long as auto-generated transcripts can be configured to allow for some human correction, the timing information they contain can be used to automatically subtitle and caption web videos.

The challenge

Subtitling or captioning a video for the web is labour-intensive and reduces the number of videos that can be produced. Using out-of-the-box video editing software, editors must either play back the video and transcribe speech as they here it, or copy-paste portions of a transcript to align with the spoken speech. Either way, the process involves stop-and-start playback of the media files, which is time-consuming and can take several times the actual length of the video to complete.

The solution

The rich timing information in automatically generated transcripts make subtitling and captioning an obvious application of the technology. But before we can burn captions and subtitles into video files, we need to give editors and producers the ability to correct errors in the transcript.

We did this in Octo by using the timing information that our speech recognition engine returns when it transcribes a video. If a journalist needs to replace an incorrect word with the correct one, the timing information associated with the original word is simply transferred to the correction. After the transcript is corrected for the portion of the video that an editor wishes to sub-clip, subtitles can be burned on or downloaded as SRT (SubRip Subtitle) files — the standard for social platforms such as Facebook.

4. Machine transcription can help solve unexpected challenges in the newsroom

We didn’t begin experimenting with speech recognition because we thought it could help promote our programmes on social media, but we found that the technology had the potential to transform social media producers’ workflows— especially in radio.

The challenge

Facebook and Twitter massively favour uploaded video content, making it more difficult for our colleagues in radio to promote their programmes on social media. Our team found a partial solution in the open-source Audiogram project from WNYC. Out-of-the-box, the Audiogram generator lets users upload audio files and convert them into videos with animated waveforms and customisable captions. However, the captions included in the templates are meant to serve as branding for specific programmes, and therefore can’t be customised to reflect specific radio content without a developer going in and modifying the code for the templates.

When we spun up a version for BBC teams to use, we learned that most of our social media producers would not post a video file that didn’t have animated captions. We therefore needed a way to support captioning that was easy to use and scalable enough to fit within our editorial colleagues’ workflows.

The solution

By integrating the transcript editor we built for Octo into the Audiogram generator, we were able to support captioning using the text from automatically generated text. This means that even on social platforms where videos autoplay without sound, the content of a radio programme is still viewable by our audiences.

audiogram transcript editor
Using an automatically generated transcript to produce captions with the Audiogram generator

The timing information associated with each word is also used to align the captions. My colleague Jonty Usborne has written previously about how the tool works and the things he learned in creating it.

The future

Looking to the future it’s clear that we’re only scratching the surface of what’s possible with speech to text in the newsroom.

The potential for transcript-based editing tools is enormous. We want to build on our Octo tool to support more complex edits, bringing together clips from multiple sources, all represented by passages of text rather than timelines.

There are obvious parallels with work our colleagues in BBC R&D have been doing around object-based broadcasting. The automatically produced transcript can effectively just be treated as one of many objects that make up a piece of media, and used to assemble different experiences for the consumer based on their device or environment. This could clearly be a benefit for our low-bandwidth audiences, who might not have the connectivity required for rich video, but could potentially be served with transcripts and keyframes, for example.

We also want to work to move our existing projects out of the prototype stage and into properly supported and scalable production tools. The benefits of this technology are only truly realised when we’ve rolled them out to our whole organisation.

Further reading

There are project pages for all of our current experiments on our website, including our work on speech-to-text:

For a more technical look at how we’ve been building our text editors for tools such as Octo and the Audiogram generator, you can watch a talk I gave at the React London Meetup in February. You can also take a look at BBC Research and Development’s speech-to-text project page and listen to Andrew McParland talk about their work on our speech recognition engine from the IBC.

Last January, we held a #newsHACK on automated transcription and published a round-up of some of the early-stage ideas for potential applications of the technology in the newsroom.