Roundup: Hacking automated transcription technologies at Text AV
This autumn, Labber Pietro Passarelli organised the second edition of Text Audio/Visual (AV) — an annual two-day collaborative event that promotes a better understanding of how automated transcription technologies can be used for storytelling.
The format was a bit different to our ordinary #newsHACKs. We kicked off with presentations from attendees on some of their current projects, which range from automated fact-checking efforts to methods for annotating collections at the British Library. Afterwards, participants formed groups to discuss commonly faced challenges in the field — and to brainstorm some possible solutions.
Here's a quick look at some of the ideas that attendees had during the working session, broken down by the problem areas that they address. We've also included links to videos and further documentation of project ideas. Beyond encouraging knowledge-sharing between participants, we also hoped that Text AV might kick off some new open-source collaborations; if you're interested in a particular project, please do get in touch!
Programmatic comparison of speech-to-text services
Let's say you're starting a project and want to choose a speech-to-text (STT) service to work with. How can you compare the accuracy of what's available --- and how various services compare to human transcribers?
Project: STT Benchmarking
This group looked into creating a GitHub page that can be used by individuals to compare multiple speech-to-text services across providers. They started by gathering requirements and setting up a repository for future work. In its final stage, the group imagines that individuals can generate analysis on the various services using examples of their own a/v content. Participants from Speechmatics additionally looked into stripping down an internal tool with similar functionality to make it available on the command line.
"Hopefully, eventually we'll get to a point where you can just turn up with a .zip file with your audio and transcripts, [which] gets automatically sent off to the various providers."
"Then you get a whole load of diagnostic information back — not just word error rates, but how things are going with capitalisation, punctuation, and other things."
- Github: hyperaudio/stt-benchmarking
- Text AV Gitbook: Fast Forward Audio prototype, BBC R&D New News Team
Fact-checking a/v content
Project: Tweet that clip
This group from Full Fact and BBC News Labs presented an automated system for tweeting fact-checked video footage during a live interview session.
The prototype's interface presents users with an automatically generated transcript, which is time-coded to correspond to a video file on the right-hand side. Users can select a claim in the transcript to generate a tweet that includes:
- a video clip showing the interviewee making the claim
- text that fact-checks the claim against FullFact's database
The prototype improves on FullFact's current social media fact-checks, which link to video claims but don't play the clips directly on Twitter — eliminating context for users who aren't following the event live.
This group of News Labbers built a Chrome extension that highlights all of the claims made in the text of a tweet. Claims within a tweet's text that are determined to be true are coloured in blue, whereas claims determined to be false are coloured in red.
In the future, the group imagines it could be integrated with FullFact's fact-checking API.
Cognitive insights in transcriptions
Speech-to-text and automated speech recognition services are becoming more and more accurate. While many attendees have focused on giving editors the ability to correct for inaccuracies in transcription, fewer projects have imagined what insight we could get from transcriptions. A group from the Financial Times and Times and Sunday Times decided to explore this space with a prototype for podcast sharing and discovery.
Project: "Selective Hearing" — Concept Clustering in Podcasts
This prototype allows users to discover content in more personalised ways, and also to share more personalised content. Different sections in a podcast are represented on a timeline displayed on an episode's webpage; users can hover over a section to see which categories appear prominently during that time period.
By clicking on a category label, users are redirected to an index of all podcast segments tagged with the same topic.
In order to build the prototype, the group transformed data in the WebVTT format into JSON. From there, they used IBM Watson's natural language understanding API to generate categories, and custom code to cluster them.
- Github: debugwand/bbctextav-segments
- Github: seanmtracey/Vector-Clustering
- Text AV Gitbook: "Selective Hearing" - Concept Clustering in Podcasts
Annotating A/V Transcripts
Say you want to associate an object — text, an image, or a video clip — with a particular moment in a piece of audio or video content. Do you structure your data so that the annotations correspond to a transcript of the content, or to a specific time code within it? Do you build the annotation into a representation of the transcript itself?
At the beginning of the day, Tom Crane shared the British Library's work with the International Image Interoperability Framework (IIIF) — a standard that's commonly used by museums and libraries to annotate 2D objects such as maps and images. His seven-minute presentation is a great introduction to how IIIF works, but we've also included some key vocabulary below to help explain the projects addressing opportunities in this space below.
- Canvas: A virtual container that represents an individual page or view.
- Manifest: A resource that contains the description and structure of a digital representation of a piece of content — including how to render that representation in a viewer.
- Universal viewer: An open-source media player that renders content described in a manifest file.
Using the IIIF to annotate a/v content is still relatively new, and so many of the attendees were eager to explore the potential of the format. Below are a few projects addressing opportunities in this space.
Project: Collaborative podcast annotation workflow
This group investigated a possible workflow for allowing multiple users to annotate a podcast script before converting the text into IIIF format — and then, turning the IIIF annotations into a script for voice devices based on the original podcast.
In the imagined workflow, each editor adds a comment to a Google Doc to create an annotation. The annotation is tagged with either:
- who, what, when, where, why, how, more or reference
The extracted text is then force-aligned with the original audio file and the annotations are extracted, before both are reformatted into an IIIF manifest document. The group imagined this file could be used as a voice user interface script, allowing users to ask questions about specific elements of a podcast before moving onto the next segment.
IIIF video segmentation
This group attempted to make A/V media more navigable —specifically, the content of TheirStory, which is a project aimed at collecting the personal memories of parents.
Whereas currently finding stories within a file requires scrubbing through video timelines, this group experimented with adding a collapsible tree element that allows viewers to skip to specific stories within the file. The group demoed their project both as a standalone web page and also within the British Library's Universal Viewer, showcasing the ability of the IIIF format to work in conjunction with various software.
IIIF Interactive Transcript — Parliamentary Debates
This group explored whether IIIF could allow viewers to fast-forward through a video by scrolling through an associated time-coded transcript. They worked to reproduce functionality demoed by Tristan Ferne and the BBC's New News group at the beginning of the event.
The group took approximately 3,500 German parliamentary debates and converted their time-coded transcript into IIIF. They then built a custom player that displayed the text fragments alongside the video of the debate. By scrolling through the text on the right-hand side, viewers were able to fast-forward the video to a moment in the debate that they were more interested in seeing.
Applications of automated transcription technologies is a long-standing interest of our group, and we plan to hold a third iteration of Text AV in 2019.
Are you working on a related project? Interested in learning more about the Text AV community? Have ideas you want to talk to us about? We'd love to hear from you!
You can follow us on Twitter for updates on our work or else send us an email at newslabs(at)bbc.co.uk.