Roundup: Hacking automated transcription technologies at Text AV

cover-image

This autumn, Labber Pietro Passarelli organised the second edition of Text Audio/Visual (AV) — an annual two-day collaborative event that promotes a better understanding of how automated transcription technologies can be used for storytelling.

The format was a bit different to our ordinary #newsHACKs. We kicked off with presentations from attendees on some of their current projects, which range from automated fact-checking efforts to methods for annotating collections at the British Library. Afterwards, participants formed groups to discuss commonly faced challenges in the field — and to brainstorm some possible solutions.

Here’s a quick look at some of the ideas that attendees had during the working session, broken down by the problem areas that they address. We’ve also included links to videos and further documentation of project ideas. Beyond encouraging knowledge-sharing between participants, we also hoped that Text AV might kick off some new open-source collaborations; if you’re interested in a particular project, please do get in touch!

Programmatic comparison of speech-to-text services

Let’s say you’re starting a project and want to choose a speech-to-text (STT) service to work with. How can you compare the accuracy of what’s available — and how various services compare to human transcribers?

Project: STT Benchmarking

This group looked into creating a GitHub page that can be used by individuals to compare multiple speech-to-text services across providers. They started by gathering requirements and setting up a repository for future work. In its final stage, the group imagines that individuals can generate analysis on the various services using examples of their own a/v content. Participants from Speechmatics additionally looked into stripping down an internal tool with similar functionality to make it available on the command line.

“Hopefully, eventually we’ll get to a point where you can just turn up with a .zip file with your audio and transcripts, [which] gets automatically sent off to the various providers.”

“Then you get a whole load of diagnostic information back — not just word error rates, but how things are going with capitalisation, punctuation, and other things.”

More information

Fact-checking a/v content

Project: Tweet that clip

This group from Full Fact and BBC News Labs presented an automated system for tweeting fact-checked video footage during a live interview session.

A screenshot of the fact-checking system
A screenshot of the fact-checking system

The prototype’s interface presents users with an automatically generated transcript, which is time-coded to correspond to a video file on the right-hand side. Users can select a claim in the transcript to generate a tweet that includes:

  • a video clip showing the interviewee making the claim
  • text that fact-checks the claim against FullFact’s database

The prototype improves on FullFact’s current social media fact-checks, which link to video claims but don’t play the clips directly on Twitter — eliminating context for users who aren’t following the event live.

More information

Project: Farfetchd

This group of News Labbers built a Chrome extension that highlights all of the claims made in the text of a tweet. Claims within a tweet’s text that are determined to be true are coloured in blue, whereas claims determined to be false are coloured in red.

In the future, the group imagines it could be integrated with FullFact’s fact-checking API.

More information

Cognitive insights in transcriptions

Speech-to-text and automated speech recognition services are becoming more and more accurate. While many attendees have focused on giving editors the ability to correct for inaccuracies in transcription, fewer projects have imagined what insight we could get from transcriptions. A group from the Financial Times and Times and Sunday Times decided to explore this space with a prototype for podcast sharing and discovery.

Project: “Selective Hearing” — Concept Clustering in Podcasts

A screenshot of the tagged and segmented podcast timeline
A screenshot of the tagged and segmented podcast timeline

This prototype allows users to discover content in more personalised ways, and also to share more personalised content. Different sections in a podcast are represented on a timeline displayed on an episode’s webpage; users can hover over a section to see which categories appear prominently during that time period.

By clicking on a category label, users are redirected to an index of all podcast segments tagged with the same topic.

In order to build the prototype, the group transformed data in the WebVTT format into JSON. From there, they used IBM Watson’s natural language understanding API to generate categories, and custom code to cluster them.

More information

Annotating A/V Transcripts

Say you want to associate an object — text, an image, or a video clip — with a particular moment in a piece of audio or video content. Do you structure your data so that the annotations correspond to a transcript of the content, or to a specific time code within it? Do you build the annotation into a representation of the transcript itself?

At the beginning of the day, Tom Crane shared the British Library’s work with the International Image Interoperability Framework (IIIF) — a standard that’s commonly used by museums and libraries to annotate 2D objects such as maps and images. His seven-minute presentation is a great introduction to how IIIF works, but we’ve also included some key vocabulary below to help explain the projects addressing opportunities in this space below.

  • Canvas: A virtual container that represents an individual page or view.
  • Manifest: A resource that contains the description and structure of a digital representation of a piece of content — including how to render that representation in a viewer.
  • Universal viewer: An open-source media player that renders content described in a manifest file.

Using the IIIF to annotate a/v content is still relatively new, and so many of the attendees were eager to explore the potential of the format. Below are a few projects addressing opportunities in this space.

Project: Collaborative podcast annotation workflow

This group investigated a possible workflow for allowing multiple users to annotate a podcast script before converting the text into IIIF format — and then, turning the IIIF annotations into a script for voice devices based on the original podcast.

In the imagined workflow, each editor adds a comment to a Google Doc to create an annotation. The annotation is tagged with either:

  • who, what, when, where, why, how, more or reference

A sample script with annotations
A sample script with annotations

The extracted text is then force-aligned with the original audio file and the annotations are extracted, before both are reformatted into an IIIF manifest document. The group imagined this file could be used as a voice user interface script, allowing users to ask questions about specific elements of a podcast before moving onto the next segment.

More information

IIIF video segmentation

This group attempted to make A/V media more navigable —specifically, the content of TheirStory, which is a project aimed at collecting the personal memories of parents.

A screenshot with a video and a collapsible tree structure on the left-hand side
The collapsible tree structure allows users to skip to different stories in a file without scrubbing through the video timeline

Whereas currently finding stories within a file requires scrubbing through video timelines, this group experimented with adding a collapsible tree element that allows viewers to skip to specific stories within the file. The group demoed their project both as a standalone web page and also within the British Library’s Universal Viewer, showcasing the ability of the IIIF format to work in conjunction with various software.

More information

IIIF Interactive Transcript — Parliamentary Debates

This group explored whether IIIF could allow viewers to fast-forward through a video by scrolling through an associated time-coded transcript. They worked to reproduce functionality demoed by Tristan Ferne and the BBC’s New News group at the beginning of the event.

The group took approximately 3,500 German parliamentary debates and converted their time-coded transcript into IIIF. They then built a custom player that displayed the text fragments alongside the video of the debate. By scrolling through the text on the right-hand side, viewers were able to fast-forward the video to a moment in the debate that they were more interested in seeing.

More information

Get involved

Applications of automated transcription technologies is a long-standing interest of our group, and we plan to hold a third iteration of Text AV in 2019.

Are you working on a related project? Interested in learning more about the Text AV community? Have ideas you want to talk to us about? We’d love to hear from you!

You can follow us on Twitter for updates on our work or else send us an email at newslabs(at)bbc.co.uk.


Categories:

Tags: