06 August 2018

Adventures in fuzzy matching with Match of the Day

Dan Clark
Former Google News Labs Fellow with BBC News Labs

How Google News Lab Fellow Dan Clark discovered that Gary Lineker sticks to his script.

Hi, my name is Dan. I’m a 24-year-old student studying for an MSc in Data and Computational Journalism at Cardiff University. Over the past few months, I’ve been working for the BBC News Labs team as part of a Google News Lab Fellowship.

During the application process, I was intrigued by the work BBC News Labs was crafting. I read up on some of their past projects and, as someone with professional experience in software development and a passion for writing, I was interested in how the team was building tools to try and make the lives of journalists more interesting and innovative. It’s fair to say I was both thrilled and honoured when I got the call offering me the position. I couldn’t wait to get started and see what the News Labs team was about.

The project

During my eight weeks at the BBC, I worked as part of a small, five-person team on a project called Radio Dicer. This was our brief:

To research and create an API that allows us to match script sections in text form to the time code of when they occur in an audio file, and assess how effective this approach is to segmenting content.

The idea behind the project was to build on the success of another News Labs prototype called Slicer TV, an experimental app that uses algorithms to automatically “slice” news programmes into individual stories. The experiment received positive feedback when it was user-tested with audiences, and so we were asked to explore whether we could achieve something similar for the BBC’s radio programmes.

Currently, all of the BBC’s radio programmes are available as single audio files after they air. The start and end times for individual stories within a programme aren’t captured anywhere within our systems, meaning that audiences can’t skip through the master file to the stories that they’re most interested in.

But even though we don’t know the start and end times for individual stories, we do have access to the story scripts that journalists write before a programme airs. These are stored in a BBC system called ENPS. Our job was to build a service that can find the start and end times for the stories by matching a transcript of the audio with the individual story scripts.

It was a broad, slightly daunting challenge, but one I was extremely excited to get stuck into.

The Radio Dicer API would need to take two inputs:

  • The programme script from ENPS, which is already broken down into sections for each story.
  • An automatically generated transcript of the programme, which we produce by running the master audio file through our internal speech-to-text service, BBC Kaldi.

The automatically generated transcript is formatted as an array of words and their start and end times within the audio file. In essence, we would be writing code to find the journalists’ story scripts within that array, in order to retrieve the start and end time for each story in the programme.

I was immediately pushed into the deep end. I began by exploring existing libraries that could assist us in achieving the end goal. After a couple of trial and errors, I landed upon fuzzyset, a JavaScript library for fuzzy searching (or approximate string matching).

Gif showing segmented Match of the Day content

A prototype Dan made using the Radio Dicer API

Fortunately, I have a lot of experience coding with JavaScript and so the fuzzyset library wasn’t anything new to me. Prior to my time at the BBC, I’d worked for over five years as a web developer for a digital marketing agency in the automotive industry. A big part of my job there was to create front-end interactivity on our clients’ websites, often linked to third-party API services, to drive new and used car sales.

However, for this project, the team decided we should develop the API in Javascript using node.js, which was a completely different development environment to what I was used to. Thanks to a couple of willing (and patient) News Labs teachers, I was able to pick this up pretty quickly. It’s funny how easily you can learn something new if you’re fully immersed in it!

Overall, the solution we came up with worked well. When we processed shows such as the 6 O’Clock News, the API results were 100% accurate (against some manually created benchmarks) and segmented the show well.

We did come across some flaws though. For example, if the show’s host went off-script, the segmenting isn’t as accurate because the generated transcript of the show is different from the pre-written script. We also had problems with BBC Kaldi, which struggled to transcribe certain words in the audio files (Brexit = Breakfast). But on the whole, we were pleased with what we created.

Working in an agile development(ish) environment

Across the project, we had regular Radio Dicer catch-ups to discuss next steps and reflect on how we’d been progressing since the last meeting. This included sprint planning sessions every two weeks, to come up with a list of objectives that we’d aim to work towards before the next session, and team retrospectives to feed back on how the project was progressing.

On top of this, the whole BBC News Labs team has fortnightly ‘Family Days’, during which each small project team shares updates on their different projects. Although I’d worked in an agile-style environment before, this setup was new to me, but it was one I found useful. I was so engrossed in the challenging work I was doing day-to-day that I didn’t really have time to keep an eye on the other projects that different teams were working on simultaneously.

The eight-week project cycles in News Labs culminate with a “two weeks of tweaks” period, allowing time for completing projects or working on something completely different. This concept really appealed to me, as it allowed me to be creative with how I spent my final days.

Match of the Day Slicer

During the two-week sprint, I wanted to work on something that I was really passionate about: football. I decided to look into an idea that my friends and I had discussed even before my time at the BBC: a Match of the Day slicer.

Screenshot of early prototype of MOTD slicer

An early version of the MOTD prototype

For those not familiar, Match of the Day (MOTD) is a weekly football show, containing highlights of the weekend’s fixtures, including manager/player reaction and analysis. As a big Chelsea fan, I’m most interested with the part of the programme showing my team’s highlights. At the moment, I have to manually skip through the show to find this section, which is where my idea for a MOTD “slicer” came in.

Using the radio dicer API the team had created during the six-week project cycle, I processed a MOTD show to see how it would handle the segmentation. To my surprise, it worked brilliantly. It turns out Gary Lineker sticks to his script!

The evolution of the prototype was rapid and by the end of the two weeks I was able to demo something to the rest of the team that looked positively BBC-ish!

Being part of a close-knit team helped me integrate and feel a part of News Labs. Before any internship or work experience, there’s a certain dread that you’ll be asked to do something mundane, to be left feeling unchallenged and like a nuisance. That couldn’t have been further from the truth; not only was I challenged, but I was incorporated equally with my team and came away feeling like I really contributed.

It was a great eight weeks — I got to work on an exciting project, with a very skilled, but more importantly, always willing-to-help team. It also reaffirmed my interest in working in a fast-paced environment, with the oportunity to get involved in a varity of projects and constantly learn about new technologies.

The News Labs team wishes Dan the best of luck in his future endeavours!

To read more about past Google News Lab Fellows’ work on our team, check out this reflection from Liam Bolton, who worked with us in 2016.


Love data and code?

We're looking for talented developers to join our team.