active

Multilingual Article Tracker (Mat)

Matching translated news articles with the original English-language text.

Aims

Match translated World Service news articles with the original English-language source content. This will allow us to measure the metrics of distribution and readership, and help inform editors’ commissioning decisions.

The problem

The BBC publishes news stories in more than 40 languages.

A central team called Digihub writes content in English for their colleagues in the language services to translate and publish to their respective audiences.

But currently there is no adequate automated way to monitor the stories the language services translate and publish.

In turn this makes it difficult to know which of the original stories appealed to diverse language services and their audiences.

This is important because the missing information would help editors decide what kind of stories their journalists should write.

Our solution

We designed a tool that would automatically ingest the English-language content from Digihub along with the published news articles from the BBC’s language services.

The tool would then compare the published, translated articles to the original English-language content to find similarities and match the source content with the translated text.

Graphic showing the flow of information

We called it Mat – the Multilingual Article Tracker.

How we built Mat

We translated the published articles back into English using a BBC News Labs translation prototype, Frank, and the BBC's transcription tool Volt.

We used a pre-trained machine learning model to turn the text into a multi-dimensional dense vector space — essentially, a very long list of numbers.

We then used the k-nearest neighbors statistical algorithm to match the original content with the translated articles.

Outcome

We tested the Mat prototype with content from Thursday 26 January 2023.

The Digihub team emailed the BBC’s language services a story examining why Germany had delayed sending tanks to Ukraine, comprising analysis by the BBC correspondent Katya Adler.

We asked Mat to find the 10 nearest matches to this English-language content.

Mat correctly picked up that the content had been translated into Spanish, Vietnamese and Japanese.

Graphic showing the translated articles

It gave a confidence rating of more than 97% that these were the right match.

It also identified seven other stories, in Russian, Arabic, Korean, Japanese, Vietnamese, Spanish and Nepali, but gave a lower confidence rating. This indicated that these were very similar to the English-language original but may not be the right match.

It appeared that these seven articles were the news story about Germany sending tanks to Ukraine, but not the analysis piece that our Digihub colleagues shared, thus justifying the lower confidence rating.

Next steps

News Labs is currently considering whether to spend more time refining the prototype.

Results

  • In a test, Mat correctly identified three translated articles which matched the English-language content.

Careers

Love data and code?

We'd like to hear from you.