News in Space

Published: 13 May 2019

Aims

What if we could compare and recommend news across media and topic types?

News is put into pre-defined sections to help readers browse by particular interest: Business, Sport, Politics, etc. However, when reading an article, this leaves the reader stuck inside these top-down subject silos. We know as humans that categories are loose — that an article about a takeover of a football club is mainly related to sports, but also somewhat to business. The sections model makes it harder for us to encourage readers to explore both worlds.

One way of overcoming this is to use tagging. A subject - say the Prime Minister - can be tagged and found in multiple articles, no matter the section. An article can also have more than one tag.

However, this still lacks the human notion of loose concepts. When searching for ‘Prime Minister’, we wouldn’t be able to find articles tagged with similar subjects, such as ‘Downing Street’, even though we know that the two terms are so similar that they’re sometimes used interchangeably by journalists.

The News in Space project attempts to find the similarity between all pieces of BBC content, no matter the format or subject. We use machine learning algorithms over large BBC datasets to create a mathematical space representing “distance” between pieces of content. This is done on text corpuses by finding words that commonly occur together, and then learning a vector for each word, so that similar words (dog→cat, Canada→USA) will have a smaller distance between them than words that co-occur together less often. Using this, we might learn that an article about Queen Elizabeth is very close to a documentary about Buckingham Palace yet further away from audio about the Houses of Parliament.

This modelling, when applied to entire articles, allows us to serve up related content that varies in similarity. Some onward journeys might be about the exact same subject, while others might be related in a more conceptual way. Because this distance is based on descriptions of the content, we can serve up media in multiple formats and show the richness of the BBC's output.

It also asks questions about the intent of the person viewing. If we know all these articles/videos/audio are related, which of these does the user want to see? Can we infer the intent of the user?