Topic Modelling on BBC data

Recently, we’ve been focussing on the classification of BBC News articles. Not that we aren’t interested in other news outlets, but using BBC content gave us an important head start: we’ve got some editorial classification already, because the BBC News website is divided into news sections (see the image above).

This manual classification served as a dataset for topic modelling - the method we chose to analyse news articles.

Why are we trying to classify news content in the first place?

As a journalist, as a reader, as a human, it is easy to look at a news article and understand and name the topics covered in the story. We may use different words, but we understand the concept - even if the topic isn’t explicitly represented in the text. An article can be about education but never actually mention the word “education” itself.

Using BBC News data from January to June 2016, we wanted to see if we could distinguish articles from different news sections based on word frequency. Some of the news sections refer to locations: regions of the UK, Africa, Europe, Asia and elsewhere; some to concepts: business, entertainment, technology and so on.

From half a year’s worth of BBC News we used around 15,000 articles, ignoring local news articles and blogs for the moment.

We focused on the news sections: from Africa to the US, from business to tech, we had 12 sections in total.

Starting by counting the different words within each news article, we use a LDA, a form of topic modelling, to segment this news corpus into 12 groups without looking at the section label.

In a second step, we assigned the section labels, looking at the distribution of articles from each section amongst topics #1-12 identified by LDA. If the vocabulary between these articles was distinct enough and the segmentation nearly perfect, each bar in this graph would be dominated by one colour (one news section) only.

Well, clearly we didn’t do well here at all.

The problem is simple: an article can be about business in China or health services in the UK, and therefore use vocabulary, terms that are common for both the location and the ‘topic’ itself. Trying to distingish geography and concepts at the same time is hard.

Let’s make the problem simpler. Instead of mixing locations and topics in our segmentation, what if we focus only on a few abstract categories - topics, really - within the last months and explore if we can distinguish them based on word frequency?

Even without any evaluation metric, we can see that LDA did well segmenting different news sections.

Topic #1 is dominated by business articles and its most relevant words, (“company”, “shares”, “bank”, “prices”…) are business related. The same holds true for topics #2 and #3, dominated by articles from the science and technology sections respectively.

Education and health articles share a representation in topic #4. It seems they share a lot of vocabulary, which makes it hard for LDA to tell them apart.

Only a minority of news articles, classified into topic #5 in our experiment, where not distinguished by the “typical” business, education, health, science or tech vocabulary. For those articles, an editorial, human assignment of sections will remain the best method - for now.

Where to go from here? - Is topic modelling the answer?

Topic modelling as a way of bringing structure into unstructured, plain text and data is a tempting idea in newsrooms facing severe pressure to run more efficiently and give audiences greater access to more personalised news. It certainly is a lot of fun playing around with topic modelling to realise how much we humans take “understanding of text” for granted.

In the context of news, topic modelling is not something we can just throw at text and expect automation to replace the journalists’ time that goes into tagging and categorisation of articles. Here are a few reasons why:

(1) Topic != Topic.

Topics in topic modelling are not the same as topics in human understanding. “Brexit” is a topic for us, but for LDA or any other topic modelling algorithm it means a distribution of tokens, more likely showing “UK”, “Europe”, “EU”, names of certain politicians, and maybe abstract terms such as “immigration” and “economy”.

A topic in this context doesn’t have a label, but the label of a topic is what connects one article to another, which conveys the meaning from journalists to readers. To assign labels to topics we might still need humans or a humanly curated dataset to inform a supervised machine learning method.

(2) Vocabulary. News is usually new.

Topic modelling has been applied in various fields from historic texts to genome analysis. The key here is that the vocabulary of the corpus is relatively static.

But news by its very nature is changing all the time. Natural disasters, political changes, crimes and cultural events bring with them new vocabulary - “Brexit”, for example, did not exist five years ago.

In addition to that, the vocabulary we choose to talk and write about a topic today will vary from the vocabulary in use 20 years ago or 20 years in the future. News is fluid and so is its dictionary. To account for that one could retrain a topic model very frequently, or use time-dependent models to account for the shift. Our Masters student Nantianjie Deng from UCL is currently looking into that and we hope she will share her knowledge here when she’s ready.

(3) A question of granularity: The optimal number of topics.

The tricky bit with topic modelling is that it isn’t magic. We still have to make an informed choice on how many topics we want to segment our articles into. Are we just expecting them to fall into two big groups or are we looking for hundreds, or maybe thousands of subgroups? We looked into using “topic similarity” to help make an objective decision, but the answer will be a different one depending on vocabulary and number of articles given.

In the end, choosing the number of topics becomes a question of granularity. Is “science” topical enough, or do we want to dig down to separate “consumer technology”, “quantum physics” and “genome research” articles? Working collaboratively with journalists will probably lead to a more sensible, pragmatic solution than searching for the local minimum algorithmically.

(4) Evaluation. An iterative process.

We can apply a topic model to a corpus of news text and split into a given number of groups, but how do we know that makes sense?

Topic modelling is unsupervised by its nature, but it can profit from an iterative process going back-and-forth with human judgement. In the context of news, we could present the journalist who is working on a news article with “topic suggestions”, word-clouds that represent the most likely topic groups this article might fall into. A simple yes or no on each of these suggestions from hundreds of journalists, thousands of times a month could make for a valuable dataset to iteratively improve the model, fine tune the vocabulary, increase or decrease granularity of topics and - in the long run - make the live of journalists easier and news stories more connected.

It’s all very exciting.