Classification of news articles in topic groupings.
News is a rolling stream of stories - by definition they live in the present. But news is a valuable resource for academics who want to study the past. How was the Ebola epidemic reported in Europe and Africa? Which news outlets cover climate change most and how? Which were the topics of discussion surrounding elections over the past century?
Coping with a large volume
News is a contemporary witnesses of society and can tell us a lot about things that went right or wrong. But the number of news articles produced in the world in a day is overwhelming; impossibly large for manual, human processing. According to Chartbeat, over 92,000 articles are posted to the web every 24 hours.
Beyond the standard topic groups
We can of course use computers to extract useful information from news texts, such as names of people, organisations, political parties, geographical locations and pretty much any word that has a definition in Wikipedia. But for a computer to nail down ‘What is this article about?’ in a few words - just like you and I would answer intuitively - is really tricky. Topic Modeling is a method to use computers for this more complex task.
We want to apply computational classification methods to answer the question “What is this news article about?” on a large scale. Luckily for us, the BBC is sitting on a huge number of news stories that we can play with, so data isn’t a limiting factor.
Prototyping for a solution
To classify articles into topical bins, there are three avenues we want to explore and compare:
- Article Classification with Principle Component Analysis
- Topic Modeling (not) given the number (and/or names) of the topics
- Machine learning with topic labels assigned to a training set of news articles
The starting point for all these approaches will be word frequency distributions. What does that mean? A news article about immigration will likely contain more geolocations, names of countries with conflict zones, and politicians who play a role in that context, while a champions league article will contain a slightly different vocabulary.
Say, we have a bag of news articles but we don’t know the how many groups or topics they fall into and which (combination of) words may tell them apart. Principle component analysis can give us a feeling for this problem. It may not scale for a huge number of news articles but its the first, most simple, step.
@sytpp is playing with this right now - if you have ideas and input, give her a ping!
Step #2: Serious topic modelling based on Latent Semantic Analysis (LSA). Digital humanities specialist Scott Weingart explains the concept much better than we could ever do - so read his blog post for an idea about how we are planning to anatomise the news.
@sytpp will come to that once she’s given up on PCA - so tune back in for updates.
Step #3. This approach requires a bit more knowledge to start with. Assume we have a number (I mean, a large number…) of news article texts and human annotated topic labels respectively. We could learn the connection between word frequency distributions and the chance of a news article falling in one or the other topic space.
Dr Chenghua Lin from the University of Aberdeen and Dong Liu are the people focussing on this approach.