Earlier last week, Iain took part in an BBC News two days internal hackathon.
Among all the great projects presented, he worked for a bit on a Breaking News Stream, which displays information from articles linked to from Twitter.
This small project is a nice way of exploring what’s trending, or breaking, for a topic.
Here how it works:
- It monitors Twitter for a set of keywords and/or tags, such as “breaking news”, “Charlie Hebdo” or other breaking topics.
- It looks for URLs linked to in the tweet.
- If a URL is s found, it fetches the page, extracting the text and images.
- Then, with some natural language processing (NLP) magic, it extracts tags and concepts, based on their frequency in the article’s body.
Even multi-word concepts are recognised, such as “David Cameron” or “Je Suis Charlie”. This works without any index of tags, it dynamically discovers topics in articles using Natural Language Procecessing algorithms.
It uses stop words to ignore useless or generic words or expressions.
e.g. Facebook “Share This” buttons or common words like “Comments” or “YouTube”.
It would be useful to compare the results against a whitelist (such as topics in BBC Things, or topics that have pages in Wikipedia) and that’s something we hope to explore.
If you connect with a browser the sever returns a webpage which stream tags, articles and iamges, updated in real time. The top 10 trending tags are displayed in a graph in the browser (and is also continually updated).
We haven’t open-sourced the codebase yet, as it is missing a rate-limiting function for the sites scraped. This means that, potentially, a URL re-tweeted thousands of times would be scraped thousands of times. And, because we’re good people, we wouldn’t want to DDOS another news organisation (or release something people could use to accidentally do that with).
We will be looking in the following days at implementing the necessary changes so we can publish the code.
We are interested in seeing how we can use/abuse this code for other hacks, so keep an eye on @BBC_News_Labs for more!