Datastringer, Juicer and The Computational Journalism Symposium in New York


We received some fantastic news from the 3rd Computation + Journalism Symposium last week: the two papers we submitted for this event were accepted.

News Labs is going over to New York City

Datastringer: easy dataset monitoring for journalists, has been invited for an oral presentation; Covering the EU Elections with Linked Data will be proposed as a poster and demo.

We’re thrilled by the welcoming these projects received, and very grateful to the reviewers and organising committee for inviting us to participating in this event.

Datastringer: easy dataset monitoring for journalists

This paper entered the Platforms category:

Platforms that support journalistic work and which enable new ways of finding, producing, curating, or disseminating stories and other news content.

Here its abstract:

We created a software enabling journalists to define a set of criteria they would like to see applied regularly to a constantly-updated dataset, sending them an alert when these criteria are met, thus signaling them that there may be a story to write. The main challenges were to keep the product scalable and powerful, while making sure that it could be used by journalists who would not possess all the technical knowledge to exploit it fully. In order to do so, we had to choose Javascript as our main language, as well as designing the code in such a way that it would allow re-usability and further improvements. This project is a proof of concept being tested in a real-life environment, and will be developed towards more and more accessibility.

Covering the EU Elections with Linked Data

This second paper entered the Research papers category:

Research papers which explore a question of interest in journalism or information studies, or in data and computing sciences as it relates back to journalism and news information.

And here its abstract:

We tested in the context of the 2014 European Election in the UK our linked data engine, developed to map out the relationships between entities mentioned in the news. This experimental project provided for the first time a real-life environment and raised a number of questions and improvements required to ensure the model’s reliability and the confidence journalists could have in this purely data-based tool. The project aimed to produce background information a journalist could use to write a story. It surfaced a very unequal media coverage received by the political parties, as well as relationships between entities that needed to be explained.

If you feel curious…

These papers will eventually be made public by the C+J. The Datastringer should even be in the American Journalism Review. In the meantime, you can read a less scientific write-up on OpenNews Source blog:

Project of the week

gist iain

Our Iain Collins crunched some data to study the news coverage bills in front of the UK Parliament receive.

Do have a look at the Github Gist with the results!

“The example above only includes Bills that have actually been published online by Parliament - Bills that are still to be published are less likely to have written about in the media.”

“Only the top 10 matching articles from a select list of sources is shown here. The total for the number of articles refers to articles published in the last 3 months that were found and indexed by the BBC News Juicer. You can read about the Juicer here and here.”

“This output was created by mashing up the node modules newsQuery, psuk-parliament and gramophone. It’s a simple but practical example of how I want to leverage existing tools and data to create a site that helps explain new legislation, and the issues related to it.

“As well as curating a list of related content and discovering potentially useful tags (which I have limited to 5 per Bill here) which could be use to help include and link to relevant content the number of related articles in the media help show which Bills are important and topical.”

“By looking at the media coverage over and time and filtering by publication (and the content FROM each publication) you can get an immediate idea of what they key issues are and how important a Bill is.”

“Not shown here is all the semantic data associated with the articles that have been found, or additional media (including news clips, BBC News and relevant BBC Parliament coverage).”

“There’s a small number of erroneous articles for little known Bills that haven’t had any press coverage but overall I’m hugely happy with the results of this smoke test and looking forward to expanding on it in the next few weeks.”

The data architects’ corner

Let’s kick off with Karl Sutt’s recent work on the Juicer.

A big change happened under the hood recently: “Content extraction open source library Newspaper replaced our previous one, Goose. Newspaper is already in production and so far, we’re really happy with it! We’ve got some upstream patches for Newspaper and hopefully have more on the way.”

“I also recently implemented, using Elasticsearch, a kind of semantic “find related” functionality, which works by exploiting tf-idf (term frequency, inverse document frequency) algorithm. The results are really great so far, people seem impressed. It’s available though the Juicer API, as well as newsQuery, as well as in the Juicer Demo.”

“The current Juicer Demo is old and not ideal, so I also started working on a new demo app. This will be in Angular and will serve as a pretty facade for the Juicer, demonstrating the different querying capabilities it has. It’s possible that I will add API endpoints as part of this work, if it turns out that the demo could benefit from them.”

kibana graph

“On a more fun note, I deployed a Kibana interface to our Elasticsearch instance for showing pretty (and hopefully insightful) graphs. It’s yet another facet into our dataset, good for demoing stuff visually to relevant stakeholders and keeping an eye out for trends and patterns.”

There is a major piece of work planned for the weeks to come: “The planning and re-designing the Juicer infrastructure for its expansion. We want to have about 150-200 sources in the Juicer by the end of November, in order to get better coverage of news and be more representative (we have 51 sources at the moment). The current ingest and processing infrastructure is not scalable in the sense that we can’t easily scale out our worker instances when more sources are added. Also part of the scaling effort is migration to larger AWS instances and building a proper Elasticsearch cluster to handle a larger influx of data.”