The Juicer

Published: 18 September 2015

Aims

How might we support exploration and understanding of journalism at a global, meta level?

Outline

The Juicer is a news aggregation and content extraction API. It takes articles from the BBC and other news sites, automatically parses them and tags them with related DBpedia entities. The entities are grouped in four categories: people, places, organisations and things (everything that doesn't fall in the first three).

The Juicer API

Documentation for the Juicer API is publicly available at this link, although the API has been shut down since 2018.

Web interfaces

We built a beta web interface to the Juicer (which is no longer supported), which allowed you to use if you were connected to a BBC network. Additionally, the WAT webapp used the Juicer API to generate graphical representations of which news organisations are covering what.

During its development, we worked to give everyone more ways to play with the data available in the Juicer, including a public search interface, trending topics and more jaw-dropping visualisations.

FAQS

BBC Juicer: What it is and what it is NOT.

BBC Juicer is a news aggregation 'pipeline'. It ingests news articles and extracts the best from them - well, just like a fruit juicer does. BBC Juicer pipeline is watching RSS feeds of news outlets. When a new article is published on one of these RSS feeds, BBC Juicer scrapes the news article, both raw text and metadata (e.g. date, time, title, news source ...). In the next step BBC Juicer identifies and tags concepts mentioned in the article text making them searchable and therefore useful for trend analysis. The Juicer API allows users to retrieve JSON representations of the news articles.

Which news sources did you include and why?

At the moment, BBC Juicer monitors around 850 RSS feeds from international, national and local news outlets. We started with British and other English language sources and are expanding into other languages. Importantly the list of sources does not claim to be comprehensive nor does it claim to provide a representative set of sources.

We do NOT ingest content that is behind a paywall. Any news outlet that does not provide news content for free will not be ingested in BBC Juicer. However if you wish the BBC to remove your content from BBC Juicer please get in touch: newslabs@bbc.co.uk. Please provide details of the content to be removed, including the date of any article or headline.

If you are working for a newspaper or other news outlet, which provides free online content, and would like it to be part of BBC Juicer, or if you are planning a project, which requires content from other news sources, please get in touch: newslabs@bbc.co.uk. Please provide details of how you will provide open access to your technology. We will look into including your source.

Where are the tags to each article coming from?

Tagging is not manual or editorially controlled. It is an algorithmic process. The algorithm analyses the raw text of the article to find concepts, i.e. people, places and organisations that appear in the text. Depending on the context in which it appears within the article a term may or may not have a tag assigned to it (depending on some confidence thresholds).

The respective tag doesn't need to appear in the exact wording in the text. An article may, for example, be tagged with Greece on the basis of containing the word Greece or Hellenic Republic.

The result of the 'concept extraction' depends on the current version of Wikipedia that underlies BBC Juicer (currently updated once a month and on a number of parameters to adjust the engine). All this is still very much in development. It is an experimental project and we don't take responsibility for incomplete or false assignment of tags.

I found an article on Juicer the other day and now the link is broken. Why?

If an article was amended by the original news source, this will not trigger a re-tagging within BBC Juicer unless the article was republished on RSS. If the article is no longer accessible on the publisher's website, the link to the original source will break. BBC Juicer will however keep title, text, metadata and tags assigned to the original article.

I have more questions, I have ideas and I want to contribute!

Are you a student or working in research and would like to use BBC Juicer for a project?

If you are looking for a data source for an academic project, we are very happy for you to use BBC Juicer. You can access BBC Juicer through the simple graphical user interface we provide, together with a key. Depending on your project you may quickly want to do more. You can access the full database of articles collected since autumn 2014 via the BBC Juicer API. BBC Juicer is free to use for academic projects, but we ask you to attribute BBC News Labs and we would be happy to consider writing about the results of your research on our BBC News Labs Blog.

Are you a journalist and would like to use BBC Juicer for your daily work?

We are actively exploring, through hackathons, how BBC Juicer may be useful for a journalist in their daily work. If you have a bit of time on your hands to get involved with us, and discuss how BBC Juicer may be useful to you or your newsroom, we would love to hear from you: newslabs@bbc.co.uk

Are you a UX designer, front end or back end developer?

BBC Juicer is still in development and we are always open to bring in people who can contribute their expertise to develop the system further, both on the back-end side as well as developing new front ends that speak to journalists, and other user groups. Get in touch: newslabs@bbc.co.uk.

I think something isn't working here. Who do I tell?

If you have a question about BBC Juicer, experience problems using it or think there is something in there that should not please do get in touch with us via: newslabs@bbc.co.uk

Do I have to pay to use BBC Juicer?

You can access BBC Juicer for free provided that your use is non-commercial and you comply with all applicable terms of use.

Results

The BBC Juicer was very popular at most of our hacks for many years.
It provided the inspiration for a successful application outside the BBC for Google Digital News Initiative funding.
Since we failed to find any other sustainable uses, we shut it down in 2018.