News doesn’t break in just one language, which is why the BBC is a partner on the pan-European Summa consortium to help develop better tools for multilingual media monitoring.
Financed by the EU’s Horizon 2020 fund, Scalable Understanding of Multilingual MediA (Summa) is a three-year project exploring how automation and algorithms can assist journalists in monitoring global news.
The idea for a global monitoring tool for journalists originated at a language technology #newsHACK we hosted in 2014. Since then, the Summa consortium has been working to marry pioneering technology from research institutions with tooling to match newsroom needs and workflows. The final product will be a monitoring platform with features that include:
- story clustering based on global news trends;
- entity extraction;
- automated transcription and translation of original news and broadcast items into English;
- automated summarisation of individual news items.
This week, we hosted a hackathon in collaboration with the Summa partners and BBC Connected Studio to explore new applications of the project’s cutting-edge language technology in the newsroom. Participants were asked to build prototypes using the Summa platform’s database of global news stories that could help editorial teams make sense of the data. Here’s what they built and where we go from here.
We asked teams to submit entries in one of three categories:
- Best editorial tool;
- Best audience-facing experience;
- Surprise Us!
Winners were selected by a panel of four judges: Paul Bradshaw, course leader and founder of the MA in Data Journalism and MA in Multimedia and Mobile Journalism at Birmingham City University; Federica Cherubini, head of knowledge sharing at Condé Nast International; Steve Herrmann, editorial director of BBC Monitoring; and Tom Wills; data journalism editor at the Times and Sunday Times.
Best Journalist Tool: Deutsche Welle’s Summa Slack bot
This research tool allows journalists to search the Summa database for recently published articles on a given topic by querying a Slack bot. Users can ask the bot for information on a keyword they’re interested in, as well as the language of the media they want to see. The bot then returns an automatically generated graph showing the number of stories on the topic. Users can also request to see automatically generated summaries of the most recently published stories.
The team from Deutsche Welle presenting their Slack bot
Judges’ note: The judging panel noted that the Summa Slack bot felt like a well-defined product that could be easily integrated into journalists’ workflows, and that it was built with team collaboration in mind.
Best audience-facing experience: Q-Tee-am’s video bot
Citing research showing audiences’ preference for consuming news via video over lengthy text reads, this team from the Qatar Computing Research Institute designed a tool to produce snappy web videos from story summaries on a given topic. “To really know what’s happening, you have to read a lot, a lot, a lot,” they told the judges. Their video bot instead pulls images and text from the Summa database to automatically create four-minute-long captioned videos for online. It also integrates text-to-speech technology to create voice-overs using story summaries from Summa.
Judges’ note: Although there are tools available to help journalists assemble video content in a similar manner, the judges noted that an audience-facing video bot is a new proposition, making it possible for users to create and watch their own news summary on topics they’re interested in.
Surprise Us!: Factmata — Fake newsworthy?
This team integrated data from Summa with Factmata datasets to explore news stories’ political bias. Using elasticsearch to look for overlap between the Summa data and their own, they built an interface with a feed of articles on the left-hand side and charts on the right showing:
- where stories from the Summa database fell on a political bias spectrum, based on which media outlets published similar stories;
- which stories from the Summa database were also reported by news websites known to spread misinformation or conspiracy theories.
The tool could help users explore content with different political leanings, the team said, as well as highlight content that may have originated in less-than-trustworthy places.
Judges’ comments: The judges said that this tool could encourage journalists to address misinformation when writing their own pieces by bringing the fake news ecosystem to their attention.
DPA-Newslab: Social Index for breaking the sourcing bubble
This trio from the German Press Agency built a prototype for helping journalists find sources outside their usual monitoring platforms. Integrating the Summa data with BuzzRank, the DPA’s internal monitoring tool, the team combined social and traditional media monitoring in an interface that displays the popularity of keywords over time. Journalists can explore trending stories by interacting with related tags that the interface displays, as well as a bar graph showing a topic’s popularity across different news sources over time.
DPA-Newslab presents Social Index to the judges.
Judges’ note: The judges applauded the integration of social and traditional media, as well as the possibility of exploring how topics move from traditional to social platforms.
NRK: Exploring trends
This team from Norway’s public broadcaster developed a live dashboard combining a timeline filter, heat map and word cloud to display how keywords are covered over time. By moving the timeline selector at the top of the interface, users can watch how media coverage ebbs and flows across the world — while also watching for new phrases to surface in a corresponding word cloud. The team demoed their prototype using the keyword search “Catalonia,” showing how coverage increased globally at the time of the independence vote and how the terms “Barcelona” and “Madrid” started appearing in coverage at around the same time. Colour-coding in the word cloud also gives users a visual cue that coverage of a topic increased from the previous reporting period.
Judges’ note: Although other teams also developed visualisations marrying heat maps and word clouds to compare topic trends over time, the judges thought this dashboard was particularly well-executed, and especially liked the time scale functionality.
Sheffield FEVER: Fact Extraction and VERification
The team from Sheffield University built a fact-checking tool for matching claims against Wikipedia data. The prototype works in two stages: first, by extracting “evidence” about a claim from Wikipedia and second, by reasoning about whether the evidence supports or refutes the original claim. The team did not integrate the Summa data, but said that going forward, it could be an additional way for journalists to compare stories and explore how claims have changed over time in media coverage.
Dataflo: Summa + DBpedia
Integrating the Summa database with DBpedia, the Dataflo duo explored new ways of labelling and tagging the story clusters that the Summa platform generates. The team first extracted nouns from Summa story clusters, then performed matches with DBpedia entities and finally, displayed a combination of the data in word clouds configured to highlight the phrases of most interest. They said that this technology could be used to help journalists generate better tags — both for the Summa clusters and individual stories — based on community-sourced Wikipedia content.
Dataflo presents their demo to the judges
UCL Team Truthiness: alternative fact engine
What are other organisations saying about the claims in the story you’re reading? Participants from UCL built a tool for exploring other outlets’ “alternative facts”. Users search the truthiness interface for the most recent articles on a given topic, and are taken to a story view when they select a headline. A behind-the-scenes question engine generates relevant queries based on claims it detects in the article and provides the answers it finds in other media sources. The team said the prototype could encourage readers to seek out diversity when they read the news by showing that there are other claims than the ones they’re currently viewing.
Decsis: dashboard views of Summa data
This two-person team from Portugal developed a dashboard with two views of the Summa data: a broad-ranging analytical view and a document-based filter view. The first gives users the power to explore news by numbers, showing popular keywords and publication languages over a specified period of time. The second allows users to filter their search based on the type of media they want to see, as well as additional fields included in the Summa data, such as publication language and date of publication.
Edinburghers: timeline sentiment analysis
This team of political researchers and data scientists built three different visual analyses of the Summa data:
- an experimental timeline showing sentiment analysis of news articles on a given topic;
- geolocated news stories over time;
- policy agenda analysis over time, using Summa data and data from the “UK Topics Codebook”.
Sheffield Team 1: multimodal story clustering
This team from Sheffield University explored story clustering based on image, rather than text. Using visual representations of video content, the team showed how individual frames could be grouped to give a non-textual representation of news content. The team said it could explore integrating the frame clustering with text clustering in the future, but did not assess the usefulness of the image clustering during the hack.
The exciting ideas generated in this hack event will allow us to refine and improve our own ideas around how a journalist might use the Summa platform, and in turn these ideas will shape the project’s ongoing development during 2018. The BBC will continue to support the project as an active partner and will be organising a second hackathon on the Summa platform in Europe next year.