Stories by numbers: Experimenting with semi-automated journalism
You might have seen some stories on the BBC News website that say they were generated using "some automation". BBC News Labs' Roo Hutton explains how we're experimenting with semi-automated journalism to bring you even more relevant local stories.
The BBC's network of journalists brings millions of people the news from their region across TV, BBC Local Radio, and online every day. To get the best value for licence fee payers, it's important that journalists cover stories that are relevant to our audiences - and we know that those stories tend to be local. How can we cover more of these stories, and is generating them automatically part of the answer?
Thanks in part to social media, people have a greater understanding of what's going on in their local community, and there is an expectation that local news media should match the frequency and specificity of this coverage. It's well known that sustaining local news has become a difficult challenge for the industry, with local radio and newspapers struggling to adapt to the expectations of this new audience. How can we satisfy the audience's thirst for quality content in an environment with limited resources?
In BBC News Labs, we want to help bring rich, data-driven storytelling to our local news teams without substantially increasing their workload. For the last few months, we've been working with colleagues in BBC English Regions on a project called Salco(Semi-Automated Local Content). Our team of two developers, myself and Tamsin Green, developed a pipeline that can generate over 100 unique stories every month, allowing our audience to learn about their local hospital's A&E performance right on the Live pages where they get local news every day. This is a new way for the BBC to tell stories, and it's the kind of agile collaboration with our editorial partners that News Labs does best.
The BBC is not the first news organisation to adopt automated journalism. The Associated Press have been generating stories based on quarterly earnings reports for the last fifteen years, and within minutes of an earthquake, the LA Times' QuakeBot will have a write-up. The Press Association's RADAR project has been generating thousands of data-driven stories for regional media outlets in the UK to consume. Salco is just the first step in the BBC's thoughtful experimentation with automated journalism, and we've benefited from seeing what our predecessors have done. Our approach differs, however, as we are able to generate stories enriched with graphics and bring them to relevant audiences through our familiar online local news offering.
To pull this off took a thoughtful mix of editorial and technical innovation, with some difficult (but interesting!) questions along the way:
- Is automated journalism editorially acceptable to the BBC?
- Will our journalists and editors be comfortable publishing articles that they didn't write themselves?
- Will our audience be happy to read stories that were generated by a machine?
And from a technical point of view, as this is a completely different way of preparing stories:
- How do we integrate with the BBC's existing publishing systems, while still allowing for the sort of journalistic oversight our editors expect?
Salco combines data processing, story generation and editorial approval into a simple "one click" process that takes raw data and automatically generates rich local stories based on templates designed by journalists. That simplicity, however, masks a complex pipeline consisting of five parts:
- process the data we get from the NHS and extract the bits we're interested in
- produce a text story for each NHS trust based on a template prepared by a journalist
- generate a graphic for each story that summarises the data in the BBC's house style
- preview each story so a journalist can verify and approve them
- publish each story to the appropriate location topic pages
Processing the data
Visual Journalism's NHS tracker analyses A&E, planned operations, cancer and mental health care performance.The NHS releases a number of datasets illustrating how the health service performs each month. The BBC already uses this data to power the popular NHS tracker, which lets readers compare their local NHS trust's performance with the rest of the country.
Our friends in Visual Journalism have written Python scripts that download and process this data for the tracker each month. These scripts download years of historic performance data, and accounts for the differences in how each nation's health service structures its data, and the targets they've set. The result is a refined dataset consisting of:
- the big numbers, such as the percentage of patients seen within 4 hours
- additional analysis based on the historic data, such as the month a target was last met
- comparative analysis across the dataset, such as the ranking of each trust
- editorial context drawn from other sources, such as the colloquial name for a local hospital
We built on these scripts, modifying them to work in the BBC's cloud infrastructure and focusing on A&E data for England. The dataset that the scripts generate bridges the raw data provided by the NHS, and the prose of a final story, and is referred to as the story model. Rather than being purely expressed as numbers, it includes data interpreted as natural language, such as "35 of 131 trusts" and "Not met since trust was formed in 2017", which can be directly embedded in a story.
By the end of this process, we have a dataset where each row represents a potential story, and the columns represent the full context a journalist might need to draw on. This dataset is stored in an Amazon S3 bucket to be picked up by the next stage in the process.
Automatically turning data into prose is known as natural language generation (NLG). For this project, we turned to Arria NLG Studio, a third-party tool that allows journalists to produce the complex templates needed to transform data into news articles, then try them out on sample data to see how well the different output stories read. This allows for an iterative process where the journalist can see how the story improves by enhancing the template.
Unlike conventional story writing, the journalist isn't just writing in response to the data in front of them on a given day, but anticipating the range of stories that could come from wildly different results in the data. Some examples include:
- an NHS trust reaching its target for the first time in years
- a hospital maintaining its unimpeachable record
- a sudden collapse in performance after a Winter flu outbreak
This is a particularly difficult task, and calls on the creativity of the journalist to produce a rich template where the gaps aren't just plugged with simple numbers and percentages from the story model. Working with our colleagues in BBC English Regions' digital team, and in the East of England where we are piloting the project, we studied previous BBC News articles on A&E performance to identify repeatable structures and the narrative threads used to tell such stories. From that, we discovered new fields to add to our story model described above, and further fleshed out the template to leverage them.
The processed story model is downloaded from its S3 bucket and passed to Arria's API,which generates a story for each row in the data using this template. The resulting stories are then written to a MySQL database via Amazon's Relational Database Service (RDS) to subsequently be shown in our editorial dashboard.
An example of a datapic generated for a story.The In-Depth Toolkit (IDT) is the BBC's tool for adding data visualisations and other graphics to news stories. Normally, a journalist would prepare an individual graphic when needed for a particular story, but clearly this approach wouldn't scale if we're generating hundreds of stories. Working with the Data Presentation team who maintain IDT, we created a system that would populate a template describing the layout of a "datapic" graphic, which emphasises the most impactful numbers in a story. This is then rendered as an image when the final story is published. This has been a particularly exciting development, as it has shown that IDT's existing infrastructure can be used in this novel way to automatically generate templated graphics.
For each story, we generate a JSON representation of a graphic. For added variety, our system chooses from a pool of stock photos to include in the graphic. We wrote a Node.js Lambda that validates that the graphic is valid and stores it in IDT's infrastructure, giving us a unique identifier we can use to embed the graphic in the final story.
We built a simple dashboard that lists the stories that have been generated, and allows them to be sent to our publishing platform, Vivo. This enables journalists to check the quality and accuracy of the stories which our system has generated. The dashboard, written as a React web app, renders information for all the stories stored in our RDS database. When a journalist is ready to publish the stories, we make a series of calls to the BBC's Vivo API to create story drafts in the appropriate live stream for the region.
Journalists can check each story in Vivo before they're publishedWhen a user enters their postcode on the BBC News website, they are shown a relevant stream of local news within a customisable radius, based on location tags embedded in stories. This is driven by the BBC's Vivo platform, which allows journalists to curate streams combining short text updates with relevant images and video, and embedded content such as tweets.
Salco generates a draft Vivo post for each generated story, combining the text from Arria and our IDT datapic. It then automatically tags the story with the location of the relevant NHS trust so it can be shown to the right audience. The journalist responsible for curating that region's live stream can then publish the stories as they arrive, although we expect this process to be completely automated once we have established confidence in the quality of the stories that Salco produces. Our stories then appear in the streams for anyone who lives near a given hospital, while preventing the audience from being overwhelmed with dozens of similar stories from across their region.
What we've learned
With this trial, we hoped to build the infrastructure needed - both technically and editorially - to support the BBC's first steps into automated storytelling. The BBC is increasingly willing to experiment with new ways of telling stories, which was shown by the editorial and technical support we received from a number of teams. Writing good templates for automated journalism is not a trivial task, and it can be uncomfortable for journalists to see their craft deconstructed into algorithmically-assembled blocks. The perception of "the machines" writing stories is sometimes derided as "robo-journalism", and we wanted to share some of our thoughts about how to sensitively adopt automation in journalism, so that it enhances, rather than replaces, journalistic effort.
In this article we've mostly discussed the technical infrastructure we needed to put in place, but we have always considered the challenge of automated journalism to be fundamentally editorial.
No amount of automation can replace the skill of structuring and telling good stories, and we think of Salco as enabling this craft, rather than replacing it.
To get this right means leaning on the domain knowledge of journalists, and embracing a willingness to do things differently, whether through new technology or an editorial mindset. A challenge of adopting automation in journalism is that there can be a significant up-front effort needed to adapt to new tools and workflows and build robust templates. However, this is offset by the value that that template continues to yield as an asset long after the journalists responsible are back to working on other stories.
In this project, after spending some time with journalists to familiarise them with our tools, we found it most effective to leave them to write effective and expressive templates. There are several reasons for doing this: most importantly, it means the stories we produce have the same voice and personality as the BBC's other stories, rather than feeling like a dry re-telling of statistics. It's also consistent with our values. Automated journalism isn't about replacing journalists or making them obsolete. It's about giving them the power to tell a greater range of stories- whether they are directly publishing the stories we generate, or using them as the starting point to tell their own stories- while saving them the time otherwise needed to analyse the underlying data.
This initial pilot of Salco has been a success, fulfilling our initial aim to create the tools and processes needed to automate storytelling on a limited scale on the BBC News website.
Now that we have more experience in telling stories this way, the next step is to better understand how this proposition meets the needs of our audience. We know that our audience values local stories, but is this the way they want to read them? It's still early days for this project, and we hope to expand the kinds of stories we tell this way, and the richness with which we tell them. In a future blog, we hope to talk more about our plans, and the results of our audience's first experience with these stories.