Scott Klein

Visualizing the Truth

interview by john grimwade

Share on facebook
Share on twitter
Share on email

Perhaps there has never been a more important time for independent, non-partisan journalism. In the divisive U.S political climate, where viewpoints have become so polarized, ProPublica has a vital role to play. You probably are already familiar with some of their excellent online projects, which bring together all forms of visual storytelling, and are built as integral parts of ProPublica investigations.

What is ProPublica?  This nonprofit organization, which is funded mainly through donations, was founded in 2007–2008. The team of more than 75 journalists investigate important topics in the public interest, and they do it in a completely impartial way, without the influence of commercial pressures. This is an old-fashioned (but essential) journalism school ideal. And ProPublica is using very modern tools to achieve it’s goals. In 2010, they became the first online news source to win a Pulitzer Prize. They have since won three more.

Scott Klein leads a group of journalist/programmers, and is one of the key people in U.S. visual journalism. RecentIy I asked him to give us some insight into the ProPublica approach.

How did you get into the news business?

I come from a publishing technology background, but I don’t come from a computer science background. I studied 19th Century British religious poetry in college. When I was young I was a generalist and a nerd. I wanted to work with words so I found my way into news through technology. Early in my career, I spent 10 years building and rebuilding the website of The Nation, a much revered and long-lived progressive weekly magazine. I had a variety of very impressive sounding titles, but the work I was doing is now called “product management.”

The Nation was founded by abolitionists in 1865, and prides itself as the oldest continuously published weekly magazine in the country. It wasn’t always easy to build new things there, as there was an explicit sense among managers that we had to be careful not to be the generation that killed an icon. We even talked about being the “temporary stewards of a historical artifact.”

Why did you join ProPublica?

When I left The Nation to come to ProPublica in early 2008, I was leaving a venerated and long-lived publication to one that hadn’t even really been born yet. The chance to start something new at a place whose whole reason to exist was an experiment was exhilarating.

The news industry in 2008 was in terrible trouble. There were so many layoffs that there was a blog dedicated to covering them. It was called “Paper Cuts.” There were a lot of hopes pinned to ProPublica, and once it was announced I had taken a job there, I started getting offers of advice from people who were rooting for it to succeed. One was Elizabeth Osder, who had introduced me to ProPublica and who taught me how to do real-world, no-bullshit product management. I also spoke to Zephyr Teachout, who told me to be on the lookout for two trends, both then cutting-edge: One was bringing readers into journalism through user-generated content and crowdsourcing, the other was the notion that open data and computation were going to grow quickly, and that ProPublica had an opportunity to capitalize on using data in new ways, especially as we didn’t have a print newspaper driving our decisions on form and function.

How did you get started?

So when we started, my team played many roles. We built the website, ran daily production and built multimedia resources around our biggest projects. That included things like a small explainer about fracking in New York State. We also did a static graphic that used a jumble of proportional circles to compare the size of various historical government bailouts. Both went viral, and my team was given the task of building more graphics that might expand our audience while keeping our core journalistic mission in mind.

Were you looking for new digital storytelling solutions?

Definitely. We knew that we’d have a chance and even a responsibility to think differently about how we approached our storytelling. Before the site really launched, I met Aron Pilhofer at The New York Times, who at the time was managing a team of “journalist/developers” who built real software products to help tell news stories. On the day I met him, his team was launching a project to display documents that Hillary Clinton’s presidential campaign released showing her White House schedule when she was the First Lady. While other newsrooms were simply publishing downloadable PDFs of the document dump, Aron’s team had created a gorgeous, fully searchable, browsable interface. Thinking that we could use it for stories that used long government documents as evidence, I asked Aron if he’d be willing to share the code with ProPublica.

Aron didn’t just say yes. He and other people at the Times committed to open-sourcing the document viewer. That software eventually became the foundation of DocumentCloud, which Aron and I and a few others founded in 2009.

Aron did something else in ProPublica’s first year: He sponsored a week-long intensive class in programming Ruby on Rails. I attended that, as did my teammate Dan Nguyen (and my future teammate, Jeff Larson).

Shortly after that class, Dan was creating a Rails app using some data he had scraped from a pharmaceutical website showing payments to doctors for things like speaking, consulting, etc. He published a blog post about it. My colleague Charlie Ornstein ran across the newsroom when he read the post and asked Dan and if he could do the same for six other pharma companies that had made similar disclosures. Dan agreed, thinking it would be easy. About 8 months later we had the first version of Dollars for Docs. If our earlier graphics had gone viral, Dollars for Docs was an absolute supernova. It melted down our servers and once everything had stabilized it became the most popular thing on our website almost every day.

With lots of support from ProPublica, we were able to turn these successes into further successes by staffing up around these interactive databases, which Brian Boyer and I decided to call “news applications.”

Scott Klein working at his desk.

In this era of “fake news,” non-profit investigative journalism has never been more important. How do infographics support ProPublica’s mission?

ProPublica’s mission is to use “the moral force of investigative journalism to spur reform through the sustained spotlighting of wrongdoing.” Our whole newsroom thinks and talks about real-world impact pretty much every day. Our team is no exception to that, though we have given ourselves a team mission that feeds into the institutional mission. That is “to create visual and data journalism that helps people inform, empower and protect themselves.”

There are a lot of people talking about the loss of trust in journalism. It’s clear people don’t have the confidence in parts of the news they once did, and with the exponential growth in information sources since the web was born, it’s become difficult to know who to trust.

Herein lies an advantage in publishing data in the way we do it. We can show people the data and let them come to their own conclusions. Nobody has to take our word for it!

Take education stories, for instance. In a story I can quote experts, report summary statistics, and show readers the example of a few schools. It’s possible my examples don’t match your mental picture of schools, drawn from your personal experience, so the story doesn’t really seem true.

In an interactive database, I can show data on EVERY school, and if I do my job well it becomes easy for you to find yours. When you start exploring data with your own school as an example, you can calibrate your understanding of the wider story through the lens of your own school. I can help you draw inferences by grounding you in something you understand.

Sometimes, your school will exhibit the characteristics of the national phenomenon I’m covering. Sometimes it won’t. That’s how statistics work, after all. But if I show you how your school fits into the broader universe, and let you see that even though your school doesn’t have a problem but maybe some nearby schools do, I’ve helped you understand and explore the boundaries of the problem and let you come to your own conclusions. Of course there’s no such thing as unbiased data. Where we get our data from, how we process it, and especially how we display it, can change the way data is understood. It’s our job as journalists to do our job as fairly as we can, but we are also transparent about our methods. We make almost all of our data available for download, along with long papers about our data sources and methods.

There’s nothing new about data journalism. I can show you examples going back hundreds of years. In fact, publishing data for popular consumption predates newspapers entirely. What news applications do is to harness what the Internet made possible–rapid software development, findability through search engines, cheap servers, and later mobile web ubiquity–to enable people to find their own example in a big data set.

What is the process of creating visual explanations at ProPublica?

There are lots of different paths that a project can take, but for the most part they’d all seem very familiar to traditional journalists. Sometimes a developer on our team will have an idea and pitch it to me, sometimes a reporter in the wider newsroom will bring us a data set and ask if it’s something we’d be interested in collaborating on. Sometimes, even reporters in other newsrooms have great data and come to us seeking a collaboration.

We think about interactive database or visual explainer projects a lot like a reporter and editor think about a story. We ask questions like: What “story” are we trying to tell? What’s new about our approach or our data? What’s the lede? What’s our evidence?

But we also have to answer questions that are quite different than what traditional journalists do: Who are our users? What are their information needs? What do we want them to do with this information? Not to mention a host of technical questions like: What library or software framework do we want to use? What server will we put this on? How do we keep the data up to date?

Always. A big part of our work is simply cleaning data– in fact for most projects preparing data is the longest step. We rarely get data that is completely clean and ready to be used. It’s useful for us to be involved in that process, though, because it gives us an opportunity to get familiar with the data and to start thinking ahead to what we want to visualize and what story or stories we want to tell.

Some developers like to sketch before they build, others like diving right into building prototypes in ruby code using real data. You sometimes don’t know if your data is suitable for an idea until you’re really trying it, so I’d say we lean more on prototyping than on sketching, but there’s definitely plenty of both in most projects.

From there it’s very much like editing a story. The editor and developer will meet and hopefully the editor can help sharpen and elevate the developer’s work. The developer will call sources and perhaps file records requests to support or deepen the project. The copy will get written and edited and lawyered. And we have a very robust process of bulletproofing and spot-checking data to make sure we haven’t made any errors in our analysis and we don’t have bugs in our code that make us show incorrect data.

How is developing an exploratory database different than writing a narrative, or even creating a traditional infographic?

We don’t assume people will have read a traditional news story that will guide them through our findings. We employ user experience design to give our databases a story-like structure while still letting our users explore the data freely. We craft our information architecture and our user experience design in much the same way as other journalists craft a narrative. We may not have a beginning middle and end in mind, but we do have a particular path through the interactive in mind, and we have the equivalent of a “nut graf” — that is, a point we want you to take away from the experience. We very carefully choose what to show a user and when.

There are two storytelling techniques you’ll see us use all the time:

First is the concept of the “near” and the “far.” These are the levels of information abstraction that let users take in huge amounts of data without feeling small. The “far” view of an interactive database is usually the front page. Context and overview is important here. We show the things you’ll need to understand the data: introductory text, instructions on how to begin interacting with the data, big versions of the tools that are available (search and other affordances). The far view also puts the data in national context: A list of states showing how each state is affected by whatever the app is about, or perhaps the largest examples of whatever the app is about.

The “near” view shows the data that is closest to the user. Personal connection and detail is what’s important here. This is where we show the user’s own school, their doctor, their whatever.

The far view helps set boundaries around the data so users know what’s big and what’s a small. The near view is what makes the information real and important to them.

Think of the near and the far much as you’d think of different zoom levels on a map. A map of the whole country isn’t useful for driving directions, but it helps me understand why some states are warmer than others, that Texas is bigger than Virginia, etc. At the same time, a map of my neighborhood is incredibly useful at finding my way around but not useful at helping me understand weather patterns.

Both the far view and the near view are important and you need them both. Many graphics choose one approach, either giving readers only very broad summaries of information or simply letting them look up some data without context or a sense of “what’s a big deal.”

The second technique you’ll see us use is that we try to avoid showing numbers out of context. If we tell you how often a doctor prescribes opioids, we say how that compares to doctors like him. We’ll sometimes do these as comparisons to some average, but we may also do these as ranks. We try never to show numbers in isolation, unless we can’t avoid it. These techniques are also a window into the kinds of project we like to pick — we like projects that have national data that we can make personally relevant for millions of people. Remember, we want to help “inform, empower and protect themselves.”

Scott Klein with his team.

What are your favorite ProPublica projects?

It’s an unfair question! I think all of them are above average, of course. I’ve talked about our Dollars for Docs app, which is the one I think we’re best known for, but here are three others I especially  like to talk about:

Our Opportunity Gap news application is a great example of a lot of the things we try to do. It’s old now —the data has been updated since we published, and the visualization now seems a little old fashioned compared to what we’re capable of now — but it’s a useful demonstration.

The data is from the U.S. Department of Education and represents every public school in the country that are in districts with more than 3,000 students — 52,000 of them! It shows how fairly districts distribute resources like advanced placement classes, chemistry, physics, etc. It answers the question, does each district distribute resources equally to their rich and poor schools?

The answer is complex! Some districts are better than others and most districts are good in some ways and bad in others. The app doesn’t try to eliminate that complexity, but helps the reader understand it and feel empowered by it.

First, it lets every user start the interaction with their own school — something they know a lot about — and tells the rest of the visual story compared that school. It lets the reader understand a complex topic through the lens of something familiar, and lets you see how your school fits into a much larger national phenomenon.

We added a button that lets you compare your schools results with two other schools from your state — one of them a wealthy school, one a high poverty school. In many cases you can see that as the level of poverty goes up, access to resources goes down. We’ve let readers run their own correlations! Of course we don’t tell them that’s what they’ve done. Our Hell and High Water project was about what would happen to Houston, Texas if it were to be hit directly by a strong hurricane. We let people play different hurricane scenarios interactively to see what would flood in different theoretical circumstances. We even let them put in their address so they could see how the flood would affect them personally. I like this project in particular because we built a graphical story that explained Houston’s vulnerability to heavy floods. This is just what happened a few months later when Hurricane Harvey hit. We also  had to understand a lot of very complex scientific analysis — and make it understandable to regular readers.

China’s Memory Hole was a group project that collects all of the images that were deleted by censors from Chinese social media service Weibo. I love it because it required some technical sleuthing to differentiate censorship from other kinds of image deletion. But it also has a beautiful interface that is data rich and revealing but that lets the images themselves tell the story.

What do you see as the next stage in the development of online visual journalism?

First, I think we’ll see much more sophisticated methodologies being used. Things like machine learning and probabilistic models can make predictions and unearth hidden patterns. The challenge will be explaining those models and their results to readers. It won’t be easy! Their output isn’t straightforward and often doesn’t come out in units people understand (and sometimes the models don’t even yield the same answer every time you run them).

Second, I think we’ll see more real-time data visualizations that keep themselves updated. We have a few of these, including a campaign data tracker called Election DataBot. Think of a Bloomberg Terminal, but for everything, or think of taking what we’ve learned about building election-night visualizations and applying that to other data that’s available via an API.

Of course I’m also interested in things like smart speakers and in augmented reality. Both seem like natural homes for data communicators — visualizers, designers, coders, etc. 

Favorites

Publication (apart from ProPublica): I subscribe to both The New York Times and The Washington Post. Only partly because I love their graphics teams so much.

Current information designer: Too many to list, but on that list would certainly be: Amanda Cox, Nigel Holmes, Alberto Cairo, John Grimwade, Martin Wattenberg and Fernanda Viegas.

Historical information designer: William Playfair, the grandfather to us all!

Data project: It’s a bit old now but I love the Mapping L.A. project that the L.A. Times did. It crowdsourced a map of Los Angeles’s neighborhoods. I love it not only because it was a great service to the city but because they deliberately did it to create a whole slew of reporting that was never before possible: Crime by neighborhood, schools by neighborhood, etc.

Infographic: It’s a long list, but near the top of it will always be Adolfo Arranz’s infographic on Kowloon Walled City, called City of Anarchy. It’s so breathtakingly detailed that I see something new every time I look at it.

Music: I work in a cubicle so Spotify’s algorithm thinks my favorite music is white noise.

What do you think?