Data wrangling for data journalism

Chris Knox

Data Editor at the New Zealand Herald (in October 2020)

chris@functionalvis.com

13 October, 2020

What is data journalism?

Paul Bradshaw from Birmingham City University says:

Data can be the source of data journalism, or it can be the tool with which the story is told — or it can be both.

The Bureau of Investigative Journalism says:

Data journalism is simply journalism.

The former is a new and trendy term but ultimately, it is just a way of describing journalism in the modern world.

What does that mean for Data Scientists?

  • Consider journalism

Both in the sense of spending some of your career as a data journalist - but also in the work you produce. Ask yourself

Are there stories in this data that are of public interest?

Journalists are overworked and deadline driven

The easier you can make your data to understand and consume the more likely it is to be picked up by a journalist

Why use data in journalism?

  • Context
  • Trust
  • Clarity and/or conciseness
  • Engagement

What does data give the narrative?

Probabilities people assign to different phrases

Narrative/Cognitive tension?

Not sure exactly what to call it - but I think it is important.

What roles does data have in journalism?

Background

Why did I decide to move from data science/visualisation to journalism?

Let’s change gears and talk about data wrangling

How much have you talked about reproducible research?

In particular version control (git) and continuous integration (like) workflows?

Run one command and publish

You should be able to run a single command that updates your data, runs your analysis, creates your assets and then publishes your article/report

Examples

NZH Covid update

NZH Stat of the Nation Project

But there’s a problem

  • Automated workflows can breed inflexibility

Things change all the time - and more interesting things change more often. Don’t become King Canute and try to stop the tide coming in.

If you are not in control of data collection and your workflow tries to control the data collection and collation your workflow will break

  • example/rant
    New Zealand Covid Data

There’s another problem

Programming Languages and tools

Use the right tool

  • R
  • Python
  • Julia
  • Fortran
  • SQL (incl. Postgis)
  • Javascript
  • GDAL
  • QGIS

Automated workflows can lock you into a single technology restricting you ability to make use of the best tools for a job.

And there’s another problem

TIME!

  • You develop a project and publish it
  • One year passes
  • You get asked to update the project with this year’s data
  • Has the data changed?
  • Can you run your script safely?
  • Can you even understand your script without a couple of days work?

What’s my solution?

Use the compiler Luke

Definitions

  • Compiled languages
  • Dynamic languages
  • Strongly typed languages
  • Weakly typed languages

Flexible (dynamic and often weakly-typed langauges) are the mainstay for analysis - especially exploratory analysis

Haskell

Haskell is a wonderful language with a steep learning curve - find a mentor if you want to learn it

  • Haskell is strongly-typed compiled functional language
  • In Haskell it is easy to use the type system to capture your assumptions about the state of the system.

You probably don’t want to do this as part of your actual analysis workflow - it is possible - but I have not found it to be very efficient.

Use Haskell as
glue

  • Do your analysis however you like
  • Write Haskell code to consume on stage of your pipeline - check it for consistency - and then spit it out for the next stage
  • That’s all

The point is that your Haskell pipeline will break - on your computer - if your assumptions are no longer true

Example
Maybe

This approach can be implement in other languages too