readwrite

Subscribe
Archives
November 17, 2025

Edition 3 - Journalism from the command line, part 1

Today's newsletter shows some command line tools that Jan uses in his day to day work. The idea here is not to claim the sole solution to a problem, but to help people get into the idea of typing some slightly more sophisticated stuff into that mysterious black box on the screen.

As always, email us at readwritenewsletter@proton.me with feedback and ideas.


For a Journalist, I Use the Command Line a Lot. Here Are the Tools That Make My Life Easier

Most journalists live in note taking apps, browsers, spreadsheets, and CMS dashboards. I spend my days in a terminal window.

It lets me focus on the task, cleaning data, searching text, or running a scraper, instead of wrestling with interfaces. And it makes everything reproducible. Every action can be saved, repeated, and explained later.

In this post, I show how I use it. There are thousands of tools out there, and I am sure that for all things I talk about, other solutions exist. I've split this in two parts: Data inspection and automating the boring stuff. Next week, I'll talk about Command line comfort.

NB: I use a Macbook for my day to day work and at DARC, my company, we rely on debian servers for most things. None of this has been tested on Windows Powershell, but luckily, you can have your own little Linux on Windows computers, if you want to.


Inspecting Data

Say hello to the grep family

grep was released over 50 years ago, so if we ignore the shell builtins, it must be one of the oldest commands I use almost every day.

It's usecase is simple: Find things in files. Typical usage looks like this:

grep "offshore account" ~/investigations/

A simple command, keyword-searching thousands of files in seconds.

When I’m exploring a leak or document dump, grep lets me test things instantly, adjusting keywords and patterns until something meaningful surfaces.

Technically, when I say grep, I mean two other tools from the same family: ripgrep (rg) and pdfgrep.

  • grep — perfect for searching command output or small files. POSIX-clean, installed everywhere, predictable regex behavior.
  • ripgrep — grep rewritten in Rust, faster and a bit more flexible. My default for directories, repos, archives, and anything large or messy.
  • pdfgrep , does the same but on PDFs:
  pdfgrep -i "tax evasion" documents/*.pdf

It works surprisingly well across large collections.

The actual power comes from using the grep family with regular expressions. This opens a whole new world and could be a newsletter topic on its own. But still, here are some examples:

Imagine you FOIA'd 500 csv files that contain budget data from districts in your country. The command below will search all files for the words donation and payment, ignoring cases, and create a matches.txt file with the results:

rg -i "donation|payment" budget_files/*.csv > matches.txt

How beautiful is that? But it gets better: Let's say you suspect that people used consultancy agreements to funnel out tax payer money. This will find all rows in all files that contain the relevant key words and Dollar amount next to it:

rg -i '\b(consultant|consulting|consultancy|contractor)\b.*\$[0-9,]+' budget_files/*.csv 

This catches lines like Consulting Services, $45,250 in your data. Once you've broken your data down like this, why not look for suspiscously round numbers? They're a common red flag for fraud, real invoices rarely end in triple zeroes. The command below finds all occurances of Dollar values starting with a digit between 1 and 9, followed by at least 3 zeroes (with or without a comma to separate 1000's), ending in .00 and writes them to a separate file.

rg '\$[1-9]0{3,}\.00\b' budget_files/*.csv > round-amounts.txt

jq

APIs love to return deeply nested JSON, I, however, hate it. jq makes these results usable, a structural filter language for complicated data.

Basic example:

curl -s https://api.gov/contracts | jq '.results[] | {title, value, date}'

This takes all the results from the (ficticuous) contracts API and shows title, value and date.

Not only does it filter, it also aggregates and does so much more. Here's how you would get the values of all contracts added up:

curl -s https://api.gov/contracts | jq '[.results[].value] | add'

It compiles streams input efficiently, which makes it faster and cleaner than hacking around with Python in exploratory mode. It’s the perfect sanity check for what an API actually returns vs. what the interface claims.


exiftool

Every digital file carries metadata: timestamps, device IDs, camera models, software signatures, sometimes even GPS coordinates. This hidden information often tells a different story than the visible content. exiftool extracts all of it.

Here are some examples:

exiftool -time:all -a -G1 photo.jpg

This shows every timestamp in the file, grouped by metadata source. Photos typically have three timestamps: when the picture was taken (DateTimeOriginal), when the file was created (CreateDate), and when it was last modified (ModifyDate).

If someone claims a photo was taken "yesterday" but DateTimeOriginal says 2019, you should ask some more questions.

exiftool -GPS* suspicious-photo.jpg

Some smartphones embed GPS coordinates in photos. This can verify (or contradict) claims about where something happened. The command above command will pull them out.

To batch-extract metadata from a whole folder, you can run:

exiftool -json *.jpg > metadata.json

A one-command quick audit (which you can then pipe into jq, maybe?).

The tool also works on other files. Here's how to check PDF manipulation:

exiftool -all leaked-memo.pdf

Look for Creator, Producer, and ModifyDate fields. If it does not line up with what your source or the gov spokesperson told you, you might be onto something.


Automating the Boring Stuff

The section title is stolen from a book that I can reommend.

caffeinate

This is macOS only, I believe: caffeinate prevents sleep during long-running jobs:

caffeinate -i python scrape.py

This will stop the system from going to sleep as long as the process (python, in this example) is running. You can also use it with the -d flag, which will keep the display on.


cron

cron runs commands on a fixed schedule on (all?) UNIX systems. The way it works is that users encode when and how often they want the computer to do a thing and, well, the computer then goes and does the thing and does not bother me again.

The syntax is a bit weird at first, but basically there are 5 positions to fill, represent usually by 5 stars: minutes, hours, day of the month, month, day of the week.

Here's what it would look like:

0 6 * * * /usr/local/bin/python ~/scripts/update_dataset.py
0 23 * * 0 /usr/bin/find ~/data/tmp -type f -mtime +14 -delete

These two rows translate to:

0 6 * * * → Run every day at 06:00 (6 AM). → In this case, it runs a Python script to update a dataset.

0 23 * * 0 → Run at 23:00 (11 PM) every Sunday (the week is zero-indexed, starting with Sunday, day 0). → Here, it finds and deletes temporary files older than 14 days.

Luckily, there are tools out there to help schedule cron jobs.


GitHub Actions

Not strictly a CLI tool, but the same philosophy: workflows as code. Instead of just scheduling commands on a single machine, GitHub Actions lets you define repeatable automation that runs in the cloud. And yes, it even understands cron syntax, so you can schedule jobs just like you would with a local crontab.

The big difference is that GitHub Actions can spin up a whole fresh virtual machine (or container) to do the work. That means you can run complex pipelines without touching your own computer.

Here’s a simple example:

name: Update dataset
on:
  schedule:
    - cron: '0 5 * * *'
jobs:
  update:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: python scripts/fetch_latest.py
      - run: git commit -am "Auto update" && git push

This workflow translates to:

  • 0 5 * * * → every day at 05:00 am
  • Spin up an ubuntu machine
  • Check out the code from the current Github repository
  • Run a Python script (from the repo) to fetch the latest dataset.
  • Commit and push the changes back to the repo.

Super simple way of creating a versioned history of, let's say, daily copies of a website.


Why I think it's worth it

I understand that this is complicated stuff. It comes with a learning curve that's quite steep. For journalists, it also comes with a simple appeal: clarity and reproducibility. It keeps you close to your data and far from abstraction (ever tried to convince Excel that something is a date, or, is not a date?).

These tools don’t make me a better reporter, but they make me a faster, calmer researcher. They reduce friction, remove uncertainty, and make complex tasks predictable.

And sometimes, when someone asks how I filtered a dataset, searched a million lines, or downloaded an archive in one go, I just smile and say: “I use the command line.”

Don't miss what's next. Subscribe to readwrite:
Powered by Buttondown, the easiest way to start and grow your newsletter.