The Digital Evidence Preservation Toolkit

In response to a quick-turnaround investigation, we have put together a quick single-page preservation pipeline that turns a researcher’s Raindrop.io bookmarks into a tamper-evident, cryptographically-verifiable web archive, with full-page WACZ preservation in Browsertrix Cloud behind every saved URL. The researcher clicks “save” in the browser they already use; a preservation chain runs behind the scenes.

It builds on Simon Willison’s very 2020 pattern using Github Actions as a scheduled runner.


How it works

A researcher clips a URL in Raindrop.io โ€” the same bookmark manager they already use, or at least a very pleasant and fast browser extension. That’s it from their perspective.

On schedule, a GitHub Actions workflow pulls new bookmarks from the Raindrop API and writes each one as its own JSON file to a dedicated bookmarks-data git branch with some metadata, including the accountId of the researcher who saved it. The branch is append-only(-ish) and diffable: a versioned record of who saw what, and when.

Each new commit fires a second workflow that hands the URL to a Browsertrix Cloud account for a single-page crawl. The result is a hashed, replayable archive of the full page as it appeared at capture time, which lands in a per-project collection. The crawl ID is logged to crawls.csv on a separate crawls-data branch, acting as the join key between the bookmark and its archived form.

   Researcher clicks "save" in Raindrop.io
                       โ”‚
                       โ–ผ
   โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
   โ”‚  โ‘  bookmark JSON  (git branch)    โ”‚  who ยท when ยท tags
   โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
                       โ”‚  triggers a crawl
                       โ–ผ
   โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
   โ”‚  โ‘ก crawls.csv row (git branch)    โ”‚  bookmarkId โ†” crawl_id
   โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
                       โ”‚  join key
                       โ–ผ
   โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
   โ”‚  โ‘ข WACZ artefact (Browsertrix)    โ”‚  full page, hashed
   โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

   Three artefacts. Three independent stores.
   One verifiable trail from click to capture.

Multi-account mode lets a team pool separate Raindrop libraries into a single shared corpus, each item tagged with its origin, and a small projects.json routes bookmarks to per-project Browsertrix collections so several investigations can run in parallel without cross-contamination. We gave the two researchers separate Raindrop accounts and can still attribute who requested which crawl.


How it compares

โœ… No servers โœ… No databases โœ… Per-researcher provenance โœ… Team-ready-ish

The alternatives, in brief:


Who is it for?

Investigative journalists ยท OSINT researchers ยท Human rights documenters ยท Cross-border reporting teams ยท Litigation support investigators ยท Any team running a long-form investigation built on web sources


Status

This project is in pilot deployment with Airwars on cross-border investigation into the apparent smuggling of luxury cars from Europe to Russia โ€” it preserved more than 9,000 unique URLs totalling 98 GiB, including a six-week recurring crawl of a Belarusโ€“Lithuania border webcam (~3,800 captures) that turned a self-overwriting page into a preserved time series.


Stay informed

Reach out to <hi@digitalevidencetoolkit.org> or subscribe to our newsletter:

#Web-Archiving #Integrity #Evidence-Preservation