Scraping the Wayback Machine for data recovery

Using Wayback Machine to recover website data

A cyberattack hit my old school which also took down the main website. I had a few ideas and wanted to get involved in the recovery process. So, we scraped the Internet Archive’s Wayback Machine.

The Problem

I downloaded a few open-source scrapers and they worked fine however, they didnt get all the files that were indexed by the Wayback Machine. I saw some files that were missing when I hosted the website locally and found them indexed on the website.

Thus these scripts were made.

The Solution: A Python-Powered Recovery Plan

1. Getting a List of Everything

The first step was to get a master list of every URL the Wayback Machine had ever archived for our domain. A simple curl command for the Archive’s CDN API did the trick, dumping a raw text file with the URLs needed.

The raw list was a bit of a mess, full of duplicate URLs with different query parameters (like page.html?id=123). My Tab Formatting.py script ran through the list and stripped everything after a ?, leaving a clean list of unique pages to download.

3. Downloading It All with Batch Downloader.py

This uses the wayback-machine-downloader library to read each URL from my text file and download the latest available version. I just set it running, and it methodically fetched the entire site.

4. Fixing the Messy Filenames with Formatting.py

The downloader, by default, saved files with URL-encoded names (e.g., a folder named about%2Fus instead of about/us). This script went through the downloaded folders and renamed everything to be clean and human-readable, restoring the original site structure.

You can find the source code for all the scripts on my repositories page!