Scraping the Wayback Machine for data recovery

A cyberattack hit my old school which also took down the main website. I had a few ideas and wanted to get involved in the recovery process. So, we scraped the Internet Archive’s Wayback Machine.

The Problem

I downloaded a few open-source scrapers and they worked fine however, they didnt get all the files that were indexed by the Wayback Machine. I saw some files that were missing when I hosted the website locally and found them indexed on the website.

Thus these scripts were made.

The Solution: A Python-Powered Recovery Plan

1. Getting a List of Everything

The first step was to get a master list of every URL the Wayback Machine had ever archived for our domain. A simple curl command for the Archive’s CDN API did the trick, dumping a raw text file with the URLs needed.

2. Cleaning Up the Links with `Tab Formatting.py`

The raw list was a bit of a mess, full of duplicate URLs with different query parameters (like page.html?id=123). My Tab Formatting.py script ran through the list and stripped everything after a ?, leaving a clean list of unique pages to download.

3. Downloading It All with `Batch Downloader.py`

This uses the wayback-machine-downloader library to read each URL from my text file and download the latest available version. I just set it running, and it methodically fetched the entire site.

4. Fixing the Messy Filenames with `Formatting.py`

The downloader, by default, saved files with URL-encoded names (e.g., a folder named about%2Fus instead of about/us). This script went through the downloaded folders and renamed everything to be clean and human-readable, restoring the original site structure.

You can find the source code for all the scripts on my repositories page!

The Problem

The Solution: A Python-Powered Recovery Plan

1. Getting a List of Everything

2. Cleaning Up the Links with Tab Formatting.py

3. Downloading It All with Batch Downloader.py

4. Fixing the Messy Filenames with Formatting.py

2. Cleaning Up the Links with `Tab Formatting.py`

3. Downloading It All with `Batch Downloader.py`

4. Fixing the Messy Filenames with `Formatting.py`