Scraping the Wayback Machine for data recovery
Using Wayback Machine to recover website data
A cyberattack hit my old school which also took down the main website. I had a few ideas and wanted to get involved in the recovery process. So, we scraped the Internet Archive’s Wayback Machine.
The Problem
I downloaded a few open-source scrapers and they worked fine however, they didnt get all the files that were indexed by the Wayback Machine. I saw some files that were missing when I hosted the website locally and found them indexed on the website.
Thus these scripts were made.
The Solution: A Python-Powered Recovery Plan
1. Getting a List of Everything
The first step was to get a master list of every URL the Wayback Machine had ever archived for our domain. A simple curl
command for the Archive’s CDN API did the trick, dumping a raw text file with the URLs needed.
2. Cleaning Up the Links with Tab Formatting.py
The raw list was a bit of a mess, full of duplicate URLs with different query parameters (like page.html?id=123
). My Tab Formatting.py
script ran through the list and stripped everything after a ?
, leaving a clean list of unique pages to download.
3. Downloading It All with Batch Downloader.py
This uses the wayback-machine-downloader
library to read each URL from my text file and download the latest available version. I just set it running, and it methodically fetched the entire site.
4. Fixing the Messy Filenames with Formatting.py
The downloader, by default, saved files with URL-encoded names (e.g., a folder named about%2Fus
instead of about/us
). This script went through the downloaded folders and renamed everything to be clean and human-readable, restoring the original site structure.
You can find the source code for all the scripts on my repositories page!