You built a website years ago. It's gone now — hosting expired, CMS crashed, hard drive died. But the Wayback Machine remembers. This is the complete, no-nonsense guide to getting it back. Every page, every image, every trick.
The Internet Archive's Wayback Machine has been crawling the web since 1996. As of 2026, it holds over 866 billion web pages. But it doesn't save everything, and understanding what it does capture will save you hours of frustration.
Think of the Wayback Machine as someone who walked through your house taking photographs. They captured what the rooms looked like — but they didn't copy the plumbing, the electrical, or the contents of your drawers. You're recovering the appearance of your site, not its engine.
This varies wildly. A popular site that was online for ten years might have hundreds of snapshots. A small personal site might have three. The completeness depends on:
robots.txt blocked the crawler (many hosting defaults did)Go to web.archive.org and type your domain in the search bar. Don't include https:// — just the bare domain:
baylesshigh.com
Hit enter. You'll land on the calendar view — a year-by-year timeline showing every snapshot of your domain.
www. — www.baylesshigh.com and baylesshigh.com are treated as different URLs. Try both.http://.web.archive.org/web/*/baylesshigh.com/alumni.htmlweb.archive.org/web/*/baylesshigh.com/* shows every URL ever crawled under your domain. This is how you find pages you forgot existed.The wildcard URL (*.yourdomain.com/*) returns a list of every archived URL. This is your site map from the past. Copy this list — it tells you exactly what's recoverable. Some of these pages may surprise you. You may find content you forgot you ever published.
The calendar view is your time machine's control panel. Here's how to read it:
Not all snapshots are equal. Your strategy:
Sites often degrade before they die. The 2005 version of your site might be a vibrant 20-page community hub. The 2019 version might be a parked domain page from GoDaddy. Always browse multiple years. The peak is usually years before the domain expired.
Every Wayback Machine URL follows this pattern:
https://web.archive.org/web/20051122190731/http://www.baylesshigh.com/
Breakdown:
20051122190731 = timestamp: 2005-11-22 at 19:07:31 UTC
http://www.baylesshigh.com/ = the original URL
That timestamp is your key. You can construct URLs for any page at any point in time by changing the timestamp and the path. This becomes important when you're recovering individual pages from different dates.
↑ Back to contentsBest for: Small sites (1-10 pages). Total control over what you get.
This is the simplest method and the one you should try first unless your site had dozens of pages.
Navigate to your best snapshot. You'll see the page rendered with the Wayback Machine toolbar at the top. Ignore the toolbar — look at your content below it.
Right-click anywhere on the page and select "View Page Source" (Ctrl+U on Windows, Cmd+Option+U on Mac). This shows you the raw HTML that was archived.
Your browser's "Save As" feature will save the page as rendered by the Wayback Machine, including the toolbar, the rewritten URLs, and archive.org's own CSS/JS. You'll get a mess. Instead, copy the source code and clean it manually. More work upfront, cleaner result.
Select all the source code (Ctrl+A) and paste it into a text editor. Save it as index.html (or whatever the original filename was).
Look through the HTML for references to images, CSS files, and other assets. Each one will be rewritten to point to the Wayback Machine. For example:
<!-- Original URL was: -->
<img src="/images/logo.gif">
<!-- Wayback rewrote it to: -->
<img src="https://web.archive.org/web/20051122190731im_/http://www.baylesshigh.com/images/logo.gif">
To download the image, take that full Wayback URL and open it in your browser. If it loads, right-click and save it. Create an images/ folder and save it as logo.gif. Then update your HTML to use the local path again.
Go through each page on your site. Use the wildcard search from Chapter 2 to find all pages, then visit each one in the best snapshot year and repeat the process.
Once you have the HTML, use find-and-replace to change all https://web.archive.org/web/20051122190731/http://www.baylesshigh.com references back to relative paths. One operation fixes dozens of links.
Best for: Sites with many pages (10+). Automated. Gets everything at once.
This is a Ruby gem that talks to the Wayback Machine's CDX API and downloads every archived file for a domain. It's the closest thing to a "give me my whole site back" button.
# Install Ruby if you don't have it
# macOS: comes preinstalled, or use brew install ruby
# Windows: use https://rubyinstaller.org
# Linux: sudo apt install ruby
# Install the gem
gem install wayback_machine_downloader
# Download the latest version of every file ever archived
wayback_machine_downloader https://baylesshigh.com
# Files are saved to ./websites/baylesshigh.com/
# Only download files from 2005
wayback_machine_downloader https://baylesshigh.com \
--from 20050101 \
--to 20051231
# Only download files from a single exact snapshot
wayback_machine_downloader https://baylesshigh.com \
--timestamp 20051122
| Flag | What it does |
|---|---|
--only "/images/" | Only download URLs containing /images/ |
--exclude "/cgi-bin/" | Skip URLs containing /cgi-bin/ |
--all-timestamps | Download every version of every file, not just the latest. Creates timestamped directories. |
--list | Don't download anything — just list all available files. Great for reconnaissance. |
--concurrency 10 | Download 10 files at a time instead of the default 1. Faster, but be polite to archive.org. |
# First, see what's available
wayback_machine_downloader https://baylesshigh.com --list
# Output shows every archived URL and timestamp:
# /index.html 20051122190731
# /alumni.html 20050815142200
# /images/logo.gif 20051122190731
# /images/gym.jpg 20050815142200
# /css/style.css 20051122190731
# ... (47 files)
# Now download everything from the best year
wayback_machine_downloader https://baylesshigh.com \
--from 20050101 --to 20060101 \
--concurrency 5
Run --list first. It shows you every file the Wayback Machine has for your domain. This is priceless — it's a complete inventory of what's recoverable. Save this output. It's the table of contents for your restoration project.
Best for: Quick grabs when you don't want to install anything.
The Wayback Machine has a lesser-known feature: you can download the raw archived file by modifying the URL. Add id_ after the timestamp:
# Normal Wayback view (with toolbar, rewritten links):
https://web.archive.org/web/20051122190731/http://baylesshigh.com/
# Raw original file (no toolbar, no rewriting):
https://web.archive.org/web/20051122190731id_/http://baylesshigh.com/
That id_ suffix tells the Wayback Machine: "Give me the original file exactly as it was, don't wrap it in your viewer." This works for HTML, CSS, JavaScript, images — everything.
The normal Wayback view injects a toolbar, rewrites every URL on the page to point to archive.org, and adds tracking JavaScript. The id_ version gives you the clean original. It's the difference between getting a photocopy with someone's notes scribbled in the margins versus getting the original document.
# Download the raw HTML
curl -o index.html \
"https://web.archive.org/web/20051122190731id_/http://www.baylesshigh.com/"
# Download a raw image
curl -o images/logo.gif \
"https://web.archive.org/web/20051122190731id_/http://www.baylesshigh.com/images/logo.gif"
# Download a raw CSS file
curl -o css/style.css \
"https://web.archive.org/web/20051122190731id_/http://www.baylesshigh.com/css/style.css"
If you use the id_ URLs, the files you download are exactly what was on your server originally. No cleanup needed for Wayback toolbar code. The only things you might need to fix are links that pointed to external sites that have since disappeared.
Best for: When you need specific pages and want to follow links automatically.
wget is a command-line download tool available on every Unix system and Windows (via Git Bash or WSL). It can mirror a site by following links — and it works on Wayback Machine URLs.
# Mirror the site from a specific Wayback snapshot
wget --mirror \
--convert-links \
--adjust-extension \
--page-requisites \
--no-parent \
--wait=1 \
--limit-rate=200k \
"https://web.archive.org/web/20051122190731/http://www.baylesshigh.com/"
| Flag | Purpose |
|---|---|
--mirror | Recursive download, following links, preserving directory structure |
--convert-links | After downloading, rewrite links to point to local files instead of URLs |
--adjust-extension | Add .html extension to files that need it |
--page-requisites | Download CSS, images, and JS needed for each page to display |
--no-parent | Don't crawl up into archive.org's parent directories |
--wait=1 | Wait 1 second between requests — be polite to archive.org |
--limit-rate=200k | Limit download speed — don't hammer their servers |
The Internet Archive is a nonprofit running on donations. They serve billions of pages for free. Always use --wait and --limit-rate when downloading. If you're recovering a small site, the download takes an extra minute. If everyone hammered their servers, they couldn't exist. Respect the resource.
wget creates a directory structure mirroring the URL path, which means your files end up in something like:
web.archive.org/web/20051122190731/http:/www.baylesshigh.com/
├── index.html
├── alumni.html
├── images/
│ ├── logo.gif
│ └── gym.jpg
└── css/
└── style.css
Just move the contents of that deep folder up to your project root. The --convert-links flag already rewrote the internal links to be relative, so they'll work once you flatten the structure.
Unless you used the id_ method from Chapter 6, your downloaded HTML will be littered with Wayback Machine code. Here's exactly what to look for and remove.
The most obvious artifact. It's a block of HTML, CSS, and JavaScript injected at the top of every page:
<!-- BEGIN WAYBACK TOOLBAR INSERT -->
<script src="/_static/js/bundle-playback.js"></script>
<script>__wm.init("https://web.archive.org/web")</script>
<div id="wm-ipp-base">
... (50-100 lines of toolbar HTML)
</div>
<!-- END WAYBACK TOOLBAR INSERT -->
Delete everything between the BEGIN and END comments, inclusive.
Every link, image source, CSS reference, and script source has been rewritten to point through archive.org:
# These all need to be fixed:
# Links to other pages on your site
href="https://web.archive.org/web/20051122190731/http://www.baylesshigh.com/alumni.html"
→ href="alumni.html"
# Image sources
src="https://web.archive.org/web/20051122190731im_/http://www.baylesshigh.com/images/logo.gif"
→ src="images/logo.gif"
# CSS links
href="https://web.archive.org/web/20051122190731cs_/http://www.baylesshigh.com/css/style.css"
→ href="css/style.css"
In any decent text editor (VS Code, Sublime Text, Notepad++), use regex find-and-replace:
# Match any Wayback-rewritten URL and extract the original path
# Find (regex):
https?://web\.archive\.org/web/\d{14}(im_|cs_|js_|fw_|if_)?/https?://(?:www\.)?baylesshigh\.com(/[^"'\s]*)?
# Replace with:
$2
# This captures the path after your domain and discards the archive.org wrapper.
# /alumni.html stays as /alumni.html
# /images/logo.gif stays as /images/logo.gif
# The root URL becomes / (or empty string — manually fix to index.html)
Look for and remove these script blocks:
<!-- Any script referencing these paths: -->
/_static/js/bundle-playback.js
/_static/js/wombat.js
/_static/js/ruffle/
/web/20*/js_/
<!-- And any inline script containing: -->
__wm.init
__wm.wombat
WB_wombat_
_wmWindow
After your big find-and-replace, search the entire project for any remaining references:
# Search all files for leftover archive.org references
grep -r "web.archive.org" .
grep -r "wm-ipp" .
grep -r "_static/js" .
grep -r "wombat" .
If any of those return results, you have more cleanup to do. When all four return empty, you're clean.
↑ Back to contentsThis is the part where most restorations hit a wall. The HTML is clean, the structure is right, but half the images are broken. Here's how to deal with it.
Use the wildcard search to see all image files that were archived:
https://web.archive.org/web/*/baylesshigh.com/images/*
This shows every image URL that was ever crawled. Click through different timestamps — an image that's missing from the 2005 snapshot might exist in the 2003 snapshot.
If an image URL returns a 404 at one timestamp, try others. The Wayback Machine stores different snapshots of the same file:
# Image missing at this timestamp:
https://web.archive.org/web/20051122id_/http://baylesshigh.com/images/gym.jpg
→ 404 Not Found
# But exists at an earlier date:
https://web.archive.org/web/20040815id_/http://baylesshigh.com/images/gym.jpg
→ 200 OK ✓
If the Wayback Machine doesn't have an image, your options:
site:baylesshigh.com and click "Cached" on results. Google sometimes has images the Wayback Machine doesn't.The Wayback Machine's CDX API can tell you every archived version of a specific file. Query it like this:
https://web.archive.org/cdx/search/cdx?url=baylesshigh.com/images/gym.jpg&output=text
This returns a list of every timestamp that file was archived, the HTTP status code, and the file size. Try every 200-status timestamp until one gives you the image.
You now have the raw materials. The big decision: do you put the old site back online exactly as it was, or do you use the content to build something modern?
This is usually the right answer. Keep the content, rebuild the container:
This is exactly what we did with baylesshigh.com. The original 2005 site was a beautiful mess of early-web design. The content — alumni memories, sports records, school history — was the real value. We poured it into a modern single-file design and added a link to the 2005 archive so visitors can see where it came from.
↑ Back to contentsYou have clean HTML files, local images, and maybe a CSS file. Now you need somewhere to put them. DigitalOcean's App Platform is purpose-built for this.
Put your site in a GitHub repository and let DigitalOcean deploy it automatically every time you push.
# In your project folder:
git init
git add -A
git commit -m "Restored site from Wayback Machine"
# Create a repo on GitHub (or use the gh CLI):
gh repo create baylesshigh.com --public --source=. --push
# Now go to DigitalOcean:
# 1. Log in at cloud.digitalocean.com
# 2. Click "Apps" in the left sidebar
# 3. Click "Create App"
# 4. Choose "GitHub" as source
# 5. Select your repo
# 6. It auto-detects "Static Site"
# 7. Pick the Starter plan (free for static sites, or $4/mo for custom domain)
# 8. Deploy
Within 2-3 minutes, your site is live at your-app-xxxx.ondigitalocean.app. Every future git push automatically redeploys.
If you don't want to use GitHub:
# Install the DigitalOcean CLI
brew install doctl # macOS
snap install doctl # Linux
scoop install doctl # Windows
# Authenticate
doctl auth init # paste your API token
# Create an app spec file
cat > app.yaml << 'EOF'
name: baylesshigh-com
static_sites:
- name: baylesshigh
source_dir: /
github:
repo: youruser/baylesshigh.com
branch: main
deploy_on_push: true
routes:
- path: /
EOF
# Deploy
doctl apps create --spec app.yaml
If your restored site has hundreds of images or files, Spaces (their S3-compatible object storage) with a CDN is cheaper and faster:
# Create a Space
doctl compute cdn create \
--origin baylesshigh.nyc3.digitaloceanspaces.com \
--ttl 3600
# Upload your site files
s3cmd sync ./site/ s3://baylesshigh/ \
--acl-public \
--no-mime-magic \
--guess-mime-type
You absolutely can. They're free for static sites. The reason this guide focuses on DigitalOcean is that if you ever need to scale beyond static — add a database, run a small API, host email — everything is already in one place. But for a simple restored site, Netlify Drop (drag and drop your folder) is hard to beat for simplicity.
↑ Back to contentsYour domain registrar (GoDaddy, Namecheap, Cloudflare, etc.) controls where your domain points. You need to update the DNS records to point to DigitalOcean.
baylesshigh.com)# For the root domain (baylesshigh.com):
Type: A Name: @ Value: (IP from DigitalOcean dashboard)
# For www subdomain (www.baylesshigh.com):
Type: CNAME Name: www Value: your-app-xxxx.ondigitalocean.app.
Alternatively, point your domain's nameservers to DigitalOcean entirely. At your registrar, change nameservers to:
ns1.digitalocean.com
ns2.digitalocean.com
ns3.digitalocean.com
Then manage all DNS records in DigitalOcean's dashboard. This is cleaner if DigitalOcean is your only hosting provider.
App Platform handles SSL automatically. Once your domain's DNS is pointed and propagated, DigitalOcean issues a Let's Encrypt certificate. No configuration needed. Your site will be available at https:// within minutes of DNS propagation.
DNS changes take time to spread across the internet. Typically:
Check propagation status at dnschecker.org — it shows whether DNS has updated across servers worldwide.
↑ Back to contentsYour domain had a life before. Google remembers. Here's how to reclaim that authority.
Create a simple sitemap.xml file listing all your pages:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://www.baylesshigh.com/</loc>
<lastmod>2026-04-01</lastmod>
</url>
<url>
<loc>https://www.baylesshigh.com/alumni.html</loc>
<lastmod>2026-04-01</lastmod>
</url>
</urlset>
Submit it in Search Console under Sitemaps > Add a new sitemap.
In Search Console, use the URL Inspection tool. Enter your homepage URL and click "Request Indexing." Google will prioritize crawling it. Do this for your most important pages.
This is critical. If your old site had a page at /alumni.html and other sites linked to that URL, you must keep that same path. If you change it to /alumni/ or /our-alumni.html, all those old backlinks hit a 404 and the SEO value evaporates.
Use the --list flag from wayback_machine_downloader to see every old URL, then make sure your restored site has content at each of those paths. For paths you're not restoring, create redirects:
# Old path → new path
/guestbook.html / 301
/cgi-bin/counter / 301
/~admin/old-page / 301
Use a free tool like Ahrefs Backlink Checker or Google Search Console's Links report to see who still links to your domain. These are your SEO assets. Make sure the pages they link to are live and serving content.
↑ Back to contentsThis matters. Let's be clear about it.
You're fine. You're recovering your own work. The Wayback Machine archived publicly accessible web pages. You own the copyright to content you created. Downloading your own work from an archive is no different from pulling files off a backup drive.
This is where it gets nuanced:
If you didn't create it, don't publish it. Use the archived content to understand what the domain was about, then create your own original content on the same topic. You get the SEO benefit of the domain's age and backlinks without the legal risk of republishing someone else's work.
The Internet Archive allows individuals to access and use their archived content. Their terms ask that you:
Recovering your own site — even using tools like wayback_machine_downloader — is well within their intended use case. They exist to preserve the web. You're preserving your part of it.
An important quirk: the Wayback Machine respects the current robots.txt on a domain, even for old snapshots. If someone buys an expired domain and puts up a robots.txt that blocks the Wayback Machine, the archive stops serving old snapshots.
This has been controversial (entire website histories have been erased this way), but it means: if you own the domain and you want the archives to be accessible, make sure your robots.txt doesn't block the Internet Archive's crawler (ia_archiver).
User-agent: *
Allow: /
# Explicitly welcome the Internet Archive
User-agent: ia_archiver
Allow: /
This means your hosting expired before most of the crawls happened. The Wayback Machine dutifully archived the GoDaddy/Namecheap parked page. Look at the earliest timestamps — the real site is often there before the parking page took over. Use the calendar view to go back year by year until you find real content.
Try the id_ URL pattern from Chapter 6. Sometimes the regular Wayback view serves images through a CDN wrapper that breaks direct downloads, but the id_ URL serves the raw file.
Also try different timestamps for the same image. The CDX API query from Chapter 9 shows every archived version.
Sort of. The Wayback Machine has integrated Ruffle, a Flash emulator, into their viewer. Old Flash content plays in the Wayback Machine's browser player. But you can't meaningfully "restore" Flash content to a modern site. Extract whatever text and images were in the Flash files and rebuild in HTML. The Flash era is over and browsers don't support it.
This happens with single-page apps (React, Angular, Vue). The HTML is just a shell; the content was rendered by JavaScript. The Wayback Machine sometimes captures the rendered output and sometimes just the shell. If you only have the shell, the content is effectively lost unless you can find a snapshot where the rendered HTML was captured.
For JavaScript-heavy sites, try archive.today — it renders pages before archiving, so it sometimes captures content that archive.org missed.
If the Wayback Machine didn't capture your CSS files, the HTML will render as unstyled text. Options:
If you're a researcher, journalist, historian, or librarian — the Wayback Machine is there for you. Access and reference freely. But publishing someone else's content on a domain you own crosses into copyright territory. See Chapter 14.
The Internet Archive serves an enormous amount of data on a nonprofit budget. If you're hitting rate limits:
--wait 2 or higher to your download commandsIf you're the verified owner of a domain and need comprehensive recovery, the Internet Archive's team can sometimes help. They have more data than the public interface exposes. Reach them through archive.org/about/contact. Be polite, be specific about what you need, and remember they're a nonprofit team handling millions of requests.
This guide was written by Paul Walhus, who restored baylesshigh.com from a 2005 Wayback Machine snapshot and rebuilt it as a modern static site. The domain, the stories, and the Bronchos live on.