Self-Hosting ArchiveBox with Docker Compose

What Is ArchiveBox?

ArchiveBox is a self-hosted web archiver that saves snapshots of web pages in multiple formats — HTML, PDF, screenshot, WARC, and more. Feed it URLs from bookmarks, RSS feeds, or browser history, and it builds a searchable, offline-accessible archive. Think of it as your own personal Wayback Machine. It replaces reliance on archive.org, Pocket’s saved pages, and browser bookmark rot.

Official site: archivebox.io | GitHub

Docker Compose Configuration

Create a directory for ArchiveBox:

mkdir archivebox && cd archivebox
mkdir data

Create a docker-compose.yml file:

services:
  archivebox:
    image: archivebox/archivebox:0.8.5rc52
    container_name: archivebox
    command: server --quick-init 0.0.0.0:8000
    restart: unless-stopped
    ports:
      - "8000:8000"
    environment:
      - ALLOWED_HOSTS=*
      - PUBLIC_INDEX=true
      - PUBLIC_SNAPSHOTS=true
      - PUBLIC_ADD_VIEW=false
      - SEARCH_BACKEND_ENGINE=ripgrep
      - MEDIA_MAX_SIZE=750m
      - TIMEOUT=60
      - CHECK_SSL_VALIDITY=true
      - SAVE_ARCHIVE_DOT_ORG=true
    volumes:
      - ./data:/data
    networks:
      - archivebox-net

  # Optional: Sonic full-text search (faster than ripgrep for large archives)
  # sonic:
  #   image: valeriansaliou/sonic:v1.4.9
  #   container_name: archivebox-sonic
  #   restart: unless-stopped
  #   environment:
  #     - SEARCH_BACKEND_PASSWORD=changeme_sonic
  #   volumes:
  #     - sonic-data:/var/lib/sonic/store
  #   networks:
  #     - archivebox-net

volumes:
  sonic-data:

networks:
  archivebox-net:

Initialize the archive and create an admin account:

docker compose run --rm archivebox init --setup
docker compose run --rm archivebox manage createsuperuser

Start the server:

docker compose up -d

Prerequisites

  • A Linux server (Ubuntu 22.04+ recommended)
  • Docker and Docker Compose installed (guide)
  • 5 GB of free disk space (grows with your archive)
  • 1 GB of RAM minimum, 2 GB recommended
  • A domain name (optional, for remote access)

Initial Setup

Access ArchiveBox at http://your-server-ip:8000. Log in with the superuser credentials you created during initialization.

To add URLs to your archive:

Via the web UI: Click “Add” in the top bar and paste URLs (one per line).

Via CLI:

# Add a single URL
docker compose run --rm archivebox add "https://example.com/article"

# Add from a bookmarks file
docker compose run --rm archivebox add < bookmarks.html

# Add from an RSS feed
docker compose run --rm archivebox add "https://example.com/feed.xml"

What Gets Archived

For each URL, ArchiveBox saves multiple output formats:

FormatToolDescription
HTMLwgetFull static HTML snapshot with assets
PDFChrome/ChromiumRendered PDF of the page
ScreenshotChrome/ChromiumFull-page PNG screenshot
WARCwgetWeb Archive format (industry standard)
ReadabilityMozilla ReadabilityClean article text extraction
SingleFileSingleFileComplete page in one HTML file
GitgitClone entire git repos
Mediayt-dlpDownload videos, audio, playlists
HeaderscurlHTTP response headers

Configuration

Key environment variables:

VariableDefaultDescription
ALLOWED_HOSTS*Restrict access by domain (comma-separated)
PUBLIC_INDEXtrueMake archive index publicly accessible
PUBLIC_SNAPSHOTStrueMake individual snapshots publicly accessible
PUBLIC_ADD_VIEWfalseAllow unauthenticated users to add URLs
SEARCH_BACKEND_ENGINEripgrepSearch backend: ripgrep, sonic, or sqlite
MEDIA_MAX_SIZE750mMaximum file size for media downloads
TIMEOUT60Download timeout in seconds per extractor
CHECK_SSL_VALIDITYtrueSkip pages with invalid SSL certificates if false
SAVE_ARCHIVE_DOT_ORGtrueSubmit URLs to the Wayback Machine as backup

Scheduled Archiving

Add a scheduler service to automatically re-archive URLs on a schedule:

  scheduler:
    image: archivebox/archivebox:0.8.5rc52
    container_name: archivebox-scheduler
    command: schedule --foreground --every=day --depth=0
    restart: unless-stopped
    environment:
      - TIMEOUT=120
    volumes:
      - ./data:/data
    networks:
      - archivebox-net

Full-Text Search with Sonic

For large archives (10,000+ pages), switch from ripgrep to Sonic for faster full-text search. Uncomment the sonic service in the Compose file and update:

SEARCH_BACKEND_ENGINE=sonic
SEARCH_BACKEND_HOST_NAME=sonic
SEARCH_BACKEND_PASSWORD=changeme_sonic

Reverse Proxy

ArchiveBox serves on port 8000. For HTTPS with a reverse proxy, see Reverse Proxy Setup.

Set ALLOWED_HOSTS to your domain name when using a reverse proxy:

ALLOWED_HOSTS=archive.example.com

Backup

The entire archive lives in the ./data directory. Back up this directory to preserve:

  • ./data/archive/ — all archived page snapshots
  • ./data/index.sqlite3 — the database of all URLs and metadata
  • ./data/ArchiveBox.conf — your configuration
tar czf archivebox-backup-$(date +%Y%m%d).tar.gz ./data

For a comprehensive backup strategy, see Backup Strategy.

Troubleshooting

Chrome/Chromium Fails to Start

Symptom: PDF and screenshot extraction fails with “Chrome not found” or “Failed to launch Chrome.”

Fix: The Docker image ships with Chromium. If you’re running outside Docker, install Chromium:

apt install chromium-browser

“Permission Denied” on Data Directory

Symptom: ArchiveBox can’t write to /data inside the container.

Fix: Set ownership on the host data directory:

sudo chown -R 911:911 ./data

Large Archives Slow Down

Symptom: Search and browsing become slow above 5,000+ snapshots.

Fix: Switch from ripgrep to the Sonic search backend. Add the Sonic service and update SEARCH_BACKEND_ENGINE=sonic.

yt-dlp Errors on Media Downloads

Symptom: Video downloads fail with “Unable to extract” or similar errors.

Fix: yt-dlp needs frequent updates as sites change. Update the container image or run:

docker compose exec archivebox pip install --upgrade yt-dlp

Resource Requirements

  • RAM: ~300 MB idle, spikes to 1-2 GB during active archiving (Chrome rendering)
  • CPU: Medium — Chrome PDF/screenshot generation is CPU-intensive
  • Disk: ~1 MB per page average (varies widely — media-heavy pages use much more)

Verdict

ArchiveBox is the most comprehensive self-hosted web archiver available. The multi-format approach (HTML + PDF + screenshot + WARC) means you have redundant copies of everything. It’s ideal for researchers, journalists, or anyone who’s lost a crucial bookmark to link rot. The trade-off is resource usage — Chrome-based archiving is heavy. For lighter use cases where you just want to save article text, Wallabag or Hoarder are simpler options.

Frequently Asked Questions

What does ArchiveBox actually save?

ArchiveBox saves multiple copies of each URL: full HTML with assets, a PDF snapshot, a screenshot, a WARC archive, plain text extraction, Git history (for repos), and media files (via yt-dlp). This multi-format approach provides redundancy against format obsolescence.

How much disk space does ArchiveBox use?

Roughly 1 MB per page on average, but it varies widely. Text-heavy pages use less; media-heavy sites with videos can use gigabytes per page. Set SAVE_MEDIA=false and SAVE_WARC=false to reduce disk usage if you only need HTML and screenshots.

Can ArchiveBox replace the Wayback Machine?

For personal use, yes. ArchiveBox creates local, searchable archives of web pages. It doesn’t provide public access like the Internet Archive does (unless you expose the web UI). It’s best for personal research, journalism, and protecting against link rot.

Does ArchiveBox archive dynamic/JavaScript-rendered pages?

Yes. ArchiveBox uses headless Chrome for rendering, which executes JavaScript before saving the page. This captures content from SPAs and dynamically loaded pages that wget-based tools miss.

Can I schedule automatic archiving?

ArchiveBox doesn’t have built-in scheduling, but you can use cron to run archivebox add on a schedule. Feed it an RSS feed, a bookmarks file, or a list of URLs to archive periodically.

How does ArchiveBox compare to Wallabag?

Wallabag is a read-later app that extracts and saves article text. ArchiveBox is a full web archiver that saves complete pages with all assets. Wallabag is lighter and better for reading; ArchiveBox is better for preservation.

Comments