Self-Hosting ArchiveBox with Docker Compose
What Is ArchiveBox?
ArchiveBox is a self-hosted web archiver that saves snapshots of web pages in multiple formats — HTML, PDF, screenshot, WARC, and more. Feed it URLs from bookmarks, RSS feeds, or browser history, and it builds a searchable, offline-accessible archive. Think of it as your own personal Wayback Machine. It replaces reliance on archive.org, Pocket’s saved pages, and browser bookmark rot.
Official site: archivebox.io | GitHub
Docker Compose Configuration
Create a directory for ArchiveBox:
mkdir archivebox && cd archivebox
mkdir data
Create a docker-compose.yml file:
services:
archivebox:
image: archivebox/archivebox:0.8.5rc52
container_name: archivebox
command: server --quick-init 0.0.0.0:8000
restart: unless-stopped
ports:
- "8000:8000"
environment:
- ALLOWED_HOSTS=*
- PUBLIC_INDEX=true
- PUBLIC_SNAPSHOTS=true
- PUBLIC_ADD_VIEW=false
- SEARCH_BACKEND_ENGINE=ripgrep
- MEDIA_MAX_SIZE=750m
- TIMEOUT=60
- CHECK_SSL_VALIDITY=true
- SAVE_ARCHIVE_DOT_ORG=true
volumes:
- ./data:/data
networks:
- archivebox-net
# Optional: Sonic full-text search (faster than ripgrep for large archives)
# sonic:
# image: valeriansaliou/sonic:v1.4.9
# container_name: archivebox-sonic
# restart: unless-stopped
# environment:
# - SEARCH_BACKEND_PASSWORD=changeme_sonic
# volumes:
# - sonic-data:/var/lib/sonic/store
# networks:
# - archivebox-net
volumes:
sonic-data:
networks:
archivebox-net:
Initialize the archive and create an admin account:
docker compose run --rm archivebox init --setup
docker compose run --rm archivebox manage createsuperuser
Start the server:
docker compose up -d
Prerequisites
- A Linux server (Ubuntu 22.04+ recommended)
- Docker and Docker Compose installed (guide)
- 5 GB of free disk space (grows with your archive)
- 1 GB of RAM minimum, 2 GB recommended
- A domain name (optional, for remote access)
Initial Setup
Access ArchiveBox at http://your-server-ip:8000. Log in with the superuser credentials you created during initialization.
To add URLs to your archive:
Via the web UI: Click “Add” in the top bar and paste URLs (one per line).
Via CLI:
# Add a single URL
docker compose run --rm archivebox add "https://example.com/article"
# Add from a bookmarks file
docker compose run --rm archivebox add < bookmarks.html
# Add from an RSS feed
docker compose run --rm archivebox add "https://example.com/feed.xml"
What Gets Archived
For each URL, ArchiveBox saves multiple output formats:
| Format | Tool | Description |
|---|---|---|
| HTML | wget | Full static HTML snapshot with assets |
| Chrome/Chromium | Rendered PDF of the page | |
| Screenshot | Chrome/Chromium | Full-page PNG screenshot |
| WARC | wget | Web Archive format (industry standard) |
| Readability | Mozilla Readability | Clean article text extraction |
| SingleFile | SingleFile | Complete page in one HTML file |
| Git | git | Clone entire git repos |
| Media | yt-dlp | Download videos, audio, playlists |
| Headers | curl | HTTP response headers |
Configuration
Key environment variables:
| Variable | Default | Description |
|---|---|---|
ALLOWED_HOSTS | * | Restrict access by domain (comma-separated) |
PUBLIC_INDEX | true | Make archive index publicly accessible |
PUBLIC_SNAPSHOTS | true | Make individual snapshots publicly accessible |
PUBLIC_ADD_VIEW | false | Allow unauthenticated users to add URLs |
SEARCH_BACKEND_ENGINE | ripgrep | Search backend: ripgrep, sonic, or sqlite |
MEDIA_MAX_SIZE | 750m | Maximum file size for media downloads |
TIMEOUT | 60 | Download timeout in seconds per extractor |
CHECK_SSL_VALIDITY | true | Skip pages with invalid SSL certificates if false |
SAVE_ARCHIVE_DOT_ORG | true | Submit URLs to the Wayback Machine as backup |
Scheduled Archiving
Add a scheduler service to automatically re-archive URLs on a schedule:
scheduler:
image: archivebox/archivebox:0.8.5rc52
container_name: archivebox-scheduler
command: schedule --foreground --every=day --depth=0
restart: unless-stopped
environment:
- TIMEOUT=120
volumes:
- ./data:/data
networks:
- archivebox-net
Full-Text Search with Sonic
For large archives (10,000+ pages), switch from ripgrep to Sonic for faster full-text search. Uncomment the sonic service in the Compose file and update:
SEARCH_BACKEND_ENGINE=sonic
SEARCH_BACKEND_HOST_NAME=sonic
SEARCH_BACKEND_PASSWORD=changeme_sonic
Reverse Proxy
ArchiveBox serves on port 8000. For HTTPS with a reverse proxy, see Reverse Proxy Setup.
Set ALLOWED_HOSTS to your domain name when using a reverse proxy:
ALLOWED_HOSTS=archive.example.com
Backup
The entire archive lives in the ./data directory. Back up this directory to preserve:
./data/archive/— all archived page snapshots./data/index.sqlite3— the database of all URLs and metadata./data/ArchiveBox.conf— your configuration
tar czf archivebox-backup-$(date +%Y%m%d).tar.gz ./data
For a comprehensive backup strategy, see Backup Strategy.
Troubleshooting
Chrome/Chromium Fails to Start
Symptom: PDF and screenshot extraction fails with “Chrome not found” or “Failed to launch Chrome.”
Fix: The Docker image ships with Chromium. If you’re running outside Docker, install Chromium:
apt install chromium-browser
“Permission Denied” on Data Directory
Symptom: ArchiveBox can’t write to /data inside the container.
Fix: Set ownership on the host data directory:
sudo chown -R 911:911 ./data
Large Archives Slow Down
Symptom: Search and browsing become slow above 5,000+ snapshots.
Fix: Switch from ripgrep to the Sonic search backend. Add the Sonic service and update SEARCH_BACKEND_ENGINE=sonic.
yt-dlp Errors on Media Downloads
Symptom: Video downloads fail with “Unable to extract” or similar errors.
Fix: yt-dlp needs frequent updates as sites change. Update the container image or run:
docker compose exec archivebox pip install --upgrade yt-dlp
Resource Requirements
- RAM: ~300 MB idle, spikes to 1-2 GB during active archiving (Chrome rendering)
- CPU: Medium — Chrome PDF/screenshot generation is CPU-intensive
- Disk: ~1 MB per page average (varies widely — media-heavy pages use much more)
Verdict
ArchiveBox is the most comprehensive self-hosted web archiver available. The multi-format approach (HTML + PDF + screenshot + WARC) means you have redundant copies of everything. It’s ideal for researchers, journalists, or anyone who’s lost a crucial bookmark to link rot. The trade-off is resource usage — Chrome-based archiving is heavy. For lighter use cases where you just want to save article text, Wallabag or Hoarder are simpler options.
Frequently Asked Questions
What does ArchiveBox actually save?
ArchiveBox saves multiple copies of each URL: full HTML with assets, a PDF snapshot, a screenshot, a WARC archive, plain text extraction, Git history (for repos), and media files (via yt-dlp). This multi-format approach provides redundancy against format obsolescence.
How much disk space does ArchiveBox use?
Roughly 1 MB per page on average, but it varies widely. Text-heavy pages use less; media-heavy sites with videos can use gigabytes per page. Set SAVE_MEDIA=false and SAVE_WARC=false to reduce disk usage if you only need HTML and screenshots.
Can ArchiveBox replace the Wayback Machine?
For personal use, yes. ArchiveBox creates local, searchable archives of web pages. It doesn’t provide public access like the Internet Archive does (unless you expose the web UI). It’s best for personal research, journalism, and protecting against link rot.
Does ArchiveBox archive dynamic/JavaScript-rendered pages?
Yes. ArchiveBox uses headless Chrome for rendering, which executes JavaScript before saving the page. This captures content from SPAs and dynamically loaded pages that wget-based tools miss.
Can I schedule automatic archiving?
ArchiveBox doesn’t have built-in scheduling, but you can use cron to run archivebox add on a schedule. Feed it an RSS feed, a bookmarks file, or a list of URLs to archive periodically.
How does ArchiveBox compare to Wallabag?
Wallabag is a read-later app that extracts and saves article text. ArchiveBox is a full web archiver that saves complete pages with all assets. Wallabag is lighter and better for reading; ArchiveBox is better for preservation.
Related
- Guide to Self-Hosted Web Archiving
- ArchiveBox vs Kiwix: Which to Self-Host?
- ArchiveBox vs Wallabag: Which Should You Self-Host?
- ArchiveBox vs Wayback Machine: Self-Hosted Archiving
- Self-Hosting Wallabag with Docker Compose
- Self-Hosting Linkwarden with Docker Compose
- Self-Hosting Hoarder with Docker Compose
- Best Self-Hosted Bookmarks & Read Later Tools
- Docker Compose Basics
- Reverse Proxy Setup
- Backup Strategy
Get self-hosting tips in your inbox
Get the Docker Compose configs, hardware picks, and setup shortcuts we don't put in articles. Weekly. No spam.
Comments