A wrapper for aria2 to mirror open directories.
This CLI tool tries to emulate the same behavior of wget --recursive, but with a couple of filters,
checks and caching and by using aria2c to perform the download of resources.
The crawler/link extraction works on HTTP(S) directory listings and HTML pages; individual resources can be downloaded from any scheme supported by aria2 (e.g. HTTP(S), FTP/FTPS).
go build .
Usage: raria2 [--output OUTPUT] [--dry-run] [--max-connection-per-server CONNECTIONS] [--max-concurrent-downloads DOWNLOADS] [--threads THREADS] [--aria2-session-size SIZE] [--max-depth DEPTH] [--accept EXT] [--reject EXT] [--accept-filename GLOB] [--reject-filename GLOB] [--case-insensitive-paths] [--accept-path PATTERN] [--reject-path PATTERN] [--visited-cache FILE] [--write-batch FILE] [--http-timeout DURATION] [--user-agent UA] [--log-level LEVEL] [--rate-limit RATE] URL [-- ARIA2_OPTS...]
Positional arguments:
URL The URL from where to fetch the resources from
ARIA2_OPTS Options forwarded to aria2c after the URL (use -- before them if
they look like flags)
Options:
--output OUTPUT, -o OUTPUT
Output directory. If omitted, raria2 mirrors into
<host>/<path>/ derived from the URL (similar to wget)
--dry-run, -d Dry Run [default: false]
--max-connection-per-server, -x
Parallel connections per download [default: 5]
--max-concurrent-downloads, -j
Maximum concurrent downloads [default: 5]
--threads THREADS, -w THREADS
Concurrent crawler threads [default: 5]
--aria2-session-size SIZE
Number of links to feed a single aria2 process before
closing stdin and restarting it. Defaults to 100 entries;
set to 0 to keep a single session. See "Aria2 stdin bug
workaround" below for details.
[default: 100]
--max-depth DEPTH Maximum HTML depth to crawl (-1 for unlimited) [default: -1]
--accept EXT Comma-separated list(s) of extensions to include (no dot, case-insensitive)
--reject EXT Comma-separated list(s) of extensions to exclude
--accept-filename GLOB Comma-separated list(s) of filename globs to include
--reject-filename GLOB Comma-separated list(s) of filename globs to exclude
--case-insensitive-paths
Make path matching case-insensitive
--accept-path PATTERN Path glob (default) or regex:<expr> that must match to crawl/download
--reject-path PATTERN Path glob or regex to skip
--visited-cache FILE Persist visited URLs to this file so interrupted runs can resume
--write-batch FILE Write aria2 input file to disk instead of executing
--http-timeout DURATION
HTTP client timeout as Go duration string (e.g. 30s, 2m) [default: 30s]
--user-agent UA Custom User-Agent string [default: raria2/1.0]
--log-level LEVEL Log level (panic,fatal,error,warn,info,debug,trace) [default: info]
--rate-limit RATE Rate limit for HTTP requests (requests per second) [default: 0]
--respect-robots Respect robots.txt when crawling [default: false]
--accept-mime TYPES Comma-separated list of MIME types to include
--reject-mime TYPES Comma-separated list of MIME types to exclude
--help, -h display this help and exit
# dry run mirroring into host/path structure automatically
raria2 -d 'https://proof.ovh.net/files/' -- --max-download-limit=1M
# explicitly setting output directory and concurrency knobs (crawler + aria2)
raria2 -d -o output -x 10 -j 8 -w 12 'https://mirror.nforce.com/pub/speedtests/' -- --max-download-limit=1M
# customize the HTTP timeout (here: 2 minutes)
raria2 --http-timeout=2m 'https://example.com/pub/'
# limit crawl depth to first directory level
raria2 --max-depth=1 'https://example.com/pub/'
# only download .iso files inside /iso/ paths and persist visited cache
raria2 --accept=iso --accept-path='glob:/iso/**' --visited-cache=visited.txt 'https://mirror.example.com/'
# Advanced filtering: filename globs and case-insensitive paths
raria2 --accept-filename='release-*' --reject-filename='*.tmp' --case-insensitive-paths --accept-path='glob:**/Files/**' 'https://example.com/pub/'
# Generate batch file for later use instead of immediate download
raria2 --write-batch downloads.txt --dry-run 'https://example.com/pub/'
# Later use: aria2c --input-file=downloads.txt
# Rate-limited crawling with custom User-Agent
raria2 --rate-limit=2 --user-agent='MyBot/1.0' 'https://example.com/pub/'
# Respect robots.txt while crawling
raria2 --respect-robots 'https://example.com/pub/'
# MIME-type filtering - only download PDFs and images
raria2 --accept-mime 'application/pdf,image/jpeg,image/png' 'https://example.com/files/'
# Exclude binary executables and archives
raria2 --reject-mime 'application/octet-stream,application/x-executable,application/zip' 'https://example.com/pub/'
Aria2 has an open bug (aria2/aria2#1161)
where downloads fed via stdin might not start until the input stream closes.
When crawling large trees, use --aria2-session-size to periodically close and
restart aria2 so downloads begin before the crawl finishes. This option only
applies when streaming URLs directly to aria2 (normal mode), not when
--write-batch is used.
The --visited-cache option allows you to persist visited URLs between runs, enabling resume functionality:
# First run - crawl and cache visited URLs
raria2 --visited-cache=visited.txt 'https://example.com/pub/'
# Interrupted run - resume from where you left off
raria2 --visited-cache=visited.txt 'https://example.com/pub/'Use --write-batch to create an aria2 input file for manual control. If you
hit the aria2 stdin bug (aria2/aria2#1161) on large crawls, combine
--aria2-session-size with --write-batch to generate smaller chunks:
# Create batch file without downloading
raria2 --write-batch downloads.txt 'https://example.com/pub/'
# Use the batch file later with aria2
aria2c --input-file=downloads.txt