Dimensionless Developments presents Rust Link Web Crawler π. A blazingly fast, high-performance web crawler and broken link checker built with Rust. This project demonstrates modern async Rust development with a full-stack web application featuring server-side rendering, concurrent HTTP requests, and comprehensive link validation.
Broken web links cause significant detriments across user experience, SEO, and compliance, with direct financial consequences.
- User Experience & Trust: Broken links frustrate users, leading to higher bounce rates. Research shows 88% of online consumers are less likely to return to a site after a bad experience, damaging brand credibility and reducing conversion chances.
SEO Impact:
- Crawl Budget Waste: Search engine bots waste time on broken links, reducing efficient crawling of valid pages, which can harm indexing and visibility.
- Lower Rankings: Google considers user behavior signals like bounce rate and dwell time; high bounce rates from broken links can negatively affect rankings.
- Loss of Link Equity (Link Juice): Broken internal and external links fail to pass SEO value. This reduces the authority of linked pages and wastes the value of backlinks from other sites.
Financial Implications:
- Lost Revenue: Broken call-to-action (CTA) links, product page links, or demo links directly result in missed sales, conversions, and leads.
Reduced Referral Traffic:
- When external sites link to your broken content, their visitors bounce, harming your referral traffic and partnership value.
Compliance Fines:
- In regulated industries like finance or healthcare, broken links to privacy policies, licensing, or legal documents can lead to regulatory penalties (e.g., GDPR, CCPA, Reg Z) and legal risks.
- Long-Term Costs: Wasted time and resources spent on content creation become ineffective if content is inaccessible due to broken links. Fixing them proactively is far cheaper than recovering lost traffic and reputation
These are the reasons Dimensionless Rust Link Crawler is needed
Rust Link Crawler π is a production-ready web crawler that:
- Traverses websites with configurable depth levels to discover all links
- Validates link status by checking HTTP response codes (working vs broken)
- Runs concurrently using Rust's async/await for lightning-fast performance
- Provides a beautiful UI with real-time crawl results and detailed link analysis
- Ensures memory safety with Rust's compile-time guarantees (no null pointer errors, no data races)
- Filters same-domain links to avoid crawling the entire internet
β
Fast: Concurrent HTTP requests powered by Tokio async runtime
β
Safe: Compile-time memory safety with zero-cost abstractions
β
Reliable: Robust error handling with fallback mechanisms (HEAD β GET requests)
β
User-friendly: Beautiful web interface with live progress updates
β
Configurable: Adjust crawl depth (0-5 levels) for different needs
1. Start with root URL
2. Fetch the page β Extract all <a> links
3. Check status of each link (working or broken?)
4. Add unvisited same-domain links to queue
5. Repeat for each URL up to max_depth
6. Return all results
- Rust 1.70+ (Install Rust)
- Windows, macOS, or Linux
# Clone the repository
git clone https://github.com/DimensionlessDevelopments/Dimensionless-Rust-Web-Crawler.git
cd dimensionless_crawler_core
# Build the project
cargo build -p dimensionless_crawler_core
cargo build -p server
# Run the server
cargo run -p serverThe crawler will start on http://127.0.0.1:3000
- Open your browser to
http://127.0.0.1:3000 - Enter a website URL (e.g.,
https://example.com) - Set crawl depth (0-5):
- 0: Only check the starting URL
- 1: Check the page + all links on that page
- 2+: Recursively follow links deeper
- Click "Start Crawling π"
- View results with β working links and β broken links
Example Output:
=== Crawl Results ===
Total links found: 42
β Working links: 40
β Broken links: 2
--- Broken Links ---
β https://example.com/broken-page (Status: 404)
β https://example.com/old-resource (Status: 410)
--- All Links ---
β https://example.com (Status: 200)
β https://example.com/about (Status: 200)
...
| Technology | Version | Purpose |
|---|---|---|
| Rust | 2021 Edition | Language & type system |
| Tokio | 1.49 | Async runtime (multi-threaded) |
| Axum | 0.6 | Web framework & HTTP routing |
| Reqwest | 0.11 | HTTP client with connection pooling |
| Scraper | 0.13 | HTML parsing & CSS selectors |
| Technology | Purpose |
|---|---|
| HTML5 | Semantic markup |
| Tailwind CSS | Utility-first styling (CDN) |
| Vanilla JavaScript | DOM manipulation & API calls |
| Tool | Purpose |
|---|---|
| Cargo | Rust package manager & build system |
| Tower-HTTP | Static file serving middleware |
DIMENSIONLESS-RUST-WEB-CRAWLER/
βββ Cargo.toml # Workspace manifest
βββ dimensionless_crawler_core/ # Crawling logic (library)
β βββ Cargo.toml
β βββ src/
β βββ lib.rs # Core crawling algorithm
βββ server/ # Web server (binary)
β βββ Cargo.toml
β βββ src/
β β βββ main.rs # Axum server & routes
β βββ static/
β βββ index.html # Frontend HTML/JS
βββ README.md
Purpose: Implements the core web crawling logic
Main Module: lib.rs
pub async fn crawl_and_check(
start: &str,
max_depth: usize
) -> Result<Vec<LinkResult>, Box<dyn Error + Send + Sync>>Key Structures:
LinkResult { url: String, status: Option<u16>, ok: bool }- Represents a checked link
Key Dependencies:
reqwest::Client- HTTP requests with connection poolingscraper::Html- DOM parsingurl::Url- URL parsing and joiningstd::collections::{HashSet, VecDeque}- Graph traversal data structures
Purpose: Provides HTTP endpoints and serves the web interface
Main Module: main.rs
Routes:
GET /- Serves static HTML frontendPOST /api/crawl- Accepts crawl requests, returns results
Key Structures:
CrawlRequest { url: String, depth: Option<usize> }- Client requestCrawlResponse { links: Vec<LinkResult> }- Server response
Key Dependencies:
axum::Router- HTTP routingtower_http::services::ServeDir- Static file servingserde- JSON serialization/deserializationtokio::main- Async runtime entry point
The crawler uses Rust's async/await to handle many HTTP requests concurrently without threads:
// Makes 100+ simultaneous HTTP requests efficiently
let status = match client.head(url).send().await {
Ok(r) => Some(r.status().as_u16()),
Err(_) => None,
};Why it matters: Traditional threads are expensive (~2MB memory each). Tokio's async tasks are lightweight (~64 bytes), allowing thousands of concurrent operations.
Rust's ownership system prevents memory leaks and data races at compile-time:
// HTML is dropped before async await point
let resolved: Vec<Url> = {
let document = Html::parse_document(&body); // Borrows &body
// ... extract URLs ...
res // Returns owned Vec<Url>
}; // HTML is dropped here (no dangling pointers!)
// Safe to use resolved with async
for url in resolved { /* ... */ }Why it matters: No garbage collection overhead, zero-cost abstractions.
Rust uses Result<T, E> instead of exceptions:
let status = match client.head(url).send().await {
Ok(r) => Some(r.status().as_u16()), // Success path
Err(e) => {
eprintln!("Failed: {}", e); // Error path (explicit)
None
}
};Why it matters: Errors must be handled explicitly. No silent failures.
Traits define shared behavior across types:
pub async fn crawl_and_check(
start: &str,
max_depth: usize
) -> Result<Vec<LinkResult>, Box<dyn Error + Send + Sync>>Error + Send + Sync- Any error type that's thread-safeSend- Can be transferred between threadsSync- Can be safely referenced from multiple threadsdyn- Dynamic dispatch (runtime polymorphism)
Why it matters: Generic, reusable error handling without knowing exact error type.
Instead of pointers, Rust has safe references:
// Immutable borrow (multiple readers allowed)
let body: String = client.get(url.clone()).send().await?.text().await?;
// Parse without moving body
let document = Html::parse_document(&body); // Borrows &body
let selector = Selector::parse("a").unwrap();
// body is still valid here!
println!("Parsed {} bytes", body.len());Why it matters: Compiler ensures no use-after-free, no buffer overflows.
Rust's match is exhaustive and more powerful than switch statements:
match depth {
0 => println!("Check only this URL"),
1..=5 => println!("Traverse {} levels", depth),
_ => println!("Too deep!"),
}Why it matters: Handles all cases, compiler forces you to be exhaustive.
Rust tracks how long references are valid:
// 'a lifetime ensures &str lives at least as long as the function
fn crawl_and_check(start: &str, max_depth: usize) -> Result<Vec<LinkResult>, ...> {
let url = Url::parse(start)?; // start must be valid for this function
// ...
}Why it matters: Prevents dangling references at compile-time.
High-level Rust code compiles to efficient machine code:
// This iterator chain has ZERO runtime overhead
data.links
.filter(|l| !l.ok) // Inlined
.map(|link| link.url.clone()) // Inlined
.collect::<Vec<_>>() // Single allocationCompiles to the same code as hand-written C.
Reduce boilerplate with compile-time code generation:
#[derive(Serialize, Deserialize)]
struct LinkResult {
url: String,
status: Option<u16>,
ok: bool,
}
// βοΈ Automatically implements JSON serialization/deserializationWhy it matters: Less code, fewer bugs, same performance.
Rust's type system guarantees thread-safety:
// Client is Arc<...> internally, safe to share across async tasks
let client = Client::builder().build()?;
let client_clone = client.clone();
tokio::spawn(async move {
client_clone.get(url).send().await // No race conditions!
}).await?;Why it matters: Compiler catches data races before runtime. Most concurrency bugs prevented at compile-time.
let mut queue: VecDeque<(Url, usize)> = VecDeque::new(); // (url, depth)
let mut seen: HashSet<String> = HashSet::new(); // Track visited
let mut results: Vec<LinkResult> = Vec::new(); // Collect results
queue.push_back((start_url, 0));
seen.insert(start_url.to_string());
while let Some((url, depth)) = queue.pop_front() {
// 1. Fetch page
let body = client.get(url).send().await?.text().await?;
// 2. Parse HTML & extract links
let resolved: Vec<Url> = {
let document = Html::parse_document(&body);
let mut res = Vec::new();
for element in document.select(&selector) {
if let Some(href) = element.value().attr("href") {
res.push(/* parse and join URL */);
}
}
res
};
// 3. Check each link
for link in resolved {
let status = client.head(link).send().await?.status().as_u16();
results.push(LinkResult {
url: link.to_string(),
status: Some(status),
ok: status < 400,
});
// 4. Add to queue if not visited
if depth + 1 <= max_depth && !seen.contains(&link.to_string()) {
queue.push_back((link, depth + 1));
}
}
}
Ok(results)| Operation | Complexity | Notes |
|---|---|---|
| Fetch URL | O(n) | n = bytes to download |
| Parse HTML | O(n) | n = HTML size |
| Check links | O(m) parallel | m = number of links |
| Overall | O(n + m*d) | d = depth, m = links per page |
Example: Crawling 100-page site with 50 links per page at depth 2:
- Synchronous: ~30 seconds (sequential)
- Async (Rust): ~2 seconds (concurrent)
# Build in development mode (faster compile, slower runtime)
cargo build
# Build in release mode (slower compile, faster runtime)
cargo build --release
# Run tests
cargo test
# Check code without building
cargo check
# Format code
cargo fmt
# Lint code
cargo clippy| Pattern | Example | Purpose |
|---|---|---|
| Option | status: Option<u16> |
Value might not exist |
| Result<T, E> | Result<Vec<LinkResult>, Error> |
Operation might fail |
| Iterators | .filter().map().collect() |
Lazy, composable data processing |
| Match | match result { Ok(x) => ..., Err(e) => ... } |
Exhaustive pattern matching |
| Closure | |link| link.ok |
Inline anonymous function |
| Trait Objects | Box<dyn Error> |
Runtime polymorphism |
| Scoping | { /* scope */ } |
Explicit resource cleanup |
# Check if port 3000 is in use
netstat -ano | findstr :3000 # Windows
lsof -i :3000 # macOS/Linux
# Kill the process or use a different port- Check browser console for JavaScript errors
- Ensure you're using
https://(some sites require it) - Try depth=0 first to verify the URL is accessible
- Check server logs for detailed error messages
# Update Rust
rustup update
# Clean and rebuild
cargo clean
cargo build- Rate limiting to respect robots.txt
- Proxy support for anonymity
- Custom headers & authentication
- Export results to CSV/JSON
- Screenshot website previews
- Broken link notifications via email
- Database persistence
- Distributed crawling
MIT License - Feel free to use this project for learning or production!
- Rust Book - Official Rust guide
- Tokio Documentation - Async runtime
- Axum Documentation - Web framework
- Reqwest Documentation - HTTP client
- Rust by Example - Interactive examples
Built with β€οΈ using Rust, Axum, and Tokio
Made by Dimensionless Developments Head to our website https://www.dimensionlessdevelopments.com. email: contact@dimensionlessdevelopments.com