Is there a way to estimate or predict deep crawl completion time in Crawl4AI? #1741
Unanswered
aman-chicmic
asked this question in
Forums - Q&A
Replies: 2 comments
-
|
@unclecode Can you please review my above question, Thanks! |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
Hey! There's no built-in ETA for deep crawl since the total number of pages is unknown upfront. But you can track progress with import asyncio
import time
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig, BrowserConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
async def deep_crawl_with_progress():
max_pages = 30 # Set limit for predictability
strategy = BFSDeepCrawlStrategy(
max_depth=2,
include_external=False,
max_pages=max_pages,
)
config = CrawlerRunConfig(
deep_crawl_strategy=strategy,
prefetch=True, # Fast mode
stream=True, # Required for progress tracking
)
start_time = time.time()
count = 0
async with AsyncWebCrawler(config=BrowserConfig(headless=True)) as crawler:
async for result in await crawler.arun(
"https://docs.crawl4ai.com/",
config=config
):
count += 1
elapsed = time.time() - start_time
rate = count / elapsed if elapsed > 0 else 0
eta = (max_pages - count) / rate if rate > 0 else 0
print(
f"[{count}/{max_pages}] "
f"Depth: {result.metadata.get('depth', 0)} | "
f"Rate: {rate:.1f}/s | "
f"ETA: {eta:.0f}s | "
f"{result.url[:50]}"
)
print(f"\nDone! {count} pages in {time.time() - start_time:.1f}s")
asyncio.run(deep_crawl_with_progress())Tips for predictable crawls:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone,
I’m using Crawl4AI for deep crawling websites and was wondering if there’s a recommended way to determine or estimate the crawl completion time.
Since the total number of pages and links is often unknown before the crawl starts, is there:
Any built-in support for ETA or progress estimation?
A best practice for estimating remaining time during an active crawl?
Suggested strategies (e.g., sampling, limits, metrics) to make crawl duration more predictable?
I’m currently relying on depth and page limits, but I’d like to better understand if dynamic ETA calculation or progress tracking is possible or recommended.
Thanks in advance for any insights or guidance!
Beta Was this translation helpful? Give feedback.
All reactions