Skip to content

Conversation

@malvads
Copy link
Owner

@malvads malvads commented Feb 3, 2026

No description provided.

@gemini-code-assist
Copy link

Summary of Changes

Hello @malvads, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request represents a significant architectural shift, moving the core crawling engine from a synchronous, libcurl-based approach to a modern, asynchronous model powered by C++20 coroutines and Boost.Asio. The changes aim to enhance performance, scalability, and maintainability by introducing a more efficient concurrency model, replacing external HTTP/WebSocket libraries with Boost.Beast, and modularizing key components like storage and proxy handling. This refactoring lays the groundwork for a more robust and flexible web crawler.

Highlights

  • Concurrency Model Overhaul: The crawling engine has been refactored to leverage C++20 coroutines and Boost.Asio for asynchronous I/O, replacing the previous thread-per-request model. This introduces 'virtual threads' (coroutines) and dedicated 'worker threads' for CPU/disk-bound tasks, significantly improving scalability and resource utilization.
  • HTTP Client Replacement: The libcurl HTTP client has been entirely removed and replaced with a new implementation based on Boost.Beast. This change aligns with the asynchronous I/O model and provides native support for HTTP/HTTPS requests, including various proxy types (HTTP CONNECT, SOCKS4, SOCKS5).
  • Chromium CDP Client Update: The Chromium DevTools Protocol (CDP) client, used for browser automation, has been rewritten to utilize Boost.Beast for WebSocket communication and Boost.Asio coroutines for its asynchronous operations, replacing the libwebsockets library.
  • Modular Storage System: A new modular storage system has been introduced, starting with a DiskStorage implementation. This allows for flexible handling of crawled content, separating storage concerns from the core crawling logic.
  • Enhanced Proxy and Robots.txt Handling: The proxy server and client-side proxy handling have been updated to support SOCKS4/SOCKS5 handshakes using Boost.Asio. Additionally, the robots.txt parsing now includes support for Crawl-delay directives, improving politeness towards target websites.
  • Configuration and Build System Updates: The project now requires C++20, and the CMake build system has been updated to reflect the new dependencies (Boost.Asio, Boost.Beast, OpenSSL) and removed libcurl and libwebsockets. New configuration options for virtual threads, worker threads, and proxy connection timeouts have been added.
Changelog
  • CMakeLists.txt
    • Updated C++ standard to C++20.
    • Removed CURL and WEBSOCKETS_LIBRARY dependencies.
    • Added OpenSSL, Boost::headers for new network stack.
    • Added src/storage subdirectory.
  • example_config.yaml
    • Added virtual_threads for max concurrent coroutines.
    • Added worker_threads for CPU/Disk worker threads.
    • Clarified threads as 'Number of IO threads'.
  • src/binary/reader.cpp
    • Added new file: Binary reader implementation for various integer types and endianness.
  • src/binary/reader.hpp
    • Added new file: Header for binary reader class.
  • src/binary/writer.cpp
    • Added new file: Binary writer implementation for various integer types and endianness.
  • src/binary/writer.hpp
    • Added new file: Header for binary writer class.
  • src/browser/CMakeLists.txt
    • Removed WEBSOCKETS_LIBRARY from mojo_browser target link libraries.
    • Added Boost::headers, OpenSSL::SSL, OpenSSL::Crypto for new CDP client.
  • src/browser/browser.cpp
    • Updated Page and Browser methods to return boost::asio::awaitable for asynchronous operations.
    • Modified Browser constructor and connect method to accept boost::asio::io_context&.
  • src/browser/browser.hpp
    • Included boost/asio/io_context.hpp and boost/asio/awaitable.hpp.
    • Updated Browser and Page interfaces to use boost::asio::awaitable for async methods.
    • Added ioc_ member to Browser class.
  • src/browser/browser_client.cpp
    • Replaced CurlClient with BeastClient for HTTP requests.
    • Updated render_to_response and get methods to be boost::asio::awaitable.
  • src/browser/browser_client.hpp
    • Added #pragma once.
    • Updated BrowserClient methods to return boost::asio::awaitable<Response>.
    • Added head method to BrowserClient interface.
    • Added ioc_ member to BrowserClient.
  • src/browser/cdp/cdp_client.cpp
    • Rewritten CDP client to use Boost.Beast WebSockets and Boost.Asio coroutines.
    • Removed libwebsockets context and callbacks.
    • Implemented get_web_socket_url, connect, send_message, read_message, wait_for_id, wait_for_event, navigate, evaluate, render as awaitable functions.
  • src/browser/cdp/cdp_client.hpp
    • Replaced libwebsockets.h includes with Boost.Beast and Boost.Asio headers.
    • Updated all public methods to return boost::asio::awaitable.
    • Replaced Context struct with direct members for Boost.Beast WebSocket stream, connection state, message IDs, and message queues.
  • src/browser/page.hpp
    • Included boost/asio/awaitable.hpp.
    • Updated all virtual methods to return boost::asio::awaitable.
  • src/core/config/config.cpp
    • Added parsing for virtual_threads, worker_threads, and proxy_connect_timeout from YAML and command line.
    • Improved error handling for opening proxy list file.
  • src/core/config/config.hpp
    • Added virtual_threads and worker_threads members to Config struct.
    • Added proxy_connect_timeout member to Config struct with a default value.
  • src/core/types/constants.hpp
    • Adjusted DEFAULT_THREADS to 2 (IO Threads).
    • Added DEFAULT_VIRTUAL_THREADS (16) and DEFAULT_WORKER_THREADS (4).
    • Increased REQUEST_TIMEOUT_SECONDS to 10.
  • src/engine/CMakeLists.txt
    • Added new crawler implementation files: crawler/impl/lifecycle.cpp, crawler/impl/worker.cpp, crawler/impl/storage.cpp, crawler/impl/robots.cpp.
    • Linked mojo_storage and Boost::headers.
  • src/engine/crawler/crawler.cpp
    • Major refactoring: removed synchronous worker loop and task processing.
    • Updated constructor to initialize new thread pools and IO context.
    • Removed old HttpClient and BrowserLauncher usage from this file.
  • src/engine/crawler/crawler.hpp
    • Included boost/asio.hpp.
    • Added num_virtual_threads_, num_worker_threads_, proxy_connect_timeout_ members.
    • Replaced std::vector<std::thread> workers_ with boost::asio::io_context ioc_, std::vector<std::thread> io_threads_, boost::asio::thread_pool worker_pool_.
    • Removed std::condition_variable cv_.
    • Added is_shutdown_ atomic flag and boost::asio::signal_set signals_.
    • Introduced domain_last_access_ and domain_mutex_ for politeness.
    • Added storage_ unique pointer for modular storage.
    • Replaced synchronous methods with new init_*, spawn_workers, await_completion, shutdown methods, and boost::asio::awaitable methods for core crawling logic.
  • src/engine/crawler/impl/lifecycle.cpp
    • Added new file: Implements the crawler's lifecycle, including initialization of IO services, signal handling, proxy server, browser launcher, storage, and worker spawning.
  • src/engine/crawler/impl/robots.cpp
    • Added new file: Implements robots.txt fetching, caching, and politeness logic using boost::asio::awaitable and boost::asio::steady_timer for crawl delays.
  • src/engine/crawler/impl/storage.cpp
    • Added new file: Implements content saving to the new storage module, including filename generation and asynchronous saving to the worker pool.
  • src/engine/crawler/impl/worker.cpp
    • Added new file: Implements the main worker loop using Boost.Asio coroutines, handling task fetching, URL politeness, retries, and content processing (binary/text, link extraction).
  • src/main.cpp
    • Removed curl_global_init and curl_global_cleanup.
    • Passed virtual_threads and worker_threads from config to CrawlerConfig.
  • src/network/CMakeLists.txt
    • Removed CURL::libcurl dependency.
    • Added Boost::headers, OpenSSL::SSL, OpenSSL::Crypto for Boost.Beast.
    • Replaced http/curl_client.cpp with http/beast_client.cpp and proxy/socks_handshake.cpp.
  • src/network/http/beast_client.cpp
    • Added new file: Implements an HTTP client using Boost.Beast and Boost.Asio coroutines, supporting HTTP/HTTPS and various proxy types (HTTP CONNECT, SOCKS4, SOCKS5).
  • src/network/http/beast_client.hpp
    • Added new file: Header for the Boost.Beast HTTP client, defining its interface and internal awaitable methods.
  • src/network/http/curl_client.cpp
    • Removed file: libcurl-based HTTP client implementation.
  • src/network/http/curl_client.hpp
    • Removed file: Header for the libcurl-based HTTP client.
  • src/network/http/http_client.hpp
    • Included boost/asio.hpp.
    • Updated HttpClient interface methods to return boost::asio::awaitable<Response>.
    • Added set_connect_timeout virtual method.
    • Added head virtual method.
  • src/network/proxy/socks_handshake.cpp
    • Added new file: Implements SOCKS4 and SOCKS5 handshake logic using Boost.Asio.
  • src/network/proxy/socks_handshake.hpp
    • Added new file: Header for SOCKS handshake utility functions.
  • src/proxy/server/connection.cpp
    • Rewritten connection handling to use Boost.Asio coroutines (start_impl, do_resolve, do_connect_upstream, do_socks5_handshake, start_tunnel, transfer).
    • Integrated SocksHandshake for SOCKS proxy connections.
    • Removed all synchronous and callback-based asynchronous logic.
  • src/proxy/server/connection.hpp
    • Replaced callback-based asynchronous methods with boost::asio::awaitable methods.
    • Updated parse_target to parse_target_request.
  • src/proxy/server/proxy_server.cpp
    • Rewritten do_accept to use Boost.Asio coroutines for accepting new connections.
    • Updated logging message for proxy server startup.
  • src/proxy/server/proxy_server.hpp
    • Updated do_accept to return boost::asio::awaitable<void>.
  • src/storage/CMakeLists.txt
    • Added new file: CMake configuration for the storage module.
  • src/storage/disk_storage.cpp
    • Added new file: Implementation of DiskStorage for saving content to the local filesystem.
  • src/storage/disk_storage.hpp
    • Added new file: Header for DiskStorage class.
  • src/storage/storage.hpp
    • Added new file: Abstract base class Storage defining the storage interface.
  • src/utils/CMakeLists.txt
    • Added ../binary/reader.cpp and ../binary/writer.cpp to mojo_utils target sources.
  • src/utils/robotstxt/robotstxt.cpp
    • Included <optional> and absl/strings/match.h.
    • Implemented get_crawl_delay method to extract crawl delay from robots.txt.
  • src/utils/robotstxt/robotstxt.hpp
    • Added get_crawl_delay method to RobotsTxt class.
  • src/utils/url/url.cpp
    • Improved URL parsing logic to correctly identify schemes and extract query and fragment components.
  • src/utils/url/url.hpp
    • Added query and fragment members to UrlParsed struct.
  • tests/integration/test_integration.cpp
    • Changed default logger level to LOG_INFO for better test visibility.
    • Added config.virtual_threads = 4 to crawler configurations in tests.
    • Added RobotsTxtEnforcement integration test.
    • Added RetryLogic integration test.
  • tests/unit/CMakeLists.txt
    • Added new unit test files: test_coroutines.cpp, test_proxy_server.cpp, test_storage.cpp, test_crawler.cpp, test_http_client.cpp.
    • Linked mojo_storage, mojo_engine, and httplib::httplib for new tests.
  • tests/unit/test_coroutines.cpp
    • Added new file: Basic unit tests for Boost.Asio coroutine functionality.
  • tests/unit/test_crawler.cpp
    • Updated namespace usage.
  • tests/unit/test_http_client.cpp
    • Rewritten HTTP client tests to use BeastClient instead of CurlClient.
  • tests/unit/test_proxy_server.cpp
    • Added new file: Unit tests for the Boost.Asio coroutine-based proxy server using httplib.
  • tests/unit/test_storage.cpp
    • Added new file: Unit tests for the DiskStorage implementation.
Activity
  • The pull request introduces a comprehensive refactoring of the core crawling logic and network stack.
  • It replaces libcurl and libwebsockets with Boost.Asio and Boost.Beast for all asynchronous I/O operations.
  • A new concurrency model using C++20 coroutines and thread pools has been implemented.
  • New modules for binary I/O and storage have been added.
  • The proxy server and robots.txt handling have been enhanced with new features and asynchronous capabilities.
  • Existing tests have been updated, and new tests have been added to cover the new functionality.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a significant and impressive refactoring that replaces the previous threading model and libcurl/libwebsockets dependencies with a modern, asynchronous approach using Boost.Asio and C++20 coroutines. The separation of concerns into IO threads and a worker thread pool for CPU/disk-bound tasks is excellent. The introduction of a storage abstraction and improved robots.txt handling (including Crawl-delay) are also great additions. I've found a few critical issues related to data correctness and security, along with some high-severity resource leaks and suggestions for improvement. Overall, this is a fantastic step forward for the project's architecture.

@malvads
Copy link
Owner Author

malvads commented Feb 3, 2026

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a significant and impressive refactoring of the crawler's core networking and concurrency model. The transition from libcurl and std::thread to Boost.Asio with C++20 coroutines is a major architectural improvement that greatly enhances readability and maintainability. The introduction of a dedicated storage abstraction and the support for Crawl-delay are also excellent additions. I've identified a critical bug in the new termination logic that could cause the crawler to hang, along with a few other potential issues related to error handling and robustness. Once these are addressed, this will be a very solid foundation for the project.

@malvads malvads force-pushed the improvemnts/virt_thrds_hdw_thrds branch 2 times, most recently from a46404d to 94eae51 Compare February 4, 2026 12:21
@malvads malvads force-pushed the improvemnts/virt_thrds_hdw_thrds branch from 458c96c to dcb4848 Compare February 4, 2026 12:39
@malvads malvads merged commit 6521acd into main Feb 4, 2026
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant