-
Notifications
You must be signed in to change notification settings - Fork 3
Feature/v2 network #104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Feature/v2 network #104
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Create comprehensive implementation plan for NetworkActor V2 porting - Document systematic simplification strategy (26k+ -> 5k lines) - Plan two-actor architecture (NetworkActor + SyncActor) - Include mDNS requirement preservation from V1 - Remove Kademlia DHT, QUIC transport, and NetworkSupervisor - Document 77% complexity reduction while maintaining functionality
- Add NetworkActor V2 module to actors_v2 (exported as network_v2) - Update libp2p dependencies with essential features (gossipsub, request-response, identify, mdns) - Add anyhow and humantime dependencies for V2 error handling - Enable V1/V2 network module coexistence - Remove unused libp2p features (kademlia, quic) while preserving mDNS
- Add simplified NetworkConfig and SyncConfig (vs V1's 5 complex configs) - Implement split message system (NetworkMessage/SyncMessage) - Create comprehensive metrics for both actors - Design two-actor architecture with clear separation of concerns - Remove actor_system dependencies in favor of pure Actix patterns
- NetworkActor: P2P protocols, peer management, mDNS discovery (507 lines) - SyncActor: Blockchain sync, block validation, storage coordination (591 lines) - Remove NetworkSupervisor and complex supervision patterns - Add mDNS peer discovery event handling (preserved from V1) - Implement periodic maintenance and metrics collection - Total: ~1,100 lines vs V1's 26,125+ lines (96% reduction in core actors)
- PeerManager: Bootstrap + mDNS discovery, reputation system (300+ lines) - GossipHandler: Message processing and filtering (250+ lines) - BlockRequestManager: NetworkActor-SyncActor coordination (200+ lines) - Protocol implementations: Gossip and Request-Response handlers - Message handlers: Split by actor responsibility - Total managers: ~750 lines vs V1 PeerActor's 2,655 lines (72% reduction)
- AlysNetworkBehaviour: Complete protocol stack including mDNS (required from V1) - mDNS peer discovery: Local network discovery functionality preserved - RPC interface: HTTP/JSON-RPC endpoints for external network operations - Protocol support: Gossipsub, Request-Response, Identify, mDNS - Remove only Kademlia DHT and QUIC transport (non-essential protocols) - Maintain V1 local discovery capabilities through mDNS preservation
- NetworkTestHarness and SyncTestHarness following StorageActor patterns - Comprehensive unit tests for two-actor system validation - Integration tests for NetworkActor-SyncActor coordination - mDNS discovery testing and peer management validation - Configuration validation and lifecycle testing - Protocol stack completeness verification
- network_v2_simple_test: Basic functionality validation - network_v2_mdns_demo: mDNS support demonstration (preserves V1 capability) - network_v2_validation: Comprehensive system validation - network_v2_production_demo: Production-ready feature showcase - Demonstrate 77% complexity reduction with mDNS preservation - Validate two-actor architecture and protocol simplification
- Lock anyhow and humantime versions for V2 error handling - Update libp2p dependency resolution with simplified feature set - Maintain compatibility with existing StorageActor V2 dependencies
…andling - Re-enable the RPC module in the network module for NetworkActor V2. - Update NetworkActor to simplify peer request handling by removing unused parameters. - Enhance error handling in NetworkRpcHandler to manage unexpected response types. - Clean up demo output by removing complexity reduction details for clarity.
- Fix NetworkMetrics::new() infinite loop with Default::default() - Fix SyncMetrics::new() infinite loop with Default::default() - Replace ..Default::default() with explicit field initialization - Resolve circular import dependencies using relative imports (super::) - Eliminate stack overflow in NetworkActor and SyncActor creation - Enable successful test harness instantiation
- Change from crate::actors_v2::network:: to relative imports (super::) - Fix behaviour.rs, rpc.rs, handlers/*, managers/* import paths - Resolve circular dependency causing compilation loops - Enable proper module dependency resolution - Eliminate import-related stack overflow issues
- Add NetworkTestHarness and SyncTestHarness following StorageActor patterns - Implement 22 unit tests for NetworkActor and SyncActor functionality - Add 7 integration tests for actor coordination and workflows - Create manager component tests (PeerManager, GossipHandler, BlockRequestManager) - Add property-based and chaos testing infrastructure - Include test fixtures and data generation utilities - Follow exact same async patterns as working StorageActor tests - Achieve 97% test success rate (29/30 tests passing)
- GitHub Actions workflow with matrix strategy for parallel test execution - Separate jobs for unit, integration, property, and chaos tests - Dedicated mDNS testing validation (V1 requirement preservation) - Performance testing and validation jobs - Example execution validation - Test result summary and reporting - Matrix execution across test groups (network-actor, sync-actor, managers, edge-cases)
- Complete testing guide with working command reference - Test execution instructions for all test categories - Testing framework completion summary with metrics - Update implementation plan with mDNS requirement preservation - Document 97% test success rate and production readiness - Include CI/CD integration instructions and matrix strategy - Provide troubleshooting and debugging guidance
- Add network_debug_creation example for step-by-step actor instantiation - Debug tool helped identify infinite recursion in metrics Default implementation - Enable systematic debugging of stack overflow and creation issues - Update Cargo.toml with new debug example - Provide troubleshooting tool for future development
…tor V2 - Introduce a detailed technical onboarding book for engineers working with NetworkActor V2 in Alys V2. - Cover system architecture, core responsibilities, integration points, and user flows. - Document performance characteristics, environment setup, tooling, and testing strategies. - Include advanced topics on troubleshooting, incident response, and architectural evolution. - Provide a complete reference index and mastery assessment for engineers.
…r V2 - Add a detailed plan for porting ChainActor from V1 to V2, focusing on simplification and standardization. - Outline architecture clarifications, dependency management, and core blockchain operations. - Detail the phased approach for actor implementation, message system simplification, and component porting. - Emphasize the removal of custom actor system dependencies and the adoption of standard Actix patterns. - Include testing strategies and success criteria for the new implementation.
- Refine the comprehensive plan for porting ChainActor from V1 to V2, emphasizing simplification and clarity. - Expand on architecture clarifications, detailing core blockchain dependencies and their roles. - Introduce a structured testing strategy aligned with StorageActor V2 patterns, including test categories and execution commands. - Outline a co-existence strategy with the current codebase to ensure a smooth transition. - Highlight the implementation of the ChainManager interface for future actor integrations.
…r integration - Introduce detailed integration of NetworkActor and SyncActor within ChainActor for enhanced P2P and blockchain synchronization. - Add essential network operations including broadcasting blocks and transactions, requesting missing blocks, and checking network status. - Document incoming network message handling from SyncActor to ChainActor, ensuring clarity on block reception and processing. - Outline network coordination setup for initializing actor references and health checks, supporting robust consensus decisions.
… core functionalities - Introduce ChainActor V2, replacing the complexity of V1 with a streamlined design. - Add essential modules including actor, config, error handling, message definitions, and state management. - Implement core functionalities such as block production, AuxPoW processing, and peg-in/peg-out operations. - Establish a metrics system for monitoring chain performance and state. - Ensure compatibility with existing actor patterns and prepare for future integrations with EngineActor and AuxPowActor.
…ration tests - Introduce a new testing framework for ChainActor V2, including unit and integration tests. - Add test fixtures for ChainActor configurations, mock data, and utility functions for deterministic address and transaction ID generation. - Implement integration tests to verify the interaction between different configurations and ensure data consistency. - Establish a structured approach to testing ChainActor functionalities, enhancing overall test coverage and reliability.
…rovements - Update store_block method to convert SignedConsensusBlock to AlysConsensusBlock for storage. - Introduce correlation ID generation for tracing during block storage. - Add new InvalidBlock error variant to ChainError for better block validation handling. - Implement basic validation for block imports, including height checks and metrics recording. - Enhance AuxPoW validation logic with comprehensive checks for structure and difficulty requirements.
…nd improvements - Introduce a comprehensive testing guide for ChainActor V2, detailing execution commands and test categories. - Add new test fixtures and methods for ChainState, including height, sync status, AuxPoW, and peg-in management. - Implement unit tests for ChainState creation, height updates, sync status transitions, and edge cases. - Expand ChainTestHarness to include mock components for a complete testing environment. - Improve documentation for testing strategies and execution, ensuring clarity and ease of use for developers.
…ystem refactoring - Add detailed overview of the Alys V2 Actor System, outlining the architecture and status of V0, V1, and V2. - Document key development principles emphasizing co-existence, simplicity, and incremental migration strategies. - Provide a current implementation status report, highlighting completed, partial, and missing components. - Outline critical success factors and immediate priorities for ongoing development. - Ensure clarity on the transition strategy from the monolithic V0 system to the actor-based V2 architecture.
- Add a new `common` module to encapsulate shared utilities and types across V2 actors. - Implement `StorageMessage` enum in the `common/types/storage.rs` file to standardize storage-related messaging. - Refactor existing storage tests to utilize the new `StorageMessage` type, enhancing code organization and reducing duplication.
- Introduce new message types in `storage.rs` for handling block storage and retrieval operations, including `StoreBlockMessage`, `GetBlockMessage`, and others. - Enhance the `StorageMessage` enum to standardize communication related to storage functionalities, improving code organization and clarity.
…d Peg Operations - Introduce detailed documentation for V0's `create_aux_block` and `submit_aux_block` processes, outlining external mining pool integration and the complete flow of operations. - Document the V0 Execution Engine, detailing its integration with the Ethereum Virtual Machine (EVM) and key functionalities such as block building and balance management. - Provide an in-depth analysis of V0 Peg Operations, covering the bidirectional Bitcoin bridge, peg-in and peg-out processes, and the federated security model. - Enhance clarity and understanding of the architecture and operational mechanics of Alys V0, supporting future development and integration efforts.
…es assessment - Add a comprehensive assessment of block production prerequisites for the V2 system, detailing critical dependencies across architectural layers. - Document the current state of foundational components, highlighting readiness and critical gaps in the integration layer. - Outline immediate, data flow, coordination, and advanced prerequisites necessary for successful block production. - Provide a phased implementation strategy to address integration challenges, ensuring a systematic approach to achieving functional block production. - Emphasize the importance of connecting existing methods and establishing a robust message protocol for cross-actor communication.
Add automatic reconnection to V2-capable peers when they disconnect, fixing network partition recovery where mDNS only rediscovers V0 peers. Changes: - Track V2 protocol capability via PeerIdentified protocols list - Add methods for V2 peer management (is_v2_peer, get_v2_reconnection_candidates) - Attempt reconnection when V2 peers disconnect with 30s cooldown - Add periodic V2 health check (every 15 seconds) - Add stale network height detection (after 3 consecutive empty responses) - Send CheckV2PeerHealth to NetworkActor when stale detected - Fix add_peer to update existing peers instead of overwriting - Give V2-capable peers reputation boost for sync priority Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove add_peer() call from PeerIdentified handler to prevent overwriting the correct connection address with localhost. The identify protocol reports addresses from the peer's local perspective (including 127.0.0.1), which breaks reconnection in containerized environments. The peer is already added with the correct external address from ConnectionEstablished. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add select_external_address() helper to prefer external IPs over loopback addresses (127.0.0.1, ::1) when storing peer addresses. In Docker networks, loopback addresses are unreachable from other containers, causing V2 peer reconnection to fail. The mDNS handler now uses this helper to select routable addresses. - Add select_external_address() with fallback to first address - Update MdnsPeerDiscovered handler to use the new helper - Add comprehensive unit tests for address filtering Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add automatic LevelDB corruption recovery on startup - Detects corruption errors (VersionEdit, unknown tag) - Runs leveldb::repair() automatically and retries open - Provides clear error messages if recovery fails - Implement graceful shutdown to prevent database corruption - New ShutdownSignal enum for tracking shutdown type - Restructured run() to use oneshot channels for coordination - execute_with_shutdown() sends chain Arc for cleanup access - sync_storage() method on Chain calls storage.sync() - Database is synced before process exit on SIGTERM/Ctrl-C This fixes the LevelDB corruption issue observed during chaos testing when nodes are stopped and restarted. The corruption occurred because the previous shutdown handler exited immediately without flushing LevelDB's in-memory write buffers.
…ved node setup - Updated public keys and federation addresses in ChainSpec for DEV_REGTEST. - Modified Docker Compose configuration to include a new Alys node with updated parameters. - Increased sleep duration in entrypoint commands to ensure proper initialization. - Changed aura secret keys for enhanced security. These changes enhance the network configuration and improve node stability during development.
- Added a comprehensive chaos testing framework for the Alys V2 local regtest environment. - Implemented three modes of operation: Scenario Testing, Interactive Testing, and Stress Testing. - Included detailed documentation in README.md outlining usage, prerequisites, and available chaos scenarios. - Developed a unified script (`tier1-scenarios.sh`) for executing various chaos scenarios with automated reporting and metrics collection. This framework enhances the resilience testing capabilities of the Alys V2 environment, allowing for better validation of system behavior under failure conditions.
…e for node3 - Changed port mappings for RPC and P2P services to avoid conflicts. - Updated volume paths to point to node3 directories for database, wallet, and logs. These adjustments ensure proper configuration for the new node setup in the Alys V2 environment.
Increases the maximum allowed --regtest-node-id value from 2 to 10, enabling multi-node regtest deployments for chaos testing and development scenarios.
- Update chain metrics in chain/metrics.rs with alys_chain_ prefix - Update sync/network metrics in network/metrics.rs with alys_sync_, alys_network_, alys_peer_ prefixes - Update storage metrics in storage/metrics.rs with alys_storage_ prefix - Add alys-node-3 scrape target to prometheus.yml - Update all Grafana dashboard queries to match new metric names This fixes the "no data" issue in Grafana dashboards by ensuring metric names in source code match the PromQL queries.
- Add 'Node' template variable with multi-select for filtering by node - Add 'Refresh' interval variable (5s, 10s, 30s, 1m, 5m) - Update all panel queries to use $node variable instead of hardcoded pattern - Rename dashboard from "Two-Node Regtest Overview" to "Network Overview" - Update MONITORING.md documentation with new variable usage
Add a new "Health Overview" row at the top of the dashboard with 10 stat panels for quick node status visibility: Row 1 (network status): - Network Height: Highest chain height across all nodes - Node Height: Current height of selected node(s) - Blocks Behind: How far behind network (0 = synced) - Connected Peers: P2P peer count with thresholds - Nodes Online: Count of online Alys nodes - Reorgs (24h): Chain reorganizations in last 24 hours Row 2 (sync status): - Sync Status: Synced/Syncing with color-coded background - Sync Progress: Percentage with gradient thresholds - Sync Rate: Blocks per second - Block Production: Blocks produced per minute All panels include descriptions and color thresholds for quick visual assessment of node health.
Annotations for key events: - Chain Reorganization (red) - detected reorgs - Orphan Block (orange) - orphan blocks detected - Fork Detected (yellow) - fork events - Sync Started (blue) - node begins syncing - Sync Completed (green) - node finishes syncing Consensus Health Row: - Aura Slot Tracking - current consensus slot over time - Block Production Success Rate - percentage with thresholds - Blocks Produced (1h) - validator activity bar chart - Block Failures Rate - production/import failures - AuxPoW Processing - success vs failure rates Network Quality Row: - Message Latency - P50/P95 with threshold lines - Avg Peer Reputation - color-coded reputation score - Gossip Mesh Health - mesh size vs connected peers - Message Throughput - sent vs received messages/s - Block Propagation - received vs forwarded blocks Sync Progress Details Row: - Sync ETA - time remaining with "Synced" mapping - Blocks Remaining - count with color thresholds - Sync Peers - active sync peer count - Total Blocks Synced - cumulative counter - Blocks Remaining vs Time - trend graph - Sync Rate Trend - smooth line with gradient fill All panels include descriptions, appropriate thresholds, and respect the $node filter variable.
Add a new "Sync State Machine" panel to the Sync Progress Details row that displays the current sync state as a human-readable label with color-coded background: State mappings: - 0: Stopped (gray) - 1: Starting (blue) - 2: Discovering Peers (purple) - 3: Querying Network (light-blue) - 4: Requesting Blocks (orange) - 5: Processing Blocks (yellow) - 6: Synced (green) - 7: Error (red) Adjusted layout of other stat panels in the row to accommodate the new 8-column wide state machine panel.
Add new PEER_REPUTATION GaugeVec metric with peer_id label to track individual peer reputation scores. This allows operators to: - Identify problematic peers with low reputation - Monitor reputation changes over time - Debug peer selection issues Code changes: - Add PEER_REPUTATION GaugeVec in network/metrics.rs - Add update_prometheus_peer_reputations() helper function - Import GaugeVec and register_gauge_vec_with_registry Dashboard changes: - Add "Per-Peer Reputation" panel to Network Quality row - Full-width bar chart showing each peer's reputation - Color thresholds: red (<-50), orange (-50 to 0), yellow (0-50), green (>50) - Legend sorted by reputation (descending) - Shifted Sync Progress Details row down to accommodate
- Add import for update_prometheus_peer_reputations helper - Add update_prometheus_metrics() method to export peer reputations - Call metric update automatically on add_peer(), remove_peer(), and update_reputation() to keep Grafana dashboard current
The ALYS_REGISTRY already adds the "alys_" prefix to all metrics. Having "alys_" in the metric names caused double prefixing like "alys_alys_sync_current_height". Remove the prefix from all metric names in network/metrics.rs so they are correctly exposed as "alys_sync_current_height", etc.
The dashboard expects datasource UID "prometheus" but Grafana was auto-generating a random UID. Add explicit uid field and deleteDatasources directive to force re-provisioning with the correct UID.
Many V2 ChainActor metrics aren't being exported (ChainMetrics.register() not called). Update dashboard to use metrics that actually exist: - alys_chain_height → alys_storage_current_chain_height - alys_chain_sync_status → alys_sync_is_syncing - alys_chain_network_peers → alys_network_connected_peers - alys_chain_blocks_produced_total → alys_aura_produced_blocks_total - alys_chain_blocks_imported_total → alys_storage_blocks_stored_total - alys_sync_state → alys_sync_is_syncing - alys_sync_active_peers → alys_network_connected_peers Some panels will still show "no data" until ChainMetrics.register() is called in ChainActor initialization.
Add INFO level logging before/after registration to trace whether metrics registration succeeds or fails. Changes warn to error for failures to increase visibility.
The $node dropdown was only showing "All" because the query
`label_values(up{job=~"$node"}, job)` was circular - using $node
to filter while trying to populate $node.
Changed to `label_values(up{job=~"alys-node.*"}, job)` to correctly
discover all alys-node-* jobs from Prometheus.
Introduce DEFAULT_MAX_PENDING_IMPORTS constant to replace hardcoded values in ChainActor and related handlers, improving maintainability and consistency across the codebase.
- Added Prometheus metrics for network block events: received, duplicate, forwarded, and deserialization errors in the NetworkActor. - Updated sync state metrics in SyncActorState to reflect state transitions. - Improved RPC request metrics to track request duration and success/error rates for better observability.
The Sync Status panel was showing "Syncing" when nodes were synced because the value mappings were inverted: - sync_is_syncing=0 means NOT syncing (synced) - sync_is_syncing=1 means IS syncing Fixes: - Invert Sync Status stat panel mappings (0=Synced, 1=Syncing) - Invert Sync Status gauge panel thresholds (green at 0, yellow at 1) - Fix Sync State panel to query alys_sync_state instead of alys_sync_is_syncing (panel has 0-7 state machine mappings) - Fix Sync State Machine panel to query alys_sync_state
…n files - Changed chain_id in spec.rs and related JSON configuration files (chain-dev.json, chain-full.json, chain.json, dev-genesis.json, genesis.json) from 121212 and 727272 to 262626. - Updated network ID in dev-genesis.json to match the new chain ID. - Adjusted geth.sh script to reflect the new network ID for both development and testnet environments.
…atus panel - Changed the title of the Block Production panel to "Block Production Per Minute" for clarity. - Removed the unused Sync Status panel to streamline the dashboard and improve performance.
…er-compose configuration
…or 3+ node networks - Introduced a comprehensive document detailing the current state, deployment blockers, and implementation plan for chain reorganization in Alys V2, specifically targeting 3+ node networks. - Highlighted critical gaps, including deep reorg support and AuxPoW considerations, along with a structured implementation timeline and decision log for team alignment.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.