Migrate to packages/* publishing roots, wire CLI to packages/cli, and add plugin system + use-cases#65
Open
saurabhsharma2u wants to merge 16 commits intomainfrom
Conversation
…e to packages/ directory This update: 1. Fixes broken symlinks and node_modules in the new workspace structure. 2. Adds missing @crawlith/core workspace dependencies to all plugins. 3. Sets up independent tsconfig.json files for each plugin to enable per-package compilation. 4. Finalizes the GraphNode interface in core by adding the 'title' property. 5. Resolves type errors in plugins by applying explicit core types (SiteGraph, MetricsContext, etc.). 6. Updates root tsconfig.json to correctly exclude build outputs.
Resolves the builtin vs. external plugin duality:
1. Gutted core's builtin.ts — it no longer contains inline plugin implementations.
The builtinPlugins array is now an empty placeholder to avoid breaking the API signature.
2. CLI's plugins.ts now imports directly from @crawlith/plugin-* packages,
making them the single source of truth for all plugin logic.
3. Added all 6 plugin packages as workspace dependencies of @crawlith/cli.
4. Added plugin packages to tsup's noExternal list so they are bundled
into the CLI binary (Node can't import raw .ts at runtime).
5. Aligned external plugins with previous builtin behavior:
- duplicate-detection: added missing { collapse: true } option
- simhash: removed unsafe 'as any' casts, uses proper GraphNode type
6. Added 'exports' field to all plugin package.json files for proper
module resolution.
Documents the full project structure including: - Package map and dependency flow - Core engine modules (crawler, graph, analysis, DB) - Plugin system contract, lifecycle hooks, and resolution - CLI commands and Server API endpoints - Build pipeline (tsup single-binary bundling) - Data flow diagrams for crawl and dashboard - Testing strategy and key design decisions Also fixes @crawlith/core dependency in infrastructure package.
…eclarations Extended the CrawlPlugin contract with a PluginCliOption interface, allowing plugins to declare their own CLI flags (with values and defaults). Changes: - Added PluginCliOption type to core plugin types - Content Clustering plugin now declares --cluster-threshold and --min-cluster-size - Duplicate Detection plugin now declares --no-collapse - HITS plugin already declared --compute-hits via flag; added description - Updated registerPluginFlags to handle both toggle flags and value options - Removed hardcoded plugin flags from crawl.ts and page.ts - Plugin flags now appear dynamically in crawl --help and page --help Also fixed infrastructure package missing @crawlith/core dependency.
Both the CLI 'page' command and server API endpoints now go through the same PageAnalysisUseCase, eliminating duplicate analysis code paths. Changes: - Extended PageAnalysisInput DTO with all analysis options (url, live, snapshotId, seo, content, accessibility, proxy, rate, etc.) - PageAnalysisUseCase now accepts EngineContext for logging - CLI page.ts: uses PageAnalysisUseCase instead of calling analyzeSite - Server GET /api/page: replaced 100+ lines of raw SQL + analyzePages with PageAnalysisUseCase call, enriched only with graph-level metrics - Server POST /api/page/crawl: replaced analyzeSite call with UseCase - Added snapshotId to AnalyzeOptions so the API can target a specific snapshot for analysis - loadCrawlData now accepts optional snapshotId parameter - Updated CLI test mock to include PageAnalysisUseCase Before: CLI called analyzeSite(), server used raw SQL + analyzePages(). After: Both go through PageAnalysisUseCase.execute(input), guaranteed identical analysis results.
Instead of creating a new 'partial' snapshot every time page --live or POST /api/page/crawl is called, the crawler now reuses the latest existing partial snapshot for the site. Behavior: - If a partial snapshot exists → reuse it, upsert page data into it - If no partial snapshot exists → create a new one - After crawl → touch snapshot's created_at to keep it current - Stale edges for re-crawled pages are cleaned up before inserting new ones to avoid duplicate rows (metrics already use INSERT OR REPLACE) Changes: - SnapshotRepository: added getLatestPartialSnapshot() and touchSnapshot() - EdgeRepository: added deleteEdgesForPage() for cleanup - Crawler.initialize(): checks for reusable partial before creating new - Crawler.flushEdges(): cleans up stale edges when reusingSnapshot=true - Crawler.run(): touches snapshot timestamp on completion if reused
Implemented a zero-dependency, sliding-window rate limiter for the API to prevent abuse, especially around the expensive live crawling endpoint. - General routes: 60 requests per minute - Intensive routes (POST /api/page/crawl): 5 requests per minute - Automatically cleans up stale IP entries every 5 minutes - Returns strictly formatted 429 Too Many Requests with Retry-After header - Adds X-RateLimit-Limit and X-RateLimit-Remaining headers to all responses
…nto commands, and fix lint errors
… flags - Created ReporterPlugin to handle all console and JSON output - Decoupled crawl and page commands from presentation logic - Standardized CLI flags; removed hardcoded options in favor of plugin-injected ones - Fixed ESM runtime errors by bundling plugins and using type-only exports - Stabilized CLI tests with updated mocks for plugin hooks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
packages/*the canonical publish roots while continuing source migration fromplugins/*.Description
packages/*packages for@crawlith/coreand@crawlith/cli(and several architecture/internal packages) that re-export or copy build artifacts from existingplugins/*sources and updatepackage.jsonscripts to build/test those filters viapnpm.package.jsonto point thecrawlithbin andcrawlithscript topackages/cli/dist/index.jsand limitbuild/testto the public package filters viapnpm --filter.plugin/types.ts,plugin/loader.ts,plugin/manager.ts,plugin/resolve.ts, andplugin/builtin.tsplus ports and use-case layers (application/usecase.ts,application/usecases.ts) and aCrawlSitegraphuse-case that runs plugin hooks and post-crawl metrics.plugins/cli/src/plugins.ts) and use theCrawlSitegraphuse-case instead of direct imperative crawl/load calls.packages/plugins/*that consume@crawlith/coreAPIs (e.g.simhash,pagerank,hits,heading-health,duplicate-detection,content-clustering) and a smallpackages/infrastructureandpackages/sharedskeletons.docs/PUBLISHING_MIGRATION.mdwith a release checklist and next-phase notes, and includepnpm-workspace.yamlupdates to include the newpackages/*workspace paths.runPostCrawlMetricsconfigurable for selective metric computation viaPostCrawlMetricOptionsand update metric runner accordingly.Testing
plugins/core/tests/plugin_system.test.ts,plugins/cli/tests/plugin_activation.test.ts, andplugins/cli/tests/cli.test.ts.pnpm test/ rootpnpm --filter @crawlith/core test && pnpm --filter @crawlith/cli test) which executedvitestfor core and CLI tests; tests completed successfully.Codex Task