Skip to content

Migrate to packages/* publishing roots, wire CLI to packages/cli, and add plugin system + use-cases#65

Open
saurabhsharma2u wants to merge 16 commits intomainfrom
codex/refactor-crawlith-architecture-to-plugin-system-ua77sg
Open

Migrate to packages/* publishing roots, wire CLI to packages/cli, and add plugin system + use-cases#65
saurabhsharma2u wants to merge 16 commits intomainfrom
codex/refactor-crawlith-architecture-to-plugin-system-ua77sg

Conversation

@saurabhsharma2u
Copy link
Contributor

Motivation

  • Stabilize the public npm surface by making packages/* the canonical publish roots while continuing source migration from plugins/*.
  • Introduce a structured plugin system and application use-cases to enable command-aware plugin activation and lifecycle hooks.
  • Rewire the root CLI and scripts to run from the new package layout and provide a migration checklist for releases.

Description

  • Add packages/* packages for @crawlith/core and @crawlith/cli (and several architecture/internal packages) that re-export or copy build artifacts from existing plugins/* sources and update package.json scripts to build/test those filters via pnpm.
  • Update root package.json to point the crawlith bin and crawlith script to packages/cli/dist/index.js and limit build/test to the public package filters via pnpm --filter.
  • Implement a plugin framework in core: add plugin/types.ts, plugin/loader.ts, plugin/manager.ts, plugin/resolve.ts, and plugin/builtin.ts plus ports and use-case layers (application/usecase.ts, application/usecases.ts) and a CrawlSitegraph use-case that runs plugin hooks and post-crawl metrics.
  • Migrate and adapt CLI commands to resolve and register command-aware plugin flags (plugins/cli/src/plugins.ts) and use the CrawlSitegraph use-case instead of direct imperative crawl/load calls.
  • Add multiple plugin packages under packages/plugins/* that consume @crawlith/core APIs (e.g. simhash, pagerank, hits, heading-health, duplicate-detection, content-clustering) and a small packages/infrastructure and packages/shared skeletons.
  • Add docs/PUBLISHING_MIGRATION.md with a release checklist and next-phase notes, and include pnpm-workspace.yaml updates to include the new packages/* workspace paths.
  • Make runPostCrawlMetrics configurable for selective metric computation via PostCrawlMetricOptions and update metric runner accordingly.

Testing

  • Added unit tests covering the plugin system and CLI integration: plugins/core/tests/plugin_system.test.ts, plugins/cli/tests/plugin_activation.test.ts, and plugins/cli/tests/cli.test.ts.
  • Ran the workspace unit tests via the workspace script (pnpm test / root pnpm --filter @crawlith/core test && pnpm --filter @crawlith/cli test) which executed vitest for core and CLI tests; tests completed successfully.

Codex Task

saurabhsharma2u and others added 13 commits March 1, 2026 04:18
…e to packages/ directory

This update:
1. Fixes broken symlinks and node_modules in the new workspace structure.
2. Adds missing @crawlith/core workspace dependencies to all plugins.
3. Sets up independent tsconfig.json files for each plugin to enable per-package compilation.
4. Finalizes the GraphNode interface in core by adding the 'title' property.
5. Resolves type errors in plugins by applying explicit core types (SiteGraph, MetricsContext, etc.).
6. Updates root tsconfig.json to correctly exclude build outputs.
Resolves the builtin vs. external plugin duality:

1. Gutted core's builtin.ts — it no longer contains inline plugin implementations.
   The builtinPlugins array is now an empty placeholder to avoid breaking the API signature.

2. CLI's plugins.ts now imports directly from @crawlith/plugin-* packages,
   making them the single source of truth for all plugin logic.

3. Added all 6 plugin packages as workspace dependencies of @crawlith/cli.

4. Added plugin packages to tsup's noExternal list so they are bundled
   into the CLI binary (Node can't import raw .ts at runtime).

5. Aligned external plugins with previous builtin behavior:
   - duplicate-detection: added missing { collapse: true } option
   - simhash: removed unsafe 'as any' casts, uses proper GraphNode type

6. Added 'exports' field to all plugin package.json files for proper
   module resolution.
Documents the full project structure including:
- Package map and dependency flow
- Core engine modules (crawler, graph, analysis, DB)
- Plugin system contract, lifecycle hooks, and resolution
- CLI commands and Server API endpoints
- Build pipeline (tsup single-binary bundling)
- Data flow diagrams for crawl and dashboard
- Testing strategy and key design decisions

Also fixes @crawlith/core dependency in infrastructure package.
…eclarations

Extended the CrawlPlugin contract with a PluginCliOption interface,
allowing plugins to declare their own CLI flags (with values and defaults).

Changes:
- Added PluginCliOption type to core plugin types
- Content Clustering plugin now declares --cluster-threshold and --min-cluster-size
- Duplicate Detection plugin now declares --no-collapse
- HITS plugin already declared --compute-hits via flag; added description
- Updated registerPluginFlags to handle both toggle flags and value options
- Removed hardcoded plugin flags from crawl.ts and page.ts
- Plugin flags now appear dynamically in crawl --help and page --help

Also fixed infrastructure package missing @crawlith/core dependency.
Both the CLI 'page' command and server API endpoints now go through
the same PageAnalysisUseCase, eliminating duplicate analysis code paths.

Changes:
- Extended PageAnalysisInput DTO with all analysis options (url, live,
  snapshotId, seo, content, accessibility, proxy, rate, etc.)
- PageAnalysisUseCase now accepts EngineContext for logging
- CLI page.ts: uses PageAnalysisUseCase instead of calling analyzeSite
- Server GET /api/page: replaced 100+ lines of raw SQL + analyzePages
  with PageAnalysisUseCase call, enriched only with graph-level metrics
- Server POST /api/page/crawl: replaced analyzeSite call with UseCase
- Added snapshotId to AnalyzeOptions so the API can target a specific
  snapshot for analysis
- loadCrawlData now accepts optional snapshotId parameter
- Updated CLI test mock to include PageAnalysisUseCase

Before: CLI called analyzeSite(), server used raw SQL + analyzePages().
After: Both go through PageAnalysisUseCase.execute(input), guaranteed
identical analysis results.
Instead of creating a new 'partial' snapshot every time page --live
or POST /api/page/crawl is called, the crawler now reuses the latest
existing partial snapshot for the site.

Behavior:
- If a partial snapshot exists → reuse it, upsert page data into it
- If no partial snapshot exists → create a new one
- After crawl → touch snapshot's created_at to keep it current
- Stale edges for re-crawled pages are cleaned up before inserting new
  ones to avoid duplicate rows (metrics already use INSERT OR REPLACE)

Changes:
- SnapshotRepository: added getLatestPartialSnapshot() and touchSnapshot()
- EdgeRepository: added deleteEdgesForPage() for cleanup
- Crawler.initialize(): checks for reusable partial before creating new
- Crawler.flushEdges(): cleans up stale edges when reusingSnapshot=true
- Crawler.run(): touches snapshot timestamp on completion if reused
Implemented a zero-dependency, sliding-window rate limiter for the API to
prevent abuse, especially around the expensive live crawling endpoint.

- General routes: 60 requests per minute
- Intensive routes (POST /api/page/crawl): 5 requests per minute
- Automatically cleans up stale IP entries every 5 minutes
- Returns strictly formatted 429 Too Many Requests with Retry-After header
- Adds X-RateLimit-Limit and X-RateLimit-Remaining headers to all responses
… flags

- Created ReporterPlugin to handle all console and JSON output
- Decoupled crawl and page commands from presentation logic
- Standardized CLI flags; removed hardcoded options in favor of plugin-injected ones
- Fixed ESM runtime errors by bundling plugins and using type-only exports
- Stabilized CLI tests with updated mocks for plugin hooks
EngineContext,
PluginManager
} from '@crawlith/core';
import { buildCrawlInsightReport } from './crawlFormatter.js';
}
});

const metrics = calculateMetrics(graph, depth);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant