Performance audit: fix critical hotspots across extraction, storage, and retrieval by LeahArmstrong · Pull Request #19 · LeahArmstrong/codebase_index

LeahArmstrong · 2026-02-28T17:32:55Z

Tier 1 (Critical):

Cache estimated_tokens in ExtractedUnit (avoid repeated metadata.to_json)
Use Sets for DependencyGraph type_index/reverse (O(1) vs O(n) include?)
Memoize DependencyGraph#to_h (avoid 3x redundant serialization)
Add store_batch to VectorStore interface (pgvector multi-row INSERT, Qdrant batch upsert)
Reuse HTTP connections in Qdrant adapter (avoid TCP handshake per request)
Wrap temporal snapshot inserts in transaction (single commit vs 5K fsyncs)

Tier 2 (High):

Pre-compile regex constants in CallbackAnalyzer (avoid dynamic Regexp.new in loops)
Eliminate flatten calls in orchestrator summaries (use each_value + sum)
Reuse HTTP connections in OpenAI embedding provider
Add find_batch to MetadataStore interface (single WHERE IN query)
Batch metadata lookup in Ranker and ContextAssembler

Tier 3 (Medium):

Cache @nodes.keys in PageRank loop
Use max_by(limit) in GraphAnalyzer hubs (O(n) vs O(n log n))
Create path Set once in git enrichment (not per batch)
Reduce checkpoint frequency in embedding indexer (every 10 batches)

Tier 4 (Low):

Combine AR_INTERNAL_METHOD_PATTERNS into single alternation regex

Also fixes: pre-existing Pathname require in shared_extractor_context,
view_template_extractor spec for root environments.

https://claude.ai/code/session_017z8KCJGSwnH3dScDiwsqov

@note

- view_template_extractor_spec: move mocks into `it` block with `a_string_matching` regex (aligns with PR #19 approach for cleaner merge) - CacheStore#fetch: expand YARD @note documenting nil-as-cache-miss semantic so custom backend implementers preserve the contract - SolidCacheStore#clear: log a warning when the backend lacks delete_matched instead of silently no-oping https://claude.ai/code/session_01V3fpEonNoFGNTRxFWpHST6

…and retrieval Tier 1 (Critical): - Cache estimated_tokens in ExtractedUnit (avoid repeated metadata.to_json) - Use Sets for DependencyGraph type_index/reverse (O(1) vs O(n) include?) - Memoize DependencyGraph#to_h (avoid 3x redundant serialization) - Add store_batch to VectorStore interface (pgvector multi-row INSERT, Qdrant batch upsert) - Reuse HTTP connections in Qdrant adapter (avoid TCP handshake per request) - Wrap temporal snapshot inserts in transaction (single commit vs 5K fsyncs) Tier 2 (High): - Pre-compile regex constants in CallbackAnalyzer (avoid dynamic Regexp.new in loops) - Eliminate flatten calls in orchestrator summaries (use each_value + sum) - Reuse HTTP connections in OpenAI embedding provider - Add find_batch to MetadataStore interface (single WHERE IN query) - Batch metadata lookup in Ranker and ContextAssembler Tier 3 (Medium): - Cache @nodes.keys in PageRank loop - Use max_by(limit) in GraphAnalyzer hubs (O(n) vs O(n log n)) - Create path Set once in git enrichment (not per batch) - Reduce checkpoint frequency in embedding indexer (every 10 batches) Tier 4 (Low): - Combine AR_INTERNAL_METHOD_PATTERNS into single alternation regex Also fixes: pre-existing Pathname require in shared_extractor_context, view_template_extractor spec for root environments. https://claude.ai/code/session_017z8KCJGSwnH3dScDiwsqov

Replace each_with_object({}) with to_h { ... } for find_batch stubs. https://claude.ai/code/session_017z8KCJGSwnH3dScDiwsqov

@metadata

1. Test coverage: Add direct unit tests for DependencyGraph Set-based @reverse/@type_index (including serialization round-trip restoring Sets from arrays), to_h memoization invalidation on register(), ExtractedUnit#estimated_tokens cache invalidation via setters, Pgvector#store_batch multi-row INSERT, Qdrant#store_batch batch upsert, and SnapshotStore transaction wrapping. 2. Setter bypass audit: Verified no code uses direct @source_code or @metadata ivar assignment outside ExtractedUnit's own setters. 3. Qdrant batch size: Added YARD doc note that callers are responsible for chunking into reasonable batch sizes before calling store_batch. 4. Readability: Replaced @results.each_value.flat_map(&:itself) with @results.values.flatten(1) in precompute_flows. 5. Bug fix: DependencyGraph#to_h memoization was broken — initialize and register cleared @to_h_cache but to_h memoized on @to_h. Unified to @to_h so cache invalidation actually works. https://claude.ai/code/session_017z8KCJGSwnH3dScDiwsqov

Call http.start to open a persistent TCP connection — without it, Net::HTTP auto-opens and auto-closes per request, making keep_alive_timeout a no-op. Guard with started? so a dropped connection gets a fresh client on next call. Add ECONNRESET retry logic to OpenAI provider (matching existing Qdrant pattern). Add retry specs for both providers.

Extractors mutate metadata in-place (unit.metadata[:git] = ..., unit.metadata[:callbacks] = ...) so setter-based invalidation can't catch those mutations. The computation is cheap (metadata.to_json.length / 4.0) so memoization isn't worth the correctness risk.

Callers like extractor.rb mutate the returned hash (graph_data[:pagerank] = ...), which pollutes the memoized @to_h. Returning @to_h.dup gives callers their own copy while preserving the memoization benefit for the expensive transform_values calls.

@note

- view_template_extractor_spec: move mocks into `it` block with `a_string_matching` regex (aligns with PR #19 approach for cleaner merge) - CacheStore#fetch: expand YARD @note documenting nil-as-cache-miss semantic so custom backend implementers preserve the contract - SolidCacheStore#clear: log a warning when the backend lacks delete_matched instead of silently no-oping https://claude.ai/code/session_01V3fpEonNoFGNTRxFWpHST6

LeahArmstrong mentioned this pull request Feb 28, 2026

Fix Prism cross-version compat and spec portability #20

Merged

2 tasks

claude and others added 6 commits February 28, 2026 14:51

Fix rubocop Style/ReduceToHash offenses in ranker_spec.rb

8ff7ac7

Replace each_with_object({}) with to_h { ... } for find_batch stubs. https://claude.ai/code/session_017z8KCJGSwnH3dScDiwsqov

LeahArmstrong force-pushed the claude/gem-performance-audit-OwiTQ branch from 542a7d2 to 0dcbc2b Compare February 28, 2026 19:58

LeahArmstrong merged commit 17d706b into main Feb 28, 2026
5 checks passed

LeahArmstrong deleted the claude/gem-performance-audit-OwiTQ branch February 28, 2026 19:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance audit: fix critical hotspots across extraction, storage, and retrieval#19

Performance audit: fix critical hotspots across extraction, storage, and retrieval#19
LeahArmstrong merged 6 commits intomainfrom
claude/gem-performance-audit-OwiTQ

LeahArmstrong commented Feb 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LeahArmstrong commented Feb 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants