Skip to content

Performance audit: fix critical hotspots across extraction, storage, and retrieval#19

Merged
LeahArmstrong merged 6 commits intomainfrom
claude/gem-performance-audit-OwiTQ
Feb 28, 2026
Merged

Performance audit: fix critical hotspots across extraction, storage, and retrieval#19
LeahArmstrong merged 6 commits intomainfrom
claude/gem-performance-audit-OwiTQ

Conversation

@LeahArmstrong
Copy link
Owner

Tier 1 (Critical):

  • Cache estimated_tokens in ExtractedUnit (avoid repeated metadata.to_json)
  • Use Sets for DependencyGraph type_index/reverse (O(1) vs O(n) include?)
  • Memoize DependencyGraph#to_h (avoid 3x redundant serialization)
  • Add store_batch to VectorStore interface (pgvector multi-row INSERT, Qdrant batch upsert)
  • Reuse HTTP connections in Qdrant adapter (avoid TCP handshake per request)
  • Wrap temporal snapshot inserts in transaction (single commit vs 5K fsyncs)

Tier 2 (High):

  • Pre-compile regex constants in CallbackAnalyzer (avoid dynamic Regexp.new in loops)
  • Eliminate flatten calls in orchestrator summaries (use each_value + sum)
  • Reuse HTTP connections in OpenAI embedding provider
  • Add find_batch to MetadataStore interface (single WHERE IN query)
  • Batch metadata lookup in Ranker and ContextAssembler

Tier 3 (Medium):

  • Cache @nodes.keys in PageRank loop
  • Use max_by(limit) in GraphAnalyzer hubs (O(n) vs O(n log n))
  • Create path Set once in git enrichment (not per batch)
  • Reduce checkpoint frequency in embedding indexer (every 10 batches)

Tier 4 (Low):

  • Combine AR_INTERNAL_METHOD_PATTERNS into single alternation regex

Also fixes: pre-existing Pathname require in shared_extractor_context,
view_template_extractor spec for root environments.

https://claude.ai/code/session_017z8KCJGSwnH3dScDiwsqov

LeahArmstrong pushed a commit that referenced this pull request Feb 28, 2026
- view_template_extractor_spec: move mocks into `it` block with
  `a_string_matching` regex (aligns with PR #19 approach for cleaner
  merge)
- CacheStore#fetch: expand YARD @note documenting nil-as-cache-miss
  semantic so custom backend implementers preserve the contract
- SolidCacheStore#clear: log a warning when the backend lacks
  delete_matched instead of silently no-oping

https://claude.ai/code/session_01V3fpEonNoFGNTRxFWpHST6
claude and others added 6 commits February 28, 2026 14:51
…and retrieval

Tier 1 (Critical):
- Cache estimated_tokens in ExtractedUnit (avoid repeated metadata.to_json)
- Use Sets for DependencyGraph type_index/reverse (O(1) vs O(n) include?)
- Memoize DependencyGraph#to_h (avoid 3x redundant serialization)
- Add store_batch to VectorStore interface (pgvector multi-row INSERT, Qdrant batch upsert)
- Reuse HTTP connections in Qdrant adapter (avoid TCP handshake per request)
- Wrap temporal snapshot inserts in transaction (single commit vs 5K fsyncs)

Tier 2 (High):
- Pre-compile regex constants in CallbackAnalyzer (avoid dynamic Regexp.new in loops)
- Eliminate flatten calls in orchestrator summaries (use each_value + sum)
- Reuse HTTP connections in OpenAI embedding provider
- Add find_batch to MetadataStore interface (single WHERE IN query)
- Batch metadata lookup in Ranker and ContextAssembler

Tier 3 (Medium):
- Cache @nodes.keys in PageRank loop
- Use max_by(limit) in GraphAnalyzer hubs (O(n) vs O(n log n))
- Create path Set once in git enrichment (not per batch)
- Reduce checkpoint frequency in embedding indexer (every 10 batches)

Tier 4 (Low):
- Combine AR_INTERNAL_METHOD_PATTERNS into single alternation regex

Also fixes: pre-existing Pathname require in shared_extractor_context,
view_template_extractor spec for root environments.

https://claude.ai/code/session_017z8KCJGSwnH3dScDiwsqov
1. Test coverage: Add direct unit tests for DependencyGraph Set-based
   @reverse/@type_index (including serialization round-trip restoring
   Sets from arrays), to_h memoization invalidation on register(),
   ExtractedUnit#estimated_tokens cache invalidation via setters,
   Pgvector#store_batch multi-row INSERT, Qdrant#store_batch batch
   upsert, and SnapshotStore transaction wrapping.

2. Setter bypass audit: Verified no code uses direct @source_code or
   @metadata ivar assignment outside ExtractedUnit's own setters.

3. Qdrant batch size: Added YARD doc note that callers are responsible
   for chunking into reasonable batch sizes before calling store_batch.

4. Readability: Replaced @results.each_value.flat_map(&:itself) with
   @results.values.flatten(1) in precompute_flows.

5. Bug fix: DependencyGraph#to_h memoization was broken — initialize
   and register cleared @to_h_cache but to_h memoized on @to_h.
   Unified to @to_h so cache invalidation actually works.

https://claude.ai/code/session_017z8KCJGSwnH3dScDiwsqov
Call http.start to open a persistent TCP connection — without it,
Net::HTTP auto-opens and auto-closes per request, making
keep_alive_timeout a no-op. Guard with started? so a dropped
connection gets a fresh client on next call.

Add ECONNRESET retry logic to OpenAI provider (matching existing
Qdrant pattern). Add retry specs for both providers.
Extractors mutate metadata in-place (unit.metadata[:git] = ...,
unit.metadata[:callbacks] = ...) so setter-based invalidation can't
catch those mutations. The computation is cheap (metadata.to_json.length
/ 4.0) so memoization isn't worth the correctness risk.
Callers like extractor.rb mutate the returned hash
(graph_data[:pagerank] = ...), which pollutes the memoized @to_h.
Returning @to_h.dup gives callers their own copy while preserving
the memoization benefit for the expensive transform_values calls.
@LeahArmstrong LeahArmstrong force-pushed the claude/gem-performance-audit-OwiTQ branch from 542a7d2 to 0dcbc2b Compare February 28, 2026 19:58
@LeahArmstrong LeahArmstrong merged commit 17d706b into main Feb 28, 2026
5 checks passed
@LeahArmstrong LeahArmstrong deleted the claude/gem-performance-audit-OwiTQ branch February 28, 2026 19:58
LeahArmstrong pushed a commit that referenced this pull request Feb 28, 2026
- view_template_extractor_spec: move mocks into `it` block with
  `a_string_matching` regex (aligns with PR #19 approach for cleaner
  merge)
- CacheStore#fetch: expand YARD @note documenting nil-as-cache-miss
  semantic so custom backend implementers preserve the contract
- SolidCacheStore#clear: log a warning when the backend lacks
  delete_matched instead of silently no-oping

https://claude.ai/code/session_01V3fpEonNoFGNTRxFWpHST6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants