Performance audit: fix critical hotspots across extraction, storage, and retrieval#19
Merged
LeahArmstrong merged 6 commits intomainfrom Feb 28, 2026
Merged
Conversation
LeahArmstrong
pushed a commit
that referenced
this pull request
Feb 28, 2026
- view_template_extractor_spec: move mocks into `it` block with `a_string_matching` regex (aligns with PR #19 approach for cleaner merge) - CacheStore#fetch: expand YARD @note documenting nil-as-cache-miss semantic so custom backend implementers preserve the contract - SolidCacheStore#clear: log a warning when the backend lacks delete_matched instead of silently no-oping https://claude.ai/code/session_01V3fpEonNoFGNTRxFWpHST6
2 tasks
…and retrieval Tier 1 (Critical): - Cache estimated_tokens in ExtractedUnit (avoid repeated metadata.to_json) - Use Sets for DependencyGraph type_index/reverse (O(1) vs O(n) include?) - Memoize DependencyGraph#to_h (avoid 3x redundant serialization) - Add store_batch to VectorStore interface (pgvector multi-row INSERT, Qdrant batch upsert) - Reuse HTTP connections in Qdrant adapter (avoid TCP handshake per request) - Wrap temporal snapshot inserts in transaction (single commit vs 5K fsyncs) Tier 2 (High): - Pre-compile regex constants in CallbackAnalyzer (avoid dynamic Regexp.new in loops) - Eliminate flatten calls in orchestrator summaries (use each_value + sum) - Reuse HTTP connections in OpenAI embedding provider - Add find_batch to MetadataStore interface (single WHERE IN query) - Batch metadata lookup in Ranker and ContextAssembler Tier 3 (Medium): - Cache @nodes.keys in PageRank loop - Use max_by(limit) in GraphAnalyzer hubs (O(n) vs O(n log n)) - Create path Set once in git enrichment (not per batch) - Reduce checkpoint frequency in embedding indexer (every 10 batches) Tier 4 (Low): - Combine AR_INTERNAL_METHOD_PATTERNS into single alternation regex Also fixes: pre-existing Pathname require in shared_extractor_context, view_template_extractor spec for root environments. https://claude.ai/code/session_017z8KCJGSwnH3dScDiwsqov
Replace each_with_object({}) with to_h { ... } for find_batch stubs.
https://claude.ai/code/session_017z8KCJGSwnH3dScDiwsqov
1. Test coverage: Add direct unit tests for DependencyGraph Set-based @reverse/@type_index (including serialization round-trip restoring Sets from arrays), to_h memoization invalidation on register(), ExtractedUnit#estimated_tokens cache invalidation via setters, Pgvector#store_batch multi-row INSERT, Qdrant#store_batch batch upsert, and SnapshotStore transaction wrapping. 2. Setter bypass audit: Verified no code uses direct @source_code or @metadata ivar assignment outside ExtractedUnit's own setters. 3. Qdrant batch size: Added YARD doc note that callers are responsible for chunking into reasonable batch sizes before calling store_batch. 4. Readability: Replaced @results.each_value.flat_map(&:itself) with @results.values.flatten(1) in precompute_flows. 5. Bug fix: DependencyGraph#to_h memoization was broken — initialize and register cleared @to_h_cache but to_h memoized on @to_h. Unified to @to_h so cache invalidation actually works. https://claude.ai/code/session_017z8KCJGSwnH3dScDiwsqov
Call http.start to open a persistent TCP connection — without it, Net::HTTP auto-opens and auto-closes per request, making keep_alive_timeout a no-op. Guard with started? so a dropped connection gets a fresh client on next call. Add ECONNRESET retry logic to OpenAI provider (matching existing Qdrant pattern). Add retry specs for both providers.
Extractors mutate metadata in-place (unit.metadata[:git] = ..., unit.metadata[:callbacks] = ...) so setter-based invalidation can't catch those mutations. The computation is cheap (metadata.to_json.length / 4.0) so memoization isn't worth the correctness risk.
Callers like extractor.rb mutate the returned hash (graph_data[:pagerank] = ...), which pollutes the memoized @to_h. Returning @to_h.dup gives callers their own copy while preserving the memoization benefit for the expensive transform_values calls.
542a7d2 to
0dcbc2b
Compare
LeahArmstrong
pushed a commit
that referenced
this pull request
Feb 28, 2026
- view_template_extractor_spec: move mocks into `it` block with `a_string_matching` regex (aligns with PR #19 approach for cleaner merge) - CacheStore#fetch: expand YARD @note documenting nil-as-cache-miss semantic so custom backend implementers preserve the contract - SolidCacheStore#clear: log a warning when the backend lacks delete_matched instead of silently no-oping https://claude.ai/code/session_01V3fpEonNoFGNTRxFWpHST6
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tier 1 (Critical):
Tier 2 (High):
Tier 3 (Medium):
Tier 4 (Low):
Also fixes: pre-existing Pathname require in shared_extractor_context,
view_template_extractor spec for root environments.
https://claude.ai/code/session_017z8KCJGSwnH3dScDiwsqov