Add find_in_batches to query services by maxkadel · Pull Request #986 · samvera/valkyrie

maxkadel · 2025-12-11T11:04:17Z

Adds the #find_in_batches method to each query service. This allows for more batch processing, especially with Postgres and Solr query services.

Connected to #985

tpendragon

Just some questions around Solr I'm thinking about, but this is a solid feature! Well written, good job.

tpendragon · 2025-12-15T17:03:41Z

lib/valkyrie/persistence/solr/queries/find_in_batches_query.rb

+    # @param [RSolr::Client] connection
+    # @param [ResourceFactory] resource_factory
+    def initialize(connection:, resource_factory:, start:, batch_size:, except_models:)
+      Valkyrie.logger.warn("You are trying to query from Solr in batches larger than 1_000, this may cause issues for large Solr documents") if batch_size > 1_000


Question: Is this problem because of Solr or because of Valkyrie? If it's because of Solr, I wonder if we can separate the paging from the batch size somehow? Might not be too important, I'm not sure when I'd actually use this query for Solr...

Ah, yeah, good point. Let me think on how to do that.

tpendragon · 2025-12-15T17:04:14Z

lib/valkyrie/persistence/solr/queries/find_in_batches_query.rb

+    def run
+      docs = Paginator.new(start: start, batch_size: batch_size)
+      while docs.has_next?
+        docs = connection.paginate(docs.next_page, docs.per_page, "select", params: { q: query })["response"]["docs"]


If there's no sort parameter I think this might return inconsistently, especially between replicas. In Postgres it works because I'm pretty sure AR's handling those internals.

I think we could either get all the IDs at once and resolve them to full documents, or add a sort param. Or maybe this works, I'm really not sure, I might just be thinking about SolrCloud edge cases..

I think you're right about inconsistent performance - I was just worried about the performance implications of sorting, but maybe I should do some benchmarking to find out how big of a hit it makes.

dchandekstark · 2025-12-16T14:27:56Z

I haven't followed through the use of start, but I think it could be confusing to use it differently than Solr does -- i.e. you are defaulting to 1 instead of 0. FWIW, I have found "pagination" to be somewhat unreliable in the general case. You have probably referred to https://solr.apache.org/guide/solr/latest/query-guide/pagination-of-results.html -- and the issue of skipped/dropped results because of updates is real. I do think using a cursor is better for iterating through a large result set, and I have had success using it.

Oldest version of rdf that makes reliance on BigDecimal explicit.

Will not pass tests because other query_services don't have method yet Connected to #985

Will not pass tests because other query_services don't have method yet How can we figure out whether this will kill a "normal" machine's memory for a "normal" solr corpus? Connected to #985

Will not pass tests because other query_services don't have method yet Connected to #985

Connected to #985

…er to get deterministic results

maxkadel force-pushed the i985_find_in_batches branch from 1da9422 to eb19fb7 Compare December 11, 2025 15:07

maxkadel changed the title ~~Postgres version of find_in_batches~~ Add find_in_batches to query services Dec 11, 2025

maxkadel marked this pull request as ready for review December 11, 2025 15:51

maxkadel requested a review from tpendragon December 11, 2025 15:51

maxkadel force-pushed the i985_find_in_batches branch from 89994cf to b2d3de9 Compare December 11, 2025 15:53

maxkadel marked this pull request as draft December 11, 2025 17:03

maxkadel removed the request for review from tpendragon December 11, 2025 17:03

maxkadel marked this pull request as ready for review December 14, 2025 14:48

maxkadel requested a review from tpendragon December 15, 2025 16:34

tpendragon requested changes Dec 15, 2025

View reviewed changes

maxkadel force-pushed the i985_find_in_batches branch 3 times, most recently from 69303b6 to b667a7f Compare December 16, 2025 14:07

maxkadel added 8 commits December 18, 2025 17:22

Older versions of RDF cause failing specs

43a4849

Oldest version of rdf that makes reliance on BigDecimal explicit.

Postgres version of find_in_batches

d43e2fb

Will not pass tests because other query_services don't have method yet Connected to #985

Solr version of find_in_batches

8e5c8a4

Will not pass tests because other query_services don't have method yet How can we figure out whether this will kill a "normal" machine's memory for a "normal" solr corpus? Connected to #985

Fedora version of find_in_batches

47b3bb4

Will not pass tests because other query_services don't have method yet Connected to #985

Memory version of find_in_batches

bbb1832

Connected to #985

Use more consistent documentation pattern

6f0c9c3

Respond to review - separate pagination from yielded batch size & ord…

29103fc

…er to get deterministic results

Add failing test for sort

903d581

maxkadel force-pushed the i985_find_in_batches branch from 4671fae to 903d581 Compare December 18, 2025 16:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add find_in_batches to query services#986

Add find_in_batches to query services#986
maxkadel wants to merge 8 commits intomainfrom
i985_find_in_batches

maxkadel commented Dec 11, 2025 •

edited

Loading

Uh oh!

tpendragon left a comment •

edited

Loading

Uh oh!

tpendragon Dec 15, 2025

Uh oh!

maxkadel Dec 15, 2025

Uh oh!

tpendragon Dec 15, 2025

Uh oh!

maxkadel Dec 15, 2025

Uh oh!

dchandekstark commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

maxkadel commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tpendragon left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tpendragon Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

maxkadel Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

tpendragon Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

maxkadel Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

dchandekstark commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

maxkadel commented Dec 11, 2025 •

edited

Loading

tpendragon left a comment •

edited

Loading