Skip to content

Add find_in_batches to query services#986

Open
maxkadel wants to merge 8 commits intomainfrom
i985_find_in_batches
Open

Add find_in_batches to query services#986
maxkadel wants to merge 8 commits intomainfrom
i985_find_in_batches

Conversation

@maxkadel
Copy link
Contributor

@maxkadel maxkadel commented Dec 11, 2025

Adds the #find_in_batches method to each query service. This allows for more batch processing, especially with Postgres and Solr query services.

Connected to #985

@maxkadel maxkadel force-pushed the i985_find_in_batches branch from 1da9422 to eb19fb7 Compare December 11, 2025 15:07
@maxkadel maxkadel changed the title Postgres version of find_in_batches Add find_in_batches to query services Dec 11, 2025
@maxkadel maxkadel marked this pull request as ready for review December 11, 2025 15:51
@maxkadel maxkadel requested a review from tpendragon December 11, 2025 15:51
@maxkadel maxkadel force-pushed the i985_find_in_batches branch from 89994cf to b2d3de9 Compare December 11, 2025 15:53
@maxkadel maxkadel marked this pull request as draft December 11, 2025 17:03
@maxkadel maxkadel removed the request for review from tpendragon December 11, 2025 17:03
@maxkadel maxkadel marked this pull request as ready for review December 14, 2025 14:48
@maxkadel maxkadel requested a review from tpendragon December 15, 2025 16:34
Copy link
Collaborator

@tpendragon tpendragon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some questions around Solr I'm thinking about, but this is a solid feature! Well written, good job.

# @param [RSolr::Client] connection
# @param [ResourceFactory] resource_factory
def initialize(connection:, resource_factory:, start:, batch_size:, except_models:)
Valkyrie.logger.warn("You are trying to query from Solr in batches larger than 1_000, this may cause issues for large Solr documents") if batch_size > 1_000
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is this problem because of Solr or because of Valkyrie? If it's because of Solr, I wonder if we can separate the paging from the batch size somehow? Might not be too important, I'm not sure when I'd actually use this query for Solr...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, good point. Let me think on how to do that.

def run
docs = Paginator.new(start: start, batch_size: batch_size)
while docs.has_next?
docs = connection.paginate(docs.next_page, docs.per_page, "select", params: { q: query })["response"]["docs"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's no sort parameter I think this might return inconsistently, especially between replicas. In Postgres it works because I'm pretty sure AR's handling those internals.

I think we could either get all the IDs at once and resolve them to full documents, or add a sort param. Or maybe this works, I'm really not sure, I might just be thinking about SolrCloud edge cases..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right about inconsistent performance - I was just worried about the performance implications of sorting, but maybe I should do some benchmarking to find out how big of a hit it makes.

@maxkadel maxkadel force-pushed the i985_find_in_batches branch 3 times, most recently from 69303b6 to b667a7f Compare December 16, 2025 14:07
@dchandekstark
Copy link
Member

I haven't followed through the use of start, but I think it could be confusing to use it differently than Solr does -- i.e. you are defaulting to 1 instead of 0. FWIW, I have found "pagination" to be somewhat unreliable in the general case. You have probably referred to https://solr.apache.org/guide/solr/latest/query-guide/pagination-of-results.html -- and the issue of skipped/dropped results because of updates is real. I do think using a cursor is better for iterating through a large result set, and I have had success using it.

Oldest version of rdf that makes reliance on BigDecimal explicit.
Will not pass tests because other query_services don't have method yet

Connected to #985
Will not pass tests because other query_services don't have method yet
How can we figure out whether this will kill a "normal" machine's memory for a "normal" solr corpus?

Connected to #985
Will not pass tests because other query_services don't have method yet

Connected to #985
@maxkadel maxkadel force-pushed the i985_find_in_batches branch from 4671fae to 903d581 Compare December 18, 2025 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants