Skip to content

Bulk read write#33

Open
pattonw wants to merge 25 commits intomainfrom
bulk-read-write
Open

Bulk read write#33
pattonw wants to merge 25 commits intomainfrom
bulk-read-write

Conversation

@pattonw
Copy link
Contributor

@pattonw pattonw commented Feb 11, 2026

Features:
Add a bulk write api with the following features:

  • context manager that handles "bulk write mode". This removes/rebuilds indexes, and sets other useful flags for writing large amounts of data quickly. Also has a flag to distinguish between worker and server processes since you don't want to be rebuilding indexes in worker processes.
  • bulk write alternatives to write_graph, write_nodes and write_edges
  • tests to ensure bulk writes match the existing write api

Improved edge reads

  • Rather than fetching all nodes in a roi, then serializing them to a massive string to check for containment, we use a join to perform edge queries in one request.
  • Less likely to run into memory errors from sending a multi GB string of node ids to check for containment
  • faster in many cases
  • Still falls back on original containment logic in cases where nodes are explicitly passed or when reading a graph with a node attribute filter

Handle larger int columns.

  • default was to set int columns to int32 in postgres db. Now defaults to using an int64

Bug fixes:

  • duplicate application of attribute filters removed
  • fixed cases where containment was not using the half open interval and was returning nodes on the upper boundary of the ROI
  • fixed cases where fail_if_exist was either not used, or used incorrectly, or not behaving as expected.
  • added a close method to graph providers to close open connections to DB. Turned db_factory test fixture into a context manager that closes connections. This resolved an annoying bug causing tests to hang permanently in rare cases.

Expanded tests to cover many of the above cases

will and others added 23 commits February 11, 2026 00:16
…nly case

The elif branch handling an upper-bound-only ROI dimension was checking
roi.begin[dim] (always False at that point) instead of roi.end[dim],
causing the upper-bound constraint to be silently dropped.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Roi uses half-open intervals where end is exclusive. SQL BETWEEN is
inclusive on both ends, so nodes exactly at roi.end were incorrectly
included. Replace with >= begin AND < end.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The __attr_query() call already generates WHERE conditions for all
attr_filter entries. The subsequent for-loop over attr_filter appended
the same conditions again, producing redundant SQL like
"WHERE foo=1 AND foo=1".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Same issue as read_nodes — __attr_query() already generates the full
filter clause, but the subsequent for-loop re-appended identical
conditions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
write_nodes was hardcoding fail_if_exists=True in the _insert_query
call, silently ignoring the caller's parameter. Duplicate node inserts
with fail_if_exists=False would crash instead of being ignored.

Add test_graph_duplicate_insert_behavior to verify both flags work
correctly for nodes and edges.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Also adds a Join query to handle cases where edges are fetched by roi and not by list of nodes. This is more efficient due to having a single round trip query.
Previous behavior was to silently ignore `roi`.
Reading by ROI can be significantly more efficient that reading by node list since the node list can be huge and would need to be serialized to a string.
This allows testing on dbs other than simply a locally running db.
There is now a bulk version of `write_nodes`, `write_edges`, and `write_graph`. These are faster but do not support some features such as fail_if_exists, and thus require more care from the user to guarantee data being passed is valid.
There are also helper context managers that will drop and rebuild indexes and drop/re-add synchronised commits that can also be used to further speed up writes.

Tests have been expanded to make sure that the new api matches the features of the base implementation and to test that it is actually faster.
…nections.

Fixes a very frustraiting permanent hang that can occur due to unclosed postgresql connections
… large numbers like fragment ids much easier
@pattonw
Copy link
Contributor Author

pattonw commented Feb 11, 2026

Resolves issues:
#18 #31 #32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant