Bulk read write by pattonw · Pull Request #33 · funkelab/funlib.persistence

pattonw · 2026-02-11T21:19:19Z

Features:
Add a bulk write api with the following features:

context manager that handles "bulk write mode". This removes/rebuilds indexes, and sets other useful flags for writing large amounts of data quickly. Also has a flag to distinguish between worker and server processes since you don't want to be rebuilding indexes in worker processes.
bulk write alternatives to write_graph, write_nodes and write_edges
tests to ensure bulk writes match the existing write api

Improved edge reads

Rather than fetching all nodes in a roi, then serializing them to a massive string to check for containment, we use a join to perform edge queries in one request.
Less likely to run into memory errors from sending a multi GB string of node ids to check for containment
faster in many cases
Still falls back on original containment logic in cases where nodes are explicitly passed or when reading a graph with a node attribute filter

Handle larger int columns.

default was to set int columns to int32 in postgres db. Now defaults to using an int64

Bug fixes:

duplicate application of attribute filters removed
fixed cases where containment was not using the half open interval and was returning nodes on the upper boundary of the ROI
fixed cases where fail_if_exist was either not used, or used incorrectly, or not behaving as expected.
added a close method to graph providers to close open connections to DB. Turned db_factory test fixture into a context manager that closes connections. This resolved an annoying bug causing tests to hang permanently in rare cases.

Expanded tests to cover many of the above cases

…nly case The elif branch handling an upper-bound-only ROI dimension was checking roi.begin[dim] (always False at that point) instead of roi.end[dim], causing the upper-bound constraint to be silently dropped. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Roi uses half-open intervals where end is exclusive. SQL BETWEEN is inclusive on both ends, so nodes exactly at roi.end were incorrectly included. Replace with >= begin AND < end. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The __attr_query() call already generates WHERE conditions for all attr_filter entries. The subsequent for-loop over attr_filter appended the same conditions again, producing redundant SQL like "WHERE foo=1 AND foo=1". Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Same issue as read_nodes — __attr_query() already generates the full filter clause, but the subsequent for-loop re-appended identical conditions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

write_nodes was hardcoding fail_if_exists=True in the _insert_query call, silently ignoring the caller's parameter. Duplicate node inserts with fail_if_exists=False would crash instead of being ignored. Add test_graph_duplicate_insert_behavior to verify both flags work correctly for nodes and edges. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Also adds a Join query to handle cases where edges are fetched by roi and not by list of nodes. This is more efficient due to having a single round trip query.

Previous behavior was to silently ignore `roi`.

Reading by ROI can be significantly more efficient that reading by node list since the node list can be huge and would need to be serialized to a string.

This allows testing on dbs other than simply a locally running db.

There is now a bulk version of `write_nodes`, `write_edges`, and `write_graph`. These are faster but do not support some features such as fail_if_exists, and thus require more care from the user to guarantee data being passed is valid. There are also helper context managers that will drop and rebuild indexes and drop/re-add synchronised commits that can also be used to further speed up writes. Tests have been expanded to make sure that the new api matches the features of the base implementation and to test that it is actually faster.

…nections. Fixes a very frustraiting permanent hang that can occur due to unclosed postgresql connections

… large numbers like fragment ids much easier

pattonw · 2026-02-11T21:26:35Z

Resolves issues:
#18 #31 #32

will and others added 23 commits February 11, 2026 00:16

Fix read_edges: remove duplicate attr_filter application

d8fe16b

Same issue as read_nodes — __attr_query() already generates the full filter clause, but the subsequent for-loop re-appended identical conditions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix fail_if_exists flag on write_nodes

17f01b6

remove duplicate file

1f4fa7b

move code into a src directory to modernize this package

9f635b1

update git ignore to ignore daisy logs

1e08cc9

ruff formatting and linting

20ce725

add flag and tests for symetric edge fetching

8d3547d

add symetric edge fetching to SQLGraphProviders

09a9fce

Also adds a Join query to handle cases where edges are fetched by roi and not by list of nodes. This is more efficient due to having a single round trip query.

throw error when both roi and nodes passed to read edges.

59542c8

Previous behavior was to silently ignore `roi`.

handle read edges cases better.

92bebc5

Reading by ROI can be significantly more efficient that reading by node list since the node list can be huge and would need to be serialized to a string.

add uv.lock to gitignore

213c3d8

Connect to postgresql db with environment variables if available.

4523594

This allows testing on dbs other than simply a locally running db.

Generalize the bulk write api to all graph providers

0d61ab7

make tests more reliable with a context manager to handle closing con…

39a9f80

…nections. Fixes a very frustraiting permanent hang that can occur due to unclosed postgresql connections

add and use a bulk_write_mode context manager for bulk writes

02f9cec

add test for node exclusion on upper bound of ROI

3cc4754

ints should be stored as bigint. cost is small and makes dealing with…

5c5489c

… large numbers like fragment ids much easier

ruff formatting and linting

2cece20

will added 2 commits February 11, 2026 21:39

fix mypy errors

f655129

fix mypy workflow file

c8a4306

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk read write#33

Bulk read write#33
pattonw wants to merge 25 commits intomainfrom
bulk-read-write

pattonw commented Feb 11, 2026 •

edited

Loading

Uh oh!

pattonw commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pattonw commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pattonw commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pattonw commented Feb 11, 2026 •

edited

Loading