Skip to content

Conversation

@gvanrossum-ms
Copy link
Collaborator

@gvanrossum-ms gvanrossum-ms commented Jan 14, 2026

  • Generalize IngestedSources to store a status field for each source_id.

    This changes the schema for IngestedSources, adding a text column 'status' that can describe whether indexing of a source_id succeeded, failed, or resulted in some other status. Well-known values are 'ingested', 'failed', but using mark_source_ingested() it can be set to any other string value, and the status can be retrieved using get_source_status(). The existing is_source_ingested() returns True only if the status field has the exact value 'ingested'.

    The behavior of add_messages_with_indexing() changes subtly: it now sets the status to 'failed' before attempting any work; it sets it to 'ingested' once it is done indexing.

    To alter a precious existing database, do this using sqlite3:

    ALTER TABLE IngestedSources
    ADD COLUMN status TEXT NOT NULL DEFAULT 'ingested';
    

    That sets the status for every row present to 'ingested', matching the original behavior. (The in-memory storage provider implements the API but always reports the status to be 'ingested' when set.)

  • Increast TypeChat timeout_seconds from 10 (default) to 30.

  • Switch from email_id to filename as source_id -- this allows us to skip parsing the message when it's already ingested.

  • Remove quadratic behavior from SqliteRelatedTermsFuzzy and SqliteMessageTextIndex constructors.

  • Change signature of VectorBase.add_embeddings() -- a new argument keys precedes the embeddings argument. (Less backwards compatible, but consistent with add_embedding().)

  • Change output of utils.timelog(): label shows first and immediately, time follows; and all goes to stderr.

  • Streamlining and improvements to ingest_email.py verbose output, e.g. print only one line for already-ingested files, and decode RFC 2047 encoded-word strings.

Fixes #166 amongst other things.

@gvanrossum-ms gvanrossum-ms changed the title Generalize ingestion status to a string, and mark/skip failed ingestions Various improvements to email ingestion pipeline Jan 15, 2026
@robgruen
Copy link
Collaborator

Why have a string column instead of a new table with status and reference the id? Probably more complicated if you added the functionality for tracking new status ids for custom status fields, but should be less storage consumed in the long run.

@gvanrossum
Copy link
Collaborator

Why have a string column instead of a new table with status and reference the id? Probably more complicated if you added the functionality for tracking new status ids for custom status fields, but should be less storage consumed in the long run.

I somehow doubt that this table is going to be one of the bigger ones -- when I get back to the machine where I am ingesting thousands of emails I will measure all tables. (Based on a small sample, I suspect that tables with embeddings like RelatedTermsFuzzy will be by far the largest.)

We could use NULL to represent "ingested", that would probably use up less space than a string, but I'm reluctant to play too many games here unless it's a known bottleneck. What I really do want is the semantics of storing various status strings for failures (see other comment).

@gvanrossum gvanrossum deployed to build-pipeline January 16, 2026 05:22 — with GitHub Actions Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SQLite create_storage_provider() is slow when database has data already

4 participants