Various improvements to email ingestion pipeline #162

gvanrossum-ms · 2026-01-14T22:11:45Z

Generalize IngestedSources to store a status field for each source_id.

This changes the schema for IngestedSources, adding a text column 'status' that can describe whether indexing of a source_id succeeded, failed, or resulted in some other status. Well-known values are 'ingested', 'failed', but using mark_source_ingested() it can be set to any other string value, and the status can be retrieved using get_source_status(). The existing is_source_ingested() returns True only if the status field has the exact value 'ingested'.

The behavior of add_messages_with_indexing() changes subtly: it now sets the status to 'failed' before attempting any work; it sets it to 'ingested' once it is done indexing.

To alter a precious existing database, do this using sqlite3:
```
ALTER TABLE IngestedSources
ADD COLUMN status TEXT NOT NULL DEFAULT 'ingested';
```
That sets the status for every row present to 'ingested', matching the original behavior. (The in-memory storage provider implements the API but always reports the status to be 'ingested' when set.)
Increast TypeChat timeout_seconds from 10 (default) to 30.
Switch from email_id to filename as source_id -- this allows us to skip parsing the message when it's already ingested.
Remove quadratic behavior from SqliteRelatedTermsFuzzy and SqliteMessageTextIndex constructors.
Change signature of VectorBase.add_embeddings() -- a new argument keys precedes the embeddings argument. (Less backwards compatible, but consistent with add_embedding().)
Change output of utils.timelog(): label shows first and immediately, time follows; and all goes to stderr.
Streamlining and improvements to ingest_email.py verbose output, e.g. print only one line for already-ingested files, and decode RFC 2047 encoded-word strings.

Fixes #166 amongst other things.

robgruen · 2026-01-15T23:17:17Z

Why have a string column instead of a new table with status and reference the id? Probably more complicated if you added the functionality for tracking new status ids for custom status fields, but should be less storage consumed in the long run.

src/typeagent/knowpro/conversation_base.py

src/typeagent/storage/sqlite/schema.py

tools/ingest_email.py

gvanrossum · 2026-01-16T02:01:19Z

Why have a string column instead of a new table with status and reference the id? Probably more complicated if you added the functionality for tracking new status ids for custom status fields, but should be less storage consumed in the long run.

I somehow doubt that this table is going to be one of the bigger ones -- when I get back to the machine where I am ingesting thousands of emails I will measure all tables. (Based on a small sample, I suspect that tables with embeddings like RelatedTermsFuzzy will be by far the largest.)

We could use NULL to represent "ingested", that would probably use up less space than a string, but I'm reluctant to play too many games here unless it's a known bottleneck. What I really do want is the semantics of storing various status strings for failures (see other comment).

Generalize ingestion status to a string, and mark/skip failed ingestions

23b1f80

gvanrossum-ms had a problem deploying to build-pipeline January 14, 2026 22:12 — with GitHub Actions Failure

gvanrossum-ms temporarily deployed to build-pipeline January 14, 2026 22:16 — with GitHub Actions Inactive

gvanrossum-ms requested a review from robgruen January 15, 2026 01:56

gvanrossum-ms added 10 commits January 15, 2026 10:03

Improve database speed (see #166)

3547f19

Add test for add_embeddings

eed216a

Send timelog to stderr, to avoid breaking MCP protocol

b3516e8

Formatted vectorbase.py and convknowledge.py

32f3abb

Fix test_timelog (capture stderr)

c0c3956

Allow updating existing source_id status

8a4ee7f

Remove quadratic code in SqliteMessageTextIndex

1cff9c0

Remove unnecessary timelog calls

3524065

Format messageindex.py

4e9604c

Cosmetic (output) changes to ingest_email.py

7e6c341

gvanrossum-ms had a problem deploying to build-pipeline January 15, 2026 21:58 — with GitHub Actions Error

gvanrossum-ms changed the title ~~Generalize ingestion status to a string, and mark/skip failed ingestions~~ Various improvements to email ingestion pipeline Jan 15, 2026

Update TODO list in ingest_email.py

1c9810a

gvanrossum-ms had a problem deploying to build-pipeline January 15, 2026 22:11 — with GitHub Actions Error

Remove unused imports

35bf3bf

gvanrossum-ms temporarily deployed to build-pipeline January 15, 2026 22:16 — with GitHub Actions Inactive

robgruen approved these changes Jan 15, 2026

View reviewed changes

src/typeagent/knowpro/conversation_base.py Outdated Show resolved Hide resolved

src/typeagent/storage/sqlite/schema.py Outdated Show resolved Hide resolved

tools/ingest_email.py Outdated Show resolved Hide resolved

gvanrossum added 3 commits January 15, 2026 18:07

Optimize decode_encoded_word() using join()

ecf45b5

add_messages_with_indexing() should never mark source ids as failures

96f4fc1

Upon Exception, set file status to exception name

85318b9

gvanrossum had a problem deploying to build-pipeline January 16, 2026 04:45 — with GitHub Actions Error

Make 'ingested' into STATUS_INGESTED constant

9d006bf

gvanrossum had a problem deploying to build-pipeline January 16, 2026 04:58 — with GitHub Actions Failure

Move STATUS_INGESTED definition to interfaces_storage.py

770d134

gvanrossum temporarily deployed to build-pipeline January 16, 2026 05:15 — with GitHub Actions Inactive

Merge branch 'main' into ingstatus

e0a5513

gvanrossum deployed to build-pipeline January 16, 2026 05:22 — with GitHub Actions Active

robgruen approved these changes Jan 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Various improvements to email ingestion pipeline #162

Various improvements to email ingestion pipeline #162

Uh oh!

gvanrossum-ms commented Jan 14, 2026 •

edited

Loading

Uh oh!

robgruen commented Jan 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gvanrossum commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Various improvements to email ingestion pipeline #162

Are you sure you want to change the base?

Various improvements to email ingestion pipeline #162

Uh oh!

Conversation

gvanrossum-ms commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robgruen commented Jan 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gvanrossum commented Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gvanrossum-ms commented Jan 14, 2026 •

edited

Loading