-
Notifications
You must be signed in to change notification settings - Fork 41
Various improvements to email ingestion pipeline #162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Why have a string column instead of a new table with status and reference the id? Probably more complicated if you added the functionality for tracking new status ids for custom status fields, but should be less storage consumed in the long run. |
I somehow doubt that this table is going to be one of the bigger ones -- when I get back to the machine where I am ingesting thousands of emails I will measure all tables. (Based on a small sample, I suspect that tables with embeddings like RelatedTermsFuzzy will be by far the largest.) We could use NULL to represent "ingested", that would probably use up less space than a string, but I'm reluctant to play too many games here unless it's a known bottleneck. What I really do want is the semantics of storing various status strings for failures (see other comment). |
Generalize IngestedSources to store a status field for each source_id.
This changes the schema for IngestedSources, adding a text column 'status' that can describe whether indexing of a source_id succeeded, failed, or resulted in some other status. Well-known values are 'ingested', 'failed', but using
mark_source_ingested()it can be set to any other string value, and the status can be retrieved usingget_source_status(). The existingis_source_ingested()returns True only if the status field has the exact value 'ingested'.The behavior of
add_messages_with_indexing()changes subtly: it now sets the status to 'failed' before attempting any work; it sets it to 'ingested' once it is done indexing.To alter a precious existing database, do this using sqlite3:
That sets the status for every row present to 'ingested', matching the original behavior. (The in-memory storage provider implements the API but always reports the status to be 'ingested' when set.)
Increast TypeChat timeout_seconds from 10 (default) to 30.
Switch from email_id to filename as source_id -- this allows us to skip parsing the message when it's already ingested.
Remove quadratic behavior from SqliteRelatedTermsFuzzy and SqliteMessageTextIndex constructors.
Change signature of VectorBase.add_embeddings() -- a new argument keys precedes the embeddings argument. (Less backwards compatible, but consistent with add_embedding().)
Change output of utils.timelog(): label shows first and immediately, time follows; and all goes to stderr.
Streamlining and improvements to ingest_email.py verbose output, e.g. print only one line for already-ingested files, and decode RFC 2047 encoded-word strings.
Fixes #166 amongst other things.