claims_prep — quickstart and reference

Lightweight utilities to preprocess healthcare-claims CSVs and ingest them into sqlite files for downstream exploration and feature engineering with pandas.

Quickstart

Run the interactive demo (creates a temporary DB by default):
```
python -m claims_prep.demo
```
Create one sqlite database per dataset subdirectory under data/ and write them to databases/ (recommended for multiple independent datasets):
```
python -m claims_prep --create-db --all-datasets --data-dir .\data --databases-dir .\databases
```

Create a single DB for a specific dataset directory:

python -m claims_prep --create-db --data-dir .\data\dataset1 --db-path .\databases\dataset1.db

Command-line reference (recommended)

Primary launcher: python -m claims_prep [OPTIONS] (from repository root). The most commonly used options are:

--create-db — ingest CSVs into one sqlite DB and exit (use with --all-datasets or supply --data-dir).
--all-datasets — when used with --create-db, create one DB per immediate subdirectory of --data-dir and write them to --databases-dir.
--data-dir PATH (default: data) — data root or dataset directory to ingest.
--databases-dir PATH (default: databases) — when using --all-datasets, destination folder for per-dataset DBs.
--db-path PATH — DB path when creating a single DB. If omitted the CLI will derive a sensible default of databases/<dataset_name>.db based on --data-dir.
--no-preprocess — skip cleaning/typing (column normalization, date parsing, numeric downcast) when ingesting CSVs.

Other useful options (single-CSV processing / interactive checks):

--input / -i PATH — path to a single CSV to process (required unless --create-db is used).
--output / -o PATH — output path for cleaned CSV (default: processed_claims.csv).
--nrows INT — read only first N rows (useful for quick tests).
--hash-ids / --id-salt — de-identify detected ID columns with deterministic hashing.
--compute-features / --features-output — run lightweight feature engineering and save features CSV.

Demo details

claims_prep.demo.demo_create_and_preview() builds a sqlite DB from data/ and previews a chosen table (claims by default). When db_path is omitted a temporary file is created and its path is logged so you can inspect or persist it.

You can run the demo using the committed small sample dataset:

python -m claims_prep.demo --use-sample

This uses the data/sample_small CSVs committed for examples and tests.

Programmatic API (quick reference)

create_sqlite_db_from_dir(data_dir: Path, db_path: Path, csv_glob: str = "*.csv", chunk_size: int = 100_000, preprocess: bool = True, if_exists: str = "replace")
- Ingest each CSV in data_dir into a table named after the file stem. Streams files in chunks to limit memory usage.
create_sqlite_databases_for_data_root(data_root: Path, databases_dir: Path, csv_glob: str = "*.csv", chunk_size: int = 100_000, preprocess: bool = True, if_exists: str = "replace")
- Create one sqlite DB per dataset directory and write into databases_dir.
list_db_tables(db_path: Path) -> List[str] — list tables in a sqlite file.
read_table(db_path: Path, table: str, sql: Optional[str] = None) -> pandas.DataFrame — read a table or query into pandas.

Notes, caveats, and next steps

Datasets may have different CSV filenames and column sets. The ingestion treats each dataset independently and does not automatically align schemas across datasets. For cross-dataset modeling, add a schema-normalization mapping step (rename, cast, fill) before ingestion or request a helper to do conservative alignment during ingestion.
The default preprocessing (clean names, parse dates, downcast numeric types) is safe for most exploratory workflows; use --no-preprocess if you need raw ingestion.
The demo writes to a temporary DB by default to avoid overwriting local files; pass db_path if you need a persistent DB.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
claims_prep		claims_prep
data		data
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

claims_prep — quickstart and reference

Quickstart

Command-line reference (recommended)

Demo details

Programmatic API (quick reference)

Notes, caveats, and next steps

About

Uh oh!

Releases

Packages

Languages

mhamilt1/MSHackathon25

Folders and files

Latest commit

History

Repository files navigation

claims_prep — quickstart and reference

Quickstart

Command-line reference (recommended)

Demo details

Programmatic API (quick reference)

Notes, caveats, and next steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages