Turn messy CSV/JSON into clean, structured outputs with an AI‑assisted pipeline. DataMorph handles upload → profiling → canonicalization → transformation → export, with a simple API and a lightweight React UI.
Stack: .NET 9, React + Typescript, Docker, Google Cloud - Google Cloud Run, Firestore, Firebase Auth Cloud Storage, Pub/Sub, Eventarc
The purpose of this tool is that people can upload any messy datasets - CSV, JSON and then have Gemini perform necessary ETL on it. Gemini can also answer any question you may have on the dataset and also clean it up for you or even generate another dataset based on what you've provided as an input - Anything pretty much! Sky is the limit on what you can do with Datamorph.
You're probably wondering why use DataMorph, when you can probably use any other LLM directly, by uploading a file onto it - asking for a response. And you're probably right! However, raw LLM UI falls over when you need reliability, scale, governance and integration. Here's what happens -
- You can enforce exactly what you need and reject/flag violations.
- DataMorph emits a strict JSONL with errors per row; no prose, no markdown.
- DataMorph is capable of handling large files (Currently disabled, to avoid huge gemini bills) via GCS uploads, chunked processing and idempotency keys. LLM UIs cap out pretty quickly
- Every run ties to a hash, so that the exact same prompt could be re-ran giving you the exact same results.
- Costs - We use a preflight profiler for estimating tokens/costs and do cheap rule based transformations before reaching an LLM, avoiding unnecessary LLM calls.
In a nutshell, we can turn something messy, clean it up, versioned transform plans, scalable chunked processing, cost control, something an ad-hoc chat with LLM cannot operationalize.
- Init a job →
POST /pipelines/initwith your prompt to getjobIdand a signed upload URL. - Upload file → the site then automatically uploads CSV/JSON directly to GCS using the signed URL.
- Auto‑parse → Eventarc triggers the
DataMorph-Parser→ emits canonical JSONL + profile. - Transform →
Transformer(pulls from Pub/Sub) applies the AI transform plan → writes file to GCS. - Done → Firestore updates state to
DONE, if its all successful, allowing user to download the processed file.
Firebase state is constantly updated so that the user can keep track of what file is being processed with what status.
flowchart LR
subgraph React + Typescript UI
UI[DataMorph UI]
end
UI -->|POST /pipelines/init - targetSchema & prompt| API[(Cloud Run .NET API)]
API -->|Create Firestore job state INIT| FS[(Firestore)]
API -->|Return jobId + signedUploadUrl| UI
UI -->|PUT file to GCS via signed URL| RAW[(GCS raw-bucket)]
RAW -- finalize --> EA[Eventarc]
EA --> PARSER[Parser Cloud Run]
PARSER -->|canonical.jsonl and profile.json| STAGE[(GCS staging-bucket)]
PARSER -->|Publish transform.requests| PS[(Pub/Sub)]
PS --> XFORM[Transformer Run]
XFORM -->|data.json and data.jsonl| OUT[(GCS output-bucket)]
PARSER -->|state PARSING| FS
XFORM -->|state TRANSFORMING| FS
XFORM -->|state DONE + download URLs| FS
FS --> UI
OUT --> UI
-
GCS Buckets
raw-bucket-gcs— client uploads land here (via signed URL)staging-bucket-gcs—canonical.jsonl+profile.jsonraw-bucket-gcs-output— final json files
-
PubSub Topic
datamorph-pubsub- The PubSub model on GCP which uses EventArc to trigger/parse
Notes
- Signed URL is time‑limited and scope‑limited to the job path. There is a 7 days expiry on the download signed url, after which user will not be able to download the said file.
- Rate limiting is set to 60 seconds per user, per file upload. This is to prevent abuse of the system, and to avoid the massive server and Gemini bill that DataMorph racks up!
A sample skeleton of the configuration can be found in the appsettings.json file in both Api and Parser
- To run the .NET apis locally, please install the latest version of .NET 9. You can use
Visual Studioon Windows as IDE. On macOS,Jetbrains Riderwould be the preffered IDE. - To run the frontend, ensure that you have
yarninstalled.-
Run
npm installindatamorph-uidirectory to installnpmmodules -
Do
yarn devto kick off local development. -
A sample
.envfile for frontend would look asVITE_FIREBASE_API_KEY=API_KEY VITE_FIREBASE_AUTH_DOMAIN=FIREBASE_AUTH_DOMAIN VITE_FIREBASE_PROJECT_ID=FIREBASE_PROJECT_ID VITE_FIREBASE_APP_ID=FIREBASE_APP_ID VITE_GOOGLE_OAUTH_ID=GOOGLE_OAUTH_ID VITE_API_URL=API_URL
-
PRs are welcome! Please open an issue describing the change before any contributions. Follow the branching convention as: feature/issue-number for features, bug/issue-number for bugs, security/issue-number for security issues. All PRs will be opened towards dev branch and not directly to master. Any PRs to master will be automatically rejected. PR will be merged from dev to master periodically by me.