Skip to content

ameysunu/DataMorph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DataMorph — AI‑Driven ETL on Google Cloud Run

Turn messy CSV/JSON into clean, structured outputs with an AI‑assisted pipeline. DataMorph handles upload → profiling → canonicalization → transformation → export, with a simple API and a lightweight React UI.

Stack: .NET 9, React + Typescript, Docker, Google Cloud - Google Cloud Run, Firestore, Firebase Auth Cloud Storage, Pub/Sub, Eventarc


How does this work?

The purpose of this tool is that people can upload any messy datasets - CSV, JSON and then have Gemini perform necessary ETL on it. Gemini can also answer any question you may have on the dataset and also clean it up for you or even generate another dataset based on what you've provided as an input - Anything pretty much! Sky is the limit on what you can do with Datamorph.

DataMorph vs LLM Chat UI

You're probably wondering why use DataMorph, when you can probably use any other LLM directly, by uploading a file onto it - asking for a response. And you're probably right! However, raw LLM UI falls over when you need reliability, scale, governance and integration. Here's what happens -

  • You can enforce exactly what you need and reject/flag violations.
  • DataMorph emits a strict JSONL with errors per row; no prose, no markdown.
  • DataMorph is capable of handling large files (Currently disabled, to avoid huge gemini bills) via GCS uploads, chunked processing and idempotency keys. LLM UIs cap out pretty quickly
  • Every run ties to a hash, so that the exact same prompt could be re-ran giving you the exact same results.
  • Costs - We use a preflight profiler for estimating tokens/costs and do cheap rule based transformations before reaching an LLM, avoiding unnecessary LLM calls.

In a nutshell, we can turn something messy, clean it up, versioned transform plans, scalable chunked processing, cost control, something an ad-hoc chat with LLM cannot operationalize.

Behind the scenes

  1. Init a jobPOST /pipelines/init with your prompt to get jobId and a signed upload URL.
  2. Upload file → the site then automatically uploads CSV/JSON directly to GCS using the signed URL.
  3. Auto‑parse → Eventarc triggers the DataMorph-Parser → emits canonical JSONL + profile.
  4. TransformTransformer (pulls from Pub/Sub) applies the AI transform plan → writes file to GCS.
  5. Done → Firestore updates state to DONE, if its all successful, allowing user to download the processed file.

Firebase state is constantly updated so that the user can keep track of what file is being processed with what status.

Architecture

flowchart LR
subgraph React + Typescript UI
UI[DataMorph UI]
end

UI -->|POST /pipelines/init - targetSchema & prompt| API[(Cloud Run .NET API)]
API -->|Create Firestore job state INIT| FS[(Firestore)]
API -->|Return jobId + signedUploadUrl| UI
UI -->|PUT file to GCS via signed URL| RAW[(GCS raw-bucket)]


RAW -- finalize --> EA[Eventarc]
EA --> PARSER[Parser Cloud Run]
PARSER -->|canonical.jsonl and profile.json| STAGE[(GCS staging-bucket)]
PARSER -->|Publish transform.requests| PS[(Pub/Sub)]


PS --> XFORM[Transformer Run]
XFORM -->|data.json and data.jsonl| OUT[(GCS output-bucket)]

PARSER -->|state PARSING| FS
XFORM -->|state TRANSFORMING| FS
XFORM -->|state DONE + download URLs| FS


FS --> UI
OUT --> UI
Loading

Buckets & Topics

  • GCS Buckets

    • raw-bucket-gcs — client uploads land here (via signed URL)
    • staging-bucket-gcscanonical.jsonl + profile.json
    • raw-bucket-gcs-output — final json files
  • PubSub Topic

    • datamorph-pubsub - The PubSub model on GCP which uses EventArc to trigger /parse

Notes

  • Signed URL is time‑limited and scope‑limited to the job path. There is a 7 days expiry on the download signed url, after which user will not be able to download the said file.
  • Rate limiting is set to 60 seconds per user, per file upload. This is to prevent abuse of the system, and to avoid the massive server and Gemini bill that DataMorph racks up!

Configuration

A sample skeleton of the configuration can be found in the appsettings.json file in both Api and Parser

Local Development

  • To run the .NET apis locally, please install the latest version of .NET 9. You can use Visual Studio on Windows as IDE. On macOS, Jetbrains Rider would be the preffered IDE.
  • To run the frontend, ensure that you have yarn installed.
    • Run npm install in datamorph-ui directory to install npm modules

    • Do yarn dev to kick off local development.

    • A sample .env file for frontend would look as

      VITE_FIREBASE_API_KEY=API_KEY
      VITE_FIREBASE_AUTH_DOMAIN=FIREBASE_AUTH_DOMAIN
      VITE_FIREBASE_PROJECT_ID=FIREBASE_PROJECT_ID
      VITE_FIREBASE_APP_ID=FIREBASE_APP_ID
      VITE_GOOGLE_OAUTH_ID=GOOGLE_OAUTH_ID
      VITE_API_URL=API_URL
      

Contributing

PRs are welcome! Please open an issue describing the change before any contributions. Follow the branching convention as: feature/issue-number for features, bug/issue-number for bugs, security/issue-number for security issues. All PRs will be opened towards dev branch and not directly to master. Any PRs to master will be automatically rejected. PR will be merged from dev to master periodically by me.

About

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors