Skip to content

A lightweight video analysis engine using VLM for automated video captioning and activity indexing.

Notifications You must be signed in to change notification settings

iocalangiu/VisionQuery

Repository files navigation

VisionQuery

VisionQuery is a high-performance visual discovery tool designed to query massive image datasets using natural language. By leveraging Modal for serverless GPU orchestration and LanceDB for multi-modal vector indexing, the goal is to enable sub-second retrieval across millions of assets.

Status: Work in Progress Current build supports high-performance indexing for locally saved videos and CIFAR-10 images. Support for direct S3 video streaming is under development.

1. Setup Environment

Ensure you have Python 3.10+ and the Modal CLI installed.

pip install -r requirements.txt
modal auth

2. Deploy the VLM Worker

Deploy the Moondream2 model to Modal's cloud GPU infrastructure:

modal deploy src/vlm_worker.py

3. Run the Pipeline

python main.py

4. Search the Data

Run the search script to query your indexed frames via natural language:

python search.py --query "a blue truck"

Next Steps

[ ] S3 Integration: Direct ingestion from AWS S3 buckets.

[ ] Visual Interface: A web-based UI to browse and filter indexed frames visually.

[ ] LoRA Fine-tuning: Implementing Low-Rank Adaptation to enforce specific captioning styles for niche domains.

About

A lightweight video analysis engine using VLM for automated video captioning and activity indexing.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors