Video Annotator – Batch Ingest + Index Pipeline (Box → Speech → Segments → Embeddings → Search)

This repo contains scripts to:

Enumerate .m4a files from a Box shared folder and generate a manifest (videos.jsonl)
Run each file through the Azure Functions pipeline:
- Submit batch transcription (TranscribeHttp)
- Write 30s segments JSON to Blob (segments/<video_id>.json)
- Embed + index segments into Azure AI Search (EmbedAndIndex)
Query indexed segments (SearchSegments)

Prerequisites

Python 3.11+ recommended
Azure Functions already deployed (or runnable locally)
Box shared folder link that contains .m4a files
Working Box API token:
- EITHER a Developer Token (quick + expires)
- OR OAuth tokens (BOX_ACCESS_TOKEN + BOX_REFRESH_TOKEN + client id/secret)

Repo Layout (expected)

transcribe/
  scripts/
    box_auth.py
    box_shared_folder_manifest.py
  import_videos.py
  videos.jsonl            # generated
  requirements.txt
  .env                    # you create this (NOT committed)

1) Create a virtual environment + install deps

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

If you don’t have a requirements.txt for scripts yet, minimally you’ll need:

requests
python-dotenv

(Box listing can be done via raw REST calls, so you may not need boxsdk.)

2) Create your `.env`

Create a .env file in the project root (same directory you run scripts from):

Box settings

Set the shared folder URL:

BOX_SHARED_FOLDER_URL=https://tulane.box.com/s/<shared-folder-token>

Choose one auth method:

Option A (fastest): Developer Token

BOX_TOKEN=<your_box_developer_token>

Option B (durable): OAuth refresh tokens

BOX_CLIENT_ID=<your_box_client_id>
BOX_CLIENT_SECRET=<your_box_client_secret>
BOX_ACCESS_TOKEN=<your_box_access_token>
BOX_REFRESH_TOKEN=<your_box_refresh_token>

Note: refresh tokens can become invalid if rotated/revoked. If you see invalid_grant, re-run your OAuth login flow and update .env.

Azure Function endpoints

These should be the full function URLs, including ?code=...:

TRANSCRIBE_URL=https://<yourapp>.azurewebsites.net/api/TranscribeHttp?code=...
EMBED_INDEX_URL=https://<yourapp>.azurewebsites.net/api/EmbedAndIndex?code=...
SEARCH_FN_URL=https://<yourapp>.azurewebsites.net/api/SearchSegments?code=...

Optional runner settings

SEGMENTS_CONTAINER=segments
POLL_SECONDS=15
MAX_ACTIVE=10

3) Generate the manifest from Box (`videos.jsonl`)

This script reads your Box shared folder and outputs videos.jsonl with one line per .m4a:

{"video_id":"vid_123","media_url":"https://..."}
{"video_id":"vid_456","media_url":"https://..."}

Run:

python scripts/box_shared_folder_manifest.py

Sanity check one URL

Pick one entry from videos.jsonl and confirm it downloads:

python - <<'PY'
import json
with open("videos.jsonl","r") as f:
    print(json.loads(next(f)))
PY

curl -I -L "<media_url>"

You want 200 OK (not HTML/404). If this fails, Speech won’t be able to fetch it either.

4) Run the pipeline import (`import_videos.py`)

This script:

reads videos.jsonl
submits transcription jobs via TranscribeHttp
polls until each completes
indexes segments via EmbedAndIndex

Run:

python import_videos.py

Progress + resume

The importer writes a pipeline_state.json file as it runs. If the script stops, you can rerun it and it will resume from the saved state.

5) Verify search

Once a few videos are indexed, query your SearchSegments function:

curl -X POST "$SEARCH_FN_URL" \
  -H "Content-Type: application/json" \
  -d '{"q":"measles","mode":"hybrid","top":5,"k":40}'

If you get results, your segments are searchable.

Troubleshooting

Box links return 404

Ensure the manifest script is producing working media_urls
Validate with curl -I -L "<media_url>" (must end in 200)
If a shared link works in browser but not via curl, it may rely on cookies/redirects. The manifest script should output a direct download URL.

Importer submits jobs but never completes

Speech batch jobs can take time; check your TranscribeHttp function logs / Application Insights
Consider increasing POLL_SECONDS to reduce throttling
Reduce MAX_ACTIVE if you see rate-limit behavior

`EmbedAndIndex` fails with invalid document key

Azure AI Search keys can’t contain : etc. If you use segment keys like vid:0001, replace : with _ or -.

Security notes

Do not commit .env, pipeline_state.json, or any token/key material.
Prefer query keys (read-only) for Search in front-end scenarios.
For long-term automation, use a Box app auth method approved by your org (not developer token).

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
EmbedAndIndex		EmbedAndIndex
LabelSegments		LabelSegments
ManageLabels		ManageLabels
SearchSegments		SearchSegments
SegmentTranscript		SegmentTranscript
TranscribeHttp		TranscribeHttp
docs		docs
scripts		scripts
shared		shared
tests		tests
ui		ui
.gitignore		.gitignore
Develop.md		Develop.md
README-transcribe.md		README-transcribe.md
README.md		README.md
host.json		host.json
import_videos.py		import_videos.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Video Annotator – Batch Ingest + Index Pipeline (Box → Speech → Segments → Embeddings → Search)

Prerequisites

Repo Layout (expected)

1) Create a virtual environment + install deps

2) Create your `.env`

Box settings

Option A (fastest): Developer Token

Option B (durable): OAuth refresh tokens

Azure Function endpoints

Optional runner settings

3) Generate the manifest from Box (`videos.jsonl`)

Sanity check one URL

4) Run the pipeline import (`import_videos.py`)

Progress + resume

5) Verify search

Troubleshooting

Box links return 404

Importer submits jobs but never completes

`EmbedAndIndex` fails with invalid document key

Security notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

tapilab/video-annotator

Folders and files

Latest commit

History

Repository files navigation

Video Annotator – Batch Ingest + Index Pipeline (Box → Speech → Segments → Embeddings → Search)

Prerequisites

Repo Layout (expected)

1) Create a virtual environment + install deps

2) Create your .env

Box settings

Option A (fastest): Developer Token

Option B (durable): OAuth refresh tokens

Azure Function endpoints

Optional runner settings

3) Generate the manifest from Box (videos.jsonl)

Sanity check one URL

4) Run the pipeline import (import_videos.py)

Progress + resume

5) Verify search

Troubleshooting

Box links return 404

Importer submits jobs but never completes

EmbedAndIndex fails with invalid document key

Security notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2) Create your `.env`

3) Generate the manifest from Box (`videos.jsonl`)

4) Run the pipeline import (`import_videos.py`)

`EmbedAndIndex` fails with invalid document key

Packages