Skip to content

Azure transcription endpoint for video annotator tool

Notifications You must be signed in to change notification settings

tapilab/video-annotator

Repository files navigation

Video Annotator – Batch Ingest + Index Pipeline (Box → Speech → Segments → Embeddings → Search)

This repo contains scripts to:

  1. Enumerate .m4a files from a Box shared folder and generate a manifest (videos.jsonl)

  2. Run each file through the Azure Functions pipeline:

    • Submit batch transcription (TranscribeHttp)
    • Write 30s segments JSON to Blob (segments/<video_id>.json)
    • Embed + index segments into Azure AI Search (EmbedAndIndex)
  3. Query indexed segments (SearchSegments)

Prerequisites

  • Python 3.11+ recommended

  • Azure Functions already deployed (or runnable locally)

  • Box shared folder link that contains .m4a files

  • Working Box API token:

    • EITHER a Developer Token (quick + expires)
    • OR OAuth tokens (BOX_ACCESS_TOKEN + BOX_REFRESH_TOKEN + client id/secret)

Repo Layout (expected)

transcribe/
  scripts/
    box_auth.py
    box_shared_folder_manifest.py
  import_videos.py
  videos.jsonl            # generated
  requirements.txt
  .env                    # you create this (NOT committed)

1) Create a virtual environment + install deps

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

If you don’t have a requirements.txt for scripts yet, minimally you’ll need:

requests
python-dotenv

(Box listing can be done via raw REST calls, so you may not need boxsdk.)

2) Create your .env

Create a .env file in the project root (same directory you run scripts from):

Box settings

Set the shared folder URL:

BOX_SHARED_FOLDER_URL=https://tulane.box.com/s/<shared-folder-token>

Choose one auth method:

Option A (fastest): Developer Token

BOX_TOKEN=<your_box_developer_token>

Option B (durable): OAuth refresh tokens

BOX_CLIENT_ID=<your_box_client_id>
BOX_CLIENT_SECRET=<your_box_client_secret>
BOX_ACCESS_TOKEN=<your_box_access_token>
BOX_REFRESH_TOKEN=<your_box_refresh_token>

Note: refresh tokens can become invalid if rotated/revoked. If you see invalid_grant, re-run your OAuth login flow and update .env.

Azure Function endpoints

These should be the full function URLs, including ?code=...:

TRANSCRIBE_URL=https://<yourapp>.azurewebsites.net/api/TranscribeHttp?code=...
EMBED_INDEX_URL=https://<yourapp>.azurewebsites.net/api/EmbedAndIndex?code=...
SEARCH_FN_URL=https://<yourapp>.azurewebsites.net/api/SearchSegments?code=...

Optional runner settings

SEGMENTS_CONTAINER=segments
POLL_SECONDS=15
MAX_ACTIVE=10

3) Generate the manifest from Box (videos.jsonl)

This script reads your Box shared folder and outputs videos.jsonl with one line per .m4a:

{"video_id":"vid_123","media_url":"https://..."}
{"video_id":"vid_456","media_url":"https://..."}

Run:

python scripts/box_shared_folder_manifest.py

Sanity check one URL

Pick one entry from videos.jsonl and confirm it downloads:

python - <<'PY'
import json
with open("videos.jsonl","r") as f:
    print(json.loads(next(f)))
PY

curl -I -L "<media_url>"

You want 200 OK (not HTML/404). If this fails, Speech won’t be able to fetch it either.

4) Run the pipeline import (import_videos.py)

This script:

  • reads videos.jsonl
  • submits transcription jobs via TranscribeHttp
  • polls until each completes
  • indexes segments via EmbedAndIndex

Run:

python import_videos.py

Progress + resume

The importer writes a pipeline_state.json file as it runs. If the script stops, you can rerun it and it will resume from the saved state.

5) Verify search

Once a few videos are indexed, query your SearchSegments function:

curl -X POST "$SEARCH_FN_URL" \
  -H "Content-Type: application/json" \
  -d '{"q":"measles","mode":"hybrid","top":5,"k":40}'

If you get results, your segments are searchable.

Troubleshooting

Box links return 404

  • Ensure the manifest script is producing working media_urls
  • Validate with curl -I -L "<media_url>" (must end in 200)
  • If a shared link works in browser but not via curl, it may rely on cookies/redirects. The manifest script should output a direct download URL.

Importer submits jobs but never completes

  • Speech batch jobs can take time; check your TranscribeHttp function logs / Application Insights
  • Consider increasing POLL_SECONDS to reduce throttling
  • Reduce MAX_ACTIVE if you see rate-limit behavior

EmbedAndIndex fails with invalid document key

  • Azure AI Search keys can’t contain : etc. If you use segment keys like vid:0001, replace : with _ or -.

Security notes

  • Do not commit .env, pipeline_state.json, or any token/key material.
  • Prefer query keys (read-only) for Search in front-end scenarios.
  • For long-term automation, use a Box app auth method approved by your org (not developer token).

About

Azure transcription endpoint for video annotator tool

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors