Evaluation: STT by AkhileshNegi · Pull Request #571 · ProjectTech4DevAI/kaapi-backend

AkhileshNegi · 2026-02-02T05:52:22Z

Summary

Target issue is #533

Checklist

Before submitting a pull request, please ensure that you mark these task.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Notes

New Features
- End-to-end Speech-to-Text evaluation: upload audio, create datasets/samples, start runs, view transcriptions, and record human feedback.
- Provider batch transcription integration (Gemini) with batch submission, polling, and result processing.
- Signed URL expiry cap and MIME detection for uploads; supported audio formats and size limits.
- Database migration adding STT sample/result tables and extending dataset/run metadata.
Functional Requirements Testing

Summary by CodeRabbit

New Features
- Speech-to-Text (STT) evaluation: create/list datasets, upload audio, start/list/get runs, view per-run results, and submit human feedback
- Audio upload support (mp3, wav, flac, m4a, ogg, webm; max 200 MB)
- Gemini-backed batch transcription for scalable STT runs
Documentation
- User docs for dataset creation, audio upload, runs, results, and feedback
Tests
- Extensive tests covering STT APIs, services, storage, and Gemini integration

…o feature/stt-evaluation

coderabbitai · 2026-02-02T05:52:29Z

📝 Walkthrough

Walkthrough

Adds full Speech-to-Text (STT) evaluation support: API endpoints, models, CRUD, Gemini batch provider, storage/file handling, async polling, services for audio/dataset handling, DB migrations, and extensive tests.

Changes

Cohort / File(s)	Summary
API docs & router wiring `backend/app/api/docs/stt_evaluation/*`, `backend/app/api/main.py`, `backend/app/api/routes/stt_evaluations/router.py`	Adds markdown docs for STT endpoints and mounts a new "/evaluations/stt" APIRouter in main.
API routes `backend/app/api/routes/stt_evaluations/files.py`, `.../dataset.py`, `.../evaluation.py`, `.../result.py`, `.../__init__.py`	New FastAPI routers for audio upload, dataset CRUD, starting/listing/getting runs, and result feedback, with permission dependencies and documented responses.
Models & DB migration `backend/app/models/stt_evaluation.py`, `backend/app/models/evaluation.py`, `backend/app/alembic/versions/044_add_stt_evaluation_tables.py`	Introduces STT models (STTSample, STTResult, DTOs), extends EvaluationDataset/Run with type/language/providers, and adds migration creating file, stt_sample, stt_result tables and new columns.
CRUD & orchestration `backend/app/crud/stt_evaluations/...`, `backend/app/crud/file.py`, `backend/app/crud/__init__.py`	Implements dataset/sample/run/result CRUD, file CRUD, batch submission orchestration, polling/processing cron, and re-exports file helpers.
Batch provider & batch core `backend/app/core/batch/gemini.py`, `backend/app/core/batch/__init__.py`, `backend/app/core/providers.py`	Adds GeminiBatchProvider, BatchJobState enum, create_stt_batch_requests helper, and registers GEMINI provider config.
Storage utils & cloud `backend/app/core/storage_utils.py`, `backend/app/core/cloud/storage.py`	Refactors uploads to generic upload_to_object_store, adds get_mime_from_url, and clamps signed URL expiry to 24 hours.
Services `backend/app/services/stt_evaluations/*`, `.../gemini/client.py`	Audio validation/upload service, dataset CSV upload orchestration, STT constants, and GeminiClient wrapper for credentials/connection.
Tests `backend/app/tests/...` (api/routes/test_stt_evaluation.py, core/batch/test_gemini.py, core/test_storage_utils.py, services/stt_evaluations/*)	Adds comprehensive unit and integration tests covering API routes, Gemini batch provider, storage utils, audio/dataset services, and Gemini client.
Cron/processing integration `backend/app/crud/evaluations/cron.py`, `backend/app/crud/evaluations/processing.py`	Integrates STT polling into per-org cron, merges STT/text summaries, and narrows pending-evaluation filtering in one text-specific path.
Project deps & tests update `backend/pyproject.toml`, `backend/app/tests/crud/test_credentials.py`	Adds runtime deps `google-genai` and `requests`; test adjustment for credential/provider expectation.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API as STT API
    participant Service
    participant Storage
    participant DB as Database

    Client->>API: POST /files/audio (multipart)
    API->>Service: validate_audio_file(file)
    Service->>Service: check extension & size
    API->>Storage: upload file to object store
    Storage-->>API: object_store_url
    API->>DB: create_file record
    DB-->>API: file metadata
    API-->>Client: AudioUploadResponse (s3_url, file_id)

sequenceDiagram
    participant Client
    participant API as STT API
    participant CRUD
    participant BatchSvc as Batch Service
    participant Gemini

    Client->>API: POST /runs (start evaluation)
    API->>CRUD: validate dataset & list samples
    CRUD-->>API: samples
    API->>CRUD: create run & create result records
    API->>BatchSvc: start_stt_evaluation_batch(run, samples)
    BatchSvc->>Storage: generate signed URLs for sample files
    BatchSvc->>BatchSvc: build JSONL requests
    BatchSvc->>Gemini: submit batch job
    Gemini-->>BatchSvc: provider batch id/status
    BatchSvc->>CRUD: update run (processing + batch id)
    API-->>Client: STTEvaluationRunPublic (processing)

sequenceDiagram
    participant Cron
    participant CRUD as Run CRUD
    participant Gemini
    participant Results as Result CRUD
    participant DB as Database

    Cron->>CRUD: get_pending_stt_runs(org)
    CRUD-->>Cron: pending runs with batch_job_id
    loop each run
        Cron->>Gemini: get_batch_status(batch_id)
        Gemini-->>Cron: state
        alt terminal (succeeded/failed)
            Cron->>Gemini: download_batch_results
            Gemini-->>Cron: results JSONL
            Cron->>Results: update_stt_result entries
            Results->>DB: persist updates
            Cron->>CRUD: update_stt_run(status=completed/failed)
        else
            Cron-->>Cron: keep processing
        end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Kaapi v1.0: Database Comments #476 — modifies models/evaluation.py; related to added evaluation fields (type/language/providers).
Evaluation: Refactor #503 — overlaps batch package exports and provider registration changes.
Evaluation: Fix score format #549 — also updates API router imports/inclusions in backend/app/api/main.py.

Suggested reviewers

kartpop
Prajna1999

Poem

🐰
I hopped in with a tape and a song,
Uploaded sounds to where they belong.
Batches hum, Gemini sings through the night,
Datasets, runs, and results take flight.
Feedback nibbles make transcripts bright. 🎧✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title 'Evaluation: STT' is vague and generic, using a broad category label without clearly conveying the primary change or scope of this substantial PR.	Consider a more descriptive title such as 'Add Speech-to-Text (STT) evaluation feature with Gemini batch processing' to clearly communicate the main change.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	Docstring coverage is 90.87% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/stt-evaluation

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@backend/app/api/routes/stt_evaluations/evaluation.py`:
- Around line 110-113: The log is referencing a non-existent key 'batch_jobs' in
batch_result so it always prints an empty set; update the logger.info in
start_stt_evaluation to reference the real key (e.g., use
batch_result.get('jobs', {}).keys() if the batch job entries are under 'jobs')
or, if you want to show all returned fields, use list(batch_result.keys())
instead of batch_result.get('batch_jobs', {}).keys() so the log prints the
actual batch info alongside run.id.

🧹 Nitpick comments (1)

backend/app/crud/evaluations/cron.py (1)

127-130: Consider clarifying the failure count semantics in org-level errors.

When an org-level exception is caught (e.g., if poll_all_pending_stt_evaluations throws), total_failed is incremented by 1. However, this single increment may not accurately represent the actual number of text + STT runs that were in progress for that organization. This was arguably the same issue before STT integration, but now it's more pronounced since two subsystems are being polled.

This is a minor semantic inconsistency in the reporting—not a functional bug—so it can be addressed later if more precise failure counting is needed.

coderabbitai · 2026-02-04T05:12:58Z

backend/app/api/routes/stt_evaluations/evaluation.py

+        logger.info(
+            f"[start_stt_evaluation] STT evaluation batch submitted | "
+            f"run_id: {run.id}, batch_jobs: {batch_result.get('batch_jobs', {}).keys()}"
+        )


⚠️ Potential issue | 🟡 Minor

Fix batch log field name.
batch_result doesn’t contain batch_jobs, so the log always prints an empty set.

🛠️ Suggested fix

- logger.info( - f"[start_stt_evaluation] STT evaluation batch submitted | " - f"run_id: {run.id}, batch_jobs: {batch_result.get('batch_jobs', {}).keys()}" - ) + logger.info( + f"[start_stt_evaluation] STT evaluation batch submitted | " + f"run_id: {run.id}, batch_job_id: {batch_result.get('batch_job_id')}, " + f"provider_batch_id: {batch_result.get('provider_batch_id')}" + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

logger.info(

f"[start_stt_evaluation] STT evaluation batch submitted | "

f"run_id: {run.id}, batch_jobs: {batch_result.get('batch_jobs', {}).keys()}"

)

logger.info(

f"[start_stt_evaluation] STT evaluation batch submitted | "

f"run_id: {run.id}, batch_job_id: {batch_result.get('batch_job_id')}, "

f"provider_batch_id: {batch_result.get('provider_batch_id')}"

)

🤖 Prompt for AI Agents

In `@backend/app/api/routes/stt_evaluations/evaluation.py` around lines 110 - 113, The log is referencing a non-existent key 'batch_jobs' in batch_result so it always prints an empty set; update the logger.info in start_stt_evaluation to reference the real key (e.g., use batch_result.get('jobs', {}).keys() if the batch job entries are under 'jobs') or, if you want to show all returned fields, use list(batch_result.keys()) instead of batch_result.get('batch_jobs', {}).keys() so the log prints the actual batch info alongside run.id.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@backend/app/core/batch/gemini.py`:
- Around line 285-325: The upload_file method is setting mime_type="jsonl" in
the types.UploadFileConfig passed to self._client.files.upload; change that
mime_type to "text/plain" to ensure Gemini Files API accepts the JSONL file.
Locate the upload_file function and the call to self._client.files.upload (and
the types.UploadFileConfig instantiation) and replace mime_type="jsonl" with
mime_type="text/plain", keeping the rest (display_name, file path/tmp_path
handling, and cleanup) unchanged.

🧹 Nitpick comments (2)

backend/app/models/stt_evaluation.py (1)
15-16: Consider making SUPPORTED_STT_PROVIDERS a constant tuple or frozenset.

Using a list for a constant that shouldn't be modified could lead to accidental mutation. A tuple or frozenset would be safer.
Suggested change
 # Supported STT providers for evaluation
-SUPPORTED_STT_PROVIDERS = ["gemini-2.5-pro"]
+SUPPORTED_STT_PROVIDERS = frozenset({"gemini-2.5-pro"})
Note: If you change to frozenset, update the error message in the validator (line 328) to use sorted(SUPPORTED_STT_PROVIDERS) for consistent ordering.
backend/app/services/stt_evaluations/gemini/client.py (1)
95-113: Consider caching or limiting the model list call in validate_connection.

The list(self._client.models.list()) call fetches all available models, which could be slow or resource-intensive. For connection validation, checking just one model or using a lighter API call might be more efficient.
Alternative approach
     def validate_connection(self) -> bool:
         """Validate that the client can connect to Gemini.

         Returns:
             bool: True if connection is valid
         """
         try:
-            # List models to verify connection
-            models = list(self._client.models.list())
+            # Fetch first model to verify connection (lighter than full list)
+            models = self._client.models.list()
+            first_model = next(iter(models), None)
             logger.info(
                 f"[validate_connection] Connection validated | "
-                f"available_models_count: {len(models)}"
+                f"connection_verified: {first_model is not None}"
             )
             return True

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@backend/app/tests/services/stt_evaluations/test_audio.py`:
- Around line 22-378: Add missing type annotations to all test methods (e.g.,
change def test_valid_mp3_file(self) to def test_valid_mp3_file(self: Self) ->
None) across the file so every test method has parameter and return type hints;
then remove the duplicated helpers TestValidateAudioFile._create_upload_file and
TestUploadAudioFile._create_upload_file and replace them with a single shared
factory fixture in the module (e.g., a pytest fixture named upload_file_factory)
that returns a callable to produce MagicMock UploadFile instances used by
validate_audio_file and upload_audio_file tests; update tests to call the
fixture instead of the class-level helper.

🧹 Nitpick comments (3)

backend/app/tests/services/stt_evaluations/test_audio.py (1)

139-150: Consolidate UploadFile creation via a factory fixture.
_create_upload_file is duplicated (Line 139 and Line 286). Please replace these helpers with a shared pytest factory fixture to align with the test fixture guideline and reduce duplication.

♻️ Suggested fixture-based factory (apply in this file)

+from collections.abc import Callable
@@
+@pytest.fixture()
+def upload_file_factory() -> Callable[..., UploadFile]:
+    def _factory(
+        filename: str | None = "test.mp3",
+        content_type: str | None = "audio/mpeg",
+        size: int | None = 1024,
+    ) -> UploadFile:
+        mock_file = MagicMock(spec=UploadFile)
+        mock_file.filename = filename
+        mock_file.content_type = content_type
+        mock_file.size = size
+        return mock_file
+    return _factory
@@
-    def test_valid_mp3_file(self):
+    def test_valid_mp3_file(self, upload_file_factory: Callable[..., UploadFile]) -> None:
         """Test validation of valid MP3 file."""
-        file = self._create_upload_file(filename="test.mp3")
+        file = upload_file_factory(filename="test.mp3")

As per coding guidelines, Use factory pattern for test fixtures in backend/app/tests/.

backend/app/tests/core/test_storage_utils.py (2)

88-107: Use factory-style fixtures for storage mocks.

Returning a factory keeps instance creation explicit per test and aligns with the test-fixture convention in this repo.

♻️ Proposed refactor (apply similarly to other mock_storage fixtures)

 `@pytest.fixture`
 def mock_storage(self):
-    storage = MagicMock()
-    storage.put.return_value = "s3://bucket/test/file.txt"
-    return storage
+    def _factory():
+        storage = MagicMock()
+        storage.put.return_value = "s3://bucket/test/file.txt"
+        return storage
+    return _factory
@@
-    def test_successful_upload(self, mock_storage):
+    def test_successful_upload(self, mock_storage):
         content = b"test content"
-        result = upload_to_object_store(
-            storage=mock_storage,
+        storage = mock_storage()
+        result = upload_to_object_store(
+            storage=storage,
             content=content,
             filename="test.txt",
             subdirectory="uploads",
             content_type="text/plain",
         )

As per coding guidelines, "Use factory pattern for test fixtures in backend/app/tests/".

22-25: Add type hints to test and fixture signatures.

Repo guidelines require parameter and return annotations on all functions; please apply this across all test methods and fixtures in this file.

🧩 Example pattern (apply broadly)

+from typing import Self
@@
-    def test_mp3_url(self):
+    def test_mp3_url(self: Self) -> None:
         """Test MIME detection for MP3 files."""
         url = "https://bucket.s3.amazonaws.com/audio/test.mp3"
         assert get_mime_from_url(url) == "audio/mpeg"

#!/bin/bash
python - <<'PY'
import ast, pathlib
path = pathlib.Path("backend/app/tests/core/test_storage_utils.py")
tree = ast.parse(path.read_text())
def check(fn):
    missing = [a.arg for a in fn.args.args if a.annotation is None]
    if fn.returns is None:
        missing.append("return")
    if missing:
        print(f"{fn.name} @ line {fn.lineno} missing: {missing}")
for node in ast.walk(tree):
    if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
        check(node)
PY

As per coding guidelines, "Always add type hints to all function parameters and return values in Python code".

backend/app/tests/services/stt_evaluations/test_audio.py

Prajna1999 · 2026-02-05T11:41:20Z

backend/app/core/providers.py

    OPENAI = "openai"
    AWS = "aws"
    LANGFUSE = "langfuse"
+    GEMINI = "gemini"


This should be GOOGLE

Prajna1999 · 2026-02-05T11:42:25Z

backend/app/core/storage_utils.py

    filename: str,
-    subdirectory: str = "datasets",
+    subdirectory: str,
+    content_type: str = "application/octet-stream",


Is octet-stream a generic content_type?

it's used here as the default content_type parameter for upload_to_object_store — so if a caller doesn't specify what type the file is (CSV, JSON, audio, etc.), it falls back to this safe generic type.

Prajna1999 · 2026-02-05T11:43:38Z

backend/app/core/storage_utils.py



+def upload_csv_to_object_store(
+    storage: CloudStorage,


This is an extra wrapper on top of upload_to_object_store

Prajna1999 · 2026-02-05T11:45:53Z

backend/app/models/stt_evaluation.py

+    TTS = "tts"
+
+
+class STTResultStatus(str, Enum):


The status can be similar to Job status. Can be reused i guess

Prajna1999 · 2026-02-05T11:49:47Z

backend/app/models/stt_evaluation.py

+    FAILED = "failed"
+
+
+class STTSample(SQLModel, table=True):


Is there a way to merge these two evaluation_models into a single model? imo all types of evals should be taken care by one eval table with a JSONB table for columns that makes sense to store as is.

We had the same thought earlier. But since each eval works differently and users interact with them in different ways, we decided to go with this approach.
For text evals, it’s pretty straightforward—it’s a one-shot process and the user just sees the final result. STT evals are more involved. They include audio files that need to be converted into signed URLs so users can listen to them. Users also label each sample as correct or not and can add comments.
Because of these extra steps and interactions, it made more sense to create a separate table for STT evals.

backend/app/services/stt_evaluations/audio.py

Prajna1999 · 2026-02-05T11:56:07Z

backend/app/api/routes/stt_evaluations/router.py

+
+from . import dataset, evaluation, files, result
+
+router = APIRouter(prefix="/evaluations/stt", tags=["STT Evaluation"])


nitpick: this router can be nested inside routes/evaluations/stt_evaluations and evaluation.py at the same level

Prajna1999 · 2026-02-05T11:57:38Z

backend/app/core/batch/gemini.py

+    """Gemini batch job states."""
+
+    PENDING = "JOB_STATE_PENDING"
+    RUNNING = "JOB_STATE_RUNNING"


we could ignore a few states like RUNNING to make it congruent with other batch job statues used elsewhere

keeping it as it is since we have two different providers OpenAI and Gemini, will see once we start seeing pattern or can keep one as enum as usually the status are pending, failed, running

…/kaapi-backend into feature/stt-evaluation

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/models/evaluation.py (1)
374-394: ⚠️ Potential issue | 🟠 Major

EvaluationRunPublic and EvaluationDatasetPublic are missing the new type, language_id, and providers fields.

The table models were extended with type, language_id (on both), and providers (on EvaluationRun), but their corresponding public response models don't expose these fields. API consumers won't see the new metadata.
Proposed fix

Add to EvaluationRunPublic:
 class EvaluationRunPublic(SQLModel):
     """Public model for evaluation runs."""

     id: int
     run_name: str
     dataset_name: str
+    type: str
+    language_id: int | None
+    providers: list[str] | None
     config_id: UUID | None
Add to EvaluationDatasetPublic:
 class EvaluationDatasetPublic(SQLModel):
     """Public model for evaluation datasets."""

     id: int
     name: str
     description: str | None
+    type: str
+    language_id: int | None
     dataset_metadata: dict[str, Any]

🤖 Fix all issues with AI agents

In `@backend/app/api/routes/stt_evaluations/evaluation.py`:
- Around line 137-162: The code re-fetches the run with get_stt_run_by_id (which
returns EvaluationRun | None) then immediately dereferences run fields; add a
None check after the call and return a 404/error response if run is None, or
replace the re-query with session.refresh(run) to reload the existing ORM object
instead of calling get_stt_run_by_id; update the logic around run (used in
STTEvaluationRunPublic and APIResponse.success_response) to only proceed when
run is present.
- Line 108: The dict comprehension sample_to_result = {r.stt_sample_id: r.id for
r in results} loses entries when multiple providers produce results for the same
stt_sample_id; change sample_to_result in evaluation.py to map each
stt_sample_id to a list of result ids (e.g., append to a list per key) so no
results are overwritten, and then update batch.py (functions around
start_stt_evaluation_batch and the batch failure handling) to accept and iterate
list-valued mapping entries when collecting result IDs to mark as FAILED
(flatten lists or handle both single id and list cases) so all provider-specific
results transition from PENDING to FAILED on batch failure.

In `@backend/app/crud/file.py`:
- Line 8: The import of FileType from app.models.file is unused in
backend/app/crud/file.py; remove FileType from the import statement (leave File)
or, if intended, use FileType in function signatures or type hints (e.g., in
functions that create or filter files) so it is referenced; update the import
line that currently references File and FileType accordingly and run linters to
ensure no other unused imports remain.

In `@backend/app/crud/stt_evaluations/batch.py`:
- Around line 144-156: The except block in start_stt_evaluation_batch currently
raises a new Exception which loses the original traceback; change the final
raise to preserve the original exception (either re-raise the caught exception
or use raise Exception(...) from e) so the cause chain is kept, and keep the
existing logging and update_stt_result calls (referencing update_stt_result,
STTResultStatus, sample_to_result, and logger) intact.

In `@backend/app/services/stt_evaluations/dataset.py`:
- Around line 69-72: The metadata calculation currently counts ground truth by
truthiness which skips empty strings; update the has_ground_truth_count
computation to count samples where s.ground_truth is not None (e.g., replace the
truthiness check with an explicit "s.ground_truth is not None") so empty string
ground truths are counted; change the generator used to compute
metadata["has_ground_truth_count"] in the dataset.py code that builds metadata
for samples.
- Around line 54-103: upload_stt_dataset currently calls create_stt_dataset and
then create_stt_samples in separate commits, which can leave an orphan dataset
if sample creation (e.g., validate_file_ids) fails; wrap both dataset and sample
creation inside a single DB transaction (use session.begin_nested() or
session.begin()) so that a failure during create_stt_samples triggers a
rollback, and if you prefer cleanup instead, catch exceptions around
create_stt_samples, call session.delete(dataset) and session.commit() (or
rollback then delete) to remove the dataset; update upload_stt_dataset to
perform these changes and reference the existing functions create_stt_dataset,
create_stt_samples and the failing validator validate_file_ids to ensure all DB
writes are atomic.

In `@backend/app/tests/api/routes/test_stt_evaluation.py`:
- Around line 543-607: Add a happy-path test in the TestSTTEvaluationRun suite
that posts to "/api/v1/evaluations/stt/runs" with a valid dataset (create or
fixture with samples) and a run_name/providers payload, mock the
start_stt_evaluation_batch function to avoid external calls, assert a 200
response (or 201 per API) and that response.json() contains success True and
expected run metadata, and verify the mocked start_stt_evaluation_batch was
called once with the created dataset id and the provided providers/run_name;
reference the test helper names used in this file and the
start_stt_evaluation_batch symbol for locating the code to mock and assert.
- Around line 514-541: The fixture test_dataset_with_samples creates samples but
leaves the dataset metadata (created via create_test_stt_dataset) with a stale
sample_count=0, which will make evaluations reject it; after creating the three
samples with create_test_file and create_test_stt_sample, update the
EvaluationDataset instance returned by create_test_stt_dataset by setting its
dataset_metadata["sample_count"] to the actual number of created samples (e.g.,
3) and persist that change using the provided db Session (flush/commit or the
project/dataset update helper) so the dataset reflects the correct sample_count
when used by start_stt_evaluation.

🧹 Nitpick comments (18)

backend/app/tests/services/stt_evaluations/test_audio.py (2)

299-327: test_successful_upload asserts on s3_url but doesn't verify the generated S3 key contains a UUID.

The service generates a unique filename with a UUID before uploading. The test asserts the exact URL returned by mock_storage.put.return_value, which is fine for verifying the pass-through, but doesn't verify storage.put was called with a UUID-based key. Consider asserting on mock_storage.put.call_args to verify the key pattern.

330-346: Importing HTTPException inside the test method body is unconventional.

The from app.core.exception_handlers import HTTPException import on Line 332 (and similarly Line 352) could be moved to the module's top-level imports for clarity and consistency.
backend/app/services/stt_evaluations/dataset.py (1)
144-148: Use logger.warning instead of logger.info when the upload returns None.

When upload_csv_to_object_store returns None, it signals a failure to persist the CSV. Logging this at info level makes it easy to miss in production. The error path on Line 153 already uses logger.warning, so this branch should be consistent.
Proposed fix
         else:
-            logger.info(
+            logger.warning(
                 "[_upload_samples_to_object_store] Upload returned None | "
                 "continuing without object store storage"
             )
backend/app/alembic/versions/044_add_stt_evaluation_tables.py (2)

110-115: size_bytes uses sa.Integer() — this limits file sizes to ~2.1 GB.

PostgreSQL INTEGER is a signed 32-bit type (max ~2,147,483,647 bytes ≈ 2 GB). While there's an application-level MAX_FILE_SIZE_BYTES check, the file table is generic and may store non-audio files in the future. Consider using sa.BigInteger() for forward-compatibility.

290-401: Consider adding a unique constraint on (stt_sample_id, evaluation_run_id, provider) in stt_result.

Without this, nothing prevents duplicate transcription results for the same sample, run, and provider combination. A retry or idempotency bug could insert duplicates silently.
backend/app/models/file.py (1)
87-99: FilePublic extends BaseModel — other public models in this PR use SQLModel.

EvaluationDatasetPublic and EvaluationRunPublic in evaluation.py extend SQLModel, while FilePublic extends pydantic.BaseModel. Both work, but being consistent within the project makes the pattern predictable. Consider aligning with the existing convention.
Proposed fix
-from pydantic import BaseModel
 from sqlmodel import Field as SQLField
 from sqlmodel import SQLModel

...

-class FilePublic(BaseModel):
+class FilePublic(SQLModel):
     """Public model for file responses."""
backend/app/crud/file.py (1)

13-67: create_file commits immediately — this prevents callers from using it within a larger transaction.

The upload_stt_dataset workflow in dataset.py calls create_stt_dataset (which commits) and then create_stt_samples (which also commits). If create_file is invoked in upstream flows that also need atomic operations, the eager commit would break transactional guarantees. Consider accepting a commit flag or using session.flush() by default, letting the caller decide when to commit.

backend/app/models/evaluation.py (1)

104-109: Use EvaluationType enum for the type field in both EvaluationDataset and EvaluationRun.

The type fields at lines 104-109 and 209-214 accept any string up to 20 characters, but they should only hold one of three values: "text", "stt", or "tts". An EvaluationType enum already exists in backend/app/models/stt_evaluation.py with these exact values. Import and apply it to enforce type safety at both the model and database levels.
backend/app/services/stt_evaluations/audio.py (1)
165-170: Silent exception swallowing hides debugging information.

The bare except Exception discards the error detail. If the S3 size lookup consistently fails for a particular storage configuration, this will be invisible in logs.
Suggested fix
         try:
             size_kb = storage.get_file_size_kb(s3_url)
             size_bytes = int(size_kb * 1024)
-        except Exception:
+        except Exception as e:
+            logger.warning(
+                f"[upload_audio_file] Could not get file size from S3 | "
+                f"s3_url: {s3_url}, error: {str(e)}"
+            )
             # If we can't get size from S3, use the upload file size
             size_bytes = file.size or 0
backend/app/crud/stt_evaluations/dataset.py (2)
59-70: Inconsistent timestamp usage — two separate now() calls vs. a shared variable.

create_stt_samples (Line 173) uses a single timestamp = now() variable for both fields, but here now() is called twice. Consider storing it in a local variable for consistency.
Suggested fix
+    timestamp = now()
     dataset = EvaluationDataset(
         name=name,
         description=description,
         type=EvaluationType.STT.value,
         language_id=language_id,
         object_store_url=object_store_url,
         dataset_metadata=dataset_metadata or {},
         organization_id=org_id,
         project_id=project_id,
-        inserted_at=now(),
-        updated_at=now(),
+        inserted_at=timestamp,
+        updated_at=timestamp,
     )
196-198: Redundant flush() before commit().

session.commit() implicitly flushes, so the explicit session.flush() on Line 197 is unnecessary.
Suggested fix
     session.add_all(created_samples)
-    session.flush()
     session.commit()
backend/app/crud/stt_evaluations/result.py (2)
68-70: Redundant flush() before commit().

Same pattern as in dataset.py — commit() implies a flush.
Suggested fix
     session.add_all(results)
-    session.flush()
     session.commit()
345-353: Simplify dict comprehension to dict(rows).

Per static analysis (Ruff C416), the dict comprehension is unnecessary since rows already yields (key, value) tuples.
Suggested fix
-    return {status: count for status, count in rows}
+    return dict(rows)
backend/app/crud/stt_evaluations/batch.py (2)
113-114: Use a more specific exception type instead of bare Exception.

Line 114 raises a generic Exception. Consider using RuntimeError or a custom exception for clearer error handling upstream.
Suggested fix
-        raise Exception("Failed to generate signed URLs for any audio files")
+        raise RuntimeError("Failed to generate signed URLs for any audio files")
1-40: This module contains orchestration logic beyond CRUD — consider relocating to services/.

This module initializes a Gemini client, generates signed URLs, builds JSONL payloads, and submits batch jobs — all of which are business/orchestration logic. As per coding guidelines, backend/app/crud/ should contain database access operations, while backend/app/services/ should implement business logic. Consider moving this to backend/app/services/stt_evaluations/batch.py.
backend/app/api/routes/stt_evaluations/evaluation.py (1)

240-261: Consider extracting a helper to build STTEvaluationRunPublic from an EvaluationRun.

The field-by-field construction of the public model is repeated verbatim in start_stt_evaluation (lines 145-161) and here (lines 241-258). A small helper (or model_validate / from_orm) would eliminate this duplication and reduce the risk of the two copies drifting apart.

backend/app/tests/api/routes/test_stt_evaluation.py (2)

382-410: Pagination test doesn't assert metadata["total"], and len(data) == 2 may be fragile.

If other tests in the session leave datasets in the DB (depending on test isolation), len(data) == 2 could fail. test_list_stt_datasets_with_data correctly uses >= 2. Also consider asserting metadata["total"] to verify the server-side count is consistent.

610-642: Consider adding a happy-path test for feedback update.

Both tests here cover error cases. A test that creates a result record and then successfully updates feedback would verify the core feedback flow works end-to-end.

coderabbitai · 2026-02-06T05:44:28Z

backend/app/api/routes/stt_evaluations/evaluation.py

+        providers=run_create.providers,
+    )
+
+    sample_to_result = {r.stt_sample_id: r.id for r in results}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Check how sample_to_result is consumed inside start_stt_evaluation_batch rg -n "sample_to_result" backend/app/crud/stt_evaluations/batch.py -B 2 -A 5

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1556

🏁 Script executed:

# Find where create_stt_results is called and what it creates rg -n "create_stt_results\|sample_to_result" backend/app/api/routes/stt_evaluations/evaluation.py -B 3 -A 3

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57

🏁 Script executed:

# Read the evaluation.py file around line 108 to see full context cat -n backend/app/api/routes/stt_evaluations/evaluation.py | head -150 | tail -60

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 2444

🏁 Script executed:

# Find the create_stt_results function definition rg -n "def create_stt_results" backend/ -A 30

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 2302

🏁 Script executed:

# Get the full implementation of create_stt_results cat -n backend/app/crud/stt_evaluations/result.py | sed -n '22,80p'

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1978

🏁 Script executed:

# Check if there are any other uses of sample_to_result beyond batch.py rg -n "sample_to_result" backend/app --type py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 801

sample_to_result mapping silently drops results when multiple providers are used.

The dict comprehension {r.stt_sample_id: r.id for r in results} maps each stt_sample_id to a single result.id. Since create_stt_results creates one result per sample per provider, later provider entries overwrite earlier ones in the dict. On batch failure (line 149 of batch.py), start_stt_evaluation_batch iterates only over remaining result IDs, so only the last provider's results get marked as FAILED while earlier providers remain in PENDING status.

The proposed fix direction is correct but requires changes in both files. Beyond the dict structure change in evaluation.py, batch.py must be updated to handle the list values:

Complete fix — handle list-valued mapping in both files

evaluation.py (line 108):

- sample_to_result = {r.stt_sample_id: r.id for r in results} + from collections import defaultdict + sample_to_result: dict[int, list[int]] = defaultdict(list) + for r in results: + sample_to_result[r.stt_sample_id].append(r.id)

batch.py (line 36, 105-111, 149):

- sample_to_result: dict[int, int], + sample_to_result: dict[int, list[int]],

if sample.id in sample_to_result: + for result_id in sample_to_result[sample.id]: - update_stt_result( + update_stt_result( session=session, - result_id=sample_to_result[sample.id], + result_id=result_id, status=STTResultStatus.FAILED.value, error_message=f"Failed to generate signed URL: {str(e)}", - ) + )

- for result_id in sample_to_result.values(): + for result_ids in sample_to_result.values(): + for result_id in result_ids: update_stt_result(

🤖 Prompt for AI Agents

In `@backend/app/api/routes/stt_evaluations/evaluation.py` at line 108, The dict comprehension sample_to_result = {r.stt_sample_id: r.id for r in results} loses entries when multiple providers produce results for the same stt_sample_id; change sample_to_result in evaluation.py to map each stt_sample_id to a list of result ids (e.g., append to a list per key) so no results are overwritten, and then update batch.py (functions around start_stt_evaluation_batch and the batch failure handling) to accept and iterate list-valued mapping entries when collecting result IDs to mark as FAILED (flatten lists or handle both single id and list cases) so all provider-specific results transition from PENDING to FAILED on batch failure.

coderabbitai · 2026-02-06T05:44:28Z

backend/app/api/routes/stt_evaluations/evaluation.py

+    run = get_stt_run_by_id(
+        session=_session,
+        run_id=run.id,
+        org_id=auth_context.organization_.id,
+        project_id=auth_context.project_.id,
+    )
+
+    return APIResponse.success_response(
+        data=STTEvaluationRunPublic(
+            id=run.id,
+            run_name=run.run_name,
+            dataset_name=run.dataset_name,
+            type=run.type,
+            language_id=run.language_id,
+            providers=run.providers,
+            dataset_id=run.dataset_id,
+            status=run.status,
+            total_items=run.total_items,
+            score=run.score,
+            error_message=run.error_message,
+            organization_id=run.organization_id,
+            project_id=run.project_id,
+            inserted_at=run.inserted_at,
+            updated_at=run.updated_at,
+        )
+    )


⚠️ Potential issue | 🟡 Minor

Missing None check after re-fetching the run.

get_stt_run_by_id returns EvaluationRun | None. If the run were missing (e.g., concurrent deletion), accessing run.id on line 146 would raise an AttributeError. Add a guard or use session.refresh(run) instead, which is simpler and avoids re-querying.

Proposed fix — use session.refresh instead

- # Refresh run to get updated status - run = get_stt_run_by_id( - session=_session, - run_id=run.id, - org_id=auth_context.organization_.id, - project_id=auth_context.project_.id, - ) + # Refresh run to get updated status + _session.refresh(run)

🤖 Prompt for AI Agents

In `@backend/app/api/routes/stt_evaluations/evaluation.py` around lines 137 - 162, The code re-fetches the run with get_stt_run_by_id (which returns EvaluationRun | None) then immediately dereferences run fields; add a None check after the call and return a 404/error response if run is None, or replace the re-query with session.refresh(run) to reload the existing ORM object instead of calling get_stt_run_by_id; update the logic around run (used in STTEvaluationRunPublic and APIResponse.success_response) to only proceed when run is present.

coderabbitai · 2026-02-06T05:44:28Z

backend/app/crud/file.py

+from sqlmodel import Session, select
+
+from app.core.util import now
+from app.models.file import File, FileType


⚠️ Potential issue | 🟡 Minor

FileType is imported but unused.

FileType from app.models.file is imported on Line 8 but never referenced in any function body or signature in this file.

Proposed fix

-from app.models.file import File, FileType +from app.models.file import File

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from app.models.file import File, FileType

from app.models.file import File

🤖 Prompt for AI Agents

In `@backend/app/crud/file.py` at line 8, The import of FileType from app.models.file is unused in backend/app/crud/file.py; remove FileType from the import statement (leave File) or, if intended, use FileType in function signatures or type hints (e.g., in functions that create or filter files) so it is referenced; update the import line that currently references File and FileType accordingly and run linters to ensure no other unused imports remain.

coderabbitai · 2026-02-06T05:44:28Z

backend/app/crud/stt_evaluations/batch.py

+    except Exception as e:
+        logger.error(
+            f"[start_stt_evaluation_batch] Failed to submit batch | "
+            f"model: {model}, error: {str(e)}"
+        )
+        for result_id in sample_to_result.values():
+            update_stt_result(
+                session=session,
+                result_id=result_id,
+                status=STTResultStatus.FAILED.value,
+                error_message=f"Batch submission failed: {str(e)}",
+            )
+        raise Exception(f"Batch submission failed: {str(e)}")


⚠️ Potential issue | 🟡 Minor

Re-raised exception loses the original traceback.

Wrapping in a new Exception discards the cause chain. Use raise ... from e to preserve it, or use a domain-specific exception type.

Suggested fix

- raise Exception(f"Batch submission failed: {str(e)}") + raise RuntimeError(f"Batch submission failed: {str(e)}") from e

🤖 Prompt for AI Agents

In `@backend/app/crud/stt_evaluations/batch.py` around lines 144 - 156, The except block in start_stt_evaluation_batch currently raises a new Exception which loses the original traceback; change the final raise to preserve the original exception (either re-raise the caught exception or use raise Exception(...) from e) so the cause chain is kept, and keep the existing logging and update_stt_result calls (referencing update_stt_result, STTResultStatus, sample_to_result, and logger) intact.

coderabbitai · 2026-02-06T05:44:28Z

backend/app/services/stt_evaluations/dataset.py

+    logger.info(
+        f"[upload_stt_dataset] Uploading STT dataset | name={name} | "
+        f"sample_count={len(samples)} | org_id={organization_id} | "
+        f"project_id={project_id}"
+    )
+
+    # Step 1: Convert samples to CSV and upload to object store
+    object_store_url = _upload_samples_to_object_store(
+        session=session,
+        project_id=project_id,
+        dataset_name=name,
+        samples=samples,
+    )
+
+    # Step 2: Calculate metadata
+    metadata = {
+        "sample_count": len(samples),
+        "has_ground_truth_count": sum(1 for s in samples if s.ground_truth),
+    }
+
+    # Step 3: Create dataset record
+    dataset = create_stt_dataset(
+        session=session,
+        name=name,
+        org_id=organization_id,
+        project_id=project_id,
+        description=description,
+        language_id=language_id,
+        object_store_url=object_store_url,
+        dataset_metadata=metadata,
+    )
+
+    logger.info(
+        f"[upload_stt_dataset] Created dataset record | "
+        f"id={dataset.id} | name={name}"
+    )
+
+    # Step 4: Create sample records
+    created_samples = create_stt_samples(
+        session=session,
+        dataset=dataset,
+        samples=samples,
+    )
+
+    logger.info(
+        f"[upload_stt_dataset] Created sample records | "
+        f"dataset_id={dataset.id} | sample_count={len(created_samples)}"
+    )
+
+    return dataset, created_samples


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, verify the existence and structure of the CRUD file fd 'dataset.py' backend/app/crud/stt_evaluations/

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 118

🏁 Script executed:

# Check for session.commit() calls in the CRUD code rg -n 'session\.commit\(\)' backend/app/crud/stt_evaluations/dataset.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 127

🏁 Script executed:

# Also check the service file to see the full context and any transaction handling cat -n backend/app/services/stt_evaluations/dataset.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 6087

🏁 Script executed:

# Let's examine the create_stt_dataset and create_stt_samples functions in the CRUD file cat -n backend/app/crud/stt_evaluations/dataset.py | head -100

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 3443

🏁 Script executed:

# Get the section around line 74 and 198 to see what functions they're in sed -n '60,90p' backend/app/crud/stt_evaluations/dataset.py sed -n '185,210p' backend/app/crud/stt_evaluations/dataset.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1569

🏁 Script executed:

# Check if there's any transaction/session management wrapping in the service or a middleware rg -n 'try|except|rollback|begin|transaction' backend/app/services/stt_evaluations/dataset.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 118

🏁 Script executed:

# Check the imports and session type to understand if it's auto-commit or explicit rg -n 'from sqlmodel|from sqlalchemy' backend/app/services/stt_evaluations/dataset.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 105

🏁 Script executed:

# Get the full create_stt_samples function sed -n '155,210p' backend/app/crud/stt_evaluations/dataset.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1681

🏁 Script executed:

# Check if there's any API-level session management or rollback handling rg -n 'upload_stt_dataset' backend/app --type py -A 5 -B 5 | head -50

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 4299

🏁 Script executed:

# Check if the API endpoint or caller wraps this in error handling rg -n 'upload_stt_dataset' backend/app/api --type py -B 5 -A 10

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 2672

🏁 Script executed:

# Also check if there's any router/endpoint level transaction management fd 'router.py\|route.py' backend/app/api --type py | head -5

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 244

🏁 Script executed:

# Get the full endpoint code cat -n backend/app/api/routes/stt_evaluations/dataset.py | head -80

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 3174

🏁 Script executed:

# Check if there's any Depends or session management in the route rg -n 'get_session\|SessionDep\|Depends' backend/app/api/routes/stt_evaluations/dataset.py | head -10

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57

🏁 Script executed:

# Check the core dependencies to understand session management rg -n 'SessionDep\|get_session' backend/app/api/deps.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57

🏁 Script executed:

# Check the validate_file_ids function to see if it can throw exceptions rg -n 'def validate_file_ids' backend/app/crud/stt_evaluations/dataset.py -A 20

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 690

🏁 Script executed:

# Check if the deps.py file exists to understand SessionDep fd 'deps.py' backend/app/api/ | xargs cat -n | head -50

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1903

🏁 Script executed:

# Final verification: check if there's any error handling at API level that might clean up orphaned datasets sed -n '37,80p' backend/app/api/routes/stt_evaluations/dataset.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1591

Dataset and samples are created in separate, non-transactional commits — partial failure leaves an orphaned dataset.

create_stt_dataset (Step 3) commits the dataset to the database, then create_stt_samples (Step 4) commits samples separately. If sample creation fails (e.g., validate_file_ids raises HTTPException for invalid file IDs), the dataset persists in the database without any samples. The API endpoint has no error handling to roll back or clean up the orphaned dataset.

Consider wrapping both operations in a single database transaction using session.begin_nested() with explicit rollback on failure, or implement cleanup logic that removes the dataset if sample creation fails.

🤖 Prompt for AI Agents

In `@backend/app/services/stt_evaluations/dataset.py` around lines 54 - 103, upload_stt_dataset currently calls create_stt_dataset and then create_stt_samples in separate commits, which can leave an orphan dataset if sample creation (e.g., validate_file_ids) fails; wrap both dataset and sample creation inside a single DB transaction (use session.begin_nested() or session.begin()) so that a failure during create_stt_samples triggers a rollback, and if you prefer cleanup instead, catch exceptions around create_stt_samples, call session.delete(dataset) and session.commit() (or rollback then delete) to remove the dataset; update upload_stt_dataset to perform these changes and reference the existing functions create_stt_dataset, create_stt_samples and the failing validator validate_file_ids to ensure all DB writes are atomic.

coderabbitai · 2026-02-06T05:44:28Z

backend/app/services/stt_evaluations/dataset.py

+    metadata = {
+        "sample_count": len(samples),
+        "has_ground_truth_count": sum(1 for s in samples if s.ground_truth),
+    }


⚠️ Potential issue | 🟡 Minor

has_ground_truth_count uses truthiness check — empty string "" would be counted as no ground truth.

sum(1 for s in samples if s.ground_truth) treats "" as falsy. If a sample's ground_truth is explicitly set to "", it won't be counted. If the intent is to count non-None values, use s.ground_truth is not None instead.

🤖 Prompt for AI Agents

In `@backend/app/services/stt_evaluations/dataset.py` around lines 69 - 72, The metadata calculation currently counts ground truth by truthiness which skips empty strings; update the has_ground_truth_count computation to count samples where s.ground_truth is not None (e.g., replace the truthiness check with an explicit "s.ground_truth is not None") so empty string ground truths are counted; change the generator used to compute metadata["has_ground_truth_count"] in the dataset.py code that builds metadata for samples.

coderabbitai · 2026-02-06T05:44:28Z

backend/app/tests/api/routes/test_stt_evaluation.py

+    @pytest.fixture
+    def test_dataset_with_samples(
+        self, db: Session, user_api_key: TestAuthContext
+    ) -> EvaluationDataset:
+        """Create a test dataset with samples for evaluation."""
+        dataset = create_test_stt_dataset(
+            db=db,
+            organization_id=user_api_key.organization_id,
+            project_id=user_api_key.project_id,
+            name="eval_test_dataset",
+        )
+        # Create some samples (file will be created automatically)
+        for i in range(3):
+            file = create_test_file(
+                db=db,
+                organization_id=user_api_key.organization_id,
+                project_id=user_api_key.project_id,
+                object_store_url=f"s3://bucket/audio_{i}.mp3",
+                filename=f"audio_{i}.mp3",
+            )
+            create_test_stt_sample(
+                db=db,
+                dataset_id=dataset.id,
+                organization_id=user_api_key.organization_id,
+                project_id=user_api_key.project_id,
+                file_id=file.id,
+            )
+        return dataset


⚠️ Potential issue | 🟡 Minor

Fixture test_dataset_with_samples is unused and has a stale sample_count.

This fixture is defined but never referenced by any test method. Additionally, create_test_stt_dataset initializes dataset_metadata={"sample_count": 0, ...}, so even if this fixture were used in a start_stt_evaluation test, the endpoint would reject it with "Dataset has no samples" (evaluation.py line 65).

Proposed fix — update metadata after adding samples

for i in range(3): file = create_test_file( db=db, organization_id=user_api_key.organization_id, project_id=user_api_key.project_id, object_store_url=f"s3://bucket/audio_{i}.mp3", filename=f"audio_{i}.mp3", ) create_test_stt_sample( db=db, dataset_id=dataset.id, organization_id=user_api_key.organization_id, project_id=user_api_key.project_id, file_id=file.id, ) + # Update metadata to reflect actual sample count + dataset.dataset_metadata = {"sample_count": 3, "has_ground_truth_count": 0} + db.add(dataset) + db.commit() + db.refresh(dataset) return dataset

🤖 Prompt for AI Agents

In `@backend/app/tests/api/routes/test_stt_evaluation.py` around lines 514 - 541, The fixture test_dataset_with_samples creates samples but leaves the dataset metadata (created via create_test_stt_dataset) with a stale sample_count=0, which will make evaluations reject it; after creating the three samples with create_test_file and create_test_stt_sample, update the EvaluationDataset instance returned by create_test_stt_dataset by setting its dataset_metadata["sample_count"] to the actual number of created samples (e.g., 3) and persist that change using the provided db Session (flush/commit or the project/dataset update helper) so the dataset reflects the correct sample_count when used by start_stt_evaluation.

coderabbitai · 2026-02-06T05:44:28Z

backend/app/tests/api/routes/test_stt_evaluation.py

+    def test_start_stt_evaluation_invalid_dataset(
+        self,
+        client: TestClient,
+        user_api_key_header: dict[str, str],
+    ) -> None:
+        """Test starting an STT evaluation with invalid dataset ID."""
+        response = client.post(
+            "/api/v1/evaluations/stt/runs",
+            json={
+                "run_name": "test_run",
+                "dataset_id": 99999,
+                "providers": ["gemini-2.5-pro"],
+            },
+            headers=user_api_key_header,
+        )
+
+        assert response.status_code == 404
+        response_data = response.json()
+        error_str = response_data.get("detail", response_data.get("error", ""))
+        assert "not found" in error_str.lower()
+
+    def test_start_stt_evaluation_without_authentication(
+        self,
+        client: TestClient,
+    ) -> None:
+        """Test starting an STT evaluation without authentication."""
+        response = client.post(
+            "/api/v1/evaluations/stt/runs",
+            json={
+                "run_name": "test_run",
+                "dataset_id": 1,
+                "providers": ["gemini-2.5-pro"],
+            },
+        )
+
+        assert response.status_code == 401
+
+    def test_list_stt_runs_empty(
+        self,
+        client: TestClient,
+        user_api_key_header: dict[str, str],
+    ) -> None:
+        """Test listing STT runs when none exist."""
+        response = client.get(
+            "/api/v1/evaluations/stt/runs",
+            headers=user_api_key_header,
+        )
+
+        assert response.status_code == 200
+        response_data = response.json()
+        assert response_data["success"] is True
+        assert isinstance(response_data["data"], list)
+
+    def test_get_stt_run_not_found(
+        self,
+        client: TestClient,
+        user_api_key_header: dict[str, str],
+    ) -> None:
+        """Test getting a non-existent STT run."""
+        response = client.get(
+            "/api/v1/evaluations/stt/runs/99999",
+            headers=user_api_key_header,
+        )
+
+        assert response.status_code == 404


🛠️ Refactor suggestion | 🟠 Major

No happy-path test for starting an STT evaluation run.

All tests in TestSTTEvaluationRun cover error/edge paths (invalid dataset, missing auth, empty list, not found). The core flow — creating a run against a valid dataset with samples and verifying the response — is untested. This is the primary feature of the PR and warrants at least one test (mocking start_stt_evaluation_batch to avoid external calls).

🤖 Prompt for AI Agents

In `@backend/app/tests/api/routes/test_stt_evaluation.py` around lines 543 - 607, Add a happy-path test in the TestSTTEvaluationRun suite that posts to "/api/v1/evaluations/stt/runs" with a valid dataset (create or fixture with samples) and a run_name/providers payload, mock the start_stt_evaluation_batch function to avoid external calls, assert a 200 response (or 201 per API) and that response.json() contains success True and expected run metadata, and verify the mocked start_stt_evaluation_batch was called once with the created dataset id and the provided providers/run_name; reference the test helper names used in this file and the start_stt_evaluation_batch symbol for locating the code to mock and assert.

kartpop · 2026-02-06T09:53:32Z

backend/app/api/routes/stt_evaluations/files.py

+
+
+@router.post(
+    "/files/audio",


wondering if the /audio path is required...
the whole path will be /evaluations/stt/files/audio right?
/stt/files itself is indicative that the files will be audio

kartpop · 2026-02-06T10:10:26Z

backend/app/api/routes/stt_evaluations/evaluation.py

+
+
+@router.post(
+    "/runs",


i think we can optimize on the number of API calls here; current flow:

n /files calls for n samples

POST /datasets call

POST /runs call

the POST /datasets call can be consumed inside the POST /runs call; can we instead do:

n /files calls for n samples (in the future, we can see if we can optimize this even, but fine for now)

POST /runs call takes in a list of audio sample ids, creates a dataset entry in the db, proceeds on to the evaluation run logic

the GET endpoints for /datasets can still remain, but the POST endpoint for it seems redundant

kartpop · 2026-02-06T10:13:46Z

backend/app/core/providers.py

    OPENAI = "openai"
    AWS = "aws"
    LANGFUSE = "langfuse"
+    GEMINI = "gemini"


kartpop · 2026-02-06T10:16:08Z

backend/app/crud/evaluations/cron.py

                summary = await poll_all_pending_evaluations(
                    session=session, org_id=org.id
                )



maybe in a separate PR, but just bringing this up that we should move away from per org processing

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@backend/app/api/routes/stt_evaluations/evaluation.py`:
- Around line 116-127: The except block in start_stt_evaluation currently
returns str(e) to the client, leaking internal error details; instead keep
logging the full error and saving the detailed message via
update_stt_run(session=_session, run_id=run.id, status="failed",
error_message=str(e)) but change the HTTPException detail to a generic message
(optionally include a non-sensitive identifier like run.id or an internal error
code) so the client receives no internal stack/secret data; locate the except
block that references logger.error, update_stt_run, and raise HTTPException and
replace the HTTPException detail payload accordingly.
- Around line 62-80: The run's total_items is being computed from
dataset.dataset_metadata.sample_count which can be stale; instead fetch the
actual samples first (the variable named samples used later) and compute
total_items = len(samples) * len(run_create.providers) before calling
create_stt_run (update the create_stt_run call to pass that computed value).
Ensure you still fall back to dataset.dataset_metadata.get("sample_count", 0)
only if samples is empty or the samples fetch fails, and keep using language_id
= dataset.language_id and other create_stt_run parameters unchanged.

🧹 Nitpick comments (7)

backend/app/models/stt_evaluation.py (5)
74-82: default_factory=dict produces {} but column is nullable=True — pick one semantic.

The Python default is {} (via default_factory=dict), so sample_metadata will never be None when created through the ORM without an explicit None assignment, yet the DB column allows NULL. This may cause confusion for downstream consumers checking is None vs == {}. Consider aligning: either use default=None (matching nullable=True) or set nullable=False, server_default=text("'{}'").

136-147: provider and status on STTResult are bare str — consider enum validation or constraints.

status has an STTResultStatus enum defined but the column is typed as str with no DB-level check constraint. Similarly, provider could drift from SUPPORTED_STT_PROVIDERS. At minimum, adding a CheckConstraint on status ensures DB-level integrity.

273-277: STTFeedbackUpdate allows an empty payload (both fields None) — this is a no-op.

If both is_correct and comment are None, the update request does nothing meaningful. Consider a model-level validator to require at least one field.
Example validator
 class STTFeedbackUpdate(BaseModel):
     """Request model for updating human feedback on a result."""
 
     is_correct: bool | None = Field(None, description="Is the transcription correct?")
     comment: str | None = Field(None, description="Feedback comment")
+
+    `@model_validator`(mode="after")
+    def check_at_least_one_field(self) -> "STTFeedbackUpdate":
+        if self.is_correct is None and self.comment is None:
+            raise ValueError("At least one of 'is_correct' or 'comment' must be provided")
+        return self
108-112: updated_at uses default_factory=now which only fires on INSERT—but CRUD updates explicitly set it, making this a best-practice refactoring suggestion.

Both STTSample.updated_at and STTResult.updated_at use default_factory=now, which only executes on INSERT. However, inspection of the CRUD layer shows that all update operations explicitly set updated_at = now() (e.g., backend/app/crud/stt_evaluations/run.py:222, backend/app/crud/stt_evaluations/result.py:238). While this explicit management works, consider adding SQLAlchemy's onupdate to the column definition for additional safety against future updates that might miss manual assignment.
Example using sa_column with onupdate
-    updated_at: datetime = SQLField(
-        default_factory=now,
-        nullable=False,
-        sa_column_kwargs={"comment": "Timestamp when the sample was last updated"},
-    )
+    updated_at: datetime = SQLField(
+        default_factory=now,
+        nullable=False,
+        sa_column_kwargs={
+            "comment": "Timestamp when the sample was last updated",
+            "onupdate": now,
+        },
+    )
Also applies to: 216-219

232-265: Consider adding model_config = ConfigDict(from_attributes=True) to STTSamplePublic and STTResultPublic for idiomatic Pydantic usage.

These models could benefit from Pydantic v2's from_attributes=True configuration. While the current explicit keyword argument construction (e.g., STTSamplePublic(id=sample.id, file_id=sample.file_id, ...)) works correctly, adopting from_attributes=True would enable the idiomatic pattern model_validate(orm_instance) and reduce duplication across construction sites.
backend/app/api/routes/stt_evaluations/evaluation.py (2)
137-155: Verbose manual field-by-field model construction — use model_validate or from_orm.

Both STTEvaluationRunPublic and STTEvaluationRunWithResults are constructed by manually mapping every field from the ORM object. This is error-prone (easy to miss a field when the model evolves) and verbose. If you add model_config = ConfigDict(from_attributes=True) to the Pydantic models, you can replace this with:
STTEvaluationRunPublic.model_validate(run)
This would also simplify STTEvaluationRunWithResults construction.

Also applies to: 233-253

190-206: Response model mismatch when include_results=False.

The endpoint declares response_model=APIResponse[STTEvaluationRunWithResults], but when include_results=False, results will be an empty list and results_total will be 0. While this technically validates, it's semantically misleading — the response schema always advertises results. Consider using a union type or separate endpoint, or at minimum document this behavior clearly.

coderabbitai · 2026-02-06T12:07:15Z

backend/app/api/routes/stt_evaluations/evaluation.py

+    sample_count = (dataset.dataset_metadata or {}).get("sample_count", 0)
+
+    if sample_count == 0:
+        raise HTTPException(status_code=400, detail="Dataset has no samples")
+
+    # Use language_id from the dataset
+    language_id = dataset.language_id
+
+    # Create run record
+    run = create_stt_run(
+        session=_session,
+        run_name=run_create.run_name,
+        dataset_id=run_create.dataset_id,
+        dataset_name=dataset.name,
+        org_id=auth_context.organization_.id,
+        project_id=auth_context.project_.id,
+        providers=run_create.providers,
+        language_id=language_id,
+        total_items=sample_count * len(run_create.providers),


⚠️ Potential issue | 🟠 Major

total_items derived from metadata may diverge from actual sample count.

sample_count is read from dataset.dataset_metadata (line 62), but the actual samples are fetched separately on line 84. If metadata becomes stale (e.g., samples added/removed without metadata update), total_items stored on the run will be incorrect. Consider computing total_items from len(samples) after fetching them.

Proposed fix

+ # Get samples for the dataset + samples = get_samples_by_dataset_id( + session=_session, + dataset_id=run_create.dataset_id, + org_id=auth_context.organization_.id, + project_id=auth_context.project_.id, + ) + + if not samples: + raise HTTPException(status_code=400, detail="Dataset has no samples") + # Create run record run = create_stt_run( session=_session, run_name=run_create.run_name, dataset_id=run_create.dataset_id, dataset_name=dataset.name, org_id=auth_context.organization_.id, project_id=auth_context.project_.id, providers=run_create.providers, language_id=language_id, - total_items=sample_count * len(run_create.providers), + total_items=len(samples) * len(run_create.providers), ) - - # Get samples for the dataset - samples = get_samples_by_dataset_id( - session=_session, - dataset_id=run_create.dataset_id, - org_id=auth_context.organization_.id, - project_id=auth_context.project_.id, - )

🤖 Prompt for AI Agents

In `@backend/app/api/routes/stt_evaluations/evaluation.py` around lines 62 - 80, The run's total_items is being computed from dataset.dataset_metadata.sample_count which can be stale; instead fetch the actual samples first (the variable named samples used later) and compute total_items = len(samples) * len(run_create.providers) before calling create_stt_run (update the create_stt_run call to pass that computed value). Ensure you still fall back to dataset.dataset_metadata.get("sample_count", 0) only if samples is empty or the samples fetch fails, and keep using language_id = dataset.language_id and other create_stt_run parameters unchanged.

coderabbitai · 2026-02-06T12:07:15Z

backend/app/api/routes/stt_evaluations/evaluation.py

+    except Exception as e:
+        logger.error(
+            f"[start_stt_evaluation] Batch submission failed | "
+            f"run_id: {run.id}, error: {str(e)}"
+        )
+        update_stt_run(
+            session=_session,
+            run_id=run.id,
+            status="failed",
+            error_message=str(e),
+        )
+        raise HTTPException(status_code=500, detail=f"Batch submission failed: {e}")


⚠️ Potential issue | 🟡 Minor

Internal error details leaked to the client.

str(e) is included in the HTTP 500 response detail, which may expose internal implementation details (stack traces, service URLs, credentials in connection strings, etc.) to API consumers.

Proposed fix

- raise HTTPException(status_code=500, detail=f"Batch submission failed: {e}") + raise HTTPException(status_code=500, detail="Batch submission failed")

🤖 Prompt for AI Agents

In `@backend/app/api/routes/stt_evaluations/evaluation.py` around lines 116 - 127, The except block in start_stt_evaluation currently returns str(e) to the client, leaking internal error details; instead keep logging the full error and saving the detailed message via update_stt_run(session=_session, run_id=run.id, status="failed", error_message=str(e)) but change the HTTPException detail to a generic message (optionally include a non-sensitive identifier like run.id or an internal error code) so the client receives no internal stack/secret data; locate the except block that references logger.error, update_stt_run, and raise HTTPException and replace the HTTPException detail payload accordingly.

AkhileshNegi added 12 commits January 30, 2026 14:15

first stab at STT evals

011d489

Merge branch 'main' of github.com:ProjectTech4DevAI/kaapi-backend int…

7777290

…o feature/stt-evaluation

Merge branch 'main' of github.com:ProjectTech4DevAI/kaapi-backend int…

d8df80c

…o feature/stt-evaluation

fix migration naming

f1df7f9

fixing endpoints

cda0611

update dataset endpoint

ad5779f

update types

01e2beb

updated dataset with URL

1637007

added few more testcases

36af7e9

added storage to core for easy reuse

78fd206

cleanup for audio duration

4ac2ca6

first stab at fixing celery task to cron

d8b531c

AkhileshNegi self-assigned this Feb 2, 2026

AkhileshNegi added the enhancement New feature or request label Feb 2, 2026

AkhileshNegi added 15 commits February 2, 2026 20:54

added gemini as provider

2295da5

moving to batch job in gemini

25e6002

code refactoring, using batch requests and files similar to OpenAI

db2512e

few cleanups

ff29ddd

updated migration

cd979fd

cleanup config for batch

b6c633a

moved documentation to separate folder

b6e6649

updated score format in stt result

719584d

cleaner dataset sample count

bf0b4c2

got rid of redundant sample count

68e6821

removed deadcode

2247faa

removing more redundant code

056612c

clean few more cruds

13bb9cc

more free from dead code

7bbf811

cleanup batch request code

04e419c

AkhileshNegi added 3 commits February 4, 2026 10:15

cleanup optimize results

4afdd2d

cleanup queries

3e62a98

cleanup leftovers

63de270

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

AkhileshNegi added 3 commits February 4, 2026 11:19

added validation for provider

c95c044

updated test suite

9aa6858

coderabbit suggestions

4a92416

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

ProjectTech4DevAI deleted a comment from coderabbitai bot Feb 4, 2026

AkhileshNegi added 2 commits February 4, 2026 13:06

added few more testcases

e204416

added more testcases for coverage

0210dab

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

backend/app/tests/services/stt_evaluations/test_audio.py Outdated Show resolved Hide resolved

AkhileshNegi requested review from Prajna1999 and kartpop February 4, 2026 10:12

moving to file table

cce5f11

Prajna1999 reviewed Feb 5, 2026

View reviewed changes

AkhileshNegi added 5 commits February 6, 2026 10:49

Merge branch 'main' into feature/stt-evaluation

497427e

Merge branch 'feature/stt-evaluation' of github.com:ProjectTech4DevAI…

0d5a0f8

…/kaapi-backend into feature/stt-evaluation

update migration

066f645

updating with language id

a3428df

updated testcases

5dcf743

coderabbitai bot reviewed Feb 6, 2026

View reviewed changes

cleanup code

d07f6fa

nishika26 linked an issue Feb 6, 2026 that may be closed by this pull request

Evaluation: STT evals #533

Open

kartpop reviewed Feb 6, 2026

View reviewed changes

ProjectTech4DevAI deleted a comment from codecov bot Feb 6, 2026

removed language_id from evaluation run

7f8cfaa

coderabbitai bot reviewed Feb 6, 2026

View reviewed changes


		from . import dataset, evaluation, files, result

		router = APIRouter(prefix="/evaluations/stt", tags=["STT Evaluation"])

	from app.models.file import File, FileType
	from app.models.file import File



		@router.post(
		"/files/audio",



		@router.post(
		"/runs",

Conversation

AkhileshNegi commented Feb 2, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

AkhileshNegi commented Feb 2, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 2, 2026 •

edited

Loading