Add file size validation to document upload endpoint#584
Add file size validation to document upload endpoint#584ankit-mehta07 wants to merge 4 commits intoProjectTech4DevAI:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds project and organization context to jobs via database migrations and model updates. Introduces file size validation for document uploads with a configurable maximum. Updates job creation across services to propagate new context fields. Includes test coverage for upload size restrictions. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant RouteHandler as /documents/upload
participant Validator as validate_document_file
participant Storage as Cloud Storage
participant Database as Database
Client->>RouteHandler: POST file + metadata
RouteHandler->>Validator: validate_document_file(file)
alt File size exceeds limit
Validator-->>RouteHandler: HTTP 413 (Payload Too Large)
RouteHandler-->>Client: 413 Error
else File is empty
Validator-->>RouteHandler: HTTP 422 (Unprocessable Entity)
RouteHandler-->>Client: 422 Error
else File size valid
Validator->>Validator: Seek to end, verify size, reset pointer
Validator-->>RouteHandler: Return file_size (bytes)
RouteHandler->>Storage: Upload file
Storage-->>RouteHandler: Upload success
RouteHandler->>Database: Create document record
Database-->>RouteHandler: Record created
RouteHandler-->>Client: 200 Success
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/services/response/jobs.py (1)
1-6:⚠️ Potential issue | 🟡 MinorAdd a type hint for
task_instance.
task_instanceis untyped on Line 61, which breaks the project’s type-hint requirement.As per coding guidelines, "Always add type hints to all function parameters and return values in Python code".✅ Suggested fix
+from typing import Any @@ def execute_job( request_data: dict, project_id: int, organization_id: int, job_id: str, task_id: str, - task_instance, + task_instance: Any, ) -> None:Also applies to: 55-62
🤖 Fix all issues with AI agents
In `@backend/app/alembic/versions/043_add_project_org_to_job_table.py`:
- Around line 22-41: The migration adds non-nullable columns organization_id and
project_id to the job table using op.add_column without a server_default, which
will fail if rows already exist; update the migration to perform a safe
two-phase change: either (A) add organization_id and project_id with a sensible
server_default (or temporary default value) so existing rows get backfilled,
commit, then remove the server_default and alter nullable to False, or (B) add
both columns as nullable (nullable=True) via op.add_column, run a data backfill
step to populate them, then run a follow-up ALTER to set nullable=False; adjust
the op.add_column calls for "organization_id" and "project_id" accordingly and
include a follow-up migration step to remove defaults or flip nullable once
backfill is done.
In `@backend/app/alembic/versions/044_optimize_conversation_query.py`:
- Around line 18-34: The migration functions upgrade and downgrade lack explicit
return type annotations; update their signatures (functions named upgrade and
downgrade in this migration) to include return type hints (i.e., -> None) so
they comply with the project's mandatory type-hints guideline, leaving the
function bodies unchanged and keeping the existing op.create_index/op.drop_index
calls intact.
- Around line 11-15: The migration functions upgrade() and downgrade() lack
return type annotations; update their definitions to include explicit return
types by changing them to "def upgrade() -> None:" and "def downgrade() ->
None:" so both functions are annotated as returning None (keep bodies unchanged
and only adjust the function signatures for upgrade and downgrade).
In `@backend/app/api/docs/documents/upload.md`:
- Around line 7-11: The docs claim a 50MB max but the code default constant
MAX_DOCUMENT_UPLOAD_SIZE_MB is 512; update the documentation in the upload.md
text to reflect the actual default (change "Maximum file size: 50MB" to "Maximum
file size: 512MB (configurable via MAX_DOCUMENT_UPLOAD_SIZE_MB environment
variable)") and ensure any related lines about rejection behavior remain
unchanged; reference the MAX_DOCUMENT_UPLOAD_SIZE_MB symbol so readers know the
source of truth.
🧹 Nitpick comments (2)
backend/app/models/job.py (1)
92-94: Consider addingback_populatesfor bidirectional navigation.The relationships lack
back_populates, meaning you cannot navigate fromOrganizationorProjectto their associated jobs. If bidirectional access is needed (e.g.,organization.jobs), you'll need to add corresponding relationship fields to those models.♻️ Example with back_populates
# Relationships - organization: Optional["Organization"] = Relationship() - project: Optional["Project"] = Relationship() + organization: Optional["Organization"] = Relationship(back_populates="jobs") + project: Optional["Project"] = Relationship(back_populates="jobs")Then add to
OrganizationandProjectmodels:jobs: list["Job"] = Relationship(back_populates="organization", cascade_delete=True)backend/app/crud/jobs.py (1)
15-31: Consider adding a log statement for job creation.Per the coding guidelines, log messages should be prefixed with the function name. Adding a log entry here would improve observability for job creation events.
📝 Proposed logging addition
self.session.add(new_job) self.session.commit() self.session.refresh(new_job) + logger.info( + f"[create] Job created | job_id={new_job.id}, job_type={job_type}, " + f"project_id={project_id}, organization_id={organization_id}" + ) return new_job
| op.add_column( | ||
| "job", | ||
| sa.Column( | ||
| "organization_id", | ||
| sa.Integer(), | ||
| nullable=False, | ||
| comment="Reference to the organization", | ||
| ), | ||
| ) | ||
|
|
||
| # Add project_id column | ||
| op.add_column( | ||
| "job", | ||
| sa.Column( | ||
| "project_id", | ||
| sa.Integer(), | ||
| nullable=False, | ||
| comment="Reference to the project", | ||
| ), | ||
| ) |
There was a problem hiding this comment.
Non-nullable columns without defaults will fail if the job table has existing rows.
Adding organization_id and project_id as nullable=False without a server_default will cause the migration to fail on databases with existing job records. Consider one of these approaches:
- Add a
server_defaultand backfill with valid IDs, then remove the default - Make columns nullable initially, backfill, then alter to non-nullable
- If the table is guaranteed empty in all environments, document this assumption
🛠️ Option 1: Two-phase migration with server_default
op.add_column(
"job",
sa.Column(
"organization_id",
sa.Integer(),
- nullable=False,
+ nullable=True,
comment="Reference to the organization",
),
)
op.add_column(
"job",
sa.Column(
"project_id",
sa.Integer(),
- nullable=False,
+ nullable=True,
comment="Reference to the project",
),
)
+
+ # TODO: Backfill existing rows with valid organization_id and project_id
+ # op.execute("UPDATE job SET organization_id = ..., project_id = ... WHERE organization_id IS NULL")
+
+ # Then alter columns to NOT NULL after backfill
+ op.alter_column("job", "organization_id", nullable=False)
+ op.alter_column("job", "project_id", nullable=False)🤖 Prompt for AI Agents
In `@backend/app/alembic/versions/043_add_project_org_to_job_table.py` around
lines 22 - 41, The migration adds non-nullable columns organization_id and
project_id to the job table using op.add_column without a server_default, which
will fail if rows already exist; update the migration to perform a safe
two-phase change: either (A) add organization_id and project_id with a sensible
server_default (or temporary default value) so existing rows get backfilled,
commit, then remove the server_default and alter nullable to False, or (B) add
both columns as nullable (nullable=True) via op.add_column, run a data backfill
step to populate them, then run a follow-up ALTER to set nullable=False; adjust
the op.add_column calls for "organization_id" and "project_id" accordingly and
include a follow-up migration step to remove defaults or flip nullable once
backfill is done.
| # revision identifiers, used by Alembic. | ||
| revision = "044" | ||
| down_revision = "043" | ||
| branch_labels = None | ||
| depends_on = None |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# List all Alembic revision IDs to confirm the latest revision number.
python - <<'PY'
import re, pathlib
path = pathlib.Path("backend/app/alembic/versions")
revs = []
for f in path.glob("*.py"):
txt = f.read_text(encoding="utf-8")
m = re.search(r'revision\s*=\s*"(\d+)"', txt)
if m:
revs.append((int(m.group(1)), f.name))
for rev, name in sorted(revs):
print(f"{rev:03d} -> {name}")
PYRepository: ProjectTech4DevAI/kaapi-backend
Length of output: 2069
🏁 Script executed:
cat backend/app/alembic/versions/044_optimize_conversation_query.pyRepository: ProjectTech4DevAI/kaapi-backend
Length of output: 934
Add type hints to upgrade() and downgrade() functions.
The revision numbering (044) is correct. However, both the upgrade() and downgrade() functions are missing required type hints. Update them to:
def upgrade() -> None:def downgrade() -> None:🤖 Prompt for AI Agents
In `@backend/app/alembic/versions/044_optimize_conversation_query.py` around lines
11 - 15, The migration functions upgrade() and downgrade() lack return type
annotations; update their definitions to include explicit return types by
changing them to "def upgrade() -> None:" and "def downgrade() -> None:" so both
functions are annotated as returning None (keep bodies unchanged and only adjust
the function signatures for upgrade and downgrade).
| def upgrade(): | ||
| # Create composite index to optimize the get_conversation_by_ancestor_id query | ||
| # This query filters by: ancestor_response_id, project_id, is_deleted | ||
| # and orders by: inserted_at DESC | ||
| op.create_index( | ||
| "ix_openai_conversation_ancestor_project_active_time", | ||
| "openai_conversation", | ||
| ["ancestor_response_id", "project_id", "is_deleted", "inserted_at"], | ||
| unique=False, | ||
| ) | ||
|
|
||
|
|
||
| def downgrade(): | ||
| op.drop_index( | ||
| "ix_openai_conversation_ancestor_project_active_time", | ||
| table_name="openai_conversation", | ||
| ) |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Add required return type hints to migration functions.
Both upgrade and downgrade should be annotated to comply with the mandatory type-hints guideline.
✅ Suggested fix
-def upgrade():
+def upgrade() -> None:
@@
-def downgrade():
+def downgrade() -> None:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def upgrade(): | |
| # Create composite index to optimize the get_conversation_by_ancestor_id query | |
| # This query filters by: ancestor_response_id, project_id, is_deleted | |
| # and orders by: inserted_at DESC | |
| op.create_index( | |
| "ix_openai_conversation_ancestor_project_active_time", | |
| "openai_conversation", | |
| ["ancestor_response_id", "project_id", "is_deleted", "inserted_at"], | |
| unique=False, | |
| ) | |
| def downgrade(): | |
| op.drop_index( | |
| "ix_openai_conversation_ancestor_project_active_time", | |
| table_name="openai_conversation", | |
| ) | |
| def upgrade() -> None: | |
| # Create composite index to optimize the get_conversation_by_ancestor_id query | |
| # This query filters by: ancestor_response_id, project_id, is_deleted | |
| # and orders by: inserted_at DESC | |
| op.create_index( | |
| "ix_openai_conversation_ancestor_project_active_time", | |
| "openai_conversation", | |
| ["ancestor_response_id", "project_id", "is_deleted", "inserted_at"], | |
| unique=False, | |
| ) | |
| def downgrade() -> None: | |
| op.drop_index( | |
| "ix_openai_conversation_ancestor_project_active_time", | |
| table_name="openai_conversation", | |
| ) |
🤖 Prompt for AI Agents
In `@backend/app/alembic/versions/044_optimize_conversation_query.py` around lines
18 - 34, The migration functions upgrade and downgrade lack explicit return type
annotations; update their signatures (functions named upgrade and downgrade in
this migration) to include return type hints (i.e., -> None) so they comply with
the project's mandatory type-hints guideline, leaving the function bodies
unchanged and keeping the existing op.create_index/op.drop_index calls intact.
| ### File Size Restrictions | ||
|
|
||
| - **Maximum file size**: 50MB (configurable via `MAX_DOCUMENT_UPLOAD_SIZE_MB` environment variable) | ||
| - Files exceeding the size limit will be rejected with a 413 (Payload Too Large) error | ||
| - Empty files will be rejected with a 422 (Unprocessable Entity) error |
There was a problem hiding this comment.
Documentation states 50MB but the code default is 512MB.
The documentation says "Maximum file size: 50MB" but backend/app/core/config.py defines MAX_DOCUMENT_UPLOAD_SIZE_MB: int = 512. Either update the documentation to reflect the actual default or change the config default to 50.
📝 Option: Update documentation to match code
### File Size Restrictions
-- **Maximum file size**: 50MB (configurable via `MAX_DOCUMENT_UPLOAD_SIZE_MB` environment variable)
+- **Maximum file size**: 512MB (configurable via `MAX_DOCUMENT_UPLOAD_SIZE_MB` environment variable)
- Files exceeding the size limit will be rejected with a 413 (Payload Too Large) error
- Empty files will be rejected with a 422 (Unprocessable Entity) error📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ### File Size Restrictions | |
| - **Maximum file size**: 50MB (configurable via `MAX_DOCUMENT_UPLOAD_SIZE_MB` environment variable) | |
| - Files exceeding the size limit will be rejected with a 413 (Payload Too Large) error | |
| - Empty files will be rejected with a 422 (Unprocessable Entity) error | |
| ### File Size Restrictions | |
| - **Maximum file size**: 512MB (configurable via `MAX_DOCUMENT_UPLOAD_SIZE_MB` environment variable) | |
| - Files exceeding the size limit will be rejected with a 413 (Payload Too Large) error | |
| - Empty files will be rejected with a 422 (Unprocessable Entity) error |
🤖 Prompt for AI Agents
In `@backend/app/api/docs/documents/upload.md` around lines 7 - 11, The docs claim
a 50MB max but the code default constant MAX_DOCUMENT_UPLOAD_SIZE_MB is 512;
update the documentation in the upload.md text to reflect the actual default
(change "Maximum file size: 50MB" to "Maximum file size: 512MB (configurable via
MAX_DOCUMENT_UPLOAD_SIZE_MB environment variable)") and ensure any related lines
about rejection behavior remain unchanged; reference the
MAX_DOCUMENT_UPLOAD_SIZE_MB symbol so readers know the source of truth.
Prajna1999
left a comment
There was a problem hiding this comment.
Is this not a duplicate PR?
Summary
Result
Improves reliability and prevents oversized uploads.
Summary by CodeRabbit
New Features
Documentation
Tests
Chores