Add file size validation to document upload endpoint#576
Add file size validation to document upload endpoint#576ankit-mehta07 wants to merge 2 commits intoProjectTech4DevAI:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds a configurable document upload size limit (default 50MB), enforces size and emptiness validation before S3 upload, updates docs, and adds tests covering oversized, empty, and valid uploads. Changes
Sequence DiagramsequenceDiagram
participant Client
participant Handler as Upload Handler
participant Validator
participant S3
participant DB as Database
Client->>Handler: POST /upload (multipart file)
activate Handler
Handler->>Validator: validate_document_file(file)
activate Validator
alt validation fails
Validator-->>Handler: HTTPException (413 or 422)
Handler-->>Client: Error response
else validation passes
Validator-->>Handler: file_size (bytes)
deactivate Validator
Handler->>Handler: pre_transform_validation(...)
Handler->>S3: Upload file
activate S3
S3-->>Handler: Upload success (URL/metadata)
deactivate S3
Handler->>DB: Create Document record
activate DB
DB-->>Handler: Record created
deactivate DB
Handler-->>Client: 201 Created (document metadata)
end
deactivate Handler
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Important Action Needed: IP Allowlist UpdateIf your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:
Failure to add the new IP will result in interrupted reviews. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/api/routes/documents.py (1)
108-123:⚠️ Potential issue | 🟡 MinorAdd return type annotation to
upload_doc.The function is missing a return type annotation. As per coding guidelines, all function parameters and return values must have type hints.
✅ Suggested change
-): +) -> APIResponse[DocumentUploadResponse]:
🤖 Fix all issues with AI agents
In `@backend/app/services/documents/validators.py`:
- Line 53: The log line should be prefixed with the function name and mask
sensitive values: update the logger.info call in the function (e.g.,
validate_document_file) to use the "[validate_document_file]" prefix and call
mask_string on file.filename and any other sensitive fields (e.g.,
logger.info(f"[validate_document_file] Document file validated:
{mask_string(file.filename)} ({file_size} bytes)")). Ensure you import/use the
existing mask_string utility and keep file_size non-sensitive.
🧹 Nitpick comments (1)
backend/app/tests/api/routes/documents/test_route_document_upload.py (1)
329-424: Use a temp-file factory fixture (and avoid 51MB writes).
These tests duplicate file-creation logic and write large files byte-by-byte. A factory fixture aligns with test guidelines and speeds up IO.As per coding guidelines, Use factory pattern for test fixtures in `backend/app/tests/`.💡 Example refactor
+@pytest.fixture +def temp_file_factory(): + def _make(size_bytes: int, suffix: str) -> Path: + with NamedTemporaryFile(mode="wb", suffix=suffix, delete=False) as fp: + fp.truncate(size_bytes) + return Path(fp.name) + return _make- with NamedTemporaryFile(mode="wb", suffix=".pdf", delete=False) as fp: - chunk_size = 1024 * 1024 - for _ in range(51): - fp.write(b"0" * chunk_size) - fp.flush() - large_file = Path(fp.name) + large_file = temp_file_factory(51 * 1024 * 1024, ".pdf")
| detail="Empty file uploaded" | ||
| ) | ||
|
|
||
| logger.info(f"Document file validated: {file.filename} ({file_size} bytes)") |
There was a problem hiding this comment.
Prefix and mask the log message.
Logging must include the function-name prefix and mask sensitive values like filenames.
🔧 Suggested change
+from app.utils import mask_string
...
- logger.info(f"Document file validated: {file.filename} ({file_size} bytes)")
+ logger.info(
+ f"[validate_document_file] Document file validated: "
+ f"{mask_string(file.filename)} ({file_size} bytes)"
+ )🤖 Prompt for AI Agents
In `@backend/app/services/documents/validators.py` at line 53, The log line should
be prefixed with the function name and mask sensitive values: update the
logger.info call in the function (e.g., validate_document_file) to use the
"[validate_document_file]" prefix and call mask_string on file.filename and any
other sensitive fields (e.g., logger.info(f"[validate_document_file] Document
file validated: {mask_string(file.filename)} ({file_size} bytes)")). Ensure you
import/use the existing mask_string utility and keep file_size non-sensitive.
|
@ankit-mehta07 we follow a branch naming standard which is documented in our contributing.md, please refer to that and rename your branch accordingly |
| @@ -83,7 +83,7 @@ def recover_password(email: str, session: SessionDep) -> Message: | |||
| return Message(message="Password recovery email sent") | |||
There was a problem hiding this comment.
why does this pr have changes of the other PR you made, this PR should only have file changes for this issue and nothing else
There was a problem hiding this comment.
I applied the changes can you please check
| Returns: | ||
| File size in bytes if valid | ||
|
|
||
| Raises: |
There was a problem hiding this comment.
this "raises" part is not needed in the function doc
| Raises: | ||
| HTTPException: If validation fails | ||
| """ | ||
| if not file.filename: |
There was a problem hiding this comment.
i do not think that it is possible that a file can be uploaded which does not have a filename
| detail="Empty file uploaded" | ||
| ) | ||
|
|
||
| logger.info(f"Document file validated: {file.filename} ({file_size} bytes)") |
There was a problem hiding this comment.
we put function name in a bracket at the start of the log sentence so that its trackable, check our rest of the code
| uploader: WebUploader, | ||
| ) -> None: | ||
| """Test that files exceeding the size limit are rejected.""" | ||
| aws = AmazonCloudStorageClient() |
There was a problem hiding this comment.
are you hitting the real aws with this, or mock?
| @@ -72,7 +72,7 @@ def test_reset_password(client: TestClient, db: Session) -> None: | |||
| data = {"new_password": new_password, "token": token} | |||
There was a problem hiding this comment.
again, the changes from the other pr should not be highlighted here
|
|
||
| ### File Size Restrictions | ||
|
|
||
| - **Maximum file size**: 50MB (configurable via `MAX_DOCUMENT_UPLOAD_SIZE_MB` environment variable) |
There was a problem hiding this comment.
these docs are for non-technical people also, they do not need to know about the environment variable part
| CALLBACK_READ_TIMEOUT: int = 10 | ||
|
|
||
| # Document upload size limit (in MB) | ||
| MAX_DOCUMENT_UPLOAD_SIZE_MB: int = 50 |
There was a problem hiding this comment.
since primarily we upload these files from s3 to openai, if we want to do anything with those files, and in openai single file size limit is 512 MB then we should keep it that only for now, or smth near that value -
Summary
Fixes #569
Summary
Motivation
This change avoids unnecessary S3 storage usage and prevents downstream issues when handling very large documents.
Summary by CodeRabbit
New Features
Documentation
Tests
Bug Fix / API