-
-
Notifications
You must be signed in to change notification settings - Fork 1
feat: implement smart text chunking for embed fields #141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Resolves GH #75 - Don't break markdown links in Ruff output. - Add smart_chunk_text() that respects markdown structures - Preserves markdown links, inline code, and code blocks - Prefers natural split points (paragraphs > sentences > newlines) - Update Ruff command to use smart chunking - Add 15 tests for the new chunking function Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
🚅 Deployed to the byte-pr-141 environment in byte
|
Reviewer's GuideImplements a smart markdown-aware text chunking utility and wires it into the Ruff rule embed flow, along with comprehensive unit tests verifying chunk boundaries and markdown preservation. Sequence diagram for Ruff rule explanation chunking with smart_chunk_textsequenceDiagram
actor User
participant DiscordClient
participant AstralPlugin as AstralPlugin_ruff_rule
participant Utils as smart_chunk_text
participant Embed
User->>DiscordClient: invoke /ruff rule <code>
DiscordClient->>AstralPlugin: ruff_rule(interaction, rule)
AstralPlugin->>AstralPlugin: format_ruff_rule(rule)
AstralPlugin->>AstralPlugin: build docs_field
AstralPlugin->>Utils: smart_chunk_text(explanation, 1000)
Utils-->>AstralPlugin: list_of_chunks
loop for each chunk in list_of_chunks
AstralPlugin->>Embed: add_field(name, chunk, inline=False)
end
AstralPlugin->>Embed: add_field(Fix, fix_text)
AstralPlugin->>Embed: add_field(Documentation, docs_field)
AstralPlugin-->>DiscordClient: send embed response
DiscordClient-->>User: display Ruff rule embed with intact markdown
Class diagram for utils smart_chunk_text and Astral Ruff plugin integrationclassDiagram
class UtilsModule {
+_find_protected_regions(text: str) list~tuple~
+_is_position_protected(pos: int, regions: list~tuple~) bool
+_find_split_point(text_segment: str, start_offset: int, max_size: int, protected_regions: list~tuple~) int
+smart_chunk_text(text: str, max_size: int) list~str~
}
class AstralPluginModule {
+ruff_rule(interaction: Interaction, rule: str) None
}
AstralPluginModule ..> UtilsModule : uses smart_chunk_text
File-Level Changes
Assessment against linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey - I've found 2 issues, and left some high level feedback:
- The
_is_position_protectedlookup is called in tight loops inside_find_split_pointand currently scans the fullprotected_regionslist each time; consider keepingprotected_regionssorted and advancing an index or using a more efficient interval lookup to avoid quadratic behavior on long texts with many markdown regions. - When a single protected region (e.g., a long code block) is longer than
max_sizeand covers the entire firstmax_sizecharacters,_find_split_pointwill fall through and returnmax_size, allowing a split inside that protected region; you may want to special-case this to either allow an oversized chunk for that region or relax protection in a controlled way rather than silently violating the no-split guarantee.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `_is_position_protected` lookup is called in tight loops inside `_find_split_point` and currently scans the full `protected_regions` list each time; consider keeping `protected_regions` sorted and advancing an index or using a more efficient interval lookup to avoid quadratic behavior on long texts with many markdown regions.
- When a single protected region (e.g., a long code block) is longer than `max_size` and covers the entire first `max_size` characters, `_find_split_point` will fall through and return `max_size`, allowing a split inside that protected region; you may want to special-case this to either allow an oversized chunk for that region or relax protection in a controlled way rather than silently violating the no-split guarantee.
## Individual Comments
### Comment 1
<location> `services/bot/src/byte_bot/lib/utils.py:195-204` </location>
<code_context>
+ return max_size
+
+
+def smart_chunk_text(text: str, max_size: int = 1000) -> list[str]:
+ """Split text into chunks without breaking markdown structures.
+
+ Respects markdown links, inline code, and code blocks. Prefers splitting at
+ natural boundaries: paragraphs > sentences > newlines > spaces.
+
+ Args:
+ text: The text to chunk.
+ max_size: Maximum characters per chunk.
+
+ Returns:
+ List of text chunks.
+ """
+ if not text:
+ return []
+
+ if len(text) <= max_size:
+ return [text]
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Guard against non-positive max_size to avoid an infinite loop.
If `max_size` is 0 or negative, `_find_split_point` returns 0, so `current_pos` never advances and the `while current_pos < len(text)` loop may not terminate. Consider validating `max_size` at the start of `smart_chunk_text`, e.g. raising `ValueError` when `max_size <= 0` or coercing it to at least 1.
</issue_to_address>
### Comment 2
<location> `tests/unit/bot/lib/test_utils.py:1029-1036` </location>
<code_context>
+ combined = " ".join(result)
+ assert "`my_function()`" in combined
+
+ def test_preserves_code_blocks(self) -> None:
+ """Test that code blocks are not broken when they fit within max_size."""
+ text = "Example:\n\n```python\ndef foo():\n pass\n```\n\nEnd of content."
+ result = smart_chunk_text(text, 60)
+ combined = "".join(result)
+ assert "```python" in combined
+ assert "def foo():" in combined
+ assert "```" in combined
+
+ def test_chunks_do_not_exceed_max_size(self) -> None:
</code_context>
<issue_to_address>
**suggestion (testing):** Code block test should ensure the triple backticks and block content stay in the same chunk
Currently this only checks that the markers and content exist in the recombined text, not that they stay together. To more directly validate `smart_chunk_text`, consider asserting that a single chunk contains the opening ``` with language, at least one code line (e.g. `"def foo():"`), and the closing ```; and that no chunk contains an unmatched opening or closing ``` without its pair.
```suggestion
def test_preserves_code_blocks(self) -> None:
"""Test that code blocks are not broken when they fit within max_size."""
text = "Example:\n\n```python\ndef foo():\n pass\n```\n\nEnd of content."
result = smart_chunk_text(text, 60)
block_chunk_found = False
for chunk in result:
backtick_fence_count = chunk.count("```")
# No chunk should contain an unmatched code fence
assert backtick_fence_count in (0, 2)
if "```python" in chunk:
# The same chunk that has the opening fence with language
# should also contain the code line and the closing fence.
assert "def foo():" in chunk
assert "```" in chunk
block_chunk_found = True
# Ensure we actually found a chunk that contained the whole code block
assert block_chunk_found
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
- Add max_size validation to prevent infinite loop (raises ValueError if <= 0) - Improve code block test to verify fences stay in same chunk - Add test for max_size validation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
Addressed the critical issues in a558827: 1. max_size validation - Now raises 2. Code block test - Now verifies fences stay in same chunk with matched count assertions Regarding the other feedback:
All 1141 tests pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request implements smart text chunking for Discord embed fields to prevent breaking markdown structures when displaying Ruff linting rule explanations. The implementation addresses issue #75 by ensuring that markdown links, inline code, and code blocks are not split across chunks.
Changes:
- Added
smart_chunk_text()function inservices/bot/src/byte_bot/lib/utils.pywith intelligent chunking that respects markdown structures - Updated the
/ruffcommand inservices/bot/src/byte_bot/plugins/astral.pyto usesmart_chunk_textinstead of the naivechunk_sequenceapproach - Added 15 comprehensive unit tests covering various chunking scenarios including edge cases
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| services/bot/src/byte_bot/lib/utils.py | Implements smart_chunk_text() and helper functions for markdown-aware text chunking with regex-based protection of markdown structures |
| services/bot/src/byte_bot/plugins/astral.py | Replaces naive chunking with smart chunking in the Ruff rule display command |
| tests/unit/bot/lib/test_utils.py | Adds 15 test cases validating chunking behavior, markdown preservation, and size limits |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for i in range(max_size, search_start, -1): | ||
| if not _is_position_protected(start_offset + i, protected_regions): | ||
| return i | ||
|
|
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fallback logic does not handle the case where a protected region (markdown link, inline code, or code block) exceeds max_size. If all positions from search_start to max_size are protected, the function returns max_size, which could split in the middle of a protected region. Consider adding logic to either skip the entire protected region or handle oversized protected regions explicitly.
| # Fallback: max_size may fall inside a protected region. Try to avoid | |
| # splitting inside that region by moving the split before or after it. | |
| absolute_max = start_offset + max_size | |
| for start, end in protected_regions: | |
| if start < absolute_max < end: | |
| # Position lies inside this protected region. | |
| rel_before = start - start_offset | |
| rel_after = end - start_offset | |
| region_len = end - start | |
| # Prefer splitting immediately before the region if there is | |
| # any unprotected prefix to keep this intact and ≤ max_size. | |
| if rel_before > 0: | |
| return rel_before | |
| # Region starts at the very beginning of this segment. | |
| # If the region itself fits within max_size, return its end so | |
| # the whole region stays intact in a single chunk. | |
| if region_len <= max_size: | |
| return rel_after | |
| # Oversized protected region: cannot keep it intact within | |
| # max_size, so fall back to max_size as a last resort. | |
| break |
| def smart_chunk_text(text: str, max_size: int = 1000) -> list[str]: | ||
| """Split text into chunks without breaking markdown structures. | ||
| Respects markdown links, inline code, and code blocks. Prefers splitting at | ||
| natural boundaries: paragraphs > sentences > newlines > spaces. | ||
| Args: | ||
| text: The text to chunk. | ||
| max_size: Maximum characters per chunk (must be > 0). | ||
| Returns: | ||
| List of text chunks. | ||
Copilot
AI
Jan 22, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function does not validate that max_size is positive. If max_size is 0 or negative, the function could produce unexpected results or enter an infinite loop. Consider adding validation to ensure max_size > 0 at the start of the function.
|
Documentation preview will be available shortly at https://jacobcoffee.github.io/byte-docs-preview/141 |
Summary
smart_chunk_text()function that respects markdown structuresChanges
smart_chunk_text(text, max_size=1000)inservices/bot/src/byte_bot/lib/utils.py[text](url)`code````code```/ruffcommand now usessmart_chunk_textinstead of naivechunk_sequenceTest plan
🤖 Generated with Claude Code
Summary by Sourcery
Add smart markdown-aware text chunking for Ruff rule explanations and wire it into the Astral /ruff command.
New Features:
Enhancements:
Tests: