Skip to content

Conversation

@JacobCoffee
Copy link
Owner

@JacobCoffee JacobCoffee commented Jan 22, 2026

Summary

Changes

  • New function: smart_chunk_text(text, max_size=1000) in services/bot/src/byte_bot/lib/utils.py
    • Preserves markdown links [text](url)
    • Preserves inline code `code`
    • Preserves code blocks ```code```
    • Prefers split points: paragraphs > sentences > newlines > spaces
  • Updated: /ruff command now uses smart_chunk_text instead of naive chunk_sequence
  • Added: 15 comprehensive tests

Test plan

  • Markdown links are not broken mid-chunk
  • Code blocks are not broken
  • Inline code is not broken
  • Chunks don't exceed max_size
  • All 1140 tests pass

🤖 Generated with Claude Code

Summary by Sourcery

Add smart markdown-aware text chunking for Ruff rule explanations and wire it into the Astral /ruff command.

New Features:

  • Introduce smart_chunk_text utility to split text into size-limited chunks while preserving markdown links, inline code, and code blocks.

Enhancements:

  • Update Ruff rule embed generation to use smart markdown-aware chunking instead of generic sequence chunking for explanations.

Tests:

  • Add comprehensive unit tests covering smart_chunk_text behavior, including markdown preservation, boundary preferences, and size limits.

Resolves GH #75 - Don't break markdown links in Ruff output.

- Add smart_chunk_text() that respects markdown structures
- Preserves markdown links, inline code, and code blocks
- Prefers natural split points (paragraphs > sentences > newlines)
- Update Ruff command to use smart chunking
- Add 15 tests for the new chunking function

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings January 22, 2026 14:22
@railway-app
Copy link

railway-app bot commented Jan 22, 2026

🚅 Deployed to the byte-pr-141 environment in byte

Service Status Web Updated (UTC)
byte ✅ Success (View Logs) Web Jan 22, 2026 at 2:28 pm

@railway-app railway-app bot temporarily deployed to byte / byte-pr-141 January 22, 2026 14:22 Destroyed
@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Jan 22, 2026

Reviewer's Guide

Implements a smart markdown-aware text chunking utility and wires it into the Ruff rule embed flow, along with comprehensive unit tests verifying chunk boundaries and markdown preservation.

Sequence diagram for Ruff rule explanation chunking with smart_chunk_text

sequenceDiagram
    actor User
    participant DiscordClient
    participant AstralPlugin as AstralPlugin_ruff_rule
    participant Utils as smart_chunk_text
    participant Embed

    User->>DiscordClient: invoke /ruff rule <code>
    DiscordClient->>AstralPlugin: ruff_rule(interaction, rule)
    AstralPlugin->>AstralPlugin: format_ruff_rule(rule)
    AstralPlugin->>AstralPlugin: build docs_field
    AstralPlugin->>Utils: smart_chunk_text(explanation, 1000)
    Utils-->>AstralPlugin: list_of_chunks
    loop for each chunk in list_of_chunks
        AstralPlugin->>Embed: add_field(name, chunk, inline=False)
    end
    AstralPlugin->>Embed: add_field(Fix, fix_text)
    AstralPlugin->>Embed: add_field(Documentation, docs_field)
    AstralPlugin-->>DiscordClient: send embed response
    DiscordClient-->>User: display Ruff rule embed with intact markdown
Loading

Class diagram for utils smart_chunk_text and Astral Ruff plugin integration

classDiagram
    class UtilsModule {
        +_find_protected_regions(text: str) list~tuple~
        +_is_position_protected(pos: int, regions: list~tuple~) bool
        +_find_split_point(text_segment: str, start_offset: int, max_size: int, protected_regions: list~tuple~) int
        +smart_chunk_text(text: str, max_size: int) list~str~
    }

    class AstralPluginModule {
        +ruff_rule(interaction: Interaction, rule: str) None
    }

    AstralPluginModule ..> UtilsModule : uses smart_chunk_text
Loading

File-Level Changes

Change Details Files
Add markdown-aware smart text chunking utility that avoids splitting protected markdown regions while respecting max_size.
  • Introduce _find_protected_regions helper using a regex to detect code blocks, inline code, and markdown links and return their index ranges.
  • Introduce _is_position_protected helper to determine if a candidate split index falls within any protected region.
  • Implement _find_split_point to choose the best split location within max_size, preferring paragraph, sentence, newline, then space separators, while backing off to any non-protected character if needed.
  • Implement smart_chunk_text that iteratively slices the input text using _find_split_point, trims trailing whitespace, skips leading whitespace in the next chunk, and early-returns when the text is empty or already within max_size.
services/bot/src/byte_bot/lib/utils.py
Use smart_chunk_text for Ruff rule explanation embeds instead of naive sequence chunking.
  • Update astral Ruff plugin imports to pull in smart_chunk_text instead of chunk_sequence for explanation handling.
  • Replace chunk_sequence-based loop with smart_chunk_text so each embed field receives a plain string chunk, preserving markdown structures in explanations.
services/bot/src/byte_bot/plugins/astral.py
Add unit tests covering smart_chunk_text chunking behavior and markdown preservation guarantees.
  • Import smart_chunk_text into the utils test module.
  • Add tests for edge cases like empty input, text within limit, and default max_size behavior.
  • Add tests verifying preferred split hierarchy (paragraphs, sentences, newlines, spaces) and that chunks never exceed max_size.
  • Add tests ensuring markdown links, inline code, and code blocks—including multi-line blocks and Ruff-like real-world content—are not broken across chunks.
tests/unit/bot/lib/test_utils.py

Assessment against linked issues

Issue Objective Addressed Explanation
#75 Implement smarter chunking for astral embed field text (particularly Ruff rule explanations) so that chunks stay within Discord’s embed field limits without breaking markdown structures such as links, inline code, or code blocks, and avoid awkward line breaks.

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • The _is_position_protected lookup is called in tight loops inside _find_split_point and currently scans the full protected_regions list each time; consider keeping protected_regions sorted and advancing an index or using a more efficient interval lookup to avoid quadratic behavior on long texts with many markdown regions.
  • When a single protected region (e.g., a long code block) is longer than max_size and covers the entire first max_size characters, _find_split_point will fall through and return max_size, allowing a split inside that protected region; you may want to special-case this to either allow an oversized chunk for that region or relax protection in a controlled way rather than silently violating the no-split guarantee.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `_is_position_protected` lookup is called in tight loops inside `_find_split_point` and currently scans the full `protected_regions` list each time; consider keeping `protected_regions` sorted and advancing an index or using a more efficient interval lookup to avoid quadratic behavior on long texts with many markdown regions.
- When a single protected region (e.g., a long code block) is longer than `max_size` and covers the entire first `max_size` characters, `_find_split_point` will fall through and return `max_size`, allowing a split inside that protected region; you may want to special-case this to either allow an oversized chunk for that region or relax protection in a controlled way rather than silently violating the no-split guarantee.

## Individual Comments

### Comment 1
<location> `services/bot/src/byte_bot/lib/utils.py:195-204` </location>
<code_context>
+    return max_size
+
+
+def smart_chunk_text(text: str, max_size: int = 1000) -> list[str]:
+    """Split text into chunks without breaking markdown structures.
+
+    Respects markdown links, inline code, and code blocks. Prefers splitting at
+    natural boundaries: paragraphs > sentences > newlines > spaces.
+
+    Args:
+        text: The text to chunk.
+        max_size: Maximum characters per chunk.
+
+    Returns:
+        List of text chunks.
+    """
+    if not text:
+        return []
+
+    if len(text) <= max_size:
+        return [text]
+
</code_context>

<issue_to_address>
**issue (bug_risk):** Guard against non-positive max_size to avoid an infinite loop.

If `max_size` is 0 or negative, `_find_split_point` returns 0, so `current_pos` never advances and the `while current_pos < len(text)` loop may not terminate. Consider validating `max_size` at the start of `smart_chunk_text`, e.g. raising `ValueError` when `max_size <= 0` or coercing it to at least 1.
</issue_to_address>

### Comment 2
<location> `tests/unit/bot/lib/test_utils.py:1029-1036` </location>
<code_context>
+            combined = " ".join(result)
+            assert "`my_function()`" in combined
+
+    def test_preserves_code_blocks(self) -> None:
+        """Test that code blocks are not broken when they fit within max_size."""
+        text = "Example:\n\n```python\ndef foo():\n    pass\n```\n\nEnd of content."
+        result = smart_chunk_text(text, 60)
+        combined = "".join(result)
+        assert "```python" in combined
+        assert "def foo():" in combined
+        assert "```" in combined
+
+    def test_chunks_do_not_exceed_max_size(self) -> None:
</code_context>

<issue_to_address>
**suggestion (testing):** Code block test should ensure the triple backticks and block content stay in the same chunk

Currently this only checks that the markers and content exist in the recombined text, not that they stay together. To more directly validate `smart_chunk_text`, consider asserting that a single chunk contains the opening ``` with language, at least one code line (e.g. `"def foo():"`), and the closing ```; and that no chunk contains an unmatched opening or closing ``` without its pair.

```suggestion
    def test_preserves_code_blocks(self) -> None:
        """Test that code blocks are not broken when they fit within max_size."""
        text = "Example:\n\n```python\ndef foo():\n    pass\n```\n\nEnd of content."
        result = smart_chunk_text(text, 60)

        block_chunk_found = False
        for chunk in result:
            backtick_fence_count = chunk.count("```")
            # No chunk should contain an unmatched code fence
            assert backtick_fence_count in (0, 2)

            if "```python" in chunk:
                # The same chunk that has the opening fence with language
                # should also contain the code line and the closing fence.
                assert "def foo():" in chunk
                assert "```" in chunk
                block_chunk_found = True

        # Ensure we actually found a chunk that contained the whole code block
        assert block_chunk_found
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

- Add max_size validation to prevent infinite loop (raises ValueError if <= 0)
- Improve code block test to verify fences stay in same chunk
- Add test for max_size validation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@railway-app railway-app bot temporarily deployed to byte / byte-pr-141 January 22, 2026 14:27 Destroyed
@JacobCoffee
Copy link
Owner Author

Addressed the critical issues in a558827:

1. max_size validation - Now raises ValueError if max_size <= 0 to prevent infinite loop

2. Code block test - Now verifies fences stay in same chunk with matched count assertions

Regarding the other feedback:

  • Performance of _is_position_protected: Acceptable for Discord embed text sizes (typically < 6000 chars with few protected regions). Would only optimize if profiling shows it's a bottleneck.

  • Long code blocks > max_size: The current behavior (allowing oversized chunks) is intentional - it's better to have one large chunk containing the complete code block than to break the markdown. This matches the "best effort" nature of the function.

All 1141 tests pass.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements smart text chunking for Discord embed fields to prevent breaking markdown structures when displaying Ruff linting rule explanations. The implementation addresses issue #75 by ensuring that markdown links, inline code, and code blocks are not split across chunks.

Changes:

  • Added smart_chunk_text() function in services/bot/src/byte_bot/lib/utils.py with intelligent chunking that respects markdown structures
  • Updated the /ruff command in services/bot/src/byte_bot/plugins/astral.py to use smart_chunk_text instead of the naive chunk_sequence approach
  • Added 15 comprehensive unit tests covering various chunking scenarios including edge cases

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
services/bot/src/byte_bot/lib/utils.py Implements smart_chunk_text() and helper functions for markdown-aware text chunking with regex-based protection of markdown structures
services/bot/src/byte_bot/plugins/astral.py Replaces naive chunking with smart chunking in the Ruff rule display command
tests/unit/bot/lib/test_utils.py Adds 15 test cases validating chunking behavior, markdown preservation, and size limits

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

for i in range(max_size, search_start, -1):
if not _is_position_protected(start_offset + i, protected_regions):
return i

Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback logic does not handle the case where a protected region (markdown link, inline code, or code block) exceeds max_size. If all positions from search_start to max_size are protected, the function returns max_size, which could split in the middle of a protected region. Consider adding logic to either skip the entire protected region or handle oversized protected regions explicitly.

Suggested change
# Fallback: max_size may fall inside a protected region. Try to avoid
# splitting inside that region by moving the split before or after it.
absolute_max = start_offset + max_size
for start, end in protected_regions:
if start < absolute_max < end:
# Position lies inside this protected region.
rel_before = start - start_offset
rel_after = end - start_offset
region_len = end - start
# Prefer splitting immediately before the region if there is
# any unprotected prefix to keep this intact and ≤ max_size.
if rel_before > 0:
return rel_before
# Region starts at the very beginning of this segment.
# If the region itself fits within max_size, return its end so
# the whole region stays intact in a single chunk.
if region_len <= max_size:
return rel_after
# Oversized protected region: cannot keep it intact within
# max_size, so fall back to max_size as a last resort.
break

Copilot uses AI. Check for mistakes.
Comment on lines +195 to +207
def smart_chunk_text(text: str, max_size: int = 1000) -> list[str]:
"""Split text into chunks without breaking markdown structures.
Respects markdown links, inline code, and code blocks. Prefers splitting at
natural boundaries: paragraphs > sentences > newlines > spaces.
Args:
text: The text to chunk.
max_size: Maximum characters per chunk (must be > 0).
Returns:
List of text chunks.
Copy link

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function does not validate that max_size is positive. If max_size is 0 or negative, the function could produce unexpected results or enter an infinite loop. Consider adding validation to ensure max_size > 0 at the start of the function.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link

Documentation preview will be available shortly at https://jacobcoffee.github.io/byte-docs-preview/141

@JacobCoffee JacobCoffee merged commit 319bdd4 into main Jan 22, 2026
5 checks passed
@JacobCoffee JacobCoffee deleted the feature/smart-embed-chunking branch January 22, 2026 14:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhancement(astral): Smart chunking of embed field data

2 participants