feat: implement smart text chunking for embed fields #141

JacobCoffee · 2026-01-22T14:22:12Z

Summary

Resolves Enhancement(astral): Smart chunking of embed field data #75 - Don't break markdown links in Ruff output
Adds smart_chunk_text() function that respects markdown structures

Changes

New function: smart_chunk_text(text, max_size=1000) in services/bot/src/byte_bot/lib/utils.py
- Preserves markdown links [text](url)
- Preserves inline code `code`
- Preserves code blocks ```code```
- Prefers split points: paragraphs > sentences > newlines > spaces
Updated: /ruff command now uses smart_chunk_text instead of naive chunk_sequence
Added: 15 comprehensive tests

Test plan

🤖 Generated with Claude Code

Summary by Sourcery

Add smart markdown-aware text chunking for Ruff rule explanations and wire it into the Astral /ruff command.

New Features:

Introduce smart_chunk_text utility to split text into size-limited chunks while preserving markdown links, inline code, and code blocks.

Enhancements:

Update Ruff rule embed generation to use smart markdown-aware chunking instead of generic sequence chunking for explanations.

Tests:

Add comprehensive unit tests covering smart_chunk_text behavior, including markdown preservation, boundary preferences, and size limits.

Resolves GH #75 - Don't break markdown links in Ruff output. - Add smart_chunk_text() that respects markdown structures - Preserves markdown links, inline code, and code blocks - Prefers natural split points (paragraphs > sentences > newlines) - Update Ruff command to use smart chunking - Add 15 tests for the new chunking function Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

railway-app · 2026-01-22T14:22:21Z

🚅 Deployed to the byte-pr-141 environment in byte

Service	Status	Web	Updated (UTC)
byte	✅ Success (View Logs)	Web	Jan 22, 2026 at 2:28 pm

sourcery-ai · 2026-01-22T14:22:32Z

Reviewer's Guide

Implements a smart markdown-aware text chunking utility and wires it into the Ruff rule embed flow, along with comprehensive unit tests verifying chunk boundaries and markdown preservation.

Sequence diagram for Ruff rule explanation chunking with smart_chunk_text

sequenceDiagram
    actor User
    participant DiscordClient
    participant AstralPlugin as AstralPlugin_ruff_rule
    participant Utils as smart_chunk_text
    participant Embed

    User->>DiscordClient: invoke /ruff rule <code>
    DiscordClient->>AstralPlugin: ruff_rule(interaction, rule)
    AstralPlugin->>AstralPlugin: format_ruff_rule(rule)
    AstralPlugin->>AstralPlugin: build docs_field
    AstralPlugin->>Utils: smart_chunk_text(explanation, 1000)
    Utils-->>AstralPlugin: list_of_chunks
    loop for each chunk in list_of_chunks
        AstralPlugin->>Embed: add_field(name, chunk, inline=False)
    end
    AstralPlugin->>Embed: add_field(Fix, fix_text)
    AstralPlugin->>Embed: add_field(Documentation, docs_field)
    AstralPlugin-->>DiscordClient: send embed response
    DiscordClient-->>User: display Ruff rule embed with intact markdown

Class diagram for utils smart_chunk_text and Astral Ruff plugin integration

classDiagram
    class UtilsModule {
        +_find_protected_regions(text: str) list~tuple~
        +_is_position_protected(pos: int, regions: list~tuple~) bool
        +_find_split_point(text_segment: str, start_offset: int, max_size: int, protected_regions: list~tuple~) int
        +smart_chunk_text(text: str, max_size: int) list~str~
    }

    class AstralPluginModule {
        +ruff_rule(interaction: Interaction, rule: str) None
    }

    AstralPluginModule ..> UtilsModule : uses smart_chunk_text

File-Level Changes

Change	Details	Files
Add markdown-aware smart text chunking utility that avoids splitting protected markdown regions while respecting max_size.	Introduce _find_protected_regions helper using a regex to detect code blocks, inline code, and markdown links and return their index ranges. Introduce _is_position_protected helper to determine if a candidate split index falls within any protected region. Implement _find_split_point to choose the best split location within max_size, preferring paragraph, sentence, newline, then space separators, while backing off to any non-protected character if needed. Implement smart_chunk_text that iteratively slices the input text using _find_split_point, trims trailing whitespace, skips leading whitespace in the next chunk, and early-returns when the text is empty or already within max_size.	`services/bot/src/byte_bot/lib/utils.py`
Use smart_chunk_text for Ruff rule explanation embeds instead of naive sequence chunking.	Update astral Ruff plugin imports to pull in smart_chunk_text instead of chunk_sequence for explanation handling. Replace chunk_sequence-based loop with smart_chunk_text so each embed field receives a plain string chunk, preserving markdown structures in explanations.	`services/bot/src/byte_bot/plugins/astral.py`
Add unit tests covering smart_chunk_text chunking behavior and markdown preservation guarantees.	Import smart_chunk_text into the utils test module. Add tests for edge cases like empty input, text within limit, and default max_size behavior. Add tests verifying preferred split hierarchy (paragraphs, sentences, newlines, spaces) and that chunks never exceed max_size. Add tests ensuring markdown links, inline code, and code blocks—including multi-line blocks and Ruff-like real-world content—are not broken across chunks.	`tests/unit/bot/lib/test_utils.py`

Assessment against linked issues

Issue	Objective	Addressed	Explanation
#75	Implement smarter chunking for astral embed field text (particularly Ruff rule explanations) so that chunks stay within Discord’s embed field limits without breaking markdown structures such as links, inline code, or code blocks, and avoid awkward line breaks.	✅

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 2 issues, and left some high level feedback:

The _is_position_protected lookup is called in tight loops inside _find_split_point and currently scans the full protected_regions list each time; consider keeping protected_regions sorted and advancing an index or using a more efficient interval lookup to avoid quadratic behavior on long texts with many markdown regions.
When a single protected region (e.g., a long code block) is longer than max_size and covers the entire first max_size characters, _find_split_point will fall through and return max_size, allowing a split inside that protected region; you may want to special-case this to either allow an oversized chunk for that region or relax protection in a controlled way rather than silently violating the no-split guarantee.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The `_is_position_protected` lookup is called in tight loops inside `_find_split_point` and currently scans the full `protected_regions` list each time; consider keeping `protected_regions` sorted and advancing an index or using a more efficient interval lookup to avoid quadratic behavior on long texts with many markdown regions.
- When a single protected region (e.g., a long code block) is longer than `max_size` and covers the entire first `max_size` characters, `_find_split_point` will fall through and return `max_size`, allowing a split inside that protected region; you may want to special-case this to either allow an oversized chunk for that region or relax protection in a controlled way rather than silently violating the no-split guarantee.

## Individual Comments

### Comment 1
<location> `services/bot/src/byte_bot/lib/utils.py:195-204` </location>
<code_context>
+    return max_size
+
+
+def smart_chunk_text(text: str, max_size: int = 1000) -> list[str]:
+    """Split text into chunks without breaking markdown structures.
+
+    Respects markdown links, inline code, and code blocks. Prefers splitting at
+    natural boundaries: paragraphs > sentences > newlines > spaces.
+
+    Args:
+        text: The text to chunk.
+        max_size: Maximum characters per chunk.
+
+    Returns:
+        List of text chunks.
+    """
+    if not text:
+        return []
+
+    if len(text) <= max_size:
+        return [text]
+
</code_context>

<issue_to_address>
**issue (bug_risk):** Guard against non-positive max_size to avoid an infinite loop.

If `max_size` is 0 or negative, `_find_split_point` returns 0, so `current_pos` never advances and the `while current_pos < len(text)` loop may not terminate. Consider validating `max_size` at the start of `smart_chunk_text`, e.g. raising `ValueError` when `max_size <= 0` or coercing it to at least 1.
</issue_to_address>

### Comment 2
<location> `tests/unit/bot/lib/test_utils.py:1029-1036` </location>
<code_context>
+            combined = " ".join(result)
+            assert "`my_function()`" in combined
+
+    def test_preserves_code_blocks(self) -> None:
+        """Test that code blocks are not broken when they fit within max_size."""
+        text = "Example:\n\n```python\ndef foo():\n    pass\n```\n\nEnd of content."
+        result = smart_chunk_text(text, 60)
+        combined = "".join(result)
+        assert "```python" in combined
+        assert "def foo():" in combined
+        assert "```" in combined
+
+    def test_chunks_do_not_exceed_max_size(self) -> None:
</code_context>

<issue_to_address>
**suggestion (testing):** Code block test should ensure the triple backticks and block content stay in the same chunk

Currently this only checks that the markers and content exist in the recombined text, not that they stay together. To more directly validate `smart_chunk_text`, consider asserting that a single chunk contains the opening ``` with language, at least one code line (e.g. `"def foo():"`), and the closing ```; and that no chunk contains an unmatched opening or closing ``` without its pair.

```suggestion
    def test_preserves_code_blocks(self) -> None:
        """Test that code blocks are not broken when they fit within max_size."""
        text = "Example:\n\n```python\ndef foo():\n    pass\n```\n\nEnd of content."
        result = smart_chunk_text(text, 60)

        block_chunk_found = False
        for chunk in result:
            backtick_fence_count = chunk.count("```")
            # No chunk should contain an unmatched code fence
            assert backtick_fence_count in (0, 2)

            if "```python" in chunk:
                # The same chunk that has the opening fence with language
                # should also contain the code line and the closing fence.
                assert "def foo():" in chunk
                assert "```" in chunk
                block_chunk_found = True

        # Ensure we actually found a chunk that contained the whole code block
        assert block_chunk_found
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

services/bot/src/byte_bot/lib/utils.py

tests/unit/bot/lib/test_utils.py

- Add max_size validation to prevent infinite loop (raises ValueError if <= 0) - Improve code block test to verify fences stay in same chunk - Add test for max_size validation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

JacobCoffee · 2026-01-22T14:27:35Z

Addressed the critical issues in a558827:

1. max_size validation - Now raises ValueError if max_size <= 0 to prevent infinite loop

2. Code block test - Now verifies fences stay in same chunk with matched count assertions

Regarding the other feedback:

Performance of _is_position_protected: Acceptable for Discord embed text sizes (typically < 6000 chars with few protected regions). Would only optimize if profiling shows it's a bottleneck.
Long code blocks > max_size: The current behavior (allowing oversized chunks) is intentional - it's better to have one large chunk containing the complete code block than to break the markdown. This matches the "best effort" nature of the function.

All 1141 tests pass.

Copilot

Pull request overview

This pull request implements smart text chunking for Discord embed fields to prevent breaking markdown structures when displaying Ruff linting rule explanations. The implementation addresses issue #75 by ensuring that markdown links, inline code, and code blocks are not split across chunks.

Changes:

Added smart_chunk_text() function in services/bot/src/byte_bot/lib/utils.py with intelligent chunking that respects markdown structures
Updated the /ruff command in services/bot/src/byte_bot/plugins/astral.py to use smart_chunk_text instead of the naive chunk_sequence approach
Added 15 comprehensive unit tests covering various chunking scenarios including edge cases

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
services/bot/src/byte_bot/lib/utils.py	Implements `smart_chunk_text()` and helper functions for markdown-aware text chunking with regex-based protection of markdown structures
services/bot/src/byte_bot/plugins/astral.py	Replaces naive chunking with smart chunking in the Ruff rule display command
tests/unit/bot/lib/test_utils.py	Adds 15 test cases validating chunking behavior, markdown preservation, and size limits

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-22T14:27:42Z

services/bot/src/byte_bot/lib/utils.py

+    for i in range(max_size, search_start, -1):
+        if not _is_position_protected(start_offset + i, protected_regions):
+            return i
+


The fallback logic does not handle the case where a protected region (markdown link, inline code, or code block) exceeds max_size. If all positions from search_start to max_size are protected, the function returns max_size, which could split in the middle of a protected region. Consider adding logic to either skip the entire protected region or handle oversized protected regions explicitly.

Suggested change

# Fallback: max_size may fall inside a protected region. Try to avoid

# splitting inside that region by moving the split before or after it.

absolute_max = start_offset + max_size

for start, end in protected_regions:

if start < absolute_max < end:

# Position lies inside this protected region.

rel_before = start - start_offset

rel_after = end - start_offset

region_len = end - start

# Prefer splitting immediately before the region if there is

# any unprotected prefix to keep this intact and ≤ max_size.

if rel_before > 0:

return rel_before

# Region starts at the very beginning of this segment.

# If the region itself fits within max_size, return its end so

# the whole region stays intact in a single chunk.

if region_len <= max_size:

return rel_after

# Oversized protected region: cannot keep it intact within

# max_size, so fall back to max_size as a last resort.

break

Copilot · 2026-01-22T14:27:42Z

services/bot/src/byte_bot/lib/utils.py

+def smart_chunk_text(text: str, max_size: int = 1000) -> list[str]:
+    """Split text into chunks without breaking markdown structures.
+
+    Respects markdown links, inline code, and code blocks. Prefers splitting at
+    natural boundaries: paragraphs > sentences > newlines > spaces.
+
+    Args:
+        text: The text to chunk.
+        max_size: Maximum characters per chunk (must be > 0).
+
+    Returns:
+        List of text chunks.
+


The function does not validate that max_size is positive. If max_size is 0 or negative, the function could produce unexpected results or enter an infinite loop. Consider adding validation to ensure max_size > 0 at the start of the function.

github-actions · 2026-01-22T14:30:02Z

Documentation preview will be available shortly at https://jacobcoffee.github.io/byte-docs-preview/141

Copilot AI review requested due to automatic review settings January 22, 2026 14:22

railway-app bot temporarily deployed to byte / byte-pr-141 January 22, 2026 14:22 Destroyed

Copilot started reviewing on behalf of JacobCoffee January 22, 2026 14:23 View session

sourcery-ai bot reviewed Jan 22, 2026

View reviewed changes

services/bot/src/byte_bot/lib/utils.py Show resolved Hide resolved

tests/unit/bot/lib/test_utils.py Outdated Show resolved Hide resolved

fix: address PR review for smart chunking

a558827

- Add max_size validation to prevent infinite loop (raises ValueError if <= 0) - Improve code block test to verify fences stay in same chunk - Add test for max_size validation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

railway-app bot temporarily deployed to byte / byte-pr-141 January 22, 2026 14:27 Destroyed

Copilot AI reviewed Jan 22, 2026

View reviewed changes

JacobCoffee merged commit 319bdd4 into main Jan 22, 2026
5 checks passed

JacobCoffee deleted the feature/smart-embed-chunking branch January 22, 2026 14:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: implement smart text chunking for embed fields #141

feat: implement smart text chunking for embed fields #141

Uh oh!

JacobCoffee commented Jan 22, 2026 •

edited by sourcery-ai bot

Loading

Uh oh!

railway-app bot commented Jan 22, 2026 •

edited

Loading

Uh oh!

sourcery-ai bot commented Jan 22, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

JacobCoffee commented Jan 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 22, 2026

Uh oh!

Copilot AI Jan 22, 2026

Uh oh!

github-actions bot commented Jan 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

+    # Fallback: max_size may fall inside a protected region. Try to avoid
+    # splitting inside that region by moving the split before or after it.
+    absolute_max = start_offset + max_size
+    for start, end in protected_regions:
+        if start < absolute_max < end:
+            # Position lies inside this protected region.
+            rel_before = start - start_offset
+            rel_after = end - start_offset
+            region_len = end - start
+            # Prefer splitting immediately before the region if there is
+            # any unprotected prefix to keep this intact and ≤ max_size.
+            if rel_before > 0:
+                return rel_before
+            # Region starts at the very beginning of this segment.
+            # If the region itself fits within max_size, return its end so
+            # the whole region stays intact in a single chunk.
+            if region_len <= max_size:
+                return rel_after
+            # Oversized protected region: cannot keep it intact within
+            # max_size, so fall back to max_size as a last resort.
+            break

Uh oh!

feat: implement smart text chunking for embed fields #141

feat: implement smart text chunking for embed fields #141

Uh oh!

Conversation

JacobCoffee commented Jan 22, 2026 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Summary by Sourcery

Uh oh!

railway-app bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sourcery-ai bot commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for Ruff rule explanation chunking with smart_chunk_text

Class diagram for utils smart_chunk_text and Astral Ruff plugin integration

File-Level Changes

Assessment against linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JacobCoffee commented Jan 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JacobCoffee commented Jan 22, 2026 •

edited by sourcery-ai bot

Loading

railway-app bot commented Jan 22, 2026 •

edited

Loading

sourcery-ai bot commented Jan 22, 2026 •

edited

Loading