Skip to content

Conversation

@SharafMohamed
Copy link
Contributor

@SharafMohamed SharafMohamed commented Jan 26, 2026

Description

This PR adds documentation for the new tagged DFA used in CLP via LogSurgeon:

  • Summarizes the background on DFAs.
  • Explains the addition to go from DFA -> TDFA.
  • Demonstrates how the TDFA is used in compression.
  • Demonstrates how the TDFA is used in search (dynamic programming algorithm).

Summary by CodeRabbit

  • Documentation
    • Added comprehensive documentation on schema-based compression and search workflow to the Dev Docs.
    • Added "Schema" entry to the Dev Docs index for easier navigation.

@SharafMohamed SharafMohamed requested a review from a team as a code owner January 26, 2026 15:29
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 26, 2026

Walkthrough

Adds comprehensive documentation for schema-based compression and search functionality. A new documentation file explains schema concepts, NFA/DFA construction, TDFA extensions, register operations, compression pipeline, and search workflow with illustrative examples. Documentation index and toctree entries are updated to reference the new file.

Changes

Cohort / File(s) Summary
Documentation Index
docs/src/dev-docs/index.md
Adds a new grid-item-card entry for "Schema" and corresponding toctree blocks to link to the schema-based compression and search documentation within the Components section.
Schema Documentation
docs/src/dev-docs/schema-based-compression-and-search.md
New comprehensive documentation file detailing schema-based compression and search workflow, including schema definition, NFA/DFA construction, TDFA extensions, register operations, compression pipeline, search procedure, and detailed end-to-end examples with TDFA diagrams and subquery generation.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding documentation about tagged DFA internals with compression and search examples.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@SharafMohamed SharafMohamed changed the title doc: Add documentation summarizing the internals of the tagged DFA and illustrating compression and search usage. docs: Add documentation summarizing the internals of the tagged DFA and illustrating compression and search usage. Jan 26, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@docs/src/dev-docs/tagged-dfa.md`:
- Around line 42-47: The markdown file has multiple fenced code blocks missing
language identifiers (markdownlint MD040); update each triple-backtick block
(including the table block shown with the header "Variable Name | Regex Pattern
| Input | Match") to include an appropriate language tag (e.g., ```text for
plain tables, ```regex for regex examples, ```yaml where YAML appears, or
```none when no highlighting is desired) so all instances listed in the comment
are annotated consistently and lint errors are resolved.
- Around line 49-50: The documentation contains multiple typos and grammatical
errors that confuse readers; update the text in docs/src/dev-docs/tagged-dfa.md
by correcting misspellings and grammar (e.g., change “durign” to “during”,
“Literal character produce” to “Literal character produce” or better “Literal
characters produce” depending on context, “reasons” to “reason”, complete the
fragment “Above, we i” into a full sentence, “out-going” to “outgoing”, and
“thats” to “that’s”); apply these fixes consistently in the noted sections
(around the existing sentence that mentions NFA/DFA construction and the other
referenced blocks) and re-read nearby sentences for similar small errors to
ensure clarity and correct pluralization/possessives.
- Around line 151-185: The doc has incomplete sections: finish or remove the
"Capture Groups in Regex", "Tagged NFA", and "Ambiguity and Leftmost-Greedy
Resolution" placeholders—specifically, complete "Capture Groups in Regex" to
describe how regex capture groups map to start/end tags and how those tags are
recorded into registers (referencing final(tag) and intermediate(tag,i)); add a
short "Tagged NFA" subsection explaining how an NFA is augmented with tag
actions on transitions and how those are compiled into TDFA operations; and add
an "Ambiguity and Leftmost-Greedy Resolution" paragraph that defines the
leftmost-greedy tie-breaker and how TDFA resolves ambiguous matches. If you
prefer not to author full text, remove the empty headings so the document
contains only the finished "Tags and Registers in the DFA", "Tagged
Transitions", and "Match Semantics" sections.
- Around line 8-13: Update the table of contents anchor links so they match the
actual heading texts: replace the entries linking to `#5-compression` and
`#6-search` with anchors that correspond to the headings "Compression Example" and
"Search Example" (e.g. use `#compression-example` and `#search-example`), and make
the same fixes for the other occurrences that reference the Compression and
Search headings; ensure the TOC entries that reference "Compression" and
"Search" exactly match the generated anchors for the headings.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@docs/src/dev-docs/tagged-dfa.md`:
- Around line 56-58: The Markdown heading "### Example Schema" and the adjacent
fenced code block (```regex ... ```) need blank lines before and after them to
satisfy markdownlint; add an empty line above the "### Example Schema" heading,
an empty line between the heading and the opening ```regex fence, and an empty
line after the closing ``` fence (and apply the same around the fenced block at
lines referenced 132-134) so headings and fenced code blocks are separated by
blank lines.
- Around line 20-21: Revise the two bullet definitions so grammar is clear and
parallel: change "- **Variables** are text in the log that contain information
pertinent to the user." to a tighter form like "Variables are text within a log
entry that convey user-relevant information." and change "- **Static-text** is
the remaining, non-variable, text in the log." to a parallel form like "Static
text is the remaining non-variable content of the log." Apply the same
grammatical tightening to the corresponding lines referenced (137-139) to ensure
consistency.
♻️ Duplicate comments (2)
docs/src/dev-docs/tagged-dfa.md (2)

49-50: Fix remaining typos/grammar for readability.
These lingering errors reduce clarity and were flagged earlier; please correct them consistently.

✏️ Proposed edits
-Each regex rule is used to construct an NFA, which is eventually considered durign DFA construction.
+Each regex rule is used to construct an NFA, which is eventually considered during DFA construction.

-- Literal character produce linear sequences of states.
-- Characters classes, quantifiers, and optional segments produce branches.
+- Literal characters produce linear sequences of states.
+- Character classes, quantifiers, and optional segments produce branches.

-At this stage, a single input can lead to multiple possible next states. This nondeterminism is the
-reasons NFAs are inefficient for traversal. For runtime performance, an NFA must be converted to
+At this stage, a single input can lead to multiple possible next states. This nondeterminism is the
+reason NFAs are inefficient for traversal. For runtime performance, an NFA must be converted to

-Above, we i
+Above, we introduce capture groups in regex patterns.

-At various DFA states register value are set or copied into other registers based on the register
-action corresponding to the out-going symbol.
+At various DFA states, register values are set or copied into other registers based on the register
+action corresponding to the outgoing symbol.

-My log has user_id=55 session=AB23 and thats it.
+My log has user_id=55 session=AB23 and that's it.

-My log has user_id=<user_id> session=<session> and thats it.
+My log has user_id=<user_id> session=<session> and that's it.

Also applies to: 76-79, 92-94, 153-154, 167-169, 275-276, 283-284


151-185: Complete or remove placeholder sections.
These empty subsections make the doc feel unfinished and leave key concepts undefined.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@docs/src/dev-docs/tagged-dfa.md`:
- Around line 414-420: Update the subsection heading "D.Grep" to include a space
after the period so it matches the other subsections' format; locate the heading
text "D.Grep against the original query:" and change it to "D. Grep against the
original query:" ensuring the spacing is consistent with subsections A, B, and
C.
- Around line 152-162: Reword two sentences for clarity and grammar: change the
awkward sentence that begins "During TDFA traversal," to a clearer phrasing like
"During TDFA traversal, each tag corresponds to one or more registers:" and fix
the subject-verb agreement in the sentence that begins "At various DFA states
register value are set..." to "At various DFA states, register values are set or
copied into other registers based on the register action corresponding to the
outgoing symbol." Ensure references to final(tag), intermediate(tag,i), TDFA
traversal, and registers remain unchanged.
- Around line 58-64: Replace the invalid Pygments lexer identifier "regex" in
the fenced code blocks with a supported lexer like "text" (or "none") to
eliminate build warnings; locate the triple-backtick blocks that start with
```regex containing the snippet lines "delimiters: \n\r\t", "int:-?\d+",
"float:-?\d+\.\d+", and "tagged_user_id:user_id=(?<user_id>\d+)" and change
their opening fence to ```text (apply the same replacement for the two other
occurrences that contain the same snippet).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In `@docs/src/dev-docs/tagged-dfa.md`:
- Line 125: The sentence contains a subject-verb agreement error: the plural
noun phrase "no valid transitions" is paired with the singular verb "exists";
update the text fragment "no valid transitions exists" to use the plural verb
"exist" so it reads "no valid transitions exist".
- Around line 268-269: Wrap the character classes that are being parsed as
footnote refs in inline code formatting so MyST/Sphinx won't treat them as
footnotes: replace the plain occurrences of [^0-9] and [^A-Za-z] in the
tagged-dfa.md text with inline code spans (e.g., use backticks around the
expressions) wherever they appear in the sentence about the true end of the
capture so the docs build error is resolved.
- Around line 316-330: Change the sentence to use correct subject-verb agreement
("no valid transitions exist") and remove leading/trailing spaces inside inline
code spans (e.g., change `My ` to `My`, and ensure `R2`, `R1`, `R0`, `S0`, `S8`,
`S9`, `user_id=` are all formatted without extra spaces inside backticks); keep
the rest of the paragraph semantics the same.
- Around line 270-277: Fix inconsistent Markdown list indentation by reducing
the nested list indentation from 4 spaces to 2 spaces so it matches project
linting; update the block describing the final transition (lines mentioning S9,
set R6, and the accepting operations copying R6→R0, setting R1, and copying
negated R4/R5→R2/R3) to use 2-space indents for the nested bullets and their
sub-bullets to ensure consistent rendering and lint compliance.
- Around line 196-217: The fenced ASCII diagram blocks lack a language
identifier; update the opening fence for the diagram shown (the block starting
with ``` on the TDFA/Log Surgeon/CLP diagram) to ```text so Markdownlint passes,
and apply the same change to the other ASCII diagram block referenced (the
second fenced diagram at the later section). Ensure you only add the word text
immediately after the opening triple backticks (```text) for both blocks.

@junhaoliao junhaoliao added this to the February 2026 milestone Jan 27, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@docs/src/dev-docs/tagged-dfa.md`:
- Around line 366-368: The markdown lacks a blank line before the fenced code
block that begins with ```text after the sentence "After TDFA execution Log
Surgeon produces:", so add a single blank line between that sentence and the
opening ```text fence to satisfy markdown formatting; locate the snippet
containing "After TDFA execution Log Surgeon produces:" and the following
"```text" (and the inner "LogType:") and insert one empty line immediately
before the fenced code block.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@docs/src/dev-docs/schema-based-compression-and-search.md`:
- Around line 340-341: Remove the stray spaces inside inline code spans (e.g.,
change `My ` to `My`) so MD038 no longer flags them; scan the same document for
similar occurrences (the review notes another instance around the `S0`/`My`
spans) and update each inline code span to have no leading or trailing spaces
inside the backticks.
- Around line 133-134: Add a blank line immediately after the heading "###
Acceptance and Match Semantics" so there is an empty line between the heading
and the following paragraph ("A DFA state is marked as **accepting**..."); this
satisfies Markdownlint MD022 by ensuring headings are surrounded by blank lines.
- Around line 138-140: Insert a comma after the phrase "At this level" in the
sentence that currently reads "At this level the DFA only indicates a list of
matches…" so it becomes "At this level, the DFA only indicates a list of
matches…"; edit the sentence in the block starting "A match succeeds if the
given input..." (the clause beginning with "At this level") to add the missing
comma for correct punctuation and readability.

Comment on lines +133 to +134
### Acceptance and Match Semantics
A DFA state is marked as **accepting** if any of its corresponding NFA states are accepting,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a blank line after the heading.
Markdownlint MD022 expects blank lines around headings.

✏️ Proposed fix
-### Acceptance and Match Semantics
-A DFA state is marked as **accepting** if any of its corresponding NFA states are accepting,
+### Acceptance and Match Semantics
+
+A DFA state is marked as **accepting** if any of its corresponding NFA states are accepting,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### Acceptance and Match Semantics
A DFA state is marked as **accepting** if any of its corresponding NFA states are accepting,
### Acceptance and Match Semantics
A DFA state is marked as **accepting** if any of its corresponding NFA states are accepting,
🧰 Tools
🪛 markdownlint-cli2 (0.20.0)

[warning] 133-133: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

🤖 Prompt for AI Agents
In `@docs/src/dev-docs/schema-based-compression-and-search.md` around lines 133 -
134, Add a blank line immediately after the heading "### Acceptance and Match
Semantics" so there is an empty line between the heading and the following
paragraph ("A DFA state is marked as **accepting**..."); this satisfies
Markdownlint MD022 by ensuring headings are surrounded by blank lines.

Comment on lines +138 to +140
A match succeeds if the given input has a traversal that does not fail and ends in an accepting
state. At this level the DFA only indicates a list of matches, ordered by variable priority in the
schema. This type of non-tagged DFA does not produce any semantic information beyond the matching
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Insert the missing comma after “At this level”.
Improves readability and matches standard grammar.

✏️ Proposed edit
-A match succeeds if the given input has a traversal that does not fail and ends in an accepting
-state. At this level the DFA only indicates a list of matches, ordered by variable priority in the
+A match succeeds if the given input has a traversal that does not fail and ends in an accepting
+state. At this level, the DFA only indicates a list of matches, ordered by variable priority in the
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
A match succeeds if the given input has a traversal that does not fail and ends in an accepting
state. At this level the DFA only indicates a list of matches, ordered by variable priority in the
schema. This type of non-tagged DFA does not produce any semantic information beyond the matching
A match succeeds if the given input has a traversal that does not fail and ends in an accepting
state. At this level, the DFA only indicates a list of matches, ordered by variable priority in the
schema. This type of non-tagged DFA does not produce any semantic information beyond the matching
🧰 Tools
🪛 LanguageTool

[typographical] ~139-~139: It appears that a comma is missing.
Context: ...and ends in an accepting state. At this level the DFA only indicates a list of matche...

(DURING_THAT_TIME_COMMA)

🤖 Prompt for AI Agents
In `@docs/src/dev-docs/schema-based-compression-and-search.md` around lines 138 -
140, Insert a comma after the phrase "At this level" in the sentence that
currently reads "At this level the DFA only indicates a list of matches…" so it
becomes "At this level, the DFA only indicates a list of matches…"; edit the
sentence in the block starting "A match succeeds if the given input..." (the
clause beginning with "At this level") to add the missing comma for correct
punctuation and readability.

Comment on lines +340 to +341
- `My ` is emitted as **static-text** because no valid transitions exist from `S0` for these
characters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove spaces inside code spans.
MD038 flags spaces inside backticks.

✏️ Proposed edit
-- `My ` is emitted as **static-text** because no valid transitions exist from `S0` for these
+- `My` is emitted as **static-text** because no valid transitions exist from `S0` for these
   characters.
...
-- Remaining characters, ` line.`, are emitted as static-text.
+- Remaining characters, `line.`, are emitted as static-text.

Also applies to: 359-359

🧰 Tools
🪛 markdownlint-cli2 (0.20.0)

[warning] 340-340: Spaces inside code span elements

(MD038, no-space-in-code)

🤖 Prompt for AI Agents
In `@docs/src/dev-docs/schema-based-compression-and-search.md` around lines 340 -
341, Remove the stray spaces inside inline code spans (e.g., change `My ` to
`My`) so MD038 no longer flags them; scan the same document for similar
occurrences (the review notes another instance around the `S0`/`My` spans) and
update each inline code span to have no leading or trailing spaces inside the
backticks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants