docs: Add documentation summarizing the internals of the tagged DFA and illustrating compression and search usage. #1908

SharafMohamed · 2026-01-26T15:29:16Z

Description

This PR adds documentation for the new tagged DFA used in CLP via LogSurgeon:

Summarizes the background on DFAs.
Explains the addition to go from DFA -> TDFA.
Demonstrates how the TDFA is used in compression.
Demonstrates how the TDFA is used in search (dynamic programming algorithm).

Summary by CodeRabbit

Documentation
- Added comprehensive documentation on schema-based compression and search workflow to the Dev Docs.
- Added "Schema" entry to the Dev Docs index for easier navigation.

coderabbitai · 2026-01-26T15:29:41Z

Walkthrough

Adds comprehensive documentation for schema-based compression and search functionality. A new documentation file explains schema concepts, NFA/DFA construction, TDFA extensions, register operations, compression pipeline, and search workflow with illustrative examples. Documentation index and toctree entries are updated to reference the new file.

Changes

Cohort / File(s)	Summary
Documentation Index `docs/src/dev-docs/index.md`	Adds a new grid-item-card entry for "Schema" and corresponding toctree blocks to link to the schema-based compression and search documentation within the Components section.
Schema Documentation `docs/src/dev-docs/schema-based-compression-and-search.md`	New comprehensive documentation file detailing schema-based compression and search workflow, including schema definition, NFA/DFA construction, TDFA extensions, register operations, compression pipeline, search procedure, and detailed end-to-end examples with TDFA diagrams and subquery generation.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding documentation about tagged DFA internals with compression and search examples.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@docs/src/dev-docs/tagged-dfa.md`:
- Around line 42-47: The markdown file has multiple fenced code blocks missing
language identifiers (markdownlint MD040); update each triple-backtick block
(including the table block shown with the header "Variable Name | Regex Pattern
| Input | Match") to include an appropriate language tag (e.g., ```text for
plain tables, ```regex for regex examples, ```yaml where YAML appears, or
```none when no highlighting is desired) so all instances listed in the comment
are annotated consistently and lint errors are resolved.
- Around line 49-50: The documentation contains multiple typos and grammatical
errors that confuse readers; update the text in docs/src/dev-docs/tagged-dfa.md
by correcting misspellings and grammar (e.g., change “durign” to “during”,
“Literal character produce” to “Literal character produce” or better “Literal
characters produce” depending on context, “reasons” to “reason”, complete the
fragment “Above, we i” into a full sentence, “out-going” to “outgoing”, and
“thats” to “that’s”); apply these fixes consistently in the noted sections
(around the existing sentence that mentions NFA/DFA construction and the other
referenced blocks) and re-read nearby sentences for similar small errors to
ensure clarity and correct pluralization/possessives.
- Around line 151-185: The doc has incomplete sections: finish or remove the
"Capture Groups in Regex", "Tagged NFA", and "Ambiguity and Leftmost-Greedy
Resolution" placeholders—specifically, complete "Capture Groups in Regex" to
describe how regex capture groups map to start/end tags and how those tags are
recorded into registers (referencing final(tag) and intermediate(tag,i)); add a
short "Tagged NFA" subsection explaining how an NFA is augmented with tag
actions on transitions and how those are compiled into TDFA operations; and add
an "Ambiguity and Leftmost-Greedy Resolution" paragraph that defines the
leftmost-greedy tie-breaker and how TDFA resolves ambiguous matches. If you
prefer not to author full text, remove the empty headings so the document
contains only the finished "Tags and Registers in the DFA", "Tagged
Transitions", and "Match Semantics" sections.
- Around line 8-13: Update the table of contents anchor links so they match the
actual heading texts: replace the entries linking to `#5-compression` and
`#6-search` with anchors that correspond to the headings "Compression Example" and
"Search Example" (e.g. use `#compression-example` and `#search-example`), and make
the same fixes for the other occurrences that reference the Compression and
Search headings; ensure the TOC entries that reference "Compression" and
"Search" exactly match the generated anchors for the headings.

docs/src/dev-docs/tagged-dfa.md

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@docs/src/dev-docs/tagged-dfa.md`:
- Around line 56-58: The Markdown heading "### Example Schema" and the adjacent
fenced code block (```regex ... ```) need blank lines before and after them to
satisfy markdownlint; add an empty line above the "### Example Schema" heading,
an empty line between the heading and the opening ```regex fence, and an empty
line after the closing ``` fence (and apply the same around the fenced block at
lines referenced 132-134) so headings and fenced code blocks are separated by
blank lines.
- Around line 20-21: Revise the two bullet definitions so grammar is clear and
parallel: change "- **Variables** are text in the log that contain information
pertinent to the user." to a tighter form like "Variables are text within a log
entry that convey user-relevant information." and change "- **Static-text** is
the remaining, non-variable, text in the log." to a parallel form like "Static
text is the remaining non-variable content of the log." Apply the same
grammatical tightening to the corresponding lines referenced (137-139) to ensure
consistency.

♻️ Duplicate comments (2)

docs/src/dev-docs/tagged-dfa.md (2)

49-50: Fix remaining typos/grammar for readability.
These lingering errors reduce clarity and were flagged earlier; please correct them consistently.

✏️ Proposed edits

-Each regex rule is used to construct an NFA, which is eventually considered durign DFA construction.
+Each regex rule is used to construct an NFA, which is eventually considered during DFA construction.

-- Literal character produce linear sequences of states.
-- Characters classes, quantifiers, and optional segments produce branches.
+- Literal characters produce linear sequences of states.
+- Character classes, quantifiers, and optional segments produce branches.

-At this stage, a single input can lead to multiple possible next states. This nondeterminism is the
-reasons NFAs are inefficient for traversal. For runtime performance, an NFA must be converted to
+At this stage, a single input can lead to multiple possible next states. This nondeterminism is the
+reason NFAs are inefficient for traversal. For runtime performance, an NFA must be converted to

-Above, we i
+Above, we introduce capture groups in regex patterns.

-At various DFA states register value are set or copied into other registers based on the register
-action corresponding to the out-going symbol.
+At various DFA states, register values are set or copied into other registers based on the register
+action corresponding to the outgoing symbol.

-My log has user_id=55 session=AB23 and thats it.
+My log has user_id=55 session=AB23 and that's it.

-My log has user_id=<user_id> session=<session> and thats it.
+My log has user_id=<user_id> session=<session> and that's it.

Also applies to: 76-79, 92-94, 153-154, 167-169, 275-276, 283-284

151-185: Complete or remove placeholder sections.
These empty subsections make the doc feel unfinished and leave key concepts undefined.

docs/src/dev-docs/schema-based-compression-and-search.md

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@docs/src/dev-docs/tagged-dfa.md`:
- Around line 414-420: Update the subsection heading "D.Grep" to include a space
after the period so it matches the other subsections' format; locate the heading
text "D.Grep against the original query:" and change it to "D. Grep against the
original query:" ensuring the spacing is consistent with subsections A, B, and
C.
- Around line 152-162: Reword two sentences for clarity and grammar: change the
awkward sentence that begins "During TDFA traversal," to a clearer phrasing like
"During TDFA traversal, each tag corresponds to one or more registers:" and fix
the subject-verb agreement in the sentence that begins "At various DFA states
register value are set..." to "At various DFA states, register values are set or
copied into other registers based on the register action corresponding to the
outgoing symbol." Ensure references to final(tag), intermediate(tag,i), TDFA
traversal, and registers remain unchanged.
- Around line 58-64: Replace the invalid Pygments lexer identifier "regex" in
the fenced code blocks with a supported lexer like "text" (or "none") to
eliminate build warnings; locate the triple-backtick blocks that start with
```regex containing the snippet lines "delimiters: \n\r\t", "int:-?\d+",
"float:-?\d+\.\d+", and "tagged_user_id:user_id=(?<user_id>\d+)" and change
their opening fence to ```text (apply the same replacement for the two other
occurrences that contain the same snippet).

docs/src/dev-docs/tagged-dfa.md

coderabbitai

Actionable comments posted: 5

🤖 Fix all issues with AI agents

In `@docs/src/dev-docs/tagged-dfa.md`:
- Line 125: The sentence contains a subject-verb agreement error: the plural
noun phrase "no valid transitions" is paired with the singular verb "exists";
update the text fragment "no valid transitions exists" to use the plural verb
"exist" so it reads "no valid transitions exist".
- Around line 268-269: Wrap the character classes that are being parsed as
footnote refs in inline code formatting so MyST/Sphinx won't treat them as
footnotes: replace the plain occurrences of [^0-9] and [^A-Za-z] in the
tagged-dfa.md text with inline code spans (e.g., use backticks around the
expressions) wherever they appear in the sentence about the true end of the
capture so the docs build error is resolved.
- Around line 316-330: Change the sentence to use correct subject-verb agreement
("no valid transitions exist") and remove leading/trailing spaces inside inline
code spans (e.g., change `My ` to `My`, and ensure `R2`, `R1`, `R0`, `S0`, `S8`,
`S9`, `user_id=` are all formatted without extra spaces inside backticks); keep
the rest of the paragraph semantics the same.
- Around line 270-277: Fix inconsistent Markdown list indentation by reducing
the nested list indentation from 4 spaces to 2 spaces so it matches project
linting; update the block describing the final transition (lines mentioning S9,
set R6, and the accepting operations copying R6→R0, setting R1, and copying
negated R4/R5→R2/R3) to use 2-space indents for the nested bullets and their
sub-bullets to ensure consistent rendering and lint compliance.
- Around line 196-217: The fenced ASCII diagram blocks lack a language
identifier; update the opening fence for the diagram shown (the block starting
with ``` on the TDFA/Log Surgeon/CLP diagram) to ```text so Markdownlint passes,
and apply the same change to the other ASCII diagram block referenced (the
second fenced diagram at the later section). Ensure you only add the word text
immediately after the opening triple backticks (```text) for both blocks.

docs/src/dev-docs/tagged-dfa.md

…FA section to explain lookahead and operations better.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@docs/src/dev-docs/tagged-dfa.md`:
- Around line 366-368: The markdown lacks a blank line before the fenced code
block that begins with ```text after the sentence "After TDFA execution Log
Surgeon produces:", so add a single blank line between that sentence and the
opening ```text fence to satisfy markdown formatting; locate the snippet
containing "After TDFA execution Log Surgeon produces:" and the following
"```text" (and the inner "LogType:") and insert one empty line immediately
before the fenced code block.

docs/src/dev-docs/schema-based-compression-and-search.md

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@docs/src/dev-docs/schema-based-compression-and-search.md`:
- Around line 340-341: Remove the stray spaces inside inline code spans (e.g.,
change `My ` to `My`) so MD038 no longer flags them; scan the same document for
similar occurrences (the review notes another instance around the `S0`/`My`
spans) and update each inline code span to have no leading or trailing spaces
inside the backticks.
- Around line 133-134: Add a blank line immediately after the heading "###
Acceptance and Match Semantics" so there is an empty line between the heading
and the following paragraph ("A DFA state is marked as **accepting**..."); this
satisfies Markdownlint MD022 by ensuring headings are surrounded by blank lines.
- Around line 138-140: Insert a comma after the phrase "At this level" in the
sentence that currently reads "At this level the DFA only indicates a list of
matches…" so it becomes "At this level, the DFA only indicates a list of
matches…"; edit the sentence in the block starting "A match succeeds if the
given input..." (the clause beginning with "At this level") to add the missing
comma for correct punctuation and readability.

coderabbitai · 2026-02-05T13:46:19Z

docs/src/dev-docs/schema-based-compression-and-search.md

+### Acceptance and Match Semantics
+A DFA state is marked as **accepting** if any of its corresponding NFA states are accepting,


⚠️ Potential issue | 🟡 Minor

Add a blank line after the heading.
Markdownlint MD022 expects blank lines around headings.

✏️ Proposed fix

-### Acceptance and Match Semantics -A DFA state is marked as **accepting** if any of its corresponding NFA states are accepting, +### Acceptance and Match Semantics + +A DFA state is marked as **accepting** if any of its corresponding NFA states are accepting,

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

### Acceptance and Match Semantics

A DFA state is marked as **accepting** if any of its corresponding NFA states are accepting,

### Acceptance and Match Semantics

A DFA state is marked as **accepting** if any of its corresponding NFA states are accepting,

🧰 Tools

🪛 markdownlint-cli2 (0.20.0)

[warning] 133-133: Headings should be surrounded by blank lines
Expected: 1; Actual: 0; Below

(MD022, blanks-around-headings)

🤖 Prompt for AI Agents

In `@docs/src/dev-docs/schema-based-compression-and-search.md` around lines 133 - 134, Add a blank line immediately after the heading "### Acceptance and Match Semantics" so there is an empty line between the heading and the following paragraph ("A DFA state is marked as **accepting**..."); this satisfies Markdownlint MD022 by ensuring headings are surrounded by blank lines.

coderabbitai · 2026-02-05T13:46:19Z

docs/src/dev-docs/schema-based-compression-and-search.md

+A match succeeds if the given input has a traversal that does not fail and ends in an accepting
+state. At this level the DFA only indicates a list of matches, ordered by variable priority in the
+schema. This type of non-tagged DFA does not produce any semantic information beyond the matching


⚠️ Potential issue | 🟡 Minor

Insert the missing comma after “At this level”.
Improves readability and matches standard grammar.

✏️ Proposed edit

-A match succeeds if the given input has a traversal that does not fail and ends in an accepting -state. At this level the DFA only indicates a list of matches, ordered by variable priority in the +A match succeeds if the given input has a traversal that does not fail and ends in an accepting +state. At this level, the DFA only indicates a list of matches, ordered by variable priority in the

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

A match succeeds if the given input has a traversal that does not fail and ends in an accepting

state. At this level the DFA only indicates a list of matches, ordered by variable priority in the

schema. This type of non-tagged DFA does not produce any semantic information beyond the matching

A match succeeds if the given input has a traversal that does not fail and ends in an accepting

state. At this level, the DFA only indicates a list of matches, ordered by variable priority in the

schema. This type of non-tagged DFA does not produce any semantic information beyond the matching

🧰 Tools

🪛 LanguageTool

[typographical] ~139-~139: It appears that a comma is missing.
Context: ...and ends in an accepting state. At this level the DFA only indicates a list of matche...

(DURING_THAT_TIME_COMMA)

🤖 Prompt for AI Agents

In `@docs/src/dev-docs/schema-based-compression-and-search.md` around lines 138 - 140, Insert a comma after the phrase "At this level" in the sentence that currently reads "At this level the DFA only indicates a list of matches…" so it becomes "At this level, the DFA only indicates a list of matches…"; edit the sentence in the block starting "A match succeeds if the given input..." (the clause beginning with "At this level") to add the missing comma for correct punctuation and readability.

coderabbitai · 2026-02-05T13:46:19Z

docs/src/dev-docs/schema-based-compression-and-search.md

+- `My ` is emitted as **static-text** because no valid transitions exist from `S0` for these
+  characters.


⚠️ Potential issue | 🟡 Minor

Remove spaces inside code spans.
MD038 flags spaces inside backticks.

✏️ Proposed edit

-- `My ` is emitted as **static-text** because no valid transitions exist from `S0` for these +- `My` is emitted as **static-text** because no valid transitions exist from `S0` for these characters. ... -- Remaining characters, ` line.`, are emitted as static-text. +- Remaining characters, `line.`, are emitted as static-text.

Also applies to: 359-359

🧰 Tools

🪛 markdownlint-cli2 (0.20.0)

[warning] 340-340: Spaces inside code span elements

(MD038, no-space-in-code)

🤖 Prompt for AI Agents

In `@docs/src/dev-docs/schema-based-compression-and-search.md` around lines 340 - 341, Remove the stray spaces inside inline code spans (e.g., change `My ` to `My`) so MD038 no longer flags them; scan the same document for similar occurrences (the review notes another instance around the `S0`/`My` spans) and update each inline code span to have no leading or trailing spaces inside the backticks.

SharafMohamed added 2 commits January 26, 2026 09:49

Doc added.

fec5a7f

Add search example.

1d76e0a

SharafMohamed requested a review from a team as a code owner January 26, 2026 15:29

SharafMohamed changed the title ~~doc: Add documentation summarizing the internals of the tagged DFA and illustrating compression and search usage.~~ docs: Add documentation summarizing the internals of the tagged DFA and illustrating compression and search usage. Jan 26, 2026

coderabbitai bot reviewed Jan 26, 2026

View reviewed changes

docs/src/dev-docs/tagged-dfa.md Show resolved Hide resolved

docs/src/dev-docs/tagged-dfa.md Outdated Show resolved Hide resolved

docs/src/dev-docs/tagged-dfa.md Outdated Show resolved Hide resolved

docs/src/dev-docs/tagged-dfa.md Outdated Show resolved Hide resolved

Add labels.

37ea775

coderabbitai bot reviewed Jan 26, 2026

View reviewed changes

docs/src/dev-docs/schema-based-compression-and-search.md Show resolved Hide resolved

docs/src/dev-docs/schema-based-compression-and-search.md Show resolved Hide resolved

SharafMohamed added 4 commits January 26, 2026 10:50

Fix typos.

7a7c290

Add new line before heading fence.

e4e8524

Remove unused headings.

a2dfafd

Update search section.

9c83157

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

docs/src/dev-docs/tagged-dfa.md Outdated Show resolved Hide resolved

docs/src/dev-docs/tagged-dfa.md Outdated Show resolved Hide resolved

docs/src/dev-docs/tagged-dfa.md Outdated Show resolved Hide resolved

Update compression example.

f3a24dd

coderabbitai bot reviewed Jan 27, 2026

View reviewed changes

junhaoliao added this to the February 2026 milestone Jan 27, 2026

SharafMohamed added 3 commits January 27, 2026 19:35

Update compression example execution table and explanation; Update TD…

632aad1

…FA section to explain lookahead and operations better.

Switch to none for codeblocks.

513d16e

Fix heading spacing.

baf6a10

coderabbitai bot reviewed Jan 28, 2026

View reviewed changes

docs/src/dev-docs/schema-based-compression-and-search.md Show resolved Hide resolved

SharafMohamed added 4 commits January 27, 2026 19:39

Grammar fix.

bdfdd45

Fix accidental footnote.

2fc8e26

Fix indentation.

c171930

Grammar fix.

bb7fc90

SharafMohamed requested a review from davidlion January 28, 2026 00:44

SharafMohamed added 3 commits January 27, 2026 19:46

Change title and overview statement.

6b96a5a

Rename readme.

5b03a29

Add new doc to index.

d4605ba

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

		### Acceptance and Match Semantics
		A DFA state is marked as accepting if any of its corresponding NFA states are accepting,

		- `My ` is emitted as static-text because no valid transitions exist from `S0` for these
		characters.

docs: Add documentation summarizing the internals of the tagged DFA and illustrating compression and search usage. #1908

Are you sure you want to change the base?

docs: Add documentation summarizing the internals of the tagged DFA and illustrating compression and search usage. #1908

Conversation

SharafMohamed commented Jan 26, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SharafMohamed commented Jan 26, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 26, 2026 •

edited

Loading