diff --git a/docs/src/dev-docs/index.md b/docs/src/dev-docs/index.md index c9146763b1..ab837cd113 100644 --- a/docs/src/dev-docs/index.md +++ b/docs/src/dev-docs/index.md @@ -36,6 +36,13 @@ Components Docs specific to each component. ::: +:::{grid-item-card} +:link: schema-based-compression-and-search +Schema +^^^ +Docs about compression and search using user-defined schemas. +::: + :::{grid-item-card} :link: tooling-containers Tooling @@ -82,6 +89,13 @@ components-core/index components-webui ::: +:::{toctree} +:caption: Schema +:hidden: + +schema-based-compression-and-search +::: + :::{toctree} :caption: Tooling :hidden: diff --git a/docs/src/dev-docs/schema-based-compression-and-search.md b/docs/src/dev-docs/schema-based-compression-and-search.md new file mode 100644 index 0000000000..754e8ca551 --- /dev/null +++ b/docs/src/dev-docs/schema-based-compression-and-search.md @@ -0,0 +1,558 @@ +# Schema-Based Compression and Search + +This document explains the inner workings of the schema-based deterministic finite automaton (DFA) +implemented in Log Surgeon and how it is used for compression and search in CLP. + +## Contents + +- [Schema](#1-schema) +- [NFA](#2-nfa) +- [DFA](#3-dfa) +- [Tagged DFA](#4-tagged-dfa) +- [Compression](#5-compression-example) +- [Search](#6-search) + +## 1. Schema + +A schema is an ordered collection of user-defined rules that describe how log messages should be +interpreted by identifying variables and static-text: + +- **Variables** are text in the log that contain information pertinent to the user. +- **Static-text** is the remaining, non-variable, text in the log. + +The schema consists of two primary components: + +- **Delimiters**, which define variable boundaries +- **Regex rules**, which define higher-level patterns matched over the input + +### Delimiters + +Delimiters are characters that define the boundaries of variables and static-text. While variables +can contain delimiters, all variables must be surrounded by delimiters (or the end/start of the +log). + +Delimiters are used to ensure that variables are matched in the valid context. They do not tokenize +or split the log. + +### Regex Rules + +Regex rules define patterns that match sequences of characters. Each regex rule contains a variable +name and a regex pattern. Examples include: + +```none +| Variable Name | Regex Pattern | Input | Match | +|-----------------|----------------------------|------------|--------| +| float | \-?\d+\.\d+ | 12.34 | 12.34 | +| user_id | user_id=(?\d+) | user_id=55 | 55 | +``` + +Each regex rule is used to construct an NFA, which is eventually considered during DFA construction. + +### Priority and Ordering + +When multiple regex rules match the same input, the ordering defined in the schema is used to +determine priority. + +### Example Schema + +```none +delimiters: \n\r\t + +int:-?\d+ +float:-?\d+\.\d+ +tagged_user_id:user_id=(?\d+) +``` + +## 2. NFA + +A nondeterministic finite automaton (NFA) is a state machine where each state can have multiple +outgoing transitions from the same input symbol, including epsilon transitions (transitions that +consume no input). + +In Log Surgeon, NFAs are built from the regex rules. They form an intermediate representation +between user-defined regex rule and DFAs. + +### Construction + +Each regex rule is compiled into an individual NFA: +- Literal character produces linear sequences of states. +- Characters classes, quantifiers, and optional segments produce branches. + +The final state of the sequence is marked as an accepting state, indicating the regex has been +matched. + +Multiple NFAs can be combined into a single NFA representing the entire schema. This allows matching +any regex rule from a single starting point. + +### States and Transitions + +NFA states are connected by labeled transitions: +- **Literal transitions** consume a specific character. +- **Epsilon transitions** do not consume any character and are always taken. + +At this stage, a single input can lead to multiple possible next states. This nondeterminism is the +reason NFAs are inefficient for traversal. For runtime performance, an NFA must be converted to +a DFA. + +## 3. DFA + +A deterministic finite automaton (DFA) consists of a finite set of states connected by labeled +transitions. At any point during traversal, the automaton is in exactly one state, and for a given +input symbol there is at most one outgoing transition. + +In Log Surgeon, DFAs are used as the execution model for lexing logs. + +### States and Transitions + +Each DFA state represents a set of possible NFA states. Transitions are labeled by literal bytes. + + +For any state `S` and input symbol `c`, there is at most one transition: + +```none +S--c-->S' +``` + +This property guarantees deterministic execution. + +### Deterministic Traversal + +Traversal proceeds left-to-right over the input: + +1. Start in the initial DFA state. +2. For each input symbol: + - follow the unique matching transition, + - or fail if no valid transitions exist. +3. If no failure occurs above, end in a single final state. + +No backtracking or ambiguity resolution occurs during traversal. This makes DFA execution: +- Linear runtime in the input size. +- Predictable in performance. +- Suitable for high-throughput parsing and search. + +### Acceptance and Match Semantics +A DFA state is marked as **accepting** if any of its corresponding NFA states are accepting, +meaning the input so far matches an underlying regex pattern. As a DFA state can map to multiple +accepting NFA states, an accepting DFA state can indicate a match for multiple regex patterns. + +A match succeeds if the given input has a traversal that does not fail and ends in an accepting +state. At this level the DFA only indicates a list of matches, ordered by variable priority in the +schema. This type of non-tagged DFA does not produce any semantic information beyond the matching +variable types; capture semantics and match metadata are introduced by the tagged DFA. + +## 4. Tagged DFA + +A tagged DFA (TDFA) extends a standard DFA by attaching **tags** to transitions and states. Tags +allow the automaton to record semantic information during traversal, such as the start and end +positions of capture groups. + +In Log Surgeon, TDFAs are used to extract sub-matches inside variables while preserving the +**deterministic** and **linear-time** properties of standard DFAs. + +### Tags and Registers + +Each capture group in regex is associated with **start** and **end** tags. During TDFA traversal, +each tag is represented by **registers**: + +- **intermediate registers** - records transient information. +- **final register** - records the final value of the tag to be used post-lexing. + +There are three types of register operations: + +- **Set operations** - Assign the current input position to a register. +- **Negate operations** - Mark a register as invalid. +- **Copy operations** - Copy a value from one register to another. + +At various DFA transitions, register values are set, negated, or copied into other registers based +on the register operation. Once TDFA traversal is finished, the value in the tag's +**final register** is used as the tag's value. + +### Tagged Transitions + +TDFA transitions are labeled like a standard DFA, but may also carry **register operations** (set, +negate, copy). Multiple tag operations can occur along a single transition. + +A key implementation detail for efficiency is that register operations are assigned using +a 1-symbol lookahead. That is, instead of executing operations upon entering a state, operations +are applied only on out-going transitions that stay along the desired capture path. For example, +consider the following two regexes: + +- `tagged_user_digit:user_id=(?\d)`, +- `tagged_user_id:user_id=(?\d+)`, + +and the input: + +- `user_id=56`. + +Upon reaching `5`: + +- With a 0-lookahead, registers for both variable captures would be set immediately. +- With a 1-lookahead, we wait until seeing the `6`, then sets a register for the start of the + `tagged_user_id` and negate the registers for `tagged_user_digit`. + +In general, TDFAs can use an **n-lookahead**, but each extra symbol of lookahead increases +cumulative runtime overhead. Using **1-lookahead** is common practice, as it experimentally provides +the best trade-off between minimizing register operations and keeping lookahead overhead low. + +### Match Semantics + +A match succeeds if: + +1. The traversal consumes the input and reaches an accepting state, and +2. Registers have been updated according to the taken transitions. + +The TDFA reports both: + +- **Match success** (which rules matched) +- **Capture variables** (via start/end positions stored in registers) + +This enables subsequent stages, such as compression and search, to operate directly on the semantic +contents of the log message. + +## 5. Compression Example + +The compression problem is to convert raw logs into a compact archives, with each archive containing +metadata in the form of a logtype and variable dictionary. The dictionaries allow search queries to +later operate efficiently, by only decompressing matching archives. To solve this problem, a schema +is used by Log Surgeon to generate a TDFA and interpret the logs. + +```none + ---------- ----------- + | Schema | | Raw Log | + ---------- ----------- + | | + V V + -------------------------------------------- + | Log Surgeon | + | - Generates a TDFA from the schema | + | - Parses the log using schema | + | - Capture groups allow for submatches | + | - Outputs logtype and variables | + -------------------------------------------- + | + V + -------------------------------- + | CLP: compress log | + | - Update logtype dictionary | + | - Update variable dictionary | + | - Add log to archive | + -------------------------------- +``` + +This example shows how a TDFA produced from a schema extracts structured variables from raw text +while preserving the surrounding static-text. + +### A. Schema + +```none +delimiters: \n\r\t + +tagged_user_id:user_id=(?\d+) +tagged_user_name:user=(?\w+) +``` + +### B. TDFA + +```none + [R0=R6, set R1] + [R4=-1,R5=-1] [set R6] [R2=R4, R3=R5] +S0-u->S1-s->S2-e->S3-r->S4-_->S5-i->S6-d->S7-=->S8-[0-9]->S9-[0-9]->S10----[^0-9]->(tagged_user_id) + | | ^ | ^ + | | |_[0-9] | + | | | + | | [set R6] | + | -------------[^0-9]------------- + | + | [R0=R7, R1=R8] + | [R7=-1,R8=-1] [Set R9] [R2=R9, set R3] + --=->S10-[A-Za-z]->S11-[A-Za-z]->S12----[^A-Za-z]->(tagged_user_name) + | ^ | ^ + | |_[A-Za-z] | + | | + | [Set R9] | + --------------[^A-Za-z]-------------- +``` + +**Explanation:** +- The TDFA shows the shared path of `user` from `S0` to `S4`. +- A branch occurs immediately after `user`, depending on whether the next character is `_` + (`tagged_user_id`) or `=` (`tagged_user_name`). +- Upon entering one of the two branches, the TDFA does not immediately commit capture register + updates because it uses a 1-symbol lookahead for register operations: + - For the `tagged_user_id` path, only after reading `i` does it negate the intermediate registers + of `tagged_user_name` (`R4`, `R5`). + - For the `tagged_user_name` path, only after reading an alphabet character does it negate the + intermediate registers of `tagged_user_id` (`R7`, `R8`). +- In the `tagged_user_id` branch, due to 1-lookahead semantics, the intermediate start register `R6` + is not set when the first digit is consumed. Instead, it is set on all out-going transitions from + `S9`. The same behavior occurs with `R9` on the `tagged_user_name` branch. +- Reaching states such as S9, S10, S11, S12, indicates the input is currently in an accepting + configuration, but the capture is not yet finalized. +- The true end of the capture occurs when the input leaves the character class (`[^0-9]` or + `[^A-Za-z]`) as this marks the match no longer continues. +- At this transition: + - The final transition operations are performed (e.g., `S9` performs its delayed `set R6` + operation). + - The accepting operations are performed. For example, in the `tagged_user_id` case: + - `R6` is copied into the final start register `R0`. + - The final end register `R1` is set to the current position. + - The final registers of `tagged_user_name` (`R2`, `R3`) copy the negated values from `R4` and + `R5`. + +### C. Input Log Line + +```none +My user_id=56 line. +``` + +### D. TDFA Execution + +The TDFA processes the line character by character, updates **states** and **registers** for +captured variables. + +```none +States | Input | Action/Tag Updates | Registers +--------------------------------------------------------------------------------------------------- +S0 | 'M' | [fail] | +S0 | 'y' | [fail] | +S0 | ' ' | [fail] | +S0 | 'u' | -> S1 | +S1 | 's' | -> S2 | +S2 | 'e' | -> S3 | +S3 | 'r' | -> S4 | +S4 | '_' | -> S5 | +S5 | 'i' | Negate R4, R5 -> S6 | R4 = -1, R5 = -1 +S6 | 'd' | -> S7 | +S7 | '=' | -> S8 | +S8 | '5' | -> S9 | +S9 [accepting] | '6' | Set R6 -> S10 | R6 = 8 +S9 [accepting] | ' ' | | +Accepting Ops | | R0 = R6, Set R1, R2 = R4, R3 = R5 | R0 = 8, R1 = 10, R2 = -1, R3 = -1 +S0 | ' ' | [fail] | +S0 | 'l' | [fail] | +S0 | 'i' | [fail] | +S0 | 'n' | [fail] | +S0 | 'e' | [fail] | +S0 | '.' | [fail] | +``` + +**Explanation**: +- `My ` is emitted as **static-text** because no valid transitions exist from `S0` for these + characters. +- The space after `My` is a delimiter, so Log Surgeon restarts matching from the start state at the + next position. +- `user_` drives deterministic transitions `S0->S5`, matching the schema's static prefix. +- Reading `i` transitions to `S6`, and triggers the 1-lookahead branching register operations to + negate the **intermediate registers** `R4` and `R5`. +- `d=` drives deterministic transitions `S6->S8`, matching the schema's static prefix. +- Reading `5` transitions to `S9`. At this point, the **intermediate register** `R6` is **not yet + set** because of the 1-lookahead delay. +- Reading `6` transitions to `S10` and triggers a **register operation** for the start tag, storing + the pre-transition position, `8`, in R6. +- Reading ` ` has no valid transition. Since ` ` is a delimiter, and `S10` is accepting: + - The TDFA reports a successful match. + - Cleanup register operations are performed: + - Copies `R6` into `R0`, the **final start register** for `tagged_user_id`. + - Stores the end position, `10` into `R1`, the **final end register** for `tagged_user_id`. + - Copies `R4` into `R2` the **final start register** for `tagged_user_name`. + - Copies `R5` into `R3` the **final end register** for `tagged_user_name`. +- Remaining characters, ` line.`, are emitted as static-text. + +The finalized register values allow Log Surgeon to replace the matched substring with the variable +placeholder while preserving the surrounding static-text. + +### E. Resulting Compressed Log + +After TDFA execution Log Surgeon produces: + +```none +LogType: + 1. My user_id= line. +Variables: + 1. user_id: 56 +``` + +- The **logtype** replaces captured values from the variable with `` and the context in the + variable (e.g. `user_id=`), is considered static-text. +- The **variable dictionary** stores the actual captured values. + +The metadata of structure (logtype dictionary) and data (variable dictionary) is what enables +efficient search without decompressing every archive. + +## 6. Search + +The search problem is to find which logs match a given wildcard query. Logs are already compressed +into an archive with a logtype and variable dictionary, so the search algorithm compares the query +against these dictionaries to minimize archive decompression. + +```none + ------------------ + | Wildcard Query | + ------------------ + | + V + --------------------------------------------------------- + | Log Surgeon | + | - Parses query using schema | + | - Generates interpretations using dynamic programming | + | - Static-text tokens | + | - Variable tokens: (value) (TDFA-based) | + --------------------------------------------------------- + | + V + ------------------------------- + | CLP: generate subqueries | + | - Encoded variables | + | - Dictionary variables | + | - 2^k wildcard combinations | + ------------------------------- + | + V + ------------------------------------------- + | CLP: compare against dictionaries | + | - Logtype dictionary | + | - Variable dictionary | + | - Only decompress archives with matches | + ------------------------------------------- + | + V + --------------------------------- + | CLP: grep decompressed logs | + | - Filtered by original query | + | - Output matched logs | + --------------------------------- +``` + + +### A. Get Query Interpretations from Log Surgeon + +Log Surgeon converts a query into **interpretations**, where each token may be: +- Static-text (literal), +- Variable (`(value)`) as determined by the schema and TDFA intersection. + +It uses a **dynamic programming algorithm** over all the substrings of the query, to minimize the +number of TDFA intersections needed to generate the complete set of interpretations. + +#### Schema Example: + +```none +delimiters: \n\r\t + +int:-?\d+ +float:-?\d+\.\d+ +tagged_user_id:user_id=(?\d+) +tagged_session:session=(?\w+\d+) +``` + +#### Query Example: + +```none +12* user_id=* 34.56 session=AB* +``` + +#### Interpretations Produced by Log Surgeon: + +```none +12* user_id=* (34.56) session=AB* +12* user_id=(*)* (34.56) session=AB* +12* user_id=* (34.56) session=(AB*)* +12* user_id=(*)* (34.56) session=(AB*)* +(12*) user_id=* (34.56) session=AB* +(12*) user_id=(*)* (34.56) session=AB* +(12*) user_id=* (34.56) session=(AB*)* +(12*) user_id=(*)* (34.56) session=(AB*)* +(12*) user_id=* (34.56) session=AB* +(12*) user_id=(*)* (34.56) session=AB* +(12*) user_id=* (34.56) session=(AB*)* +(12*) user_id=(*)* (34.56) session=(AB*)* +``` + +### B. Generate Subqueries with Encoded Variables + +For literal integer/float variables, the interpretation considers the variable to be encoded +directly into the archive if it is encodable (e.g., integer of 16 characters or fewer). However, if +the variables have a wildcard, CLP generates the **2^k combinations** of: +- **Encoded variables** (stored in the archive) +- **Dictionary variables** (stored in the variable dictionary) + +#### Example subqueries for the query above: + +```none +12* user_id=* (34.56) session=AB* +12* user_id=(*)* (34.56) session=AB* +12* user_id=* (34.56) session=(AB*)* +12* user_id=(*)* (34.56) session=(AB*)* +(12*)* user_id=* (34.56) session=AB* +(12*)* user_id=(*)* (34.56) session=AB* +(12*)* user_id=* (34.56) session=(AB*)* +(12*)* user_id=(*)* (34.56) session=(AB*)* +(12*)* user_id=* (34.56) session=AB* +(12*)* user_id=(*)* (34.56) session=AB* +(12*)* user_id=* (34.56) session=(AB*)* +(12*)* user_id=(*)* (34.56) session=(AB*)* +``` + +| Note: denotes a dictionary variable, denotes an encoded variable. + +### C. Validate Against Dictionaries + +#### Logs: + +```none +My log has user_id=55 34.56 session=AB23 and that's it. +123 user_id=22 34.56 session=AC45 +123 user_id=41 34.56 session=AB11 +123456789123456789 user_id=88 34.56 session=AB22 +12a user_id=141 34.56 session=AB11 + +``` + +#### Logtype Dictionary: + +```none +My log has user_id= session= and that's it. + user_id= session= + user_id= session= +12a user_id= session= +``` + +#### Variable Dictionary: + +```none +123456789123456789 +22 +41 +55 +AB11 +AB23 +AC45 +``` + +#### Logtypes matching the subqueries: + +```none + user_id= session= + user_id= session= +12a user_id= session= +``` + +- Each dictionary variable (``) is validated against the variable dictionary. +- An Archive is only decompressed if at least one of the subqueries matches a logtype and all the +subqueries dictionary variables are in variable dictionary. + +#### Decompressed Logs: + +```none +123 user_id=22 34.56 session=AC45 +123 user_id=41 34.56 session=AB11 +123456789123456789 user_id=88 34.56 session=AB22 +12a user_id=141 34.56 session=AB11 +``` + +### D. Grep against the original query: + +```none +123 user_id=41 34.56 session=AB11 +123456789123456789 user_id=88 34.56 session=AB22 +12a user_id=141 34.56 session=AB11 +```