Skip to content

maxspl/json2splunk-rs

Repository files navigation

json2splunk-rs

json2splunk-rs is a rework of json2splunk :

This tool allows to ingest jsonl and csv into Splunk using HEC.

Features

  • CSV files: Supports also csv files.
  • Multiprocessing Support: Utilizes multiple CPUs to process events concurrently.
  • Flexible File Matching: Configurable file matching rules based on file name/path patterns and path suffixes, allowing selective processing of files.
  • Splunk Integration: Automates the creation of Splunk indices and HEC tokens, ensuring that data is ingested smoothly and efficiently into Splunk.
  • Test Mode: Allows running the script in a test configuration where no data is actually sent to Splunk, useful for debugging and validation.
  • Vector Remap Language (VRL): In-memory fields transformation (to Elastic Common Schema for example) before sending events to Splunk

Setup

  1. Get latest released binary:

  2. Configure Splunk Settings: Update splunk_configuration.yml with your Splunk instance details:

    splunk:
      host: {splunk_FQDN_or_IP}
      user: {splunk_user}
      password: {splunk_password}
      port: {splunk_port} # Default is 8000
      mport: {splunk_mport} # Default is 8089
      ssl: {splunk_enable_ssl} # Default is False
  3. Set File Matching Rules: Edit indexer_patterns.yml to define the patterns for the files you want to ingest:

    <source_name>:
      name_rex:         # regex matching the file name (optional if path_suffix or path_rex is set). Regex applied on FILE PATH (including filename)
      path_suffix:      # suffix path to files to index (optional if name_rex or path_rex is set). Matches ending path.
                        # Example: "path_suffix: evtx" will match files under .../evtx/ (respecting ext filter if used)
      path_rex:         # regex matching the file parent directory (optional if name_rex or path_suffix is set).
                        # Regex applied on FILE DIRECTORY (without filename)
      sourcetype:       # Splunk sourcetype (optional). If not specified, defaults to <source_name>
      normalize:        # list of VRL scripts to apply for normalization (optional).
                        # Each entry is a file name or path to a .vrl script, resolved relative to --vrl_dir (or as absolute paths).
                        # Example:
                        #   normalize:
                        #     - "evtx_common.vrl"
                        #     - "evtx_4688.vrl"
      timestamp_path:   # list of JSON key paths (first existing key in the event is used) containing the event timestamp.
                        # Populates Splunk _time field.
                        # Applied AFTER VRL normalization.
                        # Example:
                        #   timestamp_path:
                        #     - "Event.System.TimeCreated.#attributes.SystemTime"
                        #     - "@timestamp"
      timestamp_format: # format of the timestamp extracted. Example: "%Y-%m-%dT%H:%M:%S.%fZ" (optional)
                        # Applied AFTER VRL normalization.
      host_path:        # path to the JSON key containing the event host. Populates Splunk host field.
                        # Applied AFTER VRL normalization.
                        # Example: "Event.System.Computer" (optional)
      host_rex:         # regex to extract the hostname from the filename or the file path. Populates Splunk host field. (optional)
      artifact:         # source_name alternative (optional) – can be useful to define a global name like "EVTX" where
                        # source_name is very specific like "windows:evtx:powershell". If not specified, defaults to <source_name>.
      encoding:         # encoding of the input file (optional). Currently "utf8" is recognized for fast path;
                        # other values fall back to a lossy UTF-8 reader.
                        # Example: "utf-8"

Usage

Run the script with the required parameters. Example usage:

json2splunk-rs --input /path/to/logs --index my_index
json2splunk-rs --input /path/to/logs --index my_index --config_spl /opt/json2splunk/splunk_configuration.yml --indexer_patterns /opt/json2splunk/indexer_patterns.yml
json2splunk-rs --input /path/to/logs --index my_index --nb_cpu 4
json2splunk-rs --input /path/to/logs --index my_index --ext ".csv,.jsonl"
json2splunk-rs --input /path/to/logs --index my_index --vrl_dir /opt/json2splunk/vrl
json2splunk-rs --input /path/to/logs --normalize-test-dir ./normalized_output

Parameters

  • --input: Mandatory. Directory containing the log files to process.
  • --index: Mandatory unless --normalize-test-dir is used. The name of the Splunk index to use.
  • --nb_cpu: Optional. Specifies the number of CPUs to use for processing. Defaults to the number of available CPUs.
  • --test: Optional. Enables test mode where no data is sent to Splunk. Useful for debugging.
  • --config_spl: Optional. Specifies the path to the Splunk configuration file. Defaults to splunk_configuration.yml.
  • --indexer_patterns: Optional. Specifies the path to the file patterns configuration. Defaults to indexer_patterns.yml.
  • --ext: Optional. Specifies a list of extensions to prefilter the input directory. Defaults is None.
  • --vrl_dir: Optional. Directory where VRL scripts referenced in indexer_patterns.yml are located. Defaults to the current directory.
  • --normalize-test-dir: Optional. Writes normalized (post-VRL) JSONL files to a directory instead of sending them to Splunk. Useful for testing transformations.
  • --verbosity: Optional. Controls log verbosity (DEBUG, INFO, WARNING, ERROR). Defaults to INFO.

VRL Support

You can dynamically transform vents using VRL files. VRL scripts are referenced in indexer_patterns.yml under the normalize section and loaded from the directory specified by --vrl_dir.

The order of processing is:

  1. Raw event ingestion
    The file is read, parsed (JSON or CSV), and converted into a structured event.

  2. VRL normalization (optional)
    All VRL scripts listed in the matching rule (normalize:) are applied in order.
    These scripts can add, remove, rename, or enrich fields.

Example of transform.vrl:

.tenant = "acme"
if exists(.timestamp) {
  ._time = .timestamp
}
  1. Post-normalization metadata extraction
    The following rule-based extractions occur after VRL has finished:
  • timestamp_path (first matching key is used)
  • timestamp_format o- host_path
  1. Source / sourcetype / artifact assignment
    Values from the matching rule are applied to prepare metadata for Splunk ingestion.

  2. Output stage

  • If --normalize-test-dir option is provided:
    The normalized and enriched output is written as jsonl files (no ingestion occurs).
  • Otherwise:
    Events are batched and sent to Splunk via HEC.

Test Mode

Test mode is designed to validate the setup without pushing data to Splunk. It simulates the entire process, from file scanning to data preparation, without making any actual data transmissions to Splunk.

This mode also generates a dataframe (named test_files_to_index.json) containing matched files and patterns, which can be reviewed to ensure correct file handling before live deployment.

For example, the dataframe can be used to review the patterns matched by each file:

[
  {
    "file_path": "input_sample/prefetch/SRV-DA09DKL--prefetch-AA4646DB4646A841_2000000016FC0_D000000018CE8_4_TABBY.EXE-D326E1BD.pf_{00000000-0000-0000-0000-000000000000}.data.jsonl",
    "file_name": "SRV-DA09DKL--prefetch-AA4646DB4646A841_2000000016FC0_D000000018CE8_4_TABBY.EXE-D326E1BD.pf_{00000000-0000-0000-0000-000000000000}.data.jsonl",
    "source": [
      "prefetch",
      "all"
    ],
    "sourcetype": "_json",
    "timestamp_path": "",
    "timestamp_format": "",
    "host": "SRV-DA09DKL",
    "host_path": null
  },
  {
    "file_path": "input_sample/evtx/SRV-DA09DKL--evtx-AA4646DB4646A841_10000000014B3_E0000000249F8_4_Microsoft-Windows-StorageSettings%4Diagnostic.evtx_{00000000-0000-0000-0000-000000000000}.data.jsonl",
    "file_name": "SRV-DA09DKL--evtx-AA4646DB4646A841_10000000014B3_E0000000249F8_4_Microsoft-Windows-StorageSettings%4Diagnostic.evtx_{00000000-0000-0000-0000-000000000000}.data.jsonl",
    "source": [
      "evtx",
      "all"
    ],
    "sourcetype": "_json",
    "timestamp_path": [
      "Event.System.TimeCreated.#attributes.SystemTime"
    ],
    "timestamp_format": "%Y-%m-%dT%H:%M:%S.%fZ",
    "host": "Unknown", // Normal as host_path is extracted after the dataframe creation
    "host_path": "Event.System.Computer"
  }
]

Example

Directory Structure Example

Let's ingest these files:

/input_sample
├── output
│   ├── app
│   │   ├── error
│   │   │   └── app_error.jsonl
│   │   ├── info
│   │   │   └── app_info.jsonl
│   │   └── debug
│   │       └── app_debug.jsonl
├── prefech
│   ├── HOST-A--prefetch1.jsonl
│   ├── HOST-A--prefetch2.jsonl
│   └── HOST-A--prefetch3.jsonl
└── evtx
    ├── event1.jsonl
    ├── event2.jsonl
    └── event3.jsonl

Patterns Configuration (indexer_patterns.yml)

This YAML file is crucial for specifying which files json2splunk-rs should process. You can define multiple criteria based on file name (or file path) regex patterns and path suffixes: Each entry specifies a unique pattern to match certain files with specific processing rules for Splunk ingestion.

Warning: Fields required: sourcetype, one of: name_rex, path_suffix Warning: If a file matches several artifacts, the first one is selected.

windows:evtx:powershell:
    name_rex: Windows_PowerShell.*\.jsonl$
    path_suffix: evtx
    host_path: "Event.System.Computer" # Extract the host from the event
    timestamp_path:  # Extract the timestamp from the event
      - "Event.System.TimeCreated.#attributes.SystemTime"
      - "Event.Timestamp"
    timestamp_format: "%Y-%m-%dT%H:%M:%S.%fZ" # Specify the timestamp format
    artifact: EVTX
evtx:
    name_rex: \.jsonl$
    path_suffix: evtx
    sourcetype: _json
    normalize:
      - normalize/windows/evtx.vrl
    host_path: ".host.name" # Extract the host AFTER VRL normalization from the event
    timestamp_path: # Extract the timestamp from the event AFTER VRL normalization
        - "timestamp"
    timestamp_format: "%Y-%m-%dT%H:%M:%SZ" # Timestamp after VRL normalization
prefetch:
    name_rex: \.jsonl$
    path_rex: ".*prefetch"
    sourcetype: _json
    host_rex: (^[\w-]+)-- # Extract host from file path
    normalize:
      - normalize/windows/prefetch.vrl
    timestamp_path: # Extract the host AFTER VRL normalization from the event
      - "timestamp" # Extract the timestamp from the event AFTER VRL normalization
    timestamp_format: "%Y-%m-%dT%H:%M:%SZ" # Timestamp after VRL normalization
reg:
    name_rex: --hives_hk
    host_rex: ([\w\.-]+)--
    sourcetype: _json
    normalize:
      - normalize/windows/hives.vrl                   
    timestamp_path:
      - "timestamp" 
    timestamp_format: "%Y-%m-%dT%H:%M:%SZ"
application:
    path_suffix: output/app
    sourcetype: _json
    host_rex: (^[\w-]+)--

Dry run before Splunk ingestion

json2splunk-rs --indexer_patterns patterns.yml --input input_sample/ --normalize-test-dir ./normalized_output

=> Inspect output files in ./normalized_output directory.

Index files in Splunk

json2splunk-rs --indexer_patterns patterns.yml --config_spl splunk_configuration.yml --input input_sample/ --index my_index

About

JSONL/CSV/RAW to Splunk ingestor

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages