[Fix] Epoch assignment by izzet · Pull Request #38 · llnl/dfanalyzer

izzet · 2025-12-04T10:46:04Z

This pull request introduces several improvements to the dftracer.py analyzer, focusing on more robust epoch assignment, file category labeling, and data sanitization. The changes enhance how epochs are set for events, expand file category handling, and improve the treatment of size and offset values.

Epoch assignment improvements:

Refactored the _set_epochs method to assign epochs based on matching both pid and time intervals, iterating over each epoch boundary for more precise labeling.

File category and data sanitization enhancements:

Added a new _fix_file_posix_category method to append purpose-based (e.g., _reader, _checkpoint) and filesystem-based (e.g., _lustre, _ssd) suffixes to the cat column for files matching certain patterns.
Introduced a _sanitize_size_offset method to replace zero values in the size and offset columns with NaN, improving data quality for downstream analysis.

…etterSet for improved performance and clarity; add comprehensive tests for unique_set functionality

Copilot

Pull request overview

This pull request introduces improvements to epoch assignment, file category labeling, and data sanitization in the dftracer analyzer. However, the implementation contains critical issues with duplicate method definitions that must be addressed before merging.

Key Changes:

Refactored epoch assignment to use pid-based matching with time intervals instead of simple binning
Added file category enrichment with purpose-based and filesystem-based suffixes
Implemented size and offset sanitization to replace zero values with NaN

Comments suppressed due to low confidence (2)

python/dftracer/analyzer/dftracer.py:706

The _fix_file_posix_category method is defined twice in this file (lines 638-661 and 689-706). This creates a bug where only the last definition will be used at runtime. Please remove the duplicate definition and keep only one implementation.

    def _fix_file_posix_category(df: pd.DataFrame):
        base_condition = (df["cat"].str.contains("posix|stdio") & ~df["file_name"].isna())
    
        # Step 1: Map file purpose suffixes first
        purpose_updates = {
            "/data": "_reader",
            "/checkpoint": "_checkpoint"
        }
        
        for path, suffix in purpose_updates.items():
            mask = base_condition & df["file_name"].str.contains(path)
            df.loc[mask, "cat"] = df.loc[mask, "cat"] + suffix
        
        # Step 2: Map filesystem suffixes
        filesystem_updates = {
            "/lustre": "_lustre",
            "/ssd": "_ssd"
        }
        
        for path, suffix in filesystem_updates.items():
            mask = base_condition & df["file_name"].str.contains(path)
            df.loc[mask, "cat"] = df.loc[mask, "cat"] + suffix

        return df

    @staticmethod
    def _sanitize_size_offset(df: pd.DataFrame):
        df["size"] = df["size"].replace(0, np.nan)
        if "offset" in df.columns:
            df["offset"] = df["offset"].replace(0, np.nan)
        return df

    @staticmethod
    def _set_epochs(df: pd.DataFrame, epoch_boundaries: pd.DataFrame):
        df["epoch"] = pd.NA

        # Iterate over each epoch boundary to find matching events
        for _, epoch_boundary in epoch_boundaries.iterrows():
            pid = epoch_boundary["pid"]
            start = epoch_boundary["time_start"]
            end = epoch_boundary["time_end"]

            # Find rows in the partition that match the pid and fall within the time interval
            mask = (df["pid"] == pid) & (df["time_start"] >= start) & (df["time_start"] < end)

            # Assign the epoch number to the matching rows
            df.loc[mask, "epoch"] = epoch_boundary["epoch"]

        return df

    @staticmethod
    def _fix_file_posix_category(df: pd.DataFrame):
        base_condition = df["cat"].str.contains("posix|stdio") & ~df["file_name"].isna()

        # Step 1: Map file purpose suffixes first
        purpose_updates = {"/data": "_reader", "/checkpoint": "_checkpoint"}

        for path, suffix in purpose_updates.items():
            mask = base_condition & df["file_name"].str.contains(path)
            df.loc[mask, "cat"] = df.loc[mask, "cat"] + suffix

        # Step 2: Map filesystem suffixes
        filesystem_updates = {"/lustre": "_lustre", "/ssd": "_ssd"}

        for path, suffix in filesystem_updates.items():
            mask = base_condition & df["file_name"].str.contains(path)
            df.loc[mask, "cat"] = df.loc[mask, "cat"] + suffix

        return df

python/dftracer/analyzer/dftracer.py:713

The _sanitize_size_offset method is defined twice in this file (lines 664-668 and 709-713). Additionally, these two implementations are inconsistent: the first uses np.nan (line 665, 667) while the second uses pd.NA (lines 710, 712). This creates a bug where only the last definition will be used. Please remove the duplicate definition and decide on a consistent approach (either np.nan or pd.NA).

    def _sanitize_size_offset(df: pd.DataFrame):
        df["size"] = df["size"].replace(0, np.nan)
        if "offset" in df.columns:
            df["offset"] = df["offset"].replace(0, np.nan)
        return df

    @staticmethod
    def _set_epochs(df: pd.DataFrame, epoch_boundaries: pd.DataFrame):
        df["epoch"] = pd.NA

        # Iterate over each epoch boundary to find matching events
        for _, epoch_boundary in epoch_boundaries.iterrows():
            pid = epoch_boundary["pid"]
            start = epoch_boundary["time_start"]
            end = epoch_boundary["time_end"]

            # Find rows in the partition that match the pid and fall within the time interval
            mask = (df["pid"] == pid) & (df["time_start"] >= start) & (df["time_start"] < end)

            # Assign the epoch number to the matching rows
            df.loc[mask, "epoch"] = epoch_boundary["epoch"]

        return df

    @staticmethod
    def _fix_file_posix_category(df: pd.DataFrame):
        base_condition = df["cat"].str.contains("posix|stdio") & ~df["file_name"].isna()

        # Step 1: Map file purpose suffixes first
        purpose_updates = {"/data": "_reader", "/checkpoint": "_checkpoint"}

        for path, suffix in purpose_updates.items():
            mask = base_condition & df["file_name"].str.contains(path)
            df.loc[mask, "cat"] = df.loc[mask, "cat"] + suffix

        # Step 2: Map filesystem suffixes
        filesystem_updates = {"/lustre": "_lustre", "/ssd": "_ssd"}

        for path, suffix in filesystem_updates.items():
            mask = base_condition & df["file_name"].str.contains(path)
            df.loc[mask, "cat"] = df.loc[mask, "cat"] + suffix

        return df

    @staticmethod
    def _sanitize_size_offset(df: pd.DataFrame):
        df["size"] = df["size"].replace(0, pd.NA)
        if "offset" in df.columns:
            df["offset"] = df["offset"].replace(0, pd.NA)
        return df

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

python/dftracer/analyzer/dftracer.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

izzet and others added 14 commits August 30, 2025 17:34

Add epoch handling to DFTracerAnalyzer and update related configurations

6726281

Add epoch assignment handling for DFTracerAnalyzer in test setup

c39081b

Merge branch 'main' into feat/assign-epochs

e5cd7fb

Remove unused percentile parameter from test_e2e function

3f60f26

Fix epoch layer definition reference in DFTracerAnalyzer

233c4b6

refactor: update unique_set and unique_set_flatten functions to use B…

729366b

…etterSet for improved performance and clarity; add comprehensive tests for unique_set functionality

refactor: add pytestmark for smoke and full CI modes

9a33f69

Merge branch 'fix/dask_agg_unique_set'

d9466db

Merge branch 'main' of https://github.com/izzet/dfanalyzer into main

01ebe67

Update import path for dask aggregation utilities in test file

b906d16

Add assign_epochs option to DFTracerAnalyzerConfig

cc7c715

Merge branch 'LLNL:develop' into main

7f4d65d

Merge branch 'LLNL:develop' into main

24b7b6b

Merge branch 'LLNL:develop' into main

3dffefc

izzet requested a review from Copilot December 4, 2025 10:46

izzet self-assigned this Dec 4, 2025

izzet added the enhancement New feature or request label Dec 4, 2025

Copilot started reviewing on behalf of izzet December 4, 2025 10:46 View session

Copilot finished reviewing on behalf of izzet December 4, 2025 10:48

Copilot AI reviewed Dec 4, 2025

View reviewed changes

Fix pandas condition

5dd46e4

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

izzet merged commit 6d90ae0 into llnl:develop Dec 4, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Epoch assignment#38

[Fix] Epoch assignment#38
izzet merged 15 commits intollnl:developfrom
izzet:main

izzet commented Dec 4, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

izzet commented Dec 4, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants