Skip to content

[Fix] Epoch assignment#38

Merged
izzet merged 15 commits intollnl:developfrom
izzet:main
Dec 4, 2025
Merged

[Fix] Epoch assignment#38
izzet merged 15 commits intollnl:developfrom
izzet:main

Conversation

@izzet
Copy link
Collaborator

@izzet izzet commented Dec 4, 2025

This pull request introduces several improvements to the dftracer.py analyzer, focusing on more robust epoch assignment, file category labeling, and data sanitization. The changes enhance how epochs are set for events, expand file category handling, and improve the treatment of size and offset values.

Epoch assignment improvements:

  • Refactored the _set_epochs method to assign epochs based on matching both pid and time intervals, iterating over each epoch boundary for more precise labeling.

File category and data sanitization enhancements:

  • Added a new _fix_file_posix_category method to append purpose-based (e.g., _reader, _checkpoint) and filesystem-based (e.g., _lustre, _ssd) suffixes to the cat column for files matching certain patterns.
  • Introduced a _sanitize_size_offset method to replace zero values in the size and offset columns with NaN, improving data quality for downstream analysis.

@izzet izzet requested a review from Copilot December 4, 2025 10:46
@izzet izzet self-assigned this Dec 4, 2025
@izzet izzet added the enhancement New feature or request label Dec 4, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces improvements to epoch assignment, file category labeling, and data sanitization in the dftracer analyzer. However, the implementation contains critical issues with duplicate method definitions that must be addressed before merging.

Key Changes:

  • Refactored epoch assignment to use pid-based matching with time intervals instead of simple binning
  • Added file category enrichment with purpose-based and filesystem-based suffixes
  • Implemented size and offset sanitization to replace zero values with NaN
Comments suppressed due to low confidence (2)

python/dftracer/analyzer/dftracer.py:706

  • The _fix_file_posix_category method is defined twice in this file (lines 638-661 and 689-706). This creates a bug where only the last definition will be used at runtime. Please remove the duplicate definition and keep only one implementation.
    def _fix_file_posix_category(df: pd.DataFrame):
        base_condition = (df["cat"].str.contains("posix|stdio") & ~df["file_name"].isna())
    
        # Step 1: Map file purpose suffixes first
        purpose_updates = {
            "/data": "_reader",
            "/checkpoint": "_checkpoint"
        }
        
        for path, suffix in purpose_updates.items():
            mask = base_condition & df["file_name"].str.contains(path)
            df.loc[mask, "cat"] = df.loc[mask, "cat"] + suffix
        
        # Step 2: Map filesystem suffixes
        filesystem_updates = {
            "/lustre": "_lustre",
            "/ssd": "_ssd"
        }
        
        for path, suffix in filesystem_updates.items():
            mask = base_condition & df["file_name"].str.contains(path)
            df.loc[mask, "cat"] = df.loc[mask, "cat"] + suffix

        return df

    @staticmethod
    def _sanitize_size_offset(df: pd.DataFrame):
        df["size"] = df["size"].replace(0, np.nan)
        if "offset" in df.columns:
            df["offset"] = df["offset"].replace(0, np.nan)
        return df

    @staticmethod
    def _set_epochs(df: pd.DataFrame, epoch_boundaries: pd.DataFrame):
        df["epoch"] = pd.NA

        # Iterate over each epoch boundary to find matching events
        for _, epoch_boundary in epoch_boundaries.iterrows():
            pid = epoch_boundary["pid"]
            start = epoch_boundary["time_start"]
            end = epoch_boundary["time_end"]

            # Find rows in the partition that match the pid and fall within the time interval
            mask = (df["pid"] == pid) & (df["time_start"] >= start) & (df["time_start"] < end)

            # Assign the epoch number to the matching rows
            df.loc[mask, "epoch"] = epoch_boundary["epoch"]

        return df

    @staticmethod
    def _fix_file_posix_category(df: pd.DataFrame):
        base_condition = df["cat"].str.contains("posix|stdio") & ~df["file_name"].isna()

        # Step 1: Map file purpose suffixes first
        purpose_updates = {"/data": "_reader", "/checkpoint": "_checkpoint"}

        for path, suffix in purpose_updates.items():
            mask = base_condition & df["file_name"].str.contains(path)
            df.loc[mask, "cat"] = df.loc[mask, "cat"] + suffix

        # Step 2: Map filesystem suffixes
        filesystem_updates = {"/lustre": "_lustre", "/ssd": "_ssd"}

        for path, suffix in filesystem_updates.items():
            mask = base_condition & df["file_name"].str.contains(path)
            df.loc[mask, "cat"] = df.loc[mask, "cat"] + suffix

        return df

python/dftracer/analyzer/dftracer.py:713

  • The _sanitize_size_offset method is defined twice in this file (lines 664-668 and 709-713). Additionally, these two implementations are inconsistent: the first uses np.nan (line 665, 667) while the second uses pd.NA (lines 710, 712). This creates a bug where only the last definition will be used. Please remove the duplicate definition and decide on a consistent approach (either np.nan or pd.NA).
    def _sanitize_size_offset(df: pd.DataFrame):
        df["size"] = df["size"].replace(0, np.nan)
        if "offset" in df.columns:
            df["offset"] = df["offset"].replace(0, np.nan)
        return df

    @staticmethod
    def _set_epochs(df: pd.DataFrame, epoch_boundaries: pd.DataFrame):
        df["epoch"] = pd.NA

        # Iterate over each epoch boundary to find matching events
        for _, epoch_boundary in epoch_boundaries.iterrows():
            pid = epoch_boundary["pid"]
            start = epoch_boundary["time_start"]
            end = epoch_boundary["time_end"]

            # Find rows in the partition that match the pid and fall within the time interval
            mask = (df["pid"] == pid) & (df["time_start"] >= start) & (df["time_start"] < end)

            # Assign the epoch number to the matching rows
            df.loc[mask, "epoch"] = epoch_boundary["epoch"]

        return df

    @staticmethod
    def _fix_file_posix_category(df: pd.DataFrame):
        base_condition = df["cat"].str.contains("posix|stdio") & ~df["file_name"].isna()

        # Step 1: Map file purpose suffixes first
        purpose_updates = {"/data": "_reader", "/checkpoint": "_checkpoint"}

        for path, suffix in purpose_updates.items():
            mask = base_condition & df["file_name"].str.contains(path)
            df.loc[mask, "cat"] = df.loc[mask, "cat"] + suffix

        # Step 2: Map filesystem suffixes
        filesystem_updates = {"/lustre": "_lustre", "/ssd": "_ssd"}

        for path, suffix in filesystem_updates.items():
            mask = base_condition & df["file_name"].str.contains(path)
            df.loc[mask, "cat"] = df.loc[mask, "cat"] + suffix

        return df

    @staticmethod
    def _sanitize_size_offset(df: pd.DataFrame):
        df["size"] = df["size"].replace(0, pd.NA)
        if "offset" in df.columns:
            df["offset"] = df["offset"].replace(0, pd.NA)
        return df

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@izzet izzet merged commit 6d90ae0 into llnl:develop Dec 4, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants