-
Notifications
You must be signed in to change notification settings - Fork 55
Add log manager to tracker experiement logs #507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @pan-x-c, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new trinity log command to monitor experiment logs in real-time. The core logic is encapsulated in a new LogManager. The changes also include a small refactoring in the Config class to centralize the construction of the checkpoint job directory path.
My review focuses on improving robustness and performance. I've identified a potential high-memory usage issue in the log searching feature, a case of silent exception handling that could hide bugs, and an opportunity to simplify some redundant code in the new CLI command. Overall, this is a great addition for improving observability during training.
| f"{Colors.CYAN}[INFO] Searching for pattern '{self.search_pattern}' in {self.filepath}{Colors.RESET}" | ||
| ) | ||
| self.file.seek(0) | ||
| lines = self.file.readlines() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the entire file into memory with self.file.readlines() can lead to high memory consumption and potentially a MemoryError if the log file is very large.
To make this more memory-efficient, consider processing the file line-by-line. You could use a collections.deque with a fixed size to maintain a sliding window of lines, which would allow you to print context around a match without loading the entire file.
| if not log_dir: | ||
| cfg = load_config(config) | ||
| checkpoint_job_dir = cfg.get_checkpoint_job_dir() | ||
| # we do not use check_and_update here because user may use this command | ||
| # in another environment | ||
| log_dir = os.path.join(checkpoint_job_dir, "log") | ||
| if not os.path.exists(log_dir): | ||
| raise FileNotFoundError(f"Log directory not found: {log_dir}") | ||
| if not os.path.exists(log_dir): | ||
| raise FileNotFoundError(f"Log directory not found: {log_dir}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for determining and checking log_dir can be simplified. The os.path.exists(log_dir) check is performed twice in the case where log_dir is derived from the config. You can consolidate this into a single check after log_dir has been determined.
| if not log_dir: | |
| cfg = load_config(config) | |
| checkpoint_job_dir = cfg.get_checkpoint_job_dir() | |
| # we do not use check_and_update here because user may use this command | |
| # in another environment | |
| log_dir = os.path.join(checkpoint_job_dir, "log") | |
| if not os.path.exists(log_dir): | |
| raise FileNotFoundError(f"Log directory not found: {log_dir}") | |
| if not os.path.exists(log_dir): | |
| raise FileNotFoundError(f"Log directory not found: {log_dir}") | |
| if not log_dir: | |
| cfg = load_config(config) | |
| checkpoint_job_dir = cfg.get_checkpoint_job_dir() | |
| # we do not use check_and_update here because user may use this command | |
| # in another environment | |
| log_dir = os.path.join(checkpoint_job_dir, "log") | |
| if not os.path.exists(log_dir): | |
| raise FileNotFoundError(f"Log directory not found: {log_dir}") |
trinity/manager/log_manager.py
Outdated
| except Exception: | ||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This broad except Exception: block silently catches all exceptions and returns False. This can hide potential bugs or unexpected issues during file stat checks (e.g., permission errors). It's better to log the exception to aid in debugging, similar to how other exception blocks in this class are handled.
| except Exception: | |
| return False | |
| except Exception as e: | |
| print(f"{Colors.RED}[ERROR] Error checking file rotation for {self.filepath}: {e}{Colors.RESET}") | |
| return False |
Description
Using
trinity logto track the logs during training.Checklist
Please check the following items before code is ready to be reviewed.