Darwin PMDA Phase 2: Apple Silicon metrics expansion#2466
Open
tallpsmith wants to merge 57 commits intoperformancecopilot:mainfrom
Open
Darwin PMDA Phase 2: Apple Silicon metrics expansion#2466tallpsmith wants to merge 57 commits intoperformancecopilot:mainfrom
tallpsmith wants to merge 57 commits intoperformancecopilot:mainfrom
Conversation
Focus on M-series Macs with graceful degradation for Intel. Catalogs ~100 additional metrics across thermal monitoring, GPU utilization, battery/power, enhanced process I/O, IPv6, disk queues, and system limits.
Expose system resource limits via sysctl-based metrics providing visibility into kernel-enforced process and file descriptor limits. New metrics: - kernel.limits.maxproc (kern.maxproc) - kernel.limits.maxprocperuid (kern.maxprocperuid) - kernel.limits.maxfiles (kern.maxfiles) - kernel.limits.maxfilesperproc (kern.maxfilesperproc) - vfs.vnodes.recycled (kern.num_recycledvnodes)
Adds visibility into macOS memory compression performance and health for diagnosing memory pressure. Implements Phase 2 step 2 with six new sysctl-based metrics: timing buckets (30s/60s/300s), thrashing detection, major compactions, and LZ4 compression counts.
Add proc.io.logical_writes and proc.memory.footprint to surface data already fetched via proc_pid_rusage(). Uses rusage_info_v4 with v3 fallback for older macOS versions.
Documents architecture for IOKit-based GPU monitoring on macOS. Covers utilization and memory metrics with TDD approach.
Enable visibility into GPU workloads on macOS via IOKit IOAccelerator. Exposes utilization and memory usage for both Apple Silicon and Intel GPUs.
Completes GPU monitoring by integrating metrics into build system, instance domains, fetch callbacks, and test suite.
Adds completion tracking showing 19/100 metrics (19%) implemented across Wave 1 and Wave 2. Documents completed work: GPU monitoring, memory compression deep dive, system limits, and process I/O metrics.
Document planned macstat views for GPU, power/battery, and thermal monitoring. Includes ready-to-implement macstat-gpu view plus updates to macstat-x (GPU util) and macstat-mem (compression timing).
Test was using incorrect darwin.* prefixes; actual PMNS defines metrics without the prefix
Route GPU count through CLUSTER_GPU to reach fetch handler
Renames README.md files to CLAUDE.md to align with hierarchical project documentation system. Fixes path typos and adds cross-references.
Three critical constraints now prominently documented: 1. PCP is NOT installed locally - read pmns file instead of running pminfo 2. Git commit required before VM tests - VM clones repo, can't see uncommitted changes 3. Unit tests local, integration tests VM-only Updated files: - src/pmdas/darwin/Claude.md: Add constraint warning box at top - .claude/skills/macos-qa-test/SKILL.md: Fix "When to Use", require commit first - .claude/agents/macos-darwin-pmda-qa.md: Add git status check for uncommitted changes - build/mac/CLAUDE.md: Add constraint box, clarify VM-only integration tests - CLAUDE.md: Add Available Agents section, macOS constraints Agent now refuses to run if uncommitted changes detected in darwin or build/mac dirs.
Root cause: hinv.ngpu registered at wrong cluster (4 vs 19) causing "Unknown metric" errors. Changes: - Fix PMNS cluster for hinv.ngpu (DARWIN:4:99 -> DARWIN:19:99) - Add debug logging to gpu_iokit.c for IOKit enumeration failures - Add debug logging to gpu.c for initialization/refresh tracking - Update integration test to accept 0 GPUs as valid (VM environment) - Improve value extraction regex in test for reliability The VM environment (Tart/GitHub Actions) has no GPU hardware, so 0 GPUs is expected. Debug logs now surface in pmcd.log/darwin.log automatically on test failures.
VM environment has virtual GPU driver (IOAccelerator service) but no actual performance statistics. Test now validates that metrics exist and have correct structure, but accepts missing values as valid in VM context. Fixes bash arithmetic error when util_value is empty.
…purposes). Under a VM like these there's no GPU, so it'll be 0, but a valid value is still good.
Implements the final Wave 1 metrics via sysctl reads: mbuf clusters, max socket buffer size, socket listen backlog, and defunct socket calls. Follows established VFS pattern with dedicated ipc.c/h module wired into pmda.c refresh/fetch cycles.
The ipc metrics were defined but not linked to the root namespace, causing PMNS parsing failures during build.
Mark Category 7.2 complete. Wave 1 now fully implemented: 21 metrics across 14 clusters (202 total Darwin PMDA metrics).
Renamed Claude.md to CLAUDE.md for consistent capitalization. Added critical documentation on PMNS root namespace requirement and VFS-pattern template for adding new metric clusters to prevent "Disconnected subtree" build errors.
Expose 21 new Darwin PMDA metrics through operator-friendly pmrep views. Created macstat-gpu for GPU monitoring, added compression timing to macstat-mem, and quick GPU utilization to macstat-x. Why: Wave 1 metrics exist but lack discoverability - operators need views to use them effectively for performance troubleshooting.
The .txt files were intended for dbpmda but provide no value over existing bash integration tests (test-gpu-metrics.sh, etc.) which validate actual metric values and handle VM/hardware variations better. Why: Misunderstood test architecture. dbpmda tests would be lower-level smoke tests, but bash scripts already provide superior coverage through pminfo/pmcd. No point maintaining two test suites for same metrics.
Creates bash integration tests following the GPU metrics pattern. Tests validate metric existence and value ranges, with VM-aware handling for power metrics (no battery in VMs). Power tests (13 metrics): - Battery presence, charging state, charge level (0-100%) - Time remaining, health, cycle count - Temperature (°C×100), voltage (mV), amperage (mA) - Design/max capacity (mAh) - AC connection status, power source string - Handles VM case: battery_present=0, ac_connected=1 IPC tests (4 metrics): - mbuf clusters, maxsockbuf, somaxconn (all > 0) - defunct socket count (>= 0) Why: TDD penance - these tests should have been written BEFORE implementing the features. Better late than never. Tests run in VM integration phase.
Reflects recent implementation progress: - Wave 1 complete: 17/25 metrics (System Limits 5, Memory Compression 6, Process I/O 1, Process Memory 1, IPC 4) - Wave 2 partial: 17/30 metrics (GPU 4, Battery/Power 13) - Phase 2 total: 34/100 metrics (34% complete) Category updates: - Cat 2 GPU: Complete (4/4) - Cat 3 Power: Partial (13/15 - missing 3.3 power consumption which needs root) - Cat 7 System Limits & IPC: Complete (9/9) - Cat 10 Memory Compression: Complete (6/6) Why: Keep research doc accurate for tracking progress and identifying remaining work.
…than a mere suggestion.
Completes research doc Wave 2 requirements by adding 13 new metrics: - 6 IPv6 protocol statistics via net.inet6.ip6.stats sysctl - 7 per-process QoS-tier CPU time via RUSAGE_INFO_V4
All Wave 2 items now complete (47/99 metrics, 47% done): - IPv6 network statistics (6 metrics) - Process QoS CPU time (7 metrics) - Process file descriptor count (verified) Added critical maintenance requirement section.
Extends IOBlockStorageDriver stats extraction with error counts, retry counts, timing data, and derived metrics (avgrq_sz, await). Adds new APFS module exposing container and volume metrics via IOKit. Per-device disk metrics (71-78): - read_errors, write_errors: Error counts from IOBlockStorageDriver - read_retries, write_retries: Retry counts - total_read_time, total_write_time: Cumulative I/O time (nanoseconds) - avgrq_sz: Average request size (derived) - await: Average wait time (derived) Aggregate disk.all metrics (79-86): Same metrics summed across devices APFS container metrics (87-98): - ncontainer: Container count - per-container: block_size, bytes_read/written, read/write_requests, transactions, cache_hits/evictions, read/write_errors APFS volume metrics (99-100): - nvolume: Volume count - per-volume: encrypted, locked status Implementation: - Extended disk.c with new IOKit key extractions - Created apfs.c/apfs.h following gpu module pattern - Added CLUSTER_APFS, APFS_CONTAINER_INDOM, APFS_VOLUME_INDOM - Wired into pmda.c refresh/fetch callbacks - Added integration tests for both disk and APFS metrics Following TDD: Tests written first, then implementation.
Corrects copy-paste error where disk.apfs.container.bytes_written used PMID 93 instead of 91, causing metric descriptor mismatch. Also updates research doc with Wave 3a completion status: - 77 total Phase 2 metrics now complete (77% of target) - Documents what could not be implemented (queue_depth, inflight, etc.) - Notes IOKit API limitations for certain disk metrics
Documents need for automated validation to prevent PMID mismatches like the Wave 3a bug (bytes_written using PMID 93 instead of 91). Task includes: - Full specification of what to validate - Implementation approach - CI integration points - Success criteria Priority: HIGH - should be done before continuing Wave 3b/4.
Prevent PMID mismatch bugs by validating pmns ↔ metrics.c consistency. Validates: - Every PMID in pmns exists in metrics.c - No duplicate PMIDs within clusters - Cluster enum definitions resolve correctly Catches bugs like Wave 3a disk.apfs.container.bytes_written mismatch (pmns:23:91 vs metrics.c:23:93) at build time instead of runtime.
Run validator after build but before unit tests to catch PMID mismatches early in the build process.
Track per-process network connection counts by inspecting file descriptors and socket info via libproc. Extends FD enumeration to identify IPv4/IPv6 TCP and UDP sockets, enabling network activity monitoring at process granularity.
…vior Desktop/laptop users need thermal visibility for performance diagnosis, especially on Apple Silicon where thermal management is opaque. Implements 13 metrics via SMC and thermal pressure API: - Temperature sensors (CPU/GPU die, package, ambient) - Fan metrics (RPM, target, mode, min/max per-fan) - Thermal pressure level/state (always available, no SMC required) SMC access is community reverse-engineered (not Apple-supported). Code degrades gracefully when unavailable. Completes Phase 2 high-value metrics to 79% → ~85%.
The Darwin PMDA was failing to load with 'Undefined instance domain serial (9)' because FAN_INDOM was declared in the enum but not added to the indomtab array. This caused pmdaInit() to reject the entire PMDA, breaking all Darwin metrics. Add FAN_INDOM entry to indomtab with initial NULL instances (populated dynamically by thermal subsystem).
Critical learnings from thermal implementation: - Instance domains must be added to both darwin.h enum AND pmda.c indomtab array - Missing indomtab entry causes 'Undefined instance domain serial' error - This is the standard PMDA pattern (not Darwin-specific) - Document direct manipulation of it_set/it_numinst (no helper function exists)
Wave 3b thermal monitoring was implemented in commits a89dace and b15b4a3 but the research document was never updated, creating confusion about project status. Updates: - Mark Wave 3b complete with 13 thermal metrics - Update total metrics from 79 to 92 - Update Wave 3 total from 32 to 45 metrics - Update Category 1 (Thermal) to Complete status (13/15) - Update Category 6 (Disk) to Complete status (30/30) - Update overall completion to ~85% (92/99 metrics) - Mark all pmrep views as Ready (thermal, power, gpu all unblocked)
Completes Category 11 (pmrep views) for Darwin PMDA Phase 2. New views expose the power/battery and thermal metrics added in earlier waves to end users via simple pmrep commands.
Merge conflict left CLUSTER_LIMITS with duplicate comment "18", causing all subsequent cluster comments to be off by one. Actual enum values were correct (C auto-increments), but documentation was misleading for anyone reading the code.
Merge conflict fallout caused systematic cluster numbering errors where kernel.limits metrics occupied cluster 18 (should be 19), pushing all subsequent clusters off by one. Runtime metric fetches failed with "Requested metric not defined" errors. Additionally, the PMID consistency validator had a critical bug where error counting happened in a subshell (pipe to while loop), causing it to always report PASSED even when detecting errors. Fixes: - pmns: Update cluster numbers 18-24 to 19-25 for limits/gpu/ipc/power/ipv6/apfs/thermal - pmns: Keep LOGIN metrics (nusers/nroots/nsessions) at cluster 18 - test-pmid-consistency.sh: Replace pipe-to-while with heredoc to fix error counting Validator now correctly fails when PMIDs mismatch, catching these issues early.
PCP doesn't recognize mAh as a unit, causing PM_ERR_CONV errors. Metrics display raw values in milliamp-hours.
Prevent PM_ERR_CONV errors by documenting which units PCP recognizes.
Contributor
Author
Wave 4 DeferredWave 4 optional metrics have been split out for future consideration in Issue #2484. This PR delivers Waves 1-3 (92 metrics) - the core Phase 2 value. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phased expansion of Darwin PMDA with ~100 additional metrics for Apple Silicon Macs, focused on system observability, thermal monitoring, and storage analytics.
Current Progress: 92/100 metrics implemented (~92% complete) 🎯
✅ Wave 1: Quick Wins (COMPLETE)
15 metrics - Low complexity, high value additions
kernel.limits.*for maxproc, maxfiles, vnodesproc.io.logical_writes,proc.memory.footprintCommits: 42f2708, fde9fed, 91c1cb3, abfab40, 20c9d64
✅ Wave 2: Medium Effort (COMPLETE)
32 metrics - Medium complexity, production-grade monitoring
Commits: 0283412, 11d49b8, f0ce125, 68d095a
✅ Wave 3: Higher Effort (COMPLETE - 45/~45 metrics)
Wave 3a: Disk & APFS Statistics (30 metrics)
Extended Disk I/O Metrics (16 metrics)
Per-device and aggregate metrics from
IOBlockStorageDriver:disk.{dev,all}.{read,write}_errorsdisk.{dev,all}.{read,write}_retriesdisk.{dev,all}.total_{read,write}_time(nanoseconds)disk.{dev,all}.avgrq_sz(avg request size),disk.{dev,all}.await(avg wait time)APFS Statistics (14 metrics)
Container and volume metrics via
IOKit:disk.apfs.{ncontainer,nvolume}Implementation Notes:
queue_depth,inflight,utilNOT implemented (IOKit doesn't expose these)Commits: afdf044, 8f2c3a4
Wave 3b: Thermal Monitoring (13 metrics)
SMC-based thermal and fan monitoring with graceful degradation:
thermal.cpu.die,thermal.cpu.proximity,thermal.gpu.die,thermal.package,thermal.ambienthinv.nfan,thermal.fan.{speed,target,mode,min,max}(per-fan instance domain)thermal.pressure.level,thermal.pressure.statePlatform behavior:
hinv.nfan=0(MacBook Air M1/M2, Mac mini M1/M2, Mac Studio base)Commits: a89dace, b15b4a3
Wave 3c: Process Network Connections (2 metrics)
Per-process TCP/UDP socket counts:
proc.net.tcp_count,proc.net.udp_countviaPROC_PIDFDSOCKETINFOenumerationCommit: 55c1740
🔲 Wave 4: Optional/Specialized (DEFERRED)
Wave 4 has been deferred to Issue #2484 for future consideration.
Wave 4 scope includes ~22 metrics across Device Enumeration, Power Consumption (requires root), Scheduler Counters, and Advanced Network statistics. These were deprioritized as they represent specialized/low-value use cases or are blocked by entitlement requirements.
See: Issue #2484 - Darwin PMDA Phase 2 Wave 4
Technical Highlights
New Clusters Added
CLUSTER_GPU(19): GPU device statisticsCLUSTER_IPC(20): IPC resource limitsCLUSTER_POWER(21): Battery & power managementCLUSTER_APFS(23): APFS filesystem statisticsCLUSTER_THERMAL(24): SMC thermal & fan monitoringNew Instance Domains
GPU_INDOM: Per-GPU device metricsAPFS_CONTAINER_INDOM: Per-APFS-container metricsAPFS_VOLUME_INDOM: Per-APFS-volume metricsFAN_INDOM: Per-fan thermal metricsIntegration Test Coverage
Architectural Patterns
Current Metrics Totals
Documentation
See darwin-pmda-phase2-research.md for:
Tracking: Issue #2465