Skip to content

Darwin PMDA Phase 2: Apple Silicon metrics expansion#2466

Open
tallpsmith wants to merge 57 commits intoperformancecopilot:mainfrom
tallpsmith:darwin-pmda-phase2-apple-silicon
Open

Darwin PMDA Phase 2: Apple Silicon metrics expansion#2466
tallpsmith wants to merge 57 commits intoperformancecopilot:mainfrom
tallpsmith:darwin-pmda-phase2-apple-silicon

Conversation

@tallpsmith
Copy link
Contributor

@tallpsmith tallpsmith commented Jan 23, 2026

Summary

Phased expansion of Darwin PMDA with ~100 additional metrics for Apple Silicon Macs, focused on system observability, thermal monitoring, and storage analytics.

Current Progress: 92/100 metrics implemented (~92% complete) 🎯

✅ Wave 1: Quick Wins (COMPLETE)

15 metrics - Low complexity, high value additions

  • System Resource Limits (5 metrics): kernel.limits.* for maxproc, maxfiles, vnodes
  • Memory Compression Deep Dive (6 metrics): Timing buckets, thrashing detection, LZ4 stats
  • Process I/O & Memory (2 metrics): proc.io.logical_writes, proc.memory.footprint
  • IPC & Socket Pools (4 metrics): mbuf clusters, socket limits, defunct sockets

Commits: 42f2708, fde9fed, 91c1cb3, abfab40, 20c9d64

✅ Wave 2: Medium Effort (COMPLETE)

32 metrics - Medium complexity, production-grade monitoring

  • GPU Monitoring (4 metrics): Device count, utilization %, VRAM usage/free
  • Battery & Power (13 metrics): Charge state, health, cycle count, temperature, voltage, amperage, capacity tracking, AC status
  • Enhanced IPv6 (6 metrics): Packet counts, discards, fragments, reassembly
  • Process QoS CPU Time (7 metrics): Per-QoS-class CPU usage (default, maintenance, background, utility, legacy, user-initiated, user-interactive)
  • Process File Descriptors (1 metric): Open FD count per process
  • Process Network (1 metric): TCP connection count per process

Commits: 0283412, 11d49b8, f0ce125, 68d095a

✅ Wave 3: Higher Effort (COMPLETE - 45/~45 metrics)

Wave 3a: Disk & APFS Statistics (30 metrics)

Extended Disk I/O Metrics (16 metrics)

Per-device and aggregate metrics from IOBlockStorageDriver:

  • Error tracking: disk.{dev,all}.{read,write}_errors
  • Retry counts: disk.{dev,all}.{read,write}_retries
  • Timing data: disk.{dev,all}.total_{read,write}_time (nanoseconds)
  • Derived metrics: disk.{dev,all}.avgrq_sz (avg request size), disk.{dev,all}.await (avg wait time)

APFS Statistics (14 metrics)

Container and volume metrics via IOKit:

  • Inventory: disk.apfs.{ncontainer,nvolume}
  • Per-container (11 metrics): block size, bytes read/written, I/O request counts, transactions flushed, cache hits/evictions, metadata errors
  • Per-volume (2 metrics): encryption status, locked status

Implementation Notes:

  • queue_depth, inflight, util NOT implemented (IOKit doesn't expose these)
  • ❌ APFS snapshot metrics NOT implemented (per-volume complexity)
  • ❌ Container size/free NOT implemented (not in IORegistry)

Commits: afdf044, 8f2c3a4

Wave 3b: Thermal Monitoring (13 metrics)

SMC-based thermal and fan monitoring with graceful degradation:

  • Temperature sensors (5 metrics): thermal.cpu.die, thermal.cpu.proximity, thermal.gpu.die, thermal.package, thermal.ambient
  • Fan monitoring (6 metrics): hinv.nfan, thermal.fan.{speed,target,mode,min,max} (per-fan instance domain)
  • Thermal pressure (2 metrics): thermal.pressure.level, thermal.pressure.state

Platform behavior:

  • Apple Silicon: Full support with Tp*/Tg* SMC keys
  • Intel Macs: Metrics registered but return no values (graceful degradation)
  • Fanless Macs: Report hinv.nfan=0 (MacBook Air M1/M2, Mac mini M1/M2, Mac Studio base)

Commits: a89dace, b15b4a3

Wave 3c: Process Network Connections (2 metrics)

Per-process TCP/UDP socket counts:

  • proc.net.tcp_count, proc.net.udp_count via PROC_PIDFDSOCKETINFO enumeration

Commit: 55c1740

🔲 Wave 4: Optional/Specialized (DEFERRED)

Wave 4 has been deferred to Issue #2484 for future consideration.

Wave 4 scope includes ~22 metrics across Device Enumeration, Power Consumption (requires root), Scheduler Counters, and Advanced Network statistics. These were deprioritized as they represent specialized/low-value use cases or are blocked by entitlement requirements.

See: Issue #2484 - Darwin PMDA Phase 2 Wave 4


Technical Highlights

New Clusters Added

  • CLUSTER_GPU (19): GPU device statistics
  • CLUSTER_IPC (20): IPC resource limits
  • CLUSTER_POWER (21): Battery & power management
  • CLUSTER_APFS (23): APFS filesystem statistics
  • CLUSTER_THERMAL (24): SMC thermal & fan monitoring

New Instance Domains

  • GPU_INDOM: Per-GPU device metrics
  • APFS_CONTAINER_INDOM: Per-APFS-container metrics
  • APFS_VOLUME_INDOM: Per-APFS-volume metrics
  • FAN_INDOM: Per-fan thermal metrics

Integration Test Coverage

  • GPU metrics validation (utilization, VRAM)
  • Power metrics validation (battery health, charging)
  • IPv6 protocol statistics
  • Process QoS CPU time buckets
  • Extended disk I/O metrics (errors, retries, timing)
  • APFS container/volume statistics
  • Thermal sensors and fan monitoring

Architectural Patterns

  • IOKit integration for hardware metrics (GPU, APFS, thermal)
  • SMC interface with graceful degradation for thermal data
  • IOPMCopyBatteryInfo for power management
  • sysctl for IPC/IPv6 statistics
  • rusage/proc_pid_rusage for process QoS time

Current Metrics Totals

  • Darwin PMDA: ~278 metrics across 17 clusters (was ~186 in Phase 1)
  • Darwin_proc PMDA: ~37 per-process metrics (was ~35 in Phase 1)

Documentation

See darwin-pmda-phase2-research.md for:

  • Detailed metric specifications
  • API research & implementation notes
  • Platform compatibility matrix
  • Wave-by-wave progress tracking

Tracking: Issue #2465

Focus on M-series Macs with graceful degradation for Intel. Catalogs ~100 additional metrics across thermal monitoring, GPU utilization, battery/power, enhanced process I/O, IPv6, disk queues, and system limits.
@tallpsmith tallpsmith self-assigned this Jan 23, 2026
@tallpsmith tallpsmith added the macOS For issues specific or related to macOS label Jan 23, 2026
@tallpsmith tallpsmith reopened this Jan 23, 2026
Expose system resource limits via sysctl-based metrics providing
visibility into kernel-enforced process and file descriptor limits.

New metrics:
- kernel.limits.maxproc (kern.maxproc)
- kernel.limits.maxprocperuid (kern.maxprocperuid)
- kernel.limits.maxfiles (kern.maxfiles)
- kernel.limits.maxfilesperproc (kern.maxfilesperproc)
- vfs.vnodes.recycled (kern.num_recycledvnodes)
Adds visibility into macOS memory compression performance and health
for diagnosing memory pressure. Implements Phase 2 step 2 with six new
sysctl-based metrics: timing buckets (30s/60s/300s), thrashing detection,
major compactions, and LZ4 compression counts.
Add proc.io.logical_writes and proc.memory.footprint to surface data
already fetched via proc_pid_rusage(). Uses rusage_info_v4 with v3
fallback for older macOS versions.
Documents architecture for IOKit-based GPU monitoring on macOS.
Covers utilization and memory metrics with TDD approach.
Enable visibility into GPU workloads on macOS via IOKit IOAccelerator.
Exposes utilization and memory usage for both Apple Silicon and Intel GPUs.
Completes GPU monitoring by integrating metrics into build system, instance domains, fetch callbacks, and test suite.
Adds completion tracking showing 19/100 metrics (19%) implemented across
Wave 1 and Wave 2. Documents completed work: GPU monitoring, memory
compression deep dive, system limits, and process I/O metrics.
Document planned macstat views for GPU, power/battery, and thermal
monitoring. Includes ready-to-implement macstat-gpu view plus updates
to macstat-x (GPU util) and macstat-mem (compression timing).
Test was using incorrect darwin.* prefixes; actual PMNS defines metrics without the prefix
Route GPU count through CLUSTER_GPU to reach fetch handler
Renames README.md files to CLAUDE.md to align with hierarchical
project documentation system. Fixes path typos and adds cross-references.
Three critical constraints now prominently documented:
1. PCP is NOT installed locally - read pmns file instead of running pminfo
2. Git commit required before VM tests - VM clones repo, can't see uncommitted changes
3. Unit tests local, integration tests VM-only

Updated files:
- src/pmdas/darwin/Claude.md: Add constraint warning box at top
- .claude/skills/macos-qa-test/SKILL.md: Fix "When to Use", require commit first
- .claude/agents/macos-darwin-pmda-qa.md: Add git status check for uncommitted changes
- build/mac/CLAUDE.md: Add constraint box, clarify VM-only integration tests
- CLAUDE.md: Add Available Agents section, macOS constraints

Agent now refuses to run if uncommitted changes detected in darwin or build/mac dirs.
Root cause: hinv.ngpu registered at wrong cluster (4 vs 19) causing "Unknown metric" errors.

Changes:
- Fix PMNS cluster for hinv.ngpu (DARWIN:4:99 -> DARWIN:19:99)
- Add debug logging to gpu_iokit.c for IOKit enumeration failures
- Add debug logging to gpu.c for initialization/refresh tracking
- Update integration test to accept 0 GPUs as valid (VM environment)
- Improve value extraction regex in test for reliability

The VM environment (Tart/GitHub Actions) has no GPU hardware, so 0 GPUs is expected.
Debug logs now surface in pmcd.log/darwin.log automatically on test failures.
VM environment has virtual GPU driver (IOAccelerator service) but no actual
performance statistics. Test now validates that metrics exist and have correct
structure, but accepts missing values as valid in VM context.

Fixes bash arithmetic error when util_value is empty.
…purposes).

Under a VM like these there's no GPU, so it'll be 0, but a valid value is still good.
Implements the final Wave 1 metrics via sysctl reads: mbuf clusters, max socket buffer size, socket listen backlog, and defunct socket calls. Follows established VFS pattern with dedicated ipc.c/h module wired into pmda.c refresh/fetch cycles.
The ipc metrics were defined but not linked to the root namespace, causing PMNS parsing failures during build.
Mark Category 7.2 complete. Wave 1 now fully implemented: 21 metrics across 14 clusters (202 total Darwin PMDA metrics).
Renamed Claude.md to CLAUDE.md for consistent capitalization. Added critical documentation on PMNS root namespace requirement and VFS-pattern template for adding new metric clusters to prevent "Disconnected subtree" build errors.
Expose 21 new Darwin PMDA metrics through operator-friendly pmrep views.
Created macstat-gpu for GPU monitoring, added compression timing to
macstat-mem, and quick GPU utilization to macstat-x.

Why: Wave 1 metrics exist but lack discoverability - operators need
views to use them effectively for performance troubleshooting.
The .txt files were intended for dbpmda but provide no value over existing
bash integration tests (test-gpu-metrics.sh, etc.) which validate actual
metric values and handle VM/hardware variations better.

Why: Misunderstood test architecture. dbpmda tests would be lower-level
smoke tests, but bash scripts already provide superior coverage through
pminfo/pmcd. No point maintaining two test suites for same metrics.
Creates bash integration tests following the GPU metrics pattern. Tests
validate metric existence and value ranges, with VM-aware handling for
power metrics (no battery in VMs).

Power tests (13 metrics):
- Battery presence, charging state, charge level (0-100%)
- Time remaining, health, cycle count
- Temperature (°C×100), voltage (mV), amperage (mA)
- Design/max capacity (mAh)
- AC connection status, power source string
- Handles VM case: battery_present=0, ac_connected=1

IPC tests (4 metrics):
- mbuf clusters, maxsockbuf, somaxconn (all > 0)
- defunct socket count (>= 0)

Why: TDD penance - these tests should have been written BEFORE implementing
the features. Better late than never. Tests run in VM integration phase.
Reflects recent implementation progress:
- Wave 1 complete: 17/25 metrics (System Limits 5, Memory Compression 6, Process I/O 1, Process Memory 1, IPC 4)
- Wave 2 partial: 17/30 metrics (GPU 4, Battery/Power 13)
- Phase 2 total: 34/100 metrics (34% complete)

Category updates:
- Cat 2 GPU: Complete (4/4)
- Cat 3 Power: Partial (13/15 - missing 3.3 power consumption which needs root)
- Cat 7 System Limits & IPC: Complete (9/9)
- Cat 10 Memory Compression: Complete (6/6)

Why: Keep research doc accurate for tracking progress and identifying remaining work.
Completes research doc Wave 2 requirements by adding 13 new metrics:
- 6 IPv6 protocol statistics via net.inet6.ip6.stats sysctl
- 7 per-process QoS-tier CPU time via RUSAGE_INFO_V4
All Wave 2 items now complete (47/99 metrics, 47% done):
- IPv6 network statistics (6 metrics)
- Process QoS CPU time (7 metrics)
- Process file descriptor count (verified)

Added critical maintenance requirement section.
Extends IOBlockStorageDriver stats extraction with error counts,
retry counts, timing data, and derived metrics (avgrq_sz, await).
Adds new APFS module exposing container and volume metrics via IOKit.

Per-device disk metrics (71-78):
- read_errors, write_errors: Error counts from IOBlockStorageDriver
- read_retries, write_retries: Retry counts
- total_read_time, total_write_time: Cumulative I/O time (nanoseconds)
- avgrq_sz: Average request size (derived)
- await: Average wait time (derived)

Aggregate disk.all metrics (79-86): Same metrics summed across devices

APFS container metrics (87-98):
- ncontainer: Container count
- per-container: block_size, bytes_read/written, read/write_requests,
  transactions, cache_hits/evictions, read/write_errors

APFS volume metrics (99-100):
- nvolume: Volume count
- per-volume: encrypted, locked status

Implementation:
- Extended disk.c with new IOKit key extractions
- Created apfs.c/apfs.h following gpu module pattern
- Added CLUSTER_APFS, APFS_CONTAINER_INDOM, APFS_VOLUME_INDOM
- Wired into pmda.c refresh/fetch callbacks
- Added integration tests for both disk and APFS metrics

Following TDD: Tests written first, then implementation.
Corrects copy-paste error where disk.apfs.container.bytes_written
used PMID 93 instead of 91, causing metric descriptor mismatch.

Also updates research doc with Wave 3a completion status:
- 77 total Phase 2 metrics now complete (77% of target)
- Documents what could not be implemented (queue_depth, inflight, etc.)
- Notes IOKit API limitations for certain disk metrics
Documents need for automated validation to prevent PMID mismatches
like the Wave 3a bug (bytes_written using PMID 93 instead of 91).

Task includes:
- Full specification of what to validate
- Implementation approach
- CI integration points
- Success criteria

Priority: HIGH - should be done before continuing Wave 3b/4.
Prevent PMID mismatch bugs by validating pmns ↔ metrics.c consistency.

Validates:
- Every PMID in pmns exists in metrics.c
- No duplicate PMIDs within clusters
- Cluster enum definitions resolve correctly

Catches bugs like Wave 3a disk.apfs.container.bytes_written mismatch
(pmns:23:91 vs metrics.c:23:93) at build time instead of runtime.
Run validator after build but before unit tests to catch PMID
mismatches early in the build process.
Track per-process network connection counts by inspecting file
descriptors and socket info via libproc.

Extends FD enumeration to identify IPv4/IPv6 TCP and UDP sockets,
enabling network activity monitoring at process granularity.
Mark PMID consistency validator as complete (commits a183326, 9aeab03).
Mark Wave 3c (Process Network Connections) as complete (commit 55c1740).
Update metrics count: 79 total (+2 proc.net.tcp_count, udp_count).
Category 4 now 11/11 complete, overall progress 49/99 (49%).
…vior

Desktop/laptop users need thermal visibility for performance diagnosis, especially
on Apple Silicon where thermal management is opaque. Implements 13 metrics via SMC
and thermal pressure API:
- Temperature sensors (CPU/GPU die, package, ambient)
- Fan metrics (RPM, target, mode, min/max per-fan)
- Thermal pressure level/state (always available, no SMC required)

SMC access is community reverse-engineered (not Apple-supported). Code degrades
gracefully when unavailable. Completes Phase 2 high-value metrics to 79% → ~85%.
The Darwin PMDA was failing to load with 'Undefined instance domain serial (9)'
because FAN_INDOM was declared in the enum but not added to the indomtab array.
This caused pmdaInit() to reject the entire PMDA, breaking all Darwin metrics.

Add FAN_INDOM entry to indomtab with initial NULL instances (populated dynamically
by thermal subsystem).
Critical learnings from thermal implementation:
- Instance domains must be added to both darwin.h enum AND pmda.c indomtab array
- Missing indomtab entry causes 'Undefined instance domain serial' error
- This is the standard PMDA pattern (not Darwin-specific)
- Document direct manipulation of it_set/it_numinst (no helper function exists)
Wave 3b thermal monitoring was implemented in commits a89dace and
b15b4a3 but the research document was never updated, creating
confusion about project status.

Updates:
- Mark Wave 3b complete with 13 thermal metrics
- Update total metrics from 79 to 92
- Update Wave 3 total from 32 to 45 metrics
- Update Category 1 (Thermal) to Complete status (13/15)
- Update Category 6 (Disk) to Complete status (30/30)
- Update overall completion to ~85% (92/99 metrics)
- Mark all pmrep views as Ready (thermal, power, gpu all unblocked)
Completes Category 11 (pmrep views) for Darwin PMDA Phase 2.
New views expose the power/battery and thermal metrics added
in earlier waves to end users via simple pmrep commands.
Merge conflict left CLUSTER_LIMITS with duplicate comment "18",
causing all subsequent cluster comments to be off by one.
Actual enum values were correct (C auto-increments), but
documentation was misleading for anyone reading the code.
Merge conflict fallout caused systematic cluster numbering errors where
kernel.limits metrics occupied cluster 18 (should be 19), pushing all
subsequent clusters off by one. Runtime metric fetches failed with
"Requested metric not defined" errors.

Additionally, the PMID consistency validator had a critical bug where
error counting happened in a subshell (pipe to while loop), causing it
to always report PASSED even when detecting errors.

Fixes:
- pmns: Update cluster numbers 18-24 to 19-25 for limits/gpu/ipc/power/ipv6/apfs/thermal
- pmns: Keep LOGIN metrics (nusers/nroots/nsessions) at cluster 18
- test-pmid-consistency.sh: Replace pipe-to-while with heredoc to fix error counting

Validator now correctly fails when PMIDs mismatch, catching these issues early.
PCP doesn't recognize mAh as a unit, causing PM_ERR_CONV errors.
Metrics display raw values in milliamp-hours.
Prevent PM_ERR_CONV errors by documenting which units PCP recognizes.
@tallpsmith tallpsmith marked this pull request as ready for review February 7, 2026 02:26
@tallpsmith tallpsmith requested a review from natoscott February 7, 2026 02:26
@tallpsmith
Copy link
Contributor Author

Wave 4 Deferred

Wave 4 optional metrics have been split out for future consideration in Issue #2484.

This PR delivers Waves 1-3 (92 metrics) - the core Phase 2 value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

macOS For issues specific or related to macOS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant