Skip to content

Conversation

@bryancall
Copy link
Contributor

@bryancall bryancall commented Jan 29, 2026

Summary

Adds two new metrics to track cache lock contention:

  1. proxy.process.cache.stripe.lock_contention - counts stripe mutex contention
  2. proxy.process.cache.writer.lock_contention - counts writer VC mutex contention during read aggregation

Also available per-volume as proxy.process.cache.volume_N.stripe.lock_contention.

Background

When ATS is configured with more threads than cache volumes, threads contend heavily for the stripe mutex, causing throughput degradation. These metrics make contention visible so operators can tune their configuration.

Benchmark Results

Testing on a 16-core system with 100 cached URLs:

Threads Volumes Throughput Contentions/s
16 1 476k req/s 12,095k
16 16 1,160k req/s 177k
24 32 1,260k req/s 161k

With only 1 volume, 16 threads is slower than 4 threads due to contention. Adding volumes eliminates the bottleneck.

Usage

# Stripe lock contention (global)
traffic_ctl metric get proxy.process.cache.stripe.lock_contention

# Stripe lock contention (per-volume)
traffic_ctl metric match volume.*stripe.lock_contention

# Writer lock contention
traffic_ctl metric get proxy.process.cache.writer.lock_contention

Implementation

Stripe Lock Contention Call Sites (VC_SCHED_LOCK_RETRY / VC_LOCK_RETRY_EVENT)

All for stripe->mutex:

CacheRead.cc:

  • L210 - openReadClose
  • L428 - openReadReadDone
  • L456 - openReadReadDone
  • L653 - openReadMain
  • L705 - openReadMain
  • L766 - openReadStartEarliest
  • L932 - openReadVecWrite
  • L988 - openReadStartHead
  • L1210 - openReadDirDelete

CacheVC.cc:

CacheWrite.cc:

  • L78 - handleWriteLock
  • L84 - handleWriteLock
  • L278 - openWriteCloseDir
  • L331 - openWriteCloseHeadDone
  • L410 - openWriteCloseDataDone
  • L504 - openWriteWriteDone
  • L648 - openWriteOverwrite
  • L681 - openWriteOverwrite
  • L794 - openWriteMain

Writer Lock Contention Call Site (VC_SCHED_WRITER_LOCK_RETRY)

For write_vc->mutex (not stripe):

CacheRead.cc:

  • L278 - openReadFromWriter (read aggregation)

Files Changed

  • P_CacheStats.h: Add stripe_lock_contention and writer_lock_contention counters
  • CacheProcessor.cc: Register both metrics
  • P_CacheInternal.h: Add metric increments to retry macros, add VC_SCHED_WRITER_LOCK_RETRY()
  • CacheRead.cc: Use VC_SCHED_WRITER_LOCK_RETRY() for writer mutex case

Adds proxy.process.cache.stripe.lock_contention counter that increments
each time a thread fails to acquire the stripe mutex. This helps identify
cache lock contention issues when tuning thread counts vs volume counts.

Also available per-volume as proxy.process.cache.volume_N.stripe.lock_contention
Add VC_SCHED_LOCK_RETRY_NO_METRIC() macro for lock retries that are not
for stripe->mutex (e.g., write_vc->mutex in read aggregation). This ensures
the stripe_lock_contention metric only counts actual stripe mutex contention.
Add proxy.process.cache.writer.lock_contention to track contention on
the writer VC mutex during read aggregation (separate from stripe mutex).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant