Skip to content

Conversation

@himaschal
Copy link
Collaborator

@himaschal himaschal commented Dec 8, 2025

ℹ️ Release Coordination (Downstream of ctutils)

This feature depends on digicert/ctutils (RA-8279).
Status: Ready for review (ensure ctutils v1.0.0 tag is available).
Plan:

  1. Wait for digicert/ctutils PR1 merge & v1.0.0 tag.
  2. Update go.mod in this PR to use digicert/ctutils v1.0.0.
  3. Merge this PR.

Summary

Integrates the digicert/ctutils shared logging library to enable OpenTelemetry distributed tracing across Trillian Log Server and Log Signer. This allows full request tracing from the frontend (CTFE) through to the storage layer.

Features

  • Distributed Tracing: End-to-end trace propagation (gRPC interceptors).
  • Unified Logging: Replaces ad-hoc logging with ctutils adapters.
  • Database Tracing: (If applicable, check code for SQL wrappers).
  • Configuration: Standard OTEL_* environment variables.

Configuration

See the README Observability Section for full details.

Variable Description
OTEL_ENABLED Master switch for tracing
OTEL_EXPORTER otlp (collector) or stdout (debug)
LOG_LEVEL Minimum log severity (Parsed by app, correctly implemented)

Testing

Integration verified in typical deployment scenarios:

  • Local K8s: Verified traces appear in Jaeger when triggered from CTFE.

See full e2e testing here

Refs: RA-8279

… integration

This change integrates the digicert/ctutils shared logging library to enable
OpenTelemetry-compliant distributed tracing across the Trillian log server and
signer components.

Key changes:
- Add config/config.go with InitLogging() for centralized OTEL configuration
- Update log_server and log_signer main.go to call config.InitLogging()
- Add chained gRPC interceptors for trace context propagation
- Add Dockerfile.unified with SSH access for private ctutils dependency
- Update go.mod/go.sum for ctutils v0.1.6 and OTEL dependencies

The logging configuration is driven by environment variables:
- OTEL_ENABLED: Enable/disable OpenTelemetry (default: false)
- OTEL_EXPORTER: Exporter type ('otlp' or 'stdout')
- OTEL_COLLECTOR_ENDPOINT: OTLP collector URL
- OTEL_SERVICE_NAME: Service name for traces
- OTEL_SAMPLE_RATIO: Sampling ratio (0.0-1.0)

This enables end-to-end request tracing from CTFE through Trillian backends,
allowing operators to correlate logs and traces across the CT infrastructure.

Refs: RA-8279
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR integrates OpenTelemetry distributed tracing into Trillian log server and signer components through the digicert/ctutils shared logging library, enabling end-to-end request tracing and structured logging.

Changes:

  • Added OTEL-compliant distributed tracing via ctutils dependency
  • Configured gRPC and HTTP middleware for trace propagation and logging
  • Added environment-variable based OTEL configuration

Reviewed changes

Copilot reviewed 16 out of 18 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
config/config.go New centralized logging initialization with OTEL setup
cmd/trillian_log_server/main.go Initialize logging on server startup
cmd/trillian_log_signer/main.go Initialize logging on signer startup
cmd/internal/serverutil/main.go Add gRPC/HTTP interceptors for trace propagation
go.mod/go.sum Add ctutils v0.1.13-test and updated OTEL dependencies
examples/deployment/docker//Dockerfile Docker build configuration for private ctutils dependency
.github/workflows/*.yaml CI authentication for private ctutils repository
README.md Documentation for OTEL configuration
experimental/batchmap/batchmap.shims.go Auto-generated code formatting updates

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Removed accidental duplicate comments introduced during previous edits to ensure clean and readable Dockerfiles.
Removed a commented-out require statement for ctutils that was causing confusion, as the actual requirement is correctly defined later in the file.
replaced os.Setenv with t.Setenv to fix errcheck lint errors and improve test cleanup.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 18 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@himaschal himaschal changed the title feat(otel): Add OpenTelemetry distributed tracing and ctutils logging integration RA-8279: feat(otel): Add OpenTelemetry distributed tracing and ctutils logging integration Jan 23, 2026
himaschal and others added 5 commits January 29, 2026 14:25
Co-authored-by: himaschal <himaschal@users.noreply.github.com>

Test version for golem rollout
…cessary and not recommended according to best practice. This will also keep the logserver / logsigner logs clean from unnecessary logs when scheduled jobs run like health checks or metrics scraping
@sadhana-angara
Copy link

QA Evidence for CT Log :

Test Case 1 : Deploy Jaeger and Configure OTLP Export :
Status : PASS
Evidence :

Jaeger pod deployed successfully -
Screenshot 2026-02-05 at 10 49 43 AM

Services configured with OTLP export -
Screenshot 2026-02-05 at 10 49 43 AM (1)

Port forwarding established -
Screenshot 2026-02-09 at 12 21 40 PM

Jaeger UI accessible at http://localhost:16686
Screenshot 2026-02-09 at 12 23 16 PM

CTFE API accessible at http://localhost:6962
Screenshot 2026-02-09 at 11 50 50 AM

========================================================================================

Test 2: Verify Distributed Tracing in Jaeger UI
Status : PASS
Evidence :

  • Multiple traces visible in search results (6+ traces) : Yes
  • Traces show both ctfe and trillian-logserver services : Yes
  • Span hierarchy shows parent-child relationships : Yes
  • Timing data present for each span (typically 3-25ms) : Yes
  • Operation names visible (e.g., ctfe_internal_sth_force) : Yes
Screenshot 2026-02-09 at 12 32 48 PM Screenshot 2026-02-09 at 12 31 19 PM

========================================================================================

Test 3: W3C Trace Context Propagation
Status : PASS
Evidence :

  • Request succeeds (200 OK) : Yes
  • Custom trace_id appears in Jaeger UI : Yes
  • Trace shows in search results within 10 seconds : Yes
Screenshot 2026-02-05 at 3 03 32 PM Screenshot 2026-02-05 at 3 13 29 PM Screenshot 2026-02-10 at 10 08 23 AM

========================================================================================

Test 4: Cross-Service Trace Propagation
Status : PASS
Evidence :

  • Same trace_id found in CTFE logs : Yes
  • Same trace_id found in Trillian logserver logs : Yes
  • Logs show parent-child relationship via span_id : Yes
Screenshot 2026-02-05 at 3 53 46 PM Screenshot 2026-02-05 at 4 36 30 PM

========================================================================================

Test 5: Structured Logging with Trace Context
Status : PASS
Evidence :

  • Logs in JSON format : Yes
  • trace_id field present (32-char hex) : Yes
  • span_id field present (16-char hex) : Yes
  • parent_source field present (values: client_header, grpc_metadata, or system_generated) : Yes
  • elapsed_ms field present (timing data) : Yes
Screenshot 2026-02-09 at 12 47 57 PM

========================================================================================

Test 6: Multiple Requests with Shared Trace
Status : PASS
Evidence :

**- All 3 requests succeed : **

Screenshot 2026-02-06 at 11 24 20 AM

- Logs show 3+ entries with same trace_id :

Screenshot 2026-02-06 at 11 27 56 AM

- Each entry has different span_id :

Screenshot 2026-02-06 at 11 28 53 AM Screenshot 2026-02-06 at 11 29 10 AM

========================================================================================

Test 7: Performance and Timing Analysis
Status : PASS
Evidence : In Jaeger UI, examine 5-10 different traces

1. Trace ID : 4dd4c18687c1d0f1b54bba5f7223efd7 :
Typical get-sth operation:  5.47ms
gRPC call to Trillian: 3.39ms on server side
CTFE overhead : 5.1 - 3.39 = 1.71ms
No traces showing errors or timeouts.

Screenshot 2026-02-09 at 10 54 12 AM

2.Trace ID : 089ff8e936e3df81afbfb9fe416f3194 :
Typical get-sth operation:  3.26ms
gRPC call to Trillian: 2.2ms on server side
CTFE overhead : 3.04 - 2.2 = 0.84ms
No traces showing errors or timeouts.

Screenshot 2026-02-09 at 10 58 09 AM

3.Trace ID : b3aec802d23c219702cc36f7d7a3b1e6 :
Typical get-sth operation:  5.1ms
gRPC call to Trillian: 3.37ms on server side
CTFE overhead : 4.75 - 3.37 = 1.38ms
No traces showing errors or timeouts.

Screenshot 2026-02-09 at 11 01 43 AM

4.Trace ID : 9a4723af4335407b39318948fb92d9b6 :
Typical get-sth operation:  4.52ms
gRPC call to Trillian: 2.94ms on server side
CTFE overhead : 4.52 - 2.94 = 1.58ms
No traces showing errors or timeouts.

Screenshot 2026-02-09 at 11 05 40 AM

5.Trace ID : bea9a080ada3f6739735ab113f1ad7c2 :
Typical get-sth operation:  7.95ms
gRPC call to Trillian: 3.78ms on server side
CTFE overhead : 7.09 - 3.78 = 3.31ms
No traces showing errors or timeouts.

Screenshot 2026-02-09 at 11 18 05 AM

6.Trace ID : 4ae35982188663fee3d8eeb622866356 :
Typical get-sth operation:  8.76ms
gRPC call to Trillian: 4.69ms on server side
CTFE overhead : 7.83 - 4.69 = 3.14ms
No traces showing errors or timeouts.

Screenshot 2026-02-09 at 11 20 55 AM

7.Trace ID : 8018b3ff51e0b343206a6b984802c18b :
Typical get-sth operation:  6.14ms
gRPC call to Trillian: 4.21ms on server side
CTFE overhead : 6.14 - 4.21 = 1.93ms
No traces showing errors or timeouts.

Screenshot 2026-02-09 at 11 23 07 AM

8.Trace ID : a3699b3aef49a80fc73aae7bb5bd6e5f :
Typical get-sth operation:  5.2ms
gRPC call to Trillian: 3.61ms on server side
CTFE overhead : 4.97 - 3.61 = 1.36ms
No traces showing errors or timeouts.

Screenshot 2026-02-09 at 11 29 11 AM

9.Trace ID : 4a4f82d3b182303ed0b773c0460b9169 :
Typical get-sth operation:  7.65ms
gRPC call to Trillian: 4.19ms on server side
CTFE overhead : 7.24 - 4.19 = 3.05ms
No traces showing errors or timeouts.

Screenshot 2026-02-09 at 11 31 50 AM

10.Trace ID : 26a99abfc5ad08be508a00b301abb032 :
Typical get-sth operation:  5.08ms
gRPC call to Trillian: 2.26ms on server side
CTFE overhead : 4.59 - 2.26 = 2.33ms
No traces showing errors or timeouts.

Screenshot 2026-02-09 at 11 34 23 AM

========================================================================================

Test 8: Service Dependencies Visualization
Status : PASS
Evidence :

  • Clear call chain: ctfetrillian-logserver : Yes
  • gRPC communication visible in spans : Pending
  • No unexpected service dependencies : Pending
Screenshot 2026-02-06 at 11 48 09 AM Screenshot 2026-02-06 at 11 48 35 AM

@sadhana-angara sadhana-angara self-requested a review February 11, 2026 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants