Skip to content

Summary

Introduce a unified, cross-language OpenTelemetry (OTel) observability layer across TensorCast components (Python Global Store, C++ Store Daemon, and C++ Core). The design standardizes tracing, metrics, and logs with consistent naming, low overhead, and seamless context propagation across gRPC boundaries. It preserves existing in-house trace macro semantics via an OTel bridge and enables trace–log correlation with minimal code changes.

Goals / Non‑Goals

Goals - End-to-end traces across Global Store ↔ Store Daemon ↔ Core, including P2P data paths. - Single naming system: spans as <Component>/<Operation>, metrics prefixed tc_*, logs enriched with trace_id and span_id. - gRPC context propagation using W3C Trace Context across Python and C++. - Low overhead by default: no-op when sampling is 0 or no exporter is configured. - Clean API surfaces and helper utilities to minimize bespoke instrumentation.

Non‑Goals - Building a custom observability stack or replacing OTel SDKs/Collector. - Defining every fine-grained span; micro-instrumentation of hot loops is intentionally out of scope. - Changing business logic, data models, or persistence schema. - Mandating a specific backend; any OTLP-compatible backend or the OTel Collector is acceptable.

Architecture & Interfaces

Overview - Signals: Tracing (primary), Metrics (unified prefix), Logs (trace correlation). - Transport: W3C Trace Context propagated over gRPC (client and server sides in both Python and C++). - Bridge: Existing SC_TRACE_* C++ macros map to OTel spans with equivalent scope semantics.

Components & Entry Points - Python - Initialization: tensorcast/observability/otel.py sets Tracer/Meter/Logger providers and gRPC instrumentation. - Global Store: initialize OTel at process start and instrument RPC handlers in tensorcast/global_store/grpc_service.py. - Clients & APIs: client init in tensorcast/daemon_ctl.py and high-level APIs in tensorcast/api create parent spans and attach low-cardinality tc.* attributes. Client SDK defaults to OTel disabled; enable by setting observability.otel.enabled: true in ClientConfig. - Logs/Metrics: tensorcast/logger.py injects trace_id/span_id; tensorcast/global_store/metrics.py defines tc_* metrics. - C++ - Initialization: core/common/otel/init.h/.cc provides init_from_config(obs, role); call from daemon/server entry. - Propagation: core/common/otel/grpc_propagation.h extracts/injects context on grpc::{Server,Client}Context. - Trace bridge: core/common/otel/trace_scope_bridge.h maps SC_TRACE_* macro scopes to OTel spans/events. - Logs: core/common/otel/logging_sink.* optionally records logs annotated with trace_id/span_id for Collector ingestion. - Engine/Transport: instrument stage-level spans/events in core/store/store_engine.cc and core/communicator/transport/*.

Span Model & Conventions - Span names: <Component>/<Operation> (examples: GlobalStore/RequestReplicaTransport, StoreDaemon/MaterializeReplica, StoreEngine/P2PIngest). - Relationships: use child-of for causal chains; use Span Links for async or staged boundaries (e.g., P2P pipeline stages). - Attributes - Standard: rpc.system=grpc, rpc.service, rpc.method, rpc.grpc.status_code. - Business (low-cardinality): tc.artifact.id, tc.replica.id, tc.device.id, tc.size.bytes, tc.source.type=remote|disk, tc.location=gpu|cpu. - Cost controls: avoid span creation in tight loops; prefer stage or batch spans. Tune rate via sampling.

Metrics - Unified prefix: all metrics start with tc_*. - Scope: memory pool usage, P2P throughput, load latencies, and gRPC request telemetry. - Export: OTLP push or Collector scrape via Prometheus receiver; deprecated in-process HTTP /metrics endpoints are removed.

Logs - Python logs automatically include trace_id and span_id when an active span exists. - C++ can install an absl::LogSink that writes enriched records for Collector ingestion.

Configuration - Observability is driven by configuration (e.g., observability.otel.*): - service.name and service role (daemon, global_store, client) - exporter endpoint(s), protocol, and timeouts - sampling ratio and span limits - enable/disable signals (traces, metrics, logs) - With no exporter or with sampling=0, runtime overhead remains near zero.

Verification (Dev Loop) - Start an OTel Collector via tools/otel/collector-dev.yaml. - Start Global Store and Daemon (Fake CUDA is supported for CPU-only machines). - Trigger any client operation (or tools/otel_smoke.py) to see cross-service traces, metrics, and correlated logs.

Schema Changes (if any)

None. This design does not introduce or modify persistent data schemas.

Trade‑offs & Risks

  • Overhead vs. fidelity: coarse stage-level spans are preferred on hot paths; sampling mitigates cost but reduces visibility.
  • Attribute cardinality: enforce low-cardinality tc.* attributes to avoid backend cost explosions and high-cardinality pitfalls.
  • Cross-language consistency: rely on shared propagation helpers and naming conventions to avoid drift.
  • Bridge semantics: the SC_TRACE_* mapping must preserve scope lifetimes; regressions could break existing debug workflows.
  • PII and data hygiene: avoid embedding sensitive data in attributes or logs.
  • Operational dependencies: Collector/backend outages must degrade gracefully (non-blocking, bounded buffering).

Compatibility & Acceptance Criteria

Compatibility - Client SDK default: OTel is disabled unless explicitly enabled via ClientConfig (observability.otel.enabled: true). - Default behavior is no-op if exporters are disabled or sampling=0. - Existing SC_TRACE_* macro semantics are preserved via the bridge. - Cross-language propagation uses W3C Trace Context; legacy code without OTel remains functional.

Acceptance Criteria - Traces: A single request produces a coherent trace across Global Store, Daemon, and Core, including P2P stages with appropriate child relationships or links. - Metrics: tc_* metrics for memory pools, P2P throughput, load latency, and gRPC telemetry are emitted and visible via OTLP or Prometheus. - Logs: Log records in Python and (optionally) C++ include trace_id/span_id for correlation. - Config: Toggling sampling and exporters affects cost and signal export without code changes.

References

  • Code surfaces
  • Python: tensorcast/observability/otel.py, tensorcast/global_store/__main__.py, tensorcast/global_store/grpc_service.py, tensorcast/daemon_ctl.py, tensorcast/api, tensorcast/logger.py, tensorcast/global_store/metrics.py
  • C++: core/common/otel/{init.h,grpc_propagation.h,trace_scope_bridge.h,logging_sink.*}, daemon/app/server_main.cc, daemon/service/grpc_service_impl.cc, core/store/store_engine.cc, core/communicator/transport/*, core/store/components/*
  • Tools & configs: tools/otel/collector-dev.yaml