Store Daemon Deployment¶
This page describes how to run the C++ StoreDaemon (daemon/tensorcast_daemon) in development and production using the unified runtime configuration.
Binary¶
- Build from source (development):
- Packaged wheel includes the daemon at
tensorcast/bin/tensorcast_daemonand the Python CLI will use it automatically.
Launch via Python CLI¶
Use the unified YAML config and start via CLI (default --global-store-mode is none):
uv run tensorcast-cli daemon start --global-store-mode connect --global-store-address 127.0.0.1:50051
If you omit --config, the CLI tries $TENSORCAST_DAEMON_CONFIG, then
examples/config/store_daemon_config.yaml (repo checkout or packaged wheel),
and errors if no config is found. Set listen/advertise addresses through the
config file instead of CLI flags.
You can also override config values inline with --set KEY=VALUE (repeatable).
Example: --set engine.memory_tiers.stable_bytes=4GB.
Common shortcuts: --stable-bytes, --mem-pool-size-bytes, --enable-rdma, --log-level.
The CLI locates the binary from the wheel or development path automatically
and extends LD_LIBRARY_PATH with the TensorCast shared library bundle as
well as the PyTorch, NVIDIA, and auxiliary CUDA runtime directories (including
packages such as cusparselt that are installed outside the nvidia
namespace) that live inside the active Python environment. This allows the
daemon to resolve libstore_engine, libtorch and CUDA components even
when only the binary is present on disk.
If you need extra launcher environment for the daemon binary, set envs in the
daemon config. envs.LD_LIBRARY_PATH is merged in this order: inherited shell
entries, then configured entries, then the auto-discovered TensorCast/PyTorch/CUDA
directories. Other envs.* keys are passed through directly to the daemon
process.
By default the daemon runs in the background after tensorcast-cli daemon start
returns, and logs are persisted under ~/.tensorcast/hosts/<host_id>/sessions/<id>/logs
(view them with uv run tensorcast-cli daemon logs). Add --blocking to keep the daemon
attached to the CLI, stream logs directly to the terminal (stdio inherited to
avoid buffered crash output), and stop it when the CLI exits (SIGTERM with a
~35s grace before SIGKILL). In blocking mode, logs are not persisted to the
session log files.
Manage Daemon Sessions¶
Daemon sessions are tracked under ~/.tensorcast/hosts/<host_id>/sessions/<session_id> and the
current session id is stored in ~/.tensorcast/hosts/<host_id>/current_session
(where <host_id> is derived from the hostname and machine-id).
Session metadata is written as soon as the daemon process starts (before
readiness), so SDK tc.init(mode="connect") can discover the session while
startup continues; retry if the daemon is still initializing. The CLI emits
periodic readiness status messages if startup takes longer than a few seconds
and probes both the configured listen host and loopback to avoid hanging on a
non-local listen address. Transient readiness probe errors are suppressed during
startup and only surfaced if the daemon fails to become ready.
Only one Store Daemon instance is allowed per host-scoped runtime root under
$TENSORCAST_HOME. If you run tensorcast-cli daemon start while another daemon
is already running, the CLI
returns an error and prints the existing daemon session details; reuse that
instance or stop it before starting a new one.
When tensorcast-cli daemon stop is invoked without a session id, the CLI
resolves the active daemon from ~/.tensorcast/hosts/<host_id>/runtime/state.json
first, then falls back to ~/.tensorcast/hosts/<host_id>/current_session.
Common commands:
# Status (connects to daemon gRPC if available, otherwise shows process info)
uv run tensorcast-cli daemon status
# Logs (stdout by default, --stderr for stderr; add -f to follow)
uv run tensorcast-cli daemon logs -f
# Stop current session (SIGTERM with a ~35s grace before SIGKILL)
uv run tensorcast-cli daemon stop
Observability¶
Metrics are exposed via the unified system; the daemon no longer provides an HTTP metrics endpoint.
Store Client Sessions¶
uv run tensorcast daemon statusnow prints a Store Sessions section after the daemon health report. Data is sourced from~/.tensorcast/store_sessions/<session_id>.json, which the Python SDK refreshes whenever a Store verb completes. Use this view to spot clients that still hold leases or in-flight futures before forcing revocation.- Each session entry includes daemon endpoint, client PID, timestamps, active lease count, pending futures, and any capabilities reported by
Store.__init__(pool size, transfer slice, lease support).
Store Client Metrics (Grafana Example)¶
{
"title": "Store Operation Latency",
"type": "timeseries",
"fieldConfig": {
"defaults": {
"unit": "ms",
"transformations": []
},
"overrides": []
},
"targets": [
{
"expr": "histogram_quantile(0.95, sum by (le, verb) (rate(tc_store_operation_latency_seconds_bucket{daemon="$daemon"}[5m])))",
"legendFormat": "{{verb}} p95"
},
{
"expr": "sum by (verb) (rate(tc_store_operation_errors_total{daemon="$daemon"}[5m]))",
"legendFormat": "{{verb}} errors/s",
"yaxis": 2
}
],
"options": {
"tooltip": {
"mode": "single"
}
}
}
Pair this panel with a counter visualization for tc_store_operation_retries_total to highlight retry-heavy verbs. Filter on the daemon label to compare multiple Store sessions in the same dashboard.
Store Session API Rollout & Backout¶
Rollout checklist¶
- Version alignment: Ensure the staged Global Store schema, Store Daemon binary, and Python SDK wheel come from the same release. Run
uv run tensorcast --versionanduv run tensorcast daemon statusto confirm the daemon reports the expected build metadata. - Pre-traffic validation: Against staging, execute
uv run pytest tests/python/test_register_lease_in_place_helper.py,uv run pytest tests/python/test_register_vram_leased_and_dvmp_stream.py, andbazel test //daemon:session_lifecycle_test --test_env=TENSORCAST_CUDA_BACKEND=fake. These suites cover lease renewal, VRAM leased-in-place flows, and daemon session lifecycle. - Metrics watch: Monitor the OpenTelemetry metrics defined in Design 0010—
tc_store_operation_latency_seconds,tc_store_operation_errors_total, andtc_store_operation_retries_total—while introducing production traffic. Alert thresholds should track the historical p95 latency and error envelopes before legacy helpers are disabled. - Session audit: Use
uv run tensorcast daemon statusto inspect the Store Sessions section and verify the session registry under~/.tensorcast/store_sessionsreflects active clients with the expected lease/future counts. - Release checklist: Cross-check the deployment steps against the Store Session Release Checklist before announcing completion.
Backout checklist¶
- Binary rollback: Redeploy the previous Store Daemon binary and Python SDK wheel (pre-Store-session release). Older clients ignore the
.tensorcast/store_sessionsmanifests, so no cleanup is required beyond optional file pruning. - Verification: Re-run the validation suites above and confirm observability indicators (
tc_store_operation_errors_total,tc_store_operation_retries_total) return to baseline values. - Communication: Notify on-call and consumer teams when rollback occurs, document the failure mode, and schedule a postmortem before attempting another rollout.
Logging¶
observability.logging.leveldrives the daemon's stderr threshold and minimum log level (DEBUG is routed through VLOG).observability.logging.vlog_levelsets the globalVLOGverbosity; values <= 0 disable verbose logging.observability.logging.filewrites plain-text logs to disk in addition to stderr; the sink is hot-swappable at runtime via config reloads.- When
observability.logging.otel_context_enabledandobservability.logging.sink_fileare set, the daemon writes a second log file enriched with OpenTelemetrytrace_id/span_idfor correlation.
Configuration¶
All runtime parameters are configured via the unified config. The daemon only
accepts --config=/path/to/file. See examples/config/store_daemon_config.yaml.
Enum fields accept friendly values and are normalized (case-insensitive): observability.otel.exporter_protocol: grpc | http/protobuf, observability.logging.level: debug|info|warn|error.
Launcher-only daemon environment can also be declared in config via envs, for
example:
When HA is enabled, the daemon advertises a routable address to the Global Store. If server.advertise.host is set but non-routable, startup fails; if it is unset, the daemon resolves it using a routable server.listen.host, the outbound route IP to the Global Store endpoint, and finally the default interface IP. The resolved advertise address is logged at startup.
For long-lived clients, prefer a non-zero server.grpc.max_connection_idle and
explicit keepalive settings (for example keepalive_time: 30s,
keepalive_timeout: 10s) to avoid idle GOAWAY churn. The Python SDK reuses a
single gRPC channel/stub per DaemonCtl instance, so keep the instance around
instead of recreating it for each RPC.
Replica promotion/export for P2P routing is controlled via the promotion block
(policy, require_verified, demotion_drain_timeout). The default
promotion.policy is never, which means replicas remain presence-only unless
promotion is explicitly enabled and requested by the SDK.
network:
high_availability:
heartbeat_interval_ms: 5000
chunk_sync_interval_ms: 10000
lifecycle:
evict_on_dead_pid: false
enable_periodic_eviction: false
eviction_check_interval_s: 30.0
gpu_memory_limit_fraction: 0.75
global_store_address: 127.0.0.1:6000
Communicator Configuration (YAML)¶
P2P/RDMA communicator is configured via a single YAML/JSON file (no per‑field flags). Example:
communicator:
enable_rdma: false
stager:
stage_cpu_for_rdma: true
buffers_per_flow: 16
expected_gpu_channels: 0
rdma:
outstanding_wr: 64
ack_ttl_ms: 30000
traffic_class: 186
qp_timeout: 20
qp_retry: 7
transport:
tcp_conn_count: 20
connect_timeout_sec: 10
tcp_tos: 0
topology_discovery:
enable: false
lldp:
file_path: /host-config/lldp-info.txt
required: false
nvlink:
source: SOURCE_DISABLED # SOURCE_SNAPSHOT_FILE | SOURCE_RUNTIME_PROBE
snapshot_file_path: ""
required: false
merge_policy:
emit_rail_switch_endpoints: true
require_connected: false
Pinned staging pool sizing and chunking come from the daemon-wide pinned memory configuration (DaemonConfig.pinned_memory) via the comm_gpu / comm_cpu classes (slice_bytes + pool_bytes), not from CommunicatorConfig.
When topology_discovery.enable=true, communicator topology generation uses simple_numa as the baseline and then merges LLDP/NVLINK discovery data:
- LLDP maps NIC endpoints to per-rail switch endpoints (
netsw_rail_<id>), withnetsw_unknownfallback when LLDP data is partial andlldp.required=false. - NVLINK snapshot mode (
SOURCE_SNAPSHOT_FILE) addsnvlink_<gpu_uuid>endpoints and bidirectional NVLINK links for discovered GPU pairs. - NVLINK runtime probe mode (
SOURCE_RUNTIME_PROBE) executes: nvidia-smi --query-gpu=index,uuid --format=csv,noheader,nounitsnvidia-smi topo -mand derives edge counts fromNV*matrix cells (NV1,NV2, ...). Non-NV tokens (SYS,PHB,PIX,X,N/A) are treated as no NVLINK edge.merge_policy.require_connectedcontrols whether final topology validation enforces a fully connected graph.
For offline verification, bazel run //core/communicator:simple_numa_topology_tool -- <config> prints a discovery summary line to stderr containing lldp_source, lldp_records, nvlink_source, nvlink_gpus, nvlink_edges, and per-source degrade reasons when fallback is taken. The repository ships lldp-info.txt and nvlink-snapshot.txt for repeatable local checks.
RDMA Environment Variables¶
These are optional and only affect RDMA device selection. Runtime parameters still live in the unified config file.
TENSORCAST_IB_HCA¶
Specifies InfiniBand HCA device names to use for RDMA. Multiple values are
comma-separated. Any = characters in the value are stripped.
If unset, TensorCast auto-discovers available devices.
TENSORCAST_LLDP_FILE_NAME¶
Path to an LLDP-style mapping file for rail ID selection in multi-rail configurations. Each non-comment line maps a network interface to PCI path, mlx5 device name, and rail ID.
Lines starting with # and blank lines are ignored. If unset, rail IDs are
derived from the mlx5 device name (for example mlx5_0 -> rail 0).
This variable is a legacy fallback used by NetDev rail detection. For topology discovery, prefer typed config:
communicator.topology_discovery.lldp.file_path.
Deprecation schedule:
- 2026-03-04: legacy compatibility only (typed config is primary).
- 2026-06-30: removed from new deployment examples except compatibility notes.
- 2026-10-31: CI/integration should use typed config path only.
- 2026-12-31 (target): fallback removal from NetDev.
Launch Example (Unified Config)¶
Metrics Exposure (Unified)¶
- The daemon no longer serves metrics directly. Use the central observability pipeline for
tc_*metrics:
Metrics include, e.g.:
- tc_memory_pool_bytes{location=cpu|gpu,device_id?,memory_type=total|free}
- tc_p2p_bytes_total
Note: The HTTP metrics sidecar is no longer spawned by the CLI.