Skip to content

High Availability

Overview

TensorCast HA keeps the single Global Store resilient and consistent with Store Daemon state by combining startup recovery, enhanced heartbeats, typed reconcile, and drift pruning. Configuration is file-based (proto-backed) and orchestrated via the tensorcast CLI.

Features

Global Store

  • Startup recovery: marks workers and replicas stale, cleans orphaned replicas, and preserves persisted state_version/state_checksum.
  • Reconcile pipeline: enhanced heartbeats advertise version + checksum + registered artifacts (heartbeats require state_version >= 1); ReconcileWorkerState applies additions/removals with “addition over removal” semantics, transactionally updates version/checksum on success, and returns typed outcomes (APPLIED, NOOP, IGNORED_STALE, RETRY_LATER, REBASE_REQUIRED, FATAL).
  • Identity guardrails: rejects loopback/unspecified registration addresses; HealthCheck surfaces cluster_token, while GetServerInfo returns bind (listen_*) endpoints, advertise (advertise_*) endpoints, metrics port, and version.

Store Daemon

  • Routable registration: requires a non-loopback advertise host and non-zero server.p2p_listen.port before enabling HA.
  • Initial drift pruning: on startup (and recovery re-registration) runs reconcile snapshot; when the server returns REBASE_REQUIRED, the daemon unloads local replicas (keyed by (artifact_id, memory_type, device_id)) not expected by the Global Store.
  • Registration seeding: initializes state_version from RegisterWorkerResponse.expected_state_version before the first heartbeat.
  • Enhanced heartbeat loop: sends version/checksum/inventory and queues background reconcile work; if the server returns NOT_FOUND but the channel is healthy, the daemon re-registers with the previous worker id and re-enters reconcile bootstrap.
  • Reconcile handling: applies server-requested removals and prefetches ADD_REPLICA updates; toggles remote access for UPDATE_REPLICA.
  • Bounded retries: all RPCs use jittered exponential backoff (max_retries=3, base 100ms).

Configuration

Global Store (proto: tensorcast.config.v1.GlobalStoreConfig)

database:
  db_file: /var/lib/tensorcast/global_store.duckdb   # persistent for HA
server:
  listen:
    host: 0.0.0.0
    port: 50051
  advertise:
    host: 10.0.0.5            # routable address for clients/GetServerInfo
  max_workers: 20
worker_policy:
  heartbeat_timeout: 30s
  cleanup_interval: 60s
  default_heartbeat_interval: 5s
  memory_tiers:
    snapshot_retention: 10m
    snapshot_max_rows: 200
    publish_interval: 5s
meta:
  cluster_token: <optional-cluster-id>               # echoed by HealthCheck

If server.advertise.host is set, it must be routable (non-loopback/unspecified) or the Global Store will fail fast on startup. Leave it unset to let the service auto-detect a routable IPv4 address.

Store Daemon (proto: tensorcast.config.v1.DaemonConfig)

server:
  listen:
    host: 0.0.0.0
    port: 50052
  advertise:
    host: 10.0.0.20           # must be routable (non-loopback/unspecified)
  p2p_listen:
    host: 0.0.0.0
    port: 65090               # required for HA registration
high_availability:
  enabled: true
  global_store_endpoints:
    - host: 10.0.0.5          # first entry is used today
      port: 50051
  heartbeat_interval: 5s      # default when omitted
  periodic_sync_interval: 10s # chunk sync loop; 0 disables
  heartbeat_rpc_timeout: 2s   # optional per-RPC overrides
  state_sync_rpc_timeout: 5s

The CLI (tensorcast daemon start) will inject high_availability.global_store_endpoints when you pass --global-store-address or --global-store-endpoints, and will auto-fill ports when set to 0. Keep server.advertise.host routable to avoid registration failures.

Note: listen.host: 0.0.0.0 is a bind-all address (server-side). Clients should connect using a routable IP/DNS name (e.g. 10.0.0.5:50051). TensorCast uses advertise_* for dial targets when routable and ignores unspecified values like 0.0.0.0 or :: in bind/advertise metadata, falling back to a connectable host.

Usage

Start Global Store (single instance)

uv run tensorcast global start --config=/etc/tensorcast/global_store.yaml
  • Use a persistent DuckDB path (database.db_file) for recovery.
  • The CLI persists the currently active local cluster_token under ~/.tensorcast/hosts/<host_id>/runtime/cluster_token (where <host_id> is derived from hostname + machine-id). Healthy local GS instances are reused; when a fresh local GS is started after cleanup or crash recovery, a new token is generated unless you explicitly pin one.
  • Add --blocking to keep the Global Store attached to the CLI and stop it when the CLI exits (Ctrl+C triggers shutdown and session cleanup).
  • If you omit --config, the CLI uses $TENSORCAST_GLOBAL_STORE_CONFIG when set, otherwise examples/config/global_store_config.yaml (repo checkout or packaged wheel); if neither is found, startup fails. Defaults come from that file (including listen.host: 0.0.0.0).
  • The example config defaults database.db_file to null (in-memory); set a persistent path for HA. When set, ~ is expanded and parent directories are created on startup.

Start Store Daemon with HA

uv run tensorcast daemon start \
  --config=/etc/tensorcast/daemon.yaml \
  --global-store-mode connect \
  --global-store-address 10.0.0.5:50051
  • The orchestrator writes an effective config that enables HA, injects the Global Store endpoint, and fills missing listen/p2p ports.
  • Startup will fail fast if server.p2p_listen.port is zero or if advertise.host is loopback/unspecified.
  • Add --blocking to keep the daemon attached to the CLI and stop it when the CLI exits (SIGTERM with a ~35s grace before SIGKILL).
  • If you omit --config, the CLI uses $TENSORCAST_DAEMON_CONFIG when set, otherwise examples/config/store_daemon_config.yaml (repo checkout or packaged wheel); if neither is found, startup fails.

Manual RPC examples (generated stubs)

Enhanced heartbeat:

import time
from tensorcast.proto.global_store.v1 import global_store_pb2

req = global_store_pb2.WorkerHeartbeatRequest(
    worker_id="worker-node-1",
    mem_pool_available_size=7 * 1024**3,
    accepting_new_requests=True,
    state_version=15,
    state_checksum="local_state_md5",
    registered_artifact_ids=["artifact1", "artifact2"],
    last_successful_sync=int(time.time()),
    global_store_status=global_store_pb2.CONNECTION_STATUS_CONNECTED,
)
resp = stub.WorkerHeartbeat(req)
if resp.state_sync_required:
    # initiate sync
    ...

Reconcile:

from tensorcast.proto.global_store.v1 import global_store_pb2
from tensorcast.proto.common.v1 import common_pb2

replica = common_pb2.ReplicaInfo()
replica.ref.artifact_id = "artifact1"
replica.memory_info.memory_type = common_pb2.MEMORY_TYPE_GPU
replica.memory_info.device_id = 0
replica.memory_info.memory_size = 1 * 1024**3
resp = stub.ReconcileWorkerState(
    global_store_pb2.ReconcileWorkerStateRequest(
        worker_id="worker-node-1",
        daemon_id="daemon-node-1",
        generation=1,
        request_seq=1,
        request_kind=global_store_pb2.RECONCILE_REQUEST_KIND_SNAPSHOT,
        inventory=[replica],
    ),
)

Failure Scenarios

  • Global Store crash/restart: recovery marks workers/replicas stale; daemons continue heartbeating, re-register on NOT_FOUND, run reconcile bootstrap, and prune drift. Persistent db_file keeps registry state.
  • Daemon crash/restart: a fresh process registers a new worker id, runs reconcile bootstrap, and unloads replicas not expected by the Global Store.
  • Network partition: RPC retries are bounded; once connectivity returns the next heartbeat triggers sync. No offline queue is kept—state is rebuilt from the current engine snapshot.
  • Registration rejected: ensure advertise.host is routable and p2p_listen.port is set.

Monitoring

  • Metrics: Global Store exports Prometheus metrics (e.g., tc_state_sync_total, tc_state_sync_seconds, tc_active_workers, tc_replicas_total, tc_grpc_server_handled_total). Scrape http://<host>:<metrics_port>/metrics.
  • Health:
  • gRPC: grpcurl -plaintext <host>:50051 tensorcast.global_store.v1.ClusterAdminService/HealthCheck (status + cluster token)
  • gRPC: grpcurl -plaintext <host>:50051 tensorcast.global_store.v1.ClusterAdminService/GetServerInfo (bind/advertise endpoints, metrics, version)
  • Inventory: grpcurl -plaintext <host>:50051 tensorcast.global_store.v1.ClusterRuntimeService/ListActiveWorkers
  • Daemon visibility: OTEL spans wrap all Global Store RPCs; counters (hb_success/failure, sync_success/failure) are exported via the configured OTEL sink when enabled.

Best Practices

  • Use persistent storage for the Global Store database; back up cluster_token with it.
  • Set a routable server.advertise.host and non-zero server.p2p_listen.port before enabling HA.
  • Start always waits for readiness and returns only when services are healthy (or on error) before clients connect; --blocking keeps the process attached after readiness.
  • Keep configs proto-valid (strict parsing) and avoid relying on multiple endpoints—the first global_store_endpoints entry is used today.

Migration from Legacy Setup

  1. Add a persistent database.db_file and optional meta.cluster_token to the Global Store config, then restart with uv run tensorcast global start --config=....
  2. Update daemon config to include high_availability and a routable server.advertise.host/p2p_listen.port, then restart with uv run tensorcast daemon start --global-store-address <addr>.
  3. Verify via HealthCheck + GetServerInfo, metrics scrape, and a daemon restart that completes reconcile bootstrap before rolling out broadly.

Limitations

  • Single Global Store instance; no automatic failover or multi-endpoint rotation (first endpoint only).
  • RPC retries are bounded; there is no durable queue of state changes while disconnected.
  • Eventual consistency: reconciles on heartbeat/reconcile loops, not instantly.

Future Enhancements

  • Endpoint rotation/failover across multiple Global Store addresses.
  • Multi-master Global Store with consensus and cross-region replication.
  • Automated standby promotion and richer conflict resolution for concurrent updates.