High Availability¶
Overview¶
TensorCast HA keeps the single Global Store resilient and consistent with Store Daemon state by combining startup recovery, enhanced heartbeats, typed reconcile, and drift pruning. Configuration is file-based (proto-backed) and orchestrated via the tensorcast CLI.
Features¶
Global Store¶
- Startup recovery: marks workers and replicas stale, cleans orphaned replicas, and preserves persisted
state_version/state_checksum. - Reconcile pipeline: enhanced heartbeats advertise version + checksum + registered artifacts (heartbeats require
state_version >= 1);ReconcileWorkerStateapplies additions/removals with “addition over removal” semantics, transactionally updates version/checksum on success, and returns typed outcomes (APPLIED,NOOP,IGNORED_STALE,RETRY_LATER,REBASE_REQUIRED,FATAL). - Identity guardrails: rejects loopback/unspecified registration addresses;
HealthChecksurfacescluster_token, whileGetServerInforeturns bind (listen_*) endpoints, advertise (advertise_*) endpoints, metrics port, and version.
Store Daemon¶
- Routable registration: requires a non-loopback advertise host and non-zero
server.p2p_listen.portbefore enabling HA. - Initial drift pruning: on startup (and recovery re-registration) runs reconcile snapshot; when the server returns
REBASE_REQUIRED, the daemon unloads local replicas (keyed by(artifact_id, memory_type, device_id)) not expected by the Global Store. - Registration seeding: initializes
state_versionfromRegisterWorkerResponse.expected_state_versionbefore the first heartbeat. - Enhanced heartbeat loop: sends version/checksum/inventory and queues background reconcile work; if the server returns
NOT_FOUNDbut the channel is healthy, the daemon re-registers with the previous worker id and re-enters reconcile bootstrap. - Reconcile handling: applies server-requested removals and prefetches
ADD_REPLICAupdates; toggles remote access forUPDATE_REPLICA. - Bounded retries: all RPCs use jittered exponential backoff (
max_retries=3, base 100ms).
Configuration¶
Global Store (proto: tensorcast.config.v1.GlobalStoreConfig)¶
database:
db_file: /var/lib/tensorcast/global_store.duckdb # persistent for HA
server:
listen:
host: 0.0.0.0
port: 50051
advertise:
host: 10.0.0.5 # routable address for clients/GetServerInfo
max_workers: 20
worker_policy:
heartbeat_timeout: 30s
cleanup_interval: 60s
default_heartbeat_interval: 5s
memory_tiers:
snapshot_retention: 10m
snapshot_max_rows: 200
publish_interval: 5s
meta:
cluster_token: <optional-cluster-id> # echoed by HealthCheck
If server.advertise.host is set, it must be routable (non-loopback/unspecified) or the Global Store will fail fast on startup. Leave it unset to let the service auto-detect a routable IPv4 address.
Store Daemon (proto: tensorcast.config.v1.DaemonConfig)¶
server:
listen:
host: 0.0.0.0
port: 50052
advertise:
host: 10.0.0.20 # must be routable (non-loopback/unspecified)
p2p_listen:
host: 0.0.0.0
port: 65090 # required for HA registration
high_availability:
enabled: true
global_store_endpoints:
- host: 10.0.0.5 # first entry is used today
port: 50051
heartbeat_interval: 5s # default when omitted
periodic_sync_interval: 10s # chunk sync loop; 0 disables
heartbeat_rpc_timeout: 2s # optional per-RPC overrides
state_sync_rpc_timeout: 5s
The CLI (
tensorcast daemon start) will injecthigh_availability.global_store_endpointswhen you pass--global-store-addressor--global-store-endpoints, and will auto-fill ports when set to 0. Keepserver.advertise.hostroutable to avoid registration failures.Note:
listen.host: 0.0.0.0is a bind-all address (server-side). Clients should connect using a routable IP/DNS name (e.g.10.0.0.5:50051). TensorCast usesadvertise_*for dial targets when routable and ignores unspecified values like0.0.0.0or::in bind/advertise metadata, falling back to a connectable host.
Usage¶
Start Global Store (single instance)¶
- Use a persistent DuckDB path (
database.db_file) for recovery. - The CLI persists the currently active local
cluster_tokenunder~/.tensorcast/hosts/<host_id>/runtime/cluster_token(where<host_id>is derived from hostname + machine-id). Healthy local GS instances are reused; when a fresh local GS is started after cleanup or crash recovery, a new token is generated unless you explicitly pin one. - Add
--blockingto keep the Global Store attached to the CLI and stop it when the CLI exits (Ctrl+C triggers shutdown and session cleanup). - If you omit
--config, the CLI uses$TENSORCAST_GLOBAL_STORE_CONFIGwhen set, otherwiseexamples/config/global_store_config.yaml(repo checkout or packaged wheel); if neither is found, startup fails. Defaults come from that file (includinglisten.host: 0.0.0.0). - The example config defaults
database.db_fileto null (in-memory); set a persistent path for HA. When set,~is expanded and parent directories are created on startup.
Start Store Daemon with HA¶
uv run tensorcast daemon start \
--config=/etc/tensorcast/daemon.yaml \
--global-store-mode connect \
--global-store-address 10.0.0.5:50051
- The orchestrator writes an effective config that enables HA, injects the Global Store endpoint, and fills missing listen/p2p ports.
- Startup will fail fast if
server.p2p_listen.portis zero or ifadvertise.hostis loopback/unspecified. - Add
--blockingto keep the daemon attached to the CLI and stop it when the CLI exits (SIGTERM with a ~35s grace before SIGKILL). - If you omit
--config, the CLI uses$TENSORCAST_DAEMON_CONFIGwhen set, otherwiseexamples/config/store_daemon_config.yaml(repo checkout or packaged wheel); if neither is found, startup fails.
Manual RPC examples (generated stubs)¶
Enhanced heartbeat:
import time
from tensorcast.proto.global_store.v1 import global_store_pb2
req = global_store_pb2.WorkerHeartbeatRequest(
worker_id="worker-node-1",
mem_pool_available_size=7 * 1024**3,
accepting_new_requests=True,
state_version=15,
state_checksum="local_state_md5",
registered_artifact_ids=["artifact1", "artifact2"],
last_successful_sync=int(time.time()),
global_store_status=global_store_pb2.CONNECTION_STATUS_CONNECTED,
)
resp = stub.WorkerHeartbeat(req)
if resp.state_sync_required:
# initiate sync
...
Reconcile:
from tensorcast.proto.global_store.v1 import global_store_pb2
from tensorcast.proto.common.v1 import common_pb2
replica = common_pb2.ReplicaInfo()
replica.ref.artifact_id = "artifact1"
replica.memory_info.memory_type = common_pb2.MEMORY_TYPE_GPU
replica.memory_info.device_id = 0
replica.memory_info.memory_size = 1 * 1024**3
resp = stub.ReconcileWorkerState(
global_store_pb2.ReconcileWorkerStateRequest(
worker_id="worker-node-1",
daemon_id="daemon-node-1",
generation=1,
request_seq=1,
request_kind=global_store_pb2.RECONCILE_REQUEST_KIND_SNAPSHOT,
inventory=[replica],
),
)
Failure Scenarios¶
- Global Store crash/restart: recovery marks workers/replicas stale; daemons continue heartbeating, re-register on
NOT_FOUND, run reconcile bootstrap, and prune drift. Persistentdb_filekeeps registry state. - Daemon crash/restart: a fresh process registers a new worker id, runs reconcile bootstrap, and unloads replicas not expected by the Global Store.
- Network partition: RPC retries are bounded; once connectivity returns the next heartbeat triggers sync. No offline queue is kept—state is rebuilt from the current engine snapshot.
- Registration rejected: ensure
advertise.hostis routable andp2p_listen.portis set.
Monitoring¶
- Metrics: Global Store exports Prometheus metrics (e.g.,
tc_state_sync_total,tc_state_sync_seconds,tc_active_workers,tc_replicas_total,tc_grpc_server_handled_total). Scrapehttp://<host>:<metrics_port>/metrics. - Health:
- gRPC:
grpcurl -plaintext <host>:50051 tensorcast.global_store.v1.ClusterAdminService/HealthCheck(status + cluster token) - gRPC:
grpcurl -plaintext <host>:50051 tensorcast.global_store.v1.ClusterAdminService/GetServerInfo(bind/advertise endpoints, metrics, version) - Inventory:
grpcurl -plaintext <host>:50051 tensorcast.global_store.v1.ClusterRuntimeService/ListActiveWorkers - Daemon visibility: OTEL spans wrap all Global Store RPCs; counters (hb_success/failure, sync_success/failure) are exported via the configured OTEL sink when enabled.
Best Practices¶
- Use persistent storage for the Global Store database; back up
cluster_tokenwith it. - Set a routable
server.advertise.hostand non-zeroserver.p2p_listen.portbefore enabling HA. - Start always waits for readiness and returns only when services are healthy (or on error) before clients connect;
--blockingkeeps the process attached after readiness. - Keep configs proto-valid (strict parsing) and avoid relying on multiple endpoints—the first
global_store_endpointsentry is used today.
Migration from Legacy Setup¶
- Add a persistent
database.db_fileand optionalmeta.cluster_tokento the Global Store config, then restart withuv run tensorcast global start --config=.... - Update daemon config to include
high_availabilityand a routableserver.advertise.host/p2p_listen.port, then restart withuv run tensorcast daemon start --global-store-address <addr>. - Verify via
HealthCheck+GetServerInfo, metrics scrape, and a daemon restart that completes reconcile bootstrap before rolling out broadly.
Limitations¶
- Single Global Store instance; no automatic failover or multi-endpoint rotation (first endpoint only).
- RPC retries are bounded; there is no durable queue of state changes while disconnected.
- Eventual consistency: reconciles on heartbeat/reconcile loops, not instantly.
Future Enhancements¶
- Endpoint rotation/failover across multiple Global Store addresses.
- Multi-master Global Store with consensus and cross-region replication.
- Automated standby promotion and richer conflict resolution for concurrent updates.