Skip to content

TensorCast Startup and Integration User Guide

This guide explains how users should bootstrap TensorCast in real deployments. It focuses on:

  • tensorcast.init(...) behavior and mode selection
  • CLI-managed vs SDK-managed daemon lifecycle
  • API/SDK integration patterns
  • TP or other multi-process workloads sharing one daemon

1. Startup Model at a Glance

tensorcast.init is the main startup entry (tensorcast.startup.init). It supports three modes:

Mode What it does Daemon owner Typical usage Main risk
connect Connects to an existing daemon only External (CLI/operator/other process) Production services, TP workers Fails if no reachable daemon
create Starts a local daemon and connects to it Current process Single-process tools, local dev Process exit may stop daemon
auto Singleflight connect-or-create under same runtime root Leader process selected at runtime Concurrent local workers booting together Config mismatch across processes causes startup failure
Environment Recommended pattern Why
Production service / long-lived inference Start daemon via CLI, app uses init(mode="connect") Clean lifecycle boundaries, easier ops
Local development or notebook init(mode="create") Fast self-contained bootstrap
Many local processes start at same time init(mode="auto") Prevents duplicate daemon launches
TP/forked workers Pre-start daemon, each worker connect Most stable and predictable

3. Configuration Resolution

3.1 Daemon config path (create / auto)

Priority Source
1 daemon_config_path parameter
2 TENSORCAST_DAEMON_CONFIG
3 examples/config/store_daemon_config.yaml (repo or packaged wheel)

If none is found, startup fails.

3.2 Global Store orchestration

global_store_mode Behavior When to use
none No Global Store orchestration Local-only or minimal setups
connect Connect to an existing Global Store Production clusters with managed GS
start Start a new local Global Store first, then daemon; fail if one already exists locally Local all-in-one workflows

3.3 Optional port overrides (create / auto)

SDK-managed launch now accepts a structured port_config object so callers can override daemon / Global Store ports without writing ad-hoc config files.

Field Meaning
daemon_listen_port Daemon gRPC port
daemon_p2p_port Daemon P2P/data-plane port
global_store_listen_port Global Store gRPC port
global_store_metrics_port Global Store Prometheus metrics port

Rules:

Rule Behavior
Port value 0 Auto-pick a free port at launch
connect mode port_config is ignored because no local process is started
global_store_mode!="start" Global Store port overrides are ignored

4. Integration Patterns

Use CLI to manage lifecycle, and keep app processes stateless regarding daemon ownership.

# 1) Start Global Store (if needed)
uv run tensorcast-cli global start --config=examples/config/global_store_config.yaml

# 2) Start Store Daemon
uv run tensorcast-cli daemon start \
  --config=examples/config/store_daemon_config.yaml \
  --global-store-mode connect \
  --global-store-address 127.0.0.1:50051
import tensorcast as tc

tc.init(mode="connect", address="127.0.0.1:50052", show_daemon_logs=False)

artifact = tc.artifact(key="model:latest")
tensors = artifact.tensor_dict(device="cuda:0")

tc.shutdown()

Pattern B: SDK self-managed launch (create)

Good for local dev and simple scripts.

import tensorcast as tc

tc.init(
    mode="create",
    daemon_config_path="examples/config/store_daemon_config.yaml",
    global_store_mode="start",
    global_store_config_path="examples/config/global_store_config.yaml",
    show_daemon_logs=False,
)

# register / get / artifact operations...

tc.shutdown()

Pattern B1: SDK self-managed launch with explicit ports

import tensorcast as tc

tc.init(
    mode="create",
    daemon_config_path="examples/config/store_daemon_config.yaml",
    global_store_mode="start",
    global_store_config_path="examples/config/global_store_config.yaml",
    port_config=tc.PortConfig(
        daemon_listen_port=50052,
        daemon_p2p_port=0,
        global_store_listen_port=50051,
        global_store_metrics_port=18008,
    ),
    show_daemon_logs=False,
)

global_store_mode="start" is exclusive for the current runtime root. If a healthy local Global Store is already recorded under the same TENSORCAST_HOME, startup fails instead of borrowing that instance; stop the existing GS first or switch to global_store_mode="connect".

Pattern C: Concurrent local startup (auto)

auto is useful when many local processes may start at once and should converge to one daemon.

import tensorcast as tc

tc.init(
    mode="auto",
    daemon_config_path="examples/config/store_daemon_config.yaml",
    global_store_mode="connect",
    global_store_address="127.0.0.1:50051",
    show_daemon_logs=False,
)

Best practice for auto:

Rule Reason
Keep init parameters identical across participating processes auto validates a config hash and rejects mismatches
Do not pass different explicit session_id values per process Breaks process-group singleflight expectations
Prefer connect for long-lived production worker pools Owner process semantics in auto are harder to operate

5. API/SDK Usage Patterns

5.1 Module-level API (simple and common)

import tensorcast as tc

tc.init(mode="connect", address="127.0.0.1:50052")
tc.register({"w": some_cuda_tensor}, key="model:v1")
art = tc.artifact(key="model:v1")
weights = art.tensor_dict(device="cuda:0")

Note: init(mode="connect") only attaches to an existing daemon. global_store_mode, global_store_address, and global_store_config_path do not reconfigure that daemon. Set Global Store when daemon is created/started (init(mode="create"|"auto", ...) or uv run tensorcast-cli daemon start ...).

5.2 Explicit Store object (advanced tuning)

import tensorcast as tc
from tensorcast.api.store.types import RetryPolicy

tc.init(mode="connect", address="127.0.0.1:50052")

store = tc.store(
    opts=tc.StoreOptions(
        get=tc.GetArtifactOptions(source="local_only"),
        retry_overrides={"get": RetryPolicy(20.0, 2, 0.1, 2.0, 0.5)},
    )
)

art = store.artifact(key="model:latest")

6. TP / Multi-Process / Fork Best Practices

For tensor parallel or any multi-process setup, treat daemon as shared infrastructure.

Step Recommendation
1 Start daemon once (CLI/system service)
2 Spawn/fork worker processes
3 In each worker process, call tc.init(mode="connect", address=...)
4 Build per-rank views (artifact.view(slices=...)) and materialize locally

6.2 Do / Don’t

Do Don’t
Initialize TensorCast inside each worker process Call tc.init() in parent before fork
Use explicit daemon address in distributed launches Rely on implicit local discovery across hosts
Use connect for stable long-running TP services Use create in every rank
Keep one daemon endpoint per process Attempt to switch daemon address after client is initialized

6.3 Can forked workers call auto directly?

Yes. You do not need to pre-start daemon if each forked worker calls tc.init(mode="auto") after fork. One worker will become leader and start daemon; others wait and connect.

Required conditions:

Condition Why
Call auto in child process (after fork) Avoid inheriting parent-initialized runtime/client state
Keep startup args identical across workers auto enforces config-hash consistency
Share the same runtime root (TENSORCAST_HOME) Singleflight election happens under one runtime root

Operational caveat:

Caveat Impact
Leader process is daemon owner If owner exits early, daemon can be stopped and other workers are affected

For long-running TP services, prefer a dedicated daemon process and worker connect.

6.4 Worker template

def worker_main(rank: int, daemon_addr: str) -> None:
    import tensorcast as tc
    import torch

    torch.cuda.set_device(rank)
    tc.init(mode="connect", address=daemon_addr, show_daemon_logs=False)

    artifact = tc.artifact(key="model:latest").view(slices=build_rank_slices(rank))
    _ = artifact.tensor_dict(device=f"cuda:{rank}")

    tc.shutdown()

7. Troubleshooting

Symptom Likely cause Action
No local daemon session found in connect No running daemon, or no discovered local session Start daemon via CLI or pass explicit address
AUTO_CONFIG_MISMATCH in auto Different init/config params across processes Make all auto startup args identical
Materialization requires a Global Store connection Daemon not connected to Global Store for requested operation Configure Global Store at daemon startup (init(mode="create"|"auto", global_store_...) or tensorcast-cli daemon start --global-store-...)
client already initialized for address ... refusing second client Same process tried to bind to another daemon address Use one daemon endpoint per process; restart process if switching is needed

8. Production Checklist

Item Status
Daemon lifecycle managed outside app process (CLI/system)
App uses init(mode="connect", address=...)
Global Store mode selected intentionally (none/connect/start)
TP workers initialize TensorCast after process start
All startup configs are deterministic and version-controlled