Artifact Loading Workflow¶
This diagram shows the complete artifact loading workflow in TensorCast, including the interaction between different components.
Related docs:
- docs/architecture/artifact-views-and-retrieval.md
- docs/architecture/api/materialization-flow.md
- docs/internals/byte-range-mapping-and-execution.md
- docs/internals/disk-load-strategy.md
- docs/designs/0084-binding-unified-model-and-contract.md
- docs/designs/0108-tensor-aware-materialization-strategy-plane.md
0108 Strategy Plane¶
The loading stack now has an explicit strategy-lowering seam between semantic resolution and executor choice.
- controller code resolves semantic truth first:
ArtifactSelection- resolved
view_id/ViewPlan - target layout
RepresentationTransformContract- the common runtime then lowers that semantic plan plus source facts into
executor strategy inside
MaterializationFacade - internal contracts for this seam live in
core/store/runtime/ingestion/materialization_strategy_types.h: ResolvedMaterializationPlanRepresentationTransformContractRepresentationWorkPlanResolvedSourceBindingExecutionCommitReport
Current rule:
- residual generic byte-range work must be explicit in
RepresentationWorkPlan, - executors do not rediscover semantic truth from source/view metadata during execution,
- executors must not implicitly reconstruct fallback work after a partial fast path attempt.
System Components¶
- InferenceInstance: Python + CXX EXT
- Entrypoint:
tensorcast.artifact(...).tensor_dict/.tensor_dict_into(facade over the process Store) - CLI class:
client.py::DaemonCtl -
CXX:
checkpoint_py.cc -
LocalStoreDaemon: C++ gRPC service (see
../architecture/api/api-design.md) - Binary:
daemon/tensorcast_daemon -
Service:
store_daemon.StoreDaemonService(ResolveKeyMapping/MaterializeReplica/ConfirmReplica/UnloadReplica/MaterializeIntoTarget) -
SessionLifecycle & LIP Runtime:
SessionLifecycleManager,RefTracker, andLipManager(underdaemon/state/session_lifecycle.*anddaemon/state/lip_manager.*) track per-PID references, mintUseLeases for GPU replicas, and satisfy selection-firstMaterializeReplicarequests from already-resident CUDA IPC handles when a matching Lease-In-Place replica exists. -
GlobalStore: Python
-
Entrypoint:
tensorcast/global_store/grpc_service.py::GlobalStoreServicer -
RemoteStoreDaemon: Same as LocalStoreDaemon
Runtime Initialization¶
All client processes call tensorcast.init(mode=...) to establish the daemon session. Initialization pins
the daemon endpoint and constructs the shared Store used by the module-level helpers. The Store owns
retry policy, lease keepalive, and fallback orchestration for the process. Advanced integrations can
access it via tensorcast.store(), but day-to-day usage goes through the functional helpers.
Typed runtime rollout for the 0108 strategy plane now comes from daemon config
under engine.materialization_strategy, not from ambient process environment.
That config is lowered into StoreEngineOptions::MaterializationStrategyConfig
and shared by the common runtime plus replica-side loaders.
For a path-by-path summary of current disk-backed loading decisions, including
TP-aware rank-local slicing, local SSD vs shared filesystem behavior, and
collective/local-batched/generic fallback selection, see
docs/internals/disk-load-strategy.md.
Artifact Loading Sequence¶
sequenceDiagram
participant InferenceInstance
participant LocalStoreDaemon
participant GlobalStore
participant RemoteStoreDaemon
participant DiskSource
InferenceInstance->>LocalStoreDaemon: 0. Malloc CUDA Memory
Note right of InferenceInstance: Local: store_engine.py::allocate_cuda_memory
InferenceInstance->>LocalStoreDaemon: 1. ResolveKeyMapping (optional)
Note left of LocalStoreDaemon: RPC: ResolveKeyMapping
InferenceInstance->>LocalStoreDaemon: 2. MaterializeReplica (selection, alloc + async load)
Note left of LocalStoreDaemon: RPC: MaterializeReplica
LocalStoreDaemon->>GlobalStore: 3. RequestReplicaTransport
Note left of GlobalStore: RPC: RequestReplicaTransport
opt Replica already resident (LIP fast path)
LocalStoreDaemon->>LocalStoreDaemon: 3a. try_satisfy_from_lip(device_id)
Note right of LocalStoreDaemon: LipManager returns CUDA IPC + ReplicaKey
LocalStoreDaemon-->>InferenceInstance: Return CUDA IPC handle + leases
end
GlobalStore-->>LocalStoreDaemon: 4. Remote session or disk-only hint
Note left of GlobalStore: RPC Resp: transport descriptor
alt Have remote replica
LocalStoreDaemon-->>RemoteStoreDaemon: 5.1 Ingest via P2P communicator.read_tensor
else Disk-only hint
LocalStoreDaemon->>DiskSource: 5.2 Ingest from disk/object store (MaterializationFacade fallback)
end
InferenceInstance->>LocalStoreDaemon: 6. Finish loading
Note left of LocalStoreDaemon: RPC: ConfirmReplica (await ready future)
alt If have Global Store
LocalStoreDaemon->>GlobalStore: 7. CompleteReplicaTransport + RegisterReplica
Note left of GlobalStore: RPC: CompleteReplicaTransport + RegisterReplica
end
InferenceInstance->>LocalStoreDaemon: 8. Exit, unregister
Note left of LocalStoreDaemon: RPC: UnloadReplica (drops refs/leases)
Region-backed tensor_dict_into¶
Region-backed into calls bypass daemon-owned replicas by streaming bytes into a
client-registered CUDA region using MaterializeIntoTarget. The SDK computes a
coalesced TargetLayout over canonical or view-indexed ByteSpaces (including
packed subsets), validates against the selected index, and invokes the v2 RPC
directly. The daemon maps the IPC handles (single or ordered multi-storage),
streams bytes from P2P or disk, and releases the region reference on completion
without allocating VRAM.
See tensor_dict_into dataflow for the detailed sequence and constraints.
Binding-Based Inplace Materialization¶
The canonical binding contract lives in
docs/designs/0084-binding-unified-model-and-contract.md. This section summarizes
where binding sits in the loading pipeline.
Binding exposes the inplace-update path directly while still using the region-backed data plane:
Artifact.bind(...)allocates client-owned CUDA target tensors, registers them as VRAM regions, performs oneMaterializeIntoTargetRPC, and returns aBinding.Artifact.bind_into(...)adopts user-owned CUDA tensors and performs the same single-RPC fill.Artifact.subset(...).view(...).bind(...)captures a rank-local source selection once; laterbinding.swap("model:v2")reuses that same selection against the new full artifact version.bind_into(..., mapping=copy_plan)captures mapped binding intent once; the daemon lowers it into the sharedRepresentationTransformContractfamily, and laterbinding.swap(...)reuses that same lowered semantic shape without Python copy loops.binding.publish_replica()orbinding.swap(..., publish=True)publishes the current bound layout once the local overwrite succeeds.
This publish path is the ordinary artifact-backed replica path from 0084. It
is not the serving-artifact publication or representation_publish closeout
path used by source-to-serving builder work.
Serving-Artifact Runtime Preflight¶
When runtime consumes a serving artifact, TensorCast now performs a serving artifact preflight before accepting it into the steady-state loading path.
Phase-1 rules:
- the phase-1 manifest carrier is
tensor:__tensorcast_meta__.manifest_json - artifacts without that reserved manifest tensor continue to load as ordinary non-serving artifacts
- strict serving runtime is now explicit rather than inferred from every
generic materialization request:
PublishedModelVersion.require_serving_runtime_policy(),RepresentationPublishContract.to_runtime_policy(), andServingArtifactManifest.to_runtime_policy()produce aServingRuntimePolicythat callers can pass intoartifact.bind(...),artifact.bind_into(...), andbinding.swap(...) - artifacts with that reserved manifest tensor must pass:
- manifest JSON parseability
schema_version == 1artifact_kind == "serving"- non-empty
framework_name,adapter_version,serving_abi_version,representation_contract_hash,serving_build_digest,tensor_schema_hash,builder_mode, andbuild_pipeline_version serving_manifest_refagreement between the manifest and the runtime policy when strict serving runtime is requested- canonical tensor count equality between manifest and canonical index
- tensor schema hash equality between manifest and the canonical index with the reserved manifest tensor excluded
Current daemon coverage:
MaterializeReplicaMaterializeIntoTarget- source-bound owned-binding create/refill paths
This keeps serving-artifact publication-time validation and runtime acceptance validation on the same contract, so runtime no longer silently accepts a manifest-bearing serving artifact whose self-description is inconsistent with its canonical tensor layout.
Important distinction:
- generic artifact load remains fail-open for ordinary non-serving artifacts
- strict serving runtime is opt-in through
ServingRuntimePolicy - this lets serving startup and reload fail closed without turning the whole artifact runtime into a serving-only surface
Serving-Builder Guardrails¶
The Python serving builder keeps artifact identity as the source authority for compiled serving recipes:
SourceCatalog.source_artifact_refmust be a real artifact identity. The builder acceptsmi2content identities and daemon-attestedmsa1mounted sources; syntheticdisk:,key:, path, and cache-local references are rejected before compile.ServingBindingPlanis the single recipe, compile, resolved-spec, and realization identity. It must agree withTensorcastServingFactsforframework_name,adapter_version, andserving_abi_version; its compile payload also carries source/schema/realization digests and destination tensor schema coverage for every trace-plan destination.- Mapped binding lowering accepts only single-axis source and destination ranges.
Destination
MultiRangeslices stay on the binding-realization path until the mapped-binding protocol has an explicit flattened-layout contract; coverage validation fails closed for unsupportedMultiRangewrites. - Retained serving binding authority may carry a
group_realization_acquirereference. Acquire passes it through to the Store Daemon, and retained attachment handles own lease release unless ownership has been explicitly transferred to runtime.
Lease-In-Place Fast Path & Use Leases¶
MaterializeReplica consumes ArtifactSelection and uses selection.artifact_id as the request identity. Key workflows resolve key mapping first through ResolveKeyMapping, then issue MaterializeReplica. Before coordinating transport, the controller asks LipManager::try_satisfy_from_lip for a replica that already lives on the requested GPU. When this fast path hits, the daemon reuses the existing CUDA IPC handle, marks the status as ALLOCATED, and returns immediately (plus optional daemon-selected used_disk_path) without invoking the bulk materialization pipeline. If the fast path misses, the controller immediately falls through to the engine-backed path described below.
Both the fast path and the engine path increment the caller’s PID in RefTracker, create a UseLease inside SessionLifecycleManager, and stash the resulting ReplicaSession under the supplied replica_uuid. That keeps eviction and TTL orchestration honest—ConfirmReplica waits on the stored future before admitting success, and UnloadReplica (or PID death) releases the lease so ReplicaRuntime knows when it is safe to evict the GPU allocation.
Runtime Events and Publish Context IDs¶
- The daemon-side Store constructs a
publish_context_idfor every ingestion request viaRuntimeContext::mint_publish_context_id()before the pipeline starts.MaterializationFacadepublishes typedIngestionStartedEvent/IngestionCompletedEventpayloads throughIngestionEventHubso subscribers receive identical metadata (ingestion source, target device,request_id, publish context, and any resolved view hints). - MetadataGateway subscribes to
ingestion_completedevents and reuses thepublish_context_idto dedupe synchronous publish requests against auto-publish flows—whichever arrives first performs the Global Store RPC, and the later call becomes a no-op/TTL refresh. - ReplicaRuntime also listens for the same events to keep UMA telemetry in sync and to attribute pipeline metrics (bytes, duration, success/failure) to the correct request id.
Key Steps Explained¶
- Request Construction & Materialization: The SDK builds one
ArtifactSelection(selection.artifact_id, optionalselection.view_spec/selection.view_id, optional subset fields) and issuesMaterializeReplicawith device selectors, optionalpinned_allocation_timeout_ms, callerpid, andreplica_uuidforConfirmReplica/UnloadReplica. - Key Resolution & Daemon-Owned Disk Source Selection: Key-based callers first resolve the human key via Global Store (
ResolveKeyMapping) to obtainartifact_id, then materialize by selection. If disk fallback is allowed, the daemon resolves managed/shared-disk or local-import bindings internally and, when available, populatesused_disk_path. Clients do not provide retrieval disk paths. - Lease-In-Place Reuse:
LipManager::try_satisfy_from_lipchecks whether the target GPU already holds the replica. On a hit, the daemon reuses the resident CUDA IPC handle, marks the status asALLOCATED, and immediately responds with the originalartifact_idand anyused_disk_pathwhile still tracking the caller’s PID/lease. - Replica Selection & Transfer: LIP misses call into
StoreEngine::materialize_replica, which routes throughMaterializationFacadeandMaterializeOrchestratorto request a remote replica (RequestReplicaTransport). Successful transports stream bytes over the communicator; failures fall back toingest_from_diskusing the resolved disk path when available. - Lease Binding & Reference Tracking: Every granted replica increments
RefTrackerfor the caller PID and acquires a GPUUseLeasefromSessionLifecycleManager. These handles block eviction until the PID drops its reference or TTL expires, keeping telemetry and scheduler state consistent. - GPU Transfer & Confirmation:
MaterializeReplicareturns as soon as memory is allocated (resident or freshly loaded). The client must callConfirmReplicawith thereplica_uuid; the daemon waits (up to the gRPC deadline, capped at 30s) for the ingestion future before confirming success, surfacing any loader failures back to the caller. - Registration & Publish: Once ingestion completes,
MaterializationFacademarks the Global Store transport session finished, registers or refreshes the replica metadata, and publishesingestion_completedevents with the originalpublish_context_idso ReplicaRuntime and MetadataGateway stay in lock-step. - Cleanup: When inference exits or releases the tensors, it calls
UnloadReplica. The daemon drops the PID reference, releases the associatedUseLease, and only tears down the GPU allocation once no active references remain, ensuring shared replicas survive across overlapping consumers. If the replica never reached allocation,UnloadReplicais a no-op and returns success.
In-Memory Registration (Store API)¶
- Unified API: BeginRegisterArtifact → FeedRegisterArtifactStream → CommitRegisteredArtifact.
- Realization Plans:
- Coalesced VRAM: daemon allocates a single VRAM segment and exposes CUDA IPC to the SDK which writes tensor bytes directly.
- VRAM Lease (FDML): client exports CUDA IPC handles for unique storage blocks and feeds LeaseSegments; daemon computes hash by compiling the canonical ByteRangeMap (PAD=0) and streaming leased memory through the unified byte-range program.
LeaseSegments ↔ ByteRangeMap¶
- Robust protocol: each
LeasedSegmentincludesartifact_offset(logical byte offset in the canonical artifact layout) plus astorage_idreference. This removes any ordering assumption when sending lease segments. - Daemon behavior:
- Builds the canonical
ByteRangeMapfrom canonical index bytes and compiles it into aByteRangeProgram. - Treats all
PADintervals as zero-filled for hashing and for any materialization copies. - Reads
DATAintervals from the referenced storage windows (StorageEntry+mapping_base_offset+storage_offset) regardless of feed order. - Client behavior:
- SDK computes a coalesced logical layout and sets
artifact_offsetper unique storage block. - Ordering no longer matters, but the SDK still sorts for stable traces.
Shared Storage Graph Helper¶
- SDK registration flows call
tensorcast.api._tensor_graph.build_tensor_storage_graph()before feeding lease segments. - The helper deduplicates
torch.Storageobjects and emits aTensorStorageGraphcontainingStorageEntryrows (unique storage id, device id, base pointer, storage length) plusTensorAliasmetadata (tensor name, storage id, storage offset, logical byte length, shape, stride, dtype). - Clients transmit the deduplicated storage table via
storage_entriesand alias metadata viatensor_aliases. The daemon reconstructs canonical index JSON from these structures, producing byte-for-byte parity with disk persistence and opening each CUDA IPC handle only once per unique storage. - On materialization, the SDK maps the returned CUDA IPC handle once and constructs all
torch.Tensorviews from that mapping in one pass so the mapping is reference-counted across tensors and closed exactly once.
Recommended: rely on tensorcast.register(...) (or register_async) with
RegisterArtifactOptions(lease_in_place=True) and an explicit ttl_ms. The Store
manages keepalives automatically and surfaces the committed descriptor through the
returned RegisteredArtifact. Same-machine consumers materialize into daemon-owned
coalesced VRAM (CUDA IPC) for zero-copy use.
Client Facade & Store helpers¶
tensorcast.register(...)(lease-in-place) andtensorcast.put(...)(daemon-owned coalesced VRAM) return aRegisteredArtifactdescribing the canonical index, replica metadata, and lease handle when applicable. Both functions call into the shared Store and offer async variants viatensorcast.store().register_async(...).- Retrieval is handle-first:
tensorcast.artifact(...).tensor_dict(...)streams tensors to the requested device (sync) andtensor_dict_async(...)mirrors it asynchronously. In-place copies useartifact.tensor_dict_into(...)or the convenienceartifact.tensor_into(name, target, ...). The Store validates shapes/strides/device before mutating buffers, zero-fills PAD segments to keep tensors consistent on failure, and unloads daemon-backed replicas immediately after copy/validation. StoreOptions.getand per-callGetArtifactOptionscarry execution-scoped retrieval policy (source) and topology hints without sprinkling ad-hoc flags across call sites.- Low-level lease feeding and commit orchestration are handled internally by the Store, so most integrations rely entirely on the functional facade and its cancellation hooks.
Python SDK Updates¶
- Plan selection uses a typed enum
PlanTypeinstead of raw strings to avoid typos: PlanType.VRAM_COALESCED(aliases:"coalesced")PlanType.VRAM_LEASED(aliases:"lease")RegisterArtifactOptionsis now a frozen dataclass with slots for immutability.- Loading helpers with fixed return types now hang off
Artifact: - Synchronous:
tensorcast.artifact(...).tensor_dict(...) -> dict[str, torch.Tensor] - Asynchronous:
tensorcast.artifact(...).tensor_dict_async(...) -> ArtifactFuture[dict[str, torch.Tensor]] ArtifactFuture.done() / result(timeout) / cancel()mirror the standardconcurrent.futurescontract. Cancellation propagates to daemon RPCs (AbortRegisteredArtifact,RevokeRegisteredArtifact) and records telemetry for observability.- Unified error model under
TensorCastErrorwith readable subclasses likeDaemonUnavailable,DeviceMismatch, andIndexParseError. - Key-resolution loads raise a clear runtime error when a key is absent, including the daemon address and guidance for registering artifacts.
- Key→artifact-id lookups are cached inside the Store for 30 seconds by default (override with
TENSORCAST_STORE_KEY_CACHE_TTL_SECONDS); disk fallback relies on daemon-resolved managed disk locations rather than client-side disk hints.
Registration Semantics¶
- Commit returns RFC-0007 content-addressed descriptor (
artifact_id = mi2:index_multihash:data_multihash). - Python:
RegisteredArtifact.commit()returnsCommitResultwith fields: descriptor(ArtifactDescriptor)existed(bool) — true when the commit hit an existing replica and joined a reference- Idempotent success on duplicates: if the same
mi2:artifact already has a replica on the same device, the daemon reclaims the new allocation and returnsOKwith the existing descriptor plusexisted=true.
Variant-Aware Views (v1)¶
core/store/materialization/dataplane/view/view_planner.{h,cc}materializes aViewPlanfrom canonical index JSON plus aViewSpec. v1 supports single-dimensionnarrow(slice) operations and emits both the variant layout (view_index_json) and aSelectionPlandescribing canonical byte ranges.core/store/materialization/dataplane/view/view_plan_source.{h,cc}wraps anySeekableSourceand executes theSelectionPlan, streaming minimal bytes (zero-filling PAD regions) to downstream consumers.StoreEnginenow exposes static helpers:compute_view_plan(...)→ Loader-backed planning entry point surfaced to the daemon.view_plan_allows_alias(plan)→ Returnstruewhen the selection is contiguous and segment-aligned so the engine can hand out zero-copy aliases.compute_view_data_hash_from_source(source, plan, leaf_bytes)→ Delegates toViewHashComputer, which reuses the TreeHash pipeline to verify variant byte spaces across disk, GPU, and replica-resident sources.
These APIs keep view normalization, selection, and hashing anchored in the C++ core so the Python daemon and SDK share a single implementation.
- Join/Lease semantics for duplicates: when existed=true, the daemon also joins a lightweight reference for the caller’s PID. If a TTL was provided at BeginRegisterArtifact, KeepAliveRegisterArtifact can extend the TTL, and the unified SessionLifecycleTask drops the joined reference when the TTL expires. This mirrors the lifecycle of a self-created replica.
Client Reuse & Resiliency¶
- The Python SDK establishes a single gRPC client per process during
tensorcast.init(mode=...); all subsequent API calls reuse the same Store session obtained viatensorcast.store(). This ensures every helper targets the same daemon endpoint and prevents accidental cross-daemon usage. - The underlying client enables gRPC keepalive and performs a light retry with channel refresh on transient errors (
UNAVAILABLE,INTERNAL,UNKNOWN,DEADLINE_EXCEEDED). - In registration flows,
RegisteredArtifactholds a cached client for its lifetime (keepalive thread, commit/abort/revoke, and feed helpers reuse the same channel).