Disk Load Strategy¶
This document describes the current load from disk strategy for TensorCast-backed
model loading, with emphasis on:
- TP-aware rank-local slicing in the current vllm TensorCast loader,
- local SSD versus shared filesystem behavior,
- daemon-side executor selection under the converged
0108strategy plane, - the difference between ordinary
tensor_dictstartup and mappedinto-targetpaths.
Related docs:
- docs/internals/model-loading.md
- docs/architecture/api/materialization-flow.md
- docs/internals/byte-range-mapping-and-execution.md
- docs/designs/0108-tensor-aware-materialization-strategy-plane.md
Scope¶
In scope:
- Ordinary disk-backed startup through
Artifact.tensor_dict(...). - TP-aware request shaping before materialization.
- Explicit separation between retrieval policy and execution-topology context.
- Common-runtime strategy planning for generic, local-batched, and collective executors.
- Disk-backed
MaterializeIntoTarget/ mapped binding strategy.
Out of scope:
- P2P-first loads from remote replicas.
- Serving-artifact startup and reload-by-swap beyond their interaction with disk fallback.
- Canonical index and byte-range compiler details beyond what is needed to explain strategy choice.
Layered View¶
The current disk strategy is split into four boundaries:
- Integration-side request shaping decides what the current rank needs and whether to attach explicit same-host collective context.
- Daemon normalization lowers retrieval policy and execution-topology context separately into one internal request object.
- Common runtime builds one
ExecutionStrategyPlanfrom: - resolved semantic/view truth,
- acquired source capabilities,
- execution-topology and locality facts,
- typed
engine.materialization_strategypolicy. - Replica/runtime execution consumes the selected plan; it is no longer the
architecture owner of
AUTO.
This is the important post-0108 ownership split:
- integrations still own TP-local selection shaping,
- daemon normalization owns policy/topology separation,
MaterializationFacadeowns ordinary disk executor candidacy and selection,Replica/ReplicaLoadControllerexecute the selected plan and emit commit diagnostics.
Important boundaries:
- the generic TensorCast SDK does not infer collective mode from ambient environment,
collective_load_group, source locality, and source-sharing hints are execution-topology facts, not retrieval policy,- executor-private work items do not travel through SDK or proto request surfaces,
- ordinary disk startup and mapped-target execution now use the same strategy family contracts.
Typed Daemon Defaults¶
The strategy entry point is daemon config, not ad-hoc environment variables.
The example below matches the generic repository sample config rather than the
audited vllm packaged serving config:
engine:
materialization_strategy:
enable_local_batched_disk_load: true
enable_tensor_aware_mapped_executor: true
allow_mixed_execution: true
enable_owner_file_collective: false
executor_preference: MATERIALIZATION_STRATEGY_EXECUTOR_PREFERENCE_AUTO
owner_file_collective_peak_bytes_budget: 8GB
owner_file_collective_batch_bytes: 512MB
owner_file_collective_dim1_staging_bytes: 256MB
owner_file_collective_max_inflight_batches: 1
owner_file_collective_shared_fs_only: true
owner_file_collective_max_owner_skew_ratio: 1.5
owner_file_collective_min_dedup_saving_bytes: 512MB
owner_file_collective_group_assemble_timeout: 15s
owner_file_collective_allow_mixed_residual: false
owner_file_collective_planner_cache_entries: 256
See examples/config/store_daemon_config.yaml.
Operationally, this means:
- local-batched disk load is enabled by default,
- tensor-aware mapped execution remains available for eligible mapped paths,
- owner-file collective now uses a bounded batched executor when selected,
- eager owner-file preload and root whole-source preload are no longer part of the selected collective steady path.
End-To-End Decision Tree For Ordinary Disk Startup¶
The main startup path is ordinary disk-backed tensor_dict(...), not mapped
runtime binding.
flowchart TD
A["TensorCast loader startup"] --> B{"explicit source artifact key?"}
B -- "yes" --> B1["not ordinary disk path\nuse artifact-key source"]
B -- "no" --> C["ordinary disk path\nartifact_ref = disk:<hf_folder>"]
C --> D["build TP-local trace plan\nmaterialize_names + tensorcast_slices"]
D --> E{"TP world size > 1?"}
E -- "no" --> F["no collective hint"]
E -- "yes" --> G{"integration collective policy\nenabled?"}
G -- "no" --> F
G -- "yes" --> H["attach explicit CollectiveLoadGroup hint"]
F --> I["daemon disk materialization"]
H --> I
I --> J["daemon normalization<br>retrieval policy + execution topology"]
J --> K["common runtime<br>ExecutionEnvironmentFacts"]
K --> L["MaterializationFacade<br>ExecutionStrategyPlan"]
L --> M{"selected executor"}
M -- "TensorBatchedLocalExecutor" --> N["local_batched_disk_load"]
M -- "OwnerFileCollectiveExecutor" --> O["collective_disk_load batched owner-file path"]
M -- "GenericByteRangeExecutor" --> P["generic byte-range / mapped fallback"]
Step 1: TP-Aware Request Shaping In The vllm Loader¶
The current vllm TensorCast loader does not send a full-model load request per rank. Instead, each rank first traces its own checkpoint access pattern and builds a rank-local plan:
materialize_names: the checkpoint tensors this rank needs,tensorcast_slices: the rank-local slice hull used to build the view.
The disk request is then:
artifact.subset(materialize_names).view(slices=tensorcast_slices).tensor_dict(...)
This is the first TP-specific optimization. It reduces disk work before the daemon chooses any executor.
This step is integration-owned, not a generic TensorCast SDK behavior. The
current implementation lives in the vllm loader, which traces
model.load_weights(...), builds TracePlan, derives tensorcast_slices, and
caches the result per TP rank and world size.
Important consequences:
- TP handling starts in the loader, not only in the daemon.
- The trace cache is per
tp_rankandtp_world_size, so each rank carries its own stable trace-plan cache entry. TP=1naturally degenerates to a non-collective single-rank request.
Step 2: Integration-Level Collective Hinting¶
For ordinary disk:<model_path> startup, collective remains an explicit
TensorCast API contract. The generic SDK does not attach a collective hint on
its own.
The current vllm loader may synthesize that explicit
CollectiveLoadGroup on behalf of the caller by constructing
CallContext.collective, but this is an integration policy above the API
boundary rather than a TensorCast-core SDK default.
Current rule:
- local filesystems:
xfsext2ext3ext4btrfszfstmpfs- default to
collective = false - anything else:
- currently also defaults to
collective = false - reason: integration-side collective hint rollout remains conservative until the shared-FS benchmark and serving matrix is fully recaptured
- explicit overrides remain available, and once the request carries explicit shared-source proof the daemon-side owner-file batched executor is now the preferred collective route
This is not a hard-coded JuiceFS or JFS special case by name. The current
vllm loader treats them through filesystem type detection:
- host-local SSD paths usually resolve to one of the local filesystem types and therefore default to non-collective startup,
- JuiceFS/JFS/shared filesystems usually do not fall into that local whitelist, but ordinary disk-backed startup is currently still kept on the non-collective path by default as a temporary regression mitigation,
- if mount type detection fails, the code currently also biases toward non-collective.
Current vllm behavior therefore is:
- local filesystems default to no collective hint,
- shared/non-local filesystems also default to no collective hint today,
- explicit integration override remains available for collective experiments.
This choice only affects whether the request carries a collective hint. The daemon still performs a second-stage eligibility check before it actually uses a collective executor.
Why Local SSD And Shared Filesystems Split Here¶
The split exists because the best-known disk strategy differs by storage class:
- on host-local SSD, direct per-rank reading plus local batching is usually better than coordinating a TP collective read,
- on shared filesystems, letting TP ranks cooperate often reduces redundant IO and improves effective source-side locality.
This is why the common TensorCast runtime no longer uses ambient environment as the control surface for collective selection. The current intended behavior is:
- keep the generic SDK contract explicit,
- allow integrations to synthesize explicit per-call hints when they have the right TP topology context,
- keep daemon-side executor policy centralized in typed strategy config,
- avoid forcing shared-fs collective from ambient loader heuristics until the benchmark and serving rollout gates are fully recaptured.
Step 3: Common-Runtime Strategy Planning For tensor_dict¶
Once the disk request reaches the daemon, ordinary executor selection is owned by common runtime rather than replica-layer branch order.
MaterializationFacade now builds one ExecutionStrategyPlan for ordinary
GPU <- DISK startup and evaluates three candidates:
GenericByteRangeExecutorTensorBatchedLocalExecutorOwnerFileCollectiveExecutor
The chosen executor is logged together with selection reason, rejected
candidates, residual bytes, and commit accounting. Replica and
ReplicaLoadController consume that selected plan; they do not own AUTO.
Collective Eligibility¶
OwnerFileCollectiveExecutor is only considered when all of the following hold:
- target is
GPU <- DISK, - the request carries a
collective_load_group, - execution-topology locality policy does not reject collective for host-local media,
canonical_index_jsonis available,source_index_jsonis available,variant_identityis available,- the requested view does not require extra server-side materialization transform,
- the source is backed by a
DiskLoaderwith usable shared context, - typed budget and threshold checks pass.
If any of these fail, the daemon logs why collective was skipped and continues to the next candidate.
Local-Batched Eligibility¶
TensorBatchedLocalExecutor is considered when:
- target is GPU,
canonical_index_jsonis available,source_index_jsonis available,variant_identityis available,DiskLoadershared context is available,enable_local_batched_disk_load = true,- no residual/fill-only/server-transform constraint forces generic fallback.
If the planner rejects it, the request falls through to generic execution with an explicit selection reason.
Under AUTO, local-batched is now also cost-gated against the exact generic
path:
- the daemon keeps host-local requests on
GenericByteRangeExecutorwhen the local-batched planner estimate would read more unique source bytes than the exact generic fallback, - this currently protects dim1-staged host-local views whose row-block staging would amplify source IO beyond the rank-local slice coverage,
- explicit
executor_preference = TENSOR_AWARE_LOCALcan still force the local executor for targeted experiments.
Important current invariant:
- the same local-batched admission summary is now used by both ordinary strategy planning and executor execution,
- unsupported local tensor-job shapes are rejected at planning time,
- and selecting
TensorBatchedLocalExecutorno longer relies on a late replica-layer "try local then silently hand off to generic" fallback.
Generic Fallback¶
GenericByteRangeExecutor remains the correctness backstop:
- canonical index and source index are composed when available,
- view selection is composed if needed,
- the request is lowered through
ByteRangeMappedSource, - transfer runs through the generic pump path.
It is still the dominant fallback IR and explainability surface, but it is no
longer the architecture owner of ordinary disk AUTO.
Internal Shape Of local_batched_disk_load¶
local_batched_disk_load is still tensor-aware, but it is single-rank and
local. It splits disk jobs into:
replicatedjobs,dim0 partitionedjobs,dim1 partitionedjobs.
flowchart TD
A["local_batched_disk_load"] --> B["build tensor jobs"]
B --> C["replicated + dim0\n-> direct segments -> batched pump"]
B --> D["dim1\n-> execute_local_dim1_jobs"]
Operationally:
- replicated and dim0-partitioned regions are lowered into direct source segments and pumped efficiently,
- identical direct source slices are now deduplicated into one primary source read plus GPU D2D reuse when possible,
- dim1-partitioned tensors use a dedicated execution path,
- the daemon emits
local_batched_disk_load timingswith job counts and timing breakdowns for verification.
Internal Shape Of collective_disk_load¶
collective_disk_load is the executor runtime used by the current same-host
collective path. When owner-file collective is selected, the ranks form a
same-host clique, build weighted owner batches, and execute tensor jobs
cooperatively.
flowchart TD
A["collective_disk_load"] --> B{"owner-file collective enabled?"}
B -- "yes" --> C["batched owner-file path\nweighted ownership + bounded staging"]
B -- "no" --> D["request is not admitted to collective executor"]
C --> E["peer distribution via NCCL send/recv or broadcast"]
Current defaults and caveats:
enable_owner_file_collectiveexists,- the generic sample config shown above keeps it
false, while the auditedvllmpackaged serving config may enable it for the same-binding mounted path, - owner-file batch planning is now the steady-state owner collective path,
- once the owner-file collective route is selected, planner failure no longer drops the request back to legacy whole-source collective scaffolding,
- group-assemble timeout is now typed config,
- the current owner-file rollout remains zero-residual-only by explicit policy
unless
owner_file_collective_allow_mixed_residual=true, - requests with residual fallback bytes therefore stay on local-batched or generic execution by default.
As with local-batched, collective execution is still tensor-aware:
- replicated tensors,
- dim0-partitioned tensors,
- dim1-partitioned tensors,
- and mapped-target rect2d tensor slices when the local typed executor owns the destination shard
are handled separately with different execution routines.
Filesystem-Specific Summary¶
Host-Local SSD¶
Expected behavior:
- no collective hint from the loader,
- TP ranks still perform rank-local
subset + sliceshaping, - daemon usually lands on
local_batched_disk_loadfor direct, dim0, and dedup-friendly requests, - daemon stays on
GenericByteRangeExecutorwhen the admissible local-batched plan would amplify host-local source IO beyond the exact generic path, - generic fallback is only the residual safety path.
This is the current best-known default path for ordinary host-local startup.
JuiceFS / JFS / Shared Filesystems¶
Expected behavior:
- current vllm loader still defaults to no collective hint,
- daemon therefore usually stays on the non-collective local-batched or generic path,
- explicit collective experiments can still attach a group and exercise
collective_disk_loadwhen eligibility is satisfied.
The important point is that these filesystems are currently handled by storage class inference, not by a product-specific special case table, but current default policy still biases to non-collective startup.
Mapped into-target / bind_into Disk Path¶
Mapped disk execution follows a related but different strategy tree. This is
used by MaterializeIntoTarget, bind(...), and bind_into(...) flows.
flowchart TD
A["disk-backed mapped into-target request"] --> B{"disk + collective hint + representation work plan?"}
B -- "yes" --> C["try collective mapped executor"]
B -- "no" --> D{"source-ordered path available?"}
D -- "yes" --> E["SourceOrderedMappedTargetExecutor"]
D -- "no" --> F["MappedTargetStreamingExecutor"]
The key differences from ordinary tensor_dict startup are:
- the request already arrives as a mapped target layout,
- the runtime considers source-ordered execution using source layout metadata,
- collective mapped execution is only possible when a derived representation work plan and collective group are both present,
- the overlap/dedup gate admits only replicated source-overlap work to the owner-file collective lane,
- same-host disjoint TP shards are routed to the local tensor-aware mapped executor when the work plan exposes enough tensor metadata,
- that local executor handles full/dim0/dim1 jobs plus 2D rectangular source/destination slices, and subtracts its covered destination ranges before building any byte-range residual,
- any remaining residual executor is still the byte-range fallback path, but it now reports through typed source-bound executor names rather than the generic replica-path label.
The anonymized Example TP Model TP8 same-host closure profile is the reference
for this mixed mapped strategy: 137 local rect2d jobs absorbed the previous
873,562,112 byte residual tail, actual_generic_backend_bytes=0, and ready
time improved to 136.271s.
Runtime Binding And Ordinary Disk Startup¶
Ordinary disk startup should not be confused with serving-artifact runtime binding.
Current rule:
- runtime initialization settings do not by themselves turn ordinary
disk:<model_path>startup into the preferred mapped startup path, - startup binding is attempted only through an explicit serving artifact locator, retained serving binding authority, or local source bootstrap,
- otherwise ordinary disk startup stays on the
tensor_dictstrategy described above.
This guard exists because disk-backed mapped source binding is not currently the best-known startup path for plain local checkpoint directories.
Practical Verification Checklist¶
When validating which disk path actually executed, use the following signals.
For host-local SSD:
- loader should not attach a collective hint,
- daemon should emit
local_batched_disk_load timings, - materialization diagnostics should not show
source=local_replicaunless the request was intentionally satisfied from an already-local source.
For shared filesystem startup:
- do not expect a collective group by default today,
- if you explicitly enable collective experiments, then the loader should attach a collective group,
- in that explicit-collective case, the daemon should log collective
eligibility and successful runs should emit
collective_disk_load timings.
For mapped into-target requests:
- inspect
materialize_mapped_into_target execution_commit, - check:
collective_handled,direct_write_supported,source_ordered,dominant_executor.- when collective is used, also inspect typed execution diagnostics for:
collective_unique_source_bytescollective_peer_transfer_bytescollective_peak_temporary_bytescollective_batch_countcollective_dedup_saving_bytes
Current Intended Defaults¶
The intended operator-facing behavior today is:
- ordinary host-local disk startup:
- use
tensor_dictpath, - rely on rank-local trace slicing,
- let the daemon select local-batched disk load only when it does not lose to the exact generic path on source-byte cost,
- otherwise keep
AUTOon exact generic execution and treat local-batched as an explicit experiment-only override for that workload, - shared filesystem startup:
- still default to non-collective startup today,
- reserve collective for explicit experiments until owner-file collective is production-ready,
- let the daemon decide whether collective eligibility is satisfied once a collective hint is actually present,
- mapped runtime binding:
- reserve for explicit serving/source-artifact workflows, not plain disk startup.
This gives TensorCast a single consistent strategy stack:
- TP-aware request shaping in the loader,
- integration-owned explicit collective hint synthesis,
- typed daemon strategy selection,
- byte-range fallback only for residual coverage.