Steptron and vLLM Binding Integration¶

Goal¶

Describe the concrete landing path for the binding work from:

docs/designs/0084-binding-unified-model-and-contract.md
docs/designs/0085-distributed-binding-assembly-and-coordinator.md

This guide is intentionally grounded in the current steptron training code and the current vllm TensorCast runtime so the design can be mapped to real integration points instead of staying abstract.

Repository roots referenced below:

steptron: /opt/example-framework
vllm: /opt/vllm

Current Steptron Training Path¶

The current steptron path for the referenced PPO experiment is:

experiment setup resolves the HF bootstrap path in playground/users/ys/step4air/rl/step4air_toy_fully_async_dual8_h800.py (_resolve_init_hf_path around line 248, Exp.__init__ around line 693).
PPOTrainer.before_train() calls self.load_checkpoint() and then creates PackedModel(...) in steptron/core/ppo_trainer.py around lines 147-162.
PackedModel.setup_model() constructs model chunks through model_config.build_model() and then moves them onto CUDA in steptron/core/multi_model.py around lines 122-167.
load_model_checkpoint() dispatches to model.load_hf_state_dict(...) in steptron/core/utils.py around lines 247-269.
load_hf_state_dict() performs reshaper.forward(...) and then load_state_dict(...) in steptron/model/module.py around lines 290-305.

Important current properties:

model construction happens before HF weights are loaded
@init_weight_callback recursively runs init_model_weight() immediately after model construction in steptron/model/utils.py around lines 130-146
tensor-parallel layers allocate real parameter storage in their constructors unless told otherwise
TP layers already expose weight_memory_loc hooks in steptron/core/tensor_parallel/layers.py around lines 1205-1277 and 1469-1542

Current vllm TensorCast Path¶

The current vllm TensorCast runtime already has the serving-side consumer model we want to preserve:

tensorcast_loader.py can:
load directly from a serving artifact with artifact.bind(...)
or bootstrap a local/source input into a serving artifact and then bind it
or acquire a retained serving binding through explicit retained authority
and later reload through binding.swap(...)
gpu_model_runner.py consumes the dedicated TensorCast serving reload path using ServingArtifactLocator and manifest runtime policy
api_server.py gates /reload_serving_artifact with drain + serialized reload, while /set_model_weight is rejected for load_format="tensorcast"

Concrete current anchors:

direct serving-artifact startup path in vllm/vllm/model_executor/model_loader/tensorcast_loader.py (TensorcastModelLoader.load_model(...))
retained binding acquire path in vllm/vllm/tensorcast/retained_binding.py
local bootstrap path in vllm/vllm/model_executor/model_loader/tensorcast_builder/local_dir_prepare.py
reload-by-swap path in vllm/vllm/model_executor/model_loader/tensorcast_loader.py
HTTP drain + serialized serving reload in vllm/vllm/entrypoints/openai/api_server.py

This means the integration goal is not to redesign vLLM reload semantics. It is to make training publish the right versions in the right representation.

TensorCast-Side Readiness And Diagnostics Contract¶

The current TensorCast-side source-bound readiness surface for this integration has one active cut point:

source_bound_contract_version >= 4
same-binding Binding.realize_from(...) / Store.realize_into_binding(...) are execution-only ingress points that return BindingUpdateEpoch rather than a sealed current value;
typed hash diagnostics on the surviving seal path are part of the stable downstream contract;
downstream builder code must gate on this version before depending on the update_epoch response shape.

The public SDK surfaces to consume for those cut points are:

binding.last_execution_diagnostics
actual execution facts such as actual_collective_committed_bytes, actual_local_typed_bytes, actual_generic_backend_bytes, collective_failure_class, dominant_executor, and direct_write_supported
binding.last_source_bound_plan_diagnostics
planner facts such as execution_plan_kind, planned_collective_candidate_bytes, planned_collective_admitted_bytes, planned_local_typed_bytes, planned_non_admitted_typed_bytes, planned_generic_residual_bytes, planner_reject_reason_buckets, planner_version, and plan_hash

Downstream integration summaries should prefer those typed surfaces rather than daemon log parsing or repo-version heuristics.

Integration Principle¶

The binding work should land around one simple separation:

steptron owns framework construction, optimizer execution, and local weight mutation
TensorCast owns stable weight location lifecycle, sealing, publishability, and distributed assembly
vllm continues to consume published serving versions through the existing runtime binding path

Local Binding Landing in Steptron¶

Recommended Initialization Shape¶

Use a layout-first initialization path.

The target shape is:

do a planning build that determines the full local weight layout
create one local contiguous binding slab for all local model weights
attach model parameters to views of that slab
load HF weights directly into that final storage

This is preferable to:

allocating one normal model weight graph first and adopting it later
or forcing a framework-side gather/dump path before publish

Why Planning Is Needed¶

The binding design assumes one coalesced local slab. That requires knowing the full local parameter layout before final allocation.

In steptron, a naive online storage provider is not enough because:

PackedModel.setup_model() loops over all local model chunks
init_weight_callback runs immediately after construction
TP layers allocate storage in constructors

The provider therefore must be driven by a full local layout plan, not by incremental “give me the next tensor” allocation.

Practical Hook Points¶

The practical hooks in steptron today are:

PackedModel.setup_model() in steptron/core/multi_model.py
around lines 122-167
natural place to add a planning mode and a binding-backed build mode
model_config.build_model()
Step4Model builder is in playground/pretrain/step4air_v4_f.py around lines 593-605
Step4RewardModel builder is in the RL recipe file around lines 517-530
natural framework entry to thread a planning or weight-provider object
TP layer constructors in steptron/core/tensor_parallel/layers.py
already accept weight_memory_loc
init_weight_callback in steptron/model/utils.py
must be disabled or made binding-aware in planning mode
load_hf_state_dict() in steptron/model/module.py
around lines 290-305
should continue to own HF -> steptron parameter mapping
should write directly into binding-backed tensors

Recommended Sequence¶

Add a plan build mode for steptron model construction.
Use the planning build to collect the full local tensor set, shapes, dtypes, alias/tie relations, and persistent buffers.
Convert that result into BindingLayout.
Create one local Binding with one contiguous slab.
Rebuild or attach the real model to binding-backed tensors.
Run load_hf_state_dict() so reshaped HF weights land directly in the final slab.
Call seal_current(...) for the initial local sealed value once load completes.
Start an assembly attempt on the existing assembly/layout trunk and contribute that sealed value into the required contribution contract, even for the single-rank case.
For single-rank landing, use a legal contract on the trunk: canonical_full or deterministic multi-view coverage.
Do not use a binding-local full-coverage piece shortcut.

Distributed Publish Landing for Steptron¶

Phase-1 Scope¶

The first production landing should stay within:

PP / VPP / EP / CP
TP1
subset-only coverage

This matches the target experiment class better than a general TP > 1 transform-aware design.

What Each Rank Owns¶

For the current production direction:

one local process may own multiple local model chunks because PackedModel loops over vp_size
EP ownership is a tensor-subset problem once global expert ids are fixed
CP does not change weight-byte ownership materially

The correct distributed publish shape is therefore:

one local binding per process
one global LayoutSpec for the full model family, including the expected contribution view set for layout-derived disjoint-piece contracts
one planner-derived BindingContributionPlan per local binding
one explicit snapped contribution contract per publish attempt
many local sealed-value contribute_to_assembly(...) operations
one coordinator-triggered source seal_assembly(...)
one published model-version result carrying source lineage now
and a serving-lineage extension only after typed child closeout contracts exist

What TensorCast Must Do¶

TensorCast must own:

LayoutSpec capture, validation, and expected-view completeness enforcement
explicit contribution-contract capture and snapshot
local coverage capture
binding-backed contribution that compiles to the same structural view-or-piece registration trunk used by other assembly frontends
phase 1 does not require a LIP-backed piece path
distributed completion tracking
contributor liveness through the existing lease/guard/finalizer runtime
final source seal_assembly(...)
source immutable version-key publication in the current dependency-ready wave
optional source -> serving builder or publisher only in the successor wave after typed child closeout contracts exist
final serving-key or serving-manifest publication only in that successor wave

steptron should not:

gather the full model
dump per-rank fragments for later reassembly outside TensorCast
restate tensor subsets every update

Serving-Artifact Hand-off to vllm¶

Preferred End State¶

The best consumer contract remains:

steptron publishes serving-compatible versions
vllm receives those versions as a serving selector (selector.kind="version_key" or selector.kind="artifact_ref")
vllm reloads through /reload_serving_artifact and the current runtime binding path

This keeps vllm close to its current design:

startup may still use direct serving-artifact bind or bootstrap bind-into
reload stays binding.swap(...)
selector durability and unhealthy handling stay in place

Representation Boundary¶

If the training-local weight layout is not already the final serving representation, then the integration must preserve the existing source vs serving split:

training local binding first seals a local value; TensorCast then promotes it through the assembly trunk into a source artifact
the current dependency-ready wave stops at source published lineage and returns PublishedModelVersion with serving fields unset
a later typed closeout wave may extend that same lineage with a serving artifact
vllm consumes the serving artifact

Do not force vllm back into consuming arbitrary training-local layouts at reload time.

Minimal Change Matrix¶

Steptron¶

add planning build mode
add binding-backed build mode
thread weight-provider or attach path through PackedModel.setup_model()
keep load_hf_state_dict() as the mapping/fill step
add calls to begin_update(), seal_current(), and sealed-value contribute_to_assembly()

TensorCast¶

implement layout-first local binding from 0084
implement binding-backed contribution on top of the existing assembly/layout/view-registration trunk from 0085
expose a published model-version result with source or serving lineage instead of a bare assembly artifact outcome
keep publish/retire/activation daemon-mediated

vllm¶

keep current runtime binding and reload path as-is
optionally add only the minimal integration needed to resolve the new published version keys or manifests

Practical Rollout Order¶

Land local binding in steptron and prove one-slab initialization works.
Land single-rank seal_current(...) plus legal same-trunk publish (canonical_full or deterministic multi-view contract).
Land distributed sealed-value contribution + source assembly for TP1.
Land typed child closeout contracts for source -> serving publication.
Land serving publication and immutable serving-key output on that same lineage.
Point vllm at the published serving selector and validate /reload_serving_artifact.
Only after that, expand toward TP > 1 range coverage and transform-aware assembly.

Common Failure Modes To Watch¶

planning build and real build diverge in tensor set or alias structure
init_weight_callback initializes tensors in planning mode accidentally
load_hf_state_dict() writes into a staging allocation instead of the final binding-backed storage
EP global-id mapping drifts between initialization and publish
attempt-time contribution contract drifts from what the planner produced
distributed assembly succeeds on incomplete or stale contributor data
source assembly succeeds but serving-artifact publication or immutable serving-key publication does not
vllm is handed a source-layout version when it expects a serving artifact

When To Revisit The Design¶

Revisit 0084 and 0085 if any of the following become true:

steptron needs TP > 1 for the production path
local weight layouts require transform-aware distributed assembly
daemon-owned training slabs become necessary instead of client-owned slabs
vllm can consume the training-local layout directly without a source -> serving conversion boundary