Error, Retry, Observability¶
This document consolidates error semantics, retry behavior, and observability signals for the SDK and daemon.
Related docs:
- Public contracts and where errors surface: API Design
- Source selection and disk verification knobs: Materialization Flow
- Persistence degradation semantics: Policy & Persistence
Error Model¶
The SDK raises ArtifactError with:
status_codealigned to gRPC canonical codesretryablehint for callers
Why both fields exist:
status_codeis a stable, interoperable taxonomy (matches gRPC canonical codes).retryablecaptures SDK knowledge about which errors are expected to clear with time (e.g. transient daemon unavailability) without callers needing to re-encode complex rules.
Type definitions:
ArtifactError: tensorcast/api/store/types.py- Error mapping: tensorcast/api/store/retry.py
Common mappings:
INVALID_ARGUMENTfor input validationFAILED_PRECONDITIONfor state mismatchesNOT_FOUNDfor missing artifacts or disk pathsRESOURCE_EXHAUSTEDfor capacity issuesUNAVAILABLEandDEADLINE_EXCEEDEDfor transient failuresDATA_LOSSfor integrity mismatches
Caller guidance (typical handling patterns)¶
INVALID_ARGUMENT,FAILED_PRECONDITION: fix the caller inputs or state; do not retry blindly.NOT_FOUND: treat as a missing artifact/key/path; retry only if you expect eventual consistency.RESOURCE_EXHAUSTED: retry may help (evictions/capacity changes), but also consider adjusting policy.UNAVAILABLE,DEADLINE_EXCEEDED,ABORTED: safe to retry within a deadline.DATA_LOSS: indicates corruption or verification failure; retrying may not help without changing source.
Retry Policy¶
The Store runtime applies bounded retries for transient errors. Defaults are
implemented in tensorcast/api/store/retry.py and can be overridden via
StoreOptions.retry_overrides.
Why retries are bounded:
- Unbounded retries hide outages and create thundering herds.
- Deadlines enforce an upper bound on caller-perceived latency.
- Backoff+jitter reduces correlated retries across workers.
Default policy summary:
register: 30s deadline, 2 attemptsput: 45s deadline, 2 attemptsget: 40s deadline, 3 attemptsget_into: 40s deadline, 3 attempts
The SDK retries only when all of the following are true:
- the error is an
ArtifactErrorwithretryable=True status_codeis one ofUNAVAILABLE,DEADLINE_EXCEEDED,ABORTED- the attempt count and overall deadline budget allow another attempt
Observability Signals¶
SDK metrics:
tc_store_artifact_cache_hits_totaltc_store_artifact_cache_misses_totaltc_store_artifact_cache_evictions_totaltc_store_artifact_cache_invalidations_totaltc_store_operation_latency_secondstc_store_operation_errors_totaltc_store_operation_retries_totaltc_store_batch_hits_total,tc_store_batch_coalesced_total,tc_store_batch_window_secondstc_store_prefetch_events_totaltc_store_region_backed_fallback_total,tc_store_region_backed_verification_skipped_total
Daemon metrics:
tc_local_stable_tier_total{op,status,requirement}tc_local_stable_tier_seconds{op,status}tc_store_materialize_into_target_total{result,reason,source}tc_store_materialize_into_target_verification_skipped_totaltc_persist_tasks_active{state}tc_persist_errors_total{stage,reason}tc_persist_progress_ratio
Logs and traces:
- SDK fallback logs include
tc.store.mode,tc.store.artifact_id, andtc.store.key. - Persistence logs include
task_id,plan_id,artifact_id, and degraded reasons.
What to watch (practical signals)¶
- Rising
tc_store_operation_errors_total{status="UNAVAILABLE"}usually indicates daemon connectivity / overload. - Sustained
tc_persist_tasks_active{state="running"}with flattc_persist_progress_ratiosuggests persistence stalls. - Increased
tc_store_region_backed_fallback_total{reason="layout_mismatch"}indicates callers are attempting region-backedget_intowith non-canonical targets. tc_local_stable_tier_total{status="DEGRADED"}spikes indicate local stable capacity pressure; correlate with policy usage andoverflow_policy.
Code Map¶
- Error mapping: tensorcast/api/store/retry.py
- SDK metrics: tensorcast/api/_metrics.py
- Daemon local-stable metrics: daemon/service/controllers/registration_controller.cc
- Daemon materialization metrics: daemon/service/controllers/materialization_controller.cc
- Persistence metrics: daemon/state/persistence_manager.cc