Global Store Deployment Best Practices¶

The Global Store is a centralized artifact registry service that manages artifact metadata, worker registration, and state synchronization across the distributed artifact serving infrastructure.

Architecture Overview¶

The Global Store provides: - Centralized Artifact Registry: Tracks all artifacts and their replicas across the cluster - Worker Management: Handles worker registration, heartbeats, and health monitoring - State Synchronization: Ensures consistency across distributed workers - High Availability: Supports persistent storage and recovery mechanisms - Metrics: Prometheus-compatible metrics endpoint

Configuration¶

Use a unified file-based configuration (YAML/JSON → Proto with strict validation); CLI flags and environment variables are not supported. See examples/config/global_store_config.yaml and pass it via --config. The example defaults database.db_file to null (in-memory); set a persistent path for production deployments.

Deployment Methods¶

1. Direct Python Module Execution (Unified Config)¶

Development/Testing¶

# Use the configuration file (strictly validated)
uv run -m tensorcast.global_store --config=examples/config/global_store_config.yaml

Production with Persistence¶

Set the persistent database path, thread counts, and other parameters in the configuration file, and start with --config.

2. Environment Variable Configuration (Deprecated)¶

Environment variables are no longer supported. Please use the unified configuration file.

3. Docker Deployment¶

# Build the image
./docker/build.sh

# Run with persistent storage (mount config)
docker run -d \
  --name global-store \
  -p 50051:50051 \
  -p 8001:8001 \
  -v /var/lib/global_store:/data \
  -v $(pwd)/examples/config/global_store_config.yaml:/etc/tensorcast/global_store.yaml:ro \
  ghcr.io/tensorcast-ai/global-store:latest \
  uv run -m tensorcast.global_store --config=/etc/tensorcast/global_store.yaml

4. Kubernetes Deployment (High Availability)¶

See the Kubernetes High Availability section below for detailed StatefulSet configuration.

High Availability Configuration¶

1. Persistent Storage¶

For production deployments, always use persistent storage:

# Ensure data directory exists with proper permissions
mkdir -p /var/lib/global_store
chown -R app:app /var/lib/global_store

# Start with persistent database via unified config
uv run -m tensorcast.global_store --config=/etc/tensorcast/global_store.yaml

Benefits: - Automatic Recovery: Global Store recovers state from database on restart - Worker State Preservation: Maintains worker registrations and artifact assignments - Audit Trail: Preserves historical data for debugging

2. Recovery Mechanisms¶

The Global Store implements several recovery features:

Database Recovery¶

Automatically initiated on startup when using persistent storage
Restores worker registrations, artifact metadata, and replica assignments
Validates data integrity during recovery

Worker Re-registration¶

Workers can perform recovery registration after Global Store restart
Preserves worker identity and artifact assignments when possible
Supports state synchronization after recovery

State Synchronization¶

Enhanced heartbeat protocol includes state versioning
Automatic detection of state divergence
Full state sync available for major discrepancies

3. Monitoring and Health Checks¶

Prometheus Metrics¶

Exposed via the unified metrics system: - Worker registration/deregistration counts - Active worker count - Artifact registration metrics - tc_register_begin_coalesced_total - tc_register_begin_cpu_total - tc_register_begin_lease_total - tc_register_commit_coalesced_total - tc_register_commit_cpu_total - tc_register_commit_lease_total - tc_register_abort_total - tc_register_keepalive_total - tc_register_revoke_total - tc_register_feed_cpu_bytes_total - tc_register_feed_lease_segments_total - tc_register_feed_lease_bytes_total - tc_register_pending_gauge (current number of in-flight registrations per daemon) - tc_register_commit_seconds (latency histogram with result label: ok/aborted/expired) - Request latencies - Error rates

Health Check Endpoints¶

# gRPC health check
grpc_health_probe -addr=localhost:50051

# HTTP health via metrics endpoint
curl http://localhost:8001/health

Kubernetes High Availability¶

StatefulSet Configuration¶

For production Kubernetes deployments, use a StatefulSet with persistent volumes:

# See docker/k8s/global_store.yaml for complete configuration
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: global-store
spec:
  replicas: 1  # Single instance with persistent storage
  serviceName: global-store-headless
  template:
    spec:
      containers:
      - name: global-store
        image: ghcr.io/tensorcast-ai/global-store:latest
        env:
        - name: GLOBAL_STORE_DB_PATH
          value: /var/lib/tensorcast/global-store/models.db
        volumeMounts:
        - name: data
          mountPath: /var/lib/tensorcast/global-store
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 20Gi

Service Configuration¶

# Headless service for StatefulSet
apiVersion: v1
kind: Service
metadata:
  name: global-store-headless
spec:
  clusterIP: None
  ports:
  - name: grpc
    port: 50051
  selector:
    app: global-store

---
# Load-balanced service for client access
apiVersion: v1
kind: Service
metadata:
  name: global-store
spec:
  type: ClusterIP
  ports:
  - name: grpc
    port: 50051
  - name: metrics
    port: 8001
  selector:
    app: global-store

Best Practices¶

1. Resource Allocation¶

CPU: 2-4 cores for moderate load, 4-8 cores for high load
Memory: 4-8 GB minimum, 16 GB recommended for large deployments
Storage: 20-50 GB SSD for database, depending on artifact count

2. Database Maintenance¶

# Periodic backup (while service is running)
cp /var/lib/global_store/models.db /backup/models.db.$(date +%Y%m%d)

# Database optimization (automatic, but can be tuned)
export GLOBAL_STORE_OPTIMIZE_INTERVAL_MS=1800000  # 30 minutes

3. Security Considerations¶

Run as non-root user
Use TLS for gRPC connections in production
Restrict network access to trusted sources

4. Scaling Strategy¶

While Global Store is designed as a single instance with persistent storage:

Vertical Scaling: Increase CPU/memory for the single instance
Read Replicas: Future versions may support read-only replicas
Backup Instance: Maintain a standby instance with replicated database

5. Integration with Store Daemons¶

Ensure Store Daemons are configured to:

# Point to Global Store service
export GLOBAL_STORE_ADDRESS=global-store.namespace.svc.cluster.local:50051

# Enable reconnection with exponential backoff
export ENABLE_GLOBAL_STORE_RECONNECT=true
export RECONNECT_MAX_RETRIES=10

Troubleshooting¶

Common Issues¶

Worker Registration Failures
Check network connectivity
Verify Global Store address in worker configuration
Check for port conflicts
Database Corruption
Stop service immediately
Restore from backup
Run with fresh database if needed
High Memory Usage
Check worker count and cleanup settings
Increase cleanup frequency
Monitor for memory leaks

Debug Logging¶

# Enable debug logging
export LOG_LEVEL=DEBUG
uv run tensorcast global start --config=/etc/tensorcast/global_store.yaml

Recovery Procedures¶

Complete System Recovery

# Stop all services
systemctl stop global-store

# Restore database from backup
cp /backup/models.db.latest /var/lib/global_store/models.db

# Start Global Store
systemctl start global-store

# Workers will auto-reconnect and re-register

Worker State Resync

# Trigger reconcile snapshot for specific worker
grpcurl -d '{"worker_id":"worker-123","daemon_id":"daemon-123","generation":"1","request_seq":"1","request_kind":"RECONCILE_REQUEST_KIND_SNAPSHOT","inventory":[]}' \
  localhost:50051 \
  tensorcast.global_store.v1.ClusterRuntimeService/ReconcileWorkerState

Conclusion¶

The Global Store is designed for high availability through: - Persistent storage with automatic recovery - Robust worker management with heartbeat monitoring - State synchronization protocols - Comprehensive monitoring and metrics

Follow these best practices to ensure reliable operation of your artifact serving infrastructure.