INTELLIGENT SCHEDULER

The intelligent scheduler
that makes every inference cheaper, faster and compliant.

Semantic cache + dynamic batching + multi-cloud routing + two-way moderation, behind an OpenAI-compatible API. Zero migration cost, 20–40% lower spend.

Overall architecture

Five layers, an end-to-end enterprise inference pipeline.

Enterprise Clients
SaaS · Finance · Retail · Education · Public sector
TokensChain Gateway
Auth · Rate Limit · Request Validation · Load Balancer
Intelligent Scheduler
Semantic cache · batch optimization · model routing · content safety · audit logs · cost accounting
Semantic Cache
Batch Optimizer
Model Router
Content Safety
Audit Logger
Cost Calculator
Alibaba Qwen
Multi-Cloud GPU Endpoints
Tencent Hunyuan
Multi-Cloud GPU Endpoints
Huawei Pangu
Multi-Cloud GPU Endpoints
Baidu ERNIE
Multi-Cloud GPU Endpoints
Volcano Doubao
Multi-Cloud GPU Endpoints
Self-hosted GPU cluster
Multi-Cloud GPU Endpoints
Monitoring & Analytics
Cache Hit Rate · GPU Utilization · Cost Savings Dashboard

Core components

Four engines, channeling the engineering of Fireworks.ai.

2.1 SEMANTIC CACHE

Semantic cache engine

Not string matching — embedding-similarity caching. Milvus / Weaviate vector search backed by Redis. Threshold 0.95, instant hits, zero GPU cost.

  • Four-tier cache: Exact / Semantic / Template / System
  • Hit rate 30–50%, latency <10ms (vs 200–2000ms)
  • TTL + model-version invalidation
2.2 BATCH OPTIMIZER

Dynamic batch optimizer

Adaptive batch sizing merges compatible requests by model, max_tokens and temperature. Four strategies (Time / Size / Adaptive / Priority) with VIP priority queue.

  • GPU utilization 30% → 70%+
  • 3–5× throughput
  • Per-token cost down 40–60%
2.3 OPENAI COMPATIBLE API

Unified API surface

Drop-in OpenAI Chat Completions API. Enhances it with enterprise auth, smart scheduling, audit logging and billing.

  • /v1/chat/completions full parity
  • Stream, user tagging, enterprise tiering
  • Zero migration cost
2.4 COMPLIANCE WRAPPER

Compliance wrapper

Alibaba Green Net + Tencent Tianyu + in-house keyword filters, bidirectional moderation. Algorithm-filing IDs baked in, audit logs persisted structurally.

  • Parallel moderation across providers, aggregated
  • Algorithm filing ID auto-injected into responses
  • Audit logs meet MLPS 2.0 and CAC requirements

Performance targets

Quantifiable engineering goals.

30-50%
Cache Hit Rate
Daily
<500ms
P95 Latency
Real-time
70%+
GPU Utilization
Real-time
99.9%
System Availability
Monthly

Economics

A cost structure where everyone wins.

Direct cloud-vendor connection
$0.0020 / 1K tokens
  • GPU utilization ~30%
  • No cache — every request is paid
  • Compliance & audit on the customer
TokensChain scheduler
$0.0016 / 1K tokens (20% off)
  • Semantic cache hit: $0 GPU cost
  • Miss: GPU cost $0.0014 after batching
  • TokensChain margin $0.0006 (30% GM)

Competitive advantage

Why TokensChain.

DimensionDirect cloudSiliconFlow / Infini-AITokensChain
CostBaselineSlightly lower20–40% lower
CacheNoneBasic cacheSemantic + multi-tier
BatchingNoneBasic batchingDynamic adaptive batching
ComplianceDIYPartialBuilt-in dual moderation + audit + filing
EnterpriseStandard SLAStandard SLACustom SLA + dedicated routing
MigrationAdaptation neededAdaptation neededZero migration (OpenAI compatible)

Deployment architecture

Production-grade, observable, multi-region.

Multi-region deployment

South (Guangzhou) + East (Shanghai) + North (Beijing) on Kubernetes (ACK / EKS / GKE)

Data layer

Redis Cluster (3 master / 3 replica) + Milvus (2 shards / 2 replicas) + Kafka (3 brokers)

Observability

Prometheus + Grafana + AlertManager; ELK log stack; multi-dimensional alert rules

Core value

"We don't sell GPU capacity — we're the software layer that uses it efficiently.
Plug into our API and cut costs 20–40% instantly, with compliance done for you and zero migration."