
TOKENSCHAIN PLATFORM · THE CHINA EDITION OF FIREWORKS
Fireworks made open-source LLMs feel like a single API for the world. TokensChain goes further — as an AI middleware MaaS + compute-scheduling infrastructure layer, aggregating GPU inference across Alibaba, Tencent, Huawei and Volcano with smart routing, semantic caching, dynamic batching and end-to-end compliance to deliver China's compute efficiency to global enterprises.
Build · Tune · Scale
Serverless inference with no cold starts, billed per token. Graduate seamlessly to on-demand GPU endpoints that auto-scale — no GPU procurement, no cluster wrangling.
LoRA / QLoRA, reinforcement learning and quantization-aware training — all in-country and compliant. Tuned models share the same API as the base model, so apps don't change.
Smart routing balances live traffic across Alibaba / Tencent / Huawei / Volcano, with multi-AZ HA and a 99.9% SLA. Dedicated VPC and on-prem options available.
System architecture
MaaS Core Modules
TokensChain is more than an API gateway — it is a full MaaS middleware spanning model access, compute orchestration, security compliance and continuous optimization. Enterprises consume LLM capabilities like utilities, with zero infrastructure to build.
One OpenAI-compatible API connects DeepSeek, Qwen, Kimi, GLM, MiniMax, Doubao, Hunyuan and other leading Chinese models, plus LLaMA, Mistral and other international open models. Built-in version canaries, A/B testing and tiered API key authorization let you switch models without touching app code.
Core Capabilities
Typical Scenarios
E-commerce Intelligent Customer Service
Switch seamlessly between Qwen for product inquiries and DeepSeek for complex return-policy interpretation through the same API — no frontend changes required.
Financial Document Analysis
One endpoint automatically identifies Chinese vs. English content and routes to GLM for Chinese contract review or LLaMA for English report analysis.
Education AI Tutor
Use Doubao for K12 math tutoring and Kimi for long-document reading comprehension under one API key with per-subject usage tracking.
Millisecond-level collection of latency, price, load and availability feeds a weighted scoring engine that routes every request to the optimal cloud node. Semantic cache hits >30%; dynamic batching boosts GPU utilization up to 4× and cuts enterprise compute costs by 35%+.
Core Capabilities
Typical Scenarios
Flash Sale Marketing Campaign
Handle 10× traffic spikes during Double-11 with auto multi-cloud scaling; semantic cache absorbs repetitive product-description queries, GPU cost only grows 2×.
Real-time Code Assistant
IDE auto-completion demands P99 < 500 ms; geo-aware routing and priority queuing maintain a silky experience even during peak development hours.
Batch Legal Document Review
Process 100,000+ contracts overnight using dynamic batching on reserved GPU instances — 75% faster than on-demand serverless.
Bidirectional input/output moderation integrates Alibaba Green Net, Tencent Tianyu and other engines. High-risk content is blocked and fully audit-logged. Supports MLPS 2.0 Level 3, CAC algorithm registration, data-sovereign deployment and GM cryptographic upgrades — meeting the strictest government and finance compliance requirements.
Core Capabilities
Typical Scenarios
Government Smart City
All citizen service inference runs in Huawei Cloud's government zone — data never leaves the municipality; complete audit trails available for regulatory inspection at any time.
Digital Banking Chatbot
Every customer-facing AI response undergoes dual-review and SM4 encryption, meeting the central bank's fintech innovation compliance requirements.
Hospital Diagnostic Assistant
Patient data is inferred only inside the hospital's private cloud; automated filtering of risky medical advice content with complete audit logs retained for health authority review.
Distributed tracing from gateway to GPU exit. Real-time latency histograms, error trends, cost attribution and anomalous request replay. Data-driven auto-tuning recommendations continuously optimize cache TTL, batch size and routing weights — driving per-token cost down month over month.
Core Capabilities
Typical Scenarios
Multi-tenant SaaS Platform
Provide each customer with isolated usage dashboards showing exact token consumption, model distribution and per-department cost breakdown.
Enterprise Knowledge Base
Semantic cache analytics revealed 40% of queries were repetitive FAQ questions; pre-warming cache cut GPU costs by 35% — ROI clearly visible.
Game Studio NPC Dialogue
Latency heatmaps revealed peak GPU contention during evening gaming hours; shifting non-critical model traffic to cost-optimized regions saved 28% on compute spend.
FAQ
Delivery Promise
Core capabilities
Live latency, price, load and availability metrics feed a weighted scoring engine that picks the best cloud endpoint per request. Geo-aware routing, cost-first policies and 5-second failover keep workloads running transparently.
Embedding similarity matching auto-caches responses for repeated prompts. >30% hit rate, millisecond returns with zero GPU cost. TTL eviction, popularity weighting and multi-tier cache architecture reduce downstream load.
Dynamic coalescing packs concurrent requests into single GPU batches. Adaptive batch size, padding alignment and priority queueing lift GPU utilization up to 4× and cut per-token cost by 35%+.
Bidirectional input/output moderation integrates Alibaba Green Net, Tencent Tianyu and other engines. High-risk content is blocked and audit-logged, meeting MLPS 2.0 and CAC requirements.
Millisecond-accurate token metering with automated multi-cloud reconciliation. Multi-account hierarchy, cost attribution, budget alerts and VAT special invoices plug into enterprise finance flows.
Distributed tracing from gateway to GPU exit. Real-time latency histograms, error trends, cost attribution and request replay. Prometheus + Grafana dashboards pinpoint issues in minutes.
Version-pinned canary release, A/B traffic splitting and gradual ramp-up. Real-time KPI monitoring with one-click rollback minimizes launch risk.
Rate limits by tenant, API key, model and time window. Burst buffering, priority queues and budget caps protect GPU clusters and keep spend predictable.
K8s Helm charts and turnkey delivery for Xinchuang environments. Gateway and cache run entirely in customer networks, with support for domestic chips, GM-crypto and air-gapped deployments.
Capability parity
We mirror Fireworks' validated product surface on China's clouds — and add what only matters in-country: deep compliance and localization.
Fine-tuning workflow
A fully-managed pipeline with zero infra overhead. Every step — upload to production — stays in-country.
Securely upload private data via the console or API. JSONL, CSV and Parquet supported, with automatic quality checks and at-rest encryption.
Pick a base model, tune LoRA / QLoRA / RL hyperparameters, set budget and wall-clock caps. Hit start — GPU clusters spin up automatically.
Watch loss, throughput and eval metrics live. When training ends, deploy to a serverless endpoint or reserved capacity in one click — same API as the base model.
Choose how you pay
Both share one OpenAI-compatible API and can coexist in a single project: reserve capacity for core pipelines, run elastic and experimental traffic on serverless.
Invoke any model instantly — zero setup, per-token billing. Ideal for bursty traffic, prototyping and SMB-scale production.
Dedicated GPUs for mission-critical workloads — predictable latency, throughput and enterprise SLA. 30–50% cheaper than on-demand at scale.
Deployment modes
Turnkey, pay-as-you-go. Built for SMBs and developers.
Gateway runs inside your VPC — data never leaves your cloud account.
Source-code delivery into your network. Xinchuang hardware and GM crypto supported.