TOKENSCHAIN PLATFORM · THE CHINA EDITION OF FIREWORKS

From inference to intelligence.
China compute, global developer experience.

Fireworks made open-source LLMs feel like a single API for the world. TokensChain goes further — as an AI middleware MaaS + compute-scheduling infrastructure layer, aggregating GPU inference across Alibaba, Tencent, Huawei and Volcano with smart routing, semantic caching, dynamic batching and end-to-end compliance to deliver China's compute efficiency to global enterprises.

Build · Tune · Scale

One platform for the entire LLM lifecycle.

BUILD

From prompt to production in seconds

Serverless inference with no cold starts, billed per token. Graduate seamlessly to on-demand GPU endpoints that auto-scale — no GPU procurement, no cluster wrangling.

TUNE

Fine-tune any open model on your private data

LoRA / QLoRA, reinforcement learning and quantization-aware training — all in-country and compliant. Tuned models share the same API as the base model, so apps don't change.

SCALE

Scale across clouds, regions and compliance zones

Smart routing balances live traffic across Alibaba / Tencent / Huawei / Volcano, with multi-AZ HA and a 99.9% SLA. Dedicated VPC and on-prem options available.

System architecture

Asset-light. Pure software. Built to scale.

Customer app

OpenAI SDK · LangChain · custom backend

↓

TokensChain Gateway

Auth · throttle · routing · billing

↓

Semantic cache

Redis Stack · vector search

Batch scheduler

Queue coalescing · dynamic batch

Compliance

Moderation API · audit logs

↓

Alibaba Cloud

GPU inference endpoint

Tencent Cloud

GPU inference endpoint

Huawei Cloud

GPU inference endpoint

Baidu AI Cloud

GPU inference endpoint

MaaS Core Modules

Four layers of model-as-a-service infrastructure.

TokensChain is more than an API gateway — it is a full MaaS middleware spanning model access, compute orchestration, security compliance and continuous optimization. Enterprises consume LLM capabilities like utilities, with zero infrastructure to build.

Unified Model Gateway

One OpenAI-compatible API connects DeepSeek, Qwen, Kimi, GLM, MiniMax, Doubao, Hunyuan and other leading Chinese models, plus LLaMA, Mistral and other international open models. Built-in version canaries, A/B testing and tiered API key authorization let you switch models without touching app code.

Core Capabilities

·Multi-vendor model unification (10+ providers, 50+ models)
·100% OpenAI SDK & REST API compatible, one-line migration
·Automatic model version canary, blue-green deployment & one-click rollback
·Tiered API key authorization (tenant / project / environment isolation)
·Automatic request/response format transformation & normalization
·Prompt template library & version management for team reuse

Typical Scenarios

E-commerce Intelligent Customer Service

Switch seamlessly between Qwen for product inquiries and DeepSeek for complex return-policy interpretation through the same API — no frontend changes required.

Financial Document Analysis

One endpoint automatically identifies Chinese vs. English content and routes to GLM for Chinese contract review or LLaMA for English report analysis.

Education AI Tutor

Use Doubao for K12 math tutoring and Kimi for long-document reading comprehension under one API key with per-subject usage tracking.

Intelligent Compute Scheduling

Millisecond-level collection of latency, price, load and availability feeds a weighted scoring engine that routes every request to the optimal cloud node. Semantic cache hits >30%; dynamic batching boosts GPU utilization up to 4× and cuts enterprise compute costs by 35%+.

Core Capabilities

·Real-time multi-cloud GPU node health scoring (latency / price / load / availability)
·5-second automatic failover across Alibaba / Tencent / Huawei / Volcano / Baidu
·Embedding vector semantic cache with >30% hit rate, zero GPU cost for repeated requests
·Dynamic request batching: adaptive batch size + priority queue, 4× GPU utilization
·Auto-scaling compute capacity based on queue depth & SLA targets
·Cost-first / performance-first / compliance-first routing policies

Typical Scenarios

Flash Sale Marketing Campaign

Handle 10× traffic spikes during Double-11 with auto multi-cloud scaling; semantic cache absorbs repetitive product-description queries, GPU cost only grows 2×.

Real-time Code Assistant

IDE auto-completion demands P99 < 500 ms; geo-aware routing and priority queuing maintain a silky experience even during peak development hours.

Batch Legal Document Review

Process 100,000+ contracts overnight using dynamic batching on reserved GPU instances — 75% faster than on-demand serverless.

Security & Compliance

Bidirectional input/output moderation integrates Alibaba Green Net, Tencent Tianyu and other engines. High-risk content is blocked and fully audit-logged. Supports MLPS 2.0 Level 3, CAC algorithm registration, data-sovereign deployment and GM cryptographic upgrades — meeting the strictest government and finance compliance requirements.

Core Capabilities

·Bidirectional I/O content moderation (Alibaba Green Net + Tencent Tianyu + custom rules)
·Tamper-proof full-chain audit logs with custom retention & compliance export
·MLPS 2.0 Level 3 / CAC filings / algorithm registration / data classification compliance
·Data-sovereign deployment options: domestic-cloud-only, air-gapped, no cross-border transfer
·GM cryptographic algorithm support (SM2 / SM3 / SM4) for finance & government crypto mandates
·RBAC fine-grained permissions per model, per API and per tenant

Typical Scenarios

Government Smart City

All citizen service inference runs in Huawei Cloud's government zone — data never leaves the municipality; complete audit trails available for regulatory inspection at any time.

Digital Banking Chatbot

Every customer-facing AI response undergoes dual-review and SM4 encryption, meeting the central bank's fintech innovation compliance requirements.

Hospital Diagnostic Assistant

Patient data is inferred only inside the hospital's private cloud; automated filtering of risky medical advice content with complete audit logs retained for health authority review.

Observability & Optimization

Distributed tracing from gateway to GPU exit. Real-time latency histograms, error trends, cost attribution and anomalous request replay. Data-driven auto-tuning recommendations continuously optimize cache TTL, batch size and routing weights — driving per-token cost down month over month.

Core Capabilities

·End-to-end distributed tracing: from API gateway to GPU inference kernel
·Real-time latency percentile analysis (P50 / P95 / P99) with anomaly detection alerts
·Per-tenant / per-model / per-cloud cost attribution & budget tracking
·Intelligent anomaly detection: auto-identifies slow requests, error spikes & cost anomalies
·Token throughput & GPU utilization dashboards with trend forecasting
·Automated tuning recommendations: cache policy, batch parameters & routing weights continuously optimized

Typical Scenarios

Multi-tenant SaaS Platform

Provide each customer with isolated usage dashboards showing exact token consumption, model distribution and per-department cost breakdown.

Enterprise Knowledge Base

Semantic cache analytics revealed 40% of queries were repetitive FAQ questions; pre-warming cache cut GPU costs by 35% — ROI clearly visible.

Game Studio NPC Dialogue

Latency heatmaps revealed peak GPU contention during evening gaming hours; shifting non-critical model traffic to cost-optimized regions saved 28% on compute spend.

FAQ

Common questions about the four core modules.

Unified Model Gateway

Intelligent Compute Scheduling

Security & Compliance

Observability & Optimization

Delivery Promise

Not just a feature list — a quantified commitment.

99.9%

Service availability SLA

Multi-AZ disaster recovery, ≤8.76 hrs downtime per year

< 5s

Auto failover

Traffic migrated within 5 seconds, zero user impact

15 min

Enterprise ticket response

7×24 dedicated support, 15-minute first response on business days

100%

Data security assurance

On-premise, GM-crypto and air-gapped deployment supported

Core capabilities

Built for enterprise AI workloads.

ROUTING

Smart routing engine

Live latency, price, load and availability metrics feed a weighted scoring engine that picks the best cloud endpoint per request. Geo-aware routing, cost-first policies and 5-second failover keep workloads running transparently.

CACHE

Semantic cache

Embedding similarity matching auto-caches responses for repeated prompts. >30% hit rate, millisecond returns with zero GPU cost. TTL eviction, popularity weighting and multi-tier cache architecture reduce downstream load.

BATCHING

Request batching

Dynamic coalescing packs concurrent requests into single GPU batches. Adaptive batch size, padding alignment and priority queueing lift GPU utilization up to 4× and cut per-token cost by 35%+.

SAFETY

Two-way content moderation

Bidirectional input/output moderation integrates Alibaba Green Net, Tencent Tianyu and other engines. High-risk content is blocked and audit-logged, meeting MLPS 2.0 and CAC requirements.

BILLING

Unified metering & billing

Millisecond-accurate token metering with automated multi-cloud reconciliation. Multi-account hierarchy, cost attribution, budget alerts and VAT special invoices plug into enterprise finance flows.

OBSERVABILITY

End-to-end observability

Distributed tracing from gateway to GPU exit. Real-time latency histograms, error trends, cost attribution and request replay. Prometheus + Grafana dashboards pinpoint issues in minutes.

RELEASE

Canary release & rollback

Version-pinned canary release, A/B traffic splitting and gradual ramp-up. Real-time KPI monitoring with one-click rollback minimizes launch risk.

QUOTA

Multi-dimensional quotas

Rate limits by tenant, API key, model and time window. Burst buffering, priority queues and budget caps protect GPU clusters and keep spend predictable.

PRIVATE

Private deployment

K8s Helm charts and turnkey delivery for Xinchuang environments. Gateway and cache run entirely in customer networks, with support for domestic chips, GM-crypto and air-gapped deployments.

Capability parity

Fireworks for the world. TokensChain for China.

We mirror Fireworks' validated product surface on China's clouds — and add what only matters in-country: deep compliance and localization.

Capability

Fireworks · Global

TokensChain · China

Serverless inference

LLaMA / DeepSeek / Qwen and other open models

DeepSeek / Qwen / Kimi / GLM / MiniMax / Doubao / Hunyuan

OpenAI-compatible API

Yes

Yes · one-line migration

Fine-tuning / RL

LoRA · RL · quantization-aware

LoRA · RL · quantization-aware · data stays in-country

Multi-cloud routing

AWS · GCP · Azure

Alibaba · Tencent · Huawei · Volcano · Baidu

Enterprise compliance

SOC2 · HIPAA · GDPR

MLPS 2.0 · CAC filings · algorithm registration · two-way content moderation

Self-hosted

BYOC · enterprise tier

BYOC · Xinchuang ready · GM crypto · air-gapped

Billing & procurement

USD card · enterprise contracts

RMB / USD · VAT special invoice · in-country entity

Fine-tuning workflow

Fine-tune any model in three simple steps.

A fully-managed pipeline with zero infra overhead. Every step — upload to production — stays in-country.

Upload your dataset

Securely upload private data via the console or API. JSONL, CSV and Parquet supported, with automatic quality checks and at-rest encryption.

Configure & launch training

Pick a base model, tune LoRA / QLoRA / RL hyperparameters, set budget and wall-clock caps. Hit start — GPU clusters spin up automatically.

Track & deploy

Watch loss, throughput and eval metrics live. When training ends, deploy to a serverless endpoint or reserved capacity in one click — same API as the base model.

Choose how you pay

Serverless or Reserved — mix to fit your workload.

Both share one OpenAI-compatible API and can coexist in a single project: reserve capacity for core pipelines, run elastic and experimental traffic on serverless.

On-demand

Serverless inference

Invoke any model instantly — zero setup, per-token billing. Ideal for bursty traffic, prototyping and SMB-scale production.

·No infrastructure to manage
·Pay only for what you use
·Auto-scales with traffic spikes
·Ideal for startups and variable workloads

Start now

Reserved

Reserved GPU instances

Dedicated GPUs for mission-critical workloads — predictable latency, throughput and enterprise SLA. 30–50% cheaper than on-demand at scale.

·Guaranteed capacity, zero queueing
·Isolated infra, physical security
·Predictable pricing and billing
·Ideal for steady production & enterprise apps

Talk to sales

Deployment modes

From public cloud to sovereign — every environment covered.

Public cloud SaaS

Turnkey, pay-as-you-go. Built for SMBs and developers.

— Free tier
— 5-minute setup
— Fully managed

Dedicated VPC

Gateway runs inside your VPC — data never leaves your cloud account.

— Isolated billing
— Dedicated routing
— VPN connectivity

On-premise license

Source-code delivery into your network. Xinchuang hardware and GM crypto supported.

— Source license
— Xinchuang ready
— On-site support

See what's in the model marketplace →

Browse models

From inference to intelligence.China compute, global developer experience.

One platform for the entire LLM lifecycle.

From prompt to production in seconds

Fine-tune any open model on your private data

Scale across clouds, regions and compliance zones

Asset-light. Pure software. Built to scale.

Four layers of model-as-a-service infrastructure.

Unified Model Gateway

Intelligent Compute Scheduling

Security & Compliance

Observability & Optimization

Common questions about the four core modules.

Unified Model Gateway

Do I need to change a lot of code to integrate with existing systems?

What happens if a model times out or returns a malformed response?

Intelligent Compute Scheduling

How is service stability and SLA guaranteed during traffic spikes like flash sales?

What happens if a cloud provider node suddenly experiences high latency?

Security & Compliance

Will data leave China? How do you meet regulatory requirements?

What if content moderation falsely blocks legitimate business requests?

Observability & Optimization

How do I integrate with an existing Prometheus / Grafana monitoring stack?

How do I quickly root-cause slow requests?

Not just a feature list — a quantified commitment.

Built for enterprise AI workloads.

Smart routing engine

Semantic cache

Request batching

Two-way content moderation

Unified metering & billing

End-to-end observability

Canary release & rollback

Multi-dimensional quotas

Private deployment

Fireworks for the world. TokensChain for China.

Fine-tune any model in three simple steps.

Upload your dataset

Configure & launch training

Track & deploy

Serverless or Reserved — mix to fit your workload.

Serverless inference

Reserved GPU instances

From public cloud to sovereign — every environment covered.

Public cloud SaaS

Dedicated VPC

On-premise license

See what's in the model marketplace →

From inference to intelligence.
China compute, global developer experience.