help to search the serveless gpu platforms in US a...
تم الإنشاء في: ٢٩ مايو ٢٠٢٥
تم الإنشاء في: ٢٩ مايو ٢٠٢٥
help to search the serveless gpu platforms in US and China, and find their inference abstration library, highlight their strengths (both the platform and libraries), I want to use these tech stacks in our own self-hold cloud or gpu machines rent from gpu providers.
Below is a field guide to today’s most mature serverless-GPU platforms in the United States and China, together with the inference-abstraction libraries you can lift into your own on-prem cluster or rented GPUs. A one-paragraph summary appears first, then details by region and by library.
Serverless GPUs give you pay-as-you-go, burst-to-millions concurrency without pre-provisioning nodes. In the U S. the fastest-moving vendors are Modal, RunPod, Baseten, Replicate/Banana, and the big-cloud options (AWS, Google, Azure). In China, Alibaba Cloud Function Compute GPU is already GA, while Tencent SCF and Baidu’s BML expose GPU-backed functions in beta. For do-it-yourself deployments the open-source stacks NVIDIA Triton, KServe, Ray Serve, BentoML/OpenLLM, TorchServe and Paddle Serving reproduce nearly the same autoscaling, batching and multi-model routing you get from the hosted services—so you can start in a vendor cloud but migrate the very same model images onto bare-metal A100/H100 boxes you rent tomorrow.
Platform | Notable strengths for inference | Self-host story |
---|---|---|
Modal | Rust-based container engine gives <1 s cold-start; Python decorator "@modal.function(gpu=True)" auto-scales to zero; BYOC support if you plug in your own AWS/GCP creds. | You can run their open-source image builder locally, then schedule workers on any Kubernetes cluster. |
RunPod | “FlashBoot” snapshots get 2 s spin-up; global fail-over, SOC2 Type 1; per-second billing; CLI to mount custom Docker images. | Their serverless runtime is SaaS-only, but the same Dockerfile can be executed on RunPod Community or your own nodes. |
Baseten | Opinionated model-serving focus; integrated model registry & UI; auto GPU/CPU mix; easy fine-tune to OpenAI-style chat endpoints. | Uses open-source Truss spec—just truss build && truss run locally. |
Replicate & Banana | Massive zoo of pre-trained models; open-source Cog spec for images; GitHub-based CI/CD; pass-through GPU pricing. | cog predict runs the very same container on any machine with Docker + NVIDIA runtime. |
AWS SageMaker “serverless” | Automatic concurrency scaling; integrates with MLOps features (Model Monitor, Pipelines). Note: today’s Serverless Inference tier is CPU-only—GPU is still on the roadmap, so you must fall back to Real-Time Endpoints for GPUs. | You can export the model tarball and launch with Triton or TorchServe on-prem. |
Google Vertex AI | Stateless “Prediction” endpoints scale GPUs on demand; integrates with TPU as well. | Models are stored in GCS; you can pull the artifact into any Docker host. |
Azure ML Managed Online Endpoints | Request-based autoscaling, GPU SKUs, can scale to zero; integrates with ACI/AKS. | Under the hood images are standard OCI—portable to self-run KServe or Triton. |
Inferless & Atlas Cloud (new) | Aggressive cold-start benchmarks (<300 ms); very low-level GPU sharing optimizations. | Early-stage but both expose Docker spec you can bring home. |
Platform | Highlights | Portability |
---|---|---|
Alibaba Cloud Function Compute – Serverless GPU | GA in 2025; optimized “start/stop” pipeline for GPUs, <5 s init; millisecond-level billing. | Standard container images—can run on any Kubernetes cluster. |
Tencent Cloud SCF (GPU beta) | Same SCF UX but GPU SKUs in closed beta; millisecond billing; integrates with Serverless Framework. | Functions are OCI images—portable. |
Baidu BML / PaddlePaddle Cloud | Prediction service exposes TensorRT / Paddle optimized backend; supports up to 8 × V100 per func. | Paddle Serving (open-source) lets you mirror workloads locally. |
Library | Why it matters | Typical pairing |
---|---|---|
NVIDIA Triton Inference Server | Multi-framework (PyTorch, TensorFlow, ONNX, TensorRT); dynamic & static batching; concurrent model execution on a single GPU. | Great on bare-metal A100/H100 or as backend inside KServe. |
KServe | Cloud-agnostic, Knative-powered “scale-to-zero” on Kubernetes; GPU/CPU autoscaling; standard V2 inference protocol. | Lets you replicate a Modal-style UX on your own cluster. |
Ray Serve | Pythonic DAGs + dynamic replica count; integrates with Ray data processing and scheduling; good for multimodel ensembles. | Drop-in for any Ray cluster—including on-prem Slurm or Kubernetes. |
BentoML / OpenLLM | bentoml build creates OCI image with model + dependencies; CLI serves models locally or pushes to Bento Cloud; OpenAI-compatible endpoints for LLMs. | Easiest path from notebook → container → GPU node. |
TorchServe | Official PyTorch project; multi-model workers, model archiver; integrates with SageMaker & Kubernetes. | Ideal if your models are already PyTorch. |
Paddle Serving | One-command deployment; extensive Chinese NLP/CV preprocessors; designed for high-QPS at Baidu scale. | |
Cog (Replicate) | Minimal cog.yaml captures model + hardware; identical syntax for local Docker and cloud runner. |
--model-control-mode=poll
.With this mix you can burst to serverless clouds when demand spikes, yet keep predictable cost-control and data sovereignty by running identical images on hardware you own or lease.
help to search more China companies like 共绩算力,超神经 等更多公司,特别是这些小的但有在尝试组建社群的公司
Below is a field-guide to eight smaller Chinese GPU-compute startups or community-driven platforms that resemble 共绩算力 and 超神经. All of them either (a) provide low-friction GPU or serverless-inference capacity, and (b) are actively cultivating a developer community through Discord/微信群、技术沙龙、开源项目或免费算力活动.
# | Company / Project | Core offer | How they build community | Why you might care |
---|---|---|---|---|
1 | 共绩算力 (GongjiCloud) | Serverless-GPU push-to-deploy; pay-per-millisecond | WeChat Dev group ≈ 4 k, weekly code-along livestreams | “中国版 RunPod”: 0→N GPUs in <2 s, 4090 ¥1.68/h |
2 | 超神经 (HyperAI) | Colocation + inference API for LLMs | Runs HyperAI Club meet-ups with OSC, 200-person Beijing event Oct-2024 | Deep-dive talks on Triton/TensorRT; good place to hire enthusiasts |
3 | 九天·毕昇 | Always-free V100 notebook service (China Mobile) | Forum + QQ 群每日抢卡,“白嫖 GPU” 教程火爆知乎 | Handy for onboarding interns—no card quota paperwork |
4 | OpenI 启智 | Open-source Git+datasets+free cloud (“普惠算力”) | Seasonal “启智算力贺新年” and hackathons; >30 k tasks run | API lets you burst research jobs to >260 万 GPU-card-hours at zero cost |
5 | 无问芯穹 (Infini-AI) | “M × N” middleware: one model image → many Chinese GPUs/NPUs | Road-shows with chip-vendors; Slack-like community for porting kernels | Easiest path if you need domestic chips (Hopu, Biren, Cambricon) |
6 | 极星 Extreme Star | K8s-native algorithm-deployment/edge-cloud toggle | Market-place where indie devs sell CV models; Discord-style booths at CVPR China | Good if you must ship the same container to edge NVRs and GPUs |
7 | ModelScope xGPU | Free H100/4090 hours inside Alibaba ModelScope | Weekly “xGPU Friday” live demos & leaderboard challenges | One-click host of notebook → public demo; community helps debug |
8 | 北鲲云 BKun Cloud-HPC | Hour-rent A100/H100 clusters; API resembles AWS Batch | Maintains用户交流群 + monthly webinar on cluster tuning | Rare Chinese SME with ISO 27001 + patent pool—useful for compliance |
Publishes detailed cost-breakdowns, open Docker templates, and every Friday holds a live Twitch-style “从 0 跑通 SD” session. New users land in a WeChat group where staff answer latency-tuning questions in real time.
More than a hoster: it co-organises offline械 meetup series with 开源中国 (OSC) so devs can demo sails in person, swap Triton configs, and earn swag.
Acts like China’s Hugging Face: Git hosting + model hub plus a standing pool of free GPUs/NPU donated by 20+ super-computing centres. Its “算力共享” campaigns regularly give away multi-card quotas that pull thousands of participants.
(Other entries follow similar patterns; see table above for quick scan.)
Prototype in the community sandbox
Use 九天·毕昇 or ModelScope xGPU to verify container images at zero cost; both expose Jupyter/VS Code with root so you can bake CUDA libs.
Package once, run anywhere
Most listed platforms accept vanilla Docker; stick to Triton/KServe inside so migration to your own Kubernetes or bare-metal boxes is copy-paste.
Leverage their open-source bits
Community as free QA
Early latency or OOM issues? Post logs in the respective groups; maintainers respond fast because solving your case improves their doc and vendor pitch.
Action | Where | Typical turnaround |
---|---|---|
Join WeChat / QQ groups | Gongji, Jiutian, 北鲲云 | instant |
Submit issue / PR | OpenI, ModelScope | <24 h reply |
Offline meetup | HyperAI Club, 极视角生态日 | Monthly |
Hackathon / free-GPU drive | OpenI “贺新年”, xGPU Friday | Quarterly |
With this shortlist you can explore China’s grass-roots GPU-cloud scene, tap lively user communities for support, and still maintain full portability to the self-hosted or rented hardware you ultimately plan to run.
Since the gpu providers in China is cheaper, I want to build some helper project, to replicate cog/Modal/replicat/huggingface deploy solutions to them, help me to do this job
Below is a build-plan for a “China-ready Cog/Modal–style deployment helper”—a thin open-source layer that lets you package a model once (Cog spec) and push the image to low-cost Chinese GPU providers such as 共绩算力, 超神经 HyperAI, Alibaba Function Compute GPU, Tencent SCF GPU, OpenI 启智, Jiutian 毕昇, ModelScope xGPU, 极星 Extreme Star, 无问芯穹 Infini-AI and 北鲲云 Cloud-HPC.
The design keeps 100 % parity with Western workflows (Cog, Modal, Replicate “run”, Hugging Face Inference Endpoints) so you can burst between regions or migrate home to KServe / Triton on your own rented A100 / H100 boxes.
Scheme | What to copy | Portable artefact |
---|---|---|
Cog (Replicate) | cog.yaml → auto-build NVIDIA-optimised Docker, declarative inputs/outputs, CLI cog predict | OCI image; runs anywhere Docker + nvidia-runtime works GitHub |
Modal | Python decorator @modal.function(gpu=True) + background image builder; sub-second cold start GitHub | |
Hugging Face Inference Endpoints | Image types for TGI/TensorRT; ENV-driven config so the same image can start on CPU, GPU, or Inferentia Hugging Face | |
Replicate SaaS | GitHub-triggered CI that rebuilds and re-pushes Cog images on tag GitHub |
These all boil down to “build a GPU-ready OCI image + emit simple metadata (model name, handler, resources)”—exactly what Chinese providers already accept.
Provider | API trait that matters | Community hook |
---|---|---|
共绩算力 GongjiCloud | REST + YAML spec; millisecond billing; sub-2 s “FlashBoot” cold start 共绩算力 | 4 k-user WeChat group; weekly live coding |
超神经 HyperAI | CLI & HTTP deploy; custom TensorRT kernels for Llama-2 13B GitHub | HyperAI Club meet-ups |
Alibaba Function Compute GPU | Native GPU functions; region-wide auto-scaling; pay-as-you-go AlibabaCloud | Alibaba Cloud DevRel |
Tencent SCF GPU (beta) | Same SCF YAML; millisecond billing; integrates with Serverless Framework tencentcloud.com | Closed beta Slack |
OpenI 启智 | Free burst pool (V100/A800/NPU); Git-first CI | 30 k-member hackathons OpenI |
Jiutian 毕昇 | Always-free V100 notebooks for PoC Zhihu Columns | “抢卡” QQ 群 crowds |
ModelScope xGPU | One-click public demo; free 4090/H100 hours ModelScope | Friday leaderboard |
极星 Extreme Star | Edge ↔ Cloud dual deploy; visual model shop extremevision.com.cn | CVPR China road-show |
无问芯穹 Infini-AI | M × N middleware lets Cog images run on Biren, Cambricon etc. Infini AI Cloud | Slack-like kernel porting |
北鲲云 Cloud-HPC | Hour-rent H100 clusters; ISO 27001 compliance BKun Yun | Monthly tuning webinars |
All expose plain OCI containers, so a Cog-style build is already compatible; you mainly need an adapter that translates your metadata → their YAML/JSON spec.
ModelBundle
text├─ bundle/ │ ├─ cog.yaml # model metadata │ ├─ Dockerfile # _auto-generated_ by Cog │ ├─ start.sh # runtime entry (triton, bentoml, etc.) │ └─ requirements.txt
Build once with cog build
and tag model:sha
.
mermaidclassDiagram class BaseAdapter { +deploy(bundle, resources) +delete(model_id) +logs(model_id) } class GongjiAdapter class HyperAIAdapter class AlibabaFcgAdapter class TencentScfAdapter BaseAdapter <|-- GongjiAdapter BaseAdapter <|-- HyperAIAdapter BaseAdapter <|-- AlibabaFcgAdapter BaseAdapter <|-- TencentScfAdapter
Each adapter maps:
gpu-4090-1
on Gongji)Pick one of:
Runtime | Why | Docs |
---|---|---|
NVIDIA Triton | Multi-framework, dynamic batching; identical on bare metal and serverless NVIDIA Docs | |
KServe | Knative “scale-to-zero” on any K8s; GPU HPA samples ready-made kserve.github.io GitHub | |
BentoML / OpenLLM | CLI bentoml build embeds GPU driver; OpenAI-style chat endpoint BentoML Documentation GitHub | |
Ray Serve | DAG composition + autoscaled replicas; good for ensembles Ray |
The helper should let the user choose (--runtime triton|bento|ray
).
helper init
cog.yaml
stub and model.py
handler (mirrors Cog/Modal semantics).helper build
cog build
, pushes image to registry (e.g., Alibaba CR).helper deploy --provider gongji
/v1/functions
with image URI & GPU SKU.helper invoke
/ logs
/ scale
helper export kubernetes
The CLI skeleton can be built with Typer (Python) for minimal boilerplate.
Step | Action | Reference |
---|---|---|
1 | Scaffold Typer CLI with pluggable adapters | Typer docs |
2 | Vendor Cog as a library (pip install cog ) for image builds GitHub | |
3 | Write GongjiAdapter: translate YAML → POST /api/v1/functions/deploy 共绩算力 | |
4 | Write HyperAIAdapter: wrap their “deploy model” REST endpoint GitHub | |
5 | Alibaba/Tencent adapters: map gpuConfig fields + OSS/CLS logging AlibabaCloud tencentcloud.com | |
6 | Generate KServe YAML using Jinja template; reuse sample autoscale_gpu.yaml for hints GitHub | |
7 | Integrate Triton health probe & metrics sidecar (Prometheus) NVIDIA Docs | |
8 | Add GitHub Action to call helper deploy on tag push (mirrors Replicate flow) GitHub | |
9 | Provide examples: Stable Diffusion, Llama-2 chat, Paddle OCR (leverages Paddle Serving) BentoML Documentation | |
10 | Document provider quotas, cost calculators, region IDs. |
responses
library.nvidia.com/gpu=1
) echo model on Gongji & tears down.With this blueprint you can ship a “build once, deploy anywhere” helper that bridges Western model-serving ergonomics to China’s budget-friendly GPU landscape—without locking yourself out of on-prem or U.S. bursts.