Inference Stack
Production-grade LLM inference API built from scratch. NestJS gateway + Python GPU workers. Scheduling, batching, KV cache, tensor parallelism, multi-modal, all against real GPUs.
A production-grade LLM inference API built from scratch in TypeScript/NestJS + Python, running against real GPUs. Built as a learning exercise to understand the same class of problems that OpenAI, Anthropic, and Google solve: GPU resource scheduling, KV cache-aware routing, streaming, dynamic batching, tensor parallelism, and multi-modal inference.
This is not a wrapper around vLLM or TGI. Every layer is built from raw transformers + grpcio to understand what happens under the hood.

What it does
Your laptop (NestJS gateway) Remote GPU cluster (Python workers)
┌─────────────────────────┐ ┌──────────────────────────────┐
│ HTTP API (OpenAI-compat)│ │ GPU-0: worker-0 (gRPC) │
│ Scheduler + Batcher │──gRPC────→│ GPU-1: worker-1 (gRPC) │
│ Router + Model Manager │ │ or: TP worker (2 GPUs) │
│ KV Cache Manager │ │ │
│ Metrics (ClickHouse) │ │ 8 models, 5 modalities │
└─────────────────────────┘ └──────────────────────────────┘
- 8 models across 5 modalities: text generation (SmolLM2 family + Qwen3-14B), vision-language (Qwen2.5-VL-3B), text-to-speech (Kokoro-82M), image generation (SD Turbo), video generation (CogVideoX-2B)
- Tensor parallelism: Qwen3-14B split across 2 GPUs via
tp_plan="auto"+torchrun, with thinking mode (<think>tag parsing) - Dynamic batching: 158x throughput improvement at concurrency 8 (2 TPS to 316 TPS)
- KV cache persistence: CPU DRAM-backed session cache with LRU eviction, 14% compute savings on multi-turn
- Runtime mode switching: Gateway SSHs to GPU host to switch between individual workers and tensor-parallel mode
- Full observability: ClickHouse metrics pipeline with TPS, latency percentiles, per-model breakdowns
Architecture
Three separate planes, just like production inference systems:
| Plane | Where | What |
|---|---|---|
| Gateway | Your laptop | NestJS API, scheduler, router, KV cache manager, batch collector |
| GPU Workers | Remote cluster | Python gRPC servers, one per GPU, running raw transformers |
| Metrics | Your laptop | ClickHouse for inference analytics |
The API server never runs on GPU machines. Communication is via gRPC over an SSH tunnel.
Key systems built
Scheduler
Priority queue with per-user fairness, aging, backpressure (429 + Retry-After), and configurable timeouts. Integrates with BatchCollector for time-window batching.
Batch Collector + GPU-Side Batching
Accumulates requests within a time window, groups by model, dispatches as a single model.generate() call with left-padded inputs. The throughput difference is dramatic:
| Concurrency | Without batching | With batching |
|---|---|---|
| c=1 | 42 TPS | 43 TPS |
| c=8 | 2 TPS | 316 TPS |
| c=32 | OOM | 1,134 TPS |
KV Cache (Disaggregated)
CPU DRAM-backed cache on the GPU host. On multi-turn conversations, restores past_key_values from CPU memory instead of recomputing from scratch. Handles transformers v4.x and v5.x cache formats.
Model Manager
VRAM-aware model placement with auto-load/unload, GPU affinity, and concurrent load coalescing. Knows which models need tensor parallelism and triggers mode switches automatically.
Router
Picks the best worker per request: model affinity (is the model already loaded?) > least loaded > trigger load on best candidate.
Worker Registry
Manages N gRPC worker connections dynamically. Supports runtime mode switching (individual workers <-> tensor parallel) via SSH to the GPU host.