Kumar Divya Rajat

A production-grade LLM inference API built from scratch in TypeScript/NestJS + Python, running against real GPUs. Built as a learning exercise to understand the same class of problems that OpenAI, Anthropic, and Google solve: GPU resource scheduling, KV cache-aware routing, streaming, dynamic batching, tensor parallelism, and multi-modal inference.

This is not a wrapper around vLLM or TGI. Every layer is built from raw transformers + grpcio to understand what happens under the hood.

Inference Playground UI

What it does

Your laptop (NestJS gateway)          Remote GPU cluster (Python workers)
┌─────────────────────────┐           ┌──────────────────────────────┐
│  HTTP API (OpenAI-compat)│           │  GPU-0: worker-0 (gRPC)     │
│  Scheduler + Batcher    │──gRPC────→│  GPU-1: worker-1 (gRPC)     │
│  Router + Model Manager │           │  or: TP worker (2 GPUs)     │
│  KV Cache Manager       │           │                              │
│  Metrics (ClickHouse)   │           │  8 models, 5 modalities     │
└─────────────────────────┘           └──────────────────────────────┘

8 models across 5 modalities: text generation (SmolLM2 family + Qwen3-14B), vision-language (Qwen2.5-VL-3B), text-to-speech (Kokoro-82M), image generation (SD Turbo), video generation (CogVideoX-2B)
Tensor parallelism: Qwen3-14B split across 2 GPUs via tp_plan="auto" + torchrun, with thinking mode (<think> tag parsing)
Dynamic batching: 158x throughput improvement at concurrency 8 (2 TPS to 316 TPS)
KV cache persistence: CPU DRAM-backed session cache with LRU eviction, 14% compute savings on multi-turn
Runtime mode switching: Gateway SSHs to GPU host to switch between individual workers and tensor-parallel mode
Full observability: ClickHouse metrics pipeline with TPS, latency percentiles, per-model breakdowns

Architecture

Three separate planes, just like production inference systems:

Plane	Where	What
Gateway	Your laptop	NestJS API, scheduler, router, KV cache manager, batch collector
GPU Workers	Remote cluster	Python gRPC servers, one per GPU, running raw transformers
Metrics	Your laptop	ClickHouse for inference analytics

The API server never runs on GPU machines. Communication is via gRPC over an SSH tunnel.

Key systems built

Scheduler

Priority queue with per-user fairness, aging, backpressure (429 + Retry-After), and configurable timeouts. Integrates with BatchCollector for time-window batching.

Batch Collector + GPU-Side Batching

Accumulates requests within a time window, groups by model, dispatches as a single model.generate() call with left-padded inputs. The throughput difference is dramatic:

Concurrency	Without batching	With batching
c=1	42 TPS	43 TPS
c=8	2 TPS	316 TPS
c=32	OOM	1,134 TPS

KV Cache (Disaggregated)

CPU DRAM-backed cache on the GPU host. On multi-turn conversations, restores past_key_values from CPU memory instead of recomputing from scratch. Handles transformers v4.x and v5.x cache formats.

Model Manager

VRAM-aware model placement with auto-load/unload, GPU affinity, and concurrent load coalescing. Knows which models need tensor parallelism and triggers mode switches automatically.

Router

Picks the best worker per request: model affinity (is the model already loaded?) > least loaded > trigger load on best candidate.

Worker Registry

Manages N gRPC worker connections dynamically. Supports runtime mode switching (individual workers <-> tensor parallel) via SSH to the GPU host.