← Back to projects

Inference Stack

Production-grade LLM inference API built from scratch. NestJS gateway + Python GPU workers. Scheduling, batching, KV cache, tensor parallelism, multi-modal, all against real GPUs.

View project|
nestjspythonhugging-face

A production-grade LLM inference API built from scratch in TypeScript/NestJS + Python, running against real GPUs. Built as a learning exercise to understand the same class of problems that OpenAI, Anthropic, and Google solve: GPU resource scheduling, KV cache-aware routing, streaming, dynamic batching, tensor parallelism, and multi-modal inference.

This is not a wrapper around vLLM or TGI. Every layer is built from raw transformers + grpcio to understand what happens under the hood.

Inference Playground UI

What it does

Your laptop (NestJS gateway)          Remote GPU cluster (Python workers)
┌─────────────────────────┐           ┌──────────────────────────────┐
│  HTTP API (OpenAI-compat)│           │  GPU-0: worker-0 (gRPC)     │
│  Scheduler + Batcher    │──gRPC────→│  GPU-1: worker-1 (gRPC)     │
│  Router + Model Manager │           │  or: TP worker (2 GPUs)     │
│  KV Cache Manager       │           │                              │
│  Metrics (ClickHouse)   │           │  8 models, 5 modalities     │
└─────────────────────────┘           └──────────────────────────────┘
  • 8 models across 5 modalities: text generation (SmolLM2 family + Qwen3-14B), vision-language (Qwen2.5-VL-3B), text-to-speech (Kokoro-82M), image generation (SD Turbo), video generation (CogVideoX-2B)
  • Tensor parallelism: Qwen3-14B split across 2 GPUs via tp_plan="auto" + torchrun, with thinking mode (<think> tag parsing)
  • Dynamic batching: 158x throughput improvement at concurrency 8 (2 TPS to 316 TPS)
  • KV cache persistence: CPU DRAM-backed session cache with LRU eviction, 14% compute savings on multi-turn
  • Runtime mode switching: Gateway SSHs to GPU host to switch between individual workers and tensor-parallel mode
  • Full observability: ClickHouse metrics pipeline with TPS, latency percentiles, per-model breakdowns

Architecture

Three separate planes, just like production inference systems:

PlaneWhereWhat
GatewayYour laptopNestJS API, scheduler, router, KV cache manager, batch collector
GPU WorkersRemote clusterPython gRPC servers, one per GPU, running raw transformers
MetricsYour laptopClickHouse for inference analytics

The API server never runs on GPU machines. Communication is via gRPC over an SSH tunnel.

Key systems built

Scheduler

Priority queue with per-user fairness, aging, backpressure (429 + Retry-After), and configurable timeouts. Integrates with BatchCollector for time-window batching.

Batch Collector + GPU-Side Batching

Accumulates requests within a time window, groups by model, dispatches as a single model.generate() call with left-padded inputs. The throughput difference is dramatic:

ConcurrencyWithout batchingWith batching
c=142 TPS43 TPS
c=82 TPS316 TPS
c=32OOM1,134 TPS

KV Cache (Disaggregated)

CPU DRAM-backed cache on the GPU host. On multi-turn conversations, restores past_key_values from CPU memory instead of recomputing from scratch. Handles transformers v4.x and v5.x cache formats.

Model Manager

VRAM-aware model placement with auto-load/unload, GPU affinity, and concurrent load coalescing. Knows which models need tensor parallelism and triggers mode switches automatically.

Router

Picks the best worker per request: model affinity (is the model already loaded?) > least loaded > trigger load on best candidate.

Worker Registry

Manages N gRPC worker connections dynamically. Supports runtime mode switching (individual workers <-> tensor parallel) via SSH to the GPU host.