S
Aider logoAiderHost an open model as a fast API for a team

Serve a Private OpenAI-Compatible API with vLLM

setuproll@setuproll
93.0Overall score

A production-style setup that serves an open model over an OpenAI-compatible endpoint with high throughput batching, so existing apps and SDKs work unchanged. For teams with an NVIDIA GPU box that want to drop in a self-hosted model behind their own URL.

93.0Score
2.5kVotes
6Components

Install this build

Export
terminal
vllm serve meta-llama/Llama-3.3-70B-Instruct --tensor-parallel-size 2

Components

Model

  • Llama 3.3 70B Instruct
  • Qwen3 32B for single GPU

Stack

  • vLLM
  • OpenAI-compatible /v1 endpoint
  • Caddy reverse proxy

Hardware

  • 2x A100 80GB or 2x RTX 4090 for 70B
  • CUDA 12.x

Quantization

  • AWQ or GPTQ 4-bit
  • FP8 on Hopper GPUs

How it works

  • Launch vLLM with tensor-parallel across your GPUs
  • It exposes /v1/chat/completions on port 8000
  • Point any OpenAI SDK at the base URL with a dummy key
  • Caddy adds TLS and an auth token in front of it

Clients

  • Aider
  • Continue
  • Any OpenAI SDK

Summary

A production-style setup that serves an open model over an OpenAI-compatible endpoint with high throughput batching, so existing apps and SDKs work unchanged. For teams with an NVIDIA GPU box that want to drop in a self-hosted model behind their own URL.

93.0 score 2.5k votes

0 Reviews

Your rating
Sign in to post

Loading discussion...