S
AiderHost an open model as a fast API for a team
Serve a Private OpenAI-Compatible API with vLLM
setuproll@setuproll93.0Overall score
A production-style setup that serves an open model over an OpenAI-compatible endpoint with high throughput batching, so existing apps and SDKs work unchanged. For teams with an NVIDIA GPU box that want to drop in a self-hosted model behind their own URL.
93.0Score
2.5kVotes
6Components
Install this build
terminal
vllm serve meta-llama/Llama-3.3-70B-Instruct --tensor-parallel-size 2Components
Model
- Llama 3.3 70B Instruct
- Qwen3 32B for single GPU
Stack
- vLLM
- OpenAI-compatible /v1 endpoint
- Caddy reverse proxy
Hardware
- 2x A100 80GB or 2x RTX 4090 for 70B
- CUDA 12.x
Quantization
- AWQ or GPTQ 4-bit
- FP8 on Hopper GPUs
How it works
- Launch vLLM with tensor-parallel across your GPUs
- It exposes /v1/chat/completions on port 8000
- Point any OpenAI SDK at the base URL with a dummy key
- Caddy adds TLS and an auth token in front of it
Clients
- Aider
- Continue
- Any OpenAI SDK
Summary
A production-style setup that serves an open model over an OpenAI-compatible endpoint with high throughput batching, so existing apps and SDKs work unchanged. For teams with an NVIDIA GPU box that want to drop in a self-hosted model behind their own URL.
93.0 score 2.5k votes
0 Reviews
Your rating
Sign in to post
Loading discussion...