AiderHost an open model as a fast API for a team

Serve a Private OpenAI-Compatible API with vLLM

93.0Overall score

A production-style setup that serves an open model over an OpenAI-compatible endpoint with high throughput batching, so existing apps and SDKs work unchanged. For teams with an NVIDIA GPU box that want to drop in a self-hosted model behind their own URL.

93.0Score

2.5kVotes

6Components

Install this build

Export

terminal

vllm serve meta-llama/Llama-3.3-70B-Instruct --tensor-parallel-size 2

Components

Model

Llama 3.3 70B Instruct
Qwen3 32B for single GPU

Stack

vLLM
OpenAI-compatible /v1 endpoint
Caddy reverse proxy

Hardware

2x A100 80GB or 2x RTX 4090 for 70B
CUDA 12.x

Quantization

AWQ or GPTQ 4-bit
FP8 on Hopper GPUs

How it works

Launch vLLM with tensor-parallel across your GPUs
It exposes /v1/chat/completions on port 8000
Point any OpenAI SDK at the base URL with a dummy key
Caddy adds TLS and an auth token in front of it

Clients

Aider
Continue
Any OpenAI SDK

Summary

93.0 score 2.5k votes

0 Reviews

Your rating

Loading discussion...

← All builds