Reproducible Benchmark
Powered by Modal — serverless GPU infra

vLLM vs SGLang vs llama.cpp
FP16 & AWQ on a single NVIDIA L4

Head-to-head LLM serving benchmark across five engine configurations, two workload regimes, and concurrency levels from 1 to 64. FP16 and AWQ quantization tested on the same hardware. All runs on Modal cloud GPU with Qwen2.5-7B-Instruct. No cherry-picking — the numbers speak.

AWQ throughput king
vLLM AWQ
976 tok/s @ c=64 · +17% over FP16
FP16 throughput
SGLang
914 tok/s @ c=64 · +10% over vLLM
Lowest TTFT
vLLM
70 ms @ c=1 · 42% faster first token
Lowest latency c=1
llama.cpp
2.7 s @ c=1 · 2.8x faster E2E
AWQ efficiency
SGLang AWQ
506 tok/s @ c=64 · 3.8 GB VRAM

Methodology

Rigorous, reproducible setup designed to isolate engine behavior from confounding variables.

Hardware

GPU
NVIDIA L4 (24 GB)
Provider
Modal (per-second billing)
CUDA
12.4.1

Model

FP16
Qwen2.5-7B-Instruct (~15 GB)
AWQ
Qwen2.5-7B-Instruct-AWQ (~5.2 GB)
GGUF
Q4_K_M quantization (~4.4 GB)

Workload

Short
≤256 in, 128 out (chat)
Long
≤2048 in, 512 out (RAG)
Sampling
Greedy (temp=0)
Repeats
3x short, 1x long
vLLM
--max-num-seqs 64 · --gpu-mem-util 0.90
v0.8.5
SGLang
--max-running-req 64 · --mem-frac 0.85 · --disable-cuda-graph
v0.4.6
llama.cpp
-np 4 --parallel 4 -ngl 99 -c 16384
b5540
vLLM AWQ
--quantization awq --enforce-eager
v0.8.5
SGLang AWQ
--quantization awq --disable-cuda-graph
v0.4.6

Benchmark Process

How prompts are generated, how measurements are taken, and what the pipeline looks like end to end.

01

Prompt Generation

Synthetic prompts generated deterministically (seed=42) from two template families:

Short regime
50 CS/ systems topics (mutex vs semaphore, CAP theorem, Raft consensus, Bloom filters, etc.) as single-turn questions. ~200 input tokens, 128 max output tokens. Models conversational chat workloads.
Long regime
10 domain pairs (distributed systems + consensus, ML + transformers, OS + scheduling, etc.) each paired with one of 10 analytical questions. Synthetic RAG context (~1200 tokens) padded to ~1800 input tokens, 512 max output tokens. Models retrieval-augmented generation workloads.

100 prompts per regime, committed to prompts/workload.jsonl. Fully deterministic — same seed, same prompts, same order.

02

Server Startup & Warmup

Each engine is launched inside a Modal container with its respective serving flags. Client and server are colocated in the same container to eliminate network variance.

Before measurement, a warmup phase fires 2 × concurrency requests to prefill KV caches and stabilize JIT compilation / CUDA graphs.

Colocation understates real-world TTFT by ~5–20 ms (no network hop). This is documented but not subtracted.

03

Concurrency Sweep

For each (engine, regime, concurrency) configuration:

  1. Open concurrency simultaneous connections via httpx.AsyncClient semaphore.
  2. Send up to max(concurrency × 30, 100) requests — enough to saturate the server and get stable percentiles.
  3. Each request is a streaming OpenAI Chat Completions call (non-streaming for llama.cpp, which lacks stream_options.include_usage).
  4. Record per-request: wall time, TTFT (first content token), TPOT (inter-token latency), output token count, success/failure.

Short regime: 3 repeats per config. Long regime: 1 repeat. Median and spread computed across repeats.

04

Metric Aggregation

From raw per-request JSONL, collect_metrics.py computes for each (engine, regime, concurrency):

Throughput
Total output tokens ÷ wall-clock seconds. Median of repeats.
TTFT p50 / p95
Time from request sent to first content token. Streaming engines only (llama.cpp: N/A).
TPOT p50 / p95
Time per output token after first token: (wall_time − TTFT) ÷ (output_tokens − 1).
E2E Latency p50 / p95 / p99
Total request wall time. Available for all engines.
Success rate
Completed requests ÷ total sent. Requests exceeding 300s timeout count as failures.

Results

Throughput, first-token latency, and tail latency across concurrency sweeps.

Throughput vs Concurrency

Time to First Token (p50)

End-to-End Latency (p95)

Reproduce & Customize

Run the benchmarks yourself, or adapt them for your own model, GPU, and workload.

Quick start

git clone https://github.com/ree2raz/inference-bench
cd inference-bench
make setup          # venv + deps + modal auth

# Run individual engines (~$8 total Modal credits)
modal run modal_vllm.py --regime short
modal run modal_sglang.py --regime short
modal run modal_llamacpp.py

# AWQ quantization variants
modal run modal_vllm_awq.py
modal run modal_sglang_awq.py

# Generate charts + tables
make report

Customize the model

Edit configs/engines.yaml:

# Swap model — vLLM/SGLang use HF repo IDs
model: "meta-llama/Meta-Llama-3-8B-Instruct"

# llama.cpp — point to a GGUF repo + file
engines:
  llamacpp:
    gguf_model: "bartowski/Meta-Llama-3-8B-Instruct-GGUF"
    gguf_file: "Meta-Llama-3-8B-Instruct-Q4_K_M.gguf"

Then update the server flags in bench_lib.py (VLLM_SERVER_ARGS, SGLANG_SERVER_ARGS, start_llamacpp_server) to match your model's requirements.

Customize the workload

Edit scripts/generate_workload.py to change prompt topics, token lengths, or template structure. Then regenerate:

python scripts/generate_workload.py
# → writes prompts/workload.jsonl

Adjust regime parameters in configs/workload_short.yaml and configs/workload_long.yaml:

max_input_tokens: 256    # short regime context length
max_output_tokens: 128    # short regime max response
concurrency_levels: [1, 4, 16, 32, 64]
repeats: 3

Change the GPU

Set GPU_TYPE in bench_lib.py. Modal supports A10G, A100, H100, L4, T4, and more. For multi-GPU, edit the @app.cls(gpu=...) decorators in the Modal app files.

# bench_lib.py — line 13
GPU_TYPE = "L4"    # change to "A100", "H100", "A10G", etc.

Budget estimate: L4 ~$0.30/hr, A100 ~$1.50/hr, H100 ~$4.00/hr. Full benchmark suite runs in ~3–4 hours on L4 (~$8).

Run a subset

Skip engines or regimes you don't need:

# Single engine only
modal run modal_vllm.py --regime short

# Combined orchestrator with filters
modal run modal_app.py --engine sglang --regime long

# Parallel all engines (fastest, if you have Modal credits)
make bench-all-parallel

Already-completed configs are automatically skipped — re-running resumes from where you left off.

Project structure

inference-bench/
├── bench_lib.py               # shared constants, image builders, bench logic
├── modal_vllm.py              # vLLM FP16 (standalone app)
├── modal_sglang.py            # SGLang FP16 (standalone app)
├── modal_llamacpp.py          # llama.cpp (standalone app)
├── modal_vllm_awq.py          # vLLM AWQ (standalone app)
├── modal_sglang_awq.py        # SGLang AWQ (standalone app)
├── modal_app.py               # orchestrator (all FP16 engines)
├── scripts/
│   ├── generate_workload.py   # prompt generation
│   ├── collect_metrics.py     # JSONL → summary.csv
│   └── plot_results.py        # CSV → charts
├── prompts/
│   └── workload.jsonl          # 200 prompts (committed)
├── configs/
│   ├── engines.yaml            # model, versions, flags
│   ├── workload_short.yaml
│   └── workload_long.yaml
├── results/
│   ├── raw/                    # per-run JSONL
│   ├── summary.jsonl
│   ├── summary.csv
│   └── plots/
└── Makefile

Findings

What the data actually says — and where each engine fits.

  1. SGLang leads throughput at every concurrency level

    At c=64 short, SGLang hits 914 tok/s vs vLLM's 831 tok/s (+10%). The gap holds in the long regime (840 vs 777 tok/s). Radix attention provides consistent per-token overhead regardless of batch size.

  2. vLLM has the lowest TTFT at low concurrency

    At c=1 short, vLLM TTFT is 70 ms vs SGLang's 119 ms — 42% faster first-token response. This advantage erodes at high concurrency as vLLM's TTFT grows faster with load.

  3. llama.cpp (Q4_K_M) is fastest at c=1 but degrades beyond c=4

    At c=1 short, llama.cpp achieves 47 tok/s with 2.7 s p95 latency — nearly 3x the throughput and 2.8x faster E2E than the FP16 engines. This comes from Q4's reduced compute. However, limited parallelism (--parallel 4) means throughput plateaus around 190 tok/s at c=64.

  4. Per-request throughput is nearly constant for vLLM/SGLang

    Both maintain ~13–17 tok/s per-request regardless of concurrency, indicating effective continuous batching. llama.cpp's per-request throughput drops sharply — from 47 to 2.9 tok/s at c=64.

  5. Long regime amplifies differences

    At c=64 long, SGLang's 840 tok/s is 8% faster than vLLM's 777 tok/s, and SGLang's p95 latency (40 s) is 9% lower. llama.cpp struggles with long sequences at high concurrency — only 55% success rate at c=64.

Where Each Engine Wins

vLLM

Lowest TTFT for single-request workloads (70 ms at c=1). Best choice when latency-sensitive individual responses matter more than aggregate throughput — interactive chatbots, real-time assistants.

SGLang

Highest throughput at every concurrency level. Best for batched workloads, API serving, and high-concurrency production deployments. Radix attention provides consistent TPOT (~60–70 ms) across concurrency levels.

llama.cpp

Fastest single-request latency and smallest footprint. Best for resource-constrained environments (edge, embedded, CPU-only), development/testing, and low-concurrency use cases. Q4_K_M runs in ~4.4 GB VRAM vs ~14.2 GB FP16.

What This Doesn't Measure

Known limitations of the benchmark scope — important for interpreting results correctly.