Head-to-head LLM serving benchmark across five engine configurations, two workload regimes, and concurrency levels from 1 to 64. FP16 and AWQ quantization tested on the same hardware. All runs on Modal cloud GPU with Qwen2.5-7B-Instruct. No cherry-picking — the numbers speak.
Rigorous, reproducible setup designed to isolate engine behavior from confounding variables.
How prompts are generated, how measurements are taken, and what the pipeline looks like end to end.
Synthetic prompts generated deterministically (seed=42) from two template families:
100 prompts per regime, committed to prompts/workload.jsonl. Fully deterministic — same seed, same prompts, same order.
Each engine is launched inside a Modal container with its respective serving flags. Client and server are colocated in the same container to eliminate network variance.
Before measurement, a warmup phase fires 2 × concurrency requests to prefill KV caches and stabilize JIT compilation / CUDA graphs.
Colocation understates real-world TTFT by ~5–20 ms (no network hop). This is documented but not subtracted.
For each (engine, regime, concurrency) configuration:
concurrency simultaneous connections via httpx.AsyncClient semaphore.max(concurrency × 30, 100) requests — enough to saturate the server and get stable percentiles.stream_options.include_usage).Short regime: 3 repeats per config. Long regime: 1 repeat. Median and spread computed across repeats.
From raw per-request JSONL, collect_metrics.py computes for each (engine, regime, concurrency):
(wall_time − TTFT) ÷ (output_tokens − 1).Throughput, first-token latency, and tail latency across concurrency sweeps.
Run the benchmarks yourself, or adapt them for your own model, GPU, and workload.
git clone https://github.com/ree2raz/inference-bench
cd inference-bench
make setup # venv + deps + modal auth
# Run individual engines (~$8 total Modal credits)
modal run modal_vllm.py --regime short
modal run modal_sglang.py --regime short
modal run modal_llamacpp.py
# AWQ quantization variants
modal run modal_vllm_awq.py
modal run modal_sglang_awq.py
# Generate charts + tables
make reportEdit configs/engines.yaml:
# Swap model — vLLM/SGLang use HF repo IDs
model: "meta-llama/Meta-Llama-3-8B-Instruct"
# llama.cpp — point to a GGUF repo + file
engines:
llamacpp:
gguf_model: "bartowski/Meta-Llama-3-8B-Instruct-GGUF"
gguf_file: "Meta-Llama-3-8B-Instruct-Q4_K_M.gguf"Then update the server flags in bench_lib.py (VLLM_SERVER_ARGS, SGLANG_SERVER_ARGS, start_llamacpp_server) to match your model's requirements.
Edit scripts/generate_workload.py to change prompt topics, token lengths, or template structure. Then regenerate:
python scripts/generate_workload.py
# → writes prompts/workload.jsonlAdjust regime parameters in configs/workload_short.yaml and configs/workload_long.yaml:
max_input_tokens: 256 # short regime context length
max_output_tokens: 128 # short regime max response
concurrency_levels: [1, 4, 16, 32, 64]
repeats: 3Set GPU_TYPE in bench_lib.py. Modal supports A10G, A100, H100, L4, T4, and more. For multi-GPU, edit the @app.cls(gpu=...) decorators in the Modal app files.
# bench_lib.py — line 13
GPU_TYPE = "L4" # change to "A100", "H100", "A10G", etc.Budget estimate: L4 ~$0.30/hr, A100 ~$1.50/hr, H100 ~$4.00/hr. Full benchmark suite runs in ~3–4 hours on L4 (~$8).
Skip engines or regimes you don't need:
# Single engine only
modal run modal_vllm.py --regime short
# Combined orchestrator with filters
modal run modal_app.py --engine sglang --regime long
# Parallel all engines (fastest, if you have Modal credits)
make bench-all-parallelAlready-completed configs are automatically skipped — re-running resumes from where you left off.
inference-bench/
├── bench_lib.py # shared constants, image builders, bench logic
├── modal_vllm.py # vLLM FP16 (standalone app)
├── modal_sglang.py # SGLang FP16 (standalone app)
├── modal_llamacpp.py # llama.cpp (standalone app)
├── modal_vllm_awq.py # vLLM AWQ (standalone app)
├── modal_sglang_awq.py # SGLang AWQ (standalone app)
├── modal_app.py # orchestrator (all FP16 engines)
├── scripts/
│ ├── generate_workload.py # prompt generation
│ ├── collect_metrics.py # JSONL → summary.csv
│ └── plot_results.py # CSV → charts
├── prompts/
│ └── workload.jsonl # 200 prompts (committed)
├── configs/
│ ├── engines.yaml # model, versions, flags
│ ├── workload_short.yaml
│ └── workload_long.yaml
├── results/
│ ├── raw/ # per-run JSONL
│ ├── summary.jsonl
│ ├── summary.csv
│ └── plots/
└── MakefileWhat the data actually says — and where each engine fits.
At c=64 short, SGLang hits 914 tok/s vs vLLM's 831 tok/s (+10%). The gap holds in the long regime (840 vs 777 tok/s). Radix attention provides consistent per-token overhead regardless of batch size.
At c=1 short, vLLM TTFT is 70 ms vs SGLang's 119 ms — 42% faster first-token response. This advantage erodes at high concurrency as vLLM's TTFT grows faster with load.
At c=1 short, llama.cpp achieves 47 tok/s with 2.7 s p95 latency — nearly 3x the throughput and 2.8x faster E2E than the FP16 engines. This comes from Q4's reduced compute. However, limited parallelism (--parallel 4) means throughput plateaus around 190 tok/s at c=64.
Both maintain ~13–17 tok/s per-request regardless of concurrency, indicating effective continuous batching. llama.cpp's per-request throughput drops sharply — from 47 to 2.9 tok/s at c=64.
At c=64 long, SGLang's 840 tok/s is 8% faster than vLLM's 777 tok/s, and SGLang's p95 latency (40 s) is 9% lower. llama.cpp struggles with long sequences at high concurrency — only 55% success rate at c=64.
Lowest TTFT for single-request workloads (70 ms at c=1). Best choice when latency-sensitive individual responses matter more than aggregate throughput — interactive chatbots, real-time assistants.
Highest throughput at every concurrency level. Best for batched workloads, API serving, and high-concurrency production deployments. Radix attention provides consistent TPOT (~60–70 ms) across concurrency levels.
Fastest single-request latency and smallest footprint. Best for resource-constrained environments (edge, embedded, CPU-only), development/testing, and low-concurrency use cases. Q4_K_M runs in ~4.4 GB VRAM vs ~14.2 GB FP16.
Known limitations of the benchmark scope — important for interpreting results correctly.