125 benchmark runs across vLLM, SGLang, and llama.cpp on NVIDIA L4 (24 GB) and A100 40GB via Modal. L4 baseline: vLLM v0.8.5 · SGLang v0.4.6. A100 tab: vLLM v0.20.1 · SGLang :latest (May 2026 sweep). Model: Qwen2.5-7B FP16 and AWQ, Qwen3-8B reasoning, Qwen3-30B-A3B MoE. Concurrency 1–64. L4 engine rankings are stable — v2 sweep confirmed <3% drift on all SGLang and vLLM FP16 configs.
Which engine to pick · Explore the charts · Key findings · Limitations
The bottom line — which engine to pick for your use case, backed by the data.
Leads on A100 at all concurrencies (+32–45% FP16, +113% Marlin at c=64). Lowest TTFT on L4 (70 ms vs 119 ms, 42% faster). AWQ delivers 2.5x throughput at 1/3 VRAM with no meaningful downside. Reasoning workloads 4.3x faster with AWQ. Default for any production deployment on A100 or with quantization.
Leads L4 FP16 aggregate throughput at high concurrency (914 vs 831 tok/s at c=64, +10%). On A100, vLLM outperforms SGLang at all concurrency levels — the L4 pattern reverses on higher-tier GPUs. Best fit: budget GPU (L4/A10G), FP16 only, throughput-first workloads where AWQ checkpoints don't exist.
Fastest single-request E2E latency (2.7 s vs 7 s FP16) and smallest footprint (4.4 GB VRAM). Best for edge, embedded, CPU-only, and development environments. Throughput caps at ~190 tok/s at c=64 and success rate drops to 55% on long contexts. Not suited for production API serving above c=4.
What the data actually says — and where each engine fits.
On L4, SGLang hits 914 tok/s vs vLLM's 831 tok/s at c=64 (+10% FP16). But at c=1, vLLM's TTFT is 70 ms vs SGLang's 119 ms (42% faster). The L4 pattern reverses completely on A100: vLLM v0.20.1 outperforms SGLang :latest at every concurrency — 80 vs 61 tok/s at c=1 (+32%), 3,102 vs 2,141 tok/s at c=64 (+45%). SGLang's L4 FP16 advantage is narrow and GPU-class-specific.
When this matters: On A100+ → vLLM for both throughput and latency. On L4 FP16 at high concurrency → SGLang has a narrow +10% edge. L4 AWQ → vLLM by 2x.
vLLM AWQ delivers 43 tok/s at c=1 (2.5x FP16) while using only 5.2 GB VRAM vs 15 GB. At c=64, it hits 976 tok/s (+17% over FP16 vLLM). AWQ's memory savings become critical for reasoning workloads where KV caches hold 8K+ token sequences — AWQ vLLM reaches 345 tok/s at c=16, 2.4x faster than FP16 SGLang (147 tok/s). The one exception: SGLang AWQ suffers a torch 2.5.1 compatibility issue, degrading to 8 tok/s at c=1.
When this matters: Almost always — use AWQ on vLLM if a pre-quantized checkpoint exists. Skip only if your pipeline requires exact FP16 outputs or your GPU predates Ampere.
At c=1 short, Q4_K_M achieves 47 tok/s with 2.7 s p95 latency (2.8x faster E2E than FP16 engines). But limited parallelism (--parallel 4) means throughput plateaus at 190 tok/s by c=64, and long-regime success rate drops to 55%. Best for edge, embedded, and low-concurrency use cases where its ~4.4 GB footprint matters.
When this matters: Edge devices, CPU-only servers, developer laptops, or any context where VRAM is the binding constraint. Not suited for production API serving above c=4.
With Qwen3-8B generating ~6,000 thinking tokens per request, first answer token (after thinking) takes 145-664 seconds. Qwen3 AWQ vLLM at c=4 delivers the answer in 145 s — 4.6x faster than Qwen3 SGLang at c=16 (664 s). For thinking models, "time to useful output" is the metric that matters, not TTFT.
When this matters: Any deployment of thinking models (DeepSeek-R1, Qwen3 thinking mode). Don't benchmark TTFT — benchmark time-to-first-answer-token. The difference is 145 s vs 664 s in this data.
At c=64 long, SGLang's 840 tok/s is 8% faster than vLLM's 777 tok/s, with 9% lower p95 latency (40 s vs 44 s). llama.cpp drops to 55% success. Longer sequences stress KV cache management, batching efficiency, and memory bandwidth — making engine choice more consequential.
When this matters: RAG pipelines, document summarization, long-context chat. Engine choice matters more at long context — the gap between engines widens as sequence length grows.
The May 2026 v2 sweep (vLLM v0.20.1 vs SGLang :latest) found a GPU-class pattern reversal. On L4 FP16, SGLang leads at c=64 (+10%). On A100, vLLM leads at every concurrency: +32% at c=16, +45% FP16 at c=64. The gap widens with Marlin quantization: vLLM Marlin hits 4,762 tok/s at c=64 while SGLang Marlin collapses to 2,231 tok/s (vLLM 2.1x ahead). Likely cause: SGLang's batch scheduler and Marlin kernel don't scale as efficiently to A100's higher memory bandwidth and compute headroom. v0.20.1 also improved Marlin performance significantly — 177 tok/s at c=1 vs 104 tok/s in v0.8.5 (+70%).
When this matters: If your production GPU is A100 (or H100), vLLM is clearly the stronger choice — especially with AWQ Marlin. The L4 benchmark results don't predict A100 engine rankings.
Qwen3-30B-A3B has 30.5B total parameters but only 3.3B active per token. The question: does decode speed track total params or active params? Active params wins. BF16 at c=1 on A100 80GB delivers 134 tok/s vs 92 tok/s for dense 7B on the same GPU (1.5x faster) — tracking active params, not total. But the efficiency gap is significant: MoE BF16 achieves 46% memory-bandwidth efficiency vs dense FP16's 63% (vLLM v0.20.2), because expert routing adds irregular memory access that ideal batching cannot hide. AWQ Marlin efficiency drops further to 11–18% from combined dequantization and expert-load overhead. The practical implication: model MoE deployments on active parameters for throughput, total parameters for VRAM, and budget 10–20% for routing overhead the calculator cannot fully capture. One hard constraint: BF16 c=16 OOMs on A100 80GB — 61 GB weights plus KV cache exceeds the 80 GB limit.
When this matters: Any MoE deployment — Qwen3-30B-A3B, DeepSeek V3/V4, Mixtral. Size VRAM on total parameters, throughput on active parameters. Most cost calculators use the wrong number for one of these.
The LLM Deploy Cost Calculator predicts throughput from GPU specs. These benchmarks measure what actually happens when you run the code. Same hardware, same model — theory vs practice.
The calculator models decode as memory-bandwidth-bound: throughput = HBM_bandwidth / (params * bytes_per_param).
| FP16 (vLLM) | 21.4 tok/s | 17.1 tok/s | 80% |
| AWQ 4-bit (vLLM) | 85.4 tok/s | 43.2 tok/s | 51% |
| Theoretical | Measured | Efficiency |
FP16 achieves 80% of theoretical bandwidth — kernel overhead, attention computation, and KV cache reads account for the gap. AWQ drops to 51% because dequantization and irregular memory access patterns reduce effective bandwidth utilization.
At high concurrency, throughput is limited by batch scheduling, KV cache pressure, and CUDA graph fragmentation.
| FP16 (vLLM) | 1,294 | 831 | 64% |
| FP16 (SGLang) | 1,294 | 914 | 71% |
| Theoretical | Measured | Efficiency |
SGLang's RadixAttention achieves 71% of theoretical at c=64, vs vLLM's 64% — the 7-point gap explains SGLang's consistent throughput advantage. The calculator's "no continuous batching efficiency" limitation is real: engines achieve 64-71% of ideal at high concurrency.
This comparison uses the same throughput model from the LLM Deploy Cost Calculator, with L4 specs (31 TFLOPS, 300 GB/s HBM), Qwen2.5-7B (7B params, GQA, 28 layers, 4 KV heads), and MFU=0.35. The "Theoretical" column is what the calculator predicts; the "Measured" column is what inference-bench recorded on the same GPU. Full benchmark data on GitHub.
A100 40GB has 5.2x more bandwidth than L4 (1,555 vs 300 GB/s). Per-stream efficiency drops because compute becomes the bottleneck.
| FP16 (vLLM v0.20.1) | 111 tok/s | 80.0 tok/s | 72% |
| FP16 (SGLang :latest) | 111 tok/s | 60.5 tok/s | 54% |
| AWQ Marlin (vLLM v0.20.1) | 444 tok/s | 177.3 tok/s | 40% |
| AWQ Marlin (SGLang :latest) | 444 tok/s | 194.8 tok/s | 44% |
| Theoretical | Measured | Efficiency |
A100 v0.20.1 vLLM FP16 hits 72% efficiency — up from 65% in v0.8.5. Marlin kernels improved dramatically: 40% in v0.20.1 vs 23% in v0.8.5. At c=64, vLLM Marlin reaches 4,762 tok/s while SGLang Marlin collapses to 2,231 tok/s — SGLang's Marlin kernel doesn't scale to high-concurrency batching on A100 the way vLLM's does.
Rigorous, reproducible setup designed to isolate engine behavior from confounding variables.
Throughput, first-token latency, and tail latency across concurrency sweeps.
Baseline hardware is NVIDIA L4 (24 GB, Modal, per-second billing) with Qwen2.5-7B-Instruct. A100 and MoE regimes use an A100 80/40 GB with a different model — see the regime note below. c=N means N concurrent requests in flight simultaneously. Throughput is aggregate tokens/sec across all concurrent streams. TTFT is time to first token (p50). Latency p95 is tail end-to-end time per request. Regime definitions: Short ≤256 input tokens · Long ≤2048 · Reasoning ~6K thinking tokens (Qwen3-8B) · A100 = vLLM v0.20.1 + SGLang :latest on A100 40GB (May 2026 sweep) · MoE = Qwen3-30B-A3B active-param study on A100 80GB (BF16) and A100 40GB (AWQ). Full methodology →
Known limitations of the benchmark scope — important for interpreting results correctly.