Reproducible Benchmark
Powered by Modal — serverless GPU infra

vLLM vs SGLang vs llama.cpp
FP16, AWQ & Reasoning on NVIDIA L4 + A100

125 benchmark runs across vLLM, SGLang, and llama.cpp on NVIDIA L4 (24 GB) and A100 40GB via Modal. L4 baseline: vLLM v0.8.5 · SGLang v0.4.6. A100 tab: vLLM v0.20.1 · SGLang :latest (May 2026 sweep). Model: Qwen2.5-7B FP16 and AWQ, Qwen3-8B reasoning, Qwen3-30B-A3B MoE. Concurrency 1–64. L4 engine rankings are stable — v2 sweep confirmed <3% drift on all SGLang and vLLM FP16 configs.

Which engine to pick  ·  Explore the charts  ·  Key findings  ·  Limitations

AWQ throughput king
vLLM AWQ
976 tok/s @ c=64 · +17% over FP16
FP16 throughput
SGLang
914 tok/s @ c=64 · +10% over vLLM
Lowest TTFT
vLLM
70 ms @ c=1 · 42% faster first token
Lowest latency c=1
llama.cpp
2.7 s @ c=1 · 2.8x faster E2E
A100 throughput king
vLLM Marlin (A100)
4,762 tok/s @ c=64 · +60% vs v0.8.5
Reasoning throughput
Qwen3 AWQ vLLM
345 tok/s @ c=16 · 14x total output
Scope & Freshness | L4 baseline: vLLM v0.8.5 · SGLang v0.4.6 · llama.cpp Q4_K_M. A100 tab (May 2026 v2 sweep): vLLM v0.20.1 · SGLang :latest — both engines benchmarked across FP16 and AWQ Marlin. Hardware: NVIDIA L4 (24 GB) · A100 40GB (dense/AWQ) · A100 80GB (MoE BF16 only) — H100 and FP8 not covered. L4 relative rankings are the durable finding. A100 tab reflects current engine versions.

Where Each Engine Wins

The bottom line — which engine to pick for your use case, backed by the data.

Production default: vLLM | Wins on 7 of 10 measured dimensions — including every A100 config, AWQ quantization, TTFT, and reasoning workloads. The one exception: L4 FP16 at high concurrency (c=64), where SGLang leads by +10%. If you're on A100 or using AWQ, vLLM is the stronger choice across the board. The v2 sweep (vLLM v0.20.1 vs SGLang :latest, May 2026) confirmed and widened this finding.

vLLM — production default

Leads on A100 at all concurrencies (+32–45% FP16, +113% Marlin at c=64). Lowest TTFT on L4 (70 ms vs 119 ms, 42% faster). AWQ delivers 2.5x throughput at 1/3 VRAM with no meaningful downside. Reasoning workloads 4.3x faster with AWQ. Default for any production deployment on A100 or with quantization.

SGLang — L4 FP16 throughput

Leads L4 FP16 aggregate throughput at high concurrency (914 vs 831 tok/s at c=64, +10%). On A100, vLLM outperforms SGLang at all concurrency levels — the L4 pattern reverses on higher-tier GPUs. Best fit: budget GPU (L4/A10G), FP16 only, throughput-first workloads where AWQ checkpoints don't exist.

llama.cpp — edge and dev

Fastest single-request E2E latency (2.7 s vs 7 s FP16) and smallest footprint (4.4 GB VRAM). Best for edge, embedded, CPU-only, and development environments. Throughput caps at ~190 tok/s at c=64 and success rate drops to 55% on long contexts. Not suited for production API serving above c=4.

Findings

What the data actually says — and where each engine fits.

  1. SGLang leads L4 FP16 throughput; vLLM leads everywhere else

    On L4, SGLang hits 914 tok/s vs vLLM's 831 tok/s at c=64 (+10% FP16). But at c=1, vLLM's TTFT is 70 ms vs SGLang's 119 ms (42% faster). The L4 pattern reverses completely on A100: vLLM v0.20.1 outperforms SGLang :latest at every concurrency — 80 vs 61 tok/s at c=1 (+32%), 3,102 vs 2,141 tok/s at c=64 (+45%). SGLang's L4 FP16 advantage is narrow and GPU-class-specific.

    When this matters: On A100+ → vLLM for both throughput and latency. On L4 FP16 at high concurrency → SGLang has a narrow +10% edge. L4 AWQ → vLLM by 2x.

  2. AWQ quantization on vLLM is a free lunch

    vLLM AWQ delivers 43 tok/s at c=1 (2.5x FP16) while using only 5.2 GB VRAM vs 15 GB. At c=64, it hits 976 tok/s (+17% over FP16 vLLM). AWQ's memory savings become critical for reasoning workloads where KV caches hold 8K+ token sequences — AWQ vLLM reaches 345 tok/s at c=16, 2.4x faster than FP16 SGLang (147 tok/s). The one exception: SGLang AWQ suffers a torch 2.5.1 compatibility issue, degrading to 8 tok/s at c=1.

    When this matters: Almost always — use AWQ on vLLM if a pre-quantized checkpoint exists. Skip only if your pipeline requires exact FP16 outputs or your GPU predates Ampere.

  3. llama.cpp wins at c=1, loses at scale

    At c=1 short, Q4_K_M achieves 47 tok/s with 2.7 s p95 latency (2.8x faster E2E than FP16 engines). But limited parallelism (--parallel 4) means throughput plateaus at 190 tok/s by c=64, and long-regime success rate drops to 55%. Best for edge, embedded, and low-concurrency use cases where its ~4.4 GB footprint matters.

    When this matters: Edge devices, CPU-only servers, developer laptops, or any context where VRAM is the binding constraint. Not suited for production API serving above c=4.

  4. Reasoning workloads change the metric

    With Qwen3-8B generating ~6,000 thinking tokens per request, first answer token (after thinking) takes 145-664 seconds. Qwen3 AWQ vLLM at c=4 delivers the answer in 145 s — 4.6x faster than Qwen3 SGLang at c=16 (664 s). For thinking models, "time to useful output" is the metric that matters, not TTFT.

    When this matters: Any deployment of thinking models (DeepSeek-R1, Qwen3 thinking mode). Don't benchmark TTFT — benchmark time-to-first-answer-token. The difference is 145 s vs 664 s in this data.

  5. Long regime amplifies every difference

    At c=64 long, SGLang's 840 tok/s is 8% faster than vLLM's 777 tok/s, with 9% lower p95 latency (40 s vs 44 s). llama.cpp drops to 55% success. Longer sequences stress KV cache management, batching efficiency, and memory bandwidth — making engine choice more consequential.

    When this matters: RAG pipelines, document summarization, long-context chat. Engine choice matters more at long context — the gap between engines widens as sequence length grows.

  6. On A100, vLLM dominates SGLang — the L4 ranking reverses

    The May 2026 v2 sweep (vLLM v0.20.1 vs SGLang :latest) found a GPU-class pattern reversal. On L4 FP16, SGLang leads at c=64 (+10%). On A100, vLLM leads at every concurrency: +32% at c=16, +45% FP16 at c=64. The gap widens with Marlin quantization: vLLM Marlin hits 4,762 tok/s at c=64 while SGLang Marlin collapses to 2,231 tok/s (vLLM 2.1x ahead). Likely cause: SGLang's batch scheduler and Marlin kernel don't scale as efficiently to A100's higher memory bandwidth and compute headroom. v0.20.1 also improved Marlin performance significantly — 177 tok/s at c=1 vs 104 tok/s in v0.8.5 (+70%).

    When this matters: If your production GPU is A100 (or H100), vLLM is clearly the stronger choice — especially with AWQ Marlin. The L4 benchmark results don't predict A100 engine rankings.

  7. MoE decode tracks active parameters, not total — but expert routing has overhead

    Qwen3-30B-A3B has 30.5B total parameters but only 3.3B active per token. The question: does decode speed track total params or active params? Active params wins. BF16 at c=1 on A100 80GB delivers 134 tok/s vs 92 tok/s for dense 7B on the same GPU (1.5x faster) — tracking active params, not total. But the efficiency gap is significant: MoE BF16 achieves 46% memory-bandwidth efficiency vs dense FP16's 63% (vLLM v0.20.2), because expert routing adds irregular memory access that ideal batching cannot hide. AWQ Marlin efficiency drops further to 11–18% from combined dequantization and expert-load overhead. The practical implication: model MoE deployments on active parameters for throughput, total parameters for VRAM, and budget 10–20% for routing overhead the calculator cannot fully capture. One hard constraint: BF16 c=16 OOMs on A100 80GB — 61 GB weights plus KV cache exceeds the 80 GB limit.

    When this matters: Any MoE deployment — Qwen3-30B-A3B, DeepSeek V3/V4, Mixtral. Size VRAM on total parameters, throughput on active parameters. Most cost calculators use the wrong number for one of these.

Validated by Real Benchmarks

The LLM Deploy Cost Calculator predicts throughput from GPU specs. These benchmarks measure what actually happens when you run the code. Same hardware, same model — theory vs practice.

Decode throughput (Qwen2.5-7B on L4, c=1)

The calculator models decode as memory-bandwidth-bound: throughput = HBM_bandwidth / (params * bytes_per_param).

FP16 (vLLM) 21.4 tok/s 17.1 tok/s 80%
AWQ 4-bit (vLLM) 85.4 tok/s 43.2 tok/s 51%
Theoretical Measured Efficiency

FP16 achieves 80% of theoretical bandwidth — kernel overhead, attention computation, and KV cache reads account for the gap. AWQ drops to 51% because dequantization and irregular memory access patterns reduce effective bandwidth utilization.

Aggregate throughput (c=64, short regime)

At high concurrency, throughput is limited by batch scheduling, KV cache pressure, and CUDA graph fragmentation.

FP16 (vLLM) 1,294 831 64%
FP16 (SGLang) 1,294 914 71%
Theoretical Measured Efficiency

SGLang's RadixAttention achieves 71% of theoretical at c=64, vs vLLM's 64% — the 7-point gap explains SGLang's consistent throughput advantage. The calculator's "no continuous batching efficiency" limitation is real: engines achieve 64-71% of ideal at high concurrency.

This comparison uses the same throughput model from the LLM Deploy Cost Calculator, with L4 specs (31 TFLOPS, 300 GB/s HBM), Qwen2.5-7B (7B params, GQA, 28 layers, 4 KV heads), and MFU=0.35. The "Theoretical" column is what the calculator predicts; the "Measured" column is what inference-bench recorded on the same GPU. Full benchmark data on GitHub.

Decode efficiency (Qwen2.5-7B on A100 40GB, c=1) — v0.20.1 / :latest

A100 40GB has 5.2x more bandwidth than L4 (1,555 vs 300 GB/s). Per-stream efficiency drops because compute becomes the bottleneck.

FP16 (vLLM v0.20.1) 111 tok/s 80.0 tok/s 72%
FP16 (SGLang :latest) 111 tok/s 60.5 tok/s 54%
AWQ Marlin (vLLM v0.20.1) 444 tok/s 177.3 tok/s 40%
AWQ Marlin (SGLang :latest) 444 tok/s 194.8 tok/s 44%
Theoretical Measured Efficiency

A100 v0.20.1 vLLM FP16 hits 72% efficiency — up from 65% in v0.8.5. Marlin kernels improved dramatically: 40% in v0.20.1 vs 23% in v0.8.5. At c=64, vLLM Marlin reaches 4,762 tok/s while SGLang Marlin collapses to 2,231 tok/s — SGLang's Marlin kernel doesn't scale to high-concurrency batching on A100 the way vLLM's does.

Methodology

Rigorous, reproducible setup designed to isolate engine behavior from confounding variables.

Hardware

GPU
NVIDIA L4 (24 GB)
Provider
Modal (per-second billing)
CUDA
12.4.1

Model

FP16
Qwen2.5-7B-Instruct (~15 GB)
AWQ
Qwen2.5-7B-Instruct-AWQ (~5.2 GB)
GGUF
Q4_K_M quantization (~4.4 GB)
Reasoning
Qwen3-8B + Qwen3-8B-AWQ

Workload

Short
≤256 in, 128 out (chat)
Long
≤2048 in, 512 out (RAG)
Reasoning
≤256 in, 8192 out (thinking)
Sampling
Greedy (temp=0)
Repeats
3x short, 1x long, 1x reasoning
vLLM
--max-num-seqs 64 · --gpu-mem-util 0.90
v0.8.5
SGLang
--max-running-req 64 · --mem-frac 0.85 · --disable-cuda-graph
v0.4.6
llama.cpp
-np 4 --parallel 4 -ngl 99 -c 16384
b5540
vLLM AWQ
--quantization awq --enforce-eager
v0.8.5
SGLang AWQ
--quantization awq --disable-cuda-graph
v0.4.6
vLLM (A100)
vllm serve · --max-num-seqs 64 · --gpu-memory-utilization 0.90 · --no-enable-log-requests
v0.20.1
SGLang (A100)
--max-running-req 64 · --mem-frac 0.85 · --disable-cuda-graph
:latest
Qwen3 vLLM
--reasoning-parser deepseek_r1 --max-model-len 16384
v0.8.5
Qwen3 SGLang
--reasoning-parser deepseek-r1 --context-length 16384
v0.4.6
Qwen3 AWQ vLLM
--quantization awq --reasoning-parser deepseek_r1 --max-model-len 16384
v0.8.5
Qwen3 AWQ SGLang
--quantization awq --reasoning-parser deepseek-r1 --context-length 16384
v0.4.6

Results

Throughput, first-token latency, and tail latency across concurrency sweeps.

Baseline hardware is NVIDIA L4 (24 GB, Modal, per-second billing) with Qwen2.5-7B-Instruct. A100 and MoE regimes use an A100 80/40 GB with a different model — see the regime note below. c=N means N concurrent requests in flight simultaneously. Throughput is aggregate tokens/sec across all concurrent streams. TTFT is time to first token (p50). Latency p95 is tail end-to-end time per request. Regime definitions: Short ≤256 input tokens · Long ≤2048 · Reasoning ~6K thinking tokens (Qwen3-8B) · A100 = vLLM v0.20.1 + SGLang :latest on A100 40GB (May 2026 sweep) · MoE = Qwen3-30B-A3B active-param study on A100 80GB (BF16) and A100 40GB (AWQ). Full methodology →

Throughput vs Concurrency

Time to First Token (p50)

End-to-End Latency (p95)

What This Doesn't Measure

Known limitations of the benchmark scope — important for interpreting results correctly.