Dell Precision T7810 2× Xeon E5-2680 v4 (28c/56t) 128 GB DDR4 ECC 2× RTX 5060 Ti 16 GB llama.cpp · CUDA · UVM
Fastest
Loading…
Best Quality
Loading…
Best Value
Loading…
Speed vs Quality — Scatter
X: PP512 prefill · Y: TG128 generation · dot size = model size · click → table row
Generation Speed (TG128)
Model Arch Size PP512 t/s TG128 t/s VRAM Overflow ncmoe / ts Quality Date Notes
Hardware & Methodology

Test Rig

PlatformDell Precision T7810 (dual-socket workstation)
CPUs2× Intel Xeon E5-2680 v4 (14c/28t each, 56 threads total, Broadwell)
RAM128 GB DDR4 ECC 2400 MHz
GPUs2× NVIDIA RTX 5060 Ti 16 GB (32 GB VRAM total, PCIe 4.0)
VRAM Budget32 GiB (overflow via CUDA UVM → RAM)
Softwarellama.cpp + ik_llama.cpp fork for MoE
Dual-socket Xeon with modern consumer GPUs — an unusual point in the inference landscape. NUMA topology affects tensor splitting. UVM allows running models larger than 32 GiB VRAM at a speed cost.

Benchmark Methodology

Toolllama-bench (built into llama.cpp)
PP512Prompt processing: 512-token context, prefill throughput (t/s)
TG128Token generation: 128 tokens — most relevant for interactive chat
BackendCUDA with full GPU offload (ngl=99 where possible)
Flash AttnEnabled where the model supports it (FA in build tag)
Tensor SplitRatio of layers across GPU 0 and GPU 1
Speed tiers: ≥60 t/s = fast (green), ≥20 = medium (orange), ≥5 = slow (yellow), <5 = impractical (red). Quality scores from 28-prompt eval suite covering reasoning, coding, and instruction following.
Quant:
PP512: t/s
TG128: t/s
Size: GiB
VRAM: