LLM Benchmarks

Test Rig

PlatformDell Precision T7810 (dual-socket workstation)

CPUs2× Intel Xeon E5-2680 v4 (14c/28t each, 56 threads total, Broadwell)

RAM128 GB DDR4 ECC 2400 MHz

GPUs2× NVIDIA RTX 5060 Ti 16 GB (32 GB VRAM total, PCIe 4.0)

VRAM Budget32 GiB (overflow via CUDA UVM → RAM)

Softwarellama.cpp + ik_llama.cpp fork for MoE

Dual-socket Xeon with modern consumer GPUs — an unusual point in the inference landscape. NUMA topology affects tensor splitting. UVM allows running models larger than 32 GiB VRAM at a speed cost.

Benchmark Methodology

Toolllama-bench (built into llama.cpp)

PP512Prompt processing: 512-token context, prefill throughput (t/s)

TG128Token generation: 128 tokens — most relevant for interactive chat

BackendCUDA with full GPU offload (ngl=99 where possible)

Flash AttnEnabled where the model supports it (FA in build tag)

Tensor SplitRatio of layers across GPU 0 and GPU 1

Speed tiers: ≥60 t/s = fast (green), ≥20 = medium (orange), ≥5 = slow (yellow), <5 = impractical (red). Quality scores from 28-prompt eval suite covering reasoning, coding, and instruction following.