Skip to content

Benchmark Results

This file is the human-readable companion to benchmark_results.csv.

Required comparison table

Config Weight memory KV-cache note Tokens/sec Quality score Training cost Recommendation
Base fp16/bf16 Pending run KV cache still scales with context and concurrency Pending run Pending run N/A Reference baseline
Quantized (int8 or int4) Pending run Weight memory drops, KV cache does not Pending run Pending run N/A Good serving candidate if quality holds
LoRA or QLoRA Pending run Base KV cache still matters at serving time Pending run Pending run Pending run Best when behavior shift matters more than freshness

Smoke-path note

In this workspace the runnable local benchmark uses:

  • base model: sshleifer/tiny-gpt2
  • quantized stand-in: local per-tensor INT8 wrapper for GPT-2-style layers
  • adapter path: LoRA via peft

That is a local validation path. For the full hands_on_lab, replace the quantized stand-in with a real GPTQ / AWQ / GGUF artifact on the hardware you actually care about.

Current smoke snapshot

Config Weight memory KV-cache note Tokens/sec Quality score Training cost Recommendation
Base fp16-style 0.000191 GB KV cache still scales with context and concurrency 2160.038 49735.209 N/A Reference baseline
Local int8 smoke 0.000189 GB Weight memory drops, but KV cache is unchanged 1968.874 49725.439 N/A Use only as a local quantization stand-in
LoRA adapter 0.000385 GB Base KV cache remains the serving driver 2050.982 49734.687 Low adapter-only CPU/GPU run Best when behavior shift matters more than freshness

These are workflow-validation numbers, not a serious model-selection result. The tiny local model is only there to prove the benchmark and LoRA paths work end to end.