Benchmark Results¶
This file is the human-readable companion to benchmark_results.csv.
Required comparison table¶
| Config | Weight memory | KV-cache note | Tokens/sec | Quality score | Training cost | Recommendation |
|---|---|---|---|---|---|---|
| Base fp16/bf16 | Pending run | KV cache still scales with context and concurrency | Pending run | Pending run | N/A | Reference baseline |
| Quantized (int8 or int4) | Pending run | Weight memory drops, KV cache does not | Pending run | Pending run | N/A | Good serving candidate if quality holds |
| LoRA or QLoRA | Pending run | Base KV cache still matters at serving time | Pending run | Pending run | Pending run | Best when behavior shift matters more than freshness |
Smoke-path note¶
In this workspace the runnable local benchmark uses:
- base model:
sshleifer/tiny-gpt2 - quantized stand-in: local per-tensor INT8 wrapper for GPT-2-style layers
- adapter path: LoRA via
peft
That is a local validation path. For the full hands_on_lab, replace the quantized stand-in with a real GPTQ / AWQ / GGUF artifact on the hardware you actually care about.
Current smoke snapshot¶
| Config | Weight memory | KV-cache note | Tokens/sec | Quality score | Training cost | Recommendation |
|---|---|---|---|---|---|---|
| Base fp16-style | 0.000191 GB | KV cache still scales with context and concurrency | 2160.038 | 49735.209 | N/A | Reference baseline |
| Local int8 smoke | 0.000189 GB | Weight memory drops, but KV cache is unchanged | 1968.874 | 49725.439 | N/A | Use only as a local quantization stand-in |
| LoRA adapter | 0.000385 GB | Base KV cache remains the serving driver | 2050.982 | 49734.687 | Low adapter-only CPU/GPU run | Best when behavior shift matters more than freshness |
These are workflow-validation numbers, not a serious model-selection result. The tiny local model is only there to prove the benchmark and LoRA paths work end to end.