Benchmark Results¶

This file is the human-readable companion to benchmark_results.csv.

Required comparison table¶

Config	Weight memory	KV-cache note	Tokens/sec	Quality score	Training cost	Recommendation
Base fp16/bf16	Pending run	KV cache still scales with context and concurrency	Pending run	Pending run	N/A	Reference baseline
Quantized (int8 or int4)	Pending run	Weight memory drops, KV cache does not	Pending run	Pending run	N/A	Good serving candidate if quality holds
LoRA or QLoRA	Pending run	Base KV cache still matters at serving time	Pending run	Pending run	Pending run	Best when behavior shift matters more than freshness

Smoke-path note¶

In this workspace the runnable local benchmark uses:

base model: sshleifer/tiny-gpt2
quantized stand-in: local per-tensor INT8 wrapper for GPT-2-style layers
adapter path: LoRA via peft

That is a local validation path. For the full hands_on_lab, replace the quantized stand-in with a real GPTQ / AWQ / GGUF artifact on the hardware you actually care about.

Current smoke snapshot¶

Config	Weight memory	KV-cache note	Tokens/sec	Quality score	Training cost	Recommendation
Base fp16-style	0.000191 GB	KV cache still scales with context and concurrency	2160.038	49735.209	N/A	Reference baseline
Local int8 smoke	0.000189 GB	Weight memory drops, but KV cache is unchanged	1968.874	49725.439	N/A	Use only as a local quantization stand-in
LoRA adapter	0.000385 GB	Base KV cache remains the serving driver	2050.982	49734.687	Low adapter-only CPU/GPU run	Best when behavior shift matters more than freshness

These are workflow-validation numbers, not a serious model-selection result. The tiny local model is only there to prove the benchmark and LoRA paths work end to end.