Validation report

⏱ latency (orange = first turn per backend, carries model warmup). chart refs = count of chart records cited — a COUNT, not a grounding/quality signal; the authoritative call is the human adjudication on each tile. tokens / finish_reasons / response model are not surfaced by /chat (OTel-deferred). Deterministic metrics only — no LLM judge. Drag tiles within a question to rank backends; rank + adjudication export to separate files.