mirror of
https://github.com/pewdiepie-archdaemon/odysseus.git
synced 2026-06-17 10:15:27 -04:00
Chat metrics: surface backend generation speed
* Chat metrics: show backend's true generation t/s, not tokens÷wall-clock
The per-message tokens/sec read low and felt wrong because it was computed as
output_tokens / total_duration, where total_duration is wall-clock including
prefill, tool calls, and network — not pure decode time. llama.cpp already
reports the correct gen speed in its stream (timings.predicted_per_second), but
it was being dropped.
- llm_core.py: when parsing the OpenAI-compatible usage chunk, also read the
sibling `timings` block llama.cpp includes — pass predicted_per_second through
as gen_tps and prompt_per_second as prefill_tps on the usage event.
- agent_loop.py: capture backend_gen_tps/backend_prefill_tps from usage events;
in _compute_final_metrics prefer backend_gen_tps over the wall-clock division
when present (fall back to computed for cloud APIs that omit timings). Tag the
result with tps_source ("backend" vs "computed") and surface prefill_tps.
Result: the displayed t/s now matches the model's real decode speed and is
stable regardless of prompt length (a long prefill no longer deflates it).
Checks: py_compile passes; verified extraction against a real llama.cpp final
chunk (gen 79 t/s surfaced vs the deflated wall-clock figure shown before).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Chat metrics: surface true t/s on the direct-chat path too
Follow-up to the gen-tps work: the non-agent direct-chat stream path in
chat_routes turned the raw `usage` event straight into a metrics event but only
copied token counts — it never set tokens_per_second or response_time. So simple
(non-tool) replies showed "Speed: n/a" / "Time: undefineds" and the chip fell
back to a bare token count ("27 tok") instead of t/s.
Map the usage event's gen_tps (llama.cpp timings.predicted_per_second, added in
the prior commit) into tokens_per_second here too, tag tps_source=backend, and
set response_time from wall-clock for the stats popup.
Checks: py_compile passes; verified llama.cpp emits usage+timings on the final
stream chunk (gen ~90 t/s) that this path consumes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Tests: backend gen/prefill t/s passthrough and preference
Cover the two pieces of the true-t/s metric so it can be reviewed on its own:
- stream_llm surfaces llama.cpp's timings.predicted_per_second /
prompt_per_second as gen_tps / prefill_tps on the usage event (captured
llama.cpp final-chunk fixture), and omits them when the backend reports no
timings.
- _compute_final_metrics prefers backend_gen_tps over output/wall-clock,
tags tps_source ("backend" vs "computed"), and surfaces prefill_tps.
Reuses the fake-client stream harness from test_llm_core_streaming.py.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -856,6 +856,15 @@ def setup_chat_routes(
|
||||
pct = min(round((last_metrics["input_tokens"] / ctx.context_length) * 100, 1), 100.0)
|
||||
last_metrics["context_percent"] = pct
|
||||
last_metrics["context_length"] = ctx.context_length
|
||||
# The frontend reads `tokens_per_second`; the raw usage event
|
||||
# carries the backend's true gen speed as `gen_tps` (llama.cpp
|
||||
# timings). Map it through so this direct-chat path shows real
|
||||
# t/s instead of "n/a" → falling back to a bare token count.
|
||||
if last_metrics.get("gen_tps") and not last_metrics.get("tokens_per_second"):
|
||||
last_metrics["tokens_per_second"] = last_metrics["gen_tps"]
|
||||
last_metrics["tps_source"] = "backend"
|
||||
# Wall-clock response time for the stats popup ("Time").
|
||||
last_metrics.setdefault("response_time", round(time.time() - _chat_start, 2))
|
||||
yield f'data: {json.dumps({"type": "metrics", "data": last_metrics})}\n\n'
|
||||
except json.JSONDecodeError:
|
||||
yield chunk
|
||||
|
||||
Reference in New Issue
Block a user