* Chat metrics: show backend's true generation t/s, not tokens÷wall-clock
The per-message tokens/sec read low and felt wrong because it was computed as
output_tokens / total_duration, where total_duration is wall-clock including
prefill, tool calls, and network — not pure decode time. llama.cpp already
reports the correct gen speed in its stream (timings.predicted_per_second), but
it was being dropped.
- llm_core.py: when parsing the OpenAI-compatible usage chunk, also read the
sibling `timings` block llama.cpp includes — pass predicted_per_second through
as gen_tps and prompt_per_second as prefill_tps on the usage event.
- agent_loop.py: capture backend_gen_tps/backend_prefill_tps from usage events;
in _compute_final_metrics prefer backend_gen_tps over the wall-clock division
when present (fall back to computed for cloud APIs that omit timings). Tag the
result with tps_source ("backend" vs "computed") and surface prefill_tps.
Result: the displayed t/s now matches the model's real decode speed and is
stable regardless of prompt length (a long prefill no longer deflates it).
Checks: py_compile passes; verified extraction against a real llama.cpp final
chunk (gen 79 t/s surfaced vs the deflated wall-clock figure shown before).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Chat metrics: surface true t/s on the direct-chat path too
Follow-up to the gen-tps work: the non-agent direct-chat stream path in
chat_routes turned the raw `usage` event straight into a metrics event but only
copied token counts — it never set tokens_per_second or response_time. So simple
(non-tool) replies showed "Speed: n/a" / "Time: undefineds" and the chip fell
back to a bare token count ("27 tok") instead of t/s.
Map the usage event's gen_tps (llama.cpp timings.predicted_per_second, added in
the prior commit) into tokens_per_second here too, tag tps_source=backend, and
set response_time from wall-clock for the stats popup.
Checks: py_compile passes; verified llama.cpp emits usage+timings on the final
stream chunk (gen ~90 t/s) that this path consumes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Tests: backend gen/prefill t/s passthrough and preference
Cover the two pieces of the true-t/s metric so it can be reviewed on its own:
- stream_llm surfaces llama.cpp's timings.predicted_per_second /
prompt_per_second as gen_tps / prefill_tps on the usage event (captured
llama.cpp final-chunk fixture), and omits them when the backend reports no
timings.
- _compute_final_metrics prefers backend_gen_tps over output/wall-clock,
tags tps_source ("backend" vs "computed"), and surfaces prefill_tps.
Reuses the fake-client stream harness from test_llm_core_streaming.py.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Both get_default_chat and _recover_empty_session_model picked the
first model from cached_models[0] without checking hidden_models.
If the first cached model was hidden (e.g. minimax-m3), it was
returned as the default or used to repair empty session models,
even though the model list endpoints already filter hidden_models.
- Add _visible_models() helper that filters cached_models by
hidden_models (mirrors the filtering in list_model_endpoints)
- Use _visible_models() in get_default_chat fallback (when no
explicit default_model is saved)
- Use _visible_models() in _recover_empty_session_model (when
repairing a session whose model field is empty before chat send)
- Add regression tests for hidden-model filtering in default chat
resolution, and unit tests for _visible_models helper
When the selected model fails before producing output, stream_llm_with_fallback
quietly switches to the next candidate and the reply is shown under the
originally selected model's name, so a misconfigured provider looks like it
works. (Concretely: a Bedrock gateway that 400s every Anthropic/Claude request
appears fine because another model silently answers under the Claude label.)
Emit a `fallback` SSE event ({selected_model, answered_by, reason}) the first
time a non-primary candidate produces output, forward it through the agent loop
and both chat-route paths, stamp the response metrics with the model that
actually answered, and show a notice + relabel the reply in the UI.
Tested: python -m pytest tests/test_llm_core_fallback.py (3 pass);
python -m py_compile src/llm_core.py src/agent_loop.py routes/chat_routes.py;
node --check static/js/chat.js.
The "don't wipe endpoint_url/model on endpoint delete" half of #587 landed
in 6a78b02 (Fix endpoint model preservation for tasks). The three remaining
follow-up pieces from the original PR — flagged in the review on #786 —
are:
- routes/model_routes.py: toggle_model_endpoint (PATCH) now accepts
api_key and base_url, so the admin UI can rotate a key or fix a typo'd
URL without going through delete+recreate. base_url is normalized the
same way the POST handler does (strip /models, /chat/completions,
/completions, /v1/messages, then _normalize_base). Cache invalidation
matches the POST/DELETE paths and the response includes base_url so the
frontend can confirm what was saved.
- routes/chat_routes.py: new _recover_empty_session_model picks
cached_models[0] from the endpoint that matches sess.endpoint_url and
persists it onto the Session row before the LLM call goes out. Wired
into both /api/chat and /api/chat_stream after the existing
_clear_orphaned_session_endpoint guard, so the order is: drop
truly-orphaned sessions first, then heal the "picker showed it, session
never knew" case.
- routes/chat_routes.py: when recovery fails (no endpoint, no cached
models) raise HTTP 400 with a clear message instead of letting
model="" reach the upstream as 401/503.
Closes#587.
Streamed deltas flagged thinking:true (reasoning-model traces) were being folded
into full_response and persisted as part of the assistant message, so saved
replies were polluted with the model's chain-of-thought. Forward those deltas to
the client (for a live thinking indicator) but exclude them from the accumulated
saved reply, in both chat and research-stream paths. Mirrors the existing rewrite
path's handling.
The /api/chat/stream_status handler did a membership test against
_active_streams followed by an indexed read of the same key. Between
those two ops, a sibling stream's finally block (or a stop / cleanup
path) can pop the entry, turning the indexed read into a KeyError that
bubbles up as a 500. The race is the exact one _stream_set was already
written to avoid; the comment on the helper at the top of the module
spells out why a single .get() is the right pattern here too.
Collapse the two-step into a single .get() call so the lookup either
returns the live record or None, and report 'detached' / 404 based on
that single read. No behavior change on the happy path; the failure
mode under concurrent stream cleanup is now handled deterministically.
Closes#658.
* feat(web-fetch): add web_fetch tool to read a specific URL's content
* test(web-fetch): add SSRF coverage and fail closed on empty DNS resolution
Add explicit SSRF regression tests for the web_fetch path covering
loopback, private LAN ranges, link-local/metadata, IPv6 private/local,
redirect-into-private, and unsupported schemes. Harden _public_http_url
to fail closed when a hostname resolves to no addresses.
chat_routes.py persisted a session's "mode" in three best-effort spots —
reading the current mode, writing the effective mode, and setting
research_pending on the stream path. Each opened a session with SessionLocal()
and called .close() as the LAST statement inside a try/except, so if anything
before close() raised (e.g. a SQLite "database is locked" under concurrent chat
streams) the except only logged and the connection was never returned to the
pool.
DATABASE_URL defaults to file-backed SQLite, whose engine uses SQLAlchemy's
default QueuePool (5 connections + 10 overflow). Repeated leaks on these hot
paths exhaust the pool; later requests then block for pool_timeout and fail
with "QueuePool limit ... reached", taking the app down until restart.
Move the logic into two best-effort helpers in core.database, next to the
existing session helpers (update_session_last_accessed, get_session_by_id):
- get_session_mode(session_id) -> Optional[str]
- set_session_mode(session_id, mode) -> bool
Both route through the existing get_db_session() context manager, which commits
on success, rolls back on error, and always closes in a finally, so the
connection is returned to the pool on every path. chat_routes.py now calls
these instead of hand-rolling sessions, also removing three copies of the same
try/except.
Add tests/test_session_mode_helpers.py: the helpers commit+close on success
and, on a mid-operation DB error, swallow + roll back + close (no leak). The
error-path tests fail against the old close()-inside-try pattern.