The previous approach polled request.is_disconnected() inside the
async-for body of the chat/agent streaming loops. That happens too
late: by the time the poll runs, __anext__() has already awaited and
consumed the next upstream chunk, so a slow or silent generation could
still run for a full round-trip (or until a read timeout) after the
client disconnected. It was also unconditional, which would have made
ordinary chat navigation/refresh/tab-close stop a run that the
detached-run design intentionally keeps going server-side.
Both problems trace back to the same root cause: chat_stream always
wraps its generator in agent_runs (the detached-run manager), which
decouples the generator's lifetime from the SSE response on purpose so
normal chat/agent streams survive the client going away. Polling
disconnection inside a detached generator can never be "prompt" — the
generator isn't tied to that request anymore — and doing so defeats the
whole point of detaching it.
Compare panes don't need (or want) that: each pane's session exists
only to drive that one generation, there's nothing meaningful to
/resume, and the user expects the pane's Stop button — which aborts the
fetch and closes the SSE — to cancel the upstream call right away. So
route compare-mode requests around the agent_runs wrapper entirely and
stream the generator directly as the SSE body. Starlette already
cancels a streaming response's body iterator (raising
CancelledError/GeneratorExit into it) the instant it notices the client
disconnected — including while the generator is mid-await on the next
upstream chunk — and the existing except (CancelledError, GeneratorExit)
handlers in both the chat-mode and agent-mode loops already save the
partial response exactly once. No polling needed; the redesign just
stops getting in its own way.
Normal (non-Compare) chat and agent streams are untouched and keep
going through agent_runs, preserving detached-run semantics (surviving
tab close / navigation / refresh, reconnect via /api/chat/resume).
Replaces the source-text assertions in
tests/test_compare_stop_disconnect_poll.py with runtime tests that
actually exercise the cancellation contract: a Compare-shaped generator
is cancelled mid-await (not after the next chunk arrives) and saves its
partial exactly once; a normal completion still saves exactly once via
the completion path; agent_runs keeps a detached run alive when its
subscriber disconnects and only stops it on an explicit stop()/cancel
(also saving the partial exactly once); and the cancellation contract
is pinned for both chat-mode- and agent-mode-shaped chunk sequences.
* feat: Add plan mode to the chat agent
Adds a plan mode: the agent investigates read-only, proposes a checklist, and
waits for approval before changing anything. On approval it runs with full
tools and checks items off as it goes. Enforcement reuses the existing
disabled_tools gate.
Includes a slash command: `/plan [on|off]` (and `/toggle plan`) to flip the
plan toggle from the chat input.
- src/tool_security.py, src/mcp_manager.py: read-only allowlist (tools + MCP).
- src/agent_loop.py, routes/chat_routes.py: union the disabled set, prepend the
plan directive, force agent mode.
- static/: plan toggle pill, Approve & Run, dockable plan window, task-list
checkboxes, and the /plan slash command.
- tests/test_plan_mode.py.
* Plan mode: persistent re-referenceable plan + agent write-back
Three improvements so a long plan survives a weak model and stays in reach:
1. Re-reference the plan (out-of-context fix). On the execution turn the frontend
sends the approved checklist back (`approved_plan`); the backend pins it as a
top-of-context `## ACTIVE PLAN` system note (kept by the context trimmer), so
the agent can always re-read the plan instead of losing the thread on a long
run. New `build_active_plan_note()` (unit-tested).
2. Re-open / dock the plan anytime. The plan checklist is stored per-session
(localStorage). When a plan exists, the plan-mode button opens a small menu
("Show plan" / "Plan mode: On/Off") that re-opens the side-dockable plan
window — so it can stay docked while the agent works. The window live-refreshes
as the plan changes.
3. Agent write-back: new `update_plan` tool. The agent calls it to tick steps
`- [x]` after finishing them, or to revise steps when the user asks. Marker
tool (no I/O) → `plan_update` SSE event → the stored plan + docked window
update live. The ACTIVE PLAN note instructs the agent to use it.
Backend: src/agent_loop.py (param + pin + note builder + emit + prompt blurb),
src/tool_execution.py (update_plan handler), routes/chat_routes.py (parse
`approved_plan`, relay `plan_update`), registration in tool_schemas / agent_tools
/ tool_index (always-available, not admin-gated).
Frontend: static/js/chat.js (plan store, send `approved_plan`, handle
`plan_update`, capture restated checklists), static/app.js (plan-button menu),
static/js/planWindow.js (`isPlanWindowOpen`), static/js/storage.js (PLAN key).
Tests: tests/test_plan_mode.py (plan-note), tests/test_update_plan_tool.py.
* Plan mode: drop bash/python, rely on read-only discovery tools
Shell can mutate (write files, hit the network) and can't be constrained to
read-only at the tool layer, so plan mode no longer relies on a prompt to keep
it well-behaved — bash/python are removed from the read-only allowlist and added
to the fail-closed block set. Discovery is covered by the dedicated read-only
tools (read_file, grep, glob, ls) instead.
Rewrites the plan-mode directive to state shell is disabled and lists the
available read-only tools positively. Addresses review feedback on #638.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* Comment: note _MCP_READONLY_VERBS are prefixes not whole words
Clarifies that entries like "summar" are intentional stems matched via
startswith (covers summarise/summarize/summary), not typos. Addresses review
feedback on #638.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* Plan mode: clarify why gating inverts the allowlist into a denylist
Rename _PLAN_MODE_FALLBACK_BLOCK -> _PLAN_MODE_KNOWN_MUTATORS and rewrite the
comments. The tool gate is a denylist (disabled_tools); plan mode's policy is an
allowlist, so it returns the inverse (all known tool names minus the allowlist).
The static mutator set is a backstop for the schema-derived name list, which
misses XML-only tools and can fail to import. Addresses review feedback on #638.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* Plan mode: stop hardcoding the read-only tool list in the directive
The model is already shown its available (read-only) tools by _assemble_prompt,
which removes every disabled tool. Enumerating them again in the directive only
duplicated that list and would drift as tools change. Point at the tools listed
below instead. Addresses review feedback on #638.
Let the agent pause and ask the user a multiple-choice question when a
task is genuinely ambiguous and the answer changes what it does next —
choosing between approaches, confirming an assumption, picking a target —
instead of guessing.
Modeled on the existing `ui_control` marker pattern: the `ask_user` tool
returns an `ask_user` payload that the agent loop emits as an SSE event
and then ends the turn. The frontend renders the question with clickable
option buttons, a free-text "Other" input, and an x to dismiss; the user's
choice is sent as the next message and the agent resumes with it in
context.
- src/tool_execution.py: `ask_user` handler — pure UI marker, no I/O.
Validates a non-empty question + 2..6 options, normalizes string/object
options, returns the payload.
- src/agent_loop.py: emit the `ask_user` event and break the round loop so
the turn ends and waits for the user's selection. Stream the question as
assistant text so it persists/replays (prevents a re-ask loop).
- Registration: TOOL_TAGS, ALWAYS_AVAILABLE, BUILTIN_TOOL_DESCRIPTIONS,
FUNCTION_TOOL_SCHEMAS, the system-prompt blurb. Not admin-gated (any
user can be asked); the structured args serialize via the default
json.dumps path.
- routes/chat_routes.py: relay the `ask_user` event to the client.
- static/js/chat.js + static/style.css: render the question card (options +
free-text Other + dismiss x; removed once answered). Reuses CSS vars and
the .modal-close button; emoji go through the monochrome-SVG pipeline.
Bump chat.js cache pin.
- tests/test_ask_user_tool.py: payload, multi flag, string options, option
cap, validation errors, serializer round-trip, registration.
* feat: Add workspace: confine agent tools to a folder
Pick a server folder as the agent's workspace so its file/shell tools work
there and don't touch files outside it. File tools are hard-confined; bash/
python run with cwd set to the folder.
Includes a slash command: `/workspace` (alias `/ws`) — show / `set <path>` /
`clear` / `pick` (open the directory browser).
- routes/workspace_routes.py: GET /api/workspace/browse (admin-only).
- src/tool_execution.py: hard path confinement for read_file/write_file;
bash/python cwd. Threaded route → stream_agent_loop → execute_tool_block.
- src/agent_loop.py: workspace note prepended to the system prompt.
- static/: overflow menu item, input-bar pill, directory-browser modal, and
the /workspace slash command.
- tests/test_workspace_confine.py.
* Wire workspace confinement into tools that landed after this PR
edit_file (#1239) and grep/glob/ls (#1670) merged after workspace-confine was
written, so they bypassed the workspace boundary. Thread the workspace through:
- edit_file: _do_edit_file resolves via _resolve_tool_path_in_workspace
- grep/glob/ls: _resolve_search_root confines to the workspace (root + paths)
- bash/python/bg cwd: workspace or _AGENT_WORKDIR (keep the #2586 data-dir
default when no workspace is set)
Tests cover edit_file + grep/ls confinement (inside ok, outside rejected).
* Workspace picker: editable path bar + modal style cohesion + cross-platform hardening
- Make the current-folder strip an editable address bar: type/paste a full
path and press Enter to navigate (also reaches other Windows drives and
hidden dirs the up-only browser cannot).
- Reuse shared modal CSS: drop bespoke .workspace-modal-content/.workspace-btn*
in favour of base .modal-content/.modal-body and the .confirm-btn button
family; separators/hover use var(--border). Net -31 CSS lines.
- Fix the path field overflowing the modal right edge (flex stretch + margin
vs an overflow:auto scrollbar-feedback loop): full-bleed, no h-margin.
- Cross-platform confinement: normcase the workspace commonpath check so
containment holds on case-insensitive filesystems (Windows/macOS).
- Make tests OS-portable: sibling temp dirs instead of /etc, python os.getcwd()
instead of pwd. 5 pass.
* feat: round-limit handling — Continue affordance at the cap + configurable cap
When the agent loop runs out of rounds (per-message step cap, default 20)
while still actively using tools, it stopped silently mid-task. Now:
1. The loop emits a `rounds_exhausted` SSE event at the cap, and the UI shows
a "Continue" pill at the bottom of the chat that resumes the task from where
it left off. Repeated cap-hits each get a fresh Continue (multiple continues
in a row).
2. The cap is configurable in Settings → Agent ("Max steps per message"),
validated on the client, at the save endpoint, and at the read site.
- src/agent_loop.py: track `_exhausted_rounds` (set only when a full
tool-executing round completes on the last allowed round — i.e. the agent
wanted to keep going); emit `{"type":"rounds_exhausted","rounds":N}` (logged).
- routes/chat_routes.py: read `agent_max_rounds` (clamped 1..200), pass as
`max_rounds`; forward the new event through the SSE relay.
- routes/auth_routes.py: validate numeric settings on save (int + clamp;
agent_max_rounds 1..200, agent_max_tool_calls 0..1000; 400 on non-int).
- src/settings.py: default `agent_max_rounds = 20`.
- static/: Settings input + client-side clamp; the Continue pill (reuses the
existing .stopped-indicator / .continue-btn classes and theme vars
--border/--fg/--bg/--accent); appended to the chat container so it survives
the message re-render at stream finalize. chat.js cache version bumped.
* test: cover rounds_exhausted emission (cap-hit vs normal finish)
Drives the real stream_agent_loop with mocked LLM stream / tool exec / settings:
a tool block every round exhausts the cap and must emit rounds_exhausted; a
plain answer hits the done-break and must not. Guards the for/else logic.
* fix: pass owner to start_research in chat stream path
Research launched from the chat stream omits the owner parameter,
causing those research sessions to never appear in the user's
research library (which filters by owner). All other start_research
call sites in this file already pass owner=_user.
* test: assert all start_research calls in chat_routes pass owner
Uses AST inspection to verify every start_research() call site
includes the owner= keyword argument, preventing regressions where
new call sites forget to scope research by user.
* Chat metrics: show backend's true generation t/s, not tokens÷wall-clock
The per-message tokens/sec read low and felt wrong because it was computed as
output_tokens / total_duration, where total_duration is wall-clock including
prefill, tool calls, and network — not pure decode time. llama.cpp already
reports the correct gen speed in its stream (timings.predicted_per_second), but
it was being dropped.
- llm_core.py: when parsing the OpenAI-compatible usage chunk, also read the
sibling `timings` block llama.cpp includes — pass predicted_per_second through
as gen_tps and prompt_per_second as prefill_tps on the usage event.
- agent_loop.py: capture backend_gen_tps/backend_prefill_tps from usage events;
in _compute_final_metrics prefer backend_gen_tps over the wall-clock division
when present (fall back to computed for cloud APIs that omit timings). Tag the
result with tps_source ("backend" vs "computed") and surface prefill_tps.
Result: the displayed t/s now matches the model's real decode speed and is
stable regardless of prompt length (a long prefill no longer deflates it).
Checks: py_compile passes; verified extraction against a real llama.cpp final
chunk (gen 79 t/s surfaced vs the deflated wall-clock figure shown before).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Chat metrics: surface true t/s on the direct-chat path too
Follow-up to the gen-tps work: the non-agent direct-chat stream path in
chat_routes turned the raw `usage` event straight into a metrics event but only
copied token counts — it never set tokens_per_second or response_time. So simple
(non-tool) replies showed "Speed: n/a" / "Time: undefineds" and the chip fell
back to a bare token count ("27 tok") instead of t/s.
Map the usage event's gen_tps (llama.cpp timings.predicted_per_second, added in
the prior commit) into tokens_per_second here too, tag tps_source=backend, and
set response_time from wall-clock for the stats popup.
Checks: py_compile passes; verified llama.cpp emits usage+timings on the final
stream chunk (gen ~90 t/s) that this path consumes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Tests: backend gen/prefill t/s passthrough and preference
Cover the two pieces of the true-t/s metric so it can be reviewed on its own:
- stream_llm surfaces llama.cpp's timings.predicted_per_second /
prompt_per_second as gen_tps / prefill_tps on the usage event (captured
llama.cpp final-chunk fixture), and omits them when the backend reports no
timings.
- _compute_final_metrics prefers backend_gen_tps over output/wall-clock,
tags tps_source ("backend" vs "computed"), and surfaces prefill_tps.
Reuses the fake-client stream harness from test_llm_core_streaming.py.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Both get_default_chat and _recover_empty_session_model picked the
first model from cached_models[0] without checking hidden_models.
If the first cached model was hidden (e.g. minimax-m3), it was
returned as the default or used to repair empty session models,
even though the model list endpoints already filter hidden_models.
- Add _visible_models() helper that filters cached_models by
hidden_models (mirrors the filtering in list_model_endpoints)
- Use _visible_models() in get_default_chat fallback (when no
explicit default_model is saved)
- Use _visible_models() in _recover_empty_session_model (when
repairing a session whose model field is empty before chat send)
- Add regression tests for hidden-model filtering in default chat
resolution, and unit tests for _visible_models helper
When the selected model fails before producing output, stream_llm_with_fallback
quietly switches to the next candidate and the reply is shown under the
originally selected model's name, so a misconfigured provider looks like it
works. (Concretely: a Bedrock gateway that 400s every Anthropic/Claude request
appears fine because another model silently answers under the Claude label.)
Emit a `fallback` SSE event ({selected_model, answered_by, reason}) the first
time a non-primary candidate produces output, forward it through the agent loop
and both chat-route paths, stamp the response metrics with the model that
actually answered, and show a notice + relabel the reply in the UI.
Tested: python -m pytest tests/test_llm_core_fallback.py (3 pass);
python -m py_compile src/llm_core.py src/agent_loop.py routes/chat_routes.py;
node --check static/js/chat.js.
The "don't wipe endpoint_url/model on endpoint delete" half of #587 landed
in 6a78b02 (Fix endpoint model preservation for tasks). The three remaining
follow-up pieces from the original PR — flagged in the review on #786 —
are:
- routes/model_routes.py: toggle_model_endpoint (PATCH) now accepts
api_key and base_url, so the admin UI can rotate a key or fix a typo'd
URL without going through delete+recreate. base_url is normalized the
same way the POST handler does (strip /models, /chat/completions,
/completions, /v1/messages, then _normalize_base). Cache invalidation
matches the POST/DELETE paths and the response includes base_url so the
frontend can confirm what was saved.
- routes/chat_routes.py: new _recover_empty_session_model picks
cached_models[0] from the endpoint that matches sess.endpoint_url and
persists it onto the Session row before the LLM call goes out. Wired
into both /api/chat and /api/chat_stream after the existing
_clear_orphaned_session_endpoint guard, so the order is: drop
truly-orphaned sessions first, then heal the "picker showed it, session
never knew" case.
- routes/chat_routes.py: when recovery fails (no endpoint, no cached
models) raise HTTP 400 with a clear message instead of letting
model="" reach the upstream as 401/503.
Closes#587.
Streamed deltas flagged thinking:true (reasoning-model traces) were being folded
into full_response and persisted as part of the assistant message, so saved
replies were polluted with the model's chain-of-thought. Forward those deltas to
the client (for a live thinking indicator) but exclude them from the accumulated
saved reply, in both chat and research-stream paths. Mirrors the existing rewrite
path's handling.
The /api/chat/stream_status handler did a membership test against
_active_streams followed by an indexed read of the same key. Between
those two ops, a sibling stream's finally block (or a stop / cleanup
path) can pop the entry, turning the indexed read into a KeyError that
bubbles up as a 500. The race is the exact one _stream_set was already
written to avoid; the comment on the helper at the top of the module
spells out why a single .get() is the right pattern here too.
Collapse the two-step into a single .get() call so the lookup either
returns the live record or None, and report 'detached' / 404 based on
that single read. No behavior change on the happy path; the failure
mode under concurrent stream cleanup is now handled deterministically.
Closes#658.
* feat(web-fetch): add web_fetch tool to read a specific URL's content
* test(web-fetch): add SSRF coverage and fail closed on empty DNS resolution
Add explicit SSRF regression tests for the web_fetch path covering
loopback, private LAN ranges, link-local/metadata, IPv6 private/local,
redirect-into-private, and unsupported schemes. Harden _public_http_url
to fail closed when a hostname resolves to no addresses.
chat_routes.py persisted a session's "mode" in three best-effort spots —
reading the current mode, writing the effective mode, and setting
research_pending on the stream path. Each opened a session with SessionLocal()
and called .close() as the LAST statement inside a try/except, so if anything
before close() raised (e.g. a SQLite "database is locked" under concurrent chat
streams) the except only logged and the connection was never returned to the
pool.
DATABASE_URL defaults to file-backed SQLite, whose engine uses SQLAlchemy's
default QueuePool (5 connections + 10 overflow). Repeated leaks on these hot
paths exhaust the pool; later requests then block for pool_timeout and fail
with "QueuePool limit ... reached", taking the app down until restart.
Move the logic into two best-effort helpers in core.database, next to the
existing session helpers (update_session_last_accessed, get_session_by_id):
- get_session_mode(session_id) -> Optional[str]
- set_session_mode(session_id, mode) -> bool
Both route through the existing get_db_session() context manager, which commits
on success, rolls back on error, and always closes in a finally, so the
connection is returned to the pool on every path. chat_routes.py now calls
these instead of hand-rolling sessions, also removing three copies of the same
try/except.
Add tests/test_session_mode_helpers.py: the helpers commit+close on success
and, on a mid-operation DB error, swallow + roll back + close (no leak). The
error-path tests fail against the old close()-inside-try pattern.