odysseus

mirror of https://github.com/pewdiepie-archdaemon/odysseus.git synced 2026-06-16 09:45:24 -04:00

Author	SHA1	Message	Date
Maruf Hasan	c3fcaf15b7	feat(providers): add NVIDIA AI provider endpoint support (#3456 ) * feat: add NVIDIA as an AI provider (integrate.api.nvidia.com) * feat: add NVIDIA option to provider settings dropdown and aliases * test: add NVIDIA provider detection and endpoint tests * Add NVIDIA to _HOST_TO_CURATED and expand non-chat model filtering - nvidia.com -> 'nvidia' curated key for proper provider routing - _NON_CHAT_PREFIXES: bge, snowflake/arctic-embed, nvidia/nv-embed - _NON_CHAT_CONTAINS: content-safety, -safety, -reward, nvclip, kosmos, fuyu, deplot, vila, neva, gliner, riva, -parse, -embedqa, -nemoretriever * Expand non-chat model filtering for NVIDIA embedding/guard/video models Add _NON_CHAT_PREFIXES: embed, recurrent Add _NON_CHAT_CONTAINS: topic-control, guard, calibration, ai-synthetic-video, cosmos-reason2 Catches remaining unfiltered non-chat models from NVIDIA catalog: embedding (llama-nemotron-embed, embed-qa), guard (llama-guard, nemoguard-topic-control), calibration (ising-calibration), video (ai-synthetic-video-detector, cosmos-reason2), recurrent (recurrentgemma-2b) * Filter non-chat models in _probe_endpoint via _is_chat_model() Previously _is_chat_model() was only used in the per-model probe and _first_chat_model(), so non-chat models still appeared in the model picker even though they were filtered in those specific paths. Applying the filter at _probe_endpoint() return ensures non-chat models (embeddings, safety guards, reward, calibration, video detectors, CLIP, VLM, translation, parsing, recurrent, etc.) never enter cached_models and never appear in the picker. * Fix _NON_CHAT_CONTAINS to catch org-prefixed embedding models Prefix checks (mid.startswith) miss models with org prefixes like baai/bge-m3, nvidia/embed-qa-4, google/recurrentgemma-2b, etc. Adding the same terms to _NON_CHAT_CONTAINS ensures they are caught regardless of the org prefix. Adds: embed, bge, recurrent, starcoder, gemma-2b * fix(model-routes): drop collision-prone substrings from global non-chat filter The NVIDIA PR added several substrings to the shared _NON_CHAT_PREFIXES and _NON_CHAT_CONTAINS tuples. These are intended to filter out embedding, retrieval, safety, and vision models from NVIDIA's catalog that are not chat-completions-capable. However, four of the added substrings collide with legitimate chat models served by other providers: - gemma-2b matches google/gemma-2b-it (instruct chat model) - starcoder matches bigcode/starcoder2-15b (code completion model) - recurrent matches google/recurrentgemma-2b (language model) - guard matches meta-llama/Llama-Guard-3-8B (safety classifier) Removing these four from the global tuples keeps the NVIDIA-specific filtering intact (safety, embedding, retrieval, and vision models are still caught by other tokens such as content-safety, -safety, -reward, embed, bge, -embedqa, -nemoretriever, nvclip, deplot, etc.) while preventing false negatives for instruct/code models on other providers. Tests added for gemma-2b-it, google/gemma-2b-it, and bigcode/starcoder2-15b-instruct asserting they are recognized as chat models. Co-authored-by: Kenny Van de Maele <kenny@kvandemaele.be> * fix(nvidia): remove duplicate bge/embed tokens from _NON_CHAT_CONTAINS Tokens already present in _NON_CHAT_PREFIXES, making the CONTAINS entries redundant since the prefix check runs first. Co-authored-by: Kenny Van de Maele <kenny@kvandemaele.be> * fix(nvidia): move bge to CONTAINS, add llama-guard, remove stray blanks Co-authored-by: Kenny Van de Maele <kenny@kvandemaele.be> * style: fix indentation of groq and xai test cases in test_provider_endpoints.py --------- Co-authored-by: Kenny Van de Maele <kenny@kvandemaele.be>	2026-06-09 11:06:12 +02:00
onemorethan0	8ae2b5f58c	fix(llm): suppress thinking mode for qwen3/gemma4 on Ollama /v1 endpoint (#3228 ) * fix(llm): suppress thinking for qwen3/gemma4 on Ollama /v1 compat endpoint When using qwen3, QwQ, gemma4, or other thinking models via Ollama's OpenAI-compatible /v1 endpoint, the model routes all output into its <think>...</think> reasoning block. Since Odysseus strips thinking content from round_response and only accumulates native tool_calls, this produces a round with 0 chars, 0 native calls, 0 tool blocks — the agent appears to silently do nothing. Root cause: Odysseus classifies the /v1 endpoint as provider="openai" (not "ollama"), so the payload is built as a standard OpenAI payload without any Ollama-specific options. Ollama's /v1 endpoint accepts "think": false as a top-level parameter to suppress extended thinking, but this was never sent. Fix: - Add _is_ollama_openai_compat_url() to detect local Ollama /v1 URLs - Inject "think": false in both stream_llm and llm_call_async for thinking models (qwen3, QwQ, gemma4, DeepSeek-R1, etc.) on this endpoint Verified with qwen3:14b on Ollama 0.24: with think=False the model correctly emits native tool_calls in a single streaming chunk and the agent executes bash/file/web tools as expected. * fix(llm): extend _is_ollama_openai_compat_url to match localhost on any port Per reviewer feedback on PR #3228: 1. Generalize host detection to mirror _is_ollama_native_url: match any localhost/127.0.0.1/0.0.0.0/::1 host (not just port 11434) so that custom OLLAMA_HOST ports and container remaps are also covered. 2. Add tests/test_llm_core_ollama_thinking.py covering: - _is_ollama_openai_compat_url for all positive/negative URL cases including IPv6, non-default port, native /api path, and real OpenAI - Payload injection: think:false set for Ollama /v1 thinking model, not set for non-thinking model, not set for real OpenAI endpoint, and set for localhost on a non-default port (the new case)	2026-06-09 07:35:15 +02:00
Ocean Bennett	e7c1d75884	fix(models): query v1 models for llama-server endpoints (#3380 ) * fix(models): query v1 models for llama-server endpoints * test(models): accept owner kwargs in llama-server regression	2026-06-09 01:09:02 +02:00
stocky789	1e0d9b92af	feat: add ChatGPT Subscription provider (#2876 ) * feat: Add ChatGPT Subscription support and related features - Introduced a new provider option for ChatGPT Subscription in the endpoint selection UI. - Implemented OAuth flow for ChatGPT Subscription sign-in, including polling for authorization status. - Updated admin interface to handle ChatGPT Subscription, including disabling API key input and providing user guidance. - Enhanced cost tracking logic to differentiate between subscription and non-subscription endpoints. - Added new slash commands for managing skills, including listing, searching, and invoking skills. - Implemented caching for skill catalog to optimize performance. - Updated tests to cover new ChatGPT Subscription functionality and ensure proper endpoint probing. - Refactored existing code to accommodate new features and improve maintainability. * refactor: share provider device-flow setup - reuse one device-flow backend for Copilot and ChatGPT Subscription - add one frontend device-flow helper for Settings and /setup - put GitHub Copilot back into Add Models, now as a dropdown option - make provider selection just select; clicking Add starts sign-in - stop ChatGPT Subscription setup from opening auth tabs automatically - make /setup copilot and /setup chatgpt-subscription work from chat - show ChatGPT Subscription in the /setup suggestions - show the real error message when setup fails - add focused tests for the shared flow and setup UI * feat(chatgpt-subscription): harden credential lifecycle and streamline auth UX Backend: - Resolve runtime bearer for provider-auth endpoints at probe time via a shared _resolve_probe_key() that delegates to resolve_endpoint_runtime, applied across all probe/refresh call sites. - Skip live completion probes and health pings for discovery-only providers (centralized behind _is_discovery_only_provider) — the Codex/Responses API has no such endpoints, so status is derived from cached models. - Never persist the short lived ChatGPT bearer to the plaintext sessions table; proactively clear any stale bearer left by an earlier code path. - Revoke orphaned ProviderAuthSession credentials when the last endpoint backing them is deleted (_delete_orphaned_provider_auth), surfaced via cleared_provider_auth in the delete response. Frontend (admin.js): - Auto-start the device-auth flow on provider selection so the authorization panel (code + Authorize) shows immediately instead of behind a "Sign in" click. - Remove the redundant top button for device auth providers, move retry into the panel via an inline "Try again". - Drop the self-evident hint text and add an execCommand clipboard fallback so Copy works in non-secure (HTTP/LAN) contexts. * fix: harden chatgpt subscription provider * chore: remove PR media from branch * Fix chatgpt subscription recovery and token handling --------- Co-authored-by: 5p00kyy <admin@5p00ky.dev>	2026-06-08 10:19:18 +02:00
adabarbulescu	a8859bb25c	fix(llm): Properly detect remote Ollama bare URLs as native endpoints (fixes #3252 ) (#3343 )	2026-06-07 21:19:19 +02:00
M57	12cb39cbd9	feat: add OpenCode Zen and Go as provider options (#26 ) - Add OpenCode Zen (https://opencode.ai/zen/v1) and Go (https://opencode.ai/zen/go/v1) - Add provider detection via _host_match() in llm_core.py - Add curated model list entries in model_routes.py - Add webhook provider URLs - Add provider icon (providers.js) and dropdown options (index.html) - Add auto-detection patterns and setup URLs (slashCommands.js) - Whitelist opencode.ai in URL validation (admin.js) - Rebased on main to fix merge conflicts with _HOST_TO_CURATED refactor Co-authored-by: M57 <hy4ri@users.noreply.github.com>	2026-06-07 16:43:00 +02:00
Mohammed Riaz	6ccd4500d7	fix(chat): show requested and actual reply models Show requested and actual reply models in chat labels when fallback or provider routing changes the responding model.	2026-06-06 04:30:16 -06:00
nubs	47a47bf71d	fix(llm): guard against null arguments in streaming tool-call accumulator (#2923 )	2026-06-05 20:57:36 +02:00
nubs	8354948a1c	fix(llm): route harmony thinking streams (#2449 )	2026-06-05 15:22:08 +02:00
Isaiah Gardner	134c608466	fix: degrade missing/None content key in system messages to empty string (#2570 )	2026-06-05 00:10:11 +02:00
Kenny Van de Maele	1cd0aa2b8c	feat(provider): add GitHub Copilot provider with device-flow auth (#1480 ) * feat(provider): add GitHub Copilot provider with device-flow auth Adds GitHub Copilot as a model provider, so Copilot models (gpt-4o/4.1/5, Claude, Gemini, …) work through the normal chat + agent loop, incl. native tool calling and vision. Auth is one-click via the GitHub OAuth device flow; the access token is stored as the endpoint's (encrypted) api_key and sent directly as `Authorization: Bearer` (no Copilot-token exchange, no refresh — matching how editors talk to the Copilot API). Copilot is a normal ModelEndpoint detected by host; the only provider-specific behaviour is a small set of required request headers, injected centrally. Sign-in is available from Settings → model endpoints ("Connect GitHub Copilot") and from chat via `/setup copilot`. - src/copilot.py (new), routes/copilot_routes.py (new): constants, header builders, device-flow start/poll, model discovery, owner-scoped endpoint provisioning. - src/llm_core.py, src/endpoint_resolver.py: detect `copilot`, inject headers, per-request x-initiator/vision. - src/agent_loop.py: allowlist api.githubcopilot.com for native tool schemas. - src/model_context.py: known context windows for Copilot (no unauthenticated /models probe). - static/, README, tests/test_copilot.py. Tidy copilot_routes: clarify supports_tools, note _PENDING is per-process	2026-06-04 21:13:14 +02:00
Giuseppe	6d511f6e66	fix(llm): auto-detect <think> in content stream for unregistered thinking models (#2588 ) * fix(llm): auto-detect <think> in content stream for unregistered thinking models _THINKING_MODEL_PATTERNS only covers known model families by name. Qwen3-derived models with non-standard names (e.g. Qwopus, custom QwQ forks) are not matched, so their <think>...</think> content streams through as visible chat text instead of being routed to the thinking display. When the first content delta opens with <think> and the model was not already identified as a thinking model, dynamically flag the stream as a thinking model for the remainder of the response. This enables the existing </think> repair path (line below) and ensures the frontend receives the full <think>...</think> wrapper it needs to split thinking from the final answer. The check is restricted to the very first content delta (_first_content_sent is False) to avoid misidentifying models that happen to write "<think>" mid-answer. Fixes #2225 Related: #2420 (covered by separate PR from @AmmarS-Analyst), #2224 (@RaresKeY) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(llm): replace inert _thinking_model flag with _in_think_tag state machine The original auto-detect set _thinking_model=True on the first <think> chunk but still emitted it as a regular delta and set _first_content_sent=True immediately, so no subsequent chunk could enter the repair path. Replace with _in_think_tag bool: enter thinking mode when first content starts with <think>, route all chunks to the thinking channel until </think> is found, then the tail becomes the first regular delta. Adds three regression tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(llm): replace _first_content_sent guard with _think_open_stripped Opening-tag stripping used `not _first_content_sent` as the guard, but _first_content_sent stays False throughout the entire think block (it only flips when regular content is emitted). So `find(">")` ran on every reasoning chunk — not just the first — and silently truncated everything before the first ">" in any reasoning text containing comparisons, arrows, or code. Fix: add `_think_open_stripped = False` alongside `_in_think_tag`. Use it as the strip guard in both the "still inside <think>" path and the "</think> found in same chunk" split path. Set it True once the opening tag is consumed so all subsequent chunks reach the thinking channel unmolested. Add regression test: 3-chunk stream where the middle chunk contains "c > d" — confirms "more c " is not dropped. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-04 20:18:19 +02:00
Giuseppe	531f426557	fix: KeyError on missing 'content' key in system messages (#2362 ) A system message that arrives without a 'content' key — possible via malformed tool results — raised a KeyError in the hot path of llm_call, llm_call_async, and stream_llm. Replace m["content"] with m.get("content") or "" in all three functions so a missing key degrades to an empty string instead of crashing. Also removes a redundant .rstrip() after .strip() in _model_activity_key. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-04 19:38:45 +02:00
Giuseppe	ff8f9f2188	fix: llm_call_async does not retry on HTTP 429/502/503/504 (#2364 ) The retry loop raised immediately for any non-success HTTP response regardless of attempt count. For transient upstream errors (rate limit, bad gateway, gateway timeout) the function should back off and retry within the existing attempt budget. Also lets ConnectError / ConnectTimeout retry when the host has not been cooled and attempts remain, instead of always raising on the first connect failure. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-04 19:35:55 +02:00
Giuseppe	bc9104efe2	fix: SSE stream parser crashes with NoneType on providers sending null choice/usage/tc entries (#2389 ) * fix: SSE parser crashes with NoneType on MiniMax-M3 (and any provider sending null choice/usage/tc) Three guards added in stream_llm: 1. choices[0] null check — MiniMax (and some other providers) send a choices entry as None. `_choices[0].get("delta")` raised AttributeError. Now checks `_choices[0] is not None` before calling .get(). 2. usage null guard — j["usage"] can arrive as None (not a dict) on some providers. Added `or {}` so subsequent .get() calls don't crash. 3. tool_calls null entry skip — individual entries in the tool_calls array can be None. Added `if tc is None: continue` before tc.get("function"). All three match the `or {}` / null-guard pattern used elsewhere in the same block. Safe for all OpenAI-compatible providers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: guard null choice in elif-choices SSE branch The usage-chunk path already guarded _choices[0] is not None, but the elif "choices" branch that processes content/tool-call deltas did not. A chunk like {"choices": [null]} or {"choices": [null], "usage": null} reaches j["choices"][0].get("delta") and crashes with: 'NoneType' object has no attribute 'get' Fix: extract choices[0] into _c0 and continue to the next chunk when it is None, matching the guard already applied in the usage path. Adds three focused regressions covering the paths the maintainer flagged: - {"choices": [null]} - {"choices": [null], "usage": null} - tool_calls array containing a null entry alongside a valid call Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-04 13:53:10 +01:00
tanmayraut45	f59edee611	Support extra CA bundle for private-CA LLM providers (#769 ) Adding GigaChat (Sber) or an on-premise enterprise LLM gateway as a model endpoint fails on first probe with CERTIFICATE_VERIFY_FAILED: self-signed certificate in certificate chain (_ssl.c:1000) because their TLS chain is signed by a private root CA (Russian Trusted Root CA for GigaChat; corporate CA for on-prem) that isn't part of the default system / certifi trust store. The endpoint shows offline in the picker even though the URL and API key are correct (issue #722). The right fix is to extend the trust store, not to weaken verification. This change: - src/tls_overrides.py: new module that resolves an opt-in env var LLM_CA_BUNDLE at import time, builds a shared SSLContext via ssl.create_default_context() (so the system / certifi bundle is loaded first) and layers the operator's PEM on top with load_verify_locations(). Exposes llm_verify() returning a value suitable for httpx `verify=`. Defaults to True (httpx built-in trust) when the env var is unset, when the file is missing, or when the PEM fails to load — verification is never silently disabled, the warning is logged and we fall back to the safe path. - src/llm_core.py: thread llm_verify() into the shared AsyncClient used by stream_llm / streaming completions. - routes/model_routes.py: thread llm_verify() into the five httpx.get call sites in _probe_endpoint / _ping_endpoint so adding a private-CA endpoint goes green on the very first probe and the picker stops showing it offline. - .env.example: document LLM_CA_BUNDLE with the GigaChat case as the concrete example. Deliberately NOT included: a verify=False knob (global or per-host). Disabling verification exposes the affected endpoint to MITM, and the operator-supplied bundle is the correct fix for legitimate private-CA providers — so the only switch in this PR is the safe one. Closes #722.	2026-06-04 13:18:50 +01:00
Yuri	a2e691da2b	fix(models): stabilize proxy endpoint refresh behavior * fix: support large proxy model endpoint refresh Large OpenAI-compatible proxy endpoints can expose hundreds of models and make /v1/models slow. Treating those endpoints like local model servers caused model picker opens and background probes to repeatedly hit /models, producing timeouts and making otherwise usable endpoints appear offline. Make model endpoint discovery cached-first for normal UI usage, add explicit proxy/API classification and refresh policy fields, exclude proxy/API endpoints from aggressive local probing, and preserve cached models when refresh fails. Manual Test/Add/Refresh actions still fetch the full model list with longer timeouts so users can intentionally import large proxy model lists without blocking normal model picker usage. * fix: preserve endpoint ping status semantics	2026-06-04 04:56:11 +01:00
danielroytel	39848a168b	fix: recognize Gemma 4 as a thinking model and add context entry (#1642 ) Gemma 4 returns reasoning_content in streaming responses via llama-server, but the model wasn't listed in _THINKING_MODEL_PATTERNS, causing reasoning tokens to be mishandled. Add "gemma" to the pattern list and register Gemma 4's 128K context window in KNOWN_CONTEXT_WINDOWS so the agent loop budgets context correctly. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-06-03 14:23:18 +09:00
Afonso Coutinho	19e62208d2	fix: streaming drops providers that emit SSE data lines with no space (#1701 )	2026-06-03 13:37:14 +09:00
Afonso Coutinho	3da4edb442	fix: token usage dropped when it rides on a non-empty finish delta (#1703 )	2026-06-03 13:36:57 +09:00
lekt8	126e91e8b9	Don't attempt the same (url, model) route twice in the fallback chains (#1733 ) The fallback helpers (llm_call_with_fallback, llm_call_async_with_fallback, stream_llm_with_fallback) build their candidate list as the primary target followed by the configured fallbacks. Callers prepend the session's live (url, model) to default_model_fallbacks, so if the user also lists their current model among the fallbacks — a common misconfiguration — the chain re-attempts the very route that just failed: a wasted round-trip (and, for the streaming path, a spurious 'fallback' notice for a switch that didn't actually happen). Add a small _dedupe_candidates() helper that filters malformed entries and drops a later repeat of an already-seen (url, model), preserving order (first wins, keeping its headers). Apply it in all three fallback chains. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 13:33:50 +09:00
Ethan	b9c382006e	Clamp Anthropic temperature to [0.0, 1.0] in _build_anthropic_payload (#1737 ) Anthropic's Messages API rejects temperature > 1.0 with HTTP 400, but _build_anthropic_payload forwarded it verbatim. The shipped "Nietzsche" preset uses temperature 1.2 and the UI slider allows up to 2.0, so every Claude request under such a preset hard-broke. Clamp into [0.0, 1.0] in the Anthropic builder only (OpenAI keeps its wider 0.0-2.0 range). Covers all three Anthropic call paths, which build through this one function. None is passed through unchanged. Fixes #1615 Co-authored-by: Ethan <23321960+0xLeathery@users.noreply.github.com>	2026-06-03 13:29:36 +09:00
Shreyas S Joshi	7504fedb17	fix: surface reasoning_content when content is empty (thinking models) (#1233 ) Thinking models served via llama.cpp without --reasoning-format none (e.g. Qwen3, DeepSeek-R1) route all tokens into reasoning_content and return content="". Two call paths were silently broken: - llm_call / llm_call_async (non-streaming): hard-keyed data["choices"][0]["message"]["content"] raises KeyError or returns empty string, discarding the entire response. - stream_agent_loop end-of-round fallback: when full_response is empty but round_reasoning has content, the existing code replaced the response with the generic empty-response error message, discarding all reasoning tokens that were correctly accumulated during streaming. Fix: in both non-streaming paths use msg.get("content") or msg.get("reasoning_content") or "". In the streaming fallback, surface round_reasoning as the answer before falling through to the error path.	2026-06-03 01:41:24 +09:00
Afonso Coutinho	65751186bd	fix: merging consecutive user messages corrupts multimodal (image) content (#1277 ) * fix: preserve multimodal content blocks when merging consecutive user messages * test: consecutive user-message merge keeps multimodal image blocks	2026-06-03 01:21:57 +09:00
Afonso Coutinho	a04553013d	fix: Anthropic responses with multiple text blocks lose all but the first (#1255 ) * fix: concatenate all Anthropic text blocks, not just the first * test: Anthropic response parsing concatenates text blocks	2026-06-03 00:57:20 +09:00
pewdiepie-archdaemon	ff93a6c63b	Polish email and cookbook flows	2026-06-02 22:42:07 +09:00
SurprisedDuck	934bca9e48	Providers: omit temperature for OpenAI reasoning models * fix: omit temperature for OpenAI reasoning models (o1/o3/o4/gpt-5) These models only accept the default temperature; sending any explicit value (even 0.0) returns HTTP 400 "Only the default (1) value is supported". This broke two paths: - Endpoint probing in _probe_single_model hardcodes temperature: 0.0, so a perfectly valid o3/gpt-5 endpoint is reported as failing in the Model Endpoints health check. - Chat/stream payloads send temperature unconditionally, so a non-default temperature preset 400s on these models. The code already special-cases the same model family for max_completion_tokens, so this adds a sibling _restricts_temperature() helper and omits the field for those models, letting the API use its required default. gpt-4.5 is intentionally excluded (not a reasoning model; accepts temperature normally). Adds tests/test_llm_core_temperature.py covering the predicate and the synchronous payload builder. * fix: also omit temperature for reasoning models on the direct-POST paths The first commit only covered llm_call/llm_call_async/stream_llm and the endpoint probe. Email auto-summary, urgency-less spam classification, the email reply-summary endpoint, and gallery vision tagging build their OpenAI payloads inline and POST them directly (requests/httpx), bypassing llm_core — so a reasoning model configured there would still 400 on the temperature field. These sites already branch on _uses_max_completion_tokens, so they're the same class; added the matching _restricts_temperature guard. gallery_routes also gains the max_completion_tokens branch it was missing, so gpt-5 vision tagging works end to end. Note: email_pollers urgency scoring goes through llm_call_async and was already covered.	2026-06-02 20:58:33 +09:00
Leo	6c15dc7d33	Chat metrics: surface backend generation speed * Chat metrics: show backend's true generation t/s, not tokens÷wall-clock The per-message tokens/sec read low and felt wrong because it was computed as output_tokens / total_duration, where total_duration is wall-clock including prefill, tool calls, and network — not pure decode time. llama.cpp already reports the correct gen speed in its stream (timings.predicted_per_second), but it was being dropped. - llm_core.py: when parsing the OpenAI-compatible usage chunk, also read the sibling `timings` block llama.cpp includes — pass predicted_per_second through as gen_tps and prompt_per_second as prefill_tps on the usage event. - agent_loop.py: capture backend_gen_tps/backend_prefill_tps from usage events; in _compute_final_metrics prefer backend_gen_tps over the wall-clock division when present (fall back to computed for cloud APIs that omit timings). Tag the result with tps_source ("backend" vs "computed") and surface prefill_tps. Result: the displayed t/s now matches the model's real decode speed and is stable regardless of prompt length (a long prefill no longer deflates it). Checks: py_compile passes; verified extraction against a real llama.cpp final chunk (gen 79 t/s surfaced vs the deflated wall-clock figure shown before). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Chat metrics: surface true t/s on the direct-chat path too Follow-up to the gen-tps work: the non-agent direct-chat stream path in chat_routes turned the raw `usage` event straight into a metrics event but only copied token counts — it never set tokens_per_second or response_time. So simple (non-tool) replies showed "Speed: n/a" / "Time: undefineds" and the chip fell back to a bare token count ("27 tok") instead of t/s. Map the usage event's gen_tps (llama.cpp timings.predicted_per_second, added in the prior commit) into tokens_per_second here too, tag tps_source=backend, and set response_time from wall-clock for the stats popup. Checks: py_compile passes; verified llama.cpp emits usage+timings on the final stream chunk (gen ~90 t/s) that this path consumes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Tests: backend gen/prefill t/s passthrough and preference Cover the two pieces of the true-t/s metric so it can be reviewed on its own: - stream_llm surfaces llama.cpp's timings.predicted_per_second / prompt_per_second as gen_tps / prefill_tps on the usage event (captured llama.cpp final-chunk fixture), and omits them when the backend reports no timings. - _compute_final_metrics prefers backend_gen_tps over output/wall-clock, tags tps_source ("backend" vs "computed"), and surfaces prefill_tps. Reuses the fake-client stream harness from test_llm_core_streaming.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-02 20:52:08 +09:00
Tatlatat	e084dc993e	Chat: merge consecutive user messages for strict providers After a non-native tool round, the agent appends tool results as a {role: 'user'} message next to the user's original 'user' prompt, producing two consecutive 'user' messages. Strict provider APIs (Anthropic/Claude) reject consecutive same-role messages, so the follow-up generation request fails silently — search returns sources, then nothing is generated. _sanitize_llm_messages now merges consecutive 'user' messages (joining their content). Only user/user is merged; normal chat and agent/tool turns already alternate and are untouched. Scoped down per maintainer review: the agent_loop 'output' source-extraction change is already on main (#898/#901) and the broad-mocking web-sources test was dropped. Added a focused test that runs consecutive-user messages through the real _build_anthropic_payload and asserts the payload alternates correctly.	2026-06-02 20:44:13 +09:00
Ernest Hysa	a8a34bd22a	Ollama: pass discovered num_ctx in chat requests _build_ollama_payload sends options.temperature and options.num_predict to /api/chat, but never options.num_ctx. Ollama defaults num_ctx to 2048 when the option is omitted, so prompts going to any Ollama backend are silently truncated there regardless of the model's actual capability. Thread the discovered context length through the three call sites (llm_call, llm_call_async, stream_llm) and emit options.num_ctx when it is known and positive. The builder filters out the DEFAULT_CONTEXT fallback (128000) so we don't lie to Ollama about models whose window we couldn't actually discover. The issue's literal 'when > 2048' heuristic is dropped: a model with a real context smaller than 2048 would OOM if Ollama used its default, so we pass the real value regardless of size. Matches how src/context_compactor.py uses the same helper. Sister fix to PR #753 — that PR teaches the compactor the right budget, this one tells Ollama to actually use that budget on the way in.	2026-06-02 20:27:24 +09:00
nsgds	5645cce6d0	Support vLLM 0.20.2 / NIM reasoning-parser output end-to-end (surface + agent context + render) (#602 ) * fix(stream): read 'reasoning' SSE field for vLLM 0.20.2 / NIM vLLM 0.20.2 / NVIDIA NIM emit reasoning-parser output in the `reasoning` delta field; older builds use `reasoning_content`. stream_llm() read only the latter, so reasoning from models like Nemotron-3-Nano (--reasoning-parser) was silently dropped and never rendered. Accept either field. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(agent): keep reasoning_content only on the latest assistant turn The agent loop echoed each round's reasoning back as `reasoning_content` on every assistant turn, assuming vendors ignore it. Nemotron's chat template re-injects ALL prior reasoning_content as <think> blocks, and the loop is trimmed only once (before it starts) — so reasoning accumulated unbounded across rounds, bloating context and feeding the model its own prior reasoning, which reinforced repetition/looping. Strip reasoning_content from earlier assistant turns so only the most recent round carries it (still satisfies DeepSeek's thinking-mode follow-up requirement). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(agent-ui): wrap each round's reasoning in its own <think> block The streamed think-tag wrapper gated on whole-message substring checks (accumulated.includes('<think>')), which only ever wrapped ONE reasoning block per message. A multi-round agent response has a reasoning phase per round, so once round 1 closed its <think>...</think>, rounds 2+ reasoning was emitted unwrapped and leaked into the visible answer. Replace the substring checks with a stateful open/close flag that toggles per think/answer cycle, so each round's reasoning gets its own collapsible block. Single-turn chat is unchanged (one open, one close). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(stream): reasoning/reasoning_content delta surfaces as thinking chunk Covers @pewdiepie-archdaemon's requested regression: a streamed {reasoning: ...} delta emits a thinking chunk while {content: ...} streams as normal content; plus the older reasoning_content field for backward compat. Mirrors the #591 scenario. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-02 11:48:17 +09:00
James Arslan	a327df6936	Fix native tool-calling follow-up round on Gemini and Ollama (#867 ) The agent's multi-round (tool-result) follow-up request was rejected with HTTP 400 on two providers, so tools ran but the agent never produced an answer: - OpenAI-compatible streaming (Gemini 3) dropped the per-call thought_signature and collided parallel tool calls, which arrive with index=None: they all landed in slot 0, overwriting the first call's name and corrupting its arguments by concatenation, so the follow-up request 400'd. Capture and replay each call's extra_content (thought_signature), and give every parallel call its own accumulator slot (allocated above the max key, so sparse or mixed indices can't collide). - Native Ollama /api/chat expects object tool-call arguments, but Odysseus carries them as a JSON string, which Ollama rejected ("Value looks like object, but can't find closing '}' symbol"). Convert them to objects in the Ollama payload builder. Both compose with the no-prose null-content sanitize fix from #862. Tested: python -m pytest tests/test_llm_core_streaming.py tests/test_llm_core_ollama.py tests/test_agent_loop.py (53 pass), and python -m py_compile src/llm_core.py src/agent_loop.py.	2026-06-02 11:39:40 +09:00
James Arslan	6776c7d691	Surface silent model fallback instead of masking it (#868 ) When the selected model fails before producing output, stream_llm_with_fallback quietly switches to the next candidate and the reply is shown under the originally selected model's name, so a misconfigured provider looks like it works. (Concretely: a Bedrock gateway that 400s every Anthropic/Claude request appears fine because another model silently answers under the Claude label.) Emit a `fallback` SSE event ({selected_model, answered_by, reason}) the first time a non-primary candidate produces output, forward it through the agent loop and both chat-route paths, stamp the response metrics with the model that actually answered, and show a notice + relabel the reply in the UI. Tested: python -m pytest tests/test_llm_core_fallback.py (3 pass); python -m py_compile src/llm_core.py src/agent_loop.py routes/chat_routes.py; node --check static/js/chat.js.	2026-06-02 11:37:25 +09:00
mist	1007703223	Keep no-prose assistant tool-call messages through _sanitize_llm_messages (#862 ) `cb13d09` made _append_tool_results emit content=None (JSON null) for a follow-up assistant message that carries only tool_calls and no prose, because Gemini's OpenAI-compatible endpoint and Ollama reject tool_calls alongside an empty-string content with HTTP 400. But _sanitize_llm_messages strips None values and then required "content" on every message, so it dropped that assistant message entirely — leaving the role:"tool" result dangling with no parent tool_calls, which breaks the follow-up round for every provider (and regresses ones that accepted "" before, since the message is now removed rather than sent). cb13d09's tests covered _append_tool_results in isolation, so the sanitizer interaction was uncaught. Make the sanitizer role-aware: assistant messages survive with content OR tool_calls, and a tool-calls-only assistant message gets an explicit content=None re-added so the provider receives spec-correct `content: null`. tool messages still require content + tool_call_id; user/system still require content. Adds tests/test_llm_core_sanitize_tool_calls.py, which drives the real producer (_append_tool_results) into the sanitizer and asserts the assistant tool-call message survives with its tool result paired. Red before this change, green after.	2026-06-02 11:17:22 +09:00
Ethan	fd04ad353d	Add Anthropic prompt caching to the agent loop (#812 ) Send `system` as a structured text block with an ephemeral cache_control breakpoint and cache the last tool schema, so multi-round agent runs read the stable system+tools prefix from cache instead of re-billing it. Gate the system breakpoint so tiny tool-less prompts skip the cache-write premium. Log cache_read/creation tokens at message_start. Fixes #791 Co-authored-by: Ethan <23321960+0xLeathery@users.noreply.github.com>	2026-06-02 11:14:31 +09:00
LittleLlama	54ecfa39cf	Provider detection: match by hostname instead of substring (re #768 ) (#815 ) * Dedupe URL routing helpers and tighten adjacent hostname checks * Match providers by hostname, not substring, in _detect_provider _detect_provider used `"anthropic.com" in url`-style substring checks, so a URL that merely contained a provider's domain in its path or query — or a look-alike host like `anthropic.com.example` — was misclassified and picked the wrong auth-header/payload shape. Switch it to the existing `_host_match` helper (hostname exact/subdomain match), the same way the human-readable labels and curated model lists already work, finishing that migration. Also harden `_host_match` against trailing-dot FQDNs. Not a credential-leak fix: _detect_provider only classifies a URL the admin already configured next to its key, and the URL — not this function — decides where the request goes. This is a correctness/consistency cleanup. Adds tests that import the real helpers (test_endpoint_resolver.py tests local copies, so it can't catch this) covering the substring false-positives. Refs #768. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Import build_headers under its real name in model_routes It was imported as `build_headers as _provider_headers`, which collides with the unrelated llm_core._provider_headers(provider, headers) — same name, different signature. Use the real name to remove the confusion. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Use hostname matching in URL builders, not raw suffix checks PR review flagged that _detect_provider() was hardened to match on hostname, but several helpers still used raw host.endswith("anthropic.com") / host.endswith("ollama.com"), which match adjacent hosts like notanthropic.com / notollama.com. Route the remaining checks through _host_match(): _is_ollama_native_url and _ollama_api_root in llm_core, and _anthropic_api_root / _ollama_api_root in endpoint_resolver. With _detect_provider already hostname-correct, the trailing "or host.endswith(...)" clauses in build_chat_url / build_models_url are redundant, so drop them rather than fix the substring match in place. Add builder-level tests asserting look-alike and domain-in-path hosts route to the OpenAI-compatible default. They import the real builders and fail on the pre-fix code. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-02 11:11:17 +09:00
SurprisedDuck	7268c49992	Make LLM host health maps thread-safe The synchronous llm_call() runs in FastAPI's threadpool (sync route handlers such as POST /sessions/auto-sort), while llm_call_async() runs on the event loop. Both mutate the module-level _response_cache, _host_fails and _dead_hosts dicts, so these are touched from multiple OS threads concurrently. Two races result: - _set_cached_response() snapshots 64 keys then deletes them with `del _response_cache[key]`; if another thread evicts the same key first, the del raises KeyError mid-eviction. Switched to pop(key, None). - _mark_host_dead() does get()+1+set() on _host_fails with no lock, so concurrent connect failures lose increments and a genuinely dead host can stay under its cooldown threshold. Guarded the host-health maps with a threading.Lock (also applied to _is_host_dead / _clear_host_dead for consistent reads). Adds tests/test_llm_core_concurrency.py with deterministic regression tests (phantom snapshot key for the eviction race; a slow-read dict that forces the lost-update window for the counter). Both fail on the unpatched code and pass with the fix.	2026-06-02 05:54:23 +09:00
Areon Lundkvist	f853a3fc67	Harden streaming deltas against null payloads	2026-06-01 23:09:17 +09:00
Alexander Kenley	2c4b8b57dd	feat(ai): add OpenRouter and Ollama Cloud providers (#231 ) Co-authored-by: Alex Kenley <Alex.Kenley@threatvectorsecurity.com>	2026-06-01 14:26:10 +09:00
pewdiepie-archdaemon	d9d95b4855	Improve OpenRouter and Groq provider requests	2026-06-01 10:32:14 +09:00
pewdiepie-archdaemon	d026e13a5a	Fix provider setup and strip message metadata	2026-06-01 10:20:18 +09:00
pewdiepie-archdaemon	fc7f107b22	Improve Ollama setup and model endpoint handling	2026-06-01 10:00:15 +09:00
pewdiepie-archdaemon	e5c99a5eee	Odysseus v1.0	2026-05-31 23:58:26 +09:00

43 Commits