Commit Graph

12 Commits

Author SHA1 Message Date
nsgds 7ae6133d7f fix(agent): don't let a materialized default budget defeat context-window scaling (#4122)
* fix(agent): don't let a materialized default budget defeat context scaling

#1230 scales agent_input_token_budget to the model's context window unless
the user explicitly set a budget, detected via is_setting_overridden(). But
the settings-save path materializes every DEFAULT_SETTINGS key into
settings.json (load_settings merges defaults; handlers persist the merged
dict), so the persisted default 6000 reads as "overridden" and the budget
code takes the min(6000, ctx) branch — silently re-capping long-context
models at 6000 for anyone who has ever saved a setting. This reintroduces
the exact regression #1170/#1230 set out to fix.

Add is_setting_customized() (saved value != default) and gate the scaling
on it instead of mere presence. A persisted default is not a user choice.

is_setting_overridden has exactly one consumer (this budget path), so the
change is contained. Tests cover the materialized-default regression, a
deliberately-chosen budget still being honoured, and the absent-key case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(agent): rework context-budget fix per review (#4122)

Address RaresKeY's review:

P2 (explicitness): is_setting_customized treated a saved value equal to the
default as "not explicit", which ALSO blocked a user from deliberately pinning
the default budget. Reframe the default value itself as the AUTO sentinel —
agent_input_token_budget == DEFAULT_BUDGET means "scale to the model's context
window", any other value is an explicit cap. A materialized default still reads
as auto (fixing the original regression), and any non-default value the user
chooses is now honoured. Drop the now-unused is_setting_customized helper.

P2 (fallback context): auto-scaling trusted get_context_length() even when it
returned only the bare DEFAULT_CONTEXT fallback (no endpoint-reported / known
window), over-allocating on self-hosted/proxy setups. Add get_context_length_known()
(also returns whether the window was actually discovered); the budget block
passes 0 when unknown so auto-scaling stays conservative instead of inflating to
an unproven window.

hard_max stays auto-only — a deliberate explicit budget wins (#1190); kept that
contract and answered the reviewer's question rather than silently reversing it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(agent): lock the materialized-default budget regression (review on #4121)

Per WGlynn's review on the issue: add an end-to-end regression that saves an
UNRELATED setting (which makes the settings-save path materialize the budget
default into settings.json) and asserts the budget still auto-scales rather than
re-reading as an explicit 6000 cap — locking the exact reopening shut.

To make the test bite the production decision (not just re-derive it), extract
`budget_is_explicit()` into src/context_budget.py and use it from the agent loop.
It keys off value-vs-default (the default is the auto sentinel), NOT settings
presence — which is the whole point, since the save path materializes defaults.

Note: after this PR's rework, is_setting_overridden has ZERO production callers,
so the merged-dict materialization smell can't reach any setting through a
presence check today (WGlynn's durability concern).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(agent): bind the budget context window to its own provenance (review #4122)

RaresKeY caught a correctness bug in the fallback-context guard: stream_agent_loop
kept only the `known` flag from get_context_length_known() and budgeted off the
passed-in `context_length`, which can come from a *different* lookup. Two failures:
- local endpoints are re-queried, so the passed value can be a stale DEFAULT_CONTEXT
  fallback while the fresh probe proves the real (smaller) served context — we'd
  scale off the stale value;
- callers that don't pass context_length (scheduled tasks, teacher escalation,
  skill test runs, bg_monitor) were capped at 6000 even when a long window is
  discoverable.

Extract budget_context_for_model() which returns the freshly-probed window when
known else 0, binding the flag to the value it proves; the agent loop uses it.
Regression tests cover the stale-fallback, no-arg-caller, and probe-error paths.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(agent): fix stale budget comments + tighten to the contract (review #4122)

- settings.py: an explicit budget is clamped to the window only — hard_max is
  auto-only (#1190); drop the incorrect "and to hard_max".
- is_setting_overridden docstring: drop the stale "adaptive budgets" example;
  point value-sensitive callers at context_budget.budget_is_explicit.
- Tighten the budget-block comments to the contract (default = auto sentinel,
  non-default = explicit cap, hard_max = auto-only ceiling).

Comment/docstring-only; no behaviour change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(agent): correct budget issue citations (#1190 → merged #1230/#1273)

The context-budget contract (auto-sentinel, explicit budgets honoured,
hard_max auto-only) merged via #1230#1190 was the earlier, closed,
superseded PR. Re-point the contract comments at #1230 (the live source,
already cited for the auto-sentinel two lines up in settings.py).

The configurable hard_max setting (`agent_input_token_hard_max`) was a
reviewer requirement first raised on #1190, omitted from the merged #1230,
and actually added in #1273 — credit #1273 for it and correct the test
comment's history (it previously implied this PR completed the requirement).

Comment/docstring-only; no behaviour change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 15:17:28 +09:00
Kenny Van de Maele 263d41c58a fix(llm): stop sending llama.cpp slot-affinity fields to cloud providers (#3945)
* fix(llm): stop sending llama.cpp slot-affinity fields to cloud providers

_apply_local_cache_affinity adds session_id + cache_prompt for llama.cpp
KV-cache slot affinity (#2927), gated on _is_self_hosted_openai_compatible,
which treated any unknown OpenAI-compatible host as self-hosted. Strict
cloud providers added as custom endpoints (Mistral at api.mistral.ai)
reject unknown body fields, so every request failed with 422
extra_forbidden. Self-hosted now also requires the endpoint to resolve as
local via model_context.is_local_endpoint: loopback/private/tailscale
host, or endpoint kind explicitly configured as "local" (the escape hatch
for tunneled self-hosted servers). is_local_endpoint is promoted to a
public name since llm_core now shares it.

Fixes #3793

* test(llm): sweep cloud OpenAI-compatible hosts in affinity gating

Parametrized cases adapted from #3839 (credit: Shabablinchikow): deepseek,
x.ai, together, fireworks, and the Gemini OpenAI-compat endpoint must all
stay free of the llama.cpp extras, not just the Mistral host from #3793.

* fix(llm): narrow the Tailscale range to 100.64.0.0/10 in is_local_endpoint

Review finding on #3945: _PRIVATE_PREFIXES carried a bare "100." prefix,
treating all of 100.0.0.0/8 as local while Tailscale only uses the CGNAT
block 100.64.0.0/10. Public 100.x hosts (e.g. AWS ranges outside the
block) were classified local and still received the llama.cpp extras
this PR exists to keep away from strict providers. Match the narrowed
classification routes/model_routes.py already uses, with boundary tests
just below, inside, and just above the range.
2026-06-11 17:51:03 +02:00
Ocean Bennett e7c1d75884 fix(models): query v1 models for llama-server endpoints (#3380)
* fix(models): query v1 models for llama-server endpoints

* test(models): accept owner kwargs in llama-server regression
2026-06-09 01:09:02 +02:00
nubs 6973c5427c fix(model-context): count tool_calls in estimate_tokens so compaction sees real size (#2751) 2026-06-05 15:56:54 +02:00
nubs 19a3fc59c9 fix(model-context): key context-window cache by (endpoint, model) (#2614)
get_context_length() cached the resolved context window by model id alone,
so two different remote endpoints serving the same model id (e.g. a capped
proxy at 8k vs. the full provider at 200k) collided: the first to resolve
won process-wide and the other endpoint was served the wrong window. That
silently over-trims conversations on the larger-window endpoint (it feeds
context_compactor) or overflows the smaller one (provider 400s).

Key the cache on (endpoint_url, model). Local endpoints already always
re-query, so they are unaffected.

Fixes #2603
2026-06-05 02:50:56 +02:00
Kenny Van de Maele 1cd0aa2b8c feat(provider): add GitHub Copilot provider with device-flow auth (#1480)
* feat(provider): add GitHub Copilot provider with device-flow auth

Adds GitHub Copilot as a model provider, so Copilot models (gpt-4o/4.1/5,
Claude, Gemini, …) work through the normal chat + agent loop, incl. native
tool calling and vision.

Auth is one-click via the GitHub OAuth device flow; the access token is stored
as the endpoint's (encrypted) api_key and sent directly as `Authorization:
Bearer` (no Copilot-token exchange, no refresh — matching how editors talk to
the Copilot API). Copilot is a normal ModelEndpoint detected by host; the only
provider-specific behaviour is a small set of required request headers,
injected centrally.

Sign-in is available from Settings → model endpoints ("Connect GitHub
Copilot") and from chat via `/setup copilot`.

- src/copilot.py (new), routes/copilot_routes.py (new): constants, header
  builders, device-flow start/poll, model discovery, owner-scoped endpoint
  provisioning.
- src/llm_core.py, src/endpoint_resolver.py: detect `copilot`, inject headers,
  per-request x-initiator/vision.
- src/agent_loop.py: allowlist api.githubcopilot.com for native tool schemas.
- src/model_context.py: known context windows for Copilot (no unauthenticated
  /models probe).
- static/, README, tests/test_copilot*.py.

* Tidy copilot_routes: clarify supports_tools, note _PENDING is per-process
2026-06-04 21:13:14 +02:00
Yuri a2e691da2b fix(models): stabilize proxy endpoint refresh behavior
* fix: support large proxy model endpoint refresh

Large OpenAI-compatible proxy endpoints can expose hundreds of models and make /v1/models slow. Treating those endpoints like local model servers caused model picker opens and background probes to repeatedly hit /models, producing timeouts and making otherwise usable endpoints appear offline.

Make model endpoint discovery cached-first for normal UI usage, add explicit proxy/API classification and refresh policy fields, exclude proxy/API endpoints from aggressive local probing, and preserve cached models when refresh fails.

Manual Test/Add/Refresh actions still fetch the full model list with longer timeouts so users can intentionally import large proxy model lists without blocking normal model picker usage.

* fix: preserve endpoint ping status semantics
2026-06-04 04:56:11 +01:00
danielroytel 39848a168b fix: recognize Gemma 4 as a thinking model and add context entry (#1642)
Gemma 4 returns reasoning_content in streaming responses via
llama-server, but the model wasn't listed in _THINKING_MODEL_PATTERNS,
causing reasoning tokens to be mishandled. Add "gemma" to the pattern
list and register Gemma 4's 128K context window in KNOWN_CONTEXT_WINDOWS
so the agent loop budgets context correctly.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-06-03 14:23:18 +09:00
SurprisedDuck d06b6d87d3 Models: prefer longest known context match
KNOWN_CONTEXT_WINDOWS lists 'o1' (200k) before 'o1-mini' (128k), and
_lookup_known returned on the first substring hit — so "o1-mini" matched
'o1' and reported 200000 instead of 128000. Track the longest matching
key instead, so the most specific entry wins regardless of table order.
2026-06-02 20:33:09 +09:00
ooovenenoso cd6041477c Refresh local model context after restart
Co-authored-by: Kevin <120500656+oooindefatigable@users.noreply.github.com>
2026-06-02 05:54:06 +09:00
Elle d885c70462 Treat Docker host gateway as local
When running Odysseus in Docker and connecting to a local LLM on the host machine (e.g. `llama.cpp` or `Ollama`), the standard endpoint `http://host.docker.internal` is used to breach the container network. 

Because `host.docker.internal` was missing from `_LOCAL_HOSTS`, Odysseus incorrectly treated local self-hosted models as cloud APIs. This triggered the fallback behavior where actual API-reported context limits were being ignored and overridden by hardcoded fallbacks in `KNOWN_CONTEXT_WINDOWS`.

**Changes**
- Added `"host.docker.internal"` to the `_LOCAL_HOSTS` whitelist in `src/model_context.py` so that Dockerized deployments correctly trust and respect the context limits of locally hosted models.

**Checks Ran**
- [x] Syntax check (`python -m py_compile src/model_context.py`)
- [x] Tested manually in Docker (`docker compose up -d --build`) on a Windows host using `llama-server`. The correct API context length is now correctly reported in the UI instead of falling back to the 131k hardcode.
2026-06-02 05:49:59 +09:00
pewdiepie-archdaemon e5c99a5eee Odysseus v1.0 2026-05-31 23:58:26 +09:00