mirror of
https://github.com/pewdiepie-archdaemon/odysseus.git
synced 2026-06-16 17:55:26 -04:00
7ae6133d7f
* fix(agent): don't let a materialized default budget defeat context scaling #1230 scales agent_input_token_budget to the model's context window unless the user explicitly set a budget, detected via is_setting_overridden(). But the settings-save path materializes every DEFAULT_SETTINGS key into settings.json (load_settings merges defaults; handlers persist the merged dict), so the persisted default 6000 reads as "overridden" and the budget code takes the min(6000, ctx) branch — silently re-capping long-context models at 6000 for anyone who has ever saved a setting. This reintroduces the exact regression #1170/#1230 set out to fix. Add is_setting_customized() (saved value != default) and gate the scaling on it instead of mere presence. A persisted default is not a user choice. is_setting_overridden has exactly one consumer (this budget path), so the change is contained. Tests cover the materialized-default regression, a deliberately-chosen budget still being honoured, and the absent-key case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(agent): rework context-budget fix per review (#4122) Address RaresKeY's review: P2 (explicitness): is_setting_customized treated a saved value equal to the default as "not explicit", which ALSO blocked a user from deliberately pinning the default budget. Reframe the default value itself as the AUTO sentinel — agent_input_token_budget == DEFAULT_BUDGET means "scale to the model's context window", any other value is an explicit cap. A materialized default still reads as auto (fixing the original regression), and any non-default value the user chooses is now honoured. Drop the now-unused is_setting_customized helper. P2 (fallback context): auto-scaling trusted get_context_length() even when it returned only the bare DEFAULT_CONTEXT fallback (no endpoint-reported / known window), over-allocating on self-hosted/proxy setups. Add get_context_length_known() (also returns whether the window was actually discovered); the budget block passes 0 when unknown so auto-scaling stays conservative instead of inflating to an unproven window. hard_max stays auto-only — a deliberate explicit budget wins (#1190); kept that contract and answered the reviewer's question rather than silently reversing it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(agent): lock the materialized-default budget regression (review on #4121) Per WGlynn's review on the issue: add an end-to-end regression that saves an UNRELATED setting (which makes the settings-save path materialize the budget default into settings.json) and asserts the budget still auto-scales rather than re-reading as an explicit 6000 cap — locking the exact reopening shut. To make the test bite the production decision (not just re-derive it), extract `budget_is_explicit()` into src/context_budget.py and use it from the agent loop. It keys off value-vs-default (the default is the auto sentinel), NOT settings presence — which is the whole point, since the save path materializes defaults. Note: after this PR's rework, is_setting_overridden has ZERO production callers, so the merged-dict materialization smell can't reach any setting through a presence check today (WGlynn's durability concern). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(agent): bind the budget context window to its own provenance (review #4122) RaresKeY caught a correctness bug in the fallback-context guard: stream_agent_loop kept only the `known` flag from get_context_length_known() and budgeted off the passed-in `context_length`, which can come from a *different* lookup. Two failures: - local endpoints are re-queried, so the passed value can be a stale DEFAULT_CONTEXT fallback while the fresh probe proves the real (smaller) served context — we'd scale off the stale value; - callers that don't pass context_length (scheduled tasks, teacher escalation, skill test runs, bg_monitor) were capped at 6000 even when a long window is discoverable. Extract budget_context_for_model() which returns the freshly-probed window when known else 0, binding the flag to the value it proves; the agent loop uses it. Regression tests cover the stale-fallback, no-arg-caller, and probe-error paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(agent): fix stale budget comments + tighten to the contract (review #4122) - settings.py: an explicit budget is clamped to the window only — hard_max is auto-only (#1190); drop the incorrect "and to hard_max". - is_setting_overridden docstring: drop the stale "adaptive budgets" example; point value-sensitive callers at context_budget.budget_is_explicit. - Tighten the budget-block comments to the contract (default = auto sentinel, non-default = explicit cap, hard_max = auto-only ceiling). Comment/docstring-only; no behaviour change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(agent): correct budget issue citations (#1190 → merged #1230/#1273) The context-budget contract (auto-sentinel, explicit budgets honoured, hard_max auto-only) merged via #1230 — #1190 was the earlier, closed, superseded PR. Re-point the contract comments at #1230 (the live source, already cited for the auto-sentinel two lines up in settings.py). The configurable hard_max setting (`agent_input_token_hard_max`) was a reviewer requirement first raised on #1190, omitted from the merged #1230, and actually added in #1273 — credit #1273 for it and correct the test comment's history (it previously implied this PR completed the requirement). Comment/docstring-only; no behaviour change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
112 lines
5.9 KiB
Python
112 lines
5.9 KiB
Python
"""Agent input-token budget contract (review on #4122).
|
|
|
|
- The DEFAULT value is the AUTO sentinel: it scales to the model's context window.
|
|
Any non-default value is an explicit cap. A materialized default 6000 can't be
|
|
told apart from a deliberate 6000 (the settings-save path persists defaults), so
|
|
the default reads as auto — pin a cap with a nearby value (e.g. 5999).
|
|
- Auto-scaling only trusts a DISCOVERED context window; a bare DEFAULT_CONTEXT
|
|
fallback stays conservative instead of scaling off an unproven window.
|
|
"""
|
|
|
|
import json
|
|
from unittest.mock import patch
|
|
|
|
import src.settings as settings
|
|
import src.model_context as mc
|
|
from src.context_budget import compute_input_token_budget, DEFAULT_BUDGET, budget_is_explicit
|
|
|
|
|
|
def test_default_value_is_the_auto_sentinel():
|
|
# The settings default equals DEFAULT_BUDGET, so the agent loop (which compares
|
|
# the configured value to DEFAULT_BUDGET) treats the default as "auto".
|
|
assert settings.DEFAULT_SETTINGS["agent_input_token_budget"] == DEFAULT_BUDGET
|
|
|
|
|
|
def test_saving_an_unrelated_setting_does_not_re_cap_the_budget(tmp_path, monkeypatch):
|
|
"""End-to-end regression (WGlynn, #4121): changing ANY setting makes the
|
|
settings-save path persist the merged dict, which materializes the budget
|
|
default into settings.json. The budget must still AUTO-SCALE — it must not be
|
|
re-read as an explicit 6000 cap. This locks the exact reopening shut.
|
|
"""
|
|
settings_file = tmp_path / "settings.json"
|
|
monkeypatch.setattr(settings, "SETTINGS_FILE", str(settings_file))
|
|
settings._settings_cache = None
|
|
|
|
# Simulate a real settings save: a handler loads the merged dict (defaults +
|
|
# saved) and persists it after the user changes one *unrelated* setting.
|
|
merged = settings.load_settings()
|
|
merged["search_result_count"] = 9 # unrelated user change
|
|
settings.save_settings(merged)
|
|
settings._settings_cache = None
|
|
|
|
# The budget default is now physically materialized into the file...
|
|
raw = json.loads(settings_file.read_text())
|
|
assert raw["agent_input_token_budget"] == DEFAULT_BUDGET
|
|
assert raw["search_result_count"] == 9
|
|
|
|
# ...yet it must read as AUTO (value == default), not an explicit cap — even
|
|
# though is_setting_overridden would report True for it now.
|
|
assert settings.is_setting_overridden("agent_input_token_budget") is True
|
|
soft = int(settings.get_setting("agent_input_token_budget", DEFAULT_BUDGET) or 0)
|
|
assert budget_is_explicit(soft) is False
|
|
# And the effective budget scales to the window rather than capping at 6000.
|
|
assert compute_input_token_budget(soft, 131072, explicit=budget_is_explicit(soft)) == int(131072 * 0.85)
|
|
|
|
|
|
def test_auto_scales_on_a_known_window():
|
|
assert compute_input_token_budget(DEFAULT_BUDGET, 131072, explicit=False) == int(131072 * 0.85)
|
|
|
|
|
|
def test_auto_stays_conservative_on_unknown_window():
|
|
# P2 #2: the budget block passes context_length=0 when the window is only a
|
|
# fallback, so auto-scaling must NOT inflate to the unproven window.
|
|
assert compute_input_token_budget(DEFAULT_BUDGET, 0, explicit=False) == DEFAULT_BUDGET
|
|
|
|
|
|
def test_nondefault_value_is_an_explicit_cap():
|
|
assert compute_input_token_budget(20000, 131072, explicit=True) == 20000 # honoured
|
|
assert compute_input_token_budget(200000, 32000, explicit=True) == 32000 # clamped to window
|
|
|
|
|
|
def test_get_context_length_known_surfaces_endpoint_proven_vs_fallback():
|
|
mc._context_cache.clear()
|
|
with patch.object(mc, "_query_context_length", return_value=(131072, True)):
|
|
assert mc.get_context_length_known("http://proven/v1", "m1") == (131072, True)
|
|
mc._context_cache.clear()
|
|
with patch.object(mc, "_query_context_length", return_value=(mc.DEFAULT_CONTEXT, False)):
|
|
ctx, known = mc.get_context_length_known("http://unknown/v1", "m2")
|
|
assert ctx == mc.DEFAULT_CONTEXT and known is False
|
|
# get_context_length keeps its plain-int contract for existing callers
|
|
mc._context_cache.clear()
|
|
with patch.object(mc, "_query_context_length", return_value=(64000, True)):
|
|
assert mc.get_context_length("http://proven/v1", "m3") == 64000
|
|
|
|
|
|
def test_budget_context_binds_known_flag_to_its_own_value():
|
|
"""Regression (RaresKeY, #4122): scale the budget off the value the `known`
|
|
flag actually proves — never a stale/missing context_length from a different
|
|
lookup. Covers the local-restaleness case (fresh proven value beats a stale
|
|
fallback) and the no-arg-caller case (discovers a long window despite fallback=0).
|
|
"""
|
|
# unknown / bare fallback -> 0 (don't scale off an unproven window)
|
|
with patch.object(mc, "get_context_length_known", return_value=(128000, False)):
|
|
assert mc.budget_context_for_model("u", "m", fallback=128000) == 0
|
|
# known -> the freshly-proven value, NOT the (stale) fallback the caller passed
|
|
with patch.object(mc, "get_context_length_known", return_value=(4096, True)):
|
|
assert mc.budget_context_for_model("u", "m", fallback=128000) == 4096
|
|
# no-arg caller (fallback=0) still gets the discovered long window
|
|
with patch.object(mc, "get_context_length_known", return_value=(131072, True)):
|
|
assert mc.budget_context_for_model("u", "m", fallback=0) == 131072
|
|
# probe error -> caller's fallback (prior behaviour)
|
|
with patch.object(mc, "get_context_length_known", side_effect=RuntimeError):
|
|
assert mc.budget_context_for_model("u", "m", fallback=4096) == 4096
|
|
|
|
|
|
def test_no_arg_caller_scales_from_discovered_window_not_6000():
|
|
"""End-to-end of the fix: a caller that passes no context_length (scheduled
|
|
tasks, teacher escalation, ...) but whose endpoint reports 131072 now scales to
|
|
~111k instead of being capped at the conservative 6000."""
|
|
with patch.object(mc, "get_context_length_known", return_value=(131072, True)):
|
|
ctx = mc.budget_context_for_model("u", "m", fallback=0)
|
|
assert compute_input_token_budget(DEFAULT_BUDGET, ctx, explicit=False) == int(131072 * 0.85)
|