fix(agent): don't let a materialized default budget defeat context-window scaling (#4122)

* fix(agent): don't let a materialized default budget defeat context scaling

#1230 scales agent_input_token_budget to the model's context window unless
the user explicitly set a budget, detected via is_setting_overridden(). But
the settings-save path materializes every DEFAULT_SETTINGS key into
settings.json (load_settings merges defaults; handlers persist the merged
dict), so the persisted default 6000 reads as "overridden" and the budget
code takes the min(6000, ctx) branch — silently re-capping long-context
models at 6000 for anyone who has ever saved a setting. This reintroduces
the exact regression #1170/#1230 set out to fix.

Add is_setting_customized() (saved value != default) and gate the scaling
on it instead of mere presence. A persisted default is not a user choice.

is_setting_overridden has exactly one consumer (this budget path), so the
change is contained. Tests cover the materialized-default regression, a
deliberately-chosen budget still being honoured, and the absent-key case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(agent): rework context-budget fix per review (#4122)

Address RaresKeY's review:

P2 (explicitness): is_setting_customized treated a saved value equal to the
default as "not explicit", which ALSO blocked a user from deliberately pinning
the default budget. Reframe the default value itself as the AUTO sentinel —
agent_input_token_budget == DEFAULT_BUDGET means "scale to the model's context
window", any other value is an explicit cap. A materialized default still reads
as auto (fixing the original regression), and any non-default value the user
chooses is now honoured. Drop the now-unused is_setting_customized helper.

P2 (fallback context): auto-scaling trusted get_context_length() even when it
returned only the bare DEFAULT_CONTEXT fallback (no endpoint-reported / known
window), over-allocating on self-hosted/proxy setups. Add get_context_length_known()
(also returns whether the window was actually discovered); the budget block
passes 0 when unknown so auto-scaling stays conservative instead of inflating to
an unproven window.

hard_max stays auto-only — a deliberate explicit budget wins (#1190); kept that
contract and answered the reviewer's question rather than silently reversing it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(agent): lock the materialized-default budget regression (review on #4121)

Per WGlynn's review on the issue: add an end-to-end regression that saves an
UNRELATED setting (which makes the settings-save path materialize the budget
default into settings.json) and asserts the budget still auto-scales rather than
re-reading as an explicit 6000 cap — locking the exact reopening shut.

To make the test bite the production decision (not just re-derive it), extract
`budget_is_explicit()` into src/context_budget.py and use it from the agent loop.
It keys off value-vs-default (the default is the auto sentinel), NOT settings
presence — which is the whole point, since the save path materializes defaults.

Note: after this PR's rework, is_setting_overridden has ZERO production callers,
so the merged-dict materialization smell can't reach any setting through a
presence check today (WGlynn's durability concern).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(agent): bind the budget context window to its own provenance (review #4122)

RaresKeY caught a correctness bug in the fallback-context guard: stream_agent_loop
kept only the `known` flag from get_context_length_known() and budgeted off the
passed-in `context_length`, which can come from a *different* lookup. Two failures:
- local endpoints are re-queried, so the passed value can be a stale DEFAULT_CONTEXT
  fallback while the fresh probe proves the real (smaller) served context — we'd
  scale off the stale value;
- callers that don't pass context_length (scheduled tasks, teacher escalation,
  skill test runs, bg_monitor) were capped at 6000 even when a long window is
  discoverable.

Extract budget_context_for_model() which returns the freshly-probed window when
known else 0, binding the flag to the value it proves; the agent loop uses it.
Regression tests cover the stale-fallback, no-arg-caller, and probe-error paths.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(agent): fix stale budget comments + tighten to the contract (review #4122)

- settings.py: an explicit budget is clamped to the window only — hard_max is
  auto-only (#1190); drop the incorrect "and to hard_max".
- is_setting_overridden docstring: drop the stale "adaptive budgets" example;
  point value-sensitive callers at context_budget.budget_is_explicit.
- Tighten the budget-block comments to the contract (default = auto sentinel,
  non-default = explicit cap, hard_max = auto-only ceiling).

Comment/docstring-only; no behaviour change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(agent): correct budget issue citations (#1190 → merged #1230/#1273)

The context-budget contract (auto-sentinel, explicit budgets honoured,
hard_max auto-only) merged via #1230#1190 was the earlier, closed,
superseded PR. Re-point the contract comments at #1230 (the live source,
already cited for the auto-sentinel two lines up in settings.py).

The configurable hard_max setting (`agent_input_token_hard_max`) was a
reviewer requirement first raised on #1190, omitted from the merged #1230,
and actually added in #1273 — credit #1273 for it and correct the test
comment's history (it previously implied this PR completed the requirement).

Comment/docstring-only; no behaviour change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
nsgds
2026-06-15 14:17:28 +08:00
committed by GitHub
parent 589fcd314a
commit 7ae6133d7f
9 changed files with 238 additions and 59 deletions
+56 -22
View File
@@ -222,16 +222,12 @@ KNOWN_CONTEXT_WINDOWS = {
# ---------------------------------------------------------------------------
# Cache
# ---------------------------------------------------------------------------
_context_cache: Dict[Tuple[str, str], int] = {}
_context_cache: Dict[Tuple[str, str], Tuple[int, bool]] = {}
def get_context_length(endpoint_url: str, model: str) -> int:
"""Get the context window size for a model.
Queries /v1/models on the endpoint and looks for context_length
or context_window fields. Caches result per (endpoint, model).
Falls back to DEFAULT_CONTEXT if unavailable.
"""
def _get_context_length_cached(endpoint_url: str, model: str) -> Tuple[int, bool]:
"""Return (context_length, known). ``known`` is False only when the value is a
bare DEFAULT_CONTEXT fallback (no endpoint report and not in the known table)."""
configured_kind = _configured_endpoint_kind(endpoint_url)
is_local = is_local_endpoint(endpoint_url)
# Key on (endpoint_url, model): the same model id can be served by two
@@ -242,14 +238,50 @@ def get_context_length(endpoint_url: str, model: str) -> int:
if not is_local and cache_key in _context_cache:
return _context_cache[cache_key]
ctx = _query_context_length(endpoint_url, model)
ctx, known = _query_context_length(endpoint_url, model)
# Only cache non-default values to allow retry on next request.
# Local endpoints can restart with a different --max-model-len while keeping
# the same model id, so always re-query them instead of serving stale cache.
if not is_local and (ctx != DEFAULT_CONTEXT or configured_kind in ("api", "proxy")):
_context_cache[cache_key] = ctx
_context_cache[cache_key] = (ctx, known)
logger.info(f"Context length for {model}: {ctx}")
return ctx
return ctx, known
def get_context_length(endpoint_url: str, model: str) -> int:
"""Get the context window size for a model.
Queries /v1/models on the endpoint and looks for context_length
or context_window fields. Caches result per (endpoint, model).
Falls back to DEFAULT_CONTEXT if unavailable.
"""
return _get_context_length_cached(endpoint_url, model)[0]
def get_context_length_known(endpoint_url: str, model: str) -> Tuple[int, bool]:
"""Like ``get_context_length`` but also returns whether the window was actually
discovered (endpoint-reported or in the known-models table) rather than the bare
DEFAULT_CONTEXT fallback. Callers that *scale* a budget off the window must not
trust an unknown value — a fallback 128K isn't proof the model holds 128K
(review on #4122)."""
return _get_context_length_cached(endpoint_url, model)
def budget_context_for_model(endpoint_url: str, model: str, *, fallback: int = 0) -> int:
"""Context window to scale the agent input budget against.
Returns the *freshly discovered* window when it was actually proven
(endpoint-reported / known table), else 0 so auto-scaling stays conservative.
Crucially this binds the ``known`` flag to the value it proves — callers must
not pair this flag with a context length from a *different* lookup (a stale
local re-query, or a caller that didn't pass one), which would budget off an
unproven number (review on #4122). On probe error, returns ``fallback`` (the
caller's best-known value) to preserve prior behaviour."""
try:
ctx, known = get_context_length_known(endpoint_url, model)
return ctx if known else 0
except Exception:
return fallback
def _lookup_known(model: str) -> Optional[int]:
@@ -271,8 +303,9 @@ def _lookup_known(model: str) -> Optional[int]:
return best_ctx
def _query_context_length(endpoint_url: str, model: str) -> int:
"""Query the model API for context length."""
def _query_context_length(endpoint_url: str, model: str) -> Tuple[int, bool]:
"""Query the model API for context length. Returns (context_length, known) where
``known`` is False only for the bare DEFAULT_CONTEXT fallback."""
known = _lookup_known(model)
api_ctx = None
configured_kind = _configured_endpoint_kind(endpoint_url)
@@ -283,8 +316,8 @@ def _query_context_length(endpoint_url: str, model: str) -> int:
if configured_kind in ("api", "proxy"):
if known:
logger.info(f"Using known context window for {model}: {known}")
return known
return DEFAULT_CONTEXT
return known, True
return DEFAULT_CONTEXT, False
# Try llama.cpp /slots endpoint first — reports actual serving context
if is_local_endpoint(endpoint_url):
@@ -297,7 +330,7 @@ def _query_context_length(endpoint_url: str, model: str) -> int:
n_ctx = slots[0].get("n_ctx")
if n_ctx and isinstance(n_ctx, int) and n_ctx > 0:
logger.info(f"llama.cpp /slots reports n_ctx={n_ctx} for {model}")
return n_ctx
return n_ctx, True
except Exception:
pass
@@ -309,7 +342,8 @@ def _query_context_length(endpoint_url: str, model: str) -> int:
if is_copilot_base(endpoint_url):
if known:
logger.info(f"Using known context window for {model}: {known}")
return known or DEFAULT_CONTEXT
return known, True
return DEFAULT_CONTEXT, False
from src.endpoint_resolver import build_models_url
@@ -354,18 +388,18 @@ def _query_context_length(endpoint_url: str, model: str) -> int:
_is_local = is_local_endpoint(endpoint_url)
if _is_local and api_ctx < known:
logger.info(f"Local endpoint reports {api_ctx} for {model} (known max: {known}) — using API value")
return api_ctx
return api_ctx, True
result = max(api_ctx, known)
if api_ctx < known:
logger.info(f"API reported {api_ctx} for {model}, using known {known} instead")
return result
return result, True
if api_ctx:
return api_ctx
return api_ctx, True
if known:
logger.info(f"Using known context window for {model}: {known}")
return known
return known, True
return DEFAULT_CONTEXT
return DEFAULT_CONTEXT, False
def estimate_tokens(messages: List[Dict]) -> int: