fix(agent): don't let a materialized default budget defeat context-window scaling (#4122)

* fix(agent): don't let a materialized default budget defeat context scaling

#1230 scales agent_input_token_budget to the model's context window unless
the user explicitly set a budget, detected via is_setting_overridden(). But
the settings-save path materializes every DEFAULT_SETTINGS key into
settings.json (load_settings merges defaults; handlers persist the merged
dict), so the persisted default 6000 reads as "overridden" and the budget
code takes the min(6000, ctx) branch — silently re-capping long-context
models at 6000 for anyone who has ever saved a setting. This reintroduces
the exact regression #1170/#1230 set out to fix.

Add is_setting_customized() (saved value != default) and gate the scaling
on it instead of mere presence. A persisted default is not a user choice.

is_setting_overridden has exactly one consumer (this budget path), so the
change is contained. Tests cover the materialized-default regression, a
deliberately-chosen budget still being honoured, and the absent-key case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(agent): rework context-budget fix per review (#4122)

Address RaresKeY's review:

P2 (explicitness): is_setting_customized treated a saved value equal to the
default as "not explicit", which ALSO blocked a user from deliberately pinning
the default budget. Reframe the default value itself as the AUTO sentinel —
agent_input_token_budget == DEFAULT_BUDGET means "scale to the model's context
window", any other value is an explicit cap. A materialized default still reads
as auto (fixing the original regression), and any non-default value the user
chooses is now honoured. Drop the now-unused is_setting_customized helper.

P2 (fallback context): auto-scaling trusted get_context_length() even when it
returned only the bare DEFAULT_CONTEXT fallback (no endpoint-reported / known
window), over-allocating on self-hosted/proxy setups. Add get_context_length_known()
(also returns whether the window was actually discovered); the budget block
passes 0 when unknown so auto-scaling stays conservative instead of inflating to
an unproven window.

hard_max stays auto-only — a deliberate explicit budget wins (#1190); kept that
contract and answered the reviewer's question rather than silently reversing it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(agent): lock the materialized-default budget regression (review on #4121)

Per WGlynn's review on the issue: add an end-to-end regression that saves an
UNRELATED setting (which makes the settings-save path materialize the budget
default into settings.json) and asserts the budget still auto-scales rather than
re-reading as an explicit 6000 cap — locking the exact reopening shut.

To make the test bite the production decision (not just re-derive it), extract
`budget_is_explicit()` into src/context_budget.py and use it from the agent loop.
It keys off value-vs-default (the default is the auto sentinel), NOT settings
presence — which is the whole point, since the save path materializes defaults.

Note: after this PR's rework, is_setting_overridden has ZERO production callers,
so the merged-dict materialization smell can't reach any setting through a
presence check today (WGlynn's durability concern).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(agent): bind the budget context window to its own provenance (review #4122)

RaresKeY caught a correctness bug in the fallback-context guard: stream_agent_loop
kept only the `known` flag from get_context_length_known() and budgeted off the
passed-in `context_length`, which can come from a *different* lookup. Two failures:
- local endpoints are re-queried, so the passed value can be a stale DEFAULT_CONTEXT
  fallback while the fresh probe proves the real (smaller) served context — we'd
  scale off the stale value;
- callers that don't pass context_length (scheduled tasks, teacher escalation,
  skill test runs, bg_monitor) were capped at 6000 even when a long window is
  discoverable.

Extract budget_context_for_model() which returns the freshly-probed window when
known else 0, binding the flag to the value it proves; the agent loop uses it.
Regression tests cover the stale-fallback, no-arg-caller, and probe-error paths.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(agent): fix stale budget comments + tighten to the contract (review #4122)

- settings.py: an explicit budget is clamped to the window only — hard_max is
  auto-only (#1190); drop the incorrect "and to hard_max".
- is_setting_overridden docstring: drop the stale "adaptive budgets" example;
  point value-sensitive callers at context_budget.budget_is_explicit.
- Tighten the budget-block comments to the contract (default = auto sentinel,
  non-default = explicit cap, hard_max = auto-only ceiling).

Comment/docstring-only; no behaviour change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(agent): correct budget issue citations (#1190 → merged #1230/#1273)

The context-budget contract (auto-sentinel, explicit budgets honoured,
hard_max auto-only) merged via #1230#1190 was the earlier, closed,
superseded PR. Re-point the contract comments at #1230 (the live source,
already cited for the auto-sentinel two lines up in settings.py).

The configurable hard_max setting (`agent_input_token_hard_max`) was a
reviewer requirement first raised on #1190, omitted from the merged #1230,
and actually added in #1273 — credit #1273 for it and correct the test
comment's history (it previously implied this PR completed the requirement).

Comment/docstring-only; no behaviour change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
nsgds
2026-06-15 14:17:28 +08:00
committed by GitHub
parent 589fcd314a
commit 7ae6133d7f
9 changed files with 238 additions and 59 deletions
+16 -12
View File
@@ -2013,30 +2013,34 @@ async def stream_agent_loop(
_t3 = time.time()
try:
from src.context_compactor import trim_for_context
from src.context_budget import compute_input_token_budget, DEFAULT_HARD_MAX
from src.settings import is_setting_overridden
from src.context_budget import compute_input_token_budget, DEFAULT_HARD_MAX, DEFAULT_BUDGET, budget_is_explicit as _budget_is_explicit
from src.model_context import budget_context_for_model
soft_budget = int(get_setting("agent_input_token_budget", 6000) or 0)
soft_budget = int(get_setting("agent_input_token_budget", DEFAULT_BUDGET) or 0)
if soft_budget > 0:
before_trim_tokens = estimate_tokens(messages)
reserve_tokens = min(max(max_tokens or 1024, 512), 2048)
# Honour the configurable ceiling for the auto-derived budget path.
# No-op when the user has an explicit `agent_input_token_budget`
# (that branch ignores hard_max). Falls back to DEFAULT_HARD_MAX
# on missing/malformed values so misconfig can't zero the budget.
# Ceiling for the auto-derived budget (no effect on an explicit budget;
# see #1230). Falls back to DEFAULT_HARD_MAX on missing/malformed values
# so misconfig can't zero the budget.
try:
hard_max = int(get_setting("agent_input_token_hard_max", DEFAULT_HARD_MAX) or DEFAULT_HARD_MAX)
except (TypeError, ValueError):
hard_max = DEFAULT_HARD_MAX
if hard_max <= 0:
hard_max = DEFAULT_HARD_MAX
# Scale the default budget to the model's context window so long-context
# models aren't silently capped at 6000; an explicit user setting is
# still honoured (clamped to the window). (#1170)
# Default value = auto sentinel (scale to the window); any other value =
# explicit cap. Value-based, not presence-based, because the save path
# materializes defaults so a persisted default must still read as auto (#4121).
budget_is_explicit = _budget_is_explicit(soft_budget)
# Scale only off a window we actually discovered, bound to the value it
# proves (else 0) — not the passed-in context_length, which can be stale
# or unset for some callers (#4122 review).
ctx_for_budget = budget_context_for_model(endpoint_url, model, fallback=context_length)
effective_budget = compute_input_token_budget(
soft_budget,
context_length,
is_setting_overridden("agent_input_token_budget"),
ctx_for_budget,
budget_is_explicit,
hard_max=hard_max,
)
trimmed_messages = trim_for_context(
+27 -7
View File
@@ -31,16 +31,22 @@ def compute_input_token_budget(
Args:
configured: the value read from settings (may be the default).
context_length: the model's discovered context window (0/unknown if none).
explicit: True if the user explicitly set ``agent_input_token_budget``.
context_length: the model's discovered context window. Pass 0 when the
window is unknown / only a bare fallback — auto-scaling then stays
conservative instead of trusting an unproven window (review on #4122).
explicit: True if the user set a NON-default budget. The default value is
the "auto" sentinel (scale to the window); any other value is an
explicit cap. (A deliberately-chosen default can't be distinguished
from a materialized default by value, so the default reads as auto.)
Rules:
- Explicit user budget is honoured exactly, only clamped to the model's
window when that window is known (never send more than the model holds).
- Otherwise (default), scale to ``headroom`` of the context window, capped
at ``hard_max`` — so long-context models use their capacity.
- When the window is unknown, fall back to the configured/default value
(preserving the previous behaviour).
window when that window is known (the user's deliberate choice wins;
``hard_max`` is an auto-budget ceiling only — see #1230).
- Otherwise (auto), scale to ``headroom`` of the context window, capped at
``hard_max`` — so long-context models use their capacity.
- When the window is unknown (context_length <= 0), use the conservative
``default`` budget and do NOT scale off the fallback.
"""
configured = int(configured or 0)
context_length = int(context_length or 0)
@@ -53,3 +59,17 @@ def compute_input_token_budget(
return max(1, min(scaled, hard_max))
return configured if configured > 0 else default
def budget_is_explicit(configured: int, *, default: int = DEFAULT_BUDGET) -> bool:
"""Whether a configured agent_input_token_budget is a deliberate explicit cap.
The default value is the "auto" sentinel (scale to the model's window), so only
a NON-default positive value counts as explicit. This keys off the VALUE, not
settings *presence* — the settings-save path materializes every default into
settings.json, so a persisted default must still read as auto (the regression
#4121 / #1230 are about). Centralised here so the materialized-default contract
is unit-testable and can't silently regress to a presence check.
"""
configured = int(configured or 0)
return configured > 0 and configured != default
+56 -22
View File
@@ -222,16 +222,12 @@ KNOWN_CONTEXT_WINDOWS = {
# ---------------------------------------------------------------------------
# Cache
# ---------------------------------------------------------------------------
_context_cache: Dict[Tuple[str, str], int] = {}
_context_cache: Dict[Tuple[str, str], Tuple[int, bool]] = {}
def get_context_length(endpoint_url: str, model: str) -> int:
"""Get the context window size for a model.
Queries /v1/models on the endpoint and looks for context_length
or context_window fields. Caches result per (endpoint, model).
Falls back to DEFAULT_CONTEXT if unavailable.
"""
def _get_context_length_cached(endpoint_url: str, model: str) -> Tuple[int, bool]:
"""Return (context_length, known). ``known`` is False only when the value is a
bare DEFAULT_CONTEXT fallback (no endpoint report and not in the known table)."""
configured_kind = _configured_endpoint_kind(endpoint_url)
is_local = is_local_endpoint(endpoint_url)
# Key on (endpoint_url, model): the same model id can be served by two
@@ -242,14 +238,50 @@ def get_context_length(endpoint_url: str, model: str) -> int:
if not is_local and cache_key in _context_cache:
return _context_cache[cache_key]
ctx = _query_context_length(endpoint_url, model)
ctx, known = _query_context_length(endpoint_url, model)
# Only cache non-default values to allow retry on next request.
# Local endpoints can restart with a different --max-model-len while keeping
# the same model id, so always re-query them instead of serving stale cache.
if not is_local and (ctx != DEFAULT_CONTEXT or configured_kind in ("api", "proxy")):
_context_cache[cache_key] = ctx
_context_cache[cache_key] = (ctx, known)
logger.info(f"Context length for {model}: {ctx}")
return ctx
return ctx, known
def get_context_length(endpoint_url: str, model: str) -> int:
"""Get the context window size for a model.
Queries /v1/models on the endpoint and looks for context_length
or context_window fields. Caches result per (endpoint, model).
Falls back to DEFAULT_CONTEXT if unavailable.
"""
return _get_context_length_cached(endpoint_url, model)[0]
def get_context_length_known(endpoint_url: str, model: str) -> Tuple[int, bool]:
"""Like ``get_context_length`` but also returns whether the window was actually
discovered (endpoint-reported or in the known-models table) rather than the bare
DEFAULT_CONTEXT fallback. Callers that *scale* a budget off the window must not
trust an unknown value — a fallback 128K isn't proof the model holds 128K
(review on #4122)."""
return _get_context_length_cached(endpoint_url, model)
def budget_context_for_model(endpoint_url: str, model: str, *, fallback: int = 0) -> int:
"""Context window to scale the agent input budget against.
Returns the *freshly discovered* window when it was actually proven
(endpoint-reported / known table), else 0 so auto-scaling stays conservative.
Crucially this binds the ``known`` flag to the value it proves — callers must
not pair this flag with a context length from a *different* lookup (a stale
local re-query, or a caller that didn't pass one), which would budget off an
unproven number (review on #4122). On probe error, returns ``fallback`` (the
caller's best-known value) to preserve prior behaviour."""
try:
ctx, known = get_context_length_known(endpoint_url, model)
return ctx if known else 0
except Exception:
return fallback
def _lookup_known(model: str) -> Optional[int]:
@@ -271,8 +303,9 @@ def _lookup_known(model: str) -> Optional[int]:
return best_ctx
def _query_context_length(endpoint_url: str, model: str) -> int:
"""Query the model API for context length."""
def _query_context_length(endpoint_url: str, model: str) -> Tuple[int, bool]:
"""Query the model API for context length. Returns (context_length, known) where
``known`` is False only for the bare DEFAULT_CONTEXT fallback."""
known = _lookup_known(model)
api_ctx = None
configured_kind = _configured_endpoint_kind(endpoint_url)
@@ -283,8 +316,8 @@ def _query_context_length(endpoint_url: str, model: str) -> int:
if configured_kind in ("api", "proxy"):
if known:
logger.info(f"Using known context window for {model}: {known}")
return known
return DEFAULT_CONTEXT
return known, True
return DEFAULT_CONTEXT, False
# Try llama.cpp /slots endpoint first — reports actual serving context
if is_local_endpoint(endpoint_url):
@@ -297,7 +330,7 @@ def _query_context_length(endpoint_url: str, model: str) -> int:
n_ctx = slots[0].get("n_ctx")
if n_ctx and isinstance(n_ctx, int) and n_ctx > 0:
logger.info(f"llama.cpp /slots reports n_ctx={n_ctx} for {model}")
return n_ctx
return n_ctx, True
except Exception:
pass
@@ -309,7 +342,8 @@ def _query_context_length(endpoint_url: str, model: str) -> int:
if is_copilot_base(endpoint_url):
if known:
logger.info(f"Using known context window for {model}: {known}")
return known or DEFAULT_CONTEXT
return known, True
return DEFAULT_CONTEXT, False
from src.endpoint_resolver import build_models_url
@@ -354,18 +388,18 @@ def _query_context_length(endpoint_url: str, model: str) -> int:
_is_local = is_local_endpoint(endpoint_url)
if _is_local and api_ctx < known:
logger.info(f"Local endpoint reports {api_ctx} for {model} (known max: {known}) — using API value")
return api_ctx
return api_ctx, True
result = max(api_ctx, known)
if api_ctx < known:
logger.info(f"API reported {api_ctx} for {model}, using known {known} instead")
return result
return result, True
if api_ctx:
return api_ctx
return api_ctx, True
if known:
logger.info(f"Using known context window for {model}: {known}")
return known
return known, True
return DEFAULT_CONTEXT
return DEFAULT_CONTEXT, False
def estimate_tokens(messages: List[Dict]) -> int:
+18 -8
View File
@@ -101,14 +101,22 @@ DEFAULT_SETTINGS = {
"research_run_timeout_seconds": 1800,
"agent_max_tool_calls": 0,
"agent_max_rounds": 20, # per-message agent step cap (clamped 1..200)
# Soft input-token budget for the agent loop. The DEFAULT value (6000) is the
# "auto" sentinel: it means "scale the budget to the model's context window"
# (#1230) — so long-context models aren't capped at 6000. Set ANY OTHER value
# to enforce an explicit cap (clamped to the window only — hard_max does not
# apply to explicit budgets, #1230); set 0 to disable soft-trimming. The
# default is treated as auto because the settings-save path materializes
# defaults, so a persisted 6000 can't be told apart from a deliberate 6000 —
# to pin a budget near the default, use a nearby value (e.g. 5999).
"agent_input_token_budget": 6000,
# Ceiling on the *auto-derived* input budget that #1230 introduced. Has
# no effect when `agent_input_token_budget` is explicitly set (the user's
# value is honoured regardless). Default matches
# `src.context_budget.DEFAULT_HARD_MAX`; lower this for cost-paranoid
# setups, raise it on premium APIs with very large windows that you
# Ceiling on the *auto-derived* input budget; a configurable setting since #1273
# (the merged #1230 left it a module constant). No effect on an explicit budget
# — a deliberate value is honoured (#1230). Default matches
# `src.context_budget.DEFAULT_HARD_MAX`; lower this for
# cost-paranoid setups, raise it on premium APIs with very large windows you
# want to actually use (e.g. 900_000 to fill a 1M-context model). See
# `compute_input_token_budget` in src/context_budget.py.
# `compute_input_token_budget`.
"agent_input_token_hard_max": 200_000,
"agent_stream_timeout_seconds": 300,
# Extra directory roots that read_file / write_file may access, in
@@ -223,8 +231,10 @@ def is_setting_overridden(key: str) -> bool:
``load_settings`` merges DEFAULT_SETTINGS with the saved file, so a value
equal to its default is indistinguishable from "never set" via get_setting.
Callers that need to treat an explicit user choice differently from the
default (e.g. adaptive budgets) use this to read the raw saved file.
Callers that must distinguish an explicit user choice from a default read
the raw saved file via this. (Note: a materialized default is also "present",
so value-sensitive callers should compare against the default see
``context_budget.budget_is_explicit``.)
"""
try:
with open(SETTINGS_FILE, "r", encoding="utf-8") as f:
+111
View File
@@ -0,0 +1,111 @@
"""Agent input-token budget contract (review on #4122).
- The DEFAULT value is the AUTO sentinel: it scales to the model's context window.
Any non-default value is an explicit cap. A materialized default 6000 can't be
told apart from a deliberate 6000 (the settings-save path persists defaults), so
the default reads as auto pin a cap with a nearby value (e.g. 5999).
- Auto-scaling only trusts a DISCOVERED context window; a bare DEFAULT_CONTEXT
fallback stays conservative instead of scaling off an unproven window.
"""
import json
from unittest.mock import patch
import src.settings as settings
import src.model_context as mc
from src.context_budget import compute_input_token_budget, DEFAULT_BUDGET, budget_is_explicit
def test_default_value_is_the_auto_sentinel():
# The settings default equals DEFAULT_BUDGET, so the agent loop (which compares
# the configured value to DEFAULT_BUDGET) treats the default as "auto".
assert settings.DEFAULT_SETTINGS["agent_input_token_budget"] == DEFAULT_BUDGET
def test_saving_an_unrelated_setting_does_not_re_cap_the_budget(tmp_path, monkeypatch):
"""End-to-end regression (WGlynn, #4121): changing ANY setting makes the
settings-save path persist the merged dict, which materializes the budget
default into settings.json. The budget must still AUTO-SCALE it must not be
re-read as an explicit 6000 cap. This locks the exact reopening shut.
"""
settings_file = tmp_path / "settings.json"
monkeypatch.setattr(settings, "SETTINGS_FILE", str(settings_file))
settings._settings_cache = None
# Simulate a real settings save: a handler loads the merged dict (defaults +
# saved) and persists it after the user changes one *unrelated* setting.
merged = settings.load_settings()
merged["search_result_count"] = 9 # unrelated user change
settings.save_settings(merged)
settings._settings_cache = None
# The budget default is now physically materialized into the file...
raw = json.loads(settings_file.read_text())
assert raw["agent_input_token_budget"] == DEFAULT_BUDGET
assert raw["search_result_count"] == 9
# ...yet it must read as AUTO (value == default), not an explicit cap — even
# though is_setting_overridden would report True for it now.
assert settings.is_setting_overridden("agent_input_token_budget") is True
soft = int(settings.get_setting("agent_input_token_budget", DEFAULT_BUDGET) or 0)
assert budget_is_explicit(soft) is False
# And the effective budget scales to the window rather than capping at 6000.
assert compute_input_token_budget(soft, 131072, explicit=budget_is_explicit(soft)) == int(131072 * 0.85)
def test_auto_scales_on_a_known_window():
assert compute_input_token_budget(DEFAULT_BUDGET, 131072, explicit=False) == int(131072 * 0.85)
def test_auto_stays_conservative_on_unknown_window():
# P2 #2: the budget block passes context_length=0 when the window is only a
# fallback, so auto-scaling must NOT inflate to the unproven window.
assert compute_input_token_budget(DEFAULT_BUDGET, 0, explicit=False) == DEFAULT_BUDGET
def test_nondefault_value_is_an_explicit_cap():
assert compute_input_token_budget(20000, 131072, explicit=True) == 20000 # honoured
assert compute_input_token_budget(200000, 32000, explicit=True) == 32000 # clamped to window
def test_get_context_length_known_surfaces_endpoint_proven_vs_fallback():
mc._context_cache.clear()
with patch.object(mc, "_query_context_length", return_value=(131072, True)):
assert mc.get_context_length_known("http://proven/v1", "m1") == (131072, True)
mc._context_cache.clear()
with patch.object(mc, "_query_context_length", return_value=(mc.DEFAULT_CONTEXT, False)):
ctx, known = mc.get_context_length_known("http://unknown/v1", "m2")
assert ctx == mc.DEFAULT_CONTEXT and known is False
# get_context_length keeps its plain-int contract for existing callers
mc._context_cache.clear()
with patch.object(mc, "_query_context_length", return_value=(64000, True)):
assert mc.get_context_length("http://proven/v1", "m3") == 64000
def test_budget_context_binds_known_flag_to_its_own_value():
"""Regression (RaresKeY, #4122): scale the budget off the value the `known`
flag actually proves never a stale/missing context_length from a different
lookup. Covers the local-restaleness case (fresh proven value beats a stale
fallback) and the no-arg-caller case (discovers a long window despite fallback=0).
"""
# unknown / bare fallback -> 0 (don't scale off an unproven window)
with patch.object(mc, "get_context_length_known", return_value=(128000, False)):
assert mc.budget_context_for_model("u", "m", fallback=128000) == 0
# known -> the freshly-proven value, NOT the (stale) fallback the caller passed
with patch.object(mc, "get_context_length_known", return_value=(4096, True)):
assert mc.budget_context_for_model("u", "m", fallback=128000) == 4096
# no-arg caller (fallback=0) still gets the discovered long window
with patch.object(mc, "get_context_length_known", return_value=(131072, True)):
assert mc.budget_context_for_model("u", "m", fallback=0) == 131072
# probe error -> caller's fallback (prior behaviour)
with patch.object(mc, "get_context_length_known", side_effect=RuntimeError):
assert mc.budget_context_for_model("u", "m", fallback=4096) == 4096
def test_no_arg_caller_scales_from_discovered_window_not_6000():
"""End-to-end of the fix: a caller that passes no context_length (scheduled
tasks, teacher escalation, ...) but whose endpoint reports 131072 now scales to
~111k instead of being capped at the conservative 6000."""
with patch.object(mc, "get_context_length_known", return_value=(131072, True)):
ctx = mc.budget_context_for_model("u", "m", fallback=0)
assert compute_input_token_budget(DEFAULT_BUDGET, ctx, explicit=False) == int(131072 * 0.85)
+5 -5
View File
@@ -47,11 +47,11 @@ def test_is_setting_overridden_reads_raw_saved_file(tmp_path, monkeypatch):
# ---------------------------------------------------------------------------
# Configurable hard_max — completes the reviewer requirement from #1190 that
# was carried over but not implemented in #1230: the ceiling on the auto-
# derived path should be a setting, not a hidden constant. Without this,
# admins on premium APIs with very large windows (1M+ context) can only
# raise the ceiling by editing src/context_budget.py.
# Configurable hard_max — the ceiling on the auto-derived path is a setting
# (`agent_input_token_hard_max`), not a hidden constant. History: a reviewer
# required it on #1190, the merged #1230 shipped without it, and #1273 added it.
# This test pins the function-level override (the `hard_max` parameter); without
# a raisable ceiling, admins on 1M+ context APIs would be stuck at the 200K default.
# ---------------------------------------------------------------------------
def test_custom_hard_max_overrides_default_in_auto_branch():
+2 -2
View File
@@ -13,7 +13,7 @@ def _setup(monkeypatch, windows):
"""windows: {endpoint_url: context_length}. Force the remote path."""
monkeypatch.setattr(mc, "is_local_endpoint", lambda url: False)
monkeypatch.setattr(mc, "_configured_endpoint_kind", lambda url: "api")
monkeypatch.setattr(mc, "_query_context_length", lambda url, model: windows[url])
monkeypatch.setattr(mc, "_query_context_length", lambda url, model: (windows[url], True))
mc._context_cache.clear()
@@ -34,6 +34,6 @@ def test_cache_hit_still_works_per_endpoint(monkeypatch):
# Both endpoints are now cached under their own key; flip the underlying
# query to prove subsequent reads come from the per-endpoint cache, not a re-query.
monkeypatch.setattr(mc, "_query_context_length", lambda url, model: 999)
monkeypatch.setattr(mc, "_query_context_length", lambda url, model: (999, True))
assert mc.get_context_length(a, "shared-model") == 8000
assert mc.get_context_length(b, "shared-model") == 200000
+1 -1
View File
@@ -51,7 +51,7 @@ def test_model_context_queries_models_for_v1_base(monkeypatch):
monkeypatch.setattr(model_context.httpx, "get", fake_get)
assert model_context._query_context_length("http://127.0.0.1:8080/v1", "qwen3") == 32768
assert model_context._query_context_length("http://127.0.0.1:8080/v1", "qwen3") == (32768, True)
assert seen == [
"http://127.0.0.1:8080/slots",
"http://127.0.0.1:8080/v1/models",
+2 -2
View File
@@ -192,7 +192,7 @@ class TestGetContextLength:
def fake_query(endpoint_url, model):
calls.append((endpoint_url, model))
return 8192 if len(calls) == 1 else 27000
return (8192, True) if len(calls) == 1 else (27000, True)
monkeypatch.setattr(model_context, "_query_context_length", fake_query)
@@ -211,7 +211,7 @@ class TestGetContextLength:
def fake_query(endpoint_url, model):
calls.append((endpoint_url, model))
return 200000 if len(calls) == 1 else 12345
return (200000, True) if len(calls) == 1 else (12345, True)
monkeypatch.setattr(model_context, "_query_context_length", fake_query)