mirror of
https://github.com/pewdiepie-archdaemon/odysseus.git
synced 2026-06-18 10:45:31 -04:00
7ae6133d7f
* fix(agent): don't let a materialized default budget defeat context scaling #1230 scales agent_input_token_budget to the model's context window unless the user explicitly set a budget, detected via is_setting_overridden(). But the settings-save path materializes every DEFAULT_SETTINGS key into settings.json (load_settings merges defaults; handlers persist the merged dict), so the persisted default 6000 reads as "overridden" and the budget code takes the min(6000, ctx) branch — silently re-capping long-context models at 6000 for anyone who has ever saved a setting. This reintroduces the exact regression #1170/#1230 set out to fix. Add is_setting_customized() (saved value != default) and gate the scaling on it instead of mere presence. A persisted default is not a user choice. is_setting_overridden has exactly one consumer (this budget path), so the change is contained. Tests cover the materialized-default regression, a deliberately-chosen budget still being honoured, and the absent-key case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(agent): rework context-budget fix per review (#4122) Address RaresKeY's review: P2 (explicitness): is_setting_customized treated a saved value equal to the default as "not explicit", which ALSO blocked a user from deliberately pinning the default budget. Reframe the default value itself as the AUTO sentinel — agent_input_token_budget == DEFAULT_BUDGET means "scale to the model's context window", any other value is an explicit cap. A materialized default still reads as auto (fixing the original regression), and any non-default value the user chooses is now honoured. Drop the now-unused is_setting_customized helper. P2 (fallback context): auto-scaling trusted get_context_length() even when it returned only the bare DEFAULT_CONTEXT fallback (no endpoint-reported / known window), over-allocating on self-hosted/proxy setups. Add get_context_length_known() (also returns whether the window was actually discovered); the budget block passes 0 when unknown so auto-scaling stays conservative instead of inflating to an unproven window. hard_max stays auto-only — a deliberate explicit budget wins (#1190); kept that contract and answered the reviewer's question rather than silently reversing it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(agent): lock the materialized-default budget regression (review on #4121) Per WGlynn's review on the issue: add an end-to-end regression that saves an UNRELATED setting (which makes the settings-save path materialize the budget default into settings.json) and asserts the budget still auto-scales rather than re-reading as an explicit 6000 cap — locking the exact reopening shut. To make the test bite the production decision (not just re-derive it), extract `budget_is_explicit()` into src/context_budget.py and use it from the agent loop. It keys off value-vs-default (the default is the auto sentinel), NOT settings presence — which is the whole point, since the save path materializes defaults. Note: after this PR's rework, is_setting_overridden has ZERO production callers, so the merged-dict materialization smell can't reach any setting through a presence check today (WGlynn's durability concern). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(agent): bind the budget context window to its own provenance (review #4122) RaresKeY caught a correctness bug in the fallback-context guard: stream_agent_loop kept only the `known` flag from get_context_length_known() and budgeted off the passed-in `context_length`, which can come from a *different* lookup. Two failures: - local endpoints are re-queried, so the passed value can be a stale DEFAULT_CONTEXT fallback while the fresh probe proves the real (smaller) served context — we'd scale off the stale value; - callers that don't pass context_length (scheduled tasks, teacher escalation, skill test runs, bg_monitor) were capped at 6000 even when a long window is discoverable. Extract budget_context_for_model() which returns the freshly-probed window when known else 0, binding the flag to the value it proves; the agent loop uses it. Regression tests cover the stale-fallback, no-arg-caller, and probe-error paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(agent): fix stale budget comments + tighten to the contract (review #4122) - settings.py: an explicit budget is clamped to the window only — hard_max is auto-only (#1190); drop the incorrect "and to hard_max". - is_setting_overridden docstring: drop the stale "adaptive budgets" example; point value-sensitive callers at context_budget.budget_is_explicit. - Tighten the budget-block comments to the contract (default = auto sentinel, non-default = explicit cap, hard_max = auto-only ceiling). Comment/docstring-only; no behaviour change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(agent): correct budget issue citations (#1190 → merged #1230/#1273) The context-budget contract (auto-sentinel, explicit budgets honoured, hard_max auto-only) merged via #1230 — #1190 was the earlier, closed, superseded PR. Re-point the contract comments at #1230 (the live source, already cited for the auto-sentinel two lines up in settings.py). The configurable hard_max setting (`agent_input_token_hard_max`) was a reviewer requirement first raised on #1190, omitted from the merged #1230, and actually added in #1273 — credit #1273 for it and correct the test comment's history (it previously implied this PR completed the requirement). Comment/docstring-only; no behaviour change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
443 lines
16 KiB
Python
443 lines
16 KiB
Python
"""
|
|
model_context.py
|
|
|
|
Query and cache model context window sizes from OpenAI-compatible APIs.
|
|
Provides token estimation for context usage tracking.
|
|
"""
|
|
|
|
import ipaddress
|
|
import logging
|
|
import sys
|
|
from typing import Dict, List, Optional, Tuple
|
|
|
|
from urllib.parse import urlparse
|
|
|
|
import httpx
|
|
|
|
logger = logging.getLogger(__name__)
|
|
|
|
_LOCAL_HOSTS = {"localhost", "127.0.0.1", "0.0.0.0", "::1", "host.docker.internal"}
|
|
_PRIVATE_PREFIXES = ("10.", "172.16.", "172.17.", "172.18.", "172.19.",
|
|
"172.20.", "172.21.", "172.22.", "172.23.", "172.24.",
|
|
"172.25.", "172.26.", "172.27.", "172.28.", "172.29.",
|
|
"172.30.", "172.31.", "192.168.")
|
|
|
|
# Tailscale uses the CGNAT range 100.64.0.0/10, NOT all of 100.0.0.0/8.
|
|
# A bare "100." prefix would classify public addresses (e.g. AWS ranges
|
|
# under 100.x outside the CGNAT block) as local; routes/model_routes.py
|
|
# already narrows this the same way for endpoint classification.
|
|
_TAILSCALE_CGNAT = ipaddress.ip_network("100.64.0.0/10")
|
|
|
|
|
|
def _in_tailscale_range(host: str) -> bool:
|
|
try:
|
|
return ipaddress.ip_address(host) in _TAILSCALE_CGNAT
|
|
except ValueError:
|
|
return False
|
|
|
|
|
|
def _normalize_base_for_compare(url: str) -> str:
|
|
url = (url or "").strip().rstrip("/")
|
|
for suffix in ("/chat/completions", "/models", "/completions", "/v1/messages"):
|
|
if url.endswith(suffix):
|
|
url = url[: -len(suffix)].rstrip("/")
|
|
return url
|
|
|
|
|
|
def _configured_endpoint_kind(url: str) -> Optional[str]:
|
|
"""Return configured endpoint kind for a chat/base URL when available."""
|
|
target = _normalize_base_for_compare(url)
|
|
if not target:
|
|
return None
|
|
if "core.database" not in sys.modules:
|
|
return None
|
|
try:
|
|
from core.database import SessionLocal, ModelEndpoint
|
|
db = SessionLocal()
|
|
try:
|
|
rows = db.query(ModelEndpoint).filter(ModelEndpoint.is_enabled == True).all()
|
|
for ep in rows:
|
|
base = _normalize_base_for_compare(getattr(ep, "base_url", "") or "")
|
|
if not base:
|
|
continue
|
|
if target != base and not target.startswith(base + "/"):
|
|
continue
|
|
kind = (getattr(ep, "endpoint_kind", None) or "auto").strip().lower()
|
|
if kind in ("local", "api", "proxy"):
|
|
return kind
|
|
if getattr(ep, "api_key", None):
|
|
parsed = urlparse(base)
|
|
host = (parsed.hostname or "").lower()
|
|
path = (parsed.path or "").rstrip("/")
|
|
if parsed.port != 11434 and "ollama" not in host and (path.endswith("/v1") or "/openai" in path):
|
|
return "proxy"
|
|
return "auto"
|
|
finally:
|
|
db.close()
|
|
except Exception:
|
|
return None
|
|
|
|
|
|
def is_local_endpoint(url: str) -> bool:
|
|
"""Check if URL points to a local/private/tailscale address."""
|
|
kind = _configured_endpoint_kind(url)
|
|
if kind in ("api", "proxy"):
|
|
return False
|
|
if kind == "local":
|
|
return True
|
|
try:
|
|
host = urlparse(url).hostname or ""
|
|
return host in _LOCAL_HOSTS or host.startswith(_PRIVATE_PREFIXES) or _in_tailscale_range(host)
|
|
except Exception:
|
|
return False
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Constants
|
|
# ---------------------------------------------------------------------------
|
|
DEFAULT_CONTEXT = 128000
|
|
REQUEST_TIMEOUT = 5
|
|
|
|
# Known context windows for major API models (used as fallback when /models
|
|
# endpoint doesn't report context_length).
|
|
# Substring matching — use the shortest unique prefix so variants get caught.
|
|
KNOWN_CONTEXT_WINDOWS = {
|
|
# --- Anthropic ---
|
|
'claude-sonnet-4-5': 200000,
|
|
'claude-sonnet-4-6': 200000,
|
|
'claude-sonnet-4': 200000,
|
|
'claude-opus-4': 200000,
|
|
'claude-haiku-4': 200000,
|
|
'claude-haiku-3-5': 200000,
|
|
'claude-3-5-sonnet': 200000,
|
|
'claude-3-5-haiku': 200000,
|
|
'claude-3-opus': 200000,
|
|
'claude-3-sonnet': 200000,
|
|
'claude-3-haiku': 200000,
|
|
|
|
# --- OpenAI ---
|
|
'gpt-5': 400000,
|
|
'gpt-4.1': 1047576,
|
|
'gpt-4.1-mini': 1047576,
|
|
'gpt-4.1-nano': 1047576,
|
|
'gpt-4o': 128000,
|
|
'gpt-4o-mini': 128000,
|
|
'gpt-4-turbo': 128000,
|
|
'gpt-4': 8192,
|
|
'gpt-3.5-turbo': 16385,
|
|
'o1': 200000,
|
|
'o1-mini': 128000,
|
|
'o1-pro': 200000,
|
|
'o3': 200000,
|
|
'o3-mini': 200000,
|
|
'o4-mini': 200000,
|
|
|
|
# --- DeepSeek ---
|
|
'deepseek-chat': 64000,
|
|
'deepseek-coder': 64000,
|
|
'deepseek-reasoner': 64000,
|
|
'deepseek-r1': 64000,
|
|
'deepseek-v3': 64000,
|
|
'deepseek-v2': 64000,
|
|
|
|
# --- Google ---
|
|
'gemini-2.5-pro': 1048576,
|
|
'gemini-2.5-flash': 1048576,
|
|
'gemini-2.0-flash': 1048576,
|
|
'gemini-1.5-pro': 1048576,
|
|
'gemini-1.5-flash': 1048576,
|
|
'gemma-4': 262144,
|
|
'gemma-3': 128000,
|
|
'gemma-2': 8192,
|
|
|
|
# --- Mistral ---
|
|
'mistral-large': 128000,
|
|
'mistral-medium': 32000,
|
|
'mistral-small': 32000,
|
|
'mistral-nemo': 128000,
|
|
'mistral-7b': 32000,
|
|
'mixtral': 32000,
|
|
'codestral': 32000,
|
|
'pixtral': 128000,
|
|
|
|
# --- xAI ---
|
|
'grok-4': 131072,
|
|
'grok-3': 131072,
|
|
'grok-2': 131072,
|
|
|
|
# --- Meta / Llama ---
|
|
'llama-4': 1048576,
|
|
'llama-3.3': 131072,
|
|
'llama-3.2': 131072,
|
|
'llama-3.1': 131072,
|
|
'llama-3': 131072,
|
|
|
|
# --- Qwen ---
|
|
'qwen3': 131072,
|
|
'qwen2.5': 131072,
|
|
'qwen2': 32768,
|
|
'qwq': 32768,
|
|
|
|
# --- Cohere ---
|
|
'command-r-plus': 128000,
|
|
'command-r': 128000,
|
|
'command-a': 256000,
|
|
|
|
# --- Perplexity ---
|
|
'sonar-pro': 200000,
|
|
'sonar': 128000,
|
|
|
|
# --- MiniMax ---
|
|
'minimax': 1000000,
|
|
|
|
# --- Moonshot / Kimi ---
|
|
'moonshot': 128000,
|
|
'kimi': 128000,
|
|
|
|
# --- Microsoft ---
|
|
'phi-4': 16000,
|
|
'phi-3': 128000,
|
|
|
|
# --- Nvidia ---
|
|
'nemotron': 131072,
|
|
|
|
# --- Yi ---
|
|
'yi-large': 32768,
|
|
'yi-1.5': 16384,
|
|
|
|
# --- 01.ai ---
|
|
'yi-lightning': 16384,
|
|
|
|
# --- Nous ---
|
|
'hermes': 131072,
|
|
'nous-hermes': 131072,
|
|
|
|
# --- Open community ---
|
|
'dolphin': 32768,
|
|
'mythomax': 4096,
|
|
'wizard': 32768,
|
|
'openchat': 8192,
|
|
'solar': 32768,
|
|
}
|
|
|
|
# ---------------------------------------------------------------------------
|
|
# Cache
|
|
# ---------------------------------------------------------------------------
|
|
_context_cache: Dict[Tuple[str, str], Tuple[int, bool]] = {}
|
|
|
|
|
|
def _get_context_length_cached(endpoint_url: str, model: str) -> Tuple[int, bool]:
|
|
"""Return (context_length, known). ``known`` is False only when the value is a
|
|
bare DEFAULT_CONTEXT fallback (no endpoint report and not in the known table)."""
|
|
configured_kind = _configured_endpoint_kind(endpoint_url)
|
|
is_local = is_local_endpoint(endpoint_url)
|
|
# Key on (endpoint_url, model): the same model id can be served by two
|
|
# different remote endpoints with different real context windows (e.g. a
|
|
# capped proxy vs. the full provider), so caching by model id alone would
|
|
# serve one endpoint's window for the other (issue #2603).
|
|
cache_key = (endpoint_url, model)
|
|
if not is_local and cache_key in _context_cache:
|
|
return _context_cache[cache_key]
|
|
|
|
ctx, known = _query_context_length(endpoint_url, model)
|
|
# Only cache non-default values to allow retry on next request.
|
|
# Local endpoints can restart with a different --max-model-len while keeping
|
|
# the same model id, so always re-query them instead of serving stale cache.
|
|
if not is_local and (ctx != DEFAULT_CONTEXT or configured_kind in ("api", "proxy")):
|
|
_context_cache[cache_key] = (ctx, known)
|
|
logger.info(f"Context length for {model}: {ctx}")
|
|
return ctx, known
|
|
|
|
|
|
def get_context_length(endpoint_url: str, model: str) -> int:
|
|
"""Get the context window size for a model.
|
|
|
|
Queries /v1/models on the endpoint and looks for context_length
|
|
or context_window fields. Caches result per (endpoint, model).
|
|
Falls back to DEFAULT_CONTEXT if unavailable.
|
|
"""
|
|
return _get_context_length_cached(endpoint_url, model)[0]
|
|
|
|
|
|
def get_context_length_known(endpoint_url: str, model: str) -> Tuple[int, bool]:
|
|
"""Like ``get_context_length`` but also returns whether the window was actually
|
|
discovered (endpoint-reported or in the known-models table) rather than the bare
|
|
DEFAULT_CONTEXT fallback. Callers that *scale* a budget off the window must not
|
|
trust an unknown value — a fallback 128K isn't proof the model holds 128K
|
|
(review on #4122)."""
|
|
return _get_context_length_cached(endpoint_url, model)
|
|
|
|
|
|
def budget_context_for_model(endpoint_url: str, model: str, *, fallback: int = 0) -> int:
|
|
"""Context window to scale the agent input budget against.
|
|
|
|
Returns the *freshly discovered* window when it was actually proven
|
|
(endpoint-reported / known table), else 0 so auto-scaling stays conservative.
|
|
Crucially this binds the ``known`` flag to the value it proves — callers must
|
|
not pair this flag with a context length from a *different* lookup (a stale
|
|
local re-query, or a caller that didn't pass one), which would budget off an
|
|
unproven number (review on #4122). On probe error, returns ``fallback`` (the
|
|
caller's best-known value) to preserve prior behaviour."""
|
|
try:
|
|
ctx, known = get_context_length_known(endpoint_url, model)
|
|
return ctx if known else 0
|
|
except Exception:
|
|
return fallback
|
|
|
|
|
|
def _lookup_known(model: str) -> Optional[int]:
|
|
"""Check known context windows by substring match.
|
|
|
|
Picks the LONGEST matching key so a short key never shadows a more specific
|
|
one. Without this, 'o1' (200k) precedes 'o1-mini' (128k) in the table and a
|
|
first-match return would report o1-mini's window as 200k.
|
|
"""
|
|
name = model.lower()
|
|
basename = name.split("/")[-1] if "/" in name else name
|
|
basename = basename.split(":")[0] # strip :free, :extended etc.
|
|
best_key: Optional[str] = None
|
|
best_ctx: Optional[int] = None
|
|
for key, ctx in KNOWN_CONTEXT_WINDOWS.items():
|
|
if key in basename or key in name:
|
|
if best_key is None or len(key) > len(best_key):
|
|
best_key, best_ctx = key, ctx
|
|
return best_ctx
|
|
|
|
|
|
def _query_context_length(endpoint_url: str, model: str) -> Tuple[int, bool]:
|
|
"""Query the model API for context length. Returns (context_length, known) where
|
|
``known`` is False only for the bare DEFAULT_CONTEXT fallback."""
|
|
known = _lookup_known(model)
|
|
api_ctx = None
|
|
configured_kind = _configured_endpoint_kind(endpoint_url)
|
|
|
|
# Large OpenAI-compatible proxies can make /models expensive. If the
|
|
# endpoint is explicitly configured as API/proxy, prefer known context
|
|
# metadata (or the default) over downloading the full catalog.
|
|
if configured_kind in ("api", "proxy"):
|
|
if known:
|
|
logger.info(f"Using known context window for {model}: {known}")
|
|
return known, True
|
|
return DEFAULT_CONTEXT, False
|
|
|
|
# Try llama.cpp /slots endpoint first — reports actual serving context
|
|
if is_local_endpoint(endpoint_url):
|
|
try:
|
|
base = endpoint_url.split("/v1")[0] if "/v1" in endpoint_url else endpoint_url.rsplit("/", 1)[0]
|
|
r = httpx.get(f"{base}/slots", timeout=REQUEST_TIMEOUT)
|
|
if r.is_success:
|
|
slots = r.json()
|
|
if isinstance(slots, list) and slots:
|
|
n_ctx = slots[0].get("n_ctx")
|
|
if n_ctx and isinstance(n_ctx, int) and n_ctx > 0:
|
|
logger.info(f"llama.cpp /slots reports n_ctx={n_ctx} for {model}")
|
|
return n_ctx, True
|
|
except Exception:
|
|
pass
|
|
|
|
# GitHub Copilot's /models requires auth + X-GitHub-Api-Version headers that
|
|
# aren't available here; an unauthenticated probe just 400s. All Copilot
|
|
# picker models are major API models covered by the known-context table, so
|
|
# rely on that instead of a doomed network call.
|
|
from src.copilot import is_copilot_base
|
|
if is_copilot_base(endpoint_url):
|
|
if known:
|
|
logger.info(f"Using known context window for {model}: {known}")
|
|
return known, True
|
|
return DEFAULT_CONTEXT, False
|
|
|
|
from src.endpoint_resolver import build_models_url
|
|
|
|
models_url = build_models_url(endpoint_url)
|
|
try:
|
|
r = httpx.get(models_url, timeout=REQUEST_TIMEOUT)
|
|
if r.is_success:
|
|
data = r.json()
|
|
models_list = data.get("data") or []
|
|
|
|
for m in models_list:
|
|
mid = m.get("id", "")
|
|
if mid == model or mid.split("/")[-1] == model.split("/")[-1]:
|
|
for field in (
|
|
"context_length",
|
|
"context_window",
|
|
"max_model_len",
|
|
"max_context_length",
|
|
"max_seq_len",
|
|
):
|
|
val = m.get(field)
|
|
if val and isinstance(val, (int, float)) and val > 0:
|
|
api_ctx = int(val)
|
|
break
|
|
|
|
if not api_ctx:
|
|
meta = m.get("meta") or m.get("model_extra") or {}
|
|
if isinstance(meta, dict):
|
|
# n_ctx is the actual serving context (set via -c flag in llama.cpp)
|
|
for field in ("n_ctx", "context_length", "context_window", "max_model_len"):
|
|
val = meta.get(field)
|
|
if val and isinstance(val, (int, float)) and val > 0:
|
|
api_ctx = int(val)
|
|
break
|
|
break
|
|
except Exception as e:
|
|
logger.debug(f"Failed to query context length for {model}: {e}")
|
|
|
|
# For local/self-hosted endpoints, trust the API value (user set --max-model-len)
|
|
# For cloud APIs, use the larger value (API can report low defaults)
|
|
if api_ctx and known:
|
|
_is_local = is_local_endpoint(endpoint_url)
|
|
if _is_local and api_ctx < known:
|
|
logger.info(f"Local endpoint reports {api_ctx} for {model} (known max: {known}) — using API value")
|
|
return api_ctx, True
|
|
result = max(api_ctx, known)
|
|
if api_ctx < known:
|
|
logger.info(f"API reported {api_ctx} for {model}, using known {known} instead")
|
|
return result, True
|
|
if api_ctx:
|
|
return api_ctx, True
|
|
if known:
|
|
logger.info(f"Using known context window for {model}: {known}")
|
|
return known, True
|
|
|
|
return DEFAULT_CONTEXT, False
|
|
|
|
|
|
def estimate_tokens(messages: List[Dict]) -> int:
|
|
"""Rough token estimate for a list of messages.
|
|
|
|
Uses chars * 0.3 which is closer to real BPE tokenizer output
|
|
than the commonly-cited chars/4 (which underestimates by ~20-30%).
|
|
Also adds ~4 tokens per message for role/formatting overhead, and counts
|
|
assistant tool_calls (name + arguments) — a tool-only turn carries
|
|
content=None with the real payload in tool_calls, so ignoring them made the
|
|
estimate (and the compaction/trim gates that rely on it) blind to large
|
|
tool arguments.
|
|
"""
|
|
total = 0
|
|
for msg in messages:
|
|
total += 4 # per-message overhead (role, separators)
|
|
content = msg.get("content", "")
|
|
if isinstance(content, str):
|
|
total += int(len(content) * 0.3)
|
|
elif isinstance(content, list):
|
|
for item in content:
|
|
if isinstance(item, dict) and item.get("type") == "text":
|
|
total += int(len(item.get("text", "")) * 0.3)
|
|
# Tool calls carry real payload too: a tool-only assistant turn is stored
|
|
# with content=None and the actual args (e.g. a create_document body) in
|
|
# tool_calls[].function.arguments. Ignoring them made large tool arguments
|
|
# read as ~0 tokens, so the compaction/trim gates missed genuine overflow.
|
|
tool_calls = msg.get("tool_calls")
|
|
if isinstance(tool_calls, list):
|
|
for tc in tool_calls:
|
|
if not isinstance(tc, dict):
|
|
continue
|
|
fn = tc.get("function") if isinstance(tc.get("function"), dict) else tc
|
|
name = fn.get("name", "") or ""
|
|
args = fn.get("arguments", "") or ""
|
|
if not isinstance(args, str):
|
|
args = str(args) # some shapes store arguments as a dict
|
|
total += 4 # per tool-call overhead (id, type, wrapper)
|
|
total += int((len(str(name)) + len(args)) * 0.3)
|
|
return total
|