Files
odysseus/tests/test_context_cache_per_endpoint.py
T
Kenny Van de Maele 263d41c58a fix(llm): stop sending llama.cpp slot-affinity fields to cloud providers (#3945)
* fix(llm): stop sending llama.cpp slot-affinity fields to cloud providers

_apply_local_cache_affinity adds session_id + cache_prompt for llama.cpp
KV-cache slot affinity (#2927), gated on _is_self_hosted_openai_compatible,
which treated any unknown OpenAI-compatible host as self-hosted. Strict
cloud providers added as custom endpoints (Mistral at api.mistral.ai)
reject unknown body fields, so every request failed with 422
extra_forbidden. Self-hosted now also requires the endpoint to resolve as
local via model_context.is_local_endpoint: loopback/private/tailscale
host, or endpoint kind explicitly configured as "local" (the escape hatch
for tunneled self-hosted servers). is_local_endpoint is promoted to a
public name since llm_core now shares it.

Fixes #3793

* test(llm): sweep cloud OpenAI-compatible hosts in affinity gating

Parametrized cases adapted from #3839 (credit: Shabablinchikow): deepseek,
x.ai, together, fireworks, and the Gemini OpenAI-compat endpoint must all
stay free of the llama.cpp extras, not just the Mistral host from #3793.

* fix(llm): narrow the Tailscale range to 100.64.0.0/10 in is_local_endpoint

Review finding on #3945: _PRIVATE_PREFIXES carried a bare "100." prefix,
treating all of 100.0.0.0/8 as local while Tailscale only uses the CGNAT
block 100.64.0.0/10. Public 100.x hosts (e.g. AWS ranges outside the
block) were classified local and still received the llama.cpp extras
this PR exists to keep away from strict providers. Match the narrowed
classification routes/model_routes.py already uses, with boundary tests
just below, inside, and just above the range.
2026-06-11 17:51:03 +02:00

40 lines
1.8 KiB
Python

"""Regression for #2603 — model context-window cache must be keyed per endpoint.
`get_context_length()` cached by model id alone, so two different remote endpoints
serving the same model id (e.g. a capped proxy at 8k vs. the full provider at 200k)
collided: whichever resolved first won process-wide and the other was served the
wrong window. The fix keys the cache on (endpoint_url, model).
"""
import src.model_context as mc
def _setup(monkeypatch, windows):
"""windows: {endpoint_url: context_length}. Force the remote path."""
monkeypatch.setattr(mc, "is_local_endpoint", lambda url: False)
monkeypatch.setattr(mc, "_configured_endpoint_kind", lambda url: "api")
monkeypatch.setattr(mc, "_query_context_length", lambda url, model: windows[url])
mc._context_cache.clear()
def test_same_model_two_remote_endpoints_get_their_own_window(monkeypatch):
a, b = "https://proxy-a.example/v1", "https://provider-b.example/v1"
_setup(monkeypatch, {a: 8000, b: 200000})
assert mc.get_context_length(a, "shared-model") == 8000
# Same model id, different endpoint: must NOT return endpoint A's cached 8000.
assert mc.get_context_length(b, "shared-model") == 200000
def test_cache_hit_still_works_per_endpoint(monkeypatch):
a, b = "https://proxy-a.example/v1", "https://provider-b.example/v1"
_setup(monkeypatch, {a: 8000, b: 200000})
mc.get_context_length(a, "shared-model")
mc.get_context_length(b, "shared-model")
# Both endpoints are now cached under their own key; flip the underlying
# query to prove subsequent reads come from the per-endpoint cache, not a re-query.
monkeypatch.setattr(mc, "_query_context_length", lambda url, model: 999)
assert mc.get_context_length(a, "shared-model") == 8000
assert mc.get_context_length(b, "shared-model") == 200000