mirror of
https://github.com/pewdiepie-archdaemon/odysseus.git
synced 2026-06-15 17:25:26 -04:00
263d41c58a
* fix(llm): stop sending llama.cpp slot-affinity fields to cloud providers _apply_local_cache_affinity adds session_id + cache_prompt for llama.cpp KV-cache slot affinity (#2927), gated on _is_self_hosted_openai_compatible, which treated any unknown OpenAI-compatible host as self-hosted. Strict cloud providers added as custom endpoints (Mistral at api.mistral.ai) reject unknown body fields, so every request failed with 422 extra_forbidden. Self-hosted now also requires the endpoint to resolve as local via model_context.is_local_endpoint: loopback/private/tailscale host, or endpoint kind explicitly configured as "local" (the escape hatch for tunneled self-hosted servers). is_local_endpoint is promoted to a public name since llm_core now shares it. Fixes #3793 * test(llm): sweep cloud OpenAI-compatible hosts in affinity gating Parametrized cases adapted from #3839 (credit: Shabablinchikow): deepseek, x.ai, together, fireworks, and the Gemini OpenAI-compat endpoint must all stay free of the llama.cpp extras, not just the Mistral host from #3793. * fix(llm): narrow the Tailscale range to 100.64.0.0/10 in is_local_endpoint Review finding on #3945: _PRIVATE_PREFIXES carried a bare "100." prefix, treating all of 100.0.0.0/8 as local while Tailscale only uses the CGNAT block 100.64.0.0/10. Public 100.x hosts (e.g. AWS ranges outside the block) were classified local and still received the llama.cpp extras this PR exists to keep away from strict providers. Match the narrowed classification routes/model_routes.py already uses, with boundary tests just below, inside, and just above the range.
40 lines
1.8 KiB
Python
40 lines
1.8 KiB
Python
"""Regression for #2603 — model context-window cache must be keyed per endpoint.
|
|
|
|
`get_context_length()` cached by model id alone, so two different remote endpoints
|
|
serving the same model id (e.g. a capped proxy at 8k vs. the full provider at 200k)
|
|
collided: whichever resolved first won process-wide and the other was served the
|
|
wrong window. The fix keys the cache on (endpoint_url, model).
|
|
"""
|
|
|
|
import src.model_context as mc
|
|
|
|
|
|
def _setup(monkeypatch, windows):
|
|
"""windows: {endpoint_url: context_length}. Force the remote path."""
|
|
monkeypatch.setattr(mc, "is_local_endpoint", lambda url: False)
|
|
monkeypatch.setattr(mc, "_configured_endpoint_kind", lambda url: "api")
|
|
monkeypatch.setattr(mc, "_query_context_length", lambda url, model: windows[url])
|
|
mc._context_cache.clear()
|
|
|
|
|
|
def test_same_model_two_remote_endpoints_get_their_own_window(monkeypatch):
|
|
a, b = "https://proxy-a.example/v1", "https://provider-b.example/v1"
|
|
_setup(monkeypatch, {a: 8000, b: 200000})
|
|
|
|
assert mc.get_context_length(a, "shared-model") == 8000
|
|
# Same model id, different endpoint: must NOT return endpoint A's cached 8000.
|
|
assert mc.get_context_length(b, "shared-model") == 200000
|
|
|
|
|
|
def test_cache_hit_still_works_per_endpoint(monkeypatch):
|
|
a, b = "https://proxy-a.example/v1", "https://provider-b.example/v1"
|
|
_setup(monkeypatch, {a: 8000, b: 200000})
|
|
mc.get_context_length(a, "shared-model")
|
|
mc.get_context_length(b, "shared-model")
|
|
|
|
# Both endpoints are now cached under their own key; flip the underlying
|
|
# query to prove subsequent reads come from the per-endpoint cache, not a re-query.
|
|
monkeypatch.setattr(mc, "_query_context_length", lambda url, model: 999)
|
|
assert mc.get_context_length(a, "shared-model") == 8000
|
|
assert mc.get_context_length(b, "shared-model") == 200000
|