fix(chat): stabilize system prompt, sequence memory extraction, and send stable session id to preserve KV cache (#3360)

* fix(chat): stabilize system prompt, sequence memory extraction, send stable session id to preserve KV cache Fixes #2927. As diagnosed in the issue, three things in Odysseus's request pattern actively destroyed local backends' (llama.cpp / LM Studio) KV-cache continuity, forcing a full prompt re-evaluation (15-30s+) on every turn: 1. Dynamic content folded into the system prompt every turn. Both the chat preface (ChatProcessor.build_context_preface) and the agent system prompt (_build_system_prompt) injected current_datetime_prompt() — text that changes every minute — directly into system-role messages, which llm_core then concatenates into the single system message sent as the cached prefix. Any byte difference there invalidates the entire cache. Moved this to a new current_datetime_context_message() helper that returns a standalone user-role message, inserted near the end of the array (right before the latest user turn) instead of mixed into the system prompt. The static system prefix (preset prompt + safety policy + agent base prompt) now stays byte-identical across turns of the same session. 2. Memory/skill extraction side-requests competed with the main completion. run_post_response_tasks fired extract_and_store / maybe_extract_skill via asyncio.create_task — fire-and-forget coroutines that could overlap the next turn's main request and steal llama.cpp's limited processing slots, evicting the cached checkpoint. They're now queued through a new _run_extraction_jobs_sequentially helper that waits for the session's stream to go idle and runs the jobs strictly one at a time. 3. No stable session identifier was sent to local backends, so llama.cpp assigned a new processing slot via LRU every turn ("session_id=<empty> server-selected (LCP/LRU)"), losing slot affinity. Added _apply_local_cache_affinity() in llm_core, which sets session_id and cache_prompt: true on outgoing payloads — gated to self-hosted OpenAI-compatible endpoints only (never api.openai.com or other cloud providers, which reject unrecognized request fields with a 400). Threaded session_id through stream_llm / llm_call_async / stream_agent_loop from the existing Odysseus session id. Tests in tests/test_kv_cache_invalidation_2927.py exercise the real payload- assembly and scheduling code paths: byte-identical system prefix across two turns of the same session (with a regression check that genuinely changed instructions DO still change it), the dynamic time block landing as a user-role message, extraction jobs waiting for the stream to go idle and running sequentially, and the outgoing payload carrying a stable session_id (same across turns of one session, different across sessions) only for self-hosted endpoints. Updated tests/test_user_time.py for the new message placement. * fix(tests): accept owner= kwarg in normalize_model_id monkeypatch The upstream normalize_model_id signature now takes an owner= keyword argument, and chat_helpers.py passes owner=getattr(sess, "owner", None) at the call site. Update the test stub lambda to **kwargs so it handles the new argument without breaking, and update chat_helpers.py to forward the owner parameter consistently. --------- Co-authored-by: Alexandre Teixeira <111787685+alteixeira20@users.noreply.github.com>
2026-06-21 20:25:23 -04:00 · 2026-06-09 18:46:54 -03:00
parent d273085744
commit 55ff22c6d5
8 changed files with 697 additions and 28 deletions
@@ -9,7 +9,7 @@ from __future__ import annotations
 import re
 from contextvars import ContextVar
 from datetime import datetime, timedelta, timezone
-from typing import Optional
+from typing import Dict, Optional


 _USER_TZ_OFFSET_MIN: ContextVar[Optional[int]] = ContextVar("user_tz_offset_min", default=None)
@@ -136,3 +136,26 @@ def current_datetime_prompt(now_utc: Optional[datetime] = None) -> str:
        "When scheduling a task with manage_tasks, scheduled_time is in UTC: "
        "convert the user's stated local time using the UTC offset above.\n\n"
    )
+
+
+def current_datetime_context_message(now_utc: Optional[datetime] = None) -> Dict[str, str]:
+    """Build the current-date/time context as a standalone chat message.
+
+    This intentionally returns a ``user``-role message rather than a
+    ``system``-role one. The text changes every turn (it embeds the current
+    clock time down to the minute), and local OpenAI-compatible backends
+    (llama.cpp / LM Studio) key their KV-cache prefix off the system message
+    byte-for-byte — folding ever-changing timestamp text into the system
+    message would invalidate the cached prefix on every single request (see
+    issue #2927). Keeping it as a separate message placed near the end of the
+    array (right before the latest user turn) lets the static system prompt
+    stay byte-identical across turns while the model still gets fresh
+    date/time grounding for relative-date reasoning.
+    """
+    return {
+        "role": "user",
+        "content": (
+            "[Context — current date/time, refreshed each turn; not part of "
+            "your instructions]\n" + current_datetime_prompt(now_utc)
+        ),
+    }