fix(chat): stabilize system prompt, sequence memory extraction, and send stable session id to preserve KV cache (#3360)

mirror of https://github.com/pewdiepie-archdaemon/odysseus.git synced 2026-06-17 10:15:27 -04:00

* fix(chat): stabilize system prompt, sequence memory extraction, send stable session id to preserve KV cache

Fixes #2927. As diagnosed in the issue, three things in Odysseus's request
pattern actively destroyed local backends' (llama.cpp / LM Studio) KV-cache
continuity, forcing a full prompt re-evaluation (15-30s+) on every turn:

1. Dynamic content folded into the system prompt every turn. Both the chat
   preface (ChatProcessor.build_context_preface) and the agent system prompt
   (_build_system_prompt) injected current_datetime_prompt() — text that
   changes every minute — directly into system-role messages, which llm_core
   then concatenates into the single system message sent as the cached
   prefix. Any byte difference there invalidates the entire cache. Moved this
   to a new current_datetime_context_message() helper that returns a
   standalone user-role message, inserted near the end of the array (right
   before the latest user turn) instead of mixed into the system prompt. The
   static system prefix (preset prompt + safety policy + agent base prompt)
   now stays byte-identical across turns of the same session.

2. Memory/skill extraction side-requests competed with the main completion.
   run_post_response_tasks fired extract_and_store / maybe_extract_skill via
   asyncio.create_task — fire-and-forget coroutines that could overlap the
   next turn's main request and steal llama.cpp's limited processing slots,
   evicting the cached checkpoint. They're now queued through a new
   _run_extraction_jobs_sequentially helper that waits for the session's
   stream to go idle and runs the jobs strictly one at a time.

3. No stable session identifier was sent to local backends, so llama.cpp
   assigned a new processing slot via LRU every turn ("session_id=<empty>
   server-selected (LCP/LRU)"), losing slot affinity. Added
   _apply_local_cache_affinity() in llm_core, which sets session_id and
   cache_prompt: true on outgoing payloads — gated to self-hosted
   OpenAI-compatible endpoints only (never api.openai.com or other cloud
   providers, which reject unrecognized request fields with a 400). Threaded
   session_id through stream_llm / llm_call_async / stream_agent_loop from
   the existing Odysseus session id.

Tests in tests/test_kv_cache_invalidation_2927.py exercise the real payload-
assembly and scheduling code paths: byte-identical system prefix across two
turns of the same session (with a regression check that genuinely changed
instructions DO still change it), the dynamic time block landing as a
user-role message, extraction jobs waiting for the stream to go idle and
running sequentially, and the outgoing payload carrying a stable session_id
(same across turns of one session, different across sessions) only for
self-hosted endpoints. Updated tests/test_user_time.py for the new message
placement.

* fix(tests): accept owner= kwarg in normalize_model_id monkeypatch

The upstream normalize_model_id signature now takes an owner= keyword
argument, and chat_helpers.py passes owner=getattr(sess, "owner", None)
at the call site. Update the test stub lambda to **kwargs so it handles
the new argument without breaking, and update chat_helpers.py to forward
the owner parameter consistently.

---------

Co-authored-by: Alexandre Teixeira <111787685+alteixeira20@users.noreply.github.com>

This commit is contained in:

Lucas Daniel

2026-06-09 18:46:54 -03:00

committed by

GitHub

parent d273085744

commit 55ff22c6d5

8 changed files with 697 additions and 28 deletions

									
										routes/chat_routes.py
									
		+2
		
												View File
												
				@@ -400,6 +400,7 @@ def setup_chat_routes(

				            temperature=ctx.preset.temperature,

				            max_tokens=ctx.preset.max_tokens,

				            prompt_type=preset_id,

				            session_id=session,

				        )

				        _clean_reply, _clean_md = clean_thinking_for_save(reply, {"model": sess.model})

				        sess.add_message(ChatMessage("assistant", _clean_reply, metadata=_clean_md))

				@@ -988,6 +989,7 @@ def setup_chat_routes(

				                        max_tokens=ctx.preset.max_tokens,

				                        prompt_type=preset_id,

				                        tools=None,

				                        session_id=session,

				                    ):

				                        if chunk.startswith("data: ") and not chunk.startswith("data: [DONE]"):

				                            try: