fix(security): harden untrusted_context_message against delimiter spoofing (#3086)

* fix(security): harden untrusted_context_message against delimiter spoofing Root cause: untrusted_context_message() did not sanitise content before interpolating it into the <<<UNTRUSTED_SOURCE_DATA>>> / <<<END_UNTRUSTED_SOURCE_DATA>>> delimited sandbox block. Malicious content embedding the literal delimiter strings could prematurely close the sandbox and inject instructions that the LLM treats as trusted. Fix: add _escape_guard_markers() helper that replaces the guard marker strings with structurally inert tokens (<<<_UNTRUSTED_DATA>>> and <<<_END_UNTRUSTED_DATA>>>) before the content is wrapped. The function is applied in untrusted_context_message() after casting content to str. The existing ~13 call sites (chat_processor.py, agent_loop.py, deep_research.py, chat_helpers.py, chat_routes.py) are unaffected because they pass content through without inspecting the output delimiters. Regression tests added in tests/test_prompt_security.py covering: - _escape_guard_markers unit tests (open, close, both, benign passthrough) - untrusted_context_message integration tests (delimiter spoofing neutralisation, type coercion, None handling, metadata preservation) Resolves #3056 * fix(security): sanitize label for newlines and guard markers Addresses reviewer feedback on PR #3086: - Normalize label: strip CR/LF to prevent pre-guard line injection - Escape guard marker literals in label via _escape_guard_markers() - Add regression tests for label-based newline injection, GUARD_OPEN and GUARD_CLOSE in label, and exactly-one-structural-guard assertion * fix(security): move Source label inside GUARD_OPEN block The reviewer correctly identified that even after sanitizing the label, any user-derived label text (e.g. `f"web page: {url}"`) still appeared before GUARD_OPEN in the trusted framing zone, where the LLM treats it as trusted instructions. Fix: move the 'Source: {label}' line to inside the guarded block so only the hardcoded UNTRUSTED_CONTEXT_HEADER sits before GUARD_OPEN. The raw label is still kept in metadata["source"] for traceability. _sanitize_label() and _escape_guard_markers() are kept for defence-in- depth on the label stored inside the block. Update test_label_newline_injection_is_blocked to assert no label- derived instruction text appears before GUARD_OPEN (pre-guard zone is now empty of any user-derived content).
2026-06-16 01:35:36 -04:00 · 2026-06-07 23:15:50 +02:00
parent f939cb65ce
commit d85c5e335e
2 changed files with 250 additions and 4 deletions
@@ -23,17 +23,60 @@ UNTRUSTED_CONTEXT_HEADER = (
 )


+GUARD_OPEN = "<<<UNTRUSTED_SOURCE_DATA>>>"
+GUARD_CLOSE = "<<<END_UNTRUSTED_SOURCE_DATA>>>"
+
+
+def _escape_guard_markers(text: str) -> str:
+    """Neutralise delimiter literals inside untrusted text.
+
+    If an attacker embeds the exact guard marker strings they can
+    prematurely close the sandbox block and inject instructions outside
+    it.  Replacing them with a visually distinct but structurally inert
+    token prevents the breakout while preserving the original meaning
+    for human review.
+    """
+    text = text.replace(GUARD_OPEN, "<<<_UNTRUSTED_DATA>>>")
+    text = text.replace(GUARD_CLOSE, "<<<_END_UNTRUSTED_DATA>>>")
+    return text
+
+
+def _sanitize_label(label: str) -> str:
+    """Sanitize a label for safe inclusion *inside* the guarded block.
+
+    Even though the label now lives inside the sandboxed region, we still
+    escape it for defence-in-depth:
+    1. Strips leading/trailing whitespace.
+    2. Replaces every CR/LF with a single space.
+    3. Escapes guard marker literals via _escape_guard_markers() so the
+       label cannot prematurely close the sandbox block.
+    """
+    label = label.strip()
+    label = label.replace("\r\n", " ").replace("\r", " ").replace("\n", " ")
+    label = _escape_guard_markers(label)
+    return label
+
+
 def untrusted_context_message(label: str, content: Any) -> Dict[str, Any]:
-    """Return an LLM message that keeps retrieved/source text out of system role."""
+    """Return an LLM message that keeps retrieved/source text out of system role.
+
+    The template is structured so that *only* the hardcoded
+    UNTRUSTED_CONTEXT_HEADER appears before GUARD_OPEN.  No user- or
+    caller-derived text is placed in the pre-guard trusted framing zone.
+    The source label and the body content are both placed *inside* the
+    guarded block where the LLM treats them as untrusted data.
+    """
+    safe_label = _sanitize_label(label)
    text = "" if content is None else str(content)
+    text = _escape_guard_markers(text)
    return {
        "role": "user",
        "content": (
            f"{UNTRUSTED_CONTEXT_HEADER}\n"
-            f"Source: {label}\n\n"
-            "<<<UNTRUSTED_SOURCE_DATA>>>\n"
+            f"{GUARD_OPEN}\n"
+            f"Source: {safe_label}\n"
            f"{text}\n"
-            "<<<END_UNTRUSTED_SOURCE_DATA>>>"
+            f"{GUARD_CLOSE}"
        ),
        "metadata": {"trusted": False, "source": label},
    }