fix(agent): stop treating illustrative Markdown fences as tool calls for native function-calling models (#3356)

* fix(agent): stop executing illustrative Markdown fences as tool calls for native function-calling models _resolve_tool_blocks fell back to the textual parse_tool_blocks() fenced-block parser whenever a model produced no native tool_calls, regardless of whether that model has a reliable native function-calling channel. Native models (GPT/Claude/Grok/Qwen3/DeepSeek-V, etc. - _is_api_model true) commonly write illustrative ```bash/```python/```json examples in guide-only prose; the fallback parser matched these and executed them as real commands, sometimes looping for several rounds as the model tried to clarify with more examples (#3222). Restrict the textual fenced-block fallback to non-native models, which rely on it as their only tool-invocation channel. Native models are trusted to use their structured tool_calls channel for real invocations; when they don't emit one, a bare fence in their response is prose, not an action. The native tool_calls path itself is untouched. This sits one layer below #3088's guide-only policy enforcement: that PR blocks tool exposure/execution on explicit no-tools requests, while this fixes the parser so ordinary illustrative fences are never misread as calls in the first place, on any turn. * fix(agent): gate only the fenced-example pattern for native models, preserve DSML/invoke recovery and persistence _resolve_tool_blocks previously short-circuited the entire textual parser (tool_blocks = [] if is_api_model else parse_tool_blocks(...)) for native function-calling models with no native tool_calls. That also dropped Patterns 2-5 (explicit [TOOL_CALL]/<invoke>/<tool_code>/DSML markup leaked into content as text), which are real calls a model couldn't emit on its structured channel (e.g. DeepSeek-V falling back to DSML), not illustrative examples. parse_tool_blocks/strip_tool_blocks now take a skip_fenced flag that gates ONLY Pattern 1 (the fenced ```bash/```python/```json block matcher). _resolve_tool_blocks passes skip_fenced=is_api_model so fenced examples stop being executed for native models while [TOOL_CALL]/<invoke>/<tool_code>/DSML stay fully active and recoverable. cleaned_round mirrors the same gate when persisting round text, so an illustrative fence that wasn't executed isn't stripped from saved/reloaded history either (it was streaming once and then disappearing on reload).
2026-06-16 17:55:26 -04:00 · 2026-06-08 17:25:28 -03:00
parent 8e494cc1c4
commit 0a324f20d2
3 changed files with 363 additions and 30 deletions
@@ -427,7 +427,7 @@ def _parse_tool_code_block(raw: str) -> Optional[ToolBlock]:
    return None


-def parse_tool_blocks(text: str) -> List[ToolBlock]:
+def parse_tool_blocks(text: str, skip_fenced: bool = False) -> List[ToolBlock]:
    """Extract executable tool blocks from LLM response text.

    Supports multiple formats:
@@ -436,6 +436,17 @@ def parse_tool_blocks(text: str) -> List[ToolBlock]:
    3. XML-style <tool_call>/<invoke> blocks
    4. <tool_code> blocks (MiniMax-M2.5 style)
    5. DeepSeek DSML markup (normalized to <invoke> first)
+
+    `skip_fenced`: when True, Pattern 1 (fenced ```bash/```python/```json code
+    blocks) is not matched at all. Native function-calling models (GPT/Claude/
+    Grok/Qwen3/DeepSeek-V, etc.) commonly write illustrative fenced examples in
+    prose; for those models we trust the structured tool_calls channel for real
+    invocations and treat a bare fence as display text rather than an action
+    (issue #3222). Patterns 2-5 — explicit [TOOL_CALL]/<invoke>/<tool_code>/DSML
+    markup that leaked into content as text — stay fully active regardless,
+    since that markup is never an illustrative example and dropping it would
+    silently lose real calls (e.g. DeepSeek-V falling back to DSML when it
+    can't emit structured tool_calls).
    """
    blocks = []

@@ -443,30 +454,31 @@ def parse_tool_blocks(text: str) -> List[ToolBlock]:
    # XML patterns below catch it.
    text = _normalize_dsml(text)

-    # Pattern 1: fenced code blocks
-    for m in _TOOL_BLOCK_RE.finditer(text):
-        tag = m.group(1).lower()
-        content = m.group(2).strip()
-        if not content:
-            continue
-        # If a code block's content is an <invoke> XML call (some models wrap
-        # tool calls in ```python or ```xml fences), parse the invoke instead.
-        if '<invoke' in content:
-            for inv in _XML_INVOKE_RE.finditer(content):
-                block = _parse_xml_invoke(inv)
+    # Pattern 1: fenced code blocks (skipped when `skip_fenced` — see docstring).
+    if not skip_fenced:
+        for m in _TOOL_BLOCK_RE.finditer(text):
+            tag = m.group(1).lower()
+            content = m.group(2).strip()
+            if not content:
+                continue
+            # If a code block's content is an <invoke> XML call (some models wrap
+            # tool calls in ```python or ```xml fences), parse the invoke instead.
+            if '<invoke' in content:
+                for inv in _XML_INVOKE_RE.finditer(content):
+                    block = _parse_xml_invoke(inv)
+                    if block:
+                        blocks.append(block)
+                # This fenced block is <invoke> markup, not literal code. Whether or
+                # not any call converted, never fall through to append the raw XML as
+                # a python/bash block — e.g. a hyphenated/namespaced tool name that
+                # _XML_INVOKE_RE's \w+ can't match would otherwise be executed as code.
+                continue
+            if tag in ("python", "bash"):
+                block = _parse_misfenced_web_lookup(content)
                if block:
                    blocks.append(block)
-            # This fenced block is <invoke> markup, not literal code. Whether or
-            # not any call converted, never fall through to append the raw XML as
-            # a python/bash block — e.g. a hyphenated/namespaced tool name that
-            # _XML_INVOKE_RE's \w+ can't match would otherwise be executed as code.
-            continue
-        if tag in ("python", "bash"):
-            block = _parse_misfenced_web_lookup(content)
-            if block:
-                blocks.append(block)
-                continue
-        blocks.append(ToolBlock(tag, content))
+                    continue
+            blocks.append(ToolBlock(tag, content))

    # Pattern 2: [TOOL_CALL] blocks (only if no fenced blocks found)
    if not blocks:
@@ -500,12 +512,23 @@ def parse_tool_blocks(text: str) -> List[ToolBlock]:
    return blocks


-def strip_tool_blocks(text: str) -> str:
-    """Remove executable tool blocks from text for clean display."""
+def strip_tool_blocks(text: str, skip_fenced: bool = False) -> str:
+    """Remove executable tool blocks from text for clean display.
+
+    `skip_fenced`: when True, fenced ```bash/```python/```json code blocks
+    (Pattern 1) are left intact instead of being stripped. This must mirror
+    whatever `skip_fenced` value `parse_tool_blocks` was called with for the
+    same response: if a fence wasn't executed as a tool call (because it's an
+    illustrative example from a native function-calling model), it shouldn't
+    vanish from the persisted/displayed text either — otherwise the example
+    streams once and then disappears on reload (issue #3222 follow-up).
+    Patterns 2-5 + DSML markup are always stripped, since that markup should
+    never reach the user regardless of whether it converted to a tool call.
+    """
    # Normalize DSML first so its markup gets stripped by the <invoke>
    # / <tool_call> removers below instead of leaking to the user.
    text = _normalize_dsml(text)
-    cleaned = _TOOL_BLOCK_RE.sub('', text)
+    cleaned = text if skip_fenced else _TOOL_BLOCK_RE.sub('', text)
    cleaned = _TOOL_CALL_RE.sub('', cleaned)
    cleaned = _XML_TOOL_CALL_RE.sub('', cleaned)
    cleaned = _TOOL_CODE_RE.sub('', cleaned)