Commit Graph

1 Commits

Author SHA1 Message Date
nopoz fbe3a0d73b fix(security): prevent ReDoS in XML and args tool-call parsers (#4941)
* fix(security): prevent ReDoS in XML and args tool-call parsers

Four py/polynomial-redos sinks in tool_parsing.py ran lazy/greedy regexes over
untrusted model output (tool-call markup is attacker-influenced via prompt
injection). When the closing delimiter was absent, each rescanned to
end-of-string from every opener -> O(n^2):

  - args => { ... } in _parse_tool_call_block: greedy \{([\s\S]*)\} restarted
    from every `args:{` opener. Now finds the opener once and takes through the
    last `}` (rfind) — equivalent capture, O(n).
  - _XML_INVOKE_RE: lazy <invoke ...>([\s\S]*?)</invoke>. Now _iter_xml_invoke
    pairs each opener with the first reachable </invoke> and stops when none is.
  - _XML_DIRECT_TOOL_RE and the <tag>([\s\S]*?)</\1> param scan in
    _parse_tool_code_block: lazy backreference patterns. Now _iter_backref_blocks
    pairs each opener with the nearest matching closer and memoizes tag names
    with no remaining closer, so an opener flood stays O(n).

All four are output-equivalent to the originals on well-formed tool-call markup;
the lazy patterns remain defined (still re-exported via agent_tools) but no
longer drive a finditer over untrusted text. Adds tests/test_redos_xml_tool_parsers.py
pinning correctness and bounding the opener-flood inputs (old paths took 4-15s).

* fix(security): harden invoke-parameter and distinct-name tag scans

Forward-only the two residual ReDoS paths in the XML/tool parsers that the
outer-delimiter fix left quadratic:

- _parse_xml_invoke parsed <parameter> with _XML_PARAM_RE.finditer, so a
  closed <invoke> body full of unclosed <parameter> openers rescanned the
  body from every opener (O(n^2), ~11s at 8k openers). Now scans forward-only
  via _iter_named_blocks, factored out of _iter_xml_invoke.
- _iter_backref_blocks only memoized repeated missing tag names; a flood of
  distinct unclosed names searched the suffix once per name (O(n^2)). It now
  indexes every closer by name in one linear pass and binary-searches per
  opener (O(n log n)). Covers the direct and tool_code backref scans.

Output-equivalent to the prior scanners (200k randomized trials match the
memoized version for both the direct ci=True and tool_code ci=False configs).
Adds regressions for the closed-invoke parameter flood and the distinct-name
floods (45k openers now run in ~0.05s, were 5-6s).
2026-06-27 15:42:55 -07:00