Files
odysseus/tests
nopoz c098355778 fix(security): prevent ReDoS in LLM-output tool/think parsers (#4704)
* fix(security): prevent ReDoS in LLM-output tool/think parsers

The regexes that parse untrusted model output in text_helpers.py and
tool_parsing.py are delimiter-bounded with a lazy [\s\S]*? (or an
ambiguous (\s+[^>]*)?). Applied with re.sub/re.finditer over a whole
response, they degrade to O(n^2) when the closing delimiter is absent:
the engine rescans to end-of-string from every opener. Model output is
untrusted, so a prompt-injected or malicious model can stall the agent
loop with many unclosed openers (measured ~25s on a 60KB <thought flood).

- text_helpers.py: replace ambiguous <thought(\s+[^>]*)?> with
  <thought([^>]*)> (identical capture, no \s+/[^>]* overlap); skip the
  Gemma <|channel>...<channel|> subs when no <channel|> closer is present.
- tool_parsing.py: gate _TOOL_CALL_RE, _XML_TOOL_CALL_RE and _TOOL_CODE_RE
  (in parse_tool_blocks and strip_tool_blocks) on a cheap presence check
  for their closing delimiter. With no closer the regex cannot match, so
  skipping is equivalent; only the wasted O(n^2) rescan is removed.

Resolves CodeQL py/polynomial-redos #230, #231, #232, #233, #235, #236,
#524. The _XML_OPEN_TOOL_CALL_RE alerts (#234, #477) are false positives
(its greedy [\s\S]*\Z is linear) and left untouched.

* fix(security): close ReDoS gaps in tool/think parsers from review

Addresses two review findings on the closer-guard approach:

- Whole-string "closer exists?" checks were bypassable: a stale closer
  before an opener flood, or a closer with no reachable inner `}`, kept
  the guard true while every opener still rescanned to end-of-string
  (O(n^2)). Replace the substring guards with `_iter_delimited`, a
  forward-only scan that pairs each opener with a *later* closer and
  stops once none is reachable (O(n)). `parse_tool_blocks` and
  `strip_tool_blocks` (via `_strip_delimited`) both use it for the
  [TOOL_CALL], <tool_call>/<function_call>, and <tool_code> formats.
  Verified equivalent to the original regexes on well-formed inputs.

- `<thought([^>]*)>` dropped the tag-name boundary and corrupted
  unrelated tags (`<thoughtful>` -> `<thinkful>`). Use `<thought(\s[^>]*)?>`:
  the single fixed `\s` keeps the pattern linear (no `\s+`/`[^>]*`
  overlap) while restoring the boundary; capture is byte-for-byte
  identical for real `<thought ...>` openers.

Adds regressions for stale-closer-before-opener, closer-present-without-
inner-brace, and the <thoughtful>/<thoughts> passthrough.

* fix(security): close Gemma channel ReDoS guard flagged in review

vdmkenny noted the same bypassable whole-string guard remained in
text_helpers.py: `if "<channel|>" in out.lower()` gating the Gemma
thought/response channel subs. A stale `<channel|>` before a
`<|channel>thought` opener flood keeps the guard true while every opener
still rescans to end-of-string (measured ~7.3s at 4k openers).

Replace it with `_sub_delimited`, the same forward-only scan used for the
tool-call parsers: pair each opener with a later closer, stop when none is
reachable (O(n)). Verified output-equivalent to the original capture regexes
on well-formed multi-channel inputs; the stale-closer case now runs in <2ms.
Adds a regression for stale-closer-before-opener on the Gemma path.

* fix(security): harden strip_think() think-tag ReDoS flagged in review

The earlier fixes hardened normalize_thinking_markup and the delimiter
scanners, but the production entrypoint strip_think() still ran
_THINK_CLOSED_RE / _THINK_ATTR_RE / _THINK_OPEN_RE (and the stray-tag
_THINK_TAG_RE) over untrusted model output. Those kept the same ReDoS
shapes: the lazy `<open>[\s\S]*?</close>` rescanned to end-of-string from
every opener, and `(?:\s+[^>]*)?` / `[^>]*` attribute scans ran to
end-of-string from every opener on a "many openers, no closer" flood. On
the prior head, malformed `<think` / `<thinking` / `<thought` floods took
6-14s through strip_think(). The shipped `<thought>` normalization had the
same residual: the single-opener case was linear but an opener flood was
still O(n^2) (~4.4s).

- Replace the lazy multi-pass _THINK_CLOSED_RE loop with the existing
  forward-only _sub_delimited scan (pair each opener with the first
  reachable closer, stop when none is reachable). One pass collapses
  sequential and nested blocks as before.
- Bound every opener/stray-tag attribute scan at `<` (`[^<>]` not `[^>]`)
  so a no-`>` opener flood can't drive a single match attempt to
  end-of-string. Identical capture for well-formed think/thought tags.
- email_helpers._strip_think: compute had_think from the single linear
  _THINK_TAG_RE instead of the lazy closed/open `.search()` calls, which
  had the same O(n^2) on the email reply/summary/extraction paths.

All flood variants now finish in <10ms (were 6-14s). Output verified
byte-for-byte identical to the prior implementation over a 34-case corpus
(nested, mismatched, attr, uppercase, Gemma, prose, prompt-echo). Adds
strip_think() timing regressions for malformed openers, opener floods
(all three tag names), the closed-opener flood, and the malformed-closer
flood.

* docs: trim verbose comments in think-tag ReDoS fix
2026-06-27 10:12:28 -07:00
..
2026-05-31 23:58:26 +09:00
2026-06-01 02:22:17 +00:00
2026-05-31 23:58:26 +09:00

Test Suite Notes

Purpose

This file documents the shared test helpers and the review expectations that go with them. The suite is being refactored incrementally, so this is a working reference for that effort - not a claim that the suite is already fully organized. Read it before adding a new helper or before reviewing a PR that touches tests/helpers/.

For the broader rules - test taxonomy, determinism/isolation rules, the behavioral-vs-source-text policy, and helper/factory extraction rules - see TESTING_STANDARD.md. This file is the concrete helper reference; that file is the standard the refactor works toward.

Running focused subsets (taxonomy markers)

tests/conftest.py tags every test at collection time with two markers derived from its filename by tests/_taxonomy.py: an area_* marker (e.g. area_security) and a finer sub_* marker (e.g. sub_owner_scope). This adds markers only - it moves no files and changes no test behavior. Use them to run a focused slice:

./venv/bin/python -m pytest -m area_security
./venv/bin/python -m pytest -m "area_services and sub_cookbook"

Areas are security, routes, services, cli, js, helpers, unit, and uncategorized. Classification is conservative and token-based: a file that matches no area keyword falls back to area_uncategorized with its filename as the sub-area. The area_* names are registered in pyproject.toml; the dynamic sub_* names are registered before collection by pytest_configure in tests/conftest.py, so unknown-mark warnings still flag genuine typos.

For common focused runs, use tests/run_focus.py. It validates area and sub-area names, accepts sub-areas with or without the sub_ prefix, and passes extra pytest arguments after --:

./venv/bin/python tests/run_focus.py --area security
./venv/bin/python tests/run_focus.py --area services --sub-area cookbook
./venv/bin/python tests/run_focus.py --sub-area sub_cookbook
./venv/bin/python tests/run_focus.py --keyword taxonomy
./venv/bin/python tests/run_focus.py --last-failed
./venv/bin/python tests/run_focus.py --dry-run --area services --sub-area cookbook
./venv/bin/python tests/run_focus.py --area services -- --maxfail=1 -q

Fast lane and duration visibility

--fast runs the fast lane: the tests that are not marked slow (it adds the marker expression not slow). It composes with --area/--sub-area using and. Because no tests may be marked slow yet, --fast can initially match the full focused selection; it becomes a real speed-up as slow marks are added from duration evidence. Use it for quick local or reviewer feedback; it does not replace broader focused or full-suite validation before merge.

--durations N and --durations-min FLOAT add pytest's slowest-test reporting so you can see where time goes. They are reporting only and do not count as a focus selector, so --durations must be combined with a real selector (--area, --sub-area, --keyword, --last-failed, or --fast).

Use the project Python environment before running these commands. The examples use the repo's documented ./venv/bin/python path so they do not accidentally fall back to system Python.

./venv/bin/python tests/run_focus.py --fast
./venv/bin/python tests/run_focus.py --area services --fast
./venv/bin/python tests/run_focus.py --area services --durations 25
./venv/bin/python tests/run_focus.py --area services --fast --durations 25 --durations-min 0.05

The slow marker is opt-in. Mark a test slow only with duration evidence (from --durations), not by guessing - see the fast-lane policy in TESTING_STANDARD.md. --fast is for quick reviewer feedback and must not replace the full suite before merge. A slow mark only excludes a test from the fast lane; the test stays runnable directly, e.g.:

./venv/bin/python -m pytest tests/test_auth_config_lock_concurrency.py
./venv/bin/python -m pytest -m slow

Order-sensitivity reporting (report-only)

tests/run_order_report.py runs pytest with the collected test items shuffled by a seeded RNG, to surface order-sensitive tests (hidden coupling through shared import state, module caches, databases, etc.). It is report-only: it is not wired into CI, adds no gate, and changes no normal pytest collection or ordering - the shuffle exists only inside this runner. The seed is always printed, and pytest targets/options go after a literal --:

./venv/bin/python tests/run_order_report.py --seed 123 -- tests/cli/ -q
./venv/bin/python tests/run_order_report.py -- tests/cli/ -q   # generates and prints a seed

The same seed reproduces the same order when the reported working directory, pytest target arguments, and test environment are also the same. The runner prints all command arguments with shell-safe POSIX quoting and uses the invoking Python interpreter.

A generated-seed run starts with output like:

[order-report] working directory: /path/to/odysseus
[order-report] shuffling test order with seed 284734921
[order-report] reproduce from this working directory with the same test environment:
[order-report] reproduce with: /path/to/odysseus/venv/bin/python /path/to/odysseus/tests/run_order_report.py --seed 284734921 -- tests/cli/ -q

Run the printed command from the reported working directory to reproduce the same fixed-seed order:

[order-report] working directory: /path/to/odysseus
[order-report] shuffling test order with seed 284734921
[order-report] reproduce from this working directory with the same test environment:
[order-report] reproduce with: /path/to/odysseus/venv/bin/python /path/to/odysseus/tests/run_order_report.py --seed 284734921 -- tests/cli/ -q

Pytest output remains visible between the report header and footer. A failing run ends with pytest's normal failure report followed by:

FAILED tests/example_test.py::test_example - AssertionError
[order-report] seed 284734921: pytest exit code 1 (report-only; fix order-sensitive failures in separate scoped PRs)

Failures discovered this way are real isolation bugs: fix them in separate scoped PRs - do not silence them with skip/xfail, and do not "fix" them by depending on a particular order.

The runner propagates pytest's exit code, so it composes with normal local workflows; "report-only" means it is not a CI gate, not that failures are swallowed.

Core principles

  • Keep PRs small and homogeneous: one kind of change per PR.
  • Prefer explicit local setup over hidden global fixtures.
  • Avoid expanding the root conftest.py unless absolutely necessary.
  • Do not mix file moves with logic changes in the same PR.
  • Do not weaken tests with skip/xfail just to make CI pass.
  • Validate the focused files you changed, plus any neighboring or order-sensitive groups they interact with.

Helper conventions

The helpers below live under tests/helpers/. They exist to remove repeated boilerplate that already appeared across multiple tests. Reach for one only when your test matches its intended use; do not stretch a helper to cover a new case.

tests.helpers.cli_loader.load_script

Use when a test needs to import a script under scripts/ without repeating SourceFileLoader / importlib.util boilerplate.

  • Intended for script/CLI tests that load a single file from scripts/.
  • Not for arbitrary package imports - use a normal import for those.
  • When migrating an existing test to it, keep the existing stubs and assertions unchanged. Any sys.modules stubs the script needs at import time must still be injected (e.g. via monkeypatch) before calling load_script.

tests.helpers.import_state.clear_module

Use when a test must drop one cached module and its parent-package attribute before a fresh import.

  • Clears sys.modules[name].
  • Clears the parent-package attribute when present.
  • Good replacement for local sys.modules.pop(...) + delattr(parent, child) blocks.

tests.helpers.import_state.preserve_import_state

Use when a test temporarily installs stubs into sys.modules and needs deterministic cleanup afterward.

  • Context manager: restores both sys.modules entries and parent-package attributes on exit (normal or exception).
  • Useful around module-level stubs or temporary imports.
  • Prefer narrow, explicit module names over broad ones.

tests.helpers.import_state.clear_fake_database_modules

Use only for the guarded fake/stub database cleanup pattern.

  • Preserves a real-looking core.database (one with a string __file__).
  • Removes a fake/stub core.database and the related src.database state.
  • Do not use as a general database reset fixture.

tests.helpers.import_state.clear_fake_endpoint_resolver_modules

Use only for the guarded fake/stub src.endpoint_resolver cleanup pattern.

  • Preserves real resolver modules (those with a truthy __file__).
  • Evicts fake/stub resolver modules and the dependent route modules that were cached against them.
  • Accepts explicit extra dependent module names to evict alongside the defaults.

tests.helpers.sqlite_db.make_temp_sqlite

Use for the repeated file-backed temp sqlite setup in tests.

  • Only constructs (SessionLocal, engine, tmpfile) from the repeated block.
  • Does not patch modules and does not clean up the temp file.
  • The caller must bind SessionLocal explicitly onto whatever module the code under test reads, and must keep the returned objects alive.
  • Do not use it as a general DB fixture framework.

tests.helpers.db_stubs.make_core_db_stub

Use for small import-time core.database stubs with a placeholder SessionLocal.

  • Pass model names via models when MagicMock attributes are sufficient.
  • Pass attributes when an import needs exact placeholder values.
  • Set install_core_package=True only when the test also needs a fake parent core module stub.
  • Keep custom fake sessions and route-specific database behavior local.

What not to abstract yet

Some remaining patterns should stay as-is for now rather than being forced into helpers:

  • Large mixed files such as security/review regression files.
  • Broad setup-oriented sys.modules stub installers.
  • One-off custom module patching.
  • Custom DB session, route, and app setup.

Validation expectations

Run validation locally before opening or approving a PR. Practical checks:

  • git diff --check - catch whitespace and conflict-marker errors.
  • ./venv/bin/python -m py_compile <changed files> - confirm changed files compile.
  • Focused ./venv/bin/python -m pytest on the changed test files.
  • ./venv/bin/python -m pytest on neighboring or order-sensitive test groups that share import state with the changed files.
  • grep for the old boilerplate when replacing it, to confirm no stragglers remain.
  • A fresh audit worktree when changing the helpers themselves, so stale __pycache__ or import state cannot mask a regression.

Current roadmap

  1. Import-state cleanup - complete.
  2. Document helper conventions (this file).
  3. Pilot the repeated import-time core.database stub helper.
  4. Add further tiny helpers only when the repeated semantics are clear.
  5. Start low-risk file moves only after helper conventions are documented.
  6. Avoid moving high-risk security/route regression files first.