docs(tests): define testing standard and taxonomy (#3372)

This commit is contained in:
Alexandre Teixeira
2026-06-08 00:15:47 +01:00
committed by GitHub
parent 505d8bae5a
commit 77b75ca97e
2 changed files with 220 additions and 5 deletions
+10 -5
View File
@@ -4,10 +4,15 @@
This file documents the shared test helpers and the review expectations that go This file documents the shared test helpers and the review expectations that go
with them. The suite is being refactored incrementally, so this is a working with them. The suite is being refactored incrementally, so this is a working
reference for that effort not a claim that the suite is already fully reference for that effort - not a claim that the suite is already fully
organized. Read it before adding a new helper or before reviewing a PR that organized. Read it before adding a new helper or before reviewing a PR that
touches `tests/helpers/`. touches `tests/helpers/`.
For the broader rules - test taxonomy, determinism/isolation rules, the
behavioral-vs-source-text policy, and helper/factory extraction rules - see
[`TESTING_STANDARD.md`](./TESTING_STANDARD.md). This file is the concrete helper
reference; that file is the standard the refactor works toward.
## Core principles ## Core principles
- Keep PRs small and homogeneous: one kind of change per PR. - Keep PRs small and homogeneous: one kind of change per PR.
@@ -30,7 +35,7 @@ Use when a test needs to import a script under `scripts/` without repeating
`SourceFileLoader` / `importlib.util` boilerplate. `SourceFileLoader` / `importlib.util` boilerplate.
- Intended for script/CLI tests that load a single file from `scripts/`. - Intended for script/CLI tests that load a single file from `scripts/`.
- Not for arbitrary package imports use a normal `import` for those. - Not for arbitrary package imports - use a normal `import` for those.
- When migrating an existing test to it, keep the existing stubs and assertions - When migrating an existing test to it, keep the existing stubs and assertions
unchanged. Any `sys.modules` stubs the script needs at import time must still unchanged. Any `sys.modules` stubs the script needs at import time must still
be injected (e.g. via `monkeypatch`) before calling `load_script`. be injected (e.g. via `monkeypatch`) before calling `load_script`.
@@ -96,8 +101,8 @@ helpers:
Run validation locally before opening or approving a PR. Practical checks: Run validation locally before opening or approving a PR. Practical checks:
- `git diff --check` catch whitespace and conflict-marker errors. - `git diff --check` - catch whitespace and conflict-marker errors.
- `python3 -m py_compile <changed files>` confirm changed files compile. - `python3 -m py_compile <changed files>` - confirm changed files compile.
- Focused `pytest` on the changed test files. - Focused `pytest` on the changed test files.
- `pytest` on neighboring or order-sensitive test groups that share import - `pytest` on neighboring or order-sensitive test groups that share import
state with the changed files. state with the changed files.
@@ -108,7 +113,7 @@ Run validation locally before opening or approving a PR. Practical checks:
## Current roadmap ## Current roadmap
1. Import-state cleanup complete. 1. Import-state cleanup - complete.
2. Document helper conventions (this file). 2. Document helper conventions (this file).
3. Audit fake DB / `SessionLocal` / route setup duplication. 3. Audit fake DB / `SessionLocal` / route setup duplication.
4. Add tiny helpers only when the repeated semantics are clear. 4. Add tiny helpers only when the repeated semantics are clear.
+210
View File
@@ -0,0 +1,210 @@
# Odysseus Testing Standard & Taxonomy
## Purpose
This document defines *how we write and refactor tests* in Odysseus. It is the
standard that the incremental test-suite refactor (issue #2523) works toward,
and it applies to both human contributors and coding agents.
It is intentionally split from [`tests/README.md`](./README.md):
- **`README.md`** - the concrete, current helper reference: what each helper in
`tests/helpers/` does and how to call it.
- **`TESTING_STANDARD.md`** (this file) - the rules and taxonomy: what a good
test looks like, where it belongs, and the policy refactor PRs must follow.
When the two ever disagree, this file states the *intent* and `README.md` states
the *current mechanics*; fix whichever is stale.
This document changes no test behavior. It is guidance only.
## What the test suite is for
The goal is not only to reorganize `tests/`. The goal is for the suite to be a
reliable foundation for future development: deterministic, modular, informative,
behavior-focused, and complete enough to replace manual QA wherever practical.
Run tests with the project virtualenv interpreter (`.venv/bin/python -m pytest`).
The system `python3` may be missing pinned dependencies (e.g. `nh3`), which
shows up as import/collection errors that are environmental, not real failures.
## What "done" means for a single test
Every new or refactored test should be:
- **Deterministic** - same result every run, no reliance on wall-clock, network,
RNG seeds, or collection order.
- **Behavior-first** - asserts on observable behavior, not on the source text or
AST of the code under test (see [Behavioral-first policy](#behavioral-first-policy)).
- **Explicit** - setup and expected result are visible in the test, not hidden in
broad fixtures.
- **Isolated from global process state** - no leaked `sys.modules`, `os.environ`,
CWD, or package parent-attribute mutation (see [Determinism & isolation](#determinism--isolation-rules)).
- **Order-independent** - passes regardless of which tests ran before it.
- **Environment-independent** - does not assume a venv layout, a developer's home
directory, an existing `./data` dir, or optional packages that may be absent.
- **Informative on failure** - the assertion message or structure makes the cause
obvious without a debugger.
- **Small** - understandable quickly; one behavior per test where practical.
- **Backed by shared helpers only when duplication is proven** - not abstracted
preemptively.
## Test taxonomy
Tests are classified by the categories below. Today the suite is flat under
`tests/`; the **Target dir** column is the phased layout from #2523 that we move
toward *after* helpers and determinism are stable. Until a category is moved,
new tests in that category stay in flat `tests/` but should still follow this
standard.
| Category | What it covers | Examples today | Target dir |
|---|---|---|---|
| **Route / API integration** | Real ASGI request/response, auth gates, admin gates, owner isolation through the app | files using `TestClient` | `tests/routes/` |
| **CLI / script** | `scripts/` entry points and dev tooling | `tests.helpers.cli_loader.load_script` users, `test_pr_blocker_audit.py` | `tests/cli/` |
| **Frontend / JS** | Browser-coupled JS run via Node subprocess; streaming-render invariants | `*_js.py` wrappers, `tests/streaming/*.test.mjs` | `tests/js/` |
| **Tool execution / parsing** | Tool-call parsing, malformed/nonstring args, tool policy | `test_unknown_tool_calls.py`, `test_tool_policy.py`, `*_nonstring.py` | `tests/unit/` or `tests/services/` |
| **LLM / provider** | Provider response parsing, streaming, sanitize, reasoning fallback | `test_llm_core_*`, `test_anthropic_response_parse.py` | `tests/services/` |
| **Session / history / DB** | Session lifecycle, history, schema, ownership at the data layer | `test_session_*`, `test_sqlite_foreign_keys.py` | `tests/services/` or `tests/unit/` |
| **Security / owner-scope / regression** | Owner isolation, auth, SSRF, path confinement, XSS, prompt injection, pinned regressions | `*_owner_scope.py`, `test_security_regressions.py`, `test_*ssrf*`, `test_*confinement*` | `tests/security/` |
| **Cookbook / bootstrap** | Model serve lifecycle, dependency completion | `test_cookbook_*` | `tests/services/` |
| **Scheduler / background** | Cron computation, background jobs, delivery | `test_compute_next_run_*`, `test_bg_*`, `test_task_scheduler_*` | `tests/services/` |
| **Import / module isolation** | The isolation helpers themselves and their guarantees | `test_helpers_import_state.py` | `tests/unit/` |
A test that genuinely spans categories (e.g. a route test that also pins a
security invariant) is classified by its **primary** assertion target and may be
split if it grows.
## Determinism & isolation rules
Do not mutate shared process state without a controlled helper and guaranteed
cleanup. Specifically:
- **`sys.modules` / parent-package attributes** - never assign at module scope.
Use `tests.helpers.import_state.preserve_import_state`, `clear_module`, or
`monkeypatch.setitem(sys.modules, ...)`. Restoring `sys.modules` alone is not
enough; the parent-package attribute must be restored too (the import-state
helpers handle both).
- **`os.environ`** - use `monkeypatch.setenv` / `monkeypatch.delenv`, never raw
`os.environ[...] = ...` that outlives the test.
- **Current working directory** - never `chdir` without restoring; never assert
against cwd-relative paths like `./data`. Use a temp workspace helper instead.
- **Database** - the root `conftest.py` defaults `DATABASE_URL` to an in-memory
SQLite for collection safety. A test that needs a real file-backed DB must opt
in explicitly via `tests.helpers.sqlite_db.make_temp_sqlite` and bind its
`SessionLocal` onto the module under test. Do not rely on a persistent
on-disk DB existing.
- **Optional dependencies** - do not require packages that may be absent in a
clean environment (e.g. `python-multipart`). Guard or stub them locally.
- **Node-subprocess JS tests** - skip cleanly when `node` is absent
(`shutil.which("node")`), matching the existing wrappers. Treat a skip as a
coverage gap to be aware of, not a pass.
- **Order independence** - a test must not depend on a sibling having imported,
cached, or stubbed something first. Order-sensitivity is a bug to fix, not a
constraint to encode.
## Behavioral-first policy
Prefer tests that exercise real behavior over tests that inspect source code.
- **Avoid** `read_text()` + substring assertions, `ast.parse`, and
`inspect.getsource` checks when the behavior can be driven directly. Source-text
assertions break on benign refactors (renames, reformatting) and can pass even
when behavior regresses, because the asserted string still appears somewhere.
- **Prefer** calling the function/route and asserting the outcome. Example: to
pin owner-scoping of `get_upcoming_events`, seed a temp DB with two owners and
assert one owner cannot see the other's events - rather than asserting the
source contains `q.filter(CalendarCal.owner == owner)`.
- **Narrow exception** - a source-text/AST assertion is acceptable only when the
invariant cannot be practically exercised at runtime (e.g. pinning that a
required constant or guard literally exists in a module that is hard to drive).
When used, say *why* in the test docstring so it is a deliberate choice, not a
shortcut.
- Do not convert source-text assertions to behavioral ones in the *same* PR that
moves files or changes unrelated setup.
## Helper & factory extraction rules
- Extract a shared helper only when the duplicated shape is **proven** - the same
setup repeated (ideally byte-identical) across multiple files.
- Prefer **plain functions** in `tests/helpers/` over fixtures. Reach for a
fixture only when it is clearly scoped to one directory/category, and put it in
that directory's `conftest.py`, not the root.
- Keep the **root `conftest.py` minimal** - `sys.path`, the DB-URL default, and
not-installed heavy-dependency stubs only. It is not a place for
feature-specific fixtures.
- Each helper documents its **intended use and its limits** ("do not stretch this
to cover X"), as the existing helpers in `README.md` do.
- Do not build a generic abstraction layer (factory framework, broad base
fixtures) before the repeated semantics are clear. Small and boring beats
clever and general.
- Candidate factories, to add only after the duplication audit confirms the
shapes: fake users, fake sessions, fake requests, fake DB rows, fake LLM
responses, fake tool calls.
## PR discipline for #2523 refactor slices
- Keep each PR small, reviewable, and behavior-preserving - unless the PR's stated
purpose is to add new coverage.
- **One kind of change per PR.** Do not mix:
- file moves with assertion changes;
- helper extraction with logic changes;
- import-state cleanup with DB-fixture changes.
- Do not weaken assertions, add `skip`/`xfail`, or delete coverage just to make CI
green. A red test is a signal to investigate, not to silence.
- Prefer 3-6 files per refactor batch, and only when they share the *same*
pattern.
- Distinguish a stale test expectation from a real production-policy change before
"fixing" a failing test - never edit a test to match a regression.
## Validation expectations
Run locally before opening or approving a refactor PR:
- `git diff --check` - whitespace and conflict-marker errors.
- `python3 -m py_compile <changed .py files>` - changed files compile.
- Focused `pytest` on the changed files (use `.venv/bin/python -m pytest`).
- `pytest` on neighboring / order-sensitive groups that share import state with
the changed files.
- When replacing boilerplate, `grep` for the old pattern to confirm no stragglers.
- When changing a helper itself, validate in a fresh worktree so stale
`__pycache__` or import state cannot mask a regression.
- For order-sensitivity, a randomized run (once `pytest-randomly` is available in
the dev environment) is the strongest check; record the seed on failures.
## Target directory structure (phased)
Move toward this layout *gradually*, only after helper conventions and
determinism are stable. Low-risk categories move first; oversized catch-all files
are split last.
```
tests/
conftest.py # stays minimal
README.md # helper reference
TESTING_STANDARD.md
helpers/ # plain helper functions (exists)
unit/ # pure helper/module tests
cli/ # scripts/ + CLI tests
js/ # node-subprocess + streaming tests
security/ # owner-scope, auth, SSRF, confinement, regressions
routes/ # TestClient integration (per-dir conftest for the client)
services/ # service-layer tests
integration/ # only if a cross-cutting flow needs it, later
```
Suggested move order: **js / cli first → security / routes / services → split
oversized catch-all files last.** Each move is mechanical (no assertion changes
in the same PR), with an identical pass set before and after.
## Related: CI-hardening track (tracked separately)
Making the suite an enforced gate is broader than #2523's organization scope and
should be tracked as its own effort. The intended sequence:
1. Add non-blocking randomized pytest reporting (`pytest-randomly`) so hidden
order-dependence becomes visible without changing any test.
2. Fix surfaced order-dependence in small same-pattern batches.
3. Add coverage reporting with no threshold gate.
4. Only then make the pytest job a blocking CI gate.
5. Consider `pytest-xdist` / parallel isolation after deterministic
single-process randomized runs are stable.