Files
odysseus/tests/test_web_fetch_plaintext.py
T
Kenny Van de Maele 074a1e6eff fix(search): add download budgets to web_fetch with truncation notice and hard ceiling (#3955)
* fix(search): add download budgets to web_fetch with truncation notice and hard ceiling

MAX_OUTPUT_CHARS only trims what the agent sees; fetch_webpage_content
buffered and cached the entire response body first, so a large or hostile
URL could pull arbitrarily many bytes into memory and the content cache.

The fetch is now a capped streaming GET (SSRF redirect guard unchanged):
a soft default budget (WEB_FETCH_SOFT_MAX_BYTES, 2 MB), a per-call
override via full/max_bytes on the web_fetch tool, and a hard ceiling
(WEB_FETCH_HARD_MAX_BYTES, 20 MB) that the override can never exceed.
When Content-Length already declares a body over the ceiling the fetch
is refused before any body bytes are buffered. Truncated results carry
truncated/fetched_bytes/total_bytes, the tool output leads with a
partial-content notice telling the model how to re-fetch with full=true,
and the tool schema documents the flag. A truncated PDF is reported as
a budget error since a cut PDF is unparseable. The effective cap is part
of the content-cache key so a truncated fetch is never served to a
full-budget request.

Existing tests that faked httpx.get or the old _get_public_url signature
are adapted to the streaming interface; behavior pins are unchanged.

Fixes #3812

* fix(search): close compressed-body cap bypass and protect the partial notice

Addresses RaresKeY's review on #3955:

- Force Accept-Encoding: identity for the capped fetch. With gzip/deflate the
  wire bytes (and Content-Length) can be a fraction of the decoded body, so a
  tiny compressed response could pass the hard-cap preflight and then expand
  past the ceiling in a single decoded chunk before the streamed cap could
  slice it. Identity makes Content-Length the true body size and keeps each
  streamed chunk bounded by the network read, so the hard ceiling actually
  bounds memory.
- Lead web_fetch output with the partial-content notice and cap the page
  title. The notice is the user-facing contract for partial fetches, but the
  title is untrusted, uncapped page content; placed ahead of the notice a giant
  title could push it past MAX_OUTPUT_CHARS and drop it. The notice now leads
  and the title is capped as a second guard.

Adds regressions: the fetch advertises identity encoding, and a truncated
result with an oversized title still surfaces the partial notice.

* fix(search): reject compressed responses that ignore the identity request

Requesting Accept-Encoding: identity is not enough on its own: a server can
ignore it and still return Content-Encoding: gzip, and httpx.iter_bytes would
decode that, so a tiny compressed body could balloon into one decoded chunk
far past the hard cap before the streamed loop slices it (and Content-Length,
the compressed wire length, makes the preflight and size metadata unreliable).

Refuse a non-identity Content-Encoding before reading the body. Adds a
regression where the server ignores the identity request and returns gzip;
the fetch is refused before any body is decoded.
2026-06-15 17:38:09 +00:00

111 lines
4.1 KiB
Python

"""fetch_webpage_content must return plain-text and Markdown bodies verbatim.
raw.githubusercontent.com serves Markdown as `text/plain`, and a lot of code
and tool documentation lives in `.md` / `.txt`. Those have no HTML structure,
so the HTML branch extracted nothing and web_fetch reported "no readable text
content". The plain-text branch returns the body as-is. HTML stays on the
parsing path.
"""
import types
import pytest
from services.search import content as content_mod
class _FakeResponse:
def __init__(self, text, content_type, status_code=200):
self.text = text
self.content = text.encode("utf-8")
self.headers = {"Content-Type": content_type}
self.status_code = status_code
def raise_for_status(self):
return None
@pytest.fixture
def no_cache(monkeypatch, tmp_path):
# Force a cache miss and skip disk writes so the test is hermetic.
monkeypatch.setattr(content_mod, "CONTENT_CACHE_DIR", tmp_path)
monkeypatch.setattr(content_mod, "_cache_result", lambda *a, **k: None)
def _patch_fetch(monkeypatch, text, content_type):
monkeypatch.setattr(
content_mod,
"_get_public_url",
lambda url, headers=None, timeout=5, **kwargs: _FakeResponse(text, content_type),
)
MARKDOWN = "# Title\n\nSome **docs** with a [link](https://example.com).\n"
def test_markdown_text_plain_returns_body(monkeypatch, no_cache):
_patch_fetch(monkeypatch, MARKDOWN, "text/plain; charset=utf-8")
r = content_mod.fetch_webpage_content(
"https://raw.githubusercontent.com/o/r/master/Documentation/Patterns.md"
)
assert r["success"] is True
assert r["content"] == MARKDOWN.strip()
assert r["title"] == "patterns.md"
assert r["error"] == ""
def test_text_markdown_content_type_returns_body(monkeypatch, no_cache):
_patch_fetch(monkeypatch, MARKDOWN, "text/markdown")
r = content_mod.fetch_webpage_content("https://example.com/readme")
assert r["success"] is True
assert r["content"] == MARKDOWN.strip()
def test_octet_stream_with_txt_suffix_returns_body(monkeypatch, no_cache):
# Some servers mislabel text files; the URL-suffix fallback still reads it.
_patch_fetch(monkeypatch, "plain notes\nline two\n", "application/octet-stream")
r = content_mod.fetch_webpage_content("https://example.com/notes.txt")
assert r["success"] is True
assert r["content"] == "plain notes\nline two"
def test_application_json_returns_body(monkeypatch, no_cache):
# application/json is not text/*; it must still be returned verbatim
# instead of being fed to the HTML parser (which yields empty content).
body = '{"name": "odysseus", "items": [1, 2, 3]}'
_patch_fetch(monkeypatch, body, "application/json")
r = content_mod.fetch_webpage_content("https://api.example.com/data")
assert r["success"] is True
assert r["content"] == body
def test_ld_json_suffix_content_type_returns_body(monkeypatch, no_cache):
body = '{"@context": "https://schema.org"}'
_patch_fetch(monkeypatch, body, "application/ld+json")
r = content_mod.fetch_webpage_content("https://example.com/meta")
assert r["success"] is True
assert r["content"] == body
def test_json_suffix_with_octet_stream_returns_body(monkeypatch, no_cache):
body = '{"raw": true}'
_patch_fetch(monkeypatch, body, "application/octet-stream")
r = content_mod.fetch_webpage_content("https://example.com/package.json")
assert r["success"] is True
assert r["content"] == body
def test_empty_text_body_is_not_success(monkeypatch, no_cache):
_patch_fetch(monkeypatch, " \n ", "text/plain")
r = content_mod.fetch_webpage_content("https://example.com/blank.txt")
assert r["success"] is False
assert r["content"] == ""
def test_html_still_uses_parser(monkeypatch, no_cache):
# An HTML body must not be short-circuited by the text branch.
html = "<html><head><title>Hi</title></head><body><p>Hello world body text</p></body></html>"
_patch_fetch(monkeypatch, html, "text/html; charset=utf-8")
r = content_mod.fetch_webpage_content("https://example.com/page")
assert r["title"] == "Hi"
assert "Hello world body text" in r["content"]