fix(search): add download budgets to web_fetch with truncation notice and hard ceiling (#3955)

* fix(search): add download budgets to web_fetch with truncation notice and hard ceiling

MAX_OUTPUT_CHARS only trims what the agent sees; fetch_webpage_content
buffered and cached the entire response body first, so a large or hostile
URL could pull arbitrarily many bytes into memory and the content cache.

The fetch is now a capped streaming GET (SSRF redirect guard unchanged):
a soft default budget (WEB_FETCH_SOFT_MAX_BYTES, 2 MB), a per-call
override via full/max_bytes on the web_fetch tool, and a hard ceiling
(WEB_FETCH_HARD_MAX_BYTES, 20 MB) that the override can never exceed.
When Content-Length already declares a body over the ceiling the fetch
is refused before any body bytes are buffered. Truncated results carry
truncated/fetched_bytes/total_bytes, the tool output leads with a
partial-content notice telling the model how to re-fetch with full=true,
and the tool schema documents the flag. A truncated PDF is reported as
a budget error since a cut PDF is unparseable. The effective cap is part
of the content-cache key so a truncated fetch is never served to a
full-budget request.

Existing tests that faked httpx.get or the old _get_public_url signature
are adapted to the streaming interface; behavior pins are unchanged.

Fixes #3812

* fix(search): close compressed-body cap bypass and protect the partial notice

Addresses RaresKeY's review on #3955:

- Force Accept-Encoding: identity for the capped fetch. With gzip/deflate the
  wire bytes (and Content-Length) can be a fraction of the decoded body, so a
  tiny compressed response could pass the hard-cap preflight and then expand
  past the ceiling in a single decoded chunk before the streamed cap could
  slice it. Identity makes Content-Length the true body size and keeps each
  streamed chunk bounded by the network read, so the hard ceiling actually
  bounds memory.
- Lead web_fetch output with the partial-content notice and cap the page
  title. The notice is the user-facing contract for partial fetches, but the
  title is untrusted, uncapped page content; placed ahead of the notice a giant
  title could push it past MAX_OUTPUT_CHARS and drop it. The notice now leads
  and the title is capped as a second guard.

Adds regressions: the fetch advertises identity encoding, and a truncated
result with an oversized title still surfaces the partial notice.

* fix(search): reject compressed responses that ignore the identity request

Requesting Accept-Encoding: identity is not enough on its own: a server can
ignore it and still return Content-Encoding: gzip, and httpx.iter_bytes would
decode that, so a tiny compressed body could balloon into one decoded chunk
far past the hard cap before the streamed loop slices it (and Content-Length,
the compressed wire length, makes the preflight and size metadata unreliable).

Refuse a non-identity Content-Encoding before reading the body. Adds a
regression where the server ignores the identity request and returns gzip;
the fetch is refused before any body is decoded.
This commit is contained in:
Kenny Van de Maele
2026-06-15 19:38:09 +02:00
committed by GitHub
parent 2fab378c6a
commit 074a1e6eff
8 changed files with 422 additions and 22 deletions
+32 -2
View File
@@ -57,13 +57,23 @@ class WebSearchTool:
class WebFetchTool:
async def execute(self, content: str, ctx: dict) -> dict:
from src.search.content import fetch_webpage_content
from src.constants import WEB_FETCH_HARD_MAX_BYTES
raw = content.strip()
url = ""
max_bytes = None
if raw.startswith("{"):
try:
parsed = json.loads(raw)
if isinstance(parsed, dict):
url = str(parsed.get("url") or "").strip()
# Download-budget override (#3812): "full": true raises the
# budget to the hard cap; an explicit max_bytes is clamped
# to the hard cap downstream. Default stays the soft cap.
if parsed.get("full") is True:
max_bytes = WEB_FETCH_HARD_MAX_BYTES
mb = parsed.get("max_bytes")
if isinstance(mb, int) and mb > 0:
max_bytes = mb
except json.JSONDecodeError:
url = ""
if not url:
@@ -78,7 +88,7 @@ class WebFetchTool:
loop = asyncio.get_running_loop()
try:
result = await asyncio.wait_for(
loop.run_in_executor(None, lambda: fetch_webpage_content(url, timeout=10)),
loop.run_in_executor(None, lambda: fetch_webpage_content(url, timeout=10, max_bytes=max_bytes)),
timeout=30,
)
except asyncio.TimeoutError:
@@ -94,8 +104,28 @@ class WebFetchTool:
return {"error": f"web_fetch: {url}: {err}", "exit_code": 1}
return {"error": f"web_fetch: {url}: no readable text content (not HTML, or the page needs JS/login)", "exit_code": 1}
# Tell the model when the download budget cut the body short and how
# to get the rest, instead of silently presenting a partial page as
# the whole thing.
size_note = ""
if result.get("truncated"):
fetched = result.get("fetched_bytes") or 0
total = result.get("total_bytes")
total_txt = f" of {total:,} bytes" if total else ""
size_note = (
f"[partial content: download stopped at {fetched:,} bytes{total_txt}. "
f'Re-call with {{"url": "{url}", "full": true}} to fetch up to '
f"{WEB_FETCH_HARD_MAX_BYTES:,} bytes.]\n\n"
)
# The notice must lead the output so the MAX_OUTPUT_CHARS trim below can
# never drop it. The title is untrusted, uncapped page content, so a
# giant title ahead of the notice could push it out of range; keep the
# notice first and cap the title as a second guard.
if len(title) > 300:
title = title[:300] + "..."
header = (f"# {title}\n" if title else "") + f"Source: {url}\n\n"
output = header + text
output = size_note + header + text
if len(output) > MAX_OUTPUT_CHARS:
output = output[:MAX_OUTPUT_CHARS] + "\n\n[...truncated]"
return {"output": output, "exit_code": 0}