fix(search): add download budgets to web_fetch with truncation notice and hard ceiling (#3955)

mirror of https://github.com/pewdiepie-archdaemon/odysseus.git synced 2026-06-24 05:35:31 -04:00

* fix(search): add download budgets to web_fetch with truncation notice and hard ceiling

MAX_OUTPUT_CHARS only trims what the agent sees; fetch_webpage_content
buffered and cached the entire response body first, so a large or hostile
URL could pull arbitrarily many bytes into memory and the content cache.

The fetch is now a capped streaming GET (SSRF redirect guard unchanged):
a soft default budget (WEB_FETCH_SOFT_MAX_BYTES, 2 MB), a per-call
override via full/max_bytes on the web_fetch tool, and a hard ceiling
(WEB_FETCH_HARD_MAX_BYTES, 20 MB) that the override can never exceed.
When Content-Length already declares a body over the ceiling the fetch
is refused before any body bytes are buffered. Truncated results carry
truncated/fetched_bytes/total_bytes, the tool output leads with a
partial-content notice telling the model how to re-fetch with full=true,
and the tool schema documents the flag. A truncated PDF is reported as
a budget error since a cut PDF is unparseable. The effective cap is part
of the content-cache key so a truncated fetch is never served to a
full-budget request.

Existing tests that faked httpx.get or the old _get_public_url signature
are adapted to the streaming interface; behavior pins are unchanged.

Fixes #3812

* fix(search): close compressed-body cap bypass and protect the partial notice

Addresses RaresKeY's review on #3955:

- Force Accept-Encoding: identity for the capped fetch. With gzip/deflate the
  wire bytes (and Content-Length) can be a fraction of the decoded body, so a
  tiny compressed response could pass the hard-cap preflight and then expand
  past the ceiling in a single decoded chunk before the streamed cap could
  slice it. Identity makes Content-Length the true body size and keeps each
  streamed chunk bounded by the network read, so the hard ceiling actually
  bounds memory.
- Lead web_fetch output with the partial-content notice and cap the page
  title. The notice is the user-facing contract for partial fetches, but the
  title is untrusted, uncapped page content; placed ahead of the notice a giant
  title could push it past MAX_OUTPUT_CHARS and drop it. The notice now leads
  and the title is capped as a second guard.

Adds regressions: the fetch advertises identity encoding, and a truncated
result with an oversized title still surfaces the partial notice.

* fix(search): reject compressed responses that ignore the identity request

Requesting Accept-Encoding: identity is not enough on its own: a server can
ignore it and still return Content-Encoding: gzip, and httpx.iter_bytes would
decode that, so a tiny compressed body could balloon into one decoded chunk
far past the hard cap before the streamed loop slices it (and Content-Length,
the compressed wire length, makes the preflight and size metadata unreliable).

Refuse a non-identity Content-Encoding before reading the body. Adds a
regression where the server ignores the identity request and returns gzip;
the fetch is refused before any body is decoded.

This commit is contained in:

Kenny Van de Maele

2026-06-15 19:38:09 +02:00

committed by

GitHub

parent 2fab378c6a

commit 074a1e6eff

8 changed files with 422 additions and 22 deletions

									
										src/constants.py
									
		+8
		
												View File
												
				@@ -65,6 +65,14 @@ MAX_OUTPUT_CHARS = 10_000       # cap for bash/python/web_search/web_fetch outpu

				MAX_READ_CHARS = 20_000         # cap for read_file / document preview

				MAX_DIFF_LINES = 400            # cap for edit_file unified-diff display

				# web_fetch response-size policy (#3812). MAX_OUTPUT_CHARS above only trims

				# what the agent SEES; these caps bound what the server downloads, parses,

				# and writes to the content cache. The soft cap is the default download

				# budget; the agent can raise it per call (full/max_bytes) but never past

				# the hard cap, so a model can't decide to pull a multi-GB file.

				WEB_FETCH_SOFT_MAX_BYTES = 2_000_000    # default download budget (2 MB)

				WEB_FETCH_HARD_MAX_BYTES = 20_000_000   # absolute ceiling, even with override (20 MB)

				# API Configuration

				MAX_CONTEXT_MESSAGES = 90

				REQUEST_TIMEOUT = 20