* fix(search): add download budgets to web_fetch with truncation notice and hard ceiling
MAX_OUTPUT_CHARS only trims what the agent sees; fetch_webpage_content
buffered and cached the entire response body first, so a large or hostile
URL could pull arbitrarily many bytes into memory and the content cache.
The fetch is now a capped streaming GET (SSRF redirect guard unchanged):
a soft default budget (WEB_FETCH_SOFT_MAX_BYTES, 2 MB), a per-call
override via full/max_bytes on the web_fetch tool, and a hard ceiling
(WEB_FETCH_HARD_MAX_BYTES, 20 MB) that the override can never exceed.
When Content-Length already declares a body over the ceiling the fetch
is refused before any body bytes are buffered. Truncated results carry
truncated/fetched_bytes/total_bytes, the tool output leads with a
partial-content notice telling the model how to re-fetch with full=true,
and the tool schema documents the flag. A truncated PDF is reported as
a budget error since a cut PDF is unparseable. The effective cap is part
of the content-cache key so a truncated fetch is never served to a
full-budget request.
Existing tests that faked httpx.get or the old _get_public_url signature
are adapted to the streaming interface; behavior pins are unchanged.
Fixes#3812
* fix(search): close compressed-body cap bypass and protect the partial notice
Addresses RaresKeY's review on #3955:
- Force Accept-Encoding: identity for the capped fetch. With gzip/deflate the
wire bytes (and Content-Length) can be a fraction of the decoded body, so a
tiny compressed response could pass the hard-cap preflight and then expand
past the ceiling in a single decoded chunk before the streamed cap could
slice it. Identity makes Content-Length the true body size and keeps each
streamed chunk bounded by the network read, so the hard ceiling actually
bounds memory.
- Lead web_fetch output with the partial-content notice and cap the page
title. The notice is the user-facing contract for partial fetches, but the
title is untrusted, uncapped page content; placed ahead of the notice a giant
title could push it past MAX_OUTPUT_CHARS and drop it. The notice now leads
and the title is capped as a second guard.
Adds regressions: the fetch advertises identity encoding, and a truncated
result with an oversized title still surfaces the partial notice.
* fix(search): reject compressed responses that ignore the identity request
Requesting Accept-Encoding: identity is not enough on its own: a server can
ignore it and still return Content-Encoding: gzip, and httpx.iter_bytes would
decode that, so a tiny compressed body could balloon into one decoded chunk
far past the hard cap before the streamed loop slices it (and Content-Length,
the compressed wire length, makes the preflight and size metadata unreliable).
Refuse a non-identity Content-Encoding before reading the body. Adds a
regression where the server ignores the identity request and returns gzip;
the fetch is refused before any body is decoded.
* test: align README presentation guards with the #4306 refresh
The 'Refresh README presentation' change (#4306) swapped the ASCII banner
for a centered wordmark image and moved the native quickstart into
docs/setup.md, which left four base tests failing on dev and froze the
merge gate:
- test_security_regressions::test_readme_native_quickstart_uses_loopback
now also accepts the loopback guidance from docs/setup.md, where the
quickstart moved (no behaviour change; the guidance is intact there).
- test_readme_ascii_fenced guards the new wordmark title instead of the
removed ASCII banner, and keeps a defensive check that any reintroduced
box-drawing banner stays inside a code fence (the original #1390 mode).
- The five unreferenced demo gifs under docs/ (chat, compare, document,
notes, research) are removed so test_docs_no_orphan_images passes; they
were de-referenced by the refresh. Recoverable from history if a docs
page wants to embed them again.
* chore: refresh PR checks
---------
Co-authored-by: Alexandre Teixeira <alexandremagteixeira@gmail.com>
* refactor(constants): single source of truth for data dir + merge core/src constants
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* docs(contributing): use named src.constants for data paths, drop core/constants references
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Keeps src.request_models real and restores both sys.modules and parent routes.session_routes package attributes after temporary test stubs. Restores one focused part of the Python CI baseline tracked in #2580.
* chore: dedupe src/search/cache.py into a re-export shim
src/search/cache.py was a byte-identical copy of services/search/cache.py.
Convert it to a sys.modules alias of the canonical services module (matching
src/search/core.py, providers.py, ranking.py) so the two cannot drift, and add
an identity assertion to test_search_module_consolidation.py.
content.py and query.py are intentionally left as-is: the copies have drifted
and services lacks fixes that src has, so they need services reconciled first
before they can be shimmed safely.
* chore: dedupe src/search content.py and query.py into shims
Convert src/search/content.py and query.py to sys.modules aliases of the
canonical services/search/* (matching cache.py, core.py, providers.py,
ranking.py) so the duplicate copies cannot drift.
Repoint the two tests that were coupled to the src-copy internals onto the
canonical services surface (behaviour is equivalent):
- test_src_search_query_nonstring.py: import services.search.query instead of
loading the src file by path.
- test_security_regressions.py::test_web_fetch_guard_blocks_redirect_into_private:
mock httpx.get (services uses the module-level get, not httpx.Client) and
assert on the canonical 'Blocked' message.
Drop the now-redundant [src_content, service_content] parametrization in
test_search_content_extraction_parity.py and test_search_content_url_guards.py
(after the shim both params are the same object); add content/query identity
assertions to test_search_module_consolidation.py.
The visual research report is assembled from LLM output over crawled web
pages (untrusted content) and served under a relaxed `script-src
'unsafe-inline'` CSP. Two values reached that HTML without sanitization:
- `_md_to_html` rendered the report markdown via python-markdown, which
passes raw HTML through verbatim, so `<script>` / `<img onerror>` /
`<svg onload>` / `javascript:` links carried in crawled content ran in
the app origin.
- `category` (from the /api/research/start request body, no enum check) was
interpolated raw into `<body class="category-{category}">`.
Allowlist-sanitize the rendered markdown with nh3, keeping the formatting
the report emits (tables, code, details/summary, toc anchors, codehilite
classes, external-link target/rel) while dropping active content, and
html.escape the category. Adds regression tests.
* fix(tests): align broken test assertions with current behavior
- test_readme_native_quickstart_uses_loopback: README warning text
moved from --host prefix to bind-to phrasing; update assertion
- test_sanitize_merges_consecutive_user_messages: consecutive user
messages ARE merged and orphan tool messages ARE dropped by the
adjacency repair pass; update expected counts and values
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix(tests): update cookbook status poll assertion for stopped state
The cookbookRunning.js ternary now handles a 'stopped' status
alongside 'error', so the exact string match in the test no longer
holds. Relax the assertion to check for the error branch presence
instead of the full ternary expression.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
#622 reported "I cant even paste that hash pw and granted So auth_en
=false & localbypass= true But then the host still is showing login
page?" — the operator turned auth off in .env and still gets bounced
to /login on every page load. The flow:
The auth middleware in app.py is correctly gated on AUTH_ENABLED, so
the middleware itself does not install when AUTH_ENABLED=false. The
SPA front-end at static/app.js wraps window.fetch and redirects to
/login on ANY 401 response from any API call. So all it takes for the
operator to see a login page is one route-level 401.
src/auth_helpers.require_user — the shared FastAPI dependency mounted
on ~50 routes (email, contacts, personal, …) — was the source. It is
documented as defense-in-depth in case the middleware was bypassed
unexpectedly (SSRF from a sibling service), but the implementation
treated AUTH_ENABLED=false as one of those unexpected bypasses and
401'd anyway. The loopback fall-through that would have admitted the
operator does not fire under docker compose / a reverse proxy because
the container sees the request arriving from the bridge gateway
(172.x.x.x), not 127.0.0.1.
require_user now short-circuits to "" when AUTH_ENABLED=false so the
explicit operator opt-out reaches the route layer too. While in the
file, also mirror LOCALHOST_BYPASS=true the same way for loopback
callers — the middleware already lets them through, and routes 401'ing
the same caller would produce the same /login bounce. Non-loopback
callers under LOCALHOST_BYPASS are still rejected, matching the
middleware's _is_trusted_loopback check.
Add three focused regression tests in tests/test_security_regressions.py:
docker-bridge caller is admitted under AUTH_ENABLED=false, loopback
caller is admitted under LOCALHOST_BYPASS=true, LAN caller under
LOCALHOST_BYPASS=true is still rejected. The existing
test_require_user_rejects_unauthenticated and
test_require_user_accepts_loopback_when_unconfigured tests continue to
pass because neither sets AUTH_ENABLED, so the AUTH_ENABLED=true
default path is unchanged.
Closes#622.
Route PDF lookups through UploadHandler.resolve_upload, reject poisoned pdf_source markers on document create/update, and add regression tests.
Co-authored-by: Cursor <cursoragent@cursor.com>
Hardens issues found in a security review of the current tree (separate from
the cookbook SSH PR):
- Email thread rendering (static/js/emailLibrary.js): the flat read path runs
inbound HTML through the allowlist sanitizer, but the two threaded paths
(_renderTurnsAsBubbles / _renderTurnsFromServer — the default view) injected
server-parsed `body_html` raw into the DOM. A crafted inbound email could
inject arbitrary markup (phishing/form/credential-capture/tracking; full XSS
if a deployment relaxes the script CSP). Now sanitized on all paths.
- Attachment extraction (routes/email_routes.py, routes/email_helpers.py): the
on-disk extraction dir was `ATTACHMENTS_DIR / f"{folder}_{uid}"` with
user-controlled folder/uid and no containment, so a folder like `../../tmp`
could escape ATTACHMENTS_DIR. New attachment_extract_dir() flattens both to a
single safe segment and asserts containment.
- Diagnostics routes (routes/diagnostics_routes.py): /api/db/stats,
/api/rag/stats, /api/test/youtube, /api/test-research relied only on the
global session check (any logged-in user). Now require_admin-gated.
- Defense-in-depth HTML escaping: session HTML export escapes the session name
(routes/session_routes.py); the MCP OAuth page escapes the reflected Host
header / server_id (routes/mcp_routes.py).
- Internal-tool token now compared with secrets.compare_digest (constant time)
in core/middleware.py and app.py.
Adds regression tests in tests/test_security_regressions.py.
* feat(web-fetch): add web_fetch tool to read a specific URL's content
* test(web-fetch): add SSRF coverage and fail closed on empty DNS resolution
Add explicit SSRF regression tests for the web_fetch path covering
loopback, private LAN ranges, link-local/metadata, IPv6 private/local,
redirect-into-private, and unsupported schemes. Harden _public_http_url
to fail closed when a hostname resolves to no addresses.