The outbound UA for web_fetch / web_search was inlined in four places with
two different values and nothing keeping them current: content.py pinned a
mid-2021 Chrome 91 build, and providers.py sent a bare Mozilla/5.0 in three
spots. Some sites serve a degraded or blocked page to a UA that old.
Add WEB_FETCH_USER_AGENT to src/constants.py (env-overridable, matching the
existing Copilot/Kimi UA-constant pattern) and import it in content.py and
providers.py. Default to a current, common desktop UA so pages return their
normal HTML: the market-leading desktop OS (Windows; NT 10.0 covers Windows
10 and 11) and browser (Chrome) on a current stable build. The version is now
bumped in one place.
Service-specific self-identifying agents (Copilot, Kimi, webhooks, cookbook)
are intentionally left separate. Adds a regression pinning the constant shape,
the env override, and a guard against a new inline Mozilla literal in the
search sources.
Closes#4324
* fix(search): add download budgets to web_fetch with truncation notice and hard ceiling
MAX_OUTPUT_CHARS only trims what the agent sees; fetch_webpage_content
buffered and cached the entire response body first, so a large or hostile
URL could pull arbitrarily many bytes into memory and the content cache.
The fetch is now a capped streaming GET (SSRF redirect guard unchanged):
a soft default budget (WEB_FETCH_SOFT_MAX_BYTES, 2 MB), a per-call
override via full/max_bytes on the web_fetch tool, and a hard ceiling
(WEB_FETCH_HARD_MAX_BYTES, 20 MB) that the override can never exceed.
When Content-Length already declares a body over the ceiling the fetch
is refused before any body bytes are buffered. Truncated results carry
truncated/fetched_bytes/total_bytes, the tool output leads with a
partial-content notice telling the model how to re-fetch with full=true,
and the tool schema documents the flag. A truncated PDF is reported as
a budget error since a cut PDF is unparseable. The effective cap is part
of the content-cache key so a truncated fetch is never served to a
full-budget request.
Existing tests that faked httpx.get or the old _get_public_url signature
are adapted to the streaming interface; behavior pins are unchanged.
Fixes#3812
* fix(search): close compressed-body cap bypass and protect the partial notice
Addresses RaresKeY's review on #3955:
- Force Accept-Encoding: identity for the capped fetch. With gzip/deflate the
wire bytes (and Content-Length) can be a fraction of the decoded body, so a
tiny compressed response could pass the hard-cap preflight and then expand
past the ceiling in a single decoded chunk before the streamed cap could
slice it. Identity makes Content-Length the true body size and keeps each
streamed chunk bounded by the network read, so the hard ceiling actually
bounds memory.
- Lead web_fetch output with the partial-content notice and cap the page
title. The notice is the user-facing contract for partial fetches, but the
title is untrusted, uncapped page content; placed ahead of the notice a giant
title could push it past MAX_OUTPUT_CHARS and drop it. The notice now leads
and the title is capped as a second guard.
Adds regressions: the fetch advertises identity encoding, and a truncated
result with an oversized title still surfaces the partial notice.
* fix(search): reject compressed responses that ignore the identity request
Requesting Accept-Encoding: identity is not enough on its own: a server can
ignore it and still return Content-Encoding: gzip, and httpx.iter_bytes would
decode that, so a tiny compressed body could balloon into one decoded chunk
far past the hard cap before the streamed loop slices it (and Content-Length,
the compressed wire length, makes the preflight and size metadata unreliable).
Refuse a non-identity Content-Encoding before reading the body. Adds a
regression where the server ignores the identity request and returns gzip;
the fetch is refused before any body is decoded.
providers.py redefined REQUEST_TIMEOUT = 20 locally, shadowing the same
value in src/constants.py and risking drift if the constant is bumped.
Import it from src.constants and drop the local copy; same value, one
source of truth.
Closes#4329
raw.githubusercontent.com serves Markdown as text/plain, JSON APIs and raw
config files serve application/json, and a lot of code and tool documentation
lives in .md/.txt. fetch_webpage_content only handled PDF and HTML, so a
non-HTML body produced empty content and web_fetch reported 'no readable text
content'. Add a branch that returns the body verbatim for non-HTML text/*,
JSON (application/json and +json), and a .md/.txt/.text/.json URL-suffix
fallback for mislabeled octet-stream. HTML and PDF handling unchanged.
Fixes#3808
* Switch to ddgs
duckduckgo_search was deprecated, this is the recommended replacement
* Update test_service_search_provider_guards.py
According to review comment
* fixed confusing credentials prompt
* fix(setup): return status from create_default_admin function
* fix(setup): initialize admin creation status in main function
* fix(setup): enhance admin creation feedback and status handling
* Enhance admin user login messages with conditional feedback based on creation status
* Refine admin user creation feedback messages for clarity and actionability and formatted code
* Add fallback error message for admin creation failure in setup script
* Add run script for Uvicorn with dotenv integration
* Refactor server runner to use argparse for host and port configuration
* Remove captured output print statement from server runner
* Fix server runner to ensure cross-platform compatibility and improve log handling
* removed run.py to match original repo
* Fixing custom search not working properly
* Refactor search settings event listeners for improved functionality and clarity
* Update search function signatures to use Optional for count parameter
* revert changes
* fixed broken merge issue
* Delete services/chat_data_scraper.py
added by mistake
---------
Co-authored-by: Alexandre Teixeira <111787685+alteixeira20@users.noreply.github.com>
raise_for_status() raises httpx.HTTPStatusError for 4xx/5xx responses,
but the surrounding try/except only caught httpx.RequestError (network
errors) and RateLimitError (429). Any other HTTP error code propagated
uncaught up through chat_processor -> chat_helpers -> chat_routes and
surfaced as a 500 Internal Server Error.
Added an explicit except httpx.HTTPStatusError clause that logs a warning
and returns an empty result, matching the behaviour already in place for
network errors.
Also adds focused regression tests that exercise the real
fetch_webpage_content() path with a mocked _get_public_url:
- 403/404 responses return the standard empty-result shape instead of
raising, proving the new HTTPStatusError handling works end to end.
- 429 responses still take their own dedicated rate-limit branch (the
status_code == 429 check runs before raise_for_status() is reached),
keeping that behaviour distinct from the new generic HTTPStatusError
handling.
Dropped the unrelated builtin_mcp.py change that had been carried over
from a rebase; that fix is tracked separately in #2018 and this branch
should stay scoped to the search content fetch path.
Closes#2148
services/search/cache.py set CACHE_DIR = services/cache (the source tree) and
mkdir'd it at import, unguarded. In Docker services/ is the read-only image
layer, so the mkdir fails at import (same class as the analytics bug #2366).
Move the cache under DATA_DIR/cache (writable on Docker and native) and wrap
the mkdir so an unwritable path disables disk cache instead of crashing import.
Part of #3331.
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
* fix: move search analytics log to writable /app/logs volume
services/search/analytics.py opened a FileHandler at module import
time pointing to /app/services/search_engine_error.log — inside the
container image's read-only layer. The process runs as non-root so
the open() fails with PermissionError, crashing uvicorn before it
ever binds. ANALYTICS_FILE had the same problem.
Both paths now point to /app/logs (bind-mounted from the host data
directory). The FileHandler creation is wrapped in try/except so a
missing mount doesn't hard-crash on import.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: derive log dir from DATA_DIR instead of hardcoded /app/logs
Fixes reviewer feedback on #2366: /app/logs only exists inside Docker,
so native runs couldn't write the analytics file. DATA_DIR resolves to
the repo's data/ directory on native and /app/data (writable mount) in
Docker, making both the error log handler and ANALYTICS_FILE work on
every platform.
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
#1473 converted the title and sports-hint matches in services/search/ranking.py
to word boundaries but left two raw substring tests:
- snippet_score: 'term in snippet.lower()' — query term 'port' hits
'transport'/'support', inflating a result's relevance.
- news_quality_adjustment: 't in text or t in netloc' for the subject term —
query 'us' substring-matches 'business'/'music', so an off-topic page
wrongly escapes the off-topic penalty on a country/subject news query.
Add a _has_word helper (the same \b...\b pattern title_score already used) and
route all three word checks (title, snippet, subject) through it, so the file
stays consistent and a future partial fix can't reintroduce the same bug class.
Pure ranking refinement: scores change only for spurious substring matches; no
API or schema change.
(cherry picked from commit 22bd23f044)
Co-authored-by: ghreprimand <203024559+ghreprimand@users.noreply.github.com>
tavily_search, serper_search and google_pse_search parsed response.json()
inside the network try block, which only caught httpx.RequestError and
RateLimitError. When a provider returned a non-JSON body (an HTML error page, a
truncated/empty body, a gateway 5xx), response.json() raised an UNCAUGHT
json.JSONDecodeError that aborted the search in the background — exactly the
'search engines other than SearXNG fail in the background' symptom.
brave_search already handles this correctly: it parses JSON in its own try
block and returns [] on json.JSONDecodeError. Mirror that in the other three
providers so a malformed provider response degrades to no-results instead of
propagating an exception.
Adds tests/test_search_provider_json.py: a non-JSON 200 body now yields [] for
tavily, serper, google_pse, and brave (the last guards the reference behaviour).
Co-authored-by: NubsCarson <nubs@nubs.site>
Make services.search.analytics tolerate missing counters in older or partial analytics files by merging loaded data over defaults, with regression coverage.
SearchService.search() did:
raw_results = await comprehensive_web_search(
query, max_results=10 * depth, fetch_content=fetch_content)
comprehensive_web_search is a synchronous function whose count knob is
`max_pages` (not `max_results`) and which has no `fetch_content` parameter, so
the call raised TypeError on argument binding; `await` on its non-coroutine
return would also fail. It returns a context string, or a (context, sources)
tuple with return_sources=True — not the list of dicts the wrapper iterates.
The method is exported in services/search/__init__.py and services/__init__.py
with a usage example in its docstring, so any caller of the documented public
API hit an immediate crash. Call it correctly via asyncio.to_thread with
max_pages + return_sources=True and use the returned source list as the rows.
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
get_search_config returned SEARCH_CONFIG.copy(), and update_search_config
cached the decrypted Brave key into that shared global at startup
(app_initializer), so the unauthenticated /api/search/config route exposed
the operator's key. The cache was dead weight: brave_search reads its key
via _get_provider_key (settings/env), never SEARCH_CONFIG.
- update_search_config: no longer stores the api_key in the shared global
(accepted for backward compat; provider keys are read on demand).
- get_search_config: scrub any string-valued credential field before
returning, preserving the has_api_key presence flag.
No schema change; brave_search/_get_provider_key untouched. Adds regression
tests.
Fixes#1661
Co-authored-by: Ethan <23321960+0xLeathery@users.noreply.github.com>
* fix: detect question words as whole words, not prefixes
* fix: same question-word prefix bug in the services search copy
* test: question-word detection rejects prefix lookalikes
* fix: only extract quotes whose closing quote matches the opening one
* fix: same mismatched-quote bug in the services search copy
* test: extract_quotes requires matching open/close quotes
invalidate_search_cache(query) built its cache key as
generate_cache_key(f"{query}|10|None"), but the write path
(searxng_search_results) replaces the caller's default count of 10 with the
admin-configured _get_result_count() (default 5) before building the key.
So a default search for "X" is cached under "X|5|None", while invalidation
looked for "X|10|None" — they never match, and invalidate_search_cache
silently failed to remove anything in the default configuration, violating
its docstring ("invalidate ... just the given query").
Derive the count from _get_result_count() so invalidation matches the
default-search entry the write path actually stores. The same bug (and fix)
applies to both the src/search and services/search copies.
Note: time-filtered variants (e.g. "X|5|day") still aren't reachable from a
query-only signature, since cache keys are opaque SHA-256 hashes with no
stored query; clearing those would need a broader cache-index redesign and is
out of scope here.
Adds tests/test_search_cache_invalidation.py covering the default-count case.
The DuckDuckGo HTML fallback returns redirect URLs (//duckduckgo.com/l/?uddg=...)
instead of actual page URLs. This caused fetch_webpage_content() to reject them
instantly because _public_http_url() requires an http/https scheme, making search
results unfetchable in deep research mode.
Added _resolve_url() to:
- Convert protocol-relative URLs to absolute (https:)
- Convert path-relative URLs to absolute
- Extract the real URL from DuckDuckGo's /l/?uddg= redirect parameters
* fix: extract full year in research query entities, not just the century
* fix: same year capture-group bug in the services search copy
* test: research query extracts the full year