raw.githubusercontent.com serves Markdown as text/plain, JSON APIs and raw
config files serve application/json, and a lot of code and tool documentation
lives in .md/.txt. fetch_webpage_content only handled PDF and HTML, so a
non-HTML body produced empty content and web_fetch reported 'no readable text
content'. Add a branch that returns the body verbatim for non-HTML text/*,
JSON (application/json and +json), and a .md/.txt/.text/.json URL-suffix
fallback for mislabeled octet-stream. HTML and PDF handling unchanged.
Fixes#3808
* Switch to ddgs
duckduckgo_search was deprecated, this is the recommended replacement
* Update test_service_search_provider_guards.py
According to review comment
* fixed confusing credentials prompt
* fix(setup): return status from create_default_admin function
* fix(setup): initialize admin creation status in main function
* fix(setup): enhance admin creation feedback and status handling
* Enhance admin user login messages with conditional feedback based on creation status
* Refine admin user creation feedback messages for clarity and actionability and formatted code
* Add fallback error message for admin creation failure in setup script
* Add run script for Uvicorn with dotenv integration
* Refactor server runner to use argparse for host and port configuration
* Remove captured output print statement from server runner
* Fix server runner to ensure cross-platform compatibility and improve log handling
* removed run.py to match original repo
* Fixing custom search not working properly
* Refactor search settings event listeners for improved functionality and clarity
* Update search function signatures to use Optional for count parameter
* revert changes
* fixed broken merge issue
* Delete services/chat_data_scraper.py
added by mistake
---------
Co-authored-by: Alexandre Teixeira <111787685+alteixeira20@users.noreply.github.com>
raise_for_status() raises httpx.HTTPStatusError for 4xx/5xx responses,
but the surrounding try/except only caught httpx.RequestError (network
errors) and RateLimitError (429). Any other HTTP error code propagated
uncaught up through chat_processor -> chat_helpers -> chat_routes and
surfaced as a 500 Internal Server Error.
Added an explicit except httpx.HTTPStatusError clause that logs a warning
and returns an empty result, matching the behaviour already in place for
network errors.
Also adds focused regression tests that exercise the real
fetch_webpage_content() path with a mocked _get_public_url:
- 403/404 responses return the standard empty-result shape instead of
raising, proving the new HTTPStatusError handling works end to end.
- 429 responses still take their own dedicated rate-limit branch (the
status_code == 429 check runs before raise_for_status() is reached),
keeping that behaviour distinct from the new generic HTTPStatusError
handling.
Dropped the unrelated builtin_mcp.py change that had been carried over
from a rebase; that fix is tracked separately in #2018 and this branch
should stay scoped to the search content fetch path.
Closes#2148
services/search/cache.py set CACHE_DIR = services/cache (the source tree) and
mkdir'd it at import, unguarded. In Docker services/ is the read-only image
layer, so the mkdir fails at import (same class as the analytics bug #2366).
Move the cache under DATA_DIR/cache (writable on Docker and native) and wrap
the mkdir so an unwritable path disables disk cache instead of crashing import.
Part of #3331.
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
* fix: move search analytics log to writable /app/logs volume
services/search/analytics.py opened a FileHandler at module import
time pointing to /app/services/search_engine_error.log — inside the
container image's read-only layer. The process runs as non-root so
the open() fails with PermissionError, crashing uvicorn before it
ever binds. ANALYTICS_FILE had the same problem.
Both paths now point to /app/logs (bind-mounted from the host data
directory). The FileHandler creation is wrapped in try/except so a
missing mount doesn't hard-crash on import.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix: derive log dir from DATA_DIR instead of hardcoded /app/logs
Fixes reviewer feedback on #2366: /app/logs only exists inside Docker,
so native runs couldn't write the analytics file. DATA_DIR resolves to
the repo's data/ directory on native and /app/data (writable mount) in
Docker, making both the error log handler and ANALYTICS_FILE work on
every platform.
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
#1473 converted the title and sports-hint matches in services/search/ranking.py
to word boundaries but left two raw substring tests:
- snippet_score: 'term in snippet.lower()' — query term 'port' hits
'transport'/'support', inflating a result's relevance.
- news_quality_adjustment: 't in text or t in netloc' for the subject term —
query 'us' substring-matches 'business'/'music', so an off-topic page
wrongly escapes the off-topic penalty on a country/subject news query.
Add a _has_word helper (the same \b...\b pattern title_score already used) and
route all three word checks (title, snippet, subject) through it, so the file
stays consistent and a future partial fix can't reintroduce the same bug class.
Pure ranking refinement: scores change only for spurious substring matches; no
API or schema change.
(cherry picked from commit 22bd23f044)
Co-authored-by: ghreprimand <203024559+ghreprimand@users.noreply.github.com>
tavily_search, serper_search and google_pse_search parsed response.json()
inside the network try block, which only caught httpx.RequestError and
RateLimitError. When a provider returned a non-JSON body (an HTML error page, a
truncated/empty body, a gateway 5xx), response.json() raised an UNCAUGHT
json.JSONDecodeError that aborted the search in the background — exactly the
'search engines other than SearXNG fail in the background' symptom.
brave_search already handles this correctly: it parses JSON in its own try
block and returns [] on json.JSONDecodeError. Mirror that in the other three
providers so a malformed provider response degrades to no-results instead of
propagating an exception.
Adds tests/test_search_provider_json.py: a non-JSON 200 body now yields [] for
tavily, serper, google_pse, and brave (the last guards the reference behaviour).
Co-authored-by: NubsCarson <nubs@nubs.site>
Make services.search.analytics tolerate missing counters in older or partial analytics files by merging loaded data over defaults, with regression coverage.
SearchService.search() did:
raw_results = await comprehensive_web_search(
query, max_results=10 * depth, fetch_content=fetch_content)
comprehensive_web_search is a synchronous function whose count knob is
`max_pages` (not `max_results`) and which has no `fetch_content` parameter, so
the call raised TypeError on argument binding; `await` on its non-coroutine
return would also fail. It returns a context string, or a (context, sources)
tuple with return_sources=True — not the list of dicts the wrapper iterates.
The method is exported in services/search/__init__.py and services/__init__.py
with a usage example in its docstring, so any caller of the documented public
API hit an immediate crash. Call it correctly via asyncio.to_thread with
max_pages + return_sources=True and use the returned source list as the rows.
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
get_search_config returned SEARCH_CONFIG.copy(), and update_search_config
cached the decrypted Brave key into that shared global at startup
(app_initializer), so the unauthenticated /api/search/config route exposed
the operator's key. The cache was dead weight: brave_search reads its key
via _get_provider_key (settings/env), never SEARCH_CONFIG.
- update_search_config: no longer stores the api_key in the shared global
(accepted for backward compat; provider keys are read on demand).
- get_search_config: scrub any string-valued credential field before
returning, preserving the has_api_key presence flag.
No schema change; brave_search/_get_provider_key untouched. Adds regression
tests.
Fixes#1661
Co-authored-by: Ethan <23321960+0xLeathery@users.noreply.github.com>
* fix: detect question words as whole words, not prefixes
* fix: same question-word prefix bug in the services search copy
* test: question-word detection rejects prefix lookalikes
* fix: only extract quotes whose closing quote matches the opening one
* fix: same mismatched-quote bug in the services search copy
* test: extract_quotes requires matching open/close quotes
invalidate_search_cache(query) built its cache key as
generate_cache_key(f"{query}|10|None"), but the write path
(searxng_search_results) replaces the caller's default count of 10 with the
admin-configured _get_result_count() (default 5) before building the key.
So a default search for "X" is cached under "X|5|None", while invalidation
looked for "X|10|None" — they never match, and invalidate_search_cache
silently failed to remove anything in the default configuration, violating
its docstring ("invalidate ... just the given query").
Derive the count from _get_result_count() so invalidation matches the
default-search entry the write path actually stores. The same bug (and fix)
applies to both the src/search and services/search copies.
Note: time-filtered variants (e.g. "X|5|day") still aren't reachable from a
query-only signature, since cache keys are opaque SHA-256 hashes with no
stored query; clearing those would need a broader cache-index redesign and is
out of scope here.
Adds tests/test_search_cache_invalidation.py covering the default-count case.
The DuckDuckGo HTML fallback returns redirect URLs (//duckduckgo.com/l/?uddg=...)
instead of actual page URLs. This caused fetch_webpage_content() to reject them
instantly because _public_http_url() requires an http/https scheme, making search
results unfetchable in deep research mode.
Added _resolve_url() to:
- Convert protocol-relative URLs to absolute (https:)
- Convert path-relative URLs to absolute
- Extract the real URL from DuckDuckGo's /l/?uddg= redirect parameters
* fix: extract full year in research query entities, not just the century
* fix: same year capture-group bug in the services search copy
* test: research query extracts the full year