odysseus

mirror of https://github.com/pewdiepie-archdaemon/odysseus.git synced 2026-06-16 17:55:26 -04:00

Author	SHA1	Message	Date
Kenny Van de Maele	fafaf089c5	refactor(search): centralize the web-scraping User-Agent into one constant (#4325 ) The outbound UA for web_fetch / web_search was inlined in four places with two different values and nothing keeping them current: content.py pinned a mid-2021 Chrome 91 build, and providers.py sent a bare Mozilla/5.0 in three spots. Some sites serve a degraded or blocked page to a UA that old. Add WEB_FETCH_USER_AGENT to src/constants.py (env-overridable, matching the existing Copilot/Kimi UA-constant pattern) and import it in content.py and providers.py. Default to a current, common desktop UA so pages return their normal HTML: the market-leading desktop OS (Windows; NT 10.0 covers Windows 10 and 11) and browser (Chrome) on a current stable build. The version is now bumped in one place. Service-specific self-identifying agents (Copilot, Kimi, webhooks, cookbook) are intentionally left separate. Adds a regression pinning the constant shape, the env override, and a guard against a new inline Mozilla literal in the search sources. Closes #4324	2026-06-16 01:33:47 +00:00
Kenny Van de Maele	074a1e6eff	fix(search): add download budgets to web_fetch with truncation notice and hard ceiling (#3955 ) * fix(search): add download budgets to web_fetch with truncation notice and hard ceiling MAX_OUTPUT_CHARS only trims what the agent sees; fetch_webpage_content buffered and cached the entire response body first, so a large or hostile URL could pull arbitrarily many bytes into memory and the content cache. The fetch is now a capped streaming GET (SSRF redirect guard unchanged): a soft default budget (WEB_FETCH_SOFT_MAX_BYTES, 2 MB), a per-call override via full/max_bytes on the web_fetch tool, and a hard ceiling (WEB_FETCH_HARD_MAX_BYTES, 20 MB) that the override can never exceed. When Content-Length already declares a body over the ceiling the fetch is refused before any body bytes are buffered. Truncated results carry truncated/fetched_bytes/total_bytes, the tool output leads with a partial-content notice telling the model how to re-fetch with full=true, and the tool schema documents the flag. A truncated PDF is reported as a budget error since a cut PDF is unparseable. The effective cap is part of the content-cache key so a truncated fetch is never served to a full-budget request. Existing tests that faked httpx.get or the old _get_public_url signature are adapted to the streaming interface; behavior pins are unchanged. Fixes #3812 * fix(search): close compressed-body cap bypass and protect the partial notice Addresses RaresKeY's review on #3955: - Force Accept-Encoding: identity for the capped fetch. With gzip/deflate the wire bytes (and Content-Length) can be a fraction of the decoded body, so a tiny compressed response could pass the hard-cap preflight and then expand past the ceiling in a single decoded chunk before the streamed cap could slice it. Identity makes Content-Length the true body size and keeps each streamed chunk bounded by the network read, so the hard ceiling actually bounds memory. - Lead web_fetch output with the partial-content notice and cap the page title. The notice is the user-facing contract for partial fetches, but the title is untrusted, uncapped page content; placed ahead of the notice a giant title could push it past MAX_OUTPUT_CHARS and drop it. The notice now leads and the title is capped as a second guard. Adds regressions: the fetch advertises identity encoding, and a truncated result with an oversized title still surfaces the partial notice. * fix(search): reject compressed responses that ignore the identity request Requesting Accept-Encoding: identity is not enough on its own: a server can ignore it and still return Content-Encoding: gzip, and httpx.iter_bytes would decode that, so a tiny compressed body could balloon into one decoded chunk far past the hard cap before the streamed loop slices it (and Content-Length, the compressed wire length, makes the preflight and size metadata unreliable). Refuse a non-identity Content-Encoding before reading the body. Adds a regression where the server ignores the identity request and returns gzip; the fetch is refused before any body is decoded.	2026-06-15 17:38:09 +00:00
Kenny Van de Maele	2fab378c6a	refactor(search): import REQUEST_TIMEOUT from constants in providers.py (#4331 ) providers.py redefined REQUEST_TIMEOUT = 20 locally, shadowing the same value in src/constants.py and risking drift if the constant is bumped. Import it from src.constants and drop the local copy; same value, one source of truth. Closes #4329	2026-06-15 17:22:08 +00:00
Kenny Van de Maele	bfac1d55d6	fix(search): read plain-text, Markdown, and JSON URLs in fetch_webpage_content (#3809 ) raw.githubusercontent.com serves Markdown as text/plain, JSON APIs and raw config files serve application/json, and a lot of code and tool documentation lives in .md/.txt. fetch_webpage_content only handled PDF and HTML, so a non-HTML body produced empty content and web_fetch reported 'no readable text content'. Add a branch that returns the body verbatim for non-HTML text/*, JSON (application/json and +json), and a .md/.txt/.text/.json URL-suffix fallback for mislabeled octet-stream. HTML and PDF handling unchanged. Fixes #3808	2026-06-11 14:24:53 +00:00
ThomasAngel	a0b0420e6f	chore: Switch duckduckgo-search to ddgs (#3143 ) * Switch to ddgs duckduckgo_search was deprecated, this is the recommended replacement * Update test_service_search_provider_guards.py According to review comment	2026-06-10 17:59:47 +02:00
Boody	f605bb3864	fix: Enforce dynamic custom search result limits in backend (#2359 ) * fixed confusing credentials prompt * fix(setup): return status from create_default_admin function * fix(setup): initialize admin creation status in main function * fix(setup): enhance admin creation feedback and status handling * Enhance admin user login messages with conditional feedback based on creation status * Refine admin user creation feedback messages for clarity and actionability and formatted code * Add fallback error message for admin creation failure in setup script * Add run script for Uvicorn with dotenv integration * Refactor server runner to use argparse for host and port configuration * Remove captured output print statement from server runner * Fix server runner to ensure cross-platform compatibility and improve log handling * removed run.py to match original repo * Fixing custom search not working properly * Refactor search settings event listeners for improved functionality and clarity * Update search function signatures to use Optional for count parameter * revert changes * fixed broken merge issue * Delete services/chat_data_scraper.py added by mistake --------- Co-authored-by: Alexandre Teixeira <111787685+alteixeira20@users.noreply.github.com>	2026-06-09 02:20:59 +01:00
Lucas Daniel	fa7c4f8ea9	fix(search): catch HTTPStatusError so 403/404 URLs degrade gracefully instead of 500 (#2203 ) raise_for_status() raises httpx.HTTPStatusError for 4xx/5xx responses, but the surrounding try/except only caught httpx.RequestError (network errors) and RateLimitError (429). Any other HTTP error code propagated uncaught up through chat_processor -> chat_helpers -> chat_routes and surfaced as a 500 Internal Server Error. Added an explicit except httpx.HTTPStatusError clause that logs a warning and returns an empty result, matching the behaviour already in place for network errors. Also adds focused regression tests that exercise the real fetch_webpage_content() path with a mocked _get_public_url: - 403/404 responses return the standard empty-result shape instead of raising, proving the new HTTPStatusError handling works end to end. - 429 responses still take their own dedicated rate-limit branch (the status_code == 429 check runs before raise_for_status() is reached), keeping that behaviour distinct from the new generic HTTPStatusError handling. Dropped the unrelated builtin_mcp.py change that had been carried over from a rebase; that fix is tracked separately in #2018 and this branch should stay scoped to the search content fetch path. Closes #2148	2026-06-08 01:09:21 +01:00
Kenny Van de Maele	92300b5d67	fix(search): write cache under DATA_DIR, guard mkdir against read-only path (#3334 ) services/search/cache.py set CACHE_DIR = services/cache (the source tree) and mkdir'd it at import, unguarded. In Docker services/ is the read-only image layer, so the mkdir fails at import (same class as the analytics bug #2366). Move the cache under DATA_DIR/cache (writable on Docker and native) and wrap the mkdir so an unwritable path disables disk cache instead of crashing import. Part of #3331. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-07 22:37:12 +01:00
Giuseppe Castelluccio	6c9a16a7a8	fix: search analytics FileHandler crashes on startup writing to read-only image layer (#2366 ) * fix: move search analytics log to writable /app/logs volume services/search/analytics.py opened a FileHandler at module import time pointing to /app/services/search_engine_error.log — inside the container image's read-only layer. The process runs as non-root so the open() fails with PermissionError, crashing uvicorn before it ever binds. ANALYTICS_FILE had the same problem. Both paths now point to /app/logs (bind-mounted from the host data directory). The FileHandler creation is wrapped in try/except so a missing mount doesn't hard-crash on import. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: derive log dir from DATA_DIR instead of hardcoded /app/logs Fixes reviewer feedback on #2366: /app/logs only exists inside Docker, so native runs couldn't write the analytics file. DATA_DIR resolves to the repo's data/ directory on native and /app/data (writable mount) in Docker, making both the error log handler and ANALYTICS_FILE work on every platform. --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-07 19:26:22 +02:00
ghreprimand	cfb2d17a2d	Word-boundary match for snippet and subject-term ranking (#1473 follow-up) (#2556 ) #1473 converted the title and sports-hint matches in services/search/ranking.py to word boundaries but left two raw substring tests: - snippet_score: 'term in snippet.lower()' — query term 'port' hits 'transport'/'support', inflating a result's relevance. - news_quality_adjustment: 't in text or t in netloc' for the subject term — query 'us' substring-matches 'business'/'music', so an off-topic page wrongly escapes the off-topic penalty on a country/subject news query. Add a _has_word helper (the same \b...\b pattern title_score already used) and route all three word checks (title, snippet, subject) through it, so the file stays consistent and a future partial fix can't reintroduce the same bug class. Pure ranking refinement: scores change only for spurious substring matches; no API or schema change. (cherry picked from commit `22bd23f044`) Co-authored-by: ghreprimand <203024559+ghreprimand@users.noreply.github.com>	2026-06-05 08:04:31 +01:00
Vykos	aaef6b1c49	fix(search): align content URL guards * Stabilize full test collection * Align search content URL guards	2026-06-04 00:34:06 +01:00
pewdiepie-archdaemon	6861c41580	Reapply "Merge branch 'main' of github.com:pewdiepie-archdaemon/odysseus" This reverts commit `cc8fe2f6e3`.	2026-06-03 22:47:00 +09:00
pewdiepie-archdaemon	cc8fe2f6e3	Revert "Merge branch 'main' of github.com:pewdiepie-archdaemon/odysseus" This reverts commit `8161c1253d`, reversing changes made to `8c2705b42a`.	2026-06-03 22:46:19 +09:00
Alexandre Teixeira	a75dd4a231	fix(search): apply recency UTC fix to live ranking module	2026-06-03 12:49:32 +01:00
Shaw	552bc15067	fix(search): degrade to empty results on non-JSON provider responses (#1129 ) (#1352 ) tavily_search, serper_search and google_pse_search parsed response.json() inside the network try block, which only caught httpx.RequestError and RateLimitError. When a provider returned a non-JSON body (an HTML error page, a truncated/empty body, a gateway 5xx), response.json() raised an UNCAUGHT json.JSONDecodeError that aborted the search in the background — exactly the 'search engines other than SearXNG fail in the background' symptom. brave_search already handles this correctly: it parses JSON in its own try block and returns [] on json.JSONDecodeError. Mirror that in the other three providers so a malformed provider response degrades to no-results instead of propagating an exception. Adds tests/test_search_provider_json.py: a non-JSON 200 body now yields [] for tavily, serper, google_pse, and brave (the last guards the reference behaviour). Co-authored-by: NubsCarson <nubs@nubs.site>	2026-06-03 14:24:23 +09:00
Afonso Coutinho	b55c970ec5	fix: sports-hint ranking penalty fires on 'transport'/'passport' substrings (#1473 ) * fix: sports-hint ranking penalty fires on 'transport'/'passport' substrings * Apply word-boundary sports-hint fix to src/search/ranking.py as well	2026-06-03 14:23:52 +09:00
Afonso Coutinho	eae8797e08	fix: web search content blocks numbered by fetch completion order break citations (#1672 )	2026-06-03 14:22:55 +09:00
Afonso Coutinho	f29c827e6e	Merge search analytics defaults in services copy Make services.search.analytics tolerate missing counters in older or partial analytics files by merging loaded data over defaults, with regression coverage.	2026-06-03 13:45:07 +09:00
Mubashir R	535d05c142	fix: SearchService.search() calls comprehensive_web_search incorrectly (broken public API) (#1720 ) SearchService.search() did: raw_results = await comprehensive_web_search( query, max_results=10 * depth, fetch_content=fetch_content) comprehensive_web_search is a synchronous function whose count knob is `max_pages` (not `max_results`) and which has no `fetch_content` parameter, so the call raised TypeError on argument binding; `await` on its non-coroutine return would also fail. It returns a context string, or a (context, sources) tuple with return_sources=True — not the list of dicts the wrapper iterates. The method is exported in services/search/__init__.py and services/__init__.py with a usage example in its docstring, so any caller of the documented public API hit an immediate crash. Call it correctly via asyncio.to_thread with max_pages + return_sources=True and use the returned source list as the rows. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-03 13:33:56 +09:00
Afonso Coutinho	c5bc39de88	fix: _extract_entities crashes on a non-string query (#1724 )	2026-06-03 13:30:28 +09:00
Afonso Coutinho	0c37943267	fix: search service crashes on a non-dict result row (#1725 )	2026-06-03 13:30:19 +09:00
Ethan	33bf975597	Stop GET /api/search/config from leaking the Brave API key (#1661 ) (#1750 ) get_search_config returned SEARCH_CONFIG.copy(), and update_search_config cached the decrypted Brave key into that shared global at startup (app_initializer), so the unauthenticated /api/search/config route exposed the operator's key. The cache was dead weight: brave_search reads its key via _get_provider_key (settings/env), never SEARCH_CONFIG. - update_search_config: no longer stores the api_key in the shared global (accepted for backward compat; provider keys are read on demand). - get_search_config: scrub any string-valued credential field before returning, preserving the has_api_key presence flag. No schema change; brave_search/_get_provider_key untouched. Adds regression tests. Fixes #1661 Co-authored-by: Ethan <23321960+0xLeathery@users.noreply.github.com>	2026-06-03 13:24:17 +09:00
Afonso Coutinho	77313170c6	fix: search query helpers crash on a non-string query (#1604 )	2026-06-03 08:36:01 +09:00
Afonso Coutinho	f62d6ea3d7	fix: research query misclassifies 'whatsapp'/'however' as questions (#1247 ) * fix: detect question words as whole words, not prefixes * fix: same question-word prefix bug in the services search copy * test: question-word detection rejects prefix lookalikes	2026-06-03 01:10:06 +09:00
red person	cc6e43da44	Report provider-specific search API keys correctly (#1202 ) * fix(search): report provider-specific API keys * fix(search): include provider env keys in status	2026-06-02 23:37:15 +09:00
Afonso Coutinho	2e2da2aefe	fix: extract_statistics drops large numbers and trailing % signs (#1153 ) * fix: extract_statistics misses comma-less numbers and drops trailing % * fix: same extract_statistics number/percent bug in services copy * test: extract_statistics captures full numbers and percent signs	2026-06-02 22:35:30 +09:00
Afonso Coutinho	2b2943a7b7	fix: extract_quotes accepts mismatched opening/closing quotes (#1113 ) * fix: only extract quotes whose closing quote matches the opening one * fix: same mismatched-quote bug in the services search copy * test: extract_quotes requires matching open/close quotes	2026-06-02 22:34:52 +09:00
ghreprimand	aa0a9e8b5a	Search: align service content extraction Co-authored-by: ghreprimand <203024559+ghreprimand@users.noreply.github.com>	2026-06-02 20:53:07 +09:00
ghreprimand	eddb9ce6db	Search: align service provider guards Co-authored-by: ghreprimand <203024559+ghreprimand@users.noreply.github.com>	2026-06-02 20:52:13 +09:00
mist	5ebe9ee67a	Fix invalidate_search_cache using a key that never matches stored entries (#852 ) invalidate_search_cache(query) built its cache key as generate_cache_key(f"{query}\|10\|None"), but the write path (searxng_search_results) replaces the caller's default count of 10 with the admin-configured _get_result_count() (default 5) before building the key. So a default search for "X" is cached under "X\|5\|None", while invalidation looked for "X\|10\|None" — they never match, and invalidate_search_cache silently failed to remove anything in the default configuration, violating its docstring ("invalidate ... just the given query"). Derive the count from _get_result_count() so invalidation matches the default-search entry the write path actually stores. The same bug (and fix) applies to both the src/search and services/search copies. Note: time-filtered variants (e.g. "X\|5\|day") still aren't reachable from a query-only signature, since cache keys are opaque SHA-256 hashes with no stored query; clearing those would need a broader cache-index redesign and is out of scope here. Adds tests/test_search_cache_invalidation.py covering the default-count case.	2026-06-02 10:53:33 +09:00
BSG-Walter	c0466274ed	fix: resolve DuckDuckGo redirect URLs in HTML fallback search The DuckDuckGo HTML fallback returns redirect URLs (//duckduckgo.com/l/?uddg=...) instead of actual page URLs. This caused fetch_webpage_content() to reject them instantly because _public_http_url() requires an http/https scheme, making search results unfetchable in deep research mode. Added _resolve_url() to: - Convert protocol-relative URLs to absolute (https:) - Convert path-relative URLs to absolute - Extract the real URL from DuckDuckGo's /l/?uddg= redirect parameters	2026-06-01 19:42:01 -03:00
Afonso Coutinho	9b1acf6612	Fix year extraction in research queries * fix: extract full year in research query entities, not just the century * fix: same year capture-group bug in the services search copy * test: research query extracts the full year	2026-06-01 23:09:41 +09:00
pewdiepie-archdaemon	e5c99a5eee	Odysseus v1.0	2026-05-31 23:58:26 +09:00

33 Commits