refactor(search): centralize the web-scraping User-Agent into one constant (#4325)

The outbound UA for web_fetch / web_search was inlined in four places with
two different values and nothing keeping them current: content.py pinned a
mid-2021 Chrome 91 build, and providers.py sent a bare Mozilla/5.0 in three
spots. Some sites serve a degraded or blocked page to a UA that old.

Add WEB_FETCH_USER_AGENT to src/constants.py (env-overridable, matching the
existing Copilot/Kimi UA-constant pattern) and import it in content.py and
providers.py. Default to a current, common desktop UA so pages return their
normal HTML: the market-leading desktop OS (Windows; NT 10.0 covers Windows
10 and 11) and browser (Chrome) on a current stable build. The version is now
bumped in one place.

Service-specific self-identifying agents (Copilot, Kimi, webhooks, cookbook)
are intentionally left separate. Adds a regression pinning the constant shape,
the env override, and a guard against a new inline Mozilla literal in the
search sources.

Closes #4324
This commit is contained in:
Kenny Van de Maele
2026-06-16 03:33:47 +02:00
committed by GitHub
parent b58af4267b
commit fafaf089c5
4 changed files with 31 additions and 6 deletions
+4 -4
View File
@@ -9,7 +9,7 @@ from urllib.parse import urljoin, urlparse, parse_qs
import httpx
from bs4 import BeautifulSoup
from src.constants import SEARXNG_INSTANCE, REQUEST_TIMEOUT
from src.constants import SEARXNG_INSTANCE, REQUEST_TIMEOUT, WEB_FETCH_USER_AGENT
from .analytics import RateLimitError, error_logger
from .query import build_enhanced_query
@@ -138,7 +138,7 @@ def searxng_search_api(query: str, count: Optional[int] = None, categories: str
count = count if count is not None else _get_result_count()
instance = _get_search_instance()
api_key = ""
headers = {"User-Agent": "Mozilla/5.0"}
headers = {"User-Agent": WEB_FETCH_USER_AGENT}
if api_key:
headers["Authorization"] = f"Bearer {api_key}"
# News/fresh queries do badly in the 'general' category — it favours
@@ -250,7 +250,7 @@ def searxng_search(query, max_results=10):
"""Search using SearXNG instance - parsing HTML."""
instance = _get_search_instance()
api_key = ""
req_headers = {"User-Agent": "Mozilla/5.0"}
req_headers = {"User-Agent": WEB_FETCH_USER_AGENT}
if api_key:
req_headers["Authorization"] = f"Bearer {api_key}"
try:
@@ -389,7 +389,7 @@ def duckduckgo_search(query: str, count: Optional[int] = None, time_filter: Opti
response = httpx.get(
"https://html.duckduckgo.com/html/",
params={"q": query, "kp": _safesearch_for("duckduckgo_html")},
headers={"User-Agent": "Mozilla/5.0"},
headers={"User-Agent": WEB_FETCH_USER_AGENT},
timeout=REQUEST_TIMEOUT,
)
response.raise_for_status()