mirror of
https://github.com/pewdiepie-archdaemon/odysseus.git
synced 2026-06-17 02:05:22 -04:00
fafaf089c5
The outbound UA for web_fetch / web_search was inlined in four places with two different values and nothing keeping them current: content.py pinned a mid-2021 Chrome 91 build, and providers.py sent a bare Mozilla/5.0 in three spots. Some sites serve a degraded or blocked page to a UA that old. Add WEB_FETCH_USER_AGENT to src/constants.py (env-overridable, matching the existing Copilot/Kimi UA-constant pattern) and import it in content.py and providers.py. Default to a current, common desktop UA so pages return their normal HTML: the market-leading desktop OS (Windows; NT 10.0 covers Windows 10 and 11) and browser (Chrome) on a current stable build. The version is now bumped in one place. Service-specific self-identifying agents (Copilot, Kimi, webhooks, cookbook) are intentionally left separate. Adds a regression pinning the constant shape, the env override, and a guard against a new inline Mozilla literal in the search sources. Closes #4324
19 lines
656 B
Python
19 lines
656 B
Python
"""The web scraping path routes its User-Agent through one constant.
|
|
|
|
Guards the dedup: web_fetch / web_search outbound UAs go through
|
|
WEB_FETCH_USER_AGENT, so a stale or bare Mozilla string cannot be re-inlined in
|
|
the search sources.
|
|
"""
|
|
from pathlib import Path
|
|
|
|
_SEARCH = Path(__file__).resolve().parent.parent / "services" / "search"
|
|
|
|
|
|
def test_search_sources_have_no_inline_mozilla_ua():
|
|
offenders = [
|
|
str(py.relative_to(_SEARCH.parent.parent))
|
|
for py in _SEARCH.rglob("*.py")
|
|
if "Mozilla/" in py.read_text(encoding="utf-8")
|
|
]
|
|
assert not offenders, f"inline Mozilla UA found; use WEB_FETCH_USER_AGENT: {offenders}"
|