mirror of
https://github.com/pewdiepie-archdaemon/odysseus.git
synced 2026-06-18 02:35:23 -04:00
e87a1ad8d2
The goal-based extractor passed raw fetched webpage content straight into the LLM prompt via string substitution, bypassing the prompt-injection hardening layer in src/prompt_security.py. Split EXTRACTOR_PROMPT into EXTRACTOR_SYSTEM (task instructions + goal, trusted) and a second message built with untrusted_context_message() (raw page content, sandboxed with <<<UNTRUSTED_SOURCE_DATA>>> guards). This aligns the extractor with every other external-content injection site in the codebase (agent_loop, chat_processor, chat_routes). Fixes #3044 Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
24 lines
921 B
Python
24 lines
921 B
Python
# src/goal_based_extractor.py
|
|
"""
|
|
Goal-based content extraction prompt inspired by Alibaba Tongyi DeepResearch.
|
|
"""
|
|
|
|
EXTRACTOR_SYSTEM = """Extract relevant information from a webpage for a given research goal.
|
|
|
|
Goal: {goal}
|
|
|
|
Task guidelines:
|
|
1. Locate the specific sections directly related to the goal within the provided webpage content.
|
|
2. Identify and extract the most relevant information; output full original context where possible, up to three or more paragraphs.
|
|
3. Organize into a concise paragraph with logical flow, judging each piece of information's contribution to the goal.
|
|
|
|
Respond in JSON with exactly these fields: "rational", "evidence", "summary".
|
|
|
|
Example:
|
|
{{
|
|
"rational": "This section discusses X which directly relates to the goal of understanding Y",
|
|
"evidence": "Full quotes and context from the page...",
|
|
"summary": "Concise summary of how this information answers the goal"
|
|
}}
|
|
"""
|