mirror of
https://github.com/pewdiepie-archdaemon/odysseus.git
synced 2026-06-17 10:15:27 -04:00
cookbook agent debug loop: persistent log files, auto-adopt orphan tmux, Codex/Claude skill parity
Three converging fixes so the chat agent + external Codex/Claude skills can actually debug a crashed serve instead of staring at a post-crash neofetch banner:
* Serves now `tee` to /tmp/odysseus-tmux/SESSION.log on the host running them. Runner saves fds 3/4 before the tee and restores them right before `exec ${SHELL}`, so the post-crash interactive zsh banner does NOT pollute the log file.
* `tail_serve_output` (chat agent) and `/api/codex/cookbook/output/{sid}` (Codex+Claude skills) both prefer the persistent log file over the tmux pane. Pane is fallback for sessions predating the tee runner. Default tail bumped 150 -> 400.
* `list_served_models` "recent log" snippet seeks to the Traceback line instead of showing the last 6 lines (which was always the bash prompt).
Cookbook auto-adoption sweep on `/api/cookbook/tasks/status`: every 20s (rate-limited) the cookbook SSHes each configured server, finds `serve-*` / `cookbook-*` tmux sessions running an actual model process (vllm/python/llama-server/etc., filtered via `pane_current_command`), and writes them into state.tasks. So when the agent falls back to raw ssh+tmux, the session appears in the Cookbook UI on the next poll.
`serve_model` error path now reads `data["detail"]` in addition to `data["error"]` so the FastAPI HTTPException message ("Invalid characters in cmd") actually reaches the agent instead of being swallowed as a generic "Serve failed". Tool description updated to warn against `cd …`/`source …`/`&&` prefixes.
Intent-without-action supervisor in agent_loop: when the model writes "Let me tail the output" / "I'll check the logs" / "Let me investigate" and ends the turn without emitting a tool call, the loop injects a sharp system nudge ("You said you would X — DO IT NOW") and continues. Capped at 2 nudges per chat so a model that genuinely cannot use the tool does not pin the loop.
Codex/Claude skill parity: adds `/cookbook/cached`, `/cookbook/presets`, `/cookbook/preset/{name}`, `/cookbook/adopt` so external agents have the same surface as the chat agent. SKILL.md docs + odysseus_api.py wrapper updated for both bundles.
`adopt_served_model` promoted to the always-on tool set so the agent has a documented fallback when serve_model rejects a cmd.
Also various cookbook UI tweaks accumulated alongside the above (cookbook.js, cookbookRunning.js, cookbookServe.js, cookbook-diagnosis.js, settings.js, style.css).
This commit is contained in:
+183
-2
@@ -998,6 +998,21 @@ def setup_cookbook_routes() -> APIRouter:
|
||||
else:
|
||||
# ── Linux/Termux: bash + tmux (existing flow) ──
|
||||
runner_lines = ["#!/bin/bash"]
|
||||
# Mirror every line of stdout+stderr into a persistent log file
|
||||
# on the host running the serve. This is the file tail_serve_output
|
||||
# reads when the tmux pane has been overwritten by the post-crash
|
||||
# bash prompt — without it, the agent's diagnostic tool sees the
|
||||
# neofetch banner instead of the actual Python traceback.
|
||||
# We save the original fds to 3/4 so we can RESTORE them before
|
||||
# `exec ${SHELL}` at the end of the script. Without that restore,
|
||||
# the post-crash interactive shell's neofetch banner ALSO gets
|
||||
# teed into the log file and `tail -N` returns ONLY the banner —
|
||||
# the actual traceback ends up earlier than the tail window.
|
||||
runner_lines.append("mkdir -p /tmp/odysseus-tmux 2>/dev/null || true")
|
||||
runner_lines.append("exec 3>&1 4>&2")
|
||||
runner_lines.append(
|
||||
f"exec > >(tee -a /tmp/odysseus-tmux/{session_id}.log) 2>&1"
|
||||
)
|
||||
runner_lines.extend(_user_shell_path_bootstrap())
|
||||
runner_lines.append('ODYSSEUS_PREFLIGHT_EXIT=""')
|
||||
# Put Odysseus's own venv bin on PATH (local runs only) so the serve
|
||||
@@ -1940,6 +1955,151 @@ def setup_cookbook_routes() -> APIRouter:
|
||||
|
||||
return {"models": out}
|
||||
|
||||
# Rate-limit for the orphan-tmux adoption sweep. The UI polls
|
||||
# tasks/status every ~3s; we don't want to SSH every host on every
|
||||
# poll. 20s is fast enough that a model the agent launched in the
|
||||
# background shows up "almost immediately" in the UI without being
|
||||
# wasteful.
|
||||
_last_orphan_sweep_ts = [0.0]
|
||||
_ORPHAN_SWEEP_MIN_INTERVAL_S = 20.0
|
||||
|
||||
def _maybe_sweep_orphans(tasks: list, state: dict) -> None:
|
||||
"""Scan each configured cookbook server for `serve-*` tmux sessions
|
||||
the cookbook doesn't know about and adopt them into state.tasks.
|
||||
|
||||
Writes are conditional: if no orphans are found, nothing is touched.
|
||||
Rate-limited so polling UIs don't trigger SSH on every refresh.
|
||||
"""
|
||||
import time as _time
|
||||
import subprocess
|
||||
logger.info(f"_maybe_sweep_orphans: entered, last_ts={_last_orphan_sweep_ts[0]}")
|
||||
now = _time.monotonic()
|
||||
if now - _last_orphan_sweep_ts[0] < _ORPHAN_SWEEP_MIN_INTERVAL_S:
|
||||
logger.info(f"_maybe_sweep_orphans: rate-limited, {now - _last_orphan_sweep_ts[0]:.1f}s since last")
|
||||
return
|
||||
_last_orphan_sweep_ts[0] = now
|
||||
|
||||
env = state.get("env") if isinstance(state, dict) else {}
|
||||
servers = env.get("servers") if isinstance(env, dict) else []
|
||||
logger.info(f"orphan sweep starting: {len(servers) if isinstance(servers, list) else 0} server(s), known_sids={len([t for t in tasks if isinstance(t, dict) and t.get('sessionId')])}")
|
||||
if not isinstance(servers, list):
|
||||
return
|
||||
|
||||
known_sids = {
|
||||
t.get("sessionId") for t in tasks
|
||||
if isinstance(t, dict) and t.get("sessionId")
|
||||
}
|
||||
|
||||
adopted_any = False
|
||||
for srv in servers:
|
||||
if not isinstance(srv, dict):
|
||||
continue
|
||||
host = (srv.get("host") or "").strip()
|
||||
if not host:
|
||||
continue # local-only entry; the /proc scan handles it
|
||||
if not _REMOTE_HOST_RE.match(host):
|
||||
continue
|
||||
sport = str(srv.get("port") or "").strip()
|
||||
ssh_base = ["ssh", "-o", "ConnectTimeout=4", "-o", "StrictHostKeyChecking=no"]
|
||||
if sport and sport != "22":
|
||||
if not _SSH_PORT_RE.match(sport):
|
||||
continue
|
||||
ssh_base.extend(["-p", sport])
|
||||
|
||||
try:
|
||||
ls = subprocess.run(
|
||||
ssh_base + [host, "tmux ls 2>/dev/null"],
|
||||
timeout=6, capture_output=True, text=True,
|
||||
)
|
||||
except Exception:
|
||||
continue
|
||||
for line in (ls.stdout or "").splitlines():
|
||||
sid = line.split(":", 1)[0].strip()
|
||||
if not sid or not _SESSION_ID_RE.match(sid):
|
||||
continue
|
||||
# Only adopt sessions that LOOK like model serves; ignore
|
||||
# bare numeric tmux sessions and unrelated work.
|
||||
if not (sid.startswith("serve-") or sid.startswith("cookbook-")):
|
||||
continue
|
||||
if sid in known_sids:
|
||||
continue
|
||||
# Skip zombie / idle-shell sessions. A tmux session left
|
||||
# over from a crashed vllm just shows a bash prompt —
|
||||
# adopting it would pollute the UI with "running" tasks
|
||||
# that aren't actually serving anything. pane_current_command
|
||||
# is the foreground process in the pane right now; only
|
||||
# real model serves leave a python/vllm/etc. process there.
|
||||
try:
|
||||
pc = subprocess.run(
|
||||
ssh_base + [host, "tmux", "list-panes", "-t", sid,
|
||||
"-F", "#{pane_current_command}"],
|
||||
timeout=4, capture_output=True, text=True,
|
||||
)
|
||||
cur = (pc.stdout or "").strip().splitlines()
|
||||
except Exception:
|
||||
cur = []
|
||||
LIVE_PROCS = {"python", "python3", "vllm", "llama-server",
|
||||
"llama_cpp_main", "sglang", "lmdeploy",
|
||||
"ollama", "node", "uvicorn"}
|
||||
if not any(c in LIVE_PROCS for c in cur):
|
||||
continue
|
||||
# Try to recover a plausible repo_id + port from the
|
||||
# pane buffer. Cheap heuristic — if we can't, register
|
||||
# with placeholder fields; the UI still shows it.
|
||||
try:
|
||||
cap = subprocess.run(
|
||||
ssh_base + [host, "tmux", "capture-pane", "-t", sid, "-p", "-S", "-300"],
|
||||
timeout=6, capture_output=True, text=True,
|
||||
)
|
||||
pane = cap.stdout or ""
|
||||
except Exception:
|
||||
pane = ""
|
||||
import re as _re_orphan
|
||||
# vLLM banner: "model /path/...". Falls back to the
|
||||
# raw vllm-serve command if the banner already scrolled.
|
||||
m_model = _re_orphan.search(r"model\s+(\S+)", pane)
|
||||
model = m_model.group(1) if m_model else ""
|
||||
if not model:
|
||||
m_serve = _re_orphan.search(r"vllm\s+serve\s+(\S+)", pane)
|
||||
model = m_serve.group(1) if m_serve else f"adopted:{sid}"
|
||||
m_port = _re_orphan.search(r"--port\s+(\d+)", pane)
|
||||
port = int(m_port.group(1)) if m_port else 0
|
||||
|
||||
import time as _t2
|
||||
tasks.append({
|
||||
"id": sid,
|
||||
"sessionId": sid,
|
||||
"name": model.split("/")[-1] if "/" in model else model,
|
||||
"type": "serve",
|
||||
"status": "running",
|
||||
"output": f"Auto-adopted from orphan tmux session on {host}. "
|
||||
"Open the task to see live output.",
|
||||
"ts": int(_t2.time() * 1000),
|
||||
"payload": {
|
||||
"repo_id": model,
|
||||
"remote_host": host,
|
||||
"_cmd": "(orphan tmux session — original launch cmd unknown)",
|
||||
"port": port,
|
||||
},
|
||||
"remoteHost": host,
|
||||
"sshPort": sport,
|
||||
"platform": "linux",
|
||||
"_serveReady": False,
|
||||
"_endpointAdded": False,
|
||||
"_adoptedExternally": True,
|
||||
})
|
||||
known_sids.add(sid)
|
||||
adopted_any = True
|
||||
logger.info(f"auto-adopted orphan tmux session {sid!r} on {host}")
|
||||
|
||||
if adopted_any:
|
||||
try:
|
||||
from core.atomic_io import atomic_write_json
|
||||
state["tasks"] = tasks
|
||||
atomic_write_json(_cookbook_state_path, state)
|
||||
except Exception as e:
|
||||
logger.warning(f"orphan sweep: state write failed: {e}")
|
||||
|
||||
@router.get("/api/cookbook/tasks/status")
|
||||
async def cookbook_tasks_status(request: Request):
|
||||
"""Check status of all active cookbook tmux sessions.
|
||||
@@ -1993,6 +2153,7 @@ def setup_cookbook_routes() -> APIRouter:
|
||||
|
||||
# Load saved tasks from cookbook state
|
||||
tasks = []
|
||||
state = {}
|
||||
if _cookbook_state_path.exists():
|
||||
try:
|
||||
state = json.loads(_cookbook_state_path.read_text(encoding="utf-8"))
|
||||
@@ -2004,6 +2165,21 @@ def setup_cookbook_routes() -> APIRouter:
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Orphan-tmux auto-adoption sweep. When the agent (or anyone)
|
||||
# SSH-launches a `serve-*` tmux session — usually because
|
||||
# serve_model rejected `source ... && vllm ...` or because of a
|
||||
# manual relaunch via tmux send-keys — that session is invisible
|
||||
# to the cookbook UI even though it's a live model server. The
|
||||
# sweep finds those orphans on each configured remote host and
|
||||
# writes them into state.tasks with _adoptedExternally=True, so
|
||||
# they show up in the UI on the next poll without anyone having
|
||||
# to remember to call adopt_served_model. Rate-limited via the
|
||||
# module-level _last_orphan_sweep so we don't SSH every 3s.
|
||||
try:
|
||||
_maybe_sweep_orphans(tasks, state)
|
||||
except Exception as _sweep_e:
|
||||
logger.warning(f"orphan sweep failed (non-fatal): {_sweep_e!r}")
|
||||
|
||||
results = []
|
||||
for task in tasks:
|
||||
session_id = task.get("sessionId", "")
|
||||
@@ -2063,7 +2239,12 @@ def setup_cookbook_routes() -> APIRouter:
|
||||
if _tport and _tport != "22":
|
||||
ssh_base.extend(["-p", str(_tport)])
|
||||
check_cmd = ssh_base + [remote, "tmux", "has-session", "-t", session_id]
|
||||
capture_cmd = ssh_base + [remote, "tmux", "capture-pane", "-t", session_id, "-p", "-S", "-50"]
|
||||
# Capture 500 lines (was 50) so a Python traceback survives
|
||||
# the post-crash neofetch banner + bash prompt that otherwise
|
||||
# fills the visible tail. Without this, output_tail ends up
|
||||
# as just "Locale: C / Ubuntu_Odysseus ❯" and the agent
|
||||
# can't diagnose the actual error.
|
||||
capture_cmd = ssh_base + [remote, "tmux", "capture-pane", "-t", session_id, "-p", "-S", "-500"]
|
||||
elif IS_WINDOWS:
|
||||
# LOCAL Windows task: launched as a detached process (no tmux).
|
||||
# Liveness comes from the <session>.pid file, output from the
|
||||
@@ -2072,7 +2253,7 @@ def setup_cookbook_routes() -> APIRouter:
|
||||
capture_cmd = None
|
||||
else:
|
||||
check_cmd = ["tmux", "has-session", "-t", session_id]
|
||||
capture_cmd = ["tmux", "capture-pane", "-t", session_id, "-p", "-S", "-50"]
|
||||
capture_cmd = ["tmux", "capture-pane", "-t", session_id, "-p", "-S", "-500"]
|
||||
|
||||
local_win_task = (not remote) and IS_WINDOWS
|
||||
|
||||
|
||||
Reference in New Issue
Block a user