cookbook agent debug loop: persistent log files, auto-adopt orphan tmux, Codex/Claude skill parity

Three converging fixes so the chat agent + external Codex/Claude skills can actually debug a crashed serve instead of staring at a post-crash neofetch banner:

* Serves now `tee` to /tmp/odysseus-tmux/SESSION.log on the host running them. Runner saves fds 3/4 before the tee and restores them right before `exec ${SHELL}`, so the post-crash interactive zsh banner does NOT pollute the log file.
* `tail_serve_output` (chat agent) and `/api/codex/cookbook/output/{sid}` (Codex+Claude skills) both prefer the persistent log file over the tmux pane. Pane is fallback for sessions predating the tee runner. Default tail bumped 150 -> 400.
* `list_served_models` "recent log" snippet seeks to the Traceback line instead of showing the last 6 lines (which was always the bash prompt).

Cookbook auto-adoption sweep on `/api/cookbook/tasks/status`: every 20s (rate-limited) the cookbook SSHes each configured server, finds `serve-*` / `cookbook-*` tmux sessions running an actual model process (vllm/python/llama-server/etc., filtered via `pane_current_command`), and writes them into state.tasks. So when the agent falls back to raw ssh+tmux, the session appears in the Cookbook UI on the next poll.

`serve_model` error path now reads `data["detail"]` in addition to `data["error"]` so the FastAPI HTTPException message ("Invalid characters in cmd") actually reaches the agent instead of being swallowed as a generic "Serve failed". Tool description updated to warn against `cd …`/`source …`/`&&` prefixes.

Intent-without-action supervisor in agent_loop: when the model writes "Let me tail the output" / "I'll check the logs" / "Let me investigate" and ends the turn without emitting a tool call, the loop injects a sharp system nudge ("You said you would X — DO IT NOW") and continues. Capped at 2 nudges per chat so a model that genuinely cannot use the tool does not pin the loop.

Codex/Claude skill parity: adds `/cookbook/cached`, `/cookbook/presets`, `/cookbook/preset/{name}`, `/cookbook/adopt` so external agents have the same surface as the chat agent. SKILL.md docs + odysseus_api.py wrapper updated for both bundles.

`adopt_served_model` promoted to the always-on tool set so the agent has a documented fallback when serve_model rejects a cmd.

Also various cookbook UI tweaks accumulated alongside the above (cookbook.js, cookbookRunning.js, cookbookServe.js, cookbook-diagnosis.js, settings.js, style.css).
This commit is contained in:
pewdiepie-archdaemon
2026-06-04 23:27:18 +09:00
parent 041c03bf11
commit 9112861d8e
19 changed files with 1529 additions and 151 deletions
+53 -2
View File
@@ -613,6 +613,20 @@ function _rerenderCachedModels() {
panelHtml += `<label class="hwfit-backend-vllm hwfit-backend-sglang">${_l('Max Seqs','Maximum concurrent requests. Lower = less memory. Default 4 — prosumer GPUs often OOM on vLLM default 256 during CUDA graph capture.')}<input type="text" class="hwfit-sf" data-field="max_seqs" value="${esc(sv('max_seqs', '4'))}" placeholder="4" /></label>`;
panelHtml += `<label>${_l('Dtype','Data type for weights. auto picks best for GPU')}<select class="hwfit-sf" data-field="dtype">${dtypeOpts}</select></label>`;
panelHtml += `<label class="hwfit-backend-vllm">${_l('KV Cache','vLLM --kv-cache-dtype. auto uses the model/runtime default; fp8 reduces KV memory for long context.')}<select class="hwfit-sf" data-field="vllm_kv_cache_dtype" style="height:32px;">${vllmKvCacheOpts}</select></label>`;
// Attention backend selector — pin the kernel impl. Default `auto` lets
// vLLM pick FlashInfer (which JITs on first use and breaks on older
// system nvcc) → FlashAttention → xformers. Forcing FLASH_ATTN skips
// the JIT entirely, fixing the `nvcc fatal: Unsupported gpu
// architecture 'compute_89'` failure mode on Ada / Hopper hosts.
const vllmAttnBackendOpts = ['auto', 'FLASH_ATTN', 'XFORMERS', 'FLASHINFER', 'TORCH_SDPA']
.map(b => `<option value="${b === 'auto' ? '' : b}"${(sv('vllm_attn_backend','') === (b === 'auto' ? '' : b)) ? ' selected' : ''}>${b}</option>`).join('');
panelHtml += `<label class="hwfit-backend-vllm">${_l('Attention','vLLM VLLM_ATTENTION_BACKEND. auto = vLLM picks (often FLASHINFER, which JITs and can fail on old nvcc). FLASH_ATTN skips the JIT entirely.')}<select class="hwfit-sf" data-field="vllm_attn_backend" style="height:32px;">${vllmAttnBackendOpts}</select></label>`;
// Free-text env-vars field. Anything pasted here is prepended to the
// launch command verbatim. Use for CUDACXX, PATH overrides, NCCL_*
// tuning, or any other KEY=VALUE pair that doesn't have a dedicated
// field. After the venv activate runs, $VIRTUAL_ENV / $PATH / etc. are
// already exported so they expand correctly here.
panelHtml += `<label class="hwfit-backend-vllm hwfit-backend-sglang" style="flex:1 1 100%;">${_l('Env','Extra KEY=VALUE env-var pairs prepended to the launch (space-separated). Example: CUDACXX=$VIRTUAL_ENV/lib/python3.10/site-packages/nvidia/cuda_nvcc/bin/nvcc — points flashinfer at the venv-bundled nvcc when the system one is too old for your GPU.')}<input type="text" class="hwfit-sf" data-field="extra_env" value="${esc(sv('extra_env',''))}" placeholder="CUDACXX=/path/to/nvcc NCCL_P2P_DISABLE=1" style="width:100%;" /></label>`;
panelHtml += `</div>`;
// Row 2b: Diffusers settings
const diffDtypeOpts = ['bfloat16','float16','float32'].map(d => `<option value="${d}"${sv('diff_dtype','bfloat16')===d?' selected':''}>${d}</option>`).join('');
@@ -1643,6 +1657,35 @@ function _rerenderCachedModels() {
// Launch button
panel.querySelector('.hwfit-serve-launch').addEventListener('click', async (ev) => {
const _launchBtn = ev.currentTarget;
// Immediate visual feedback. The GPU probe + backend-warning prompt
// below can take ~1-2s before the task UI shows up, leaving the
// button looking dead. Drop in the same whirlpool spinner the rest of
// the cookbook uses (Probe GPUs, dependency installs, etc.) right
// away; restored on any early-return / failure path below.
const _origBtnHtml = _launchBtn.innerHTML;
const _origBtnDisabled = _launchBtn.disabled;
let _launchingWp = null;
const _restoreLaunchBtn = () => {
try { _launchingWp?.destroy?.(); } catch {}
_launchingWp = null;
_launchBtn.innerHTML = _origBtnHtml;
_launchBtn.disabled = _origBtnDisabled;
};
_launchBtn.disabled = true;
_launchBtn.innerHTML = '';
const _launchingWrap = document.createElement('span');
_launchingWrap.className = 'hwfit-serve-launching';
_launchingWrap.style.cssText = 'display:inline-flex;align-items:center;gap:6px;';
_launchingWp = spinnerModule.createWhirlpool(18);
if (_launchingWp?.element) {
_launchingWp.element.style.margin = '0';
_launchingWp.element.style.transform = 'translateY(-2px)';
_launchingWrap.appendChild(_launchingWp.element);
}
const _launchingLabel = document.createElement('span');
_launchingLabel.textContent = 'Launching…';
_launchingWrap.appendChild(_launchingLabel);
_launchBtn.appendChild(_launchingWrap);
// Final safety net: never launch with ctx beyond the model's trained
// limit (or the absolute sanity ceiling when the limit is unknown). A
// stale preset or typo (e.g. 16000000) overflows and, with a quantized
@@ -1650,7 +1693,14 @@ function _rerenderCachedModels() {
// command (then we respect their literal text).
if (!_cmdManuallyEdited) _clampCtx(true);
if (!_cmdManuallyEdited) updateCmd();
const launchCmd = _cmdTextarea ? _cmdTextarea.value.trim() : panel._cmd;
// Pasted commands often carry hidden newlines / CRs / tabs from copies
// out of model cards or wrapped help text. The backend cmd allowlist
// rejects \n / \r outright (`Invalid characters in cmd`), so collapse
// all whitespace to single spaces before launch — same effect as the
// user manually re-flowing the textarea, no behavior change.
const _rawLaunchCmd = _cmdTextarea ? _cmdTextarea.value : panel._cmd;
const launchCmd = String(_rawLaunchCmd || '').replace(/\s+/g, ' ').trim();
if (_cmdTextarea && _cmdTextarea.value !== launchCmd) _cmdTextarea.value = launchCmd;
const serveState = {};
panel.querySelectorAll('.hwfit-sf').forEach(el => {
if (el.type === 'checkbox') serveState[el.dataset.field] = el.checked;
@@ -1659,6 +1709,7 @@ function _rerenderCachedModels() {
serveState.backend = serveState.backend || (_detectBackend(m).backend) || 'vllm';
const backendWarning = _serveBackendWarning(m, repo, serveState.backend, serveState);
if (backendWarning) {
_restoreLaunchBtn();
await window.styledConfirm(backendWarning.body, {
title: backendWarning.title,
confirmText: 'Edit settings',
@@ -1689,7 +1740,7 @@ function _rerenderCachedModels() {
`No GPU detected on ${_probeHost ? _probeHost : 'this host'}. ${serveState.backend.toUpperCase()} needs a visible CUDA/ROCm accelerator to start — launching now will most likely crash early.\n\nLaunch anyway?`,
{ title: 'No GPU detected', confirmText: 'Launch anyway', cancelText: 'Cancel', danger: true },
);
if (!_proceed) return;
if (!_proceed) { _restoreLaunchBtn(); return; }
}
} catch {
// Network / probe failure — don't block. Better to let the launch