mirror of
https://github.com/pewdiepie-archdaemon/odysseus.git
synced 2026-06-17 02:05:22 -04:00
Cookbook polish: auto-reconnect, ctx slider fixes, scoring, lots of UI
Backend (services/hwfit + routes):
- VRAM column sort now shows global highest first (was special-cased to
ascending then truncated top-N, which made "highest VRAM" mathematically
unreachable). Every column path uses reverse=True for the truncation.
- Hardware probe cache TTL 30min -> 24h so changing filters doesn't keep
re-probing the rig during a session; Rescan button still forces fresh.
- Multi-GPU rigs filter GGUF Q*/IQ quants (vLLM/SGLang can't serve them);
default non-prequantized to BF16 on 2+ GPUs.
- AWQ / AWQ-8bit / GPTQ-8bit get a -1.0 quality penalty so FP8 wins ties.
- Version-aware tiebreaker (parse Mn.n / Vn) — MiniMax-M2.7 ranks above M2.5.
- hf_models.json: zai-org/GLM-5.1 added; zai-org/GLM-5 quantization flipped
Q4_K_M -> BF16. DeepSeek-V4-Flash / -Pro + their -Base variants registered
with new FP4-MoE-Mixed / FP8-Mixed quant keys (calibrated BPP from the
actual 156 GB / 284 GB disk footprints).
- New FP4-MoE-Mixed + FP8-Mixed entries in QUANT_BPP / QUANT_SPEED_MULT /
QUANT_QUALITY_PENALTY / QUANT_BYTES_PER_PARAM / PREQUANTIZED_PREFIXES.
Frontend — Scan/Download:
- Engine + Quant swapped in the toolbar; Quant defaults to "All".
- Ctx (range slider) ported from origin/main: 8k/16k/32k/50k/128k/Max. Drag
re-sorts by vram ascending (smallest fitting first); back to Max → score.
- Ctx slider rail now visible — was background:transparent in a duplicate
later-cascade rule. Hardcoded grey + !important.
- Search input moved to the far right of the toolbar.
- Type/Standard default; "Context" not uppercased; Search placeholder dimmed.
- Engine "?" + Quant "?" inline help chips inside their dropdown boxes.
- Fit-column dot toggles fit-only filter; un-toggling re-sorts by VRAM desc.
- Quant column truncates to 9 chars + ellipsis ("FP4-MoE-M..."), full in
tooltip. Smart title-suffix strips the parts already in the repo name
(QuantTrio/MiniMax-M2-AWQ + quant AWQ-4bit -> just "(4bit)").
- Conditional warning for safetensors models on non-GPU rigs only.
- Dependency Install / Installed / Installed▾ / N/A all 75.85px wide.
- Rebuild llama.cpp moved into the llama_cpp dep row, styled as a tag.
- Foldable Download admin-card (h2 chevron); line under h2 only when folded.
- HF token save gets a green ✓ + "Saved" flash.
- Cached scan no longer counts stalled rows as downloaded.
- Footer: "Request it →" link with GitHub mark to the public discussion
(#1962) for model-add requests.
Frontend — Running tab:
- Strict download-finish check (DOWNLOAD_OK or /snapshots/, not bare
"Download complete"). True overall % for multi-shard downloads:
((N-1)+frac)/total instead of hf_transfer's per-shard aggregate.
- ETA in the uptime ticker: "downloading: 12m 34s · ETA 1h 23m".
- Clear button kills the tmux session too; if the output still shows a
live shard line, the pill is hidden + relabels as "reconnect" + revives
on click.
- Self-heal: on cookbook open AND every bg-monitor cycle (10s, throttled
to 8s), scan persisted done/error/crashed downloads and probe their
tmux session — if alive, flip status back to running and reattach.
- Per-launch zombie probe: clicking Download on a model whose persisted
state is done but tmux is still alive revives the existing task and
refuses to start a duplicate.
- Pre-launch GPU probe: vllm / sglang / diffusers serve check
/api/cookbook/gpus first; warns + confirms if no GPU is visible.
- Server-side state guard: rejects "done" POSTs for downloads lacking
DOWNLOAD_OK / DOWNLOAD_FAILED / /snapshots/ when the last-mentioned
shard is N<total — stale tabs can't poison persisted state any more.
- Running count includes tasks whose output looks active even if persisted
status got stuck. Dir text on the running row, font matched to uptime.
Serve panel:
- Ctx text input always resets to model max on open (default 20000 when
metadata is missing).
- Max Seqs default 8 -> 4. KV Cache dtype select 32px tall.
- Lightning icon on Launch (same as Action toggle).
- Diagnosis card simplified (no fold/copy/dismiss), suggestion font
matches body; action buttons get icons on the left (Retry/Copy/Edit/
Install/Kill/Switch/etc.).
- Incomplete-download serve warning when model status is
downloading / stalled / has_incomplete.
- MTP "?" tooltip ("supported on a few model families … up to ~3× faster").
This commit is contained in:
@@ -560,6 +560,15 @@ function _rerenderCachedModels() {
|
||||
+ `</div>`;
|
||||
|
||||
let panelHtml = `<div class="hwfit-serve-panel">${_slotsHtml}`;
|
||||
// Warn when serving a model whose download hasn't fully completed —
|
||||
// the user CAN still hit Launch (vLLM/llama-server will start, then
|
||||
// crash trying to read missing shards), but they should know.
|
||||
if (m && (m.status === 'downloading' || m.status === 'stalled' || m.has_incomplete)) {
|
||||
const _warnText = m.status === 'stalled'
|
||||
? `This model looks like a stale download shell (${esc(m.size || '0 KB')}). The weights aren't on disk — the serve will fail to load. Re-download first, or pick another model.`
|
||||
: `This model's download isn't complete yet (${esc(m.size || 'partial')}). The serve will start but is likely to crash on a missing shard. Wait for the download to finish, or relaunch after it's done.`;
|
||||
panelHtml += `<div class="hwfit-serve-warn" style="margin:0 0 8px;padding:6px 10px;border-radius:5px;font-size:11px;background:color-mix(in srgb, var(--color-warning, #f0ad4e) 14%, transparent);border:1px solid color-mix(in srgb, var(--color-warning, #f0ad4e) 40%, transparent);color:var(--color-warning, #f0ad4e);display:flex;gap:6px;align-items:flex-start;line-height:1.4;"><span aria-hidden="true">⚠</span><span>${_warnText}</span></div>`;
|
||||
}
|
||||
// Row 1: Backend + Server + Env
|
||||
panelHtml += `<div class="hwfit-serve-row">`;
|
||||
const _backendChoices = _isWindows()
|
||||
@@ -597,13 +606,13 @@ function _rerenderCachedModels() {
|
||||
panelHtml += `<label class="hwfit-backend-vllm hwfit-backend-sglang">${_l('TP','Tensor Parallelism — split model across N GPUs')}<select class="hwfit-sf" data-field="tp">${tpOpts}</select></label>`;
|
||||
// ctx resets to the model's max on every panel open (the real ctx slider
|
||||
// lives in the Scan/Download toolbar — see cookbook.js .hwfit-ctx-control).
|
||||
panelHtml += `<label>${_l('Context','Max tokens per request — resets to the model max on every open. Lower = less VRAM')}<input type="text" class="hwfit-sf" data-field="ctx" value="${esc(m.context_length || m.context || '8192')}" /></label>`;
|
||||
panelHtml += `<label>${_l('Context','Max tokens per request — resets to the model max on every open. Lower = less VRAM')}<input type="text" class="hwfit-sf" data-field="ctx" value="${esc(m.context_length || m.context || '20000')}" /></label>`;
|
||||
panelHtml += `<label>${_l('GPU','Which GPU to use. Leave empty for default')}<input type="text" class="hwfit-sf" data-field="gpu_id" value="${esc(sv('gpu_id', ''))}" placeholder="auto" style="width:50px;" /></label>`;
|
||||
panelHtml += `<label class="hwfit-backend-vllm hwfit-backend-sglang">${_l('GPU Mem','Fraction of GPU memory (0.0–1.0). Lower if OOM')}<input type="text" class="hwfit-sf" data-field="gpu_mem" value="${esc(sv('gpu_mem', '0.90'))}" /></label>`;
|
||||
panelHtml += `<label class="hwfit-backend-vllm">${_l('Swap','CPU swap space in GB. Leave empty to omit (removed in newer vLLM)')}<input type="text" class="hwfit-sf" data-field="swap" value="${esc(sv('swap', ''))}" placeholder="off" /></label>`;
|
||||
panelHtml += `<label class="hwfit-backend-vllm hwfit-backend-sglang">${_l('Max Seqs','Maximum concurrent requests. Lower = less memory. Default 8 — prosumer GPUs often OOM on vLLM default 256 during CUDA graph capture.')}<input type="text" class="hwfit-sf" data-field="max_seqs" value="${esc(sv('max_seqs', '8'))}" placeholder="8" /></label>`;
|
||||
panelHtml += `<label class="hwfit-backend-vllm hwfit-backend-sglang">${_l('Max Seqs','Maximum concurrent requests. Lower = less memory. Default 4 — prosumer GPUs often OOM on vLLM default 256 during CUDA graph capture.')}<input type="text" class="hwfit-sf" data-field="max_seqs" value="${esc(sv('max_seqs', '4'))}" placeholder="4" /></label>`;
|
||||
panelHtml += `<label>${_l('Dtype','Data type for weights. auto picks best for GPU')}<select class="hwfit-sf" data-field="dtype">${dtypeOpts}</select></label>`;
|
||||
panelHtml += `<label class="hwfit-backend-vllm">${_l('KV Cache','vLLM --kv-cache-dtype. auto uses the model/runtime default; fp8 reduces KV memory for long context.')}<select class="hwfit-sf" data-field="vllm_kv_cache_dtype">${vllmKvCacheOpts}</select></label>`;
|
||||
panelHtml += `<label class="hwfit-backend-vllm">${_l('KV Cache','vLLM --kv-cache-dtype. auto uses the model/runtime default; fp8 reduces KV memory for long context.')}<select class="hwfit-sf" data-field="vllm_kv_cache_dtype" style="height:32px;">${vllmKvCacheOpts}</select></label>`;
|
||||
panelHtml += `</div>`;
|
||||
// Row 2b: Diffusers settings
|
||||
const diffDtypeOpts = ['bfloat16','float16','float32'].map(d => `<option value="${d}"${sv('diff_dtype','bfloat16')===d?' selected':''}>${d}</option>`).join('');
|
||||
@@ -696,7 +705,7 @@ function _rerenderCachedModels() {
|
||||
if (!_specMethods.includes(_specMethod)) _specMethods.unshift(_specMethod);
|
||||
const _specOpts = _specMethods.map(m =>
|
||||
`<option value="${m}"${m === _specMethod ? ' selected' : ''}>${m}</option>`).join('');
|
||||
panelHtml += `<label class="hwfit-sf-cb hwfit-spec-group"><input type="checkbox" class="hwfit-sf" data-field="speculative" /> Speculative <select class="hwfit-sf hwfit-spec-method" data-field="spec_method" title="vLLM --speculative-config method">${_specOpts}</select><span class="hwfit-numstep"><button type="button" class="hwfit-numstep-btn" data-step="-1" tabindex="-1" aria-label="Decrease">‹</button><input type="number" class="hwfit-sf hwfit-spec-tokens" data-field="spec_tokens" value="${esc(_specTokens)}" min="1" max="10" title="num_speculative_tokens" /><button type="button" class="hwfit-numstep-btn" data-step="1" tabindex="-1" aria-label="Increase">›</button></span></label>`;
|
||||
panelHtml += `<label class="hwfit-sf-cb hwfit-spec-group"><input type="checkbox" class="hwfit-sf" data-field="speculative" /> Speculative <select class="hwfit-sf hwfit-spec-method" data-field="spec_method" title="vLLM --speculative-config method">${_specOpts}</select><span class="hwfit-numstep"><button type="button" class="hwfit-numstep-btn" data-step="-1" tabindex="-1" aria-label="Decrease">‹</button><input type="number" class="hwfit-sf hwfit-spec-tokens" data-field="spec_tokens" value="${esc(_specTokens)}" min="1" max="10" title="num_speculative_tokens" /><button type="button" class="hwfit-numstep-btn" data-step="1" tabindex="-1" aria-label="Increase">›</button></span><span class="hwfit-help-chip hwfit-help-chip-inline" title="MTP / speculative decoding is supported on a few model families only — turn it on when the model card explicitly recommends it. On supported models it can boost inference throughput up to ~3×; on unsupported models it will either be ignored or fail to launch." style="margin-left:6px;">?</span></label>`;
|
||||
}
|
||||
if (_opts2.envVars.length) panelHtml += `<label class="hwfit-sf-cb"><input type="checkbox" class="hwfit-sf" data-field="moe_env" /> MoE Env Vars</label>`;
|
||||
panelHtml += `</div>`;
|
||||
@@ -721,7 +730,7 @@ function _rerenderCachedModels() {
|
||||
// pushes Cancel + Launch to the right.
|
||||
panelHtml += `<span class="hwfit-serve-actions-spacer"></span>`;
|
||||
panelHtml += `<button class="cookbook-btn hwfit-serve-cancel" type="button" title="Close this configuration panel">Cancel</button>`;
|
||||
panelHtml += `<button class="cookbook-btn hwfit-serve-launch">Launch</button>`;
|
||||
panelHtml += `<button class="cookbook-btn hwfit-serve-launch"><svg width="11" height="11" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" style="vertical-align:-1px;margin-right:4px;flex-shrink:0;"><polygon points="13 2 3 14 12 14 11 22 21 10 12 10 13 2"/></svg>Launch</button>`;
|
||||
panelHtml += `</div>`;
|
||||
panelHtml += `</div>`;
|
||||
|
||||
@@ -1657,6 +1666,37 @@ function _rerenderCachedModels() {
|
||||
});
|
||||
return;
|
||||
}
|
||||
// Pre-launch GPU probe — common failure pattern: vLLM/SGLang launched
|
||||
// on a host where no GPU is visible (driver missing, $CUDA_VISIBLE_DEVICES
|
||||
// unset, container without --gpus). Catch it BEFORE the user spends
|
||||
// minutes watching the task fail.
|
||||
const _needsGpu = ['vllm', 'sglang'].includes(serveState.backend)
|
||||
|| (serveState.backend === 'diffusers');
|
||||
if (_needsGpu) {
|
||||
try {
|
||||
const _probeHost = (_envState.remoteHost || '').trim();
|
||||
const _probeParams = new URLSearchParams();
|
||||
if (_probeHost) {
|
||||
_probeParams.set('host', _probeHost);
|
||||
const _sp = (_envState.servers || []).find(s => s.host === _probeHost)?.port;
|
||||
if (_sp) _probeParams.set('ssh_port', _sp);
|
||||
}
|
||||
const _probeRes = await fetch('/api/cookbook/gpus' + (_probeParams.toString() ? '?' + _probeParams : ''), { credentials: 'same-origin' });
|
||||
const _probeData = await _probeRes.json();
|
||||
const _probeGpus = Array.isArray(_probeData) ? _probeData : (_probeData.gpus || []);
|
||||
if (!_probeGpus.length) {
|
||||
const _proceed = await window.styledConfirm(
|
||||
`No GPU detected on ${_probeHost ? _probeHost : 'this host'}. ${serveState.backend.toUpperCase()} needs a visible CUDA/ROCm accelerator to start — launching now will most likely crash early.\n\nLaunch anyway?`,
|
||||
{ title: 'No GPU detected', confirmText: 'Launch anyway', cancelText: 'Cancel', danger: true },
|
||||
);
|
||||
if (!_proceed) return;
|
||||
}
|
||||
} catch {
|
||||
// Network / probe failure — don't block. Better to let the launch
|
||||
// proceed than to silently refuse because the probe endpoint
|
||||
// hiccuped (the user can read the real error in the task output).
|
||||
}
|
||||
}
|
||||
// Save in the { _byRepo, _lastUsed } schema — no legacy flat keys at
|
||||
// the root so per-model state doesn't leak between models.
|
||||
try {
|
||||
|
||||
Reference in New Issue
Block a user