Cookbook polish: auto-reconnect, ctx slider fixes, scoring, lots of UI

Backend (services/hwfit + routes):
- VRAM column sort now shows global highest first (was special-cased to
  ascending then truncated top-N, which made "highest VRAM" mathematically
  unreachable). Every column path uses reverse=True for the truncation.
- Hardware probe cache TTL 30min -> 24h so changing filters doesn't keep
  re-probing the rig during a session; Rescan button still forces fresh.
- Multi-GPU rigs filter GGUF Q*/IQ quants (vLLM/SGLang can't serve them);
  default non-prequantized to BF16 on 2+ GPUs.
- AWQ / AWQ-8bit / GPTQ-8bit get a -1.0 quality penalty so FP8 wins ties.
- Version-aware tiebreaker (parse Mn.n / Vn) — MiniMax-M2.7 ranks above M2.5.
- hf_models.json: zai-org/GLM-5.1 added; zai-org/GLM-5 quantization flipped
  Q4_K_M -> BF16. DeepSeek-V4-Flash / -Pro + their -Base variants registered
  with new FP4-MoE-Mixed / FP8-Mixed quant keys (calibrated BPP from the
  actual 156 GB / 284 GB disk footprints).
- New FP4-MoE-Mixed + FP8-Mixed entries in QUANT_BPP / QUANT_SPEED_MULT /
  QUANT_QUALITY_PENALTY / QUANT_BYTES_PER_PARAM / PREQUANTIZED_PREFIXES.

Frontend — Scan/Download:
- Engine + Quant swapped in the toolbar; Quant defaults to "All".
- Ctx (range slider) ported from origin/main: 8k/16k/32k/50k/128k/Max. Drag
  re-sorts by vram ascending (smallest fitting first); back to Max → score.
- Ctx slider rail now visible — was background:transparent in a duplicate
  later-cascade rule. Hardcoded grey + !important.
- Search input moved to the far right of the toolbar.
- Type/Standard default; "Context" not uppercased; Search placeholder dimmed.
- Engine "?" + Quant "?" inline help chips inside their dropdown boxes.
- Fit-column dot toggles fit-only filter; un-toggling re-sorts by VRAM desc.
- Quant column truncates to 9 chars + ellipsis ("FP4-MoE-M..."), full in
  tooltip. Smart title-suffix strips the parts already in the repo name
  (QuantTrio/MiniMax-M2-AWQ + quant AWQ-4bit -> just "(4bit)").
- Conditional warning for safetensors models on non-GPU rigs only.
- Dependency Install / Installed / Installed▾ / N/A all 75.85px wide.
- Rebuild llama.cpp moved into the llama_cpp dep row, styled as a tag.
- Foldable Download admin-card (h2 chevron); line under h2 only when folded.
- HF token save gets a green ✓ + "Saved" flash.
- Cached scan no longer counts stalled rows as downloaded.
- Footer: "Request it →" link with GitHub mark to the public discussion
  (#1962) for model-add requests.

Frontend — Running tab:
- Strict download-finish check (DOWNLOAD_OK or /snapshots/, not bare
  "Download complete"). True overall % for multi-shard downloads:
  ((N-1)+frac)/total instead of hf_transfer's per-shard aggregate.
- ETA in the uptime ticker: "downloading: 12m 34s · ETA 1h 23m".
- Clear button kills the tmux session too; if the output still shows a
  live shard line, the pill is hidden + relabels as "reconnect" + revives
  on click.
- Self-heal: on cookbook open AND every bg-monitor cycle (10s, throttled
  to 8s), scan persisted done/error/crashed downloads and probe their
  tmux session — if alive, flip status back to running and reattach.
- Per-launch zombie probe: clicking Download on a model whose persisted
  state is done but tmux is still alive revives the existing task and
  refuses to start a duplicate.
- Pre-launch GPU probe: vllm / sglang / diffusers serve check
  /api/cookbook/gpus first; warns + confirms if no GPU is visible.
- Server-side state guard: rejects "done" POSTs for downloads lacking
  DOWNLOAD_OK / DOWNLOAD_FAILED / /snapshots/ when the last-mentioned
  shard is N<total — stale tabs can't poison persisted state any more.
- Running count includes tasks whose output looks active even if persisted
  status got stuck. Dir text on the running row, font matched to uptime.

Serve panel:
- Ctx text input always resets to model max on open (default 20000 when
  metadata is missing).
- Max Seqs default 8 -> 4. KV Cache dtype select 32px tall.
- Lightning icon on Launch (same as Action toggle).
- Diagnosis card simplified (no fold/copy/dismiss), suggestion font
  matches body; action buttons get icons on the left (Retry/Copy/Edit/
  Install/Kill/Switch/etc.).
- Incomplete-download serve warning when model status is
  downloading / stalled / has_incomplete.
- MTP "?" tooltip ("supported on a few model families … up to ~3× faster").
This commit is contained in:
pewdiepie-archdaemon
2026-06-03 20:25:25 +09:00
parent 3706d756f3
commit 562bc4dedc
12 changed files with 669 additions and 115 deletions
+71 -12
View File
@@ -527,6 +527,9 @@ export async function _hwfitFetch(fresh = false) {
if (useCase) params.set('use_case', useCase);
if (quantPref) params.set('quant', quantPref);
if (targetCtx) params.set('ctx', String(targetCtx));
// Fit-only filter — set by the dot in the Fit column header.
const _fitOnly = (() => { try { return localStorage.getItem('hwfit_fit_only_v1') === '1'; } catch { return false; } })();
if (_fitOnly) params.set('fit_only', '1');
}
const endpoint = isImageMode ? `/api/hwfit/image-models?${params}` : `/api/hwfit/models?${params}`;
const res = await fetch(endpoint);
@@ -888,9 +891,15 @@ export function _hwfitRenderList(el, models) {
arrow = isReversed ? ' \u25B2' : ' \u25BC';
}
const dataAttr = col.key ? ` data-sort="${col.key}"` : '';
const label = (col.cls === 'hwfit-fit' && _budget)
? `${col.label} <span style="font-size:0.75em;opacity:0.6;font-weight:normal;">(${_budget})</span>`
: col.label;
// Fit column gets a small dot to its left that toggles "show only models
// that fit" — replaces the old Fits On/Off button next to the toolbar.
let label = col.label;
if (col.cls === 'hwfit-fit') {
const _fitOnly = (() => { try { return localStorage.getItem('hwfit_fit_only_v1') === '1'; } catch { return false; } })();
label = `<span class="hwfit-fit-dot${_fitOnly ? ' active' : ''}" title="${_fitOnly ? 'Showing only models that fit. Click to also show too-tight rows.' : 'Click to show only models that fit your hardware.'}" data-fit-dot>●</span>${col.label}`;
// (Budget tag removed — the GPU/RAM/N-GPU suffix next to "Fit" was noise;
// the toggle row already shows which budget is active.)
}
html += `<span class="hwfit-col ${col.cls}${sortable}${active}"${dataAttr}>${label}${arrow}</span>`;
}
html += '</div>';
@@ -910,9 +919,31 @@ export function _hwfitRenderList(el, models) {
const dlDot = (_cachedModelIds && (_cachedModelIds.has(m.name) || [..._cachedModelIds].some(id => id === m.name?.split('/').pop()))) ? '<span class="hwfit-dl-dot" title="Downloaded">\u25CF</span>' : '';
html += `<div class="hwfit-row" data-model="${esc(m.name)}">`;
html += `<span class="hwfit-col hwfit-fit" style="color:${fitColor}">${esc(fitLabel)}</span>`;
html += `<span class="hwfit-col hwfit-name">${modelLogo(m.name)}${esc(m.name?.split('/').pop() || m.name)}${moeBadge}${imgBadge}${dlDot}</span>`;
// Append quant to the title when it's not already in the repo name. The
// suffix strips quant-parts the name already contains — e.g. for
// QuantTrio/MiniMax-M2-AWQ + quant=AWQ-4bit we just show "(4bit)", not
// "(AWQ-4bit)". DeepSeek-V4-Flash + FP4-MoE-Mixed keeps the full tag
// (none of those parts are in the repo id).
const _short = m.name?.split('/').pop() || m.name || '';
const _quantTag = (m.quant || '').trim();
const _lowerShort = _short.toLowerCase();
let _quantSuffix = '';
if (_quantTag) {
const _parts = _quantTag.split(/[-_]/).filter(Boolean);
const _remaining = _parts.filter(p => !_lowerShort.includes(p.toLowerCase()));
if (_remaining.length && _remaining.length < _parts.length + 1) { // at least one part is new
let _display = _remaining.join('-');
if (_display.length > 9) _display = _display.slice(0, 9) + '…';
_quantSuffix = ` <span class="hwfit-name-quant" title="${esc(_quantTag)} — full storage format">(${esc(_display)})</span>`;
}
}
html += `<span class="hwfit-col hwfit-name">${modelLogo(m.name)}${esc(_short)}${_quantSuffix}${moeBadge}${imgBadge}${dlDot}</span>`;
html += `<span class="hwfit-col hwfit-c-params">${esc(pcount)}</span>`;
html += `<span class="hwfit-col hwfit-c-quant">${esc(m.quant || '?')}</span>`;
// Truncate the Quant cell to 9 chars + ellipsis so long tags like
// "FP4-MoE-Mixed" don't push neighboring columns. Full tag stays in title.
const _qRaw = m.quant || '?';
const _qShort = _qRaw.length > 9 ? _qRaw.slice(0, 9) + '…' : _qRaw;
html += `<span class="hwfit-col hwfit-c-quant" title="${esc(_qRaw)}">${esc(_qShort)}</span>`;
html += `<span class="hwfit-col hwfit-c-vram">${vramLabel}</span>`;
html += `<span class="hwfit-col hwfit-c-ctx">${m.is_image_gen ? '\u2014' : ctx}</span>`;
html += `<span class="hwfit-col hwfit-c-speed">${m.is_image_gen ? '\u2014' : tps + ' t/s'}</span>`;
@@ -934,7 +965,26 @@ export function _hwfitRenderList(el, models) {
});
// Clickable header columns → sort (click again to toggle direction)
el.querySelectorAll('.hwfit-header .hwfit-sortable').forEach(col => {
col.addEventListener('click', () => {
col.addEventListener('click', (e) => {
// The little dot inside the Fit header is its own toggle (fit-only
// filter), don't let it fall through to a sort click.
if (e.target.closest('[data-fit-dot]')) {
const on = !e.target.classList.contains('active');
try { localStorage.setItem('hwfit_fit_only_v1', on ? '1' : '0'); } catch {}
// Un-toggling the fit filter (off → showing too-tight rows again) is
// typically because the user wants to see the LARGE models they can't
// run yet — re-sort by VRAM descending so the biggest surface first.
if (!on) {
const sortSel = document.getElementById('hwfit-sort');
if (sortSel) {
sortSel.value = 'vram';
sortSel.dataset.reverse = '0'; // descending (biggest first)
}
}
_hwfitCache = null;
_hwfitFetch();
return;
}
const sortKey = col.dataset.sort;
if (!sortKey) return;
const sel = document.getElementById('hwfit-sort');
@@ -1018,7 +1068,16 @@ export function _expandModelRow(row, modelData) {
if (modelData.is_image_gen) {
html += `<div style="font-size:10px;opacity:0.5;margin-top:4px;">${esc((modelData.capabilities || []).join(' \u00B7 ') || '')}${modelData.description ? ' \u2014 ' + esc(modelData.description) : ''}</div>`;
} else if (_requiresAcceleratorBackend(modelData)) {
html += `<div class="hwfit-panel-note">This is a safetensors GPU-serving format. Use vLLM/SGLang with a visible CUDA/ROCm accelerator, or pick a GGUF download for llama.cpp/Ollama.</div>`;
// Only show the "needs CUDA/ROCm" note when the host doesn't already have
// one. With a visible CUDA/ROCm accelerator the note is noise — the user
// can already serve the model and reading the warning on every row makes
// the panel feel like everything's broken.
const _sys = _hwfitCache?.system || {};
const _backend = (_sys.backend || '').toLowerCase();
const _hasGpuAccel = !!_sys.has_gpu && (_backend === 'cuda' || _backend === 'rocm');
if (!_hasGpuAccel) {
html += `<div class="hwfit-panel-note">This is a safetensors GPU-serving format. Use vLLM/SGLang with a visible CUDA/ROCm accelerator, or pick a GGUF download for llama.cpp/Ollama.</div>`;
}
}
html += `</div>`;
@@ -1243,14 +1302,14 @@ export function _hwfitInit() {
const targetCtx = _ctxValue();
try { localStorage.setItem(_CTX_KEY, String(targetCtx)); } catch {}
// Ctx drag affects sort mode: a specific ctx target (anything < Max)
// implies the user is hunting for "what fits at this context length",
// so re-rank by fit (lowest first). Dragging back to Max means no
// ctx constraint → go back to the default score-based ranking.
// implies "what runs at this context length" — sort by VRAM ascending
// so the cheapest-fitting models surface first. Dragging back to Max
// releases the constraint → go back to the default score ranking.
const sortSel = document.getElementById('hwfit-sort');
if (sortSel) {
if (targetCtx) {
sortSel.value = 'fit';
sortSel.dataset.reverse = '1';
sortSel.value = 'vram';
sortSel.dataset.reverse = '1'; // ascending = smallest VRAM first
} else {
sortSel.value = 'score';
sortSel.dataset.reverse = '';