Cookbook polish: auto-reconnect, ctx slider fixes, scoring, lots of UI

Backend (services/hwfit + routes): - VRAM column sort now shows global highest first (was special-cased to ascending then truncated top-N, which made "highest VRAM" mathematically unreachable). Every column path uses reverse=True for the truncation. - Hardware probe cache TTL 30min -> 24h so changing filters doesn't keep re-probing the rig during a session; Rescan button still forces fresh. - Multi-GPU rigs filter GGUF Q*/IQ quants (vLLM/SGLang can't serve them); default non-prequantized to BF16 on 2+ GPUs. - AWQ / AWQ-8bit / GPTQ-8bit get a -1.0 quality penalty so FP8 wins ties. - Version-aware tiebreaker (parse Mn.n / Vn) — MiniMax-M2.7 ranks above M2.5. - hf_models.json: zai-org/GLM-5.1 added; zai-org/GLM-5 quantization flipped Q4_K_M -> BF16. DeepSeek-V4-Flash / -Pro + their -Base variants registered with new FP4-MoE-Mixed / FP8-Mixed quant keys (calibrated BPP from the actual 156 GB / 284 GB disk footprints). - New FP4-MoE-Mixed + FP8-Mixed entries in QUANT_BPP / QUANT_SPEED_MULT / QUANT_QUALITY_PENALTY / QUANT_BYTES_PER_PARAM / PREQUANTIZED_PREFIXES. Frontend — Scan/Download: - Engine + Quant swapped in the toolbar; Quant defaults to "All". - Ctx (range slider) ported from origin/main: 8k/16k/32k/50k/128k/Max. Drag re-sorts by vram ascending (smallest fitting first); back to Max → score. - Ctx slider rail now visible — was background:transparent in a duplicate later-cascade rule. Hardcoded grey + !important. - Search input moved to the far right of the toolbar. - Type/Standard default; "Context" not uppercased; Search placeholder dimmed. - Engine "?" + Quant "?" inline help chips inside their dropdown boxes. - Fit-column dot toggles fit-only filter; un-toggling re-sorts by VRAM desc. - Quant column truncates to 9 chars + ellipsis ("FP4-MoE-M..."), full in tooltip. Smart title-suffix strips the parts already in the repo name (QuantTrio/MiniMax-M2-AWQ + quant AWQ-4bit -> just "(4bit)"). - Conditional warning for safetensors models on non-GPU rigs only. - Dependency Install / Installed / Installed▾ / N/A all 75.85px wide. - Rebuild llama.cpp moved into the llama_cpp dep row, styled as a tag. - Foldable Download admin-card (h2 chevron); line under h2 only when folded. - HF token save gets a green ✓ + "Saved" flash. - Cached scan no longer counts stalled rows as downloaded. - Footer: "Request it →" link with GitHub mark to the public discussion (#1962) for model-add requests. Frontend — Running tab: - Strict download-finish check (DOWNLOAD_OK or /snapshots/, not bare "Download complete"). True overall % for multi-shard downloads: ((N-1)+frac)/total instead of hf_transfer's per-shard aggregate. - ETA in the uptime ticker: "downloading: 12m 34s · ETA 1h 23m". - Clear button kills the tmux session too; if the output still shows a live shard line, the pill is hidden + relabels as "reconnect" + revives on click. - Self-heal: on cookbook open AND every bg-monitor cycle (10s, throttled to 8s), scan persisted done/error/crashed downloads and probe their tmux session — if alive, flip status back to running and reattach. - Per-launch zombie probe: clicking Download on a model whose persisted state is done but tmux is still alive revives the existing task and refuses to start a duplicate. - Pre-launch GPU probe: vllm / sglang / diffusers serve check /api/cookbook/gpus first; warns + confirms if no GPU is visible. - Server-side state guard: rejects "done" POSTs for downloads lacking DOWNLOAD_OK / DOWNLOAD_FAILED / /snapshots/ when the last-mentioned shard is N<total — stale tabs can't poison persisted state any more. - Running count includes tasks whose output looks active even if persisted status got stuck. Dir text on the running row, font matched to uptime. Serve panel: - Ctx text input always resets to model max on open (default 20000 when metadata is missing). - Max Seqs default 8 -> 4. KV Cache dtype select 32px tall. - Lightning icon on Launch (same as Action toggle). - Diagnosis card simplified (no fold/copy/dismiss), suggestion font matches body; action buttons get icons on the left (Retry/Copy/Edit/ Install/Kill/Switch/etc.). - Incomplete-download serve warning when model status is downloading / stalled / has_incomplete. - MTP "?" tooltip ("supported on a few model families … up to ~3× faster").
2026-06-30 00:22:10 -04:00 · 2026-06-03 20:25:25 +09:00
parent 3706d756f3
commit 562bc4dedc
12 changed files with 669 additions and 115 deletions
@@ -527,6 +527,9 @@ export async function _hwfitFetch(fresh = false) {
      if (useCase) params.set('use_case', useCase);
      if (quantPref) params.set('quant', quantPref);
      if (targetCtx) params.set('ctx', String(targetCtx));
+      // Fit-only filter — set by the dot in the Fit column header.
+      const _fitOnly = (() => { try { return localStorage.getItem('hwfit_fit_only_v1') === '1'; } catch { return false; } })();
+      if (_fitOnly) params.set('fit_only', '1');
    }
    const endpoint = isImageMode ? `/api/hwfit/image-models?${params}` : `/api/hwfit/models?${params}`;
    const res = await fetch(endpoint);
@@ -888,9 +891,15 @@ export function _hwfitRenderList(el, models) {
      arrow = isReversed ? ' \u25B2' : ' \u25BC';
    }
    const dataAttr = col.key ? ` data-sort="${col.key}"` : '';
-    const label = (col.cls === 'hwfit-fit' && _budget)
-      ? `${col.label} <span style="font-size:0.75em;opacity:0.6;font-weight:normal;">(${_budget})</span>`
-      : col.label;
+    // Fit column gets a small dot to its left that toggles "show only models
+    // that fit" — replaces the old Fits On/Off button next to the toolbar.
+    let label = col.label;
+    if (col.cls === 'hwfit-fit') {
+      const _fitOnly = (() => { try { return localStorage.getItem('hwfit_fit_only_v1') === '1'; } catch { return false; } })();
+      label = `<span class="hwfit-fit-dot${_fitOnly ? ' active' : ''}" title="${_fitOnly ? 'Showing only models that fit. Click to also show too-tight rows.' : 'Click to show only models that fit your hardware.'}" data-fit-dot>●</span>${col.label}`;
+      // (Budget tag removed — the GPU/RAM/N-GPU suffix next to "Fit" was noise;
+      // the toggle row already shows which budget is active.)
+    }
    html += `<span class="hwfit-col ${col.cls}${sortable}${active}"${dataAttr}>${label}${arrow}</span>`;
  }
  html += '</div>';
@@ -910,9 +919,31 @@ export function _hwfitRenderList(el, models) {
    const dlDot = (_cachedModelIds && (_cachedModelIds.has(m.name) || [..._cachedModelIds].some(id => id === m.name?.split('/').pop()))) ? '<span class="hwfit-dl-dot" title="Downloaded">\u25CF</span>' : '';
    html += `<div class="hwfit-row" data-model="${esc(m.name)}">`;
    html += `<span class="hwfit-col hwfit-fit" style="color:${fitColor}">${esc(fitLabel)}</span>`;
-    html += `<span class="hwfit-col hwfit-name">${modelLogo(m.name)}${esc(m.name?.split('/').pop() || m.name)}${moeBadge}${imgBadge}${dlDot}</span>`;
+    // Append quant to the title when it's not already in the repo name. The
+    // suffix strips quant-parts the name already contains — e.g. for
+    // QuantTrio/MiniMax-M2-AWQ + quant=AWQ-4bit we just show "(4bit)", not
+    // "(AWQ-4bit)". DeepSeek-V4-Flash + FP4-MoE-Mixed keeps the full tag
+    // (none of those parts are in the repo id).
+    const _short = m.name?.split('/').pop() || m.name || '';
+    const _quantTag = (m.quant || '').trim();
+    const _lowerShort = _short.toLowerCase();
+    let _quantSuffix = '';
+    if (_quantTag) {
+      const _parts = _quantTag.split(/[-_]/).filter(Boolean);
+      const _remaining = _parts.filter(p => !_lowerShort.includes(p.toLowerCase()));
+      if (_remaining.length && _remaining.length < _parts.length + 1) {  // at least one part is new
+        let _display = _remaining.join('-');
+        if (_display.length > 9) _display = _display.slice(0, 9) + '…';
+        _quantSuffix = ` <span class="hwfit-name-quant" title="${esc(_quantTag)} — full storage format">(${esc(_display)})</span>`;
+      }
+    }
+    html += `<span class="hwfit-col hwfit-name">${modelLogo(m.name)}${esc(_short)}${_quantSuffix}${moeBadge}${imgBadge}${dlDot}</span>`;
    html += `<span class="hwfit-col hwfit-c-params">${esc(pcount)}</span>`;
-    html += `<span class="hwfit-col hwfit-c-quant">${esc(m.quant || '?')}</span>`;
+    // Truncate the Quant cell to 9 chars + ellipsis so long tags like
+    // "FP4-MoE-Mixed" don't push neighboring columns. Full tag stays in title.
+    const _qRaw = m.quant || '?';
+    const _qShort = _qRaw.length > 9 ? _qRaw.slice(0, 9) + '…' : _qRaw;
+    html += `<span class="hwfit-col hwfit-c-quant" title="${esc(_qRaw)}">${esc(_qShort)}</span>`;
    html += `<span class="hwfit-col hwfit-c-vram">${vramLabel}</span>`;
    html += `<span class="hwfit-col hwfit-c-ctx">${m.is_image_gen ? '\u2014' : ctx}</span>`;
    html += `<span class="hwfit-col hwfit-c-speed">${m.is_image_gen ? '\u2014' : tps + ' t/s'}</span>`;
@@ -934,7 +965,26 @@ export function _hwfitRenderList(el, models) {
  });
  // Clickable header columns → sort (click again to toggle direction)
  el.querySelectorAll('.hwfit-header .hwfit-sortable').forEach(col => {
-    col.addEventListener('click', () => {
+    col.addEventListener('click', (e) => {
+      // The little dot inside the Fit header is its own toggle (fit-only
+      // filter), don't let it fall through to a sort click.
+      if (e.target.closest('[data-fit-dot]')) {
+        const on = !e.target.classList.contains('active');
+        try { localStorage.setItem('hwfit_fit_only_v1', on ? '1' : '0'); } catch {}
+        // Un-toggling the fit filter (off → showing too-tight rows again) is
+        // typically because the user wants to see the LARGE models they can't
+        // run yet — re-sort by VRAM descending so the biggest surface first.
+        if (!on) {
+          const sortSel = document.getElementById('hwfit-sort');
+          if (sortSel) {
+            sortSel.value = 'vram';
+            sortSel.dataset.reverse = '0';   // descending (biggest first)
+          }
+        }
+        _hwfitCache = null;
+        _hwfitFetch();
+        return;
+      }
      const sortKey = col.dataset.sort;
      if (!sortKey) return;
      const sel = document.getElementById('hwfit-sort');
@@ -1018,7 +1068,16 @@ export function _expandModelRow(row, modelData) {
  if (modelData.is_image_gen) {
    html += `<div style="font-size:10px;opacity:0.5;margin-top:4px;">${esc((modelData.capabilities || []).join(' \u00B7 ') || '')}${modelData.description ? ' \u2014 ' + esc(modelData.description) : ''}</div>`;
  } else if (_requiresAcceleratorBackend(modelData)) {
-    html += `<div class="hwfit-panel-note">This is a safetensors GPU-serving format. Use vLLM/SGLang with a visible CUDA/ROCm accelerator, or pick a GGUF download for llama.cpp/Ollama.</div>`;
+    // Only show the "needs CUDA/ROCm" note when the host doesn't already have
+    // one. With a visible CUDA/ROCm accelerator the note is noise — the user
+    // can already serve the model and reading the warning on every row makes
+    // the panel feel like everything's broken.
+    const _sys = _hwfitCache?.system || {};
+    const _backend = (_sys.backend || '').toLowerCase();
+    const _hasGpuAccel = !!_sys.has_gpu && (_backend === 'cuda' || _backend === 'rocm');
+    if (!_hasGpuAccel) {
+      html += `<div class="hwfit-panel-note">This is a safetensors GPU-serving format. Use vLLM/SGLang with a visible CUDA/ROCm accelerator, or pick a GGUF download for llama.cpp/Ollama.</div>`;
+    }
  }
  html += `</div>`;

@@ -1243,14 +1302,14 @@ export function _hwfitInit() {
      const targetCtx = _ctxValue();
      try { localStorage.setItem(_CTX_KEY, String(targetCtx)); } catch {}
      // Ctx drag affects sort mode: a specific ctx target (anything < Max)
-      // implies the user is hunting for "what fits at this context length",
-      // so re-rank by fit (lowest first). Dragging back to Max means no
-      // ctx constraint → go back to the default score-based ranking.
+      // implies "what runs at this context length" — sort by VRAM ascending
+      // so the cheapest-fitting models surface first. Dragging back to Max
+      // releases the constraint → go back to the default score ranking.
      const sortSel = document.getElementById('hwfit-sort');
      if (sortSel) {
        if (targetCtx) {
-          sortSel.value = 'fit';
-          sortSel.dataset.reverse = '1';
+          sortSel.value = 'vram';
+          sortSel.dataset.reverse = '1';   // ascending = smallest VRAM first
        } else {
          sortSel.value = 'score';
          sortSel.dataset.reverse = '';