* Cookbook: Engine filter + intelligent hardware-computed serve profiles
Two related Cookbook serving improvements for accurate, hardware-aware model
serving (especially on consumer GPUs that can only run GGUF/llama.cpp).
Engine filter
- New "Engine" dropdown (All / llama.cpp / vLLM / SGLang) beside the quant
picker. Pure client-side view filter over the fetched list via the same
_detectBackend() the serve commands use, so what you filter to is exactly what
would launch. Re-renders from cache (no refetch). Empty-state message + the
instant-cache-paint path account for it too.
Intelligent serve profiles (Quality / Balanced / Speed)
- services/hwfit/profiles.py: compute_serve_profiles() turns detected VRAM +
model size into concrete llama.cpp flags (n_gpu_layers, n_cpu_moe, cache-type,
context). Encodes the by-hand tuning: a too-big MoE offloads experts to CPU
instead of failing; a model that fits stays fully on GPU; quant tracks profile
intent; vision models keep image-encoder headroom. Reuses models.py VRAM math
so filtering and serving agree on what fits. Pure/deterministic (no t/s claims
— partial-offload speed isn't reliably predictable; fit is what's computed).
- /api/hwfit/profiles endpoint returns the profiles + the model's trained
context limit, with loose name matching (strips org/ prefix, -GGUF suffix,
quant tag) so a local GGUF folder name resolves to its catalog entry.
- _buildServeCmd (llama.cpp) now emits --n-cpu-moe / --flash-attn /
--cache-type-k/v when set, with llama-cpp-python fallback equivalents. It
previously only set -ngl/-c, which is why it OOM'd or ran slow.
- Serve panel: profile chips that fill the fields on click, plus CPU-MoE / KV
Cache / Flash Attn fields. Context is clamped to the model's trained limit
(and an absolute 1M sanity ceiling) on type/blur/profile-load and at launch —
fixes a crash where a stale 256k/16M preset + quantized KV cache caused an
amdgpu ErrorDeviceLost.
Tests: tests/test_serve_profiles.py (7) — offload vs full-GPU fit, never exceed
VRAM, context cap, launchable flags, vision headroom, no-GPU empty.
Checks: py_compile + node --check pass; pytest test_serve_profiles + test_hwfit_amd
green; verified live on an RDNA4 box (gfx1200) — Balanced lands ~ncm18 q4 128k,
matching hand-tuning.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Cookbook: make column-header sorting discoverable (incl. Newest)
Sorting in Cookbook is via clickable column headers (pewds' design), but the
headers had no visual cue that they're interactive — so sorting in general, and
the Newest sort on the Model header specifically, was undiscoverable.
- Style sortable headers as interactive: pointer cursor, hover underline, and
the active sort column bolded/highlighted. There was no CSS for
.hwfit-sortable / .hwfit-sort-active at all; this helps every existing sort,
not just Newest.
- The Model column header sorts by release_date (newest first), reusing the
existing header-click sort wiring and the "newest" SORT_KEY.
No new sort control — uses the existing column-header paradigm.
Checks: node --check passes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Cookbook serve profiles: keep the on-disk file's quant fixed (don't propose Q6/Q2)
In the Serve tab the model is a specific GGUF file already on disk, so its quant
can't change — but the profiles were suggesting "Quality · Q6_K" / "Speed · Q2_K"
as if you could re-quantize it. That's meaningless when serving a fixed file.
- compute_serve_profiles gains serve_weights_gb / serve_quant. When set (SERVE
mode), the quant is locked to the file's and profiles differ only in the real
serving knobs — n_cpu_moe, KV-cache type, context. _weights_gb / _cpu_moe_for_budget
use the file's actual size instead of a quant-derived estimate. DOWNLOAD mode
(no override) still varies the quant to show download options.
- /api/hwfit/profiles accepts serve_weights_gb & serve_quant.
- The Serve panel parses the file's size (from m.size "20.6 GB") and quant (from
the repo/file name) and passes them, so profiles match what's actually served.
Result for a 20.6 GB Q4_K_M file: all three profiles stay Q4_K_M and differ by
KV/ctx/offload (Quality q8 KV 128k ncm21, Balanced q4 128k ncm17, Speed q4 32k
ncm15) — no nonsensical quant changes.
Tests: test_serve_mode_keeps_fixed_quant. Full serve-profile suite green (9).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Cookbook serve: Vision toggle (auto-find mmproj) + live VRAM/RAM-spillover monitor
Two serve-panel additions:
1. **Vision toggle.** A "Vision" checkbox that serves the model with its
multimodal projector so it can read images. The mmproj path is resolved at
runtime (find mmproj-*.gguf next to the model), so dropping an mmproj file in
the model folder makes the toggle just work; `--mmproj … --image-max-tokens
1024` (native) / `--clip_model_path` (llama-cpp-python) only when on + found.
2. **Live GPU-memory monitor.** A readout that polls /api/cookbook/gpus every 4s
while the panel is open and shows VRAM used/total/%, free, and — crucially on
a discrete card — **RAM spillover** (AMD gtt_used_mb), with a plain-language
health hint: green/healthy, amber/tight, red/"spilled to RAM — slow (raise
CPU MoE or lower context)". Surfaces gtt_used_mb from the gpus endpoint
(previously read for total only and discarded for 'used').
Lets you see at a glance whether a config fits VRAM (fast) or is paging to system
RAM over PCIe (slow) instead of guessing.
Checks: node --check + py_compile pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two small polish items in the Cookbook Serve panel.
Saved-config badge
The little count badge next to the Save button ("3 ▾" etc.) had a
generic "Saved launch configs" tooltip, so the number reads like a
notification dot. Make it spell out what it is and what clicking does:
"3 saved launch configs for <model> — click ▾ to load or delete"
(and "No saved launch configs for <model> yet — click Save to add
one" when empty). Tooltip stays in sync via _updateSavedToggleLabel
so save/delete updates both the count and the hint.
GPU chip on mixed-GPU boxes (#711)
The chip label was `${gpuCount}x ${gpu_name}`, where gpu_name is
just gpus[0].name — so a 4090 + 3060 reads as "2x RTX 4090". The
backend already emits gpu_groups (identical cards grouped, used by
the serve flow to pin CUDA_VISIBLE_DEVICES) and a per-card gpus[]
array, so use them:
- Label renders each homogeneous pool: "1× RTX 4090 + 1× RTX 3060".
Homogeneous setups keep the existing "2× RTX 4090" form.
- Tooltip lists each GPU with its index + VRAM, useful for picking
the right device when launching.
Refs #711.
The Cookbook Scan/Download (hwfit) table gave the Fit column key:'score', so
clicking the Fit header sorted by score instead of by fit. Give the Fit column
its own 'fit' sort key, add a matching option to the #hwfit-sort select, and
rank fit_level (perfect > good > marginal > too_tight > no_fit) in the
client-side sort. Default puts the best fit first; clicking again reverses it.
Score still sorts by score.
Closes#842
Gate Cookbook "Run" on the model being downloaded
The What-Fits tab's quick "Run" button launched a serve task even when
the model was not downloaded. It POSTed directly to /api/model/serve and switched to the Running tab, so vLLM/SGLang would background-pull at launch (and llama.cpp just errors "No GGUF found") while the task showed as "running" without actually serving anything.
The Configure button and the Serve tab already gate on the cached-model
list; quick-Run did not. Mirror that gate: when the model isn't cached,
honor the button's "Download" half by kicking off the download instead of spawning a phantom serve task, and toast the user to Run again once it finishes.
Hoist the HTML-escape lookup table in static/js/ui.js out of the
String.replace callback so it is allocated once instead of on every
matched character. esc() is the canonical escaper aliased across 27
modules and runs on essentially every render, so this removes a lot of
short-lived garbage on the hottest text path. Output is byte-identical
(verified across null/undefined/emoji/attribute edge cases).
Also build the <select> option lists in cookbook-hwfit.js and group.js
by accumulating a string and assigning innerHTML once, instead of
`innerHTML +=` inside a forEach (which makes the browser re-parse the
element's markup on every iteration). Final DOM is unchanged.
Pure micro-optimizations; no behavior change.
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>