* fix(hwfit): tolerate non-numeric gpu_count in /api/hwfit/models
The route did `n = int(gpu_count)` with no guard, so a non-numeric query param
like `?gpu_count=abc` raised ValueError and returned HTTP 500. Parse it
defensively (mirroring the gpu_group guard a few lines above): a malformed value
is ignored, exactly like omitting the param, and valid values still apply.
Adds tests/test_hwfit_gpu_count_nonnumeric.py: a non-numeric gpu_count returns a
ranking instead of raising, and a numeric value is still accepted.
* test(hwfit): cover non-numeric manual_gpu_count too
Follow-up to the gpu_count guard: add a regression test for the sibling
manual_gpu_count query param (the hardware simulator in _apply_manual_hardware),
which dev already guards by defaulting to 1 on a non-numeric value. This pins
that behaviour so the endpoint's count parsing is fully covered and cannot
regress to a 500.
Surface a lot of accumulated cookbook + UI work as a single non-agent
commit so the agent rework lands cleanly.
Highlights:
- Ollama as a first-class backend in the Cookbook:
* Download input accepts ollama-style names (name:tag) → backend=ollama
* /api/cookbook/ollama/library (cached scrape of ollama.com + curated
fallback so classic models like qwen2.5 stay reachable)
* "Browse Ollama library" toggle below Download with size chips
* Engine=Ollama in hwfit toolbar merges the Ollama library into the
main scan list as per-tag rows with the same Fit/Param/Quant/VRAM
columns; click → fills Download input
- API Tokens form added to Integrations panel (matching wired
loadTokens()/initTokenForm() that had no HTML)
- Serve panel polish: Advanced fold tightening (-8px nudges on vLLM
checks, Extra args, Spec row), n_cpu_moe + Split Mode controls
pulled up 8px to align with the row's checkboxes, GGUF File dropdown
exposed for Ollama backend, GPU re-render on Edit serve restore,
_forceBackend flag so saved serveState wins over backend detection,
cookbook:servers-changed CustomEvent so panels don't need refresh
- Models page redesign: Add Models row (URL + hidden API key reveal +
Type select + Scan/Ollama/Key/Test/Add icon buttons), Probe All +
Clear-offline buttons in Added Models toolbar, offline-pill removed
(opacity already conveys state), Engine dropdown gains Ollama option
- _ping_endpoint probes /v1/models then base, accepts 4xx as
reachable (vLLM returns 404 on bare /v1, fully working endpoints
were showing offline)
- Diagnosis card: × dismiss + Copy bundle buttons restored on the
serve error feedback card
- Orphan tmux sweep re-enabled behind a 60s rate-limit + background
Thread (off the main event loop) so dead serves get discovered
- cookbook_routes auto-register watchdog: drops the endpoint if the
serve session exits non-zero within the first ~3min
- ollama-rocm sidecar awareness in download wrapper (`docker exec
ollama-rocm ollama pull` when host ollama isn't installed)
- Skill extractor sets initial_status="published" when
auto_approve_skills pref is on (audit demotes later)
- Skill list / model list / cookbook scan misc polish
Backend (services/hwfit + routes):
- rank_models picks visible set by REQUESTED column, not always score —
sorting by Param now shows highest-param models PERIOD (incl. too_tight).
- New fit_only param. Multi-GPU rigs filter GGUF Q*/IQ quants (vLLM/SGLang
cannot serve them); default non-prequantized to BF16 on 2+ GPUs.
- AWQ / GPTQ-8bit get a -1.0 quality penalty (was 0.0, tied with FP8), so
FP8 wins when both fit.
- Version-aware tiebreaker (parse Mn.n / Vn) — MiniMax-M2.7 ranks above
M2.5 on equal composite score; >=100B integers not misread as versions.
- /api/cookbook/hf-latest no longer drops models without an "NB" pattern in
the repo id (MiniMax-M2.7, DeepSeek-V4-Pro etc. were silently filtered).
- Cached-model scan: atexit flushes models JSON even if the script is
killed mid-walk; each scan_dir wrapped in try/except; timeout 60s -> 180s.
- KB granularity for sub-MB sizes (was "0 MB" for 12 KB shells). New
"stalled" status for shells <1 MB with no .incomplete files.
- /api/cookbook/state POST guard: rejects "done" download tasks lacking
DOWNLOAD_OK / DOWNLOAD_FAILED / /snapshots/ when the last-mentioned
shard is N<total — stops stale tabs from poisoning persisted state.
- hf_models.json: add zai-org/GLM-5.1; flip zai-org/GLM-5 quantization
Q4_K_M -> BF16 (it is the native base, not a quant).
Frontend (static/js):
- Scan/Download toolbar: quant defaults to All; ctx slider (8k/16k/32k/
50k/128k/Max) ported from origin/main with sort=fit on drag, sort=score
on Max. GPU toggle commits _activeCount to maxGpu on initial render. Fit
column header tagged with active budget (RAM / GPU / N GPU).
- Foldable Download admin-card: the Download h2 is the chevron trigger;
state persists in localStorage.
- Download card surfaces destination dir (Dir: <path>). Same dir on running
task row, font/color matched to uptime (9px Fira Code muted, opacity .4).
- Serve panel ctx text input always resets to model max on open. Sub-MB
cached models show with red "download stalled" badge.
- Bulk-select Cancel + Delete reset the Select button label on exit.
- Cookbook running: false-finished bug fixed — DOWNLOAD_OK or /snapshots/
required; bare "Download complete" no longer marks the task done after
the first config file. Clear button now sends tmux kill-session too.
True overall % for multi-shard downloads: ((N-1)+frac)/total instead of
hf_transfer per-shard aggregate.
- Diagnosis card simplified: removed fold toggle, copy button, dismiss X.
Suggestion font matches message body (12px).
- HF token field flashes green check + "Saved" on save.
- Cached scan no longer counts stalled rows as downloaded in Scan/Download.
CSS:
- dep Install button width pinned to 76px to match Installed split.
- task-sub row +1px; task-status badge gets margin-right 8px.
- Ctx slider styled like gallery editor sliders (thin pill rail, red thumb).
- Bulk-select cancel button top -3px -> -5px.
The Cookbook's manual hardware simulator ("what if I had this setup") let users
pick a backend, but _apply_manual_hardware only accepted cuda/rocm/cpu_x86/
cpu_arm and silently coerced anything else to cuda. So selecting Apple/Metal
simulated a CUDA box instead — and ranked safetensors-only repos a Mac can't
serve, even though the rest of hwfit (services.hwfit.fit, the serve-command
generation) already supports Metal as GGUF-only via llama.cpp/Ollama.
Add "metal" to the accepted backends (now a named _MANUAL_BACKENDS set, kept a
subset of what fit.py understands) and set unified_memory=True for it — Apple
Silicon shares one memory pool with the GPU — while clearing that flag for the
discrete (cuda/rocm) and CPU backends. _apply_manual_hardware is lifted to
module scope so it is directly unit-testable; both route call sites are
unchanged.
Adds tests/test_hwfit_manual_backend.py, including an end-to-end check that a
simulated Metal box only recommends GGUF-servable models.
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
* Cookbook: Engine filter + intelligent hardware-computed serve profiles
Two related Cookbook serving improvements for accurate, hardware-aware model
serving (especially on consumer GPUs that can only run GGUF/llama.cpp).
Engine filter
- New "Engine" dropdown (All / llama.cpp / vLLM / SGLang) beside the quant
picker. Pure client-side view filter over the fetched list via the same
_detectBackend() the serve commands use, so what you filter to is exactly what
would launch. Re-renders from cache (no refetch). Empty-state message + the
instant-cache-paint path account for it too.
Intelligent serve profiles (Quality / Balanced / Speed)
- services/hwfit/profiles.py: compute_serve_profiles() turns detected VRAM +
model size into concrete llama.cpp flags (n_gpu_layers, n_cpu_moe, cache-type,
context). Encodes the by-hand tuning: a too-big MoE offloads experts to CPU
instead of failing; a model that fits stays fully on GPU; quant tracks profile
intent; vision models keep image-encoder headroom. Reuses models.py VRAM math
so filtering and serving agree on what fits. Pure/deterministic (no t/s claims
— partial-offload speed isn't reliably predictable; fit is what's computed).
- /api/hwfit/profiles endpoint returns the profiles + the model's trained
context limit, with loose name matching (strips org/ prefix, -GGUF suffix,
quant tag) so a local GGUF folder name resolves to its catalog entry.
- _buildServeCmd (llama.cpp) now emits --n-cpu-moe / --flash-attn /
--cache-type-k/v when set, with llama-cpp-python fallback equivalents. It
previously only set -ngl/-c, which is why it OOM'd or ran slow.
- Serve panel: profile chips that fill the fields on click, plus CPU-MoE / KV
Cache / Flash Attn fields. Context is clamped to the model's trained limit
(and an absolute 1M sanity ceiling) on type/blur/profile-load and at launch —
fixes a crash where a stale 256k/16M preset + quantized KV cache caused an
amdgpu ErrorDeviceLost.
Tests: tests/test_serve_profiles.py (7) — offload vs full-GPU fit, never exceed
VRAM, context cap, launchable flags, vision headroom, no-GPU empty.
Checks: py_compile + node --check pass; pytest test_serve_profiles + test_hwfit_amd
green; verified live on an RDNA4 box (gfx1200) — Balanced lands ~ncm18 q4 128k,
matching hand-tuning.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Cookbook: make column-header sorting discoverable (incl. Newest)
Sorting in Cookbook is via clickable column headers (pewds' design), but the
headers had no visual cue that they're interactive — so sorting in general, and
the Newest sort on the Model header specifically, was undiscoverable.
- Style sortable headers as interactive: pointer cursor, hover underline, and
the active sort column bolded/highlighted. There was no CSS for
.hwfit-sortable / .hwfit-sort-active at all; this helps every existing sort,
not just Newest.
- The Model column header sorts by release_date (newest first), reusing the
existing header-click sort wiring and the "newest" SORT_KEY.
No new sort control — uses the existing column-header paradigm.
Checks: node --check passes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Cookbook serve profiles: keep the on-disk file's quant fixed (don't propose Q6/Q2)
In the Serve tab the model is a specific GGUF file already on disk, so its quant
can't change — but the profiles were suggesting "Quality · Q6_K" / "Speed · Q2_K"
as if you could re-quantize it. That's meaningless when serving a fixed file.
- compute_serve_profiles gains serve_weights_gb / serve_quant. When set (SERVE
mode), the quant is locked to the file's and profiles differ only in the real
serving knobs — n_cpu_moe, KV-cache type, context. _weights_gb / _cpu_moe_for_budget
use the file's actual size instead of a quant-derived estimate. DOWNLOAD mode
(no override) still varies the quant to show download options.
- /api/hwfit/profiles accepts serve_weights_gb & serve_quant.
- The Serve panel parses the file's size (from m.size "20.6 GB") and quant (from
the repo/file name) and passes them, so profiles match what's actually served.
Result for a 20.6 GB Q4_K_M file: all three profiles stay Q4_K_M and differ by
KV/ctx/offload (Quality q8 KV 128k ncm21, Balanced q4 128k ncm17, Speed q4 32k
ncm15) — no nonsensical quant changes.
Tests: test_serve_mode_keeps_fixed_quant. Full serve-profile suite green (9).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* Cookbook serve: Vision toggle (auto-find mmproj) + live VRAM/RAM-spillover monitor
Two serve-panel additions:
1. **Vision toggle.** A "Vision" checkbox that serves the model with its
multimodal projector so it can read images. The mmproj path is resolved at
runtime (find mmproj-*.gguf next to the model), so dropping an mmproj file in
the model folder makes the toggle just work; `--mmproj … --image-max-tokens
1024` (native) / `--clip_model_path` (llama-cpp-python) only when on + found.
2. **Live GPU-memory monitor.** A readout that polls /api/cookbook/gpus every 4s
while the panel is open and shows VRAM used/total/%, free, and — crucially on
a discrete card — **RAM spillover** (AMD gtt_used_mb), with a plain-language
health hint: green/healthy, amber/tight, red/"spilled to RAM — slow (raise
CPU MoE or lower context)". Surfaces gtt_used_mb from the gpus endpoint
(previously read for total only and discarded for 'used').
Lets you see at a glance whether a config fits VRAM (fast) or is paging to system
RAM over PCIe (slow) instead of guessing.
Checks: node --check + py_compile pass.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>