Cookbook backend detection: report Vulkan on AMD hosts without ROCm; gate CUDA build on actual NVIDIA hardware

Three classes of incorrect detection fixed: (1) AMD GPU + no ROCm installed (e.g. Strix Halo) was reported as backend=rocm everywhere, so launch commands emitted HIP_VISIBLE_DEVICES (silent no-op on Vulkan) and the from-source build path failed. Both _probe_amd_sysfs (routes/cookbook_routes) and _detect_amd (services/hwfit/hardware) now probe rocminfo / hipconfig / vulkaninfo at detection time and report vulkan when only Vulkan is present. (2) Build helper was picking the CUDA branch on AMD hosts whenever a stray pip-installed nvcc was on PATH (vLLM wheels carry one without libcudart). Added _odysseus_has_nvidia_hw() that checks nvidia-smi / /dev/nvidia* / lspci, and gates both the nvcc PATH augmentation and the CUDA elif branch on real hardware. (3) Build chain reordered to ROCm/HIP > CUDA > Vulkan > CPU. Vulkan tier added between CUDA and CPU as a portable fallback for hosts with a GPU but no native toolchain (the common Strix Halo case). Same _append_llama_cpp_linux_accel_build_lines also auto-attempts sudo -n apt/pacman/dnf install of cmake/build-essential/git when they are missing, surfacing a clear no-passwordless-sudo warning otherwise.
2026-06-28 07:35:27 -04:00 · 2026-06-19 00:33:07 +00:00
parent b3e186746a
commit 1324e1b0d5
3 changed files with 293 additions and 20 deletions
@@ -282,7 +282,17 @@ def _detect_amd():
            "gpus": cards,
            "gpu_groups": groups,
            "homogeneous": len(groups) <= 1,
-            "backend": "rocm",
+            # Pick the actual runtime label: ROCm/HIP only when its
+            # toolchain is installed, otherwise Vulkan if vulkaninfo is
+            # present (mesa RADV works fine on RDNA/CDNA when ROCm
+            # packages are absent — see Strix Halo where ROCm support
+            # is still backporting). Reporting "rocm" on a Vulkan-only
+            # host misleads downstream env-var pinning
+            # (HIP_VISIBLE_DEVICES is a no-op there).
+            "backend": (
+                "rocm" if (_run(["which", "rocminfo"]) or _run(["which", "hipconfig"]))
+                else ("vulkan" if _run(["which", "vulkaninfo"]) else "rocm")
+            ),
            "unified_memory": is_apu,
            # AMD ISA/family so downstream can tell datacenter Instinct (CDNA,
            # where vLLM/SGLang run AWQ/GPTQ reliably) from consumer Radeon