fix: CUDA/GPU detection for vLLM and llama.cpp in Docker (#479)

Two bugs caused GPU inference to silently fall back to CPU inside the Odysseus Docker container even when the GPU was correctly passed through. ## entrypoint.sh — CUDA_HOME detection only covered CUDA 13.x wheels The nvcc glob only searched vidia/cu13, which matches the vidia-nvcc-cu13 pip wheel layout. CUDA 12.x wheels install nvcc to vidia/cuda_nvcc/bin/nvcc (nvidia-cuda-nvcc-cu12) or vidia/cu12 (nvidia-nvcc-cu12) — completely different paths. The glob found nothing, so CUDA_HOME was never set. Worse, VLLM_USE_FLASHINFER_SAMPLER=0 was inside the same if-block, so it was never set either. vLLM then tried to JIT-compile the FlashInfer sampler at startup, failed with 'Could not find nvcc', and crashed — even though the GPU was fully visible to the container. Fix: expand the search to also check nvidia/cu12 and nvidia/cuda_nvcc. Move VLLM_USE_FLASHINFER_SAMPLER=0 to an unconditional export after the loop (it is sampler-only, no impact on the attention path, and the correct setting for any container where CUDA headers may be incomplete). ## cookbook_routes.py — llama.cpp Linux source build silently fell back to CPU The cmake invocation was: cmake -B build -DGGML_CUDA=ON 2>/dev/null || cmake -B build 2>/dev/null suppressed all configure errors. When nvcc is absent (the slim base image has no CUDA toolkit — intentional), cmake fails silently, then the || fallback re-runs without -DGGML_CUDA=ON. A CPU-only binary is produced with no warning. Additionally, a stale CMakeCache.txt from the failed CUDA attempt was reused (no rm -rf build), poisoning the next configure run. The macOS branch already did rm -rf build for exactly this reason; the Linux branch did not. Fix: before cmake, detect pip-installed nvcc across the same three path patterns as entrypoint.sh and expose it via CUDA_HOME/PATH. If nvcc is found, run a clean CUDA build with full error visibility. If not, fall back to a CPU build with an explicit warning telling the user how to get a GPU build (install vLLM via Cookbook -> Dependencies, which brings the CUDA wheels including nvcc, then re-launch). ## .env.example — document Windows COMPOSE_FILE separator Added a comment showing the semicolon separator required on Windows Docker Desktop alongside the existing colon-separator (Linux) example.
2026-06-17 18:25:26 -04:00 · 2026-06-01 15:30:51 +02:00
parent 3c6b084f08
commit 00320972dc
3 changed files with 42 additions and 5 deletions
@@ -1004,9 +1004,33 @@ def setup_cookbook_routes() -> APIRouter:
                runner_lines.append('      && cmake --build build -j"$NPROC" --target llama-server \\')
                runner_lines.append('      && ln -sf ~/llama.cpp/build/bin/llama-server ~/bin/llama-server')
                runner_lines.append('  else')
-                runner_lines.append('    cd ~/llama.cpp && { cmake -B build -DGGML_CUDA=ON 2>/dev/null || cmake -B build; } \\')
-                runner_lines.append('      && cmake --build build -j"$NPROC" --target llama-server \\')
-                runner_lines.append('      && ln -sf ~/llama.cpp/build/bin/llama-server ~/bin/llama-server')
+                # Detect pip-installed nvcc (from vLLM/nvidia CUDA wheels) and put
+                # it on PATH so cmake's CUDA configure can find it.  We check the
+                # same three layouts as entrypoint.sh:
+                #   nvidia/cu13       — nvidia-nvcc-cu13
+                #   nvidia/cu12       — nvidia-nvcc-cu12
+                #   nvidia/cuda_nvcc  — nvidia-cuda-nvcc-cu12 (sub-package style)
+                runner_lines.append('    for _cudir in ~/.local/lib/python*/site-packages/nvidia/cu13 ~/.local/lib/python*/site-packages/nvidia/cu12 ~/.local/lib/python*/site-packages/nvidia/cuda_nvcc; do')
+                runner_lines.append('      [ -x "$_cudir/bin/nvcc" ] && export CUDA_HOME="$_cudir" && export PATH="$_cudir/bin:$PATH" && break')
+                runner_lines.append('    done')
+                # rm -rf build so a prior poisoned CMakeCache.txt (e.g. from a
+                # failed CUDA attempt) doesn't cause the next configure to reuse
+                # stale settings and silently produce a CPU-only binary.
+                runner_lines.append('    cd ~/llama.cpp && rm -rf build')
+                runner_lines.append('    if command -v nvcc &>/dev/null; then')
+                runner_lines.append('      echo "[odysseus] CUDA nvcc found — building llama-server with CUDA (GPU) support..."')
+                runner_lines.append('      cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON \\')
+                runner_lines.append('        && cmake --build build -j"$NPROC" --target llama-server \\')
+                runner_lines.append('        && ln -sf ~/llama.cpp/build/bin/llama-server ~/bin/llama-server')
+                runner_lines.append('    else')
+                runner_lines.append('      echo "[odysseus] WARNING: nvcc not found — building llama-server for CPU only."')
+                runner_lines.append('      echo "[odysseus]   GPU inference will not be available for this llama.cpp build."')
+                runner_lines.append('      echo "[odysseus]   To get a GPU build, first install vLLM via Cookbook -> Dependencies"')
+                runner_lines.append('      echo "[odysseus]   (its CUDA wheels include nvcc), then re-launch this serve task."')
+                runner_lines.append('      cmake -B build -DCMAKE_BUILD_TYPE=Release \\')
+                runner_lines.append('        && cmake --build build -j"$NPROC" --target llama-server \\')
+                runner_lines.append('        && ln -sf ~/llama.cpp/build/bin/llama-server ~/bin/llama-server')
+                runner_lines.append('    fi')
                runner_lines.append('  fi')
                runner_lines.append('  # If the native build failed, fall back to the Python bindings.')
                runner_lines.append('  if ! command -v llama-server &>/dev/null && ! python3 -c "import llama_cpp" 2>/dev/null; then')