Add macOS Apple Silicon Cookbook support

* Add Apple Silicon (Metal) GPU detection and unified-memory fit tuning hardware.py detects Apple Silicon locally and over SSH, reporting backend=metal, the chip name, and a RAM-scaled fraction of unified memory as the usable GPU budget. fit.py gains an M1-M4 memory-bandwidth table for realistic tok/s and drops vLLM-only formats (AWQ/GPTQ/FP8) that can't be served on Metal. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 32ac81dbc6) * Generate macOS/Metal serve commands and surface the Metal GPU cookbook_routes.py adds a macOS serve path (Ollama, Metal-aware llama.cpp build using `sysctl hw.ncpu` instead of `nproc`, and a clear error if vLLM is attempted). The frontend defaults Metal serving to llama.cpp and offers llama.cpp/Ollama instead of vLLM/SGLang. The odysseus-cookbook CLI's `gpus` command reports the Metal GPU via sysctl/vm_stat. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 4ba01ce25d) * Add launchd LaunchAgent for macOS (systemd equivalent) com.odysseus.ui.plist + install-service-macos.sh run Odysseus at login and restart on crash, the macOS counterpart to odysseus-ui.service. The installer auto-fills paths from the venv, so there's no hand-editing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 3d4b6b2c7b) * Document macOS install (brew, Ollama, AirPlay port, launchd) README + setup.py cover the Homebrew / Apple Silicon path: brew install python@3.11 tmux ollama, Metal serving via Ollama/llama.cpp, the launchd service, and the macOS AirPlay Receiver conflict on ports 7000/5000. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 8dc9a3578a) * Add downloadable macOS launcher app builder build-macos-app.sh generates dist/Odysseus.app and a drag-to-Applications dist/Odysseus.dmg. The app starts the local server from this repo's venv and opens the UI in a chrome-less app window (Chromium --app mode, falling back to the default browser). It's a launcher wrapper — it drives the venv rather than bundling Python — so the install path is baked in at build time. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> (cherry picked from commit 7927940c38) * Harden macOS Cookbook support: hide MLX, fix Metal build cache Builds on the adopted PR #213 macOS/Metal work with two fixes and tests: - fit.py: always drop MLX-quantized models. Odysseus only generates serve commands for llama.cpp/Ollama (Metal) and vLLM/SGLang (CUDA); MLX needs the mlx_lm runtime and the catalog's MLX repos ship no GGUF alternative, so they were surfaced on Apple Silicon but could never be served. - cookbook_routes.py (macOS branch only): `rm -rf build` before configure so a poisoned CMakeCache from a prior failed CUDA attempt can't make every later build fail; explicit -DCMAKE_BUILD_TYPE=Release; a clear "brew install cmake" hint if cmake is missing. Linux/CUDA path unchanged. - tests/test_hwfit_macos.py: MLX hidden on metal, MLX still hidden on CUDA (regression guard), Metal detection on Apple Silicon, and skipped on Linux/Intel (proves non-macOS detection is untouched). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Propagate unified_memory flag and document macOS GPU/Docker caveat - hardware.py: detect_system now carries the unified_memory flag from GPU detection into the system dict (it was set by _detect_apple_silicon / AMD-APU detection but dropped during result assembly, so the API always reported null). Lets callers distinguish unified from discrete VRAM. - README: prominent warning that Docker on Apple Silicon can't reach the Metal GPU (runs a Linux VM) — Cookbook must run natively for GPU serving; fix stale text that said Cookbook recommends MLX models (now hidden as unservable). - test: detect_system propagates unified_memory. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Put Odysseus's venv bin on PATH for cookbook runners Native (non-Docker) installs run from a virtualenv whose bin holds the `hf` CLI and `python3` the cookbook download/serve tmux scripts shell out to. Those scripts start in a fresh login shell with the venv NOT activated, so on a native macOS install `hf download` failed with "hf: command not found" — and the `pip --user` self-heal missed because macOS has no bare `pip` command. - cookbook_helpers.py: _local_tooling_path_export() — pure helper returning a PATH export for the running interpreter's bin dir (escaped for double quotes). - cookbook_routes.py: download + serve runners prepend that dir on local runs (gated off SSH/Windows); swap the `pip` install fallbacks to `python3 -m pip`. - tests: helper output for normal and spaced paths. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Document macOS llama.cpp serving prerequisites Clarify the two serving paths on Apple Silicon: the recommended zero-build route (brew install llama.cpp ships a Metal llama-server Cookbook finds on PATH), and the from-source fallback, which requires cmake + Xcode Command Line Tools. Without those the build is skipped and serving silently degrades to a slow CPU build, so new users now know to install them (or use the prebuilt) up front. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Recommend only GGUF-servable models on Metal Apple Silicon's only serving engines are llama.cpp and Ollama, both GGUF-only (vLLM/SGLang are CUDA/ROCm and don't run on macOS). The catalog tags raw safetensors repos with a default Q4_K_M quant, so the fit-ranking was recommending ~397/501 models that have no GGUF and fail to serve on Metal with "No GGUF found" (e.g. microsoft/Phi-mini-MoE-instruct). Drop any model without a real GGUF (is_gguf/gguf_sources) on Apple Silicon — subsumes the previous AWQ/GPTQ/FP8 special-case into one rule. On CUDA these stay visible since vLLM serves safetensors directly. Metal recommendations go 501 -> 104, all actually servable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Remove macOS launchd LaunchAgent (cherry-picked extra) Drop the launchd service from the PR #213 cherry-picks: the install-service-macos.sh installer, the com.odysseus.ui.plist template, and the README section documenting them. Tangential to the core Cookbook/Metal support and not wanted. The build-macos-app.sh launcher is kept. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Add one-command macOS quick start (start-macos.sh) Running Odysseus natively on a Mac previously meant ~7 manual terminal steps (brew deps, venv, activate, pip, setup.py, uvicorn with the right port) — not friendly for a generic macOS user, and the native run is required because Docker on macOS can't reach the Metal GPU. - start-macos.sh: installs Homebrew deps (python@3.11, tmux, prebuilt Metal llama.cpp), creates the venv, installs requirements, runs setup, and launches on a non-AirPlay port (7860). Idempotent; re-run to start again. - README: the Apple Silicon section now leads with this one-command quick start and the clickable .app, with engine/port/manual details folded into a collapsible block. Added a pointer at the top of the manual-install section. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * macOS quick start: auto-open browser when ready The "open this URL" line scrolled out of view as uvicorn kept logging after it, so users missed it. Now start-macos.sh waits (in the background) until the server accepts connections, prints a boxed "ready" banner at that point (i.e. after the startup burst, not before), and opens the URL in the default browser automatically. Skippable with ODYSSEUS_NO_OPEN=1 for headless/SSH use. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Don't assume/force a specific Python version on macOS The README claimed "system Python is 3.9" — a machine-specific generalization that's often wrong (macOS ships no recent Python by default; many users already have 3.11+). Make it generic, and make start-macos.sh detect an existing Python 3.11+ and use it, only installing python@3.11 when none is found instead of forcing it on top of the user's Python. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Align start-macos.sh venv path with build-macos-app.sh start-macos.sh created the environment in .venv/, but build-macos-app.sh and the manual install steps use venv/ — so the clickable .app wouldn't reuse the quick-start's environment and would rebuild a second one. Use venv/ everywhere. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * README: state clearly that MLX is unsupported on Apple Silicon Odysseus has no mlx_lm runtime; it serves GGUF (llama.cpp/Ollama) and CUDA (vLLM/SGLang) only. MLX-only models can't run on a Mac and are hidden from Cookbook — make that explicit in both the quick start and the details. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * start-macos.sh: build the venv with an arm64 Python on Apple Silicon A clean-room run surfaced this: with a universal2/x86 Python (e.g. the python.org installer under /usr/local), the venv's compiled extensions install as arm64 but get loaded as x86_64 when launched from the .app bundle, so it crashes with "incompatible architecture (have arm64, need x86_64)". The terminal run happened to work only because a universal binary defaults to arm64 there. On Apple Silicon, look only under /opt/homebrew (arm64-only) for the build Python, and install Homebrew's python@3.11 if none is present — so the venv is arm64-only and launches correctly from both the terminal and the .app. Intel and non-mac paths are unchanged. Verified end-to-end in a clean clone: .app now boots on Metal with no arch error. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Address dev-exp review: macOS setup robustness + doc/UX fixes From the voltagent dev-exp review of the branch: - README: fix broken anchor links (the em-dash heading produced a slug the links didn't match); simplify the heading to a stable slug. - cookbook_routes.py: add /opt/homebrew/bin and /usr/local/bin to the serve PATH so a brew-installed llama-server/ollama is found instead of falling back to a slow source build. - start-macos.sh: guard against an empty Python path; fail fast with a clear message on port-in-use; ERR trap with a "safe to re-run" message; show pip progress (drop --quiet on the slow requirements install); stop the background browser-opener cleanly on exit/Ctrl+C (no orphaned poller). - setup.py: bind hint to 127.0.0.1; suppress the manual run-hint when launched by start-macos.sh (ODYSSEUS_SKIP_RUN_HINT) so the URL isn't contradictory. - build-macos-app.sh: the .app only opens the browser once the server is actually ready (not after the readiness timeout). - cookbookServe.js: drop "Diffusers" from the Metal backend picker — diffusion_server.py is CUDA-only, so it was an unservable option on macOS. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: yunggilja <yunggilja@gmail.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-17 10:15:27 -04:00 · 2026-06-01 15:29:19 +09:30
parent b998c52dd0
commit f1817fd560
13 changed files with 835 additions and 29 deletions
@@ -19,12 +19,22 @@ GPU_BANDWIDTH = {
    "6950 xt": 576, "6900 xt": 512, "6800 xt": 512, "6800": 512, "6700 xt": 384, "6600 xt": 256, "6600": 224,
    "mi300x": 5300, "mi300": 5300, "mi250x": 3277, "mi250": 3277, "mi210": 1638, "mi100": 1229,
    "9070 xt": 624, "9070": 488,
+    # Apple Silicon unified-memory bandwidth (GB/s). Keyed off the chip name
+    # reported by sysctl machdep.cpu.brand_string (e.g. "Apple M4 Max"). Listed
+    # before the bare "m_" keys matters less than length-sorting (done below),
+    # which guarantees "m4 max" is tried before "m4".
+    "m1 ultra": 800, "m1 max": 400, "m1 pro": 200, "m1": 68,
+    "m2 ultra": 800, "m2 max": 400, "m2 pro": 200, "m2": 100,
+    "m3 ultra": 800, "m3 max": 300, "m3 pro": 150, "m3": 100,
+    "m4 max": 410, "m4 pro": 273, "m4": 120,
 }

 # Pre-sort keys by length descending for correct substring matching
 _BW_KEYS_SORTED = sorted(GPU_BANDWIDTH.keys(), key=len, reverse=True)

-FALLBACK_K = {"cuda": 220, "rocm": 180, "cpu_x86": 70, "cpu_arm": 90}
+# metal: backstop for Apple Silicon chips not in GPU_BANDWIDTH (e.g. a future
+# M5) — the named chips above take the accurate bandwidth path instead.
+FALLBACK_K = {"cuda": 220, "rocm": 180, "metal": 150, "cpu_x86": 70, "cpu_arm": 90}

 USE_CASE_WEIGHTS = {
    "general":    (0.45, 0.30, 0.15, 0.10),
@@ -411,17 +421,28 @@ def rank_models(system, use_case=None, limit=50, search=None, sort="score", quan
    # If user picked a prequantized format (AWQ/FP8/GPTQ), filter to only those models
    filter_native = quant and any(quant.startswith(p) for p in ("AWQ-", "GPTQ-", "FP8"))

-    # MLX-quantized models only run on Apple Silicon (Metal). Exclude them on
-    # every other backend (CUDA / ROCm / CPU) so Linux/Windows users don't see
-    # unrunnable suggestions.
    system_backend = (system.get("backend") or "").lower()
    apple_silicon = system_backend in ("mps", "metal", "apple")

    for m in models:
        native_q = m.get("quantization", "")

-        # Drop MLX models on non-Apple hardware
-        if not apple_silicon and native_q.startswith("mlx-"):
+        # MLX-quantized models need the MLX runtime (mlx_lm), which Odysseus
+        # doesn't generate serve commands for — only llama.cpp/Ollama (Metal)
+        # and vLLM/SGLang (CUDA). MLX repos ship no GGUF alternative, so they're
+        # unrunnable on every backend we support. Always drop them, on Apple
+        # Silicon too, so the Cookbook never recommends a model it can't serve.
+        if native_q.startswith("mlx-"):
+            continue
+
+        # On Apple Silicon the only serving engines are llama.cpp and Ollama,
+        # both GGUF-only (vLLM/SGLang are CUDA/ROCm and don't run on macOS). So
+        # a model is Metal-servable ONLY if it ships a real GGUF. Drop everything
+        # else — raw safetensors repos (which the catalog still tags with a
+        # default GGUF quant) and vLLM-only AWQ/GPTQ/FP8 builds alike. Without
+        # this the Cookbook recommends models the Mac can't run; on CUDA these
+        # stay visible because vLLM serves safetensors directly.
+        if apple_silicon and not (m.get("is_gguf") or m.get("gguf_sources")):
            continue

        # Format filter: AWQ tab → only AWQ models, FP8 tab → only FP8 models
@@ -204,6 +204,82 @@ def _detect_amd():
        return None


+def _detect_apple_silicon():
+    """Detect Apple Silicon (M-series) GPUs.
+
+    Macs have no discrete VRAM — the GPU shares the system's unified memory.
+    We report a fraction of total RAM as the usable GPU budget (matching macOS's
+    default Metal working-set limit) so the Cookbook recommends models that
+    actually run on the GPU instead of classifying the machine as CPU-only.
+
+    backend="metal" is what services.hwfit.fit and the serve-command generation
+    key off of (they already understand MLX / llama.cpp-Metal). Works locally
+    (platform.system()=="Darwin") and over SSH (uname -s == Darwin).
+    """
+    # Gate to macOS — locally via platform, remotely via uname.
+    if _remote_host:
+        if "darwin" not in (_run(["uname", "-s"]) or "").lower():
+            return None
+        arch = (_run(["uname", "-m"]) or "").lower()
+    else:
+        if platform.system() != "Darwin":
+            return None
+        arch = platform.machine().lower()
+
+    # Only Apple Silicon (arm64) has a Metal GPU worth serving LLMs on; Intel
+    # Macs fall through to the CPU path.
+    if "arm" not in arch and "aarch64" not in arch:
+        return None
+
+    # Chip name, e.g. "Apple M4 Max" — carries the Pro/Max/Ultra variant that
+    # the fit bandwidth table keys off of.
+    brand = (_run(["sysctl", "-n", "machdep.cpu.brand_string"]) or "Apple Silicon").strip()
+
+    # Total unified memory in bytes.
+    memsize = _run(["sysctl", "-n", "hw.memsize"])
+    try:
+        total_gb = int(memsize) / (1024**3) if memsize else 0.0
+    except ValueError:
+        total_gb = 0.0
+    if total_gb <= 0:
+        return None
+
+    # Usable GPU budget. macOS lets Metal use most of unified memory, but the
+    # default working-set limit scales with RAM: small machines have to keep
+    # more back for the OS + app. These fractions track Apple's
+    # recommendedMaxWorkingSetSize defaults across the lineup. Honour an
+    # explicit override if the user raised it with
+    # `sudo sysctl iogpu.wired_limit_mb=…`.
+    if total_gb <= 16:
+        frac = 0.67
+    elif total_gb <= 64:
+        frac = 0.75
+    else:
+        frac = 0.80
+    vram_gb = round(total_gb * frac, 1)
+    wired = _run(["sysctl", "-n", "iogpu.wired_limit_mb"])
+    try:
+        wired_mb = int(wired) if wired else 0
+        if wired_mb > 0:
+            vram_gb = round(wired_mb / 1024.0, 1)
+    except ValueError:
+        pass
+
+    gpu = {"index": 0, "name": brand, "vram_gb": vram_gb}
+    return {
+        "gpu_name": brand,
+        "gpu_vram_gb": vram_gb,
+        "gpu_count": 1,
+        "gpus": [gpu],
+        "gpu_groups": _group_gpus([gpu]),
+        "homogeneous": True,
+        "backend": "metal",
+        # Unified memory: the "VRAM" above is carved out of system RAM, not a
+        # separate pool — downstream fit logic uses this to avoid double-budgeting.
+        "unified_memory": True,
+    }
+
+
 def _read_file(path):
    """Read a file, locally or via SSH."""
    if _remote_host:
@@ -246,6 +322,15 @@ def _get_ram_gb():
                return (pages * page_size) / (1024**3)
        except Exception:
            pass
+
+    # macOS has no /proc/meminfo — fall back to sysctl (works locally and over
+    # SSH to a remote Mac, where the sysconf path above isn't taken).
+    memsize = _run(["sysctl", "-n", "hw.memsize"])
+    if memsize:
+        try:
+            return int(memsize.strip()) / (1024**3)
+        except ValueError:
+            pass
    return 0.0


@@ -263,6 +348,12 @@ def _get_cpu_name():
            if line.startswith("model name"):
                return line.split(":", 1)[1].strip()

+    # macOS has no /proc/cpuinfo — sysctl gives the chip name (e.g. "Apple M4").
+    # Harmlessly returns nothing on Linux, so it's safe to try unconditionally.
+    brand = _run(["sysctl", "-n", "machdep.cpu.brand_string"])
+    if brand and brand.strip():
+        return brand.strip()
+
    if not _remote_host:
        return platform.processor() or "unknown"
    return "unknown"
@@ -270,7 +361,8 @@ def _get_cpu_name():

 def _get_cpu_count():
    if _remote_host:
-        out = _run(["nproc"])
+        # nproc on Linux; hw.ncpu via sysctl on a remote Mac (no nproc there).
+        out = _run(["nproc"]) or _run(["sysctl", "-n", "hw.ncpu"])
        if out:
            try:
                return int(out.strip())
@@ -411,7 +503,7 @@ def detect_system(host="", ssh_port="", platform="", fresh=False):
    cpu_cores = _get_cpu_count()
    cpu_name = _get_cpu_name()

-    gpu_info = _detect_nvidia() or _detect_amd()
+    gpu_info = _detect_apple_silicon() or _detect_nvidia() or _detect_amd()

    if gpu_info:
        result = {
@@ -427,6 +519,9 @@ def detect_system(host="", ssh_port="", platform="", fresh=False):
            "gpu_groups": gpu_info.get("gpu_groups", []),
            "homogeneous": gpu_info.get("homogeneous", True),
            "backend": gpu_info["backend"],
+            # Apple Silicon / AMD APUs share system RAM with the GPU — carry the
+            # flag through so callers can tell unified from discrete VRAM.
+            "unified_memory": gpu_info.get("unified_memory", False),
        }
    else:
        if _remote_host: