Cookbook: auto-set KV cache to fp8 for DeepSeek V3/V4/R1 MoE families

These models OOM on --kv-cache-dtype auto (≈bf16) at any usable
context with current tensor-parallel layouts. _detectModelOptimizations
now seeds opts.kvCacheDtype='fp8' for them, and the serve panel's KV
Cache select picks that up as the default unless the user has a
saved override on this skill.
This commit is contained in:
pewdiepie-archdaemon
2026-06-14 08:57:29 +09:00
parent d3944be1be
commit 4074e77d93
2 changed files with 13 additions and 2 deletions
+5 -1
View File
@@ -249,10 +249,14 @@ function _detectModelOptimizations(modelName) {
}
// DeepSeek MoE — V3 / V3.1 / V4 (and future Vx), R1 / R2 reasoning.
// Anything v-{integer} or r-{integer} family from DeepSeek is MoE in
// current architectures.
// current architectures. These models also require fp8 KV cache to
// fit at meaningful context with current tensor-parallel layouts —
// the launch crashes otherwise (--kv-cache-dtype auto → bf16 OOMs).
else if (n.includes('deepseek') && /\b(v[3-9]|v\d{2,}|r[1-9])\b/.test(n)) {
opts.flags.push('--enable-expert-parallel');
opts.tips.push('MoE expert parallel for DeepSeek');
opts.kvCacheDtype = 'fp8';
opts.tips.push('fp8 KV cache required — bf16 OOMs at usable context');
}
// Reasoning parser — applies independently of MoE detection. Without this
// flag, models like MiniMax-M2.x, DeepSeek-R1, Qwen3 reasoning, GLM-4.x,