-
Notifications
You must be signed in to change notification settings - Fork 28
Description
Describe the bug
When I try to run vllm on image: intel/vllm:0.11.1-xpu on kubernetes, I get several errors indicating that it's not able to use the MXFP4 any more.
I'm running this on a node with two Arc B580 with 12 GB each. Here is the OS info on the host node:
uname -a
Linux fs5 6.14.0-37-generic #37~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 20 10:25:38 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
cat /etc/release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=24.04
DISTRIB_CODENAME=noble
DISTRIB_DESCRIPTION="Ubuntu 24.04.3 LTS"
PRETTY_NAME="Ubuntu 24.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.3 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
Inside the container:
uname -a
Linux vllm-gptoss-xpu-5f9755f86d-n7m9h 6.14.0-37-generic #37~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 20 10:25:38 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
python --version
Python 3.12.3
cat /etc/release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=24.04
DISTRIB_CODENAME=noble
DISTRIB_DESCRIPTION="Ubuntu 24.04.3 LTS"
PRETTY_NAME="Ubuntu 24.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.3 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
Error Logs
(Worker_TP0 pid=227) INFO 01-30 18:16:00 [gpu_model_runner.py:3258] Starting to load model openai/gpt-oss-20b...
(Worker_TP1 pid=228) WARNING 01-30 18:16:00 [logger.py:133] MXFP4 linear layer is not implemented - falling back to UnquantizedLinearMethod.
(Worker_TP1 pid=228) WARNING 01-30 18:16:00 [logger.py:133] MXFP4 attention layer is not implemented. Skipping quantization for this layer.
(Worker_TP1 pid=228) INFO 01-30 18:16:00 [xpu.py:58] Setting VLLM_KV_CACHE_LAYOUT to 'NHD' for XPU; only NHD layout is supported by XPU attention kernels.
(Worker_TP1 pid=228) INFO 01-30 18:16:00 [xpu.py:79] Using Flash Attention backend.
(Worker_TP1 pid=228) INFO 01-30 18:16:00 [mxfp4.py:147] Using ipex marlin backend on XPU
(Worker_TP0 pid=227) WARNING 01-30 18:16:00 [logger.py:133] MXFP4 linear layer is not implemented - falling back to UnquantizedLinearMethod.
(Worker_TP0 pid=227) WARNING 01-30 18:16:00 [logger.py:133] MXFP4 attention layer is not implemented. Skipping quantization for this layer.
(Worker_TP0 pid=227) INFO 01-30 18:16:00 [xpu.py:58] Setting VLLM_KV_CACHE_LAYOUT to 'NHD' for XPU; only NHD layout is supported by XPU attention kernels.
(Worker_TP0 pid=227) INFO 01-30 18:16:00 [xpu.py:79] Using Flash Attention backend.
(Worker_TP0 pid=227) INFO 01-30 18:16:00 [mxfp4.py:147] Using ipex marlin backend on XPUReproduction Instructions
Yaml:
# Deployment for fs5 running gpt-oss-20b on Intel Arc B580 GPUs with MXFP4
# Image: intel/vllm:0.11.1-xpu
# NodePort: 30024
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-gptoss-xpu
namespace: vllm
labels:
app: vllm-gptoss-xpu
spec:
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
app: vllm-gptoss-xpu
template:
metadata:
labels:
app: vllm-gptoss-xpu
spec:
nodeSelector:
kubernetes.io/hostname: fs5
intel.feature.node.kubernetes.io/gpu: "true"
volumes:
- name: cache-volume
persistentVolumeClaim:
claimName: vllm-pvc
- name: shm
emptyDir:
medium: Memory
sizeLimit: "10Gi"
- name: dri
hostPath:
path: /dev/dri
containers:
- name: vllm-gptoss-xpu
image: intel/vllm:0.11.1-xpu
imagePullPolicy: IfNotPresent
securityContext:
privileged: true
command: ["/bin/bash", "-c"]
args:
- |
export HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_XPU_MEMORY_LIMIT=10GiB
vllm serve openai/gpt-oss-20b \
--dtype=bfloat16 \
--trust-remote-code \
--enforce-eager \
--port 8000 \
--block-size 16 \
--gpu-memory-utilization 0.85 \
--no-enable-prefix-caching \
--disable-log-requests \
--max-num-batched-tokens 512 \
--max-model-len 1024 \
--tensor-parallel-size 2 \
--cpu-offload-gb 20
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
- name: HF_HOME
value: "/root/.cache/huggingface"
- name: TRANSFORMERS_CACHE
value: "/root/.cache/huggingface"
ports:
- containerPort: 8000
resources:
limits:
cpu: "16"
memory: 40Gi
requests:
cpu: "4"
memory: 12Gi
volumeMounts:
- mountPath: /root/.cache/huggingface
name: cache-volume
- name: shm
mountPath: /dev/shm
- name: dri
mountPath: /dev/dri
startupProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 15
failureThreshold: 40
readinessProbe:
httpGet:
path: /v1/models
port: 8000
periodSeconds: 10
failureThreshold: 3
livenessProbe:
httpGet:
path: /health
port: 8000
periodSeconds: 30
failureThreshold: 3
---
apiVersion: v1
kind: Service
metadata:
name: vllm-gptoss-xpu
namespace: vllm
spec:
type: NodePort
selector:
app: vllm-gptoss-xpu
ports:
- port: 8000
targetPort: 8000
nodePort: 30024Affected Subfolder
- classical-ml
- enterprise
- preset
- python
- pytorch
- tensorflow
- test-runner
- workflows
Versions
lscpu
lspci
cat /etc/os-release
docker --version
docker compose version
python --version
pip freeze