Skip to content

MXFP4 linear layer is not implemented - falling back to UnquantizedLinearMethod. #958

@prentrodgers

Description

@prentrodgers

Describe the bug

When I try to run vllm on image: intel/vllm:0.11.1-xpu on kubernetes, I get several errors indicating that it's not able to use the MXFP4 any more.

I'm running this on a node with two Arc B580 with 12 GB each. Here is the OS info on the host node:
uname -a
Linux fs5 6.14.0-37-generic #37~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 20 10:25:38 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

cat /etc/release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=24.04
DISTRIB_CODENAME=noble
DISTRIB_DESCRIPTION="Ubuntu 24.04.3 LTS"
PRETTY_NAME="Ubuntu 24.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.3 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu

Inside the container:
uname -a
Linux vllm-gptoss-xpu-5f9755f86d-n7m9h 6.14.0-37-generic #37~24.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 20 10:25:38 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
python --version
Python 3.12.3
cat /etc/release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=24.04
DISTRIB_CODENAME=noble
DISTRIB_DESCRIPTION="Ubuntu 24.04.3 LTS"
PRETTY_NAME="Ubuntu 24.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.3 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"

Error Logs

(Worker_TP0 pid=227) INFO 01-30 18:16:00 [gpu_model_runner.py:3258] Starting to load model openai/gpt-oss-20b...
(Worker_TP1 pid=228) WARNING 01-30 18:16:00 [logger.py:133] MXFP4 linear layer is not implemented - falling back to UnquantizedLinearMethod.
(Worker_TP1 pid=228) WARNING 01-30 18:16:00 [logger.py:133] MXFP4 attention layer is not implemented. Skipping quantization for this layer.
(Worker_TP1 pid=228) INFO 01-30 18:16:00 [xpu.py:58] Setting VLLM_KV_CACHE_LAYOUT to 'NHD' for XPU; only NHD layout is supported by XPU attention kernels.
(Worker_TP1 pid=228) INFO 01-30 18:16:00 [xpu.py:79] Using Flash Attention backend.
(Worker_TP1 pid=228) INFO 01-30 18:16:00 [mxfp4.py:147] Using ipex marlin backend on XPU
(Worker_TP0 pid=227) WARNING 01-30 18:16:00 [logger.py:133] MXFP4 linear layer is not implemented - falling back to UnquantizedLinearMethod.
(Worker_TP0 pid=227) WARNING 01-30 18:16:00 [logger.py:133] MXFP4 attention layer is not implemented. Skipping quantization for this layer.
(Worker_TP0 pid=227) INFO 01-30 18:16:00 [xpu.py:58] Setting VLLM_KV_CACHE_LAYOUT to 'NHD' for XPU; only NHD layout is supported by XPU attention kernels.
(Worker_TP0 pid=227) INFO 01-30 18:16:00 [xpu.py:79] Using Flash Attention backend.
(Worker_TP0 pid=227) INFO 01-30 18:16:00 [mxfp4.py:147] Using ipex marlin backend on XPU

Reproduction Instructions

Yaml:
# Deployment for fs5 running gpt-oss-20b on Intel Arc B580 GPUs with MXFP4
# Image: intel/vllm:0.11.1-xpu
# NodePort: 30024
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-gptoss-xpu
  namespace: vllm
  labels:
    app: vllm-gptoss-xpu
spec:
  replicas: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: vllm-gptoss-xpu
  template:
    metadata:
      labels:
        app: vllm-gptoss-xpu
    spec:
      nodeSelector:
        kubernetes.io/hostname: fs5
        intel.feature.node.kubernetes.io/gpu: "true"
      volumes:
      - name: cache-volume
        persistentVolumeClaim:
          claimName: vllm-pvc
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: "10Gi"
      - name: dri
        hostPath:
          path: /dev/dri
      containers:
      - name: vllm-gptoss-xpu
        image: intel/vllm:0.11.1-xpu
        imagePullPolicy: IfNotPresent
        securityContext:
          privileged: true
        command: ["/bin/bash", "-c"]
        args:
          - |
            export HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
            export VLLM_WORKER_MULTIPROC_METHOD=spawn
            export VLLM_XPU_MEMORY_LIMIT=10GiB
            vllm serve openai/gpt-oss-20b \
              --dtype=bfloat16 \
              --trust-remote-code \
              --enforce-eager \
              --port 8000 \
              --block-size 16 \
              --gpu-memory-utilization 0.85 \
              --no-enable-prefix-caching \
              --disable-log-requests \
              --max-num-batched-tokens 512 \
              --max-model-len 1024 \
              --tensor-parallel-size 2 \
              --cpu-offload-gb 20
        env:
        - name: HF_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token
        - name: HF_HOME
          value: "/root/.cache/huggingface"
        - name: TRANSFORMERS_CACHE
          value: "/root/.cache/huggingface"
        ports:
        - containerPort: 8000
        resources:
          limits:
            cpu: "16"
            memory: 40Gi
          requests:
            cpu: "4"
            memory: 12Gi
        volumeMounts:
        - mountPath: /root/.cache/huggingface
          name: cache-volume
        - name: shm
          mountPath: /dev/shm
        - name: dri
          mountPath: /dev/dri
        startupProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 15
          failureThreshold: 40
        readinessProbe:
          httpGet:
            path: /v1/models
            port: 8000
          periodSeconds: 10
          failureThreshold: 3
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          periodSeconds: 30
          failureThreshold: 3
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-gptoss-xpu
  namespace: vllm
spec:
  type: NodePort
  selector:
    app: vllm-gptoss-xpu
  ports:
  - port: 8000
    targetPort: 8000
    nodePort: 30024

Affected Subfolder

  • classical-ml
  • enterprise
  • preset
  • python
  • pytorch
  • tensorflow
  • test-runner
  • workflows

Versions

lscpu
lspci
cat /etc/os-release
docker --version
docker compose version
python --version
pip freeze

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions