GitHub - joaopn/encoder-optimization-guide: Conversion and benchmark of encoder models to ONNX-FP16

Encoder Optimizations with ONNX

Many transformers encoder-type models can have considerably performance gains by being converted to ONNX. Some model families (BERT, RoBERTa, etc) can be further quantized to ONNX-FP16 for 2-4X performance gains with no accuracy penalty. This repo contain scripts to convert, validate accuracy and benchmark models.

A collection of pre-converted models, with their accuracy metrics, can be found here: https://huggingface.co/collections/joaopn/onnx-fp16

Benchmark

GPU Benchmark of the SamLowe/roberta-base-go_emotions model on two datasets of 10k reddit comments, one random and one filtered to comments with >200 char. The full dataset is pre-tokenized, and dispatched to a varying number of GPU workers with varying batch sizes. The figures below represent the best combination (check results/ for the full data). On the normal dataset, the ONNX-FP16 version (joaopn/roberta-base-go_emotions-onnx-fp16) had a mean label probability difference from the original version of 0.00119459 and a max of 0.02122244, across all labels.

Results

The ONNX-FP16 model is 3-4X faster than base torch, ~2X faster than torch.compile
With the top-end cards, the model hits the CPU dispatch rate bottleneck: the H200 is barely faster than the RTX 4090 despite much higher specs, even with multiple parallel workers

Usage

Requirements

ONNX with CUDA requires a working torch installation with CUDA support, as well as transformers, optimum, pandas and tqdm. These can be installed with

pip install transformers optimum[onnxruntime-gpu] pandas tqdm --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/

Alternatively, a conda environment bench with CUDA 12.8 and all the requirements can be created with

conda env create -f environment.yml
conda activate bench

Dataset

The dataset consists of 10k randomly sampled Reddit comments from 12/2005-03/2023, from the Pushshift data dumps. It excludes comments with empty, [deleted] or [removed] content. Two options are provided:

normal: As described above
filtered: contains only comments with >200 characters.

To run the benchmarks, use the run_benchmark.py script:

python run_benchmark.py --model [torch, onnx or onnx-fp16] --device [gpu or cpu]

Arguments:

model (required): Model backend to use, either "torch" for torch or "onnx" for ONNX Runtime.
device (required): Device type to use, either "gpu" or "cpu"
dataset: Dataset variant to use, either "normal" or "filtered" (default: "normal").
gpu: ID of the GPU to use (default: 0).
batches: Comma-separated batch sizes to run (default: "1,2,4,8,16,32").
threads: Specify the number of CPU threads to use (default: 1).

The scripts will output the number of messages processed per second for each batch size.

Model Export

To export and optimize a HuggingFace model to ONNX FP16 format, use the export_onnx.py script:

python export_onnx.py <model_id> [OPTIONS]

This script:

Exports a HuggingFace model to ONNX with FP16 optimization (O4 config)
Benchmarks it against the original PyTorch model on 10k Reddit comments
Generates a README with accuracy statistics
Optionally uploads the optimized model to HuggingFace Hub

Arguments:

model_id (required): HuggingFace model ID (e.g., "SamLowe/roberta-base-go_emotions")
--save-dir: Directory to save the optimized model (default: "./{model_name}-onnx-fp16")
--batch-size: Batch size for benchmarking (default: 1)
--hf-token: HuggingFace API token for upload
--no-upload: Skip the upload prompt and don't upload to HuggingFace
--disable-shape-inference: Disable shape inference during optimization (recommended for very large models)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Encoder Optimizations with ONNX

Benchmark

Results

Usage

Requirements

Dataset

Model Export

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
data		data
results		results
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
export_onnx.py		export_onnx.py
run_benchmark.py		run_benchmark.py

joaopn/encoder-optimization-guide

Folders and files

Latest commit

History

Repository files navigation

Encoder Optimizations with ONNX

Benchmark

Results

Usage

Requirements

Dataset

Model Export

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages