This repository contains the official implementation of the paper "Large Language Models as End-to-end Combinatorial Optimization Solvers" presented at The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025).
A framework for training Large Language Models (LLMs) to solve combinatorial optimization problems using supervised fine-tuning (SFT) followed by reinforcement learning (RL).
Title: Large Language Models as End-to-end Combinatorial Optimization Solvers
Authors: Xia Jiang, Yaoxin Wu, Minshuo Li, Zhiguang Cao, Yingqian Zhang
Conference: The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025)
Paper Link: Arxiv
It now supports training and evaluation on multiple combinatorial optimization problems:
- TSP (Traveling Salesman Problem)
- CVRP (Capacitated Vehicle Routing Problem)
- OP (Orienteering Problem)
- MVC (Minimum Vertex Cover)
- MIS (Maximum Independent Set)
- PFSP (Permutation Flow Shop Problem)
- JSSP (Job Shop Scheduling Problem)
You can generate your own data through the problem-specific environments under /Envs/, or use the data generated in the original paper:
- SFT DATA: https://drive.google.com/drive/folders/1bE1coGUa00gfuMkPXnfvldi1-WHGNnEb?usp=sharing
- RL DATA: https://drive.google.com/drive/folders/1VN9crftdW7DTsMQupbc06u6PzRT-Bwnx?usp=sharing
Place your training and evaluation data in the following structure:
data/
├── <problem_name>/
│ ├── train/ # Training data
│ ├── eval/ # Evaluation data
│ └── instances.pkl # Problem instances
The training consists of three main stages:
First, train the model using supervised learning on problem-specific data:
python main_train.py --problem <problem_name> [options]Key parameters:
--problem: Problem type (tsp, cvrp, op, mvc, mis, pfsp, jssp)--model_name: Base model to fine-tune (default: unsloth/Qwen2.5-7B)--max_seq_length: Maximum sequence length (default: 20000)--per_device_train_batch_size: Batch size per device (default: 4)--num_train_epochs: Number of training epochs (default: 1)--learning_rate: Learning rate (default: 2e-4)--lora_r: LoRA rank (default: 64)--lora_alpha: LoRA alpha (default: 64)
Example:
python main_train.py --problem cvrp --num_train_epochs 1 --per_device_train_batch_size 4After SFT, improve the model using reinforcement learning (GRPO):
python rl_train.py --problem <problem_name> --model_name <sft_checkpoint_path> [options]Key parameters:
--model_name: Path to SFT checkpoint (e.g.,output_alpha64_r64_cvrp_gamma_train_embed_tok_False_seq20000_b4_ep1/checkpoint-31250)--num_generations: Number of generations for GRPO (default: 8)--beta: KL coefficient (default: 0.05)--learning_rate: Learning rate (default: 1e-6)--max_prompt_length: Maximum prompt length (default: 20000)--max_completion_length: Maximum completion length (default: 1000)
Example:
python rl_train.py --problem cvrp --model_name output_alpha64_r64_cvrp_gamma_train_embed_tok_False_seq20000_b4_ep1/checkpoint-31250After training, merge the LoRA weights with the base model:
-
Edit
cmd.shto specify your model checkpoint path:MODEL_DIR="./path/to/your/checkpoint" -
Run the merge script:
bash cmd.sh
This creates a saved_models/ directory with the merged model.
Evaluate the trained model using two methods:
python eval.py --model_id saved_models --problem <problem_name> --eval_method vanilla --num_samples 100python eval.py --model_id saved_models --problem <problem_name> --eval_method best_of_n --num_samples 100 --best_of_n 8 --temperature 0.7Evaluation parameters:
--model_id: Path to the merged model (default: saved_models)--eval_method: Evaluation method (vanilla or best_of_n)--num_samples: Number of test instances to evaluate--best_of_n: Number of solutions to generate per instance (for best_of_n)--temperature: Sampling temperature--batch_size: Batch size for evaluation
The evaluation provides:
- Feasibility Rate: Percentage of valid solutions
- Optimality Gap: Average gap from optimal/reference solutions
Here's a complete example for training on CVRP:
# 1. Supervised Fine-Tuning
python main_train.py --problem cvrp --num_train_epochs 1
# 2. Reinforcement Learning
python rl_train.py --problem cvrp --model_name output_alpha64_r64_cvrp_gamma_train_embed_tok_False_seq20000_b4_ep1/checkpoint-31250
# 3. Merge Model (edit MODEL_DIR in cmd.sh first)
bash cmd.sh
# 4. Evaluate
python eval.py --model_id saved_models --problem cvrp --eval_method vanilla --num_samples 100We welcome contributions to this project. Please feel free to submit issues and pull requests.
If you find this work useful in your research, please consider citing:
@inproceedings{
jiang2025large,
title={Large Language Models as End-to-end Combinatorial Optimization Solvers},
author={Xia Jiang, Yaoxin Wu, Minshuo Li, Zhiguang Cao, Yingqian Zhang},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://arxiv.org/abs/2509.16865}
}This project is licensed under the MIT License - see the LICENSE file for details.