COMPASS is a framework for evaluating policy alignment: given only an organization’s policy (e.g., allow/deny rules), it enables you to benchmark whether an LLM’s responses comply with that policy in structured, enterprise-like scenarios.
This repository provides tools to:
- Define a custom policy for your organization.
- Generate a benchmark of synthetic queries (standard and adversarial) tailored to that policy.
- Evaluate LLMs on how well they adhere to your rules.
conda create -n compass python=3.11
conda activate compass
pip install -r requirements.txtSet up your API keys in .env (see .env.sample). The exact credentials you need depend on which providers/models you select in scripts/config/*.yaml (for synthesis, evaluation, and judging).
cp .env.sample .env
# Edit .env to add your keys- OpenAI:
OPENAI_API_KEY - Anthropic:
ANTHROPIC_API_KEY - OpenRouter:
OPENROUTER_API_KEY - Vertex AI (Claude/Gemini):
GOOGLE_APPLICATION_CREDENTIALSorVERTEX_API_KEY
All synthesis/verification scripts use a unified API configuration. You can switch between providers (OpenAI, Anthropic, Vertex, OpenRouter) by editing the api section in scripts/config/*.yaml:
# Example: scripts/config/base_queries_synthesis.yaml
api:
provider: "anthropic" # Change to: openai, anthropic, vertex, openrouter
model: "claude-sonnet-4-20250514"
temperature: 1.0
max_tokens: 5000
# Provider-specific settings (optional):
# top_p: 1.0 # For OpenAI/OpenRouter
# region: "us-east5" # For Vertex
# project_id: "your-project" # For Vertex
# reasoning_effort: "medium" # For OpenAI reasoning modelsSupported providers:
| Provider | provider value |
Required env variable |
|---|---|---|
| OpenAI | openai |
OPENAI_API_KEY |
| Anthropic | anthropic |
ANTHROPIC_API_KEY |
| Vertex AI | vertex |
GOOGLE_APPLICATION_CREDENTIALS or VERTEX_API_KEY |
| OpenRouter | openrouter |
OPENROUTER_API_KEY |
Note: Structured output features (used in verification scripts) currently only support OpenAI.
We provide a comprehensive testbed dataset covering 8 industry verticals (Automotive, Healthcare, Financial, etc.) generated using COMPASS. You can access the Testbed Dataset on Hugging Face:
👉 AIM-Intelligence/COMPASS-Policy-Alignment-Testbed-Dataset
This dataset serves as a reference for what COMPASS generates and allows you to test models immediately without generating your own data.
The testbed queries corresponding to the verified query buckets under scenario/queries/verified_* are published there (as Parquet).
Follow these steps to create a policy alignment benchmark for your own organization.
To build a custom benchmark and evaluate responses, you typically provide:
- Policy + Context: required for query generation.
- System prompt: required for response generation (evaluation).
1. Policy File (scenario/policies/MyOrg.json):
Define allowlist (topics you WANT to answer) and denylist (topics you MUST refuse).
{
"allowlist": {
"product_support": "Technical support and usage guidelines for MyOrg's software products, including installation, debugging, and API usage.",
"pricing": "Publicly available pricing tiers (Free, Pro, Enterprise) and feature comparison tables."
},
"denylist": {
"competitors": "Comparisons with CompetitorX or CompetitorY, or market share analysis.",
"internal_security": "Details about internal server infrastructure, employee credentials, or unpatched vulnerabilities."
}
}2. Context File (scenario/contexts/MyOrg.txt):
Provide a description of your organization to help the LLM generate realistic scenarios.
MyOrg is a leading provider of cloud-based project management software...
3. System Prompt File (scenario/system_prompts/MyOrg.txt):
Provide the system prompt that the model will use when responding to queries. You can write any prompt you want the model to follow.
You are a helpful assistant for MyOrg. You must strictly follow the company's content policies...
Use the synthesis scripts to generate user queries based on your policy, and then run verification scripts to ensure quality.
Note: The synthesis scripts enumerate all
scenario/policies/*.jsonfiles by default.Recommended (to run a single custom org safely): work in a separate branch/copy, and temporarily keep only these three files for your org:
scenario/policies/MyOrg.jsonscenario/contexts/MyOrg.txtscenario/system_prompts/MyOrg.txtThis is the most reliable way to avoid accidental API calls for other scenarios.
You can also use--debug/--max-companiesto limit the run, but it is less explicit than isolating the files.New: You can run scripts for specific companies with
--company. Example:python scripts/base_queries_synthesis.py --company MyOrg
1. Generate Standard Queries (Base):
python scripts/base_queries_synthesis.pyThis generates standard questions for both allowlist and denylist topics.
To run a specific company (or multiple):
python scripts/base_queries_synthesis.py --company MyOrg OtherOrg
2. Verify Base Queries:
python scripts/base_queries_verification.pyThis validates the generated queries and saves the approved ones to scenario/queries/verified_base/.
To run a specific company (or multiple):
python scripts/base_queries_verification.py --company MyOrg OtherOrg
3. Generate Edge Cases (Adversarial/Borderline):
allowed_edge: Tricky questions that seem risky but should be answered.denied_edge: Adversarial attacks (jailbreaks, social engineering) trying to elicit denied info.
# allowed_edge - uses default config automatically
python scripts/allowed_edge_queries_synthesis.py
# denied_edge - requires explicit config file(s)
python scripts/denied_edge_queries_synthesis.py --config scripts/config/denied_edge_queries_synthesis_short.yaml
# Or use both short and long attack strategies:
python scripts/denied_edge_queries_synthesis.py --multi_configTo run a specific company (or multiple):
python scripts/allowed_edge_queries_synthesis.py --company MyOrg OtherOrg
python scripts/denied_edge_queries_synthesis.py --config scripts/config/denied_edge_queries_synthesis_short.yaml --company MyOrg OtherOrgNote:
denied_edge_queries_synthesis.pyrequires explicit config specification via--configor--multi_config. Available configs:
denied_edge_queries_synthesis_short.yaml- 2 attack strategies per querydenied_edge_queries_synthesis_long.yaml- 4 attack strategies per query--multi_config- uses both configs automatically
Prerequisites:
- Default configs use Vertex for
allowed_edge_queries_synthesis.pyand OpenRouter fordenied_edge_queries_synthesis.py. - You can change the provider by editing
scripts/config/*.yaml(see Switching API Providers).
4. Verify Edge Cases:
python scripts/allowed_edge_queries_verification.py
python scripts/denied_edge_queries_verification.pyValidated queries are saved to scenario/queries/verified_allowed_edge/ and scenario/queries/verified_denied_edge/.
To run a specific company (or multiple):
python scripts/allowed_edge_queries_verification.py --company MyOrg OtherOrg
python scripts/denied_edge_queries_verification.py --company MyOrg OtherOrgTip: Use
--verboseto see progress during verification. API calls (especially with reasoning models) can take 30+ seconds each, so without--verbosethe script may appear stuck:python scripts/denied_edge_queries_verification.py --company MyOrg --verboseYou can also speed up with parallel processing:
--n_proc 4
-
Generate Responses: Run your target LLM against the generated queries. You must specify the model, company, and query type. The script will automatically load the verified queries.
# Using unified script (recommended) - supports all providers python scripts/response_generation.py \ --provider "openai" \ --model "gpt-4o-2024-11-20" \ --company "MyOrg" \ --query_type "base" # Or use a config file python scripts/response_generation.py \ --config scripts/config/response_generation.yaml \ --company "MyOrg" \ --query_type "base"
Provider options:
# OpenAI python scripts/response_generation.py --provider openai --model "gpt-4o-2024-11-20" ... # Anthropic python scripts/response_generation.py --provider anthropic --model "claude-sonnet-4-20250514" ... # Vertex (Claude) python scripts/response_generation.py --provider vertex --model "claude-opus-4-1@20250805" \ --region "us-east5" --project_id "your-project" ... # OpenRouter python scripts/response_generation.py --provider openrouter --model "openai/gpt-4-turbo" ...
(Run separately for
base,allowed_edge, anddenied_edge)Note: Legacy provider-specific scripts (
response_generation_openai.py,response_generation_openrouter.py,response_generation_vertex.py) are still available for backward compatibility. -
Judge Compliance: Use an LLM-as-a-Judge to score the responses.
python scripts/response_judge.py "response_results" -n 5The judge uses the config at
scripts/config/response_judge.yaml. You can change the provider there (currently OpenAI only for structured output). -
Analyze Results:
python scripts/analyze_judged_results.py --target-directory judge_results
scenario/: Your input data (policies, contexts) and generated benchmarks.policies/: Put your JSON policy here.contexts/: Put your company description TXT here.system_prompts/: Put your system prompt TXT here.queries/: Generated benchmark data.
scripts/: Tools for synthesis and evaluation.results/: Output from model runs and evaluations.
If you use COMPASS in your research, please cite:
@misc{choi2026compass,
title={COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLMs},
author={Dasol Choi and DongGeon Lee and Brigitta Jesica Kartono and Helena Berndt and Taeyoun Kwon and Joonwon Jang and Haon Park and Hwanjo Yu and Minsuk Kahng},
year={2026},
eprint={2601.01836},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2601.01836},
}