COMPASS: A Framework for Policy Alignment Evaluation

COMPASS is a framework for evaluating policy alignment: given only an organization’s policy (e.g., allow/deny rules), it enables you to benchmark whether an LLM’s responses comply with that policy in structured, enterprise-like scenarios.

This repository provides tools to:

Define a custom policy for your organization.
Generate a benchmark of synthetic queries (standard and adversarial) tailored to that policy.
Evaluate LLMs on how well they adhere to your rules.

🚀 Quick Start

1. Installation

conda create -n compass python=3.11
conda activate compass
pip install -r requirements.txt

Set up your API keys in .env (see .env.sample). The exact credentials you need depend on which providers/models you select in scripts/config/*.yaml (for synthesis, evaluation, and judging).

cp .env.sample .env
# Edit .env to add your keys

Required credentials (common)

OpenAI: OPENAI_API_KEY
Anthropic: ANTHROPIC_API_KEY
OpenRouter: OPENROUTER_API_KEY
Vertex AI (Claude/Gemini): GOOGLE_APPLICATION_CREDENTIALS or VERTEX_API_KEY

Switching API Providers

All synthesis/verification scripts use a unified API configuration. You can switch between providers (OpenAI, Anthropic, Vertex, OpenRouter) by editing the api section in scripts/config/*.yaml:

# Example: scripts/config/base_queries_synthesis.yaml
api:
  provider: "anthropic"           # Change to: openai, anthropic, vertex, openrouter
  model: "claude-sonnet-4-20250514"
  temperature: 1.0
  max_tokens: 5000
  # Provider-specific settings (optional):
  # top_p: 1.0                    # For OpenAI/OpenRouter
  # region: "us-east5"            # For Vertex
  # project_id: "your-project"    # For Vertex
  # reasoning_effort: "medium"    # For OpenAI reasoning models

Supported providers:

Provider	`provider` value	Required env variable
OpenAI	`openai`	`OPENAI_API_KEY`
Anthropic	`anthropic`	`ANTHROPIC_API_KEY`
Vertex AI	`vertex`	`GOOGLE_APPLICATION_CREDENTIALS` or `VERTEX_API_KEY`
OpenRouter	`openrouter`	`OPENROUTER_API_KEY`

Note: Structured output features (used in verification scripts) currently only support OpenAI.

2. Testbed Dataset

We provide a comprehensive testbed dataset covering 8 industry verticals (Automotive, Healthcare, Financial, etc.) generated using COMPASS. You can access the Testbed Dataset on Hugging Face:

👉 AIM-Intelligence/COMPASS-Policy-Alignment-Testbed-Dataset

This dataset serves as a reference for what COMPASS generates and allows you to test models immediately without generating your own data. The testbed queries corresponding to the verified query buckets under scenario/queries/verified_* are published there (as Parquet).

🛠️ Usage: Creating a Custom Benchmark

Follow these steps to create a policy alignment benchmark for your own organization.

Step 1: Define Your Policy, Context, and System Prompt

To build a custom benchmark and evaluate responses, you typically provide:

Policy + Context: required for query generation.
System prompt: required for response generation (evaluation).

1. Policy File (scenario/policies/MyOrg.json): Define allowlist (topics you WANT to answer) and denylist (topics you MUST refuse).

{
  "allowlist": {
    "product_support": "Technical support and usage guidelines for MyOrg's software products, including installation, debugging, and API usage.",
    "pricing": "Publicly available pricing tiers (Free, Pro, Enterprise) and feature comparison tables."
  },
  "denylist": {
    "competitors": "Comparisons with CompetitorX or CompetitorY, or market share analysis.",
    "internal_security": "Details about internal server infrastructure, employee credentials, or unpatched vulnerabilities."
  }
}

2. Context File (scenario/contexts/MyOrg.txt): Provide a description of your organization to help the LLM generate realistic scenarios.

MyOrg is a leading provider of cloud-based project management software...

3. System Prompt File (scenario/system_prompts/MyOrg.txt): Provide the system prompt that the model will use when responding to queries. You can write any prompt you want the model to follow.

You are a helpful assistant for MyOrg. You must strictly follow the company's content policies...

Step 2: Generate and Verify Evaluation Queries

Use the synthesis scripts to generate user queries based on your policy, and then run verification scripts to ensure quality.

Note: The synthesis scripts enumerate all scenario/policies/*.json files by default.

Recommended (to run a single custom org safely): work in a separate branch/copy, and temporarily keep only these three files for your org:

scenario/policies/MyOrg.json

scenario/contexts/MyOrg.txt

scenario/system_prompts/MyOrg.txt

This is the most reliable way to avoid accidental API calls for other scenarios.
You can also use --debug/--max-companies to limit the run, but it is less explicit than isolating the files.

New: You can run scripts for specific companies with --company. Example: python scripts/base_queries_synthesis.py --company MyOrg

1. Generate Standard Queries (Base):

python scripts/base_queries_synthesis.py

This generates standard questions for both allowlist and denylist topics. To run a specific company (or multiple):
python scripts/base_queries_synthesis.py --company MyOrg OtherOrg

2. Verify Base Queries:

python scripts/base_queries_verification.py

This validates the generated queries and saves the approved ones to scenario/queries/verified_base/. To run a specific company (or multiple):
python scripts/base_queries_verification.py --company MyOrg OtherOrg

3. Generate Edge Cases (Adversarial/Borderline):

allowed_edge: Tricky questions that seem risky but should be answered.
denied_edge: Adversarial attacks (jailbreaks, social engineering) trying to elicit denied info.

# allowed_edge - uses default config automatically
python scripts/allowed_edge_queries_synthesis.py

# denied_edge - requires explicit config file(s)
python scripts/denied_edge_queries_synthesis.py --config scripts/config/denied_edge_queries_synthesis_short.yaml

# Or use both short and long attack strategies:
python scripts/denied_edge_queries_synthesis.py --multi_config

To run a specific company (or multiple):

python scripts/allowed_edge_queries_synthesis.py --company MyOrg OtherOrg
python scripts/denied_edge_queries_synthesis.py --config scripts/config/denied_edge_queries_synthesis_short.yaml --company MyOrg OtherOrg

Note: denied_edge_queries_synthesis.py requires explicit config specification via --config or --multi_config. Available configs:

denied_edge_queries_synthesis_short.yaml - 2 attack strategies per query

denied_edge_queries_synthesis_long.yaml - 4 attack strategies per query

--multi_config - uses both configs automatically

Prerequisites:

Default configs use Vertex for allowed_edge_queries_synthesis.py and OpenRouter for denied_edge_queries_synthesis.py.
You can change the provider by editing scripts/config/*.yaml (see Switching API Providers).

4. Verify Edge Cases:

python scripts/allowed_edge_queries_verification.py
python scripts/denied_edge_queries_verification.py

Validated queries are saved to scenario/queries/verified_allowed_edge/ and scenario/queries/verified_denied_edge/.

To run a specific company (or multiple):

python scripts/allowed_edge_queries_verification.py --company MyOrg OtherOrg
python scripts/denied_edge_queries_verification.py --company MyOrg OtherOrg

Tip: Use --verbose to see progress during verification. API calls (especially with reasoning models) can take 30+ seconds each, so without --verbose the script may appear stuck:
python scripts/denied_edge_queries_verification.py --company MyOrg --verbose
You can also speed up with parallel processing: --n_proc 4

Step 3: Run Evaluation

Generate Responses: Run your target LLM against the generated queries. You must specify the model, company, and query type. The script will automatically load the verified queries.

# Using unified script (recommended) - supports all providers
python scripts/response_generation.py \
  --provider "openai" \
  --model "gpt-4o-2024-11-20" \
  --company "MyOrg" \
  --query_type "base"

# Or use a config file
python scripts/response_generation.py \
  --config scripts/config/response_generation.yaml \
  --company "MyOrg" \
  --query_type "base"

Provider options:

# OpenAI
python scripts/response_generation.py --provider openai --model "gpt-4o-2024-11-20" ...

# Anthropic
python scripts/response_generation.py --provider anthropic --model "claude-sonnet-4-20250514" ...

# Vertex (Claude)
python scripts/response_generation.py --provider vertex --model "claude-opus-4-1@20250805" \
  --region "us-east5" --project_id "your-project" ...

# OpenRouter
python scripts/response_generation.py --provider openrouter --model "openai/gpt-4-turbo" ...

(Run separately for base, allowed_edge, and denied_edge)

Note: Legacy provider-specific scripts (response_generation_openai.py, response_generation_openrouter.py, response_generation_vertex.py) are still available for backward compatibility.

Judge Compliance: Use an LLM-as-a-Judge to score the responses.
```
python scripts/response_judge.py "response_results" -n 5
```
The judge uses the config at scripts/config/response_judge.yaml. You can change the provider there (currently OpenAI only for structured output).

Analyze Results:

python scripts/analyze_judged_results.py --target-directory judge_results

Project Structure

scenario/: Your input data (policies, contexts) and generated benchmarks.
- policies/: Put your JSON policy here.
- contexts/: Put your company description TXT here.
- system_prompts/: Put your system prompt TXT here.
- queries/: Generated benchmark data.
scripts/: Tools for synthesis and evaluation.
results/: Output from model runs and evaluations.

Citation

If you use COMPASS in your research, please cite:

@misc{choi2026compass,
      title={COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLMs}, 
      author={Dasol Choi and DongGeon Lee and Brigitta Jesica Kartono and Helena Berndt and Taeyoun Kwon and Joonwon Jang and Haon Park and Hwanjo Yu and Minsuk Kahng},
      year={2026},
      eprint={2601.01836},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2601.01836}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
scenario		scenario
scripts		scripts
.env.sample		.env.sample
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

COMPASS: A Framework for Policy Alignment Evaluation

🚀 Quick Start

1. Installation

Required credentials (common)

Switching API Providers

2. Testbed Dataset

🛠️ Usage: Creating a Custom Benchmark

Step 1: Define Your Policy, Context, and System Prompt

Step 2: Generate and Verify Evaluation Queries

Step 3: Run Evaluation

Project Structure

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

AIM-Intelligence/COMPASS

Folders and files

Latest commit

History

Repository files navigation

COMPASS: A Framework for Policy Alignment Evaluation

🚀 Quick Start

1. Installation

Required credentials (common)

Switching API Providers

2. Testbed Dataset

🛠️ Usage: Creating a Custom Benchmark

Step 1: Define Your Policy, Context, and System Prompt

Step 2: Generate and Verify Evaluation Queries

Step 3: Run Evaluation

Project Structure

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages