In-N-Out: A Parameter-Level API Graph Dataset for Tool Agents

This repository contains the dataset and code for the paper:
In-N-Out: A Parameter-Level API Graph Dataset for Tool Agents (accepted to appear in TACL).

📚 Table of Contents

🚀 Motivation
📊 In-N-Out Dataset Overview
🛠 Setup & Data Download
📦 Dataset Structure
📁 Repository Structure & Experiments
🧪 Running Experiments
📜 Citation

🚀 Motivation

Tool agents often struggle to identify the correct API call sequence because parameter dependencies are obscured in large API sets. In-N-Out provides the first expert-annotated API graph dataset built from real-world benchmarks (AppWorld, NESTful(v1)), enabling agents to comprehend complex parameter relationships.

📊 In-N-Out Dataset Overview

Scope: 550 real-world APIs across 24 domains.
Scale: 34,793 parameter-level edges.
Granularity: Distinguishes between Strong (compatible & natural), Weak (conditional & natural), and Non-edges.
Utility: Improves tool retrieval and multi-tool query generation performance by nearly 2x compared to documentation-only approaches.

🛠 Setup & Data Download

Download Data: Get the dataset from Google Drive.
Setup: Place contents into data/ folder.
Environment: Set up your .env file with OPENAI_API_KEY (required for experiments).
- Note: Requires API keys (OpenAI) and GPUs for fine-tuned model inference.

📦 Dataset Structure

The data/ directory contains datasets for AppWorld and NESTful, each following the same structure:

Core Files

edges.json: Expert-annotated parameter-level edges (gold standard labels).
api_documentation/:
- apis.json: Original API list formatted consistently.
- apis_with_prerequisites_gold.json: APIs with prerequisite_apis field populated based on edges.json, listing APIs that can provide required input parameter values.

Experiment-Specific Data

1. Graph Construction (`graph_construction/`)

train_data.json, valid_data.json, test_data.json: Data splits for fine-tuning graph construction models (generated via src/graph_construction/data_preprocessor_train_valid_test.py).
full_data.json: Full dataset for building complete graphs using models fine-tuned on other datasets (generated via src/graph_construction/data_preprocessor_full.py).
results/:
- full_graph_predicted_by_finetuned_qwen2dot5-32b.json: Complete graph built using the best-performing fine-tuned Qwen2.5-32B model.
- edges_automated.json: Extracted edges from the automated graph construction.
api_documentation/apis_with_prerequisites_automated.json: APIs with prerequisite_apis populated using edges_automated.json.

2. Tool Retrieval (`tool_retrieval/`)

retrieval_test_data.json: Test data used for the tool retrieval experiments.

3. API Subset Selection (`api_subset_selection/`)

api_graph_gold.pkl: NetworkX graph representation of the gold standard edges.
api_graph_automated.pkl: NetworkX graph representation of the automated edges.

📁 Repository Structure & Experiments

The src/ directory is organized by the three core experiments presented in the paper (Section 4):

Graph Construction (src/graph_construction/, Section 4.1): Utilities for fine-tuning and benchmarking LLMs on predicting parameter-level edges from natural language documentation.
Tool Retrieval (src/tool_retrieval/, Section 4.2): Experiments on ranking candidate APIs using the API graph to identify prerequisite tools.
API Subset Selection (src/api_subset_selection/, Section 4.3): Tasks for selecting groups of APIs that satisfy specific dependency patterns (Chain, Fork, Collider) for query generation.

🧪 Running Experiments

1. Benchmarking API Graph Construction Capabilities (Section 4.1)

Most users can directly run the experiments using the preprocessed JSON files already included under data/.

If you want to regenerate the preprocessed data:

For fine-tuning (creates train/valid/test splits):

python src/graph_construction/data_preprocessor_train_valid_test.py

For cross-dataset evaluation (creates full_data.json for building graphs with models fine-tuned on other datasets):

python src/graph_construction/data_preprocessor_full.py

Then, to run the main experiments:

Fine-tune graph construction models (requires multi-GPU setup):

bash run_fine-tune_models.sh

Zero-shot evaluation:

python src/graph_construction/test_zero-shot_models.py

Note: Modify experiment configuration (dataset, model, etc.) in the __main__ block of the script.

2. Tool Retrieval with API Graphs (Section 4.2)

Run tool retrieval experiments:

python src/tool_retrieval/run_experiment.py

Note: Modify experiment configuration (dataset, use_graph, is_gold_graph, etc.) in the __main__ block of the script.

3. Structured API Subset Selection for Multi-Tool Query Generation (Section 4.3)

Run API subset selection experiments:

python src/api_subset_selection/run_experiment.py

Note: Modify experiment configuration (dataset, pattern_type, etc.) in the __main__ block of the script.

📜 Citation

@article{lee-etal-2025-innout,
    title = "In-N-Out: A Parameter-Level {API} Graph Dataset for Tool Agents",
    author = "Lee, Seungkyu  and
      Kim, Nalim  and
      Jo, Yohan",
    journal = "Transactions of the Association for Computational Linguistics",
    year = "2026",
    address = "Cambridge, MA",
    publisher = "MIT Press",
    url = "https://arxiv.org/abs/2509.01560",
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
run_fine-tune_models.sh		run_fine-tune_models.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

In-N-Out: A Parameter-Level API Graph Dataset for Tool Agents

📚 Table of Contents

🚀 Motivation

📊 In-N-Out Dataset Overview

🛠 Setup & Data Download

📦 Dataset Structure

Core Files

Experiment-Specific Data

1. Graph Construction (`graph_construction/`)

2. Tool Retrieval (`tool_retrieval/`)

3. API Subset Selection (`api_subset_selection/`)

📁 Repository Structure & Experiments

🧪 Running Experiments

1. Benchmarking API Graph Construction Capabilities (Section 4.1)

2. Tool Retrieval with API Graphs (Section 4.2)

3. Structured API Subset Selection for Multi-Tool Query Generation (Section 4.3)

📜 Citation

About

Uh oh!

Releases

Packages

Languages

License

holi-lab/In-N-Out-API-Graph

Folders and files

Latest commit

History

Repository files navigation

In-N-Out: A Parameter-Level API Graph Dataset for Tool Agents

📚 Table of Contents

🚀 Motivation

📊 In-N-Out Dataset Overview

🛠 Setup & Data Download

📦 Dataset Structure

Core Files

Experiment-Specific Data

1. Graph Construction (graph_construction/)

2. Tool Retrieval (tool_retrieval/)

3. API Subset Selection (api_subset_selection/)

📁 Repository Structure & Experiments

🧪 Running Experiments

1. Benchmarking API Graph Construction Capabilities (Section 4.1)

2. Tool Retrieval with API Graphs (Section 4.2)

3. Structured API Subset Selection for Multi-Tool Query Generation (Section 4.3)

📜 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Graph Construction (`graph_construction/`)

2. Tool Retrieval (`tool_retrieval/`)

3. API Subset Selection (`api_subset_selection/`)

Packages