This repository contains the dataset and code for the paper:
In-N-Out: A Parameter-Level API Graph Dataset for Tool Agents (accepted to appear in TACL).
- 🚀 Motivation
- 📊 In-N-Out Dataset Overview
- 🛠 Setup & Data Download
- 📦 Dataset Structure
- 📁 Repository Structure & Experiments
- 🧪 Running Experiments
- 📜 Citation
Tool agents often struggle to identify the correct API call sequence because parameter dependencies are obscured in large API sets. In-N-Out provides the first expert-annotated API graph dataset built from real-world benchmarks (AppWorld, NESTful(v1)), enabling agents to comprehend complex parameter relationships.
- Scope: 550 real-world APIs across 24 domains.
- Scale: 34,793 parameter-level edges.
- Granularity: Distinguishes between Strong (compatible & natural), Weak (conditional & natural), and Non-edges.
- Utility: Improves tool retrieval and multi-tool query generation performance by nearly 2x compared to documentation-only approaches.
- Download Data: Get the dataset from Google Drive.
- Setup: Place contents into
data/folder. - Environment: Set up your
.envfile withOPENAI_API_KEY(required for experiments).- Note: Requires API keys (OpenAI) and GPUs for fine-tuned model inference.
The data/ directory contains datasets for AppWorld and NESTful, each following the same structure:
edges.json: Expert-annotated parameter-level edges (gold standard labels).api_documentation/:apis.json: Original API list formatted consistently.apis_with_prerequisites_gold.json: APIs withprerequisite_apisfield populated based onedges.json, listing APIs that can provide required input parameter values.
train_data.json,valid_data.json,test_data.json: Data splits for fine-tuning graph construction models (generated viasrc/graph_construction/data_preprocessor_train_valid_test.py).full_data.json: Full dataset for building complete graphs using models fine-tuned on other datasets (generated viasrc/graph_construction/data_preprocessor_full.py).results/:full_graph_predicted_by_finetuned_qwen2dot5-32b.json: Complete graph built using the best-performing fine-tuned Qwen2.5-32B model.edges_automated.json: Extracted edges from the automated graph construction.
api_documentation/apis_with_prerequisites_automated.json: APIs withprerequisite_apispopulated usingedges_automated.json.
retrieval_test_data.json: Test data used for the tool retrieval experiments.
api_graph_gold.pkl: NetworkX graph representation of the gold standard edges.api_graph_automated.pkl: NetworkX graph representation of the automated edges.
The src/ directory is organized by the three core experiments presented in the paper (Section 4):
- Graph Construction (
src/graph_construction/, Section 4.1): Utilities for fine-tuning and benchmarking LLMs on predicting parameter-level edges from natural language documentation. - Tool Retrieval (
src/tool_retrieval/, Section 4.2): Experiments on ranking candidate APIs using the API graph to identify prerequisite tools. - API Subset Selection (
src/api_subset_selection/, Section 4.3): Tasks for selecting groups of APIs that satisfy specific dependency patterns (Chain, Fork, Collider) for query generation.
Most users can directly run the experiments using the preprocessed JSON files already included under data/.
If you want to regenerate the preprocessed data:
- For fine-tuning (creates
train/valid/testsplits):
python src/graph_construction/data_preprocessor_train_valid_test.py- For cross-dataset evaluation (creates
full_data.jsonfor building graphs with models fine-tuned on other datasets):
python src/graph_construction/data_preprocessor_full.pyThen, to run the main experiments:
- Fine-tune graph construction models (requires multi-GPU setup):
bash run_fine-tune_models.sh- Zero-shot evaluation:
python src/graph_construction/test_zero-shot_models.pyNote: Modify experiment configuration (dataset, model, etc.) in the __main__ block of the script.
Run tool retrieval experiments:
python src/tool_retrieval/run_experiment.pyNote: Modify experiment configuration (dataset, use_graph, is_gold_graph, etc.) in the __main__ block of the script.
Run API subset selection experiments:
python src/api_subset_selection/run_experiment.pyNote: Modify experiment configuration (dataset, pattern_type, etc.) in the __main__ block of the script.
@article{lee-etal-2025-innout,
title = "In-N-Out: A Parameter-Level {API} Graph Dataset for Tool Agents",
author = "Lee, Seungkyu and
Kim, Nalim and
Jo, Yohan",
journal = "Transactions of the Association for Computational Linguistics",
year = "2026",
address = "Cambridge, MA",
publisher = "MIT Press",
url = "https://arxiv.org/abs/2509.01560",
}