Skip to content

Dataset and experiment code for the paper "In-N-Out: A Parameter-Level API Graph Dataset for Tool Agents"

License

Notifications You must be signed in to change notification settings

holi-lab/In-N-Out-API-Graph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

In-N-Out: A Parameter-Level API Graph Dataset for Tool Agents

This repository contains the dataset and code for the paper:
In-N-Out: A Parameter-Level API Graph Dataset for Tool Agents (accepted to appear in TACL).

📚 Table of Contents

🚀 Motivation

Tool agents often struggle to identify the correct API call sequence because parameter dependencies are obscured in large API sets. In-N-Out provides the first expert-annotated API graph dataset built from real-world benchmarks (AppWorld, NESTful(v1)), enabling agents to comprehend complex parameter relationships.

📊 In-N-Out Dataset Overview

  • Scope: 550 real-world APIs across 24 domains.
  • Scale: 34,793 parameter-level edges.
  • Granularity: Distinguishes between Strong (compatible & natural), Weak (conditional & natural), and Non-edges.
  • Utility: Improves tool retrieval and multi-tool query generation performance by nearly 2x compared to documentation-only approaches.

🛠 Setup & Data Download

  1. Download Data: Get the dataset from Google Drive.
  2. Setup: Place contents into data/ folder.
  3. Environment: Set up your .env file with OPENAI_API_KEY (required for experiments).
    • Note: Requires API keys (OpenAI) and GPUs for fine-tuned model inference.

📦 Dataset Structure

The data/ directory contains datasets for AppWorld and NESTful, each following the same structure:

Core Files

  • edges.json: Expert-annotated parameter-level edges (gold standard labels).
  • api_documentation/:
    • apis.json: Original API list formatted consistently.
    • apis_with_prerequisites_gold.json: APIs with prerequisite_apis field populated based on edges.json, listing APIs that can provide required input parameter values.

Experiment-Specific Data

1. Graph Construction (graph_construction/)

  • train_data.json, valid_data.json, test_data.json: Data splits for fine-tuning graph construction models (generated via src/graph_construction/data_preprocessor_train_valid_test.py).
  • full_data.json: Full dataset for building complete graphs using models fine-tuned on other datasets (generated via src/graph_construction/data_preprocessor_full.py).
  • results/:
    • full_graph_predicted_by_finetuned_qwen2dot5-32b.json: Complete graph built using the best-performing fine-tuned Qwen2.5-32B model.
    • edges_automated.json: Extracted edges from the automated graph construction.
  • api_documentation/apis_with_prerequisites_automated.json: APIs with prerequisite_apis populated using edges_automated.json.

2. Tool Retrieval (tool_retrieval/)

  • retrieval_test_data.json: Test data used for the tool retrieval experiments.

3. API Subset Selection (api_subset_selection/)

  • api_graph_gold.pkl: NetworkX graph representation of the gold standard edges.
  • api_graph_automated.pkl: NetworkX graph representation of the automated edges.

📁 Repository Structure & Experiments

The src/ directory is organized by the three core experiments presented in the paper (Section 4):

  1. Graph Construction (src/graph_construction/, Section 4.1): Utilities for fine-tuning and benchmarking LLMs on predicting parameter-level edges from natural language documentation.
  2. Tool Retrieval (src/tool_retrieval/, Section 4.2): Experiments on ranking candidate APIs using the API graph to identify prerequisite tools.
  3. API Subset Selection (src/api_subset_selection/, Section 4.3): Tasks for selecting groups of APIs that satisfy specific dependency patterns (Chain, Fork, Collider) for query generation.

🧪 Running Experiments

1. Benchmarking API Graph Construction Capabilities (Section 4.1)

Most users can directly run the experiments using the preprocessed JSON files already included under data/.

If you want to regenerate the preprocessed data:

  • For fine-tuning (creates train/valid/test splits):
python src/graph_construction/data_preprocessor_train_valid_test.py
  • For cross-dataset evaluation (creates full_data.json for building graphs with models fine-tuned on other datasets):
python src/graph_construction/data_preprocessor_full.py

Then, to run the main experiments:

  • Fine-tune graph construction models (requires multi-GPU setup):
bash run_fine-tune_models.sh
  • Zero-shot evaluation:
python src/graph_construction/test_zero-shot_models.py

Note: Modify experiment configuration (dataset, model, etc.) in the __main__ block of the script.

2. Tool Retrieval with API Graphs (Section 4.2)

Run tool retrieval experiments:

python src/tool_retrieval/run_experiment.py

Note: Modify experiment configuration (dataset, use_graph, is_gold_graph, etc.) in the __main__ block of the script.

3. Structured API Subset Selection for Multi-Tool Query Generation (Section 4.3)

Run API subset selection experiments:

python src/api_subset_selection/run_experiment.py

Note: Modify experiment configuration (dataset, pattern_type, etc.) in the __main__ block of the script.

📜 Citation

@article{lee-etal-2025-innout,
    title = "In-N-Out: A Parameter-Level {API} Graph Dataset for Tool Agents",
    author = "Lee, Seungkyu  and
      Kim, Nalim  and
      Jo, Yohan",
    journal = "Transactions of the Association for Computational Linguistics",
    year = "2026",
    address = "Cambridge, MA",
    publisher = "MIT Press",
    url = "https://arxiv.org/abs/2509.01560",
}

About

Dataset and experiment code for the paper "In-N-Out: A Parameter-Level API Graph Dataset for Tool Agents"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published