This repository accompanies the paper “Experimental Evaluation of AI-Augmented Cybersecurity Requirements Generation Leveraging LLMs’ Capabilities.” It contains every script, dataset, prompt template and result needed to fully reproduce our empirical study.
This project investigates the practical use of state‑of‑the‑art Large Language Models (LLMs) to transform high‑level, standard‑driven cyber‑security controls into concrete, system‑specific requirements. Using a synthetic yet industrially plausible case study—AI4I4, an IoT‑enabled automotive logistics platform—we benchmark thirteen frontier models (GPT‑4, LLaMa 3, Mixtral, QWen, etc.), representing tge state of the art as of September 2024, across four prompting pipelines and three temperature regimes.
Key contributions include:
- Annotated benchmark of 54 ISO‑27002 control definitions with placeholder semantics suitable for automatic instantiation.
- LangChain pipelines that decompose the task into applicability filtering, domain‑element search, requirement generation, and JSON formatting.
- Comprehensive evaluation of accuracy (precision, recall, F2), creativity (F2‑synthetic), and consistency (Jaccard overlap across runs).
- Prompt library enumerating >180 templates, showing how subtle changes in instruction design affect hallucination rate and coverage.
The artefacts and scripts below allow full replication—from raw prompts to final figures—on any infrastructure with access to the referenced models.
.
├── data/ # Experimental inputs
│ ├── ai4i4.md # Functional specification of the AI4I4 case study
│ ├── annotated_standard_subset.json # Annotated subset of ISO‑27002 control definitions
│ └── prompt/ # Prompt templates organised by task and model
├── src/ # LangChain pipelines and helper scripts
│ ├── generate_requirements/ # End‑to‑end automation
│ └── graph/ # Scripts to render result figures
├── results/ # Raw outputs and aggregated metrics
│ ├── requirements/ # Requirement lists (human + models)
│ ├── analysis/ # Coverage, F‑scores, Jaccard, etc.
│ └── graph/ # Re‑generated figures from the manuscript
├── doc/ # Execution logs for every configuration
├── LICENSE, LICENSE_DATA.txt
└── README.md # This document
Given that python3 and pip are installed and correctly configured in the system, and assuming that you have (depending on the model(s) you intend to use):
- A valid Huggingface PRO token.
- Granted acces the intended models on AWS Bedrock.
- A valid OpenAI API key.
- A valid Mistral API key.
You may follow the steps below to set up the environment and run the scripts.
- Clone this repository locally.
git clone git@github.com:STRAST-UPM/ai_requirements_generation_rr.git- Change to the
generate_requirementsdirectory.
cd src/generate-requirements- Create a python virtual environment and activate it (recommended)
python -m venv .venv
source .venv/bin/activate - Install all required dependencies.
pip install -r requirements.txt- Create a
.envfile with the following content (depending on the models you want to use):
HUGGINGFACE_API_TOKEN=<your_token>
MISTRAL_API_TOKEN=<your_token>
OPENAI_API_TOKEN=<your_token>Tip
You may find an example of the .env file at .env.example.
- If you want to use models provided by AWS, configure AWS CLI with the credentials provided by the AWS administration console.
aws configureTo generate cybersecurity requirements for a given system description, you may use the [/src/generate -requirements/main.py](/src/generate -requirements/main.py) script. You may specify the following parameters:
-s STANDARDS, to set the path of the file containing the adapted cybersecurity standards, as a .json file.
-d DOMAIN, to set the path of the file containing the system description, as a .md file.
-o OUTPUT, to set the path of the folder containing the generated cybersecurity requirements, as a .json file and the execution details.
-c CHAIN, to set the name of the Langchain's chain topology declaration to use (located at [/src/generate
-requirements/templates/chain](/src/generate
-requirements/templates/chain)).
--help, to show the help message for the script.
Example:
python main.py \
--standards ../../data/annotated_standard_subset.json \
--domain ../../data/ai4i4.md \
--output ../../results/requirements \
--chain cot_llamaImportant
In its default configuration, the requirements generation script makes use of the meta.llama3-1-405b-instruct-v1:0 model provided by AWS for serverless inference.
| Path | Brief description |
|---|---|
data/ai4i4.md |
System specification of the pilot use‑case. |
annotated_standard_subset.json |
Parameterised ISO‑27002 controls. |
data/prompt/** |
180+ prompt templates, categorised by task and model. |
results/analysis/summary.csv |
Precision, recall, F2 and relative F2 for every run. |
results/analysis/consistency.csv |
Jaccard indices across successive runs. |
doc/*_execution_details.md |
Detailed execution logs per configuration. |
Important
Complete dataset datasheets are provided in the data/README.md and results/README.md files.
- Determinism Because of the inherent stochasticity of LLMs, results may vary across runs. Please refer to the consistency metrics in
results/analysis/consistency.csvto assess stability considerations. - Data licensing ISO‑27002 excerpts are replaced by identifiers to comply with copyright; users must possess the full standard.
- Model access Some models (e.g., GPT‑4, Mistral) require API keys or specific access permissions. Ensure you have the necessary credentials before running the scripts.
- Environment The scripts are tested on Python 3.10+ with the dependencies listed in
requirements.txt. Ensure your environment matches these specifications to avoid compatibility issues.
Important
Model selection references and rationale are documented in doc/selection_of_models.md.
This research is conducted under the principles of responsible AI. The generated requirements are intended for educational and research purposes only. Users must ensure compliance with local laws and ethical guidelines when applying these results in real-world scenarios.
Any use involving production compliance auditing, legal certification, or critical system design should involve human oversight and validation by qualified cybersecurity professionals.
| Version | Date | Highlights |
|---|---|---|
| 1.0 | 2025‑07-31 | Initial public release. |
| 2.0 | 2025‑12-01 | Second release including additional executions. |
| 2.1 | 2025‑12-03 | Terminology fixes. |
This repository uses two licenses:
- Software: Proprietary license — personal, non-commercial research use only; no modification, redistribution, or commercial use permitted (see LICENSE).
- Data: Creative Commons Attribution 4.0 International (CC BY 4.0) (see LICENSE).
If you use this repository in your research, please cite it as follows:
@misc{llmsec2025iso,
author={Yelmo, Juan Carlos and Martín, Yod-Samuel and Perez-Acuna, Santiago},
title={Experimental Evaluation of AI-Augmented Cybersecurity Requirements Generation Leveraging LLMs’ Capabilities | Reproducible Research Package},
year={2025},
url={https://github.com/STRAST-UPM/ai_requirements_generation_rr},
doi={10.5281/zenodo.15641294},
version={2.0},
}Juan Carlos Yelmo García - juancarlos.yelmo@upm.es
Yod Samuel Martín García - ys.martin@upm.es
Santiago Pérez Acuña - santiago.perez.acuna@upm.es
Last updated : 2025‑12-03