Artifact Evaluation (AE) is essential for ensuring the transparency and reliability of research, closing the gap between exploratory work and real-world deployment is particularly important in cybersecurity, particularly in IoT and CPSs, where large-scale, heterogeneous, and privacy-sensitive data meet safety-critical actuation. Yet, manual reproducibility checks are time-consuming and do not scale with growing submission volumes. In this work, we demonstrate that Large Language Models (LLMs) can provide powerful support for AE tasks: (i) text-based reproducibility rating, (ii) autonomous sandboxed execution environment preparation, and (iii) assessment of methodological pitfalls. Our reproducibility-assessment toolkit yields an accuracy of over 72% and autonomously sets up execution environments for 28% of runnable cybersecurity artifacts. Our automated pitfall assessment detects seven prevalent pitfalls with high accuracy (F1 > 92%). Hence, the toolkit significantly reduces reviewer effort and, when integrated into established AE processes, could incentivize authors to submit higher-quality and more reproducible artifacts. IoT, CPS, and cybersecurity conferences and workshops may integrate the toolkit into their peer-review processes to support reviewers' decisions on awarding artifact badges, improving the overall sustainability of the process.
@inproceedings{heye2025supporting,
title = {Supporting Artifact Evaluation with LLMs: A Study with Published Security Research Papers},
author = {David Heye and Karl Kindermann and Robin Decker and johannes-lohmoeller and Anastasiia Belova and Sandra Geisler and Klaus Wehrle and Jan Pennekamp},
booktitle = {Proceedings of the 2025 IEEE International Conference on Big Data},
year = {2025},
doi = {10.1109/BigData66926.2025.11401815},
}