Magma Experiment for Code- and Bug-based Coverage Benchmarking
Overview
This dataset contains the raw experimental artifacts for the Magma portion of our study on the concordance (split-half reliability) of coverage-based and bug-based fuzzer benchmarking procedures. [oai_citation:0‡FSE2026.pdf](sediment://file_000000006394720c8478a4c3a5cd9f3e)
The artifacts are intentionally released at a low level (queues, logs, and bug information) so that others can reproduce, audit, and recompute outcomes under alternative analysis choices.
Note: This dataset is not linked directly from the paper. The paper links to a companion code repository, and that repository links to this dataset.
Experimental Setup (Summary)
- Benchmark suite: Magma v1.2.0
- Benchmarks used: 18 fuzz drivers (benchmarks) across multiple programs
- Fuzzers: 8 fuzzers
- Trials: 20 independent trials per (fuzzer, benchmark) combination (subject to supported combinations)
- Campaign length: 23 hours per trial
- Bug evaluation: Magma canary-based ground truth; we count triggered bugs
- Coverage evaluation: branch coverage (LLVM tooling)
What This Dataset Contains
This dataset includes the raw outputs needed to compute bug-based results directly and to compute coverage-based results via offline replay:
- Execution queues produced during fuzzing campaigns (per fuzzer, benchmark, trial)
- Raw bug information from Magma’s canary-based bug reporting (including logs/outputs per run)
- Per-run metadata required to group results and reproduce the analysis workflow
Important: How Coverage Results Are Obtained
Coverage is not precomputed in this dataset. This dataset includes the execution queues and other artifacts, but coverage values must be derived using the companion code repository.
To extract coverage:
- Obtain the companion code repository referenced by the paper.
- Configure it to point to this dataset on disk.
- Run the provided pipeline to replay the execution queues and compute branch coverage via LLVM tooling.
If you download only this dataset without the companion code, you can still analyze bug-based outcomes, but you cannot reproduce the paper’s coverage-based results.
Directory Layout
The dataset is organized hierarchically by fuzzer, benchmark, and trial. The exact naming is intended to match the assumptions of the companion analysis scripts.
Conceptually:
<fuzzer>/<benchmark>/<trial>/...
Each trial directory corresponds to one 23-hour campaign run and contains the artifacts produced by that run (e.g., queues, logs, bug-triggering information, and auxiliary metadata).
File Count
At release time, this dataset contains exactly 2,860 files.
Known Omissions / Unsupported Combinations
- SymCC × exif is not present because SymCC does not compile on the exif benchmark; therefore, no such run exists.
Data Completeness Note (Minimal)
A small number of runs (seven) were re-executed after a temporary server failure to complete the expected trial matrix, but were excluded from the paper. The released dataset is intended to be complete for the supported fuzzer–benchmark combinations.
Relationship to Paper and Code
- Paper: defines the research questions, metrics, and statistical analysis. [oai_citation:1‡FSE2026.pdf](sediment://file_000000006394720c8478a4c3a5cd9f3e)
- Companion code repository: performs coverage extraction (queue replay), aggregation, ranking, and concordance computations.
- This dataset: provides the raw artifacts consumed by that code.
Intended Use
- Reproduction of bug-based outcomes and reanalysis of trial variability
- Recomputation of coverage-based outcomes via replay using the companion code
- Alternative analyses of bug and code coverage.
This dataset is not a pre-aggregated results table; it is raw experimental output.
License
This dataset is released under CC BY 4.0.
Citation
Please cite this dataset via its DOI (doi:10.17617/3.UQGK4A). When referencing the methodology or results, also cite the accompanying paper, In Bugs We Trust? On Measuring the Randomness of a Fuzzer Benchmarking Outcome