Building a commit-level dataset of real-world vulnerabilities

Abstract

While CVE have become a \emph{de facto} standard for publishing advisories on vulnerabilities, the state of current \gls{cve} databases is lackluster. Yet, \gls{cve} advisories are insufficient to bridge the gap with the vulnerability artifacts in the impacted program. Therefore, the community is lacking a public real-world vulnerabilities dataset providing such association.

In this paper, we present a method restoring this missing link by analyzing the vulnerabilities from the \gls{aosp}, an aggregate of more than 1,800 projects. It is the perfect target for building a representative dataset of vulnerabilities, as it covers the full spectrum that may be encountered in a modern system where a variety of low-level and higher-level components interact. More specifically, our main contribution is a dataset of more than 1,900 vulnerabilities, associating generic metadata (e.g.~vulnerability type, impact level) with their respective patches at the commit granularity (e.g.~fix commit-id, affected files, source code language).

Finally, we also augment this dataset by providing precompiled binaries for a subset of the vulnerabilities. These binaries open various data usage, both for binary only analysis and at the interface between source and binary. In addition of providing a common baseline benchmark, our dataset release supports the community for data-driven software security research.

Publication
Proceedings of the Twelveth ACM Conference on Data and Application Security and Privacy
Alexis Challande
Alexis Challande
Security Engineer

Security Engineer and Doctor in Cybersecurity