SoReL-20M: A Huge Dataset of 20 Million Malware Samples Released Online

Dec 14, 2020Ravie Lakshmanan

Cybersecurity firms Sophos and ReversingLabs on Monday jointly released the first-ever production-scale malware research dataset to be made available to the general public that aims to build effective defenses and drive industry-wide improvements in security detection and response.

"SoReL-20M" (short for Sophos-ReversingLabs – 20 Million), as it's called, is a dataset containing metadata, labels, and features for 20 million Windows Portable Executable (.PE) files, including 10 million disarmed malware samples, with the goal of devising machine-learning approaches for better malware detection capabilities.

"Open knowledge and understanding about cyber threats also leads to more predictive cybersecurity," Sophos AI group said. "Defenders will be able to anticipate what attackers are doing and be better prepared for their next move."

Accompanying the release are a set of PyTorch and LightGBM-based machine learning models pre-trained on this data as baselines.

Unlike other fields such as natural language and image processing, which have benefitted from vast publicly-available datasets such as MNIST, ImageNet, CIFAR-10, IMDB Reviews, Sentiment140, and WordNet, getting hold of standardized labeled datasets devoted to cybersecurity has proved challenging because of the presence of personally identifiable information, sensitive network infrastructure data, and private intellectual property, not to mention the risk of providing malicious software to unknown third-parties.

Although EMBER (aka Endgame Malware BEnchmark for Research) was released in 2018 as an open-source malware classifier, its smaller sample size (1.1 million samples) and its function as a single-label dataset (benign/malware) meant it "limit[ed] the range of experimentation that can be performed with it."

SoReL-20M aims to get around these problems with 20 million PE samples, which also includes 10 million disarmed malware samples (those can't be executed), as well as extracted features and metadata for an additional 10 million benign samples.

Furthermore, the approach leverages a deep learning-based tagging model trained to generate human-interpretable semantic descriptions specifying important attributes of the samples involved.

The release of SoReL-20M follows similar industry initiatives in recent months, including that of a coalition led by Microsoft, which released the Adversarial ML Threat Matrix in October to help security analysts detect, respond to, and remediate adversarial attacks against machine learning systems.

"The idea of threat intelligence sharing in security isn't new but is more critical than ever given the innovation threat actors have shown over the past several years," ReversingLabs researchers said. "Machine learning and AI have become central to these efforts allowing threat hunters and SOC teams to move beyond signatures and heuristics and become more proactive in detecting new or targeted malware."

Found this article interesting? Follow us on Twitter and LinkedIn to read more exclusive content we post.