Data Infrastructure Building Blocks for ISI. A Project of the University of Arizona (NSF #ACI-1443019), Drexel University,

University of Virginia, University of Texas at Dallas, and University of Utah

Other Data

Intelligence and Security Informatics Data Sets

AZSecure-data.org

The README file which accompanies each data set describes its origin, contents, and size; contents vary. Collections may be small or large. Sizes, if provided, are approximate. To download, click on the name of the data set and choose to open or save the file when prompted by your browser. Restricted sets may not be downloaded directly but must be requested through the project manager - email ailab@eller.arizona.edu and state which data set you would like to use and purpose for which it will be used; please provide complete contact information including name, affiliation and mailing address, email, and telephone number.

Data Sets By Type

Malware

Ether Malware Analysis Dataset - Collected by Artem Dinaburg, Paul Royal, Monirul Sharif, Wenke Lee at the Georgia Institute of Technology, the Ether Malware Analysis Dataset is a collection of over 25,000 malware instances used to test EtherUnpack against packaged malware. These malware instances were collected between January and March of 2008 from honeypots, mail filters, proxy monitors, web crawling, file sharing networks, and other sources. CAUTION: THIS DATASET CONTAINS MALWARE. Please review the DIBBs-ISI Malware Handling Protocol.

readme.txt

EtherMalwareDataset.zip (12.9 GB zipped, 15 GB unzipped)

Network Traffic

ADFA-IDS - Collected by Gideon Creech and Jiankun Hu of the Australian Defense Force Academy, ADFA IDS is an intrusion detection system dataset made publicly available in 2013, intended as representative of modern attack structure and methodology to replace the older datasets KDD and UNM. ADFA IDS includes independent datasets for Linux and Windows environments.

readme.txt

ADFA-IDS.zip (14MB)

ADFA-IDS 2017 - An update of the original ADFA-IDS dataset, released March 27, 2017.

readme-ADFA.txt

How_to_use_AFDA-IDS_DATASETS.pdf

ADFA-IDS_2017.zip (959MB)

Aktaion Example Labeled Data - Collected by Joseph Zadeh and Rod Soto. This collection contains labeled network traffic data in ARFF format. The original purpose was to train ransomware detection in the Aktaion IDS. The data predates August 2016.

readme.txt

aktaion.zip (15.5 KB)

Chris Sanders' Packets 2017 - Collected by Chris Sanders, this collection of 76 PCAPs contains live malware captures. Last updated in 2017.

readme.txt

ChrisSandersPackets.zip (89.9-8 MB)

Comprehensive, Multi-Source Cyber Security Events - Collected by Alexander Kent at Los Alamos National Laboratory, this collection is a comprehensive enterprise cyber security data set spanning 58 days containing data from authentication, process, DNS, Network flows, and red team attacks on the Los Alamos National Laboratory's corporate, internal computer network. The data records five data elements, 1,648,275,307 events in total for 12,425 users, 17,684 computers, and 62,974 processes.

readme.txt

CMSC.zip (10.9GB)

CSDMC 2010 - This data was collected by API monitors during a data mining competition at the International Conference on Neural Information Processing (ICNIP) in Sydney, Austrailia 2010. Knowledge about the malware programs presented in the dataset is outdated and may not contain
significant information for detecting today’s malware. It may, however, be useful for historical reference or other purposes.

readme.txt

CSDMC2010.zip (2MB)

CTU-13 - Collected by Sebastian Garcia Martin Grill, and Honza Stiborek, at the Czech Technical University (CTU) in Prague, the CTU-13 dataset consist of a group of 13 different malware captures in a real network environment. The captures include Botnet, Normal, and Background traffic. The Botnet traffic comes from the infected hosts, the Normal traffic from the verified normal hosts, and the Background traffic is all the rest of traffic. The dataset is labeled in a flow by flow basis and was collected from August 10-15, 2011.

ReadMe.txt (more information at: http://mcfp.weebly.com/the-ctu-13-dataset-a-labeled-dataset-with-botnet-normal-and-background-traffic.html)
CTU13.zip (1.8GB)

eMews HTTPS and SSH Collection-1 Dataset - Collected by Brian Ricks and Bhavani Thuraisingham at the University of Texas at Dallas, eMews is a collection of PCAP data captured from an in-lab emulated network, using the CORE network emulator and the eMews framework developed to generate packet traces and manage experimental runs. The captures vary from 1 hour to 10 hours in duration and are captured from an HTTPS and SSH server within the network. Because these captures were performed in a controlled environment, the researchers can guarantee that no malware or any other malicious behavior is present. The network consists of 1,022 nodes, of which 844 incorporate autonomous web crawling activity, and 36 incorporate autonomous SSH interactions.

ReadMe.txt

emews-dataset-1.zip (282 MB)

ISOT Botnet – Collected by Sherif Saad, Issa Traore, Ali A. Ghorbani, Bassam Sayed, David Zhao, Wei Lu, John Felix, and Payman Hakimian, of the Information Security and Object Technology (ISOT) Research Lab at the University of Victoria, the ISOT dataset combines datasets containing malicious traffic from the French chapter of the honeynet project involving the Storm and Waledac botnets and one dataset each from the Traffic Lab at Ericsson Research in Hungary and the Lawrence Berkeley National Lab (LBNL). The Ericsson Lab dataset contains a large number of general traffic from a variety of applications, including HTTP web browsing behavior, World of Warcraft gaming packets, and packets from popular bittorrent clients such as Azureus. The datasets from the LBNL trace data to provide additional non-malicious background traffic. Data collection dates from October 2004 to January 2005.

ReadMe.txt
ISOT_Botnet.zip (2.2GB)

Linux Redhat 7.1 System Deployed in a Honeynet Logs - Collected by Anton A. Chuvakin. This dataset consists of system logs from a Linux Redhat 7.1 system deployed in a honeynet. The owner of the data runs a site for public domain, real-world log data in which malicious activities are captured. One interesting aspect of this data is that there is no sanitization or anonymization; the data is provided unmodified (and no modifications are needed or required to use the data for research). Data was collected for 590 continuous days between 2006 and 2007.

ReadMe.txt

HoneynetRedHatLogs.zip (41.2 MB)

Malware Training Sets - Collected by Marco Ramilli. The dataset is composed of 71 JSON format labeled malware examples, in which each example corresponds to an instance of a specific malware, labeled with the malware name.

ReadMe.txt

MalwareTrainingSets.zip (29.9 MB)

M0DROID Dataset - Collected by the University Putra Malaysia. This dataset comes bundled with the M0DROID mobile malware analysis tool, which is designed to detect Android malware using signatures derived from system call requests of individual Android APKs. The dataset itself contains signatures generated from many Android APKs, and can be used separately from the detection engine. Collected November 2014.

ReadMe.txt

M0DROID.zip (6.1 MB)

Shadowbrokers EternalBlue/EternalRomance PCAP Dataset - Collected by Eric Conrad. This dataset is comprised of PCAP data from the EternalBlue and EternalRomance malware. These PCAPs capture the actual exploits in action, on target systems that had not yet been patched to defeat to the exploits. The EternalBlue PCAP data uses a Windows 7 target machine, whereas the EternalRomance PCAP data uses a Windows 2008r2 target machine. Also included is EternalBlue PCAP data for a patched Windows 7 target machine showing the failed exploit. This data was collected in April 2017.

ReadMe.txt

ShadowbrokersEternalBlue.zip (1.9 MB)

Standard Dragon NIDS Alert Logs- Collected by Anton A. Chuvakin. This dataset consists of alert logs from the Enterasys Dragon NIDS 4.x intrusion detection system. The owner of the data runs a site for public domain, real-world log data in which malicious activities are captured. One interesting aspect of this data is that there is no sanitization or anonymization; the data is provided unmodified (and no modifications are needed or required to use the data for research). Data was collected for 590 continuous days between 2006 and 2007.

ReadMe.txt

DragonAlertLogs.zip (20.7 MB)

Unified Host and Network Data Set - Collected by Melissa J. M. Turcotte, Alexander D. Kent, and Curtis Hash. This dataset helps to address the current lack of datasets derived from real-world enterprise networks, and to also fulfill the need for a rich dataset that has not been so heavily sanitized as to cripple any cyber-research value. There are two sets which comprise this dataset: one of network flow data mainly originating from internal enterprise routers, and one of Windows host data. The data was collected in 2017 over a period of 90 days from the Los Alamos National Laboratory's enterprise network. Some values were anonymized, but for those values, the anonymization was kept consistent between the two datasets.

ReadMe.txt

UHNDS_2-5 (6.1 GB)

UHNDS_6-10 (8.4 GB)

UHNDS_11-15 (7.8 GB)

UHNDS_16-20 (7.8 GB)

UHNDS_21-25 (7.5 GB)

UHNDS_26-30 (7.4 GB)

UHNDS_31-35 (7.3 GB)

UHNDS_36-40 (7.9 GB)

UHNDS_41-45 (7.6 GB)

UHNDS_46-50 (7.8 GB)

UHNDS_51-55 (8.6 GB)

UHNDS_56-60 (8.8 GB)

UHNDS_61-65 (7.1 GB)

UHNDS_66-70 (8.5 GB)

UHNDS_71-75 (9.7 GB)

UHNDS_76-80 (8.2 GB)

UHNDS_81-85 (8.8 GB)

UHNDS_86-90 (8 GB)

UHNDS_Host_1-29 (11.7 GB)

UHNDS_Host_30-59 (12.7 GB)

UHNDS_Host_60-90 (12.7 GB)

VERIS Community Database - Collected by Verizon Security Research & Cyber Intelligence Center. The Vocabulary for Event Recording and Incident Sharing (VERIS) is a language for describing security incidents. VERIS and its accompanying dataset (VCDB) aim to provide not only a repository of widespread publicly collected incidents, but a common language for describing these incidents. The overall goal is to cooperatively learn from past experiences for better risk management, and to collect data for all publicly available data breaches. Data collected between 2012 and November 2017.

ReadMe.txt

VCDB.zip (24.4 MB)

Wi-Fi Header Database - This data was collected as a part of development of an intrusion detection system for Wi-Fi networks. The database contains 9,817,671 wi-fi traffic headers collected by University of Arizona researchers Pratik Satam and Salim Hariri between 6/3/2016 – 6/11/2016 using a Wi-Fi card in monitor mode and a C based tool developed for this task.

ReadMe.txt

WiFiHeader.zip (19MB)

News

CHINESE

Al Qaeda News - The textual data are the news about Al-Qaeda reported in a variety of online sources. Dirs which are named by site's name contain the original web pages. The XML files are news contents which are parsed from web pages.The data set size is 426.4 MB zipped, and approximately 1.79 GB unzipped. The data are from the paper, "Extracting Action Knowledge in Security Informatics," by Ansheng Ge, Wenji Mao, Daniel Zeng, Qingchao Kong, and Huachi Zhu, presented at the 2012 Intelligence and Security Informatics Conference. RESTRICTED; please request through the project manager - email ailab@eller.arizona.edu and state which data set you would like to use and purpose for which it will be used; please provide complete contact information including name, affiliation and mailing address, email, and telephone number.

Web chat

CHINESE

QQ Chat Logs – Collected by Kangzhi Zhao, Yong Zhang, Chunxiao Xing, Hsingchun Chen, this dataset contains the textual chat log of Chinese cybercriminals in underground QQ groups. The data was collected manually by downloading chat logs after joining in underground QQ groups for a period of time. The chats were collected between March 20 and April 4, 2016.

ReadMe.pdf
QQ.zip (1.5MB)

Websites

Patriot, Militia, Hate and Linked Websites - Collected by the Artificial Intelligence Lab, Management Information Systems Department, University of Arizona, the Patriot, Militia, Hate and Linked Websites collection presented here contains 74 websites belonging to groups identified by the Southern Poverty Law Center in 2009 as belonging to groups promoting extreme social perspectives. The collection also contains 123 additional websites linked to by the initial set of websites. The full list of websites in this collection is in the ReadMe.txt file. Due to the size of this collection, it has been divided into 20 portions to make downloading easier. Each bundle of websites contains the ReadMe.txt and About.pdf files.

ReadMe.txt

About.pdf

Dark Net Markets

DreamMarket Dark Net Market (2016) is an online platform for exchanging illegal goods by cybercriminals. The dataset was collected by Artificial Intelligence Lab at University of Arizona and contains 39,473 product listings from 690 sellers in 2016. Sellers membership's date ranges from 12/4/2013 to 12/1/2016. For more information please refer to the ReadMe file below.

ReadMe.txt

DreamMarket_2016.zip

DreamMarket Dark Net Market (2017): The dataset was collected by Artificial Intelligence Lab at University of Arizona and contains 91,463 product listings from 2092 sellers in 2016. Sellers membership's date ranges from 12/4/2013 - 10/4/2017. For more information please refer to the ReadMe file below.

ReadMe.txt
DreamMarket_2017.zip