Intelligence and Security Informatics Data Sets
Data Infrastructure Building Blocks for ISI. A Project of the University of Arizona (NSF #ACI-1443019), Drexel University,
University of Virginia, University of Texas at Dallas, and University of Utah
The README file which accompanies each data set describes its origin, contents, and size; contents vary. Collections may be small or large. Sizes, if provided, are approximate. To download, click on the name of the data set and choose to open or save the file when prompted by your browser. Restricted sets may not be downloaded directly but must be requested through the project manager - email email@example.com and state which data set you would like to use and purpose for which it will be used; please provide complete contact information including name, affiliation and mailing address, email, and telephone number.
Data Sets By Type
- Ether Malware Analysis Dataset - Collected by Artem Dinaburg, Paul Royal, Monirul Sharif, Wenke Lee at the Georgia Institute of Technology, the Ether Malware Analysis Dataset is a collection of over 25,000 malware instances used to test EtherUnpack against packaged malware. These malware instances were collected between January and March of 2008 from honeypots, mail filters, proxy monitors, web crawling, file sharing networks, and other sources. CAUTION: THIS DATASET CONTAINS MALWARE.
EtherMalwareDataset.zip (12.9 GB zipped, 15 GB unzipped)
- ADFA-IDS - Collected by Gideon Creech and Jiankun Hu of the Australian Defense Force Academy, ADFA IDS is an intrusion detection system dataset made publicly available in 2013, intended as representative of modern attack structure and methodology to replace the older datasets KDD and UNM. ADFA IDS includes independent datasets for Linux and Windows environments.
- Comprehensive, Multi-Source Cyber Security Events - Collected by Alexander Kent at Los Alamos National Laboratory, this collection is a comprehensive enterprise cyber security data set spanning 58 days containing data from authentication, process, DNS, Network flows, and red team attacks on the Los Alamos National Laboratory's corporate, internal computer network. The data records five data elements, 1,648,275,307 events in total for 12,425 users, 17,684 computers, and 62,974 processes.
- CTU-13 - Collected by Sebastian Garcia Martin Grill, and Honza Stiborek, at the Czech Technical University (CTU) in Prague, the CTU-13 dataset consist of a group of 13 different malware captures in a real network environment. The captures include Botnet, Normal, and Background traffic. The Botnet traffic comes from the infected hosts, the Normal traffic from the verified normal hosts, and the Background traffic is all the rest of traffic. The dataset is labeled in a flow by flow basis and was collected from August 10-15, 2011.
ReadMe.txt (more information at: http://mcfp.weebly.com/the-ctu-13-dataset-a-labeled-dataset-with-botnet-normal-and-background-traffic.html)
- ISOT Botnet – Collected by Sherif Saad, Issa Traore, Ali A. Ghorbani, Bassam Sayed, David Zhao, Wei Lu, John Felix, and Payman Hakimian, of the Information Security and Object Technology (ISOT) Research Lab at the University of Victoria, the ISOT dataset combines datasets containing malicious traffic from the French chapter of the honeynet project involving the Storm and Waledac botnets and one dataset each from the Traffic Lab at Ericsson Research in Hungary and the Lawrence Berkeley National Lab (LBNL). The Ericsson Lab dataset contains a large number of general traffic from a variety of applications, including HTTP web browsing behavior, World of Warcraft gaming packets, and packets from popular bittorrent clients such as Azureus. The datasets from the LBNL trace data to provide additional non-malicious background traffic. Data collection dates from October 2004 to January 2005.
- Al Qaeda News - The textual data are the news about Al-Qaeda reported in a variety of online sources. Dirs which are named by site's name contain the original web pages. The XML files are news contents which are parsed from web pages.The data set size is 426.4 MB zipped, and approximately 1.79 GB unzipped. The data are from the paper, "Extracting Action Knowledge in Security Informatics," by Ansheng Ge, Wenji Mao, Daniel Zeng, Qingchao Kong, and Huachi Zhu, presented at the 2012 Intelligence and Security Informatics Conference. RESTRICTED; please request through the project manager - email firstname.lastname@example.org and state which data set you would like to use and purpose for which it will be used; please provide complete contact information including name, affiliation and mailing address, email, and telephone number.
- QQ Chat Logs – Collected by Kangzhi Zhao, Yong Zhang, Chunxiao Xing, Hsingchun Chen, this dataset contains the textual chat log of Chinese cybercriminals in underground QQ groups. The data was collected manually by downloading chat logs after joining in underground QQ groups for a period of time. The chats were collected between March 20 and April 4, 2016.
- Patriot, Militia, Hate and Linked Websites - Collected by the Artificial Intelligence Lab, Management Information Systems Department, University of Arizona, the Patriot, Militia, Hate and Linked Websites collection presented here contains 74 websites belonging to groups identified by the Southern Poverty Law Center in 2009 as belonging to groups promoting extreme social perspectives. The collection also contains 123 additional websites linked to by the initial set of websites. The full list of websites in this collection is in the ReadMe.txt file. Due to the size of this collection, it has been divided into 20 portions to make downloading easier. Each bundle of websites contains the ReadMe.txt and About.pdf files.