Hackers use forums as message boards to post messages via conversation threads that encompass hacking tools, tutorials, and malicious source code. The Artificial Intelligence Lab at the University of Arizona collected several major Forums in three different languages from hacker community ecosystem in order to facilitate the research in this area of cybersecurity. This collection covers emerging topics such as social engineering, AI bots, and ransomware that can facilitate the cutting edge research in analyzing hacker community. It contains useful data about forums’ posts including thread title, authors, joining date of authors, post date, and the post textual content. The forums in each language category were selected based on criteria such as the number of attachment, number of posts and number of users. The popularity of forums in hacker community combined with the rich metadata such as post date (included in ReadMe.txt), makes forum a unique data source for conducting cybersecurity research on assets used by hackers, identifying prolific threat actors, and discovering emerging threats.
English is the dominant language in hacker forums. English forums contributes to a large portion of knowledge and tool exchanges in hacker community and therefore are suitable to obtain large volumes of information about English hacker community which is not limited to English-speaking countries The topics include breached data, mobile malware, cryptocurrencies, login dumps, code for AI bots, etc.
- CrackingArena Forum with 44,927 posts and 11,977 active users is one of the largest forums existing in 2018.The platform has a dedicated section to hacking tools and tutorials, called "hacking zone". This section contains content-rich threads with high user participation which makes this forum conducive to cybersecurity research on the interaction patterns among cybercriminals. The variety of covered topics in the forum ranges from social engineering, cracking tool and tutorials, to exploits which makes this forum a viable source for pinpointing the characteristics of newly emerged hacker assets. The posts in this forum date from 4/8/2013 – 2/24/2018.
- Suggested analytics: Identification of specific types of assets (e.g., social engineering assets) via text mining, Social network analysis for identifying the interaction patterns
- Suggested techniques: Designing specific-purpose classifiers using deep learning, supervised topic modeling, and conditional random fields, descriptive network analytics
- Suggested tools: Scikit-learn, Tensorflow, Keras, Gephi, NetworkX.
- CrackingFire with 14,511 users is an English forum with a large concentration of users. In addition to a dedicated section for hacking tools, this forum features a section called "coding zone" which contains the source codes for variety of languages such as C# and VB.Net to run malicious operations including compromising online social media accounts. Thus, this dataset facilitates cybersecurity research concerning with analysis of hacker assets and especially source code analysis of these assets. CrackingFire forum dataset contains 37,572 forum posts ranging from 4/7/2011 – 2/21/2018.
- Suggested analytics: static malware analysis through applying text mining techniques to gain insight about the used languages, and the attack vectors of the provided source code assets.
- Suggested techniques: Unsupervised topic modeling, SOM clustering, and other text clustering approaches
- Suggested tools: Scikit-learn, Gensim, Mallet, and Standford Topic Modeling Toolbox
CrackingFire.zip (29.4 MB)
- ExeTools Forum is distinguished among other hacking forums in the sense that it is one of the oldest forums that has been active in exchanging hacker assets since 2002. This characteristic makes studying longitudinal threat landscape possible. The ratio of the number of posts to the number of users in this forum is very high compared to the other forums. It is expected that the hackers in this forum are more specialized than the rest of hackers. Using this forum is suggested in the studies that focus on the difference of expertise level in hacker communities. This dataset contains 24,663 posts dating from 1/16/2002 – 3/14/2018.
- Suggested analytics: Longitudinal analysis via detecting the evolutionary patterns of creation of hacker assets within the community, Hacker language and jargon semantic shifts
- Suggested techniques: Time series analysis, Recurrent Neural Network language models, Graph Convolutional Neural Networks
- Suggested tools: Keras, TensorFlow, PyTorch, and NetworkX
ExeTools.zip (30.6 MB)
- Garage4Hackers Forum, despite being a medium-sized forum in terms of number of content and users, is another highly specialized English forum which features an expert section with materials related to exploitation tools and techniques, botnets, and reverse engineering. This dataset can provide inside in studying specialized hacking asset tools. It contains 8,700 forum posts dating from 7/6/2010 – 9/18/2017.
- Suggested analytics: Identification of highly specialized types of hacker assets and their corresponding disseminators
- Suggested techniques: Advanced classification techniques such as deep learning
- Suggested tools: Keras, TensorFlow, and PyTorch
Garage4hackers.zip (14.8 MB)
- Hackhound Forum contains 4,242 forum posts on variety of hacking topics which were collected in 2015. Posts date from October 2012 to September 2015.
hackhound.zip (1.7 MB)
From cybersecurity research perspective, it is crucial to analyze the data sources in other languages to obtain a global insight about cyber threats across platforms since different geographical regions differ in tools, focus, and malicious intentions. Russian is the second largest language in hacker forums and compared to English forums they are more focused on exchanging highly-specialized hacking tools.
- Antichat Forum is one of the largest Russian forums with 233,480 collected posts. As a large-scale communication platform, it gained significant reputation for a data breach that revealed passwords of several thousands of its users. The topics in this forum cover vulnerabilities, anonymity, and security issues reported by administrators. One of the unique characteristics of this forum is having specific sections for web masters and system administrators. These sections contain highly specialized discussion that can answer fine-grained research questions about the threat landscape in hacker community from the view point of experts. Posts date from 3/6/2002 – 3/27/2018.
- Suggested analytics: Identification of non-English cyber threats in Russian hacker community via applying multilingual classifiers
- Suggested techniques: Advanced classification techniques such as deep learning
- Suggested tools: TensorFlow, PyTorch, and Theano
Antichat.zip (424 MB)
- DamageLab Russian forum has been known for advertising and hosting large-scale cyber attack platforms such as Zeus and SpyEye botnet command and control networks. Similar to other Russian forums, highly-specialized and sophisticated hacker assets can be found in this forum. Therefore the collected data can be used to pinpoint the Tools, techniques, and Procedures (TTP) of the recent hacker assets. The data contains 7,569 forum posts. Posts date from 11/13/2004 – 2/15/2018.
- Suggested analytics: Identifying highly specialized hacker assets being exchanged in Russian hacker community
- Suggested techniques: Classification techniques such as deep learning, supervised topic modeling and conditional random fields
- Suggested tools: keras, PyTorch, CRF++, and Gensim
DamageLab.zip (9.27 MB)
- Xakepok Forum is a large Russian forum containing 48,034 forum posts.The forum specializes in cross-site referencing, SQL injection, cryptors (can be used in ransomware), and keyloggers. The variety of offerings in Xakepok is one of the specific characteristics of this forum. Not only this dataset can be used to analyze emerging hacker assets, but also, due to the large number of cybercriminals in this forum, the dataset can be utilized in analyzing emerging threat actors. Posts in this forum date from 4/15/2009 – 10/18/2017.
- Suggested analytics: Automated categorization of different types of hacker assets in Russian hacker community via clustering
- Suggested techniques: Unsupervised topic modeling, SOM clustering
- Suggested tools: Gensim, Mallet, Stanford Topic Modeling Toolbox, Scikit-learn
Xakepok.zip (81 MB)
- Webkill Forum dataset contains 133,858 forum posts on hacking and carding topics. Posts date from September 2007 to September 2015.
webkill.zip (130.5 MB)
- Douban Group - A Chinese-language forum. The dataset is a collection of posts of the Douban group forum. There are two Douban groups, buybook (http://www.douban.com/group/buybook/) and qiong (http://www.douban.com/group/qiong), which are separately organized in two dirs with the corresponding group name. In each group (dir), each txt file corresponds to one post, with the filename, such as '1012614-info.txt', as the post id. In each post info file, the first line describes the original post and the following lines describe the comments. Relevant fileds are separated by '[=]'. For the first line, the data fields are topic id, group id, user id, post title, post publishing date, number of comments, post content. For the following lines, the data fields are comment id, group id, topic id, comment user id, comment publishing date, comment id which is quoted(replied), comment content. NOTE: All txt files are encoded with UTF-8. Dataset statistics: 4,992 posts in buybook and 9,985 posts in qiong. This data set is from the paper titled, "Predicting User Participation in Social Networking Sites," by Qingchao Kong, Wenji Mao, and Daniel Zeng, presented at the 2013 Intelligence and Security Informatics conference.
douban-group-dataset.zip (50 MB)
- Baidu Forums – Collected by the Artificial Intelligence Lab at the University of Arizona, this collection of forum posts from the Baidu Forums was identified using keywords related to credit card fraud. Forum posts date from January 2006 to March 2016 with 5,131 threads and 53,963 replies collected.