Data Infrastructure Building Blocks for ISI. A Project of the University of Arizona (NSF #ACI-1443019), Drexel University,
University of Virginia, University of Texas at Dallas, and University of Utah
Forums, By Language
- Douban Group - A Chinese-language forum. The dataset is a collection of posts of the Douban group forum. There are two Douban groups, buybook (http://www.douban.com/group/buybook/) and qiong (http://www.douban.com/group/qiong), which are separately organized in two dirs with the corresponding group name. In each group (dir), each txt file corresponds to one post, with the filename, such as '1012614-info.txt', as the post id. In each post info file, the first line describes the original post and the following lines describe the comments. Relevant fileds are separated by '[=]'. For the first line, the data fields are topic id, group id, user id, post title, post publishing date, number of comments, post content. For the following lines, the data fields are comment id, group id, topic id, comment user id, comment publishing date, comment id which is quoted(replied), comment content. NOTE: All txt files are encoded with UTF-8. Dataset statistics: 4,992 posts in buybook and 9,985 posts in qiong. This data set is from the paper titled, "Predicting User Participation in Social Networking Sites," by Qingchao Kong, Wenji Mao, and Daniel Zeng, presented at the 2013 Intelligence and Security Informatics conference.
douban-group-dataset.zip (50 MB)
- Baidu Forums – Collected by the Artificial Intelligence Lab at the University of Arizona, this collection of forum posts from the Baidu Forums was identified using keywords related to credit card fraud. Forum posts date from January 2006 to March 2016 with 5,131 threads and 53,963 replies collected.
- Hackhound Forum – Collected by the Artificial Intelligence Lab at the University of Arizona, the Hackhound Forum dataset contains 4,242 forum posts on hacking topics. Posts date from October 2012 to September 2015.
- Webkill Forum – Collected by the Artificial Intelligence Lab at the University of Arizona, the Webkill Forum dataset contains 133,858 forum posts on hacking and carding topics. Posts date from September 2007 to September 2015.
The README file which accompanies each data set describes its origin, contents, and size. Collections may be small or large. Generally, postings from the forums are organized into threads indicating the topic under discussion. Postings may include metadata such as date, member name, etc. Sizes, if provided, are approximate. To download, click on the name of the forum and choose to open or save the file when prompted by your browser.
Intelligence and Security Informatics Data Sets