Intelligence and Security Informatics Data Sets
Data Infrastructure Building Blocks for ISI. A Project of the University of Arizona (NSF #ACI-1443019), Drexel University,
University of Virginia, University of Texas at Dallas, and University of Utah
The AZSecure-data repository currently provides access to Web forums, Internet phishing websites, Twitter data, and other data. Most files are available to download from the "Get Data" buttons above; other files can be requested through the project manager. To request access to restricted data, send email to firstname.lastname@example.org and state which data set you would like to use and the purpose for which it will be used; also provide complete contact information including name, affiliation and mailing address, email, and telephone number.
The Dark Web forums were collected up through 2012 by the Artificial Intelligence Lab to support its Dark Web project on the study of international Jihadi social media. Dark Web Forums are in English, Arabic, French, German, and Russian. GeoWeb general interest forums were also collected for the AI Lab's GeoPolitical Web project on assessing country risk. GeoWeb forums are in English, Arabic, Indonesian, Pashto, and Urdu, and are from Afghanistan, Algeria, Egypt, Indonesia, Iraq, Jordan, Lebanon, Morocco, Pakistan, Saudia Arabia, Somalia, Tunisia, and Yemen. The collection presently includes over 21M postings written by hundreds of thousands of forum members. Other forums have been provided through the generous donations of authors (see the Honor Roll page. Additional information about each forum is provided on the forum page and in the README files accompanying each data set. Forums are generally provided as individually downloadable compressed text files which may then be opened in any CSV-compatible text processing program. Sizes are approximate. To download, click on the name of the forum and choose to open or save the file when prompted by your browser. ALL FILES PROVIDED "AS IS."
The Internet phishing websites were collected by project partners at the University of Virginia and are available as downloadable zip or rar files They are organized by type of institution, including 334 concocted escrow, bank, transportation, and delivery websites; 210 spoofed financial institutions such as banks, PayPal, eBay, etc.; and 150 concocted pharmacies using black hat SEO. To download,go to the Phishing page and click on the name of the forum; open or save the file when prompted by your browser. ALL FILES PROVIDED "AS IS."
The Anti-Virus Security Software Tweets data was collected by project partners at the University of Virginia and contains over 700,000 tweets from 2008 through early 2013 collected using 14 anti-virus security software company-related seed keywords. This file is available by request. See the Twitter Data page for more information. ALL FILES PROVIDED "AS IS."
Other data contains any other data sets (not forums, phishing, or Twitter data) that is available or accessible through the portal.