TREC 2016 Dynamic Domain Track

1. Datasets Descriptions

There are two datasets for the TREC 2016 Dynamic Domain Track: Ebola Dataset and Polar Dataset. Basic statistics of these datasets are shown in the Table 1-1. These datasets contains records in forms of web pages, scientific data, PDFs, and tweets. All the datasets are formatted using the Common Crawl Architecture schema from the DARPA MEMEX project, and stored as sequences of CBOR objects. Detailed Description, as well as some links to sample code, of each dataset is given in the following subsections.

Table 1-1 Datasets Statistics
Datasets	Size of Data on Disk	Number of Records
Ebola	12.6 GB	682,157
Polar	158 GB	1,741,530

1.1 Ebola Domain

This data is related to the Ebola outbreak in Africa in 2014-2015. The dataset comprises tweets relating to the outbreak, web pages from sites hosted in the affected countries as well as PDF documents from websites such as The World Health Organization, Financial Tracking Service and The World Bank. Such information resources are designed to provide information to citizens and aid workers on the ground.

Company/POC for crawl: Juliana Friere and Kien Pham, NYU (juliana dot freire at nyu dot edu); Peter Landwehr peter dot landwehr at giantoak dot com); Lewis McGibbney, JPL (Lewis dot J dot Mcgibbney at jpl dor nasa dot gov)
Version: 1.0
Purpose of data: represents an emerging humanitarian assistance situation.
Four disjoint parts:
1. Ebola-web-01-2015: web pages crawled during January 2015.
2. Ebola-web-03-2015: web pages crawled during March 2015. These two parts are primarily information from NGO's, relief agencies, and news organizations.
3. Ebola-pdfs: PDF documents collected from West African government and other sources.
4. Ebola-tweets: Tweets that originate from West African regions involved in the Ebola outbreak.
Time frame: January - March 2015
Geographic area: global, primarily West Africa
Size of data on disk: 12.6 GB
Format: Gzipped sequence of CBOR records; sequence of tweet pointers
Number of items: 497,362 web pages, 19,834 PDFs, 164,961 tweets.
Schema notes: The web and PDF subsets only contain the raw crawled data. The tweets subset as distributed only contains pointers to tweets, because Twitter does not allow redistribution of tweets. You will need to use the Twittertools crawler (link here) to get the actual tweets. The output of that crawler includes both the HTML pages from Twitter, as well as extracted tweet data in the 'features' block.

1.2 Polar Domain

This dataset is a set of web pages, scientific data, zip files, PDFs, images, and science code related to the polar sciences and available publicly from the NSF funded Advanced Cooperative Artic Data and Information System (ACADIS), NASA funded Antarctic Master Directory (AMD), and National Snow and Ice Data Center (NSIDC) Arctic Data Explorer. More information about this dataset can be found here.

Company/POC for crawl: Chris Mattman, JPL (chris dot a dot mattmann at jpl dot nasa dot gov)
Version: 1.0
Purpose of data: represents a domain for open science and scientific data search.
Description: Web pages, data files, zip archives, PDFs, images, and code.
Number of items: 1,741,530 records.
Time frame: September 2014 - May 2015.
Geographic area: global
Size of data on disk: 158GB.
Format: Gzipped sequence of CBOR records, encrypted with the KBA StreamCorpus key. Crawled data were put into Common Crawl Format, acording to Memex format, using the CommonCrawlDataDumper. The CommonCrawlDataDumper is an Apache Nutch tool that can dump Nutch segments into Common Crawl data format, mapping each crawled-by-Nutch file on a JSON-based data structure. CommonCrawlDataDumper dumps out the files and serialize them with CBOR encoding, a data representation format used in many contexts.Each contributed web crawl has an accompanying JSON file that lists the total records, by mimeType. A program, aggregate.py, aggregates all of the JSON files.

2. CCA generic schema

The documents in each dataset are stored in a structured format called the Common Crawl Architecture (CCA), developed as part of the DARPA MEMEX project. Records in this schema follow this format:

{   'key': 'ebola-03cad6ee34e9dc0aeb77e4c5d31aad2aa41f6ad819f23b8504612d6e6de8a18c',
    'request': {   'body':    None,
                   'client':  { '...': '...' },
                   'headers': [ [ 'Accept-Language': 'en-US,en' ], [ '...', '...' ] ],
                   'method':  'GET'
    },
    'response': {  'body':    '<!DOCTYPE html> <html lang=\'en\' class=\'js-disabled\'> <head> ... </html>',
                   'status':  '200',
                   'headers': [ ['Content-Type', 'text\/html' ], [ '...', '...'] ],
    },
    'timestamp': 1421064000L,
    'url': 'http://www.nature.com/news/ebola-1.15750',
    'indices': [
        { 'key': 'crawl', 'value': 'ebola' },
        { 'key': '...', 'value': '...'},
    ],
    'features': [
        { '...': '...' }, { '...': '...' }
    ],
}

More or fewer fields are present in the different datasets depending on how they were created. Every record in every dataset has a 'key' field (the "document number"), a 'response.body' with raw content, the timestamp, and the source URL. They should also have an 'indices.key=crawl' field indicating which dataset the document goes with. The 'features' block is meant to contain extracted or derived data. In the Ebola tweets subset to hold a structured representation of the tweet, automatically extracted from the raw Twitter HTML contained in the 'response.body'. In the Illicit-goods dataset, the features block contains extracted posts and thread/post metadata from the raw HTML thread.

3. Data format

The documents are stored in files that each contain a stream of CBOR records that follow the CCA format above. CBOR is a variation of JSON that supports binary data and a has more efficient encoding than text. Here is an example using Python and the cbor library. More example code in Python and Java can be found at the TREC DD Github site.

4. Obtaining the datasets

To get access to the collections, you must be a TREC 2016 participant. Complete the TREC DD Organizational User Agreement (password protected, use the TREC participant password) and email it as a scanned PDF or a high-quality digital photo to Angela (dot) Ellis (at) NIST (dot) gov, including your TREC participant ID. Within a week you should receive access credentials to download the data, as well as the decryption key to read the local-politics documents.
Local users of the data must complete the Individual User Agreement and return it to the organizational point of contact who will maintain those records.

TREC Dynamic Domain Track 2016 Datasets

1. Datasets Descriptions

2. CCA generic schema

3. Data format

4. Obtaining the datasets