There are two datasets for the TREC 2016 Dynamic Domain Track: Ebola Dataset and Polar Dataset.
Basic statistics of these datasets are shown in the Table 1-1. These
datasets contains records in forms of web pages, scientific data, PDFs, and tweets. All the datasets are
formatted using the Common Crawl Architecture schema from the DARPA MEMEX project, and stored as
sequences of CBOR objects. Detailed Description, as well as some links to sample code, of each dataset
is given in the following subsections.
Table 1-1 Datasets Statistics
Datasets |
Size of Data on Disk |
Number of Records |
Ebola |
12.6 GB |
682,157 |
Polar |
158 GB |
1,741,530 |
1.1 Ebola Domain
This data is related to the Ebola outbreak in Africa in 2014-2015. The dataset comprises tweets
relating to the outbreak, web pages from sites hosted in the affected countries as well as PDF
documents from websites such as The World Health Organization, Financial Tracking Service and The
World Bank. Such information resources are designed to provide information to citizens and aid
workers on the ground.
- Company/POC for crawl: Juliana Friere and Kien Pham, NYU (juliana dot freire at nyu dot
edu); Peter Landwehr
peter dot landwehr at giantoak dot com); Lewis McGibbney, JPL (Lewis dot J dot Mcgibbney at
jpl dor nasa dot gov)
- Version: 1.0
- Purpose of data: represents an emerging humanitarian assistance situation.
- Four disjoint parts:
- Ebola-web-01-2015: web pages crawled during January 2015.
- Ebola-web-03-2015: web pages crawled during March 2015. These two parts are
primarily information from NGO's, relief agencies, and news organizations.
- Ebola-pdfs: PDF documents collected from West African government and other sources.
- Ebola-tweets: Tweets that originate from West African regions involved in the Ebola
outbreak.
- Time frame: January - March 2015
- Geographic area: global, primarily West Africa
- Size of data on disk: 12.6 GB
- Format: Gzipped sequence of CBOR records; sequence of tweet pointers
- Number of items: 497,362 web pages, 19,834 PDFs, 164,961 tweets.
- Schema notes: The web and PDF subsets only contain the raw crawled data. The tweets subset
as distributed only contains pointers to tweets, because Twitter does not allow
redistribution of tweets. You will need to use the Twittertools crawler (link here) to get
the actual tweets. The output of that crawler includes both the HTML pages from Twitter, as
well as extracted tweet data in the 'features' block.
1.2 Polar Domain
This dataset is a set of web pages, scientific data, zip files, PDFs, images, and science code
related to the polar sciences and available publicly from the NSF funded Advanced Cooperative Artic
Data and Information System (ACADIS), NASA funded Antarctic Master Directory (AMD), and National
Snow and Ice Data Center (NSIDC) Arctic Data Explorer. More information about this dataset can be
found
here.
- Company/POC for crawl: Chris Mattman, JPL (chris dot a dot mattmann at jpl dot nasa dot
gov)
- Version: 1.0
- Purpose of data: represents a domain for open science and scientific data search.
- Description: Web pages, data files, zip archives, PDFs, images, and code.
- Number of items: 1,741,530 records.
- Time frame: September 2014 - May 2015.
- Geographic area: global
- Size of data on disk: 158GB.
- Format: Gzipped sequence of CBOR records, encrypted with the KBA StreamCorpus key. Crawled
data were put into Common Crawl Format, acording to Memex format, using the CommonCrawlDataDumper.
The CommonCrawlDataDumper is an Apache Nutch tool that can dump Nutch segments into Common
Crawl data format, mapping each crawled-by-Nutch file on a JSON-based data structure.
CommonCrawlDataDumper dumps out the files and serialize them with CBOR encoding, a data
representation format used in many contexts.Each contributed web crawl has an accompanying
JSON file that lists the total records, by mimeType. A program, aggregate.py, aggregates all
of the JSON files.
The documents in each dataset are stored in a structured format called the Common Crawl Architecture
(CCA), developed as part of the DARPA MEMEX project. Records in this schema follow this format:
{ 'key': 'ebola-03cad6ee34e9dc0aeb77e4c5d31aad2aa41f6ad819f23b8504612d6e6de8a18c',
'request': { 'body': None,
'client': { '...': '...' },
'headers': [ [ 'Accept-Language': 'en-US,en' ], [ '...', '...' ] ],
'method': 'GET'
},
'response': { 'body': '<!DOCTYPE html> <html lang=\'en\' class=\'js-disabled\'> <head> ... </html>',
'status': '200',
'headers': [ ['Content-Type', 'text\/html' ], [ '...', '...'] ],
},
'timestamp': 1421064000L,
'url': 'http://www.nature.com/news/ebola-1.15750',
'indices': [
{ 'key': 'crawl', 'value': 'ebola' },
{ 'key': '...', 'value': '...'},
],
'features': [
{ '...': '...' }, { '...': '...' }
],
}
More or fewer fields are present in the different datasets depending on how they were created. Every
record in every dataset has a 'key' field (the "document number"), a 'response.body' with raw content,
the timestamp, and the source URL. They should also have an 'indices.key=crawl' field indicating which
dataset the document goes with. The 'features' block is meant to contain extracted or derived data. In
the Ebola tweets subset to hold a structured representation of the tweet, automatically extracted from
the raw Twitter HTML contained in the 'response.body'. In the Illicit-goods dataset, the features block
contains extracted posts and thread/post metadata from the raw HTML thread.
The documents are stored in files that each contain a stream of
CBOR
records that follow the CCA format above. CBOR is a variation of JSON that supports binary data and a
has more efficient encoding than text.
Here is an example
using Python and the cbor library. More
example code in Python and Java can be found at
the TREC DD Github site.
To get access to the collections, you must be a TREC 2016 participant. Complete the
TREC DD
Organizational User Agreement (password protected, use the TREC participant password) and email it
as a scanned PDF or a high-quality digital photo to Angela (dot) Ellis (at) NIST (dot) gov, including
your TREC participant ID. Within a week you should receive access credentials to download the data, as
well as the decryption key to read the local-politics documents.
Local users of the data must complete the
Individual
User Agreement and return it to the organizational point of contact who will maintain those records.