2022-253 - ClueWeb22 dataset

Description:

2022-253 - ClueWeb22 dataset

Abstract

ClueWeb22 is the newest in the Lemur Project's ClueWeb line of datasets that support research
on information retrieval, natural language processing and related human language technologies.
This new dataset was developed by the Lemur Project with significant assistance and support
from Microsoft Corporation. ClueWeb22 is available for research purposes only.

ClueWeb22 has several novel characteristics compared with earlier ClueWeb datasets.

  • It is much larger (10 billion documents)
  • Documents are of higher quality (selected from a commercial search engine index)
  • Documents are available in several formats (HTML, clean text, screen shots)
  • Document outlink and inlink data is provided in a convenient format
  • Document page analyses are provided that reveal where on a page text was displayed, and what was near it

The complete dataset fills 14 × 18 TB disks, which is expensive to distribute and store, thus it is
distributed in different subsets and formats to support the most common uses. Some subsets may be
downloaded for free. Others are distributed on disk and require payment.

 

Other Information

Acquiring the Dataset

Acquiring a ClueWeb22 dataset is a three-step process.

1. Complete an organizational agreement: This agreement entitles your organization to use the
dataset. There is no cost for an organizational agreement. Select one of the organizational
agreements below to start the licensing process.
2. Obtain data: After the license is complete, you will receive email that describes how to select
the subset(s) that you need and, if necessary, pay a distribution fee.
3. Complete individual agreements: Each person who will use or have access to the dataset must
sign an Individual Agreement. Your organization must retain the completed individual
agreements of people while they have access to the dataset.


Organizational Agreements
The Organization Agreement must be signed by a person with the authority to sign agreements on behalf of your organization.

The Organization Agreement typically applies to a single research group or unit within a larger
legal entity. For example, in a university, the Organization Agreement might apply to a research
group consisting of a few professors, and the students and staff doing research with them. In this
case, the organization would be the name of the research group (e.g., the Information Retrieval
Laboratory), and the Corporation/Legal Entity would be the name of the university.
 

  • Commercial organizations: Organizational agreement, Individual agreement
  • Non-profit organizations: Organizational agreement, Individual agreement
  • U.S. federal & state government organizations: Organizational agreement, Individual agreement

 

Publications

 

 

Patent Information:
Category(s):
Software
For Information, Contact:
Raymond Taylor
Manager, Business Development & Licensing
CMU
rtaylor@andrew.cmu.edu
Inventors:
James (Jamie) Callan
Cameron VandenBerg
Keywords: