CALO Enron Email Dataset
The Carnegie Mello University (CMU) CALO Project dataset is perhaps the most widely used data set and is available for download at http://www.cs.cmu.edu/~enron/. This dataset is a derivative of the FERC dataset and has been referenced in many email research studies and is also used by many commercial E-Discovery organizations. The CMU page describes this dataset as follows:
- This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes).
- It contains data from 150 custodians, mostly senior management of Enron, organized into folders.
- The corpus contains a total of about 0.5M messages.
- This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation.
- The email dataset was later purchased by Leslie Kaelbling at MIT, and turned out to have a number of integrity problems. A number of folks at SRI, notably Melinda Gervasio, worked hard to correct these problems, and it is thanks to them (not me) that the dataset is available.
- The dataset here
- does not include attachments, and
- some messages have been deleted “as part of a redaction effort due to requests from affected employees”.
- Invalid email addresses were converted to something of the form firstname.lastname@example.org whenever possible (i.e., recipient is specified in some parse-able format like “Doe, John” or “Mary K. Smith”) and to email@example.com when no recipient was specified.
CALO correctly identified 8 duplicate, misspelled custodians in the FERC dataset, resulting in 150 CALO custodians vs. 158 FERC custodians.
In addition to the above, the CALO dataset has a number of optimizations:
- Message-ID: New Message-IDs have been created and used in place of existing Message-IDs
- Date: Dates have been canonicalized replacing the raw dates
- Headers: Some other headers are missing from the email
Removing the attachments makes the dataset much more manageable in size. However, attachments are still useful for certain investigations and Mark Dredze has created a version of the CALO dataset with attachment information brought over from the FERC dataset.
K. Krasnow Waterman discusses how these changes affect the email in Knowledge Discovery in Corporate Email: The Compliance Bot Meets Enron, 2006.