海角大神

Enron's gift to students of language

The Texas energy giant鈥檚 record for largest corporate bankruptcy has long since been overtaken, but linguists will be feasting on the Enron e-mail dataset for years.

The E is taken off one of the final remaining Enron Field signs outside the formerly named ballpark in Houston in 2002.

Brett Coomer/AP

August 14, 2014

The dog days of August call up, for the historically minded, anniversaries of some notable disasters: The 鈥済uns of August,鈥 for instance, boomed a century ago this year, as Europe lumbered into World War I.

I鈥檓 thinking, though, of a more recent disaster, with no special peg to August, other than the memory of California鈥檚 鈥渞olling blackouts鈥 in summer 2001.

The collapse of the Texas energy firm Enron was, at the time, the biggest corporate bankruptcy in American history. It led to the restructuring of the accounting industry and to major legal changes for American public companies. 鈥淓nron鈥 even became a musical, winning critical raves in , if not so much in .

Lesotho makes Trump鈥檚 polo shirts. He could destroy their garment industry.

But Enron also left a legacy for those who study language. The 鈥淓nron e-mail dataset鈥 is a gold mine for researchers in computational linguistics.

As Jessica Leber explained in , the Federal Energy Regulatory Commission, on March 26, 2003, dumped online more than 1.6 million e-mails to and from Enron鈥檚 executives from 2000 through 2002.聽

Sharing all this e-mail, unearthed during FERC鈥檚 investigation of Enron, was meant to serve the public interest. In its original form, though, it was way too much information. FERC removed much of the most sensitive stuff.聽

Then scholars acquired the raw data and massaged it into a more usable form. As Ms. Leber notes, 鈥渢he 鈥楨nron e-mail corpus,鈥 as the cleaned-up version is now known, remains the largest public domain database of real e-mails in the world 鈥 by far.鈥

is a scientific term for this kind of 鈥渂ody鈥 of words gathered for study. Originally 517,431 e-mails, by 2004 the Enron corpus had been trimmed down to 200,000.聽

What the sentence in Breonna Taylor鈥檚 death says about police reform under Trump

Computational linguistics is the computerized study of language. A huge preassembled 鈥渃orpus鈥 to work on lets scholars focus on analysis, not data collection.

As Leber wrote, because the Enron corpus 鈥渋s a rich example of how real people in a real organization use e-mail ... it has become the foundation of hundreds of research studies in fields as diverse as machine learning and workplace gender studies.鈥

The researcher who first mentioned the Enron dataset to me had been studying gender and power. A team sought clues in the as to who was lying, and when. 鈥淎pparently liars wanted to dissociate themselves from their words ... and made an attempt to create a story that seemed less complex ... and more concrete.鈥 has studied the flow of gossip up and down within an organization.聽

The Enron corpus is a product of a particular time: after e-mail had come into wide use but before the privacy and security implications of a move like FERC鈥檚 original data dump were understood. Enron鈥檚 bankruptcy record has long since been broken. But the Enron dataset is likely to remain unique.