Text as Data: From Federalist Papers to Yelp


If you want to find an example of big data in your own life, look no farther than the nearest bookshelf. A book contains millions of words of text, making up a dataset that can be sliced up and statistically analyzed just like information from biology, physics, or astronomy. For decades, scientists using methods such as natural language processing and topic modeling on text for purposes such as translation, author attribution, and speech recognition. In his talk at the UChicago Research Computing Center, Matt Taddy explained how these methods are now allowing social scientists to ask new questions.

Taddy, an associate professor of econometrics and statistics at the Booth School of Business, develops methods to extract new meaning from text in economics, politics, social media, and many other topics. Many of these text analyses are based simply on counting words, or what Taddy called “The Bag of Words” -- parsing raw text into individual words, phrases, or even emoticons and hashtags, then treating them as an independently distributed sample where statistics can be applied.

“This is really the state of the art, I'm not selling you some kind of ancient technology here,” Taddy said. “It’s dumb, but it works​. The extra rules aren’t worth their complexity.”

Early applications of this approach can be traced back to the University of Chicago in the 1960’s, when Frederick Mosteller and David Wallace used text analysis to determine whether James Madison or Alexander Hamilton wrote unattributed segments of the Federalist Papers. The method wa relatively simple: count how many times each author used various words, count how many times those same words appeared in unattributed papers, and run mathematical models to predict which Founding Father was the more likely author of each document.

These approaches were further refined by academics and researchers at IBM and Bell Labs, finding its way into technologies we routinely use today, Taddy said. Spam filters, search engines, and Siri all use software based on these methods, now known as natural language processing.

Until recently, social scientists have been slower to bring these text analysis approaches into their own research. But recent political science studies (such as the work of Justin Grimmer, who visited the CI in late 2013) have applied these methods to the large bodies of text available from sources such as news releases and the Congressional Record. Much of this work has been fueled by new methods called topic modeling, which automatically groups related words into topics -- such as “budget,” “education,” or “military” for political press releases -- for easier analysis.

The utility of these approaches goes beyond politics, as Taddy demonstrated with his work using the popular business-reviewing site Yelp. Taddy worked with a publicly available dataset comprising 220,000 reviews of 12,000 restaurants by 44,000 users -- a small scoop of the millions of reviews on the website. After processing the text to strip off suffixes and weed out uncommon or excessively common words, Taddy ended up with 14,000 words to analyze across some 420 attributes, such as number of stars or votes the review received, how many reviews the commenter has written, business location or type, and many more. To analyze this volume of data, Taddy used parallel computing, distributed file systems, and big data tools such as Hadoop and MapReduce.

After the analysis, Taddy could explore the relationship between review attributes and words. In one simple example, the analysis found the expected relationship between the number of stars on a review and the presence of the emoticons :-) or :-( -- more stars raised the odds of the happy face, fewer stars predicted the presence of its sad counterpart. Extra stars also increased the rate of word such as “awesome,” “yummy,” or “delicious,” while lower ratings led to more appearances from “bland,” “overpriced,” and “rude.”

While most of these results might be common sense, the analysis also shook some unexpected relationships out of the data. The word “buffets” was somehow associated with both happy and sad faces, prompting Taddy to hypothesize that maybe older reviewers were more likely to use emoticons. Reviews that were tagged more often as “funny” by readers were more likely to contain swear words and the word “hipster.” And independent of the words within, commenters who had written more reviews were more likely to receive more favorable ratings from readers.

Independently, these findings might seem like mere curiosities. But taken together, the results could be used by Yelp to automatically promote reviews that are likely to be popular, instead of waiting for readers actually rate the review and vote it up or down, Taddy said. Such an algorithm would make Yelp more useful for visitors, and perhaps make the site more successful.

The next frontier in text analysis is to find ways of extracting meaning from these relationships that go beyond speculation about an elderly preference for buffets and emoticons. Taddy has worked with fellow Booth faculty Matthew Gentzkow and Jesse Shapiro to analyze partisan speech in Congress, but deeper meaning such as “partisanship” remains hard to model objectively and statistically. In order for text analysis to realize its full potential in social science research, new methods will need to be developed, he said.

“Right now, it’s useful for descriptive ideas, but if we really want to push these things into bigger questions that people are asking in science...then we have to start thinking about more math, how you project through these models, how you use these models to get slices or summaries of information,” Taddy said.

Written By: