Thursday, 17 October 2013

Enhancing Linguistic Search with the Google Books Ngram Viewer



Our book scanning effort, now in its eighth year, has put tens of millions of books online. Beyond the obvious benefits of being able to discover books and search through them, the project lets us take a step back and learn what the entire collection tells us about culture and language.

Launched in 2010 by Jon Orwant and Will Brockman, the Google Books Ngram Viewer lets you search for words and phrases over the centuries, in English, Chinese, Russian, French, German, Italian, Hebrew, and Spanish. It’s become popular for both casual explorations into language usage and serious linguistic research, and this summer we decided to provide some new ways to search with it.

With our interns Jason Mann, Lu Yang, and David Zhang, we’ve added three new features. The first is wildcards: by putting an asterisk as a placeholder in your query, you can retrieve the ten most popular replacement. For instance, what noun most often follows “Queen” in English fiction? The answer is “Elizabeth”:


This graph also reveals that the frequency of mentions of the most popular queens has been decreasing steadily over time. (Language expert Ben Zimmer shows some other interesting examples in his Atlantic article.) Right-clicking collapses all of the series into a sum, allowing you to see the overall change.

Another feature we’ve added is the ability to search for inflections: different grammatical forms of the same word. (Inflections of the verb “eat” include “ate”, “eating”, “eats”, and “eaten”.) Here, we can see that the phrase “changing roles” has recently surged in popularity in English fiction, besting “change roles”, which earlier dethroned “changed roles”:


Curiously, this switching doesn’t happen when we add non-fiction into the mix: “changing roles” is persistently on top, with an odd dip in the late 1980s. As with wildcards, right-clicking collapses and expands the data:


Finally, we’ve implemented the most common feature request from our users: the ability to search for multiple capitalization styles simultaneously. Until now, searching for common capitalizations of “Mother Earth” required using a plus sign to combine ngrams (e.g., “Mother Earth + mother Earth + mother earth”), but now the case-insensitive checkbox makes it easier:


As with our other two features, right-clicking toggles whether the variants are shown.

We hope these features help you discover and share interesting trends in language use!

No comments:

Post a Comment