Tuesday, November 12, 2013

Large Data sources for NLP from Google

Google has made available two large and rich sources for NLP research:
These have been described in the following papers:
  •  Jean-Baptiste Michel*, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden. Quantitative Analysis of Culture Using Millions of Digitized Books.  Science. 2011.
  • Goldberg, Yoav, and Jon Orwant. "A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books.". *SEM-2013. 2013.

These resources have been created from the Google Books corpus, which is an outcome of Google's efforts to scan all the world's books. I will just highlight the important points from these papers in this post.

Google N-gram corpus
This is a traditional n-gram corpus, where frequency counts are provided for 1 to 5 gram strings. However, there are a couple of additional features:
  • One is the temporal aspect of the n-grams i.e. for each n-gram, the frequency counts are given for each year since the medieval times. For English, the counts are available from the 16th century onwards.
  • Frequency counts are available for extended n-grams also. The extension is in terms of the POS tags. All the data has been POS tagged with a tagset of 12 basic tags. This makes possible queries of the following form: 
    • the burnt_NOUN car (combination of POS tag and token queries)
    • _DET_ _NOUN_ (queries involving determiners only)
          There some restrictions on the 4 and 5-grams available, in order to prevent
          combinatorial explosion.  
  •  Information on head-modifer relations in also available, though the relation type is not specified
 You can use the Google N-gram viewer to query  this resource in an interactive way. The corpus has been used for studying evolution of culture over time, and can be used to a variety of such temporal studies e.g. economics, language, etc.

Google Syntactic N-gram corpus
While, the traditional n-grams  contains words which are sequential, the syntactic n-gram is defined to be a set of words involved in a dependency relationship. Further, an order-n syntactic n-gram means an n-gram containing n content words. The Google Books syntactic n-gram corpus contains dependency tree fragments of size 1-5 viz. nodes, arcs, biarcs, triarcs and quadarcs. There is a restriction of the types of quadarcs available in the corpus. Each fragment contains the surface form of the words, their POS tags, the head-modifier relationships and the relative order of the n-grams. It does not contain information about the linear distance between the words in the dependency or the existence of gaps between words in the n-gram. The counts of all the syntactic n-grams are provided. A few noteworthy points:
  • As with the Books n-gram corpus, temporal information on the syntactic n-grams is available.
  • Additional information for dependency trees involving conjunctions and prepositions is made available. Here, the dependency tree fragments are extended to provide information about the conjunctions and prepositions, though they are function words. This  information is part of the extended component of the corpus ( extended-arcs, extended-arcs, etc.)
  • verbargs-unlex and nounargs-unlex is an unlexicalized version of the syntactic n-gram where only the head word and the top-1000 words in the language are lexicalized. 
The syntactic n-gram corpus can be very useful or studying lexical semantics, sub-categorization, etc.

No comments:

Post a Comment