An Overview of Microsoft Web N-gram Corpus and Applications

4 pages
An Overview of Microsoft Web N-gram Corpus and ApplicationsKuansan Wang Christopher Thrasher Evelyne Viegas Xiaolong Li Bo-june (Paul) Hsu Microsoft Research One Microsoft Way Redmond, WA, 98052, USA larger datasets, culminating the release of the Eng-lish Giga-word corpus (Graff and Cieri, 2003) and Abstract the 1 Tera-word Google N-gram (Thorsten and This document describes the properties and Franz, 2006) created from arguably the largest text some applications of the Microsoft Web N- source available, the World Wide Web. gram corpus. The corpus is designed to have Recent research, however, suggests that studies the following characteristics. First, in contrast on the document body alone may no longer be suf-to static data distribution of previous corpus ficient in understanding the language usages in our releases, this N-gram corpus is made publicly daily lives. A document, for example, is typically available as an XML Web Service so that it associated with multiple text streams. In addition can be updated as deemed necessary by the to the document body that contains the bulk of the user community to include new words and contents, there are also the title and the file-phrases constantly being added to the Web. Secondly, the corpus makes available various name/URL the authors choose to name the docu-sections of a Web document, specifically, the ment. On the web, a document is often ...
45 Proceedings of the NAACL HLT 2010: Demonstration Session, pages 45–48, Los Angeles, California, June 2010.c2010 Association for Computational Linguistics
