CONFERENCE (INTERNATIONAL) Perplexity on Reduced Corpora
the 52nd Annual Meeting of the Association for Computational Linguistics (ACL 2014)
June 01, 2014
This paper studies the idea of removing low-frequency words from a corpus, which is a common practice to reduce computational costs, from a theoretical standpoint. Based on the assumption that a corpus follows Zipf’s law, we derive trade-off formulae of the perplexity of k-gram models and topic models with respect to the size of the reduced vocabulary. In addition, we show an pproximate behavior of each formula under certain conditions. We verify the correctness of our theory on synthetic corpora and examine the gap between theory and practice on real corpora.
Slides Download (1.1MB)