Hapax legomena are quite common, as predicted by Zipf's law, which states that the frequency of any word in a work (corpus) is inversely related to its rank in the frequency table. For large corpora, about 40% to 60% of the words (counting by type) are hapax legomena, and another 10% to 15% are dis legomena. In the Brown Corpus of American English, about half of the 50,000 words are hapax legomena within that corpus.
Note that hapax legomenon refers to a word's appearance in a body of text, and does not talk about its origin nor how often it is used in speech. For this reason, it is different from a nonce word, which may never be recorded, or which may find currency and may be widely recorded, or which may appear several times in the work which coins it, and so on.
- Paul Baker, Andrew Hardie, and Tony McEnery, A Glossary of Corpus Linguistics, Edinburgh University Press, 2006, page 81, ISBN 0-7486-2018-4.
- András Kornai, Mathematical Linguistics, Springer, 2008, page 72, ISBN 1-84628-985-8.
- Kirsten Malmkjær, The Linguistics Encyclopedia, 2nd ed, Routledge, 2002, ISBN 0-415-22210-9, p. 87.