Audience Dialogue

Word frequencies in English

This page is for people who are interested in the technical side of our list of the 15,000 words that learners of global English most need to know. Those words came from the British National corpus (BNC, for short). A subset of the BNC (86,800 different words) is available at www.wordcount.org.

You don't like the above assumptions? Please feel free to create an improved edited version - but be warned, it's probably more work than you think! We found it easiest to use Excel, rather than the database software we originally planned to use, because of the more flexible sorting that a spreadsheet allows, as well as the fact that more people have spreadsheet than database software.

What next?

So far, this word list is a service for people who are learning English, and who are at a reasonably advanced level. The next use of the word list is to help people writing English for a global audience. It will probably resemble a spelling checker, but instead of checking spelling, it will use the word list to check the readability of a web page. Then it will be possible to check the accuracy of the idea that started this whole process: whether it's true that no more than 1.5% of words on this website are outside the 15,000 word vocabulary.

The Google corpus

Probably the biggest corpus in English is the Google index. In February 2005 it included 4.3 billion web pages. I couldn't find the total number of words mentioned on the Google website but a figure of 234 words per page is probably on the low side. If that figure is correct, the Google index includes one trillion words (did you wonder why I chose 234?). In numbers: that's 1,000,000,000,000 words, or 10,000 times the size of the BNC. In words: that's a lot - though still only a tiny proportion of everything that's been written in English.

Philipp Lenssen (who you might describe as a Googler) compiled a frequency list of 27,693 words in the Google database, around November 2003. You can see the top 50 and download the list at blog.outer-court.com.

Given the way search engines work, this is not a count of word frequencies, but a count of the number of pages on which each word appears. (For example, if "the" occurs 10 times on a page, Google notes 1 page containing "the".) This would tend to produce underestimates for very common words, but that's not a problem here, because they'd still be in the first 15,000.

Looking at the top 50, you can see immediately that the Web is atypical. For example, the 9th most common word is home. Why? Because so many web pages include a link to their home page. For similar reasons, site, information, contact, search, page, web, and copyright also appear in the top 50.

What Philipp Lenssen doesn't say is how he came up with the list of 27,693 words to look up in Google. Some of the bottom ones look very uncommon to me. I suspect he built his own small corpus to produce that word list. If that's right, the counts will have come from a huge corpus, but it's possible that the word list excludes some fairly common words which just didn't happen to come up in the preliminary corpus. I contacted him, but he ddn't remember exactly where he'd got the list of 27,693 words.

It occurred to me that a way to make a word list lacking idiosyncrasies would be this Google list and our BNC extract. Only words occurring in the top 15,000 on both lists would be accepted. That will take some programming, so it won't happen today - but watch this spot.