This page is for people who are interested in the technical side of our list of the 15,000 words that learners of global English most need to know. Those words came from the British National corpus (BNC, for short). A subset of the BNC (86,800 different words) is available at www.wordcount.org.
- The BNC is very much the British National Corpus - not the English language corpus. It includes many words that would make little sense to people outside England. Even as a native English speaker, I don't know what some of those supposedly common words are. Maybe they are the names of famous sports people in England in the early 1990s. (Erm? Medau?)
All the words in the BNC are converted to lower-case, so you can't easily separate an abbreviation from a word. (Maybe erm should be ERM, but is the European Exchange Rate Mechanism really an everyday term? Or is it just another spelling of um in transcribed speech?)
In the BNC, every word is listed separately for each part of speech - e.g. jumps occurs twice: a plural noun, and also a verb in the present tense. When we calculated that 15,000 words was the threshold of sustainability, that figure meant 15,000 different word forms, and jumps is just one word form.
To get 15,000 usable words, we ended up extracting 22,000 from the BNC. Omitted from our list were: 3,211 duplicates (several parts of speech with the same spelling - like jumps), 384 compound words that a learner can easily guess (e.g. dressing-room), 1335 surnames and place names, 200 "unwords" (e.g. numbers, letters, foreign words), 481 not-so-common words that were far too British, and 89 trade names (many of which were computer companies).
We also dropped some other words that just seemed to come up too often. A corpus is sampled from a selection of published and spoken material, and the selection method can bias the corpus. For example, what criterion could be used to know whether a corpus included the appropriate number of medical publications? I suspect that the BNC included too much medical writing. In the count of 100 million words, pylori occurred 1100 times, but prizes only occurred 1000 times. Are pylori really more common than prizes? Checking on Google, with a much bigger database, showed that prizes outnumbered pylori by a factor of 7 to 1. I strongly suspect that terms such as pylori are almost unknown to most native speakers of English. (I just asked 3 of them, and none knew what pylori meant. They all knew what prizes were!) Therefore 103 medical, computer, chemical, geological, and linguistic terms were excluded from the list of 15,000.
Judging from the commonest surnames of politicians (Gorbachev, Clinton, Thatcher) it's obvious the BNC was compiled in the early 1990s. (For details, see the BNC page BNC: What is the BNC). So there are no entries for more recent terms, such as texting, but plenty of mentions of now-quaint words like microcomputer.
I figured that because about 20% of the world's native English speakers are in Britain, any BNC word used only in the UK would have to appear about 5 times as often to be counted. The rarest of these 15,000 words occurred 200 times in 100 million in the corpus, so words used only in the UK had to be mentioned 1,000 times to get into our list. Thus Hereford (mentioned 986 times) missed out, but Aberdeen (1044) was included. (The cattle might cancel out.)
Just because there are only 15,000 words listed here, it doesn't follow that you have to learn only 15,000 meanings - because the commonest words in English tend to have lots of meanings. But to counterbalance that, it's usually easy to spot plurals and verb forms - once you've learned the first few thousand irregular ones. For example, take the verb abandon, which also includes abandoned and abandoning. The fourth form, abandons is not in the first 15,000 words - but if you knew the other forms, you could easily work out its meaning.
To prevent the list from being overwhelmed by surnames and place names, only those mentioned at least 400 times in the BNC's 100 million words were included.
In my view, once you know the first few thousand words, the parts of speech least important for understanding English sentences are nouns. Most of the other parts of speech can appear in any text, but nouns (including names) are more specialized. Depending on what you are reading, you need a different set of nouns. Also, it's usually easy to identify unknown nouns in an English sentence. Therefore we only included nouns with a frequency of at least 235, compared with a cut-off level of 200 for other parts of speech. That meant excluding 900-odd nouns, such as paranoia and thorn, both of which occurred exactly 200 times. It's nice to know them, but as they only occurred once in every half million words in the BNC, you should survive without knowing them.
Since this is a list of words (not of spellings) only the commonest spelling of each word was included. For example, in UK usage, center is uncommon, so its frequency count was combined with centre. The endings -ise and -ize were more of a problem: with some words, -ise was more common, while with others, -ize was more common. So we spelled them all as -ize, which is more common in international English.
You don't like the above assumptions? Please feel free to create an improved edited version - but be warned, it's probably more work than you think! We found it easiest to use Excel, rather than the database software we originally planned to use, because of the more flexible sorting that a spreadsheet allows, as well as the fact that more people have spreadsheet than database software.
So far, this word list is a service for people who are learning English, and who are at a reasonably advanced level. The next use of the word list is to help people writing English for a global audience. It will probably resemble a spelling checker, but instead of checking spelling, it will use the word list to check the readability of a web page. Then it will be possible to check the accuracy of the idea that started this whole process: whether it's true that no more than 1.5% of words on this website are outside the 15,000 word vocabulary.
The Google corpus
Probably the biggest corpus in English is the Google index. In February 2005 it included 4.3 billion web pages. I couldn't find the total number of words mentioned on the Google website but a figure of 234 words per page is probably on the low side. If that figure is correct, the Google index includes one trillion words (did you wonder why I chose 234?). In numbers: that's 1,000,000,000,000 words, or 10,000 times the size of the BNC. In words: that's a lot - though still only a tiny proportion of everything that's been written in English.
Philipp Lenssen (who you might describe as a Googler) compiled a frequency list of 27,693 words in the Google database, around November 2003. You can see the top 50 and download the list at blog.outer-court.com.
Given the way search engines work, this is not a count of word frequencies, but a count of the number of pages on which each word appears. (For example, if "the" occurs 10 times on a page, Google notes 1 page containing "the".) This would tend to produce underestimates for very common words, but that's not a problem here, because they'd still be in the first 15,000.
Looking at the top 50, you can see immediately that the Web is atypical. For example, the 9th most common word is home. Why? Because so many web pages include a link to their home page. For similar reasons, site, information, contact, search, page, web, and copyright also appear in the top 50.
What Philipp Lenssen doesn't say is how he came up with the list of 27,693 words to look up in Google. Some of the bottom ones look very uncommon to me. I suspect he built his own small corpus to produce that word list. If that's right, the counts will have come from a huge corpus, but it's possible that the word list excludes some fairly common words which just didn't happen to come up in the preliminary corpus. I contacted him, but he ddn't remember exactly where he'd got the list of 27,693 words.
It occurred to me that a way to make a word list lacking idiosyncrasies would be this Google list and our BNC extract. Only words occurring in the top 15,000 on both lists would be accepted. That will take some programming, so it won't happen today - but watch this spot.