Audience Dialogue

Stopwords

Return to parent page: content analysis

Each language has a well-established set of word frequencies. In most documents in English, "the" is the commonest word, "of" is often second, and so on. These very common words tell you nothing about the content of a document: all they do is get in the way, and slow down the analysis. Therefore most word-frequency analysis software ignores the commonest words - they are referred to as "stopwords".

Audience Dialogue uses two lists of stopwords, for different kinds of project. Usually we ignore the commonest 20 words, sometimes the commonest 180 (e.g. when considering content rather than expression, as in answers to open-ended questions in a survey). The following short list is based on the LOB Corpus of words most commonly used in English. (The LOB Corpus was counted in the 1960s, but there's no reason to expect the very commonest words to have changed since then.) The top 20 words are - along with their average frequency per 1000 words...

RankWord Per 1000
1 the 70
2 of 36
3 and 29
4 to 26
5 a 23
6 in 21
7 that 11
8 is 10
9 was 10
10 he 10
11 for 9
12 it 9
13 with 7
14 as 7
15 his 7
16 on 7
17 be 6
18 at 5
19 by 5
20 I 5

For most purposes you can ignore these stopwords, but for other purposes some of them are important. For example, a study of of the use of gender-based terms would need to include words such as "he" and "his" (numbers 10 and 15 in the above table). The fact that "her" and "she" come in at number 36 and 38 respectively already tells us something.

Back to content analysis

Other collections of stopwords: