When you are checking data files, one of the first steps is to do a frequency distribution of every variable in a survey. When all the checks have been made, and nothing is obviously wrong, the frequency distributions can be repeated, showing how many people in the survey gave each answer to each question.

Frequency distributions

A frequency distribution, also known as marginals (short for marginal totals) shows how many people gave each answer to each question. In statistical language, you could restate this to say that it shows how many cases had each value on each variable.

If you’re well-organized (or lucky) you may find that all the apparent errors have been fixed, the second time you run the frequency distribution.

A typical frequency distribution looks like this, but will vary slightly depending on the software you are using.

NGAYTHANG | Freq Percent Cum. ----------+----------------------- 16/10 | 20 4.0% 4.0% 17/10 | 20 4.0% 7.9% 18/10 | 32 6.3% 14.3% 19/10 | 86 17.1% 31.3% 20/10 | 79 15.7% 47.0% 21/10 | 86 17.1% 64.1% 22/10 | 77 15.3% 79.4% 23/10 | 47 9.3% 88.7% 24/10 | 17 3.4% 92.1% 25/10 | 24 4.8% 96.8% 26/10 | 16 3.2% 100.0% ----------+----------------------- Total | 504 100.0% |

This example is from a survey in Vietnam. Ngaythang means the date of interview. It shows that 20 people (4.0% of the total of 504) were interviewed on 16 October. On the 17th of October, another 20 were interviewed, another 4.0%, making a cumulative percentage of 7.9%, or 40 out of 504. The two lots of 4.0 add to 7.9 instead of 8.0 because of rounding. On the last day, the 26th of October, 16 people were interviewed, which was 3.2% of the total. The cumulative on the last day is, of course, 100% of the total.

Numerical questions

By "numerical questions" I mean those where the answers are real numbers, not just arbitrary codes.** **At this point, it will help to explain the four types of numerical scales that exist: nominal scales, ordinal scales, interval scales, and ratio scales. Let’s start at the purest end.

Types of scale

A ratioscale is one in which the answers are real numbers, and an answer of zero means what it says. "What age are you?" - "How tall are you?" - "How many children do you have?" These questions are measured on a ratio scale. If respondent A gives an answer which is twice as large as respondent B’s answer, this means that respondent A has twice as much of whatever the question asked - e.g. was twice as old, twice as tall, or had twice as many children.
The next scale down is an interval scale - if there’s a zero point, it’s arbitrary, but the difference between two successive possible answers is the same. For example, the scale of temperature. The Celsius scale says water boils at 100 degrees and freezes at 0. According to the Fahrenheit scale, water boils at 212 degrees and freezes at 32. You can’t say that 20 degrees is twice as hot as 10 degrees, because this statement couldn’t be true on both scales. So the scales are arbitrary, but within each scale, 1 degree has the same meaning. Thus these are interval (meaning equal-interval) scales. Interval and ratio scales together are referred to as metric scales. You can do statistical manipulations with metric scales that are invalid with lesser types of scale: i.e. ordinal and nominal scales. An ordinal scale uses numbers, but you can’t say that the difference between successive numbers is the same. For example, "How many points out of 10 would you give the breakfast session on FM99?" Though the question seems reasonable, you can’t prove that the difference between 8 and 9 points is the same as the difference between 9 and 10 points. However you can say that 9 points is better than 8. A nominal scale isn’t really a scale at all, but an arbitrary code value. For example, coding men as 1 and women as 2. There’ s no meaning here, and the codes could equally well be 3 for women and 8 for men. The only reason for giving a code is to distinguish the different groups. |

Most survey questions use nominal scales or ordinal scales. Occasionally, you can use a metric scale. However, most statistical techniques are designed for metric scales, and most statistical software generally assumes that every variable is on a metric scale. Sometimes you can use ordinal scales as if they are metric scales, but you have to do a lot of checking to make sure that the results aren’t spurious.

Metric scales aren’t limited to whole numbers, and have many possible values. If the variable is measured in enough detail, no two respondents may give the same answer. If all respondents had their height measured to the nearest millimetre, the frequency distribution would be pages long, showing each height, and the fact that 1 person was that height. Obviously this is not much use. There are two solutions:

1. Grouping the heights into ranges: recoding. Usually you’d form about 10 groups, each of equal size.

2. Calculating summary statistics, such as the average, standard deviation,, and quartiles.

The statistical measures you can use with a variable depend on which type of scale it was measured on: nominal, ordinal, interval, or ratio. The higher you move up the ladder of scales - from nominal up to ratio - the more statistical measures are possible - but the more rigorously the variable must have been recorded.

With nominal variables (when the possible different values have no inherent order, such as gender, marital status, or religion) you can measure:

- The number of respondents in each category
- The percentage of respondents in each category
- The mode (i.e. the category containing the highest number of respondents)

The distribution, or frequency distribution) is a table or graph showing how many respondents were in each category.

With ordinal variables (such as rating scales), you can measure all the above, plus:

- The minimum
- The maximum
- The range (the difference between the minimum and the maximum)
- The median
- The quartiles

You don’t know what the median and quartiles mean? Here’s an example. Imagine measuring the height of 99 people. Persuade all of them to stand in a line, in order of height, with the shortest person at one end and the tallest person at the other. The person in the middle of the line (the 50th person, counting from either end) has the median height - whatever that height is. If you had an even number of people (e.g. 100 instead of 99) nobody is exactly in the middle, so the median is halfway between the height of the 50th and the 51st person.

So the median is a figure which half the people exceed, and the other half fall short of: it divides the sample into two equal groups.

When a sample is divided into four groups, each with a quarter of the people, these are called quartiles. There are three quartile points: the first, second, and third. The second (of course) is the median. The lower quartile is the measure (e.g. height) which three quarters exceed, and one quarter of them fall short of. The upper quartile is the measure which only one quarter exceed.

As well as quartiles, you can determine quintiles (dividing the sample into five equal groups), deciles (10 groups), and percentiles (100 groups). The first quartile is the 25% percentile, the median is the 50th percentile - and so on.

If you are measuring height, you can say that 25% of the sample are shorter than the first quartile, 25% are taller than the third quartile, and 50% are between the two quartiles. Using quartiles is a handy way of describing a sample’s variation on a particular measure.

Here’s a simple example. Let’s say you ask 12 respondents how many people live in their household. These are the answers:

7 9 11 3 8 4 5 8 6 9 14 6

To find the median, sort the answers in order:

3 4 5 ` 6 6 7 ^ 8 8 9 " 9 11 14

The median is at the ^ symbol, or 7.5: halfway between 7 (the 5th person, when the answers are ordered) and 8 (the 6th person).

To find the lower and upper quartiles, the sample must be divided into four equal groups, e.g.

3 4 5 ‘ 6 6 7 ^ 8 8 9 " 9 11 14

The lower quartile is 5.5: the average of the figures before and after the ` symbol. The upper quartile is the average of the figures before and after the " symbol, or 9. When the sample isn’t an exact multiple of 4, calculating quartiles is a little more complicated. For example, if there were only 10 people in the sample, the lower quartile would be the answer given by the 2.5th person. As there's no such thing as the 2.5th person, you take the figure halfway between the second and the third person's answers.

With interval variables (rarely found in surveys) and ratio variables, you can measure all the above, plus:

- The average, or the mean.
- Using the example of people’s height, the average is calculated by adding together everybody’s height, and dividing by the number of people. There’s no need to have them stand in order of height to do this
- The standard deviation (think of this as the difference of a typical respondent’s score from the average)
- A wide range of other measures is possible with ratio scales - e.g. variance (the square of the standard deviation); these measures are used in more advanced statistics, beyond the scope of this book.

Here’s an example of a set of descriptive statistics for a numeric variable: the number of people in each surveyed household, in the Vietnamese survey mentioned above. This Epi Info printout shows the number of people in each household surveyed (PPHH = people per household).

PPHH | Freq Percent Cum. ------+----------------------- 1 | 2 0.4% 0.4% 2 | 10 2.0% 2.4% 3 | 55 11.0% 13.5% 4 | 94 18.9% 32.3% 5 | 113 22.7% 55.0% 6 | 101 20.3% 75.3% 7 | 60 12.0% 87.3% 8 | 35 7.0% 94.4% 9 | 12 2.4% 96.8% 10 | 8 1.6% 98.4% 11 | 3 0.6% 99.0% 12 | 2 0.4% 99.4% 13 | 1 0.2% 99.6% 14 | 1 0.2% 99.8% 21 | 1 0.2% 100.0% ------+----------------------- Total | 498 100.0% Total Sum Mean Variance Std Dev Std Err 498 2727 5.476 3.980 1.995 0.089 Minimum 25%ile Median 75%ile Maximum Mode 1.000 4.000 5.000 6.000 21.000 5.000 |

The "Total" figure shows that 498 respondents answered this question (one per household). The total number of people in those 498 households ("Sum") was 2727, so that's an average ("Mean") of 5.476 people per household. The standard deviation was 1.995. We'll pass over the Variance and Standard Error, but the next line of statistics isn't difficult. The minimum was 1, which means that no household had less than 1 inhabitant. The 25th percentile was 4, meaning that 25% of households had 4 people or fewer. The median was 5, meaning that half the households had less people than this, and half had more. The 75% percentile was 6, meaning that three quarters of the households had 6 people or fewer. The maximum was 21: yes some households had 21 people. The mode was 5, meaning that more households had 5 people than any other number.

If a distribution is exactly symmetrical, the average and the median are the same. Most distributions are approximately symmetrical, so the average and the median are usually quite similar. However, with a ratio scale (such as height) it’s impossible to go below zero, and with this type of scale (income is a good example) some cases can have very high values. These high values will push up the average but not the median, so for such variables the average is usually more than the median. For this reason, the average income of a group of people is always more than the median income.

For some purposes it makes more sense to use the average, for others the median, and sometimes you need to use both to describe a set of results well.

With almost every survey question, some respondents can’t (or won’t) answer it. When a respondent is asked a question, and doesn’t give a valid answer, the result is called a missing value. Often a special code is given for a missing value, such as 0 or -1. Sometimes the same code (often a blank space) is given both for missing values and for questions that have been skipped.

A question can have more than one missing value. For example, you may want to separately count the people who answer "Don’t know" and those who give no answer at all.

For every question, you need to decide what to do about missing values. Even if you make no conscious decision, you will be making some assumption about the missing values.

The common procedure is to ignore missing values, and count only valid answers to a question. This will mean that the sample size differs between questions.

For example, if 300 respondents are asked their age group, you may find 80 under 25, 107 aged 25 to 44, 101 aged 45 and over, and 12 who didn’t state their age: i.e. missing. If the 12 missing values are excluded, the sample of those who answered is 288, and 80 of the 288 (27.8%) are under 25.

But you could also say that you know 80 of the 300 respondents were under 25; that’s 26.7%.

Usually only a few percent of respondents don’t answer a question, and the two percentages aren’t very different. But if a high percentage of respondents fail to answer, changing your assumptions can make a large difference to the results. This often happens when respondents are asked their income. Usually, some don’t want to supply it, and others simply don’t know. It’s not unusual to find 30% of respondents not answering a question on income - and there’s some evidence that people who don’t answer tend to have an above-average income. So ignoring the missing cases may produce an average income figure which is too low.

Imputation

One solution to such a problem is imputation. This means estimating a value for each respondent who does not answer a question., based on the answers that others give. For example, income can be imputed based on the respondent’s sex, age group, and occupation - so if a respondent is a man aged 45 to 64 and a full-time professional worker, he could be imputed the average income of all other people in this category - say $650. For example, you could type in this Epi Info command:

IF INCOME<0 AND SEX=1 AND AGE=3 AND OCCUP=1 THEN INCOME=650

Referring to a numeric code as <0 ("less than zero") is Epi Info’s way of saying that no value was entered.

You’d have to type in such a statement for every combination of sex, age, and occupation, inserting the average income figure for each group. And before you could do this, you’d have to find out how average income varied between groups. Imputation can be quite time-consuming.

Of course, an imputed estimate probably won’t be accurate for many respondents. It’s still an assumption - but it’s a better-informed assumption than assuming a respondent’s income is the average of all respondents in the survey.

Imputation doesn’t always work well, because people who fail to answer one question often fail to answer others too. So even with imputation, there are usually still some missing value. But as long as less than about 5% of respondents don’t answer a question, the assumptions you make don’t usually cause the results to vary much.

Sometimes it’s easier to ask a question in a very detailed form - but you may not want to report the answers in such detail. A good example of this is when respondents are asked their age. It’s easy for them to give an exact number of years, and it’s easy to record that answer - but you are probably not very interested in the differences between people aged 48 and those aged 49 (for example). And unless the survey sample is very large, there may only be a few respondents of each age.

It’s easier to deal with broad age groups: sampling fluctuations tend to cancel out, and you can more easily draw conclusions. So individual ages can be recoded into age groups: usually between 3 and 10 of them.

The advantage of gathering detailed figures, then recoding them, is that you can change the recoding if some groups later seem to be too large or too small. Suppose you begin with 3 recoded age groups: under 25, 25 to 44, and 45-plus. After analysing the survey data, you might suspect that people over 60 are giving quite different answers from those aged 45 to 59. With a program such as Epi Info or SPSS, you can simply change the recoding, to add another age group. But if your questionnaire had only asked which of three age groups each respondent was in, you would not be able to gather more detailed information without going back to each respondent.

So a good principle of questionnaire design is to gather information in more detail than you think you will need - because you can always recode it later.

For a question requiring only a single answer, the numbers of respondents giving each answer will add to the total sample, and the percentages will add to 100%.

Multiple-answer questions are more complicated. There are two possible percentages: the percentage of respondents, and the percentage of answers. The percentage of answers adds to 100, but the percentage of respondents usually adds to more than 100. This is because everybody gives at least one answer (even if it is "none of the above") but some people give many answers.

To complicate the matter further, there are two types of multiple-answer question: multiple dichotomies, and multiple categories.

Multiple dichotomies

A multiple-dichotomy question is one which is equivalent to a series of questions with Yes or No answers - for example "Which newspapers have you read in the last week?" Respondents will name some local newspapers, or say "none." In concept, this is the same as reading out a list of newspapers, and getting a Yes or No answer for each. (In practice, the answers won’t be the same, because naming a newspaper will help some respondents remember they read it - but in terms of question format, it’s the same thing.)

Multiple-dichotomy questions usually have their results presented as a set of variables, each with a Yes or No response.

Multiple categories

The other type of multiple-answer question is multiple categories. An example is "Please give me up to 3 reasons why you listen to FM99." Some respondents will give 1 reason, some will give 2, and others will give 3. Some possible reasons might be

1. I like the announcers.

2. I like the programs.

3. Other people in my household prefer this station, and I listen too.

4. This is the only station I can receive.

Using multiple categories, a respondent might give answers 1, 2, and 4. With multiple dichotomies, the four reasons would be question items, e.g.

" Do you listen to FM99 because you like the announcers?"

...and each item would be answered Yes or No. So the two forms of multiple answer question produce the same information, but what is a question with multiple dichotomies becomes an answer with multiple categories.

SPSS handles both types of multiple-answer question. Epi Info is designed to deal only with multiple dichotomies, but multiple-category questions in Epi Info can easily be converted into multiple-dichotomy form. Most other statistical software, such as SAS, is not designed to handle multiple-answer questions.

There are filter questions and filtered questions: they are not the same A filter question is one which determines whether skipping will take place after it. For example, respondents might be asked the filter question "Do you ever listen to FM99?" If they say Yes, they might then be asked some questions about the programs on FM99, but if they say No, they will skip these questions.

These questions about the FM99 programs are filtered questions. When a statistical program presents the results of these filtered questions, there will be a lot of missing cases: all those respondents who were filtered out, and were not asked the questions. This should be a fixed number of respondents, because the whole group of them should have skipped all the questions.

As well as this fixed number of missing cases on each filtered question, there may also be a variable number of additional missing cases: people who were asked the question, but did not give a valid answer.

Sometimes you may want to separate these two different sorts of missing cases: those who were not asked a question, and those who were asked but did not answer. The easiest way to do this is to foresee the problem, and allocate a separate code for "did not answer" in every group of filtered questions. This code can still be defined as missing, and you will then have two different missing values: the system-missing value (automatically allocated by your software: usually a blank space) for the skipped cases, and a specially defined missing value for those who didn’t answer the question.

Percentages for filtered questions (at least those questions requiring a single answer) will still add to 100%, but this will not be 100% of the whole sample, only of the filtered group.

Sometimes survey results can be distorted because of a combination of two factors:

- some types of person were over-represented or under-represented in the survey (sometimes deliberately), and
- those types of person gave different patterns of answers to others.

For example, many surveys under-represent people aged around 15 to 20, because they spend a lot of time away from home. Also, people in this age group tend to listen to different radio stations from most others.

Weighting is a way of compensating for these problems. It works by dividing respondents into separate groups (e.g. age groups), calculating a separate result for each group, and combining these results based on each group’s proportion in the total population - which is not necessarily its proportion in the sample.

Therefore, to weight the sample based on a particular variable, you need to know the figures for the total population. Variables such as age group are often available from Census data - usually a few years out of date.

There are several ways of producing weighted results, but with a random sample, the simplest is to calculate case weights: i.e. how many people in the population each member of the sample is representing.

As with missing values, when you use weighting you are always making an assumption - even if you don’t know it. The typical assumption is that people who don’t participate in the survey are the same (on average) as others in their group.

However "their group" can be based on the answers to any question, and weighting with the wrong variable (the wrong group) can make the results less accurate instead of more accurate. For example, even though people aged 15 to 20 often spend more time listening to radio than older people do, the 15-20 year olds who are not at home much (and therefore not included in the survey) often spend less time listening to radio than others in their age group.

So maybe the best variable for weighting is not the question on age group, but a question such as "How many days in the last 7 were you at home at this time?" However, this could not be used - because there will be no population data for the latter question. Therefore you can’t know what groups (if any) are under-represented in your sample.

In general, I suggest that you avoid weighting - unless a particular group has been deliberately under or over-represented in the sample. Consider this situation:

Two radio stations, based in the same town, agree to share the costs of a survey. One station covers only the town, with a population of 20,000. The other station covers a much larger area, with a total population of 100,000: 20,000 in the town, and 80,000 in the outlying rural area. For the survey, 100 people are interviewed: 50 in the town, and 50 in the rural area.

Because the town has 20% of the population but 50% of the interviews, combining all the survey results will produce misleading figures. Weighting could be used to correct the results. Each respondent in the town represents 400 people (20,000 divided by 50), while each respondent in the country represents 1,600 (80,000 divided by 50).

The results can be corrected by giving each town respondent a weight of 1, and each country respondent a weight of 4. In effect, each country respondent’s answers will be counted 4 times, to produce a correct audience estimate for the station covering the larger area.

In this situation, weighting is essential. In most other situations, it’s hardly worth the trouble. If there are problems with the balance of the sample, due to poor response rates among some groups, weighting - which is purely a mathematical procedure - can’t fix these. If the differences between weights for various groups are fairly small, weighting makes almost no difference to the results. But if the differences in weights are large, an unexpected figure from a small group of respondents - a random fluctuation which often occurs - can produce very misleading results.

Projection is a form of weighting, which is used to produce population estimates. In a random survey, you can assume that if (say) 53% of respondents say they listen to FM99 at least once a week, then 53% of the sampled population would also give that answer - within the tolerance of sampling error, of course.

But percentages can be confusing - specially when several sets of percentages, using different bases, appear in the same table. Many people can understand survey results better if percentages are converted into projections - the estimated number of people in the whole population who would give this answer.

If the surveyed population is 100,000, and the sample size was 1,000, then each person in the sample represents 100 in the population. Therefore, using a weight of 100 will convert raw numbers into projections. 53% of 1000 respondents is 530 people in the survey. As each person surveyed stands for 100 in the population, the projected figure is 53,000 - which of course is 53% of the 100,000 population.

That’s simple projection, which involves giving the same weight to all respondents. A more sophisticated form of projection varies the weight, for different groups. For example, most surveys include more women than men, because women are usually easier to find at home, and are more willing to be interviewed. If the population of 100,000 includes 50,000 men and 50,000 women, but the sample includes 450 men and 550 women, you’d calculate separate projection figures for each sex. Each man interviewed represents 111 men (50,000 divided by 450), while each woman interviewed represents 91 women (50,000 divided by 550). Let’s say that of the 530 listeners to FM99, 280 were women and 250 were men: so 50.9% of women and 55.6% of men listened to FM99.

Using these different weightings (because projection is a form of weighting) for each sex, you’d now find a projected total of 53,230 listeners to FM99: 25,480 women and 27,750 men. So after going to all this trouble, the weighting makes only a tiny difference to the original figure of 53,000.

The fictional results I’ve given here are fairly typical of what you’d find in a real survey. This demonstrates that weighting is usually not worth the trouble. However, projection (using the same weight for all respondents) is useful, because it makes the results easier to understand.

Be very careful when making projections from filtered questions. If only half the respondents answered a question, multiplying the percentage answers by the case weight of each respondent will produce projections that are double what they should be. Therefore projections should be based on percentages of the entire sample, not those who answered a particular question.

But beware of making projections for missing data. If you’re applying questionnaire findings to a whole population, some answers don’t make sense. For example, if 49% of your respondents are male and 49% are female and 3% are not recorded, this doesn’t mean that 3% of the whole population have a sex of "not recorded." This is a property of the survey, not the whole population. And if you were using a quota system to ensure that you interviewed the same number of each sex, it doesn’t necessarily follow that the population also has exactly the same number. Projection only works from random data, not from quotas - unless your quotas were chosen by using census data.