It’s interesting to read through a frequency listing for a survey, and find out how many people gave each answer. This is only the beginning of survey analysis, though. For example, you might find that 34% of the respondents read your newspaper *Bekur* this week. Immediately you want to know more:. "What sort of people are they?" you’ll ask. "What other newspapers do they read? How many have TV at home?"

To answer questions like these, you need to analyse two variables together. This is called cross-tabulation - or, more usually, "crosstabs".

The reason for the name is that two variables are tabulated across each other. For example, one variable is from a question asking respondents whether they have read *Bekur* this week. There are three possible answers: yes, no, and can’t remember. The other variable could be the sex of the respondent: male or female; we’ll assume they can all remember which they are. If you surveyed 200 people, the frequency distributions could have been:

Read Bekur this week? | |
---|---|

Yes | 68 |

No | 127 |

Can’t remember | 5 |

Total | 200 |

Sex | |

Male | 98 |

Female | 102 |

Total | 200 |

To find out the balance of men and women who read your newspaper, you need to construct a table laid out like this:

Sex | |||

Read Bekur this week? | Total | Men | Women |

Yes | 68 | x | x |

No | 127 | x | x |

Can’t remember | 5 | x | x |

Total | 200 | 98 | 102 |

We already know the marginal totals in the table: these were the frequency figures. What we don’t yet know are the 6 figures which are needed to replace the x’s in the table. To get these figures, separate counts must be made for men and women. If you are using Epi Info, the TABLES command will do this. With SPSS, it’s the CROSSTABS command. If the Epi Info variable labels were BEKUR and SEX, typing in this command will produce the table you need:

TABLES BEKUR SEX

The table might look like this:

Sex | |||

Read Bekur this week? | Total | Men | Women |

Yes | 68 | 38 | 30 |

No | 127 | 57 | 70 |

Can’t remember | 5 | 3 | 2 |

Total | 200 | 98 | 102 |

If instead you typed in TABLES SEX BEKUR, the table would look like this:

Read Bekur this week? | ||||

Sex | Total | Yes | No | Can’t remember |

Men | 98 | 38 | 57 | 3 |

Women | 102 | 30 | 70 | 2 |

Total | 200 | 68 | 127 | 5 |

This is the same information as in the first table, but shown the other way round.

The figures in these table are what we call __raw numbers__: the number of people who gave each answer. It’s not easy to interpret a table of raw numbers. You probably want to know things like "who’s more likely to read our paper - men or women?"

For this reason, crosstabs are usually expressed in percentages. The first table, showing percentages of each sex, looks like this:

Sex | |||

Read Bekur this week? | Total | Men | Women |

Yes | 34% | 39% | 29% |

No | 64% | 58% | 69% |

Can’t remember | 2% | 3% | 2% |

Total | 100% | 100% | 100% |

Base | 200 | 98 | 102 |

The line labelled "Base" shows the raw number that each 100% figure is based on. You can now see that men are more likely than women to read *Bekur*: 39% of men (i.e. 39 of the 98 men) say they’ve read it this week, compared with only 29% of women (30 of the 102 women).

What these figures do *not* mean is that 39% of the readers of your paper are men and 29% are women. If you think about it, that must be wrong, because 39% and 29% don’t add to 100%. The above table is based on __column percentages__: that is, the percentages in each column add to 100. If you want to know the sex breakdown of your readers, the percentages have to be calculated the other way, as __row percentages__:

Read Bekur this week? | Total | Men | Women | Base |

Yes | 100% | 56% | 44% | 68 |

No | 100% | 45% | 55% | 127 |

Can’t remember | 100% | 60% | 40% | 5 |

Total | 100% | 100% | 100% | 200 |

Notice that the base now appears on each line, not in each column. You can now say that 56% of your readers are men and 44% are women.

Yet another way of expressing the same basic figures is __percentages of grand total __(don’t call these "total percentages," which is not very clear). This time, it’s not the column or row which adds to 100%, but every figure in the table (except, of course, the Total row or Total column. This format isn’t so often used, because it doesn’t answer what are probably your main questions: "Are men or women more likely to read *Bekur*?" and "What percentage of our readers are of each sex?"

The final way of expressing a crosstab is as a projection. If you have done a random survey, and know the population covered, you can calculate how many people in the population each respondent represents. If the population surveyed is 50,000 and you have surveyed 200, every respondent represents 250 people. So you can produce a table of projections by taking your table of raw numbers and multiplying it by 250 people. Here’s how it would look.

Read Bekur this week? | Total | Men | Women |

thousands | |||

Yes | 17.0 | 9.5 | 7.5 |

No | 31.8 | 14.2 | 17.5 |

Can’t remember | 1.2 | 0.8 | 0.5 |

Total | 50.0 | 24.5 | 25.5 |

Projections are normally shown in thousands - to avoid tables full of figures ending with 000. So 9.5 means 9,500 men in the population are estimated to have read *Bekur* this week. Tables of projections are often used for showing to advertisers. You can say "If you had advertised in *Bekur* last week, you would have reached 9,500 men."

Many statistical programs produce crosstabs that show several different sorts of percentages at the same time. This is useful for experienced survey analysts, but confusing for beginners. It’s easy to mis-read a figure. Here’s how the above figures would look in a table showing column percentages and row percentages at the same time:

Read Bekur this week? | Total | Sex | |

Men | Women | ||

Yes | 34% | 39% | 29% |

100% | 56% | 44% | |

No | 64% | 58% | 69% |

100% | 45% | 55% | |

Can’t remember | 2% | 3% | 2% |

100% | 60% | 40% | |

Total | 100% | 100% | 100% |

Confusing, isn’t it? I recommend (whether you’re producing a table for your own use, or writing a report): only show one type of figure in each cell of a table. If you need both row and column percentages, produce the table twice. That way, you (and your readers) are less likely to misinterpret the figures.

Significance testing

In the above example, there were 200 survey respondents from a population of 50,000 people. How do we know that, if we’d surveyed a different lot of 200 people, we’d have reached the same result? What if it was just a fluke that more men than women read *Bekur,* in this sample of 200?

This can be checked by doing a __statistical significance test__. In this case, the simplest test to use is called the chi-squared test. (Chi - pronounced KY, to rhyme with sky, is a Greek letter, spelled c so chi-squared is often shown as c ^{2}.) The chi-squared test looks for differences between expected and observed figures in a crosstab.

All statistical software will do chi-squared tests. Programs such as Epi Info print out a lot of information, most of which only statisticians can understand, but the important thing to look for is the p value. This is the chance that the figures in the table are different from what you’d expect if you knew only the marginal totals. If p is less than 0.05, there are less than 5 chances in 100 that the figures in the body of the table are what would be expected if there was no difference between the groups - that is, more than 95 chances in 100 that there’s a difference between the categories. In the above example, a low p figure means that men and women are not equally likely to read *Bekur*.

If you have read this far, you now know how traditional analysis is done. Think of a survey data file as a big table of codes, with one line for each respondent and one column for each field. The questionnaires are answered "horizontally" (i.e. one line at a time), and analysis proceeds vertically, by considering each column in turn. With cross-tabulation, two columns are compared. It’s possible to have three-way cross-tabulation, comparing answers to three questions at the same time, but this is often confusing to interpret.

It's very interesting to read through completed questionnaires, specially if they contain many open-ended questions. Simply reading a questionnaire can give a lot of insight into the links between the various answers given by a respondent. However, when you read through 100 completed questionnaires, you become overwhelmed by the mass of data: if no two questionnaires have exactly the same answers), what sense is there in trying to classify them into groups? The easiest reaction is to go back to the standard analysis methods, considering one two questions at a time.

Despite the problems brought about by reading whole questionnaires, to sit down and read through a heap of them can give you a better intuitive understanding of the survey results than any number of cross-tabulations and correlation matrices. Following this line of thought, I developed a method for measuring the "typicality" of respondents to a survey, then describing the most typical respondents.

With most surveys, when the most typical three or four respondents were determined, and their questionnaires examined in detail, there was a clear pattern of similarities between all of them.

With some other surveys, the most typical few respondents gave totally different sets of answers. In these cases, it seemed that the whole population being surveyed had some natural split into a number of quite different groups or segments. In some cases, this was confirmed when I ran a segmentation analysis (which divides respondents into segments which are as similar as possible in their answers), then found the most typical members of each segment. Within a segment, the most typical respondents were usually very similar to each other in their patterns of answers.

How to calculate typicality

To work out which are the most typical respondents in a survey, you need to calculate a score for each one. The simplest way to do this is:

Method A

1. Produce a frequency distribution for each demographic variable.

2. For each demographic variable, find the __modal__ (most commonly given) answer.

3. Now go through the questionnaires, one at a time. For each respondent, count how many modal answers that person gave.

4. The number of modal answers is that respondent’s typicality score.

Method A, though simple, does not produce consistent results. Giving each respondent one point for each modal answer and no points for each other answer (which is what this method involves) has many problems. If 51% of your sample are women and 49% are men, all the women will get 1 point added to their typicality score and the men will get none. But if you repeated the survey, and this time interviewed 49% women and 51% men, only the men would be counted as typical - because of a tiny change in the sample. For this reason, don’t use Method A. I have described it only because its simplicity allows a clear explanation of typicality.

Another problem arises with age groups - or any other variables with artificial boundaries. Differences related to age are larger among young people than old people: a 5-year-old is much more different from a 15-year-old than a 55-year-old is from a 65-year old. For this reason, age groups in surveys often become gradually larger: e.g. 15-19, 20-29, 30-44, 45 and over. So if there are more people in the 45-plus age group than any other, does it mean that people aged over 45 are the most typical? Of course not, because the groups are arbitrary. Whichever age group your survey has placed a person in has no effect on that person. So if you are using age groups to determine typicality, I suggest making each group with an equal number of years: e.g. 10-year groups of 15-24, 25-34, 35-44, 45-54, 55-64.... The same applies to other questions with a wide range of numerical answers, such as income.

If you want to make a typicality analysis, use Method B, which produces much more consistent results than Method A:

Method B

1. Produce a frequency distribution for each demographic variable.

For variables with many different answers, these answers should first be recoded into no more than about 10 groups, with ranges of equal size (e.g. 10-year ranges of age).

2. For each demographic variable, find the modal answer. This is worth 1 point.

2a. For each other answer, the number of points is the percentage who gave that answer, divided by the percentage who gave the modal answer. (This will always be less than 1 point.)

3. Now go through the questionnaires, one at a time, and calculate a typicality score for each respondent.

For each respondent, for each variable:

For modal answers, add 1 point.

For other answers, add the appropriate fraction of a point, as calculated at 2a.

4. The respondent’s typicality score is the sum of all the points added for him or her.

The main difference with Method B is step 2. For example, if 51% of respondents are women, each woman has 1 added to her typicality score, and each man gets 0.96 (49 divided by 51).

The advantage of Method B is that it gives each variable an equal importance, with a maximum score of 1. Where several proportions are about equal (as in the case of men and women), the larger group gets only a slightly higher number of points.

You may wonder: why not simplify the calculation, and just add 0.51 for women and 0.49 for men? The answer is that it would make variables with close to 50-50 splits more important than variables with a large number of different answers, where the commonest answer might apply to less than 10% of the total sample.

Method B is not easy to calculate manually: it’s much less effort to use a computer program. As measuring typicality in this way is my invention, no special software is available for calculating this. However it's not difficult to do, using any statistical software with a command language. With Epi Info, for example, you’d write commands like this, one for each answer to each question:

IF SEX="F" THEN TYPICALITY=TYPICALITY + 1

IF SEX="M" THEN TYPICALITY=TYPICALITY + 0.96

When the final typicality score is calculated for each respondent, it is written to the data file, as a new field. You can then sort the file in order of typicality, and browse through the answers of the top and bottom scorers.

Variations on Method B

1. I’ve suggested calculating typicality using demographic variables. But you can base a typicality score on any variables (except open-ended ones in which almost all respondents give different answers). For example, if you ask a set of questions about your TV programs, you can calculate typicality scores for that set of questions, and discover the viewing patterns of your most typical audience members.

2. If some variables are more important than others, you could change the maximum of 1 point per variable. Maybe the most important variables should get 2 points for a modal answer.

3. Separate typicality scores can be calculated for subgroups of the sample. For example, you can divide the sample into people who are and are not members or your audience. Then work out two typicality scores for each respondent: how typical he or she is of (a) your audience, and of (b) your non-audience. If you’re interested in increasing your audience, you could find out what types of people you might attract by looking at those who don’t use your service, but have high typicality scores for your audience.

Using typicality scores

When you have calculated a typicality score for every respondent in a survey, the next step is to pick out the most typical and least typical few (3 of each group is usually enough), find their questionnaires, and describe them in detail. Here’s an example, from a survey of radio listeners in Canberra, Australia:

The most typical listener to radio 2XX was a woman aged over 60, retired, who left secondary school aged under 16 but ended up with a degree or diploma.
She listened to 2XX in the breakfast session, and often from 4pm to 6pm. She sometimes listened to 2CA (commercial) and 2CY (classical music and talks). Her main reasons for listening to 2XX were its convenience ("it has what I want to listen to, at a time when I want to hear it"), the music it played, and its news and information programs. She was very satisfied with 2XX, and had listened to the same station several years ago. Program types she most often sought on radio were background information, news, spoken programs "to really listen to," and "pleasant music." She preferred an even mixture of talk and music, and liked the majority of music played on 2XX. Her musical preferences were "some of everything except punk rock". She was unable to name any favourite musicians or recording artists, saying they were too numerous to mention. |

If answers to some questionnaires are to be picked out and included in reports, this raises some questions about confidentiality. Even if a respondent's name is not given, enough information may be included on a questionnaire for any readers of the report who know that respondent to be able to identify him or her. This is a serious breach of respondents’ privacy, and you should take steps to prevent such identification. The easiest way to do this is by making categories fairly vague. For example, don't list the respondent's actual age, but express it as an age group in a 10-year range. Similarly, be a little vague about occupation and place of residence, if these are included in the report, and don't include in the report any specific personal information volunteered in open-ended questions.

A test of the stability of typicality scores is whether the people who are calculated to be the most typical (based on one set of variables in the survey) also share answers on a lot of variables which were not included in the typicality score. This usually turns out to be the case.

Though the above methods of calculating typicality totally ignore open-ended answers, it is those verbatim answers which help to give the clearest picture of each respondent as a person. It is usually much harder to visualize the most typical respondents in surveys with few or no open-ended answers.

Reporting on typicality is no substitute for normal survey analysis - but finding and describing the most typical questionnaires in a survey can give the readers of survey reports a much better understanding of the reasons why respondents act and think as they do.

Not many surveys are analysed manually these days, even in developing countries, but manual analysis can be quicker when:

- samples are small (no more than a few hundred), and
- questionnaires are short (one sheet of paper), and
- most questions are multiple-response.

Manual counting works better if questionnaires are printed on fairly thick paper (i.e. not very floppy), and answer codes are shown on the questionnaires.

There are two possible ways to hand-count questionnaires:

1. to go through each questionnaire once, noting on a separate piece of paper the answers to every question.

2. to go through each question one at a time, counting questionnaires.

In theory, method 1 should be faster - but in practice, it’s not, because it’s too easy to make a mistake. If you are interrupted while doing this. you are almost certain to lose count, and you’ll have to start again. In practice, method 2 is best: one heap for each possible answer to the question you are currently tabulating. Though this means that you have to re-sort the questionnaires for every question, it also means that any errors are easily discovered - and, with hand-counting, there will always be a few errors.

For example, to hand-count the results of the question "Which sex are you?" you need to make space for 2 heaps (or possibly 3, if the sex of some respondents wasn’t noted). Sort the questionnaires into male and female heaps (and maybe a not-stated heap), then count the number in each heap. As you count each questionnaire, double-check that it’s in the right heap. The total of all the heaps should be the known total number of questionnaires. If it isn’t, count again.

After tabulating the responses to each question (e.g. 81 male, 119 female, 2 not stated), take the next question, and create a new set of heaps.

To make a manual count for a multiple-answer question (such as "Which languages do you understand?") is not so easy, because people who gave two answers should be in two heaps - but they only have one questionnaire. The solution is to make a separate count for each possible answer, with two heaps for each answer: those who gave that answer (e.g. understand English) and those who did not (e.g. do not understand English).

If you want to do manual counting of a questionnaire that takes up more than one piece of paper, this is possible, but each questionnaire will have to be folded so that you can always see the answer to the question you’re counting at the time.

Cross-tabulation by hand

For this, you need not just a row of heaps of questionnaires, but a matrix of heaps. If one of the two questions being cross-tabulated has 3 possible answers, and the other has 5, you’ll need to make space for 15 heaps: 5 across, and 3 down, or vice versa. If your table isn’t large enough, you can do this on the floor - as long as there’s no wind!

When you’re analysing survey data, there are so many things you can do. Where do you begin? When do you stop? To make the task easier, here’s a suggested plan for analysing a survey.

1. Check the coding and data entry

1.1 Browse through the data file, using the BROWSE command in EPI Info, or scrolling through the spreadsheet in SPSS. Look for strange patterns in the data - omissions that seem wrong, letters where every other case has numbers, cases that are only partly entered. Be suspicious.

1.2 For every variable which should be present in every case (e.g. date of interview), check that it’s not blank.

1.3. Run descriptive statistics for each numeric variable. Check the smallest and the largest values. Do these make sense? If not, find out which cases have these extreme values, and check the questionnaires for these, to make sure that there have been no errors in data entry. If there are errors, fix them, re-run the statistics for that variable, and check it again.

1.4. Run frequencies for each coded variable - look for wild codes. If you find any, go back and check those questionnaires.

1.5 Check the total number of people answering each question which could be skipped. Are these numbers consistent? If not, investigate.

1.6 If any answers have been entered verbatim, check the consistency of these. Are some in upper-case letters and some in lower-case? Are some abbreviated and others spelled out in full? If so, make them consistent.

2. Analyse each variable separately

In this phase, take each variable in turn, and think about what the results might mean, in practical terms.

2.1 Get descriptive statistics (e.g. averages and ranges) for all numeric variables. Are these what you could expect? Compare answers with earlier surveys - if possible. Are averages too high or too low because some respondents have been mistakenly included or excluded?

2.2 Get frequency distributions for all coded variables. If any variable has too many codes to make sense of (more than about 20) recode that variable, grouping similar answers.

2.3 For groups of questions with the same possible answers, summarize the answers for each group of questions on a single piece of paper. Compare results for different items in the set. Mark the highest and lowest figures in each column and each row. (Hint: do this in pencil first: it’s easy to miss a high or low figure.)

2.4 For questions copied from other surveys or a census, compare this survey’s results with the earlier data. Ignore differences that are less than about 5% - these are probably sampling error. Think very hard about differences of more than about 20%.

2.5 Consider: these are the results from the whole sample - but if the sample was divided into several groups (based on the answers to some other question) might the results from each group be quite different? For example, perhaps you might expect men and women to give very different answers.

2.6 Read through the full transcript of each verbatim variable, and consider what other variables might explain the differences in these answers. What sort of people are giving what sort of answer?

3 Analyse several variables at a time

3.1 If there are 100 variables in the survey, and you analyse every possible combination, you could produce 4950 tables of figures. Too many! So for each variable, consider what other variables should be looked at with it. Consider which are the most important questions in the survey. If the survey was done for a particular medium (e.g. a newspaper), compare readership of this newspaper for all demographic variables and for other media.

3.2 Run crosstabs (2-way tables) for each pair of variables you’ve chosen. Do a chi-squared significance test for each table, and look at the p value. If it’s more than .05, there’s no relationship between the two variables, so don’t bother investigating this pair further. If it’s less than .001 there’s a very strong relationship. Consider what this might mean in practical terms: convert it into words, and explain it to somebody.

Had enough yet? If you’re new to survey analysis, all this could easily take a week, non-stop. There’s still a lot more you could do, but you need a good knowledge of statistics and computers to do it. As this book’s not intended for experts, I won’t go into any more detail here - but if you’re interested, I suggest you take a course in statistical analysis. Alternatively, you could read some books on statistics - but most people find it extremely difficult to learn statistics from a book. You learn statistics by doing it, not by reading about it, so I shan’t try to explain it here.