Audience Dialogue

Know Your Audience: chapter 5, part A
Data entry and data cleaning

Now you’ve done a survey, and the questionnaires are all filled in - what happens next? This chapter is about processing completed questionnaires: analysing them, and reporting on the results.

Even in developing countries, most surveys are analysed by computer these days, so this chapter focuses mainly on computer analysis. In case you don’t have access to a computer, I’ve also included a section on manual analysis - which for a small survey can be quicker than computer analysis.

Note, this content is now a bit dated, with the advent of online survey processing one example, and we'll try to update key areas so it remains useful and relevant. 

Safeguarding completed questionnaires

Take care of them! This is the point in a survey where the information is most vulnerable. Except for telephone surveys done in a single office, completed questionnaires will be transported from each interviewer to the survey office. If the questionnaires completed by an interviewer are lost, all that person’s work will be wasted, and it will cost a lot of money to repeat those interviews. It could also delay the survey results.

Therefore, use the safest possible methods of getting the completed questionnaires from each interviewer to the survey office. If the postal system is unreliable, and the survey did not extend over a very large area, it’s a good idea for each interviewer to bring their completed work back to the office. For surveys extending over a larger area, the interviewers can deliver the complete questionnaires to their supervisors, and the supervisors can bring the questionnaires to the office.

When there’s a high risk of losing questionnaires, it’s even advisable to copy the questionnaires before returning to the survey office.

Debriefings

The end of a survey is be a good opportunity to ask the interviewers or supervisors about any problems they found with the survey methods or the questionnaire. I’ve found it helpful to hold debriefing meetings, where 5 to 20 interviewers or supervisors bring back their completed questionnaires at the same time, then discuss the survey, and how it could have been improved.

Many survey organizers don’t hold debriefings, partly because they believe the interviewers are uneducated or ill-informed, and have little to offer. But interviewers and supervisors can gain a very detailed knowledge of how respondents react to interviews, and without using this knowledge, survey methods can’t be fully improved. The only situation where debriefings aren’t useful is when you do the same survey over and over again. Even then, it’s helpful to hold debriefings occasionally.

Debriefing meetings can be held using the consensus group method (described in another chapter). The findings are recorded, so that they can be referred to the next time a survey is done.

Storing questionnaires

As questionnaires come back from each interviewer, their arrival should be recorded (e.g. on a wall chart). The questionnaires are then put away (e.g. in cardboard boxes) in order of their serial numbers. All questionnaires should have serial numbers.

With mail surveys, the completed questionnaires will arrive back in any order. Sometimes it’s helpful to give each returned questionnaire a new serial number: the first questionnaire returned is number 1, the second is 2, and so on. It’s also useful to have a date stamp, and to stamp on each questionnaire the date when it was received. This enables you to analyse whether there’s any systematic difference between questionnaires returned early and those returned late, and to make some estimate of what types of people don’t mail back their questionnaires at all.

The danger period for questionnaires runs from the time each questionnaire is completed, to the time when it has been entered on a computer file (for computer analysis), or when the analysis is completed (if analysis is manual). During this period, if the questionnaire is lost, all the effort of that interview will be wasted. So at this time, the questionnaires should be kept in a safe place, such as a locked room.

Checking and editing

Though completed questionnaires should already have been checked by interviewers and supervisors, they need to be checked again before (or during) data entry.

What to check

Every questionnaire needs to be thoroughly checked:

All standard items at the beginning or end of a questionnaire should be filled in. They usually include:

These are not questions asked of the respondent, but information supplied by the interviewer. If the interviewer forgot to include something here, the supervisor should have noticed, and made sure it was added. But sometimes newly trained supervisors don’t notice these omissions. They sooner these problems are found, the more easily they can be corrected.

Recoding frequent "other" answers

It’s annoying to read a survey report and find that a large proportion of the answers to a question were "other". The goal should be to make sure the "other" category is the one with the fewest answers - certainly no more than 5%. Take for example this question:

"Which languages do you understand?"
(Circle all codes that apply)
1 Amharic
2 Oromiya
3 Tigrinya
4 English
5 Other - write in: ......................

If 10% of people gave an "other" answer, the written-in responses will need to be counted. If 4% of people understood Arabic, and 3% understood Agew, two new codes could be created:

6 = Arabic
7 = Agew

For each questionnaire mentioning these languages, the circled 5 should be crossed out (unless a different "other" language was also mentioned), and 6 and/or 7 written in and circled. This should reduce the remaining "other" figure to about 3%. (It doesn’t matter that the code for "other" is no longer the highest number. Code numbers have only arbitrary meaning in this type of list. See below, about nominal variables.)

Unless at least 2% of respondents give a particular "other" answer, it’s usually not worthwhile to create a separate code. Sometimes a number of "other" answers can be grouped, e.g.

8 = Somali languages

But when such a code has been made, there is no way to recode the question except by going back to all the questionnaires with that code. The principle should be not to combine any answers which you might later want to look at separately.

Coding open-ended questions

With some open-ended questions, you expect to find many answers recurring. For example: "What is your occupation?" There will be some occupations which are very common, some less common, and there will probably be a lot of occupations which only one respondent in the sample mentions. With other open-ended questions (such as "What do you like most about listening to FM99?") you may find that no two respondents give the same answer.

For both types of question, the standard coding method is the same: you take a sub-sample of answers to that question - often the first 100 answers to come in. (That may be a lot more than 100 questionnaires, if not everybody is asked the question.)

Each different answer is written on a slip of paper, and these answers are then sorted into groups with similar meanings. Usually, there are 10 to 20 groups. If fewer than 2 people in 100 give a particular answer, it’s not worthwhile having a separate code for that answer - unless it has a very specific and different meaning from all the others.

Having defined these 10 to 20 groups, a code number is then assigned to each. Following the example of what people like about FM99, these codes might be assigned.

01 = like everything about FM99
02 = like nothing about FM99
03 = the announcers in general
04 = the programs in general
05 = the music
06 = news bulletins
07 = talkback
08 = breakfast program
09 = Eugene Shurple
10 = other

A practical problem with such a coding scheme is that, the more codes are defined, the more likely some are to be very similar, and the coders may not be consistent in assigning codes to answers.

When consistency is very important, any codes which are not absolutely clear should be allocated by several coders working together, or by a single supervisor. As new comments are found, which are not covered by the original codes, new codes will need to be added.

There are many ways in which an answer can be given a code - what is most useful depends on any action you might take as a result of the survey. If there are particular types of answer you are looking for, you could create codes for these. For example, if a station broadcasts programs in a particular language, that language should be listed as a code. Even if no respondent understands that language, this in itself is useful information.

For open-ended questions with predefined answers (such as occupations) there may be no need to build a coding frame by looking at the answers. For example, occupation coding is often done using the 10 major groups from the International Standard Classification of Occupations. See the Question Library for a list of these.

That’s one way to code open-ended questions. It works well for questions with a limited number of answers, but for questions on attitudes, opinions, and so on, counting the coded categories lose much of the detail in the answers. Another approach is to use the whole wording of the answers - e.g. by entering the verbatim answers on a computer file. The coding can then be very simple, and summarize the exact wording. I’ve used coding schemes such as...

0 = made no comment
1 = made a comment

or...

1 = positive or favourable comment
2 = negative or unfavourable comment
3 = neutral comment, or mixed positive and negative.

These very broad coding schemes are much quicker to apply, and less dependent on a coder’s opinion. But the broad codes are not very useful, unless you also report the exact wording of comments.

DATA ENTRY

When a questionnaire has been checked and edited, and codes have been written in for the open-ended questions, the questionnaires is ready to be entered into a computer. (If you don’t have a computer, see the section at the end of this chapter on manual analysis of surveys.)

Principles of data entry

Data entry is a computerized way of filling in a large table of codes. Each row or line of the table represents one respondent or questionnaire, and each column represents one field.

A field is a space for entering the answer to a simple question. By a "simple question" I mean one with only one answer from each respondent, such as "Which sex are you?" A question with several possible answers, such as "Which languages do you understand?" would have several fields, either one field for the first language mentioned, another field for the second language, and so on - or else one field for each language listed in the questionnaire.

Instead of typing into the computer the full answer to each question, codes are typed in (e.g. 1 if the respondent is male, 2 if female). In each field, the codes have separate meanings, so in the field for "Which sex are you?" 1 could mean Male, while in the first field for "Which languages do you understand?" 1 might mean English.

Codes are usually numerical, because this is what most statistical software expects, and using only numerical codes makes data entry faster. But with some software it’s also possible to use alphabetic codes (e.g. M for a man and W for a woman). These are easier to remember, and therefore tend to have lower error rates.

Survey data is entered into a computer one questionnaire at a time - often working from left to right across a single row or line on the computer screen. But the data is usually analysed one question at a time - i.e. picking out a single column. Here’s an example of a data table from a very small survey, with only 5 respondents and 2 questions, the second question with two fields. And of course, questionnaires always have ID numbers:

ID Sex 1st Language 2nd Language Age
1 M F - 25
2 W G H 18
3 W I E 43
4 W F -  
5 M E F 70

In the Sex field, M means a man and W means a woman. In the Language fields, let’s say E is English, F is French, G is Greek, H is Hausa, and I is Italian.

By looking down the columns (fields) we can see that this survey of 5 people included 2 men and 3 women.

The most commonly understood language was English: 3 people understood this language. 2 people understood French, and 1 person each understood languages with codes G, H, and I. Dashes in the 2nd Language field mean that respondents 1 and 4 understand only one language. Notice that we have only two language fields here, so if anybody understood a third language, the data file could not record this.

The Age field is obvious, but notice that one woman didn’t give her age. No matter how thoroughly the answers are checked, sometimes a respondent simply refuses to answer a question.

Choosing data entry software

Data entry can be done on a wide variety of computer programs - so how do you choose one? I suggest two main principles:

1. Use software that speeds up data entry and minimizes errors.
2. Have a thorough knowledge of the software you are using.

Several types of software can be used: spreadsheets, databases, and statistical programs.

Using spreadsheets for data entry

Many beginners enter survey results into a spreadsheet program, such as Excel or Lotus 123. This can be done using one row for each respondent, and one column for each field. However, I don’t recommend using a spreadsheet: it’s far too easy to make a mistake, it’s slow, and there’s no built-in checking for valid values. Though it is possible to build in checking, only expert users of spreadsheets would know how to do this.

Using database software for data entry

Databases are more difficult to set up than spreadsheets, but they guard your results better: files are saved automatically, and exact values can be checked. Some common database programs are Filemaker Pro, Access, and Foxpro; there are also many others.

However, these programs are designed mainly for accounting, and don’t handle survey data easily. The most basic types of survey analysis, such as frequency counts, usually can’t be done by database programs - or not without considerable difficulty, unless you’re an expert user of the program.

Using statistical software for data entry

Professional survey organizations mostly use statistical package programs. These are called "package" programs because they do many things, and one of those things is data entry.

The best known statistical program is SPSS (Statistical Package for the Social Science). This is widely used in universities and research companies around the world. It’s not particularly easy to use (specially if you aren’t trained in statistics), and it’s not available in many languages, and it’s very expensive - from about 1,000 to 10,000 US dollars, depending on which parts of it you buy. However there are many people who know how to use it, and if you don’t do your own data analysis, perhaps you can find somebody in a nearby university who will analyse the survey for you, using SPSS.

The basic module of SPSS includes spreadsheet-like data entry, but without much checking built in. There’s also a module specifically designed for data entry, but it costs a lot more, and is said to be difficult to set up properly (I haven't used it).  I found a useful site where you can learn how to use SPSS, it features several handy resources, and has plenty to help you with survey research generally.

Other widely-used statistical packages - similar to SPSS - include SAS, Statistica, Statview, BMDP, The Survey System, Powertab, Perseus Survey Solutions, and various others. Most of these are American, have been around for at least 10 years, and add new features every year or two. This makes them more versatile, but also steadily more difficult to learn.

The statistical program I recommend for developing countries is Epi Info. This was commissioned by the World Health Organization, and is designed for epidemiologists, but is also very useful for audience surveys: epidemics spread from person to person in much the same way as information does, so the epidemiological measures are also suitable for audience research.

Some of the many advantages of Epi Info include:

The main disadvantages of Epi Info are:

I’m referring to Epi Info version 6. A new version, called Epi 2000, has just been released (late 2000) and I haven't yet had time to try it with a real survey. Epi 2000 is more complex than Epi Info 6, and requires a powerful computer, with Windows 95 (or a later version of Windows). Though Epi 2000 seems to be more difficult to use than the earlier version, its output looks easier for non-experts to understand.

What about data-entry software?

Why not use data-entry software for data entry? Though this was possible ten years ago, most of the specific data-entry programs either went off the market, or grew up into database programs. Data entry and data analysis go together so closely that it’s not worthwhile to produce separate programs - using a single program means that you only need to define the data once. And if you find an error when analysing the results, you don't need to go to another program to correct it.

Using word processing software for data entry

I don’t recommend using word processors (or text editors) for coded data entry. There’s no checking at all, and it’s all too easy to get the fields out of alignment.

However, if you want to enter the complete text of comments, other programs make this very difficult. For example, Epi Info has a maximum field length of 78# characters, and many comments are longer than that.

*483 Q15 With a word processor, you can enter a comment of any length, like this, as a single paragraph.

In this example, the *483 at the beginning means "questionnaire no. 483" and the Q15 following that means that this comment was given as an answer to Q15. You can either do it this way, or have a separate file for each question. If you put the question number first, and have a single file, and are using a powerful word processor such as Microsoft Word, you can sort all the paragraphs in a file into question order.

See the chapter on Informal Interviews for more detail on this.

Which type of software to choose for data entry

None of this software is what I’d call user-friendly. Whatever type of software you choose, if you haven’t used it before and you now plan to use it to process your first survey, you’ll be learning two things at once: using the software, and survey analysis. Most people find this very difficult - so I suggest you either

1. find an expert to use the software for you, or

2. learn the software thoroughly before doing the survey, or

3. use software you already know well - even if it’s not particularly suitable for survey analysis - e.g. a spreadsheet or database. (But if you use a spreadsheet or database, you need to know it very well - e.g. able to write macros - and to be certain that they are correct.)

The best type of software to use for survey analysis is software designed for surveys or statistical analysis Though it takes at least a week to learn to use these packages, you’ll find it much easier to analyse your second survey.

Differences between research, computer, and statistical terms

The varying words given to the same concepts often confuse novice researchers. Here are some simple explanations.

Questions, fields, and variables

A field is a space to be filled - like a column in a table. For example, the first field in most survey data files is the questionnaire ID number. (Field, in the computer sense, has absolutely nothing to do with fieldwork.) Each field contains one variable: you can’t have two variables in one field, or vice versa.

A question with one possible answer also occupies a field, and corresponds to a variable. But a question with multiple answers (e.g. "which languages do you understand?") will have a number of fields and variables. If a question has up to 6 possible answers, it will have 6 variables in 6 fields - though for respondents who have given less than 6 answers, some of these variables and fields will be blank

When you describe a questionnaire, you refer to questions. When you’re doing data entry, or planning computer files, "field" is the term to use. But when you’re analysing the data, using a statistical program, a field becomes a variable.

Answers, codes, and values

An answer belongs to a question on a questionnaire, a code occupies a field in a computer file, and a value belongs to a variable. All three are much the same thing.

Questionnaires, respondents, interviews, records, and cases

Again, these are really the same thing, but the name changes with the context. An interviewer will speak of an interview producing a questionnaire, a computer programmer will call the information from one questionnaire a record (when it’s on a computer file), and a statistician will call it a case.

Of course, there are also respondents and interviews: usually there’s one respondent per questionnaire, and one questionnaire per interview. Don’t make the common mistake of calling a questionnaire or interview a "survey": the survey is the whole project.

When shown on a computer screen, one record will usually correspond to one line - though this will often be too wide to see on the screen all at once.

INITIAL CHECKING OF DATA FILES

More checking? Haven’t you just done a whole lot? Yes: this checking is different. After all the questionnaires have been entered into a file, a computer can find problems that humans usually miss. The main principle is that humans should do what humans are best at (e.g. seeing patterns in data) and computers should do what computers are best at (mostly clerical work).

So when all the data is entered, computers can easily check for wild codes, extreme values, slides, and consistency.

Wild codes and extreme values

The first step in analysing a survey is to go through each variable in turn, checking its frequency distribution.

If the question was "Which sex are you, male or female?" and the possible answers are 1 for male and 2 for female, these should be the only values for that variable - except perhaps for a few blanks, if some respondents did not have their sex noted by the interviewer.

If a variable which should be blank, 1, or 2 includes values such as 18, -1, or M, these are wild codes which need to be looked into. What you need to know is the ID number of each questionnaire with an invalid value. Some statistical programs include an option to automatically show the ID numbers of records with wild codes.

You can then go straight to that questionnaire (if they are stored in order of ID number), find out what the answer should have been, and fix the error in the data file. If it’s not an error, you may need to create a new code.

A good data-entry program will prevent a lot of errors. For example, if the code is 15, but the data-entry operator tries to enter the 1 as a lower-case L, the program will make the computer beep, and an error message will appear, such as "only numeric codes allowed here".

A program such as Epi Info will let you specify that a particular field can contain only 1, 3, 9, or blank. It can even check codes based on answers to other questions: e..g. detecting any children born to people who were recorded as male.

What most data entry programs will not do is warn the user when unlikely (but possible) codes occur. For example, if a respondent’s age is shown as 99, this may be true, but it may also be a mistake. Therefore it’s not only wild values that need to be checked. The first frequencies printout from a survey program needs to be looked at very carefully to detect this kind of mistake.

People who are inexperienced with surveys - interviewers as well as data entry staff - usually make some mistakes. It’s very unusual to find a survey with no wild codes and a perfect data file. If 500 questionnaires have been returned, and each questionnaire has 100 variables, that’s 50,000 data items. If only 1% of these are in error, there will be 500 errors to find. This could take several days’ work.

Slides

One of the worst types of error is a slide. This occurs when the data entry operator leaves out a field, or puts in the same value twice. It’s specially common when there’s a long string of similar codes. For example, if a question asks about radio stations listened to, if there are 20 possible stations and each is coded 0 if the respondent doesn’t listen and 1 if he or she does, a sequence of codes to type in might be something like this:

0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 0 1 0

If the data entry operator misses the third 1 in the third group, and types:

0 1 1 0 1 1 0 1 1 0 1 1 0 1 1 1 0 1 0

the sequence will be one code short. If the answer to the next question was 1, the error can’t be automatically detected. It may be many questions later when a wild code occurs - and every value, from the mistake to the wild code - will be wrong.

Most questionnaire writers use code 1 as the first for each question - and some statistical programs automatically make it so. To make it easy to detect slides, I suggest you vary the code ranges of neighbouring questions, specially after long sequences with the same answer codes. If the question immediately after the above group of 20 did not have 0 or 1 as a valid code, any slide would be detected much sooner, probably even by the data entry operator when he or she tried to enter an invalid 1 or 0.

Consistency checking

Another part of preliminary analysis is cross-tabulating pairs of variables to hunt out impossible combinations. For example, if you ask the interviewer to write in the day of week and the date, you can check whether the 10th of May 2000 is always shown as Wednesday. With diary surveys, it’s very important to get this right, as people’s activities can vary greatly on different days of the week. This sort of information can also be checked against the interview logs, if there’s any doubt about the correct answer.