Audience Dialogue

Know Your Audience: chapter 2, part B
Sampling: choosing areas

7. Choosing the sampling unit

Now you need to choose your sampling unit: what will you sample? It seems obvious at first: your sample will be people, because only people can be interviewed.

In fact, it’s not that simple, specially with door to door surveys. Most door to door surveys begin by sampling dwellings. (A dwelling is the place where the household lives: households are people, dwellings are homes.) Dwellings are easier to find than people: they don’t move around. Even if you make your initial sample from a list of people, such as an electoral roll, you’ll find that some people have moved since the list was compiled. It’s much easier to sample dwellings, and then, as a second stage, interview the people who live in those selected dwellings.

Sometimes it’s more appropriate to sample households than people. For example, a few years ago I organized an Australian survey about media usage. Part of this survey asked about the types of media equipment that were available in households. In each household, the interviewers asked for the person who knew most about technology. This person was then asked questions such as "How many radios in this household can receive FM programs?" The average numbers reported in the survey were then applied to the whole population of Australian households. We were able to make statements such as "there are between 29 and 31 million FM radios in Australia."

When the sampling unit is people, some parts of the population are usually excluded. Usually, children below some minimum age are excluded - because they don’t do the activity the survey deals with (e.g. reading newspapers), and also because interviews with children must be done differently. Normal questionnaires are usually too difficult for them. Depending on the subject of the survey, the minimum age is usually between 10 and 18 - most commonly, 15. Children under 10 seldom listen to radio, or read newspapers, so there’s no problem excluding them if this is the subject of your survey. But children as young as 2 watch TV, so any TV survey that does not involve young viewers will be incomplete. The best solution is usually to survey only people aged 10 or over, acknowledge the lack of data from younger viewers, and to do a separate study among children aged under 10, using observation instead of questionnaires.

Door-to-door surveys usually exclude people who don’t live in private households: visitors in hotels, troops in barracks, homeless people, and so on. These people are usually only a few percent of the population, so excluding them makes very little difference to the survey results. For any proposed door-to-door survey, you should try to find out how many people you will not be able to reach, and whether these people are likely to give different answers from the others.

In the 1980s, an Australian government department did a telephone survey with teenagers, and found a surprisingly low rate of unemployment - because it mainly reached teenagers who were living with their parents, in households rich enough to have telephones. At the time, only 10% of households had no telephones - but these were the poorest households.

8. Selecting samples from lists

If you have a complete list of all people in the population, with addresses, you can use this to draw a sample.

When population lists are available, they are often for specific populations, such as all people who work in a particular organization. Even when a list is supposed to contain the entire population, it usually doesn’t: perhaps because it’s out of date, or because certain types of people are excluded. Here are some lists that usually claim to apply to an entire population. Some of these populations are people, some are households, and some can be used as both.

Electoral rolls

Though an electoral roll is designed as a sample of people, it can be used as a sample of households. People may come and go, but addresses stay the same.

In most countries, electoral rolls are not very complete. I made a study in South Australia, around 1990, and found that approximately 20% of people were not on the electoral rolls at their current address. Some were not citizens, some had not bothered to enrol, and some (e.g. police and judges) were excluded from published rolls. Also, many people had moved in the several years since the rolls were last updated. And of course, electoral rolls always exclude people below the minimum voting age - though children over 10 can usually answer survey questions well.

In some countries, such as Britain, the situation is much better, because electoral rolls are updated every year, and printed in a very convenient street order (unlike the alphabetical order of surname used in Australia).

Before you use an electoral roll to represent the entire public, I suggest you take a small geographical area - perhaps a few street blocks, or a few hundred dwellings - and check how many people in that area are on the rolls. If the figure is less than 90%, look for a better population list.

A good compromise with electoral rolls is to use them with multi-stage sampling: i.e. carry out a cluster survey, and use electoral rolls only to choose the starting point for each cluster. If the sample is stratified, and the number of clusters in each small area is proportional to the population of that area, this helps to ensure that people living in areas where many are not on the electoral roll will still be included in the survey.

Street directories

These are lists of addresses, in street order. Typically, you first look up a locality, and find all the streets in that area, listed in alphabetical order. For each street, addresses are listed in numerical order. Where up-to-date street directories exist, they are an excellent source of addresses for door-to-door surveys. But they are often incomplete, omitting many addresses - particularly where several dwellings share one street address.

Telephone directories

A few years ago in rich countries, telephone directories were an excellent population list with one (and only one) entry for every household. In Australia in the late 1980s, over 90% of households had a telephone, few households were unlisted with "silent numbers", business numbers were clearly separated from residential numbers, few households had more than one number or answering machines, and few people had mobile phones.

But now, it’s all a mess. A telephone directory is no longer a good population list. Many households have more than one phone number, and these are usually the wealthier households. Other households have only mobile phones, which are not usually listed in printed directories. Because mobile phones are carried around, any directory of mobile phone numbers will be a sampling frame of people, not of households.

See the chapter on telephone surveys for full instructions on how to draw samples from telephone directories.

Utility subscribers

The advantage of a telephone directory, in an area where nearly everybody has a phone at home, is that it is a readily available list which includes most dwellings. But other such lists often exist. In areas where nearly every household has electricity from the mains, you can sometimes get access to a list of electrical subscribers. Unlike telephone directories, which are becoming messier by the day - with households having multiple numbers, unlisted numbers, and home-business numbers, utility lists mention each household once and only once.

Other utilities which can be used are those for services which almost all households use - such as local government, water supply, sewerage, rubbish collection, and so on. These lists are usually up to date and accurate. However, they are not published, so the main problem with using these for survey sampling is getting access to them from the authorities that own them.

Other population lists

Many organizations have mailing lists of their subscribers or users. These can easily be used to draw samples for surveys, using systematic sampling. They are usually samples of people.

A lot of these lists are not representative of a population. For example, a radio network I once worked with held a competition with a very valuable prize (an overseas holiday). All competitors had their names and addresses entered into a database, possibly to be surveyed later. Though competitors weren’t meant to put in more than one entry, we found that some addresses had 20 entrants. Judging from the names, it seemed that some people had entered their pets!

Before using any population list, find out more about it. Analyse it closely, and consider:

  1. Is it a sample of people, of households, or what? Is it suitable for your purpose?
  2. How complete is it?
  3. How up to date is it?
  4. How accurate is it?
  5. Does it contain duplication?

A simple way to check a list is to find 20 people who should be on it, and look them up on it. On a perfect list, all of them will be there once (and only once), and all information will be up to date. This is rare!

Cleaning lists

Before using any list to generate a sample, I suggest you bring it into a computer program with which you can view the data in a spreadsheet-like format (rows and columns, with one row for each person, and a column for each piece of information). Then sort it on every field in turn. Examine the first and last few entries in each field. If there are any problems, you are likely to find them at the top or bottom of a column.

Read carefully through the list, searching for duplicates. It’s often easier to see mistakes in print than on a computer screen. If possible, print it out, in several different sequences, and have a different person check each printout. You’ll probably be amazed how many obvious errors you find. You’ll find some people on it several times, maybe with different addresses or slightly different versions of their name.

The biggest problem with an incomplete list is that it’s likely to be biased in some way: in other words, it may not represent a typical cross-section of your audience. If this problem exists, random sampling cannot fix it: all you’ll get is a representative sample of an unrepresentative list.

It can be tempting to use these "found" lists. It can be cheap and easy to do a mail survey with such a list, but if you don’t know how representative it is, the results it produces can be extremely misleading. Never rely on the results from a single survey, unless you know it's random.

Finding and creating maps

If you only have a map, and no information about where the population is spread across the map area, it’s difficult to achieve an accurate sample. But it’s more difficult still when you don’t even have a map. Unfortunately, this is common in developing countries. Even though census data at a district level may be available, if there are no detailed maps it can be difficult to relate the census place-names to areas on the ground.

So it’s vital to find or create a local map. Sometimes these are painted on walls at local government offices, and you can copy this by hand from the wall (or photograph it).

If you can’t find a map, hold a meeting some well-informed local people and create a hand-drawn map. This does not need to be exactly to scale. It should show all locality names and main roads in the district.

9. How to draw a sample from a population list

The simplest method of sampling from population lists is to use systematic sampling. This means that you divide the list into a number of equal groups, select one random number, and sample the same location in each group. Here’s how.

1. Find the sampling interval

Divide the size of the list by the number of sampling points wanted. (If you are using a stratified sample, as described above, it’s more complicated: you need to use a separate list for each stratum.)

For example, you may have a list of households, taking up 411 pages. Let’s say you’re doing a cluster survey, and you need 40 starting points. Divide 411 by 40. The answer is 10.275. That means dividing the list into 40 groups of 10.275 pages. This can be done, but it is not easy - you’d be counting a lot of lines. So I suggest taking 40 groups of 10 pages, skipping one page after every 4 groups, so that the unused pages are evenly spread through the list.

2. Draw a random number.

Banknotes are a good source of random numbers. The last few digits of a banknote’s serial number are effectively random. Find a banknote, and take down its last 3 digits. I just did this: the serial number was VG 95872658. The last 3 digits are 658. Interpret this as meaning that in each group of 10 pages (or whatever you have divided your list into), you will take the entry that is 658 thousandths (65.8%) of the way through the list.

3. Work out which entries this random number corresponds to

65.8% of 10 pages is 6.58 pages. How much is 0.58 of a page? An easy way to do this (if every household takes up the same number of lines) is to use a ruler, and measure the height of the address list on the printed page. If there are two columns, each 235 mm high, that’s 470 mm of addresses. 65.8% of that is 309 mm: this means 74 mm down the second column.

So for each group of 10 pages, find the address 74 mm down the second column of the 7th page: that’s your random address. Repeat this 40 times, and there’s your list of random starting points.

To save counting lines, you can make a card to show you which lines to choose. Measure the distance from the foot of the page to the line you need, and cut a piece of card exactly that high. If you hold the bottom of the card level with the bottom of the page with your thumbs, the first line visible above the card will be the one with the selected number.

So that’s systematic sampling: the advantage is that you draw only one random number, and use it over and over again. You should look for two problems:

1. There is a tiny chance that the population list is arranged so that there’s a regular sequence in the entries. Maybe in a list of people, every alternate one is a man and every other one a woman. If you use systematic sampling, you would select either only men or only women. (It’s extremely unlikely that any list would be ordered like this, but it would badly spoil your sample.)

2. It’s easier to round off the sampling interval so each group comes from a whole number of pages. This means that there will be some unused pages. Don’t leave all of these at the end - scatter them throughout the list. If the list is in geographical order, this will ensure you don’t exclude a large area. Later, you may be able to use some addresses on these unused pages, to replace sampled addresses that turn out to no longer exist.

Stratification of lists

If you are designing a stratified sample (dividing the survey area into a number of smaller areas, and taking a separate sample from each smaller area) you should check any population list you want to use, to see if is already stratified in a suitable form.

As stratification is based on Census data, the population list you use must be divided into Census areas. If it is not already divided in this way (e.g. a telephone directory covering more than the whole survey area) many hours’ work will be needed to draw a properly stratified sample.

10. Choosing the place of interview

People can be interviewed in three main places: at their homes, at their workplaces, and in public places. In most surveys, people are interviewed at home. As almost everybody has one home, home-based sampling provides a better coverage of the population than samples based on workplace (because not everybody has a job) and public places (because some people spend very little time there).

With a probability sample, it's usual to interview people at home, because it's usually the homes that are sampled, rather than the people who live in them.

If you are using a quota sample, people can be interviewed anywhere you find them: at home, at work, or in a street. Though this seems easier, it's not as valid - see section 3 above.

With some types of sample, it's better to find people at some place other than their home. If you are surveying the workforce of an organization, it may be more convenient to interview them at work (as long as they'll tell the truth there). If you are surveying people about shopping - common with market research, but rare with audience research - it can be better to survey them in a shopping area (see section 15 of this chapter). And if you are surveying the audience to some kind of event, the obvious place to interview them is at the event: see the chapter on event surveys for more details of this.

11. Selecting starting points for door-to-door surveys

Door-to-door surveys nearly always use cluster sampling, because it is so much cheaper than choosing individual households at random. When you are using clusters, the sampling is done in at least three stages:

  1. Choose the starting address (at random, from a list)
  2. Choose a random route to take after finding the starting address: a route that gives every household in the cluster an equal chance of being selected for the survey.
  3. Choose one or more persons in each selected household.

When people are surveyed in their homes, usually one person is selected at each address.

With 500 respondents, 500 separate addresses would be used. Unless the survey was confined to a small town, the chances are that these addresses would be widely scattered. This would provide wide geographical coverage, but much time would be wasted going from one dwelling to another. In a large area, more of the interviewers’ time would be taken up by travelling than by doing interviews.

To increase productivity, surveys normally use cluster samples. Instead of selecting 500 individual addresses at random, only 50 might be chosen, and a cluster of 10 neighbouring households surveyed at each point. So there would be 50 clusters each of 10 households.

You can see intuitively that taking only 50 separate parts of the city is not going to be as representative as taking 500, because neighbours tend to be similar in their habits and characteristics. To equal the accuracy of a simple random sample of 500, a cluster sample would need about 750 people. However, it is cheaper to interview 750 people in clusters than 500 individually.

Clustering saves most money when interviews are brief, and travel cost (from home or office to cluster) is high, and few or no extra trips to the cluster need be made, to interview the last few respondents. If your survey is in a large city, and you have few interviewers, and questionnaires are left for respondents to fill in and be collected on a return trip, clustering can save a tremendous amount of money, and clusters can be quite large.

In practical terms, cluster sizes usually range between 3 and 20 households. If clusters are too small, travel costs rise, but (for a fixed sample size) there will be more clusters, and the effective sample size will be larger. If clusters are too large, travel costs will be less, but the effective sample size will also be smaller. Another problem with large clusters is that interviewers can run out of households.

A good compromise in most situations is to have about 10 households per cluster.

Three ways of selecting cluster starting points

(1) Using a local population list

In many countries, local authority offices have a list showing all households. If the authorities co-operate, this can be used to draw a random number, to select the starting point for a cluster. Other alternatives, as discussed above, include electoral rolls, street directories, and telephone directories.

Use systematic sampling, as explained in more detail above. For example, if you want to select 5 clusters in a village with 600 households:

1. Find a list of all households.

2. Divide this list into 5 equal sections, each with 120 households.

3. Choose a random number between 1 and 120 (e.g. using the last digits of a serial number on a banknote). Say it's 57.

4. In each of the 5 sections of the household list, choose the 57th household.

(2) Block listing

If you can’t find a population list, what can you do? The answer: create one. This is called block listing. It is time consuming, and therefore expensive. But when accuracy is important, and labour is cheap, block listing is the ideal choice. It will also be more up to date than any official population list.

To start with, you’ll need a large map, because it will have to show every street or road in the district. If such a map doesn’t already exist, you’ll build it up as you go. It doesn’t need to be exactly to scale. For a large district, it’s best to have a number of partial maps, and assign one interviewer to work on each.

Interviewers are now sent out to walk the whole length of every road in their assigned area. Nobody is interviewed at this stage, but the interviewers note each dwelling, and write a brief description (enough to distinguish it from its neighbours). If a street is not already on the map, it must be marked there. Dwellings that are clearly unoccupied are noted as such.

One interviewer can list several hundred dwellings in a day’s work - but this depends on the distance between dwellings, the difficulty of counting separate dwellings, and the interviewer’s ability to fend off interruptions from curious residents.

When every house on every road is listed, you have created a street directory for the survey area. When the block listing is completed, count the total number of dwellings listed - ignoring unoccupied dwellings. Number each dwelling, from 1 upwards, giving adjacent dwellings adjacent numbers (where possible). You can take a systematic sample from this list to work out the starting points for clusters.

(3) Area-based sampling

When no population list is available, and block listing is too expensive, the only other method of finding cluster starting points is to use area-based sampling -which is similar to sampling from maps. This type of sampling is not ideal, because it gives each area of land an equal chance of being surveyed, not each person. Therefore, people who live in thinly populated areas have a greater chance of being included in the survey. As towns are densely populated, people living in towns will have a lower chance of being included.

There are several solutions to this problem, but the best solution may be different in each cluster.

Separating areas of equal population density

One solution is to survey towns separately from rural areas. In some rural areas -for example, fertile plains - the rural population is distributed quite evenly, with most people living on small farms. As long as the area of each cluster has a consistent population density, area-based sampling (e.g. from a map) will be reasonably accurate.

Aerial photographs

The second solution uses aerial photos. If you can get an aerial photograph of the area where the cluster is, you can count the roofs, number them, and form a sampling frame that way. If the scale is no greater than 1:10,000, and the roofs are clearly visible on the photo, this works quite well. However, aerial photos are sometimes many years out of date. Professional aerial photography is very expensive, because special aircraft are used. But if you hire a small plane for an hour or so, and take photos from that, this is much cheaper than block listing. Choose a time of day when roofs are most visible - this varies with the roof materials, roof shape, and the weather. In some places, the middle of the day is best; in others, early morning or late afternoon. The best height is about 5,000 feet (1,500 metres): below this, the area of each photo is too small, and above it, individual roofs are too difficult to make out. If possible, use a high-wing plane, so that the wings don’t get in the way. If you are taking photos though the windows, avoid using an auto-focus camera; these often try to focus on the glass. To be safe, have two photographers (one on each side of the plane) and two different cameras. Though official aerial photos are usually in black-and-white, colour photos are easier to interpret.

When the photos are developed, there will be a lot of overlapping. It’s best to number the photos, and draw lines on each to show where other photos overlap -otherwise it’s too easy to count some roofs twice.

Radial sampling

The third solution, which I call radial sampling, works well in countries where most people live along roads, and there are not a lot of roads. This often applies in south-east Asia.

For example, in 1997 I was training people in Laos in survey methods. Our class, with 12 students, decided to do a survey in the town of Phonhong, about two hours’ drive north of Vientiane (the capital). We had no information about Phonhong except its total population. Our first stop was at the local authority office, where we found the only map of the town: it was painted on a wall. We found that the town was Y-shaped; it had grown around the intersection of three roads.

map of Phonhong

Walking though the town, we found that early everybody lived on these three roads. The number of houses on each road was approximately equal. There and then, we decided to divide the town into six strips: three roads, each with two sides. The 12 trainees were divided into 6 teams, with 2 people in each. Three teams started from the central junction and worked outwards. The other three started at the edge of town and worked inwards, from the opposite side of each road - as shown on the above map.

The result was probably a good sample of the Phonhong population. I say "probably" because we had no way of being certain. If this had been a real survey, not just a training exercise, I’d have done a block-listing first, because the town had only about 600 households and we surveyed about 120 of these.

This method, radial sampling, works in any town or district where a number of roads meet in a central place. Here’s a more systematic set of rules for radial sampling.

1. Draw a very rough map of the area for the cluster/s, showing only the roads that meet, and approximate distances.

2. For each cluster, choose 3 random numbers (e.g. the last 3 digits of a banknote serial number).

3. The first random number selects the direction from the centre point. 0 = north, s = south, and so on. (This is like dividing a clock face into 10 parts instead of 12.)

4. Use the second random number to select the distance from the centre point. o = the centre, and 9 = the outer boundary.

5. If the second random number was o, the interviewing must work outwards from the centre. If it was 9, the interviewing must go back towards the centre. If it was neither 0 nor 9, look at the 3rd random number. If this is odd, work outwards. If it is even, work inwards.

When most people live along the radial roads, this method will produce a representative sample. The exception is when densely populated areas (e.g. squatter settlements) are in areas between the main roads. Radial sampling will often miss these areas - and block listing or recent aerial photographs would be better.