Audience Dialogue

Glossary of common statistical terms

Hundreds of terms are used in statistics, but many of them are known only to statisticians. The names of the concepts can be intimidating, even unpronounceable - "the Kolmogorov-Smirnov" test. This glossary is designed for readers of research reports who perhaps once studied basic statistics (then forgot it as quickly as possible!) - so it's far from complete, but may help as a reminder.

AID
Automatic Interaction Detection
: a statistical method for successively splitting a sample into groups of people with different probabilities of using an item. The result takes the form of a tree diagram.

Average
The average of a set of numbers is derived by adding them up, then dividing this result by the number of numbers added. For example, the average of 3, 5, 8, and 12 is 7, because 3 + 5 + 8 + 12 = 28, there were 4 numbers, and 28 divided by 4 is 7. The average is also known as the mean. See also median and mode: other measures of central tendency.

Chi squared
One of the commonest tests of statistical significance. In surveys, used for testing whether different groups of respondents can be safely assumed to have given different sets of answers to a question. Degrees of Freedommake a difference here.

Choice modelling
A multivariate statistical technique which can provide a dollar value for non-marketed goods and services. Similar to conjoint analysis.

Cluster analysis
A multivariate statistical technique often used in segmentation. Respondents are mathematically grouped into clusters, so that people in one cluster are as similar as possible to each other, and as different as possible from people in the other clusters.

Conjoint analysis
A multivariate statistical technique which analyses preferences for various combinations of attributes: e.g. "Would you rather have a can of cold fizzy soft drink, a glass of claret, or a cup of coffee with cream?' Conjoint analysis (derived from considering jointly) would separate out preferences for hot vs cold drinks, alcoholic vs non-alcoholic, colour, and container.
Related to choice modelling. The information can come from either databases (see revealed preferences) or questionnaires (see stated preferences).

Correlation
A type of relationship between two answers to two questions. For example, there is a correlation between people's height and their weight: other things being equal, taller people weigh more than shorter people. A negative correlation occurs when one thing gets smaller as another gets bigger. See also similarity.

Cross-tab
Short for Cross-tabulation.A table of survey results, with several rows and columns of figures. Here's an example of a cross-tab:

Age group by gender
(% of grand total)
Men Women Total
Age under 25 21 14 35
Age 25 to 44 20 19 39
Age 45 or above 8 18 26
Total 49 51 100


Interpretation: 21%of all people surveyed were men under 25...and so on. Real crossabs are often based on column totals, with each column adding to 100%.

Degrees of Freedom
The number of ways in which the results in a table can vary, given the column and row totals. The above table has 6 data cells (excluding the totals and the labels, and shown in blue above): 2 columns and 3 rows of data. If a figure in one row or column is changed, another has to change in the opposite direction to maintain the total. Thus degrees of freedom (dffor short) is columns-minus-1 times rows-minus-1, or in this case 2.

Descriptive statistics
Figures which summarize or describe a data set, without making any inferences or generalizations. All measures listed on this page are descriptive statistics. In contrast, there's inferential statistics, in which inferences are made about the data - such as using a sample to make estimate about a population.

Factor analysis
A multivariatestatistical technique, which reduces a large number of questions in a topic area to a smaller number of basic factors.

Frequencies
Or frequency distribution: in a survey, a table showing what number (or percentage) of respondents gave each answer to a question. Also called marginals or top-line results. Here's an example of a frequency distribution that matches the above cross-tab.

Age group Total
Age under 25 35
Age 25 to 44 39
Age 45 or above 26
Total 100

Imputation
Making an assumption about a missing value. E.g. if a survey respondent hasn't supplied his or her sex, and the person's occupation is Housewife, it's a reasonable imputation that the sex is female.

Incidence
Much the same as penetration, reach, and saturation, but more often used in a medical context and related to individuals, e.g. "the incidence of AIDS in Southern Africa is more than 10% of the population."

Inferential statistics
The next step above descriptive statistics, inferential statistics uses significance tests and other measures to make inferences are made about the data - such as using a sample to make estimate about a population.

Marginals
Same as frequencies- see above.

Mean
A more technical term for average.

Median
The middle value of a set of numbers, when they are sorted in ascending order. If you line five people in a row, the middle person in the middle has the median height. A median is usually a very similar number to an average, but is less misleading when a few extreme values distort the average. Take the numbers 2, 3, 5, 5, and 20. There are 5 numbers, and the total is 35, so the average is 7 - but only one of the 5 numbers is above 7. The median is 5, and is less distorted by the presence of the top figure of 19.

Mode
The most frequent answer given to a question in a survey. In the above example of five numbers (2, 3, 5, 5, and 20) the mode is 5 - the only figure appearing more than once. Because of the normal distribution, the mode of a set of numbers is usually near the middle of the range.

Multidimensional scaling (MDS)
A statistical technique for displaying differences between items, as if they were points on a map, or in a 3-dimensional space. The greater the distance, the more different the items are - in the opinions of people who rated them. This is the statistical technique used for perceptual mapping.

Multivariate statistics
A branch of statistics which measures changes in a number of items simultaneously. Some of the most common multivariate methods are cluster analysis, conjoint analysis, factor analysis, multidimensional scaling.

Neural network
A type of statistical computer program which classifies large and complex data sets by grouping cases together in a way similar to the human brain. Used in data mining.

Normal distribution
When just about anything about people is measured - their height, for example - most people are close to the average. The further you go from the average, the fewer people have that measurement. This is sometimes called the bell-shaped curve, and a lot of statistical measures - such as standard deviation, are based on the assumption of a normal distribution. Most numeric variables in surveys follow an approximately normal distribution, with most answers near the middle of the range, and few at the extremes.

Penetration
Much the same as incidence, but usually applied to households, whereas incidence usually refers to people: e.g. "the penetration of television is 90% of households" but "the incidence of AIDS is 1% of people." Also called Saturation.

Percentage
A fraction expressed in hundredths. If 50% of people are male, then 50 out of every 100 people are male. But beware of surveys with small sample sizes. If 2 people were surveyed, and 1 was male, that's still 50%. As a rough guide, if the sample is less than about 20, don't quote percentage figures unless you're comparing two groups of people.

Perceptual mapping
A multivariatestatistical technique which uses survey data to produce "maps" of the perceived distances between products, attributes, etc. For example, in a study of cars, you'd expect the Rolls-Royce product to be close to the Quality attribute. Much the same as multidimensional scaling.

Projection
Estimating a station's total number of listeners, by multiplying a survey percentage by a raising factor.

Propensity score
A method sometimes used instead of weighting, to partly adjust for an unrepresentative sample. A form of imputation.

Regression
A statistical method which tries to predict a dependent variable (result) by combining a number of independent variables(measures). For example, regression analysis could predict your life expectancy by combining your grandparents' age at death, whether you smoke, whether you have high blood pressure, etc.

Saturation
The same as penetration, but often used in a marketing context.

Significance testing
Statistical methods for determining the probability that a result could be due to chance. For example, if you toss a coin 20 times and 12 of those times are heads, does that mean the coin is biased? A significance test would tell you there's a 13% chance of getting at least 12 heads.

Similarity
A numerical estimate of the difference between two people, groups of people, or concepts. Often used in perceptual mapping. Similar to correlation, but varies only between 0 and 1. A similarity of 0 means the two units had completely different responses; a similarity of 1 means they are exactly the same.

Smoothing
Using statistical techniques to smooth out irregular graphs; usually plotting some measure over a period of time, and producing a smoother graph by averaging the current 3 (or more) figures.

SPSS
Statistical Package for the Social Sciences. Among the most widely used software for survey analysis. See our statistical software pagefor more about such software.

Standard deviation
A statistical measure of variation within a sample. Just as the averagemeasures the expected middle position of a group of numbers, the standard deviation is a way of expressing how different the numbers are from the average. The standard deviation is (roughly) the amount by which the average person's score differs from the average of all scores.

Tabulation
Comparing combinations of counts of several variables in the form of tables of numbers. See cross-tabfor an example.

Top-line results
Same as frequencies. In finance they call this the bottom line, but in statistics it's the top line: it depends whether you give the totals of a set of figures at the top or the bottom!