 # Using Excel for survey analysis (5)Cross-tabulations with pivot tables

A cross-tabulation (or "crosstab" for short) is a table produced when all the answers to one question are compared with all the answers for another question. These work much the same as one-way tables (alias frequencies), but instead of selecting just one column of data (one question), you now need to select at least two columns.

To make sense of these instructions, you probably need to read the previous page in this set. Also, these instructions are for Excel 97; Excel 2000 is slightly different.

1. Select adjacent columns by dragging the mouse across their grey headings (e.g. C and D)

2. Go to the Data menu and click on Pivot Table Report.

3. The Pivot Table Wizard comes up. Two or more boxes appear at the right-hand side, in step 3 of the Pivot Table Wizard: in this case, Q1 and Q2. These labels are copied from row 1 of the columns you've selected.

Drag one box (e.g. Q1) to ROW, and the other (e.g. Q2) to COLUMN. Drag either one again to DATA. In this case, I dragged Q1 to DATA. It looked like this: Click Finish (to put it on a new worksheet), and there's your pivot table:

 Count of Q1 Q2 Q1 n p s y (blank) Grand Total n 92 19 12 8 4 135 y 172 31 12 15 68 298 (blank) Grand Total 264 50 24 23 72 433

If I had dragged Q2 to ROW and Q1 to column (instead of vice versa), the table would have been transposed, with Q1 values across the top, and Q2 down the left-hand side. But, strangely, if I had dragged Q2 instead of Q1 to the DATA section of the table (where it would be shown as COUNT OF Q2) there would be no difference at all in the final figures.

What do these numbers mean? As an example, note the figure of 172 in the above table. This number is in the column headed n (under Q2) and the row headed y under Q1. This means that 172 people answered No to Question 2 and Yes to Question 1.

Such figures usually make more sense as percentages, but pivot tables don't calculate percentages for you. There are three likely percentages you might want to calculate:

• Of all the people who answered Yes to Q1, what percent answered No to Q2? (Answer = 172 / 298 = 57.7%)
• Of all the people who answered No to Q2, what percent answered Yes to Q1? (Answer = 172 / 264 = 65.2%)
• Of all the people surveyed, what percent answered Yes to Q1 and no to Q2? (Answer = 172 / 500 = 34.4%)
The percentage you choose depends on (a) the purpose of the questions, and (b) what you are trying to find out. The easiest way to display percentages is by creating another table next to the pivot table, and calculating percentages based on either the row total, the column total, or the grand total.

If you want to cross-tabulate two columns that aren't next to each other, it would be nice to be able to select first one column, then the other non-adjoining column by clicking the mouse while holding down the Option or Alt key. But that doesn't work: instead, you have to select all the columns in that range. For example, if you want to tabulate Q1 by Q7, select all the columns from Q1 to Q7. This will produce 7 boxes on the right hand side in screen 3 of the Pivot Table Wizard. Just drag Q1 and Q7 to where you want them, and ignore Q2 to Q6.

### Significance testing

When you've created a pivot table from survey data, you'll probably want to know if there's any pattern in the answers, or if any difference between them is just due to chance variation. A statistical significance test will help with this, and the commonest significance test for a two-dimensional table is the chi-squared test. (Chi is a letter of the Greek alphabet, and rhymes with sky.)

Unfortunately, Excel makes it difficult to calculate chi squared, so there's a high chance that your answer will be wrong - unless you know what you're doing so well that you probably don't need to do it! Fortunately, there are several web pages that calculate chi squared for you. All you need to do is feed in the numbers.

For example, Audience Dialogue recently had a data set of 2208 cases. Cases are usually people, but in this example they were coded comments. We divided the comments into 4 categories, based on respondents' answers to one question, and looked at the number of people in each of the 4 groups who mentioned a particulartheme. This produced a 4 x 2 table, as follows. (Only the actual numbers are included in the "4x2", not the labels above and beside them - nor any row or column totals.)

 Answer Group 1 Group 2 Group 3 Group 4 Yes 336 235 120 239 No 637 346 196 199

As an example of how to read the table, 336 people in group 1 answered Yes to the key question, and the other 637 did not answer Yes. Thus 34.5% of this group answered Yes. In the other 3 groups, 40.4%, 38.0%, and 41.1% respectively answered Yes. We wanted to know "Are these four groups really different?" To answer this question, a chi squared test can be used. Instead of trying to do this with Excel, visit this page from
Kris Preacher at the University of Kansas or this one at
Vassar University.

Both of these provide online calculators for chi squared. Preacher's version gives a better explanation of what's happening, but the Hong Kong one saves retyping numbers, and thus potential error. If you're not fully confident about what you're doing, we recommend the Ohio one - for which you can download the pages and use them even when you're not online.

Feeling lazy, we used the Hong Kong page: all we had to do was run the mouse over the 8 totals in the Excel pivot table, select "copy", call up the online calculator, and paste the numbers into that. In a second or two, the answer came back:

Chi squared=51.223 df=3 p<0.0001

The higher the chi squared value, the more significant the difference between proportions - for a given table size (which is what the df is about). The important figure is not the chi squared itself, but p, the probability that the groups are really from the same population. The p figure of <0.0001 means less than one chance in 10,000. That's a very high level of certainty. Most analysts don't consider a result worth bothering about unless the p figure is less than .05. We often use a criterion of .01, to avoid getting excited about too many false positives. When the p figure is less than .01, it shows there's a very low chance that the two groups are really the same. But it doesn't necessarily mean there's a strong relationship between the groups, because it the sample size is large enough, even the tiniest difference can be statistically significant. If you want to look at the size of the difference, it's better to use a measure like Cramer's V, which the Vassar page above calculates at the same time as chi squared. So chi squared measures whether two (or more) sets of figures are really different, while Cramer's V measures how different they are.

 1. Excel for surveys | 2. Data entry | 3. Numeric variables | 4. Nominal variables | • | 6. Frequencies