The χ2 test

The χ2 is a method of comparing proportions between two groups that may be used to test for deviations in the distribution of classes between groups. The simplest method of utilising χ2 tables is to create a 2 x 2 contingency table. For example you may be carrying out a genetic case-control association study where your hypothesis is that carrying a particular SNP (which may be the causative mutation or in LD with the causative mutation) increases your risk of developing a disease. In which case you would expect there to be ab excess of one of the alleles in the cases compared to the controls.

Calculating χ2

In order to calculate the χ2 statistic you need to create a contingency table. For a case-control study described above this is a simple 2 x 2 table with the cases in the top row, and the controls in the bottom row. The first column is for carrying allele 1 of the SNP, whilst the second column is for carrying allele 2.

Worked Example

You have a cohort of cases and controls, for a particular disease (choose your disease of preference, it really makes no difference, you will still divided individuals into cases or controls dependent upon their disease status). You have collected 248 cases with the disease, and 246 controls who are matched on age and sex to help avoid confounding factors. You have decided to screen for mutations in candidate gene X which is involved in some biological aspect of the disease, and you have identified ten polymorphic loci within the gene, three are in the promoter region, three are within exons but are synonymous, two are intronic, and one is exonic and causes a non-synonymous amino-acid substitution.

For each SNP you should construct the contingency table shown below...

Allele 1 Allele 2 Total
Cases 332 164 496
Controls 230 262 492
Total 562 426 988

The above table represents the observed values, you must now calculate the expected values for each cell of the table, this is achieved by simply multiplying the row with the column total for each cell and then dividing by the overall total. Thus the expected value for Cases with allele 1 is (496 x 562) / 988 = 272.13765. Now you could do this calulation for each and every cell, but that would be a waste of time, as you can simply subtract the expected value for the cases with allele 1 from the total in the allele 1 column to give you the expected value for controls with allele 1. In a similar manor you can calaulate the cases with allele 2. But don't take my word for it, you should do these calculations yourself. The formula for calculating the expected values of a cell is given below...

Expected value = (Row total x Column total) / Table Total

Now that you have calculated your observed and expected values for each cell you are ready to calculate your χ2 statistic which is given by the formula below...

χ2 = Σ(O - E)2/E

To calculate the χ2 statistic you should create a table as shown below and fill in the values...

Disease
Status
Allele Observed (O) Expected (E) (O - E)2/E Cumulative
Total
Case 1 332 282.13765 8.8122019 8.8122019
Case 2 164 213.86325 11.625858 20.437689
Control 1 230 279.86235 8.8838457 29.321535
Control 2 262 212.13765 11.720003 41.041538

From the above table you can see that your χ2 statistic is 41.0415. This is fine, but you need to have some way of assessing the statistical significance of this value, i.e. you need to know how frequently you would have seen such differences by chance alone, which is the definition of a p-value. In order to calculate your p-value you need to know how many degrees of freedom there are within your table. Degrees of freedom are simply the number of cells that can vary within the table, thus if you were to change one of the values, all of the others would automatically be defined from the row and column totals. A simple formula for calculating degrees of freedom is given below...

Degrees of Freedom = (rows - 1) x (columns - 1)

So in this instance there is one degree of freedom, although as you will see you can construct contingency tables for tables of size n x m, so the degrees of freedom can become very large.

Once you have calculated your degrees of freedom you can use Statistical Tables (linked at the bottom of the menu on the right) to look up the p-value associated with your χ2 statistic, and in this instance you will find that the p-value is < 0.0001.

Interpreting χ2 Statistics

When you perform a χ2 you will be furnished with a χ2 statistics the degrees of freedom (often abbreviated to df), and a p-value (and perhaps some other statistics such as exact p-values depending on the options you have used). So how do you go about interpreting this? Well the p-value itself conveys very little information, it simply tells you how likely you are to have observed the differences by chance alone, nothing more, nothing less. You should have decided prior to performing thes test your acceptable significance level, and should now be able to say whether there is a significant association.

There is however a little more information you can extract from a contingency table, by tabulating the observed and expected frequencies for each cell you can see which cells are contributing to the greatest amount to the global χ2 statistic. If you refer to the formula for calculating χ2 given above, you will see that the greater the difference between the observed and expected values for a given cell, the larger the contribution it makes to the χ2 statistic. This can be particularly useful in genetics as it allows you to define which genotype there is an excess or deficency of in your cases/controls. Calculating the Odds-Ratio will also provide information on the risk certain alleles confer to disease.

Getting computers to do the work for you

To sit down and calculate χ2 statistics for each and every SNP that you are testing would take you maybe an hour or so, however there is a much easier solution, and that is to use a computer program to do this for you. Details of how to use Stata to calculate allelic and genotypic associations can be found here. Under this section you will also find details of how to account for multiple testing.


Last modified: Tue Feb 17 18:11:14 GMT 2004