The procedures for testing hypotheses discussed in Chapters 3, 4, and 5 apply to experiments in which the control and treatment groups contain different subjects (individuals). It is often possible to design experiments in which each experimental subject can be observed before and after one or more treatments. Such experiments are generally more sensitive because they make it possible to measure how the treatment affects each individual. When the control and treatment groups consist of different individuals, the changes due to the treatment may be masked by variability between experimental subjects. This chapter shows how to analyze experiments in which each subject is repeatedly observed under different experimental conditions.
We will begin with the paired t test for experiments in which the subjects are observed before and after receiving a single treatment. Then, we will generalize this test to obtain repeated measures analysis of variance, which permits testing hypotheses about any number of treatments whose effects are measured repeatedly in the same subjects. We will explicitly separate the total variability in the observations into three components: variability between the experimental subjects, variability in each individual subject’s response, and variability due to the treatments. Like all analyses of variance (including t tests), these procedures require that the observations come from normally distributed populations. (Chapter 10 presents methods based on ranks that do not require this assumption.) Finally, we will develop McNemar’s test to analyze data measured on a nominal scale and presented in contingency tables.
Experiments When Subjects Are Observed before and after a Single Treatment: The Paired t Test
In experiments in which it is possible to observe each experimental subject before and after administering a single treatment, we will test a hypothesis about the average change the treatment produces instead of the difference in average responses with and without the treatment. This approach reduces the variability in the observations due to differences between individuals and yields a more sensitive test.
Figure 9-1 illustrates this point. Figure 9-1A shows daily urine production in two samples of 10 different people each; one sample group took a placebo and the other took a drug. Since there is little difference in the mean response relative to the standard deviations, it would be hard to assert that the treatment produced an effect on the basis of these observations. In fact, t computed using the methods of Chapter 4 is only 1.33, which comes nowhere near t.05 = 2.101, the critical value for ν = npla + ndrug −2 = 10 + 10−2 = 18 degrees of freedom.
Figure 9-1.

(A) Daily urine production in two groups of 10 different people. One group of 10 people received the placebo and the other group of 10 people received the drug. The diuretic does not appear to be effective. (B) Daily urine production in a single group of 10 people before and after taking a drug. The drug appears to be an effective diuretic. The observations are identical to those in panel A; by focusing on changes in each individual’s response rather than the response of all the people taken together, it is possible to detect a difference that was masked by the between subjects variability in panel A.
Now consider Figure 9-1B. It shows urine productions identical to those in Figure 9-1A but for an experiment in which urine production was measured in one sample of 10 individuals before and after administering the drug. A straight line connects the observations for each individual. Figure 9-1B shows that the drug increased urine production in 8 of the 10 people in the sample. This result suggests that the drug is an effective diuretic.
Now, let us develop a statistical procedure to quantify our subjective impression in such experiments. The paired t test can be used to test the hypothesis that there is, on the average, no change in each individual after receiving the treatment under study. Recall that the general definition of the t statistic is
The parameter we wish to estimate is the average difference in response δ in each individual due to the treatment. If we let d equal the observed change in each individual that accompanies the treatment, we can use  the mean change, to estimate δ. The standard deviation of the observed differences is
 the mean change, to estimate δ. The standard deviation of the observed differences is
To test the hypothesis that there is, on the average, no response to the treatment, set δ = 0 in this equation to obtain
To recapitulate, when analyzing data from an experiment in which it is possible to observe each individual before and after applying a single treatment:
- Compute the change in response that accompanies the treatment in each individual d.
 
- Compute the mean change  and the standard error of the mean change and the standard error of the mean change . .
 
- Use these numbers to compute t =  / / . .
 
- Compare this t with the critical value for ν = n −1 degrees of freedom, where n is the number of experimental subjects.
Note that the number of degrees of freedom, ν, associated with the paired t test is n −1, less than the 2 (n −1) degrees of freedom associated with analyzing these data using an unpaired t test. This loss of degrees of freedom increases the critical value of t that must be exceeded to reject the null hypothesis of no difference. While this situation would seem undesirable, because of the typical biological variability that occurs between individuals this loss of degrees of freedom is virtually always more than compensated for by focusing on differences within subjects, which reduces the variability in the results used to compute t. All other things being equal, paired designs are almost always more powerful for detecting effects in biological data than unpaired designs.
Finally, the paired t test, like all t tests, is predicated on a normally distributed population. In the t test for unpaired observations developed in Chapter 4, responses needed to be normally distributed. In the paired t test, the differences (changes within each subject) associated with the treatment need to be normally distributed.
Smokers are more likely to develop diseases caused by abnormal blood clots (thromboses), including heart attacks and occlusion of peripheral arteries, than nonsmokers. Platelets are small bodies that circulate in the blood and stick together to form blood clots. Since smokers experience more disorders related to undesirable blood clots than nonsmokers, Peter Levine* drew blood samples in 11 people before and after they smoked a single cigarette and measured the extent to which platelets aggregated when exposed to a standard stimulus. This stimulus, adenosine diphosphate, makes platelets release their granular contents, which, in turn, makes them stick together and form a blood clot.
Figure 9-2 shows the results of this experiment, with platelet stickiness quantified as the maximum percentage of all the platelets that aggregated after being exposed to adenosine diphosphate. The pair of observations made in each individual before and after smoking the cigarette is connected by straight lines. The mean percentage aggregations were 43.1% before smoking and 53.5% after smoking, with standard deviations of 15.9% and 18.7%, respectively. Simply looking at these numbers does not suggest that smoking had an effect on platelet aggregation. This approach, however, omits an important fact about the experiment: the platelet aggregations were not measured in two different (independent) groups of people, smokers and nonsmokers, but in a single group of people who were observed both before and after smoking the cigarette.
Figure 9-2.

Maximum percentage platelet aggregation before and after smoking a tobacco cigarette in 11 people. (Adapted with permission of the American Heart Association, Inc. from Fig. 1 of Levine PH. An acute effect of cigarette smoking on platelet function: a possible link between smoking and arterial thrombosis. Circulation. 1973;48:619–623.)
In all but one individual, the maximum platelet aggregation increased after smoking the cigarette, suggesting that smoking facilitates thrombus formation. The means and standard deviations of platelet aggregation before and after smoking for all people taken together did not suggest this pattern because the variability between individuals masked the variability in platelet aggregation that was due to smoking the cigarette. When we took into account the fact that the data consisted of pairs of observations done before and after smoking in each individual, we could focus on the change in response and so remove the variability that was due to the fact that different people have different platelet-aggregation tendencies regardless of whether they smoked a cigarette or not.
The changes in maximum percent platelet aggregation that accompany smoking are (from Fig. 9-2) 2%, 4%, 10%, 12%, 16%, 15%, 4%, 27%, 9%, −1%, and 15%. Therefore, the mean change in percent platelet aggregation with smoking in these 11 people is  = 10.3%. The standard deviation of the change is 8.0%, so the standard error of the change is
 = 10.3%. The standard deviation of the change is 8.0%, so the standard error of the change is  Finally, our test statistic is
 Finally, our test statistic is
This value exceeds 3.169, the value that defines the most extreme 1% of the t distribution with ν = n −1 = 11−1 = 10 degrees of freedom (from Table 4-1). Therefore, we report that smoking increases platelet aggregation (P < .01).
How convincing is this experiment that a constituent specific to tobacco smoke, as opposed to other chemicals common to smoke in general (e.g., carbon monoxide), or even the stress of the experiment produced the observed change? To investigate this question, Levine also had his subjects “smoke” an unlit cigarette and a lettuce leaf cigarette that contained no nicotine. Figure 9-3 shows the results of these experiments, together with the results of smoking a standard cigarette (from Fig. 9-2).
Figure 9-3.

Maximum percentage platelet aggregation in 11 people before and after pretending to smoke (“sham smoking”), before and after smoking a lettuce-leaf cigarette that contained no nicotine, and before and after smoking a tobacco cigarette. These observations, taken together, suggest that it was something in the tobacco smoke, rather than the act of smoking or other general constituents of smoke, that produced the change in platelet aggregation. (Redrawn with permission of the American Heart Association, Inc. from Fig. 1 of Levine PH. An acute effect of cigarette smoking on platelet function: a possible link between smoking and arterial thrombosis. Circulation. 1973;48:619–623.)
When the experimental subjects merely pretended to smoke or smoked a non-nicotine cigarette made of dried lettuce, there was no discernible change in platelet aggregation. This situation contrasts with the increase in platelet aggregation that followed smoking a single tobacco cigarette. This experimental design illustrates an important point:
In a well-designed experiment, the only difference between the treatment group and the control group, both chosen at random from a population of interest, is the treatment.
In this experiment the treatment of interest was tobacco constituents in the smoke, so it was important to compare the results with observations obtained after exposing the subjects to non tobacco smoke. This step helped ensure that the observed changes were due to the tobacco rather than smoking in general. The more carefully the investigator can isolate the treatment effect, the more convincing the conclusions will be.
There are also subtle biases that can cloud the conclusions from an experiment. Most investigators, and their colleagues and technicians, want the experiments to support their hypothesis. In addition, the experimental subjects, when they are people, generally want to be helpful and wish the investigator to be correct, especially if the study is evaluating a new treatment that the experimental subject hopes will provide a cure. These factors can lead the people doing the study to tend to slant judgment calls (often required when collecting the data) toward making the study come out the way everyone wants. For example, the laboratory technicians who measure platelet aggregation might read the control samples on the low side and the smoking samples on the high side without even realizing it. Perhaps some psychological factor among the experimental subjects (analogous to a placebo effect) led their platelet aggregation to increase when they smoked the tobacco cigarette. Levine avoided these difficulties by doing the experiments in a double blind manner in which the investigator, the experimental subject, and the laboratory technicians who analyzed the blood samples did not know the content of the cigarettes being smoked until after all experiments were complete and specimens analyzed. As discussed in Chapter 2, double-blind studies are the most effective way to eliminate bias due to both the observer and experimental subject.
In single blind studies one party, usually the investigator, knows which treatment is being administered. This approach controls biases due to the placebo effect but not observer biases. Some studies are also partially blind, in which the participants know something about the treatment but do not have full information. For example, the blood platelet study might be considered partially blind because both the subject and the investigator obviously knew when the subject was only pretending to smoke. It was possible, however, to withhold this information from the laboratory technicians who actually analyzed the blood samples to avoid biases in their measurements of percent platelet aggregation.
The paired t test can be used to test hypotheses when observations are taken before and after administering a single treatment to a group of individuals. To generalize this procedure to experiments in which the same individuals are subjected to a number of treatments, we now develop repeated measures analysis of variance.
To do so, we must first introduce some new nomenclature for analysis of variance. To ease the transition, we begin with the analysis of variance presented in Chapter 3, in which each treatment was applied to different individuals. After reformulating this type of analysis of variance, we will go on to the case of repeated measurements on the same individual.
* Levine PH. An acute effect of cigarette smoking on platelet function: a possible link between smoking and arterial thrombosis. Circulation. 1973;48:619–623.
When we developed analysis of variance in Chapter 3, we assumed that all the samples were drawn from a single population (i.e., that the treatments had no effect), estimated the variability in that population from the variability within the sample groups and between the sample groups, then compared these two estimates to see how compatible they were with the original assumption–the null hypothesis−that all the samples were drawn from a single population. When the two estimates of variability were unlikely to arise if the samples had been drawn from a single population, we rejected the null hypothesis of no effect and concluded that at least one of the samples represented a different population (i.e., that at least one treatment had an effect). We used estimates of the population variance to quantify variability.
In Chapter 8, we used a slightly different method to quantify the variability of observed data points about a regression line. We used the sum of squared deviations about the regression line to quantify variability. The variance and sum of squared deviations, of course, are intimately related. One obtains the variance by dividing the sum of squared deviations by the appropriate number of degrees of freedom. We now will recast analysis of variance using sums of squared deviations to quantify variability. This new nomenclature forms the basis of all forms of analysis of variance, including repeated measures analysis of variance.
In Chapter 3, we considered the following experiment: To determine whether diet affected cardiac output in people living in a small town, we randomly selected four groups of seven people each. People in the control group continued eating normally; people in the second group ate only spaghetti; people in the third group ate only steak; and people in the fourth group ate only fruit and nuts. After 1 month, each person was catheterized and his cardiac output measured. Figure 3-1 showed that diet did not, in fact, affect cardiac output. Figure 3-2 showed the results of the experiment as they would appear to you as an investigator or reader. Table 9–1 presents the same data in tabular form. The four different groups did show some variability in cardiac output. The question is: How consistent is this observed variability with the hypothesis that diet did not have any effect on cardiac output?
| Treatment (Diet) | ||||
|---|---|---|---|---|
| Control | Spaghetti | Steak | Fruit and Nuts | |
| 4.6 | 4.6 | 4.3 | 4.3 | |
| 4.7 | 5.0 | 4.4 | 4.4 | |
| 4.7 | 5.2 | 4.9 | 4.5 | |
| 4.9 | 5.2 | 4.9 | 4.9 | |
| 5.1 | 5.5 | 5.1 | 4.9 | |
| 5.3 | 5.5 | 5.3 | 5.0 | |
| 5.4 | 5.6 | 5.6 | 5.6 | |
| Treatment (column) means | 4.96 | 5.23 | 4.93 | 4.80 | 
| Treatment (column) sums of squares | 0.597 | 0.734 | 1.294 | 1.200 | 
| Grand mean = 4.98 | Total sum of squares = 4.507 | |||
* This and the following section, which develops repeated measures analysis of variance (the multitreatment generalization of the paired t test), are more mathematical than the rest of the text. Some readers may wish to skip this section until they encounter an experiment that should be analyzed with repeated measures analysis of variance. Despite the fact that such experiments are common in the biomedical literature, this test is rarely used. This decision leads to the same kinds of multiple t test errors discussed in Chapters 3 and 4 for the unpaired t test.
Tables 9-1 and 9-2 illustrate the notation we will now use to answer this question; it is required for more general forms of analysis of variance. The four different diets are called the treatments and are represented by the columns in the table. We denote the four different treatments with the numbers 1 to 4 (1 = control, 2 = spaghetti, 3 = steak, 4 = fruit and nuts). Seven different people receive each treatment. Each particular experimental subject (or, more precisely, the observation or data point associated with each subject) is represented by Xts, where t represents the treatment and s represents a specific subject in that treatment group. For example, X11 = 4.6 L/min represents the observed cardiac output for the first subject (s = 1) who received the control diet (t = 1). X35 = 5.1 L/min represents the fifth subject (s = 5) who had the steak diet (t = 3).
| Treatment | ||||
|---|---|---|---|---|
| 1 | 2 | 3 | 4 | |
| X11 | X21 | X31 | X41 | |
| X12 | X22 | X32 | X42 | |
| X13 | X23 | X33 | X43 | |
| X14 | X24 | X34 | X44 | |
| X15 | X25 | X35 | X45 | |
| X16 | X26 | X36 | X46 | |
| X17 | X27 | X37 | X47 | |
| Treatment (column) means | 
 | 
 | 
 | 
 | 
| Treatment (column) sums of squares | 
 | 
 | 
 | 
 | 
| Grand mean =  | Total sum of squares  | |||
Tables 9-1 and 9-2 also show the mean cardiac outputs for all subjects (in this case, people) receiving each of the four treatments, labeled  ,
,  ,
,  , and
, and  . For example,
. For example,  = 5.23 L/min is the mean cardiac output observed among people who were treated with spaghetti. The tables also show the variability within each of the treatment groups, quantified by the sum of squared deviations about the treatment mean,
 = 5.23 L/min is the mean cardiac output observed among people who were treated with spaghetti. The tables also show the variability within each of the treatment groups, quantified by the sum of squared deviations about the treatment mean,
Sum of squares for treatment t = sum, over all subjects who received treatment t, of (value of observation for subject–mean response of all individuals who received treatment t)2.
The equivalent mathematical statement is
The summation symbol, Σ, has been modified to indicate that we sum over all s subjects who received treatment t. We need this more explicit notation because we will be summing up the observations in different ways. For example, the sum of squared deviations from the mean cardiac output for the seven people who ate the control diet (t = 1) is
Recall that the definition of sample variance is
where n is the size of the sample. The expression in the numerator is just the sum of squared deviations from the sample mean, so we can write
Hence, the variance in treatment group t equals the sum of squares for that treatment divided by the number of individuals who received the treatment (i.e., the sample size) minus 1:
In Chapter 3, we estimated the population variance from within the groups for our diet experiment with the average of the variances computed from within each of the four treatment groups
In the notation of Table 9-1, we can rewrite this equation as
Now, replace each of the variances in terms of sums of squares.
or
in which n = 7 represents the size of each sample group. Factor n −1 out of the four expressions for variance computed from within each of the four separate treatment groups, and let m = 4 represent the number of treatments (diets), to obtain
The numerator of this fraction is just the total of the sums of squared deviations of the observations about the means of their respective treatment groups. Call it the within treatments (or within groups) sum of squares SSwit. Note that the within treatments sum of squares is a measure of variability in the observations that is independent of whether or not the mean responses to the different treatments are the same.
For the data from our diet experiment in Table 9-1
Given our definition of SSwit and the equation  above, we can write
 above, we can write
 appears in the denominator of the F ratio associated with ν d = m(n −1) degrees of freedom. Using this notation for analysis of variance, degrees of freedom are often denoted by DF rather than ν, so let us replace m(n −1) with DFwit in the equation for
 appears in the denominator of the F ratio associated with ν d = m(n −1) degrees of freedom. Using this notation for analysis of variance, degrees of freedom are often denoted by DF rather than ν, so let us replace m(n −1) with DFwit in the equation for  to obtain
 to obtain
For the diet experiment, DFwit = m(n −1) = 4(7−1) = 24 degrees of freedom.
Finally, recall that in Chapter 2 we defined the variance as the “average” squared deviation from the mean. In this spirit, statisticians call the ratio SSwit/DFwit the within groups mean square and denote it MSwit. This notation is clumsy, since SSwit/DFwit is not really a mean in the standard statistical meaning of the word and it obscures the fact that MSwit is the estimate of the variance computed from within the groups (that we have been denoting  ). Nevertheless, it is so ubiquitous that we will adopt it. Therefore, we will estimate the variance from within the sample groups with
). Nevertheless, it is so ubiquitous that we will adopt it. Therefore, we will estimate the variance from within the sample groups with
We will replace  in the definition of F with this expression.
 in the definition of F with this expression.

Stay updated, free articles. Join our Telegram channel
 
				Full access? Get Clinical Tree
 
				 
	
				
			
		            
	         















