How to Test for Differences between Groups




How to Test for Differences between Groups: Introduction



Listen




Statistical methods are used to summarize data and test hypotheses with those data. Chapter 2 discussed how to use the mean, standard deviation, median, and percentiles to summarize data and how to use the standard error of the mean to estimate the precision with which a sample mean estimates the population mean. Now we turn our attention to how to use data to test scientific hypotheses. The statistical techniques used to perform such tests are called tests of significance; they yield the highly prized P value. We now develop procedures to test the hypothesis that, on the average, different treatments all affect some variable identically. Specifically, we will develop a procedure to test the hypothesis that diet has no effect on the mean cardiac output of people living in a small town. Statisticians call this hypothesis of no effect the null hypothesis.




The resulting test can be generalized to analyze data obtained in experiments involving any number of treatments. In addition, it is the archetype for a whole class of related procedures known as analysis of variance.




The General Approach



Listen




To begin our experiment, we randomly select four groups of seven people each from a small town with 200 healthy adult inhabitants. All participants give informed consent. People in the control group continue eating normally; people in the second group eat only spaghetti; people in the third group eat only steak; and people in the fourth group eat only fruit and nuts. After 1 month, each person has a cardiac catheter inserted and his or her cardiac output is measured.




As with most tests of significance, we begin with the hypothesis that all treatments (diets) have the same effect (on cardiac output). Since the study includes a control group (as experiments generally should), this hypothesis is equivalent to the hypothesis that diet has no effect on cardiac output. Figure 3-1 shows the distribution of cardiac outputs for the entire population, with each individual’s cardiac output represented by a circle. The specific individuals who were randomly selected for each diet are indicated by shaded circles, with different shading for different diets. Figure 3-1 shows that the null hypothesis is, in fact, true. Unfortunately, as investigators we cannot observe the entire population and are left with the problem of deciding whether or not to reject the null hypothesis from the limited data shown in Figure 3-2. There are obviously differences between the samples; the question is: Are these differences due to the fact that the different groups of people ate differently or are these differences simply a reflection of the random variation in cardiac output between individuals?





Figure 3-1.



The values of cardiac output associated with all 200 members of the population of a small town. Since diet does not affect cardiac output, the four groups of seven people each selected at random to participate in our experiment (control, spaghetti, steak, and fruit and nuts) simply represent four random samples drawn from a single population.






Figure 3-2.



An investigator cannot observe the entire population but only the four samples selected at random for treatment. This figure shows the same four groups of individuals as in Figure 3-1 with their means and standard deviations as they would appear to the investigator. The question facing the investigator is: Are the observed differences due to the different diets or simply random variation? The figure also shows the collection of sample means together with their standard deviation, which is an estimate of the standard error of the mean.





To use the data in Figure 3-2 to address this question, we proceed under the assumption that the null hypothesis that diet has no effect on cardiac output is correct. Since we assume that it does not matter which diet any particular individual ate, we assume that the four experimental groups of seven people each are four random samples of size 7 drawn from a single population of 200 individuals. Since the samples are drawn at random from a population with some variance, we expect the samples to have different means and standard deviations, but if our null hypothesis that the diet has no effect on cardiac output is true, the observed differences are simply due to random sampling.




Forget about statistics for a moment. What is it about different samples that leads you to believe that they are representative samples drawn from different populations? Figures 3-2, 3-3, and 3-4 show three different possible sets of samples of some variable of interest. Simply looking at these pictures makes most people think that the four samples in Figure 3-2 were all drawn from a single population, while the samples in Figures 3-3 and 3-4 were not. Why? The variability within each sample, quantified with the standard deviation, is approximately the same. In Figure 3-2, the variability in the mean values of the samples is consistent with the variability one observes within the individual samples. In contrast, in Figures 3-3 and 3-4, the variability among sample means is much larger than one would expect from the variability within each sample. Notice that we reach this conclusion whether all (Fig. 3-3) or only one (Fig. 3-4) of the sample means appear to differ from the others.





Figure 3-3.



The four samples shown are identical to those in Figure 3-2 except that the variability in the mean values has been increased substantially. The samples now appear to differ from each other because the variability between the sample means is larger than one would expect from the variability within each sample. Compare the relative variability in mean values with the variability within the sample groups with that seen in Figure 3-2.






Figure 3-4.



When the mean of even one of the samples (sample 2) differs substantially from the other samples, the variability computed from within the means is substantially larger than one would expect from examining the variability within the groups.





Now let us formalize this analysis of variability to analyze our diet experiment. The standard deviation or its square, the variance, is a good measure of variability. We will use the variance to construct a procedure to test the hypothesis that diet does not affect cardiac output.




Chapter 2 showed that two population parameters—the mean and standard deviation (or, equivalently, the variance)—completely describe a normally distributed population. Therefore, we will use our raw data to compute these parameters and then base our analysis on their values rather than on the raw data directly. Since the procedures, we will now develop are based on these parameters they are called parametric statistical methods. Because these methods assume that the population from which the samples were drawn can be completely described by these parameters, they are valid only when the real population approximately follows the normal distribution. Other procedures, called nonparametric statistical methods, are based on frequencies, ranks, or percentiles do not require this assumption.* Parametric methods generally provide more information about the treatment being studied and are more likely to detect a real treatment effect when the underlying population is normally distributed.




We will estimate the parameter population variance in two different ways: (1) The standard deviation or variance computed from each sample is an estimate of the standard deviation or variance of the entire population. Since each of these estimates of the population variance is computed from within each sample group, the estimates will not be affected by any differences in the mean values of different groups. (2) We will use the values of the means of each sample to determine a second estimate of the population variance. In this case, the differences between the means will obviously affect the resulting estimate of the population variance. If all the samples were, in fact, drawn from the same population (i.e., the diet had no effect), these two different ways to estimate the population variance should yield approximately the same number. When they do, we will conclude that the samples were likely to have been drawn from a single population; otherwise, we will reject this hypothesis and conclude that at least one of the samples was drawn from a different population. In our experiment, rejecting the original hypothesis would lead to the conclusion that diet does alter cardiac output.




* In fact, these methods make no assumption about the specific shape of the distribution of the underlying population; they are also called distribution-free methods. We will study these procedures in Chapters 5, 8, 10, and 11.




Two Different Estimates of the Population Variance



Listen




How shall we estimate the population variance from the four sample variances? When the hypothesis that the diet does not affect cardiac output is true, the variances of each sample of seven people, regardless of what they ate, are equally good estimates of the population variance, so we simply average our four estimates of variance within the treatment groups:




Average variance in cardiac output within treatment groups = 1/4 (variance in cardiac output of controls + variance in cardiac output of spaghetti eaters + variance in cardiac output of steak eaters + variance in cardiac output of fruit and nut eaters)




The mathematical equivalent is





where s2 represents variance. The variance of each sample is computed with respect to the mean of that sample. Therefore, the population variance estimated from within the groups, the within-groups variance , will be the same whether or not diet altered cardiac output.




Next, we estimate the population variance from the means of the samples. Since we have hypothesized that all four samples were drawn from a single population, the standard deviation of the four sample means will approximate the standard error of the mean. Recall that the standard error of the mean is related to the sample size n (in this case 7) and the population standard deviation σ according to





Therefore, the true population variance σ2 is related to the sample size and standard error of the mean according to





We use this relationship to estimate the population variance from the variability between the sample means using





where is the estimate of the population variance computed from between the sample means and is the standard deviation of the means of the four sample groups, the standard error of the mean. This estimate of the population variance, computed from between the group means is often called the between-groups variance.




If the null hypothesis that all four samples were drawn from the same population is true (i.e., that diet does not affect cardiac output), the within-groups variance and between-groups variance are both estimates of the same population variance and so should be about equal. Therefore, we will compute the following ratio, called the F-test statistic:






Since both the numerator and the denominator are estimates of the same population variance σ2, F should be about σ22 = 1. For the four random samples in Figure 3-2, F is about equal to 1, we conclude that the data in Figure 3-2 are not inconsistent with the hypothesis that diet does not affect cardiac output and we continue to accept that hypothesis.




Now we have a rule for deciding when to reject the null hypothesis that all the samples were drawn from the same population:




If F is a big number, the variability between the sample means is larger than expected from the variability within the samples, so reject the null hypothesis that all the samples were drawn from the same population.




This quantitative statement formalizes the qualitative logic we used when discussing Figures 3-2, 3-3, and 3-4. The F associated with Figure 3-3 is 68.0, and that associated with Figure 3-4 is 24.5.




What Is a “Big” F?



Listen




The exact value of F one computes depends on which individuals were drawn for the random samples. For example, Figure 3-5 shows yet another set of four samples of seven people drawn from the population of 200 people in Figure 3-1. In this example F = 0.5. Suppose we repeated our experiment 200 times on the same population. Each time we would draw four different samples of people and—even if the diet had no effect on cardiac output—get slightly different values for F due to random variation. Figure 3-6A shows the result of this procedure, with the resulting Fs rounded to one decimal place and represented with a circle; the two dark circles represent the values of F computed from the data in Figures 3-2 and 3-5. The exact shape of the distribution of values of F depend on how many samples were drawn, the size of each sample, and the distribution of the population from which the samples were drawn.





Figure 3-5.



Four samples of seven members each drawn from the population shown in Figure 3-1. Note that the variability in sample means is consistent with the variability within each of the samples, F = 0.5.






Figure 3-6.



(A) Values of F computed from 200 experiments involving four samples, each of size 7, drawn from the population in Figure 3-1. (B) We expect F to exceed 3.0 only 5% of the time when all samples were, in fact, drawn from a single population. (C) Results of computing the F ratio for all possible samples drawn from the original population. The 5% of most extreme F values are shown darker than the rest. (D) The F distribution one would obtain when sampling an infinite population. In this case, the cutoff value for considering F to be “big” is that value of F that subtends the upper 5% of the total area under the curve.





As expected, most of the computed Fs are around 1 (i.e., between 0 and 2), but a few are much larger. Thus, even though most experiments will produce relatively small values of F, it is possible that, by sheer bad luck, one could select random samples that are not good representatives of the whole population. The result is an occasional relatively large value for F even though the treatment had no effect. Figure 3-6B shows, however, that such values are unlikely. Only 5% of the 200 experiments (i.e., 10 experiments) produced F values equal to or greater than 3.0. We now have a tentative estimate of what to call a “big” value for F. Since F exceeded 3.0 only 10 out of 200 times when all the samples were drawn from the same population, we might decide that F is big when it exceeds 3.0 and reject the null hypothesis that all the samples were drawn from the same population (i.e., that the treatment had no effect). In deciding to reject the hypothesis of no effect when F is big, we accept the risk of erroneously rejecting this hypothesis 5% of the time because F will be 3.0 or greater about 5% of the time, even when the treatment does not alter mean response.




When we obtain such a “big” F, we reject the original null hypothesis that all the means are the same and report P < .05. P < .05 means that there is less than a 5% chance of getting a value of F as big or bigger than the computed value if the original hypothesis were true (i.e., diet did not affect cardiac output).




The critical value of F should be selected not on the basis of just 200 experiments but all 1042 possible experiments. Suppose we did all 1042 experiments and computed the corresponding F values, then plotted the results such as we did for Figure 3-6B. Figure 3-6C shows the results with grains of sand to represent each observed F value. The darker sand indicates the biggest 5% of the F values. Notice how similar it is to Figure 3-6B. This similarity should not surprise you, since the results in Figure 3-6B are just a random sample of the population in Figure 3-6C. Finally, recall that everything so far has been based on an original population containing only 200 members. In reality, populations are usually much larger, so that there can be many more than 1042 possible values of F. Often, there are essentially an infinite number of possible experiments. In terms of Figure 3-6C, it is as if all the grains of sand melted together to yield the continuous line in Figure 3-6D.




Therefore, areas under the curve are analogous to the fractions of total number of circles or grains of sand in Figures 3-6B and 3-6C. Since the shaded region in Figure 3-6D represents 5% of the total area under the curve, it can be used to compute that the cutoff point for a “big” F with the number of samples and sample size in this study is 3.01. This and other cutoff values that correspond to P < .05 and P < .01 are listed in Table 3-1.





Table 3-1. Critical Values of F Corresponding to P < .05 (Lightface) and P < .01 (Boldface)
Jan 20, 2019 | Posted by in ANESTHESIA | Comments Off on How to Test for Differences between Groups

Full access? Get Clinical Tree

Get Clinical Tree app for offline access