The Special Case of Two Groups: The t Test




The Special Case of Two Groups: The t Test: Introduction



Listen




As we have just seen in Chapter 3, many investigations require comparing only two groups. In addition, as the last example in Chapter 3 illustrated, when there are more than two groups, analysis of variance only allows you to conclude that the data are not consistent with the hypothesis that all the samples were drawn from a single population. It does not help you decide which one or ones are most likely to differ from the others. To answer these questions, we now develop a procedure that is specifically designed to test for differences in two groups: the t test or Student’s t test. While we will develop the t test from scratch, we will eventually show that it is just a different way of doing an analysis of variance. In particular, we will see that F = t2 when there are two groups.




The t test is the most common statistical procedure in the medical literature; you can expect it to appear in more than half the papers you read in the general medical literature. In addition to being used to compare two group means, it is widely applied incorrectly to compare multiple groups, by doing all the pairwise comparisons, for example, by comparing more than one intervention with a control condition or the state of a patient at different times following an intervention. As we will see, this incorrect use increases the chances of rejecting the null hypothesis of no effect above the nominal level, say 5%, used to select the cutoff value for a “big” value of the test statistic t. In practical terms, this boils down to increasing the chances of reporting that some therapy had an effect when the evidence does not support this conclusion.




The General Approach



Listen




Suppose we wish to test a new drug that may be an effective diuretic. We assemble a group of 10 people and divide them at random into two groups, a control group that receives a placebo and a treatment group that receives the drug; then we measure their urine production for 24 hours. Figure 4-1A shows the resulting data. The average urine production of the group receiving the diuretic is 240 mL higher than that of the group receiving the placebo. Simply looking at Figure 4-1A, however, does not provide very convincing evidence that this difference is due to anything more than random sampling.





Figure 4-1.



(A) Results of a study in which five people were treated with a placebo and five people were treated with a drug thought to increase daily urine production. On the average, the five people who received the drug produced more urine than the placebo group. Are these data convincing evidence that the drug is an effective diuretic? (B) Results of a similar study with 20 people in each treatment group. The means and standard deviations associated with the two groups are similar to the results in panel A. Are these data convincing evidence that the drug is an effective diuretic? If you changed your mind, why did you do it?





Nevertheless, we pursue the problem and give the placebo or drug to another 30 people to obtain the results shown in Figure 4-1B. The mean responses of the two groups of people as well as the standard deviations are almost identical to those observed in the smaller samples shown in Figure 4-1A. Even so, most observers are more confident in claiming that the diuretic increased average urine output from the data in Figure 4-1B than the data in Figure 4-1A, even though the samples in each case are good representatives of the underlying population. Why?




As the sample size increases, most observers become more confident in their estimates of the population means so they can begin to discern a difference between the people taking the placebo or the drug. Recall that the standard error of the mean quantifies the uncertainty of the estimate of the true population mean based on a sample. Furthermore, as the sample size increases, the standard error of the mean decreases according to





where n is the sample size and σ is the standard deviation of the population from which the sample was drawn. As the sample size increases the uncertainty in the estimate of the difference of the means between the people who received placebo and the patients who received the drug decreases relative to the difference of the means. As a result, we become more confident that the drug actually has an effect. More precisely, we become less confident in the hypothesis that the drug had no effect, in which case the two samples of patients could be considered two samples drawn from a single population.




To formalize this logic, we will examine the ratio





When this ratio is small we will conclude that the data are compatible with the hypothesis that both samples were drawn from a single population. When this ratio is large we will conclude that it is unlikely that the samples were drawn from a single population and assert that the treatment (e.g., the diuretic) produced an effect.




This logic, while differing in emphasis from that used to develop the analysis of variance, is essentially the same. In both cases, we are comparing the relative magnitude of the differences in the sample means with the amount of variability that would be expected from looking within the samples.




To compute the t ratio we need to know two things: the difference of the sample means and the standard error of this difference. Computing the difference of the sample means is easy; we simply subtract. Computing an estimate for the standard error of this difference is a bit more involved. We begin with a slightly more general problem, that of finding the standard deviation of the difference of two numbers drawn at random from the same population.




The Standard Deviation of a Difference or a Sum



Listen




Figure 4-2A shows a population with 200 members. The mean is 0, and the standard deviation is 1. Now, suppose we draw two samples at random and compute their difference. Figure 4-2B shows this result for the two members indicated by solid circles in Figure 4-2A. Drawing five more pairs of samples (indicated by different symbols in Fig. 4-2A) and computing their differences yields the corresponding shaded points in Figure 4-2B. Note that there seems to be more variability in the differences of the samples than in the samples themselves. Figure 4-2C shows the results of Figure 4-2B, together with the results of drawing another 50 pairs of numbers at random and computing their differences. The standard deviation of the population of differences is about 40% larger than the standard deviation of the population from which the samples were drawn.





Figure 4-2.



If one selects pairs of members of the population in panel A at random and computes the difference the population of differences, shown in panel B, has a wider variance than the original population. Panel C shows another 100 values for differences of pairs of members selected at random from the population in A to make this point again.





In fact, it is possible to demonstrate mathematically that the variance of the difference (or sum) of two variables selected at random equals the sum of the variances of the two populations from which the samples were drawn. In other words, if X is drawn from a population with standard deviation σx and Y is drawn from a population with standard deviation σy, the distribution of all possible values of XY (or X + Y) will have variance





This result should seem reasonable to you because when you select pairs of values that are on opposite (the same) sides of the population mean and compute their difference (sum), the result will be even farther from the mean. Returning to the example in Figure 4-2, we can observe that both the first and second numbers were drawn from the same population whose variance was 1 and so the variance of the difference should be





Since the standard deviation is the square root of the variance, the standard deviation of the population of differences will be times the standard deviation of the original population, or about 40% bigger, confirming our earlier subjective impression.*




When we wish to estimate the variance in the difference or sum of members of two populations based on the observations, we simply replace the population variances σ2 in the equation above with the estimates of the variances computed from our samples:





The standard error of the mean is just the standard deviation of the population of all possible sample means of samples of size n, and so we can find the standard error of the difference of two means using the equation above. Specifically,





in which case





Now we are ready to construct the t ratio from the definition in the last section.




* The fact that the sum of randomly selected variables has a variance equal to the sum of the variances of the individual numbers explains why the standard error of the mean equals the standard deviation divided by . Suppose we draw n numbers at random from a population with standard deviation s. The mean of these numbers will be





so





Since the variance associated with each of the X is a σ2, the variance of will be





and the standard deviation will be





But we want the standard deviation of , which is therefore





which is the formula for the standard error of the mean. Note that we made no assumptions about the population from which the sample was drawn. (In particular, we did not assume that it had a normal distribution.)




Use of t to Test Hypotheses About Two Groups



Listen




Recall that we decided to examine the ratio





We can now use the result of the last section to translate this definition into the equation





Alternatively, we can write t in terms of the sample standard deviations rather than the standard errors of the mean:





in which n is the size of each sample.




If the hypothesis that the two samples were drawn from the same population is true, the variances and computed from the two samples are both estimates of the same population variance σ 2. Therefore, we replace the two different estimates of the population variance in the equation above with a single estimate, s2, that is obtained by averaging these two separate estimates:





This is called the pooled-variance estimate since it is obtained by pooling the two estimates of the population variance to obtain a single estimate. The t test statistic based on the pooled-variance estimate is





The specific value of t one obtains from any two samples depends not only on whether or not there actually is a difference in the means of the populations from which the samples were drawn but also on which specific individuals happened to be selected for the samples. Thus, as for F, there will be a range of possible values that t can have, even when both samples are drawn from the same population. Since the means computed from the two samples will generally be close to the mean of the population from which they were drawn, the value of t will tend to be small when the two samples are drawn from the same population. Therefore, we will use the same procedure to test hypotheses with t as we did with F in Chapter 3. Specifically, we will compute t from the data then reject the assertion that the two samples were drawn from the same population if the resulting value of t is “big.”




Let us return to the problem of assessing the value of the diuretic we were discussing earlier. Suppose the entire population of interest contains 200 people. In addition, we will assume that the diuretic had no effect, so that the two groups of people being studied can be considered to represent two samples drawn from a single population. Figure 4-3A shows this population, together with two samples of 10 people each selected at random for study. The people who received the placebo are shown as dark circles, and the people who received the diuretic are shown as lighter circles. The lower part of Figure 4-3A shows the data as they would appear to the investigator together with the mean and standard deviations computed from each of the two samples. Looking at these data certainly does not suggest that the diuretic had any effect. The value of t associated with these samples is −0.2.





Figure 4-3.




A population of 200 individuals and two groups selected at random for study of a drug designed to increase urine production but which is totally ineffective. The people shown as dark circles received the placebo and those with the lighter circles received the drug. An investigator would not see the entire population but just the information as reflected in the lower part of panel A; nevertheless, the two samples show very little difference and it is unlikely that one would have concluded that the drug had an effect on urine production. Of course, there is nothing special about the two random samples shown in panel A, and an investigator could just as well have selected the two groups of people in panel B for study. There is more difference between these two groups than the two shown in panel A and there is a chance that the investigator would think that this difference is due to the drug’s effect on urine production rather than simple random sampling. Panel C shows yet another pair of random samples the investigator might have drawn for the study.





Of course, there is nothing special about these two samples and we could just as well have selected two different groups of people to study. Figure 4-3B shows another collection of people that could have been selected at random to receive the placebo (dark circles) or diuretic (light circles). Not surprisingly, these two samples differ from each other as well as the samples selected in Figure 4-3A. Given only the data in the lower part of Figure 4-3B we might think that the diuretic increases urine production. The t value associated with these data is −2.1. Figure 4-3C shows yet another pair of samples. They differ from each other and the other samples considered in Figure 4-3A and 4-3B. The samples in Figure 4-3C yield a value of 0 for t.




We could continue this process for quite a long time since there are more than 1027 different pairs of samples of 10 people each that we could draw from the population of 200 individuals shown in Figure 4-3A. We can compute a value of t for each of these 1027 different pairs of samples. Figure 4-4 shows the values of t associated with 200 different pairs of random samples of 10 people each drawn from the original population, including the three specific pairs of samples shown in Figure 4-3. The distribution of possible t values is symmetrical about t = 0 because it does not matter which of the two samples we subtract from the other. As predicted, most of the resulting values of t are close to zero; t rarely is below about −2 or above +2.





Figure 4-4.



The results of 200 studies like that described in Figure 4-3; the three specific studies from Figure 4-3 are indicated in panel A. Note that most values of the t statistic cluster around 0, but it is possible for some values of t to be quite large, exceeding 1.5 or 2. Panel B shows that there are only 10 chances in 200 of t exceeding 2.1 in magnitude if the two samples were drawn from the same population. If one continues examining all possible samples drawn from the population and our pairs of samples drawn from the same population, one obtains a distribution of all possible t values which becomes the smooth curve in panel C. In this case, one defines the critical value of t by saying that it is unlikely that this value of t statistic was observed under the hypothesis that the drug had no effect by taking the 5% most extreme error areas under the tails of distribution and selecting the t value corresponding to the beginning of this region. Panel D shows that if one required a more stringent criterion for rejecting the hypothesis for no difference by requiring that t be in the most extreme 1% of all possible values, the cutoff value of t is 2.878.





Figure 4-4 allows us to determine what a “big” t is. Figure 4-4B shows that t will be less than −2.1 or greater than +2.1 10 out of 200, or 5% of the time. In other words, there is only a 5% chance of getting a value of t more extreme than −2.1 or +2.1 when the two samples are drawn from the same population. Just as with the F distribution, the number of possible t values rapidly increases beyond 1027 as the population size grows, and the distribution of possible t values approaches a smooth curve. Figure 4-4C shows the result of this limiting process. We define the cutoff values for t that are large enough to be called “big” on the basis of the total area in the two tails. Figure 4-4C shows that only 5% of the possible values of t will lie beyond −2.1 or +2.1 when the two samples are drawn from a single population. When the data are associated with a value of t beyond this range, it is customary to conclude that the data are inconsistent with the null hypothesis of no difference between the two samples and report that there was a difference in treatment.




The extreme values of t that lead us to reject the hypothesis of no difference lie in both tails of the distribution. Therefore, the approach we are taking is sometimes called a two-tailed t test. Occasionally, people use a one-tailed t test, and there are indeed cases where this is appropriate. One should be suspicious of such one-tailed tests, however, because the cutoff value for calling t “big” for a given value of P is smaller. In reality, people are almost always looking for a difference between the control and treatment groups so a two-tailed test is appropriate. This book always assumes a two-tailed test.




Note that the data in Figure 4-3B are associated with a t value of −2.1, which we have decided to consider “big.” If all we had were the data shown in Figure 4-4B, we would conclude that the observations were inconsistent with the hypothesis that the diuretic had no effect and report that it increased urine production, and even though we did the statistical analysis correctly, our conclusion about the drug would be wrong.




Reporting P< .05 means that if the treatment had no effect, there is less than a 5% chance of getting a value of t from the data as far or farther from 0 as the critical value for t to be called “big.” It does not mean it is impossible to get such a large value of t when the treatment has no effect. We could, of course, be more conservative and say that we will reject the hypothesis of no difference between the populations from which the samples were drawn if t is in the most extreme 1% of possible values. Figure 4-4D shows that this would require t to be beyond −2.88 or +2.88 in this case, so we would not erroneously conclude that the drug had an effect on urine output in any of the specific examples shown in Figure 4-3. In the long run, however, we will make such errors about 1% of the time. The price of this conservatism is decreasing the chances of concluding that there is a difference when one really exists. Chapter 6 discusses this trade-off in more detail.




The critical values of t, like F, have been tabulated and depend not only on the level of confidence with which one rejects the hypothesis of no difference—the P value—but also on the sample size. As with the F distribution, this dependence on sample size enters the table as the degrees of freedom, ν, which is equal to 2(n − 1) for this t test, where n is the size of each sample. As the sample size increases the value of t needed to reject the hypothesis of no difference decreases. In other words, as sample size increases it becomes possible to detect smaller differences with a given level of confidence. Reflecting on Figure 4-1 should convince you that this is reasonable.




What If the Two Samples Are Not the Same Size?



Listen




It is easy to generalize the t test to handle problems in which there are different numbers of members in the two samples being studied. Recall that t is defined by





in which and are the standard errors of the means of the two samples. If the first sample is of size n1 and the second sample contains n2 members,





in which s1 and s2 are the standard deviations of the two samples. Use these definitions to rewrite the definition of t in terms of the sample standard deviations





When the two samples are different sizes, the pooled estimate of the variance is given by





so that





This is the definition of t for comparing two samples of unequal size. There are ν = n1+ n2 − 2 degrees of freedom.




Notice that this result reduces to our earlier results when the two sample sizes are equal, that is, when n1 = n2 = n.




Cell Phones Revisited



Listen




The study by Fejes and colleagues about the relationship of cell phone use and rapid sperm motility we discussed in Chapter 3 had two observational groups, 61 men who used cell phones less than 15 minutes/day and 61 men who used cell phones more than 60 minutes/day, so we can analyze their data using a t test, as well as an analysis of variance. From Figure 3-7, the mean percentage of rapidly mobile sperm was 49% for the low use group and 41% for the high use group. The standard deviations were 21% and 22%, respectively. Because the sample sizes are equal,*






and






with ν = 2(n − 1) = 2(61 − 1) = 120 degrees of freedom. Table 4-1 shows that the magnitude of t should only exceed 1.980 only 5% of the time by chance when the null hypothesis is true, in this case, that cell phone exposure does not affect rapid sperm motility (P < .05). Since the magnitude of t associated with the data exceeds 1.980, we reject the null hypothesis and conclude that cell phone use is associated with rapid sperm motility.





Table 4-1. Critical Values of t (Two-Tailed)
Jan 20, 2019 | Posted by in ANESTHESIA | Comments Off on The Special Case of Two Groups: The t Test

Full access? Get Clinical Tree

Get Clinical Tree app for offline access