Chapter 50 Craig D. Newgard and Roger J. Lewis Performing well-designed and properly analyzed clinical and systems research is the essential first step in improving out-of-hospital medical care, and only through such research can we better understand the effectiveness of both current and new practices. Data handling procedures and appropriate use of statistics are critical aspects of high-quality out-of-hospital research. The purpose of this chapter is to present and explore concepts underlying data handling and statistical analysis for clinical research in the out-of-hospital setting. Historically, research data have been analyzed using p values and classical hypothesis testing [1–3], although there has been an increasing emphasis on the use of confidence intervals in the presentation of results. In classical hypothesis testing, two hypotheses or conclusions that might be supported by the data are considered. The first, called the null hypothesis, is the idea that there is no difference between two or more groups, with respect to the measured quantity of interest [2]. For example, in a study examining the effect of a new EMS dispatch system on response time (response interval), the null hypothesis might be that there is no difference between the median response times before and after implementation of the new system. The alternative hypothesis is the idea that the groups being compared are different with respect to the quantity of interest and, ideally, the alternative hypothesis defines the size of that difference [2]. For example, the alternative hypothesis might be that the new EMS dispatch system decreases the median response time by 1 minute or more compared to the old dispatch system. The difference between the two groups defined by the alternative hypothesis is called the “treatment effect” or “effect size.” The size of the difference implied by the alternative hypothesis, the treatment effect, must be defined prior to data collection [4–7]. Once the null and alternative hypotheses have been defined, the study is conducted and the data are obtained. During analysis of the results, the null hypothesis is “tested” to determine which hypothesis (null or alternative) will be accepted as true. Testing the null hypothesis consists of calculating the probability of obtaining the results observed, or results more inconsistent with the null hypothesis, assuming the null hypothesis is true. This probability is the p value [2]. For example, it may be determined that the median response time was 6 minutes for the new dispatch system and 7 minutes for the old system. Testing is done to determine the probability that the 1-minute decrease in response time is due solely to chance and is not a reflection of improvements in the dispatch system. If the p value is less than some predefined value, denoted α, then the null hypothesis is rejected and the alternative hypothesis is accepted as true. In other words, if results like those obtained would occur less than α percent of the time (where α is usually 5% or 0.05) assuming the null hypothesis to be true, then the null hypothesis is rejected as false. Using our previous example, the researcher might state that he or she is willing to accept a 5% probability (α) of falsely concluding that there is a difference between dispatch systems when in reality there is no difference. If a 1-minute difference between the dispatch systems were observed, this difference could be due solely to chance, especially if the p value is greater than 0.05. However, using a p value of less than 0.05 as a cut-off (i.e. “statistically significant”), the probability of saying there is a difference between the systems when there really is not (called a type I error – see later in the chapter) is less than 1 in 20. Rather than proving the alternative hypothesis that a difference exists between the groups, we eliminate from consideration the null hypothesis that there is no difference. Although the p value provides evidence for rejecting the null hypothesis, it does not provide information regarding the magnitude of the treatment effect or precision of the estimated treatment effect [8]. Furthermore, the investigator arbitrarily assigns α, the “level of significance.” A level of α equal to 0.05 is most often used and represents nothing more than medical convention. Probabilities above this level (e.g. 0.06) may still suggest an important difference in measured outcome between treatment groups, although not reaching a generally accepted level of proof. P values should be interpreted as only one piece of statistical information and are best interpreted in the context of the study design, the sample size, and the credibility of the hypotheses being tested [8]. Definitions of terms used in classical hypothesis testing are shown in Table 50.1. Table 50.1 Definitions of terms commonly used in classical statistical testing A type I error occurs when one concludes that a difference has been demonstrated between two groups of patients when, in fact, no such difference really exists [2,3,9]. It is a type of “false positive.” Using p values, a type I error occurs if a statistically significant p value is obtained when there is no underlying difference between the groups being compared. The risk of a type I error is equal to the maximum p value considered statistically significant or alpha, typically set at a level of 0.05 [2,9]. A type II error occurs when a difference does exist between the two groups that is as large as, or larger than that defined by the alternative hypothesis, yet a non-significant p value is obtained [2,4–7]. That is, the researcher states there is no difference between the groups because the p value is greater than α, when in reality there is a difference. A type II error is a type of “false negative.” A common cause of a type II error is an inadequate sample size. The “power” of a trial is the chance of detecting a treatment effect of the size defined by the alternative hypothesis, if one truly exists [4–7]. Studies are usually designed to have a power of 0.80, 0.90, or 0.95. Because the power of a trial is the chance of finding a true treatment effect, the value β (1 − power) is the chance of missing a true treatment effect (i.e. the risk of committing a type II error) if a true difference equal to the effect size actually exists [4–6]. The value of α, the power, and the magnitude of the treatment effect sought (defined by the alternative hypothesis) are all used to determine the sample size required for a study [7]. Determining the required sample size for a clinical study is an essential step in the statistical design of a project. An adequate sample size helps ensure that the study will yield reliable information. If the final data suggest that no clinically important treatment effect exists, an adequate sample size is needed to reduce the chance that a type II error has occurred and a clinically important difference has been missed. If the final result is positive, an adequate sample size is needed to ensure the treatment effect is measured with appropriate precision. In addition to the basic study design and planned analysis method, four parameters influence the power of a study: the sample size, the effect size defined by the alternative hypothesis, the variability of the results from patient to patient, and α. For each method of statistical analysis and type of study design (unpaired samples, case–control studies, etc.) there is a different formula relating sample size to power. Because the treatment effect size sought by the study is a major determinant of the sample size required, choosing an appropriate effect size is the first step in sample size determination. Optimally, one designs clinical studies to reliably detect the minimum clinically relevant treatment effect (i.e. the smallest treatment effect that would result in a change in clinical practice). Defining the minimum clinically significant treatment effect is a medical and scientific judgment, not a statistical decision. For example, what difference in response times is large enough to warrant changing a dispatch system (e.g. 4 minutes, 2 minutes, 1 minute, or 30 seconds)? The researcher has the responsibility of proposing an effect size that he or she believes would be clinically or operationally relevant. The decision about the effect size to be studied should be based on the researcher’s judgment, available resources (sample size feasibility), and relevant information from similar published studies. The smaller the treatment effect sought by a study, the larger the sample size required. Frequently, available time and resources do not allow a clinical trial large enough to reliably detect the smallest clinically significant treatment effect. In these cases, one may choose to define a larger treatment effect size, with the realization that should the trial result be negative, it will not reliably exclude the possibility that a smaller but clinically important treatment difference exists. One frequently faces the problem of interpreting data from a negative study in which no power calculation was initially performed. Although tempting, performing a post hoc (i.e. after the study is concluded) power analysis in which one calculates the effect size that could have been found with the actual sample size and a given power is invalid and should never be done. The correct approach to analyzing such data is to calculate the 95% confidence interval for the difference in the outcome of interest, based on the final data [10]. Because there is some risk in participating in clinical research, it is unethical to enroll patients in a trial that has an inadequate sample size and therefore is unlikely to yield useful information. Such a study would have little chance of generating meaningful information about the hypothesis and may waste time and resources, as well as expose subjects to risk with a low likelihood of benefit to society. This issue warrants special consideration during the design of phase I and so-called “pilot” studies. Studies involving human subjects (or laboratory animals) should be designed with a large enough sample size so it is highly likely they will yield useful information. Historically, learning the techniques of sample size determination and power analysis has been difficult because of relatively complex mathematical considerations and numerous formulas. There has been tremendous improvement in the availability, ease of use, and capability of commercially available sample size determination software. These programs now allow the determination of sample size and the resulting power for a wide variety of research designs and analysis methods. Some of the available programs include nQuery Advisor by Statistical Solutions (www.statistical-solutions-software.com/), PASS by NCSS (www.ncss.com), and Power and Precision by Biostat (www.power-analysis.com). Before reviewing some common statistical tests [11], it is useful to define the different types of characteristics, or variables, we may wish to compare [12]. The ability to differentiate the types of variables and measurements is critical to selecting appropriate analysis methods. There are two general scales of measurement: numerical and categorical. With a numerical (or quantitative) scale, the size of differences between numbers has meaning. Variables measured in a numerical scale may be continuous, having values measured along a continuum (e.g. age, height, weight, time), or discrete, taking on only specific values (e.g. number of paramedic calls per shift). The mean and median of variables measured using a numeric scale are often used to summarize results or to compare characteristics between groups. Categorical variables are used for measuring qualitative characteristics. The simplest form of a categorical variable is termed a nominal scale, in which observations fit into discrete categories that have no inherent order (e.g. race, sex, hospital name). When there are only two categories (e.g. male or female), the variable is termed dichotomous or binary. If there are more than two categories, (e.g. blood types A, B, AB, and O), the variable is polychotomous (or polytomous). One common, but potentially ill-advised, practice is to categorize continuous variables (e.g. systolic blood pressure ≤90 mmHg or age by decades) for statistical analyses. Although this may simplify the interpretation of the variable, this practice substantially reduces the available information in the data, frequently requires an arbitrary selection of the appropriate cut-point(s), and may reduce study power as well as introduce residual confounding [13]. When a variable is ordinal, there is an inherent order among different categories. Examples include the Glasgow Coma Scale (GCS) score and the Apgar scores. Although an order exists among categories (each possible score is a category) in an ordinal scale, the relationship between categories may differ throughout the scale (i.e. the clinical implication of a difference between a GCS of 4 versus 3 is not the same as a difference of 15 versus 14). Many statistical tests exist for both continuous and categorical data. Depending on the type of data being analyzed, different statistical tests are used to determine the p value. The most common statistical tests and their assumptions are described in Table 50.2. Selecting the appropriate statistical test requires identifying the type of variable to be analyzed and making reasonable assumptions regarding the underlying distribution of the data and the standard deviation or variance of the data in each group [12]. Parametric tests require more assumptions about the data – typically that numeric data follow a normal distribution and that different groups yield data with equal variance. Non-parametric tests do not require these assumptions [14]. Considering the low power of available tests used to detect deviations from the normal distribution, it is prudent to use non-parametric methods of analysis when there is any doubt as to the underlying distribution of the data [14]. Table 50.2 Common statistical tests and their assumptions
Data handling and statistics essentials
Introduction
Classical hypothesis testing
Term
Definition
α
The maximum p value to be considered statistically significant; the risk of committing a type I error
α error
A type I error
Alternative hypothesis
The hypothesis that is considered as an alternative to the null hypothesis; the hypothesis that there is an effect of the studied treatment, of a given size, on the measured variable of interest; the hypothesis that there is a difference between two or more groups of a given size, on the measured variable of interest; sometimes called the test hypothesis
β
The risk of committing a type II error
β error
A type II error
Null hypothesis
The hypothesis that there is no effect of the studied treatment on the measured variable of interest; the hypothesis that two or more groups are the same with respect to the measured variable of interest
Power
The probability of detecting the treatment effect defined by the alternative hypothesis (i.e. obtaining a p value <α), given α, and the sample size of the clinical trial; power = 1 − β
p value
The probability of obtaining results similar to those actually obtained, or results more inconsistent with the null hypothesis, assuming the null hypothesis is true
Type I error
Obtaining a statistically significant p value when, in fact, there is no effect of the studied treatment on the measured variable of interest or that the groups being compared are not different; a false positive
Type II error
Not obtaining a statistically significant p value when, in fact, there is an effect of the treatment on the measured variable of interest that is as large or larger than the effect the trial was designed to detect, or that there is a difference between the groups that is as large or larger than the treatment effect tested; a false negative
Type I error
Type II error and power
Power analysis and sample size determination
Statistical tests
Statistical test
Description
Parametric tests
Student’s t test
Used to test whether the means of a continuous variable from two groups are equal, assuming that the data are normally distributed and that the data from both groups have equal standard deviation or variance. A less common form of the t test can be used to analyze data from matched pairs (e.g. before and after measurements on each patient)
One-way analysis of variance (ANOVA)
Used to test the null hypothesis that three or more sets of continuous data have equal means, assuming the data are normally distributed and that the data from all groups have equal standard deviations or variances. The one-way ANOVA may be thought of as a t test for three or more groups
Non-parametric tests
Wilcoxon rank sum test (Mann-Whitney U test)
Used to test whether two sets of continuous data have the same median. These tests are similar in use to the t test but do not assume the data are normally distributed
Wilcoxon signed rank test
Used to examine data from matched pairs, similar to the matched pairs t test, but when differences in each pair are not normally distributed
Kruskal-Wallis
This is a test analogous to the one-way ANOVA, but no assumption is required regarding normality of the data. The Kruskal-Wallis test may be thought of as a Wilcoxon rank sum test for three or more groups or as a one-way ANOVA for non-normally distributed data
Chi-square test
Stay updated, free articles. Join our Telegram channel
Full access? Get Clinical Tree
Get Clinical Tree app for offline access
Data handling and statistics essentials
Source: adapted from Lewis RJ. The medical literature: a reader’s guide. In: Rosen P, Barkin R, Danzl DF, et al. (eds) Emergency Medicine: Concepts and Clinical Practice, 4th edn. St Louis: Mosby, 1998; and Lewis RJ. An introduction to the use of interim data analysis in clinical trials. Ann Emerg Med 1993;22:1463–9.