Clinical Trials

Roger Chou

Richard A. Deyo

Controversies abound in the clinical management of pain, and there are enormous geographic variations in care. Lumbar spine surgery rates historically vary fivefold among developed countries, with rates in the United States being highest and rates in the United Kingdom being among the lowest¹—yet, patient outcomes appear to be broadly similar across countries. In smaller geographic areas, variations are also striking. Within the United States, rates of lumbar fusion surgery among Medicare enrollees vary more than 20-fold between regions, from 4.6 per 1,000 enrollees in Idaho Falls, Idaho, to 0.2 per 1,000 in Bangor, Maine.² Within Washington State, county back surgery rates vary more than sevenfold, even after excluding the smallest counties.³

Another problem in pain management is the successive uptake of a series of fads in treatment. Research has eventually discredited many of these, but they enjoyed widespread use, with substantial costs and side effects, before they were found to be ineffective. Examples include sacroiliac joint fusion for the treatment of low back pain, coccygectomy for coccydynia, bed rest and traction for back pain, and many others.⁴ This phenomenon is prominent in the field of pain medicine but not unique to it. Examples of abandoned therapies from other areas of medicine include internal mammary artery ligation for treating angina pectoris, gastric freezing for duodenal ulcers, and vitamin E and hormone therapy for prevention of cardiovascular events.⁵^,⁶^,⁷ Promoting such ineffective treatments drains resources from more useful interventions, produces side effects, and eventually damages professional credibility.

Despite welcome breakthroughs in basic science research on pain, increases in knowledge regarding optimal ergonomics of work tasks, and the development and use of more technologically advanced medical therapies, evidence indicates an increasing prevalence of chronic back pain and disability. In the state of North Carolina, the prevalence of chronic, impairing back pain more than doubled from 3.9% in 1992 to 10.2% in 2006.⁸ A large and steady rise in use of opioids, surgery, and interventional therapies for low back pain has not been associated with improved health status but appears to be an important factor contributing to increases in health care expenditures associated with back pain.⁹^,¹⁰^,¹¹ Thus, despite impressive gains in our understanding of the molecular and cellular origins of pain, there is an important gap in translating this knowledge into effective clinical management. One reason may be the widespread reliance on inadequate research designs that lead to conflicting, confusing, or misinterpreted results. Biostatistical and epidemiologic methods make it possible to substantially improve this situation, but many key principles are not widely appreciated.

Uncontrolled Studies Paradigm

Historically, much of pain treatment research consisted simply of uncontrolled studies in which clinicians treated a group of patients and then reported mean pain scores or the proportion who improved. Such studies are often referred to as case series, although the alternative term before-after study or treatment series may help distinguish them from studies that identify “cases” based on an outcome (such as an adverse event) rather than an exposure (such as a medical intervention) and only assesses patients at one point in time.¹² The before-after study design remains popular in part because it usually does not require extensive resources but is vulnerable to many pitfalls.¹³

First, many uncontrolled studies are retrospectively reported. After treating a certain number of patients, the clinician looks back at his or her experience and tries to summarize the characteristics, treatments, and outcomes of the patients studied. Unfortunately, in this retrospective approach, there is often incomplete baseline information on patient characteristics. For example, factors such as age, sex, previous surgery, disability compensation, neurologic deficits, psychological comorbidities, and pain duration often have a major influence on the outcomes of back surgery. Yet, in a systematic review of outcome studies on surgery for spinal stenosis, 74 relevant articles were found, but less than 10% mentioned all these patient characteristics.⁷

Another problem with the retrospective approach is that it can be difficult to identify an “inception cohort” of all patients (or a random sample) who met specified criteria and received the intervention. A systematic review of 72 uncontrolled studies of spinal cord stimulation for chronic low back pain or failed back surgery syndrome found that less than one-quarter clearly described evaluation of a consecutive or representative sample of patients.¹⁴ In such studies, it is impossible to know if patients with poorer results were excluded for arbitrary reasons, or how many patients received the treatment but were lost to follow-up. If patients excluded from analysis or lost to follow-up were more likely to experience poor outcomes than those who were followed, this could result in serious overestimates of benefits.

A third problem with uncontrolled studies is that even if the researcher collects data prospectively, there is typically no blinding of patient, therapist, or outcome assessor to the nature of the treatment provided. This allows important unconscious—or conscious—biases to affect assessments. This is particularly important for outcomes related to pain, which by nature are subjective. Most of us would question the reliability of outcomes rated by a surgeon evaluating his or her own patients, and yet, this is the norm in much of the literature.

By definition, uncontrolled studies do not include control groups for comparison. The assumption seems to be that patients with painful conditions, especially chronic pain, will not improve unless effective treatment is given. However, there are many reasons why patients improve in the face of ineffective therapy, some of which are listed in Table 10.1. First, the natural history of many painful conditions is to improve spontaneously. This may be true even for patients with long-standing pain, who sometimes improve for unclear reasons. For acute conditions such as acute low back pain, rapid early improvement is the norm.¹⁵ Second are placebo effects, which are not well understood but are consistently underestimated and may be particularly important when assessing pain.⁷ Several factors may mediate placebo effects, including patient expectations,¹⁶ learning and conditioning from previous treatments, reduction of anxiety, and endorphin effects. Placebo effects for pain treatments may be getting larger. In 1996, patients in US clinical trials reported that drugs relieved neuropathic pain 27% more than placebo, but by 2013, the difference had decreased to 9%,¹⁷ a trend that appeared due to a stronger placebo response in the setting of stable drug effects.

TABLE 10.1 Why Patients May Improve with Ineffective Therapy

Natural history of a condition to improve

Placebo effects

Regression to the mean

Nonspecific effects: concern, conviction, enthusiasm, attention

Another poorly appreciated factor is regression to the mean.¹⁸ This term was coined by statisticians who observed that in a group of patients who are assembled because of the extreme nature of some clinical condition, there is a tendency for the condition to return to some average level that is less severe over time. Figure 10.1 shows what we often assume to be the course of chronic pain problems, with a steady level of severity that falls after successful intervention. However, the second panel is more likely to represent the true natural history, with good days and bad days, and fluctuations being the norm.¹⁹ Patients seek health care when their symptoms are most extreme. We might easily be misled into believing that improved outcomes are due to the intervention, when in fact, random fluctuations are why their symptoms have returned toward a more average level. As Sartwell and Merrell²⁰ pointed out, “the term chronic has a tendency to conjure up ideas of stability and unchangeability … it is changeability and variation, not stability, that is in fact the dominant characteristic of most long-lived conditions.”

FIGURE 10.1 Hypothetical course of chronic low back pain (LBP). (Reprinted with permission from Deyo RA. Practice variations, treatment fads, rising disability. Do we need a new clinical research paradigm? Spine 1993;18[15]: 2153-2162)

TABLE 10.2 Therapeutic Trial for Patients with Chronic Low Back Pain: Mean Duration of 4 Years, n = 31

	Score Improvement
Outcome Measure	Baseline to 1-Month Follow-Up	p Value
Overall function (SIP)	32%	.002
Physical function	44%	.001
Pain severity (VAS)	33%	.006
Pain frequency (5-point scale)	20%	.000
SIP, Sickness Impact Profile; VAS, visual analog scale.
Reprinted with permission from Deyo RA. Practice variations, treatment fads, rising disability. Do we need a new clinical research paradigm? Spine 1993;18(15):2153-2162.

A host of other nonspecific effects also can affect assessments of patient improvement. Increased concern, conviction, enthusiasm, and attention of a therapist, a researcher, and a clinical staff may all have positive but nonspecific effects on patient outcomes. Table 10.2 shows a potential consequence of all these factors, using data from a clinical trial of patients with chronic low back pain.¹⁹ The 31 patients in Table 10.2 have had back pain an average of 4 years. They received a clinical intervention that resulted in 20% to 44% improvements in pain frequency, severity, and function, all of which were highly statistically significant. However, this seemingly effective treatment for chronic pain was a sham transcutaneous electrical nerve stimulation (TENS) unit, along with hot packs twice a week. This was the control arm of a randomized trial and illustrates the substantial improvements that may occur among those with long-standing pain who receive ineffective treatments.

Finally, an issue that has begun to receive more attention is that uncontrolled studies are highly susceptible to publication bias.²¹ There is little incentive for clinicians to publicize poor or even average results. Estimates of efficacy from uncontrolled studies that get published will therefore often overrepresent the most positive results.

There is considerable room for improvement in the design and conduct of uncontrolled studies of pain interventions.¹⁴^,²² However, even when conducted well, the ability of uncontrolled studies to provide reliable information about treatment efficacy will always be limited. Exceptions can occur when the relationship between an intervention and outcomes is obvious, the effects are immediate, and the effects are so dramatic that they cannot be explained by other factors.²³ Examples include surgery for appendicitis, eyeglasses for correction of refractive error, and cataract surgery. For nearly all pain conditions, however, there are many plausible alternative explanations for the observed changes in outcomes, and reliable conclusions about treatment efficacy require the use of more rigorous study designs. There is simply too much “noise” to sort out whether outcomes are due to the treatment or to other factors.²⁴

CONTROL GROUPS: AN IMPROVEMENT OVER THE CASE SERIES

Given the variety of factors that may produce improvement with ineffective therapy, it is incumbent on investigators to have a comparison group of subjects with the same likelihood for improvement as a treatment group but who do not receive the active therapy. The goal should be to minimize the potential differences across groups in the effects of the various nonspecific causes for improvement that are listed in Table 10.1. With this goal in mind, the appropriate comparison group is unlikely to be one that receives no care at all. Patients in such a group would not experience placebo effects or the nonspecific effects of clinical concern and enthusiasm. The importance of
having an adequate placebo is illustrated by a trial that found acupuncture more effective than no treatment for chronic low back pain but no more effective than sham acupuncture.²⁵ Similarly, using a “waiting list” control group is often suboptimal because these patients experience none of the placebo or nonspecific effects of the intervention group. A preferable control group would be one that receives other credible, appropriate care that does not include the specific treatment under study. This might consist of “usual care” supplemented by a placebo of some sort. The placebo should be difficult to distinguish from the intervention under study so that it is perceived as being as likely to help as the active therapy. This is the reason for providing inactive pills in the control groups of drug trials, but even for nondrug treatments, credible placebos should be provided when possible. Examples include the use of sham TENS units in trials of TENS, the use of sham injections in trials of interventional therapies, the use of subtherapeutic weight in trials of traction, or “misplaced needling” as a control for acupuncture.

In some cases, it may be unethical or impossible to provide a true placebo. Examples include many surgical interventions, psychological therapies, and rehabilitation interventions. In such situations, a reasonable alternative is to provide a control treatment that creates some sense that patients are receiving an additional intervention and attention but is not likely to have a strong effect on outcomes. One example might be a brief educational brochure.²⁶

In addition to choosing an appropriate control intervention, it is also important to make the treatment and control groups as similar to each other as possible in other ways. Confounding is a critical concept that refers to variables associated with both the intervention being evaluated and observed outcomes. A classic example of confounding is the association between alcohol consumption and lung cancer. This association is confounded by smoking, which is associated with alcohol consumption and is also an independent risk factor for lung cancer. Examples of common confounders in pain research include severity of baseline pain or functional deficits, psychological and medical comorbidities, age, and use of other therapies. The consequence of confounding is that the observed treatment effect is a poor estimate of the true effect. The modifying effect of the confounding variables result in either an overestimate or underestimate of treatment benefits and can sometimes even result in a positive effect when the true effect is negative (or vice versa).

Selection of controls to minimize the potential for confounding is often a challenge. Control groups that are convenient to assemble are also unfortunately frequently associated with important pitfalls. For example, it would be unwise to choose patients who did not have adequate insurance coverage for the treatment being provided as a control group because insurance coverage is related to important sociodemographic characteristics. Patients with the best insurance are typically those with the highest salaries and the most satisfying jobs, are happier with their insurance, and are more likely to practice healthy behaviors. Failure to adjust for socioeconomic status in observational studies could have resulted in the subsequently disproven belief in the positive cardiovascular benefits of hormone replacement therapy.²⁷ Similarly, selecting patients nonadherent with intended therapy as a control group is a flawed strategy. In a large-scale study of cholesterol-lowering therapy, control patients were divided among those who took more than 80% of their placebo tablets and those who took less than 80%.²⁸ Even after adjusting for 40 coronary risk factors, there were enormous differences in mortality between the adherent and nonadherent groups. Patients who were adherent with their placebos had a 5-year mortality of only 16%, whereas those who were not adherent had a 5-year mortality rate of 26% (P < .0001). These findings were probably related to important differences between the groups that were not reflected in their coronary risk factors. These may have included other health habits, behaviors, attitudes toward risk, and occupations. Thus, nonadherent patients are often strikingly different from adherent patients, and we cannot assume that any differences in outcome are related only to treatment effects.

Sometimes, the issues of proper selection of control patients and treatments are intertwined. A study that assigned patients with presumed discogenic low back pain to intradiscal electrothermal therapy (IDET) or rehabilitation therapy based on their insurance coverage for IDET reported an average 4.5-point improvement in pain scores.²⁹ Subsequent randomized trials found either no advantage of IDET or only a 1-point difference between IDET and sham treatment.³⁰^,³¹ In addition to potential socioeconomic differences related to differential insurance coverage, patients who were denied IDET probably had lower expectations about the likely benefits of rehabilitation therapy, particularly because some had previously received this treatment but had not responded.

Confounding by indication is particularly important in studies that assess treatment efficacy. It refers to the strong, natural (and appropriate) tendency for clinicians to selectively use therapies in patients most likely to benefit. A striking example of confounding by indication is a study of new users of nonsteroidal anti-inflammatory drugs that found use of ulcer-healing drugs associated with a 10-fold increase in risk of gastrointestinal bleeding or perforations.³² Obviously, ulcer-healing drugs do not cause ulcers. Rather, the increased risk of gastrointestinal complications in patients deemed appropriate for ulcer-healing drugs dwarfed any protective effect of the drugs.

There are ways to minimize or adjust for the effects of confounding. These include matching patient selection on the variables thought to be most important potential confounders, restricting enrollment to patients defined by a narrow set of inclusion criteria, and statistically adjusting and analyzing known confounders.³³ Nonetheless, the effects of confounding can be dramatic even when one or more of these strategies are employed. For example, confounding by indication was strong in the study on ulcer-healing drugs, even though it attempted to restrict enrollment to lower risk patients without a previous ulcer or who had even been previously prescribed an ulcer-healing drug.³²

Matching also may not be enough to overcome effects of confounding. Table 10.3 shows how one might assemble two groups of objects that are well matched on five different characteristics and yet literally be comparing apples and oranges.¹⁹ Table 10.4 shows real data from a comparison of outcomes of two groups of Medicare patients who underwent low back surgery. They were matched on diagnosis (all had spinal stenosis), gender, age, insurance (all Medicare), and surgical procedure (all had a laminectomy without fusion). Despite being well matched on these five characteristics, the likelihood of reoperations differed almost fourfold between the two groups.
Differences of this magnitude might easily be attributed to some dramatic advantage of the treatment used in group A. However, these groups were intentionally assembled in such a way that group A was composed of African American patients who had not had prior surgery and group B was composed of white patients with prior surgery.¹⁹ These two characteristics, which might have easily been overlooked, accounted entirely for the difference in reoperation rates. Unfortunately, it usually is not as simple as matching on a few critical and easily measured variables. The cholesterol-lowering placebo study described earlier shows how even matching (or adjusting) for 40 different risk factors may not capture important differences between two groups of patients.²⁸

TABLE 10.3 Why Not Find “Matching” Controls?

	Apples	Oranges
Shape	Round	Round
Source	Tree	Tree
Edible?	Yes	Yes
Size	Handheld	Handheld
Weight	½ lb.	½ lb.
Reprinted with permission from Deyo RA. Practice variations, treatment fads, rising disability. Do we need a new clinical research paradigm? Spine 1993;18(15):2153-2162.

TABLE 10.4 Two Cohorts of Medicare Patients with Laminectomy for Stenosis (1985)

	Group A (n = 252)	Group B (n = 141)	Significance
% Women	57%	55%	NS
Mean age	71	72	NS
% Fusion	0	0	NS
4-Year reoperations	4%	15%	<.0005
NS, not significant.
Reprinted with permission from Deyo RA. Practice variations, treatment fads, rising disability. Do we need a new clinical research paradigm? Spine 1993;18(15):2153-2162.

If waiting lists, patients with insufficient insurance coverage, nonadherent patients, or even carefully matched patients receiving appropriate placebo treatments make poor control groups, is there a better solution? Fortunately, the concept of random allocation provides an ideal method of establishing a comparison group that is likely to be similar in nearly all respects to an intervention group.

Randomized Allocation of Treatment and Control Groups

The term randomized trial has become familiar among clinicians and yet is often misunderstood. Some assume that a randomized trial is one in which patients are randomly selected from a population of interest. However, just the opposite may be true. Patients may be highly selected from a group of potential candidates based on specific characteristics that make the study treatment safe and likely to succeed. Randomization does not refer to the selection of patients to be studied but rather to the patients’ allocation to the treatment or the control group.

Why is randomization such a desirable way of creating a control group? It is attractive because the problem of confounding is largely eliminated.³⁴ Because it is never possible to completely understand or measure all confounders, residual confounding is always a potential issue in studies that are not randomized.³⁵ With random allocation, we may not even know the important prognostic factors, but they will be equally distributed (given a fair randomization and enough patients) between the treatment and control groups. Effective randomization requires the generation of a truly unpredictable (random) allocation sequence as well as its successful implementation via allocation concealment.³⁶

There is sometimes confusion about what constitutes randomization. Randomization requires using a list of random numbers that may be published or determined by a computer program. Each successive subject has an equal likelihood of being assigned to each treatment arm, although the order in which they are assigned is unpredictable. Alternating assignment—that is, the first patient is assigned to treatment, the next to placebo, the next to treatment, and so on—is not random because it is predictable. Similarly, assigning patients without conscious bias, or haphazardly, is not the same as random allocation. Using hospital numbers, date of birth, or day of the week is also not randomization. If day of the week is used, a patient could simply come in (or be told to come in) on the day that the desired intervention will be offered. Allocation concealment means that the allocation sequence remains unknown until at least after patients have been assigned to therapy, thus preserving the actual randomization. A traditional method to help preserve allocation concealment is use of opaque sealed envelopes containing the treatment assignment. An increasingly common alternative is to have an offsite facility that keeps the random sequence, so research personnel cannot know the next assignment as a subject is enrolled.³⁷

TABLE 10.5 Bias in Studies of Myocardial Infarction

Allocation Method	Prognostic Maldistribution (%)	Difference Found in Case-Fatality (%)
Blinded randomization	14	9
Unblinded randomization	27	24
Nonrandomized	58	58
Data from Chalmers T, Celano P, Sacks HS, et al. Bias in treatment assignment in controlled clinical trials. N Engl J Med 1983;309:1358-1361 and Deyo RA. Practice variations, treatment fads, rising disability. Do we need a new clinical research paradigm? Spine 1993;18:2153-2162.

A dramatic example of the effects of randomization in pain research is a systematic review of TENS therapy for postoperative pain that found 15 of 17 randomized trials of efficacy showed no benefit.³⁸ By contrast, 17 of 19 nonrandomized studies showed a substantial positive treatment effect. Some investigators have also quantified the magnitude of bias that occurs when allocation concealment is inadequate. One such study, shown in Table 10.5, compared randomization with adequate allocation concealment with randomization with inadequate allocation concealment and with nonrandom allocation of controls.³⁹ The investigators examined a series of treatments for acute myocardial infarction and, as the table shows, demonstrated that maldistribution of prognostic factors was least with randomization with adequate allocation concealment and greatest with nonrandom allocation. Similarly, the likelihood of finding a substantial improvement in case fatality rate rose dramatically, from just 9% of trials with randomization and adequate allocation concealment to up to almost 60% of trials with nonrandom allocation. Other studies suggest that on average, inadequate allocation concealment inflates results by about 40% compared to studies with adequate allocation concealment.³⁷^,⁴⁰

Why is allocation concealment so important? There are probably several reasons. Failure to conceal allocation makes it easy to subvert the randomization process. If this occurs, confounding by indication can be as much of a problem as in nonrandomized studies.³⁷ Some overt methods that have been used to bypass randomization include adjusting treatment assignments based on posted allocation sequences or ignoring allocation to treatments perceived as less desirable.⁴¹ Inadequate allocation concealment can also have more subtle effects. If the investigator has a bias as to which treatment group is more effective—even a subconscious bias—he or she may approach the next subject differently based on knowledge of what the next treatment assignment will be. This may affect the way in which a clinical trial is presented to a patient, the enthusiasm with which consent is sought, or the rigor with which eligibility criteria are applied.

TABLE 10.6 Evaluating Articles about Therapy⁴²

These criteria can be used as a guide for evaluating the validity of study results and treatment efficacy.

Were patient assignments randomized and were those groups analyzed?
Were patients properly accounted for and attributed?
Was follow-up complete?
Were all study participants (including study staff and health workers) “blind” to treatment?
Were groups similar throughout the trial?
Were groups treated equally (other than the intervention being studied)?
What was the extent of the treatment’s effect?
Are the results applicable to my patient care?
Do benefits of the treatment outweigh the risks and costs?

Other Methods for Reducing Bias in Clinical Trials

Randomization is a powerful method for minimizing the possibility of confounding, but does not protect against other types of bias, or systematic errors in measurement. The quality of a trial refers to how rigorously it employs measures to protect against bias. Table 10.6 lists criteria that have been proposed for critical readers to evaluate the quality of studies on treatment efficacy.⁴²^,⁴³ Lengthier and more detailed sets of criteria for evaluating clinical trials has been developed by the Cochrane Collaboration⁴⁴ and by the Cochrane Collaboration Back and Neck Group.⁴⁵ Additional guidance for clinical investigators includes recommendations on how to report the methods and results of randomized trials.⁴⁶ The list of criteria in Table 10.6 begins with random allocation, which was discussed in detail previously.

BASELINE SIMILARITY OF STUDY GROUPS

Randomization usually provides the best way to produce groups with equivalent prognoses. However, randomization may not always work, and investigators should present a comparison of baseline characteristics of patients in the treatment and control groups. In a properly randomized trial, any observed differences are chance occurrences but may still be sufficiently large to compromise the validity of the study. When this occurs, investigators sometimes adjust for baseline differences using statistical techniques. Such statistical adjustments should be based on how strongly the prognostic factor is thought to be associated with the outcomes and the clinical importance of baseline imbalances, not on the results of statistical tests for significant differences.⁴⁷ Statistical tests can be misleading, as small differences may be clinically trivial but statistically significant in large trials, and large differences may be clinically important but statistically nonsignificant in small trials.

Even if adjustment is appropriate, it cannot control for differences in unmeasured confounders. It is also important to consider whether baseline imbalances could be due to intentional subversion of randomization.⁴⁸ Minimization is a method for achieving balanced groups based on key predefined patient factors and the number of patients in each groups.⁴⁹ Although it is not truly random, because patient factors influence treatment assignments, minimization may provide better balance between treatment groups than standard randomization.

BLINDING

The importance of blinding is that it helps to create similar expectations on the part of patients and similar enthusiasm by the therapists. Furthermore, it ensures that the same level of attention and concern is provided to both a treatment and a control group. Blinding is particularly important in studies that assess subjective outcomes such as pain. In one study, lack of blinding inflated estimates of treatment effects by 30% in trials with subjective outcomes such as pain but had no effect on estimates in trials with objective outcomes.⁵⁰

It is common to talk about double-blind trials, but the term is often used ambiguously.⁵¹ Typically, it is meant to imply that the patient is unaware whether he or she is receiving active treatment or a placebo, and the person administering or prescribing the treatment is also unaware. In some cases, it may be impossible to “blind” patients or therapists, as in trials of surgical treatments or some rehabilitation and psychological interventions. There is also a third party that may be blinded—an independent assessor of outcomes. Maintaining such a blinded assessor should generally be feasible, even when it is impossible to blind patients and therapists. Trials should explicitly describe who was blinded rather than use nonspecific jargon such as “single-,” “double-,” or even “triple-blinded.”⁵¹

As noted in the discussion of control groups, creativity can sometimes produce credible placebo or alternative treatments that at least help to maintain blinding. In many situations, it would be informative to test the success of blinding at the end of a study. This is not done frequently but is important for certain drugs and other interventions that have side effects or other characteristics that can give the treatment away.⁵² In a trial of TENS therapy for low back pain, for example, sham treatment does not produce the same sensation as active therapy, so patients could know they are receiving sham rather than active therapy. This would essentially result in an unblinded trial. In fact, one trial of TENS found that patients and physicians were able to guess better than random chance whether individual patients were in the treatment or control group, but the magnitude of blinding failure was sufficiently modest that the results could be interpreted with some confidence.⁵³

Only gold members can continue reading. Log In or Register to continue