Chapter 6
Understanding Bias in Diagnostic Research

Christopher R. Carpenter¹ and Jesse M. Pines^2,3

¹ Department of Emergency Medicine, Washington University School of Medicine, St. Louis, MO, USA

² US Acute Care Solutions, Canton, OH, USA

³ Department of Emergency Medicine, Drexel University, Philadelphia, PA, USA

Highlights

The Standards for Reporting of Diagnostic Accuracy Studies (STARD) criteria and Preferred Reporting Items for Systematic Review and Meta‐analysis of Diagnostic Test Accuracy (PRISMA‐DTA) criteria provide transparent recommendations for reporting diagnostic accuracy research.
Subtypes of diagnostic bias include incorporation, partial verification, differential verification, imperfect gold standard, spectrum, context, interval, and cutoff biases that skew observed estimates of sensitivity and specificity in varying directions.
A hierarchy of diagnostic evidence exists from technical efficacy, diagnostic accuracy efficacy, diagnostic thinking efficacy, therapeutic efficacy, clinical outcome efficacy, and societal efficacy.
Diagnostic randomized controlled trials can evaluate the comparative efficacy between differing diagnostic approaches, but are not always necessary and when available often neglect patient‐centered outcomes.

Chapter 2 summarizes the components and philosophy of evidence‐based medicine (EBM), including methods to evaluate individual study quality before incorporating it into patient care. Attaining EBM proficiency requires mentored learning and consistent practice just like any other procedural skill in medicine.¹ Awareness of EBM concepts and resources can assist clinicians to assess the quality and applicability of diagnostic research evidence for individual questions.²

The Standards for Reporting of Diagnostic Accuracy Studies (STARD) criteria provide a systematic approach to conducting and reporting diagnostic research.³ STARD provides a 30‐item checklist of essential details that diagnostic researchers must report (Table 6.1). An extension of the original STARD reporting standards for history and physical examination also exists.⁴ Although the quality of diagnostic accuracy reporting continues to improve, emergency medicine researchers frequently do not adhere to STARD, which increases the potential for unrecognized bias in the study results.^5–7

As described in Chapter 2, systematic reviews are at the top of the evidence pyramid (i.e., one of the least biased forms of research when done methodically), but systematic review methods for diagnostic tests are relatively new.⁸ Diagnostic test accuracy systematic reviews also have reporting standards called Preferred Reporting Items for Systematic Review and Meta‐analysis of Diagnostic Test Accuracy (PRISMA‐DTA).⁹ The Quality Assessment Tool for Diagnostic Accuracy Studies (QUADAS‐2) methods provide an instrument to assess four domains of bias that can skew individual trial estimates of diagnostic accuracy: patient selection, index test, criterion standard, and timing.¹⁰ QUADAS‐2 is the preferred approach to evaluate the quality of evidence in diagnostic meta‐analyses. While diagnostic accuracy systematic reviews increasingly report the quality of included studies, few extrapolate the implications of the QUADAS‐2 assessment into their conclusions.¹¹ In this chapter, we will describe the forms of bias to consider in diagnostic research and provide suggestions for incorporating this data into guidelines.

Diagnostic science is vulnerable to several forms of bias that physicians should recognize while critically appraising original research (Table 6.2).¹² For this discussion the new test being evaluated by researchers will be called the “index test” whereas the gold standard upon which the presence or absence of the disease is determined will be called the “criterion standard.” The criterion standard is the most accurate method available to delineate whether a disease or condition is present or absent. For example, in appendicitis, diagnostic tests would be compared to the pathologic findings of appendiceal inflammation, which is the criterion standard for appendicitis.

Types of bias

Incorporation bias is likely when the index test is one determinant of the criterion standard. This occurs when the criterion standard involves a review of all pertinent clinical information by a panel of experts, including the index test. For example, when evaluating the accuracy of B‐type natriuretic peptide (BNP) for decompensated congestive heart failure (CHF) and the criterion standard is consensus of two cardiologists reviewing history, physical examination, imaging, and labs that include the BNP result, then incorporation bias occurs. The experimentally observed sensitivity and specificity of BNP will be higher than occurs in actual clinical practice. Another example of incorporation bias in emergency medicine is a study evaluating BNP that appropriately blinded two cardiologists assessing the presence or absence of CHF, which concluded “the best clinical predictor of CHF was an increased size on chest roentgenogram.”¹³ Since cardiologists evaluated the chest X‐ray in addition to the medical history and follow‐up studies, the heart size certainly influenced the decision to label a patient as CHF or not. The main problem with incorporation bias is the overestimation of both sensitivity and specificity.¹⁴ To avoid incorporation bias, the index test cannot be a component of or reviewed to determine the criterion standard.

Table 6.1 STARD 2015 criteria

Checklist item
Title
Identify manuscript as a study of diagnostic accuracy reporting at least one measure of accuracy
Abstract
Structured summary of design, methods, results, and conclusions
Introduction
Scientific/clinical background, including the intended use of the index test
Explicit statement of study objectives and hypothesis
Methods
Design
Whether data collection planned before index test and criterion standard performed (prospective) or after (retrospective)
Participants
Eligibility criteria for participants
Basis upon which participants identified (symptoms, previous test results, registry)
Where and when potentially eligible participants were identified
Whether participants formed a consecutive, random, or convenience sampling
Test methods
Index test described in sufficient detail to allow replication
Criterion standard described in sufficient detail to allow replication
Rationale for selecting the criterion standard (if alternatives exist)
Definition of and rationale for index test positivity cut‐offs, distinguishing prespecified from exploratory
Definition of and rationale for criterion standard test positivity cut‐offs or result categories
Whether clinical information and criterion standard were available to performers/readers of the index test
Whether clinical information and criterion standard were available to assessors of the criterion standard
Analysis
Methods for estimating or comparing measures of diagnostic accuracy
How indeterminate index test or criterion standard results were handled
How missing data on the index test or criterion standard were handled
Any analyses of variability in diagnostic accuracy, distinguishing prespecified from exploratory
Intended sample size and how it was determined
Results
Participants
Flow of participants, using a diagram
Baseline demographic and clinical characteristics of participants
Distribution of severity of disease in those with the target condition
Distribution of alternative diagnoses in those without the target condition
Time interval and any clinical interventions between the index test and criterion standard
Test results
Cross tabulation of the index test by the results of the criterion standard
Estimates of diagnostic accuracy and their precision (such as 95% confidence intervals)
Any adverse events from performing the index test or the criterion standard
Discussion
Study limitations, including sources of potential bias, statistical uncertainty, and generalizability
Implications for practice, including the intended use and clinical role of the index test
Other information
Registration number and name of registry
Where the full study protocol can be accessed
Sources of funding and other support with role of funders

Partial verification bias occurs when patients with abnormal index test results are more likely to receive a subsequent criterion standard evaluation and only those with criterion standard testing are included in the study. Accurate quantification of sensitivity and specificity in research requires that all patients in whom the index test would be obtained in actual practice receive the criterion standard evaluation regardless of the index test result. Partial verification bias is particularly problematic in studying the accuracy of signs or symptoms. For example, one study quantified the sensitivity of right lower quadrant pain (RLQ) for pediatric appendicitis as 96% and specificity is very low at 5%.¹⁵ As discussed in Chapter 3, this equates to a likelihood ratio positive and negative of 1.0, implying that RLQ pain is useless in the diagnosis of pediatric appendicitis because it is so nonspecific. This study used histological findings as the criterion standard for appendicitis, yet only children with RLQ pain underwent surgery. This introduces partial verification bias because not all children had the same assessment of the criterion standard because children that were ruled out based on clinical grounds were assumed to be true negatives. Another example illustrating partial verification bias via a modified 2 × 2 table in Figure 6.1 uses results from a headache study. In this decades‐old study, computed tomography (CT) was more likely to be obtained in patients with a headache, so studying headache as a predictor of intracranial hemorrhage demonstrates higher sensitivity and lower specificity than if headache played no role in obtaining the criterion standard CT.^16,17

Table 6.2 Description of bias in diagnostic research

Source: Data from [12].

Threats to diagnostic accuracy	Alternative names or subtypes	Description	Sensitivity is falsely	Specificity is falsely
Incorporation bias	Review bias	Classification of disease status partly depends on results of the index test, including failure to blind the outcome assessor to the index test	Raised	Raised
Partial verification bias	Verification bias, work‐up bias, referral bias	Positive index test cases are more likely to have criterion standard testing and only patients with criterion standard testing are included	Raised	Lowered
Differential verification bias	Double gold standard bias	Positive index test cases more likely to receive immediate invasive criterion standard, while patients with negative index test more likely to receive clinical follow‐up for “disease” with bias when criterion standards give different answers	Raised	Raised for disease that can resolve spontaneously
Differential verification bias	Double gold standard bias		Lowered	Lowered for disease that only becomes detectable during follow‐up
Imperfect gold standard bias	Copper standard bias	The criterion standard determining patient’s disease status misclassifies some patients	Raised	Raised if errors on the index test and copper standard are correlated
Imperfect gold standard bias	Copper standard bias		Lowered	Lowered if errors on the index test and the gold standard are independent
Spectrum bias, disease severity	Case‐mix or subgroup bias	Spectrum of disease and nondisease differs from clinical practice. Sensitivity depends on spectrum of disease, whereas specificity depends on spectrum of nondisease that mimic disease of interest	Raised when disease skewed toward higher severity than observed in clinical practice	Raised when nondisease skewed toward healthier patients than observed in clinical practice
Spectrum bias, exclusion of ambiguous tests		Patients with ambiguous or intermediate test results are excluded	Raised when excluded patients with disease more likely than those included to have been false negative	Raised when excluded patients without disease more likely to have been false positive

Differential verification bias occurs when researchers use different criterion standards to define the presence of absence of disease when different gold standards can give different results. As just described with pediatric appendicitis, rather than excluding patients who did not undergo surgery, another approach would be to follow nonoperative patients longitudinally to ensure that they do not develop appendicitis (and are therefore true negatives rather than early‐stage or initially mild appendicitis). However, two different criterion standards for detecting the presence or absence of appendicitis then exist – one surgical and the other clinical, creating differential verification bias. Appendicitis sometimes resolves spontaneously and nonoperative management is increasingly common.¹⁸ If a child with mild appendicitis reports RLQ pain and goes to the operating room, the inflamed appendix will be noted on histology and the presence of RLQ be recognized as a true positive. If the same child does not have RLQ pain and the clinical follow‐up criterion standard is employed with spontaneous resolution of appendicitis, the absence of RLQ will be recorded as a true negative. Sensitivity and specificity will be artificially elevated if compared to a study where every patient had the same criterion standard applied regardless of the index test result. Another example involves assessment of inability to fully extend the elbow after blunt trauma as a predictor of fracture. In this study, every patient who had an abnormal elbow extension test had X‐rays obtained, but only 19% of those with normal elbow extension had X‐rays with the remainder evaluated by clinical follow‐up.¹⁹

Only gold members can continue reading. Log In or Register to continue