Big Data, Advances in Computational Sciences, and Oncology Care

As information technology has matured, organizations and industries embracing the use of data-driven analytics to guide services has profoundly changed the human experience. Just-in-time inventory, predictive pricing models and simple applications such as web searches have generated immense gains in productivity, efficiency, and reliability. The past 40 years have also seen a revolution in the use of data in health care. Within medicine there is an explosion of registries, such as the National Anesthesia Clinical Outcomes Registry (NACOR) from the Anesthesia Quality Institute, tracking clinical events and providing practice benchmarks and quality reporting for its members. Requiring a mix of manual data entry and automated data extraction, these efforts attempt to build a data storehouse to help identify trends and understand practice. Within the wider scope of medicine, efforts are underway to incorporate information from many diverse sources, such as electronic health records (EHRs), genomic testing, claims, and public data sets. These advances will profoundly sculpt the face of health care from patient experience to patient outcomes. The impact of these forces cannot be underestimated.

Little Data to Big Data

Database technology is ubiquitous in our modern existence. Its roots trace back to the 1970s when computer systems were fairly novel. Commonly, the mainframe vendor engineered a closed software platform and was the sole source for the end-user applications. The hardware, operating system, and software were only available from one source. Since there were no competitors at the software layer, standards were internal and ad hoc. This issue created hardware vendor silos inhibiting third party developers. With no competition general product advancement was slow and development costly. Each application was a unique development effort and rarely leveraged work from other products. Due to this vertical integration, it was common for application designers to create and recreate ways to store, access, and aggregate data.

To overcome this, on the data side researchers at the IBM Corporation began to investigate how a layer of abstraction could be constructed between the application and the data. In creating this boundary, application programmers were no longer required to code their own data handling routines or other tools to manipulate the data cache. This break between data management and application programming was foundational. To support this, a standard interface was created to bridge this divide. In 1970, a team led by Dr. Edgar F. Codd at the Almaden Research Center published their work on building a standardized data query language called SEQUEL. SEQUEL became the bridge and interface between data and application. Shortened to SQL over time, the language provided standard tools for programmers to manipulate data without regard to how it actually was stored or represented at the file level. In creating this split, the database management function (DBMS) became a separate service to the programmer, much like the screen display or the network. The SQL language was initially proprietary, but in 1979, a company named Relational created its own competitive and compatible database management software, having reverse engineered SQL. Relational would later be known by the corporate name of Oracle.

The database that Codd described in his publications and implemented in IBM software was known as a relational model. The main competing structure was called hierarchical and could be thought of as a tree with branches or nodes for different items. Relational databases organize data into tables. Tables can be stand alone, like a spreadsheet, or be linked together by keys. The use of keys traces a map of data, linking and defining how the data elements relate to each other. A SQL query leverages these relationships to pull data that conforms to both the constraints of the data model as well as the constraints of the SQL language. Given the standardization of database technology and explosion in data storage, relational databases became part of the fabric of computing. They enforced discipline on the data model through the syntax of the SQL language and ensured that in using a database engine (relational database management system; RMDBS) data could not be inadvertently corrupted through SQL-based manipulation. Designed in the 1970s, these structures worked well and delivered a robust and reliable performance. As computing, storage, and networking continued to evolve, however, the flood of data requiring processing and analysis expanded at an exponential rate. SQL was born in the primacy of the mainframe. In the world of the web, the Internet of Things (IoT), and digitization of whole new categories of daily artifacts, SQL and the relational database began to show its 50-year roots as challenged by the rise of the server.

NoSQL Technology

SQL-based technologies were hitting the performance wall. As the size of the database increased, it required more processing horsepower. While processors were accelerating in capacity, the rate of data generation outstripped their ability to scale. In the 1990s the World Wide Web (WWW) was beginning to grow by leaps and bounds. Architects in this environment scaled their web server performance using multiple compute cores spread across multiple servers to expand their ability to handle the increasing load. They were able to scale horizontally. The RDBMS vendors struggled to use this type of architecture. Mainframes where SQL was born were single processor devices. Due to the way the RDBMS interacted with data, it was technically difficult to make the application work when using multiple servers. It was constrained to scale only vertically, i.e., a single compute core that had to run faster to deliver improved performance. This pressure to create a more scalable and flexible model continued to build.

Driven by the large web utilities, Facebook, Google, LinkedIn, and others, internal research and development resources were tasked to explore alternative paths. The goals were to create a flexible framework that could handle a diverse set of data types (video, audio, text, discrete and binary) using racks of inexpensive hardware. Unlike the RDBMS/SQL solution with a standard interface, defined language and deterministic performance, the new framework, now known as NoSQL, was extremely heterogeneous and spawned a number of database management systems built to optimize specific classes of use cases. Over the past decade, work has gone into classifying data storage and retrieval challenges. From this work the specific task typically can be addressed using one of four categories of NoSQL topologies: Key-value, Column-oriented, Document-oriented, and Graph databases. Underlying these tools is the ability to replicate data across multiple servers, harness the power of numerous processors, and scale horizontally. These database management systems underpin the infrastructure of Facebook, Twitter, Verizon, and other large data consumers.

In health care, the data environment is still evolving. As the installed base of EHR continues to climb, volumes of granular data about and surrounding the management of each encounter are recorded in digital form. Dwarfing this is the volume of imaging data captured. Penetration of digital radiology workflows is substantial, and the images captured result in petabytes of data just within institutions. By 2013, more than 90% of US hospitals had installed digital imaging, and adoption of 3D image reconstruction in hospitals now exceeds 40%. Technology is now transforming pathology workflow with digital whole slide image capture becoming more widely adopted. This is a massive data management exercise, as images contain terabytes each. Additionally, the slide images required to support a diagnosis would include scanning of the entire specimen at differing levels of magnification. For liquid slides, such as bone marrow of blood smears, this is increased due to the need to capture images at different focal planes also.

Within the operating room (OR) and critical care environments, streaming data is ubiquitous. The current generation of anesthesia machines, infusion pumps, noninvasive monitors, and ventilators emit a continuous river of information on a second by second basis. Physiologic and machine data from a large hospital can comprise hundreds of kilobits of discrete data per second 24 h a day. Buried in this data are clues warning of patient deterioration, sepsis, and intraoperative events. The tools necessary to find these signals must function on data that is streaming rather than at rest. While NoSQL technology can be used to store data, its value is manifest when it is also used to process data. In the example above, streaming data from different patients flow into a data processing engine and are segregated by patient. These individual patient data streams can be processed by algorithms colocated on the same server as the data. This combination of processing and storage on the same platform sets NoSQL apart from a performance standpoint. These algorithm/storage pipelines can continuously process data looking for predetermined signals. Compared to a traditional RDBMS, applications outside the database would be continuously making SQL calls for data to feed the analytics. This analytic SQL traffic would be competing against the SQL traffic driven by the data ingestion. Leveraging a NoSQL data structure that incorporates pipelines and the ability to process data locally enables analysis with low cost hardware and high throughput.

Medicine is in an age of information. Whether the application requires a SQL-type database or a NoSQL data processing engine to handle the ingestion and analysis of terabytes, the tools and technology to cope with this torrent of data are well developed and readily available.

Computational Advances

The first operational electronic computing machines were built during World War II to assist in code breaking of enemy message traffic. These were purpose-built machines and used the technology of the day: telephone relays, vacuum tubes, and paper tape. With the commercial advent of the transistor in the mid-1950s, an enormous change took place as vacuum tubes were replaced with integrated circuits. Now, the function of a tube to switch from on to off, representing a logic one or zero went from the size of a baseball to that of a flea. By the early 1960s, designers were able to put more than one transistor in a package and in being able to connect multiple transistors together on a single piece of silicon they could create “chips” used as logic building blocks.

During this era, Gordon Moore was director of Research and Development at Fairchild semiconductor, and through his observation of the technological trends within the industry, the underlying physical science, and the economic economies of scale, he formulated what has become to be known as Moore’s Law. Dr. Moore stated, “The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term, this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years.” In this statement from 1965, he predicted a log-linear relationship that would hold true for over 50 years (see Fig. 61.1 ). Gordon Moore went on to found Intel Corporation.

The relentless advance of semiconductor technology has paved the way for smaller devices and increased computing capacity. Paired with this is the bountiful harvest of connected technologies benefiting from an increased understanding of materials science and physics. Storage technology, networking bandwidth, and video display all have similar cost-performance curves with incredible capabilities contained within inexpensive packages. Modeled after Moore’s law and driven by some of the same economic pressures, Edholm’s law (data rates on wired and wireless networks, see Fig. 61.2 ) predicts the log-linear relationship of performance in these industries.

Cloud Computing

In the mid-2000s as predicted by Edholm’s law, network bandwidth had increased to the point where connectivity was not usually a limiting factor in regard to application functionality. Previously, on-premise compute and storage using specialized data centers had been seen as the only choice to host business class applications. With many applications moving to the WWW, mobile platforms becoming more robust and the continually increasing server compute power, application owners were looking to outsource management of hardware.

Cloud computing refers to the use of shared resources (storage, compute, and processing) that are located oftentimes in IT infrastructure owned by a third party. Cloud has become widely adopted due to ease of use and low upfront capital costs. In July 2002 Amazon Corporation launched Amazon Web Services (AWS). Used until that point as an internal resource, Amazon began to open its platform to customers to run their applications. From the end-user perspective, items such as maintenance, upkeep, and security become contractual terms handled by the cloud provider, not tasks staffed by the customer. Eliminating this overhead enabled organizations to take advantage of an incremental approach to the technology with the ability to increase size or capability rapidly, needing only a contract amendment. For individual researchers using these services for analytics or machine learning applications, they are charged only for the resources they use, when they use them, lowering the bar to access technology of a world-class platform. Today Amazon (AWS), Microsoft (Azure), Google (Google Cloud Platform), VMWare, and IBM (IBM Cloud) are dominant forces within the industry.

Analytics and Visualization

Analytics is defined as “the discovery, interpretation, and communication of meaningful patterns in data. It also entails applying data patterns towards effective decision-making. In other words, analytics can be understood as the connective tissue between data and effective decision-making within an organization.” Society and industry have been transformed through the use of data. Early analytics tended to be static charts or tables, and the presentation of that data was left for the end user to process and understand the implications. Over time the discipline of data visualization began to mature. This field understood that presentation of data was half the job of a good visualization. The more important half was to guide the user to the correct interpretation of the data presented. Quality visualizations make understanding the implications of the data intuitive.

The price to pay for poor visualizations can be high. On the morning of January 28, 1986, the space shuttle Challenger was launched. Seventy-three seconds into the flight, the external fuel tank ruptured and the crew were lost. In the subsequent root cause analysis, it was discovered that rubber O-rings joining fuel sections together of the solid rocket boosters had failed. This came as a shock to the engineers who had been following the health of the O-rings closely each mission. Analytics used to understand previous nonfatal O-ring failures provided no real insight into a sporadic event. They had been plotted many ways including versus ambient temperature because the flight engineers understood that O-rings became less flexible as they cooled. In the visualization from the congressional report, O-ring failure is plotted by location and flight in sequential order with temperature noted individually. In constructing this analytic, the engineers made it difficult to extrapolate O-ring performance at launch temperatures more than 20°F cooler (32°F ambient at launch) than any other previous flight. When the elements are visualized appropriately, the dramatic effect of temperature on O-ring performance becomes intuitive ( Fig. 61.3 ). Failure to grasp this insight by highly motivated, intelligent, and data rich engineers at NASA cost seven lives and the loss of over a billion dollars worth of hardware.

Visualization of medical data can be similarly high risk. As more clinical data moves onto electronic platforms such as EHRs, how that data is presented can influence how it is interpreted and acted upon. This comes not only from the representation of clinical data but the usability of the system interface. In the early stages of the nationwide push to install EHRs, focus rested squarely on end-user adoption. Moving from a paper and verbal-based system of patient management to one with a keyboard, mouse, and screen was difficult. The Office of the National Coordinator for Health Information Technology estimates that in 2017, 96% of all nonfederal acute care hospitals possessed certified health IT, nearly 9 in 10 (86%) of office-based physicians had adopted any EHR, and nearly 4 in 5 (80%) had adopted a certified EHR. Humans become effective and efficient at performing tasks when their mental model, their internal representation of the problem space, closely matches and predicts reality. Building a good mental model requires time and repetition.

With EHRs now close to a decade in wide use, the next generation of providers has little experience with the paper chart. How the health provider interacts with the application and the way data is presented shape patient interactions. The user interface not only displays information but guides clinical interactions and potentially the interpretation of the data on the screen.

Augmented Decision Support

When understanding the potential return on dollars invested for health information technology, one of the most common cited was the promise that digital platforms would reduce waste and improve the quality of care delivered. On the current EHR platforms, the realization of these promises is mixed. The most common feature to be cited when discussing the improvement of quality is clinical decision support (CDS). While CDS has been widely adopted, it is seen as falling short of its full potential in clinical practice. As a rigid rule-based framework in an extremely fluid environment, many alerts are redundant, not germane, or simplistic. The highest impact has been around positive patient ID and in-patient medication/transfusion administration. To work in the current complex clinical environment, decision support must incorporate deeper knowledge of the process of care, goals of therapy, and the state of the patient. This requires moving from a rule-based framework to one with a more overarching “understanding” of the patient.

Machine Learning and Artificial Intelligence

From the advent of electronic computing machines, the idea to use them to emulate or augment the process of human decision-making has been ongoing. Early attempts began in the 1950s with chess-playing programs. Scientists’ initial tendencies were to imagine the world and our interactions with it governed by a set of rules and associated supporting logic, ironically much like a chess board and the current EHR decision support framework. The assumption being that if all the rules and logic could be coded and captured, the computer would begin to mimic human thoughts. Progress was slow and early predictions of rapid success fell away as the problem space expanded and the limitations of computational power were reached. Over time, the field matured, and it became obvious that some problems, such as image recognition, were not tractable in a rule-based framework. During the 1980s and 1990s, the science progressed, and the term artificial intelligence (AI) became more generic. A spectrum of approaches began to emerge, including machine learning, deep learning, and classic rule-based AI. Over the next 20 years, machine-learning techniques took advantage of increasing computational power to create applications capable of addressing real-world problems.

Machine learning and deep learning have become popular topics. Deep learning is a machine learning method based on understanding data representations rather than task-specific algorithms. Technologies supporting this sort of machine learning effort tend to be neural network-based and produce applications that are found in everyday life, such as speech recognition, natural language processing, and image processing. In this work, all these terms will be used interchangeably.

The impact of machine learning and deep learning is manifest today across society. In medicine, the most visible and common interaction that providers have with this type of technology is machine learning driven speech to text. Companies such as Nuance Communications and M*Modal have created the infrastructure to ingest dictation at the point of care, perform voice to text translation, and display the spoken word within the EHR in near real time. Physicians and other health care providers had relied for decades on humans to transcribe recorded speech. The ability for health care organizations to deploy this technology without enormous training costs and end-user adoption problems has only occurred recently. This technology enables physicians to continue to work as they are accustomed while deriving benefit of rapid turnaround and low training cost.

But voice to text only impacts the user interface. The greater utility of the dictation exists when the concepts embedded in the text can be codified and shared as independent discrete concepts. Talking about an episode of sepsis is useful to others only after the document is opened and read. Being able to extract the concept of sepsis using machine learning enables this discrete information to become actionable in many places without the need to search for, find, read, and understand the base document. Automated concept extraction is a form of machine learning known as natural language processing (NLP).

NLP uses deep learning techniques to extract data and concepts from the written or spoken language. The utility of this capability is manifest within health care. While widespread EHR implementation has reduced the number of paper forms and other documents within the care ecosystem, there exists a legacy of scanned as well as physical documents that are poorly indexed and essentially inaccessible.

In our mobile society, records such as these are transferred from provider to provider commonly as physical documents that become invisible within the scanned/outside documents tab of the destination EHR. Using NLP, these documents could be mined for concepts and organized for search. The impact of this on clinical care would be substantial. The most expensive disease categories, those encompassing chronic conditions, represent an overwhelming percentage of cost. The ability to understand the longitudinal course of a disease and therapies that have succeeded as well as failed could have an outsized impact.

NLP is an area of intense innovation and research. The machine-based understanding of language and its meaning is extraordinarily complex. In medicine, accuracy and reliability are critically important. Current technology continues to evolve and within constrained universes accuracy is relatively good. The commercial product Nuance One is a relevant specific example. The Nuance One offering contains a component called Dragon Medical Advisor. Nuance One is a speech to text application with the capability for real-time medical concept extraction. Using the extracted concepts, Dragon Medical Advisor begins to create a profile of the patient. A knowledge engine built specifically for diagnosis and billing criteria then evaluates the concepts presented with facts from the specific text and the applicable coding criteria. It notifies the physician on the EHR screen of areas where additional language or finding specificity is needed. Within the context of billing codes and documentation completeness, the system is extremely accurate, unobtrusive, and easy to use. It is essentially a just-in-time learning system focused around coding and documentation. The promise of machine learning to help find, correlate, and display relevant information within the patient-physician interaction is appealing.

Neural Networks and Deep Learning

As opposed to machine learning, neural networks are often used for complex analysis of nontextual artifacts such as images. Applied to image data, neural networks use layers of interconnected nodes that focus and analyze small sections of an image. Working in parallel, they combine to judge whether the image contains the target. This parallel nature enables the neural network to be spread across multiple servers, increasing the speed of the application. Traditional programming languages are used to create the neural network environment. Once constructed, these networks are trained with large sets of data. This is different to how we normally think about software applications. Training occurs where multiple samples, such as images containing the desired feature (true positive), are ingested by the network. This process causes the nodes to adjust their connections and create feedback loops. When presented, for example, with multiple images of chest x-rays known to be positive for tuberculosis (TB), the network begins to distinguish those features that are indicative for TB, similar to the method of training for residents in radiology. Shown a slightly different set of training images results in a neural network with different properties.

The training process causes the features to be recognized implicitly within the network, rather than explicitly programmed by a developer. In the example for TB, the application had a sensitivity (accurate identification of true positives) of 97% and a specificity (accurate identification of true negatives) of 100%. The impact of this technology on health care costs is evident. A deep learning application for automated image analysis has the potential to exceed the accuracy and reliability of human radiologists. While this is perceived as a controversial point and currently dismissed, automated image analysis has several compelling economic and quality advantages. Given the low impact of imaging procedures to the patient, it is a frequently used modality. For patients with chronic disease states, they accumulate numerous studies over time. While radiologists use previous images to compare and discern disease progression or discover new findings, there exists a functional limitation as to how far back and how many different modalities can be examined.

With a deep learning-based approach, this is only constrained by the available computing resources. Deep learning has the advantage of ingesting all the images within the Picture Archiving and Communication System (PACS) that are relevant to the analysis. Having the unlimited ability to ingest the longitudinal course of the disease could potentially bring insights into the subclinical evolution of the disease process. Second, improvements to the application can occur continuously and the most effective models instantly moved into operation, benefiting all subsequent studies. Radiologists accumulate knowledge and insight with experience over time. The resources for training and enabling a provider to become proficient are substantial and limited. This creates a scaling problem leading to multiple tiers of access to this resource gated by location, practice environment, and compensation resulting in variability of cost, quality, and availability.

Deep learning has the potential to feed forward information into the ordering cycle. Many organizations use a radiologist to protocol upcoming studies. This requires the radiologist to review the referring physician’s order and the chart to ensure that the study ordered is going to actually help answer the clinical question. While ordering a CT scan of the head seems like the appropriate test for a patient with a metastatic lesion to the brain, the protocoling radiologist would correct that to CT of the head with and without contrast to enable evaluation of the perineural edema visible due to contrast enhancement. Missing in this workflow also is a comprehensive review of prior images. Coupled with natural language processing, this deep learning augmented activity could reduce inappropriate ordering and radiologist time and improve diagnostic efficiency. This is similarly applicable to other highly visually oriented specialties including pathology, ophthalmology, and dermatology.

The common conception is that deep learning applications must be associated with a keyboard, screen, and many racks of equipment behind the scenes. The level of semiconductor integration available today enables powerful applications to be constructed to fit in the palm of one’s hand. Combining embedded deep learning models, commercially available image sensor, and the computing power found in a cell phone, DermaSensor has created a small handheld package that will image skin lesions (freckles, moles, etc.) and render a score of potential for malignancy. This type of technology could enable earlier detection of skin cancers for more people at a substantially lower cost. Current skin screenings are usually performed by dermatologists at a greater cost. With devices such as these, lower-skilled professionals would be able to deliver comparable results. This directly affects the economics of medicine through its impact on labor costs and potential curability of skin malignancies.

The impact of deep learning, machine learning, and the other technologies discussed will have a profound impact on the cost, quality, and availability of healthcare. The use cases cited above exist and are in clinical practice. This sort of technology has the potential to cause extreme disruption in the structure of medicine. As in most other industries, technology has caused displacement and upheaval. The travel industry is a clear example with the role of the travel agent dramatically shifting. Uber and Lyft have completely disrupted the cab industry, resulting in wage compression, bankruptcy, and suicides. Technology and innovation are a double-edged sword, and their impact in medicine is not to be underestimated.

Genomic Data

Following the discovery of the nature of DNA by Watson and Crick in the mid-1950s, scientific attention turned to methods to understand and sequence the molecule. In 1973 Fred Sanger and his collaborators successfully created the first process to sequence segments of DNA. These efforts gave rise to a diversity of approaches and technologies. Slowly the science matured and larger sequences of base pairs were able to be more rapidly decoded. By the late 1980s, the state of the industry was beginning to create commercial products and viable companies. Government funding supported much of this work and it was clear that sequencing the complete human genome was a realistic goal. Launched in 1990 and coordinated by the National Institutes of Health, the Human Genome Project was singularly focused on accomplishing this task. It was successfully concluded in 2003 at a cost of $3 billion.

With the realization of the potential for this technology, genomic sequencing became a multibillion-dollar industry. Current sequencing using the “Ion Torrent” methodology uses specially formed micro wells on a semiconductor chip. Since this methodology leverages the same forces as Moore’s law, similar cost performance curves can be expected. In 2003 Robert Carlson formalized this with an eponymous law calling out the log-linear relationship emerging in sequencing cost over time. In practice this is borne out with sequencing costs actually declining faster than predicted over some time periods ( Fig. 61.4 ). Today viral and bacterial sequencing is routine. For patients with HIV, their therapeutic regimen is driven by recurring sequencing of their virus. Understanding what mutations the virus has undergone guides drugs used. Likewise, genome sequencing for cancer is expanding and for breast disease the use of specific genetic markers such as HER2 (Human Epidermal Growth Factor 2) and BRCA1 (BReast CAncer type 1 susceptibility protein) is best practice. The revolution to enable these advances included biochemistry, physics, and information technology.