Eric L. Vu1, Craig G. Rusin2, and Kenneth M. Brady3 1 Department of Anesthesiology, Cardiac Anesthesiology Lurie Children’s Hospital of Chicago, Northwestern University Feinberg School of Medicine, Chicago, IL, USA 2 Department of Computational and Applied Mathematics, Predictive Analytics Lab, Texas Children’s Hospital, Baylor College of Medicine, Rice University, Houston, TX, USA 3 Division of Cardiac Anesthesia Regenstein Cardiac Care Unit, Gracias Family Professor in Cardiac Critical Care, Department of Anesthesiology and Pediatrics, Lurie Children’s Hospital of Chicago, Northwestern University Feinberg School of Medicine, Chicago, IL, USA The amount of patient data which can be leveraged for decision‐making in critical care environments, such as operating rooms (ORs) and intensive care units (ICUs), continues to grow at an ever‐increasing rate. While working in a data‐rich environment has clear clinical benefits, the geometric growth of data generated per patient often leads to increasingly challenging situations for a single individual to manually review and assimilate all the data collected from a given patient with high fidelity. This is often referred to as the “data overload” problem. The field of informatics provides a structured framework for how to overcome this challenge. Applying the tools and technologies of informatics to the data, practices, and use cases inherent to anesthesia and critical care medicine yields powerful ways to enhance physician decision‐making and improve patient care and safety. With the development of these tools, there is anticipated improvement in healthcare access, efficiency, quality, and outcomes. This chapter will provide an overview of the history of medical informatics and emerging technologies. A brief examination of the basics of various data, data types, and common techniques in data mining will be discussed with a focus on their relative strengths and limitations. Finally, clinical examples of advanced analytics for congenital heart disease (CHD) and pediatric critical care medicine will be presented. Biomedical informatics is a multidisciplinary field that studies the effective uses of biomedical data, information, and knowledge for scientific inquiry, problem‐solving, and decision‐making to improve the medical care and health of patients [1]. The term “medical informatics” first appeared in the 1960s and undoubtedly has historical origin from the first computer, which was developed in 1945. The world’s first electronic digital computer was developed at the University of Pennsylvania Moore School of Electrical Engineering. This computer was called the electronic numerical integrator and computer (ENIAC) and was utilized by the United States Army Ballistic Research Laboratory to study thermonuclear weapons and calculate artillery firing tables [2]. ENIAC played a role in the processing of Monte Carlo methods, which are a set of computational algorithms which rely on repeated random sampling to obtain numerical results [3]. The utilization of these methods and computational processing paved the way for applications in simulations, optimization problems, predictive modeling, probabilistic analyses, and applied statistics. Machine learning was first described in 1959 by Arthur Samuel in reference to a computer program which could iteratively improve each time it played a game of checkers [4]. What began with programs and algorithms for simple pattern classifications has now grown to a robust field with symbolic methods to analyze data. Over the next few decades, a crossover of computer technology into healthcare emerged. Computer technology began appearing in the 1970s at Veterans Affairs Hospitals and Massachusetts General Hospital [5, 6]. Meanwhile, database and database management continued to evolve from primitive file processing to database management systems from the 1970s to 1980s. This included hierarchical and relational database systems and query languages, such as structured query language (SQL) [7]. In the 1980s, the improvement of computer hardware technology and processing speeds led to improved capabilities to store and process data. With the improvement in the management of more complex data, further integration of heterogeneous data sources, development of advanced data models and queries, and sophisticated database management systems, the landscape of data storage and analysis evolved. By the late 1980s, the business data warehouse was developed by International Business Machines (IBM) researchers Barry Devlin and Paul Murphy [8]. The business data warehouse is a repository of multiple heterogeneous data sources, organized under a schema to facilitate business intelligence and decision‐making. The applicability of data warehouses to medicine not only allowed for new techniques and fields to emerge but also created new problems to solve. Meanwhile, a United States Defense Advanced Research Projects Agency (DARPA)‐funded project led to computer networks and the Internet in the 1960s. This system of interconnected computer networks grew to the World Wide Web in the late 1980s and early 1990s [9]. Information retrieval, data mining, and web‐based databases (XML or extensible markup language databases) soon became possible. Over the next few decades, the application of database and data warehouse technologies continued to expand from business to medical use. With the adoption of the Health Information Technology for Economic and Clinical Health (HITECH) Act in 2009, there was a large movement toward the meaningful use of health information technology and the standardization of electronic health records in efforts to improve medical care [10]. The patient data generated on a day‐to‐day basis in healthcare has led to major advancements in the field of medical informatics, particularly in the perioperative setting and ICU [11, 12]. Modern physiologic monitors can provide more than 20 continuous values at frequencies >200 Hz regarding a patient’s status [13]. Similarly, the adoption of electronic health records, databases, and data warehouses has influenced medical informatics. The thoughtful design of these databases and communication standards is essential to ensure interoperability. Communication standards such as Health Level‐7 (messaging), Digital Imaging and Communications in Medicine (imaging), and Institute of Electrical and Electronics Engineers/International Organization for Standardization 11 073 (device interoperability) have played a pivotal role in medical informatics [14]. Today, medical informatics is a field which intersects with clinical sciences, medicine, mathematics, computer programming, bioengineering, decision analysis, probability, epidemiology, and statistics. This section provides an overview of medical informatics by describing basic principles, terminologies, and technologies. Its goal is to provide a framework to begin conceptualizing the use of data science in healthcare. Friedman’s fundamental theorem of informatics states that: “A person working in partnership with an information resource is ‘better’ than that same person unassisted” (Figure 5.1) [15]. The theorem describes how information technology is capable of augmenting human reasoning, and information resources may provide information, knowledge, or advice to support an individual who is faced with a medical decision. Friedman’s fundamental theorem assumes that the resource offers something of benefit that the individual does not know. The interaction between the individual and information resource is critical. For a technology to be truly useful, data is utilized to create information that the clinician utilizes and integrates with their knowledge. Through the repeated application of such knowledge, wisdom forms. The concept of knowledge and wisdom creation from data has been adopted from Ackoff’s data, information, knowledge, and wisdom hierarchy (Figure 5.2) [16–18]. Working with data is both an art and a science. Each data science project is unique in its goals and will therefore be unique in its design and execution. However, there are many best practices and frameworks that have been adopted by the data science community to ensure that high‐quality and reliable results are achieved. To understand how to work with data, we must first understand how data is represented, measured, collected, organized, and accessed. All data, from music to images to patient medical records, is ultimately stored as ones and zeros within a computer. These are referred to as bits. The bits are stored in the computer’s volatile memory (random‐access memory, RAM) or in nonvolatile memory (i.e. files on a hard drive which are preserved across reboots of a computer). Typically, sets of 8 bits are grouped together. This grouping is called a byte. Data size is often measured in units of bytes, with ever‐increasingly larger prefixes (Table 5.1). There are various systems to encode the data. Two examples include binary, which consists of two digits (0 and 1), or base‐10, which consists of 10 digits (0 through 9). To encode useful information, there are many ways to interpret patterns of bits. Different encoding types have different advantages and disadvantages. The most common data types for data science include the following. Integers – This data type represents whole numbers (1, 2, 3, etc.). They can be signed or unsigned (i.e. positive or negative). The range of values that integers can have depends on the number of bits that are used to express their values. For example, an unsigned integer that uses 16 bits (or 2 bytes) can represent a number between 0 and 65 535 (2^16 − 1), while an unsigned 8‐bit integer can represent a number between −128 and 127. Typical sizes for integers are 8, 16, 32, and 64 bits. The disadvantage of integers is that they cannot express a number that contains a decimal or a fraction value. Table 5.1 Units of data Source: Gavin [19] and McKenna [20]. Floating‐point numbers – This data type represents positive and negative decimal numbers. The precision of floating‐point numbers is based on the number of representative bits. They are usually represented as a decimal with an exponent (i.e. 2.3456 × 105). A 32‐bit floating‐point number has eight digits of precision (i.e. seven digits after the decimal), while a 64‐bit float (usually called a double float) has 15 digits of precision. Character arrays, strings, and text – This data type represents text. Usually, 1 or 2 bytes of data are used to represent an individual character, and character encoding maps are used to associate a specific byte value with a letter or pictograph (i.e. a character byte that has a value of 65 represents the letter “A”). The most common encoding maps are American Standard Code for Information Interchange (ASCII) and a universal character encoding standard called UNICODE. Traditionally, a string is terminated with a NULL character (i.e. the byte value of 0). More complicated data representations are constructed hierarchically from these basic types. For example, a grayscale image that has a resolution of 600 × 800 (such as that produced by an X‐ray) can be represented using an array of 480 000 integers (600 × 800 = 480 000), where each integer represents the intensity of the signal at a specific location in the image. An RGB color image is represented using a set of three integers to represent a color for each pixel or location in the image. While there may be complex ways that these data structures are stored (because they can get very large), all complex data structures can be broken down into these fundamental data types for subsequent analysis. There are four main classes of data within a hospital: text‐based records in the electronic medical system (EMR)/electronic health system (EHR), imaging data held in the Picture Archive and Communication System (PACS) system, physiologic time‐series data generated by the bedside patient monitoring equipment, and genetic data from patients (‐omics data). The most common data on medical data science projects include text‐based records from the EMR. Within the EMR, there exists a further dichotomization of data: structured vs. unstructured data. Structured data is data that has been captured in such a way that an unambiguous result can be ascertained from the user input. Conversely, unstructured data is data that is captured without such a rigorous process in place. For example, the entry of a patient’s weight into the EMR using the “weight” field or ordering of a medication from a drop‐down “medication list” are examples of creating structured data because there is a specific field on the form that was designed to capture “weight” and “medication name.” It is not the nature of the data itself that is structured, but rather the way that the data is captured that determines if it is structured data. The limitation of a structured data capture system is that it can only capture data that it is designed to capture and nothing else. The benefit of a structured data capture system is that it facilitates data analyses and the data analysis pipeline (discussed later in this chapter). Unstructured data is equivalent to the data generated while writing in the free‐text portion of a clinical progress note. The free‐text nature of unstructured data allows the user flexibility to add whatever information is important. The downside is that unstructured data is significantly more difficult to query than structured data. Imagine if a task is to obtain a list of the medication doses given to a set of patients. If these data were captured in a structured way, a spreadsheet could be generated with the medication name, corresponding dose, time, and patient (how this is done will be addressed later in this chapter). Unstructured data would require a manual chart review of all of the free text in the medical record to obtain the information. This process is time‐consuming and may create errors that would potentially invalidate any data analysis. Often the most flexible systems have a mix of both structured and unstructured data. Significant forethought is given to the design of the structured fields to be captured and how the unstructured data will be utilized for analysis. It is important to note that not all data that is generated in the hospital may be captured for retrospective review and analysis. Just because a piece of data is generated does not mean that it is stored. Oftentimes, there may be temporary storage of data that is purged in the future. A summary of the major data types and sources in an ICU setting is presented in Figure 5.3. The lexicon of medical informatics and data science broadly encompasses terms related to various types of data, data analytics, and components of machine learning models (Table 5.2). Table 5.2 Definitions and terminologies in medical informatics and data science Source: Sanchez‐Pinto et al. [21]. Reproduced with permission of Elsevier. Structured data can be organized and recorded in databases to enable efficient storage and retrieval of information. While there are many types of databases, the two most common are relational databases and key‐value storage systems (also called a NoSQL database or column stores). Relational databases allow a user to define relationships between sets of data so that data integrity and fidelity are rigorously and automatically enforced, while key‐value databases provide ways of storing and retrieving large amounts of data without requiring the complexities and overhead associated with these guarantees. For the purposes of this discussion, we will focus on relational databases as these will be the most common databases utilized within the healthcare setting. The structure of a relational database is called a schema, which is the organization or blueprint of the database. The schema is the database described by a set of one or more tables. A table is similar to a spreadsheet. Each table has its own set of rows and columns. The columns of a table (also called fields) represent the specific aspects of a set of data (i.e. “first name,” “last name,” and “date of birth”). Each field also has a type (such as integer, float, or text) to allow the database to understand how to interpret the data within each field. The rows of the table (also called records) represent the specific data points that have been recorded (i.e. “John,” “Doe,” “1‐1‐1970”). Each table must contain a column that has a unique identifier for each row. This is called the “primary key” field of the table. This is the equivalent of the row number in a spreadsheet. For most tables, this field is called the “record number” and is represented as a 64‐bit integer, which allows for the unique identification of hundreds of millions of trillions of rows (or more precisely, 2^64 − 1 records). The complexity of databases becomes evident when one considers the hundreds or even thousands of tables that need to be managed, linked, and organized. For example, a database associated with an EMR system may have more than 20 000 unique tables, and each table may have hundreds of fields. How is the subset of data located and retrieved in the haystack of tables defined in the database schema? The answer is through relationships. Generally, all tables in a database schema will be related to one another in some way, either directly or indirectly. A relationship between two tables is created by embedding the primary key of one table into a field within a different table. For example, imagine building a database to describe how beds are organized within a hospital ICU. Conceptually, hospital beds are organized into individual units, and each unit has one or more beds. The database schema can be represented using two tables: a “Units” table and a “Beds” table. Each table would have its own primary key and other fields related to the specifics of the table. The “Beds” table might have a field that is “bed name,” while the “Units” table might have a field called “unit name.” To create the relationship between the units and beds table, a field can be added to the beds’ table, called unit ID, which can hold the record number of the record in the unit table that corresponds to the unit the bed is located. Therefore, to lookup what unit a given bed is located in, the following steps are executed: Executing this process manually for dozens or hundreds of relationships between a large number of tables would be very tedious, but computers can be programmed to execute such searches as quickly and efficiently as possible. Computer scientists have created a structured language for querying database tables in order to efficiently search and extract data from databases. This language is called SQL (structured query language). SQL is a standard language that can be used to query data across a wide variety of databases such as MSSQL (Microsoft SQL), Oracle, and PostgreSQL among others. There are many textbooks and online tutorials available for learning how to write SQL queries that the reader can reference for further learning. Data processing pipelines are the processes by which heterogeneous data is collected and transformed into a standardized format for subsequent analysis (Figure 5.4). The data processing pipeline allows for the cleaning and integration of data. The process of creating knowledge from data includes the following steps: To ensure meaningful results, this process is often iterative and requires feedback from the data preprocessing or analytical processing steps. In addition, a strong understanding of the dataset via domain expertise is also imperative for meaningful inferences and insights.
CHAPTER 5
Informatics and Artificial Intelligence in Congenital Heart Disease
Introduction
History of medical informatics
Principles, terminology, and technologies
How to work with big data
How is data represented
Units
Binary system
Base‐10 system
Example
1 Bit (b)
Binary digit
Binary digit
‘1’ or ‘0’
1 Byte (B)
8 Bits
8 Bits
1 typed character (i.e. ‘A’)
1 Kilobyte (kB)
1024 Bytes
1000 Bytes
1 paragraph
1 Megabyte (MB)
1024 Kilobytes
1000 Kilobytes
400‐page book
1 Gigabyte (GB)
1024 Megabytes
1000 Megabytes
30 minutes of video. 1 DVD movie is 4–8 GB)
1 Terabyte (TB)
1024 Gigabytes
1000 Gigabytes
500 hours of movies
1 Petabyte (PB)
1024 Terabytes
1000 Terabytes
500 billion pages of typed text or 2000 years of MP3‐encoded music
1 Exabyte (EB)
1024 Petabytes
1000 Petabytes
11 million 4k resolution videos
Data types within a hospital
Term
Definition
Big data
Data generated in high volume, variety, and velocity, resulting in datasets that are too large for traditional data‐processing systems.
Data science
A set of fundamental principles that support and guide principled extraction of information and knowledge from data.
Data mining
Extraction of knowledge from data via machine learning algorithms that incorporate data science principles.
Domain expertise
The understanding of real‐world problems in a given domain (e.g. critical care medicine) that helps frame and contextualize the application of data science to solve problems.
Machine learning
The field of study that focuses on how computers learn from data and on the development of algorithms to make learning possible.
Features
The data elements (independent variables) used to train a model. Features may be raw data, transformations of raw data, or complex transformations of data (such as ones performed by neural networks).
Outcomes
The data elements (dependent variables) which represent the target for training in a supervised learning model. Outcomes may be categorical (e.g. yes/no) or continuous (e.g. length of hospital stay). Binary outcomes are typically represented as Boolean logic (e.g. true/false) or fuzzy logic (e.g. range of probabilities).
Supervised learning
Algorithms used to uncover relationships between a set of features and one or more known outcomes.
Unsupervised learning
Algorithms used to uncover patterns or groupings in data without targeting a specific outcome.
Model training
The process where machine learning algorithms develop a model of the data by learning the relationships between features. In supervised learning, the relationship between a set of features and one or more known outcomes is utilized for training. This is also referred to as model derivation or data fitting.
Model validation
The process of measuring how well a model fits new, independent data. For example, evaluating the performance of a supervised model at predicting an outcome in new data. This is also referred to as model testing.
Predictive model
A model trained to predict the likelihood of a condition, event, or response. The US Food and Drug Administration specifically considers predictive strategies as those geared toward identifying groups of patients more likely to respond to an intervention.
Prognostic model
A model trained to predict the likelihood of a condition‐related endpoint or outcome such as mortality. In general, the goal is to estimate a prognosis given a set of baseline features, regardless of what ultimately leads to the outcome.
Overfitting
A phenomenon when an algorithm learns from idiosyncrasies in the training data, usually “noise.” The noisy data present in the training dataset do not represent generalizable truth in relationships between features and outcomes. Overfitting will typically lead to poor model performance in an independent validation dataset.
Digitization
The conversion of analog data (e.g. paper documents, printed images) into a digital format (e.g. bits)
Digitalization
The wide adoption of digital technologies by an organization to leverage digitized data with the goal of improving operations and performance. The adoption of electronic health records, picture archiving, and pharmacy management systems are examples of digitalization in healthcare.
Data curation
The process of integrating data from different sources into a structured dataset. It typically involves authenticating data to ensure quality and may involve annotation of data to facilitate use in the analysis.
Structured data
Data (usually discrete or numeric) that are easy to search, summarize, sort, and quantify. Examples include vital signs and laboratory test results (e.g. complete blood count, complete metabolic panel).
Unstructured data
Data that do not conform to a prespecified structure. Unstructured data are usually more difficult to search, sort, and quantify. Examples include clinician notes, written narratives, images, pathology slides, radiology images, video, or audio.
Databases, data organization, and querying data
Data analysis and processing pipelines