Annals of Clinical Epidemiology
Online ISSN : 2434-4338
SEMINAR
Introduction to supervised machine learning in clinical epidemiology
Sachiko Ono Tadahiro Goto
Author information
JOURNAL OPEN ACCESS FULL-TEXT HTML

2022 Volume 4 Issue 3 Pages 63-71

Details
ABSTRACT

Machine learning refers to a series of processes in which a computer finds rules from a vast amount of data. With recent advances in computer technology and the availability of a wide variety of health data, machine learning has rapidly developed and been applied in medical research. Currently, there are three types of machine learning: supervised, unsupervised, and reinforcement learning. In medical research, supervised learning is commonly used for diagnoses and prognoses, while unsupervised learning is used for phenotyping a disease, and reinforcement learning for maximizing favorable results, such as optimization of total patients’ waiting time in the emergency department. The present article focuses on the concept and application of supervised learning in medicine, the most commonly used machine learning approach in medicine, and provides a brief explanation of four algorithms widely used for prediction (random forests, gradient-boosted decision tree, support vector machine, and neural network). Among these algorithms, the neural network has further developed into deep learning algorithms to solve more complex tasks. Along with simple classification problems, deep learning is commonly used to process medical imaging, such as retinal fundus photographs for diabetic retinopathy diagnosis. Although machine learning can bring new insights into medicine by processing a vast amount of data that are often beyond human capacity, algorithms can also fail when domain knowledge is neglected. The combination of algorithms and human cognitive ability is a key to the successful application of machine learning in medicine.

1.  INTRODUCTION

The availability of various health care data, including electronic health records, registries, claims data, and digital imaging, has been proliferating in the past few decades. These data are linked, integrated, and utilized for medical research [1]. Recent advances in computer technology and statistics enabled the implementation of complex and computationally expensive algorithms with large-scale data. With the combination of these data and technological advancements, researchers have attempted to develop algorithms—termed machine learning—that imitate and even excel in human cognitive ability to do complex tasks in the medical field.

Machine learning refers to an algorithm in which a computer recognizes patterns and relationships of variables based on given data. Each algorithm develops a model to output an answer for a specific problem. Researchers have tried to develop machine learning algorithms that substitute for medical experts, or that find unexpected rules that are beyond human comprehension. These attempts are prompted by the facts that advancement of medical science have invented multiple treatment options and more subdivided diagnosis for a disease, resulting in ever more complex decision-making process in practice and shortage of experts of such subcategories. In this context, machine learning has been increasingly utilized in medical research, leveraging abundant medical data. The number of articles in PubMed that are tagged with “machine learning” [Mesh] has increased dramatically since the MeSH term introduction, as shown in Fig. 1.

Fig. 1 Search results of “Machine Learning” [Mesh] by year in PubMed

Compared with the conventional regression model and prediction method, machine learning can handle more variables and build a complex model that considers interactions of variables and nonlinear relationships between variables and outcomes. Currently, there are three types of machine learning: supervised, unsupervised, and reinforcement learning. The present article focuses on supervised learning, a type of machine learning commonly used for the prediction and diagnoses problems, introducing the concept, commonly used four algorithms, and their applications in medical research.

2.  CONCEPT OF MACHINE LEARNING

Machine learning refers to a series of processes in which a computer finds rules from a vast amount of data. In machine learning, the computer develops a model that represents what it has learned from the data (i.e., the relationships among variables), and applies the model to unknown data to make predictions and classifications. The development of machine learning has been driven by the development of databases in recent years. The digitization of various documents and automatic recording by electronic sensors have enabled the constant collection of vast amounts of data. Such data include medical claims, electronic health records, laboratory data, and medical images from various medical fields [1].

In a conventional model such as logistic regression model, a human determines variables (also referred to as predictors in a prediction model), and develops a model based on domain knowledge to predict outcomes. (To be precise, logistic regression analysis can also be categorized as machine learning due to its iterative process for maximum likelihood estimation, but the term “machine learning” is rarely used for logistic regression analysis in medical articles.) Developing a conventional model is far more difficult when there are enormous number of variables; nonetheless, leveraging such data can bring new insights on a given topic. Machine learning replaces most of the model-creating work with computer algorithms to process vast amounts of data beyond human capacity. Algorithms developed by machine learning methods are particularly powerful when the problem is complex, that is, when the number of variables is huge, when the variables have complex interactions or effect modifiers, and when the association of outcome with variables are nonlinear. Indeed, several studies have shown that machine learning outperforms existing predictive models for complex classification problems such as predicting prognosis and diagnosis. For example, Tokodi et al. [2] developed a machine learning-based risk stratification model to estimate 1 to 5-year mortality risk for patients undergoing cardiac resynchronization therapy, and the model had a much higher predictive ability than all the pre-existing scoring systems (Seattle Heart Failure Model, VALID-CRT, EAARN, ScREEN, and CRT-score).

2.1  Supervised Learning, Unsupervised Learning, and Reinforcement Learning

Conventionally, there are three types of machine learning: supervised, unsupervised, and reinforcement learning. Supervised learning is a type of machine learning in which machines learn from “labeled” training data and then predict the outcome, specifically diagnosis and prognosis in medical field. “Labeled” means that the training data are tagged with the correct answer (i.e., outcome). Unsupervised learning, on the other hand, classifies data with similar characteristics or patterns into groups based on unlabeled data. For example, Bleecker et al. [3] divided patients with asthma into six clinical phenotypes using an unsupervised learning method, mainly for future investigation of pathology and treatment response. In the reinforcement learning, algorithms learn from trial and error (i.e., rewarding desirable results and punishing unwanted ones) to maximize favorable results in cases where there is no given right answer. Lee et al. [4], for example, successfully minimized patients’ waiting time in emergency department using reinforcement learning. Among these three machine learning methods, supervised learning is the most frequently utilized method in medical research.

2.2  Supervised Learning Algorithms

Although there is a myriad of algorithms in supervised learning, the ones frequently used in medical papers are: (i) random forests, (ii) gradient-boosted decision tree (GBDT) (iii) support vector machines (SVM), and (iv) neural networks [5]. These algorithms can be implemented using statistical software, such as the caret package of the R statistical software [6, 7] and the scikit-learn library of Python [8], which effectively help develop machine learning models.

Decision tree and random forest

To explain random forests, we will first explain decision trees. A decision tree is a algorithms of determining the final classification by creating branches from each step (node) based on given rules. Fig. 2 shows an example of a decision tree for deciding whether the rent of a property is 1,000 USD or more. Each rectangle represents a node, of which the first node is called the “root node” and the last node is called the “terminal node”. When splitting a node, the cluttering (impurity) of the node’s data content is expressed as a numerical value called entropy. The node is split to organize its content; that is, the ratio or difference (information gain) of the entropy of the nodes before and after is maximized [9, 10] (Fig. 3). Along with entropy, Gini coefficient is also commonly used to measure the impurity value of a split condition [11].

Fig. 2 An example of a decision tree
Fig. 3 Entropy and information gain

As the tree grows downward, the data can be subdivided according to its characteristics. While complex data can be finely classified, the inherent noise of the data is also captured and classified as a feature. This may cause overfitting; a model that has learned the noise of one particular dataset will not apply well to another new dataset [12]. This is where “random forests” comes in. “Random forests” is a type of ensemble learning (general term for algorithms combining multiple models) that seeks better predictive performance for new data [13]. Random forests create several decision trees and predict by majority vote to prevent overfitting due to data-specific noise (Fig. 4).

Fig. 4 Decision tree and random forests

In random forests, multiple decision trees are created in parallel using randomly selected data for each. Similarly, the variables used to split the decision tree are also chosen at random (Fig. 5). (The variables used in the splitting process are sometimes referred to as “features”.) The name “random forests” is derived from the fact that multiple decision trees, that is, “forests”, are created using “random” data and variables. Random forests can handle complex problems that cannot be represented by linear models at a relatively high speed.

Fig. 5 Creating random forests

Gradient-boosted decision tree (GBDT)

GBDT is another type of ensemble learning created by multiple decision trees. While random forests use multiple decision trees created in parallel, GBDT uses multiple decision trees created in sequence [14]. In this algorithm, the first decision tree produces an initial prediction. The second tree predicts the residuals or errors of the first tree by creating another decision tree. Then, the next model predicts the residuals of the precedent model by creating another decision tree, and this process is repeated until the residuals converge to 0 or the number of iterations reaches a prespecified number (i.e., number of decision trees). After the iterations are done, all trees are combined to make a final prediction (Fig. 6). GBDT often outperforms other algorithms in terms of accuracy; however, it may overfit when the number of trees is too large [14].

Fig. 6 Creating gradient-boosted decision tree

Support vector machine (SVM)

SVM classifies data according to a boundary created by an algorithm based on the values of given variables [1518]. The model developed by the SVM provide a prediction as to which side of the boundary new data will be on. A clinical application is, for example, to develop a model that predicts death or survival within 30 days based on multiple laboratory test results in a certain disease. Fig. 7 illustrates the use of SVM to draw a boundary line that classifies ● and × based on the values of variables X1 and X2 (e.g., laboratory test results). When creating the boundary, the sum of the distances from the boundary to the data should be maximized.

Fig. 7 Creating a boundary line using a support vector machine

● and × represent data points of two different groups depicted based on values of variables X1 and X2, respectively (e.g., laboratory test results). The line represents the boundary that maximizes the sum of the distances from the line.

When data points cannot be separated by a linear boundary, as shown in Fig. 8, a method called kernel trick can be used to map the data to a higher dimension called feature space [15, 17, 19] (Fig. 9). We can then separate these data points linearly. When the term SVM is used in medical literature, it usually refers to SVM with kernel trick (kernel SVM). Kernel SVM can deal with numerous variables, and it is easy to obtain good results, even with small data. On the other hand, it can be computationally expensive to process a large amount of data as the kernel trick generally increases the dimensionality [15, 17].

Fig. 8 Cases where support vector machine cannot create a linear boundary

● and × represent data points of two different groups depicted based on values of variables X1 and X2, respectively. The linear line cannot separate these data points.

Fig. 9 Boundary creation by kernel support vector machine

● and × represent data points of two different groups depicted based on values of variables X1 and X2, respectively. The data are mapped in feature space for linear separation.

Perceptron and neural network

The perceptron is an algorithm that attempts to reproduce human-like cognitive abilities by imitating human neurons [2022]. As shown in Fig. 10, perceptron adds a weight wi to the input data, passes it to the next stage (node or neuron), and then passes the obtained value to a function called an activation function, which outputs a predictive value when the sum of given values exceed a certain threshold. The weights are updated through the learning process: the output value is compared to the actual outcome (i.e., the right answer) and updated until the error becomes minimal.

Fig. 10 Perceptron

Xn represents variables (predictors) and Wi represents weights.

A neural network is a multi-layered combination of perceptrons, namely, the input layer, multiple hidden layers, and the output layer [21, 22] (Fig. 11). The number of hidden layers is a hyperparameter, a value that should be prespecified by a researcher before running the algorithm. By combining multiple layers, we can model more complex relationships than a simple perceptron. In the neural network, the most commonly used weight updating method is backpropagation, where the total error obtained from the current output is passed backwards and distributed to the preceding nodes in the hidden layers and then the ones in the input layer. The weights are adjusted by repeating this process to minimize the total error.

Fig. 11 Neural network

Deep learning typically refers to an advanced type of neural network that has multiple layers organized in deeply nested network architecture [2325]. With using advanced operation, such as convolution for a digital image, and multiple activation functions in one node; deep learning achieves much better performance than a simple neural network. Deep learning is widely used in almost all the fields that embrace machine learning technology (e.g., medical imaging, natural language processing, speech and audio processing, and drug discovery). One of the deep learning methods that are particularly successful in medicine is a convolutional neural network for processing medical imaging [24]. Convolutional neural network makes a prediction based on the input of arrays of pixel intensities in the three-color channels. Diabetic retinopathy, for example, was identified with 90.3% of sensitivity and 98.1% of specificity by using a convolutional neural network model developed from retinal fundus photographs [26]. The other medical fields, where image classification demonstrated promising performance, were dermatology [27, 28], radiology [29, 30], pathology [31, 32] and cardiology [3335].

2.3  Developing a Prediction Model Using Machine Learning

Although the algorithms of machine learning are much more complex than conventional methods, the sequence of steps for creating a prediction model is similar. The steps are to: (i) determine the research question, (ii) obtain data, (iii) preprocess the data and split them into training data and test data, (iv) apply the algorithm to the training data to develop a model, and (v) evaluate the performance of the model on the test data. The steps from step three onward are described in another article in this journal [36]. Data preprocessing in the step three includes imputation of missing values, creation of dummy variables, and normalization/standardization of data. Application in the step four is the main part of machine learning, and appropriate algorithms should be selected from various machine learning methods for each problem setting. To ensure accuracy of the prediction, one strategy is to try multiple machine learning methods and select the one with the best performance or to combine multiple methods to utilize the advantages of different methods.

Before applying the algorithm to data for machine learning, researchers have to specify hyperparameters. Hyperparameters are values that are external to the model and cannot learn from the data. Examples of hyperparameters are the number of decision trees to construct or features to select in a random forest. These hyperparameters affect the accuracy, complexity, and efficiency of the model. Because the model performance varies greatly depending on the values of hyperparameters, parameter tuning is required to find good values. The method called grid search is often used for hyperparameter tuning by manually or automatically changing the values of the hyperparameters little by little. When evaluating hyperparameters, cross-validation method is commonly used to improve the fit to unknown data by using “validation” data divided from training data (Fig. 12). The validation data here is for hyperparameter tuning; it differs from the one used for internal validation in the prediction model described in another article in this journal [36]. In the machine learning context, the dataset used for the internal validation is often called test data; while, it is called validation data in an epidemiological context.

Fig. 12 Cross-validation for hyperparameter tuning

In step five, the model developed in step four is applied to test data to evaluate its performance. As described in the article about the clinical prediction model [36], along with accuracy, sensitivity, specificity, and area under the curve are commonly used performance measures for classification problems. For regression problems that predict numerical value, root-mean-square error (RMSE) is used to evaluate the deviation of the predicted value from the observed (correct) values of the given dataset. If the prediction performance is unacceptably low, return to step two and reconsider each step.

3.  SUPERVISED MACHINE LEARNING APPLICATIONS

3.1  <Example 1> Triage Systems Developed by Multiple Machine Learning Methods

Where medical resource is limited, differentiation and prioritization of critically ill children are important. Goto et al. [37] examined how well an objective triage system developed by machine learning can predict clinical outcomes of children presented to the emergency department (ED) compared to a conventional triage method based on a medical professional’s assessment. The authors predicted in-hospital death and/or ICU admission using the least absolute shrinkage and selection operator or lasso in short, random forest, GBDT, and deep neural network. Variables used for prediction were age, sex, mode of arrival (walk-in vs ambulance), vital signs (temperature, pulse rate, systolic and diastolic blood pressure, respiratory rate, and oxygen saturation), visit reasons, patient’s residence (home vs other [e.g., long-term care facility]), ED visits in the preceding 72 hours, and comorbidities. All the machine learning-based triage systems, although not statistically significant, performed better than the conventional triage system with a fewer number of undertriaged children. The authors concluded that machine learning-based triage systems may support clinicians in making triage decisions efficiently, thus improving optimal resource allocation.

3.2  <Example 2> Prognostic Prediction for COVID-19 Using a Combination of Machine Learning Methods

Health care systems worldwide are overwhelmed by the soaring number of COVID-19 patients. For early intervention and optimal resource allocation, an accurate prediction model for COVID-19 is needed. In contrast to Goto et al. [37] in example 1 where they evaluated multiple models separately, Gao et al. [38] integrated four different machine learning models into one ensemble model to predict the mortality risk of admitted patients for COVID-19. To develop the ensemble model, the authors first selected important 14 variables (consciousness, male, age, sputum, blood urea nitrogen, respiratory rate, D-dimer, number of comorbidities, platelet count, fever, albumin, SpO2, lymphocyte, and chronic kidney disease) out of original 53 variables by lasso, another machine learning method often used for unimportant variable elimination. With the 14 variables, the authors developed 6 machine learning prediction models (logistic regression, SVM, GBDT, neural network, k-nearest neighbor, and random forests), then integrated the top 4 predictive models (logistic regression, SVM, GBDT, and neural network) into one. The ensemble model achieved an area under the curve of 0.96 and 0.92 for predicting mortality of COVID-19 in two external cohorts. The authors concluded that the model efficiently enables accurate risk stratification of COVID-19 patients on admission.

3.3  <Example 3> Classification of HIV Rapid Test Using Deep Learning

A rapid diagnostic test is a convenient and affordable option to screen for HIV in low- and middle-income countries. However, tests with weak or faint lines make visual interpretation diverse among field workers with different training levels. The accuracy of interpretation varied between 80% and 97%. Turbé et al. [39] developed a machine learning model to determine whether the results indicated positive or negative from photos of rapid diagnostic tests. A total of 11,374 images taken with tablets were labeled by three rapid diagnostic test experts and then used as a training dataset. The authors developed 4 models using the dataset; one is an SVM and three are different convolutional neural network models. One of the convolutional neural network models was used because of its best performance in terms of sensitivity and specificity. As a pilot test, the performance of the model was compared with those of 5 end-users with varying levels of training (2 nurses and 3 newly trained community health workers). In the visual interpretation of rapid diagnostic testing, the end-users’ agreement levels were from 61 to 100%. The machine learning model demonstrated better performance than end-users for the following 4 indicators: sensitivity (95.6% vs. 97.8%), specificity (89% vs. 100%), positive predictive value (88.7% vs. 100%), and negative predictive value (95.7% vs. 98%). The authors concluded that rapid diagnostic testing images captured by a mobile device could standardize the interpretation of test results, reduce interpretation errors, and provide a platform for workforce training.

4.  CHALLENGES IN MACHINE LEARNING

Machine learning is not a one-size-fits-all solution. Although the term “machine learning” gives the impression that everything is done automatically, it has some challenges. As with conventional methods, machine learning requires a good research question, sufficient sample size and variables, appropriate data sampling, and algorithm selection for each problem setting. When these processes are done heuristically by experts with domain knowledge, the model can achieve good performance. Automated machine learning without a human in the loop, especially in the medical field, has a risk to model artifacts because medical data often contain uncertainty, noise, and missing data [40].

An example is the failure of Google’s influenza forecasting algorithm called Google Flu Trends. In 2008, Google developed an algorithm to quickly detect influenza trends from the combination of Google search terms data and actual survey data [41]. For the first few years, it surprisingly well predicted the number of cases, or trends of influenza, two weeks earlier and more accurately than the Centers for Disease Control and Prevention [42, 43]. Earlier prediction means being able to take action sooner, and it may even prevent future influenza pandemics. The introduction of this new technology has raised expectations for improving public health. However, several years later, the number of influenza cases predicted by Google’s forecasting algorithm deviated significantly from the actual number of cases [44, 45].

Although Google did not provide a clear explanation for the suboptimal estimation, researchers speculated that the algorithm might have over-learned irrelevant search terms. Later, several studies analyzed how Google Flu Trends could have avoided erroneous forecasting by manually adding another data source or by updating the algorithm constantly [44, 45]. As in this example, machine learning sometimes produces unintended and erroneous results. When the public health policy was affected by such erroneous algorithms, what would be at stake were the lives of people. Regular human checks on these algorithms are therefore essential.

The application of machine learning to medicine raises another concern: the algorithmic predictions could control the “right” answer in the real world. For example, some physicians might blindly follow the prediction to admit patients who are classified as “inpatients” by the algorithm. While this behavior would further improve the apparent accuracy of the algorithm, it would obscure the true performance of the predictive algorithm. Furthermore, the physician, who is supposed to be the “teacher” in “supervised learning”, would lose their credibility and authority if they are completely dependent on the “machine”. The above concerns were expressed by several experts [40, 46], and there is still ongoing discussion on how to incorporate machine learning into clinical practice.

5.  CONCLUSION

Machine learning has developed rapidly in the last decade with the improvement of computer performance and the advancement of statistics. This approach has massive potential for new insights into medicine given large numbers of variables, complex interactions, and nonlinear relationships between variables and outcomes. However, machine learning application in the medical field has only just begun. Owing to the complexity of medical domains, machine learning cannot fully substitute for human ability, at least for now. The combination of algorithms and human cognitive ability may be a key to the successful application of machine learning in medicine.

CONFLICTS OF INTEREST

The authors declare no conflicts of interest in relation to the work presented in the manuscript.

ACKNOWLEDGMENT

We would like to thank Masao Iwagami in the Department of Health Services Research, the University of Tsukuba, for his input and critical reading of the manuscript and feedback.

REFERENCES
 
© 2022 Society for Clinical Epidemiology

This article is licensed under a Creative Commons [Attribution-NonCommercial-NoDerivatives 4.0 International] license.
https://creativecommons.org/licenses/by-nc-nd/4.0/
feedback
Top