NEW APPROACHES TO MISSING BIOMEDICAL DATA RECOVERY FOR MACHINE LEARNING

. Missing data is a common problem for medical data sets, especially large ones. This issue is of major importance since it can influence the analysis and further use of the data, e.g., for machine learning purposes. There are various methods for recovering missing data. One such method is to remove observations with missing values, but this is not very useful given the limited amount of data available. Another commonly used approach is the Last Observation Carried Forward (LOCF). But most such methods are not universal and may need adjustments to the data set at hand. This article describes the possibility of solving this problem in the case of multimodal time series of biomedical data coming from patients with sepsis. It describes and compares three approaches tailored to a sepsis dataset, which is analyzed and finally used to build a sepsis prediction system based on clinical data routinely recorded in an intensive care unit.


Introduction
In research, missing data are frequently inevitable, but their ability to affect the findings is rarely explored. According to recent systematic reviews [1,2] many healthcare data sets are found to be incomplete, with missing values, and require cleaning and missing values imputation to enhance the effectiveness and accuracy of the data analysis. Along with an overview of the situation in the field, these articles present a review of methods for using machine learning to impute missing data.
The nature of "missingness" in this context -random vs nonrandom -is important to note. However, using observable data, it is sometimes impossible to tell apart between missing at random and missing not at random. As the data set employed in this study lacks a justification for the nature of the missingness, it is believed that the missing data are absent at random (during the time the patient is undergoing a medical procedure that requires the sensors used for data collection to be removed, human errors, equipment malfunction, etc.). The bias imposed by the recovery technique is thought to be less in the situation of data missing at random, even though data recovery may be achievable regardless of the kind of missingness [3].
When missing values occur in data, there are several ways to handle them. A lesser number of tools are useful for continuous data, such as time series, while the majority of these techniques are acceptable for static data. Complete case analysis (CCA) [4] is the simplest method, which involves deleting observations with missing values, but in many situations, especially when there is a limited amount of data, this technique may not be practical. In imbalanced sets, where each observation in the minority class is significant, this problem is of particular significance when using the data for machine learning techniques.
Data recovery techniques may often be classified into: 1. Based on single imputation -replace a missing data point with a single value using a single imputation approach, often using the Last Observation Carried Forward (LOCF), Baseline Observation Carried Forward (BOCF), and Next Observation Carried Backward (NOCB) procedures or data from other sources (e.g., mean value imputation, regression imputation, etc.).
2. Based on multiple imputations -using many plausible imputed data sets and properly integrating the findings from each, multiple recovery methods create plausible imputed data sets. There are several statistical software available for this (e.g., Amelia, FURIA, MICE in R programming language (R), etc.) The Guideline on Missing Data in Confirmatory Clinical Trials [5] states that "if missing values are handled by simply excluding any patients with missing values from the analysis, this will result in a reduction in the number of cases available for analysis and consequently typically result in a reduction of the statistical power." Obviously, the likelihood of a power drop increases with the number of missing data. Thus, it is imperative to make every effort to reduce the quantity of missing data. Unluckily, there isn't a methodical technique to handle missing data that works in every circumstance. As a result, while preparing a trial, it is crucial to take into account how to reduce the quantity of missing data and how missing data will be treated in the analysis.
Three data recovery techniques are presented in the current work, each of which involves a number of steps and components: (a) conventional techniques (such as LOCF and NOCB); (b) a less popular Kalman filtering-based approach; and (c) regression imputation, which substitutes the predictions from a regression of the missing variables on the observed variables. The idea of seeing the human body as a complex system in which the many characteristics that define its functioning (in health or sickness) are associated and missing data may be generated from existing data using approximated correlation is one of the underlying principles for the final technique. This paper's research contribution specifically relates to the presentation of several approaches for missing value recovery in connection to the objectives of future usage of the recovered data.
As part of a bigger research project aimed at developing a machine learning-based software application for sepsis prediction, the data recovered via a variety of ways are eventually being used to build a machine learning system.

Data
The "Early Prediction of Sepsis from Clinical Data: the PhysioNet/Computing in Cardiology Challenge 2019" [6] public database provided the information utilized in this study. The public portion of the data was sourced from two different US hospitals: Emory University Hospital (set A) and Beth Israel Deaconess Medical Center (set B). With the necessary Institutional Review Boards' consent, these data were gathered over the previous ten years, de-identified, and classified using Sepsis-3 clinical criteria. They include 8 vital sign variables, 26 laboratory variables, and 6 demographic factors. They are made up of hourly summaries of vital signs, lab results, and static patient descriptions for 40,336 patients. Each patient's characteristics were distilled into hourly bins (e.g., multiple heart rate measurements in an hourly time window were summarized as the median heart rate measurement).
The data in set B presents more than 80% of missing values. Thus, set A is selected for further study because it had fewer missing data (i.e., 79,4%) and a greater prevalence of sepsis (8,80% vs. 5,71% in set B). There are 1790 septic patients among the 20336 patients in this set, and 502 of those subsets have all the missing values for at least one parameter (out of 6 parameters of interest). Following the application of the initial selection criteria (such as the presence of at least 7 hourly observations prior to the diagnosis of sepsis, the lack of artifacts, etc.), there are 211 subsets that have missing data but may be able to have them recovered. Table 1 depicts the look of an original sepsis file with one parameter (i.e., DBP) having all-missing values (NA) and other physiological parameters having some of their values missing. It describes observations on seven parameters of interest chosen for additional study, including age and labeling. These parameters include heart rate, peripheral blood oxygen saturation, temperature, systolic blood pressure, diastolic blood pressure, respiratory rate, patient's age, and the label of the observation.  HR -heart rate, O2Sat -peripheral blood oxygen saturation, Temp -temperature, SBP -systolic blood pressure, DBP -diastolic blood pressure, Resp -respiratory rate. Sepsis label -the label of the observation (0for non-sepsis observations and 1 -for sepsis). NA -not available (missing value).

Missing Values Recovery Methods
Data recovery performed throughout this research is based on several simple common algorithms/methods, including their combination as well as less common techniques tailored to the available data. The first method is commonly used for similar data, while the next two methods are specifically designed for the data recovery task concerning the data at hand.

LOCF and NOCB
The last observation carried forward (LOCF) [7] is a missing value recovery method that uses the last measured value (per column) to fill in the next missing one(s). The next observation carried backward (NOCB) [8] is a "reversed version" of LOCF, by which the missing values are filled in backward. Examples of when these approaches can be applied are columns "HR", "O 2Sat", "Temp", "SBP" and "Resp" in Table 1. These techniques can evidently not be used for columns with all values missing (e.g., the "DBP" column in Table 1)

A custom technique
This approach was designed to take care of both cases: for columns with some values missing as well as for columns with all missing values.
The following procedures are included in this method when there is at least one value in each column: (a) each column is assessed for missing data (NA); (b) the first and last values (per column) are "recovered" in accordance with the value in the closest cell in the same column; (c) interpolation is completed, with the calculation of the values between two present values taking the trend into consideration (increase or decrease). For interpolation, the "zoo" package in R [9] (i.e., "na.approx()" function) is utilized (see, for instance, the "Temp" column in Table 1).
When all values in a column are missing, the recovery procedure is as follows: (a) from the complete cases set, the present values were extracted for each parameter/column separately by class (e.g., septic patients from set A). (b) the number of "non-missing" values (n) was determined and the mean (mean) and standard deviation (sd) were calculated for these values. Using the {rnorm()} function in R, and "n", "mean" and "sd" as arguments to it, Gaussian distributions were generated for each of the 6 parameters. The missing values in the original data are replaced with values extracted from generated distributions.

A technique with the use of Kalman filtering and machine learning algorithms
This approach uses Kalman filtering for columns with at least three non-missing values and Generalized Linear Models for data recovery in case of all missing values in a column.
The Kalman filter, which first appeared in [10], uses a discrete filtering model based on the dynamic and measurements models as linear Gaussian. The following is a description of the basics of this approach with elements of optimal control and dynamic linear models as in [11]. The Kalman filtering technique is used to estimate states based on linear dynamical systems in state space format and offers estimates of unknown variables given the observations over time. According to this model, the state is evolving as follows from time − 1 to time : where is the state transition matrix estimated using the previous state vector −1 , is the control-input matrix applied to the control vector −1 , and −1 is the process noise vector that is assumed to be zero-mean Gaussian with the covariance , i.e., −1~ (0, ).
The link between the state and the measurement at the current time step is described by the process model in conjunction with the measurement model as: where is the measurement vector, is the measurement matrix, and is the measurement noise vector that is assumed to be zero-mean Gaussian with the covariance , i.e., ~ (0, ).
Given the initial estimate of 0 , the sequence of measurements, 1 , 2 , … , , and the details of the system defined by , , , and , the Kalman filter's job is to offer an estimate of at time . Typically, and are tuned parameters that the user may change to get the desired performance.

Update (step 2):
Estimating measurement residual � = − � − (5) Updating state estimate Updating error covariance where the "hat" operator denotes an estimate of a variable. Thus, � is an estimate of . The superscripts "-" and "+" denote predicted (prior) and updated (posterior) estimates, respectively. The software implementation used for the data recovery purpose in this research is "imputeTS" package in R [12] and represents an extended version of the algorithm described above (i.e., Eq. 1-8) with Kalman smoothing on structural time series models.
The algorithm used for data recovery purposes in columns with all values missing represents a Generalized Linear Model (GLM) [13] available on the H 2O platform [14].
The dependence between the response variable and the covariates vector is modelled as a linear function by the Gaussian technique (behind GLM): where, is the parameter vector, 0 denotes the intercept and is a gaussian random variable representing the noise in the model, ~ (0, 2 ). By maximizing the log-likelihood over the parameter vector for the observed data, the model's estimation is achieved. By resolving the following likelihood optimization with parameter regularization, the GLM [13] employed in this study fits the model: The weighted sum of the ℓ1 (least absolute shrinkage parameter) and ℓ2 (ridge regression) norms of the coefficients vector is the regularization penalty, and it is expressed as follows: where is the elastic net parameter, ∈ [0, 1] and is a tuning parameter, and there is no penalty for the intercept. The optimization task concerning observations can be described as follows: The Gradient Boosting Machine (GBM) [15], offered by the same H2O platform, turned out to be the top performing algorithm at the final machine learning (ML) stage.
R [16] is the programming language employed in the current study, and a number of packages from the same environment are used for a variety of tasks throughout the study. Plotting and interacting with the H 2O ML platform are done in the same language and environment.

Data Processing and Results
This study's data processing cycle includes several processes, such as missing value recovery, and tries to provide datasets appropriate for machine learning.
Machine learning algorithms are used in this research at different stages for two distinct purposes: (a) for missing values recovery (i.e., GLMs), and (b) for building the final sepsis prediction model (at this stage it was experimented with several algorithms and this is described in coming sections). The following is a description of the main processing steps through a Consolidated Standards of Reporting Trials (CONSORT)-like diagram.

Preprocessing Stage
At this stage files containing artifacts like human errors, equipment malfunction or failure, etc. were excluded. Due to the study design and the goal of building a sepsis prediction system with a prediction horizon of at least four hours (and three-hour observations needed to get the first prediction), sepsis files with less than seven observations were also excluded. A similar approach was used for non-sepsis subsets keeping only the files with seven and more consecutive observations without missing values concerning the parameters of interest. This is illustrated in Figure 1.
502 sepsis subsets are initially lacking all values for at least one important parameter, such as heart rate, peripheral blood oxygen saturation, temperature, systolic blood pressure, diastolic blood pressure, and respiratory rate. Some of these files were reconstructed at the next stage with the custom approach described above or through ML algorithms used for data recovery purposes as part of the third method based on Kalman filtering.

Applying LOCF/NOCB
The first method (i.e., LOCF/NOCB) for data recovery in this research is LOCF (Last Observation Carried Forward, by "ImputeTS" package, R). The same package is used for NOCB as the second step. This is appropriate for columns in which some of the values are missing. It will not work for columns with all missing values, but it will recover lost data in columns when some of the values are missing.

Data recovery using the custom approach
Since this approach can deal with partially as well as totally (in a column) missing values it provides a final set of a larger size (i.e., 30635 samples) to be used for building the prediction system.

Kalman Filtering coupled with ML for data recovery
This approach also takes care of columns with partially missing values, on which Kalman filtering for data recovery can be used, should there be three and more non-missing values in the respective column.
The correlation between the six important factors listed above and the age was looked at in order to handle the all-missing values scenarios. The age has no missing data and exhibits a moderate correlation with some important metrics.
For each of the six parameters, the three most correlated parameters were chosen based on the correlation coefficients (e.g., for temperature the most correlated parameters are heart rate, systolic blood pressure, and age).
The correlation coefficients for seven parameters are shown in Table 2. The highest correlation coefficients are indicated by bold type. HR -heart rate, O2Sat -peripheral blood oxygen saturation, Temp -temperature, SBP -systolic blood pressure, DBP -diastolic blood pressure, Resp -respiratory rate. Sepsis label -the label of the observation (0for non-sepsis observations and 1 -for sepsis).
A series of GLMs were trained using this correlation data, and the models that performed the best were chosen for additional study. The primary traits of these models are displayed in Table 3. Resp 20.3789 0.9001 (HR) -1.0138 (O2Sat) 0.3220 (Temp) Note. HR -heart rate, O2Sat -peripheral blood oxygen saturation, Temp -temperature, SBP -systolic blood pressure, DBP -diastolic blood pressure, Resp -respiratory rate.
Together with Kalman filtering, these models are incorporated into the data recovery pipeline that reconstructs the missing value sepsis files. The look of a recovered file using this method is shown in Table 4. Bold values indicate recovered values. The resulting recovered subset is the one shown in Table 1 above.  HR -heart rate, O2Sat -peripheral blood oxygen saturation, Temp -temperature, SBP -systolic blood pressure, DBP -diastolic blood pressure, Resp -respiratory rate. Sepsis label -the label of the observation (0for non-sepsis observations and 1 -for sepsis).

Preparing Data Sets for Machine Learning
After being recovered, the data are divided into a training set (85-90% of the final samples) and a test set (10-15%). A sliding window method is used on each file (or subset) to aggregate observations into three-value-long chunks. Lastly, the algorithmic complexity (using the Block Decomposition Method) on each of the two 3x3 matrices is calculated, along with the difference between the parameter's value in three successive hourly samples [17]. The format of the data to be provided to the ML algorithm is represented by the 14L vector that is produced for each sample. The size of the final data sets is more than the number of initially selected files/subsets since each file comprises at least seven hourly observations of each of the six parameters of interest on which the sliding window approach is used. This procedure was applied to each of the final sets recovered through the approaches described earlier. The size of the final sets and train/test splits are presented in Figure 1.

Machine Learning Stage
A series of machine learning models are trained using the H2O platform with 10-fold cross-validation utilizing the training sets for ML produced with each of the three methods  As one can see in the figure above the best performing model (i.e., AUC equal to 0.953) is model (c), or the model based on Kalman filtering for partially missing values coupled with ML approach through GLM for data recovery in columns with all-missing values. Table 5 summarizes further facts about the top-performing GBM.

Discussion
It is possible to somewhat diminish the amount of missing data with thoughtful planning. This is significant because incomplete data might introduce bias into data analysis. This work uses a well-known viewpoint to handle the difficulty of recovering missing data during model construction. As far as we are aware, this is the first time the data recovery strategy based on the amalgamation of Kalman filtering and ML has been used for such or comparable datasets. One of key components of this method is the Kalman filter, used here in a less common way (in contrast to its more traditional application [11], e.g., for car cruise control, autopilot, dynamic positioning, target tracking, etc.). The second main component of this method is represented by six GLMs used to impute the values in columns with all-missing data. The latter is described in more details in [18]. When used for missing values recovery, this approach provides the highest performance (AUC 0.953) of the sepsis prediction model and the difference is statistically significant when compared with the other two approaches (p-values of 7.4137e-05 (b) and 4.3727e-04 (c) respectively). Figure 3 presents more details of the statistical comparison of the three methods described in the current work.
The suggested approach has some drawbacks. First off, it has not been evaluated on a variety of datasets, including datasets with categorical variables and those produced by other diseases (other than sepsis), among others. Additionally, it disregards data in which there is no association between the features and disregards the kind of variable distribution (normal, log-normal, logarithmic, etc.). So, identical datasets with continuous features and outcomes and a similar correlation between features could take into account our technique. Future research may have the opportunity to evaluate the method's resilience across various datasets.
Using the methods outlined in this study on sufficiently big and comparable data sets would be one of the future research paths. When analyzing various strategies for missing data recovery, particularly when evaluating the performance of classification ML models in differentiating between septic and non-septic cases, might serve as metrics of the method adequacy and dependability (e.g., comparing the results of the full set analysis to those of the complete case analysis).

Conclusions
The missing values in research data may be an issue, especially in case of large data sets with multiple observations and features. There are a number of methods for dealing with such issues, but a universally accepted approach is lacking, particularly when the data are used for machine learning purposes.
This paper tackles the missing value problem in a medical data set coming from patients in the intensive care unit. After recovery through three distinct approaches, these data are used to build a sepsis prediction system.
The authors' missing value imputation strategy, which is based on ML and Kalman filtering, offers the best classification performance (AUC 0.953) for the specific data utilized in this study.
As the missing values imputation method may have an impact on how well an ML model discriminates, it is worthwhile to test out several approaches before settling on the most appropriate one for the given set of data. It can result in a better diagnosis and course of therapy in some circumstances, potentially saving the patient's life (like septic patients in the current study).

Conflicts of Interest:
The authors declare no conflict of interest.