Informatica 42 (2018) 33–42 33 Continuous Blood Pressure Estimation from PPG Signal Gašper Slapničar and Mitja Luštrek Joef Stefan Institute, Jamova cesta 39, 1000 Ljubljana E-mail: gasper.slapnicar@ijs.si, mitja.lustrek@ijs.si Matej Marinko Faculty of Mathematics and Physics, Jadranska cesta 19, 1000 Ljubljana E-mail: matejmarinko123@gmail.com Keywords: photoplethysmography, blood pressure estimation, regression analysis, m-health Received: November 11, 2017 Given the importance of blood pressure (BP) as a direct indicator of hypertension, regular monitoring is encouraged for healthy people and mandatory for patients at risk from cardiovascular diseases. We propose a system in which photoplethysmogram (PPG) is used to continuously estimate BP. A PPG sensor can be easily embedded in a modern wearable device, which can be used in such an approach. The PPG signal is first preprocessed in order to remove major noise and movement artefacts present in the signal. A set of features describing the PPG signal on a per-cycle basis is then computed to be used in regression models. The predictive performance of the models is improved by first using the RReliefF algorithm to select a subset of relevant features. Afterwards, personalization of the models is considered to further improve the performance. The approach is validated using two distinct datasets, one from a hospital environment and the other collected during every-day activities. Using the MIMIC hospital dataset, the best achieved mean absolute errors (MAE) in a leave-one-subject-out (LOSO) experiment were 4.47 mmHg for systolic and 2.02 mmHg for diastolic BP, at maximum personalization. For everyday-life dataset, the lowest errors in the same LOSO experiment were 8.57 mmHg for systolic and 4.42 mmHg for diastolic BP, again using maximum personalization. The best performing algorithm was an ensemble of regression trees. Povzetek: Krvni tlak je neposreden pokazatelj hipertenzije. Razvili smo sistem, ki krvni tlak ocenjuje iz fotopletizmograma (PPG), kakršen je že vgrajen v večino modernih senzorskih zapestnic. Signal PPG smo sprva predprocesirali in segmentirali na cikle. Predprocesiranje odpravi večino šuma, ki se pogosto pojavlja zaradi gibanja. Iz očiščenega signala smo nato izračunali množico značilk, ki smo jih uporabili v regresijskih modelih. Sistem smo izboljšali z uporabo algoritma RReliefF za izbor relevantnih značilk in z uporabo dela podatkov vsake osebe za učenje personaliziranih napovednih modelov. Sistem smo vrednotili na dveh podatkovnih množicah, eni iz kliničnega okolja in drugi zbrani med rutinskimi dnevnimi aktivnostmi posameznikov. V poizkusu smo model vsakič naučili na vseh osebah razen eni in ga nato testirali na izpuščeni osebi. Z uporabo klinične podatkovne množice smo v omenjenem poizkusu dosegli najnižji povprečni absolutni napaki (MAE) 4.47 mmHg za sistolični in 2.02 mmHg za diastolični krvni tlak, pri največji stopnji personalizacije. Za množico, zbrano med dnevnimi aktivnostmi, smo dosegli najnižji napaki 8.57 mmHg za sistolični in 4.42 mmHg za diastolični krvni tlak, ponovno pri največji stopnji personalizacije. Najbolje se je obnesel ansambel regresijskih dreves. 1 Introduction World Health Organization (WHO) listed cardiovascular diseases as the most common cause of death in 2015, re- sponsible for almost 15 million deaths combined [1]. Hy- pertension is one of the most common precursors of such diseases and can be easily detected with regular blood pres- sure (BP) monitoring, which is especially critical for pa- tients already suffering from hypertension or related cardi- ovascular diseases, as it can indicate potential vital threats to their health. While regular BP monitoring is important, it is also trou- blesome, as devices using inflatable cuffs are still consi- dered the “golden standard”. The cuff placement is cri- tical, as the sensor must be located directly above the main artery in the upper arm area, at approximately heart height [4]. These requirements impose relatively strict mo- vement restrictions on the subject and require substantial time commitment, thus causing low subject adherence to regular monitoring. Furthermore, when done by the subject him/herself in a home environment, this process can cause stress, which in turn influences the BP values, making the measurements less reliable. This problem is usually not al- leviated by having the medical personnel perform the me- asurement, as this can again cause anxiety in the subject, commonly known as the “white coat syndrome”. 34 Informatica 42 (2018) 33–42 G. Slapničar et al. Our work focuses on photoplethysmogram (PPG) analy- sis and the development of a robust non-obtrusive method for continuous BP estimation. It will be implemented and used in an m-health system based on a wristband with an embedded PPG sensor. This will allow the user to wear the device without any interference or limits imposed upon their daily routine, allowing for truly continuous measuring without stressing the user and thus potentially influencing the BP values. The rest of the paper is organized as follows. Section 2 gives a brief overview of the related work. Section 3 explains the methodology we have used, focusing on sig- nal pre-processing and machine learning features. Section 4 elaborates on the experimental setup and results, and Section 5 concludes with a summary and plans for future work. 2 Related work Photoplethysmography is a relatively simple technique ba- sed on inexpensive technology, which is becoming incre- asingly popular in wearable devices for heart rate estima- tion. It is based on the illumination of the skin and mea- surement of changes in its light absorption [5]. In its basic form it only requires a light source to illuminate the skin (typically a light-emitting diode – LED light) and a photo- detector (photodiode) to measure the amount of light either transmitted through, or reflected from the skin. Thus PPG can be measured in either transmission or reflectance mode. Both modes of operation are shown in Figure 1. Figure 1: Transmission and reflectance mode in which the PPG signal can be obtained. LED is the light source while PD is the photodetector [6]. With each cardiac cycle, the heart pumps blood towards the periphery of the body. This produces a periodic change in the amount of light that is either absorbed or reflected from the skin to the photodetector, as the tissue changes its tone based on the amount of blood in it. Exploring the recent applications of PPG, we can see that it is becoming more widely used in BP estimation. One of two common approaches are typically used: 1. BP estimation using two sensors (PPG + Electrocardi- ogram (ECG)) 2. BP estimation using the PPG sensor only The first approach requires the use of two sensors, typi- cally an ECG and a PPG sensor, in order to measure the time it takes for a single heart pulse to travel from the he- art to a peripheral point in the body. This time is com- monly known as pulse transit time (PTT) or pulse arrival time (PAT), and its correlation with BP changes is well es- tablished. The more recent approach is focused on the PPG signal only; however, the relationship between the PPG and BP is only postulated and not as well established as the relations- hip between the PTT and BP. This approach is, however, notably less obtrusive, especially since PPG sensors have recently become very common in most modern wristbands. BP is commonly measured in millimeters of mercury (mmHg), which is a manometric unit routinely used in me- dicine and many other scientific fields. A mercury mano- meter is a curved tube containing mercury, which is closed at one end while pressure is applied on the other end. 1 mmHg of pressure means that the pressure is large enough to increase the height of the mercury in the tube for 1 mm. To put the values discussed in this paper into perspective, the normal healthy adult BP is considered to be around 120 mmHg (16 kPa) for systolic and 80 mmHg (11 kPa) for diastolic BP [2]. One of the earliest PPG-only attempts was conducted by Teng et. al. in 2003 [3]. The relationship between the arte- rial BP (ABP) and certain features of the PPG signals was analyzed. Data were obtained from 15 young healthy sub- jects in a highly controlled laboratory environment, ensu- ring constant temperature, no movement and silence. The mean differences between the linear regression estimations and the measured BP were 0.21 mmHg for systolic (SBP) and 0.02 mmHg for diastolic BP (DBP). The corresponding standard deviations were 7.32 mmHg for SBP and 4.39 mmHg. Using mean errors instead of mean absolute errors as the evaluation metric is questionable, since it does not reflect the actual performance of the derived model and the error can be extremely low, even if the actual predictions are high above and under the actual observed BP values. A paper was published in 2013 in which the authors used data from the Multiparameter Intelligent Monitoring in In- tensive Care (MIMIC) waveform database [7, 8] to extract 21 time domain features and use them as an input vector for artificial neural networks (ANNs) [9]. The results are not quite as good as with the linear regression model descri- bed earlier; however, the data was obtained from a higher number and variety of patients in a less controlled environ- ment. Mean absolute errors of less than 5 mmHg for both SBP and DBP were reported. While the environment was less controlled compared to the previous work, the patients were still within a hospital setting and hospital equipment was used for data collection. Furthermore, only an undis- closed subset of all the available data from MIMIC was used. Another research was conducted in 2013 in which the authors used a smartphone camera to capture the PPG sig- nal using the camera flash as the light source and the phone camera as the photodiode [10]. PPG features were again extracted and fed to a neural network, which estimated SBP Continuous Blood Pressure Estimation from. . . Informatica 42 (2018) 33–42 35 and DBP. All the data processing and BP evaluation was done in a cloud in order to reduce the computational burden on the device. It is not clear how many subjects participa- ted in the experiment, however, they reported the maximum error not exceeding 12 mmHg. The error metric is not ex- plained in detail, however, based on the given results table, we can presume that MAE was used. Such a method re- quires some user effort, as the user must place and hold his finger over the camera and LED light. This prevents any other activities during this time. It is clear that the PPG-only approach has potential, ho- wever, a robust unobtrusive method that works well on a general case is yet to be developed. 3 Methodology The proposed system consists of two main modules, na- mely the signal pre-processing and machine learning mo- dule. The former is responsible for cleaning the PPG signal of most noise and then segmenting it into cycles, where one PPG cycle corresponds to a single heart beat. The latter ex- tracts features describing the PPG signal on a per-cycle ba- sis, selects a subset of relevant features using the RReliefF algorithm [12], and finally feeds the subset into regression algorithms, which build the prediction models. 3.1 Signal pre-processing PPG sensors must be very sensitive in order to detect tiny variations in light absorption of the tissue. This also ma- kes them highly susceptible to movement artefacts. This problem is especially obvious when dealing with PPG col- lected via a wristband, as the contact between the sensor and the skin can be compromised during arm movements. This is partially alleviated by using green light, which is less prone to artefacts, however, major artefacts often re- main in the signal. Subsequently, substantial effort is di- rected towards PPG pre-processing. 3.1.1 Cleaning based on established medical criteria In the first phase, both the BP and PPG signal are roughly cleaned based on established medical criteria [13]. A 5- second sliding window is used to detect segments with ex- treme BP values or extreme changes of the BP in a short time period. Thresholds for extreme values and changes are selected based on established medical criteria in related work [13] and are given in Table 1. Some thresholds were slightly modified, since the criteria given in the referenced paper seem too strict for some subjects encountered in our datasets. We have thus loosened the criteria in accordance with empirical observations in our datasets (e.g., the ori- ginal criteria excludes all data with SBP > 180, while we observed some segments with SBP over 180 mmHg). After the cleaning of the clinical dataset, 85% of data is kept on average, while 15% is discarded. This is very subject dependent, as for some subjects nearly all the data Criterion Threshold SBP > 250 or < 80 DBP > 150 or < 40 SBP – DBP < 20 ∆SBP or ∆DBP in 5 sec > 50 Table 1: Established medical criteria and thresholds for rough signal cleaning. ∆ signifies a change in BP value. 5-second segments meeting any of these criteria are remo- ved from the signal. is removed (e.g., sensor anomaly which shows 0 ABP al- most all the time), while for majority of subject most of the data is kept. For everyday-life dataset, which contains a lot more noise, only 40% of data is kept, while 60% is discar- ded. This is the result of some subjects having long noisy segments of the PPG signal. It should be noted, that these percentages are also subject of the parameters for trade-off between the required quality and the amount of signal kept, which are discussed in 3.1.3. 3.1.2 Peak and cycle detection In order to do further cleaning and subsequent feature ex- traction, PPG cycle detection is mandatory. This is not tri- vial, as substantial noise in the PPG signal poses a signifi- cant problem, as mentioned earlier. This problem was tackled in several steps. First, a filte- ring transformation, which enhances the systolic upslopes of the pulses in the PPG signal, is used. It is designed to use the derivative of the PPG signal at lower frequencies, in order to detect the abrupt upslopes of the systolic pulse compared to the diastolic or dicrotic pulse in the PPG sig- nal. This is based on a low-pass differentiator (LPD) filter, which removes high frequency components and performs differentiation. Once the steepest points in the PPG signal are located, the following peak is chosen as the PPG sys- tolic peak. Afterwards, a time-varying threshold for peak detection is applied, which ensures that potential double peaks or diastolic peaks close to the systolic ones are not chosen. The procedure is explained in detail in a paper by Lzaro et al. [14]. After the peaks are detected, finding the cycle start-end points is simpler, as the dominant valleys between the de- tected peaks must be found. An example of detected peaks and cycle locations using the described method is shown in Figure 2. 3.1.3 Cleaning based on ideal templates After cycles are successfully detected, the second cleaning phase begins. A 30-second sliding window is used. First, the most likely length of a cycle L in the current window is determined using autocorrelation analysis. A copy of the PPG signal in the current window is taken and 36 Informatica 42 (2018) 33–42 G. Slapničar et al. Figure 2: The upper subplot shows a PPG segment. Lower subplot shows the LPD filtering transformation of the same PPG segment. Peaks of the transformation in the lower subplot correspond to the steepest systolic upslopes of the PPG in the upper subplot, and are denoted as nA. Actual detected PPG peaks are denoted as n∗A. shifted sample by sample up to a certain length that con- tains at least two heart beats. When the copy is shifted by the number of samples corresponding to exactly one cycle, the autocorrelation reaches its first peak, and this number of samples is chosen as L. Presuming that the majority of cycles within a 30-second window are not morphologically altered, we can create an “ideal cycle template” for this window. Such a template is created by always taking the next L samples from each cy- cle starting point and then computing the mean cycle. Each individual cycle is then compared to the computed template and its quality is evaluated using three signal quality indi- ces (SQIs), which are defined as follows [15]: 1. SQI1: First L samples of each cycle are taken and then each cycle is directly compared to the template using a correlation coefficient. 2. SQI2: Each cycle is interpolated to length L and then the correlation coefficient with the template is compu- ted. 3. SQI3: The distance between each cycle and the template is computed using dynamic time warping (DTW). Finally the thresholds for each SQI are empirically deter- mined. Each cycles’ SQIs are evaluated and if they reach the required quality threshold, that cycle is kept, otherwise it is removed. If more than half the cycles in the current 30-second window are under the thresholds, the whole win- dow is discarded as too noisy. An example of this cleaning is shown in Figure 3. Once the PPG signal is cleaned and only high-quality cycles with minimal morphological anomalies remain, fea- tures can be extracted from each cycle. 3.2 Machine learning In order to derive the relationship between the PPG and BP, features describing the PPG signal were computed and then the relevant subset of these features was selected to be used in the regression algorithms. 3.2.1 Features In accordance with the related work [3, 9, 10], several time- domain features were computed from the PPG signal, and the set of features was further expanded with some from the frequency [13] and complexity-analysis domains. Most fe- atures focus on describing the morphology of a given PPG cycle, as shown in Figure 4. Feature Description Tc Cycle duration Ts Time from start of cycle to systolic peak Td Time from systolic peak to end of cycle Tnt Time from systolic peak to diastolic rise Ttn Time from diastolic rise to end of cycle S1 Area under the curve (AUC) from start of cycle to max upslope point S2 AUC from max upslope point to systolic peak S3 AUC from systolic peak to diastolic rise S4 AUC from diastolic rise to end of cycle AUC syst S1 + S2 AAC syst Area above the curve (AAC) from start of cycle to systolic peak AUC diast S3 + S4 AAC diast AAC from systolic peak to end of cycle Table 2: Elaborations of some of the used features shown in Figure 4. In addition to the features focusing on the PPG cycle morphology, which were highlighted thus far, the following features were computed and considered: 1. AI – Augmentation Index: a measure of wave re- flection on arteries. AI = diastolic rise amplitude systolic peak amplitude 2. LASI – Large Artery Stiffness Index: an indicator of arterial stiffness, which is denoted as Tnt in Figure 4 and Table 2. 3. Complexity analysis: signal complexity and mobility are computed for the 30-second PPG segment contai- ning the current cycle. Mobility represents an estimate of the mean frequency and is proportional to the stan- dard deviation of the power spectrum. Complexity gi- ves an estimate of change in frequency by comparing the signal similarity to a pure sine wave. They are Continuous Blood Pressure Estimation from. . . Informatica 42 (2018) 33–42 37 Figure 3: Example of the cleaning algorithm in the second phase of the signal pre-processing. Comparing the top (uncle- aned) and bottom (cleaned) PPG signal, we see that the obvious artefact segments are removed. given by Najarian and Splinter [11] as follows (presu- ming a zero-mean signal): S0 = √∑N i=1 x 2 i N , S1 = √∑N−1 j=2 d 2 j N − 1 , S2 = √∑N−2 k=3 g 2 k N − 2 , where x is the PPG signal, d is the first order derivative of x and g is the second order derivative of x. Mobility = √ var(d) var(x) = S1 S0 Complexity = mobility(d) mobility(x) = √ S22 S21 − S 2 1 S20 , 4. FFT features: amplitudes and phases of the frequency-domain representation of the 30 second PPG segment containing the current cycle. The length of the window was chosen such that it contains enough cycles (expected 1 cycle per second) for the frequen- cies in the segment to be reliably determined. Considering all the time and frequency-domain featu- res along with the complexity-analysis features, and the amount of instances (cycles) available, we are often dea- ling with a very large matrix of training data. The number of rows (instances) is on the order of magnitude 105 and the number of columns (features) is on the order of magni- tude 102, thus dimensionality reduction through selection of a subset of relevant features is feasible, but not manda- tory. More importantly, feature selection allows us to de- termine which features are useful for the learning process, and which are irrelevant, allowing us to obtain a smaller subset containing only the relevant features. 3.2.2 Feature selection The RReliefF algorithm was chosen for feature selection. It is a modification of the ReliefF algorithm, suitable for regression problems with continuous target variables. The algorithm was applied to a subset of 10% of all data cho- sen randomly. This was repeated 10 times. All the features with non-zero relevance, as chosen by the algorithm, were considered in each iteration and their importance was sa- ved. Looking at the final scores of the algorithm across all the iterations, we notice that quite a few features are con- sidered irrelevant, while the same features are commonly chosen as important for both SBP and DBP, as shown in Figure 5. Noting the fact that the same features were se- lected in each of the 10 iterations, we can assume that the relevant features are not dependent on the selected subset of the available data. As mentioned, all the features with non-zero importance were taken, as more than half were discarded as irrelevant by the RReliefF algorithm. Among the non-zero impor- tance features, some features from each of the groups men- tioned earlier (temporal, frequency and complexity analy- sis) are present. Most area-based features were marked as irrelevant, while certain times (Tc, Ts and Td), both complexity-analysis (signal complexity and signal mobi- lity) as well as some frequency-domain (amplitudes and phases at low frequencies) features were marked as impor- tant. These non-zero importance features were then used in the regression algorithms. The relevant features were determined using the larger and more varied dataset from the MIMIC database. The same subset of features was also used with the smaller everyday-life dataset. Both datasets are described in more detail in the following section. Since the feature selection procedure only slightly im- proved the results, we have not considered experiments with other or additional feature selection methods. 38 Informatica 42 (2018) 33–42 G. Slapničar et al. Figure 4: Time and area based features that describe the morphology of the PPG signal on a per-cycle basis. The features are listed and elaborated in Table 2. 4 Experiments and results In an effort to make the proposed method as general as pos- sible, two datasets were considered for the experimental evaluation. The data from all subjects, which met the requi- rements of having both the PPG and BP signals recorded, were always used in the experiments. 4.1 Data The first dataset is from the publicly accessible MIMIC da- tabase, which is commonly used for experiments and com- petitions in the signal processing field. The original version contains data from 72 hospitalized patients. All patients with both the PPG and BP signal were originally conside- red, however, after the filtering and pre-processing, only 41 patients had enough high-quality data remaining to be used in the experiments. All the data was collected in a hospital environment using hospital measuring equipment, including an ABP measuring device. The ABP is measu- red by inserting a cathether in an artery, making it highly invasive, however, it offers the most precise BP monitoring. The second dataset was collected at Jožef Stefan Insti- tute (JSI) using the Empatica E4 wristband for the PPG and a digital cuff-based Omron BP monitoring device for the ground truth BP, as is common in such experimental settings in related work. This device is reported to be cli- nically validated according to the British Hypertension So- ciety and the Association for the Advancement of Medi- cal Instrumentation (AAMI) protocols [17], which means that the mean errors do not exceede 5 mmHg. The col- lection procedure at JSI was conducted in accordance with the standardized clinical protocol. The correct placement of the cuff on the upper arm area with the sensor above the main artery was ensured. The measurements were done in an upright sitting position, making sure the cuff was located at approximately heart height. The recommended protocol was followed as best as possible, however, in an ideal si- tuation the ground truth BP should be measured as ABP within an artery. Due to the invasive nature of ABP measu- rement, this is not feasible in an everyday-life situation, so the digital cuff-based monitor was used as a good replace- ment. An upper-arm cuff-based monitor was chosen over a wrist-based one, as the latter is less accurate and extremely sensitive to body position. In the first completed phase of the data collection, 8 he- althy subjects were considered, 5 male and 3 female. Each subject wore the wristband PPG measuring device for se- veral hours during their everyday activities. They measu- red their BP every 30 minutes or more often. Finally, only parts of the PPG signal 3 minutes before and after each BP measurement point were taken into consideration, as the measured BP value is only relevant for a short time. Ideally, the BP would be measured more often, however, this would place further stress on the subjects and was not possible during their everyday routine. Furthermore, addi- tional physiological variations (e.g., breathing rate) could be obtained from the PPG and used for the BP estimation, however, this was not yet considered but might be a subject of future work. Continuous Blood Pressure Estimation from. . . Informatica 42 (2018) 33–42 39 Figure 5: The output of RReliefF algorithm, which shows the feature importance for each of the considered features. 4.2 Experimental setup Two experimental setups were considered, 5-fold cross va- lidation and LOSO. The purpose of the first was to establish initial observations about the selected features and perfor- mance of different regression algorithms. The second ex- periment was conducted to evaluate the generalization per- formance of the algorithms and subsequently determine po- tential requirement for personalization. 4.2.1 K-fold cross validation The MIMIC dataset consisted of roughly 200 000 instances post filtering, which correspond to 41 patients. The instan- ces were obtained by uniformly taking 20 3-minute seg- ments from the whole recording for a given patient. Each instance (cycle) in a given 3-minute segment was assigned the mean SBP and mean DBP of this segment. This simu- lates the patients measuring their BP periodically, but not more than once in 3 minutes. K-fold cross validation (k = 5) was conducted, where in- stances were first shuffled randomly and then all the data was split into nearly equal folds. Then k – 1 folds were taken for learning and the remaining fold was used for tes- ting. This was repeated k times. The random shuffling of instances makes it so that instances belonging to a given subject might appear in both training and testing sets. This was taken into account (a sort of implicit personalization), as this experiment was merely a starting point to determine the initial performance of the algorithms and was later com- plemented by a Leave-one-subject-out experiment. Several regression algorithms were compared using the full set of features. The algorithm that performed best using all the features was additionally evaluated using only the subset of best features as selected by the RReliefF algo- rithm. The predictive performance of these options in 5- fold cross validation is discussed in detail in the Results section. 4.2.2 Leave-one-subject-out Due to increased computational complexity of a leave- one-subject-out experiment compared to k-fold cross- validation, data was additionally sub-sampled, by taking 500 uniformly selected cycles from each patient’s data. During the initial attempt, a regression model was trai- ned in each iteration on all the subjects, except the one left out. It was ensured that no instances from the testing subject appeared in the training set. This yielded poor re- sults. Notable improvements can be made by using a small amount of each patient’s data for training, most likely due to each patient having a subtly unique cardiovascular dyn- amic and relation between PPG and BP. This was additio- nally confirmed by doing cycle morphology analysis, du- ring which it was established that similar cycle shapes do not necessarily signify similar BP values. Due to the men- tioned factors, personalization of the trained models was considered in an attempt to improve the predictive perfor- mance of the models. In the second attempt, the regression models were again trained using all the subjects except the one left out. This time, however, the models were further personalized by using some instances from the left out subject. The instan- ces of the left out subject were grouped by their BP values. These groups were then sorted from lowest to highest BP. Afterwards, every n-th group (n = 2, 3, 4, 5, 6) of instances was taken from the testing data and used in training in order to personalize the model to the current patient. This ensu- res personalization with different BP values, as taking just a single group of instances gives little information, since the BP will be constant within this group. Given the fact that the MIMIC data consists of roughly 5x the number of patients compared to everyday-life data, the personaliza- tion data for it was multiplied 5 times, making it noticeable within the large amount of training data from the remaining patients. During both attempts, several regression algorithms were once again considered, as given in Tables 3 and 4. The MAE was used as the evaluation metric. All models were compared with a dummy regressor, which always predicted the mean BP value of the same combination of general and personalization data as the other models used for training. Finally, the regressor with the lowest MAE was chosen as 40 Informatica 42 (2018) 33–42 G. Slapničar et al. best. For successful personalization, the user should measure their PPG continuously and also make a few periodic mea- surements of their BP using a reliable commercial device. This allows the model to personalize to the user, learning from a small sample of their labeled data, thus improving its predictive performance. 4.3 Results Using the personalization approach, notable improvements have been made over the dummy regressor in both experi- ments. The results are discussed in detail in the following sections. 4.3.1 K-fold cross validation results MAE with corresponding standard deviations in the 5-fold cross validation experiment for the MIMIC data are given in Table 3, while the results for the everyday-life data are given in Table 4. Algorithm MAESBP [mmHg] Dummy (predicts mean) 19.70 ± 16.07 Linear regression 18.47 ± 15.91 Ensemble (all feat.) 5.83 ± 7.74 Ensemble (relevant feat.) 4.90 ± 6.59 Algorithm MAEDBP [mmHg] Dummy (predicts mean) 8.73 ± 6.77 Linear regression 8.14 ± 7.98 Ensemble (all feat.) 2.92 ± 4.09 Ensemble (relevant feat.) 2.21 ± 3.70 Table 3: MAE of different algorithms for SBP and DBP estimation in 5-fold cross validation using the MIMIC hos- pital dataset. Algorithm MAESBP [mmHg] Dummy (predicts mean) 11.46 ± 7.51 Linear regression 11.21 ± 8.00 Ensemble (all feat.) 9.12 ± 7.90 Ensemble (relevant feat.) 7.87 ± 7.47 Algorithm MAEDBP [mmHg] Dummy (predicts mean) 5.01 ± 3.99 Linear regression 5.01 ± 8.00 Ensemble (all feat.) 4.38 ± 3.74 Ensemble (relevant feat.) 3.84 ± 3.63 Table 4: Mean absolute errors of different algorithms for SBP and DBP estimation in 5-fold cross validation using the JSI-collected everyday-life dataset. Ensemble of shallow regression trees has shown the best predictive performance in the 5-fold cross validation for both SBP and DBP using both datasets. We also notice a slightly better performance when only the relevant fea- tures, as given by RReliefF, are used in comparison to the default feature set. As the ensemble of regression trees has shown the best performance, its hyperparameters were optimized using Bayesian optimization. All the available hyperparameters were optimized using the MATLAB built-in Bayesian Op- timization Workflow [16]. It optimizes both the hyperpara- meters of the ensemble as well as the hyperparameters of the weak learners, which are chosen to be shallow Regres- sion Trees. The optimization is ran for 30 iterations, trying to minimize the objective cross-validation loss function. Bootstrap aggregation was chosen as superior over gradient boosting strategy, and the optimal number of weak learners was determined to be 77. The maximum number of splits in the weak learner was determined to be 1, meaning that the regression trees are in fact regression stumps. 4.3.2 Leave-one-subject-out results Figure 6: MAE for SBP and DBP for the MIMIC dataset, at different amounts of personalization. Continuous Blood Pressure Estimation from. . . Informatica 42 (2018) 33–42 41 The lowest error using the MIMIC data was again achie- ved using the hyperparameter tuned Ensemble of regres- sion trees algorithm with RReliefF selected subset of fea- tures. The highest amount of personalization (50%) gave the best results. 50% personalization corresponds to 10 BP measurements conducted by the subject, given the fact that 20 segments with 20 different BP values were taken. Obtai- ning 10 BP measurements by the subject, in order to per- sonalize the model, seems like a reasonable requirement. Figure 7: MAE for SBP and DBP for the everyday-life da- taset, at different amounts of personalization. The JSI-collected everyday-life data has proven to be more problematic, as there were only a few different BP values recorded in the first phase of data collection. Furt- hermore, due to the high amount of movement artefacts, a lot of data was removed by the cleaning algorithm, lea- ving a very small amount of usable data with a very low variation in BP. This further enhanced the performance of the dummy regressor, which achieved much lower MAE compared to the MIMIC dataset, however, improvements were again achieved by using personalization, as shown in Figure 7. 5 Conclusion We have developed a system for BP estimation using only the PPG signal, and have evaluated its performance on two distinct datasets using two experimental setups. The first module of the system deals with signal pre- processing, removing most movement artefacts and ano- malies from the PPG signal. It then detects PPG cycles corresponding to heart beats and feeds them to the second module, which computes a number of features describing each cycle. This is followed by feature subset selection using the RReliefF algorithm and finally the features are fed into several regression algorithms. Predictive models were created and evaluated on a hospital MIMIC dataset as well as an everyday-life dataset collected at JSI. The lowest MAE achieved for the MIMIC hospital dataset in 5-fold cross validation were 4.90± 6.59 mmHg for SBP and 2.21 ± 3.70 mmHg for DBP. The best performing algorithm was an Ensemble of shallow regression trees. Its hyperparame- ters were optimized using Bayesian optimization. Finally, the same models were evaluated on the same dataset using the leave-one-subject-out validation, achieving the lowest MAE of 4.47 ± 5.85 mmHg for SBP and 2.02 ± 2.94 mmHg for DBP, again using the same hyperparameter- tuned Ensemble and the subset of features selected by the RReliefF algorithm. These results were achieved using the maximum, 50% personalization. Similar trends can be observed for the everyday-life JSI-collected dataset. The lowest MAE in 5-fold cross validation were 7.87 ± 7.47 mmHg for SBP and 3.84 ± 3.63 mmHg for DBP. Ensem- ble of shallow regression trees with optimized parameters prevailed again. In LOSO validation, the lowest MAE of of 8.57 ± 7.93 mmHg for SBP and 4.42 ± 3.61 mmHg for DBP were achieved. 5.1 Interpretation of results Comparing the results of the 5-fold cross-validation to those of the LOSO evaluation, we first notice, that the best performing algorithm is the same. In each fold in the 5- fold cross validation, 80% of randomly shuffled instances were taken for training, which translates to 80% persona- lization for each subject. This is the reason behind the lo- wer MAE in the 5-fold cross-validation, however, similar MAE was also achieved with higher amounts of persona- lization in the LOSO experiment. The developed system shows promising results and could be used by both regu- lar people and hypertensive patients during their everyday routine, by wearing an unobtrusive wristband. It could in- form them of their current medical condition regarding BP. Further testing with more field-collected data is required to more accurately determine its performance, however, it already achieves low MAE when personalization is consi- dered. 42 Informatica 42 (2018) 33–42 G. Slapničar et al. 5.2 Future work We plan to expand our data collection experiment at JSI, which will give us more data and more variety within the collected BP data. Once enough data is collected, we plan to upgrade the machine learning part of our pipeline using deep-learning algorithms. These are well-suited for pro- blems dealing with signal analysis and represent the state of the art approach in signal processing in recent years, ma- king them a suitable candidate for our domain. Acknowledgement The HeartMan project has received funding from the Eu- ropean Unions Horizon 2020 research and innovation pro- gramme under grant agreement No 689660. Project part- ners are Jožef Stefan Institute, Sapienza University, Ghent University, National Research Council, ATOS Spain SA, SenLab, KU Leuven, MEGA Electronics Ltd and European Heart Network. References [1] The World Health Organization. “The top 10 causes of death”, 2015. [2] Mayo Foundation for Medical Education and Rese- arch (MFMER). “Blood pressure chart: What your reading means”. Accessed online: 2nd March, 2018. [3] Teng et. al. “Continuous and noninvasive estima- tion of arterial blood pressure using a photoplet- hysmographic approach”, Proceedings of the 25th Annual International Conference of the IEEE Engi- neering in Medicine and Biology Society (IEEE Cat. No.03CH37439), 2003. [4] Frese et al. “Blood Pressure Measurement Guidelines for Physical Therapists”, Cardiopulmonary Physical Therapy Journal, 2011. [5] Shelley et al. “Pulse Oximeter Waveform: Photoelec- tric Plethysmography”, Clinical Monitoring: Practi- cal applications for anesthesia and critical care, 2001. [6] Tamura et al. “Wearable Photoplethysmographic Sen- sorsPast and Present”, Electronics, 2014. [7] Goldberger et al. “PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals”, Circulation, 2000. [8] Moody et al. “A Database to Support Development and Evaluation of Intelligent Intensive Care Monito- ring”, Computers in Cardiology, 1996. [9] Lamonaca et al. “A neural network-based method for continuous blood pressure estimation from a PPG sig- nal”, IEEE International Congress I2MTC, 2013. [10] Lamonaca et al. “Application of the Artificial Neural Network for blood pressure evaluation with smartp- hones”, 2013 IEEE 7th International Conference on Intelligent Data Acquisition and Advanced Compu- ting Systems (IDAACS), 2013. [11] Najarian et al. “Biomedical Signal and Image Proces- sing, 2nd Edition”, CRC Press, 2012. [12] Robnik-Šikonja et al. “Theoretical and Empirical Analysis of ReliefF and RReliefF”, Machine Lear- ning, 2003. [13] Xing et al. “Optical Blood Pressure Estimation with Photoplethysmography and FFT-Based Neural Net- works”, Biomedical Optics Express, 2016. [14] Lzaro et al. “Pulse Rate Variability Analysis for Dis- crimination of Sleep-Apnea-Related Decreases in the Amplitude Fluctuations of Pulse Photoplethysmo- graphic Signal in Children”, IEEE Journal of Biome- dical and Health Informatics, 2014. [15] Li et al. “Dynamic time warping and machine lear- ning for signal quality assessment of pulsatile sig- nals”, Physiological Measurement, 2012. [16] The MathWorks Inc., Natick, Massachusetts, Uni- ted States. “MATLAB 2017a Optimization Toolbox”, 2017. [17] Coleman et al. “Validation of the Omron M7 (HEM- 780-E) oscillometric blood pressure monitoring de- vice according to the British Hypertension Society protocol”, Blood Pressure Monitoring, 2008.