https://doi.org/10.31449/inf.v45i4.3747 Informatica 45 (2021) 643–652 643 Impact of Gaussian Noise for Optimized Support Vector Machine Algorithm Applied to Medicare Payment on Raspberry Pi Shrirang Ambaji Kulkarni, Varadraj Gurpur, Christian King and Andriy Koval School of Global Health Management and Informatics, University of Central Florida, 32816, Orlando, Florida, USA E-mail: sakulkarni@ucf.edu, varadraj.gurupur@ucf.edu, christian.king@ucf.edu, Andriy.v.koval@gmail.com Keywords: medicare analysis, internet of things, data, statistical feature optimization techniques, support vector machine, pipelined models Received: September 11, 2021 A relatively large dataset coupled with efficient but computationally slow machine learning algorithm poses a great deal of challenge for Internet of Things (IoT). On the contrary, Deep Learning Neural Networks (DLANNs) are known for good performances in terms of accuracy, but by nature are computationally intensive. Based on this argument, the purpose of this article is to apply a pipelined Support Vector Machine (SVM)) learning algorithm for benchmarking public health data using Internet of Things (IoT). Support Vector Machine (SVM) a very good performing machine learning algorithm but has constraints in terms of huge training time and its performance is also susceptible to noise. The applied software pipelined architecture to SVM was to minimize its computational time under a resource constrained device like raspberry pi. It was tested with a medicare dataset with Gaussian noise to assess the impact of noise. The classification results of Total Medicare Standardized Payment Amount obtained indicated that the proposed pipelined SVM model was optimal in performance compared to DLANN model by 79.74% in terms of computational time. Also the performance of SVM in terms of area under curve (AUC) was better compared to other models and outscored Logistic Regression by 7.2%, and DLANN model by 22.65%. Povzetek: Analiziran je vpliv Gaussovega šuma na SCM metodo za plačevanje medijskih storitev. 1 Introduction Allhoffa and Henschke indicate that [1] Internet of Things (IoT) will become one of the greatest technologies that will revolutionize information capabilities and will have tremendous impact on the society at large. It is to be noted that IoT has limitations in terms of processing, memory and secondary storage capacities as compared to laptops, workstations and servers. Haller et al. [2][3] define IoT as “a world where physical objects are seamlessly integrated into the information network, and where the physical objects can become active participants in business process.” On the other hand, Gokhale et al., [4] define IoT simply as a “network of physical objects.” Here they indicate that generally speaking devices, vehicles, buildings and other forms of hardware and their embedded software can be conceived as physical objects. IoT has be of special importance to the world of healthcare where organizations pertaining to the healthcare ecosystem are working towards reduction of costs and improving productivity. IoT is especially useful in decision support, transmitting information, and device control. Much of this pertains to the field of healthcare informatics. Healthcare informatics is defined by Wan and Gurupur [5] as “a transdisciplinary study of the data flow and processing into more abstract forms such as information, knowledge, and wisdom along with the associated systems needed to synthesize or develop decision support systems for the purpose of helping the healthcare management processes achieve better outcomes in healthcare delivery.” The processes involved in synthesizing and developing decision support systems from knowledge and information requires innovative computational solutions and bolsters the need to advance data science especially pertaining to machine learning. Machine learning can be effectively performed in a suitable computational environment. It is to be noted that edge computing or fog computing is becoming popular day by day as advanced biomedical devices are involved in collecting patient medical data thereby further improving processes associated with healthcare delivery. The advantages in terms of reduced latency between users, edge infrastructure and cloud are evident as described by Shukla et al., [6]. The central storage and sophisticated processing facilities provided by cloud facilities at time may suffer from network latency issues for real-time applications and may act as a single point of failure. It is to be noted that Machine Learning (ML) algorithms are being applied in plethora of applications in relation to the context discussed. In the work delineated in this article the investigators explore Raspberry Pi as an edge computing device for benchmarking a popular ML algorithm Support Vector Machine (SVM). The SVM is defined by Noble [7], as “a 644 Informatica 45 (2021) 643–652 S. A. Kulkarni et al. computer algorithm that learns by example to assign labels to objects.” As explained by Noble [7] SVM is a key algorithm that can be effectively used to identify patterns that can be used to train and label data for the purpose of classification. Here the classifiers performance is measured using the concept of Area under the Curve (AUC) as explained by Bradley [8]. This attribute brings about a key desired characteristic for analysing healthcare data. In the recent past many investigators have used the combination of Raspberry Pi and SVM to identify noise and patterns. For experimentation and demonstration the investigators have used health care data with 40,662 rows and 28 variables, logistic regression algorithm for computational time and Deep Learning Neural Network (DLANN) for testing the accuracy of the classification results of Total Medicare Standardized Payment Amount. The reason for choosing SVM is its ability to produce results at higher level of accuracy; however, SVM tends to be constrained by high computational time and memory complexities for larger size training data [9]. This problem is compounded by the constrained computational resources of a Raspberry Pi and the presence of noisy data. The solution explored is an application of the pipeline architecture for SVM and its performances against the benchmarks set by of logistic regression and deep learning neural network on the same dataset. The specific research objectives of the analysis are as follows: • To analyse the performance of SVM with other benchmarks such as Deep Learning Machine Algorithm, and Logistic Regression in terms of accuracy and computational time under optimized and selected variable dataset for a resource constrained environment of Raspberry Pi and • To implement a pipelined architecture model for SVM with feature selection and ascertain the consistency of performance in terms of metrics and robustness by evaluating the performances on a Gaussian Noise based dataset. The presentation of a pipelined architecture is to contribute to the science of applying SVM to Medicare and Medicaid type datasets. Here the investigators are mindful of the fact that different datasets of different sizes and complexities require different approaches for analysis in terms of machine learning. More importantly it is important to state that the key targeted contribution of the experimentation explained in this article is to provide a computational method that can be effectively used in analysing healthcare data. 2 Related work SVM suffer from high time required for training datasets [9][10]and memory complexities issues. These problems are compounded for large datasets and for noisy data were SVM had disadvantages in terms of performance, SVM was applied by Cheng-Lung Huang [11] for credit scoring. They proposed a SVM with Genetic Algorithms (SVM- GA). One of the drawbacks which they observed that SVM-GA took large training times and proposed SVM- GA to be suitable for parallel architectures. Yazici et.al [12], in their work observed the performances of machine learning algorithms on raspberry pi as a part of their study on edge computing paradigm. Some of their results proved that SVM algorithm was slightly faster in inference and also efficient in power consumption. The above work’s motivated us to reduce SVM’s computational time by integrating it with a pipeline architecture model for working on moderately large datasets for a resource constrained environment like raspberry pi. Nguyen and Torre [13] in their work discussed that feature selection aided Support Vector Machines towards generalization and computational efficiency. The authors proposed a convex energy-based framework towards feature selection and parameter selection. Experiments on seven different datasets and with feature selection helped them to retain the desired performances. Sanz et.al, [14] discussed in their work that predictor models with most relevant variables was one of the important criteria for biomedical research. They proposed the extension of Recursive Feature Elimination (RFE) based on non-linear SVM kernels. The proposed methods when applied on 3 different datasets performed better as compared to classical RFE. Logistic regression a supervised learning is one of the popular models applied for classifying medical healthcare data. Logistic regression usually works on large sample size and thus the motivation to apply the same to our 2014 Medicare Provider Utilization and Payment Data [15]. Zardo and Collien [16] successfully used logistic regression to successfully identify critical predictor variables in public health policy research in Australia. Incidentally, Sheets et.al, [17] demonstrated the use of logistic regression in identifying attributes associated with high utilization of Medicare payments, thereby creating a burden on US taxpayer dollars. This research is focused on chronic patients and managed care and proactively identify high risk patients to reduce the cost of healthcare. Thus the present study would like to extend logistic regression to resource constrained environment of raspberry pi. Deep Learning Artificial Neural Networks (DLANN) are more specialized forms of artificial neural networks and can also learn on their own and handle huge datasets to provide superior classification accuracy, but they also need huge computational resources. Sakr et.al, [18] in their work applied Convolutional Neural Networks (CNN) and SVM for automation of sorting waste on raspberry pi 3.SVM appeared to have higher classification accuracy as compared to CNN by outscoring CNN by 11.8%. Ravi et.al, [19] also studied the impact of Deep Learning algorithms on Health Informatics. They summarized that most of the deep learning algorithms were applied to balanced or synthetic datasets. Also, deep learning algorithms required large amounts of training data. Thus, with algorithms like logistic regression, deep learning the investigators would like to benchmark the classification accuracy and related performances of support vector machine on a pipeline architecture on a resource constrained device like raspberry pi which holds lot of promise for edge devices. This analysis was carried Impact of Gaussian Noise for Optimized Support... Informatica 45 (2021) 643–652 645 on a dataset of 40,662 records [15]. Gangsar and Tiwari [20] studied the impact of noise for fault diagnosis of electric machines. They found for perfect original signal SVM predicted with greater accuracy for all speeds. However, when white Gaussian noise was applied to the raw signal, the overall prediction accuracy fell by 10%. They considered 2% external noise for their study. Pei et.al, [21] in their studies considered the impact of images with white Gaussian noise and their performance effects on convolutional neural networks (CNNs). As the percentage of noise addition increased, the accuracy started to decrease. Wu and Zhu[22] analysed real world data in terms of noise handling features of data mining algorithms. They said error-aware data mining algorithms improved the data mining results. Last but not the least, in their work Zualkernan et.al., [23] considered the application of remote cameras for monitoring animals. They considered an IoT based system whereby images captured on a camera are processed on the edge using Raspberry Pi and the accuracy results are moved to the cloud database system. To summarize application of SVM and other methods related to data science has immense potential that needs to be further explored and the experimentation presented in this article is a step taken in that direction. 3 Method and experiments The block architecture of the experimental setup is as illustrated in Figure.1 The experiments were executed once the platform was laid, this included implementing the pipelined model for SVM, installing tensor flow for Deep Learning algorithms and a computational time model on a resource constrained environment of Raspberry Pi. 3.1 Statistical optimization and performance The dataset used for experimentation is a medical healthcare data that contains records for physical therapy patients and amounts paid to the physical therapists in each case Gurupur et al., [15]. It becomes imperative to consider feature section techniques for dataset pruning as an optimization technique for resource constrained environment. Hardware Platform used for the experiment was Raspberry Pi B; Quad Core ARM Cortex A53 CPU 1.2GHz 64bit CPU with 1 GB RAM. From a software perspective, a python program was written using numpy, pandas and scikit-learn [24] along with keras and tensorflow; to apply logistic regression, SVM and DLANN for all variables in order to model them as a classification problem under supervised learning. This software platform was also used to execute metrics like K- Fold Cross Validation, Confusion Matrix and Area Under Curve (AUC). The reason for applying statistical techniques for the dataset as follows: a) To optimize the data features so that it helps the machine learning algorithm to classify with a lesser number of variables. b) To identify outlier’s and remove those from the dataset so that we have statistically a more normalized dataset. Feature selection is an important step in the application of machine learning to achieve at times better performance from the models in terms of computational execution speed. The presence of irrelevant features may negatively affect this application. This creates the need for developing parsimonious models. The advantages could be minimizing the impact of overfitting, accurate results and reduce timing. Therefore, feature selection was the first step in the process. This was implemented using Python scikit-learn library [24] that provides a class called SelectKBest and to this the investigators further utilized the f_classif score function. Finally, SelectKBest retains the first K features of the input dataset X minus the target variable. In our case the value of k was 10. Using this process the investigators listed the features with top 10 F_Score in Table 1. This was followed by the statistical determination of the presence of outliers [25]. As defined by Zhao [26] an “outlier is considered as a data point which is far from other observations.” Here the investigators believe that the presence of outliers may have an impact on the final results of machine learning models. With this in mind, the investigators applied Interquartile range (IQR) to detect the presence of the outliers. Technically, as applied in [27] the IQR is measured as the difference between the third Quartile and the first Quartile i.e. IQR = IQ3-IQ1. After applying the operation to remove outliers from the dataset the investigators removed 6,579 entries.The skewness of the dataset was measured. Skewness as indicated by [27] attempts to indicate the normal distribution of the values. Finding outliers and removing them from the dataset is one of the ways of handling skewness, this process was outlined by [29]. Thus, we measure skewness of the selected features before and after removing outliers from our dataset (Table 2). It can be observed in Table 2 that after removing outliers the skewness of the selected features has reduced. The analysis of binary classification for selected variables for logistic regression, SVM and DLANN is as illustrated in Figure 2. Figure 1: Block architecture of the experimental setup. 646 Informatica 45 (2021) 643–652 S. A. Kulkarni et al. Metrics applied were K-Fold validation test, confusion matrix metrics and Area Under Curve (AUC). Cross validation is used to gauge the effectiveness of the model. It involves using a sample of the dataset for testing and training the model on the remaining part of the dataset [30]. The value of k determines the number of groups that a data can be split into. In our case we have set the value of k to 10; therefore, the name 10-fold cross-validation. Additionally, investigators have used a confusion matrix also termed as an error matrix to analyse the performance of a machine learning algorithm in a matrix format [31]. It is as shown in Table 3. In the confusion matrix, TP stands for true positive, TN stands for true negative, FN stands for false negative and FP stands for false positive. The assumptions made are 𝑆 𝑇𝑃 denotes the Samples of True Positive, 𝑆 𝑇𝑁 are the samples which denote True Negatives, 𝑆 𝐹𝑃 denotes the Samples for False positive and 𝑆 𝐹𝑁 gives the samples for False Negatives. Feature variable names F_Score Number of Services 22369.69 Total Medicare Standardized Payment Amount 22184.17 Total Medicare Allowed Amount 22119.67 Total Submitted Charge Amount 19193.84 proxy for # of new patients 19177.12 Number of Medicare Beneficiaries 18581.63 Average Medicare Standardized Amount per Beneficiary 7535.67 Number of HCPCS 6275.17 Physical therapy services that involve Physical Agents 1998.79 Physical therapy services that involve Therapeutic Practice 1998.79 Table 1: Feature selection based on F-Score. Feature variable names With Outliers Skewness Without Outliers Skewness Number of HCPCS 0.59 0.26 Number of Medicare Beneficiaries 2.70 0.98 Average Medicare Standardized Amount per Beneficiary 2.05 0.66 Physical therapy services that involve Physical Agents 1.53 1.17 Physical therapy services that involve Therapeutic Practice -1.53 -1.17 proxy for # of new patients 2.87 0.78 Number of Services 3.96 1.06 Total Submitted Charge Amount 3.97 1.07 Total Medicare Allowed Amount 4.15 1.01 Total Medicare Standardized Payment Amount 4.55 1.05 Table 2: Measuring skewness with and without outliers. Input: Medicare data from CSV file Output: Measure Accuracy Score 1. Select the features using F_Score # SelectKBest() is a function under # feature_selection under sklearn library # f_classif uses Anova F-value for classification # purposes 2. selec_features ← SelectKBest(f_classif, k = 10) 3..Remove the outliers using Z_Score # zscore a function available in Scipy python # package under stats module 4..z ← np.abs(stats.zscore(data)) 5..Compute the Skewness to determine normal distribution of values #Pandas library in Python to measure unbiased #skewness. 6.skw ← data.skew() 7. Remove the outliers by identifying anything that is not the range of lower and upper bound IQR ← IQR3 – IQR1 l_bound ← IQR1 - (IQR * 1.5) u_bound ← IQR3 + (IQR * 1.5) 8. Assign X to columns and Y to target 9. Split X and Y into training and testing dataset in the ratio 80 to 20% 10. Train the models (Logistic,SVM and DLANN Model) 11. Predict the target for the above models. 12. Compute K-Fold accuracy for the models # KFold from sklearn library will split data into 10 # folds where 9 folds are used for training and # 1 fold for validation in an iterative manner; # random state=7 is seed for random number # generator 13. kfold ← KFold(n_splits=10, random_state=7) 14. Compute confusion matrix metrics and ROC for the above models. 15. Plot the area under receiver operating characteristic curve from the metrics module under sklearn library 16. auc_score ← metrics.roc_auc_score(y_test, y_pred_prob) Algorithm 1. Impact of Gaussian Noise for Optimized Support... Informatica 45 (2021) 643–652 647 Accuracy of the classification model [32] is determined in the present study by correctness of the confusion matrix and is as given in Equation 1. 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑚𝑜𝑑𝑒𝑙 = (𝑆 𝑇𝑃 + 𝑆 𝑇𝑁 ) 𝑆 𝑇𝑜𝑡𝑎𝑙 (1) where 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 𝑚𝑜𝑑𝑒𝑙 gives the classification accuracy. A higher accuracy of 99% is good but at times it also depends on the dataset. Precision of the classification model gives the percentage the correct results among all the returned results and is as given in Equation 2 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑚𝑜𝑑𝑒𝑙 = 𝑆 𝑇𝑃 𝑆 𝑇𝑃 + 𝑆 𝐹𝑃 (2) where 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 𝑚𝑜𝑑𝑒𝑙 gives precision of a machine learning model for classification problem Recall is the capacity of the model to find data points of interest and is as given in Equation 3 𝑅𝑒𝑐𝑎𝑙𝑙 𝑚𝑜𝑑𝑒𝑙 = 𝑆 𝑇 𝑃 𝑆 𝑇𝑃 + 𝑆 𝐹𝑁 (3) where 𝑅𝑒𝑐𝑎𝑙𝑙 𝑚𝑜𝑑𝑒𝑙 gives the correct classification of positive samples by the machine learning model for the given binary classification problem. One of the limitations of accuracy is its constraints in terms of test sample size which in our experiments has been considered as 20%. Thus, for a binary classifier as in our experiments, where we have pitted true positives against false negatives; Area Under Curve (AUC) gives a more generic approach as it evaluates the binary classifier model for random guesses. Thus, AUC provides a better perceived measure as compared to accuracy which is more tightly coupled to a threshold. In an event when accuracy cannot be used to clearly distinguish machine learning models AUC can work as an alternative deciding parameter [33]. The experimentation conducted provided K-Fold validation scores of 94.10% and 99.97% for logistic regression and SVM respectively. Thus, the K-Fold accuracy of SVM is superior to Logistic Regression by 12.15%. We now consider the confusion matrix metrics for the selected feature dataset as illustrated in Figure 2. It is further observed in Figure 2 that SVM was the top performer and marginally outscored DLANN which is an interesting observation which needs to be analysed further. Figure 3 shows AUC for Logistic Regression, SVM and DLANN. From Figure 3 it is observed that SVM has the highest AUC of 1.0 followed by DLANN with an AUC of 0.99. The AUC of Logistic Regression is relatively least with a score of 0.98. 3.2 Computational time analysis Based on the observations made from the binary classifier model it becomes imperative that apart from scoring high on accuracy and other associated metrics computational efficiency on resource constrained IoT environment is a necessary attribute for a low-cost data analysis system. Therefore, the investigators decided to compare the computational time of each model used for analysis. The hardware platform used for this aspect of analysis was a Raspberry Pi with Quad Core 1.2GHz Broadcom BCM2837 64bit CPU, 1GB RAM. The results of this analysis is as illustrated in Table 4. As mentioned before, the application of feature selection and removal of outliers led to the reduction of dataset size from 8.7 MB to 2.9 MB. Therefore, it is common sense that for a dataset with selected variables the computational time will be naturally lower. This is of significance for resource constrained environments IoT environments such as the Raspberry Pi. It is observed under dataset with selected variables Logistic Regression outperforms SVM by 99.04% and DLANN by 98.02%. This clearly indicates that Logistic Regression is most computationally efficient as compared to SVM and DLANN. Also, SVM outperformed DLANN and Logistic Regression in terms of AUC, confusion matrix metrics and Predicted Actual TP FP FN TN Table 3: Layout of confusion matrix. Figure 2: Confusion matrix metrics forl logistic regression, SVM and DLANN for selected feature dataset. Figure 3: AUC for logistic regression, SVM and DLANN for selected feature dataset. 648 Informatica 45 (2021) 643–652 S. A. Kulkarni et al. K-Fold validation tests. This motivated the investigators for further analysis where they built a model where SVM provides robust performance and also is computationally time efficient. 3.3 Pipelined support vector machine architecture and Gaussian noise Pipeline allows us to fit a model by combining a number of transformations and executing a predictor once. The software pipeline architecture as provided by scikit-learn [24] is as illustrated in Figure. 4. In Python the Pipeline class [34] allows the collation of multiple processes into a single estimator. Therefore, we can fit the pipeline to the whole training data and also transform it to test data without the need for doing the same individually. Linear Support Vector Classification abbreviated as LinearSVC uses a linear kernel, is faster and can also scale rapidly. These parameters were fed to the pipeline to reduce the computational time required for SVM on raspberry pi. The algorithm implemented in our model of pipelined SVM is as illustrated in Algorithm 2. Here Gaussian noise is added to the dataset to benchmark the performance of SVM against Logistic Regression and Deep Learning Artificial Neural Network. The presence of Additive Gaussian Noise [35][36] is known to have impact on the distribution of the data. To check the robustness of the different classifier models a common data corruption technique through Gaussian noise was applied. Many such analysis were conducted in [37] to benchmark neural network robustness. In our work the noise signal was set with mean 0 and standard deviation of 0.1.To simulate the Gaussian Noise the NumPy Random Normal function was used, which generates values from the Gaussian distribution. The values assumed for μ was 0 and σ = 0.1. The additive noise is as generalized [38] in Equation 4. 𝑀 𝑅𝑛𝑜 ,𝐹𝑛𝑜 = 𝑂 𝑅𝑛𝑜 ,𝐹𝑛𝑜 + 𝜖 𝑅𝑛𝑜 ,𝐹𝑛𝑜 (4) where 𝑀 𝑅𝑛𝑜 ,𝐹𝑛𝑜 is the modified data point; 𝑂 𝑅𝑛𝑜 ,𝐹𝑛𝑜 is the original data point and 𝜖 𝑅𝑛𝑜 ,𝐹𝑛𝑜 is the random noise approximately equal to the distribution (μ, 𝜎 2 ); where μ is mean and 𝜎 2 is the variance. The algorithm for Gaussian Noise implementation is illustrated in Algorithm 3. The analysis of K-Fold validation tests came with a result of 78.88% for Logistic Regression and 78.58% for SVM which indicated the similar performance of both the models in presence of Gaussian Noise. The performance of Logistic Regression dropped by 15.22% and performance of SVM dropped by 21.39 %. This clearly indicates that in the presence of noise logistic regression performed at an acceptable level. We further continued our experiments for results with confusion metrics as illustrated in Figure 5. From Figure 5 it is observed that the performance of SVM in terms of accuracy is least 58.44% in presence of Gaussian Noise. The precision of Logistic Regression and DLANN was good and exhibited similar performances of 49.79%. and 50.79%. However, it could be observed that the precision was one of the worst affected metrics and the performance for Logistic Regression dropped by 46.8%, SVM by 66.53%, and DLANN by 49.11%. This performance was compared with performances of machine learning models run on dataset with selected features. A low precision for SVM could basically indicate a large number of false positives. On the contrary, a high value of recall of 99.16% indicates that SVM was very sensitive and could successfully identify true positive observations. The analysis was continued for AUC metric. Binary classifier Model Computational Time in seconds Raspberry Pi B Logistic Regression – Selected Dataset 37.81 SVM – Selected Dataset 3949.02 DLANN – Selected Dataset 1918.59 Table 4: Computational time of machine learning and deep learning models. Scaler = StandardScaler () Learning Algorithm = LinearSVC Prediction Accuracy Training Dataset Testing Dataset Fit and Transform Transform Figure 4: Pipeline architecture for SVM on Raspberry Pi. Output: Pipelined Architecture of SVM 1.pipe_lrSVC ← Pipeline([('scaler', StandardScaler()), ('clf', LinearSVC())]) #Build the pipeline 2. r ← pipe_lrSVC.fit(X_train, y_train) 3. y_pred ← pipe_lrSVC.predict(X_test) #predict Algorithm 2 Input : newmeddata.csv # The original dataset Output: noisy_data.csv # The noisy dataset 1. σ ← 0.1 # standard deviation is 0.1 2. μ ←0 # mean is 0 3. noise ← actual_data + σ * random (size (actual_data)) + μ 4. noisy_data.csv ← actual_data + noise #noisy_data.csv is the data with added Gaussian noise 5. target_variable ← int (actual_target_variable + noise) Algorithm 3 Impact of Gaussian Noise for Optimized Support... Informatica 45 (2021) 643–652 649 From Figure 6 the investigators observe that the performance of SVM is better compared to other models. It outscores Logistic Regression by 7.2%, and DLANN model by 22.65%. 3.3.1 Computational time analysis As indicated in the introduction section the investigators performed the computational time analysis of different methods. This computational time analysis is illustrated in Table 5. Here we observe that Logistic Regression was the most computationally efficient in terms of execution time. However, with a pipeline SVM outperformed its nearest competitor DLANN by 79.7 4% and was inferior to Logistic Regression by 93.83%. Therefore, SVM improved its performance in terms of computational execution time. Additionally, it was observed that in presence of Gaussian Noise, the accuracy of most of the models dropped and DLANN emerged as slight winner with little bit of consistency and SVM exhibited low recall and high precision thereby exhibiting its fitness for the dataset under consideration. Also, the proposed Pipelined model of SVM achieved a better performance in terms of computational time to its nearest competitor the DLANN model. 4 Discussion The investigators in the present work implemented a pipeline SVM model to test it against known benchmarks of Logistic Regression and Deep Learning Neural Network for performance optimization in terms of computational time and accuracy metrics for a resource constrained environment of Raspberry Pi. Therefore, the investigators explored statistical technique of F Score for feature selection and could shortlist top 10 features. The investigators further processed outliers by applying Inter Quartile Range. This helped the investigators to balance the skewness of the data. Thus, the modified dataset with reduced storage requirements was tested on Raspberry PI for machine learning models like logistic regression, SVM and DLANN for binary classification and performance benchmarking. K-Fold accuracy of SVM was superior to Logistic Regression by 12.15%. Confusion matrix metrics where further applied to test the machine learning models and SVM achieved better performance and at times was at par with Deep Learning Neural Network. The uniqueness of the present work is that it dealt with the training time that SVM takes which is usually large. Thus reducing training time was of paramount importance as the platform were, SVM was to be implemented was Raspberry Pi. This was achieved by implementing SVM with a pipelined architecture. Thus SVM achieved a better performance in terms of computational time to its nearest competitor the DLANN model by 79.74%. .SVM is prone to noise, thus the optimized and pipelined architecture of SVM was benchmarked with Deep Learning in the presence of Gaussian noise. The accuracy of most of the models dropped and DLANN emerged as slight winner with little bit of consistency and SVM exhibited low recall and high precision thereby exhibiting its fitness for the dataset under consideration. The better accuracy of DLANN with selected features and under noise may be attributed to the fact that noise could have added as a regularization factor thus boosting the performance of DLANN. This clearly provides some pathway for future work in terms of Figure 5: Confusion matrix metrics for logistic regression, SVM and DLANN for selected feature dataset and with gaussian noise. Figure 6: AUC for logistic regression, SVM and DLANN for selected feature dataset with gaussian noise. Binary classifier Model Computational Time in seconds Raspberry Pi B Logistic Regression – Selected Dataset with Gaussian Noise 23.57 SVM – Selected Dataset with Gaussian Noise 382.34 DLANN – Selected Dataset – Gaussian Noise 1887.36 Table 5: Computational time of machine learning and deep learning models. 650 Informatica 45 (2021) 643–652 S. A. Kulkarni et al. extending pipeline architectures for Deep Learning algorithms [39],[40],[41], which are efficient but slow and are visualized for working in resource constrained environments of IoT. 4.1 Limitations of the present work The analysis was considered for a single medical dataset. In future the capabilities of the models could be generalized for a range of datasets. With parallel environments for machine learning models and with IoT clusters based on graphical processing units (GPU’s) for remote computing the models could be made much more computationally feasible. Also, techniques like PCA for feature selection and its interaction for deep learning algorithms was not explored in the present work. 5 Conclusion Overall, the investigators conclude that SVM exhibited its robustness in terms of relatively good performances for all computational setups of optimized, and corrupted datasets in resource constrained environments of IoT. The impact of additive noise had distressing effects on most models and may be a concern in an environment where devices collect data from sensors. As stated, the analysis was conducted on a single dataset thereby limiting the validation of the conclusions derived. The feature selection of dataset resulted in reduction of dataset size by 67% but had a minor loss in terms of accuracy of the classifier models like Logistic Regression, SVM and DLANN. Therefore, we can safely suggest that SVM had a relatively stable performance across all the scenarios and at times was better than DLANN model. Additionally, we suggest that pipeline architectures and automating machine learning models had a good impact on resource constrained environments like Raspberry Pi. SVM pipelined model outscored DLANN model by 79.94% for a featured selected and Gaussian noise added dataset in terms of computational time. Thereby, the investigators have concluded SVM as the model of choice for analysing similar datasets. Therefore, the core contributions of this work were: i) implementing a pipelined Support Vector Machine model for performance benchmarking against Logistic Regression and Deep Learning Neural Network for computational time efficiency and accuracy metric for a relatively largest dataset, and ii) a brief analysis of computational time analysis for these general methods for SVM using Raspberry Pi. In future, the investigators would like to explore how the machine learning and deep learning models that can detect noise and outliers and automatically improve their learning abilities for complex pipelined models, in a constrained environment of an IoT device enabled by Graphics Processing Unit (GPU). Acknowledgments The authors would like to thank the School of Global Health Management and Informatics for the permission to use the University of Central Florida (UCF), Decision Support Systems and Informatics Laboratory facilities to conduct the research work and related documentation. R efer ence s [1] Allhoffa F. & Henschke A (2018). The Internet of Things: Foundational ethical issues, Internet of Things, pp. 55–66. https://doi.org/10.1016/j.iot.2018.08.005 [2] Haller S., Karnouskos S., & Schroth C (2009). "The Internet of Things in an Enterprise Context," in Future Internet – FIS 2008, Lecture Notes in Computer Science, vol. 5468, pp 14-28. https://doi.org/10.1007/978-3-642-00985-3_2 [3] Zhang Z-K., Cho M , Wang C-W.., Hsu C-W,Chen C-K, & Shieh S (2014). IoT Security: Ongoing Challenges and Research Opportunities, Research Study Analysis Techniques Results This project Pipelined Support Vector Machine, Logistic Regression and Deep Learning Artificial Neural Network on Raspberry Pi environment Pipelined SVM achieved a better performance in terms of computational time measurement to its nearest competitor the DLANN model by 79.74%. Sheets et.al, 2017 [17] Combination of contrast mining and Logistic Regression was used. Electronic Health Record (EHR) contrast mining with Logistic Regression predicted 5% of patients contributing to 50% of healthcare expenses. Nalepa & Kawulok J., 2018 [9] Trained Support Vector Machine for large datasets with different kernels. SVM has been successful in solving a variety of pattern recognition tasks; its main drawbacks were the huge time and memory related complexities. Sakr et.al, 2016 [18] Deep Learning Convolutional Neural Network (CNN) and Support Vector Machine SVM model achieved high classification accuracy of 94.8% while CNN could achieve 83% Pei et.al., 2021 [21] Deep Learning Convolutional Neural Network (CNN) and White Gaussian Noise Classification performance of Deep Learning CNN drops significantly when noise is added. Table 6: Comparison of research projects and analysis methods. Impact of Gaussian Noise for Optimized Support... Informatica 45 (2021) 643–652 651 Proceedings of the 2014 IEEE 7th International Conference on Service-Oriented Computing and Applications, pp. 230-234. https://doi.org/10.1109/SOCA.2014.58 [4] Gokhale P., Bhat O., Bhat S (2018). Introduction to IOT, International Advanced Research Journal in Science, Engineering and Technology, vol. 5(1), pp. 41- 44. https://doi.org/10.17148/iarjset.2018.517 [5] Wan T.T.H, Gurupur V (2020). Understanding the Difference between Healthcare Informatics and Healthcare Data Analytics in the Present State of Health Care Management, Health Services Research & Managerial Epidemiology, vol. 7, pp. 1-3. http://dx.doi.org/10.1177/2333392820952668 [6] Shukla S., Hassan M.F., Khan M.K., Jung L.T., Awang A (2019).An analytical model to minimize the latency in healthcare internet-of-things in fog computing environment, PLoS ONE, pp.1-31. http://dx.doi.org/10.1371/journal.pone.0224934 [7] Noble W.S (2006). What is a support vector machine? Nature Biotechnology, Vol.24, pp. 1565– 1567. https://doi.org/10.1038/nbt1206-1565 [8] Bradley A.P (1997).The Use of the Area Under the ROC Curve in the Evaluation of Machine Learning Algorithms, Pattern Recognition, vol. 30(7), pp. 1145-1159. https://doi.org/10.1016/S0031-3203(96)00142-2 [9] Nalepa J. , Kawulok M (2019). Selecting training sets for support vector machines: a review. Artif Intell Rev 52, pp. 857–900. https://doi.org/10.1007/s10462-017-9611-1 [10] Papadonikolakis M., Bouganis C. & Constantinides G (2009). "Performance comparison of GPU and FPGA architectures for the SVM training problem," 2009 International Conference on Field- Programmable Technology, pp. 388-391. https://doi.org/10.1109/FPT.2009.5377653 [11] Huang C-L, Chen M-C, Wang C-J (2007). Credit scoring with a data mining approach based on support vector machines, Expert Systems with Applications, vol. 33, pp. 847–856 https://doi.org/10.1016/j.eswa.2006.07.007 [12] Yazici M T. , Basurra S. & .Gaber M M (2018). Edge Machine Learning: Enabling Smart Internet of Things Applications, Big Data and Cognitive Computing, vol. 2: 26, pp. 1-17. https://doi.org/10.3390/bdcc2030026 [13] Nguyen M H. Torre F de la (2010). Optimal feature selection for support vector machines, Pattern Recognition, vol.43, pp. 584–591 https://doi.org/10.1016/j.patcog.2009.09.003 [14] Sanz H, Valim C., Vegas E, Oller J M. & Reverter F (2018). SVM-RFE: selection and visualization of the most relevant features through non-linear kernels, BMC Bioinformatics, vol. 19:432, pp 1-18. https://doi.org/10.1186/s12859-018-2451-4 [15] Gurupur V. P, Kulkarni S. A., Liu X., Desai U., & Nasir A (2018). Analysing the power of deep learning techniques over the traditional methods using medicare utilisation and provider data, Journal of Experimental & Theoretical Artificial Intelligence, pp. 99-115. https://doi.org/10.1080/0952813X.2018.1518999 [16] Zardo P., Collie A (2014). Predicting research use in a public health policy environment: results of a logistic regression analysis, Implementation Science, vol. 9, pp. 1-10. https://doi.org/10.1186/s13012-014-0142-8 [17] Sheets L., Petroski G.F., Zhuang Y., Phinney M.A,. Ge B, Parker J.C., Shyu C-R (2017). Combining Contrast Mining with Logistic Regression to Predict Healthcare Utilization in a Managed Care Population, Applied Clinical Informatics, vol. 8: 2, pp. 430-446. https://doi.org/10.4338/aci-2016-05-ra-0078 [18] Sakr G. E, Mokbel M., Darwich A., Khneisser M. N & Hadi A (2016). Comparing deep learning and support vector machines for autonomous waste sorting, 2016 IEEE International Multidisciplinary Conference on Engineering Technology (IMCET), pp. 207-212. https://doi.org/10.1109/IMCET.2016.7777453 [19] Ravi D., Wong C., Deligianni F., Berthelot M., Andreu-Perez J., Lo B., &. Yang G-Z (2017). Deep Learning for Health Informatics, IEEE Journal of Biomedical and Health Informatics, vol. 21: (1), pp.4-21. https://doi.org/10.1109/jbhi.2016.2636665 [20] Gangsar P. & Tiwari R (2018). Effect of noise on support vector machine based fault diagnosis of IM using vibration and current signatures, MATEC Web of Conferences, vol. 211. http://dx.doi.org/10.1051/matecconf/201821103009 [21] Pei Y., Huang Y., Zou Q., Zhang X. & Wang S (2021). Effects of Image Degradation and Degradation Removal to CNN-Based Image Classification," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43: 4, pp. 1239-1253. https://doi.org/10.1109/TPAMI.2019.2950923 [22] Wu X. & Zhu X (2008). Mining with Noise Knowledge: Error-Aware Data Mining, IEEE Transactions on Systems, Man, and Cybernetics – Part A: Systems and Humans, vol.38: (4), pp.15-19. https://doi.org/10.1109/CIS.2007.7 [23] Zualkernan A., Zualkernan I A., Dhou S, Judas J, Sajun A R, Gomez B R., Hussain L A., Sakhnini D (2020), Towards an IoT-based Deep Learning Architecture for Camera Trap Image Classification, 2020 IEEE Global Conference on Artificial Intelligence and Internet of Things (GCAIoT),pp. 1- 6. https://doi.org/10.1109/GCAIoT51063.2020.93458 58 [24] Scikit-learn Machine Learning in Python. [Online]. Available: https://scikit-learn.org/ stable/ [25] Tukey J (1977). Exploratory Data Analysis. Addison-Wesley, Reading MA [26] Zhao Q., Zhou G., Zhang L., Cichocki A. & Amari S (2016).Bayesian Robust Tensor Factorization for 652 Informatica 45 (2021) 643–652 S. A. Kulkarni et al. Incomplete Multiway Data, IEEE Transactions on Neural Networks and Learning Systems, vol.27:(4), pp.736-748 http://dx.doi.org/10.1109/TNNLS.2015.2423694 [27] Khan Z., Naeem M., Khalil U., Khan D. M., Aldahmani S. & Hamraz M (2019). Feature Selection for Binary Classification Within Functional Genomics Experiments via Interquartile Range and Clustering, IEEE Access, vol. 7, pp.78159-78169. https://doi.org/10.1109/ACCESS.2019.2922432 [28] Yusoff S. B. & Wah Y. B (2012). Comparison of conventional measures of skewness and kurtosis for small sample size, 2012 International Conference on Statistics in Science, Business and Engineering (ICSSBE), pp.1-6. https://doi.org/10.1109/ICSSBE.2012.6396619 [29] Heymann S., Latapy M. & Magnien C (2012). Outskewer: Using Skewness to Spot Outliers in Samples and Time Series, 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp.527-534. https://doi.org/10.1109/ASONAM.2012.91 [30] Xu L., Hu O., Guo Y., Zhang M., Lu D., Cai C. B., Xie S., Goodarzi M., Fu H. Y., She Y. B (2018). Representative splitting cross validation, Chemometrics and Intelligent Laboratory Systems, vol.183, pp.29-35. https://doi.org/10.1016/j.chemolab.2018.10.008 [31] Tharwat A (2018). Classification assessment methods, Applied Computing and Informatics, pp.1- 13. https://doi.org/10.1016/j.aci.2018.08.003 [32] Fatourechi M., Ward R. K., Mason S. G., Huggins J., Schlög A., & Birch G. E (2008). Comparison of Evaluation Metrics in Classification Applications with Imbalanced Datasets, Proceedings of the 2008 Seventh International Conference on Machine Learning and Applications, pp.777 – 782. https://doi.org/10.1109/ICMLA.2008.34 [33] Huang J. & Ling C (2005). Using AUC and Accuracy in Evaluating Learning Algorithms, IEEE Transactions on Knowledge & Data Engineering, vol.17:(3), pp.299-310. https://doi.org/10.1109/TKDE.2005.50 [34] Pipelines and composite estimators, https://scikit- learn.org/stable/modules/compose.html [35] Nadarajah S. & Kotz S (2007). On the Generation of Gaussian Noise, IEEE Transactions on Signal Processing, vol. 55 (3), pp.1172-1172. http://dx.doi.org/10.1109/TSP.2006.888061 [36] Zhuang L. & Ng M. K (2020). Hyperspectral Mixed Noise Removal By ℓ1-Norm-Based Subspace Representation, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol.13 , pp.1143-1157. https://doi.org/10.1109/JSTARS.2020.2979801 [37] Hendrycks D., & Dietterich T. G (2018). Benchmarking Neural Network Robustness to Common Corruptions and Surface Variations, arXiv: Learning, pp.1-13 https://arxiv.org/abs/1807.01697v5 [38] Domingo-Ferrer J., Seb´e F., & Castell`a-Roca J (2004). On the Security of Noise Addition for Privacy in Statistical Databases, International Workshop on Privacy in Statistical Databases, pp.149-161. http://dx.doi.org/10.1007/978-3-540-25955-8_12 [39] Yao S., Zhao Y., Zhang A., Hu S., Shao H., Zhang C., Su L., Abdelzaher T (2018). Deep Learning for the Internet of Things, Computer, vol. 51: 5, pp. 32- 41. https://doi.org/10.1109/MC.2018.2381131 [40] Ma X., Yao T., Hu M., Dong Y., Liu W., Wang F., Liu J (2019).A Survey on Deep Learning Empowered IoT Applications, in IEEE Access, vol. 7, pp. 181721-181732. https://doi.org/10.1109/ACCESS.2019.2958962 [41] Ahmed I., Din S., Jeon G., Piccialli F (2020). Exploring Deep Learning Models for Overhead View Multiple Object Text of the second section, in IEEE Internet of Things Journal, vol. 7: 7, pp. 5737- 5744. http://dx.doi.org/10.1109/JIOT.2019.2951365