https://doi.org/10.31449/inf.v46i4.3476 Informatica 46 (2022) 523–530 523 OMCOKE: A Machine Learning Outlier-based Overlapping Clustering Technique for Multi-Label Data Analysis Said Baadel 1 , Fadi Thabtah 2 , Joan Lu 3 , Saida Harguem 4 1 Mt. Royal University, Calgary, Canada, 2 ASDTests, Auckland, New Zealand, 3 University of Huddersfield, Huddersfield, UK, 4 Canadian University Dubai, Dubai, UAE Email: sbaadel@mtroyal.ca, fadi@asdtests.com, j.lu@hud.ac.uk, saida.harguem@cud.ac.ae Keywords: K-means; Machine Learning; OMCOKE; Overlapping Clustering; Unsupervised Learning Received: March 3, 2021 Clustering is one of the challenging machine learning techniques due to its unsupervised learning nature. While many clustering algorithms constrain objects to single clusters, K-means overlapping partitioning clustering methods assign objects to multiple clusters by relaxing the constraints and allowing objects to belong to more than one cluster to better fit hidden structures in the data. However, when datasets contain outliers, they can significantly influence the mean distance of the data objects to their respective clusters, which is a drawback. Therefore, most researchers address this problem by simply removing the outliers. This can be problematic especially in applications such as fraud detection or cybersecurity attacks risk analysis. In this study, an alternative solution to this problem is proposed that captures outliers and stores them on-the-fly within a new cluster, instead of discarding. The new algorithm is named Outlier-based Multi-Cluster Overlapping K-Means Extension (OMCOKE). Empirical results on real-life multi-label datasets were derived to compare OMCOKE’s performance with other common overlapping clustering techniques. The results show that OMCOKE produced a better precision rate compared to the considered clustering algorithms. This method can benefit various stakeholders as these outliers could have real-life applications in cybersecurity, fraud detection, and the anti-phishing of websites. Povzetek: V tej študiji je predlagana alternativna rešitev (OMCOKE), ki zajame izstope in jih sproti shrani v novo gručo, namesto da bi jih odstranila. 1 Introduction Clustering is an unsupervised learning process that involves grouping a set of data objects into subsets, each of which has its own label based on a predefined similarity metric [2] [5]. Prior to learning, each resulting subset will contain data objects usually exhibiting similar traits but dissimilar from data objects in the other subsets [25]. In clustering, some structural characteristics are not known a priori unless some sort of domain knowledge is presented in advance (i.e. there are no labels attached to the data patterns as in the case of supervised classification), thus deeming clustering a difficult problem due to this unsupervised nature [23] [32]. Various clustering techniques such as probabilistic, distance-based, and grid- based have been explored in machine learning with the distance-based proving to be popular [1] [24]. Undoubtedly, the K-means [30], and its generic extensions and adaptations, is one of the most widely used distance-based partition-clustering algorithms [23] [26] [28]. There are many reasons attributed to this, such as it is easy to implement, its versatility allows any part to be easily modified, and its guaranteed nature to converge at a quadratic rate [16]. Thus, the K-means algorithm has been primarily utilized to deal with non-overlapping clustering problems that limit each data object to a single cluster. However, one of the main challenges of K-means and its successors is sensitivity to exceptional data (outliers). K- means often derives clusters by optimizing the mean Sum of Squared Error (SSE) (Equation 1) by calculating the Euclidean distance between the data objects and the clusters ‘computed centroids. 𝑆𝑆𝐸 = ∑ ∑ ||𝑥 𝑖 − 𝑐𝑘 || 2 𝑥𝑖 ∈𝐶𝑘 𝐾 𝑘 =1 (1) Where Ck is the kth cluster, xi is a point in Ck, and ck is the mean of the kth cluster. In cases when the input dataset contains few outliers, this may significantly influence the mean distance (the outlier will skew the mean and variance) of the data objects to their respective clusters, and thus K-means tends to discard outliers [6] [15] [34]. Existing algorithms that extend the K-means and allow objects to overlap include Kernel Overlapping K-means (KOKM) [9][11], Overlapping K-means (OKM) [17-18], Parametrized R- OKM [9] and Multi-Cluster Overlapping K-Means Extension (MCOKE) [3][4]. Detecting these outliers is advantageous for decision makers as these outliers could be used for fraudulent activities such as in the case of cybersecurity or a fraud insurance claim. Therefore, it will 524 Informatica 46 (2022) 523–530 S. Baadel et al. be more useful to store these outliers (as opposed to discarding them) in a separate cluster for potential usage as they represent exceptional patterns. This research addresses the above issue by detecting outliers during the clustering process and then storing them, making it different from the other overlapping clustering algorithms. Since the user identifies the number of k clusters a priori when running clustering algorithms, our algorithm is able to adjust to this and accommodate the outliers by adding a new cluster during the learning process called an Outlier cluster (k+1). The proposed algorithm is called Outlier-based Multi-Cluster Overlapping K-Means Extension (OMCOKE); uses the outlier cluster for later analysis by decision makers. The current study approach adds immense value to the learning process as we save these data objects to investigate and understand their characteristics. These data objects could potentially be a result of an imbalanced data set with high cardinality (i.e. natural overlaps) and perhaps the k number of clusters, which is defined a priori, can be revised to accommodate the data and allow the algorithm to better fit the clusters. Outliers could also indicate suspicious data objects with malicious intent. Therefore, an outlier cluster that can be investigated has profound real-life implications such as in e-banking, website phishing, cyber security, or medical screening. For example, in cyber security, historical data can reveal statistical acceptable trends through the data patterns and how they are clustered together. Any outlier objects outside the regular clustered trends will automatically raise red flags. Such red flags can be used in data analytics to alert the user of a potential security threat or an intrusion attempt. Experimental results using real datasets indicate that OMCOKE is able to detect outliers and to produce clusters with higher precision and accuracy when compared to existing algorithms such as OKM, KOKM, and R-OKM among others. The rest of the paper is structured as follows: Section 2 reviews the literature concerning overlapping clustering. Section 3 discusses OMCOKE and datasets used in the empirical experiments. Section 4 provides the results and analysis with a comparison of different ML clustering techniques. Lastly, we provide conclusions and further research in Section 5. 2 Literature review Outliers are data objects or points that do not conform to the normal behaviour or model of the dataset, hence are deemed inconsistent or grossly different [12]. This data can be erroneous, but could also be classified as suspicious data in fraudulent activity; that could be useful for fraud detection, intrusion detection marketing, website phishing sites, etc. Outlier detection is considered a task in itself; research in the data mining domain has focused on an efficient and optimal way to detect distance-based outliers. Outlier detection surveys such as by Chandola, et al. [15], Bay and Schwabacher [7], and Kadam and Pund [27] discussed several approaches used to tackle anomalies and noise data. In [8] and [31] the authors provide methods that would efficiently mine outliers in large datasets. Other recent studies have devised methods in clustering analysis that will prune or screen out outliers from the dataset such as Liu, et al. [29], Barai and Dey[6], Gan and Nk [21] , Danganan et al. [19] and Chagas et al.[14] . For example, Yu, et al. [35] proposed an outlier detection method to identify and eliminate outliers in the dataset forming an outlier-eliminated dataset (OED). The authors then applied the K-means algorithm on the OED, thereby improving the accuracy of the clustering. Similarly, the Barai & Dey [6] approach is to divide their algorithm into two steps. The first step calculates the threshold value used in detecting outliers by taking the average of the maximum and minimum values of pairwise distance of all data. Each data point is then reiterated and compared to the threshold. Those that have a distance value greater than the threshold are deemed as outliers and are subsequently tossed out of the dataset. The second step then runs the K-means algorithm without outliers, thus improving the clustering process. Liu et al., [29] also propose a two-phased approach for their clustering with the outlier removal (COR) algorithm. In the first phase, their method runs the K- means algorithm to generate basic partitions and discover outliers. The outliers here are identified as objects with large distances to their nearest centroid. The second phase removes the identified outlier objects and the remainder are partitioned into k clusters. Similarly, Danganan et al [19] proposed a modification of MCOKE [3] by incorporating a median absolute deviation (MAD) that measures any potential outliers in the dataset. The authors proposed a three- phased approach in which the objects are ranked in ascending order and the distance of each object is calculated against MAD which is multiplied to a certain constant number determined by the user to obtain a decision value. If the distance of an object is greater than the decision value, that object is deemed an outlier and is pruned from the dataset. While many studies focus on pruning and discarding the outliers to improve the classification process, rarely do we find algorithms that detect outliers simultaneously while performing clustering [19]. The K-means with outlier removal (KMOR) algorithm is similar to the standard K-means algorithm but introduces an outlier cluster (k+1) that takes into account objects that don’t fit in the k defined clusters. The algorithm identifies outliers as objects that are above a calculated threshold which is defined by the average distance multiplied by a certain parameter greater or equal to 0. The average distance is calculated during the clustering phase. The KMOR algorithm requires three parameters such as the k number of clusters, the maximum number of outliers n0 (to control the number of objects being assigned as outliers), and finally, a third parameter to classify outliers and those that OMCOKE: A Machine Learning Outlier-based Overlapping… Informatica 46 (2022) 523–530 525 are not. Two additional parameters are used to help terminate the algorithm. All the studies mentioned above utilize the K-means partition algorithm that eventually constrains objects to single clusters. Overlapping partitioning clustering methods tends to relax or remove the constraints allowing overlaps between clusters; this better fits any hidden structures in the data and assign data objects to one or more clusters building a non-disjoint partition of the data [4] [5]. The present study focuses on overlapping partitioning methods which have several applications in real-life such as dynamic system identification, document categorization (a document belonging to different clusters), data compression, bioinformatics, image recognition, model construction, etc. [1] [21]. Extensions of the K-means that allow overlaps include Kernel Overlapping K-means (KOKM) [9,11], Overlapping K-means (OKM) [17,18], Parametrized R- OKM [10] and Multi-Cluster Overlapping K-Means Extension (MCOKE) [3][4]. The OKM algorithm is an extension of K-means that allows overlaps by using a heuristic that discovers a combinatorial set of possible assignments of the data points. For each observation, the heuristic sorts the clusters from the closest to the farthest; it then assigns the objects to those centroids in the defined order while minimizing the distance between the centroid and the observed object. The KOKM algorithm is a variant of OKM that utilizes the use of kernel methods for overlapping clustering. The authors use two variants in their method; one is a kernelization of the Euclidean metric, similar to the one used in OKM, that calculates the distances between the objects and the clusters in a high dimensional mapping space; the second variant performs all the clustering steps where data is implicitly mapped. The Parameterized R-OKM algorithm is another variant of OKM that lets users regulate the overlaps via a parameter. As the size of the parameter increases, the algorithm builds clusters with reduced overlaps, and vice- versa when the size of the parameter approaches zero. The PR-OKM algorithm is reduced to OKM when this parameter is set to exactly zero. Unlike other algorithms that prune the outliers and discard them, the proposed algorithm saves them on a newly created outlier cluster during the iteration process. The present study considers the same idea as the KMOR algorithm and introduces an outlier cluster k+1 that stores the anomalies or outlier objects separately from the normal instances. As noted above, the KMOR algorithm requires users to define the maximum number of outliers, including a parameter to classify the outliers and those that are not. This is impractical in real-life scenarios in unsupervised datasets where no prior knowledge of the data is given. Also, their method requires additional parameters to help terminate the algorithm. This is not an easy feat to be determined by novice users. However, in this study we do not require users to enter parameters to terminate the algorithm or to identify the maximum number of outliers in the dataset; this makes it more practical in machine learning. None of the overlapping K- means algorithms above have the capability to detect outliers and store them for additional scrutiny. Thus, we provide additional value to the literature by introducing this new overlapping clustering method. This study considers the key classification evaluation measures of Precision and F-measure. We evaluate and compare the results to highlight the significance of excluding the outliers in the dataset when clustering and how that improves the precision of the algorithm. The following section discusses the proposed clustering algorithm and the dataset used for evaluation. 3 The proposed OMCOKE algorithm and experimental dataset The proposed method is an enhancement of the MCOKE algorithm [3] that allows objects to overlap and belong to more than one cluster based on their distance comparison to the maxdis variable. Maxdist calculates the largest distance of any object assigned to any centroid during the partitioning phase for it to belong to a particular cluster. That distance is used as an outer radius of similarity threshold and as the benchmark to allow objects to belong to other clusters that were not initially assigned to them, allowing them to overlap. However, K-means, being a greedy algorithm, guarantees all objects to be assigned to a cluster including any outliers, hence the maxdist radius benchmark could easily be influenced by outliers. The present study introduces another variable that calculates the average distance (averdist) between the object and the centroid for all clusters. Averdist acts as a new threshold for the inner radius between the object and the centroid. 𝑎𝑣𝑒𝑟𝑑𝑖𝑠𝑡 = 1 𝑛 𝑖 ∑ ||𝑥 𝑖 − 𝐶 𝑘 || 2 𝑥 𝑖 ∈𝐶 𝑘 𝑖 = 1, 2, … 𝐾 (2) Where Ck is the kth cluster, xi is a point in Ck. It is assumed that most objects being clustered will fall close to the inner radius threshold (i.e. close to their cluster centroid) that is based on the average distance of all objects belonging to the cluster centroids. Anomalies or outliers therefore tend to be further away from their closest cluster centroid. Objects that have a distance greater than the inner radius but less or equal to the outer radius (maxdist) are subject to further scrutiny and are flagged to ensure they are not outliers on the border of the clusters. Therefore, the maxdistThreshold defines the radius distance to be considered from the outer boundary, for example, 0.98 will mean the area covered inside the 526 Informatica 46 (2022) 523–530 S. Baadel et al. outer boundary for objects is not to be considered an anomaly. This logic is based on the assumptions that: a) Anomalies tend to be in sparse clusters, whereas normal instances usually belong to dense clusters b) Anomalies tend to be far from the closest cluster centroid, whereas normal instances tend to be near their closest cluster centroid. In cases where some knowledge of the data is known beforehand, this value can also be adjusted by the user prior to running the algorithm. This modification logic is summarized in the pseudocode provided below. Outlier Detection Pseudocode 1. For each xi 𝜖 Ck 2. Do 3. If (dist (xi, centroid Ck) ≤ averdist) 4. Cluster ← xi 5. Else 6. If (dist (xi, centroid Ck) ≥ maxdist * maxdistThreshold) 7. Outlier_Cluster ← xi 8. Else 9. Cluster ← xi 10. End if 11. End if In Step 6 of the code above, the area covered by the maxdistThreshold is multiplied by the maxdist, calculated as a percentage of the overall maximum distance for any object belonging. This acts as the cut-off point and any object that has a distance value greater than the upper percentile of this value is deemed an outlier. Upon identification of at least one outlier, the k number of clusters entered by the user prior to running the method is incremented by 1 on the fly; the outlier object is assigned to this newly created cluster. All other identified outliers, a subset S from the initial population, are assigned to belong to this newly created cluster. Once an outlier is detected, the algorithm adds k+1 clusters as the new output vector with the outlier cluster indexes listed as part of the output. This allows for further investigation of those data points as opposed to discarding them as is usual. When no outliers are detected, the algorithm will simply cluster with overlaps without incrementing the number of k clusters. 3.1 Experimental dataset Different datasets from the Mulan: A Java Library for Multi-Label Learning repository [34] are used to evaluate the proposed algorithm’s performance. The data repository hosts more than 25 different datasets in the domains of text, audio, video, music, images, and biology to mention only a few. Items of multi-label datasets can be members of multi-groups which are true for real world problems and, as a result, ideal for the study of overlapping clustering. In our empirical experiment, three different domain datasets that have been used, along with their specifications and descriptive statistics, are displayed in Table 3 and Table 4 respectively. Data Set Instances # of Labels Attribute Cardinality Emotions 593 6 72 1.869 Yeast 2417 14 103 4.237 Scene 2407 6 294 1.074 Table 3: Statistics of used Benchmarks Data Set Min Max Mean StdDev Emotions 0.01 0.195 0.069 0.031 Yeast 0.371 0.52 0.001 0.097 Scene 0.0 1.0 0.659 0.214 Table 4: Descriptive Statistics of used Benchmarks 3.2 Description of the Overlapping Datasets This study conducted experiments on real-life overlapping datasets to measure the effectiveness of the methods used to identify such overlapping groups. The three datasets have a wide diversity in their dataset making them a suitable combination for use as benchmarks. For example, their sizes vary from 593 (Emotions) to 2417 (Yeast), their dimension (attributes) from 72 (Emotions) to 294 (Scene), cardinality (i.e. overlap rates) from 1.074 (Scene) to 4.237 (Yeast). Their application domain also varies considerably i.e. music, biology, and images. The following is a brief description of the three datasets (Emotion, Yeast, and Scene). 3.2.1 Emotion dataset Analyzing music signals is used in the detection of emotion in music. In this case, music can be classified into several categories at the same time since they are not usually disjointed i.e. it can make you feel both “sad” and “angry”. The dataset contains sound clips that can be described by 72 attributes which were annotated by three male music experts into six emotional clusters. Only the songs that had all three experts unanimously agree on their label were kept, resulting in a total of 593 songs being selected for the dataset. 3.2.2 Yeast dataset The Yeast dataset is classified into 14 gene groups or classes. A gene can belong to several different classes at the same time thus making this a multilabel dataset. For example, the gene YAL014W may belong in the following four groups: {Cell Growth, Cell Division}, {Cellular Organization}, {Cellular Communication, Signal OMCOKE: A Machine Learning Outlier-based Overlapping… Informatica 46 (2022) 523–530 527 Transduction} and {Transposable elements, Viral and Plasmid Proteins}. 3.2.3 Scene dataset The dataset contains 2407 natural scene images. The images were classed into six categories. In this case, the images can be classified into different categories at the same time since they are not usually disjointed i.e. they become multilabelled and can belong to more than one category such as field + mountain or fall foliage + mountain. 4 Experimental results 4.1 Experimental settings To exhibit the performance of our algorithm, with respect to different measures when contrasted with a wide range of ML, the current study selected clustering algorithms using the following criteria: a) Algorithms that utilize the partitioning method that extends the K-means algorithm b) The algorithms use the Euclidian distance to calculate the similarities between the sets of observations c) All algorithms work on numeric attributes only d) All are known algorithms that have been evaluated by previous researchers in ML. All experiments have been run on an Intel Core i7 computer with a 3.4 GHz processor and 8.0 GB RAM running on a 64-bit, Windows 10 Operating System. We used the pair-based Precision-Recall measure that is calculated over pairs of observations. The precision- recall is computed as follows: Where TP is a true positive decision, FP is a false positive decision (two dissimilar objects assigned to the same cluster), and FN is a false negative (two similar objects assigned to different clusters). FP TP TP ecision + = Pr (3) FN TP TP call + = Re (4) call ecision call ecision measure F Re Pr Re * Pr * 2 + = − (5) Where TP is a true positive decision, FP is a false positive decision (two dissimilar objects assigned to the same cluster), and FN is a false negative (two similar objects assigned to different clusters). 4.2 Empirical results and analysis For fair comparisons, datasets with different sizes and from different domains have been chosen and are compared to well-known algorithms that have been evaluated by previous researchers. Through experimental study, we evaluated and compared the performance of OMCOKE with three existing methods namely: Kernel Overlapping K-means (KOKM), Overlapping K-means (OKM), and Parametrized R-OKM as shown in Table 5 below. For each experiment, we set the parameters for KOKM, OKM, and P-ROKM as follows: • Maximum iterations = 10 • Number of clusters = 3 • Number of labels = Emotions (6), Yeast (14), and Scene (6). • Minimal improvement = 0.01 • Alpha = 1 and 0.1 for P-ROKM algorithms. In addition to the number of iterations and clusters set as above, the following parameters were also set in OMCOKE: • maxdistThreshold = 0.99 • useMeasures = True Overlapping methods will have an overlap that is greater than 1 since the objects belong to more than one cluster. The size of the overlaps affects the value of Precision i.e., there will be low value of Precision because the observations are assigned to more than one cluster. Method Emotion P. F. Yeast P. F. Scene P. F. KOKMII OKM P-ROKM (α=1) P-ROKM (α=0.1) OMCOKE 0.471 0.467 0.474 0.468 0.565 0.641 0.586 0.524 0.578 0.419 0.785 0.234 0.919 0.802 0.972 0.878 0.376 0.565 0.654 0.496 0.193 0.234 0.379 0.288 0.706 0.324 0.376 0.506 0.439 0.453 Table 5: Comparison of Performace 528 Informatica 46 (2022) 523–530 S. Baadel et al. Fig. 3: Precision Accuracy of the Benchmark Datasets The pair-based Precision-Recall method used in the empirical results is calculated over pairs of observations. This allows for the evaluations of clusters independently and compares their partitions with different numbers of clusters in the dataset. It measures whether the predicted pair is correctly assigned in the same cluster as indicated in the true class datasets. However, the Recall measure uses a binary function to compute the relationship between pairs of observations, and not considering that those pairs of observations could also feature in multiple clusters in the overlap. This results in a biased Recall measure, especially when the cardinality in the dataset is large. Thus, we chose not to use the Recall in our experiment as a measure of OMCOKE. It is evident from the above empirical results that the OMCOKE algorithm has a high precision rate and outperforms all the other overlapping algorithms in the study as shown in Figure 3 above. This can be attributed to the algorithm’s ability to separate outliers from the rest of the data objects when assigning them to clusters. For the Emotion, Yeast, and Scene datasets, OMCOKE precision was 0.565, 0.972, and 0.706 followed by P- ROKM (α=1) at 0.474, 0.919, and 0.379 respectively. High values of F-Measures are generally induced by the high values of Recalls as opposed to non-overlapping algorithms whose high values of F-measures are generally as a result of the Precision. When compared to the other algorithms, OMCOKE performs relatively well in the F- Measure as shown in Figure 4 below, scoring second behind P-ROKM (with α=1) in the Scene dataset; the P- ROKM method with the alpha value of 1 yielded an overlap of exactly 1 and dataset had a cardinality of 1.07. The F-Measure values are higher for clustering methods whose overlap rates are closer to the actual cardinality of the dataset. The cardinality shown in Table 3 is the natural overlaps in the dataset i.e., the average number of categories each observation can belong to. The analysis shows that the F-Measures and Precision are significantly affected by the overlap rate in the actual dataset. Algorithms that have partitions with smaller overlaps fared well in their F-Measure meaning that they Fig. 4: F-Measure of the Benchmark Datasets produced non-disjointed partitions that fit the data better compared to others. OMCOKE performed reasonably well in the Scene and Emotion datasets since the cardinalities of the datasets are low (1.074 and 1.869 respectively) nearing 1 but did poorly in the Yeast dataset that had an overlap of over 4. Our algorithm detected several outliers in the dataset. These are listed in Table 6 below. Dataset Number of Outliers Identified Outlier Instances Emotion 1 27 Scene 2 304; 1502 Yeast 1 1819 Table 6: Outliers Detected in the Three Datasets As indicated, an input dataset containing a few outliers significantly influences the mean distance (the outlier will skew the mean and variance) of the data objects to their respective clusters. This explains why OMCOKE outperformed the other methods in all datasets in terms of Precision rate. This also shows that by separating the outliers from the rest of the data, the OMCOKE was able to build its model relatively closer and more acceptable to the actual overlaps in each of the datasets; this is as compared to the other methods for the precision to be higher than the rest. 5 Conclusions and future work In this paper, some different K-means variants of overlapping clustering methods were discussed. The proposed algorithm, with the capability of detecting outliers and treating them as a separate cluster, was evaluated and compared with three existing overlapping clustering methods namely: Kernel Overlapping K-means (KOKM), Overlapping K-means (OKM), and Parametrized R-OKM. We used real-life multi-label datasets for our experiments. The empirical results showed that the F-Measures and Precision were significantly affected by the overlap rate in the actual dataset. OMCOKE did well in the Scene dataset since the cardinality of the dataset is very low and did poorly in the 0 0,2 0,4 0,6 0,8 1 KOKM II OKM P-ROKM ( α = 1) P-ROKM ( α = 0.1) OMCOKE Emotion Yeast Scene 0 0,2 0,4 0,6 0,8 1 1,2 KOKM II OKM P-ROKM ( α = 1) P-ROKM ( α = 0.1) OMCOKE Emotion Yeast Scene OMCOKE: A Machine Learning Outlier-based Overlapping… Informatica 46 (2022) 523–530 529 Yeast dataset that had a significant high overlap rate of over 4. However, when it came to Precision, OMCOKE outperformed the other overlapping algorithms in all datasets indicating that our method had a better detection rate of clusters and for assigning observations with a better precision after it segregated the outliers in the dataset. The proposed algorithm detects and stores outliers during the clustering process making it different from the other overlapping clustering algorithms, thus adding value in this domain. As opposed to discarding anomalies and outliers, our method can provide tremendous benefit to cyber security experts, medical practitioners, IT administrators, data mining researchers, and other stakeholders as these outliers could have real-life applications such as fraudulent activities as in the case of cybersecurity, fraud insurance claims in the banking domain, or to help raise flags in the medical field especially in the screening process. In future, we plan to extend the method to increment k cluster to more than 1 to cater for other dispersed objects that may not necessarily be deemed anomalies but could form dispersed clusters that have common characteristics that are somehow dissimilar from the rest of the data objects. These newly created clusters can then be fused and merged based on their similarity weights to minimize the number of clusters produced in large datasets. References [1] Aggarwal, C., & Reddy, C. K. (2014). Data clustering: Algorithms and applications. CRC Press. [2] Arabie, L. J., Hubert, G., & DeSoete, P. (1999). Clustering and classification. World Scientific. [3] Baadel, S., Thabtah, F., & Lu, J. (2015). MCOKE: Multi-Cluster Overlapping K-Means Extension Algorithm. International Journal of Computer, Control, Quantum and Information Engineering 9(2). Pp. 374-377. [4] Baadel, S., Thabtah, F., & Lu, J. (2016). Overlapping clustering: A review. IEEE SAI Computing Conference, London, UK. Pp 233-237. https://doi.org/10.1109/sai.2016.7555988 [5] Baadel, S. (2021). Big Data Analytics: A Tutorial of Some Clustering Techniques. International Journal of Management and Data Analytics, 1(2). Pp 38-46. [6] Barai, A., & Dey, L. (2017). Outlier detection and removal algorithm in K-means and hierarchical clustering. World Journal of Computer Application and Technology, 5(2). 24-29. https://doi.org/10.13189/wjcat.2017.050202 [7] Bay, S., & Schwabacher, M. (2003). Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In KDD. https://doi.org/10.1145/956750.956758 [8] Beltran, B., Vilarino, D., Martinez-Trinidad, J., Carrasco-Ochoa, J.A. (2020). K-means based method for overlapping document clustering. Journal of Intelligent and Fuzzy Systems, 39 (2). Pp. 2127-2135. https://doi.org/10.3233/jifs-179878 [9] BenN’Cir, C., & Essoussi, N. (2012). Overlapping patterns recognition with linear and non-linear separations using positive definite kernels. International Journal of Computer Applications (IJCA), pp 1–8. https://doi.org/10.5120/8916-2981 [10] BenN’Cir, C., Cleuziou, G., & Essoussi, N. (2013). Identification of non-disjoint clusters with small and parameterizable overlaps. In IEEE International Conference on Computer Applications Technology (ICCAT), pages 1–6. https://doi.org/10.1109/iccat.2013.6522010 [11] BenN’Cir, C., Essoussi, N., & Bertrand, P. (2010). Kernel overlapping k-means for clustering in feature space. In International Conference on Knowledge discovery and Information Retrieval (KDIR), pp 250–256. https://doi.org/10.5220/0003095102500256 [12] Berkhin P. (2006) A survey of clustering data mining techniques. In: Kogan J., Nicholas C., Teboulle M. (eds) Grouping Multidimensional Data. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540- 28349-8_2 [13] Boundaillier, E., & Hebrail, G. (1988). Interactive interpretation of hierarchical clustering. Intelligent Data Analysis. [14] Chagas, G. O., Lorena, A., Dos Santos, R. (2019). A hybrid Heuristic for the overlapping Clustering problem. Applied Soft Computing. 81(105482), 1- 48. https://doi.org/10.1016/j.asoc.2019.105482 [15] Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys, 41(3), 1-72. https://doi.org/10.1145/1541880.1541882 [16] Celebi, M., Kingravi, H., & Vela, P. (2013). A comparative study of efficient initialization methods for the K-means clustering algorithm. Expert Systems with Applications. 40 (1). 200-210. https://doi.org/10.1016/j.eswa.2012.07.021 [17] Cleuziou, G. (2008). An extended version of the k- means method for overlapping clustering. In International Conference on Pattern Recognition ICPR, pp 1–4. https://doi.org/10.1109/icpr.2008.4761079 [18] Cleuziou, G. (2009). Two variants of the okm for overlapping clustering. Advances in Knowledge Discovery and Management. pp 149–166. https://doi.org/10.1007/978-3-642-00580-0_9 [19] Danganan, A., Sison, A., Medina, R. (2019). OCA: Overlapping Clustering application unsupervised approach for data analysis. Indonesian Journal of Electrical Engineering and Computer Science, 14 (3) pp. 1473-1478. https://doi.org/10.11591/ijeecs.v14.i3.pp1471-1478 [20] Elisseeff, A., & Weston, J. (2001). A kernel method for multi-labelled classification. In T.G. Dietterich, S. Becker, and Z. Ghahramani, (eds), Advances in Neural Information Processing Systems. [21] Gan, G., & Ng, M. K. (2017). K-means clustering with outlier removal. Pattern Recognition Letters, 530 Informatica 46 (2022) 523–530 S. Baadel et al. 90,8-14. https://doi.org/10.1016/j.patrec.2017.03.008 [22] Höppner, F., Klawonn, F., Kruse, R., & Runkler, T. (1999). Fuzzy cluster analysis: Methods for classification, data analysis and image recognition. Wiley. [23] Hrushka, E. R., Campello, R., Freitas, A., & Carvalho, A. (2009). A survey of evolutionary algorithms for clustering. IEEE Transactions on Systems, Man, and cybernetics, Part C. (Applications and Reviews), 39(2), 133-155. https://doi.org/10.1109/tsmcc.2008.2007252 [24] Jain, A. K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31(8) 651– 666. https://doi.org/10.1016/j.patrec.2009.09.011 [25] Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data, Prentice Hall. [26] Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys 31(3) 264–323. https://doi.org/10.1145/331499.331504 [27] Kadam, N. V., & Pund, M. A. (2013). Joint approach for outlier detection. International Journal of Computer Science Application, 6 (2), 445–448. [28] Lam, D., & Wunsch, D. (2014). Clustering. Academic Press Library in Signal Processing, Signal Processing Theory and Machine Learning, (1). https://doi.org/10.1016/b978-0-12-396502-8.00020- 6 [29] Liu, H., Li, J., Wu, Y., & Fu, Y. (2018). Clustering with outlier ermoval. Proceedings of ACM Sig on Knowledge Discovery and Data Mining (KDD). ACM, New York, NY, USA. [30] McQueen, J. B. (1967). Some methods of classification and analysis of multivariate observations, In: Proc. 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281- 297. [31] Ramaswamy, S., Rastogi, R., & Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. In SIGMOD. https://doi.org/10.1145/335191.335437 [32] Saxena, A., Prasad, M., … Gupta, A. (2017). A review of clustering techniques and developments. International Journal of Neurocomputing. 267. Pp 664-681. https://doi.org/10.1016/j.neucom.2017.06.053 [33] Trohidis, K., Tsoumakas, G., Kalliris, G., & Vlahavas, I. (2008). Multilabel classification of music into emotions. Proceeding of the 2008 International Conference on Music Information Retrieval (ISMIR 2008), pp. 325-330, Philadelphia, PA, USA. https://doi.org/10.1186/1687-4722-2011- 426793 [34] Tsoumakas, G., Katakis, I., & Vlahavas, I. (2010). Mining multi-label data, data mining and knowledge discovery handbook, O. Maimon, L. Rokach (Ed.), Springer, 2nd ed., 2010. https://doi.org/10.1007/978- 0-387-09823-4_34 [35] Yu, Q., Luo, Y., Chen, & C., Ding, X. (2016). Outlier-eliminated k-means clustering algorithm based on differential privacy preservation. Applied Intelligence, 45 (4). 1179–1191. https://doi.org/10.1007/s10489-016-0813-z [36] Zhang, J. S., & Leung, Y. (2003). Robust clustering by pruning outliers. IEEE Trans. on Systems, Man, and Cybernetics – Part B 33 (6) 983–999. https://doi.org/10.1109/tsmcb.2003.816993