https://doi.org/10.31449/inf.v47i8.4872 Informatica 47 (2023) 67–78 67 Hybrid Variable-Length Spider Monkey Optimization with Good- Point Set Initialization for Data Clustering Athraa Qays Obaid, Maytham Alabbas * Department of Computer Science, College of Computer Science and Information Technology, University of Basrah, Basrah, Iraq E-mail: itpg.athraa.qays@uobasrah.edu.iq, ma@uobasrah.edu.iq * Corresponding author Keywords: data clustering, spider monkey optimization, k-means clustering, variable-length spider monkey optimization, good-point set Received: May 21, 2023 Data clustering refers to the process of grouping similar data points based on patterns or characteristics. It finds applications in image analysis, pattern recognition, and data mining. The k-means algorithm is commonly used for this purpose, but it has two main limitations. Firstly, it necessitates the user to explicitly specify the number of clusters. Secondly, it is highly sensitive to the initial selection of cluster centroids. To overcome these limitations, this study presents a novel approach that utilizes a variable- length spider monkey optimization algorithm (VLSMO) with a hybrid measure to determine the optimal number of clusters and initial centroids. Experimental results obtained from real-life datasets demonstrate that VLSMO outperforms the standard k-means algorithm and other techniques in terms of accuracy and clustering capacity. Povzetek: Prispevek opisuje novo metodo za združevanje podatkov z uporabo optimirnega algoritma VLSMO, ki odpravlja omejitve predhodnih algoritma, izboljšuje natančnost in zmogljivost združevanja. 1 Introduction Data clustering is one of the most important data mining approaches, which involves partitioning data instances into smaller groups, where each group comprises objects that are similar to each other but distinct from those in other clusters. Clusters are defined by a center point and a proximity metric that measures the similarity or dissimilarity of the candidate data points. Clustering analysis has as its primary objective the creation of clusters consisting of the highest density of similar points and the most distant clusters of different points. Clustering cannot be performed manually due to the large volume of data. Instead, specialized computing techniques are used. Nonetheless, clustering differs from classification since most data is unlabeled, implicitly performing classification. Therefore, it is considered unsupervised learning. The practice of clustering is prevalent throughout industries, as it allows related objects to be grouped. For example, a clustering approach can be used in marketing to identify consumers with similar purchasing habits. In educational settings, it can be useful in analyzing students' academic achievement by grouping those with similar study habits. In addition, clustering can also be utilized in diverse applications, including the segmentation of images, the detection of outliers, the detection of tumors, and the detection of fraud. Furthermore, clustering is a powerful method of uncovering hidden patterns within a dataset. Despite its wide range of applications, clustering poses a number of challenges [1]. Many clustering algorithms have been developed, including K-means, density-based spatial clustering of applications with noise (DBSCAN), expectation maximization (EM), and hierarchical agglomerative clustering (HAC). The K-means algorithm is widely recognized as the most commonly employed clustering algorithm. The K-means algorithm is popular in scientific research and industrial applications because it is simple, fast convergence, and scalable. Nevertheless, k-means clustering, which involves randomly distributing starting points during center initialization, frequently results in local optimal clustering outcomes that could result in inaccurate categorization due to instability. It has a number of limitations, including the necessity of specifying the number of clusters and its sensitivity to initial center points. In order to overcome the limitations of this algorithm, a globally optimized approach must be adopted [2]. Several techniques have been proposed in the literature to overcome these limitations, including the elbow method, the gap statistic, and the canopy method. The appropriate number of clusters (k) can be determined using these techniques. Furthermore, various algorithms can be used to identify the initial centroids, including the Forgy method, random partition method, and k-means++ algorithm. As far as the authors are aware, there are limited techniques available to determine both parameters simultaneously based on using optimization techniques like genetic algorithms (GA), artificial bee colony (ABC), and particle swarm optimization (PSO). The current work aims to progress in this area by presenting a modified version of the spider monkey 68 Informatica 47 (2023) 67–78 A.Q. Obaid et al. optimization algorithm (SMO) with variable length to determine the number of clusters and their initial centroids. The following sections of this paper are organized as follows: An overview of related work is presented in Section 2. Section 3 provides a theoretical background on the topic. Section 4 describes the current work. A summary of the results is presented in Section 5. Finally, the paper is concluded in Section 6. 2 Literature review Numerous techniques have been presented to determine suitable parameters for the k-means algorithm. Several of these techniques focus on estimating the number of clusters (k), others optimize the initial centroids, and some address both aspects simultaneously. Multiple methods are available for estimating the number of clusters, and selecting a suitable one depends on the dataset and goals. One of the most common methods is the elbow method. It works by plotting the sum of squared distances between each data point and its nearest centroid against different values of k. The point where the curve starts to level off is considered the optimal value of k [3]. An alternative approach is the silhouette method, which calculates the silhouette coefficient for each point in order to assess how well it fits within the assigned cluster in relation to other clusters. The optimal number of clusters is determined by selecting the k value that maximizes the average silhouette coefficient. Additionally, the gap statistic provides a robust method in this regard. In this method, the within-cluster sum of squares of observed data is compared to a randomly generated data set to identify the optimal value of k that minimizes the gap statistics. Similarly, the canopy method uses hierarchical clustering to group data points. It determines the optimal number of clusters by identifying the level in the hierarchy at which clusters are most distinctive. These are just a few of the many methods that can be used to estimate the number of clusters. On the other hand, choosing suitable initial centroids is also crucial as they serve as the starting points for the k- means clustering algorithm and greatly influence the final clustering outcomes. The most common method of selecting initial centroids is to choose them randomly from the dataset. Alternatively, more advanced algorithms such as k-means++ can be used to refine the selection process. It aims to select centroids that are distant from each other and representative of the data distribution. This helps to avoid local optima, which can occur when the initial centroids are not well-chosen. The authors in [4] propose a hybrid algorithm, ACO-K-means, that combines the ant colony optimization (ACO) algorithm with the k-means algorithm for clustering to find good initial centroids for the k-means algorithm, which can improve the clustering results. ACO-K-means outperform the k-means algorithm on the tested datasets. A hybrid clustering algorithm named MABCKM is presented in [5], which combines modified ABC and KM algorithms. MABCKM addresses the issue of dependency on initial cluster centers in KM. A cooperative algorithm is presented in [6] that combines PSO with k-means to provide a global and local search capability. Experimental results indicate that the proposed algorithm achieves satisfactory efficiency and robustness when clustering data. In [7], three K-means initialization strategies are compared using the UCI ML handwritten digits dataset: random, K-means++, and PCA-based K- means. The findings indicate that the PCA-based K-means initialization strategy performs better than the other two approaches regarding accuracy and running time. In [8], the authors prove that combining the glowworm swarm optimization (GSO) algorithm, k-means algorithm, and good-point set can improve the clustering effect and stability under unsupervised learning conditions. Further, some studies have investigated using meta- heuristic algorithms to optimize k-means issues. These algorithms search the parameter space to find the optimal combination of k and initial centroids. For instance, an algorithm based on variable-length chromosomes is proposed in [9] for solving the K-means clustering problem. This approach is designed to automatically determine the number of cluster centers that is most appropriate. In [10], the authors propose a new method KMBA for optimizing the k-means clustering algorithm using the bat algorithm. In KMBA, each bat in the algorithm performs two essential tasks: (1) identify the ideal number of clusters utilizing a discrete PSO approach based on CPSO, and (2) determine the optimal number of cluster centers using the K-means algorithm. It is easy to implement and effective in a variety of problems. In [11], a new K-FA approach combines the firefly algorithm (FA) with k-means clustering to cluster data. It utilizes FA to determine the centroids of the specified number of clusters and further refines them using k-means clustering. K-FA algorithm clusters data accurately and effectively. In [2], the authors present a novel approach, ABCVL (Artificial Bee Colony with Variable-Length Individuals), for optimizing the k-means clustering algorithm. ABCVL is used to determine both the value of k and the initial centroids. ABCVL has been evaluated on various datasets, demonstrating that it is more accurate in clustering than the standard k-means algorithm. ABCVL is also simple to implement and is effective across a wide range of problem domains. Finally, the literature indicates that there is no one-size- fits-all approach to estimating k-means parameters. It is important to consider both the data characteristics and the problem at hand before choosing a method. Therefore, it is essential to experiment with different techniques and compare their performance to select the most appropriate one for a particular application. Table 1 summarizes a selection of relevant studies in the field. Hybrid Variable-Length Spider Monkey Optimization with… Informatica 47 (2023) 67–78 69 3 Background 3.1 Spider monkey optimization (SMO) algorithm SMO is a collaborative, iterative process based on trial and error, much like the other population-based algorithms. Six phases comprise the SMO process: Local Leader, Global Leader, Local Leader Learning, Global Leader Learning, Local Leader Decision, and Global Leader Decision. The Gbest-guided ABC [12] [13] and a modified version of ABC [14] [15] inspired the position updating procedure in the Global Leader phase. Following is an explanation of each SMO implementation step in detail [16] 3.1.1 Initialization of the population Initially, SMO creates an initial population of N spider monkeys (SMs) with a uniform distribution. Each SM in the population is represented by a D-dimensional vector, SM ij (i = 1, 2,..., N and j = 1, …, D), where D represents the length of the optimization problem. Each SM represents a possible solution to the issue. Every SM i is initialized as follows: SMij = SMminj + U (0,1) × (SMmaxj - SMminj). (1) Ref Year Technique fitness dataset Best Mean SD Run [3] 2019 Analyze four K-value selection algorithms Elbow Method, Gap Statistic, Silhouette Coefficient, and Canopy Silhouette Coefficient ▪ Iris 0.765 --- --- --- [4] 2005 Hybrid Ant Colony Optimization with K-Means Algorithm (ACO-K-means algorithm) distance between cluster centers and calculating the sum of color and physical distances between each pixel and its assigned cluster center. ▪ images --- --- --- --- [5] 2011 Hybrid Modified Artificial Bee Colony and K-Means Algorithm (MABCKM) Euclidean distance (short for distance) ▪ Iris ▪ glass ▪ lung cancer ▪ soybean (small) ▪ wine ▪ vowel 78.8514 336.0840 535.6552 208.1545 2.3707e6 3.06907 78.8516 336.6181 570.8732 227.6273 2.3708e6 3.06907 0.00004 0.1604 10.5690 10.2121 29.9780 10.1211 20 [6] 2012 A cooperative algorithm based on PSO and k-means Sum of Intra Cluster Distances (SICD) ▪ Irish ▪ Pima ▪ Wine ▪ WDBC ▪ Glass ▪ Sonar 96.73000 47561.35 16292.68 211.0400 149473.1 234.6300 96.91000 47580.43 16293.75 214.8300 149475.0 234.7800 0.170000 59.97000 0.880000 4.090000 1.370000 0.160000 120 [7] 2018 Evaluation of these three strategies for initializing the centroids silhouette coefficient ▪ digits 0.15 --- --- --- [8] 2018 Glowworm Swarm Optimization, K-means, and Good-Point Set (GSOK-GP algorithm) Density-Based ▪ Iris ▪ Glass 97.32 225.08 89.33 53.50 0 0.0090 20 [9] 2019 genetic algorithm with variable length string Davies-Bouldin Index (DBI) ▪ Random 0.2126 --- --- 20 [10] 2012 K-Means and Bat Algorithm (KMBA) Accuracy ▪ Iris 95% --- --- --- [11] 2012 hybrid K-means and Firefly Algorithm (K-FA) intra-cluster distance ▪ Iris ▪ WDBC ▪ Sonar ▪ Glass ▪ Wine 96.13 149450.3 229.35 210.51 16284.01 103.87 149590.2 231.36 221.87 16327.53 2.45 197.31 2.94 10.5 10.10 30 [2] 2022 Improved artificial bee colony algorithm with variable-length individuals (ABCVL) Hybrid-Scale2 ▪ Mall Customers ▪ Digits ▪ Breast Cancer 0.521397 1.336862 0.423736 0.520917 1.345410 0.423736 0.002586 0.008708 1.110×10 -16 30 Table 1: Relevant studies in the field. 70 Informatica 47 (2023) 67–78 A.Q. Obaid et al. Here, U (0,1) is a uniformly distributed random number in the range [0,1], and SM minj and SM maxj are the limits of SM i in the j th direction. 3.1.2 Local leader phase (LLP) Each SM adjusts its present position during LLP based on information from the experience of the local leader and local group members. The fitness value of the newly acquired position is computed. As long as the new position has a greater fitness value than the old one, the SM replaces the old one with the new one. In LLP, the position update equation for the ith SM, a member of the kth local group, is: SM newij = SM ij + U (0,1) × (LL kj - SM ij) + U (-1,1) × (SM rj - SM ij), (2) where, LL kj stands for the j th dimension of the k th local group leader position, and SM ij is the j th dimension of the i th SM. In order to ensure that r≠i and U (0,1) is a uniformly distributed random integers between 0 and 1, SM rj is the j th dimension of the r th SM that is randomly selected inside the k th group. Algorithm 1 illustrates the method of updating LLP positions [16]. Algorithm 1: LLP MG: the swarm's maximum number of groups Pr: the perturbation rate ∈ [0.1, 0.9], determining how much the current location is perturbed. for each k ∈ {1, ..., MG} do for each member SM i ∈ k th group do for each j ∈ {1, ..., D} do if U (0,1) ≥ pr then Update SM newij using Eq. (2). else SMnew ij = SM ij end if end for end for end for 3.1.3 Global leader phase (GLP) Once LLP has been completed, GLP will begin. Each SM updates its position during the GLP based on the experience of the global leader and the local group members. Here, the position is updated by applying Eq. (3). SM newij = SM ij + U (0,1) × (GL j - SM ij) + U (-1,1) × (SM rj - SM ij). (3) Where j ∈ {1, 2, ..., D} is the randomly selected index, and GL j is the j th dimension of the global leader position. In this phase, SM placements are adjusted based on probabilities derived from their fitness. A better candidate has a higher chance of improving under this approach. For determining the probability prob i, Eq. (4), or some other equation related to fitness, may be used: probi = 0.9 × fitnessi 𝑚𝑎𝑥 _ 𝑓𝑖𝑡𝑛𝑒𝑠𝑠 + 0.1, (4) where, fitness i denotes the i th SM's fitness value and max_ fitness denotes the group's maximum fitness. Additionally, the fitness of the SMs' newly produced position is computed, compared to the previous one, and the more suitable position is accepted. Algorithm 2 illustrates the process of updating GLP positions [16]. 3.1.4 Global leader learning (GLL) During this phase, the position of the global leader is updated by applying greedy selection on the population, i.e., selecting the SM position with the best fitness among the population as the updated position. In addition, the global leader position is determined, and if it is not updated, the GlobalLimitCount is incremented by one. 3.1.5 Local leader learning (LLL) phase In this phase, the local leader's position is updated using a greedy selection method within the group. The local leader is chosen based on the SM's fitness level. The updated position of the local leader is then compared to its previous position, and if the local leader has not changed, the LocalLimitCount is incremented by one. 3.1.6 Local leader decision (LLD) phase Assume that the position of any local leader is not updated until the Local Leader Limit has been reached. In this situation, all group members update their positions using either random initialization or the combined knowledge of the Global Leader and Local Leader obtained by Eq. (5), depending upon the pr. SM newij = SMij + U (0,1) × (GL j - SM ij) + U (0, 1) × (SM ij - LL kj) (5) According to Eq. (5), the updated dimension of this SM is attracted to the global leader and repellent to the local leader. Algorithm 3 displays the process of the LLD phase, where LocalLimitCountk refers to the trial counter for the local best solution of the kth group. Algorithm 2: GLP for k = 1 to MG do count = 1; GS = k th group size; while count < GS do for i = 1 to GS do if U (0, 1) < prob i then // see Eq. (4) count = count + 1. Randomly select j ∈ {1...D}. Randomly select SM r from k th group s.t. r ≠ i. Update SM newij using Eq. (3). end if end for if i is equal to GS then i = 1; end if end while end for Hybrid Variable-Length Spider Monkey Optimization with… Informatica 47 (2023) 67–78 71 3.1.7 3.1.8 Global leader decision (GLD) phase As soon as the position of the global leader has not been changed for a predefined number of iterations, the population is divided into smaller groups by the global leader. The population is divided into two groups first, followed by three groups, and so on until the maximum number of groups (MG) has been reached. Whenever a new group is established during the GLD phase, the LLL procedure is initiated to elect the local leader. Regardless of how many groups are established, if the global leader's position remains unchanged, all groups will be united. Hence, this method was based on SMs' fusion-fission structure. Algorithm 4 illustrates how the GLD phase works. The SMO algorithm can be summarized in Algorithm 5 [16]. Algorithm 5: SMO Algorithm 1. Initialize the population of spider monkeys. 2. Evaluate the fitness of each spider monkey. 3. Identify the best spider monkey based on fitness. 4. Perform exploration and exploitation by updating the positions of spider monkeys. 5. Evaluate the fitness of the updated spider monkeys. 6. If the termination condition is met, stop; otherwise, go to step 3. 7. Output the best solution found. 3.2 K-Means clustering algorithm The k-means algorithm and its extensions have a long history, and it remains a challenging task today. This is because it depends on the initializations and requires determining a specific number of clusters [17]. This algorithm partitions the dataset into k unique clusters by iteratively achieving a local minimum. The initial K cluster centers for the K-means method are selected at random from the dataset, where the user predetermines k. Throughout each iteration, each point in a dataset is assigned to the nearest cluster center. Once all data points have been grouped into clusters, the new centroid is calculated as the mean of all cluster points for each cluster. This process is repeated until either the centroids of the clusters do not change any longer or the maximum number of iterations has been reached [2] [18]. 3.3 Basic theory of good-point set Good point sets are defined and structured in the following manner [8] [19]: 1) Assume Gs is a unit cube in S-dimensional Euclidean space, which is expressed as x ∈ Gs, x = (x 1, x 2,...,xs) , (6) where, 0 ≤ Xi ≤ 1, i = 1, 2,..., s. 2) Assume Pn(K) is a point set with the number of n in Gs, which is expressed as: Pn (k) = {(X1 (n) (K), X2 (n) (K), ..., Xs (n) (K))} , (7) where, 0 ≤ K ≤ n, 0 ≤ xi (n) (K) ≤ 1, i = 1, 2,..., s. 3) Assume r = (r1, r2, ..., rs) is a given point in Gs and Nn(r) = Nn(r1, r2, ..., rs) is the number of points not satisfying the inequality below in point set Pn(k). 0≤ Xi (n) (k) ≤ ri, where, i = 1, 2, ..., s. (8) 𝜑 (n) = sup |Nn(r)/n − |r||, where r ∈ Gs, |r| = r 1r 2 ⋅⋅⋅rs, and 𝜑 (n) is the deviation of point set Pn(K). 4) Assume 𝜑 (n) is the deviation of Pn(K) = {(X1 (n) ∗k, X2 (n) ∗K,..., Xs (n) ∗K, K= 1, 2, ..., n} and meets the requirements below: 𝜑 (n) = C(r,𝜀 )n − 1+ 𝜀 , where C(r, 𝜀 ) is a constant related to r and 𝜀 , 𝜀 > 0. A good-point set is denoted by P n(K), and a good point is denoted by r. The order of deviation 𝜑 (n) about approximation integration is shown by applicable theorems to be dependent solely on n and independent of the sample's spatial dimensions. The computation in high- dimensional spaces may thus be supported more effectively by the good-point set. Meanwhile, the deviation 𝜑 (n) of n points P n = X 1, X 2, ..., X n acquired by a good-point set is much better than n points produced by a random technique for a point set object whose distribution is unknown. Based on this characteristic of the good-point set, a better initial distribution strategy may be offered for the swarm distribution in the swarm intelligence algorithm. 4 The current work Swarm intelligence (SI) is significantly limited when dealing with fixed-length individuals. This paper proposes Algorithm 3: LLD Phase for k = {1...MG} do if LocalLimitCountk > Local Leader Limit then LocalLimitCount k = 0. GS = kth group size; for i ∈ {1...GS} do for each j ∈ {1...D} do if U (0,1) ≥ pr then SM newij = SM minj + U (0,1) × (SM maxj - SM minj) else Upate SM newij using Eq. (5). end if end for end for end if end for Algorithm 4: GLD Phase if GlobalLimitCount > Global Leader Limit then GlobalLimitCount = 0 if Number of groups < MG then Divide the population into groups. else Combine all the groups to make a single group. end if Update Local Leaders position. end if 72 Informatica 47 (2023) 67–78 A.Q. Obaid et al. an enhanced version of the SMO algorithm, known as the variable-length SMO algorithm (VLSMO), that handles variable length. To check the effectiveness of VLSMO, we will utilize it to tackle the challenges of the k-means algorithm by determining the number of clusters (k), and the initial cluster centers. We will describe the current work in the following lines. 4.1 Individual representation Two representation techniques are used to represent individuals (SMs) in the population: indirect and direct. Suppose the dataset contains 100 data points numbered from 1 to 100, each data point has four attributes, and the number of clusters is 3. The following is a description of SM with the details below using these two techniques. Data point index Attributes (4 features) 2 4.7 3.2 1.3 0.2 40 5.0 3.5 1.3 0.3 70 6.4 2.9 4.3 1.3 ▪ Indirect technique: in this technique, every SM position is represented as the index of data points in the dataset, where each index corresponds to a sample. For example, below is the representation of SM according to this technique. Cluster 1 Cluster 2 Cluster 3 2 40 74 This individual implies that the number of clusters is 3, and the first centroid corresponds to the data point with index 2, the second centroid corresponds to the data point with index 40, and the third centroid corresponds to the data point with index 74. ▪ Direct technique: in this technique, every SM position is represented as the attributes of data points in the dataset, where each position corresponds to a sample. For example, below is the representation of SM according to this technique. 4.7 3.2 1.3 0.2 Cluster 1 5.0 3.5 1.3 0.3 Cluster 2 6.4 2.9 4.3 1.3 Cluster 3 This individual implies that the number of clusters is 3. The first centroid corresponds to the data point with attributes (4.7, 3.2, 1.3, 0.2) instead of its index 2; the second centroid corresponds to the data point with attributes (5.0, 3.5, 1.3, 0.3) instead of its index 40; and the third centroid corresponds to the data point with attributes (6.4, 2.9, 4.3, 1.3) instead of its index 74. 4.2 Initial population based on the Good- point set To optimize the SMO, it is effective to switch from a random distribution to a uniform distribution in the search space. Nevertheless, a good-point set method must be applied to enhance the initial distribution to ensure that the solution space is covered in most cases. This can be accomplished through the theory of the good-point set, discussed in Section 3.3. The population in this context exhibits the following attributes: it comprises individuals {ind 1, ind 2, ind 3, …, N}, and the length of each individual corresponds to the number of clusters, denoted as K, where nC min≤K i≤ nC max for 1≤i≤N. Here, nC min and nC max denote the minimum and maximum number of clusters, respectively. Each individual serves as a representation of the initial cluster centroids (C): ind i = {C 1, C 2, C 3, ... C Ki}. 4.3 Objective function We evaluated the clustering quality by experimenting with eight objective functions that relied on key metrics. These metrics encompassed both internal (1-2) and external (3- 6) measures, as well as Inertia (7) and a hybrid measure we proposed (8). These measurements are explained below [1]. 1. Davies-Bouldin Index (DBI) measure measures the average similarity between a cluster and its most comparable one. The lower the DBI value, the better the clustering, while zero indicates optimal clustering. 2. Silhouette coefficient measure: It measures how similar a data point is to its own cluster compared to others. The score ranges from -1 to 1, where a higher value indicates better clustering results. 3. Homogeneity measure: This measure assesses the degree to which all data points within a cluster belong to the same class. A high homogeneity score indicates well-separated clusters. 4. Completeness measure: It measures how effectively all data points belonging to a single class are placed together. A high completeness score indicates optimal clustering. 5. V-measure: It assesses both homogeneity and completeness at the same time. It measures how well the clustering solution matches the known class labels. Better clustering is indicated by higher V- measure scores, with 1 being perfect clustering. 6. Adjusted rand index (ARI) measure: It measures the similarity between predicted and actual clustering by considering the possibility of chance clustering. The score ranges from -1 to 1, where a higher value indicates better clustering results. 7. Inertia: It measures the sum of squared distances between each data point and the center of the cluster to which it belongs. The lower the score, the better the result [2]. 8. Hybrid score (HS): this proposed measure integrates the Silhouette and ARI measures, assigning a weight Hybrid Variable-Length Spider Monkey Optimization with… Informatica 47 (2023) 67–78 73 to each measure to determine its relative importance using Eq. (9). HS = a × Silhouette + b × ARI, (9) where a and b are real numbers within the range [0,1] and a+b =1. 4.4 Modified search equation The SMO algorithm was improved to accommodate individuals of varying lengths. By adhering to numerous modification instructions, The LLP equation has been revised to accommodate two scenarios based on the potential varying lengths of the two individuals, namely MLLPq. Assume we have two individuals (Ind 1 and Ind 2) of different lengths (L 1 and L 2). Detailed descriptions of each scenario are provided below. • MLLPq 1 In the first scenario, MLLPq 1, the L 1 is less than or equal to the L 2. Equation (2) is utilized on the initial L 1 positions shared by the two individuals (Ind 1 and Ind 2), resulting in a new individual (Ind 3) with a length equal to L 1. The sketch diagram of the MLLPq 1 example is shown in Figure 1(a). • MLLPq 2 The second scenario, MLLPq 2, involves L 1 being greater than L 2. Initially, Eq. (2) is utilized on the initial L 2 positions shared by the two individuals (Ind 1 and Ind 2), resulting in the first L 2 positions in the new individual (Ind 3). Then, a random integer number P in range (L 2, L 1] is selected. Finally, the positions from L 2+1 to P of Ind 1 are copied to the same positions at Ind 3. In this scenario, the length of Ind 3 ranges from L 2 to L 1. The sketch diagram of the MLLPq 2 example is shown in Figure 1(b). Figure 1 illustrates how a new individual Ind 3 can be created by combining two individuals, Ind 1 and Ind 2, with lengths L 1 and L 2, respectively. Figure 1: The schematic diagram of (a) MLLPq1, and (b) MLLPq2. 5 Results 5.1 Tested datasets The current work evaluates five datasets from the UCI repository [20], and their properties of these datasets are listed in Table 2. Table 2: Characteristics of the data sets considered Name No. of classes No. of attribute Size of dataset Seed 3 7 210 (70, 70, 70) Bupa 2 6 345 (145, 200) Iris 3 4 150 (50, 50, 50) Heart 2 13 270 (120, 150) Haberman 2 3 306 (225, 81) 5.2 Parameters setting The current work employs the following parameters settings for all experiments: • The swarm size: N = 20 • The minimum number: nC min = 2 • The maximum number nCmax = 12 • The GlobalLeaderLimit = [N/2, 2 × N] • The LocalLeaderLimit = D × N • The maximum number of iterations = 100 • a and b for HS function = 0.5 5.3 Results We conducted a series of experiments to assess the effectiveness of the VLSMO method from different perspectives. These experiments are referred to as Experiment I, Experiment II, and Experiment III. In Experiment I, our objective was to evaluate the impact of the performance of the objective function on the overall effectiveness of the VLSMO method. We carefully examined how variations in the objective function affected the performance and optimization capabilities of the VLSMO method. Moving on to Experiment II, our focus was on analyzing the efficiency of the VLSMO method by considering individuals of varying lengths. We investigated how the method's performance was influenced by different lengths of individuals, aiming to identify any correlations between the length of individuals and the efficiency of the VLSMO method. In Experiment III, we aimed to assess the efficiency of the VLSMO method by comparing it with other techniques that have been previously discussed in [23]. This experiment provided valuable insights into how the VLSMO method compared to other existing techniques, Ind1 X 1 X 2 X 3 X 4 X 5 X 6 Ind2 Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Y 7 Y 8 Y 9 ----------------------------------------------------------------- Ind3 Z 1 Z 2 Z 3 Z 4 Z 5 Z 6 (a) (a) Ind1 X 1 X 2 X 3 X 4 X 5 X 6 X 7 Ind2 Y 1 Y 2 Y 3 If value of random = 4 --------------------------------------------------------- Ind3 Z 1 Z 2 Z 3 X 4 If value of random = 7 ------------------------------------------------------- Ind3 Z 1 Z 2 Z 3 X 4 X 5 X 6 X 7 (b) 74 Informatica 47 (2023) 67–78 A.Q. Obaid et al. allowing us to better understand its strengths and weaknesses. Through these experiments, we comprehensively understood the VLSMO method and its performance across different dimensions. 5.3.1 Experiment I In this experiment, we evaluated the effectiveness of the VLSMO method for clustering problems on five datasets, as described in Section 5.1, using eight measures defined in Section 3.4.3 as objective functions. Tables 3-13 summarize the results of this experiment, including the best value, mean, optimal k, and mean of k, with the best outcomes highlighted in bold typeface. Table 3: Experiment I on Seed dataset-Indirect technique. Measure Best Mean Optimal (K) Mean (K) DBI 0.667283 0.667283 2 2.0 Silhouette 0.474919 0.485367 2 2.2 Homogeneity 0.133885 0.133702 9 10.0 Completeness 0.256622 0.256622 2 2.0 V-measure 0.282329 0.282353 3 3.0 Inertia 161.2699 162.6081 12 12.0 ARI 0.27389 0.277021 3 3.1 HS 0.400861 0.400861 3 3.0 Table 4: Experiment I on Bupa dataset-Indirect technique. Measure Best Mean Optimal (K) Mean (K) DBI 0.767878 0.767878 2 2.0 Silhouette 0.365595 0.384366 2 2.3 Homogeneity 0.868698 0.878894 11 10.6 Completeness 0.961773 0.966154 10 9.6 V-measure 0.938959 0.943134 11 11.1 Inertia 113115.8 115453.6 12 11.9 ARI 0.961553 0.968388 6 8.3 HS 0.685475 0.68636 2 2.1 Table 5: Experiment I on Iris dataset-Indirect technique. Measure Best Mean Optimal (K) Mean (K) DBI 0.404293 0.455829 2 2.2 Silhouette 0.318954 0.344599 2 2.2 Homogeneity 0.022699 0.031922 12 10.9 Completeness 0.116486 0.146221 2 2.3 V-measure 0.219169 0.219169 3 3.0 Inertia 22.39424 22.66280 12 11.9 ARI 0.213836 0.224934 3 3.1 HS 0.336311 0.34281 3 3.3 Table 6: Experiment I on Heart dataset-Indirect technique. Measure Best Mean Optimal (K) Mean (K) DBI 0.713265 0.727587 3 3.3 Silhouette 0.614592 0.618281 2 2.3 Homogeneity 0.783788 0.791029 11 11.1 Completeness 0.927299 0.932774 6 6.7 V-measure 0.89814 0.902304 8 9.4 Inertia 154717.8 164091.2 12 11.3 ARI 0.904502 0.909118 3 3.8 HS 0.793613 0.794022 2 2.2 Table 7: Experiment I on Haberman dataset-Indirect technique Measure Best Mean Optimal (K) Mean (K) DBI 0.80366 0.800347 3 4.2 Silhouette 0.566715 0.567513 3 3.0 Homogeneity 0.834284 0.845633 11 10.4 Completeness 0.904877 0.910012 2 2.1 V-measure 0.914638 0.916857 2 2.1 Inertia 6472.869 6552.305 12 11.9 ARI 0.811333 0.82498 2 2.2 HS 0.734413 0.739459 2 2.5 Table 8: Experiment I on Seed dataset-direct technique. Measure Best Mean Optimal (K) Mean (K) DBI 0.667282 0.667282 2 2.0 Silhouette 0.474919 0.490591 2 2.3 Homogeneity 0.135167 0.147534 11 11.1 Completeness 0.256622 0.271856 2 2.4 V-measure 0.282328 0.282376 3 3.0 Inertia 161.591242 166.966275 12 11.7 ARI 0.2738904 0.277063 3 3.1 HS 0.400861 0.400861 3 3.0 Table 9: Experiment I on Bupa dataset-direct technique. Measure Best Mean Optimal (K) Mean (K) DBI 0.767878 0.739907 2 3.8 Silhouette 0.365594 0.369024 2 2.2 Homogeneity 0.893468 0.901519 11 11.1 Completeness 0.968224 0.968492 9 10.2 V-measure 0.950264 0.953214 11 10.1 Inertia 113166.23 114804.18 12 11.8 ARI 0.969933 0.972824 6 7.1 HS 0.685474 0.687244 2 2.2 Table 10: Experiment I on Iris dataset-direct technique. Measure Best Mean Optimal (K) Mean (K) DBI 0.404292 0.404292 2 2.0 Silhouette 0.318953 0.331776 2 2.1 Homogeneity 0.032809 0.041085 8 9.0 Completeness 0.116485 0.155215 2 2.4 V-measure 0.219168 0.2202207 3 3.2 Inertia 22.507321 23.250523 12 11.7 ARI 0.1984501 0.21826009 3 3.1 HS 0.3363106 0.342809 3 3.0 Table 11: Experiment I on Heart dataset-direct technique. Measure Best Mean Optimal (K) Mean (K) DBI 0.713264 0.738831 3 3.2 Silhouette 0.614592 0.615822 2 2.1 Homogeneity 0.798409 0.800753 12 11.5 Completeness 0.929894 0.9351805 6 6.8 V-measure 0.900645 0.905123 10 9.7 Inertia 155568.13 163630.64 12 11.4 ARI 0.895478 0.9088609 3 3.7 HS 0.793613 0.793817 2 2.1 Table 12: Experiment I on Haberman dataset-direct technique Measure Best Mean Optimal (K) Mean (K) DBI 0.727124 0.740409 7 6.6 Silhouette 0.566714 0.567868 3 3.1 Homogeneity 0.835039 0.846492 12 10.4 Completeness 0.893457 0.905006 2 2.3 Hybrid Variable-Length Spider Monkey Optimization with… Informatica 47 (2023) 67–78 75 V-measure 0.914637 0.919636 2 2.0 Inertia 6466.24 6521.739 12 12.0 ARI 0.799551 0.822034 2 2.4 HS 0.727788 0.734275 2 2.1 Table 13: Accuracy of the tested measures on the five datasets-Indirect and direct technique. Measure Seed Bupa Iris Heart Haberman Accuracy DBI      10% Silhouette      20% Homogeneity      0% Completeness      10% V-measure      30% Inertia      0% ARI      30% HS      100% 5.3.2 Experiment II In this experiment, we evaluated the effectiveness of the VLSMO method for optimizing k-means techniques across five clustering datasets. We contrasted the outcomes against those of the standard k-means algorithm. Tables 14-15 summarize the outcomes of this experiment, displaying the best value, mean, and standard deviation (SD) generated by each algorithm throughout 30 runs, with the best results highlighted in bold typeface. Furthermore, Figures 2 and 3 depict the evolution curve of the VLSMO algorithm across the five clustering datasets. Table 14: The results-Indirect technique of Experiment II. Dataset Basic K-Means Algorithm VLSMO Algorithm Best Mean SD Best Mean SD Seed 0.4052703 0.4052703 1.110223 0.400861 0.400861 2.775557 Bupa 0.687704 0.687704 0.0 0.685475 0.690008 0.014974 Iris 0.358471 0.358471 1.665334 0.336311 0.34281 0.002166 Heart 0.797463 0.797463 2.220446 0.793613 0.793886 0.000694 Haberman 0.800257 0.800257 1.110223 0.731062 0.736556 0.005237 Table 15: The results-direct technique of Experiment II. Dataset Basic K-Means Algorithm VLSMO Algorithm Best Mean SD Best Mean SD Seed 0.405270 0.405270 1.11022 0.400861 0.401992 0.006093 Bupa 0.687704 0.687704 0.0 0.685474 0.688840 0.013460 Iris 0.358471 0.358471 1.66533 0.336310 0.346141 0.010807 Heart 0.797463 0.797463 2.22044 0.793613 0.793954 0.000761 Haberman 0.800257 0.800257 1.11022 0.727788 0.735944 0.006259 Figure 2: The curve of the VLSMO with the hybrid score on five tested datasets, indirect representation 76 Informatica 47 (2023) 67–78 A.Q. Obaid et al. Figure 3: The curve of the VLSMO with the hybrid score on five tested datasets, direct representation 5.3.3 Experiment III In this experiment, we compared the effectiveness of the VLSMO method against K-Means, PSO, Hybrid, K- Means+GA, and a hybrid sequential clustering algorithm. The results, summarized in Table 16, present the quantization error (average distance between data points and cluster centroids) for the artificial problem I, artificial problem II, Wine, and Iris datasets. The best results are highlighted in bold, indicating superior performance. Table 16: Comparison of VLSMO with K-Means, PSO, Hybrid, K-Means+GA, and a hybrid sequential clustering (HSC) algorithm regarding quantization error. Algorithm Artificial problem I Artificial problem II Wine Iris K-Means 0.984 ±0.032 0.264 ±0.001 1.139 ±0.125 1.139 ±0.125 PSO 0.769 ±0.031 0.252 ±0.001 1.493 ±0.095 0.774 ±0.094 Hybrid 0.768 ±0.048 0.250 ±0.001 1.078 ±0.085 0.633 ±0.143 KMeans+ GA 0.772 ±0.05 0.260 ±0.001 1.384 ±0.099 0.982 ±0.128 HSC 0.764 ±0.031 0.250 ±0.001 1.072 ±0.084 0.628 ±0.092 VLSMO 0.761± 1.9135e-15 0.250 ±4.432e-15 1. 074± 1.3688e-16 0.6 24± 1.2243-15 6 Discussion The results of Experiment I (Tables 3-13) reveal that the proposed hybrid measure, HS, outperforms all other measures. This finding suggests that combining Silhouette and ARI measures into a hybrid measure can address the limitations of each individual measure, resulting in a more robust and comprehensive evaluation of clustering performance. The HS measure considers both the similarity of objects within clusters and the agreement between actual and predicted cluster labels, making it particularly useful for assessing clustering algorithms on complex and heterogeneous datasets. As a result, the HS measure will be utilized in the experiment II. In Experiment II, the results indicate that employing the VLSMO algorithm leads to superior performance compared to traditional techniques for setting parameters of k-means. The VLSMO algorithm surpasses the basic k- means algorithm to achieve the best value, mean, standard deviation, and speed in identifying the optimal value for the tested clustering datasets. Figures 2-3 demonstrate that the direct technique exhibits faster convergence with VLSMO than the indirect technique. Moving on to Experiment III, the outcomes consistently demonstrate that the VLSMO algorithm achieves the lowest average quantization error across all tested datasets, except for the Wine dataset, where the Hybrid Sequential clustering algorithm performs better. While other algorithms may yield better results in specific cases, their overall performance lacks consistency. Only the proposed VLSMO algorithm consistently generates the best solutions across the board. These findings highlight the efficacy of the VLSMO algorithm in various experimental scenarios, supporting Hybrid Variable-Length Spider Monkey Optimization with… Informatica 47 (2023) 67–78 77 its suitability for optimizing clustering outcomes. The consistent superior performance of the VLSMO algorithm in terms of quantization error underscores its potential for effectively addressing complex clustering challenges. 7 Conclusion In this paper, we proposed a new version of the SMO algorithm called VLSMO, which enhances its flexibility by eliminating the constraint of fixed-length individuals. Unlike the basic SMO algorithm, VLSMO can handle both fixed and variable-length individuals, enabling it to tackle problems that were previously intractable. This feature makes the VLSMO algorithm suitable for solving a variety of problems that the SMO algorithm alone cannot address. To evaluate the effectiveness of the VLSMO algorithm, we applied it to significant clustering problems involving the identification of optimal hyperparameters for the k- means algorithm, specifically the optimal value of k (number of clusters) and the initial centroids. While existing techniques like the elbow method, gap statistic, and canopy method are used to determine the appropriate cluster size, and methods such as the Forgy method, random partition method, and k-means++ algorithm are employed to find initial centroids, the VLSMO algorithm demonstrates efficient and accurate identification of both optimal values for k and initial centroids, as demonstrated by the test results. We aim to extend the application of the VLSMO algorithm to a wide range of problems that involve variable-length individuals. Some examples include scheduling time-based sensor networks [21], optimizing road traffic coordination in a multipath scenario [22], and other similar issues. References [1] I. Aljarah, H. Faris, and S. Mirjalili, Evolutionary data clustering: Algorithms and applications. Springer, 2021, https://doi.org/10.1007/978-981-33- 4191-3. [2] S. F. Raheem and M. Alabbas, "Optimal k-means clustering using artificial bee colony algorithm with variable food sources length," International Journal of Electrical & Computer Engineering (2088-8708), vol. 12, no. 5, 2022, https://doi.org/10.11591/ijece.v12i5.pp5435-5443. [3] C. Yuan and H. Yang, "Research on K-value selection method of K-means clustering algorithm," J, vol. 2, no. 2, pp. 226-235, 2019, https://doi.org/10.3390/j2020016. [4] S. Saatchi and C. C. Hung, "Hybridization of the ant colony optimization with the k-means algorithm for clustering," in Image Analysis: 14th Scandinavian Conference, SCIA 2005, Joensuu, Finland, June 19- 22, 2005. Proceedings 14, 2005: Springer, pp. 511- 520, https://doi.org/10.1007/11499145_52. [5] A. Kumar, D. Kumar, and S. Jarial, "A novel hybrid K-means and artificial bee colony algorithm approach for data clustering," Decision Science Letters, vol. 7, no. 1, pp. 65-76, 2018, https://doi.org/10.5267/j.dsl.2017.4.003. [6] M. Neshat, S. F. Yazdi, D. Yazdani, and M. Sargolzaei, "A new cooperative algorithm based on PSO and k-means for data clustering," Journal of Computer Science, vol. 8, no. 2, p. 188, 2012, https://doi.org/10.3844/jcssp.2012.188.194. [7] B. Li, "An experiment of k-means initialization strategies on handwritten digits dataset," Intelligent Information Management, vol. 10, no. 2, pp. 43-48, 2018, https://doi.org/10.4236/iim.2018.102003. [8] Y. Li, Z. Ni, F. Jin, J. Li, and F. Li, "Research on clustering method of improved glowworm algorithm based on good-point set," Mathematical Problems in Engineering, vol. 2018, 2018, https://doi.org/10.1155/2018/8724084. [9] Z. Bin, G. Zhichun, and H. Qiangqiang, "A Genetic Clustering Method Based on Variable Length String," in 2019 2nd International Conference on Safety Produce Informatization (IICSPI), 2019: IEEE, pp. 460-464, https://doi.org/10.1109/iicspi48186.2019.9095977. [10] G. Komarasamy and A. Wahi, "An optimized K- means clustering technique using bat algorithm," European Journal of Scientific Research, vol. 84, no. 2, pp. 263-273, 2012. [11] T. Hassanzadeh and M. R. Meybodi, "A new hybrid approach for data clustering using firefly algorithm and K-means," in The 16th CSI international symposium on artificial intelligence and signal processing (AISP 2012), 2012: IEEE, pp. 007-011, https://doi.org/10.1109/aisp.2012.6313708. [12] G. Zhu and S. Kwong, "Gbest-guided artificial bee colony algorithm for numerical function optimization," Applied mathematics and computation, vol. 217, no. 7, pp. 3166-3173, 2010, https://doi.org/10.1016/j.amc.2010.08.049. [13] S. F. Raheem and M. Alabbas, "Fuzzy logic-based self-adaptive artificial bee colony algorithm," in AIP Conference Proceedings, 2023, vol. 2591, no. 1: AIP Publishing, https://doi.org/10.1063/5.0119873. [14] D. Karaboga and B. Akay, "A modified artificial bee colony (ABC) algorithm for constrained optimization problems," Applied soft computing, vol. 11, no. 3, pp. 3021-3031, 2011, https://doi.org/10.1016/j.asoc.2010.12.001. [15] M. Alabbas and A. Abdulkareem, "Hybrid artificial bee colony algorithm with multi-using of simulated annealing algorithm and its application in attacking of stream cipher systems," Journal of Theoretical and Applied Information Technology, vol. 97, pp. 23-33, 01/15 2019. [16] J. C. Bansal, H. Sharma, S. S. Jadon, and M. Clerc, "Spider monkey optimization algorithm for numerical optimization," Memetic computing, vol. 6, pp. 31-47, 2014, https://doi.org/10.1007/s12293- 013-0128-0. 78 Informatica 47 (2023) 67–78 A.Q. Obaid et al. [17] K. P. Sinaga and M.-S. Yang, "Unsupervised K- means clustering algorithm," IEEE access, vol. 8, pp. 80716-80727, 2020, https://doi.org/10.1109/access.2020.2988796. [18] G. S. Ohannesian and E. J. Harfash, "Epileptic Seizures Detection from EEG Recordings Based on a Hybrid system of Gaussian Mixture Model and Random Forest Classifier," Informatica, vol. 46, no. 6, 2022, https://doi.org/10.31449/inf.v46i6.4203. [19] S. F. Raheem and M. Alabbas, "Dynamic Artificial Bee Colony Algorithm with Hybrid Initialization Method," Informatica, vol. 45, no. 6, 2021, https://doi.org/10.31449/inf.v45i6.3652. [20] C. Blake and C. Merz, "UCI repository of machine learning databases, 1998).(http," archive. ics. uci. edu/ml/index. php. [21] V.-P. Ha, T.-K. Dao, N.-Y. Pham, and M.-H. Le, "A variable-length chromosome genetic algorithm for time-based sensor network schedule optimization," Sensors, vol. 21, no. 12, p. 3990, 2021, https://doi.org/10.3390/s21123990. [22] L. Cruz-Piris, I. Marsa-Maestre, and M. A. Lopez- Carmona, "A variable-length chromosome genetic algorithm to solve a road traffic coordination multipath problem," IEEE Access, vol. 7, pp. 111968-111981, 2019, https://doi.org/10.1109/access.2019.2935041. [23] S. Rana, S. Jasola, and R. Kumar, "A hybrid sequential approach for data clustering using K- Means and particle swarm optimization algorithm," International Journal of Engineering, Science and Technology, vol. 2, no. 6, pp. 167-176, 2010, https://doi.org/10.4314/ijest.v2i6.63708.