https://doi.or g/10.31449/inf.v46i4.3565 Informatica 46 (2022) 507–522 507 Parametrized MT r ee Cluster er for W eka Marian Cristian Mihăescu, Marius Andrei Ciurez Department of Computer Science and Information T echnologies, Faculty of Automatics, Computers and Electronics, Uni- versity of Craiova, Craiova, Romania E-mail: cristian.mihaescu@edu.ucv .ro, mariusandrei.ciurez@gmail.com Keywords: Clustering, MT rees, metric spaces Received: May 25, 2021 In the ar ea of clustering, pr oposing or impr oving new algorithms r epr esents a challenging task due to an alr eady existing well-established list of algorithms and various implementations that allow rapid evalua- tion against tasks on publicly available datasets. In this work, we pr esent an impr oved version of the MT r ee clustering algorithm implemented within W eka workbench. The algorithmic appr oach starts fr om classi- cal metric spaces and integrates parametrised business logic for finding the optimal number of clusters, choosing the division policy and other characteristics. The r esult is a versatile data structur e that may be used in clustering to find the optimal number of clusters, but mainly for loading data sets that alr eady have a known structur e. Experimental r esults show that MT r ee finds the pr oper structur e in two clustering tasks, although other algorithms fail in various w ays. A discussion of further impr ovements and experiments on r eal data sets and functions is included. Povzetek: Opisana je nova metoda algoritma MT r ee za gručenje. 1 Intr oduction As an unsupervised learning technique, clustering is con- tinuously getting attention within the machine learning area due to many available algorithms and a wide range of appli- cation domains for which practical solutions are continually being designed and implemented. The general problem of building clusters of objects is narrowed down in [ 41 ] by clearly stating that the first thing that needs to be taken into consideration in the context of the problem. Therefore, building a general-purpose cluster - ing algorithm that may work with any objects and solve any task is not a realistic option. Thus, the objects we need to define and provide a proper task description accompanied by particular distance and evaluation metrics or various al- gorithmic approaches need to be carefully taken care of in building a clustering data analysis workflows. In [ 1 ], authors recently raised the problem of clusterabil- ity of a dataset. Having an available dataset does not neces- sarily imply that we may meaningfully run a clustering pro- cess to solve a particular task. Thus, defining clusterability becomes a critical issue. Checking if a dataset is cluster - able becomes an inherently tricky problem, especially when dealing with actual data and solving a particular practical task. The wide range of approaches used in available imple- mented clustering algorithms has found various application domains to provide ef ficient solutions to many practical tackled problems. From the many application domains, we mention medical image processing (i.e., pattern recognition and image segmentation) [ 39 ] [ 24 ], general and natural lan- guage processing knowledge discovery [ 20 ] [ 33 ] [ 34 ], nav- igation of robots [ 14 ] [ 22 ] and in many other contexts. In the area of unsupervised learning, there are several general classes of clustering algorithms (i.e., flat, hierar - chical and density-based) that all share two common prob- lems: finding the optimal number of clusters and quickly and ef ficiently finding the correct clusters taking into con- sideration specific distance measures appropriate for the objects (i.e., pixels, points, persons, books, etc.) that are being grouped. Further , once the clusters are being cor - rectly determined, there may be later used to query for the nearest neighbours or run specific range queries. Depend- ing upon the inner data structures used for managing the clusters when tree data structures are being used ef ficiently , searching or traversing may be accomplished ef ficiently The objective of this work is to present an improved ver - sion of the metric trees (MT ree) algorithm that has been firstly proposed by [ 7 ] and later by [ 40 ] and [ 43 ].The first proposal of using MT rees in the context of clustering has been made in [ 31 ] and later implemented as W eka [ 16 ] package in [ 30 ]. This paper presents an improved version of the works from [ 30 ] and [ 31 ] that has been tested in a comparative benchmark with k-MS morphological recon- struction clustering algorithm [ 37 ] along with classical al- gorithms (i.e., simple k-means, Cobweb, Farther First and Canopy) in [ 9 ]. This paper presents a further improved sparametrised version of the MT ree algorithm in terms of managed items, used distance, the method for finding and setting K (i.e., the number of clusters), division policy and validation metrics. The critical improvement of the new parametrised version of the MT ree clusterer is that it is now suitable to address a broader range of problems. The parametrised version is 508 Informatica 46 (2022) 507–522 M.C. Mihaescu et al. not ideal for solving any clustering problem by dynamically choosing the suitable parameters. Instead, a wide range of clustering problems may be addressed to the newly pro- posed MT ree cluster by properly setting its parameters de- pending on the available data objects and the tackled task. In general, there are two types of clustering problems. One regards finding patterns in a dataset that we do not know if they have a specific structure or if a certain num- ber of clusters exist. The second type of task regards cor - rectly building a data clustering model that may later be queried many times for getting specific information about managed data. In this scenario, the cluster model is con- structed only once such that it may be regarded as a prepro- cessing step. Later , few insertions and updates may occur at runtime, while most calls are range queries or kNN queries that need to be solved correctly and ef ficiently . The pro- posed approach is suited for both tasks, but the second one is more appropriate. Range queries determine items whose distance from a specific query item is smaller than a partic- ular value. The paper is or ganised as follows. In Section 2, we per - form a literature review with regards to similar libraries other application domain usages of metric trees for clus- tering such as time series analysis of cytometry data, rec- ommender systems, spatial clustering data, automatic com- puting the number of clusters for colour or greyscale image segmentation and clustering quality metrics. Section 3 de- scribes the proposed approach with a detailed presentation of how each parameter may be set and how it influences the business logic of the MT ree. Section 4 presents the experi- mental results on two publicly available data sets with runs on several parameter settings. Finally , section 5 contains the conclusions of this work, summarises the main contri- butions and discusses potential improvements and applica- tions. 2 Related work, limitations and appr oaches Data clustering (also known as unsupervised learning) rep- resents a subarea of machine learning. Other areas of machine learning are supervised learning, reinforcement learning or deep learning. A particular sub-domain is rep- resented by age clustering and segmentation [ 13 ] which presents the use of subtractive clustering along with clas- sical K-means algorithm to preprocess the data for optimal centroid initialisation. The experimental results were ob- tained on medical images representing infected blood cells with malaria. The classical images used for segmentation bring better results than k-means taking into account root mean square error (RMSE) and peak signal-to-noise ratio (PSNR) metrics. Another approach for image clustering was proposed by Chang et al. in [ 5 ]. They propose a Deep Adaptive Cluster - ing (DAC) approach that reduces to a classification prob- lem in which similarity is determined by cosine distance and learned labelled features tend to be one-hot vectors ob- taining good results on popular and accessible datasets like MNIST , CIF AR-10 and STL-10. One of the critical practical usages of clustering algo- rithms is for image segmentation. Including spatial infor - mation along with taking outlier points at a later stage in the clustering algorithm has been proposed in [ 42 ]. From this perspective, outliers are data points with almost equal distance to their adjacent clusters and therefore should be taken into consideration later . This approach raises two is- sues. One regards the fact that the order in which data points are given to the algorithm has significant importance. Thus, if a clustering algorithm is highly sensitive to the order in which data points are provided with a custom preprocess- ing may be necessary . The second issue regards the very nature of the data set from the clusterability perspective. The critical verification that is also highly recommended is to always check for clusterability before starting to the clustering analysis. Another application of clustering is grey scale image seg- mentation [ 44 ]. As compared with the clustering of colour images or with images that incorporate spatial information, the task is to highly decrease the time complexity of the al- gorithm by using af finity propagation (AP) clustering algo- rithm. As always, when real life grey scale images are being clustered the problem of correctly determining the number of clusters or segments is a critical one. More elaborate applications regard indexing and re- trieval of similar images from an image database (CBIR - Content-Based Image Retrieval) which represent a chal- lenging task that has been addressed in [ 29 ] and [ 27 ]. The first approach uses features, colour and texture. It employs K-means and hierarchical clustering for finding the most similar images. The second approach uses colour , tex- ture and shape as features and K-means as business logic for determining four classes of images: dinosaurs, flowers, buses and elephants. The obtained experimental results are promising in terms of excellent precision and recall values. Clustering has also been used successfully in recom- mender systems that were based on collaborative filtering in [ 1 1 ]. A novel K-medoids clustering recommendation al- gorithm has been proposed, which introduced an improved Kullback-Leibler diver gence for computing item similarity . The final task was to improve the ef fectiveness of the de- veloped recommendation system. Lately , in the application domain of immunology has been used clustering algorithms - Chr onoClust , a new density-based clustering algorithm - on time series cytom- etry data [ 26 ]. The task was to characterise the immune response to disease by tracking temporal evolution. V ery recent works [ 1 ] put a high emphasis on defining clusterability and checking if a dataset is clusterable as a preprocessing step before any other data analysis is further performed. Therefore, before applying the algorithm for solving a task that requires clustering a sanity check may be required, in the way that we should verify the clusterabil- ity of the dataset. In other words, clustering may not work Parametrized MT ree Clusterer for W eka Informatica 46 (2022) 507–522 509 on datasets which do not exhibit any internal structure, irre- spective of any particular algorithm that may be employed. In our case, clusterability becomes a prerequisite that the dataset needs to meet before being loaded into the MT ree structure. The new proposed clustering algorithm should have as main scenario working with data that is known to be clusterable, for which we know it has a well-defined struc- ture with a known number of clusters. A dataset has a well- defined structure when there exists an assignment of items into clusters that is validated by a domain specialist for real world datasets. In the case of synthetic datasets, the func- tion that creates the instances is designed in such a way that clusters are well-defined and represent the gold labeling for any clustering algorithm. If the number of clusters is not known than the results highly depend on the particular practical context of the clus- tering problem. The context is defined by the problem (clustered objects and clustering task) and parameters of the MT ree: object type, distance function, splitting policy and validation metrics. If the dataset is not clusterable, we ar gue that the MT ree algorithm - as well as any other clustering algorithms - will exhibit undefined behaviour . A more complex context occurs when the image source is unknown or when t he ground truth for the training dataset is also unknown [ 4 ] [ 10 ]. In this particular situation, find- ing the optimal K represents am an inherently dif ficult task. From an experimental algorithmic perspective, using proper distance metrics and loss function in this optimisa- tion problem represents one of the key ingredients towards successful results. Such approaches propose fancy solu- tions such as hierarchical clustering or clustering ensem- bles based graph partitioning methods, cluster -based Sim- ilarity Partitioning Algorithm (CSP A), HyperGraph Parti- tioning Algorithm (HGP A), and Meta Clustering Algorithm (MCLA). Among the most well-known issues in unsupervised learning consists in determining the actual number of clus- ters from a dataset. Unfortunately , scenarios in which the value of K is known to occur only in a few practical systems. In general, an application that performs an image process- ing task does not have any information regarding the actual number of clusters from the tar get image. This situation may occur when dealing with data streams [ 25 ] or with very high-dimensional datasets [ 17 ]. In general, one of the most suitable approaches tries to reduce to automatic determina- tion of K that may be based on dynamic clustering [ 17 ] or joint tracking segmentation [ 32 ]. A general-purpose algorithm for finding the optimal number of clusters has been proposed by [ 6 ] and imple- mented in an R package in NbClust. The main idea of this approach is that it may use up to 30 indices for voting the number of clusters. The package has implemented a func- tion to run a clustering algorithm (i.e., k-means or hierar - chical clustering) using various distance measures and ag- gregation methods. The main limitation of the approach is that it is general and practical usage for particular datasets needs to be parametrised by the appropriate clustering algo- rithm, subset of indices, distance measure and aggregation method. One particular usage of clustering regards automatic computing of the number of clusters for colour image seg- mentation [ 21 ]. This approach uses fuzzy c-means algo- rithms for extracting chromaticity features of colours and trains a Neural-Networks with obtained chromatic data of colours. The trained model may be further used on new colour images to predict the number of clusters in colour images. Among many clustering libraries, we mention LEAC [ 36 ]. It is an open-source library with source code publicly available in the GitHub repository . Thus, once the experi- ments are also performed on publicly available datasets, the results may be reproducible and also used in other setups for further improvements. Inclusion of 23 state-of-the-art Evolutionary Algorithms for partial clustering within a li- brary t hat allows easy and fast development and integration of new clustering algorithm represents the solution that we also tar get when improving the initial version of the MT ree clustering algorithm within W eka package. Another usage of metric trees has been reported recently in [ 12 ]. The task is to quickly and ef ficiently scale-up the problem of shadow rendering for 70 million objects (i.e., triangles) in real-time. The proposed metric tree uses as splitting policy binary space partitioning (BSP) and ternary object partitioning (T OP) for grouping triangles into clusters as precomputed bounding capsules (line-swept spheres). Finally , the whole clustering process needs validation, and many quality metrics may accomplish this for a wide range of algorithms [ 18 ]. Depending on the structure of the dataset, various clustering quality frameworks [ 23 ], [ 38 ] have been proposed. The key issue that always arises re- gards choosing the proper similarity and quality metrics [ 38 ]. 3 Pr oposed appr oach The proposed MT ree parametrised clusterer has been firstly proposed in [ 31 ] in an attempt to define a new clustering algorithm that has as main business logic the metric trees initially presented in [ 7 ] and later in [ 40 ] and [ 43 ]. The initial C++ implementation approach was designed to clus- ter students who were defined by three of their obtained grades during one semester . The main shortcoming of the proposed structure was that it as designed only for man- aging student objects and had several hard-coded parame- ters needed for building the tree. The most important one is nrKeys , which represents the maximum number of stu- dents contained in a node (i.e., a cluster). This approach has a critical limitation in the fact that the splitNote() method was called based only when a note was full. Other limita- tions regard the lack of parametrised division policy , dis- tance metrics or other features needed for flexibly running a clustering process. 510 Informatica 46 (2022) 507–522 M.C. Mihaescu et al. Later , the initially proposed MT ree clustering algorithm has been contributed as an of ficial W eka package [ 30 ]. This newly Java-based approach had the goal to be function- ally available under W eka workbench as any other clusterer , such that it may be further used in various practical situa- tions. The MT ree clusterer from W eka has been used in [ 37 ] in a comparative analysis on publicly available images. The results of MT ree were inferior such that the limitations were addressed in [ 9 ]. As critical improvements, the MT ree ver - sion from [ 9 ] uses of f-line dataset preprocessing for find- ing the optimal number of clusters and adjusts the business logic of the clusterer in terms of division policy and distance metric between instances. The current proposed version of the MT ree cluster rep- resents a flexible parametrised version of the former one in terms of division policy , used distance, the method for finding and setting the number of clusters. 3.1 Definitions and context The metric space M=(D, d) on a data domain D with the distance function d :D× D→ R postulates: Non negativity :∀ x, y∈ D, d(x, y)≥ 0 (1a) Symmetry :∀ x, y∈ D, d(x, y)=d(y, x) (1b) Identity :∀ x, y∈ D, x=y⇔ d(y, x)= 0 (1c) Triangle inequality :∀ x, y, z∈ D, d(x, z)≤ d(x, y)+d(y, z) (1d ) The conditions specified above are satisfied by most dis- crete or continuous distance (or similarity) metrics (or mea- sures): Euclidean, Minkowski, Manhattan, quadratic form distance (i.e., colour histograms, weighted Euclidean dis- tance), edit distance, Jaccard’ s coef ficient or Hausdorf f dis- tance. Building clusters of objects in metric spaces relies on the partitioning method and shape of the decision bound- ary that lays between two adjacent clusters. The partition- ing method regards how a set of objects is split into two or more clusters taking into account specific optimisation criteria such as the sum of squared errors (SSE) in case of simple k-means algorithm. Regarding the decision bound- ary , the two most common options are ball partitioning as in metric spaces and hyperplane partitioning. As current implementation of the MT ree algorithm rep- resents a two-level ball decomposition generalisation of the approach from [ 40 ]. The limitations from [ 40 ] regard choosing an arbitrary object as the pivot, using only binary splits around the median object, which implies previously sorting the objects and multilayered approach due to recur - sive construction. From the practical perspective of run- ning a clustering process, these are substantial limitations because ordering may not always be possible, binary split- ting may not be useful when dataset consists of many clus- ters and the notion of the cluster become unclear in a mul- tilayered approach. Definition 1. The newly designed MT ree data structure is a two-level perfectly balanced multiway tree that indexes a set of objects into its leaves which reside only on the sec- ond level. After building the tree, the root contains the set of centroids and their corresponding covered radius. On the second level, the leaves represent the clusters contain- ing objects whose distance is smaller or equal to the radius assigned of the corresponding centroid within the root. The key features for the parametrised MT ree implemen- tation are: 1. The possibility to process various object data types provided as input (i.e., image, document, etc.) that is represented as a multidimensional vector . 2. The possibility to set up a particular distance function between objects by direct usage of distance functions that are already available in W eka or by using a newly defined custom function. 3. The possibility to set up a particular division policy that will be used internally used as needed. Practi- cally , the logic of the division policy is managed by the clustering algorithm. This feature needs to be ac- companied by a parameter that controls the number of clusters into which full leaf may be splitted. A vailable options are binary object partitioning (BOP), ternary object partitioning (T OP) or multi-object partitioning (POS). 4. The possibility to set up a specific number of clusters in which the entire dataset will be partitioned, if the number of clusters is known. If the number of clusters is not known, the MT ree will find the optimal number of clusters considering the other parameters that have been set. 5. The possibility to compute at request various cluster - ing quality metrics that will give a general idea on the quality of the clustering process. This feature is critical in benchmarking the MT ree clustering re- sults against results produced by other clustering al- gorithms. The MT ree uses only one node data structure for the roots and leaves. In the root node, the instances are represented by centroids, and each element from the radix vector rep- resents the covered radius for the corresponding centroid. In the same way , each element from the route vector is an address of the corresponding leaf node. Their vector po- sition accomplishes the correspondence between centroids from the root node and covered radius and leaves. For ex- ample, the first element of the instances vector of the root represents the first centroid with a radius defined in the first element of the radix vector . The first element of the route vector represents the address of the leaf node that contains objects whose distance to the first centroid is smaller or equal to the covered radius. In the case of leaf nodes, the instances represent the data objects themselves. The isLeaf flag is set by the business logic to value TRUE such that radix and route vectors are set to null. Parametrized MT ree Clusterer for W eka Informatica 46 (2022) 507–522 51 1 T able 1: The structure of MT ree node Field name Description nrKeys The number of objects actually stored in the node. isLeaf This flag represents the node type: root or leaf. radixes The vector of distances covered by a centroid. routes The vector of node addresses to leaf nodes. instances The objects from the node (parent or leaf). parent The address of the parent node. Before starting any computation needed for inserting a new object in the MT ree, the algorithm checks if we are in a leaf and if the leaf has objects in it. If any of these conditions do not hold than insertion may not take place because insertion may be performed only in a leaf and fur - ther splitting may be taken into consideration only if the leaf has objects in it. Further , the leaf is evaluated for splitting parametrised by evaluatorOfK , which is a function that de- termines the optimal number of clusters. This helper func- tion takes as parameter the objects from tr eeNode and out- puts splitEval as an evaluation of the splitting. If this also is valid, then the leaf node is split into the optimal number of clusters by using a parametrised divisionPolicy . The in- sertNonFull function is called to append a new object in the leaf when no splitting is necessary . The two main ingredients of mT r eeInsert function are the evaluation of the optimal number of clusters from a leaf and the division policy used for splitting. The current im- plementation uses a function called voteK that computes the optimal number of clusters in a similar way as NbClust [ 6 ] package in R. The main dif ference is that NbClust uses 30 indices while voteK currently integrates only 8 in- dices: Davies-Bouldin, Dunn, Xi-Beni, Banfeld-Raftery , McClain-Rao, Ray-T uri, Calinski-Harabasz and PBM in- dices. The architectural design of the package allows the easy call of any method that given an input dataset can com- pute the optimal number of clusters. The nrKeys represents the number of items (i.e., cen- troids or items) contained in a node (i.e., root or leaf). For the root node, the items are centroids, and for the leaf node, the items are the s ample points. Depending upon the imple- mentation, the centroids may be items from the dataset or computed instances. The isLeaf field from the data struc- ture represents a clarification regarding what represent the items from a node: centroids or instances. As far as our M-tree is concerned, we pay extra attention to the nodes because it is essential whether they are leaves (i.e., terminal nodes containing instances) or internal nodes (i.e., having only centroids) information is critical for the algorithm design. The instances vector contains the centroid points or items, depending upon the node is either root or leaf. If the node is the root, vector radixes contain the distance covered by each centroid, while the routes vector contains the leaves’ addresses. On the other hand, if the node is a leaf, then the field parent includes the root address while the radixes and routes vectors are empty . The second key ingredient of the mT r eeInsert function is the division policy . The default option for this param- eter is the simple k-means algorithm, but any other strat- egy may be called as needed. Other options, besides call- ing particular clustering algorithms as a multi-object parti- tioning (MOP) strategy , is using binary space partitioning (BSP) or ternary object partitioning (T OP) by simply set- ting proper parametrised values. Suppose there is no need for node splitting, then insertion reduces to appending the new object into that leaf. In that case, insertion reduces to appending the new ob- ject into that leaf, which is accomplished by calling insert- NonFull function. Intuitively , when a leaf is not full, we insert a new object; otherwise, we split the leaf. The most crucial dif ference from former versions or other approaches is that splitting is not called when the number of stored ob- jects reaches a specific value, but when current objects ex- hibit the property that they are properly clustered, and there- fore a split is compulsory . Algorithm 1 summarizes the business logic for inserting a new object into an existing MT ree leaf node. Algorithm 1 summarises the business logic for inserting a new object into an existing MT ree leaf node. The sec- ond critical procedure is the one that performs the split of a leaf node as specified in the insertion algorithm. The main ingredient is represented by the clusteringEvaluator setup, which allows breaking the leaf into clusters according to with specific division policy and a predetermined optimal number of clusters. The newly obtained clusters are added into a splitedClusters vector and returned as output. The main improvement in the split procedure is avoiding split- ting a leaf into a hard-coded number of clusters by using a pre-determined optimal number of clusters and parametris- ing for the division policy for running the ef fective split. Algorithm 2 summarises the business logic for splitting. The input consists of a tree node (i.e., a cluster) and an eval- uator , and the returned value consists of a vector of clusters, called splittedClusters. The evaluator has the task of decid- ing if the node should remain as a single cluster or be split in two or more clusters. If the evaluator considers more than one cluster in the note, the node will split, and the tree will change its structure. The bounding volumes of the MT ree leaves are circles if objects are 2-dimensional or spheres if objects are 3- dimensional and Euclidean distance is being used. For 512 Informatica 46 (2022) 507–522 M.C. Mihaescu et al. Algorithm 1 mT reeInsert (MT ree treeNode, Instance newObject) 1: if tr eeNode is leaf and has objects then 2: Set clusteringEvaluator = getOptimalNrOfClusters (treeNode, evaluatorOfK ); 3: end if 4: if splitEval is valid then 5: nrOfClusters = clusterEval.getNumberOfClusters (); 6: if nrOfClusters> 1 then 7: splitNode(treeNode, clusteringEvaluator); 8: else 9: insertNonFull(treeNode, newObject); 10: end if 1 1: end if Algorithm 2 mT reeSplitNode (MT ree treeNode, ClusterEvaluation clusteringEvaluator) 1: clusters = getClusters (treeNode, clusteringEvaluator); 2: if size of clusters> 0 then 3: for each cluster in clusters do 4: centroidInstance = chooseCenter(cluster); 5: splitedCluster .addCentroid(centroidInstance.getCentroid()); 6: splitedCluster .setRadix(centroidInstance.getRadix()); 7: splitedClusters.add(splitedCluster); 8: end for 9: end if 10: Return splitedClusters; n-dimensional spaces, the bounding volumes are hyper - spheres, a generalisation for a set of points equally dis- tanced to a given point. This generalisation is valid only for Euclidean distance, as for other distances, the bounding volume is the representative of a sphere for that particular vector space. 3.2 MT r ee algorithms description The MT ree implementation is aimed for usage by client code in practical scenarios. Although the primary usage scenario addresses the situation for which we know the ac- tual number of clusters, the clusterer may also be used in the case when the data analyst specifies a particular number of clusters depending on the tackled task or in the situation when the number of clusters is not known or does not exist. In any of the situations mentioned above, the parame- ters that need to be set are the input data-set, the number of clusters, the distance metric, the evaluator of the number of clusters and the division policy algorithm. The main prac- tical goal of the current version of the MT ree clusterer is to verify that it correctly loads the data given specific pa- rameters and creates a data model that validates the ground- truth model from two publicly available data sets. Finally , the clustering validation metrics are verified against other clustering algorithms implemented in W eka workbench. One key aspect for correctly building the MT ree from data regards the order in which the data objects are pro- vided as input. As a general rule, any clustering algorithm is highly sensitive to the order in which data objects are considered in the clustering process. The seed mechanism is the general solution to clustering algorithms and is also used as a standard option in the MT ree. For our case, the seed mechanism represents a random shuf fle of the order in which the instances are added to the MT ree. The pseudocode of mT reeInsert function presents the structure and logic of operations when inserting a new in- stance into the tree. The function’ s parameters are the ad- dress of the tree and the item that should be inserted. Since inserting takes place only in leaf nodes, the algorithm first checks that we are in a leaf node and then determines the optimal number of clusters from that leaf. If more than one cluster is found in the leaf, the node is split, and the instance is placed into the appropriate cluster . The pseudocode of mT reeSplitNode function actually performs the split of a leaf. Along the treeNode, an evalu- ator is given as parameter , which gathers the parameters needed to determine the clusters. Once the clusters are determined, the centroids and corresponding radixes are placed in the root, the addresses of the new clusters are placed in the routes and instances themselves are placed in the appropriate clusters. A particular situation occurs when we do not know the correct number of clusters. In this case, the adequate so- lution is to preprocess the data-set to check clusterability and determine the valid number of clusters if such a num- ber exists. The situation in which the data-set has not a clear well-defined structure may be interpreted in two ways: ei- ther the data-set has several possible options as the actual number of clusters, or we do not know the correct number of Parametrized MT ree Clusterer for W eka Informatica 46 (2022) 507–522 513 clusters. In the first situation, a domain knowledge person should consider the specific data analysis task and choose the correct number of clusters that fit the practical problem. More extensive work needs to be done as a preprocessing step in the latter situation. Figure 1: Nodes structure As a general rule, when the data-set has no structure, the first preprocessing steps should define the number of clus- ters as a task-dependent value. Then, the centroids should also be defined as representatives for each cluster consid- ered by the domain knowledge person. Finally , the objects from the data-set are added to the pertaining cluster only if a maximal threshold distance from the centroid is not ex- ceeded. The objects that have almost equal distance from centroids are considered outliers and are not added to any cluster . In this way , the data analyst may obtain a cluster - able data set with a specific number of clusters. Having a clusterable data-set is a prerequisite for the MT ree cluster - ing algorithm and any other algorithm. Therefore, if the data-set does not meet clusterability , building an MT ree clusterer or any other clusterer will exhibit undefined be- haviour . Finding the correct number of clusters by using MT ree may be performed by running with various parameter set- tings in terms of the number of clusters and division policy in an attempt to obtain a clusterer whose validation param- eters show that the correct patterns have been discovered. In this use case, the MT ree data structure is built for finding whatever clusters are to be found. 3.3 Complexity analysis The complexity analysis of building the MT ree from data depends on the number of objects, the number of clusters, the method of finding the optimal number of clusters and the number of distance computations. The number of clus- ters from the data-set represents the number of splits that need to be performed while building the tree. The most critical operation is finding the optimal number of clus- ters, which is called after each object insertion. The num- ber of distance computations is related to the number of clusters since distances from the newly inserted object to all the centroids from the root need to be computed to de- termine the suitable leaf where the insertion should occur . The most time-consuming function is the getOptimalNrOf- Clusters function that is called whenever a new object is to be inserted. W e have observed that for a reasonably small number of clusters and a lar ge number of objects, that method getOptimalNrOfClusters is used to trigger a split fewer times than the number of clusters. For example, once the number of leaves from MT ree has reached the true num- ber of clusters, then looking for the optimal number of clus- ters becomes useless. Further insertions will be performed in constant time just by determining the proper leaf where the new object needs to be inserted. As stated in [ 15 ] the performance of building an MT ree with n objects is anal- ogous to that of k-d trees, that is O(nlogn) for worst-case scenario. Depending on the split method the time may in- crease to O(nlog 2 n) or O(knlogn) for k dimensions. Still, the currently proposed method is highly sensitive to the or - der in which objects are being inserted, the seed selection and the particular parameters setup as well as all other clus- tering algorithms. The critical property of the MT ree is that after correctly building the clusters, the operations of inserting, removing and querying may be performed in O(logn) time. These aspects are not tested by the current works and need to be further experimentally investigated in practical clustering tasks. 4 Experimental r esults 4.1 Data-sets description Experimental results have been performed on two synthetic publicly available data sets from the clustering basic bench- mark [ 19 ]: Unbalance [ 35 ] and Dim2 [ 28 ]. Unbalance is a synthetic 2-d data-set with 6500 points and 8 Gaussian clusters for which ground truth centroids and partitions are known. Dim2 is also a synthetic 2-d data-set with 1351 points and 9 Gaussian clusters.. Figures 4 and 5 present a plot of the raw input data. W e have chosen two synthetic datasets for which the ground truth centroids and partitions are known because they are suitable for comparing clustering results obtained by MT ree algorithm against other ones implemented in W eka workbench. 514 Informatica 46 (2022) 507–522 M.C. Mihaescu et al. Figure 2: Sample MT ree Figure 3: Sample MT ree split node Figure 4: Unbalance dataset Figure 5: Dim2 dataset Finally , we have tested the MT ree on W ine and Iris clas- sical datasets from UCI Machine Learning Repository [ 3 ]. These real-world datasets were chosen because they may be successfully used in clustering tasks as they are labelled, and classical unsupervised training may correctly deter - mine the classes. 4.2 MT r ee package description MT ree is implemented in Java and is aimed to be executed within W eka 3.8 workbench by using the Package man- ager tool. The MT ree package is based on three classes Node , MT r eeBean and MT r ee along with other three helper classes. Figure 6 shows the software architecture the MT ree package as an UML class diagram . The class Node represents a blueprint for a cluster of in- stances and the MT r eeBean class contains the root of the MT ree. The most important class is MT r ee , which extends RandomizableCluster er , which is an abstract class whose direct subclasses are Canopy , Cobweb, FarthestFirst and SimpleKMeans. Further , by implementing the NumberOf- ClustersRequestable and other interfaces, the MT ree gets the possibility to be parametrised similarly as other cluster - ing algorithms are in W eka. The main goal was to obtain a clusterer that may be pa- rameterisable in the same way as already existing clusterers based on interfaces that are already defined in W eka but also of fering the possibility of defining new interfaces specific for parameters needed by MT ree algorithm. In this way , the Parametrized MT ree Clusterer for W eka Informatica 46 (2022) 507–522 515 Figure 6: UML class diagram for MT ree package newly obtained MT ree can be easily parametrised in the us- age of the command-line interface or W eka GUI interface. 4.3 Sample MT r ee usage The MT ree can be used in three ways. The current imple- mentation provides flexibility for running full experimental runs and benchmarking the performance of an MT ree pa- rameter configuration against already existing W eka clus- tering algorithms. – Basic mode, thr ough command line . This mode allows executing the MT ree on any machine that has JVM 1.8 and weka.jar version 3.8.3. Figure ?? presents a sample command line execution of the MT ree algo- rithm. This approach is commonly used when batch execution is needed only once for building the clus- ters and serialising the obtained model (i.e., the dis- tribution of objects into clusters) to persistent storage (i.e., csv , xml, etc.) for later usage is rather tricky . The available options when running MT ree in command line interface are further presented. N represents the number of clusters. If we want MT ree to decide for itself the number of clusters this option must be set to value -1. init option may be used for setting the initialization method. I option is for setting the maximum number of iterations, O for preserving the order of instances, S for setting the number of seeds, d for setting the distance metric, findN for setting the method for finding the optimal number of clusters and splitPolicy for setting the method used as splitting policy . Current implementation may use as splitting policy Canopy , Simple k-means, CobW eb, FarthestFirst or HierarchicalClusterer clusterers from W eka. – Using the W eka GUI . This mode is the most user - friendly as the MT ree may be used from W eka GUI as any other clustering algorithm. As the MT ree pack- age is in the list of of ficial packages, it needs to be in- stalled before usage. Installing the MT ree package in W eka is a straightforward procedure as the MT r ee.zip archive is publicly available in SourceFor ge [ 30 ] and the link to the package is available in the list of of ficial packages within the W eka package manager tool. – Pr ogrammatic way . The most versatile usage of the MT ree is programmatically . This approach allows set- ting up the parameters at runtime as well as having a ready-to-use in memory MT ree object that is ready for querying. This approach allows the usage of the cluster as business logic on a server side in complex applications where client code is performing various queries. Sample code for building the MT ree from data is publicly available in [ 8 ]. 516 Informatica 46 (2022) 507–522 M.C. Mihaescu et al. Figure 7: Command line execution of MT ree Figure 8: Execution of MT ree in W eka GUI 4.4 Sample runs on Unbalance and Dim2 synthetic data-sets The newly released parametrised MT ree implementation has been tested against two publicly available synthetic data sets: Unbalance and Dim2 . The client code that calls the MT ree package is publicly available at [ 8 ], as further pre- sented experimental results were obtained by programmat- ically running the MT ree implementation. Figures 9 and 10 present the experimental results of run- ning the MT ree clusters along other five clusterers imple- mented in W eka workbench: Canopy , EM (expectation- maximization), FF (Farthest First), HC (Hierarchical Clus- tering) and SKM (Simple KMeans). As figure 9 clearly shows, the MT ree correctly deter - mines the clusters by using 100 seeds and Canopy for split policy . As a general rule, the clustering result exhibits un- defined behaviour regarding the number of seeds, such that correct results may be obtained sometimes for only 10 seeds and sometimes for 1000 seeds. Here is a summary of the experimental results of the other five algorithms: – Canopy algorithm fails to determine the correct num- ber of clusters and the found distribution into clusters is wrong. – EM algorithm fails to determine the correct number of clusters although in many situations it is used for this task as it does not require the value of K. The obtained clusters are reasonable fine with two exceptions: clus- ter 0 puts together three real clusters and cluster 1 puts together two real clusters. – FF algorithm correctly determines the number of clus- ters but misses to determine two of them correctly: cluster 0 puts together two real clusters and cluster 2 is composed of objects belonging to two distinct clusters. – HC algorithm correctly solves the task. – SKM algorithm correctly solves the task after finetun- ing the parameters: 100 seeds, usage of kmeans++ [ 2 ] for seed optimisation and maximum 10,000 iterations. As figure 10 clearly shows, the MT ree correctly deter - mines the clusters on a more dif ficult task by using only 10 seeds. The other investigated algorithms provide the fol- lowing results: – Canopy algorithm fails to determine the correct num- ber of clusters and the found distribution into clusters is wrong, as it puts together in a cluster objects from two clusters. – EM algorithm correctly determines the clusters. – FF algorithm fails to determine the correct number of clusters and misses to determine one of them correctly . Parametrized MT ree Clusterer for W eka Informatica 46 (2022) 507–522 517 (a) MT ree results (b) Canopy results (c) EM results (d) FF results (e) HC results (f) SKM results Figure 9: Clustering results on Unbalance dataset T able 2: Running times statistics (measured in seconds) Algorithm Unbalance dataset Dim2 dataset MT ree 0.3 [per seed] 0.06 [per seed] Canopy 0.01 0.1 EM 8.21 [per tuned seed] 0.55 [per tuned seed] FF 0.01 [per seed] 0.01 [per seed] HC 605.42 4.1 1 SKM 0.06 [per seed] 0.01 [per seed] T able 3: Performance results on real world datasets Algorithm Accuracy W ine Accuracy Iris MT ree+Canopy 0.94382 0.89261 MT ree+cKMs 0.92134 0.89261 MT ree+FF 0.93258 0.88590 MT ree+HC 0.89887 0.89261 KMeans 0.93258 0.88590 518 Informatica 46 (2022) 507–522 M.C. Mihaescu et al. (a) MT ree results (b) Canopy results (c) EM results (d) FF results (e) HC results (f) SKM results Figure 10: Clustering results on Dim2 dataset Parametrized MT ree Clusterer for W eka Informatica 46 (2022) 507–522 519 The result shows that objects from one real cluster are split between two real clusters. – HC fails to determine the correct number of clusters and reports one found cluster as a join between two real clusters. – SKM algorithm correctly solves the task after fine tun- ing the parameters: 10 seeds, usage of kmeans++ [ 2 ] for seed optimisation and maximum 10,000 iterations. Experimental results show that the first K instances have the most significant impact over the final result, where K is the number of clusters. Thus, the current implementa- tion uses kmeans++ [ 2 ] seed optimisation, so the first K in- stances that are added to the MT ree are a rather sparse one from another . T able 2 summarizes the running time statistics for all six algorithms on both data-sets. In the case of EM algorithm, each tuned seed has been obtained by more iterations, and more k-means runs, a fact that explains the more signifi- cant running time. HC and Canopy do not have seeds and Canopy’ s poor results on both data-sets were obtained re- gardless of the configuration parameters. The number of seeds for the algorithms that correctly solved the dim2 data- set has been set to 10. Finally , the SSE (sum of squared error), as well as the assignment of objects, are correctly computed for MT ree as they compute to the same values as SKM of 4.2471 and 0.3367 for unbalance and dim2 datasets, respectively . Therefore, the clusters produced by the MT ree are valid and represent the real ones from the datasets and SSE represents a good optimisation metric for these datasets. 4.5 Sample runs on Iris and W ine r eal-world data-sets W ine data was normalised and then MT ree was used on all 13 features. The algorithm was run on 100 seeds and the best result was tar geting to be with the best (smallest) SSE. As can be seen from the table, smaller SSE does not always provide the best accuracy , with a SSE value of 66 against 68 but an accuracy of 89 against 94.This suggests that a dif ferent cluster quality metrics may be able to improve the performance of the proposed algorithm. KMeans run on 100 seeds obtains 93% accuracy or 12 wrong predictions. Iris data was normalised and MT ree used on all 4 fea- tures. As on the previous data-set, the algorithm was run on 100 seeds and best SSE was tar geted. It is interesting to notice that dif ferent SSE provide the same accuracy , it seems that the algorithm conver ged with 16 wrong predic- tions being its best. MT ree+cobweb is not able to cluster the data. On the same data, KMeans run on 100 seeds obtains 88% accuracy or 17 wrong predictions. T able 3 presents accuracy results of MT ree parametrised by various splitting algorithms (i.e., Canopy , KMeans, Far - thest First and Hierarchical clustering) against baseline KMeans algorithm. Experimental results show the MT ree clustering algorithm correctly finds groups at least as good as simple KMeans algorithm. 5 Conclusions and futur e work This paper presents an improved parametrised MT ree clus- ters for W eka workbench. The experimental results show that MT ree correctly solves two synthetic datasets for which the correct structure (i.e., number of clusters, centroids and distribution) is known. More, five other clustering algo- rithms implemented in W eka are outperformed in various situations due to that fact that they do not solve the cluster - ing task correctly or need intensive tuning for solving it. Still, the current approach is used only for sanity check of the clustering capabilities of the MT ree implementation, rather than solving a particular clustering task on a real dataset. The implementation is Java-based and is available as open source W eka package. The experimental results are correct and promising such that further development under W eka of fers the possibility of proper benchmarking of fur - ther clustering tasks that may be taken into consideration. The main contributions are summarised as follows: 1. An updated parametrised version of the MT ree pack- age is presented. The parametrisation mainly regards used distance metric, the method for finding and set- ting the number of clusters and the division policy . The data structure can load various object types after being properly processed as well as providing valida- tion insights. 2. The proposed software architecture of the MT ree en- ables parameterisation through easy integration of other internal algorithmic strategies that perform key tasks within the business logic. 3. The implementation of the MT reeis available as an open source package in W eka workbench. This ap- proach gives the opportunity for further usage and benchmarking against other clustering algorithms. 4. The experiments demonstrate that the proposed ap- proach outperforms current clustering algorithms on two datasets. Future works may take into consideration extending the voteK algorithm as a Java implementation of the already existing R package NbClust. Extending voteK should take into consideration the available clustering quality indices and parametrisation capabilities. In terms of internal busi- ness logic, MT ree may try dif ferent approaches regarding the order in which objects are processed when building the tree. One option is to cluster the outlier objects later in the process. As the most expensive operations are finding the optimal number of clusters and splitting, one option is trying to pre- dict how the insertion of an object will impact the tree in 520 Informatica 46 (2022) 507–522 M.C. Mihaescu et al. terms of triggering a split. Checking for the optimal num- ber of clusters should be performed only when an insert is highly to determine a split, as most inserts do not require a split, especially when the dataset has a well-defined struc- ture. MT ree currently implements range query and kNN query . These implementations should be further tested in practical real data scenarios. Other tasks in which MT ree may be also used are outlier detection and finding the cor - rect number of clusters in a dataset. Finally , MT ree algo- rithm may be further tested for finding patterns in data in the situation when internal structure is not known. Acknowledgement This work was partially supported by the grant 135C/2021 ”Development of software applications that integrate ma- chine learning algorithms”, financed by the University of Craiova. Refer ences [1] Andreas Adolfsson, Mar gareta Ackerman, and Naomi C Brownstein. “T o cluster , or not to cluster: An analysis of clusterability methods”. In: Pattern Recognition 88 (2019), pp. 13–26. DOI: 10 . 1016 / j.patcog.2018.10.026 . [2] David Arthur and Ser gei V assilvitskii. “k-means++: The advantages of careful seeding”. In: Pr oceedings of the eighteenth annual ACM-SIAM symposium on Discr ete algorithms . Society for Industrial and Ap- plied Mathematics. 2007, pp. 1027–1035. [3] Catherine L Blake and Christopher J Merz. UCI r epository of machine learning databases, 1998 . 1998. [4] Roberto Caldelli et al. “Fast image clustering of un- known source images”. In: Jan. 201 1, pp. 1–5. DOI: 10.1109/WIFS.2010.5711454 . [5] Jianlong Chang et al. “Deep adaptive image cluster - ing”. In: Pr oceedings of the IEEE International Con- fer ence on Computer V ision . 2017, pp. 5879–5887. DOI: 10.1109/iccv.2017.626 . [6] Malika Charrad et al. “NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set”. In: Journal of Statistical Softwar e 61 (Oct. 2014), pp. 1–36. DOI: 10.18637/jss.v061.i06 . [7] Paolo Ciaccia et al. “Indexing metric spaces with m- tree.” In: SEBD . V ol. 97. 1997, pp. 67–86. [8] Marius Andrei Ciurez. MT r ee client code . https:// github . com / kyko007 / Cordoba / tree / master / MTree . 2019. [9] Marius Andrei Ciurez and Marian Cristian Mi- haescu. “Improved Architectural Redesign of MT ree Clusterer in the Context of Image Segmentation”. In: International Confer ence on Intelligent Data En- gineering and Automated Learning . Springer . 2018, pp. 99–106. DOI: 10.29007/sm6x . [10] Abhisek Dash et al. “Image Clustering without Ground T ruth”. In: CoRR (Oct. 2016). DOI: 10 . 48550/arXiv.1610.07758 . [1 1] Jiangzhou Deng, Junpeng Guo, and Y ong W ang. “A Novel K-medoids clustering recommendation algo- rithm based on probability distribution for collabora- tive filtering”. In: Knowledge-Based Systems (Mar . 2019). DOI: 10.1016/j.knosys.2019.03.009 . [12] F Deves et al. “Scalable Real-T ime Shadows us- ing Clustering and Metric T rees”. In: Eur ograph- ics Symposium on Rendering . Karlsruhe, Germany , July 2018, pp. 1–12. DOI: 10.2312/sre20181175 . URL: https : / / hal . archives - ouvertes . fr / hal- 02089095 . [13] Nameirakpam Dhanachandra, Khumanthem Man- glem, and Y ambem Jina Chanu. “Image segmenta- tion using K-means clustering algorithm and sub- tractive clustering algorithm”. In: Pr ocedia Com- puter Science 54 (2015), pp. 764–771. DOI: 10 . 1016/j.procs.2015.06.090 . [14] Gianni A Di Caro, Frederick Ducatelle, and L Gam- bardella. “A fully distributed communication-based approach for spatial clustering in robotic swarms”. In: Pr oceedings of the 2nd Autonomous Robots and Multir obot Systems W orkshop (ARMS), affil- iated with the 1 1th International Confer ence on Autonomous Agents and Multiagent Systems (AA- MAS)(V alencia, Spain, June 5) . Citeseer . 2012, pp. 153–171. [15] Herbert Edelsbrunner . Algorithms in combinatorial geometry . V ol. 10. Springer Science & Business Me- dia, 2012. DOI: 10.1007/978- 3- 642- 61568- 9 . [16] Frank Eibe, MA Hall, and IH W itten. “The WEKA W orkbench. Online Appendix for Data Mining: Practical Machine Learning T ools and T echniques”. In: Mor gan Kaufmann (2016). DOI: 10 . 1016 / B978- 0- 12- 804291- 5.00024- 6 . [17] Ahmed Ali Abdalla Esmin, Rodrigo A. Coelho, and Stan Matwin. “A review on particle swarm optimiza- tion algorithm and its variants to clustering high- dimensional data”. In: Artif. Intell. Rev . 44.1 (2015), pp. 23–45. DOI: 10.1007/s10462- 013- 9400- 4 . [18] Adil Fahad et al. “A Survey of Clustering Algo- rithms for Big Data: T axonomy and Empirical Anal- ysis”. In: IEEE T rans. Emer ging T opics Comput. 2.3 (2014), pp. 267–279. DOI: 10 . 1109 / tetc . 2014 . 2330519 . Parametrized MT ree Clusterer for W eka Informatica 46 (2022) 507–522 521 [19] Pasi Fränti and Sami Sieranoja. “K-means properties on six clustering benchmark datasets”. In: Applied Intelligence 48.12 (2018), pp. 4743–4759. DOI: 10. 1007/s10489- 018- 1238- 7 . [20] Guojun Gan, Chaoqun Ma, and Jianhong W u. Data Clustering: Theory , Algorithms, and Applications . Society for Industrial and Applied Mathematics, 2007. DOI: 10.1137/1.9780898718348 . [21] Farid Garcia-Lamont et al. “Automatic computing of number of clusters for color image segmentation employing fuzzy c-means by extracting chromatic- ity features of colors”. In: Pattern Analysis and Ap- plications 23 (2020), pp. 59–84. DOI: 10 . 1007 / s10044- 018- 0729- 9 . [22] Melvin Gauci et al. “Clustering objects with robots that do not compute”. In: Pr oceedings of the 2014 international confer ence on Autonomous agents and multi-agent systems . International Foundation for Autonomous Agents and Multiagent Systems. 2014, pp. 421–428. [23] Ángel Castellanos Gonzáles, Juan Manuel Cigarrán, and Ana Garcı ́ a-Serrano. “Formal concept analysis for topic detection: A clustering quality experimental analysis”. In: Information Systems 66 (2017), pp. 24– 42. ISSN: 0306-4379. DOI: 10 . 1016 / j . is . 2017 . 01.008 . [24] Suchita Goswami and Lalit Kumar P Bhaiya. “Brain tumour detection using unsupervised learning based neural network”. In: 2013 International Confer ence on Communication Systems and Network T echnolo- gies . IEEE. 2013, pp. 573–577. DOI: 10 . 1109 / csnt.2013.123 . [25] Sudipto Guha and Nina Mishra. “Clustering Data Streams”. In: Data Str eam Management - Pr ocess- ing High-Speed Data Str eams . Ed. by Minos N. Garofalakis, Johannes Gehrke, and Rajeev Rastogi. Springer , 2016, pp. 169–187. DOI: 10 . 1049 / iet - smt.2018.5389 . [26] Givanna H. Putri et al. “ChronoClust: Density-based clustering and cluster tracking in high-dimensional time-series data”. In: Knowledge-Based Systems 174 (Feb. 2019). DOI: 10 . 1016 / j . knosys . 2019 . 02 . 018 . [27] K. Anil Jain and Aditya V ailaya. “Image retrieval us- ing color and shape”. In: Pattern Recognition 29.8 (1996), pp. 1233–1244. DOI: 10 . 1016 / 0031 - 3203(95)00160- 3 . [28] Ismo Kärkkäinen and Pasi Fränti. “Gradual model generator for single-pass clustering”. In: Pattern Recognition 40.3 (2007), pp. 784–795. DOI: 10 . 1016/j.patcog.2006.06.023 . [29] Manish Maheshwari, Sanjay Silakari, and Mahesh Motwani. “Image clustering using color and tex- ture”. In: Computational Intelligence, Communica- tion Systems and Networks . IEEE, 2009, pp. 403– 408. DOI: 10.1109/CICSYN.2009.69 . [30] Marian Cristian Mihaescu. MT r ee Cluster er . Accessed: 2019-05-30. URL: http : / / weka . sourceforge . net / packageMetaData / MTreeClusterer/index.html . [31] Marian Cristian Mihaescu and Dumitru Dan Burde- scu. “Using M tree data structure as unsupervised classification method”. In: Informatica 36.2 (2012). [32] Anton Milan et al. “Joint tracking and segmentation of multiple tar gets”. In: IEEE Confer ence on Com- puter V ision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015 . 2015, pp. 5397– 5406. DOI: 10.1109/cvpr.2015.7299178 . [33] Jose A Miñarro-Giménez, Markus Kreuzthaler , and Stefan Schulz. “Knowledge Extraction from MED- LINE by Combining Clustering with Natural Lan- guage Processing”. In: AMIA Annual Symposium Pr oceedings . V ol. 2015. American Medical Infor - matics Association. 2015, p. 915. [34] T raian Rebedea and Ştefan T răuşan-Matu. “Au- tonomous News Clustering and Classification for an Intelligent W eb Portal”. In: Foundations of In- telligent Systems . Springer Berlin Heidelber g, 2008, pp. 477–486. DOI: 10 . 1007 / 978 - 3 - 540 - 68123 - 6_52 . [35] Mohammad Rezaei and Pasi Fränti. “Set matching measures for external cluster validity”. In: IEEE T ransactions on Knowledge and Data Engineering 28.8 (2016), pp. 2173–2186. DOI: 10 . 1109 / TKDE . 2016.2551240 . [36] Hermes Robles et al. “LEAC: An ef ficient library for clustering with evolutionary algorithms”. In: Knowledge-Based Systems (May 2019). DOI: 10 . 1016/j.knosys.2019.05.008 . [37] Érick Oliveira Rodrigues et al. “K-MS: a novel clus- tering algorithm based on morphological reconstruc- tion”. In: Pattern Recognition 66 (2017), pp. 392– 403. DOI: 10.1016/j.patcog.2016.12.027 . [38] T iago Rodrigues Lopes dos Santos and Luis E. Zárate. “Categorical data clustering: What similarity measure to recommend?” In: Expert Syst. Appl. 42.3 (2015), pp. 1247–1260. DOI: 10 . 1016 / j . eswa . 2014.09.012 . [39] Lincoln F Silva et al. “Hybrid analysis for indicat- ing patients with breast cancer using temperature time series”. In: Computer methods and pr ograms in biomedicine 130 (2016), pp. 142–153. DOI: 10 . 1016/j.cmpb.2016.03.002 . 522 Informatica 46 (2022) 507–522 M.C. Mihaescu et al. [40] Jef frey K Uhlmann. “Satisfying general proxim- ity/similarity queries with metric trees”. In: Infor - mation pr ocessing letters 40.4 (1991), pp. 175–179. DOI: 10.1016/0020- 0190(91)90074- R . [41] Ulrike V on Luxbur g, Robert C W illiamson, and Is- abelle Guyon. “Clustering: Science or art?” In: Pr o- ceedings of ICML W orkshop on Unsupervised and T ransfer Learning . 2012, pp. 65–79. [42] Zhimin W ang et al. “Adaptive spatial information- theoretic clustering for image segmentation”. In: Pattern Recognition (Sept. 2009), pp. 2029–2044. DOI: 10.1016/j.patcog.2009.01.023 . [43] Pavel Zezula et al. Similarity sear ch: the metric space appr o ach . V ol. 32. Springer Science & Busi- ness Media, 2006. DOI: 10 . 1007 / 0 - 387 - 29151 - 2 . [44] Shibing Zhou and Zhenyuan Xu. “Automatic grayscale image segmentation based on Af finity Propagation clustering”. In: Pattern Analysis and Applications (Feb. 2019). DOI: 10 . 1007 / s10044 - 019- 00785- 4 .