https://doi.or g/10.31449/inf.v46i4.3565 Informatica 46 (2022) 507–522 507
Parametrized MT r ee Cluster er for W eka
Marian Cristian Mihăescu, Marius Andrei Ciurez
Department of Computer Science and Information T echnologies, Faculty of Automatics, Computers and Electronics, Uni-
versity of Craiova, Craiova, Romania
E-mail: cristian.mihaescu@edu.ucv .ro, mariusandrei.ciurez@gmail.com
Keywords: Clustering, MT rees, metric spaces
Received: May 25, 2021
In the ar ea of clustering, pr oposing or impr oving new algorithms r epr esents a challenging task due to an
alr eady existing well-established list of algorithms and various implementations that allow rapid evalua-
tion against tasks on publicly available datasets. In this work, we pr esent an impr oved version of the MT r ee
clustering algorithm implemented within W eka workbench. The algorithmic appr oach starts fr om classi-
cal metric spaces and integrates parametrised business logic for finding the optimal number of clusters,
choosing the division policy and other characteristics. The r esult is a versatile data structur e that may be
used in clustering to find the optimal number of clusters, but mainly for loading data sets that alr eady have
a known structur e. Experimental r esults show that MT r ee finds the pr oper structur e in two clustering tasks,
although other algorithms fail in various w ays. A discussion of further impr ovements and experiments on
r eal data sets and functions is included.
Povzetek: Opisana je nova metoda algoritma MT r ee za gručenje.
1 Intr oduction
As an unsupervised learning technique, clustering is con-
tinuously getting attention within the machine learning area
due to many available algorithms and a wide range of appli-
cation domains for which practical solutions are continually
being designed and implemented.
The general problem of building clusters of objects is
narrowed down in [ 41 ] by clearly stating that the first thing
that needs to be taken into consideration in the context of
the problem. Therefore, building a general-purpose cluster -
ing algorithm that may work with any objects and solve any
task is not a realistic option. Thus, the objects we need to
define and provide a proper task description accompanied
by particular distance and evaluation metrics or various al-
gorithmic approaches need to be carefully taken care of in
building a clustering data analysis workflows.
In [ 1 ], authors recently raised the problem of clusterabil-
ity of a dataset. Having an available dataset does not neces-
sarily imply that we may meaningfully run a clustering pro-
cess to solve a particular task. Thus, defining clusterability
becomes a critical issue. Checking if a dataset is cluster -
able becomes an inherently tricky problem, especially when
dealing with actual data and solving a particular practical
task.
The wide range of approaches used in available imple-
mented clustering algorithms has found various application
domains to provide ef ficient solutions to many practical
tackled problems. From the many application domains, we
mention medical image processing (i.e., pattern recognition
and image segmentation) [ 39 ] [ 24 ], general and natural lan-
guage processing knowledge discovery [ 20 ] [ 33 ] [ 34 ], nav-
igation of robots [ 14 ] [ 22 ] and in many other contexts.
In the area of unsupervised learning, there are several
general classes of clustering algorithms (i.e., flat, hierar -
chical and density-based) that all share two common prob-
lems: finding the optimal number of clusters and quickly
and ef ficiently finding the correct clusters taking into con-
sideration specific distance measures appropriate for the
objects (i.e., pixels, points, persons, books, etc.) that are
being grouped. Further , once the clusters are being cor -
rectly determined, there may be later used to query for the
nearest neighbours or run specific range queries. Depend-
ing upon the inner data structures used for managing the
clusters when tree data structures are being used ef ficiently ,
searching or traversing may be accomplished ef ficiently
The objective of this work is to present an improved ver -
sion of the metric trees (MT ree) algorithm that has been
firstly proposed by [ 7 ] and later by [ 40 ] and [ 43 ].The first
proposal of using MT rees in the context of clustering has
been made in [ 31 ] and later implemented as W eka [ 16 ]
package in [ 30 ]. This paper presents an improved version
of the works from [ 30 ] and [ 31 ] that has been tested in a
comparative benchmark with k-MS morphological recon-
struction clustering algorithm [ 37 ] along with classical al-
gorithms (i.e., simple k-means, Cobweb, Farther First and
Canopy) in [ 9 ].
This paper presents a further improved sparametrised
version of the MT ree algorithm in terms of managed items,
used distance, the method for finding and setting K (i.e., the
number of clusters), division policy and validation metrics.
The critical improvement of the new parametrised version
of the MT ree clusterer is that it is now suitable to address
a broader range of problems. The parametrised version is
508 Informatica 46 (2022) 507–522 M.C. Mihaescu et al.
not ideal for solving any clustering problem by dynamically
choosing the suitable parameters. Instead, a wide range of
clustering problems may be addressed to the newly pro-
posed MT ree cluster by properly setting its parameters de-
pending on the available data objects and the tackled task.
In general, there are two types of clustering problems.
One regards finding patterns in a dataset that we do not
know if they have a specific structure or if a certain num-
ber of clusters exist. The second type of task regards cor -
rectly building a data clustering model that may later be
queried many times for getting specific information about
managed data. In this scenario, the cluster model is con-
structed only once such that it may be regarded as a prepro-
cessing step. Later , few insertions and updates may occur at
runtime, while most calls are range queries or kNN queries
that need to be solved correctly and ef ficiently . The pro-
posed approach is suited for both tasks, but the second one
is more appropriate. Range queries determine items whose
distance from a specific query item is smaller than a partic-
ular value.
The paper is or ganised as follows. In Section 2, we per -
form a literature review with regards to similar libraries
other application domain usages of metric trees for clus-
tering such as time series analysis of cytometry data, rec-
ommender systems, spatial clustering data, automatic com-
puting the number of clusters for colour or greyscale image
segmentation and clustering quality metrics. Section 3 de-
scribes the proposed approach with a detailed presentation
of how each parameter may be set and how it influences the
business logic of the MT ree. Section 4 presents the experi-
mental results on two publicly available data sets with runs
on several parameter settings. Finally , section 5 contains
the conclusions of this work, summarises the main contri-
butions and discusses potential improvements and applica-
tions.
2 Related work, limitations and
appr oaches
Data clustering (also known as unsupervised learning) rep-
resents a subarea of machine learning. Other areas of
machine learning are supervised learning, reinforcement
learning or deep learning. A particular sub-domain is rep-
resented by age clustering and segmentation [ 13 ] which
presents the use of subtractive clustering along with clas-
sical K-means algorithm to preprocess the data for optimal
centroid initialisation. The experimental results were ob-
tained on medical images representing infected blood cells
with malaria. The classical images used for segmentation
bring better results than k-means taking into account root
mean square error (RMSE) and peak signal-to-noise ratio
(PSNR) metrics.
Another approach for image clustering was proposed by
Chang et al. in [ 5 ]. They propose a Deep Adaptive Cluster -
ing (DAC) approach that reduces to a classification prob-
lem in which similarity is determined by cosine distance
and learned labelled features tend to be one-hot vectors ob-
taining good results on popular and accessible datasets like
MNIST , CIF AR-10 and STL-10.
One of the critical practical usages of clustering algo-
rithms is for image segmentation. Including spatial infor -
mation along with taking outlier points at a later stage in
the clustering algorithm has been proposed in [ 42 ]. From
this perspective, outliers are data points with almost equal
distance to their adjacent clusters and therefore should be
taken into consideration later . This approach raises two is-
sues. One regards the fact that the order in which data points
are given to the algorithm has significant importance. Thus,
if a clustering algorithm is highly sensitive to the order in
which data points are provided with a custom preprocess-
ing may be necessary . The second issue regards the very
nature of the data set from the clusterability perspective.
The critical verification that is also highly recommended
is to always check for clusterability before starting to the
clustering analysis.
Another application of clustering is grey scale image seg-
mentation [ 44 ]. As compared with the clustering of colour
images or with images that incorporate spatial information,
the task is to highly decrease the time complexity of the al-
gorithm by using af finity propagation (AP) clustering algo-
rithm. As always, when real life grey scale images are being
clustered the problem of correctly determining the number
of clusters or segments is a critical one.
More elaborate applications regard indexing and re-
trieval of similar images from an image database (CBIR
- Content-Based Image Retrieval) which represent a chal-
lenging task that has been addressed in [ 29 ] and [ 27 ]. The
first approach uses features, colour and texture. It employs
K-means and hierarchical clustering for finding the most
similar images. The second approach uses colour , tex-
ture and shape as features and K-means as business logic
for determining four classes of images: dinosaurs, flowers,
buses and elephants. The obtained experimental results are
promising in terms of excellent precision and recall values.
Clustering has also been used successfully in recom-
mender systems that were based on collaborative filtering
in [ 1 1 ]. A novel K-medoids clustering recommendation al-
gorithm has been proposed, which introduced an improved
Kullback-Leibler diver gence for computing item similarity .
The final task was to improve the ef fectiveness of the de-
veloped recommendation system.
Lately , in the application domain of immunology has
been used clustering algorithms - Chr onoClust , a new
density-based clustering algorithm - on time series cytom-
etry data [ 26 ]. The task was to characterise the immune
response to disease by tracking temporal evolution.
V ery recent works [ 1 ] put a high emphasis on defining
clusterability and checking if a dataset is clusterable as a
preprocessing step before any other data analysis is further
performed. Therefore, before applying the algorithm for
solving a task that requires clustering a sanity check may be
required, in the way that we should verify the clusterabil-
ity of the dataset. In other words, clustering may not work
Parametrized MT ree Clusterer for W eka Informatica 46 (2022) 507–522 509
on datasets which do not exhibit any internal structure, irre-
spective of any particular algorithm that may be employed.
In our case, clusterability becomes a prerequisite that the
dataset needs to meet before being loaded into the MT ree
structure. The new proposed clustering algorithm should
have as main scenario working with data that is known to be
clusterable, for which we know it has a well-defined struc-
ture with a known number of clusters. A dataset has a well-
defined structure when there exists an assignment of items
into clusters that is validated by a domain specialist for real
world datasets. In the case of synthetic datasets, the func-
tion that creates the instances is designed in such a way that
clusters are well-defined and represent the gold labeling for
any clustering algorithm.
If the number of clusters is not known than the results
highly depend on the particular practical context of the clus-
tering problem. The context is defined by the problem
(clustered objects and clustering task) and parameters of the
MT ree: object type, distance function, splitting policy and
validation metrics. If the dataset is not clusterable, we ar gue
that the MT ree algorithm - as well as any other clustering
algorithms - will exhibit undefined behaviour .
A more complex context occurs when the image source
is unknown or when t he ground truth for the training dataset
is also unknown [ 4 ] [ 10 ]. In this particular situation, find-
ing the optimal K represents am an inherently dif ficult
task. From an experimental algorithmic perspective, using
proper distance metrics and loss function in this optimisa-
tion problem represents one of the key ingredients towards
successful results. Such approaches propose fancy solu-
tions such as hierarchical clustering or clustering ensem-
bles based graph partitioning methods, cluster -based Sim-
ilarity Partitioning Algorithm (CSP A), HyperGraph Parti-
tioning Algorithm (HGP A), and Meta Clustering Algorithm
(MCLA).
Among the most well-known issues in unsupervised
learning consists in determining the actual number of clus-
ters from a dataset. Unfortunately , scenarios in which the
value of K is known to occur only in a few practical systems.
In general, an application that performs an image process-
ing task does not have any information regarding the actual
number of clusters from the tar get image. This situation
may occur when dealing with data streams [ 25 ] or with very
high-dimensional datasets [ 17 ]. In general, one of the most
suitable approaches tries to reduce to automatic determina-
tion of K that may be based on dynamic clustering [ 17 ] or
joint tracking segmentation [ 32 ].
A general-purpose algorithm for finding the optimal
number of clusters has been proposed by [ 6 ] and imple-
mented in an R package in NbClust. The main idea of this
approach is that it may use up to 30 indices for voting the
number of clusters. The package has implemented a func-
tion to run a clustering algorithm (i.e., k-means or hierar -
chical clustering) using various distance measures and ag-
gregation methods. The main limitation of the approach is
that it is general and practical usage for particular datasets
needs to be parametrised by the appropriate clustering algo-
rithm, subset of indices, distance measure and aggregation
method.
One particular usage of clustering regards automatic
computing of the number of clusters for colour image seg-
mentation [ 21 ]. This approach uses fuzzy c-means algo-
rithms for extracting chromaticity features of colours and
trains a Neural-Networks with obtained chromatic data of
colours. The trained model may be further used on new
colour images to predict the number of clusters in colour
images.
Among many clustering libraries, we mention LEAC
[ 36 ]. It is an open-source library with source code publicly
available in the GitHub repository . Thus, once the experi-
ments are also performed on publicly available datasets, the
results may be reproducible and also used in other setups
for further improvements. Inclusion of 23 state-of-the-art
Evolutionary Algorithms for partial clustering within a li-
brary t hat allows easy and fast development and integration
of new clustering algorithm represents the solution that we
also tar get when improving the initial version of the MT ree
clustering algorithm within W eka package.
Another usage of metric trees has been reported recently
in [ 12 ]. The task is to quickly and ef ficiently scale-up
the problem of shadow rendering for 70 million objects
(i.e., triangles) in real-time. The proposed metric tree
uses as splitting policy binary space partitioning (BSP)
and ternary object partitioning (T OP) for grouping triangles
into clusters as precomputed bounding capsules (line-swept
spheres).
Finally , the whole clustering process needs validation,
and many quality metrics may accomplish this for a wide
range of algorithms [ 18 ]. Depending on the structure of the
dataset, various clustering quality frameworks [ 23 ], [ 38 ]
have been proposed. The key issue that always arises re-
gards choosing the proper similarity and quality metrics
[ 38 ].
3 Pr oposed appr oach
The proposed MT ree parametrised clusterer has been firstly
proposed in [ 31 ] in an attempt to define a new clustering
algorithm that has as main business logic the metric trees
initially presented in [ 7 ] and later in [ 40 ] and [ 43 ]. The
initial C++ implementation approach was designed to clus-
ter students who were defined by three of their obtained
grades during one semester . The main shortcoming of the
proposed structure was that it as designed only for man-
aging student objects and had several hard-coded parame-
ters needed for building the tree. The most important one
is nrKeys , which represents the maximum number of stu-
dents contained in a node (i.e., a cluster). This approach has
a critical limitation in the fact that the splitNote() method
was called based only when a note was full. Other limita-
tions regard the lack of parametrised division policy , dis-
tance metrics or other features needed for flexibly running
a clustering process.
510 Informatica 46 (2022) 507–522 M.C. Mihaescu et al.
Later , the initially proposed MT ree clustering algorithm
has been contributed as an of ficial W eka package [ 30 ]. This
newly Java-based approach had the goal to be function-
ally available under W eka workbench as any other clusterer ,
such that it may be further used in various practical situa-
tions. The MT ree clusterer from W eka has been used in [ 37 ]
in a comparative analysis on publicly available images. The
results of MT ree were inferior such that the limitations were
addressed in [ 9 ]. As critical improvements, the MT ree ver -
sion from [ 9 ] uses of f-line dataset preprocessing for find-
ing the optimal number of clusters and adjusts the business
logic of the clusterer in terms of division policy and distance
metric between instances.
The current proposed version of the MT ree cluster rep-
resents a flexible parametrised version of the former one
in terms of division policy , used distance, the method for
finding and setting the number of clusters.
3.1 Definitions and context
The metric space M=(D, d) on a data domain D with the
distance function d :D× D→ R postulates:
Non negativity :∀ x, y∈ D, d(x, y)≥ 0 (1a)
Symmetry :∀ x, y∈ D, d(x, y)=d(y, x) (1b)
Identity :∀ x, y∈ D, x=y⇔ d(y, x)= 0 (1c)
Triangle inequality :∀ x, y, z∈ D, d(x, z)≤ d(x, y)+d(y, z)
(1d )
The conditions specified above are satisfied by most dis-
crete or continuous distance (or similarity) metrics (or mea-
sures): Euclidean, Minkowski, Manhattan, quadratic form
distance (i.e., colour histograms, weighted Euclidean dis-
tance), edit distance, Jaccard’ s coef ficient or Hausdorf f dis-
tance. Building clusters of objects in metric spaces relies on
the partitioning method and shape of the decision bound-
ary that lays between two adjacent clusters. The partition-
ing method regards how a set of objects is split into two
or more clusters taking into account specific optimisation
criteria such as the sum of squared errors (SSE) in case of
simple k-means algorithm. Regarding the decision bound-
ary , the two most common options are ball partitioning as
in metric spaces and hyperplane partitioning.
As current implementation of the MT ree algorithm rep-
resents a two-level ball decomposition generalisation of
the approach from [ 40 ]. The limitations from [ 40 ] regard
choosing an arbitrary object as the pivot, using only binary
splits around the median object, which implies previously
sorting the objects and multilayered approach due to recur -
sive construction. From the practical perspective of run-
ning a clustering process, these are substantial limitations
because ordering may not always be possible, binary split-
ting may not be useful when dataset consists of many clus-
ters and the notion of the cluster become unclear in a mul-
tilayered approach.
Definition 1. The newly designed MT ree data structure
is a two-level perfectly balanced multiway tree that indexes
a set of objects into its leaves which reside only on the sec-
ond level. After building the tree, the root contains the set
of centroids and their corresponding covered radius. On
the second level, the leaves represent the clusters contain-
ing objects whose distance is smaller or equal to the radius
assigned of the corresponding centroid within the root.
The key features for the parametrised MT ree implemen-
tation are:
1. The possibility to process various object data types
provided as input (i.e., image, document, etc.) that is
represented as a multidimensional vector .
2. The possibility to set up a particular distance function
between objects by direct usage of distance functions
that are already available in W eka or by using a newly
defined custom function.
3. The possibility to set up a particular division policy
that will be used internally used as needed. Practi-
cally , the logic of the division policy is managed by
the clustering algorithm. This feature needs to be ac-
companied by a parameter that controls the number of
clusters into which full leaf may be splitted. A vailable
options are binary object partitioning (BOP), ternary
object partitioning (T OP) or multi-object partitioning
(POS).
4. The possibility to set up a specific number of clusters
in which the entire dataset will be partitioned, if the
number of clusters is known. If the number of clusters
is not known, the MT ree will find the optimal number
of clusters considering the other parameters that have
been set.
5. The possibility to compute at request various cluster -
ing quality metrics that will give a general idea on
the quality of the clustering process. This feature
is critical in benchmarking the MT ree clustering re-
sults against results produced by other clustering al-
gorithms.
The MT ree uses only one node data structure for the roots
and leaves. In the root node, the instances are represented
by centroids, and each element from the radix vector rep-
resents the covered radius for the corresponding centroid.
In the same way , each element from the route vector is an
address of the corresponding leaf node. Their vector po-
sition accomplishes the correspondence between centroids
from the root node and covered radius and leaves. For ex-
ample, the first element of the instances vector of the root
represents the first centroid with a radius defined in the first
element of the radix vector . The first element of the route
vector represents the address of the leaf node that contains
objects whose distance to the first centroid is smaller or
equal to the covered radius. In the case of leaf nodes, the
instances represent the data objects themselves. The isLeaf
flag is set by the business logic to value TRUE such that
radix and route vectors are set to null.
Parametrized MT ree Clusterer for W eka Informatica 46 (2022) 507–522 51 1
T able 1: The structure of MT ree node
Field name Description
nrKeys The number of objects actually stored in the node.
isLeaf This flag represents the node type: root or leaf.
radixes The vector of distances covered by a centroid.
routes The vector of node addresses to leaf nodes.
instances The objects from the node (parent or leaf).
parent The address of the parent node.
Before starting any computation needed for inserting a
new object in the MT ree, the algorithm checks if we are
in a leaf and if the leaf has objects in it. If any of these
conditions do not hold than insertion may not take place
because insertion may be performed only in a leaf and fur -
ther splitting may be taken into consideration only if the leaf
has objects in it. Further , the leaf is evaluated for splitting
parametrised by evaluatorOfK , which is a function that de-
termines the optimal number of clusters. This helper func-
tion takes as parameter the objects from tr eeNode and out-
puts splitEval as an evaluation of the splitting. If this also
is valid, then the leaf node is split into the optimal number
of clusters by using a parametrised divisionPolicy . The in-
sertNonFull function is called to append a new object in the
leaf when no splitting is necessary .
The two main ingredients of mT r eeInsert function are the
evaluation of the optimal number of clusters from a leaf
and the division policy used for splitting. The current im-
plementation uses a function called voteK that computes
the optimal number of clusters in a similar way as NbClust
[ 6 ] package in R. The main dif ference is that NbClust
uses 30 indices while voteK currently integrates only 8 in-
dices: Davies-Bouldin, Dunn, Xi-Beni, Banfeld-Raftery ,
McClain-Rao, Ray-T uri, Calinski-Harabasz and PBM in-
dices. The architectural design of the package allows the
easy call of any method that given an input dataset can com-
pute the optimal number of clusters.
The nrKeys represents the number of items (i.e., cen-
troids or items) contained in a node (i.e., root or leaf). For
the root node, the items are centroids, and for the leaf node,
the items are the s ample points. Depending upon the imple-
mentation, the centroids may be items from the dataset or
computed instances. The isLeaf field from the data struc-
ture represents a clarification regarding what represent the
items from a node: centroids or instances.
As far as our M-tree is concerned, we pay extra attention
to the nodes because it is essential whether they are leaves
(i.e., terminal nodes containing instances) or internal nodes
(i.e., having only centroids) information is critical for the
algorithm design.
The instances vector contains the centroid points or
items, depending upon the node is either root or leaf. If
the node is the root, vector radixes contain the distance
covered by each centroid, while the routes vector contains
the leaves’ addresses. On the other hand, if the node is a
leaf, then the field parent includes the root address while
the radixes and routes vectors are empty .
The second key ingredient of the mT r eeInsert function
is the division policy . The default option for this param-
eter is the simple k-means algorithm, but any other strat-
egy may be called as needed. Other options, besides call-
ing particular clustering algorithms as a multi-object parti-
tioning (MOP) strategy , is using binary space partitioning
(BSP) or ternary object partitioning (T OP) by simply set-
ting proper parametrised values. Suppose there is no need
for node splitting, then insertion reduces to appending the
new object into that leaf.
In that case, insertion reduces to appending the new ob-
ject into that leaf, which is accomplished by calling insert-
NonFull function. Intuitively , when a leaf is not full, we
insert a new object; otherwise, we split the leaf. The most
crucial dif ference from former versions or other approaches
is that splitting is not called when the number of stored ob-
jects reaches a specific value, but when current objects ex-
hibit the property that they are properly clustered, and there-
fore a split is compulsory .
Algorithm 1 summarizes the business logic for inserting
a new object into an existing MT ree leaf node.
Algorithm 1 summarises the business logic for inserting
a new object into an existing MT ree leaf node. The sec-
ond critical procedure is the one that performs the split of a
leaf node as specified in the insertion algorithm. The main
ingredient is represented by the clusteringEvaluator setup,
which allows breaking the leaf into clusters according to
with specific division policy and a predetermined optimal
number of clusters. The newly obtained clusters are added
into a splitedClusters vector and returned as output. The
main improvement in the split procedure is avoiding split-
ting a leaf into a hard-coded number of clusters by using a
pre-determined optimal number of clusters and parametris-
ing for the division policy for running the ef fective split.
Algorithm 2 summarises the business logic for splitting.
The input consists of a tree node (i.e., a cluster) and an eval-
uator , and the returned value consists of a vector of clusters,
called splittedClusters. The evaluator has the task of decid-
ing if the node should remain as a single cluster or be split in
two or more clusters. If the evaluator considers more than
one cluster in the note, the node will split, and the tree will
change its structure.
The bounding volumes of the MT ree leaves are circles
if objects are 2-dimensional or spheres if objects are 3-
dimensional and Euclidean distance is being used. For
512 Informatica 46 (2022) 507–522 M.C. Mihaescu et al.
Algorithm 1 mT reeInsert (MT ree treeNode, Instance newObject)
1: if tr eeNode is leaf and has objects then
2: Set clusteringEvaluator = getOptimalNrOfClusters (treeNode, evaluatorOfK );
3: end if
4: if splitEval is valid then
5: nrOfClusters = clusterEval.getNumberOfClusters ();
6: if nrOfClusters> 1 then
7: splitNode(treeNode, clusteringEvaluator);
8: else
9: insertNonFull(treeNode, newObject);
10: end if
1 1: end if
Algorithm 2 mT reeSplitNode (MT ree treeNode, ClusterEvaluation clusteringEvaluator)
1: clusters = getClusters (treeNode, clusteringEvaluator);
2: if size of clusters> 0 then
3: for each cluster in clusters do
4: centroidInstance = chooseCenter(cluster);
5: splitedCluster .addCentroid(centroidInstance.getCentroid());
6: splitedCluster .setRadix(centroidInstance.getRadix());
7: splitedClusters.add(splitedCluster);
8: end for
9: end if
10: Return splitedClusters;
n-dimensional spaces, the bounding volumes are hyper -
spheres, a generalisation for a set of points equally dis-
tanced to a given point. This generalisation is valid only
for Euclidean distance, as for other distances, the bounding
volume is the representative of a sphere for that particular
vector space.
3.2 MT r ee algorithms description
The MT ree implementation is aimed for usage by client
code in practical scenarios. Although the primary usage
scenario addresses the situation for which we know the ac-
tual number of clusters, the clusterer may also be used in the
case when the data analyst specifies a particular number of
clusters depending on the tackled task or in the situation
when the number of clusters is not known or does not exist.
In any of the situations mentioned above, the parame-
ters that need to be set are the input data-set, the number of
clusters, the distance metric, the evaluator of the number of
clusters and the division policy algorithm. The main prac-
tical goal of the current version of the MT ree clusterer is
to verify that it correctly loads the data given specific pa-
rameters and creates a data model that validates the ground-
truth model from two publicly available data sets. Finally ,
the clustering validation metrics are verified against other
clustering algorithms implemented in W eka workbench.
One key aspect for correctly building the MT ree from
data regards the order in which the data objects are pro-
vided as input. As a general rule, any clustering algorithm
is highly sensitive to the order in which data objects are
considered in the clustering process. The seed mechanism
is the general solution to clustering algorithms and is also
used as a standard option in the MT ree. For our case, the
seed mechanism represents a random shuf fle of the order in
which the instances are added to the MT ree.
The pseudocode of mT reeInsert function presents the
structure and logic of operations when inserting a new in-
stance into the tree. The function’ s parameters are the ad-
dress of the tree and the item that should be inserted. Since
inserting takes place only in leaf nodes, the algorithm first
checks that we are in a leaf node and then determines the
optimal number of clusters from that leaf. If more than one
cluster is found in the leaf, the node is split, and the instance
is placed into the appropriate cluster .
The pseudocode of mT reeSplitNode function actually
performs the split of a leaf. Along the treeNode, an evalu-
ator is given as parameter , which gathers the parameters
needed to determine the clusters. Once the clusters are
determined, the centroids and corresponding radixes are
placed in the root, the addresses of the new clusters are
placed in the routes and instances themselves are placed in
the appropriate clusters.
A particular situation occurs when we do not know the
correct number of clusters. In this case, the adequate so-
lution is to preprocess the data-set to check clusterability
and determine the valid number of clusters if such a num-
ber exists. The situation in which the data-set has not a clear
well-defined structure may be interpreted in two ways: ei-
ther the data-set has several possible options as the actual
number of clusters, or we do not know the correct number of
Parametrized MT ree Clusterer for W eka Informatica 46 (2022) 507–522 513
clusters. In the first situation, a domain knowledge person
should consider the specific data analysis task and choose
the correct number of clusters that fit the practical problem.
More extensive work needs to be done as a preprocessing
step in the latter situation.
Figure 1: Nodes structure
As a general rule, when the data-set has no structure, the
first preprocessing steps should define the number of clus-
ters as a task-dependent value. Then, the centroids should
also be defined as representatives for each cluster consid-
ered by the domain knowledge person. Finally , the objects
from the data-set are added to the pertaining cluster only if
a maximal threshold distance from the centroid is not ex-
ceeded. The objects that have almost equal distance from
centroids are considered outliers and are not added to any
cluster . In this way , the data analyst may obtain a cluster -
able data set with a specific number of clusters. Having a
clusterable data-set is a prerequisite for the MT ree cluster -
ing algorithm and any other algorithm. Therefore, if the
data-set does not meet clusterability , building an MT ree
clusterer or any other clusterer will exhibit undefined be-
haviour .
Finding the correct number of clusters by using MT ree
may be performed by running with various parameter set-
tings in terms of the number of clusters and division policy
in an attempt to obtain a clusterer whose validation param-
eters show that the correct patterns have been discovered.
In this use case, the MT ree data structure is built for finding
whatever clusters are to be found.
3.3 Complexity analysis
The complexity analysis of building the MT ree from data
depends on the number of objects, the number of clusters,
the method of finding the optimal number of clusters and
the number of distance computations. The number of clus-
ters from the data-set represents the number of splits that
need to be performed while building the tree. The most
critical operation is finding the optimal number of clus-
ters, which is called after each object insertion. The num-
ber of distance computations is related to the number of
clusters since distances from the newly inserted object to
all the centroids from the root need to be computed to de-
termine the suitable leaf where the insertion should occur .
The most time-consuming function is the getOptimalNrOf-
Clusters function that is called whenever a new object is
to be inserted. W e have observed that for a reasonably
small number of clusters and a lar ge number of objects, that
method getOptimalNrOfClusters is used to trigger a split
fewer times than the number of clusters. For example, once
the number of leaves from MT ree has reached the true num-
ber of clusters, then looking for the optimal number of clus-
ters becomes useless. Further insertions will be performed
in constant time just by determining the proper leaf where
the new object needs to be inserted. As stated in [ 15 ] the
performance of building an MT ree with n objects is anal-
ogous to that of k-d trees, that is O(nlogn) for worst-case
scenario. Depending on the split method the time may in-
crease to O(nlog
2
n) or O(knlogn) for k dimensions. Still,
the currently proposed method is highly sensitive to the or -
der in which objects are being inserted, the seed selection
and the particular parameters setup as well as all other clus-
tering algorithms.
The critical property of the MT ree is that after correctly
building the clusters, the operations of inserting, removing
and querying may be performed in O(logn) time. These
aspects are not tested by the current works and need to be
further experimentally investigated in practical clustering
tasks.
4 Experimental r esults
4.1 Data-sets description
Experimental results have been performed on two synthetic
publicly available data sets from the clustering basic bench-
mark [ 19 ]: Unbalance [ 35 ] and Dim2 [ 28 ]. Unbalance is
a synthetic 2-d data-set with 6500 points and 8 Gaussian
clusters for which ground truth centroids and partitions are
known. Dim2 is also a synthetic 2-d data-set with 1351
points and 9 Gaussian clusters.. Figures 4 and 5 present a
plot of the raw input data.
W e have chosen two synthetic datasets for which the
ground truth centroids and partitions are known because
they are suitable for comparing clustering results obtained
by MT ree algorithm against other ones implemented in
W eka workbench.
514 Informatica 46 (2022) 507–522 M.C. Mihaescu et al.
Figure 2: Sample MT ree
Figure 3: Sample MT ree split node
Figure 4: Unbalance dataset
Figure 5: Dim2 dataset
Finally , we have tested the MT ree on W ine and Iris clas-
sical datasets from UCI Machine Learning Repository [ 3 ].
These real-world datasets were chosen because they may
be successfully used in clustering tasks as they are labelled,
and classical unsupervised training may correctly deter -
mine the classes.
4.2 MT r ee package description
MT ree is implemented in Java and is aimed to be executed
within W eka 3.8 workbench by using the Package man-
ager tool. The MT ree package is based on three classes
Node , MT r eeBean and MT r ee along with other three helper
classes. Figure 6 shows the software architecture the MT ree
package as an UML class diagram .
The class Node represents a blueprint for a cluster of in-
stances and the MT r eeBean class contains the root of the
MT ree. The most important class is MT r ee , which extends
RandomizableCluster er , which is an abstract class whose
direct subclasses are Canopy , Cobweb, FarthestFirst and
SimpleKMeans. Further , by implementing the NumberOf-
ClustersRequestable and other interfaces, the MT ree gets
the possibility to be parametrised similarly as other cluster -
ing algorithms are in W eka.
The main goal was to obtain a clusterer that may be pa-
rameterisable in the same way as already existing clusterers
based on interfaces that are already defined in W eka but also
of fering the possibility of defining new interfaces specific
for parameters needed by MT ree algorithm. In this way , the
Parametrized MT ree Clusterer for W eka Informatica 46 (2022) 507–522 515
Figure 6: UML class diagram for MT ree package
newly obtained MT ree can be easily parametrised in the us-
age of the command-line interface or W eka GUI interface.
4.3 Sample MT r ee usage
The MT ree can be used in three ways. The current imple-
mentation provides flexibility for running full experimental
runs and benchmarking the performance of an MT ree pa-
rameter configuration against already existing W eka clus-
tering algorithms.
– Basic mode, thr ough command line . This mode allows
executing the MT ree on any machine that has JVM
1.8 and weka.jar version 3.8.3. Figure ?? presents a
sample command line execution of the MT ree algo-
rithm. This approach is commonly used when batch
execution is needed only once for building the clus-
ters and serialising the obtained model (i.e., the dis-
tribution of objects into clusters) to persistent storage
(i.e., csv , xml, etc.) for later usage is rather tricky .
The available options when running MT ree in
command line interface are further presented. N
represents the number of clusters. If we want MT ree
to decide for itself the number of clusters this option
must be set to value -1. init option may be used for
setting the initialization method. I option is for setting
the maximum number of iterations, O for preserving
the order of instances, S for setting the number of
seeds, d for setting the distance metric, findN for
setting the method for finding the optimal number of
clusters and splitPolicy for setting the method used as
splitting policy . Current implementation may use as
splitting policy Canopy , Simple k-means, CobW eb,
FarthestFirst or HierarchicalClusterer clusterers from
W eka.
– Using the W eka GUI . This mode is the most user -
friendly as the MT ree may be used from W eka GUI
as any other clustering algorithm. As the MT ree pack-
age is in the list of of ficial packages, it needs to be in-
stalled before usage. Installing the MT ree package in
W eka is a straightforward procedure as the MT r ee.zip
archive is publicly available in SourceFor ge [ 30 ] and
the link to the package is available in the list of of ficial
packages within the W eka package manager tool.
– Pr ogrammatic way . The most versatile usage of the
MT ree is programmatically . This approach allows set-
ting up the parameters at runtime as well as having
a ready-to-use in memory MT ree object that is ready
for querying. This approach allows the usage of the
cluster as business logic on a server side in complex
applications where client code is performing various
queries. Sample code for building the MT ree from
data is publicly available in [ 8 ].
516 Informatica 46 (2022) 507–522 M.C. Mihaescu et al.
Figure 7: Command line execution of MT ree
Figure 8: Execution of MT ree in W eka GUI
4.4 Sample runs on Unbalance and Dim2
synthetic data-sets
The newly released parametrised MT ree implementation
has been tested against two publicly available synthetic data
sets: Unbalance and Dim2 . The client code that calls the
MT ree package is publicly available at [ 8 ], as further pre-
sented experimental results were obtained by programmat-
ically running the MT ree implementation.
Figures 9 and 10 present the experimental results of run-
ning the MT ree clusters along other five clusterers imple-
mented in W eka workbench: Canopy , EM (expectation-
maximization), FF (Farthest First), HC (Hierarchical Clus-
tering) and SKM (Simple KMeans).
As figure 9 clearly shows, the MT ree correctly deter -
mines the clusters by using 100 seeds and Canopy for split
policy . As a general rule, the clustering result exhibits un-
defined behaviour regarding the number of seeds, such that
correct results may be obtained sometimes for only 10 seeds
and sometimes for 1000 seeds. Here is a summary of the
experimental results of the other five algorithms:
– Canopy algorithm fails to determine the correct num-
ber of clusters and the found distribution into clusters
is wrong.
– EM algorithm fails to determine the correct number of
clusters although in many situations it is used for this
task as it does not require the value of K. The obtained
clusters are reasonable fine with two exceptions: clus-
ter 0 puts together three real clusters and cluster 1 puts
together two real clusters.
– FF algorithm correctly determines the number of clus-
ters but misses to determine two of them correctly:
cluster 0 puts together two real clusters and cluster 2 is
composed of objects belonging to two distinct clusters.
– HC algorithm correctly solves the task.
– SKM algorithm correctly solves the task after finetun-
ing the parameters: 100 seeds, usage of kmeans++ [ 2 ]
for seed optimisation and maximum 10,000 iterations.
As figure 10 clearly shows, the MT ree correctly deter -
mines the clusters on a more dif ficult task by using only 10
seeds. The other investigated algorithms provide the fol-
lowing results:
– Canopy algorithm fails to determine the correct num-
ber of clusters and the found distribution into clusters
is wrong, as it puts together in a cluster objects from
two clusters.
– EM algorithm correctly determines the clusters.
– FF algorithm fails to determine the correct number of
clusters and misses to determine one of them correctly .
Parametrized MT ree Clusterer for W eka Informatica 46 (2022) 507–522 517
(a) MT ree results (b) Canopy results
(c) EM results (d) FF results
(e) HC results (f) SKM results
Figure 9: Clustering results on Unbalance dataset
T able 2: Running times statistics (measured in seconds)
Algorithm Unbalance dataset Dim2 dataset
MT ree 0.3 [per seed] 0.06 [per seed]
Canopy 0.01 0.1
EM 8.21 [per tuned seed] 0.55 [per tuned seed]
FF 0.01 [per seed] 0.01 [per seed]
HC 605.42 4.1 1
SKM 0.06 [per seed] 0.01 [per seed]
T able 3: Performance results on real world datasets
Algorithm Accuracy W ine Accuracy Iris
MT ree+Canopy 0.94382 0.89261
MT ree+cKMs 0.92134 0.89261
MT ree+FF 0.93258 0.88590
MT ree+HC 0.89887 0.89261
KMeans 0.93258 0.88590
518 Informatica 46 (2022) 507–522 M.C. Mihaescu et al.
(a) MT ree results (b) Canopy results
(c) EM results (d) FF results
(e) HC results (f) SKM results
Figure 10: Clustering results on Dim2 dataset
Parametrized MT ree Clusterer for W eka Informatica 46 (2022) 507–522 519
The result shows that objects from one real cluster are
split between two real clusters.
– HC fails to determine the correct number of clusters
and reports one found cluster as a join between two
real clusters.
– SKM algorithm correctly solves the task after fine tun-
ing the parameters: 10 seeds, usage of kmeans++ [ 2 ]
for seed optimisation and maximum 10,000 iterations.
Experimental results show that the first K instances have
the most significant impact over the final result, where K
is the number of clusters. Thus, the current implementa-
tion uses kmeans++ [ 2 ] seed optimisation, so the first K in-
stances that are added to the MT ree are a rather sparse one
from another .
T able 2 summarizes the running time statistics for all six
algorithms on both data-sets. In the case of EM algorithm,
each tuned seed has been obtained by more iterations, and
more k-means runs, a fact that explains the more signifi-
cant running time. HC and Canopy do not have seeds and
Canopy’ s poor results on both data-sets were obtained re-
gardless of the configuration parameters. The number of
seeds for the algorithms that correctly solved the dim2 data-
set has been set to 10.
Finally , the SSE (sum of squared error), as well as the
assignment of objects, are correctly computed for MT ree
as they compute to the same values as SKM of 4.2471
and 0.3367 for unbalance and dim2 datasets, respectively .
Therefore, the clusters produced by the MT ree are valid and
represent the real ones from the datasets and SSE represents
a good optimisation metric for these datasets.
4.5 Sample runs on Iris and W ine
r eal-world data-sets
W ine data was normalised and then MT ree was used on all
13 features. The algorithm was run on 100 seeds and the
best result was tar geting to be with the best (smallest) SSE.
As can be seen from the table, smaller SSE does not always
provide the best accuracy , with a SSE value of 66 against
68 but an accuracy of 89 against 94.This suggests that a
dif ferent cluster quality metrics may be able to improve the
performance of the proposed algorithm. KMeans run on
100 seeds obtains 93% accuracy or 12 wrong predictions.
Iris data was normalised and MT ree used on all 4 fea-
tures. As on the previous data-set, the algorithm was run
on 100 seeds and best SSE was tar geted. It is interesting
to notice that dif ferent SSE provide the same accuracy , it
seems that the algorithm conver ged with 16 wrong predic-
tions being its best. MT ree+cobweb is not able to cluster the
data. On the same data, KMeans run on 100 seeds obtains
88% accuracy or 17 wrong predictions.
T able 3 presents accuracy results of MT ree parametrised
by various splitting algorithms (i.e., Canopy , KMeans, Far -
thest First and Hierarchical clustering) against baseline
KMeans algorithm. Experimental results show the MT ree
clustering algorithm correctly finds groups at least as good
as simple KMeans algorithm.
5 Conclusions and futur e work
This paper presents an improved parametrised MT ree clus-
ters for W eka workbench. The experimental results show
that MT ree correctly solves two synthetic datasets for which
the correct structure (i.e., number of clusters, centroids and
distribution) is known. More, five other clustering algo-
rithms implemented in W eka are outperformed in various
situations due to that fact that they do not solve the cluster -
ing task correctly or need intensive tuning for solving it.
Still, the current approach is used only for sanity check
of the clustering capabilities of the MT ree implementation,
rather than solving a particular clustering task on a real
dataset. The implementation is Java-based and is available
as open source W eka package. The experimental results are
correct and promising such that further development under
W eka of fers the possibility of proper benchmarking of fur -
ther clustering tasks that may be taken into consideration.
The main contributions are summarised as follows:
1. An updated parametrised version of the MT ree pack-
age is presented. The parametrisation mainly regards
used distance metric, the method for finding and set-
ting the number of clusters and the division policy .
The data structure can load various object types after
being properly processed as well as providing valida-
tion insights.
2. The proposed software architecture of the MT ree en-
ables parameterisation through easy integration of
other internal algorithmic strategies that perform key
tasks within the business logic.
3. The implementation of the MT reeis available as an
open source package in W eka workbench. This ap-
proach gives the opportunity for further usage and
benchmarking against other clustering algorithms.
4. The experiments demonstrate that the proposed ap-
proach outperforms current clustering algorithms on
two datasets.
Future works may take into consideration extending the
voteK algorithm as a Java implementation of the already
existing R package NbClust. Extending voteK should take
into consideration the available clustering quality indices
and parametrisation capabilities. In terms of internal busi-
ness logic, MT ree may try dif ferent approaches regarding
the order in which objects are processed when building the
tree. One option is to cluster the outlier objects later in the
process.
As the most expensive operations are finding the optimal
number of clusters and splitting, one option is trying to pre-
dict how the insertion of an object will impact the tree in
520 Informatica 46 (2022) 507–522 M.C. Mihaescu et al.
terms of triggering a split. Checking for the optimal num-
ber of clusters should be performed only when an insert is
highly to determine a split, as most inserts do not require a
split, especially when the dataset has a well-defined struc-
ture.
MT ree currently implements range query and kNN
query . These implementations should be further tested in
practical real data scenarios. Other tasks in which MT ree
may be also used are outlier detection and finding the cor -
rect number of clusters in a dataset. Finally , MT ree algo-
rithm may be further tested for finding patterns in data in
the situation when internal structure is not known.
Acknowledgement
This work was partially supported by the grant 135C/2021
”Development of software applications that integrate ma-
chine learning algorithms”, financed by the University of
Craiova.
Refer ences
[1] Andreas Adolfsson, Mar gareta Ackerman, and
Naomi C Brownstein. “T o cluster , or not to cluster:
An analysis of clusterability methods”. In: Pattern
Recognition 88 (2019), pp. 13–26. DOI: 10 . 1016 /
j.patcog.2018.10.026 .
[2] David Arthur and Ser gei V assilvitskii. “k-means++:
The advantages of careful seeding”. In: Pr oceedings
of the eighteenth annual ACM-SIAM symposium on
Discr ete algorithms . Society for Industrial and Ap-
plied Mathematics. 2007, pp. 1027–1035.
[3] Catherine L Blake and Christopher J Merz. UCI
r epository of machine learning databases, 1998 .
1998.
[4] Roberto Caldelli et al. “Fast image clustering of un-
known source images”. In: Jan. 201 1, pp. 1–5. DOI:
10.1109/WIFS.2010.5711454 .
[5] Jianlong Chang et al. “Deep adaptive image cluster -
ing”. In: Pr oceedings of the IEEE International Con-
fer ence on Computer V ision . 2017, pp. 5879–5887.
DOI: 10.1109/iccv.2017.626 .
[6] Malika Charrad et al. “NbClust: An R Package for
Determining the Relevant Number of Clusters in a
Data Set”. In: Journal of Statistical Softwar e 61 (Oct.
2014), pp. 1–36. DOI: 10.18637/jss.v061.i06 .
[7] Paolo Ciaccia et al. “Indexing metric spaces with m-
tree.” In: SEBD . V ol. 97. 1997, pp. 67–86.
[8] Marius Andrei Ciurez. MT r ee client code . https://
github . com / kyko007 / Cordoba / tree / master /
MTree . 2019.
[9] Marius Andrei Ciurez and Marian Cristian Mi-
haescu. “Improved Architectural Redesign of MT ree
Clusterer in the Context of Image Segmentation”.
In: International Confer ence on Intelligent Data En-
gineering and Automated Learning . Springer . 2018,
pp. 99–106. DOI: 10.29007/sm6x .
[10] Abhisek Dash et al. “Image Clustering without
Ground T ruth”. In: CoRR (Oct. 2016). DOI: 10 .
48550/arXiv.1610.07758 .
[1 1] Jiangzhou Deng, Junpeng Guo, and Y ong W ang. “A
Novel K-medoids clustering recommendation algo-
rithm based on probability distribution for collabora-
tive filtering”. In: Knowledge-Based Systems (Mar .
2019). DOI: 10.1016/j.knosys.2019.03.009 .
[12] F Deves et al. “Scalable Real-T ime Shadows us-
ing Clustering and Metric T rees”. In: Eur ograph-
ics Symposium on Rendering . Karlsruhe, Germany ,
July 2018, pp. 1–12. DOI: 10.2312/sre20181175 .
URL: https : / / hal . archives - ouvertes . fr /
hal- 02089095 .
[13] Nameirakpam Dhanachandra, Khumanthem Man-
glem, and Y ambem Jina Chanu. “Image segmenta-
tion using K-means clustering algorithm and sub-
tractive clustering algorithm”. In: Pr ocedia Com-
puter Science 54 (2015), pp. 764–771. DOI: 10 .
1016/j.procs.2015.06.090 .
[14] Gianni A Di Caro, Frederick Ducatelle, and L Gam-
bardella. “A fully distributed communication-based
approach for spatial clustering in robotic swarms”.
In: Pr oceedings of the 2nd Autonomous Robots
and Multir obot Systems W orkshop (ARMS), affil-
iated with the 1 1th International Confer ence on
Autonomous Agents and Multiagent Systems (AA-
MAS)(V alencia, Spain, June 5) . Citeseer . 2012,
pp. 153–171.
[15] Herbert Edelsbrunner . Algorithms in combinatorial
geometry . V ol. 10. Springer Science & Business Me-
dia, 2012. DOI: 10.1007/978- 3- 642- 61568- 9 .
[16] Frank Eibe, MA Hall, and IH W itten. “The WEKA
W orkbench. Online Appendix for Data Mining:
Practical Machine Learning T ools and T echniques”.
In: Mor gan Kaufmann (2016). DOI: 10 . 1016 /
B978- 0- 12- 804291- 5.00024- 6 .
[17] Ahmed Ali Abdalla Esmin, Rodrigo A. Coelho, and
Stan Matwin. “A review on particle swarm optimiza-
tion algorithm and its variants to clustering high-
dimensional data”. In: Artif. Intell. Rev . 44.1 (2015),
pp. 23–45. DOI: 10.1007/s10462- 013- 9400- 4 .
[18] Adil Fahad et al. “A Survey of Clustering Algo-
rithms for Big Data: T axonomy and Empirical Anal-
ysis”. In: IEEE T rans. Emer ging T opics Comput. 2.3
(2014), pp. 267–279. DOI: 10 . 1109 / tetc . 2014 .
2330519 .
Parametrized MT ree Clusterer for W eka Informatica 46 (2022) 507–522 521
[19] Pasi Fränti and Sami Sieranoja. “K-means properties
on six clustering benchmark datasets”. In: Applied
Intelligence 48.12 (2018), pp. 4743–4759. DOI: 10.
1007/s10489- 018- 1238- 7 .
[20] Guojun Gan, Chaoqun Ma, and Jianhong W u. Data
Clustering: Theory , Algorithms, and Applications .
Society for Industrial and Applied Mathematics,
2007. DOI: 10.1137/1.9780898718348 .
[21] Farid Garcia-Lamont et al. “Automatic computing
of number of clusters for color image segmentation
employing fuzzy c-means by extracting chromatic-
ity features of colors”. In: Pattern Analysis and Ap-
plications 23 (2020), pp. 59–84. DOI: 10 . 1007 /
s10044- 018- 0729- 9 .
[22] Melvin Gauci et al. “Clustering objects with robots
that do not compute”. In: Pr oceedings of the 2014
international confer ence on Autonomous agents and
multi-agent systems . International Foundation for
Autonomous Agents and Multiagent Systems. 2014,
pp. 421–428.
[23] Ángel Castellanos Gonzáles, Juan Manuel Cigarrán,
and Ana Garcı
́ a-Serrano. “Formal concept analysis
for topic detection: A clustering quality experimental
analysis”. In: Information Systems 66 (2017), pp. 24–
42. ISSN: 0306-4379. DOI: 10 . 1016 / j . is . 2017 .
01.008 .
[24] Suchita Goswami and Lalit Kumar P Bhaiya. “Brain
tumour detection using unsupervised learning based
neural network”. In: 2013 International Confer ence
on Communication Systems and Network T echnolo-
gies . IEEE. 2013, pp. 573–577. DOI: 10 . 1109 /
csnt.2013.123 .
[25] Sudipto Guha and Nina Mishra. “Clustering Data
Streams”. In: Data Str eam Management - Pr ocess-
ing High-Speed Data Str eams . Ed. by Minos N.
Garofalakis, Johannes Gehrke, and Rajeev Rastogi.
Springer , 2016, pp. 169–187. DOI: 10 . 1049 / iet -
smt.2018.5389 .
[26] Givanna H. Putri et al. “ChronoClust: Density-based
clustering and cluster tracking in high-dimensional
time-series data”. In: Knowledge-Based Systems 174
(Feb. 2019). DOI: 10 . 1016 / j . knosys . 2019 . 02 .
018 .
[27] K. Anil Jain and Aditya V ailaya. “Image retrieval us-
ing color and shape”. In: Pattern Recognition 29.8
(1996), pp. 1233–1244. DOI: 10 . 1016 / 0031 -
3203(95)00160- 3 .
[28] Ismo Kärkkäinen and Pasi Fränti. “Gradual model
generator for single-pass clustering”. In: Pattern
Recognition 40.3 (2007), pp. 784–795. DOI: 10 .
1016/j.patcog.2006.06.023 .
[29] Manish Maheshwari, Sanjay Silakari, and Mahesh
Motwani. “Image clustering using color and tex-
ture”. In: Computational Intelligence, Communica-
tion Systems and Networks . IEEE, 2009, pp. 403–
408. DOI: 10.1109/CICSYN.2009.69 .
[30] Marian Cristian Mihaescu. MT r ee Cluster er .
Accessed: 2019-05-30. URL: http : / / weka .
sourceforge . net / packageMetaData /
MTreeClusterer/index.html .
[31] Marian Cristian Mihaescu and Dumitru Dan Burde-
scu. “Using M tree data structure as unsupervised
classification method”. In: Informatica 36.2 (2012).
[32] Anton Milan et al. “Joint tracking and segmentation
of multiple tar gets”. In: IEEE Confer ence on Com-
puter V ision and Pattern Recognition, CVPR 2015,
Boston, MA, USA, June 7-12, 2015 . 2015, pp. 5397–
5406. DOI: 10.1109/cvpr.2015.7299178 .
[33] Jose A Miñarro-Giménez, Markus Kreuzthaler , and
Stefan Schulz. “Knowledge Extraction from MED-
LINE by Combining Clustering with Natural Lan-
guage Processing”. In: AMIA Annual Symposium
Pr oceedings . V ol. 2015. American Medical Infor -
matics Association. 2015, p. 915.
[34] T raian Rebedea and Ştefan T răuşan-Matu. “Au-
tonomous News Clustering and Classification for
an Intelligent W eb Portal”. In: Foundations of In-
telligent Systems . Springer Berlin Heidelber g, 2008,
pp. 477–486. DOI: 10 . 1007 / 978 - 3 - 540 - 68123 -
6_52 .
[35] Mohammad Rezaei and Pasi Fränti. “Set matching
measures for external cluster validity”. In: IEEE
T ransactions on Knowledge and Data Engineering
28.8 (2016), pp. 2173–2186. DOI: 10 . 1109 / TKDE .
2016.2551240 .
[36] Hermes Robles et al. “LEAC: An ef ficient library
for clustering with evolutionary algorithms”. In:
Knowledge-Based Systems (May 2019). DOI: 10 .
1016/j.knosys.2019.05.008 .
[37] Érick Oliveira Rodrigues et al. “K-MS: a novel clus-
tering algorithm based on morphological reconstruc-
tion”. In: Pattern Recognition 66 (2017), pp. 392–
403. DOI: 10.1016/j.patcog.2016.12.027 .
[38] T iago Rodrigues Lopes dos Santos and Luis E.
Zárate. “Categorical data clustering: What similarity
measure to recommend?” In: Expert Syst. Appl. 42.3
(2015), pp. 1247–1260. DOI: 10 . 1016 / j . eswa .
2014.09.012 .
[39] Lincoln F Silva et al. “Hybrid analysis for indicat-
ing patients with breast cancer using temperature
time series”. In: Computer methods and pr ograms
in biomedicine 130 (2016), pp. 142–153. DOI: 10 .
1016/j.cmpb.2016.03.002 .
522 Informatica 46 (2022) 507–522 M.C. Mihaescu et al.
[40] Jef frey K Uhlmann. “Satisfying general proxim-
ity/similarity queries with metric trees”. In: Infor -
mation pr ocessing letters 40.4 (1991), pp. 175–179.
DOI: 10.1016/0020- 0190(91)90074- R .
[41] Ulrike V on Luxbur g, Robert C W illiamson, and Is-
abelle Guyon. “Clustering: Science or art?” In: Pr o-
ceedings of ICML W orkshop on Unsupervised and
T ransfer Learning . 2012, pp. 65–79.
[42] Zhimin W ang et al. “Adaptive spatial information-
theoretic clustering for image segmentation”. In:
Pattern Recognition (Sept. 2009), pp. 2029–2044.
DOI: 10.1016/j.patcog.2009.01.023 .
[43] Pavel Zezula et al. Similarity sear ch: the metric
space appr o ach . V ol. 32. Springer Science & Busi-
ness Media, 2006. DOI: 10 . 1007 / 0 - 387 - 29151 -
2 .
[44] Shibing Zhou and Zhenyuan Xu. “Automatic
grayscale image segmentation based on Af finity
Propagation clustering”. In: Pattern Analysis and
Applications (Feb. 2019). DOI: 10 . 1007 / s10044 -
019- 00785- 4 .