https://doi.or g/10.31449/inf.v47i6.4445 Informatica 47 (2023) 203–218 203
Big Data Clustering T echniques Challenges and Perspectives: Review
Fouad H. A wad
1
, Murtadha M. Hamad
1
1
College of Computer Science and Information T echnology , University of Anbar , Anbar , Iraq
E-mail: fouad.hammadi@uoanbar .edu.iq, co.mortadha61@uoanbar .edu.iq
Keywords: Big data, clustering, data mining, machine learning
Received: October 12, 2022
Clustering in big data is consider ed a critical data mining and analysis technique. Ther e ar e issues with
adapting clustering algorithms to lar ge amounts of data and new challenges br ought by big data. As the
size of big data is up to petabytes of data, and clustering methods have high pr ocessing costs, the challenge
is how to handle this issue and utilize clustering techniques for big data efficiently . This study aims to
investigate the r ecent advancement of clustering platforms and techniques to handle big data issues, fr om
the early suggested techniques to today’ s novel solutions. The methodology and specific issues for building
an effective clustering mechanism ar e pr esented and evaluated, followed by a discussion of the choices
for enhancing clustering algorithms. A brief literatur e r eview of the r ecent advancement in clustering
techniques has been pr esented to addr ess each solution’ s main characteristics and drawbacks.
Povzetek: Članek pr edstavlja pr egled tehnik gručenja za velike podatke.
1 Intr oduction
After a period of addressing challenges associated with pro-
cessing big data, the emphasis has shifted towards mak-
ing sense of this vast volume of information. According
to experts and scholars, big data represents one of the most
pressing concerns in the field of computer science today .
A notable example of the scale of big data can be seen in
Y ouT ube, which boasts one billion daily active users who
collectively upload 100 hours of content every hour . Addi-
tionally , its copyrighted content service must evaluate over
400 years of video content on a daily basis [ 31 ]. A billion
users on Facebook and T witter are creating terabytes of data
every minute. It is critical to employ advanced knowledge
discovery tools to deal with this flood of data. Data min-
ing techniques [ 68 ] [ 54 ] are excellent information-seeking
tools for this purpose. One of them is Clustering, which is
described as ”a strategy for dividing data into groups in such
a way that items in one group have more similarity than ob-
jects in other groups” [ 31 ]. Data clustering is a well-known
strategy in many parts of computer science and related dis-
ciplines. Although data mining is the most widely used
method of Clustering, it is also commonly used in other sub-
ject areas such as biostatistics, ener gy studies, deep learning
[ 37 ], networking, and pattern classification [ 1 ] [ 5 ], result-
ing in a lar ge body of research [ 41 ]. Researchers have stud-
ied clustering algorithms since their inception to regulate
their complexity and processing cost and, consequently , im-
prove scalability and performance. Big data’ s rise in popu-
larity in recent years has introduced new challenges to this
field, prompting further research into improved clustering
methods. It is critical to determine the size of the data be-
fore focusing on clustering it. Bezdek and Hathaway es-
tablished a data size classification to solve this challenge
[ 50 ].
There are five important challenges of big data, which
are [ 24 ]:
– V olume: it is the first, exemplified by streaming un-
structured data in the form of social media, which
raises questions of how to assess the relevant data
within high data volumes and how to provide helpful
information by evaluating data.
– V elocity: Data is pouring at breakneck performance,
and it must be processed in a fair amount of time. One
of the issues of big data is responding fast to data ve-
locity .
– V ariety: Another problematic challenge is mer ging,
managing, and evaluating data from several sources
with varying standards, such as social media posts, au-
dio, unstructured data, and video.
– V eracity: This refers to the data’ s quality , and it shows
the data’ s biases, noise, and abnormalities.
– V alue: This points to the precious knowledge revealed
from the data.
Due to their massive complexity and processing cost, tra-
ditional clustering approaches cannot handle this enormous
volume of data. In the standard K-means algorithm, for
example, even when the number of clusters is 2, the NP-
hard. As a result, scalability is the most challenging aspect
of massive clustering data[ 35 ]. The scaling and speed-up
clustering algorithm is the crucial goal while sacrificing as
little quality as possible. Although researchers in this field
have always sought to improve the scalability and perfor -
mance of clustering algorithms, big data concerns highlight
these flaws and demand additional attention and research.
204 Informatica 47 (2023) 203–218 F .H. A wad et al.
Figure 1: Big data challenges
Because of their incredible complexity and processing
cost, traditional clustering approaches cannot handle this
massive volume of data. T raditional Kmeans clustering, for
example, is NP-hard even when k=2, which is the number
of clusters. As a result, scalability is the most challenging
aspect of massive clustering data[ 35 ].
Scaling and speeding up are the critical goals of cluster -
ing algorithms while sacrificing as little quality as possible.
Although researchers in this field have always sought to
improve the scalability and speed of clustering algorithms,
significant data concerns highlight these flaws and demand
additional attention and research. According to a review
of the literature on clustering approaches, these techniques
have progressed through five stages, as shown in Figure 1
[ 76 ].
Many papers have been discussed and written to be re-
view papers for the clustering techniques in big data due to
the importance of those types of studies to understanding
Clustering and start working with it. However , compared
with the recent review papers in big data clustering [ 60 ] [ 33 ]
[ 75 ] [ 23 ], this study has many key dif ferences that do not
present in the previous review papers which represent the
novelty as follows:
– This review focuses only on the recent clustering tech-
niques and algorithms used for the modern big data
processing frameworks after reviewing more than 200
papers.
– Addressing the main issues of clustering technique re-
lated to big data.
– Representing the main characteristics and dif ferences
between the most used big data platform which is
Apache Spark and Hadoop MapReduce, which may
guide the researchers to select the based platform
based on the issue and requirements of the big data
system
– Filtering the most advanced clustering solution in the
big data field with addressing each solution’ s main ad-
vantages and disadvantages.
In the rest of this research, the drawbacks and benefits of
algorithms in each step have been studied in the sequence
they occur in the diagram. Finally , based on current and
unique approaches and future work, an additional stage has
been revealed that might be the next stage for lar ge data
clustering algorithms. T echniques for enabling clustering
mechanisms to operate with lar ger datasets by increasing
the performance and scalability are divided into two sec-
tions:
– T echniques for clustering single machines
– T echniques for clustering many machines
Single-machine clustering algorithms run on a single sys-
tem and can only use that machine’ s resources, whereas
multiple-machine clustering techniques are accessed to re-
sources and run on several machines. In the sections
below , further information will be discussed about those
techniques[ 30 ].
2 Survey methodology
W e examined the most relevant research publications pub-
lished between 2015 and 2022, mainly from 2021 and 2022,
with a few from 2019. Papers from reputable publications
such as IEEE, Elsevier , MDPI, Nature, ACM, and Springer
were the primary emphasis, and some papers were chosen
from prestigious conferences. W e looked at over 200 papers
on various big data clustering topics. Seventy-five papers
from the year 2022, 60 papers from year 2021, and 30 pa-
pers from the year 2019 are included in this collection. That
means that this study focused on recent articles in the field
of big data clustering. The publications were evaluated and
reviewed to (1) define and list big data methods, and clus-
tering algorithms, (2) describe and explain big data struc-
tures, and (3) of fer clustering issues and alternative solu-
tions. (”Big Data”), (”Clustering T echniques”), (”Cluster -
ing Platform”), (”single-machine Clustering,” AND ”multi-
machine Clustering”) are the most often utilized keywords
for this review paper ’ s search criteria.
3 Big data
”Big Data” describes a set of algorithms, methods, and tech-
nologies designed to handle lar ge amounts of data. Greater
storage capacity , enhanced data storage, improved process-
ing capabilities of high-performance computers, and access
to extensive data have all supported the growth of the Big
Data Processing industry . Hardware and software in to-
day’ s world can manage, alter , analyze, and analyze mas-
sive amounts of data in a novel manner . The proliferation
Big Data Clustering T echniques Challenged… Informatica 47 (2023) 203–218 205
of the Internet, social media, smartphones, and apps has re-
sulted in a data tsunami. Collecting and analyzing enor -
mous volumes of data can reveal patterns, trends, and corre-
lations relating to human actions and interactions. Big data
has been used to analyze customer behavior , tar get market-
ing ef forts, improve tasks, improve ef ficiency , and mini-
mize risk. According to IDC, a global source of industry
insight and information technology consulting services, in
the subsequent years, the global analytics of big data is ex-
pected to expand rapidly [ 7 ]. Or ganizations need help to
figure out how to make the most of this plethora of data.
There are three types of data: identity , intelligence, and
people. Because of the size of big data, typical process-
ing techniques are frequently insuf ficient [ 46 ]. Big Data is
listed as a future technology in Gartner ’ s 2014 Hype Cycle.
Because data is continually created and we are inundated
with it, what today considers ’big’ tomorrow may not be
’big.’ As a result, traditional data processing approaches
that can not scale to lar ge data will eventually become out-
dated. Using multiple Big Data processing frameworks,
each client can handle Big Data and strive to extract value
from it. Hadoop and Spark are two of the most prominent
big d ata processing open-source frameworks, while others
have better specialized in their use yet have nonetheless
managed to gain reputations and good market sharing. [ ? ].
In general, there are two types of frameworks: proprietary
and open-source, both widely utilized in the big data field.
4 Big data clustering
There are two types of clustering techniques in big data,
as shown in Figure 2 , which are: Single-machine and
multiple-machine clustering techniques. Multiple machine
clustering solutions are lately gaining attention due to their
high scalability and performance response times for cus-
tomers.
4.1 Single Machine Clustering T echniques
These approaches are applied to a single machine and only
utilize this machine’ s resources. For single-machine tech-
niques, data mining-based and dimension-reduction tech-
niques are two common tactics [ 66 ].
1. Data mining-based clustering : Unsupervised
classification tools are used in these strategies.
Existing techniques cannot handle lar ge volumes
of detailed data since they use significant time
and resources. Evolutionary-based, density-based,
hierarchical-based, model-based, grid-based, and
Partitioning-based procedures are all examples of
data mining approaches.
2. Partitioning-based : this technique divides a big data
set into k number of clusters (predefined by the user)
by utilizing a similarity measurement which is dis-
tance, with each group representing a cluster . The
Figure 2: Big data clustering techniques
most common algorithms that use this strategy are K-
Means [ 10 ], K-Medoids [ 64 ], K-Modes [ 71 ], P AM
[ 64 ], CLARA [ 27 ], and CLARANS [ 49 ]. This type of
clustering technique employs distance measurements
as a similarity criterion, such as Manhattan distance,
Euclidian distance, and maximum distance, to build
clusters in a hierarchical way (tree). The two sub-
categories of hierarchical-based Clustering are Di-
visive Hierarchical and Agglomerative Hierarchical.
The agglomerative hierarchical approaches begin with
considering the individual point in data as a cluster ,
then continually fusing the most similar clusters until
only one remains at the end. T wo algorithms, BIRCH
[ 54 ] and CHAMELEON, take advantage of this con-
cept [ 16 ]. A divisive hierarchical clustering approach
treats the entire data set as a single lar ge cluster , then
splits the most relevant clusters at each stage until
a user -defined threshold of clusters is reached. This
method’ s algorithms include PDDP [ 72 ], and DHCC
[ 59 ].
3. Density-based clustering : Data items are or ganized
into clusters based on density areas, connectedness,
and boundary in this method. A cluster with a high
density of individuals can expand in any direction.
This method can detect clusters in any shape and is
more resistant to noise and outliers since it is single-
scanned. The typical example of this approach are:
DBSCAN [ 45 ], DENCLUE [ 34 ], OPTICS [ 22 ] and
DBCLASD [ 17 ].
4. Grid-based clustering : It uses three steps to the con-
structor the clusters, which are:
– Initially , the method divides the entire area into
k number of small squares, which is set by the
206 Informatica 47 (2023) 203–218 F .H. A wad et al.
user and is usually considerably less than the
database’ s size.
– Next, deleting the cells with low density of data
objects;
– At the end, it mer ges neighboring cells with high
densities to create clusters. In terms of obtain-
ing speed, the grid-based approach provides a
major advantage. On either h and, the quality of
a cluster is proportional to the number of cells
provided by the user at the start. The most well-
known algorithms in this family are STING [ 20 ],
CLIQUE [ 38 ], OPTIGrid [ 26 ], and W aveCluster
[ 17 ].
5. Model-based clustering : these approaches provide,
by allowing the preset mathematical model to adjust
the data to the best of its ability , an average approxi-
mation of model parameters. T o determine the classi-
fication’ s uncertainty , this model employs probability
distributions. Automatically , the number of clusters
could be found with this method, which is more out-
liner and noise-resistant. W ith the increasing number
of model parameters, they become more complicated.
EM [ 1 1 ], COBWEB [ 51 ], and MCLUST [ 74 ] are some
examples of this method.
6. Dimension r eduction techniques : directly propor -
tional to the amount of data and the complexity and
speed of this technique. The number of variables and
examples or instances determines the size of a dataset.
Scholars are attempting to minimize the lar ge dimen-
sionality of data by applying two models: feature ex-
traction and feature selection, to make algorithms very
rapid and time-ef fective.
– Feature selection: This entails picking a limited
number of important variables (features) from a
lar ger pool of possibilities. The amount of cho-
sen features in the subset can be controlled by
one of the subset creation approaches in the fea-
ture choices algorithm or by the user as an in-
put parameter . Correlation-based Feature Selec-
tion is one of three popular Sequential Forward
Selection (SFS) [ 39 ], feature selection strategies
(CFS) [ 73 ], and Markov Blanket Filter (MBF)
[ 67 ].
– Feature extraction’ s purpose is to integrate the
initial set of features via operational mapping to
generate new feature subsets or variables. T o
put it another way , the initial set of character -
istics will be lowered to a new subset. Ana-
lyze the Principal Components (PCA) [ 43 ], Lin-
ear Discriminant, Singular V alue Decomposition
(SVD), and Analysis (LDA) [ 19 ] [ 70 ] are exam-
ples of this technique.
Figure 3 sums up the single-machine clustering tech-
niques of big data.
Figure 3: Single machine clustering techniques
4.2 Multiple machine clustering techniques
The massive amount of data cannot be handled by single-
machine clustering algorithms (measured in Petabytes) in
an adequate time with the current data explosion era. In-
deed, with the advancement of computer specifications and
computational sharing technologies, researchers hypothe-
sized that similar algorithms could be executed on multiple
machines by pooling their resources. These methods entail
breaking down lar ge amounts of data into smaller chunks.
These components will be handled in various machines,
and the machines’ resources will be used to solve them.
Multi-machine algorithms of fer a faster processing speed
and greater scalability than single-machine algorithms, but
they face a significant challenge in high data traf fic costs.
1. Parallel clustering : In data mining algorithms, there
are three (3) techniques of parallelism utilized in the
literature, and they are as follows [ 55 ]:
– Independent parallelism : which each CPU
uses the entire amount of data available, and
there without intercommunication amongst pro-
cessors at this level;
– T ask parallelism : Each processor executes a
dif ferent algorithm (task).
– SPMD parallelism (Single Pr ogram Several
Data) : It is when the same algorithm is run on
multiple processors with separate data sets.
Parallel approaches increase Clustering’ s scalability
and speed, but they also introduce a new responsibility
for programmers: the complexity of data distribution.
Parallel techniques increase Clustering’ s scalability
and performance but also give programmers a new re-
sponsibility: dealing with data distribution complex-
ity . As illustrated in Figure 4 , parallel clustering oper -
ations are frequently carried out in four phases. The
complete data sets are partitioned before being dis-
tributed among workstations. Each computer subse-
quently processes the partitioned data to form local
clusters. The forward stage is to integrate the global
clusters produced by previously collected local clus-
ters. Finally , final clusters are disseminated to each
computer to improve and modify the local clusters.
That is a step that can be skipped.
Big Data Clustering T echniques Challenged… Informatica 47 (2023) 203–218 207
Figure 4: Parallel clustering techniques flowchart
A parallel platform is used to run several data mining
clustering algorithms. Kantabutra and colleagues [ 62 ]
created a new solution based on Network Of W ork-
stations (NOWs), a parallel K-Means, in which re-
searchers employ a message-passing mechanism for
resource communication. Another approach, parallel
CLARANS, was implemented in [ 36 ]. It is using a
master -slave mechanism based on PVM (Parallel V ir -
tual Machine) to make a group of a single machine
or distributed computers. The message-forwarding
model also establishes communication between CPUs.
[ 23 ] introduces PBIRCH, a parallel approach to the
BIRCH method, in hierarchical approaches. The
SPMD (Single Program Multiple Data) technique is
used to achieve the parallel paradigm in this program,
and the message forwarding mechanism protects com-
munication between processors. Using a work pool
analogy , a parallel variant of the CHAMELEON
method was described in [ 16 ]. This approach is di-
vided into three stages-the first concerns using the con-
current K-Nearest Neighbor approach to reduce the
study’ s temporal complexity . Second, the previously
collected sparse graph is partitioned using multilayer
graph partitioning techniques to form smaller clus-
ters. Ultimately , clusters are combined to shape more
prominent clusters depending on their interconnectiv-
ity and proximity .
One of the most used parallel clustering techniques is
Density-based distributed clustering (DBDC) [ 14 ], in
which the entire data set is partitioned and distributed
amongst sites (machines). The local model is then cre-
ated by each site using the DBSCAN technique to es-
tablish a local cluster and identify representative data
objects (i.e., cluster IDs). All local models are sent
to a central location known as the server to generate a
global model. Finally , clients get the global model on
new local workstations so that local clusters may be
updated and global clusters can be built.
Researchers have recently become interested in us-
ing GPUs rather than CPUs due to their high compu-
tational density per memory access. Because GPUs
are designed for parallel processes, they are faster
and more powerful than CPUs. The parallel GPU-
based acceleration algorithm variant of DBSCAN is
G-DBSCAN [ 53 ]. There are two parallelized phases
in this approach. The first phase entails creating a
graph whose edges are produced by adhering to a pre-
determined threshold. The second stage is to locate the
cluster by traversing the previously generated graph
using the Breadth-First Search (BFS) approach.
2. MapReduce-based clustering : In 2004, Google re-
leased MapReduce as an open-source system and com-
puter program for working with massive data sets.
This strategy uses the parallel paradigm to break the
principal work into smaller jobs distributed across nu-
merous processing nodes. The architecture of MapRe-
duce is shown in Figure 5 . It comprises two (2) func-
tions with user -based scheduling (map and reduce). In
each mapping machine, the first step is to analyze the
scattered data. At the end of this phase, a set of pairs of
intermediate keys/values is formed. By employing the
reducer machines as a shuf fling or combining mech-
anism to blend intermediate values and intermediate
keys that are associated together [ 29 ]. The real ben-
efit of this technique is that it obscures the complex-
ities of parallel execution, allowing the user to con-
centrate solely on data classification and processing
strategies. Furthermore, MapReduce can save pro-
grammers time by automatically managing network-
ing issues like load balancing, data dissemination, and
fault tolerance, allowing for massive parallelism and
more simplicity by expanding the ability of parallel
systems.
Figure 5: MapReduce framework
V arious classification methods are implemented in
a MapReduce platform [ 3 ], including k-means with
MapReduce (PK-Means); MRDBSCAN, which is
used GPU with MapReduce-based DBSCAN.
5 Big data clustering platforms
Big data platforms are divided into three categories: Batch
processing, participatory analytics, and real-time handling.
The most common big data platforms are B atch processing
platforms, which perform numerical computations; stream-
ing is allowed in applications that require low latency; real-
time processing typically necessitates high-performance
data processing, and interactive analytics allow users to re-
motely access datasets and perform a variety of tasks as
required. The most extensively used batch-oriented plat-
form is Apache Hadoop. Apache Hadoop is a free and
208 Informatica 47 (2023) 203–218 F .H. A wad et al.
open-source MapReduce implementation. MapReduce was
created by Google and is used to solve challenges such as
lar ge-scale machine learning and Clustering [ 52 ]. The map
and reduce primitives in the functional language have in-
spired the functional language (Lisp) [ 57 ]. Hadoop breaks
down an issue into the smallest unit that may be recur -
sively handled. The little bits are subsequently spread
among system nodes for execution, with a primary node,
and worker nodes coordinating the operation [ 42 ]. MapRe-
duce, HDFS, and Y ARN are the software components that
makeup Hadoop. Abstractions like Hive, Hbase, Pig, and
Spark are part of Hadoop’ s ecosystem. These Hadoop mod-
ules and abstractions cover the complete big data value
chain, including data collection, storage, processing, evalu-
ation, and administration. Hadoop can be deployed in huge-
scale companies due to its low cost, scalability , fault toler -
ance, and flexibility .
Stream processing systems, often known as real-time or
semi-real-time systems, are the following type of big data
platform. These systems are required when data streams
are continuous, and speedy processing is required. Users of
real-time systems require quick access and analysis of data
held in warehouses. Stream systems necessitate continuous
data processing without the need to preserve it. Whether
and transportation systems are two examples of such uses.
The number of sensors in networked items is predicted to
climb to fifty billion by 2020, these objects being linked to
smart gadgets and smartphones [ 21 ]. The data from the de-
vices might be used in both traditional and creative ways to
improve the well-being of individuals, processes, the envi-
ronment, systems, and or ganizations. These devices would
need stream processing. Storm and SAP HANA are two ex-
amples of stream processing systems. Interactive analytics
is the final big data platform accessible.
Academics from the University of California, Berkeley ,
created Spark, a next-generation prominent data processing
paradigm. It is a Hadoop alternative to alleviate disk I/O
constraints and boost older systems’ performance. Spark’ s
capacity to do in-memory computations is one of its most
notable features. It allows data to be kept in memory for it-
erative operations, bypassing Hadoop’ s disk overhead bar -
rier . Spark is a Java, Scala, and Python-based lar ge-scale
data processing engine that has been proven to be up to
100 times faster than Hadoop MapReduce. Data is reduced
when it fits in memory; when it does not, it is up to 10 times
faster . It is a Hadoop alternative designed to get around disk
I/O constraints and boost the performance of older servers.
Spark’ s ability to execute in-memory computations is one
of its most distinguishing features. It enables caching data
in memory , and for iterative activities, Hadoop’ s disk over -
head barrier is bypassed. Spark is a lar ge-scale data pro-
cessing engine that supports Java, Scala, and Python and
has been tested to be up to 100 times quicker than Hadoop
MapReduce. When data fits in memory , it is up to ten times
quicker; when data does not, it is up to ten times faster . It
can read data from HDFS and execute on the Hadoop Y arn
management. As a result, it may operate on a wide range
of platforms.
T able 1 summing up the platforms comparison of the big
data. Each platform has its advantages over the other , there-
fore, selecting the best platform depends on the big data
characteristics and requirements [ 18 ] [ 13 ] [ 61 ] [ 25 ].
6 Clustering issues and challenges
The amount of information on the web is increasing at a
breakneck speed. In the broad sense, the data can be cate-
gorized into three main categories: structured (consists en-
tirely of basic data types such as integrals, array of integrals
or characters, and characters). The unstructured (consist-
ing of unstructured data types such as strings) is used in the
SQL model. The semi-structured data consists of structured
data and unstructured data types. W ith semi-structured, the
two previous data types are combined and represented as
XML in general. Most of the data created is unstructured,
and standard database management solutions cannot handle
it. The 3Vs defining Big Data are:
1. V olume : The amount of data that must be pro-
cessed continuously increases. This demonstrates how
big data increased use of contemporary technology
(smartphones, social networks, networked equipment,
and so forth) causes us to generate more and more data
in our personal and professional relationships; corpo-
rations are coping with a data explosion. This V olume
does keep growing at a rapid pace. The quantity of
data stored worldwide is expected to increase every
four years. Since 2010, it has amassed more data than
it has since the start of time.
2. V elocity : The rate at which data is produced, gath-
ered, and exchanged is referred to as Big Data veloc-
ity . These data are being produced and developed at
a breakneck speed. As a result, real-time data gather -
ing, analysis, and utilization should become increas-
ingly common; it may even be possible to break data
storing and instead study streaming to get the correct
summing up.
3. V ariety : The final ”V” indicates that the data is not
always structured and may be diverse. Indeed, it can
use information from websites, posts, messages, so-
cial media exchanges (Facebook, Y outube, Whatsapp
), photos, videos, audio, logs, spatial data, biometrics,
and other sources. They come from various places:
the web, text mining, image mining [ 2 ], and so on.
T o develop actionable findings, we must mix dif ferent
sources. The complexity of Big Data explains why
typical data warehousing infrastructure is dif ficult to
use.
They are making heterogeneous data (climate, transporta-
tion, geography , and automobile traf fic) with linking to get
relevant data and enhance the dif ferent exploiting of the
sectors of the lar ge and scattered quantity of data. Indeed,
Big Data Clustering T echniques Challenged… Informatica 47 (2023) 203–218 209
Pr operty Apache Spark Hadoop MapReduce
Performance Lightning-fast cluster computing
(100 times faster)
Slower than Spark
Read/W rite Cycle Low High
Usability Easy Requires code for every operation
Realtime Analysis Supported Not Supported
Latency Low High
Fault T olerance Supported Supported
Security Low High
Cost High (requires a lot of RAM) Low
Developing Language Scala Java
Licenses Free Free
SQL Support Spark SQL Hive
Scalability High High
Machine Learning MLLib Apache Mahout
Data Caching Supported Not Supported
Hardware Requirements High-Level hardware Commodity hardware
T able 1: Big data clustering platforms
Big Data’ s ultimate challenge. According to the HACE the-
ory (Heterogeneous, Autonomous, Complexity , Evolving),
the big data’ s most significant characteristics are [ 56 ]:
1. Heterogeneous data represent the variety of data
sources, including T witter , Facebook, Instagram,
and social messaging application, in a complicated
and heterogeneous approach, necessitating various
methodologies and solutions.
2. One of Big Data’ s most distinguishing features is that
it is self-contained and dependent on self-contained
sources. This source comprises dispersed and decen-
tralized controls in this fashion, allowing each data
source to function independently of any centralized
control. Each web server on the W orld W ide W eb
(WWW) can create information and function success-
fully without other server assistance. On the other
side, the intricacy of Big Data shows it is incredibly
fragile, and it would swiftly fail if she relied on any
centralized control unit. Another advantage of au-
tonomous servers is that it allows some applications of
Big Data, such as Google, and other social networks,
such as Instagram, to give clients instant replies and
services.
3. Data is acquired in several methods, including multi-
table, multi-source, multi-view , sequential, and dis-
tributed treatment data or massively parallel process-
ing, which adds to Big Data’ s c omplexity (MapRe-
duce). As data complexity increases with increas-
ing V olume, traditional treatment methods, such as re-
lational database supervisors tools, are not suf ficient
anymore to keep up with the requirements of storing
and capturing, extra analysis, and traditional potential
treatments, such as administration of database engine
tools, are not suf ficient anymore to hold the prerequi-
sites of capture, backup, and extra analysis.
4. The development of complex data, which is constantly
changing, is also an essential element. Big data is
evolving at a breakneck pace. When a client leaves
a comment on a page of a social site, over a while,
these comments must be retrieved for the algorithm to
function and have accurate data.
T o handle the expanding demands for data, the capacity has
to be enhanced and ef fective in practices and methodolo-
gies. T o leverage the data’ s functions without engaging ex-
tra workers, Big Data needs innovative solutions to boost
capacity and ef fective treatment.
T raditional data mining techniques have been unable to
address critical data processing needs due to the exponen-
tial expansion of data. So, to use this massive data, ef fec-
tive processing, and practical computing technique is re-
quired for this complex, massive, heterogeneous, and dy-
namic data.
7 Recent advancements
There are many research advancements in clustering tech-
niques due to the importance and requirements to overcome
the issue in the clustering techniques which are working
over big data. The most recent solution overview is pre-
sented below:
Mehdi Assefi, et al (2017).[ 8 ]. In this contribution, they
have looked at the open-source framework Apache Spark
MLlib 2.0 because of its scalability and platform-agnostic
machine learning library from a computational standpoint.
T o analyze the platform’ s qualitative and quantitative quali-
ties, They conduct various real-world machine learning ex-
periments [ 4 ]. They also discuss recent trends in big data
210 Informatica 47 (2023) 203–218 F .H. A wad et al.
research of machine learning and of fer suggestions for fu-
turistic studies.
Gunasekaran Manogaran, et al (2017). [ 48 ] have used
a Gaussian Mixture (GM) Clustering and Bayesian hidden
Markov model (HMM) technique to simulate DNA copy
number variation across the genome in this work. The sug-
gested Bayesian HMM with GM Clustering technique is
compared to other approaches such as the pruned precise
linear time method, binary classification method, and seg-
ment neighborhood method [ 15 ]. Experimental data sup-
port the ef ficiency of the suggested change detection ap-
proach.
Gourav Bathla et al. (2018). [ 12 ]. In this paper ,
the Kprototype algorithm is implemented using MapRe-
duce. Experiments have shown that using Kprototype with
Mapreduce improves performance on several nodes, and
single nodes are compared together . For comparison, CPU
running period and performance are employed as assess-
ment measures. This work proposes an intelligent split-
ter that divides extensive mixed data into numerical and
category data. Compared to existing methods, the pro-
posed approach performs better when dealing with enor -
mous amounts of data.
Ahmed Z. Skaik (2018) [ 65 ] has presented a new ap-
proach that overcomes the shortcomings of both algo-
rithms, improves significant data clustering, and avoids be-
ing trapped in a locally optimal solution by leveraging a
robust optimization algorithm (Particle Swarm Optimiza-
tion) known as PSO and reducing time and resource con-
sumption by leveraging a robust distribution framework
Apache/Spark. Research findings show that the algorithm
can significantly lower clustering costs and create superior
clustering outputs more accurately and fruitfully than solo
K-Means, IWC, and PSO methods.
Behrooz Hosseini, et al (2018). [ 32 ] proposed a solution
built and tested using the Apache Spark framework with a
range of datasets. Each phase of the proposed technique for
all points is independent, and there are no serial bottlenecks
in the clustering operation. Other data points are members
of the cluster with the closest center , filtering out outliers
and improving the robustness of the recommended tech-
nique. The proposed technique has been tested and com-
pared to other recently published studies. The cluster va-
lidity indices of the proposed technique reveal its advan-
tage in accuracy and noise resilience compared to previous
research. Compared to others techniques, the suggested
method outperforms them in terms of scalability , perfor -
mance, and computing time.
Mo Haia, et al, (2018). [ 28 ]. The performance of the
main lar ge data processing platforms: Hadoop and Spark,
are examined using parallel clustering: parallel K-means
with and without fuzzy . Experiments are conducted on a
wide range of text and numerical collections and clusters
of varying sizes. The findings reveal that: (1) with a 6 GB
of RAM in each node, they achieve a 60% performance gain
over Hadoop and a 32% performance improvement over
Spark for the same data set; (2) the 6 GB of RAM should
be selected over 3 GB of RAM for high clustering perfor -
mances.
Aditya Sarma, et al, (2019). [ 63 ] of fer DBSCAND, a
highly scalable distributed implementation of DBSCAN
that takes advantage of commodity cluster hardware, in
this work. Experimental results show that the suggested
algorithms perform significantly better than the respec-
tive state-of-the-art techniques for various typical datasets.
DBSCAN-D is a DBSCAN exact parallel solution that can
quickly analyze enormous volumes of data (1 billion data
points in 41 minutes on a 32-node cluster) while maintain-
ing the same Clustering as regular DBSCAN.
Omkaresh Kulkarni, et al, (2020). [ 44 ]. T o lower com-
putational complexity , the clustering approach is critical
in this study . W ith the understanding of clustering algo-
rithms, the MapReduce architecture is used to process huge
data from distributed sources (MRF). One function of the
MRF is for mapping, while the other is for reducing, the
ideal centroids are computed in the mapper phase using the
proposed approach, which is then improved in the reducer
phase. In the experiment, the Skin data set and the geolo-
cation data set from the UCI machine learning repository
were employed, and accuracy and the DB Index were used
to develop the inquiry . The proposed method analyses ac-
quired a maximum accuracy with proven up to 90.6012%
with a minimal DB Index of 5.33.
Hoill Jung, et al, (2020). [ 40 ] have presented a method
for establishing trustworthy user modeling in this study by
integrating standard static model information with informa-
tion acquired from social networks, and a dif ferent amount
of weight is applied based on the users’ associations. Pre-
fixSpan is employed in a life care forecasting model that
uses social relations to supplement a weak area in the can-
didate pattern that frequently takes a long time to scan.
They compared the common cluster technique with a so-
cial mining-based approach for performance measurement.
As a result, the suggested technique in the mining-based
healthcare platform outperformed the existing model-based
cluster method.
K. Rajendra Prasad, et al, (2021). [ 56 ], the suggested
technique in this study solves the big clustering problem
by deriving sampling-based crisp partitions. The crisp
partitions will reliably predict the cluster labels of data
items. This study uses lar ge real-world synthetic datasets to
demonstrate the suggested work’ s performance ef ficiency .
Mustafa Razee et al. (2021) [ 58 ], as a novel way , sug-
gested the game-based k-means (GBK-means) algorithm.
T o demonstrate the superiority and ef ficiency of GBK-
means over conventional clustering algorithms, namely k-
means and fuzzy k-means, They use the following syntac-
tic and real-world data sets: (1) a series of two-dimensional
syntactic data sets; and (2) ten benchmark data sets that are
widely used in dif ferent clustering studies. GBK-means can
cluster data more accurately than classical algorithms based
on eight evaluation metrics, including the F-measure, the
Dunn index (DI), the rand index (RI), the Jaccard index (JI),
normalized mutual information (NMI), normalized varia-
Big Data Clustering T echniques Challenged… Informatica 47 (2023) 203–218 21 1
tion of information (NVI), a measure of concordance, and
error rate (ER).
Chen Zhen (2021) [ 77 ] this research proposes a big data
fuzzy K-means clustering and information fusion-based
English teaching competence evaluation system. The re-
sults demonstrate that adopting this technique to evaluate
English teaching ability enhances learning ability accuracy
and resource ef ficiency .
C. W u. et al. (2021) [ 69 ] have set up random sampling
before parallelizing the distance computing technique, en-
suring data item independence and parallel cluster analysis.
After the MapReduce parallel processing, They use numer -
ous nodes to compute distance, which enhances the algo-
rithm’ s speed. Finally , the grouping of data objects is par -
allelized. The findings indicate that the system can provide
timely , consistent, and conver gent services.
Fouad H. et al. (2022) [ 9 ] have proposed a solution to
run the clustering technique on a single-instruction machine
processor found in the mobile device’ s processor . Big data
clustering across the network may be ef ficiently handled
using k-means Clustering in a distributed method based on
mobile machine learning. The results revealed that using a
neural engine processor on a mobile smartphone or tablet
can increase the speed of the clustering algorithm by up to
two times compared to traditional laptop/desktop proces-
sors, resulting in a performance improvement of the clut-
tering algorithm of up to two times faster . Furthermore,
compared to parallel and distributed k-means, the number
of iterations necessary to create (k) clusters was reduced by
up to two-fold.
Lin Ma et al. (2022) [ 47 ] start by using a quick mean-
shift technique with configurable radius and active subsets
to find the centers, which significantly cuts down on pro-
cessing time. The second stage employs the form of the
probability density function of the distribution of distances
between a selected point and the other points in the data set.
That is done to determine the critical and cluster radiuses
of the fast mean-shift method. The new algorithm has four
distinct benefits. It reduces processing complexity , over -
comes dimensionality dif ficulties, can handle a wide range
of spherical data sets, and is noise and outlier tolerant. They
discovered that the proposed technique works ef fectively
after testing it on several synthetic and real-world data sets.
The results of this literature review are summarized in
T ables 2 and 3 .
212 Informatica 47 (2023) 203–218 F .H. A wad et al.
Refer ences Clustering T echnique Goal Results
Mehdi Assefi, et
al (2017) [ 8 ].
MLib 2.0 Apache Spark Evaluating the platform qual-
itative and quantitative at-
tributes
Addressing the recent solu-
tions in big data machine
learning
Gunasekaran
Manogaran, et al
(2017) [ 48 ].
Bayesian Hidden Markov
Model (HMM) with Gaussian
Mixture (GM) Clustering
Modeling the DNA copy num-
ber change across the genome
Experimental results demon-
strate the ef fectiveness of the
proposed change detection al-
gorithm.
Gourav Bathla, et
al.(2018) [ 12 ].
K-Prototype with Map-
Reduce
splits mixed big data into nu-
merical and categorical data
Proving that the proposed al-
gorithm works better for lar ge-
scale data.
Ahmed Z. Skaik
(2018) [ 65 ]
IWC and PSO Introduce a new approach that
conquers the drawbacks of
both algorithms
Reduces the clustering cost,
and produces superior cluster -
ing outputs
Behrooz Hos-
seini, et al (2018).
[ 32 ]
Apache Spark framework and
Locality Sensitive Hashing
it has been used in gene ex-
pression clustering as a sample
of its application
shows the proposed method’ s
superior scalability , high per -
formance, and low computa-
tion cost.
Mo Haia, et al,
(2018). [ 28 ]
Parallel Clustering T echnique
(K-Mean and Fuzzy K-Mean)
to o btain a high clustering per -
formance
60% performance improve-
ment compared with Hadoop,
and achieve about 32% per -
formance improvement com-
pared with Spark
Aditya Sarma, et
al, (2019) [ 63 ].
DBSCAN to exploit a commodity cluster
infrastructure
ability of massive data pro-
cessing ef ficiently (1 billion
data points in 41 minutes on a
32 node cluster).
Omkaresh Kulka-
rni, et al, (2020)
[ 44 ].
Fractional Sparse Fuzzy C-
Means and PSO
Relieving the computational
complexity of the clustering
method
Acquired a maximum accu-
racy of 90.6012% and a min-
imum DB Index of 5.33.
Hoill Jung et al.,
(2020) [ 40 ]
Conventional Static Model to create reliable user mod-
eling and apply a dif ferent
weight depending on user re-
lations.
mining-based healthcare plat-
form had better performance
than the conventional model-
based cluster method [ 6 ].
T able 2: Big data clustering recent advancements
Big Data Clustering T echniques Challenged… Informatica 47 (2023) 203–218 213
Refer ences Clustering T echnique Goal Results
K. Rajendra
Prasad et al.,
(2021) [ 56 ]
BigV A T with the derivation
of sampling-based crisp parti-
tions
The crisp partitions will accu-
rately predict the cluster labels
of data objects.
addresses the clustering prob-
lem of bigV A T
Mustafa Razee et
al. (2021) [ 58 ]
Game-Based K-Means (GBK-
means)
Improving K-means Perfor -
mance
GBK-means can cluster data
more accurately than classi-
cal algorithms based on eight
evaluation metrics
Chen Zhen,
(2021). [ 77 ].
clustering fuzzy K-means
technique
an English teaching ability
evaluation
improves the accuracy and the
ef ficiency of learning tech-
niques.
C W u. el. al.
(2021) [ 69 ]
K-Means Clustering Based on
Distributed Computing
Improving k-Means clustering
performance
Improving the performance
and stably of the parallelized
k-means clustering
Fouad H. et. al.
(2022) [ 9 ]
Neural-Engine Based k-means
clustering
performance improvement improving the performance of
clustering up to 200%
Lin Ma et al.
(2022) [ 47 ]
Radar Scanning Strategy
Clustering
Reducing computational
power and overcomes the
issues of the high dimen-
sionality of radar scanning
technique
Improves the accuracy , ef fi-
ciency , and performance of
radar scanning technique.
T able 3: Big data clustering recent advancements
214 Informatica 47 (2023) 203–218 F .H. A wad et al.
8 Conclusion
Big data has become an essential part of research and engi-
neering in today’ s world. Or ganizations are increasingly
relying on these massive data sets to gain new insights
and explore new possibilities. However , traditional ma-
chine learning approaches are often insuf ficient in dealing
with the massive volume, velocity , variety , veracity , and
value of big data, commonly referred to as the 5 Vs of Big
Data. Consequently , machine learning algorithms need to
reimagine themselves to meet these challenges. The inte-
gration of machine learning and big data has paved the way
for a promising future in the data-driven industry . Big data
processing and machine learning algorithms work hand in
hand to reveal new information, increase productivity , and
generate new and unexpected insights. Clustering tech-
niques such as K-means, Fuzzy C-means, and K-mode are
widely used in numerous frameworks for big data cluster -
ing. Python has emer ged as a valuable tool for executing
big data clustering techniques. Researchers rely on Python
for its simplicity , flexibility , and scalability , making it an
excellent option for big data processing. However , alterna-
tive frameworks for big data techniques are also necessary ,
given the ever -increasing complexity of big data sets. De-
spite the growing importance of big data clustering, there
is a limited number of essential publications on the topic.
One of the main reasons for this is the dif ficulty of obtain-
ing research data that are suitable for implementing big data
techniques. Nonetheless, previous studies have made sig-
nificant contributions to evaluating the performance, ef fi-
ciency , and accuracy of big data clustering and reducing
noise and processing time. In conclusion, big data cluster -
ing is a critical aspect of the data-driven industry . As the
amount of data continues to grow , it is essential to contin-
ually refine machine learning algorithms and explore alter -
native frameworks to meet the challenges posed by big data.
The use of Python and other tools will be crucial in devel-
oping ef ficient and ef fective big data clustering techniques.
Refer ences
[1] Omran Alshamma, Fouad H A wad, Laith Alzubaidi,
Mohammed A Fadhel, Zinah Mohsin Arkah, and
Laith Farhan. Employment of multi-classifier and
multi-domain features for pcg recognition. In 2019
12th International Confer ence on Developments in
eSystems Engineering (DeSE) , pages 321–325. IEEE,
2019. doi:10.1109/DeSE.2019.00066 .
[2] Laith Alzubaidi, Muthana Al-Amidie, Ahmed Al-
Asadi, Amjad J Humaidi, Omran Al-Shamma, Mo-
hammed A Fadhel, Jinglan Zhang, J Santamara, and
Y e Duan. Novel transfer learning approach for medi-
cal imaging with limited labeled data. Cancers , 13(7).
[3] Laith Alzubaidi, Mohammed A Fadhel, Omran Al-
Shamma, Jinglan Zhang, and Y e Duan. Deep learn-
ing models for classification of red blood cells in mi-
croscopy images to aid in sickle cell anemia diag-
nosis. Electr onics , 9(3):427, 2020. doi:10.3390/
electronics9030427 .
[4] Laith Alzubaidi, Mohammed A Fadhel, Omran Al-
Shamma, Jinglan Zhang, J Santamaría, and Y e Duan.
Robust application of new deep learning tools: an
experimental study in medical imaging. Multime-
dia T ools and Applications , pages 1–29, 2021. doi:
10.1007/s11042- 021- 10942- 9 .
[5] Laith Alzubaidi, Reem Ibrahim Hasan, Fouad H
A wad, Mohammed A Fadhel, Omran Alshamma, and
Jinglan Zhang. Multi-class breast cancer classifica-
tion by a novel two-branch deep convolutional neu-
ral network architecture. In 2019 12th International
Confer ence on Developments in eSystems Engineer -
ing (DeSE) , pages 268–273. IEEE, 2019. doi:10.
1109/DeSE.2019.00057 .
[6] Laith Alzubaidi, Jinglan Zhang, Amjad J Humaidi,
A yad Al-Dujaili, Y e Duan, Omran Al-Shamma, J San-
tamaría, Mohammed A Fadhel, Muthana Al-Amidie,
and Laith Farhan. Review of deep learning: Con-
cepts, cnn architectures, challenges, applications, fu-
ture directions. Journal of big Data , 8(1):1–74, 2021.
doi:10.1186/s40537- 021- 00444- 8 .
[7] Hiba Asri, Hajar Mousannif, Hassan Al Moatassime,
and Thomas Noel. Big data in healthcare: challenges
and opportunities. pages 1–7, 2015. doi:10.1109/
CloudTech.2015.7337020 .
[8] Mehdi Assefi, Ehsun Behravesh, Guangchi Liu, and
Ahmad P T afti. Big data machine learning using
apache spark mllib. In 2017 ieee international confer -
ence on big data (big data) , pages 3492–3498. IEEE,
2017. doi:10.1109/BigData.2017.8258338 .
[9] Fouad H A wad and Murtadha M Hamad. Im-
proved k-means clustering algorithm for big data
based on distributed smartphone neural engine pro-
cessor . Electr onics , 1 1(6):883, 2022. doi:10.3390/
electronics11060883 .
[10] Fouad H A wad, Murtadha M Hamad, and Laith
Alzubaidi. Robust classification and detection of big
medical data using advanced parallel k-means cluster -
ing, yolov4, and logistic regression. Life , 13(3):691,
2023.
[1 1] S. Balakrishnan, M.J. W ainwright, and B. Y u. Statis-
tical guarantees for the em algorithm: From popula-
tion to sample-based analysis. The Annals of Statis-
tics , 45(1):77–120,. doi:10.1214/16- AOS1435 .
[12] Gourav Bathla, Himanshu Aggarwal, and Rinkle
Rani. A novel approach for clustering big data based
on mapreduce. International Journal of Electrical
Big Data Clustering T echniques Challenged… Informatica 47 (2023) 203–218 215
& Computer Engineering (2088-8708) , 8(3), 2018.
doi:10.11591/ijece.v8i3.pp1711- 1719 .
[13] Y assine Benlachmi and Moulay Lahcen Hasnaoui.
Big data and spark: Comparison with hadoop. In 2020
Fourth W orld Confer ence on Smart T r ends in Systems,
Security and Sustainability (W orldS4) , pages 81 1–
817. IEEE, 2020. doi:10.1109/WorldS450073.
2020.9210353 .
[14] Panthadeep Bhattacharjee and Pinaki Mitra. A survey
of density based clustering algorithms. Fr ontiers of
Computer Science , 15(1):1–27, 2021. doi:10.1007/
s11704- 019- 9059- 3 .
[15] Salima Bourougaa-T ria, Farid Mokhati, HoussemEd-
dine T ria, and Okba Bouziane. Spubbin: Smart public
bin based on deep learning waste classification an iot
system for smart environment in algeria. Informatica ,
46(8), 2022. doi:10.31449/inf.v46i7.4331 .
[16] X. Cao, T . Su, P . W ang, G. W ang, Z. Lv , and X. Li.
An optimized chameleon algorithm based on local
features. In Pr oceedings of the 2018 10th Interna-
tional Confer ence on Machine Learning and Comput-
ing , pages 184–192. ACM. doi:10.1145/3195106.
3195118 .
[17] Matthias Carnein and Heike T rautmann. Optimizing
data stream representation: An extensive survey on
stream clustering algorithms. Business & Information
Systems Engineering , 61(3):277–297, 2019. doi:10.
1007/s12599- 019- 00576- 5 .
[18] Eduardo PS Castro, Thiago D Maia, Marluce R
Pereira, Ahmed AA Esmin, and Denilson A Pereira.
Review and comparison of apriori algorithm imple-
mentations on hadoop-mapreduce and spark. The
Knowledge Engineering Review , 33, 2018. doi:10.
1017/S0269888918000127 .
[19] D. Chu, L.-Z. Liao, M.K.-P . Ng, and X. W ang. In-
cremental linear discriminant analysis: a fast algo-
rithm and comparisons. IEEE transactions on neural
networks and learning systems , 26(1 1):2716–2735,.
doi:10.1109/TPAMI.2005.244 .
[20] Nguyen Duy Dat, V o Ngoc Phu, V o Thi Ngoc
T ran, V o Thi Ngoc Chau, and T uan A Nguyen.
Sting algorithm used english sentiment classifica-
tion in a parallel environment. International Jour -
nal of Pattern Recognition and Artificial Intelli-
gence , 31(07):1750021, 2017. doi:10.1142/
S0218001417500215 .
[21] Marcos Dias de Assuncao, Alexandre da Silva V eith,
and Rajkumar Buyya. Distributed data stream pro-
cessing and edge computing: A survey on resource
elasticity and future directions. Journal of Network
and Computer Applications , 103:1–17, 2018. doi:
10.1016/j.jnca.2017.12.001 .
[22] Z. Deng, Y . Hu, M. Zhu, X. Huang, and B. Du. A
scalable and fast optics for clustering trajectory big
data. Cluster Computing , 18(2):549–562,. doi:10.
1007/s10586- 014- 0413- 9 .
[23] Kheyreddine Djouzi and Kadda Beghdad-Bey . A re-
view of clustering algorithms for big data. In 2019 In-
ternational Confer ence on Networking and Advanced
Systems (ICNAS) , pages 1–6. IEEE, 2019. doi:10.
1109/ICNAS.2019.8807822 .
[24] Jason Ernst, Gerard J Nau, and Ziv Bar -Joseph. Clus-
tering short time series gene expression data. Bioin-
formatics , 21(suppl 1):i159–i168, 2005. doi:10.
1093/bioinformatics/bti1022 .
[25] Mithu Mary Geor ge and PS Rasmi. Performance
comparison of apache hadoop and apache spark for
covid-19 data sets. In 2022 4th International Con-
fer ence on Smart Systems and Inventive T echnology
(ICSSIT) , pages 1659–1665. IEEE, 2022. doi:10.
1109/icssit53264.2022.9716232 .
[26] Attri Ghosal, Arunima Nandy , Amit Kumar Das, Sap-
tarsi Goswami, and Mrityunjoy Panday . A short
review on dif ferent clustering techniques and their
applications. Emer ging technology in modelling
and graphics , pages 69–83, 2020. doi:10.1007/
978- 981- 13- 7403- 6_9 .
[27] T . Gupta and S.P . Panda. A comparison of k-means
clustering algorithm and clara clustering algorithm on
iris dataset. International Journal of Engineering
& T echnology , 7(4):4766–4768,. doi:10.14419/
ijet.v7i4.21472 .
[28] Mo Hai, Y uejing Zhang, and Haifeng Li. A per -
formance comparison of big data processing plat-
form based on parallel clustering algorithms. Pr o-
cedia computer science , 139:127–135, 2018. doi:
10.1016/j.procs.2018.10.228 .
[29] Murtadha M Hamad. A comparative study of index-
ing techniques ef fect in big data system storage opti-
mization. In 2020 2nd Al-Noor International Confer -
ence for Science and T echnology (NICST) , pages 18–
21. IEEE, 2020. doi:10.1109/NICST50904.2020.
9280309 .
[30] Richard J Hathaway and James C Bezdek. Extend-
ing fuzzy and probabilistic clustering to very lar ge
data sets. Computational Statistics and Data Anal-
ysis , 51(1):215–234, 2006. doi:10.1016/j.csda.
2006.02.008 .
[31] T imothy C Havens, James C Bezdek, and Marimuthu
Palaniswami. Scalable single linkage hierarchical
clustering for big data. In 2013 IEEE eighth in-
ternational confer ence on intelligent sensors, sen-
sor networks and information pr ocessing , pages 396–
401. IEEE, 2013. doi:10.1109/issnip.2013.
6529823 .
216 Informatica 47 (2023) 203–218 F .H. A wad et al.
[32] Behrooz Hosseini and Kourosh Kiani. A robust dis-
tributed big data clustering-based on adaptive den-
sity partitioning using apache spark, 2018. doi:
10.3390/sym10080342 .
[33] Hassan Ibrahim Hayatu, Abdullahi Mohammed, and
Ahmad Barroon Isma’eel. Big data clustering tech-
niques: Recent advances and survey . Machine Learn-
ing and Data Mining for Emer ging T r end in Cy-
ber Dynamics , pages 57–79, 2021. doi:10.1007/
978- 3- 030- 66288- 2_3 .
[34] A. Idrissi, H. Rehioui, A. Laghrissi, and S. Retal. An
improvement of denclue algorithm for the data clus-
tering. In 2015 5th International Confer ence on Infor -
mation & Communication T echnology and Accessibil-
ity (ICT A , pages 1–6. IEEE. doi:10.1109/ICTA.
2015.7426936 .
[35] Flix Iglesias and W olfgang Kastner . Analysis of
similarity measures in times series clustering for the
discovery of building ener gy patterns. Ener gies ,
6(2):579–597, 2013. doi:10.3390/en6020579 .
[36] K Indira, S Karthiga, CV Nisha Angeline, and C San-
thiya. Parallel clarans algorithm for recommenda-
tion system in multi-cloud environment. In Com-
puter Networks and Inventive Communication T ech-
nologies , pages 461–472. Springer , 2021. doi:10.
1007/978- 981- 15- 9647- 6_36 .
[37] Manaswini Jena, Debahuti Mishra, Smita Prava
Mishra, Pradeep Kumar Mallick, and Sachin Kumar .
Exploring the parametric impact on a deep learn-
ing model and proposal of a 2-branch cnn for dia-
betic retinopathy classification with case study in iot-
blockchain based smart healthcare system. Informat-
ica , 46(2), 2022. doi:10.31449/inf.v46i2.3906 .
[38] Y an Jin, Bowen Xiong, Kun He, Y angming Zhou, and
Y i Zhou. On fast enumeration of maximal cliques
in lar ge graphs. Expert Systems with Applications ,
187:1 15915, 2022. doi:10.1016/j.eswa.2021.
115915 .
[39] A. Jovic, K. Brkic, and N. Bogunovic. A review of
feature selection methods with applications. In 2015
38th International Convention on Information and
Communication T echnology , Electr onics and Micr o-
electr onics (MIPRO , pages 1200– 1205. IEEE. doi:
10.1109/MIPRO.2015.7160458 .
[40] Hoill Jung. Social mining-based clustering process
for big-data integration. Journal of Ambient Intelli-
gence and Humanized Computing . doi:10.1007/
s12652- 020- 02042- 7 .
[41] Harihar Kalia, Satchidananda Dehuri, and Ashish
Ghosh. A survey on fuzzy association rule mining. In-
ternational Journal of Data W ar ehousing and Mining
(IJDWM) , 9(1):1–27, 2013. doi:10.4018/jdwm.
2013010101 .
[42] T aiwo Kolajo, Olawande Daramola, and A yodele
Adebiyi. Big data stream analysis: a systematic lit-
erature review . Journal of Big Data , 6(1):1–30, 2019.
doi:10.1186/s40537- 019- 0210- 7 .
[43] X. Kong, C. Hu, and Z. Duan. Generalized prin-
cipal component analysis. In Principal Component
Analysis Networks and Algorithms , pages 185–233.
Springer . doi:10.1007/978- 981- 10- 2915- 8_7 .
[44] Omkaresh Kulkarni. Mapreduce framework based
big data clustering using fractional integrated sparse
fuzzy c means algorithm. IET Image Pr ocess ,
14(12):2719–2727. doi:10.1049/iet- ipr.2019.
0899 .
[45] K.M. Kumar and A.R.M. Reddy . A fast dbscan clus-
tering algorithm by accelerating neighbor searching
using groups method. Pattern Recognition , 58:39–
48,. doi:10.1016/j.patcog.2016.03.008 .
[46] In Lee. Big data: Dimensions, evolution, impacts, and
challenges. Business horizons , 60(3):293–303, 2017.
doi:10.1016/j.bushor.2017.01.004 .
[47] Lin Ma, Y i Zhang, Víctor Leiva, Shuangzhe Liu, and
T iefeng Ma. A new clustering algorithm based on
a radar scanning strategy with applications to ma-
chine learning data. Expert Systems with Applications ,
191:1 16143, 2022. doi:10.1016/j.eswa.2021.
116143 .
[48] Gunasekaran Manogaran, V V ijayakumar ,
R V aratharajan, Priyan Malarvizhi Kumar , Re-
vathi Sundarasekar , and Ching-Hsien Hsu.
Machine learning based big data processing
framework for cancer diagnosis using hidden
markov model and gm clustering. W ir eless per -
sonal communications , 102(3):2099–21 16, 2018.
doi:10.1007/s11277- 017- 5044- z .
[49] L. Matioli, S. Santos, M. Kleina, and E. Leite. A new
algorithm for clustering based on kernel density esti-
mation. Journal of Applied Statistics , 45(2):347–366,.
doi:10.1080/02664763.2016.1277191 .
[50] Francois G Meyer and Jatuporn Chinrungrueng. Spa-
tiotemporal clustering of fmri time series in the spec-
tral domain. Medical Image Analysis , 9(1):51–68,
2005. doi:10.1016/j.media.2004.07.002 .
[51] N. Mulani, A. Pawar , P . Mulay , and A. Dani. V ariant
of cobweb clustering for privacy preservation in cloud
db querying. Pr ocedia Computer Science , 50:363–
368,. doi:10.1016/j.procs.2015.04.034 .
Big Data Clustering T echniques Challenged… Informatica 47 (2023) 203–218 217
[52] Ruba Obiedat and Sara Amjad T oubasi. A com-
bined approach for predicting employees’ produc-
tivity based on ensemble machine learning meth-
ods. Informatica , 46(5), 2022. doi:10.31449/inf.
v46i5.3839 .
[53] Madhav Poudel and Michael Gowanlock. Cuda-
dclust+: Revisiting early gpu-accelerated dbscan
clustering designs. In 2021 IEEE 28th International
Confer ence on High Performance Computing, Data,
and Analytics (HiPC) , pages 354–363. IEEE, 2021.
doi:10.1109/HiPC53243.2021.00049 .
[54] V ishnu Priya and A V adivel. User behaviour pattern
mining from weblog. International Journal of Data
W ar ehousing and Mining (IJDWM) , 8(2):1–22, 2012.
doi:10.4018/jdwm.2012040101 .
[55] T Ragunthar , P Ashok, N Gopinath, and M Subashini.
A strong reinforcement parallel implementation of
k-means algorithm using message passing interface.
Materials T oday: Pr oceedings , 46:3799–3802, 2021.
doi:10.1016/j.matpr.2021.02.032 .
[56] K Rajendra Prasad, Moulana Mohammed,
L V Narasimha Prasad, and Dinesh Kumar An-
guraj. An ef ficient sampling-based visualization
technique for big data clustering with crisp partitions.
Distributed and Parallel Databases , 39(3):813–832,
2021. doi:10.1007/s10619- 021- 07324- 3 .
[57] Sreekanth Rallapalli, RRb Gondkar , and Uma Pa-
van Kumar Ketavarapu. Impact of processing and an-
alyzing healthcare big data on cloud computing envi-
ronment by implementing hadoop cluster . Pr ocedia
Computer Science , 85:16–22, 2016. doi:10.1016/
j.procs.2016.05.171 .
[58] Mustafa Jahangoshai Rezaee, Milad Eshkevari,
Morteza Saberi, and Omar Hussain. Gbk-means
clustering algorithm: An improvement to the k-
means algorithm based on the bar gaining game.
Knowledge-Based Systems , 213:106672, 2021.
doi:10.1016/j.knosys.2020.106672 .
[59] Maurice Roux. A comparative study of divisive
and agglomerative hierarchical clustering algorithms.
Journal of Classification , 35(2):345–366, 2018. doi:
10.48550/arXiv.1506.08977 .
[60] Mozamel M Saeed, Zaher Al Aghbari, and Mo-
hammed Alsharidah. Big data clustering techniques
based on spark: a literature review . PeerJ Computer
Science , 6:e321, 2020. doi:10.7717/peerj- cs.
321 .
[61] Y assir Samadi, Mostapha Zbakh, and Claude T adonki.
Performance comparison between hadoop and spark
frameworks using hibench benchmarks. Concur -
r ency and Computation: Practice and Experience ,
30(12):e4367, 2018. doi:10.1002/cpe.4367 .
[62] T anvir Habib Sardar and Zahid Ansari. An analysis
of mapreduce ef ficiency in document clustering using
parallel k-means algorithm. Futur e Computing and
Informatics Journal , 3(2):200–209, 2018. doi:10.
1016/j.fcij.2018.03.003 .
[63] Aditya Sarma, Poonam Goyal, Sonal Kumari, Anand
W ani, Jagat Sesh Challa, Saiyedul Islam, and Navneet
Goyal. µ dbscan: an exact scalable dbscan algo-
rithm for big data exploiting spatial locality . In 2019
IEEE International Confer ence on Cluster Comput-
ing (CLUSTER) , pages 1–1 1. IEEE, 2019. doi:10.
1109/cluster.2019.8891020 .
[64] Erich Schubert and Peter J Rousseeuw . Faster
k-medoids clustering: improving the pam,
clara, and clarans algorithms. In Interna-
tional confer ence on similarity sear ch and
applications , pages 171–187. Springer , 2019.
doi:10.1007/978- 3- 030- 32047- 8_16 .
[65] Ahmed Z. Skaik. Clustering big data based on iwc-pso
and mapreduce. In Thesis Submitted in Partial Fulfill-
ment of the Requir ements For the Degr ee of Master
in Computer Engineering . doi:10.11591/ijece.
v8i3.pp1711- 1719 .
[66] Jing W ang, Roobaea Alroobaea, Abdullah M
Baqasah, Anas Althobaiti, and Lavish Kansal. Study
on library management system based on data mining
and clustering algorithm. Informatica , 46(9), 2023.
doi:10.31449/inf.v46i9.3858 .
[67] Y . W ang, J. W ang, H. Liao, and H. Chen. An ef ficient
semisupervised representatives feature selection algo-
rithm based on information theory . Pattern Recogni-
tion , 61:51 1–523,. doi:10.1016/j.patcog.2016.
08.011 .
[68] Philicity K W illiams, Caio V Soares, and Juan E
Gilbert. A clustering rule based approach for clas-
sification problems. International Journal of Data
W ar ehousing and Mining (IJDWM) , 8(1):1–23, 2012.
doi:10.4018/jdwm.2012010101 .
[69] Chunqiong W u, Bingwen Y an, Rongrui Y u, Baoqin
Y u, Xiukao Zhou, Y anliang Y u, and Na Chen. k-
means clustering algorithm and its simulation based
on distributed computing platform. Complexity , 2021,
2021. doi:10.1155/2021/9446653 .
[70] T . W u, S.A.N. Sarmadi, V . V enkatasubramanian,
A. Pothen, and A. Kalyanaraman. Fast svd compu-
tations for synchrophasor algorithms. IEEE T rans-
actions on Power Systems , 31(2):1651–1652,. doi:
10.1109/TPWRS.2015.2412679 .
[71] Fanyi Xie. Semiconductor scheduling problem based
on k-mode clustering algorithm. In International Con-
fer ence on Fr ontier Computing , pages 867–873, 2020.
218 Informatica 47 (2023) 203–218 F .H. A wad et al.
[72] T . Xiong, S. W ang, A. Mayers, and E. Monga. Dhcc:
Divisive hierarchical clustering of categorical data.
Data Mining and Knowledge Discovery , 24(1):103–
135,. doi:10.1007/s10618- 011- 0221- 2 .
[73] Q. Zhang, C. Zhu, L.T . Y ang, Z. Chen, L. Zhao, and
P . Li. An incremental cfs algorithm for clustering
lar ge data in industrial internet of things. IEEE T rans-
actions on Industrial Informatics , 13(3):1 193–1201,.
doi:10.1109/TII.2017.2684807 .
[74] W anli Zhang and Y anming Di. Model-based cluster -
ing with measurement or estimation errors. Genes ,
1 1(2):185, 2020. doi:10.3390/genes11020185 .
[75] Y onglai Zhang and Y aojian Zhou. Review of clus-
tering algorithms. Journal of Computer Applica-
tions , 39(7):1869, 2019. doi:10.11772/j.issn.
1001- 9081.2019010174 .
[76] Y ing Zhao and Geor ge Karypis. Empirical and theo-
retical comparisons of selected criterion functions for
document clustering. Machine learning , 55(3):31 1–
331, 2004. doi:10.1023/B:MACH.0000027785.
44527.d6 .
[77] Chen Zhen. Using big data fuzzy k-means clustering
and information fusion algorithm in english teaching
ability evaluation. Complexity , 2021, 2021. doi:10.
1155/2021/5554444 .