https://doi.org/10.31449/inf.v43i4.2577 Informatica 43 (2019) 425–459 425
Predictive Analytics on Big Data - an Overview
Gayathri Nagarajan and Dhinesh Babu L.D
School of Information Technology and Engineering, Vellore Institute Of Technology, Vellore, India
gayunagarajan1083@gmail.com
E-mail: lddhineshbabu@gmail.com
Overview paper
Keywords: predictive analytics, big data, machine learning
Received: November 5, 2018
Big data generated in different domains and industries are voluminous and the velocity at which they are
generated is pretty high. While research works carried out continuously to handle big data is at one end,
processing it to develop the business insights is a hot topic to work on the other end. Though there are
lot of technologies and tools developed to handle big data and extract insights from them, there are lot of
challenges and open issues that are yet to be addressed. This paper presents an overview on predictive
analytics with big data. The overview throws light on the core predictive models, challenges of these
models on big data, research gaps in several domain sectors and using different techniques. This paper
categorizes the major technical challenges of predictive analytics on big data under six headings. The
paper concludes with the identiﬁcation of open issues and future directions to work on for researchers in
this ﬁeld.
Povzetek: Pregledni ˇ clanek opisuje prediktivno analitiko na velikih podatkih.
1 Introduction
Research focus on predictive analytics for big data has
gained signiﬁcance because of its scope in various do-
mains and industries. It has stepped into every ﬁeld includ-
ing health care, telecommunication, education, marketing,
business, etc. Predictive analytics is a common diction that
often means predicting the outcome of a particular event.
The main idea behind prediction is to take certain input
data, apply statistical techniques and predict the outcome
of an event. The terminology ‘predictive analytics’ is syn-
onymous with other terminologies like ‘machine learning’,
‘data mining’, ‘business intelligence’ and recently the other
terminology which is in common use today ‘data science’.
Though they seem to be synonymous there is a narrow line
that distinguishes their context of use.
The technique of business understanding, data under-
standing, data integration, data preparation, building a
model to extract hidden insights, evaluating the model and
ﬁnally deploying the model is called ‘Data mining’. The
model may be predictive or may not be. [1]. In some
cases it may be descriptive whereas ‘predictive analytics’ in
most cases mean to predict the value of certain output vari-
able from input variables. ‘Machine learning’ is basically
a technique whereas ‘predictive analytics’ is an application
of machine learning. ‘Machine learning’ is used to discover
hidden patterns in data by using some of their techniques
like classiﬁcation, association or clustering in training the
machine. ‘Machine learning’ is one disciplinary of ‘data
mining’ which is multidisciplinary that includes other dis-
ciplines like statistics, BI tools, etc. ‘Data science’ can be
considered as an application of statistical methods to busi-
ness problems. Predictive analytics is more narrowly fo-
cused than data science. Data science uses data program-
ming whereas predictive analytics uses modeling. Predic-
tive analytics in most of the cases is probabilistic in nature
whereas data science involves exploration of data. Data
scientists require both domain knowledge and the knowl-
edge in technology. Business intelligence provides stan-
dard business reports, ad hoc reports on past data based
on OLAP and looks at the static part of the data. Predic-
tive analytics requires statistical analysis, forecasting, and
causal analysis, text mining and related techniques to meet
the need of forward looking business [1].
In predictive analytics, data is collected from different
input sources. A model is developed based on statistics.
The model is used to predict the outcome after proper vali-
dation. With the advent of big data, predictive analytics on
big data has become a signiﬁcant area of interest. Though
there are lot of tools and techniques available to handle pre-
dictive analytics on big data, there are yet challenges open
for the researchers to work upon. Our paper aims to present
an overview on predictive analytics in big data to aid the re-
searchers understand the contemporary works done in this
area thereby providing them research directions for future
work. We focused on including the research works car-
ried in different industries and using different techniques so
that the researchers can focus more on their speciﬁc area of
interest after a complete understanding of the works done
in different ﬁelds and using different techniques. The mo-
426 Informatica 43 (2019) 425–459 G. Nagarajan et al.
tivation behind this work is the fact that many papers in
this ﬁeld are more focused on a particular domain or tech-
nique but there is a lack of papers that presents a broader
overview of predictive analytics in big data to help the bud-
ding researchers identify research problems. Hence we fo-
cussed on a comprehensive overview on predictive analyt-
ics.
This paper is organized into 7 sections. Core predictive
models with their strengths, weaknesses along with few so-
lutions are discussed in Section 2, the challenges of core
predictive models on big data is discussed in Section 3,
scope of predictive analytics on big data generated across
different domain sectors along with few research gaps is
discussed in Section 4 and the comprehensive challenges
for predictive analytics on big data and the techniques used
to overcome them is discussed in Section 5, the future di-
rections for research are summarized in Section 6 and Sec-
tion 7 winds up with conclusion.
2 Core predictive models
The major processes of predictive analytics include de-
scriptive analysis on data that constitutes around 50% of
the work, data preparation(like detecting outliers) that con-
stitutes around 40% of the work, data modeling that con-
stitutes around 4% of the work and evaluating the model
that constitutes around 6% of the work [98]. Only a frac-
tion of raw data is considered for building the model which
is assessed and tested [7]. The phases involved in building
predictive models is shown in ﬁgure 1. Initially predictive
analytics was carried out using many mathematical statis-
tical approaches. Later data mining, machine learning be-
gan its era in predictive analytics since they proved to be
effective. This section discusses few core predictive mod-
els to make the reader understand the concept of predic-
tive analytics. Different models are used for different types
of predictive tasks such as estimating an unknown value,
classiﬁcation(supervised learning to predict the class label
of the instance), clustering(unsupervised learning to group
similar patterns together) etc. The section is branched into
three subsections - the predictive models based on math-
ematical(statistical) approaches, the models based on data
mining approaches and the models based on machine learn-
ing approaches respectively. Yet, there is a very narrow
line of separation among the subsections and they overlap
in certain predictive tasks. Figure 2 shows the classiﬁcation
of core predictive models.
2.1 Predictive models based on mathematics
Mathematical techniques especially statistics is used for
predictive tasks. Despite, data mining algorithms and ma-
chine learning algorithms also use math as their base for
predictive tasks. Major core predictive models based on
mathematics include Extrapolation, Regression, Bayesian
statistics that are described in detail.
2.1.1 Extrapolation
Extrapolation is a method of extending the value of a vari-
able beyond the original observation range. For example,
the details of the road condition are known to a driver un-
til a certain point and he is expected to predict the road
condition beyond that point. A tangent line is drawn by re-
lating the input variable to the output variable. The line is
extended further to predict the output for different values
of input. The line determines whether the extrapolation is
linear, quadratic or cubic etc.
Strength and weakness :
Extrapolation suits well for such tasks where the target
variable is in close relationship with the predictor variables.
The results of extrapolation are also accurate in certain ex-
periments where the relationships among the variables are
simple [101].
The major problem with extrapolation is the interpreta-
tion of results. There are many studies where the study pop-
ulation differs widely from the target population.[100] is an
example of such problem where the extrapolation of the ex-
perimental results on sheep cannot be justiﬁed for other tar-
get population. In such studies, the claims of the study re-
sults cannot be applied or justiﬁed to the target population
[99]. Few solutions proposed to solve the interpretation
problem of extrapolation is simple induction, randomized
trails and expertise, Mechanistic reason etc. Population
modeling is also proposed to solve the extrapolation prob-
lem [102]. But these solutions can help only to a certain
extent. Secondly, it is hard to model the past with extrapo-
lation. Sometimes several extrapolation methods are com-
bined to model. Moreover, extrapolation cannot be used to
model the tasks with non linear patterns [103].
2.1.2 Regression
Regression models are used for supervised learning prob-
lems in predictive analytics. It works by establishing a
mathematical relation of the input variables with the output
variable. There are different types of regression models like
linear regression model, multi variate regression model, lo-
gistic regression model, time series regression model, sur-
vival analysis, etc. depending on the nature of the relation-
ship discovered among the variables. Though the term is
synonymous with extrapolation, there is a difference. Re-
gression explains the variations in the dependent attribute
with respect to the variations in the predictor attributes.
Also, regression doesn‘t use the value of the input variables
outside the range to establish the relationship with the de-
pendent variable as in the case of extrapolation. There are
different variants of regression depending on the nature of
the variables as shown in table 1.
Strength and weakness:
Linear regression can suit well on tasks where the vari-
ables exhibit linear relationship [104]. For predictive tasks
where associated probability is also important apart from
predicting the value of a variable such as in [110], logis-
tic regression is preferred. For predictive tasks where a
Predictive Analytics on Big Data - an Overview Informatica 43 (2019) 425–459 427
Figure 1: Phases involved in building a core predictive model
Figure 2: Classiﬁcation of core predictive models
non parametric and non linear methodology has to be used,
MARS regression can be considered as it does not assume
any functional relationship between the dependent and in-
dependent variables [114],[117]. The biggest strength of
MARS is that it is simple to understand and easy to in-
terpret. Despite, it does not require any data preparation
[116]. Moreover it is suitable for problems with higher
input dimensions due to its ’divide and conquer’ strategy
of working principle [117]. To model complex processes,
in which no theoretical models exist, LOESS regression is
preferred as it does not require a function to ﬁt a model to
all data in the sample [121]. The major strength of LOESS
regression is its ﬂexibility. The ﬂexibility is controlled by a
smoothing parameter [119]. It is usually set between 0.25
and 0.5 [120]. For predictive tasks, where the number of
predictors is more, regularization methods such as ridge re-
gression is preferred as it avoids overﬁtting [122]. It also
handles multicollinearity effectively [126].
The major weakness of linear regression is that it consid-
ers the mean of dependent variable and hence is sensitive
to outliers. It is also more suited only to applications in
which the predictor variables are independent [104]. For
example, [105] states the reasons for moving to machine
learning predictive models from simple regression models
for predictive tasks in medicine. Though the regression
models are simple and robust, they are limited to a small
number of predictors that operate within a range. Outlier
detection algorithms are proposed for linear time series re-
gression models [106],[107]. Another problem with lin-
ear regression models is that it does not suit for predictive
tasks when the predictor variables are collinear. [109] is an
example that shows the adverse effect in interpretation of
results when regression is performed without considering
the multicollinearity problem. Techniques such as princi-
pal component analysis or stepwise regression can help to
remove highly correlated predictors from the model [108].
The major weakness with logistic regression is that it is af-
fected by omitted variables. Solutions such as replacing
the latent continuous variable with an observed continuous
one is proposed. Moreover,the odds ratio obtained from
logistic regression cannot be interpreted easily as the het-
erogeneity of the model is not accounted [112]. In clinical
predictive tasks, the odds ratios are estimated as risk ratios
which is actually an overestimation. Alternatives such as
Mantel–Haenszel risk ratio method, log–binomial regres-
sion, Poisson regression are used to give correct risk ratios
[113]. Interpretation of results can be achieved to a certain
extent by observing the probability changes [111]. Logistic
regression is also not found to perform well with class im-
balance problems and in such cases algorithmic approaches
such as random forests is used [112]. The major weak-
ness of MARS regression is overﬁtting and methods such
as generalized cross validation is used [115],[118]. Pruning
technique is also used to avoid overﬁtting problem to some
extent by reducing the number of its basic functions thereby
limiting the complexity of the model [117]. LOESS regres-
sion is less sensitive to outliers yet they are also overcome
by extreme outliers. Another disadvantage with LOESS re-
gression is that it requires densely sampled data to produce
good results [121]. The major problem with ridge regres-
sion is the parameter settings. The hyperparameter has to
be set [124]. Few methods are proposed in [123], [128]
for parameter setting. Ridge regression also suffers from
interpretability [124]. This happens because the unimpor-
tant predictors still exists in the model with the coefﬁcients
close to zero but not exactly zero [125]. LASSO regres-
sion is preferred in such predictive tasks to avoid the inter-
428 Informatica 43 (2019) 425–459 G. Nagarajan et al.
S.no Type of regres-
sion
Explanation
1 Linear regression The linear regression is given as
Y =aX +b (1)
where Y represents the predictor attribute, X represents the independent at-
tribute, a is the slope and b, the intercept. The best ﬁt line is obtained by mini-
mizing the square of variations of each observed and the actual values with the
line.
2 Logistic regres-
sion
Logistic regression is used for problems that are binary in nature and hence is
mainly used for classiﬁcation. This method aids to determine the likelihood of
the occurence of a happening. It is denoted by
logit(a) =ln(
a
1  a
) (2)
where a is the likelihood of the occurence of the event.
3 MARS Multivariate adaptive regression splines represented as MARS uses stepwise
regression for model building. It is a non parametric method. The non lin-
earities among the variables are identiﬁed and modelled with hinge functions.
These functions create a hinge in the best ﬁt line automatically according to the
non linear relationship among the variables. The MARS equation is given as
D =  0
+
N
X
n=1
  n
h
n
(I) (3)
where D represents dependent attribute, I represents independent attribute,  0
represents intercept variable,  n
represents slope of the hinge function h
n
(I).
4 LOESS regres-
sion
LOESS regression is a non parametric regression model. This method helps
to ﬁt regression line on subset of data rather than the data as a whole. It in-
corporates the concepts of regression model along with the nearest neighbor
concepts.
5 Ridge regression Ridge regression is another method commonly applied when the dataset ex-
periences multicollinearity. They reduce the standard errors. A shrinkage pa-
rameter   is added to the least squares term to minimize the variance.Ridge
regression estimator is given as
  ridge
= (I
T
I + I
p
)
  1
I
T
D (4)
where I represents independent attribute matrix, D represents predicted at-
tribute matrix, Ip represents identity matrix and  represents shrinkage param-
eter.
Table 1: Different types of regression
pretability problem of ridge regression as it sets the coefﬁ-
cients of unimportant parameters to zero [127].
2.1.3 Bayesian statistics
Bayesian statistics predicts the likelihood of events occur-
ring in the future. It works in the same way like normal
probability but uses the input from experimentation and re-
search to adjust the beliefs. For example, the probability
of a six occurring in a die thrown six times is 1/6. But
Bayesian statistics starts with the initial value of 16.6%
and then adjusts the belief based on the experimentation.
If the die is showing 6 more than the expected number of
times during experimentation, the value of this belief is in-
creased accordingly. Hence the likelihood of 6 turning in a
die thrown 6 times will increase or decrease depending on
the outcomes in the experimentation.
Strength and weakness:
Bayesian approaches yield accurate results in many pre-
dictive tasks as they consider both experimental data and
theoretical predictions [129]. Bayesian approaches are best
suited for tasks where there are uncertainties in models and
parameters. They also ﬁnd their use in predictive tasks
where probabilities questions have to be answered such as
Predictive Analytics on Big Data - an Overview Informatica 43 (2019) 425–459 429
in stock market analysis [130].
The major weakness of bayesian approaches lies in de-
termining the prior distributions. Bayesian approaches are
also computationally expensive [130]. Use of frequencies
instead of probabilities can help in improving bayesian rea-
soning [131].
2.2 Predictive models based on data mining
The process of extracting hidden patterns from the given in-
put is called data mining. It is basically mining knowledge
from the data. Three major approaches for data mining in-
clude Classiﬁcation, Clustering and Association. Machine
learning algorithms are widely used to execute the task.
2.2.1 Classiﬁcation
The method of determining the class to which a particular
data given as input belongs to is called classiﬁcation. It is
a supervised machine learning technique with labelled in-
put data. Classiﬁcation can be used in predictive analytics.
There are lots of algorithms under classiﬁcation technique
but few basic algorithms are described in detail in this sub-
section.
1. Naive Bayes: Naive Bayes is a statistical modeling ap-
proach with two assumptions —all the attributes are
equally important, all the attributes are statistically in-
dependent. It is a probabilistic approach which works
on the following Bayes theorem.
P [M=N] =
P [N=M]P [M]
P [N]
(5)
Strength and weakness:
The main strength of naive bayes is its simplicity. In-
spite of the fact that its accuracy is less, it is found
to perform better due to its simplicity in tasks such as
document classiﬁcation where merely classiﬁcation is
important. Naive bayes is computationally efﬁcient
since the contribution of each variable is equal and
independent [134]. Moreeover only few parameters
need to be calculated in naive bayes due to its condi-
tional independence assumption. Hence it suits well
for tasks where the training data is less [135].
The major weakness of naive bayes approach is its
conditional independence assumption that often does
not hold for real world applications [132]. This weak-
ness is overcome to a certain extent by weighting at-
tributes. Naive bayes with attribute weighting is found
to perform better than random forest and logistic re-
gression in [136],[137]. Accuracy and speed of classi-
ﬁcation is also less when compared with other classiﬁ-
cation approaches. Effective negation and feature se-
lection techniques such as mutual information in com-
bination with naive bayes is found to improve the ac-
curacy and speed to a certain extent[133]. An another
problem with naive bayes is that though they are good
classiﬁers, they are not good estimators as discussed
earlier. Hence it does not perform well in the tasks
where probability value is important [138] and certain
improvements to naive bayes is proposed to improve
the probability estimation [139]. Moreover when the
test data differs widely from the training data naive
bayes fails to perform well unless smoothing tech-
niques such as laplace estimation is used [138].
2. Decision trees: Decision tree is constructing a tree
based structure for classiﬁcation. Each node involves
testing a particular attribute and the leaves are as-
signed classiﬁcation labels based on the values at each
node. Decision trees use divide and conquer approach
and the attributes can also be selected with heuristic
knowledge for the nodes of decision trees though few
measurements like entropy and information gain are
used in selecting the attributes. Decision trees can be
converted to rules also. Many variations of decision
trees have evolved and one of them is random forest
which is commonly used bagging method in recent re-
search problems. The leaf nodes of the decision trees
are called decision nodes. Entropy is the amount of
randomness in the data. Information gain is the in-
formation obtained that helps for accurate prediction.
Entropy is given by
E(X) =  n
X
i=1
(P
i
logP
i
) (6)
where P
i
is the probability of occurence of value i
when there are n different values.
Information gain is a purity measure given by
IG(X;a) =E(X)  E(XjA) (7)
The value represents the information gained by split-
ting the tree with that particular attribute. The attribute
with less entropy or more information gain at every
stage is considered the best split and the process is
repeated until the tree converges. There are many de-
cision tree algorithms but few variants are shown in
table 2.
Strength and weakness:
The major advantage of decision tree over other classi-
ﬁers is its interpretability. The tree like structure helps
users to extract the knowledge easily [140]. Indeed,
decision trees does not require the data to be normally
distributed. The data can be continuous, discrete or
a combination of both. Hence there is no need for
much data preparation in decision tree model [142].
Moreover decision trees require only very few train-
ing iterations [143]. The random forest, an ensemble
technique of decision tree is found to yield more ac-
curate results. An another advantage of random forest
is that it is non parametric in nature and helps in de-
termining variable importance [141].
430 Informatica 43 (2019) 425–459 G. Nagarajan et al.
S.no Type of decision tree Description
1 ID3 Iterative dichotomiser is the basic non incremental algorithm used for
construction of decision tree. It uses information gain measure to select
the best split at each node. But the major disadvantage of ID3 is that it
may overﬁt the training data and may not result in optimal solution.
2 C4.5 C4.5 is an improved version of ID3 algorithm. It solves the overﬁtting
problem of ID3 by pruning the tree. C4.5 handles both continuous and
discrete attributes and is capable of handling missing values too.
3 C5.0 C5.0 is an improved version of C4.5 in terms of performance and mem-
ory efﬁciency. It supports boosting technique to improve accuracy. C5.0
constructs smaller decision trees by removing unhelpful attributes with-
out compromising on accuracy.
4 Decision stumps It is a single level decision tree and ﬁnds its use along with machine
learning techniques like bagging and boosting as weak learners.
5 CHAID Chi square automatic interaction detector is used to ﬁnd the relation-
ship between categorical dependent variable and categorical indepen-
dent variables. It uses chi square test of independence to test the inde-
pendency between two categorical variables. CHAID can be extended
for continuous variables too.
Table 2: Different types of decision trees
The major problem with decision trees is overﬁtting
or underﬁtting [144]. Techniques such as pruning
[146],[147] or feature selection methods are required
to avoid overﬁtting problem and also to reduce the
computational complexity of the model [147]. Deci-
sion trees does not suit well for imbalanced datasets
though ensemble techniques can help [145]. More-
over, though random forest yields more accurate re-
sults, they are black box classiﬁers as the split rules
are unknown [141].
3. KNN: Another classiﬁcation technique that is of wide
use is KNN yet with its own challenges. This tech-
nique works on the idea that the input data to be clas-
siﬁed depends on the class of its neighbors. The value
of ‘K’ determines the effectiveness of the algorithm.
‘K’ represents the number of neighbors to be consid-
ered. The input data is assigned to the class to which
most of its neighbors belong to. A distance metric
from the input data to the ‘K ’ neighbors is calculated.
Euclidean distance is usually deployed to calculate the
distance. Other distance metrics like mahanalobis and
manhattan distance measures can also be used instead
of euclidean. The accuracy of the algorithm lies in the
choice of K. Lower value of K might result in over-
ﬁtting and higher value for K might result in a more
generalized model difﬁcult to predict.
Strength and weakness:
The biggest advantage of KNN is its simplicity [150].
It does not require any prior knowledge about the dis-
tribution of data [149]. This non parametric nature of
KNN makes it effective for real world problems [151].
The major issue with KNN is the choice of parameter
k and distance metric [150],[152],[153] and few works
are proposed to determine the value of k [157],[158]
etc. Computational complexity is another issue with
KNN. Techniques such as clustering are used along
with KNN [148] to reduce the computational com-
plexity. KNN is also affected by irrelevant features
[150] and is sensitive to outliers [152],[153]. The
outlier problem can be avoided to a certain extent by
choosing a reasonable value for k rather than a small
k [154]. Few methods or improvements such as lo-
cal mean based KNN [155] and distance weight KNN
[156] are proposed to overcome the negative effect of
outliers in KNN.
4. Support vector machines: This classiﬁcation tech-
nique yields better accuracy in classiﬁcation prob-
lems. SVM works on the basis of hyperplane. The
idea behind this technique is to ﬁnd the best hyper-
plane that can classify the two classes more accurately.
This technique is best suited for both linear and non-
linear separation problems. Non-linear problems can
be handled using kernel functions that does data trans-
formations to ﬁnd the best hyperplane classifying the
data more accurately. This algorithm works under the
concept of margin. The distance between the hyper-
plane and the closest object in each class is calculated.
The hyperplane with maximum margin with the class
objects is the best classiﬁer since it can predict the
class more accurately. Each input data is assigned a
point in n-dimensional space.
Strength and weakness:
The major strength of SVM is its robustness. It mod-
els non linear relationships very effectively and hence
is proved to yield better results in terms of accuracy
especially in non-linear problems [161][165]. SVM
Predictive Analytics on Big Data - an Overview Informatica 43 (2019) 425–459 431
is also known for its generalization capability and the
generalization error is less in SVM [163] [164]. This
advantage of SVM helps it to model complex prob-
lems even when the training data is less [166]. More-
over, there is no need for feature extraction process
in SVM as the kernel function can be directly applied
on the data [164]. SVM also avoids overﬁtting [165],
[166].
The major weakness of SVM lies in its parameter set-
tings. Proper setting of kernel parameters determine
the accuracy of SVM. Certain optimization techniques
such as PSO [159] and GA [160] are used to optimize
the parameters of SVM. Methods such as double lin-
ear and grid search are also used to determine the pa-
rameter values of SVM [162].
2.2.2 Clustering
The technique of identifying similar patterns in the input
data and grouping the input data with similar patterns to-
gether is called clustering. Clustering helps in predictive
analytics. An example of clustering algorithm includes
segmenting customers based on their buying behavior pat-
tern thereby helping to predict the insights in the business
and improve the sales accordingly. Clustering the scan im-
ages in health care helps to predict whether the person is af-
fected by a speciﬁc disease or not. Though there are many
clustering algorithms, the three basic algorithms include k-
means, hierarchical and density clustering and a summary
of the same is provided in the following subsection. A re-
view of clustering with its scope, motivation, techniques
and applications are explained in [2].
1. K-means clustering: K-means clustering chooses K
random centroids and measures the distance of each
input data point with the centroids. The most com-
monly used distance measurement metric is Euclidean
distance. The input data points within the speciﬁc dis-
tance from the centroid are grouped together as a clus-
ter and hence arrived at few clusters. The average of
the distance of all the points from the centroid inside
a cluster is calculated and the centroid is recalculated
accordingly. The input data points belonging to the
cluster changes again. This process continues until
the centroids are ﬁxed. Another variation of partition
clustering is K-mediods where the centroid itself is an
input data point. K-median and K-mode algorithms
are also partition based clustering algorithms that uses
median and mode instead of mean. There are sev-
eral metrics to measure the performance of clustering.
One among them is the distance metric. Single link-
age is the nearest neighbor clustering where the mini-
mum distance between the data points of two different
clusters is calculated. Complete linkage is the farthest
clustering where the maximum distance between the
data points of two different clusters is calculated. Av-
erage linkage is also used in some scenarios.
Strength and weakness:
The major strength of k means clustering is it‘s sim-
plicity and speed [168]. It can also work on datasets
containing a variety of probability distribution [175].
The major drawback with k means clustering is it‘s
sensitivity to the initialization of cluster centers [167].
Hence determining the initial cluster centers is a ma-
jor challenge though many methods based on statis-
tics, information theory and goodness of ﬁt are pro-
posed [168]. Determining the number of centers is
also a challenge and is addressed in few works [173].
An another drawback with k means clustering is it‘s
computational complexity. As the distance of each
data point has to be calculated with each cluster cen-
ter for every iteration, the computational time is high.
Solutions such as data structure that stores informa-
tion at each iteration to avoid repeated computations
[169] and Graphical processing units (GPUs) that par-
allelize the algorithm are proposed to reduce the com-
putational complexity of k means clustering [172].
Moreover k means clustering is also sensitive to out-
liers and can end up in local optima. Few alternatives
include fuzzy c means clustering and other soft clus-
tering techniques that are proved to work well with
noisy data [170]. k means clustering is also combined
with optimization techniques such as PSO and ACO to
avoid local optima and to arrive at better cluster par-
titions [171]. Few works are carried out to identify
the better cluster partitions with minimum intracluster
distances and maximum intercluster distances. Opti-
mization function is derived that minimizes the intr-
acluster distances and maximizes the intercluster dis-
tances. This function is optimized using optimization
algorithms such as GA, PSO, WOA, ACO etc in few
clustering works. Few other works include [269] that
uses a set of objective functions and updates the al-
gorithm accordingly to improve the intracluster com-
pactness and intercluster separation, [270] that uses
bisected clustering algorithm to measure the intraclus-
ter and intercluster similarity metrics etc. [174] pro-
poses a method to overcome the drawback of noisy
features in k means clustering.
2. Hierarchical based clustering: Hierarchical cluster-
ing works either in a divisive way (top-down) or ag-
glomerative way (bottom-up). In the divisive cluster-
ing, large cluster is broken down into smaller pieces.
In the agglomerative clustering, each observation is
started as its own cluster and pair of clusters is merged
together as they move up in the hierarchy. A den-
dogram is a pictorial representation for hierarchical
based clustering. The height of the dendogram repre-
sents the distance between the clusters. Agglometric
and divisive clustering algorithms are called AGNES
and DIANA respectively. In DIANA clustering tech-
nique, all the input data points are considered as a sin-
gle cluster and every iteration divides the cluster based
432 Informatica 43 (2019) 425–459 G. Nagarajan et al.
on heterogeneity. More heterogeneous data points
breaks down into another cluster. In AGNES clus-
tering technique, each input point is considered as a
single cluster and homogeneous points are clustered
together as a single cluster at each iteration.
Strength and weakness:
The major strength of hierarchical clustering includes
it‘s scalability and capacity to work with datasets of
arbitrary shapes [177]. It also determines the hier-
archical relationships well. Moreover the number of
clusters need not be speciﬁed in advance [178].
The major drawback with hierarchical clustering is its
computational complexity [176],[177],[178] and few
other methods are proposed to improve the efﬁciency
of the same [176]. Parallel techniques are also used to
improve the computational efﬁciency of hierarchical
clustering [179].
3. Density based clustering: Density based clustering
works by deﬁning a cluster as the maximal set of
density connected points. Three types of points are
chosen core, border and outlier. Core is the part of
the cluster that contains dense neighborhood. Bor-
der doesn‘t have many points but can be reached by
the cluster. Outlier can‘t be reached by the cluster.
Density based clustering picks up a random point and
checks if it is the core point. If not, the point is marked
as an outlier. Else, all the directly reachable nodes
from the speciﬁc point are assigned to the cluster. It
keeps ﬁnding the neighbors until it is unable to. There
are certain kinds of problems where density based
clustering provide accurate results than k-means clus-
tering. Outlier detection is accurate in density based
clustering.
Strength and weakness:
The major strength of density based clustering is that
it can discover clusters of arbitrary shapes [184],[177].
It is also robust to outliers [184]. There are several
density based algorithms such as DBSCAN, OPTICS,
Mean-Shift etc [177].
The major drawback with density based clustering is
the setting of parameters. Parameters such as neigh-
bourhood size, radius etc. have to be set in density
based clustering [182], [177]. Few algorithms are pro-
posed to determine the parameters in density based
clustering [183]. Moreover the density of the starting
objects affect the behavior of the algorithm. The al-
gorithm also ﬁnds its difﬁculty in identifying the adja-
cent clusters of different densities [182] ,[177]. Tech-
niques such as space stratiﬁcation is proposed to solve
this problem [182]. An another drawback with den-
sity based clustering is its efﬁciency. Parallelization of
the algorithm reduces the computational complexity
to a certain extent. Techniques such as GPUs [180],
mapreduce [181] are used to improve the scalability
and efﬁciency of the algorithm.
2.2.3 Association
The method of identifying the relationship amid the items
and deriving the rules based on the relationship is called as-
sociation. Though association mining is not of much use in
prediction, there are few scenarios where association rule
is used. The rule has antecedent and consequent. Associa-
tion rule mining is used mainly in business and marketing
[185]. There are different algorithms used in association
rule mining. Few include Apriori, Predictive Apriori, Ter-
tius etc [186]. Optimization techniques such as PSO are
also used with association rule mining to improve its efﬁ-
ciency [187]. [188] presents a survey on association rule
mining algorithms.
2.3 Predictive models based on machine
learning
Machine learning approaches are used for predictive tasks.
It is the process of training the machine with a training in-
put set, building a model and the evaluating it with the test
data. The machine learns continuously from the errors until
the model gets stabilized. Supervised learning works with
labeled input data whereas unsupervised learning works
with unlabeled input data. Machine learning uses soft com-
puting techniques like neural networks for training.
2.3.1 Neural networks
Neural network is a commonly used soft computing tech-
nique for predictive analytics. Neural networks are used to
classify complex patterns that are difﬁcult to classify us-
ing SVMs or other techniques. There are different types
of neural networks that can be trained using supervised,
unsupervised and reinforcement learning. There are also
different learning algorithms for training neural networks.
Neural networks machine learning algorithm can be used
to train a network with a group of training data and then
test it with a group of test data thereby measuring the accu-
racy of prediction. Learning continues until the network
becomes stable and able to classify the data accurately.
Cross validation is one among the widely used technique
for evaluating the model. Backpropagation algorithm is
the most commonly used training algorithm in neural net-
works. Weights are assigned at each layer input, hidden and
the output. Weightage is given to each attribute based on
the impact of it in predicting the output variable. Different
types of functions like sigmoidal function and sign func-
tion are used to compute the output variable. These func-
tions are called threshold functions and the output variable
is predicted based upon these functions. Threshold func-
tions are also called activation function or transfer func-
tion. The choice on number of input nodes, hidden layers,
weightage, threshold functions, algorithm for learning are
all based on the application and data for which predictive
analytics has to be applied. Now, deep learning techniques
are used to improve accuracy.
Predictive Analytics on Big Data - an Overview Informatica 43 (2019) 425–459 433
Strength and weakness:
The major strength of Artiﬁcial Neural Networks(ANN)
lies in it‘s ability to work with large amounts of data
and yield good results. They have good generalization
and learning ability and are universal approximators [191].
ANN has good fault tolerant, self learning, adaptation and
organization ability [192]. An another advantage with
ANN is that they are good for high dimensional datasets as
each variable do not have major impact on the class vari-
able but as a group they are good at classiﬁcation. More-
over a complex ANN relives user from determining the
interactional and functional forms in prior and is able to
smooth any polynomial function [193]. There are different
types of neural networks such as as feedforward network,
radial basis function network(RBFN), auto encoder, Boltz-
mann machine, extreme learning machines, deep belief net-
work, deep convolutional network etc [189] each with its
own strengths and weaknesses. For example, RBFN are
easy to design, have good generalization ability and are tol-
erant to noise. These networks ﬁnd their use in designing
ﬂexible structures and systems [190].
The major weakness with ANN is that they can‘t be ap-
plied blindly to all kinds of data. Hence they are used by
combining with other models as hybrid prediction models
in most of the prediction problems. For example, in time
series problems, both linear and non linear relationships ex-
ist and ANN is combined with ARIMA modelling in such
problems [190]. An another disadvantage lies in the fact
that there are no proper rules to determine the number of
hidden nodes in neural networks. Moreover they can also
easily end up in local optima and are tend to overﬁt [193].
Optimization algorithms such as GA [194], Gravitational
search algorithm with PSO [196] are used to avoid the lo-
cal optima problem in ANN. Algorithms such as Fruitﬂy
algorithm also ﬁnd their use in determining the parameters
for ANN [195]. Overﬁtting problems is addressed by tech-
niques such as dropout mechanisms [197], bayesian regu-
larization [198] etc.
2.3.2 Deep learning
Deep learning is the most commonly used technique in
use today for classiﬁcation, clustering, prediction and other
purposes. While learning in machine learning proceeds in a
broader way, deep learning works in a narrow way. It works
by breaking down the complex patterns into simple smaller
patterns. Learning happens in parallel in the smaller pat-
terns and ﬁnally the sub solutions are combined together
to generate the ﬁnal solution. This improves the accuracy
of the network. Deep nets also help in avoiding the van-
ishing gradient problem. Most of the deep learning prob-
lems use Rectiﬁed Linear units function(ReLUs) instead of
sigmoidal and tanh activation functions that causes vanish-
ing gradient problem. The use of ReLUs help overcome
the vanishing gradient problem by avoiding zero value for
the derivative and maintaining a constant value instead
[271]. Moreover the use of deep learning networks such
as Long Short Term Memory Networks(LSTM) avoid van-
ishing gradient problem by maintaining a constant value for
the recursive derivative using cell states [272]. Deep nets
use GPUs that help them get trained faster. When the input
pattern is quite complex, normal neural networks might not
be effective because the number of layers required might
grow exponentially. Deep nets work effectively in solving
such complex pattern by breaking them down into simpler
patterns and reconstructing the complex solution from the
simpler solutions. GPUs are known to be a better alter-
native to CPUs for big data analytics owing to it‘s lower
energy consumption and parallel processing. Hence GPUs
are found to be scalable in deep learning as the training
and learning of the deep nets are made faster with paral-
lel processing of many computational operations such as
matrix multiplications [273]. There are different kinds of
deep nets used for different purposes [5]. Table 3 shows the
different types of deep nets and their usage.
Strength and weakness:
An overview of deep learning in neural networks has
been discussed in [199]. The major strength of deep learn-
ing is it‘s ability to model non linear relationships well.
Deep learning also suits well for massive data and has
better representation ability than normal networks [200].
Moreover deep learning does automatic feature extraction
and selection [201].
The major weakness of deep learning is that it is a black
box technique. There is no theoretical understanding be-
hind the model. Certain techniques such as information
plane visualization are proposed to understand DNN by us-
ing the mutual information exchanged between layers in
DNN [202]. Moreover deep learning works well only with
massive data and their architectures are more specialized to
a particular domain. They also consume high power [203].
2.3.3 Fuzzy rule based prediction
Fuzzy logic is a concept of soft computing technique more
suited for prediction problems with uncertainty and impre-
cision. Fuzzy sets have membership functions associated
with each input data set. The membership value of a par-
ticular input data represents the level of belonging of the
particular input data to the particular set. Rules are de-
rived and learning is based on the rules. Finally the rule
based approach is used to classify or predict the output vari-
able. Fuzzy systems are widely used for prediction pur-
poses. Fuzzy systems can be used as stand-alone or can
also be combined with other machine learning algorithms
for predictive tasks. The simple fuzzy based classiﬁer is
If-THEN classiﬁer and it can be made more meaningful
with the use of linguistic labels [61],[204]. Fuzzy systems
are also combined with neural networks as neuro-fuzzy
classiﬁer [261],[209] and is used for prediction purposes.
Fuzzy systems are also combined with KNN for prediction
purposes. Fuzzy c means clustering is found to perform
well than hard clustering especially in applications such as
bioinformatics where genes are associated with many clus-
434 Informatica 43 (2019) 425–459 G. Nagarajan et al.
S.no Type of deep
net
Description Usage
1 Restricted
Boltzmann
machine
Two layered network with visible and hid-
den layer. Layers not connected among
themselves. In the forward pass, RBM
takes the input and encodes as numbers.
The backward pass does the reverse. Data
need not be labelled.
Recognize inherent patterns. Works well with real
time data like photos, videos, voice etc. Used for
clustering.
2 Deep belief
nets
Stack of RBMs arranged together. Output
of hidden layer of the previous networks
like RBN is given as input to visible layer
of next RBN.
Used for recognizing complex patterns. Used
more commonly in facial recognition.
3 Convolution
nets
Made up of three layers, convolution,
RELU and pooling each having its own
function.
Used to identify the internal patterns within an im-
age.
4 Recurrent net-
works
A network with built in feedback loop.
Uses techniques like Gating to overcome
vanishing gradient problem.
Used when the patterns in the input data changes
over time. For image captioning, document clas-
siﬁcation, classify videos frame by frame, natu-
ral language processing etc. LSTM is a recurrent
network architecture that is used for deep learn-
ing. The application of LSTM includes time series
data predictions, classiﬁcation, processing, hand
writing recognition, speech recognition etc. It is
known for reducing the exploding and vanishing
gradient problems.
5 Autoencoders Encode the input data and reconstruct it to
back. Works with unlabeled data.
Finds its use in dimensionality reduction. Used
for text analytics.
6 Recursive
Neural Tensor
nodes
Works with the concept of roots and leaves.
Data moves in the network in a recursive
way.
Used to discover hierarchical structure of a set of
data. Used in sentimental analysis. Used in natu-
ral language processing.
7 Generative
adversial
networks
A network that can produce or generate
new data with the same statistics as the
training data[274].
Used in fashion designing, improving satellite im-
ages, etc.
Table 3: Different types of deepnets and their usage
ters [170], [270].
Strength and weakness:
The major strength of fuzzy systems is it‘s interpretabil-
ity. The fuzzy models are easy to interpret if designed care-
fully [205] especially with the use of linguistic labels [206].
Fuzzy rules also help to model the real world processes
easily [207]. Fuzzy systems are known well for handling
uncertainty [208].
The major weakness of fuzzy systems include it‘s poor
generalization capability as it is rule based. Fuzzy systems
are not robust as any change should be incorporated into
the rule base. To overcome this disadvantage, fuzzy sys-
tems are often combined with ANN and hybrid systems
are developed for prediction [208]. An another disadvan-
tage of using fuzzy systems is that the knowledge about
the problem should be known in advance. The use of hy-
brid systems can help overcome this disadvantage as the
knowledge is extracted from neural networks in such sys-
tems [209]. Approaches such as genetic programming is
also used to generate rules for fuzzy systems [210].
2.3.4 Ensemble algorithms
Ensemble methods are combination of more than one tech-
nique to achieve more accuracy in prediction than achieved
by an individual model. Few ensemble techniques are
shown in table 4. Each ensemble technique has its own
strength and weakness. For example, bagging is stable
against noise but needs comparable classiﬁers whereas
boosting is unstable against noise but its classiﬁcation per-
formance is better than bagging [211]. Also, bagging is
found to perform better than boosting for class imbalance
problems especially in noisy environment [212]. Stacking
has it‘s own weakness with respect to computational time.
It is computationally expensive [213]. Another problem
with stacking lies in the selection of base level classiﬁers as
techniques such as exhaustive search consumes more time
when search space is large. Yet, unlike bagging and boost-
ing that uses the same algorithm, stacking uses a different
algorithm and hence heterogeneous in nature [215]. The
choice of the ensemble technique depends largely on the
problem at hand. Bagging is good to deal with problems
Predictive Analytics on Big Data - an Overview Informatica 43 (2019) 425–459 435
S.no Ensemble Description
1 Bagging Bagging or bootstrap aggregation is the method of decreasing the variance without
any change in the bias. It is mainly used for regression and classiﬁcation techniques.
Each model is built separately and the net output is derived by bringing together the
results from the individual models by joining, aggregation and other methods.
2 Boosting Boosting is a parallel ensemble method to reduce bias without any changes in vari-
ance. Boosting converts weak learning algorithms to strong learning algorithms using
certain methods like weighed average. There are many variations of boosting algo-
rithms like adaboost, gradient boosting etc. The misclassiﬁed instances are assigned
more weight in the successive iterations.
3 Stacking Stacking is the technique in which the output of the previous level classiﬁer is used
as training data for the next classiﬁer to approximate the target function. It minimizes
variance and methods like logistic regression is used to combine the individual models.
Table 4: Ensemble techniques
where a single model is likely to overﬁt whereas boosting
is good for problems where a model yields poor perfor-
mance. Moreover bagging can be done in parallel as each
model is independent whereas every model in boosting de-
pends on the previous model [214]. There are different
kinds of boosting techniques, the major include adaboost
and gradient boost. Adaboosting improves performance by
assigning high weight to the wrongly classiﬁed data points
whereas gradient boosting improves performance by using
gradients in the loss function [216]. Indeed, gradient boost-
ing converges in a limit whereas adaboost is computation-
ally efﬁcient than gradient boosting [217].
3 Challenges of core predictive
models on big data
‘Big data’ represents data sets that are in petabytes,
zettabytes and Exabyte. The sources of big data include
satellites that generate enormous information every second
from space, mobile communications generating volumi-
nous data, social media like Facebook, Twitter with blogs,
posts etc. Traditional relational databases, data warehouses
and many visualization tools and analytical tools are de-
veloped for structured data. Because of the heterogeneous
nature of big data and enormous amount of data generated
including real time data, there is a need to enrich traditional
analytical methods to support the analytical functionalities
for big data. Alternatively, new tools and techniques are
developed to work on big data in combination with the tra-
ditional analytical techniques. Big data development in-
cludes the development in all the areas of handling big data
including data storage, pre-processing, data visualization,
data analytics, online transaction processing, online analyt-
ical processing, online real time processing, use of business
intelligence tools for predicting insights from big data etc
[3]. The main characteristics of big data include volume,
velocity, variety, veracity and value.[4]. A single machine
cannot store big data because of its volume. The basic con-
cept behind big data storage is to have many nodes (com-
puters) and store the chunks of big data in them. The nodes
are arranged in racks and communicate with each other
and the centralized node that controls them. Clouds are
also used for big data storage but it has its own challenge
of privacy and security. Getting into the depth of storage
technology is outside the purview of this article and hence
we are leaving the discussion about big data storage at this
stage. As our paper mainly aims to discuss the overview
of predictive analytics with big data, this section addresses
the challenges encountered by the core predictive models
discussed in previous section on big data.
Extrapolation
Extrapolation will be precise only when the knowledge
about the underlying covariate information [220] and the
actual system is clear [219],[221] which is difﬁcult to de-
termine in big datasets. With big data such as spatial data,
existing extrapolation approaches fail due to it‘s time and
space constraints. Hence new technological innovative ap-
proaches are required to model such big datasets and under-
stand them [219]. Extrapolation with kernel methods like
gaussian are proved to be good due to their ﬂexibility in
choosing the kernel function. Yet, when it comes to devel-
opment of gaussian models for multidimensional big data,
it suffers from computational constraints. Techniques such
as recovering out of class kernels are used to overcome the
computational constraint to a certain extent [218]. An an-
other problem with the machine learning models including
deep learning is that they merely ﬁt the data and may per-
form well for training dataset and even testing dataset but
fails in extrapolation [221],[222]. This happens as they do
not have proper structural explanations for the correlations
they identify [222]. [220],[221] recommends the construc-
tion of hybrid systems comprising the science based model
or physical models along with the predictive models to im-
prove the accuracy of extrapolation. But the problem in
developing hybrid systems lies in the fact that it requires
domain knowledge.
Regression
Regression is easy and can be understood well when
the data is small and can be loaded into memory com-
436 Informatica 43 (2019) 425–459 G. Nagarajan et al.
pletely. But big datasets can not be loaded into memory
completely. Few parallel techniques and solutions are pro-
posed for regression yet they end up in local optima or
in accessing the data again and again for updates. The
other problem in using parallel techniques is the compu-
tational resources incurred [223]. The area in the improve-
ment of computational resources is still lacking when com-
pared with the amount of big data generated [224]. Regres-
sion approaches such as kringing is computationally com-
plex especially with big data. Sampling techniques such as
leveraging [224] and subdata selection[226] are proposed
to reduce the computational complexity. But as discussed
in the earlier sections, the inferences on the samples cannot
be justiﬁed completely for the whole population as such.
Regression is also performed locally by dividing the big
dataset into few smaller datasets and then combining the
submodels to construct the ﬁnal model [225]. The chal-
lenges with these solutions lies in the choice of appropriate
method for division, aggregation etc.
Decision trees
Big data streams are more prone to noise and decision
trees are more sensitive to noisy data [227]. The time taken
to build the decision tree is computationally expensive with
big data [228]. Preprocessing and sampling the big data in
full batches before the construction of decision tree adds
to the computational cost [227]. External storage is re-
quired to construct decision tree for big data as the com-
plete dataset cannot be loaded into memory. Hence tra-
dition decision tree design does not suit for the big data.
Solutions such as incrementally optimized decision tree al-
gorithm [227] is proposed where decision tree is built in-
crementally. Parallel techniques [229] are proposed in big
data platforms such as spark where the decision tree algo-
rithm is executed in parallel. Decision tree algorithm is also
converted into mapreduce procedures in [228] to reduce the
computational time. The computational time of gradient
boosted trees is decreased in [230] by eliminating few in-
stances in calculation of information gain and bundling cer-
tain features together. Yet, these solutions come at the cost
of choosing the right technique to break the algorithm for
parallel execution, bundling the features etc.
K Nearest Neighbor
The major problem of KNN with big data is it‘s com-
putational time as the distance has to be calculated among
each instances [231]. This in turn incurs memory require-
ment for storage [232]. k means clustering is used to cluster
the big dataset and KNN algorithm is used on each subset
to reduce the computational time [231]. But this solution
comes with the general limitations of k means clustering.
Memory requirement is handled to a certain extent by big
data platforms such as spark so that in time memory com-
putation is used effectively [232]. Map reduce approaches
[233] are also used to reduce the computational time. Par-
allelization of KNN algorithm is also proposed [234]. Yet,
all these big data platform solutions come with their own
concerns on the nature of partitioning as the accuracy can
not be compromised for efﬁciency [235].
Naive Bayes
Naive Bayes requires the probability to be calculated
for all the attributes. With big datasets, the number of at-
tributes is more and hence the time complexity to calculate
the probability for all the attributes is high [236]. Another
problem with naive bayes is the underﬂow and overﬁtting
problems [237]. The underﬂow problem is usually han-
dled by calculating the sum of log of probabilities rather
than multiplying the probabilities whereas overﬁtting prob-
lem is handled using techniques like laplace estimate, M-
estimate etc. But with high dimensional big datasets like
genomic datasets, these solutions are not efﬁcient [237].
Naive bayes deals only with discrete data. Hence dis-
cretization methods are used before applying naive bayes
algorithm. In case of big data, existing traditional dis-
cretization methods are not efﬁcient and may lead to loss
of information [238]. Parallel implementations are pro-
posed for naive bayes algorithms yet they come at the cost
of hardware requirements [236]. [237] proposes a solu-
tion to solve the underﬂow and overﬁtting problems in big
data. The method uses a robust function that works based
on average of condition probabilities of all attributes and
calculation of dependency of the attributes on the class at-
tribute. Parallel versions of existing discretization methods
are also proposed to address the challenge of big data [239].
Yet, more research is required in these open issues.
Support vector machines
SVM is known for it‘s accurate results yet the compu-
tational complexity of SVM is quite high on big datasets
[240],[241]. Inspite of this computational complexity,
SVM uses certain optimization techniques like grid search
method for parameter tuning. These optimization tech-
niques are not suited for big datasets [244]. Though certain
parameter optimization techniques such as stepwise opti-
mization is proposed in [244] for big datasets, more re-
search is needed in this area. Solutions such as implement-
ing SVM on a quantum computer [240] to reduce it‘s time
complexity is proposed. Again, they come at the cost of
hardware. Parallel implementation of SVM using mapre-
duce technique is proposed [241] yet they may end up in
local support vectors that may be far away from the global
support vectors. [242] proposes a distributed version of
SVM where global support vectors are achieved by retain-
ing the ﬁrst and second order statistics of the big dataset.
Though there are many parallel versions of SVM, only a
very few parallel tools are available in open source for par-
allel SVM. Moreover, these tools also require proper tuning
[243].
K means clustering
The major problem of k means clustering with big data
is it‘s computational complexity as the distance calculation
and convergence rate incurs more time with increased num-
ber of observations and features. But it can be easily paral-
lelized using big data platforms [245]. Though paralleliza-
tion is easy with k means clustering using techniques like
mapreduce, the I/O and communication cost increases due
to repeated reading added with the iteration dependence
Predictive Analytics on Big Data - an Overview Informatica 43 (2019) 425–459 437
property of k means [246]. Methods like subspace clus-
tering and sampling are used to reduce the iteration depen-
dency property of k means [246]. Yet, the choice of correct
sampling method and partitioning technique in case of sub-
space clustering adds to the big data challenges. Indeed,
the size of the sample data is more than half the original
data in most of the methods and hence the computational
complexity still persists [248]. Optimization of initial clus-
ters using techniques like choosing the data points in high
density space [247] are proposed. Though they can avoid
outliers, they still suffer from the same computational com-
plexity owing to the distance calculation. Dimensionality
reduction techniques are also proposed but they come with
their own drawback that the cluster in projected space may
not comply with the clusters in actual space [248]. Hybrid
methods are proposed by combining projection techniques
with sampling and few other techniques like visual assess-
ment of cluster tendency. But the research on hybrid tech-
niques are still in it‘s initial state [248].
Hierarchical clustering
Hierarchical clustering also suffers from the drawback
of computational complexity and in fact it incurs more time
than k means clustering when the size of the dataset is large
[252]. Techniques such as building clusters using centroids
[250] and usage of cluster seeding [252] are proposed to re-
duce the computational complexity of hierarchical cluster-
ing. Partitioning the sequence space into subspaces using
partition tree is also proposed [251] and the clusters are re-
ﬁned in the subspaces. Fast methods to compute the closest
pair is also proposed to reduce the computational cost. Yet,
these methods are very speciﬁc to the particular problem.
Moreover, the partitioning techniques and the cluster seed-
ing techniques should be chosen wisely. Visual assessment
of tendency are also used to return single linkage partitions
on big datasets [249] yet the study of tendency curves have
to be clear.
Density based clustering
Density based algorithms are better compared to parti-
tioning algorithms on big data and data streams because
it can handle datasets of arbitrary shapes. It is also not
required to specify the number of clusters and it can han-
dle noise effectively. But, with the high speed of evolving
data streams and high dimensional big data, density based
clustering is ﬁnding many challenges. Though few meth-
ods are found to perform better, they still suffer from open
challenges such as too many parameters to set, memory
constraints, handling different kinds of data such as cate-
gorical , continuous etc [254]. Big data platforms such as
hadoop is used for parallelization in density based cluster-
ing. Yet, there is a need to choose the shufﬂing mechanism,
partitioning technique and work load balancing efﬁciently
[253]. Moreover, density based clustering algorithms such
as OPTICS cannot be parallelized as such and either im-
provements or new algorithms have to be proposed to han-
dle large datasets [255]. Few enhancements are carried out
in OPTICS and other density based clustering algorithms to
support parallelism. Yet, they are very speciﬁc to the prob-
lems they address. For example, [256] uses spatio temporal
distance and temporal indexing for parallelization which is
more speciﬁc to the spatio temporal data and [257] pro-
poses a method that is speciﬁc to the electricity data.
Neural networks
The major challenge that neural network faces with big
data is the time taken for training phase as large data sets
require more training iterations or epochs [260],[261]. As
a result, the computing power required becomes high [258]
and in turn the energy consumption[260]. Though tech-
niques like mapreduce on hadoop platforms are used [259],
the mapreduce design has to be an optimized and efﬁcient
one. Hardware solutions such as GPU and memrister are
proposed [258], yet they suffer from the major drawback,
the cost factor. Optimization algorithms are proposed [259]
to optimize the parameters of neural networks. Yet, there is
a common perspective that using optimization algorithms
increases the computational time due to it‘s convergence
property though proper optimization decreases the training
time of neural networks inspite of improving the accuracy.
Very few researches are carried out in this area for big data
with neural networks. The other problems with neural net-
works on big data include the increase in number of param-
eters, lack of proper theoretical guideline in the structure of
neural networks, insufﬁcient knowledge as only abstraction
is extracted, inherent problem in learning etc[261].
Fuzzy systems
Fuzzy systems are of great use in big data due to it‘s
ability to handle uncertainty. To cope up with the big data
requirements, fuzzy systems are designed that distributes
the computing operations using mapreduce technique [61].
But the major problem with mapreduce is the overhead
taken to reload the job everytime during it‘s execution.
Moreover when there are more number of maps, the imbal-
ance problem has to be dealt with carefully. Spark which
has in memory operations and resilent distributed databases
is more efﬁcient than mapreduce but unfortunately there
are no big works that integrate fuzzy systems with spark
[263]. Fuzzy systems are also found to be more scalable as
they represent the data as information granules [262]. Yet,
good granular techniques in combination with fuzzy classi-
ﬁcation systems exclusively for big data is required [262].
The fuzzy techniques designed for big data should be tested
for real world problems and more general fuzzy techniques
need to be developed rather than the techniques designed
to address speciﬁc problems [262].
Deep learning
Deep learning is used for big data due to it‘s accuracy
and automatic learning of hierarchical data representations
[264]. Yet, the major problem with deep learning is the re-
quirement of high performance systems. There are other
areas that need to be explored in deep learning for big data
problems. These include transfer learning with deep ar-
chitectures, deep online learning that is still in the initial
stage of research [264], incremental learning with deep ar-
chitectures [265], working with temporal data [266] etc.
Indeed, constant memory consumption due to the fact that
438 Informatica 43 (2019) 425–459 G. Nagarajan et al.
deep learning is usually performed on very big data that in-
volves millions of parameters and many CPUs is another
problem. Hence deep learning requires the use of exces-
sive computational resources. Moreover, deep learning also
faces the challenge of determining the number of optimal
parameters, learning good semantic data representations as
they are known for representing only the abstract detail etc
[265]. Lot of research works is required to address the in-
terpretablity problem[266].
Ensemble algorithms
The major problem with ensemble algorithms for big
data is it‘s computational time [268]. As ensemble tech-
niques require the use of different classiﬁers, the computa-
tional time they require is generally high and this increases
when the data is big. Ensemble techniques are known
for their diversity as they use different kinds of classiﬁers
and aggregate the results. Though many ensemble tech-
niques are developed for static data, there are no big re-
search works carried out in studying the diversity of ensem-
ble techniques on online streaming data. Since the differ-
ent classiﬁers used on streaming data already differs in the
data they use, a proper study of the advantages of ensemble
techniques on streaming data is required. Proper pruning
techniques is also an area to be explored [267]. There are
few works where ensemble techniques for big data is par-
allelized with mapreduce [268], yet they are not tried on
platforms such as spark that are proved to be more efﬁcient
than mapreduce.
4 Predictive analytics on big data
across different domains
4.1 Healthcare
Big data is generated by lot of industries and health care
is one among them. Huge amount of big data is gener-
ated from wearable devices in patients, emergency care
units, etc. Structured data such as electronic patient record,
health record, unstructured and semi structured data such as
scanned images, clinical notes, prescriptions, signals from
emergency care units, health data from social media are few
examples of big data generated from health care domain.
Predictive analytics on health big data helps in predicting
the spread of diseases [8], [9], predicting chances of read-
mission in hospitals [10], predicting the diseases at an early
stage, [11], [12],clinical decision support system to identify
the right treatment for the affected patients, hospital man-
agement etc [13]. A detailed overview about the use of pre-
dictive analytics in health informatics is presented in [14].
The paper discusses about the applications of big data pre-
dictive analytics in health informatics, techniques for the
same, opportunities and challenges.
Research gaps
Apart from storage, processing and aggregating differ-
ent types of data in health care, identifying the dependen-
cies among different data types is still an open challenge
that requires optimal solution. Another challenge is with
data compression methods. Though various methods are
available for big data compression like FPGA, lossy image
compression, they might not suit well for medical big data
since medical images shouldn‘t lose any information [44].
An another area of improvement is in predicting the spread
of diseases earlier. Though there are certain works carried
out in this area, few important features were not taken into
consideration for prediction. For example, though environ-
mental attributes are used as input for predicting the spread
of disease, certain inputs related to biological and socio-
behavioral is excluded in the proposed approach in [8].
Proper dimensionality reduction techniques and feature se-
lection are not considered in few works. Merely knowledge
of domain experts are used for feature selection in health
big data [10]. Clinical decision support systems is another
challenge to work with. Though clinical decision support
systems are developed, the success rate is very less. The
decision support system should be developed considering
both the patient’s and physician’s perspectives as patients’
acceptance is very important for these systems. It can be
of greater importance for emergency care units. Few chal-
lenges to work on include privacy issues, proper training
for clinicians, quality of data etc. Such systems help in tak-
ing precautionary actions like identifying low risk patients,
work on hospital readmission rates etc [60]. Radiation on-
cology is another area open to researchers. Building an
integrative model for radiation oncology to be used as deci-
sion support system will be of great help for clinicians [10].
More concentration on genomic analysis is required since
the present applications of clinical prediction uses genomic
data. Research on functional path analysis of genomic data
has to be concentrated upon [44]. Research works on han-
dling noise, complexity, heterogeneity, real time, privacy in
clinical big data is the need of the hour [14].
4.2 Education
Education ﬁeld is another domain generating lot of big data
using sensors, mobile devices for applications like learning
management system, online assignments, exams, marks,
attendance etc. Social media is also widely used by stu-
dents and instructors as forums [15]. Predictive analyt-
ics in data generated from these devices and institutions
help to predict the effectiveness of a course [15], learning
method [16],[17], student’s performance [18], [20], institu-
tion’s performance [21] etc. Such predictions also help to
personalize instructions by customizing the learning expe-
rience to each student’s skills and abilities. [19] discusses
about big data opportunities and challenges in education
ﬁeld. [17] and [21] also discusses about the scope of pre-
dictive analytics in educational big data.
Research gaps
Firstly, there are very few works carried out for predic-
tive analytics in education ﬁeld. There are very few papers
in this ﬁeld and most of them are review and survey pa-
pers. Hence researchers can focus on this ﬁeld. The major
Predictive Analytics on Big Data - an Overview Informatica 43 (2019) 425–459 439
challenge involved in this ﬁeld is social and ethical chal-
lenges. Since the student‘s and institution‘s individual data
are used for prediction purposes, many students and insti-
tutions might not want their data to be exposed [68]. In
the recent days, many institutions allow students to bring
their own devices for language learning. Hence massive
amounts of data is generated and scalability is an important
area to concentrate in future [19]. Another area to work on
is in integrating data from different sources. Student data
are available in multiple sources like social media, schools,
district ofﬁces, universities etc. Few are structured and few
unstructured. A more focus on this area can help predic-
tion.
4.3 Supply chain management
Predictive analytics helps in supply chain management.
Accenture is a company that has implemented big data so-
lutions for prediction in supply chain management [22].
Prediction in supply chain management helps to improve
customer relationships by regular interactions with them
thereby helping in understanding their satisfaction level,
product recommendations [22], predicting supplier rela-
tionships [23], reduce the customer waiting times [24], im-
prove the production based on demand [25], [26], manage
the inventory effectively [27] and reduce risks in the pro-
cess of supply chain [28].[29] discusses about the advan-
tages of predictive big data analytics in this domain. Prod-
uct development is another area where big data solutions
can help the process.
Research gaps
Though there is more scope for prediction in supply
chain management, very few industries have implemented
it. Probably because of the hesitation to invest and the lack
of skillset. Models which can reduce cost can be proposed
for supply chain management processes. Hiring data scien-
tists with domain knowledge helps the industries to move
towards big data solutions for efﬁcient supply chain man-
agement [22]. Though some of the companies use analytics
in supply chain management, most of them are ad-hoc and
situation speciﬁc. Predictive analysis on other areas like
improvement in demand driven operations, better customer
supplier relationships, optimization of inventory etc can be
more concentrated upon [22], [27]. Generalized models for
prediction in supply chain industry can be more focussed
on. Though [23] proposed a model based on deduction
graph, it is not tested on variety of product designs. Pri-
vacy of data is not considered. The approach also uses lot
of mathematical techniques. Hence approaches using sim-
ple techniques can be developed. More solutions for sup-
ply chain management considering both strategy and oper-
ations has to be focussed upon [26].
4.4 Product development and marketing
industry
[30] presents a white paper about the scope of predictive
analytics for product development process. Marketing is
a part of almost all the sectors and prediction in market-
ing has gained more importance because of its direct im-
pact in business and income. Predictions using big data so-
lutions for marketing helps to acquire customers, develop
customers and retain customers. Prediction in product de-
velopment and marketing industry helps to validate the de-
sign of the product, predict the demand and supply thereby
increasing the sales and improving the customer experi-
ence.
Research gaps
There is no single technology available to address all the
big data requirements. Big data solutions have to be in-
tegrated with other approaches and techniques to support
predictive analytics. Researchers can concentrate to work
on a single technology that can address all the require-
ments [30]. Also, different processes have to use different
techniques and approaches speciﬁcally designed for them.
For example,semiconductor manufacturing process should
consider the spatial, temporal and hierarchical properties in
manufacturing process as the existing algorithms doesn‘t
suit well for them. Speciﬁc solutions can be proposed for
different product development industries [55]. More work
on implementing the machine learning algorithms in differ-
ent areas of marketing and integrating them together can be
a scope of work for researchers [45].
4.5 Transportation
‘Smart city’ is a diction that is of common talk in today‘s
world. A smart city uses the information collected from
sensors operating over the cities to help in the administra-
tion of the cities. Many research works are carried out in
this area. Intelligent transportation systems is required for
building smart city. Sensors generate lot of information
that require big data solutions for processing and predic-
tion. Predictive analytics using big data solutions for trans-
portation has lot of applications like predicting the trafﬁc
and controlling it efﬁciently [31], [32], [33],predicting the
travel demand and making effective use of the infrastruc-
ture thereby reducing the waiting time of the passengers
[34], [35], automatic control of trafﬁc signals [36] and pre-
dicting the transport mode of a person [37].
Research gaps
With the advent of many devices, lot of information
is generated in the ﬁeld of transportation from various
sources. New tools to integrate data from sensors and latest
devices to traditional data sources is required. Researchers
can focus on developing such tools [34]. The capability of
the real time trafﬁc data collection service should be im-
proved since video and image data are all collected [36].
Proper methods to handle correlations among data and un-
certainty in data is also required in this ﬁeld since the data
440 Informatica 43 (2019) 425–459 G. Nagarajan et al.
generated is temporal and spatial in nature for transporta-
tion. Readings from sensors are also uncertain [34]. An-
other area to concentrate is on using other deep learning
approaches to predict trafﬁc ﬂow for better performance.
The prediction layer used in [33] is just logistic regression.
More powerful predictors can improve performance. Data
fusion for social transportation data is still in a preliminary
stage. GPS from taxi driver only give information about
origin and destination but not on travel demand. Mobile
data are used for travel demand but can‘t estimate travel
time on roads. Hence a proper fusion approach has to
be used to integrate data from several sources.Web based
agent technology for transportation management and con-
trol is a research direction [35]. Software robots to monitor
the state of drivers, check the condition of cars, evaluate
safety environment is a research in progress [35]. There
is more scope in predicting the transport mode of a per-
son. [37] used only sensor information for classiﬁcation
whereas future work can concentrate on integrating infor-
mation from cloud too. Advanced techniques to remove
noise and outliers can be worked upon.
4.6 Other domain areas
Agriculture can be beneﬁted using predictive analytics.
Now-a-days sensors and UA Vs ﬁnds its use in agriculture.
Sensors are used to ﬁnd the effectiveness of certain type of
seed and fertilizer in different places of the farmland. Big
data solutions are used to store and analyze this informa-
tion to improve the operations in agriculture. Additional
information like predicting the effect of certain diseases on
crops helps to take precautionary measures. Farmers do
predictive analytics in agricultural big data to lower costs
and increase yields. The use of different fertilizers, pesti-
cides is used to predict the environmental effects [38], [39].
Big data prediction ﬁnds its application in banking. Analy-
sis in browsing data helps to acquire customers. Defaulters
are predicted by mining Facebook data. Banks use sen-
timent analysis to analyze the customer needs and prefer-
ences and motivate them to buy more products thereby re-
ducing the customer churn. Facebook interactions, tweets,
customer bank visits, logs, website interactions are used as
sources for sentiment analysis. A 360 degree view of cus-
tomer interactions is analyzed to prevent churning of cus-
tomers. Certain features like account balance, time since
last transaction all helps banks to frame rules and identify
the customers who are about to churn. Big data prediction
helps to identify hidden customer patterns. Large sample
of outliers are analyzed to predict fraud detection in banks
[40]. There are also few works carried out for prediction
in library big data. Libraries work with online journals and
resources which are again voluminous. Predictive analyt-
ics is used in library big data for useful extractions related
to learning analytics, research performance analytics etc.
The user search behavior, log behavior are all analyzed to
extract useful information. An other industry where pre-
dictive analytics ﬁnds its importance is telecommunication
industry. Lot of applications today like Whatsapp uses dif-
ferent kinds of data that include structured, unstructured
and semi structured. Predictive analytics in telecommuni-
cation industry focusses mainly on customer satisfaction
[41]. Predictive analytics in big data helps business too.
Big data analytics platforms of different providers help in
personalization. The extent to which they support person-
alization differs in different platforms. Many big data plat-
forms like KNIME, IBM Watson analytics are all ﬁnding
its use in personalization [42]. Prediction in big data is
used extensively in robotics ﬁeld also. Robots communi-
cate with many other systems. Sharing of information be-
tween robots and smart environments, comparing the in-
formation the robot has with other systems improves the
robot intelligence [43]. Movie industry uses big data solu-
tions for predicting the success using social media. Enor-
mous data gets accumulated in social media like Wikipedia,
Facebook and Twitter. Self-aware machines are ﬁnding its
way in industries with the help of big data and cloud com-
puting techniques. These machines are capable of moni-
toring their health condition on their own and take smart
decisions on their maintenance.
5 Comprehensive challenges
Big data has its own challenges in terms of storage, pro-
cessing, analytics etc. We restrict this paper to address the
overall challenges involved in predictive analytics on big
data and to throw light on few techniques used and state of
the art work done in handling these challenges. The over-
all challenges are categorized under six headings shown in
ﬁgure 3.
Figure 3: Comprehensive challenges of Predictive analyt-
ics on big data - taxonomy
5.1 Real-time data
Handling real time data is one among the major challenges
in predictive analytics on big data. Few predictions such
as predicting the early outbreak of the disease to take care
of public health [9], real time recommendation system for
marketing requires real time data from social sites to be
collected.
Firstly, Latency is one of the main parameter to be taken
care of when working with such data. Secondly, techniques
Predictive Analytics on Big Data - an Overview Informatica 43 (2019) 425–459 441
to handle interactive queries is important during predictions
with these data. Thirdly, predictive algorithms should be
integrated with solutions handling real time data for effec-
tive prediction.
Some big data solutions and machine learning algo-
rithms are used in handling the above challenges with real
time data. [44] states that spark and storm helps in collect-
ing the data without latency. Hadoop platform are used to
carry out Extract, transform, load(ETL) operations. Hbase,
HiveQL are used to work on interactive queries. Open
source software applications like Cassandra, MongoDB are
used to achieve scalability and performance. Apache Ma-
hout Machine learning library is used to run predictive al-
gorithms on top of Hadoop [30]. Yet these techniques
are generalized and their performance differs depending
on the nature of the data. [72] proposes a task level adap-
tive mapreduce framework for real time streaming data in
health care applications. The scaling capability is designed
at task level that improves scalability in real time streaming
data from sensors. It is proved to cope up with the velocity
of the streaming data. But this approach was tested only on
health care applications.
Few traditional data mining algorithms are also proposed
for real time data. [44] states that algorithms such as Naive
bayes can be used for sentiment analytics to extract the
words from the twitter or other sites. Logistic regression,
nearest neighbours are used for customer segmentation and
to predict the probability that the customer will click the
advertisement. [73] uses naive bayes to extract information
from tweets texts. [4] proposes the use of Extreme learn-
ing machine to enhance the speed for processing real time
data. [30] uses random forests and bayesian techniques to
predict the crash and congestion in real time trafﬁc moni-
toring. Yet, when the velocity at which the real time data
enters increases, the performance of these traditional data
mining algorithms deteriorate.
Mobile computing techniques and cloud infrastructure
are used with big data platforms and data mining algo-
rithms to handle real time data. [46] proposes a monitor-
ing platform Context-Aware platform using integrated mo-
bile services(CAPIM) to make the life of smart city eas-
ier. CAPIM collects the general trafﬁc information, stores
in the mobile device and uploads when the wi-ﬁ is avail-
able. Drivers are provided feedback about the trafﬁc in-
formation that helps to take decisions. More visualization
output is presented to the users using google maps and the
services in turn send data about his locality on his social ac-
counts like twitter etc. [31] proposes the use of techniques
like Hadoop, Hadoop Distributed File System(HDFS) and
HBase to store the trafﬁc related information. This pa-
per proposes Real time trafﬁc information(RTTI) system
for collecting and integrating information from various
sources. Massive trafﬁc data sets are utilized transparently
with the aid of cloud infrastructure. Cloud and big data is
integrated in trafﬁc ﬂow algorithms. The usage of cloud for
massive storage and the use of mapreduce techniques and
Hadoop HDFS improves the performance of data mining
algorithms in predictions with real time trafﬁc ﬂow infor-
mation. Cloud services like Watson analytics is also used
to analyze real time data from social media. Yet there are
not many platforms that integrates cloud services, big data
solutions and mobile computing. There is a need for data
fusion approaches.
Few other related works for real time data include [4]
that uses Representative streaming processing systems for
processing real time streaming data and [46] that uses Very
Fast Decision Tree (VFDT) algorithm and IBM Infosphere
streams to analyze the real time streaming population data
to predict the spread of cardio respiratory spells. [48] also
proposes a multi-dimensional fusion technique that works
on Hadoop platform with both real time and ofﬂine data.
This model suits well for satellite applications where real
time data is captured and forwarded to the ground sta-
tion for reﬁning and prediction. Energy efﬁcient memrister
based neural network is used for big data analytics. GPUs
are used instead of CPUs to improve the speed up. Recur-
rent neural networks are used since they prove to be effec-
tive for non- linear sequential processing tasks like speech
recognition, natural language processing, video indexing
etc. [69] uses convolutional neural networks with GPU ca-
pabilities to detect real time objects. [70] combines real
time measurements from real time databases with static
data and uses simple extrapolation technique for predic-
tion in substation automation. [71] proposes a compression
technique for real time analog signals. It can handle the big
data in real time instruments and optical communications.
5.2 Data integration
This section discusses about the challenges involved in in-
tegrating data of different types for prediction. Heteroge-
nous data is one of the main characteristics of big data.
Few predictive works require data from different sources
to be integrated with the existing data. For example, pre-
dictions in educational sector collects data from multiple
sources like social media, schools, district ofﬁces, univer-
sities etc. Few are structured and few are unstructured.
[20] proposes a model that integrates information from so-
cial media and predicts student involvement and success.
Since social media is used by students to share their ideas,
feedbacks about courses, sentiment analysis can be used to
solve problems by analyzing the most common feedbacks,
ideas etc [16]. Healthcare sector also requires integration
of information from different sources. [49] uses unstruc-
tured data from Google Flu trends to predict the spread of
inﬂuenza by predicting the region that are most likely to
be affected. Past data is combined with real time data for
prediction. [11] proposes a predictive model to predict Sys-
temic Lupus Erythematosus, a disease that affects multiple
organs by integrating structured data like electronic health
records, unstructured and semi structured data like imaging
and scan tests(MRI, CT, Ultrasound scan, X-ray), complete
blood count, urinalysis. [8] develops a spatial data model to
predict inﬂuenza epidemic in Vellore, India. Large repos-
442 Informatica 43 (2019) 425–459 G. Nagarajan et al.
itories of data are collected. The developed spatial model
is dependent on geographically weighted regression tech-
nique. It involves bringing together several data sources
like surveillance systems, sentinel data etc to predict the
spread of epidemics. Movie industry uses big data solu-
tions for predicting the success using social media. Enor-
mous data gets accumulated in social media like Wikipedia,
Facebook and Twitter. [50] proposed a predictive model for
predicting the movie box ofﬁce success in Wikipedia. The
major issue with heterogeneous data is that since they may
not exactly be the same, there are possible chances that the
machine learning results may be affected.
Big data platforms with simulation help to bring together
data from different sources and predict the hidden pat-
terns in them. [10] develops a predictive model using Ma-
hout’s machine library that works on top of Hadoop with
Mapreduce technology to ﬁnd out the chances of readmis-
sion after discharge of persons with heart failure. Data is
collected from various sources and integrated using ma-
hout. Mahout machine library is used for prediction that
runs mapreduce jobs on Hadoop. The model uses HiveQL
for distributed query. Data extraction and integration is
done using Hadoop, Hive and Cassandra. Random for-
est and logistic regression is used in predicting the read-
missions. Big data solutions gave good results in terms
of time efﬁciency and scalability. [23] proposes a model
for prediction in supply chain management process that
uses deduction graph model to visually link competent sets
from many data sources both structured and unstructured.
Customer preferences and new product ideas are predicted
through social media using recent shopping history. Cus-
tomer response time is improved. Big data solutions such
as Apache Mahout is used for machine learning, tableau is
used for visualization, Ionosphere for data mining etc. In-
frastructure proposed by combining deduction graph model
with data mining, proves to provide better results in sup-
ply chain management with respect to usability, feasibil-
ity etc. [51] proposes the use Mapreduce technology in
distributed data management and scheduling for heteroge-
nous environment. The paper proposes a system for so-
cial transportation. Statistical approaches, data mining al-
gorithms and visualization analytics are used according to
the type of data. It is implemented in hadoop platform but
used with other frameworks also like cascading, sailﬁsh,
Disco, Skynet, Filemap, Themis etc. High level languages
like PigLatin, HiveQL are used for the technology. Hence,
very few tools and techniques are available to integrate data
from different sources and the advent of devices like sen-
sors and RFIDs kindles the requirement for new tools and
techniques.
Oncology can help integration easier in healthcare sec-
tor. Radiation oncology ontology is a key component used
in data collecting system for better interpretability. Radi-
ation oncology requires aggregation of input from many
origins like scans, images and EHR of patients. [52] car-
ried out a survey that explains about the state of art and
future prospects of using machine learning algorithms and
big data in radiation oncology. [9] states that building an
integrative model for radiation oncology to be used as deci-
sion support system will be of great help for clinicians. But
the research works in oncology is in its very initial stage.
Data warehousing is another concept in which data are
integrated from heterogeneous sources. Few examples in-
clude [275] that proposes a dimensional warehouse for in-
tegrating data from clinical trials, [276] that proposes an
architecture for a data warehouse model to integrate health
data from different sources etc. The major drawback with
data warehousing with big data is that technologies like
hadoop, mapreduce etc. has to be integrated with the data
warehouse to support the big data requirements. Moreover
the ETL operations have to be designed in the architecture
to suit big data requirements. There is also no standard
architectural framework to design data warehouse for big
data and hence the existing architectures designed to suit
particular problems lacks ﬂexibility [277]. Concepts such
as data lakes are also of use as they are non relational ap-
proaches to integrate different types of data from heteroge-
neous resources. They postpone data mapping until query
time. But, the query and reporting capabilities of data lakes
are still emerging. They are not as powerful as SQL on re-
lational databases [275]. Few research works such as [278]
are proposed that manages the metadata effectively in data
lakes to address the big data requirements.
Few related works on data integration for big data in-
cludes [74] for their work on a new resource ’Diseases’
that aims to ﬁnd the associations between genes and dis-
eases by integrating automated text mining with existing
datasets, [75] for their data integration method ’Optique’
based on ontology that integrates streaming and static data
as an abstraction layer, [76] for their work on API centric
linked data integration that discusses about the use of API
to extend the classical linked data application architecture
to facilitate data integration, [78] for their work to propose
a new approach for integration based on semantics. This
approach converts the data sources in different formats to
nested relational models initially and then imports only a
subset of large datasets to build the model thereby coping
up with the data size problem. A review of data integra-
tion in life sciences along with its challenges are discussed
in [77]. [79] proposes the use of fused lasso approach in
regression coefﬁcients to learn heterogeneity when com-
bining data from heterogeneous sources.
5.3 Data complexity
Large complex datasets such as genomic datasets have to be
dealt in few predictive tasks. Such complex datasets make
the predictive task difﬁcult. Firstly, storage and processing
of such complex data becomes a problem. Secondly, under-
standing such complex data to build predictive models need
to be taken care of. Thirdly, either enhancements in exist-
ing data mining algorithms or new data mining algorithms
have to be proposed to handle such complex datasets.
[44] states that storage techniques like HDFS, apache
Predictive Analytics on Big Data - an Overview Informatica 43 (2019) 425–459 443
Hadoop helps in storing and processing the big data effec-
tively. Using distributed platform like mapreduce prevents
the over utilization of the resources. A detailed survey on
map reduce technology is discussed in [6]. Cloud platforms
also help in handling complex data. Cloud platforms in
health care like “PCS-on-demand”are found to be effective
in storing and sharing of healthcare information with its
cloud infrastructure. Mapreduce is used to parallelize the
processing. Mahout library is used for machine learning
algorithm to process the images, signals and genomic data.
MongoDB is used for storage because of its high avail-
ability, performance and easy scalability [45]. [25] pro-
poses the use of Microsoft Azure, a cloud platform for data
storage in inventory management for supply chain process.
Data sources for inventory management are internal like
RFID, sensors etc. They generate lot of data and big data
solutions help in processing them efﬁciently. NoSQL is
used for data access. Batch analytics is done using Apache
Hadoop. [32] proposes a model that sends driving guidance
to vehicles with cloud computing technique incorporated to
big data. Big data solutions [53] are used for spatial data
analytics and Cloud solutions in big data platforms are used
for predictive analytics in tactical big data [54].
Visualization analytics and clustering helps in under-
standing complex datasets. [12] develops a predictive
model for diabetic data analysis in big data. Association
rule mining is used to ﬁnd association between the labo-
ratory results and diabetes type of the patient. Clustering
of similar patterns and classiﬁcation of health risk value
by patients health condition is done and predeﬁned deduc-
tive rules are derived to predict the diabetes. The predic-
tive model uses Hadoop/mapreduce environment for paral-
lel processing. Visualization also helps in predicting hid-
den insights from the data. [13] proposes a model for
hospital management. The temporal information helps to
understand the clinical activities. Proper visualization and
clustering of this temporal data helps to understand the ab-
normalities. [55] proposes an optimization framework for
wafer quality prediction in semiconductor manufacturing
process that uses clustering to identify similar behavior pat-
tern over time for chambers. [56] presents a survey on clus-
tering time series data. Abnormality can also be discovered
thereby helping quality control and fault diagnostics. But
since time series clustering is mostly for unstructured data,
a co-clustering pattern is formulated for this problem with
constraints to match the tools and the chambers. Visual-
izing the data effectively helps in prediction. Aggregation
and multi-dimensional analysis is also used in big data to
extract knowledge from them. AsterixDB, DGFIndex are
used that helps in aggregation and multi-dimensional anal-
ysis for big data [57]. Yet, new frameworks for visualiza-
tion techniques and multi-dimensional analysis need to be
developed exclusively for big data.
Few data mining algorithms are used in complex datasets
such as genomic data and signals. [47] uses Nearest Cen-
troid Based Classiﬁer (NCBC) to predict clinical outcome
related to cancer using gene expression data.[58] uses Mul-
tipartiate graph for prediction in genomics big data. Deep
learning also helps in handling complex data. Lot of re-
search works are carried out in clinical image processing
with deep learning. [33] uses deep learning for predicting
trafﬁc ﬂow. A stacked autoencoder model trained greedily
learns trafﬁc ﬂow features. It uses spatial and temporal cor-
relations.The medical data including cardiac MRI involves
signal processing. There are many statistical learning tools
for signal processing. [59] uses signal processing tech-
niques such as kernel based interpolators and timely ma-
trix decompositions for big data. Yet, more research works
are required for signal processing with big data and on ge-
nomic data analysis.
Few related works to handle complex data include [80]
for their work on anytime density based clustering for com-
plex data to improve scalability, [81] for their work on
interactive data visualization to understand complex data,
[83] for their work to propose a Pairwise weighted ensem-
ble clustering algorithm to cluster complex data for bet-
ter understanding, [82] for their work to address scalability
problem in complex data by proposing two suboptimal al-
gorithms to address casting complex problem of L1 Norm
principal component analysis, [84] for their work to de-
velop complexheatmap package that helps in visualizing
and revealing the patterns in high dimensional genomic
data.
5.4 Data quality
Low quality data is another challenge in predictive analyt-
ics. The source data in some applications like from emer-
gency care, sensors may be of poor quality. Predictive
analytics techniques on such low quality data need some
sophisticated techniques to be applied on the existing al-
gorithms. Low quality data may also be due to the fact
that some predictive works fail to consider important at-
tributes required for prediction. For example, [8] predicts
the spread of the disease using environmental attributes
but fail to consider the biological and socio-behavioral at-
tributes. Data may also be incomplete and inconsistent in
certain cases.
Generally techniques like Mathematical or logical re-
gression might work for low quality data [60]. [4] states
that advanced deep learning methods, statistical learning
theory of sparse matrix are used to overcome the challenge
of incomplete and inconsistent data. Techniques like Wat-
son analytics are used to overcome the challenge of low
value density data. More works on identifying proper cor-
relations among inconsistent data is required.
Proper dimensionality reduction techniques and feature
selection also help to improve the data quality. [9] states
that merely knowledge of domain experts are used for fea-
ture selection in most of the predictive works.
The nature of data differs depending on the application.
For example,semiconductor manufacturing process should
consider the spatial, temporal and hierarchical properties in
manufacturing process as the existing algorithms doesn‘t
444 Informatica 43 (2019) 425–459 G. Nagarajan et al.
suit well for them. Speciﬁc solutions can be proposed for
different domains depending on the nature of the data [54].
Few related works on predictive analytics with low qual-
ity data include [85] that proposes an extension to likeli-
hood method to handle low quality data, [86] to propose a
method that uses data mining tasks such as clustering to ex-
tract patterns from noisy data in market segmentation, [87]
to propose a new algorithm based on C4.5 decision tree
that uses imprecise probabilities in classifying noisy data,
[88] that proposes a new algorithm for extreme machine
learning to work efﬁciently in the presence of outliers. [89]
proposes a hybrid feature selection scheme to reduce the
performance deterioration caused by outliers in predictive
analytics.
5.5 Computing speed
Computing speed is one of the important challenges to be
handled during predictions on big data. Most of the wear-
able devices consume more power and the algorithms used
on them are computationally intensive. Computing speed
of predictive algorithms on big data also increases due to
its volume.
Parallel computing techniques help to overcome the
challenge with respect to volume and computing speed. [4]
proposes the use of alternating direction method of mul-
tipliers to overcome the challenge with respect to volume
since it acts as a platform for distributed frameworks with
parallel computing. Mapreduce is used to work parallel on
the chunks of the big data. [63] proposes a data mining al-
gorithm K Nearest neighbor based on Spark(KNN-IS) for
classiﬁcation in big data. The algorithm uses Mapreduce
technology for parallel processing of the training data set.
Though Hadoop works well with mapreduce, it has its own
limitations like latency which is overcome by spark’s in-
memory computations. Map reduce is used on spark for
KNN to yield better results in terms of time and accuracy.
Resilient distributed databases are used on spark platform.
Sometimes medium quality predictions with low latency
perform better than high quality predictions with more la-
tency. [90] proposes parallel random forest algorithm in
spark cloud computing platform to improve the computa-
tional efﬁciency of big data analytics. A parallel version of
deep neural network training is proposed in [91].
Feature selection techniques like Representation learn-
ing, deep learning and dimensionality reduction are also
used to reduce the computing speed since the unnecessary
features are eradicated. Yet, the computational complex-
ity of certain feature selection techniques like wrapper ap-
proaches is high and researchers are working on it. [92]
proposes a hierarchical attribute reduction algorithm using
mapreduce in which attribute reduction process is executed
in parallel. [93] proposes fast minimum redundancy max-
imum relevance algorithm for feature selection in high di-
mensional data.
Fuzzy techniques aid in reducing the processing time.
Researchers worked towards reducing the time for pro-
cessing using fuzzy rules on data with lot of input fea-
tures. An algorithm named Chi-Fuzzy rule based clas-
siﬁcation systems(Chi-FRBCS) is proposed. It works on
Mapreduce framework and uses linguistic fuzzy rules. Two
versions of Chi-FRBCA algorithms are proposed - Chi-
FRBCS BigData-Max and Chi-FRBCS BigData-Ave. Ex-
periments are conducted on six different big data problems
set and Chi-FRBCS is proved to be effective in terms of
processing time and accuracy [61]. [62] proposed a big
data algorithm called FMM based on fuzzy rules for sen-
timent analysis in social media. A parallelized algorithm
FMM with mapreduce is also used and that proves to be
effective in terms of accuracy compared to the other tech-
niques. The algorithm is made to work on twitter data and
is observed that the execution time is much lesser for big
data.
[64] states that the hardware solution is effective in terms
of energy savings, power efﬁciency. Scientiﬁc applications
use multi-dimensional data sets. Processing has to be faster
compared with other applications because of the velocity
at which it arrives. For example, predicting climate change
requires fast processing. [65] proposes the use of an I/O
in-memory server for scientiﬁc big data analytics applica-
tions. [37] proposes SVM polynomial degree 3 kernel to
reduce the computational complexity and memory require-
ment during classiﬁcation. This model detects the transport
mode of a person whether he or she is walking, jogging or
going in bike. Hardware changes are also done by using
virtual gyroscope, accelerometer and few other hardware
devices to ensure that low power consuming devices are
used. The memory consumed by the algorithm is less.
5.6 Class imbalance
Class imbalance is another problem in certain predictive
works. Techniques like oversampling and undersampling
are used in class imbalance problems.
Some machine learning algorithms are effective in solv-
ing class imbalance problems. [66] proposes an algorithm
ROSEFW-RF for contact map prediction. It is a classiﬁca-
tion task related to protein structure where there are very
few positive samples available. This algorithm is based
on key-value pairs and uses mapreduce approach for dis-
tributed processing. Predictive model is constructed using
random forest. The class distribution is balanced through
random oversampling. Irrelevant features are removed
through feature weighing. Oversampling is found to be
more robust than undersampling and cost sensitive learning
when number of maps is increased in mapreduce technique.
The test data is classiﬁed. Experiments are conducted in
bioinformatics data and ROSEFW-RF algorithm is the win-
ner algorithm for imbalance big data classiﬁcation prob-
lem. [67] proposes Random forest with Mapreduce for
prediction on imbalanced big data. Five different ver-
sions of random forest algorithms are used in imbalanced
big data classiﬁcation with Mapreduce approach. - RF
- BigData, RF-BigDataCS, ROS+RF-BigData, RUS+RF-
Predictive Analytics on Big Data - an Overview Informatica 43 (2019) 425–459 445
BigData, SMOTE+RF-BigData. Random forest techniques
is found to work well for imbalanced big data classiﬁcation.
Few other related works on class imbalance problem in-
clude [94] that proposes an ensemble method to handle
both online learning and imbalance learning using over-
sampling and undersampling bagging techniques, [95] that
proposes a diversity based oversampling approach to cre-
ate new instances for minority class, [96] that proposes a
new ensemble method ’hardensemble’ to handle class im-
balance, [97] that uses extreme learning machine to handle
class imbalance problem both at feature and algorithmic
levels.
Class imbalance problem is another area that require re-
searcher’s attention. A mixed strategy of oversampling and
undersampling can be tried to boost performance [66].
5.7 Other challenges
Few other challenges include privacy issues, lack of proper
training to the domain experts, social and ethical challenges
etc. For example, the decision support system in health-
care should be developed considering both the patient’s and
physician’s perspectives as patients’ acceptance is very im-
portant for these systems. It can be of greater importance
for emergency care units. Privacy issues, proper training
for clinicians, quality of data are all few issues to be con-
sidered in this case [60]. In education sector, since the
student‘s and institution‘s individual data are used for pre-
diction purposes, many students and institutions might not
want their data to be exposed [68]. Hiring data scientists
with domain knowledge helps the industries to move to-
wards big data solutions for efﬁcient supply chain manage-
ment [22].
6 Potential research directions
From the above discussions on predictive analytics with big
data, it has been observed that this ﬁeld is readily open for
researchers. The potential research directions are summa-
rized.
6.1 Data management
– Since real time data includes the collection of lot of
video and image data in the present scenario, there is
a need to improve the real time data collection service.
Researchers can work on the same.
– Developing framework fusion approaches to integrate
big data solutions, cloud computing techniques and
mobile computing techniques is another area for re-
search as there is no single technology available to ad-
dress all the big data requirements. Researchers can
concentrate to work on a single technology that can
address all the requirements to support predictive an-
alytics.
– With the advent of many devices, lot of information is
generated in different domains from various sources.
New tools to integrate data from latest devices to ex-
isting data sources is required. Researchers can focus
on developing such tools.
– Data partitioning methods, indexing and multidimen-
sional analysis on big data are few other topics for re-
searchers.
– Scalability is an important area to concentrate in fu-
ture as massive amounts of data is generated.
6.2 Algorithms and solutions
– Enhancements in traditional data mining algorithms to
handle and analyze real time data or developing new
algorithms for the same is another promising research
area available for researchers.
– Identifying the dependencies and semantic features
among heterogeneous data types is still an open chal-
lenge requiring favorable solutions due to the biased
view of data distribution.
– New visualization techniques and frameworks can be
developed for effective interpretation of complex data.
– Genomic data analytics is in its initial stage. Works
such as functional path analysis on genomic data is an
area open for researchers.
– Data compression methods is also a challenge. Big
data compression methods like FPGA, lossy image
compression, might not suit well for certain types of
big data like medical images as they shouldn‘t lose
any information. Hence there is a need for new data
compression methods exclusively for speciﬁc big data
types.
– Oncology, semantic web are all areas to be concen-
trated in machine learning for big data.
– Establishing correlations among uncertain data, tem-
poral and spatial data is another area to work on for
researchers.
– Overﬁtting still remains as an open issue and re-
searchers can focus on developing better solutions
without much compromise on accuracy and cost.
– Using ensemble techniques in mapreduce platform is
another area that can be concentrated on to improve
accuracy
– Speciﬁc solutions can be proposed for different do-
mains depending on the nature of the data.
– Class imbalance problem is another area that require
researcher’s attention. A mixed strategy of oversam-
pling and undersampling can be tried to boost perfor-
mance
446 Informatica 43 (2019) 425–459 G. Nagarajan et al.
7 Conclusion
The paper provides an overview of core predictive models
and their challenges on big data. We discussed the scope of
predictive analytics on big data generated across different
domains and few research gaps are identiﬁed. Though the
mathematical approaches may not suit well for big data, we
found that the data mining approaches and machine learn-
ing techniques used for prediction have their base from the
mathematical approaches. The choice of predictive model
depends on the nature of application and data in hand. Fi-
nally we presented comprehensive challenges of predictive
analytics on big data and state of the art techniques used
to address the challenges. Based on our discussion, we
also presented a separate section on future directions for
research.
References
[1] Chiang, Roger H.L and Goes, Paulo and Stohr, Edward
A (2012) Business intelligence and analytics education,
and program development: A unique opportunity for
the information systems discipline, ACM Transactions
on Management Information Systems (TMIS),https:
//doi.org/10.1145/2361256.2361257.
[2] Jain, A. K. and Murty, M. N. and Flynn, P. J
(1999) Data Clustering: A Review, ACM Comput.
Surv., 264–323, https://doi.org/10.1145/
331499.331504.
[3] Benjamins, Richard.V (2014) Big Data: from Hype
to Reality?, Proceedings of the 4th International Con-
ference on Web Intelligence, Mining and Semantics
(WIMS14), ACM, https://doi.org/10.1145/
2611040.2611042.
[4] Qiu, Junfei and Wu, Qihui and Xu, Yuhua and Feng,
Shuo (2016) A survey of machine learning for big
data processing, EURASIP Journal on Advances in
Signal Processing, https://doi.org/10.1186/
s13634-016-0382-7.
[5] Chen, Ju-Chin and Liu, Chao-Feng (2015) Visual-
based Deep Learning for Clothing from Large Database,
Proceedings of the ASE BigData and SocialInformatics,
ACM.
[6] Sakr, Sherif and Liu, Anna and Fayoumi, Ayman
G (2013) The Family of Mapreduce and Large-scale
Data Processing Systems, ACM Comput. Surv., ACM,
https://doi.org/10.1201/b17112-3.
[7] Hung, San-Chuan and Kuo, Tsung-Ting and Lin,
Shou-De (2015) Novel Topic Diffusion Prediction Us-
ing Latent Semantic and User Behavior,Proceedings of
the ASE BigData and SocialInformatics, ACM.
[8] D. Lopez and M. Gunasekaran and B. S. Murugan and
H. Kaur and K. M. Abbas (2014) Spatial big data ana-
lytics of inﬂuenza epidemic in Vellore, India,IEEE In-
ternational Conference on Big Data, https://doi.
org/10.1109/BigData.2014.7004422.
[9] Curran, Martina and Howley, Enda and Duggan, Jim
(2016) An Analytics Framework to Support Surge Ca-
pacity Planning for Emerging Epidemics,Proceedings
of the 6th International Conference on Digital Health
Conference, ACM, https://doi.org/10.1145/
2896338.2896354.
[10] Zolfaghar, Kiyana and Meadem, Naren and Tere-
desai, Ankur and Roy, Senjuti Basu and Chin, Si-
Chi and Muckian, Brian (2013) Big data solutions
for predicting risk-of-readmission for congestive heart
failure patients,IEEE International Conference on Big
Data, https://doi.org/10.1109/BigData.
2013.6691760.
[11] S. Gomathi and V . Narayani (2015) Implement-
ing Big Data analytics to predict Systemic Lupus
Erythematosus,International Conference on Innova-
tions in Information, Embedded and Communication
Systems (ICIIECS),https://doi.org/10.1109/
ICIIECS.2015.7192893.
[12] Eswari, T and Sampath, P and Lavanya, S and
others (2015) Predictive Methodology for Diabetic
Data Analysis in Big Data,Procedia Computer Sci-
ence, pp:203–208, https://doi.org/10.1016/
j.procs.2015.04.069
[13] Tsumoto, Shusaku and Hirano, Shoji (2015) Analyt-
ics for Hospital Management,Proceedings of the ASE
BigData and SocialInformatics, ACM.
[14] Fang, Ruogu and Pouyanfar, Samira and Yang, Yimin
and Chen, Shu-Ching and Iyengar, S.S (2016) Compu-
tational health informatics in the big data age: a sur-
vey,ACM Computing Surveys (CSUR), pp.12, https:
//doi.org/10.1145/2932707.
[15] The center for Digital education (2015) : Big data in
education report. Technical Report, https://blog.
stcloudstate.edu.
[16] Sin, Katrina and Muthu, Loganathan (2015) Appli-
cation of big data in education data mining and learn-
ing analytics?A literature review,ICTACT Journal on
Soft Computing, pp.1–035,https://doi.org/10.
21917/ijsc.2015.0145.
[17] Oracle (2015) Big data education. Technical Report,
www.oracle.com.
[18] Jo, Il-Hyun and Kim, Dongho and Yoon, Meehyun
(2014) Analyzing the Log Patterns of Adult Learners
in LMS Using Learning Analytics,Proceedings of the
Fourth International Conference on Learning Analytics
And Knowledge, ACM, pp.183–187, https://doi.
org/10.1145/2567574.2567616.
Predictive Analytics on Big Data - an Overview Informatica 43 (2019) 425–459 447
[19] Wang, Yinying (2016) Big opportunities and
big concerns of big data in education,TechTrends,
Springer, pp.1–4 , https://doi.org/10.1007/
s11528-016-0072-1.
[20] Niemi, Gitin and David, Elena (2012) Using Big Data
to Predict Student Dropouts: Technology Affordances
for Research,International Association for Development
of the Information Society.
[21] Kellen, V and Consortium, Cutter and Recktenwald,
A and Burr, S (2013) Applying big data in higher edu-
cation: A case study,Data Insight and Social BI.
[22] Accenture (2014) Accenture-Global-Operations-
Megatrends-Study-Big-Data-Analytics. Technical
Report,https://acnprod.accenture.com.
[23] Tan, Kim Hua and Zhan, YuanZhu and Ji, Guo-
jun and Ye, Fei and Chang, Chingter (2015) Har-
vesting big data to enhance supply chain innova-
tion capabilities: An analytic infrastructure based on
deduction graph,International Journal of Production
Economics, pp:223–233, https://doi.org/10.
1016/j.ijpe.2014.12.034.
[24] Rozados,Ivan Varela and Tjahono, Benny (2014) Big
Data Analytics in Supply Chain Management: Trends
and Related Research,6th International Conference on
Operations and Supply Chain Management.
[25] J.Leveling and M. Edelbrock and B. Otto (2014) Big
data analytics for supply chain management,IEEE Inter-
national Conference on Industrial Engineering and En-
gineering Management, pp:918-922, https://doi.
org/10.1109/IEEM.2014.7058772.
[26] Wang, Gang and Gunasekaran, Angappa and Ngai,
Eric WT and Papadopoulos, Thanos (2016) Big
data analytics in logistics and supply chain manage-
ment: Certain investigations for research and ap-
plications,International Journal of Production Eco-
nomics, Elsevier, pp:98–110, https://doi.org/
10.1016/j.ijpe.2016.03.014.
[27] Waller, Matthew A and Fawcett, Stanley E (2013)
Data science, predictive analytics, and big data: a rev-
olution that will transform supply chain design and
management,Journal of Business Logistics, pp:77–84,
https://doi.org/10.1111/jbl.12010.
[28] Wilschut, Tim and Adan, Ivo JBF and Stokker-
mans, Joep (2014) Big data in daily manufacturing
operations,Proceedings of the Winter Simulation Con-
ference, pp:2364–2375, https://doi.org/10.
1109/WSC.2014.7020080.
[29] He, Miao and Ji, Hao and Wang, Qinhua and Ren,
Changrui and Lougee, Robin (2014) Big data fueled
process management of supply risks: sensing, predic-
tion, evaluation and mitigation,Proceedings of the Win-
ter Simulation Conference, pp:1005–1013, https://
doi.org/10.1109/WSC.2014.7019960.
[30] Intel (2013) Predictive analytics and interactive
queries on big data.Technical Report, https://
software.intel.com.
[31] Shi, Qi and Abdel-Aty, Mohamed (2015) Big
data applications in real-time trafﬁc operation and
safety monitoring and improvement on urban ex-
pressways,Transportation Research Part C: Emerging
Technologies, pp:380–394,https://doi.org/10.
1016/j.trc.2015.02.022.
[32] Yu, Jianjun and Jiang, Fuchun and Zhu, Tongyu
(2013) RTIC-C: a big data system for massive trafﬁc
information mining,International Conference on Cloud
Computing and Big Data (CloudCom-Asia), IEEE,
pp:395–402.
[33] Y . Lv and Y . Duan and W. Kang and Z. Li and F. Y .
Wang (2015) Trafﬁc Flow Prediction With Big Data: A
Deep Learning Approach,IEEE Transactions on Intelli-
gent Transportation Systems, pp:865-873.
[34] Toole, Jameson L and Colak, Serdar and Sturt,
Bradley and Alexander, Lauren P and Evsukoff, Alexan-
dre and Gonzalez, Marta C (2015) The path most
traveled: Travel demand estimation using big data
resources,Transportation Research Part C: Emerging
Technologies, pp:162–177,https://doi.org/10.
1016/j.trc.2015.04.022.
[35] Yuan, Nicholas Jing and Zheng, Yu and Zhang,
Liuhang and Xie, Xing (2013) T-ﬁnder: A rec-
ommender system for ﬁnding passengers and vacant
taxis,IEEE Transactions on Knowledge and Data En-
gineering, pp:2390–2403, https://doi.org/10.
1109/TKDE.2012.153.
[36] Wang, Chao and Li, Xi and Zhou, Xuehai and
Wang, Aili and Nedjah, Nadia (2016) Soft computing in
big data intelligent transportation systems,Applied Soft
Computing, Elsevier, pp:1099–1108, https://doi.
org/10.1016/j.asoc.2015.06.006.
[37] Yu, Meng-Chieh and Yu, Tong and Wang, Shao-Chen
and Lin, Chih-Jen and Chang, Edward Y (2014) Big
data small footprint: the design of a low-power classi-
ﬁer for detecting transportation modes,Proceedings of
the VLDB Endowment, VLDB Endowment, pp:1429–
1440, https://doi.org/10.14778/2733004.
2733015.
[38] Stubbs, Megan (2016) Big Data in U.S Agricul-
ture.Technical Report.
[39] Sangwani, G (2016) The Big Future of Big
Data,Business Insider, India.Technical Report.
[40] Martens (2016) Financial forum.Technical Report,
https://www.financialforum.be.
[41] Malaka, Iman and Brown, Irwin (2015) Challenges
to the Organisational Adoption of Big Data Analyt-
ics: A Case Study in the South African Telecom-
448 Informatica 43 (2019) 425–459 G. Nagarajan et al.
munications Industry,Proceedings of the 2015 An-
nual Research Conference on South African Insti-
tute of Computer Scientists and Information Technolo-
gists, ACM, pp:27:1–27:9, https://doi.org/10.
1145/2815782.2815793.
[42] Lopes, claudio and Cabral, Bruno and Bernardino,
Jorge (2016) Personalization Using Big Data Analytics
Platforms,Proceedings of the Ninth International Con-
ference on Computer Science & Software Engineering,
ACM, pp:131–132.
[43] Felzmann, Heike and Beyan, Timur and Ryan,
Mark and Beyan, Oya (2016) Implementing an Eth-
ical Approach to Big Data Analytics in Assistive
Robotics for Elderly with Dementia,SIGCAS Comput.
Soc., ACM, pp:280–286, https://doi.org/10.
1145/2874239.2874279.
[44] Belle, Ashwin and Thiagarajan, Raghuram and
Soroushmehr, S.M.Reza and Navidi, Fatemeh and
A.Beard, Daniel and Najarian, Kayvan (2015) Big data
analytics in healthcare,BioMed research international,
https://doi.org/10.1155/2015/370194.
[45] EVRY (2014) Big data - white paper.Technical Re-
port,www.evry.com.
[46] Dobre, C and Xhafa, F (2014) Intelligent services
for big data science,Future Generation Computer Sys-
tems, pp:267–281, https://doi.org/10.1016/
j.future.2013.07.014.
[47] Herland, Matthew and Khoshgoftaar, Taghi M.
and Wald, Randal (2014) A review of data min-
ing using big data in health informatics,Journal Of
Big Data, pp:1–35,https://doi.org/10.1186/
2196-1115-1-2.
[48] Ahmad, Aswais and Paul, Anand and Rathore,
Mazhar and Chang, Hangbae (2016) An efﬁcient mul-
tidimensional big data fusion approach in machine-to-
machine communication,ACM Transactions on Embed-
ded Computing Systems (TECS).
[49] Davidson, Michael W and Haim, Dotan A. and Radin,
Jennifer M (2015) Using networks to combine big data
and traditional surveillance to improve inﬂuenza pre-
dictions,Scientiﬁc reports, https://doi.org/10.
1038/srep08154.
[50] Mestyan, Marton and Yasseri, Taha and Kertesz,
Janos (2013) Early prediction of movie box ofﬁce
success based on Wikipedia activity big data,PloS one,
https://doi.org/10.1371/journal.pone.
0071226.
[51] Zheng, Xinhu and Chen, Wei and Wang, Pu and
Shen, Dayong and Chen, Songhang and Wang, Xiao
and Zhang, Qingpeng and Yang, Liuqing (2016) Big
data for social transportation, IEEE Transactions on In-
telligent Transportation Systems, pp:620–630,https:
//doi.org/10.1109/TITS.2015.2480157.
[52] Bibault,Jean Emmanuel and Giraud, Philippe and
Burgun, Anita (2016) Big Data and machine learn-
ing in radiation oncology: State of the art and fu-
ture prospects,Cancer letters, https://doi.org/
10.1016/j.canlet.2016.05.033.
[53] Chen, Xin and V o, Haong and Aji, Ablimit and Wang,
Fusheng (2014) High performance integrated spatial big
data analytics,Proceedings of the 3rd ACM SIGSPATIAL
International Workshop on Analytics for Big Geospatial
Data, ACM, pp. 11–14.
[54] Savas, Onur and Sagduyu, Yalin and Deng, Julia and
Li, Jason (2014) Tactical Big Data Analytics: Chal-
lenges, Use Cases, and Solutions,SIGMETRICS Per-
form. Eval. Rev., ACM, pp.86–89, https://doi.
org/10.1145/2627534.2627561.
[55] Zhu, Yada and Xiong, Jinjun (2015) Modern Big
Data Analytics for Old-fashioned Semiconductor Indus-
try Applications,Proceedings of the IEEE/ACM Interna-
tional Conference on Computer-Aided Design, pp.776–
780.
[56] Liao, T Warren (2005) Clustering of time series data
survey,Pattern recognition, pp.1857–1874, https://
doi.org/10.1016/j.patcog.2005.01.025.
[57] Cuzzocrea, Alfredo (2015) Aggregation and multi-
dimensional analysis of big data for large-scale scien-
tiﬁc applications: models, issues, analytics, and be-
yond,Proceedings of the 27th International Confer-
ence on Scientiﬁc and Statistical Database Manage-
ment, ACM, pp. 23,https://doi.org/10.1145/
2791347.2791377.
[58] Phillips, Charles A. and Wang, Kai and Bubier,
Jason and Baker, Erich J. and Chesler, Elissa J.
and Langston, Michael A. (2015) Scalable Multipar-
tite Subgraph Enumeration for Integrative Analysis
of Heterogeneous Experimental Functional Genomics
Data,Proceedings of the 6th ACM Conference on Bioin-
formatics, Computational Biology and Health Infor-
matics, ACM, pp.626–633,https://doi.org/10.
1145/2808719.2812595.
[59] Slavakis, Konstantinos and Giannakis, Georgios B
and Mateos, Gonzalo (2014) Modeling and optimiza-
tion for big data analytics:(statistical) learning tools for
our era of data deluge,IEEE Signal Processing Mag-
azine, pp.18–31, https://doi.org/10.1109/
MSP.2014.2327238.
[60] Alexander T. Janke and Daniel L. Overbeek and
Keith E. Kocher and Phillip D. Levy (2016) Explor-
ing the Potential of Predictive Analytics and Big Data
in Emergency Care,Annals of Emergency Medicine,
pp.227 - 236, https://doi.org/10.1016/j.
annemergmed.2015.06.024.
Predictive Analytics on Big Data - an Overview Informatica 43 (2019) 425–459 449
[61] Victoria Lopez and Sara del Rio and Jose Manuel
Benitez and Francisco Herrera (2015) Cost-sensitive lin-
guistic fuzzy rule based classiﬁcation systems under the
MapReduce framework for imbalanced big data,Fuzzy
Sets and Systems, pp.5 - 38, https://doi.org/
10.1016/j.fss.2014.01.015.
[62] Bing, Li and Chan, Keith C.C (2014) A fuzzy logic
approach for opinion mining on large scale twitter
data,Proceedings of the 2014 IEEE/ACM 7th Interna-
tional Conference on Utility and Cloud Computing , pp.
652–657.
[63] Jesus Maillo and Sergio Ramirez and Isaac Triguero
and Francisco Herrera (2016) kNN-IS: An Iterative
Spark-based design of the k-Nearest Neighbors classi-
ﬁer for big data,Knowledge-Based Systems.
[64] Wang, Yu and Li, Boxun and Luo, Rong and Chen,
Yiran and Xu, Ningyi and Yang, Huazhong (2014)
Energy efﬁcient neural networks for big data analyt-
ics,Design, Automation & Test in Europe Conference &
Exhibition (DATE), pp:1–2.
[65] Elia, Donatello and Fiore, Sandro and D’Anca,
Alessandro and Palazzo, Cosimo and Foster, Ian and
Williams, Dean N (2016) An in-memory based frame-
work for scientiﬁc data analytics,Proceedings of the
ACM International Conference on Computing Frontiers,
ACM, pp. 424–429.
[66] Triguero, Isaac and del Rio, Sara and Lopez, Vic-
toria and Bacardit, Jaume and Benitez, Jose M and
Herrera, Francisco (2015) ROSEFW-RF: the winner
algorithm for the ECBDL14 big data competition:
an extremely imbalanced big data bioinformat-
ics problem,Knowledge-Based Systems, pp.69–79,
https://doi.org/10.1016/j.knosys.
2015.05.027.
[67] Sara del Rio and Victoria Lopez and Jose Manuel
Benitez and Francisco Herrera (2014) On the use of
MapReduce for imbalanced big data using Random
Forest,Information Sciences, pp.112 - 137, https://
doi.org/10.1016/j.ins.2014.03.043.
[68] Johnson, Jeffrey Alan (2014) The ethics of big data
in higher education,International Review of Information
Ethics, pp.4–9.
[69] Ren, Shaoqing and He, Kaiming and Girshick, Ross
and Sun, Jian (2015) Faster R-CNN: Towards Real-Time
Object Detection with Region Proposal Networks, Ad-
vances in Neural Information Processing Systems 28,
pp.91–99.
[70] S. S. Biswas, A. K. Srivastava and D. Whitehead
(2015) A Real-Time Data-Driven Algorithm for Health
Diagnosis and Prognosis of a Circuit Breaker Trip As-
sembly, IEEE Transactions on Industrial Electronics,
vol. 62, no. 6, pp. 3822-3831, https://doi.org/
10.1109/TIE.2014.2362498.
[71] Mohammad H. Asghari and Bahram Jalali (2014) Ex-
perimental demonstration of optical real-time data com-
pression, Applied Physics Letters.
[72] Fan Zhang, Junwei Cao, Samee U. Khan, Keqin Li,
Kai Hwang (2015) A task-level adaptive MapReduce
framework for real-time streaming data in healthcare ap-
plications, Future Generation Computer Systems, Vol-
umes 43-44, pp. 149-160, https://doi.org/10.
1016/j.future.2014.06.009.
[73] Yiming Gu, Zhen (Sean) Qian, Feng Chen (2016)
From Twitter to detector: Real-time trafﬁc incident
detection using social media data,Transportation Re-
search Part C: Emerging Technologies, Volume 67, pp.
321-342, https://doi.org/10.1016/j.trc.
2016.02.011.
[74] Sune Pletscher-Frankild, Albert Palleja, Kalliopi
Tsafou, Janos X. Binder, Lars Juhl Jensen (2015)
DISEASES: Text mining and data integration of
disease?gene associations,Methods, Volume 74, pp.
83-89, https://doi.org/10.1016/j.ymeth.
2014.11.020.
[75] Evgeny Kharlamov, Sebastian Brandt, Ernesto
Jimenez, Ruiz, Yannis Kotidis, Steffen Lamparter, The-
oﬁlos Mailis, Christian Neuenstadt, Ozgur L. Oz-
cep, Christoph Pinkel, Christoforos Svingos, Dmitriy
Zheleznyakov, Ian Horrocks, Yannis E. Ioannidis
and Ralf Moller (2016) Ontology-Based Integration
of Streaming and Static Relational Data with Op-
tique,Proc. of International Conference on Manage-
ment Data (SIGMOD), pp.2109–2112, https://
doi.org/10.1145/2882903.2899385.
[76] Paul Groth, Antonis Loizou, Alasdair J.G. Gray, Ca-
role Goble, Lee Harland, Steve Pettifer (2014) API-
centric Linked Data integration: The Open PHACTS
Discovery Platform case study,Web Semantics: Science,
Services and Agents on the World Wide Web, Volume 29,
pp. 12-18.
[77] Gomez-Cabrero, David and Abugessaisa, Imad and
Maier, Dieter and Teschendorff, Andrew and Merken-
schlager, Matthias and Gisel, Andreas and Ballestar,
Esteban and Bongcam-Rudloff, Erik and Conesa, Ana
and Tegner, Jesper (2014) Data integration in the
era of omics: current and future challenges, BMC
Systems Biology, https://doi.org/10.1186/
1752-0509-8-S2-I1.
[78] Craig A. Knoblock, Pedro Szekely (2015) Exploiting
Semantics for Big Data Integration, Association for the
Advancement of Artiﬁcial Intelligence.
[79] Lu Tang, Peter X.K. Song (2016) Fused Lasso Ap-
proach in Regression Coefﬁcients Clustering Learning
Parameter Heterogeneity in Data Integration, Journal of
Machine Learning Research.
450 Informatica 43 (2019) 425–459 G. Nagarajan et al.
[80] Son T. Mai, Xiao He, Jing Feng, Claudia Plant, Chris-
tian Bohm (2016) Anytime density-based clustering of
complex data, Knowledge and Information Systems, pp
319-355.
[81] Diane J. Janvrin, Robyn L. Raschke, William N. Dilla
(2014) Making sense of complex data using interac-
tive data visualization,Journal of Accounting Education,
Volume 32, Issue 4, pp 31-48, https://doi.org/
10.1016/j.jaccedu.2014.09.003.
[82] N. Tsagkarakis, P. P. Markopoulos, G. Sklivanitis
and D. A. Pados (2018) L1-Norm Principal-Component
Analysis of Complex Data, IEEE Transactions on Sig-
nal Processing, vol. 66, no. 12, 15 June15, pp. 3256-
3267, https://doi.org/10.1109/TSP.2018.
2821641.
[83] Vladimir Berikov (2014) Weighted ensemble of algo-
rithms for complex data clustering,Pattern Recognition
Letters, Volume 38, pp 99-106,https://doi.org/
10.1016/j.patrec.2013.11.012.
[84] Zuguang Gu Roland Eils Matthias Schlesner (2016)
Complex heatmaps reveal patterns and correlations in
multidimensional genomic data,Bioinformatics, Volume
32, Issue 18, pp 2847-2849, https://doi.org/
10.1093/bioinformatics/btw313.
[85] Thierry Denoeux (2014) Likelihood-based belief
function: justiﬁcation and some extensions to low-
quality data ,International Journal of Approximate
Reasoning, pp.1535-1547, https://doi.org/10.
1016/j.ijar.2013.06.007.
[86] Paul W. Murray, Bruno Agard, Marco A. Bara-
jas (2017) Market segmentation through data min-
ing: A method to extract behaviors from a noisy
data set,Computers and Industrial Engineering, Volume
109, pp 233-252, https://doi.org/10.1016/
j.cie.2017.04.017.
[87] Carlos J. Mantas, Joaquin Abellan : Credal-C4.5
(2014) Decision tree based on imprecise probabilities
to classify noisy data,Expert Systems with Applications,
Volume 41, Issue 10, pp.4625-4637, https://doi.
org/10.1016/j.eswa.2014.01.017.
[88] B. Frenay and M. Verleysen (2016) Reinforced Ex-
treme Learning Machines for Fast Robust Regression in
the Presence of Outliers, IEEE Transactions on Cyber-
netics, vol. 46, no. 12, pp. 3351-3363,https://doi.
org/10.1109/TCYB.2015.2504404.
[89] M. Kang, M. R. Islam, J. Kim, J. Kim and M.
Pecht (2016) A Hybrid Feature Selection Scheme for
Reducing Diagnostic Performance Deterioration Caused
by Outliers in Data-Driven Diagnostics,IEEE Transac-
tions on Industrial Electronics, vol. 63, no. 5, pp.3299-
3310, https://doi.org/10.1109/TIE.2016.
2527623.
[90] J. Chen et al (2017) A Parallel Random Forest Algo-
rithm for Big Data in a Spark Cloud Computing Envi-
ronment,IEEE Transactions on Parallel and Distributed
Systems, vol. 28, no. 4, pp. 919-933, https://doi.
org/10.1109/TPDS.2016.2603511.
[91] I. Chung et al (2017) Parallel Deep Neural Network
Training for Big Data on Blue Gene/Q, IEEE Transac-
tions on Parallel and Distributed Systems, vol. 28, no.
6, pp. 1703-1714, https://doi.org/10.1109/
TPDS.2016.2626289.
[92] Jin Qian, Ping Lv, Xiaodong Yue, Caihui Liu,
Zhengjun Jing (2015) Hierarchical attribute reduction
algorithms for big data using MapReduce,Knowledge-
Based Systems, Volume 73, pp. 18-31,https://doi.
org/10.1016/j.knosys.2014.09.001.
[93] Sergio Ramirez,Gallego,Iago Lastra, David Mar-
tinez,Rego, Veronica Bolon,Canedo, Jose Manuel
Benitez,Francisco Herrera, Amparo Alonso,Betanzos
(2016) Fast,mRMR: Fast Minimum Redundancy Max-
imum Relevance Algorithm for High Dimensional
Big Data,International Journal of Intelligent Systems,
https://doi.org/10.1002/int.21833.
[94] S. Wang, L. L. Minku and X. Yao (2015) Resampling-
Based Ensemble Methods for Online Class Imbalance
Learning, IEEE Transactions on Knowledge and Data
Engineering, vol. 27, no. 5, pp. 1356-1368,https://
doi.org/10.1109/TKDE.2014.2345380.
[95] K. E. Bennin, J. Keung, P. Phannachitta, A. Mon-
den and S. Mensah (2018) MAHAKIL- Diversity Based
Oversampling Approach to Alleviate the Class Im-
balance Issue in Software Defect Prediction, IEEE
Transactions on Software Engineering, vol. 44, no. 6,
pp. 534-550, https://doi.org/10.1109/TSE.
2017.2731766.
[96] Loris Nanni, Carlo Fantozzi, Nicola Lazzarini
(2015) Coupling different methods for overcoming
the class imbalance problem,Neurocomputing, Volume
158, pp. 48-61, https://doi.org/10.1016/j.
neucom.2015.01.068.
[97] Zhen Liu, Deyu Tang, Yongming Cai, Ruoyu Wang,
Fuhua Chen (2017) A hybrid method based on en-
semble WELM for handling multi class imbalance
in cancer microarray data, Neurocomputing, Volume
266, pp. 641-650, https://doi.org/10.1016/
j.neucom.2017.05.066.
[98] Tavish Srivastava (2015) Perfect way to
build a Predictive Model, Analytics Vidhya,
https://www.analyticsvidhya.com.
[99] Jeremy Howick, Paul Glasziou, Jeffrey K. Aron-
son (2013) Problems with using mechanisms to solve
the problem of extrapolation, Theoretical Medicine and
Bioethics, Volume 34, Issue 4, pp 275–291, https:
//doi.org/10.1007/s11017-013-9266-0.
Predictive Analytics on Big Data - an Overview Informatica 43 (2019) 425–459 451
[100] Manuel Martin-Flores, Monique D. Parr, Luis Cam-
poy, Robin D. Gleed (2012) The sensitivity of sheep to
vecuronium: an example of the limitations of extrapo-
lation, Canadian Journal of Anesthesia, Volume 59, Is-
sue 7, pp 722–723,https://doi.org/10.1007/
s12630-012-9707-7.
[101] James R. Miller, Monica G. Turner, Erica A. H.
Smithwick, C. Lisa Dent, Emily H. Stanley (2004)
Spatial Extrapolation: The Science of Predicting
Ecological Patterns and Processes, BioScience, Vol-
ume 54, Issue 4,Pages 310–320, https://doi.
org/10.1641/0006-3568(2004)054[0310:
SETSOP]2.0.CO;2.
[102] Forbes, V . E., Calow, P. and Sibly, R. M. (2009)
The extrapolation problem and how population model-
ing can help, Environmental Toxicology and Chemistry,
pp: 1987-1994, https://doi.org/10.1897/
08-029.1.
[103] Stack Exchange (2016) What is wrong with
Extrapolation, StackExchange, https://stats.
stackexchange.com/questions.
[104] Peter Flom (2018) The disadvantages of linear
regression, Sciencing, https://sciencing.com/
disadvantages-linear-regression-8562780.
html.
[105] Benjamin A. Goldstein, Ann Marie Navar, Rickey
E. Carter (2017) Moving beyond regression techniques
in cardiovascular risk prediction: applying machine
learning to address analytic challenges, European Heart
Journal, Volume 38, Issue 23, Pages 1805–1814.
[106] Harvey J Motulsky and Ronald E Brown (2006) De-
tecting outliers when ﬁtting data with nonlinear regres-
sion – a new method based on robust nonlinear regres-
sion and the false discovery rate, BMC Bioinformatics.
[107] Johansen, S., and Nielsen, B (2016) Asymptotic
Theory of Outlier Detection Algorithms for Linear
Time Series Regression Models. Scand J Statist, 43:
321– 348, https://doi.org/10.1111/sjos.
12174.
[108] Minitab Blog Editor (2013) Enough Is Enough!
Handling Multicollinearity in Regression Analysis,
The Minitab Blog, https://blog.minitab.
com/blog/understanding-statistics/
handling-multicollinearity-in\
-regression-analysis.
[109] Vatcheva KP, Lee M, McCormick JB, Rahbar MH
(2016) Multicollinearity in Regression Analyses Con-
ducted in Epidemiologic Studies. Epidemiology (Sunny-
vale),6(2):227.
[110] Jake Lever, Martin Krzywinski and Naomi Alt-
man (2016) Logistic regression, Nature Methods, pages
541–542, https://doi.org/10.1038/nmeth.
3904.
[111] Carina Mood (2010) Logistic Regression: Why We
Cannot Do What We Think We Can Do, and What We
Can Do About It, European Sociological Review, V ol-
ume 26, Issue 1, Pages 67–82, https://doi.org/
10.1093/esr/jcp006.
[112] Muchlinski, D., Siroky, D., He, J., and Kocher, M.
(2016) Comparing Random Forest with Logistic Re-
gression for Predicting Class-Imbalanced Civil War On-
set Data. Political Analysis, 24(1), 87-103, https:
//doi.org/10.1093/pan/mpv024.
[113] Mirjam J. Knol, Saskia Le Cessie, Ale Algra, Jan
P. Vandenbroucke, Rolf H.H. Groenwold (2012) Over-
estimation of risk ratios by odds ratios in trials and co-
hort studies: alternatives to logistic regression, CMAJ,
184 (8) 895-899, https://doi.org/10.1503/
cmaj.101715.
[114] Pijush Samui (2013) Multivariate Adaptive Regres-
sion Spline (Mars) for Prediction of Elastic Modulus of
Jointed Rock Mass, Geotechnical and Geological En-
gineering, V olume 31, Issue 1, pp 249–253, https:
//doi.org/10.1007/s10706-012-9584-4.
[115] Pijush Samui, Sarat Das, Dookie Kim (2011) Up-
lift capacity of suction caisson in clay using multivariate
adaptive regression spline,Ocean Engineering, Volume
38, Issues 17–18, Pages 2123-2127, https://doi.
org/10.1016/j.oceaneng.2011.09.036.
[116] Kuhn Max; Johnson Kjell (2013) MARS regression,
Applied Predictive Modeling, New York, NY: Springer
New York.
[117] Statsoft (2019) Multivariate Adaptive Re-
gression Splines (MARSplines), Statsoft.com,
http://www.statsoft.com/Textbook/Multivariate-
Adaptive-Regression-Splines.
[118] Pijush Samui,Pradeep Kurup (2012) Multivariate
Adaptive Regression Spline and Least Square Sup-
port Vector Machine for Prediction of Undrained Shear
Strength of Clay, International Journal of Applied Meta-
heuristic Computing, 3(2),https://doi.org/10.
4018/jamc.2012040103.
[119] Ariadna Montiel, Ruth Lazkoz, Irene Sendra, Celia
Escamilla-Rivera, and Vincenzo Salzano (2014) Non-
parametric reconstruction of the cosmic expansion with
local regression smoothing and simulation extrapola-
tion, Physical Review D, 89, 043007,https://doi.
org/10.1103/PhysRevD.89.043007.
[120] Felix Biscarri, Inigo Monedero, Antonio Garcia,
Juan Ignacio Guerrero, Carlos Leon (2017) Electricity
clustering framework for automatic classiﬁcation of cus-
tomer loads, Expert Systems with Applications, Volume
86, Pages 54-63,https://doi.org/10.1016/j.
eswa.2017.05.049.
452 Informatica 43 (2019) 425–459 G. Nagarajan et al.
[121] NIST/SEMATECH (2012) e-Handbook of Statisti-
cal Methods, NIST/SEMATECH,http://www.itl.
nist.gov/div898/handbook.
[122] Kyoosik Kim (2019) Ridge Regression
for Better Usage, Towards data science,
https://towardsdatascience.com/
ridge-regression-for-better-usage.
[123] C.B. Garcia, J. Garcia, M.M. Lopez Martín and
R. Salmeron (2015) Collinearity: revisiting the vari-
ance inﬂation factor in ridge regression, Journal of
Applied Statistics, Volume 42 - Issue 3, Pages 648-
661, https://doi.org/10.1080/02664763.
2014.980789.
[124] NCSS Statistical Software: Ridge regression, NCSS,
https://ncss-wpengine.netdna-ssl.com.
[125] Patrick Breheny: Penalized regression methods,
University of Kentucky, https://web.as.uky.
edu/statistics/users/pbreheny.
[126] B M Golam Kibria, Shipra Banik (2016) Some
Ridge Regression Estimators and Their Performances,
Journal of Modern Applied Statistical Methods, Volume
15, Issue 1 Article 12.
[127] Prashant Gupta (2017) Regularization in
Machine Learning, Towards data science,
https://towardsdatascience.com/
regularization-in-machine-learning.
[128] Gene H. Golub, Michael Heath and Grace Wahba
(2012) Generalized Cross-Validation as a Method
for Choosing a Good Ridge Parameter, Techno-
metrics, 21:2, 215-223, https://doi.org/10.
1080/00401706.1979.10489751.
[129] Charles K. Fisher, Austin Huang, and Collin M.
Stultz (2010) Modeling Intrinsically Disordered Pro-
teins with Bayesian Statistics, Journal of the Ameri-
can Chemical Society, 132 (42), 14919-14927,https:
//doi.org/10.1021/ja105832g.
[130] Andre E. Punt and Ray Hilborn (2001) Strengths and
weaknesses of Bayesian Approach, Computerized Infor-
mation Series, Food and Organization of the United Na-
tions.
[131] Hoffrage Ulrich, Krauss Stefan, Martignon Laura,
Gigerenzer Gerd (2015) Natural frequencies im-
prove Bayesian reasoning in simple and complex in-
ference tasks, Frontiers in Psychology, Volume 6,
pp.1473, https://doi.org/10.3389/fpsyg.
2015.01473.
[132] Tina R. Patil, Mrs. S. S. Sherekar, Sant Gadgebaba
(2013) Performance Analysis of Naive Bayes and J48
Classiﬁcation Algorithm for Data Classiﬁcation, Inter-
national Journal Of Computer Science And Applications
Vol. 6, No.2.
[133] Narayanan V ., Arora I., Bhatia A. (2013) Fast
and Accurate Sentiment Classiﬁcation Using an
Enhanced Naive Bayes Model, Intelligent Data
Engineering and Automated Learning – IDEAL
2013. IDEAL, https://doi.org/10.1007/
978-3-642-41278-3_24.
[134] S.L. Ting, W.H. Ip, Albert H.C. Tsang (2011) Is
Naïve Bayes a Good Classiﬁer for Document Classiﬁ-
cation?, International Journal of Software Engineering
and Its Applications Vol. 5, No. 3.
[135] J. Zhang, C. Chen, Y . Xiang, W. Zhou and Y . Xi-
ang (2013) Internet Trafﬁc Classiﬁcation by Aggregat-
ing Correlated Naive Bayes Predictions, IEEE Transac-
tions on Information Forensics and Security, vol. 8, no.
1, pp. 5-15, https://doi.org/10.1109/TIFS.
2012.2223675.
[136] Nayyar A. Zaidi, Jesus Cerquides, Mark J. Carman,
Geoffrey I.Webb (2013) Alleviating Naive Bayes At-
tribute Independence Assumption by AttributeWeight-
ing, Journal of Machine Learning Research 14 , 1947-
1988.
[137] Liangxiao Jiang, Chaoqun Li, Shasha Wang,
Lungan Zhang (2016) Deep feature weighting for
naive Bayes and its application to text classiﬁca-
tion,Engineering Applications of Artiﬁcial Intelligence,
Volume 52, Pages 26-39, https://doi.org/10.
1016/j.engappai.2016.02.002.
[138] Victor Roman (2019) Naive Bayes Al-
gorithm: Intuition and Implementation in
a Spam Detector, Towards data science,
https://towardsdatascience.com/
naive-bayes-intuition-and-implementation.
[139] Liangxiao Jiang, Zhihua Cai, Dianhong Wang,
Harry Zhang (2012) Improving Tree augmented Naive
Bayes for class probability estimation,Knowledge-
Based Systems, Volume 26, Pages 239-245,https://
doi.org/10.1016/j.knosys.2011.08.010.
[140] Zhun Yu, Fariborz Haghighat, Benjamin C.M.
Fung, Hiroshi Yoshino (2010) A decision tree method
for building energy demand modeling, Energy and
Buildings, Volume 42, Issue 10, Pages 1637-1646,
https://doi.org/10.1016/j.enbuild.
2010.04.006.
[141] V .F. Rodriguez-Galiano, B. Ghimire, J. Rogan,
M. Chica-Olmo, J.P. Rigol-Sanchez (2012) An assess-
ment of the effectiveness of a random forest clas-
siﬁer for land-cover classiﬁcation, ISPRS Journal of
Photogrammetry and Remote Sensing, Volume 67,
Pages 93-104, https://doi.org/10.1016/j.
isprsjprs.2011.11.002.
[142] Mahyat Shafapour Tehrany, Biswajeet Pradhan,
Mustafa Neamah Jebur (2013) Spatial prediction of
Predictive Analytics on Big Data - an Overview Informatica 43 (2019) 425–459 453
ﬂood susceptible areas using rule based decision tree
(DT) and a novel ensemble bivariate and multivariate
statistical models in GIS, Journal of Hydrology, Volume
504, Pages 69-79, https://doi.org/10.1016/
j.jhydrol.2013.09.034.
[143] Jung Hwan Cho, Pradeep U. Kurup (2011) De-
cision tree approach for classiﬁcation and dimension-
ality reduction of electronic nose data, Sensors and
Actuators B: Chemical, Volume 160, Issue 1, Pages
542-548, https://doi.org/10.1016/j.snb.
2011.08.027.
[144] Song YY , Lu Y (2015) Decision tree methods: appli-
cations for classiﬁcation and prediction. Shanghai Arch
Psychiatry. 27(2), pp. 130 – 135.
[145] Bartosz Krawczyk, Michał Wo´ zniak, Gerald Schae-
fer (2014) Cost-sensitive decision tree ensembles for
effective imbalanced classiﬁcation, Applied Soft Com-
puting, Volume 14, Part C, Pages 554-562, https:
//doi.org/10.1016/j.asoc.2013.08.014.
[146] Tao Wang, Zhenxing Qin, Zhi Jin, Shichao Zhang
(2010) Handling over-ﬁtting in test cost-sensitive deci-
sion tree learning by feature selection, smoothing and
pruning, Journal of Systems and Software, Volume 83,
Issue 7, Pages 1137-1147, https://doi.org/10.
1016/j.jss.2010.01.002.
[147] Rutvija Pandya, Jayati Pandya (2015) C5.0 Algo-
rithm to Improved Decision Tree with Feature Selec-
tion and Reduced Error Pruning, International Jour-
nal of Computer Applications Volume 117 – No. 16,
https://doi.org/10.5120/20639-3318.
[148] Shengyi Jiang, Guansong Pang, Meiling Wu, Limin
Kuang (2012) An improved K-nearest-neighbor algo-
rithm for text categorization, Expert Systems with Appli-
cations, Volume 39, Issue 1, Pages 1503-1509,https:
//doi.org/10.1016/j.eswa.2011.08.040.
[149] Sadegh Bafandeh Imandoust And Mohammad
Bolandraftar (2013) Application of K-Nearest Neighbor
(KNN) Approach for Predicting Economic Events: The-
oretical Background, S B Imandoust et al. Int. Journal of
Engineering Research and Applications, Vol. 3, Issue 5,
pp.605-610.
[150] X. Liang, X. Gou and Y . Liu (2012) Fingerprint-
based location positoning using improved KNN, 2012
3rd IEEE International Conference on Network Infras-
tructure and Digital Content, Beijing, pp. 57-61.
[151] A. Thommandram, J. M. Eklund and C. McGregor
(2013) Detection of apnoea from respiratory time se-
ries data using clinically recognizable features and kNN
classiﬁcation, 2013 35th Annual International Confer-
ence of the IEEE Engineering in Medicine and Biol-
ogy Society (EMBC), Osaka, pp. 5013-5016, https:
//doi.org/10.1109/EMBC.2013.6610674.
[152] Carmelo Cassisi, Alfredo Ferro, Rosalba Giugno,
Giuseppe Pigola, Alfredo Pulvirenti (2013) Enhancing
density-based clustering: Parameter reduction and out-
lier detection, Information Systems, Volume 38, Issue 3,
Pages 317-330, https://doi.org/10.1016/j.
is.2012.09.001.
[153] O. Kursun (2010) Spectral Clustering with Reverse
Soft K-Nearest Neighbor Density Estimation, The 2010
International Joint Conference on Neural Networks
(IJCNN), Barcelona, pp. 1-8.
[154] aporras (2018) 10 Reasons for loving
Nearest Neighbors algorithm, QuantDare,
https://quantdare.com/10-reasons\
-for-nearest-neighbors-algorithm.
[155] Y . Mitani, Y . Hamamoto (2006) A local mean-based
nonparametric classiﬁer, Pattern Recognition Letters,
Volume 27, Issue 10, Pages 1151-1159, https://
doi.org/10.1016/j.patrec.2005.12.016.
[156] Gustavo E.A.P.A. Batista, Diego Furtado Silva
(2009) How k-Nearest Neighbor Parameters Affect its
Performance?, Argentine symposium on artiﬁcial intelli-
gence.
[157] Shichao Zhang, Debo Cheng, Zhenyun Deng,
Ming Zong, Xuelian Deng (2018) A novel kNN al-
gorithm with data-driven k parameter computation,
Pattern Recognition Letters, Volume 109, Pages 44-
54, https://doi.org/10.1016/j.patrec.
2017.09.036.
[158] Zhang, Shichao and Li, Xuelong and Zong, Ming
and Zhu, Xiaofeng and Cheng, Debo (2017) Learning
K for kNN Classiﬁcation, ACM Trans. Intell. Syst. Tech-
nol., Volume 8, pp. 43:1–43:19,https://doi.org/
10.1145/2990508.
[159] Abdulhamit Subasi (2013) Classiﬁcation of EMG
signals using PSO optimized SVM for diagnosis
of neuromuscular disorders, Computers in Biology
and Medicine, Volume 43, Issue 5, Pages 576-586,
https://doi.org/10.1016/j.compbiomed.
2013.01.020.
[160] Fangjun Kuang, Weihong Xu, Siyang Zhang (2014)
A novel hybrid KPCA and SVM with GA model for in-
trusion detection,Applied Soft Computing, Volume 18,
Pages 178-184, https://doi.org/10.1016/j.
asoc.2014.01.028.
[161] Roman M. Balabin, Ekaterina I. Lomakina
(2011) Support vector machine regression (SVR/LS-
SVM)—an alternative to neural networks (ANN) for
analytical chemistry? Comparison of nonlinear methods
on near infrared (NIR) spectroscopy data , Analyst,
issue:8.
[162] S. Han, Cao Qubo and Han Meng (2012) Parameter
selection in SVM with RBF kernel function, World Au-
tomation Congress 2012, Puerto Vallarta, Mexico, pp.
1-4.
454 Informatica 43 (2019) 425–459 G. Nagarajan et al.
[163] Weijun li, Zhenyu Liu (2011) A method of SVM
with Normalization in Intrusion Detection, Procedia En-
vironmental Sciences, Volume 11, Part A, Pages 256-
262, https://doi.org/10.1016/j.proenv.
2011.12.040.
[164] Adel Bolbol, Tao Cheng, Ioannis Tsapakis,
James Haworth (2012) Inferring hybrid transporta-
tion modes from sparse GPS data using a mov-
ing window SVM classiﬁcation, Computers, Envi-
ronment and Urban Systems, Volume 36, Issue 6,
Pages 526-537, https://doi.org/10.1016/j.
compenvurbsys.2012.06.001.
[165] A. Bhardwaj, A. Gupta, P. Jain, A. Rani and J. Ya-
dav (2015) Classiﬁcation of human emotions from EEG
signals using SVM and LDA Classiﬁers, 2015 2nd In-
ternational Conference on Signal Processing and Inte-
grated Networks (SPIN), Noida, pp. 180-185, https:
//doi.org/10.1109/SPIN.2015.7095376.
[166] De Giorgi, M.G., Campilongo, S., Ficarella, A.,
Congedo, P.M (2014) Comparison Between Wind
Power Prediction Models Based on Wavelet Decom-
position with Least-Squares Support Vector Machine
(LS-SVM) and Artiﬁcial Neural Network (ANN) Ener-
gies, 7, 5251-5272,https://doi.org/10.3390/
en7085251.
[167] M. Emre Celebi, Hassan A. Kingravi, Patricio A.
Vela (2013) A comparative study of efﬁcient initializa-
tion methods for the k-means clustering algorithm, Ex-
pert Systems with Applications, Volume 40, Issue 1 ,
Pages 200-210, https://doi.org/10.1016/j.
eswa.2012.07.021.
[168] Trupti M. Kodinariya, Prashant R. Makwana (2013)
Review on determining number of Cluster in K-Means
Clustering, International Journal of Advance Research
in Computer Science and Management Studies, Volume
1, Issue 6.
[169] S. Na, L. Xumin and G. Yong (2010) Research on
k-means Clustering Algorithm: An Improved k-means
Clustering Algorithm, Third International Symposium
on Intelligent Information Technology and Security In-
formatics, Jinggangshan, pp. 63-67.
[170] Soumi Ghosh, Sanjay Kumar Dubey (2013) Com-
parative analysis of k-means and fuzzy c-means algo-
rithms, International Journal of Advanced Computer
Science and Applications, Vol. 4, No.4, https://
doi.org/10.14569/IJACSA.2013.040406.
[171] Taher Niknam, Babak Amiri (2010) An efﬁcient hy-
brid approach based on PSO, ACO and k-means for clus-
ter analysis,Applied Soft Computing, Volume 10, Issue
1 , pp. 183-197, https://doi.org/10.1016/j.
asoc.2009.07.001.
[172] You Li, Kaiyong Zhao, Xiaowen Chu, Jim-
ing Liu (2013) Speeding up k-Means algorithm by
GPUs,Journal of Computer and System Sciences, Vol-
ume 79, Issue 2, pp. 216-229, https://doi.org/
10.1016/j.jcss.2012.05.004.
[173] Steinley, D., and Brusco, M. J (2011) Choosing the
number of clusters in K-means clustering, Psychological
Methods, 16(3), 285-297, https://doi.org/10.
1037/a0023346.
[174] Renato Cordeiro de Amorim, Boris Mirkin (2012)
Minkowski metric, feature weighting and anoma-
lous cluster initializing in K-Means clustering, Pat-
tern Recognition, Volume 45, Issue 3 , pp. 1061-
1075, https://doi.org/10.1016/j.patcog.
2011.08.012.
[175] Carlos Ordonez, Edward Omiecinski (2004) Efﬁ-
cient Disk-Based K-Means Clustering for Relational
Databases, IEEE Transactions on Knowledge and Data
Engineering, vol. 16, pp. 909-921, https://doi.
org/10.1109/TKDE.2004.25.
[176] Athman Bouguettaya, Qi Yu, Xumin Liu, Xiangmin
Zhou, Andy Song (2015) Efﬁcient agglomerative hierar-
chical clustering, Expert Systems with Applications, Vol-
ume 42, Issue 5, pp 2785-2797,https://doi.org/
10.1016/j.eswa.2014.09.054.
[177] Dongkuan Xu, Yingjie Tian (2015) A Comprehen-
sive Survey of Clustering Algorithms, Annals of Data
Science, Volume 2, Issue 2, pp 165–193, https://
doi.org/10.1007/s40745-015-0040-1.
[178] Shaﬁq Alam, Gillian Dobbie, Yun Sing Koh, Patricia
Riddle, Saeed Ur Rehman (2014) Research on particle
swarm optimization based clustering: A systematic re-
view of literature and techniques, Swarm and Evolution-
ary Computation, Volume 17, Pages 1-13, https://
doi.org/10.1016/j.swevo.2014.02.001.
[179] T. Nguyen and C. Kwoh (2015) Efﬁcient agglomera-
tive hierarchical clustering for biological sequence anal-
ysis, TENCON 2015 - 2015 IEEE Region 10 Conference,
Macao, pp. 1-3.
[180] Guilherme Andrade, Gabriel Ramos, Daniel
Madeira, Rafael Sachetto, Renato Ferreira, Leonardo
Rocha (2013) G-DBSCAN: A GPU Accelerated Algo-
rithm for Density-based Clustering, Procedia Computer
Science, Volume 18, Pages 369-378, https://doi.
org/10.1016/j.procs.2013.05.200.
[181] Younghoon Kim, Kyuseok Shim, Min-Soeng Kim,
June Sup Lee (2014) DBCURE-MR: An efﬁcient
density-based clustering algorithm for large data using
MapReduce, Information Systems, Volume 42, Pages 15-
35, https://doi.org/10.1016/j.is.2013.
11.002.
Predictive Analytics on Big Data - an Overview Informatica 43 (2019) 425–459 455
[182] Carmelo Cassisi, Alfredo Ferro, Rosalba Giugno,
Giuseppe Pigola, Alfredo Pulvirenti (2013) Enhancing
density-based clustering: Parameter reduction and out-
lier detection, Information Systems, Volume 38, Issue 3,
Pages 317-330, https://doi.org/10.1016/j.
is.2012.09.001.
[183] Sunita Jahirabadkar, Parag Kulkarni (2014) Algo-
rithm to determine  -distance parameter in density based
clustering, Expert Systems with Applications, Volume
41, Issue 6, Pages 2939-2946, https://doi.org/
10.1016/j.eswa.2013.10.025.
[184] A Amini, TY Wah (2011) Density micro-clustering
algorithms on data streams: A review, Proceedings
of the International Multiconference of Engineers and
Computer Scientists.
[185] Dongwon Lee, Sung-Hyuk Park, Songchun Moon
(2013) Utility-based association rule mining: A market-
ing solution for cross-selling, Expert Systems with Appli-
cations, Volume 40, Issue 7, Pages 2715-2725,https:
//doi.org/10.1016/j.eswa.2012.11.021.
[186] Jesmin Nahar, Tasadduq Imam, Kevin S. Tickle, Yi-
Ping Phoebe Chen (2013) Association rule mining to de-
tect factors which contribute to heart disease in males
and females, Expert Systems with Applications, Volume
40, Issue 4, Pages 1086-1093, https://doi.org/
10.1016/j.eswa.2012.08.028.
[187] R.J. Kuo, C.M. Chao, Y .T. Chiu (2011) Applica-
tion of particle swarm optimization to association rule
mining, Applied Soft Computing, Volume 11, Issue 1,
Pages 326-336, https://doi.org/10.1016/j.
asoc.2009.11.023.
[188] Zhang M., He C. (2010) Survey on Associa-
tion Rules Mining Algorithms , Advancing Comput-
ing, Communication, Control and Management. Lec-
ture Notes in Electrical Engineering, vol 56. Springer,
Berlin, Heidelberg., pp 111-118, https://doi.
org/10.1007/978-3-642-05173-9_15.
[189] Andrew Tch: The mostly complete chart of
Neural Networks, explained, Towards data science,
https://towardsdatascience.com/the-mostly-complete-
chart-of-neural-networks-explained-3fb6f2367464.
[190] H. Yu, T. Xie, S. Paszczynski and B. M. Wilam-
owski (2011) Advantages of Radial Basis Function Net-
works for Dynamic System Design, IEEE Transactions
on Industrial Electronics, vol. 58, no. 12, pp. 5438-
5450, https://doi.org/10.1109/TIE.2011.
2164773.
[191] Mehdi Khashei, Mehdi Bijari (2011) A novel hy-
bridization of artiﬁcial neural networks and ARIMA
models for time series forecasting, Applied Soft Com-
puting, Volume 11, Issue 2, Pages 2664-2675, https:
//doi.org/10.1016/j.asoc.2010.10.015.
[192] Li-Hua Feng, Jia Lu (2010) The practical research
on ﬂood forecasting based on artiﬁcial neural networks,
Expert Systems with Applications, Volume 37, Issue 4,
Pages 2974-2977, https://doi.org/10.1016/
j.eswa.2009.09.037.
[193] Daniel Westreich, Justin Lessler, Michele Jonsson
Funk (2010) Propensity score estimation: neural net-
works, support vector machines, decision trees (CART),
and meta-classiﬁers as alternatives to logistic regression,
Journal of Clinical Epidemiology, Volume 63, Issue 8,
Pages 826-833, https://doi.org/10.1016/j.
jclinepi.2009.11.020.
[194] Ding, S., Su, C. and Yu (2011) An optimizing
BP neural network algorithm based on genetic algo-
rithmArtiﬁcial Intelligence Review , Volume 36, Is-
sue 2, pp 153–162,https://doi.org/10.1007/
s10462-011-9208-z.
[195] Hong-ze Li, Sen Guo, Chun-jie Li, Jing-qi Sun
(2013) A hybrid annual power load forecasting model
based on generalized regression neural network with
fruit ﬂy optimization algorithm, Knowledge-Based Sys-
tems, Volume 37, Pages 378-387, https://doi.
org/10.1016/j.knosys.2012.08.015.
[196] SeyedAli Mirjalili, Siti Zaiton Mohd Hashim, Hos-
sein Moradian Sardroudi (2012) Training feedforward
neural networks using hybrid particle swarm opti-
mization and gravitational search algorithm, Applied
Mathematics and Computation, Volume 218, Issue
22, Pages 11125-11137, https://doi.org/10.
1016/j.amc.2012.04.069.
[197] N Srivastava, G Hinton, A Krizhevsky, I Sutskever,
R Salakhutdinov (2014) Dropout: a simple way to pre-
vent neural networks from overﬁtting, Journal of ma-
chine learning research.
[198] Jonathan L. Ticknor (2013) A Bayesian regularized
artiﬁcial neural network for stock market forecasting,
Expert Systems with Applications, Volume 40, Issue 14,
Pages 5501-5506, https://doi.org/10.1016/
j.eswa.2013.04.013.
[199] Jurgen Schmidhuber (2015) Deep learning in neu-
ral networks: An overview, Neural Networks, Volume
61, Pages 85-117, https://doi.org/10.1016/
j.neunet.2014.09.003.
[200] Chao Shang, Fan Yang, Dexian Huang, Wenxiang
Lyu (2014) Data-driven soft sensor development based
on deep learning technique, Journal of Process Control,
Volume 24, Issue 3, Pages 223-233, https://doi.
org/10.1016/j.jprocont.2014.01.012.
[201] Jie Zhi Cheng, Dong Ni, Yi-Hong Chou, Jing
Qin, Chui-Mei Tiu, Yeun-Chung Chang, Chiun-Sheng
Huang, Dinggang Shen and Chung-Ming Chen (2016)
456 Informatica 43 (2019) 425–459 G. Nagarajan et al.
Computer-Aided Diagnosis with Deep Learning Ar-
chitecture: Applications to Breast Lesions in US Im-
ages and Pulmonary Nodules in CT Scans, Scientiﬁc
Reports volume 6, https://doi.org/10.1038/
srep24454.
[202] Ravid Shwartz-Ziv, Naftali Tishby (2017) Opening
the Black Box of Deep Neural Networks via Infor-
mation, Machine learning, https://arxiv.org/
abs/1703.00810.
[203] Tanu Arya (2018) Drawbacks of Deep Learn-
ing, Stanford Management Science and Engineering,
https://mse238blog.stanford.edu.
[204] P.K. Anooj (2012) Clinical decision support sys-
tem: Risk level prediction of heart disease using
weighted fuzzy rules, Journal of King Saud University
- Computer and Information Sciences, Volume 24, Issue
1, Pages 27-40, https://doi.org/10.1016/j.
jksuci.2011.09.002.
[205] Jose M. Alonso, Luis Magdalena (2011) Special
issue on interpretable fuzzy systems, Information Sci-
ences, Volume 181, Issue 20, Pages 4331-4339,https:
//doi.org/10.1016/j.ins.2011.07.001.
[206] Salma Elhag, Alberto Fernández, Abdullah
Bawakid, Saleh Alshomrani, Francisco Herrera (2015)
On the combination of genetic fuzzy systems and
pairwise learning for improving detection rates on
Intrusion Detection Systems,Expert Systems with Appli-
cations, Volume 42, Issue 1, Pages 193-202, https:
//doi.org/10.1016/j.eswa.2014.08.002.
[207] V . P. G. Jimenez, Y . Jabrane, A. G. Armada, B.
Ait Es Said and A. Ait Ouahman (2011) High Power
Ampliﬁer Pre-Distorter Based on Neural-Fuzzy Systems
for OFDM Signals, IEEE Transactions on Broadcasting,
vol. 57, no. 1, pp. 149-158,https://doi.org/10.
1109/TBC.2010.2088331.
[208] TE Alhanafy, F Zaghlool, A Saad, ED Moustafa
(2010) Neuro-Fuzzy modeling scheme for the prediction
of air pollution, Journal of American Science, 6(12).
[209] Ilija Svalina, Vjekoslav Galzina, Roberto Lujic,
Goran Simunovic (2013) An adaptive network-based
fuzzy inference system (ANFIS) for the forecasting: The
case of close price indices,Expert Systems with Applica-
tions, Volume 40, Issue 15, Pages 6055-6063, https:
//doi.org/10.1016/j.eswa.2013.05.029.
[210] B. Dennis, S. Muthukrishnan (2014) AGFS:
Adaptive Genetic Fuzzy System for medical data
classiﬁcation, Applied Soft Computing, Volume 25,
Pages 242-252, https://doi.org/10.1016/j.
asoc.2014.09.032.
[211] P Amudha, S Karthik, S Sivakumari (2013) Classi-
ﬁcation techniques for intrusion detection an overview
, International Journal of Computer applications, Vol-
ume 76, No.16, https://doi.org/10.5120/
13334-0928.
[212] T. M. Khoshgoftaar, J. Van Hulse and A. Napoli-
tano (2011) Comparing Boosting and Bagging Tech-
niques With Noisy and Imbalanced Data, IEEE Trans-
actions on Systems, Man, and Cybernetics - Part A: Sys-
tems and Humans, vol. 41, no. 3, pp. 552-568,https:
//doi.org/10.1109/TSMCA.2010.2084081.
[213] Syarif I., Zaluska E., Prugel-Bennett A., Wills G
(2012) Application of Bagging, Boosting and Stacking
to Intrusion Detection, International workshop on Ma-
chine Learning and Data Mining in Pattern Recogni-
tion. MLDM . Lecture Notes in Computer Science, vol
7376. Springer, Berlin, Heidelberg, https://doi.
org/10.1007/978-3-642-31537-4_46.
[214] GeeksforGreeks: Comparison b/w
Bagging and Boosting in Data Mining,
https://www.geeksforgeeks.org/comparison-b-w-
bagging-and-boosting-data-mining.
[215] M. Paz Sesmero, Agapito I. Ledezma, Araceli San-
chis (2015) Generating ensembles of heterogeneous
classiﬁers using Stacked Generalization, Wires data
mining and knowledge discovery.
[216] Harshdeep Singh (2018) Under-
standing Gradient Boosting Machines,
https://towardsdatascience.com/understanding-
gradient-boosting-machines-9be756fe76ab.
[217] Ferreira A.J., Figueiredo M.A.T. : Boosting Al-
gorithms: A Review of Methods, Theory, and Ap-
plications, Ensemble Machine Learning. Springer,
Boston, MA, pp 35-85, https://doi.org/10.
1007/978-1-4419-9326-7_2.
[218] Wilson, Andrew G and Gilboa, Elad and Neho-
rai, Arye and Cunningham, John P (2014) Fast Ker-
nel Learning for Multidimensional Pattern Extrapola-
tion, Advances in Neural Information Processing Sys-
tems 27, pp.3626- 3634.
[219] Wang S () CyberGIS and spatial data science, Geo-
Journal, Volume 81, Issue 6, pp.965–968, https://
doi.org/10.1007/s10708-016-9740-0.
[220] William Q. Meeker and Yili Hong (2014) Reliability
Meets Big Data: Opportunities and Challenges, Qual-
ity Engineering, 26:1, 102-116,https://doi.org/
10.1080/08982112.2014.846119.
[221] Markus Reichstein, Gustau Camps-Valls, Bjorn
Stevens, Martin Jung, Joachim Denzler, Nuno Car-
valhais and Prabhat (2019) Deep learning and pro-
cess understanding for data-driven Earth system science,
Nature, volume 566, pages195–204, https://doi.
org/10.1038/s41586-019-0912-1.
Predictive Analytics on Big Data - an Overview Informatica 43 (2019) 425–459 457
[222] Peter V . Coveney, Edward R. Dougherty and Roger
R. Highﬁeld (2016) Big data need big theory too, Philo-
sophical transactions of the Royal Society A Mathemat-
ical, Physical and Engineering sciences.
[223] Xiang Liu, Ziyang Tang, Huyunting Huang, Tonglin
Zhang and Baijian Yang (2019) Multiple Learning for
Regression in big data, CoRR.
[224] Ping Ma, Xiaoxiao Sun (2014) Leveraging for big
data regression, Wires Computational Statistics.
[225] S Jun, SJ Lee, JB Ryu (2015) A divided regression
analysis for big data, International Journal of software
engineering and its applications.
[226] HaiYing Wang, Min Yang and John Stufken (2019)
Information-Based Optimal Subdata Selection for Big
Data Linear Regression, Journal of the American Sta-
tistical Association, volume 114, number 525, pp. 393-
405, https://doi.org/10.1080/01621459.
2017.1408468.
[227] Yang, Hang and Fong, Simon (2012) Incrementally
Optimized Decision Tree for Noisy Big Data, Proceed-
ings of the 1st International Workshop on Big Data,
Streams and Heterogeneous Source Mining: Algorithms,
Systems, Programming Models and Applications,pp.36-
44.
[228] W Dai, W Ji (2014) A mapreduce implementa-
tion of C4. 5 decision tree algorithm, International
journal of database theory and application, Vol 7,
No. 1, pp.49-60, https://doi.org/10.14257/
ijdta.2014.7.1.05.
[229] J. Chen et al. (2017) A Parallel Random Forest Al-
gorithm for Big Data in a Spark Cloud Computing Envi-
ronment, IEEE Transactions on Parallel and Distributed
Systems, vol. 28, no. 4, pp. 919-933, https://doi.
org/10.1109/TPDS.2016.2603511.
[230] Ke, Guolin and Meng, Qi and Finley, Thomas and
Wang, Taifeng and Chen, Wei and Ma, Weidong and Ye,
Qiwei and Liu, Tie-Yan (2017) LightGBM: A Highly
Efﬁcient Gradient Boosting Decision Tree, Advances in
Neural Information Processing Systems 30, pp. 3146-
3154.
[231] Zhenyun Deng, Xiaoshu Zhu, Debo Cheng, Ming
Zong, Shichao Zhang (2016) Efﬁcient kNN classiﬁ-
cation algorithm for big data, Neurocomputing, Vol-
ume 195, Pages 143-148, https://doi.org/10.
1016/j.neucom.2015.08.112.
[232] Jesus Maillo, Sergio Ramírez, Isaac Triguero, Fran-
cisco Herrera (2017) kNN-IS: An Iterative Spark-based
design of the k-Nearest Neighbors classiﬁer for big
data,Knowledge-Based Systems, Volume 117, Pages
3-15, https://doi.org/10.1016/j.knosys.
2016.06.012.
[233] J. Maillo, I. Triguero and F. Herrera (2015)
A MapReduce-Based k-Nearest Neighbor Ap-
proach for Big Data Classiﬁcation, IEEE Trust-
com/BigDataSE/ISPA, Helsinki, pp. 167-172.
[234] X Yan, Z Wang, D Zeng, C Hu and H Yao (2014)
Design and analysis of parallel MapReduce based KNN-
join algorithm for big data classiﬁcation, TELKOM-
NIKA Indonesian Journal of ELectrical Engineering,
Vol 12, No 11 , pp.7927-7934, https://doi.org/
10.11591/telkomnika.v12i11.6357.
[235] G. Song, J. Rochas, L. E. Beze, F. Huet and F.
Magoules (2016) K Nearest Neighbour Joins for Big
Data on MapReduce: A Theoretical and Experimental
Analysis, IEEE Transactions on Knowledge and Data
Engineering, vol. 28, no. 9, pp. 2376-2392, https:
//doi.org/10.1109/TKDE.2016.2562627.
[236] Katkar, V . D., and Kulkarni, S. V . (2013) A novel
parallel implementation of Naive Bayesian classiﬁer for
Big Data, International Conference on Green Com-
puting, Communication and Conservation of Energy
(ICGCE)
[237] B. Chandra, Manish Gupta (2011) Robust approach
for estimating probabilities in Naïve–Bayes Classiﬁer
for gene expression data, Expert Systems with Appli-
cations,Volume 38, Issue 3,Pages 1293-1298, https:
//doi.org/10.1016/j.eswa.2010.06.076.
[238] Zhou, L., Pan, S., Wang, J., and Vasilakos, A. V .
(2017) Machine learning on big data: Opportunities
and challenges, Neurocomputing, 237, pp.350–361,
https://doi.org/10.1016/j.neucom.
2017.01.026.
[239] Sergio Ramirez Gallego ,Salvador Garcia, Hector
Mourino-Talin , David Martinez-Rego , Veronica Bolon-
Canedo ,Amparo Alonso-Betanzos ,Jose Manuel Ben-
itez ,Francisco Herrera (2015) Data discretization: tax-
onomy and big data challenge, Advanced Review –
Wires Data mining and knowledge discovery, https:
//doi.org/10.1002/widm.1173.
[240] Rebentrost, Patrick and Mohseni, Masoud and
Lloyd, Seth (2014) Quantum Support Vector Machine
for Big Data Classiﬁcation, Phys. Rev. Lett. – Volume
113, Issue 13, pp. 130503,https://doi.org/10.
1103/PhysRevLett.113.130503.
[241] Anushree Priyadarshini, SonaliAgarwal (2015) A
Map Reduce based Support Vector Machine for Big
Data Classiﬁcation, International Journal of Database
Theory and Application , Vol.8 No.5 ,pp.77-98,https:
//doi.org/10.14257/ijdta.2015.8.5.07 .
[242] D. Singh, D. Roy and C. K. Mohan : DiP-SVM
(2017) Distribution Preserving Kernel Support Vector
Machine for Big Data, IEEE Transactions on Big Data,
vol. 3, no. 1, pp. 79-90, https://doi.org/10.
1109/TBDATA.2016.2646700.
458 Informatica 43 (2019) 425–459 G. Nagarajan et al.
[243] Cavallaro, G., Riedel, M., Richerzhagen, M.,
Benediktsson, J. A., and Plaza, A (2015) On Under-
standing Big Data Impacts in Remotely Sensed Image
Classiﬁcation Using Support Vector Machine Methods,
IEEE Journal of Selected Topics in Applied Earth
Observations and Remote Sensing, 8(10), 4634–4646,
https://doi.org/10.1109/JSTARS.2015.
2458855.
[244] Y . Liu and J. Du (2015) Parameter Optimization
of the SVM for Big Data, 8th International Sympo-
sium on Computational Intelligence and Design (IS-
CID), Hangzhou, pp. 341-344.
[245] Xiao Cai, Feiping Nie, Heng Huang (2013) Multi-
View K-Means Clustering on Big Data, Twenty-Third
International Joint Conference on Artiﬁcial Intelligence,
Web and Knowledge-Based Information Systems.
[246] Cui, X., Zhu, P., Yang, X., Li, K., and
Ji, C. (2014) Optimized big data K-means cluster-
ing using MapReduce, The Journal of Supercomput-
ing, 70(3), pp.1249–1259, https://doi.org/10.
1007/s11227-014-1225-7.
[247] N. Akthar, M. V . Ahamad and S. Khan (2015) Clus-
tering on Big Data Using Hadoop MapReduce, Inter-
national Conference on Computational Intelligence and
Communication Networks (CICN), Jabalpur, pp. 789-
795.
[248] Rathore, Punit and Kumar, Dheeraj and C. Bezdek,
James and Rajasegarar, Sutharshan and Palaniswami,
Marimuthu (2018) A Rapid Hybrid Clustering Algo-
rithm for Large V olumes of High Dimensional Data ,
IEEE Transactions on Knowledge and Data Engineer-
ing PP. 10.1109.
[249] T. C. Havens, J. C. Bezdek and M. Palaniswami
(2013) Scalable single linkage hierarchical clustering for
big data, IEEE Eighth International Conference on In-
telligent Sensors, Sensor Networks and Information Pro-
cessing, Melbourne, VIC, pp. 396-401.
[250] Athman Bouguettaya, Qi Yu, Xumin Liu, Xiang-
min Zhou, Andy Song (2015) Efﬁcient agglomerative
hierarchical clustering, Expert Systems with Applica-
tions,Volume 42, Issue 5,Pages 2785-2797, https://
doi.org/10.1016/j.eswa.2014.09.054.
[251] Yunpeng Cai, Yijun Sun, ESPRIT-Tree (2011) Hi-
erarchical clustering analysis of millions of 16S rRNA
pyrosequences in quasilinear computational time, Nu-
cleic Acids Research, Volume 39, Issue 14, Page e95,
https://doi.org/10.1093/nar/gkr349.
[252] Embrechts, M and Gatti, Christopher and Linton,
Jonathan and Roysam, Badrinath (2013) Hierarchical
Clustering for Large Data Sets , Advances in Intelli-
gent Signal Processing and Data Mining: Theory and
Applications, pp.197-233, https://doi.org/10.
1007/978-3-642-28696-4_8.
[253] Yanwei Yu, Jindong Zhao,Xiaodong Wang, Qin
Wang , Yonggang Zhang (2015) Cludoop: An Efﬁcient
Distributed Density-Based Clustering for Big Data Us-
ing Hadoop, International Journal of Distributed Sensor
Networks.
[254] Amini, A., Wah, T.Y . and Saboohi H. J (2014) On
Density-Based Data Streams Clustering Algorithms: A
Survey, Journal of Computer Science and Technology
- Volume 29, Issue 1, pp 116–141, https://doi.
org/10.1007/s11390-014-1416-y.
[255] Younghoon Kim, Kyuseok Shim, Min-Soeng Kim,
June Sup Lee (2014) DBCURE-MR: An efﬁcient
density-based clustering algorithm for large data using
MapReduce, Information Systems, Volume 42,Pages 15-
35, https://doi.org/10.1016/j.is.2013.
11.002.
[256] Deng, Z., Hu, Y ., Zhu, M. et al (2015) A scalable
and fast OPTICS for clustering trajectory big data, Clus-
ter Computing 18: 549, https://doi.org/10.
1007/s10586-014-0413-9.
[257] Imran Khan, Joshua Z. Huang, Kamen Ivanov
(2016) Incremental density-based ensemble clustering
over evolving data streams,Neurocomputing, Volume
191,Pages 34-43, https://doi.org/10.1016/
j.neucom.2016.01.009.
[258] Wang, Yu and Li, Boxun and Luo, Rong and Chen,
Yiran and Xu, Ningyi and Yang, Huazhong (2014) En-
ergy efﬁcient neural networks for big data analytics,
Proceedings of the Conference on Design, Automation
and Test in Europe.
[259] Cao J, Cui H, Shi H, Jiao L : Big Data (2016)
A Parallel Particle Swarm Optimization Back Propa-
gation Neural Network Algorithm Based on MapRe-
duce,PLoS ONE 11(6): e0157551, https://doi.
org/10.1371/journal.pone.0157551.
[260] Chiroma, Haruna, Ali Abdullahi, Usman, Abdul-
hamid, Shaﬁ i, Abdulsalam AlArood, Ala and Gabralla,
Lubna and Rana, Nadim and Shuib, Liyana and Hashem,
Ibrahim and Dada, Emmanuel and Abubakar, Adamu
and Zeki, Akram and Herawan, Tutut (2018) Progress
on Artiﬁcial Neural Networks for Big Data Analytics:
A Survey, IEEE Access. PP . 1-1.
[261] Zhang, Y Guo, Q Wang, J : Big data
analysis using neural networks 49. 9-18,
10.15961/j.jsuese.2017.01.002.
[262] Hai Wang, Zeshui Xu, Witold Pedrycz (2017)
An overview on the roles of fuzzy set techniques
in big data processing: Trends, challenges and op-
portunities, Knowledge-Based Systems, Volume 118,
Pages 15-30, https://doi.org/10.1016/j.
knosys.2016.11.008.
Predictive Analytics on Big Data - an Overview Informatica 43 (2019) 425–459 459
[263] Alberto Fernandez and Cristobal Jose Carmona
and Maria Jose del Jesus and Francisco Herrera (2016)
A View on Fuzzy Systems for Big Data: Progress
and Opportunities, International Journal of Compu-
tational Intelligence Systems, Volume 9, pp.69-80,
https://doi.org/10.1080/18756891.
2016.1180820.
[264] X. Chen and X. Lin (2014) Big Data Deep Learn-
ing: Challenges and Perspectives, IEEE Access, vol.
2, pp. 514-525, https://doi.org/10.1109/
ACCESS.2014.2325029.
[265] Maryam M Najafabadi, Flavio Villanustre, Taghi
M Khoshgoftaar, Naeem Seliya, Randall Wald and
Edin Muharemagic (2015) Deep learning applica-
tions and challenges in big data analytics, Journal
of Big Data, 2:1, https://doi.org/10.1186/
s40537-014-0007-7.
[266] Riccardo Miotto, Fei Wang, Shuang Wang, Xiao-
qian Jiang, Joel T Dudley (2018) Deep learning for
healthcare: review, opportunities and challenges, Brief-
ings in Bioinformatics, Volume 19, Issue 6, Pages
1236–1246, https://doi.org/10.1093/bib/
bbx044.
[267] Bartosz Krawczyk, Leandro L. Minku, Joao Gama,
Jerzy Stefanowski, Michal Wozniak (2017) Ensemble
learning for data stream analysis: A survey, Infor-
mation Fusion,Volume 37,Pages 132-156, https://
doi.org/10.1016/j.inffus.2017.02.004.
[268] Shan Huang, Botao Wang, Junhao Qiu, Jitao Yao,
Guoren Wang, Ge Yu (2016) Parallel ensemble of online
sequential extreme learning machine based on MapRe-
duce, Neurocomputing,Volume 174, Part A,Pages 352-
367, https://doi.org/10.1016/j.neucom.
2015.04.105.
[269] Huang X., Ye Y ., Zhang H. (2016) Extending
Kmeans-Type Algorithms by Integrating Intra-
cluster Compactness and Inter-cluster Separation,
Unsupervised Learning Algorithms. Springer,
pp 343-384, https://doi.org/10.1007/
978-3-319-24211-8_13.
[270] N. B. Nikhare and P. S. Prasad (2018) A review on
inter-cluster and intra-cluster similarity using bisected
fuzzy C-mean technique via outward statistical testing,
2nd International Conference on Inventive Systems and
Control (ICISC), Coimbatore, pp. 215-217.
[271] Frederic Godin, Jonas Degrave, Joni Dambre, Wes-
ley De Neve (2018) Dual Rectiﬁed Linear Units (DRe-
LUs): A replacement for tanh activation functions in
Quasi-Recurrent Neural Networks, Pattern Recognition
Letters, Volume 116, Pages 8-14, https://doi.
org/10.1016/j.patrec.2018.09.006.
[272] Yu Wang (2017) A new concept using LSTM
Neural Networks for dynamic system identiﬁcation,
American Control Conference (ACC), Seattle, WA, pp.
5324-5329, https://doi.org/10.23919/ACC.
2017.7963782.
[273] Igor M. Coelho, Vitor N. Coelho, Eduardo J. da
S. Luz, Luiz S. Ochi, Frederico G. Guimaraes, Eyder
Rios (2017) A GPU deep learning metaheuristic based
model for time series forecasting, Applied Energy, V ol-
ume 201, Pages 412-418, https://doi.org/10.
1016/j.apenergy.2017.01.003.
[274] Goodfellow, Ian; Pouget-Abadie, Jean; Mirza,
Mehdi; Xu, Bing; Warde-Farley, David; Ozair, Sher-
jil; Courville, Aaron; Bengio, Yoshua (2014) Generative
Adversarial Networks (PDF). Proceedings of the Inter-
national Conference on Neural Information Processing
Systems (NIPS). pp. 2672–2680.
[275] Michael A Farnum, Lalit Mohanty, Mathangi
Ashok, Paul Konstant, Joseph Ciervo, Victor S Lobanov,
Dimitris K Agraﬁotis (2019) A dimensional warehouse
for integrating operational data from clinical trials,
Database, Volume 2019, https://doi.org/10.
1093/database/baz039.
[276] Khan S.I., Hoque A.S.M.L (2015) Towards Devel-
opment of National Health Data Warehouse for Knowl-
edge Discovery, Part of Advances in Intelligent Systems
and Computing, vol 385. Springer, Cham.
[277] SO Salinas, ACN Lemus (2019) Data warehouse
and big data integration, International Journal of Com-
puter Science and Information Technology, Vol 9, No.2.
[278] Hai, Rihan and Geisler, Sandra and Quix, Christoph
(2016) Constance: An Intelligent Data Lake System,
Proceedings of the 2016 International Conference on
Management of Data.
460 Informatica 43 (2019) 425–459 G. Nagarajan et al.