https://doi.org/10.31449/inf.v47i9.3844 Informatica 47 (2023) 17–34 17 
A Consolidated Tree Structure Combining Multiple Regression Trees with 
Varying Depths, Resulting in an Efficient Ensemble Model 
 
Elmira Ashoor Mahani
*
, Koorush Ziarati 
Comp. Sci. & Eng. & I.T. Dept., Shiraz University, Mollasadra Street, Shiraz, Iran 
E-mail: elmira.ashoor@cse.shirazu.ac.ir, ziarati@shirazu.ac.ir 
*Corresponding Author 
 
 
Keywords regression tree, stopping condition, ensemble, efficient 
 
Received: November 28, 2021 
 
Regression is a commonly used technique to predict a continuous target value based on a set of input 
features. Decision trees are hierarchical models that offer high interpretability, fast and precise 
reasoning, and are also used for regression tasks. However, determining the optimal stopping conditions 
for decision trees is a complex problem that has attracted significant research interest. Ensemble based 
modeling is an effective approach for adjusting hyper-parameters, where base models with varying 
parameter values are combined instead of searching for the best value. Random forests are a classic 
example of an ensemble model that combines decision trees generated from different perspectives. This 
paper proposes a novel approach that generates base trees using the same tree-generation procedure, but 
with different stopping conditions. Unlike random forests, this model can be efficiently integrated into a 
single tree structure. Additionally, the paper proposes some aggregation methods based on weighting the 
base models. Experimental results on standard datasets demonstrate that the proposed method 
outperforms well-known stopping conditions. 
Povzetek: Razvita je nova metoda kombiniranja regresijskih dreves, ki dosega boljše rezultate v 
primerjavi z znanimi metodami  v regresijskih nalogah. 
 
1 Introduction 
Decision trees are commonly used tools for pattern 
recognition, especially in supervised tasks like 
classification and regression. Their hierarchical structure 
allows for simple, interpretable, fast, and accurate 
decision-making. Decision trees have been extensively 
researched in the literature, with many applications 
focusing on classification tasks such as intrusion 
detection, privacy preservation, power systems, bank 
marketing, health care and disease diagnosis, and 
agriculture [1-7]. Additionally, researchers have 
investigated regression models based on decision trees, 
which are typically referred to as regression trees, in fields 
like psychology, education, urban planning, 
environmental management, genetics, communication, 
and economics [8-15]. 
In Table 1, the regression methods are compared. The 
popularity of decision trees can be attributed to their 
ability to provide a simple and interpretable representation 
of the data, which is particularly important in fields where 
understanding and analyzing the model's output is crucial. 
Furthermore, decision trees are fast and accurate, making 
them attractive for real-time applications. However, their 
performance can be affected by the quality of the data used 
to train the model, and they may not always be the best 
choice for complex datasets. Despite these limitations,  
 
 
 
decision trees remain a valuable tool for pattern 
recognition and continue to be an active area of research.  
1.1 Decision tree 
A decision tree is a flowchart-like structure with a tree 
hierarchy, where each internal node includes a test on a 
feature, each branch represents the output of the test, and 
each leaf node represents a consequence. To make a 
decision on a given instance using a decision tree, the 
instance must traverse a complete path from the root to a 
leaf in the tree. At the leaf, a value or a model is provided 
to predict the target value of the instance, such as a specific 
classifier or regression model. Decision trees can easily be 
converted into a set of rules, where each path from the root 
to a leaf is a rule, making them simple and completely 
interpretable. 
Compared to other rule-based systems, decision trees 
are efficient in decision-making due to their hierarchical 
structure, and not all rules need to be checked. 
Additionally, decision trees can simultaneously divide the 
feature space into small or large regions, providing 
specific and general rules, respectively. This ability makes 
decision trees one of the most accurate prediction models. 
In summary, decision trees are simple, interpretable, fast, 
and accurate models that can efficiently divide the feature 
space and convert the tree structure into a set of rules. 
 
18 Informatica 47 (2023) 17–34 E.A. Mahani et al.  
The time complexity of decision tree induction has 
been shown to increase exponentially with the height of 
the tree. To overcome this limitation, heuristic methods 
have been proposed in the literature to produce shallow, 
small, and/or accurate trees [13,14]. These methods aim to 
optimize a predefined objective function on the training 
set by partitioning the feature space into two or more sub-
regions based on a specific feature. The resulting tree 
includes internal (decision) nodes and leaves, where each 
node is associated with a region that may be partitioned 
based on a feature or labeled as a leaf node. 
In a classic decision tree, each sub-region, considering 
associated instances, is recursively partitioned into smaller 
ones until a stopping condition is satisfied. The 
partitioning process forms a tree that provides a 
hierarchical representation of the decision-making 
process. At each internal node, a test is performed on a 
feature (called pivot feature) to partition the feature space 
into sub-regions. The objective is to maximize the 
homogeneity of the instances within each sub-region with 
respect to the target variable. The partitioning process 
continues until a stopping condition is met, typically when 
a sub-region contains a small number of instances or when 
further partitioning does not improve the accuracy of the 
model. In regression models, the best feature is the one 
that can reduce the error metric of regression (e.g., mean 
square error) on the training instances more than others. 
In a regression tree, if a region cannot be further 
partitioned and some stopping conditions are met, the 
partitioning process is stopped. The instances associated 
with that region are then used for reasoning. A single value 
may be assigned as the prediction value for all instances 
in the leaf node, or a regression model trained on the 
associated instances may be used for future prediction. 
The assigned value or model represents the prediction of 
the decision tree for instances that belong to that region in 
the feature space. 
After the tree construction phase, the resulting tree 
may have many leaves, particularly in the presence of 
noisy data and outliers. This can lead to overfitting, where 
the tree is too complex and fits the training data too 
closely, even capturing undesired noises. To address this 
issue and remove the least reliable branches, statistical 
measures are typically used to prune the tree. 
Pruning the decision tree results in a smaller, less 
complex, and easier-to-understand tree, which is usually 
faster and more accurate in classifying test data. Pruning 
can be performed either pre-pruning or post-pruning, 
depending on the stopping conditions used in the tree 
construction phase. Pre-pruning involves setting stopping 
conditions during tree construction to avoid overfitting, 
while post-pruning involves removing branches after 
construction based on statistical measures. Proper 
stopping conditions can lead to an efficient pre-pruning 
approach that is more accurate and faster than post-
pruning [15].  
In the testing phase, a query instance is fed into the 
decision tree starting from the root node and following a 
path through the tree until a leaf node is reached. At each 
node, the value of the query instance for the associated 
pivot feature is used to determine which child node to 
traverse to in the next layer. This process continues until 
the query instance reaches a leaf node. In the leaf node, 
the assigned value or model is used to predict the 
corresponding target value of the query instance.  
1.2 Random forest 
Although decision trees have a simple structure and 
provide acceptable performance, finding the optimal tree 
that optimizes the desired objective function can be a 
challenging task. The construction process is typically 
done using a greedy method, which can result in a 
suboptimal tree. As a result, small changes in the 
parameters, distribution of training instances, or objective 
function can significantly alter the final tree. To address 
this issue, ensemble methods have been proposed to 
generate multiple trees from different perspectives to 
improve the stability and accuracy of the model [16]. 
Ensemble methods use multiple base models that can 
differ in type, model, hyperparameters, construction 
method, and instance set used for training. In the case of 
using a set of decision trees as the base models, the 
resulting ensemble model is called a Random Forest (RF). 
There are various types of RFs that use different 
approaches to generate the individual decision trees, such 
as Adaboost, boost strap or bagging [17-20]. 
Decision trees themselves have the ability to divide 
the feature space into subregions and their associated 
instances, which in turn allows a dedicated subtree to be 
assigned to each region to recursively classify the 
associated training instances. Therefore, decision trees 
employ a specific instance selection to some extent. In RF, 
the base trees typically differ in the random features 
chosen in each step to partition the feature space. This is 
known as feature bagging [21]. Additionally, the random 
selection of splitting points in each node is another 
characteristic of classic RF algorithm [22]. These 
randomization techniques help to reduce overfitting and 
increase the diversity of the base models, which in turn 
improves the accuracy and stability of the ensemble 
model. 
In ensemble methods such as Random Forests (RFs), 
a majority vote (in classification) or averaging (in 
regression) is used to combine the predictions of all the 
trees on a query instance to make the final prediction. This 
requires storing all the trees. Also, the output of all the 
trees should be determined for final prediction as a time-
consuming task during the testing phase. Moreover, a 
large number of decision trees can degrade the 
interpretability of the model. Some rules may be generated 
whose antecedents are satisfied, but their consequences 
differ significantly from the final outcome of the ensemble 
model due to the aggregation of the consequences of all 
activated rules. Despite these limitations, RFs have been 
shown to significantly improve the performance of 
decision trees, and they remain popular in many 
applications. In this paper, trees ensembled with different 
stopping conditions is considered, which can lead to a 
more efficient and interpretable model. 
 
 
A Consolidated Tree Structure Combining Multiple Regression…                                      Informatica 47 (2023) 17–34   19 
1.3 Stopping conditions 
During the construction of a decision tree, the 
expansion of a node may be stopped for various reasons, 
such as decreasing memory complexity or preventing 
overfitting on the training data. To prevent overfitting and 
control the complexity of the decision tree, a set of 
hyperparameters is used to define and evaluate the 
stopping conditions. The stopping condition is checked at 
each node to decide whether to expand the node further or 
stop partitioning. Some stopping conditions' 
hyperparameters are reported in Table 2. The ones 
investigated in this paper are: 
- Maximum Depth of the tree (MxD): This parameter 
limits the maximum allowed depth of the decision tree. If 
the depth of a node exceeds this value, the node expansion 
is stopped. 
- Minimum number of Instances per leaf (MnI): This 
parameter sets the minimum number of training instances 
required to split a node. If a node has fewer instances than 
this threshold, the node expansion is stopped. 
- Maximum number of Leaves (MxL): This parameter 
limits the maximum allowed number of leaf nodes in the 
decision tree. If the number of leaves exceeds this value, 
the node expansion is stopped. 
- Maximum Error per leaf (MxE): This parameter sets 
the maximum error can be handled in a leaf. If the error 
exceeds this threshold, the leaf should be expanded.  
-Minimum (Relative) Promotion (MnP/MnRP): Each 
expansion should be along with a minimum decrease in 
the total error to be done. The value of decrease in error 
may be relatively computed respect to the current error. 
These hyperparameters can be tuned to optimize the 
performance of the decision tree and prevent overfitting. 
The merits and drawbacks of them are discussed later in 
this paper. The optimal set of hyperparameters depends on 
the characteristics of the training data and the desired 
performance of the model. On the other hand, adjusting 
the hyperparameters of a decision tree is a time-
consuming and challenging task, and there is no rule of 
thumb to determine the optimal values for any given 
dataset. Cross-validation on the training data is a common 
technique for hyperparameter tuning in machine learning 
models. However, cross-validation can be expensive, and 
separating a set of validation data from the training set can 
result in variations in the constructed decision trees such 
that the best hyperparameters may not be suitable with 
sufficient confidence. In addition, cross-validation is a 
task of evaluating a value for a hyper parameter. A search 
strategy is required to extract candidates and find the 
optimal solution. The strategies will be compared later. 
Moreover, different regions of the feature space may 
require different attention to extract the decision 
boundaries. However, the hyperparameters of the stopping 
conditions affect all the nodes in a single tree. Therefore, 
in some cases where more specificity is required, the 
expansion may be stopped, while in another branch of the 
tree, generalization may be sacrificed for extra 
partitioning. Additionally, the importance of the stopping 
conditions may vary from case to case. 
1.4 Motivations and innovations 
In this paper, a novel approach is proposed to 
ensemble learning using decision trees with different 
hyperparameters, specifically focusing on the depth of the 
trees. This approach simplifies the process of designing 
these hyperparameters by generating many trees with 
different stopping conditions, which are then aggregated 
to produce the final result. In this approach, the 
hyperparameters are integrated into a single, simple 
solution by constructing trees with different depths in each 
branch. Finally, the resulting trees are merged into a single 
tree, called as the Mother Tree. This approach preserves 
both the efficiency and interpretability of the model, since 
it occupies memory only as large as a single tree and 
allows the results of all the virtual trees to be computed for 
a query instance by a single pass on the associated path 
from the root to a leaf of the Mother Tree. The proposed 
approach offers a simple and effective solution for 
stopping condition adjustment based on ensemble learning 
that overcomes the limitations and challenges of 
traditional decision trees and random forests, and provides 
an interpretable model. 
Weighted model aggregation has been shown to 
improve the performance of ensemble methods, but it can 
also be computationally expensive, especially when 
dealing with a large number of trees. Heuristic 
approaches, such as Adaboost are commonly used to 
address this issue [18]. In this paper, some simple 
weighting approaches are proposed, which can be applied 
to the Mother Tree structure. The proposed approaches 
assign weights to the virtual trees in voting phase, each 
one with its own characteristics.  
The remainder of this paper is organized as follows. 
In Section 2, a review of the related work is given on 
ensemble learning and model aggregation. In Section 3, 
the proposed model is presented. In Section 4, the 
experimental results are reported followed by discussions. 
Finally, the paper is concluded in Section 5 
2 Related work 
A regression tree is a variant of the classification 
decision tree that predicts a real value instead of a class 
label. In a regression tree, a constant value or a regression 
model is created at each leaf node after constructing the 
tree. The node error is usually calculated as the mean 
square of all differences between the desired and predicted 
values for all validation instances in that node. The first 
known regression tree is the AID method [23] . AID 
declares the mean of target values of all instances reached 
at each leaf node as its regression value. Hence the mean-
square error is equal to the variance of associated target 
values as the impurity measure of each node. This method 
generates a piecewise constant model that has good 
interpretability but may have lower accuracy than models 
with more smoothness. 
M5' is a regression tree that uses an efficient strategy 
to create a piecewise linear model [24]. The first output of 
this strategy is a piecewise constant tree, and then a proper 
linear regression is found and replaced constant value for 
20 Informatica 47 (2023) 17–34 E.A. Mahani et al.  
each leaf. Because the tree structure is piecewise constant, 
the M5' final tree is larger than other piecewise linear 
trees. Fidalgo-Merino et al. introduced an incremental 
algorithm to generate a tree [25]. However, most recent 
research has applied Classification and Regression Trees 
(CART) as the base regression model more than other 
models in the literature [26]. 
Regression trees are typically constructed using 
heuristics, which do not guarantee the optimality of the 
output. Kordos et al. introduced an evolutionary approach 
in which a set of regression trees evolves to achieve the 
best tree [27]. Two important evolutionary operators, 
mutation and crossover, were implemented by substituting 
selected attributes along branches and exchanging 
subtrees between different trees. However, evolutionary 
search methods can be time-consuming, and defining a 
fitness function that can accurately describe the optimal 
tree may not be straightforward. With such ambiguity, 
finding the tree with the optimum objective function may 
not be necessary. 
Due to the uncertainties surrounding the proper 
objective function, limitations, and structure of regression 
trees, researchers have turned to ensemble different views 
of tree construction. RSSCARD is a method proposed in 
to predict spatial landslides [28]. They hybridized 
Random Subspaces (RSS) and CART to achieve their 
goal. Choubin and et al. ensemble, for the first time, 
CART and Multivariate Discriminant Analysis (MDA) to 
analyze flood susceptibility and obtained acceptable 
results [29]. Random Forests (RF) are also an ensemble of 
decision trees that combine the merits of different 
structures or parameters. Decision trees are faster to train 
than RFs, which is why they are still desirable. However, 
the accuracy of Random Forests (RF) is greater [30]. 
Parameter adjustment in tree construction is also a 
challenge in this research field. One of the most important 
sets of parameters are the ones associated with the 
stopping conditions, such as maximum depth, maximum 
leaves, and maximum error. Bing Zhang et al. used CART 
to predict blood pressure, and they used cross-validation 
to avoid overfitting, choose the best parameters, and 
develop a better general model [31]. Ensemble methods, 
including RFs, can also be used to aggregate trees with 
different parameters. In this case, time-consuming cross-
validation is not used, and the randomness of the model 
injected from cross-validation is removed. However, to 
our knowledge, this ensemble approach has not yet been 
used in the literature as a solution for the problem of 
stopping conditions, especially for regression models. In 
this paper, not only is this ensemble approach presented, 
but also it has been efficiently integrated into the structure 
of a single tree called the Mother Tree. Finally, instead of 
generating many trees, a reasoning strategy has been 
proposed based on the generated Mother Tree with various 
aggregation methods. 
3 Proposed method 
The main contributions of this paper, as mentioned 
previously, are as follows: 
1. All stopping conditions and parameter 
combinations are integrated into one parameter, 
Maximum Depth of tree (MxD), as the only difference 
among the candidate trees. It has been proven that MxD is 
a sufficient parameter as the stopping condition, here. 
2. All constructed trees are merged into a single tree 
with the time-complexity of generating just one tree. This 
approach saves computational resources and improves the 
interpretability of the model. 
3. Some weighted average methods are proposed on 
this integrated tree to ensemble all candidate trees. These 
methods provide more robust and accurate prediction by 
combining the outputs of multiple trees. Each aggregation 
method has its own attributes and behavior on datasets. 
Each one of these contributions is addressed in detail 
in the following subsections. 
3.1 Maximum depth of tree 
Stopping conditions' (SC) parameters do not affect the 
selection of the pivot feature of each node or the best set 
of splitting points in the nodes. However, these parameters 
can determine whether a node should be expanded or not. 
As a result, some leaves may be generated based on SCs 
that are internal nodes in some other trees (due to later 
stopping) and may not be generated in other trees (due to 
early stopping). The training set reached at an internal 
node or leaf is deterministic and unique in all possible 
trees that have that node. If the parameter of the stopping 
condition is changed, it can alter the tree's depth of some 
branches. The mostly expanded possible tree is called, 
here, the Mother Tree (MT). However, the MT may also 
follow some hard hyper restrictions defined by experts 
which prevent more expansion, such as the Maximum 
Depth of Tree. For example, consider an MT constructed 
as shown in Figure 1, where the Sum Square Error (SSE) 
of the target values of belonging instances in each node is 
also reported in the nodes. 
 
Figure 1: An example of Mother Tree expanded to reach 
zero sum square error in leaves. 
 
If Maximum Error (MxE) in leaves is considered as 
the stopping condition, each node with SSE greater than 
ME will be expanded, whereas nodes with SSE less than 
or equal to MxE will become leaves. Decreasing this 
upper-bound may lead to a deeper tree by expanding some 
nodes until reaching the MT shown in Figure 1. Other 
intermediate possible trees associated with different 
values of MxE are depicted in Figure 2. In this example, 
MT is completely generated for MxE < 4. 
A Consolidated Tree Structure Combining Multiple Regression…                                      Informatica 47 (2023) 17–34   21 
Figure 2: Intermediate tree structures depicted in Fig. 
1: (a) MxE ≥ 16, (b) MxE ∈ [8,16), (c) MxE ∈ [4,8). 
 
It is evident that only full rooted subtrees (tree 
structures) can be generated because when an internal 
node is expanded, all of its children are generated. Cross-
validation is a common method for adjusting 
hyperparameters to the best possible value in a candidate 
set. However, there is often no guarantee that two different 
candidate values will generate different tree structures. 
Additionally, a value that generates a specific tree 
structure may not be included in the candidate set. For 
example, cross-validation on the candidate values of MxE 
{18, 13, 8, 3} can only generate two out of the three tree 
structures in Figure 2 and the Mother Tree in Figure 1. In 
other words, MxE = 13 or 8 generates similar structures, 
and the one shown in Figure 2-c is missed. However, a 
complete set of non-redundant candidates can be 
generated in this case by sorting all positive SSEs in the 
nodes of the MT. 
 
The main problem, however, remains. Given a 
predefined parametric stopping condition, some node 
expansions are inevitable to achieve other specific 
expansions. For example, by decreasing MxE of the tree, 
it cannot be controlled to have expansion in specific nodes 
only. Hence, generating all tree structures is not 
guaranteed. For instance, the tree structure depicted in 
Figure 3 cannot be generated with this stopping condition 
because if a node with SSE = 4 is expanded, the node with 
SSE = 8 should also be expanded. However, these nodes 
cover disjoint regions of the feature space with their 
unique characteristics and requirements. 
Hybrid stopping conditions may generate more tree 
structures. There are many parameters associated with 
stopping conditions, as mentioned in the introduction, and 
each one can have numerous values if the parameter is 
discrete (e.g., MxD, MnI, and MxL) or a range of values 
if the parameter is continuous (e.g., MxE and MnP). 
Selecting an appropriate set of parameters and candidate 
value set for each one to construct all possible structures 
is likely impractical. Thus, conducting a complete search 
on tree structures is a challenging task, and the optimal 
structure may be missed. 
Any parameter of the stopping condition only affects 
the lengths of the paths from the root to the leaves. In other 
words, these parameters are not used in reasoning and only 
determine the expandability of each node during the 
training phase. Once the nodes that should not be 
expanded (leaves) are determined, the tree structure can be 
uniquely constructed, and the target value for each query 
instance can be predicted. The tree structures are finite, 
countable, and much less than all combinations of the 
parameter values. As the first contribution of this paper, 
all tree structures are investigated regardless of any 
stopping condition to address the above concerns. 
However, the number of tree structures also exponentially 
increases with respect to the Maximum Depth of the tree. 
Assume that the Mother Tree is a full binary tree with 
depth 𝐷 at all leaves, called here a 𝐷 -tree. For D-trees with 
depth D = 1 and 2, there are two and five tree structures, 
respectively. The five tree structures of the MT in Figure 
1 are itself and four other structures presented in Figures 
2 and 3. The number of tree structures can be very large 
for deeper trees. For example, a 4-tree, 5-tree, and 6-tree 
have 677, 458,330, and approximately 210 × 
10^9structures, respectively. Such a large number of 
structures can make constructing all possible structures 
computationally infeasible. To address this problem, two 
techniques have been employed. Firstly, it is proved that a 
small set of tree structures defined with just one stopping 
condition, the Maximum Depth of tree, is sufficient to 
include an optimal structure for each query instance. 
Secondly, these structures are integrated to achieve a 
single structure without losing performance. These 
techniques significantly reduce the number of structures 
and make it practical to construct an optimal tree structure. 
3.1.1 Minimum covering set 
Determining the best tree structure with minimum 
total error on a set of validation data can be challenging. 
Regardless of the definition of the error function, the 
optimal tree structure is not necessarily the best for all 
query instances. Finding the tree structure with the 
minimum error for each query instance 𝑞 in the validation 
set can help to use the strengths and overcome the 
weaknesses of different tree structures. This information 
can be used to select the best tree structure for a particular 
query instance or to develop an ensemble of tree structures 
that performs well on a diverse set of query instances. 
Since a query instance 𝑞 deterministically moves 
from the root of the MT down to reach the leaf, its path in 
its best structure is also determined to reach the associated 
leaf 𝐿 . In other words, one tree structure in the set of 𝑑 -
tree structures (𝑑 = 0, … , 𝐷 ), includes 𝐿 in associated path 
of 𝑞 as the leaf and performs as well as the best structure 
for 𝑞 . 
Therefore, a set of (𝐷 + 1) 𝑑 -structures (𝑑 =
0,1, . . . , 𝐷 ) can work as well as all exponential number of 
Figure 3. A tree structure of MT in Fig.1 which is 
not feasible with just MxE as the stopping 
condition. 
22 Informatica 47 (2023) 17–34 E.A. Mahani et al.  
tree structures. This approach significantly reduces the 
number of tree structures that need to be constructed and 
evaluated, making it computationally feasible to find an 
optimal tree structure. 
3.2 Merging the trees 
Assuming a Mother Tree (MT) is constructed and 
given, as shown in Figure 1. For each query instance 𝑞 , the 
corresponding leaf in its optimal tree is denoted by 𝑁 ∗
(𝑞 ). 
Tracking the associated path from the root towards the 
corresponding deepest node (leaf) in MT certainly passes 
𝑁 ∗
(𝑞 ). If the prediction value or model for all internal 
nodes of MT in this path is also computed, as well as the 
leaves, the query can compute its predicted target value by 
all 𝑑 -structures. Therefore, just one tree structure (Mother 
tree) is sufficient to be constructed, whereas all nodes 
(internal and leaves) are assigned a prediction value or 
model in the training phase. The only problem is 
determining the optimal depth for a given query instance. 
In this case, two different paths, separated from a 
disjoint node, can have their own stopping conditions, 
which are dynamically discovered in the test phase. Even 
two query instances that belong to a common path may 
have their own stopping condition on the tree. As a 
postponed decision, all prediction values of a query 
instance in all depths of MT can be gathered, and one of 
these candidate values can be assigned as the final 
prediction (e.g., the most frequent or the median). This 
approach can significantly reduce the computational cost 
of constructing multiple tree structures to generating just 
one tree but with a decision value or model in all the 
nodes. 
In the training phase, it is assumed that the stopping 
condition may be held in each of the nodes (even in the 
root). Hence, each node is temporarily considered as a 
leaf, and a label or model is assigned to it. In the test phase, 
each given query instance 𝑞 is fed to the MT to find its 
complete path. This path is a sequence of 𝐷 + 1 decision 
nodes 𝑛 0
(𝑞 ), 𝑛 1
(𝑞 ), . . . , 𝑛 𝐷 (𝑞 ). Based on that, 𝐷 + 1 
different predictions are computed for 𝑞 representing the 
values or output of the models assigned to decision nodes. 
Now, it is time to reason for the final value (adaptively 
select the leaf or stopping condition). For example, the 
most frequent or median value in the predicted values can 
be assigned. This adaptive approach can improve the 
performance of the MT and make it more robust to 
different query instances. 
3.3 Ensemble of tree structures 
As mentioned earlier, the best tree structure for a 
query instance 𝑞 exists; however, it may not be identified 
during the test phase. Even, its prediction for the target 
value may not be sufficiently accurate. In this case, none 
of the candidate values in the path are sufficiently 
accurate. To address this issue, this paper, similar to 
random forests, uses the average of the outputs of the 
regression trees for reasoning during the testing phase.  
As the third contribution, this paper employs an 
ensemble of 𝑑 -structures for reasoning, instead of 
selecting just one tree structure. The ensemble of (𝐷 + 1) 
𝑑 -structures is used for decision here. During the testing 
phase, each query instance 𝑞 is fed to all (𝐷 + 1) 𝑑 -
structures in the ensemble, and 𝐷 + 1 different predictions 
are computed for 𝑞 , as explained earlier. Then, the average 
of these predictions is returned as the final prediction for 
𝑞 . Also, a weighted approach can also be used as shown 
in (1). 
𝑦 (𝑞 ) =
∑ (𝑤 𝑑 ∗ 𝑦 𝑑 (𝑞 ))
𝑑 ∑ 𝑤 𝑑 𝑑    (1) 
 
where 𝑦 𝑑 (𝑞 ) is the predicted value for query instance 
𝑞 by the 𝑑 𝑡 ℎ
 𝑑 -structure, 𝑤 𝑑 is the weight assigned to the 
𝑑 𝑡 ℎ
 𝑑 -structure, and ∑ (𝑤 𝑑 )
𝑑 is the sum of weights over 
all 𝑑 -structures to normalize the average. Each node 𝑛 𝑑 
contributes to the total reasoning as much as its positive 
weight 𝑤 𝑑 . If all 𝑑 -structures have an equal effect on the 
total prediction, 𝑤 𝑑 is set to one.  
3.4 Weighting the nodes 
In this paper, some proper greedy weighting methods 
are sought for the nodes of a path to determine the final 
decision. As a classic approach, a statistical representative 
of the outputs of the path can be considered as the final 
approach as follows: 
1. If it is desired to minimized the maximum 
difference between each output and the 
representative, the Middle point as the mean 
of minimum and maximum values is 
returned as the final decision. This output is 
robust against small changes in all outputs 
except of the minimum and maximum. 
However, it is highly sensitive to outlier 
outputs. 
2. If it is desired to minimize sum of absolute 
differences between the outputs and the 
representative, final result is the Median of 
the outputs in the path. In contrary with 
Middle, Median is more robust against 
outliers and increasing maximum or 
decreasing minimum does not change its 
value. 
3. Minimizing sum square of differences leads 
to return the Mean of outputs. It is sensitive 
to all values but as much as one vote between 
all other votes.  
All of above approaches can be considered as a version 
of weighted average as shown in equations (2)-(4). 
 
𝑀𝑖𝑑𝑑𝑙𝑒 =
1
2
𝑣 0
+
1
2
𝑣 𝐷    (2) 
 
𝑀𝑒𝑑 𝑖𝑎𝑛 =
1
2
𝑣 ⌊
𝐷 2
⌋
+
1
2
𝑣 ⌈
𝐷 2
⌉
   (3) 
 
𝑀𝑒𝑎𝑛 =
1
𝐷 +1
∑ 𝑣 𝒹 𝒟 𝒹 =0
     (4) 
 
where values 𝑣 𝒹 (𝒹 = 0,1, … , 𝒟 ) are sorted outputs of 
the path 𝑦 𝒹 (𝑞 ). However, the weights can be assigned 
A Consolidated Tree Structure Combining Multiple Regression…                                      Informatica 47 (2023) 17–34   23 
with other strategies respect to different goals. Two 
simple opposite strategies are given here. 
Increasing weighting: In this strategy, it is assumed 
that deeper nodes (more specific rules) are more trustable 
due to their high resolution of looking training data. In this 
strategy, deeper leaves have greater weights. With a 
constant growth rate 𝛼 , the weights 𝛼 , 2𝛼 , 3𝛼 , … , (𝒟 +
1)𝛼 are assigned to 𝑤 0
, 𝑤 1
, 𝑤 2
, … , 𝑤 𝒟 , respectively. 
Decreasing weighting: From this point of view, the 
deepest node (the leaf) is not necessarily the best for 
reasoning, especially if the Mother Tree is permitted to be 
expanded as much as possible (which may result in 
overfitting to the training data). In addition, the training 
instances that belong to a leaf not only affect the output of 
the leaf but also play a role in determining the output of 
all internal nodes in the path. Therefore, the accumulated 
effect of training instances that belong to the leaf is more 
significant than that of others in determining the final 
result. This is why; to balance the effects, the weight of 
deeper nodes should be smaller. In this paper, reverse of 
previous approach is used and 𝛼 , 2𝛼 , … , (𝒟 + 1)𝛼 are 
assigned to 𝑤 𝒟 , 𝑤 𝒟 −1
, … , 𝑤 0
, respectively. 
In increasing weighting, each weight 𝑤 𝒹 should be 
normalized by equation (5). 
 
𝑤 𝒹 𝑁 =
2𝑤 𝒹 𝛼 (𝒟 +1)(𝒟 +2)
=
2(𝒹 +1)𝛼 𝛼 (𝒟 +1)(𝒟 +2)
=
2(𝒹 +1)
(𝒟 +1)(𝒟 +2)
 (5)  
 
where 𝑤 𝒹 𝑁 is normalized value of 𝑤 𝒹 . Hence, 𝛼 has no 
effect on the final result and the weights can similarly 
considered 1, 2, …,𝒟 + 1. 
3.5 Discussions 
In this subsection, some properties of the proposed model 
are discussed from different viewpoints. The proposed 
ensemble approach has several desirable properties that 
make it a promising method for solving classification and 
regression tasks.  
The proposed ensemble approach can be considered as an 
ensemble model of (𝐷 + 1) 𝑑 -structures with weights 
𝑤 0
, 𝑤 1
, … , 𝑤 𝐷 , respectively. Each 𝑑 -structure corresponds 
to a tree with a constant depth 𝑑 in all branches.  
In addition, the proposed model is flexible and can be 
adapted to different types of datasets and tasks. There is 
no parameter introduced by this paper for implementing 
the proposed ensemble model. The only parameters are 
related to the construction of the MT, such as the number 
of branches of an internal node for continuous pivot 
features or the maximum depth the MT can be expanded 
to. These parameters can be adjusted based on the specific 
characteristics of the dataset and the computational 
resources available. 
3.5.1 Extra expansion 
One of the challenges in using the proposed ensemble 
approach is determining the value of the maximum depth 
𝐷 of the MT. The value of 𝐷 can be chosen based on the 
characteristics of the dataset and the computational 
resources available.  
It is worth noting that the maximum depth 𝐷 is equivalent 
to the Maximum Depth of the tree (MxD) used in a 
regression tree with a single stopping condition. In both 
cases, the tree is expanded up to a certain depth. However, 
the main difference between a regression tree with a single 
stopping condition based on MxD and the proposed 
method is in reasoning. In the former, the target value is 
predicted by just using the value or model assigned to the 
leaf node corresponding to the query instance. However, 
in the proposed method, a weighted average of predictions 
in the leaf node and its ancestors is used as the final result. 
This weighted average takes into account the contribution 
of each node to the final prediction. In increasing and 
decreasing weighting, general and specific rules, 
respectively, have more effects on the output to achieve 
more generalization on test data or precision on learning 
training data 
Expanding the nodes more than necessary can cause 
overfitting on the training set. This weighting method 
aggregates general (short) and specific (long) rules to 
make a good decision. Specifically, if the MT is overfitted 
on the training data by deep leaves, it can be compensated 
to some extent by the shallower nodes in the test phase. 
The shallow nodes capture the general patterns and rules 
in the data, while the deeper nodes capture more specific 
and detailed patterns. The proposed ensemble approach 
aggregates the predictions of all nodes, taking into account 
their weights, to make a final prediction for a query 
instance. This aggregation ensures that the contribution of 
each node to the final prediction is proportional to its 
importance. In decreasing weighting, overfitting is 
addressed by ensuring that the weights of deeper nodes are 
smaller than the weights of shallower nodes.  
3.5.2 Single representative tree 
One concern with the proposed ensemble approach is the 
time-complexity of tree construction and reasoning during 
the training and test phases, respectively. However, the 
tree construction process in this approach is similar to a 
simple regression tree, including feature selection, finding 
splitting points, node expansion, and value or model 
assignment to the leaves. In contrast to using a regression 
model in just the leaves (e.g., in M5), the proposed method 
requires learning a model for all nodes, including the 
leaves and internal nodes. Each internal node at depth 𝑑 is 
treated as a leaf node in a 𝑑 -structure, and a model is 
learned on the instances that reach this node. However, all 
these models are tuned in the offline training phase, rather 
than during the online testing phase. 
One of the main concerns with the proposed ensemble 
approach is the time-complexity of the reasoning process 
during the online testing phase. Each query instance must 
be fed to all the models associated with the met nodes in 
the path to compute their outputs, which can be time-
consuming. To address this issue, the computation can be 
transferred to the training phase. Specifically, let Q be the 
set of instances (not limited to the training set) that reach 
the leaf node 𝑛 𝐷 (𝑄 ). Each instance in Q deterministically 
meets all and only the ancestors of 𝑛 𝐷 (𝑄 ) in the path from 
24 Informatica 47 (2023) 17–34 E.A. Mahani et al.  
the root to the leaf node (𝑛 0
(𝑄 ), . . . , 𝑛 𝐷 (𝑄 ), where 𝑛 𝑑 (𝑄 ) 
is the node in depth 𝑑 that is met by Q). 
For all instances in 𝑄 , the same weights 𝑤 0
, . . . , 𝑤 𝐷 and 
models 𝑦 0
(𝑞 ), . . . , 𝑦 𝐷 (𝑞 ) are used for output generation. 
However, 𝑦 𝑑 (𝑞 1
) may not be equal to 𝑦 𝑑 (𝑞 2
) for two 
different vectors 𝑞 1
 and 𝑞 2
 in 𝑄 , unless a constant value is 
assigned to each node. Finally, instead of having 𝐷 + 1 
different models in the path, it is possible to generate just 
one model in each leaf by integrating the models in the 
path using equation (1). It should be noted that even the 
weights can be adaptively determined. 
By using the same weights and models for all instances in 
Q, the time-complexity of the reasoning process during the 
online testing phase may be reduced significantly. In some 
cases, such as in AID and M5 models, the outputs of all 
nodes are constant or can be represented by a polynomial 
regression model of degree M. In these cases, the final 
output can also be constant or a polynomial of the same 
degree. Therefore, instead of averaging the nodes' output 
in the path during reasoning, a constant output or 
polynomial regression model can be set to just the leaves, 
making the time-complexity of reasoning equal to that of 
a single regression tree. However, such value or model is 
not computed using only the target values of instances 
belonging to the that leaf. Other instances also contribute 
in this value respect to their roles to determine the values 
of the nodes in the path and associated weights. 
Hence, the problem of stopping condition adjustment 
changes to value/model assignment to the leaves. In other 
words, the main difference between the proposed method 
and the base regression tree is in the value/model used in 
the leaves for reasoning, with similar time-complexity. 
This computation can also be used for other regression 
models, but the weighted average of base models is not 
necessarily a model of the same type. In this case, the 
time-complexity may not be reduced. 
As the final quality evaluation, the proposed method is 
compared with general strategies for hyperparameter 
setting in Table 3. As mentioned, Evolutionary strategies 
can also be used for tree construction including 
hyperparameters [32]. But evolutionary models are highly 
time-consuming such that evaluating each solution needs 
a tree construction and evaluation. The most usual strategy 
is grid search where a limited set of values for a 
hyperparameter (often single parameter) is considered and 
the best one based on cross-validation is selected. In order 
to be sure, these values address different and all possible 
trees, entire path of the parameter may be used in the 
literature [33]. The entire path of stopping condition 
parameters can easily be extracted by traversing the 
Mother Tree; although, based on our knowledge, no 
research yet has proposed it. Ensemble of models with 
different parameters is also a good approach which is not 
indeed a search strategy [34]. However, it uses various 
models with different parameters (instead of selecting just 
one value) and aggregates their outputs. One of the 
contributions of this paper is proposing an ensemble 
model for defining the stopping condition. The main merit 
of ensemble models is their capacity to generate outputs 
by aggregation which are not even seen in each base 
model. The main drawback of ensemble models is their 
storage need and inefficiency during reasoning because 
the query instance should be fed to all base models stored 
in the memory. But the proposed method can integrate all 
the models in one structure to leverage merits of ensemble 
models along with low complexity in time and memory. 
One of the most important merits of the proposed model 
is that, for a query instance can compute the final output 
with any stopping condition parameter without need to 
reconstruct the tree using the training set. 
4 Experimental results 
In this section, the proposed ensemble approach is 
compared with single stopping conditions in decision 
trees. Other regression methods are not considered 
because, it is assumed that decision tree in a specific 
application is preferred by an expert as a regression model 
due to their accuracy, speed in reasoning, or 
interpretability. Random forests may outperform single 
trees by sacrificing speed of reasoning and interpretability. 
However, the proposed method, despite being based on an 
ensemble of trees, can be merged and described by a single 
tree. As mentioned before, even the resulted tree can be 
converted to a single tree with just labels or models in the 
leaves. 
In the conducted experiments and comparisons, six 
datasets presented in Table 4, were used. Two of these 
datasets, "Stock portfolio performance" and "Concrete 
Compressive Strength," have two target values for 
prediction. As a result, eight datasets in the experimental 
results were used. The proposed ensemble approach was 
compared with single decision trees using different 
stopping conditions, such as Maximum Depth of the tree 
(MxD), Maximum Error per leaf (MxE), and Minimum 
Relative Promotion in error (MnRP). The experimental 
results show that the proposed ensemble approach 
outperforms single stopping conditions.  
In this paper, two types of regression trees were 
implemented based on AID and M5. Each tree 
construction has two phases: in the first phase, the tree is 
constructed by selecting the best features and splitting 
points, while in the second phase, a model is assigned to 
the leaves. Based on AID, a constant value (mean of the 
target value of belonging instances) is assigned to each 
leaf. In M5, although the tree is constructed similarly, a 
linear regressor is finally assigned to each leaf. In other 
words, based on constant and fast regressors, the tree is 
constructed, and finally, a linear regressor is modeled on 
the instances of each leaf.  The use of constant regressors 
in AID-based trees and linear regressors in M5-based trees 
allows for flexible modeling of the target variable, 
depending on the nature of the dataset. In this paper, each 
feature is partitioned into two branches in an internal node. 
However, it is permitted to use it again as the pivot feature 
in descendants, indirectly supporting partitioning to more 
than two branches. 
The proposed trees were tested on all datasets 
addressed in Table 4, with the only stopping condition 
being Maximum Depth of tree (MxD). In the first 
experiment, Mean Square Errors (MSE) of prediction by 
A Consolidated Tree Structure Combining Multiple Regression…                                      Informatica 47 (2023) 17–34   25 
both AID and M5 were computed vs. MxD, using five-
fold five-times cross-validation. In Fig. 4, for each MxD, 
the number of datasets for which AID or M5 had better 
functionality than the other one was reported. For 
example, for MxD of at most two, M5 defeated AID on all 
datasets. However, as depth increased up to 20, the 
number of datasets on which AID achieved better 
prediction also increased. The main reason for this is that, 
by increasing the depth of the tree, the complexity of the 
tree also increases. In such situations, simpler models in 
the leaves have better generalization on unseen data. The 
best model for each dataset can often be easily detected by 
cross-validation. Therefore, to be fair, the best model was 
used for each dataset (i.e., M5 was used on "Housing" and 
"Concrete-Slump", while AID was applied on the others). 
 
 
Figure 4: The number of wins of AID (Morgan, 
1963) and M5 (Witten, 2005) in competition on the 
datasets addressed in Table 4 vs. MxD. 
 
Table 5 reports the normalized MSE of predictions on the 
datasets with different values of MxD. The MSE has been 
normalized by Mean Square of the target value under 
prediction, to diminish the scaling effect of target values 
on the error. The last two rows of the table show the 
minimum normalized MSE and associated depth for each 
dataset.  
While using maximum depth as a stopping condition has 
some merits, it also has some drawbacks. On the one hand, 
this parameter is simple and has a practically finite set of 
values (positive integers less than an acceptable bound), 
which makes it easy to implement and tune. On the other 
hand, this parameter acts similarly in all branches, 
whereas two nodes at similar depths may have completely 
different situations with respect to the number of 
associated instances, variance of target values, and 
separability of instances in the next expansion. As a 
consequence, finding the proper value of MxD for a 
dataset is a challenge, and it may not work properly in all 
parts of the tree.  
In the next experiment, the best regression tree for each 
dataset (AID/M5) was constructed with another stopping 
condition, such that one node would not be expanded 
unless the Sum Square Error (SSE) of associated instances 
was greater than a predefined threshold. This threshold is 
referred to as Maximum Error (MxE) in this paper. The 
results of this experiment are reported in Table 6, with 
respect to normalized MSE vs. 10 different values of MxE 
with various granularities. The two last rows of the table 
present the minimum error and corresponding value of 
MxE for each dataset.  
However, even for this stopping condition, as shown in 
Table 6, different thresholds can lead to the best 
performance. For example, the labels of "Stock Portfolio" 
(Stock Annual & Excess) achieve the best prediction with 
MxE=0.01, while for "Slump-Slump," the best 
performance is associated with MxE=100. This stopping 
condition also suffers from some drawbacks, including its 
scale sensitivity. If the training set is duplicated, it is not 
desired to have any modification in the decision tree. 
However, the number of instances in each node is doubled, 
and therefore, stopping conditions based on MxE or 
Minimum Instance per leaf (MnI) may be held in deeper 
nodes. To address this issue, Mean Squared Error (MSE) 
of each node may be checked as the stopping criterion 
instead of SSE. Another scaling problem is related to the 
scale of the target value. If all target values are scaled by 
α, it is desired to have a similar tree for reasoning but with 
scaled values at the leaves. However, Sum Squared Error 
and MSE of each node are scaled by 𝛼 2
. Hence defining a 
threshold is dependent to the scale of the target values.  
The third stopping condition investigated in this paper is 
based on Relative Promotion (RP), which is presented in 
equation (6).  
𝑅𝑃 (𝑛 ) = 1 −
∑ 𝑆𝑆𝐸 (𝑛 𝑏 )
𝐵 𝑏 =1
𝑆𝑆𝐸 (𝑛 )
                                    (6) 
Equation (6) defines RP as the difference between the 
sum of squared errors (SSE) of the parent node and the 
sum of SSEs of its children, divided by the SSE of the 
parent node. Here, 𝑆𝑆𝐸 (𝑛 ) is the Sum Square Error of 
associated instances to the node 𝑛 . The node 𝑛 is 
considered to be expanded to 𝐵 children 𝑛 1
, . . . , 𝑛 𝐵 . In 
other words, this metric measures how the expansion 
decreases the sum square error of prediction on associated 
instances, relative to the initial SSE of the parent node. 
The proposed approach uses RP as a stopping condition, 
such that a node will not be expanded if its RP is less than 
a predefined threshold. This threshold is referred to as 
Minimum Relative Promotion (MnRP) in this paper. 
Unlike other stopping conditions, such as MxD and MxE, 
RP is stable with respect to both the training set size and 
target value scaling. Table 7 shows the normalized MSE 
of experiments using the proposed stopping condition with 
different values of MnRP for tree construction on the 
datasets. 
Based on the results, although less oscillation of 
normalized errors can be seen in comparison with 
previous metrics, and there is a proper value of MnRP (i.e., 
0.2) for most of the datasets, the final results are not 
sufficiently good. Table 8 compares the best results 
(minimum normalized MSE) of the stopping conditions in 
Tables 5-7, which can provide insights into the relative 
performance of each stopping condition for different 
datasets.  
Table 8 ranks the stopping conditions based on their 
achieved results for each dataset. Each stopping condition 
0
2
4
6
8
10
1 3 5 7 9 11 13 15 17 19 21
# of Wins
Maximum Depth of the tree (MxD)
AID M5
26 Informatica 47 (2023) 17–34 E.A. Mahani et al.  
is assigned a rank equal to the number of other stopping 
conditions with less error plus one. For example, on 
"Yacht Dynamics," both MxD and MxE achieved the best 
rank, and MnRP gets the third rank. The last column of the 
table shows the average rank of each stopping condition 
across all datasets. Based on the results, MxD could be the 
best stopping condition on all datasets except for "Stock 
Excess," where it achieved the second rank. MxE could 
achieve the best error on four datasets, whereas it could be 
better than MxD on just "Stock Excess." However, MxE 
was the worst stopping condition on other datasets. MnRP 
could never be the best stopping condition on any dataset, 
but on average, it achieved a better rank than MxE on these 
datasets. Due to the stability of results based on MnRP 
(presented in Table 7), it can be an acceptable stopping 
condition when the user does not know the best value of 
the parameter. In the case of having a chance to find out 
the best parameter, MnRP is better than MxE on average, 
but cannot compete with MxD.  
Based on these experiments, it seems that MxD can be a 
good stopping metric if there is enough knowledge to find 
the best value of MxD. However, the proposed model 
claims that an ensemble of trees with different depths may 
achieve better results on average without any knowledge 
requirement. To evaluate this claim, the proposed model 
has been compared with a set of trees with depths ranging 
from 0 to 7 in Table 9. The proposed model combined all 
of these eight trees in a single tree with MxD=7 and five 
aggregation strategies: Middle, Median, Mean, Incresing 
weighting, and Decreasing weighting. The methods have 
been compared based on the resulting normalized MSE.  
In addition to normalized MSE, the rank of each method 
for each dataset has been computed and shown in 
parentheses. The last columns of the table show the 
average rank and error of each method across all datasets.  
The results indicate that ensemble methods significantly 
outperform fixed depth trees. The worst average error of 
proposed models (0.1349) is achieved for Middle 
representative which is better than the best one for single 
depth models (0.1352 for depth 2). The best average error 
is 0.1049 for Decreasing Weighting. 
Considering the ranks, just on "Housing" and "Yacht" the 
best method is a single depth model with depths five and 
six, respectively. On other datasets, the best rank of a 
single depth model is 4
th
. Four depths 0-4, no model can 
achieve a rank better than 5
th
 on any dataset. With any 
fixed depth, there is at least one dataset for which, the 
model is ranked 10
th
 or more. Whereas Decreasing 
Weighting model is the 1
th
 model on four datasets and at 
most 8
th
 model on others.  The Mean model is 2
nd
 or 3
rd
 
rank on all datasets except of "Yacht". This is why; the 
best average rank is achieved for Mean and the second one 
is Decreasing Weighting. In Fig. 5, average of ranks for 
fixed depth models on each dataset is compared with the 
proposed ensemble models. As depicted, ensemble 
models significantly outperform the fixed depth ones on 
all datasets except of "Yacht". 
 
Figure 5: Comparing average of ranks of fixed-depth 
models with ensemble ones on datasets 
 
 
Figure 6: Differences between the normalize MSE of 
the 1st rank method and each one of "Mean" & 
"Decreasing Weighting" methods on datasets 
  
In order to have a better comparison between rival 
proposed methods, "Mean" and "Decreasing Weighting", 
Fig. 6 is designed to present the difference between the 
normalized MSE of each method respect to the best one 
on each dataset. Because the "Mean" method is not the 
best on any dataset, this difference is always non-zero for 
"Mean". Although, "Mean" is the 2
nd
 rank on 
"Challenger", but this difference is significant about 0.08. 
However, "Decreasing Weighting" method, not only is the 
best on four datasets, but also has not difference with the 
best method greater than 0.02.  
 
5 Conclusion 
In this paper, a new approach is proposed to ensemble 
regression trees that are different in maximum depth. The 
trees are integrated into a single tree using some weighting 
methods, and all the outputs are merged into a single crisp 
tree. Based on the experimental results, the proposed 
methods can achieve better results than even the tree with 
the best maximum depth on most of datasets, indicating 
the importance of considering an ensemble of trees with 
different depths rather than using a fixed stopping 
condition for all datasets. Moreover, the proposed method 
achieves significantly better average rank compared to 
trees constructed up to a fixed maximum depth, with the 
time complexity of a single crisp tree. Future research 
0
2
4
6
8
10
Average Rank
Fixed depth Ensemble
0
0,02
0,04
0,06
0,08
0,1
MSE greater than minimum
Mean Decreasing Weighting
A Consolidated Tree Structure Combining Multiple Regression…                                      Informatica 47 (2023) 17–34   27 
directions may include investigating other reasoning 
methods, weighting methods, stopping conditions, and 
making the final function continuous. These extensions 
can further improve the performance and interpretability 
of regression trees in various applications. 
Acknowledgment 
The authors would like to thank Dr. M. Taheri for his help 
in revising this paper. 
References 
[1] Charbuty, B., & Abdulazeez, A. (2021). Classification 
based on decision tree algorithm for machine learning. 
Journal of Applied Science and Technology Trends, 
2(01), 20-28. https://doi.org/10.38094/jastt20165 
[2] Nancy, P., Muthurajkumar, S., Ganapathy, S., Kumar, 
S. S., Selvi, M., & Arputharaj, K. (2020). Intrusion 
detection using dynamic feature selection and fuzzy 
temporal decision tree classification for wireless 
sensor networks. IET Communications, 14(5), 888-
895. https://doi.org/10.1049/iet-com.2019.0172 
 [3] Wang, C., Wang, A., Xu, J., Wang, Q., & Zhou, F. 
(2020). Outsourced privacy-preserving decision tree 
classification service over encrypted data. Journal of 
Information Security and Applications, 53, 102517. 
https://doi.org/10.1016/j.jisa.2020.102517 
[4] Vanfretti, L., & Arava, V. N. (2020). Decision tree-
based classification of multiple operating conditions 
for power system voltage stability assessment. 
International Journal of Electrical Power & Energy 
Systems, 123, 106251. 
https://doi.org/10.1016/j.ijepes.2020.106251 
[5] Yang, S. B., & Chen, T. L. (2020). Uncertain decision 
tree for bank marketing classification. Journal of 
Computational and Applied Mathematics, 371, 
112710. https://doi.org/10.1016/j.cam.2020.112710 
[6] Sahoo, S., Subudhi, A., Dash, M., & Sabut, S. (2020). 
Automatic classification of cardiac arrhythmias based 
on hybrid features and decision tree algorithm. 
International Journal of Automation and Computing, 
17(4), 551-561. https://doi.org/10.1007/s11633-019-
1219-2 
[7] Rajesh, B., Vardhan, M. V. S., & Sujihelen, L. (2020, 
June). Leaf Disease Detection and Classification by 
Decision Tree. In 2020 4th International Conference 
on Trends in Electronics and Informatics 
(ICOEI)(48184) (pp. 705-708). IEEE. 
https://doi.org/10.1109/icoei48184.2020.9142988 
[8] Gomes, C. M. A., Amantes, A., & Jelihovschi, E. G. 
(2020). Applying the regression tree method to predict 
students’ science achievement. Trends in Psychology, 
28(1), 99-117. https://doi.org/10.9788/s43076-019-
00002-5 
[9] Hu, Y., Dai, Z., & Guldmann, J. M. (2020). Modeling 
the impact of 2D/3D urban indicators on the urban heat 
island over different seasons: A boosted regression 
tree approach. Journal of environmental management, 
266, 110424. 
https://doi.org/10.1016/j.jenvman.2020.110424 
[10] Yang, Q., Williamson, A. M., Hasted, A., & Hort, J. 
(2020). Exploring the relationships between taste 
phenotypes, genotypes, ethnicity, gender and taste 
perception using Chi-square and regression tree 
analysis. Food Quality and Preference, 83, 103928. 
https://doi.org/10.1016/j.foodqual.2020.103928 
[11] Alamgir, M. S. M., Sultana, M. N., & Chang, K. 
(2020). Link adaptation on an underwater 
communications network using machine learning 
algorithms: Boosted regression tree approach. IEEE 
access, 8, 73957-73971. 
https://doi.org/10.1109/access.2020.2981973 
[12] Shabani, S., Pourghasemi, H. R., & Blaschke, T. 
(2020). Forest stand susceptibility mapping during 
harvesting using logistic regression and boosted 
regression tree machine learning models. Global 
Ecology and Conservation, 22, e00974. 
https://doi.org/10.1016/j.gecco.2020.e00974 
[13] Avellaneda, F. (2020, April). Efficient inference of 
optimal decision trees. In Proceedings of the AAAI 
Conference on Artificial Intelligence (Vol. 34, No. 
04, pp. 3195-3202). 
https://doi.org/10.1609/aaai.v34i04.5717 
[14] Salman Saeed, M., Mustafa, M. W., Sheikh, U. U., 
Jumani, T. A., Khan, I., Atawneh, S., & Hamadneh, 
N. N. (2020). An efficient boosted C5. 0 Decision-
Tree-Based classification approach for detecting 
non-technical losses in power utilities. Energies, 
13(12), 3242. https://doi.org/10.3390/en13123242 
[15] Lotfi, S., Ghasemzadeh, M., Mohsenzadeh, M., & 
Mirzarezaee, M. (2021). The Construction of 
Scalable Decision Tree based on Fast Splitting and 
J-Max Pre-Pruning on Large Datasets. International 
Journal of Engineering, 34(8). 
https://doi.org/10.5829/ije.2021.34.08b.01 
[16] Panhalkar, A. R., & Doye, D. D. (2021). A novel 
approach to build accurate and diverse decision tree 
forest. Evolutionary intelligence, 1-15. 
https://doi.org/10.1007/s12065-020-00519-0 
[17] Muharam, F. M., Nurulhuda, K., Zulkafli, Z., 
Tarmizi, M. A., Abdullah, A. N. H., Che Hashim, M. 
F., ... & Ismail, M. R. (2021). UAV-and Random-
Forest-AdaBoost (RFA)-Based Estimation of Rice 
Plant Traits. Agronomy, 11(5), 915. 
https://doi.org/10.3390/agronomy11050915 
[18] Jadhav, D. A. (2021). An enhanced and secured 
predictive model of Ada-Boost and Random-Forest 
techniques in HCV detections. Materials Today: 
Proceedings. 
https://doi.org/10.1016/j.matpr.2021.05.071 
[19] Wang, Q., Zhou, Y., Ding, W., Zhang, Z., 
Muhammad, K., & Cao, Z. (2020). Random forest 
with self-paced bootstrap learning in lung cancer 
prognosis. ACM Transactions on Multimedia 
Computing, Communications, and Applications 
(TOMM), 16(1s), 1-12. 
https://doi.org/10.1145/3345314 
[20] Biau, G., & Scornet, E. (2016). A random forest 
guided tour. Test, 25(2), 197-227. 
https://doi.org/10.1007/s11749-016-0481-7 
28 Informatica 47 (2023) 17–34 E.A. Mahani et al.  
[21] Abellan, J., Mantas, C. J., Castellano, J. G., & Moral-
Garcia, S. (2018). Increasing diversity in random 
forest learning algorithm via imprecise probabilities. 
Expert Systems with Applications, 97, 228-243. 
https://doi.org/10.1016/j.eswa.2017.12.029 
[22] Hornung, R. (2020). Diversity forests: Using split 
sampling to allow for complex split procedures in 
random forest. https://doi.org/10.1007/s42979-021-
00920-1 
[23] Morgan, J. N., & Sonquist, J. A. (1963). Problems in 
the analysis of survey data, and a proposal. Journal 
of the American statistical association, 58(302), 415-
434. 
https://doi.org/10.1080/01621459.1963.10500855 
[24] Witten, I. H., Frank, E., Hall, M. A., Pal, C. J., & 
DATA, M. (2005). Practical machine learning tools 
and techniques. In DATA MINING (Vol. 2, p. 4). 
https://doi.org/10.1016/b978-0-12-374856-0.00007-
9 
[25] Fidalgo-Merino, R., & Nunez, M. (2011). Self-
adaptive induction of regression trees. IEEE 
transactions on pattern analysis and machine 
intelligence, 33(8), 1659-1672. 
https://doi.org/10.1109/tpami.2011.19 
[26] Breiman, Leo. (2017). Classification and regression 
trees. Routledge. DUA, DHEERU, & GRAFF, 
CASEY. 2017. UCI Machine Learning Repository. 
GRISONI, FRANCESCA, CONSONNI, VIVIANA, 
VIGHI, MARCO, VILLA, SARA, & 
TODESCHINI, ROBERTO. 2016. Investigating the 
mechanisms of bioconcentration through QSAR 
classification trees. Environment international, 88, 
198–205. 
https://doi.org/10.1016/j.envint.2015.12.024 
[27] Kordos, M., Piotrowski, J., Bialka, S., Blachnik, M., 
Golak, S., & Wieczorek, T. (2012, March). 
Evolutionary optimized forest of regression trees: 
application in metallurgy. In International 
Conference on Hybrid Artificial Intelligence 
Systems (pp. 409-420). Springer, Berlin, Heidelberg. 
https://doi.org/10.1007/978-3-642-28942-2_37 
[28] Pham, B. T., Prakash, I., & Bui, D. T. (2018). Spatial 
prediction of landslides using a hybrid machine 
learning approach based on random subspace and 
classification and regression trees. Geomorphology, 
303, 256-270. 
https://doi.org/10.1016/j.geomorph.2017.12.008 
[29] Choubin, B., Moradi, E., Golshan, M., Adamowski, 
J., Sajedi-Hosseini, F., & Mosavi, A. (2019). An 
ensemble prediction of flood susceptibility using 
multivariate discriminant analysis, classification and 
regression trees, and support vector machines. 
Science of the Total Environment, 651, 2087-2096. 
https://doi.org/10.1016/j.scitotenv.2018.10.064 
[30] Ahmad, M. W., Reynolds, J., & Rezgui, Y. (2018). 
Predictive modelling for solar thermal energy 
systems: A comparison of support vector regression, 
random forest, extra trees and regression trees. 
Journal of cleaner production, 203, 810-821. 
https://doi.org/10.1016/j.jclepro.2018.08.207 
[31] Zhang, B., Wei, Z., Ren, J., Cheng, Y., & Zheng, Z. 
(2018). An empirical study on predicting blood 
pressure using classification and regression trees. 
IEEE access, 6, 21758-21768. 
https://doi.org/10.1109/access.2017.2787980 
[32] Abdelkader, E. M., Al-Sakkaf, A., Alfalah, G., & 
Elshaboury, N. (2022). Hybrid Differential 
Evolution-Based Regression Tree Model for 
Predicting Downstream Dam Hazard Potential. 
Sustainability, 14(5), 3013. 
https://doi.org/10.3390/su14053013 
[33] Bak, K. Y. (2023). The regularization paths of total 
variation-penalized regression splines. 
Communications in Statistics-Simulation and 
Computation, 1-12. 
https://doi.org/10.1080/03610918.2023.2170410 
[34] Zhang, Y. L., Shi, Q., Li, M., Yang, X., Li, L., & 
Zhou, J. (2022). A Classification Based Ensemble 
Pruning Framework with Multi-metric 
Consideration. In Intelligent Systems and 
Applications: Proceedings of the 2021 Intelligent 
Systems Conference (IntelliSys) Volume 1 (pp. 650-
667). Springer International Publishing. 
https://doi.org/10.1007/978-3-030-82193-7_44 
A Consolidated Tree Structure Combining Multiple Regression…                                      Informatica 47 (2023) 17–34   29 
Table 1: Comparing categories of the regression methods 
Regression 
Methods 
Interpretability Storage 
Training 
time 
Reasoning 
time 
Performance Description 
Deep 
Low  Complex High Good High 
Need large training data 
Neural 
Network 
Low  Good Good Low  High 
Local Optima, high capacity 
of learning 
Model Based 
Low  Good Good Low  Good 
Global Information 
Prototype 
Based 
Good Complex Low  High High 
Local information, storing 
all prototypes 
Rule Based 
High Low Good Good Good 
Need to evaluate by all 
rules 
Random 
Forest 
Good Good Good Good High 
Need to evaluate by all 
trees 
Decision Tree 
High Low  Low  Low  Good 
Interpretable, Compact, 
Simple, Fast, Accurate 
 
 
Table 2: Stopping conditions' hyperparameters of regression trees 
Stopping 
Conditions 
Merits Drawbacks 
Max. Depth Simple, Integer, Limited values 
Global effect on all nodes, Independent of 
error 
Max. Nodes 
Simple, Integer, Limited values, Using storage 
completely 
Needs priority metric, Complex training, 
Independent of error 
Min. Instances Simple, Integer, Limited values 
Sensitive to the size of dataset, Independent 
of error 
Max. Sum 
Square Error 
Considers local errors 
Floating point, Unlimited range, sensitive to 
the size of dataset and the scale of target 
values 
Max. Mean 
Square Error 
Considers local errors, insensitive to the size of 
dataset 
Floating point, Unlimited range, sensitive to 
the scale of target values, Ignores 
expandability 
Min. Error 
Promotion 
Considers local errors, Considers expandability 
Floating point, Unlimited range, sensitive to 
the scale of target values 
Min. Relative 
error Promotion 
Considers local errors, Considers expandability, 
insensitive to the size of dataset and scale of target 
values 
Floating point, Unlimited range 
Information 
theoretical 
Good for classification 
Not proper for regression and continuous 
spaces 
Hybrid 
conditions 
More control on tree construction Complex, multi parameters, Static 
 
  
30 Informatica 47 (2023) 17–34 E.A. Mahani et al.  
Table 3: Search strategies to find hyperparameters of a model 
Search strategies Merits Drawbacks 
Evolutionary Computation: GA, 
PSO, … 
Capable of search in complex 
spaces including parameters of 
hybrid conditions 
Inefficient: needs to measure the 
fitness of each solution by tree 
construction, Static reasoning 
Grid Search Simple 
Just one parameter, limited 
number of candidates: redundancy 
and missing values, Static 
reasoning 
Entire Path 
All possible unique values of 
parameter 
Just one parameter, Static 
reasoning 
Ensemble of models Generating various outputs 
Each parameter set needs a tree, 
Inefficient reasoning 
Proposed model 
Generating various outputs, Single 
tree, Efficient reasoning, testing 
new conditions without retraining 
 
 
Table 4: Datasets used in experiments from UCI Machine Learning Depository (Asuncion, 2007) in regression 
category. 
Dataset Attributes type # of attribute # of rows labels 
Housing Real 13 506 Housing 
Challenger USA Space Shuttle O-Ring Integer 3 23 Challenger 
Concrete Slump Test Real 10 103 Concrete Slump 
Yacht Hydrodynamics Real 7 308 Yacht Dynamics 
Stock Portfolio Performance Real 12 63 
Stock Annual 
Stock Excess 
Concrete Compressive Strength Real 9 1030 
Slump _Slump 
Slump 
Flow 
 
  
A Consolidated Tree Structure Combining Multiple Regression…                                      Informatica 47 (2023) 17–34   31 
Table 5. Normalized MSE of prediction on datasets vs. MxD 
MxD Housing Challenger 
Concrete 
Slump 
Yacht 
Dynamics 
Stock 
Annual 
Stock 
Excess 
Slump_ 
Slump 
Slump_ 
Flow 
0 
0.046 0.844 0.072 0.682 0.051 0.054 0.192 0.113 
1 
0.064 0.611 0.039 0.152 0.045 0.043 0.156 0.082 
2 
0.059 0.654 0.037 0.035 0.031 0.034 0.152 0.081 
3 
0.044 0.751 0.03 0.009 0.03 0.033 0.189 0.09 
4 
0.037 0.875 0.03 0.006 0.028 0.032 0.204 0.101 
5 
0.032 0.92 0.029 0.006 0.027 0.032 0.215 0.104 
6 
0.038 0.92 0.029 0.005 0.026 0.031 0.22 0.106 
7 
0.034 0.92 0.027 0.006 0.026 0.03 0.22 0.112 
8 
0.033 0.92 0.027 0.006 0.026 0.031 0.222 0.114 
9 
0.034 0.92 0.027 0.006 0.026 0.031 0.225 0.114 
10 
0.034 0.92 0.027 0.006 0.026 0.031 0.225 0.115 
Min 
0.032 0.611 0.027 0.005 0.026 0.03 0.152 0.081 
Best Depth 5 1 >6 6 >5 7 2 2 
Table 6. Normalized MSE vs. 10 values of MxE 
MxE Housing  Challenger 
Concrete 
Slump 
Yacht 
Dynamics  
Stock 
Annual 
Stock 
Excess 
Slump- 
Slump  
Slump- 
Flow 
0 0.034 0.92 0.027 0.006 0.026 0.031 0.225 0.115 
0.01 
0.034 0.92 0.027 0.006 0.026 0.029 0.225 0.115 
0.05 
0.034 0.92 0.027 0.006 0.028 0.033 0.225 0.115 
0.1 
0.034 0.92 0.027 0.006 0.03 0.032 0.225 0.115 
0.5 
0.034 0.92 0.027 0.006 0.045 0.043 0.226 0.115 
1 
0.034 0.775 0.027 0.006 0.051 0.043 0.226 0.114 
5 
0.034 0.834 0.027 0.005 0.051 0.054 0.224 0.114 
10 
0.035 0.844 0.027 0.005 0.051 0.054 0.226 0.114 
50 
0.045 0.844 0.028 0.007 0.051 0.054 0.217 0.117 
100 
0.044 0.844 0.028 0.008 0.051 0.054 0.214 0.115 
Min 0.034 0.775 0.027 0.005 0.026 0.029 0.214 0.114 
Best 
MxE 
[0,5] 1 [0,10] [5,10]  [0,0.01] 0.01 100 [1,10] 
 
  
32 Informatica 47 (2023) 17–34 E.A. Mahani et al.  
Table 7. Normalized MSE vs. values of Minimum Relative Promotion 
MnRP Housing  Challenger 
Concrete 
Slump 
Yacht 
Dynamics  
Stock 
Annual 
Stock 
Excess 
Slump- 
Slump  
Slump- 
Flow 
0 0.034 0.92 0.027 0.006 0.026 0.031 0.225 0.115 
0.05 0.033 0.92 0.027 0.006 0.026 0.031 0.225 0.115 
0.06 0.033 0.918 0.027 0.006 0.026 0.031 0.225 0.115 
0.07 0.033 0.912 0.027 0.006 0.026 0.031 0.226 0.115 
0.08 0.033 0.912 0.027 0.006 0.026 0.031 0.226 0.115 
0.09 0.033 0.911 0.027 0.006 0.026 0.031 0.225 0.115 
0.1 0.033 0.911 0.027 0.006 0.026 0.031 0.217 0.114 
0.15 0.033 0.716 0.027 0.006 0.026 0.03 0.196 0.104 
0.2 0.033 0.617 0.038 0.006 0.026 0.03 0.183 0.096 
0.3 
0.04 0.617 0.072 0.006 0.039 0.048 0.189 0.092 
Min 
0.033 0.617 0.027 0.006 0.026 0.03 0.183 0.092 
Best 
MnRP 
[0,0.2] [0.2,0.3] [0,0.15] [0,0.3] [0,0.2] [0.15,0.2] 0.2 0.3 
 
Table 8. Comparing MxD, MxE and MnRP based on the best Normalized MSE on datasets 
Stop. 
Metric 
Housing  Challenger 
Concrete 
Slump 
Yacht 
Dynamics  
Stock 
Annual 
Stock 
Excess 
Slump- 
Slump  
Slump- 
Flow 
Avg. 
Rank 
MxD 
0.032 
(1) 
0.611 
(1) 
0.027 
(1) 
0.005 
(1) 
0.026 
(1) 
0.03 
(2) 
0.152 
(1) 
0.081 
(1) 
1.125 
MxE 
0.034 
(3) 
0.775 
(3) 
0.027 
(1) 
0.005 
(1) 
0.026 
(1) 
0.029 
(1) 
0.214 
(3) 
0.114 
(3) 
2 
MnRP 
0.033 
(2) 
0.617 
(2) 
0.027 
(1) 
0.006 
(3) 
0.026 
(1) 
0.03 
(2) 
0.183 
(2) 
0.092 
(2) 
1.875 
 
  
A Consolidated Tree Structure Combining Multiple Regression…                                      Informatica 47 (2023) 17–34   33 
Table 9. Comparing normalized MSE of the proposed model and regression trees with fixed MxD 
MxD Housing  Challenger 
Concrete 
Slump 
Yacht 
Dynamics  
Stock 
Annual 
Stock 
Excess 
Slump- 
Slump  
Slump- 
Flow 
Avg. 
Rank 
 
Avg. 
Err. 
0 
0.0462 (11) 0.8443 (9) 
0.0723 
(13) 
0.6824 
(13) 
0.0508 
(13) 
0.054 
(13) 
0.1922 
(9) 
0.1133 
(13) 
12 0.257 
1 
0.0637 (13) 0.6111 (5) 
0.0391 
(12) 
0.152 
(12) 
0.0451 
(12) 
0.0429 
(12) 
0.1558 
(6) 
0.0817 
(7) 
9.9 0.149 
2 
0.0592 (12) 0.6538 (7) 
0.0366 
(11) 
0.0347 
(10) 
0.0305 
(11) 
0.0335 
(11) 
0.1523 
(4) 
0.0811 
(5) 
8.9 0.135 
3 
0.0441 (10) 
0.7508 
 (8) 
0.0298 
(9) 
0.0085 
(7) 
0.0295 
(10) 
0.0335 
(10) 
0.1886 
(8) 
0.0899 
(8) 
8.8 0.147 
4 
0.037 (7) 
0.8753 
(10) 0.03 (10) 
0.0065 
(5) 
0.0281 
(9) 
0.0318 
(8) 
0.2041 
(10) 
0.1006 
(9) 
8.5 0.164 
5 
0.0322 (1) 
0.92 
 (11) 
0.0293 
(7) 
0.0056 
(3) 
0.0271 
(8) 
0.0324 
(9) 
0.2147 
(11) 
0.1036 
(10) 
7.5 0.171 
6 
0.0378 (8) 
0.92  
(12) 
0.0289 
(6) 
0.0054 
(1) 
0.0262 
(7) 
0.0306 
(7) 
0.2197 
(13) 
0.1058 
(11) 
8.1 0.172 
7 
0.0338 (2) 
0.92 
 (13) 
0.0267 
(4) 
0.0055 
(2) 
0.0261 
(5) 
0.0305 
(6) 
0.2196 
(12) 
0.1124 
(12) 
7 0.172 
Middle 
0.0359 (5) 
0.5874 
 (4) 
0.0288 
(5) 
0.1506 
(11) 
0.0254 
(4) 
0.0271 
(4) 
0.1481 
(2) 
0.0762 
(1) 
4.5 0.135 
Median 
0.0378 (9) 
0.5853 
 (3) 
0.0248 
(3) 
0.0061 
(4) 
0.0261 
(6) 
0.0293 
(5) 
0.1538 
(5) 
0.081 
(4) 
4.9 0.118 
Mean 
0.0345 (3) 
0.5723 
 (2) 
0.0245 
(2) 
0.0288 
(9) 
0.0243 
(3) 
0.0259 
(2) 
0.1491 
(3) 
0.0768 
(2) 
3.3 0.117 
Inc. 
Weight 
0.0347 (4) 
0.6164 
 (6) 
0.0224 
(1) 
0.0078 
(6) 
0.0236 
(2) 
0.0266 
(3) 
0.1663 
(7) 
0.0813 
(6) 
4.4 0.122 
Dec. 
Weight 
0.0369 (6) 
0.4908 
 (1) 
0.0298 
(8) 
0.0237 
(8) 
0.017 
(1) 
0.0165 
(1) 
0.1456 
(1) 
0.0786 
(3) 
3.6 0.105 
  
 
 
 
 
 
 
 
 
 
 
 
 
34 Informatica 47 (2023) 17–34 E.A. Mahani et al.