https://doi.org/10.31449/inf.v46i2.3027 Informatica 46 (2022) 243–258 243
Formal Approach to Data Accuracy Evaluation
Belkacem Athamena
1
and Zina Houhamdi
2
E-mail: athamena@gmail.com, belkacem.athamena@aau.ac.ae, z_houhamdi@yahoo.fr, zina.houhamdi@aau.ac.ae
1
Business Administration Department, College of Business, Al Ain University, United Arab Emirates
2
Cybersecurity Department, College of Engineering, Al Ain University, United Arab Emirates
Keywords: data integration systems, data quality, data accuracy
Received: December 20, 2019
Usually, data quality is defined by multiple attributes that allow classifying the output data (such as com-
pleteness, freshness, and accuracy) or the methods exploiting these data (such as dependability, perfor-
mance, and protection). Among the suggested quality attributes, we will discuss one of the principal
categories: data accuracy. Scientific experiments, decision–making, and data retrieval are examples of
situations that require a formal evaluation approach to data accuracy. The evaluation approach should be
adaptable to distinct understandings of data accuracy and distinct end-user expectations. This study inves-
tigates data accuracy and defines dimensions and metrics that affect its evaluation. The investigation of
data accuracy generates problems in the user expectation specification and database quality models. This
work describes our proposed approach for data accuracy evaluation by defining an evaluation algorithm
that considers the distribution of inaccuracies in database relations. The approach decomposes the query
output in accordance with data accuracy, labels every part with its accuracy value, and addresses the pos-
sibility of enforcing data accuracy by using these values. This study mainly contributes by proposing an
explicit evaluation of quality attributes of data accuracy, a formal evaluation approach to data accuracy,
and suggesting some improvement actions to reinforce data accuracy.
Povzetek: Opisana je formalna metoda preverjanja toˇ cnosti podatkov.
1 Introduction
Data quality has increasingly become an essential charac-
teristic required by users particularly for data integration
systems (DISs), which involve combining data residing in
multiple databases and providing the user with a unified
view of these data as answers to their queries [4]. Be-
cause of the growth of retrieved data, users are becoming
increasingly worried about data quality. On the other hand,
the number of quality attributes and their relationships are
huge. Accordingly, data quality evaluation is considered as
a complex problem involving multiple variables. In DIS
context, data quality evaluation is exceptionally compli-
cated because of the combination of data derived from dif-
ferent databases with possibly distinct qualities. Because of
the large number and high heterogeneity of databases that
are independent, it is crucial to precisely determine their
quality and to consider it in the design phase of DISs.
System quality improvement is associated with opti-
mization of a problem with multiple variables, which
may be very complex specifically in an indefinite context
[3, 26]. Consequently, it is arduous to consider all qual-
ity attributes at the same time. To investigate data quality
thoroughly, it is inevitable to investigate each quality at-
tribute independently besides the factors of the context that
affect it. Dependencies will be investigated later. Among
the proposed quality attributes, we opt for the main cat-
egory: data accuracy. Currently, many systems consider
the need to have reliable measurements of data accuracy
as a crucial and decisive requirement. These systems are
numerous and are used in diverse fields, such as customer
relationship management, web-services integration, scien-
tific experiments, decision–making, and data retrieval.
This study discusses data accuracy analysis in DIS and
considers the relational scenario. Explicitly, it treats user
queries that are composed of projections, selections, and
joins (PSJ) operators over a collection of database relations.
It addresses the problem of accuracy evaluation of data de-
livered to the user in response to user queries and decides
if the expectations of the user on the data accuracy can be
achieved or not.
Our approach consists of fragmenting the query output
in sets of tuples (called areas), which have uniform accu-
racy values, and marking each area with its accuracy value.
Consequently, the user can do the following:
– extract only the most precise data by selecting the area
that possesses high accuracy values,
– exclude data that do not satisfy the accuracy threshold
by ignoring areas possessing low accuracy values, or
– classify data by sorting areas based on their accuracy
values. Moreover, if we desire displaying additional
data (because the first accurate area is incomplete), we
can show the next area and so on. Thus, this will rep-
resent an added value to the delivered data.
244 Informatica 46 (2022) 243–258 B. Athamena et al.
This paper illustrates our proposed accuracy evaluation
algorithm. We present the values of semantic accuracy us-
ing the Boolean metric where each cell of every area of any
relation is assigned a value accurate or inaccurate. How-
ever, the accuracy of the whole area is calculated as an
aggregation of the cells’ accuracy values by dividing the
number of accurate cells by the total number of cells in the
area. The query results are sorted by the areas’ accuracy
values. All areas with accuracy value bigger than the user
threshold are considered as area with high accuracy value
and on the other hand, the areas with accuracy value less
than the user threshold are considered as area with low ac-
curacy value.
The fragmentation of query output and evaluation of data
accuracy requires an in–depth analysis of the inaccuracy
distribution in database relations and their union to gen-
erate the query output [24]. For this purpose, we split
database relations into areas possessing uniform accura-
cies. As clarified in [15], databases are usually of het-
erogeneous quality; consequently, assigning an accuracy
value for the entire relation is considered as an imprecise
accuracy calculation of particular data [14]. The area is de-
scribed as a view (selection and projection) over the rela-
tions of the database. That is to say, a fragment aggregates
a group of relations specified by the predicates characteriz-
ing the fragment [8, 9, 25].
This paper presents a formal approach to evaluate data
accuracy that considers the portions of database relations
and processes them to generate an output for user query.
It focuses on pre–evaluation, that is, the data accuracy is
estimated before the execution of the user query. The out-
puts of our evaluation approach will be useful to make a
comparison between different query plans to select the plan
with the maximal accuracy value. Furthermore, these out-
puts are beneficial during the design phase (to determine
which databases will be included in the DIS) and moni-
toring phase (to calculate the query accuracy). Eventually,
this study discusses the issue of accuracy improvement. It
proposes the usage of output fragments to choose the ar-
eas with high accuracy values. We use the portions of a
relation having the highest accuracy value instead of using
the whole relation. This distinguishes our approach from
existing approaches in the literature.
2 Background
Data accuracy symbolizes a set of quality attributes. This
section describes the three accuracy attributes listed in the
literature [23, 27]:
– Syntactic accuracy represents the level at which data
do not contain syntactic mistakes such as format in-
consistencies and spelling errors. It expresses the
interval between descriptions of the data in the DIS
and conventional descriptions of these data (syntactic
gap).
– Semantic accuracy defines how adequately the data
describe the environment state. It represents the in-
terval between the data described in the DIS and real–
world data (semantic gap).
– Precision describes the level of data details. It ex-
presses the gap between the detail level of the DIS
data and its planned detail level.
Apropos of accuracy measurements, three metrics are men-
tioned in the literature [12]:
– Boolean metric uses a Boolean value to indicate if the
data detail is accurate (1 or True) or inaccurate (0 or
False).
– Degree metric uses a dimension to capture the princi-
ple of how precise the data are; this usually belongs to
a [0− 1] interval.
– Value–deviation is defined as an integer to capture the
gap between data item in the system and the original
one; this is usually normalized to a [0− 1] interval.
This section reviews the main concepts used in this study.
First, it discusses current approaches for evaluating data ac-
curacy as adapted from our proposed model, particularly,
the fragmentation technique, and it recalls the characteris-
tics of partitioning relational databases. It reviews the con-
cept of query rewriting and explains the bucket algorithm.
Finally, it comments on selectivity estimation techniques.
Our accuracy evaluation approach is supported by two
techniques: prior evaluation approach, which presumes
homogenous distribution of errors [16] and posterior eval-
uation, which uses the accuracy homogeneity for partition-
ing database relations [19].
Prior Evaluation Method: Naumann et al. [16] pro-
posed a method that propagates quality factors (encom-
passing data accuracy) in conformity with query operators.
The method estimates the query output accuracy based on
the accuracy of the database. The database relation accu-
racy is the percentage of syntactic accuracy (rate of accu-
rate cells).
The query is a PSJ. The method assumes that inac-
curacies are homogeneously spread in the database rela-
tions; thus, regardless of the selected tuples or projected
attributes, the database accuracy is conserved. Concerning
the join operation, the joined data accuracy is computed by
multiplying the accuracy of input relations [20].
The disadvantage of this method is the strong assump-
tion on a homogenous error distribution (usually inappli-
cable to real–world data). Generally, query operations do
not maintain accuracies and accordingly, this method does
not obtain an exact calculation of the accuracy. This defi-
ciency is due to the absence of knowledge on inaccuracy
distribution (where the errors are concentrated). Supple-
mentary knowledge characterizing instances of relation is
mandatory to obtain outputs that are more accurate.
Formal Approach to Data Accuracy Evaluation Informatica 46 (2022) 243–258 245
Posterior Evaluation Method: This method uses a par-
titioning algorithm [15, 19] for fragmenting the database
relations into areas that have extremely uniform accuracy.
An area is described as a view, which involves projection
and selection operations. The accuracy is calculated by
considering the portion of each relation and then computing
the cell accuracy of that portion. The area accuracy is the
ratio of semantic accuracy (rate of correct cells). The ac-
curacy values will be exploited in portion fragmentation by
applying a computerized algorithm for fragmentation that
evaluates a set of criteria. After that, the same fragmenta-
tion is performed over the complete database relation.
Usually, the query is a PSJ and relational algebra is em-
ployed to operate with the partitions, i.e., the operator takes
as inputs the set of relations with their respective partitions.
Its output is a single relation with its partitions: the parti-
tion of projection (selection) is determined as the intersec-
tion of the operation output and the partition of the input
relation (intersection of selection conditions with projected
attributes). In this case, the data accuracy is conserved ow-
ing to accuracy uniformity. Finally, when the query output
is ready, we calculate the accuracy value as the weighted
sum of area accuracy (weight is equal to the number of
cells in the area). Note that a single value of accuracy is
computed for the entire output.
Partition Algorithm: We will concisely describe
Rakov’s algorithm for sampling a relation [19]. Figure 1
shows the related pseudo–code. It is an iterative function
using a classification tree. It takes a relation as input and
splits it into two blocks (vertical or horizontal splitting
but never both) and it tries to identify the block with the
highest uniformity; then it iterates the procedure for each
block.
The block partitioning terminates if it provides a negligible
amelioration in uniformity. A thresholdx notes a fair uni-
form distribution of accuracies in the block and it serves as
a stopping condition. Uniformity is estimated using Gini
indices [6]. The Gini index GI(P ) formula is given by
equation 1:
GI(P ) = 2r(1− r) (1)
wherer represents the rate of accurate cells in partitionP .
The equation 2 defines the halt constraint that estimates the
reduction of the partition of the Gini index:
∆ GI =GI(P )− α 1
GI(P
1
)− α 2
GI(P
2
) (2)
where{ P
1
,P
2
} are partitions ofP andα i
=
|Pi| |P| ,i = 1, 2.
Because consideration of the complete possible rela-
tion partitions is excessively costly, Rakov suggests some
heuristics to decrease the number of treated partitions. In
addition, we distinguish two types of attributes: categori-
cal and ordered [1]. In the case of horizontal partitioning,
if the attribute is ordered and possessesK different values
(A
1
≤···≤ A
k
), the partitions are identified as (K− 1)
binary conditionst≤ A
i
. Otherwise, in the case of a cat-
egorical attribute that possesses K different values, these
R: Relation 
Start 
True 
x: Threshold 
End 
True 
False 
False 
null Queue 
 
  R Queue
L Views of List
p


 
p
L Return
 
p
L , N Insert
 
 
 
  N in cell of number T GI y
S Maximum T
N Split S
queue Next N
  



x y 
  queue , S Insert
Figure 1: Relation Partitioning Algorithm.
values are sorted based on the number of incorrect cells.
After that, they are considered as ordered attributes. For
vertical partitioning, we consider all partitions for a small
number of attributes; otherwise, for a large number of at-
tributes, we apply the same method employed for categori-
cal attributes. In the following, we discuss the characteris-
tics of well–constructed partitions.
Partition correctness : There are three correctness con-
straints that a partition must satisfy to guarantee database
coherence [17]. The constraints are related to the follow-
ing:
– Disjunction: The horizontal decomposition of a re-
lation R into partitions R
1
, ... , R
n
, verifies that
∀ celld
i
/d
i
∈ R
j
⇒ d
i
/ ∈ R
k
,k̸= j (i.e., each cell
in the partitionR
j
does not appear in any other parti-
tionsR
k
wherek̸=j). This property ensures that the
partitions are disjoints. This rule guarantees that there
is no intersection between all horizontal partitions. In
the case of vertical decomposition, the disjunction is
limited to non-primary key attributes of the relation
because the primary key attributes are basically dupli-
cated in all partitions.
– Restoration: In the case of decomposing a relationR
into partitions R
1
, ... , R
n
, there is always a way to
246 Informatica 46 (2022) 243–258 B. Athamena et al.
find an operator Ω /R = Ω R
i
,i = 1...n . This rule
assures the preservation of data restrictions defined as
dependencies. Normally, the union operator is used
for horizontal partitioning whereas the join operator
is used for vertical partitioning.
– Completeness: Assuming that the decomposition of
a relationR into partitionsR
1
,...,R
n
, all items be-
longing to relation R also belong to one or more R
i
partitions. This characteristic ensures data preserva-
tion, i.e., there is no data loss (all data in global rela-
tion are projected in partitions). For horizontal parti-
tioning, items denote tuples; however, for vertical par-
titioning, they denote attributes.
Query rewriting: is the reformulation of a user query
(defined by the global relation) to a possible analogous
scheme, known as rewriting, which concerns exclusively
the database framework [11]. In the local–as–viewed
(LA V) strategy, the global relation is defined without re-
ferring to the databases, and after that, a mapping between
them is made by expressing each database relation as a
view over the global relation [21]. Thus, query rewriting
is comparable to the application of views to answer a user
query.
Definition: For a queryQ over relationsR
i
defining the
global relation and viewsV
i
referring to database relations
overR
i
, the queryQ
r
is called a rewriting ofQ if:
– Q
r
⊂ Q
– Q
r
concerns exclusively the views.
In general, the query is a PSJ and is defined using a
datalog–like language. Thus, query Q is expressed by
equation 3:
Q(X) =R
1
(Z
1
)∧···∧ R
n
(Z
n
)∧ C
Q
(3)
where
– R
i
is a relation andZ
i
={ A
i
} /A
i
is an attribute of
R
i
– C
Q
is a conjunction predicate,C
Q
=uθv/θ ∈{ =,<
,>,≤ ,≥} , andu,v∈ S
1≤ i≤ n
Z
i
– X⊆ S
1≤ i≤ n
Z
i
={ A
i
} /A
i
projected byQ
Numerous algorithms for query rewriting can be found in
the literature and [7] presents a review of these algorithms.
The most popular is the bucket algorithm [13], which we
will describe and use in this study.
The bucket algorithm computes all possible rewritings
that are included in (but not imperatively analogous to) the
initial query [5]. The algorithm prunes the area of candi-
date rewritings in two phases:
– ∀ t∈ Q: Construct a container called bucket that in-
cludes the contributing database relations for t, i.e.,
the database relations containing tuples oft.
– Construct candidate rewritings (query integration by
joining one database relation from each bucket) and
preserve only the rewritings belonging toQ.
The first phase possesses a polynomial complexity with
respect to the database number. However, the second phase
reduces the complexity considerably by diminishing the
number of possibilities. Despite the fact that containment
is generally manageable, its resolution is equal to the query
size (usually small) and happens when the query has many
occurrences of exact schemas; accordingly, the contain-
ment complexity is not a challenge in real applications [5].
Selectivity Evaluation: Techniques for selectivity eval-
uation are widely practiced in query optimization. The
statistics of data saved in the database provides an estima-
tion to the optimizer [22]. The most popular statistics are
histograms, which are adopted by several business database
systems. This section shows their current application to
evaluate the selectivity of a complex query.
The histogram of attributeA is defined as a list of buck-
ets, where bucketb
i
definesr
i
, which is a subarea ofA’s
area, and possesses two attributes:
– periodicityp
i
, which represents the number of tuples
x verifyingx.A∈ r
i
, and
– discrete valuedv
i
, which represents the number of dif-
ferent values ofx.Ain all tuplesx/x.A∈ r
i
.
The query selectivity is calculated by dividing the query
cardinality by the relation cardinality. In order to evaluate
the query cardinality, we sum the periodicity of all buckets
included (totally or partially) in the predicate. If the query
has many extended predicates, the selectivity is calculated
as the product of all selectivities.
For a random PSJ query, there is a new challenge: car-
dinality evaluation needs statistical information propaga-
tion over predicates, i.e., we must create histograms for
in–between outputs using the histograms of database rela-
tions. The propagation of in–between statistics in addition
to multiple query operators can considerably decrease the
accuracy. A possible solution is a precalculation of statis-
tics for a subset of query results that are exceptional and to
use them in selectivity calculation of in–between outputs.
The remainder of this paper describes the proposed
model for evaluation of data accuracy. Algorithms and
methods reviewed previously are modified and adopted in
data accuracy estimation.
3 Formal model
This section formalizes the proposed approach for accuracy
assessment. The approach considers DIS in relational con-
text (the system combines data from relational databases
Formal Approach to Data Accuracy Evaluation Informatica 46 (2022) 243–258 247
and allows users to execute queries over a global schema).
Each query is reformulated according to the database rela-
tions in order to obtain a set of rewritings that select data to
answer the query.
The accuracy of data delivered to a user as answer to
the user’s query is addressed in this section. We arrange
data in areas with uniform accuracy to apprise the user on
the inaccuracy distribution. To achieve this goal, database
relations are decomposed into partitions (called areas) pos-
sessing uniform accuracy. Note that areas are views (vir-
tual relations) determined by the distribution rules (projec-
tion attributes and selection conditions). The user query is
reformulated over the areas (rather than in the global rela-
tion). The proposed approach uses an a priori assessment
method: before performing the query rewriting, we assess
the data accuracy, depending only on the operations (PSJ)
that define the rewritings and the area accuracy. We can
state the problem as follows:
– Input:
– User query
– A collection of database relations decomposed
into areas having uniform accuracy
– Output:
– A rewriting set answering the user query
– Accuracy value of rewritings
– Accuracy value of the query answer
Our proposed scenario for accuracy estimation contains
three phases:
– Database relation fragmentation based on accuracy
uniformity: This stage estimates the accuracy of a por-
tion of individual database relation and uses the esti-
mation results for relation fragmentation. The frag-
mentation phase is performed at the initial stage (DIS
development or regularly) but is isolated from the
query appraisal stage.
– Query rewriting with respect to fragments: Every
query is reformulated using the areas of the database
relations. This stage generates a set of rewritings us-
ing the areas. The output of the query is the union of
the created rewritings.
– Data accuracy estimation of query outputs: This stage
estimates the accuracy of data generated by the rewrit-
ings, using the area accuracy, and aggregates them to
calculate the accuracy value for the whole user query
output.
To reinforce the data accuracy, a simple adjustment con-
sists of rejecting rewritings or areas with small accuracy
values. The rejection can be done at the rewriting genera-
tion time (early stage). Thus, our approach can be consid-
ered as selective. The next subsections provide the details
of each phase.
3.1 Database relation fragmentation based
on accuracy uniformity
Using the partitioning approach suggested by Rakov [19],
the data accuracy is used for fragmenting each database re-
lation. The purpose of fragmentation is the manipulation of
pieces of database relation that are uniformly accurate; in
other words, if we try to fragment the accuracy value again,
it will approximately stay unchangeable [2]. This subsec-
tion defines the fragmentation process and discusses how
to extract appropriate fragments.
To simplify the partition usage, the approach suggests
decomposing the relation into areas using a horizontal par-
tition (selection predicates) and after that, it decomposes
each area into subareas using a vertical partition (subsets
of attributes). Fragments should verify the three complete-
ness constraints discussed previously: each tuple exists in a
unique area, key attributes appear in all subareas, and non–
key attributes appear in a single subarea. To manipulate
all attributes (key and non–key) in a similar way (projec-
tion of the attributes in unique subarea), the key attributes
are duplicated. Consequently, a new key for the subarea
(composite key) is generated. This approach is called key
expansion.
Key expansion definition: If R(A
1
,...,A
m
,...,A
n
)
is a relation, where m ≤ n, and (A
1
,...,A
m
)
form the key of R,
¯ R denotes the key expan-
sion of R generated by duplicating the key attributes:
¯ R(K
1
,...,K
m
,A
1
,...,A
m
,...,A
n
). The duplicated at-
tributes (K
1
,...,K
m
) are the expanded keys of
¯ R.
In vertical fragmentation, we project the expanded key in
all subareas, and we project all attributes of the initial rela-
tion in a unique subarea. The fragmentation process (area
and subarea) can be formalized as follows:
Horizontal fragmentation: A horizontal fragmentation
of a relationR, represented byH
P1,...,P m
(R) (see equation
4), consists of defining a set of subrelations{ R
1
,...,R
m
} ,
named areas created by the application of predicates
P
1
,...,P
m
to
¯ R, where a conjunctive predicate set
¯ P =
{ P
1
,...,P
m
} overR, disjoint and complete (each tuple of
R verifies a uniqueP
i
):
H
P1,...,P m
(R) =R
1
,...R
m
=σ P1
(
¯ R),...,σPm
(
¯ R) (4)
The area is denoted by <
Name,Predicate,N,Key
Accuracy
> :
– Name distinguishes areas within the relation,
– Predicate is the conjunction that determines the area,
– N represents the number of tuples in the area (verify
the Predicate), and
– Key
Accuracy
is the accuracy value of the key at-
tributes ofR.
248 Informatica 46 (2022) 243–258 B. Athamena et al.
Vertical fragmentation: Forn attribute subsets of a rela-
tionR, where
R(S1,...,S n)
T
n
i
Si
=ϕ : the vertical fragmentation
of an areaR
i
, R
i
∈ R, is expressed asV
S1...S m
(R
i
) (see
equation 5), which defines a list of views{ V
i1
,...,V
in
} ,
named subareas generated by projecting the attribute sub-
sets toR
i
. The subareas are disjointed (each attribute of R
belongs to one subarea). However, the expanded keyK of
¯ R belongs to all subareas:
V
S1...S n
(R
i
) =V
i1
,...,V
in
=π k,S 1
(R
i
),...,πk,S n
(R
i
) (5)
The subarea is defined as a three–tuple <
Name,Attributes,Accuracy >, where Name dis-
tinguishes the subarea within the relation, Attributes define
the list of subarea attributes, and finally, Accuracy denotes
the calculated subarea accuracy.
Again, subareas and areas are views determined by frag-
mentation rules (projection attributes and selection condi-
tions); in other words, the database relations are not really
partitioned and saved as independent partitions. Database
relations are preserved unvaried in databases and the de-
scription models of subareas and areas are kept in the DIS.
After the horizontal and vertical fragmentation of the
relation, any relation cell appears in a single subarea be-
longing to the single area. To fragment database relations,
Rakov’s algorithm described earlier can be applied. How-
ever, it should be noted that the approach alternately exe-
cutes horizontal and vertical partitioning. Accordingly, the
resulting fragments can be different from the areas and their
subareas. As a possible solution, we propose to reorganize
the fragment in accordance with the horizontal and vertical
fragments.
For any hybrid fragmentation (horizontal and vertical)
of a relation R, which consists of a set of areas that ver-
ify the correctness criteria, it is always possible to ob-
tain different fragments of R by additional fragmentation
of some of the areas, i.e., the algorithm obtains the sub-
areas{ S
11
,...,S
1m1
,...,S
n1
,...S
nmn
} , where the sub-
set{ S
i1
,...,S
1i1
} represents the vertical fragmentation of
some area A
i
(1≤ i≤ n), and{ A
1
,...,A
n
} set repre-
sents the horizontal fragmentation of R. To this end, we
follow a process that contains four steps:
– Find the selection predicatesP ={ p
1
,...,p
r
} defin-
ing the partitionT
1
,...,T
k
/r≤ k (because multiple
partitions possess identical selection predicates).
– Search for two non-disjunctive predicates p
i
and p
j
(i.e., p
i
T
p
j
̸= ϕ ) and then replace p
i
and p
j
by
p
i
− p
j
, p
j
− p
i
, and p
i
T
p
j
. This step will termi-
nate because the predicates are the unions of inequali-
ties onR’s attributes. The output is a set of predicates
P
′ ={ p
′ 1
,...,p
′ n
} , which are disjoint.
– Determine the area setA ={ A
1
,...,A
n
} . For each
predicatep
′ i
∈ P
′ , there is an areaA
i
∈ A, whereA
corresponds to the horizontal fragmentation of R; in
other words,A verifies the completeness constraints,
which implies that predicates sub-expressions are not
lost.
– Intersect every areaA
i
∈ A with partitionsT
1
,...,T
k
to obtain a new set of subareas{ S
i1
,...,S
1i1
} . This
new set corresponds to the vertical fragmentation of
A
i
; in other words, it verifies the completeness crite-
ria.
Consequently, each hybrid fragmentation of a relation that
satisfies the correctness constraints can be reorganized in
areas and subareas; particularly, the fragments generated
by Rakov’s algorithm. Figure 2 shows an example of a
reorganization.
P
1
P
2
P
3
P
4
P
5
P
6
(a) Before Reorganization
Area 1
S
11
S
12
S
13
Area 2
S
21
S
22
Area 3
S
31
S
32
Area 4
S
41
(b) After Reorganization
Figure 2: Fragments Reorganization.
To fragment database relations, we suggest applying
Rakov’s algorithm. Nevertheless, the fragmentation can be
done manually by taking advantage of knowledge concern-
ing the database relations collected from the user, DIS ex-
pert, or IT administrator.
Rakov’s algorithm calculates the area accuracy as a per-
centage of accurate cells. Equation 6 calculates the Gini
indices:
G(V ) = 2k(1− k) (6)
wherek is the accuracy value, and the indices will be used
to estimate the accuracy uniformity, and thereafter, to select
the fragment representing the highest value of uniformity
(i.e., the fragmentation function is parameterized to com-
pute the area accuracy). Each fragment is reorganized as
illustrated earlier. The area and subarea schema are also de-
duced from the related fragment using the following steps:
1. Name the areas sequentially and then their associated
subareas.
2. Define the area predicate and the list of attributes de-
scribing subareas from the fragment.
Formal Approach to Data Accuracy Evaluation Informatica 46 (2022) 243–258 249
3. Estimate the subarea accuracy by calculating the cell
accuracy average (computed by Rakov’s algorithm).
4. Estimate the key accuracy by calculating the average
of accuracies of attributes conforming to the key (the
accuracy of attributes is equal to their subarea accu-
racy owing to accuracy uniformity). Note that the
key accuracy corresponds to the subarea accuracy if
all key attributes are projected in a single subarea.
5. At the end, as Rakov’s algorithm fragments a sample
of database relation, the number of tuples in the sam-
ple is used to deduce the number of tuples in an area,
i.e.,numberof tuples× relationsize
samplesize
.
The fragmentation is executed one time at DIS develop-
ment or regularly. However, it is not related to the accuracy
estimation of the query. The following section describes
the query reformulation and its impact on accuracy estima-
tion.
3.2 User query rewriting
Our approach proposes to reformulate the user query ac-
cording to database relation areas. We will take into con-
sideration the option of reformulation of user query in ref-
erence to subareas and we clarify the reason for discarding
this alternative.
To describe a user query based on areas that constitute
database relations, our approach uses the classic rewriting
algorithm, i.e., the bucket algorithm. Hence, the LA V tech-
nique is applied to define the areas as views of a global
schema by replacing the database relation with its descrip-
tion over the global schema, i.e., it unfolds views over the
database relation. Note that this operation is unrelated to
the user query because it is performed after fragmentation
of database relations.
The query rewriting applies the bucket algorithm that
creates buckets, compares predicates of the query and the
area, and then determines possible rewritings and checks
query containment. The number of rewritings increases
with the number of areas in a polynomial manner. Note
that the rigidity of the rewriting algorithm is not due to the
number of relations, but to the query size (which is approx-
imately small) and exists only if the query has more than
one occurrence of the same relations [13].
Now, we discuss the option to rewrite a query over sub-
areas and we explain why this option is rejected. Areas
contain all tuples of database relations. Consequently, the
rewriting algorithm of a query over areas joins all tuples of
database relations (similar to query rewriting over database
relations). Nevertheless, a subarea decomposes tuples as
it projects a part of attributes of the relation. First, some
versions of the bucket algorithm generate the total required
attributes by joining multiple relations of a bucket. Con-
sequently, query rewriting over subareas is not impossi-
ble. On the other hand, assembling a small number of at-
tributes belonging to a large number of relations augments
hazardously the risk of interpolating semantic inaccuracies
(generation of tuples without meaning in real world). By
way of illustration, the output can be the student name and
the phone number of a different student who possess identi-
cal identifiers in distinct databases. This problem is intrin-
sic to DIS, but it remarkably grows if the number of joins
increases (particularly when tuples are split). Moreover,
the rewriting algorithm avoids splitting tuples specifically
to reduce this risk.
Accordingly, our approach rewrites the query over areas
and inevitably uses a subarea structure for computing data
accuracy. This also explains why the proposed approach
fragments relations in a horizontal and then vertical manner
rather than by random management of hybrid fragmenta-
tions as Rakov’s approach. The following section describes
the estimation of data accuracy for each rewriting.
3.3 Data accuracy estimation
To compute area accuracies and aggregate them to calcu-
late the accuracy value of the entire user query output, the
proposed approach proceeds in three steps:
– identification of areas and subareas constituting the
rewriting,
– estimation of the rewriting accuracy and its key accu-
racy, and
– calculation of the number of tuples in each area to per-
form the aggregation.
Identification of areas and subareas: An area contains
at least one subarea and the rewriting is expressed as the
join of area sets. The joining output is a unique area, which
consists of the union of all subareas possessing part of the
projected attributes and verifies the selection predicates of
all input areas. The following example shows three ar-
eas, namelyP
1
,P
2
, andP
3
, decomposed to subareas (P
11
,
P
12
), (P
21
,P
22
), and (P
31
,P
32
,P
33
) respectively (see Fig-
ure 3). The rewriting QR is defined as one area joining
all subareas, which include part of the projected attributes.
Note that, P
12
is not included in QR because it does not
contain any of the projected attributes inQR.
Because all tuples in the output of the rewriting fulfill
all area predicates, a single area is built for the rewriting
by joining the area predicates and rewriting predicates. In
addition, because the bucket algorithm generates only rel-
evant rewritings, the predicates cannot be contradictory.
Note that we list only the predicates that are more con-
straining than others (e.g., “age≥ 18” is less constraining
than “age = 20”).
The rewriting subareas are defined as the aggregation
of all subareas belonging to input areas, by considering
just the common attributes between the subarea and rewrit-
ing. If the intersection between the rewriting and subarea
is null, the subarea is discarded. The resulting rewriting
250 Informatica 46 (2022) 243–258 B. Athamena et al.
P
1
P
11
P
12
P
2
P
21
P
22
P
3
P
31
P
32
P
33
QR P
11
P
21
P
22
P
31
P
32
P
33
Figure 3: Rewriting Joins Multiple Areas.
subareas represent the vertical fragmentation of the rewrit-
ing area. The completeness constraint is satisfied because
rewriting attributes appear in certain subareas of the input
areas. However, for the disjunction constraint, the natural
join emerges as a problem related to joining attributes exist-
ing in two different input subareas and appear only once in
the rewriting. To solve this problem, a new subarea is cre-
ated and its accuracy value will be computed as the average
of both subarea accuracies (this will be discussed later).
Estimation of rewriting accuracy: If fragments are ad-
equately determined, they logically conserve the accuracy
of subareas because they are affected by the projection of
predicates and attributes (owing to the accuracy unifor-
mity). However, for the joining operation, each tuple in
the subarea is composed of two input tuples; consequently,
the produced subarea accuracy depends on both input ar-
eas. Particularly, the accuracy propagation can be different
and depends on the accuracy factor. Recall that during the
evaluation of semantic correctness, if the key of a particular
tuple is inaccurate (does not reflect the real–world object),
the entire tuple is considered as incorrect. Actually, seman-
tic correctness evaluates the correspondence of an attribute
(the key) to the real–world entity. On the other hand, the
syntactic correctness evaluates the cell accuracy without re-
gard to the key attributes. Thus, during area joining, the
cell accuracy is computed differently from semantic cor-
rectness. Consequently, the subarea accuracy in the join
output is computed as the product of the input subarea ac-
curacy and key accuracy of the remaining areas. However,
during syntactic correctness evaluation, the cell accuracy is
equal to the cell accuracy of the input subarea.
Thus, our approach proceeds as follows:
– Semantic correctness: subarea accuracy is computed
as the product of the input subarea accuracy and the
key accuracy of the remaining areas.
– Syntactic correctness: subarea accuracy is calculated
as the input subarea accuracy.
In the case of subareas that include join attributes (which
were separated in the prior step), we calculate the accuracy
for each input subarea separately and then take the average.
The subareas with identical accuracy values are merged in
a single subarea.
At the end, the accuracy of the key is calculated by multi-
plying the accuracy of keys in all areas. Then, the rewriting
accuracy is calculated as the average of the cell accuracies
(weighted average of subarea accuracy, where the weight is
the number of attributes in the subarea).
Selectivity estimation: Because the query result is de-
termined as the fusion of multiple rewritings, its accuracy
is estimated as the weighted aggregation of the rewriting
accuracy, where the weight is equal to the rewriting size
(number of tuples in the rewriting). Thus, the estimation of
the number of tuples in each rewriting is necessary.
To this end, we suggest estimating the rewriting selectiv-
ity to deduce its number of tuples. In general, a join corre-
sponds to the comparison between two keys (primary and
foreign), and then histograms are used for the calculation.
Note that query optimization algorithms or statistical infor-
mation about the prior execution of the same/similar query
can be applied to estimate the selectivity. Particularly, ex-
perts also can estimate the selectivity. Remember that our
approach does not depend on the estimation algorithm but
certainly, the resulting accuracy value depends on it. The
selectivity is defined as follows:
Given a query rewriting Q
R
over areas{ R
1
,...,R
k
} ,
the selectivity ofQ
R
, expressed asS(Q
R
), represents the
number of tuples in the Cartesian product of all areas that
satisfy the rewriting predicates. IfQ
R
does not have a se-
lection (or join) predicate, the whole set of tuples is con-
sidered and in this manner, S(Q
R
) = 1. Concerning the
rewriting query, the number of tuples is calculated as:
Numberof tuplesintheinputareas× Rewritingselectivity
3.4 Quality graph
After generation of the query rewritings, we build a quality
graph to calculate the user query as the union of all rewrit-
ings [10, 18]. We can have multiple rewritings or a unique
one.
The creation of the quality graph shown in Figure 4 fol-
lows these definitions:
– Source nodes represent database relations.
– Target node represents user query.
– Activity nodes represent areas of database relations,
denoted as Area. Edges connect the area to its source.
– Activity nodes represent the rewritings, denoted by
Rewriting. Edges connect the rewriting to the area in-
dicated by the rewriting.
Formal Approach to Data Accuracy Evaluation Informatica 46 (2022) 243–258 251
– One activity node represents the union of rewritings
(possibly single rewriting), denoted by the Union
node. This node is the successor of all rewriting nodes
and the predecessor of the Target node.
Source 1 
Area 11 
Rewriting 1 
Area 1p ... 
Source n 
Area n1 
Rewriting m 
Area nq ... 
......... 
Union 
Target 
......... 
Figure 4: Quality Graph.
3.5 Accuracy estimation algorithm
We propose an algorithm for data accuracy estimation. The
pseudo–code is shown in Figure 5. It implements the
approach previously described (section 3.3) consisting of
three phases:
1. For each Rewriting, the algorithm generates an area,
defines its subareas, and computes the characteristic
values (cardinality, predicates, subarea accuracy, and
key accuracy).
2. The algorithm assembles all rewritings to determine
the Union node.
3. The algorithm performs the accuracy value aggrega-
tion for all data edges.
In the first phase, the algorithm executes two loops over
area nodes that are inputted to each rewriting node. During
the first loop, the key accuracy feature is calculated by mul-
tiplying the accuracy of all areas; the predicate feature is
determined by merging the predicates of input areas and the
rewriting predicate, and eliminating worthless constraints;
finally, the cardinality feature is computed as the product
of the rewriting selectivity and the input area cardinalities.
During the second loop, to estimate the subarea feature,
we just insert subareas of all input areas. This should be
performed in another loop as the input of subarea accu-
racy is the key accuracy computed in the first loop. Then
the algorithm adds subareas to the rewriting node and joins
subarea attributes with the rewriting attributes. The natu-
ral join attributes are stored in new subareas. We calculate
their accuracy using equation 7:
Accuracy =
P
n
1
subarea
i
n
(7)
where n is the number of subareas. At the end, the sub-
areas with identical accuracy values are merged. U
areas
is
a list that contains all generated areas for each rewriting;
consequently, the second phase requires settingU
areas
as
the value of the union node.
Finally, for each source or activity node, the accuracy ag-
gregation is calculated as the weighted sum of the subarea
accuracy and the weight is calculated by multiplying the
number of subarea attributes by the area cardinality. The
resulting output is associated with all edges outgoing the
node.
4 Case study
This example is used to demonstrate the proposed approach
for data accuracy estimation. The Boolean metric is used to
illustrate the evaluation of semantic correctness; neverthe-
less, alternative accuracy metrics and factors can be used.
Assume that the DIS global schema contains two rela-
tions that hold data concerning students and their marks
respectively:
– Student(ID,name,level,exam,phone,
address,city)
– Mark(ID,mark,year)
Attributes of the relation student are ID (the student
identifier),name,level (first level defined by initial inter-
view and its value can be “low”, “medium”, or “high”),
exam (initial exam result; its value ∈ [0, 1]), phone,
address, andcity.
Attributes symbolizing the relation mark areID (the stu-
dent identifier),mark∈ [0− 20], where 20 is the highest
mark), andyear. The relation keys areID andID,year
respectively.
Suppose that two databases provide data concerning stu-
dents and their marks as presented in Table 1 and Table 2:
– S(ID,name,level,exam,phone,address) // stu-
dents residing at Al–Ain (UAE).
– M(ID,mark,year) // the student marks.
The colored cells represent the inaccuracy. The aggregated
accuracy values forS andM relations (calculated as the av-
erage of cell accuracies) are 0.6 (40/66) and 0.77 (37/48)
respectively. The key expansions are:
¯S(K
ID
,ID,name,level,exam,phone,address)
and
¯ M(K
ID
,K
year
,ID,mark,year)
252 Informatica 46 (2022) 243–258 B. Athamena et al.
 
G:Quality 
graph 
Start 
True 
End 
True 
False 
False 
null R 
g G.Rewritin R 
 
  G Union U
G Node A
Area


   
   
 
 
 
  R Next
U Area Insert
Area R e Insertvalu G
Area Subareas Merge
e Attribut Join R getvlue G Attribute Area separate
s Attribute projected R getvlue G Attribute Area Intersect
Area
,
, .
,
_ , . , .
_ , . , .
   
   
 
 
  R r predecesso A
predicte R getvalue G predicate
y selectivit R getvalue G y cardinalit
Key
Accuracy




, .
, .
1
null A 
 
 
  A Next
predicate predicate Area Insert
y cardinalit Area y cardinalit y cardinalit
Key Key Key
a Are A getvalue G Area
Accuracy Accuracy Accuracy
, .
.
, .
 
 
    
True 
False 
null A 
False 
 
 
  A edge outgoing E
Area n Aggregatio Accuracy val
Area A getvalue G Area
 


.
. .
null E 
True 
  G Return
  E Next
Key Accuracy E G
Accuracy
 . .
  R r predecesso A
predicate predicate Area
y cardinalit y cardinalit Area
Key Key Area
Accuracy Accuracy




.
.
.
True 
False 
null A 
 
 
 
 
  A Next
subarea Area x Insert
Key x update
Area area Sub x
a Are A getvalue G Area
Accuracy
. ,
,
_
, .

    
Figure 5: Accuracy Propagation Algorithm.
where{ K
ID
} and{ K
year
,K
ID
} define the expanded
key. NowS andM are fragmented to determine their areas
and subareas; then, we calculate the number of tuples and
their accuracies. An acceptable fragmentation ofS is:
– AreaS
1
; [ID< 300]; 6 tuples;keyaccuracy = 0. 50
– Subarea S
11
; { ID,name,level,exam} ;
accuracy = 0. 50
– Subarea S
12
;{ address,phone} ; accuracy =
0. 25
– AreaS
2
; [ID≥ 300]; 5 tuples;keyaccuracy = 1. 00
– SubareaS
21
;{ ID} ;accuracy = 1. 00
– SubareaS
22
;{ name,level,exam,
phone,address} ;accuracy = 0. 80
Similarly, a possible fragmentation of theM relation is:
– AreaM
1
; [year< 2014]; 1 tuple;keyaccuracy = 0
– SubareaM
11
;{ ID,mark,year} ;accuracy =
0. 00 (0/3)
– Area M
2
; [year≥ 2014∧ id < 300]; 6 tuples;
keyaccuracy = 0. 50
– SubareaM
21
;{ ID,mark,year} ;accuracy =
0. 50 (9/18)
– Area M
3
; [year≥ 2014 ∧ ID≥ 300]; 8 tuples;
keyaccuracy = 0. 93
– SubareaM
31
;{ ID,mark,year} ;accuracy =
0. 93 (25/27)
After horizontal and vertical fragmentation of the relation,
each cell belongs to a specific subarea included in a unique
area; consequently, the fragments are represented by differ-
ent colors describing different subareas. Table 3 presents
the fragments of the student relation. The expanded key is
not colored because it belongs to all subareas and it can be
left out in the graphical illustration.
The respective schemas ofS andM are:
– S(ID,name,level,exam,phone,address,
city)← Student(ID,name,level,exam,
phone,address)∧ city = “ Al− Ain
′′
Formal Approach to Data Accuracy Evaluation Informatica 46 (2022) 243–258 253
Table 1:S Relation.
ID Name Level Exam Phone Address
120 Rani High 1 6001104 12
123 Nada Medium 0.465 99628734 Chiab al Alashekhar
141 Areej Low 0.987 Al–Ain
154 Hanan Low 0.1234 9023365 Palmier 13
155 Zineb Medium 0.61 3364244 502 logts 13 n4
157 Deena Low 0.2 7091232 Annaba 69
300 Assala High 0.97 4112533
301 Alae High 0.92 5437898
302 Raid High 0.78 Abu Dhabi
303 Med Low 0.2 3248673 Algeria 1280/12
304 Sarah Medium 0.67 231987253
Table 2:M Relation.
ID Mark Year
120 17 2015
141 10 2015
141 18 1014
154 9 2014
155 13 2014
155 4 2015
157 5 2015
300 10 2014
300 19 2015
301 17 2014
301 12 2014
302 18 2015
302 6 2014
303 9 2015
304 11 2014
304 13 2015
– M(ID,mark,year)← Mark(ID,mark,
year)
The areas ofS andM are expressed in datalog–like nota-
tion:
– S
1
(ID,name,level,exam,phone,address) ← S(ID,name,level,exam,phone,address)∧ ID<
300
– S
2
(ID,name,level,exam,phone,address) ← S(ID,name,level,exam,phone,address)∧ ID≥ 300
– M
1
(ID,mark,year)← M(ID,mark,year)
∧ y< 2014
– M
2
(ID,mark,year)← M(ID,mark,year)
∧ y≥ 2014∧ ID< 300
– M
3
(ID,mark,year)← M(ID,mark,year)
∧ y≥ 2014∧ ID≥ 300
The substitution ofS andM by their expressions results in
the area definition in terms of the global schema:
– S
1
(ID,name,level,exam,phone,address) ← Student(ID,name,level,exam,phone,
address)∧ city = “ Al− Ain
′′ ∧ ID< 300
– S
2
(ID,name,level,exam,phone,address) ← Student(ID,name,level,exam,phone,
address)∧ city = “ Al− Ain
′′ ∧ ID≥ 300
– M
1
(ID,mark,year) ← Mark(ID,y,m)∧ y <
2014
– M
2
(ID,mark,year) ← Mark(ID,y,m)∧ y ≥ 2014∧ ID< 300
– M
3
(ID,mark,year) ← Mark(ID,y,m)∧ y ≥ 2014∧ ID≥ 300
Suppose the user queryQ in datalog–like notation:
– Q(ID,name,year,mark)← S(ID,name,
level,exam,phone,address,city)∧ M(ID,
mark,year)∧ y = 2015
The output of queryQ (extracted from theS andM rela-
tions) is given in Table 4.
The query rewriting using S
1
, S
2
, M
1
, M
2
, and M
3
cre-
ates the following buckets: Buck(S) = { S
1
,S
2
} and
Buck(M) = { M
2
,M
3
} . The area M
1
is omitted in
Buck(M) because it does not satisfy the Q predicate
(“year = 2015” inQ and “year< 2014” inM
1
).
Then the query rewritings are produced, considering one
area for each bucket:
– QR
1
(ID,year,name,mark)← S
1
(ID,name,
level,exam,phone,address)∧ M
2
(ID,
mark,year)∧ y = 2015
254 Informatica 46 (2022) 243–258 B. Athamena et al.
Table 3: Student Relation Fragmentation.
KID ID Name Level Exam Phone Address
120 120 Rani High 1 6001104 Zakhir 12
123 123 Nada Medium 0.465 99628734 Chiab al Alashekhar
141 141 Areej Low 0.987 Al–Ain
154 154 Hanan Low 0.1234 9023365 Palmier 13
155 155 Zineb Medium 0.61 3364244 502 logts 13 n4
157 157 Deena Low 0.2 7091232 Annaba 69
300 300 Assala High 0.97 4112533
301 301 Alae High 0.92 5437898
302 302 Raid High 0.78 Abu Dhabi
303 303 Med Low 0.2 3248673 Algeria 1280/12
304 304 Sarah Medium 0.67 231987253
Table 4: Query Result.
ID Name Year Mark
120 Rani 2015 17
141 Areej 2015 10
155 Zineb 2015 4
157 Deena 2015 5
300 Assala 2015 19
302 Raid 2015 18
303 Med 2015 9
304 Sarah 2015 13
– QR
2
(ID,year,name,mark)← S
2
(ID,name,
level,exam,phone,address)∧ M
3
(ID,
mark,year)∧ y = 2015
Because the areas, S
1
and M
3
are contradictory
(“ID≥ 300” and “ID < 300”), they are not joined; the
same applies for areas S
2
and M
2
. Consequently, Q ⊇ QR
1
S
QR
2
.
Let us focus on the rewriting ofQR
1
:
QR
1
andQR
2
areas and subareas are defined as creations
of a unique area for each rewriting of QR
1
and QR
2
re-
spectively, because the rewriting of QR
1
combines areas
S
1
(its subareas areS
11
,S
12
) andM
2
(its subarea isM
21
).
The intersection of subarea S
11
and M
21
attributes with
the rewriting attributes results in two subareas QR
11
and
QR
12
; the join attribute is isolated in subarea QR
13
. In
this case there is no projection of the subareaS
12
attribute.
QR
2
subareas are defined similarly. The final result is:
– Area QR
1
{ ID< 300∧ year = 2015} ; ? tuples;
keyaccuracy =?; inputs:S
1
,M
2
– Subarea QR
11
{ name} ; accuracy =?; input:
S
11
– Subarea QR
12
{ year,mark} ; accuracy =?;
input:M
21
– SubareaQR
13
{ ID} ;accuracy =?; inputs:S
11
andM
21
– Area QR
2
{ ID≥ 300∧ year = 2015} ; ? tuples;
keyaccuracy =?; inputs:S
2
,M
3
– Subarea QR
21
{ name} ; accuracy =?; input:
S
22
– Subarea QR
22
{ year,mark} ; accuracy =?;
input:M
31
– Subarea QR
23
{ ID} ; accuracy =?; inputs:
S
21
,M
31
After the estimation of semantic accuracy of subareas, we
obtain
– Area QR
1
{ ID< 300∧ year = 2015} ; ? tuples;
keyaccuracy = 0. 25 (0. 50× 0. 50); inputs:S,M
2
– Subarea QR
11
{ name} ; accuracy = 0. 25 =
(averageof(0. 50× 0. 50)); input:S
11
– Subarea QR
12
{ year,mark} ; accuracy =
0. 25 = 0. 50× 0. 50; input:M
21
– Subarea QR
13
{ ID} ; accuracy =
0. 25(average(0. 50 × 0. 50, 0. 50 × 0. 50);
input:S
11
,M
21
– Area QR
2
{ ID≥ 300∧ year = 2015} ; ? tuples;
keyaccuracy = 0. 93(1. 00× 0. 93); inputs:S
2
,M
3
– Subarea QR
21
{ name} ; accuracy =
0. 74(0. 80× 0. 93); input:S
22
– Subarea QR
22
{ year,mark} ; accuracy =
0. 93(0. 93× 1. 00); input:M
31
– Subarea QR
23
{ ID} ; accuracy =
0. 84(average0. 8× 0. 93, 0. 93× 1. 00); in-
puts:S
21
,M
31
Because subareas QR
11
, QR
12
, QR
13
possess the same
accuracy, they are merged to subareaQR
11
.
Formal Approach to Data Accuracy Evaluation Informatica 46 (2022) 243–258 255
– Area QR
1
{ ID< 300∧ year = 2015} ; ? tuples;
keyaccuracy = 0. 25; inputs:S
1
,M
2
– Subarea QR
11
{ ID,name,year,mark} ;
accuracy = 0. 25
– Area QR
2
{ ID≥ 300∧ year = 2015} ; ? tuples;
keyaccuracy = 0. 93; inputs:S
2
,M
3
– SubareaQR
21
{ name} ; accuracy = 0. 74; in-
put:S
22
– Subarea QR
22
{ year,mark} ; accuracy =
0. 93; input:M
31
– SubareaQR
23
{ ID} ;accuracy = 0. 84; inputs:
S
21
andM
31
Remember that the rewriting accuracy is the average of
the cell accuracies (accuracy aggregation). In this man-
ner, theQR
1
accuracy is 0. 25 and theQR
2
(which repre-
sents the accuracy value of its unique subarea) accuracy is
0. 86(0. 74× 1 + 0. 93× 2 + 1. 00× 3).
Thus, theQR
1
andQR
2
selectivity is calculated as
1
9
(
4
6× 6
)
and
1
10
(
4
5× 8
) respectively. Subsequently, the partition
metadata is
– AreaQR
1
{ ID< 300∧ year = 2015} ;
1
9
(6× 6) = 4
tuples;keyaccuracy = 0. 25; inputs:S
1
,M
2
– Subarea QR
11
{ ID,name,year,mark} ;
accuracy = 0. 25
– AreaQR
2
{ ID≥ 300∧ year = 2015} ; 4(= 5× 8× 1/10) tuples;keyaccuracy = 0. 93; inputs:S
2
,M
3
– SubareaQR
21
{ name} ;accuracy = 0. 74
– Subarea QR
22
{ year,mark} ; accuracy =
0. 93
– SubareaQR
23
{ ID} ;accuracy = 0. 84
The global user queryQ accuracy value is estimated as the
sum of the weighted and rewriting accuracies (weight =
tuplenumbers); thus, we obtain (4× 0. 25+4× 0. 93)/(4+
4) = 0. 59.
Figure 6 shows the quality graph for user queryQ defined
by merging its rewritingsQR
1
andQR
2
.
In the following section, we will discuss some data ac-
curacy improvements by presenting a possible approach,
which discards the areas or subareas causing exaggeration
in accuracy prediction.
5 Accuracy improvement
To enforce the data accuracy, some fundamental amend-
ment actions can be taken if the user expectations on data
accuracy cannot be achieved. Because the proposed es-
timation approach arranges the query outputs in subareas
possessing uniform accuracy, there is no guarantee that the
 
Student 
S 1 
QR 1 
S 2 
Mark 
M 2 
QR 2 
M 3 
Union 
Q = Target 
Figure 6: Quality Graph.
total cells in the query output satisfy the target expecta-
tions. However, the delivered data, in general, satisfy the
user target.
We propose a simple improvement action consisting of
discarding pieces of the query output with small accuracy
values. We will present different ways and different times
of executing such improvement actions. Particularly, we
distinguish three accuracy levels:
– Cell level: Accuracy of cell set, in general, must be
less than a specific threshold.
– Tuple level: Accuracy of tuple set, in general, must be
less than a specific threshold.
– Output level: Accuracy of the whole output, must be
less than a specific threshold.
The cell level is the most constraining. It reflects the fact
that the user accepts only tuples containing accurate val-
ues. Moreover, the accuracy restrictions may be related to
specified attributes; as an example, the user requires accu-
rate phone numbers and disregards the accuracy of other
attributes. For this type of restriction, we suggest filtering
subareas with small accuracy values because the query out-
put is partitioned to subareas with uniform accuracy. If the
restriction concerns only specific attributes, filtering con-
cerns only the subareas involving those attributes. Such
action can be performed in the initial evaluation phase, ex-
actly during insertion of areas in the buckets. We call this
improvement action selective rewriting.
The tuple level accepts the existence of a group of at-
tributes having low accuracy with the condition that the ag-
gregate accuracy of the tuple is sufficiently high. Some
users may tolerate acquiring tuples containing inaccuracies
in certain attributes as long as the remaining cells are cor-
rect (e.g., if different attributes hold possible manner to
contact clients such as phone, address, and email). There-
fore, mistakes in some attributes are tolerable if the remain-
ing attributes have high accuracy values. In this case, we
256 Informatica 46 (2022) 243–258 B. Athamena et al.
suggest area filtering rather than subarea filtering; in other
words, the accuracy value is the aggregation of subarea ac-
curacy, which is then compared to user expectations. Note
that area filtering should be performed after rewriting com-
putation (contrary to subarea filtering), as the aggregation
is communicated to several subareas belonging to different
database relations.
Finally, the output level does not consider the tuple or
attribute accuracy; however, it signifies that the final out-
put must reach a specific accuracy level. Remember that
query outputs containing a mixture of tuples having high
accuracy and tuples having very low accuracy are toler-
able. This reflects people requesting all possible output
data, without ignoring accuracy (e.g., the user sends pub-
licity message to clients, accepting up to 6% undelivered
message because of mistakes in contact attribute). To sat-
isfy this constraint type, we must generate all possible data
without considering their accuracy. To this end, we sug-
gest sorting areas by their accuracies, aggregate the accu-
racy values gradually, and then stop if the correctness con-
straints are unsatisfied. Additionally, the data delivery can
be incremental, allowing the users to stop when they obtain
the necessary data or data containing many mistakes. This
approach is executed only after aggregating the accuracy of
all rewritings.
Therefore, three improvement actions are proposed:
– Selective rewriting is the generation of query rewrit-
ings that achieve a specific quality level, particu-
larly “all subarea accuracy should be greater than the
threshold.” In this case, the rewriting algorithm can be
modified to exclude from buckets the areas related to
subareas having low accuracy. As the subarea accu-
racy was precomputed during the fragmentation pro-
cess, this strategy can be implemented easily.
– Rewriting filtering can be implemented directly during
the aggregation of rewriting accuracy.
– Incremental data delivery. After the aggregation of
rewriting accuracy, we suggest sorting them according
to descending order of accuracies. The algorithm is
shown in Figure 7. The inputs of the algorithm are
the expected values of accuracy and a list of areas in
the rewritings; the output is a sorted area list. The list
represents the data that satisfy user expectations.
Note that if the constraints are more restrictive, the output
size will be small. It is worthy to mention that we should
have a balance between accuracy and completeness expec-
tations to deliver useful and valuable tuples to the user and
avoid filtering too much data.
Figure 8 shows an example that illustrates the proposed
improvement actions. Assume thatQR
1
,QR
2
,QR
3
, and
QR
4
represent four rewritings. The number inside each
subarea indicates the accuracy value of that subarea and the
aggregated accuracy value for each area is written in front
ofQR
i
.
 
l:list of Areas 
Start 
True End 
False 
  l Accuracy by Sort  
  l Rturn 
y cardinalit Area y cardinalit
y cardinalit Area Accuracy Area Accuracy
l Area
null l
.
. .

 

 
 
 
y cardinalit Area y cardinalit y cardinalit
y cardinalit Area Accuracy Area Accuracy
Area Next Area
l Area Insert
.
. .
,
 
 


Threshold
y cardinalit
Accuracy

l 
Figure 7: Incremental Data Delivery.
Assume also thatC
1
,C
2
, andC
3
describe three constraints
as follows:
– C
1
: All subarea accuracy values must be more than or
equal to 0. 8.
– C
2
: All area accuracy values must be more than 0. 8.
– C
3
: The query accuracy value must be more than 0. 7.
OnlyQR
3
satisfies constraintC
1
. Thus, the algorithm will
generate onlyQR
3
. Furthermore, QR
4
will be discarded
because one of its subareas S
41
does not satisfy the con-
straint. However,QR
3
andQR
4
satisfy constraintC
2
. De-
spite the fact that subarea S
41
possesses a low accuracy
value than 0. 8, the whole area accuracy is accepted.
To check constraintC
3
, we sortQR
i
values by accuracy.
Thus, we obtain this order: QR
3
→ QR
4
→ QR
2
→ QR
1
. We compute the aggregate accuracy gradually:
– { QR
3
} accuracy is 0. 9.
– { QR
3
,QR
4
} accuracy is (0. 9× 20+0. 8× 10)/(20+
10) = 0. 86.
– { QR
3
,QR
4
,QR
2
} accuracy is (0. 9× 20+0. 8× 10+
0. 6× 10)/(20 + 10 + 10) = 0. 8.
– { QR
3
,QR
4
,QR
2
,QR
1
} accuracy is (0. 9× 20+0. 8× 10 + 0. 6× 10 + 0. 5× 20)/(20 + 10 + 10 + 20) =
0. 7. Because constraint C
3
is not satisfied, QR
1
is
discarded.
Formal Approach to Data Accuracy Evaluation Informatica 46 (2022) 243–258 257
QR
1
 S
11
 S
12
 S
13
 S
14
 
0.5 0.35 0.49 0.51 0.59 
20 
tuples 
QR
2
 S
21
 S
22
 
0.6 0.47 0.64 
10 
tuples 
QR
3
 S
31
 S
32
 S
33
 
0.9 0.8 0.97 0.93 
20 
tuples 
QR
4
 S
41
 S
42
 S
43
 S
44
 
0.8 0.4 0.92 0.82 0.85 
10 
tuples 
S
45
 S
46
 
0.93 0.85 
Figure 8: Rewritings Filtering.
6 Conclusion
This study investigates the data accuracy estimation and
proposes some improvement actions. We suggested the
fragmentation of database relations according to data ac-
curacy by applying Rakov’s algorithm and the projection
of such fragments to query outputs to report the inaccuracy
distribution in a better way. We rewrote the user query ac-
cording to the fragments, and then we aggregated the accu-
racy values for the rewritings. We defined the query output
by uniting the output tuples of rewritings. A basic algo-
rithm to estimate the data accuracy was presented. In con-
trast to the existing approaches, the areas with low accuracy
are identified explicitly by our proposed algorithm, so that
data unsatisfying user expectations are discarded.
Furthermore, we suggested three primary improvement
actions in order to filter data having lower accuracy to meet
different types of user expectation accuracy. This approach
is applicable in multiple stages of DIS development (de-
sign, production, or maintenance) to inform users about
data accuracy, compare databases, check expectation satis-
faction, or analyze enforcement actions for improving data
accuracy.
Our objective in the near future is the development of
a prototype implementing the proposed accuracy evalua-
tion algorithm and the illustration of its usage in real–world
applications. For this purpose, the prototype should dis-
play and edit the different components (such as databases,
quality graph, characteristics, accuracy values, etc.) of the
framework in addition to the execution of the quality evalu-
ation algorithms. Furthermore, the prototype should evalu-
ate the data accuracy in different applications for valida-
tion purposes by describing several tests in order to as-
sess the approach performance and limitations. Finally, we
suggest to improve some features of the quality evaluation
tool and perform additional performance tests. In addition,
we aim to analyze further quality factors and their inter–
relationships.
References
[1] Abdalla, H., and Artoli, A. M. Towards an effi-
cient data fragmentation, allocation, and clustering
approach in a distributed environment. Information,
10(3), 112, 2019. DOI: 10.3390/info10030112
[2] Agrawal, S., Narasayya, V ., and Yang, B. Inte-
grating vertical and horizontal partitioning into au-
tomated physical database design. In Proceedings
of the 2004 ACM SIGMOD international confer-
ence on Management of data, 359–370, 2004. DOI:
10.1145/1007568.1007609
[3] Alsaif, S. A., and Hidri, A. Impact of Data Balanc-
ing During Training for Best Predictions. Informat-
ica, 45(2), 2021. DOI: 10.31449/inf.v45i2.3479
[4] Azeroual, O., Ershadi, M. J., Azizi, A., Banihashemi,
M., and Abadi, R. E. (2021). Data Quality Strat-
egy Selection in CRIS: Using a Hybrid Method of
SWOT and BWM. Informatica, 45(1), 2021. DOI:
10.31449/inf.v45i1.2995
[5] Bai, Q., Hong, J., and McTear, M. F. Some modifi-
cations of bucket-based algorithms for query rewrit-
ing using views. In International Conference on Ad-
vances in Information Systems, Springer, Berlin, Hei-
delberg, 57–67, 2004. URL: https://rb.gy/leqpjc
[6] Breiman, L., Friedman, J. H., Olshen, R. A.,
and Stone, C. J. Classification and Regres-
sion Trees, : Wadsworth, 368, 1984. DOI:
10.1201/9781315139470-8
[7] Calvanese, D., Lembo, D., and Lenzerini, M. Sur-
vey on methods for query rewriting and query an-
swering using views. Integrazione, Warehousing e
Mining di sorgenti eterogenee, 25, 2001. URL:
https://rb.gy/05cken
[8] Ezéchiel, K. K., Kant, S., and Agarwal, R. A sys-
tematic review on distributed databases systems and
their techniques. Journal of Theoretical and Applied
Information Technology, 96(1), 236–266, 2019. URL:
https://rb.gy/gs00qk
[9] Fuaad, H. A., Ibrahim, A. A., Majed, A., and Asem,
A. A Survey on Distributed Database Fragmentation
Allocation and Replication Algorithms. Current Jour-
nal of Applied Science and Technology, 27(2), 1–12,
2018. DOI: 10.9734/CJAST/2018/37079
[10] Gao, J., Li, X., Xu, Y . E., Sisman, B., Dong, X. L.,
and Yang, J. Efficient knowledge graph accuracy eval-
uation. arXiv preprint arXiv:1907.09657, 2019. DOI:
10.48550/arXiv.1907.09657
[11] Halevy, A. Y . Answering queries using views: A sur-
vey. The VLDB Journal, 10(4), 270–294, 2002. DOI:
10.1007/s007780100054
258 Informatica 46 (2022) 243–258 B. Athamena et al.
[12] Hill, G. A Framework for valuing the qual-
ity of Customer Information Doctoral disserta-
tion, Melbourne University, 1–194, 2009. URL:
http://hdl.handle.net/11343/35567
[13] Levy, A., Rajaraman, A., and Ordille, J. Querying
heterogeneous information sources using source de-
scriptions. In Proceedings of 22th International Con-
ference on Very Large Data Bases, 1, 1–26, 1996.
DOI: 10.1049/tpe.1981.0030
[14] Ma, H., Noack, R., Schewe, K. D., and Thalheim, B.
Using meta–structures in database design. Informat-
ica, 34(3), 387—403, 2010. https://rb.gy/cbonrz
[15] Motro, A., and Rakov, I. Estimating the quality of
databases. In International Conference on Flexible
Query Answering Systems, Springer, Berlin, Heidel-
berg, 298–307, 1998. URL: https://rb.gy/iriobk
[16] Naumann, F., Leser, U., and Freytag, J. C. Quality–
driven integration of heterogeneous information sys-
tems. In Proceedings of the International Conference
on Very Large Data Bases, 447–458, 2005. URL:
https://rb.gy/bgzhyn
[17] Özsu, M. T., and Valduriez, P. Principles of Dis-
tributed Database Systems, Management, Springer,
New York, 12, 657–722, 2011. DOI: 10.1007/978-1-
4419-8834-8
[18] Peng, P., Zou, L., Chen, L., and Zhao, D. Adap-
tive distributed RDF graph fragmentation and alloca-
tion based on query workload. IEEE Transactions on
Knowledge and Data Engineering, 31(4), 670–685,
2018. DOI: 10.1109/TKDE.2018.2841389
[19] Rakov, I. Data quality and its use for reconciling in-
consistencies in multidatabase environment.PhD dis-
sertation, George Mason University, 1998.
[20] Raman, V ., Swart, G., Qiao, L., Reiss, F., Dialani,
V ., Kossmann, D., Narang, I. and Sidle, R. Constant–
time query processing. In 2008 IEEE 24th Interna-
tional Conference on Data Engineering, 60–69, 2008.
DOI: 10.1109/ICDE.2008.4497414
[21] Riezler, S., and Liu, Y . Query rewriting using
monolingual statistical machine translation. Compu-
tational Linguistics, 36(3), 569–582, 2010. DOI:
10.1162/coli_a_00010
[22] Silberschatz, A., Korth, H. F., and Sudarshan, S.
Database system concepts. McGraw-Hill, New York,
5, 2002. URL: https://rb.gy/ttrvjt
[23] Stvilia, B. A workbench for information quality eval-
uation, In Proceedings of the 8th ACM/IEEE-CS Joint
Conference on Digital libraries, 469–469, 2008. DOI:
10.1145/1378889.1379014
[24] Tarun, S., Batth, R. S., and Kaur, S. A Review on
Fragmentation, Allocation and Replication in Dis-
tributed Database Systems. In International Con-
ference on Computational Intelligence and Knowl-
edge Economy (ICCIKE), 538–544, 2019. DOI:
10.1109/ICCIKE47802.2019.9004233
[25] Várkonyi, G. G., and Gradišek, A. Data protection
impact assessment case study for a research project
using artificial intelligence on patient data. Informat-
ica, 44(4), 2020. DOI: 10.31449/inf.v44i4.3253
[26] Wang, R. Y ., and Strong, D. M. Beyond accuracy:
What data quality means to data consumers. Jour-
nal of management information systems, 12(4), 5–33,
1996. DOI: 10.1080/07421222.1996.11518099
[27] Zozus, M. N., Pieper, C., Johnson, C. M., Johnson,
T. R., Franklin, A., Smith, J., and Zhang, J. Fac-
tors affecting accuracy of data abstracted from medi-
cal records. PloS one, 10(10), e0138649, 2015. DOI:
10.1371/journal.pone.0138649