https://doi.org/10.31449/inf.v46i2.3027 Informatica 46 (2022) 243–258 243 Formal Approach to Data Accuracy Evaluation Belkacem Athamena 1 and Zina Houhamdi 2 E-mail: athamena@gmail.com, belkacem.athamena@aau.ac.ae, z_houhamdi@yahoo.fr, zina.houhamdi@aau.ac.ae 1 Business Administration Department, College of Business, Al Ain University, United Arab Emirates 2 Cybersecurity Department, College of Engineering, Al Ain University, United Arab Emirates Keywords: data integration systems, data quality, data accuracy Received: December 20, 2019 Usually, data quality is defined by multiple attributes that allow classifying the output data (such as com- pleteness, freshness, and accuracy) or the methods exploiting these data (such as dependability, perfor- mance, and protection). Among the suggested quality attributes, we will discuss one of the principal categories: data accuracy. Scientific experiments, decision–making, and data retrieval are examples of situations that require a formal evaluation approach to data accuracy. The evaluation approach should be adaptable to distinct understandings of data accuracy and distinct end-user expectations. This study inves- tigates data accuracy and defines dimensions and metrics that affect its evaluation. The investigation of data accuracy generates problems in the user expectation specification and database quality models. This work describes our proposed approach for data accuracy evaluation by defining an evaluation algorithm that considers the distribution of inaccuracies in database relations. The approach decomposes the query output in accordance with data accuracy, labels every part with its accuracy value, and addresses the pos- sibility of enforcing data accuracy by using these values. This study mainly contributes by proposing an explicit evaluation of quality attributes of data accuracy, a formal evaluation approach to data accuracy, and suggesting some improvement actions to reinforce data accuracy. Povzetek: Opisana je formalna metoda preverjanja toˇ cnosti podatkov. 1 Introduction Data quality has increasingly become an essential charac- teristic required by users particularly for data integration systems (DISs), which involve combining data residing in multiple databases and providing the user with a unified view of these data as answers to their queries [4]. Be- cause of the growth of retrieved data, users are becoming increasingly worried about data quality. On the other hand, the number of quality attributes and their relationships are huge. Accordingly, data quality evaluation is considered as a complex problem involving multiple variables. In DIS context, data quality evaluation is exceptionally compli- cated because of the combination of data derived from dif- ferent databases with possibly distinct qualities. Because of the large number and high heterogeneity of databases that are independent, it is crucial to precisely determine their quality and to consider it in the design phase of DISs. System quality improvement is associated with opti- mization of a problem with multiple variables, which may be very complex specifically in an indefinite context [3, 26]. Consequently, it is arduous to consider all qual- ity attributes at the same time. To investigate data quality thoroughly, it is inevitable to investigate each quality at- tribute independently besides the factors of the context that affect it. Dependencies will be investigated later. Among the proposed quality attributes, we opt for the main cat- egory: data accuracy. Currently, many systems consider the need to have reliable measurements of data accuracy as a crucial and decisive requirement. These systems are numerous and are used in diverse fields, such as customer relationship management, web-services integration, scien- tific experiments, decision–making, and data retrieval. This study discusses data accuracy analysis in DIS and considers the relational scenario. Explicitly, it treats user queries that are composed of projections, selections, and joins (PSJ) operators over a collection of database relations. It addresses the problem of accuracy evaluation of data de- livered to the user in response to user queries and decides if the expectations of the user on the data accuracy can be achieved or not. Our approach consists of fragmenting the query output in sets of tuples (called areas), which have uniform accu- racy values, and marking each area with its accuracy value. Consequently, the user can do the following: – extract only the most precise data by selecting the area that possesses high accuracy values, – exclude data that do not satisfy the accuracy threshold by ignoring areas possessing low accuracy values, or – classify data by sorting areas based on their accuracy values. Moreover, if we desire displaying additional data (because the first accurate area is incomplete), we can show the next area and so on. Thus, this will rep- resent an added value to the delivered data. 244 Informatica 46 (2022) 243–258 B. Athamena et al. This paper illustrates our proposed accuracy evaluation algorithm. We present the values of semantic accuracy us- ing the Boolean metric where each cell of every area of any relation is assigned a value accurate or inaccurate. How- ever, the accuracy of the whole area is calculated as an aggregation of the cells’ accuracy values by dividing the number of accurate cells by the total number of cells in the area. The query results are sorted by the areas’ accuracy values. All areas with accuracy value bigger than the user threshold are considered as area with high accuracy value and on the other hand, the areas with accuracy value less than the user threshold are considered as area with low ac- curacy value. The fragmentation of query output and evaluation of data accuracy requires an in–depth analysis of the inaccuracy distribution in database relations and their union to gen- erate the query output [24]. For this purpose, we split database relations into areas possessing uniform accura- cies. As clarified in [15], databases are usually of het- erogeneous quality; consequently, assigning an accuracy value for the entire relation is considered as an imprecise accuracy calculation of particular data [14]. The area is de- scribed as a view (selection and projection) over the rela- tions of the database. That is to say, a fragment aggregates a group of relations specified by the predicates characteriz- ing the fragment [8, 9, 25]. This paper presents a formal approach to evaluate data accuracy that considers the portions of database relations and processes them to generate an output for user query. It focuses on pre–evaluation, that is, the data accuracy is estimated before the execution of the user query. The out- puts of our evaluation approach will be useful to make a comparison between different query plans to select the plan with the maximal accuracy value. Furthermore, these out- puts are beneficial during the design phase (to determine which databases will be included in the DIS) and moni- toring phase (to calculate the query accuracy). Eventually, this study discusses the issue of accuracy improvement. It proposes the usage of output fragments to choose the ar- eas with high accuracy values. We use the portions of a relation having the highest accuracy value instead of using the whole relation. This distinguishes our approach from existing approaches in the literature. 2 Background Data accuracy symbolizes a set of quality attributes. This section describes the three accuracy attributes listed in the literature [23, 27]: – Syntactic accuracy represents the level at which data do not contain syntactic mistakes such as format in- consistencies and spelling errors. It expresses the interval between descriptions of the data in the DIS and conventional descriptions of these data (syntactic gap). – Semantic accuracy defines how adequately the data describe the environment state. It represents the in- terval between the data described in the DIS and real– world data (semantic gap). – Precision describes the level of data details. It ex- presses the gap between the detail level of the DIS data and its planned detail level. Apropos of accuracy measurements, three metrics are men- tioned in the literature [12]: – Boolean metric uses a Boolean value to indicate if the data detail is accurate (1 or True) or inaccurate (0 or False). – Degree metric uses a dimension to capture the princi- ple of how precise the data are; this usually belongs to a [0− 1] interval. – Value–deviation is defined as an integer to capture the gap between data item in the system and the original one; this is usually normalized to a [0− 1] interval. This section reviews the main concepts used in this study. First, it discusses current approaches for evaluating data ac- curacy as adapted from our proposed model, particularly, the fragmentation technique, and it recalls the characteris- tics of partitioning relational databases. It reviews the con- cept of query rewriting and explains the bucket algorithm. Finally, it comments on selectivity estimation techniques. Our accuracy evaluation approach is supported by two techniques: prior evaluation approach, which presumes homogenous distribution of errors [16] and posterior eval- uation, which uses the accuracy homogeneity for partition- ing database relations [19]. Prior Evaluation Method: Naumann et al. [16] pro- posed a method that propagates quality factors (encom- passing data accuracy) in conformity with query operators. The method estimates the query output accuracy based on the accuracy of the database. The database relation accu- racy is the percentage of syntactic accuracy (rate of accu- rate cells). The query is a PSJ. The method assumes that inac- curacies are homogeneously spread in the database rela- tions; thus, regardless of the selected tuples or projected attributes, the database accuracy is conserved. Concerning the join operation, the joined data accuracy is computed by multiplying the accuracy of input relations [20]. The disadvantage of this method is the strong assump- tion on a homogenous error distribution (usually inappli- cable to real–world data). Generally, query operations do not maintain accuracies and accordingly, this method does not obtain an exact calculation of the accuracy. This defi- ciency is due to the absence of knowledge on inaccuracy distribution (where the errors are concentrated). Supple- mentary knowledge characterizing instances of relation is mandatory to obtain outputs that are more accurate. Formal Approach to Data Accuracy Evaluation Informatica 46 (2022) 243–258 245 Posterior Evaluation Method: This method uses a par- titioning algorithm [15, 19] for fragmenting the database relations into areas that have extremely uniform accuracy. An area is described as a view, which involves projection and selection operations. The accuracy is calculated by considering the portion of each relation and then computing the cell accuracy of that portion. The area accuracy is the ratio of semantic accuracy (rate of correct cells). The ac- curacy values will be exploited in portion fragmentation by applying a computerized algorithm for fragmentation that evaluates a set of criteria. After that, the same fragmenta- tion is performed over the complete database relation. Usually, the query is a PSJ and relational algebra is em- ployed to operate with the partitions, i.e., the operator takes as inputs the set of relations with their respective partitions. Its output is a single relation with its partitions: the parti- tion of projection (selection) is determined as the intersec- tion of the operation output and the partition of the input relation (intersection of selection conditions with projected attributes). In this case, the data accuracy is conserved ow- ing to accuracy uniformity. Finally, when the query output is ready, we calculate the accuracy value as the weighted sum of area accuracy (weight is equal to the number of cells in the area). Note that a single value of accuracy is computed for the entire output. Partition Algorithm: We will concisely describe Rakov’s algorithm for sampling a relation [19]. Figure 1 shows the related pseudo–code. It is an iterative function using a classification tree. It takes a relation as input and splits it into two blocks (vertical or horizontal splitting but never both) and it tries to identify the block with the highest uniformity; then it iterates the procedure for each block. The block partitioning terminates if it provides a negligible amelioration in uniformity. A thresholdx notes a fair uni- form distribution of accuracies in the block and it serves as a stopping condition. Uniformity is estimated using Gini indices [6]. The Gini index GI(P ) formula is given by equation 1: GI(P ) = 2r(1− r) (1) wherer represents the rate of accurate cells in partitionP . The equation 2 defines the halt constraint that estimates the reduction of the partition of the Gini index: ∆ GI =GI(P )− α 1 GI(P 1 )− α 2 GI(P 2 ) (2) where{ P 1 ,P 2 } are partitions ofP andα i = |Pi| |P| ,i = 1, 2. Because consideration of the complete possible rela- tion partitions is excessively costly, Rakov suggests some heuristics to decrease the number of treated partitions. In addition, we distinguish two types of attributes: categori- cal and ordered [1]. In the case of horizontal partitioning, if the attribute is ordered and possessesK different values (A 1 ≤···≤ A k ), the partitions are identified as (K− 1) binary conditionst≤ A i . Otherwise, in the case of a cat- egorical attribute that possesses K different values, these R: Relation Start True x: Threshold End True False False null Queue      R Queue L Views of List p     p L Return   p L , N Insert         N in cell of number T GI y S Maximum T N Split S queue Next N       x y    queue , S Insert Figure 1: Relation Partitioning Algorithm. values are sorted based on the number of incorrect cells. After that, they are considered as ordered attributes. For vertical partitioning, we consider all partitions for a small number of attributes; otherwise, for a large number of at- tributes, we apply the same method employed for categori- cal attributes. In the following, we discuss the characteris- tics of well–constructed partitions. Partition correctness : There are three correctness con- straints that a partition must satisfy to guarantee database coherence [17]. The constraints are related to the follow- ing: – Disjunction: The horizontal decomposition of a re- lation R into partitions R 1 , ... , R n , verifies that ∀ celld i /d i ∈ R j ⇒ d i / ∈ R k ,k̸= j (i.e., each cell in the partitionR j does not appear in any other parti- tionsR k wherek̸=j). This property ensures that the partitions are disjoints. This rule guarantees that there is no intersection between all horizontal partitions. In the case of vertical decomposition, the disjunction is limited to non-primary key attributes of the relation because the primary key attributes are basically dupli- cated in all partitions. – Restoration: In the case of decomposing a relationR into partitions R 1 , ... , R n , there is always a way to 246 Informatica 46 (2022) 243–258 B. Athamena et al. find an operator Ω /R = Ω R i ,i = 1...n . This rule assures the preservation of data restrictions defined as dependencies. Normally, the union operator is used for horizontal partitioning whereas the join operator is used for vertical partitioning. – Completeness: Assuming that the decomposition of a relationR into partitionsR 1 ,...,R n , all items be- longing to relation R also belong to one or more R i partitions. This characteristic ensures data preserva- tion, i.e., there is no data loss (all data in global rela- tion are projected in partitions). For horizontal parti- tioning, items denote tuples; however, for vertical par- titioning, they denote attributes. Query rewriting: is the reformulation of a user query (defined by the global relation) to a possible analogous scheme, known as rewriting, which concerns exclusively the database framework [11]. In the local–as–viewed (LA V) strategy, the global relation is defined without re- ferring to the databases, and after that, a mapping between them is made by expressing each database relation as a view over the global relation [21]. Thus, query rewriting is comparable to the application of views to answer a user query. Definition: For a queryQ over relationsR i defining the global relation and viewsV i referring to database relations overR i , the queryQ r is called a rewriting ofQ if: – Q r ⊂ Q – Q r concerns exclusively the views. In general, the query is a PSJ and is defined using a datalog–like language. Thus, query Q is expressed by equation 3: Q(X) =R 1 (Z 1 )∧···∧ R n (Z n )∧ C Q (3) where – R i is a relation andZ i ={ A i } /A i is an attribute of R i – C Q is a conjunction predicate,C Q =uθv/θ ∈{ =,< ,>,≤ ,≥} , andu,v∈ S 1≤ i≤ n Z i – X⊆ S 1≤ i≤ n Z i ={ A i } /A i projected byQ Numerous algorithms for query rewriting can be found in the literature and [7] presents a review of these algorithms. The most popular is the bucket algorithm [13], which we will describe and use in this study. The bucket algorithm computes all possible rewritings that are included in (but not imperatively analogous to) the initial query [5]. The algorithm prunes the area of candi- date rewritings in two phases: – ∀ t∈ Q: Construct a container called bucket that in- cludes the contributing database relations for t, i.e., the database relations containing tuples oft. – Construct candidate rewritings (query integration by joining one database relation from each bucket) and preserve only the rewritings belonging toQ. The first phase possesses a polynomial complexity with respect to the database number. However, the second phase reduces the complexity considerably by diminishing the number of possibilities. Despite the fact that containment is generally manageable, its resolution is equal to the query size (usually small) and happens when the query has many occurrences of exact schemas; accordingly, the contain- ment complexity is not a challenge in real applications [5]. Selectivity Evaluation: Techniques for selectivity eval- uation are widely practiced in query optimization. The statistics of data saved in the database provides an estima- tion to the optimizer [22]. The most popular statistics are histograms, which are adopted by several business database systems. This section shows their current application to evaluate the selectivity of a complex query. The histogram of attributeA is defined as a list of buck- ets, where bucketb i definesr i , which is a subarea ofA’s area, and possesses two attributes: – periodicityp i , which represents the number of tuples x verifyingx.A∈ r i , and – discrete valuedv i , which represents the number of dif- ferent values ofx.Ain all tuplesx/x.A∈ r i . The query selectivity is calculated by dividing the query cardinality by the relation cardinality. In order to evaluate the query cardinality, we sum the periodicity of all buckets included (totally or partially) in the predicate. If the query has many extended predicates, the selectivity is calculated as the product of all selectivities. For a random PSJ query, there is a new challenge: car- dinality evaluation needs statistical information propaga- tion over predicates, i.e., we must create histograms for in–between outputs using the histograms of database rela- tions. The propagation of in–between statistics in addition to multiple query operators can considerably decrease the accuracy. A possible solution is a precalculation of statis- tics for a subset of query results that are exceptional and to use them in selectivity calculation of in–between outputs. The remainder of this paper describes the proposed model for evaluation of data accuracy. Algorithms and methods reviewed previously are modified and adopted in data accuracy estimation. 3 Formal model This section formalizes the proposed approach for accuracy assessment. The approach considers DIS in relational con- text (the system combines data from relational databases Formal Approach to Data Accuracy Evaluation Informatica 46 (2022) 243–258 247 and allows users to execute queries over a global schema). Each query is reformulated according to the database rela- tions in order to obtain a set of rewritings that select data to answer the query. The accuracy of data delivered to a user as answer to the user’s query is addressed in this section. We arrange data in areas with uniform accuracy to apprise the user on the inaccuracy distribution. To achieve this goal, database relations are decomposed into partitions (called areas) pos- sessing uniform accuracy. Note that areas are views (vir- tual relations) determined by the distribution rules (projec- tion attributes and selection conditions). The user query is reformulated over the areas (rather than in the global rela- tion). The proposed approach uses an a priori assessment method: before performing the query rewriting, we assess the data accuracy, depending only on the operations (PSJ) that define the rewritings and the area accuracy. We can state the problem as follows: – Input: – User query – A collection of database relations decomposed into areas having uniform accuracy – Output: – A rewriting set answering the user query – Accuracy value of rewritings – Accuracy value of the query answer Our proposed scenario for accuracy estimation contains three phases: – Database relation fragmentation based on accuracy uniformity: This stage estimates the accuracy of a por- tion of individual database relation and uses the esti- mation results for relation fragmentation. The frag- mentation phase is performed at the initial stage (DIS development or regularly) but is isolated from the query appraisal stage. – Query rewriting with respect to fragments: Every query is reformulated using the areas of the database relations. This stage generates a set of rewritings us- ing the areas. The output of the query is the union of the created rewritings. – Data accuracy estimation of query outputs: This stage estimates the accuracy of data generated by the rewrit- ings, using the area accuracy, and aggregates them to calculate the accuracy value for the whole user query output. To reinforce the data accuracy, a simple adjustment con- sists of rejecting rewritings or areas with small accuracy values. The rejection can be done at the rewriting genera- tion time (early stage). Thus, our approach can be consid- ered as selective. The next subsections provide the details of each phase. 3.1 Database relation fragmentation based on accuracy uniformity Using the partitioning approach suggested by Rakov [19], the data accuracy is used for fragmenting each database re- lation. The purpose of fragmentation is the manipulation of pieces of database relation that are uniformly accurate; in other words, if we try to fragment the accuracy value again, it will approximately stay unchangeable [2]. This subsec- tion defines the fragmentation process and discusses how to extract appropriate fragments. To simplify the partition usage, the approach suggests decomposing the relation into areas using a horizontal par- tition (selection predicates) and after that, it decomposes each area into subareas using a vertical partition (subsets of attributes). Fragments should verify the three complete- ness constraints discussed previously: each tuple exists in a unique area, key attributes appear in all subareas, and non– key attributes appear in a single subarea. To manipulate all attributes (key and non–key) in a similar way (projec- tion of the attributes in unique subarea), the key attributes are duplicated. Consequently, a new key for the subarea (composite key) is generated. This approach is called key expansion. Key expansion definition: If R(A 1 ,...,A m ,...,A n ) is a relation, where m ≤ n, and (A 1 ,...,A m ) form the key of R, ¯ R denotes the key expan- sion of R generated by duplicating the key attributes: ¯ R(K 1 ,...,K m ,A 1 ,...,A m ,...,A n ). The duplicated at- tributes (K 1 ,...,K m ) are the expanded keys of ¯ R. In vertical fragmentation, we project the expanded key in all subareas, and we project all attributes of the initial rela- tion in a unique subarea. The fragmentation process (area and subarea) can be formalized as follows: Horizontal fragmentation: A horizontal fragmentation of a relationR, represented byH P1,...,P m (R) (see equation 4), consists of defining a set of subrelations{ R 1 ,...,R m } , named areas created by the application of predicates P 1 ,...,P m to ¯ R, where a conjunctive predicate set ¯ P = { P 1 ,...,P m } overR, disjoint and complete (each tuple of R verifies a uniqueP i ): H P1,...,P m (R) =R 1 ,...R m =σ P1 ( ¯ R),...,σPm ( ¯ R) (4) The area is denoted by < Name,Predicate,N,Key Accuracy > : – Name distinguishes areas within the relation, – Predicate is the conjunction that determines the area, – N represents the number of tuples in the area (verify the Predicate), and – Key Accuracy is the accuracy value of the key at- tributes ofR. 248 Informatica 46 (2022) 243–258 B. Athamena et al. Vertical fragmentation: Forn attribute subsets of a rela- tionR, where R(S1,...,S n) T n i Si =ϕ : the vertical fragmentation of an areaR i , R i ∈ R, is expressed asV S1...S m (R i ) (see equation 5), which defines a list of views{ V i1 ,...,V in } , named subareas generated by projecting the attribute sub- sets toR i . The subareas are disjointed (each attribute of R belongs to one subarea). However, the expanded keyK of ¯ R belongs to all subareas: V S1...S n (R i ) =V i1 ,...,V in =π k,S 1 (R i ),...,πk,S n (R i ) (5) The subarea is defined as a three–tuple < Name,Attributes,Accuracy >, where Name dis- tinguishes the subarea within the relation, Attributes define the list of subarea attributes, and finally, Accuracy denotes the calculated subarea accuracy. Again, subareas and areas are views determined by frag- mentation rules (projection attributes and selection condi- tions); in other words, the database relations are not really partitioned and saved as independent partitions. Database relations are preserved unvaried in databases and the de- scription models of subareas and areas are kept in the DIS. After the horizontal and vertical fragmentation of the relation, any relation cell appears in a single subarea be- longing to the single area. To fragment database relations, Rakov’s algorithm described earlier can be applied. How- ever, it should be noted that the approach alternately exe- cutes horizontal and vertical partitioning. Accordingly, the resulting fragments can be different from the areas and their subareas. As a possible solution, we propose to reorganize the fragment in accordance with the horizontal and vertical fragments. For any hybrid fragmentation (horizontal and vertical) of a relation R, which consists of a set of areas that ver- ify the correctness criteria, it is always possible to ob- tain different fragments of R by additional fragmentation of some of the areas, i.e., the algorithm obtains the sub- areas{ S 11 ,...,S 1m1 ,...,S n1 ,...S nmn } , where the sub- set{ S i1 ,...,S 1i1 } represents the vertical fragmentation of some area A i (1≤ i≤ n), and{ A 1 ,...,A n } set repre- sents the horizontal fragmentation of R. To this end, we follow a process that contains four steps: – Find the selection predicatesP ={ p 1 ,...,p r } defin- ing the partitionT 1 ,...,T k /r≤ k (because multiple partitions possess identical selection predicates). – Search for two non-disjunctive predicates p i and p j (i.e., p i T p j ̸= ϕ ) and then replace p i and p j by p i − p j , p j − p i , and p i T p j . This step will termi- nate because the predicates are the unions of inequali- ties onR’s attributes. The output is a set of predicates P ′ ={ p ′ 1 ,...,p ′ n } , which are disjoint. – Determine the area setA ={ A 1 ,...,A n } . For each predicatep ′ i ∈ P ′ , there is an areaA i ∈ A, whereA corresponds to the horizontal fragmentation of R; in other words,A verifies the completeness constraints, which implies that predicates sub-expressions are not lost. – Intersect every areaA i ∈ A with partitionsT 1 ,...,T k to obtain a new set of subareas{ S i1 ,...,S 1i1 } . This new set corresponds to the vertical fragmentation of A i ; in other words, it verifies the completeness crite- ria. Consequently, each hybrid fragmentation of a relation that satisfies the correctness constraints can be reorganized in areas and subareas; particularly, the fragments generated by Rakov’s algorithm. Figure 2 shows an example of a reorganization. P 1 P 2 P 3 P 4 P 5 P 6 (a) Before Reorganization Area 1 S 11 S 12 S 13 Area 2 S 21 S 22 Area 3 S 31 S 32 Area 4 S 41 (b) After Reorganization Figure 2: Fragments Reorganization. To fragment database relations, we suggest applying Rakov’s algorithm. Nevertheless, the fragmentation can be done manually by taking advantage of knowledge concern- ing the database relations collected from the user, DIS ex- pert, or IT administrator. Rakov’s algorithm calculates the area accuracy as a per- centage of accurate cells. Equation 6 calculates the Gini indices: G(V ) = 2k(1− k) (6) wherek is the accuracy value, and the indices will be used to estimate the accuracy uniformity, and thereafter, to select the fragment representing the highest value of uniformity (i.e., the fragmentation function is parameterized to com- pute the area accuracy). Each fragment is reorganized as illustrated earlier. The area and subarea schema are also de- duced from the related fragment using the following steps: 1. Name the areas sequentially and then their associated subareas. 2. Define the area predicate and the list of attributes de- scribing subareas from the fragment. Formal Approach to Data Accuracy Evaluation Informatica 46 (2022) 243–258 249 3. Estimate the subarea accuracy by calculating the cell accuracy average (computed by Rakov’s algorithm). 4. Estimate the key accuracy by calculating the average of accuracies of attributes conforming to the key (the accuracy of attributes is equal to their subarea accu- racy owing to accuracy uniformity). Note that the key accuracy corresponds to the subarea accuracy if all key attributes are projected in a single subarea. 5. At the end, as Rakov’s algorithm fragments a sample of database relation, the number of tuples in the sam- ple is used to deduce the number of tuples in an area, i.e.,numberof tuples× relationsize samplesize . The fragmentation is executed one time at DIS develop- ment or regularly. However, it is not related to the accuracy estimation of the query. The following section describes the query reformulation and its impact on accuracy estima- tion. 3.2 User query rewriting Our approach proposes to reformulate the user query ac- cording to database relation areas. We will take into con- sideration the option of reformulation of user query in ref- erence to subareas and we clarify the reason for discarding this alternative. To describe a user query based on areas that constitute database relations, our approach uses the classic rewriting algorithm, i.e., the bucket algorithm. Hence, the LA V tech- nique is applied to define the areas as views of a global schema by replacing the database relation with its descrip- tion over the global schema, i.e., it unfolds views over the database relation. Note that this operation is unrelated to the user query because it is performed after fragmentation of database relations. The query rewriting applies the bucket algorithm that creates buckets, compares predicates of the query and the area, and then determines possible rewritings and checks query containment. The number of rewritings increases with the number of areas in a polynomial manner. Note that the rigidity of the rewriting algorithm is not due to the number of relations, but to the query size (which is approx- imately small) and exists only if the query has more than one occurrence of the same relations [13]. Now, we discuss the option to rewrite a query over sub- areas and we explain why this option is rejected. Areas contain all tuples of database relations. Consequently, the rewriting algorithm of a query over areas joins all tuples of database relations (similar to query rewriting over database relations). Nevertheless, a subarea decomposes tuples as it projects a part of attributes of the relation. First, some versions of the bucket algorithm generate the total required attributes by joining multiple relations of a bucket. Con- sequently, query rewriting over subareas is not impossi- ble. On the other hand, assembling a small number of at- tributes belonging to a large number of relations augments hazardously the risk of interpolating semantic inaccuracies (generation of tuples without meaning in real world). By way of illustration, the output can be the student name and the phone number of a different student who possess identi- cal identifiers in distinct databases. This problem is intrin- sic to DIS, but it remarkably grows if the number of joins increases (particularly when tuples are split). Moreover, the rewriting algorithm avoids splitting tuples specifically to reduce this risk. Accordingly, our approach rewrites the query over areas and inevitably uses a subarea structure for computing data accuracy. This also explains why the proposed approach fragments relations in a horizontal and then vertical manner rather than by random management of hybrid fragmenta- tions as Rakov’s approach. The following section describes the estimation of data accuracy for each rewriting. 3.3 Data accuracy estimation To compute area accuracies and aggregate them to calcu- late the accuracy value of the entire user query output, the proposed approach proceeds in three steps: – identification of areas and subareas constituting the rewriting, – estimation of the rewriting accuracy and its key accu- racy, and – calculation of the number of tuples in each area to per- form the aggregation. Identification of areas and subareas: An area contains at least one subarea and the rewriting is expressed as the join of area sets. The joining output is a unique area, which consists of the union of all subareas possessing part of the projected attributes and verifies the selection predicates of all input areas. The following example shows three ar- eas, namelyP 1 ,P 2 , andP 3 , decomposed to subareas (P 11 , P 12 ), (P 21 ,P 22 ), and (P 31 ,P 32 ,P 33 ) respectively (see Fig- ure 3). The rewriting QR is defined as one area joining all subareas, which include part of the projected attributes. Note that, P 12 is not included in QR because it does not contain any of the projected attributes inQR. Because all tuples in the output of the rewriting fulfill all area predicates, a single area is built for the rewriting by joining the area predicates and rewriting predicates. In addition, because the bucket algorithm generates only rel- evant rewritings, the predicates cannot be contradictory. Note that we list only the predicates that are more con- straining than others (e.g., “age≥ 18” is less constraining than “age = 20”). The rewriting subareas are defined as the aggregation of all subareas belonging to input areas, by considering just the common attributes between the subarea and rewrit- ing. If the intersection between the rewriting and subarea is null, the subarea is discarded. The resulting rewriting 250 Informatica 46 (2022) 243–258 B. Athamena et al. P 1 P 11 P 12 P 2 P 21 P 22 P 3 P 31 P 32 P 33 QR P 11 P 21 P 22 P 31 P 32 P 33 Figure 3: Rewriting Joins Multiple Areas. subareas represent the vertical fragmentation of the rewrit- ing area. The completeness constraint is satisfied because rewriting attributes appear in certain subareas of the input areas. However, for the disjunction constraint, the natural join emerges as a problem related to joining attributes exist- ing in two different input subareas and appear only once in the rewriting. To solve this problem, a new subarea is cre- ated and its accuracy value will be computed as the average of both subarea accuracies (this will be discussed later). Estimation of rewriting accuracy: If fragments are ad- equately determined, they logically conserve the accuracy of subareas because they are affected by the projection of predicates and attributes (owing to the accuracy unifor- mity). However, for the joining operation, each tuple in the subarea is composed of two input tuples; consequently, the produced subarea accuracy depends on both input ar- eas. Particularly, the accuracy propagation can be different and depends on the accuracy factor. Recall that during the evaluation of semantic correctness, if the key of a particular tuple is inaccurate (does not reflect the real–world object), the entire tuple is considered as incorrect. Actually, seman- tic correctness evaluates the correspondence of an attribute (the key) to the real–world entity. On the other hand, the syntactic correctness evaluates the cell accuracy without re- gard to the key attributes. Thus, during area joining, the cell accuracy is computed differently from semantic cor- rectness. Consequently, the subarea accuracy in the join output is computed as the product of the input subarea ac- curacy and key accuracy of the remaining areas. However, during syntactic correctness evaluation, the cell accuracy is equal to the cell accuracy of the input subarea. Thus, our approach proceeds as follows: – Semantic correctness: subarea accuracy is computed as the product of the input subarea accuracy and the key accuracy of the remaining areas. – Syntactic correctness: subarea accuracy is calculated as the input subarea accuracy. In the case of subareas that include join attributes (which were separated in the prior step), we calculate the accuracy for each input subarea separately and then take the average. The subareas with identical accuracy values are merged in a single subarea. At the end, the accuracy of the key is calculated by multi- plying the accuracy of keys in all areas. Then, the rewriting accuracy is calculated as the average of the cell accuracies (weighted average of subarea accuracy, where the weight is the number of attributes in the subarea). Selectivity estimation: Because the query result is de- termined as the fusion of multiple rewritings, its accuracy is estimated as the weighted aggregation of the rewriting accuracy, where the weight is equal to the rewriting size (number of tuples in the rewriting). Thus, the estimation of the number of tuples in each rewriting is necessary. To this end, we suggest estimating the rewriting selectiv- ity to deduce its number of tuples. In general, a join corre- sponds to the comparison between two keys (primary and foreign), and then histograms are used for the calculation. Note that query optimization algorithms or statistical infor- mation about the prior execution of the same/similar query can be applied to estimate the selectivity. Particularly, ex- perts also can estimate the selectivity. Remember that our approach does not depend on the estimation algorithm but certainly, the resulting accuracy value depends on it. The selectivity is defined as follows: Given a query rewriting Q R over areas{ R 1 ,...,R k } , the selectivity ofQ R , expressed asS(Q R ), represents the number of tuples in the Cartesian product of all areas that satisfy the rewriting predicates. IfQ R does not have a se- lection (or join) predicate, the whole set of tuples is con- sidered and in this manner, S(Q R ) = 1. Concerning the rewriting query, the number of tuples is calculated as: Numberof tuplesintheinputareas× Rewritingselectivity 3.4 Quality graph After generation of the query rewritings, we build a quality graph to calculate the user query as the union of all rewrit- ings [10, 18]. We can have multiple rewritings or a unique one. The creation of the quality graph shown in Figure 4 fol- lows these definitions: – Source nodes represent database relations. – Target node represents user query. – Activity nodes represent areas of database relations, denoted as Area. Edges connect the area to its source. – Activity nodes represent the rewritings, denoted by Rewriting. Edges connect the rewriting to the area in- dicated by the rewriting. Formal Approach to Data Accuracy Evaluation Informatica 46 (2022) 243–258 251 – One activity node represents the union of rewritings (possibly single rewriting), denoted by the Union node. This node is the successor of all rewriting nodes and the predecessor of the Target node. Source 1 Area 11 Rewriting 1 Area 1p ... Source n Area n1 Rewriting m Area nq ... ......... Union Target ......... Figure 4: Quality Graph. 3.5 Accuracy estimation algorithm We propose an algorithm for data accuracy estimation. The pseudo–code is shown in Figure 5. It implements the approach previously described (section 3.3) consisting of three phases: 1. For each Rewriting, the algorithm generates an area, defines its subareas, and computes the characteristic values (cardinality, predicates, subarea accuracy, and key accuracy). 2. The algorithm assembles all rewritings to determine the Union node. 3. The algorithm performs the accuracy value aggrega- tion for all data edges. In the first phase, the algorithm executes two loops over area nodes that are inputted to each rewriting node. During the first loop, the key accuracy feature is calculated by mul- tiplying the accuracy of all areas; the predicate feature is determined by merging the predicates of input areas and the rewriting predicate, and eliminating worthless constraints; finally, the cardinality feature is computed as the product of the rewriting selectivity and the input area cardinalities. During the second loop, to estimate the subarea feature, we just insert subareas of all input areas. This should be performed in another loop as the input of subarea accu- racy is the key accuracy computed in the first loop. Then the algorithm adds subareas to the rewriting node and joins subarea attributes with the rewriting attributes. The natu- ral join attributes are stored in new subareas. We calculate their accuracy using equation 7: Accuracy = P n 1 subarea i n (7) where n is the number of subareas. At the end, the sub- areas with identical accuracy values are merged. U areas is a list that contains all generated areas for each rewriting; consequently, the second phase requires settingU areas as the value of the union node. Finally, for each source or activity node, the accuracy ag- gregation is calculated as the weighted sum of the subarea accuracy and the weight is calculated by multiplying the number of subarea attributes by the area cardinality. The resulting output is associated with all edges outgoing the node. 4 Case study This example is used to demonstrate the proposed approach for data accuracy estimation. The Boolean metric is used to illustrate the evaluation of semantic correctness; neverthe- less, alternative accuracy metrics and factors can be used. Assume that the DIS global schema contains two rela- tions that hold data concerning students and their marks respectively: – Student(ID,name,level,exam,phone, address,city) – Mark(ID,mark,year) Attributes of the relation student are ID (the student identifier),name,level (first level defined by initial inter- view and its value can be “low”, “medium”, or “high”), exam (initial exam result; its value ∈ [0, 1]), phone, address, andcity. Attributes symbolizing the relation mark areID (the stu- dent identifier),mark∈ [0− 20], where 20 is the highest mark), andyear. The relation keys areID andID,year respectively. Suppose that two databases provide data concerning stu- dents and their marks as presented in Table 1 and Table 2: – S(ID,name,level,exam,phone,address) // stu- dents residing at Al–Ain (UAE). – M(ID,mark,year) // the student marks. The colored cells represent the inaccuracy. The aggregated accuracy values forS andM relations (calculated as the av- erage of cell accuracies) are 0.6 (40/66) and 0.77 (37/48) respectively. The key expansions are: ¯S(K ID ,ID,name,level,exam,phone,address) and ¯ M(K ID ,K year ,ID,mark,year) 252 Informatica 46 (2022) 243–258 B. Athamena et al. G:Quality graph Start True End True False False null R  g G.Rewritin R      G Union U G Node A Area                   R Next U Area Insert Area R e Insertvalu G Area Subareas Merge e Attribut Join R getvlue G Attribute Area separate s Attribute projected R getvlue G Attribute Area Intersect Area , , . , _ , . , . _ , . , .               R r predecesso A predicte R getvalue G predicate y selectivit R getvalue G y cardinalit Key Accuracy     , . , . 1 null A        A Next predicate predicate Area Insert y cardinalit Area y cardinalit y cardinalit Key Key Key a Are A getvalue G Area Accuracy Accuracy Accuracy , . . , .          True False null A  False       A edge outgoing E Area n Aggregatio Accuracy val Area A getvalue G Area     . . . null E  True   G Return   E Next Key Accuracy E G Accuracy  . .   R r predecesso A predicate predicate Area y cardinalit y cardinalit Area Key Key Area Accuracy Accuracy     . . . True False null A            A Next subarea Area x Insert Key x update Area area Sub x a Are A getvalue G Area Accuracy . , , _ , .       Figure 5: Accuracy Propagation Algorithm. where{ K ID } and{ K year ,K ID } define the expanded key. NowS andM are fragmented to determine their areas and subareas; then, we calculate the number of tuples and their accuracies. An acceptable fragmentation ofS is: – AreaS 1 ; [ID< 300]; 6 tuples;keyaccuracy = 0. 50 – Subarea S 11 ; { ID,name,level,exam} ; accuracy = 0. 50 – Subarea S 12 ;{ address,phone} ; accuracy = 0. 25 – AreaS 2 ; [ID≥ 300]; 5 tuples;keyaccuracy = 1. 00 – SubareaS 21 ;{ ID} ;accuracy = 1. 00 – SubareaS 22 ;{ name,level,exam, phone,address} ;accuracy = 0. 80 Similarly, a possible fragmentation of theM relation is: – AreaM 1 ; [year< 2014]; 1 tuple;keyaccuracy = 0 – SubareaM 11 ;{ ID,mark,year} ;accuracy = 0. 00 (0/3) – Area M 2 ; [year≥ 2014∧ id < 300]; 6 tuples; keyaccuracy = 0. 50 – SubareaM 21 ;{ ID,mark,year} ;accuracy = 0. 50 (9/18) – Area M 3 ; [year≥ 2014 ∧ ID≥ 300]; 8 tuples; keyaccuracy = 0. 93 – SubareaM 31 ;{ ID,mark,year} ;accuracy = 0. 93 (25/27) After horizontal and vertical fragmentation of the relation, each cell belongs to a specific subarea included in a unique area; consequently, the fragments are represented by differ- ent colors describing different subareas. Table 3 presents the fragments of the student relation. The expanded key is not colored because it belongs to all subareas and it can be left out in the graphical illustration. The respective schemas ofS andM are: – S(ID,name,level,exam,phone,address, city)← Student(ID,name,level,exam, phone,address)∧ city = “ Al− Ain ′′ Formal Approach to Data Accuracy Evaluation Informatica 46 (2022) 243–258 253 Table 1:S Relation. ID Name Level Exam Phone Address 120 Rani High 1 6001104 12 123 Nada Medium 0.465 99628734 Chiab al Alashekhar 141 Areej Low 0.987 Al–Ain 154 Hanan Low 0.1234 9023365 Palmier 13 155 Zineb Medium 0.61 3364244 502 logts 13 n4 157 Deena Low 0.2 7091232 Annaba 69 300 Assala High 0.97 4112533 301 Alae High 0.92 5437898 302 Raid High 0.78 Abu Dhabi 303 Med Low 0.2 3248673 Algeria 1280/12 304 Sarah Medium 0.67 231987253 Table 2:M Relation. ID Mark Year 120 17 2015 141 10 2015 141 18 1014 154 9 2014 155 13 2014 155 4 2015 157 5 2015 300 10 2014 300 19 2015 301 17 2014 301 12 2014 302 18 2015 302 6 2014 303 9 2015 304 11 2014 304 13 2015 – M(ID,mark,year)← Mark(ID,mark, year) The areas ofS andM are expressed in datalog–like nota- tion: – S 1 (ID,name,level,exam,phone,address) ← S(ID,name,level,exam,phone,address)∧ ID< 300 – S 2 (ID,name,level,exam,phone,address) ← S(ID,name,level,exam,phone,address)∧ ID≥ 300 – M 1 (ID,mark,year)← M(ID,mark,year) ∧ y< 2014 – M 2 (ID,mark,year)← M(ID,mark,year) ∧ y≥ 2014∧ ID< 300 – M 3 (ID,mark,year)← M(ID,mark,year) ∧ y≥ 2014∧ ID≥ 300 The substitution ofS andM by their expressions results in the area definition in terms of the global schema: – S 1 (ID,name,level,exam,phone,address) ← Student(ID,name,level,exam,phone, address)∧ city = “ Al− Ain ′′ ∧ ID< 300 – S 2 (ID,name,level,exam,phone,address) ← Student(ID,name,level,exam,phone, address)∧ city = “ Al− Ain ′′ ∧ ID≥ 300 – M 1 (ID,mark,year) ← Mark(ID,y,m)∧ y < 2014 – M 2 (ID,mark,year) ← Mark(ID,y,m)∧ y ≥ 2014∧ ID< 300 – M 3 (ID,mark,year) ← Mark(ID,y,m)∧ y ≥ 2014∧ ID≥ 300 Suppose the user queryQ in datalog–like notation: – Q(ID,name,year,mark)← S(ID,name, level,exam,phone,address,city)∧ M(ID, mark,year)∧ y = 2015 The output of queryQ (extracted from theS andM rela- tions) is given in Table 4. The query rewriting using S 1 , S 2 , M 1 , M 2 , and M 3 cre- ates the following buckets: Buck(S) = { S 1 ,S 2 } and Buck(M) = { M 2 ,M 3 } . The area M 1 is omitted in Buck(M) because it does not satisfy the Q predicate (“year = 2015” inQ and “year< 2014” inM 1 ). Then the query rewritings are produced, considering one area for each bucket: – QR 1 (ID,year,name,mark)← S 1 (ID,name, level,exam,phone,address)∧ M 2 (ID, mark,year)∧ y = 2015 254 Informatica 46 (2022) 243–258 B. Athamena et al. Table 3: Student Relation Fragmentation. KID ID Name Level Exam Phone Address 120 120 Rani High 1 6001104 Zakhir 12 123 123 Nada Medium 0.465 99628734 Chiab al Alashekhar 141 141 Areej Low 0.987 Al–Ain 154 154 Hanan Low 0.1234 9023365 Palmier 13 155 155 Zineb Medium 0.61 3364244 502 logts 13 n4 157 157 Deena Low 0.2 7091232 Annaba 69 300 300 Assala High 0.97 4112533 301 301 Alae High 0.92 5437898 302 302 Raid High 0.78 Abu Dhabi 303 303 Med Low 0.2 3248673 Algeria 1280/12 304 304 Sarah Medium 0.67 231987253 Table 4: Query Result. ID Name Year Mark 120 Rani 2015 17 141 Areej 2015 10 155 Zineb 2015 4 157 Deena 2015 5 300 Assala 2015 19 302 Raid 2015 18 303 Med 2015 9 304 Sarah 2015 13 – QR 2 (ID,year,name,mark)← S 2 (ID,name, level,exam,phone,address)∧ M 3 (ID, mark,year)∧ y = 2015 Because the areas, S 1 and M 3 are contradictory (“ID≥ 300” and “ID < 300”), they are not joined; the same applies for areas S 2 and M 2 . Consequently, Q ⊇ QR 1 S QR 2 . Let us focus on the rewriting ofQR 1 : QR 1 andQR 2 areas and subareas are defined as creations of a unique area for each rewriting of QR 1 and QR 2 re- spectively, because the rewriting of QR 1 combines areas S 1 (its subareas areS 11 ,S 12 ) andM 2 (its subarea isM 21 ). The intersection of subarea S 11 and M 21 attributes with the rewriting attributes results in two subareas QR 11 and QR 12 ; the join attribute is isolated in subarea QR 13 . In this case there is no projection of the subareaS 12 attribute. QR 2 subareas are defined similarly. The final result is: – Area QR 1 { ID< 300∧ year = 2015} ; ? tuples; keyaccuracy =?; inputs:S 1 ,M 2 – Subarea QR 11 { name} ; accuracy =?; input: S 11 – Subarea QR 12 { year,mark} ; accuracy =?; input:M 21 – SubareaQR 13 { ID} ;accuracy =?; inputs:S 11 andM 21 – Area QR 2 { ID≥ 300∧ year = 2015} ; ? tuples; keyaccuracy =?; inputs:S 2 ,M 3 – Subarea QR 21 { name} ; accuracy =?; input: S 22 – Subarea QR 22 { year,mark} ; accuracy =?; input:M 31 – Subarea QR 23 { ID} ; accuracy =?; inputs: S 21 ,M 31 After the estimation of semantic accuracy of subareas, we obtain – Area QR 1 { ID< 300∧ year = 2015} ; ? tuples; keyaccuracy = 0. 25 (0. 50× 0. 50); inputs:S,M 2 – Subarea QR 11 { name} ; accuracy = 0. 25 = (averageof(0. 50× 0. 50)); input:S 11 – Subarea QR 12 { year,mark} ; accuracy = 0. 25 = 0. 50× 0. 50; input:M 21 – Subarea QR 13 { ID} ; accuracy = 0. 25(average(0. 50 × 0. 50, 0. 50 × 0. 50); input:S 11 ,M 21 – Area QR 2 { ID≥ 300∧ year = 2015} ; ? tuples; keyaccuracy = 0. 93(1. 00× 0. 93); inputs:S 2 ,M 3 – Subarea QR 21 { name} ; accuracy = 0. 74(0. 80× 0. 93); input:S 22 – Subarea QR 22 { year,mark} ; accuracy = 0. 93(0. 93× 1. 00); input:M 31 – Subarea QR 23 { ID} ; accuracy = 0. 84(average0. 8× 0. 93, 0. 93× 1. 00); in- puts:S 21 ,M 31 Because subareas QR 11 , QR 12 , QR 13 possess the same accuracy, they are merged to subareaQR 11 . Formal Approach to Data Accuracy Evaluation Informatica 46 (2022) 243–258 255 – Area QR 1 { ID< 300∧ year = 2015} ; ? tuples; keyaccuracy = 0. 25; inputs:S 1 ,M 2 – Subarea QR 11 { ID,name,year,mark} ; accuracy = 0. 25 – Area QR 2 { ID≥ 300∧ year = 2015} ; ? tuples; keyaccuracy = 0. 93; inputs:S 2 ,M 3 – SubareaQR 21 { name} ; accuracy = 0. 74; in- put:S 22 – Subarea QR 22 { year,mark} ; accuracy = 0. 93; input:M 31 – SubareaQR 23 { ID} ;accuracy = 0. 84; inputs: S 21 andM 31 Remember that the rewriting accuracy is the average of the cell accuracies (accuracy aggregation). In this man- ner, theQR 1 accuracy is 0. 25 and theQR 2 (which repre- sents the accuracy value of its unique subarea) accuracy is 0. 86(0. 74× 1 + 0. 93× 2 + 1. 00× 3). Thus, theQR 1 andQR 2 selectivity is calculated as 1 9 ( 4 6× 6 ) and 1 10 ( 4 5× 8 ) respectively. Subsequently, the partition metadata is – AreaQR 1 { ID< 300∧ year = 2015} ; 1 9 (6× 6) = 4 tuples;keyaccuracy = 0. 25; inputs:S 1 ,M 2 – Subarea QR 11 { ID,name,year,mark} ; accuracy = 0. 25 – AreaQR 2 { ID≥ 300∧ year = 2015} ; 4(= 5× 8× 1/10) tuples;keyaccuracy = 0. 93; inputs:S 2 ,M 3 – SubareaQR 21 { name} ;accuracy = 0. 74 – Subarea QR 22 { year,mark} ; accuracy = 0. 93 – SubareaQR 23 { ID} ;accuracy = 0. 84 The global user queryQ accuracy value is estimated as the sum of the weighted and rewriting accuracies (weight = tuplenumbers); thus, we obtain (4× 0. 25+4× 0. 93)/(4+ 4) = 0. 59. Figure 6 shows the quality graph for user queryQ defined by merging its rewritingsQR 1 andQR 2 . In the following section, we will discuss some data ac- curacy improvements by presenting a possible approach, which discards the areas or subareas causing exaggeration in accuracy prediction. 5 Accuracy improvement To enforce the data accuracy, some fundamental amend- ment actions can be taken if the user expectations on data accuracy cannot be achieved. Because the proposed es- timation approach arranges the query outputs in subareas possessing uniform accuracy, there is no guarantee that the Student S 1 QR 1 S 2 Mark M 2 QR 2 M 3 Union Q = Target Figure 6: Quality Graph. total cells in the query output satisfy the target expecta- tions. However, the delivered data, in general, satisfy the user target. We propose a simple improvement action consisting of discarding pieces of the query output with small accuracy values. We will present different ways and different times of executing such improvement actions. Particularly, we distinguish three accuracy levels: – Cell level: Accuracy of cell set, in general, must be less than a specific threshold. – Tuple level: Accuracy of tuple set, in general, must be less than a specific threshold. – Output level: Accuracy of the whole output, must be less than a specific threshold. The cell level is the most constraining. It reflects the fact that the user accepts only tuples containing accurate val- ues. Moreover, the accuracy restrictions may be related to specified attributes; as an example, the user requires accu- rate phone numbers and disregards the accuracy of other attributes. For this type of restriction, we suggest filtering subareas with small accuracy values because the query out- put is partitioned to subareas with uniform accuracy. If the restriction concerns only specific attributes, filtering con- cerns only the subareas involving those attributes. Such action can be performed in the initial evaluation phase, ex- actly during insertion of areas in the buckets. We call this improvement action selective rewriting. The tuple level accepts the existence of a group of at- tributes having low accuracy with the condition that the ag- gregate accuracy of the tuple is sufficiently high. Some users may tolerate acquiring tuples containing inaccuracies in certain attributes as long as the remaining cells are cor- rect (e.g., if different attributes hold possible manner to contact clients such as phone, address, and email). There- fore, mistakes in some attributes are tolerable if the remain- ing attributes have high accuracy values. In this case, we 256 Informatica 46 (2022) 243–258 B. Athamena et al. suggest area filtering rather than subarea filtering; in other words, the accuracy value is the aggregation of subarea ac- curacy, which is then compared to user expectations. Note that area filtering should be performed after rewriting com- putation (contrary to subarea filtering), as the aggregation is communicated to several subareas belonging to different database relations. Finally, the output level does not consider the tuple or attribute accuracy; however, it signifies that the final out- put must reach a specific accuracy level. Remember that query outputs containing a mixture of tuples having high accuracy and tuples having very low accuracy are toler- able. This reflects people requesting all possible output data, without ignoring accuracy (e.g., the user sends pub- licity message to clients, accepting up to 6% undelivered message because of mistakes in contact attribute). To sat- isfy this constraint type, we must generate all possible data without considering their accuracy. To this end, we sug- gest sorting areas by their accuracies, aggregate the accu- racy values gradually, and then stop if the correctness con- straints are unsatisfied. Additionally, the data delivery can be incremental, allowing the users to stop when they obtain the necessary data or data containing many mistakes. This approach is executed only after aggregating the accuracy of all rewritings. Therefore, three improvement actions are proposed: – Selective rewriting is the generation of query rewrit- ings that achieve a specific quality level, particu- larly “all subarea accuracy should be greater than the threshold.” In this case, the rewriting algorithm can be modified to exclude from buckets the areas related to subareas having low accuracy. As the subarea accu- racy was precomputed during the fragmentation pro- cess, this strategy can be implemented easily. – Rewriting filtering can be implemented directly during the aggregation of rewriting accuracy. – Incremental data delivery. After the aggregation of rewriting accuracy, we suggest sorting them according to descending order of accuracies. The algorithm is shown in Figure 7. The inputs of the algorithm are the expected values of accuracy and a list of areas in the rewritings; the output is a sorted area list. The list represents the data that satisfy user expectations. Note that if the constraints are more restrictive, the output size will be small. It is worthy to mention that we should have a balance between accuracy and completeness expec- tations to deliver useful and valuable tuples to the user and avoid filtering too much data. Figure 8 shows an example that illustrates the proposed improvement actions. Assume thatQR 1 ,QR 2 ,QR 3 , and QR 4 represent four rewritings. The number inside each subarea indicates the accuracy value of that subarea and the aggregated accuracy value for each area is written in front ofQR i . l:list of Areas Start True End False   l Accuracy by Sort     l Rturn  y cardinalit Area y cardinalit y cardinalit Area Accuracy Area Accuracy l Area null l . . .           y cardinalit Area y cardinalit y cardinalit y cardinalit Area Accuracy Area Accuracy Area Next Area l Area Insert . . . ,       Threshold y cardinalit Accuracy  l Figure 7: Incremental Data Delivery. Assume also thatC 1 ,C 2 , andC 3 describe three constraints as follows: – C 1 : All subarea accuracy values must be more than or equal to 0. 8. – C 2 : All area accuracy values must be more than 0. 8. – C 3 : The query accuracy value must be more than 0. 7. OnlyQR 3 satisfies constraintC 1 . Thus, the algorithm will generate onlyQR 3 . Furthermore, QR 4 will be discarded because one of its subareas S 41 does not satisfy the con- straint. However,QR 3 andQR 4 satisfy constraintC 2 . De- spite the fact that subarea S 41 possesses a low accuracy value than 0. 8, the whole area accuracy is accepted. To check constraintC 3 , we sortQR i values by accuracy. Thus, we obtain this order: QR 3 → QR 4 → QR 2 → QR 1 . We compute the aggregate accuracy gradually: – { QR 3 } accuracy is 0. 9. – { QR 3 ,QR 4 } accuracy is (0. 9× 20+0. 8× 10)/(20+ 10) = 0. 86. – { QR 3 ,QR 4 ,QR 2 } accuracy is (0. 9× 20+0. 8× 10+ 0. 6× 10)/(20 + 10 + 10) = 0. 8. – { QR 3 ,QR 4 ,QR 2 ,QR 1 } accuracy is (0. 9× 20+0. 8× 10 + 0. 6× 10 + 0. 5× 20)/(20 + 10 + 10 + 20) = 0. 7. Because constraint C 3 is not satisfied, QR 1 is discarded. Formal Approach to Data Accuracy Evaluation Informatica 46 (2022) 243–258 257 QR 1 S 11 S 12 S 13 S 14 0.5 0.35 0.49 0.51 0.59 20 tuples QR 2 S 21 S 22 0.6 0.47 0.64 10 tuples QR 3 S 31 S 32 S 33 0.9 0.8 0.97 0.93 20 tuples QR 4 S 41 S 42 S 43 S 44 0.8 0.4 0.92 0.82 0.85 10 tuples S 45 S 46 0.93 0.85 Figure 8: Rewritings Filtering. 6 Conclusion This study investigates the data accuracy estimation and proposes some improvement actions. We suggested the fragmentation of database relations according to data ac- curacy by applying Rakov’s algorithm and the projection of such fragments to query outputs to report the inaccuracy distribution in a better way. We rewrote the user query ac- cording to the fragments, and then we aggregated the accu- racy values for the rewritings. We defined the query output by uniting the output tuples of rewritings. A basic algo- rithm to estimate the data accuracy was presented. In con- trast to the existing approaches, the areas with low accuracy are identified explicitly by our proposed algorithm, so that data unsatisfying user expectations are discarded. Furthermore, we suggested three primary improvement actions in order to filter data having lower accuracy to meet different types of user expectation accuracy. This approach is applicable in multiple stages of DIS development (de- sign, production, or maintenance) to inform users about data accuracy, compare databases, check expectation satis- faction, or analyze enforcement actions for improving data accuracy. Our objective in the near future is the development of a prototype implementing the proposed accuracy evalua- tion algorithm and the illustration of its usage in real–world applications. For this purpose, the prototype should dis- play and edit the different components (such as databases, quality graph, characteristics, accuracy values, etc.) of the framework in addition to the execution of the quality evalu- ation algorithms. Furthermore, the prototype should evalu- ate the data accuracy in different applications for valida- tion purposes by describing several tests in order to as- sess the approach performance and limitations. Finally, we suggest to improve some features of the quality evaluation tool and perform additional performance tests. In addition, we aim to analyze further quality factors and their inter– relationships. References [1] Abdalla, H., and Artoli, A. M. Towards an effi- cient data fragmentation, allocation, and clustering approach in a distributed environment. Information, 10(3), 112, 2019. DOI: 10.3390/info10030112 [2] Agrawal, S., Narasayya, V ., and Yang, B. Inte- grating vertical and horizontal partitioning into au- tomated physical database design. In Proceedings of the 2004 ACM SIGMOD international confer- ence on Management of data, 359–370, 2004. DOI: 10.1145/1007568.1007609 [3] Alsaif, S. A., and Hidri, A. Impact of Data Balanc- ing During Training for Best Predictions. Informat- ica, 45(2), 2021. DOI: 10.31449/inf.v45i2.3479 [4] Azeroual, O., Ershadi, M. J., Azizi, A., Banihashemi, M., and Abadi, R. E. (2021). Data Quality Strat- egy Selection in CRIS: Using a Hybrid Method of SWOT and BWM. Informatica, 45(1), 2021. DOI: 10.31449/inf.v45i1.2995 [5] Bai, Q., Hong, J., and McTear, M. F. Some modifi- cations of bucket-based algorithms for query rewrit- ing using views. In International Conference on Ad- vances in Information Systems, Springer, Berlin, Hei- delberg, 57–67, 2004. URL: https://rb.gy/leqpjc [6] Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. Classification and Regres- sion Trees, : Wadsworth, 368, 1984. DOI: 10.1201/9781315139470-8 [7] Calvanese, D., Lembo, D., and Lenzerini, M. Sur- vey on methods for query rewriting and query an- swering using views. Integrazione, Warehousing e Mining di sorgenti eterogenee, 25, 2001. URL: https://rb.gy/05cken [8] Ezéchiel, K. K., Kant, S., and Agarwal, R. A sys- tematic review on distributed databases systems and their techniques. Journal of Theoretical and Applied Information Technology, 96(1), 236–266, 2019. URL: https://rb.gy/gs00qk [9] Fuaad, H. A., Ibrahim, A. A., Majed, A., and Asem, A. A Survey on Distributed Database Fragmentation Allocation and Replication Algorithms. Current Jour- nal of Applied Science and Technology, 27(2), 1–12, 2018. DOI: 10.9734/CJAST/2018/37079 [10] Gao, J., Li, X., Xu, Y . E., Sisman, B., Dong, X. L., and Yang, J. Efficient knowledge graph accuracy eval- uation. arXiv preprint arXiv:1907.09657, 2019. DOI: 10.48550/arXiv.1907.09657 [11] Halevy, A. Y . Answering queries using views: A sur- vey. The VLDB Journal, 10(4), 270–294, 2002. DOI: 10.1007/s007780100054 258 Informatica 46 (2022) 243–258 B. Athamena et al. [12] Hill, G. A Framework for valuing the qual- ity of Customer Information Doctoral disserta- tion, Melbourne University, 1–194, 2009. URL: http://hdl.handle.net/11343/35567 [13] Levy, A., Rajaraman, A., and Ordille, J. Querying heterogeneous information sources using source de- scriptions. In Proceedings of 22th International Con- ference on Very Large Data Bases, 1, 1–26, 1996. DOI: 10.1049/tpe.1981.0030 [14] Ma, H., Noack, R., Schewe, K. D., and Thalheim, B. Using meta–structures in database design. Informat- ica, 34(3), 387—403, 2010. https://rb.gy/cbonrz [15] Motro, A., and Rakov, I. Estimating the quality of databases. In International Conference on Flexible Query Answering Systems, Springer, Berlin, Heidel- berg, 298–307, 1998. URL: https://rb.gy/iriobk [16] Naumann, F., Leser, U., and Freytag, J. C. Quality– driven integration of heterogeneous information sys- tems. In Proceedings of the International Conference on Very Large Data Bases, 447–458, 2005. URL: https://rb.gy/bgzhyn [17] Özsu, M. T., and Valduriez, P. Principles of Dis- tributed Database Systems, Management, Springer, New York, 12, 657–722, 2011. DOI: 10.1007/978-1- 4419-8834-8 [18] Peng, P., Zou, L., Chen, L., and Zhao, D. Adap- tive distributed RDF graph fragmentation and alloca- tion based on query workload. IEEE Transactions on Knowledge and Data Engineering, 31(4), 670–685, 2018. DOI: 10.1109/TKDE.2018.2841389 [19] Rakov, I. Data quality and its use for reconciling in- consistencies in multidatabase environment.PhD dis- sertation, George Mason University, 1998. [20] Raman, V ., Swart, G., Qiao, L., Reiss, F., Dialani, V ., Kossmann, D., Narang, I. and Sidle, R. Constant– time query processing. In 2008 IEEE 24th Interna- tional Conference on Data Engineering, 60–69, 2008. DOI: 10.1109/ICDE.2008.4497414 [21] Riezler, S., and Liu, Y . Query rewriting using monolingual statistical machine translation. Compu- tational Linguistics, 36(3), 569–582, 2010. DOI: 10.1162/coli_a_00010 [22] Silberschatz, A., Korth, H. F., and Sudarshan, S. Database system concepts. McGraw-Hill, New York, 5, 2002. URL: https://rb.gy/ttrvjt [23] Stvilia, B. A workbench for information quality eval- uation, In Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital libraries, 469–469, 2008. DOI: 10.1145/1378889.1379014 [24] Tarun, S., Batth, R. S., and Kaur, S. A Review on Fragmentation, Allocation and Replication in Dis- tributed Database Systems. In International Con- ference on Computational Intelligence and Knowl- edge Economy (ICCIKE), 538–544, 2019. DOI: 10.1109/ICCIKE47802.2019.9004233 [25] Várkonyi, G. G., and Gradišek, A. Data protection impact assessment case study for a research project using artificial intelligence on patient data. Informat- ica, 44(4), 2020. DOI: 10.31449/inf.v44i4.3253 [26] Wang, R. Y ., and Strong, D. M. Beyond accuracy: What data quality means to data consumers. Jour- nal of management information systems, 12(4), 5–33, 1996. DOI: 10.1080/07421222.1996.11518099 [27] Zozus, M. N., Pieper, C., Johnson, C. M., Johnson, T. R., Franklin, A., Smith, J., and Zhang, J. Fac- tors affecting accuracy of data abstracted from medi- cal records. PloS one, 10(10), e0138649, 2015. DOI: 10.1371/journal.pone.0138649