https://doi.org/10.31449/inf.v42i3.1559 Informatica 42 (2018) 375–399 375
  
 
Using Semantic Perimeters with Ontologies to Evaluate the Semantic 
Similarity of Scientific Papers 
Samia Iltache 
Department of Computer Science, Mouloud Mammeri University, Tizi ouzou, Algeria 
E-mail: siltache@gmail.com 
Catherine Comparot 
IRIT, Université de Toulouse, CNRS, INPT, UPS, UT1, UT2J, France 
E-mail: Catherine.Comparot@irit.fr 
Malik Si Mohammed 
Department of Computer Science, Mouloud Mammeri University, Tizi ouzou, Algeria 
E-mail: m_si_mohammed@esi.dz 
Pierre-Jean Charrel 
IRIT, Université de Toulouse, CNRS, INPT, UPS, UT1, UT2J, France 
E-mail: Charrel@univ-tlse2.fr 
 
Keywords: domain ontologies, semantic annotation, classification, conceptual graph, semantic perimeter, text 
similarity 
Received: March 17, 2017 
 
The work presented in this paper deals with the use of ontologies to compare scientific texts. It 
particularly deals with scientific papers, specifically their abstracts, short texts that are relatively well 
structured and normally provide enough knowledge to allow a community of readers to assess the 
content of the associated scientific papers. The problem is, therefore, to determine how to assess the 
semantic proximity/similarity of two papers by examining their respective abstracts. Given that a 
domain ontology provides a useful way to represent knowledge relative to a given domain, this work 
considers ontologies relative to scientific domains. Our process begins by defining the relevant domain 
for an abstract through an automatic classification that makes it possible to associate this abstract to its 
relevant scientific domain, chosen from several candidate domains. The content of an abstract is 
represented in the form of a conceptual graph which is enriched to construct its semantic perimeter. As 
presented below, this notion of semantic perimeter usefully allows us to assess the similarity between the 
texts by matching their graphs. Detecting plagiarism is the main application field addressed in this 
paper, among the many possible application fields of our approach. 
Povzetek: Prispevek obravnava uporabo ontologij za primerjavo znanstvenih besedil. Poglavitna 
uporaba je odkrivanje plagiacije. 
1 Introduction 
Assessing query-text or text-text similarity is the concern 
of several research domains such as information retrieval 
and automatic classification of documents. For many 
works, a document is represented by a vector of words. 
The very large size of the vectors reduces the 
effectiveness of these approaches and often requires 
reducing the number of dimensions to represent the 
document vectors. Some approaches are based on a 
learning corpus to compute the similarity between texts, 
as is done in the field of document classification. 
However, a large text corpus may not always be 
available and the result of the document classification 
depends and varies according to the chosen learning 
corpus. The similarity is based on the morphological 
comparison of the terms composing the query and the 
documents. The polysemy and synonymy inherent in the 
presence of certain terms of the language as well as the 
links between the terms are ignored, which generates 
erroneous matching. 
In this paper, an approach to assess the similarity 
between texts is presented, focusing on the similarity of 
scientific abstracts. This approach is based on a semantic 
classification of documents using domain ontologies 
which provides a more stable base than a learning 
corpus. A document is no longer represented by a set of 
characteristics independent of each other, but by a 
conceptual graph extracted from the ontology to which 
the document is attached. The similarity between two 
documents is evaluated by comparing their respective 
graphs. 
One of our propositions is to refine this process of 
semantic comparison through a generic structuring of an 
abstract of a scientific paper into distinct parts whose 
descriptive roles are different. The global similarity of 
376 Informatica 42 (2018) 375–399 S. Iltache et al.  
 
 
two abstracts will indeed be different according to 
whether one compares, for example, the contribution or 
the context of the paper, both evoked in the abstract. The 
proposed process constitutes a solution that can answer 
many problems requiring semantic comparison, as is the 
case, for example, in Semantic Information Retrieval. 
Finally, the relevance of our approach is examined by 
using it to highlight risks of plagiarism (expressing 
identical ideas using different terms), or even self-
plagiarism (identical results published more than once by 
their authors, voluntarily using different terms).  
In addition to an original process to compare the 
abstracts of scientific papers based on domain ontologies, 
and combine a classification process with a semantic 
comparison of conceptual graphs, one of our main 
contributions is the introduction of the concept of 
semantic perimeter which is obtained by an ontology 
enrichment process. The semantic perimeter plays an 
important role in semantic comparison as shown by our 
results. Our approach also introduces the possibility of 
structuring scientific abstracts in three distinctive parts, 
generally respected by authors, namely Context, 
Contribution and Application domain. Finally, this 
constitutes a complete process for semantic text 
comparison, starting by using domain ontologies, and 
reaching text similarity. 
Section 2 of this paper covers some work related to 
our problematic. Section 3 describes the different steps of 
our text classification and comparison process and 
explains how to perform this process using scientific 
abstracts. Finally, Section 4 presents the experimentation 
results of our process, followed by a conclusion on the 
interest of such an approach and its applicability on 
several domains, such as giving a useful approach to 
constituting a documentary fund on a given knowledge 
domain by collecting relevant papers, which is more 
powerful than a mere keyword-based approach, or 
detecting plagiarism, which is our main purpose here. 
2 Related work 
2.1 Word similarity 
Similarity measures are necessary for various 
applications in natural language processing such as word 
sense disambiguation [1] and automatic thesauri 
extraction [2]. They are also used in Web related tasks 
such as automatic annotation of Web pages [3]. Two 
classes of approaches dealing with word similarity 
measure can be distinguished.  
Distributional approaches [4] consider a word based 
on its context of appearance. Words are represented by a 
vector of words that co-occur with them. Latent Semantic 
Indexation [5] is a vectorial approach that exploits co-
occurrences between words. It reduces the space of 
words by grouping co-occurring words in the same 
dimensions using Singular Value Decomposition.  The 
textual content of Wikipedia [6][7] and the Neural 
networks [8][9] are used for distributional word 
similarity to define the context of a word. In the second 
category, the similarity of two words is based on the 
similarity of their closest senses. For this purpose, a 
lexical resource is used, such as WordNet and MeSH. 
The nodes at these resources represent the meaning of the 
words. Measures that make it possible to calculate the 
degree of proximity (distance) between two nodes have 
been defined. Several approaches can be identified for 
calculating of such distances: Approaches based only on 
the hierarchical structure of the resource 
[10][11][12][13]. The measure proposed in [11] is based 
on edge counting and the measure proposed in [12] is 
based on the notion of least common super-concept; that 
is, the common parent of two nodes, the furthest from the 
root. In [13], the proposed measure takes into account the 
minimum distance between two nodes to their most 
specific common parent (cp) and the distance between cp 
and the root. Some approaches include information other 
than the hierarchical structure information, such as 
statistics on nodes or the informative content of nodes. 
To represent information content value, probabilities 
based on word occurrences in a given corpus are 
associated with each concept in the taxonomy [14][15]. 
Resources, such as Wikipedia [16][17] and Wiktionary 
[18], are also used in measuring word similarity.  
2.2 Text similarity 
The purpose of calculating text similarity is to identify 
documents with similar or different content. The 
different approaches dealing with textual similarity can 
be classified into three categories: approaches based on 
vector representation of document content, approaches 
applying text alignment, and approaches based on a 
graphical representation of documents and queries. Some 
approaches relating to each category are cited below. 
2.2.1 Vector similarity 
A text (document or query) is projected into a vector 
space where each dimension is represented by an 
indexing term. Each element of a vector consists of a 
weight associated with an indexing term. This weight 
represents the importance of a term and is calculated on 
the basis of TF-IDF [19] or its variants. The vector 
similarity is computed using several metrics such as the 
cosine measurement which measures the cosine of the 
angle formed by the vectors corresponding to the texts. 
Two texts are similar if their vectors are close in the 
vector space in which they are represented.  
- Document retrieval 
The vector model is proposed by Salton in the 
SMART system [20]. To retrieve the documents that best 
meet a user need, a document and a query are represented 
by a vector. The relevance of a document to a query is 
measured by a similarity based on the distance between 
their respective vector. Adaptations of the basic model 
have been proposed for processing structured documents 
[21][22]. The Extended Vector Space Model is one of the 
first adaptations of the vector model proposed by Fox 
[22]. A document is represented by an extended vector 
containing different information classes referred to as 
objective identifiers (denoted by c-type) such as author, 
Using Semantic Perimeters with Ontologies to Evaluate... Informatica 42 (2018) 375–399 377 
 
title and bibliographic references. The similarity between 
a document d and a query q is computed by a measure of 
similarity which is a linear combination of the different 
sub vector similarities. 
Conventional Information Retrieval considers 
documents only based on their textual content. The 
evolution of the document content towards a structured 
representation and more precisely towards the XML 
format raises new issues. In [23], the author presents a 
Searching XML documents through xml fragments. A 
fragment is a text delimited by a structure. The queries 
are transformed into XML fragments and, for each 
document, a profile is created. This profile is represented 
by a vector composed of the pairs (t, c), where c is the 
context of appearance of the term t. The context is 
assimilated to the element with its path. An entry in the 
index is no longer a term but a pair (t, c). Another 
adaptation of the vector model described in [24] based on 
the computation of the cosine makes it possible to 
compute the similarity between a node n, belonging to a 
tree representing a document, and a query q. In [25], the 
corpus is represented by a labeled tree where each sub-
tree is considered as a logical document. The authors 
introduce the notion of structural term (s-term) which is a 
labeled tree. An s-term may be an element, an attribute, 
or a term. The similarity between a query and a 
document is computed by the scalar product of the 
vectors. The weight of the terms is computed during the 
retrieval phase since the notion of logical tree is defined 
according to the structure of the query. 
- Document classification. 
Automatic texts classification makes it possible to 
group documents dealing with similar themes around the 
same class. Supervised classification approaches assign 
documents to predefined classes [26][27][28] while 
unsupervised classification approaches automatically 
define classes, referred to as clusters, [29].  
In the supervised classification, classifiers use two 
document collections: A collection containing training 
documents to determine the characteristics of each 
category and a collection containing new documents to 
be automatically classified. The classification of a new 
document depends on the characteristics selected for 
each category. There are various supervised machine 
learning classification techniques. In [30], the author 
provides a comparison of their features.  
The method based on the K Nearest Neighbors 
(KNN) [28][31] assumes that if the vectorial 
representations of two documents are close in vector 
space, they have a strong probability of belonging to the 
same category. A new document d is compared with 
documents belonging to the training set. The category 
assigned to document d depends on the category of its K 
nearest neighboring documents. To determine the 
category to be assigned to the document d, the most 
assigned class to the K neighbors closest to d is chosen or 
a weight is assigned to the different classes of k nearest 
neighbors according to the classification of these 
neighbors. Thus the class with the highest weight will be 
retained. 
With Support Vector Machines (SVM), documents 
are represented in a vector space by the indexing terms 
that compose them. Using a training phase, this method 
defines a separating surface, called hyperplan, between 
the documents belonging to two classes which maximize 
the distance between this hyperplan and the nearest 
documents and minimizes categorization errors [32]. A 
category c is assigned to a new document d as a function 
of the position of d relative to the separating surface. 
Some classifiers create a "prototype" class from the 
training collection [26]. This class is represented by the 
mean vector of all the document vectors in the collection. 
Only some features are retained which constitutes a loss 
of information. Some approaches replace the training 
collection with data extracted from "world knowledge" 
such as Open Directory Project (ODP) [33]. Other 
approaches exploit thesauri or domain ontologies with 
conventional classifiers (SVM, Naive Bayes, K-means, 
etc.) and represent a document by a vector whose 
features are concepts or a set of terms and concepts 
[29][34][35].  
As reported in [36], approaches using the vector 
representation of documents have several limitations: 
Their performances decrease as soon as they apply to 
relatively long texts. With the weighting formulas used, 
words appearing only once in the document or, on the 
contrary, words that are often repeated are ignored 
although they have a meaning with respect to the content 
of the document. The vector representation as defined 
does not highlight the relationships between words in a 
document, thus generating erroneous matching. 
A document is represented by a vector whose size is 
equal to the number of features retained to represent the 
various categories, in the case of classification, and the 
number of terms used to represent the corpus, in the case 
of information retrieval. In [37], the authors studied the 
impact of the number of dimensions on the "nearest 
neighbor" problem. Their analysis revealed that when 
this number increases, the distance to the nearest data 
point approaches the distance to the farthest data point.  
2.2.2 Sentence alignment 
Approaches dealing with sentence alignment are divided 
into three categories. Syntactic approaches based on 
morphological word comparison, semantic approaches 
using sentence structure and approaches that combine 
syntax and semantics. Gunasinghe [38] proposes a hybrid 
algorithm that combines syntactic and semantic 
similarity and uses a vectorial representation of sentences 
by using WordNet. This algorithm takes into account two 
types of relationship in the sentence pairs: relationships 
between verbs and relationships between nouns. Liu [39] 
proposes an approach to evaluate the semantic similarity 
between two sentences. They use a regression model, 
Support Vector Regression, combined with features 
defined using WordNet, corpus, alignment and other 
features to cover various aspects of sentences. Other 
approaches perform the text alignment by comparing all 
the words preserving their order in sentences. However, 
these algorithms are rather slow and they do not 
378 Informatica 42 (2018) 375–399 S. Iltache et al.  
 
 
dissociate terms describing the theme of the document 
from those used to build sentences. In [40], authors use a 
text alignment algorithm [41] to align a text with the set 
of documents in a corpus. This algorithm uses a matrix in 
which the deletion or insertion of a word is represented 
by -1, a mismatch by a 0 while a match is represented by 
its IDF weight. The authors use a full-text alignment 
where the highest score from any cell in the alignment 
matrix represents the similarity score of two texts. In 
[42], authors introduce a new type of sentence similarity 
called Structural Similarity for informal, social network 
styled sentences. Their approach eliminates syntactic and 
grammatical features and performs a disambiguation 
process without syntactic parsing or POS Tagging. They 
focus on sentence structures to discover purpose- or 
emotion-level similarities between sentences. 
2.2.3 Graph similarity  
Assessing of the graph similarity is used, in particular, in 
the field of Information Retrieval. The document and 
query are both represented by a conceptual graph 
constructed from a domain ontology or a thesaurus. 
In the domain of Semantic Information Retrieval, 
Dudognon [43] represents the documents by a set of 
"annotations". Each annotation consists of several 
conceptual graphs. The similarity between two graphs is 
defined as the weighted average of the similarities 
between the concepts that compose this graphs and the 
similarity between two "annotations" is computed by the 
mean of similarities of their conceptual graphs. Baziz 
[44] suggests constructing a graph for each document 
and for each query using concepts extracted from 
WordNet. A mapping of the graph of a document to that 
of the query leads the author to represent the two graphs 
with respect to the same reference graph made up of 
nodes belonging to the document and to the query. Each 
graph is then expanded by adding nodes of the reference 
graph. The weights of the nodes added to the query are 
zero whereas in the sub-tree of the document where a 
node is added, the weight of a level s node is updated 
recursively by multiplying the weight of the level s + 1 
node (the level s node subsumes the level s + 1 node) by 
a factor which depends on the hierarchy level. The two 
representations are then compared using fuzzy operators 
and a relevance value is computed. This value expresses 
the extent to which the document covers the subject 
expressed in the query. Shenoy [45] represents a 
document by a "sub-ontology" constructed using the 
demo version of ONTO GEN Ontology Learner which is 
part of the TAO Project. Two documents are compared 
by applying the alignment of their "sub-ontology" based 
on the number of concepts, properties and relationships 
contained in each document. In [46], the authors propose 
a unified framework of graph-based text similarity 
measurement by using Wikipedia as background 
knowledge. They call each article in Wikipedia a 
Wikipedia concept. For each document, the authors 
extract representative keywords or phrases and then map 
them into Wikipedia concepts. These concepts constitute 
the nodes at the bottom of the bipartite graph. There is an 
edge between a document node and a concept node if the 
concept appears in the specific document. The weight of 
the edge is determined by the frequency of the concept’s 
occurrence in that document. The similarity of two 
documents is determined by the similarity of the 
concepts they contain. The authors in [18] present a 
unified graph-based approach for measuring semantic 
similarity between linguistic items at multiple levels:  
senses, words, and sentences. The authors construct 
different semantic networks. One of them is based on 
WordNet. The nodes in the WordNet semantic network 
represent individual concepts, while edges denote 
manually-crafted concept-to-concept relations. This 
graph is enriched by connecting a sense with all the other 
senses that appear in its disambiguated gloss. Measuring 
the semantic similarity of a pair of linguistic items 
consists of an Alignment-based Disambiguation and a 
random Walk on a semantic network. In [47], authors 
propose a graph-based text representation, which is 
capable of capturing term order, term frequency, term co-
occurrence, and term context in documents. A document 
is represented by a graph. A node represents a concept: a 
set of single word or phrase and an edge is constructed 
based on proximity and co-occurrence relationship 
between concepts. In addition; the associations among 
concepts are represented through their contexts. The 
nodes within the window (e.g. paragraph, sentence) are 
linked by weighted bidirectional edges. The approach 
described in [48] presents a graph-based method to select 
the related keywords for short text enrichment. This 
method exploits topics as background knowledge. The 
authors extract topics and re-rank the keywords 
distribution under each topic according to an improved 
TF-IDF-like score. Then, a topic-keyword graph is 
constructed to prepare for link analysis. In [49], the 
authors create a semantic representation of a collection of 
text documents and propose an algorithm to connect 
them into a graph. Each node in a graph corresponds to a 
document and contains a subset of document words. The 
authors define a feature and document similarity 
measures based on the distance between the features in 
the graph. 
2.3 Detecting plagiarism 
Plagiarism consists in copying a work of an author and 
presenting it as one’s own original work. Plagiarism 
detection systems usually have the original document and 
the suspicious document as inputs. They focus on the 
following points: an exact copy of the text (copy/paste), 
inserting or deleting words, substituting words (use of 
synonyms), reformulation and modification of sentences 
structure. In n-gram approach, a text is characterized by 
sequences of n consecutive characters [50][51][52]. 
Based on statistical measures, each document can be 
described with so called fingerprints, where n-grams are 
hashed and then selected to be fingerprints [53]. An 
overlap of two fingerprints extracted from the suspicious 
and source documents indicates a possibly plagiarized 
text passage. Statistical methods [54] do not require an 
understanding of the meaning of the documents. The 
Using Semantic Perimeters with Ontologies to Evaluate... Informatica 42 (2018) 375–399 379 
 
common approach is to construct the document vector 
from values describing the document such as the 
frequency of terms. Comparing the source document 
with the suspicious document, amounts to calculating 
their degree of similarity on the basis of different 
measures (BM25, language model, etc.). Vani [55] 
segments the source document and the suspicious 
document into sentences. Each sentence is then 
represented by a vector of weighted terms that compose 
it. Each sentence of the source document is compared to 
all the sentences of the suspicious document and 
similarity between two vectors is computed using, 
individually, several metrics (cosine, dice, etc.). Vani 
studies the importance of the combination of these 
various metrics on detecting plagiarism. He also explores 
the impact of the use of POS Tagging on calculating of 
sentence similarity. The sentences labeled by a syntactic 
parser are thus compared by matching the terms 
belonging to the same class (nouns with nouns, verbs 
with verbs, adjectives with adjectives and adverbs with 
adverbs). Other approaches based on sentences alignment 
compute the overlapping percentage of words or 
sentences between the source document and the 
suspicious document. These methods do not permit the 
detection of cases of plagiarism where synonymy is used 
to replace words in the reformulation of sentences. The 
representation of a document by a graph is also used in 
detecting plagiarism. In [45], the alignment of "sub-
ontologies" is based on the number of concepts, 
properties and relations corresponding to the original 
document and the suspicious document. Alignment is 
expressed as a fraction of the whole. If this fraction is 
above a given threshold, the system concludes that the 
two documents are similar in meaning. Osman [56] 
describes an approach of detecting plagiarism by 
representing documents (original and suspicious) with a 
graph deduced from WordNet. This approach is useful in 
detecting forms of plagiarism where synonymy is used to 
reformulate sentences. The document is divided into 
sentences. Each node of the graph constructed for the 
document represents the terms of a sentence. The terms 
of sentences are projected on WordNet to extract the 
concepts corresponding to them. Each relationship 
between two nodes is represented by the overlap between 
the concepts of the two nodes. These concepts help in 
detecting suspicious parts of a document. 
An important characteristic of our approach lies in 
the fact that it is not necessary to have a reference 
document a priori, since any document can be compared 
with a corpus dealing with the same knowledge domain 
as identified in the first step of our process that is 
proposed here. 
3 Proposed approach 
The representation of a document by a semantic graph is 
used in different domains such as information retrieval 
[43][44], plagiarism detection [45][56] and document 
summarization [57]. However, these graphs differ in the 
way they are constructed. The purpose of our approach is 
to assess the semantic similarity between textual 
documents. Unlike conventional approaches, a document 
is not represented by a vector. Our approach is to build a 
conceptual representation of a text in the form of a 
semantic graph in which the nodes and arcs correspond 
respectively to concepts and relationships between 
concepts extracted from the domain ontology chosen.  
The similarity between two texts is evaluated in two 
steps. The first step is to perform a semantic 
classification of documents based on domain ontologies. 
The classification makes it possible to deduce an overall 
similarity defined by the context in which the content of 
the document is used. The second step compares and 
evaluates the similarity of two texts related to the same 
domain ontology by comparing their constructed and 
enriched graph as explained in the following sections. 
3.1 Classification of documents 
The process is based on a semantic classification of texts 
using domain ontologies [58]. Figure 1 summarizes the 
classification process. 
The classification groups documents according to the 
knowledge domain covered by their content. This 
grouping identifies an overall similarity and involves 
several steps.  
- Projection, extraction of terms and candidate 
concepts. The "projection" of a document on different 
ontologies helps to associate meaning to the terms of the 
document with respect to concepts belonging to these 
ontologies and to select the candidate concepts. The 
notion of concept gives a meaning to a term relative to 
the domain in which this concept is defined. The whole 
document is divided into sentences. Each sentence is 
browsed from left to right from the first word. The words 
of each sentence are projected, before pruning stop 
words, on different domain ontologies to extract longer 
phrases (groups of adjacent words in a sentence called 
"terms") that denote concepts. This choice is determined 
by: 1) the concepts are often represented by labels 
consisting of several words. An example of mono- and 
multi-word concepts is given in table 1. 2) long terms are 
less ambiguous and better determine the meaning 
conveyed by the sentence. Several concepts belonging to 
the same domain ontology may be candidates for a given 
term. The following example shows to what extent it is 
important to bring out the longest terms and the longest 
concept.  
For the sentence: "The Secretary of State for the 
Home Department had clearly indicated that evidence 
obtained by torture was inadmissible in any legal 
proceedings," the synsets in Table 1 are extracted from 
WordNet.  
As shown in Table 1, there are several synsets in 
WordNet that correspond to the words "secretary of state 
for the home department" in the sentence. These synsets 
have one or more words. 
 
 
 
380 Informatica 42 (2018) 375–399 S. Iltache et al.  
 
 
 
Figure 1: Classification of a document. 
 
Words in a 
sentence 
Synset label in WordNet N° synset in WordNet 
Secretary 
of 
State 
for 
the 
home 
Department 
secretary_of_state_for_the_home_department 09526473 
secretary_of_state 09883412     09455599    00569400 
secretary 09880743     09880504    09836400   
04007053 
state 07682724     08125703    07673557   
00024568     07646257    08023668     
13192180    13656873 
home 08037383    03141215     07973910   
13687178    03398332     07974113    
07587703     03399133   08060597 
department 07623945    08027411     05514261 
Table 1: Extraction of terms and synsets.
The longest term "secretary of state for the home 
department" is extracted from the sentence. It 
corresponds to the synset secretary_of_state_for_the_ 
home_department (09526473), which represents the 
correct sense in the sentence.  
- Local disambiguation. In the projection step, for 
each ontology, all the candidate concepts for a given 
term are extracted. The local disambiguation process is 
used to select for a term t the most appropriate concept 
among several candidates belonging to the same 
ontology. To do this, the context of occurrence of the 
term t in the document is taken into consideration. 
The appropriate concept for the term t is chosen, 
taking into account both the semantic distance between 
the term t with neighboring terms, (i.e., which occur in 
its context), and the semantic distance between concepts 
associated with the term t and concepts corresponding to 
the neighboring terms in the ontology considered.  
The meaning of a term t in a document is determined 
by its nearest unambiguous neighbors terms. t will then 
be disambiguated by its nearest neighbor on the left or by 
its nearest neighbor on the right. In case the left and right 
neighbors exist simultaneously, they will both be taken 
into consideration. 
The disambiguation process is then done at three 
levels, starting at the sentence level. For each sentence, 
the ambiguous terms are disambiguated considering their 
left and right neighbors in the sentence. Any 
disambiguated term helps to move forward in the process 
of disambiguation of next terms. This process is repeated 
in case ambiguous terms still remain, considering in a 
second step the paragraph level, and finally, if necessary, 
the document level. The local disambiguation process at 
the sentence level, summarized by the algorithm in 
Figure 2, considers neighboring terms, unambiguous, that 
have associated concepts in the ontology considered, 
surrounding t: it retrieves the concepts Cnl and Cnr, 
corresponding respectively to nl, the nearest neighbor on 
the left of t and nr, the nearest neighbor on the right of t.  
The appropriate concept for the term t among candidate 
concepts is the semantically nearest concept of Cnl or 
Using Semantic Perimeters with Ontologies to Evaluate... Informatica 42 (2018) 375–399 381 
 
Cnr. This amounts to browsing the ontology and 
calculating the minimum distance between each concept 
associated with t and candidate concepts Cnl, Cnr. 
Several existing metrics in the literature are used to 
calculate this minimum distance. An example of local 
disambiguation in the domain anatomy of WordNet is 
given in the Figure3. 
 
Figure 2: Local disambiguation at the sentence level. 
382 Informatica 42 (2018) 375–399 S. Iltache et al.  
 
 
 
Figure 3: Disambiguation of shoulder and hand. 
Table 2 shows the terms and their senses (synsets) in 
the domain anatomy of WordNet. The different 
calculated distances help in choosing the most 
appropriate synset for each ambiguous term. 
The term shoulder in the sentence is ambiguous. To 
disambiguate it, spinal column, its nearest unambiguous 
neighbor term on the left, is considered. The synset 
retained is 05231159. 
The term hand in the sentence is ambiguous. Its 
disambiguation is done using shoulder and skeleton, its 
two nearest unambiguous neighboring terms on the left 
and right. The synset retained is 05246212. 
 
Words in a sentence Synset label 
(Anatomy) 
N° synset Distance between synsets Terms 
extracted 
Bones 
 
Spinal 
 
Column 
 
Shoulders (ambiguous) 
 
Hands (ambiguous) 
 
skeleton 
bone 04966339  bone 
Spinal column 
shoulder 
hand 
skeleton 
Spinal_column 05268544  
shoulder 05231159 
05231380 
 
Dist(05268544,05231159)= 0.42857143 
Dist(05268544, 05231380)= 0.5 
hand 05246212 
02352577 
Dist(05246212,05231159)= 0.42857143 
Dist(02352577,05231159)= 0.6363636 
 
Dist(05246212,05265883)= 0.42857143 
Dist(02352577,05265883)= 0.6363636 
 
skeleton 05265883  
Table 2: Disambiguation of ambiguous terms. 
At the end of the preceding steps, a document d is 
represented by several sets of concepts extracted from 
the domain ontologies θ i on which it was projected. 
These sets are represented by (1). 
   
 
 
........
........
....., c ,
....., c ,
 
2i 1
1 21 11 1
ni i
d
i
n
d
c c
c c
d =
=
= 

  () 
- Global disambiguation. The classifier must be able 
to conclude about the relevance of a document relative to 
a given context and to choose from the different 
ontological representations the one that best corresponds 
to its context. A score is calculated for each document. 
The highest score determines the candidate ontology to 
be selected to represent document d. 
The different terms in a document, taken together 
considering the contextual relations linking them, make it 
possible to conduct a semantic evaluation of the textual 
content. A matrix, defined by (2), is associated for each 
ontology and for each document. 
 














=
            .....     
          .....       
2 1
1 2 1 1 1
n n n n
n
d
c lc c lc c lc
c lc c lc c lc
M
i

 (2) 
The rows and columns of this matrix represent all the 
concepts extracted from the ontology θ i for the document 
d. 
C i is the selected concept for the term t i after 
projection of the document d on θ i and lc ic j represents the 
weight of the link between the concept C i and the 
concept C j (i≠j). 
The matrix is initialized to zero. 
If a term t i and a term t j appear together within the 
same paragraph of the document d and the concepts Ci 
and Cj respectively correspond to the terms t i and t j, then 
the weight lcicj =1. 
The weight lcicj is updated whenever the terms t i and 
t j appear together in the same paragraph. 
The weight lcici corresponds to the appearance of the 
term t i in the document. It is equal to 1. 
The weight lcicj is updated for all paragraphs of the 
document d. 
 
The importance of the concept C i in document d is 
determined by its total weight in d relatively to the 
ontology θ i. This weight is given by the row associated 
with it in the matrix. 
The score for each ontology obtained from the sum 
of the weights of all concepts extracted from this 
ontology for the document d measures the extent to 
which each ontology represents this document. The 
ontology that gets the highest score will be selected to 
represent the document d. 
For documents belonging to the same knowledge 
domain, their "local" semantic similarity is computed. 
Using Semantic Perimeters with Ontologies to Evaluate... Informatica 42 (2018) 375–399 383 
 
The process compares their content using their semantic 
perimeter – a notion that is introduced and defined later 
in the paper – constructed on the basis of their conceptual 
graph extracted from the ontology to which they are 
attached. 
3.2 Text similarity and semantic perimeter 
An author describes the subject of his document by 
evoking one or more different notions. He can describe 
them by addressing several sub-notions. These notions 
and/or sub-notions can be described in a general or 
precise way according to the level of detail to be 
highlighted.  
In an ontology, there exists a certain structure 
defining the meaning of information representing a given 
knowledge domain and the way in which this 
information is related to each other. This structure is 
defined by several branches representing different 
hierarchies. Each hierarchy has branches to separate data 
with common characteristics but also different 
characteristics. The tree of Figure 4, inspired by the 
geometric figures ontology proposed in [59], shows two 
branches Br1 (figure) and Br2 (angle) representing two 
different data. Branch Br2 has two sub-branches 2.1 and 
2.2 corresponding respectively to a right angle and an 
acute angle. Right angle and acute angle are two 
concepts with different characteristics but common 
characteristics defined by their common parent angle. 
 
 
Figure 4: Extract from the geometric figures ontology. 
3.2.1 Objective of the approach 
Consider two texts Txt1 and Txt2, previously classified in 
the same knowledge domain represented by a domain 
ontology, whose similarity needs to be assessed:  
Sim (Txt1, Txt2). Our semantic similarity process is based 
on the following assumptions: 
1 Each branch/sub-branch of the ontology is 
associated with a notion/sub-notion described in a 
document. 
2 Concepts linked by "is-a" relations form a branch. 
3 A branch can have several sub-branches. 
4 Two branches with the root of the ontology as the 
only common parent represent two different 
notions. 
5 Two sub-branches having a common parent 
represent two different sub-notions sharing 
common characteristics defined by their common 
parent. 
6 The weight of an initial concept is equal to 1. 
7 The weight of an added concept representing 
implicit information is less than 1. 
8 The similarity between two texts varies between 0 
and 1. 
Our approach is based on the identification of the 
branches to which the concepts of the documents belong 
and on the enrichment of the conceptual graphs of these 
documents. Associating a notion with a branch helps in 
identifying different and identical notions. It can be said 
for example that the notion "angle" is different from the 
384 Informatica 42 (2018) 375–399 S. Iltache et al.  
 
 
notion "figure" and that the notion "triangle" is different 
from the notion "quadrilateral" because they belong to 
different branches or sub-branches. The concepts 
quadrilateral, parallelogram, diamond, and square 
belong to the same sub-branch describing the same 
notion. Each of them brings a degree of precision 
knowing that this precision is increasingly higher the 
further one goes down the hierarchy.  
Graph enrichment highlights common notions to two 
documents without these being explicitly cited in their 
content and makes it possible to deduce similarities 
between notions by examining the branches to which 
their corresponding concepts belong. 
3.2.2 Graph enrichment 
To describe a given subject, the authors, can choose 
different words and different levels of description 
depending on the importance that each of them wishes to 
give to a notion addressed in the text. Thus, by adding 
concepts, graph enrichment makes it possible to deduce 
implicit information that can be shared by these two 
texts.   
Like Baziz [44], our process enriches the text graphs 
by adding concepts. The applied enrichment differs from 
that achieved by Baziz in the choice of concepts to be 
added and the weight assigned to these concepts. For our 
case, the weight assigned to the concepts helps in 
defining the implicit or explicit presence of a concept. 
A graph is enriched by constructing the semantic 
perimeter of its corresponding text and comparing it to 
another graph. 
3.2.2.1 Constructing the semantic perimeter of a 
text 
Definition 1: The semantic perimeter of a text is a graph 
whose nodes are the initial concepts and the link 
concepts. Initial concepts are extracted from the domain 
ontology to which the document is attached. These 
concepts represent the information explicitly described in 
its content. With these concepts, a conceptual graph is 
constructed and enriched by link concepts representing 
the implicit information in the text that is deduced from 
the initial concepts and through browsing the "is-a" 
relationships and the transversal relationships defined in 
the domain ontology. The semantic perimeter thus 
constructed for each document makes it possible to 
evaluate their semantic similarity even if these 
documents describe the same ideas with different terms. 
-  Constructing the graph of initial concepts  
During the classification process, a text is projected 
onto a set of domain ontologies. At the end of this step, 
the text is represented by a conceptual graph, whose 
nodes constitute the initial concepts.  
These concepts correspond to the terms explicitly 
cited in the document. 
-  Constructing the semantic perimeter 
The link concepts extracted from the ontology, being 
on the shortest path linking the initial concepts Ci and Cj 
by is-a relations or transversal relations, are added to the 
graph of a document. 
Link concepts are selected in order to retain only 
concepts that make sense in relation to the knowledge 
domain represented by the ontology. In fact, some 
concepts represented in an ontology are used to construct 
the structure of the ontology but have no meaning for the 
domain in question. 
Example: host and hard_disk, are two synsets 
extracted for a document classified in the computer_ 
science domain. Figure 5 shows the synsets linking them 
in WordNet. 
 
Figure 5: link synset linking host to hard_disk.  
The link synsets are: {computer 02971359, machine 
03561924, device 03068033, memory_device 03604997 
and magnetic_disk 03568359}. The synsets machine 
03561924 and  device 03068033 are not retained, since 
they respectively belong to the buildings domain and 
factotum domain. 
3.2.2.2 Comparing graphs 
Comparing two texts Txt1 and Txt2 is carried out from 
their semantic perimeter G1 and G2. A mutual 
enrichment of these two graphs is achieved by comparing 
the concepts belonging to G1 with the concepts 
belonging to G2. Each graph enriched the other and 
concepts are added to G1 and/or to G2. This is done by 
browsing the graphs from leaf nodes to the root as 
follow: 
• If the graph G1 (the graph G2) contains a 
concept C1 and the graph G2 (the graph G1) 
contains a concept C2 such that C2 is an 
ancestor of C1, then the concept C2 is added to 
the graph G1 (to the graph G2).   
• The graphs are also enriched by adding the 
common parents to concepts belonging to 
graphs G1 and G2. This enrichment is done in 
two steps:  
Using Semantic Perimeters with Ontologies to Evaluate... Informatica 42 (2018) 375–399 385 
 
▪ By considering concepts belonging only to 
the graph G1 (to the graph G2). 
▪ By considering the concepts belonging to 
graphs G1 and G2. 
 
By adding common parent concepts, graph 
enrichment helps in determining the common branches 
and sub-branches to G1 and G2 and thus to deduce an 
implicit similarity between Txt1 and Txt2. 
As an illustration, in the geometric figures domain 
represented by figure 4, three texts (T1, T2 and T3) are 
considered, and their content is as follows: 
T1: A square is a regular polygon with four sides. It 
has four right angles and its sides have the same 
measure. 
T2: A diamond is a parallelogram. Some diamonds 
have four equal angles. 
T3: A triangle has three sides. If it has a right angle, 
it is a right triangle. 
- Let us compare T1 and T2. 
The semantic perimeters of T1 and T2 and the 
comparison of their respective graphs G1 and G2 are 
given in Figure 6.  
The projection of the texts T1 and T2 on the ontology 
represented by figure 4, allows us to find the initial 
concepts to construct graphs G1 and G2. 
G1 is represented by the concepts (square, polygon, 
right angle) and G2 is represented by the concepts 
(diamond, parallelogram and angle). At this stage, the 
graphs have no common concept. 
 
 
 
Figure 6: Comparison and enrichment of graphs corresponding to T1 and T2.
The enrichment of these two graphs made it possible 
to add concepts semantically linked to the initial 
concepts and to bring out common concepts to the two 
texts, not explicitly cited in their contents. The common 
concepts are diamond, parallelogram, quadrilateral, 
polygon and angle. 
- Let us compare T2 and T3. 
The semantic perimeters of T2 and T3 and the 
comparison of their respective graphs G2 and G3 are 
given in Figure 7.  
The projection of the texts T2 and T3 on the 
ontology, represented by figure 4, allows us to find the 
initial concepts to construct graphs G2 and G3. 
G2 is represented by the concepts (diamond, 
parallelogram and angle) and G3 is represented by the 
concepts (triangle, right triangle and right angle). The 
enrichment of the two graphs enabled us to find common 
concepts (angle and polygon).  
386 Informatica 42 (2018) 375–399 S. Iltache et al.  
 
 
 
Figure 7: Comparison and enrichment of graphs corresponding to T2 and T3.               
3.2.3 Calculating the similarity of two texts 
Definition 2: Textual similarity is defined by the set of 
common notions and sub-notions addressed by two texts. 
It is a function of the concepts corresponding to these 
texts, their weight and the branches to which these 
concepts belong. The similarity of two texts is given by 
the similarity of their respective graphs according to 
equation (3). 
𝑆𝑖 𝑚 ( 𝑇 𝑥𝑡 1 , 𝑇 𝑥𝑡 2 ) = 𝑆𝑖 𝑚 ( 𝐺 𝑇 𝑥𝑡 1
, 𝐺 𝑇 𝑥𝑡 2
)  (3) 
3.2.3.1 Weight of the concepts  
The weight attributed to an initial concept is equal to 1. 
This weight defines the explicit presence of the concept 
in the document. Concepts belonging to the same branch 
do not have the same semantic weight: concepts at the 
top of the hierarchy have a more general meaning than 
concepts at the bottom of the hierarchy that represent a 
more precise meaning. The more one descends towards 
the bottom of the hierarchy, the more precise the 
meaning of the concepts is. Thus, to a concept added to 
graph G1 during the enrichment process, a weight whose 
value is less than 1 is assigned. This weight represents 
the value of an implicit information and is calculated 
based on parameter g. g expresses the degree of 
generalization of a parent concept vis-a-vis its child 
concept.  
Like Fuhr [60] and Baziz [44], who reduce the weight 
of the nodes of a tree representing a document according 
to their position with respect to the most specific nodes 
by multiplying by a factor whose value is between 0 and 
1, our process computes the weight of an added concept 
by using parameter g whose value is between 0 and 0.1 
according to equation (4). 
𝑃 ( 𝐶 𝑗 ) = 1 − ( 𝑔 × ( 𝑙 𝑒 𝑛 𝑔𝑡 ℎ ( 𝐶 𝑖 , 𝐶 𝑗 ) ) (4) 
Cj is the added concept and Ci is the initial concept, 
belonging to G1 and/or to G2, the lowest in the branch to 
which Cj is added and Length (ci, cj) indicates the 
number of arc linking Cj to Ci in the branch.  
3.2.3.2 Semantic similarity of two graphs G1 and 
G2 
A factor is introduced indicating the percentage of 
common notions described by two texts. Its value is 
calculated by the number of common branches relative to 
the total number of branches belonging to the two 
graphs. The similarity between two graphs G1 and G2 is 
computed using equation (5). 
 
) (
) (
) 2 , 1 (
  
  
) 2 , 1 (
) 2 , 1 (
 
 



=
B B C
Bc Bc Ccom
com
G G
G G
C P
C P
nbB
nbBc
G G Sim
  (5) 
B represents any branch belonging to the graphs 
while Bc represents a common branch to both graphs. C 
is a concept belonging to graphs G1, G2 and Ccom is a 
common concept to both graphs. nbBc (G1,G2) and 
nbB (G1,G2) respectively represent the number of common 
branches and the total number of branches belonging to 
the two graphs.  
3.2.3.3 Example 
Let us again take the examples shown in Figures 6 and 7 
and summarize the various results in Tables 3 and 4. For 
parameter g, the value 0.05 is used. 
Initially, G1 and G2 showed no concept in common 
and, therefore, a priori no similarity. The same applies to 
graphs G2 and G3.  The enrichment of the graphs helped 
to bring out a similarity between T1 and T2, as well as 
between T2 and T3 that is not explicitly described in their 
content. The results also show that text T2 is 
semantically closer to T1 than to T3. 
Using Semantic Perimeters with Ontologies to Evaluate... Informatica 42 (2018) 375–399 387 
 
3.3 Similarity of scientific abstracts  
Refining the process of semantic comparison of two texts 
(defined in section 3.2) is performed through a generic 
structuring of an abstract of a scientific paper into 
distinct parts whose descriptive roles are different.  
Several works have taken interest in the annotation 
of the discursive structure of scientific papers: text 
zoning [61] [62]. Their objective is to better characterize 
the content of the papers by defining several classes 
(objective, method, results, conclusion, etc.), knowing 
that the existence of these classes depends on the corpus 
studied. Categorization is performed at the sentence 
level. For each sentence of an abstract, authors associate 
a class chosen from the defined classes. 
This work deals with decomposing scientific 
abstracts into zones for the purpose of detecting 
plagiarism. From the structure generally reproduced by 
the authors of scientific papers, the content of a scientific 
abstract is divided into three distinct parts which are 
referred to as zones that define the context, the 
contribution and the application domain. This 
decomposition is generally reflected in most scientific 
papers that aim, in principle, at making a scientific 
contribution in a given domain. This decomposition aims 
to extract the notions relating to each zone and thus 
permits a comparison between zones of the same type. 
The process can then evaluate, in a progressive approach, 
whether two abstracts deal with the same context, 
whether their contributions are similar and whether they 
apply their approach to the same application domain, the 
risk of plagiarism evidently increasing with each 
conclusive comparison. 
Categorization at the sentence level poses a problem 
when information from one class is cited in another class. 
In analyzing several abstracts, it was found that there is 
no strict uniformity in writing abstracts: all the sentences 
belonging to a given zone do not contain only the terms 
describing this zone but may contain terms representing 
another zone. For example, a sentence assigned to the 
application domain zone may contain terms defining an 
algorithm or a method (terms that instead define the 
contribution zone). This overlapping of several zones in 
the same sentence then generates labeling errors. 
To illustrate the categorization at the sentence level, 
each sentence of abstract 2 provided in section (3.3.1), is 
associated with one of the three selected zones.  
 "Recently, new approaches have integrated the use 
of data mining techniques in the ontology enrichment 
process. <context> 
Indeed, the two fields, data mining and ontological 
meta-data are extremely linked: on one hand data mining 
techniques help in the construction of the semantic Web, 
and on the other hand the semantic Web assists in the 
extraction of new knowledge. <context> 
Thus, many works use ontologies as a guide for the 
extraction of rules or patterns, allow to discriminate the 
data by their semantic value and thus to extract more 
relevant knowledge. <context> 
 
It turns out, however, that few works aimed at 
updating the ontology are concerned with data mining 
techniques. <context> 
 In this paper, we present an approach to support the 
onologies management  of websites based on the use of 
Web Usage Mining techniques. <contribution> 
 The presented approach has been tested and 
evaluated on an website ontology , which we have 
constructed and then enriched based on the sequential 
patterns extracted on the log. <Application domain>" 
The following inconsistencies are noted: 
- The term sequential pattern is assigned to the 
Application domain zone while it represents the 
algorithm and method used by the author and, therefore, 
defines the contribution. 
49 , 0
) 1 ( ) 1 ( ) 1 1 1 85 , 0 ( ) 1 (
) 95 , 0 ( ) 95 , 0 90 , 0 85 , 0 ( ) 80 , 0 (
4
3
) 2 , 1 (
=
+ + + + + +
+ + + +

= T T Sim
 
 
09 , 0
) 1 ( ) 1 ( ) 1 ( ) 1 ( ) 1 1 ( ) 85 , 0 (
) 95 , 0 ( ) 85 , 0 (
6
2
) 3 , 2 (
=
+ + + + + +
+

= T T Sim
 
Texts Concepts Type Weight 
T1 square initial 1 
diamond link 0,95 
parallelogram link 0,90 
quadrilateral link 0,85 
polygon initial 1 
angle Ancestor 0,95 
Right angle  initial 1 
T2 diamond initial 1 
parallelogram initial 1 
quadrilateral Ancestor 0,85 
polygon Ancestor 0,80 
Angle  initial 1 
Common branches 1        1.2        2 
All branches 1        1.2        2         2.2 
Table 3: Concepts of T1 and T2 after enriching their 
respective graphs. 
Texts Concepts Type Weight 
T2 diamond initial 1 
parallelogram initial 1 
polygon Common parent 0,85 
angle initial 1 
T3 right angle initial 1 
right triangle  initial 1 
triangle initial 1 
polygon Common parent 0,85 
angle ancestor 0,95 
Common branches 1             2 
All branches 1    1.1   1.2   1.1.3    2    
2.2 
Table 4: Concepts of T2 and T3 after enriching their 
respective graphs. 
388 Informatica 42 (2018) 375–399 S. Iltache et al.  
 
 
- The term Data mining technique is assigned to the 
context zone while it represents the contribution. 
- The term ontologies management is assigned to the 
contribution zone while it defines the context. 
To evaluate the semantic similarity of the two 
abstracts given in section (3.3.1), their content was 
previously divided as illustrated above. For each abstract, 
three graphs are constructed and enriched (a graph for 
each selected zone). For each zone, a similarity value is 
calculated. The similarity values obtained are very low. 
This is justified by assigning the terms to a zone while 
they semantically define another zone, a consequence of 
the decomposition based on categorization at the 
sentence level and of the overlapping of zones.  
To overcome this problem of overlapping of zones, 
the terms are assigned to each zone of an abstract 
according to the overall meaning conveyed by its 
content. From the global meaning of an abstract, the 
meaning and the role of its terms are deduced. A term 
can describe the context of the paper (document 
categorization, document clustering, image 
categorization, ontologies enrichment, information 
retrieval, etc.) or contribution (the methods and 
algorithms as well as notions used to describe them) or 
the application domain (classification applied to a given 
corpus, data mining applied to textual documents, data 
mining applied to the web, data mining applied to 
images, etc.). In addition, the terms contained in the title 
and in the keywords are used, as they often contain 
information that is not cited in the abstract. 
The role of each term is defined according to the 
knowledge domain in which it is used.  
The semantic annotation of the concepts was 
achieved especially in WordNet Domains [63]. In 
WordNet Domains, different subject fields are defined, 
such as medicine, computer science, and architecture. 
Each synset of WordNet [64] is annotated by one or 
more Subject Fields where this synset has a meaning. On 
the basis of the principle that a term describes one of the 
three zones selected to characterize a scientific abstract, 
each concept is annotated in the ontology associated with 
this abstract by one of the three zones (context, 
contribution and application domain).  
The extraction of the concepts corresponding to each 
zone is performed by projecting the terms composing the 
content of an abstract on the ontology. The comparison 
of two abstracts amounts to comparing the zones playing 
the same role. Three partial similarities are then 
calculated on the basis of the concepts belonging to the 
same zone. Two abstracts are compared at three levels. A 
global similarity of two scientific abstracts A1 and A2 is 
obtained by combining the three partial similarities 
according to equation (6). The global similarity makes it 
possible to rank abstracts in descending order of their 
similarity as illustrated in Tables 10, 11 and 12. 
) 2 , 1 (  
) 2 , 1 ( 
) 2 , 1 ( 
) 2 , 1 (
ndomain applicatio
on contributi
A A sim
A A sim
A A sim
A A Sim
context



+
+
=
  (6) 
α, β, γ are parameters whose values are between 0 
and 1. They define the importance attributed to the 
context, the contribution and the application domain.  
α + β+ γ=1. 
The documents processed are not necessarily 
suspicious, since it is possible to implement this 
approach in comparing a document under review, for 
example, to an entire corpus, without a priori as to its 
respect for scientific ethics. 
A similarity threshold determined by 
experimentation and according to the ontology and to the 
collection of abstracts used determines if a risk of 
plagiarism exists. Abstracts with high similarity will then 
require a full review of the entire document.  
3.3.1 Example 
Figure 8 provides an extract of an ontology associated 
with the domain ontologies enrichment and shows the 
annotation of the concepts by the three zones defined to 
characterize the content of a scientific abstract.  
Let us consider two abstracts from two scientific 
papers. These papers published in French were translated 
for the need of our work. The construction of their graphs 
and calculation of their partial similarities and global 
similarity is given in section 3.3.2. 
Abstract1: Ontology enrichment based on sequential 
pattern. 
 The mass of information now available via the web, 
in constant evolution, requires structuring in order to 
facilitate access and knowledge management. In the 
context of the Semantic Web, ontologies aim at 
improving the exploitation of informational resources, 
positioning themselves as a model of representation. 
However, the relevance of the information they contain 
requires regular updating, and in particular the addition 
of new knowledge. In this paper, we propose an 
ontologies enrichment approach based on data mining 
techniques and more specifically on the search for 
sequential patterns in textual documents.  
The presented approach has been tested and 
evaluated on an ontology of the water domain, which we 
have enriched from documents extracted from the Web. 
Key words: ontology, enrichment, semantic web, 
data mining, sequential pattern 
 
 
Using Semantic Perimeters with Ontologies to Evaluate... Informatica 42 (2018) 375–399 389 
 
 
Figure 8: Extract of the ontologies enrichment domain ontology, and annotation of concepts by their zone. 
Abstract2: Web usage mining for ontology 
enrichment. 
Recently, new approaches have integrated the use of 
data mining techniques in the ontologies enrichment 
process. Indeed, the two fields, data mining and 
ontological meta-data are extremely linked: on one hand 
data mining techniques help in the construction of the 
semantic Web, and on the other hand the semantic Web 
assists in the extraction of new knowledge. Thus, many 
works use ontologies as a guide for the extraction of 
rules or patterns, allow to discriminate the data by their 
semantic value and thus to extract more relevant 
knowledge. It turns out, however, that few works aimed 
at updating the ontology are concerned with data mining 
techniques. In this paper, we present an approach to 
support the onologies management  of websites based on 
the use of Web Usage Mining techniques. The presented 
approach has been tested and evaluated on an website 
ontology , which we have constructed and then enriched 
based on the sequential patterns extracted on the log. 
Key words: Semantic Web, ontology, Web Usage Mining, 
enrichment, data mining, sequential pattern. 
3.3.2 Applying our approach 
3.3.2.1 Extracting the initial concepts for each 
abstract 
Initial concepts are extracted at the classification step. 
The two abstracts are attached to the ontology 
represented in Figure 8. The concepts are assigned to 
their appropriate zone according to their annotation.  
 
 
Figure 9: Enriched graph of Abstract1. 
 
390 Informatica 42 (2018) 375–399 S. Iltache et al.  
 
 
3.3.2.2 Enrichment of the graphs corresponding to 
the two abstracts 
The initial concepts are used to enrich the graphs of the 
two abstracts by constructing their semantic perimeter 
and by comparing their graphs. The enriched graphs of 
the two abstracts Abstract1 and Abstract2 are represented 
in Figure 9 and Figure 10. The distribution by zone of the 
initial concepts and the added concepts by enrichment is 
given in Table 5. 
3.3.2.3 Similarity calculating between Abstract1 
and Abstract2 
Table 6 provides values of the global similarity and 
partial similarities. (Values obtained with α= 0.35, β = 
0.63, γ = 0.02, g = 0.05). 
 
Sim
context
(abstract1,abstract2) 
 
0,98 
Sim
contribution 
(abstract1,abstract2) 
 
0,59 
 
Simapplicatiodomain(abstract1,abstract2) 
 
0,10 
Sim (abstract1,abstract2) 
 
0,72 
Table 6: Similarities between abstract1 and abstact2. 
3.3.2.4 Result 
The results obtained indicate that these two abstracts 
process the same context (sim context = 0.98) with 
similar approaches. The similarity obtained for the 
contribution is high (sim contribution = 0.59). These two 
abstracts differ at the application domain level since the 
similarity value obtained for this zone is very low (sim 
application domain = 0.10). The global similarity 
 
Figure 10: Enriched graph of Abstract2. 
 
Abstract1 Abstract2 
 
Zones Concepts of Abstract 1 
 
Concept 
type 
Concepts of Abstract 2 
 
Concept 
type 
context Ontology_management Added Ontology_management Initial 
Ontology_enrichment  Initial Ontology_enrichment Initial 
Ontology Initial Ontology Initial 
contribution Data_mining Initial Data_mining Initial 
Technique Added Technique Added 
Data mining_technique Initial Data_mining_technique Initial 
Sequential_pattern Initial Sequential_pattern Initial 
Web_usage_mining Initial 
Application 
domain 
Informational_resource  Initial Informational_resource Added 
Textual_document  Initial log Initial 
Domain Added Domain added 
Water_domain Initial Website Initial 
Table 5: Distribution by zone of the concepts of Abstract1 and Abstract2. 
Using Semantic Perimeters with Ontologies to Evaluate... Informatica 42 (2018) 375–399 391 
 
obtained is high. This value indicates that the papers 
associated with these two abstracts should be the subject 
of a more in-depth analysis that could possibly reveal a 
case of plagiarism. 
4 Experimentations 
Our approach is evaluated at two levels. The first 
evaluation concerns our semantic classification process 
based on domain ontologies (CBO) and the second 
concerns the textual similarity calculation process of 
scientific abstracts. 
4.1 Semantic classification process 
4.1.1 The data 
The implementation of our semantic classification 
process was performed using WordNet and WordNet 
Domains simultaneously. In WordNet Domains several 
knowledge domains are used. These different domains 
were assimilated to domain ontologies. The Rita 
similarity measure [13] was used to measure the 
semantic distance between two synsets in WordNet. The 
terms within sentences were annotated with their type 
(noun, verb, adverb and adjective) by Stanford Part-Of-
Speech Tagger (POS Tagger) [65].  
To evaluate conventional classifiers with our corpus, 
a pre-processing was performed on the documents. 
Nouns, verbs and adjectives used in each document were 
retained. The lemmas relative to these terms were 
extracted and their weight based on Tf-Idf was then 
calculated. These lemmas constitute the vector 
representation of documents. For conventional 
classifiers, the implementation of three algorithms, SVM, 
Naive Bayes and decision tree of Weka [66] were used. 
Our evaluation covers 10 domains defined in 
WordNet Domains and a corpus consisting of 976 
abstracts of scientific papers. Some abstracts of the 
domain medicine were extracted from the corpus 
Muchmore which is a parallel corpus of English-German 
scientific medical abstracts obtained from the Springer 
Link web site. All the other abstracts of our corpus were 
extracted from several scientific journals specialized in 
the retained domains browsing their Web site. Table 7 
gives the distribution of the abstracts relative to the 
selected domains. 
Domains Number of 
abstracts 
Music 106 
Law 83 
Computer_science 101 
Politics 76 
Physics 101 
Chemistry 83 
Economy 104 
Buildings 104 
Medicine 117 
Mathematics 101 
Total 976 
Table 7: Distribution of abstracts by domains. 
4.1.2 Results and discussion 
Measures traditionally used in categorization are 
considered in this work: precision, recall, F-measure and 
baseline accuracy. The results of our process were 
compared with those of conventional classifiers. The 
results obtained are summarized in Table 8.  
The recall (Rc) determines the number of documents 
that are correctly classified in a class divided by the total 
number of documents belonging to that class. Precision 
(Pr) defines the number of documents that are correctly 
classified in a class divided by the number of documents 
assigned to that class. A measure that combines precision 
and recall is their harmonic mean, referred to as the F-
measure (F). Baseline accuracy (Acc) gives the 
percentage of documents correctly classified relative to 
the total number of documents in the corpus.  
Classes 
CBO Naive Bayes SVM Tree C4.5 
Pr Rc F Pr Rc F Pr Rc F Pr Rc F 
Music 0,962 0,943 0,952 0,835 0,906 0,869 0,963 0,981 0,972 0,913 0,887 0,900 
Law 0,952 0,964 0,958 0,777 0,880 0,825 0,947 0,867 0,906 0,766 0,711 0,737 
Computer_science 0,970 0,950 0,960 0,845 0,861 0,853 0,872 0,941 0,905 0,474 0,644 0,546 
Politics 0,949 0,974 0,961 0,788 0,829 0,808 0,944 0,882 0,912 0,754 0,645 0,695 
Physics 0,960 0,960 0,960 0,833 0,842 0,837 0,887 0,931 0,908 0,513 0,386 0,441 
Chemistry 0,940 0,952 0,946 0,947 0,867 0,906 0,986 0,880 0,930 0,848 0,807 0,827 
Economy 0,980 0,962 0,971 0,820 0,788 0,804 0,855 0,904 0,879 0,541 0,442 0,487 
Buildings 0,980 0,962 0,971 0,950 0,913 0,931 0,925 0,952 0,938 0,757 0,750 0,754 
Medicine 1,000 0,983 0,991 0,982 0,940 0,961 0,991 0,991 0,991 0,894 0,863 0,878 
Mathematics 0,925 0,980 0,952 0,904 0,842 0,872 0,898 0,871 0,884 0,493 0,673 0,569 
Average 
0,964 0,963 0,963 0,872 0,869 0,870 0,926 0,924 0,924 0,694 0,682 0,683 
Accuracy 
0,963 0,869 0,924 0,682 
Table 8: Comparison of the results of the various classifiers. 
392 Informatica 42 (2018) 375–399 S. Iltache et al.  
 
 
To calculate these different values for SVM, Naive 
Bayes, and tree C4.5, cross-validation was performed 
and the results obtained with the best parameters were 
retained. Table 8 shows that for our process the values of 
recall and precision are close. These values are close to 
1. This is an indicator of the good performance of our 
classifier. Considering the average of precisions, recalls 
and F-measure, our process obtains better results than the 
three conventional classifiers considered. The best 
percentage of documents correctly classified relatively to 
all documents in the corpus is obtained by our semantic 
classification process.  
A Wilcoxon Signed-Rank test was used in order to 
study the statistical significance of the improvement 
brought about by our process. The p-value between our 
system and the three conventional classifiers was 
calculated. This Wilcoxon Signed-Rank test is based on 
the values of the F-measure obtained for CBO, SVM, 
Naive Bayes and tree C4.5. This improvement is 
considered statistically significant if p-value <0.05 and 
very significant if p-value <0.01. The results of the test 
are summarized in Table 9. 
 
CBO - 
SVM 
CBO - 
Naive Bayes 
CBO - 
Tree C4.5 
P-value 
(F-measure) 
0.00885858 
 
0.00294464 
 
0.000976562 
 
Table 9: Wilcoxon test result. 
The p-values obtained with the Wilcoxon test are all 
less than 0.01. These are very significant p-values. This 
allows us to conclude that our system significantly 
improves the classification process of documents 
compared to conventional classifiers at the threshold  
α = 0.01. 
The three conventional classifiers have in common 
the representation of the documents by words 
independent of each other as well as a morphological 
comparison of the words belonging to the documents. 
The comparison is performed at the word level, whereas 
in our process, the comparison is performed at the overall 
context level of the document. A document is represented 
by the domain described in its content. This domain is 
deduced by the words of the document taken together 
considering their relationships in the context in which 
they appear. In addition, our process is built from domain 
ontologies, which is a more stable base than a training 
collection. Indeed, a modification in the choice of the 
documents constituting this training collection leads to a 
modification of the results of conventional classifiers.  
4.2 Semantic similarity process of scientific 
abstracts 
4.2.1 The data 
Our implementation was extended by adding processes to 
build the semantic perimeters, to divide scientific 
abstracts into three zones and to compare graphs. To 
evaluate our approach defining the semantic similarity of 
scientific abstracts, we constructed an ontology 
representing the domain of automatic classification of 
documents. To construct our corpus, a set of scientific 
abstracts related to this domain was extracted from the 
web. In our different tests, the abstract, the title of the 
paper and the keywords were taken into account. Each 
abstract was compared with all the abstracts in the 
corpus. The abstracts were compared in pairs. For 
example, the results were obtained by comparing twenty 
abstracts for which 190 comparisons were made. The 
construction of the initial graph, the semantic perimeter 
of each abstract and the comparison of the graphs is done 
according to the process defined in the previous sections. 
Each concept of our ontology was annotated by one 
of the three selected zones characterizing the content of 
the scientific abstracts: context, contribution and 
application domain. This annotation is performed 
according to the role that each concept plays depending 
on the chosen domain. For example, clustering, 
classification and document concepts are annotated by 
the context zone, the concepts representing the different 
algorithms and methods used by the authors as well as all 
the concepts describing these methods are annotated by 
the contribution zone. The concepts representing the type 
of document (Text, Web) and the corpus used are 
annotated by the application domain zone. 
Our approach was compared to two existing 
approaches.  
The first approach is based on a vector representation 
of the content of the text: Bag-of-words.  
The process of extracting terms is similar to the one 
performed in section 4.1.1. An abstract vector contains 
the lemmas corresponding to the nouns, verbs and 
adjectives extracted from the text. Lemmas are 
represented by their weight based on Tf-Idf. The 
similarity of two abstracts is calculated by measuring the 
cosine of the angle between their respective vectors.  
The second n-grams approach is based on the 
representation of an abstract by a set of words called n-
grams. The text is divided into a set of n-grams. The size 
of an n-gram is determined by a chosen number of 
consecutive characters, n. Several values of n were tested 
(n= 2, 4 and 8) and for each, the similarity between two 
abstracts was calculated using equation (7) [51] [52] and 
(8) [53]. For any pair of abstracts x and y, the similarity 
Sim(x,y) is computed as bellow : 

 
+
−

+
=
Dn(y) Dn(x) w
2
2
)) ( ) ( (
)) ( ) ( (
  
) ( ) (
1
) , (
w f w f
w f w f
y Dn x Dn
y x Sim
x y
x y
 (7)
 
 
     
y  x   
) , (
y x
y x Sim


=
            
(8) 
w denotes an arbitrary n-gram, fx(w) denotes the 
relative frequency with which w appears in the abstract x 
Using Semantic Perimeters with Ontologies to Evaluate... Informatica 42 (2018) 375–399 393 
 
and Dn(x) represents the so called n-gram dictionary of 
x. | | is the number of n-grams. 
The best results were obtained with n = 8 and 
equation (8), for which the fewest erroneous matching 
was noted. 
Text1 Text2 
Similarity 
context contribution 
application 
domain 
global 
A1.clustering A3.clustering 1,000 0,401 1,000 0,622 
A1.clustering A10.clustering 1,000 0,295 0,157 0,539 
A1.clustering A2.clustering 0,982 0,306 0,065 0,538 
A1.clustering A9.clustering 1,000 0,227 0,153 0,496 
A1.clustering A16.clustering 1,000 0,169 0,065 0,458 
A1.clustering A15.clustering 1,000 0,103 0,237 0,419 
A1.clustering A17.clustering 1,000 0,092 0,345 0,415 
A1.clustering A5.clustering 1,000 0,095 0,237 0,414 
A1.clustering A18.clustering 1,000 0,022 0,353 0,371 
A1.clustering A19. classif-clust 0,558 0,016 0,065 0,207 
A1.clustering A14.classification 0,244 0,125 0,431 0,172 
A1.clustering A6.classification 0,240 0,074 0,016 0,131 
A1.clustering 
A8.classification 0,225 0,060 0,541 0,127 
A1.clustering 
A7.classification 0,225 0,060 0,065 0,118 
A1.clustering 
A4.classification 0,244 0,036 0,065 0,109 
A1.clustering 
A11.classification 0,237 0,034 0,108 0,107 
A1.clustering 
A13.classification 0,230 0,014 0,065 0,090 
A1.clustering 
A12.classification 0,231 0,007 0,125 0,088 
A1.clustering 
A20.classification 0,237 0,005 0,031 0,087 
Table 10: similarities between A1 and the others abstracts. 
Text1 Text2 
Similarity 
context contribution 
Application 
domain 
global 
A12.classification A13.classification 1,000 0,015 0,000 0,360 
A12.classification A4.classification 0,966 0,032 0,000 0,358 
A12.classification A20.classification 0,964 0,012 0,483 0,355 
A12.classification A6.classification 0,965 0,015 0,193 0,351 
A12.classification A14.classification 0,966 0,012 0,066 0,347 
A12.classification A11.classification 0,964 0,007 0,023 0,342 
A12.classification A8.classification 0,900 0,005 0,185 0,322 
A12.classification A7.classification 0,900 0,005 0,000 0,318 
A12.classification A19.classif-clust 0,541 0,107 0,000 0,257 
A12.classification A5.clustering 0,234 0,027 0,329 0,105 
A12.classification A3.clustering 0,234 0,032 0,125 0,105 
A12.classification A18.clustering 0,234 0,026 0,125 0,101 
A12.classification A17.clustering 0,234 0,024 0,123 0,100 
A12.classification A9.clustering 0,227 0,019 0,189 0,095 
A12.classification A15.clustering 0,231 0,004 0,329 0,090 
A12.classification A1.clustering 0,231 0,007 0,125 0,088 
A12.classification A2.clustering 0,233 0,005 0,000 0,085 
A12.classification A16.clustering 0,231 0,006 0,000 0,085 
A12.classification A10.clustering 0,227 0,004 0,032 0,083 
Table 11: Similarities between A12 and the others abstracts. 
 
394 Informatica 42 (2018) 375–399 S. Iltache et al.  
 
 
4.2.2 Results and discussion 
Parameter values (α, β and γ) depend on the ontology and 
on the corpus used. Several values for these parameters 
were tested. The goal of this study is to attribute more 
importance to the context zone and the contribution zone 
since it aims to look for matches that primarily indicate 
documents dealing with the same context and similar 
contributions. The following values were retained: α = 
0.35, β = 0.63, γ = 0.02, g = 0.05. These values led to the 
abstracts being grouped based on their context. Table 10 
and Table 11 provide the results obtained when 
comparing respectively the abstracts A1 and A12 with 
the other abstracts. These tables provide the three partial 
similarities computed for each pair of abstracts as well as 
their global similarity. The results, ranked in descending 
order of global similarity, show a grouping of the 
abstracts by context. Abstract A1 deals with the document 
clustering context. Abstracts that have the highest 
similarity with A1 correspond to this context. The 
abstract A12 deals with the document classification 
context. Abstracts that have the highest similarity with 
A12 also correspond to this context hers abstracts. 
Table 10 provides a comparison of the similarities 
between A1 and the other abstracts at three levels. Their 
similarity can be compared at the context level, at the 
contribution level and at the application domain level. 
The values obtained comparing A1 with A3 indicate that 
these two abstracts deal with the same context  
(sim context = 1), present similar contributions  
(Sim contribution = 0, 401) and apply their approach to 
the same domain (sim application domain = 1). The 
value of their global similarity is high. These values 
enable us to retain these two abstracts as suspicious 
documents, thus requiring further reading and analysis of 
their entire contents.  
Table 11 provides a comparison of the similarities 
between A12 and the other abstracts at three levels. For 
the last ten rows of Table 11, very low partial and global 
similarities were obtained. The first eight rows of Table 
11 show that the corresponding abstracts deal with the 
same context as abstract A12  
(sim context >= 0.900) but use different approaches  
(sim contribution <= 0.032). Their global similarity is 
low (<= 0,360). This enables us to conclude that abstract 
A12 does not present any risk of plagiarism with the 
other abstracts.  
The goal of our approach is to be able to find 
suspicious documents; that is, documents with high 
similarities. To find these documents, a threshold for the 
calculated similarity values is determined by 
experimentation.  
To compare the results obtained with our approach to 
those of Bag-of-words and n-grams, similarities between 
the different abstracts of our corpus using the Bag-of-
words and n-grams approaches were calculated. The 
abstracts were then ranked in descending order of their 
similarity. For these two approaches, several erroneous 
matching were found. Table 12, gives an example of the 
comparison of the similarities between A4 and the other 
abstracts obtained by our approach, and the Bag-of-
words and n-grams approaches. A4 deals with the 
context classification. With Bag-of-word and n-grams 
approaches, most of the abstracts semantically closest to 
A4 deal with the clustering context. 
For the Bag-of-words approach, abstracts belonging 
to the context clustering (A10, A3, A2, A5, A15, A1) 
obtain a better similarity score than those (A11, A8, A12, 
A20, A7, A14) that deal with the same context that A4. It 
is the same for the n-grams approach. Abstracts 
Text1 Text2 
Our 
approach 
Bag-of-words 
N-grams 
A4.classification A6.classification 
0.417272 A06.classification 0.125685 A11.classification 0,042080 
A 4.classification A11.classification 
0.401363 A10.clustering 0.108323 A18.clustering 0,038287 
A 4.classification A13.classification 
0.373563 A13.classification 0.097182 A03.clustering 0,036313 
A 4.classification A12.classification 
0.358287 A19.classif-clust 0.095763 A06.classification 0,035757 
A 4.classification A14.classification 
0.358132 A03.clustering 0.092988 A10.clustering 0,035634 
A 4.classification A7.classification 
0.353878 A02.clustering 0.092751 A08.classification 0,035602 
A 4.classification A20.classification 
0.353120 A05.clustering 0.089178 A12.classification 0,034261 
A 4.classification A8.classification 
0.330633 A15.clustering 0.073636 A01.clustering 0,033475 
A 4.classification A19.classif-clust 
0.257688 A01.clustering 0.066826 A19.classif-clust 0,033400 
A 4.classification A5.clustering 
0.191517 A11.classification 0.061259 A07.classification 0,033071 
A 4.classification A3.clustering 
0.180843 A08.classification 0.045829 A17.clustering 0,032417 
A 4.classification A9.clustering 
0.176679 A18.clustering 0.043951 A09.clustering 0,029097 
R4.classification A2.clustering 
0.175801 A12.classification 0.042752 A15.clustering 0,026786 
R4.classification A15.clustering 
0.147094 A16.clustering 0.041947 A05.clustering 0,025901 
A4.classification A10.clustering 
0.135412 A20.classififcation 0.033817 A13.classification 0,025269 
A4.classification A18.clustering 
0.129238 A07.classification 0.031982 A14.classification 0,023015 
A4.classification A17.clustering 
0.119075 A17.clustering 0.028876 A02.clustering 0,020426 
A4.classification A16.clustering 
0.114507 A14.classification 0.026670 A16.clustering 0,018511 
A4.classification A1.clustering 
0.109055 A09.clustering 0.023351 A20.classification 0,015968 
Table 12: Similarities between A4 and the others abstracts using our approach, Bag-of-words, and N-grams 
Using Semantic Perimeters with Ontologies to Evaluate... Informatica 42 (2018) 375–399 395 
 
belonging to the context clustering (A18, A3, A10, A1) 
obtain a better similarity score than those (A7, A13, A14, 
A20) that deal with the same context as A4. 
For all the comparisons made between the abstracts 
in the corpus, our approach is able to correctly rank the 
abstracts by context as shown in Table 10, 11, 12 and 13. 
Clustering and classification are two different contexts. 
For these two contexts, the methods and algorithms used 
are different. For that reason, the similarity between two 
abstracts belonging to these two contexts must be low 
(low context similarity and low contribution similarity) 
and, therefore, the risk of plagiarism is very low, or even 
non-existent. To determine which approach performs the 
correct matching between abstracts of our corpus, the 
precision P5 and the R-precision for each approach and 
for each abstract were computed. 
An abstract Ab1 is assumed relevant to an abstract 
Ab2, if Ab1 deals with the same context as Ab2. 
Precision Px at point x (x=5, R) is the ratio of the 
relevant abstracts among the first x returned ones. R in 
the R-precision represents the number of the relevant 
abstracts to a given abstract in the corpus. Table 13 
summarizes the different values. 
Our process obtains better results than Bag-of-words 
and n-grams approaches. Our process is able to match 
correctly abstracts dealing with the same context and, 
therefore, it is more precise than the other approaches. 
The Wilcoxon Signed-Rank test was used in order to 
study the statistical significance of the improvement 
brought about by our process. The p-value between our 
system and the two other approaches was calculated. 
The results of the Wilcoxon test are summarized in 
Table 14. The p-values obtained with the Wilcoxon test 
are all less than 0.01. These are very significant p-values. 
This leads us to conclude that our system is able to match 
abstracts by context more correctly than the bag-of-word 
and n-grams approaches. Others results are summarized 
in Table 15. 
 
Our 
approach / 
Bag-of-word 
Our  
approach  / 
n-grams 
P-value at 
P5 
0.000213431 0.0089409 
P-value at 
R-precision 
0.0000638361 0.000219794 
Table 14: Wilcoxon test result. 
- The content of abstracts A1, A2, A3 and A10 
indicates great similarity between abstracts (A1-A3) and 
(A2-A10). These two pairs of abstracts deal with the 
same context, use the same algorithms and use ontologies 
to solve similar problematic a priori. As shown in  
Table 15, our approach makes it possible to select these 
abstracts as suspicious, while the Bag-of-words and n-
grams approaches select only the abstracts (A1-A3). A1 
and A3 use almost the same words in their content. As 
for the abstracts A2 and A10, their content is described 
with different words and different sentences, but both are 
interested in ontology-based feature selection and use the 
Abstracts 
P5 R-precision 
Bag-of-words N-grams 
Our 
approach 
Bag-of-words N-grams 
Our 
approach 
A1 1,000 1,000 1,000 0,800 1,000 1,000 
A2 0,800 1,000 1,000 0,800 1,000 1,000 
A3 0,800 1,000 1,000 0,800 0,900 1,000 
A4 0,600 0,400 1,000 0,333 0,556 1,000 
A5 0,800 0,600 1,000 0,900 0,800 1,000 
A6 1,000 1,000 1,000 0,667 0,778 1,000 
A7 0,800 0,800 1,000 0,778 0,667 1,000 
A8 0,800 0,800 1,000 0,778 0,556 1,000 
A9 0,800 1,000 1,000 0,900 0,900 1,000 
A10 0,800 1,000 1,000 0,800 0,900 1,000 
A11 1,000 1,000 1,000 0,778 0,889 1,000 
A12 0,800 0,800 1,000 0,778 0,667 1,000 
A13 0,800 0,800 1,000 0,667 0,667 1,000 
A14 1,000 1,000 1,000 0,778 0,667 1,000 
A15 0,800 1,000 1,000 0,800 1,000 1,000 
A16 1,000 1,000 1,000 0,800 0,900 1,000 
A17 0,600 1,000 1,000 0,700 0,900 1,000 
A18 0,800 1,000 1,000 0,600 0,800 1,000 
A19 1,000 1,000 1,000 1,000 1,000 1,000 
A20 0,800 0,800 1,000 0,778 0,667 1,000 
Average 0,840 0,900 1,000 0,762 0,811 1,000 
Table 13: Precision values for, Bag-of-words, n-grams and our approach. 
396 Informatica 42 (2018) 375–399 S. Iltache et al.  
 
 
same clustering algorithm. Our approach is able to 
capture the meaning of the abstract and, therefore, retains 
these two abstracts for a complete examination of their 
corresponding papers. 
 - The Bag-of-words approach indicates a matching 
between abstracts A15 and A16. These two abstracts have 
a high similarity whereas the authors of these two 
abstracts use different methods in their contribution. Our 
approach has the advantage of comparing abstracts at 
three levels. For our approach, the contribution similarity 
between A15 and A16 indicates a very low value, which 
means that the methods used by the authors to solve their 
problematic are different. This makes it possible to 
conclude that even if these two abstracts present similar 
contexts, the risk of plagiarism is low. 
Our approach assesses the similarity of texts in two 
steps. The documents are first assigned to a domain 
ontology that best describes their content. This overall 
similarity is achieved by a semantic classification 
process. This process emphasizes the overall context of 
the document that can be deduced from the terms of the 
document taken together, unlike conventional classifiers 
that consider words independently of each other. For 
documents belonging to the same ontology, a "local" 
similarity is calculated. This similarity is based on graphs 
corresponding to the texts. The enrichment of the graphs 
through the construction of the semantic perimeter of the 
texts and comparing of their graphs makes it possible to 
deduce a similarity not explicitly cited in the texts. The 
similarity calculation of scientific abstracts is refined by 
dividing their contents into three zones. Partial similarity 
values are then calculated. This helps to bring out the 
notions common to both texts. A grouping by context 
and a ranking in descending order of the global similarity 
value can be achieved by combining the three partial 
similarities. The objective of our approach is to find 
suspicious documents. It has the advantage of comparing 
the content of the documents based on three levels. The 
examination of the similarity obtained for each zone 
makes it possible to conclude on the existence of a risk of 
plagiarism. 
5 Conclusion 
The approach proposed in this paper is meant to assess 
text similarity. This similarity is based on an overall 
similarity calculation obtained by a classification 
process. Our classification process is based on domain 
ontologies and takes into account the relationships 
between the terms relative to their context of appearance 
in the document. The evaluation of our process showed 
better results than those of conventional classifiers. The 
construction of the semantic perimeter and the 
comparison of the graphs of texts based on the domain 
ontology to which they are attached make it possible to 
enrich the graphs and to deduce implicit information. Our 
approach thus present the advantage of taking into 
account the synonymy and polysemy present in a 
language and of deducing a similarity between two texts 
not explicitly cited in their content.  
Assessing the similarity between the scientific texts 
represented by their abstracts is our main interest. In the 
process of semantic comparison, three distinct parts were 
defined to structure the abstracts of scientific texts: 
context, contribution and application domain and three 
partial similarities were calculated. The comparison of 
two scientific abstracts is then performed at three levels. 
The global similarity value of two abstracts, calculated 
by combining partial similarities, makes it possible to 
rank the abstracts in descending order of their global 
similarity. A threshold applied to the calculated 
similarities is useful in finding suspicious documents and 
highlighting a risk of plagiarism. Tests were performed 
on a set of scientific abstracts. The enrichment of the 
graphs makes it possible to bring out common notions 
not explicitly cited. Moreover, dividing the contents of 
abstracts into three distinct zones helps in extracting the 
notions relative to the context, contribution and 
application domain and thus makes comparisons 
between zones of the same type. An evaluation can be 
made to determine whether two abstracts deal with the 
same context, whether their contributions are similar and 
whether they apply their approach to the same 
application domain.   
The quality of our process depends on domain 
ontologies that must cover the entire vocabulary of the 
knowledge domain represented for the process to be 
effective. This may constitute a limitation of this work 
since the process used does not support the building of 
domain ontologies. It is, therefore, assumed that they are 
available. Even if this can be assumed for scientific texts 
or abstracts structured as shown in this work, the process 
obviously needs to be refined for it to be used in 
comparing general texts. Indeed, one of the ways of 
improving our approach is to generalize the concept of 
semantic perimeter so as to consider any text rather than 
just scientific abstracts. 
6 References 
[1] P. Resnik (1999). Semantic similarity in a 
taxonomy: An information based measure and its 
application to problems of ambiguity in natural 
Text1 Text2 
Our approach Bag-of-
Words 
N-grams 
context contribution Application 
domain 
global 
A1.clustering A3.clustering 1.000000 0.400673 1.000000 0.622424 0.724688 0,352187 
A2.clustering A10.clustering 0.982456 0.486622 0.112994 0.652692 0.198869 0,050761 
A15.clustering A16.clustering 1.000000 0.188889 0.000000 0.469000 0.470623 0,108580 
Table 15: Comparison between Bag-of-words, N-grams and our approach. 
Using Semantic Perimeters with Ontologies to Evaluate... Informatica 42 (2018) 375–399 397 
 
language. Journal of Aritificial Intelligence 
Research, Vol.11, Issue 1, pp. 95-130. 
https://doi.org/10.1613/jair.514 
[2] J. Curran (2002). Ensemble methods for automatic 
thesaurus extraction. In Proceedings of the 
conference on Empirical methods in natural 
language processing (EMNLP), Philadelphia, 
Vol.10, pp. 222-229. 
[3] P. Cimano, S. Handschuh, and S. Staab (2004). 
Towards the self-annotating web. In Proceedings of 
the 13th international conference on World Wide 
Web, New York, USA, pp. 462-471.  
[4] Z.S. Harris (1954). Distributional structure.  Word, 
Vol. 10, Issue 2-3, pp. 146–162. 
https://doi.org/10.1080/00437956.1954.11659520 
[5] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, 
and R. Harshman (1990). Indexing by latent 
semantic analysis. Journal of the American Society 
of Information Science, Vol. 41, Issue 6, pp. 391–
407. 
https://doi.org/10.1002/(SICI)1097-
4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9 
[6] E. Gabrilovich and S. Markovitch (2007). 
Computing semantic relatedness using Wikipedia-
based explicit semantic analysis. In Proceedings of 
the 20th International Joint Conference on Artificial 
Intelligence, Hyderabad, India, pp. 1606–1611. 
[7] M.Yazdani and A. Popescu-Belis (2013). 
Computing text semantic relatedness using the 
contents and links of a hypertext encyclopedia: 
extended abstract.  In Proceedings of the Twenty-
Third International Joint Conference on Artificial 
Intelligence, Beijing, China, pp. 3185–3189. 
[8] J. Turian, L. Ratinov and Y. Bengio (2010). Word 
representations: a simple and general method for 
semi-supervised learning.  In Proceedings of the 
48th Annual Meeting of the Association for 
Computational Linguistics, Uppsala, Sweden, pp. 
384–394.  
[9] M. Baroni, G. Dinu and G. Kruszewski (2014). 
Don’t count, predict! A systematic comparison of 
context-counting vs. context-predicting semantic 
vectors. In Proceedings of the 52nd Annual Meeting 
of the Association for Computational Linguistics, 
Volume 1, Baltimore, Maryland, pp. 238–247. 
[10] C. Leacock, G. A. Miller, and M. Chodorow (1998). 
Using corpus statistics and WordNet relations for 
sense identification. Journal of Computational 
Linguistics, Vol 24, Issue 1, pp. 147-165. 
[11] R. Rada, H. Mili, E. Bicknell and M. Blettner 
(1989). Development and application of a metric on 
semantic nets.  IEEE Transactions on systems, Man 
and Cybernetics, Vol 19, Issue 1, pp.17-30. 
https://doi.org/10.1109/21.24528 
[12] Z. Wu and M. Palmer (1994). Verb semantics and 
lexical selection. In Proceedings of the 32nd Annual 
Meetings of the Associations for Computational 
Linguistics, Las Cruces, New Mexico, pp. 133-138. 
https://doi.org/10.3115/981732.981751 
[13] D. C. Howe (2009). RiTa: creativity support for 
computational literature. In Proceedings of the 
seventh ACM conference on Creativity and 
cognition (C&C '09), Berkeley, California, USA,  
pp. 205-210. 
[14] D. Lin (1998). An information-theoric definition of 
similarity. In Proceedings of the 15th international 
conference on Machine Learning, pp. 296-304. 
[15] P. Resnik (1995). Using information content to 
evaluate semantic similarity in a taxonomy. In 
Proceedings of the 14th International Joint 
Conference on Artificial Intelligence, Vol 1, 
Montreal, Quebec, Canada, pp. 448-453. 
[16] S. P. Ponzetto and M. Strube (2007). Knowledge 
derived from Wikipedia for computing semantic 
relatedness. Journal of Artificial Intelligence 
Research, Vol 30, Issue 1, pp. 181–212. 
https://doi.org/10.1613/jair.2308 
[17] D. Milne, I. H. Witten (2008). Learning to link with 
Wikipedia. In Proceedings of the 17th ACM 
Conference on Information and Knowledge 
Management, California, USA, pp. 509–518. 
[18] M. T. Pilehvar and R. Navigli (2015). From senses 
to texts: An all-in-one graph-based approach for 
measuring semantic similarity. Journal of Artificial 
Intelligence Vol. 228, pp. 95–128. 
https://doi.org/10.1016/j.artint.2015.07.005 
[19] G. Salton and M.J. McGill (1983). Introduction to 
modern information retrieval. McGraw-Hill 
computer Science Series. 
[20] G. Salton (1971). The SMART Retrieval System – 
Experiments in Automatic Document Processing. 
Prentice-Hall. 
[21] C.J. Crouch, S. Apte, et H. Bapat (2002). Using the 
extended vector model for xml retrieval. In 
Proceedings of the First Workshop of the Initiative 
for the Evaluation of XML Retrieval (INEX), 
Schloss Dagstuhl, pp. 95-98. 
[22] E.A. Fox (1983). Extending the Boolean and Vector 
Space Models of information retrieval with p-norm 
queries and multiple concept types. PhD thesis, 
Department of Computer Science, Cornell 
University. 
[23] D. Carmel, Y. Maarek, M. Mandelbrod, Y. Mass 
and A. Soffer (2003). Searching xml documents via 
xml fragments. In Proceedings of the 26th annual 
international ACM SIGIR conference on Research 
and development in informaion retrieval, Toronto, 
Canada, pp. 151– 158. 
https://doi.org/10.1002/asi.10060 
[24] M. Fuller, E. Mackie, R. Sacks-Davis, and R. 
Wilkinson (1993). Structural answers for a large 
structured document collection. In Proceedings of 
the 16th annual international ACM SIGIR 
conference on Research and development in 
information retrieval, Pitthsburgh, pp. 204–213. 
[25] T. Schileder and H. Meus (2002). Querying and 
ranking XML documents. Journal of the American 
Society for Information Science and Technology, 
Vol. 53, Issue 6, pp. 489–503. 
[26] T. Joachims (1997). A Probabilistic Analysis of the 
Rocchio Algorithm with TFIDF for Text 
Categorization. In Proceedings of the Fourteenth 
398 Informatica 42 (2018) 375–399 S. Iltache et al.  
 
 
International Conference on Machine Learning, 
Tennessee, pp.143-151. 
[27] S. Jaillet, A. Laurent and M. Teisseire (2006). 
Sequential patterns for text categorization. Journal 
of Intelligent Data Analysis, IOS Press, Vol.10, 
issue 3, pp.199–214. 
[28] P. Soucy, G. W. Mineau (2001). A Simple k-NN 
Algorithm For Text Categorization. In Proceedings 
of IEEE International Conference on Data Mining, 
San Jose, USA, pp.647–648. 
[29] A. Hotho, A. Maedche and S. Staab (2002). 
Ontology-based Text Document Clustering. KI, 
Vol. 16, Issue 4, pp. 48-54. 
[30] S. B. Kotsiantis (2007). Supervised Machine 
Learning: A Review of Classification Techniques. 
Informatica Vol. 31, Issue 3, pp. 249-268. 
[31] Y. Yang and X. Liu (1999).  A re-examination of 
text categorization methods. In Proceedings of the 
22nd annual international ACM SIGIR conference 
on Research and development in information 
retrieval, Berkley, pp. 42–49. 
[32] T. Joachims (1998). Text categorization with 
support vector machines: learning with many 
relevant features. In Proceedings of ECML-98, 10th 
European Conference on Machine Learning, 
Chemnitz, Germany, pp. 137–142. 
[33] E. Gabrilovich and S. Markovitch (2005). Feature 
Generation for Text categorization Using World 
Knowledge. In Proceedings of IJCAI 2005: the 
Nineteenth International Joint Conference on 
Artificial Intelligence, Edinburgh, Scotland, UK,  
pp. 1048-1053. 
[34] A. Hotho, S. Staab and G. Stumme (2003). 
Ontologies Improve Text Document Clustering. In 
Proceedings of ICDM:3rd IEEE International 
Conference on Data Minin, Melbourne, FL, USA,  
pp. 541-544. 
[35] H. H. Tar and T.T. Soe.Nyunt (2011). Ontology-
Based Concept Weighting for Text documents. 
International Conference on Information 
Communication and Management IPCSIT vol.16, 
IACSIT Press, Singapore. 
[36] B. Pincemin (2000). Similarites texte–texts  
expérience d’une application de diffusion ciblée et 
propositions.  In Matemáticas y Tratamiento de 
Corpus, Actes du 2ème séminaire de l’Ecole 
interlatine de linguistique appliquée,San Millán de 
la Cogolla, Logroño, Espagne, Logroño : Fundación 
San Millán de la Cogolla, 2002, pp 35-52. 
[37] K. Beyer, J. Goldstein, R. Ramakrishnan and U. 
Shaft (1999). When is `nearest neighbor' 
meaningful. In Proceedings of ICDT, International 
Conference on Database Theory, pp. 217-235. 
https://doi.org/10.1007/3-540-49257-7_15 
[38] U.L.D.N. Gunasinghe, W.A.M. De Silva, N.H.N.D. 
de Silva, A.S. Perera, W.A.D. Sashika and 
W.D.T.P. Premasiri (2014).  Sentence similarity 
measuring by vector space model. In Proceedings of 
the 14 th International Conference on Advances in 
ICT for Emerging Regions (ICTer), Colombo, Sri 
Lanka, pp. 185-189. 
[39] Y. Liu, C. Sun, L. Lin, Y. Zhao and X. Wang 
(2015). Computing Semantic Text Similarity Using 
Rich Features. In Proceedings of PACLIC: 29th 
Pacific Asia Conference on Language, Information 
and Computation, Shanghai, China, pp. 44 – 52. 
[40] J. Lewis, S. Ossowski, J. Hicks, M. Errami and H. 
R. Garner (2006). Text similarity: an alternative 
way to search MEDLINE. Bioinformatics Vol. 22, 
Issue 18, pp. 2298–2304. 
https://doi.org/10.1093/bioinformatics/btl388 
[41] E. Yamamoto, M. Kishida, Y. Takenami, Y. Takeda 
and K. Umemura (2003). Dynamic programming 
matching for large scale information retrieval. In 
Proceedings of the Sixth International Workshop on 
Information Retrieval with Asian Languages, 
Vol.11, Sapporo, Japan, pp. 100–108. 
https://doi.org/10.3115/1118935.1118948 
[42] W. Ma and T. Suel (2016). Structural Sentence 
Similarity Estimation for Short Texts. In 
Proceedings of the Twenty-Ninth International 
Florida Artificial Intelligence Research Society 
Conference, Florida, pp. 232–237. 
[43] D. Dudognon, G. Hubert and B. Ralalason (2010). 
Proxigénéa : Une mesure de similarite 
conceptuelle. In Proceedings of the Colloque Veille 
Strategique Scientifique et Technologique (VSST 
2010). 
[44] M. Baziz, M. Boughanem, H. Prade and G. Pasi 
(2005). A Fuzzy Set Approach to Concept-based 
Information Retrieval. In Proceedings of the 4th 
Conference of the European Society for Fuzzy 
Logic and Technology and the 11ème Eleventh 
Rencontres Francophones sur la Logique Floue et 
ses Applications (Eusflat-LFA 2005 joint 
Conference), Barcelona, Spain, pp. 1287–1292. 
[45] K. M. Shenoy, K.C. Shet, U.D. Acharya (2012). 
Semantic plagiarism detection system using 
ontology mapping. Advanced Computing: An 
International Journal (ACIJ), Vol.3, Issue 3, pp. 59–
62. 
[46] L. Zhang, C. Li, J. Liu and H. Wang (2011). Graph-
Based Text Similarity Measurement by Exploiting 
Wikipedia as Background Knowledge. International 
Journal of Computer, Electrical, Automation, 
Control and Information Engineering Vol.5, Issue 
11, pp. 1328–1333. 
[47] W. Jin and R. K. Srihari (2007). Graph-based Text 
Representation and Knowledge Discovery. 
In Proceedings of the 2007 ACM symposium on 
Applied computing, Seoul, Korea, pp. 807-811. 
https://doi.org/10.1145/1244002.1244182 
[48] P. Wang, H. Zhang, B. Xu, C. Liu, and H. Hao 
(2014). Short Text Feature Enrichment Using Link 
Analysis on Topic-Keyword Graph. In Proceedings 
of Natural Language Processing and Chinese 
Computing, Springer, pp. 79–90.  
[49] J. Leskovec and J. Shawe-Taylor (2005). Semantic 
text features from small world graphs. Workshop 
on Subspace, Latent Structure and Feature Selection 
techniques: Statistical and Optimization 
perspectives, Bohinj.  
Using Semantic Perimeters with Ontologies to Evaluate... Informatica 42 (2018) 375–399 399 
 
[50] S. Brin, J. Davis and H. Garcia-Molina (1995). 
Copy detection mechanisms for digital documents. 
In Proceedings of the 1995 ACM SIGMOD 
International Conference on Management of Data, 
San Jose, California, pp. 398–409. 
https://doi.org/10.1145/223784.223855 
[51] C. Basile, D. Benedetto, E. Caglioti, and M. D. 
Esposti (2008). An example of mathematical 
authorship attribution. Journal of Mathematical 
Physics, Vol. 49, Issue 12, pp. 125211-1–125211-
20. 
https://doi.org/10.1063/1.2996507 
[52] C. Basile, D. Benedetto, E. Caglioti, G. Cristadoro 
and M. D. Esposti (2009). A plagiarism detection 
procedure in three steps: selection, matches and 
squares.  3rd Workshop on Uncovering Plagiarism, 
Authorship and Social Software Misuse, PAN 2009. 
[53] B. Stein, S.M. zu Eissen (2005). Near Similarity 
Search and Plagiarism Analysis. In Proceeding of 
the 29th Annual Conference of the GfKl Springer, 
pp. 430-437. 
[54] R. Lukashenko, V. Graudina and J. Grundspenkis 
(2007). Computer-Based Plagiarism Detection 
Methods and Tools: An Overview. In Proceeding of 
the 2007 International Conference on Computer 
Systems and Technologies - CompSysTech’07, 
Bulgaria, article N° 40.  
https://doi.org/10.1145/1330598.1330642 
[55] K. Vani, D. Gupta (2015). Investigating the Impact 
of Combined Similarity Metrics and POS tagging in 
Extrinsic Text Plagiarism Detection System. In 
Proceeding of the International Conference on 
Advances in Computing, Communications and 
Informatics (ICACCI), Kochi, India, pp. 1578-
1584. 
[56] A. H. Osman, N. Salim, M. S. Binwahlan, H. 
Hentably and A. M. Ali (2011). Conceptual 
similarity and graph-based method for plagiarism 
detection.  Journal of Theoretical and Applied 
Information Technology, Vol. 32, Issue 2,  
pp. 135-145. 
[57] D. Rusu,  B. Fortuna, M. Grobelnik and D. 
Mladenić (2009). Semantic Graphs Derived from 
Triplets with Application in Document 
Summarization.  Informatica Vol.33, Issue 3, pp. 
357–362. 
[58] S. Iltache, C. Comparot, M. Si Mohammed and P. J. 
Charrel (2016). Using domain ontologies for 
classification and semantic interpretation of 
documents. In Proceedings of ALLDATA 2016: 
2nd International Conference on Big Data, Small 
Data, Linked Data and Open Data, pp. 76-81. 
[59] R. Bendaoud, (2009). Analyses formelle et 
relationnelle de concepts pour la construction 
d’ontologies de domaines à partir de ressources 
textuelles hétérogènes. PhD thesis, Henri Poincaré 
University, Nancy 1. 
[60] N. Fuhr and K. Grossjohann (2001). XIRQL: a 
query language for information retrieval in XML 
documents. In Proceedings of the 24th annual 
international ACM SIGIR conference on Research 
and development in information retrieval, New 
Orleans, Louisiana, USA, pp. 172-180. 
[61] E. Omodei, Y. Guo, J. P. Cointet and T. Poibeau, 
(2014). Analyse discursive automatique du corpus 
ACL Anthology. In : Actes de la 21ème conférence 
Traitement Automatique des Langues Naturelles, 
Marseille. 
[62] Y. Guo, A. Korhonen and T. Poibeau (2011). A 
Weakly-supervised Approach to Argumentative 
Zoning of Scientific Documents. In Proceedings of 
the 2011 conference on Empirical Methods in 
Natural Language Processing, Edinburgh, UK,  
pp. 273–283.  
[63] B. Magnini and G. Cavaglià (2000). Integrating 
Subject Field Codes into WordNet. In Proceedings 
of LREC-2000, Second International Conference on 
Language Resources and Evaluation, Athens, 
Greece, pp. 1413-1418. 
[64] C. Fellbaum (1998). WordNet: An Electronic 
Lexical Database. MIT Press, Cambridge MA. 
[65] K. Toutanova, D. Klein, C. Manning, and Y. Singer 
(2003). Feature-Rich Part-of-Speech Tagging with 
a Cyclic Dependency Network.  In Proceedings of 
HLT-NAACL, pp. 252-259. 
[66] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. 
Reutemann and I. H. Witten (2009). The WEKA 
Data Mining Software: An Update. SIGKDD 
Explorations, Vol. 11, Issue 1. pp. 10-18. 
https://doi.org/10.1145/1656274.1656278 
  
400 Informatica 42 (2018) 375–399 S. Iltache et al.