Slovenski jezik – Slovene Linguistic Studies 9 (2013): 163–183
Nina Ledinek and Andrej Perdih
Inštitut za slovenski jezik Frana Ramovša ZRC SAZU, Ljubljana
Creating XML Schemas for Lexicographical Projects:
the Case of the Dictionary of the Slovene Literary
Language of the 16th Century
Članek obravnava strategije oblikovanja XML-shem za slovarske projekte na primeru kom-
pleksne strukture Slovarja slovenskega knjižnega jezika 16. stoletja in pojasnjuje, kako različ-
ne možnosti izdelave XML-sheme vplivajo na leksikografsko delo in kasnejšo uporabo slovar-
ja. Prikazani so različni splošni vidiki izdelave shem za slovarske projekte, ki jih je smiselno
upoštevati: vsebinski, praktični in tehnični vidik. Na tej osnovi je utemeljena odločitev za način
formalnega opisa strukture v danem računalniškem formatu.
The article discusses the strategies applied for creating XML Schemas in dictionary projects,
based on experience gained during creation of the schema of the Dictionary of the Slovene
Literary Language of the 16th Century, and it explains how the different possibilities of the
construction of XML Schemas influence the lexicographical work and the subsequent use of
the dictionary. Three general aspects to be considered in designing schemas for dictionary
projects are described in the article: the content aspect, the practical aspect, and the technical
aspect. On this basis, the decision on the formal description method of the structure in a given
schema definition language is made and justified.
1 Introduction
In a dictionary database written in the currently widely adopted XML format, the
structure of the lexicographical data is described by a schema. Regardless of the
schema definition language (DTD, XML Schema, RELAX NG) supported by the
lexicographical software, it is essential to create a schema that, firstly, corresponds
to the lexicographical content requirements, secondly, adequately structures the data
set, and, thirdly, enables the employment of the program’s solutions, which facilitate
and accelerate the work done by lexicographers. If the construction of the schema
leans towards maximizing the simplicity, transparency, and intuitive quality of the
structure, while always showing consistency with the conceptual requirements, lexi-
cographers can more easily focus on its contents and lose less time with technical is-
sues. Moreover, dictionary users will subsequently find advanced searches to be more
straightforward and easier to use. It is also important to take into consideration the
subsequent use of data in other reference works as well as for language technology
164 Slovenski jezik – Slovene Linguistic Studies 9 (2013)
needs or data conversion. The different schema implementations of the chosen dic-
tionary structure can impact any one of these factors. For these reasons it is especially
important to devote a considerable amount of time to the construction of the schema
and the refinement of the concept in the case of dictionaries containing large amount
of different types of data. Last but not least, the creation of schemas usually reveals
gaps in the adopted dictionary concept as regards formal requirements, and may also
disclose content uncertainties of the designed dictionary concept.
The COBUILD project was one of the first lexicographical projects, dating back
to the early eighties of the last century, in which the data was tagged in a similar way
as today, and in which the lexicographical material was prepared with the help of
specialized electronic lexicographical tools. The lexicographical database was struc-
tured and written in a standard textual format ASCII (cf. Clear 1987: 51, Krek 2004:
4). The fact that standards for encoding lexicographical information have lately been
established shows that modern lexicography was significantly influenced particu-
larly by the fact that lexicographers usually do not perceive a dictionary (initially)
as a book anymore, but rather as an extendable machine-readable database structure,
which can be used for different purposes and where the data is organized in a hier-
archical order, marked (according to the standard) and interconnected, thus enabling
the interchangeability of data, different content views, and a much faster and more
accurate lexicographical data search (cf. Abel 2012, Smrž 2001).
In the core part of this article we look at three aspects that should be considered
when creating a schema: the content aspect, the practical aspect, and the technical
aspect. In section 4 we demonstrate two of the possible solutions for designing XML
Schemas of the morphological header in the Dictionary of the Slovene Literary Lan-
guage of the 16th Century while considering these three aspects. Furthermore, we
consider and evaluate the two solutions based on the advantages and disadvantages
of their use, and, finally, we justify our selection of the most suitable structure.
2 Using XML Format in Lexicography
Most dictionaries are known for their relatively complex structure (Abel 2012: 84),
due to the fact that they include a large amount of various linguistic data. Different
functions of the individual segments of dictionary entries are usually pointed out via
different text style, the use of symbols and punctuation, as well as clearly defined
subdivisions of dictionary entries. Even dictionaries with relatively simple micro-
structure usually contain several types of data, e.g., lemma, part of speech, and ir-
regular forms, whereas dictionaries with relatively complex microstructure usually
include additional information such as inflections, pronunciation, lexeme usage, col-
locations, synonyms, phraseology, etymology, frequency of usage, etc. Due to both
an enormous amount of diverse data and the need for its transparency, it is essential
N. Ledinek and A. Perdih, Creating XML Schemas for Lexicographical Projects 165
that the different language data types in hierarchically structured machine-readable
dictionary databases are clearly marked with unique and meaningful tags.
The standard mark-up language suitable for marking dictionaries and other lan-
guage data, which has established itself in the past years is XML (eXtensible Markup
Language). The tree structure of the XML file is suitable for encoding hierarchically
structured data, thus allowing the user to apply advanced searches of the database
as well as a wide variety of complex types of data processing. Unlike HTML, meta-
tags in XML format are not specified in advance, but can be set by the user almost
arbitrarily. By default, Unicode character encoding is used, which, together with the
appropriate Unicode font, provides ample opportunity for the use of a vast range of
characters. XML files are used by a number of lexicographical programs. At the Fran
Ramovš Institute of Slovene Language SRC SASA iLEX is being used.1 Among the
well-known ones are also ABBYY Lingvo Content,2 IDM DPS3 and TshwaneLex.4
Termania5 has been developed in Slovenia. XML files are plain text files. Their great-
est virtue is that they are transferrable between different programs, operating systems,
and devices, which is essential in the long term, because the dictionary written in
XML format can be used in any program that is able to read plain text files.
XML format is especially suitable for structuring data types and their long term
storage by way of allowing the data to be portable and interchangeable. However, it
is not primarily intended for the display of data. For a comprehensive use of XML
files, various XML languages were developed inside the XML family. XSLT lan-
guage is meant for converting XML documents into other formats, which can include
changing the XML document or converting it into formats suited for screen display
or for print. HTML and XHTML are both formats used for screen display (the most
common example is web sites), whereas XSL:FO is suitable for converting data into
PDF, i.e., print. These kinds of conversions are almost indispensable for the user, be-
cause, as mentioned before, XML is a markup language and not easy to read, whereas
people want to see a formatted text on screen or on paper. XQuery is used for data
search in XML files; XPath is used for navigation, while the structure of XML files
is defined by different schema languages: DTD (.dtd), XML SCHEMA (.xsd), and
RELAX NG (.rng). Schematron represents a different type of schema, which verifies
if the content of an element is allowed, depending on the content of another element
(cf. Berglund 2006, Birbeck et al. 2010, Bray et al. 2006, Hunter 2007).
1 http://www.emp.dk/
2 http://www.abbyy.com/lingvo_content/
3 http://www.idm.fr/products/dictionary_writing_system_dps/27/
4 http://tshwanedje.com/tshwanelex/
5 http://www.termania.net/
166 Slovenski jezik – Slovene Linguistic Studies 9 (2013)
3 The Dictionary’s Structure and the Schema for XML Files6
The creation of a dictionary concept or the appropriate structure of dictionary entries
is one of the most challenging steps before commencing with editorial work. It is cru-
cial to design a dictionary structure, which defines elements of the dictionary entries
and the allowable sequences in which they can occur and as such serve as a basis for
any type of dictionary entries, despite the obvious fact that not all of their available
structure components will appear in every dictionary entry. The task of creating such
a structure is usually especially demanding for certain types of specialized dictionar-
ies (as is the case with the dictionary presented in this article), in particular for dic-
tionaries that aspire to display a variety of different data.
The schema describes the formal structure of the dictionary database in XML
format. It determines which elements are allowed in the dictionary database, the hi-
erarchical relations between them and their order, the possibilities of their different
combinations or exclusions, as well as the number of element occurrences when we
wish to specify more than one consecutive identical element (Hunter et al. 2007). The
schema also provides the formal content requirements for elements: whether any kind
of content is allowed, or, if there is a restriction to the list of possible choices (drop-
down menu), or, if there is a length restriction regarding the content, if the content is
restricted to digits, if elements can have attributes, etc. A schema is therefore a kind
of transformation of a dictionary concept into computer language. As regards the dic-
tionary content, though, it helps only with certain technical requirements and can in
no case prevent content inadequacy or inconsistency with the conceptual guidelines
or the linguistic reality.
There are several standard schema definition languages to describe XML docu-
ments; the two most established and widespread today are DTD and XML Sche-
ma.7 Even though DTD is slightly less popular than XML Schema, which enables
a somewhat more flexible and detailed description, using one or the other schema
definition language in practice usually depends on the lexicographical program,
taking into account that some programs do not support different schema definition
languages.8
For various reasons, such as interchange, longevity, simpler processing, glob-
al standards have been established in regard to language data format for encoding
6 The lexicographical material described below is the analyzed material being edited by
the research group composed of Kozma Ahačič, Metod Čepar, Alenka Jelovšek, Andreja Le-
gan Ravnikar, Majda Merše, Jožica Narat and France Novak. The XML Schema was prepared
in close cooperation with Kozma Ahačič.
7 XML Schema (.xsd file) and schema are not synonymous, because XML Schema is one of the
schema definition languages just like DTD, which describes the structure of the XML file content
(Hunter et al. 2007: 145). For more information on XML Schema see Thompson et al. 2004.
8 IDM DPS and Tshwanelex support DTD, iLEX supports RELAX NG and XML Sche-
ma, Termania supports XML Schema.
N. Ledinek and A. Perdih, Creating XML Schemas for Lexicographical Projects 167
dictionary or lexicon databases. Here we will mention Lexical Markup Framework
(LMF),9 Lexical Interchange Format Standard (LIFT),10 TEI Guidelines.11 This list,
however, is not exhaustive. Despite the fact that the data could be written in at least
one of the standard formats, we have not yet decided to take this step. The decision
not to follow any of these standards was based on the judgment that it would be less
sensible to consider this kind of format, due to the extreme complexity of the dic-
tionary’s microstructure, which would cause great difficulty in following a standard
format. Despite its flexibility it would be difficult to follow the standard encoding. A
question to take into consideration is also whether the dictionary encoding standards
will be altered by the time the dictionary is finished. In making our decision we con-
sidered the fact that the dictionary database is primarily intended for specialized users
for language research, whereas to a lesser extent it will also be useful for 16th century
Slovene texts processing. Also, the dictionary concept was designed in a time when
these standards were not yet ubiquitous or established and therefore adjusting the
structure to any of the standards is more time consuming.
Besides the fact that the software for lexicographical work enables editors to edit
the language data and visually represent it, it is one of the basic tasks of the software
to ensure consistency between the dictionary entries and the schema and to point out
the irregularities in the structure as well as the formal content of the elements. It is
therefore a lexicographical tool, which helps to ensure the consistency of individual
dictionary entries and the entire structure of the dictionary. This is especially crucial
for larger projects, in which numerous lexicographers are involved. Even though the
(base) schema is created before the editing process takes place, it is customary that
the schema is partially modified and elaborated during the preparation of the first
(test) dictionary entries. Lexicographers and computer experts must be attentive to
any possible inadequacies in the dictionary structure and consequently the schema,
for in doing so they can eliminate any confusion, point out the technical shortcom-
ings, or coordinate any different existing views on the content of the emerging dic-
tionary.
In the process of preparing XML Schemas for dictionaries at the Fran Ramovš
Institute for Slovene Language SRC SASA it has become clear that it is reasonable
to consider several aspects, which can affect the creation of XML Schemas. These
aspects are:
•• the content aspect,
•• the practical aspect,
•• the technical aspect.
9 http://www.lexicalmarkupframework.org/
10 http://code.google.com/p/lift-standard/
11 http://www.tei-c.org/release/doc/tei-p5-doc/en/html/DI.html
168 Slovenski jezik – Slovene Linguistic Studies 9 (2013)
3.1 The Content Aspect
The content aspect is in fact a conceptual requirement for the dictionary structure,
meaning that it should be projected as perceived by lexicographers and as directly
as possible onto an intuitively understandable hierarchical data structure, which is
defined by a schema. In practice, the aspect is most clearly articulated when deter-
mining and designating the dictionary parts or the schema elements and their allowed
content and when defining the hierarchical relations between elements. The content
aspect is the critical aspect to be considered when it comes to the integration of multi-
ple elements into the parent element. This is true if it corresponds to the lexicographi-
cal view of the structure, if it is useful for structuring the data, or for transparency
purposes. To put it simply, the content aspect is most evident when making the deci-
sion whether, for instance, the element header should contain all the sub elements the
lexicographer sees in the header but not the ones he sees in the section where word
senses are described.
An example of a simple XML document is the dictionary entry almužna shown
below, whose content is taken from the analyzed material of the Dictionary of the
Slovene Literary Language of the 16th Century:12
almužna
almožen
rod. ed.
Figure 1
almužna
almožen
Gen. sg.
Figure 1: English translation
Five different elements are used in this case, where the type of their content is already
clear from their description. Each element starts with a start tag, e.g.,
(lemma), and ends with an end tag, e.g., ; in between the tags we have
the text or other hierarchically subordinate elements.
12 The right hand indent is used to indicate child elements.
N. Ledinek and A. Perdih, Creating XML Schemas for Lexicographical Projects 169
3.2 The Practical Aspect
In a dictionary with a complex microstructure, the number of different elements in
a schema can quickly reach a three digit number. Certain elements can also occur in
several different places at the same time, which makes sense in terms of content but can
differ just enough in the relations between the elements and their order to overburden
the lexicographer's memory. It is therefore reasonable to consider the practical aspect
of the lexicographical schema since it must not only be logical in content but also man-
ageable, and it must coincide with the lexicographical process as naturally as possible.
The question is whether lexicographers can sufficiently manage the complex structure
without it being an obstacle in their work. It is equally important, of course, that the data
is arranged in a way so that the end users of the dictionary will be able to understand it.
The hierarchically deeper structure has an advantage over the flat or less hierar-
chically developed structure because more data combined in the parent element can
be considered, thus enabling an easier search and a more detailed subsequent data
processing. It is also true in this kind of structure that the data can, to a great extent, be
structured according to the lexicographer’s understanding of the dictionary structure.
The drawback of an excessively branched structure is, however, that the orientation
within the dictionary entry is more difficult (consequently one can extend the assem-
bly time for entries). For the most part, the lexicographer focuses his attention on the
hierarch’s accuracy instead of its content; therefore, a highly complex structure can
become more demanding for actual work.
3.3 The Technical Aspect
When considering how to enhance both the usability of a (completed) dictionary and
the optimal amount of work put into its elaboration, the structural relationships be-
tween the elements of the schema are not of sole importance. There are other options
we want to make possible. A particularly important criterion is the references to vari-
ous elements in the dictionary’s microstructure, (for example if we want to make the
electronic version clickable, or if we want the same data to be displayed in more than
one location, if relevant), using mixed content, etc. It is also worthwhile to give some
thought to the publishing of the dictionary. If the dictionary is available in electronic
form only, it is not worthwhile to consider some of the adjustments a book edition
might require. From the technical aspect a factor which might impact the structure of
the schema is the lexicographical program itself – according to our experience, some
adjustments in schema might be welcomed, for example, because the user interface
influences the decision about whether it would be better to combine repetitive ele-
ments into one parent element or not. This is relevant because at one point we want to
shrink the content on the screen or hide certain hierarchical elements to have a better
overview of the other content we want to focus our attention on at that particular time.
Programs tackle transparency issues of the interface in different ways; therefore, it is
indispensable to consider the technical aspect.
170 Slovenski jezik – Slovene Linguistic Studies 9 (2013)
The hierarchical organization of elements is of key importance when searching for
data, rearranging, converting, or using it in any other way, while the occurrence of
the same element in different places can, due to various combinations of the oc-
currences, make the search for relevant data very complicated. At the same time
the risk of producing inadequate search results increases. The search for existing
content is an important aspect in lexicographical work, giving it insight into the dic-
tionary’s already verified solutions. It also helps to maintain the consistency of the
dictionary database and is a useful supplement of the dictionary Style Guide. Once
the dictionary is complete, the dictionary database can also be used for linguistic or
metalexicographic research and as a starting point for the creation of other linguis-
tic resources, which is why the method of data organization can be very important
in this aspect as well.
In terms of further use of dictionary databases, the use of elements with mixed
content is equally important. This kind of element can contain text and child elements
along with their own proper content. One of the reasons for the use of mixed content
is the entry of a differently formatted text, for example, superscript and subscript
numbers (m3, CO2). In regard to content, the use of mixed elements is more important
for elements in the structure where their sub elements and text are “mixed” together
due to the very content this kind of element contains. A typical example is the ety-
mological section, where the reconstructed forms, the meanings of the words, foreign
language words and their meanings, tags for languages they originate from, and the
“normal” explanatory text between them occur in a rather unpredictable sequence.
From a technical point of view, caution is generally advised when using mixed con-
tent because it can lead to the subsequent processing of this data being significantly
more difficult than otherwise.
4 Creating the Structure of the Morphological Header in XML Schema
Based on the three aspects presented, we will try to illustrate two different approach-
es applied in the construction of the morphological header of the Dictionary of the
Slovene Literary Language of the 16th Century. The dictionary is one of the major
lexicographical projects in Slovenia. The preparations for the lexicographical work
(including collecting the material) began in 1973, and in 2001 a test fascicle of the
dictionary (Merše et al. 2001) was published. In 2011 a publication of Vocabulary
of Slovene Literary Language of the 16th Century (Ahačič et al. 2011) was released,
which is a compilation of 16th Century Slovene words with basic morphological in-
formation. This vocabulary will be lexicographically treated in the emerging diction-
ary. The written language of the time however was not standardized, which renders
visible the dialectical elements used by the authors, especially regarding the phono-
logical and morphological characteristics. Furthermore, the issue of non-standardized
N. Ledinek and A. Perdih, Creating XML Schemas for Lexicographical Projects 171
orthography in the texts causes variation in the recording of some phonemes as well
as in the use of capitalization.13 The dictionary schema must be structured in a way to
enable recording of every one of the different variants or at least to draw attention to
them, at several structural levels.
The morphological header provides information about word forms of the lemma
that appear in the texts, for example, the singular, dual, and plural forms for nouns in
different cases, and different forms for verbs including number, person, tense, mood,
etc. The writers of these entries also add references for each word about the author
and the text where the word is documented, as well as notes about any oddities that
appear in the word forms. Since the dictionary is intended mostly for specialized
linguists, the only word forms that appear in the dictionary are ones from which the
paradigmatic model the word belongs to can be established.14 This is due to the large
number of different existent word forms. For example, in the case of nouns, the first
manifested form in the singular, dual and plural is indicated, whereas, the inflected
forms for each declension are indicated only if they deviate from the expected para-
digm, as presented in the front matter of the dictionary. The same is true for other
parts of speech. From the lexicographical point of view it is therefore quite evident that
we wish to produce a structure, which will enable the entry of hierarchically classified
information about the morphological categories and the actual forms of headwords. In
the case of adjectives, we have six levels of information in morphological header: (1)
adjective > (2) indefinite form > (3) masculine > (4) singular > (5) dative > (6) entry of
documented form. The same is true for other parts of speech that take on inflections. If
our objectives from the lexicographical point of view are perfectly clear, there exist at
least two options on how to create a schema, which has to accept all possible forms of
paradigms for relevant parts of speech, along with the information about the document-
ed forms. We expect more than 500 combinations for the morphological categories. At
the same time, we aspire to have the simplest structure possible.
The number of expected theoretically possible word forms is undoubtedly sig-
nificantly greater than the actual number of the documented words in the historical
texts for individual lemmas; therefore, we can assume that in the structure of the
dictionary entry only some of the spaces in the morphological header for individual
lemmas will be complete. Even though it is useful to be aware of this fact, it must not
13 Digitalization of 16th century texts has not been as successful as of texts printed after
1850 and is therefore more time-consuming and costly (Erjavec et al. 2011: 121). For these
reasons, no electronic corpus of 16th century Slovene texts has been compiled yet and a lot of
manual work is required for the compilation of the dictionary. The situation was more success-
ful in the recent IMPACT project which resulted in (mainly) 18th and 19th century language
resources (http://nl.ijs.si/imp/). A small sample of 16th century texts is also included.
14 This solution differs from the method adopted by the test fascicle dictionary (Merše et
al. 2001), where all manifested forms are recorded, because it would significantly slow down
the two-decades-long dictionary production.
172 Slovenski jezik – Slovene Linguistic Studies 9 (2013)
affect the process of creating the schema. We have applied two different approaches
for creating a morphological header schema and consequently devised two: a general
schema, and an explicit schema.
4.1 The General Schema
The first option is to create a simple and general schema, which provides an ad-
equate number of hierarchically classified general data containers, i.e., elements.
This means that we only need a small number of hierarchically classified elements
of the schema because the distinctive value, i.e., relevant morphological category,
is given in the form of text (selection from the dropdown list of options). The
advantage of this kind of schema is its universality for all parts of speech and its
extreme simplicity. The number of levels in the morphological header from top to
bottom is 6. In addition there is a limited number of places for data entry – all the
information for one form can be entered with a maximum of six (less in some parts,
of course) shifts in level, depending on the part of speech. There are 5 different
elements for data entry, while the number of elements inside the morphological
header, which contain only child elements, is 6. It should be noted, however, that
the information for the part of speech and its category is given in the normal header;
therefore, it can be omitted here.
The drawback in this type of schema is, however, that the exact location where
specific information can be found is not explicitly clear, which makes such schemas
non-intuitive for new participants in the dictionary making process, as well as for
subsequent dictionary searches. In addition, when making a dictionary the data of
this type of structure is harder to maintain. Furthermore, the potential treatment of
similar information should rely on information outside the morphological header;
otherwise, it is not successful. General schema’s biggest problems arise from the
fact that it is necessary to know exactly at what level an indication is. For example,
the grammatical category of number, since it may vary between different parts of
speech. This can be very challenging to memorize for lexicographers and advanced
dictionary users.
The following example shows a truncated record of the morphological header
for the adjective bel (English white) in XML form, corresponding to the schema in
question. The structure is, for transparency purposes, slightly shortened.15
15 Element i has been added, which signifies that its content in the historical texts may be
written with different letters, while in the dictionary only one letter is recorded as a generali-
zation. This is done due to the fact that providing all possible variants of this type would result
in an overload of spelling variants in the morphological section of the entry and consequently
would draw user's attention to spelling instead of morphology. The information on the possible
realizations of the letter in the element i will be given in the introduction or in the form of hints.
N. Ledinek and A. Perdih, Creating XML Schemas for Lexicographical Projects 173
nedoločna oblika
m
ednina
imenovalnik
b | é/ej/e/ee/ee | l
rodilnik
b | é/e/ej | liga
dvojina
imenovalnik
bela
[...]
množina
imenovalnik
b | e/é/ej | li
[...]
določna oblika
[...]
Figure 2
174 Slovenski jezik – Slovene Linguistic Studies 9 (2013)
English translation of terms in Figure 2:
določna oblika definite form
dvojina dual
ednina singular
imenovalnik nominative
m masculine
množina plural
nedoločna oblika indefinite form
oblikoslovno_zaglavje morphological header
OZ + number morphological header level
OZ_oblika morphological header – form
pregibno_skupina declinable – group
rodilnik genitive
seznam list
In this first approach at creating a schema a choice between two options has to be
made on the first level: (1) pregibno_skupina (declinable_group) and (2) nepregib
no_skupina (indeclinable_group).
•• At (2) nepregibno_skupina (indeclinable_group) on the second level in the in-
flected part we have to choose OZ_oblika (morphological header_form), which
is the end element for the entry of the treated form, in order to be consistent with
the location of the information.
•• At (1) pregibno_skupina (declinable_group) we reach the only option on the
second level, OZ_1. Within OZ_1 on the third level, we have seznam1 (list1),
which contains a list of all possible information on this level. We can then choose
OZ_2, OZ_3 or OZ_4.
•• If we choose OZ_4 on the third level, we choose seznam4 (list4) and OZ_obli
ka (morphological_header_form) on the fourth level, where we enter the actual
form of the word.
•• If we choose OZ_3 on the third level, we then have seznam3 (list3) on the fo-
urth level, where we choose the information from the menu, and OZ_4, which
on the fifth level contains seznam4 (list4), and OZ_oblika (morphological_hea
der_form), where we enter the actual form of the word.
•• If we choose OZ_2 on the third level, we then have seznam2 (list2) on the fourth
level, where we choose the information from the menu. We can then choose
OZ_3 or OZ_4, where both have the same structure as mentioned above.
•• With all these choices multiple repetitions in sequential order are possible for
OZ_2, OZ_3 or OZ_4, given that the hierarchically superior data is applicable
N. Ledinek and A. Perdih, Creating XML Schemas for Lexicographical Projects 175
to all subordinate ones, for example, for OZ_4, which contains the nominative
form, followed by another OZ_4, which contains the genitive form.
The editorial process is therefore the following: after our decision on the declinabil-
ity of the word in the declinable section, we first make a decision on level OZ_1 and
its seznam1 (list1). Depending on the number of hierarchical data, we then continue
towards OZ_4 (we only use the intermediate OZ_2 and OZ_3, if we have to enter this
much data) where we find seznam4 and the end element OZ_oblika (morphological_
header_form), where we enter the actual form of the headword. With the elements
that have the same name, but different information in seznam1 (list1), seznam2, sez
nam3 and seznam4 we can enter information regardless of the part of speech. The
schema is therefore simple; the only valuable thing to know is how many steps it
takes to enter the information about the form – which differs for different parts of
speech.
4.2 The Explicit Schema
The second option is to create a structure, where the data is more strongly embedded
in the hierarchy of the structure, which then simplifies complex searches and data
processing. It also leaves no doubts for lexicographers in the data entry process. Con-
trarily to the first option, the disadvantage of this type of structure is its complexity
because the number of different elements in the schema is greatly increased, which
can consequently overburden the dictionary writer at first contact with the schema
– the number of different places for data entry is extremely high, 552 in 30 ele-
ments with different names on the lowest hierarchical level, where the details of the
manifested form are in fact entered. The difference between the numbers means that,
for example, the element imenovalnik (nominative), where the nominative form is
entered, can occur in various places in the structure, depending on the part of speech,
number, gender, etc., of the treated form. Other information, which does not consti-
tute the actual form of the word, is given with the name of the element, which con-
tains the given form or is its parent element (in the first schema this data is selected
from the lists). The hierarchical depth of this type of schema is five or less, which is
similar to the first option (six or less).16
The following example demonstrates the truncated record of the morphologi-
cal header for the adjective bel (English white) in XML form, corresponding to the
schema in question. The structure is, for transparency purposes, slightly shortened.17
16 Some elements in the morphological categories can, for practical reasons, be named
slightly differently than what is common practice in modern linguistic descriptions, given that
older word forms in old language are, from our point of view, probably unexpected due to the
fact they were not codified and due also to their transiency.
17 Cf. footnote 14.
176 Slovenski jezik – Slovene Linguistic Studies 9 (2013)
b | é/ej/e/ee/ee | l
b | é/e/ej | liga
[...]
bela
[...]
b | e/é/ej | li
[...]
[...]
Figure 3
English translation of terms in Figure 3:
določna_obl definite form
dvojina dual
ednina singular
imenovalnik nominative
m masculine
množina plural
nedoločna_obl indefinite form
oblikoslovno_zaglavje morphological header
pridevniško adjectival
rodilnik genitive
In this explicit approach of making the schema the decision for one of the eight op-
tions takes place on the first level: (1) nepregibno (indeclinable), (2) samostalniško
(nounal), (3) pridevniško (adjectival), (4) števnik (numeral), (5) glagol (verb), (6)
posamostaljeno (nominalized), (7) izpridevniški_prisl (deadjectival_adverb), (8)
pregibni_povedkovnik (declinable_predicative).
N. Ledinek and A. Perdih, Creating XML Schemas for Lexicographical Projects 177
•• At (1) nepregibno (indeclinable) this is already the end element, where we enter
the given form.
•• At (2) samostalniško (nounal) we can choose nesklonljivo (indeclinable) on the
second level. We can also choose one or all possible numbers (ednina, dvoji
na, množina) (singular, dual, plural), for which we choose the declension on
the third level, where we enter the given form (imenovalnik, rodilnik, dajalnik,
tožilnik, mestnik, orodnik) (nominative, genitive, dativ, accusative, locative, in
strumental).
•• At (3) pridevniško (adjectival) we have the option samo_ena_oblika (just_one_
form) or nedoločna_obl (indefinite_form) and določna_obl (definite_form) on
the second level, depending if there is a distinction between the definite or indefi-
nite form of the adjective, or if there is no distinction. On the same level it is also
possible to have elements primernik (positive_form), presežnik (comparative_
form) and nesklonljivo (indeclinable). With all of them (except the last one) we
have to choose the grammatical gender (m, ž, s) (masculine, feminine, neutral) on
the third level, the number for each gender on the fourth level (ednina, dvojina,
množina) (singular, dual, plural), while the fifth level has a set of declensions,
where we enter the given form.
•• At (4) števnik (numeral) on the second level we choose one of the gender possi-
bilities or the element m_ž_s (m_f_n), if the numeral does not indicate the gender.
For each gender we then choose the number on the third level, and the declen-
sion on the fourth level. With m_ž_s (m_n_f) we can choose the element osnov
na_oblika_števnika (basic_form_of number) on the third level, which is then
used instead of the nominative and/or accusative and/or the undeclined form,
označitev_osnovne_oblike_števnika (tag_of_basic_form_of_numeral), where
we indicate the information about whether it is used as a noun, adjective, or its
use is irrelevant, and tudi_kot_nesklonljiv (also_as_indeclinable), and also other
declensions – therein we enter the given form of the numeral.
•• At (5) glagol (verb) on the second level we choose between: nedoločnik, name
nilnik (infinitive, supine) (they are both end elements), sedanjik, velelnik and
del_na_l (indicative, imperative, and participle_ending_with_l). On the third
level with the present indicative we choose between the grammatical number,
which is, due to its being different from the number in the case of declined
words marked as ednina_glag, dvojina_glag, množina_glag (singular_verb,
dual_verb, plural_verb), while each of them has the person category (oseba_1,
oseba_2, oseba_3) (person_1, person_2, person_3) on the fourth level. On the
third level the case is similar for the imperative form regarding the number
category (ednina_velelnik, dvojina_velelnik, množina_velelnik) (singular_im
perative, dual_imperative, plural_imperative). For dual and plural we must
choose the elements oseba_1 and oseba_2 (person_1, person_2). There is no
such division with the singular, where only the 2nd person is possible. Participle
178 Slovenski jezik – Slovene Linguistic Studies 9 (2013)
ending with l (del_na_l) can be of different gender (m_del, ž_del, s_del) (m_
participle, f_participle, n_participle), where each of them is different in num-
ber: ednina_del, dvojina_del and množina_del (singular_participle, dual_par
ticiple, plural_participle).
•• At (6) posamostaljeno (nominalized) we can chose the gender on the second
level (the gender is for the positive form), within the gender options on the third
level we can choose between the number options, and within them we can chose
between cases on the fourth level. On the second level we have the option of
choosing between the comparative and the superlative, which contain the same
categories as the positive, therefore gender on the third level, number on the fourth
level, and declension on the fifth level.
•• (7) izpridevniški_prisl (izpridevniški prislov) (deadjectival adverb) on the se-
cond and last level differentiates between the positive form, the comparative
form, and the superlative form of the adverb deriving from an adjective (osnov
nik_izpr_prisl, primrk_izpr_prisl, presež_izpr_prisl) (positive_adjectival_ad
verb, comparative_adjectival_adverb, superlative_adjectival_adverb).
•• (8) pregibni_povedkovnik (declinable_ predicative) can take on one of the gen-
ders (m_povedkovnik, ž_povedkovnik, s_povedkovnik) (m_predicative, f_predi
cative, n_predicative), these in turn take a different number on the second le-
vel (ed_povedkovnik, dv_povedkovnik, mn_povedkovnik) (singular_predicative,
dual_predicative, plural_predicative), on the second level we can also choose
either the comparative or the superlative form (primrk_izpr_prisl, presež_izpr_
prisl) (comparative_deadjectival_adverb, superlative_deadjectival_adverb).
In the case where the elements have different names but where the approach is equal-
ly explicit and organized, we can enter information on the forms of headwords, de-
pending on which part of speech they belong to.
4.3 Choice of Approach
The complexity of the explicit schema is, similar to the simplicity of the general
schema, the outcome of a series of hierarchical decisions made by the lexicographer.
If in the example above we perceive six levels of data: (1) adjective > (2) indefinite
form > (3) masculine > (4) singular > (5) nominative > (6) entry of given form, then
the schema we are creating follows this string of decisions.
Fundamentally, the difference between the two approaches we presented is, for
the most part, of a technical nature. Either we want to enter the actual information
only as text and, therefore, the elements are designated in a very general manner,
because in this case they do not mean anything by themselves except a hierarchical
level making it thus essential to have comprehensive lists of possible topics sepa-
rately for each level (in our case the serial numbers of levels: OZ_1, OZ_2 etc. and a
list of allowed topics: seznam1 (list1), seznam2 etc.). Or we select the elements for
N. Ledinek and A. Perdih, Creating XML Schemas for Lexicographical Projects 179
which we already determine not only the hierarchical relation, but also the meaning
of the level in an explicit way. A scrutiny of the combinations in the list of options is
necessary when choosing the first option since XML Schema itself does not enable it.
One option would be to use the Schematron standard, which points out the incorrect
combinations (e.g. adjective > masculine > 3rd person).
The explicit schema, which requires unambiguously and explicitly tagged infor-
mation, does not leave lexicographers any doubts. Contrary to the general schema,
the drawback of this kind of structure is its complexity; the number of different ele-
ments in the structure greatly increases, which makes the schema seem at first sight
very complex and branched. This is also the reason why the schema in this article
could not be graphically depicted in a transparent way because the size of book for-
mat is limited. In spite of these drawbacks, it is a fact18 that for a lexicographer this
type of structure is actually more logical and easier to understand due to the explicit-
ness of the information. A tree structure becomes even more easily manipulated with
the right program for lexicographical work, which partially guides the lexicographer
when making data entries. A more simple search and data processing of the dictionary
database also speaks in favor of the explicit schema, which is important for review-
ing already finished work at the time of editing as well as for advanced users of the
electronic version. It is also true for linguistic research, where the access to various
data in the microstructure is of great importance.
If we initially insisted that both variants of the schema must correspond to the
content aspect and did not encounter any obstacles that would deter us from this con-
viction, we can therefore conclude that, from the point of view of the practical aspect,
the suitability of either of the schemas cannot be argued, but, because the making of
a dictionary expands over a long period of time, we tend to favor the explicit schema.
The technical aspect certainly confirms that it is the best solution.
In our estimation the second option, i.e., the explicit schema of the morphologi-
cal header, is better for lexicographical work and subsequent usage of the dictionary
database because it enables a clear classification of linguistic data into the dictionary
database and facilitates search for lexicographical data at a later stage. However, we
are aware all along that this solution has certain disadvantages, which prevents it from
being ideal.
5 Conclusion
Modern lexicography is inconceivable without the appropriate technical support
and the application of software tools. XML, which allows hierarchical structuring
of data, has recently been established as the standard format for encoding dictionary
18 Both schema variants (as well as variants of other parts of the schema structure) were tested
among the lexicographic team and it was shown clearly that explicit schema gave better results.
180 Slovenski jezik – Slovene Linguistic Studies 9 (2013)
databases and many other language sources. For dictionaries that contain mainly a
great number of various data types, the unambiguous identification of data is crucial.
The appropriate schema, which exhibits the dictionary structure, provides lexicogra-
phers with a clear picture of the dictionary’s overall structure, whereas the method of
its production affects the difficulty of both the construction and use of the dictionary.
Therefore, when designing the schema to be a projection of the dictionary concept,
we must firstly consider three aspects, the content aspect, the practical aspect, and
the technical aspect, as well as the possibility to use one of the standard formats for
encoding lexicographical data. In the Dictionary of the Slovene Literary Language
of the 16th Century it is even more crucial to carefully consider the structure of the
schema given the fact that the dictionary has a very complex microstructure due to
the nature of the material and that fact that it will evolve during a long period of time.
References
Abel, Andrea. 2012. Dictionary Writing Systems and Beyond. In Electronic Lexicography, ed. by
Sylviane Granger, and Magali Paquot, 83–106. Oxford: Oxford University Press.
Ahačič, Kozma, Andreja Legan Ravnikar, Majda Merše, Jožica Narat, and France Novak. 2011.
Besedje slovenskega knjižnega jezika 16. stoletja. Ljubljana: Založba ZRC, ZRC SAZU.
Berglund, Anders. 2006. Extensible Stylesheet Language (XSL) Version 1.1. .19
Birbeck, Mark, Markus Gylling, Shane McCarron, and Steven Pemberton. 2010. XHTML™ 2.0.
.
Bray, Tim, Jean Paouli, C. M. Sperberg-McQueen, Eve Maler, François Yergeau, and John Cow-
an. 2006. Extensible Markup Language (XML) 1.1 (Second Edition). .
Clear, Jeremy. 1987. Computing: Overview of the Role of Computing in Cobuild. In Looking Up:
An Account of the COBUILD Project in Lexical Computing, ed. by John M. Sinclair, 41–61.
London, Glasgow: Collins ELT.
Erjavec, Tomaž, Ines Jerele, and Maša Kodrič. 2011. Izdelava korpusa starejših slovenskih
besedil v okviru projekta IMPACT. In Meddisciplinarnost v slovenistiki. Obdobja 30, ed. by
Simona Kranjc, 41–47. Ljubljana: Znanstvena založba Filozofske fakultete.
Hunter, David, Jeff Rafter, Joe Fawcett, Eric van der Vlist, Danny Ayers, Jon Duckett, Andrew
Watt, and Linda McKinnon. 2007. Beginning XML. Indianapolis: Wiley Publishing.
Krek, Simon. 2004. Slovarji serije COBUILD in formalizacija definicijskega jezika. Jezik in
slovstvo 49/2: 3–16.
Merše, Majda, France Novak, and Francka Premk. 2001. Slovar jezika slovenskih protestantskih
piscev 16. stoletja : poskusni snopič. Ljubljana: Založba ZRC, ZRC SAZU.
Smrž, Pavel, 2001. Slovníková data ve formátu XML. In Slovenčina a čeština v počítačovom
spracovaní. Zborník referátov zo seminára. Bratislava 26. – 27. októbra 2001, ed. by Alek-
sandra Jarošová, 175–187. Bratislava: Veda.
19 All referenced websites in the article were accessible on 9 June 2012.
N. Ledinek and A. Perdih, Creating XML Schemas for Lexicographical Projects 181
Thompson, Henry S., David Beech, Murray Maloney, and Noah Mendelsohn. 2004. XML
Schema Part 1: Structures: W3C Recommendation 28 October 2004. .
ABBYY Lingvo Content .
iLEX .
IDM DPS .
TshwaneLex .
Termania .
Prispelo januarja 2013, sprejeto marca 2013
Received January 2013, accepted March 2013
Oblikovanje XML-sheme za leksikografske projekte na primeru
Slovarja slovenskega knjižnega jezika 16. stoletja
Prispevek prikazuje in pojasnjuje, kako je bila oblikovana struktura oblikoslovnega
zaglavja Slovarja slovenskega knjižnega jezika 16. stoletja, zapisana v računalniškem
formatu XML-shema, skladno s konceptualnimi zahtevami sestavljavcev slovarja.
Izdelani in na podlagi različnih meril analizirani sta bili dve v osnovi bistveno različni
shemi oblikoslovnega zaglavja, tj. posplošena in eksplicitna shema.
Da si sodobne leksikografije ni mogoče predstavljati brez ustrezne računalniške
podpore in uporabe programskih orodij, je danes samoumevno. Slovarska baza za
Slovar slovenskega knjižnega jezika 16. stoletja bo nastajala v standardnem formatu
XML, ki omogoča večnivojsko hierarhično strukturirane podatkovne baze (drevesna
struktura), povezovanje s sklici, zahtevno iskanje in različne obdelave podatkov. Ker
so datoteke XML v resnici navadne besedilne datoteke, so prenosljive med različni-
mi programi in operacijskimi sistemi, kar je v času stalnega napredka informacijske
tehnologije velikega pomena pri dolgoročnem delu, saj lahko slovarsko podatkovno
bazo ne glede na uporabljeni program za izdelavo kasneje uporabimo v katerem koli
programu, ki zna brati navadne besedilne datoteke.
Slovarska struktura Slovarja slovenskega knjižnega jezika 16. stoletja je izdelana
v formatu XML-shema. Ta med drugim določa, kateri elementi so v slovarski bazi
dovoljeni, kakšna so hierarhična razmerja med njimi in kakšen je njihov vrstni red,
kakšne so možnosti njihovega kombiniranja oziroma izključevanja in kolikokrat se
določen element lahko ponovi, kadar želimo navesti več zaporednih enakih elemen-
tov. Ker leksikografska programska orodja izkoriščajo XML-shemo za pomoč leksi-
kografom pri izdelavi pravilne strukture geselskih člankov, je bilo pri izdelavi sheme
za oblikoslovno zaglavje poleg vsebinskega vidika smiselno upoštevati tudi praktični
182 Slovenski jezik – Slovene Linguistic Studies 9 (2013)
in tehnični vidik. Od njiju je namreč odvisna hitrost izdelave slovarja, pa tudi (ne)za-
pletenost kasnejše uporabe slovarskih podatkov za jezikoslovne raziskave in uporabo
v drugih jezikovnih opisih.
Ker smo izhajali iz tega, da morata obe predlagani shemi ustrezati leksikograf-
skemu vidiku, sta bila za izbiro med dvema različnima shemama odločilna praktični
in tehnični vidik. Na podlagi analize prednosti uporabe ene ali druge različice she-
me je bila zaradi jasne strukturiranosti podatkov in enostavnejše izdelave slovarja
izbrana eksplicitna shema, ki daje slovarski podatkovni bazi večjo vrednost tudi za
kasnejšo uporabo slovarskih podatkov.
Creating XML Schemas for Lexicographical Projects in the Case of
the Dictionary of the Slovene Literary Language of the 16th Century
This paper depicts and explains how the structure of the morphological header of
the Dictionary of the Slovene Literary Language of the 16th Century was written in
electronic XML Schema definition language, in accordance with the lexicographers’
conceptual requirements. Two substantially different schemas, i.e., a general and an
explicit schema of a morphological header, were designed and analyzed on the basis
of different criteria.
It is indeed inconceivable to imagine modern lexicography today without the
appropriate technical support and software tools. The Dictionary of the Slovene
Literary Language of the 16th Century database will be created in standard XML
format, thus enabling a hierarchically structured database (tree structure), referenc-
es, advanced search, and a variety of data processing. XML files are in fact nothing
more than plain text files, transferable between different programs and operating
systems. In a time of continual progress and change in information technology this
is extremely significant for long-term work considering that the dictionary data-
base, regardless of the program used, can later be applied to any program that is
able to read plain text files.
The structure of the Dictionary of the Slovene Literary Language of the 16th
Century is designed in XML Schema definition language, which inter alia defines the
elements allowed in the dictionary database, the hierarchical relations between them
and their order, the possibilities of their combination or exclusion, and the number of
times a certain element can occur, when we wish to specify more than one consecu-
tive element. Since the lexicographical software tools exploit XML Schema in order
to support lexicographers in creating the correct structure of the dictionary entries, it
was crucial to take into consideration the practical and technical aspects in addition to
the content aspect for the process of making the schema for the morphological header.
These two additional aspects contribute to the speed of the dictionary production and
N. Ledinek and A. Perdih, Creating XML Schemas for Lexicographical Projects 183
the complicatedness, or lack thereof, of subsequent use of dictionary data for linguis-
tic research purposes, as well as its use in other language descriptions.
We proceeded from the fact that both presented schemas meet the requirements
of the content aspect, which left the practical and technical aspects to be the decisive
elements in our choice of schema. Based on the analysis of the advantages when us-
ing one or the other version of the schema, it became clear that, due to the transparent
structure of data and easier construction process of the dictionary, the explicit schema
was our preferred choice; moreover, it gives the dictionary even greater value for
subsequent use of lexicographical data.