59 1 (2019)
ZA NOVEJŠO ZGODOVINO
PR
IS
PE
V
K
I Z
A
N
O
V
EJ
ŠO
Z
G
O
D
O
V
IN
O
PRISPEVKI
59
1
(2
01
9)
UDC
94(497.4)"18/19"
UDK
ISSN 0353-0329
1
Nina Ditmajer, Matija Ogrin, Tomaž Erjavec
Encoding Textual Variants of the Early Modern Slovenian Poetic Texts in TEI
Isolde van Dorst
You, Thou and Thee: A Statistical Analysis of Shakespeare’s Use
of Pronominal Address Terms
Darja Fišer, Monika Kalin Golob
Corporate Communication on Twitter in Slovenia: A Corpus Analysis
Darja Fišer, Nikola Ljubešič, Tomaž Erjavec
Parlameter – a Corpus of Contemporary Slovene Parliamentary Proceedings
Polona Gantar, Špela Arhar Holdt, Jaka Čibej, Taja Kuzman
Structural and Semantic Classification of Verbal Multi-Word
Expressions in Slovene
Aniko Kovač, Maja Markovič
A Mixed-principle Rule-based Approach to the Automatic
Syllabification of Serbian
Milan M. van Lange, Ralf D. Futselaar
Debating Evil: Using Word Embeddings to Analyse Parliamentary Debates
on War Criminals in the Netherlands
Andrej Pančur
Sustainability of Digital Editions: Static Websites of the History
of Slovenia – SIstory Portal
Ajda Pretnar, Dan Podjed
Data Mining Workspace Sensors: A New Approach to Anthropology
Tadej Škvorc, Simon Krek, Senja Pollak, Špela Arhar,
Holdt Marko Robnik-Šikonja
Predicting Slovene Text Complexity Using Readability Measures
INŠTITUT ZA NOVE JŠO ZGODOVINO
INŠTITUT ZA NOVEJŠO ZGODOVINO
PRISPEVKI
ZA NOVEJŠO
ZGODOVINO
Letnik LIX Ljubljana 2019 Številka 1
DIGITAL
HUMANITIES
AND LANGUAGE
TECHNOLOGIES
Prispevki za novejšo zgodovino
Contributions to the Contemporary History
Contributions a l’histoire contemporaine
Beiträge zur Zeitgeschichte
UDC
94(497.4) "18/19 "
UDK
ISSN 0353-0329
Uredniški odbor/Editorial board: dr. Jure Gašparič (glavni urednik/editor-in-chief),
dr. Zdenko Čepič, dr. Filip Čuček, dr. Damijan Guštin, dr. Ľuboš Kačirek,
dr. Martin Moll, dr. Andrej Pančur, dr. Zdenko Radelić, dr. Andreas Schulz,
dr. Mojca Šorn, dr. Marko Zajc
Prevodi/Translations: Studio S.U.R.
Bibliografska obdelava/Bibliographic data processing: Igor Zemljič
Izdajatelj/Published by: Inštitut za novejšo zgodovino/Institute of Contemporary
History, Kongresni trg 1, SI-1000 Ljubljana, tel. (386) 01 200 31 20,
fax (386) 01 200 31 60, e-mail: jure.gasparic@inz.si
Sofinancer/Financially supported by: Javna agencija za raziskovalno dejavnost
Republike Slovenije/ Slovenian Research Agency
Računalniški prelom/Typesetting: Barbara Bogataj Kokalj
Tisk/Printed by: Medium d.o.o.
Cena/Price: 15,00 EUR
Zamenjave/Exchange: Inštitut za novejšo zgodovino/Institute of Contemporary
History, Kongresni trg 1, SI-1000 Ljubljana
Prispevki za novejšo zgodovino so indeksirani v/are indexed in: Scopus, ERIH Plus,
Historical Abstract, ABC-CLIO, PubMed, CEEOL, Ulrich’s Periodicals Directory,
EBSCOhost
Številka vpisa v razvid medijev: 720
Za znanstveno korektnost člankov odgovarjajo avtorji/ The publisher assumes no
responsibility for statements made by authors
Fotografija na naslovnici: Enigma, s katero so Nemci med 2. svetovno vojno šifrirali
vojaška sporočila, hrani MNZS.
3
Articles
Nina Ditmajer, Matija Ogrin, Tomaž Erjavec, Encoding Textual Variants
of the Early Modern Slovenian Poetic Texts in TEI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
UDC: : 004.934:821.163.6-1”16/18”
Isolde van Dorst, You, Thou and Thee: A Statistical Analysis
of Shakespeare’s Use of Pronominal Address Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
UDC: 004.934:821.111SHAK(083.41)
Darja Fišer, Monika Kalin Golob, Corporate Communication
on Twitter in Slovenia: A Corpus Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
UDC: 003.295:659.4+004.738.5(497.4) )”201”
Darja Fišer, Nikola Ljubešić, Tomaž Erjavec, Parlameter –
a Corpus of Contemporary Slovene Parliamentary Proceedings . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
UDC: 003.295: 342.537.6(497.4)”2014/2018”
Polona Gantar, Špela Arhar Holdt, Jaka Čibej, Taja Kuzman, Structural
and Semantic Classification of Verbal Multi-Word Expressions in Slovene . . . . . . . . . . . 99
UDC: 003.295:821.163.6‘367.625
Aniko Kovač, Maja Marković, A Mixed-principle Rule-based Approach
to the Automatic Syllabification of Serbian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
UDC: 004.934:821.163.41
Milan M. van Lange, Ralf D. Futselaar, Debating Evil:
Using Word Embeddings to Analyse Parliamentary Debates
on War Criminals in the Netherlands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
UDC: 003.295:342.537.6:355.012(492)”1940/1945”
Table of Contents
Editorial
Digital Humanities and Language Technologies
(Darja Fišer, Andrej Pančur in Tomaž Erjavec) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4
Andrej Pančur, Sustainability of Digital Editions:
Static Websites of the History of Slovenia – SIstory Portal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
UDC: 004.774-026.11
Ajda Pretnar, Dan Podjed, Data Mining Workspace Sensors:
A New Approach to Anthropology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
UDC: 003.295:572+316.7
Tadej Škvorc, Simon Krek, Senja Pollak, Špela Arhar Holdt,
Marko Robnik-Šikonja, Predicting Slovene Text Complexity Using
Readability Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
UDC: 003.295:821.163.6
Reviews and Reports
Jakob Lenardič, Language Technologies and Digital Humanities 2018,
20–21 September 2018, Faculty of Electrical Engineering, Ljubljana . . . . . . . . . . . . . . . . . 221
5In memoriam
Editorial Notice
Contributions to Contemporary History is one of the central Slovenian scientific
historiographic journals, dedicated to publishing articles from the field of
contemporary history (the 19th and 20th century).
It has been published regularly since 1960 by the Institute of Contemporary
History, and until 1986 it was entitled Contributions to the History of the Workers‘
Movement.
The journal is published three times per year in Slovenian and in the following
foreign languages: English, German, Serbian, Croatian, Bosnian, Italian, Slovak and
Czech. The articles are all published with abstracts in English and Slovenian as well
as summaries in English.
The archive of past volumes is available at the History of Slovenia - SIstory web
portal.
Further information and guidelines for the authors are available at http://ojs.
inz.si/index.php/pnz/index.
6 Prispevki za novejšo zgodovino LIX - 1/2019
7
Digital Humanities and
Language Technologies
The current special issue of the journal Contributions to contemporary history brings
papers which seem to break with the established editorial tradition. The journal has
been issued regularly by the Institute of contemporary history since 1960 which was
called Institute for the History of the Labour Movement until 1986. The journal was
renamed at the same time as the institute, and it has since become one of the major
Slovenian scientific journals in the field of history that publishes papers on the contem-
porary history (19th and 20th century) of Central and Southeastern Europe. With the
establishment of an infrastructure programme Research infrastructure of Slovenian
Historiography the Institute has entered the field of digital history and has contrib-
uted to the establishment of the European Digital Research Infrastructure for Arts and
Humanities (DARIAH) since 2008. With this, the Institute of contemporary history
has started to develop into one of the major digital humanities hubs in Slovenia. The
current special issue is one of the results of this new research direction of its publisher
and reflects a distinct interdisciplinary and heterogeneous profile of digital humanities.
With this special issue we are celebrating the 20th anniversary of the first Language
technologies conference which took place in 1998 in Cankarjev dom, Ljubljana and was
organized by Tomaž Erjavec, Vojko Gorjanc, Jerneja Žganec Gros and Anica Rant.
The topics of the first conference were the development and application of language
technologies for Slovene and directions for the future. The conference has since been
held biennially and has recently expanded its focus to digital humanities. As the inter-
section of digital technologies and the humanities, digital humanities is a very active
Editorial
8 Prispevki za novejšo zgodovino LIX - 1/2019
research field where digital technologies are used in the study of language, society
and culture, but humanities research also paves the way for the development of new
digital technologies. Digital humanities is a highly interdisciplinary and collaborative
field which transforms traditional practices in the humanities and acts as a catalyst of
new analytical techniques and methods as well as promotes discussion between the
different stakeholders in the field. This initiative aims to promote integration of the
disciplines and at the same act as an important hub for fellow researchers in the region.
We invited authors of 11 best-reviewed regular papers and the best student paper
that were presented a the Language technologies and digital humanities conference which
took place on 20–21 September 2018 in Ljubljana, organized by the Slovenian langu-
age technologies society, Centre for language resources and technologies at the University of
Ljubljana, Faculty of Electrical Engineering of the University of Ljubljana and the research
infrastructures CLARIN.SI and DARIAH-SI. Authors of 10 regular papers and the
student paper from the fields of language technologies, digital linguistics and digi-
tal humanities accepted the invitation and prepared extended papers relevant for an
international audience which then underwent another reviewing procedure by inter-
national reviewers.
The editors of the special issue would like to thank the authors and the reviewers
for their dedicated work as well as for believing in the challenge and being willing to
engage in an interdisciplinary dialogue which requires all the parties involved to step
out of their comfort zone but also brings knowledge transfer and rewarding results.
Darja Fišer, Andrej Pančur and Tomaž Erjavec
Ljubljana, May 16th 2019
9In memoriam
Articles
10 Prispevki za novejšo zgodovino LIX - 1/2019
Nina Ditmajer,* Matija Ogrin,** Tomaž Erjavec***
Encoding Textual Variants
of the Early Modern Slovenian
Poetic Texts in TEI
IZVLEČEK
ZAPIS VARIANTNOSTI STAREJŠIH SLOVENSKIH
PESNIŠKIH BESEDIL V TEI
V prispevku obravnavamo problematiko zapisa verza in variantnih mest v znanstve-
nokritični izdaji Foglarjevega rokopisa, štajerske baročne pesmarice iz sredine 18. stoletja.
Najprej prikažemo diplomatični zapis verza v izbranih problematičnih primerih. V nada-
ljevanju predstavimo metodo, uporabljeno za izdelavo kritičnega aparata variantnih mest.
Temeljno besedilo, tj. Foglarjev rokopis, je primerjano z verzijami v osmih drugih rokopisih
in tiskih iz 18. in začetka 19. stoletja. Variantna mesta so označena z elementi XML po
Smernicah TEI (TEI Guidelines) kot enote kritičnega aparata. Prikazujemo nekaj pri-
merov detajliranega označevanja rime, stopice, zamenjav verzov ter variantnih razlik na
pravopisni, glasoslovni in leksikalni ravnini jezika. Na koncu orišemo več možnosti spletnega
prikaza elektronskega diplomatičnega besedila. Pokazala se je potreba po prilagodljivosti
teh orodij slovenskemu literarnemu izročilu.
Ključne besede: slovensko slovstvo, Foglarjev rokopis, znanstvenokritična izdaja, kri-
tični aparat, variantnost besedila, TEI
* Research Centre of the Slovenian Academy of Sciences and Arts, Novi trg 2, SI-1000 Ljubljana, nina.ditmajer@
zrc-sazu.si
** Research Centre of the Slovenian Academy of Sciences and Arts, Novi trg 2, SI-1000 Ljubljana, matija.ogrin@
zrc-sazu.si
*** Department of Knowledge Technologies, Jožef Stefan Institute, Jamova Cesta 39, SI-1000 Ljubljana, tomaz.
erjavec@ijs.si
1.01 UDC: : 004.934:821.163.6-1”16/18”
11N. Ditmajer, M. Ogrin, T. Erjavec: Encoding Textual Variants of the Early Modern Slovenian…
ABSTRACT
The paper deals with the problem of encoding the verses and textual variants in the cri-
tical edition of Foglar’s Manuscript, a Styrian Baroque hymn book from the mid-eighteenth
century. We first show the diplomatic transcript of the verse in selected problematic cases,
after which we present the method applied to produce a critical apparatus for approaching
textual variants. The base text, i.e. Foglar’s Manuscript, is compared with versions in eight
other manuscripts and prints from the eighteenth and early nineteenth centuries. Variants
are encoded with XML elements according to the TEI Guidelines as units of the critical
apparatus. We highlight some examples of the detailed encoding of rhymes, feet, verse repla-
cements, and textual variants on the spelling , vocabulary and lexical levels of the language.
To conclude, we present a number of possibilities for the online display of the electronic
diplomatic transcript. The need for the adaptability of these tools to the Slovenian literary
tradition is evident.
Keywords: Slovenian literature, Foglar’s Manuscript, critical edition, critical apparatus,
textual variance, TEI
Introduction
The texts that have been passed down to us over time via manuscript culture were
transcribed from witness to witness over a long period of time. In this kind of textual
transmission (Textüberlieferung), many textual variations appear in the text, which are
called (variant) readings (Lesarten) or variants (Überlieferungsvarianten). Variant read-
ings can be merely scribal mistakes or “errors”, but even these can range from using
the wrong letter to the omission of an entire line. Variants, however, can also be the
scribe’s intentional modifications of the text, including anything from orthographic
differences and various word forms to major interventions in the text, such as addi-
tions, omissions, word order changes, transpositions of whole paragraphs or stanzas,
etc. Textual variance also occurs in printed texts in general, that is, in the culture of
the printed book: as soon as the same text is published again, variant readings start to
appear, albeit not quite as extensively as in the handwritten tradition. Since very few
medieval manuscript texts are preserved in the Slovenian language, the problem of
textual variation in Slovenian only appears in the early modern age, especially in the
Baroque era. Among the most common examples of the Slovenian transcription tradi-
tion are those of the Baroque texts of the eighteenth and nineteenth centuries. Among
prose texts, for example, the Črnovrški Manuscript,1 the manuscripts on the Antikrist2
and the Poljane Manuscript3 are mentioned in the present paper, while handwritten
1 The text is treated in the Register of Baroque and Enlightenment Slovenian Manuscripts (NRSS Ms 124).
2 Cf. the Register of Baroque and Enlightenment Slovenian Manuscripts (NRSS Ms 15, Ms 17, Ms 24, Ms 71).
3 Cf. the Register of Baroque and Enlightenment Slovenian Manuscripts (NRSS Ms 23, Ms 28).
12 Prispevki za novejšo zgodovino LIX - 1/2019
hymn books were particularly popular among the common people. These hymn books
were preserved through the textual transmission in all of the regional varieties of the
Slovenian standard language4 existing in the Slovenian ethnic territories until the uni-
fication of the Slovenian standard language in the mid-nineteenth century. They were
either copied by scribes from earlier printed or handwritten hymn books, flyers for
special occasions (e.g., pilgrimage, church consecration), lectionaries, catechisms and
prayer books, or were written from memory, or dictation.
It is precisely by supplying scholarly evidence and an explanation of its textual
tradition that the critical edition should provide us with the most authentic and com-
plete version of a literary work’s text: “When a text is transmitted through more than
one witness, a critical edition will generally take a strong interest in recording the variant
readings of some or all of those manuscripts or editions” (Burghart 2017).
Therefore, in addition to the original text, the critical edition should also hand
down a textual tradition of witnesses, which exists in the form of transcripts, frag-
ments, drafts, proof sheets, etc. in order to clarify the process of the text’s transfor-
mation and genesis: “The apparatus is a set of notes designed to foster in the reader an
awareness of the historical and editorial processes that resulted in the text he or she is reading
and to give the reader what he or she needs to evaluate the editor’s decisions” (Damon 2017,
202). In principle, digital editions offer more possibilities than printed versions to
present the text in its various formats, as they allow for the juxtaposition of different
forms of text (for example, a digital facsimile and a diplomatic transcript) in a selected
size category and in precisely selected places, at the level of the paragraph, the stanza
or the verse (Ogrin 2005, 9–10).
In the present paper, taking as an example the diplomatic transcript of a selected
hymnal manuscript, we present the question of encoding the variant readings of the
text as reflected in its handwritten and printed versions according to the TEI Guidelines
from 2019. These can be used to produce a variety of digital texts, from simple reading
editions to scholarly critical editions, dictionaries and language corpora. The digital
markup means that the structural elements of the text (e.g., verses, stanzas, notes) are
encoded with TEI-defined tags that the computer can then recognise. The TEI recom-
mendations consist of descriptions of the tags rendered in the XML markup language,
which can be defined as an open encoding standard focused not on the display but on
the structure and internal relations of the data. We can use these tags to mark in the
electronic encoding the desired structure and other characteristics of the text (Ogrin
and Erjavec 2009; Ogrin 2005, 14; Hockey 2000, 24). In this way, we have, since 2004,
prepared nine editions of the eZISS library – Digital Scholarly Editions of Slovenian
Literature (Ogrin and Erjavec 2009).
In the following paragraphs, we present Foglar’s Manuscript, the selected base text,
in a diplomatic transcript, along with its variant readings in the preserved versions of
the hymns in other manuscripts and prints. The diplomatic transcript is important not
4 The Eastern Slovenian standard language with its Prekmurje and Eastern Styrian varieties and the Central Slovenian
standard language with its Carniolan and Carinthian varieties.
13N. Ditmajer, M. Ogrin, T. Erjavec: Encoding Textual Variants of the Early Modern Slovenian…
only for locating the original version of the text, but also for comparing versions on all
levels of the language. By using suitable web tools, we can also study the stanza forms,
verse and metre. In addition to a presentation of selected tools, we were interested in
the different kinds of display of the digital diplomatic text in the HTML layout.
The Text Corpus
Foglar’s hymn book (1757–1762) is a Slovenian Baroque manuscript containing
twenty-four hymns. It originates in the area of the then Austrian province of Styria in
the parish of Kamnica near Maribor. The manuscript is named after Lovrenc Foglar,
one of its authors (cf. Ditmajer 2017), and contains the following hymn texts: the
oldest Slovenian hymns celebrating the pilgrimage to Mariazell in Upper Styria; four
hymns dedicated to saints; a festive hymn dedicated to the Holy Trinity; two hymns
with eschatological content; one worshipping Jesus’ name; one of repentance for the
fasting period; and another praising the love of God. During the examination of pre-
served Slovenian religious hymns known to date, as well as other witnesses containing
hymn texts, a number of hymns were discovered that could have served as a base text
for Foglar’s Manuscript, or vice versa.
To date, we have included eight variant texts in the critical edition:
– the hymnal manuscript Pesmarica from Gorje (1761–1792, NRSS Ms 113),
– Paglovec’s hymnal manuscript Cantilenae variae partim antiquae partim (1733–
1759, NUK R 0 75843),
– Lavrenčič’s printed Misijonske pesme inu molitve (1757, NUK GS 0 10212),
– Krebs’s hymnal manuscript (1750–1800, NRSS Ms 022),
– the hymnal manuscript Cerkvene pesmi in molitve (ok 1778, NRSS Ms 052),
– Maurer’s hymnal manuscript (1754, NUL Ms 1485),
– Parhamer’s printed catechism entitled Obchinzka knisicza zpitavanya teh pet glav-
nih stukov maloga katekizmussa (1764, UKM R 20675), and
– Manuskript iz Podmelca (1802–1810, Archives of the ZRC SAZU Institute of
Ethnomusicology, Kokošar’s Series, Ms. II., Sg. Ms. Ko. 101/125).
The selected variant hymns were mostly produced in the eighteenth century in
the regions of Styria, Carinthia, Carniola and Gorizia. Eleven of the hymns exist in a
single version (for example, Pesem od svete trojce, Pesem od božje lubezni, Pesem od svete
Notburge), and only one exists in two versions (Pesem od Marije Magdalene). All of the
manuscripts and prints mentioned are listed among the listWit (witness list) source list
added to the preface to the critical edition, and shown as follows:
14 Prispevki za novejšo zgodovino LIX - 1/2019
Pesmarica iz Gorij,
1761–1792,
Ms 113
The Diplomatic Transcript of the Base Text and
Its Variations
In early Slovenian hymn books, one graphic line does not always correspond to
a single metric verse. Frequently, due to a lack of paper space, scribes would write the
next word or phrase on a second graphic line. In the diplomatic transcript, we used
the TEI element label to number stanzas; verse lines encoded with an l (line)
are embedded in an lg (line group) element following label; the refrain is nested
in the parent stanza (i.e., lg)with an assigned @type attribute; and the break of the
verse line is simply marked with an lb (line break) element, as shown in the encoding
example of the first stanza of Pesmi od Svete trojce:
Sve Ti tro ÿ zi zhem moi Le ben da tiSam ſe be jioi kenimo offri Spra fftiTiſto zhem zha ſti tiHvalo niei ſtu ri tiSahva le na do vei ko maBo di sve ta troÿ za
Figure 1: The original variant of the first stanza of Pesmi od Svete trojce
15N. Ditmajer, M. Ogrin, T. Erjavec: Encoding Textual Variants of the Early Modern Slovenian…
Difficulties are caused above all by hymn texts in which the author has disregarded
the verse line, rendering the hymn in prose form. In view of this, hymns with a second
verse line continuing in the same graphic line where the first verse line begins were
encoded with the ab (anonymous block) element, while the @type attribute was
used to mark the stanza, with line breaks indicated as shown in the following example:
Vsak Brat inu Sestra Serze Posdigni, Iesusa Mario Josepha hvali:
Klizi Jesus Maria mojo Serze moj glas, ô Jo-seph moj varih sdajna
Posledni zhass.
Figure 2: The original variant of the first stanza of Pesmi o svetem Jožefu
In addition to verse lines, stanzas and refrains, rhyme and foot can be specifically
encoded in a machine readable format. However, this markup in our scholarly edition
have not yet been taken into account. The rhyme patterns can be documented with
the @rhyme attribute, while the @label attribute is used to specify which parts of
a rhyme scheme a given set of rhyming words represent. The value of this attribute is
usually one of the letters of the rhyme pattern.
Sve Ti tro ÿ zi zhem moi Le ben da tiSam ſe be jioi kenimo offri Spra fftiTiſto zhem zha ſti tiHvalo niei ſtu ri tiSahva le na do vei ko maBo di sve ta troÿ za
16 Prispevki za novejšo zgodovino LIX - 1/2019
In the second example the @met attribute indicates the metrical structure, where
the symbol | marks the foot boundaries. If some lines divert from the metrical scheme
documented in the @met attribute, the deviation is documented with the @real
attribute:
Po ſluſhai kai ti jaſ povemKai ti ozhem osnani tiNesna nu le tu do vſih mouNo tt burgo zhem zha ſti tiNo tt Burga je Tÿ RolarzaS nto lar ſke Do li nePoſhtenih pur garskih ludiPrav ſrezhne korenine
For a scholarly critical edition of a manuscript, especially one from an early period,
it is essential to look for textual variants, as they facilitate the detection of errors in the
overall text and aid the search for the base text. In the described critical edition, all of
the preserved textual transmissions (traditions) are displayed and organised so as to
be subordinate to the base text, that is, Foglar’s text. Our first attempt at encoding tex-
tual variance in poetic texts was the preparation of the digital critical edition of Anton
Martin Slomšek’s poems, which was devised in the period 2006–2011 and is still in
progress. The diplomatic transcript of Foglar’s hymn book was treated with the same
apparatus criticus, applying the same parallel segmentation method5 and displaying the
variant readings using the app element. The latter contains the base text (the lemma),
and one or more variant readings encoded with the rdg (reading) element, each with
a reference to the appropriate version via the @witt (witness) attribute:
MAri ia Magda lenaAn Bart MagdalenaEnkrat Madalena
The @wit attribute value refers to the identifier of the description of manuscripts
and prints with the aforementioned versions of hymnal texts, such as the value “M” for
Maurerjeve pesmarice, or “POD” denoting the Manuskript iz Podmelca, as shown in the
5 For a detailed description of the method, see section “12.2 Linking the Apparatus to the Text” of the TEI Guidelines,
12 Critical Apparatus - The TEI Guidelines, https://www.tei-c.org/release/doc/tei-p5-doc/en/html/TC.html.
17N. Ditmajer, M. Ogrin, T. Erjavec: Encoding Textual Variants of the Early Modern Slovenian…
list of sources in the preceding section. The critical edition includes 988 units of the
critical apparatus app, which contain 988 lem elements and 1072 rdg elements.
Only pure textual variants were included as units of the critical apparatus, excluding
the identification of the verse-stanza structure of the variant text.
Particularly problematic are hymns whose entire stanzas, or simply the verses of a
single stanza, are switched, such as in Pesem od vernih duš. Such switches can be more
explicitly marked using the @xml:id (identifier) and @corresp (corresponds)
attributes:
Dol vo gen ſo sako pa neVshgala ga je ta pravizaVuishgalagaje pravizaUſse U' tem ogniu sede
sakopane
In textual criticism, we distinguish two major groups of variant readings: sub-
stantive and accidental (Greg 1950). The latter include those changes that do not
significantly affect the meaning, such as orthographic variants, although in some cases
even these cause meaning-related dilemmas. The Baroque text of Foglar’s Manuscript
is substantively marked by the non-standard use of spelling and the regional phonetic
variation in various branches of the textual transmission. The scope of the critical appa-
ratus and the degree of its granularity have been the subject of discussion in philology
since the beginning of critical edition production, especially regarding the distinction
between the level of purely orthographic differences, or so-called accidentals, and the
level of more meaning-related differences, or so-called substantives, which go back to
Greg’s theory of copy-text and beyond into the history of philology.6
In order to provide a better visual representation of the various types of modifica-
tion when applying tools for the display and analysis of texts, we need to classify these
modifications more precisely and introduce more units of the critical apparatus within
one verse line. In the eighteenth century – due to the lack of Slovenian textbooks on
spelling and grammar, and of Slovenian books in general, as well as to the fact that
school instruction was carried out in a foreign language (only elementary instruction
6 For a comprehensive historical outline of the views that have been formed in textual criticism with regard to this
question, see Sahle (2013, 172–73).
18 Prispevki za novejšo zgodovino LIX - 1/2019
was conducted in Slovenian) and that the education of copyists varied – the use of
graphic characters for certain sounds varied significantly (marked in the critical edi-
tion with the @type attribute value):
BresBreſsMadeshaMadeſhaspozhetaſpozheta
Until the mid-nineteenth century, the Slovenian ethnic territories were character-
ised by the coexistence of regional varieties of the Slovenian standard language. We
therefore encounter many phonological and morphological variant readings in this
critical edition, which, like spelling variants, do not affect the meaning of a particular
word.
vunven
Lexical substitutions are of more importance, but in the manuscript texts included
in the critical edition it is generally a case of synonyms:
dela fairontDella pust
19N. Ditmajer, M. Ogrin, T. Erjavec: Encoding Textual Variants of the Early Modern Slovenian…
Tools for Text Analysis and Display
The XML-TEI encoding of textual variation shown above conveys the logical
and semantic structure of the variant readings in the hymns, on the basis of which
the editor of the critical edition is able to formulate his or her textological and philo-
logical analysis of the textual tradition of a given hymn in a machine readable format.
However, this format is not intended for the reading public of the digital edition, that
is, for actual reading from the screen. For this purpose, it has to be converted into a
reader-friendly display format, such as HTML, where the meaning structure of the text
is converted into the appropriate graphic design of the text.
To show textual variance in the textual transmission of Foglar’s Manuscript, we
used (or tested) three tools that have very different sets of functionalities for convert-
ing XML-TEI elements to the HTML format of display, and that are derived from very
different concepts of the graphic representation of textual variants. Apart from these,
Versioning Machine (VM)7 is the tool that probably has the longest history. Although it
boasts plentiful functionalities, we did not opt for it in this case because we would have
had to extensively adapt the XML format in order for the VM to display it well. The
tools were evaluated according to how the relevant files, prepared in strict agreement
with the TEI Guidelines, were converted without special adjustments.
XSLT Conversion
During the preparation of the digital scholarly edition of Foglar’s hymn book,
XSLT conversion was predominantly used, having been developed as a working tool
for the emerging critical edition of the poems by A. M. Slomšek. A web-based tool8
supporting this conversion enables the conversion of documents from Word (.docx)
into TEI and/or the conversion of TEI documents into HTML. For each conversion,
a folder is created that is accessible online and contains both the source file and its
converted TEI encoding, as well as the HTML file generated from it. The conversion
works so that the general conversion of the TEI encoding (provided and continu-
ously developed by the TEI Consortium) into its HTML version is enriched with
local changes that the user can activate by selecting the appropriate profile. For our
purposes, we developed a ZRC profile that upgrades the general conversion by placing
the variant in braces {}, inside which first a lemma, then a variant reading are listed, separated by
a vertical slash |. The name of the version referred to by wit/@witness is displayed when a user
places a mouse hover over it.
The aforementioned issue of granularity of the critical apparatus, i.e., how detailed
the information about individual variant readings should be (based either on words or
larger sections), is clearly shown in Figures 3 and 4. First, Figure 3 shows the solution
7 Cf. Versioning Machine 5.0, http://v-machine.org/.
8 DOCX to TEI to HTML conversion, http://nl.ijs.si/tei/convert/.
20 Prispevki za novejšo zgodovino LIX - 1/2019
where the lem element contains the entire verse of Foglar’s text, followed by the
rdg element containing the whole verse from the manuscript by Mihail Paglovec.
In this case, the critical apparatus unit contains and defines the entire verse line as a
variant reading. Figure 4, on the other hand, shows the same verse lines as Figure 3,
but encoded in a way that each word is represented by its own unit, so each element
containing a single word from Foglar’s text has a corresponding rdg element contain-
ing a single word from Paglovec’s text. Thus, all of the orthographic and substantive
variants are likely to be more clearly shown, with the exception of the spaces between
the syllables, which, although not so important for the analysis, does make reading
somewhat more difficult.
Figure 3: A synoptic presentation of the base text by Foglar and of Paglovec’s variant in
HTML format (Pesem od svete Notburge)
Figure 4: A synoptic presentation of the base text by Foglar and of Paglovec’s variant in
HTML format (Pesem od svete Notburge)
This tool, whose generic conversion according to the TEI Guidelines has been
upgraded with a synoptic display of the critical apparatus in a main text line, is
intended for a simple but philologically accurate presentation of textual variance in a
digital scholarly edition. Its use is conditioned by the consistent adoption of the paral-
lel segmentation method in TEI. Although not providing the reader with the greatest
flexibility of display (for example, the ability to hide or display a specific version of the
text), it is a valuable tool because it is available as an online service9 and can easily be
9 The conversion service operates at the address http://nl.ijs.si/tei/convert/, by selecting the conversion profile
ZRC.
21N. Ditmajer, M. Ogrin, T. Erjavec: Encoding Textual Variants of the Early Modern Slovenian…
installed on any computer, enabling it to be run at any time during the editorial pro-
cess. It is ideal for displaying texts in which only two or three, perhaps four, versions are
compared in each unit of the apparatus, which seems to be entirely appropriate for the
actual range of textual variance established in the earlier Slovenian literary tradition.
TEI CAT
The TEI Critical Apparatus Toolbox (TEI CAT) is a web service10 developed by
a group led by Marjorie Burghart. It is explicitly intended for critical editors prepar-
ing digital scholarly editions with the parallel segmentation method under the TEI
Guidelines. It therefore serves as a work aid enabling editors to check and visualise
meaning components in the course of the preparation of their scholarly editions. Many
functionalities are provided for this purpose, including those for checking errors and
inconsistencies that emerge in the encoding process (Burghart 2016). We will focus
on the functionalities that are the most relevant to our textual analyses.
The user sends an XML file to the online service to verify the correctness of the
tagging. If the results are positive, the main text or the so-called critical text of the edi-
tion will be displayed for viewing. Beside each unit of the critical apparatus, an arrow
appears on screen, which can be clicked to open a window with the content of the unit
in a classic form based on the use of the right square bracket: everything to the left
of the square bracket represents the lemma, while to the right is the variant reading
marked with the abbreviation of the variation.
In addition, we are free to select a number of controls, such as whether the system
should display page breaks or colour the units of the apparatus that do not contain all
of the versions, or, conversely, whether it should colour only those units of the appa-
ratus that contain a specific version, etc.
The most important functionality offered by TEI CAT is a parallel view of all of
the versions generated by the tool from the units of the critical apparatus. Regardless
of the fact that, according to the TEI Guidelines, the recommended place for the list of
versions listWit is in the so-called teiHeader metadata element, the CAT system will
locate the listWit anywhere in the TEI document (in our case, it is placed in back),
logically sorting its information with respect to the abbreviations. The user can then
choose to view all of the versions in a parallel display, or, by ticking only the selected
abbreviations, have individual versions displayed in parallel for comparison:
10 The consortium developing the tool includes CNRS and the University of Lyon, cf. TEI Critical Apparatus Toolbox,
http://teicat.huma-num.fr/index.php.
22 Prispevki za novejšo zgodovino LIX - 1/2019
Figure 5: TEI CAT enables the critical editor to view a parallel display of the main text and
the selected versions.
The disadvantage of the parallel display in the TEI CAT tool is that, in longer texts,
columns match only at the beginning of the file, while in the continuation the relation-
ship can be broken, resulting in the reader losing reference for comparison. The tool
cannot (yet) be downloaded to the user’s computer and run locally. It is in fact not
primarily intended for preparing an edition as a publication for the general readership,
but rather serves to allow verifications in the course of the editorial process. However,
in addition to its being very practical for displaying the apparatus and several other
functionalities, its greatest advantage is the basic statistical analysis that it produces of
the document, not just of the TEI tags used, but also of the texts themselves: it gener-
ates a simple but informative frequency list of the words occurring in the edition, with
any spelling variant being considered as a new word form, of course.
EVT
Open source EVT – Edition Visualization Technology – is designed to produce
and publish digital scholarly editions in TEI. As with TEI CAT, the encoding of the
critical apparatus with the parallel segmentation method is required.11 A group led by
Roberto Rosselli del Turco conceived EVT with the explicit aim of bridging the gap
between the TEI Guidelines as a first-rate standard for the production of complex phil-
ological works, such as critical editions, and the problems that philologists face when
they want their editions encoded in TEI visualised and published online (Rosselli del
Turco 2014). Whether locally or online, EVT is opened and used as a web page in
11 The EVT tool is freely available for download to a personal computer and is easy to install.
23N. Ditmajer, M. Ogrin, T. Erjavec: Encoding Textual Variants of the Early Modern Slovenian…
the selected browser. The tool is designed as a dynamic environment, with Javascript
being used to upgrade HTML options. It offers a range of options for displaying criti-
cal texts and their variants, including a parallel version and various details about the
particular units of the apparatus, which can be freely selected by switching between
and generating various displays in real time (see Figure 6). Among the options that
would be welcome for the type of editions contained in the eZISS library are support
for the dynamic display of digital facsimiles, support for the designated entities and
their lists, such as place and personal names, etc. (clearly, these must be appropriately
encoded in TEI), and a high level of adaptability to specific project needs.
The conception of the EVT tool is determined by the common conceptual world
of Western European philology, whereby the critical editor normally choses to present
the text of one selected manuscript accompanied by a smaller or larger number of
versions of the same text presented in the form of a critical apparatus. This concept
is based on a rich textual tradition composed of thousands of medieval manuscripts
both in Latin and in various vernaculars. For example, the digital edition of Chaucer’s
Canterbury Tales prepared by Peter Robinson is based on a transcription of these sto-
ries in around 80 preserved manuscripts and incunabula. The Slovenian textual tradi-
tion is much less extensive: texts have been preserved in several versions only since
the early modern era, while it is only from the eighteenth century onwards that the
Slovenian literary tradition offers a significant increase in textual variance. Another
large area extremely rich in variation is Slovenian folk poetry, which is not discussed
here; nonetheless, EVT might be an ideal tool for studying the exceptional variation
of Slovenian folk poetry.
For the Slovenian manuscript culture to which Foglar’s Manuscript belongs, it is
very often the case that only a single manuscript has survived of several witnesses of
the text. In such situations, the rich textual tradition has only been passed down to us
as one surviving manuscript, the so-called codex unicus. This becomes the sole object
of a critical edition, which requires a meticulous and detailed presentation, in particu-
lar by distinguishing between its diplomatic and critical transcript, which is typical of
a philology such as Slovenian philology. In the light of the above, the design of a qual-
ity and complex tool, such as EVT, should be appropriately adjusted to optimise the
display of a parallel representation of a diplomatic and critical transcript of the same
text (in some cases, it will involve critical apparatus, but unless at least two versions of
the text have been preserved, the apparatus cannot be compiled).
24 Prispevki za novejšo zgodovino LIX - 1/2019
Figure 6: The EVT tool enables a number of dynamic ways to display the digital scholarly
edition, e.g., by showing the main text on the left and the selected version of it on the
right.
From this perspective, Foglar’s hymn book is a particularly demanding example.
On the one hand, with eight previously recorded versions of textual transmission or
tradition, it requires a classical Western European type of scholarly edition; on the
other hand, a Slovenian philological type of scholarly edition is determined by the
contrasting method involving the diplomatic and the critical transcript of the main
text. In the future, this need should also be met by adjustments made to its reading
display solutions.
Conclusion
The article presents the method adopted to compile a critical apparatus of variant
readings in the digital scholarly edition of Foglar’s Manuscript, a Slovenian Baroque
hymn book from the mid-eighteenth century. The editor compared Foglar’s text with
its versions in eight other manuscripts and old prints. The variant readings identi-
fied in the collation process were encoded with XML elements according to the TEI
Guidelines as units of the critical apparatus. The problem of the variation of older poetic
texts raises the problem that various tools embody various functionalities but no tool
satisfies the needs of all researchers. This opens up (not entirely new) horizons, where
the value of the canonical record of our edition in TEI is further increased, as it can be
processed with various, ever evolving tools and according to various needs of presenta-
tion and research. Therefore, the first question that arose was how to label a maximum
number of analytical findings about the variants using the TEI markup: how to indi-
cate whether the differences are on the level of spelling, vocabulary, lexis, semantics,
25N. Ditmajer, M. Ogrin, T. Erjavec: Encoding Textual Variants of the Early Modern Slovenian…
etc. The second question was how to best display variants of such diversity in the
HTML format designed for reading from the screen. Taking into account the require-
ments of this critical edition, we tested and evaluated three tools for visualising the
critical apparatus. In addition to technology-related differences and the diverse func-
tionalities of these tools, their dependence on individual philological and manuscript
traditions has also been shown. As well as the critical apparatus of variant readings,
the Slovenian handwritten tradition requires support for the parallel presentation of
a diplomatic transcript (with the apparatus) and a critical transcript intended for the
wider reading public due to the significant orthographic differences between early and
modern Slovenian. In further work we will continue to attempt to further bring the
Slovene text tradition ever closer to an ideal method of displaying and publishing texts.
Sources and Literature
• Burghart, Marjorie. 2016. “The TEI Critical Apparatus Toolbox: Empowering Textual Scholars
through Display, Control, and Comparison Features.” Journal of the Text Encoding Initiative 10
(2016). https://journals.openedition.org/jtei/1520#article-1520.
• Burghart, Marjorie. 2017. “Textual Variants.” In Digital Editing of Medieval Texts: A Textbook.
Edited by Marjorie Burghart.
• “Online course: Digital Scholarly Editions: Manuscripts, Texts, and TEI Encoding - Digital Editing
of Medieval Manuscripts.” Digital Editing of Medieval Manuscripts.
https://www.digitalmanuscripts.eu/digital-editing-of-medieval-texts-a-textbook/.
• Cankar, Izidor. 2007. S poti. Elektronska znanstvenokritična izdaja. Edited by Matija Ogrin, Luka
Vidmar and Tomaž Erjavec. Elektronske znanstvenokritične izdaje slovenskega slovstva [Scholarly
Digital Editions of Slovenian Literature], ZRC SAZU, IJS. http://nl.ijs.si/e-zrc/izidor/.
• Damon, Cynthia. 2016. “Beyond Variants: Some Digital Desiderata for the Critical Apparatus
of Ancient Greek and Latin Texts.” In Digital Scholarly Editing: Theories and Practices, edited by
Matthew James Driscoll and Elena Pierazzo, 201–18. Cambridge: Open Book Publishers.
• Ditmajer, Nina. 2017. “Romarske pesmi v Foglarjevi pesmarici (1757–1762).” In Rokopisi
slovenskega slovstva od srednjega veka do moderne, edited by Aleksander Bjelčevič, Marija Ogrin and
Urška Perenič, 75–82. Ljubljana: Znanstvena založba Filozofske fakultete. http://centerslo.si/
wp-content/uploads/2017/10/Obdobja-36_Ditmajer.pdf.
• Ditmajer, Nina, and Matija Ogrin. 2018. “Foglarjeva pesmarica [Foglar’s Hymn Book]. Ms 123.”
In Register slovenskih rokopisov 17. in 18. stoletja [Register of Baroque and Enlightenment Slovenian
Manuscripts]. http://ezb.ijs.si/nrss/.
• Greg, W. W. 1950. “The Rationale of Copy-Text.” Studies in Bibliography 3: 19–36.
• Hockey, Susan. 2000. Electronic Texts in the Humanities. Oxford: Oxford University Press.
• Ogrin, Matija. 2005. “Uvod. O znanstvenih izdajah in digitalni humanistiki.” In Znanstvene izdaje
in elektronski medij, edited by Matija Ogrin, 7–21. Ljubljana: Založba ZRC, ZRC SAZU.
• Ogrin, Matija, and Tomaž Erjavec. 2009. “Ekdotika in tehnologija: elektronske znanstvenokritične
izdaje slovenskega slovstva.” Jezik in slovstvo 54, No. 6 (2009): 57–72.
• Ogrin, Matija, and Tomaž Erjavec. 2009. “Elektronske znanstvenokritične izdaje slovenskega
slovstva eZISS: metode zapisa in izdaje.” Infrastruktura slovenščine in slovenistike, Simpozij
Obdobja 28, edited by Marko Stabej, 123–28. Ljubljana: Znanstvena založba Filozofske fakultete.
http://www.centerslo.net/files/file/simpozij/simp28/Erjavec_Ogrin.pdf.
• Ogrin, Matija, and Andrejka Žejn. 2016. “Strojno podprta kolacija slovenskih rokopisnih besedil:
variantna mesta v luči računalniških algoritmov in vizualizacij.” Zbornik konference Jezikovne
26 Prispevki za novejšo zgodovino LIX - 1/2019
tehnologije in digitalna humanistika, edited by Tomaž Erjavec and Darja Fišer, 125–32. Ljubljana:
Znanstvena založba Filozofske fakultete, Jožef Stefan Institute.
• “P5 Guidelines – TEI: Text Encoding Initiative.” TEI Consortium. http://www.tei-c.org/
Guidelines/P5/.
• Rosselli Del Turco, Roberto, Giancarlo Buomprisco, Chiara Di Pietro, Julia Kenny, Raffaele
Masotti, and Jacopo Pugliese. 2014. “Edition Visualization Technology: A Simple Tool to Visualize
TEI-based Digital Editions.” Journal of the Text Encoding Initiative 8.
https://journals.openedition.org/jtei/1077.
• Sahle, Patrick. 2013. Digitale Editionsformen. Zum Umgang mit der Überlieferung unter den
Bedingungen des Medienwandels. Teil 1: Das typografische Erbe. Norderstedt: BoD.
• TEI Consortium. 2018. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Version
3.3.0. [31 Jan. 2018].
Nina Ditmajer, Matija Ogrin, Tomaž Erjavec
ENCODING TEXTUAL VARIANTS OF THE EARLY MODERN
SLOVENIAN POETIC TEXTS IN TEI
SUMMARY
In the process of textual transmission (Textüberlieferung), many textual vari-
ations appear in the text, which are called (variant) readings (Lesarten) or variants
(Überlieferungsvarianten). The problem of textual variation in Slovenian literary his-
tory, which is particularly evident in numerous handwritten and printed hymn books,
only appears in the early modern age, especially in the Baroque era. Hymnal texts were
transmitted among the people both through oral and written traditions. In the pre-
sent paper, taking as an example the diplomatic transcript of Foglar’s hymn book, we
present the question of encoding the variant readings of this hymnal text as reflected
in its handwritten and printed versions according to the TEI Guidelines from 2019.
The TEI recommendations consist of descriptions of the tags rendered in the cur-
rently most widely used XML markup language. We present Foglar’s Manuscript, the
selected base text, whose diplomatic transcript contains a critical apparatus of its vari-
ant readings located in the other eight preserved hymn books originating in the four
historical Slovenian regions. We first highlight examples of the diplomatic transcript
of verse lines, differentiating between the graphic and the verse line. Various elements
and attributes can be added for the machine analysis of the text, such as an analysis of
stanzas and feet. We then present ways of encoding variant readings, using the paral-
lel segmentation method and focusing on verse line switches within stanzas and on
substantive and accidental variants. Considering the fact that Slovenian literary texts
were significantly marked by the regional varieties of the standard language prior to its
unification in the mid-nineteenth century, including by an orthographic heterogeneity,
we decided to introduce a number of units of the critical apparatus within a verse line
27N. Ditmajer, M. Ogrin, T. Erjavec: Encoding Textual Variants of the Early Modern Slovenian…
and assign each variant reading an @type attribute value. In the final section, we
present three tools for text analysis and display: our own XSLT conversion tool, the
TEI Critical Apparatus Toolbox and the open source Edition Visualization Technology
tool. For the critical edition in question, XSLT conversion, which generates a static
web site with a visually separate display of the variant readings in a line, turned out to
be reasonably appropriate. The TEI CAT tool provides a very useful parallel display
of the variants, but is not intended for final publication.
Generally distinguished by powerful functionalities, the EVT tool should be
slightly adjusted for the Slovenian textual tradition, in which the diplomatic and criti-
cal transcripts of the same text play the major role. Future technological solutions
for digital scholarly editions will have to take into account, in particular, the diverse,
complex differences in the structure of both transcripts: the diplomatic transcript, for
example, with its specific problems is encoded and shown as a paragraph in which
several interventions have taken place; the critical transcript, on the other hand, can
display the same text in linguistically regularised forms, as a stanza of rhymed verse
with a marked metric structure, etc. The parallel representation of the digital facsimile
and two methodologically completely different transcriptions (and possibly even a
classical critical apparatus) potentially represents a significant technological problem;
however, only such an ecdotic (text-critical) conception of the scholarly critical edi-
tion can reveal all of the semantic wealth of early modern Slovenian texts.
Nina Ditmajer, Matija Ogrin, Tomaž Erjavec
ZAPIS VARIANTNOSTI STAREJŠIH SLOVENSKIH
PESNIŠKIH BESEDIL V TEI
POVZETEK
V procesu rokopisne preoddaje (Textüberlieferung, Textual transmission) nastajajo
v besedilu številne razlike, ki jih imenujemo variante (Lesarten, readings) ali variantna
mesta (Überlieferungsvarianten, variants). V slovenski literarni zgodovini se problem
variantnosti pojavi še posebej v dobi baroka, ta pa je najbolj vidna v številnih rokopi-
snih in tiskanih pesmaricah, ki so se med ljudstvom širile tako pisno kot ustno.
V prispevku na primeru diplomatičnega prepisa Foglarjeve pesmarice prikazu-
jemo problematiko zapisa variantnih mest istega besedila v preostalih rokopisnih in
tiskanih verzijah po Smernicah TEI (TEI Consortium 2019). Priporočila TEI sesta-
vljajo opisne razlage teh oznak, ki so izražene v trenutno najbolj razširjenem računalni-
škem označevalnem jeziku (markup language) XML. Foglarjev rokopis je v naši izdaji
prepoznan kot temeljno besedilo (base text), ki smo mu v diplomatičnem prepisu
dodali kritični aparat variantnih mest, najdenih v osmih drugih pesmaricah iz štirih
28 Prispevki za novejšo zgodovino LIX - 1/2019
slovenskih historičnih pokrajin. Najprej prikazujemo primere diplomatičnega zapisa
verza z razlikovanjem med grafično in verzno vrstico. Za strojno analizo besedila lahko
zapisu dodajamo različne oznake in atribute, npr. za analizo rime in stopice. Nato z
uporabo metode vzporednega segmentiranja variantnih mest (parallel segmentation
method) prikazujemo primer zapisa variantnih mest. Še posebej se osredotočamo na
označevanje zamenjav verzov v kitici ter substancialnih in akcidentalnih variantnih
mest. Ker so slovenska besedila pred poenotenjem slovenskega knjižnega jezika pre-
cej pokrajinsko obarvana in izkazujejo tudi neenoten pravopis, smo poskusili znotraj
enega verza uvesti več enot kritičnega aparata in variante označiti z vrednostjo atributa
@type. Na koncu smo predstavili in preizkusili tri orodja za prikaz in analizo besedil:
našo lastno pretvorbo XSLT, orodje TEI Critical Apparatus Toolbox in odprtokodno
orodje Edition Visualization Technology. Kot razmeroma primerna se je za našo izdajo
izkazala pretvorba XSLT, ki izdela statično spletno stran z vizualno ločenim izpisom
variantnih mest v vrstici. Orodje TEI CAT omogoča zelo uporaben vzporedni prikaz
variantnih mest, vendar ni namenjeno končnemu publiciranju. Orodje EVT bi bilo
potrebno ob že razvitih zmogljivih funkcionalnostih nekoliko prilagoditi za slovensko
besedilno izročilo, kjer imata največjo vlogo diplomatični in kritični prepis istega bese-
dila. Bodoče tehnološke rešitve elektronskih znanstvenokritičnih izdaj bodo morale
upoštevati zlasti raznolike, kompleksne razlike v strukturi obeh prepisov: diplomatični
prepis je denimo s svojimi specifičnimi problemi označen in prikazan kot odstavek,
v katerega je posegalo več rok ipd.; kritični prepis pa lahko prikazuje isto besedilo v
jezikoslovno regulariziranih oblikah, kot kitico rimanih verzov z označeno metrično
strukturo itn. Vzporedni prikaz digitalnega faksimila in dveh metodološko povsem
različnih prepisov (in eventualno še klasičnega kritičnega aparata) potencialno pred-
stavlja nemajhne tehnološke probleme; vendar šele takšna ekdotična (tekstnokritična)
zasnova edicije razpre vse semantično bogastvo starejših slovenskih besedil.
29I. Dorst: You, Thou and Thee: A Statistical Analysis of Shakespeare’s Use of…
1.01 UDC: 004.934:821.111SHAK(083.41)
Isolde van Dorst*
You, Thou and Thee: A Statistical
Analysis of Shakespeare’s Use of
Pronominal Address Terms
IZVLEČEK
YOU, THOU IN THEE: STATISTIČNA ANALIZA UPORABE IZRAZOV
ZAIMKOVNEGA NASLAVLJANJA PRI SHAKESPEARU
Študija se ukvarja z oblikovanjem napovednega modela, namenjenega ugotavljanju,
katere jezikovne in nejezikovne značilnosti vplivajo na izbiro zaimkov v Shakespearovih
igrah. V angleščini, ki se je uporabljala v Shakespearovem obdobju, je razlikovanje med
YOU in THOU, ki je danes arhaično, še obstajalo. Običajno se navaja, da sta ga določala
relativni družbeni status ter osebna bližina govorca in naslovljenca. Vendar pa je treba še
ugotoviti, ali bo statistično strojno učenje potrdilo to tradicionalno razlago. Proučuje se 23
značilnosti, izbranih z različnih jezikoslovnih področij, kot so pragmatika, sociolingvistika
in analiza pogovora. Trije uporabljeni algoritmi – naivni Bayesov klasifikator, odločitveno
drevo in metoda podpornih vektorjev – so izbrani kot ilustrativni nabor možnih modelov
zaradi njihovih kontrastnih predpostavk in učne pristranskosti. Opravita se dve napovedi,
prva o binarnem (you/thou) razlikovanju in druga o trinarnem (you/thou/thee) razli-
kovanju. Od vseh treh algoritmov daje najboljše rezultate metoda podpornih vektorjev. Po
ugotovitvah so značilnosti, ki najbolje napovejo izbiro zaimka, besede iz neposrednega jezi-
kovnega konteksta. Izkazalo se je, da na napoved zaimka vpliva tudi več drugih značilnosti,
vključno z imenom govorca in naslovljenca, razliko v statusu ter pozitivnim ali negativnim
mnenjem.
Ključne besede: izrazi zaimkovnega naslavljanja, Shakespeare, korpusno jezikoslovje,
digitalna humanistika, statistično modeliranje
* Lancaster University, i.vandorst@lancaster.ac.uk
30 Prispevki za novejšo zgodovino LIX - 1/2019
ABSTRACT
This study creates a prediction model to identify which linguistic and extra-linguistic fea-
tures influence pronoun choices in the plays of Shakespeare. In the English of Shakespeare’s
time, the now-archaic distinction between you and thou persisted, and is usually reported as
being determined by relative social status and personal closeness of speaker and addressee.
However, it remains to be determined whether statistical machine learning will support this
traditional explanation. 23 features are investigated, having been selected from multiple
linguistic areas, such as pragmatics, sociolinguistics and conversation analysis. The three
algorithms used, Naive Bayes, decision tree and support vector machine, are selected as
illustrative of a range of possible models in light of their contrasting assumptions and lear-
ning biases. Two predictions are performed, firstly on a binary (you/thou) distinction and
then on a trinary (you/thou/thee) distinction. Of the three algorithms, the support vector
machine models score best. The features identified as the best predictors of pronoun choice
are the words in the direct linguistic context. Several other features are also shown to influ-
ence the pronoun prediction, including the names of the speaker and addressee, the status
differential, and positive and negative sentiment.
Keywords: pronominal address terms, Shakespeare, corpus linguistics, digital humani-
ties, statistical modelling
Introduction
For several decades much research has been undertaken on the use of you, thou and
thee in Shakespeare’s works. However, the results so far have yet to arrive at an exact
and conclusive answer regarding how these pronouns were used.
This study combines the strengths of multiple research fields in an effort to deter-
mine via hitherto unused methods which linguistic and extra-linguistic features influ-
ence the choice of second person singular pronoun (you versus thou or thee) in the
plays of William Shakespeare. Prior findings in literary and linguistic studies are uti-
lised to find which features could be relevant in this choice, and tools and applications
created for corpus linguistics and computer science are exploited to analyse the data in
a more exact way than has so far been accomplished. Through these techniques, I hope
to identify which features can contribute to a more accurate prediction of pronoun
choice, in a model to mimic the pronoun use of Shakespeare.
It is worth observing at this point that it has not yet been determined whether it
is even possible to predict the pronoun based on linguistic features. Part of the aim
of this paper is to make a determination on this point. In other words, is it possible to
create a computational model that can predict which pronoun will be used based on a
set of linguistic and extra-linguistic features taken from the text itself and selected on
the basis of knowledge that we have of English in the late 1500s and early 1600s? To
31I. Dorst: You, Thou and Thee: A Statistical Analysis of Shakespeare’s Use of…
accomplish this, all occurrences of you, thou and thee are extracted from Shakespeare’s
plays, and every instance is manually coded for 23 linguistic and extra-linguistic fea-
tures, creating data which will serve to ascertain the answer to this primary question. A
second question to be addressed is whether some features perform better as predictors
of the pronoun choice than others. Thirdly, the issue of whether the use of different
algorithms affects the prediction outcomes will be considered.
Throughout this paper, italicised you, thou and thee refer to specific pronoun forms.
However, whereas you – in Early Modern English as in contemporary English – does
not exhibit any formal variation for pronoun case, thou is strictly a nominative form
with thee as its accusative/dative form. Thou and thee are therefore related inflectional
forms of a single pronoun lemma; you exists in variation with both. Small capitals are
used to indicate the pronoun lemmas, thus: you and thou, where thou includes both
thou and thee. Whenever discussing pronouns in this paper, I am strictly referring to
the singular second-person pronouns you, thou and thee that are examined in this study.
Background
Digital Humanities
Over the past few years, computational research has branched out into other
research fields that are not necessarily closely connected to computer science. Digital
Humanities (DH) is an umbrella term for all research that is computational but
approaches the datasets investigated within, and/or addresses questions or problems
that are of importance to, the disciplines of the humanities.
The popularity of Digital Humanities, a cross-domain field of study, is attributable
to the fact that it does not diminish the differences between fields but rather opera-
tionalises this difference to solve difficulties that could not be dealt with within a single
discipline. The role of computational methods in the humanities can be considered
as that of a supporting character; in any DH computer modelling research, it should
be kept in mind that the interpretation is as important at the suitability of a computa-
tional model and its outcomes.
Early Modern English and you/thou
In Early Modern English (EModE), two different second person singular pro-
nouns were used, namely the formally singular thou and the formally plural (but
pragmatically also respectful-singular) you, with only the latter surviving the EModE
period (Taavitsainen and Jucker 2003). The difference between the uses of these two
pronouns is evident from multiple literary studies that have addressed Shakespeare’s
32 Prispevki za novejšo zgodovino LIX - 1/2019
work, work of his contemporaries, and other documents from this era, such as Walker
(2003) and Busse (2002). These studies suggest that unwritten social rules governed
the use of these pronouns, abiding by which rules was necessary in order to speak
according to society’s standards. The use of the two different pronouns acted as a
sign of relative status: you would be used to superiors and thou towards inferiors. The
choice of pronoun can thus also operate as a subtle means of showing respect or dis-
respect; using the pronouns in this way would have been natural and easy to English
native speakers of the period.
Shakespeare lived during the Early Modern English period, and thus used both
you and thou in his writing. His work was written less than 100 years before thou and
thee disappeared from the standard language (surviving in dialects and archaicised
registers, such as pious addresses to the divinity). Thus we may straightforwardly
posit that the disappearance of thou was likely already in progress around his time.
Though obviously heightened in its use of emotional and dramatic language and style
to accommodate to the genre of the play script, the language of Shakespeare – includ-
ing the usage of the two second-person pronouns – can be assumed to be a reasonably
good representation of the language used generally in social interaction and conversa-
tion at that time (Calvo 1992).
Prior Studies on you/thou
Most studies of Shakespeare’s use of you and thou so far have been literary and
nonnumeric studies (Brown and Gilman 1960; Quirk 1974; Calvo 1992); the relative
few to have used data-based or quantitative techniques did not implement any method
beyond directly comparing raw frequency counts (Busse 2003; Mazzon 2003; Stein
2003). Moreover, these studies did not look at all the extant Shakespeare plays, but
instead chose a few plays to focus on. Nonetheless, these studies have demonstrated
some patterns in the use of you and thou and thus provide a workable foundation
for a more in-depth study of the usage of those two pronouns.
These prior studies support in the overall conclusion that the pronouns you and
thou appear to be used to support the explicit expression of respect, social status, and
familiarity. Quirk (1974) and Mazzon (2003) characterise the role of the pronoun as
a linguistic marker, whose usage can be seen as either marked or unmarked. In other
words, the use of a particular pronoun can be seen as marked when it is used unex-
pectedly, for example when you is expected based on social status, but thou is used
instead. Thus, in contrast to earlier studies (Brown and Gilman 1960), they do not per-
ceive you and thou to be in direct contrast, and to have a more variable interpretation
than was assumed until then, based on the context it occurs in. Calvo (1992) and Stein
(2003) expand on this by concluding that markedness of the pronoun is dependent on
the context and the situation, in addition to the pronoun choice depending on stable
factors such as the social statuses of, and the level of familiarity between, the characters
33I. Dorst: You, Thou and Thee: A Statistical Analysis of Shakespeare’s Use of…
in Shakespeare’s plays; the speakers and addressees in this study – rather than just the
latter factors (Brown and Gilman 1960). The emotive effect of the utterances within
which the you/thou distinction is utilised is of importance as well; feelings such as
anger and love for another character may find expression through pronoun choice.
This is connected to the notion of respect, as, in an angry remark, marked pronouns
can be used to disrespect the addressee based on their social status (Stein 2003).
As Stein (2003) and Busse (2006) already stressed in their studies, a study of
you and thou in Shakespeare cannot and should not be limited to a single research
discipline. Rather, what is needed is a combination of literature, sociolinguistics,
pragmatics and conversation analysis, which are all useful in capturing the complex-
ity of pronominal address and the social constrictions that may have underpinned the
choice of one honorific pronoun-form over the other.
Methodology
As has already been mentioned, this is a strictly empirical study which attempts
to verify the findings of earlier research through a computational approach. The use
of a computational, statistical method is motivated by the goal of creating a more
objective representation of Shakespeare’s use of you and thou in his plays than has
been accomplished so far, since it does not require analysis of meaning-in-context by
a human being, but rather proceeds directly from quantitative measurements.
Hypotheses
Three hypotheses were formulated on the basis of the literature:
1. No single model will be able to predict the pronominal address term solely based
on linguistic and extra-linguistic features.
This, being a null-hypothesis, is exactly what this study aims to falsify by develop-
ing such a model. It is not likely that a single model will be able to predict Shakespeare’s
original choice of you or thou based on linguistic and extra-linguistic features,
because this choice is dependent on so many factors. However, the application of
literature, sociolinguistics, pragmatics and conversation analysis all combined into a
computational model will be able to successfully predict the pronoun choice as it
includes all the factors that might influence the choice for either you or thou.
2. The features of social status, age and sentiment will be better predictors of the
pronoun choice than other features.
34 Prispevki za novejšo zgodovino LIX - 1/2019
A hierarchy will be established according to which the linguistic and extra-linguis-
tic features are predicting the pronoun choice in the best performing model. It may be
inferred from the literature that social status, age and sentiment are highly likely to be
at the top of this hierarchy, among the most influential features; these three features
have shown up most reliably in prior research.
3. The best performing algorithm will combine features both dependently and
independently.
The different learning biases and assumptions of the three algorithms applied in
this study will reveal how the features interact with one another. The first algorithm,
Naive Bayes, assumes all features are independent of one another, while the deci-
sion tree algorithm assumes that the features are all dependent on each other. Lastly,
the support vector machine works with both dependent and independent features. I
expect the set of features that will be included in the final model to be a combination of
both dependent and independent features, and therefore the support vector machine
algorithm to perform best. The three algorithms will be discussed in more detail later
in the chapter Classification based on three algorithms.
Data
The data for this study comes from the Encyclopaedia of Shakespeare’s Language
project1, which is a research project at Lancaster University (UK). The project corpus
consists of 38 of Shakespeare’s plays, which includes all 36 plays from the First Folio
with the addition of The Two Noble Kinsmen and Pericles: Prince of Tyre. A broadly
annotated version of the full Shakespeare corpus can be found online2. Some of the
annotation and all of the abbreviations used for the titles of the plays follow The Arden
Shakespeare.
Linguistic and Extra-linguistic Features
The Encyclopaedia of Shakespeare’s Language corpus is richly annotated.
However, some additional annotation was necessary to perform a full analysis of what
extra-linguistic features could be predictors of the pronominal address term. The full
set of features used in this study can be found in Table 1. The added features are briefly
described here.
As a referent (such as a second person singular pronoun) is dependent on context,
the adjacent part of the utterance is used as a feature to test the effect of co-text. Six
1 More information on this project, which is funded by the Arts and Humanities Research Council (AH/
N002415/1), can be found on http://wp.lancs.ac.uk/shakespearelang/.
2 CQPweb Main Page, http://cqpweb.lancs.ac.uk.
35I. Dorst: You, Thou and Thee: A Statistical Analysis of Shakespeare’s Use of…
co-textual words are included, i.e. a 7-gram altogether. “LW” labels the words occur-
ring on the left of the pronoun, and “RW” the words on the right of the pronoun. Each
of these words are numbered based on their distance from the pronoun, e.g. LW3 is
the third word on the left of the pronoun. In corpus linguistics, collocations are often
examined within a three-word-window, meaning there are three words on either side
of the word of interest. While I am not necessarily looking at specific collocations of
you and thou, the LW/RW features will look at similarities and differences in co-
textual words to see if they can predict the pronoun choice.
Another feature noted as critical in prior studies is sentiment, that is the use of
the pronoun to convey positivity or negativity. Sentiment was annotated with the use
of the 7-gram described above. SentiStrength is a lexicon-based sentiment analysis
program that scores phrases with a score for positivity and negativity (Thelwall et al.
2010). Since SentiStrength was developed to work with online comments rather than
complete sentences as in formal written English, it works well with n-grams too. The
scores for positivity and negativity are kept as separate variables.
The corpus already included metadata on the speakers; however, I wanted to
include age as well. The age of a character is often not given except for when it is
an important attribute of that character, making this difficult to annotate. Therefore,
Quennell and Johnson’s (2002) character descriptions were used. The characters were
Table 1: List of all features used in this study
Feature Acronym Annotation
Genre Genre Pre-annotated
Play name Play Pre-annotated
Play, act, scene Scene Pre-annotated
Speaker ID S_ID Pre-annotated
Speaker gender S_Gender Pre-annotated
Speaker status S_Status Pre-annotated
Production date Prod_Date Pre-annotated
N-gram LW1-3,
RW1-3
Automatic
Positive sentiment Pos_Sent Automatic
Negative sentiment Neg_Sent Automatic
Speaker age S_Age Manual
Location Location Manual
Addressee ID A_ID Automatic
Addressee gender A_Gender Pre-annotated
Addressee status A_Status Pre-annotated
Addressee age A_Age Manual
Status differential Stat_Diff Automatic
No. of people addressed A_Number Pre-annotated
36 Prispevki za novejšo zgodovino LIX - 1/2019
sorted into a trinary classification, with ‘adult’ as the default category. Any deviations
towards ‘younger’ or ‘older’ were based on textual references or the character’s name,
such as for ‘Old Man’ in King Lear. Older characters were occasionally classified as
such based on the fact they had adult children with prominent roles in the plays.
A more global feature is the location where the scene is set. This was difficult to
annotate, due to the often unreliable stage directions. Instead of a nominal description
for each scene location, I used a binary annotation of ‘public’ and ‘private’. The text
itself was examined to determine the location based on what characters said about their
location, but in addition Bate and Rasmussen’s (2007) annotation and Greenblatt,
Cohen, Howard and Maus’ (1997) annotations were consulted. The use of these three
resources enabled the binary manual annotation of location for every scene.
Besides the information about the speaker and the scene, information regarding
the addressee is essential when analysing character interaction from a conversation
analysis perspective. As a manual annotation for addressee would be incredibly time
consuming, I instead used an automatic method which identifies the previous speaker
as the addressee of any given utterance. This is in line with the last-as-next bias used
in conversation analysis (Mazeland 2003). This means that, even in larger group con-
versations, it is often expected that the last speaker before the current speaker will
also be the next speaker, thus making it likely that the current speaker is addressing
the last speaker. If the utterances were interrupted by the start of a new scene or other
stage directions (e.g. someone walking into the scene), the annotated addressee would
be the next speaker rather than the previous speaker for the first utterance after the
interruption.
Using the data for the social status of the speaker and the addressee, I also cre-
ated a status differential. As the status category labels are numeric and ordered, this
can be done by taking the difference between the two. For example, a king (status
= 0) and a servant (status = 6) are distant in status, and thus will have a high status
differential (here: 6). Between a king and a prince (status = 1), the difference is a lot
smaller (here: 1). This absolute feature was automatically generated from the already
annotated features.
A feature that had to be excluded is familiarity between characters (social dis-
tance). This data was not already available, and it was beyond the scope of this study
to annotate this for all relevant character pairs. The literature has shown this to be a
relevant feature. However, through the use of sentiment analysis, I have attempted to
cover the complimentary and insulting aspects that could arise from high familiarity,
and any lack thereof arising from low familiarity. Obviously, this does not cover all
aspects of familiarity, but it means that this feature is not totally neglected.
37I. Dorst: You, Thou and Thee: A Statistical Analysis of Shakespeare’s Use of…
Classification Based on Three Algorithms
Three different algorithms are used for the classification task, namely Naive Bayes,
decision trees and support vector machines. Whereas it would be ideal to achieve a
high precision and recall score, the main goal of this research is to see whether it is
even possible to predict the second person singular pronoun choice through a com-
putational application at all. If this is indeed the case, what features contribute to this
prediction? It is thus more important to verify which features influence the choice and
to what extent they do so.
The reason for using three algorithms, and in particular these three, is their dif-
ferences in learning biases and assumptions. Naive Bayes assumes all features are
independent of one another, whereas decision tree attempts to create a dependent,
hierarchical structure in the features. Support vector machine (SVM) is more complex
and is able to combine both dependent and independent features. The addition of the
latter algorithm will be particularly useful if the difference between the two simpler
algorithm’s models is small.
As well as applying three algorithms, I will also look at the difference between
keeping thou and thee separate and combining them into the one category thou. For
this, I will run both a binary (you and thou) and a trinary (you, thou and thee) clas-
sification, to see whether this affects the scores or changes which features are included
in the best models.
Overview of Implementation
I ran the three algorithms using the Waikato Environment for Knowledge Analysis
(Weka3) software4 with the default settings. The algorithms were run using a 10-fold
cross-validation to ensure the best model based on training and testing of all folds
combined.
The number of relevant instances of you/thou/thee extracted from the dataset is
22,932, which makes up 99.5% of the total number of such pronouns in the dataset.
The pronouns were extracted using a Python script with simple heuristics. About 0.5%
was missed due to noise in the dataset. The number of instances of you/thou/thee that
were extracted from each play range from 363 (in Macbeth) to 811 (in Coriolanus).
I attempted to improve or maintain the scores while making the model simpler
by excluding features, that is, through feature ablation. When there were conflicting
changes in the scores, the scores of precision and F-measure were prioritised. I hoped
to identify which features truly help predict the pronoun by building the simplest but
best performing model. The baseline that the models were compared to is derived
3 Weka 3 - Data Mining with Open Source Machine Learning Software in Java, http://www.cs.waikato.ac.nz/ml/
weka/.
4 In Weka, Naive Bayes is identified as NaiveBayesMultinominal, decision tree as J48, and support vector machine as
SMO.
38 Prispevki za novejšo zgodovino LIX - 1/2019
from the distribution of the pronouns in the dataset, thus 62.6% of you and 37.4%
thou.
I first took out groups of features that are related, rather than one feature at a
time. Among the 23 features, I created six different groups. The first group related to
the wider linguistic and social context (play, production date, genre, scene, location),
while the second group was the closer linguistic co-text (n-gram). Information on
the speaker (name, status, gender, age) and the addressee (name, status, gender, age,
number of people) were groups 3 and 4. I kept status differential on its own, because
it relates to multiple groups. Finally, the last group was sentiment (positive and nega-
tive). After the group ablation, I went back over the features to see if individual feature
exclusions would improve the model further. This ensured the simplest and best model
for each algorithm. The scores and the features included in each model are given in
Tables 2, 3 and 4.
Results
Trinary Classification Scores
Table 2 shows the results of the trinary classification. As can be seen, each model
performed significantly better than the baseline model, on all scores. The F-measure
of the best model, the support vector machine model, is highlighted in bold.
Table 2: Scores for precision, recall, F-measure and accuracy for trinary pronoun prediction
Algorithm Precision Recall F-measure Accuracy
Baseline
Weighted Avg. 0.392 0.626 0.483 62.6417%
you 0.626 1.000 0.770
thou 0.000 0.000 0.000
thee 0.000 0.000 0.000
Naive Bayes
Weighted Avg. 0.826 0.826 0.826 82.64%
you 0.880 0.885 0.882
thou 0.865 0.850 0.857
thee 0.509 0.510 0.510
Decision Tree
Weighted Avg. 0.732 0.752 0.712 75.2093%
you 0.738 0.960 0.835
thou 0.896 0.574 0.700
thee 0.408 0.097 0.157
Support Vector
Machine
Weighted Avg. 0.854 0.857 0.854 85.675%
you 0.871 0.927 0.898
thou 0.919 0.836 0.876
thee 0.659 0.566 0.609
39I. Dorst: You, Thou and Thee: A Statistical Analysis of Shakespeare’s Use of…
Binary Classification Scores
Table 3 shows the results of the best models for the binary classification. The
F-measure of the best model, again the support vector machine model, is highlighted
in bold. This is also the best scoring model out of all models presented in this paper.
Table 3: Scores for precision, recall, F-measure and accuracy for binary pronoun prediction
Algorithm Precision Recall F-measure Accuracy
Baseline
Weighted Avg. 0.392 0.626 0.483 62.6417%
you 0.626 1.000 0.770
thou 0.000 0.000 0.000
Naive Bayes
Weighted Avg. 0.868 0.868 0.867 86.8306%
you 0.876 0.920 0.897
thou 0.853 0.782 0.816
Decision Tree
Weighted Avg. 0.818 0.818 0.818 81.8376%
you 0.849 0.863 0.856
thou 0.764 0.744 0.754
Support Vector
Machine
Weighted Avg. 0.872 0.873 0.872 87.2798%
you 0.886 0.914 0.900
thou 0.848 0.803 0.825
Feature Comparison of the Models
Overall, the final models contain similar sets of features. The exact compositions
are given in Table 4. What is surprising is that the binary classification model for the
decision tree is very different from the other models: it does not contain any of the
words from the n-gram as a predictor, whereas the others did.
Table 4: Features included in the best model of each algorithm
Algorithm Type Features included
Naive Bayes
Trinary LW1, LW2, RW1, RW2, S_ID
Binary LW1, LW2, LW3, RW1, RW2, RW3, A_ID
Decision Tree
Trinary LW1, LW2, RW1, RW2, S_ID, Stat_Diff,
Neg_Sent
Binary Scene, S_ID, S_Gender, A_ID, A_Status,
A_Age, Stat_Diff, Pos_Sent
Support Vector
Machine
Trinary LW1, RW1, S_ID, S_Age, A_ID, A_Age, A_
Number, Stat_Diff, Pos_Sent, Neg_Sent
Binary LW1, RW1, S_ID, S_Age, A_ID, A_Age, A_
Number, Stat_Diff, Pos_Sent, Neg_Sent
40 Prispevki za novejšo zgodovino LIX - 1/2019
Discussion
This study has given some new insights into the analysis of pronominal address
terms. Looking at the second person singular pronoun choice as a binary and a trinary
classification problem resulted in slightly different outcomes. Even though the highest
scores were achieved in the binary classification, one might still wonder whether this is
the best method for addressing the second person singular pronoun choice. Looking
back at prior studies on pronoun interpretation and comparing them to the features
used in this study, we can conclude that thee and thou are equal in their opposition to
you, with the main difference being their grammatical role. From the model compari-
son, we have seen that the co-text is most important when predicting the pronoun.
This is evidence of the purely grammatical difference between thou and thee and their
overall similarity in other aspects. Therefore, both linguistically and computationally,
it makes more sense to perform a binary classification.
Differences between the algorithms were observed, but all three algorithms easily
outperformed the baseline. The support vector machine models performed best, but
the scores for the Naive Bayes models were quite similar to those for the SVM models.
A choice between these approaches could be based solely on the scores for accuracy,
precision, recall and F-measure, or also by taking into account the complexity, which is
significantly higher for the support vector machine models. The more nuanced models
that the support vector machine creates, which include more features than the mod-
els of the other algorithms, may suggest that the extra complexity of SVM models is
indeed beneficial.
The best predicting features were the LW and RW features, which supports the
importance of the direct linguistic co-text. In particular RW1 appeared as the most
important feature in predicting the second person singular pronominal address term.
Other important features were the speaker’s name, addressee’s name, status differ-
ential, positive sentiment and negative sentiment, with additional support from the
speaker’s gender, addressee’s status, addressee’s age, speaker’s age, and number of peo-
ple addressed. Only six features were not included in any of the models: genre, play,
production date, location, speaker’s status and addressee’s gender.
I am, therefore, now able to falsify the null-hypothesis that it is not possible to
build a reliable prediction model based on linguistic and extra-linguistic features.
All six models demonstrate that linguistic and extra-linguistic features substantially
improve the prediction of the pronominal address term, as all six outperform the
baseline.
The second hypothesis, about which features would be good predictors, was par-
tially correct in predicting that social status, age and sentiment would be included in
the best models. However, none of these features were the main predictor of pronoun
choice; that was the immediate co-text.
With regard to the final hypothesis, it has been revealed that the features are indeed
both dependent on and independent of each other. However, since the Naive Bayes
41I. Dorst: You, Thou and Thee: A Statistical Analysis of Shakespeare’s Use of…
models perform almost identically to the support vector machine models, we can say
that the features are, for the most part, independent of one another.
Conclusions
The primary finding of this study is that it is indeed possible to build a predic-
tion model for the use of you versus thou with a singular referent in the plays of
Shakespeare that is based on linguistic and extra-linguistic features. Moreover, in par-
ticular, the direct linguistic co-text of the second person singular pronoun is impor-
tant. Other important features include the speaker’s and addressee’s names, status
differential and both positive and negative sentiment. All in all this suggests that the
pronoun choice is influenced by several linguistic and extra-linguistic features.
The best scoring algorithm and model was the support vector machine with 87.3%
accuracy through its binary classification model.
For future research, I would recommend an exploration of other algorithms and
features that were left out of this study, such as morphology, word embeddings and
POS-tags. This will help us gain more information about the linguistic co-text directly
surrounding the second person singular pronoun, which will likely give more insight
into why this direct co-text is so important in deciding the choice of you or thou.
Moreover, including familiarity between characters (social distance) as a feature would
be beneficial, as this has been noted multiple times in prior research as an influential
factor, but was beyond the scope of this study.
Although this study has not yet provided a comprehensive set of all the linguistic
and extra-linguistic features that influence the second person singular pronoun choice
in Shakespeare’s plays, it has definitely provided a more objective and extensive analy-
sis of the matter that furthers the research into you and thou.
Acknowledgements
The research presented in this article was conducted in collaboration with the
Encyclopaedia of Shakespeare’s Language project at Lancaster University. This pro-
ject is funded by the UK’s Arts and Humanities Research Council (AHRC), grant
reference AH/N002415/1. The Shakespeare corpus will be made publicly available
in Summer 2019, first via the CQPweb interface and then through download at a later
stage. Many thanks to Jonathan Culpeper and the rest of the team for their advice and
support throughout the study.
42 Prispevki za novejšo zgodovino LIX - 1/2019
References
Literature:
• Bate, Jonathan, and Eric Rasmussen, eds. 2007. William Shakespeare: Complete Works. London:
The Royal Shakespeare Company.
• Brown, Roger W., and Albert Gilman. 1960. “The Pronouns of Power and Solidarity.” In Style in
Language, edited by Thomas A. Sebeok, 253–76. Cambridge: MIT Press.
• Busse, Beatrix. 2006. Vocative Constructions in the Language of Shakespeare. Amsterdam: John
Benjamins.
• Busse, Ulrich. 2003. “The Co-occurrence of Nominal and Pronominal Address forms in the
Shakespeare Corpus: Who Says Thou or You to Whom?”, in Diachronic perspectives on Address
Term Systems, edited by Irma Taavitsainen and Andreas H. Jucker, 193–221. Amsterdam: John
Benjamins.
• Busse, Ulrich. 2002. The Function of Linguistic Variation in the Shakespeare Corpus: A Corpus-based
Study of the Morpho-syntactic Variability of the Address Pronouns and Their Socio-historical and
Pragmatic Implications. Amsterdam: John Benjamins.
• Calvo, Clara. 1992. “Pronouns of Address and Social Negotiation in As You Like It.” In Language
and Literature, Vol. 1(1), 5–27. London: Longman Group UK Ltd.
• Greenblatt, Stephen, Walter Cohen, Jean E. Howard, and Katherine E. Maus. 1997. The Norton
Shakespeare: Based on the Oxford Edition. New York: W.W. Norton & Company, Inc.
• Mazeland, Harrie. 2003. Inleiding in de conversatieanalyse. Bussum: Coutinho bv.
• Mazzon, Gabriella. 2003. “Pronouns and Nominal Address in Shakespearean English: A Socio-
affective Marking System in Transition.” In Diachronic Perspectives on Address Term Systems, edited
by Irma Taavitsainen and Andreas H. Jucker, 223–49. Amsterdam: John Benjamins.
• Quennell, Peter, and Hamish Johnson. 2002. Who’s Who in Shakespeare. London: Routledge.
• Quirk, Randolph. 1974. “Shakespeare and the English language.” In The linguist and the English
Language, edited by R. Quirk, 46–64. London: Edward Arnold.
• Stein, Dieter. 2003. “Pronomial Usage in Shakespeare: Between Sociolinguistics and Conversation
Analysis.” In Diachronic Perspectives on Address Term Systems, edited by Irma Taavitsainen and
Andreas H. Jucker, 251–307. Amsterdam: John Benjamins.
• Taavitsainen, Irma, and Andreas H. Jucker. 2003. “Introduction.” In Diachronic Perspectives on
Address Term Systems, edited by Irma Taavitsainen and Andreas H. Jucker, 1–25. Amsterdam: John
Benjamins.
• Thelwall, Mike, Kevan Buckley, Georgious Paltoglou, Di Cai, and Arvid Kappas. 2010. “Sentiment
Strength Detection in Short Informal Text.” Journal of the American Society for Information Science
and Technology, 61(12): 2544–58. https://doi.org/10.1002/asi.21416.
• Walker, Terry. 2003. “You and Thou in Early Modern English Dialogues: Patterns of usage.” In
Diachronic Perspectives on Address Term Systems, edited by Irma Taavitsainen and Andreas H.
Jucker, 309–42. Amsterdam: John Benjamins.
43I. Dorst: You, Thou and Thee: A Statistical Analysis of Shakespeare’s Use of…
Isolde van Dorst
YOU, THOU AND THEE: A STATISTICAL ANALYSIS
OF SHAKESPEARE’S USE OF PRONOMINAL
ADDRESS TERMS
SUMMARY
Much research has been undertaken on the use of you, thou and thee in Shakespeare’s
works. However, the results so far have yet to arrive at an exact and conclusive answer
regarding how these pronouns were used. This study combines the strengths of multi-
ple research fields in an effort to determine via hitherto unused computational meth-
ods which linguistic and extra-linguistic features influence the second person singular
pronoun choices in the plays of Shakespeare. In the English of Shakespeare’s time, the
now-archaic distinction between you and thou persisted, and is usually reported
as being determined by relative social status and personal closeness of speaker and
addressee. However, even between studies with similar outcomes, the results vary mas-
sively on the degree of influence and by the inclusion or exclusion of a wide range of
other potential influencing factors. Therefore, it remains to be determined whether
statistical machine learning will support this traditional explanation.
In this study, 23 linguistic and extra-linguistic features are investigated, having
been selected from multiple linguistic areas, such as pragmatics, sociolinguistics and
conversation analysis. The three algorithms used, Naive Bayes, decision tree and sup-
port vector machine, are selected as illustrative of a range of possible models in light
of their contrasting assumptions and learning biases. Two predictions are performed,
firstly on a binary (you/thou) distinction and then on a trinary (you/thou/thee)
distinction, giving six final models to compare. This is a strictly empirical study, which
attempts to verify the findings of earlier research through a computational approach.
Its aim and main focus is to try and find a pattern or model that best explains the use of
second person singular pronominal address terms in Shakespeare, rather than simply
achieve the best performing model.
The primary finding of this study is that it is indeed possible to build a prediction
model for the use of singular second person pronouns in the plays of Shakespeare
based on linguistic and extra-linguistic features. Moreover, in particular, the direct lin-
guistic context of the pronoun is the most important feature in all of the models except
one. Several other features are also influencing the pronoun prediction, including the
names of the speaker and addressee, the status differential, and positive and negative
sentiment. Additionally, all three algorithms easily outperformed the baseline. Out
of the three algorithms, the support vector machine models score best. However, the
Naive Bayes models perform almost equally well. This reveals that the features are,
for the most part, independent of one another. When comparing the binary and tri-
nary classification outcomes, the binary models scored better than the trinary ones.
44 Prispevki za novejšo zgodovino LIX - 1/2019
Looking back at prior studies on pronoun interpretation and comparing them to the
features used in this study, we can conclude that thee and thou are equal in their oppo-
sition to you, with the main difference being their grammatical role. Therefore, both
linguistically and computationally, it makes most sense to use the binary classification.
Isolde van Dorst
YOU, THOU IN THEE: STATISTIČNA ANALIZA UPORABE
IZRAZOV ZAIMKOVNEGA NASLAVLJANJA PRI
SHAKESPEARU
POVZETEK
O uporabi zaimkov you, thou in thee v Shakespearovih delih je bilo opravljenih
veliko raziskav. Vendar rezultati doslej še niso dali natančnega in dokončnega odgo-
vora o tem, kako so se ti zaimki uporabljali. Študija združuje prednosti z različnih razi-
skovalnih področij, da bi z računalniškimi metodami, ki doslej še niso bile uporabljene,
ugotovili, katere jezikovne in nejezikovne značilnosti vplivajo na izbiro osebnega
zaimka druge osebe ednine v Shakespearovih igrah. V angleščini, ki se je uporabljala v
Shakespearovem obdobju, je razlikovanje med YOU in THOU, ki je danes arhaično,
še obstajalo. Običajno se navaja, da sta ga določala relativni družbeni status ter osebna
bližina govorca in naslovljenca. Vendar pa se tudi med študijami s podobnimi rezultati
ti zelo razlikujejo glede stopnje vplivanja ter upoštevanja ali neupoštevanja številnih
drugih mogočih dejavnikov vpliva. Zato je treba še ugotoviti, ali bo statistično strojno
učenje potrdilo to tradicionalno razlago.
V tej študiji se proučuje 23 jezikovnih in nejezikovnih značilnosti, izbranih z raz-
ličnih jezikoslovnih področij, kot so pragmatika, sociolingvistika in analiza pogovora.
Trije uporabljeni algoritmi – naivni Bayesov klasifikator, odločitveno drevo in metoda
podpornih vektorjev – so izbrani kot ilustrativni nabor možnih modelov zaradi njiho-
vih kontrastnih predpostavk in učne pristranskosti. Opravita se dve napovedi, prva o
binarnem (you/thou) razlikovanju in druga o trinarnem (you/thou/thee) razlikova-
nju, s čimer dobimo šest končnih modelov, ki jih lahko primerjamo. Študija je strogo
empirična, njen cilj pa je z računalniškim pristopom preveriti ugotovitve predhodnih
raziskav. Osredotoča se predvsem na iskanje vzorca ali modela, ki bi najbolje pojasnil
uporabo izrazov zaimkovnega naslavljanja za drugo osebo ednine pri Shakespearu, in
ne le na oblikovanje modela, ki deluje najbolje.
Temeljna ugotovitev te študije je, da je resnično mogoče oblikovati napovedni
model za uporabo zaimkov za drugo osebo ednine v Shakespearovih igrah na podlagi
jezikovnih in nejezikovnih značilnosti. Poleg tega je neposredni jezikovni kontekst
zaimka najpomembnejša značilnost v vseh modelih razen v enem. Na napoved zaimka
45I. Dorst: You, Thou and Thee: A Statistical Analysis of Shakespeare’s Use of…
vpliva tudi več drugih značilnosti, vključno z imenom govorca in naslovljenca, razliko
v statusu ter pozitivnim ali negativnim mnenjem. Vsi trije algoritmi so tudi z lahkoto
dosegli boljše rezultate od izhodišča. Od vseh treh algoritmov daje najboljše rezultate
metoda podpornih vektorjev. Vendar tudi modeli naivnega Bayesovega klasifikatorja
dosegajo skoraj enako dobre rezultate. Iz tega izhaja, da so značilnosti večinoma neod-
visne druga od druge. Primerjava binarne in trinarne klasifikacije je pokazala, da so
rezultati binarnih modelov boljši od rezultatov trinarnih. Če primerjamo predhodne
študije o interpretaciji zaimkov z značilnostmi, uporabljenimi v tej študiji, lahko ugo-
tovimo, da sta zaimka thee in thou v opoziciji z zaimkom you enakovredna, pri čemer
je najpomembnejša razlika njihova slovnična vloga. Zato je z jezikoslovnega in raču-
nalniškega stališča najbolj smiselna uporaba binarne klasifikacije.
46 Prispevki za novejšo zgodovino LIX - 1/2019
1.01 UDC: 003.295:659.4+004.738.5(497.4) )”201”
Darja Fišer,* Monika Kalin Golob**
Corporate Communication
on Twitter in Slovenia:
A Corpus Analysis
IZVLEČEK
SLOVENSKO KORPORATIVNO KOMUNICIRANJE NA DRUŽBENEM
OMREŽJU TWITTER: KORPUSNA ANALIZA
V prispevku predstavimo korpusno analizo korporativnega komuniciranja na druž-
benem omrežju Twitter, ki smo jo s kombinacijo besedilnih in metapodatkov izvedli na
korpusu Janes-Tweet. Analizirali smo značilnosti slovenskih korporativnih računov in
dinamiko njihovih objav ter analizirali rabo novomedijskih elementov in uporabljenega
jezika v korporativnih objavah. Na koncu smo proučili še ključne besede v korporativnih
objavah. Izvedene analize so pokazale, da v primerjavi z zasebnimi računi v korporativnih
tvitih izrazito prevladujejo standardne jezikovne prvine formalnega sporočanja, sicer red-
kejše neformalne in nestandardne izbire pa so uporabljene premišljeno glede na naslovnika
sporočila in namen sporočanja. Prispevek je dragocen tudi zato, ker demonstrira potencial
korpusnih pristopov v komunikologiji, medijskih študijah in drugih sorodnih družboslovnih
disciplinah, ki proučujejo jezikovno rabo.
Ključne besede: korporativno komuniciranje, družbena omrežja, Twitter, korpusna
analiza
* Department of Translation, Faculty of Arts, University of Ljubljana, Aškerčeva 2, SI-1000 Ljubljana, Department
of Knowledge Technologies, Jožef Stefan Institute, Jamova cesta 39, SI-1000 Ljubljana, darja.fiser@ff.uni-lj.si
** Chair of Journalism, Faculty of Social Sciences, University of Ljubljana, Kardeljeva ploščad 5, SI-1000 Ljubljana,
monika.kalin-golob@fdv.uni-lj.si
47D. Fišer, M. Kalin Golob: Corporate Communication on Twitter in Slovenia…
ABSTRACT
The paper presents a corpus analysis of corporate communication on Twitter, which was
performed with a combination of metadata and textual data on the Janes-Tweet corpus.
We compare the amount, posting dynamics and use of social-media specific communication
elements by Slovene corporate and private users. Next, we analyse the language of corporate
users. Our analysis shows that, in comparison to private accounts, corporate tweets predo-
minantly use formal communication and standard language characteristics with seldom
usage of informal and non-standard choices. In the event of those, however, they are chosen
deliberately to address a specific target audience and meet the desired communicative goals.
A major contribution of the paper is also a showcase of corpus-based approaches in com-
munication studies, media studies and other related disciplines in social sciences which study
language use.
Keywords: corporate communication, social media, Twitter, corpus analysis
Introduction
In the past decade, social media have evolved into a powerful tool, attracting mil-
lions of users every day (boyd and Ellison 2007). Jansen et al. (2010) have shown
that around 20 percent of all published tweets mentioned or expressed their opinion
about an organization, brand, product or service. What is more, Wu et al. (2011) show
that this new form of electronic word-of-mouth is approximately 20 times more effec-
tive than marketing events and 30 times more effective than media appearances. It is
therefore unsurprising to see such a rapid growth of the online social media market-
ing (Griffiths and McLean 2014) through which companies address a wide range of
goals, such as increased traffic and brand awareness, improved search engine rankings
or increased sales (Thoring 2011). In addition, social media can also be used for cus-
tomer service and market research (Weber 2009).
With the growing commercial relevance of social media, researchers have begun
to study the nature and influence of corporate communication on social media.
Researchers who investigate the patterns of how information spreads through the
Twitter network showed that tweets which contain URLs tend to spread faster (Park
et al. 2012) and that tweets containing words which indicate either positive or nega-
tive sentiment tend to receive more retweets than neutral posts (Stieglitz and Dang-
Xuan 2012). Stelzner (2010) and Heaps (2009) showed that marketers use social
media mainly for generating exposure for their business and increasing traffic to their
corporate websites, rather than for selling products and services. Evidence has also
been found that social media have a positive effect on increasing relational outcomes,
such as online reputation and relationship strength (Clark and Melancon 2013; Li
et al. 2013; Miller and Tucker 2013). It is therefore surprising that while the new
48 Prispevki za novejšo zgodovino LIX - 1/2019
platform of engagement with customers has shifted the company–customer discourse,
Mangold and Faulds (2009) show that communication is still predominantly scripted,
promotion-centric and lacks real interaction with the customers.
In this paper we present the results of the first large-scale analysis of corporate
communication on Twitter in Slovenia. We look into the production, dynamics and
language in the tweets of Slovene corporate users in order to identify the characteris-
tics of such communication in contrast to the communication of private Twitter users.
In our study, we use the term corporate account for all private companies, public insti-
tutions, the media and interest associations who do not post as individuals for leisure
purposes. The analysis was performed on the corpus Janes-Tweet (Erjavec et al. 2018)
by combining the available user and text metadata with the content of the tweets,
which enabled a more accurate contextualization, parametrization, comparison and
generalizations of language use in a specific communicative context.
The rest of the paper is structured as follows: in Section 2 we present related work
relevant for our study, in section 3, we present the results of the corpus analysis and
Section 4 concludes the paper and outlines future work.
Related Work
In communication studies, three main strands of research into corporate social
media communication practices can be identified. The first group focuses on inves-
tigating posting behaviour, the second looks into content analysis, and the third are
perception studies. In terms of research focus, investigators are mostly interested
in corporate communication styles, reputation management and corporate social
responsibility.
Quantitative differences in communication dynamics, style and content of Slovene
private and corporate Twitter users have been identified by Ljubešić and Fišer (2016)
and have been attributed to the different communication functions of private and cor-
porate social media users. While corporate users mostly tweet during the work week
in the morning, private users are more active during weekends and in the evening.
Corporate tweets have distinctly positive sentiment, while private tweets are predomi-
nantly neutral. Tweets posted by corporate users are retweeted much more often while
private tweets are more frequently favourited.
By analyzing tweet frequency, following behavior, hyperlinks, hashtags, mentions
and retweets, several studies have shown that one-way communication is still the most
common communication strategy used by organizations on Twitter (Waters and Jamal
2011; Xifra and Grau 2010) and that the style and genre in tweets by PR professionals
is the same as in other PR text types, treating social media as yet another channel for
reaching a different consumer segment, without adapting their language accordingly
(Kalin Golob et al. 2018). However, as shown by Kwon and Sung (2011), the growing
frequency of imperative verb phrases, such as “follow the brand,” “come by the booth,”
49D. Fišer, M. Kalin Golob: Corporate Communication on Twitter in Slovenia…
“join us at the event,” or “sign up” for a planned occasion, suggest that corporations
increasingly use Twitter as a tool to initiate and maintain relationships with consum-
ers. Risius and Beck (2015) empirically identified social media activities in terms of
social media management strategies (using social media management tools or the web-
frontend client), account types (broadcasting or receiving information), and commu-
nicative approaches (conversational or disseminative). They found positive effects of
social media management tools, broadcasting accounts, and conversational commu-
nication on public perception. Company account characteristics that have been found
to influence public perception are verification, friends, and status.
Gomez and Chalmeta (2013) used content analysis to look into corporate
social responsibility (CSR) on social media and have identified presentation, con-
tent, and interactivity as the key resources for CSR communication on social media.
Presentation refers to the different tools and basic information that supports the com-
pany’s CSR presence on social media. Content includes messages related to CSR and
other topics that reinforce the communication of CSR practices. Interactivity refers
to the type of CSR communication and the frequency of CSR messages and feedback.
Li et al. (2013) used social identity theory to identify design factors that deter-
mine the social context of a corporate Twitter channel and users’ social identifica-
tion with the community. They confirm that user engagement and informedness in a
corporate Twitter channel have a positive effect on corporate reputation and that the
credibility of the corporate Twitter channel has a positive effect on user informedness
about the corporation. An interesting finding is that deeper relationships among users
of a corporate Twitter channel result in higher user engagement and informedness
when the level of corporate involvement with the channel is high and the channel has
a specific purpose but that the opposite is true when the channel has a generic purpose.
In the related work, post harvesting is typically tailor-made and small-scale, either
focused on a few carefully selected corporate social media accounts (e.g. 3 companies),
or limited to a carefully designed time span (e.g. 1 month). Coding of the observed
phenomena is manual. The research framework is quantitative but done on a relatively
small scale, and experimental in that research hypotheses are confirmed or rejected
with statistical tests. Our work differs from this research framework in that we use an
existing large corpus of tweets and are interested in the characteristics of all the avail-
able corporate accounts in it. While coding of certain phenomena (e.g. account type,
user gender) was manual, it was performed prior to this study by coders unrelated to
this study, so could not be fully controlled. Coding of many other phenomena (e.g.
language, sentiment and standardness level of tweets) was automatic and therefore
contains a certain degree of noise. Our approach is not only quantitative but large
scale as well, taking into account several thousands of users and several million of their
tweets, and is descriptive in nature. What is more, unlike most related work which
mostly observe the metadata (e.g. tweet frequency, following behavior, retweets) or
content of the messages (e.g. hyperlinks, hashtags, mentions, sentiment), we also per-
form an analysis of the language used in the messages, which is still underresearched
50 Prispevki za novejšo zgodovino LIX - 1/2019
in communication studies. A better understanding of the language practices used by
public companies and institutions for presentation, persuasion and reputation man-
agement on social media will contribute towards a comprehensive understanding of
contemporary, technology-enhanced corporate public relations and marketing strate-
gies and practices. Finally, while most researchers focus almost exlusively on English,
our study is performed on Slovene which can serve as a showcase for other languages
with a smaller number of speakers (and therefore a smaller market size the corporate
accounts are serving).
Corpus Analysis of Corporate Communication on Twitter
The analysis has been performed on the Janes-Tweet corpus (Erjavec et al. 2018)
consisting of 11.3 million Slovene tweets or 160 million tokens published by more
than 10,200 users. Depending on their communication purpose, users in the corpus
are manually divided into two groups: private and corporate. Corporate accounts
comprise all private companies, public institutions, the media and interest associa-
tions who do not post as individuals for leisure purposes, who are treated as private
accounts. In order to establish the characteristics of corporate communication on
Twitter and differentiate them from the common practices typical of this medium in
general, we perform a contrastive analysis of these two types of accounts.
Our study consists of three parts, each of which addresses a major segment of com-
munication styles on Twitter, ranging from the analysis of communication dynamics
and metadata to the content and language analysis, observed from the perspective of
the two types of accounts. First, we analyzed the production and posting dynamics
of these two user groups. Next, we analyzed the use of social media-specific commu-
nication elements, such as hashtags, emojis and emoticons. Finally, we analyzed the
language and keywords used in corporate tweets. All the analyses were performed in
the SketchEngine corpus-analysis1 suite (Killgarriff et al. 2014).
The research questions we address with each part of our study are: 1) Does cor-
porate communication on Twitter by Slovene users have a distinct corporate profile in
terms of posting dynamics and volume? 2) Have Slovene corporate users adopted the
new media communication style and are using the features offered by the new media
to maximize their reach and relationship strength? 3) Can we identify the Slovene
corporate tweeting code?
1 The corpus is publically available for download as well as for on-line querying through the CLARIN.SI research
infrastructure.
51D. Fišer, M. Kalin Golob: Corporate Communication on Twitter in Slovenia…
Account Analysis
Table 1: Share of corporate and private users and their production in the Janes-Tweet
corpus.
Users No. of users (%) No. of tokens (%) No. of tweets (%)
Corporate 2612 (25.57%) 30,003,182 (18.70%) 2,112,910 (18.64%)
Private 7627 (74.44%) 130,401,083 (81.30%) 9,223,736 (81.36%)
Total 10,248 (10.00%) 160,404,265 (100.00%) 11,336,646 (100.00%)
Share of users. The ratio between private to corporate users in the corpus is 3:1.
As can be seen in Table 1, less than a fifth of all the tweets in the corpus have been
posted by corporate users. This means that in Slovenia, Twitter is mainly used for
private communication.
Table 2: Distribution of tweets by corporate and private users based on gender in the
Janes-Tweet corpus.
Corporate Private
Gender No. of tweets % No. of tweets %
Unknown 1,730,258 81.89% 134,048 1.45%
Male 271,729 12.86% 6,136,470 66.53%
Female 110,923 5.25% 2,953,218 32.02%
Total 2,112,910 100.00% 9,223,736 100.00%
Users’ gender. As shown in Table 2, gender could not be determined for the
majority of corporate users (82%) based on user name, user profile data and verb
form usage in their tweets, which is rare in the case of private users (1.5%). This is
unsurprising because corporate users tweet on behalf of their company or organiza-
tion, adapting their style of writing accordingly, e.g. the use of first person plural verb
forms, which do not distinguish the gender of the writer.
Posting Analysis
Post quantity. There are only 29 (1%) corporate users who are very active on
social media and have posted over 10,000 tweets, and 422 (16%) medium-active ones
with 1,000 – 10,000 tweets. The majority of corporate users (1,640 or 62.79%) fall
into the category of low-activity accounts with 100 – 1,000 tweets. The lowest-activ-
ity group includes 521 users (19.95%) who have posted fewer than 100 tweets. In
comparison to private users, the biggest difference is in groups 2 and 4. There are 9%
more private users with 1,000 – 10,000 tweets and a similar percentage fewer private
52 Prispevki za novejšo zgodovino LIX - 1/2019
accounts with only 100 – 1,000 tweets. In the years included in the Janes-Tweet cor-
pus, the volume of content generated by the corporate users is stable but is decreasing
slightly among the private users (see Figure 1). Occasional sharp drops in the number
of posts, which are simultaneous for both user groups, were caused by the technical
issues during data collection and are not related to the seasonal fluctuations or other
content-related phenomena.
Table 3: Activity of corporate and private users in the Janes-Tweet corpus.
Corporate Private
No. of all accounts 2612 % 7627 %
> 10,000 tweets 29 1.11% 129 1.69%
Between 10,000 and 1,000 tweets 422 16.16% 1867 24.48%
Between 1,000 and 100 tweets 1640 62.79% 4055 53.17%
< 100 tweets 521 19.95% 1576 20.66%
Figure 1: Posting dynamics of corporate and private users in the Janes-Tweet corpus.
according to the number of posted tweets between June 2013 and June 2017.
Post length. Figure 2 shows that the length of corporate tweets is more homog-
enous than the length of private tweets. The biggest share of corporate tweets are 7 to
11 words long (4 to 7 words in case of private users). The share of corporate tweets
which do not contain any word (only emojis, hashtags, hyperlinks or multimedia ele-
ments) is only 0.1%. Such tweets are six times more frequently produced by private
users, which is not surprising as these symbols are typically used in bidirectional com-
munication, which is rare in corporate PR tweets.
53D. Fišer, M. Kalin Golob: Corporate Communication on Twitter in Slovenia…
Figure 2: Tweet length of corporate and private users in the Janes-Tweet corpus.
Analysis of Interactive Elements
Likes. As can be seen from Table 4, nearly 80% of corporate tweets do not receive
any likes, 12% have one like and only 9% have 2 or more likes. Private tweets receive
significantly different attention: a third of all the private tweets is liked at least once
and a significant share of them (0.7%) receives over 10 likes. This is another strong sign
that bidirectional communication is less typical of corporate users and that corporate
tweets are just one of the channels of the same type of (one-directional) communica-
tion disseminated through different genres.
Table 4: Share of liked and retweeted tweets of corporate and private users in the Janes-
Tweet corpus.
No. of likes
Corporate users Private users
No. of tweets % No. of tweets %
0 1,663,755 78.74% 610,9048 66.23%
1 265,385 12.56% 1,890,549 20.50%
2–10 175,788 8.32% 1,160,057 12.58%
>10 7,982 0.38% 64,082 0.69%
Total 2,112,910 100.00% 9,223,736 100.00%
No. of retweets
Corporate users Private users
No. of tweets % No. of tweets %
0 1,754,988 83.06% 8,414,713 91.23%
1 219,698 10.40% 490,346 5.32%
2–10 134,184 6.35% 300,319 3.26%
>10 4,040 0.19% 18,358 0.19%
Total 2,112,910 100.00% 9,223,736 100.00%
54 Prispevki za novejšo zgodovino LIX - 1/2019
Figures 3 and 4: The most liked (left) and the most retweeted (right) tweet posted by
corporate users in the Janes-Tweet corpus.
Table 5: Use of hashtags, emoji, hyperlinks and mentions by corporate and private users
in the Janes-Tweet corpus.
Hashtags
Abs. freq. Per million Per tweet
Corporate 922,504 30,746.9 0.44
Private 2,241,693 17,190.8 0.24
Emoji
Abs. freq. Per million Per tweet
Corporate 1,285,696 42,852.0 0.61
Private 12,061,885 92,498.3 1.31
Hyperlinks
Abs. freq. Per million Per tweet
Corporate 1,989,643 66,314.4 0.94
Private 2,583,651 19,813.1 0.28
Mentions
Abs. freq. Per million Per tweet
Corporate 659,211 21,971.4 0.31
Private 9,216,857 57,460.2 1.00
Retweets. Retweeting results show a different picture where a much greater share
of corporate tweets have at least one retweet (17%) in comparison to private tweets
(8%), suggesting a higher informative value of corporate tweets for a wider audi-
ence. Interestingly, when considering very frequently retweeted posts, no difference
between the two account types has been observed.
55D. Fišer, M. Kalin Golob: Corporate Communication on Twitter in Slovenia…
Use of hashtags. Relatively speaking, corporate accounts use hashtags almost
twice as often as private accounts. On average, almost every second corporate tweet
contains a hashtag, which holds for only every fourth private tweet. As presented in
Table 5, sport is the predominant topic of the 10 most frequent hashtags used by
corporate users which is very similar to private users. Interestingly, half of the 10 most
frequently used hashtags are shared (sport, news, Ljubljana). Among the 10 corporate
users with the highest relative frequency of hashtag use we can find less formal maga-
zines and companies. Therefore, for a more detailed analysis of corporate communi-
cation it would be interesting to further divide corporate users into different groups:
media (journals and magazines), companies, state institutions and non-governmental
organizations. We plan to include this in our future studies.
Use of emoticons and emojis.2 The usage of emoticons and emojis is opposite to
hashtags, as emojis are, relatively speaking, more than twice as common in posts by
private users who use 1.3 emojis or emoticons per tweet on average while occurring
only in every second corporate tweet which indicates greater degree of formality in
corporate communication on Twitter. Among the 10 corporate accounts the relative
frequency of emojis and emoticons, we mainly identified resellers of fashion items.
As presented in Table 6, all of the most frequently used emojis or emoticons are
positive which again indicates a positive tone in PR communication. However, it is
interesting that only 2 emojis appear on the top 10 list for corporate users while the
rest are emoticons. This could be a sign of more conservative communication strate-
gies used by corporate users given that emojis are a much more recent phenomenon,
but this could also be a consequence of corporate users more frequently tweeting from
their computers rather than smart phones which better support the use of emojis.
Table 6: Ten most frequent hashtags in corporate and private tweets.
Corporate users Private users
Hashtag Frequency Hashtag Frequency
#plts 18,03 #plts 26,370
#slonews 18,247 #slonews 18,270
#PLTS 9,620 #junaki 18,167
#Ljubljana 5,724 #slochi 13,195
#izvršba 5,167 #PLTS 10,943
#NKDomzale 4,437 #Slovenia 10,780
#olimpija 4,176 #Ljubljana 10,141
#rokomet 4,143 #radiobattleSI 9,184
#junaki 3,941 #ligaprvakov 9,091
#skupajdovrha 3,864 #sp14si 8,351
2 Emoticons (e.g. ;)) are combinations of standard typographical characters used for expressing emotions. Emojis are
pictograms (e.g. ) which include emotions as well as a broad range of other topics and their usage and interpreta-
tion depend on the individual.
56 Prispevki za novejšo zgodovino LIX - 1/2019
Table 7: Ten most frequent emoticons and emojis in corporate and private tweets and the
ten corporate accounts with the highest relative frequency of emoticons and emojis.
Emoji Frequency User Frequency Rel. freq*
:) 114,602 RecycleMan 530 12.711,5
;) 55,763 JennParisBags 188 11.522,1
:D 17,715 EtiVelikonja 160 10.409,8
<3 13,688 ApartmaNet 184 10.104,9
:-) 9,672 TRENDtrgovina 436 10.049,3
;-) 4,926 Pawla40 228 9.720,0
:)) 4,680 iPlacesi 125 8.860,0
3,679 bozicluka 92 8.290,2
:P 3,558 matejgaber22 99 7.222,6
3,436 Modniovitki 424 7.010,9
* Relative frequency is the average frequency of the phenomenon in one million tokens.
Table 8: Ten most frequently mentioned accounts in the tweets posted by corporate and
by private users.
Corporate users Private users
Mention Frequency Mention Frequency
@YouTube 8,325 @petrasovdat 91,328
@Nova24TV 6,903 @YouTube 71,859
@Val202 3,992 @MarkoSket 57,333
@rtvslo 3,866 @JJansaSDS 53,482
@kzssi 3,736 @lucijausaj 51,391
@unionolimpija 3,616 @leaathenatabako 44,453
@JJansaSDS 3,464 @petrajansa 44,102
@radioPrvi 3,128 @savicdomen 43,394
@vladaRS 2,764 @darkob 42,363
@nkmaribor 2,758 @zzTurk 40,534
Use of hyperlinks. Great differences between private and corporate users can be
observed in their use of hyperlinks in tweets. Relatively speaking, corporate tweets
contain more than three times the number of hyperlinks in comparison to private
tweets. On average, corporate users add a hyperlink to nearly each tweet they post,
while private users include it only in every fourth tweet. This corresponds to the find-
ings of our preliminary analysis that tweets are often only compressed press releases
leading to a complete message in the form of a hyperlink.
Mentions of other users. Big differences between private and corporate users are
observed in the rate and type of other user accounts mentions. Relatively speaking,
57D. Fišer, M. Kalin Golob: Corporate Communication on Twitter in Slovenia…
mentions are more than twice as frequent in private tweets as they are in corporate
tweets. On average, private users mention other users in every tweet, whereas corpo-
rate users use this option only in every third message. This is not surprising because the
main objective of PR tweets is self-presentation, which is why referencing others is less
needed. Among the 10 most frequently mentioned accounts in corporate tweets are
mainly media, political institutions/parties/individual politicians and sport organiza-
tions, while in private tweets we find social media influencers, two journalists and a
politician. Both lists have only two mentions in common, i.e. YouTube and Janez Janša,
one of the oldest and best known Slovenian politicians.
Language Analysis
Language of tweets. Corporate users almost exclusively post messages in Slovene
(93%), which is considerably different from private users whose share of tweets in
a foreign language is twice as large. Among the foreign languages used in tweets of
corporate users, English prevails (5%). This corresponds to our preliminary findings
that the main goal of Slovene corporate Twitter users is to address their Slovene audi-
ence through formal communication for business or informative purposes. The only
exception are the accounts of Slovene Embassies around the world often posting in
their local language (e.g. in French), as well as the accounts of the Ministry of Foreign
Affairs, the president and the prime minister who occasionally use English tweets to
inform the international community about major events (e.g. arbitration).
Table 9: Language use in the tweets posted by corporate and private users.
Corporate Private
Language No. of tweets % No. of tweets %
Slovene 1,973,677 93.41% 8,074,681 87.54%
English 104,955 4.97% 983,141 10.66%
Bosnian/Croatian/Serbian 16,058 0.76% 57,017 0.62%
Other 18,220 0.86% 108,897 1.18%
Total 2,112,910 100.00% 9,223,736 100.00%
Sentiment of tweets. Every tweet in the corpus is annotated with a sentiment label
(see Erjavec et al. 2018). Half of all corporate tweets have positive sentiment, a third
has neutral sentiment and 17% of the tweets have negative sentiment. This greatly
differs from private tweets, half of which are neutral, 27% negative and only a quarter
positive. This is another indication of the PR nature of corporate tweets which try to
convey a positive corporate image, attract customers, sell products, etc.
58 Prispevki za novejšo zgodovino LIX - 1/2019
Table 10: Sentiment of tweets posted by corporate and private users.
Corporate Private
sentiment No. of tweets % No. of tweets %
positive 1,024,238 48.48% 2,320,841 25.16%
neutral 729,811 34.54% 4,411,516 47.83%
negative 358,861 16.98% 2,491,379 27.01%
total 2,112,910 100.00% 9,223,736 100.00%
Table 11: Language standardness level in the tweets posted by corporate and private
users.
Corporate Private
Standardness No. of tweets % Sentiment No. of tweets
L1 1,688,244 79.90% 4,515,310 48.95%
L2 353,397 16.73% 3,489,743 37.83%
L3 71,269 3.37% 1,218,683 13.21%
2,112,910 100.00% 9,223,736 100.00%
Table 12: Comparison of the language used in corporate and private tweets according to
part of speech.
Part of speech Corporate (per million) Private (per million) Ratio**
Proper nouns 66,738.4 33,507.8 1.99
Numerals 30,564.9 16,109.7 1.90
Conjunctions 54,381.1 33,302.1 1.63
Prepositions 86,947.2 54,549.6 1.59
Adjectives 76,889.9 48,254.8 1.59
Common nouns 186,446.6 127,056.0 1.47
Abbreviations 3,826.0 3,458.9 1.11
Punctuation 143,234.6 158,188.2 0.91
Main verbs 62,631.9 75,795.7 0.83
Auxiliary verbs 36,974.7 52,968.0 0.70
Adverbs 38,192.1 55,483.1 0.69
Pronouns 39,118.2 62,678.8 0.62
Particles 19,816.6 35,540.7 0.56
Interjections 1,740.9 6,194.5 0.28
** Ratio between the frequency in corporate and in private tweets.
Language standardness. Tweets by corporate users mainly contain standard
Slovene (80%) and highly nonstandard content is only rarely present (3%). Almost
the opposite is true of private users. Less than half of their tweets are written in stand-
ard Slovene and the share of tweets containing highly nonstandard Slovene is more
59D. Fišer, M. Kalin Golob: Corporate Communication on Twitter in Slovenia…
than four times greater in comparison to corporate users. Some exceptions can be
found among the accounts of public personalities (e.g. stand-up comics, radio pre-
senters, musicians) who often purposefully tweet in nonstandard Slovene because
informal communication is a major part of their corporate image.
Orthography. Great differences are detected regarding the use of abbreviations:
corporate tweets mainly contain standard abbreviations of academic or other titles
(dr., mag., d. o. o.) and common abbreviations (št., oz., min.), while in private tweets
we find nonstandard abbreviations (tw), often without full stop (slo, lj, min). Some
differences can be also observed in the use of punctuation. In corporate accounts, a
bigger range of classic punctuation marks is used according to the orthographic norm.
Tweets by private users are characterized by frequent repetitions of the same punctua-
tion mark to give the message an emotional charge. Much more frequent is also the
use of social-media specific symbols (#, @, *).
Parts of speech. The analysis of the parts of speech in the language of corporate
tweets offers an insight into communication purposes of corporate accounts. Relatively
speaking, there are almost twice as many proper nouns and numerals in corporate
tweets than in private ones. Much more frequent are also conjunctions, prepositions,
adjectives and common nouns. As shown in Table 10, interjections are considerably
more often present in private accounts (3.5 times more). The same is true for parti-
cles (almost 2 times more), pronouns and adverbs. On the one hand this confirms a
greater formality of corporate users and reflects a more direct and personal approach
of private users. On the other hand this also reflects different communicative func-
tions of Twitter: informative for corporate and conversational for private accounts.
Furthermore, the informative, as well as the influencing function to some extent, are
also confirmed by the detailed analysis of individual parts of speech presented below.
The noun. Common nouns are 1.5 times more common in corporate tweets
than in private ones, but the matching rate of the first 20 common nouns that are
most frequently used is surprisingly high (70%): dan/day, leto/year, tekma/race, ura/
hour, mesto/place, teden/week, čas/time, hvala/thank you, svet/world, delo/work, človek/
human, konec/end, otrok/child, država/country. Among the 20 most frequent nouns,
the following are specific to corporate tweets: video/video, foto/photo, zmaga/victory,
novica/news, cena/price, sezona/season. Proper nouns are twice as common in corpo-
rate tweets than in the private ones and the matching rate of the 20 most frequent
nouns is 40%: Slovenija/Slovenia, Ljubljana, Maribor, EU, Slovenc/Slovene, Evropa/
Europe, ZDA/USA, Cerar, Janša. Among the 20 most frequent nouns, the following
proper nouns are corporate tweets: Olimpija, Koper, Peter, Gorica, Janez, Domžale,
Luka, Tina, Marko.
In corporate tweets a higher level of formality of expression has been detected as
both first and last names are indicated (private tweets mention only the last name).
Furthermore, we can observe greater diversity of places and company names. An anal-
ysis of nominal pronouns returned predictable results: corporate tweets contain plural
pronouns (nam/to us, nas/us, vam/to you), while in private tweets we find singular
60 Prispevki za novejšo zgodovino LIX - 1/2019
forms of pronouns (jaz/I, me/of me, ti/to you, te/you). The reason for grammatical
plurality lies in the fact that authors of corporate tweets use formal communication
methods on behalf of their institution or company and formal form of addressing.
The verb. The use of main verbs is more common in private tweets. The matching
rate of the 20 most frequent verbs in private and corporate tweets is 60% (imeti/have,
iti/go, morati/must, vedeti/know, videte/see, priti/come, dobiti/get, začeti/begin, čakati/
wait, dati/give, praviti/say, delati/work, dobiti/get), but the difference lies in their moti-
vation for communication: corporate accounts mainly report on events and publish
statements, while private accounts describe personal activities and give opinions.
Among the 20 most frequent verbs, the following main verbs are specific to corporate
tweets: želeti/wish, preveriti/check, najti/find, iskati/search, prebrati/read, gledati/watch,
moči/able, hoteti/want, narediti/do.
The adjective. Adjectives are 1.5 times more frequently used in corporate than
in private tweets and the matching rate of the 20 most used adjectives is 50%: nov/
new, dober/good, slovenski/Slovenian, velik/big , lep/beautiful, zadnji/last, mlad/young,
star/old, pravi/real, super/super. Among the 20 most frequent adjectives the following
are specific to corporate tweets: vabljen/invited, današnji/today’s, evropski/European,
javen/public, spleten/web/based, svetoven/world/wide, odličen/excellent, državen/nati-
onal, visok/high, domač/domestic. Positive adjectives are characteristic of corporate
tweets (nov/new, dober/good, velik/big , lep/beautiful) which are also more formal than
the adjectives characteristic of private tweets (vabljen/invited, odličen/excellent, visok/
high vs. hud/badass, mali/little, sam/alone). Adjectival as well as nominal pronouns
are used in the first person plural form in corporate tweets (naše/our-Female, naši/
our-Male) when the goal is identification with the company or the institution and
integration into the communicative circle that connects the author of the message on
behalf of the institution with the recipient (Korošec 1998).
The particle. The difference between formality and informality can also be
observed through particles which overlap in 80% of the cases. However, among the
particles that are present only in tweets of one user group, our analysis showed that
formal particles are distinctive for corporate tweets (morda/maybe, predvsem/above all,
sicer/though, skoraj/nearly) and nonstandard and informal particles for private tweets
(tud < tudi/also; ze < že/already, itak/off course, pač/well).
The interjection. As already mentioned, the analysis of this part of speech showed
most notable differences. The matching rate of the 20 most common interjections
in corporate and private tweets is 55%: bravo, hm, haha, uf, o, ej, ah, ha, aha, aja, oh.
Among the most frequent interjections that are distinctive for one of the user groups
are the following ones: živjo, zdravo, hej, hehe, gooool, opa, ups, na, ojoj. Interjections in
corporate tweets are fewer in quantity as well as more formal and salutatory (zdravo,
ups), while private tweets often contain interjections in foreign language (btw, lol) and
swear words.
61D. Fišer, M. Kalin Golob: Corporate Communication on Twitter in Slovenia…
Keyword Analysis
This section highlights the results of the keyword analysis performed on corporate
tweets. In this paper, the keywords are understood as the words which are unexpect-
edly more frequent in the tweets of corporate users compared to the entire Janes-
Tweet corpus as reference.
Table 13: List of 20 most key lemmas in corporate tweets according to sentiment.
Negative Keyness index Positive Keyness index Neutral Keyness index
oviran 22.2 čestitka 3.5 novice.si 10.1
trčenje 19.1 vabljen 3.5 zemljišče 8.7
trčiti 18.0 bravo 3.4 pivniški 8.3
priključek 15.4 album 3.4 ebel 8.3
evakuirati 15.3 beautiful 3.4 katarinin 8.1
ranjen 15.1 hvala 3.4 petv 8.0
poškodovan 15.0 posted 3.4 šloganje 7.9
razcep 14.9 photos 3.4 solaten 7.8
novicejutro.si 14.9 odličen 3.3 ugnati 7.8
osumljen 14.6 polepšati 3.3 pripravljalen 7.7
nesreča 14.5 odlično 3.3 koel 7.6
aretirati 14.3 prijeten 3.3 novinec 7.6
avtocesta 14.1 super 3.3 napovednik 7.4
neurje 14.1 čudovit 3.3 zoofa 7.3
strmoglaviti 13.9 čestitati 3.3 prerokovanje 7.3
osumljenec 13.1 srečno 3.3 poiesis 7.2
magnituda 13.1 facebook 3.3 apod 7.1
prometen 12.8 welcome 3.3 wt 7.1
ubit 12.8 summer 3.3 sklepen 6.9
Sentiment. As shown in Table 13, the highest keyness index is attributed to lexis
from corporate tweets with negative sentiment. Among those, all 20 top-ranking key
lemmas are part of media tweets that reference reports on crime and other accidents
(e.g., trčenje/collision, evakuirati/evacuate, ranjen/injured, nesreča/accident). The 20 top-
ranking keywords with positive sentiment correspond to the definitions of positive
PR communication (e.g., čestitka/congratulations, vabljen/invited, bravo/bravo, čudo-
vit/wonderful, polepšati/make sbd’s (day)). Adjectives and adverbs with highly positive
meaning are also ranked high (e.g., lep/beautiful, odličen, odlično/fantastic, prijeten/nice,
super/super). Furthermore, the 20 top-ranking keywords with neutral sentiment are
part of the tweets containing media reports (e.g., novice.si/news.si, zemljišče/property,
napovednik/preview, sklepen/final) and denote events (e.g., pivniški/beer, ebel/ebel, šlo-
ganje/card-reading, prerokovanje/fortune-telling) or names (katarinin, ebel, zoofa, apod).
62 Prispevki za novejšo zgodovino LIX - 1/2019
This list suggests that for a more fine-grained analysis of corporate communication on
Twitter it could be useful to consider separating the tweets generated by media from
those that are created by companies or institutions.
Table 14: Comparison of key word forms in corporate tweets, written in standard and
non-standard language.
Standard tweets Keyness index Non-standard tweets Keyness index
Izkl 6.4 Posetite 562.3
Novice.SI 6.4 potrazi 557.6
dražba 6.0 sjajan 553.5
[hyperlink] 5.9 Jeste 455.0
SiOL 5.8 tim 308.5
Petv 5.8 [hyperlink] 307.2
APOD 5.8 [hyperlink] 186.6
Moia 5.7 li 166.4
spletnem 5.7 koketo 145.9
Zurnal24 5.7 trombeto 143.3
ugodne 5.7 [hyperlink] 130.0
astronomska 5.7 belooranžnega 129.5
SMUČANJE 5.6 deejaytime 111.2
KOŠARKA 5.6 Živjo 111.0
oviran 5.6 Skupne 109.6
[hyperlink] 5.6 pritisne 92.8
ALPSKO 5.6 oglasiš 66.2
HOKEJ 5.6 [hyperlink] 65.9
zamudite 5.6 cheers 60.3
Preverite 5.5 hajskul 56.5
Nogometaši 5.5 [hyperlink] 49.6
TENIS 5.5 gnargnar 49.6
ciganskih 5.4 sporočimo 47.0
NOGOMET 5.4 najbrš 46.8
ROKOMET 5.4 pridte 45.3
[hyperlink] 5.4 javimo 41.9
Astrolife.si 5.4 Poslali 41.5
Izbrane 5.4 dm 41.2
Slovenske 5.4 javiš 41.2
SMUČARSKI 5.4 unc 41.0
Standardness. A comparison of the 30 top-ranking key word forms (see Table
14) in corporate tweets written in standard and nonstandard Slovene shows that users
write in standard Slovene when posting notifications and adds (e.g., dražba/auction,
63D. Fišer, M. Kalin Golob: Corporate Communication on Twitter in Slovenia…
ugodne/good, zamudite/miss, preverite/check). Tweets written in nonstandard Slovene
have a similar communication purpose, but numerous elements in foreign language
and nonstandard spelling of Slovene words indicate that authors of such messages
want to establish a closer connection with their target audience and make their offer
more appealing to them (e.g. deejaytime/phoneticized spelling of DJ/time, hajskul – pho-
neticized spelling of high school, najbrš – nonstandard for I guess, pridte –nonstandard for
come, dm – abbreviation for direct message, javiš – nonstandard for answer).
Tabela 15: Comparison of key word forms in corporate tweets written by male and
female users.
Female Keyness index Male Keyness index
foodwalks 7.7 Moia 41.7
Posodobljen 7.0 dražba 39.9
Patsy 6.1 APOD 37.2
KOEL 5.9 astronomska 36.4
[hyperlink] 5.9 premičnin 35.4
info@patsy.si 5.5 UGANKA 33.9
[hyperlink] 5.5 [hyperlink] 30.7
foodwalk 5.5 Izhodišče 30.3
Lylo 5.3 FOTOGRAFIJE 30.0
ORTO 5.1 GLASBA 29.6
UriKuri 4.6 Dopolni 29.5
yummy 4.6 UE 29.1
Ordered 4.4 javna 27.5
Shellac 4.4 sedežna 27.2
Cosmo 4.2 GCC 26.5
LPG 3.8 PRIPOROČAMO 26.4
Starševski 3.7 Espargaro 26.4
e-trgovine 3.5 [hyperlink] 26.3
[hyperlink] 3.5 zemljišča 26.0
Elle 3.3 [hyperlink] 25.3
info@tjasaseme.si 3.3 Pomurskem 24.8
boxa 3.2 ENERGIJE 24.5
derivatov 3.2 Žurnal24 24.4
IBU 3.1 LITERATURA 24.3
Onaplus 3.1 gozda 24.2
Aquafresh 3.0 [hyperlink] 23.5
naftnih 3.0 PRS 23.1
Watercolour 3.0 Ekipa24 22.8
[hyperlink] 3.0 [hyperlink] 22.3
foodwalks 7.7 Moia 41.7
64 Prispevki za novejšo zgodovino LIX - 1/2019
Gender. While a comparison of the key word forms from female or male corporate
accounts in Table 15 does not offer any insights into possible linguistic differences
between them, it does give us information about differences in topics and style in
regard to language choices made when addressing female or male target audience.
Female accounts include names of magazines, URLs and proper names related to fash-
ion, shopping, food and parenting, while in male account these elements are related
to real estate, sport and music.
Conclusions
Social media have revolutionized corporate communications by allowing com-
panies to communicate directly and instantly with their stakeholders, marking a shift
from the traditional one-way output of corporate communications, to an expanded
dialogue between company and consumer (Matthews 2010). This paper presents the
results of the first comprehensive, large-scale and corpus-driven analysis of the char-
acteristics of corporate communication on Twitter in Slovenia that could serve as a
starting-point of further, data-driven and linguistically enhanced investigations of the
importance of social media for fostering corporate communication. In the study, we
combined the analysis of the available metadata, Tweet content and corpus annota-
tions to study three key aspects of the communication of Slovene corporate Twitter
users: (1) the participation, posting dynamics and posting volume, (2) the utilization
of new media elements, and (3) the language choices observed through several levels
of linguistic discription.
Based on the Janes-Tweet corpus, Twitter appears to be mainly used for private
communication in Slovenia. The majority of corporate accounts belong to the low-
activity category but the volume of content generated by the corporate users is stable.
Corporate tweets are more homogenous length-wise and are predominantly longer
than those of private users.
The analysis of the usage of the new media elements suggests that corporate tweets
come short of the true dialogic approach as most Slovene companies and institutions use
Twitter as yet another channel for unidirectional communication of regular (shortened)
PR messages, while the prevalent communication function remains informative and posi-
tively presentational. This can be seen from a much less frequent usage of emoticons and
all other interactive elements typical of private accounts, which display a distinct conver-
sational communication function that can be seen in their frequent usage of non-standard
particles, interjections, punctuation and language, and a large number of favourites.
A very strong feature of corporate communication is the almost exclusive usage
of Slovene which is undoubtedly strategic with a clear focus on the Slovene mar-
ket. While standard language and formal elements do prevail in corporate tweets
of Slovene companies and institutions, the infrequent occurrences of informal and
non-standard elements seem to be used deliberately and tailored to the specific target
65D. Fišer, M. Kalin Golob: Corporate Communication on Twitter in Slovenia…
audience, which points towards a growing awareness of adapting the style to the con-
tent that is communicated (level of formality, linguistic standardness, discursiveness),
target audience (general public – neutral style vs. specific public – variations between
neutral and colloquial style) and the organization profile (public institution – neutral
style, standard language, companies – visible, colloquial, non-standard features).
Both sentiment- and part-of-speech-based keyword analyses show an interest-
ing landscape of corporate tweets. The usage of evaluative adjectives is prominent
throughout this subcorpus, among which superlatives stands out in particular. The
negative keywords originate from the coverage of accidents and crimes by the media,
and the positive fully correspond with the definition of promotional elements. These
results indicate an important difference between the negative reporting-style tweets
by the news outlets, and the positive promotional style of companies, public institu-
tions and non-governmental institutions, suggesting the need for a more fine-grained
categorization of corporate accounts, which will be refined in our future work. We
also plan to focus on analyzing the reception of corporate tweets which contain non-
standard language and interactive elements which are more typical of private com-
munication on social media.
An important original contribution of this study is its demonstration of the meth-
odological potential of corpus approaches in communication studies, media studies and
related disciplines in social sciences which are based on language data, which is not yet
utilized in the Slovene context. Apart from theoretical relevance, the results of this analy-
sis therefore also have practical implications for PR practitioners and organizations in
that they reinforce the importance of properly trained PR practitioners who use social
media in a dialogic, two-way symmetrical model, understand their role as boundary span-
ners and the need to seek opportunities to engage in and stimulate dialogue with stake-
holders. The results of our study also clearly illustrate to the PR practitioners that social
media should not be treated as just another means through which to disseminate the
same advertisements and publicity pieces that stakeholders are already receiving through
other traditional media channels. According to Matthews (2010), social media offers an
opportunity for direct and instant corporate communication as well as an opportunity
to get back to the ideal basics of public relations – building and maintaining relationships
– and to change some of the negative stereotypes typically associated with the industry.
Acknowledgments
The work described in this paper was funded by the Slovenian Research Agency
within the national basic research project “Resources, methods, and tools for the
understanding, identification, and classification of various forms of socially unaccep-
table discourse in the information society” ( J7-8280, 2017–2019) and the Slovenian-
Flemish bilateral basic research project “Linguistic landscape of hate speech on social
media” (N06-0099, 2019–2023).
66 Prispevki za novejšo zgodovino LIX - 1/2019
Sources and Literature
• boyd, danah m., and Nicole B. Ellison. 2007. “Social Network Sites: Definition, History, and
Scholarship.” Journal of Computer‐Mediated Communication 13 (1): 210–30. doi:10.1111/j.1083-
6101.2007.00393.x.
• Clark, Melissa, and Joanna Melancon. 2013. “The Influence of Social Media Investment on
Relational Outcomes: A Relationship Marketing Perspective.” International Journal of Marketing
Studies 5 (4): 132–42. doi:10.5539/ijms.v5n4p132.
• Erjavec, Tomaž, Nikola Ljubešić, and Darja Fišer. 2018. “Korpus slovenskih spletnih uporabniških
vsebin Janes.” In Viri, orodja in metode za analizo spletne slovenščine, edited by Darja Fišer, 16–43.
Ljubljana: Znanstvena založba Filozofske fakultete.
• Gomez, Lina M., and Ricardo Chalmeta. 2013. “The Importance of Corporate Social
Responsibility Communication in the Age of Social Media.” In 16th International Public Relations
Research Conference, 1–16. Amsterdam: Elsevier.
• Griffiths, Marie, and Rachel McLean. 2014. “Unleashing Corporate Communications: Social
Media and Conversations With Customers.” In UKAIS International Conference Proceedings 2014,
1–51. https://aisel.aisnet.org/ukais2014/51.
• Heaps, Darrel. 2009. “Twitter: Analysis of Corporate Reporting Using Social Media.” Corporate
Governance Advisor 17 (6): 18–22.
• Jansen, Bernard J., Mimi Zhang, Kate Sobel, and Abdur Chowdury. 2010. “Twitter power: Tweets
as electronic word of mouth.” Journal of the American Society for Information Science and Technology
60 (11): 2169–88. doi:10.1002/asi.21149.
• Kalin Golob, Monika, Nada Serajnik Sraka, and Dejan Verčič. 2018. Pisanje za odnose z javnostmi:
temeljni žanri. Ljubljana: Založba FDV.
• Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel
Rychlý, and Vít Suchomel. 2014. “The Sketch Engine: Ten Years On.” Lexicography 1 (1): 7–36.
doi:10.1007/s40607-014-0009-9.
• Korošec, Tomo. 1998. Stilistika slovenskega poročevalstva. Ljubljana: ČZD Kmečki glas.
• Kwon, Eun Sook, and Yongjun Sung. 2011. “Follow Me! Global Marketers’ Twitter Use.” Journal of
Interactive Advertising 12 (1): 4–16. doi:10.1080/15252019.2011.10722187.
• Li, Ting, Guido Berens, and Maikel de Maertelaere. 2013. “Corporate Twitter Channels: The
Impact of Engagement and Informedness on Corporate Reputation.” International Journal of
Electronic Commerce 18 (2): 97–126. doi:10.2753/JEC1086-4415180204.
• Ljubešić, Nikola, and Darja Fišer. 2016. “Slovene Twitter Analytics.” In Proceedings of the 4th
Conference on CMC and Social Media Corpora for the Humanities, edited by Darja Fišer and Michael
Beißwenger, 39–43. Ljubljana: Znanstvena založba Filozofske fakultete Univerze v Ljubljani.
• Mangold, W. Glynn, and David J. Faulds. 2009. “Social Media: The New Hybrid Element of the
Promotion Mix.” Business Horizons 52 (4): 357–65. doi:10.1016/j.bushor.2009.03.002.
• Matthews, Laura. 2010. “Social Media and the Evolution of Corporate Communications.” The Elon
Journal of Undergraduate Research in Communications 1 (1): 17–23.
• Miller, Amalia R., and Catherine Tucker. 2013. “Active Social Media Management: the Case of
Health Care.” Information Systems Research 24 (1): 52–70. doi:10.1287/isre.1120.0466.
• Park, Jaram, Meeyoung Cha, Hoh Kim, and Jaeseung Jeong. 2012. “Managing Bad News in Social
Media: A Case Study on Domino’s Pizza Crisis.” In Proceedings of the Sixth International AAAI
Conference on Weblogs and Social Media Relations Review, 409–11.
• Risius, Marten, and Roman Beck. 2015. “Effectiveness of Corporate Social Media Activities in
Increasing Relational Outcomes.” Information & Management 52 (7): 824–39. doi:10.1016/j.
im.2015.06.004.
• Stelzner, Michael A. 2010. “Social Media Marketing Industry Report: How Marketers are Using
Social Media to Grow Their Businesses.” Accessed February 15, 2019.
http://www.socialmediaexaminer.com/social-media-marketing-industry-report-2010/.
67D. Fišer, M. Kalin Golob: Corporate Communication on Twitter in Slovenia…
• Stieglitz, Stefan, and Linh Dang-Xuan. 2012. “Impact and Diffusion of Sentiment in Public
Communication on Facebook.” In ECIS 2012 Proceedings. Accessed February 15, 2019.
https://aisel.aisnet.org/ecis2012.
• Thoring, Anne. 2011. “Corporate Tweeting: Analysing the Use of Twitter as a Marketing Tool by
UK Trade Publishers.” Publishing Research Quarterly 27 (2): 141–58. doi:10.1007/s12109-011-
9214-7.
• Waters, Richard D., and Jia Y. Jamal. 2011. “Tweet, Tweet, Tweet: A Content Analysis of
Nonprofit Organizations’ Twitter updates.” Public Relations Review 37 (3): 321–24. doi:10.1016/j.
pubrev.2011.03.002.
• Weber, Larry. 2009. Marketing on the Social Web: How Digital Customer Communities Build Your
Business. Hoboken, New Jersey: Wiley.
• Wu, Shaomei, Jake M. Hofman, Winter A. Mason, and Duncan J. Watts. 2011. “Who Says What
to Whom on Twitter.” In Proceedings of the WWW’11 Conference, 705–14. New York: ACM.
doi:10.1145/1963405.1963504.
• Xifra, Jordi, and Francesc Grau. 2010. “Nanoblogging PR: The Discourse on Public Relations in
Twitter.” Public Relations Review 36 (2): 171–74. doi:10.1016/j.pubrev.2010.02.005.
Darja Fišer, Monika Kalin Golob
CORPORATE COMMUNICATION ON TWITTER IN
SLOVENIA: A CORPUS ANALYSIS
SUMMARY
In the past decade, social media have transformed corporate communications by
enabling direct and instant communication with the stakeholders. In communication
studies, three main strands of research into corporate communication practices on
social media can be identified: posting behaviour, content analysis and perception
studies. Investigators are mostly interested in corporate communication styles, reputa-
tion management and corporate social responsibility. A better understanding of the
language practices used by public companies and institutions for presentation, persua-
sion and reputation management on social media is still lacking.
This paper addresses this gap with the first comprehensive, large-scale and cor-
pus-driven analysis of the characteristics of corporate communication on Twitter in
Slovenia. In the study, we combined the analysis of the available metadata, Tweet con-
tent and corpus annotations in the Janes-Tweet corpus to study three key aspects of the
communication of Slovene corporate Twitter users: (1) their participation, posting
dynamics and posting volume, (2) the use of social-media specific communication
elements, and (3) the language choices observed through several levels of linguistic
discription.
Our analysis shows that, in comparison to private accounts, corporate tweets pre-
dominantly use formal communication and standard language characteristics with
68 Prispevki za novejšo zgodovino LIX - 1/2019
seldom usage of informal and non-standard choices. In the event of those, however,
they are chosen deliberately to address a specific target audience and meet the desired
communicative goals. The analysis of the utilisation of the new media elements by
corporate users clearly show that their tweets come short of the true dialogic approach
and that most Slovene companies and institutions use Twitter as yet another channel
for unidirectional communication of regular (shortened) PR messages in which the
prevalent communication function remains informative and positively presentational.
A keyword analysis reveals an important difference between the negative reporting-
style tweets by the news outlets, and the positive promotional style of companies,
public institutions and non-governmental institutions, suggesting the need for a more
fine-grained categorization of corporate accounts, which will be refined in our future
work.
Another major contribution of the paper is its demonstration of the methodo-
logical potential of corpus approaches in communication studies, media studies and
related disciplines in social sciences that are based on language data, which is not
yet utilized in the Slovene context. Apart from theoretical relevance, the results of
this analysis therefore also have practical implications for the PR community which
highlight the importance of properly trained PR practitioners who use social media
in a dialogic, symmetrical model, understand their role as boundary spanners and the
need to seek opportunities to engage in and stimulate dialogue with their stakeholders.
Darja Fišer, Monika Kalin Golob
SLOVENSKO KORPORATIVNO KOMUNICIRANJE NA
DRUŽBENEM OMREŽJU TWITTER: KORPUSNA ANALIZA
POVZETEK
V zadnjem desetletju so z omogočanjem neposrednega in takojšnjega stika z dele-
žniki družbena omrežja močno vplivala tudi na korporativno kominiciranje. V komu-
nikologiji korporativne komunikacijske prakse na družbenih omrežjih raziskujejo z
opazovanjem vedenja korporativnih uporabnikov, analizo vsebine in percepcijskimi
študijami. Komunikologe zanimajo predvsem slogi poslovnega sporočanja, upravlja-
nje ugleda in družbena odgovornost podjetij, medtem ko še vedno primanjkujejo jezi-
koslovno usmerjene raziskave, ki bi omogočile boljše razumevanje jezikovnih praks,
ki jih podjetja in institucije uporabljajo za predstavljanje svojih izdelkov, vplivanje na
potrošnike in odzivanje v kritičnih situacijah.
To vrzel naslavlja pričujoči prispevek, v katerem predstavimo prvo celovito, na
obsežnem korpusu zasnovano analizo korporativnega komuniciranja med sloven-
skimi uporabniki družbenega omrežja Twitter. Izvedli smo jo s kombinacijo besedilnih
69D. Fišer, M. Kalin Golob: Corporate Communication on Twitter in Slovenia…
podatkov, metapodatkov in korpusnih oznak, ki so na voljo v korpusu Janes-Tviti, pri
analizi pa smo se osredotočili na tri vidike korporativnega komuniciranja v slovenskih
uporabnikov: (1) njihovo prisotnost, aktivnost, dinamiko in količino objav, (2) rabo
novomedijskih komunikacijskih elementov in (3) jezikovne izbire, opazovane na raz-
ličnih ravneh jezikovnega opisa.
Izvedene analize so pokazale, da v primerjavi z zasebnimi računi v korporativ-
nih tvitih izrazito prevladujejo standardne jezikovne prvine formalnega sporočanja,
sicer redkejše neformalne in nestandardne izbire pa so uporabljene premišljeno glede
na naslovnika sporočila in namen sporočanja. Analiza izkoriščanja novomedijskih
elementov jasno kaže, da komuniciranje slovenskih korporativnih uporabnikov na
družbenem omrežju Twitter ne sledi dialoškemu pristopu in da večina slovenskih
podjetij in institucij Twitter razume kot dodatni kanal za enosmerno sporočanje kla-
sičnih (skrajšanih) sporočil za javnost, sporočanjska vloga katerih ostaja pretežno
informativna in pozitivno predstavitvena. Analiza ključnih besed razkrije pomembno
razliko med negativnim poročanjskim slogom medijskih računov in med pozitivnim
promocijskim slogom podjetij, javnih ustanov in nevladnih organizacij, kar nakazuje
na potrebo po natančnejši kategorizaciji korporativnih računov v korpusu, ki jo načr-
tujemo za prihodnje raziskave.
Pričujoči prispevek je dragocen tudi zato, ker demonstrira potencial korpusnih
pristopov v komunikologiji, medijskih študijah in drugih sorodnih družboslovnih
disciplinah, ki temeljijo na jezikovnih podatkih, kar v slovenskem okolju še ni ustaljena
praksa. Poleg teoretične relevantnosti imajo rezultati predstavljene analize tudi prak-
tično vrednost za komunikološko stroko, saj izpostavljajo pomen ustrezno usposo-
bljenih strokovnjakov za odnose z javnostmi, ki obvladajo dialoški, simetričen model
družbenih omrežij, razumejo svojo posredniško vlogo med deležniki in podjetjem, ki
ga zastopajo, ter proaktivno iščejo priložnosti za navezovanje pristnih stikov z delež-
niki in spodbujajo dialog z njimi.
70 Prispevki za novejšo zgodovino LIX - 1/2019
1.01 UDC: 003.295: 342.537.6(497.4)”2014/2018”
Darja Fišer,* Nikola Ljubešić,** Tomaž Erjavec***
Parlameter – a Corpus of
Contemporary Slovene
Parliamentary Proceedings
IZVLEČEK
PARLAMETER – KORPUS RAZPRAV SLOVENSKEGA
DRŽAVNEGA ZBORA
V prispevku predstavimo korpus sodobnih parlamentarnih razprav Parlameter, ki vse-
buje razprave 7. mandata slovenskega Državnega zbora (2014–2018). Korpus Parlameter
vsebuje bogate metapodatke o govorcih (spol, starost, izobrazba, strankarska pripadnost) in
je jezikoslovno označen (lematizacija, tegiranje), kar omogoča številne raziskave s področja
digitalne humanistike in družboslovja. V prispevku prikažemo potencial korpusnoanalitič-
nih tehnik za raziskovanje političnih razprav. Korpusna arhitektura je zasnovana tako, da
omogoča širitev korpusa na druga časovna obdobja, prav tako pa tudi vključevanje gradiv
drugih parlamentov, začenši s hrvaškim in bosanskim.
Ključne besede: parlamentarne razprave, izdelava korpusa, jezikovne tehnologije, kor-
pusna analiza
ABSTRACT
The paper presents the Parlameter corpus of contemporary Slovene parliamentary pro-
ceedings, which covers the VIIth mandate of the Slovene Parliament (2014–2018). The
Parlameter corpus offers rich speaker metadata (gender, age, education, party affiliation)
* Department of Translation, Faculty of Arts, University of Ljubljana, Aškerčeva cesta 2, SI-1000 Ljubljana,
Department of Knowledge Technologies, Jožef Stefan Institute, Jamova cesta 39, SI-1000 Ljubljana, darja.fiser@
ff.uni-lj.si
** Jožef Stefan Institute, Jamova cesta 39, SI-1000 Ljubljana, nikola.ljubesic@ijs.si
*** Department of Knowledge Technologies, Jožef Stefan Institute, Jamova Cesta 39, SI-1000 Ljubljana, tomaz.
erjavec@ijs.si
71D. Fišer, N. Ljubešić, T. Erjavec: Parlameter – a Corpus of Contemporary Slovene …
and is linguistically annotated (lemmatization, tagging), which boost research in several
digital humanities and social sciences disciplines. We demonstrate the potential of the corpus
analysis techniques for investigating political debates. The corpus architecture allows for
regular extensions of the corpus with additional Slovene data, as well as data from other
parliaments, starting with Croatian and Bosnian.
Keywords: parliamentary proceedings, corpus construction, language technology, cor-
pus analysis
Introduction
Parliamentary discourse is motivated by a wide range of communicative goals,
from position-claiming, persuasion and negotiation to agenda-setting and opinion-
building along ideological or party lines. It is characterized by role-based commit-
ments and confrontation and the awareness of a multi-layered audience (Ilie 2017).
The unique content, structure and language of records of parliamentary debates are
all factors that make them an important object of study in a wide range disciplines
in digital humanities and social sciences, such as political science (van Dijk 2010),
sociology (Cheng 2015), history (Pančur and Šorn 2016), discourse analysis (Hirst
et al. 2014), sociolinguistics (Rheault et al. 2016), and multilinguality (Bayley 2014).
Despite the fact that parliamentary discourse has become an increasingly impor-
tant research topic in various fields of digital humanities and social sciences in the
past 50 years (Chester and Bowring 1962; Franklin and Norton 1993), it has only
recently started to acquire a truly interdisciplinary scope (Bayley 2014). Recent devel-
opments enable cross-fertilization of linguistic studies with other disciplines and in-
depth exploration of institutional uses of language, interpersonal behaviour patterns,
interplay between language-shaped facts, and reality-prompted language ritualization
and change (Ihalainen et al. 2016).
With an increasingly decisive role of parliaments and their rapidly changing relations
with the public, mass media, executive branch and international organizations, further
empirical research and development of integrative analytical tools are necessary in order
to achieve a better understanding of parliamentary discourse as well as its wider societal
impact, in particular with studies that represent diverse parts of society (women, minori-
ties, marginalized groups) and cross-cultural studies (Hughes et al. 2013).
Parliamentary Corpora
The most distinguishing characteristic of records of parliamentary debates is that
they are essentially transcriptions of spoken language produced in controlled and reg-
ulated circumstances. For this reason, they are rich in invaluable (sociodemographic)
72 Prispevki za novejšo zgodovino LIX - 1/2019
meta-data. They are also easily available under various Freedom of Information Acts
set in place to enable informed participation by the public and to improve effec-
tive functioning of democratic systems, making the datasets even more valuable for
researchers with heterogeneous backgrounds.
This has motivated a number of national as well as international initiatives (for an
overview, see Fišer and Lenardič 2018) to compile, process and analyse parliamentary
corpora. They are available for most countries within the CLARIN ERIC research
infrastructure for language resources and technology, with the UK’s Hansard Corpus
being the largest (1.6 billion tokens) and spanning the longest time period (1803–
2005) while corpora from other countries are significantly smaller (most comprise
between 10 and 100 million tokens) and cover significantly shorter periods (mostly
from the 1970s onwards).
The Slovene parliamentary corpus SlovParl 2.0 (Pančur 2016) contains minutes of
the Assembly of the Republic of Slovenia for the legislative period 1990–1992 when
Slovenia became an independent country. The corpus comprises over 200 sessions,
almost 60,000 speeches and 11 million words. It contains extensive meta-data about
the speakers, a typology of sessions and structural and editorial annotations and is uni-
formly encoded to the Text Encoding Initiative (TEI) Guidelines, a de-facto standard
for encoding and annotating textual data in Digital Humanities. It is available under
the CC-BY licence in the CLARIN.SI repository of language resources and via the
CLARIN.SI concordancers (Pančur et al. 2017). SlovParl is thus an exemplary corpus
but contains material from a quite limited, and not very recent time period. This makes
the corpus of limited use for the rich body of research on recent parliamentary activities.
Contemporary Slovenian parliamentary debates are monitored by the analytical
tool Parlameter11 which makes use of linguistic as well as non-linguistic data, such as
MPs’ attendance and voting results. While this is a very useful tool for journalists and
citizen scientists and gives valuable insight into contemporary parliamentary data, its
functionality is confined to that of the tool and as such cannot be freely manipulated
by scholars according to their specific research needs.
The goal of the research presented in this paper was to convert the Parlameter data-
base into a freely and openly available linguistically annotated corpus enriched with
session and speaker metadata, and to showcase the analyses that can be performed
on such corpora via open-source tools for corpus analysis. Section 3 gives the basic
information on the corpus structure and size, Section 4 presents the analysis of the
corpus according to the text and speaker metadata by utilizing some of the best-known
corpus analysis techniques, and Section 5 gives some conclusions and directions for
further research.
While the focus of the paper is the parliamentary language material which we
process with natural language processing and analyse with standard methods from
corpus linguistics, the aim of the analysis is to inform media and political studies by
transferring the presented methodology into these areas.
1 Parlameter, https://parlameter.si.
73D. Fišer, N. Ljubešić, T. Erjavec: Parlameter – a Corpus of Contemporary Slovene …
Corpus Compilation
The data dump from the Parlameter tool consisted of the minutes of the National
Assembly of the Republic of Slovenia from its VIIth mandate spanning sessions that
started from 2014-08-01 to 2018-05-24 (the complete mandated lasted till 2018-06-
22). It was received from the Parlameter API (application programming interface)
as a series of JSON files, which were first reorganised into a file containing speaker
metadata and a file with the transcriptions of the minutes with speaker identifiers.
The speaker metadata contains information about the speaker name and surname,
and (for some speakers) their sex, date of birth, education, and party affiliation. The
complete speaker metadata is available for the members of the parliament and of the
government, but not for, e.g., visiting field experts, representatives of governmental
agencies, non-governmental organizations or civil initiatives. This is why the analyses
in Section 4 are performed based on the instances for which the metadata is available
in the corpus.
The transcriptions contain the ID of the session, name of the session (e.g. “4. izredna
seja” - 4th extraordinary session), the date when the session started, and its speeches, each
one with the ID of the speaker and a number of segments, roughly corresponding to para-
graphs. As discussed below, the transcriptions also contain comments by the transcribers.
Normalisation of Speaker Data
The speaker data was normalised by removing extraneous spaces and removing
honorifics (sometimes the name was preceded by, e.g., “Gospod” – Mr.). Furthermore,
in Slovene it is relatively easy to infer the sex from the given name, so we also added
sex information to the speakers missing it.
Normalisation of Transcriptions
The JSON dump also contained empty speeches, as well as a significant amount
of duplicated speeches. These were removed, as well as extraneous spaces in the text
of the transcriptions.
Second, apart from the speeches, the minutes also contained 65,965 comments
on verbal and non-verbal behaviour of the speaker or the members of parliament,
and there are two types of such remarks. The first are written between slashes and
are mostly comments on audible incidents, e.g., /nerazumljivo/ (incomprehensible), /
oglašanje iz dvorane/ (comments from the hall), /znak za konec razprave/ (sign for the
end of the discussion). The second type of comments are written between brackets and
mainly denote voting results, e.g., (nihče), /nobody/, (10 članov) /10 members/, (proti
44) /44 against/. Both types of comments have been removed from the transcriptions
74 Prispevki za novejšo zgodovino LIX - 1/2019
for the current version of the corpus, as they are not part of the transcription proper
and would significantly complicate further processing. Furthermore, the content of
the comments is not uniform, with the same information written in various ways (e.g.
/smeh/ – laughter, /smeh iz dvorane/ – laughter from the hall, /smeh v dvorani/ – laughter
in the hall), meaning that the values would have to be unified before being converted
to appropriate corpus elements.
Linguistic Annotation
In the second stage, the text of the transcriptions was automatically annotated with
linguistic information. In particular, the text was tokenised, i.e. split into words, punc-
tuation marks and spaces, and segmented into sentences, which was performed by the
ReLDI tokeniser (Ljubešić et al. 2016). Second, the words were part-of-speech tagged
and lemmatised, i.e. each word was assigned its context-dependent morphosyntactic
description and non-marked form, e.g., the words in “V naši sredini” – In our midst
are assigned the MSDs “Sl Ps1fslp Ncfsl” meaning preposition in the locative case; the
possessive pronoun in the first person feminine singular locative with a plural owner
number; and the feminine common noun in the singular locative, while the lemmas are
“v naš sredina”. The tagging and lemmatisation was performed with the ReLDI tagger
(Ljubešić and Erjavec 2016) using its model for Slovene. Finally, the transcriptions
were also tagged for named entities, i.e., names identified in the corpus were marked
and categorised into five classes, those for persons, locations, organisations, for adjec-
tives derived from a person’s name (e.g. “Cerarjev” – Cerar’s), and a miscellaneous cat-
egory. The named entity annotation was performed with Janes-NER (Fišer et al. 2018).
Corpus Encoding
The corpus is encoded in XML, according to the Text Encoding Initiative
Guidelines (TEI Consortium 2017). The complete corpus is stored as one TEI docu-
ment, which contains its TEI header with the metadata for the corpus, and its text
body, containing the transcriptions, one division for each starting date of the sessions;
each division is stored as a separate file, giving one root file for the corpus and 525 files
for the divisions.
The TEI header contains extensive metadata for the corpus as a whole, e.g., its
authors and funders, the source description, the list and numbers of elements used in
the corpus, as well as the list of speakers and their metadata. Most metadata is given
both in Slovene and English.
As illustrated in Figure 1, the TEI text body date divisions contain a division for
each session, and then the utterances for each speaker, each one containing one or
more segments, which then contain the annotated transcription.
75D. Fišer, N. Ljubešić, T. Erjavec: Parlameter – a Corpus of Contemporary Slovene …
Figure 1: The TEI encoding of the corpus.
26.08.2014–
Mandat VII, 26.08.2014–
2. redna seja
26.08.2014–Lepopozdravljeni.Pričenjamo2.sejoKolegijapredsednikaDržavnegazbora.
Corpus Size
Some basic statistics regarding the corpus are given in Table 1. In total, the
Parlameter corpus contains 371 sessions (as distinguished by their title) which spanned
over 525 days, i.e., 1.4 days per session on average. If we count distinct sessions that
started on a given day, the corpus contains 1,338 such sessions. The VIIth mandate of
the parliament heard 1,981 speakers who gave 133,287 speeches which contain almost
35 million words, i.e., 67 speeches per speaker and 260 words per speech on average.
Due to a number of factors, such as different roles of the speakers in the parliament, the
distribution is, of course, far from uniform, e.g., there is one speaker that gave 14,616
speeches, while 711 speakers gave only one speech.
76 Prispevki za novejšo zgodovino LIX - 1/2019
Table 1: Basic statistic of the Parlameter corpus.
Tokens 40,987,516
Words 34,882,499
Sentences 1,833,147
Utterances 133,287
Speakers 1,981
Sessions on date 1,338
Dates 525
Sessions 371
Availability of the Corpus
The Parlameter corpus is available through CLARIN.SI. CLARIN is the European
research infrastructure for language resources and technologies, which makes digi-
tal language resources available to scholars, researchers, students and citizen-scien-
tists from all disciplines, especially in the humanities and social sciences, through
single sign-on access. CLARIN offers long-term solutions and technology services
for deploying, connecting, analysing and sustaining digital language data and tools.
CLARIN is organised as a network of national centres, with CLARIN.SI covering
Slovenia. CLARIN.SI2 offers, inter alia, two concordancers for on-line corpus explo-
ration, and a repository of language resources and tools, intended for their long-term
archiving together with support for different types of licences and an unambiguous
way for others to cite these resources, using Handle persistent identifiers. The land-
ing page of each resource also gives a cross-reference to the concordancers for the
particular corpus, and vice-versa. The repository also exposes its metadata, which is
being harvested by a number of other services.
The Parlameter corpus is available through both CLARIN.SI concordancers, as
well as for download from its repository, both as a TEI document and in the simpler
vertical file format, under the liberal Creative Commons – Attribution-ShareAlike
(CC BY-SA 4.0) licence (Dobranić et al. 2019). In this way we hope to raise interest
among other researchers to explore the corpus and make use of it in their research.
2 CLARIN Slovenia, http://www.clarin.si/info/about/.
77D. Fišer, N. Ljubešić, T. Erjavec: Parlameter – a Corpus of Contemporary Slovene …
Corpus Analysis
By using the CLARIN.SI NoSketch Engine concordancer,3 we demonstrate the
potential of the basic corpus analysis techniques (Gorjanc and Fišer 2013) for politol-
ogy, history and other related humanities and social sciences disciplines that base their
research on large volumes of language data. Concordances are lists of all examples of the
search word or phrase from a corpus which are shown in the context they were used in
and are equipped with the available metadata. Wordlists are comprehensive summari-
zations of the language inventory in the corpus, organized by frequency or alphabeti-
cally. Collocations are partly or fully fixed multi-word expressions which have become
established through usage. Keywords are words which appear in the focus corpus more
frequently than they would in the general language. Combined with the available text
and speaker metadata, such as date, speaker gender or political affiliation, they provide
a powerful analytical tool for discovering the commonalities and specificities of the
linguistic footprint and trends by different types of speakers in the parliament as will
be shown in the rest of this section.
Production Volume and Vocabulary Size
As already presented in Table 1, the corpus contains nearly 41 million tokens or 35
million words. noSketch Engine also offers the lexicon size of the corpus, as given in
Table 2, which shows that the corpus contains approximately 263,000 different word
forms (so, inflected words, e.g., Slovenije) and over 104,000 different lemmas (so, base
forms of words, e.g., Slovenija), and 1,080 different morphosyntactic tags (e.g.,Verb
main present second plural). However, it should be noted that both lemmas and the tags
are automatically assigned, so they also contain some annotation errors: the accuracy
of morphosyntactic tags is around 94%, the accuracy of lemmas is above 99%.
Table 2: Lexicon sizes of the Parlameter corpus.
Unique words 263,007
Unique lemmas 104,247
Unique tags 1,080
While the corpus contains parliamentary debates from the period 2014-2018,
62% of the material was recorded in 2015 and 2016. Given the parliamentary term,
which lasted from 1 August 2014 to 14 April 2018, it is interesting to observe an 8%
smaller production in 2017 compared to the year before since the last year of the term
would be expectedly the busiest in order to wrap up the workplan and set the ground
for a new election cycle.
3 NoSketch Engine @ CLARIN.SI, https://www.clarin.si/noske/.
78 Prispevki za novejšo zgodovino LIX - 1/2019
Table 3: Distribution of text quantity by year in Parlameter.
Year No. of tokens % of tokens Rel. freq.
2014 3,759,110 9% 91,714
2015 12,441,754 30% 303,550
2016 13,270,257 32% 323,763
2017 9,944,401 24% 242,620
2018 1,571,994 4% 38,353
Total 40,987,516 100% 1,000,000
Morphosyntactic Specificities of the Language in ParlaMeter
We performed a basic analysis of the morphosyntactic annotations of the corpus
in form of the most significant differences in their frequencies between the Gigafida
reference corpus of Slovene4 and the Parlameter corpus, which are given in Table 4.5
Table 4: Most salient differences in morphosyntactic descriptions between Gigafida 2.0
and Parlameter.
Gigafida Parlameter
Residual web Pronoun personal first singular
nominative
Numeral roman cardinal Verb main present second plural
Adjective possessive positive masculine
singular instrumental
Pronoun personal second masculine
plural nominative
Auxiliary infinitive Pronoun possessive first feminine singular
genitive singular
Adjective possessive positive masculine
plural genitive
Verb main present first plural -Negative
Adjective possessive positive masculine
singular locative
Verb main present second plural
-Negative
Adjective possessive positive neuter singular
locative
Pronoun demonstrative neuter plural
accusative
Pronoun possessive third masculine singular
accusative dual
Pronoun personal first singular accusative
Adjective possessive positive masculine
singular nominative -Definiteness
Verb main present first singular
4 For this comparison we used the deduplicated version of Gigafida 2.0. At the time of writing, this corpus was newly
made and does not yet have a reference publication. It is, however, freely available for searching and analysis at
https://www.clarin.si/noske/.
5 The morphosyntactic tags are given here in their expanded form to aid understanding. The reference to these mor-
phosyntactic descriptions is given in http://nl.ijs.si/ME/V6/msd/html/msd-sl.html.
79D. Fišer, N. Ljubešić, T. Erjavec: Parlameter – a Corpus of Contemporary Slovene …
Gigafida Parlameter
Pronoun possessive third feminine plural
locative singular masculine
Verb main present first singular
Adjective possessive positive masculine
plural nominative
Pronoun demonstrative masculine
singular dative
Noun proper feminine plural dative Pronoun indefinite feminine singular
genitive
Numeral letter ordinal neuter plural genitive Pronoun indefinite masculine singular
accusative
Pronoun personal first dual accusative Verb auxiliary present second plural
-Negative
Pronoun personal first dual dative Verb auxiliary future first singular
-Negative
Noun proper neuter singular instrumental Pronoun personal first masculine plural
nominative
Adjective possessive positive feminine
singular locative
Verb auxiliary present second plural
+Negative
Pronoun personal second singular
accusative bound
Verb main present first plural
Pronoun personal third masculine dual
dative +Clitic
Pronoun indefinite feminine singular
accusative
Adjective possessive positive masculine
plural locative
Pronoun demonstrative feminine plural
accusative
The results show that the parliamentary speeches, as expected, contain more pre-
sent tense verb forms, especially in the first and second person singular or plural (e.g.,
imamo – we have, pozdravljam – I greet, zaupate- you trust), as well as personal and
demonstrative pronouns, the former most prominently as the first person singular
personal pronoun (jaz – I).
On the other hand, the parliamentary proceedings do not contain URLs or
Roman numerals. More interestingly, they also contain significantly fewer possessive
adjectives (e.g. torkovim – Tuesday’s) and pronouns (njun – theirs[dual]), proper names,
numerals, personal pronouns in the dual number (naju – us two), or in second person
singular accusative (nate – to you) than general Slovene.
Language and Gender in Parlameter
Gender is recorded for all but one speaker in the corpus.6 In total, 1,965 speakers
are represented, 62% of which are male and 38% female. Interestingly, the contribution
from the speakers is not proportionate to the distribution according to their gender,
6 This missing information is due to errors in input metadata records, which will be improved in the next version of
the corpus.
80 Prispevki za novejšo zgodovino LIX - 1/2019
with the male speakers contributing 71% of the tokens in the corpus and the female
speakers 29%. On the speech level the difference is even more pronounced as the male
speakers delivered 73% of the speeches while female speakers only 27%, indicating
that, on average, the speeches given by female speakers were somewhat longer than
those by male speakers.
Table 5: Distribution of speakers and text production by gender in Parlameter.
Gender No. of speakers % of speakers No. of tokens % of tokens
Female 747 38% 29,147,871 71%
Male 1217 62% 11,838,913 29%
Unknown 1 0% 732 0%
Total 1965 100% 40,987,516 100%
Table 6, which lists top-ranking 10 female and male speakers and their production
in terms of tokens, shows that the most prolific male speakers produced nearly twice
as much material as their female counterparts. Overall, all top 10 speakers except one
(Miha Kordiš, male, the Levica party) have a leading role in one or more parliamentary
or governmental bodies, including 2 ministers, both of which are female, 2 opposition
deputy group chairs, who are both male, and the Chair of the National Assembly who
is also male. Based on their roles in the parliament or the government, top-ranking
speakers represent issues on culture, corruption, judiciary, finances, agriculture, for-
eign policy, education and infrastructure. In terms of political orientation, the larg-
est opposition party SDS is best represented with 5 top-ranking male and 3 female
speakers, including chair and vice-chair of their deputy group. Among the top-ranking
female speakers, the entire political spectrum is represented while male speakers from
the SD and DeSUS parties do not make the list, and the SMC party is only represented
by the Chair of the National Assembly whose role is most likely predominantly proce-
dural, not to promote the party agenda.
Table 6: Top-ranking 10 female and male speakers and their text production in Parlameter.
Female Party affiliation
// Role
Tok.
%
Male Party affiliation //
Role
Tok.
%
Anja B.
Žibert
SDS // Chair
of the Culture
Committee
698,883
6%
Jožef
Horvat
NSI // Chair of the
Foreign Policy
Committee; Chair of
the Deputy Group NSI
1,141,778
4%
Jelka
Godec
SDS // Chair
of the Inquiry
Commission
on the Misuse
Practices in
Healthcare
530,029
4%
Jani
Mödern-
dorfer
ZAAB //
Chair of the Inquiry
Commission on bank
money laundering;
Vice-chair of the
Election Committee
1,062,546
4%
81D. Fišer, N. Ljubešić, T. Erjavec: Parlameter – a Corpus of Contemporary Slovene …
Female Party affiliation
// Role
Tok.
%
Male Party affiliation //
Role
Tok.
%
Iva Dimic NSI // Vice-chair
of the Judiciary
Committee
509,101
4%
Franc
Trček
Levica // Vice-chair
of the Infrastructure
Committee; Vice-
chair of the Inquiry
Commission on bank
money laundering
1,060,399
4%
Alenka
Bratu šek
ZAAB // Vice-
chair of the
Public Finances
Committee;
Vice-chair of the
Deupty Group
ZAAB
483,171
4%
Milan
Brglez
SMC // Chair
of the National
Assembly; Chair of
the Constitution
Committee
948,334
3%
Violeta
Tomić
Levica // Vice-
chair of the
Agriculture
Committee
446,460
4%
Vinko
Gorenak
SDS // Vice-chair of
the Deputy Group
SDS
788,678
3%
Eva Irgl SDS // Chair
of the petition
committee
439,042
4%
Franc
Breznik
SDS // Vice-chair
of the Election
Committee
763,437
3%
Urška Ban SMC // Chair of
the Finances and
Monetary Policy
Committee
382,425
3%
Jože
Tanko
SDS // Chair of the
Deputy Group SDS
752,130
3%
Mateja V.
Erman
Minister of
Finance
381,604
3%
Andrej
Šircelj
SDS // Chair of the
Public Finances
Committee
721,135
2%
Bojana
Muršič
SD // Vice-chair
of the National
Assembly,
Vice-chair of
the Education
Committee
366,547
3%
Tomaž
Lisec
SDS // Chair of
the Agriculture
Committee
707,666
2%
Julijana B.
Mlakar
DeSUS // Minister
of Culture; Vice-
chair of the
Foreign Policy
Committee
308,355
3%
Miha
Kordiš
Levica 676,717
2%
In order to compare the topics discussed by female and male speakers in the
Slovene parliament, we analysed their 100 top-ranking key lemmas, where we used
the corpus of all female speakers as the target corpus against the reference corpus of
all male speakers in the Parlameter corpus, and vice versa, so the two lists display the
distinguishing features of each of the groups. By observing their contexts via con-
cordances, we manually classified them into one of the 13 topics represented by the
ministries in the Slovenian government:
82 Prispevki za novejšo zgodovino LIX - 1/2019
– agriculture, forestry and food
– culture
– defence
– economy and technology
– education, science and sport
– environment and spatial planning
– finance
– health
– foreign affairs
– infrastructure
– interior
– justice
– labour, family and social affairs
– public administration
In addition, we introduced 4 additional categories for words that could not be
classified into any of the topics above:
– interaction/procedural for keywords which referred to other people attending the ses-
sion (e.g., references to names of other speakers, predsednik – chairman) or expressed
procedural matters during the session (e.g., prisotni – present, dobrodošli – welcome)
– style for keywords which were either distinctly colloquial or distinctly formal and
were frequently used only by a single or very few speakers in order to achieve a
special effect (e.g., penez, a very informal expression for money, šiht, a very infor-
mal expression for job)
– ideology for keywords which were used to ideologically label an individual speaker
or a group of speakers (e.g., levičarski – leftist, kapitalizem – capitalism)
– multiple for keywords which were used in several topics (e.g., zgodnji – early, fan-
tastičen – fantastic).
As can be seen from Table 7, the most frequent topics among the female speak-
ers are health (35) and labour, family and social affairs (33), which are followed by
public administration (13) and education, science and sport (8). Most of the 100 top-
ranking keywords uttered by male speakers, on the other hand, could not be classi-
fied into a single topic because they were used either to achieve a stylistic effect (24),
were general words that were used in multiple topics, such as descriptive adjectives or
legal terms (22), or ideological expressions (6), all of which indicate a more discursive,
debating style of the male speakers, but could also stem from the fact that the leading
roles in that term were predominantly held by male members of parliament.7 Despite
being much more infrequent than in the female part of the corpus overall, the most
7 This problem could be avoided by removing outliers regarding production in the dataset before performing the
analyses. But our goal here was to present the complete corpus and demonstrate the basic corpus analysis tech-
niques.
83D. Fišer, N. Ljubešić, T. Erjavec: Parlameter – a Corpus of Contemporary Slovene …
frequently represented specific topics by male speakers are infrastructure (9), interior
(6), agriculture, forestry and food (5), and defence (5), suggesting a significant differ-
ence in the roles and interests of male and female speakers in the Slovene parliament.
Table 7: Topics of 100 top-ranking keywords of female and male speakers in Parlameter.
Topics – female Freq. Topics – male Freq.
health 35 style 24
labour, family & social affairs 33 multiple 22
public administration 13 infrastructure 9
education, science & sport 8 interior 6
interaction/procedural 3 ideology 6
multiple 3 interaction/procedural 5
environment & spatial planning 1 agriculture, forestry & food 5
agriculture, forestry & food 1 defense 5
culture 1 foreign affairs 4
finance 1 finance 4
economy & technology 1 justice 3
Total 100 Total 100
Illustrative examples of the 10 top-ranking female- and male-specific keywords
with a manually assigned topic are listed in Tables 8 and 9.
Table 8: Most frequent keywords, topics and word type among female speakers in
Parlameter. N stands for nouns, Adj for adjectives, and NP for proper nouns (names).
Lemma – English translation Topic PoS Freq. Freq_ref Score
rejništvo – fostercare
labour, family &
social affairs N 264 59 7.7
mark – mark health PN 155 29 7.1
enostarševski – single-parent
labour, family &
social affairs Adj 167 38 6.6
roditeljski – parent
labour, family &
social affairs Adj 169 39 6.5
medical – medical health PN 128 26 6.2
plazma – plasma health N 82 9 6.1
pacientov – patient’s health Adj 282 97 5.7
zaznamba – notice
public
administration N 155 43 5.7
žilen – stent health Adj 518 213 5.4
duševen – mental health Adj 393 156 5.4
nasilnež – violent person
labour, family &
social affairs N 98 21 5.4
84 Prispevki za novejšo zgodovino LIX - 1/2019
Table 9: Most frequent keywords, topics and word type among male speakers in
Parlameter.
lemma – English translation category PoS Freq_ref Score
penez – inf. money finance N 0 13.2
navsezadnje – nevertheless multiple Adv 90 8.4
kubik – cubic agriculture, forestry & food N 10 7.8
islam – Islam interior N 6 6.4
levičarski – leftist ideology Adj 2 6.2
navzoč – present interaction/procedural Adj 211 6.0
avtošola – driving school infrastructure N 1 5.8
socialist – socialist ideology N 25 5.5
svojevrsten – peculiar multiple Adj 16 5.4
e-klopa – e-bench interaction/procedural N 1 5.3
prečenje – crossing style N 3 5.2
That the nature and style of male speeches is quite different from the female ones
can also be seen from the analysis of the morphosyntactic types of 100 highest-ranking
keywords for male and female speakers. While nouns are the most frequent category
and are used equally frequently by both male and female speakers (44%), many more
adjectives were found among the female top-ranking keywords (33% vs. 16%), while
the male keywords had more adverbs (11% vs. 4%) and verbs (9% vs. 2%), which
again could be related to the roles of the speakers in the parliament. However, given the
results of our preliminary work on this dataset (Ljubešić et al. 2018), during which we
removed the speakers that produced most of the linguistic material from the analysis,
we see similar trends both in the gender-dependent keyword and morphosyntactic
analysis, and are therefore rather in favour of accepting the observed differences as
impact of gender and not role.
Language and Party Affiliation in Parlameter
Affiliation is recorded for only 113 speakers out of the 1982, however, these are
responsible for 79% of the tokens in the corpus. Affiliation is considered as either
deputy group membership or a role in the government, where it must be noted that
in this version of the corpus the metadata reflect the situation at the beginning of the
term and does not keep track of party membership transfers or resignations of minis-
ters or members of parliament. Also, when elected members of parliament were later
appointed as ministers, the metadata record only their party affiliation and records as
ministers only those who were appointed without being first elected to the parliament.
To facilitate more fine-grained and accurate use of the corpus in political science or
contemporary history, we plan to refine the metadata for the next release of the cor-
pus, adding also the MP’s membership in the working bodies of the National Assembly,
85D. Fišer, N. Ljubešić, T. Erjavec: Parlameter – a Corpus of Contemporary Slovene …
etc. Also, the metadata in the current version of the corpus do not flag the independent
members of parliament who do not belong to any of the parliamentary parties and oper-
ate in the Independents deputy group, which is why they are not included in our analysis.
As Table 10 shows, the most prolific deputy group is the largest opposition party
Slovenian Democratic Party (SDS), whose 20 members contributed nearly 10 million
tokens or 30% of the corpus. SDS is followed by the main governing party, Party of
Modern Centre (SMC), whose 42 members contributed 7 million tokens or 22% of
the corpus. It is interesting to note that in terms of the volume contributed to the cor-
pus on one side and the number of speakers on the other, that this party was the least
productive among the main parties, with a ratio of the percentage of tokens to the per-
centage of speakers (i.e., the relative token to speaker ratio) of 0.54, which means that
this party generated a little bit more than a half of the material that would have been
expected given their number of speakers and the overall activity of all the speakers.
The Left (Levica) and New Slovenia (NSi) rank third and fourth, despite the fact that
they had only 6 members each in the parliament, making them the most productive
parties with a relative token to speaker ratio of 1.83 and 1.66. The Democratic Party
of Pensioners of Slovenia had as many as 12 elected MPs but contributed 1 million
tokens less than the two previous parties, which makes them the second least produc-
tive party with a relative token to speaker ratio of 0.67.
Table 10: Distribution of speakers and text production by party affiliation in ParlaMeter
with speakers with unknown affiliation removed.8
Affiliation
No. of
speakers
% of
speakers
No.
of tokens
% of
tokens
Slovenian Democratic Party
Deputy Group (SDS)
20 20% 9.516.651 30%
Party of Modern Centre Deputy
Group (SMC)
42 41% 7.162.719 22%
The Left Deputy Group (Levica) 6 6% 3.438.194 11%
New Slovenia – Christian
Democrats Deputy Group (NSI)
6 6% 3.370.131 10%
Social Democrats Deputy Group
(SD)
9 9% 2.533.019 8%
Democratic Party of Pensioners of
Slovenia Deputy Group (DeSUS)
12 12% 2.435.884 8%
Party of Alenka Bratušek Deputy
Group (SAB)
4 4% 1.876.294 6%
Italian and Hugarian National
Minorities Deputy Group (IMNS)
2 2% 117.709 0%
Government 1 1% 1.765.374 5%
Total 102 100% 32.215.975 100%
8 The number of speakers per party is calculated from the ParlaMeter dump and deviates slightly from the official
member number due to different handling of speakers with multiple roles.
86 Prispevki za novejšo zgodovino LIX - 1/2019
Next, we performed a manual analysis of the 100 top-ranking keywords of each
political party against the rest of the corpus. These analyses display the distinct prop-
erties of one party that are not shared by other parties. Using the concordances, we
classified the keywords into the same categories as in Section 4.1, the results of which
are summarized in Tables 11 and 12.
Table 11: Topics of 100 top-ranking keywords of party members in Parlameter.
Topics SMC DeSUS SD SDS NSi Levica SAB
agriculture, forestry & food 0 0 34 0 27 0 0
culture 0 3 0 0 0 1 0
defense 0 0 21 5 0 0 1
economy & technology 0 0 5 1 11 13 1
education, science & sport 0 0 0 0 0 0 4
environment & spatial planning 0 0 3 0 6 1 0
finance 0 2 2 0 6 1 1
foreign affairs 0 5 0 2 4 3 0
health 0 3 0 8 1 0 5
ideology 0 0 0 15 3 9 0
infrastructure 1 0 2 0 7 1 1
interaction/procedural 99 61 14 17 10 4 14
interior 0 0 0 3 0 3 5
justice 0 1 1 8 0 0 0
labour, family & social affairs 0 13 3 1 4 13 3
multiple 0 2 6 13 8 17 29
public administration 0 2 0 5 2 1 7
style 0 8 9 22 11 33 29
Total 100 100 100 100 100 100 100
87D. Fišer, N. Ljubešić, T. Erjavec: Parlameter – a Corpus of Contemporary Slovene …
Unsurprisingly, due to the role of the main governing party SMC, practically all
their top-ranking keywords are interactional elements with the other speakers or have
a procedural nature (e.g., navzoč – present, glasovanje – voting, amandma – amendment).
That DeSUS is a single-issue party can be seen from their keywords, which, apart from
a surprisingly high proportion of interactive keywords, belong almost exclusively to
the semantic field of retirement and pension (e.g., regres – holiday pay, valorizirati – to
revalue, gmoten – material). Interestingly, even the topics of foreign affairs and culture
are nearly completely absent from their keyword list, despite the fact that these minis-
ters came from their party, suggesting that these topics are more or less evenly shared
with other parties. SD, the third coalition party, clearly display their priority areas of
agriculture, forestry and food (e.g., teran – Teran wine, fermentiran – fermented, kmeto-
vati – to farm) and defence (e.g., vojakinja – female soldier, neeksplodiran – unexploded,
strelivo – ammunition), which can be traced back to their ministers.
The largest opposition party SDS stands out from the rest by the amount of ideo-
logical keywords identified among the top-ranking keywords (e.g. tranzicijski – transiti-
onal, totalitarizem – totalitarism, lustracija – lustration). NSi and Levica, the opposition
parties with the same number of MPs but from the opposite ends of the political spec-
trum, both address the widest variety of issues (their keywords were classified into 13
out of 18 topics). The topics with nearly equal number of completely opposite key-
words are economy and technology (e.g. soupravljanje – co-management for Levica vs.
espejevec – private entrepreneur for NSi). While NSi mostly talks about the local issues
related to their constituencies (e.g. samooskrba – self-sufficiency, posekan – cut down,
obdelovati – farm), Levica stands out by signature stylistic devices which range from
very informal (e.g. šlamastika – pickle, gazda – informal for master, nabijati – to bang on)
to highly elevated registers (e.g. nemara – perhaps, onkraj – beyond, ducat – dozen) and
displays the largest proportion of ideological vocabulary next to SDS (e.g. tovarišica –
camerade, revizionizem – revisionism, imperializem – imperialism). SAB seems to stand
out by a predominantly (local) administrative/procedural/governance vocabulary
(e.g. proporcionalen – proportional, odpoklic – recall, dvokrožen – double-ballot) as well
as a discursive, informal style of distinctly negative sentiment, which is characteristic
of one of their members Vinko Möderndorfer (e.g. rešpektiram – honour, kozlarija –
nonsense, zmazek – disaster).
88 Prispevki za novejšo zgodovino LIX - 1/2019
Table 12: 100 top-ranking keywords per political party, taking into account lowercased
lemmas, computed against the rest of the Parlameter corpus and sorted according to
their keyness score.
SMC navzoč, e-klopa, udis, roberto, prekinjen, podprogram, prehajati, lipicer, kustec,
katerim, grebenšek, h, battelli, epi, stanujoč, obveščati, krajnc, zaključevati,
predajati, pričenjati, sodin, porotnica, simona, franc, glasovati, obrazložitev,
moderen, kolegij, tanko, postopkovno, potisek, končevati, nuklearen,
brezpredmeten, ep, jernej, dneven, počkaj, glasovnica, mandatno-volilen,
vojko, jožef, trček, bojan, neusklajen, tilen, prelog, ustavnorevizijski, odločanje,
arko, nadomeščati, he, branislav, matej, jože, glasovanje, prvopodpisan, e-klop,
glas, dopolnjen, porotnik, terminski, vložen, simono, franca, pogačnik, erman,
ugotavljati, klanjšček, smc, stebernak, nepovezan, jana, žibert, bien, matjaž,
šircelj, fajt, postopkoven, lilijana, skrajšan, monetaren, prekinjati, poslovniški,
matičen, bah, mag., marinka, šergan, lenča, vraničar, izvolitev, karlovšek,
razpravljavec, predstavnica, razširitev, anita, amandma, nadomeščanje, zame
DeSUS meglič, črnak, pripadajoč, desus, pogačar, dasiravno, vukov, valenca, požun,
inferioren, upajoč, möderndorfer, pregrešiti, divjak, valorizacija, korva, rezime,
kkr, kuzmanič, marijan, upokojen, vuk, mehčati, pojbič, košnik, bližnjevzhoden,
zaposlovalen, punkcija, žmavc, milojka, zaporedno, celarc, konzularen, xv.,
marija, kolar, bačič, erika, grošelj, rubelj, minski, lukić, rudarski, zadržanost,
mirjam, godec, valorizirati, sng, tašner, kušar, brinovšek, invalid, zamrznitev,
tedaj, dvoživkarstvo, nina, pirnat, dekleva, merše, federacija, nada, klanjšček,
protiukrep, jelka, ogrizek, gmoten, kisikov, ivo, majcen, izvoliti, iva, dimic.,
modifikacija, ljubič, žan, upokojenec, prikrajšanje, prečitati, šimenko, jasna,
izplačevanje, zipro, korpič, antonija, premožen, sapa, voljč, suzana, dimic, vesni,
lukič, zdravko, irena, teja, sluga, regres, ruše, janja, razparava, trivialen
SD izčistiti, genetsko, izčiščen, vezava, surov, demokrat, vojakinja, gorsko-hribovski,
travinje, potočan, vadišče, razprodati, hip, služenje, hišniški, faktorski, pripadnica,
stiskanje, zmogljivost, omd-, kočevski, anhovo, vrtojba, peterica, mineralen,
maji, krušen, kmetica, ciolos, vklop, deti, socialdemokratski, formacijski, teran,
selnica, kloniran, urszr, obramben, salonit, radeče, mlekarna, neperspektiven,
marjana, popolnjevanje, omd, odzivanje, vrtnina, vselej, zorganizirati, vikariat,
eutm, pokolp, govedo, rogaška, klirinški, razprodaja, surovina, ksenija, vinko,
izčiščevati, konzumen, refundirati, pripadnik, neeksplodiran, social, uokviriti,
žito, kfor, prebroditi, konvergenca, grajski, brecelj, hogan, administriranje, trader,
kočevsko, h4, primož, korenjak, bržkone, kmetovati, obrtništvo, vojska, strelivo,
poveljevanje, snežnik, plasiran, gorsko, refundacija, hribovski, proizvodnja,
subvencijski, dacian, missing, kmetija, opazovati, voditeljstvo, kramar,
fermentiran, viher
SDS islam, fišer, mark, svinjarija, levičarski, odnosno, medical, kb, demokratski,
odnosen, lenart, zemljarič, kučan, zalar, bordojski, kb1909, morišče, zločin,
iznenada, velikanski, tomos, kangler, patria, multikulti, masleša, prvorazreden,
škrlec, udba, stožice, tranzicijski, šef, praprotnik, moralno-etičen, ilegalno,
zločinski, bomben, peticija, porsche, srebrenica, cener, umor, totalitaren,
pokrasti, totalno, genocid, drugorazreden, tamle, erdogan, judikat, vega,
ribičič, privilegiranec, komunističen, razorožitev, varnostnoobveščevalen, žilen,
opornica, indičen, škandal, ornik, lustracija, poljanski, posavje, počenjati, furlan,
pobiti, sevnica, ubog, janković, krkovič, npu, deček, opran, bojda, blamaža, lopov,
toplak, kerševan, slikati, bmw, veselo, amen, totalen, komunizem, totalitarizem,
obsoditi, preiskati, bedarija, udbovski, pomorjen, turnšek, vladavina, zlagati,
šoping, vpiti, ukc, avion, klemenčič, koruptiven, neumnost
89D. Fišer, N. Ljubešić, T. Erjavec: Parlameter – a Corpus of Contemporary Slovene …
NSI komunalno, socialno-tržen, marn, božičnica, zidanica, egalitaren, krščanski,
espejevec, fantastičen, ekstrapolacija, planšarija, medparlamentaren, kamnik,
demografija, kapica, bundestag, podonavski, bajuk, samoprispevek, vinogradnik,
razlastiti, vipavski, prijateljstvo, kanalizacija, aksiom, pomurje, bogataš, ferenc,
parcelacija, optimirati, oljčnik, komenda, polnost, vrtalec, ozp, pomurski, ikt,
simulirati, dimniški, parlamentarec, podčrtovati, artikulirati, obžalovati, omizje,
cerknica, polčas, ginijev, zbirno-reciklažen, brutalno, prekladanje, širokogruden,
absorpcijski, šinko, dolenjsko, lestev, vodovod, rodnost, traktor, notranjska, opn,
posekan, vinograd, zaraščati, odvajanje, loža, kristjan, davno, regresen, lovrenčič,
firefox, parcela, akrapovič, obdelovati, obratovalnica, zpn, terezija, mihael,
odlašati, peskovci, vamp, notranjski, ovs, copatek, veselica, upniški, penzija,
hala, digitalen, goljuf, identifikacijski, mohar, postoriti, goveji, prirasti, splačati,
samooskrba, prazniti, odstaven, todorić, pozor
Levica penez, tuliti, vračljivost, ubesedovati, onkraj, bajta, neoliberalen, prečiti, nemara,
ducat, socialist, delavski, imperialističen, zvrniti, desnica, navsezadnje, blazen,
sociolog, šiht, soupravljanje, zategovanje, mandarin, kapitalizem, strokovec,
šlamastika, blazno, kapitalističen, tovarišica, ubesedovanje, revizionizem,
prekarnost, vzdržan, gazda, profit, sodržavljanka, izkoriščevalski, represija,
protisocialen, nabijati, prekaren, metafora, soodločanje, periferen, agregaten,
cinkarna, rezilen, mezda, amandmiranje, demokratizacija, ips, efektivno, natov,
levica, belokranjec, bučka, zaposlovalec, izhajajoč, reven, požegnati, profiten,
marof, ics, minimalec, podrejati, imperializem, kapitalist, silno, prekarizacija,
odpustek, sodržavljan, noveliranje, versus, zvo, bolgarski, zastraševanje,
informatičen, metaforično, režati, razreden, ciničen, striči, ropotati, korporacija,
rasizem, redistributiven, pregrevanje, trade, rez, omv, prekeren, deregulacija,
štacuna, grosist, znoreti, penzion, oligopolen, jahati, fevdalizacija, sočasno,
prečenje
SAB svojevrsten, večnost, mvk, pooblaščati, that’s, diskvalifikacija, prekleto, bla,
resnica, fakt, naglas, odpoklic, zavezništvo, minis, četrten, trapast, istrabenz,
zasebništvo, zamah, dvokrožen, ramšakov, diskvalificirati, športnica, drk, štos,
cetera, ups, nedostojno, redarski, strojan, nijz, proporcionalen, ma, evtanazija,
zanič, bloudkov, etc, mv, vsakič, naturalizacija, zamera, nor, listnica, smešiti,
dispečiranje, diskusija, strašansko, nefer, diskutirati, regres, sprevržen, r.,
zavrtanik, večen, hiv, nekorektno, ubežati, imperativen, presedan, prastrah,
dinozaver, halo, ekstremističen, rimskokatoliški, mvk-, namenoma, zmazek,
gedrih, somalijski, zamahniti, nonstop, kostanjevec, policaj, domišljati,
prohibicija, znakoven, paradoks, barantati, et, hecen, močvirnik, avans, nametati,
preprosto, prepričevati, podžupan, traparija, kričati, ekstra, non-stop, telovadba,
stefanovič, el-zoheiry, ničkolikokrat, kozlarija, prvenstvo, boh, domišljija,
rešpektiram
The Zeitgeist of ParlaMeter
Finally, we observe the zeitgeist of the Parlameter corpus by comparing it with its
older and smaller cousin, the SlovParl corpus, which contains material from the period
of Slovenia’s independence (1990–1992). First, we created keyword lists with each of
the two corpora acting as a focus and a reference corpus. We then manually classified
100 top-ranking keywords into the same categories as in Section 4.1, with the follow-
ing additional categories:
90 Prispevki za novejšo zgodovino LIX - 1/2019
– abbreviations (etc., Mr.), which were in use in the SlovParl but are no longer the
convention in the ParlaMeter transcriptions of the parliamentary sessions
– IT vocabulary (internet, web), which at the time of SlovParl was not yet widespread.
If we disregard the differences in the mentions of the active politicians in the two
periods, which are the most frequent category, most of the top-ranking keywords in
both corpora belong to procedural and legal issues, which are clearly different in a
newly established state and a state integrated in the EU (see Tables 13 and 14). Apart
from that, many more topics are identified in the Parlameter corpus, such as economy
and technology, foreign affairs and health, which again is not surprising as a well-estab-
lished state will need to take care of a full spectrum of issues.
Table 13: Topics of the 100 top-ranking keywords in Parlameter and SlovParl.
Topic ParlaMeter SlovParl
abbreviation 0 3
defence 0 1
economy & technology 6 2
education 1 0
environment & spatial planning 2 0
finance 12 7
foreign affairs 4 0
health 4 0
multiple 0 1
informal vocabulary 2 0
infrastructure 1 0
interior 2 0
it vocabulary 2 0
justice 1 0
labour, family & social affairs 3 0
legal/procedural 14 21
politician/party 46 65
Total 100 100
91D. Fišer, N. Ljubešić, T. Erjavec: Parlameter – a Corpus of Contemporary Slovene …
Table 14: 100 top-ranking keywords in Parlameter contrasted against SlovParl and vice
versa.
ParlaMeter evro, eu, desus, smc, cerar, sdh, dutb, möderndorfer, trček, bratušek, sds,
gorenak, spleten, mandatno-volilen, deležnik, koalicijski, kordiš, anja,
matej, direktiva, postopkovno, kpk, okoljski, kohezijski, javnofinančen,
tonin, bdp, veber, naročanje, korupcija, bah, jani, levica, nlb, unija, tanko,
migrantski, povprečnina, vatovec, čakalen, pojbič, migrant, varuhinja,
prikl, žnidar, šircelj, varuh, zujf, teš, violeta, tomić, mahnič, ddv, digitalen,
han, istospolen, lisec, telekom, vrtovec, dars, žibert, novela, globa, zorčič,
vajeništvo, godec, trošarina, čuš, okrožen, internet, prvopodpisan,
schengenski, matić, trajnosten, gašperšič, jurša, podneben, dz, lipica,
lah, podizvajalec, žan, uredba, blagajna, okej, verbič, ferluga, dobovšek,
mramor, računski, vraničar, zakonik, ljudmila, nevladen, postopkoven,
preiskovalen, direktorat, hanžek, muršič, irgl
SlovParl delegat, oz., glavič, družbenopolitičen, gros, dinar, republiški, usklajevalen,
din, skupščinski, starman, zakonjšek, alinea, vzdržati, potrč, vzdržan,
kolešnik, izvršen, lukač, sklepčnost, pintar, npr., navzočnost, buser,
arzenšek, feltrin, atelšek, liberalno-demokratski, smole, razpravljalec, školč,
zvezen, schwarzbartl, delegatski, tomšič, zagožen, železarna, jakič, gošnik,
skupščina, polajnar, tomažič, muren, štefančič, lastninjenje, deviza, zlobec,
šter, demos, dretnik, kreditno-monetaren, sdp, čimprej, nabornik, devizen,
marka, delegatka, sekretariat, bekeš, deželak, klavora, peterle, črnej, halb,
kreft, šonc, lokar, gradišar, šeligo, juri, perko, sfrj, voljč, požarnik, semolič,
volilec, kramarič, bučar, plebiscit, dvornik, tomše, grašič, tolar, starc, pregelj,
podobnik, pozsonec, balažic, g., moge, medzborovski, jaša, razdevšek,
rojec, šetinc, urbančič, lavtižar-bebler, vivod, anka, šešok
To illustrate differences in the zeitgeist of both corpora, we extracted the strongest
collocations of the following 3 expressions, which are frequent in both corpora, tak-
ing into account the collocation candidates that appear at least 5 times immediately
next (left or right) to the headword, and analysed the first 50 collocation candidates:
– adjective južen – southern,
– noun kriza – crisis, and
– verb sprožiti – trigger.
92 Prispevki za novejšo zgodovino LIX - 1/2019
Table 15: Comparison of collocations of južen, kriza and sprožiti in SlovParl and
ParlaMeter. Topics or morphosyntactic categories are indicated in square brackets, and
new collocations in Parlameter are highlighted in bold.
SlovParl ParlaMeter
južen 178 (14.03 per million)
- [GEOGRAPHY]: koreja, primorska,
amerika
- [CONCRETE]: meja, železnica
- [METAPHORICAL]: trg, del, stran,
republika
910 (22.20 per million)
- [GEOGRAPHY]: afrika, koreja,
sredozemlje, amerika, tirolska,
sudan, tirolec, koroška, italija,
evropa, nemčija, slovenija
- [CONCRETE]: meja, obvoznica, tok,
sadje, odsek, železnica, ulica
- [METAPHORICAL]: sosedstvo,
soseda, sosed, soseščina, del, trg,
projekt, stran, država, republika
sprožiti 548 (43.19 per million)
- [CONCRETE]: spor, postopek, proces,
interpelacijo, arbitražo
- [METAPHORICAL]: reakcijo, polemiko,
akcijo, mehanizem, pobudo,
vprašanje, diskusijo, zahtevo,
spremembo, razpravo, zadevo
1,569 (38.28 per million)
- [CONCRETE]: postopek, spor,
preiskavo, alarm, process, ovadbo,
tožbo, stečaj, prijavo, revizijo
- [METAPHORICAL]: plaz,
mehanizem, polemiko, reakcijo,
kepo, pobudo, akcijo, iniciativo,
aktivnost, debato, kampanjo
kriza 1,114 (87.79 per million)
- [GEOGRAPHY]: jugoslovanska,
zalivska kriza
- [POLITICS]: vladna, gospodarska,
parlamentarna, ekonomska, ustavna,
politična kriza
- [METAPHORICAL]: duševna, socialna,
razvojna, družbena kriza
- [MODIFIERS]: huda, moralna, globoka,
katastrofalna, velika, težka kriza
- [NOUNS]: reševanje, razrešitev, rešitev,
razplet, razreševanje krize
- [VERBS]: prebroditi, poglabljati,
razrešiti, povzročiti, rešiti, začeti krizo
8,062 (196.69 per million)
- [GEOGRAPHY]: ukrajinska, grška,
svetovna, globalna kriza
- [POLITICS]: migrantska,
begunska, gospodarska, finančna,
migracijska, humanitarna,
ekonomska, dolžniška, bančna,
politična, begunsko-migrantska,
mlečna, javnofinančna, varnostna,
kapitalistična kriza
- [METAPHORICAL]: socialna kriza
- [MODIFIERS]: huda, kompleksna,
globoka, velika kriza
- [NOUNS]: začetek, breme, izbruh,
nastop, posledica, nastanek,
reševanje, obdobje krize
- [VERBS]: kriza nastopi, nastane,
pokaže, udari // povzročiti,
reševati, poglabljati krizo
As can be seen from Table 15, the biggest difference in relative frequency between
the two corpora is observed for the noun crisis, which is more than twice as frequent
in Parlameter compared to SlovParl, despite the fact that the early 1990s were marked
by a long and bloody war in the Balkans as well as severe economic hardship related to
change of the economic and political system. Parlameter contains the largest number
of new collocation candidates that indicate issues that were not present in the period
of SlovParl, such as migrant/refugee/humanitarian/security crisis. On the other hand,
93D. Fišer, N. Ljubešić, T. Erjavec: Parlameter – a Corpus of Contemporary Slovene …
the secession period was marked by constitutional/parliamentary crisis, which are not
observed in the late 2010s. Interestingly, SlovParl contains more metaphorical collo-
cations which are not prominent in the Parlameter corpus, such as mental/social/wel-
fare/moral crisis. Collocations containing geographical terms indicate the key political,
military and social hotspots from that period: Yugoslav/Gulf crisis in early 1990s, and
Ukraine/Greek crisis in late 2010s. An analysis of key verbal collocates with the noun
crisis reveals another interesting observation, which is that in SlovParl, all the verbs
are about solving the crisis (to solve/resolve/untangle the crisis), whereas in Parlameter,
politicians mostly use verbs that discuss the beginnings or deepening of the crisis (cri-
sis sets in/appears/starts/hits, to trigger/deepen the crisis).
The verb trigger is the only one of the three examples that has a higher relative
frequency in SlovParl but despite the greater relative frequency, Parlameter contains
more collocation candidates, both in the direct and the metaphorical sense, such as
trigger an investigation/indictment/lawsuit, or trigger an audit/bankruptcy.
It is interesting to note that the adjective southern is more frequently used and
has more collocations in general in ParlaMeter despite the fact that in the secession
period, links to the rest of former Yugoslavia were probably stronger and there were
probably more open issues, signalling that certain topics were probably not discussed
on purpose until the issues were resolved and the relations were established again.
Especially interesting are all the neighbour-related collocations, which only appear
in the Parlameter corpus, 30 years after Slovenia left Yugoslavia: southern neighbour /
neighbours / neighbourhood / market / fruit, despite the fact that geographically speak-
ing, the former Yugoslav republics, spread south-east, not south of Slovenia. The one
major unsettled issue is the border with Croatia that has even been subject of interna-
tional arbitration during the parliamentary term included in the Parlameter corpus,
which is reflected in the top-ranking strong collocation južna meja/southern border.
Conclusions
In this paper we presented the Parlameter corpus of contemporary Slovene parlia-
mentary proceedings. We analysed the linguistic production of the speakers according
to the morphosyntactic annotation of the corpus and the speaker metadata.
We have shown that despite the fact that the material included in the corpus spans
the period 2014–2018, the bulk of the material was recorded in the first two full years
of the parliament. When contrasted against general Slovene, parliamentary speeches
contain more present tense forms and personal and demonstrative pronouns. A com-
parison of male and female speakers shows that while male speakers take the floor
more often than their female colleagues, it is the female speakers who make longer
contributions. Female speakers mostly address the topics of health, labour, family and
social affairs, public administration, and education, science and sport, while most of the
keywords from male speakers do not belong to specific topics, which indicate a more
94 Prispevki za novejšo zgodovino LIX - 1/2019
discursive, debating style of the male speakers. When comparing speeches according
to party lines, the most prolific deputy group is the largest opposition party Slovenian
Democratic Party (SDS) while the ruling Party of Modern Centre (SMC) is the least
prolific one. The most productive parties with a relative token to speaker ratio are the
smallest parties in this parliamentary term, the Left (Levica) and New Slovenia (NSi).
The largest opposition party SDS stands out from the rest by the large amount of ideo-
logical keywords while Levica stands out by signature stylistic devices which range
from very informal to highly elevated. NSi and Levica, the opposition parties with
the same number of MPs but from the opposite ends of the political spectrum, both
address the widest variety of issues. With keywords belonging almost exclusively to
the semantic field of retirement and pension, DeSUS lies on the other end of the spec-
trum as a single-issue party. A comparison with the SlovParl corpus of parliamentary
debates from the period of Slovenia’s independence, many more topics are identified
in Parlameter, which understandable as a well-established state will need to take care
of a full spectrum of issues whereas a new state will mostly be dealing with procedural
issues and the new legislature. In the future we plan to enrich the corpus with addi-
tional session records of previous and the most recent parliamentary terms as well as
with additional metadata available through the Parlameter system, such as voting data
and accepted legislation, which are also valuable for addressing a number of research
questions in various research communities. In parallel, we also plan to develop com-
parable corpora from other parliaments, starting with Croatian and Bosnian.
Acknowledgments
The work described in this paper was funded by the Slovenian Research Agency
within the national basic research project “Resources, methods, and tools for the
understanding, identification, and classification of various forms of socially unaccep-
table discourse in the information society” ( J7-8280, 2017–2019) and the Slovenian
research infrastructure for language resources and technology CLARIN.SI.
Sources and Literature
Literature:
• Bayley, Paul. 2014. “Introduction: The Whys and Wherefores of Analyzing Parliamentary
Discourse.” In Cross-Cultural Perspectives on Parliamentary Discourse, edited by Paul Bayley, 1–44.
Amsterdam, Philadelphia: John Benjamins Publishing.
• Cheng, Jennifer E. 2015. “Islamophobia, Muslimophobia or Racism? Parliamentary discourses on
Islam and Muslims in Debates on the Minaret Ban in Switzerland.” Discourse & Society 26 (5):
562–86.
95D. Fišer, N. Ljubešić, T. Erjavec: Parlameter – a Corpus of Contemporary Slovene …
• Chester, Daniel Norman, and Nona Bowring. 1962. Questions in Parliament. Oxford: Clarendon
Press.
• van Dijk, Teun A. 2010. “Political Identities in Parliamentary Debates.” In European Parliaments
Under Scrutiny: Discourse Strategies and Interaction Practices, edited by Cornelia Ilie, 29–56.
Amsterdam, Philadelphia: John Benjamins Publishing.
• Fišer, Darja, and Jakob Lenardič. 2018. “Parliamentary Corpora in the CLARIN Infrastructure.”
In Selected Papers from the CLARIN Annual Conference 2017, edited by Maciej Piasecki, 75–85.
Accessed February 27, 2019. http://www.ep.liu.se/ecp/147/007/ecp17147007.pdf.
• Fišer, Darja, and Vojko Gorjanc. 2013. Korpusna analiza. Ljubljana: Znanstvena založba Filozofske
Fakultete.
• Fišer, Darja, Nikola Ljubešić, and Tomaž Erjavec. 2018. “The Janes Project: Language Resources
and Tools for Slovene user Generated Content.” Language Resources and Evaluation. In press.
https://doi.org/10.1007/s10579-018-9425-z.
• Franklin, Mark N., and Philip Norton. 1993. Parliamentary Questions: For the Study of Parliament
Group. Oxford: Oxford University Press.
• Hirst, Graeme, Vanessa Wei Feng, Christopher Cochrane, and Nona Naderi. 2014. “Argumentation,
Ideology, and Issue Framing in Parliamentary Discourse.” In ArgNLP. Accessed 27 February 2019.
ftp://www.cs.toronto.edu/pub/gh/Hirst-etal-Bertinoro-2014.pdf.
• Hughes, Lorna M., Paul S. Ell, Gareth A.G. Knight, and Milena Dobreva. 2013. “Assessing
and Measuring Impact of a Digital Collection in the Humanities: An Analysis of the SPHERE
(Stormont Parliamentary Hansards: Embedded in Research and Education) Project.” Digital
Scholarship in the Humanities 30 (2): 183–98.
• Ihalainen, Pasi, Cornelia Ilie, and Kari Palonen. 2016. Parliament and Parliamentarism: A
Comparative History of a European Concept. Oxford, New York: Berghahn Books.
• Ilie, Cornelia. 2017. “Parliamentary Debates.” In The Routledge Handbook of Language and Politics,
edited by Ruth Wodak and Bernhard Forchtner. Routledge.
• Ljubešić, Nikola, and Tomaž Erjavec. 2016. “Corpus vs. Lexicon Supervision in Morphosyntactic
Tagging: The Case of Slovene.” In Proceedings of the Tenth International Conference on Language
Resources and Evaluation (LREC 2016), edited by Nicoletta Calzolari, Khalid Choukri, Thierry
Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion
Moreno, Jan Odijk, and Stelios Piperidis, 1527–31. Accessed February 27, 2019. http://www.lrec-
conf.org/proceedings/lrec2016/pdf/811_Paper.pdf.
• Ljubešić, Nikola, Tomaž Erjavec, Darja Fišer, Tanja Samardžić, Maja Miličević, Filip Klubička,
and Filip Petkovski. 2016. “Easily Accessible Language Technologies for Slovene, Croatian and
Serbian.” In Proceedings of the Conference on Language Technologies and Digital Humanities 2016,
edited by Tomaž Erjavec and Darja Fišer, 120–24. Accessed February 27, 2019. http://www.sdjt.
si/wp/wp-content/uploads/2016/09/JTDH-2016_Ljubesic-et-al_Easily-Accessible-Language-
Technologies.pdf.
• Ljubešić, Nikola, Darja Fišer, Tomaž Erjavec, and Filip Dobranić. 2018. “The Parlameter corpus
of contemporary Slovene parliamentary proceedings.” In Proceedings of the Conference on Language
Technologies and Digital Humanities 2018, edited by Darja Fišer and Andrej Pančur, 162–167.
Accessed June 12, 2019. http://www.sdjt.si/wp/wp-content/uploads/2018/09/JTDH-2018_
Ljubesic-et-al_The-Parlameter-corpus-of-contemporary-Slovene-parliamentary-proceedings.pdf
• Pančur, Andrej, and Mojca Šorn. 2016. “Smart Big Data: Use of Slovenian Parliamentary Papers in
Digital History.” Prispevki za novejšo zgodovino 56 (3): 130–46.
• Pančur, Andrej. 2016. “Označevanje zbirke zapisnikov sej slovenskega parlamenta s smernicami
TEI.” In Proceedings of the Conference on Language Technologies and Digital Humanities 2016, edited
by Tomaž Erjavec and Darja Fišer, 142–48. Accessed February 27, 2019. http://www.sdjt.si/
wp/wp-content/uploads/2016/09/JTDH-2016_Pancur_Oznacevanje-zbirke-zapisnikov-sej-
slovenskega-parlamenta.pdf.
96 Prispevki za novejšo zgodovino LIX - 1/2019
• Rheault, Ludovic, Kaspar Beelen, Christopher Cochrane, and Graeme Hirst. 2016. “Measuring
Emotion in Parliamentary Debates with Automated Textual Analysis.” PLoS ONE 11 (12): 1–18.
• TEI Consortium, 2017. Guidelines for Electronic Text Encoding and Interchange. Accessed
February 27, 2019. http://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html.
Sources:
• Dobranić, Filip, Nikola Ljubešić, and Tomaž, Erjavec. 2019. Slovenian Parliamentary Corpus
ParlaMeter-sl 1.0, Slovenian Language Resource Repository CLARIN.SI. http://hdl.handle.
net/11356/1208.
• Pančur, Andrej, Mojca Šorn, and Tomaž Erjavec. 2017. Slovenian Parliamentary Corpus SlovParl
2.0, Slovenian Language Resource Repository CLARIN.SI. http://hdl.handle.net/11356/1167.
Darja Fišer, Nikola Ljubešić, Tomaž Erjavec
PARLAMETER – A CORPUS OF CONTEMPORARY
SLOVENE PARLIAMENTARY PROCEEDINGS
SUMMARY
The unique content, structure and language, as well as the availability of records of
parliamentary debates are all factors that make them an important object of study in a
wide range disciplines in digital humanities and social sciences. This has motivated a
number of national as well as international initiatives to compile, process and analyse
parliamentary corpora. This paper presents the Parlameter corpus of contemporary
Slovene parliamentary proceedings, which covers the VIIth mandate of the Slovene
Parliament (2014–2018). The Parlameter corpus offers rich speaker metadata (gen-
der, age, education, party affiliation) and is linguistically annotated (lemmatization,
tagging, named entity recognition).
The Parlameter corpus contains 371 sessions and 1,981 speakers who gave
133,287 speeches which contain almost 35 million words. In the paper we demon-
strate the potential of the corpus analysis techniques for investigating political debates
by analysing the linguistic production of the speakers according to the morphosyn-
tactic annotation of the corpus and the speaker metadata. When contrasted against
general Slovene, parliamentary speeches contain more present tense forms and per-
sonal and demonstrative pronouns. While male speakers take the floor more often
than their female colleagues, the female speakers’ contributions tend to be longer.
Female speakers mostly address the topics of health, labour, family and social affairs,
public administration, and education, science and sport, while most of the keywords
from male speakers do not belong to specific topics, which indicate a more discur-
sive, debating style of the male speakers. The most prolific deputy group overall is
97D. Fišer, N. Ljubešić, T. Erjavec: Parlameter – a Corpus of Contemporary Slovene …
the largest opposition party Slovenian Democratic Party (SDS) while the then ruling
Party of Modern Centre (SMC) is the least prolific. The most productive parties with
a relative token to speaker ratio are the smallest parties in that parliamentary term, the
Left (Levica) and New Slovenia (NSi). The largest opposition party SDS stands out
from the rest by the large amount of ideological keywords while Levica stands out by
signature stylistic devices which range from very informal to highly elevated. NSi and
Levica, the opposition parties with the same number of MPs but from the opposite
ends of the political spectrum, both address the widest variety of issues. With key-
words belonging almost exclusively to the semantic field of retirement and pension,
DeSUS lies on the other end of the spectrum as a single-issue party. A comparison
with the SlovParl corpus of parliamentary debates from the period of Slovenia’s inde-
pendence, many more topics are identified in Parlameter, which understandable as a
well-established state will need to take care of a full spectrum of issues whereas a new
state will mostly be dealing with procedural issues and the new legislature.
The Parlameter corpus is available through both CLARIN.SI concordancers, as
well as for download from its repository, both as a TEI document and in the simpler
vertical file format, under the liberal Creative Commons – Attribution-ShareAlike
(CC BY-SA 4.0) licence. The corpus architecture allows for regular extensions of the
corpus with additional Slovene data, as well as data from other parliaments, starting
with Croatian and Bosnian.
Darja Fišer, Nikola Ljubešić, Tomaž Erjavec
PARLAMETER – KORPUS RAZPRAV SLOVENSKEGA
DRŽAVNEGA ZBORA
POVZETEK
Edinstvena vsebina, struktura in jezik, pa tudi dostopnost prepisov parlamentar-
nih razprav so dejavniki, zaradi katerih so le-ti pomemben predmet raziskav v števil-
nih znanstvenih disciplinah digitalne humanistike in družboslovja. To je motiviralo
številne nacionalne in mednarodne iniciative za izgradnjo, označevanje in analizo par-
lamentarnih korpusov. V tem prispevku predstavimo korpus sodobnih parlamentar-
nih razprav Parlameter, ki vsebuje razprave 7. mandata slovenskega Državnega zbora
(2014–2018). Korpus Parlameter vsebuje bogate metapodatke o govorcih (spol,
starost, izobrazba, strankarska pripadnost) in je jezikoslovno označen (lematizacija,
tegiranje, imenske entitete).
Korpus Parlameter vsebuje 371 razprav in 1.981 govorcev, ki so prispevali 133.287
govorov oziroma 35 milijonov besed. V prispevku prikažemo potencial korpusno-
analitičnih tehnik za raziskovanje političnih razprav z analizo jezikovne produkcije
98 Prispevki za novejšo zgodovino LIX - 1/2019
govorcev glede na morfosintaktične oznake in metapodatke o govorcih. Primerjava s
splošno slovenščino pokaže, da v parlamentarnih govorih izstopajo sedanjiške oblike
ter osebni in kazalni zaimki. Čeprav moški govorci spregovorijo večkrat, so govori
žensk daljši. Ženske večinoma razpravljajo o temah, kot so zdravje, delo, družina
in sociala, javna uprava ter izobraževanje, znanost in šoprt, večina ključnih besed v
moških govorih pa ni vezanih na določeno tematiko, kar nakazuje bolj diskurziven, raz-
pravljalski slog moških govorcev. V celoti gledano je najbolj produktivna strankarska
skupina največja opozicijska stranka SDS, medtem ko je vladajoča stranka SMC v kor-
pusu zastopana z najmanj izrečenimi besedami. Najvišji relativni delež števila pojavnic
na govorca imata najmanjši parlamentarni stranki tega sklica Levica in NSi. Največja
opozicijska stranka SDS izstopa po izrazito velikem obsegu ideološko obarvanih ključ-
nih besed, Levica pa po specifičnih slogovnih figurah, ki so tako zelo neformalne kot
zelo povzdignjene. NSi in Levica, opozicijski stranki z enakim številom poslancev a s
povsem različnih polov političnega spektra, obe naslavljajta največje število tematik.
Po drugi strani pa s ključnimi besedami, ki skoraj v celoti spadajo v pomensko polje
upokojevanja in pokojnin, pa je povsem obratno pri stranki DeSUS, ki s tem utrjuje
svoj status problemske stranke. Primerjava s korpusom SlovParl iz obdobja slovenske
osamosvojitve kaže, da je v korpusu Parlameter obravnavanih veliko več tem kot v
korpusu SlovParl, kar je razumljivo, saj se mora uveljavljena država ukvarjati s celotnim
spektrom problematik, medtem ko se novo ustanovljena država posveča predvsem
priceduralnim vprašanjem in sprejemanju nove zakonodaje.
Korpus Parlameter je dostopen preko obeh konkordančnikov v okviru razisko-
valne infrastructure CLARIN.SI, prav tako pa ga je mogoče prenesti z repozitorija
v format TEI, pa tudi v preprostejšem vertikalnem formatu pod licenco Creative
Commons – Attribution-ShareAlike (CC BY-SA 4.0). Korpusna arhitektura je zasno-
vana tako, da omogoča širitev korpusa na druga časovna obdobja, prav tako pa tudi
vključevanje gradiv drugih parlamentov, začenši s hrvaškim in bosanskim.
99P. Gantar et al.: Structural and Semantic Classification of Verbal Multi-Word Expressions in Slovene
1.01 UDC: 003.295:821.163.6‘367.625
Polona Gantar,* Špela Arhar Holdt,** Jaka Čibej,***
Taja Kuzman****
Structural and Semantic
Classification of Verbal Multi-Word
Expressions in Slovene
IZVLEČEK
STRUKTURNA IN POMENSKA KLASIFIKACIJA GLAGOLSKIH
VEČBESEDNIH ENOT V SLOVENŠČINI
Prispevek je nadgrajena različica konferenčnega prispevka, v katerem predstavljamo
kategorije glagolskih večbesednih enot (GVBE), kot so bile oblikovane v okviru mednarodne
COST akcije PARSEME Shared Task 1.1. S kategorijami, ki so nadjezikovne in obenem
prilagojene posameznim vključenim jezikom, smo označili 13.511 povedi učnega korpusa
ssj500k 2.0. Rezultat označevanja je 3.364 identificiranih večbesednih glagolskih enot, ki
so klasificirane kot: inherentno povratni glagoli, zveze z glagoli v pomensko oslabljeni rabi,
predložnomorfemski glagoli in glagolski idiomi. V prispevku rezultate označevanja predsta-
vimo kvantitativno in kvalitativno, pri čemer sopostavimo predlagani sistem klasifikacije ob
obstoječe prakse na področju slovenistične obravnave GVBE in ocenimo uporabnost sistema
za nadaljnje delo.
Ključne besede: glagolske zveze, korpusni pristop, večbesedne enote, PARSEME,
slovenščina
* Department of Translation, Faculty of Arts, University of Ljubljana, Aškerčeva 2, SI-1000 Ljubljana, apolonija.
gantar@guest.arnes.si
** CJVT, Faculty of Computer and Information Science, University of Ljubljana, Večna pot 113, SI-1000 Ljubljana,
spela.arhar@cjvt.si
*** Artificial Intelligence Laboratory, Jožef Stefan Institute, Jamova cesta 39, SI-1000 Ljubljana, jaka.cibej@ijs.si
**** kuzman.taja@gmail.com
100 Prispevki za novejšo zgodovino LIX - 1/2019
ABSTRACT
This paper is an extended version of a conference paper presenting the categorization
of verbal multi-word expressions (VMWEs) according to the PARSEME COST Action
Shared Task 1.1 Guidelines. The categorization is universal but takes into account the cha-
racteristics of the individual languages included in it. The Shared Task was used to annotate
over 13,000 sentences of the Slovene ssj500k 2.0 training corpus, which resulted in nearly
3,400 identified VMWEs categorized as inherently reflexive verbs, light verb constructions,
inherently adpositional verbs, and verbal idioms. The paper presents both the quantitative
and qualitative results of the analysis, compares the suggested categorization system to exi-
sting work on VMWEs in Slovene linguistics, and evaluates the use of the proposed system
for future work.
Keywords: verb phrases, corpus approach, multi-word expressions, PARSEME, Slovene
Introduction
In the digital medium, the bulk of interactions between users – as well as between
users and computers or applications – occur with the use of language, which is why the
existence and open accessibility of digital language infrastructure is of vital importance
to the development and vitality of a language. Slovene is no exception in this regard; it
requires an infrastructure that serves as a source of information for the language com-
munity as well as a resource to be used in applied/theoretical linguistic research and
the development of new language technology tools, methods, and services. Examples
of such infrastructure include digital language resources that allow for continued
updates and contributions from the community, language databases with structured
and machine-readable data, and training corpora in which authentic texts are anno-
tated with different linguistic categories. In this regard, digital lexicography, whose aim
is to prepare the dictionary part of this language infrastructure, plays an essential role
in the field of digital humanities.
In the field of digital lexicography, multi-word expressions (MWEs) are consid-
ered important for constructing machine-readable language resources that in turn
enable the compilation of electronic MWE lexicons and the development of language
technology tools for a specific language. In order to achieve these goals, it is crucial
to know the linguistic features of different types of MWEs and develop methods and
standards for their identification in authentic language use.
However, this is not a trivial task. Definitions and categorisations of MWEs differ
according to their methodological and theoretical basis and research goals.1 A lexi-
cographic perspective focuses on the semantic characteristics of MWEs and defines
1 For an overview of MWE classifications according to different methodological approaches, see Gantar et al. (2018).
101P. Gantar et al.: Structural and Semantic Classification of Verbal Multi-Word Expressions in Slovene
them as “different types of phrases that demonstrate a certain degree of idiomatic
meaning” (Atkins and Rundell 2008, 166) or as phrases whose “exact meaning is not
directly obtained from its component parts” (Sag et al. 2002). On the other hand,
the definition of MWEs from the perspective of machine processing emphasises their
statistical significance: “a group of tokens in a sentence that cohere more strongly than
ordinary syntactic combinations. That is, they are idiosyncratic in form, function, or
frequency” (Schneider et al. 2014) and their inability to be split into independent
lexemes and at the same time maintain their semantic and syntactic functions, as well
as their lexical, syntactic, semantic, pragmatic and statistical idiomaticity (Baldwin and
Kim 2010, 3). Although no universally accepted definition of MWEs exists, research-
ers in linguistics and NLP both agree that the key feature separating MWEs from free
phrases is the special relationship between the elements that form the MWE. This rela-
tion is usually defined using such concepts as collocability (or statistical idiomaticity),
idiomaticity (or semantic non-compositionality), syntactic (in)flexibility, which also
includes the possibility of an internal modification of the phrase and the flexible order
of its lexicalised elements, and lexical variance.
An attempt to provide the much needed guidelines and a pilot study on the
annotation of MWEs in language corpora was made as part of the PARSEME COST
Action Shared Task 1.1.2 The task focused on the automatic identification of verbal
multi-word expressions (VMWEs) in running text. As part of the task, universal guide-
lines for VMWE classification were compiled containing examples for all languages
involved. Based on the guidelines, a multi-lingual corpus was manually annotated with
VMWEs and made available under a Creative Commons licence.
While the categories of MWEs were designed as language-independent, the spe-
cific characteristics of all the included languages had to be taken into account to reach
a universally applicable solution. In this paper, we focus on the Slovene results, which
will be useful when compiling a digital lexicon of Slovene MWEs, as well as other lan-
guage resources such as the Dictionary of Modern Slovene (Gorjanc et al. 2017) and
a corpus-based grammar of Slovene. The topic was presented in Gantar et al. (2018)
with a focus on MWEs and their theoretical framework in Slovene studies. This paper
focuses on MWEs from the perspective of a unified concept that was applied to 20
different languages within the PARSEME Shared Task 1.1. A comparison of the results
can be found in Ramisch et al. (2018).
Identifying and Categorizing Verbal Multi-Word
Expressions
The verb plays a crucial role in the sentence in terms of co-text organization,
which is why the PARSEME Shared Task focused on verbal multi-word expressions
2 Home – PARSEME, http://www.parseme.eu.
102 Prispevki za novejšo zgodovino LIX - 1/2019
(VMWEs). For further analysis, it is crucial to determine the differences between
the definitions and categorizations of VMWEs as established in Slovene studies on
the one hand, and the international PARSEME COST Action on the other. Our task
aims to adopt the international annotation scheme in order to include Slovene. Our
research question focuses on the applicability of the PARSEME system to authentic
Slovene texts. Can the adapted PARSEME categories be applied in practice? Are they
attributable, robust, one-dimensional, and represented in actual language use? What
information do they entail (e.g. in terms of syntax), how can they contribute to the
development of new automatic extraction methods, and finally, which problems arise
when applying the system to text? In the following sections, we present the annota-
tion method. This is followed by quantitative and qualitative analysis. The latter is
focusing on individual categories, their characteristics, and the potential problems of
the approach.
Verbal Multiword Expressions – Slovenian Case
In Slovene studies, MWEs are divided into a) phraseological units (PUs), in which
at least one component carries meaning that differs from one of its denotative “diction-
ary” senses, and expresses figurativeness, and b) all other multiword expressions (i.e.
fixed expressions), which are characterized by a certain degree of fixedness and denote
a meaning that can be predicted from the meanings of their elements. PUs are further
divided by syntactic structure: the clausal type (which also includes proverbs) and
the phrasal type (all non-verbal PUs). In Slovene linguistic theory, verbal MWEs are
determined by their morphosyntactic features (Toporišič 1973/74; Kržišnik 1994):
a MWE is classified as a VMWE if it includes a verbal element and if it functions as a
predicate. However, it remains unclear how to classify examples in which the verbal
MWE does not function as a predicate, e.g. hočeš nočeš ‘like it or not’, which includes
two verbal elements, but functions as an adverbial.
The problem of categorizing MWEs according to their morphological structure
and syntactic function was resolved in PARSEME shared task through the definition
that the main criterion for VMWEs is that their syntactic head in the prototypical
form is a verb, regardless of the fact whether it can or cannot fulfil other syntactic
roles. In addition, Slovene categorizations have so far never treated verbs with the se/
si morpheme as a separate MWE category. Phrasal verbs that consist of a verb and a
preposition and carry an independent meaning were categorized as MWEs only con-
ditionally (Kržišnik 1994, 58).
103P. Gantar et al.: Structural and Semantic Classification of Verbal Multi-Word Expressions in Slovene
Verbal Multiword Expressions within the Parseme Shared Task 1.1
For the categorization of VMWEs within the Parseme Shared task 1.1, exhaustive
guidelines3 were prepared in which the VMWE categories are defined by semantic
and syntactic features and are described with decision trees. The identification and
categorization process consisted of three steps. In the first step, we identified potential
VMWEs consisting of a verb as the syntactic head of the phrase and at least one other
word. In the second step, we identified the lexicalised elements within the phrase. In
the third step, we used detailed linguistic tests consisting of generic and specific lan-
guage criteria to determine the correct category of the identified VMWE.
Based on the guidelines, VMWEs are further divided into two classes based on
whether the category can be applied to the majority of languages included in the
task, or whether they are typical of individual (groups of) languages. The universal
categories include verbal idioms (VID) and light verb constructions (LVC), which
are further divided into full (LVC.full) and causal (LVC.cause). The quasi-universal
categories, which are used within individual groups of languages, include inherently
reflexive verbs (IRV), which are typical of most Slavic languages, and verb-particle
constructions (VPC), typical of Germanic languages. In the second version of the
guidelines, an additional quasi-universal category was added: inherently adpositional
verbs (IAV), which require an open syntactic slot and are typical of Slovene and sev-
eral other Slavic languages.
For Slovene, examples of VMWEs can be found for all the categories except for
VPC. For certain categories, however, there are specific characteristics based on syn-
tactic or morphological features of Slovene or on grammatical categories that are gen-
erally accepted in Slovene but differ to some extent from other languages. The specific
Slovene features will be described along with individual VMWE types.
The Corpus and Annotation Tool
VMWEs were annotated in the Slovene ssj500k 2.0 training corpus (Krek et al.
2017), which consists of approximately 500,000 tokens and just under 28,000 sen-
tences sampled from the FidaPLUS corpus of Slovene (Arhar Holdt and Gorjanc
2007). The entire corpus is morphosyntactically tagged (Grčar et al. 2012). Certain
portions also contain named-entity annotations and syntactic dependencies
(Dobrovoljc et al. 2012). In the first annotation phase, 11,411 sentences were anno-
tated by two annotators with VMWEs as defined by the first version of the PARSEME
Guidelines (Candito et al. 2016). Disagreements in annotation were discussed and
adjusted accordingly. In the second phase, the categories were automatically modified
based on the second version of the PARSEME Guidelines and manually checked. The
3 PARSEME Shared Task 1.1 - Annotation guidelines, http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/.
104 Prispevki za novejšo zgodovino LIX - 1/2019
second phase continued with the annotation of an additional 2,100 sentences anno-
tated in packages by individual annotators. Problematic examples were discussed and
correctly annotated.
The tool used for annotation in the first phase was SentenceMarkup System
(Figure 1), a custom tool primarily developed for syntactic dependency annotation
of Slovene texts (Dobrovoljc et al. 2012). The tool was adjusted for the annotation of
VMWEs by adding an additional independent and interconnectable annotation layer
(cf. Gantar et al. 2017).
Figure 1: Annotations in the SentenceMarkup System
In the second phase, the annotation took place in the FLAT annotation plat-
form (FoLiA Linguistic Annotation Tool), which was adapted for the purposes of
the PARSEME Shared Task and tested on 13 collaborating languages (Figure 2). The
FLAT platform allows text strings to be annotated with a set of categories. Files can
be assigned to different annotators. The supported formats are XML and TSV, while
annotated files are exported in XML. All annotations are saved automatically. The
interface also features text search using CQL.
105P. Gantar et al.: Structural and Semantic Classification of Verbal Multi-Word Expressions in Slovene
Figure 2: Annotations in FLAT
Quantitative Analysis
The annotated VMWEs were imported into the ssj500k 2.1 training corpus (Krek
et al. 2017). Among the 13,511 sentences annotated in the first two annotation phases,
2,290 of them (approximately 22%) contain at least one VMWE. On average, each
of these sentences features 1.15 VMWEs. Taking into account all the annotated sen-
tences, each sentence contains approximately 0.25 VMWEs; in other words, on aver-
age, one VMWE is present in every fourth sentence.
Table 1 shows the distribution of the annotated VMWEs by category. The final
number of VMWEs in the training corpus is 3,364. The number of different VMWEs
(i.e. without any repetitions of the same unit) was just under 1,100. When looking
at absolute frequencies, the most frequent category is IRV (48%) and the least fre-
quent category is LVC.cause (2%). The categories with the highest number of dif-
ferent VMWEs are VID and IAV, while LVC.full and LVC.cause are the least diverse
categories. We describe each category in more detail in section 5.
106 Prispevki za novejšo zgodovino LIX - 1/2019
Table 1: Distribution of annotated VMWEs by category
Category Example Translation All
VMWEs
Percent Different
VMWEs
Inherently Reflexive
Verbs (IRV)
bati se to be afraid 1,627 48% 345
Inherently
Adpositional Verbs
(IAV)
priti do to come about 710 21% 154
Verbal Idioms (VID) spati kot ubit (lit.) to sleep
like a dead
person
724 22% 457
Light Verb
Constructions (LVC):
LVC.cause
spraviti koga v
smeh
to make
someone
laugh
64 2% 27
Light Verb
Constructions (LVC):
LVC.full
biti v pomoč to be of help 239 7% 103
Total - - 3,364 100% 1,086
Table 2 shows the most common VMWE structures by parts of speech (V – verb,
N – noun, A – adjective, R – adverb, Pre – preposition, Pro – pronoun). The structures
occurring in the corpus with a frequency below 10 have been categorized as Other. The
most frequent structures are V + Pro, V + Pre, V + N and V + Pre + N. Collectively,
they account for approximately 85% of all annotated VMWEs.
Table 2: Distribution of annotated VMWEs by part-of-speech structure
Structure Example Translation Frequency Percent
V + Pro bati se to be afraid 1,663 49%
V + Pre priti do to come about 535 16%
V + N imeti odnos to have a relationship 372 11%
V + Pre + N biti pod vtisom to be under the impression 303 9%
V + Pro + A biti si edini to be unanimous 146 4%
V + R biti res to be true 136 4%
V + Pro + Pre + N ujeti se v past to get caught in a trap 24 1%
V + A biti jasno to be clear 20 1%
V + A + N imeti glavno besedo to have the last word 19 1%
N + V + Pre + N biti na robu propada to be on the verge of collapse 12 <1%
V + Pro + N vzeti si čas to take one's time 11 <1%
Other - - 123 4%
Total - - 3,364 100%
107P. Gantar et al.: Structural and Semantic Classification of Verbal Multi-Word Expressions in Slovene
Qualitative Analysis
The qualitative analysis deals with the semantic and structural features of VMWEs.
Based on the PARSEME Guidelines, several characteristic features of Slovene were
identified on the level of structural and semantic tests used to determine the category
of VMWEs. In the analysis, we focused on patterns within structures for each sub-
category, the syntactic environment of the expression as a unit, and the lexical units
filling the corresponding participant slots. Based on corpus examples, we also tried to
identify the indicators of semantic integrality that could be useful when automatically
identifying VMWEs in text.
Inherently Reflexive Verbs (IRV)
The PARSEME Shared Task 1.1 guidelines treat verbs occurring with the inde-
pendent morpheme se/si as a separate category of VMWEs called inherently reflexive
verbs. It is a language-specific category that includes phrases in which the verb without
the morpheme se/si does not exist (zdeti se ‘to seem’, *zdeti) or in which the presence
of se/si changes the meaning of the verb (pobrati se ‘to recover’ vs. pobrati ‘to pick up’).
Inherently reflexive verbs cover the largest percentage of VMWEs in the training
corpus (see Table 1). Among the correctly categorized examples (1,621 in total)4 we
identified 339 different IRVs, with the following most frequently occurring verbs: zdeti
se ‘to seem’, odločiti se ‘to decide’, zgoditi se ‘to come to pass’ and pojaviti se ‘to appear’.
To test whether the expression is semantically integral and to differentiate it from
other types of verb phrases with se/si that are not defined as VMWEs, we examined
the behaviour of the verb in terms of its opening up syntactic positions as a phrase.
Inherently reflexive verbs keep se/si as an obligatory verb morpheme in all forms of
their inflectional paradigm and can be transitive (bati se koga/česa ‘to be afraid of smn/
sth’) or intransitive (znajti se ‘to find oneself somewhere’, zvečeriti se ‘to fall [evening]’).
Inherently reflexive verbs as VMWEs must be differentiated from verbs where the
reflexive pronoun se/si is not an obligatory morpheme but serves another function,
more specifically: (a) it denotes mutualness (poljubljati se ‘to kiss [each other]’, srečati
se ‘to encounter [each other]’), (b) it denotes that the target of the action is the subject
(umivati se ‘to wash [oneself]’, praskati se ‘to scratch [oneself]’), or that the action is to
the benefit of the subject (kuhati si ‘to cook [oneself sth]’, (c) it is used for passivizing
the sentence by removing the agent (kdo ponavlja kaj ‘someone repeats something’ –
kaj se ponavlja ‘something is repeated’), and (d) it denotes a generic action (govori se
‘it is said’; se razume ‘it is understood’).
With verbs that can also occur without se/si, only the phrases where the mor-
pheme changes the verb’s meaning are categorized as IRVs. There are cases in which
4 Among the 1,627 annotated examples, four were miscategorized. In two examples, the elements of the expressions
were incorrectly annotated. These examples were excluded from further analysis.
108 Prispevki za novejšo zgodovino LIX - 1/2019
the presence (or absence) of se/si causes a semantic shift directly tied to a human
subject In these cases, the verb denotes a metaphorical meaning pobrati se ‘to recover’:
pobrati ‘to pick sth up’; gristi se ‘to worry’ : gristi ‘to bite’.
In Slovene linguistics, lexicalised phrases consisting of a verb and the se/si mor-
pheme have so far not been treated as fixed expressions. The main focus has been
recognition of the function of the morpheme or the reflexive pronoun in terms of
denoting different degrees of agentness or the subject’s (un)involvedness, as in the case
of the non-singular (zbrati se ‘to gather’) or generic agent (tiskati se ‘to be printed’)
(Žele 2012, 44; Toporišič 1982, 244; 2000, 503). The identification of IRVs in text
from the perspective of their semantic and syntactic is particularly important for the
automatic identification of MWEs. In future lexicons and dictionaries, IRVs should
thus be treated either as independent entries or as part of polysemy.
Light Verb Constructions (LVC)
Light verb constructions have been treated from different perspectives by different
authors (for an overview, see Soršak 2013). In most definitions, the verbs in LVCs are
categorized as something between full verbs and auxiliary verbs, while the expressions
that feature them are categorized as a phenomenon between fixed and free expressions.
Using existing typologies for Slovene (Toporišič 2000; Žele 1999), Soršak analyzes
Slovene LVCs based on the entries in the Dictionary of Standard Slovene (SSKJ). The
results highlight that the dictionary often mentions the semantically light use of a verb
in places where the use is stylistically marked, most frequently as expressive (Soršak
2013, 514; e.g. groza ga sprehaja, lit. ‘terror is walking him’). The results described in
this paper show the opposite – in the annotated corpus, LVCs are typical, stylistically
neutral, and frequently occurring.
As per the PARSEME Guidelines, a LVC must fulfil the following conditions:
it consists of a verb and a noun or a noun phrase that can also take the form of a
prepositional phrase (imeti mnenje ‘to have an opinion’, biti v dvomih ‘to be in doubt’),
and must open up its own valency slots (kdo ima predavanje za koga ‘someone holds
a lecture for someone’). Semantically, the expression must denote an action (imeti
predavanje ‘to hold a lecture’) or a state (biti v dvomih ‘to be in doubt’). According to
the verb, the category has two subtypes: (a) if the verb contributes to the meaning
on a predominantly categorical level, the expression is categorized as LVC.full (biti
v pomoč ‘to be of help’); (b) if the subject can be interpreted as the cause or source
of the denoted action, the expression is categorized as LVC.cause (spraviti v smeh ‘to
make smn laugh’). The LVC tests also take into account the abstractness of the noun
(imeti avto ‘to have a car’ is not a multiword expression, while idiomatic expressions
like imeti mačka ‘lit. to have a cat – to have a hangover’ are categorized as VIDs) and,
with LVC.full, the possibility of rephrasing by omitting the verb (Janez ima predavanje
‘Janez holds a lecture’ – Janezovo predavanje ‘Janez’s lecture’).
109P. Gantar et al.: Structural and Semantic Classification of Verbal Multi-Word Expressions in Slovene
Despite the somewhat elusive concept of LVCs, the annotation process has
confirmed that the PARSEME guidelines are specific enough to be successfully
applied to real text. Of the 303 examples annotated as LVCs (1 example was catego-
rized incorrectly), 78.8% were LVC.full and 21.2% LVC.cause. 87.1% of them were
combinations of a verb and a noun, while 12.9% were combinations of a verb and a
prepositional phrase. The annotated LVCs contained a total of 19 different verbs,5
predominantly the verb imeti ‘to have’ (65.6%), but also biti ‘to be’ (13.6%) and dati/
dajati ‘to give’ (a total of 9.6%).6 Other verbs (narediti ‘to do’, postaviti/postavljati ‘to
put’, ostati ‘to remain’, voditi ‘to lead’, namenjati ‘to pay [attention]’, delati ‘to do/make’,
storiti ‘to do’, vzbujati/zbujati ‘to incite’, dobiti ‘to get’, zastaviti ‘to pose’, spraviti ‘to
make’, doseči ‘to achieve’ and nositi ‘to wear’) occur less frequently, often in a single
expression (ostati v spominu ‘to remain in one’s memory’, namenjati pozornost ‘to pay
attention to sth’).
Combinations of a verb and a prepositional phrase are somewhat more typical of
the LVC.cause category. In the annotated data, LVC.cause occurs exclusively with the
prepositions v ‘in’ (33 instances) and na ‘on’ (6 instances). In the majority of cases, the
combination is biti v (biti v pomoč ‘to be of help’, biti v podporo ‘to provide support’, biti
v navadi ‘to be a habit’).
In the annotated expressions, a relatively limited set of nouns can be found: a total
of 97. The most frequent nouns are težava ‘problem’ (21) and pravica ‘right’ (20), fol-
lowed by možnost ‘possibility’, mnenje ‘opinion’, učinek ‘effect’, vloga ‘role’, vpliv ‘influ-
ence’, vtis ‘impression’, pomoč ‘help’, občutek ‘feeling’, prednost ‘advantage’, sreča ‘luck’,
korist ‘benefit’, vprašanje ‘question’, volja ‘will’, posledica ‘consequence’. As expected,
some of these nouns occur exclusively in LVC.full (pravica, možnost, mnenje, vloga),
while others occur in LVC.cause (učinek, vpliv, vtis, pomoč). In other cases, the category
depends on the meaning of the verb (dati prednost ‘to give an advantage’ ® LVC.cause
and imeti prednost ‘to have an advantage’ ® LVC.full.
In accordance with the conclusions made by Soršak (2013, 519), the results show
that the featured verbs can also be used with full meaning, while the semantic lightness
in LVCs is complemented by the nominal part (imeti ‘to have’ meaning ‘to possess’
compared to imeti posledice ‘to have consequences’ meaning ‘to cause/lead to conse-
quences’). Semantically, the noun groups occurring in LVC.cause describe the result
of an action, be it a type of result (učinek ‘effect’, vpliv ‘influence’, vtis ‘impression’), a
positive (korist ‘benefit’, užitek ‘pleasure’) or negative consequence (muka ‘torment’,
preglavica ‘trouble’). The semantically light verb binds the result to the subject (nekdo/
nekaj daje vtis ‘smn/sth makes an impression’, i.e. the agent is the cause of the action).
In certain cases, LVCs can be converted into semantically full verbs with a similar
morphological base (dosegati učinek ‘to achieve an effect’ – učinkovati ‘to affect’; imeti
5 This is the full set of the LVCs in the data, confirming that the set of verbs occurring in these expressions is lim-
ited. In the dictionary, Soršak (2013, 513) finds mentions of semantic lightness in 420 verb entries. However, as
mentioned, the labels often signify stylistically marked and atypical language use.
6 In Slovene lingustics, verb phrases with imeti ‘to have’ and biti ‘to be’ have been most frequently treated as the
equivalent of LVCs, but analyzed from different perspectives (see e.g. Vidovič Muha 1998).
110 Prispevki za novejšo zgodovino LIX - 1/2019
vpliv ‘to have an influence’ – vplivati ‘to influence’), but not always (imeti posledice ‘to
have consequences’ – /).
The nouns occurring in LVC.full are semantically more diverse. Dividing them
into semantic groups reveals that the common ground of these expressions can be
defined as planning or estimating success. Among the encountered LVCs are phrases
with nouns dealing with (a) communication (mnenje ‘opinion’, predlog ‘suggestion’,
vprašanje ‘question’); or describing (b) the potential for success (možnost ‘possibil-
ity’, prednost ‘advantage’, priložnost ‘opportunity’); (c) initial steps (obljuba ‘promise’,
napoved ‘prediction’, načrt ‘plan’); (d) potential reasons for failure (napaka ‘mistake’,
pomanjkljivost ‘disadvantage’). Other groups deal with (e) negative states (težava
‘problem’, strah ‘fear’, dvom ‘doubt’), (f) positive qualities (moč ‘power’, pogum ‘cour-
age’, potrpljenje ‘patience’), (g) achieved results (izobrazba ‘education’, status ‘status’,
posel ‘business’), and (h) attitude towards as of yet unrealized goals (želja ‘wish’, ambi-
cija ‘ambition’, vizija ‘vision’). Again, some examples can be converted into a semanti-
cally full verb (imeti mnenje ‘to have an opinion’ – meniti), while others cannot (imeti
ambicije ‘to have ambitions’ – /).
Inherently Adpositional Verbs (IAV)
Inherently adpositional verbs, also called verbs with a lexicalised prepositional
morpheme (Žele 2002), were included in the PARSEME Guidelines during the sec-
ond annotation phase as an optional test category.7 The guidelines define IAVs as verbs
that only occur with a prepositional morpheme (simpatizirati z ‘to sympathize with’)
or verbs that change meaning when occurring with a prepositional morpheme (biti
za ‘be for, to support’ vs. biti ‘to be’). The participants required by the verb phrase as
a whole are not a part of the VMWE, as opposed to e.g. stati na + trdnih tleh ‘to stand
on + solid ground’, which is categorized as a VID.
Prepositions have been treated as free verb morphemes as early as in Metelko’s
Grammar of Slovene (1825, 247–56) and were analyzed in further detail by Breznik
(1916, 250; 1934, 225). Verbs with a lexicalised prepositional morpheme were also
analyzed by Žele (2002) and Kržišnik (1994), the former from the perspective of the
degree of lexicality of the preposition and the latter from the perspective of phrase
fixedness as either a phraseological unit with structural fixedness (biti ob čem ‘to be
next to sth’ meaning ‘to be positioned next to sth’) or phrasemes with lexical fixedness
(biti ob kaj ‘to lose sth’).
7 Based on the feedback from the first annotation campaign and the issues discussed among the contributors, idi-
omatic combinations of verbs with prepositions or postpositions (IAVs) were separated from verb-particle con-
structions (VPCs) such as put off, to blow up, to do in, in which the particle completely changes the meaning or
adds a partly predictable but non-spatial meaning to the verb. Unlike VPCs, which are characteristic of Germanic
languages and Hungarian, less so of Romance languages, and absent in Slavic languages, IAVs can exclusively be
found in the Balto-Slavic language group.
111P. Gantar et al.: Structural and Semantic Classification of Verbal Multi-Word Expressions in Slovene
In the training corpus, IAVs account for approximately 20% of all annotated
VMWEs (see Table 1). Among the 710 examples, 154 diverse IAVs were identified.
The following examples appear with a frequency of at least 20: iti za ‘to be about’
(always in the third person singular – gre za), priti do ‘to occur’, ukvarjati se z ‘to work
on sth’, vplivati na ‘to influence’, skrbeti za ‘to take care of ’, temeljiti na ‘to be based on’,
naleteti na ‘to encounter’, veljati za ‘to be considered’ and biti proti ‘to be against’. As
per the guidelines, the IAV category also includes verb phrases that consist of an inher-
ently reflexive verb (see 5.1) and a lexicalised prepositional morpheme (nanašati se
na ‘to refer to sth’).
The most frequent lexicalised prepositional morpheme is za ‘for’, occurring with
34 different verbs (e.g. gre za ‘to be about’), followed by na ‘on’, occurring with 33
different verbs (e.g. vplivati na ‘to influence’). Frequent prepositional morphemes are
also z/s ‘with’, do ‘to’ and v ‘in’.
The lexicalised prepositional morpheme is usually positioned after the verb, which
is true in 86% of the annotated examples. In the vast majority of cases, the morpheme
is positioned directly after the verb or in a narrow window (+3 words). An exception is
gre za, where an intervening element serves to reference preceding information (gre [v
tem primeru] za ‘it [in this case] is about’). In less frequent examples where the prepo-
sitional morpheme is positioned before the verb, the distance between the verb and
the morpheme is significantly larger (in 20% of the cases, the distance is 3+ words).
Verbs with a lexicalised prepositional morpheme can also be identified based on
common semantic features, e.g. the expression of (a) function or quality: veljati za
[favorita] ‘to be considered [a favorite]’,8 imenovati [direktorja] ‘to name [smn a direc-
tor]’, označiti za [laž] ‘to call [sth] out as [a lie]’; (b) (dis)agreement: biti za/proti
[globalizacijo] ‘to be for/against [globalization]’; (c) basis: temeljiti na (dejstvu) ‘to be
based on [fact]’, graditi na (zaupanju) ‘to build on [trust]’; (d) beginning or change of
action/state: pasti v [komo] ‘to fall in [a coma]’, prerasti v (ljubezen) ‘to blossom into
[love]’; (e) change of quality or form: pretvoriti v (energijo) ‘to convert into [energy]’;
(f) survival: iti skozi (proces) ‘to go through [a process]’; (g) active participation:
ukvarjati se z ‘to work on sth’, skrbeti za ‘to take care of sth’.
IAVs are characterized by the fact that the presence of the prepositional morpheme
often changes the valency qualities of the verb, e.g. (a) when the original intransitive
verb becomes transitive, as in the example živeti ‘to live’ : živeti od koga/česa ‘to live off
of sth’; (b) when there is a change in the case of the prepositional complement, e.g.
obrniti se na koga ‘to turn to someone (fig.) : obrniti se h komu ‘to turn to someone (lit.)’.
There are also many examples of movement verbs that as IAVs change meaning to a
non-spatial judgment of state (priti skozi ‘to go through’ in the sense of ‘to survive’).
With verbs featuring a wide semantic range, the prepositional morpheme typically
narrows down the meaning (biti ‘to be’ : biti za ‘to be for, to support sth’). Some verbs
within IAVs require an abstract object, e.g. pasti v [depresijo, vrtinec nizkotnosti] ‘to fall
8 With IAVs, we also list typical collocates from the Gigafida Corpus of Written Slovene to ease semantic disambigua-
tion.
112 Prispevki za novejšo zgodovino LIX - 1/2019
into [depression, a whirlpool of insidiousness]’, dišati po [prevari] ‘to smell of [deceit]’,
pokati od [veselja] ‘to be bursting of [joy]’.
Identifying inherently adpositional verbs poses a challenge both for human anno-
tators and language technology tools as additional elements can intervene between the
lexicalised morpheme and the verb. In addition, numerous verb-preposition combina-
tions can denote a literal meaning while not exhibiting any change in the case of the
object complement (stati za [vrati] ‘to stand behind the door’ : stati za [dejanji] ‘to
stand by one’s actions’). They can also be polysemous (priti do [spremembe] ‘to occur
[change]’ : priti do [denarja] ‘to get [money]’). The analysis offers a starting point
for the automatic identification of IAVs and provides possibilities for more detailed
research, especially in terms of valency, sentence patterns and the semantic features
of participants.
Verbal Idioms (VID)
The PARSEME Guidelines define verbal idioms (VID) as the combination of two
lexicalised elements in which the verb is the syntactic head that requires at least one
participant in the sentence pattern. The participants can take different syntactic roles,
e.g. a direct or prepositional object complement (plačati ceno ‘to pay a price’, zravnati
z zemljo ‘to level with the earth’), a subject (zgodba se ponavlja ‘lit. the story repeats
itself ’), an adverbial (spati kot ubit ‘lit. to sleep like a dead person’) or a subordinate
clause (vedeti, koliko je ura ‘lit. to know what time it is’ in the sense ‘to know what is
going on’). VIDs must also keep a meaning that is independent of the meanings of
their elements even with certain syntactic conversions. The Guidelines mention that
the elements can appear in expected paradigms (declensions), in different tenses, in
active or passive voices, with lexical variance, etc.
The definition provided by the PARSEME Guidelines differs from the one found
in Slovene linguistics in that it focuses on the verb as the head and the lexicalised
elements within the verb’s sentence pattern. On the other hand, Slovene linguistics
focuses primarily on the ability of the verb phrase as a whole to take the role of the
predicate (Toporišič 1973/74; Kržišnik 1994). From this point of view, it is prob-
lematic to treat phrases that feature a verb as the fixed part, but as a whole do not
always take the role of the predicate. In some cases, they can take the role of an object
complement ([ne spodobi se] voditi za nos ‘lit. [it is not proper] to lead someone by the
nose’ in the sense ‘fooling someone is frowned upon’), a sentence (srce se trga [komu]
‘[someone’s] heart is breaking’), or an adverbial (hočeš nočeš ‘like it or not’).
In the training corpus, 724 units were categorized as VIDs, which represents 22%
of all VMWEs (see Table 1). As can be expected, VIDs occurring more than 10 times
feature the verbs biti ‘to be’ and imeti ‘to have’. Several other VIDs occur more than 5
times (biti kos ‘to be sth’s match’, priti prav ‘to come in handy’, igrati vlogo ‘to play a role’,
pustiti pri miru ‘leave sth be’, priskočiti na pomoč ‘to rush to smn’s aid’, and imeti opravka
113P. Gantar et al.: Structural and Semantic Classification of Verbal Multi-Word Expressions in Slovene
s/z ‘to busy oneself with’), along with fixed discourse markers (cf. Dobrovoljc 2017):
se pravi ‘which is to say’, kdo ve ‘who knows’.
As mentioned above, the most frequent structures are combinations of the verb
biti ‘to be’ and an adverb/adjective/noun. Taking into account their structural fixed-
ness and semantic vagueness of the verb, they should be treated as separate lexicon
entries: biti všeč/res/mar/prida/prav/kos ‘to be likeable/true/to care/to be of benefit/
to be right/to be smn’s match’. This group includes phrases with a semantically wide
verb imeti ‘to have’: imeti prav/rad ‘to be right/to love’, ne imeti pojma/smisla ‘to have
no clue/meaning’.
Another frequent structure in the training corpus is the combination of a verb
and a noun or noun phrase. Among the verbs, the most frequent are delati ‘to make’
(delati družbo/gužvo/izjeme/preglavice/razlike/sceno/škodo ‘to do/make company/a
crowd/an expection/trouble/a difference/a scene/damage’) and dati ‘to give’ (dati
polet/pečat ‘to give momentum/to leave a mark’). The latter structurally coincide with
LVCs, but cannot be converted in the same way as LVCs to express possession (Miha
ima predavanje ‘Miha holds a lecture’ ® Mihovo predavanje ‘Miha’s lecture’, but not Miha
dela gužvo ‘Miha is crowding the place’ ® *Mihova gužva ‘Miha’s crowd’). The largest
percentage in the training corpus is covered by VIDs consisting of a verb and a prepo-
sitional phrase. Again, the most frequent verb is biti ‘to be’ (biti na dosegu roke ‘to be in
reach’, biti na razpolago/voljo ‘to be at one’s disposal’), followed by e.g. priti ‘to come’
(priti na dan ‘to come to light’, priti na misel ‘to come to mind’) and dati ‘to give’ (dati
na izbiro ‘to give a choice’). In terms of fixedness, some combinations of a verb and
a nominal/prepositional phrase require an obligatory negation (ne moči si kaj ‘can’t
help but’, ni ne duha ne sluha o (kom/čem) ‘no trace of sth’, ni para (komu) ‘someone
has no equal’).
The training corpus also features other structures, but with lower frequencies
(solze stopijo v oči (komu) ‘someone’s eyes are watering’, časi se spreminjajo ‘times are
changing’). These also include idioms (bolje preprečiti kot zdraviti ‘lit. better to prevent
than to cure’) and comparisons (igrati se [s kom/čim] kot mačka z mišjo ‘lit. to play
[with smn/sth] like a cat plays with a mouse’), as well as verb-adverb combinations
(priti skupaj ‘to come together’, daleč priti ‘to come far’) and combinations of a verb
and a pronominal morpheme (zagosti jo (komu) ‘to create mischief for someone’).
Within their sentence patterns, VIDs open up predictable syntactic slots filled by
participants with typical semantic roles. A quick overview of the annotated examples
shows that certain verb forms are fixed or more frequent (e.g. third person or negated
forms) and that lexical elements in a certain slot are to some extent predictable: (svet,
življenje, vse) postaviti na glavo ‘to turn [the world/life/everything] upside down’).
114 Prispevki za novejšo zgodovino LIX - 1/2019
Discussion and Conclusion
The conducted annotation task has shown that the annotation set-up (including
the tool and the annotation scheme) is suitable. However, content-wise, the task is
relatively complex and requires a more advanced linguistic background. The categories
provided in the available guidelines are attributable and formalistically distinguishable
from each other; categorization problems occur mostly when distinguishing colloca-
tions from VMWEs. The quantitative analysis shows that all categories are robust and
present in authentic texts.
Based on the annotated VMWEs, we were able to identify certain pattern features
on the syntactic and semantic levels. These patterns represent a good starting point for
a set of rules for the automatic extraction of VMWEs and further language description.
Methodologically, we made a shift in focus from a functional-syntactic perspective to
the description of interconnected features on the morphosyntactic, syntactic, seman-
tic, and lexical levels.
As expected, VMWEs are typically formed by verbs with a wide semantic range,
e.g. biti ‘to be’, dati ‘to give’, imeti ‘to have’, which makes them lose their lexical qualities,
but keep their morphological features, syntactic function, and position in the sentence
pattern. The degree to which the meaning of the verb as an element of the MWE
contributes to the meaning of the whole is often difficult to determine, one of the rea-
sons being that numerous verb phrases structurally coincide with several categories,
but denote no idiomatic meaning. In the text, they are difficult to distinguish from
free phrases or collocations (frequent, semantically sensible and structurally adequate
word co-occurrences).
On the other hand, the initial structural and semantic analysis has shown that (a)
individual types of VMWEs form recognizable structural patterns, e.g. verb + nominal/
prepositional phrase; (b) the lexicalization of elements influences the change in the par-
ticipants’ position and their semantic roles (vreči se po kom ‘to take after smn’ – vreči se v
kaj ‘to begin working enthusiastically’ – vreči koga ven ‘to throw smn out’); (c) that the
sequence of verb elements in a VMWE is usually not fixed, but (e) there are certain ten-
dencies in word order and (d) the number and representation of intervening elements.
Furthermore, (e) certain lexical elements can be predicted based on the frequency and
the elements of the co-text; (f) for better automatic identification of VMWEs, their
formalized description should include information on all levels of language description.
The list of VMWEs obtained from the annotated corpus represents a set of lexicon
units that can be used in machine learning for the automatic identification of VMWEs
in text.
While our research did not include a systematic analysis of the sentence patterns,
it should be mentioned that the training corpus includes the syntactic (formalized
syntactic dependencies) and semantic (semantic role labeling) data that can be used
to analyze them. This would allow us to identify more general sentence patterns for a
certain VMWE type and use them in automatic extraction.
115P. Gantar et al.: Structural and Semantic Classification of Verbal Multi-Word Expressions in Slovene
To correctly identify different MWEs, we will also create a typology of non-verbal
MWEs, e.g. nominal (žlahtna kapljica ‘fine wine’), adjectival (vreden greha ‘worthy of
sin’), or adverbial phrases (zdaj ali nikoli ‘now or never’), as well as phrases contain-
ing particles, conjunctions and pronouns (ja pa ja ‘as if ’, s tem da ‘taking into account
that’) which were identified as frequent n-grams (Dobrovoljc 2017). Another chal-
lenge to tackle is the relation between the canonical and converted forms of MWEs,
e.g. začarani krog ‘vicious circle’ – biti ujet v začarani krog/v začaranem krogu ‘to be
caught in a vicious circle’ – izviti se/rešiti se iz začaranega kroga ‘to escape from a
vicious circle’ – vrteti se/znajti se v začaranem krogu ‘to spin/end up in a vicious circle’
– izstopiti iz začaranega kroga ‘to step out of a vicious circle’, etc. Furthermore, it is
difficult to identify MWEs with an independent, but non-metaphorical meaning, e.g.
fixed expressions of the type tehnološki park ‘technological park’ and ustavno sodišče
‘supreme court’, which are closer to terminology and named entities.
Acknowledgments
The authors acknowledge the financial support from the Slovenian Research
Agency: (a) research core funding No. P6-0215, Slovene Language – Basic, Contrastive,
and Applied Studies; (b) research core funding No. P6-0411, Language Resources and
Technologies for Slovene; and (c) project funding No. J6-8256, New grammar of modern
standard Slovene: resources and methods. The research was conducted within the frame-
work of the IC1207 PARSEME COST Action9 and the IS1305 ENeL COST Action.10
Sources and Literature
• Arhar Holdt, Špela, and Vojko Gorjanc. 2007. “Korpus FidaPLUS: nova generacija slovenskega
referenčnega korpusa.” Jezik in slovstvo 52, No. 2 ( January): 95–110.
• Atkins, Sue B. T., and Michael Rundell. 2008. The Oxford Guide to Practical Lexicography. New
York: Oxford University Press.
• Baldwin, Timothy, and Su Nam Kim. 2010. “Multiword Expressions” In Handbook of Natural
Language Processing, edited by Nitin Indurkhya and Fred J. Damerau, Second Edition, 267–92.
Boca Raton: CRC Press.
• Breznik, Anton. 1916. Slovenska slovnica za srednje šole. Celovec: Družba sv. Mohorja.
• Breznik, Anton. 1934. Slovenska slovnica za srednje šole. 4th, enlarged edition. Celje: Družba sv.
Mohorja.
• Candito, Marie, Fabienne Cap, Silvio Cordeiro, Vassiliki Foufi, Polona Gantar, Voula Giouli, Carlos
Herrero, Mihaela Ionescu, Verginica Mititelu, Johanna Monti, Joakim Nivre, Mihaela Onofrei,
Carla Parra Escartín, Manfred Sailer, Carlos Ramisch, Monica-Mihaela Rizea, Agata Savary, Ivelina
Stonayova, Sara Stymne, Veronika Vincze. 2016. PARSEME Shared Task 1.0 Annotation Guidelines
– version 1.6b – last updated on November 26, 2016. http://parsemefr.lif.uiv-mrs.fr/parseme-st-
guidelines/1.0/.
9 Home – PARSEME, http://www.parseme.eu.
10 Action IS1305 – COST, www.elexicography.eu.
116 Prispevki za novejšo zgodovino LIX - 1/2019
• Dobrovoljc, Kaja. 2017. “Multi-word Discourse Markers and Their Corpus-driven Identification:
the Case of MWDM Extraction from the Reference Corpus of Spoken Slovene.” International
Journal of Corpus Linguistics 22, No. 4 (December): 551–82.
• Dobrovoljc, Kaja, Simon Krek, and Jan Rupnik. 2012. “Skladenjski razčlenjevalnik za slovenščino.”
In Zbornik Osme konference Jezikovne tehnologije, edited by Tomaž Erjavec and Jerneja Žganec Gros,
42–47. Ljubljana: Jožef Stefan Institute.
• Gantar, Polona, Lut Colman, Carla Parra Escartín and Héctor Martínez Alonso. 2018. “Multiword
Expressions: Between Lexicography and NLP.” International Journal of Lexicography: 1–25.
• Gantar, Polona, Špela Arhar Holdt, Jaka Čibej, Taja Kuzman, and Teja Kavčič. 2018. “Glagolske
večbesedne enote v učnem korpusu ssj500k 2.1.” In Proceedings of the Conference on Language
Technologies & Digital Humanities, edited by Darja Fišer and Andrej Pančur, 85–92. Ljubljana:
Znanstvena založba Filozofske fakultete.
• Gantar, Polona, Simon Krek, and Taja Kuzman. 2017. “Verbal Multiword Expressions in Slovene.”
Europhras 2017, Computational and Corpus-Based Phraseology: Proceedings, edited by Ruslan
Mitkov, 247–59. Cham: Springer.
• Godec Soršak, Lara. 2013. “Glagoli z oslabljenim pomenom v Slovarju slovenskega knjižnega
jezika.” Slavistična revija 61, No. 3 (March): 507–22.
• Gorjanc, Vojko, Polona Gantar, Iztok Kosem, and Simon Krek, eds. 2017. Dictionary of Modern
Slovene: Problems and Solutions. Ljubljana: Ljubljana University Press, Faculty of Arts. https://e-
knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/book/15.
• Grčar, Miha, Simon Krek, and Kaja Dobrovoljc. 2012. “Obeliks: statistični oblikoskladenjski
označevalnik in lematizator za slovenski jezik.” In Zbornik Osme konference Jezikovne tehnologije,
edited by Tomaž Erjavec and Jerneja Žganec Gros. Ljubljana: Jožef Stefan Institute.
• Krek, Simon, Kaja Dobrovoljc, Tomaž Erjavec, Sara Može, Nina Ledinek, Nanika Holz, Katja
Zupan, Polona Gantar, and Taja Kuzman. 2017. “Training Corpus Ssj500k 2.0.” Slovenian Language
Resource Repository CLARIN.SI. http://hdl.handle.net/11356/1165.
• Kržišnik, Erika. 1994. “Slovenski glagolski frazemi (ob primeru glagolov govorjenja).” PhD diss.,
Faculty of Arts, University of Ljubljana.
• Metelko, Franc Serafin. 1825. Lehrgebäude der slowenischen Sprache im Königreiche Illyrien und in
den benachbarten Provinzen. Laibach: Leopold Eger.
• Ramisch, Carlos, Silvio Ricardo Cordeiro, Agata Savary, Veronika Vincze, Verginica Barbu Mititelu,
Archna Bhatia, Maja Buljan, Marie Candito, Polona Gantar et al. 2018. “Edition 1.1 of the PARSEME
Shared Task on Automatic Identification of Verbal Multiword Expressions.” In Proceedings: LAW-
MWE-CxG 2018, The 12th Linguistic Annotation Workshop (LAW XII) and the 14th Workshop on
Multiword Expressions (MWE 2018), edited by Agata Savary, Carlos Ramisch, Jena D. Hwang,
Nathan Schneider, Melanie Andresen, Sameer Pradhan, and Miriam R. L. Petruck, 222–40. Santa
Fe: Association for Computational Linguistics. http://aclweb.org/anthology/W18-49.
• Sag, Ivan, Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. 2002. “Multiword
Expressions: a Pain in the Neck for NLP.” In Proceedings of the 3rd International Conference on
Intelligent Text Processing and Computational Linguistics (CICLing 2002), edited by Alexander
Gelbukh, 1–15. Berlin, Heidelberg, New York: Springer.
• Schneider, Nathan, Spencer Onuffer, Nora Kazour, Emily Danchik, Michael T. Mordowanec,
Henrietta Conrad, and Noah A. Smith. 2014. “Comprehensive Annotation of Multiword
Expressions in a Social Web Corpus.” Proceedings of the Ninth International Conference on Language
Resources and Evaluation (LREC-2014), edited by Nicoletta Calzolari, Khalid Choukri, Thierry
Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk and
Stelios Piperidis, 455–61. European Languages Resources Association (ELRA).
• Slovar slovenskega knjižnega jezika. 2nd edition. Ljubljana: SAZU and Fran Ramovš Institute of the
Slovenian Language ZRC SAZU. www.fran.si.
• Toporišič, Jože. 1973/74. “K izrazju in tipologiji slovenske frazeologije.” Jezik in slovstvo 19, No. 8
(Spring): 273–79.
117P. Gantar et al.: Structural and Semantic Classification of Verbal Multi-Word Expressions in Slovene
• Toporišič, Jože. 1982. Nova slovenska skladnja. Ljubljana: Državna Založba Slovenije.
• Toporišič, Jože. 2000. Slovenska slovnica. Maribor: Založba Obzorja.
• Vidovič-Muha, Ada. 1998. “Pomenski preplet glagolov imeti in biti – njuna jezikovnosistemska
stilistika.” Slavistična revija 46, No. 4: 293–323.
• Žele, Andreja. 1999. “Vezljivost v slovenskem knjižnem jeziku (s poudarkom na glagolu).” PhD
diss., Faculty of Arts, University of Ljubljana.
• Žele, Andreja. 2002. “Prostomorfemski glagoli kot slovarska gesla.” Jezikoslovni zapiski 8, No. 1:
95–108.
• Žele, Andreja. 2012. Pomensko-skladenjske lastnosti slovenskega glagola. Linguistica et philologica
27. Ljubljana: Založba ZRC, ZRC SAZU.
Polona Gantar, Špela Arhar Holdt, Jaka Čibej, Taja Kuzman
Structural and Semantic Classification of Verbal
Multi-Word Expressions in Slovene
SUMMARY
In the paper, we present an analysis of Slovene verbal multi-word expressions
(VMWEs) based on the categorization made within PARSEME COST Action Shared
Task 1.1 for 20 different languages. The purpose of the task was to identify VMWEs in
running text based on syntactic and semantic guidelines, as well as to compile a manu-
ally annotated multi-language corpus to be made available under a Creative Commons
licence. The results of the analysis will be useful in the compilation of a digital lexicon
of Slovene multi-word units and will help establish a theoretical framework that takes
into account the specific characteristics of Slovene while still fulfilling international
criteria.
Unlike the functional-syntactic criteria advocated thus far in Slovene stud-
ies (Toporišič 1973/74; Kržišnik 1994), the classification of VMWEs within the
PARSEME Shared Task 1.1 focuses on the identification of the syntactic head of the
MWE. This allows MWEs to be divided into e.g. verbal, adjectival, and nominal MWEs
regardless of the function they have in the sentence as a semantic and syntactic whole.
The PARSEME classification consists of both universal and language-specific catego-
ries. Universal categories include verbal idioms (VID; plačati ceno ‘to pay the price’)
and light verb constructions, which are further divided into full (LVC.full; imeti mne-
nje ‘to have an opinion’) and causal (LVC.cause; spraviti v smeh ‘to make smn laugh’).
Language-specific categories encompass inherently reflexive verbs (IRV; zdeti se ‘to
seem’), which are typical of most Slavic languages; phrasal verbs (VPC), typical of
Germanic languages; and inherently adpositional verbs (IAV), also typical of most
Slavic languages, including Slovene. A total of 13,511 sentences in the Slovene training
corpus ssj500k 2.0 (Krek et al. 2017) were annotated with 3,364 VMWEs: 1,627 IRV
118 Prispevki za novejšo zgodovino LIX - 1/2019
(48%), 724 VID (22%), 710 IAV (21%), 239 LVC.full (7%), and 64 LVC.cause (2%).
A linguistic analysis of the individual categories highlights numerous semantic and
syntactic characteristics of the identified VMWEs that can be taken into account in
the compilation of a MWE lexicon and the automatic identification of MWEs in text.
Among other things, the results show the importance of the criteria used to distinguish
between different types of reflexive verbs based on the role of the reflexive pronoun;
they can be viewed either as independent lexical units with their own meaning (e.g.
delati se ‘to pretend’) or as verbal phrases denoting e.g. mutual (poljubljati se ‘to kiss
each other’), reflexive (umivati se ‘to wash oneself ’), or passive actions (ponavljati se ‘to
be repeated’). The analysis has also shown that although the order of the components
of a VMWE is usually not fixed, certain tendencies exist in terms of word order and the
number of intervening elements. A semantic analysis of VMWEs has also revealed the
presence of semantic groups formed by VMWEs within an individual category, as well
as the properties of light verbs and verbs that typically form idiomatic units.
The study provides a good basis for further analyses of Slovene MWEs. In the
training corpus, VMWE annotations can be analyzed in terms of their formalized
syntactic dependency trees or the semantic roles played by the participants in the
sentence.
Polona Gantar, Špela Arhar Holdt, Jaka Čibej, Taja Kuzman
STRUKTURNA IN POMENSKA KLASIFIKACIJA
GLAGOLSKIH VEČBESEDNIH ENOT V SLOVENŠČINI
POVZETEK
V prispevku predstavljamo analizo glagolskih večbesednih enot (GVBE) v sloven-
ščini na podlagi kategorizacije, kot je bila izdelana v okviru PARSEME COST Action
Shared Task 1.1 za 20 različnih jezikov. Namen naloge je bil identificirati GVBE v
tekočem besedilu na podlagi skladenjskih in pomenskih smernic ter izdelava ročno
označenega večjezičnega korpusa, ki bo na voljo pod licenco Creative Commons.
Rezultati analize bodo uporabljeni pri izdelavi digitalnega leksikona večbesednih enot
za slovenščino kot tudi za utemeljitev teoretičnih izhodišč, ki upoštevajo specifike slo-
venščine in so hkrati usklajena z mednarodnimi merili.
Klasifikacija VMWE znotraj Parseme Shared task 1.1 za razliko od funkcijsko-
skladenjskih meril, ki jih predvideva slovenistično jezikoslovje (Toporišič 1973/74;
Kržišnik 1994), postavlja v izhodišče prepoznavanje skladenjskega jedra MWE, kar
omogoča njihovo delitev na glagolske, pridevniške, samostalniške ipd. GVBE, neod-
visno od funkcije, ki jo v stavku opravljajo kot pomenska in skladenjska celota. V
izhodišču predvideva Parsemovska klasifikacija univerzalne in jezikovnospecifične
119P. Gantar et al.: Structural and Semantic Classification of Verbal Multi-Word Expressions in Slovene
kategorije. Znotraj prvih loči glagolske idiome (VID; plačati ceno) in zveze z glagoli v
pomensko oslabljeni rabi, ki so členjeni na prave (LVC.full; imeti mnenje) in vzročne
(LVC.cause; spraviti v smeh). Znotraj druge skupine pa inherentno povratne glagole
(IRV; zdeti se), ki so tipični za večino slovanskih jezikov, frazne glagole (VPC), zna-
čilne za germanske jezike, in glagole z leksikaliziranim predložnim morfemom (IAV),
ki so tipični za slovenščino in večino slovanskih jezikov. V učnem korpusu ssj500k
2.0 (Krek et al. 2017) smo označili 13,511 stavkov, v katerih smo identificirali skupno
3,364 VMWE v naslednjih deležih: 1,627 IRV (48 %), 724 VID (22 %), 710 IAV
(21 %), 239 LVC.full (7 %) in 64 LVC.cause (2 %).
Jezikoslovna analiza posameznih kategorij je pokazala številne semantične in
skladenjske značilnosti identificiranih GVBE, ki jih bo mogoče upoštevati pri izde-
lavi leksikona VBE ter pri njihovi avtomatski identifikaciji v besedilu. Med drugim
je izpostavila merila za ločevanje različnih tipov povratnih glagolov na podlagi vloge
povratnega zaimka, kar omogoča njihovo obravnavanje bodisi kot samostojnih leksi-
kalnih enot z lastnim pomenom (npr. delati se) bodisi kot glagolskih zvez v različnih
upovedovalnih vlogah, kot so npr. vzajemnost (poljubljati se), povratnost (umivati se),
pasivizacija (ponavljati se) ipd. Analize so tudi pokazale, da zaporedje elementov v
GVBE navadno ni ustaljeno, obstajajo pa določene tendence glede besednega reda
in števila vrivajočih se elementov. Analiza GVBE s semantičnega vidika je pokazala
navzočnost določenih semantičnih skupin, ki jih tvorijo GVBE v posamezni kategoriji,
kot tudi lastnosti glagolov v pomensko oslabljeni rabi ter glagolov, ki tipično tvorijo
idiomatične enote.
Raziskava postavlja dobre osnove za nadaljnje analize VBE v slovenščini, zlasti
ob upoštevanju skladenjskih oznak v obliki formaliziranih skladenjskih drevesnic v
učnem korpusu, in semantičnih vlog, pripisanih udeležencem v stavčnem vzorcu.
120 Prispevki za novejšo zgodovino LIX - 1/2019
1.01 UDC: 004.934:821.163.41
Aniko Kovač,* Maja Marković**
A Mixed-principle Rule-based
Approach to the Automatic
Syllabification of Serbian
IZVLEČEK
MEŠANI PRISTOP K AVTOMATSKEMU ZLOGOVANJU V SRBŠČINI NA
PODLAGI NAČEL IN PRAVIL
V tem prispevku predstavljamo mešani pristop k avtomatskemu zlogovanju v srbščini
na podlagi načel in pravil, ki temelji na predpisnih pravilih tradicionalne slovnice v kombi-
naciji z načelom zaporedja glede na zvočnost (Sonority Sequencing Principle). Proučujemo
težave in omejitve obeh uveljavljenih pristopov, ki temeljita na zbirki pravil in zvočnosti;
vpeljujemo algoritem, ki uporablja oba načina za doseganje natančnejše členitve besed na
zloge, ki bi bila skladnejša z intuicijo rojenih govorcev; in predstavljamo statistične podatke,
povezane z razporeditvijo zlogov in njihovo strukturo v srbščini.
Ključne besede: zlog , pristop na podlagi pravil, zvočnost, računalniško jezikoslovje,
fonologija
ABSTRACT
In this paper we present a mixed-principle rule-based approach to the automatic sylla-
bification of Serbian, based on prescriptive rules from traditional grammar in combination
with the Sonority Sequencing Principle. We explore the problems and limitations of the
existing rule set and sonority-based approaches, introduce an algorithm that utilizes both
means in an attempt to produce a more accurate segmentation of words into syllables that is
* Department of Language Science and Technology, Saarland University Campus A2 2, 66123 Saarbrücken, Ger-
many, anikok@coli.uni-saarland.de
** Department of English Language and Literature, Faculty of Philosophy, University of Novi Sad, Dr Zorana
Đinđića 2, 21000 Novi Sad, Serbia, majamarkovic@ff.uns.ac.rs
121A. Kovač, M. Marković: A Mixed-principle Rule-based Approach to the Automatic Syllabification …
better aligned with the intuition of the native speakers, and present the statistical data related
to the distribution of syllables and their structure in Serbian.
Keywords: syllable, rule-based approach, sonority, computational linguistics, phonology
Introduction
Syllables have been considered — although not unequivocally (cf. Koehler 1966)
— to be one of the basic units in phonology constituting the minimal units of pro-
nunciation, and to play a role in prosody, phonotactics, and phonological processing
(Ladefoged and Johnson 2014). The role of the segmentation of words into syllables
and their distributional properties began to see an increase in importance in speech
technologies in the 1990s (Iacoponi and Savy 2011), most notably in the areas of
speech recognition (SR) and text-to-speech synthesis (TTS).
Syllable segmentation today plays a role in speech technologies on the segmental
level — conditioning the length of segmental units such as consonants and vowels —
as well as on the prosodic level — governing rhythmical alternations (Bigi and Petrone
2014). Syllable segmentation is also a key component in hyphenation (e.g. Kaplar et al.
2018), although it should be noted that, at least in Serbian, hyphenation is governed by
a partially diverging set of rules from those governing syllabification1. Syllable distri-
bution data is also of crucial importance for psycholinguistic experiments, as syllable
frequency has been shown to play a role in the processing of words (e.g. Barber et al.
2004; Cholin et al. 2006; Cholin and Levelt 2009). Developing an automatic system
of syllabification allows for the segmentation of large-scale language corpora needed
for the development of automatic systems or the extraction of relevant data related
to frequency syllable distributions, which would otherwise require a large number of
trained annotators and would be a resource and cost heavy undertaking.
The two generally distinguishable approaches to automatic syllabification are
rule-based versus data-driven approaches (Marchand et al. 2009). While data-driven
approaches have taken over many aspects of natural language processing, and there
are a number of data-driven models of syllable segmentation using artificial neural
networks (e.g. Daelemans and van den Bosch 1992; Hunt 1993; Stoianov et al. 1997;
Landsiedel et al. 2011), the unavailability of segmented data for Serbian makes rule-
based approaches the only viable option for automatic syllabification in Serbian.
To the best of our knowledge, there is a single publicly available attempt at devel-
oping a rule-based syllabifier for Serbian by Kaplar et al. (2018). In this paper we lay
out a number of problems and limitations with the ruleset used in their syllabification
system and why relying on the existing set of prescriptive rule descriptions from tra-
ditional grammar is insufficient to capture and describe a syllabification system that
1 For example, hyphenation rules ban the segmentation after a syllable consisting of a single vowel at word onset,
while this segmentation is allowed and expected according to the rules of syllabification.
122 Prispevki za novejšo zgodovino LIX - 1/2019
is aligned with the intuition of native speakers of Serbian. A relatable attempt at auto-
matic syllabification was developed by Meštrović et al. (2015) for Croatian, the key
difference between their work and ours being in the principle behind the syllabifica-
tion algorithm which in their case relied solely on the onset maximization principle —
limiting possible syllable onsets to valid onsets at the beginning of words. Taking into
account Morelli’s (1999) limitations on possible syllable onsets in Serbo-Croatian,
the onset maximization principle employed by Meštrović et al. could be considered
a comparatively liberal system. In order to attempt to constrain our syllabifer, we are
decided on a different approach that will not rely on onset maximization, but rather a
combination of a number of alternative principles.
In this paper we present a mixed-principle rule-based approach to the syllabifica-
tion of Serbian. Our starting set of rules is based on the Gramatika srpskoga jezika
by Stanojčić and Popović (2005), a prescriptive textbook for Serbian grammar that
presents a set of rule descriptions for the segmentation of words into syllables. In a pre-
vious version of our syllabification algorithm (Kovač and Marković 2018), we made
a number of changes to the rule descriptions of Stanojčić and Popović (2005) as the
formulation of some of the descriptions proved to be redundant, some were example-
based and not specific enough for a formal implementation, and we also expanded
them with three added modifications related to the treatment of nasals and the alveolar
sonorant /r/ based on Kašić (2014) and the treatment of alveolar sonorants /l/ and
/n/ based on Zec (2000). In this paper we extend our previous algorithm to include a
module for validating the structure of syllables in terms of their compliance with the
Sonority Sequencing Principle (SSP), thus further fine-tuning the accuracy of our seg-
mentation, and resolving a number of problems noted in our earlier implementation.
The goal of the paper is threefold: i) to improve our system for automatic rule-
based syllabification for Serbian based on the formalization of existing rule descrip-
tions by the addition of the sonority sequencing validation module, ii) to provide an
analysis of the outcomes of the automatic syllabification process in order to address
possible theoretical considerations and serve as a basis for the development of future
syllabifiers, and iii) to present statistical data related to the distribution of syllables and
their structure in Serbian.
Prescriptive Rule Descriptions
Our starting set of rules was based on the formalization of the rule descriptions
governing the segmentation of words into syllables from the Gramatika srpskoga jezika
by Stanojčić and Popović (2005). Being a prescriptive textbook on Serbian grammar
used at a high school level by all student profiles, we expected these rules to constitute
the common knowledge base shared by the majority of native speakers.
Regarding syllable boundaries, Stanojčić and Popović (2005, 37) establish the
following general rule (1).
123A. Kovač, M. Marković: A Mixed-principle Rule-based Approach to the Automatic Syllabification …
(1) In words made up of multiple phonemes, consonants, sonorants and vowels, the syllable
boundary comes after the vowel and before the consonant (e.g. či-ta-ti [to read]).
In addition to this general rule, they list the following rules — (2), (3), (4), (5)
and (6) — that further specify medial syllable boundaries depending on consonant
manner of articulation.
(2) Medially, in a consonant cluster which has an affricate or fricative sound in its initial
position, the syllable boundary will be before that consonant cluster (e.g. po-šta [post],
ma-čka [cat]).
(3) The syllable boundary will be before a consonant cluster if, in a consonant cluster found medi-
ally in a word, the second position in the cluster is occupied by one of the sonorants /v/, /j/,
/r/, /l/ or /ʎ/ preceded by any other consonant besides a sonorant (e.g. sve-tlost [light]).
(4) If a consonant cluster consists of two sonorants, the syllable boundary will be between
them so that one sonorant belongs to the preceding, and one sonorant belongs to the
following syllable (e.g. lom-ljen [broken]).
(5) If a consonant cluster consists of a plosive in its initial position and some other consonant
except the sonorants /j/, /v/, /l/, /ʎ/ and /r/, the syllable boundary will be between
the consonants (e.g. lep-tir [butterfly]).
(6) If in a cluster of two sonorants, the second position is occupied by the sonorant /j/ from je
corresponding to the ijekavica dialect to /e/ in the ekavica dialect, the syllable boundary
will be before that group (e.g. čo-vjek [man]).
Stanojčić and Popović (2005, 32) also introduce the rule descriptions (7) and (8)
to define when the sonorants /r/, /l/, and /n/ constitute syllable nuclei.
(7) The sonorant /r/ can be a syllable carrier in standard Serbian when:
a. it is found medially between two consonants (e.g. tr-ča-ti [to run]),
b. it is found initially before a consonant (e.g. r-va-ti se [to wrestle]),
c. it is found after a vowel in compounds (e.g. za-r-đa-ti [to rust]),
d. before /o/ that is realized as an /l/ in other members of the paradigm (e.g. o-tr-o
(m.) from o-tr-la (f.) [wiped]).
(8) The other two alveolar sonorants, /l/ and /n/ can be syllable carriers in dialectal toponyms
(e.g. Stlp, Vlča glava, Žlne) or foreign toponyms (e.g. Vltava, Plzen) but also in other per-
sonal names (e.g. English Idn or Arabic Ibn-Saud), and in the word bicikl [bicycle].
Revising the Existing Rule Set
The development of our syllabification algorithm has been an iterative process
testing the existing rule set and making changes as needed. While other authors (e.g.
Kaplar et al. 2018) used the rule descriptions of Stanojčić and Popović (2005) directly
124 Prispevki za novejšo zgodovino LIX - 1/2019
to implement a software architecture for syllabification in Serbian, we have found a
number of problems with this approach.
The definition of the rule description under (1) causes the initial member of a
consonant cluster in the rule descriptions under (2)–(6) to be understood as the first
consonant following a vowel. However, given that the sonorants /r/, /l/, and /n/ can
also constitute syllable nuclei in Serbian in certain positions, as presented under rule
descriptions (7) and (8), a more precise definition would be that the initial member
of a consonant cluster is the first consonant following an element that constitutes a
syllable nucleus. The general rule under (1) should be then revised as follows.
(1*) In words made up of multiple phonemes, consonants, sonorants and vowels, the syllable
boundary comes after the vowel or sonorants /r/, /l/, and /n/ in syllable bearing posi-
tions and before the consonant (e.g. či-ta-ti [to read], tr-ča-ti [to run]).
In addition to our expansion of the general rule presented under (1) to include
the syllable bearing sonorants, while formalizing the rule descriptions via finite-state
automata, rules (2) and (3) proved to be redundant as they produced identical out-
comes to the general rule under (1*). Because of this, these rules were disregarded in
our syllabification algorithm.
During our early testing of the verbatim implementation of the rule descriptions,
we also noticed that the existing rule descriptions treated a consonant cluster consist-
ing of a nasal in initial position followed by a consonant that is not one of the sonorants
/j/, /v/, /l/, /ʎ/, and /r/ as a part of the following syllable onset, producing outcomes
such as: gu-ngula [commotion], mo-mci [guys], ka-ncelarije [offices], su-nce [sun], etc.
Contrary to Stanojčić and Popović (2005), authors such as Kašić (2014) argue that
nasals should be treated analogously to plosives during syllabification because there is
a complete occlusion in the oral cavity during their production. If this principle were
to be employed, rule (5) should be revised as follows.
(5*) If a consonant cluster consists of a plosive or nasal in its initial position and some other
consonant except the sonorants /j/, /v/, /l/, /ʎ/, and /r/, the syllable boundary will
be between the consonants.
Following rule (5*), the examples above would then be segmented as: gun-gula
[commotion], mom-ci [guys], kan-celarije [offices], sun-ce [sun], etc. Even though in the
earlier implementation of our syllabifier (Kovač and Marković 2018) we did not want
to employ the Sonority Sequencing Principle (SSP), we opted for the treatment of
nasals by Kašić (2014) in our implementation, which respected the limitations put
forward by the Sonority Hierarchy, and was more in line with native speaker intuition.
125A. Kovač, M. Marković: A Mixed-principle Rule-based Approach to the Automatic Syllabification …
The Sonority Hierarchy
Sonority Theory accounts for the organization of segments into well-formed
sequences, both within the syllable and across syllabic boundaries. This organization
is driven by principles of sonority, a property that is used as the basis of ranking all
sounds along a scale, from less sonorous to more sonorous ones. Although there is
a general consensus that segments are ranked by their inherent sonority, the notion
of sonority itself is not unambiguously described in the phonetic and phonological
literature. Among the phonetic approaches, Ladefoged (1982) defines sonority as the
perceptual salience or loudness of a sound, and Bloch and Trager (1942; according
to Goldsmith 1995) define it as the amount of airflow in the resonance chamber. For
others, sonority is dependent on multiple phonetic parameters (Ohala and Kawasaki
1984; Ohala 1990; Butt 1992). In the phonological literature, sonority is generally
defined as a multi-valued feature (Foley 1972; Hankamer and Aissen 1974; Selkirk
1984), although there are also authors who argue that it is derivable from the more
basic binary features of phonological theory (Clements 1990). Other questions that
are often addressed are whether sonority scales are universal or language-specific,
allowing freedom to languages in assigning sonority values, and how fine-grained dis-
tinctions sonority scales should capture. For example, Clements’ universal sonority
scale includes only four major classes of consonants (Clements 1990), ranked from
least sonorous to most sonorous, as in (i):
(i) O < N < L < G
(O = obstruents, N = nasals, L = liquids, G = glides)
Selkirk (1984, 112) proposes a much more detailed scale, which divides all sounds
into 11 groups, assuming more subtle differences in sonority values. Selkirk also states
that the sonority indices may not be as important in themselves as the sonority rela-
tions that they express. Selkirk’s scale of sonority in consonants is given in (ii):
(ii) p, t, k < b, d, g < f, θ < v, z, ð < s < m, n < l < r
Sonority scales serve as the basis of constructing segment sequences within syl-
lables. The universal cross-linguistic generalization is that in the sequence of segments,
the one ranking highest on the sonority scale constitutes the peak of the syllable, i.e. it
is the syllabic nucleus. As for the other segments around the nucleus, they are organ-
ized so that the more sonorous ones are closer to the nucleus, and less sonorous ones
are more distant. This generalization is referred to as Sonority Sequencing Principle
(SSP). Thus a syllable with an ascending sonority slope in the onset and a descending
slope in the coda, such as, for example blunt, is a well-formed syllable, whereas *lbutn
is prohibited, due to the violation of the SSP. Adopting thee SSP often solves the prob-
lems of syllabic consonants, since they generally occur in environments where they
constitute a sonority peak, as in the Serbian word pr-vi.
126 Prispevki za novejšo zgodovino LIX - 1/2019
The Need for Sonority
Apart from the segmentation of nasals analogously to plosives following Kašić
(2014) that relied on principles of the SSP, in our initial attempt at the formalization
of the rule description under (8) of Stanojčić and Popović (2005) we had to rely on
sonority to define the criteria for when the alveolar sonorants /l/ and /n/ act as syl-
lable nuclei.
As Stanojčić and Popović gave no formal criteria defining the contexts of sylla-
ble bearing /l/ and /n/, our initial attempt to draw on generalizations based on their
examples for syllable carrying /l/ (Stlp, Vlča glava, Žlne, Vlava, Plzen) and /n/ (Idn,
Ibn-Saud). In analogy to the rules descriptions under (7a) and (7b) and our added
rule (7c*) defining the contexts in which the alveolar phoneme /r/ can act as a syl-
lable nucleus, we implemented rule (8*) to define the conditions under which the
phonemes /l/ and /n/ can act as syllable bearing nuclei.
(8*) The other two alveolar sonorants, /l/ and /n/, can be syllable carriers if they are found:
a) medially between two consonants,
b) initially before a consonant, or
c) finally after a consonant.
However, the formulation under (8*) allowed for outcomes such as: Be-rn, Ka-rl,
erla-jn, Kla-jn, kasa-rn-skim, Linko-ln, Va-jl-om, etc. in which the phonemes /l/ and
/n/ identified as syllable nuclei have a lower sonority level than the consonants in
their onset or coda. Because the phonemes /r/ and /j/ are more sonorous than the
phonemes /l/ and /n/, and the lateral phoneme /l/ is more sonorous than the nasal
phoneme /n/, native speakers do not perceive the elements of lower sonority as syl-
lable nuclei in these contexts. Zec (2000) states that alveolar sonorants can be syllable
bearing elements in Serbian only in contexts in which there is no segment of a higher
level of sonority in their immediate vicinity. Because of this, we needed to further
specify rule (8*) to take sonority constraints into consideration as follows.
(8**) The other two alveolar sonorants, /l/ and /n/, can be syllable carriers if they are
found:
a) medially between two consonants of lower sonority,
b) initially before a consonant of lower sonority, or
c) finally after a consonant of lower sonority.
It turns out that this principle can also account for the behavior of the syllable
bearing /r/ in Serbian. In fact, it does not only provide a general account for conso-
nantal syllabic nuclei in Serbian that subsumes the rules under (7) and (8**) it also
accounts for our extension of rule (7) that keeps the the consonant cluster /rje/ of
127A. Kovač, M. Marković: A Mixed-principle Rule-based Approach to the Automatic Syllabification …
the ijekavica dialect unsegmented in initial position2. Because the phoneme /j/ has a
higher level of sonority than /r/, the phoneme /r/ should not be treated as a syllable
nucleus initially in words such as rjeka [river].
In our previous implementation of the syllabifier (Kovač and Marković 2018),
we attempted to limit our reliance on the Sonority Sequencing Principle to the cases
above. However, during the evaluation of our algorithm, we encountered a number of
syllable structures that were unexpected due to their absence from the onset maximi-
zation approach to syllabification developed for Croatian by Meštrović et al. (2015).
Namely, we encountered the syllable structure CCCCCVC in mo-na-rhstvom [with
the monarchy], the structure CCCCCV in the words se-rbska [Serbian], ca-rstva [king-
doms], and sta-ra-te-ljstva [custody], and the structure CCCCVC in se-rbskom [Serbian],
de-jstvom [with effect], vo-đstvom [leadership], spo-rtskim [sport], and a-lpskog [alpine].
The way we attempted to remedy this issue was to limit the syllable onset length
three-syllable clusters, which is the maximum length of non-syllabic consonant clus-
ters word initially in Serbian (Kašić 2014). While this constraint, in combination
with rules (5) and (6), resolved the issues in the examples we encountered — with
this limitation, they are segmented as mo-narh-stvom [with the monarchy], serb-ska
[Serbian] (three-syllable onset limitation + rule (5)), car-stva [kingdoms], sta-ra-telj-
-stva [custody], serb-skom [Serbian], dej-stvom [with effect], vođ-stvom [leadership],
sport-skim [sport], alp-skog [alpine] — some medial clusters with a syllabic consonant
still remained a problem. For example, in the word najstrpljiviji [most patient], which
contains a syllabic /r/, the syllable boundary that would be placed between /na/ and
/jstr/ — na-jstr-pljiviji — which does not coincide with native speaker intuition. The
Sonority Sequencing Principle seems like a perfect solution for this cases, as it would
require the structure of a syllable to follow a sonority scale, with the syllable nucleus
being the most sonorous element, while sonority would gradually decrease towards
the periphery of the syllable (Zec 2000). With this added sonority requirement, the
phoneme /j/, being more sonorous than /s/ and /t/, would have to constitute a part
of the previous syllable where it would be of a lower sonority when compared to its
neighbouring syllable bearing vowel, and the syllable boundary would be naj-str-pljiviji
which is in line with native speaker intuition.
As a final check following rules (1)–(8**), we add rule (9) that has the ability to
shift the syllable boundary in order to avoid a violation of the sonority hierarchy.
(9) If the syllable structure resulting from rules (1)–(8**) does not conform to the Sonority
Sequencing Principle, move the boundary so that the phoneme violating the sonority
sequence is shifted into the neighboring syllable.
2 It should be noted that while sonority sequencing accounts for the non-syllabic treatment of /r/ before /je/ in
initial position, our rule extension is still needed as it has a more general scope than the sonority rule and accounts
for segmentation in medial positions as well (e.g. in words such as isko-rje-nilo [eradicated]).
128 Prispevki za novejšo zgodovino LIX - 1/2019
An Adapted Sonority Hierarchy
In our sonority sequencing module, we relied on a combination of Selkirk’s (1984)
sonority scale, the sonority apertures for Serbian described by Subotić et al. (2012),
and some notes on sonority sequencing in Serbian from Zec (2000). Our sonority
scale is shown under (iii).
(iii) p, t, k < b, d, g < ts, tʃ, tɕ < f, ʃ, h < v, z, ʒ < s < m, n, ɲ < l, ʎ < j, r < a, e, i o, u
The highest sonority group in our implementation was made up by the vowels
of Serbian. As vowels constitute syllable nuclei and there can only be a single vowel
per syllable, we did not need to make a distinction between three sonority apertures
of vowels (i, u < e, o < a) as it is the case in the hierarchy of Subotić et al. (2012).
Following Selkirk (1984), we divided sonorants into three sonority classes, and follow-
ing Zec (2000), we treated liquids as more sonorous than nasals, and, within liquids,
the phoneme /r/ as more sonorous than laterals. For the needs of our implementation,
we treated the phoneme /r/ and glide /j/ as a single sonority group, although from a
theoretical standpoint /j/ would be considered as more sonorous out of the two given
its semi-vowel nature. We opted for treating /s/ as an element of higher sonority than
voiced fricative despite its voiceless nature following Selkirk (1984), and expanded
Selkirk’s hierarchy with the addition of affricates between voiceless fricatives and
voiced plosives as a parallel to the aperture order presented by Subotić et al. (2012).
It is important to note that there are sequences which clearly do not conform with
the SSP in a number of languages, and which may undermine the relevance and power
of the sonority hierarchy. A very common pattern, found across a number of unrelated
languages, is the possibility of an /s/ + plosive sequence in the syllable onset, which
would be in clear violation if we were to adopt the sonority scale outlined above. In
Serbian, there is a known ambiguity in syllable segmentation in the case of continu-
ant fricative phonemes. For example, the word postaviti [to set] can be syllabified as
both po-sta-vi-ti and pos-ta-vi-ti (Gvozdanović 2011). We therefore adopt the view
put forward in Morelli (1999), who argues that fricatives and plosives may be treated
as a single class with respect to sonority in these cases — since splitting them into
separate classes would make wrong typological predictions — and add an exception
to our sonority sequencing module that leaves fricative + plosive sequences as a viable
sequence in the syllable onset.
Our Algorithm3
Our mixed-principle syllabification algorithms consists of the following steps:
3 Our implementation of the algorithm can be found at https://github.com/versi-regular/rule-based_syllabifier_sr,
licensed under the GNU General Public License v3.0. It was developed using Python 3.x and processes 10380
tokens/s on average estimated on a 4,681,713 token corpus processed on an Intel® Core™ i5-3210M CPU @
2.50GHz with 8.00 GBs of DDR3L-1600 SODIMM, including pre-processing, clean-up, and transliteration.
129A. Kovač, M. Marković: A Mixed-principle Rule-based Approach to the Automatic Syllabification …
I. Identify vowels in the word and mark their positions as positions capable of con-
stituting syllable nuclei (based on (1)).
II. If a word contains the letters l, n or the letter r not followed by the sequence je in
the center of a consonant cluster consisting of elements of lower sonority or at
the beginning or a word followed by a consonant of lower sonority, or the letters
l or n at the end of a word preceded by a consonant of lower sonority, treat those
positions in the word as capable of constituting syllable nuclei (based on (1*), (7),
and (8**)).
III. For each position identified as capable of constituting a syllable nucleus:
A. If it is followed by a sequence of two sonorants, mark the syllable boundary
between the two sonorants (based on (4)), except if the second sonorant is j
and it is followed by e. If the second sonorant is j followed by e, mark the syllable
boundary before the sonorant cluster (based on (6)).
B. If it is followed by a sequence of a plosive or nasal and a plosive, fricative, affri-
cate or nasal, mark the syllable boundary between the two consonants (based
on (5*)).
C. In all other cases mark the syllable boundary after the syllable nucleus (based
on (1*)).
IV. Run a recursive sonority check (based on (9)):
A. If the word consists of more than one syllable, convert the syllable structures
identified by the previous steps into sonority group values.
B. For each syllable, check if there is a violation of the SSP at the edges of the syl-
lable ignoring the check at the onset on the word-initial syllable and the check
in the coda of the word-final syllable.
C. If a violation found is a sequence of a fricative followed by a plosive in the onset,
ignore the violation.
D. If there is a violation, remove the letter from the edge of the syllable, and add it
onto the neighboring syllable.
E. Repeat until no violation is found.
Syllable Distribution Data
In this section, we present the statistical distribution data of syllables in Serbian
based on our updated syllabification process applied to the Serbian Lemmatized
and PoS Annotated Corpus SrpLemKor (Popović 2010; Utvić 2011). We chose
SrpLemKor for our analysis, because its annotation allowed us to filter out numbers,
Roman numerals, abbreviations and non-Serbian words or suffixes in compounds (at
least to some extent) and thus reduce noise in the data.
The following results show the syllable distribution statistics based on 3,648,543
non-unique word-forms (word tokens) from SrpLemKor. From a total of 4,681,713
entities (punctuation and word tokens) in our version of the corpus, 113,679 (2.43%)
130 Prispevki za novejšo zgodovino LIX - 1/2019
entities of texts #260, #4505 and #4517 were excluded because the files contained
faulty encoding. Based on corpus tags, we excluded 919,161 (19.63%) entities tagged
PUNCT (punctuation), SENT (sentence separator full-stops), RN (Roman numer-
als), NUM @card@ (Arabic numerals), ABB (abbreviations) and ? (non-Serbian
words and other uncategorized entries). An additional 815 (0.02%) entities that con-
tained the characters w, q and x were removed in an attempt to further reduce noise
stemming from foreign words, as not all foreign words were tagged as such in the
corpus. In the process of syllabification, an additional 12,877 (0.28%) entities were
removed as they were solely made up of consonant clusters with no available syllable
nucleus candidate.
Syllable Type Distributions in Serbian
In the 3,648,543 word-forms from SrpLemKor, a total of 8,196,771 syllables were
identified. Table 1 presents the syllable type distribution based on our mixed-principle
syllabification algorithm.
Table 1: Syllable structure distribution of syllables in the SrpLemKor corpus
Syllable structure No.of instances Percent
CV 5030622 61.37321636
CCV 938275 11.44688561
CVC 913603 11.14588903
V 852854 10.40475573
CCVC 218126 2.661121068
VC 141980 1.7321455
CCCV 56168 0.685245446
CVCC 20339 0.248134296
CCCVC 14362 0.175215338
CCVCC 6274 0.076542336
VCC 2234 0.027254635
CCCCV 780 0.009515942
CVCCC 731 0.008918146
CCCVCC 170 0.002073987
CCCCVC 84 0.001024794
VCCC 67 0.000817395
CCCCVC 36 0.000439197
Other 66 0.000805195
Total 8196771 100
131A. Kovač, M. Marković: A Mixed-principle Rule-based Approach to the Automatic Syllabification …
These results show the distribution of syllables in a somewhat noisy data. We
found there are still foreign words annotated as non-foreign in the corpus constituting
some of the less-frequent syllable structures listed as “Other” in Table 1. For example,
an instance of the syllable structure VCCCCC was found to correspond to the seg-
mentation of the German word Pe-itscht [lashes], the syllable structure CCCCVCCC
was identified in the German word Fle-i-schmarkt [meat market], and the structure
CCCCCVC was found in the German word Gle-i-chschal-tung [co-ordination]. The
structure CCCCCCVC was found in the German word Na-chtschat-ten [nightshade]
and in the toponym CRYSLER. The syllable structure CCVCCCC was found in the
source transcription of the last name Pe-tritsch and in the English word knights. The
syllable structure CCCVCCC was identified to be a part of the German words Wol-
fsmilch [spurge] and E-in-ge-schickt [sent in] and to correspond to the English word
string. The syllable structure CCCCCCV was identified in the German words We-i-
hna-chtsbra-e-u-che [Christmas trees], Stor-chschna-bel [Crane’s bill], while the structure
CCCCCV was found in the words Re-chtsge-schi-chte [history of law] and Um-gan-
gsspra-che [vernacular], as well as in the sequences šttske and su-žnjstva. The syllable
structure CCCCVCC was found in the German word Ze-it-schrift [magazine], and in
multiple occurrences of the source spelling of the last names Schmidt and Rot-hchild.
The structure VCCCC was found in the German words Deutsch [German], Ernst [seri-
ousness], in the sequence der-demnaechst [soon], and in the strings ikvbv and EHCmc.
As can be seen from the examples above, besides foreign origin words, noise in the data
can also be found in typos and strings we did not manage to identify. Another example
of such string was ngBpJKTnQ identified as the structure VCCCCCCCC. Most struc-
tures identified as CVCCCC were the result of typos, e.g. serbsk, kra-levstv, pod-dan-
stv, carstv, slav-jansk, ju-go-slo-venskg, cr-no-gorskg, but also foreign origin names,
e.g. Hirsch, Herbst, Lokotsch, and Worlds in additions to strings such as majnds and
Gorrrr. In addition to these, one occurrence of the syllable structure CVCCCCCCCC
that stood for the onomatopoeic vulgarism mršššššššš [go away].
We also found 2 syllable structures that differed from the structures found by
Meštrović et al. (2005) for Croatian. The structure CCCCVC was identified in the
words vo-đstvom [with leadership], za-ko-no-da-vstvom [with legislature], mo-nar-
-hstvom [with monkhood], lu-ka-vstvom [with slyness], be-zzglob-na [without wrists],
and in the paradigm members of the word po-sthlad-no-ra-to-vski [post-cold-war]. It
also occurred in the Russian word Zdra-vstvuj [hello], in the German-origin word
Ha-up-tstrum-fi-rer [mid-level commander], in the German Ra-u-schmit-tel [intoxicant]
and Li-e-be-spflan-ze [love plant] and in the misspelled Serbian words pri-ja-tljskih [fri-
endly] and kvdrat [square]. The structure CCCCV was found in the words bi-vstvu
[existence], va-zdu-ho-plo-vstvo [aviation], kra-lje-vstva [kingdoms], zdra-vstve-noj
[health], vo-đstvo [leadership], ču-vstva [feeling], pre-i-mu-ćstva [advantages], and mo-
-gu-ćstvu [possibility]. It also occurred in German words such as Pfin-gstro-se [peony],
Ke-u-schhe-it [chastity], Schne-e-glo-ec-kchen [snowdrop], Schne-e-ro-se [Chrismas rose],
Ge-i-sskle-e [cystus], Vol-ksbra-uch [popular custom], Vol-ksgla-u-ben [popular belief],
132 Prispevki za novejšo zgodovino LIX - 1/2019
Schri-ften [regulations], Schlu-e-ssel-blu-me [cowslip], and more. We discuss the implica-
tions of these for our syllabification algorithm in the Discussion section below.
Syllable Type Positional Distributions in Serbian
We also examined the syllable type frequencies with respect to their position in a
word. Four positional frequencies are presented in Table 2: syllable type frequencies
in monosyllabic words, and syllables type frequencies in the initial position, in medial
positions, and in the final position of polysyllabic words.
Table 2: Syllable structure distribution of syllables in the SrpLemKor corpus categorized by
position
Syllable
structure
Monosyllabic words Polysyllabic words
MONO INITIAL MEDIAL FINAL
No.of
instances
Percent
No.of
instances
Percent
No.of
instances
Percent
No.of
instances
Percent
CV 612214 50.382 1356771 56.064 1476732 68.956 1584905 65.49
CCV 62244 5.122 372181 15.379 305247 14.254 198603 8.21
CVC 129337 10.644 178859 7.391 211979 9.898 393428 16.26
V 301295 24.795 369133 15.253 61241 2.860 121185 5.01
CCVC 35428 2.916 50383 2.082 53397 2.493 78918 3.26
VC 64038 5.270 67539 2.791 7123 0.333 3280 0.14
CCCV 174 0.014 19754 0.816 20260 0.946 15980 0.66
CVCC 5368 0.442 1052 0.043 695 0.032 13224 0.55
CCCVC 1490 0.123 3976 0.164 4427 0.207 4469 0.18
CCVCC 1635 0.135 206 0.009 17 0.001 4416 0.18
VCC 1125 0.093 162 0.007 18 0.001 929 0.04
CCCCV 14 0.001 21 0.001 381 0.018 364 0.02
CVCCC 579 0.048 3 0.000 1 0.000 148 0.01
CCCVCC 105 0.009 0 0.000 0 0.000 65 0.00
CCCCVC 1 0.000 0 0.000 25 0.001 58 0.00
VCCC 45 0.004 0 0.000 0 0.000 22 0.00
CCCCVC 11 0.001 0 0.000 0 0.000 25 0.00
Other 38 0.003 0 0.000 7 0.000 21 0.00
Based on SrpLemKor, the most frequent monosyllabic syllable structures in
Serbian are CV (50%), V (25%) and CVC (11%). The most frequent syllable struc-
tures in the initial position of polysyllabic words are CV (56%), CCV (15%) and V
(15%). In medial positions in polysyllabic words, the most frequent syllable structures
133A. Kovač, M. Marković: A Mixed-principle Rule-based Approach to the Automatic Syllabification …
are CV (69%), CCV (14%) and CVC (10%). The most frequent syllable structures in
the final position of polysyllabic words are CV (65%), CVC (16%) and CCV (8%).
It is interesting to note the asymmetry that the syllable structures CCCVCC, VCCC,
and CCCCVC occurred only in monosyllabic words and in the final position of poly-
syllabic words, while the syllable structure CCCCVC occurred in all positions except
the initial position in polysyllabic words.
Syllable Nuclei Statistics in Serbian
The distribution of different syllable nuclei in Serbian based on the SrpLemKor
corpus is presented in Table 3.
Table 3: Syllable nuclei statistics and positional frequencies of syllables in the SrpLemKor
corpus
N
uc
le
us TOTAL
Monosyllabic
words Polysyllabic words
MONO INITIAL MEDIAL FINAL
No.of
instances
Percent
No.of
instances
Percent
No.of
instances
Percent
No.of
instances
Percent
No.of
instances
Percent
a 2177498 26.566 330629 27.209 604764 24.990 585787 27.353 656318 27.120
e 1646579 20.088 304442 25.054 447662 18.498 394573 18.425 499902 20.657
i 1730439 21.111 230637 18.980 394735 16.311 600823 28.056 504244 20.836
l 939 0.011 326 0.027 32 0.001 77 0.004 504 0.021
n 1261 0.015 409 0.034 544 0.022 33 0.002 275 0.011
o 1753091 21.388 168126 13.836 671752 27.758 385687 18.010 527526 21.798
r 88021 1.074 1898 0.156 66250 2.738 19560 0.913 313 0.013
u 798943 9.747 178674 14.704 234301 9.682 155010 7.238 230958 9.544
Based on the positional nucleus distribution data, it can be seen that overall /a/
and /o/ constitute the most frequent nuclei in Serbian. However, there is some posi-
tional variation. While the most frequent nuclei in final, medial, and initial position
of polysyllabic words are also /a/ and /o/, in monosyllabic words, the most frequent
nuclei are /a/ and /e/.
Discussion
While our mixed-principle rule-based syllabification algorithm is suitable for the
segmentation of words into syllables following the ruleset we devised based by the
combination of prescriptive rule descriptions and the employment of the Sonority
134 Prispevki za novejšo zgodovino LIX - 1/2019
Sequencing Principle, there are still some practical and theoretical considerations to
be addressed.
While reporting on the syllable distribution data, we mentioned that the 3,648,543
word-forms extracted from SrpLemKor used for the calculation of statistical data
related to the distribution of syllables and their structure in Serbian still contained
some noise such as foreign words, typos, and possibly random character strings. Based
on 500 random samples taken from the syllable output data checked by a human eval-
uator, the estimate of the amount of such noise in the data is <2%. Given the nature
of corpus-based data, this noise should not significantly impact the reliability of the
distributional information.
From a theoretical standpoint, in formulating our algorithm, we disregarded the
three-syllable consonant cluster limitation put forward by Kašić (2014) in favor of
exploring the limitations of the sonority module. The occurrence of the two syllable
types CCCCVC and CCCCV, which were not present in the onset-maximization-
based syllabification algorithm for Croatian (Meštrović et al. 2015), shows that in
a limited number of instances this constraint is needed to exclude syllable clusters
that are in accordance with the SSP and prescriptive rule descriptions, but seem con-
trary to native speaker intuition about syllable boundaries. In addition to this, there is
the ambiguity in syllable segmentation in the case of continuant fricative phonemes
(Gvozdanović 2011) with the continuant constituting either the first place in the onset
of the syllable or the last place in the coda of the previous syllable, e.g. the possibility
to syllabify postaviti [to set] as po-sta-vi-ti and pos-ta-vi-ti, would require a larger-scale
study examining the intuition of native speakers on syllabification to make an assump-
tion about contemporary tendencies in the segmentation in these contexts.
In order to verify the syllabic status of different clusters, it would be interesting to
conduct a series of monitoring studies modeled after Mehler et al. (1981), who have
shown that reaction times to a word are faster if the word is primed by a sequence cor-
responding to a syllable in the word when compared to priming with a string that does
not constitute a syllable. Bradley et al. (1993) argue that these effects produce mixed
results in some languages which contain a large number of ambisyllabic segments, so
these studies may also reveal whether and to what extent syllables play a role in pre-
lexical processing in Serbian.
Conclusion
In this paper we presented a mixed-principle rule-based syllabifier modelled after
the rule descriptions found in Stanojčić and Popović (2005), extended by rule specifi-
cations from Kašić (2014) and Zec (2000), and complemented by a sonority sequenc-
ing module based on Selkirk (1984), Subotić et al. (2012), and Zec (2000).
An implementation of the existing prescriptive rules for the segmentation of
words into syllables allowed us to gain an insight into the problem areas of the rule
135A. Kovač, M. Marković: A Mixed-principle Rule-based Approach to the Automatic Syllabification …
descriptions, and propose a number of revisions and amendments to the existing rules.
The sonority sequencing module revealed the need for an additional onset-length limi-
tation constraint, and pointed out the limitations of sonority in ambiguous consonant
clusters that would require further exploration and validation by native speaker intui-
tion. We have also gained an insight into the distribution of different syllable structures
and syllable nuclei following this approach, which will be useful for comparison with
the performance of alternative syllabification systems.
In the future, we plan to compare our system to an onset-maximization-based syl-
labifier for Serbian in combination with the prescriptive rules to see if we can create
an alternative system that will produce outputs consistent with the intuition of native
speakers of Serbian. It would be interesting to see a systematic comparison of our
current approach and the onset-maximization approach with data gathered from the
intuition of contemporary native speakers of Serbian.
We also believe that, while phonological criteria present a basis for syllabifica-
tion, in the future we will also need to test whether and to what extent approaches
based solely on phonological criteria result in syllable boundaries that coincide with
morphological boundaries. Our assumption is that phonological rules will need to be
amended by morphological criteria to result in syllabification that respects morpho-
logical boundaries as well.
In addition to these, the question of the treatment of foreign origin words and
transcribed foreign words might be an additional point to consider. As an extension of
a syllabifier, a language detection algorithm might be employed to properly seg-
ment the former, while the latter might not need special treatment as the process of
transcription should in itself contain a degree of phonological adaptation.
Acknowledgment
This research was supported by the Serbian Ministry of Education and Science
under the projects Development of Dialogue Systems for Serbian and Other South
Slavic Languages (TR-32035) and Languages and Cultures in Time and Space
(ON-178002).
Sources and Literature
Literature:
• Barber, Horacio, Marta Vergara, and Manuel Carreiras. 2004. “Syllable-frequency Effects in Visual
Word Recognition: Evidence from ERPs.” Neuroreport 15 (3): 545–48.
• Bradley, Dianne C., Rosa M. Sánchez-Casas, and José E. García-Albea. 2007. “The Status of the
Syllable in the Perception of Spanish and English.” Language and Cognitive Processes 8 (2): 197–
233.
136 Prispevki za novejšo zgodovino LIX - 1/2019
• Bigi, Brigitte, and Caterina Petrone. 2014. “A Generic Tool for the Automatic Syllabification of
Italian.” In Proceedings of The First Italian Conference on Computational Linguistics, CLiC-it, 73–77.
Pisa: Pisa University Press. http://siti.fileli.unipi.it/projects/clic/proceedings/Proceedings-
CLICit-2014.pdf.
• Butt, Matthias. 1992. “Sonority and the Explanation of Syllable Structure.” Linguistische Berichte
137: 45–67.
• Cholin, Joana, Willem J. M. Levelt, and Niels O. Schiller. 2006. “Effects of Syllable Frequency in
Speech Production.” Cognition 99 (2): 205–35.
• Cholin Joana, and Willem J. M. Levelt. 2009. “Effects of Syllable Preparation and Syllable Frequency
in Speech Production: Further Evidence for Syllabic Units at a Post-lexical Level.” Language and
Cognitive Processes 24(5): 662–84.
• Clements, George N. 1990. “The Role of the Sonority Cycle in Core Syllabification.” In Papers in
Laboratory Phonology I: Between the Grammar and Physics of Speech, edited by John Kingston, John
and Mary E. Beckman, 282–333. Cambridge: Cambridge University Press.
• Daelemans, Walter, and Antal van den Bosch. 1992. “Generalization Performance of
Backpropagation Learning on a Syllabification Task.” In Connectionism and Natural Language
Processing: Proceedings of the 3rd Twente Workshop on Language Technology, TWLT3, 27–38.
Enschede: University of Twente, Department of Computer Science. https://pure.uvt.nl/portal/
files/760578/generalization.pdf.
• Foley, James. 1972. “Rule Precursors and Phonological Change by Meta-rule.” In Linguistic
change and generative theory, edited by Robert P. Stockwell and Ronald K. S. Macaulay, 96–100.
Bloomington: Indiana University Press.
• Goldsmith, John A. 1995. The Handbook of Phonological Theory. London: Blackwell Publishers.
• Gvozdanović, Jadranka. 2011. “Phonological Domains.” In Sandhi Phenomena in the Languages of
Europe, edited by Henning Andersen, 27–54. Berlin: Mouton de Gruyter.
• Hankamer, Jorge, and Judith Aissen. 1974. “The Sonority Hierarchy.” In Papers from the Parasession
on Natural Phonology, edited by Anthony Bruck, Robert Allen Fox, and Michael W. La Galy, 131–
45. Chicago: Chicago Linguistic Society.
• Hunt, Andrew. 1993. “Recurrent Neural Networks for Syllabification.” Speech Communication 13
(3–4): 323–32.
• Iacoponi, Luca, and Renata Savy. 2011. “Sylli: Automatic Phonological Syllabification for Italian.”
In INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication
Association, 641–44. Florence: International Speech Communication Association. http://eden.
rutgers.edu/~li51/php/papers/interspeech2011.pdf.
• Kaplar, Sebastijan, Marija Radojičić, Ivan Obradović, Biljana Lazić, and Ranka Stanković. 2018.
“Solution for Quantitative Analysis of Texts in Serbian Based on Syllables.” In ICIST 2018
Proceedings 2, 315–20. Belgrade: Society for Information Systems and Computer Networks.
http://www.eventiotic.com/eventiotic/library/paper/429.
• Kašić, Zorka. 2014. “Opšta lingvistika 2 (Fonologija).” Lecture Materials, Faculty of Philosophy,
University of Belgrade.
• Koehler, Klaus J. 1966. “Is the Syllable a Phonological Universal?” Journal of Linguistics 2: 207–208.
• Kovač, Aniko, and Maja Marković. 2018. “A Rule-Based Syllabifier for Serbian.” In Proceedings of
the Conference on Language Technologies and Digital Humanities 2018, 140–46. Ljubljana: Ljubljana
University Press.
• Ladefoged, Peter, and Keith Johnson. 2014. A Course in Phonetics. Belmont: Wadsworth Publishing.
• Ladefoged, Peter. 1982. A Course in Phonetics. New York: Harcourt Brace Jovanovich.
• Landsiedel, Christian, Jens Edlund, Florian Eyben, Daniel Neiberg, and Björn Schuller. 2011.
“Syllabification of Conversational Speech Using Bidirectional Long-Short-Term Memory Neural
Networks.” In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), 5256–9. Prague: IEEE. http://ieeexplore.ieee.org/abstract/document/5947543.
137A. Kovač, M. Marković: A Mixed-principle Rule-based Approach to the Automatic Syllabification …
• Marchand, Yannick, Connie R. Adsett, and Robert I. Damper. 2009. “Automatic Syllabification in
English: A Comparison of Different Algorithms.” Language and Speech 52 (1): 1–27.
• Mehler, Jacques, Jean Yves Dommergues, Uli Frauenfelder, and Juan Segui. 1981. “The Syllable’s
Role in Speech Segmentation.” Journal of Verbal Learning and Verbal Behavior 20 (3): 298–305.
• Meštrović, Ana, Sanda Martinčić-Ipšić, and Mihaela Matešić. 2015. “Postupak automatskoga
slogovanja temeljem načela najvećega pristupa i statistika slogova za hrvatski jezik.” Govor, 32:
3–34.
• Morelli, Frida. 1999. “The Phonotactics and Phonology of Obstruent Clusters in Optimality
Theory.” PhD diss., University of Maryland.
• Ohala, John, and Haruko Kawasaki. 1984. “Prosodic Phonology and Phonetics.” Phonology
Yearbook, 1: 113–27.
• Ohala, John. 1990. “The Phonetics and Phonology of Aspects of Assimilation.” In Papers in
Laboratory Phonology I, edited by John Kingston, John and Mary E. Beckman, 258–75. Cambridge:
Cambridge University Press.
• Popović, Zoran. 2010. “Taggers Applied on Texts in Serbian.” INFOtheca 11 (2): 21a–38a.
• Selkirk, Elisabeth O. 1984. “On the Major Class Features and Syllable Theory.” In Language Sound
Structure, edited by Mark Aronoff and Richard T. Oehrle, 107–36. Cambridge: MIT Press.
• Stanojčić, Živojin, and Ljubomir Popović. 2005. Gramatika srpskoga jezika. Belgrade: Zavod za
udžbenike i nastavna sredstva Beograd.
• Stoianov, Ivelin, John Nerbonne, and Huub Bouma. 1997. “Modelling the Phonotactic Structure
of Natural Language Words with Simple Recurrent Networks.” In Computational Linguistics in the
Netherlands 1997: Selected Papers from the Eight Clin Meeting, 77–95. Amsterdam: Rodopi.
• Subotić, Ljiljana, Dejan Sredojević, and Isidora Bjelaković. 2012. Fonetika i fonologija: Ortoepska i
ortografska norma standardnog srpskog jezika. Novi Sad: Filozofski fakultet Univerziteta u Novom
Sadu.
• Utvić, Miloš. 2011. “Annotating the Corpus of Contemporary Serbian.” INFOtheca 12 (2): 36a–
37a.
• Zec, Draga. 2000. “O strukturi sloga u srpskom jeziku.” Južnoslovenski filolog 56 (1–2): 435–48.
138 Prispevki za novejšo zgodovino LIX - 1/2019
Aniko Kovač, Maja Marković
A MIXED-PRINCIPLE RULE-BASED APPROACH TO THE
AUTOMATIC SYLLABIFICATION OF SERBIAN
SUMMARY
In this paper we present a mixed-principle rule-based approach to the automatic
syllabification of Serbian based on prescriptive rule descriptions from traditional
grammar found in Stanojčić and Popović (2005), extended by rule specifications from
Kašić (2014) and Zec (2000), and complemented by a sonority sequencing module
based on Selkirk (1984), Subotić et al. (2012), and Zec (2000).
Syllable segmentation plays a role in speech technologies – most notably in the
areas of speech recognition and text-to-speech synthesis – at both the segmental and
prosodic levels. It is also one of the governing factors in hyphenation, and syllable
frequency distribution data is used in psycholinguistic experiments as a covariate. The
unavailability of segmented data for Serbian makes a rule-based approach to automatic
syllabification the only viable option as there is no data available for training a data-
driven neural network model, and the segmentation of large-scale language corpora
by trained annotators would be a resource and cost heavy undertaking.
Our goal in this paper is threefold: i) we extend and improve an earlier version of our
syllabification algorithm by introducing a sonority sequencing validation module which
resolves a number of issues present in the earlier version of our syllabifier, ii) we provide
a detailed analysis of the outcomes of the automatic syllabification process in order to
address possible theoretical considerations and serve as a basis for the development of
future syllabifiers, and iii) we present the statistical data related to the distribution of
syllables and their structure in Serbian to be used in psycholinguistic experiments.
The implementation of the existing set of prescriptive rules for the segmentation
of words into syllables in Serbian allowed us to gain an insight into problem areas of
the rule descriptions, and propose a number of revisions and amendments to the exist-
ing rules. The sonority sequencing module revealed the need for an additional onset-
length limitation constraint, and pointed out the limitations of sonority in ambiguous
consonant clusters – such is the case with continuant fricative phonemes that seem to
be able to occupy either the first place in the onset of a syllable or the last place in the
coda of a previous syllable – that would require further exploration and validation by
native speaker intuition.
The data on the distribution of different syllable structures and syllable nuclei
following this approach will be useful for comparison with the performance of alter-
native syllabification systems. In the future, it would be interesting to see a systematic
comparison of our current approach to alternative approaches such as an onset-maxi-
mization approach evaluated on segmentation data gathered from the native speakers
of Serbian.
139A. Kovač, M. Marković: A Mixed-principle Rule-based Approach to the Automatic Syllabification …
Aniko Kovač, Maja Marković
MEŠANI PRISTOP K AVTOMATSKEMU ZLOGOVANJU V
SRBŠČINI NA PODLAGI NAČEL IN PRAVIL
POVZETEK
V tem prispevku predstavljamo mešani pristop k avtomatskemu zlogovanju v srb-
ščini na podlagi načel in pravil, ki temelji na opisih predpisnih pravil tradicionalne
slovnice (kot jih navajata Stanojčić in Popović 2005), razširjenih z opredelitvami
pravil (kot jih navajata Kašić (2014) in Zec (2000)) in dopolnjenih z modulom za
zaporedje glede na zvočnost (na podlagi del avtorjev Selkirk 1984; Subotić et al. 2012;
Zec 2000).
Členitev na zloge ima pomembno vlogo v govornih tehnologijah – zlasti na
področjih prepoznavanja govora in pretvorbe besedila v govor – na segmentalni in
prozodični ravni. Je tudi eden od vodilnih dejavnikov pri deljenju besed. Podatki o
frekvenčni porazdelitvi zlogov se uporabljajo v psiholingvističnih poskusih kot soča-
sna spremenljivka. Pristop k avtomatskemu zlogovanju, ki temelji na pravilih, je edina
smiselna izbira, saj za srbščino ni na voljo segmentiranih podatkov, iz katerih bi se
model nevronske mreže lahko učil. Projekt, pri katerem bi usposobljeni komentatorji
razčlenjevali obsežne jezikovne korupse, pa bi bil zelo zahteven in drag.
Naš prispevek ima tri cilje: i) razširiti in izboljšati predhodno različico našega
algoritma za zlogovanje z vpeljavo modula za potrjevanje zaporedja glede na zvoč-
nost, ki odpravlja vrsto težav iz predhodne različice našega zlogovalnika; ii) predsta-
viti podrobno analizo rezultatov avtomatskega postopka zlogovanja, da bi spodbudili
morebitne teoretične razmisleke in zagotovili podlago za razvoj prihodnjih zlogoval-
nikov; in iii) predstaviti statistične podatke, povezane s porazdelitvijo in strukturo
zlogov v srbščini, ki jih bo mogoče uporabiti pri psiholingivstičnih poskusih.
Uporaba uveljavljene zbirke predpisnih pravil za členitev besed na zloge v srbščini
nam je omogočila, da smo dobili podroben vpogled v težavna področja pri opisih pra-
vil in predlagali vrsto sprememb in popravkov uveljavljenih pravil. Modul za zaporedje
glede na zvočnost je razkril potrebo po dodatni omejitvi dolžine vzglasja in izpostavil
omejitve zvočnosti pri dvoumnih soglasniških sklopih (na primer priporniki, ki očitno
lahko zavzemajo prvo mesto na začetku zloga ali zadnje mesto na koncu predhodnega
zloga), ki bi jih bilo treba dodatno raziskati in potrditi s pomočjo intuicije rojenega
govorca.
Podatke o porazdelitvi različnih zlogovnih struktur in jeder, pridobljene s tem
pristopom, bo mogoče uporabiti za primerjavo z delovanjem drugih sistemov za zlo-
govanje. Zanimivo bi bilo opraviti sistematično primerjavo našega pristopa z drugimi
pristopi, na primer pristopom maksimizacije vzglasja, ovrednotenim na podlagi podat-
kov o členitvi, pridobljenih od rojenih govorcev srbščine.
140 Prispevki za novejšo zgodovino LIX - 1/2019
1.01 UDC: 003.295:342.537.6:355.012(492)”1940/1945”
Milan M. van Lange,* Ralf D. Futselaar**
Debating Evil: Using Word
Embeddings to Analyse
Parliamentary Debates on War
Criminals in the Netherlands
IZVLEČEK
RAZPRAVE O ZLU: ANALIZIRANJE PARLAMENTARNIH RAZPRAV O
VOJNIH ZLOČINCIH NA NIZOZEMSKEM Z VEKTORSKIMI VLOŽITVAMI
BESED
Predstavljamo metodo za raziskovanje sprememb v zgodovinskem diskurzu, pri kateri
se uporabljajo obsežni besedilni korpusi in modeli vektorske vložitve besed. Kot študijo pri-
mera raziskujemo razprave o kaznovanju vojnih zločincev v nizozemskem parlamentu v
obdobju 1935–1975. Predstavili bomo, kako se za sledenje zgodovinskega razvoja parla-
mentarnega besedišča skozi čas lahko uporabljajo modeli vektorske vložitve besed, ki se učijo
z Googlovim algoritmom Word2Vec.
Ključne besede: vojni zločinci, zgodovina kaznovanja, parlamentarna zgodovina,
Word2Vec, modeli vektorske vložitve besed
ABSTRACT
We are proposing a method to investigate changes in historical discourse by using large
bodies of text and word embedding models. As a case study, we investigate discussions in
* NIOD, Institute for War, Holocaust and Genocide Studies, Herengracht 380, 1016CJ Amsterdam, The Nether-
lands, m.van.lange@niod.knaw.nl
** NIOD, Institute for War, Holocaust and Genocide Studies, Herengracht 380, 1016CJ Amsterdam, The Nether-
lands, r.futselaar@niod.knaw.nl
141M. M. Lange, R. D. Futselaar: Debating Evil: Using Word Embeddings to Analyse …
Dutch Parliament about the punishment of war criminals in the period 1945–1975. We
will demonstrate how word embedding models, trained with Google’s Word2Vec algorithm,
can be used to trace historical developments in parliamentary vocabulary through time.
Keywords: War Criminals, Penal History, Parliamentary History, Word2Vec, Word
Embedding Models
The Case: War Criminals
Soon after German forces in the Netherlands surrendered in May of 1945, the
question arose how the hundreds of suspected war criminals and thousands of Nazi
collaborators in Dutch custody were to be treated. For the next five decades, this
question caused a series of heated political controversies. The debates in Dutch par-
liament about the punishment, penalty reduction, or release of these people are not
only among the longest debates in Dutch parliamentary history, but are generally con-
sidered to have been the most emotionally charged (Bootsma and Griensven 2003;
Futselaar 2015; Tames 2013).
Discourse and Controversy
In this paper, we use an implementation of word embedding models (WEMs)
to analyse parliamentary discussions concerning incarcerated war criminals and Nazi
collaborators after the end of the German occupation. At peak, in the summer of
1945, more than a hundred thousand people were incarcerated. They were accused
of a variety of crimes, all committed during the occupation of the country: political
and military collaboration, war crimes, and (complicity in) genocide. The majority of
these prisoners were civilians, whose crimes amounted to little more than membership
of national socialist organisations. These people, and other small fry, were released
quickly. A small and dwindling number of serious offenders remained in prison, some
of them until 1989. After the 1960s, all remaining prisoners were former German offi-
cials and officers, whose initial death sentences had been commuted to life in prison.
These prisoners became the flashpoint of intense political and media attention. As long
as they remained behind bars, plans for their release continued to resurface, and cause
political controversy (Piersma 2005; Tames 2013; Futselaar 2015; Grevers 2013).
The main medium of parliamentary communication is spoken language. We aim
to demonstrate that a systematic investigation of the verbatim records of the language
used in Dutch parliament to discuss these cases can reveal historical change. The results
will enable us to track the vocabularies in these discussions through time. We assume
that this vocabulary, as we will call it, reflects the changing parliamentary discourse
about incarcerated war criminals in Dutch society. We aim to link these developments
142 Prispevki za novejšo zgodovino LIX - 1/2019
in parliamentary vocabulary to actual historical events, developments concerning the
post-war dealing with war criminals, and discursive shifts in Dutch society (Olieman
et al. 2017). Specifically, we aim to investigate the changing political attitude towards
incarcerated war criminals and use our findings to test established notions prevalent
in Dutch historiography.
The published proceedings of the two houses of parliament provide us with a data-
set comprising of all the words spoken in the plenary sessions. The completeness of the
parliamentary dataset allows us to investigate the changing parliamentary vocabulary
through time, and in the context of different discussions.
We here focus on two questions directly related to the treatment of these delin-
quents in the Dutch penal system. The first of these concerns the focus on the identi-
fication of the wronged party: did politicians focus on crimes against the Dutch nation
as a whole, or against specific groups of individual victims? The second concerns the
appropriateness of harsh punishments, specifically whether or not life imprisonment
was considered a just alternative for the death penalty. These questions both derive
directly from historiography and serve to answer an overarching question: can we
assess the validity of traditional scholarship using unsupervised text mining?
Parliamentary Proceedings
In this investigation, we rely entirely on parliamentary proceedings, known in
Dutch as the Handelingen der Staten-Generaal. The Handelingen are available in
machine-readable form. The minutes of both houses of parliament for the period 1814–
1995 were first digitised by the Royal Library of the Netherlands and made available
to the public in 2010. The dataset was dramatically improved in the PoliticalMashUp
project that ran from 2012 to 2016. This improved and enriched dataset is freely avail-
able, on request, from DANS, the Dutch national repository of research data. The
dataset consists of a large collection of XML files containing the complete minutes
of all the meetings of the lower and upper chambers of parliament, separated by date,
speaker, political affiliation, etc. This makes it an excellent corpus for various forms of
automated text analysis (Marx et al. 2012).
Word Embedding Models and Historical Research
We investigate the vocabularies used in parliament to discuss a broad category of
inmates that could be described as political delinquents, as well as the changes of these
vocabularies through time. This is a fairly normal investigation to undertake in tradi-
tional historical research - that is to say without computational analyses. Historians
typically work by reading the relevant texts. This enables them to use and expand
their domain knowledge while processing the data. Although this hermeneutic step is
143M. M. Lange, R. D. Futselaar: Debating Evil: Using Word Embeddings to Analyse …
inevitably part of historical research, this approach has several disadvantages. In this
particular case the corpus to be assessed is enormous, making reading and manual
encoding of text problematic. More importantly, the traditional research process is
highly vulnerable to the biases of the reader/researcher. When studying ethically
charged controversies in the relatively recent past, this vulnerability to bias is evidently
problematic. People with an interest in recent history and knowledge of the Dutch
language almost inevitably hold an opinion on these issues and on the actors in the
debate. How do we ensure that our personal political preferences do not influence our
reading of the source materials?
Words in Vector Space
A WEM provides a possible solution to these problems. WEMs are techniques
to investigate words, and relations between words, in large text corpora. WEMs are
based on the calculation of the average distance of unique words to all other unique
words in a corpus. The position of each unique word can then be described as a list of
numerical values, representing its distance to all other unique words. This list of values
is called the ‘vector’ of the word. In principle, the number of values, also referred to as
‘coordinates’, or ‘dimensions’ of the vector, is the same as the number of unique words
in the text, minus one. The complete trained corpus, or ‘spatial model’, is often referred
to as a vector space. The method does not prioritize any particular words; the position
of each unique word is investigated and given a vector in the model.
The vectors of words within a corpus can be compared. That is to say, the closeness
of one vector to another can be calculated. High closeness often reflects a close seman-
tic relationship. Some words with similar vectors are synonyms or near synonyms, or
have very similar usages (tea and coffee, for example). Here, we use cosine similarity
to calculate the closeness of vectors, although other methods are also feasible.
Since the position of unique words relative to other words is an average calculated
on the basis of all occurrences in the text, WEMs are exceptionally effective at inves-
tigating relations between relatively frequent words in a sufficiently large text corpus.
For historical research, insight in these relations is very useful, and goes far beyond
mere closeness. With WEMs we are able to identify associations between words that
are not self-evident and would not have been found by traditional means (Schmidt
2015).
Limitations of WEMs
WEMs also have an important downside that is particularly relevant to histori-
cal research. Since the training of the model determines the position of a word rela-
tive to all other words in that specific corpus, its vector is meaningless in any other
144 Prispevki za novejšo zgodovino LIX - 1/2019
model. Word vectors, hence, can only be compared with other word vectors within
the same spatial model. For historians, this means that comparisons between differ-
ent moments in time are difficult. To make a comparison through time it would be
necessary to divide the corpus into subsets representing different periods. For each
of these period-specific corpora, a new model, based on a subset of the corpus, needs
to be trained. Since vectors of different WEMs are not readily comparable, change
through time is difficult to investigate with WEMs. This means that, while WEMs are
perfectly adequate tools for fulfilling the first of our aims, investigating vocabularies,
they are virtually useless for the second aim, investigating change through time. Since
change through time is the core of virtually all historical research (including this inves-
tigation), this presents us with a major problem; how can we compare outcomes for
different WEMs, for different periods in time?
We have, however, developed a workaround to enable us to use WEMs to investi-
gate changing ways to talk about certain topics through time. We do not directly com-
pare the closeness of vectors within different models, but we calculate relative closeness
of vectors for the same terms within different models by using cosine similarity.
Word2Vec
For this investigation, we have used the relatively popular Word2Vec implemen-
tation of WEMs to train and analyse word embedding models. Word2Vec was devel-
oped by a team of Google engineers and published in 2013. It has been shown to be
a particularly effective implementation. This algorithm, however, was developed with
a different aim than the one for which we are using it. Initially, Word2Vec was a tool
to investigate natural language itself, for example to identify (near) synonyms. In our,
historical, investigation, the statistical modelling of language as such is not the objec-
tive. Rather than trying to identify linguistic regularities to investigate language, we
focus on linguistic irregularities and patterns to identify the influence of political and
historical change on the language used in political speech.
For researchers using the R programming language, a package is readily available
to analyse texts. This package, created and maintained by Benjamin Schmidt, has been
used in this investigation as well (Schmidt 2015, 2017). Our method, however, is in
no way dependent on this particular platform and could also be used in Python or
any other environment. Neither is the method reliant on the Word2Vec algorithm. It
would work broadly in the same way with another implementation of word embed-
dings. Here, however, we have chosen to use a popular WEM implementation in a
relatively user friendly and accessible environment, with the added benefit of using
open-source, free software.
145M. M. Lange, R. D. Futselaar: Debating Evil: Using Word Embeddings to Analyse …
Analytical Process
Text analysis with WEMs involves two necessary steps. The first of these, the train-
ing of the corpus, creates the spatial model, the WEM itself. The second step is the
analysis of the positions of specific words or word clusters within the virtual space of
the model.
The corpus of the Handelingen is vast by the standards of historical research (mil-
lions of words per year), but not very large for the kind of analysis we are undertak-
ing. For the purpose of WEMs, the size is barely adequate. Therefore we have trained
our dataset with a Skip-GramWord2Vec model, which has anecdotally been shown to
yield better results on smaller samples (Gelbukh 2015). The vectors of different words
can be compared within the model by using cosine similarity. Within a vector space,
any two vectors by can be described, by definition, as lying within a horizontal plane.
Cosine similarity calculates the angle between these vectors. Perfectly overlapping
vectors would result in a cosine similarity of 1, a perfectly opposite relationship -1. In
practice, WEMs consist only of positive space, which means that scores fall between 0
(low, or no similarity) and 1 (high, or perfect) similarity (Singhal 2001).
Training the Models
The first step of our workaround is to train two WEMs (more than two is equally
feasible), based on two subsets of the corpus (in this case 1945–1955 and 1965–1975).
Each of these subsets contains ten years of parliamentary speeches. When using this
approach, it is necessary to use relatively similar training corpora, both in terms of
size and in terms of language use. For historical research into relatively short periods
of parliamentary history, this is not particularly problematic. For reasons of efficiency,
we have limited ourselves to unique words that appear at least five times in the corpus
and we have limited the number of dimensions of each vector to one hundred. This
allows this investigation to be undertaken, and repeated, using fairly normal office
grade hardware. We have experimented with more dimensions (several hundreds), but
more vectors appear only to be useful with larger corpora. Training WEMs with several
hundreds of dimensions also requires far more computational power.
Analysing Word Vectors
Within each spatial model, we have identified the 250 words with the highest
cosine similarity to the Dutch terms for ‘war criminal’ (singular and plural, see Table
1). With these 250 nearest neighbours, we have defined the time specific vocabu-
lary used in the discussion of war criminals. Obviously, these are not the same 250
words in each model. To identify changes in the discussions surrounding our topic,
146 Prispevki za novejšo zgodovino LIX - 1/2019
we calculated the cosine similarity of each of the 250 nearest-neighbour words in each
model to two different terms that are present in each of the two corpora. This allows us
to compare the position of the vocabulary of the discussion on our topic (war crimi-
nals) in relation to, in this case, two stable concepts. The selection of these concepts
is crucial for our investigation and for this method. It is here that we translate our
research question into a formal, computational inquiry.
For now, we have chosen a two-dimensional implementation of this technique.
This is not theoretically necessary, but it allows us to visualize and analyse results
more easily in two dimensions. What is important is that concepts used to investigate
the relative position of each investigated word are the same in each of the models to
be compared. It is also necessary that the concepts are relatively stable through time.
Since concepts are represented by words in the corpus itself, words that shift meaning
dramatically, such as the English word ‘gay’, are less suitable than ‘cheerful’ or ‘homo-
sexual’, which have not undergone such dramatic change over time.
When discussing concepts, the number of possible words referring to the same
concept is often greater than one. Since our investigation focuses on concepts that
may be described with multiple words, we need to create a so-called combined vector.
We used synonyms and plurals to create a cluster of words with the shared meaning
of the concept of interest. This cluster was used as a combined vector in the model by
calculating the mean of all the vectors of the cluster words. That is to say that this word
set was treated as a single term, resulting in a vector of similar length to a single-word
vector. This combined vector allows us to investigate our corpus using all synonyms
and near-synonyms of terms as if they were a single term, with a single vector.
Table 1: Word sets used in Debating Evil
Concept Concept represented by combined vector of the Dutch words:
Death penalty ‘doodstraf’ and ‘doodstraffen’
Life
imprisonment
‘levenslang’, ‘levenslange’, ‘vrijheidsstraf’, ‘gevangenisstraffen’,
‘gevangenisstraf’, ‘opsluiting’, and ‘hechtenis’
Treason/traitor ‘landverrader’, ‘landverraders’, ‘verrader’,‘verraders’, and ‘landverraad’
Victim ‘slachtoffer’ and ‘slachtoffers’
War Criminal ‘oorlogsmisdadiger’ and ‘oorlogsmisdadigers’
After selecting two concepts that are present in each of the two corpora, we can
calculate the relative similarity of other terms in the corpus to each of them. Although
vectors between the two trained WEMs are not comparable, the relative distance to
two or more other vectors can be compared very well across several models, provided
the underlying concepts are historically stable. When the terms used to estimate the
relative position of vocabularies are related and dissimilar, or even perfectly opposite,
a historically meaningful analysis becomes viable.
147M. M. Lange, R. D. Futselaar: Debating Evil: Using Word Embeddings to Analyse …
Using two concepts allows us to plot our ‘vocabulary’, that is the top 250 war-
criminal-related words in each of the two periods, in a two-dimensional space. Figure
1 and 2 show the similarity scores of each of the 250 word vocabularies relative to one
concept that serves as the y-axis, and another on the x-axis. Each point represents one
of the 250 words that form the war-criminal vocabulary for a specific time period.
They are plotted based on their cosine similarity score to the combined vector of the
concept ‘victim’ (x) and ‘treason’ (y) in Figure 1, and to ‘life imprisonment’ (x) and
‘death penalty’ (y) in Figure 2. The average scores of all 250 war criminal words on
the two dimensions are shown as horizontal and vertical lines. Thus, we have arrived
at a visual representation that allows for a comparison of word embedding results for
more than one corpus and hence for a comparison through time (in this case, between
two distinct historical periods).
Results
Here, we present only two examples using four concepts and two time periods
(1945–1955 and 1965–1975). Specifically, we try to identify differences in the way
incarcerated war criminals and collaborators were discussed in the immediate after-
math of the Nazi occupation of the Netherlands, and at the height of controversies
surrounding the intended release of a number of German war criminals from Dutch
prisons - namely Kotälla, Aus der Fünten, and Fischer (Piersma 2005).
Obviously, the discussions in the two periods refer to different groups of perpe-
trators. In the immediate aftermath of the Nazi occupation the population of inmates
was large and diverse, consisting of small-time war profiteers, minor collaborators and
their families, but also mass murderers. In the second period, only a handful of elderly
foreigners were left, whose crimes were relatively similar and also similarly egregious.
For this investigation, however, our primary aim is not to unearth radically new
insights into post-war penal policy in the Netherlands, but to confront the results of
an unsupervised, ’distant’ reading of parliamentary records to an established histo-
riography. Such a historiography is available for the case at hand; Dutch historians
have identified a number of trends in the thinking about political delinquents that (if
true) should be reflected in these discussions. Two changes have been identified in
particular:
1. A turn in focus from the nature of the crime committed and the person of the
perpetrator towards the lasting, psychological damage endured by the victims
(Heijden 2012; Haan 1997).
2. A decline in the support, both public and political, for harsh, vengeful punishments,
exemplified here in the discussions about the propriety of the death penalty.
Although the death penalty was (again) abolished in the 1950s, it remained a
point of discussion with regard to war criminals in custody (Futselaar 2015; Smits
2008).
148 Prispevki za novejšo zgodovino LIX - 1/2019
Historical Case
Over the course of three decades, attitudes to incarcerated war criminals, as rep-
resented by the vocabularies used to discuss them, changed. In the first period the
emphasis lay on crimes against the collective, whereas the focus shifted more towards
the plight of individual victims. As can be seen in Figure 1, the initial emphasis on
crimes against the nation (treason) in debates about war criminals declined. The aver-
age cosine similarity between war-criminal words and treason words (horizontal lines)
decreased significantly when we compare 1945–1955 to 1965–1975. At the same
time, we observed increased levels of closeness in vector space between war criminal
related words to words associated with (individual) victims, as can be seen in Figure 1.
Figure 1: Top 250 war criminal related words 1945–1955 (grey) and 1965–1975 (black)
plotted by their cosine similarity to victim (x) and traitor (y) words.
149M. M. Lange, R. D. Futselaar: Debating Evil: Using Word Embeddings to Analyse …
At first glance, this observation is completely in line with the relevant historiogra-
phy. Several authors have emphasized the sharp rise of interest into the mental health
of individual war victims and their families as a decisive factor in policy making and the
formation of political opinion. Figure 1 also indicates the observed shift in discourse
from focusing on the initial crimes, committed by the war criminals, to the conse-
quences of their deeds for individual people involved (Haan 1997; Heijden 2012;
Smits 2008; Withuis 2002).
This development can, however, not be considered a mere discursive change: the
observed shifts in parliamentary vocabulary represent actual historical developments
in the post-war dealing with war criminals. In the early 1970s, the only war crimi-
nals remaining in Dutch prisons were German nationals. Whereas in 1945, main part
of the more than hundred thousand incarcerated war criminals were Dutch citizens.
Evidently, the accusation of treason was only applicable to the latter group. Hence, if
we compare the two periods, it is not surprising that the discursive element of ‘trea-
son’ decreased in importance in the war criminal vocabulary in Dutch parliamentary
debates between 1965 and 1975.
Although the shifts in vocabulary indicate that there was an observable shift in
discourse, we have to stress that our analysis also indicates continuity in the parlia-
mentary vocabulary of 1945–1955 and 1965–1975. The scatterplots in Figure 1 indi-
cate a shift, but do not show a complete turn of the parliamentary vocabulary on war
criminals. The scatterplots in Figure 1 from both periods show overlap between the
nearest neighbours of war criminal related words from 1945–1955 and 1965–1975,
scored on closeness to both treason and victim words. We have observed a significant
change, or shift. However, we also have to conclude that we did not find a complete
turn in vocabulary, as our analysis also indicates continuity and a lasting importance
for perpetration and treason in the war criminal debates.
It remains imperative to remain aware of the possible pitfalls of this type of inves-
tigation. This is evident in the sharp rise of references to the death penalty in war
criminal vocabulary that we observed (see Figure 2). During the second period under
scrutiny, capital punishment had long been discontinued in the Netherlands and could
not have been discussed as a serious penal option. Closer scrutiny of the data revealed
that in many discussions, capital punishment was not advocated, but merely used as a
reference point. The war criminals in question had originally been condemned to die,
but their punishment had been commuted into life imprisonment. Several members
of parliament felt that a pardon would mean that the original verdict (death penalty)
would be watered down twice. In these discussions, capital punishment was often ref-
erenced, even when its application was not a viable (or even legal) option (Futselaar
2015).
150 Prispevki za novejšo zgodovino LIX - 1/2019
Figure 2: Top 250 war criminal related words 1945–1955 (grey) and 1965–1975 (black)
plotted by their cosine similarity to life imprisonment (x) and death sentence words (y).
Conclusion
This paper outlines a method for studying discursive changes in history. We trained
WEMs and calculated cosine similarities between two opposite or related concepts for
specific periods. This enabled us to compare WEMs for different periods. This opens
the door for the use of word embeddings as a tool for historical research, because it
enables us to investigate change through time in sufficiently large and consistent his-
torical textual datasets. Parliamentary records are perhaps the best example of such
datasets. This method holds considerable promise because parliamentary proceedings
and other historical sources are increasingly digitised and made available in machine-
readable form.
151M. M. Lange, R. D. Futselaar: Debating Evil: Using Word Embeddings to Analyse …
We have shown how developments in vocabulary can be considered reflective of
discursive changes. These changes are related to historical events and developments
in the post-war dealing with war criminals in Dutch society. Recent historiography
has suggested a dramatic shift away from the crime committed by war criminals and
towards the consequences of these deeds for victims and their relatives. We do recog-
nize that victims became more prominent in discussions about war criminals, but this
did not diminish the importance of the deed they committed. In other words, the shift
is there, but it appears to be far less radical then suggested.
We could also demonstrate that actual historical developments regarding the type
of war criminals incarcerated in the Netherlands (from many local convicts, to a hand-
ful of foreigners) were reflected by a discursive shift, in which closeness to ‘treason’
declined. German officials, in the eyes of post-war Dutch parliamentarians, did not
commit treason by committing crimes against the Dutch nation.
We have also encountered examples of pitfalls of an overly enthusiastic reliance
on word embeddings as an analytical tool. Capital punishment was mentioned par-
ticularly frequently in the 1970s, but not because the possibility of executing the war
criminals was seriously entertained. Distributional semantics are a powerful new tool
for historians, but they do not remove the need for hermeneutic awareness. In this
paper, the method is itself the main object of inquiry. We believe we have shown that
it possible, feasible, and useful to develop and implement a coherent and widely appli-
cable method for investigating historical change using WEMs.
Discussion
Method Evaluation
For this paper, we have used two corpora, each representing ten years of parlia-
mentary debate to train our WEMs. More interesting, from a research perspective,
would be to find out how stable our results are when using smaller, overlapping win-
dows of corpora over time, say with one year steps. It is likely (but not certain) that
using more fine-grained windows will reveal similar developments and shifts in lan-
guage use over time. Repeating the analysis with more data points has the potential
to gain more insights in the graduality and the pace of the observed shifts in language
used. That said, there is a potential trade-of between detail and precision given that
the corpora available to historians are mostly modest in size.
A second ambition is to look more seriously into the distribution of the cosine
similarity scores, and the changes in these distributions over time. It will be interesting
to measure, visualise, and statistically evaluate these distributions more closely, and
to see whether they can be linked to, for example, unanimity and/or homogeneity in
parliamentary discussions.
152 Prispevki za novejšo zgodovino LIX - 1/2019
Historical Evaluation
Another remaining ambition is to compare the parliamentary vocabularies used
to discuss ‘domestic’ collaborators and foreign (usually German) war criminals.
Furthermore, we also hope to position the war criminal debates in a broader context:
how distinct are they from other war related debates, and from other discussions about
penal law or criminals in a more general sense? Just as a closer investigation of differ-
ent categories of perpetrators is viable and useful, different groups of war victims who
were discussed in parliamentary debates also license further investigation. These may
have included first and second generation victims of wartime violence and persecu-
tion, former forced labourers, holocaust survivors and the children of holocaust vic-
tims, etc. Given the emphasis on the protection of war victims mentioned above, we
are interested to see if there have been changes in the groups emphasized in political
debate about the topic.
Acknowledgements
We are grateful to the participants of our Text Mining workshop at the Luxembourg
Centre for Contemporary and Digital History (C2DH) in Esch-sur-Alzette ( June
2018), for their comments, input, and criticism. We would also like to thank the
participants and organisers of the Language Technologies and Digital Humanities
Conference in Ljubljana (September 2018).
Sources and Literature
Datasets and Academic Software:
• Van Lange, Milan. Debating Evil Repository. Distributed by Github. https://github.com/
MilanvanL/debating_evil.
• Marx, M., J. Van Doornik, A. Nusselder, and L. Buitinck. 2012. “Thematic Collection:
PoliticalMashup and Dutch Parliamentary Proceedings 1814–2013.” Distributed by Data Archiving
and Networked Services (DANS). https://doi.org/10.17026/dans-zg8-9x2v.
• Schmidt, Benjamin. 2017. “Bmschmidt/WordVectors: Tools for Creating and Analyzing Vector-
Space Models of Texts Version 2.0 from GitHub.” GitHub. Accessed on November 5, 2017.
https://rdrr.io/github/bmschmidt/wordVectors/.
• Wickham, Stefan Milton Bache and Hadley. 2014. Magrittr: A Forward-Pipe Operator for R (version
1.5). https://CRAN.R-project.org/package=magrittr.
•
153M. M. Lange, R. D. Futselaar: Debating Evil: Using Word Embeddings to Analyse …
Literature:
• Bootsma, Peter, and Peter van Griensven. 2003. “‘Teleurstelling Is Mijn Opperste Emotie’: Vragen
over Emotie in de Politiek Aan A.A.M. van Agt.” In Jaarboek Parlementaire Geschiedenis, 2003.
Emotie in de Politiek, edited by Carla van Baalen, Willem Breedveid, Jan Willem Brouwer, Peter van
Griensven, Jan Ramakers, and Inke Secker, 121 – 25. Den Haag: SDU Uitgevers.
• Futselaar, Ralf. 2015. Gevangenissen in oorlogstijd: 1940–1945. 1st ed. Amsterdam: Boom.
• Gelbukh, Alexander. 2015. Computational Linguistics and Intelligent Text Processing: 16th
International Conference, CICLing 2015, Cairo, Egypt, April 14–20, 2015, Proceedings. Springer.
• Grevers, Helen. 2013. Van landverraders tot goede vaderlanders: de opsluiting van collaborateurs in
Nederland en België, 1944–1950. Amsterdam: Balans.
• Haan, Ido de. 1997. Na de ondergang: de herinnering aan de Jodenvervolging in Nederland 1945–
1995. Den Haag: SDU.
• Heijden, Chris van der. 2012. Dat nooit meer: de nasleep van de Tweede Wereldoorlog in Nederland.
3rd ed. Amsterdam: Atlas Contact.
• Olieman, Alex, Kaspar Beelen, Milan van Lange, Jaap Kamps, and Maarten Marx. 2017. “Good
Applications for Crummy Entity Linkers? The Case of Corpus Selection in Digital Humanities.”
CoRR abs/1708.01162. http://arxiv.org/abs/1708.01162.
• Piersma, Hinke. 2005. De Drie van Breda: Duitse Oorlogsmisdadigers in Nederlandse Gevangenschap,
1945–1989. 1st ed. Amsterdam: Balans.
• Schmidt, Benjamin. 2015. “Vector Space Models for the Digital Humanities.” Ben’s Bookworm
Blog. Accessed October 25, 2015. http://bookworm.benschmidt.org/posts/2015-10-25-Word-
Embeddings.html.
• Singhal, Amit. 2001. “Modern Information Retrieval: A Brief Overview.” Bulletin of the IEEE
Computer Society Technical Committee on Data Engineering 24: 9.
• Smits, Hans. 2008. Strafrechthervormers en hemelbestormers: opkomst en teloorgang van de Coornhert-
Liga. Amsterdam: Aksant.
• Tames, Ismee. 2013. Doorn in het vlees: foute Nederlanders in de jaren vijftig en zestig. Erfenissen van
Collaboratie. Amsterdam: Balans.
• Withuis, Jolande. 2002. Erkenning: van oorlogstrauma naar klaagcultuur. Amsterdam: De Bezige Bij.
Milan M. van Lange, Ralf Futselaar
DEBATING EVIL: USING WORD EMBEDDINGS TO
ANALYSE PARLIAMENTARY DEBATES ON WAR
CRIMINALS IN THE NETHERLANDS
SUMMARY
This paper presents a case study to investigate the application of text mining tech-
niques in historical research. We demonstrate the usability, advantages, and limitations
of distributional semantics when investigating large diachronic historical datasets
with word embedding models (WEMs). WEMs are applied to a large digitised and
154 Prispevki za novejšo zgodovino LIX - 1/2019
machine-readable historical dataset, namely the verbatim proceedings of both houses
of Dutch parliament for the period 1945–1975.
WEMs are techniques to investigate relations between words in large corpora.
WEMs are based on the calculation of the average distance of unique words to all other
unique words in a corpus. The position of each unique word can then be described
as a list of numerical values, representing its distance to all other words. This list of
values is called the ‘vector’ of the word. These numerical vectors can be compared.
That is to say, the closeness of one vector to another can be calculated. High closeness
often reflects a close semantic relationship between words. Some words with similar
vectors are (near) synonyms or have very similar usages (tea and coffee, for example).
For historical research insight in these relations is very useful. It goes far beyond mere
closeness. With WEMs we are able to identify associations between words that are not
self-evident and would not have been found by traditional means.
The paper uses WEMs to investigate a case study on the vocabulary in parlia-
mentary discussions concerning the punishment, incarceration, and release of Nazi
collaborators and war criminals in the Netherlands. We identify changes related to
historical events and developments in the post-war dealing with war criminals. Recent
historiography on the topic has suggested a dramatic shift away from the crime com-
mitted by war criminals and towards the consequences of these deeds for victims
and their relatives. We focus on two questions directly related to the treatment of
these delinquents in the Dutch penal system. The first of these concerns the focus on
the identification of the wronged party: did politicians focus on crimes against the
Dutch nation as a whole, or against specific groups of individual victims? The second
concerns the appropriateness of harsh punishments, specifically whether or not life
imprisonment was considered a just alternative for the death penalty. These questions
both derive directly from historiography and serve to answer an overarching question:
can we assess the validity of traditional scholarship using text mining?
In the paper we show how victims became more prominent in discussions about
war criminals. This did, however, not diminish the importance of the deed they com-
mitted. In other words, the shift is there, but it appears to be far less radical then sug-
gested. We also demonstrate that actual historical developments regarding the type of
war criminals incarcerated in the Netherlands (from many local convicts in 1945, to a
handful of foreigners in the 1970s) were reflected by a discursive shift in the debates.
This paper also shows examples of pitfalls of an overly enthusiastic reliance on WEMs
as an analytical tool in historical research. Capital punishment was mentioned particu-
larly frequently in the debates of the 1970s, but not because MPs discussed the actual
possibility of executing the war criminals.
To conclude: distributional semantics are a powerful new tool for historians, but
they do not remove the need for hermeneutic awareness. In this paper, the method
is itself the main object of inquiry. We believe we have shown that it possible, feasi-
ble, and useful to develop and implement a coherent and widely applicable method
for investigating historical change using WEMs. We believe that the outcomes of this
155M. M. Lange, R. D. Futselaar: Debating Evil: Using Word Embeddings to Analyse …
investigation show that WEMs can be a useful and powerful tool in historical research,
provided they are used cautiously and with sufficient domain knowledge.
Milan M. van Lange, Ralf Futselaar
RAZPRAVE O ZLU: ANALIZIRANJE PARLAMENTARNIH
RAZPRAV O VOJNIH ZLOČINCIH NA NIZOZEMSKEM Z
VEKTORSKIMI VLOŽITVAMI BESED
POVZETEK
V prispevku je prikazana študija primera, pri kateri se proučuje uporaba metod za
rudarjenje besedil v zgodovinskih raziskavah. Predstavljamo uporabnost, prednosti
in omejitve distribucijske semantike pri proučevanju obsežnih diahronih zgodovin-
skih podatkovnih nizov z modeli vektorske vložitve besed (word embedding models
– modeli WEM). Modele WEM smo uporabili za analizo obsežnih digitaliziranih in
strojno berljivih zgodovinskih podatkovnih nizov, in sicer dobesednih zapisov postop-
kov v obeh domovih nizozemskega parlamenta v obdobju 1945–1975.
Modeli WEM so metode za proučevanje povezav med besedami v obsežnih koru-
pusih. Temeljijo na izračunu povprečne oddaljenosti edinstvenih besed od vseh drugih
edinstvenih besed v korpusu. Položaj vsake edinstvene besede se potem lahko opiše kot
seznam numeričnih vrednosti, ki predstavlja njeno oddaljenost od vseh drugih besed.
Seznam vrednosti se imenuje “vektor” besede. Te numerične vektorje je mogoče pri-
merjati. To pomeni, da je mogoče izračunati, kako blizu so si posamezni vektorji. Če
so si zelo blizu, to pogosto pomen, da so besede tesno semantično povezane. Nekatere
besede s podobnimi vektorji so (skoraj) sopomenke ali imajo zelo podobno rabo (na
primer čaj in kava). Vpogled v te povezave je zelo koristen za zgodovinske raziskave
in presega samo vprašanje bližine. Z modeli WEM lahko prepoznamo povezave med
besedami, ki niso očitne in jih ne bi bilo mogoče najti na tradicionalne načine.
V prispevku smo uporabili modele WEM za proučitev študije primera besedišča
iz parlamentarnih razprav o kaznovanju, zaporni kazni in izpustitvi nacističnih kola-
borantov in vojnih zločincev na Nizozemskem. Ugotavljali smo spremembe, pove-
zane z zgodovinskimi dogodki in dogajanjem v povojni obravnavi vojnih zločincev. V
novejšem zgodovinopisju, posvečenem tej tematiki, lahko opazimo precejšen premik
od zločinov, ki so jih zagrešili vojnih zločinci, k posledicam teh dejanj za žrtve in nji-
hove sorodnike. Osredotočili smo se na dve vprašanji, ki sta neposredno povezani z
obravnavo teh zločincev v nizozemskem sistemu kazenskega pregona. Prvo vpraša-
nje je povezano z osredotočanjem na opredelitev žrtev: ali so se politiki osredotočali
na zločine proti nizozemskemu narodu kot celoti ali proti posameznim skupinam
156 Prispevki za novejšo zgodovino LIX - 1/2019
individualnih žrtev? Drugo vprašanje zadeva ustreznost strogih kazni, zlasti ali je
dosmrt na zaporna kazen veljala za pravično alternativo smrtni kazni. Obe vprašanji
izhajata neposredno iz zgodovinopisja in omogočata odgovor na širše vprašanje: ali
lahko presojamo tehtnost tradicionalne znanosti z rudarjenjem besedil?
V prispevku smo pokazali, kako lahko žrtve dobijo pomembnejše mesto v razpra-
vah o vojnih zločincih. S tem pa se ni zmanjšal pomen dejanj, ki so jih zločinci zagrešili.
Povedano drugače, premik je mogoče opaziti, vendar se zdi, da je precej manjši od
pričakovanega. Pokazali smo tudi, da so se dejanski zgodovinski dogodki, povezani z
vojnimi zločinci, ki so bili na Nizozemskem kaznovani z zaporom (od številnih lokal-
nih obsojencev leta 1945 do nekaj tujcev v sedemdesetih letih 20. stoletja), izrazili v
diskurzivnem premiku v razpravah. V prispevku so prikazani tudi primeri različnih
pasti, ki jih prinese preveč navdušeno opiranje na modele WEM kot analitično orodje
v zgodovinskih raziskavah. Smrtna kazen se je pogosto omenjala predvsem v razpravah
v sedemdesetih letih 20. stoletja, vendar ne zato, ker bi poslanci razpravljali o dejanski
možnosti usmrtitve vojnih zločincev.
Zaključimo lahko, da je distribucijska semantika koristno novo orodje za zgodo-
vinarje, vendar to ne pomeni, da hermenevtična zavest ni več potrebna. V tem pri-
spevku je glavni predmet proučevanja sama metoda. Menimo, da smo dokazali, da je
mogoče, izvedljivo in koristno razviti in uporabljati usklajeno ter za široko rabo pri-
merno metodo za proučevanje zgodovinskih sprememb z modeli WEM. Verjamemo,
da rezultati te raziskave dokazujejo, da so modeli WEM lahko koristno in uporabno
orodje v zgodovinskih raziskavah, če jih uporabljamo previdno in z ustreznim znanjem.
157A. Pančur: Sustainability of Digital Editions: Static Websites of the History of Slovenia …
1.01 UDC: 004.774-026.11
Andrej Pančur*
Sustainability of Digital Editions:
Static Websites of the History of
Slovenia – SIstory Portal
IZVLEČEK
TRAJNOST DIGITALNH IZDAJ: STATIČNE SPLETNE STRANI
PORTALA ZGODOVINA SLOVENIJE – SISTORY
Prispevek izhaja iz stališča, da je pri digitalnih izdajah potrebno poskrbeti za čim bolj
celovito digitalno trajnost tako podatkov kot prezentacij, funkcionalnosti in programske
kode. To je velik izziv predvsem za manjše digitalno humanistične projekte z omejenim
financiranjem, ki ne omogoča dolgoročnega vzdrževanja tehnično zahtevnih digitalnih
izdaj. Kot alternativno rešitev so v prispevku predstavljene rešitve, ki jih v zadnjih letih
ponuja hiter razvoj statičnih spletnih strani. Digitalne izdaje, ki temeljijo na TEI, so s pomo-
čjo osnovnih XML (XSLT) in spletnih tehnologij (HTML, CSS, JavaScript) kot statične
spletne strani uspešno vključene v repozitorij portala SIstory. Vse statične spletne strani
imajo tudi možnost dinamičnega prikazovanja vsebine.
Ključne besede: digitalne izdaje, digitalno kuratorstvo, TEI, XSLT, statične spletne
strani
ABSTRACT
The contribution is based on the position that, with regard to digital editions, the hig-
hest possible degree of digital sustainability of data, presentations, functionalities, and pro-
gramme code should be ensured. This represents a significant challenge, especially in case
of smaller digital humanities projects with limited financing, which does not allow for the
long-term maintenance of technically-demanding digital editions. The alternative solutions
facilitated by the swift development of static websites in the recent years are presented in the
* Institute of Contemporary History, Kongresni trg 1, SI-1000 Ljubljana, andrej.pancur@inz.si
158 Prispevki za novejšo zgodovino LIX - 1/2019
contribution. Digital editions based on the TEI have been successfully included in the SIstory
portal repository as static websites, employing basic XML (XSLT) and web technologies
(HTML, CSS, JavaScript). All the static websites also have the possibility of displaying
dynamic content.
Keywords: digital editions, digital curation, TEI, XSLT, static website
Introduction
In digital humanities, the awareness of the importance of digital sustainability and
permanent preservation of digital sources has been present for a long time (Schaffner
and Erway 2014, 7). The research data of an individual project usually outlives the pro-
ject in the context of which it has been collected, organised, and published. Therefore
it is very important to ensure a high-quality and sustainable storage of digital data even
after the project itself has been concluded.
In the recent years, the technical aspects of research data management and long-
term archiving (metadata, archive formats, preservation media, and documentation)
have been the subject of intensive discussions. Only lately, however, have we begun
to realise that the preservation of data in accordance with the specific requirements
of various scientific disciplines is almost more important for the high-quality man-
agement and reuse of this data (Moeller et al. 2018). While in the natural and social
sciences the data from measurements and questionnaires is typically used, in the
humanities the use of cultural objects like manuscripts, texts, pictures, and recordings
is predominant. Moreover, researchers in humanities will usually additionally process,
visualise, tag, link, and interpret digital cultural objects (DHd-AG Datenzentren 2017,
7).
Such data processing is particularly important in case of digital editions, which are
a crucial part of digital humanities (Andorfer et al. 2016). Naturally, digital scholarly
editions mostly consist of the research in the context of which different transcriptions,
indications, analyses, explanations, etc., are produced. Such research data in particular
should therefore be available to the research community in the long term and under
open access conditions (Robinson 2016). In the case of digital editions, the encoded
text is the most crucial long-term result of the project. The display of information is
vital as well, as it represents the outlook of the project group on this information in the
context of a certain application. However, it is not that every such outlook is unique
in any way or even the only one possible. Instead, this information can be displayed
in a variety of ways (Turska et al. 2016). With each new interpretation, the number
of other potential user interfaces even increases. Each such presentation is thus a new
research result that deserves long-term storage as well.
Therefore, research results in humanities consist not only of research data, but also
of the presentation environment and the applications that enable data interpretation,
159A. Pančur: Sustainability of Digital Editions: Static Websites of the History of Slovenia …
searching, filtering, browsing, and linking (DHd-AG Datenzentren 2017, 7). If we
only stored research data, the initial presentation would be lost forever, even though
the presentation represents an integral part of any digital edition (Fechner 2018). At
the same time, we should not forget that the programming code used for the creation
of digital editions is an integral part of the scientific argumentation as well, just like
the digital editions (Andrews and Zundert 2016).
Sustainable storage of digital editions therefore represents a particularly significant
challenge. Moreover, digital editions can be very different from each other in terms of
their contents, appearance, and functionality. They mostly result from specific research
projects with relatively limited financial and human resources at their disposal. As the
project group members come from the field of humanities, they often lack the suit-
able technical expertise, which is why they mostly need to rely on external contractors
when it comes to technical development. Furthermore, digital editions depend on the
very swift development of online technologies and standards (Andorfer et al. 2016).
As the number of digital editions increases rapidly, the challenges involved in the
sustainable storage of digital editions will only become greater in the future (Fechner
2018). In case of smaller digital humanities projects with limited financing, which does
not allow for the long-term maintenance of technically-demanding digital editions,
this represents a significant challenge and will continue to do so. In the continuation,
I will present alternative solutions offered by the rapid development of static websites.
In the recent years, static websites have become one of the main online development
trends. It appears that this trend will also persist in the future (Williams 2019). In the
present contribution, I will present the experience gained by generating static websites
for the digital editions in the context of the activities of the Research Infrastructure
of Slovenian Historiography, which, among other tasks, also manages the History of
Slovenia – SIstory web portal.1 In this regard I will restrict my article solely to the static
websites generated from XML files, encoded in accordance with the Text Encoding
Initiative Guidelines (TEI) (TEI Consortium 2019). In digital humanities, the TEI
Guidelines are the de facto standard for text encoding, used by many different humani-
ties projects and studies (Romary et al. 2017, 5).
In the chapter Modern Static Websites, I will first present the main advantages and
disadvantages of this type of websites. In our case, we have decided to upgrade the
basic XSLT Stylesheets of the TEI Consortium. In the SIstory TEI Profile chapter, I
will present generic upgrade of the TEI Stylesheets. In the chapter Configuring and
Upgrading the SIstory TEI Profile I will outline the project-specific options for upgrad-
ing this profile. In both these chapters, I will also discuss the various options of adding
dynamic contents to static websites. In the chapter Publishing Digital Editions I will
outline how these static websites can be made available to the public, in particular by
their inclusion in the SIstory portal’s digital repository. In the Conclusion, I will also
mention a few more general findings.
1 “Research Infrastructure of Slovenian Historiography,” History of Slovenia – SIstory, accessed April 15, 2019, http://
www.sistory.si/publikacije/?menuBottom=2.
160 Prispevki za novejšo zgodovino LIX - 1/2019
Modern Static Websites
All websites used to be static at first, which is why all of the digital editions in the
field of digital humanities were initially created as static HTML websites. This was
also true in case of the Slovenian scholarly digital editions (Ogrin and Erjavec 2009),2
which have introduced the paradigm of digital editions in Slovenia (Ogrin 2005). The
creators of these digital editions soon encountered certain shortcomings of static web-
sites. In particular, they missed the option of carrying out structured text searches,
adaptable URL query string parameters, and dynamic web content association. In the
case of newer digital editions, they therefore opted for the Fedora Commons platform
(Erjavec et al. 2011).
By that point, the internet had been, for a long time already, dominated by dynamic
websites that had successfully replaced the outdated static websites, where the con-
tents could only be altered by the developers directly editing the HTML code. By
means of content management systems (e.g. the very popular WordPress, Drupal, and
Joomla), dynamic websites have finally made it possible for technically unskilled users
to start publishing on the internet.
The contents of dynamic websites are stored in databases. The server does not con-
struct the contents until the user demands that a website be displayed, adapted to the
demands of the user. A suitable programming language is used to communicate with
the server. The biggest problem of such dynamic websites is that its technical solutions
are often more complicated than the actual needs of their users.
Modern static websites, however, have been created as an answer to the problems
exhibited by dynamic websites. Unlike the latter, static websites do not employ data-
bases and server-side programming languages, but are simply a collection of HTML,
CSS, and JavaScript files. Static websites therefore enjoy numerous advantages in com-
parison with dynamic websites (Rinaldi 2015):
– efficiency: as static websites do not require any databases or server-side process-
ing, they are not in danger of becoming slow;
– hosting: because static websites do not rely on a server-side programming lan-
guage, their hosting is simple and cheap. There are even free options, for example
the GitHub Pages service;
– security: static websites do not require any databases or server-side programming
languages that hackers could breach. Therefore such sites are safe until the files
they consist of are stored securely;
– maintenance: as static websites do not rely on any databases, server-side program-
ming languages, or content management systems, their maintenance is extremely
simple;
– versioning: since static websites consist exclusively of text files, all of their versions
can be quite simply stored in version control systems like Git.
2 Scholarly Digital Editions of Slovenian Literature, eZISS, accessed April 15, 2019, http://nl.ijs.si/e-zrc/index-en.html.
161A. Pančur: Sustainability of Digital Editions: Static Websites of the History of Slovenia …
These reasons are particularly important to ensure the sustainability of digital edi-
tions. The use of standard formats like TIFF and JPEG for digital photographs, HTML
and XML for texts, and so on, ensures that the digital editions created will remain
readable and useful for a long time to come (Rosselli Del Turco 2016). Consequently,
this paradigm started to be emphasised in other similar projects in the field of digital
humanities as well (Viglianti 2017; Daengeli and Zumsteg 2017; Diaz 2018).
These reasons, however, are less convincing in case we expect digital editions to
contain user-generated contents as well. Therefore, static websites are not appropriate
for all digital editions in the field of digital humanities, as such solutions will often
fail to satisfy the needs of the creators and users. On the other hand, countless digital
projects do not call for very complex content and its display. In such cases the existing
solutions provided by static websites can be more than satisfactory, especially because
modern static websites do not completely lack the option of adding dynamic contents.
In reality, static websites have only experienced their renaissance with the appearance
of various services and programming solutions that allowed such websites to include
dynamic contents.
Modern static websites are no longer coded manually, but are instead generated
by employing static website generators. Nowadays, the selection of such generators is
extremely broad. One of the most popular is Jekyll,3 which is also used in the creation
of GitHub pages. Thus its use has also spread to humanities (Visconti 2016). Static
website generators assume that the users will write the contents using text formatting
syntax like Markdown markup language, which is very popular among developers.4
These formats can then be converted to HTML sites with a website generator and
then published online. However, the Markdown syntax is very deficient and only
allows for basic content publishing. As such, it is inappropriate for the tagging of com-
plex humanities texts. Consequently, humanities texts are most often encoded with
Extensible Markup Language (XML). Furthermore, XSLT (Extensible Stylesheet
Language for Transformation) is used as a tool for XML conversion. Together, these
are the key technologies employed by digital humanities (Flanders et al. 2016). As the
use of XSLT transformations is often very similar to static site generator conversions,
we can describe XSLT as a “modern, efficient static site generator” as well (Kraetke
and Imsieke 2016).
SIstory TEI Profile
For many years, the TEI Consortium has been regularly maintaining and updat-
ing the XSL Stylesheets, which can be used to generate, on the basis of TEI docu-
ments, not only (X)HTML websites, but also many other formats, including LaTeX,
XSL-FO, EPUB, DOCX, and ODT. These XSL stylesheets are freely available from
3 Jekyll • Simple, blog-aware, static sites, accessed April 15, 2019, https://jekyllrb.com/.
4 Daring Fireball: Markdown, accessed April 15, 2019, https://daringfireball.net/projects/markdown/.
162 Prispevki za novejšo zgodovino LIX - 1/2019
the GitHub repository and regularly updated in accordance with the new versions of
the TEI Guidelines.5 Not only is the relevant written documentation very good, but
the programming code comments are exemplary as well. XSLT stylesheets are also
used, among other things, to generate the static website for each version of the TEI
Guidelines.6
Most importantly, by means of custom profiles, the XSLT stylesheets of the TEI
Consortium allow for very flexible adaptations to different project requirements. In
fact, the XSL Stylesheets for TEI have been written with the intention of being as
adaptable as possible. Numerous parameters exist that can be configured according
to preferences. The stylesheets contains many variables and templates, which can be
adapted to specific requirements. The authors of the code even thought of empty
(hook) templates, to which custom contents and XSLT programming code may be
added. I have made use of all these options when writing the SIstory profile for the
XSLT stylesheets of the TEI Consortium. (Pančur 2019a)
Initially, I based the creation of these profiles on the needs of the Research
Infrastructure of Slovenian Historiography for flexible and prompt publication of our
technical documentation online. In the context of the Research Infrastructure, my
colleagues and I are managing the History of Slovenia – SIstory portal, which also
contains a repository and digital library. Therefore we have decided to include these
digital editions into the existing infrastructure as intensively as possible. Until 2016,
the static websites of these digital editions had been stored on an additional www2
server of the SIstory portal,7 while the digital library itself had only stored the metadata
about the digital editions and links to these static sites. After the upgrade of the SIstory
portal in 2016, we could start storing the HTML and all other files related to these
digital editions directly in the repository and the digital library.
Due to the desire to maximize the integration of digital editions into the SIstory
portal, I also tried to bring the external appearance of digital editions as close as pos-
sible to the user interface of the portal. As an example, Figure 1 shows a snapshot of
the home page of the portal between the years 2012 and 2016, and in Figure 2, the
user interface of the digital edition of 2014.
5 TEI XSL Stylesheets, accessed April 15, 2019, https://github.com/TEIC/Stylesheets.
6 “P5: Guidelines for Electronic Text Encoding and Interchange,” TEI: Text Encoding Initiative, accessed April 15,
2019, https://www.tei-c.org/release/doc/tei-p5-doc/en/html/index.html.
7 www2.SIstory.si, accessed April 15, 2019, http://www2.sistory.si/.
163A. Pančur: Sustainability of Digital Editions: Static Websites of the History of Slovenia …
Figure 1: Home page of the History of Slovenia – SIstory portal of 2016
Source: Spletni arhiv Narodne in univerzitetne knjižnice, accessed April 10, 2018, http://nukrobi2.nuk.
uni-lj.si:8080/wayback/20160225143401/http://www.sistory.si/.
Figure 2: The 2014 digital edition user interface
Source: (Gašparič 2014), accessed April 10, 2018, http://www2.sistory.si/publikacije/monografije/
Gasparic_Parlamentaria1/ch01.html.
164 Prispevki za novejšo zgodovino LIX - 1/2019
Even though the colour scheme is identical and the layout of the logo, the search
bar, main top navigation menu, and the contents are very closely modelled after the
SIstory portal, the user interfaces are nevertheless not the same. At the time, the user
interface of the portal was still based on the old HTML 4 technology, but I had already
started to use responsive website design and HTML 5 for the digital editions. In this
regard, I decided to use the responsive front-end framework ZURB Foundation.8
I keep my adaptations as well as CSS and JS additions in the GitHub repository.
(Pančur 2019b) As the use of this framework turned out to be extremely useful, we
also included it in the new SIstory portal in 2016. Subsequently I also adapted the
appearance of the digital editions to the new portal design (compare Figures 3 and 4).
Figure 3: Top navigation menu, search bar, and metadata page of the SIstory portal
Source: (Pančur 2016).
Apart from the originally envisioned technical documentation, we soon also
started to publish other sorts of publications – in particular monographs, collections
of scientific texts, and magazines – online in the HTML format. Therefore I reconfig-
ured the SIstory TEI profile with the aim of facilitating the publication of these sorts
of digital editions. The profile allows for the transformations of:
– individual TEI documents;
8 Foundation: The most advanced responsive front-end framework in the world, accessed April 15, 2019, https://founda-
tion.zurb.com/.
165A. Pančur: Sustainability of Digital Editions: Static Websites of the History of Slovenia …
– several TEI documents from a shared TEI corpus. In this case, each TEI document
needs to be converted separately. The TEI corpus itself and its need to
be converted separately, as in this manner a common cover, colophon, and tables
of contents are generated.
Figure 4: The 2016 digital edition user interface
Source: (Pančur 2016), accessed April 15, 2019, http://www.sistory.si/cdn/publikacije/36001-37000/36294/
ch10.html.
The digital edition’s main navigation menu is located at the very top of the web
page, as horizontal navigation with a drop-down menu. The structure of this naviga-
tion reflects the structure, sections, and divisions of the individual TEI documents. In
the continuation I will briefly outline the possible content sections of the navigation
as well as the TEI document. In practice, no TEI document contains every single one
of these sections. Instead, the authors of TEI documents can use and arrange them
completely in accordance with their needs.
The central part of the content is always contained within the element.
The main content must be contained within a single or several
elements with
the obligatory attribute @xml:id. Each
element represents its own division of
the content or chapter. Therefore the navigation bar’s single drop-down menu displays
all of the
divisions contained within the element. A variety of contents,
encoded in the relevant TEI document within the and elements, may
also be accessible before and after this part of the drop-down menu. Figure 5 thus
illustrates all of these main content sections.
166 Prispevki za novejšo zgodovino LIX - 1/2019
Figure 5: The main content sections of a TEI document
Only is obligatory, because it is converted to the default start page
(index.html) and, as such, accessible through the navigation bar – at the very top,
as the Title Page. The element may contain one or several
elements,
which represent the introductory chapters section in the navigation. The ele-
ment includes three possible content sections (bibliographies, annexes, summaries),
which is why they must always be assigned the appropriate @type attribute. Each of
these sections can consist of one or more chapters. In most cases, the conversion of
the content of these divisions is based on the standard XSLT stylesheets of the TEI
Consortium, which I have only partly adapted to the needs of our own digital editions.
I have written the transformations for the generated divisions from scratch.
167A. Pančur: Sustainability of Digital Editions: Static Websites of the History of Slovenia …
All of them have been included in the SIstory TEI profile. These generated divisions
can be included in the (Figure 6) or the element (Figure 7), and each
of the elements must include a with an arbitrary division title. These
titles are then included in the digital edition’s navigation.
Figure 6: The list of all possible generated divisions, contained in the
element
Unlike the aforementioned
elements, where the use of @xml:id identifiers
is merely recommended (the HTML files that contain these divisions are named after
these identifiers), in case of generated divisions they are obligatory and also have a
semantic meaning that is of key importance for their conversion. The @type attribute
defines the main category, which is particularly highlighted in the horizontal navi-
gation. The @xml:id attribute more precisely defines the subcategory, shown in the
navigation drop-down menu. The most extensive category is the Table of Contents
(TOC) group, which, apart from the various tables of the contents of chapters and
168 Prispevki za novejšo zgodovino LIX - 1/2019
subchapters, also contains a list of tables, figures, and charts. In reality, the list of charts
is merely a separate group of list of figures (
), which includes figures with the
@type attribute and chart value.
The element involves only a single category of generated divisions that
includes various lists of persons, places, and organisations. The generated divisions
include all of the persons mentioned in the TEI document, encoded with the element, all places encoded with , or all organisations encoded
with . All of the named entities, encoded in this manner, must also be
assigned the @ref attribute, in order to refer to the appropriate canonical element in
the list of entities ( for persons, for organisations, and
for places) in the TEI header (). The element’s @ref attrib-
ute may also contain a reference to the GeoNames9 or DBpedia10 URI, where the
SIstory profile can process the geographical coordinates and display them in the list
of places.
Figure 7: The list of all possible elements for automatically generated text division
, contained in the element
As it is also possible to use the SIstory profile to convert the TEI documents from
the TEI corpus, the elements from the various TEI documents cannot pos-
sess the same @xml:id identifiers. Therefore the subcategories of the generated divi-
sions are specified in such a manner that the subcategory’s identifier is stated after
the final hyphen of this identifier’s value (see Figures 6 and 7, where the id before the
hyphen in @xml:id attribute defines the arbitrary identifier, while the subcategory is
stated after the hyphen).
The SIstory profile also allows for the display of dynamic contents. The Tipue
Search engine is included as a basic functionality.11 It can be included with a generated
division () of the search type in the element. Tipue Search is an open
source jQuery plugin, which can be relatively easily integrated even in static sites. In
9 GeoNames, accessed April 15, 2019, http://www.geonames.org/.
10 DBpedia, accessed April 15, 2019, http://wiki.dbpedia.org/.
11 Tipue Search, accessed April 15, http://www.tipue.com/search/.
169A. Pančur: Sustainability of Digital Editions: Static Websites of the History of Slovenia …
the graphical user interface, the search bar is located immediately below the bottom
navigation, while the element generates a search.html web page that includes
a dynamic display of search results. The content of the TEI document is indexed, as a
JavaScript object ( JSON), in the file tipuesearch_content.js, which needs to be located
in the same folder as the search.html file. Content indexation takes place at the level
of paragraphs (
), lists (), tables (
), figures (), and all other
possible TEI elements, which are direct child elements of the text division
.
Therefore, all of these elements must include a @xml:id attribute for unique identifier.
Lists are the only exception: when they do not possess the @xml:id attribute, whereas
their child elements do, then the latter are indexed.
Configuring and Upgrading the SIstory TEI Profile
Much like the main XSL Stylesheets of the TEI Consortium, the SIstory profile
has been created to allow for its adaptation to the requirements of any individual pro-
ject. To this end, it includes a few original parameters of the TEI Consortium’s XSLT
stylesheets which affect the default stylesheet output, to which I have added a few new
SIstory parameters. All of these parameters can be set up anew for each conversion, but
it is more appropriate that new project profiles be created for each individual project.
The conversion usually proceeds in the following manner: the project’s custom profile
imports the SIstory profile, which, in turn, imports the TEI XSLT transformations,
and adds overrides (see Figure 8).
Figure 8: Chained XSLT conversions with additional profiles
170 Prispevki za novejšo zgodovino LIX - 1/2019
For example, during conversion, the default SIstory profile thus expects every
chapter or the first
text division to become a separate HTML web page. In
this case, navigation through forward and back buttons is added to the web pages
automatically. Unlike the original TEI transformations, this navigation also includes
the generated divisions. However, by changing the splitLevel parameter
(originally a parameter included in the TEI conversions), it is possible to specify that
subchapters also become separate HTML web pages. The forward/back and up/down
navigation between the web pages has now been appropriately adapted. The current
SIstory profile only supports a depth of three text divisions.
The documentationLanguage parameter may currently be used to specify the
Slovenian, English, or Serbian navigation (in the Latin or Cyrillic script). By adding
new translations to the myi18n.xml document, it is possible to further expand this
localisation. The localisation of the Tipue Search engine has been suitably taken care
of as well.
The SIstory profile also allows for the parallel display of the texts’ various language
versions. In this case, all of the main
text divisions and generated divi-
sions must contain @xml:lang attributes with the appropriate language code as well as
@corresp attributes pointing at all the other language versions of the text in question
(see Figure 9). Simultaneously, the languages-locale parameter must be set to the value
true, while the languages-locale-primary parameter must specify the language code of
the starting index.html file.
Figure 9: Localization and language setting in TEI document
The display of the TEI document’s metadata from the element is
similarly adaptable. This transformation can be initially specified by including the
generic division (), whose @type attribute value should be set to teiHeader
(see Figure 6). The entire content of the element is converted to
(definition list HTML element), where
(description term element) defines the
name of the TEI element as well as the names and attribute values (element [attribute
= value | attribute = value]), while
(definition description element) defines the
text contents of the TEI element. Of course, the definitions are appropriately nested.
With additional parameters, this transformation can be configured in such a way as to
171A. Pančur: Sustainability of Digital Editions: Static Websites of the History of Slovenia …
display the descriptive names of elements and attributes in the English or Slovenian
language instead of their names.
Apart from this simple SIstory profile configuration, any additional XSLT trans-
formation that can be completely adapted to the needs of an individual digital edition
can be included during the conversion of a project. Simultaneously, by using various
JavaScript libraries and plugins as well as web applications, it is also possible to enable
additional dynamic content display. For example, in the case of the SIstory portal’s
digital editions, I have successfully used DataTables12 to display large quantities of
tabled data, Highcharts13 for charts, Google Maps for maps, and ImageViewer14 and
Viewer.js for images.15 These are merely examples: there are alternatives, and every year
many new possibilities emerge.
Figure 10: The simultaneous display of facsimiles, diplomatic transcription, and critical
transcription of the Kapelski pasijon passion play.
Source: Kapelski pasijon, GitHub pages, https://dariah-si.github.io/Kapelski-pub/
12 DataTables: Table plug-in for jQuery, accessed April 15, 2019, https://datatables.net/.
13 Highcharts, accessed April 15, 2019, https://www.highcharts.com/products/highcharts/.
14 ImageViewer, accessed April 15, 2019, http://ignitersworld.com/lab/imageViewer.html.
15 Viewer.js, JavaScript image viewer, accessed April 15, 2019, https://fengyuanchen.github.io/viewerjs/.
172 Prispevki za novejšo zgodovino LIX - 1/2019
Simultaneously, in 2017, with the publication of Saxon-JS16, the possibilities of
dynamically displaying the contents of XML documents in static web pages have
even improved. Saxon-JS is an XSLT 3.0 run-time written in pure JavaScript. It could
contribute to XSLT once again becoming a client-side technology that works in a
browser (Lumley et al. 2017). For digital editions, I have thus started to successfully
use an Saxon extension function ixsl:query-params, which parses the query param-
eters of the HTML page URI. In the case of the Kapelski pasijon (The Železna Kapla
Passion Play) digital edition, I have thus created and used the following parameters to
generate a dynamic parallel display of facsimiles as well as the diplomatic and critical
transcription: type, mode, page, and lb (line break). These parameters have allowed me
to construct a dynamic display of extremely complex contents (Figure 10), which can
still be optionally upgraded in the future digital editions.
Publishing Digital Editions
The default SIstory profile transformation generates all the HTML, JS, and any
other files in a single folder. As the digital editions generated in this manner consist
solely of static web pages, they can also be used on personal computers. In this manner
it is possible to effectively test the digital editions even before publishing them online,
where we can swiftly and simply publish them on any accessible servers. Additionally,
the GitHub repository web pages are a free option that can also ensure an efficient
version control.
However, the main purpose of SIstory profiles is to include digital editions directly
into the SIstory portal’s repository and its digital library. Thus we can efficiently store
all of the digital editions’ files by adding persistent Handle System identifiers and
checksums for all the relevant files, as well as flexibly organise digital editions as one
or several digital objects with one or several intellectual entities. Each intellectual
entity has its own Handle identifier and metadata. It can include several files or none
at all. The files belonging to an individual intellectual entity are located in the same
folder. The path to this folder also includes the suffix of the Handle persistent identi-
fier, which is, in the case of the SIstory portal, always a numerical value (e.g., for the
suffix 555, the relative path would be /cdn/publikacije/1-1000/555/file). Therefore,
the SIstory XSLT profile must know the values of these identifiers in advance. Thus we
can precisely determine, even in advance, whether the entire contents of a digital edi-
tion should be contained in a single intellectual entity of the SIstory portal, or whether
various digital edition files should be included in various intellectual entities. These
identifiers can be recorded among the rest of the metadata in , within the
element, as a value of one or more elements. This element
16 “Saxon-JS,” Saxonica, accessed April 15, 2019, http://www.saxonica.com/saxon-js/index.xml.
173A. Pančur: Sustainability of Digital Editions: Static Websites of the History of Slovenia …
requires that the value of the @type attribute be specified as sistory or si4, while the
@corresp attribute should point at all the appropriate
and divisions
whose content will be included in the intellectual entity with this identifier.
The SIstory XSLT profile is open source and available in the GitHub repository.
(Pančur 2019a) Another GitHub repository also contains all of the digital editions
currently kept on the SIstory portal. The project upgrades of the SIstory XSLT pro-
file for each of these editions are available as well. (Pančur 2018) I regularly expand
and maintain the SIstory profile in accordance with the changes of the TEI XSLT
stylesheets.
Conclusion
There are several advantages to a digital editions infrastructure organised in this
manner:
– using format that is most common in digital humanities: TEI XML (Neuefeind
2019, 221);
– using a single XML technology (XSLT) for various sorts of digital editions, which
enjoys a wide support in the TEI community;
– the possibility of simply including JavaScript libraries and plugins;
– flexibly adding dynamic contents with Saxon-JS;
– in comparison with other technologies (dynamic sites), static sites ensure a rela-
tive sustainability and simple maintenance of digital editions;
– using Git version control to store the various versions of digital editions together
with the software used to generate static websites;
– open access to the complete digital editions code in the GitHub and GitLab soft-
ware development platforms;
– the possibility of sharing digital editions on the GitHub Pages and GitLab Pages,
and, last but not least, the possibility of including them in the History of Slovenia –
SIstory portal.
Acknowledgements
The work presented in this paper was supported by the Slovenian historiography
research infrastructure (I0- 0013), and the Slovenian ESFRI infrastructures DARIAH-
SI which are financially supported by the Slovenian Research Agency.
174 Prispevki za novejšo zgodovino LIX - 1/2019
Sources and Literature
Datasets and Academic Software:
• Pančur, Andrej. 2018. Electronic publishing on SIstory. Distributed by GitHub. https://github.com/
SIstory/publications.
• Pančur, Andrej. 2019a. SIstory TEI Stylesheets. Distributed by GitHub. https://github.com/
SIstory/Stylesheets.
• Pančur, Andrej. 2019b. SIstory: additional CSS and JS. Distributed by GitHub. https://github.
com/SIstory/themes.
Literature:
• Andorfer, Peter, Matej Ďurčo, Thomas Stäcker, Christian Thomas, Vera Hildenbrandt, Hubert
Stigler, Sibylle Söring, and Lukas Rosenthaler. 2016. “Nachhaltigkeit technischer Lösungen für
digitale Editionen: Eine kritische Evaluation bestehender Frameworks und Workflows von und für
Praktiker_innen.” In DHd 2016: Modellierung – Vernetzung – Visualisierun: Die Digital Humanities
als fächerübergreifendes Forschungsparadigma: Konferenzabstracts, 36–39. Universität Leipzig.
http://www.dhd2016.de/.
• Andrews, Tara, and Joris van Zundert. 2016. “What Are You Trying to Say? The Interface
as an Integral Element of Argument.” In Digital Scholarly Editions as Interfaces, International
Symposium at the University of Graz, Austria, 31–32. Graz: Centre for Information Modelling –
Austrian Centre for Digital Humanities. https://static.uni-graz.at/fileadmin/gewi-zentren/
Informationsmodellierung/PDF/dse-interfaces_BoA21092016.pdf.
• Daengeli, Peter, and Simon Zumsteg. 2017. “Hermann Burgers Lokalbericht: Hybrid-Edition
mit digitalem Schwerpunkt.” In DHd 2017: Digitale Nachhaltigkeit: Konferenzabstracts, 151–55.
Universität Bern. http://www.dhd2017.ch/.
• DHd-AG Datenzentren. 2017. Geisteswissenschaftliche Datenzentren im deutschsprachigen Raum:
Grundsatzpapier zur Sicherung der langfristigen Verfügbarkeit von Forschungsdaten. Hamburg. DOI:
10.5281/zenodo.1134760.
• Erjavec, Tomaž, Jan Jona Javoršek, Matija Ogrin, and Petra Vide Ogrin. 2011. “Od biografskega
leksikona do znanstvenokritične izdaje: vprašanje trajnosti elektronskih besedil.” Knjižnica 55, No.
1: 103–14. https://knjiznica.zbds-zveza.si/knjiznica/article/view/6004.
• Diaz, Chris. 2018. “Using Static Site Generators for Scholarly Publications and Open Educational
Resources.” Code4Lib Journal, No. 44. https://journal.code4lib.org/articles/1386.
• Fechner, Martin. 2018. “Eine nachhaltige Präsentationsschicht für digitale Editionen.” In DHd
2018: Kritik der digitalen Vernunft: Konferenzabstracts, edited by Georg Vogeler, 203–7. Universität
zu Köln. http://dhd2018.uni-koeln.de/.
• Flanders, Julia, Syd Bauman, and Sarah Connell. 2016. “XSLT: Transforming our XML data.” In
Doing Digital Humanities: Practice, Training, Research, edited by C. Crompton, R. J. Lane and R.
Siemens, 255–72. Oxon and New York: Routledge.
• Gašparič, Jure. 2014. Slovenski parlament: Politično zgodovinski pregled od začetka prvega do konca
šestega mandata (1992–2014). Ljubljana: Inštitut za novejšo zgodovino. http://hdl.handle.
net/11686/26950.
• Kraetke, Martin, and Gerrit Imsieke. 2016. “XSLT as a Modern, Powerful Static Website
Generator: Publishing Hogrefe’s Clinical Handbook of Psychotropic Drugs as a Web App.” In
Proceedings of XML in, Web Out: International Symposium on sub rosa XML, Balisage Series on
Markup Technologies, vol. 18. https://doi.org/10.4242/BalisageVol18.Kraetke02.
175A. Pančur: Sustainability of Digital Editions: Static Websites of the History of Slovenia …
• Lumley, John, Debbie Lockett, and Michael Kay. 2017. “Compiling XSLT3, in the Browser,
in Itself.” In Proceedings of Balisage: The Markup Conference 2017, Balisage Series on Markup
Technologies, vol. 19. https://doi.org/10.4242/BalisageVol19.Lumley01.
• Moeller, Katrin, Matej Ďurčo, Barbara Ebert, Marina Lemaire, Lukas Rosenthaler, Patrick Sahle,
Urlike Wuttke, and Jörg Wettlaufer. 2018. Die “Summe geisteswissenschaftlicher Methoden?
Fachspezifisches Datenmanagement als Voraussetzung zukunftsorientierten Forschens.” In DHd
2018: Kritik der digitalen Vernunft: Konferenzabstracts, edited by Georg Vogeler, 89–93. Universität
zu Köln. http://dhd2018.uni-koeln.de/.
• Neuefeind, Claes, Philip Schildkamp, and Brigitte Mathiak. 2019. “Technologienutzung im
Kontext Digitaler Edition – eine Landschaftsvermessung.” In DHd 2019: Digital Humanities:
multimedial & multimodal. Konferenzabstracts, 219–22. Universität Mainz, Universität Frankfurt.
https://dhd2019.org/.
• Ogrin, Matija, and Tomaž Erjavec. 2009. “Ekdotika in tehnologija: Elektronske znanstvenokritične
izdaje slovenskega slovstva.” Jezik in slovstvo 54, No. 6 (2009): 57–72. http://www.dlib.
si/?URN=URN:NBN:SI:doc-BOC8BANS.
• Ogrin, Matija, ed. 2005. Znanstvene razprave in elektronski mediji: razprave. Ljubljana: Založba
ZRC, ZRC SAZU. http://nl.ijs.si/e-zrc/bib/eziss-knjiga.pdf.
• Pančur, Andrej. 2016. “History of the Holocaust in Slovenia.” In Between the House of Habsburg and
Tito: A Look at the Slovenian Past, edited by Jurij Perovšek and Bojan Godeša. Ljubljana: Inštitut za
novejšo zgodovino. http://hdl.handle.net/11686/36294.
• Rinaldi, Brian. 2015. Static Site Generators: Modern Tools for Static Website Development. Sebastopol,
CA: O’Reilly Media.
• Robinson, Peter. 2016. “Why Interfaces Do Not and Should Not Matter for Scholarly Digital
Editions.” In Digital Scholarly Editions as Interfaces, International Symposium at the University of
Graz, Austria, 29–30. Centre for Information Modelling – Austrian Centre for Digital Humanities.
• https://static.uni-graz.at/fileadmin/gewizentren/Informationsmodellierung/PDF/dse-
interfaces_BoA21092016.pdf.
• Rosselli Del Turco, Roberto. 2016. “The Battle We Forgot to Fight: Should We Make a Case
for Digital Editions?” In Digital Scholarly Editing: Theories and Practices, edited by Matthew
James Driscoll and Elena Pierazzo, 19–238. Cambridge: Open Book Publishers. http://dx.doi.
org/10.11647/OBP.0095.
• Romary, Laurent, Piotr Banski, Jack Bowers, Emiliano Degl’innocenti, Matej Ďurčo, Roberta
Giacomi, Klaus Illmayer, Adeline Joffres, Fahad Khan, Mohamed Khemakhem, et al. 2017. Report
on Standardization (draft). [Technical report] 4.2 Inria. https://hal.inria.fr/hal-01560563.
• Schaffner, Jennifer and Ricky Erway. 2014. Does Every Research Library Need a Digital Humanities
Center? Dublin, Ohio: OCLC Research. https://www.oclc.org/content/dam/research/
publications/library/2014/oclcresearch-digital-humanities-center-2014.pdf.
• TEI Consortium, ed. 2019. TEI P5: Guidelines for Electronic Text Encoding and Interchange 3.5.0.
TEI Consortium. http://www.tei-c.org/Guidelines/P5/.
• Turska, Magdalena, James Cummings, and Sebastian Rahtz. 2016. “Challenging the Myth of
Presentation in Digital Editions.” Journal of the Text Encoding Initiative, No. 9. DOI: 10.4000/
jtei.1453.
• Viglianti, Raffaele. 2017. “Your Own Shelley-Godwin Archive: An off-line strategy for an on-line
publication (poster).” In TEI 2017 Victoria. https://hcmc.uvic.ca/tei2017/abstracts/t_126_
viglianti_shelleygodwin.html.
• Visconti, Amanda. 2016. “Building a Static Website with Jekyll and GitHub Pages.” The
Programming Historian, 5. https://programminghistorian.org/lessons/building-static-sites-with-
jekyll-github-pages.
• Williams, Martin. 2019. Web Development Trends 2019 (blog). March 14, 2019. Accessed April 12,
2019. https://www.keycdn.com/blog/web-development-trends-2019.
176 Prispevki za novejšo zgodovino LIX - 1/2019
Andrej Pančur
SUSTAINABILITY OF DIGITAL EDITIONS: STATIC WEBSITES
OF THE HISTORY OF SLOVENIA – SISTORY PORTAL
SUMMARY
The contribution is based on the position that, with regard to digital editions, the
highest possible degree of digital sustainability of data, presentations, functionalities,
and programme code should be ensured. This represents a significant challenge, espe-
cially in case of smaller digital humanities projects with limited financing, which does
not allow for the long-term maintenance of technically-demanding digital editions.
The alternative solutions facilitated by the swift development of static web pages in
the recent years are presented in the contribution.
Static websites enjoy numerous advantages in comparison with dynamic websites:
efficiency, hosting, security, maintenance, and versioning. These reasons are particu-
larly important to ensure the sustainability of digital editions. These reasons, however,
are less convincing in case we expect digital editions to contain user-generated con-
tents as well. Therefore, static websites are not appropriate for all digital editions in the
field of digital humanities. On the other hand, countless digital projects do not call for
very complex content and its display. In such cases the existing solutions provided by
static websites can be more than satisfactory, especially because modern static web-
sites do not completely lack the option of adding dynamic contents. Modern static
websites are generated by employing static website generators. Humanities texts are
most often encoded with Extensible Markup Language (XML). Extensible Stylesheet
Language for Transformation (XSLT) is used as a tool for XML conversion: also in
static websites. Digital editions based on the TEI have been successfully included in
the SIstory portal repository as static web pages, employing basic XML (XSLT) and
web technologies (HTML, CSS, JavaScript). All the static web pages also have the
possibility of displaying dynamic content.
In the case of SIstory portal, we have decided to upgrade the basic XSLT
Stylesheets of the TEI Consortium. In the SIstory TEI Profile chapter, I will present
generic upgrade of the TEI Stylesheets. In the chapter Configuring and Upgrading the
SIstory TEI Profile I will outline the project-specific options for upgrading this pro-
file. In both these chapters, I will also discuss the various options of adding dynamic
contents to static websites. In the chapter Publishing Digital Editions I will outline how
these static websites can be made available to the public, in particular by their inclu-
sion in the SIstory portal’s digital repository. In the Conclusion, I will also mention a
few more general findings.
There are several advantages to a digital editions infrastructure organised in this
manner: using format that is most common in digital humanities (TEI XML); using
177A. Pančur: Sustainability of Digital Editions: Static Websites of the History of Slovenia …
a single XML technology (XSLT) for various sorts of digital editions, which enjoys
a wide support in the TEI community; the possibility of simply including JavaScript
libraries and plugins; flexibly adding dynamic contents with Saxon-JS; in comparison
with other technologies (dynamic sites), static sites ensure a relative sustainability and
simple maintenance of digital editions; using Git version control to store the various
versions of digital editions together with the software used to generate static websites;
open access to the complete digital editions code in the GitHub and GitLab soft-
ware development platforms; the possibility of sharing digital editions on the GitHub
Pages and GitLab Pages, and, last but not least, the possibility of including them in the
History of Slovenia – SIstory portal.
Andrej Pančur
TRAJNOST DIGITALNH IZDAJ: STATIČNE SPLETNE STRANI
PORTALA ZGODOVINA SLOVENIJE – SISTORY
POVZETEK
Prispevek izhaja iz stališča, da je pri digitalnih izdajah potrebno poskrbeti za čim
bolj celovito digitalno trajnost tako podatkov kot prezentacij, funkcionalnosti in pro-
gramske kode. To je velik izziv predvsem za manjše digitalno humanistične projekte z
omejenim financiranjem, ki ne omogoča dolgoročnega vzdrževanja tehnično zahtev-
nih digitalnih izdaj. Kot alternativno rešitev so v prispevku predstavljene rešitve, ki jih
v zadnjih letih ponuja hiter razvoj statičnih spletnih strani.
Statične spletne strani imajo v primerjavi s dinamičnimi številne prednosti:
zmogljivost, gostovanje, varnost, vzdrževanje in kontrola verzij. Ti razlogi so zlasti
pomembni zaradi trajnosti digitalnih izdaj. Vendar so ti razlogi manj prepričljivi, če
glede digitalnih izdaj pričakujemo, da bodo vsebovale tudi uporabniško generirano
vsebino. Zato statične spletne strani niso primerne za vse digitalne izdaje s področja
digitalne humanistike. Po drugi strani pa je zelo veliko digitalnih projektov, kjer vse-
bina in njen prikaz nista tako zelo zahtevni. V teh primerih bi bile obstoječe rešitve,
ki jih prinašajo statične spletne strani, več kot zadovoljive, predvsem zaradi tega, ker
moderne statične strani niso povsem brez možnosti dodajanja dinamičnih vsebin.
Moderne statične spletne strani generiramo s pomočjo generatorjev statičnih splet-
nih strani. Besedila v humanistiki večinoma kodiramo z XML označevalnim jezikom.
XSLT pa uporabljamo kot orodje za pretvorbo XML: tudi v statične spletne strain.
Digitalne izdaje, ki temeljijo na TEI, so s pomočjo osnovnih XML (XSLT) in spletnih
tehnologij (HTML, CSS, JavaScript) kot statične spletne strani uspešno vključene v
repozitorij portala SIstory. Vse statične spletne strani imajo tudi možnost dinamičnega
prikazovanja vsebine.
178 Prispevki za novejšo zgodovino LIX - 1/2019
V primeru portala SIstory smo se odločili za nadgradnjo osnovnih pretvorb XSLT
konzorcija TEI. V poglavju SIstory TEI profil bom predstavil svojo generično nadgra-
dnjo pretvorb XSLT konzorcija TEI. V poglavju Konfiguracija in nadgradnja SIstory
profila bom nato predstavil projektno specifične možnosti nadgradnje tega profila. V
obeh teh poglavjih bom predstavil še različne možnosti dodajanja dinamične vsebine
statičnim spletnim stranem. V poglavju Publiciranje digitalnih izdaj bom omenil, kako
te statične spletne strani damo na razpolago javnosti, predvsem z vključitvijo v digi-
talni repozitorij portala SIstory. V Sklepu naposled dodam še nekaj pomembnejših
splošnih ugotovitev.
Tako vzpostavljena infrastruktura za digitalne izdaje ima več prednosti: uporaba
podatkov, ki so v digitalni humanistiki najbolj razširjeni (TEI-XML); uporaba enotne
XML tehnologije (XSLT) za različne vrste digitalnih izdaj, ki ima široko podoro v
TEI skupnosti; možnost enostavnega vključevanja JavaScript knjižnic in vtičnikov;
fleksibilno dodajanje dinamične vsebine s Saxon-JS; statične spletne strani zagota-
vljajo v primerjavi z ostalimi tehnologijami (dinamične spletne strani) relativno traj-
nost digitalnih izdaj ter relativno enostavno vzdrževanje; uporaba Git kontrole verzij
za shranjevanje različnih izdaj digitalnih izdaj, skupaj s programsko opremo, ki smo jo
uporabili pri generiranju statičnih spletnih strani; odprti dostop do celotne kode digi-
talnih izdaj v platformah za razvoj programske opreme GitHub in GitLab; možnost
gostovanja digitalnih izdaj v GitHub Pages in GitLab Pages in nenazadnje možnost
vključitve v portal Zgodovina Slovenije – SIstory.
179A. Pretnar, D. Podjed: Data Mining Workspace Sensors: A New Approach to Anthropology
1.01 UDC: 003.295:572+316.7
Ajda Pretnar*, Dan Podjed**
Data Mining Workspace Sensors:
A New Approach to Anthropology
IZVLEČEK
PODATKOVNO RUDARJENJE SENZORJEV V DELOVNEM OKOLJU: NOV
PRISTOP K ANTROPOLOGIJI
Antropologija po nepotrebnem zaostaja pri vključevanju računskih metod v raziskave,
čeprav te postajajo vse bolj priljubljene v družboslovju in humanistiki. Tudi v antropologiji
namreč uspešno združujemo kvantitativne in kvalitativne metode, še posebej kadar preha-
jamo med njimi. V prispevku predlagamo nov metodološki pristop in opišemo, kako smo
uporabili kvantitativne metode in podatkovno analitiko v etnografskem raziskovalnem
delu. Metodologijo prikažemo na primeru analize senzorskih podatkov ene od fakultetnih
stavb Univerze v Ljubljani, kjer smo opazovali prakse in vedenje zaposlenih med delovnim
časom in ugotavljali, kako upravljajo s stavbo in bivalnim okoljem. Za raziskovanje smo na
primeru t.i. »pametne stavbe« uporabili krožne mešane metode, ki prepletajo podatkovno
analitiko (kvantitativni pristop) z etnografijo (kvalitativni pristop), ter sočasno empirično
identificirali glavne prednosti nove antropološke metodologije.
Ključne besede: računska antropologija, senzorski podatki, podatkovna etnografija,
krožne mešane metode
ABSTRACT
While social sciences and humanities are increasingly including computational methods
in their research, anthropology seems to be lagging behind. But it does not have to be so.
Anthropology is able to merge quantitative and qualitative methods successfully, espe-
cially when traversing between the two. In the following contribution, we propose a new
* Laboratory of Bioinformatics, Faculty of Computer and Information Science, University of Ljubljana, Večna pot
113, SI-1000 Ljubljana, ajda.pretnar@fri.uni-lj.si
** Institute of Slovenian EthnologyResearch Centre of the Slovenian Academy of Sciences and Arts, Novi trg 2, SI-
1000 Ljubljanadan.podjed@zrc-sazu.si
180 Prispevki za novejšo zgodovino LIX - 1/2019
methodological approach and describe how to engage quantitative methods and data
analysis to support ethnographic research. We showcase this methodology with the analysis
of sensor data from a University of Ljubljana’s faculty building, where we observed human
practices and behaviours of employees during working hours and analysed how they interact
with the building and their environment. We applied the proposed circular mixed methods
approach that combines data analysis (quantitative approach) with ethnography (quali-
tative approach) on an example of a “smart building” and empirically identified the main
benefits of the new anthropological methodology.
Keywords: computational anthropology, sensor data, data ethnography, circular mixed
methods
Introduction
Social sciences and humanities are rapidly adopting computational approaches
and software tools, resulting in an emerging field of digital humanities (Klein and
Gold 2016) and computational social sciences (Conte et al. 2012). Among these is
anthropology, which is particularly suitable for traversing between quantitative and
qualitative methods. Anthropologists study and analyse human habits, practices,
behaviours and cultures, with a particular focus on participant observation and long-
term fieldwork as a methodological cornerstone of the discipline. With an increasing
availability of data coming from social networks and wearable devices among other
sources (Miller et al. 2016; Gershenfeld and Vasseur 2014), anthropologists can easier
than ever dive into data analysis and study humans and their societies, subcultures and
cultures quantitatively as well as qualitatively.
With this contribution, we tentatively place anthropology in the field of digital
humanities,1 mostly because the suggested approach is multidisciplinary and by anal-
ogy similar to the shifts between distant and close reading ( Jänicke et al. 2015) in
literary studies. Just like distant reading can offer an abstract (over)view of the corpus,
quantitative analyses can give a researcher a broad understanding of the population
she is investigating. And just like distant reading needs close reading to understand the
style, themes, and subtle meanings of a literary work, so does data analysis need an
ethnographic approach to contextualize the information and extract subtle meanings
of individual human experience.
As Pink et al. (2017) suggest, there is value in investigating everyday data that
reveal what is ordinary, what extraordinary and how to contextualize the two. In this
contribution we expand the idea by employing the circular mixed methods approach
that combines qualitative research from anthropology and quantitative analysis from
data mining. We consider the mixed methods (Creswell and Clark 2007; Teddlie and
1 Anthropology is considered a part of the humanities in the Slovenian academic tradition, while elsewhere it is
placed under the umbrella of social sciences. In reality, it probably lies at the intersection of both.
181A. Pretnar, D. Podjed: Data Mining Workspace Sensors: A New Approach to Anthropology
Tashakkori 2009) as an integrative research that merges data collection, methods of
research and philosophical issues from both quantitative and qualitative research para-
digms into a singular framework ( Johnson et al. 2007). We also stress the need for a
circular research design, where we traverse between methods to continually verify and
enhance knowledge. Circularity gives research flexibility and enables shifting perspec-
tives in response to new information.
Our study, which provides the basis for this article, began in October 2017 and
includes 14 offices at one of the University of Ljubljana’s faculty buildings. The build-
ing is equipped with automation systems and sensors measuring large amounts of
data related mostly to the building’s energy performance and thermal comfort. We
retrieved measurements from approximately 20 sensors from the SCADA monitoring
system for the year 2016 and extrapolated behavioural patterns for different rooms
and, more generally, room types through data visualization and exploratory analysis.
The analysis showed specific patterns emerging in several rooms; we noticed there
were some definite outliers in terms of working hours and room interaction.
We used computational methods to gauge new perspectives on human behaviour
and invoke potentially relevant hypotheses. Data analysis provided several distinct pat-
terns of behaviour and defined the baseline for workspace use. However, this approach
was unable to provide us with a context for the data. Quantitative methods can easily
answer the ‘what’, ‘where’ and ‘when’ type of questions, but struggle with the ‘why’.
At that stage, we employed anthropological fieldwork and ethnography as the main
methods of anthropology. We conducted interviews with room occupants to explain
what the uncovered patterns mean and why people behave the way they do.
The main purpose of our study was to demonstrate how anthropologists can use
statistics and data visualisation to establish the essential facts of the observed phenom-
ena and how the traditional anthropological methods, which have not significantly
changed since the early 20th century, when Malinowski (2002 [1922]) carried out
his ground-breaking ethnographic research at the Trobriand Islands, can be comple-
mented and upgraded by data analysis. We call this a circular mixed methods approach,
where circular implies continual traversing between qualitative and quantitative meth-
ods, between fieldwork and data analysis. Our research applies the proposed method-
ology to sensor data obtained from a smart building and with a combination of data
mining and ethnographic field establishes both a wide and deep understanding of
human behaviour in a workplace setting.
While inclusion of domain experts is already a postulate in machine learning
and data mining, the opposite, the inclusion of machine learning and data mining in
anthropology, is fairly new and lacks sufficient practical application. Concurrently,
few anthropologists and even social scientists and humanists in general are included
in the development of AI solutions and data analysis, even when the data is strictly
coming from a social domain (Skeem and Lowenkamp 2016; Lum and Isaac 2016;
see also Pretnar and Robnik-Šikonja 2019). Moreover, the plot twist in anthropology
comes from the fact that anthropologists do not act as domain experts explaining the
182 Prispevki za novejšo zgodovino LIX - 1/2019
data, but as channels and interpreters for the people to explain the data they produced
themselves. In anthropology, authority does not come from the researcher, but from
the researched – the group of people that are the source of data and information.
Development of Computational Anthropology
While digital humanities became a full-fledged field in the last couple of decades
(Hockey 2004), anthropology seems to be lagging behind. Some authors suggest
anthropology should be more concerned with digital as an object of analysis rather
than as a tool (Svensson 2010). However, there have been several attempts to include
computational methods and quantitative analyses into anthropological research.
Already in the 1960s, anthropologists looked at using computers for the organisation
of anthropological data and field notes (Kuzara et al. 1966; Podolefsky and McCarty
1983) and started using computers for social network analysis (Mitchell 1974).
Progress in text analysis, coding facts, and comparative studies in linguistics (Dobbert
et al. 1984; White and Truex 1988) followed suit.
However, only lately has there been a significant computational breakthrough in
the discipline. Digital anthropology turned disciplinary attention to the analysis of
online worlds, virtual identities, and human relationships with technology. For exam-
ple, Bell (2006) gave a cultural interpretation of the use of ICTs in South and Southeast
Asia, Boellstorff (2015) investigated online worlds in the Second Life, Nardi (2010)
explored gaming behaviour of the World of Warcraft, and Bonilla and Rosa (2015)
described how to use hashtags for ethnographic research. Moreover, a discussion has
been opened on what does ‘big data’ mean for social sciences and how to ethically
address its retrieval and analysis (boyd and Crawford 2012; Mittelstadt et al. 2016).
There was a discussion on the methodological front as well. Anderson et al.
(2009) argue for a method that combines the ethos of ethnography with database
mining techniques, something the authors call ‘ethno-mining’. Similarly, Blok and
Pedersen (2014) look at the intersection of ‘big’ and ‘small’ data to produce ‘thick’
data and include research subjects as co-producers of knowledge about themselves
(see also Hsu 2014). Finally, Krieg et al. (2017) not only elaborate on the usefulness
of algorithms for ethnographic fieldwork, but also show in detail how to conduct such
research in an example of online reports of drug experiences.
Anthropology vs. Data Analysis
For an anthropologist, statistical and computational analysis is not the first
thing that comes to mind when developing research design and methodology.
Anthropologists are trained to observe phenomena in the field, talk to people, spend
time with them, participate in daily activities, and immerse themselves in research
183A. Pretnar, D. Podjed: Data Mining Workspace Sensors: A New Approach to Anthropology
topics (Kawulich 2005; Marcus 2007). This type of information gives us detailed
stories of human lives, uncovers meanings behind rituals, habits, languages, and rela-
tionships, and provides a coherent explanation of the researched phenomena. So why
would anthropologists even have to include data analysis in their studies? Why and
when is such an approach relevant?
Sometimes, the phenomena that anthropologists are trying to explain occur in dif-
ferent places at the same time and are impossible to observe simultaneously. It could
be that anthropologists know little of the topic they are exploring and have yet to
generate their research questions. Alternatively, the nature of the phenomenon lends
itself nicely to computational analysis. For example, behaviour of many individuals is
difficult to observe in real time, especially if we want to observe them at once in dif-
ferent locations. Sensors, on the other hand, can track behaviours of these individuals
independently (Patel et al. 2012) and therefore enable a detailed comparative analysis.
With a large number of measurements, researchers can also observe seasonal varia-
tions, similarity of users, and changes through time.
Data analysis also helps us define the parameters of a research field and establish
what is ordinary and what extraordinary behaviour. Visualisations in particular are
excellent tools for exploring and understanding frequent patterns of behaviour and
outliers. When done well, visualisations harness the perceptual abilities of humans to
provide visual insights into data (Fayyad et al. 2002, 4). Moreover, they provide a new
perspective on a phenomenon and help generate research questions and hypotheses.
Once we know how research participants behave (or communicate if we are observing
textual documents or establish social ties if we are observing social networks), we can
enter the field equipped with knowledge and information to verify and contextualise.
Finally, large data sets are particularly appropriate for computational analysis.
While ‘big data’ became a popular buzzword in data science, anthropologists most
likely will not be dealing with millions of data points that can be analysed only with
graphics processing units (GPUs). However, even ten thousand observations are too
much for a researcher to make sense of. For such data, we need software tools and
visualisations, which provide an overview of the phenomenon, plot typical patterns,
and enable exploring different sub-populations.
Data Ethics and Surveillance Technologies
Computational anthropology does not encompass only methodological
approaches for data analysis, epistemological questions on the relationship of human
being towards technology, and empirical research with computational methods, but
also ethics on data storage, processing, analysis and dissemination. In a broad sense,
it includes three axes of ethics, namely the ethics of data, the ethics of algorithms, and
ethics of practices (Floridi and Taddeo 2016). In this contribution, we mostly touch
upon the final one, the ethics of practices.
184 Prispevki za novejšo zgodovino LIX - 1/2019
Research ethics, in particular sensitivity to the potential harm a study could elicit,
is one of the core questions of anthropology, which is deeply immersed in the personal
human experience. First, a solid deontological paradigm is crucial for working with not
only sensitive data but any human-produced data. In this sense, we follow the princi-
ples of positivist ethics which call for human dignity, autonomy, protection, maximiz-
ing benefits and minimizing harm, respect, and justice (Markham et al. 2012; Halford
2017). In other words, anthropologists should act in the best interest of the research
participants and avoid or minimize negative effects the study could have on the people.
Secondly, anthropologists should be mindful of the potential subjectivity of their
interpretation of the data. Every data set, whether quantitative or qualitative, elicits
interpretation that inevitably stems from our own world-view. To keep the bias to a
minimum, the suggested circular mixed methods approach, proposed in this article,
as well as most others approaches with origins in anthropology strive for continual
reinterpretation of the results within the actual social context of research participants.
Each ethnographic layer explains the results from the point of view of data producers
and thus minimizes the chance of bias and misinterpretation.
Sensors and wearable devices inevitably invoke questions of surveillance and pri-
vacy. Here, we propose a distinction between surveillance and monitoring. Surveillance
implies guiding actions of surveilled subjects, while monitoring proposes a more pas-
sive stance of observing behaviour (see Marx 2002; Nolan 2018). The present study
was not designed to guide behaviour but to observe and understand, hence being more
monitoring than surveillance focused. And even if we consider it surveillance-like,
Marx proposes “a broad comparative measure of surveillance slack which considers
the extent to which a technology is applied, rather than the absolute amount of surveil-
lance” (Marx 2002), meaning that the extent to which surveillance is harmful is the
power it holds for the user. The case of sensor data of a smart building that monitors
only neutral human behaviour, falls to the soft side of power, which, in the opinion
of the authors, deserves some surveillance slack. Nevertheless, we strived to uphold
high ethical standards for handling the data and disseminating the results, mostly by
employing “ongoing consensual decision-making” (Ramos 1989) by informing par-
ticipants of the purpose of the research, which data are being collected and how the
findings are going to be presented.
Circular Mixed Methods
Circularity in machine learning and data mining is not a novel idea. Data science
methodology already includes ideas about circular phases of data mining (CRISP-DM,
Shearer 2000), where phases are interdependent and by reiterating through them the ana-
lyst clarifies existing and generates new business questions. However, not much has been
written yet alone put to practice in terms of interdisciplinary circularity and the intertwin-
ing of methods from different scientific fields (for some pioneering efforts see above).
185A. Pretnar, D. Podjed: Data Mining Workspace Sensors: A New Approach to Anthropology
For the purpose of our study, we designed a novel methodological approach,
named it circular mixed methods, and employed it to analyse the workspace behaviour
and practices of employees. This approach aims to observe the phenomenon from
several different perspectives. Nominally, we have split these perspectives into several
research stages, where we use a single method, but in reality these methods are used
interchangeably and in accordance with each particular situation. For the sake of clar-
ity, however, we will refer to the four stages of research.
The first stage involved gathering historical longitudinal data from the building’s
sensors. We used unsupervised data mining and exploratory data analysis to uncover
behavioural patterns, identify interesting individuals (outliers) and form several
hypotheses about the use of spaces and energy consumption in the building.
The second stage involved in-depth ethnographic research, where we used inter-
views, questionnaires, focus groups, and, most importantly, participant observation
from a three-year period of working in the mentioned building. This part helped us
clarify the context of particular behaviours, identified the values, motivations and
deterring factors of each research participant, and confirmed or rejected hypotheses.
In the third stage, which is currently in process, we use text mining methods on
interview transcripts to find common topics, observe sentiment towards particular
issues, and determine which individuals have similar opinions on certain topics. This
is still a work-in-progress and the results might even be negative – since this is the
first application of text mining on ethnographic interview transcripts, we still need to
estimate the viability of such an approach for anthropology. The size of the corpus in
anthropology is normally small and it is entirely possible there is no added value to text
mining such data. Finally, we conclude with another round of ethnography, conduct-
ing the second round of interviews (normally a year after the first one) and verifying
the results from the third stage.
As mentioned before, the distinction between each particular stage is not always
strict. Circular mixed methods aim to provide the researcher with the freedom she
needs to address complex phenomena from several different perspectives. There can
be several alternating stages, where each stage contributes another interpretative layer
to the previously established facts.
There are plenty of benefits of this approach. Circular mixed methods are particu-
larly appropriate for uncovering intricate longitudinal patterns, which are incredibly
challenging for the researcher to observe at such granularity. Moreover, this method
can be used for observing diachronic phenomena, where the data comes from several
locations at once, hence overcoming the physical limitations of a single researcher.
Finally, researchers can effectively and rapidly analyse large data collections with
computational means. Some such collections are interesting for anthropologists as
well, namely social media, archival data, wearables, sensors or audio-visual recordings.
Visualisations, one of the product of data mining, substantiate the findings and enable
researchers to uncover relations, patterns and outliers in the data. Data analysis can
thus help generate hypotheses and questions for the research. This cuts down the time
186 Prispevki za novejšo zgodovino LIX - 1/2019
required to get familiar with the field. A researcher can come into the field equipped
with potentially interesting hypotheses and test them almost immediately.
Looking at the data alone, however, we would be unable to determine what any
of those patterns and outliers mean. To truly understand them, we need to immerse
ourselves in the field, ask questions and observe how people behave and create their
habits and practices. While quantitative analysis provides us with clues, qualitative
approaches, such as ethnography and fieldwork, explain those clues and substantiate
the superficial knowledge of the field acquired in the first and third research phase.
Metaphorically speaking, data analysis is great for scratching the surface and ethnog-
raphy excels at digging deeper. By combining the two approaches, however, we can
interpret the data in a rich and meaningful way, as we will show with the case study
presented in the article.
Data Preprocessing
In our study, we have observed sensor measurements from a faculty building
which is considered to be a state-of-the-art smart building in Slovenia. Each room
in the building is equipped with a temperature sensor and sensors on windows that
track when they are open or closed. Doors have electronic key locks that track when
the room is occupied. There were altogether 11 sensor measurements, with additional
8 measurements coming from the weather station located on the building’s rooftop.
In-room sensor reports the room temperature, set temperature, ventilation speed,
daily regime, and so on, while the weather station reports the external temperature,
light, rainfall, etc.
One of the most important measures is the daily regime, which has four values,
each representing a state of the overall room setting. When a person is present in the
room, the regime is comfort (value = 0) and when a window is open, the regime is
off (value = 4). If the room is vacant, the regime goes to night (value = 1) or standby
(value = 3).2 These measurements come from electronic locks on the doors, which
record when the room is occupied, and the magnets in windows, which record when
the window is open.
We retrieved 55,456 recordings for 14 rooms of different types, namely 5 labora-
tories, 6 cabinets, and 3 administration rooms. Measurements are recorded bi-hourly
and stored in SCADA, a software that allows controlling processes locally or at remote
locations, monitoring and processing real-time data, interacting with devices, and
recording events into a log file.
We decided to observe the year 2016 and later compare it to 2017. The results in
the paper refer only to 2016. The rooms are anonymised to ensure data privacy and
results for two of the rooms are not reported at the request of their occupants.
2 Standby is activated on workdays as a transitory setting between night and comfort regime.
187A. Pretnar, D. Podjed: Data Mining Workspace Sensors: A New Approach to Anthropology
Table 1: Original data.
Date Room temperature Daily regime Room
2016-01-01 02:10:00 20.94266 1 C
2016-01-01 02:10:00 21.65854 1 B
2016-01-01 02:10:00 20.63234 1 K
2016-01-01 02:10:00 22.41270 1 D
2016-01-01 02:10:00 20.25890 1 M
2016-01-01 02:10:00 21.45220 3 C
Source: Author data.
We performed extensive data cleaning and preprocessing and removed data
points with missing values (Table 1). The daily regime was considered the most
important variable since it reports a presence in the room or the opening of windows.
Concurrently, we retained only the variables reporting the daily regime, since this fea-
ture registered human behaviour the best, and room temperature. We also generated
additional features, such as the day of the week and room type (cabinet, laboratory,
and administration).
Table 2: Data transformed into a behaviour vector. 1 denotes occupancy of the room,
meaning daily regime value was either 0 (comfort) or 4 (window open).
Date 0 am 1 am 2 am 3 am … Room Day Type
2016-01-01 0 1 1 0 … C Fri laboratory
2016-01-01 0 0 0 0 … B Fri laboratory
2016-01-01 0 0 1 1 … K Fri administration
2016-01-01 0 0 0 1 … D Fri cabinet
2016-01-01 0 1 1 1 … M Fri administration
2016-01-02 0 0 1 1 … C Sat laboratory
Source: Author data.
In the second part of the analysis, we created a transformed data set where we
merged daily readings for a room into one ‘daily behaviour’ vector (Table 2). In the
new data set, each room has a daily recording, where the new features are values of
the daily regime at each hour. Since sensors only record the state every two hours, we
filled missing values with the previously observed state. For example, if the original
vector was {0, ?, 0, ?, 1, ?, 1}, we imputed missing values to get {0, 0, 0, 0, 1, 1, 1}. As
we were interested only in the presence in the room, we put 0 where daily regime was 1
(night) or 3 (standby) and 1 where it was 0 (comfort) or 4 (window open), discarding
the information on specific temperature regimes. This gave us the final daily behaviour
vector which we could compare in time and between rooms.
To sum up, we were working with two data sets, the first reporting the presence of
people in the room for a given time (11 features from room sensors) and the second
188 Prispevki za novejšo zgodovino LIX - 1/2019
one showing the behaviour of the room throughout the day (24 features on room
occupancy at each hour).
Results
First, we wanted to see how rooms differ by room occupancy alone. We hypoth-
esised there will be a significant difference in occupancy between laboratories and
cabinets since the presence of more people in a space extends the occupancy hours (no
complete overlap of working time). We took the first data set with bi-hourly record-
ings and removed readings where the daily regime was either 1 (night) or 3 (standby)
because these readings indicate the room was not occupied. Afterwards, we computed
the contingency matrix of room occupancy by the day of the week, which shows how
many times per year a room was occupied on a certain day. We visualised the result in
a line plot by the type of the room (Figure 1). We can notice that laboratories have a
higher presence on Saturday and Sunday than the other rooms.
Figure 1: Occupancy of the rooms for each day of the week
Source: Author data displayed in the software Orange.
Moreover, N and O are the top two rooms by occupancy. We know that these
two rooms belong to a single laboratory and are separated with a permanently open
door. These two rooms are occupied by the largest number of people and since the
189A. Pretnar, D. Podjed: Data Mining Workspace Sensors: A New Approach to Anthropology
employees of the faculty have a somewhat flexible working time, the dispersion of
working time is expectedly the highest in rooms with the most occupants (smallest
overlap in working time among employees). N and O are also among the few rooms
where occupancy goes up towards the end of the week.
F and B are also laboratories, both displaying similarly high presence across the
week. On the bottom of the plot there are cabinets, namely G, K, F. Unsurprisingly,
cabinets display lower occupancy rates than laboratories, since cabinets are used by a
single person and hence no overlap is possible. They are also functional rooms, used
predominantly for meetings, office hours, and other intermittent work of professors.
With the second room occupancy data set, we made an analysis of behavioural
patterns by the time of the day. We observed occupancy by room type in a heat map
where 1 (yellow) means presence and 0 (blue) absence. Visualisation in Figure 2 is
simplified by merging similar rows with k-means (k = 50) and clustering by similarity
(Euclidean distance, average linkage, and optimal leaf ordering). Such simplification
joins identical or highly similar patterns into one row and rearranges them so that
similar rows are put closer together.
Figure 2: Occupancy of the rooms for each day of the week
Source: Author data displayed in the software Orange.
190 Prispevki za novejšo zgodovino LIX - 1/2019
Clustering revealed that occupancy sequence highly depends on the room type.
There were some error data, where sensors recorded presence at unusual hours (for
example during the night consistently across all rooms). But despite some noise in
our data, we can distinguish between typical laboratory, administration and cabinet
behaviour, since our error data constitute a separate cluster (Dave 1991). Cabinets
again show the lowest occupancy with presence recorded sporadically across the day.
Normally, university lecturers spend a large portion of their time in lecture rooms and
in their respective laboratories. This is why occupancy of cabinets is so erratic and does
not display a consistent pattern. Laboratory occupants, on the other hand, usually
come late and stay late, while administration staff work regularly from 7:00 a.m. to
4:00 p.m. They both display fairly consistent behaviour.
We visualised the same data set in a line plot, which shows the frequency of attributes
on a line. In this way, we can better observe the differences between individual rooms
at each time of the day and where specific peaks (high frequencies) happen. Figure 3
displays the occupancy ratio at a specific time of day, while Figure 4 shows the ratio of
window opening.3 Several interesting observations emerge. In both cases, room O is
skewed to the right, meaning its occupants work at late hours and open windows while
working. Conversely, room J is skewed to the right, indicating its occupants start work
earlier than most. There is also a distinct peak in window opening at around lunch time.
Figure 3: Room occupancy by the time of day
3 1 would mean the room was always occupied and 0 that the room was never occupied at a specific time of the day.
191A. Pretnar, D. Podjed: Data Mining Workspace Sensors: A New Approach to Anthropology
In most rooms, people are opening windows from late morning to early afternoon.
Again, not surprising, considering this is their peak working time. This is a great indica-
tor for an ethnographer if he or she wants to observe windows interaction (who does
it, is there a consensus on whether or not it should be opened, does this happen more
frequently after lunch…). Looking at the data, the best time for observing the speci-
fied behaviour is between 10:00 a.m. and 1:00 p.m. Accordingly, data analysis can also
serve as a guide for ethnographic fieldwork.
Ethnography Comes In
Data analysis revealed some interesting patterns in the use of working spaces:
– laboratories work more on the weekends,
– rooms N and O work late,
– room J starts the day early and opens the windows at lunchtime, and
– in rooms H, N and O the occupancy goes up towards the end of the week.
How can we explain this? While the data gave us clues, the answers lie with the
people. Substantiating analytical findings with fieldwork ethnography is crucial for
understanding the data. We conducted semi-structured interviews with the rooms’
occupants to discover what those patterns mean and why a certain behaviour occurs.
Figure 4: Window opening frequency by the time of the day
192 Prispevki za novejšo zgodovino LIX - 1/2019
Laboratories have a higher weekend occupancy since they offer a quiet place to
work for PhD students who are either catching deadlines for publishing papers or
using their ‘off time’ for some in-depth research. Room B, in particular, seems to like
working at weekends and we were able to identify an individual who often comes to
work on Saturdays. In the interview, he4 told us this was the time when he finally man-
aged to do some actual work: “Effectively, if you look at the duration of my focus, it is much
longer during the weekends. In my opinion, I do a day and a half worth of work during the
weekend compared to the weekday.”
Rooms N and O are quite similar in terms of presence although room N displays
a tendency to work the latest. By observing the inhabitants in this room and talking
to them, we identified an individual who preferred to work in the late afternoon and
evening. Since, as mentioned above, working time is flexible at the studied faculty, he
adjusted his working hours to suit his preferences. He also prefers fresh air to artificial
ventilation and opens the windows whenever possible. This accounts for the skew to
the right for room O in Figure 4. “I like fresh air,” he told us. “The air outside is always
better than the air inside. I opened my window at every chance, even during the winter.”
The increased productivity in rooms N, O, and H towards the end of the week is
explained by the fact that Fridays are working sprints for the occupants of these three
rooms. The case of room H is particularly interesting. This is the room with the overall
lowest occupancy, yet the room is most frequented on Fridays, unlike in most other
rooms, where the occupancy decreases towards the end of the week. Room H is the
cabinet of a professor who runs laboratories N and O. He is also a part of the Friday
development sprints, hence the peak. Yet he is very sociable and prefers to work in
the laboratory with colleagues, rather than alone in the cabinet, as was evident from
our observation and discussion with him. This also explains the overall low and erratic
occupancy of his room during the rest of the week.
The skewed peak for room J in Figure 3 is also interesting. The occupant of this
room admitted he prefers coming to work earlier to make the most of the day. He
stressed several times that daylight is important to him and by shifting working time
to earlier hours, he was able to leave early and use the rest of the day for himself. He
also said he was the most productive in early mornings since these were the quietest
parts of the day. In his words:“[I like coming early] because I have more of the day left in
my private life. It is also quiet in the morning and I can do more work.”
Personal preferences evidently affected the discovered patterns of workday behav-
iour. In summary, people working in the researched building adjust their working hours
and their environment to suit their personal needs, values, and lifestyle. Designing a
single solution for such a diverse group not only invokes dissatisfaction among occu-
pants, but leads to lower productivity, higher stress, improvised DIY solutions, and
ultimately to higher energy consumption and worse workspace health.
4 For concealing the actual identity of the people participating in the study, the pronoun he is used to denote both
males and females.
193A. Pretnar, D. Podjed: Data Mining Workspace Sensors: A New Approach to Anthropology
Conclusion
In this paper, we have shown how anthropological (qualitative) research methods
can be enriched, upgraded, and substantiated by data analysis. While the findings are
still preliminary and based on a limited sample, they nevertheless pinpoint aspects of
data analysis that benefit from ethnographic insight and vice versa.
With the increasing availability of data, especially from sensors, wearable devices,
and social media, anthropologists can use computational methods and data analysis
to uncover common patterns of human behaviour and pinpoint interesting outliers.
Quantitative methods have proven useful when dealing with large data sets. In such
cases, an analysis without digital tools is virtually impossible, while visualisations offer
new insight into the problem and help present the data concisely. In addition, quantita-
tive approaches also increase the reproducibility of research.
However, patterns emerging from such analysis can hardly ever be explained with
data alone. We argue that data analysis can generate new hypotheses and research
questions (Krieg et al. 2017) and provide a general overview of the topic. Conversely,
ethnography substantiates analytical findings with the context and story behind the
data. Going back and forth, from quantitative to qualitative methods and approaches,
enables researchers to establish a research problem as suggested by the data, gauge new
perspectives on the known problems, and account for outliers and patterns in the data.
Circular research design enhances the quality of information, which does not have to
derive solely from a quantitative or qualitative approach. By combining the two, we
are using a research loop that ensures both sets of data get an additional perspective –
quantitative data are verified with ethnography in the field, while ethnographic data
become supported with statistically relevant patterns analysed by computational tools.
Such methods are already, to a certain extent, employed in digital anthropology
(Drazin 2012), but they are gaining more prominence in mainstream anthropology as
well (Krieg et al. 2017). By establishing a solid methodological framework for quanti-
tative analyses in relation to qualitative ones, we do not only strengthen the subfield of
computational anthropology, but also provide new perspectives and research ventures
to anthropology and emphasise its relevance for understanding lifestyles, habits, and
practices in data-driven societies.
Sources and Literature
Literature:
• Anderson, Ken, Dawn Nafus, Tye Rattenbury, and Ryan Aipperspach. 2009. “Numbers Have
Qualities Too: Experiences with Ethno-mining.” Ethnographic Praxis in Industry Conference
Proceedings 2009 (1): 123–40.
• Bell, Genevieve. 2006. “Satu keluarga, satu komputer (One Home, One Computer): Cultural
Accounts of ICTs in South and Southeast Asia.” Design Issues 22 (2): 35–55.
194 Prispevki za novejšo zgodovino LIX - 1/2019
• Blok, Anders, and Morten Axel Pedersen. 2014. “Complementary Social Science? Quali-
quantitative experiments in a Big Data World.” Big Data & Society 1 (2): 1–6.
• Boellstorff, Tom. 2015. Coming of Age in Second Life: An Anthropologist Explores the Virtually
Human. Princeton: Princeton University Press.
• Bonilla, Yarimar, and Jonathan Rosa. 2015. “#Ferguson: Digital Protest, Hashtag Ethnography, and
the Racial Politics of Social Media in the United States.” American Ethnologist 42 (1): 4–17.
• boyd, danah, and Kate Crawford. 2012. “Critical Questions for Big Data: Provocations for a
Cultural, Technological, and Scholarly Phenomenon.” Information, Communication & Society 15
(5): 662–79.
• Conte, Rosaria, Nigel Gilbert, Giulia Bonelli, Claudio Cioffi-Revilla et al. 2012. "Manifesto of
Computational Social Science.” The European Physical Journal Special Topics 214 (1): 325-346
• Creswell, John W., and Vicki L. Plano Clark. 2007. Designing and Conducting Mixed Methods
Research. Thousand Oaks: Sage publications.
• Dave, Rajesh N. 1991. “Characterization and Detection of Noise in Clustering.” Pattern Recognition
Letters 12 (11): 657–64.
• Dobbert, Marion Lundy, Dennis P. McGuire, James J. Pearson, and Kenneth Clarkson Taylor. 1984.
“An Application of Dimensional Analysis in Cultural Anthropology.” American Anthropologist 86
(4): 854–84.
• Drazin, Adam. 2012. “Design Anthropology: Working on, with and for Digital Technologies.” In
Digital Anthropology, edited by Heather A. Horst and Daniel Miller, 245–65. London and New
Yourk: Berg.
• Fayyad, Usama M., Andreas Wierse, and Georges G. Grinstein, eds. 2002. Information Visualization
in Data Mining and Knowledge Discovery. San Francisco: Morgan Kaufmann.
• Floridi, Luciano and Taddeo, Mariarosaria. 2016. “What is Data Ethics?” Philosophical Transactions
A: 374: 1–5.
• Gershenfeld, Neil, and J. P. Vasseur. 2014. “As Objects go Online: The Promise (and Pitfalls) of the
Internet of Things.” Foreign Affairs 93: 60.
• Halford, Susan. 2017. “The Ethical Disruptions of Social Media Data: Tales from the Field.” In The
Ethics of Online Research, 13–25. Bingley: Emerald Publishing Limited.
• Hockey, Susan. 2004. “The History of Humanities Computing.” A Companion to Digital Humanities,
3–19.
• Hsu, Wendy F. 2014. “Digital Ethnography Toward Augmented Empiricism: A New
Methodological Framework.” Journal of Digital Humanities 3 (1): 1–19.
• Jänicke, Stefan, Greta Franzini, Muhammad Faisal Cheema, and Gerik Scheuermann. 2015. “On
Close and Distant Reading in Digital Humanities: A Survey and Future Challenges.” In Eurographics
Conference on Visualization (EuroVis)-STARs. The Eurographics Association.
• Johnson, R. Burke, Anthony J. Onwuegbuzie, and Lisa A. Turner. 2007. “Toward a Definition of
Mixed Methods Research.” Journal of Mixed Methods Research 1 (2): 112–33.
• Kawulich, Barbara. 2005. “Participant Observation as a Data Collection Method.” In Qualitative
Sozialforschung / Forum: Qualitative Social Research 6 (2).
• Klein, Lauren F., and Matthew K. Gold. 2016. “Digital Humanities: The Expanded Field.” Debates
in the Digital Humanities.
• Krieg, Lisa Jenny, Moritz Berning, and Anita Hardon. 2017. “Anthropology with Algorithms? An
Exploration of Online Drug Knowledge Using Digital Methods.” Issues 5 (2).
• Kuzara, Richard S., George R. Mead, and Keith A. Dixon. 1966. “Seriation of Anthropological
Data: A Computer Program for Matrix-ordering.” American Anthropologist 68 (6): 1442–55.
• Lum, Kristian, and William Isaac. 2016. “To Predict and Serve?” Significance 13: 14–19.
• Malinowski, Bronislaw. 2002 [1922]. Argonauts of the Western Pacific: An Account of Native
Enterprise and Adventure in the Archipelagoes of Melanesian New Guinea. London: Routledge.
• Marcus, George E. 2007. “Ethnography Two Decades after Writing Culture: From the Experimental
to the Baroque.” Anthropological Quarterly 80 (4): 1127–45.
195A. Pretnar, D. Podjed: Data Mining Workspace Sensors: A New Approach to Anthropology
• Markham, Annette, Elizabeth Buchanan, and AoIR Ethics Working Committee. 2012. Ethical
Decision-making and Internet Research: Version 2.0. Association of Internet Researchers.
• Marx, Gary T. 2002. “What’s New about the ‘New Surveillance’? Classifying for Change and
Continuity.” Knowledge, Technology & Policy 17 (1): 18–37.
• Miller, Daniel, Elisabetta Costa, Nell Haynes, Tom McDonald, Razvan Nicolescu, Jolynna
Sinanan, Juliano Spyer, Shriram Venkatraman, and Xinyuan Wang. 2016. How the World Changed
Social Media. London: UCL press.
• Mitchell, J. Clyde. 1974. “Social Networks.” Annual Review of Anthropology 3: 279–99.
• Mittelstadt, Brent Daniel, Patrick Allo, Mariarosaria Taddeo, Sandra Wachter, and Luciano Floridi.
2016. “The Ethics of Algorithms: Mapping the Debate.” Big Data & Society 3 (2): 1–21.
• Nardi, Bonnie. 2010. My Life as a Night Elf Priest: An Anthropological account of World of Warcraft.
University of Michigan Press.
• Nolan, Cathy. 2018. “Data Surveillance, Monitoring, and Spying: Personal Privacy in a Data-
Gathering World.” Data Topics. Published May 2, 2018. https://www.dataversity.net/data-
surveillance-monitoring-spying-personal-privacy-data-gathering-world/.
• Patel, Shyamal, Hyung Park, Paolo Bonato, Leighton Chan, and Mary Rodgers. 2012. “A Review
of Wearable Sensors and Systems with Application in Rehabilitation.” Journal of Neuroengineering
and Rehabilitation 9 (1): 21.
• Pink, Sarah, Shanti Sumartojo, Deborah Lupton, and Christine Heyes La Bond. 2017. “Mundane
Data: The Routines, Contingencies and Accomplishments of Digital living.” Big Data & Society 4
(1): 1–12.
• Podolefsky, Aaron, and Christopher McCarty. 1983. “Topical Sorting: A Technique for Computer
Assisted Qualitative Data Analysis.” American Anthropologist 85 (4): 886–90.
• Pretnar, Ajda, and Marko Robnik-Šikonja. 2019. “Analiza slik in besedil s pristopi umetne
inteligence.” Glasnik SED 59 (1): 49–57.
• Ramos, Mary Carol. 1989. “Some Ethical Implications of Qualitative Research.” Research in
Nursing & Health 12 (1): 57–63.
• Shearer, Colin. 2000. “The CRISP-DM Model: the New Blueprint for Data Mining.” Data
Warehousing 5: 13–22.
• Skeem, Jennifer L., and Christopher Lowenkamp. 2016. “Risk, Race, & Recidivism: Predictive
Bias and Disparate Impact.” Criminology: An Interdisciplinary Journal 54 (4): 680–712.
• Svensson, Patrik. 2010. “The Landscape of Digital Humanities.” Digital Humanities. http://
digitalhumanities.org/dhq/vol/4/1/000080/000080.html.
• Teddlie, Charles, and Abbas Tashakkori. 2009. Foundations of Mixed Methods Research: Integrating
Quantitative and Qualitative Approaches in the Social and Behavioral Sciences. Los Angeles: Sage.
• White, Douglas R., and Gregory F. Truex. 1988. “Anthropology and Computing: The Challenges
of the 1990s.” Social Science Computer Review 6 (4): 481–97.
Oral Sources:
• Personal archive.
196 Prispevki za novejšo zgodovino LIX - 1/2019
Ajda Pretnar, Dan Podjed
DATA MINING WORKSPACE SENSORS: A NEW
APPROACH TO ANTHROPOLOGY
SUMMARY
With an increasing availability of data coming from social networks and wear-
able devices among other sources, anthropologists can easier than ever dive into data
analysis and study humans and their societies, subcultures and cultures quantitatively
as well as qualitatively. In this contribution we extend the interdisciplinarity of anthro-
pology by employing circular mixed methods that combine qualitative (ethnographic)
approaches with quantitative approaches from data mining and machine learning.
The research, which is the basis for this contribution, began in October 2017 and
includes 14 workspaces of one of University of Ljubljana’s buildings. For the purpose
of our study, we designed a novel methodological approach and named it circular mixed
methods. We employed it to analyse workspace behaviours and practices of employees
and to develop sustainable solutions for encouraging a healthy lifestyle.
As we explain in the contribution, circular mixed methods are appropriate for
uncovering detailed longitudinal patterns, which are impossible to detect manually.
The suggested approach is effective for analysing diachronic phenomena, where we
retrieve the data from several locations at once, thus overcoming the physical limita-
tions of individual researchers. Finally, using this methodology, researchers can effec-
tively and rapidly analyse large data collections. Some such collections are interesting
for anthropologists as well, namely social media, archival data, wearables, sensors or
audio-visual recordings.
While quantitative analysis helps us generate hypotheses and uncover patterns
in the data, qualitative approaches, such as ethnography and fieldwork, explain those
patterns and substantiate data with rich details. Combining the two approaches, we
can interpret the data in a contextually rich and anthropologically relevant way.
197A. Pretnar, D. Podjed: Data Mining Workspace Sensors: A New Approach to Anthropology
Ajda Pretnar, Dan Podjed
PODATKOVNO RUDARJENJE SENZORJEV V DELOVNEM
OKOLJU: NOV PRISTOP K ANTROPOLOGIJI
POVZETEK
Z vse večjo množico podatkov, pridobljenih, med drugim, z družbenih omrežij in
pametnih naprav, lahko antropologi lažje kot kadarkoli prej pri raziskovanju uporabijo
podatkovno analitiko in preučujejo ljudi in njihove navade ter kulture in podkulture,
in to tako s kvantitativnega kot kvalitativnega vidika. V prispevku razširimo idejo
interdisciplinarnosti v antropologiji z uporabo krožnih mešanih metod, ki povezujejo
kvalitativne (etnografske) pristope s kvantitativnimi pristopi rudarjenja podatkov in
strojnega učenja.
Raziskava, ki je podlaga pričujočega prispevka, se je začela oktobra 2017 in vklju-
čuje 14 delovnih prostorov ene od stavb Univerze v Ljubljani. Za vzpostavljanje novih
pogledov na navade ljudi v stavbi in snovanje potencialno relevantnih hipotez smo
uporabili krožne mešane metode, s katerimi smo analizirali vedenje in prakse ter na
podlagi le-teh razvili celostne rešitve za spodbujanje zdravega načina življenja in izbolj-
šanje počutja na delovnem mestu.
Kot pojasni prispevek, so krožne mešane metode najprimernejše za odkriva-
nje podrobnih in dolgotrajnih vzorcev, ki jih raziskovalec ne more sam opazovati in
zaznati. Pristop je učinkovit tudi za opazovanje sočasnih dogodkov, kjer podatke hkrati
pridobimo z več lokacij, s čimer presežemo raziskovalčeve fizične omejitve. Poleg tega
je metodologija uporabna za učinkovito in hitro analizo velikih podatkovnih zbirk,
med katerimi so za antropologe posebej zanimivi podatki z družbenih omrežij, pame-
tnih naprav in senzorjev, avdio-vizualno gradivo ter digitalizirani arhivski viri.
Medtem ko kvantitativna analiza omogoča postavitev hipotez in odkrivanje vzor-
cev v podatkih, jih kvalitativne metode, zlasti etnografija in opazovanje na terenu,
razložijo in obogatijo s podrobnostmi. Kombinacija obeh pristopov zagotavlja, da
podatke interpretiramo na vsebinsko bogat in antropološko relevanten način.
198 Prispevki za novejšo zgodovino LIX - 1/2019
* University of Ljubljana, Faculty of Computer and Information Science, Večna Pot 113, SI-1000 Ljubljana, Jožef
Stefan Institute, Jamova cesta 39, SI-1000 Ljubljana, tadej.skvorc@fri.uni-lj.si
** Jožef Stefan Institute, Jamova cesta 39, SI-1000 Ljubljana, University of Ljubljana, Faculty of Arts, Aškerčeva 2,
SI-1000 Ljubljana, simon.krek@guest.arnes.si
*** Jožef Stefan Institute, Jamova cesta 39, SI-1000 Ljubljana, senja.pollak@ijs.si
**** University of Ljubljana, Faculty of Arts, Aškerčeva 2, SI-1000 Ljubljana, University of Ljubljana, Faculty of Com-
puter and Information Science, Večna Pot 113, SI-1000 Ljubljana, spela.arharholdt@ff.uni-lj.si
***** University of Ljubljana, Faculty of Computer and Information Science, Večna Pot 113, SI-1000 Ljubljana, marko.
robnik@fri.uni-lj.si
1.01 UDC: 070.18:329.052 (497.5)”1945/1990”
Tadej Škvorc,* Simon Krek,** Senja Pollak,***
Špela Arhar Holdt,**** Marko Robnik-Šikonja*****
Predicting Slovene Text Complexity
Using Readability Measures
IZVLEČEK
NAPOVEDOVANJE KOMPLEKSNOSTI SLOVENSKIH BESEDIL
Z UPORABO MER BERLJIVOSTI
Večina obstoječih formul za merjenje berljivosti je zasnovana za besedila v angleškem
jeziku, na katerih je tudi ocenjena njihova kakovost. V našem članku predstavimo prila-
goditev izbranih mer za slovenščino. Uspešnost desetih znanih formul ter osmih dodatnih
kriterijev berljivosti ocenimo na petih skupinah besedil: otroških revijah, splošnih revijah,
časopisih, tehničnih revijah in zapisnikih sej državnega zbora. Te skupine besedil imajo
različne ciljne publike, zaradi česar predpostavimo, da uporabljajo različne stile pisanja, ki
bi jih formule in kriteriji berljivosti morali zaznati. V analizi pokažemo, katere formule in
kriteriji berljivosti delujejo dobro in s katerimi razlik med skupinami nismo mogli zaznati.
Ključne besede: berljivost, obdelava naravnega jezika, analiza besedil
199T. Škvorc et al.: Predicting Slovene Text Complexity…
ABSTRACT
The majority of existing readability measures are designed for English texts. We aim
to adapt and test the readability measures on Slovene. We test ten well-known readability
formulas and eight additional readability criteria on five types of texts: children’s magazines,
general magazines, daily newspapers, technical magazines, and transcriptions of national
assembly sessions. As these groups of texts target different audiences, we assume that the
differences in writing styles should be reflected in their readability scores. Our analysis shows
which readability measures perform well on this task and which fail to distinguish between
the groups.
Keywords: readability, natural language processing , text analysis
Introduction
In English, the problem of determining text readability (i.e. how easy a text is
to understand) has long been a topic of research, with its origins in the 19th century
(Sherman 1893). Since then, many different methods and readability measures have
been developed, often with the goal of determining whether a text is too difficult for its
target age group. Even though the question of readability is complex from a linguistic
standpoint, a large majority of existing measures are based on simple heuristics. There
has been little research on readability of languages other than English, therefore we aim
to apply these measures to Slovene and evaluate how well they perform.
There are several factors that might cause these measures to perform poorly on
non-English languages, such as:
– Many measures are fine-tuned to correspond to the grade levels of the United
States education system. It is likely a different fine-tuning would be needed for
other languages, as a.) their education system is different from the US system,
and b.) the differences in readability between grade levels are likely to be different
between languages, meaning that each language would require specifically tuned
parameters.
– Some measures utilize a list of common English words and their results depend
on the definition of this list. For Slovene, there currently does not exist a publicly
available list of common words, so it is not known how such measures would
perform.
– The existing readability measures do not use the morphological information to
determine difficult words but rely on syllable and character counts, or a list of
difficult words. As Slovene is morphologically much more complex than English,
words with complex morphology are harder to understand than those with simple
morphology, even if they have the same number of characters or syllables.
200 Prispevki za novejšo zgodovino LIX - 1/2019
We analyze the commonly used readability measures (as well as some novel meas-
ures) on Slovene texts and propose a word list needed to implement the word-list-
based measures. We calculate statistical distributions of scores for each readability
measure across subcorpora and assess the ability of measures to distinguish between
different subcorpora using a variety of statistical tests. We show that machine learning
classification models, using a combination of readability measures, can predict the
subcorpus a given text belongs to.
The paper extends the short version of the paper presented in Škvorc et al. (2018)
and is structured as follows. We first present the related work on readability measures
and describe the readability measures used in our analysis. The methodology of the
analysis is presented next, followed by the results split into three sections. The last
section concludes the paper and presents ideas for further work.
Related Work
For English, there exists a variety of works focused on determining readability by
using readability formulas. Those formulas rely on different features of the text such
as the average sentence length, percentage of difficult words, and the average number
of characters per word. Examples of such measures include the Coleman-Liau index
(Coleman and Liau 1975), LIX (Björnsson 1968), and the automated readability
index (ARI) (Senter and Smith 1967). Some formulas, like the Flesch-Kincaid grade
level (Kincaid et al. 1975) and SMOG (Mc Laughlin 1969) use the number of sylla-
bles per word to determine if a word is difficult. Additionally, some measures (e.g., the
Spache readability formula (Spache 1953) and Dale-Chall readability formula (Dale
and Chall 1948) rely on a pre-constructed list of difficult words.
Aside from the readability formulas, there exists a variety of other approaches that
can be used to determine readability (Bailin and Grafstein 2016). For example, vari-
ous machine-learning approaches can be used to obtain better results than readability
formulas, such as the approach presented in Francois and Miltsakaki (2012), which
outperforms readability formulas on French text.
There is little work attempting to apply these measures to Slovene texts. Most
work dealing with the readability of Slovene text is focused on manual methods. For
example, Justin (2009) analyzes Slovene textbooks from a variety of angles, including
readability. On the other hand, works that focus on automatic readability measures are
rare. Zwitter Vitez (2014) uses a variety of readability measures for author recognition
in Slovene text, but we found no works that used them to determine readability.
In addition to Slovene, some related works evaluate readability measures on other
languages. Debowski et al. (2015) evaluate readability formulas on Polish text and
show that they obtain better results by using a more complex, machine-learning-based
approach.
201T. Škvorc et al.: Predicting Slovene Text Complexity…
Readability Measures
In our analysis, we used two groups of readability measures:
– Existing readability formulas for English: we focused mainly on popular methods
that have been shown to achieve good results on English texts. These measures
mostly rely on easy-to-obtain features such as a number of difficult words, sen-
tence length, and word length.
– Natural-language-processing-based readability criteria: we used additional cri-
teria that are not present in the existing readability formulas but can be obtained
from tools for automatic language processing, such as the percentage of verbs,
number of unique words, and morphological difficulty of words. In the existing
English formulas, such criteria are not used but they might contain useful informa-
tion for determining the readability of Slovene texts.
In the following two subsections we present the established readability measures
for grading English text and our proposed additional criteria.
Existing Readability Formulas
There exists a variety of ways to measure the readability of texts written in English.
For our analysis, we used 10 readability formulas given below. The entities used in the
expressions correspond to the number of occurrences of a given entity, e.g., word cor-
responds to the number of words in a measured text.
– Gunning fog index (Gunning 1952) is calculated as:
where a word is considered complex if it contains three or more syllables. As
there exists no established automatic method for counting syllables of Slovene
words, we used a rule-based approach designed for English. The resulting score is
calibrated to the grade level of the USA education system.
– Flesch reading ease (Kincaid et al. 1975) is calculated as:
The score does not correspond to grade levels. Instead, the higher the value, the
easier the text is considered to be. A text with a score of 100 should be easily
understood by 11-year-old students, while a text with a score of 0 should be
intended for university graduates.
202 Prispevki za novejšo zgodovino LIX - 1/2019
– Flesch–Kincaid grade level (Kincaid et al. 1975) is similar to Flesch reading ease,
but does correspond to grade levels. It is calculated as:
– Dale–Chall readability formula (Dale and Chall 1948) is calculated as:
The formula requires a predefined list of common (easy) words and the words which
are not on the list are considered as difficult. The novelty of the Dale-Chall Formula
was that it did not use word-length counts but a count of “hard” words which do
not appear on a specially designed list of common words. This list was defined as
the words familiar to most of the 4th-grade students: when 80 percent of the fourth-
-graders indicated that they knew a word, the word was added to the list.
Higher scores indicate that the text is harder, but the resulting score does not
correspond to grade levels, nor is it appropriate for text aimed at children below
4th grade. In our analysis, we obtained the difficult words in two ways:
1. By constructing a list of “easy” words and considering every word not on the
list as difficult. The list of easy words is described later in the paper.
2. By considering words with more than seven characters as difficult.
– Spache readability formula (Spache 1953) is calculated as:
Difficult words are defined as words that do not appear in the list of commonly
used words, which is the same as the one used in the Dale–Chall readability for-
mula. This method was specifically designed for texts targeting children up to the
fourth grade and was not designed to perform well on harder text. The obtained
score corresponds to grade levels.
– Automated readability index (Senter and Smith 1967) is calculated as:
The formula was designed so that it could be automatically captured in times when
texts were written on typewriters and therefore it does not use information rela-
ting to syllables or difficult words. The obtained score corresponds to grade levels.
– SMOG (Simple Measure of Gobbledygook) (McLaughlin 1969) can be calcula-
ted as:
where difficult words are defined as words with three or more syllables. The score
corresponds to grade levels.
203T. Škvorc et al.: Predicting Slovene Text Complexity…
– LIX (Bjornsson 1968) is calculated as:
where long words are defined as words consisting of more than six characters. LIX
is the only measure we used that was not designed specifically for English but for
a variety of languages. Because of this, it does not use syllables or a list of unique
words. The score does not correspond to grade levels.
– RIX (Anderson 1983) is a simplification of LIX, and is calculated as:
– Coleman-Liau index (Coleman and Liau 1975) is calculated as:
where L is the average number of letters per 100 words and S is the average number
of sentences per 100 words. The obtained score corresponds to grade levels.
Language-Processing-Based Readability Criteria
The readability formulas described in the previous section use a low number of
common criteria, such as the number of syllables in words or the number of words in a
sentence. In our analysis, we also analyzed Slovene texts using the following additional
statistics:
– percentage of long words,
– percentage of difficult words,
– percentage of verbs,
– percentage of adjectives,
– percentage of unique words,
– average sentence length.
Many of these (percentage of long words, difficult words, unique words, and aver-
age sentence length) are used as features in the readability measures described above.
We evaluate them individually to determine how important each of them is for Slovene
texts. The percentage of verbs is used because a higher number of verbs can indicate
more complex sentences with multiple clauses. The percentage of adjectives was cho-
sen because we assumed a higher percentage of adjectives could indicate longer, more
descriptive sentences that are harder to understand.
To take into account richer morphology of Slovene and a less fixed word order
compared to English, we computed two additional criteria:
204 Prispevki za novejšo zgodovino LIX - 1/2019
– Context of difficult words, which is the average number of difficult words that
appear in a context (i.e. the three words before or after the word) of a difficult
word. Difficult words are defined as words that do not appear on the list of com-
mon words. The intuition behind this metric is that a difficult word that appears in
the context of easy words is easier to understand than if it is surrounded by other
difficult words since its meaning can be more easily inferred from the context.
– Average morphological difficulty, where we use the Slovene morphological lexi-
con Sloleks (Arhar Holdt 2009) to assign a “morphological difficulty” score to
each word. Sloleks is a lexicon of word forms and contains frequency informa-
tion for morphological variants of over 100,000 lemmas (base forms of words as
defined in a dictionary). We use the relative frequency of a word variant compared
to other variants of the same lemma as the morphological difficulty score.
In addition, we also calculated the number of words in each document, even if in our
case, it cannot be interpreted as a criterion for determining readability since it is largely
determined by the type of document. E.g., the documents belonging to the subcorpus
of newspapers contain individual articles and are therefore short, while the subcorpus of
computer magazines contains entire magazines which are considerably longer.
Analysis of Slovene Texts
In this section, we describe the methodology used for our analysis. In the first sub-
section, we describe the data sets on which we conducted our analysis. In the second
subsection, we describe how we constructed the list of easy words used in some of the
readability measures.
Data Sets
We created a set of subcorpora from the Gigafida reference corpus of written
Slovene (Logar et al. 2012). Gigafida contains 39,427 Slovene texts released from 1990
to 2011, for a total of 1,187,002,502 words. We focused on texts published in maga-
zines, newspapers, and books while ignoring texts collected from the internet. The
texts in the Gigafida corpus are segmented into paragraphs and sentences, tokenized,
and part-of-speech tagged using the Obeliks tagger (Grčar et al. 2012). We grouped
the texts based on the intended audience, resulting in the following subcorpora:
– Children’s magazines include magazines aimed at younger children (to be read
independently or by their parents), namely Cicido and Ciciban.
– Pop magazines contain magazines aimed at the general public, namely Lisa, Gloss,
and Stop.
205T. Škvorc et al.: Predicting Slovene Text Complexity…
– Newspapers contain general adult population newspapers, namely Delo and
Dolenjski list.
– Computer magazines include magazines focusing on technical topics relating to
computers, namely Monitor, Računalniške novice, PC & Mediji, and Moj Mikro.
– National Assembly includes transcriptions of sessions from the National Assembly
of Slovenia.
In Table 1 we show the number of documents in each subcorpus and the aver-
age number of words per document. The subcorpus of newspapers contains the larg-
est number of documents, while the subcorpus of text sourced from the National
Assembly of Slovenia contains the fewest.
Table 1: The number of documents and the average number of words per document for
each subcorpus.
Subcorpus #docs Avg. #words / doc
Children's magazines 125 5,488
Pop magazines 247 33,967
Newspapers 14,011 12,881
Computer magazines 163 110,875
National Assembly 35 58,841
Our hypothesis is that the readability measures will be able to distinguish texts
from different subcorpora. We assume that children’s magazines will be easily distin-
guishable from other genres that are addressing an adult population. We also suppose
that general magazines are less complex than specialized magazines. The National
Assembly transcripts were included as they differ from other texts in two major ways:
a.) they are transcripts of spoken language and b.) they relate to a highly technical
subject matter. Because of this, we were interested in how readability measures would
grade them. To test our hypothesis and to determine how well each readability meas-
ure works, we analyzed texts from each subcorpus to obtain a score distribution for
each measure. The scores were calculated separately for each source text (e.g., one
magazine article, a newspaper, or one assembly session).
List of Common Words
For designing the list of common words, we took a corpus-based approach. Note
that the methodology to create a list of common words from language corpora was
already tested for other languages, (see e.g., Kilgarriff et al. 2014). We used four cor-
pora to create a list of common words: Kres, Janes, Gos, and Šolar:
– Šolar (Kosem et al. 2011) contains 2,703 texts written by pupils in Slovenia from
grades 6 to 13 (grade 6 to 9 in primary school, and grade 1 to 4 in secondary school).
206 Prispevki za novejšo zgodovino LIX - 1/2019
The texts include essays, summaries, and answers to examination questions.
– Gos (Verdonik et al. 2011) contains around 120 hours of recorded spoken Slovene
(1,035,101 words), as well as transcriptions of the recordings. The recordings are
collected from a variety of sources, including conversations, television, radio, and
phone calls. Around 10% of the corpus consists of recorded lessons in primary
and secondary schools.
– Janes (Fišer et al. 2014) contains Slovene texts from various internet sources, such
as tweets, forum posts, blogs, comments, and Wikipedia talk pages.
– Kres (Logar Berginc and Šuster 2009) is a sub-corpus of Gigafida that is balanced
with respect to the source (e.g. newspapers, magazines, or internet).
We extracted the most common words and defined the common words as the ones
that appear frequently in all four corpora (and are therefore not specific to a certain
text type). We use four corpora to include texts that primarily reflect language pro-
duction by different language users (Gos, Janes, Šolar), as well as texts that primarily
reflect standard language (Kres). We aimed at covering younger school-going popula-
tion (Šolar) and adults. For some corpora, we could have assigned words to different
age levels (e.g. using pupils’ grade levels in Šolar or using the age groups available
in Gos metadata), but these corpora are very specific and the resulting word groups
would mainly reflect the genre instead of age levels. Because of this, we opted for the
approach of crossing the word lists to obtain a single list. The overlap of the most com-
mon words in four corpora eliminates frequent words which are typical for only one
of the corpora (e.g. administrative language in Kres, spoken language markers in Gos,
Twitter-specific usage in Janes, and literary references from essays in Šolar).
From each corpus, we extracted the 10,000 most frequent word lemmas and
part-of-speech tuples. In order to construct a list of common words representative of
Slovene language, we selected the word lemmas that occurred in the most frequent
word lists of all the four corpora. We obtained a list of 2,562 common words, which
we used in readability measures.
Results
For each text in each subcorpus, we calculated readability scores using all readabil-
ity measures described in the previous section. In Figure 1 we present a few examples
of obtained score distributions. We show distributions for three text subcorpora (chil-
dren’s magazines, newspapers, and technical magazines) and three readability scores
(Goobledygook, Coleman-Liau, and the average number of words in a sentence).
207T. Škvorc et al.: Predicting Slovene Text Complexity…
Figure 1: The score distributions for three text subcorpora and three readability
measures. The distributions show that technical magazines readability scores are the most
consistent, while newspapers’ scores are more diverse. Children’s magazines’ scores have
a strong peak on the left-hand side (easier texts) that is well separated from the other
sources.
To show a compact overview of all included readability measures we calculated
the median, first and third quartiles of the distribution for each score and each text
subcorpus. The box-and-whiskers plots showing these results are visualized in Figure
2 which shows that most readability measures are able to distinguish between differ-
ent subcorpora. Additionally, some of the readability measures confirm our original
hypothesis, i.e. they are able to distinguish children’s magazines from other genres that
are addressing adult population, and evaluate general magazines as less complex than
computer magazines.
208 Prispevki za novejšo zgodovino LIX - 1/2019
Figure 2: The scores of each readability measure for each subcorpus of texts, represented
with box plots. The subcorpora depicted from left to right are: 1.) Children’s magazines, 2.)
General magazines, 3.) Newspapers, 4.) Computer magazines, and 5.) National assembly
transcriptions. The boxes show the first, second, and third quartile of the distributions
while the whiskers extend for 1.5 IQR past the first and third quartile.
Figure 2 allows for an additional interpretation of readability measures. For exam-
ple, children’s magazines vs. general magazines vs. newspapers mean scores show
increasing complexity in the following measures: Percentage of long words, Flesh
Kincaid Grade Level, Gunning Fog Index, Dale-Chall Readability Formula (based
on complexity defined by syllables), Context of Difficult Words, SMOG, LIX, RIX
209T. Škvorc et al.: Predicting Slovene Text Complexity…
and Automated Readability Index. All these measures consider the length of words
and/or sentences. The percentage of adjectives also seems to correlate with the com-
plexity of these three text types, although to a lesser extent. The same holds for Flesh
Reading Ease, since higher scores indicate lower complexity. For the majority of these
measures, the distinction between newspapers and specialized computer magazines is
either less evident or not evident at all, but they do indicate that computer magazines
are less readable than general magazines.
Scores using the list of common words do not lead to the same conclusions.
Percentage of Difficult Words and Dale-Chall Readability Formula with word list do
not reflect the complexity of genres, but to some extent, they do distinguish between
general and specialized texts (i.e. newspapers and general magazines have lower scores
than specialized computer magazines). One of the reasons for the relatively high scores
for the complexity of children magazines might be in the large proportion of literary
language, such as in poems for children with many words not in the list of common
words. For example, “KRAH, KRAH, KRAH! MENE NIČ NI STRAH!” (Krah, krah,
krah! I am not afraid!) has 7 words, out of which 4 are on the list of simple words, while
the interjection KRAH is not on the simple words list. Therefore, the proportion of
difficult words in this segment is 42.8% (3 occurrences of word KRAH out of 7 words
in total). On the other hand, the words are short, therefore length-based measures
consider them to be simple words.
The readability scores for the National Assembly subcorpus show high variability
across the measures, which might be attributed to the fact that it is a different genre (spo-
ken, but specialized). E.g., in several measures where the readability complexity rises
from children’s magazines to general magazines and newspapers, the National assembly
scores are close to general magazines. Very long words are less likely used in spoken
language, even in a political context. Average morphological difficulty and context of dif-
ficult words lead to the interpretation that this genre is more complex (less “readable”).
The very high score for the context of difficult words might be attributed to enumera-
tion of Assembly members (e.g., “Obveščen sem, da so zadržani in se današnje seje ne
morejo udeležiti naslednje poslanke in poslanci: Ciril Pucko, Franc Kangler, Vincencij
Demšar, Branko Kalalemina, …” (I was informed that the following deputies are occu-
pied and cannot attend this session: …). The relatively high percentage of verbs can
also be interpreted from this perspective, e.g., the National assembly text include many
performatives, such as “Pričenjam nadaljevanje seje” (Starting the continuation of the
session) and “Ugotavljamo prisotnost v dvorani” (Establishing the presence).
In summary, using a list of common words did not improve the partitioning of the
text subcorpora perceived as easy and as difficult to read. Both measures that use it
(Dale-Chall and Spache readability formulas) are poor separators. A number of simple
readability measures worked well, such as the percentage of long words, the percentage
of verbs/adjectives, and the average morphological difficulty.
We also calculated the sample mean and standard deviation of readability meas-
ures for each text subcorpus. The results are shown in Table 2.
210 Prispevki za novejšo zgodovino LIX - 1/2019
Table 2: The mean and standard deviation for each subcorpus of texts and each
readability score.
Measure
Children's
mag. Magazines
Newspapers
Technical
mag.
National
assembly
% long words 0.065 (0.015)
0.109
(0.011)
0.137
(0.029)
0.146
(0.010)
0.137
(0.046)
Number of words 5488 (6184)
33966
(34821)
12881
(84708)
110875
(151007)
58841
(106515)
% adjectives 0.078 (0.016)
0.111
(0.013)
0.120
(0.020)
0.120
(0.008)
0.096
(0.022)
% verbs 0.216 (0.026)
0.170
(0.015)
0.161
(0.034)
0.144
(0.013)
0.180
(0.044)
% unique words 0.517 (0.077)
0.375
(0.053)
0.513
(0.114)
0.244
(0.144)
0.277
(0.173)
Context of difficult words 0.756 (0.054)
0.834
(0.027)
0.849
(0.133)
0.808
(0.036)
0.929
(0.044)
% difficult words 0.464 (0.048)
0.369
(0.022)
0.356
(0.122)
0.389
(0.032)
0.280
(0.036)
Gunning Fog Index 9.950 (1.255)
14.272
(1.271)
18.662
(9.319)
17.470
(0.800)
15.901
(3.493)
Flesch reading ease 37.592 (4.989)
23.855
(5.217)
10.002
(24.128)
12.520
(4.340)
19.178
(13.098)
Flesch–Kincaid grade level 10.500 (0.894)
13.596
(1.193)
17.356
(8.959)
15.999
(0.741)
14.523
(2.761)
Dale–Chall 2.845 (0.425)
4.036
(0.306)
4.972
(1.270)
4.941
(0.258)
4.560
(0.971)
Dale–Chall with word list 7.781 (0.720)
6.534
(0.357)
6.643
(2.163)
6.955
(0.484)
5.208
(0.539)
Spache readability
formula
6.217
(0.368)
6.079
(0.348)
6.977
(3.499)
6.685
(0.323)
5.482
(0.600)
Automated readability
index
12.873
(1.086)
16.117
(1.428)
20.474
(11.456)
19.007
(0.885)
17.014
(3.371)
SMOG 12.206 (0.759)
15.095
(1.066)
18.200
(2.757)
17.194
(0.611)
15.849
(2.500)
LIX 33.676 (3.384)
44.999
(3.282)
56.016
(23.123)
53.260
(2.077)
47.909
(9.073)
RIX 2.381 (0.496)
4.481
(0.781)
7.370
(3.836)
6.354
(0.518)
5.250
(2.574)
Coleman-Liau index 17.785 (1.120)
19.823
(0.861)
21.220
(1.807)
21.762
(0.903)
20.318
(2.170)
Avg. morphological
difficulty
0.419
(0.017)
0.428
(0.010)
0.436
(0.044)
0.441
(0.017)
0.445
(0.026)
Avg. sentence length 8.353 (0.820)
13.389
(2.843)
21.120
(4.043)
18.641
(1.960)
19.063
(3.826)
211T. Škvorc et al.: Predicting Slovene Text Complexity…
Using these results, we calculated the Bhattacharyya distance between the distri-
butions of Children’s magazines and newspapers for each score. The Bhattacharyya
distance measures the similarity between two statistical distributions. We assumed
the scores were distributed normally, as the results shown in Figure 1 show that the
scores approximately follow a normal distribution, and calculated the distance using
the following formula:
We also show the Bhattacharyya coefficient, which measures the overlap between
two statistical distributions and can be calculated as:
The results are presented in Table 3. These results are similar to the ones shown
in Figure 2, with the readability formulas using the list of difficult words showing less
dichotomization power. The largest distance is obtained using average sentence lengths.
Table 3: The Bhattacharyya distances and coefficients between the distributions of scores
for children’s magazines and newspapers for each readability measure. The results are
sorted by decreasing distance.
Measure Distance Coefficient
Average sentence length 2.866 0.057
SMOG 1.433 0.239
% long words 1.350 0.259
RIX 1.101 0.333
Flesch-Kincaid grade level 0.956 0.385
Automated readability index 0.945 0.389
Dale-Chall readability formula 0.885 0.413
Gunning fog index 0.880 0.415
LIX 0.853 0.426
Spache readability formula 0.797 0.451
Flesch reading ease 0.776 0.460
% adjectives 0.719 0.487
Coleman-Liau index 0.708 0.493
% verbs 0.432 0.649
% difficult words 0.365 0.694
Dale-Chall with word list 0.318 0.728
Context of difficult words 0.285 0.752
Avg. morphological difficulty 0.235 0.790
% unique words 0.039 0.961
212 Prispevki za novejšo zgodovino LIX - 1/2019
Additional Statistical Tests
In addition to the initial analysis presented in the previous section, we performed
additional, more thorough statistical tests to determine which of the evaluated
measures are better at predicting the group a text belongs to. We used the following
approaches:
– Mutual information. This measure reports the amount of information we get
about a random variable Y by observing another random variable X. In our case,
mutual information reports the amount of information we get about the group
of texts by knowing a score of certain readability measure. Mutual information is
defined as:
where p(x) and p(y) are the marginal probability distribution functions of X and
Y and p(x, y) is the joint probability function of X and Y. In our case, X represents
the distributions of readability measures and Y the distribution of groups. The
higher the mutual information between the readability measure and the groups,
the more useful the measure for determining the group membership.
– Analysis of variance (ANOVA). This measure first splits samples of a statistical
distribution into several groups (in our case, based on the group the texts belong
to) and then calculates if the groups are significantly different from one another.
We use this measure to determine if the distributions obtained by calculating a
single measure on each group of texts are significantly different. If they are, they
can be useful for determining the group membership of a given text.
– Feature selection using a chi-squared test. Similarly to mutual information, we
use the chi-squared test to determine whether the readability measures and the
group memberships are mutually dependent. If they are, this indicates that know-
ing the value of the readability measure is useful when determining which group
a text belongs to.
In addition to the four statistical tests used above, we also ranked each feature
using a random forest classifier (Breiman 2001). The classifier is capable of automati-
cally combining different readability measures in order to predict which subcorpus a
given text belongs to and is also capable of calculating how important each readability
measure was when making the prediction. The classifier is described in more detail in
the next section. Using each of these tests, we obtained scores that tell us how useful
each readability measure is when trying to predict the subcorpus it came from. The
results are presented in Table 4, with higher scores indicating better (more informa-
tive) readability measures.
213T. Škvorc et al.: Predicting Slovene Text Complexity…
Table 4: The ranks of readability measures obtained by the statistical tests, which report
the usefulness of readability measures for predicting group membership. The measures
are ordered from the most useful to the least useful.
Random Forest ANOVA Mutual information Chi2
Average sentence
length
Average sentence
length
Average sentence
length
% new words
% new words % difficult words SPG RIX Number of words
Number of words % long words SMOG % unique words
% unique words SMOG
Percentage of new
words
Flesch reading ease
% difficult words SPG Dale-Chall
Automated readability
index
LIX
Gunning fog index
Percentage of
adjectives
Gunning fog index
Average sentence
length
Percentage of verbs Coleman-Liau index LIX % difficult words
RIX
Percentage of unique
words
Number of words Gunning fog index
Dale-Chall (word list) RIX
Flesch-Kincaid grade
level
Automated readability
index
SMOG % verbs Flesch reading ease % difficult words SPG
LIX Flesch reading ease Dale-Chall
Flesch-Kincaid grade
level
Flesch-Kincaid grade
level
Context of difficult
words
% unique words SMOG
Context of difficult
words
LIX % long words RIX
Dale-Chall Gunning fog index % difficult words Coleman-Liau index
% long words
Flesch-Kincaid grade
level
% difficult words SPG Dale-Chall
% difficult words % difficult words
Spache readability
formula
Spache readability
formula
Avg morphological
difficulty
Automated readability
index
Context of difficult
words
Dale-Chall (word list)
Automated readability
index
% new words Coleman-Liau index % long words
% adjectives Number of words % verbs
Context of difficult
words
Flesch reading ease Dale-Chall (word list) % adjectives % verbs
Spache readability
formula
Spache readability
formula
Dale-Chall (word list) % adjectives
Coleman-Liau index
Avg morphological
difficulty
Avg morphological
difficulty
Avg morphological
difficulty
The results of the statistical tests show that the features commonly used by the
readability formulas (i.e. an average sentence length and number of long words) are
214 Prispevki za novejšo zgodovino LIX - 1/2019
useful when it comes to determining group membership. In particular, the average
sentence length stands out since it is ranked as the most important measure in three
out of the four tests. At least one of either LIX or RIX is also highly ranked (in the top
50% of all measures) by all the tests. Those measures are the only ones from the tested
measures that were not designed specifically for English, which could be one of the
reasons why they perform better on Slovene texts. The results also show that a number
of proposed simpler readability criteria, such as the percentage of verbs, percentage of
adjectives, and the average morphological difficulty are less useful than the established
statistical formulas. The results are inconclusive about the most useful readability cri-
terion for Slovene. Several formulas and statistics are useful, but the rankings are dif-
ferent by different tests. When using our list of common words Dale-Chall and Spache
readability formulas are again shown to perform worse than the formulas that consider
long words as difficult.
Classification Results
In addition to statistical evaluation, we also performed a test with machine learn-
ing classifiers (Kononenko and Kukar 2007) to see whether we could use our readabil-
ity measures to predict which subcorpus a text belongs to. With classification models,
we can automatically learn how to split the texts into different subcorpora based on
readability formulas and other readability criteria. We used the following classification
models.
– Decision trees construct a binary decision tree where each node splits the training
set based on one readability measure. The trained tree can predict the subcorpus
of a given text.
– Random forests (Breiman 2001) create multiple decision trees in a random man-
ner. This reduces the variance of a model and often gives better prediction accu-
racy than using a single decision tree.
– Naive Bayes is a probabilistic model based on the Bayes’ theorem. The model
assumes that the readability measures are independent.
– Extreme gradient boosting (Chen and Carlos 2016) constructs a large number of
simple classifiers and combines them to achieve state-of-the-art results on many
classification problems.
In order to use classification models, we first train them on a training subset of our
data set. We used randomly selected 75% of our data set for the training. To evaluate
the models, we calculated the classification accuracy (i.e. the percentage of texts each
model predicted correctly) on the remaining 25% of the data set. The obtained results
are presented in Table 5. The results obtained by the majority classifier (i.e. classifying
everything as the most frequent group) are presented as a baseline score.
215T. Škvorc et al.: Predicting Slovene Text Complexity…
Table 5: The classification accuracies for each of the models. The numbers show the
percentage of texts for which the group membership was correctly predicted.
Model Classification Accuracy
Random Forest 0.984
Extreme Gradient Boosting 0.979
Decision Tree 0.960
Majority Classifier 0.791
Naive Bayes 0.553
Table 5 shows that we are able to predict the correct group of a text with high
accuracy, over 98% with the best-performing model (Random forest). This shows that
a combination of readability measures that we evaluated in this paper can be used to
accurately distinguish between different groups of text.
Conclusion and Future Work
We analyzed statistical distributions of well-known readability measures on
Slovene texts. We extracted five subcorpora of texts from the Gigafida corpus with
commonly perceived different readability levels: children magazines, popular maga-
zines, newspapers, technical magazines, and national assembly texts. We find that the
readability formulas are able to distinguish between these subcorpora reasonably well,
with the exception of national assembly texts, which are of a different, spoken, genre
and the used measures were not originally designed to handle it. A number of simple
readability statistics, such as the context of difficult words and average sentence length,
also dichotomize the different subcorpora of text.
In this work, we only focused on simple readability formulas along with some
additional readability criteria. There exist several more complex methods for evaluat-
ing the complexity of texts, such as the one presented in Lu (2009) and Wiersma et
al. (2010). Such advanced methods might be more suitable for Slovene texts than the
simple methods used in this paper, and we plan to test them in future work.
Most of the used English readability formulas were designed to correlate with
school grades and were initially tuned on that domain. For Slovene, there currently is
no publicly available data set with texts tagged according to the appropriate grade level.
This disallows analysis of the readability measures from this perspective. In future
work, we plan to prepare such a corpus and design several readability scores fit for dif-
ferent purposes. This will allow us to frame text complexity as a classification problem
with the goal of predicting the grade level of a text instead of predicting its group mem-
bership. In a similar approach, experts would annotate texts with readability scores.
This would allow us to fit a regression model using the readability measures analyzed
in this paper.
216 Prispevki za novejšo zgodovino LIX - 1/2019
Another area that we plan to explore is the use of coherence and cohesion meas-
ures (Barzilay and Lapata 2008; Crossley et al. 2016), which are used to determine
if words, sentences, and paragraphs are logically connected. Coherence and cohe-
sion methods usually use machine learning approaches that mostly rely on language-
specific features and shall be therefore evaluated on Slovene texts. The same applies
to readability measures based on machine learning (Francois and Miltsakaki 2012)
which we also plan to analyze in the future.
Acknowledgments
The research was financially supported by the Slovenian Research Agency through
project J6-8256 (New grammar of contemporary standard Slovene: sources and meth-
ods), project J5-7387 (Influence of formal and informal corporate communications
on capital markets), a young researcher grant, research core fundings no. P6-0411 and
P2-0103; Republic of Slovenia, Ministry of Education, Science and Sport/European
social fund/European fund for regional development/European cohesion fund (pro-
ject Quality of Slovene textbooks, KaUč). This work has received funding from the
European Union’s Horizon 2020 research and innovation programme under grant
agreement No 825153 (EMBEDDIA).
Sources and Literature
Literature:
• Anderson, Jonathan. 1983. “LIX and RIX: Variations on a Little-known Readability Index.” Journal
of Reading 26, No. 6: 490–96.
• Arhar Holdt, Špela. 2009. “Učni korpus SSJ in leksikon besednih oblik za slovenščino.” Jezik in
slovstvo 54, No. 3–4: 43–56.
• Bailin, Alan, and Ann Grafstein. 2016. Readability: Text and context. Springer.
• Barzilay, Regina, and Mirella Lapata. 2008. “Modeling Local Coherence: An Entity-based
Approach.” Computational Linguistics 34, No. 1: 1–34.
• Björnsson, Carl Hugo. 1968. Läsbarhet. Liber.
• Breiman, Leo. 2001. “Random forests.” Machine learning 45, No. 1: 5–32.
• Chen, Tianqi, and Carlos Guestrin. 2016. “Xgboost: A Scalable Tree Boosting System.” In
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, 785–794. ACM.
• Coleman, Meri, and Ta Lin Liau. 1975. “A Computer Readability Formula Designed for Machine
Scoring.” Journal of Applied Psychology 60, No. 2: 283.
• Crossley, Scott A., Kristopher Kyle, and Danielle S. McNamara. 2016. “The tool for the automatic
analysis of text cohesion (TAACO): Automatic Assessment of Local, Global, and Text Cohesion.”
Behavior Research Methods 48, No. 4: 1227–37.
• Dale, Edgar, and Jeanne S. Chall. 1948. “A Formula for Predicting Readability: Instructions.”
Educational Research Bulletin: 37–54.
217T. Škvorc et al.: Predicting Slovene Text Complexity…
• Dębowski, Łukasz, Bartosz Broda, Bartłomiej Nitoń, and Edyta Charzyńska. 2015. “Jasnopis–A
Program to Compute Readability of Texts in Polish Based on Psycholinguistic Research.” In
Natural Language Processing and Cognitive Science, edited by B. Sharp, W Lubaszewski and R.
Delmonte, 51–61. Liberia Editrice Cafoscarina.
• Fišer, Darja, Tomaž Erjavec, Ana Zwitter Vitez, and Nikola Ljubešić. 2014. “JANES se predstavi:
metode, orodja in viri za nestandardno pisno spletno slovenščino.” In Language technologies :
proceedings of the 17th International Multiconference Information Society - IS 2014, edited by Tomaž
Erjavec and Jerneja Žganec Gros, 56–61. Ljubljana: Jožef Stefan Institute.
• François, Thomas, and Eleni Miltsakaki. 2012. “Do NLP and Machine Learning Improve
Traditional Readability Formulas?” In Proceedings of the First Workshop on Predicting and Improving
Text Readability for target reader populations, edited by Sandra Williams, Advaith Siddharthan and
Ani Nenkova, 49–57. Association for Computational Linguistics.
• Grčar, Miha, Simon Krek, and Kaja Dobrovoljc. 2012. “Obeliks: statisticni oblikoskladenjski
oznacevalnik in lematizator za slovenski jezik.” In Proceedings of the Eighth Language Technologies
Conference, edited by Tomaž Erjavec and Jerneja Žganec Gros, 89–94. Ljubljana: Jožef Stefan
Institute.
• Gunning, Robert. 1952. The technique of clear writing. McGraw-Hill.
• Justin, J. 2003. Učbenik kot dejavnik uspešnosti kurikularne prenove: poročilo o rezultatih evalvacijske
študije.
• Kilgarriff, Adam, Frieda Charalabopoulou, Maria Gavrilidou, Janne Bondi Johannessen, Saussan
Khalil, Sofie Johansson Kokkinakis, Robert Lew, Serge Sharoff, Ravikiran Vadlapudi, and Elena
Volodina. 2014. “Corpus-based Vocabulary Lists for Language Learners for Nine Languages.”
Language Resources and Evaluation 48, No. 1: 121–63.
• Kincaid, J. Peter, Robert P. Fishburne Jr, Richard L. Rogers, and Brad S. Chissom. 1975. Derivation
of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease
formula) for navy enlisted personnel. Report No. 8–75.
• Kononenko, Igor, and Matjaž Kukar. 2007. Machine Learning and Data Mining. Chichester,
Horwood Publishing.
• Kosem, Iztok, Tadeja Rozman, and Mojca Stritar. 2011. “How do Slovenian Primary and Secondary
School Students Write and What Their Teachers Correct: A Corpus of Student Writing.” In
Proceedings of Corpus Linguistics Conference 2011, ICC Birmingham, 20–22.
• Logar Berginc, Nataša, and Simon Šuster. 2009. “Gradnja novega korpusa slovenščine.” Jezik in
slovstvo 54: 57–68.
• Logar Berginc, Nataša, Miha Grčar, Marko Brakus, Tomaž Erjavec, Špela Arhar Holdt, Simon
Krek, and Iztok Kosem. 2012. Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES:
gradnja, vsebina, uporaba. Ljubljana: Trojina, zavod za uporabno slovenistiko and Faculty of Social
Sciences.
• Lu, Xiaofei. 2009. “Automatic Measurement of Syntactic Complexity in Child Language
Acquisition.” International Journal of Corpus Linguistics 14, No. 1: 3–28.
• Mc Laughlin, G. Harry. 1969. “SMOG Grading - a New Readability Formula.” Journal of Reading
12, No. 8: 639–46.
• Senter, R. J., and Edgar A. Smith. 1967. Automated Readability Index. Ohio; University of
Cincinnati.
• Sherman, Lucius Adelno. 1893. Analytics of Literature: A Manual for the Objective Study of English
Prose and Poetry. Boston: Ginn.
• Škvorc, Tadej, Simon Krek, Senja Pollak, Špela Arhar Holdt, and Marko Robnik-Šikonja. 2018.
“Evaluation of Statistical Readability Measures on Slovene Texts.” In Proceedings of the conference
on Language Technologies & Digital Humanities 2018, edited by Darja Fišer and Andrej Pančur,
240–47. Ljubljana: Ljubljana University Press, Faculty of Arts.
• Spache, George. 1953. “A New Readability Formula for Primary-grade Reading Materials.” The
Elementary School Journal 53, No. 7: 410–13.
218 Prispevki za novejšo zgodovino LIX - 1/2019
• Verdonik, Darinka, Ana Zwitter Vitez, and Hotimir Tivadar. 2011. Slovenski govorni korpus Gos.
Trojina, zavod za uporabno slovenistiko.
• Wiersma, Wybo, John Nerbonne, and Timo Lauttamus. 2010. “Automatically Extracting Typical
Syntactic Differences from Corpora.” Literary and Linguistic Computing 26, No. 1: 107–24.
• Zwitter Vitez, Ana. 2014. “Ugotavljanje avtorstva besedil: primer »Trenirkarjev«.” In zbornik
Devete konference Jezikovne Tehnologije Informacijska družba – IS, edited by Tomaž Erjavec and
Jerneja Žganec Gros, 131–34. Ljubljana: Jožef Stefan Institute.
Tadej Škvorc, Simon Krek, Senja Pollak, Špela Arhar Holdt,
Marko Robnik-Šikonja
PREDICTING SLOVENE TEXT COMPLEXITY USING
READABILITY MEASURES
SUMMARY
In English, the problem of determining text readability (i.e. how easy a text is to
understand) has long been a topic of research, with its origins in the 19th century. Since
then, many different methods and readability measures have been developed, often
with the goal of determining whether a text is too difficult for its target age group.
Even though the question of readability is complex from a linguistic standpoint, a large
majority of existing measures are based on simple heuristics. Since most of these meas-
ures were developed for English texts, it is hard to say how well they would perform
on Slovene texts. Measures designed for English are designed to correspond with the
American school system, are sometimes based on pre-constructed lists of easy words
which do not exist for Slovene and do not take into account morphological informa-
tion when determining whether a word is difficult or not.
In our work, we analyze some common readability measures on Slovene text. We
also introduce and analyze two additional readability criteria that do not appear in
any of the analyzed readability measures: morphological difficulty, where we assume
word forms that appear rarely are harder to understand than the ones that appear com-
monly and the context of difficult words, where we assume difficult words are easier
to understand in a context of simple words, as their meaning can be inferred from
that context. We performed the analysis on 14,581 text documents from the Gigafida
corpus, which were split into five groups based on their target audience (childrens’
magazines, pop magazines, newspaper articles, computer magazines, and transcrip-
tions of sessions of the National Assembly). We assumed that the groups should have
different readability scores due to their differing target audiences and writing styles.
For each analyzed readability measure we checked how well it separates texts from
different groups. We did this by first obtaining the statistical distribution of readability
219T. Škvorc et al.: Predicting Slovene Text Complexity…
scores for texts in each group and checking how much the distributions differ. We
show that a number of common readability measures designed for English work well
on Slovene texts. To determine which of the measures perform the best we used sev-
eral statistical tests.
We also show that machine-learning methods can be used to accurately (over 98%
chance of a correct prediction) predict which group a text belongs to based on its
readability scores. We trained four different machine-learning models (decision trees,
random forests, naïve Bayes classifier, and extreme gradient boosting) and evaluated
them on our dataset. We obtained the best result (98.4% classification accuracy) by
using random forests.
Tadej Škvorc, Simon Krek, Senja Pollak, Špela Arhar Holdt,
Marko Robnik-Šikonja
NAPOVEDOVANJE KOMPLEKSNOSTI SLOVENSKIH
BESEDIL Z UPORABO MER BERLJIVOSTI
POVZETEK
Problem berljivosti (t.j. kako enostavno je besedilo za branje) je v angleščini dobro
raziskan. Obstaja veliko različnih metod in formul, s katerimi lahko analiziramo angle-
ška besedila z vidika berljivosti. Kljub temu, da je vprašanje berljivosti z lingvističnega
vidika zapleteno večina metod za ugotavljanje berljivosti temelji na preprostih zna-
čilnostih besedil. Ker je bila večina mer berljivosti zasnovanih za angleška besedila,
ne moremo biti prepričani da bodo enako dobro delovala na slovenskih besedilih.
Angleške mere berljivosti so namreč usklajene z ameriškim šolskim sistemom, včasih
temeljijo na vnaprej sestavljenih seznamih lahkih besed in ne upoštevajo težavnosti
besed z morfološkega vidika.
V našem delu analiziramo pogoste mere berljivosti na slovenskih besedilih. Poleg
tega uvedemo in analiziramo dva dodatna kazalnika berljivosti ki ne nastopata v pogo-
stih merah berljivosti: morfološka zahtevnost besed, s katero želimo zajeti predpo-
stavko da so redkejše morfološke oblike besed težko berljive, in kontekst težkih besed,
s katero želimo zajeti predpostavko, da so neznane besede, ki se pojavijo v kontekstu
znanih besed lažje berljive, saj lahko njihov pomen razberemo iz konteksta. Analizo
smo izvedli na 14,581 besedilih iz korpusa Gigafida, ki smo jih razdelili v pet skupin
glede na njihovo ciljno publiko (Otroške revije, splošne revije, časopisni članki, raču-
nalniške revije in transkripcije sej Državnega zbora). Predpostavili smo, da imajo revije
zaradi različnih ciljnih publik in tematik različne sloge pisanja in posledično različne
stopnje berljivosti.
220 Prispevki za novejšo zgodovino LIX - 1/2019
Za vsako izmed mer berljivosti smo preverili, kako dobro med seboj loči besedila
iz različnih skupin. Za vsako izmed njih smo pridobili statistično distribucijo vredno-
sti berljivosti vsake skupine in preverili, ali so distribucije ustrezno ločene. V analizi
pokažemo, da se številne uveljavljene mere, ki so bile zasnovane za angleščino, dobro
obnesejo tudi na slovenskih besedilih. Da bi ugotovili, katere mere najbolje razlikujejo
med skupinami smo uporabili statistične teste.
Poleg tega pokažemo, da lahko z modeli strojnega učenja in kombinacijo analizira-
nih metod berljivosti z visoko točnostjo (nad 98 %) napovemo, v katero skupino spada
določeno besedilo. Za to analizo smo uporabili štiri različne metode strojnega učenja
(odločitvena drevesa, naključne gozdove, naivni Bayesov klasifikator, in extreme gra-
dient boosting). Najboljši rezultat (98,4 %) smo dobili z metodo naključnih gozdov.
221Reviews and Reports
Language Technologies and Digital Humanities 2018,
20–21 September 2018,
Faculty of Electrical Engineering, Ljubljana
The conference Language Technologies and Digital Humanities 2018 took place
at the Faculty of Electrical Engineering at the University of Ljubljana on 20 and 21
September 2018. It was organised by the Slovenian Language Technologies Society,1 the
Centre for Language Resources,2 the Faculty of Electrical Engineering,3 and the research
infrastructures CLARIN.SI4 and DARIAH.SI.5 The conference was the eleventh itera-
tion – as well as the 20th anniversary – of the Language Technologies conference series,6
which was started by the Slovenian Language Technologies Society and has been taking
place biennually since 1998. In 2016 it successfully expanded its scope to include
Digital Humanities as well. The 2018 edition of the conference was very international,
with authors from 17 European countries7 as well as two participants from Brazil and
Japan. This is why the conference programme was organized in such a way that talks
on Day 1 were in English and on Day 2 in Slovene.
The conference was opened by the first keynote speaker Malvina Nissim, who is
Associate Professor of Computational Linguistics and Natural Language Processing
at the University of Groningen. In her talk, titled “Too good to be true: Current
Approaches to author profiling”, she discussed novel approaches to the automatic
identification of the gender and age of social media users. In particular, she showed
that models which abstract away from the lexical content of social media posts and
instead focus on extra-linguistic information such as punctuation and emoticons,
whose use is shared across languages to a great extent, offer a robust and reliable way
to identify such personal information.
1 SDJT – Slovensko društvo za jezikovne tehnologije, http://www.sdjt.si/wp/english/.
2 CJVT – Centre for language resources and technologies, https://www.cjvt.si/en/.
3 Univerza v Ljubljani, Fakulteta za elektrotehniko, http://www.fe.uni-lj.si/en/.
4 CLARIN Slovenia, http://www.clarin.si/info/about/.
5 Dariah-SI | Digitalna humanistika, http://www.dariah.si/en/.
6 SDJT – Slovensko društvo za jezikovne tehnologije, http://www.sdjt.si/wp/dogodki/konference/strani/.
7 In addition to Slovenia, the following European countries were represented: Austria, Belgium, Bulgaria, Denmark,
Finland, Germany, Greece, Ireland, the Netherlands, Norway, Poland, Portugal, Serbia, Spain, Sweden, and
Switzerland.
Reviews and Reports
222 Prispevki za novejšo zgodovino LIX - 1/2019
The keynote talk was followed by two morning sessions devoted to topics in
machine translation and language resources. The machine translation session was
chaired by Tomaž Erjavec and comprised two talks. Gregor Donaj and Mirjam S.
Maučec compared traditional statistical machine translation with the use of neural net-
works for translating between Slovenian and English, while Mihael Arčan compared
the two approaches by using translations between three Slavic languages – Slovenian,
Croatian and Serbian.
In the subsequent session devoted to language resources, which was chaired by
Simon Krek, six papers were presented, introducing on-going work on language cor-
pora and lexical resources in Slovenian, Croatian and Portuguese. For example, Filip
Dobranić presented joint work with Nikola Ljubešić, Darja Fišer and Tomaž Erjavec
on the creation of the Parlameter corpus, which contains contemporary Slovenian par-
liamentary proceedings from 2014 to 2018 with rich speaker metadata on the gender,
age, education and party affiliation of the members of the Slovenian parliament. Filip
also showcased how the resource facilitates in-depth exploration of institutionalised
language use and interpersonal behaviour patterns, which is important for an inter-
disciplinary approach to the analysis of parliamentary discourse that involves collabo-
ration between researchers working in disciplines like sociology, discourse analysis,
history, sociolinguistics, and political science.
The poster session presented nine posters on various applications of quantitative
approaches to data analysis within digital humanities and social sciences. For instance,
Katja Mihurko Poniž and colleagues introduced a tool that aids in the research of the
historical representation of women’s authorship, which is an important topic in socio-
historic approaches to literary theory, while Damjan Popič and Darja Fišer presented
a corpus-driven analysis of the attitudes toward language in Slovenian, Croatian, and
Serbian computer-mediated communication.
The first afternoon session, which was chaired by Jurij Hadalin, was devoted to
Digital Humanities. Dan Podjed and Ajda Pretnar analysed the use of social media by
the Slovenian President Borut Pahor for self-promotion. On the basis of qualitative
and quantitative approaches to data analysis, they identified three distinct categories
of the President’s Instagram posts that prove to be the most popular among his fol-
lowers; namely, (i) photographs in which he is seen together with celebrities and his
family, (ii) posts in which he gives the impression of being approachable, and (iii)
photographs in which he is depicted in an unusual situation. Tobias Weber and Jeremy
Bradler discussed a novel approach of integrating computational methods, digital
resources and computer literacy skills into the curriculum of Finno-Ugric linguistics,
stressing the importance of tailoring the materials to the students’ non-computational
backgrounds in humanities and social sciences.
The subsequent session, which was chaired by Simon Dobrišek, concluded the
first day of the conference. Nine papers were presented on topics related to language
technologies and their application. For instance, in a cross-disciplinary approach to
phonetics and medicine, Tatjana Marvin introduced joint work with Jure Derganc,
223Reviews and Reports
Samo Beguš and Saba Battelino on a novel Slovenian Sentence Matrix Text for measur-
ing speech intelligibility in patients suffering from hearing loss. To give another exam-
ple in a different field of application, Milan van Lange and Ralf Futselaar presented
their use of word embeddings in the analysis of parliamentary debates on war criminals
in The Netherlands.
The second day of the conference began with a keynote talk delivered by Martijn
Kleppe, who is Head of the Research Department at the National Library of the
Netherlands. In his talk, titled “Bringing Digital Humanities to the wider public: librar-
ies as incubator for DH research results”, Martijn presented one of the main aims of
the National Library of the Netherlands, which is to support researchers in the Digital
Humanities and social sciences and incorporate their research results in its services
and products. To this end, Martijn showcased LAB, and online toolchain of the Library
which offers researchers an interoperative environment for working with richly anno-
tated texts and state-of-the-art tools for processing handwritten documents. He also
discussed the Institute’s collaborations with other national and international research
infrastructures, such as CLARIN ERIC.
The next session, chaired by Andrej Pančur, brought two talks on issues related to
Slovenian research infrastructures. Maja Dolinar, Janez Štebe and Sonja Bezjak, pre-
sented a new set of guidelines for the acquisition and archiving of qualitative research
in the Slovenian Social Science Data Archives. Tomaž Erjavec presented joint work
with Darja Fišer and Jakob Lenardič on how linguistic data, such as those found in
language corpora, are cited in Slovenian research publications, and proposed recom-
mendations and solutions for more consistent and rigorous citation practices in line
with the Austin Principles of Data Citation in Linguistics.
In the next session, chaired by Darja Fišer, six talks were given on topics related
to corpus linguistics. For instance, Nataša Logar presented the main morphosyntac-
tic characteristics of academic Slovenian, which she analysed together with Tomaž
Erjavec on the basis of the Slovenian balanced corpus Kres and the corpus KAS, which
consists of Slovenian BA, BSc and PhD theses. Iztok Kosem presented joint work with
Simon Krek, Polona Gantar, Špela Arhar Holdt, Jaka Čibej, and Cyprian Laskowski
on the user interface of the first Collocations Dictionary for Modern Slovenian, which
was compiled on the basis of state-of-the-art lexicographic methods.
The student session, which was chaired by Iza Škrjanec, included four talks. Urška
Bratoš discussed the compilation and analysis of a corpus of tweets written by Slovenian
politicians. Isolde van Dorst then presented a statistical analysis of Shakespeare’s use of
pronominal expressions, specifically his usage of the second-person pronoun you and
its two now-obsolete informal variants – nominative thou and accusative thee. Gabi
Rolih presented an implementation of a K-means clustering method applied to com-
puter-mediated communication and discussed how it can be used to further improve a
state-of-the-art part-of-speech tagger for Slovenian. Finally, Klara Eva Kukovičič com-
pared the concordancer Sketch Engine with the tool CollTerm from the point of view
of terminology extraction. The Best Student Paper Award was awarded to Isolde van
224 Prispevki za novejšo zgodovino LIX - 1/2019
Dorst by the selection committee Iza Škrjanec (chair of the Student Session), Tomaž
Erjavec (on behalf of the Programme Committee) and Kaja Dobrovoljc (on behalf of
the Slovenian Language Technologies Society).
The final session, chaired by Matija Ogrin, focused again on Digital Humanities
and concluded the conference. Five papers were presented on topics related to cultural
heritage, historical studies and geography. For instance, Andrej Pančur presented the
SIstory web portal, which offers a sustainable repository for digital editions of histori-
cal texts, while Alenka Kavčič presented joint work with Ivan Lovrić and Vera Smole
on the development of an interactive online map of the seven major Slovenian dialect
groups, which includes geocoded text examples enriched with audio materials that
exemplify the salient phonological features of the dialects.
The Language Technologies and Digital Humanities 2018 conference success-
fully presented on-going and completed work on state-of-the-art language tools and
resources, as well as their application. The presentations that used computational tools
and methodologies to answer qualitative research questions were especially illustrative
in showing how language technologies facilitate and open new grounds for research
in fields like translation studies, political science, historical studies, phonetics and
phonology, and literary theory. Perhaps crucially, the work presented by master’s and
doctoral students was an inspiring showcase of how young researchers use innova-
tive computational approaches to tackle complex research problems in such interdis-
ciplinary fields. The conference thus gave both novice and experienced researchers
from Slovenia and abroad a chance to strike up collaborations and get involved in
research projects that bridge the gap between language technologies on the one hand
and humanities and social sciences on the other.
Jakob Lenardič*
* Department for Translation, Faculty of Arts, University of Ljubljana, Aškerčeva 2, SI-1000 Ljubljana,
jakob.lenardic@ff.uni-lj.si
225Reviews and Reports
Sources and Literature
• Arčan, Mihael. 2018. “A comparison of Statistical and Neural Machine Translation for Slovene,
Serbian and Croatian.” In Proceedings of the Conference on Language Technologies & Digital
Humanities 2018, edited by Darja Fišer and Andrej Pančur, 3–10. Ljubljana: Znanstvena založba
Filozofske fakultete v Ljubljani.
• Bratoš, Urška. 2018. “Gradnja korpusa tvitov slovenskih politikov Janes-TwePo.” In Proceedings
of the Conference on Language Technologies & Digital Humanities 2018, edited by Darja Fišer and
Andrej Pančur, 269–73. Ljubljana: Znanstvena založba Filozofske fakultete v Ljubljani.
• Dolinar, Maja, Janez Štebe, and Sonja Bezjak. 2018. “Razvoj smernic za predajo in arhiviranje
kvalitativnih podatkov v Arhivu družboslovnih podatkov.” In Proceedings of the Conference on
Language Technologies & Digital Humanities 2018, edited by Darja Fišer and Andrej Pančur, 55–61.
Ljubljana: Znanstvena založba Filozofske fakultete v Ljubljani.
• Donaj, Gregor, and Mirjam S. Maučec. 2018. “Prehod iz statističnega strojnega prevajanja na
prevajanje z nevronskimi omrežji za jezikovni par slovenščina-angleščina.” In Proceedings of the
Conference on Language Technologies & Digital Humanities 2018, edited by Darja Fišer and Andrej
Pančur, 62–68. Ljubljana: Znanstvena založba Filozofske fakultete v Ljubljani.
• van Dorst, Isolde. 2018. “You, Thou and Thee: A Statistical Analysis of Shakespeare’s Use of
Pronominal Address Terms,” In Proceedings of the Conference on Language Technologies & Digital
Humanities 2018, edited by Darja Fišer and Andrej Pančur, 274–80. Ljubljana: Znanstvena založba
Filozofske fakultete v Ljubljani.
• Fišer, Darja, Jakob Lenardič, and Tomaž Erjavec. 2018. “Citiranje jezikoslovnih podatkov v
slovenskih znanstvenih objavah: stanje in priporočila.” In Proceedings of the Conference on Language
Technologies & Digital Humanities 2018, edited by Darja Fišer and Andrej Pančur, 77–84. Ljubljana:
Znanstvena založba Filozofske fakultete v Ljubljani.
• Kavčič, Alenka, Ivan Lovrić, and Vera Smole. 2018. “Karta slovenskih narečnih besedil.” In
Proceedings of the Conference on Language Technologies & Digital Humanities 2018, edited by Darja
Fišer and Andrej Pančur, 121–25. Ljubljana: Znanstvena založba Filozofske fakultete v Ljubljani.
• Kleppe, Martijn. 2018. “Bringing Digital Humanities to the Wider Public: Libraries as Incubator
for DH Research Results.” In Proceedings of the Conference on Language Technologies & Digital
Humanities 2018, edited by Darja Fišer and Andrej Pančur, 2. Ljubljana: Znanstvena založba
Filozofske fakultete v Ljubljani.
• Kosem, Iztok, Simon Krek, Polona Gantar, Špela Arhar Holdt, Jaka Čibej, and Cyprian Laskowski.
2018. “Kolokacijski slovar sodobne slovenščine,” In Proceedings of the Conference on Language
Technologies & Digital Humanities 2018, edited by Darja Fišer and Andrej Pančur, 133–39.
Ljubljana: Znanstvena založba Filozofske fakultete v Ljubljani.
• Kukovičič, Klara Eva. 2018. “Uporabnost luščilnikov terminologije Sketch Engine in CollTerm z
vidika (študenta) prevajalca.” In Proceedings of the Conference on Language Technologies & Digital
Humanities 2018, edited by Darja Fišer and Andrej Pančur, 281–87. Ljubljana: Znanstvena založba
Filozofske fakultete v Ljubljani.
• van Lange, Milan, and Ralf Futselaar. 2018. “Debating Evil: Using Word Embeddings to Analyze
Parliamentary Debates on War Criminals in The Netherlands.” In Proceedings of the Conference on
Language Technologies & Digital Humanities 2018, edited by Darja Fišer and Andrej Pančur, 147–
53. Ljubljana: Znanstvena založba Filozofske fakultete v Ljubljani.
• Ljubešić, Nikola, Darja Fišer, Tomaž Erjavec, and Filip Dobranić. 2018. “The Parlameter corpus
of contemporary Slovene parliamentary proceedings.” In Proceedings of the Conference on Language
Technologies & Digital Humanities 2018, edited by Darja Fišer and Andrej Pančur, 162–67.
Ljubljana: Znanstvena založba Filozofske fakultete v Ljubljani.
• Logar, Nataša, and Tomaž Erjavec. 2018. “Strokovnoznanstvena slovenščina: besednovrstne in
oblikoskladenjske značilnosti.” In Proceedings of the Conference on Language Technologies & Digital
226 Prispevki za novejšo zgodovino LIX - 1/2019
Humanities 2018, edited by Darja Fišer and Andrej Pančur, 175–80. Ljubljana: Znanstvena založba
Filozofske fakultete v Ljubljani.
• Marvin, Tatjana, Jure Derganc, Samo Beguš, and Saba Battelino. 2018. “Word Selection in the
Slovenian Sentence Matrix Test for Speech Audiometry.” In Proceedings of the Conference on
Language Technologies & Digital Humanities 2018, edited by Darja Fišer and Andrej Pančur, 181–
87. Ljubljana: Znanstvena založba Filozofske fakultete v Ljubljani.
• Mihurko Poniž, Katja, Amelia Sanz, Marie Nedregotten Sørbø, Suzan van Dijk, Viola Parente-
Čapková, Narvika Bovcon, and Aleš Vaupotič. 2018. “Teaching Women Writers with NEWW
Virtual Research Environment.” In Proceedings of the Conference on Language Technologies & Digital
Humanities 2018, edited by Darja Fišer and Andrej Pančur, 254–55. Ljubljana: Znanstvena založba
Filozofske fakultete v Ljubljani.
• Nissim, Malvina. 2018. “Too Good to Be True: Current Approaches to Author profiling.” In
Proceedings of the Conference on Language Technologies & Digital Humanities 2018, edited by Darja
Fišer and Andrej Pančur, 1. Ljubljana: Znanstvena založba Filozofske fakultete v Ljubljani.
• Pančur, Andrej. 2018. “Trajnost digitalnih izdaj: Uporaba statističnih spletnih strani na portal
Zgodovina Slovenije – SIstory.” In Proceedings of the Conference on Language Technologies & Digital
Humanities 2018, edited by Darja Fišer and Andrej Pančur, 203–10. Ljubljana: Znanstvena založba
Filozofske fakultete v Ljubljani.
• Podjed, Dan, and Ajda Pretnar. 2018. “Samopromocija na Instagramu: Primer predsednikovega
profila.” In Proceedings of the Conference on Language Technologies & Digital Humanities 2018, edited
by Darja Fišer and Andrej Pančur, 221–26. Ljubljana: Znanstvena založba Filozofske fakultete v
Ljubljani.
• Popič, Damjan, and Darja Fišer. 2018. “Odnosi do jezika v slovenski, hrvaški in srbski računalniško
posredovani komunikaciji.” In Proceedings of the Conference on Language Technologies & Digital
Humanities 2018, edited by Darja Fišer and Andrej Pančur, 256–59. Ljubljana: Znanstvena založba
Filozofske fakultete v Ljubljani.
• Rolih, Gabi. 2018. “K-means Clustering of CMC Data for Tagger Improvement.” In Proceedings
of the Conference on Language Technologies & Digital Humanities 2018, edited by Darja Fišer and
Andrej Pančur, 288–91. Ljubljana: Znanstvena založba Filozofske fakultete v Ljubljani.
• Weber, Tobias, and Jeremy Bradley. 2018. “Exploring Finno-Ugric Linguistics Through Solving
IT Problems.” In Proceedings of the Conference on Language Technologies & Digital Humanities
2018, edited by Darja Fišer and Andrej Pančur, 248–53. Ljubljana: Znanstvena založba Filozofske
fakultete v Ljubljani.
227Reviews and Reports
The library‘s basic collection consists of around 40.000 books
about the contemporary history of Slovenia and the world.
Initially the majority of books focused on the history of World
War II and the workers‘ movement, while later the library started
procuring literature about social and cultural history. We can
state that with its collection of materials our library represents
the most important historiographic
collection about the history of the 20th
century in Slovenia.
The library keeps around 200 titles of magazines, including all of
the most important newspapers since Bleiweis‘s Kmetijske and
rokodelske novice newspaper to cultural and professional magazines
and all kinds of bound daily newspapers.
THE INSTITUTE OF CONTEMPORARY HISTORY LIBRARY
Opening hours: Monday – Friday: 8 a.m. to 1 p.m., Wednesday: 8 a.m. to 3 p.m.
Contact: + 386 1 200 31 28 or +386 1 200 31 32
Web page: http://www.inz.si/knjiznica.php
The Institute of Contemporary History Library is a specialised library,
collecting and storing the resources for scientific researchers and fans
of contemporary history. Initially its materials mostly encompassed
books and magazines on the history of World War II and history of
the workers‘ movement. However, as the Institute‘s areas of interest
expanded, its library has also procured materials about the political,
economic, social and cultural history of Slovenians.
59 1 (2019)
ZA NOVEJŠO ZGODOVINO
PR
IS
PE
V
K
I Z
A
N
O
V
EJ
ŠO
Z
G
O
D
O
V
IN
O
PRISPEVKI
59
1
(2
01
9)
UDC
94(497.4)"18/19"
UDK
ISSN 0353-0329
1
Nina Ditmajer, Matija Ogrin, Tomaž Erjavec
Encoding Textual Variants of the Early Modern Slovenian Poetic Texts in TEI
Isolde van Dorst
You, Thou and Thee: A Statistical Analysis of Shakespeare’s Use
of Pronominal Address Terms
Darja Fišer, Monika Kalin Golob
Corporate Communication on Twitter in Slovenia: A Corpus Analysis
Darja Fišer, Nikola Ljubešič, Tomaž Erjavec
Parlameter – a Corpus of Contemporary Slovene Parliamentary Proceedings
Polona Gantar, Špela Arhar Holdt, Jaka Čibej, Taja Kuzman
Structural and Semantic Classification of Verbal Multi-Word
Expressions in Slovene
Aniko Kovač, Maja Markovič
A Mixed-principle Rule-based Approach to the Automatic
Syllabification of Serbian
Milan M. van Lange, Ralf D. Futselaar
Debating Evil: Using Word Embeddings to Analyse Parliamentary Debates
on War Criminals in the Netherlands
Andrej Pančur
Sustainability of Digital Editions: Static Websites of the History
of Slovenia – SIstory Portal
Ajda Pretnar, Dan Podjed
Data Mining Workspace Sensors: A New Approach to Anthropology
Tadej Škvorc, Simon Krek, Senja Pollak, Špela Arhar,
Holdt Marko Robnik-Šikonja
Predicting Slovene Text Complexity Using Readability Measures
INŠTITUT ZA NOVE JŠO ZGODOVINO
Digitalna knjižnica Slovenije - dLib.si