Zbornik 24. mednarodne multikonference

•

INFORMACIJSKA DRUZBA

Zvezek C

Proceedings of the 24th International Multiconference

INFORMATION SOCIETY

Volume C

I S S 0 S I

Odkrivanje znanja in podatkovna

skladišča • SiKDD

Data Mining and Data

Warehouses • SiKDD

Urednika • Editors:

Dunja Mladenić, Marko Grobelnik

4. oktober 2021 Ljubljana, Slovenija • 4 October 2021 Ljubljana, Slovenia • http://is.ijs.si





Zbornik 24. mednarodne multikonference

INFORMACIJSKA DRUŽBA – IS 2021

Zvezek C





Proceedings of the 24th International Multiconference

INFORMATION SOCIETY – IS 2021

Volume C





Odkrivanje znanja in podatkovna skladišča - SiKDD

Data Mining and Data Warehouses - SiKDD





Urednika / Editors



Dunja Mladenić, Marko Grobelnik





http://is.ijs.si





4. oktober 2021 / 4 October 2021

Ljubljana, Slovenia



Urednika:





Dunja Mladenić,

Department for Artificial Intelligence

Jožef Stefan Institute, Ljubljana



Marko Grobelnik

Department for Artificial Intelligence

Jožef Stefan Institute, Ljubljana





Založnik: Institut »Jožef Stefan«, Ljubljana

Priprava zbornika: Mitja Lasič, Vesna Lasič, Lana Zemljak

Oblikovanje naslovnice: Vesna Lasič





Dostop do e-publikacije:

http://library.ijs.si/Stacks/Proceedings/InformationSociety





Ljubljana, oktober 2021





Informacijska družba

ISSN 2630-371X



Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni knjižnici v Ljubljani

COBISS.SI-ID 85867267

ISBN 978-961-264-218-1 (PDF)





PREDGOVOR MULTIKONFERENCI

INFORMACIJSKA DRUŽBA 2021



Štiriindvajseta multikonferenca Informacijska družba je preživela probleme zaradi korone v 2020. Odziv se povečuje, v 2021 imamo enajst konferenc, a pravo upanje je za 2022, ko naj bi dovolj velika precepljenost končno omogočila normalno delovanje. Tudi v 2021 gre zahvala za skoraj normalno delovanje konference tistim predsednikom konferenc, ki so kljub prvi pandemiji modernega sveta pogumno obdržali visok strokovni nivo.



Stagnacija določenih aktivnosti v 2020 in 2021 pa skoraj v ničemer ni omejila neverjetne rasti IKTja, informacijske družbe, umetne inteligence in znanosti nasploh, ampak nasprotno – rast znanja, računalništva in umetne inteligence se nadaljuje z že kar običajno nesluteno hitrostjo. Po drugi strani se je pospešil razpad družbenih vrednot, zaupanje v znanost in razvoj. Se pa zavedanje večine ljudi, da je potrebno podpreti stroko, čedalje bolj krepi, kar je bistvena sprememba glede na 2020.



Letos smo v multikonferenco povezali enajst odličnih neodvisnih konferenc. Zajema okoli 170 večinoma spletnih predstavitev, povzetkov in referatov v okviru samostojnih konferenc in delavnic ter 400 obiskovalcev. Prireditev so spremljale okrogle mize in razprave ter posebni dogodki, kot je svečana podelitev nagrad – seveda večinoma preko spleta. Izbrani prispevki bodo izšli tudi v posebni številki revije Informatica (http://www.informatica.si/), ki se ponaša s 45-letno tradicijo odlične znanstvene revije.



Multikonferenco Informacijska družba 2021 sestavljajo naslednje samostojne konference:

• Slovenska konferenca o umetni inteligenci

• Odkrivanje znanja in podatkovna skladišča

• Kognitivna znanost

• Ljudje in okolje

• 50-letnica poučevanja računalništva v slovenskih srednjih šolah

• Delavnica projekta Batman

• Delavnica projekta Insieme Interreg

• Delavnica projekta Urbanite

• Študentska konferenca o računalniškem raziskovanju 2021

• Mednarodna konferenca o prenosu tehnologij

• Vzgoja in izobraževanje v informacijski družbi



Soorganizatorji in podporniki multikonference so različne raziskovalne institucije in združenja, med njimi ACM

Slovenija, SLAIS, DKZ in druga slovenska nacionalna akademija, Inženirska akademija Slovenije (IAS). V imenu organizatorjev konference se zahvaljujemo združenjem in institucijam, še posebej pa udeležencem za njihove dragocene prispevke in priložnost, da z nami delijo svoje izkušnje o informacijski družbi. Zahvaljujemo se tudi recenzentom za njihovo pomoč pri recenziranju.



S podelitvijo nagrad, še posebej z nagrado Michie-Turing, se avtonomna stroka s področja opredeli do najbolj izstopajočih dosežkov. Nagrado Michie-Turing za izjemen življenjski prispevek k razvoju in promociji informacijske družbe je prejel prof. dr. Jernej Kozak. Priznanje za dosežek leta pripada ekipi Odseka za inteligentne sisteme Instituta ''Jožef Stefan'' za osvojeno drugo mesto na tekmovanju XPrize Pandemic Response Challenge za iskanje najboljših ukrepov proti koroni. »Informacijsko limono« za najmanj primerno informacijsko potezo je prejela trditev, da je aplikacija za sledenje stikom problematična za zasebnost, »informacijsko jagodo« kot najboljšo potezo pa COVID-19 Sledilnik, tj. sistem za zbiranje podatkov o koroni. Čestitke nagrajencem!



Mojca Ciglarič, predsednik programskega odbora

Matjaž Gams, predsednik organizacijskega odbora



i

FOREWORD - INFORMATION SOCIETY 2021



The 24th Information Society Multiconference survived the COVID-19 problems. In 2021, there are eleven conferences with a growing trend and real hopes that 2022 will be better due to successful vaccination. The multiconference survived due to the conference chairs who bravely decided to continue with their conferences despite the first pandemic in the modern era.



The COVID-19 pandemic did not decrease the growth of ICT, information society, artificial intelligence and science overall, quite on the contrary – the progress of computers, knowledge and artificial intelligence continued with the fascinating growth rate. However, COVID-19 did increase the downfall of societal norms, trust in science and progress. On the other hand, the awareness of the majority, that science and development are the only perspectives for a prosperous future, substantially grows.



The Multiconference is running parallel sessions with 170 presentations of scientific papers at eleven conferences, many round tables, workshops and award ceremonies, and 400 attendees. Selected papers will be published in the Informatica journal with its 45-years tradition of excellent research publishing.



The Information Society 2021 Multiconference consists of the following conferences:

• Slovenian Conference on Artificial Intelligence

• Data Mining and Data Warehouses

• Cognitive Science

• People and Environment

• 50-years of High-school Computer Education in Slovenia

• Batman Project Workshop

• Insieme Interreg Project Workshop

• URBANITE Project Workshop

• Student Computer Science Research Conference 2021

• International Conference of Transfer of Technologies

• Education in Information Society



The multiconference is co-organized and supported by several major research institutions and societies, among them ACM Slovenia, i.e. the Slovenian chapter of the ACM, SLAIS, DKZ and the second national academy, the Slovenian Engineering Academy. In the name of the conference organizers, we thank all the societies and institutions, and particularly all the participants for their valuable contribution and their interest in this event, and the reviewers for their thorough reviews.



The award for lifelong outstanding contributions is presented in memory of Donald Michie and Alan Turing. The Michie-Turing award was given to Prof. Dr. Jernej Kozak for his lifelong outstanding contribution to the development and promotion of the information society in our country. In addition, the yearly recognition for current achievements was awarded to the team from the Department of Intelligent systems, Jožef Stefan Institute for the second place at the XPrize Pandemic Response Challenge for proposing best counter-measures against COVID-19. The information lemon goes to the claim that the mobile application for tracking COVID-19 contacts will harm information privacy.

The information strawberry as the best information service last year went to COVID-19 Sledilnik, a program to regularly report all data related to COVID-19 in Slovenia. Congratulations!



Mojca Ciglarič, Programme Committee Chair

Matjaž Gams, Organizing Committee Chair





ii

KONFERENČNI ODBORI

CONFERENCE COMMITTEES



International Programme Committee

Organizing Committee

Vladimir Bajic, South Africa

Matjaž Gams, chair

Heiner Benking, Germany

Mitja Luštrek

Se Woo Cheon, South Korea

Lana Zemljak

Howie Firth, UK

Vesna Koricki

Olga Fomichova, Russia

Mitja Lasič

Vladimir Fomichov, Russia

Blaž Mahnič

Vesna Hljuz Dobric, Croatia

Klara Vulikić

Alfred Inselberg, Israel



Jay Liebowitz, USA

Huan Liu, Singapore

Henz Martin, Germany

Marcin Paprzycki, USA

Claude Sammut, Australia

Jiri Wiedermann, Czech Republic

Xindong Wu, USA

Yiming Ye, USA

Ning Zhong, USA

Wray Buntine, Australia

Bezalel Gavish, USA

Gal A. Kaminka, Israel

Mike Bain, Australia

Michela Milano, Italy

Derong Liu, Chicago, USA

Toby Walsh, Australia

Sergio Campos-Cordobes, Spain

Shabnam Farahmand, Finland

Sergio Crovella, Italy





Programme Committee

Mojca Ciglarič, chair

Bogdan Filipič

Dunja Mladenič

Niko Zimic

Bojan Orel,

Andrej Gams

Franc Novak

Rok Piltaver

Franc Solina,

Matjaž Gams

Vladislav Rajkovič

Toma Strle

Viljan Mahnič,

Mitja Luštrek

Grega Repovš

Tine Kolenik

Cene Bavec,

Marko Grobelnik

Ivan Rozman

Franci Pivec

Tomaž Kalin,

Nikola Guid

Niko Schlamberger

Uroš Rajkovič

Jozsef Györkös,

Marjan Heričko

Stanko Strmčnik

Borut Batagelj

Tadej Bajd

Borka Jerman Blažič Džonova

Jurij Šilc

Tomaž Ogrin

Jaroslav Berce

Gorazd Kandus

Jurij Tasič

Aleš Ude

Mojca Bernik

Urban Kordeš

Denis Trček

Bojan Blažica

Marko Bohanec

Marjan Krisper

Andrej Ule

Matjaž Kljun

Ivan Bratko

Andrej Kuščer

Boštjan Vilfan

Robert Blatnik

Andrej Brodnik

Jadran Lenarčič

Baldomir Zajc

Erik Dovgan

Dušan Caf

Borut Likar

Blaž Zupan

Špela Stres

Saša Divjak

Janez Malačič

Boris Žemva

Anton Gradišek

Tomaž Erjavec

Olga Markič

Leon Žlajpah



iii





iv





KAZALO / TABLE OF CONTENTS



Odkrivanje znanja in podatkovna skladišča - SiKDD / Data Mining and Data Warehouses - SiKDD ................. 1

PREDGOVOR / FOREWORD ................................................................................................................................. 3

PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ..................................................................................... 4

Observing odor-related information in academic domain / Novalija Inna, Massri M.Besher, Mladenić Dunja,

Grobelnik Marko, Schwabe Daniel, Brank Janez ............................................................................................... 5

Understanding Text Using Agent Based Models / Mladenic Grobelnik Adrian, Grobelnik Marko, Mladenić Dunja

............................................................................................................................................................................ 9

News Stream Clustering using Multilingual Language Models / Novak Erik ........................................................ 13

SloBERTa: Slovene monolingual foundation model / Ulčar Matej, Robnik-Šikonja Marko .................................. 17

Understanding the Impact of Geographical Bias on News Sentiment: A Case Study on London and Rio

Olympics / Swati, Mladenić Dunja ................................................................................................................... 21

An evaluation of BERT and Doc2Vec model on the IPTC Subject Codes prediction dataset / Pranjić Marko,

Robnik-Šikonja Marko, PI3 ............................................................................................................................... 25

Classification of Cross-cultural News Events / Sittar Abdul, Mladenić Dunja ...................................................... 29

Zotero to Elexifinder: Collection, curation, and migration of bibliographical data / Lindemann David ................. 33

Simple discovery of COVID ISWAR Metaphors Using Word Embeddings / Brglez Mojca, Pollak Senja, Vintar

Špela................................................................................................................................................................. 37

Topic modelling and sentiment analysis of COVID-19 related news on Croatian Internet portal / Buhin Pandur

Maja, Dobša Jasminka, Beliga Slobodan, Meštrović Ana ................................................................................ 41

Tackling Class Imbalance in Radiomics: the COVID-19 Use Case / Rožanec Jože M., Poštuvan Tim, Fortuna

Blaž, Mladenić Dunja ........................................................................................................................................ 45

Observing Water-Related Events for Evidence-Based Decision-Making / Pita Costa Joao, Massri M.Besher,

Novalija Inna, Casals del Busto Ignacio, Mocanu Iulian, Rossi Maurizio, Šturm Jan, Eržin Eva, Guček

Alenka, Posinković Matej, Grobelnik Marko ..................................................................................................... 49

Anomaly Detection on Live Water Pressure Data Stream / Petkovšek Gal, Erznožnik Matic, Kenda Klemen .... 53

Entropy for Time Series Forecasting / Costa Joao, Kenda Klemen, Pita Costa Joao ......................................... 57

Modeling stochastic processes by simultaneous optimization of latent representation and target variable /

Jelenčič Jakob, Mladenić Dunja ....................................................................................................................... 61

Causal relationships among global indicators / Neumann Matej ......................................................................... 65

Active Learning for Automated Visual Inspection of Manufactured Products / Trajkova Elena, Rožanec Jože M.,

Dam Paulien, Fortuna Blaž, Mladenić Dunja ................................................................................................... 69

Learning to Automatically Identify Home Appliances / Lorbek Ivančič Dan, Bertalanič Blaž, Cerar Gregor,

Fortuna Carolina ............................................................................................................................................... 73

Indeks avtorjev / Author index ................................................................................................................................ 77





v





vi



Zbornik 24. mednarodne multikonference

INFORMACIJSKA DRUŽBA – IS 2021

Zvezek C





Proceedings of the 24th International Multiconference

INFORMATION SOCIETY – IS 2021

Volume C





Odkrivanje znanja in podatkovna skladišča - SiKDD

Data Mining and Data Warehouses - SiKDD





Urednika / Editors



Dunja Mladenić, Marko Grobelnik





http://is.ijs.si





4. oktober 2021 / 4 October 2021

Ljubljana, Slovenia

1





2





PREDGOVOR





Tehnologije, ki se ukvarjajo s podatki so v devetdesetih letih močno napredovale. Iz prve faze, kjer je šlo predvsem za shranjevanje podatkov in kako do njih učinkovito dostopati, se je razvila industrija za izdelavo orodij za delo s podatkovnimi bazami, prišlo je do standardizacije procesov, povpraševalnih jezikov itd. Ko shranjevanje podatkov ni bil več poseben problem, se je pojavila potreba po bolj urejenih podatkovnih bazah, ki bi služile ne le transakcijskem procesiranju ampak tudi analitskim vpogledom v podatke – pojavilo se je t.i.

skladiščenje podatkov (data warehousing), ki je postalo standarden del informacijskih sistemov v podjetjih. Paradigma OLAP (On-Line-Analytical-Processing) zahteva od uporabnika, da še vedno sam postavlja sistemu vprašanja in dobiva nanje odgovore in na vizualen način preverja in išče izstopajoče situacije. Ker seveda to ni vedno mogoče, se je pojavila potreba po avtomatski analizi podatkov oz. z drugimi besedami to, da sistem sam pove, kaj bi utegnilo biti zanimivo za uporabnika – to prinašajo tehnike odkrivanja znanja v podatkih (data mining), ki iz obstoječih podatkov skušajo pridobiti novo znanje in tako uporabniku nudijo novo razumevanje dogajanj zajetih v podatkih. Slovenska KDD konferenca pokriva vsebine, ki se ukvarjajo z analizo podatkov in odkrivanjem znanja v podatkih: pristope, orodja, probleme in rešitve.





FOREWORD





Data driven technologies have significantly progressed after mid 90’s. The first phases were mainly focused on storing and efficiently accessing the data, resulted in the development of industry tools for managing large databases, related standards, supporting querying languages, etc. After the initial period, when the data storage was not a primary problem anymore, the development progressed towards analytical functionalities on how to extract added value from the data; i.e., databases started supporting not only transactions but also analytical processing of the data. At this point, data warehousing with On-Line-Analytical-Processing entered as a usual part of a company’s information system portfolio, requiring from the user to set well defined questions about the aggregated views to the data. Data Mining is a technology developed after year 2000, offering automatic data analysis trying to obtain new discoveries from the existing data and enabling a user new insights in the data. In this respect, the Slovenian KDD conference (SiKDD) covers a broad area including Statistical Data Analysis, Data, Text and Multimedia Mining, Semantic Technologies, Link Detection and Link Analysis, Social Network Analysis, Data Warehouses.





3





PROGRAMSKI ODBOR / PROGRAMME COMMITTEE



Jane Brank, Jožef Stefan Institute, Ljubljana

Marko Grobelnik, Jožef Stefan Institute, Ljubljana

Jakob Jelenčič, Jožef Stefan Institute, Ljubljana

Branko Kavšek, University of Primorska, Koper

Aljaž Košmerlj, Qlector, Ljubljana

Dunja Mladenić, Jožef Stefan Institute, Ljubljana

Inna Novalija, Jožef Stefan Institute, Ljubljana

Jože Rožanec, Qlector, Ljubljana

Luka Stopar, Sportradar, Ljubljana



4





OBSERVING ODOR-RELATED INFORMATION IN

ACADEMIC DOMAIN

Inna Novalija



M. Besher Massri

Jožef Stefan Institute

Jožef Stefan Institute and Jožef Stefan

Jamova cesta 39, Ljubljana, Slovenia

International Postgraduate School

inna.koval@ijs.si

Jamova cesta 39, Ljubljana, Slovenia

Dunja Mladenić

besher.massri@ijs.si

Jožef Stefan Institute and Jožef Stefan

Marko Grobelnik

International Postgraduate School

Jožef Stefan Institute

Jamova cesta 39, Ljubljana, Slovenia

Jamova cesta 39, Ljubljana, Slovenia

dunja.mladenic@ijs.si

marko.grobelnik@ijs.si

Daniel Schwabe

Janez Brank

Jožef Stefan Institute

Jožef Stefan Institute

Jamova cesta 39, Ljubljana, Slovenia

Jamova cesta 39, Ljubljana, Slovenia

daniel.schwabe@ijs.si

janez.brank@ijs.si



ABSTRACT

In this paper we present an approach for mining olfactory

information from scientific research collections, such as the

In this paper, we demonstrate an approach for observing olfactory

Microsoft Academic Graph (MAG) [3].

related information in an academic publications environment (such

as Microsoft Academic Graph) based on semantic technologies.

The olfactory mining approach combines data processing,

We present an Odor Observatory tool that enables several usage

modelling and visualization methods in order to develop applicable

scenarios, such as observing odor-related papers and topics,

tools for data analysis.

viewing institutions conducting olfactory research, defining top

We present an Odor Observatory tool [4] targeted at several

journals and key countries in the olfactory domain.

visualization scenarios. In particular, the Odor Observatory allows

Validation of the proposed approach on a collection of academic

exploring olfactory related papers from the MAG over time, and

publications from 1800 until 1925 confirms applicability of the

along with current data, provides historical information starting

proposed approach on large data collections with a wide span of with the early XIX century.

time. In usage scenarios we observed the odor-related publications

The data-driven functionalities of Odor Observatory are:

in Microsoft Academic Graph by topic, discovered the journals

▪

with historical olfactory publications and found that the most



Possibility of exploring top ranked topics in the olfactory

popular terms in odor-related research content are: method,

academic domain;

▪

olfactory, odor, device, invention, smell, preparation, utility model.



Possibility of exploring top ranked institutions

conducting olfactory research;



▪

Possibility of exploring key countries and defining top

KEYWORDS

ranking journals in the olfactory academic domain;

▪

Odor-related search functionalities;

Odor, Olfactory information, Microsoft Academic Graph (MAG),

▪

Word cloud visualization for odor-related terms.

Data mining.





2.

RELATED WORK

1.

INTRODUCTION

Olfactory science covers different aspects of research related to Olfaction, or the sense of smell, is the sense through which smells

odors, therefore exploring odor related information and data can be

(or odors) are perceived [1]. Olfactory science involves studying viewed as complex multidisciplinary area.

olfaction and odor-related topics, the sensory system, physiology,

Lötsch et al. [5] considered machine learning approaches for human

and pheromone signals.

olfactory research. The authors state that the complexity of the The Odeuropa project [2] gathers and integrates expertise in

human sense of smell is reflected in complex and high-dimensional

sensory mining and olfactory heritage. The project partners are

data, which supports the applicability of machine learning and data

developing novel methods to collect information about smell from

mining techniques. The use of machine learning in human olfactory

(digital) text and image collections.

research includes the following aims:

The Odeuropa project partners apply state-of-the-art AI techniques

to text and image datasets in order to identify and trace how ‘smell’

1.

The study of the physiology of pattern-based odor

was expressed in different languages, with what places it was

detection and recognition processes;

associated, what kinds of events and practices it characterized, and

2.

Pattern recognition in olfactory phenotypes;

to what emotions it was linked.

3.

The development of complex disease biomarkers

including olfactory features;

5





4.

Odor prediction from physico-chemical properties of

Figure 2 illustrates an entry in MAG for a historical publication volatile molecules, and

tagged with several odor-relevant topics.

5.

Knowledge discovery in publicly available large



databases.



The authors provide review of key concepts of machine learning

and summarizes current applications on human olfactory data.

At the same time, linguistic and semantic communities focused on

studying the language of smell [6]. Iatropoulos et al. developed a

computational method to characterize the olfaction-related

semantic content of words in a large text corpus of internet sites in

English. They also introduced novel metrics, such as olfactory

association index (OAI) and olfactory specificity index (OSI).

Tonelli [7] describes olfactory information extraction and semantic

processing from a multilingual perspective. The author states that

in several studies it was found that languages seem to have a

smaller vocabulary to describe smells as compared to other senses.

In our work we apply data mining and machine learning, as well as

semantic approaches for enriching textual data. We use data from

Microsoft Academic Graph and our methodologies can be regarded

as being in the context of semantic and text processing research.

Our approaches can cover cross-lingual and multilingual data and

Figure 2: Publication in MAG

allow for tracking olfactory trends in time.





3.

PROBLEM DEFINITION



3.1

DATA SOURCES

The Microsoft Academic Graph (MAG) [3] is a heterogeneous

graph

containing

scientific

publication

records,

citation

relationships between those publications, as well as authors,

institutions, journals, conferences, and fields of study.

Since this research is conducted in line with the Odeuropa project

(targeted at olfactory heritage), the time frame used for MAG data

is set to range from the early publications in the 19th century to the

present time. The Odeuropa project is interested in particular in the

data available up to 1925. Though the project is focused on the historical datasets, the developed Odor Observatory tool allows

users to explore recent olfactory publications as well. The dataset

is updated on a monthly basis and new available data is uploaded

into the observatory.

Figure 3: Odor in the MAG Taxonomy





Figure 1: The Conceptual Schema for MAG

Figure 3 shows a representation of Odor in the MAG taxonomy,

The Microsoft Academic Graph data schema is based on the list of

with parent topics (Organic chemistry and Neuroscience) and child

following entity types: publication, author, author affiliation

topics (Olfactory learning, Geosmin etc.)

(institution), publication venue (journals and conferences), field of

study (topic). It contains information about publication dates, as An important functionality while exploring the literature is the

well as citation pairs and co-authorship data (see Figure 1).

ability to expand searches by looking at related topics to a topic of

interest.

Figure

4

displays

topics

about/related

to

Figure 2 illustrates an entry in MAG for a historical publication Olfaction/Odor/Smell in MAG taxonomy.

tagged with several odor-relevant topics.



6





Figure 5 shows the number of odor-related historical publications

in MAG over time. This scenario assumes observing trends in

different olfactory topics throughout a time interval.

It is possible to observe that the highest number of publications are

in the domains of biology and psychology.





Figure 4: Odor-related Topics in MAG



Figure 5: Odor-related Publications in MAG (from year

1800 until year 1925, cumulative) by topic





3.2

METHODOLOGY

2.

What are the most popular terms used in odor-related

The methodology for observing olfactory related information from

publications?

academic publication resources includes a number of steps:

▪

This use case helps the user to visualize term usage by displaying a



Using the MAG taxonomy, obtain the list of research

word cloud with the most popular olfactory terms used in the

papers that corresponds to odor-related topics. Papers

publications in the period of interest (see Figure 6).

were filtered to those containing the topics: Olfaction,

Odor, Fragrance, Fragrance ingredient, as well as the



“smell” keyword;



▪

Ingest the extracted corpus into the Elastic Search tool1;

▪

Provide visualization functionalities, such as MAG time

series per term.

The key challenges of the development techniques include:

▪

Interpretability and explainability of the results – the aim

is for the visualizations to be able easily interpretable by

humans;

▪

Given the large scale of the incoming data streams, it is

essential that building visualizations are scalable. The

MAG contains more than 265 million records (August

2021), including several types of publication, such as

Figure 6: Odor Terms Word Cloud in MAG

journal articles, conference papers, books, book chapters,



and papers from other repositories. In addition, MAG

also indexes a large corpus of patents.





3.

Which venues were mostly used when publishing odor-

related research articles?

3.3

USAGE SCENARIOS

This use case shows a number of journals that had historical

We present a couple of usage scenarios for the Odor Observatory

publications about smells (see Figure 7).

tool, cast as questions asked by scholars studying the field.

The figure shows that JAMA and Nature journals are the most

1.

What are the historical trends in odor-related

popular journals regarding historical olfactory publications.

publications?



1 https://www.elastic.co

7





The figure shows a ranked list of relevant publications on the topic

of “smells”, in the period from 1900 to 1925. The list is modified

by changing the context on the right side - the focus is changed by

placing the cursor over a cluster, and publications associated with

this cluster are displayed.



4.

CONCLUSION

In this paper we demonstrated an approach towards observing

olfactory related information in scientific publications, as recorded

in the MAG.

In addition, we present an Odor Observatory tool that enables

several usage scenarios for exploring historical and present

olfactory research.

Figure 7: Journals with Olfactory Publications in MAG

The future work will include the exploration of other textual

(from year 1800 until year 1925, cumulative)

datasets applicable for olfactory research, with an accent on

olfactory heritage information.



In line with the Odeuropa project, the relevant information

4.

Which are the publications about smell (from a

extracted from textual sources will be, following semantic web

contextual point of view)?

standards, aligned with the ‘European Olfactory Knowledge

The Research Explorer tool is a search engine that enables

Graph’ (EOKG).

exploring the individual articles in the corpus of odor-related



publications.

5.

ACKNOWLEDGMENTS



This research is supported by the Slovenian research agency



and by the European Union’s Horizon 2020 program project

Odeuropa under grant agreement number 101004469.



REFERENCES

[1] Wolfe, J. M., Kluender, K. R., Levi, D. M., Bartoshuk, L. M.,

Herz, R. S., Klatzky, R., Lederman, S. J., & Merfeld, D. M.

(2012). Sensation & perception (3rd ed.). Sinauer Associates.

[2] Odeuropa project, https://odeuropa.eu (accessed in August, 2021).

[3] Wang K. et al. A Review of Microsoft Academic Services for

Science of Science Studies, Frontiers in Big Data, 2019, doi:

10.3389/FDATA.2019.00045.

[4] JSI

Odor

Observatory,

public

service,

https://odeuropa.ijs.si/dashboards/Main/Index?visualization=

visualizations-MAG--top-topics# (accessed in August, 2021).

[5] Lötsch, J., Kringel, D., Hummel, T. Machine Learning in

Human Olfactory Research, Chemical Senses, Volume 44,

Issue

1,

January

2019,

Pages

11–22,

https://doi.org/10.1093/chemse/bjy067.

[6] Iatropoulos, G., Herman, P., Lansner, A., Karlgren, J.,

Figure 8: List of Olfactory Publications in MAG (from year

Larsson, M., Olofsson, JK. The language of smell: Connecting

1800 until year 1925) that contains the keyword "smell"

linguistic and psychophysical properties of odor descriptors.



Cognition.

2018

Sep;178:37-49.

doi:

The tool is built on Elastic Search and provides search by keyword

10.1016/j.cognition.2018.05.007. Epub 2018 May 12. PMID:

and by date. It also supports smart navigation through the results by

2976379.

clustering the results and re-ranking the results by moving the focus

[7] Tonelli, S. A Smell is Worth a Thousand Words: Olfactory

of search through the cluster space (see Figure 8). The goal of the

Information Extraction and Semantic Processing in a

tool is to enhance a search engine by providing the users multiple

Multilingual

Perspective.

doi:

rankings of the results for each query. It is achieved by

https://doi.org/10.4230/OASIcs.LDK.2021.2

generating topics for the given query and its result set, and

https://drops.dagstuhl.de/opus/volltexte/2021/14538/pdf/OA

visualizing these topics on the “Ranking Space” panel. When

SIcs-LDK-2021-2.pdf (accessed in August, 2021).

the focus is set near a given topic, results that are on or closer to

that topic are ranked higher.

8



Understanding Text Using Agent Based Models

Adrian Mladenic Grobelnik

Marko Grobelnik

Dunja Mladenic

Jozef Stefan Institute

Jozef Stefan Institute

Jozef Stefan Institute

Ljubljana Slovenia

Ljubljana Slovenia

Ljubljana Slovenia

adrian.m.grobelnik@ijs.si

marko.grobelnik@ijs.si

dunja.mladenic@ijs.si

ABSTRACT

The main contributions of this paper are (1) a novel approach to

explainable story understanding, (2) a system generating stories

The paper proposes a novel approach to text understanding and

given a set of agents with attributes and goals, and (3)

text generation focusing on short stories. The proposed approach

implementation of the proposed approach, with publicly

attempts to understand and generate stories by creating an

available source code [7] allowing users to create and analyze

explainable, agent-based world model of the story. The world

their own stories.

model is defined through agents, their goals, actions, attributes

The rest of this paper is organized as follows: Section 2 provides

and relationships between them. We demonstrate our approach

a problem description. Section 3 describes the approach used to

on the story of ‘Little Red Riding Hood’, simulating it as a

tackle the problem. Section 4 demonstrates the functioning of our

sequence of 48 actions, involving 7 main agents and 14 goals.

approach. The paper concludes with discussion and directions for

KEYWORDS

future work in Section 5.



Text understanding, agent-based approach, world model, agent-

based model

2 Problem Description

The problem we are solving is, given the text of a short story,

convert it into a machine understandable and actionable

1 Introduction

description representing the dynamics of the story being told.

With recent advancements in deep learning and overall increases

Such an actionable description should encode the implicit

in computing power, artificial intelligence systems are now able

knowledge assumed by the text in the form of an agent-based

to make commonsense inferences from simple events, as

world model.

proposed in research such as COMET [1] and MultiCOMET [2].

The world model should include enough representational power

While the aforementioned commonsense inferences can be made

to fully represent the story. This includes agents, their

with a high degree of precision, they lack an explainable and

environment and the relationships between them. The world

comprehensive structure capable of storing and predicting future

model should be actionable enough to simulate the dynamics of

events with such inferences. Agent-based models (ABMs), while

an input story with all the key elements, and relevant details

capable of simulating complex interactions between agents,

mentioned in the input text.

rarely focus on understanding stories in greater depth. Moreover,

As the world model can represent a story given its text, it should

they cannot perform commonsense reasoning on agent’s goals,

also be able to represent and simulate other stories within the actions or attributes. In our research, we draw from existing

world model’s constraints.

work on ABMs to create a system capable of understanding short

Some of the key operations the resulting system should support:

text-based stories, with the potential to incorporate

1. representation of the story

commonsense inferences in the future.

2. simulation of the story’s dynamics

Related work such as ‘Automated Storytelling via Causal,

3. question answering about explicit and implicit

Commonsense Plot Ordering’ [3] and ‘Modeling Protagonist

elements written or assumed within the story

Emotions for Emotion-Aware Storytelling’ [4] makes use of

4. creating alternative stories, given their context

COMET to tackle automated story plot generation. As the stories

are generated using COMET’s commonsense causal inferences,

they lack explainability. In our work, we focus on generating

3 Approach Description

explainable stories.

The general aim of our approach is to provide deep text

Other related work [5] focuses on story understanding using

understanding of the input story. Not all the steps are automatable

manually supplied commonsense rules, concept patterns and

at this stage. In particular, the biggest challenge is to

story text. Our system aims to understand and simulate a story,

automatically translate the story text into the knowledge based

given the story text, goals and initial attributes of its agents.

representation aligned with the world model. We are looking

forward to eventually automate all of the steps in the approach.



Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Information Society 2021, 4-8 October 2021, Ljubljana, Slovenia

© 2021 Copyright held by the owner/author(s).



9





Figure 1: A partial representation of the Wolf agent’s goals, actions and attributes.

As a running example of the input story, we selected the popular

preconditions and effects. We show two example action

children’s story ‘Little Red Riding Hood’ [6]. In the first stage,

representations in Figure 3 and Figure 4. The duration of each

we restructured the original story into 73 simplified sentences

action corresponds to the passing of one time unit.

where we identified 23 key events involving 7 main agents:



1. Mother

2. Riding Hood

3. Flower Field

4. Butterfly

5. Wolf

6. Grandma

7. Woodsman

Each agent is represented by its goals, actions and attributes (see

Figure 1 for an example involving the Wolf). All goals cause

actions and all actions change at least one agent’s attributes.

As depicted on Figure 2, an agent’s goal is defined by a goal state



(a set of agents with specific attribute values) and ‘pre-goals’

(goals that must be completed and act as preconditions for an

Figure 3: An example representation of an action

agent to start working towards the goal).



Figure 4: An example pseudocode representation of a



concrete action, taken from [9]

Figure 2: An example representation of a goal

An attribute is simply defined as any information relating to the

To define actions, we use an action schema proposed as part of

agent. For instance, the agent’s location, inventory of items and

‘UCPOP: A Sound, Complete, Partial Order Planner for ADL’

awareness of other agents.

[8] where each action consists of a set of parameters,





10





Information Society 2021, 4-8 October 2021, Ljubljana, Slovenia

A.M. Grobelnik et al.





Figure 5: Hierarchy of agents for the Little Red Riding Hood story

The agents are defined through a hierarchy, ensuring consistency

We first initialize the world model to an initial setting similar to

across agent goals, actions, attributes and providing a clear

that of ‘Little Red Riding Hood’, illustrated in Figure 7. For

overview of the agent types as observed in Figure 5.

instance, agents ‘forest4’ and ‘woodsman’ are in the same

Throughout the story simulation of ‘Little Red Riding Hood’ 3

location, 1 unit above agent ‘forest3’. The model is initialized

key agents jointly had 14 goals, causing them to perform a total

with the agents, their initial attributes with values and their goals

of 48 actions composed of 12 unique action types.

in the story. Once initialized, we can run the model and see the

We propose a simple textual description of each performed

agents interact with each other within their environment. For an

action, stating why the agent executed the action and which other

example, see Figure 9.

agents were involved. See Figure 8 for an example.

One could divide the story into the following 5 main segments:

At the highest conceptual level, we randomly select an agent and

1. Riding Hood discusses visiting Grandma with Mother

simulate all of its possible next actions. We then select the action

(6 actions)

that brings the agent closest to all it’s currently active goals, and

2. Riding Hood meets Wolf and goes to Grandma (23

execute this action. We repeat this until there are no more agents

actions)

with active goals in our world model, as depicted in Figure 6.

3. Wolf eats Grandma and tries to impersonate her;

Riding Hood arrives at GrandmaHouse and cries for

help (6 actions)

4. Woodsman saves Grandma and takes Wolf away,

Riding Hood gifts Grandma (13 actions)

As an example, in the third story segment the actions occur in the



following order:

1. Wolf eats Grandma to satisfy hunger.

Figure 6: High level pseudocode of the simulation within the

2. Wolf took perfume from GrandmaHouse’s inventory

world model

to try impersonating Grandma.

3. Wolf took nightgown from GrandmaHouse’s

inventory to try impersonating Grandma.

4 Approach Demonstration

4. Wolf took sleeping cap from GrandmaHouse’s

inventory to try impersonating Grandma.

5. Riding Hood moved 1 unit up to visit Grandma.

6. Riding Hood cried for help to get help.

The system is able to automatically generate the textual



description of the story simulation over time, as depicted in

Figure 8.

Figure 7: Initial state of the agents’ locations within the

world model; each X, Y slot includes a list of agents at that

location





11





Understanding Text Using Agent Based Models

Information Society 2021, 4 October 2021, Ljubljana, Slovenia



Adapting the system to another story using our source code is

relatively easy, provided the action and attribute types of the

agents in the story are similar to those in the ‘Little Red Riding

Hood’. If the story requires the implementation of new actions or

attributes, this can be done by extending the class structure in



C++ using already implemented actions and attributes as

examples.

Figure 8: A part of an example story, generated by the

In our future work we intend to integrate commonsense

system

inferences, such as those from MultiCOMET into our model to

further the system’s degree of textual understanding. Our system

could also benefit from the addition of dynamic and simultaneous

goals that change based on the agent’s environment. Another

possible future line of work is to use our approach in other

domains to describe more complex phenomena, such as real-

world events or geopolitics. Lastly, a user evaluation of our

system’s performance on a variety of stories and scenarios could

provide further insight into the efficacy of our approach.

ACKNOWLEDGMENTS



The research described in this paper was supported by the

Slovenian research agency under the project J2-1736 Causalify

Figure 9: Screenshot of two subsequent agent location

and co-financed by the Republic of Slovenia and the European

configurations on the map: (1) after Riding Hood gives

Union under the European Regional Development Fund. The

Grandma flowers and (2) after Woodsman carries away

operation is carried out under the Operational Programme for the

Wolf

Implementation of the EU Cohesion Policy 2014–2020.

One of the more conceptually complex parts of the story was

Riding Hood asking Mother for permission to visit Grandma.

REFERENCES

This required the creation of a new attribute for human agents to

[1]

Bosselut, A.; Rashkin, H.; Sap, M.; Malaviya, C.; Celikyilmaz, A.; and Choi, Y. 2019. COMET: Commonsense transformers for automatic

describe their opinions of other agents’ goals.

knowledge graph construction. In ACL, 4762–4779.

The most complex action implemented was “cry for help”. This

[2]

Adrian Mladenic Grobelnik, Marko Grobelnik, Dunja Mladenic. 2020.

involved the creation of a new goal “respond to cry for help” for

MultiCOMET – Multilingual Commonsense Description. In Proceedings

of the 23rd international multiconference information society, pages 37-all human agents within a certain radius of the agent crying for

40

help, provided they were conscious and able to respond.

[3]

Prithviraj Ammanabrolu, Wesley Cheung, William Broniec, and Mark O



Riedl. 2021. Automated storytelling via causal, commonsense plot

The story ends when Riding Hood gives Grandma the flowers

ordering. In Proceedings of the 35th AAAI Conference on Artificial she picked and the basket Mother gave her, and Woodsman

Intelligence (AAAI).

[4]

Faeze Brahman and Snigdha Chaturvedi. 2020. Modeling protagonist

carries the Wolf “deep into the forest where he wouldn't bother

emotions for emotion-aware storytelling. In Proceedings of EMNLP,

people any longer” [6].

pages 5277– 5294.

The system was implemented in about 3,000 lines of C++ code,

[5]

Patrick Henry Winston. The genesis story understanding and story telling

system: A 21st century step toward artificial intelligence. 2014. Technical available on GitHub [7].

report, Center for Brains, Minds and Machines (CBMM).

[6]

Little Red Riding Hood by Leanne Guenther. https://www.dltk-

teach.com/RHYMES/littlered/story.htm. Accessed 16.09.2021.

[7]

Understanding Text Using Agent Based Models GitHub.

5 Discussion

https://github.com/AMGrobelnik/Understanding-Text-Using-Agent-

Based-Models . Accessed 16.09.2021.

In our research we expanded on and adapted existing work on

[8]

Penberthy, J., & Weld, D. 1992. UCPOP: a sound, complete, partial-order agent-based models, providing an alternate approach to text

planner for ADL. In Proceedings of KR’92, pp. 103–114, Los Altos, CA.

understanding and generation involving short stories. As a proof

Kaufmann.

[9]

An Introduction to AI Story Generation. https://thegradient.pub/an-

of concept, we applied our approach on the children’s story of

introduction-to-ai-story-generation/. Accessed 16.09.2021.

‘Little Red Riding Hood’, describing it through a series of 48

highly explainable actions involving 7 main agents.





12





News Stream Clustering using Multilingual Language Models Erik Novak

erik.novak@ijs.si

Jožef Stefan Institute

Jožef Stefan International Postgraduate School

Jamova cesta 39

Ljubljana, Slovenia

ABSTRACT

in Section 5. Finally, we conclude the paper and provide ideas for In this paper, we propose a news stream clustering algorithm

future work in Section 6.

which directly outputs cross-lingual event clusters. It uses multi-

lingual language models to generate cross-lingual article repre-

2

RELATED WORK

sentations which enable a direct comparison of articles in differ-

ent languages. The algorithm is evaluated using a cross-lingual

News Stream Clustering. The objective of news stream cluster-

news article data set and compared against a strong baseline

ing is to group news articles that report about the same event

algorithm. The experiment results show the algorithm has great

that happened in the world. Grouping can be a difficult task,

promise, but requires additional modifications for improving its

especially if the articles are written in multiple languages. To

performance.

this end, various approaches were developed for cross-lingual

event clustering. A statistical approach called Generalization of

KEYWORDS

Canonical Correlation Analysis is used to compare news articles

in different languages [9]. Information extraction techniques, online news, event detection, news events, multilingual language

such as named entity recognition and part-of-speech tagging, are

model

also used for event detection [6]. With the increasing popularity of neural networks, more advanced approaches are used to link

1

INTRODUCTION

event clusters. The work in [3] uses word embeddings to compare and link monolingual event clusters into cross-lingual ones.

Online news is producing hundreds of thousands of articles per

Transformer-based language models are used for event sentence

day reporting about any significant event that happened in the

coreference identification [4], a task that links parts of articles to world. The articles cover various domains (such as politics, sports,

multiple events. However, the algorithm is performed only on a

and culture) and are written in different languages. In order to

monolingual data set.

automatically identify these events, news stream clustering algo-

To the best of our knowledge, our work is the first that uses

rithms are used. These usually have the following steps: (1) they

multilingual language models for grouping articles directly into

group articles written in the same language into monolingual

cross-lingual events.

clusters, and (2) form cross-lingual clusters by linking monolin-

gual clusters that report on the same event. Both steps usually

employ monolingual text features such as TF-IDF vectors; these

Multilingual Language Models. Since the introduction of the

do not allow cross-lingual comparison without using advanced

transformers [11], language model development has gained trac-statistical or machine learning methods.

tion in the research community. One of the most well known

In this paper, we propose a news stream clustering algorithm

language models, BERT [2], has improved the performance of that directly generates cross-lingual event clusters. The algorithm

various NLP tasks. By training it using multilingual documents,

uses multilingual language models for generating cross-lingual

the multilingual BERT [5] enabled solving tasks that require content embeddings and extracting named entities found in the

cross-lingual text representations. While these models improved

articles. These are used to measure if an article should be assigned

the performance of various NLP tasks, they do not provide good

to an event. The algorithm is evaluated using a cross-lingual data

document embeddings for tasks like clustering. This changed

set consisting of articles in English, Spanish, and German, and is

with the introduction of Sentence-BERT [8], which generates compared against a strong baseline. While the experiment results

monolingual sentence embeddings appropriate for measuring

look promising, there is still room for improving the algorithms

sentence similarity. A year later, an approach for making mono-

performance.

lingual document representations cross-lingual [7] opened a way The paper is structured as follows: Section 2 contains an for using sentence embeddings for cross-lingual clustering.

overview of the related work on cross-lingual news stream clus-

In this work, we employ the multilingual Sentence-BERT

tering and multilingual language models. Next, we present the

model to generate cross-lingual embeddings used to group arti-

proposed clustering algorithm in Section 3, and describe the excles into events.

periment setting in Section 4. The experiment results are found 3

THE CLUSTERING ALGORITHM

Permission to make digital or hard copies of part or all of this work for personal We propose a news stream clustering algorithm that directly

or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and outputs cross-lingual events. It uses cross-lingual embeddings,

the full citation on the first page. Copyrights for third-party components of this named entities, and temporal features to measure if an article

work must be honored. For all other uses, contact the owner/author(s).

should be assigned to an event cluster. If none of the events are

Information Society 2021, 4 - 8 October 2021, Ljubljana, Slovenia

appropriate, a new cluster is created and the article is assigned

© 2021 Copyright held by the owner/author(s).

to it. Figure 1 shows the algorithm’s workflow diagram.

13





Information Society 2021, 4 - 8 October 2021, Ljubljana, Slovenia Erik Novak

event clusters

content embedding:

a 6

c 1

c 2

a

® (0)

𝑐

= ®

0,

1

𝑒

a

a 2

4

a

a

(𝑘−1)

5

3

(𝑘 − 1) · ®

𝑐

+ 𝑐 ®

𝑒

𝑎

® (𝑘)

𝑘

𝑐

=

,

𝑒

𝑘

(𝑘)

COND?

YES

where ®

𝑐

is the centroid calculated using the first 𝑘 articles

𝑒

assigned to the event 𝑒, and 𝑐 ® is the content embedding of the

c

𝑎𝑘

3

𝑘 -th article 𝑎 .

𝑘

NO

Event Named Entities. Each event stores all of the unique

named entities that are found in any of its articles. The named

entities are used to identify if the incoming article mentions the

Figure 1: The algorithm’s workflow diagram. The algo-

event’s entities. The event’s named entities set is updated when

rithm maintains a set of event clusters which are used

a new article is assigned to the event:

when asessing if a new article (𝑎6) should be assigned to

(0)

𝑟

= ∅,

an existing event. If the conditions are met, the article is

𝑒

assigned to the most appropriate cluster (

(𝑘)

(𝑘−1)

𝑐 2). Otherwise, an

𝑟

= 𝑟

∪ 𝑟

,

𝑒

𝑒

𝑎𝑘

empty event cluster is created (𝑐3), the article is assigned to

(𝑘)

it, and the newly created event is added to the cluster set.

where 𝑟

is the set of named entities generated using the first

𝑒

𝑘 articles assigned to the event 𝑒 , and 𝑟

is the set of named

𝑎𝑘

entities of the 𝑘-th article 𝑎 .

𝑘

In this section we describe how the algorithm represents the

Time Statistics. The time statistics provide insights into the

articles and events, and how it decides when to assign an article

articles’ temporal distribution. These are calculated using the

to the event cluster.

articles’ time attribute. In this experiment we measured the fol-

lowing statistics: the minimum, average, and maximum article

3.1

Article Representation

timestamps. These are used to validate if an article was published

In this section we describe the different article representations

at a time when it could still report about an existing event.

used in the algorithm. Each article is assumed to have a title, body,

and time attributes, which are used to (1) generate the content

3.3

Assignment Condition

embedding and (2) extract its named entities.

The most crucial part of the proposed algorithm is how to mea-

sure to which event should an article be assigned to, if any. We

Content Embedding. Each article is assigned an embedding

propose a condition that combines (1) the cosine similarity be-

that represents the article’s content. Using multilingual Sentence-

tween the article’s content embedding and the event’s centroid,

BERT1, a language model designed for generating vectors used in (2) the overlap between the article’s and event’s named entities,

cross-lingual clustering tasks, we get the content embedding by

and (3) the time difference between the article’s time and one of

concatenating the article’s title and body and inputing it into the

the event’s time statistics.

language model. The output is a single 768 dimensional vector

Let 𝐸 = {𝑒

} be the set of existing event clusters,

that captures the semantic meaning of the article.

1, 𝑒2, . . . , 𝑒 𝑗

where each event is represented with its centroid, named entities,

Article Named Entities. For each article we extract the named

and one of its time statistics



𝑒

=

®

𝑐

, 𝑟

, 𝑡

. Let the article

𝑖

𝑒

𝑒

𝑒

𝑖

𝑖

𝑖

entities that are mentioned in the article’s body. To extract them,

be represented by its content embedding, named entities, and

we developed a multilingual NER model using XLM-RoBERTa

time attribute 𝑎 = ( ®

𝑐

, 𝑟

, 𝑡

). We then check if the following

𝑎

𝑎

𝑎

[1] and fine-tuned it using the CoNLL-2003 [10] data set.2 Af-conditions are met for each event:

terwards, we filter out the duplicates and store the remaining

⟨ ®

𝑐

, ®

𝑐

⟩

𝑒

𝑎

𝑖

unique entities for later use.

𝛿

=

≥ 𝛼,

𝑐

∥ ®

𝑐

∥

∥

𝑒

2 ∥ ®

𝑐𝑎 2

𝑖

(1)

𝛿

= |𝑟

∩ 𝑟 | ≥ 𝛽,

3.2

Event Representations

𝑟

𝑒

𝑎

𝑖

𝛿

= |𝑡

− 𝑡 | ≤ 𝜏,

𝑡

𝑒

𝑎

An event is represented as an aggregate of its articles. This in-

𝑖

cludes (1) the event centroid, (2) the named entities, and (3) the

where 𝛼, 𝛽 and 𝜏 are the thresholds corresponding to how similar

time statistics. In this section we describe how the aggregates

the article’s content must be to the event, the required amount

are calculated and updated.

of overlapping entities, and the time window in which an article

has to be to be assigned to the event, respectively. Thus, 𝛿 , 𝛿 , 𝛿

𝑐

𝑟

𝑡

Event Centroid. The centroid represents the average content

correspond to the content similarity, entity overlap, and time

embedding of the articles assigned to the event. It is used to assess

window conditions, respectively.

if an incoming article’s content is similar enough to the event.

If an event meets the conditions described in Equation 1, the Since the algorithm is intended to work on a news streams, we

article is assigned to it. If multiple events are appropriate, the

iteratively update the centroid with the newly assigned article’s

article is assigned to the event that has the greatest 𝛿 value.

𝑐

If none are appropriate, a new empty event cluster is created,

1The model is available at https://huggingface.co/sentence-transformers/

the article is assigned to it, and the event representations are

paraphrase-xlm-r-multilingual-v1.

updated.

2The code of the model is available at https://github.com/ErikNovak/named-entity-

To compare the impact of the conditions, we implement mul-

recognition.

tiple versions of the algorithm that use a different combination

14





News Stream Clustering using Multilingual Language Models Information Society 2021, 4 - 8 October 2021, Ljubljana, Slovenia

of 𝛿 , 𝛿 , and 𝛿 conditions. Table 1 shows all of the algorithm 4.3

Baseline Algorithm

𝑐

𝑟

𝑡

versions compared in the experiment.

The baseline algorithm used in the experiment is presented in

[3]. It performs cross-lingual news stream clustering by first gen-Table 1: The list of algorithm versions. Each algorithm uses

erating monolingual event clusters using TF-IDF subvectors of

a different combination of conditions.

words, word lemmas and named entities of the articles. After-

wards, it merges monolingual into cross-lingual clusters using

Algorithm

condition combination

cross-lingual word embeddings to represent the articles. The algo-

CONTENT

rithm compares two approaches when performing cross-lingual

𝛿𝑐

CONTENT + NE

clustering:

𝛿

and 𝛿

𝑐

𝑟

CONTENT + TS

𝛿

and 𝛿

• Global parameter. Using a global parameter for measuring

𝑐

𝑡

CONTENT + NE + TS

𝛿

and 𝛿 and 𝛿

distances between all language articles for cross-lingual

𝑐

𝑟

𝑡

clustering decisions.

• Pivot parameter. Using a pivot parameter, where the dis-

4

EXPERIMENTS

tances between every other language are only compared

to English, and cross-lingual clustering decisions are made

We now present the experiment setting. We introduce the data set

only based on this distance.

and how it is prepared for the experiment. Next, we present the

evaluation metrics. Finally, the baseline algorithm is described.

Since the baseline algorithm was already evaluated using the

cross-lingual data set we are using the the experiment, we only

4.1

Data Set

report their performances from the paper.

To compare the algorithm performances we use the news article

5

RESULTS

data sets acquired via Event Registry and prepared by [3] for the purposes of news stream clustering. These data sets are in three

In this section we present the experiment results. For all exper-

different languages (English, German, and Spanish), and consist

iments we fix the values 𝛽 = 1 and 𝜏 = 3 days, and evaluate

of articles containing the following attributes:

the algorithms using different values of 𝛼. In addition, all experi-

ments use the event’s minimum time statistic when validating

• Title. The title of the article.

the time condition

.

• Text. The body of the article.

𝛿𝑡

• Lang. The language of the article.

Baseline Comparison. Table 3 shows the experiment results

• Date. The datetime when the article was published.

of the best performing algorithm on the evaluation data set. We

• Event ID. The ID of the event the article is associated with.

report the best performing CONTENT + NE + TS algorithm

It is used to measure the performance of the algorithms.

which uses the content similarity threshold 𝛼 = 0.3.

For the experiment, we merge the three data sets together to

create a single cross-lingual news article data set. We extract

Table 3: The algorithm performances. The best reported

their content embeddings and named entities, and sort them in

algorithm uses all three asssignment conditions.

chronological order, i.e. from oldest to newest. Table 2 shows the data set statistics.

Algorithm

𝐹1

𝑃

𝑅

Table 2: Data set statistics. For each language data set we

Baseline (global)

72.7

89.8

61.0

denote the number of documents in the data set (# docs), the

Baseline (pivot)

84.0

83.0

85.0

average length of the documents (avg. length), the number

CONTENT + NE + TS

72.2

79.7

66.0

of event clusters (# clusters) and the average number of

documents in the clusters (avg. size).

While the proposed algorithm does not perform better than

any of the baselines with respect to the 𝐹1 score, our algorithm

Language

# docs

avg. length

# clusters

avg. size

still shows promising results. Its performance is comparable to

English

8,726

537

238

37

the baseline using the global parameter and also outperforms the

German

2,101

450

122

17

baseline (global) recall by 5%, showing it is better at grouping

Spanish

2,177

401

149

15

articles.

Together

13,004

500

427

30

Condition Analysis. We have analyzed the impact the con-

ditions have on the algorithm’s performance. For each algo-

rithm version we run the experiments using different values

4.2

Evaluation Metrics

of 𝛼 ∈ {0.3, 0.4, 0.5, 0.6, 0.7}, and measure the balanced F-score,

precision, and recall, as well as the number of clusters it gener-

For the evaluation we use the same metrics as [3]. Let tp be the ated. Table 4 shows the condition analysis results. By analysing number of correctly clustered-together article pairs, let fp be the

the results we come to two conclusions:

number of incorrectly clustered-together article pairs, and let fn

Increasing 𝛼 increases precision, decreases recall, and

be the number of incorrectly not-clustered-together article pairs.

tp

tp

generates a larger number of clusters. When 𝛼 is bigger, the

Then we report precision as 𝑃 =

, recall as 𝑅 =

, and

tp+fp

tp+fn

content condition 𝛿 requires the articles to be more similar to

𝑐

the balanced F-score as 𝐹1 = 2 · 𝑃 ·𝑅 . While precision describes

the event. This condition is met when the article’s content em-

𝑃 +𝑅

how homogenous are clusters the, recall tells us the amount of

bedding is close to the event’s centroid. Since this has to hold for

articles that should be together but are actually found in different

all articles in the event, then the articles that have high similar-

clusters.

ity are clustered together, increasing the algorithm’s precision.

15





Information Society 2021, 4 - 8 October 2021, Ljubljana, Slovenia Erik Novak

Table 4: The condition analysis results. The bold values

REFERENCES

represent the best performances on the data set.

[1]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vish-

rav Chaudhary, Guillaume Wenzek, Francisco Guzmán,

Algorithm

𝛼

# clusters

𝐹1

𝑃

𝑅

Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin

CONTENT

0.3

46

29.6

19.7

59.8

Stoyanov. 2020. Unsupervised cross-lingual representa-

0.4

234

51.6

46.2

58.4

tion learning at scale. In Proceedings of the 58th Annual

0.5

849

57.7

67.7

50.3

Meeting of the Association for Computational Linguistics,

0.6

1762

45.3

73.1

32.8

8440–8451.

0.7

3185

26.0

81.9

15.5

[2]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina

Toutanova. 2019. BERT: pre-training of deep bidirectional

CONTENT

0.3

279

43.7

33.3

63.8

transformers for language understanding. In Proceedings

+ NE

0.4

648

52.9

55.8

50.3

of the 2019 Conference of the North American Chapter of

0.5

1168

56.5

67.4

48.6

the Association for Computational Linguistics: Human Lan-

0.6

1939

45.1

73.6

32.5

guage Technologies, Volume 1 (Long and Short Papers). As-

0.7

3254

25.9

82.3

15.4

sociation for Computational Linguistics, 4171–4186.

CONTENT

0.3

344

58.8

63.2

55.0

[3]

Sebastião Miranda, Art¯urs Znotin,š, Shay B Cohen, and

+ TS

0.4

806

64.1

76.5

55.2

Guntis Barzdins. 2018. Multilingual clustering of stream-

0.5

1346

58.8

83.4

45.4

ing news. In Proceedings of the 2018 Conference on Empirical

0.6

2068

47.1

81.7

33.1

Methods in Natural Language Processing. Association for

0.7

3356

25.2

84.8

14.7

Computational Linguistics, Brussels, Belgium.

[4]

Faik Kerem Örs, Süveyda Yeniterzi, and Reyyan Yeniterzi.

CONTENT

0.3

925

72.2

79.7

66.0

2020. Event clustering within news articles. In Proceedings

+ NE

0.4

1221

72.2

80.5

65.5

of the Workshop on Automated Extraction of Socio-political

+ TS

0.5

1554

54.0

81.9

40.2

Events from News 2020, 63–68.

0.6

2174

46.7

80.7

32.9

[5]

Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How

0.7

3403

25.0

84.8

14.7

multilingual is multilingual BERT? In Proceedings of the

57th Annual Meeting of the Association for Computational

However, if the

Linguistics. Association for Computational Linguistics,

𝛼 is to large then the condition is too strong,

thus similar articles can be split into multiple clusters, conse-

4996–5001.

quently decreasing recall and increasing the number of clusters

[6]

Xiaoting Qu, Juan Yang, Bin Wu, and Haiming Xin. 2016.

the algorithm generates.

A news event detection algorithm based on key elements

Algorithms with more conditions can achieve better per-

recognition. In 2016 IEEE First International Conference on

formance. The algorithm’s performance is increasing with added

Data Science in Cyberspace (DSC). (June 2016), 394–399.

conditions. While the worst performance is achieved when only

[7]

Nils Reimers and Iryna Gurevych. 2020. Making monolin-

the content condition

gual sentence embeddings multilingual using knowledge

𝛿

is used (CONTENT algorithm), the best

𝑐

is reached when all three conditions are used (CONTENT + NE +

distillation. In Proceedings of the 2020 Conference on Em-

TS algorithm). The most significant contribution is provided by

pirical Methods in Natural Language Processing (EMNLP).

the time condition

Association for Computational Linguistics, 4512–4525.

𝛿

which drastically improves the 𝐹

𝑡

1 score.

[8]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT:

6

CONCLUSION

sentence embeddings using Siamese BERT-Networks. In

Proceedings of the 2019 Conference on Empirical Methods

We propose a news stream clustering algorithm that directly

in Natural Language Processing and the 9th International

generates cross-lingual event clusters. It uses multilingual lan-

Joint Conference on Natural Language Processing (EMNLP-

guage models to generate cross-lingual article representations

IJCNLP). Association for Computational Linguistics, 3982–

which are used to compare with and generate cross-lingual event

3992.

clusters. The algorithm was evaluated on a news article data set

[9]

Jan Rupnik, Andrej Muhic, Gregor Leban, Primoz Skraba,

and compared to a strong baseline. The experiment results look

Blaz Fortuna, and Marko Grobelnik. 2016. News across

promising, but there is still room for improvement.

languages - Cross-Lingual document similarity and event

In the future, we intend to modify the assignment condition

tracking. en. J. Artif. Intell. Res., 55, (January 2016), 283–

and learn the condition parameters instead of manually setting

316.

them. Modifying the language models to accept longer inputs

[10]

Erik F. Tjong Kim Sang and Fien De Meulder. 2003. In-

could better capture the articles semantic meaning. In addition,

troduction to the CoNLL-2003 shared task: language-in-

events from different domains are reported with different rates.

dependent named entity recognition. In Proceedings of

Learning these rates and including them in the algorithm could

the Seventh Conference on Natural Language Learning at

improve its performance.

HLT-NAACL 2003, 142–147.

ACKNOWLEDGMENTS

[11]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-

reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia

This work was supported by the Slovenian Research Agency and

Polosukhin. 2017. Attention is all you need. In Proceedings

the Humane AI Net European Unions Horizon 2020 project under

of the 31st International Conference on Neural Information

grant agreement No 952026.

Processing Systems. Curran Associates Inc., Red Hook, NY,

USA, 6000–6010.

16





SloBERTa: Slovene monolingual large pretrained masked

language model

Matej Ulčar and Marko Robnik-Šikonja

University of Ljubljana, Faculty of Computer and Information Science

Ljubljana, Slovenia

{matej.ulcar, marko.robnik}@fri.uni- lj.si

ABSTRACT

Successful transformer models typically contain more than

100 million parameters. To train, they require considerable com-

Large pretrained language models, based on the transformer

putational resources and large training corpora. Luckily, many of

architecture, show excellent results in solving many natural lan-

these models are publicly released. Their fine-tuning is much less

guage processing tasks. The research is mostly focused on Eng-

computationally demanding and is accessible to users with mod-

lish language; however, many monolingual models for other lan-

est computational resources. In this work, we present the training

guages have recently been trained. We trained first such mono-

of a Slovene transformer-based masked language model, named

lingual model for Slovene, based on the RoBERTa model. We

SloBERTa, based on a variant of BERT architecture. SloBERTa is

evaluated the newly trained SloBERTa model on several classi-

the first such publicly released model, trained exclusively on the

fication tasks. The results show an improvement over existing

Slovene language corpora.

multilingual and monolingual models and present current state-

of-the-art for Slovene.

2

RELATED WORK

KEYWORDS

Following the success of the BERT model [5], many transformer-natural language processing, BERT, RoBERTa, transformers, lan-

based language models have been released, e.g., RoBERTa [14],

guage model

GPT-3 [3], and T5 [28]. The complexity of these models has been constantly increasing. The size of newer generations of

the models has made training computationally prohibitive for all

1

INTRODUCTION

research organizations and is only available to large corporations.

Solving natural language processing (NLP) tasks with neural

Training also requires huge amounts of training data, which do

networks requires presentation of text in a numerical vector

not exist for most languages. Thus, most of these large models

format, called word embeddings. Embeddings assign each word

have been trained only for a few very well-resourced languages,

its own vector in a vector space so that similar words have similar

chiefly English, or in a massively multilingual fashion.

vectors, and certain relationships between word meanings are

The BERT model was pre-trained on two tasks simultaneously,

expressed in the vector space as distances and directions. Typical

a masked token prediction and next sentence prediction. For the

static word embedding models are word2vec [19], GloVe [24], and masked token prediction, 15% of tokens in the training corpus

fastText [1]. ELMo [25] embeddings are an example of dynamic, were randomly masked before training. The training dataset was

contextual word embeddings. Unlike static word embeddings,

augmented by duplicating the training corpus a few times, with

where a word gets a fixed vector, contextual embeddings ascribe

each copy having different randomly selected tokens masked. The

a different word vector for each occurrence of a word, based on

next sentence prediction task attempts to predict if two given

its context.

sentences appear in a natural order.

State of-the-art text representations are currently based on the

The RoBERTa [14] model uses the same architecture as BERT, transformer architecture [35]. GPT-2 [27] and BERT [5] models but drops the next sentence prediction task, as it was shown that it

are among the first and most influential transformer models. Due

does not contribute to the model performance. The masked token

to their ability to be successfully adapted to a wide range of

prediction task was changed so that the tokens are randomly

tasks, such models are, somewhat impetuously, called foundation

masked on the fly, i.e. a different subset of tokens is masked in

models [2, 17]. While GPT-2 uses the transformer’s decoder stack each training epoch.

to model the next word based on previous words, BERT uses

Both BERT and RoBERTa were released in different sizes. Base

the encoder stack to encode word representations of a masked

models use 12 hidden transformer layers of size 768. Large models

word, based on the surrounding context before and after the

use 24 hidden transformer layers of size 1024. Smaller-sized BERT

word. Previous embedding models (e.g., ELMo and fastText) were

models exist using knowledge distillation from pre-trained larger

used to extract word representations which were then used to

models [11].

train a model on a specific task. In contrast to that, transformer

A few massively multilingual models were trained on 100

models are typically fine-tuned for each individual downstream

or more languages simultaneously. Notable released variants

task, without extracting word vectors.

are multilingual BERT (mBERT) [5] and XLM-RoBERTa (XLM-R) [4]. While multilingual BERT models perform well for the trained languages, they lag behind the monolingual models [36,

Permission to make digital or hard copies of part or all of this work for personal

33]. Examples of recently released monolingual BERT models for or classroom use is granted without fee provided that copies are not made or various languages are Finnish [36], Swedish [16], Estonian [30], distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this Latvian [37], etc.

work must be honored. For all other uses, contact the owner /author(s).

The Slovene language is supported by the aforementioned

Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia

massively multilingual models and by the trilingual CroSloEngual

© 2021 Copyright held by the owner/author(s).

BERT model [33], which has been trained on three languages, 17





Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia Ulčar and Robnik-Šikonja

Croatian, Slovene, and English. No monolingual transformer

98 epochs) on the Slovene corpora, described in Section 3.1. The model for Slovene has been previously released.

model supports the maximum input sequence length of 512 sub-

word tokens.

3

SLOBERTA

SloBERTa was trained as a masked language model, using

The presented SloBERTa model is closely related to the French

fairseq toolkit [22]. 15% of the input tokens were randomly Camembert model [18], which uses the same architecture and masked, and the task was to predict the masked tokens. We

training approach as the RoBERTa base model [14], but uses a used the whole-word masking, meaning that if a word was split

different tokenization model. In this section, we describe the

into more subtokens and one of them was masked, all the other

training datasets, the architecture, and the training procedure of

subtokens pertaining to that word were masked as well. Tokens

SloBERTa.

were masked dynamically, i.e. in each epoch, a different subset

of tokens were randomly selected to be masked.

3.1

Datasets

Training a successful transformer language model requires a large

4

EVALUATION

dataset. We combined five large Slovene corpora in our training

We evaluated SloBERTa on five tasks: named-entity recognition

dataset. Gigafida 2.0 [13] is a general language corpus, composed (NER), part-of-speech tagging (POS), dependency parsing (DP),

of fiction and non-fiction books, newspapers, school textbooks,

sentiment analysis (SA), and word analogy (WA). We used the

texts from the internet, etc. The Janes corpus [9] is composed of labeled ssj500k corpus [12, 6] for fine-tuning SloBERTa on each several subcorpora. Each subcorpus contains texts from a certain

of the NER, POS and DP tasks. For NER, we limited the scope to

social medium or a group of similar media, including Twitter,

three types of named entities (person, location, and organization).

blog posts, forum conversations, comments under articles on

We report the results as a macro-average 𝐹

score of these three

1

news sites, etc. We used all Janes subcorpora, except Janes-tweet,

classes. For POS-tagging, we used UPOS tags, the results are

since the contents of that subcorpus are encoded and need to be

reported as a micro-average 𝐹

score. For DP, we report the

1

individually downloaded from Twitter, which is a lengthy process,

results as a labeled attachement score (LAS). The SA classifier

as Twitter limits the access speed. KAS (Corpus of Academic

was fine-tuned on a dataset composed of Slovenian tweets [20,

Slovene) [8] consists of PhD, MSc, MA, Bsc, and BA theses written

21], labeled as either "positive", "negative", or "neutral". We report in Slovene between 2000 and 2018. SiParl [23] contains minutes the results as a macro-average 𝐹

score.

1

of Slovene national assembly between 1990 and 2018. SlWaC [15]

Traditional WA task measures the distance between word vec-

is a web corpus collected from the .si top-level web domain. All

tors in a given analogy (e.g., man : king ≈ woman : queen). For

corpora used are listed in Table 1 along with their sizes.

contextual embeddings such as BERT, the task has to be modified

Table 1: Corpora used in training of SloBERTa with their

to make sense. First, word embeddings from transformers are

sizes in billion of tokens and words. Janes* corpus does

generally not used on their own, rather the model is fine-tuned.

not include Janes-tweet subcorpus.

Four words from an analogy also do not provide enough con-

text for use with transformers. In our modification, we input the

four words of an analogy in a boilerplate sentence "If the word

Corpus

Genre

Tokens

Words

[word1] corresponds to the word [word2], then the word [word3]

Gigafida 2.0

general language

1.33

1.11

corresponds to the word [word4]." We then masked [word2] and

Janes*

social media

0.10

0.08

attempted to predict it using masked token prediction. We used

KAS

academic

1.70

1.33

Slovene part of the multilingual culture-independent word anal-

siParl 2.0

parliamentary

0.24

0.20

ogy dataset [32]. We report the results as an average precision@5

slWaC 2.1

web crawl

0.90

0.75

(the proportion of the correct [word2] analogy words among the

Total

4.27

3.47

5 most probable predictions).

Total after deduplication

4.20

3.41

We compared the performance of SloBERTa with three other

transformer models supporting Slovene, CroSloEngual BERT

(CSE-BERT) [33], multilingual BERT (mBERT) [5], and XLM-3.2

Data preprocessing

RoBERTa (XLM-R) [4]. Where sensible, we also included the results achieved with training a classifier model using Slovene

We deduplicated the corpora, using the Onion tool [26]. We split ELMo [31] and fastText embeddings.

the deduplicated corpora into three sets, training (99%), validation

We fine-tuned the transformer models on each task by adding

(0.5%), and test (0.5%). Independently of the three splits, we pre-

a classification head on top of the model. The exception is the DP

pared a smaller dataset, one 15th of the size of the whole dataset,

task, where we used the modified dep2label-bert tool [29, 10]. For by randomly sampling the sentences. We used this smaller dataset

ELMo and fastText, we extracted embeddings from the training

1

to train a sentencepiece model , which is used to tokenize and datasets and used them to train token-level and sentence-level

encode the text into subword byte-pair-encodings (BPE). The

classifiers for each task, except for the DP. The classifiers are

sentencepiece model trained for SloBERTa has a vocabulary con-

composed of a few LSTM layer neural networks. For the DP

taining 32,000 subword tokens.

task, we used the modified SuPar tool, based on the deep biaffine

3.3

Architecture and training

attention [7]. The details of the evaluation process are presented in [34].

SloBERTa has 12 transformer layers, which is equivalent in size

The results are shown in Table 2. The results of ELMo and to BERT-base and RoBERTa-base models. The size of each trans-fastText, while comparable between each other, are not fully com-

former layer is 768. We trained the model for 200,000 steps (about

parable with the results of transformer models as the classifier

1 https://github.com/google/sentencepiece

training approach is different.

18





SloBERTa

Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia

Table 2: Results of Slovene transformer models.

2020 research and innovation programme under grant agreement

No 825153, project EMBEDDIA (Cross-Lingual Embeddings for

Model

NER

POS

DP

SA

WA

Less-Represented Languages in European News Media).

fastText

0.478

0.527

/

0.435

/

REFERENCES

ELMo

0.849

0.966

0.914

0.510

/

[1]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and

mBERT

0.885

0.984

0.681

0.576

0.061

Tomas Mikolov. 2017. Enriching word vectors with sub-

XLM-R

0.912

0.988

0.793

0.604

0.146

word information. Transactions of the Association for Com-

CSE-BERT

0.928

0.990

0.854

0.610

0.195

putational Linguistics, 5, 135–146.

SloBERTa

0.933

0.991

0.844

0.623

0.405

[2]

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Alt-

man, Simran Arora, Sydney von Arx, Michael S Bernstein,

Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al.

On the NER, POS, SA, and WA tasks, SloBERTa outperforms all

2021. On the opportunities and risks of foundation models.

other models/embeddings. For the POS-tagging, the differences

ArXiv preprint 2108.07258. (2021).

between the models are small, except for fastText, which performs

[3]

Tom Brown et al. 2020. Language models are few-shot

much worse. ELMo, surprisingly, outperforms all transformer

learners. In Advances in Neural Information Processing Sys-

models on the DP task. However, it performs worse on the other

tems. Volume 33, 1877–1901.

tasks. SloBERTa performs worse than CSE-BERT on the DP task,

[4]

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav

but beats other multilingual models.

Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard

The success of ELMo on the DP task can be partially explained

Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov.

by the different tools used for training the classifiers. Further

2019. Unsupervised cross-lingual representation learning

work needs to be done to fully evaluate the difference and success

at scale. arXiv preprint arXiv:1911.02116.

of ELMo embeddings on this task.

[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina

The performance on the SA task is limited by the low inter-

Toutanova. 2019. BERT: pre-training of deep bidirectional

annotator agreement [20]. The reported average of 𝐹

scores for

1

transformers for language understanding. In Proceedings

positive and negative class is 0.542 for inter-annotator agreement

of the 2019 Conference of the North American Chapter of

and 0.726 for self-agreement. Using the same measure (average of

the Association for Computational Linguistics: Human Lan-

𝐹

for positive and

for negative class), SloBERTa scores 0

1

𝐹1

.667,

guage Technologies, Volume 1 (Long and Short Papers), 4171–

and mBERT scores 0.593.

4186. doi: 10.18653/v1/N19- 1423.

On the WA task, most models perform poorly. This is expected

[6]

Kaja Dobrovoljc, Tomaž Erjavec, and Simon Krek. 2017.

because very little context was provided on the input, and the

The universal dependencies treebank for Slovenian. In

transformer models need a context to perform well. SloBERTa

Proceeding of the 6th Workshop on Balto-Slavic Natural

significantly outperforms other models, not only because it was

Language Processing (BSNLP 2017).

trained only on Slovene data, but largely because its tokenizer

[7]

Timothy Dozat and Christopher D. Manning. 2017. Deep

is adapted to only Slovene language and does not need to cover

biaffine attention for neural dependency parsing. In Pro-

other languages.

ceedings of 5th International Conference on Learning Repre-

5

CONCLUSIONS

sentations, ICLR.

[8]

Tomaž Erjavec, Darja Fišer, and Nikola Ljubešić. 2021.

We present SloBERTa, the first monolingual transformer-based

The KAS corpus of Slovenian academic writing. Language

masked language model trained on Slovene texts. We show that

Resources and Evaluation, 55, 2, 551–583.

SloBERTa large pretrained masked language model outperforms

[9]

Darja Fišer, Tomaž Erjavec, and Nikola Ljubešić. 2016.

existing comparable multilingual models supporting Slovene on

Janes v0. 4: korpus slovenskih spletnih uporabniških vse-

four tasks, NER, POS-tagging, sentiment analysis, and word anal-

bin. Slovenščina 2.0: empirical, applied and interdisciplinary

ogy. The performance on the DP task is competitive, but lags

research, 4, 2, 67–99.

behind some of the existing models.

[10]

Carlos Gómez-Rodríguez, Michalina Strzyz, and David

In further work we intend to compare improvement of BERT-

Vilares. 2020. A unifying theory of transition-based and

like monolingual models over multilingual models for other lan-

sequence labeling parsing. In Proceedings of the 28th Inter-

guages.

national Conference on Computational Linguistics, 3776–

The pre-trained SloBERTa model is publicly available via CLA-

3793. doi: 10.18653/v1/2020.coling- main.336.

2

3

RIN.SI

and Huggingface

repositories. We make the code, used

[11]

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao

for preprocessing the corpora and training the SloBERTa, publicly

Chen, Linlin Li, Fang Wang, and Qun Liu. 2020. Tiny-

4

available .

BERT: Distilling BERT for natural language understanding.

ACKNOWLEDGMENTS

(2020). arXiv: 1909.10351 [cs.CL].

[12]

Simon Krek, Kaja Dobrovoljc, Tomaž Erjavec, Sara Može,

The work was partially supported by the Slovenian Research

Nina Ledinek, Nanika Holz, Katja Zupan, Polona Gantar,

Agency (ARRS) core research programmes P6-0411 and project

Taja Kuzman, Jaka Čibej, Špela Arhar Holdt, Teja Kavčič,

J6-2581, as well as the Ministry of Culture of Republic of Slovenia

Iza Škrjanec, Dafne Marko, Lucija Jezeršek, and Anja Zajc.

through project Development of Slovene in Digital Environment

2019. Training corpus ssj500k 2.2. Slovenian language re-

(RSDO). This paper is supported by European Union’s Horizon

source repository CLARIN.SI. (2019).

2 http://hdl.handle.net/11356/1397

3 https://huggingface.co/EMBEDDIA/sloberta

4 https://github.com/clarinsi/Slovene-BERT-Tool

19

Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia Ulčar and Robnik-Šikonja

[13]

Simon Krek, Tomaž Erjavec, Andraž Repar, Jaka Čibej,

[28]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee,

Spela Arhar, Polona Gantar, Iztok Kosem, Marko Rob-

Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and

nik, Nikola Ljubešić, Kaja Dobrovoljc, Cyprian Laskowski,

Peter J Liu. 2020. Exploring the limits of transfer learning

Miha Grčar, Peter Holozan, Simon Šuster, Vojko Gorjanc,

with a unified text-to-text transformer. Journal of Machine

Marko Stabej, and Nataša Logar. 2019. Gigafida 2.0: Korpus

Learning Research, 21, 1–67.

pisne standardne slovenščine. viri.cjvt.si/gigafida. (2019).

[29]

Michalina Strzyz, David Vilares, and Carlos Gómez-Rodrí-

[14]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar

guez. 2019. Viable dependency parsing as sequence la-

Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle-

beling. In Proceedings of the 2019 Conference of the North

moyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly

American Chapter of the Association for Computational Lin-

optimized BERT pretraining approach. ArXiv preprint

guistics: Human Language Technologies, Volume 1 (Long

1907.11692. (2019).

and Short Papers), 717–723. doi: 10.18653/v1/N19-1077.

[15]

Nikola Ljubešić and Tomaž Erjavec. 2011. hrWaC and

[30]

Hasan Tanvir, Claudia Kittask, and Kairit Sirts. 2020. Est-

slWaC: Compiling web corpora for Croatian and Slovene.

BERT: A pretrained language-specific BERT for Estonian.

In International Conference on Text, Speech and Dialogue.

arXiv preprint 2011.04784. (2020).

Springer, 395–402.

[31]

Matej Ulčar and Marko Robnik-Šikonja. 2020. High quality

[16]

Martin Malmsten, Love Börjeson, and Chris Haffenden.

ELMo embeddings for seven less-resourced languages. In

2020. Playing with Words at the National Library of Swe-

Proceedings of the 12th Language Resources and Evaluation

den – Making a Swedish BERT. ArXiv preprint 2007.01658.

Conference, LREC 2020, 4733–4740.

(2020).

[32]

Matej Ulčar, Kristiina Vaik, Jessica Lindström, Milda Daili-

[17]

Gary Marcus and Ernest Davis. 2021. Has AI found a new

d ˙

enait ˙

e, and Marko Robnik-Šikonja. 2020. Multilingual

foundation? The Gradient. 11 September 2021.

culture-independent word analogy datasets. In Proceedings

[18]

Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez,

of the 12th Language Resources and Evaluation Conference,

Yoann Dupont, Laurent Romary, Éric de la Clergerie, Djamé

4067–4073.

Seddah, and Benoît Sagot. 2020. CamemBERT: A tasty

[33]

Matej Ulčar and Marko Robnik-Šikonja. 2020. FinEst BERT

French language model. In Proceedings of the 58th Annual

and CroSloEngual BERT: less is more in multilingual mod-

Meeting of the Association for Computational Linguistics,

els. In Proceedings of Text, Speech, and Dialogue, TSD 2020,

7203–7219.

104–111.

[19]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.

[34]

Matej Ulčar, Aleš Žagar, Carlos S. Armendariz, Andraž

2013. Efficient estimation of word representations in vector

Repar, Senja Pollak, Matthew Purver, and Marko Robnik-

space. arXiv preprint 1301.3781.

Šikonja. 2021. Evaluation of contextual embeddings on

[20]

Igor Mozetič, Miha Grčar, and Jasmina Smailović. 2016.

less-resourced languages. ArXiv preprint 2107.10614. (2021).

Multilingual Twitter sentiment classification: the role of

[35]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-

human annotators. PLOS ONE, 11, 5.

reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia

[21]

Igor Mozetič, Miha Grčar, and Jasmina Smailović. 2016.

Polosukhin. 2017. Attention is all you need. In Advances

Twitter sentiment for 15 european languages. Slovenian

in neural information processing systems, 5998–6008.

language resource repository CLARIN.SI. (2016). http://

[36]

Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma,

hdl.handle.net/11356/1054.

Juhani Luotolahti, Tapio Salakoski, Filip Ginter, and Sampo

[22]

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam

Pyysalo. 2019. Multilingual is not enough: BERT for Finnish.

Gross, Nathan Ng, David Grangier, and Michael Auli. 2019.

arXiv preprint arXiv:1912.07076. (2019).

Fairseq: a fast, extensible toolkit for sequence modeling.

[37]

Art ¯

urs Znotinš and Guntis Barzdinš. 2020. LVBERT: Trans-

,

,

In Proceedings of NAACL-HLT 2019: Demonstrations.

former-based model for Latvian language understanding.

[23]

Andrej Pančur and Tomaž Erjavec. 2020. The siParl corpus

In Human Language Technologies–The Baltic Perspective:

of Slovene parliamentary proceedings. In Proceedings of

Proceedings of the Ninth International Conference Baltic

the Second ParlaCLARIN Workshop, 28–34.

HLT 2020. Volume 328, 111.

[24]

Jeffrey Pennington, Richard Socher, and Christopher Man-

ning. 2014. GloVe: global vectors for word representation.

In Proceedings of the 2014 conference on empirical methods

in natural language processing EMNLP, 1532–1543.

[25]

Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gard-

ner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer.

2018. Deep contextualized word representations. In Pro-

ceedings of the 2018 Conference of the North American Chap-

ter of the Association for Computational Linguistics: Human

Language Technologies, Volume 1 (Long Papers), 2227–2237.

doi: 10.18653/v1/N18- 1202.

[26]

Jan Pomikálek. 2011. Removing boilerplate and duplicate

content from web corpora. PhD thesis. Masaryk university,

Brno, Czech Republic.

[27]

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario

Amodei, and Ilya Sutskever. 2019. Language models are

unsupervised multitask learners. OpenAI blog. 2019 Feb

24. (2019).

20





Understanding the Impact of Geographical Bias on News

Sentiment: A Case Study on London and Rio Olympics

Swati

Dunja Mladenić

swati@ijs.si

dunja.mladenic@ijs.si

Jožef Stefan Institute,

Jožef Stefan Institute,

Jožef Stefan International Postgraduate School

Jožef Stefan International Postgraduate School

Ljubljana, Slovenia

Ljubljana, Slovenia

News article on London

Olympic Legacy

News articles on Rio Olympic Legacy

Headline: Five Years On,

Headline: Rio 2016's

London's Olympic Real Estate

venues abandoned and

Headline: Rio 2016: An

Legacy Is A Clear Winner

vandalised in worrying legacy

Olympics legacy of despair

Category: Business

Category: Business

Headline: Rio de Janeiro

and wreckage

suffering painful post-Olympic

Category: Business

Summary: The Olympics have

Summary: Rio de Janeiro

hangover

become famously bad for real

pulled off last year's Olympics,

Category: Sports

Summary: When any country

estate…

keeping crime…

hosts a major sporting event,

Summary: The Brazilian

the organisers and those...

Sentiment: 0.067

Sentiment: -0.372

metropolis of Rio de Janeiro

was the epicenter of the…

Sentiment: -0.435

Article Similarity: 0.509, Sentiment Dissimilarity: 0.439

Sentiment: -0.388

Article Similarity: 0.632, Sentiment Dissimilarity: 0.502

Article Similarity: 0.503, Sentiment Dissimilarity: 0.455

Figure 1: An example to illustrate the impact of geographical location on the sentiment of similar news articles.

ABSTRACT

to induce a variety of political and social implications, both direct

There are various types of news bias, most of which play an

and indirect. For instance, any political controversy presented

important role in manipulating public perceptions of any event.

from a specific perspective may alter the voting pattern [4, 1, 6].

Researchers frequently question the role of geographical location

There are different forms of news bias, and geographical bias

in attributing such biases. To that end, we intend to investigate the

is one of them. It exists if the sentiment polarity of similar arti-

impact of geographical bias on news sentiments in related articles.

cles published in different geographical location is contradictory

As our case study, we use news articles collected from the Event

or varies significantly. Sentiment analysis methods, which are

Registry over two years about the Olympic legacy in London

commonly used to determine news bias [3, 14], can be used to and Rio. Our experimental analysis reveals that geographical

examine the shift in sentiment polarity in similar news articles.

boundaries do have an impact on news sentiment.

Now, an intriguing question arises: Is geographical bias a factor

affecting news sentiment? This study seeks to answer the above

KEYWORDS

question by identifying and comparing sentiments of similar

news articles. In doing so, we demonstrate how geographical

Bias, News Bias, Geographical Bias, Olympics, Semantic Similar-

location impacts the sentiments of similar articles. We also inves-

ity, Sentiment Analysis, Dataset

tigate this impact in relation to several news categories such as

politics, business, sports, and so on.

1

INTRODUCTION

The Olympic Games are a symbol of the greatest sports events

Claims of bias in news coverage raise questions about the role of

in the world. Every edition leaves a number of legacies for the

geography in shaping public perceptions of similar events. Based

Olympic Movement, as well as unforgettable memories for each

on the geographical location, multiple factors, such as political

host city, whether positive or negative. In this regard, we select

affiliation, editorial independence, etc., can influence the way

news articles about the Olympic legacy in London and Rio as a

news articles are generated. Although it is well known that biased

case study for our analysis.

news can have more influence on people’s thinking and decision-

We use Event Registry1 [10] to collect English news articles, making processes [7, 9], it is nearly impossible to produce an along with their sentiment and categories, published between

article without any bias. Biased news articles have the potential

January 2017 and December 2020. We use the popular Sentence-

BERT (SBERT) [12] embedding to represent the articles and then Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or compute the cosine similarity between them to identify similar

distributed for profit or commercial advantage and that copies bear this notice and article pairs.

the full citation on the first page. Copyrights for third-party components of this Our data and code can be found in the GitHub repository at

work must be honored. For all other uses, contact the owner/author(s).

Information Society 2021, 4 October 2021, Ljubljana, Slovenia

https:// github.com/ Swati17293/ geographical-bias.

© 2021 Copyright held by the owner/author(s).

1https://eventregistry.org

21





Information Society 2021, 4 October 2021, Ljubljana, Slovenia Swati and Dunja Mladenić

1.1

Contributions

in London/Rio if the headline and/or summary of the article

The paper’s contributions are as follows:

contains the keywords ‘London’/‘Rio’, ‘Olympic’, and ‘Legacy’.

For each article, we then extract the summary, category, and

• We propose a task of analyzing the impact of geographical

sentiment. The article summaries vary in length from 290 to 6,553

bias on the sentiment of news articles with data on the

words. Sentiment scores ranges from −1 to 1. We select seven

Olympic legacies of Rio and London as a case study.

major news categories, namely business, politics, technology,

• We present a dataset of English news articles customized

environment, health, sports, and arts-and-entertainment, and

to the above-mentioned task.

remove the rest of the categories. After excluding the duplicate

• We present experimental results to demonstrate the afore-

articles we end up with 8,690 and 5,120 articles about the Olympic

mentioned impact of geographical bias.

legacy in London and Rio respectively.

2

RELATED WORK

4

MATERIALS AND METHODS

The Majority of the sentiment analysis methods for news bias

analysis depend on the sentiment words that are explicitly stated.

4.1

Methodology

SentiWordNet2, which is a publicly available lexical resource used The primary task is to compute the average difference in senti-by the researchers for opinion mining to identify the sentiment

ment scores between similar news articles about the Olympic

inducing words that classify them as positive, negative, or neutral.

legacies in Rio and London. The stated task can be subdivided

Melo et al. [5] collected and analyzed articles from Brazil’s and mathematically formulated as follows:

news media and social media to understand the country’s re-

(1) Generate two distinct sets of news articles 𝐴1 and 𝐴2, one

sponse to the COVID-19 pandemic. They proposed using an

about the London Olympic legacy and the other about the

enhanced topic model and sentiment analysis method to tackle

Rio Olympic legacy. For each

′

𝑎

∈ 𝐴

∈ 𝐴

𝑖

1 find a list of 𝑎

2,

𝑗

this task. They identified and applied the main themes under con-

where

𝑡 ℎ

𝑎

is the 𝑖

article in set 𝐴

, 𝑠

)}

𝑖

1 = {(𝑎1, 𝑠1), (𝑎2, 𝑠2)...(𝑎𝑛

𝑛

sideration in order to comprehend how their sentiments changed

and ′

𝑡 ℎ

′

′

′

′

′

′

𝑎

is the 𝑗

article in set 𝐴2 = {(𝑎 , 𝑠 ), (𝑎 , 𝑠 )...(𝑎 , 𝑠 )}

over time. They discovered that certain elements in both media

𝑗

1

1

2

2

𝑚

𝑚

which is the closest match (c.f. Section 4.1.1) to 𝑎 . Here, reflected negative attitudes toward political issues.

𝑖

𝑛 = |𝐴

Quijote et al. [11] used SentiWordNet along with the Inverse 1 | and 𝑚 = |𝐴2 |.

(2) For each list, calculate 𝐷

to represents the difference

Reinforcement Model to analyze the bias present in the news ar-

𝑖 𝑗

between the sentiment scores

′

𝑠

and 𝑠 of the articles 𝑎

ticle and to determine whether the outlets are biased or not. The

𝑖

𝑖

𝑗

′

lexicons were first scored for the experiments using SentiWord-

and 𝑎 .

𝑗

Net and then fed to the Inverse Reinforcement model as input.

(3) Calculate the average difference 𝐷 of sentiment scores.

To determine the news bias, the model measured the deviation

(4) Calculate the percentage of similar article pairs with re-

and controversy scores of the articles. The findings lead to the

versed polarity and those with unchanged polarity.

inference that articles from major news outlets in the Philippines

The secondary task is to assess the primary task with respect

are not biased, excluding those from the Manila Times.

to news categories, i.e. to calculate the average difference 𝐷 of

Bharathi and Geetha [3] classified the articles published by sentiment scores for similar articles in each category.

the UK, US, and India median as positive, negative, or neutral

In the following subsections, we discuss the tasks mentioned

using the content sentiment algorithm [2]. The sentiment scores above in greater detail.

of the opinion words and their polarities were used as input to

4.1.1

Article Similarity. We embed the articles in sets 𝐴1 and 𝐴2

the algorithm.

to construct sets

′

′

′

𝐹1 = {𝑓1, 𝑓2...𝑓 } and 𝐹

, 𝑓

... 𝑓

}. While

𝑚

2 = {𝑓

Existing research investigates news bias using sentiment anal-

1

2

𝑛

alternative embedding approaches can be utilized, in this study

ysis methods, but, unlike our work, it does not provide a suitable

we select the popular Sentence-BERT (SBERT) [12] embedding automated method for analyzing the impact of geographical bias

to extract 768-dimensional feature vectors to represent the indi-

on news sentiment.

vidual articles in 𝐹1 and 𝐹2.

For each article

in

3

DATA DESCRIPTION

𝑎

𝐴

𝑖

1, we compute the similarity score3

between 𝑎 and every article 𝑎 in 𝐴

𝑖

𝑗

2 using the cosine similarity

3.1

Raw Data Source

metric

𝑐𝑜𝑠

′

′

𝑆𝑖𝑚

(𝑎 , 𝑎 ) (Eq 1). We consider articles 𝑎 and 𝑎 to be

𝑖

𝑖

𝑗

𝑗

We use Event Registry [10] as our raw data source which mon-similar only if their similarity score is greater than 0.5.

itors, gathers, and delivers news articles from all around the

′

𝑓

· 𝑓

world. It also annotates articles with numerous metadata such as

𝑖

𝑗

𝑐𝑜𝑠

′

𝑆𝑖𝑚

(𝑎 , 𝑎 ) =

(1)

𝑖

a unique identifier for article identification, categories to which

𝑗

||

′

𝑓 | | | | 𝑓

||

𝑖

𝑗

it may belong, geographical location, sentiment, and so on. Its

where

′

𝑓

and 𝑓 represents the embedded feature vectors of article

𝑖

𝑗

large-scale coverage can therefore be used effectively to assess

′

𝑎

and 𝑎 .

𝑖

the impact of geographical bias on news sentiment.

𝑗

The similarity score ranges from −1 to 1, where −1 indicates

3.2

Dataset

that the articles are completely unrelated and 1 indicates that they

are identical, and in-between scores indicate partial similarity or

To generate our dataset, we use a similar data collection process

dissimilarity.

as described in [13]. Using the Event Registry API, we collect all English-language news articles about the Olympic legacy in

4.1.2

Average Sentiment Dissimilarity. For every pair of similar

′

London and Rio published between January 2017 and December

articles 𝑎 and 𝑎 , we calculate the difference 𝐷

between their

𝑖

𝑖 𝑗

𝑗

′

2020. We consider an article to be about the Olympic Legacy

sentiment scores 𝑠 and 𝑠 . To calculate the average sentiment

𝑖

𝑗

2http://sentiwordnet.isti.cnr.it/

3https://en.wikipedia.org/wiki/Cosine_similarity

22





Understanding the Impact of Geographical Bias on News Sentiment

Information Society 2021, 4 October 2021, Ljubljana, Slovenia

Table 1: Category-wise confusion matrix to show the percentage of similar article pairs with respect to their sentiment polarity.

Sports

Business

Politics

Environment

Health

Technology

Arts & Entertainment

Pos

Neg

Pos

Neg

Pos

Neg

Pos

Neg

Pos

Neg

Pos

Neg

Pos

Neg

Pos

77

10

62

28

42

18

55

18

29

12

87

4

59

16

Neg

11

2

7

4

23

16

14

12

12

46

1

0

7

18

Table 2: Confusion matrix to show the percentage of sim-

ilar article pairs with respect to their sentiment polarity.

Positive

Negative

Positive

69

15

Negative

11

4

Table 3: Distribution of average sentiment difference

across news categories for similar article pairs with iden-

tical category.

Figure 2: Distribution of average sentiment differences

News category

Average Sentiment Difference

across categories for similar articles in the same category.

Sports

0.19

Business

0.20

Politics

0.18

Health

0.16

Environment

0.22

Technology

0.14

Arts and Entertainment

0.19

dissimilarity score 𝐷, we add all 𝐷

and divide it by the total

𝑖 𝑗

number of similar article pairs.

5

RESULTS AND ANALYSIS

Figure 3: An illustration of the effect of category on senti-

In our experiments, we compare 44,492,800 possible article pairs

ment polarity.

for similarity and discover 375,008 similar pairs. The comparison

in terms of sentiment similarity reveals that if two articles from

different geographical regions are similar, in our case Rio and

The categorical distribution of the percentage of similar article

London, the average difference in their sentiment scores is 0.171.

pairs in terms of sentiment polarity is shown in Table 1. ‘Politics’

In addition, as defined in Table 2, we calculate the percentage of has the highest percentage of articles with reversed polarity,

similar article pairs based on their sentiment polarity. It’s worth

while ‘technology’ has the lowest. Categories such as ‘business’

noting that the polarity of the article is completely reversed

and ‘entertainment’, though not as clearly as ‘politics’, exhibit the

27% of the time, indicating the impact of geographic region on

same bias.

sentiments.

This disparity arises from the fact that, in contrast to other cat-

It is because the success of mega-events such as the Olympics

egories, politics is most influenced by geographical boundaries,

in a particular host city is heavily influenced by its residents’ trust

whereas science and technology are typically location indepen-

and support for the government [8]. It can be viewed positively as dent. Since politics has such a large influence on shaping beliefs

a national event with social and economic benefits, or negatively

and public perceptions, it is frequently twisted to fit a particu-

as a source of money waste. While the Olympics have left an

lar narrative of a story. It is inherently linked to geographical

economic and social legacy in London, a series of structural

borders, and it can be extremely polarizing depending on the

investment demands in Rio raise the question of whether or not

geographical region.

the Olympics was worthwhile for the entire country.

6

CONCLUSIONS AND FUTURE WORK

5.1

Impact of news categories

In this work, we use news articles about the Olympic Legacy in

The impact of news categories on the sentiments of similar arti-

London and Rio as a case study to understand how geographical

cles with identical categories from different geographical regions

boundaries interplay with news sentiments.

is shown in Table 3. It demonstrates that certain news categories We begin by presenting a dataset of news articles collected

have a greater impact than others. Figure 2 depicts this distinction over two years using the Event Registry API. We compute the

more clearly.

cosine similarity scores of all possible embedded article pairs, one

23





Information Society 2021, 4 October 2021, Ljubljana, Slovenia Swati and Dunja Mladenić

from each set of Olympic legacy articles (London and Rio). We

algorithm. Indonesian Journal of Electrical Engineering and

use the popular Sentence-BERT for article embedding and then

Computer Science, 16, 2, 882–889.

compute the sentiment difference between similar article pairs.

[4]

Chun-Fang Chiang and Brian Knight. 2011. Media bias

From 44,492,800 possible article pairs we end up with 375,008

and influence: evidence from newspaper endorsements.

similar pairs.

The Review of economic studies, 78, 3, 795–820.

In our analysis, we discovered that the sentiment reflected

[5]

Tiago de Melo and Carlos MS Figueiredo. 2021. Comparing

in similar articles from different geographical regions differed

news articles and tweets about covid-19 in brazil: senti-

significantly. We also investigate this difference in relation to

ment analysis and topic modeling approach. JMIR Public

different news categories such as politics, business, sports, and

Health and Surveillance, 7, 2, e24585.

so on. We find a significant difference in news sentiment across

[6]

Claes H De Vreese. 2005. News framing: theory and typol-

geographical boundaries when it comes to political news, while

ogy. Information Design Journal & Document Design, 13,

in the case of news in technology, the difference is much smaller.

1.

We find that articles in categories such as politics and business

[7]

John Duggan and Cesar Martinelli. 2011. A spatial theory

can be heavily influenced by geographical location, articles in

of media slant and voter choice. The Review of Economic

categories such as science and technology are typically location

Studies, 78, 2, 640–666.

independent.

[8]

Dogan Gursoy and KW Kendall. 2006. Hosting mega events:

In the future, we plan to identify the most frequently men-

modeling locals’ support. Annals of tourism research, 33, 3,

tioned topics in the Olympic legacy corpus to see how they affect

603–623.

the news sentiment of articles about different geographical lo-

[9]

Daniel Kahneman and Amos Tversky. 2013. Choices, val-

cations. Since our study is limited to English news articles, we

ues, and frames. In Handbook of the fundamentals of finan-

intend to learn more about the role of cultures and languages in

cial decision making: Part I. World Scientific, 269–278.

this bias analysis. We also intend to broaden our investigation to

[10]

Gregor Leban, Blaz Fortuna, Janez Brank, and Marko Gro-

discover the adjectives used to describe the negative and positive

belnik. 2014. Event registry: learning about world events

legacies of Rio and London. Such an analysis would aid in un-

from news. In Proceedings of the 23rd International Confer-

derstanding the expectations from cities such as Rio (the first in

ence on World Wide Web, 107–110.

South America to host the Olympics) in comparison to London.

[11]

TA Quijote, AD Zamoras, and A Ceniza. 2019. Bias detec-

tion in philippine political news articles using sentiword-

7

ACKNOWLEDGMENTS

net and inverse reinforcement model. In IOP Conference

This work was supported by the Slovenian Research Agency and

Series: Materials Science and Engineering number 1. Vol-

the European Union’s Horizon 2020 research and innovation

ume 482. IOP Publishing, 012036.

program under the Marie Skłodowska-Curie grant agreement No

[12]

Nils Reimers and Iryna Gurevych. 2019. Sentence-bert:

812997.

sentence embeddings using siamese bert-networks. In Pro-

ceedings of the 2019 Conference on Empirical Methods in

REFERENCES

Natural Language Processing. Association for Computa-

[1]

Dan Bernhardt, Stefan Krasa, and Mattias Polborn. 2008.

tional Linguistics, (November 2019). http://arxiv.org/abs/

Political polarization and the electoral effects of media

1908.10084.

bias. Journal of Public Economics, 92, 5-6, 1092–1104.

[13]

Swati, Tomaž Erjavec, and Dunja Mladenić. 2020. Eveout:

[2]

Shri Bharathi and Angelina Geetha. 2017. Sentiment anal-

reproducible event dataset for studying and analyzing the

ysis for effective stock market prediction. International

complex event-outlet relationship.

Journal of Intelligent Engineering and Systems, 10, 3, 146–

[14]

Taylor Thomsen. 2018. Do media companies drive bias?

153.

using sentiment analysis to measure media bias in news-

[3]

SV Shri Bharathi and Angelina Geetha. 2019. Determina-

paper tweets.

tion of news biasedness using content sentiment analysis

24





An evaluation of BERT and Doc2Vec model on the IPTC

Subject Codes prediction dataset

Marko Pranjić

Marko Robnik-Šikonja

Senja Pollak

marko.pranjic@styria.ai

marko.robnik@fri.uni- lj.si

senja.pollak@ijs.si

Jožef Stefan International

University of Ljubljana, Faculty of

Jožef Stefan Institute

Postgraduate School

Computer and Information Science

Jamova cesta 39

Jamova cesta 39

Ljubljana, Slovenia

Ljubljana, Slovenia

Ljubljana, Slovenia

Trikoder d.o.o.

Zagreb, Croatia

ABSTRACT

standardized set of topics would enable faster news production

and higher quality of the metadata for news content.

Large pretrained language models like BERT have shown excel-

In this paper, we use recently published ST T News[10] dataset lent generalization properties and have advanced the state of the

in Finnish to evaluate the performance of the monolingual Fin-

art on various NLP tasks. In this paper we evaluate Finnish BERT

BERT model [13] on the IPTC Subject Codes prediction task, (FinBERT) model on the IPTC Subject Codes prediction task. We

together with the Doc2Vec[3] model as a baseline. We attempt compare it to a simpler Doc2Vec model used as a baseline. Due to

to encode the hierarchical nature of the prediction task in the

hierarchical nature of IPTC Subject Codes, we also evaluate the

prediction network topology by mimicking the structure of the

effect of encoding the hierarchy in the network layer topology.

labels. Finally, impact of using a different tokenizers with the

Contrary to our expectations, a simpler baseline Doc2Vec model

same model is evaluated.

clearly outperforms the more complex FinBERT model and our

The paper is structured as follows. In Section 2, we describe

attempts to encode hierarchy in a prediction network do not

the dataset and the labels relevant for the prediction task. Section

yield systematic improvement.

3 describes the methods used to model the prediction task and

KEYWORDS

all variations of experiments. In Section 4, we provide results of

our experiments and, finally, in Section 5 we conclude this paper

news categorization, text representation, BERT, Doc2Vec, IPTC

and suggest ideas for further work.

Subject Codes

2

DATASET

1

INTRODUCTION

The ST T corpus [10] contains 2.8 million news articles from the The field of Natural Language Processing (NLP) has greatly ben-Finnish News Agency (ST T) published between 1992 and 2018.

efited from the advances in deep learning. New techniques and

The articles come with a rich metadata information including the

architectures are developed at a fast pace. The Transformer ar-

1

news article topics encoded as IPTC Subject Codes . The IPTC

chitecture [12] is the foundation for most new NLP models and Subject Codes are a deprecated version of IPTC taxonomy of

it is especially successful with models for text representation,

news topics focused on text. The IPTC Subject Codes standard

such as BERT model [1] which dominates the text classification.

describes around 1400 topics structured in three hierarchical

The gains in performance promised by the large BERT models

levels. The first level consists of the most general topics. Topics

comes at the price of significant data resources and computa-

on the second level are subtopics of the ones at the first level and,

tional capabilities required in the model pretraining phase. The

likewise, topics on the third level are subtopics of the ones on

practitioners take one of the models pretrained in the language

second level. All topics on the third level are leaf topics - there

of the data and finetune it for the specific classification prob-

are no more subdivisions, but there are also some topics on the

lem. Multilingual BERT-like models have also shown remarkable

second level that are leaf topics and do not extend to the third

potential for cross-lingual transfer ([7], [8], [6]). A majority of level. A set of IPTC topics at ST T is an extended version of IPTC

the research with BERT-like models is focused on English, while

Subject Codes as some codes used at ST T are not part of the IPTC

less-resourced languages tend to be neglected.

standard.

The IPTC Subject Codes originate in the journalistic setting.

Not all articles in the ST T corpus contain the IPTC Subject

The news articles are tagged with the IPTC topics to enable search

Codes, as can be seen in Figure 1, showing the ratio of articles and classification of the news content, as well as to facilitate

containing this information through time. IPTC Subject Codes

content storage and digital asset management of news content

were introduced in ST T in May 2011 and around 10-15% of articles

at media houses. It provides a consistent and language agnostic

do not contain this information.

coding of topics across different news providers and across time.

If an article contains a specific sub-topic, it also contains its

Solving the automatic classification of the news content to the

upper-level topics. For example, if an article contains the third

level topic "poetry", it also contains the second level topic "litera-Permission to make digital or hard copies of part or all of this work for personal ture" that generalizes the "poetry", as well as the first level topic or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and

"arts, culture and entertainment". In this way, article metadata the full citation on the first page. Copyrights for third-party components of this contains full path through the topic hierarchy.

work must be honored. For all other uses, contact the owner /author(s).

Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia

© 2020 Copyright held by the owner/author(s).

1 https://iptc.org/standards/subject-codes/

25





Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia

Pranjić, et al.

experiments use the PV-DM variant of the algorithm available

2

in the Gensim

library with most of the hyperparameters set

to their default values. We set the context window width to 5

and train the network for 10 epochs on the news content from

the training data. The model produces a 256 dimensional output

vector. Once the model is trained, we do not finetune it further

during training of the prediction task.

Tokenization of the data was done using the SentencePiece[2]

tokenizer. It was trained to produce a vocabulary of 40,000 tokens

by using randomly selected 1 million sentences sampled from

the articles in the training set. Additionally, we ran experiments

using the same WordPiece[14] tokenizer that is used with the FinBERT model.

3.2

BERT

Figure 1: The Ratio of news articles in STT corpus contain-

ing IPTC Subject Codes.

BERT is an deep neural-network architecture of bidirectional

text encoders introduced in [1]. The base model consists of 12

Transformer [12] layers. It is trained using the masked language Most articles are assigned only a small number of leaf-level

modeling (MLM) and next sentence prediction (NSP) objectives

topics (and its higher-level topics), but they can contain up to 7, 19

on a large text corpora. Maximum length of the input sequence

and 30 topics from the first, second and third level, respectively.

for the model is 512 tokens and each token is represented with 768

We split the dataset to train, validation and test set such that

dimensions. Model inference produces a context dependent repre-

all articles published after 31-12-2017 belong to the test set and

sentations of the input tokens. The whole input sequence can be

discard articles without IPTC Subject Codes from it. The rest of

represented with a single vector by using the context dependent

the articles were randomly split such that 5% of articles contain-

representation of the [CLS] token. In [1], this representation is ing IPTC Subject Codes represent the validation set and all other

used as an aggregate sequence representation for classification

articles belong to the train set.

tasks. Another way to represent the whole sequence, as used

After this step, there are around 30 thousand articles in the

in [9], is to take the average representation of all output tokens validation set, around 100 thousand in test set and 2.7 million in

(AVG). In this paper, we use FinBERT, a BERT model introduced

3

training set - of which some 560 thousand contain IPTC Subject

in [13] that was pretrained on Finnish corpora.

We should note

Codes annotation.

that this model contains the ST T corpus as part of its training

The train set contains 17 different topics on the first level, 400

data.

4

on the second level, and 972 on the third (the most specific) level.

Input to the model is restricted to 512 tokens

and longer news

In our experiments, we evaluate models only on topics found in

articles are trimmed such that only the first 512 tokens are used.

the training set.

In the dataset, there are less than 5% and 7% of documents in

the training and test data that are longer than 512 tokens. We

3

METHODOLOGY

experiment with the CLS and the AVG representations and in

both cases the article representation is a 768 dimensional vector.

For our experiments, we used a network design consisting of two

The FinBERT model is finetuned during training of the IPTC

stacked neural networks (extractor and predictor). The extractor

Subject Codes prediction task.

processes the text and produces the text representation in the

format of a numeric vector. The predictor (the second part) is

3.3

Prediction network

a multi-label prediction network that maps the extracted text

representation vector to IPTC Subject Codes. For the extractor

For the predictor part, we experiment with two different archi-

part, we evaluate the Doc2Vec and BERT model and for the

tectures. The first is a single layer of the neural network that

predictor our models use one or three layer neural network.

maps the input vector to the predictions and can be seen in the

Figure 2. The IPTC Subject Codes on all levels are concatenated 3.1

Doc2Vec

together, thus producing a 1389 outputs in the final layer.

The second architecture utilizes the tree hierarchy of the IPTC

Before the contextual token embeddings became popular, this

Subject Codes. We assumed that a flat output (the previous ap-

model was regularly used to represent a text paragraph with

proach) requires the network to predict each label independently,

a fixed vector. It was introduced in [3] with two variants of irrespective of the level of the target label. By introducing sepa-the algorithm - PV-DM (Paragraph Vector-Distributed Memory)

rate layers for each target level, we expect that the model will

and PV-CBOW (Paragraph Vector-Continuous Bag-of-Words).

implicitly learn the hierarchy among labels. We designed this

In the PV-DM variant of the algorithm, a training context is

network in three layers and the architecture is shown in Figure 3.

defined as a sliding window over the text. The model is a shallow

The first layer of the network predicts labels from the third IPTC

neural network trained to predict the central word of this context

hierarchical level (the most fine-grained topics), the second layer

window given the embeddings of the rest of the context words

together with the embedding of the whole document. During

2

training, the network learns both the word embeddings and the

https://radimrehurek.com/gensim/

3 We also test the FinEst BERT[11] but since the better performance was achieved embedding for the document. The simpler PV-CBOW variant

with the FinBERT[13], we do not include FinEst BERT it in the results.

does not employ a context window, the neural network is trained

4 The tokenizer used with the model is a predefined WordPiece tokenizer that came to predict a randomly sampled word from the document. Our

with the FinBERT model.

26





STT IPTC Subject Codes evaluation

Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia

using a WordPiece (WP) tokenizer and either the CLS token or

the average (AVG) of all output tokens as a text representation.

The Doc2Vec model is using either the WordPiece (WP) tokenizer

or the SentencePiece (SP) tokenizer.

4.1

Evaluation metrics

We approach the article categorization problem through the in-

formation retrieval paradigm. Namely, we try to return the set

of the most probable IPTC Subject Codes assigned to each ar-

ticle in the ST T corpus. We use two performance metrics, the

mean average precision (mAP) and recall at 10 (R@10). The mean

average precision returns the expectation of the area under the

precision-recall curve for a random query. The recall at 10 com-

putes the ratio of correct topics found in the 10 tags with the

highest predicted probability. To measure the generalization of

our prediction models, we compute these metrics separately for

each level of the IPTC Subject Codes.

Figure 2: Predictor network architecture, flat variant. The

4.2

Results and discussion

image does not show a normalization layer before the out-

put layer.

In all experiments, the Doc2Vec model performed significantly

better than the FinBERT model, regardless of the specific extrac-

tor or predictor setup. This is surprising in the light of other

successful applications of BERT models. Nevertheless, as there

are less than 5% of articles in the training set and less than 7%

of articles in the test set that have more than 512 tokens (the

limitation of BERT but not Doc2Vec) we cannot assign the poor

performance of BERT to this limitation.

Some other relevant findings are as follows. While for some

tasks[9] the BERT average token representation performs better than the representation based on the CLS token, in our experiments the CLS and the AVG representations perform comparably.

The three-layer network mimicking the shape of the tree-like

IPTC Subject Codes hierarchy did not yield any systematic im-

provement over the single, flat layer of the neural network. Dif-

ference in tokenizers for Doc2Vec experiments shows small, but

consistent improvement when using the SentencePiece tokenizer.

5

CONCLUSIONS AND FURTHER WORK

Figure 3: Predictor network architecture, tree variant. The

In this work, we have compared a monolingual FinBERT and

image does not show a normalization layer before each

Doc2Vec model on the IPTC Subject Codes prediction task in

output layer.

Finnish language. We evaluated several variations of experiments

and achieved consistently better results with a Doc2Vec model.

In contrast to the Doc2Vec, the BERT model has a limitation in

predicts topics from the second level and the third layer predicts

the form of maximum number of input tokens. We believe the

only the toplevel IPTC Subject Codes.

results cannot be explained by this as the data used does not

3.4

Training

contain a significant amount of documents exceeding this limit.

We plan to explore this topic further in hope of understanding

Each model was trained using the batch size of 128 articles and

and addressing this problem. Recent work in BERT finetuning

AdamW[4] optimizer with the learning rate of 1e-3. We compute

strategies[5] identifies a problem of vanishing gradients due to the metrics on the validation set every 100 iterations. Once the

excessive learning rates and implementation details of the opti-

loss on the validation data starts increasing, we stop the training

mizer.

and evaluate the best performing checkpoint on the test data.

Our attempt at encoding the hierarchical nature of the predic-

The loss function used in all experiments is the sum of binary

tion task did not yield systematic improvement and we believe

cross-entropy losses calculated at each topic level. The news

it is worthwhile to explore other strategies and improve on this

articles that do not have an annotation for certain topic level do

area, like encoding the hierarchy of the predictions in the loss

not contribute to the loss of that level.

function itself.

For Doc2Vec experiments, consistently better results were

4

EXPERIMENTS AND RESULTS

achieved using the SentencePiece[2] tokenizer over the Word-All experiments were repeated three times and we report the

Piece[14] tokenizer used in FinBERT model. Both of those tok-median of those three runs in Table 1. The extraction network enizers retain the whole information of the input as there are no

was evaluated with four configurations. The FinBERT model is

destructive operations on the text. We plan further experiments

27





Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia Pranjić, et al.

Table 1: Results for different experimental configurations.

Extractor

Predictor

mAP (lvl 1)

mAP (lvl 2)

mAP (lvl 3)

R@10 (lvl 1)

R@10 (lvl 2)

R@10 (lvl 3)

FinBERT (CLS)

Flat

0.5432

0.2047

0.1031

0.9058

0.3687

0.2242

FinBERT (CLS)

Tree

0.5434

0.1949

0.1043

0.9058

0.3602

0.2417

FinBERT (AVG)

Flat

0.5401

0.2026

0.1006

0.9045

0.3692

0.2391

FinBERT (AVG)

Tree

0.5410

0.2088

0.1089

0.9078

0.3724

0.2367

Doc2Vec (WP)

Flat

0.8091

0.5204

0.2990

0.9721

0.7008

0.4750

Doc2Vec (WP)

Tree

0.8127

0.5202

0.2972

0.9743

0.7099

0.4714

Doc2Vec (SP)

Flat

0.8298

0.5550

0.3149

0.9803

0.7277

0.4951

Doc2Vec (SP)

Tree

0.8315

0.5643

0.3282

0.9832

0.7358

0.4896

to confirm and quantify these findings and understand what en-

[7]

Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How

ables such improvement of downstream prediction task at the

multilingual is multilingual BERT? In Proceedings of the

tokenizer level.

57th Annual Meeting of the Association for Computational

Linguistics. Association for Computational Linguistics, Flo-

ACKNOWLEDGMENTS

rence, Italy, (July 2019), 4996–5001. doi: 10.18653/v1/P19-

The work was partially supported by the Slovenian Research

1493. https://aclanthology.org/P19- 1493.

Agency (ARRS) core research programmes P6-0411 and P2-0103,

[8]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee,

as well as the research project J6-2581 (Computer-assisted mul-

Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and

tilingual news discourse analysis with contextual embeddings).

Peter J. Liu. 2020. Exploring the limits of transfer learning

This paper is supported by European Union’s Horizon 2020 re-

with a unified text-to-text transformer. Journal of Machine

search and innovation programme under grant agreement No

Learning Research, 21, 140, 1–67. http://jmlr.org/papers/

825153, project EMBEDDIA (Cross-Lingual Embeddings for Less-

v21/20- 074.html.

Represented Languages in European News Media).

[9]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT:

sentence embeddings using Siamese BERT-networks. In

REFERENCES

Proceedings of the 2019 Conference on Empirical Methods

in Natural Language Processing and the 9th International

[1]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina

Joint Conference on Natural Language Processing (EMNLP-

Toutanova. 2019. BERT: pre-training of deep bidirectional

IJCNLP). Association for Computational Linguistics, Hong

transformers for language understanding. In Proceedings

of the 2019 Conference of the North American Chapter of

Kong, China, (November 2019), 3982–3992. doi: 10.18653/

the Association for Computational Linguistics: Human Lan-

v1/D19- 1410. https://aclanthology.org/D19- 1410.

guage Technologies, Volume 1 (Long and Short Papers)

[10]

ST T. 2019. Finnish news agency archive 1992-2018, source

. (June

(http://urn.fi/urn:nbn:fi:lb-2019041501). (2019).

2019), 4171–4186. doi: 10.18653/v1/N19- 1423.

[11]

Matej Ulčar and Marko Robnik-Šikonja. 2020. Finest bert

[2]

Taku Kudo and John Richardson. 2018. Sentencepiece: a

and crosloengual bert. In Text, Speech, and Dialogue. Petr

simple and language independent subword tokenizer and

Sojka, Ivan Kopeček, Karel Pala, and Aleš Horák, editors.

detokenizer for neural text processing. In (January 2018),

Springer International Publishing, Cham, 104–111.

66–71. doi: 10.18653/v1/D18- 2012.

[12]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-

[3]

Quoc Le and Tomas Mikolov. 2014. Distributed represen-

reit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia

tations of sentences and documents. In Proceedings of the

31st International Conference on Machine Learning

Polosukhin. 2017. Attention is all you need. In Proceedings

(Pro-

of the 31st International Conference on Neural Information

ceedings of Machine Learning Research) number 2. Eric P.

Processing Systems (NIPS’17). Curran Associates Inc., Red

Xing and Tony Jebara, editors. Volume 32. PMLR, Bejing,

Hook, NY, USA, 6000–6010. isbn: 9781510860964.

China, (June 2014), 1188–1196. https://proceedings.mlr.

[13]

Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma,

press/v32/le14.html.

Juhani Luotolahti, Tapio Salakoski, Filip Ginter, and Sampo

[4]

Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight

Pyysalo. 2019. Multilingual is not enough: bert for finnish.

decay regularization. In International Conference on Learn-

ing Representations

(2019). arXiv: 1912.07076 [cs.CL].

. https : / / openreview. net / forum ? id =

[14]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le,

Bkg6RiCqY7.

Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun,

[5]

Marius Mosbach, Maksym Andriushchenko, and Dietrich

Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva

Klakow. 2021. On the stability of fine-tuning {bert}: mis-

Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan

conceptions, explanations, and strong baselines. In Inter-

national Conference on Learning Representations

Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith

. https :

Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff

//openreview.net/forum?id=nzpLWnVAyah.

Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals,

[6]

Andraž Pelicon, Marko Pranjić, Dragana Miljković, Blaž

Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016.

Škrlj, and Senja Pollak. 2020. Zero-shot learning for cross-

Google’s neural machine translation system: bridging the

lingual news sentiment classification. Applied Sciences,

gap between human and machine translation. (2016). arXiv:

10, 17. issn: 2076-3417. doi: 10.3390/app10175993. https:

1609.08144 [cs.CL].

//www.mdpi.com/2076- 3417/10/17/5993.

28





Classification of Cross-cultural News Events

Abdul Sittar∗

Dunja Mladenić

abdul.sittar@ijs.si

dunja.mladenic@ijs.si

Jožef Stefan Institute and Jožef Stefan International

Jožef Stefan Institute and Jožef Stefan International

Postgraduate School

Postgraduate School

Jamova cesta 39

Jamova cesta 39

Ljubljana, Slovenia

Ljubljana, Slovenia

ABSTRACT

meta data of an event is shown in the Table 1. The main scientific We present a methodology to support the analysis of culture

contributions of this paper are the following:

from text such as news events and demonstrate its usefulness

(1) A novel perspective of aligning news events across dif-

on categorising news events from different categories (society,

ferent cultures through categorising countries and news

business, health, recreation, science, shopping, sports, arts, com-

events.

puters, games and home) across different geographical locations

(2) A cross-cultural automatically annotated dataset in several

(different places in 117 countries). We group countries based on

different domains (Business, Science, Sports, Health etc.).

the culture that they follow and then filter the news events based

(3) Experimental comparison of several classification mod-

on their content category. The news events are automatically

els adopting different set of features (character ngrams,

labelled with the help of Hofstede’s cultural dimensions. We

GLOVE embeddings and word ngrams).

present combinations of events across different categories and

check the performances of different classification methods. We

Table 1: The description of the meta data of an event.

also presents experimental comparison of different number of

Attributes

Description

features in order to find a suitable set to represent the culture.

title

title of the event

summary

summary of the event

source

event reported by a news source

KEYWORDS

categories

list of DMOZ categories

cultural barrier, news events, text classification

location

location of the event

1

INTRODUCTION

Culture is defined as a collective programming of the mind which

2

RELATED WORK

distinguishes the members of one group or category of people

In this section, we review the related literature about the influ-

from another [9]. It has a huge impact on the lives of people and ence of culture, its representation and classification in different

in result it influences events that involve cross-cultural stake-

fields.

holders. News spreading is one of the most effective mechanisms

Countries that share a common culture are expected to have

for spreading information across the borders. The news to be

heavier news flows between them when reporting on similar

spread wider cross multiple barriers such as linguistic, economic,

events [10]. There are many quantitative studies that found de-geographical, political, time zone, and cultural barriers. Due to

mographic, psychological, socio-cultural, source, system, and

rapidly growing number of events with significant international

content-related aspects [2].

impact, cross-cultural analytics gain increased importance for

Cross-cultural research and understanding the cultural influences

professionals and researchers in many disciplines, including digi-

in different fields have competitive advantages. The goal of re-

tal humanities, media studies, and journalism. The most recent

searching the impact of culture might be to draw conclusions

examples of such events include COVID-19 and Brexit [1]. There in which way the cultural factors influence a specific corporate

are few determinants that have significant influence on the pro-

action. There are many type of cultures such as societal, organi-

cess of information selection, analysis and propagation. These

zational, and business culture etc [8].

include cultural values and differences, economic conditions and

The hidden nature of cultural behavior causes some difficulties

association between countries. For instance, if two countries are

in measurement and defining these. To cope with difficulties,

culturally more similar, there are more chances that there will

researchers have developed measurements that measure culture

be a heavier news flow between them [10], [3]. In this paper, on a general scale to compare differences among cultures and

we focus on classification of news events across different cul-

management styles. These results can be used to find similarities

tures. We select some of the most read daily newspapers and

within a region and differences to other regions. There are many

collect information using Event Registry about the news they

models that have tried to explain cultural differences between

have published. Event Registry is a system which analyzes news

societies. Hofstede’s national culture dimensions (HNCD) have

articles, identifies groups of articles that describe the same event

been widely used and cited in different disciplines [6, 5]. Hofst-and represent them as a single event [7]. The description of the ede’s dimensions are the result of a factor analysis at the level

of country means of comprehensive survey instrument, aimed

Permission to make digital or hard copies of part or all of this work for personal at identifying systematic differences in national cultural. Their

or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and purpose is to measure culture in countries, societies, sub-groups,

the full citation on the first page. Copyrights for third-party components of this and organizations; they are not meant to be regarded as psycho-work must be honored. For all other uses, contact the owner/author(s).

logical traits.

Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia

© 2021 Copyright held by the owner/author(s).

There is a plethora of research studies that were conducted to un-

derstand the cultural influences such as cross-culture privacy and

29





Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia

Abdul and Dunja, et al.

attitude prediction, and cultural influences on today’s business.

provided by the Event Registry. This task can be formulated as:

[4] explores how culture affects the technological, organizational,

𝐶 = 𝑓 (𝑆, 𝐺 )

and environmental determinants of machine learning adoption

by conducting a comparative case study between Germany and

C donates the culture of the news event, f is the learning function,

US. Rather than looking at the influence of cultural differences

S donates summary of a news event and G donates category of a

within one domain, we intend to understand association between

news event (see Table 1).

news events belonging to different domains (society, business,

health, recreation, science, shopping, sports, arts, computers,

4.2

Methodology

games and home) and different cultures (117 countries from all

the continents). We conduct this research to find an appropriate

representation and classification of culture across different do-

mains.

Clusters of Countries

Char Ngrams

3

DATA DESCRIPTION

News Events

Dataset Annotation

Glove Embeddings

Classification

3.1

Dataset Statistics

We choose the top 10 daily read newspapers in the world in 2020 1

and collect the events reported by these newspapers using Event

Category of Events

Word Ngrams

Registry [7] over the time period of 2016-2020. Approximately 8000 events belongs to each newspaper with exception of “Za-man” that has only 900 events. Figure 1 shows the number of events reported by the selected newspapers on a yearly basis.

Figure 2: Classification of cross-cultural news events.

This dataset can be found on the Zenodo repository (version

1.0.0) 2

4.2.1

Data labeling. Each news event has information about the

type of categories to which it belongs and the location where it

happened (see Table 1). Each event has many categories and each category has a weight reflecting its relevance for the event. We

only keep the most relevant categories and group the news events

based on their categories. For each group of events, we estimate

the cultural characteristic of each event through the country of

the place where the event occurred. We cluster the countries

based on their culture. We utilize the Hofstede’s national culture

dimensions (HNCD) to represent the culture of a country. We take

average of cultural dimensions and call it average cultural score.

Based on this score, we find optimal number of clusters using

Figure 1: Each color in a bar represents the total number

popular clustering algorithm k-means (see Figure 4). Finally, we of events per year by a daily newspaper and a complete

label each news event with one of the six cultural clusters.

bar shows the total number of events per year by all the

newspapers.

The attributes of an event with description are displayed in

Table 1. Few attributes are self-explanatory such as title, summary, date, and source. DMOZ-categories are used to represent topics

of the content. The DMOZ project is a hierarchical collection of

web page links organized by subject matters 3. Event Registry use top 3 levels of DMoz taxonomy which amount to about 50,000

categories 4.

4

MATERIAL AND METHODS

4.1

Problem Definition

Figure 3: The pie chart depicts the percentage of the news

events that occurred in six different clusters (each cluster

There are two main parts of the problem that we are addressing.

consists of a list of countries with similar culture).

The first part is to label the examples by assigning a culture C to a

news event E using its location L. The second part is a multi-class

classification task where we predict the culture C of a news event

4.2.2

Data representation. Each news event in Event Registry

E using its summary description S and its content category G as

has associated categories with it along with a weight (see Table

1), we take the top categories based on their weight. In case of

1https://www.trendrr.net/

2

multiple categories with equal weight, we sort them alphabeti-

https://zenodo.org/record/5225053

3https://dmoz-odp.org/

cally and keep the first one. We represent each news event by a

4https://eventregistry.org/documentation?tab=terminology

short summary S and a set of content categories G.

30





Classification of Cross-cultural News Events

Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia

Figure 4: In word cloud, the color of each word shows cluster to whom it belongs (see Figure 3). Radial dendrograms illustrate the shared categories of news events between the pair of six clusters.

4.2.3

Data Modeling. For multi-class classification task, we use

word ngrams using 1 to 3 word ngrams (see Figure 5). Looking at simple classification models (SVM, Decision Tree, KNN, Naive

the character ngrams, the highest F1-score is achieved when we

Bayes, Logistic Regression) as well as neural network. For sim-

select the top 15K characters for all the tested algorithms except

ple classification models, we input character and word ngrams

Naive Bayes which declines in performance with the growing

varying the number of ngrams and compare the results. We also

set of features. Based on these settings, we achieve the highest

use pre-trained Glove embeddings.

accuracy (0.85) using Logistic Regression. Using Glove embed-

dings, we experiment with and without using the category of

5

EXPERIMENTAL EVALUATION

event. The highest F1-score with and without the category is 0.80

5.1

Evaluation Metric

and 0.79 respectively.

For multi-class classification task, we use following most com-

monly used evaluation measures: accuracy, precision, recall, and

F1 score.

7

CONCLUSIONS AND FUTURE WORK

6

RESULTS AND ANALYSIS

For researchers and professionals, it is very important to anal-

6.1

Annotation Results

yse the cross-cultural differences in different disciplines. As the

The results of annotation are six clusters where almost 50% news

international impact is increasing and international events are

events belong to the two clusters (shown with red and blue colors)

becoming popular, the need to develop some automatic methods

and remaining 50% belong to the other four clusters 3. Looking is significantly increasing and leaving a blank space. We con-in each group, we find that clusters do not lies in a specific

ducted experiments on news events related to different fields

geographic area or a continent. Rather all the countries in a

to have a broader look on data and machine learning methods.

cluster belong to the different continents. Similarly, these clusters

Further research would be helpful in examining the impact of

do not have all the countries that are economically rich or poor.

specific socio-cultural factors on news events. In this research

There are more categories in green and red colors in the word

work, we estimate the culture of a specific place by its country,

cloud (see Figure 4) which represent to the cluster with that colors.

use basic features and simple classification models. To continue

Radial dendrograms in Figure 4 present the shared categories this work further, we would like to improve feature set such as

between the clusters. In the figure, root of the tree is data and

by including part of speech tagging (POS) as well as other state

then there are ten pair of clusters that share the same categories.

of the art embeddings.

The objective of this whole process was to keep news events

according to the category to whom they belongs. Moreover, we

can only observe the cultural differences when we have same

type of news events from different places.

ACKNOWLEDGMENTS

6.2

Classification Results

The research described in this paper was supported by the Slove-

Fro the experimental results we can see that the best performance

nian research agency under the project J2-1736 Causalify and

is achieved by Logistic Regression, kNN and Decision Tree. The

by the European Union’s Horizon 2020 research and innovation

performance of SVM varies depending on the number of selected

programme under the Marie Skłodowska-Curie grant agreement

features: the highest F1-score is achieved with the top 10K or 20K

No 812997.

31





Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia

Abdul and Dunja, et al.

[3] Tsan-Kuo Chang and Jae-Won Lee. 1992. Factors affecting

gatekeepers’ selection of foreign news: a national survey

of newspaper editors. Journalism Quarterly, 69, 3, 554–561.

[4] Verena Eitle and Peter Buxmann. 2020. Cultural differences

in machine learning adoption: an international compari-

son between germany and the united states.

[5] Meihan He and Jongsu Lee. 2020. Social culture and in-

novation diffusion: a theoretically founded agent-based

model. Journal of Evolutionary Economics, 1–41.

[6] Mahmood Khosrowjerdi, Anneli Sundqvist, and Katriina

Byström. 2020. Cultural patterns of information source use:

a global study of 47 countries. Journal of the Association

for Information Science and Technology, 71, 6, 711–724.

[7] Gregor Leban, Blaz Fortuna, Janez Brank, and Marko Gro-

belnik. 2014. Event registry: learning about world events

from news. In Proceedings of the 23rd International Confer-

ence on World Wide Web, 107–110.

[8] Björn Preuss. 2017. Text mining and machine learning to

capture cultural data. Technical report. working paper, 2.

doi: 10.13140/RG. 2.2. 30937.42080.

[9] Giselle Rampersad and Turki Althiyabi. 2020. Fake news:

acceptance by demographics and culture on social media.

Journal of Information Technology & Politics, 17, 1, 1–11.

[10] H Denis Wu. 2007. A brave new world for international

news? exploring the determinants of the coverage of for-

eign news on us websites. International Communication

Gazette, 69, 6, 539–551.

Figure 5: First two line charts illustrate the variations in

F1 score by simple classification models after varying the

number of features. The first line chart depicts the results

of word ngrams whereas the second one shows the results

for character ngrams. The last line graph presents com-

parison between Glove embeddings (with and without cat-

egory feature).

REFERENCES

[1] Sara Abdollahi, Simon Gottschalk, and Elena Demidova.

2020. Eventkg+ click: a dataset of language-specific event-

centric user interaction traces. arXiv preprint arXiv:2010.12370.

[2] Hosam Al-Samarraie, Atef Eldenfria, and Husameddin

Dawoud. 2017. The impact of personality traits on users’

information-seeking behavior. Information Processing &

Management, 53, 1, 237–247.

32





Zotero to Elexifinder: Collection, curation, and migration of

bibliographical data

David Lindemann

david.lindemann@ijs.si

Jožef Stefan Institute

Jamova cesta 39

Ljubljana, Slovenia

Figure 1: Zotero to Elexifinder workflow model

ABSTRACT

workshop connected to the 2019 eLex conference in Sintra (Por-

tugal), it was decided to combine the efforts, and the workflow

In this paper, we present ongoing work concerning a workflow

explained in this paper was designed, in order to merge existing

and software tool pipeline for collecting and curating bibliograph-

datasets, decide criteria for data curation, and make the results

ical data of the domain of Lexicography and Dictionary Research,

available to the lexicographic community. Two years later, at the

and data export in a custom JSON format as required by the

2021 Euralex conference, Elexifinder version 2 was introduced

Elexifinder application, a discovery portal for lexicographic lit-

[3]. Main shortcomings of Elexifinder version 1 have been sorted erature. We present the employed software tools, which are all

out, namely the missing author disambiguation, and the coverage

freely available and open source. A Wikibase instance has been

of the domain’s literature has been significantly increased, also

chosen as central data repository. We also present requirements

regarding publication languages other than English. Moreover, a

for bibliographical data to be suitable for import into Elexifinder;

vocabulary of lexicographic terms has been developed, which is

these include disambiguation of entities like natural persons and

now used for content-describing indexation of article full texts.

natural languages, and a processing of article full texts. Beyond

Lexicography and Dictionary Research is a relatively small

the domain of Lexicography, the described workflow is applicable

discipline, having thematic intersections with Corpus Linguis-

in general to single-domain small scale digital bibliographies.

tics, Terminology, Natural Language Processing, and Philology.

KEYWORDS

In metalexicographic literature, all aspects of the lexicographic

process, dictionary structure and functions, dictionary use, and

bibliographical data, author disambiguation, e-science corpora

other relevant issues are discussed. The lexicographic commu-

1

INTRODUCTION

nity communication is mainly taking place through a reduced

number of conference series and journals, being complemented

1

In 2019, version 1 of Elexifinder,

a discovery portal for lexico-

by handbooks and other edited volumes. The need for a dedicated

graphic literature, was launched in the framework of the ELEXIS

digital bibliography arises from the following observations:

2

project [2].

At the same time, at University of Hildesheim, a

domain ontology and bibliographical data collection for Lexicog-

• The vast majority of publications do not have Digital Ob-

raphy and Dictionary Research was planned [6, 5]. Both endeav-ject Identifiers (DOI), and thus are not indexed in cross-

ours already had compiled significant datasets. At a dedicated

domain digital collections of publication metadata. This

applies to nearly all older publications, but also to many

1 Accessible at https://finder.elex.is.

2

newer contributions published in the last two decades.

See https://elex.is.

• When searching for metalexicographical publications in

Permission to make digital or hard copies of part or all of this work for personal cross-domain digital collections, search results are mixed

or classroom use is granted without fee provided that copies are not made or up with publications from other domains, which may dis-distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this turb a straightforward information retrieval.

work must be honored. For all other uses, contact the owner /author(s).

• Author disambiguation in domain-independent digital col-

Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

lections that can be considered the big players in the field

© 2020 Copyright held by the owner/author(s).

(such as Google Scholar) is not at all accurate, so that very

33





Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia D. Lindemann

12

often name variants are not resolved to a single person

added.

Duplicate management has been done in batches (whole

entity, and different persons with the same name are not

journal issues or conference iterations), or one by one using

disambiguated.

Zotero’s built-in duplicate detection functionality. Main criterion

• If articles are indexed with content-describing terms in

for the inclusion of metadata records has been the availability of

cross-domain digital collections, the vast majority of those

the corresponding full texts. This means a clear preference for

terms will be out of the scope of the domain we are looking

Open Access publications; but also other publications have been

at.

included, wherever a suitable license agreement allowed access

•

13

Publication metadata found at big (i.e. automatically com-

to the text.

14

piled) repositories is often incomplete or noisy, so that

Zotero data can be accessed by API,

or exported locally using

using those, e.g. for citations, requires manual interven-

pre-set or custom export scripts. We use an adapted version of

tion in order to achieve a publishable quality.

the Zotero JSON-CSL exporter, which produces a list of JSON

objects containing all metadata fields and their values as literal

Therefore, it seems useful to provide the lexicographic commu-

strings, as well as the location of all local file attachment copies.

nity with a platform that makes publications and their metadata

For statements that cannot be expressed using standard Zotero

accessible in a way that the described shortcomings will be over-

15

fields

, we have used Zotero tags as workaround, following a

come. Single-domain endeavours of this kind, which all involve

simple syntax of predicate and object. For example, for asserting

manual curation, are DBLP 3 for Computer Science, IxTheo4 for that an article is a review article, the tag ":type Review", and so Theology, or EconBiz5 for Economics. Inspired by features found on. Tags in Zotero can be easily copied from one item to others

in these, we propose a workflow that involves the use of free

by manual drag-and-drop operations, set via API, and also be

software accessible to anybody, which makes it reproducible and

included in display styles, so that in the Zotero item listings,

cost-reducing.

for example, review article titles can be preceded by a coloured

symbol. With this workaround we can assert semantic triples

2

LEXBIB ZOTERO GROUP

inside Zotero. That is, for instance, that for representing the

6

Zotero,

developed and maintained by the Corporation for Digital

statement that a certain item is contained in another item (e.g. a

7

Scholarship , a non-profit organisation, is the most widely used book chapter item in an item of type book), we use a tag beginning

open source citation management software application. Zotero

with ":container", followed by an identifier for the containing offers functionality for web-scraping publication metadata, im-item; for a conference paper presented at a certain event, we use

porting metadata from different structured formats, and an online

a tag beginning with ":event", followed by an identifier for that platform for collaborative curation of metadata, along with the

event. For both of these, corresponding Zotero fields do exist

possibility to attach full text PDF (and TXT versions) to metadata

("contained in", "presented at"), but these are filled by the web records. The Zotero scraper functionality allows to download

scraping and importer translators with literal string values as

publication metadata and attached PDF files from all those sites

needed for citations, and not with unambiguous identifiers.

8

the Zotero community has provided a "translator"

for, includ-

For Elexifinder, a special metadatum is included in all publica-

ing the web platforms of major publishing houses, Open Journal

tion metadata sets: The location of the first author. This allows

Systems, etc. From the Zotero platform, users are able to obtain

the generation of location maps and search filters according to

metadata records as single items or as batches for import into

locations in the Elexifinder portal. For these locations, we insert

their own citation managers, or as export records in a range of

16

English Wikipedia page titles in the Zotero "extra" field.

citation styles or in structured formats such as bibtex. Members

of a Zotero group can view and download full text attachments.

3

LEXBIB WIKIBASE

Moreover, Zotero items can be annotated with custom tags, and

3.1

Wikibase as LOD infrastructure solution

additional information (such as excerpts or comments) can be

attached to them. Around Zotero, an active community is devel-

The decisive shift from a metadata set as in Zotero, which con-

9

oping plug-ins that add new functionalities to Zotero.

sists of certain fields and their literal values, towards unambigu-

In the first planning period of the LexBib project, funded by the

ous Linked Data lies in the reconciliation of those literal values

University of Hildesheim, conference publications of the Euralex

against existing or new unambiguous identifiers. For example,

and the eLex conference series, and publications from a range of

and this already refers to the hardest nut to crack in this context,

journals and edited volumes have been added to LexBib Zotero

an author may have several name variants appearing across the

10

group.

Items collected for Elexifinder version 1, available as

publication metadata collection, and there may be other persons

tabular data, have then been merged to the Zotero group. For this

sharing the same name, or any of the name variants. But one

11

purpose, tabular csv data has been transformed to RIS format

author or editor (i.e., a "creator") should only have one identifier and imported to Zotero. Additionally, metadata records from

(such as ORCID). Since we do not know Wikidata and/or ORCID

OBELEX-meta and EURALEX-Dykstra bibliographies have been

identifiers of all creators in our database, we need to create our

own (and map them later). Other Zotero fields that should be

3 Accessible at https://dblp.org/.

12 See references in [3].

4 Accessible at https://ixtheo.de/.

13 Article full text are stored and exclusively used for project-related text mining 5 Accessible at https://www.econbiz.de/.

tasks; they cannot be downloaded from Zotero. We instead provide download links 6 See https://zotero.org.

which lead to the download offered by the corresponding publisher, subject to 7 See https://digitalscholar.org/.

applicable restrictions.

8

14

See https://www.zotero.org/support/translators.

See https://www.zotero.org/support/dev/web_api/v3/start.

9

15

For example, very recently the Cita plug-in has been developed, which allows to See https://www.zotero.org/support/kb/item_types_and_fields.

add citation metadata to Zotero records, see https://meta.m.wikimedia.org/wiki/

16 Wikipedia page titles are unambiguous (see e.g. https://en.wikipedia.org/wiki/

Wikicite/grant/WikiCite_addon_for_Zotero_with_citation_graph_support.

Cambridge vs. https://en.wikipedia.org/wiki/Cambridge, _Massachusetts), and map 10 Last version accessible at https://www.zotero.org/groups/lexbib/library.

to only one Wikidata entity. This strategy has turned out effective, since manual 11 See https://en.wikipedia.org/wiki/RIS_(file_format).

annotators are able to find the adequate Wikipedia page without hassle.

34





Zotero to Elexifinder

Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

reconciled against unambiguous identifiers are those describing

wikibase properties, in this case with datatype "item", that

the containing item, the conference where the contribution was

is, to object properties.

presented, the journal, the publisher, the publication place, and

• Creator name and publisher name literals are mapped to

the publication language. For some of these, persistent identifiers

the properties corresponding to the creator role (author

are available in many cases (e.g. journals), or in all cases (lan-

or editor), or to the publisher. This is done in a way that

guages). In general, we create our own identifiers, and map them

the name literals appear as qualifiers to a wikibase "no-

to Wikidata; in some cases, immediately (languages, places, and,

value" statement, which is a placeholder for the creator or

by ISSN, also journals), and in other cases, we leave that mapping

publisher item, that will be defined in the disambiguation

to the (near) future, as it is the case for creators and publishers.

process explained below.

Other Zotero fields contain identifiers (ISSN, ISBN, DOI), which

• Zotero fields that contain external identifiers (ISSN, ISBN

after normalisation can be taken directly as external identifiers

and DOI), are mapped to the corresponding properties of

in a Linked Database.

datatype "external identifier". Wikibase properties of that

After experimenting with different RDF database solutions,

datatype allow to define a URL pattern, in order to make

which allow to represent data in the described way, we have

the identifier a valid hyperlink, which can be clicked on

17

decided for Wikibase,

which is the software infrastructure un-

in Wikibase entity data pages.

18

derlying main Wikidata.

Since 2019, "Wikibase as a Service"

• As mentioned, we use the Zotero "extra" field ("note" in

19

is offered to the community.

Wikibase entities are items (each

bibtex) for annotation of the item with a Wikipedia page

of which has its own identifier preceded by the letter Q), and

that corresponds to the first author’s location. Wikidata

properties (preceded by letter P), just as in Wikidata, but in a

API is queried for the corresponding Wikidata entity, an

different namespace. Properties may point to other items, other

equivalent of which is created in LexBib Wikibase, in order

properties, external identifiers, or values of a certain datatype,

to function as object to the property "first author location".

20

such as "monolingual text", "point in time", "string", "url", etc.

• The Zotero "language" field, in LexBib may contain a two-

Wikibase as central data repository solution has several ad-

letter ISO-639-1, or a three-letter ISO-639-3 code. This

vantages compared to other infrastructure solutions for Linked

is mapped to a property pointing to the language item

Open Data (LOD):

corresponding to that code.

•

•

The Zotero item URI is taken as external identifier in

Entity data is displayed on entity pages, where it can be

LexBib wikibase, with the Zotero storage location of PDF

viewed and edited. These pages always reflect the last

and TXT attachments as qualifiers to that statement. In

update.

•

addition, we annotate this statement with a qualifier as-

A complete edit history is available, and changes can be

serting the presence of an abstract, and, if any, in what

undone.

23

•

language.

Every entity page is linked to a dedicated discussion page.

•

•

The content of the remaining fields is mapped to Wikibase

User and user rights management allow a community-

properties of the corresponding datatype ("URL", "string", driven editing process.

•

or "point in time").

In addition to query interface and SPARQL endpoint known

from other RDF database solutions, Wikibase data can be

The resulting dataset is then imported into LexBib Wikibase. It

uploaded and downloaded using an API, and as entity data

is worth mentioning that uploading data to a Wikibase triple

24

dump in several formats.

by triple using the mediawiki API of the Wikibase instance

takes about 0.5 seconds per triple, which is due to the need of

The backbone of LexBib Wikibase is an ontology of classes and

updating Wikibase search indices and edit histories for every

21

properties,

which can be aligned to Wikidata or other external

single uploaded triple.

ontologies. We have started to define these alignments. This en-

sures interoperability with other resources, such as Wikidata, so

3.3

Entity disambiguation using Open Refine

that data can be transferred from LexBib to Wikidata or vice versa,

The around 5,000 creator names appearing in LexBib Zotero by

or accessed in both at the same time, using federated SPARQL

spring 2021 have been mapped to around 4,000 unique person

queries.

items. This has been done testing different clustering algorithms

25

3.2

Zotero to Wikibase migration

available in the Open Refine application,

by Christiane Klaes

from the University of Hildesheim, in the framework of her MA

As mentioned before, Zotero item data is exported from a local

thesis [1]. These are the creator items present in LexBib Wikibase Zotero instance, using an adapted version of the Zotero JSON-CSL

26

experimental version 2.

22

exporter.

The resulting list of JSON objects is then processed

From that moment on, any new Zotero item that is exported

in the following way:

to Wikibase, which will contain, as explained above, one or more

• Zotero tags that contain semantic triple shortcodes (ex-

creator statements of type "novalue", is reconciled against existing plained above) are mapped to the corresponding LexBib

LexBib Wikibase creator items, using the given and last name

literal qualifiers. For this purpose, a reconciliation service for

27

LexBib Wikibase is set up

, and then accessed by Open Refine,

17 See http://wikiba.se; our instance is accessible at http://lexbib.elex.is.

18

in order to match creator name literals to creator items.

Accessible at http://www.wikidata.org.

19 See https://www.wbstack.com. The service has been co-enabled by Adam Shore-23

land (https://addshore.com/), Rhizome (https://rhizome.org/), and WMDE (https:

The abstract language is assumed to be the same as the publicaton language, if

//www.wikimedia.de/).

not stated different as tag shortcode ":abstractLang".

20

24

See https://www.wikidata.org/wiki/Help:Data_type.

For LexBib Wikibase, see https://lexbib.elex.is/w/api.php.

21

25

For more information, see LexBib Wikibase main page at https://lexbib.elex.is.

Available at https://openrefine.org/.

22

26

Available at https://github.com/elexis- eu/elexifinder/blob/master/Zotero/LexBib_

Accessible at https://data.lexbib.org.

27

JSON.js.

This is done using https://github.com/wetneb/openrefine- wikibase.

35





Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia D. Lindemann

If a literal can not be matched to any existing item, a new

The full text body itself is also exported to Elexifinder, where

person item is created. The reconciliation also works with fuzzy

it is used for displaying the first bits of it in search result displays, matches, and all name variants attached to existing items are

and for wikification, from which Elexifinder "concepts" are ob-

considered. Matches can also be manually chosen. Any additional

tained, as long as the system is able to associate named entities

name variant appearing in Zotero data is linked to the LexBib

occurring in the text with Wikipedia pages that describe them.

Wikibase person item as "alias" label, while the most frequent

name variant is chosen as "preferred" label. This allows for the 5

CONCLUSIONS AND OUTLOOK

new name variants being available for subsequent reconciliation

The described workflow enables us to disambiguate entities found

iterations.

in bibliographical datasets. For the time being, we are applying

LexBib persons have up to six name variants found in Zotero

this for feeding the Elexifinder app. Having chosen Wikibase

data. In some cases, we have chosen the preferred name variant

as central data repository also allows for aligning LexBib data

manually, according to the author’s own choice, or to conventions

with Wikidata in a straightforward way. In some cases, we have

in the community regarding the naming of commonly known

imported statements from Wikidata, in order to enrich LexBib

28

authors.

entities with additional information, but that can be done the

other way round as well. In other words: Wherever we find (or

3.4

Full text processing

create) a Wikidata entity to align with our own, we can export the

LexBib full text PDFs are stored in the local Zotero storage folder,

statements asserted on LexBib Wikibase to the main Wikidata.

which is automatically synchronised with Zotero cloud. When

We have done this using LexBib events (conferences) as test case,

processing Zotero JSON output, PDF files are sent to an installa-

and plan to align other entity types with Wikidata in the near

29

tion of the GROBID application

, which will propose a TEI rep-

future, namely articles, persons, and organisations.

resentation of the PDF content. This allows for isolating the full

text body from the other text components, such as title, running

ACKNOWLEDGMENTS

titles, abstract, author list, and references section. The extracted

The research received funding from the European Union’s Hori-

full text body is manually validated, and, in case of any mistake,

zon 2020 research and innovation programme under grant agree-

it is corrected, using a plain TXT version of the PDF, which is by

ment No. 731015.

default produced by Zotero.

GROBID turns out to structure PDF content as TEI very ef-

REFERENCES

ficiently if the article resembles a typical structure as found in

[1]

Christiane Klaes. 2021. Linked Open Data-Strategien zum

journals and proceedings. Book chapters and review articles,

Identity Management in einer Fachontologie. Master’s thesis.

which normally do not feature an abstract, in turn, are usually

Universität Hildesheim, Hildesheim, (June 2021). http : / /

not parsed adequately. In those cases, we now use directly the

lexbib.elex.is/entity/Q15468.

plain TXT version for producing a cleaned version manually.

[2]

Iztok Kosem and Simon Krek. 2019. ELEXIFINDER: A Tool

30

The article text is then lemmatised,

and lexicalisations of

for Searching Lexicographic Scientific Output. In Electronic

31

LexVoc lexicographic terms are looked up in the text.

LexVoc

Lexicography in the 21st Century: Proceedings of the eLex

32

vocabulary

is a resource still under development; for the term

2019 Conference. Lexical Computing CZ s.r.o., Brno, 506–

discovery process, terms and lexicalisations (labels) are obtained

518. http://lexbib.elex.is/entity/Q9484.

from LexBib Wikibase by a SPARQL query, the result of which

[3]

Iztok Kosem and David Lindemann. 2021. New develop-

will reflect the state of LexVoc in that particular moment. The

ments in Elexifinder, a discovery portal for lexicographic

keyword processor returns counts of every term, so that relative

literature. In Lexicography for Inclusion: Proceedings of the

frequencies can be calculated for every term, according to the

19th EURALEX International Congress, 7-11 September 2021,

occurrences of its labels and the amount of tokens in the article

Alexandroupolis, Vol. 2. Democritus University of Thrace,

text body; this information can be uploaded to LexBib Wikibase

Alexandroupolis, 759–766. http : / / lexbib . elex . is / entity /

bibliographical items, so that term indexation becomes part of

Q15467.

their entity data.

[4]

Gregor Leban, Blaz Fortuna, Janez Brank, and Marko Gro-

belnik. 2014. Event registry: learning about world events

4

WIKIBASE TO ELEXIFINDER

from news. In Proceedings of the 23rd International World

The described workflow is necessary for being able to export

Wide Web Conference, WWW14, Seoul, Korea, April 7-11,

bibliographical data in a custom JSON format, as needed for Elex-

2014, 107–110. doi: 10.1145/2567948.2577024.

ifinder, which is an application based on some of the elements of

[5]

David Lindemann, Christiane Klaes, and Philipp Zumstein.

the Event Registry system architecture [4]. In particular, authors 2019. Metalexicography as Knowledge Graph. OASICS, 70.

and content-describing terms (Elexifinder "categories") have to

http://lexbib.elex.is/entity/Q13955.

be represented as objects containing an unambiguous URI and a

[6]

David Lindemann, Fritz Kliche, and Ulrich Heid. 2018. LexBib:

textual label; the containing item, the LexBib Zotero item URI,

A Corpus and Bibliography of Metalexicographical Publi-

and the link for accessing full text download are represented as

cations. In Lexicography in Global Contexts: Proceedings of

URL, publication date in ISO 8601 format, publication language

the 18th EURALEX International Congress, 17-21 July 2018,

in ISO 639-3 format, and the item title as simple string.

Ljubljana. Ljubljana University Press, Ljubljana, 699–712.

http://lexbib.elex.is/entity/Q6059.

28 See an example at http://lexbib.elex.is/entity/Q1583.

29 See https://grobid.readthedocs.io.

30 For the time being, we are only processing English text. For lemmatisation, we use spaCy (see https://spacy.io/).

31 This is done using https://pypi.org/project/flashtext/.

32 Described at http://lexbib.elex.is/wiki/LexVoc.

36





Simple Discovery of COVID IS WAR Metaphors

Using Word Embeddings

Mojca Brglez

Senja Pollak

Špela Vintar

University of Ljubljana

Jožef Stefan Institute

University of Ljubljana

Ljubljana, Slovenia

Ljubljana, Slovenia

Ljubljana, Slovenia

mojca.brglez@ff.uni- lj.si

senja.pollak@ijs.si

spela.vintar@ff.uni- lj.si

ABSTRACT

an innovative methodological approach. We propose a top-down

method to search for expected conceptual metaphors through

In the past year, the discourse on the COVID-19 pandemic has

semi-automatic means employing word embeddings. While most

produced a great number of metaphors stemming from the more

previous corpus-based approaches to identify metaphors either

basic conceptual metaphor ILLNESS IS WAR. In this paper, we

use a small set of candidate words or require manual inspec-

present a semi-automatic method to detect linguistic manifes-

tions of large data samples, our approach reduces manual work

tations of the latter in Slovene media. The method consists of

on assembling linguistic data by combining existing annotated

assembling a seed vocabulary of war-related words from an ex-

resources and text mining methods.

isting Slovene metaphor corpus, extending the vocabulary using

word embeddings, and refining the extended vocabulary using

2

PROPOSED APPROACH

intersection filtering. Our method offers a quick compilation of

corpus data for further analysis, however, we also address is-

Our method aims to discover linguistic expressions of the con-

sues related to the method’s precision and the need for manual

ceptual metaphor COVID IS WAR in the corpus by targeting

filtering.

a broader potentially metaphoric vocabulary. Previous related

works have relied on either a limited vocabulary set (e.g. [7]) or a KEYWORDS

list of words laboriously compiled from various sources such as

dictionaries, thesauri and other studies on metaphor [19], or have metaphors, covid, word embeddings, media discourse

used sophisticated but complex NLP methods and specialized

1

INTRODUCTION

resources (e.g. [6]). In our experiment, we use a simple unsupervised approach using existing resources and language processing

The COVID pandemic has been a ubiquitous topic in the dis-

technologies.

course of the past year, featuring in medical, political, public

The main novelty of our approach is using pre-trained word

and personal discourse. The emergence of a new virus of yet

embeddings to extend the vocabulary, used also by e.g. [16] and unknown origin, behaviour and effects has presented itself like a

[18] to extend terminology. As past research has shown [14], word complex and obscure topic. To make sense of it, we have once

embeddings used for training language models retain linguistic

more resorted to metaphorical language, much like we do when

regularities, including syntactic and semantic relationships be-

faced with other abstract, obscure concepts. According to Con-

tween words. This means that similar words have similar vectors,

ceptual Metaphor Theory (CMT, [11, 12]), metaphors “are among and the closer vector representations (word embeddings) are,

our principal vehicles for understanding” and “play a central role

the higher the chance they share a certain semantic space. We

in the construction of social and political reality” ([12, p. 151]).

make use of this feature by trying to capture a semantic space that

In CMT, linguistic metaphors such as "food for thought" and

would resemble the conceptual domain of WAR, which represents

"half-baked idea" are considered manifestations of an established the source domain of the metaphor.

conceptual mapping between a more concrete domain and a more

abstract domain, here for example IDEAS ARE FOOD. The do-

2.1

Method

main of DISEASES, on the other hand, is often mapped to the

First, we start by collecting war-related lexical units from the

domain of WAR, a more common frame of reference which has

KOMET corpus [1], the only corpus of metaphors in Slovene taken hold as a fairly conventional way to talk about illnesses

which was recently compiled and annotated similarly to the

and their treatments, as well as several other domains ([8]).

English corpus of metaphors, VUAMC [17]. KOMET contains ap-As was already observed in various studies ([19, 2, 5, 7]), the dis-proximately 200,000 words obtained from journalistic, fiction and

course on the current COVID pandemic has also repeatedly used

online texts and was hand-annotated for metaphoricity on the ba-

the WAR domain in its metaphors. At the time of our experiment,

sis of the MIPVU procedure ([17]). Additionally, the metaphoric however, no study has yet addressed the use of such metaphors

expressions are tagged for one of 69 semantic frames, i.e. the

in Slovene, where they were also adopted for communicating

source concepts that semantically motivate them. One of these se-

various implications, preventive measures, recommendations and

mantic frames is #met.battle, which subsumes 105 metaphoric

laws to abide by. To investigate the use and pervasiveness of this

instances with 67 different lemmas, such as predati, ostrostrelec,

metaphorical domain in Slovene media, we have conducted a

orožje, napasti [surrender, sniper, weapon, attack]. These also

quick analysis of a corpus of COVID-related news articles using

form multi-word idioms such as železna pest [iron fist] and boriti

Permission to make digital or hard copies of part or all of this work for personal se z mlini na veter [to tilt at windmills] which we exclude from

or classroom use is granted without fee provided that copies are not made or our candidates list because the word embeddings we use only rep-distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this resent tokens, not whole phrases. Moreover, the lemmas within

work must be honored. For all other uses, contact the owner /author(s).

do not themselves necessarily represent the desired domain. We

Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia

also filter out some words erroneously annotated with the frame

© 2021 Copyright held by the owner/author(s).

such as številen [numerous]. This gives a starting vocabulary of

37





Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia Brglez et al.

51 unique seed words. Then, to extend the vocabulary further,

(2) true metaphorical expressions referring to disease as target

we employ Slovene word token embeddings ([13] pre-trained domain

with fastText ([4]) on various large corpora of Slovene (GigaFida, For example, in the following sentence, the word brigade

Janes, KAS, slWaC etc.). For each seed word in the list of words

[brigades] only refers to a name of a street, which we mark

extracted from the KOMET corpus, we use the Gensim library

as literal usage.

([20]) to find the word’s N nearest neighbours in the fastText embeddings’ space (using the most_similar function).

• /. . . / odvzem brisov pri pacientih s sumom na Covid-19: ob

To increase the robustness of the extended vocabulary, we try

Cesti proletarskih brigad 21 /. . . /

to automatically filter out lexis not related to war. To this end,

/. . . / taking swabs from patients with suspected Covid-19: at

we use the word embeddings intersection method ([18]). The 21, Proletarian Brigades Road /. . . /

method retains only the candidates that intersect between the

In the following example, the word napad [attack] is used to refer

sets, meaning they occur in the neighbourhood of at least k input

to another domain – INTERNET, COMP UTING, which we mark

seed words. For our main experiment, presented in this paper, we

as metaphor for another target domain.

select the parameters N =50 and k=3. We thus obtain a maximum

• Covid-19 je okrepil trend rasti kibernetskih napadov [Covid-

of 2550 (50 x 51) potential candidates. In the output, there are

19 reinforced the growing trend of cyber attacks]

2078 unique words, and, after lemmatization, 1539 unique lemmas.

After the intersection filtering, the vocabulary extended by word

The following three example sentences contain expression that

embeddings consists of 184 word lemmas: 44 of them are already

we mark as metaphor for the target domain of DISEASE.

included in our initial seed set and 140 are new lemmas. We join

• Čeprav v boju z virusom to nikakor ni hitro.

the new, extended set with the initial seed set, which yields a

[Although this is by no means fast in the fight against the

total of 191 lemmas to search for.

virus.]

3

CORPUS

• Kako bo jeseni, ko bodo »udarili« še drugi virusi?

The experiment is carried out on a corpus of Slovene COVID-

[What will happen in autumn, when other viruses also

19-related news articles, automatically crawled from the web by

“strike”?]

searching for the keyword “covid-19” in article titles (a subset of

the Slovene corpus used in the Slav-NER 2021 shared task ([15]).

• Prvi organski sistem v organizmu, ki ga virus napade,

The corpus consists of 233 texts spanning from February 2nd

povzroči pljučnico, . . .

to December 11th, 2020 . To prepare it for analysis, we remove

[The first system in the organism that the virus attacks

the header of each text (comprised of the article number, locale,

causes pneumonia . . . ]

date and URL), then parse the text into sentences and tokens

Results of this analysis are presented in Table 1, whereby using the NLTK library ([3]). We also lemmatize the corpus using we report only lemmas that were metaphorically used for the

the LemmaGen lemmatization module ([9]). The pre-processed DISEASE target domain at least once.

corpus contains 7,273 sentences and 151,947 tokens.

As can be derived from Table 1, our proposed method correctly

3.1

Corpus search

identified 25 different lemmas with a total of 123 occurrences

that are used metaphorically to frame the topic of the pandemic.

In the next step, we extract all sentences from the corpus con-

Out of our 233 articles, 68 or 29,18% contained at least one mili-

taining any of the war-related terms from our expanded vocab-

taristic metaphorical expression. The ostensibly most frequent

ulary of 191 lemmas. The results yield 335 instances of poten-

expression used was boj [fight] with 46 metaphorical occurrences,

tially metaphorical expressions. Out of the 191 lemmas on the

followed by boriti [to fight] with 13 metaphorical occurrences

metaphorical candidate list, the COVID corpus contains 49, ap-

and soočati [to confront] with 7 metaphorical occurrences. They

pearing in 268 sentences. Due to the unsupervised approach these

account for 37.4%, 10.6% and 5.7% of all metaphorical expressions

are still only candidate words from the semantic domain of war.

found by our method, respectively, and together, they represent

A manual analysis shows that in addition to war metaphors, our

more than 50% of them. This points to the interpretation that the

extracted sentences include the following four cases:

news corpus contains mostly highly conventional and recurrent

(1) Some of the seed words found in the corpus are used

metaphors. A lot of the war-related vocabulary (potential can-

literally;

didates in our extended war-related lexis) is not used, meaning

(2) Some of the seed words found in the corpus are a result

the corpus does not, at this moment, exhibit very original, novel

of lemmatization errors

metaphorical expressions. Using a larger and a more recently

(3) Some of the seed words found in the corpus are used

compiled corpus would perhaps reveal a more innovative use of

metaphorically, but refer to other target domains, such

COVID IS WAR metaphors. The vocabulary extension method

as POLITICS or NATURE (e. g. boriti se proti podnebnim

using word embeddings has proven fruitful as it revealed some

spremembam [’fight against climate change’])

metaphorical expressions that were not in the initial 51-word

(4) Some of the seed words in our initial 191-candidate list

list extracted from the KOMET corpus. The 9 newly discovered

are not actually related to the topic of WAR but are more

lemmas are: soočiti, izbojevati, zmagati, obraniti, uiti, soočanje,

closely related to another topic (e.g. gol [‘goal’])

spopadati, zoperstaviti, podleči [to confront, to fight, to win, to

On this account we perform a manual analysis of the extracted

defend, to escape, confrontation, to combat, to oppose, to suc-

sentences and categorize them as follows:

cumb].

(1) falsely extracted instances due to a lemmatization error

The analysis also revealed some additional lemmas that relate

or literal use, or true metaphorical expressions but with

the epidemic to the war frame. In the sentences containing the

other source or target domain, and

lemmas we searched for, there were other words from the WAR

38





Simple Discovery of COVID IS WAR Metaphors

Using Word Embeddings

Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia

Table 1: Analysis of metaphoric lemmas from the ex-

(75, 100, 150 and 200). Our initial experiments were carried out

tended vocabulary

on a N of 50 and intersection k of 3. However, by changing the

parameters, the results of initial new lemmas could differ. In

Lemma

Corpus

Literal

DISEASE

Figure 1, we analyse how the seed list changes with different ocuses,

as target

parameters: N of 50 and 75 neighbours, each combined with the

curences

lemma-

domain

intersection count k of 2, 3 and 4. Note that these refer only to

tization

the list of potentially metaphoric lemmas, and not to the analy-

errors

sis of their use, which can only be analysed in context. We see

or

other

that the initially selected parameters (50 neighbours and 3 recur-

source/target

rences) are an acceptable middle-ground between precision and

domain

size while still maintaining an unsupervised approach, however,

had we wanted more examples, we could increase the parameter

Boj [fight]

57

11

46

N or decrease the parameter k.

Boriti [to fight]

16

3

13

For the recall, we are not able to carry out a systematic eval-

Soočati [to confront]

17

10

7

uation. Nevertheless, based on metaphor clusters analysis men-

Spopad [to combat]

6

6

tioned above, we identified the set of additional words that belong

Spopadanje

[combat-

6

6

to the military vocabulary:

fronta, strategija, preboj, akcijski, vo-

ting]

jen, sovražnik [front, strategy, breakthrough, action [ADJ], war

Zoperstaviti

[to

5

5

[ADJ], enemy]. The words vojen [war[ADJ]] and sovražnik [en-

oppose]

emy] would have been included if we lowered the intersection

Bitka [battle]

5

1

4

parameter to k = 2 at N = 50 neighbours or extended the vo-

Napad [attack]

41

37

4

cabulary by N = 75 neighbours while keeping the intersection

Podleči [succumb]

5

1

4

parameter k = 3. Other metaphorical expressions occurring in

Spopadati [to combat]

5

1

4

the corpus (fronta, preboj, strategija, akcijski) [front, strategy,

Bojen [combat [ADJ]]

17

15

2

breakthrough, action [ADJ]] are not found anywhere in the first

Borba [battle]

3

1

2

200 neighbours of any of the words, indicating perhaps that the

Braniti [to defend]

4

2

2

number of neighbours might be further increased. However, we

Napasti [to attack]

6

4

2

observe that increasing the number of neighbours leads to fuzzier

Obramben

[defense

9

7

2

results. The added vocabulary using 75, 100, 150, and 200 near-

[ADJ]]

est neighbours of our initial seed words includes increasingly

Soočanje [confronting]

2

2

more words unrelated to the topic of war and some very common

Soočiti [to confront]

6

4

2

words, which would need additional filtering. We assume that

Žrtev [victim]

49

47

2

the reason for this is that words commonly used metaphorically

Borec [fighter]

3

2

1

(conventional or dead metaphors) are “displaced” in the vector

Izbojevati [to fight]

1

1

space of embeddings, moving away from the words in their orig-

Obraniti [to defend]

1

1

inal semantic domains and closer to words in other semantic

Štab

[base,

headquar-

3

2

1

domains – target domains. For example, we observed a lot of

ters]

sports expressions in our extended vocabulary (e.g. “ball”, “goal”,

Udariti [to hit]

2

2

“goalpost”). This shows how entrenched metaphors are in our

Uiti [to escape]

2

1

1

language: in the vector space of word embeddings, the seman-

Zmagati [to win]

5

4

1

tic domains are already “muddled”. In the present example, this

TOTAL

270

147

123

could be a due to the frequent linguistic manifestations of the

conceptual metaphor COMPETITION IS WAR.

domain forming so called metaphor clusters ([10]). Thus, we managed to capture some metaphorical expressions that appeared

in close vicinity (in the same sentence) of the found metaphor-

ical expressions: fronta, strategija, preboj, akcijski načrt, vojna

mentaliteta, sovražnik [front, strategy, breakthrough, action plan,

war mentality, enemy]. For instance, our method found the sen-

tence below which, in addition to the word bitka [battle] in our

candidate list, contains a metaphorical use of the word fronta

[front].

• Bitka proti virusu na več frontah

[Battle against the virus on multiple fronts]

4

ANALYSING DIFFERENT PARAMETER

SETTINGS

Figure 1: Analysis of vocabulary extension parameters N

Some of the expressions mentioned above would have been cap-

and k

tured had we modified the parameters of vocabulary extension.

Namely, we experimented with using more nearest neighbours

39





Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia Brglez et al.

5

CONCLUSION

[9]

Matjaž Juršič, Igor Mozetič, Tomaž Erjavec, and Nada

Lavrač. 2010. Lemmagen: multilingual lemmatisation with

We present an innovative approach using word embeddings as

induced ripple-down rules. Journal of Universal Computer

a tool for extending the vocabulary of potentially metaphoric

Science, 16, 9, 1190–1214. http://www.jucs.org/jucs_16_9/

expressions and identify them in corpora. Our approach shows

lemma_gen_multilingual_lemmatisation|.

promise in that it correctly identifies numerous such expressions

[10]

Veronika Koller. 2003. Metaphor clusters, metaphor chains:

and confirms that intersections of semantic spaces of metaphor-

analyzing the multifunctionality of metaphor in text. In

ical seed words can be used to refine the quest for words per-

volume 5, 115–134.

taining to the military domain. Nevertheless, some metaphoric

[11]

George Lakoff and Mark Johnson. 1980. Metaphors we live

expressions are missed by our method and the experiment still

by. University of Chicago press.

needs manual analysis. Further research and experiments would

[12]

George Lakoff and Mark Johnson. 2003. Metaphors we live

be needed for a larger expansion of vocabulary and a finer filter-

by. University of Chicago press.

ing approach as well as comparing different word embeddings,

[13]

Nikola Ljubešić and Tomaž Erjavec. 2018. Word embed-

possibly those trained on more literal language.

dings CLARIN.SI-embed.sl 1.0. Slovenian language resource

ACKNOWLEDGMENTS

repository CLARIN.SI. (2018). http://hdl.handle.net/11356/

1204.

This work is supported by the Slovenian Research Agency by

[14]

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013.

the research core funding P6-0215 and P2-0103, as well as by

Linguistic regularities in continuous space word represen-

the research project CANDAS ( J6-2581). The work has also been

tations. In Proceedings of the 2013 Conference of the North

supported by the European Union’s Horizon 2020 research and in-

American Chapter of the Association for Computational Lin-

novation programme under grant agreement No. 825153, project

guistics: Human Language Technologies. Association for

EMBEDDIA (Cross-Lingual Embeddings for Less-Represented

Computational Linguistics, Atlanta, Georgia, (June 2013),

Languages in European News Media). The results of this paper

746–751. https://aclanthology.org/N13- 1090.

reflect only the authors’ view and the Commission is not re-

[15]

Jakub Piskorski, Bogdan Babych, Zara Kancheva, Olga

sponsible for any use that may be made of the information it

Kanishcheva, Maria Lebedeva, Michał Marcińczuk, Preslav

contains.

Nakov, Petya Osenova, Lidia Pivovarova, Senja Pollak,

REFERENCES

Pavel Přibáň, Ivaylo Radev, Marko Robnik-Sikonja, Vasyl

Starko, Josef Steinberger, and Roman Yangarber. 2021.

[1]

Špela Antloga. 2020. Metaphor corpus KOMET 1.0. Slove-

Slav-NER: the 3rd cross-lingual challenge on recognition,

nian language resource repository CLARIN.SI. (2020). http:

normalization, classification, and linking of named entities

//hdl.handle.net/11356/1293.

across Slavic languages. In Proceedings of the 8th Workshop

[2]

Benjamin R. Bates. 2020. The (in)appropriateness of the

on Balto-Slavic Natural Language Processing. Association

war metaphor in response to SARS-CoV-2: a rapid analysis

for Computational Linguistics, Kiyv, Ukraine, 122–133.

of Donald J. Trump’s rhetoric. Frontiers in Communication,

https://aclanthology.org/2021.bsnlp- 1.15.

5, 50, (June 2020). doi: 10.3389/fcomm.2020.000505.

[16]

Senja Pollak, Andraž Repar, Matej Martinc, and Vid Pod-

[3]

Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural

pečan. 2019. Karst exploration: extracting terms and defi-

Language Processing with Python. (1st edition). O’Reilly

nitions from karst domain corpus. In Proceedings of eLex

Media, Inc.

2019, 934–956.

[4]

Piotr Bojanowski, Edouard Grave, Armand Joulin, and

[17]

G. Steen. 2010. A Method for Linguistic Metaphor Identifica-

Tomas Mikolov. 2017. Enriching word vectors with sub-

tion: From MIP to MIPVU. Converging evidence in language

word information. Transactions of the Association for Com-

and communication research. John Benjamins Publishing

putational Linguistics, 5, (June 2017), 135–146. doi: doi.

Company. doi: 10.1075/celcr.14.

org10.1162/tacl_a_00051.

[18]

Špela Vintar, Larisa Grcic, Matej Martinc, Senja Pollak,

[5]

Eunice Castro Seixas. 2021. War metaphors in political

and Uroš Stepišnik. 2020. Mining semantic relations from

communication on Covid-19. Frontiers in Sociology, 5, 112.

comparable corpora through intersections of word embed-

doi: 10.3389/fsoc.2020.583680.

dings. In (May 2020). https://aclanthology.org/2020.bucc-

[6]

Jane Demmen, Elena Semino, Zsófia Demjén, Veronika

1.5.pdf .

Koller, Andrew Hardie, Paul Rayson, and Sheila Payne.

[19]

Philipp Wicke and Marianna M. Bolognesi. 2020. Framing

2015. A computer-assisted study of the use of violence

COVID-19: how we conceptualize and discuss the pan-

metaphors for cancer and end of life by patients, family

demic on Twitter. PLOS ONE, 15, 9, (September 2020), 1–

carers and health professionals. International Journal of

24. doi: 10.1371/journal.pone.0240010.

Corpus Linguistics, 20, 2, 205–231. doi: 10.1075/ijcl.20.2.

[20]

Radim Řehůřek and Petr Sojka. 2010. Software framework

03dem.

for topic modelling with large corpora. In (May 2010), 45–

[7]

Damián Fernández-Pedemonte, Felicitas Casillo, and Ana

50. doi: 10.13140/2.1.2393.1847.

Jorge-Artigau. 2021. Communicating COVID-19: metaphors

we “survive” by. Tripodos, 2, (February 2021), 145–160. doi:

10.51698/tripodos.2020.47p145- 160.

[8]

Stephen J. Flusberg, Teenie Matlock, and Paul H. Thi-

bodeau. 2018. War metaphors in public discourse. Metaphor

and Symbol, 33, 1, 1–18. doi: 10 . 1080 / 10926488 . 2018 .

1407992.

40

Topic modelling and sentiment analysis of COVID-19

related news on Croatian Internet portal



Maja Buhin Pandur

Jasminka Dobša

Faculty of Organization and Informatics,

Faculty of Organization and Informatics,

University of Zagreb

University of Zagreb

Varaždin, Croatia

Varaždin, Croatia

mbuhin@foi.hr

jasminka.dobsa@foi.hr





Slobodan Beliga

Ana Meštrović

University of Rijeka, Department of Informatics &

University of Rijeka, Department of Informatics &

University of Rijeka, Center for Artificial

University of Rijeka, Center for Artificial

Intelligence and Cybersecurity

Intelligence and Cybersecurity

Rijeka, Croatia

Rijeka, Croatia

sbeliga@uniri.hr

amestorovic@uniri.hr





ABSTRACT

approach by using NRC word-emotion lexicon [13] for detection

of sentiments (positive or negative) and basic emotions,

The research aims to identify topics and sentiments related to the

according to Pluchik’s model of emotions [15], in extracted

COVID-19 pandemic in Croatian online news media. For

topics.

analysis, we used news related to the COVID-19 pandemic from

The main goal of this paper is to analyse sentiments and

the Croatian portal Tportal.hr published from 1st January 2020 to emotions in crises communication in the news related to the

19th February 2021. Topic modelling was conducted by using the

COVID-19 pandemic published on the Croatian online portal.

LDA method, while dominant emotions and sentiments related

Our goal was aggravated in this research because articles belong

to extracted topics were identified by National Research Council

rather to objective than to subjective type of reporting. Another

Canada (NRC) word-emotion lexicon created originally for

problem is the lack of lexical resources for sentiment and

English and translated into Croatian, among other languages. We

emotions in the Croatian language. Glavaš and co-workers [10]

believe that the results of this research will enable a better

developed a Croatian sentiment lexicon called CroSentiLex,

understanding of the crisis communication in the Croatian media

which consists of positive and negative lists of words ranked with

related to the COVID-19 pandemic.

PageRank scores. Nevertheless, there is no available lexicon for

the analysis of emotions for the Croatian language. Our analysis

uses the NRC word-emotion lexicon, initially developed for

KEYWORDS

English and translated into 104 languages, including Croatian.

News media, sentiment, emotions, pandemic, lexicon approach,

Such an approach has disadvantages due to cultural differences,

Latent Dirichlet Allocation

but developing emotion lexicons for low-resource languages as

Croatian is very demanding. Sentiment analysis of COVID-19

related texts is conducted mainly for texts written in English,

1 INTRODUCTION

such as research by Shofiya and Abidi [17], where the

There are three major approaches to sentiment and emotions

SentiStrength tool was used to detect the polarity of tweets, and

analysis in text: lexicon based, machine learning based approach

support vector machine (SVM) algorithm was employed for

[12] and the most recent deep-learning approach. In this research,

sentiment classification. In [14], tweets about COVID-19 in

we used a hybrid approach by applying the method of Latent

Brazil written in Brazilian Portuguese due to lack of language

Dirichlet Allocation (LDA) for topic modelling [6] and lexicon

resources are analysed by translating original text from

Portuguese to English and using available resources for English.



Regarding Croatian social media space, Twitter social



network communication was analysed through sentiment

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed analysis [2] and COVID-19 information spreading [3]. Crisis

for profit or commercial advantage and that copies bear this notice and the full communication of Croatian online portals was already explored

citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

by topic modelling of COVID-19 related articles [7]. However,

Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

in that research, it is not included further sentiment and emotional

© 2020 Copyright held by the owner/author(s).

analysis of topics. In [4], information monitoring and name entity

41



Information Society 2020, 5–9 October 2020, Ljubljana,

M. Buhin Pandur et al.

Slovenia



recognition were conducted on news portal texts related to

a COVID-19 article only if it contains at least one keyword

pandemics.

related to coronavirus thematic. We use COVID-19 thesaurus for

article filtering, which contains about thirty of the most important

words describing the SARS-CoV-2 virus epidemic together with

2 METHODS

their corresponding morphological variations. From the total of

31,177 articles, according to defined filtering, the dataset used in

2.1 Latent Dirichlet Allocation

the experiment consists of 12,080 COVID-19 related articles.

LDA is a generative, probabilistic hierarchical Bayesian model

Articles on the portal are categorised into one of nine main

that induces topics from a document collection [5,6]. The

categories: Biznis ( Business), Sport ( Sport), Kultura ( Culture), intuition behind topic modelling using LDA is that documents

Tehno ( Techno), Showtime, Lifestyle, Autozona ( Autozone), exhibit multiple topics. The topic is formally defined as a

Funbox, and Vijesti ( News) (see Table 1).

distribution over fixed vocabulary. Induction of topics is done in

Documents of a collection are created using text from the

three steps:

article’s subcategory, introduction, main text, and tags. The

 Each document in the collection is distributed over topics

collection is preprocessed by ejection of English and Croatian

that are sampled using Dirichlet distribution.

stop words and numbers and performing a lemmatisation. It is

 Each word in the document is connected with one single

created a term-document matrix using tf-idf weighting scheme.

topic based on Dirichlet distribution.

The collection is indexed by terms contained in at least four

 Each topic is defined as a multinomial distribution over

documents of the collection, and the final list of index terms

words that are assigned to the sampled topics.

contained 31,121 terms.

Topic modelling by LDA is conducted using stm package in

R [16].

Table 1: Number of articles from dataset categorised into

one of nine main categories

2.2 Number of topics estimation

Before performing the LDA topic modelling, it has to be

Category

Number COVID-19 articles

estimated the number of topics. In this research we used four

Business

2,767

metrics from the R package ldatuning: Arun2010 [1],

Sport

2,008

CaoJuan2009 [8], Deveaud2014 [9], and Griffiths2004 [11].

Culture

894

Measures Arun2010 and CaoJuan2009 have to be minimised,

Techno

101

while measures Deveaud2014 and Griffiths2004 have to be

Showtime

1,352

maximised. However, as measures, Arun2010 and CaoJuan2009

Lifestyle

1,442

Autozone

124

generally decrease with the number of topics, and measures

Funbox

58

Deveaud2014 and Griffiths2004 increase with the number of

News

3,334

topics, we will choose the number of topics as the value when



observed measures start to stagnate.

3.2 Results

2.3 Detection of sentiments and emotions

As a first step, the number of topics had to be estimated. Since

For the association of sentiments and emotions to extracted

articles on the portal are categorised into nine main categories,

topics it was used NRC word-emotion lexicon [13], which

we examined a number of topics from 5 to 15. We chose nine

consists of 14,182 words with scores of 0 or 1, according to the

topics since the metrics started to stagnate for a higher number

association to positive or negative sentiment or one of eight of topics (see Figure 1).

emotions of Pluchick’s model ( anger, anticipation, disgust, fear, joy, sadness, surprise, and trust) [15]. The lexicon was created manually by crowdsourcing on Mechanical Turk.

For every sentiment and emotion, we created a vector with a

distribution of zeros and ones over the words of a controlled

dictionary created from the collection. Association of topics to

sentiments and emotions is calculated as the cosine similarity

between vectors of topics and corresponding vector of sentiment

or emotion.

3 EXPERIMENT

3.1 Data set and preprocessing



The data set used for research consists of articles from the

Internet portal Tportal.hr related to the topics of COVID-19

Figure 1: Metrics for estimation of the best fitting number

pandemic crises and collected from 1st January 2020 to 19th

of topics for 5 to 15 topics

February 2021. Each article included in the dataset is defined as





42

Topic modelling and sentiment analysis of COVID-19

Information Society 2020, 5–9 October 2020, Ljubljana,

related news on Croatian Internet portal

Slovenia



Table 2: Top 10 words with the largest probabilities over

mali (small), trošak (expenditure), posljedica

topics and top 10 words with a negative sentiment with the (consequence), epidemija (epidemic)

largest probabilities over topics, both sorted in descending

words by theme:

order of their probabilities. Topics are sorted by their

osoba (person), koronavirus (coronavirus), covid,

representation in documents in descending order.

slučaj (case), mjera (measure), broj (number),

županija (county), nov (new), sat (hour), bolnica

Topic 7 –

Topic’s

(hospital)

Top 10 words

Daily

words by negative sentiment:

theme

reports

bolest (disease), virus (virus), zaraziti (to infect),

words by theme:

zaraza (infection), epidemija (epidemic), umrijeti (to

koronavirus (coronavirus), liga (league), klub (club),

die), velik (big), infekcija (infection), zarazan

nogometni (football), igrač (player), godina (year),

(contagious), simptom (symptom)

utakmica

(match),

sezona

(season),

hrvatski

words by theme:

Topic 1 –

(Croatian), nogomet (football)

godina (year), film (film), nov (new), festival

Sport

words by negative sentiment:

(festival), program (program), hrvatski (Croatian),

igrač (player), velik (big), problem (problem),

Zagreb, kultura (culture), kazalište (theater), knjiga

epidemija (epidemic), odgoditi (to delay), prekinuti

Topic 8 –

(book)

(to interrupt), čekati (to wait), borba (fight), napraviti

Culture

words by negative sentiment:

(to make), posljedica (consequence)

velik (big), mali (small), predstavljati (to present),

words by theme:

nastup (appearance), otkazati (to cancel), odgoditi (to

cijepljenje (vaccination), cjepivo (vaccine), zemlja

delay), smrt (death), rat (war), strana (side), kritika

Topic 2 –

(country),

europski

(European),

koronavirus

(critique)

Vaccination

(coronavirus), doza (dose), predsjednik (president),

words by theme:

vlada (government), mjera (measure), čovjek (man)

and

nov (new), proizvod (product), automobil (car), velik

words by negative sentiment:

epidemic

(big), godina (year), hrvatska (Croatia), proizvodnja

vlada

(government),

velik

(big),

epidemija

(production), tvrtka (company), trgovina (market),

measures

(epidemic), red (order), borba (fight), sud (court),

Topic 9 –

kupac (buyer)

granica (border), problem (problem), potreban

Business 2

words by negative sentiment:

(required), upozoriti (to warn)

velik (big), nafta (oil), epidemija (epidemic), lanac

words by theme:

(chain), smanjiti (decrease), kriza (crisis), mali

mjera

(measure),

hrvatska

(Croatia),

vlada

(small), zaraza (infection), problem (problem), utjecaj

Topic 3 –

(government), rad (labor), pomoć (help), potpora

(influence)

Earthquake

(support), odluka (decision), potres (earthquake),



zaštita (protection), Zagreb

and

Topics were labelled based on words with the largest

words by negative sentiment:

government

probabilities in topics vectors (keywords) shown in Table 2.

potres (earthquake), velik (major), pogoditi (to hit),

measures

potreban (required), posao (job), šteta (demage),

Some of the topics are directly connected to main categories on

prijava (report), republika (republic), poziv (call),

the portal: the first topic is labelled as Sport, the fourth topic as

posljedica (consequence)

Lifestyle, and the eighth topic as Culture, while the sixth and the words by theme:

ninth topics are connected to the business world and are labelled

modni

(fashion),

godina

(year),

pandemija

as Business 1 and Business 2. Business 1 is associated with the (pandemic), nov (new), koronavirus (coronavirus),

capital market, while Business 2 is associated with production.

poznat (famous), moda (fashion), obitelj (family),

Topic 4 –

Topic 2 is associated with Vaccination and epidemic measures,

brend (brand), model (model)

Lifestyle

while Topic 3 is associated with Earthquake and government

words by negative sentiment:

velik (big), nositi (to wear), izolacija (isolation), veza

measures. Topic 5 seems rather General on stories in a pandemic (relationship), majka (mother), dug (debt), djevojka

world, while Topics 7 contains daily reports on the pandemic (wench), znak (sign), mali (small), pun (full)

state.

words by theme:

We found that all topics are mainly associated with negative

čovjek (man), vrijeme (time), znati (know), virus

sentiments. In Table 2 are listed words associated with negative

(virus), velik (big), život (life), dan (day), dijete

sentiment with the largest probabilities across topics, while

Topic 5 –

(child), koronavirus (coronavirus), dobro (good)

words associated with positive sentiment have coincided with the

words by negative sentiment:

Generally

words from topics theme. This list gives some insight into what

velik (big), virus (virus), problem (problem), posao

stories

“bears” negative sentiment in the topics.

(job), napraviti (to make), bolest (disease), mali

(small), potreban (required), teško (hard), nositi (to

Figure 2 shows the association of topics to sentiments and

wear)

emotions. The ratio of positive and negative sentiments is the best for categories of Sport and Culture. These categories and words by theme:

Lifestyle are only categories associated with joy as one of the posto (percentage), godina (year), pad (drop), velik

dominant emotions. Surprise and anticipation are dominant (big), pandemija (pandemic), tržište (market), rast



emotions across all topics. Categories Vaccination and epidemic

(growth), kuna, gospodarstvo (economy), banka

Topic 6 –

measures, Earthquake and government support, Generally

(bank)

Business 1

stories and Business 1 are associated with the emotion of sadness, words by negative sentiment:

pad (drop), velik (big), kriza (crisis), vlada

while categories Vaccination and epidemic measures and Daily

(government), prihod (income), smanjiti (decrease),

reports are associated with fear.



43





Information Society 2020, 5–9 October 2020, Ljubljana,

M. Buhin Pandur et al.

Slovenia





ACKNOWLEDGEMENTS

This work has been supported in part by the Croatian Science

Foundation

under

the

project

IP-CORONA-04-2061,

“Multilayer Framework for the Information Spreading

Characterization in Social Media during the COVID-19 Crisis”

(InfoCoV) and by the University of Rijeka project number uniri-

drustv-sp-20-58.

REFERENCES

[1]

R. Arun, V. Suresh, C.E. Madhavan and M. Narasima Murty. 2010. On finding the natural number of topics with Latent Dirichlet Allocation: Some observations, In Proceedings of Advances in Knowledge Discovery



and Data Mining, 14th Pacific-Asia Conference (PAKDD 2010), Hyderabad, India. doi: 10.1007/978-3-642-1357-3_43.

[2]

K. Babić, M. Petrović, S. Beliga, S. Martinčić-Ipšić, A. Jarynowski and A.

Meštrović. 2022. COVID-19-Related Communication on Twitter:

Analysis of the Croatian and Polish Attitudes. In: Yang XS., Sherratt S., Dey N., Joshi A. (eds) Proceedings of Sixth International Congress on Information and Communication Technology. Lecture Notes in Networks

and

Systems,

vol

216.

Springer,

Singapore.

Available

at

https://link.springer.com/chapter/10.1007%2F978-981-16-1781-2_35.

[3]

K. Babić, M. Petrović, S. Beliga, S. Martinšić-Ipšić, M. Pranjić and A.

Meštrović. 2021. Prediction of COVID-19 related information spreading

on Twitter. In Proceedings of the IEEE International Convention on Information

and

Communication

Technology,

Electronics

and

Microelectronics (MIPRO 2021), accepted for publication.

[4]

S. Beliga, S. Martinčić-Ipšić, M. Matešić and A. Meštrović. 2021. Natural Language Processing and Statistic: The First Six Months of the COVID-19 Infodemic in Croatia, In The Covid-19 Pandemic as a Challenge for Media and Communication Studies. K. Kopecka-Piech and B. Łódzki,

Eds., Routledge, Taylor & Francis Group, accepted for publication.

Figure 2: Association of topics to sentiments and emotions

[5]

D. M. Blei. 2012. Probabilistic topic models. Communications of the ACM, 55(4), 77-84. doi:10.1145/2133806.2133826.

[6]

D. M. Blei, A. Y. Ng and M.I. Jordan. 2003. Latent Dirichlet Allocation.

4 CONCLUSIONS AND FURTHER WORK

Journal of Machine Learning Research 3, 993-1022.

[7]

P. K. Bogović, S. Beliga , A. Meštrović and S. Martinčić-Ipšić. 2021.

The main goal of this paper was to analyse sentiments and

Topic modelling of Croatian news during COVID-19 pandemic. In

emotions in crises communication in the news related to the

Proceedings of the IEEE International Convention on Information and COVID-19 pandemic. For that purpose, we have created our

Communication Technology, Electronics and Microelectronics (MIPRO

2021), accepted for publication.

collection of documents from articles on the Internet news portal

[8]

J. Chao, L. Tian, Z. Jintao, T. Yongdong and S. Tang. 2009. A density-

connected to pandemic crises and analysed it utilising the LDA

based method for adaptive LDA model selection, Neurocomputing, 72(7-method for extraction of prevalent topics in the collection and

9), 1775-1781. doi: 10.1016/j.neucom.2008.06.0011.

[9]

R. Deveaud, E. Sanjuan, P. Bellot. 2014. Accurate and effective latent NRC word-emotion lexicon for detection of sentiments and

concept modeling for ad hoc information retrieval, Document

emotions associated with extracted topics.

Numérique, 17(1). doi: 10.3166/dn.17.1.61-84.

Application of LDA resulted in relatively intuitive topics.

[10]

G. Glavaš, J. Šnajder and B. Dalbelo Bašić. 2012. Semi-supervised Some of them can be associated with the main categories of the

acqusition of Croatian sentiment lexicon. In Proceedings of 15th International Conference on Text, Speech and Dialogue, TSD 2112, Brno, observed portal, and the other are related to the actual situation

166-173.

in a pandemic world in Croatia: vaccination, earthquake (there

[11]

T.L. Griffiths, M. Steyvers. 2004. Finding scientific topics. In Proceedings were two great earthquakes in Croatia in 2020), stories, daily of the National Academy of Sciences 101 Suppl 1(1), 5228-35, doi: 10.1073/pnas.0307752101.

reports. It is shown that all extracted topics are associated

[12]

H. Lane, C. Howard and H. Hapke. 2019. Natural Language Processing

dominantly with negative sentiment, while prevalent emotions in Action. Manning Publications, New York, NY.

are anticipation, surprise, sadness and fear.

[13]

S. M. Mohammad and P.D. Turney. 2013. Crowdsourcing a word-emotion

By this research, we have gained insight into how COVID-19

association lexicon. Computational Intelligence, 29(3), 436-465.

[14]

T. Melo and C. M. S. Figueiredo. 2021. Comparing news articles and pandemic crises was communicated to the public. To gain insight

tweets about COVID-19 in Brasil: Sentiment analysis and topic modeling

into how the public experienced the crises, we could use the same

approach. JMIR Public Health and Surveillance, 7(2), doi:

methodology applied to comments of articles or on social

10.2196/24585.

[15]

R. Plutchik. 1962. The Emotions. Random House, New York, NY.

networks. This could be a direction for a further work. Also, it

[16]

M. Roberts, B.M. Stewart and D. Tingley. 2019. stm: An R package for

would be interesting to investigate how topics and

structural topic models, Journal of Statistical Software, 91(2), 1-40. doi:

sentiments/emotions are changing and evaluating over time.

10.18637/jss.v091.i02.



[17]

C . Shofiya and S. Abidi. 2021. Sentiment analysis on COVID-19-related social distancing in Canada using Twitter data. International Journal of Environmental Research and Public Health, 18(11), 1-10.





44





Tackling Class Imbalance in Radiomics: the COVID-19 Use Case

Jože M. Rožanec∗

Tim Poštuvan∗

Jožef Stefan International Postgraduate School

École Polytechnique Fédérale de Lausanne (EPFL)

Ljubljana, Slovenia

Lausanne, Switzerland

joze.rozanec@ijs.si

tim.postuvan@epfl.ch

Blaž Fortuna

Dunja Mladenić

Qlector d.o.o.

Jožef Stefan Institute

Ljubljana, Slovenia

Ljubljana, Slovenia

blaz.fortuna@qlector.com

dunja.mladenic@ijs.si

ABSTRACT

dyspnea[5]. In addition, older people, or people with previous med-Since the start of the COVID-19 pandemic, much research has been

ical problems (e.g., diabetes, obesity, or hypertension), are more

published highlighting how artificial intelligence models can be

likely to develop a severe form of the disease[12, 42], which can used to diagnose a COVID-19 infection based on medical images.

derive into multiple organ failure, acute respiratory distress syn-

Given the scarcity of published images, heterogeneous sources, for-

drome, fulminant pneumonia, heart failure, arrhythmias, or renal

mats, and labels, generative models can be a promising solution

failure, among others[37, 40].

for data augmentation. We propose performing data augmentation

Expert radiologists have observed that the impact of the COVID-

on the embeddings space, saving computation power and stor-

19 infection on the respiratory system can be discriminated from

age. Moreover, we compare different class imbalance mitigation

other viral pneumonia in computed tomography (CT) scans[7, 39].

strategies and machine learning models. We find CTGAN data aug-

Most frequent radiological signs include irregular ground-glass

mentation shows promising results. The best overall performance

opacities and consolidations, observed mostly in the peripheral and

was obtained with a GBM model trained with focal loss.

basal sites[31]. While such opacities were observed up to a maximum of seven days before the symptoms onset[25], they progress CCS CONCEPTS

rapidly and remain a long time after the symptoms onset[35, 38].

•

While such opacities can be observed on chest radiography, they

Information systems → Data mining; • Computing method-

have low sensitivity, which can lead to misleading diagnoses in

ologies → Computer vision problems; • Applied computing;

early COVID-19 stages, and thus a CT scan is preferred[38].

Scientific studies have shown Artificial Intelligence (AI) is a

KEYWORDS

promising technology transforming healthcare and medical prac-

COVID-19, CT Scans, Imbalanced Dataset, Data Augmentation,

tice helping on some clinicians’ tasks (e.g., decision support, or

Computer-Aided Diagnosis, Radiomics, Artificial Intelligence, Ma-

providing disease diagnosis)[45]. In particular, the field of radiomics chine Learning

studies how to mine medical imaging data to create models that

ACM Reference Format:

support or execute such tasks. Given that distinct patterns can

Jože M. Rožanec, Tim Poštuvan, Blaž Fortuna, and Dunja Mladenić. 2021.

be observed on chest radiographies and CT scans, clinicians and

Tackling Class Imbalance in Radiomics: the COVID-19 Use Case. In Ljubljana researchers sought to use AI for COVID-19 diagnostics[31].

’21: Slovenian KDD Conference on Data Mining and Data Warehouses, October, There are multiple challenges associated with radiomics, and

2021, Ljubljana, Slovenia. ACM, New York, NY, USA, 4 pages.

in particular, with the COVID-19 diagnosis use case. Despite the

limitations that can exist regarding privacy concerns[26, 44], many 1

INTRODUCTION

datasets have been made publicly available. From those datasets,

In December 2019, an outbreak of the coronavirus SARS-CoV-2

many are limited to a few cases[35]; were collected from different infection (a.k.a COVID-19) began in Wuhan, China. The disease

sources and image protocols, and thus cannot be merged (e.g., the

rapidly spread across the world, and on January 30th 2020, the

gray-levels across images can have different meanings[7]); or were World Health Organization (WHO) declared a global health emer-labeled at different granularity levels (e.g., patient-level, or slice-

gency. The most common COVID-19 symptoms are dry cough,

level)[2]. Therefore, models developed from these datasets cannot sore throat, fever, loss of taste or smell, diarrhea, myalgia, and

always be ported to a specific environment. Finally, limitations can

exist regarding data collection, further limiting available data to

∗Both authors contributed equally to this research.

develop working models to diagnose the disease.

The main contributions of this research are (i) a comparative

Permission to make digital or hard copies of part or all of this work for personal or study between four data-augmentation strategies used to deal with

classroom use is granted without fee provided that copies are not made or distributed class imbalance, (ii) across eight frequently cited machine learn-for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored.

ing algorithms, based on a real-world dataset of chest CT scans

For all other uses, contact the owner/author(s).

annotated with their COVID-19 diagnosis. We developed the ma-

SiKDD ’21, October, 2021, Ljubljana, Slovenia

chine learning models with images provided by the Medical Physics

© 2021 Copyright held by the owner/author(s).

45





SiKDD ’21, October, 2021, Ljubljana, Slovenia

Rožanec and Poštuvan

Research Group at the University of Ljubljana and made them avail-

data[8]. Due to these reasons, care must be taken to select metable as part of the RIS competition1.

rics not sensitive to such imbalance. Among common strategies

We report the models’ discrimination power in terms of the area

to deal with class imbalance, we find oversampling data methods,

under the receiver operating characteristic curve (AUC ROC). The

which aim to increase the number of data instances of the minority

AUC ROC is a widely adopted classification metric that quantifies

class to balance the dataset. Oversampling methods can add data

the sensitivity and specificity of the model while is invariant to a

instances from existing ones by replicating them (e.g., using a näive

priori class probabilities.

random sampler that draws new samples by randomly sampling

This paper is organized as follows. Section 2 outlines related

with replacement from the available train samples), or by creating

scientific works, Section 3 provides an overview of the use case,

synthetic data instances (e.g., through SMOTE[9], ADASYN[19],

and Section 4 details the methodology. Finally, section 5 presents

or GANs). In addition to data oversampling, the Focal Loss[29]

and discusses the results obtained, while Section 6 concludes and

can be used on specific algorithms. The Focal Loss reshapes the

describes future work.

cross-entropy loss to down-weight well-classified examples while

focusing on the misclassified ones, achieving better discrimination.

2

RELATED WORK

Finally, while the techniques mentioned above are useful for clas-

The field of radiomics is concerned with extracting high-dimensional

sification, we can reframe the problem as an anomaly detection

data from medical images, which can be mined to provide diagnoses

problem, attempting to detect which data instances correspond to

and prognoses, assuming the image features reflect an underly-

the minority class (anomaly).

ing pathophysiology[16, 27, 28]. While the research on the field is Through the research we reviewed, we found a paper describing

experiencing exponential growth, multiple authors have warned

the use of SMOTE[14], and two papers using GANs[1, 34] for data about common issues affecting the quality and reproducibility of

augmentation at the image level. We found no paper performing

radiomics research and proposed several criteria that should be met

a more extensive assessment of the class imbalance influence nor

to mitigate them (e.g., RQS, CLAIM, or TRIPOD)[10, 27, 32]. It has compared class imbalance strategies towards the COVID-19 detec-also been observed that the translation into clinical use has been

tion models’ outcomes. We propose utilizing data augmentation

slow[13].

techniques, generating new embeddings instead of full images. Such

Since the start of the COVID-19 pandemic, much research has

an approach provides similar information in the embedding space

been published highlighting how AI models could be used to is-

as would be obtained from synthetic images while enabling widely

sue COVID-19 diagnoses based on medical images. While much

used techniques for tabular data oversampling. Furthermore, in

research was invested into transfer learning leveraging pre-trained

GANs, new data instances are cheaper to compute and store than

deep learning models, or the use of deep learning models as feature

would be if creating new images.

extractors[24], some authors also experimented with handcrafted features[7]. Most common machine learning approaches involved 3

USE CASE

the use of deep learning (end-to-end models, or pre-trained models

The research reported in this paper is done with images provided by

for feature extraction)[14, 23, 34, 36, 43], Support Vector Machine the Medical Physics Research Group at the University of Ljubljana

(SVM)[4, 7, 14, 22, 23, 34, 36, 38, 43], k-Nearest Neighbors (kNN)[14,

and made available as part of the RIS competition. The dataset

22, 23, 38, 43], Random Forest (RF)[22, 23, 36], CART[22, 23, 36],

was built from computed tomography (CT) scans obtained from

Näive Bayes[22, 23], and Gradient Boosted Machines (GBM)[6, 22].

three datasets reported in[18, 25, 33], that correspond to 289 healthy Two commonly faced challenges regarding COVID-19 diagnoses

persons and 66 COVID-19 patients. Healthy persons are determined

based on medical images are images scarcity and class imbalance.

with a CT score between zero and five, while COVID-19 patients are

Given the heterogeneity of the datasets, it is not always possi-

considered those with a CT score equal to or higher than ten[15].

ble to merge them[2, 7, 35]. Thus, some researchers successfully Each CT scan was segmented into twenty slices, resulting in 7.100

experimented using generative adversarial networks (GANs) to

images with an axial view of the lungs, and annotated into two

generate new images that comply with the existing patterns in

classes: COVID-19 and non-COVID-19. The visual inspection of

the dataset[1, 34]. GANs provide means to learn deep representa-CT scans aims to determine if the person was infected with the

tions from labeled data and generate new data samples based on a

COVID-19 disease. Automating this task reduces manual work and

competition involving two models: a generator, learns to generate

speeds up the diagnosis.

new images only from its interaction with the discriminator; and

the discriminator, who has access to the real and synthetic data

4

METHODOLOGY

instances, and tries to tell the difference between them[3, 11]. While this method was first applied on images[17], new approaches were We propose using artificial intelligence for an automated COVID-19

developed to adapt it for tabular data[41].

diagnosis based on images obtained from CT scan segmentation,

The fact that the classification categories are not approximately

posing it as a binary classification problem. The discrimination

equally represented in a dataset can affect how the machine learn-

capability of the models is measured with the AUC ROC metric

ing algorithms learn and their performance on unseen data, where

with a cut threshold of 0.5.

the distribution can be different from the one observed in training

We use the ResNet-18 model[20] for feature extraction, retrieving the vector produced by the Average Pooling layer. Since the vector

consists of 512 features, we perform feature selection computing

1http://tiziano.fmf.uni-lj.si/

the features’ mutual information and selecting the top K to avoid

46





Tackling Class Imbalance in Radiomics: the COVID-19 Use Case SiKDD ’21, October, 2021, Ljubljana, Slovenia

√

overfitting. To obtain K, we follow the equation 𝐾 = 𝑁 suggested

this approach leads to the best forecast outcomes with a GBM

by[21], where N is the number of data instances in the train set.

model trained with a Focal Loss on a dataset enriched with new

To evaluate the models’ performance across different data aug-

CTGAN generated instances. Moreover, we compare this approach

mentation strategies, we apply a stratified ten-fold cross-validation.

to other imbalanced data strategies, finding that Näive random

Data augmentation is performed by introducing additional minority

oversampling, SMOTE, and ADASYN degrade the resulting models’

class data samples on the train folds. We consider five imbalance mit-

performance compared to the original dataset. Future work will

igation strategies: NONE (without data augmentation), RANDOM

focus on further understanding the cases where the CTGAN data

(näive random sampler), SMOTE, ADASYN, and CTGAN (GAN that

augmentation leads to poor results and provide an integral explain-

enables the conditional generation of data instances based on a

ability model for machine learning classifiers that consume image

class label)[41]. No augmentation is performed on the test fold to embeddings.

ensure measurements are comparable. The performance of the data

augmentation strategies is measured across eight machine learning

ACKNOWLEDGMENTS

algorithms: SVM, kNN, RF, CART, Gaussian Näive Bayes, Multi-

This work was supported by the Slovenian Research Agency. The

layer Perceptron (MLP), GBM, and Isolation Forest (IF)[30]. Finally, authors acknowledge the Medical Physics Research Group at the

we compare the performance of the data augmentation scenarios

University of Ljubljana2 for providing the image segmentation data computing the average AUC ROC across the test folds and assess

as part of the RIS competition3.

if the difference is statistically significant by using the Wilcoxon

signed-rank test, using a p-value of 0.05.

REFERENCES

[1] Erdi Acar, Engin Şahin, and İhsan Yılmaz. 2021. Improving effectiveness of 5

RESULTS AND ANALYSIS

different deep learning-based models for detecting COVID-19 from computed tomography (CT) images. Neural Computing and Applications (2021), 1–21.

When comparing the results across different imbalance mitigation

[2] Parnian Afshar, Shahin Heidarian, Nastaran Enshaei, Farnoosh Naderkhani, strategies (see Table 1), we observed that data augmentation leads Moezedin Javad Rafiee, Anastasia Oikonomou, Faranak Babaki Fard, Kaveh

to inferior results in most cases. While this outcome was expected

Samimi, Konstantinos N Plataniotis, and Arash Mohammadi. 2021. COVID-

CT-MD, COVID-19 computed tomography scan dataset applicable in machine

for IF (the minority class is no longer an outlier after data augmen-

learning and deep learning. Scientific Data 8, 1 (2021), 1–8.

tation), we found that only the CART, MLP, and GBM algorithms

[3] Alankrita Aggarwal, Mamta Mittal, and Gopi Battineni. 2021. Generative adversarial network: An overview of theory and applications.

achieved better performance with CTGAN data augmentation com-

International Journal of

Information Management Data Insights (2021), 100004.

pared to the original dataset. Moreover, six algorithms achieved

[4] Dhurgham Al-Karawi, Shakir Al-Zaidi, Nisreen Polus, and Sabah Jassim. 2020.

the best results when augmented with CTGAN compared to other

Machine learning analysis of chest CT scan images as a complementary digital test of coronavirus (COVID-19) patients. MedRxiv (2020).

data imbalance strategies (except NONE). We confirmed the AUC

[5] William E Allen, Han Altae-Tran, James Briggs, Xin Jin, Glen McGee, Andy Shi, ROC differences between imbalanced datasets strategies were sta-Rumya Raghavan, Mireille Kamariza, Nicole Nova, Albert Pereta, et al. 2020.

tistically significant, with a few exceptions:

Population-scale longitudinal mapping of COVID-19 symptoms, behaviour and SMOTE vs. ADASYN

testing. Nature human behaviour 4, 9 (2020), 972–982.

for CART, MLP, and GBM; NONE vs. RANDOM for CART; NONE

[6] Eduardo J Mortani Barbosa, Bogdan Georgescu, Shikha Chaganti, Gorka Bastar-vs. SMOTE for Näive Bayes; RANDOM vs. SMOTE for SVM and RF;

rika Aleman, Jordi Broncano Cabrero, Guillaume Chabin, Thomas Flohr, Philippe Grenier, Sasa Grbic, Nakul Gupta, et al. 2021. Machine learning automatically de-and RANDOM and SMOTE vs. CTGAN for SVM and IF. From the

tects COVID-19 using chest CTs in a large multicenter cohort. European radiology results obtained, we consider the CTGAN success can be attributed

(2021), 1–11.

to the fact the generative model can learn over time to generate

[7] Mucahid Barstugan, Umut Ozkaya, and Saban Ozturk. 2020. Coronavirus (covid-19) classification using ct images by machine learning methods. arXiv preprint high-quality data instances based on the discriminator’s feedback

arXiv:2003.09424 (2020).

loop, while Näive random sampling reuses existing instances (pro-

[8] Nitesh V Chawla. 2009. Data mining for imbalanced datasets: An overview. Data viding little new information to the dataset), and the SMOTE and

mining and knowledge discovery handbook (2009), 875–886.

[9] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer.

ADASYN algorithms generate new samples based on heuristics

2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial without learning capabilities.

intelligence research 16 (2002), 321–357.

[10] Gary S Collins, Johannes B Reitsma, Douglas G Altman, and Karel GM Moons.

We observed that GBM models trained with a Focal Loss achieved

2015. Transparent reporting of a multivariable prediction model for individual the best results in all datasets. Even when no data augmentation is

prognosis or diagnosis (TRIPOD): the TRIPOD statement. Journal of British performed and the RF achieves the best result, the difference is not

Surgery 102, 3 (2015), 148–158.

[11] Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sen-statistically significant compared to the GBM model. The overall

gupta, and Anil A Bharath. 2018. Generative adversarial networks: An overview.

best performance was obtained with a GBM model trained over a

IEEE Signal Processing Magazine 35, 1 (2018), 53–65.

dataset with CTGAN data augmentation. While the reasons behind

[12] Thays Maria Costa de Lucena, Ariane Fernandes da Silva Santos, Brenda Regina de Lima, Maria Eduarda de Albuquerque Borborema, and Jaqueline de Azevêdo Silva.

the performance drop for the kNN, Näive Bayes, RF, and SVM

2020. Mechanism of inflammatory response in associated comorbidities in

models remain unclear, further investigation is required to clarify

COVID-19. Diabetes & Metabolic Syndrome: Clinical Research & Reviews 14, 4 (2020), 597–600.

them. Nevertheless, we consider the CTGAN data augmentation

[13] Daniel Pinto Dos Santos, Matthias Dietzel, and Bettina Baessler. 2021. A decade on the embeddings space approach is promising.

of radiomics research: are images really data or just patterns in the noise?

[14] El-Sayed M El-Kenawy, Abdelhameed Ibrahim, Seyedali Mirjalili, Marwa Met-wally Eid, and Sherif E Hussein. 2020. Novel feature selection and voting classi-6

CONCLUSION

fier algorithms for COVID-19 classification in CT images. IEEE Access 8 (2020), 179317–179335.

This research presents a novel approach towards data augmentation

in radiomics by generating new data instances in the embedding

2https://medfiz.si/en

space rather than generating new images. We demonstrate that

3http://tiziano.fmf.uni-lj.si/

47

SiKDD ’21, October, 2021, Ljubljana, Slovenia

Rožanec and Poštuvan

Class Imbalance

Mitigation

CART

IF

kNN

MLP

Naive Bayes

RF

SVM

GBM

Strategies

NONE

0,6429 0,6802 0,8504 0,7879

0,6653 0,8601 0,8066

0,8555

RANDOM

0,6402 0,5215 0,7846 0,7993

0,6464

0,6691 0,6888 0,8150

SMOTE

0,6147 0,5607 0,6813 0,7663

0,6590

0,6660 0,6817 0,7826

ADASYN

0,6020 0,5863 0,6660 0,7655

0,6282

0,6435 0,6652 0,7787

CTGAN

0,7401 0,5340 0,8118 0,8419

0,6395

0,7090 0,6896 0,8871

Table 1: Average AUC ROC values obtained across the ten cross-validation folds. Best results are bolded, second-best results are highlighted in italics.

[15] Marco Francone, Franco Iafrate, Giorgio Maria Masci, Simona Coco, Francesco

[30] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In 2008

Cilia, Lucia Manganaro, Valeria Panebianco, Chiara Andreoli, Maria Chiara eighth ieee international conference on data mining. IEEE, 413–422.

Colaiacomo, Maria Antonella Zingaropoli, et al. 2020. Chest CT score in COVID-

[31] Hossein Mohammad-Rahimi, Mohadeseh Nadimi, Azadeh Ghalyanchi-

19 patients: correlation with disease severity and short-term prognosis. European Langeroudi, Mohammad Taheri, and Soudeh Ghafouri-Fard. 2021. Application of radiology 30, 12 (2020), 6808–6817.

machine learning in diagnosis of COVID-19 through X-ray and CT images: a

[16] Robert J Gillies, Paul E Kinahan, and Hedvig Hricak. 2016. Radiomics: images scoping review. Frontiers in cardiovascular medicine 8 (2021), 185.

are more than pictures, they are data. Radiology 278, 2 (2016), 563–577.

[32] John Mongan, Linda Moy, and Charles E Kahn Jr. 2020. Checklist for artificial

[17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, intelligence in medical imaging (CLAIM): a guide for authors and reviewers.

Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial

[33] Sergey P Morozov, Anna E Andreychenko, Ivan A Blokhin, Pavel B Gelezhe, nets. Advances in neural information processing systems 27 (2014).

Anna P Gonchar, Alexander E Nikolaev, Nikolay A Pavlov, Valeria Yu Chernina,

[18] Stephanie A Harmon, Thomas H Sanford, Sheng Xu, Evrim B Turkbey, Holand Victor A Gombolevskiy. 2020. Mosmeddata: data set of 1110 chest ct scans ger Roth, Ziyue Xu, Dong Yang, Andriy Myronenko, Victoria Anderson, Amel

performed during the covid-19 epidemic. Digital Diagnostics 1, 1 (2020), 49–59.

Amalou, et al. 2020. Artificial intelligence for the detection of COVID-19 pneu-

[34] Jawad Rasheed, Alaa Ali Hameed, Chawki Djeddi, Akhtar Jamil, and Fadi Al-monia on chest CT using multinational datasets. Nature communications 11, 1

Turjman. 2021. A machine learning-based framework for diagnosis of COVID-19

(2020), 1–7.

from chest X-ray images. Interdisciplinary Sciences: Computational Life Sciences

[19] Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. 2008. ADASYN: Adaptive 13, 1 (2021), 103–117.

synthetic sampling approach for imbalanced learning. In 2008 IEEE international

[35] Michael Roberts, Derek Driggs, Matthew Thorpe, Julian Gilbey, Michael Yeung, joint conference on neural networks (IEEE world congress on computational intelli-Stephan Ursprung, Angelica I Aviles-Rivero, Christian Etmann, Cathal McCague, gence). IEEE, 1322–1328.

Lucian Beer, et al. 2021. Common pitfalls and recommendations for using machine

[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning to detect and prognosticate for COVID-19 using chest radiographs and learning for image recognition. In Proceedings of the IEEE conference on computer CT scans. Nature Machine Intelligence 3, 3 (2021), 199–217.

vision and pattern recognition. 770–778.

[36] Prottoy Saha, Muhammad Sheikh Sadi, and Md Milon Islam. 2021. EMCNet:

[21] Jianping Hua, Zixiang Xiong, James Lowey, Edward Suh, and Edward R

Automated COVID-19 diagnosis from X-ray images using convolutional neural Dougherty. 2005. Optimal number of features as a function of sample size

network and ensemble of machine learning classifiers. Informatics in medicine for various classification rules. Bioinformatics 21, 8 (2005), 1509–1515.

unlocked 22 (2021), 100505.

[22] Lal Hussain, Tony Nguyen, Haifang Li, Adeel A Abbasi, Kashif J Lone, Zirun

[37] Adekunle Sanyaolu, Chuku Okorie, Aleksandra Marinkovic, Risha Patidar, Kokab Zhao, Mahnoor Zaib, Anne Chen, and Tim Q Duong. 2020. Machine-learning

Younis, Priyank Desai, Zaheeda Hosein, Inderbir Padda, Jasmine Mangat, and classification of texture features of portable chest X-ray accurately classifies Mohsin Altaf. 2020. Comorbidity and its impact on patients with COVID-19. SN

COVID-19 lung infection. BioMedical Engineering OnLine 19, 1 (2020), 1–18.

comprehensive clinical medicine (2020), 1–8.

[23] Seifedine Kadry, Venkatesan Rajinikanth, Seungmin Rho, Nadaradjane Sri Mad-

[38] Ahmet Saygılı. 2021. A new approach for computer-aided detection of coronavirus hava Raja, Vaddi Seshagiri Rao, and Krishnan Palani Thanaraj. 2020. Development (COVID-19) from CT and X-ray images using machine learning methods. Applied of a machine-learning system to classify lung ct scan images into normal/covid-19

Soft Computing 105 (2021), 107323.

class. arXiv preprint arXiv:2004.13122 (2020).

[39] H Swapnarekha, Himansu Sekhar Behera, Janmenjoy Nayak, and Bighnaraj Naik.

[24] Sara Hosseinzadeh Kassania, Peyman Hosseinzadeh Kassanib, Michal J Wesolows-2020. Role of intelligent computing in COVID-19 prognosis: A state-of-the-art kic, Kevin A Schneidera, and Ralph Detersa. 2021. Automatic detection of coron-review. Chaos, Solitons & Fractals 138 (2020), 109947.

avirus disease (COVID-19) in X-ray and CT images: a machine learning based

[40] Tianbing Wang, Zhe Du, Fengxue Zhu, Zhaolong Cao, Youzhong An, Yan Gao, approach. Biocybernetics and Biomedical Engineering 41, 3 (2021), 867–879.

and Baoguo Jiang. 2020. Comorbidities and multi-organ injuries in the treatment

[25] Michael T Kassin, Nicole Varble, Maxime Blain, Sheng Xu, Evrim B Turkbey, of COVID-19. The Lancet 395, 10228 (2020), e52.

Stephanie Harmon, Dong Yang, Ziyue Xu, Holger Roth, Daguang Xu, et al. 2021.

[41] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni.

Generalized chest CT and lab curves throughout the course of COVID-19. Scien-2019. Modeling tabular data using conditional gan. arXiv preprint arXiv:1907.00503

tific reports 11, 1 (2021), 1–13.

(2019).

[26] Virendra Kumar, Yuhua Gu, Satrajit Basu, Anders Berglund, Steven A Eschrich,

[42] Jing Yang, Ya Zheng, Xi Gou, Ke Pu, Zhaofeng Chen, Qinghong Guo, Rui Ji, Haojia Matthew B Schabath, Kenneth Forster, Hugo JWL Aerts, Andre Dekker, David

Wang, Yuping Wang, and Yongning Zhou. 2020. Prevalence of comorbidities in Fenstermacher, et al. 2012. Radiomics: the process and the challenges. Magnetic the novel Wuhan coronavirus (COVID-19) infection: a systematic review and resonance imaging 30, 9 (2012), 1234–1248.

meta-analysis. Int J Infect Dis 10, 10.1016 (2020).

[27] Philippe Lambin, Ralph TH Leijenaar, Timo M Deist, Jurgen Peerlings, Eve-

[43] Huseyin Yasar and Murat Ceylan. 2021. A novel comparative study for detection lyn EC De Jong, Janita Van Timmeren, Sebastian Sanduleanu, Ruben THM Larue, of Covid-19 on CT lung images using texture analysis, machine learning, and deep Aniek JG Even, Arthur Jochems, et al. 2017. Radiomics: the bridge between learning methods. Multimedia Tools and Applications 80, 4 (2021), 5423–5447.

medical imaging and personalized medicine. Nature reviews Clinical oncology 14,

[44] Stephen SF Yip and Hugo JWL Aerts. 2016. Applications and limitations of 12 (2017), 749–762.

radiomics. Physics in Medicine & Biology 61, 13 (2016), R150.

[28] Philippe Lambin, Emmanuel Rios-Velazquez, Ralph Leijenaar, Sara Carvalho,

[45] Kun-Hsing Yu, Andrew L Beam, and Isaac S Kohane. 2018. Artificial intelligence Ruud GPM Van Stiphout, Patrick Granton, Catharina ML Zegers, Robert Gillies, in healthcare. Nature biomedical engineering 2, 10 (2018), 719–731.

Ronald Boellard, André Dekker, et al. 2012. Radiomics: extracting more information from medical images using advanced feature analysis. European journal of cancer 48, 4 (2012), 441–446.

[29] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017.

Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980–2988.

48





Observing Water-Related Events

for Evidence-Based Decision-Making



Joao Pita Costa * *****, M. Besher Massri *, Inna Novalija *, Ignacio Casals del Busto **, Iulian Mocanu ***, Maurizio Rossi

****, Jan Šturm *, Eva Erzin *, Alenka Guček *, Matej Posinković *, Marko Grobelnik * *****

* Institute Jozef Stefan, Slovenia, ** Aguas del Alicante, Spain, *** Apa Braila, Romania, **** Ville de Carouge, Switzerland, ***** Quintelligence, Slovenia ABSTRACT

propose a slightly different approach that integrates

heterogeneous data sources to try and solve common

With the awareness of a changing climate impacting our

research questions, as well as to support water management

sustainability, and in line with the European Green Deal

companies in their current problems. This solution is named

initiative or the Sustainable Development Goal 6 addressing

NAIADES Water Observatory (NWO), available at

water, the industry, society and local governments are

naiades.ijs.si, putting together: (i) real-time information from

requiring reliable and comprehensive technology that can

multilingual world news on water topics; (ii) data

provide them an overview to water events to anticipate

visualisation of water-related indicators through time,

problems and the tools to analyse best practices appropriate

sourced from the datasets associated with the Sustainable

to solve them. This paper presents the NAIADES Water

Development Goal 6 (water) and other UN data (see Figure

Observatory (NOW), a digital solution offering a series of

1); and (iii) scientific knowledge from published biomedical

analysis and visualisations of water-related topics, helping

research on water-related topics (e.g., water contamination).

users to extract important insights in relation to the water

Due to the rapidly growing awareness of the sustainability

sector. Taking advantage of heterogeneous data sources, from

challenges that we are facing in Europe and worldwide in the

the media and social media landscape, to published research

context of water resource management, there has been much

and global/local indicators. Through collaboration with local

work done to develop systems that are able to collect

water resource management institutions, the NWO was

information about the available water and even simulate and

configured to local priorities and ingests local datasets to

forecast that in the near future. But these are usually

better fit the needs of decision-makers.

geolocation-based systems ingesting water-related data to

enable real-time monitoring of resources and usage [x] [y] [z],

CCS CONCEPTS

and thus much different than the water observatory that we

• Real-time systems • Data management systems • Life and

are proposing in this paper. The typical example is GoAigua

medical science

system [4], a digital twin technology allowing, e.g., the city of

Valencia to optimize its water management at the network

KEYWORDS

level, improving efficiency in daily operations, plan real-time

Water Resource Management, Smart Water, Observatory,

scenarios, and make some prediction on its future behaviour

Water Digital Twin, Elasticsearch, Streamstory

[5].

1 Introduction

The water sector is facing rapid development towards the

smart digitalisation of resources, much motivated and

supported by the UN’s global initiative for the Sustainable

Development Goal 6. In that context, the efforts to address the

specific challenges related to water management data and

priorities multiply globally. There are several “digital twin”



systems dedicated to water, each of which focuses on the

Figure 1: Visualisation of water-related indicators within

different aspects of the digitalisation of signals to support

Spain to complement the global indicators view ingesting

water management companies, as well as water

data from, e.g., U.N. and the World Bank.

“observatories”. These are usually meant as Geographical

Information Systems that showcase the different aspects of

water resources through time.

2 A data-driven solution for water events

Within the scope of the European Commission-funded project

The proposed Water Observatory enables extraction of

NAIADES [1] focusing on the automation of the water

insightful water-related information, configured to use case

resource management and environmental monitoring, we

priorities and needs from the data integration of

49





SIKDD’21, October 2021, Ljubljana, Slovenia

J. Pita Costa et al.



heterogeneous sources. This includes information from social

● Media: each location has its own news and social media

media when the weather is favourable for floods and the

streams configured to priorities and aspects of the news

historical information from news and published research on

that stakeholders define as topics of interest (e.g. floods)

these weather-related events and how to make better

● Research: similarly to the media sources, the research

decisions to solve them.

topics allow for some customisation to fit the needs of

This is complemented by data ingested from global and local

the local user better

indicators (i.e., datasets at regional level), showcasing the

● Resources: the natural resources information provided

observation of water-related datasets linked to SDG 6 at

for exploration is geolocated to the regions of interest to

global and country levels that can help us observe changes

the user of the platform

and trends. The NAIADES Water Observatory enables the

It is relatively easy to include new use cases and

user to explore the information provided by published

corresponding workspaces after the discussions on user

science and the success stories that can be used in decision-

priorities that will allow us to configure the information

making and water education at the local level (i.e., showcasing

presented and making it meaningful.

the resources and problematics of the region).

In this approach, the water data sensing is done over dynamic

open data sources that serve as digital sensors (news, social

3 Addressing the challenges of tomorrow

media, indicators, publications, weather forecasts). This data

With the range of views provided at the observatory, the

is then integrated and visualised, each in its tab, addressing

problems addressed can be of complex nature and cover a

specific topics of interest. The observatory is thus composed

range of concerns and workflows. The different ICT

of all that heterogeneous data coming in at different

capabilities available across the water sector require intuitive

frequencies. The interactions between those data sources to

and meaningful technologies to ensure the usefulness of the

solve common problems make it a Water Digital Twin. The

contribution to the Community. The target users of the NWO

envisioned examples include the analysis of best practices in

seem to belong to three main scenarios with different

water events in, e.g. Braila, identified in the news and

workflows that can be supported by the developed

explored over the published research, or the alerts triggered

technology:

by weather conditions and observed over social media on a

1. Water resource management: using the provided

water event. The questions we are trying to solve with this

information in the resolution of problems related to

innovative technology are, e.g., if we can predict water

weather events to understand how their actions are

shortages in a certain region given the historical data; or if we

perceived by the consumers and to explore successful

can identify early signals of water-related problems from

scenarios in similar cases

social media (see Figure 2).

2. Local governments: to help evidence-based decision-



making using open data, better synchronise to SDG6

and other guidelines and evaluate commitments in time

3. General public: for water education with a local context,

in aspects that matter to the local population, based on

parts of the Water Observatory that can be open to

public

The priorities in the European Union are rapidly changing

towards sustainability and environmental efficiency,

transversally to most domains of action. The European



Commission’s Green Deal [3] aiming for a climate-neutral

Europe by 2050 and boosting the economy through green

Figure 2: Analysis of the sentiment in water-related posts

in Twitter and the relation to consumer satisfaction and

technology provides a new framework to understand and

water-related events

position water resource management in the context of the

challenges of tomorrow. The NAIADES Water Observatory

All of the views of this observatory, each of which represents

will not only contribute to the improvement of European

digital solutions on their own, are configured to the local

sustainability in water-related matters but will also assign the

priorities of the NAIADES users as a Proof of Concept,

local actors on the water resource management an active role

showing that each can address specific conditions.

in that. The NAIADES Water Observatory provides the user of

● Indicators: adding to the global UN indicators, we are

the NAIADES platform, as earlier extensively discussed, with

ingesting curated open datasets that have regional

the global and local insight that can be transformed into

information about water topics of interest to the

business intelligence, and help companies to steer their

stakeholder

strategies towards customer satisfaction. We will be





50





Observing Water-Related Events for Evidence-based Decision-

SIKDD’21, October 2019, Ljubljana, Slovenia

Making



describing selected views of this observatory through the

enabled sites across the world. From the data management

verticals (or views) News – Indicators – Biomedical, first at

module the real-time news data is accessed by the news

the level of the specific dashboards that constitute the tabs in

dashboard that can be configured by the NAIADES user to

the online instance, and then by the extended exploratory

tune the topics of interest in the configuration web app. To

instances, including public instances and APIs, for each of the

further explore a water-related topic, the NWO provides a

three verticals.

dashboard for the analysis of social media posts in Twitter



(see Figure 2), collected in a real-time frequency, where

sentiment is analysed, related concepts are extracted and it is

possible to access the raw tweets or apply several filters.

Finally, the biomedical module allows for the exhaustive

exploration of water contamination information from

scientific research articles published worldwide and available

through the MEDLINE biomedical open dataset [9] and the

Microsoft Academic Graph [8]. The MEDLINE dataset is

collected from the official FTP source made available by the



North American National Library of Medicine (NLM) over an

Figure 3: The global view of the pilot 1 over usage and data

XML dump and uploaded to the elasticSearch data

sources.

management system through a python script, the Microsoft



Academic Graph dataset is collected from an Azure container

These dashboards come together to provide the user with a

with the data biweekly updated by the Microsoft Research

global perspective in real-time, where five different tiers of

team. The data management is based on the elasticsearch

usability are made available (see Figure 3). The tiers allow for

technology [2, useful for both the interactive data

the extended usability of the Water Observatory,

visualisations and the Indicators Explorer view. The latter

Transversally to the data sources available.

allows the NAIADES user to explore the raw data through

template visualisations, use a Lucene-based query that can

4 System description and architecture

leverage the loaded metadata, and easily build visualisation

The NWO offers user exploratory dashboards for the further

modules that can define a new dashboard of data

investigation over news, to get deeper into the indicators

visualisation modules. The dataset is then called over and

ingested, and to explore the biomedical research on water

HTTP API by the SearchPoint technology [6] to load the

contamination in detail. Moreover, each of the three

dataset and respective metadata. thus allowing for powerful

dashboards have versions built to be exposed by, e.g., iframe

Lucene-based queries and further interaction over a movable

through a publicly available channel that can be used for

pointer. This will lead to the refinement of the search of

integration in high management KPI-monitoring dashboards.

information that can then be extended over the Biomedical

Furthermore, we also offer a part of the information in these

Explorer, which feeds over the same dataset through Kibana,

through APIs easily integrable with our own systems.

but also allows for the analysis of raw data, or the easy

The Indicators view provides the user with interactive data

construction of data visualisation modules from templates,

exploration tools that allow for the KPI-monitoring over

and for an interactive data visualisation dashboard. All the

several water-related topics that include the SDG 6, the World

mentioned dashboards can be made publicly available

Bank Open Data, the UN data, etc. In this module we also

through, e.g., iframe to be integrated in high-management KPI

ingest regional data sources that include local indicators,

monitors.

addressing the user’s priorities. Considering their well-

established data types, the data integration is possible and,

whenever limitations appear due to lack or poor quality of the

data, the dataset is pre-processed to allow for data

completion (whenever possible), or at least the improvement

of data quality.

The Media view provides the user with the real-time news monitoring over water-related topics (such as Water Scarcity

and Water Contamination), and the analysis of water-related



tweets based on data visualisation modules. Based on the

news engine Eventregistry [7] this view provides the system

Figure 4: System architecture of the NAIADES Water

Observatory showcasing the relation between used

with a continuous stream of news articles, sourced from RSS-

technologies and NOW views



51





SIKDD’21, October 2021, Ljubljana, Slovenia

J. Pita Costa et al.



5. Conclusions and further work

In this paper we discussed the technological development and

research opportunities motivated by the emerging need to

support decision-makers with evidence from open data that

can retract best practices and answer questions from the

collected data, bringing the digitalisation of the water sector

to a new level.

The potential to ingest complementary local data and

configure global sources to parameters addressing local



priorities provides a local dimension that is being explored

Figure 6: The multi time-series analysis of the weather

close to the priorities of the NAIADES data providers within

parameters, using Markov chains in complex data

water resource management institutions. It will also be

visualisation through the Streamstory technology [9].

exploring the insights driven by the appropriate aspects of

chosen datasets, e.g., between news data and focused

ACKNOWLEDGMENTS

interactions through Twitter for weather-related events

when the weather is likely to be favourable to their cause (see

We thank the support of the European Commission on the

Figure 5). There are many systems that can collect business

H2020 NAIADES project (GA nr. 820985).

intelligence data, but we believe that the “digital twin”-type of

insight is in the interaction between these data streams.

REFERENCES



[1] CORDIS, "NAIADES Project". [Online]. Available:

https://cordis.europa.eu/project/id/820985 [Accessed 1 9 2020].

[2] Elasticsearch, "Elasticsearch," 2020. [Online]. Available:

https://www.elastic.co/elasticsearch/. [Accessed 1 9 2020].

[3] European Commission, "European Green Deal," 2019. [Online]. Available: https://ec.europa.eu/info/strategy/priorities-2019-2024/european-green-deal_en. [Accessed 1 9 2020].

[4] Idrica, "GoAigua: Smart Water for a Better World," 2020. [Online].

Available: https://www.idrica.com/goaigua/. [Accessed 1 9 2020].



[5] Idrica, "Digital Twin: implementation and benefits for the water sector,"

19 2 2020. [Online]. Available: https://www.idrica.com/blog/digital-

Figure 5: Preliminary data analysis of the relation

twin-implementation-benefits-water-sector/. [Accessed 1 9 2020].

between news and tweets on water-related events and

[6] Institute Jozef Stefan, "Streamstory". [Online]. Available: their relations with other topics (e.g., weather).

http://streamstory.ijs.si/. [Accessed 26 8 2021]

[7] G. Leban, B. Fortuna, J. Brank and M. Grobelnik, "Event registry: learning about world events from news," Proceedings of the 23rd International Further development to the NAIADES Water Observatory,

Conference on World Wide Web, pp. 107-110, 2014.

will be providing the users with tools to explore the impact of

[8] Microsoft, "Microsoft Academic Graph". [Online]. Available: https://www.microsoft.com/en-us/research/project/microsoft-natural resources as, e.g., the weather, as well as predictions

academic-graph/. [Accessed 26 8 2021]

on the levels of the available bodies of water, based on

[9] National Library of Medicine, "MEDLINE". [Online]. Available: https://www.nlm.nih.gov/medline/medline_overview.html. [Accessed 26

ingested weather data from the ECMWF (on humidity,

8 2021]

temperature and rainfall) and other open data sources. This

[10] L. Stopar, P. Škraba, M. Grobelnik, and D. Mladenić (2018). StreamStory: Exploring Multivariate Time Series on Multiple Scales. IEEE transactions

will help the users to have some insight on the impact of the

on visualization and computer graphics 25.4: 1788-1802.

climate crisis in regions that directly relate to their water

resources. We will use a sophisticated engine - Streamstory

[6][10] - to explore the states of that weather-related data and

short/medium term predictions on aspects of that data (see

Figure 6).





52





Anomaly Detection on Live Water Pressure Data Stream

Gal Petkovšek

Matic Erznožnik

Klemen Kenda

Jožef Stefan Institute

Jožef Stefan Institute

Jožef Stefan Institute

Jamova 39, 1000 Ljubljana,

Jamova 39, 1000 Ljubljana,

Jožef Stefan International

Slovenia

Slovenia

Postgraduate School

gal.petkovsek@ijs.si

matic.erznoznik@ijs.si

Jamova 39, 1000 Ljubljana,

Slovenia

klemen.kenda@ijs.si

ABSTRACT

The algorithms in this paper were already considered in the

We present the application of several anomaly detection

related work in different settings and for different time series.

algorithms to water pressure data streams.

We evaluate

their quality on unlabelled data sets using agreement rates.

Anomaly detection can be used by estimating the expected

The applied algorithms are the Generative Adversarial Net-

regular interval in the upcoming measurement. This can be

work (GAN), DBSCAN, Welford’s algorithm and Facebook

achieved in an incremental fashion with a simple short-term

Prophet. We found that GAN performed best.

prediction model, for example with Kalman filter [7], or with a more advanced approach, based on time-series modeling

Keywords

[11]. The latter can be used in several settings, for example water management, machine learning, anomaly detection

in detecting air temperature anomalies in the sewer systems

[12].

1.

INTRODUCTION

DBSCAN [10] is a data clustering algorithm that can be ap-In last decades, Internet of Things (IoT) has penetrated

plied in frequently changing data sets. Its incremental ver-

and shaped several fields such as energy management, traf-

sion [5] can be used in a streaming setting. The potential of fic, health care and others. The water sector is, however,

the algorithm for anomaly detection has been demonstrated

still implementing IoT solutions that will improve the water

in several use cases, for example in detecting air temperature

management with features such as real-time consumption

anomalies [3].

prediction, leakage detection, water quality estimation and

others.

The paper that demonstrated the use of Generative Ad-

versarial Networks for anomaly detection on data stream is

In the presented work, we focus on the anomaly detection on

fairly recent [6]. The authors have shown that this approach the live water pressure data stream from the town of Braila

can outperform several other baselines on data sets obtained

(Romania). The overall goal of the research is to detect leak-

from NASA, Yahoo, Amazon etc. They introduced different

age points in the city’s water distribution network. To detect

measures of evaluating the reconstruction accuracy, which

the presence of a leakage in the system we apply an anomaly

we tried to improve upon in our paper.

detection algorithm to the water pressure data stream. We

considered several such algorithms, which were applied and

In this work, we use the already established anomaly detec-

evaluated on four data streams obtained from four pressure

tion approaches and compare their performance on an unla-

sensors. Our goal was to find the algorithm which returns

beled water pressure data stream from a water distribution

the best results. Since the data is not labeled (regular or

network. A more detailed description of the algorithms is

anomalous), the estimation of accuracy was done with a

given in the Methodology section. We argue that the rela-

method considering relative agreement among selected al-

tive agreement approach [1] improves the anomaly detection gorithms [1]. The anomaly detection algorithms that were performance, which we demonstrate by manual evaluation

tested were GAN (generative adversarial networks) [6], DB-of the results.

SCAN [10], Welford’s algorithm [9] and anomaly detection with Facebook Prophet [11]. It is important to note that first three algorithms consider the data stream as an actual

2.

DATA AND DATA PREPROCESSING

live stream. This means that they consume one sample at

We demonstrate our anomaly detection methodology on four

a time (or a feature vector containing multiple past values,

data sets. Each of the data sets represents the pressure val-

enrichment values and contextual data) and declare it reg-

ues of one of the sensors, which are located at different points

ular or anomalous as the algorithms were intended to do in

in Braila’s water distribution network. The sensors are la-

production. In contrast, the Facebook Prophet consumes

beled as ‘5770’, ‘5771’, ‘5772’ and ‘5773’.

The data sets

the whole data stream as a batch and labels all the samples

contain between 10 and 11 thousand instances, which are

together. This makes it unusable in production (in this set-

spaced in 15 minute intervals, so about 100 days-worth of

ting), however it is included in the experiment since it can

data. The data was first pre-processed to remove any du-

help to estimate the accuracy of other algorithms.

plicated points and ‘holes’ in the data which were formed

as a consequence of sensor down-time. When working with

Anomaly detection on time series is a well researched field.

data streams, this process should be done automatically to

53





avoid any incorrect analysis when feeding the data into the errors (4 standard deviations from the mean of the window).

anomaly detection algorithms. Each of the four data sets

We used a slightly different approach using the moving aver-

was split into a training and evaluation part. The training

age multiplied by a constant as the threshold. This proved

sets consisted of the first 2000 data points and the evalu-

to be easier to implement on our live data stream use-case.

ation sets contained all the rest. This is done so that the

algorithms which require training can be trained on one part

of the data and evaluated on the other (GAN, DBSCAN).

3.3

DBSCAN

DBSCAN [4] is a well-known data clustering algorithm. It 3.

METHODOLOGY

groups together points, which are close together based on

3.1

Evaluation of algorithms

Euclidean distance. The group with the largest number of

points in our case are considered ‘normal’, and the lower-

Evaluation of the performance of algorithms on unlabelled

density groups are outliers which are then labeled as an

data always represents a challenge.

Since we are work-

anomaly. The parameter which measures how close the

ing with such data an actual calculation of accuracy scores

points should be for them to still be considered of the same

would require manual labelling of the data instances. To

group, can be adjusted based on the data set, and the de-

avoid this time-consuming process, we use a method for es-

sired sensitivity of the algorithm. For DBSCAN we also use

timating error rates (ratio of wrong classifications to the

an input vector composed of consecutive pressure values. In

total number of instances) from the agreement rates of mul-

this case, we discovered that a vector of 5-6 values works

tiple algorithms. Agreement rate of two classifiers fi and fj

best.

is defined in the following way:

S

1 X

a

3.4

Welford’s algorithm

{i,j} =

I{fi(Xs) = fj (Xs)}

S s=1

Welford’s algorithm gets its name from the Welford’s method

for online estimation of mean and variance. A very simple

where X1, ..., XS are unlabeled samples.

The calculated

anomaly detection approach [9] can then be constructed by agreement rates are then inserted into the following equa-defining the upper and lower limits (UL and LL) of ”normal”

tions:

data as a function of mean and variance:

a{i, j} = 1 − e{i} − e{j} + 2e{i,j}

U L = mean + X ∗ variance

Here we assume that the functions make independent errors

we can substitute e{i, j} with e{i}e{j}. With such a system

of equations we can then calculate error rates using some

LL = mean − X ∗ variance

root-finding algorithm. Such an approach has been previ-

X is fixed and determines the threshold band. Any instance

ously used for the evaluation of classifiers on an unlabelled

which falls out of that band is labeled as an anomaly. In-

dataset [1]. Therefore we consider the anomaly detection stances can then be input into the algorithm one by one to

algorithm as a binary classifier and use the aforementioned

be labeled and after each the mean and the variance (con-

method for the comparison of different algorithms. Addi-

sequently UL and LL also) are updated.

tionally, two important assumptions were made. Firstly, we

For this experiment the actual Welford’s method was not

assumed that the anomaly detection algorithms were inde-

used since the mean and variance were computed from the

pendent and secondly, that each of those algorithms per-

last 1500 samples so that they would better adapt to the

forms better than a random classifier.

new samples.

Note that the first 1500 samples therefore

Since the estimated performance of one algorithm depends

could not be labeled; however, this was not a problem since

on the output of the others it was important that the al-

most of the other approaches required 2000 samples for fit-

gorithms yield a similar percentage of anomalies. In other

ting the models and the evaluation was therefore done on

words, the algorithms are tuned to have similar predicted

the remaining stream. However, the upper and lower limits

positive condition rate (P P CR =

F P +T P

).

For

F P +T P +F N +T N

of the interval were still computed as shown above with the

most data streams this means that 1%-3% of the samples

value of X = 2.2.

are labelled as anomalous.

3.2

GAN

3.5

Facebook Prophet

The Generative Adversarial Network (GAN)[6] is an unsu-Facebook Prophet is an algorithm for time series forecast-

pervised machine learning approach to anomaly detection.

ing that works especially well on data streams with multiple

An encoder-decoder structure of the neural network is used

seasonalities [8]. Prophet also works well with missing data to first encode the input data point and then decode the

which makes it a good candidate for the problem at hand.

encoded one.

The model learns to reconstruct the input

After fitting the model it can make predictions for a cho-

data point as closely as possible. The idea is that the re-

sen set of timestamps presented to it. Furthermore besides

construction should be better if the input data is ‘normal’

the prediction it also outputs upper and lower limits of the

and worse if it is abnormal/anomalous. We use an input

confidence interval for every sample. Ashrapov [2] demon-vector, which is composed of 10 consecutive values of the

strates the implementation of an anomaly detection algo-

uni-variate data stream. We then compare the input vec-

rithm which uses this property to classify the samples inside

tor to the reconstructed one using the mean squared error

the confidence interval as regular and the rest as anoma-

(MSE) metric. We classify the data point as ‘normal’ if the

lies. The model is fitted on the entire data set and then

value of the MSE is below the defined threshold. [6] calcu-makes predictions on the same data set, providing both the

lated the thresholds using sliding windows on reconstruction

anomaly detection and the confidence interval.

54





4.

RESULTS

The results of the algorithms for data stream from sensor

5770 are presented in Figures 1, 2, 3 and 4. The charts show the raw values obtained from the pressure sensors, indicating

the points which are labeled as anomalies with red points.

Since the data sets are unlabelled it is hard to assess the

accuracy of each algorithm based on anomaly visualizations

alone, but we do notice some similarities and some differ-

ences. All of the algorithms are good at identifying obvious

outliers (points which fall far out of the ‘normal’ range).

The difference between the algorithms can be noticed when

Figure 4: Anomalies found using Facebook Prophet on datas-

classifying points closer to the normal range. For example

tream from sensor 5770.

Welford’s algorithm tends to label points as anomalies at

the peaks of daily pressure fluctuation, which might not be

ideal since we know that this behaviour can be considered

sets are unlabeled, it is hard to determine the optimal pa-

normal. More sophisticated algorithms such as GAN and

rameters. We decided to tune the algorithms to have similar

Prophet were also able to identify more ”subtle” anomalies.

recall of 1 - 3%, as we deemed that this would make the com-

parison of the algorithms the most fair. In Table 1 the shares of anomalies are presented for each separate data stream.

5770

5771

5772

5773

Algorithm

anomaly

anomaly

anomaly

anomaly

share

share

share

share

GAN

1.42%

0.99%

0.77%

1.13%

DBSCAN

2.63%

2.82%

2.73%

2.85%

Welford’s

3.39%

3.41%

1.66%

3.16%

algorithm

Facebook

1.66%

1.13%

0.46%

1.40%

Prophet

Figure 1: Anomalies found using GAN on data stream from

sensor 5770.

Table 1: Shares of anomalies for all four data streams.

The error rates calculated from agreement rates are shown

in Table 1 for each of the data streams. Since we assumed most of the samples in the data stream were normal these

error rates are not very informative out of context. We can

however, observe that Prophet performed best followed by

GAN, DBSCAN and Welford, respectively. The results are

consistent in all four scenarios. If we take into consideration

that Prophet worked on the whole data set at once when the

other three were limited to one sample at a time (as it is in

production) we can declare that GAN performed best out of

Figure 2: Anomalies found using DBSCAN on datastream

the algorithms that can detect anomalies on a live stream.

from sensor 5770

5770

5771

5772

5773

Algorithm

Error

Error

Error

Error

rate

rate

rate

rate

GAN

1.34%

1.38%

0.66%

1.09%

DBSCAN

1.59%

1.70%

1.78%

1.81%

Welford’s

2.44%

2.41%

1.10%

2.31%

algorithm

Facebook

1.14%

0.62%

0.39%

0.81%

Prophet

Table 2: Error rates estimated from agreement rates for all

Figure 3:

Anomalies found using Welford’s algorithm on

four data streams.

datastream from sensor 5770.

We also considered a state-of-the-art method Isolation For-

est, however it was too sensitive and therefore not usable in

The recall of each algorithm can be increased or decreased

the error rate calculation.

by modifying parameters and thresholds.

Since the data

55





5.

CONCLUSIONS

[10] Schubert, E., Sander, J., Ester, M., Kriegel,

We have tested five anomaly detection algorithms (Gener-

H. P., and Xu, X. Dbscan revisited, revisited: why

ative Adversarial Network, DBSCAN, Facebook Prophet,

and how you should (still) use dbscan. ACM

Welford’s algorithm and Isolation Forest) on four separate

Transactions on Database Systems (TODS) 42, 3

data streams of water pressure data. Out of those five the

(2017), 1–21.

Isolation Forest performed poorly since the share of anoma-

[11] Taylor, S. J., and Letham, B. Forecasting at scale.

lies found with this method was unreasonably high and was

The American Statistician 72, 1 (2018), 37–45.

therefore not included in the final error estimates calcula-

[12] Thiyagarajan, K., Kodagoda, S., Ulapane, N.,

tion.

and Prasad, M. A temporal forecasting driven

Other approaches had similar shares of anomalies and were

approach using facebook’s prophet method for

therefore used to calculate agreement rates and finally the

anomaly detection in sewer air temperature sensor

estimated error rates of each anomaly detection algorithm.

system. In 2020 15th IEEE Conference on Industrial

The results were consistent for all four data streams. Prophet

Electronics and Applications (ICIEA) (2020),

performed best in every setting, however it looked at a data

pp. 25–30.

stream as a batch and it therefore could not be used for

online anomaly detection. GAN performed second best fol-

lowed by DBSCAN and Welford’s algorithm which all work

on a live data stream. Therefore we can conclude that the

most fitting algorithm to be used for anomaly detection on

the live water pressure data from water distribution network

is GAN.

In future work, Facebook prophet could be adopted in such

a way that it would also work on a live data stream since it

has shown promising results in this experiment.

6.

ACKNOWLEDGMENTS

This paper is supported by European Union’s Horizon 2020

research and innovation programme under grant agreement

No. 820985, project NAIADES (A holistic water ecosystem

for digitisation of urban water sector).

7.

REFERENCES

[1] Antonios Platanios, E. Estimating accuracy from

unlabeled data.

[2] Ashrapov, I. Anomaly detection in time series with

prophet library, Jun 2020.

[3] Celik, M., Dadaser-Celik, F., and Dokuz, A. S.

Anomaly detection in temperature data using dbscan

algorithm. 2011 International Symposium on

INnovations in Intelligent SysTems and Applications

(2011).

[4] do Prado, K. S. How dbscan works and why should

we use it?, Apr 2017.

[5] Ester, M., and Wittmann, R. Incremental

generalization for mining in a data warehousing

environment. In International Conference on

Extending Database Technology (1998), Springer,

pp. 135–149.

[6] Geiger, A., Cuesta-Infante, A., and

Veeramachaneni, K. Adversarially learned anomaly

detection for time series data, 2020.

[7] Kenda, K., and Mladenić, D. Autonomous sensor

data cleaning in stream mining setting. Business

Systems Research: International journal of the Society

for Advancing Innovation and Research in Economy 9,

2 (2018), 69–79.

[8] Krieger, M. Time series analysis with facebook

prophet: How it works and how to use it, Mar 2021.

[9] Lobo, J. L. Detecting real-time and unsupervised

anomalies in streaming data: a starting point, Feb

2020.

56





Entropy for Time Series Forecasting

João Costa

António Costa

Fakulteta za matematiko in fiziko

ESN Paris

joaocostamat@gmail.com

antoniocbscosta@gmail.com

Klemen Kenda

João Pita Costa

Jožef Stefan Institut

IRCAI

klemen.kenda@ijs.si

joao.pitacosta@quintelligence.com

Figure 1: Sample of the time series and projections of the embedding - This plot gives us a geometrical representation of the theory involved in section 3 and shows the reconstructed state space of the given time series. This can be obtained by using Takens’ embedding to reconstruct the time series 𝑦, given in figure a), as the markovian system 𝑌 with 𝐾 time delays and

𝐾

then use Principal Component Analysis in order to perform the change of basis of the data. The obtained projections b), c) and d) attain the dynamics of the system, which gives us the possibility to predict the time series with higher efficiency.

ABSTRACT

for the H2020 NAIADES Project [2] with data collected from the Municipality of Alicante (Spain). We will present this study for

In this paper, we present the exploitation of a method to extract

the Autobus Dataset, related to the Bus Station Areas in Alicante.

information from microscopic samples of time series data in order

to provide a representation of optimized stability to a chaotic sys-

tem [1]. The main goal of this approach is to predict the dynamics 2

STATIONARY AND CHAOTIC NATURE

of a time series and therefore develop optimized forecasting al-

gorithms. First, we study how to increase the predictability of

2.1

Dickey-Fuller Test for Stationarity

a system and second, we develop a Deep Learning Algorithm,

In order to proceed with the theory involved in the method,

namely an LSTM, that can recognize patterns in sequential data

it is necessary to understand the behaviour of the time series

and accurately predict the future behaviour of a time series.

and its sensitivity to initial conditions. For studying time series’

stationarity, one can use the Augmented Dickey-Fuller test, which

KEYWORDS

is a type of statistical test called a unit root test, where generally

Recurrent Neural Networks, LSTM, Entropy, Markov Chain, Clus-

the null hypothesis is that the time series can be represented by

tering, Time Series

a unit root, which means that for 𝑦 = {𝑦 }𝑇

, the information

𝑡

𝑡 =1

at point 𝑦

does not provide us the ability to predict 𝑦 . In

𝑡 −1

𝑡

1

INTRODUCTION

our case, we obtained that the p-value of the test was 0, so the

null hypothesis was rejected and the time series has no unit

Given its intrinsic nature, mathematics concerns with the con-

root. Therefore, it is stationary and the time delays will provide

struction of formal statements and proofs relating the different

important information for predicting the dynamics of the time

concepts within it. Its methods are used in countless ways and

series.

effectively model the shape of our world. But how is it possible to

shape the unknown? Motivated by this question and the upmost

need for finding ways of optimizing water resources for future

2.2

Lyapunov exponents for understanding

generations, there has been a great development on the study of

chaotic nature

dynamical systems based on, for example, (Shannon) entropy [9]

and phase space reconstruction [4]. In this paper, we provide an The Lyapunov Exponent is a quantifier for the sensitivity of the

approach to water resource management using Deep Learning

time series on initial conditions and therefore for its chaotic na-

and Chaos Theory, by studying the dynamics of a time series

ture. The main idea is to select an array of nearest neighbors, i.e,

using the 2 main ideas cited before. This study was developed

points at minimum distance, and calculate its trajectories in time.

By doing so, we can then obtain an average of this divergence

Permission to make digital or hard copies of part or all of this work for personal exponent which gives us the Lyapunov Exponent. Since the sys-or classroom use is granted without fee provided that copies are not made or tem is bounded, the divergence is also bounded and will reach

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this a plateau after a certain number of timesteps. In our case, the

work must be honored. For all other uses, contact the owner /author(s).

Lyapunov Exponent, given as the initial slope, is ≈ 518 and the

Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

initial growth is exponential, as can be seen in figure 5. Therefore,

© 2020 Copyright held by the owner/author(s).

the time series is of a chaotic nature.

57





Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

João Costa, et al.

3

MAXIMUM PREDICTABILITY

Given the high variability of any chaotic system, it is hard to

capture the whole set of variables that model the state space.

This is characteristic of a non-Markovian system which is highly

unpredictable. How do we surpass this issue?

Takens’ Embedding Theorem [8] tells us that, under certain conditions, it is possible to use past data to reconstruct a Markovian

system, thus giving us the possibility to model the initial time

series with higher efficiency. We start by considering a set of

ODEs 𝑥

= ( ¤

𝑥

) and the

1, ¤

𝑥 2, . . . , ¤

𝑥

𝑑 -dimensional time series

𝐷

𝑦 (𝑡 ) of duration 𝑇 which is a set of incomplete measurements

of 𝑥 given by a measure 𝑀 , i.e., 𝑦

= 𝑀 (𝑥 ). Then, in order to

Figure 2: An LSTM performs the following ordered compu-

calculate the number of 𝐾 time delays to feed the LSTM with,

tations: The first step is to forget their irrelevant history.

the 𝑑 -dimensional measurements are lifted into the state space

Then, LSTMs perform computation to decide on relevant

𝑑 ×𝐾

𝑌

∈

consisting of the previously referred 𝐾 time delays

parts of new information and based on the previous two

𝐾

R

[3]. It is possible to quantify the chaotic measure of the system steps, they selectively update the internal state. Finally, an

𝑌

by calculating the entropy resulting from clustering. This

output is generated.

𝐾

can be done by partitioning the 𝑑 × 𝐾 -dimensional space into

𝑁 Voronoi cells using 𝐾 -Means clustering. Having partitioned

shown in this figure can be mathematically represented as

the state space 𝑌

, the reconstructed dynamics are encoded as a

𝐾

row-stochastic transition probability matrix 𝑃 = [𝑃

]

which

𝑇

𝑖 𝑗

𝑖, 𝑗

𝑓 (𝑥 , ℎ

) = 𝜎 (𝑤

𝑥

+ 𝑤

ℎ

+ 𝑏 )

𝑡

𝑡

𝑡 −1

𝑡

𝑡 −1

𝑓 ,𝑥

𝑓 ,ℎ

𝑓

relates increments on the state-space density 𝑝 in the following

(

) =

𝑇

+

+

)

(4)

way

𝑖

𝑥 , ℎ

𝜎 (𝑤

𝑥

𝑤

ℎ

𝑏

𝑡

𝑡

𝑡 −1

𝑖,𝑥

𝑡

𝑖,ℎ

𝑡 −1

𝑖

𝑇

∑︁

𝑜

(𝑥 , ℎ

) = 𝜎 (𝑤

𝑥

+ 𝑤

ℎ

+ 𝑏 ),

𝑡

𝑡

𝑡 −1

𝑜 ,𝑥

𝑡

𝑜 ,ℎ

𝑡 −1

𝑜

𝑝 (𝑡 + 𝛿𝑡 ) =

𝑃

𝑝

(𝑡 ).

(1)

𝑖

𝑗 𝑖

𝑗

𝑑

𝑗

where 𝑤

, 𝑤

, 𝑤

∈

are weight parameters and 𝜎 is an

𝑓 ,𝑥

𝑖,𝑥

𝑜 ,𝑥

R

activation function.

The entropy rate of the initial time series 𝑦 (𝑡 ) is then approxi-

mated by estimating the entropy rate (Figure 3) of the associated

4.2

Our approach

Markov chain on the different time delays 𝐾 using Kolmogorov’s

The core idea is to take a list of 𝑘 training sets 𝑄 0, 𝑄1, . . . , 𝑄𝑘−1

definition

and testing sets 𝑃

in order to generalize the model

0, 𝑃1, . . . , 𝑃𝑘 −1

∑︁

and do the best estimation for the time series. This is based

ℎ

(𝐾 ) = −

𝜋 𝑃

log 𝑃

,

(2)

𝑝

𝑖

𝑖 𝑗

𝑖 𝑗

𝑁

on translating the testing sets’ partitions along the time series,

𝑖, 𝑗

0

𝑛

where the first partition 𝑃

= {

} is taken from the

0

𝑝

, . . . , 𝑝

0

0

where 𝜋 is the estimated stationary distribution of the Markov

zeroth point of the time series data and the last partition 𝑃

=

𝑘 −1

0

𝑛

chain 𝑃 . This approximation gives an estimate for the conditional

{𝑝

, . . . , 𝑝

} until the last point of the time series data and

𝑘 −1

𝑘 −1

entropies (Figure 6), i.e., for a discrete state with delay vectors

|𝑦 |

®𝐾

𝑦

= { ®

𝑦 , . . . , ®

𝑦

}, the entropy of the Markov chain provides

|

| =

𝑖

𝑖 +𝐾 −1

𝑃

, ∀𝑖 ∈ {0, . . . , 𝑘 − 1}

(5)

𝑖

𝑘

an estimate for the conditional entropy,

where |𝑦 | stands for the cardinality of the time series 𝑦. This

procedure yields 𝑘 models which will use each of the training

ℎ

(𝐾 ) ≈ ⟨− log[𝑝 (𝑦

|𝑦 , . . . , 𝑦

)]⟩

𝑝

𝑖

𝑁

𝑁

𝑖 +𝐾

𝑖 +𝐾 −1

sets to make predictions on the respective test sets. Given the

= 𝐻

(𝑁 ) − 𝐻 (𝑁 )

𝐾 +1

𝐾

(3)

erratic nature of the data, which was taken in 15 and 30 minutes

= ℎ (𝑁 ),

𝐾

samples, a resampling to 30 minute delays had to be done on the

15 minutes delay data points and a masking was added to the time

where 𝐻

is the Shannon Entropy of the sequence obtained by

𝐾

series in order to neglect NaN values that could be created from

partitioning the ®

𝑦 space into 𝑁 partitions.

resampling. Therefore, a masking layer was added and the model

is composed by 3 other layers L

, L

and L

, where 𝑛

=

𝑛

𝑛

𝑛

1

1

2

3

4

MODEL ARCHITECTURE

𝑛

= 1 (we have a univariate timeseries) and

= 64, since it gave

3

𝑛2

the best results in cross validation. A dropout regularization of

4.1

LSTM

0.1 was added for better approximation of training and validation

Long Short Term Memory (LSTM) Networks are a special type

errors and the batch size was set to 128. The mean squared error

of Recurrent Neural Networks (RNN) which rely on gated cells

for the predictions on the training set is ≈ 0.00115 and for the

that control the flow of information by choosing what elements

testing set is ≈ 0.00236. One can address the capacity of the model

of the sequence are passed on to the next module. This idea was

whose predictive results are shown in figure 4.

introduced in order to surpass the vanishing gradient problem

in conventional RNNs [7]. At each time 𝑡 , consider 𝑓

as the

𝑡

5

FORECASTING

forget gate, 𝑖

as the input gate and 𝑜

as the output gate, which

𝑡

𝑡

5.1

Forecasting Methods

are functions that depend on the output of the previous LSTM

Consider a time series 𝑇 = {𝑡

}. The forecasting process

module, given by ℎ

and on the input of the current timestep,

1, . . . , 𝑡 𝑁

𝑡 −1

can be done in 3 ways:

given by 𝑥 . Then, the next figure shows a representation of how

𝑡

a single LSTM cell performs its computations. The computations

(1) iterated forecasting

58





Entropy for Time Series Forecasting

Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

(2) direct forecasting

6.3

Data and Code Git Repository

(3) multi-neural network forecasting

The complete work can be found in:

Process number (1) is based on "many-to-one" forecast for which

https://github.com/johncoost/JoaoModelsForAlicante.

𝑡

≈ F (𝑡 , . . . , 𝑡

), 𝑖 ∈ {1, . . . , 𝑁 − 𝑛}.

(6)

𝑛+1

𝑖

𝑖 +𝑛−1

7

PLOT OF RESULTS

Then, a 𝐾 -step forecast can be iteratively obtained by

ˆ

𝑡

:= F ( ˆ

𝑡

, . . . , ˆ

𝑡

, ˆ

𝑡

), 𝑗 ∈ 1, . . . , 𝐾 .

(7)

𝑁 + 𝑗

𝑁 + 𝑗 −𝑛+1

𝑁 + 𝑗 −2

𝑁 + 𝑗 −1

Process number (2) can be characterized by training a "many-to-

many" function F for which

(𝑡

, . . . , 𝑡

) ≈ F (𝑡 , . . . , 𝑡

),

(8)

𝑖 +𝑛

𝑖 +𝑛+𝐾 −1

𝑖

𝑖 +𝑛−1

where 𝑖 ∈ {1, . . . , 𝑁 − 𝑛 − 𝐾 + 1}. We can obtain a 𝐾 -step forecast by

( ˆ

𝑡

, . . . , ˆ

𝑡

) := F (𝑡

, . . . , 𝑡

).

(9)

𝑁 +1

𝑁 +𝐾

𝑁 −𝑛+1

𝑁

Finally, process (3) is defined by 𝑘 "many-to-one" functions

F

which hold the following relationship

1, . . . , F𝑘

𝑡

≈ F (𝑡 , . . . , 𝑡

)

𝑖 +𝑛

1

𝑖

𝑖 +𝑛−1

Figure 3: Entropy Rate ℎ - The entropy rate ℎ is given as

..

the function of the number of partitions 𝑁 for increasing

(10)

.

number of delays 𝐾 (given by the different colors in a de-

𝑡

≈ F (𝑡 , . . . , 𝑡

),

𝑖 +𝑛+𝐾 −1

𝑘

𝑖

𝑖 +𝑛−1

scendent mode). It is possible to observe that the entropy

rate is a non-decreasing function on the number of par-

where 𝑖 ranges from 1 to 𝑁 − 𝑛 − 𝐾 + 1. Process (1) does not

titions 𝑁 . The idea is to choose the value of 𝑁 for which

require 𝑘 a propri while both process (2) and (3) are dependent

the entropy is maximum so that we have the maximum

on the choice of 𝑘 .

possible information about the system’s dynamics.

5.2

Our Approach

We chose to do a Direct Forecasting for the next 7 days by taking

the last test set partition 𝑃

and did a prediction on this test

𝑘 −1

set. Although forecasting seems pretty motivating, by choosing

a partition that attains more characteristics of the time series,

one can achieve even better results. The achieved forecast can be

seen on Figure 8 and compared with a 7 days sample on Figure 7.

6

RESEARCH METHODS

6.1

Time Series Reconstruction

Consider the time series 𝑦 with duration 𝑇 as given in section

2. The idea is to add 𝐾 time delays to 𝑦 in order to obtain a

(

𝑑 ×𝐾

𝑡 − 𝐾 ) × 𝐾𝑑 space 𝑌

∈

and further partition 𝑌

using

Figure 4: Prediction on the last test set - This shows a sam-

𝐾

R

𝐾

𝑘 -means Clustering into 𝑁 Voronoi Cells.

ple of the last test set and its prediction. We can observe

the effectiveness of the LSTM in modelling the given time

6.2

Entropy Calculation

series by having a deep understanding of its inherent dy-

namics.

Consider the 𝑁 Voronoi Cells given as the number of partitions of

𝑌

and consider the joint probability 𝑝 (𝑐

, . . . , 𝑐

), {𝑖

} ∈

𝐾

𝑖

𝑖

1, . . . , 𝑖

1

𝑙

𝑙

{0, . . . , 𝑁 − 1}. Then, the Shannon Entropy [6] is given by

∑︁

𝐻

= −

𝑝 (𝑐

, . . . , 𝑐

) log 𝑝 (𝑐 , . . . , 𝑐 )

(11)

𝑙

𝑖

𝑖

𝑖

𝑖

1

𝑙

1

𝑙

and the conditional probabilites are given by

𝑝 (𝑐

|𝑐 , . . . , 𝑐 ),

(12)

𝑖

𝑖

𝑖

𝑙 +1

1

𝑙

where 𝑐𝑖

is the next Voronoi Cell after 𝑐

. We can calculate the

𝑙 +1

𝑖𝑙

entropy rate growth by considering the conditional probabilities

of the system given the previous 𝑙 cells, when visiting the (𝑙 +1)-th

cell, via

ℎ

= ⟨− log[𝑝 (𝑐

|𝑐 , . . . , 𝑐 )]⟩ = 𝐻

− 𝐻

(13)

𝑙

𝑖

𝑖

𝑖

𝑙 +1

1

𝑙

𝑙 +1

𝑙

Figure 5: In this figure, we can understand the initial expo-

Taking the supremum limit over all possible partitions 𝑃 of 𝑌

,

𝐾

nential growth on distance between points (given in blue),

we obtain the Kolmogorov-Sinai invariant of the system,

relative to a curve of slope 1 (given in orange).

ℎ

= sup lim ℎ (𝑃 ).

(14)

𝐾 𝑆

𝑙

𝑙 →∞

𝑃

59





Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

João Costa, et al.

building other algorithms, such as Transformer neural network,

that would provide even better results. Another idea is to use

weather data and build a multivariate LSTM that optimally gives

better results than the univariate one.

9

ACKNOWLEDGMENTS

I greatly thank to António Carlos Costa for working in coopera-

tion and giving me the possibility to use the powerful machinery

he built in order to obtain the desired 𝐾 time delays and under-

stand the complex dynamics of the system. Also, to the NAIADES

team at Jožef Stefan Institute for all the knowledge exchange and,

in particular, to Klemen Kenda for giving me the possibility of

writing this paper and João Pita Costa for giving me insights on

Figure 6: Conditional Entropies - In this plot we can see

how to write and structure the paper.

the entropy rate for number of partitions

This paper is supported by European Union’s Horizon 2020

𝑁 = 200 which

maximizes this entropy. This function reaches a plateau

research and innovation programme under grant agreement No.

at

820985, project NAIADES (A holistic water ecosystem for digiti-

≈ 24 timesteps, which gives us an idea about which is

the optimal

sation of urban water sector).

𝐾 to choose. Given that we have 30 minutes

timesteps, this plot shows that the optimized time delay is

REFERENCES

of 12h which corresponds to the day and night cycles

[1]

Tosif Ahamed, Antonio Carlos Costa, and Greg J. Stephens.

2019. Capturing the continuous complexity of behavior in

c. elegans. (2019). arXiv: 1911.10559 [q-bio.NC].

[2]

2019-2022. Cordis, "naiades project". In CORDIS. \url{https:

//cordis.europa.eu/project/id/820985}.

[3]

Antonio Carlos Costa, Tosif Ahamed, David Jordan, and

Greg Stephens. 2021. Maximally predictive ensemble dy-

namics from data. (2021). arXiv: 2105.12811 [physics.bio-ph].

[4]

Vicente de P. Rodrigues da Silva, Adelgicio F. Belo Filho,

Vijay P. Singh, Rafaela S. Rodrigues Almeida, Bernardo B.

da Silva, Inajá F. de Sousa, and Romildo Morant de Holanda.

2017. Entropy theory for analysing water resources in north-

eastern region of brazil. Hydrological Sciences Journal, 62,

Figure 7: 7 Days Sample

7, 1029–1038. doi: 10.1080/02626667.2015.1099789. eprint:

https : / / doi . org / 10 . 1080 / 02626667 . 2015 . 1099789. https :

//doi.org/10.1080/02626667.2015.1099789.

[5]

David A. Dickey and Wayne A. Fuller. 1979. Distribution

of the estimators for autoregressive time series with a unit

root. Journal of the American Statistical Association, 74, 366a,

427–431. doi: 10 . 1080 / 01621459 . 1979 . 10482531. eprint:

https : / / doi . org / 10 . 1080 / 01621459 . 1979 . 10482531. https :

//doi.org/10.1080/01621459.1979.10482531.

[6]

Robert M. Gray. 2011. Entropy and Information Theory.

(2nd edition). Springer Publishing Company, Incorporated.

isbn: 9781441979698.

[7]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-

term memory. Neural Computation, 9, 8, 1735–1780. doi:

Figure 8: Prediction for 7 days ahead - Actual forecast using

10.1162/neco.1997.9.8.1735.

336 timesteps that gives a 7 day future forecast sample using

[8]

Floris Takens. 1981. Detecting strange attractors in tur-

the LSTM model and direct forecasting. It is possible to

bulence. In Lecture Notes in Mathematics. Springer Berlin

observe that, as in figure 6, the values vary between ≈ 2000

Heidelberg, 366–381. doi: 10.1007/bfb0091924.

to ≈ 14000 flow units and the essential dynamics of the

[9]

Peyman Yousefi, Gregory Courtice, Gholamreza Naser, and

time series were understood by the LSTM.

Hadi Mohammadi. 2020. Nonlinear dynamic modeling of

urban water consumption using chaotic approach (case

8

CONCLUSION

study: city of kelowna). Water, 12, 3. issn: 2073-4441. doi:

10.3390/w12030753. https://www.mdpi.com/2073- 4441/12/

Having developed all the necessary machinery for constructing

3/753.

a coherent forecasting engine, we come to the conclusion that

although the cardinality of the time series data was relatively

small, the obtained results are promising and the model will

certainly show satisfying results when applied in real time.

For the future, we want to continue developing the project by

60





Modeling stochastic processes by simultaneous

optimization of latent representation and target variable

Jakob Jelenčič

Dunja Mladenić

Jozef Stefan International

Jozef Stefan International

Postgraduate School

Postgraduate School

Jozef Stefan Institute

Jozef Stefan Institute

Ljubljana, Slovenia

Ljubljana, Slovenia

jakob.jelencic@ijs.si

dunja.mladenic@ijs.si

ABSTRACT

We have evaluated the proposed method on an equities

This paper proposes a novel method for modeling stochastic

dataset and a cryptocurrency dataset, in both cases achieving

processes, which are known to be notoriously hard to predict

extraordinary results on the test dataset. We have also shown

accurately. State of the art methods quickly overfit and

the importance of noise distribution and how the de-noising

create big differences between train and test datasets. We

fails if the distributions of the data and noise do not align.

present a method based on simultaneous optimization of la-

tent representation and the target variable that is capable of

The rest of the paper is organised as follows. Section 2

dealing with stochastic processes and to some extent reduces

describes the data we were using. In section 3 we introduce

the overfitting. We evaluate the method on equities and

the proposed method. In section 4 we present empirical

cryptocurrency datasets, specifically chosen for their chaotic

results. In section 5 we conclude by pointing out the main

and unpredictable nature. We show that with our method

results and defining guidance for the future work.

we significantly reduce overfitting and increase performance,

compared to several commonly used machine learning algo-

2.

DATA

rithms: Random forest, General linear model and LSTM

The proposed method works well for stochastic processes.

deep learning model.

Equities are supposed to follow some form of stochastic pro-

cess [9], either the Black-Scholes one or some more complex 1.

INTRODUCTION

process with unknown formulation. In order to evaluate our

Time series prediction has always been an interesting chal-

method, we have collected daily data of more than 5000

lenge. Deep learning structures that are designed for time

equities listed on NASDAQ from 2007 on. The data is freely

series are prone to overfitting. Especially if the underlying

available on the Yahoo Finance website [2]. We transformed time series is stochastic by nature. Every young researcher’s

the data using technical analysis [10] and for test set took first attempt when dealing with time series, was trying to

every instance that happened after 2019. We calculated mov-

learn a time series model that will predict future prices;

ing average using 10 days closing price then tried to predict

whether in equities, commodities, forex or cryptocurrencies.

the direction of the change of this trendline.

Unfortunately it is not that simple. One can easily build

a near perfect model on the train dataset just to find it is

The equity data turned out to be a little bit timid, not

completely useless on the test dataset.

chaotic enough to demonstrate the full ability of the pro-

posed method. This is why we also collected minute data of

We propose a novel method that is capable of effectively

cryptocurrencies Ethereum and Bitcoin and used the method

combatting the overfitting, especially this proves to be a

on them as well. Data is available on the crypto exchange

difficult task when one is dealing with a problem directly

Kraken [1]. We used the same transformation as for the applicable in practical situations. The main idea is to add

equities, but with a bit quicker trend. This time the target

noise from the same distribution as the training data and

variable was change in the trendline in the next 6 hours. For

then at the same time optimize the target variable and the

the test set we took every instance that has time stamp after

latent representation with the help of the autoencoder. The

December 2020.

longer the training goes, the lower is the amplitude of noise

and the less focus is on the optimization of the representation.

The reader should note that the end goal is not to accurately

predict future equity price, since that is next to impossible.

As soon there is a pattern, someone will profit from it and

then the pattern will change. By predicting the future trend

line, one can obtain a significant confidence interval and

estimates of where the price could be, and then design for

example a derivative strategy that searches for favourable

risk versus rewards trades.

3.

PROPOSED METHOD

61





We propose the method designed for prediction of stochastic Algorithm 1 Noise definition

processes. The method achieves significant results improving

1: Inputs: X, α, β, epoch

the metrics and loss functions on unseen data, where standard

2: Y = [ts, ts, np]

. Array for holding Cholesky

deep learning is prone to over-fit. The main advantage is

decompositions of time correlation matrices.

reducing the gap between training data and testing data,

3: for t ∈ {1, . . . , np} do

sometimes to a degree where one sacrifices a little bit on the

4:

Σt = cov(X[, , t])

train side to actually have the model outperforming it on

5:

Y [, , t] = chol(Σt)

. In practice the

test data. This is very important in time series, where a

closest positive definite matrix of Σt is computed before

prediction model is usually just one part of a bigger strategy

the Cholesky decomposition.

and where the train over-fit is the biggest issue. For example,

6: end for

designing a trading strategy on over-fitted predictions, that

7: Z = [bs, ts, np]

. Array for holding noise samples.

kind of mistake can lead to huge capital losses.

8: for i ∈ {1, . . . , ts} do

9:

Σi = cov(X[, i, ])

The proposed method can be broken down into 3 important

10:

Z[, i, ] = mvn(bs, Σi)

parts: normalization, noise addition and additional opti-

11: end for

mization of latent representation. Each part can be easily

12: for j ∈ {1, . . . , np} do

integrated into an already existing pipeline.

13:

Z[, , j] = matmul(Z[, , j], Y [, , j])

. Correcting

initially independent noise samples with respect to time.

14: end for

3.1

Empirical normalization

15: for w ∈ {1, . . . , ts} do

Normalization plays an important role in deep learning mod-

16:

Z[, w, ] = Z[, w, ] ∗ ((βts−w · αepoch) · sd) . Decrease

els. It was shown that normalization significantly speeds up

the noise during the training procedure.

the gradient descent, almost independently of where normal-

17: end for

ization takes place. It can be weight normalization [11] during 18: R = X + Z

the actual optimization, or it can be the batch normalization

19: Return R.

[8], or just normalization of the whole input data [7]. In the proposed method it is important that the 3 dimensional

input data comes from the same distribution as the gener-

3.3

Optimization of latent representation

ated noise. Since it is fairly straightforward to sample data

The most common issue with deep learning optimization is

from a 3 dimensional normal distribution, we normalize input

falling into a local optimum and being unable to move past

data using an empirical cumulative distribution function [12]

it [13]. We introduce autoencoder part into the optimization and empirical copula [4] [5]. We align all central moments procedure in order to force the model to shift from going

of the unknown distribution to the ones from centered and

directly to local optimum to learning the latent representation

standardised normal distribution. The normalization takes

first. We expect that this combined with the addition of

place before the data is reshaped to 3 dimensional tensor.

noise, will force the model first to learn how to ignore the

noise that we added and the noise that is already in the data

3.2

Noise addition

by nature of the stochastic process [15]. We optimized the Introduction of the noise is not new in unsupervised learning

model using the Adam optimizer [6]. The loss function used and it was shown that it has a positive effect [14]. Adding in optimization is defined like:

noise to input data and then forcing the model to learn

how to ignore it has a lot of success in generative adversar-

ial networks [3], where convergence can be very tricky to L = LY + Wae · decayepoch · Lae,

achieve. We transformed that idea and embedded it into su-

pervised learning procedure. The noise addition is described

in Algorithm 1.

where LY stands for the supervised loss function which will

depend on the problem while Lae stands for the loss between

In Algorithm 1 we will use the following abbreviations.

encoded output and input data. Decay weight is decreasing

the longer the training goes on.

• X = [bs, ts, np] stands for the input tensor with 3 di-

4.

RESULTS

mensions; batch size, time steps and number of features

We have divided the results section into 2 parts: unsupervised

used for predictions.

and supervised. In the first we demonstrate why the noise

distribution is important. For the unsupervised part, due to

• α, β are parameters that control how fast noise will

hardware constraints, we have only used the cryptocurrency

decrease during the training procedure. They should

dataset since we deemed it more demanding than the equity

be between 0 and 1, where lower value correspond to a

one. In the second, we demonstrate how the our method

faster decrease in the amplitude of the added noise.

increases test metric on both datasets.

• mvn stands for function sampling from a two dimen-

sional correlated Gaussian distribution, where Σ is the

covariance. matmul stands for matrix multiplication.

62





4.1

Unsupervised learning results

one. This result is definitely worth further investigation and

In order to test the efficiency of distributed noise versus just

experimentation.

random noise, we created 3 models. The baseline model was

a deep learning model with 3 stacked LSTM layers, encoded

4.2

Supervised learning results

layer, then again 3 stacked LSTM for decoded output. We

In the previous section we have shown that the distribution

have used Adam as optimizer. As loss function we used

of the noise matters.

In this section we will show that

mean-squared error. We have stopped the learning after

noise combined with optimization of latent representation

there was no improvement for 25 epochs on the validation set.

significantly improves metrics on unseen data. Similarly as

The validation set was randomly taken out of the train set.

before, α and β were both set to 0.99 and sd was initially set

Parameters α and β were both set to 0.99 and sd was initially

to 1.25. From our experience this setting achieves the best

set to 1.25. The noise decreases with learning procedure.

results, but further exploration needs to be done. Wae was

Interestingly keeping noise constant did not achieve any

initially set to 5 and decay to 0.95.

results.

Since we now operate in a supervised environment, we can

compare our models to the majority class. But to really

demonstrate the effectiveness of the method, we chose to

compare the following models:

• Majority class, which serves as a sanity check.

• Random Forest with 500 trees.

• Generalized linear model.

• Deep learning model with 3 stacked LSTM layers.

Figure 1: Test loss of autoencoder model with random noise

(green) versus no noise (blue).

• Deep learning model with 3 stacked LSTM layers and

optimization of latent representation.

Initially we have tested baseline model versus de-noising

• Deep learning model with 3 stacked LSTM layers and

model but with uncorrelated noise. In the Figure 1 is plotted

correlated noise addition.

the de-noising test loss function in green colour and the

baseline test loss function in blue. Training was stopped

• Finally, deep learning model with 3 stacked LSTM

relatively early compared to Figure 2 and it is also obvious

layers and correlated noise addition and optimization

that de-noising test loss is even worse than that of the classic

of latent representation.

autoencoder.

In the second example we switched from uncorrelated noise

All 4 of the deep learning models are identical, all are opti-

to the noise with same distribution as input data. As is

mized with Adam and categorical cross entropy was used as a

apparent on Figure 2, where again we have de-noising test

loss function for the supervised part and mean squared error

loss plotted with green and classic test loss with blue, the de-

for the autoencoder part. Initially we have only tested the

noising autoencoder achieved lower test loss than the classic

models on equities data, but it turned out that the equities

one.

were not chaotic enough. By that we mean that especially

with deep learning models the difference between train and

test loss was not so big that it would be problematic. From

previous work experience we know that overfit is a big issue

in cryptocurrency dataset, so then we decided to test that

dataset in a supervised setting as well. All models were

trained three times on each dataset and the results in Table

1 and Table 2 are the averages of the 3 runs.

In Table 1 we show the results from the equity dataset. Our

method managed to improve test accuracy (from 0.673 to

0.682) without decreasing train accuracy (0.681). Maintain-

ing test accuracy and keeping it comparable to test one is

important if one needs to build additional strategy upon

Figure 2: Test loss of autoencoder model with correlated noise

predictions. Just noise addition slightly improved the results

(green) versus no noise (blue).

(from 0.673 to 0.675), while just the optimization of the latent

distribution does not improve anything.

What we expected is that then the train and validation losses

will be worse than with the classic autoencoder. Surprisingly,

that was not the case. With the de-noising autoencoder

using noise with the same distribution as the input data,

both train and validation losses were better than with classic

63





6.

ACKNOWLEDGMENTS

Table 1: Supervised results on equity dataset.

This work was supported by the Slovenian Research Agency.

Method

Train Accuracy

Test Accuracy

We also wish to thank prof. dr. Ljupčo Todorovski for his

Majority

0.513

0.537

help, especially with unsupervised results.

Random Forest

0.649

0.655

GLM

0.664

0.655

7.

REFERENCES

LSTM

0.681

0.673

latent LSTM

0.633

0.673

[1] Kraken exchange. https://www.kraken.com/.

noise LSTM

0.681

0.675

[2] Yahoo Finance. https://finance.yahoo.com/.

latent noise LSTM

0.681

0.682

[3] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran,

B. Sengupta, and A. A. Bharath. Generative

adversarial networks: An overview. IEEE Signal

In Table 2 we show results from the cryptocurrency dataset.

Processing Magazine, 35(1):53–65, 2018.

Similar as on the equity dataset, our method behaves as

[4] P. Jaworski, F. Durante, W. K. Hardle, and T. Rychlik.

intended on the cryptocurrency dataset as well. We can see

Copula theory and its applications, volume 198.

reduced overfitting that is apparent in the normal LSTM

Springer, 2010.

model. With those results we can conclude that the proof of

[5] H. Joe. Dependence Modeling with Copulas. CRC Press,

concept works, but for additional claims we will need more

2014.

testing and deeper parameter analysis.

[6] D. Kingma and J. Ba. Adam: A Method for Stochastic

Optimization. 2014.

https://arxiv.org/abs/1412.6980.

Table 2: Supervised results on cryptocurrency dataset.

[7] K. Y. Levy. The power of normalization: Faster evasion

Method

Train Accuracy

Test Accuracy

of saddle points. arXiv preprint arXiv:1611.04831,

Majority

0.512

0.556

2016.

Random Forest

0.689

0.692

[8] M. Liu, W. Wu, Z. Gu, Z. Yu, F. Qi, and Y. Li. Deep

GLM

0.682

0.695

learning based on batch normalization for p300 signal

LSTM

0.754

0.696

detection. Neurocomputing, 275:288–297, 2018.

latent LSTM

0.736

0.683

[9] R. C. Merton. Option pricing when underlying stock

noise LSTM

0.697

0.695

returns are discontinuous. Journal of financial

latent noise LSTM

0.706

0.714

economics, 3(1-2):125–144, 1976.

[10] J. J. Murphy. Technical Analysis of the Financial

It is interesting to point out that with the proposed method

Markets: A Comprehensive Guide to Trading Methods

the test loss on cryptocurrency dataset was 0.552, while

and Applications. New York Institute of Finance Series.

train loss was 0.592. While 0.552 was the best loss any deep

New York Institute of Finance, 1999.

learning model achieved, that wide difference indicates that

[11] T. Salimans and D. P. Kingma. Weight normalization:

we could improve our model even further by fine tuning the

A simple reparameterization to accelerate training of

parameters.

deep neural networks. Advances in neural information

processing systems, 29:901–909, 2016.

5.

CONCLUSIONS AND FUTURE WORK

[12] B. W. Turnbull. The empirical distribution function

In this work we have introduced and demonstrated how the

with arbitrarily grouped, censored and truncated data.

addition of noise and simultaneous optimization of latent

Journal of the Royal Statistical Society: Series B

representation and target variable reduce overfitting on time

(Methodological), 38(3):290–295, 1976.

series data. In the unsupervised case we have shown that the

[13] R. Vidal, J. Bruna, R. Giryes, and S. Soatto.

distribution of the noise matters and the input data must

Mathematics of deep learning. arXiv preprint

align to achieve maximum effect from the noise addition.

arXiv:1712.04741, 2017.

[14] P. Vincent, H. Larochelle, Y. Bengio, and P.-A.

In the future work we have to estimate the effect of the

Manzagol. Extracting and composing robust features

newly introduced parameters on method’s convergence. At

with denoising autoencoders. In Proceedings of the 25th

the same time we need to explore how the method behaves

international conference on Machine learning, pages

when embedded into larger models, transformers for example.

1096–1103, 2008.

We also need to evaluate the method in datasets that are by

nature stochastic but do not come from the financial domain.

[15] N. Wax. Selected papers on noise and stochastic

Finally, we need to evaluate our method on a dataset that is

processes. Courier Dover Publications, 1954.

not stochastic.

64





Causal relationships among global indicators

Matej Neumann



Marko Grobelnik

Jožef Stefan Institute



Jožef Stefan Institute

Jamova cesta 39, Ljubljana, Slovenia

Jamova cesta 39, Ljubljana, Slovenia



matej.neumann@student.fmf.uni-

marko.grobelnik@ijs.si

lj.si





ABSTRACT

This is the official source published by the United Nations it

It is important to know how changing one thing will affect

provides information on the development and implementation of

another. This becomes even more important when the thing we

an indicator framework for the follow up and review of the 2030

are changing will affect a lot of people. Therefore, we need a way

Agenda for Sustainable Development [4].

to visualize how all the things are connected. In this paper, we

will demonstrate an approach that uses Granger causality to find

2.2 The World Bank (WB)

causal relationships between global indicators. Our results show

As the data set provided by the UN itself often has missing

that global indicators are indeed highly interconnected however,

values, which results in unhealthy timeseries and unreliable

they still need to be looked at within each country individually.

results, we decided to add the dataset “World Development

We also comment how this approach can be used to help with

Indicators” from The World Bank [5]. Although the data set

policy making decisions.

might not be as official as the one provided by the UN, it does

contain 1440 unique indicators for 266 different countries and

KEYWORDS

groups, where each indicator contains a timeseries ranging from

Causality, Global indicators, Granger, Timeseries, SDGs

the year 1960 to the present time. This addition does not only

make the dataset healthier, it also introduces new indicators that

are not listed in the UN SDGs. Even so our new dataset still has

1 INTRODUCTION

some limitations. From Figure 1 we can see that on average a The Sustainable Development Goals (SDGs) launched on

country or groups has no values for around 33% of its indicators.

January 1, 2016 include 17 goals, 169 targets and 232 unique

Therefore, from now on when talking about the indicators, we

indicators with the intent to help frame the policies of the United

will restrict ourselves to just those ones that have at least 20

Nations’ (UN) member states through 2030 [8]. Because the

nonmissing values in their timeseries. This restriction will insure

goals are highly interconnected, as the indicators are not

that we are always dealing with a healthy timeseries and it is independent, it is important to understand synergies, conflicts

justified as on average those indicators make up about 50% of all

and causal relationships between them to support decisions.

of the ones available as seen in Figure 2.

Without such understanding a policy to help one goal could hurt



another. For example, a policy aiming to improve hunger could

conflict with climate-mitigation. This paper will focus on finding



such relationship with Granger causality.

Granger causality is a statistical concept of causality that is



based on prediction and was traditionally only used in the





financial domain however, over recent years there has been



growing interest in the use of Granger causality to identify causal



interactions in neural data [6].

Similar works such as [7] and [2] have already looked for



causal relationships between specific SDGs. This paper confirms

the previously done work and expands it by adding additional



indicators and looking for causal relationship between all the

indicators, not just the ones focused on SDGs.





In paper [2] the authors say that the analysis of all of





the indicators country by country is without doubt impractical.

Figure 1: Percentage of indicators having x nonmissing

Nevertheless, Table 2 shows that however impractical it may be, values in its timeseries.

it is still required, as even neighboring countries have vastly



different causal relationships.

2 DESCRIPTION OF DATA

2.1 United Nations Statistics Division (UNSD)

65



future values of Y with both the past values of X and Y and not





just the past values of Y.



More formally, let x and y be stationary timeseries and let x(t)



and y(t) be the univariate autoregression of x and y respectfully:



𝑝





𝑥(𝑡) = 𝑏



0 + ∑ 𝑏𝑖𝑥(𝑡 − 𝑖) + 𝐸2(𝑡)





𝑖=1



𝑝





𝑦(𝑡) = 𝑎0 + ∑ 𝑎𝑖𝑦(𝑡 − 𝑖) + 𝐸1(𝑡)





𝑖=1





where p is the number of chosen lagged values included in the





model, 𝑎𝑖 and 𝑏𝑖 are contributions of each lagged observation to



the predicted values of 𝑥(𝑡) and 𝑦(𝑡) and 𝐸



𝑖(𝑡) the difference



between the predicted value and the actual value. To test the null





hypothesis that x does not Granger-cause y, we augment 𝑦(𝑡) by





including the lagged values of 𝑥 to get:

Figure 2: percentage of indicators having at least x

𝑝

nonmissing values in its timeseries.

𝑦(𝑡) = 𝑐0 + ∑ 𝑎𝑖𝑦(𝑡 − 𝑖) + 𝑏𝑖𝑥(𝑡) + 𝐸3(𝑡).



𝑖=1

To better imagine what kind of indicators we are dealing with,

We then say that x Granger-causes y if the coefficients 𝑏𝑖 are

we can check Table 1 which shows the top 10 most common ones.

jointly significantly different from zero. This can be tested by



performing an F-test of the null hypothesis that 𝑏𝑖 = 0 for all i.

Indicator name

Frequency

Renewable electricity output

265

3.2 Statistical significance and the p-value

(% of total electricity output)

In testing, a result has statistical significance if it is unlikely to

Population, total

265

occur assuming the null hypothesis. More precisely, a

Population growth (annual %)

265

significance level α, is the probability of the test rejecting the null

Nitrous oxide emissions in

265

hypothesis, given that the null hypothesis was assumed to be true

energy sector (thousand metric

and the p-value is the probability of getting result at least as tons of CO2 equivalent)

extreme, given that the null hypothesis is true. Then we say that

Methane emissions in energy

265

the result is statistically significant when 𝑝 ≤ 𝛼.

sector (thousand metric tons of

CO2 equivalent)

3.3 Limitations of the Granger causality test

Agricultural nitrous oxide

265

As its name implies, Granger causality is not necessarily true

emissions (thousand metric tons

causality. Having said this, it has been argued that given a

of CO2 equivalent)

probabilistic view of causation, Granger causality can be

Agricultural

methane

265

considered true causality in that sense, especially when

emissions (thousand metric tons

Reichenbach's "screening off" notion of probabilistic causation of CO2 equivalent)

is considered [1].

Urban population growth

263

A problem may occur if both timeseries x and y are connected

(annual %)

via a third timeseries z. In that case our test can reject the null

Urban population (% of total

263

hypothesis even if manipulation of one of the timeseries would

population)

not change the other. Other possible sources of problems can

Urban population

263

happen due to: (1) not frequent enough or too frequent sampling,

Table 1: Most common indicators and their frequency of 20

(2) time series nonstationarity, (3) nonlinear causal relationship.

nonmissing values



4 EXPERIMENTS

3 METHODOLOGY

4.1 Setup

Due to time constraints and the limitations of my home system,

3.1 Granger causality

we decided to limit ourselves to taking just a few countries and

The causal relationships between indicators were determined by

groups and calculating the causality relationships for them. The

the Granger causality test. The Granger causality test is a

ones we decided on are: (1) United States, (2) China, (3)

statistical hypothesis test for determining whether one timeseries

Uruguay, (4) Slovenia, (5) Austria, (6) Croatia, (7) Italy, (8)

is useful in forecasting another. Informally we say that timeseries

European Union and (9) OECD. Our plan was to choose

X Granger-causes timeseries Y if predictions of the value of Y

based on its own past values and on the past values of X are better

than predictions of Y based only on Y's own past values. Or in

other words X Granger-causes Y if we can better explain the

66



AUS

CH

CRO

EU

ITA

OECD

SLO

UY

USA

AUS

100%

4.8%

5.1%

6.9%

6.7%

6.0%

5.9%

4.4%

7.1%

CH



100%

5.6

3.5%

4.3%

3.9%

4.2%

4.7%

4.3%

CRO





100%

4.6%

5%

3.3%

6.6%

3.8%

5.6%

EU





100%

11%

20%

5.7%

3.6%

10%

ITA





100%

6.7%

7.5%

3.8%

6.7%

OECD





100%

5%

3%

17%

SLO





100%

3.5%

5.6%

UY





100%

4.2%

USA





100%

Table 2: Percentage of same causal relationships.

That being said one can easily imagine why each population age



Granger-causes

a few of the major world powers and compare the differences and



similarities between the causal relationships.



the next one. For example, if we know the percentage of people

4.2 Modeling the dataset

aged 4, we can pretty accurately predict what the percentage of

Once the data was collected from the UNSD and WB website it

people aged 5 is going to be in the next year.

first had to be put into a suitable form. We decided on a 3D



matrix where the first component represented the country or

SDG

Buzzwords

group, the second component represented the time series and last

Zero Hunger

nourishment, food, stun, anemia,

one representing the indicator.

agriculture



4.3 Parameters

Clean Water and water, sanitation, drinking, drink,

Sanitation

hygiene, freshwater

As mentioned before, when searching for causal relationships in



a certain country or group we limit ourselves only to those

Affordable

and energy, electricity, fuel

indicators who have at least 20 nonmissing values. Furthermore,

Clean Energy

we chose a significance level of 0.05 or 5% and tested for lagged

Climate Action

disaster, disasters, climate, natural,

values from 1 to 4.

risk, Sendai, environment,



environmental, green, developed,

pollution

4.4 Determining causality

Good Health and mortality, birth, infection,

Once the modeling was done and the parameters were set we first

Well-Being

tuberculosis, malaria, hepatitis,

needed to make sure that the timeseries were stationary. To do

disease, cancer, diabetes,

that we ran the ADF-test and differenced the times series

treatment, Alcohol, death, birth,

accordingly to make them stationary. Then we ran the Granger-

health, pollution, medicine

causality test 4 times, once for each lagged value, for each of the

Table 3: Some of the most common buzzwords found in

9 countries and groups listed in 4.1. The results for each lagged SDGs

value were then saved in a 1440x1440 weighted adjacency

matrix, where the (i,j) element was nonzero if and only if the i-th

indicator Granger-caused the j-th indicator for all lagged values

between 1 and 4 and had the weight of the average of the 4 p-

values.

Once we had the weighed adjacency matrix we matched the

available indicators with the 17 SDGs by comparing the most

common buzzwords found in the description of the SDGs and the

name of the indicators. An example of some of the buzzwords

can be seen in Table 3.

5 RESULTS

With the weighted adjacency matrix in hand, it is sensible to ask

ourselves whether there exist any causal relationships that hold

true for each of the tested countries or groups. The answer is positive as seen in Figure 3. We can however see that the only causal relationships that survived were the ones that connected

Figure 3 Only causal relationships that are true for each of

different population ages to each other. This result seems

the 9 countries and groups (continuous down).

sensible as in general no two countries are exactly the same and



are therefore going to have a unique set of causal relationships.

On the other hand, one may assume that if we compare

countries which are close to each other or are historically

connected then the causal relationships should not differ by a lot.

67





Figure 4: Interconnectedness of SDGs.

That however is not the case as can be seen in Table 2. This This work has been supported by the Slovenian research agency.

suggests, that when talking about causal relationships, one must

look at each country or group individually.

Therefore, let’s focus just on Slovenia. Due to Slovenia

8 REFERENCES

having 10083 positive causal relationships we will limit



ourselves to just those that interact with SDGs. Figure 4 shows that indeed SDGs are not independent and in fact are highly

[1]

M. Michael in S. L. Bressler, „Foundational

interconnected. The presence of self-loops also suggests that

perspectives on causality in large-scale brain networks,“

there exist causal relationships between indicators of an SDG

Physics of Life Reviews, pp. 107-123, 2015.

itself. This result has two consequences:

[2]

G. Dörgő, V. Sebestyén in J. Abonyi, „Evaluating the

•

When thinking about policies aiming to improve

Interconnectedness of the Sustainable Development

one goal we need to be careful to not harm another

Goals Based on the Causality Analysis of Sustainability

•

Indicators,“



Instead of outright improving one goal, we can

Sustainability, 2018.

instead focus the ones that are in causal relationship

[3]

C. Stefano in S. Sangwon, „Cause-effect analysis for

with the one we wish to improve

sustainable development policy,“ NRC Research Press,

Let’s give an example. Suppose we would want to

2017.

implement a policy to help to help lower the suicide mortality

[4]

https://unstats.un.org/sdgs/indicators/database/.

rate, but we are not how to do that directly. We can therefore

[5]

https://datacatalog.worldbank.org/search/dataset/00377

instead check which indicators Granger-cause the one we are

12/World-Development-Indicators.

trying to improve. In our case the indicator “Unemployment,

youth total (% of total labor force ages 15-24)” Granger-causes

[6]

B. Corrado in K. Peter, „On the directionality of cortical

the suicide mortality rate. Therefore, if we improved the % of

interactions studied,“ Biological Cybernetics, 1999.

unemployed young people we would be able to also reduce the

[7]

K. Irfan, H. Fujun in P. L. Hoang, „The impact of

suicide mortality rate which was our initial goal.

natural resources, energy consumption, and population

growth on environmental quality: Fresh evidence from

the United States of America,“ Science of The Total

6 CONCLUSION AND FUTURE WORK

Environment, 2020.

In this paper we demonstrated an approach for calculating

[8]

H. Tomáš, J. Svatava and M. Bedřich, “Sustainable

causality between depending global indicators and mentioned

Development Goals: A need for relevant indicators,”

how this can help with implementing policies. We also showed

Ecological Indicators, pp. 565-573, 2016.

that neighboring and similar countries in general don’t have the



same causal relationships, which makes it hard to group them



together. However, finding such a grouping, if it exists, could be



done in the future. The approach shown in this paper could also



be implemented to find causal relationship between certain



google searches and natural events. For example, we could check



if there is any correlation between the increase of users searching



the words “water”, “rain”, or “cloud” and the likelihood of a



flood happening.





7 ACKNOWLEDGMENTS

68





Active Learning for Automated Visual Inspection of

Manufactured Products

Elena Trajkova∗

Jože M. Rožanec∗

Paulien Dam

University of Ljubljana, Faculty of

Jožef Stefan International

Philips Consumer Lifestyle BV

Electrical Engineering

Postgraduate School

Drachten, The Netherlands

Ljubljana, Slovenia

Ljubljana, Slovenia

paulien.dam@philips.com

trajkova.elena.00@gmail.com

joze.rozanec@ijs.si

Blaž Fortuna

Dunja Mladenić

Qlector d.o.o.

Jožef Stefan Institute

Ljubljana, Slovenia

Ljubljana, Slovenia

blaz.fortuna@qlector.com

dunja.mladenic@ijs.si

ABSTRACT

regarding defective products, it provides insights into when and

Quality control is a key activity performed by manufacturing enter-

where such defects occur, which can be used to further dig into the

prises to ensure products meet quality standards and avoid potential

root causes of such defects and mitigation actions to improve the

damage to the brand’s reputation. The decreased cost of sensors and

quality of manufacturing products and processes.

connectivity enabled an increasing digitalization of manufacturing.

The decreased cost of sensors and connectivity enabled an in-

In addition, artificial intelligence enables higher degrees of automa-

creasing digitalization of manufacturing [3], which along with the tion, reducing overall costs and time required for defect inspection.

adoption of Artificial Intelligence (AI) [12], represents an opportu-In this research, we compare three active learning approaches and

nity towards enhancing the defect detection in industrial settings

five machine learning algorithms applied to visual defect inspection

[5]. While the quality of the manual inspection has low scalability with real-world data provided by Philips Consumer Lifestyle BV. Our

(requires time to train an inspector, the employees can work a lim-

results show that active learning reduces the data labeling effort

ited amount of time and are subject to fatigue, and the inspection

without detriment to the models’ performance.

itself is slow), its quality can be affected by the operator-to-operator

inconsistency, and it depends on the complexity of the task, the

CCS CONCEPTS

employees (e.g., their intelligence, experience, well-being), the en-

•

vironment (e.g., noise and temperature), the management support

Information systems → Data mining; • Computing method-

and communication [23]; none of these factors affect the outcome ologies → Computer vision problems; • Applied computing;

of automated quality inspection. Machine learning has been suc-

KEYWORDS

cessfully applied to defect detection in a wide range of scenarios

[1, 9, 11, 15, 21].

Smart Manufacturing, Machine Learning, Automated Visual Inspec-

An annotated dataset must be acquired to implement machine

tion, Defect Detection

learning models for defect detection successfully. The increasing

ACM Reference Format:

number of sensors provides large amounts of data. As the manufac-

Elena Trajkova, Jože M. Rožanec, Paulien Dam, Blaž Fortuna, and Dunja

turing process quality increases, the data obtained from the sensors

Mladenić. 2021. Active Learning for Automated Visual Inspection of Man-

is expected to be highly imbalanced: most of the data instances

ufactured Products. In Ljubljana ’21: Slovenian KDD Conference on Data

will correspond to non-defective products, and a small proportion

Mining and Data Warehouses, October, 2021, Ljubljana, Slovenia. ACM, New

York, NY, USA, 4 pages.

of them will correspond to different kinds of defects. Annotating

all the data is prone to similar limitations as manual inspection

1

INTRODUCTION

described in the paragraph above. It is thus imperative to provide

strategies to select a limited subset of them that are most informa-

Quality control is one of the critical activities that must be per-

tive to the defect detection models.

formed by manufacturing enterprises [27, 28]. The main purpose of We frame the defect detection problem as a supervised learning

such activity is to detect product defects meeting quality standards,

problem. Given a large amount of unlabeled data, and based on

avoid rework, supply chain disruptions, and avoid potential dam-

the premise that only a tiny fraction of the data provides new

age to the brand’s reputation [3, 27]. Along with the information information to the model and thus has the potential to enhance its

∗Both authors contributed equally to this research.

performance, we adopt an active learning approach. Active learning

is a subfield of machine learning that attempts to identify the most

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed informative unlabeled data instances, for which labels are requested

for profit or commercial advantage and that copies bear this notice and the full citation some oracle (e.g., a human expert) [24]. This research compares on the first page. Copyrights for third-party components of this work must be honored.

For all other uses, contact the owner/author(s).

three active learning strategies: pool-based sampling, stream-based

SiKDD ’21, October, 2021, Ljubljana, Slovenia

sampling, and query by committee.

© 2021 Copyright held by the owner/author(s).

69





SiKDD ’21, October, 2021, Ljubljana, Slovenia

Trajkova and Rožanec

The main contributions of this research are (i) a comparative

instances are drawn one at a time, and a decision is made whether

study between the five most frequently cited machine learning

a label is requested, or the sample is discarded), and (iii) pool-based

algorithms for automated defect detection and (ii) three active

selective sampling (queries samples from a pool of unlabeled data).

learning approaches (iii) for a real-world multiclass classification

Among the frequently used querying strategies, we find (i) un-

problem. We develop the machine learning models with images

certainty sampling (select an unlabeled sample with the highest

provided by the Philips Consumer Lifestyle BV corporation. The

uncertainty, given a certain metric or machine-learning model[17]),

dataset comprises shaver images divided into three classes, based

or (ii) query-by-committee (retrieve the unlabeled sample with the

on the defects related to the printing of the logo of the Philips

highest disagreement between a set of forecasting models (com-

Consumer Lifestyle BV corporation: good shavers, shavers with

mittee)) [6, 24]. More recently, new scenarios have been proposed double printing, and shavers with interrupted printing.

leveraging reinforcement learning, where an agent learns to select

We evaluate the models using the area under the receiver oper-

images based on the similarity relationship between the instances

ating characteristic curve (AUC ROC, see [4]). AUC ROC is widely and rewards obtained based on the oracle’s feedback [22]. In addi-adopted as a classification metric, having many desirable properties

tion, it has been demonstrated that ensemble-based active learning

such as being threshold independent and invariant to a priori class

can effectively counteract class imbalance through new labeled

probabilities. We measure AUC ROC considering prediction scores

images acquisition [2].

cut at a threshold of 0.5.

Active learning was successfully applied in the manufacturing

This paper is organized as follows. Section 2 outlines the current

domain, but scientific literature remains scarce on this domain [19].

state of the art and related works, Section 3 describes the use case,

Some use cases include the automatic optical inspection of printed

and Section 4 provides a detailed description of the methodology

circuit boards[8] and the identification of the local displacement and experiments. Finally, section 5 outlines the results obtained,

between two layers on a chip in the semi-conductor industry[25].

while Section 6 concludes and describes future work.

The use of machine learning automates the defect detection, and

active learning enables an inspection by exception [5], only querying for labels of the images that the model is most uncertain about.

2

RELATED WORK

While this considerably reduces the volume of required inspections,

Among the many techniques used for automated defect inspection,

it is also essential to consider that it can produce an incomplete

we find the automated visual inspection, which refers to image

ground truth by missing the annotations of defective parts classified

processing techniques for quality control, usually applied in the

as false negatives and not queried by the active learning strategy

production line of manufacturing industries [1]. Visual inspection

[7].

requires extracting features from the images, which are used to

train the machine learning model. This procedure is simplified when

using deep learning models, enabling end-to-end learning, where a

3

USE CASE

single architecture can perform feature extraction and classification

The use case provided for this research corresponds to visual in-

[10, 18], and have shown state-of-the-art performance for image spection of shavers produced by Philips Consumer Lifestyle BV. The

classification [20].

visual quality inspection aims to detect defective printing of a logo

The use of automated visual inspection for defect detection has

on the shavers. This use case focuses on four pad printing machines

been applied to multiple manufacturing use cases. [21] manually ex-setup for a range of different products, and different logos. A lot

tracted features (e.g., histograms) from machine component images

of products are produced every day on these machines, which are

and compared the performance of the Näive Bayes and C4.5 models.

manually handled and inspected on their visual quality and re-

[9] extracted statistical features from the images and compared moved from further processing if the prints on the products are

the performance of Support Vector Machines (SVM), Multilayer

not classified as good. Operators spend several seconds handling,

Perceptron (MLP), and k-nearest neighbors (kNN) models for visual

inspecting, and labeling the products. Given an automated visual

inspection of microdrill bits in printed circuit board production.

quality inspection system would strongly reduce the need to manu-

[11] used 3D convolutional filters applied on computed tomog-ally inspect and label the images, it could speed up the process for

raphy images and an SVM classifier for defect detection during

more than 40%. Currently there are two types of defects classified

metallic powder bed fusion in additive manufacturing. [15] used related to the printing quality of the logo on the shaver: double

some heuristics to detect regions of interest on slate slab images,

printing, and interrupted printing. Therefore, images are classified

on which they performed feature engineering to later train an SVM

into three classes: good printing (class zero), double printing (class

model on them. Finally, [1] reported using a custom neural network one), and interrupted printing (class two). A labeled dataset with a

for feature extraction and an SVM model for classification when

total of 3.518 images was provided to train and test the models.

inspecting aerospace components.

While the authors cited above worked with fully labeled datasets,

a production line continually generates new data, exceeding the

4

METHODOLOGY

labeling capacity. A possible solution to this issue is the use of active We pose automated defect detection as a multiclass classification

learning, where the active learner identifies informative unlabeled

problem. We measure the model’s performance with the AUC ROC

instances and requests labels to some oracle. Typical scenarios in-

metric, using the "one-vs-rest" heuristic method, which involves volve (i) membership query synthesis (a synthetic data instance

splitting the multiclass dataset into multiple binary classification

is generated), (ii) stream-based selective sampling (the unlabeled

problems. Furthermore, we calculate the metrics for each class and

70





Active Learning for Automated Visual Inspection of Manufactured Products SiKDD ’21, October, 2021, Ljubljana, Slovenia

compute their average, weighted by the number of true instances

When analyzing the results, we were interested in how the mod-

for each class.

els’ performance evolved through time and significant variations

To extract features from the images, we make use of the ResNet-

between the first and last results observed. To that end, we as-

18 model [13], extracting embeddings from the Average Pooling sessed the statistical significance between the means of the first

layer. Since the embedding results in 512 features, which could

and last quartiles of the test fold for each active learning scenario.

cause overfitting, we use the mutual information to evaluate the

We assessed the statistical significance using the Wilcoxon signed-

√

most relevant ones and select the top K features, with 𝐾 = 𝑁 ,

rank test, with a p-value of 0.05. While such variations existed and

where N is the number of data instances in the train set, as suggested

were positive in most test folds (the models learned through time),

in [14].

the improvements were not statistically significant in none of the

To evaluate the models’ performance across different active learn-

scenarios.

ing strategies, we apply a stratified k-fold cross validation [29],

using one fold for testing, one fold as a pool of unlabeled data for

6

CONCLUSION

active learning, and the rest from training the model. We adopt

In this paper, we compared three active learning scenarios (pool-

k=10 based on recommendations by [16], and query all available based, stream-based with classifier uncertainty sampling, and query-unlabeled instances to evaluate the active learning approaches. We

by-committee) across five machine learning algorithms (Gaussian

compare three active learning scenarios: drawing queries through

Näive Bayes, CART, Linear SVM, MLP, and kNN). We found that

(i) stream-based classifier uncertainty sampling accepting instances

the best performance was achieved by the MLP model regardless

with an uncertainty threshold above the 75th percentile of observed

of the active learning strategy. The second-best performance was

instances, (ii) pool-based sampling selecting the instances a given

obtained through the query-by-committee strategy, while the fre-

model is most uncertain about, and pool-based sampling consider-

quently used SVM models ranked third. We found no significant

ing a query-by-committee strategy, where the committee is created

difference between using pool-based or stream-based active learn-

with models trained with the five algorithms we consider in this re-

ing approaches. Results from the query-by-committee approach

search: Gaussian Näive Bayes, CART (Classification and Regression

were statistically significant in all cases and better than all the

Trees, similar to C4.5, but it does not compute rule sets), Linear SVM,

models, except for the MLPs. Finally, we found no case where the

MLP, and kNN. Comparing deep learning models remains a subject

improvement between the first and last quartile of the test fold in

of future work. Finally, we compare the performance of the active

each active learning scenario would be significant. We believe that

learning scenarios computing the average AUC ROC of each fold

further investigation is required to determine if a larger pool of un-

and assess if the results differences obtained from each model are

labeled images would help us achieve such a significant difference.

statistically significant by using the Wilcoxon signed-rank test[26],

Future work will focus on data augmentation techniques that could

using a p-value of 0.05.

help achieve a statistically significant improvement over time when

applying active learning techniques.

5

RESULTS AND ANALYSIS

ACKNOWLEDGMENTS

The results obtained from the experiments we ran, and described

This work was supported by the Slovenian Research Agency and

in Section 4, are presented in Table 1, and Table 2. Table 1 describes the European Union’s Horizon 2020 program project STAR under

the average AUC ROC per each active learning scenario and model

grant agreement number H2020-956573. The authors acknowledge

for each cross-validation test fold. We observe that the best model

the valuable input and help of Jelle Keizer and Yvo van Vegten from

across strategies is the MLP, which achieved the best or second-best

Philips Consumer Lifestyle BV.

performance across almost every fold in pool-based and stream-

based active learning. Among those two scenarios, the best results

REFERENCES

were obtained for stream-based active learning. We observed the

[1] Carlos Beltrán-González, Matteo Bustreo, and Alessio Del Bue. 2020. External same across the rest of the models, though the differences were

and internal quality inspection of aerospace components. In 2020 IEEE 7th International Workshop on Metrology for AeroSpace (MetroAeroSpace). IEEE, 351–355.

not significant for all but the Näive Bayes models (see Table 2).

[2] William H Beluch, Tim Genewein, Andreas Nürnberger, and Jan M Köhler. 2018.

Query-by-committee displayed a strong performance, showing best

The power of ensembles for active learning in image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9368–9377.

results immediately after the MLP. When assessing the statistical

[3] Tajeddine Benbarrad, Marouane Salhaoui, Soukaina Bakhat Kenitar, and Mounir significance between the query-by-committee scenario and results

Arioua. 2021. Intelligent machine vision model for defective product inspection obtained from different models with stream-based and pool-based

based on machine learning. Journal of Sensor and Actuator Networks 10, 1 (2021), 7.

strategies, we observed that differences were significant in all cases,

[4] Andrew P. Bradley. 1997. The use of the area under the ROC curve in the except for the SVM models. SVM models, most widely used in ac-evaluation of machine learning algorithms. Pattern Recognition 30, 7 (1997), 1145

tive learning literature related to automated defect inspection, were

– 1159. https://doi.org/10.1016/S0031-3203(96)00142-2

[5] Amal Chouchene, Adriana Carvalho, Tânia M Lima, Fernando Charrua-Santos, the third-best models among the tested ones, immediately after

Gerardo J Osório, and Walid Barhoumi. 2020. Artificial intelligence for prod-the MLPs in stream-based and pool-based active learning and the

uct quality inspection toward smart industries: quality control of vehicle non-conformities. In 2020 9th international conference on industrial technology and query-by-committee approach. SVM models did not display signif-management (ICITM). IEEE, 127–131.

icant differences when compared across different active learning

[6] David Cohn, Les Atlas, and Richard Ladner. 1994. Improving generalization with scenarios. The worst results were consistently observed for the

active learning. Machine learning 15, 2 (1994), 201–221.

[7] Antoine Cordier, Deepan Das, and Pierre Gutierrez. 2021. Active learning using CART models.

weakly supervised signals for quality inspection. arXiv preprint arXiv:2104.02973

71

SiKDD ’21, October, 2021, Ljubljana, Slovenia

Trajkova and Rožanec

Active Learning scenario

Model

Fold 1

Fold 2

Fold 3

Fold 4

Fold 5

Fold 6

Fold 7

Fold 8

Fold 9

Fold 10

CART

0,8168

0,7828

0,7810

0,7694

0,8196

0,7805

0,7843

0,7970

0,8409

0,7940

kNN

0,9289

0,9121

0,9174

0,8686

0,9024

0,9000

0,9051

0,8960

0,9282

0,9082

stream-based

MLP

0,9900

0,9928

0,9846

0,9563

0,9804

0,9807

0,9710

0,9729

0,9793

0,9845

Näive Bayes

0,8818

0,8668

0,8819

0,8686

0,8829

0,8899

0,8650

0,8877

0,8864

0,9098

SVM

0,9752

0,9828

0,9725

0,9530

0,9816

0,9720

0,9570

0,9412

0,9824

0,9712

CART

0,7584

0,7904

0,7543

0,7468

0,8441

0,7730

0,8044

0,7701

0,7850

0,7412

kNN

0,9189

0,9149

0,9161

0,8581

0,9055

0,9036

0,8961

0,8910

0,9224

0,9056

pool-based

MLP

0,9892

0,9921

0,9845

0,9563

0,9790 0,9803 0,9702 0,9723

0,9806

0,9840

Näive Bayes

0,8800

0,8654

0,8809

0,8677

0,8813

0,8895

0,8637

0,8873

0,8850

0,9090

SVM

0,9752

0,9819

0,9726

0,9518

0,9806

0,9712

0,9562

0,9412

0,9823

0,9722

query-by-committee

0,9774

0,9824

0,9714

0,9500

0,9723

0,9726

0,9597

0,9571

0,9830

0,9734

Table 1: AUC ROC values were obtained across the ten cross-validation folds. Best results are bolded, second-best results are highlighted in italics.

Active Learning scenarios

Model

stream-based vs. pool-based

stream-based vs. query-by-committee

pool-based vs. query-by-committee

CART

0,0840

0,0020

0,0020

kNN

0,1309

0,0020

0,0020

MLP

0,0856

0,0039

0,0039

Näive Bayes

0,0020

0,0020

0,0020

SVM

0,1824

0,4316

0,6250

Table 2: p-values obtained for the Wilcoxon signed-rank test when comparing the average of AUC ROC results across ten cross-validation folds.

(2021).

[22] Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Xiaojiang

[8] Wenting Dai, Abdul Mujeeb, Marius Erdt, and Alexei Sourin. 2018. Towards Chen, and Xin Wang. 2020. A survey of deep active learning. arXiv preprint automatic optical inspection of soldering defects. In 2018 International Conference arXiv:2009.00236 (2020).

on Cyberworlds (CW). IEEE, 375–382.

[23] Judi E See. 2012. Visual inspection: a review of the literature. Sandia Report

[9] Guifang Duan, Hongcui Wang, Zhenyu Liu, and Yen-Wei Chen. 2012. A machine SAND2012-8590, Sandia National Laboratories, Albuquerque, New Mexico (2012).

learning-based framework for automatic visual inspection of microdrill bits in

[24] Burr Settles. 2009. Active learning literature survey. (2009).

PCB production. IEEE Transactions on Systems, Man, and Cybernetics, Part C

[25] Karin van Garderen. 2018. Active Learning for Overlay Prediction in Semi-

(Applications and Reviews) 42, 6 (2012), 1679–1689.

conductor Manufacturing. (2018).

[10] Tobias Glasmachers. 2017. Limits of end-to-end learning. In Asian Conference on

[26] Frank Wilcoxon. 1992. Individual comparisons by ranking methods. In Break-Machine Learning. PMLR, 17–32.

throughs in statistics. Springer, 196–202.

[11] Christian Gobert, Edward W Reutzel, Jan Petrich, Abdalla R Nassar, and Shashi

[27] Thorsten Wuest, Christopher Irgens, and Klaus-Dieter Thoben. 2014. An ap-Phoha. 2018. Application of supervised machine learning for defect detection proach to monitoring quality in manufacturing using supervised machine learn-during metallic powder bed fusion additive manufacturing using high resolution ing on product state data. Journal of Intelligent Manufacturing 25, 5 (2014), imaging. Additive Manufacturing 21 (2018), 517–528.

1167–1180.

[12] Irlán Grangel-González. 2019. A knowledge graph based integration approach for

[28] Jing Yang, Shaobo Li, Zheng Wang, Hao Dong, Jun Wang, and Shihao Tang. 2020.

industry 4.0. Ph.D. Dissertation. Universitäts-und Landesbibliothek Bonn.

Using deep learning to detect defects in manufacturing: a comprehensive survey

[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual and current challenges. Materials 13, 24 (2020), 5755.

learning for image recognition. In Proceedings of the IEEE conference on computer

[29] Xinchuan Zeng and Tony R Martinez. 2000. Distribution-balanced stratified vision and pattern recognition. 770–778.

cross-validation for accuracy estimation. Journal of Experimental & Theoretical

[14] Jianping Hua, Zixiang Xiong, James Lowey, Edward Suh, and Edward R

Artificial Intelligence 12, 1 (2000), 1–12.

Dougherty. 2005. Optimal number of features as a function of sample size

for various classification rules. Bioinformatics 21, 8 (2005), 1509–1515.

[15] Carla Iglesias, Javier Martínez, and Javier Taboada. 2018. Automated vision system for quality inspection of slate slabs. Computers in Industry 99 (2018), 119–129.

[16] Max Kuhn, Kjell Johnson, et al. 2013. Applied predictive modeling. Vol. 26.

Springer.

[17] David D Lewis and Jason Catlett. 1994. Heterogeneous uncertainty sampling for supervised learning. In Machine learning proceedings 1994. Elsevier, 148–156.

[18] Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3431–3440.

[19] Lingbin Meng, Brandon McWilliams, William Jarosinski, Hye-Yeong Park, Yeon-Gil Jung, Jehyun Lee, and Jing Zhang. 2020. Machine learning in additive manufacturing: A review. Jom 72, 6 (2020), 2363–2377.

[20] Samira Pouyanfar, Saad Sadiq, Yilin Yan, Haiman Tian, Yudong Tao, Maria Presa Reyes, Mei-Ling Shyu, Shu-Ching Chen, and Sundaraja S Iyengar. 2018. A survey on deep learning: Algorithms, techniques, and applications. ACM Computing Surveys (CSUR) 51, 5 (2018), 1–36.

[21] S Ravikumar, KI Ramachandran, and V Sugumaran. 2011. Machine learning approach for automated visual inspection of machine components. Expert systems with applications 38, 4 (2011), 3260–3266.

72





Learning to Automatically Identify Home Appliances

Dan Lorbek Ivančič1, Blaž Bertalanič1,2, Gregor Cerar1, Carolina Fortuna1

1Jozef Stefan Institute, Ljubljana, Slovenia

2Faculty of Electrical Engineering, University of Ljubljana, Slovenia

E-mail: dl0586@student.uni-lj.si

Abstract. Appliance load monitoring (ALM) is a

The obtained data is then disaggregated and each individ-

technique that enables increasing the efficiency of domes-

ual appliance and its energy consumption are detected.

tic energy usage by obtaining appliance specific power

One promising approach to ILM for automatic iden-

consumption profiles. While machine learning have been

tification of home appliances is the use of machine learn-

shown to be suitable for ALM, the work on analyzing

ing (ML). For instance, in [4] they used ML to find pat-

design trade-offs during the feature and model selection

terns in the data and extract useful information such as

steps of the ML model development is limited. In this

type of load, electricity consumption detail and the run-

paper we show that 1) statistical features capturing the

ning conditions of appliances [4]. More recently, [5] fo-

shape of the time series, yield superior performance by

cused on the study of design trade-offs during the fea-

up to 20 percentage points and 2) our best deep neural

ture and model selection steps of the development of the

network-based model slightly outperforms our best gradi-

ML-based classifier for ILM. In their study they consid-

ent descent boosted decision trees by 2 percentage points

ered various statistical summaries for feature engineering

at the expense of increased training time.

and classical machine learning techniques for model se-

lection. We complement the work in [5] by extending

1

Introduction

the feature set with additional shape capturing values and

considering deep learning (DNN) and gradient boosted

Household energy consumption accounts for a large pro-

trees (XGBoost) as promising modelling techniques. The

portion of the world’s total energy consumption. The first

contributions of this paper are as follows:

studies, conducted as early as the 1970s, showed that as

much as 25% of national energy was consumed by our

• We explore a variety of different statistical features and

domestic appliances alone. This figure rose to 30% in

show the ones capturing the shape of the time series,

2001 [1] and continues to increase with an exponential

such as longest strike above mean, longest strike be-

rate. Some researchers even predict that these numbers

low mean, absolute energy and kurtosis yield superior

will double by 2030 [2].

performance by up to 20 percentage points.

In support of rationalizing consumption, appliance load

monitoring (ALM) has been introduced. It aims to help

• We show that our best DNN based model slightly out-

solve domestic energy usage related issues by obtaining

performs our best XGBoost by 2 percentage points at

appliance specific power consumption profiles. Such data

the expense of increased training time. We also show

can help devise load scheduling strategies for optimal en-

that our models outperform the results from [5] by 5

ergy utilization [2]. Additionally, data about appliance

percentage points.

usage can provide useful insight into daily activities of

The paper is organized as follows. Section 2 summa-

residents which can be useful for long-distance monitor-

rizes related work, Section 3 formulates the problem and

ing of elderly people who prefer to stay at home rather

provides methodological details, Section 4 focuses on the

than going to retirement homes [2]. Other applications

study of feature selection trade-offs, while Section 5 dis-

include theft detection, building safety monitoring, etc.

cusses model selection. Concluding remarks are drawn

The two different ways of realizing ALM are intru-

in Section 6.

sive load monitoring (ILM) and non-intrusive load mon-

itoring (NILM). While ILM is known to be more accu-

2

Related Work

rate, it requires multiple sensors throughout the entire

building to be installed which incurs extra hardware cost

Existing work that uses machine learning for ALM, such

and installation complexity. NILM, however, is a cost-

as in [6] investigates the performance of deep learning

effective, easy to maintain process for analyzing changes

neural networks on NILM classification tasks and builds

in the voltage [3] and current going into a building with-

a model that is able to accurately detect activations of

out having to install any additional sensors on different

common electrical appliances using data from the smart

household devices, since it operates using only data ob-

meter. More complex DNNs for NILM classification tasks

tained from the single main smart meter in a building.

are presented by the authors in [3], where they introduce

73



a Long Short-Term Memory Recurrent Neural Network

(LSTM-RNN) based model and show that it outperforms

the considered baselines. In [7] they approach a simi-

lar problem by proposing a convolutional neural network

based model that allows simultaneous detection and clas-

sification of events without having to perform double pro-

cessing. In [8] authors train a temporal convolutional

neural network to automatically extract high-level load

signatures for individual appliances while in [9] a fea-

ture extraction method is presented using multiple par-

allel convolutional layers as well as an LSTM recurrent

neural network based model is proposed.

3

Problem formulation

Our goal was to design a classifier that when given an

input time series T, it is able to accurately map this data

to the appropriate class C, as shown in equation 1.

C = Φ(T )

(1)

where Φ represents the mapping function from time

series to target classes and C is a set of these classes,

where each class corresponds to one of the following house-

hold appliances: computer monitor, laptop computer, tele-

vision, washer dryer, microwave, boiler, toaster, kettle

Figure 1: Selected appliances, showing power in relation to time

and fridge. The appliances and measured data illustrated

over a 1 hour interval.

in Figure 1 available in the public UK-Dale dataset are

used. The UK DALE (Domestic Appliance-level Elec-

tricity) contains the power demand from 5 different houses

are provided by dedicated time series feature engineering

in the United Kingdom. The dataset was build at a sample-

tools such as tsfresh 1.

rate of 16 Hz for the whole-house and 0.1667 Hz for each

Following an extensive evaluation of combinations of

individual appliance. Data is spread into 1 hour long seg-

time-series, we report the results for a representative se-

ments, each dataset sample contains a time series with

lection of three feature sets as follows:

600 datapoints as depicted in Figure 1.

FeatureSet1 - This feature set consists of the raw

For realizing Φ, we perform first a feature selection

time series, containing 2517 time series samples, each

task followed by a model selection one. For selecting the

with 600 datapoints. It is used as a baseline to see the

best feature set, we perform feature selection in Section

performance achieves with the available data.

4. For model selection, we go beyond the work in [5] and

FeatureSet2 - This feature set consists of: mean

consider deep learning architectures enabled by Tensor-

value, maximum, minimum, standard deviation,

flow and advanced decision trees that use on optimized

variance, peak − to − peak, count above mean, count

distributed gradient boosting technique available in the

below mean, mean change, absolute mean change,

XGBoost open source library as detailed in Section 5.

absolute energy. The count above and below mean counts

the numbers of values in each sample that are higher or

4

Feature selection

lower than the mean value of that same sample and helps

quantifying the width of a pulse such as the ones for the

As can be seen in Figure 1, the time-series corresponding

toaster and microwave from Figure 1. The mean absolute

to each device has unique shape and patterns, therefore an

change gives the mean over the absolute differences be-

intuitive approach to feature selection is to extract statis-

tween subsequent time series values. The absolute energy

tical properties of the time series that would capture the

represents the sum of squared values, calculated using

unique properties of the signals. For instance, a summary

formula shown in equation 2 and provides the informa-

such as the peak-to-peak value is able to capture the dif-

tion on whether a specific appliance has large consump-

ference between the maximum and minimum value in a

tion profile or not.

time series signal while one such as skewness is able to

describe the asymmetry in the distribution of datapoints

n−1

X

in a particular sample. A good combination of such fea-

E =

x2i

(2)

ture would be able to inform the model with relevant in-

i=0

formation about the power consumption of each appli-

FeatureSet3 - After taking a deeper look into the fea-

ance, making it easier to find patterns in the data and per-

tures from FeatureSet2, we noticed that minimum is re-

form classification task more accurately. Recently, stan-

1

dard tools for computing a large range of such summaries

https : //tsf resh.readthedocs.io/en/latest/text/list of

f eatures.html

74

dundant as it is usually zero in every sample and peak-to-them. Nevertheless, the CNN classifies all three the best

peak is in most cases equal to maximum value due to the

due to its superior pattern recognition ability.

lowest value mostly being zero. This feature set consists

of: maximum, standard deviation, mean absolute

Table 2: Per class performance, FeatureSet3 vs best [5]

change, mean change, longest strike above mean,

Class

Inst.

CNN f1

XGB f1

[5] f1

longest strike below mean, absolute energy, kurtosis,

number of peaks in each signal. The longest strike

monitor

300

0.827

0.833

0.780

above and below mean returns the length of the the longest

laptop

276

0.983

0.932

0.838

consecutive subsequence that is higher or lower than the

television

300

0.992

0.976

0.941

mean value of that specific sample. The kurtosis is an-

washer/dryer

226

0.941

0.912

0.804

other metric of describing the probability distribution and

microwave

300

0.688

0.620

0.687

measures how heavily the tails of a distribution differ

boiler

300

1.000

0.968

0.940

from the tails of a normal distribution.

toaster

215

0.949

0.940

0.806

kettle

300

0.756

0.722

0.739

fridge

300

1.000

0.983

0.970

Table 1: Feature comparison using the best models.

Model

Feature set

Precision

Recall

f1

DNN3

FeatureSet1

0.638

0.595

0.573

5

Model selection

XGB3

FeatureSet1

0.799

0.769

0.779

DNN3

FeatureSet2

0.918

0.885

0.889

For analyzing the performance of DNN and XGBoost for

XGB3

FeatureSet2

0.869

0.864

0.867

our problem we conducted extensive performance eval-

DNN3

FeatureSet3

0.931

0.898

0.902

uations. We started by developing a deep learning se-

XGB3

FeatureSet3

0.888

0.889

0.889

quential model, which at first consisted of three dense

layers, each with an arbitrarily chosen number of neu-

DNN3

best[5]

0.893

0.887

0.888

rons. By trying different combinations of hyperparame-

XGB3

best[5]

0.861

0.860

0.861

ters such as number of neurons, loss functions, optimiz-

SVM[5]

best[5]

0.851

0.835

0.834

ers, batch size, number of epochs, number of layers and

learning rate, we came closer to finding the best suited

model for our problem. For optimizing certain hyperpa-

rameters we took advantage of the automatic hyperpa-

4.1

Results

rameter optimization framework Optuna 2. We then ap-

The results of the feature selection process are listed in

plied similar optimization techniques on the XGB model,

Table 1 for the two techniques considered in this paper.

although it’s default parameter configuration already gave

As can be seen from the second column of the table en-

good results. All the experiments were ran on Google

titles instances, the dataset is balanced. From columns

Colab using an instance with Nvidia Tesla K80 GPU and

3-5 it can be seen that for the baseline FeatureSet1, the

12.69 GB of RAM.

f1 score is 0.57 for the CNN and 0.77 for XGB. By using

In this section we present and analyze three represen-

features that better capture the shape of the time series

tative models from each class, DNN and XGboost respec-

such as in the case of FeatureSet2, an improvement of

tively.

up to 20% can be seen as follows: the f1 of the CNN

model increasing to 0.89, the precision 0.92 and recall to

5.1

Deep neural network

0.88. The XGBoost model also performed better with an

DNN1 - This model consisted of three fully con-

f1 of 0.87, precision of 0.87 and recall of 0.86. Finally, it

nected dense layers. The first two had 32 neurons each

can be seen from the table that FeatureSet3 performs the

as well as ReLU (rectified linear unit) activation function,

best with the f1 of 0.90, precision of 0.93 and recall of

while the output layer had nine neurons, each correspond-

0.90 for the CNN model and f1 of 0.89, precision of 0.89

ing to one of the nine possible appliances and Softmax

and recall of 0.89 for the XGB model. FeatureSet3 per-

activation function.

formed better than FeatureSet2 because its features had

DNN2 - For this model we took the DNN1 model and

much less correlation between each other as well as all of

added an additional dense layer with 64 neurons as well

the redundant features from FeatureSet2 were removed.

as changed the activation function to linear in the penulti-

For FeatureSet3, a variety of different feature orderings

mate layer. With this additional complexity we expected

were also tested but the results remained more within 1%

to see better results.

accuracy variance.

DNN3 - For this model we introduced two 1D con-

To gain insights into the per class performance of Fea-

volution layers, first with 128 filters and second with 64.

tureSet3 with the two techniques, we present per device

Then we used a flatten layer to reduce the dimensionality

f1 score breakdown in Table 2. It can be seen that com-

of the output space, and make the data compatible with

puter monitor, microwave and kettle are classified worst

the following dense layer, followed by another (output)

by all three models, as their similar consumption profiles

dense layer.

make it difficult for the models to distinguish between

2https : //optuna.org

75

5.2

XGBoost

6

Conclusions

XGB1 - This is the model with standard configura-

In this paper we investigated the design trade-offs during

tion, i.e. maximum depth of 3, 100 estimators and learn-

the feature and model selection steps of the development

ing rate of 0.1.

of the ML-based classifier for ILM. After formulating our

XGB2 - In this model we increased the maximum

problem, we first show that by extracting various statis-

depth to 4 as well as first reduced learning rate by 50%

tical features from raw time series data and then training

(to 0.05) and then increased the number of estimators by

our models with these features, we were able to improve

50% (to 200). Doing this gave slightly better results.

f1 score by up to 20 percentage points.

XGB3 - For this model we decreased the maximum

Second, we propose two different ML techniques and

depth to 2, increased number of estimators to 500 and

our process of developing the proposed models using these.

learning rate to 0.25.

We show that optimizing hyperparameters to better suit

our specific problem can improve their respective perfor-

Table 3: Model performance on FeatureSet3.

mance by around 4 percentage points. However, choos-

ing the right features that better capture the shape of the

Model

Precision

Recall

f1

Comp. time

data has a much greater impact on the end results than op-

DNN1

0.866

0.851

0.846

10.972s

timizing the models. We also show that classical machine

DNN2

0.900

0.887

0.889

21.026s

learning model does not perform significantly worse than

DNN3

0.931

0.898

0.902

21.124s

the deep neural network based one, while at the same time

being less computationally expensive.

XGB1

0.876

0.863

0.864

1.126s

XGB2

0.884

0.881

0.882

2.518s

References

XGB3

0.888

0.889

0.889

3.225s

[1] L. Shorrock, J. Utley et al., Domestic energy fact file 2003.

SVM [5]

0.878

0.852

0.852

0.301s

Citeseer, 2003.

[2] A. Zoha, A. Gluhak, M. A. Imran, and S. Rajasegarar,

“Non-intrusive load monitoring approaches for disaggre-

5.3

Results

gated energy sensing: A survey,” Sensors, vol. 12, no. 12,

pp. 16 838–16 866, 2012.

5.3.1

Classification performance

[3] J. Kim,

T.-T.-H. Le,

and H. Kim,

“Nonintrusive

The classification performance of the models is provided

Load Monitoring Based on Advanced Deep Learn-

in Table 3. It can be seen that the best performing models

ing

and

Novel

Signature,”

Computational

Intelli-

are DNN3 with an f1 score of 0.90 and XGB3 with an f1

gence and Neuroscience,

vol. 2017,

p. e4216281,

of 0.88. However, the computation time of XGB3 is only

Oct. 2017, publisher:

Hindawi. [Online]. Available:

3.23s while for DNN3 it is 21.12s. The XGB classifier

https://www.hindawi.com/journals/cin/2017/4216281/

using classical machine learning performed only about

[4] E. Aladesanmi and K. Folly,

“Overview of non-

1 percentage point worse than the CNN model, while at

intrusive

load

monitoring

and

identification

tech-

the same time being much less complex and able to com-

niques,”

IFAC-PapersOnLine,

vol. 48,

no. 30,

pp.

plete the entire training process about 18 seconds faster

415–420,

2015,

9th IFAC Symposium on Control

than the CNN. In addition, the XGB model is much eas-

of Power and Energy Systems CPES 2015. [Online].

ier to optimize since it has no hidden layers and a pre-

Available:

https://www.sciencedirect.com/science/article/

arranged hyperparameter configuration that usually re-

pii/S2405896315030566

quires no further optimization at all. From the last line of

[5] L. Ogrizek, B. Bertalanic, G. Cerar, M. Meza, and C. For-

the table it can be seen that the SVM-based model from

tuna, “Designing a machine learning based non-intrusive

[5] performs 5 percentage points less than DNN3 on Fea-

load monitoring classifier,” in 2021 IEEE ERK, 2021, pp.

1–4.

tureSet3.

[6] M. Devlin and B. P. Hayes, “Non-intrusive load monitor-

5.3.2

Computation time

ing using electricity smart meter data: A deep learning

approach,” in 2019 IEEE Power Energy Society General

The superior performance of the DNN model comes at a

Meeting (PESGM), 2019, pp. 1–5.

cost of increased algorithm complexity and hence longer

computation time. As depicted in Table 3 the first DNN

[7] F. Ciancetta, G. Bucci, E. Fiorucci, S. Mari, and A. Fiora-

vanti, “A new convolutional neural network-based system

model took 10.97 seconds to complete the training pro-

for nilm applications,” IEEE Transactions on Instrumenta-

cess and the best (most complex one) took 21.12 seconds.

tion and Measurement, vol. 70, pp. 1–12, 2021.

XGBoost, on the other hand, was much faster with XGB1

[8] Y. Yang, J. Zhong, W. Li, T. A. Gulliver, and S. Li,

taking only 1.12 seconds. The added depth for the XGB2

“Semisupervised multilabel deep learning based nonintru-

caused a slight increase in computation time to 2.52 sec-

sive load monitoring in smart grids,” IEEE Transactions

onds, which further increased to 3.23 seconds due to the

on Industrial Informatics, vol. 16, no. 11, pp. 6892–6902,

high number of estimators used in XGB3. Finally, the

2020.

state of the art was the fastest to complete the training

[9] W. He and Y. Chai, “An empirical study on energy disaggre-

process taking only 0.3 seconds but scored the worst in

gation via deep learning,” Advances in Intelligent Systems

terms of performance.

Research, vol. 133, pp. 338–342, 2016.

76





Indeks avtorjev / Author index



Beliga Slobodan ........................................................................................................................................................................... 41

Bertalanič Blaž ............................................................................................................................................................................. 73

Brank Janez .................................................................................................................................................................................... 5

Brglez Mojca ................................................................................................................................................................................ 37

Buhin Pandur Maja....................................................................................................................................................................... 41

Casals del Busto Ignacio .............................................................................................................................................................. 49

Cerar Gregor ................................................................................................................................................................................. 73

Costa Joao .................................................................................................................................................................................... 57

Dam Paulien ................................................................................................................................................................................. 69

Dobša Jasminka ............................................................................................................................................................................ 41

Eržin Eva ...................................................................................................................................................................................... 49

Erznožnik Matic ........................................................................................................................................................................... 53

Fortuna Blaž ........................................................................................................................................................................... 45, 69

Fortuna Carolina ........................................................................................................................................................................... 73

Grobelnik Marko .................................................................................................................................................................. 5, 9, 49

Guček Alenka ............................................................................................................................................................................... 49

Jelenčič Jakob ............................................................................................................................................................................... 61

Kenda Klemen ........................................................................................................................................................................ 53, 57

Lindemann David ......................................................................................................................................................................... 33

Lorbek Ivančič Dan ...................................................................................................................................................................... 73

Massri M.Besher ...................................................................................................................................................................... 5, 49

Meštrović Ana .............................................................................................................................................................................. 41

Mladenić Dunja ............................................................................................................................................ 5, 9, 21, 29, 45, 61, 69

Mladenic Grobelnik Adrian............................................................................................................................................................ 9

Mocanu Iulian .............................................................................................................................................................................. 49

Neumann Matej ............................................................................................................................................................................ 65

Novak Erik ................................................................................................................................................................................... 13

Novalija Inna ............................................................................................................................................................................ 5, 49

Petkovšek Gal ............................................................................................................................................................................... 53

Pita Costa Joao ....................................................................................................................................................................... 49, 57

Pollak Senja ............................................................................................................................................................................ 25, 37

Posinković Matej .......................................................................................................................................................................... 49

Poštuvan Tim ............................................................................................................................................................................... 45

Pranjić Marko ............................................................................................................................................................................... 25

Robnik-Šikonja Marko ........................................................................................................................................................... 17, 25

Rossi Maurizio ............................................................................................................................................................................. 49

Rožanec Jože M. .................................................................................................................................................................... 45, 69

Schwabe Daniel .............................................................................................................................................................................. 5

Sittar Abdul .................................................................................................................................................................................. 29

Šturm Jan ...................................................................................................................................................................................... 49

Swati ............................................................................................................................................................................................. 21

Trajkova Elena ............................................................................................................................................................................. 69

Ulčar Matej ................................................................................................................................................................................... 17

Vintar Špela .................................................................................................................................................................................. 37





77





78



Odkrivanje znanja in podatkovna

skladišča • SiKDD

Data Mining and Data

Warehouses • SiKDD

Dunja Mladenić, Marko Grobelnik





Document Outline


02 - Naslovnica - notranja - C - TEMP

03 - Kolofon - C - TEMP

04 - IS2021 - Predgovor - TEMP

05 - IS2021 - Konferencni odbori

07 - Kazalo - C

08 - Naslovnica - notranja - C - TEMP

09 - Predgovor podkonference - C

10 - Programski odbor podkonference - C

01- Novalijaetal

02 - MladenicEtal

03 - Novak Abstract

1 Introduction

2 Related Work

3 The Clustering Algorithm 3.1 Article Representation

3.2 Event Representations

3.3 Assignment Condition





4 Experiments 4.1 Data Set

4.2 Evaluation Metrics

4.3 Baseline Algorithm





5 Results

6 Conclusion

Acknowledgments





04 - Ulcar+Robnik Abstract

1 Introduction

2 Related work

3 SloBERTa 3.1 Datasets

3.2 Data preprocessing

3.3 Architecture and training





4 Evaluation

5 Conclusions

Acknowledgments





05 - Swati+Mladenic Abstract

1 Introduction 1.1 Contributions





2 Related Work

3 Data Description 3.1 Raw Data Source

3.2 Dataset





4 Materials and Methods 4.1 Methodology





5 Results and Analysis 5.1 Impact of news categories





6 Conclusions and Future Work

7 Acknowledgments





06 - Pranjicetal Abstract

1 Introduction

2 Dataset

3 Methodology 3.1 Doc2Vec

3.2 BERT

3.3 Prediction network

3.4 Training





4 Experiments and results 4.1 Evaluation metrics

4.2 Results and discussion





5 Conclusions and Further Work

Acknowledgments





07 - Sittar+Mladenic Abstract

1 Introduction

2 Related Work

3 Data Description 3.1 Dataset Statistics





4 Material and Methods 4.1 Problem Definition

4.2 Methodology





5 Experimental Evaluation 5.1 Evaluation Metric





6 Results and Analysis 6.1 Annotation Results

6.2 Classification Results





7 Conclusions and Future Work





08 - LindemannDavid Abstract

1 Introduction

2 LexBib Zotero group

3 LexBib Wikibase 3.1 Wikibase as LOD infrastructure solution

3.2 Zotero to Wikibase migration

3.3 Entity disambiguation using Open Refine

3.4 Full text processing





4 Wikibase to Elexifinder

5 Conclusions and Outlook

Acknowledgments





09 - Brglezetal Abstract

1 Introduction

2 Proposed approach 2.1 Method





3 Corpus 3.1 Corpus search





4 Analysing different parameter settings

5 Conclusion

Acknowledgments





10 - Panduretal

11 - Rožanecetal Abstract

1 Introduction

2 Related Work

3 Use Case

4 Methodology

5 Results and Analysis

6 Conclusion

Acknowledgments

References





12 - Costaetal

13 - Petkovšeketal Introduction

Data and Data Preprocessing

Methodology Evaluation of algorithms

GAN

DBSCAN

Welford's algorithm

Facebook Prophet





Results

Conclusions

Acknowledgments

References





14 - Costaetal_2 Abstract

1 Introduction

2 Stationary and Chaotic Nature 2.1 Dickey-Fuller Test for Stationarity

2.2 Lyapunov exponents for understanding chaotic nature





3 Maximum Predictability

4 Model Architecture 4.1 LSTM

4.2 Our approach





5 Forecasting 5.1 Forecasting Methods

5.2 Our Approach





6 Research Methods 6.1 Time Series Reconstruction

6.2 Entropy Calculation

6.3 Data and Code Git Repository





7 Plot of results

8 Conclusion

9 Acknowledgments





15 - Jelencic+Mladenic Introduction

Data

Proposed method Empirical normalization

Noise addition

Optimization of latent representation





Results Unsupervised learning results

Supervised learning results





Conclusions and future work

Acknowledgments

References





16 - Neumann+Grobelnik

17 - Trajkovaetal Abstract

1 Introduction

2 Related Work

3 Use Case

4 Methodology

5 Results and Analysis

6 Conclusion

Acknowledgments

References





18 - Ivancicetal

12 - Index - C

Blank Page

Blank Page

Blank Page

Blank Page

15 - Jelencic+Mladenic.pdf Introduction

Data

Proposed method Empirical normalization

Noise addition

Optimization of latent representation





Results Unsupervised learning results

Supervised learning results





Conclusions and future work

Acknowledgments

References





Blank Page