Zbornik 24. mednarodne multikonference • INFORMACIJSKA DRUZBA Zvezek C Proceedings of the 24th International Multiconference INFORMATION SOCIETY Volume C I S S 0 S I Odkrivanje znanja in podatkovna skladišča • SiKDD Data Mining and Data Warehouses • SiKDD Urednika • Editors: Dunja Mladenić, Marko Grobelnik 4. oktober 2021 Ljubljana, Slovenija • 4 October 2021 Ljubljana, Slovenia • http://is.ijs.si Zbornik 24. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2021 Zvezek C Proceedings of the 24th International Multiconference INFORMATION SOCIETY – IS 2021 Volume C Odkrivanje znanja in podatkovna skladišča - SiKDD Data Mining and Data Warehouses - SiKDD Urednika / Editors Dunja Mladenić, Marko Grobelnik http://is.ijs.si 4. oktober 2021 / 4 October 2021 Ljubljana, Slovenia Urednika: Dunja Mladenić, Department for Artificial Intelligence Jožef Stefan Institute, Ljubljana Marko Grobelnik Department for Artificial Intelligence Jožef Stefan Institute, Ljubljana Založnik: Institut »Jožef Stefan«, Ljubljana Priprava zbornika: Mitja Lasič, Vesna Lasič, Lana Zemljak Oblikovanje naslovnice: Vesna Lasič Dostop do e-publikacije: http://library.ijs.si/Stacks/Proceedings/InformationSociety Ljubljana, oktober 2021 Informacijska družba ISSN 2630-371X Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni knjižnici v Ljubljani COBISS.SI-ID 85867267 ISBN 978-961-264-218-1 (PDF) PREDGOVOR MULTIKONFERENCI INFORMACIJSKA DRUŽBA 2021 Štiriindvajseta multikonferenca Informacijska družba je preživela probleme zaradi korone v 2020. Odziv se povečuje, v 2021 imamo enajst konferenc, a pravo upanje je za 2022, ko naj bi dovolj velika precepljenost končno omogočila normalno delovanje. Tudi v 2021 gre zahvala za skoraj normalno delovanje konference tistim predsednikom konferenc, ki so kljub prvi pandemiji modernega sveta pogumno obdržali visok strokovni nivo. Stagnacija določenih aktivnosti v 2020 in 2021 pa skoraj v ničemer ni omejila neverjetne rasti IKTja, informacijske družbe, umetne inteligence in znanosti nasploh, ampak nasprotno – rast znanja, računalništva in umetne inteligence se nadaljuje z že kar običajno nesluteno hitrostjo. Po drugi strani se je pospešil razpad družbenih vrednot, zaupanje v znanost in razvoj. Se pa zavedanje večine ljudi, da je potrebno podpreti stroko, čedalje bolj krepi, kar je bistvena sprememba glede na 2020. Letos smo v multikonferenco povezali enajst odličnih neodvisnih konferenc. Zajema okoli 170 večinoma spletnih predstavitev, povzetkov in referatov v okviru samostojnih konferenc in delavnic ter 400 obiskovalcev. Prireditev so spremljale okrogle mize in razprave ter posebni dogodki, kot je svečana podelitev nagrad – seveda večinoma preko spleta. Izbrani prispevki bodo izšli tudi v posebni številki revije Informatica (http://www.informatica.si/), ki se ponaša s 45-letno tradicijo odlične znanstvene revije. Multikonferenco Informacijska družba 2021 sestavljajo naslednje samostojne konference: • Slovenska konferenca o umetni inteligenci • Odkrivanje znanja in podatkovna skladišča • Kognitivna znanost • Ljudje in okolje • 50-letnica poučevanja računalništva v slovenskih srednjih šolah • Delavnica projekta Batman • Delavnica projekta Insieme Interreg • Delavnica projekta Urbanite • Študentska konferenca o računalniškem raziskovanju 2021 • Mednarodna konferenca o prenosu tehnologij • Vzgoja in izobraževanje v informacijski družbi Soorganizatorji in podporniki multikonference so različne raziskovalne institucije in združenja, med njimi ACM Slovenija, SLAIS, DKZ in druga slovenska nacionalna akademija, Inženirska akademija Slovenije (IAS). V imenu organizatorjev konference se zahvaljujemo združenjem in institucijam, še posebej pa udeležencem za njihove dragocene prispevke in priložnost, da z nami delijo svoje izkušnje o informacijski družbi. Zahvaljujemo se tudi recenzentom za njihovo pomoč pri recenziranju. S podelitvijo nagrad, še posebej z nagrado Michie-Turing, se avtonomna stroka s področja opredeli do najbolj izstopajočih dosežkov. Nagrado Michie-Turing za izjemen življenjski prispevek k razvoju in promociji informacijske družbe je prejel prof. dr. Jernej Kozak. Priznanje za dosežek leta pripada ekipi Odseka za inteligentne sisteme Instituta ''Jožef Stefan'' za osvojeno drugo mesto na tekmovanju XPrize Pandemic Response Challenge za iskanje najboljših ukrepov proti koroni. »Informacijsko limono« za najmanj primerno informacijsko potezo je prejela trditev, da je aplikacija za sledenje stikom problematična za zasebnost, »informacijsko jagodo« kot najboljšo potezo pa COVID-19 Sledilnik, tj. sistem za zbiranje podatkov o koroni. Čestitke nagrajencem! Mojca Ciglarič, predsednik programskega odbora Matjaž Gams, predsednik organizacijskega odbora i FOREWORD - INFORMATION SOCIETY 2021 The 24th Information Society Multiconference survived the COVID-19 problems. In 2021, there are eleven conferences with a growing trend and real hopes that 2022 will be better due to successful vaccination. The multiconference survived due to the conference chairs who bravely decided to continue with their conferences despite the first pandemic in the modern era. The COVID-19 pandemic did not decrease the growth of ICT, information society, artificial intelligence and science overall, quite on the contrary – the progress of computers, knowledge and artificial intelligence continued with the fascinating growth rate. However, COVID-19 did increase the downfall of societal norms, trust in science and progress. On the other hand, the awareness of the majority, that science and development are the only perspectives for a prosperous future, substantially grows. The Multiconference is running parallel sessions with 170 presentations of scientific papers at eleven conferences, many round tables, workshops and award ceremonies, and 400 attendees. Selected papers will be published in the Informatica journal with its 45-years tradition of excellent research publishing. The Information Society 2021 Multiconference consists of the following conferences: • Slovenian Conference on Artificial Intelligence • Data Mining and Data Warehouses • Cognitive Science • People and Environment • 50-years of High-school Computer Education in Slovenia • Batman Project Workshop • Insieme Interreg Project Workshop • URBANITE Project Workshop • Student Computer Science Research Conference 2021 • International Conference of Transfer of Technologies • Education in Information Society The multiconference is co-organized and supported by several major research institutions and societies, among them ACM Slovenia, i.e. the Slovenian chapter of the ACM, SLAIS, DKZ and the second national academy, the Slovenian Engineering Academy. In the name of the conference organizers, we thank all the societies and institutions, and particularly all the participants for their valuable contribution and their interest in this event, and the reviewers for their thorough reviews. The award for lifelong outstanding contributions is presented in memory of Donald Michie and Alan Turing. The Michie-Turing award was given to Prof. Dr. Jernej Kozak for his lifelong outstanding contribution to the development and promotion of the information society in our country. In addition, the yearly recognition for current achievements was awarded to the team from the Department of Intelligent systems, Jožef Stefan Institute for the second place at the XPrize Pandemic Response Challenge for proposing best counter-measures against COVID-19. The information lemon goes to the claim that the mobile application for tracking COVID-19 contacts will harm information privacy. The information strawberry as the best information service last year went to COVID-19 Sledilnik, a program to regularly report all data related to COVID-19 in Slovenia. Congratulations! Mojca Ciglarič, Programme Committee Chair Matjaž Gams, Organizing Committee Chair ii KONFERENČNI ODBORI CONFERENCE COMMITTEES International Programme Committee Organizing Committee Vladimir Bajic, South Africa Matjaž Gams, chair Heiner Benking, Germany Mitja Luštrek Se Woo Cheon, South Korea Lana Zemljak Howie Firth, UK Vesna Koricki Olga Fomichova, Russia Mitja Lasič Vladimir Fomichov, Russia Blaž Mahnič Vesna Hljuz Dobric, Croatia Klara Vulikić Alfred Inselberg, Israel Jay Liebowitz, USA Huan Liu, Singapore Henz Martin, Germany Marcin Paprzycki, USA Claude Sammut, Australia Jiri Wiedermann, Czech Republic Xindong Wu, USA Yiming Ye, USA Ning Zhong, USA Wray Buntine, Australia Bezalel Gavish, USA Gal A. Kaminka, Israel Mike Bain, Australia Michela Milano, Italy Derong Liu, Chicago, USA Toby Walsh, Australia Sergio Campos-Cordobes, Spain Shabnam Farahmand, Finland Sergio Crovella, Italy Programme Committee Mojca Ciglarič, chair Bogdan Filipič Dunja Mladenič Niko Zimic Bojan Orel, Andrej Gams Franc Novak Rok Piltaver Franc Solina, Matjaž Gams Vladislav Rajkovič Toma Strle Viljan Mahnič, Mitja Luštrek Grega Repovš Tine Kolenik Cene Bavec, Marko Grobelnik Ivan Rozman Franci Pivec Tomaž Kalin, Nikola Guid Niko Schlamberger Uroš Rajkovič Jozsef Györkös, Marjan Heričko Stanko Strmčnik Borut Batagelj Tadej Bajd Borka Jerman Blažič Džonova Jurij Šilc Tomaž Ogrin Jaroslav Berce Gorazd Kandus Jurij Tasič Aleš Ude Mojca Bernik Urban Kordeš Denis Trček Bojan Blažica Marko Bohanec Marjan Krisper Andrej Ule Matjaž Kljun Ivan Bratko Andrej Kuščer Boštjan Vilfan Robert Blatnik Andrej Brodnik Jadran Lenarčič Baldomir Zajc Erik Dovgan Dušan Caf Borut Likar Blaž Zupan Špela Stres Saša Divjak Janez Malačič Boris Žemva Anton Gradišek Tomaž Erjavec Olga Markič Leon Žlajpah iii iv KAZALO / TABLE OF CONTENTS Odkrivanje znanja in podatkovna skladišča - SiKDD / Data Mining and Data Warehouses - SiKDD ................. 1 PREDGOVOR / FOREWORD ................................................................................................................................. 3 PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ..................................................................................... 4 Observing odor-related information in academic domain / Novalija Inna, Massri M.Besher, Mladenić Dunja, Grobelnik Marko, Schwabe Daniel, Brank Janez ............................................................................................... 5 Understanding Text Using Agent Based Models / Mladenic Grobelnik Adrian, Grobelnik Marko, Mladenić Dunja ............................................................................................................................................................................ 9 News Stream Clustering using Multilingual Language Models / Novak Erik ........................................................ 13 SloBERTa: Slovene monolingual foundation model / Ulčar Matej, Robnik-Šikonja Marko .................................. 17 Understanding the Impact of Geographical Bias on News Sentiment: A Case Study on London and Rio Olympics / Swati, Mladenić Dunja ................................................................................................................... 21 An evaluation of BERT and Doc2Vec model on the IPTC Subject Codes prediction dataset / Pranjić Marko, Robnik-Šikonja Marko, PI3 ............................................................................................................................... 25 Classification of Cross-cultural News Events / Sittar Abdul, Mladenić Dunja ...................................................... 29 Zotero to Elexifinder: Collection, curation, and migration of bibliographical data / Lindemann David ................. 33 Simple discovery of COVID ISWAR Metaphors Using Word Embeddings / Brglez Mojca, Pollak Senja, Vintar Špela................................................................................................................................................................. 37 Topic modelling and sentiment analysis of COVID-19 related news on Croatian Internet portal / Buhin Pandur Maja, Dobša Jasminka, Beliga Slobodan, Meštrović Ana ................................................................................ 41 Tackling Class Imbalance in Radiomics: the COVID-19 Use Case / Rožanec Jože M., Poštuvan Tim, Fortuna Blaž, Mladenić Dunja ........................................................................................................................................ 45 Observing Water-Related Events for Evidence-Based Decision-Making / Pita Costa Joao, Massri M.Besher, Novalija Inna, Casals del Busto Ignacio, Mocanu Iulian, Rossi Maurizio, Šturm Jan, Eržin Eva, Guček Alenka, Posinković Matej, Grobelnik Marko ..................................................................................................... 49 Anomaly Detection on Live Water Pressure Data Stream / Petkovšek Gal, Erznožnik Matic, Kenda Klemen .... 53 Entropy for Time Series Forecasting / Costa Joao, Kenda Klemen, Pita Costa Joao ......................................... 57 Modeling stochastic processes by simultaneous optimization of latent representation and target variable / Jelenčič Jakob, Mladenić Dunja ....................................................................................................................... 61 Causal relationships among global indicators / Neumann Matej ......................................................................... 65 Active Learning for Automated Visual Inspection of Manufactured Products / Trajkova Elena, Rožanec Jože M., Dam Paulien, Fortuna Blaž, Mladenić Dunja ................................................................................................... 69 Learning to Automatically Identify Home Appliances / Lorbek Ivančič Dan, Bertalanič Blaž, Cerar Gregor, Fortuna Carolina ............................................................................................................................................... 73 Indeks avtorjev / Author index ................................................................................................................................ 77 v vi Zbornik 24. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2021 Zvezek C Proceedings of the 24th International Multiconference INFORMATION SOCIETY – IS 2021 Volume C Odkrivanje znanja in podatkovna skladišča - SiKDD Data Mining and Data Warehouses - SiKDD Urednika / Editors Dunja Mladenić, Marko Grobelnik http://is.ijs.si 4. oktober 2021 / 4 October 2021 Ljubljana, Slovenia 1 2 PREDGOVOR Tehnologije, ki se ukvarjajo s podatki so v devetdesetih letih močno napredovale. Iz prve faze, kjer je šlo predvsem za shranjevanje podatkov in kako do njih učinkovito dostopati, se je razvila industrija za izdelavo orodij za delo s podatkovnimi bazami, prišlo je do standardizacije procesov, povpraševalnih jezikov itd. Ko shranjevanje podatkov ni bil več poseben problem, se je pojavila potreba po bolj urejenih podatkovnih bazah, ki bi služile ne le transakcijskem procesiranju ampak tudi analitskim vpogledom v podatke – pojavilo se je t.i. skladiščenje podatkov (data warehousing), ki je postalo standarden del informacijskih sistemov v podjetjih. Paradigma OLAP (On-Line-Analytical-Processing) zahteva od uporabnika, da še vedno sam postavlja sistemu vprašanja in dobiva nanje odgovore in na vizualen način preverja in išče izstopajoče situacije. Ker seveda to ni vedno mogoče, se je pojavila potreba po avtomatski analizi podatkov oz. z drugimi besedami to, da sistem sam pove, kaj bi utegnilo biti zanimivo za uporabnika – to prinašajo tehnike odkrivanja znanja v podatkih (data mining), ki iz obstoječih podatkov skušajo pridobiti novo znanje in tako uporabniku nudijo novo razumevanje dogajanj zajetih v podatkih. Slovenska KDD konferenca pokriva vsebine, ki se ukvarjajo z analizo podatkov in odkrivanjem znanja v podatkih: pristope, orodja, probleme in rešitve. FOREWORD Data driven technologies have significantly progressed after mid 90’s. The first phases were mainly focused on storing and efficiently accessing the data, resulted in the development of industry tools for managing large databases, related standards, supporting querying languages, etc. After the initial period, when the data storage was not a primary problem anymore, the development progressed towards analytical functionalities on how to extract added value from the data; i.e., databases started supporting not only transactions but also analytical processing of the data. At this point, data warehousing with On-Line-Analytical-Processing entered as a usual part of a company’s information system portfolio, requiring from the user to set well defined questions about the aggregated views to the data. Data Mining is a technology developed after year 2000, offering automatic data analysis trying to obtain new discoveries from the existing data and enabling a user new insights in the data. In this respect, the Slovenian KDD conference (SiKDD) covers a broad area including Statistical Data Analysis, Data, Text and Multimedia Mining, Semantic Technologies, Link Detection and Link Analysis, Social Network Analysis, Data Warehouses. 3 PROGRAMSKI ODBOR / PROGRAMME COMMITTEE Jane Brank, Jožef Stefan Institute, Ljubljana Marko Grobelnik, Jožef Stefan Institute, Ljubljana Jakob Jelenčič, Jožef Stefan Institute, Ljubljana Branko Kavšek, University of Primorska, Koper Aljaž Košmerlj, Qlector, Ljubljana Dunja Mladenić, Jožef Stefan Institute, Ljubljana Inna Novalija, Jožef Stefan Institute, Ljubljana Jože Rožanec, Qlector, Ljubljana Luka Stopar, Sportradar, Ljubljana 4 OBSERVING ODOR-RELATED INFORMATION IN ACADEMIC DOMAIN Inna Novalija M. Besher Massri Jožef Stefan Institute Jožef Stefan Institute and Jožef Stefan Jamova cesta 39, Ljubljana, Slovenia International Postgraduate School inna.koval@ijs.si Jamova cesta 39, Ljubljana, Slovenia Dunja Mladenić besher.massri@ijs.si Jožef Stefan Institute and Jožef Stefan Marko Grobelnik International Postgraduate School Jožef Stefan Institute Jamova cesta 39, Ljubljana, Slovenia Jamova cesta 39, Ljubljana, Slovenia dunja.mladenic@ijs.si marko.grobelnik@ijs.si Daniel Schwabe Janez Brank Jožef Stefan Institute Jožef Stefan Institute Jamova cesta 39, Ljubljana, Slovenia Jamova cesta 39, Ljubljana, Slovenia daniel.schwabe@ijs.si janez.brank@ijs.si ABSTRACT In this paper we present an approach for mining olfactory information from scientific research collections, such as the In this paper, we demonstrate an approach for observing olfactory Microsoft Academic Graph (MAG) [3]. related information in an academic publications environment (such as Microsoft Academic Graph) based on semantic technologies. The olfactory mining approach combines data processing, We present an Odor Observatory tool that enables several usage modelling and visualization methods in order to develop applicable scenarios, such as observing odor-related papers and topics, tools for data analysis. viewing institutions conducting olfactory research, defining top We present an Odor Observatory tool [4] targeted at several journals and key countries in the olfactory domain. visualization scenarios. In particular, the Odor Observatory allows Validation of the proposed approach on a collection of academic exploring olfactory related papers from the MAG over time, and publications from 1800 until 1925 confirms applicability of the along with current data, provides historical information starting proposed approach on large data collections with a wide span of with the early XIX century. time. In usage scenarios we observed the odor-related publications The data-driven functionalities of Odor Observatory are: in Microsoft Academic Graph by topic, discovered the journals ▪ with historical olfactory publications and found that the most Possibility of exploring top ranked topics in the olfactory popular terms in odor-related research content are: method, academic domain; ▪ olfactory, odor, device, invention, smell, preparation, utility model. Possibility of exploring top ranked institutions conducting olfactory research; ▪ Possibility of exploring key countries and defining top KEYWORDS ranking journals in the olfactory academic domain; ▪ Odor-related search functionalities; Odor, Olfactory information, Microsoft Academic Graph (MAG), ▪ Word cloud visualization for odor-related terms. Data mining. 2. RELATED WORK 1. INTRODUCTION Olfactory science covers different aspects of research related to Olfaction, or the sense of smell, is the sense through which smells odors, therefore exploring odor related information and data can be (or odors) are perceived [1]. Olfactory science involves studying viewed as complex multidisciplinary area. olfaction and odor-related topics, the sensory system, physiology, Lötsch et al. [5] considered machine learning approaches for human and pheromone signals. olfactory research. The authors state that the complexity of the The Odeuropa project [2] gathers and integrates expertise in human sense of smell is reflected in complex and high-dimensional sensory mining and olfactory heritage. The project partners are data, which supports the applicability of machine learning and data developing novel methods to collect information about smell from mining techniques. The use of machine learning in human olfactory (digital) text and image collections. research includes the following aims: The Odeuropa project partners apply state-of-the-art AI techniques to text and image datasets in order to identify and trace how ‘smell’ 1. The study of the physiology of pattern-based odor was expressed in different languages, with what places it was detection and recognition processes; associated, what kinds of events and practices it characterized, and 2. Pattern recognition in olfactory phenotypes; to what emotions it was linked. 3. The development of complex disease biomarkers including olfactory features; 5 4. Odor prediction from physico-chemical properties of Figure 2 illustrates an entry in MAG for a historical publication volatile molecules, and tagged with several odor-relevant topics. 5. Knowledge discovery in publicly available large databases. The authors provide review of key concepts of machine learning and summarizes current applications on human olfactory data. At the same time, linguistic and semantic communities focused on studying the language of smell [6]. Iatropoulos et al. developed a computational method to characterize the olfaction-related semantic content of words in a large text corpus of internet sites in English. They also introduced novel metrics, such as olfactory association index (OAI) and olfactory specificity index (OSI). Tonelli [7] describes olfactory information extraction and semantic processing from a multilingual perspective. The author states that in several studies it was found that languages seem to have a smaller vocabulary to describe smells as compared to other senses. In our work we apply data mining and machine learning, as well as semantic approaches for enriching textual data. We use data from Microsoft Academic Graph and our methodologies can be regarded as being in the context of semantic and text processing research. Our approaches can cover cross-lingual and multilingual data and Figure 2: Publication in MAG allow for tracking olfactory trends in time. 3. PROBLEM DEFINITION 3.1 DATA SOURCES The Microsoft Academic Graph (MAG) [3] is a heterogeneous graph containing scientific publication records, citation relationships between those publications, as well as authors, institutions, journals, conferences, and fields of study. Since this research is conducted in line with the Odeuropa project (targeted at olfactory heritage), the time frame used for MAG data is set to range from the early publications in the 19th century to the present time. The Odeuropa project is interested in particular in the data available up to 1925. Though the project is focused on the historical datasets, the developed Odor Observatory tool allows users to explore recent olfactory publications as well. The dataset is updated on a monthly basis and new available data is uploaded into the observatory. Figure 3: Odor in the MAG Taxonomy Figure 1: The Conceptual Schema for MAG Figure 3 shows a representation of Odor in the MAG taxonomy, The Microsoft Academic Graph data schema is based on the list of with parent topics (Organic chemistry and Neuroscience) and child following entity types: publication, author, author affiliation topics (Olfactory learning, Geosmin etc.) (institution), publication venue (journals and conferences), field of study (topic). It contains information about publication dates, as An important functionality while exploring the literature is the well as citation pairs and co-authorship data (see Figure 1). ability to expand searches by looking at related topics to a topic of interest. Figure 4 displays topics about/related to Figure 2 illustrates an entry in MAG for a historical publication Olfaction/Odor/Smell in MAG taxonomy. tagged with several odor-relevant topics. 6 Figure 5 shows the number of odor-related historical publications in MAG over time. This scenario assumes observing trends in different olfactory topics throughout a time interval. It is possible to observe that the highest number of publications are in the domains of biology and psychology. Figure 4: Odor-related Topics in MAG Figure 5: Odor-related Publications in MAG (from year 1800 until year 1925, cumulative) by topic 3.2 METHODOLOGY 2. What are the most popular terms used in odor-related The methodology for observing olfactory related information from publications? academic publication resources includes a number of steps: ▪ This use case helps the user to visualize term usage by displaying a Using the MAG taxonomy, obtain the list of research word cloud with the most popular olfactory terms used in the papers that corresponds to odor-related topics. Papers publications in the period of interest (see Figure 6). were filtered to those containing the topics: Olfaction, Odor, Fragrance, Fragrance ingredient, as well as the “smell” keyword; ▪ Ingest the extracted corpus into the Elastic Search tool1; ▪ Provide visualization functionalities, such as MAG time series per term. The key challenges of the development techniques include: ▪ Interpretability and explainability of the results – the aim is for the visualizations to be able easily interpretable by humans; ▪ Given the large scale of the incoming data streams, it is essential that building visualizations are scalable. The MAG contains more than 265 million records (August 2021), including several types of publication, such as Figure 6: Odor Terms Word Cloud in MAG journal articles, conference papers, books, book chapters, and papers from other repositories. In addition, MAG also indexes a large corpus of patents. 3. Which venues were mostly used when publishing odor- related research articles? 3.3 USAGE SCENARIOS This use case shows a number of journals that had historical We present a couple of usage scenarios for the Odor Observatory publications about smells (see Figure 7). tool, cast as questions asked by scholars studying the field. The figure shows that JAMA and Nature journals are the most 1. What are the historical trends in odor-related popular journals regarding historical olfactory publications. publications? 1 https://www.elastic.co 7 The figure shows a ranked list of relevant publications on the topic of “smells”, in the period from 1900 to 1925. The list is modified by changing the context on the right side - the focus is changed by placing the cursor over a cluster, and publications associated with this cluster are displayed. 4. CONCLUSION In this paper we demonstrated an approach towards observing olfactory related information in scientific publications, as recorded in the MAG. In addition, we present an Odor Observatory tool that enables several usage scenarios for exploring historical and present olfactory research. Figure 7: Journals with Olfactory Publications in MAG The future work will include the exploration of other textual (from year 1800 until year 1925, cumulative) datasets applicable for olfactory research, with an accent on olfactory heritage information. In line with the Odeuropa project, the relevant information 4. Which are the publications about smell (from a extracted from textual sources will be, following semantic web contextual point of view)? standards, aligned with the ‘European Olfactory Knowledge The Research Explorer tool is a search engine that enables Graph’ (EOKG). exploring the individual articles in the corpus of odor-related publications. 5. ACKNOWLEDGMENTS This research is supported by the Slovenian research agency and by the European Union’s Horizon 2020 program project Odeuropa under grant agreement number 101004469. REFERENCES [1] Wolfe, J. M., Kluender, K. R., Levi, D. M., Bartoshuk, L. M., Herz, R. S., Klatzky, R., Lederman, S. J., & Merfeld, D. M. (2012). Sensation & perception (3rd ed.). Sinauer Associates. [2] Odeuropa project, https://odeuropa.eu (accessed in August, 2021). [3] Wang K. et al. A Review of Microsoft Academic Services for Science of Science Studies, Frontiers in Big Data, 2019, doi: 10.3389/FDATA.2019.00045. [4] JSI Odor Observatory, public service, https://odeuropa.ijs.si/dashboards/Main/Index?visualization= visualizations-MAG--top-topics# (accessed in August, 2021). [5] Lötsch, J., Kringel, D., Hummel, T. Machine Learning in Human Olfactory Research, Chemical Senses, Volume 44, Issue 1, January 2019, Pages 11–22, https://doi.org/10.1093/chemse/bjy067. [6] Iatropoulos, G., Herman, P., Lansner, A., Karlgren, J., Figure 8: List of Olfactory Publications in MAG (from year Larsson, M., Olofsson, JK. The language of smell: Connecting 1800 until year 1925) that contains the keyword "smell" linguistic and psychophysical properties of odor descriptors. Cognition. 2018 Sep;178:37-49. doi: The tool is built on Elastic Search and provides search by keyword 10.1016/j.cognition.2018.05.007. Epub 2018 May 12. PMID: and by date. It also supports smart navigation through the results by 2976379. clustering the results and re-ranking the results by moving the focus [7] Tonelli, S. A Smell is Worth a Thousand Words: Olfactory of search through the cluster space (see Figure 8). The goal of the Information Extraction and Semantic Processing in a tool is to enhance a search engine by providing the users multiple Multilingual Perspective. doi: rankings of the results for each query. It is achieved by https://doi.org/10.4230/OASIcs.LDK.2021.2 generating topics for the given query and its result set, and https://drops.dagstuhl.de/opus/volltexte/2021/14538/pdf/OA visualizing these topics on the “Ranking Space” panel. When SIcs-LDK-2021-2.pdf (accessed in August, 2021). the focus is set near a given topic, results that are on or closer to that topic are ranked higher. 8 Understanding Text Using Agent Based Models Adrian Mladenic Grobelnik Marko Grobelnik Dunja Mladenic Jozef Stefan Institute Jozef Stefan Institute Jozef Stefan Institute Ljubljana Slovenia Ljubljana Slovenia Ljubljana Slovenia adrian.m.grobelnik@ijs.si marko.grobelnik@ijs.si dunja.mladenic@ijs.si ABSTRACT The main contributions of this paper are (1) a novel approach to explainable story understanding, (2) a system generating stories The paper proposes a novel approach to text understanding and given a set of agents with attributes and goals, and (3) text generation focusing on short stories. The proposed approach implementation of the proposed approach, with publicly attempts to understand and generate stories by creating an available source code [7] allowing users to create and analyze explainable, agent-based world model of the story. The world their own stories. model is defined through agents, their goals, actions, attributes The rest of this paper is organized as follows: Section 2 provides and relationships between them. We demonstrate our approach a problem description. Section 3 describes the approach used to on the story of ‘Little Red Riding Hood’, simulating it as a tackle the problem. Section 4 demonstrates the functioning of our sequence of 48 actions, involving 7 main agents and 14 goals. approach. The paper concludes with discussion and directions for KEYWORDS future work in Section 5. Text understanding, agent-based approach, world model, agent- based model 2 Problem Description The problem we are solving is, given the text of a short story, convert it into a machine understandable and actionable 1 Introduction description representing the dynamics of the story being told. With recent advancements in deep learning and overall increases Such an actionable description should encode the implicit in computing power, artificial intelligence systems are now able knowledge assumed by the text in the form of an agent-based to make commonsense inferences from simple events, as world model. proposed in research such as COMET [1] and MultiCOMET [2]. The world model should include enough representational power While the aforementioned commonsense inferences can be made to fully represent the story. This includes agents, their with a high degree of precision, they lack an explainable and environment and the relationships between them. The world comprehensive structure capable of storing and predicting future model should be actionable enough to simulate the dynamics of events with such inferences. Agent-based models (ABMs), while an input story with all the key elements, and relevant details capable of simulating complex interactions between agents, mentioned in the input text. rarely focus on understanding stories in greater depth. Moreover, As the world model can represent a story given its text, it should they cannot perform commonsense reasoning on agent’s goals, also be able to represent and simulate other stories within the actions or attributes. In our research, we draw from existing world model’s constraints. work on ABMs to create a system capable of understanding short Some of the key operations the resulting system should support: text-based stories, with the potential to incorporate 1. representation of the story commonsense inferences in the future. 2. simulation of the story’s dynamics Related work such as ‘Automated Storytelling via Causal, 3. question answering about explicit and implicit Commonsense Plot Ordering’ [3] and ‘Modeling Protagonist elements written or assumed within the story Emotions for Emotion-Aware Storytelling’ [4] makes use of 4. creating alternative stories, given their context COMET to tackle automated story plot generation. As the stories are generated using COMET’s commonsense causal inferences, they lack explainability. In our work, we focus on generating 3 Approach Description explainable stories. The general aim of our approach is to provide deep text Other related work [5] focuses on story understanding using understanding of the input story. Not all the steps are automatable manually supplied commonsense rules, concept patterns and at this stage. In particular, the biggest challenge is to story text. Our system aims to understand and simulate a story, automatically translate the story text into the knowledge based given the story text, goals and initial attributes of its agents. representation aligned with the world model. We are looking forward to eventually automate all of the steps in the approach. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Information Society 2021, 4-8 October 2021, Ljubljana, Slovenia © 2021 Copyright held by the owner/author(s). 9 Figure 1: A partial representation of the Wolf agent’s goals, actions and attributes. As a running example of the input story, we selected the popular preconditions and effects. We show two example action children’s story ‘Little Red Riding Hood’ [6]. In the first stage, representations in Figure 3 and Figure 4. The duration of each we restructured the original story into 73 simplified sentences action corresponds to the passing of one time unit. where we identified 23 key events involving 7 main agents: 1. Mother 2. Riding Hood 3. Flower Field 4. Butterfly 5. Wolf 6. Grandma 7. Woodsman Each agent is represented by its goals, actions and attributes (see Figure 1 for an example involving the Wolf). All goals cause actions and all actions change at least one agent’s attributes. As depicted on Figure 2, an agent’s goal is defined by a goal state (a set of agents with specific attribute values) and ‘pre-goals’ (goals that must be completed and act as preconditions for an Figure 3: An example representation of an action agent to start working towards the goal). Figure 4: An example pseudocode representation of a concrete action, taken from [9] Figure 2: An example representation of a goal An attribute is simply defined as any information relating to the To define actions, we use an action schema proposed as part of agent. For instance, the agent’s location, inventory of items and ‘UCPOP: A Sound, Complete, Partial Order Planner for ADL’ awareness of other agents. [8] where each action consists of a set of parameters, 10 Information Society 2021, 4-8 October 2021, Ljubljana, Slovenia A.M. Grobelnik et al. Figure 5: Hierarchy of agents for the Little Red Riding Hood story The agents are defined through a hierarchy, ensuring consistency We first initialize the world model to an initial setting similar to across agent goals, actions, attributes and providing a clear that of ‘Little Red Riding Hood’, illustrated in Figure 7. For overview of the agent types as observed in Figure 5. instance, agents ‘forest4’ and ‘woodsman’ are in the same Throughout the story simulation of ‘Little Red Riding Hood’ 3 location, 1 unit above agent ‘forest3’. The model is initialized key agents jointly had 14 goals, causing them to perform a total with the agents, their initial attributes with values and their goals of 48 actions composed of 12 unique action types. in the story. Once initialized, we can run the model and see the We propose a simple textual description of each performed agents interact with each other within their environment. For an action, stating why the agent executed the action and which other example, see Figure 9. agents were involved. See Figure 8 for an example. One could divide the story into the following 5 main segments: At the highest conceptual level, we randomly select an agent and 1. Riding Hood discusses visiting Grandma with Mother simulate all of its possible next actions. We then select the action (6 actions) that brings the agent closest to all it’s currently active goals, and 2. Riding Hood meets Wolf and goes to Grandma (23 execute this action. We repeat this until there are no more agents actions) with active goals in our world model, as depicted in Figure 6. 3. Wolf eats Grandma and tries to impersonate her; Riding Hood arrives at GrandmaHouse and cries for help (6 actions) 4. Woodsman saves Grandma and takes Wolf away, Riding Hood gifts Grandma (13 actions) As an example, in the third story segment the actions occur in the following order: 1. Wolf eats Grandma to satisfy hunger. Figure 6: High level pseudocode of the simulation within the 2. Wolf took perfume from GrandmaHouse’s inventory world model to try impersonating Grandma. 3. Wolf took nightgown from GrandmaHouse’s inventory to try impersonating Grandma. 4 Approach Demonstration 4. Wolf took sleeping cap from GrandmaHouse’s inventory to try impersonating Grandma. 5. Riding Hood moved 1 unit up to visit Grandma. 6. Riding Hood cried for help to get help. The system is able to automatically generate the textual description of the story simulation over time, as depicted in Figure 8. Figure 7: Initial state of the agents’ locations within the world model; each X, Y slot includes a list of agents at that location 11 Understanding Text Using Agent Based Models Information Society 2021, 4 October 2021, Ljubljana, Slovenia Adapting the system to another story using our source code is relatively easy, provided the action and attribute types of the agents in the story are similar to those in the ‘Little Red Riding Hood’. If the story requires the implementation of new actions or attributes, this can be done by extending the class structure in C++ using already implemented actions and attributes as examples. Figure 8: A part of an example story, generated by the In our future work we intend to integrate commonsense system inferences, such as those from MultiCOMET into our model to further the system’s degree of textual understanding. Our system could also benefit from the addition of dynamic and simultaneous goals that change based on the agent’s environment. Another possible future line of work is to use our approach in other domains to describe more complex phenomena, such as real- world events or geopolitics. Lastly, a user evaluation of our system’s performance on a variety of stories and scenarios could provide further insight into the efficacy of our approach. ACKNOWLEDGMENTS The research described in this paper was supported by the Slovenian research agency under the project J2-1736 Causalify Figure 9: Screenshot of two subsequent agent location and co-financed by the Republic of Slovenia and the European configurations on the map: (1) after Riding Hood gives Union under the European Regional Development Fund. The Grandma flowers and (2) after Woodsman carries away operation is carried out under the Operational Programme for the Wolf Implementation of the EU Cohesion Policy 2014–2020. One of the more conceptually complex parts of the story was Riding Hood asking Mother for permission to visit Grandma. REFERENCES This required the creation of a new attribute for human agents to [1] Bosselut, A.; Rashkin, H.; Sap, M.; Malaviya, C.; Celikyilmaz, A.; and Choi, Y. 2019. COMET: Commonsense transformers for automatic describe their opinions of other agents’ goals. knowledge graph construction. In ACL, 4762–4779. The most complex action implemented was “cry for help”. This [2] Adrian Mladenic Grobelnik, Marko Grobelnik, Dunja Mladenic. 2020. involved the creation of a new goal “respond to cry for help” for MultiCOMET – Multilingual Commonsense Description. In Proceedings of the 23rd international multiconference information society, pages 37-all human agents within a certain radius of the agent crying for 40 help, provided they were conscious and able to respond. [3] Prithviraj Ammanabrolu, Wesley Cheung, William Broniec, and Mark O Riedl. 2021. Automated storytelling via causal, commonsense plot The story ends when Riding Hood gives Grandma the flowers ordering. In Proceedings of the 35th AAAI Conference on Artificial she picked and the basket Mother gave her, and Woodsman Intelligence (AAAI). [4] Faeze Brahman and Snigdha Chaturvedi. 2020. Modeling protagonist carries the Wolf “deep into the forest where he wouldn't bother emotions for emotion-aware storytelling. In Proceedings of EMNLP, people any longer” [6]. pages 5277– 5294. The system was implemented in about 3,000 lines of C++ code, [5] Patrick Henry Winston. The genesis story understanding and story telling system: A 21st century step toward artificial intelligence. 2014. Technical available on GitHub [7]. report, Center for Brains, Minds and Machines (CBMM). [6] Little Red Riding Hood by Leanne Guenther. https://www.dltk- teach.com/RHYMES/littlered/story.htm. Accessed 16.09.2021. [7] Understanding Text Using Agent Based Models GitHub. 5 Discussion https://github.com/AMGrobelnik/Understanding-Text-Using-Agent- Based-Models . Accessed 16.09.2021. In our research we expanded on and adapted existing work on [8] Penberthy, J., & Weld, D. 1992. UCPOP: a sound, complete, partial-order agent-based models, providing an alternate approach to text planner for ADL. In Proceedings of KR’92, pp. 103–114, Los Altos, CA. understanding and generation involving short stories. As a proof Kaufmann. [9] An Introduction to AI Story Generation. https://thegradient.pub/an- of concept, we applied our approach on the children’s story of introduction-to-ai-story-generation/. Accessed 16.09.2021. ‘Little Red Riding Hood’, describing it through a series of 48 highly explainable actions involving 7 main agents. 12 News Stream Clustering using Multilingual Language Models Erik Novak erik.novak@ijs.si Jožef Stefan Institute Jožef Stefan International Postgraduate School Jamova cesta 39 Ljubljana, Slovenia ABSTRACT in Section 5. Finally, we conclude the paper and provide ideas for In this paper, we propose a news stream clustering algorithm future work in Section 6. which directly outputs cross-lingual event clusters. It uses multi- lingual language models to generate cross-lingual article repre- 2 RELATED WORK sentations which enable a direct comparison of articles in differ- ent languages. The algorithm is evaluated using a cross-lingual News Stream Clustering. The objective of news stream cluster- news article data set and compared against a strong baseline ing is to group news articles that report about the same event algorithm. The experiment results show the algorithm has great that happened in the world. Grouping can be a difficult task, promise, but requires additional modifications for improving its especially if the articles are written in multiple languages. To performance. this end, various approaches were developed for cross-lingual event clustering. A statistical approach called Generalization of KEYWORDS Canonical Correlation Analysis is used to compare news articles in different languages [9]. Information extraction techniques, online news, event detection, news events, multilingual language such as named entity recognition and part-of-speech tagging, are model also used for event detection [6]. With the increasing popularity of neural networks, more advanced approaches are used to link 1 INTRODUCTION event clusters. The work in [3] uses word embeddings to compare and link monolingual event clusters into cross-lingual ones. Online news is producing hundreds of thousands of articles per Transformer-based language models are used for event sentence day reporting about any significant event that happened in the coreference identification [4], a task that links parts of articles to world. The articles cover various domains (such as politics, sports, multiple events. However, the algorithm is performed only on a and culture) and are written in different languages. In order to monolingual data set. automatically identify these events, news stream clustering algo- To the best of our knowledge, our work is the first that uses rithms are used. These usually have the following steps: (1) they multilingual language models for grouping articles directly into group articles written in the same language into monolingual cross-lingual events. clusters, and (2) form cross-lingual clusters by linking monolin- gual clusters that report on the same event. Both steps usually employ monolingual text features such as TF-IDF vectors; these Multilingual Language Models. Since the introduction of the do not allow cross-lingual comparison without using advanced transformers [11], language model development has gained trac-statistical or machine learning methods. tion in the research community. One of the most well known In this paper, we propose a news stream clustering algorithm language models, BERT [2], has improved the performance of that directly generates cross-lingual event clusters. The algorithm various NLP tasks. By training it using multilingual documents, uses multilingual language models for generating cross-lingual the multilingual BERT [5] enabled solving tasks that require content embeddings and extracting named entities found in the cross-lingual text representations. While these models improved articles. These are used to measure if an article should be assigned the performance of various NLP tasks, they do not provide good to an event. The algorithm is evaluated using a cross-lingual data document embeddings for tasks like clustering. This changed set consisting of articles in English, Spanish, and German, and is with the introduction of Sentence-BERT [8], which generates compared against a strong baseline. While the experiment results monolingual sentence embeddings appropriate for measuring look promising, there is still room for improving the algorithms sentence similarity. A year later, an approach for making mono- performance. lingual document representations cross-lingual [7] opened a way The paper is structured as follows: Section 2 contains an for using sentence embeddings for cross-lingual clustering. overview of the related work on cross-lingual news stream clus- In this work, we employ the multilingual Sentence-BERT tering and multilingual language models. Next, we present the model to generate cross-lingual embeddings used to group arti- proposed clustering algorithm in Section 3, and describe the excles into events. periment setting in Section 4. The experiment results are found 3 THE CLUSTERING ALGORITHM Permission to make digital or hard copies of part or all of this work for personal We propose a news stream clustering algorithm that directly or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and outputs cross-lingual events. It uses cross-lingual embeddings, the full citation on the first page. Copyrights for third-party components of this named entities, and temporal features to measure if an article work must be honored. For all other uses, contact the owner/author(s). should be assigned to an event cluster. If none of the events are Information Society 2021, 4 - 8 October 2021, Ljubljana, Slovenia appropriate, a new cluster is created and the article is assigned © 2021 Copyright held by the owner/author(s). to it. Figure 1 shows the algorithm’s workflow diagram. 13 Information Society 2021, 4 - 8 October 2021, Ljubljana, Slovenia Erik Novak event clusters content embedding: a 6 c 1 c 2 a ® (0) 𝑐 = ® 0, 1 𝑒 a a 2 4 a a (𝑘−1) 5 3 (𝑘 − 1) · ® 𝑐 + 𝑐 ® 𝑒 𝑎 ® (𝑘) 𝑘 𝑐 = , 𝑒 𝑘 (𝑘) COND? YES where ® 𝑐 is the centroid calculated using the first 𝑘 articles 𝑒 assigned to the event 𝑒, and 𝑐 ® is the content embedding of the c 𝑎𝑘 3 𝑘 -th article 𝑎 . 𝑘 NO Event Named Entities. Each event stores all of the unique named entities that are found in any of its articles. The named entities are used to identify if the incoming article mentions the Figure 1: The algorithm’s workflow diagram. The algo- event’s entities. The event’s named entities set is updated when rithm maintains a set of event clusters which are used a new article is assigned to the event: when asessing if a new article (𝑎6) should be assigned to (0) 𝑟 = ∅, an existing event. If the conditions are met, the article is 𝑒 assigned to the most appropriate cluster ( (𝑘) (𝑘−1) 𝑐 2). Otherwise, an 𝑟 = 𝑟 ∪ 𝑟 , 𝑒 𝑒 𝑎𝑘 empty event cluster is created (𝑐3), the article is assigned to (𝑘) it, and the newly created event is added to the cluster set. where 𝑟 is the set of named entities generated using the first 𝑒 𝑘 articles assigned to the event 𝑒 , and 𝑟 is the set of named 𝑎𝑘 entities of the 𝑘-th article 𝑎 . 𝑘 In this section we describe how the algorithm represents the Time Statistics. The time statistics provide insights into the articles and events, and how it decides when to assign an article articles’ temporal distribution. These are calculated using the to the event cluster. articles’ time attribute. In this experiment we measured the fol- lowing statistics: the minimum, average, and maximum article 3.1 Article Representation timestamps. These are used to validate if an article was published In this section we describe the different article representations at a time when it could still report about an existing event. used in the algorithm. Each article is assumed to have a title, body, and time attributes, which are used to (1) generate the content 3.3 Assignment Condition embedding and (2) extract its named entities. The most crucial part of the proposed algorithm is how to mea- sure to which event should an article be assigned to, if any. We Content Embedding. Each article is assigned an embedding propose a condition that combines (1) the cosine similarity be- that represents the article’s content. Using multilingual Sentence- tween the article’s content embedding and the event’s centroid, BERT1, a language model designed for generating vectors used in (2) the overlap between the article’s and event’s named entities, cross-lingual clustering tasks, we get the content embedding by and (3) the time difference between the article’s time and one of concatenating the article’s title and body and inputing it into the the event’s time statistics. language model. The output is a single 768 dimensional vector Let 𝐸 = {𝑒 } be the set of existing event clusters, that captures the semantic meaning of the article. 1, 𝑒2, . . . , 𝑒 𝑗 where each event is represented with its centroid, named entities, Article Named Entities. For each article we extract the named and one of its time statistics 𝑒 = ® 𝑐 , 𝑟 , 𝑡 . Let the article 𝑖 𝑒 𝑒 𝑒 𝑖 𝑖 𝑖 entities that are mentioned in the article’s body. To extract them, be represented by its content embedding, named entities, and we developed a multilingual NER model using XLM-RoBERTa time attribute 𝑎 = ( ® 𝑐 , 𝑟 , 𝑡 ). We then check if the following 𝑎 𝑎 𝑎 [1] and fine-tuned it using the CoNLL-2003 [10] data set.2 Af-conditions are met for each event: terwards, we filter out the duplicates and store the remaining ⟨ ® 𝑐 , ® 𝑐 ⟩ 𝑒 𝑎 𝑖 unique entities for later use. 𝛿 = ≥ 𝛼, 𝑐 ∥ ® 𝑐 ∥ ∥ 𝑒 2 ∥ ® 𝑐𝑎 2 𝑖 (1) 𝛿 = |𝑟 ∩ 𝑟 | ≥ 𝛽, 3.2 Event Representations 𝑟 𝑒 𝑎 𝑖 𝛿 = |𝑡 − 𝑡 | ≤ 𝜏, 𝑡 𝑒 𝑎 An event is represented as an aggregate of its articles. This in- 𝑖 cludes (1) the event centroid, (2) the named entities, and (3) the where 𝛼, 𝛽 and 𝜏 are the thresholds corresponding to how similar time statistics. In this section we describe how the aggregates the article’s content must be to the event, the required amount are calculated and updated. of overlapping entities, and the time window in which an article has to be to be assigned to the event, respectively. Thus, 𝛿 , 𝛿 , 𝛿 𝑐 𝑟 𝑡 Event Centroid. The centroid represents the average content correspond to the content similarity, entity overlap, and time embedding of the articles assigned to the event. It is used to assess window conditions, respectively. if an incoming article’s content is similar enough to the event. If an event meets the conditions described in Equation 1, the Since the algorithm is intended to work on a news streams, we article is assigned to it. If multiple events are appropriate, the iteratively update the centroid with the newly assigned article’s article is assigned to the event that has the greatest 𝛿 value. 𝑐 If none are appropriate, a new empty event cluster is created, 1The model is available at https://huggingface.co/sentence-transformers/ the article is assigned to it, and the event representations are paraphrase-xlm-r-multilingual-v1. updated. 2The code of the model is available at https://github.com/ErikNovak/named-entity- To compare the impact of the conditions, we implement mul- recognition. tiple versions of the algorithm that use a different combination 14 News Stream Clustering using Multilingual Language Models Information Society 2021, 4 - 8 October 2021, Ljubljana, Slovenia of 𝛿 , 𝛿 , and 𝛿 conditions. Table 1 shows all of the algorithm 4.3 Baseline Algorithm 𝑐 𝑟 𝑡 versions compared in the experiment. The baseline algorithm used in the experiment is presented in [3]. It performs cross-lingual news stream clustering by first gen-Table 1: The list of algorithm versions. Each algorithm uses erating monolingual event clusters using TF-IDF subvectors of a different combination of conditions. words, word lemmas and named entities of the articles. After- wards, it merges monolingual into cross-lingual clusters using Algorithm condition combination cross-lingual word embeddings to represent the articles. The algo- CONTENT rithm compares two approaches when performing cross-lingual 𝛿𝑐 CONTENT + NE clustering: 𝛿 and 𝛿 𝑐 𝑟 CONTENT + TS 𝛿 and 𝛿 • Global parameter. Using a global parameter for measuring 𝑐 𝑡 CONTENT + NE + TS 𝛿 and 𝛿 and 𝛿 distances between all language articles for cross-lingual 𝑐 𝑟 𝑡 clustering decisions. • Pivot parameter. Using a pivot parameter, where the dis- 4 EXPERIMENTS tances between every other language are only compared to English, and cross-lingual clustering decisions are made We now present the experiment setting. We introduce the data set only based on this distance. and how it is prepared for the experiment. Next, we present the evaluation metrics. Finally, the baseline algorithm is described. Since the baseline algorithm was already evaluated using the cross-lingual data set we are using the the experiment, we only 4.1 Data Set report their performances from the paper. To compare the algorithm performances we use the news article 5 RESULTS data sets acquired via Event Registry and prepared by [3] for the purposes of news stream clustering. These data sets are in three In this section we present the experiment results. For all exper- different languages (English, German, and Spanish), and consist iments we fix the values 𝛽 = 1 and 𝜏 = 3 days, and evaluate of articles containing the following attributes: the algorithms using different values of 𝛼. In addition, all experi- ments use the event’s minimum time statistic when validating • Title. The title of the article. the time condition . • Text. The body of the article. 𝛿𝑡 • Lang. The language of the article. Baseline Comparison. Table 3 shows the experiment results • Date. The datetime when the article was published. of the best performing algorithm on the evaluation data set. We • Event ID. The ID of the event the article is associated with. report the best performing CONTENT + NE + TS algorithm It is used to measure the performance of the algorithms. which uses the content similarity threshold 𝛼 = 0.3. For the experiment, we merge the three data sets together to create a single cross-lingual news article data set. We extract Table 3: The algorithm performances. The best reported their content embeddings and named entities, and sort them in algorithm uses all three asssignment conditions. chronological order, i.e. from oldest to newest. Table 2 shows the data set statistics. Algorithm 𝐹1 𝑃 𝑅 Table 2: Data set statistics. For each language data set we Baseline (global) 72.7 89.8 61.0 denote the number of documents in the data set (# docs), the Baseline (pivot) 84.0 83.0 85.0 average length of the documents (avg. length), the number CONTENT + NE + TS 72.2 79.7 66.0 of event clusters (# clusters) and the average number of documents in the clusters (avg. size). While the proposed algorithm does not perform better than any of the baselines with respect to the 𝐹1 score, our algorithm Language # docs avg. length # clusters avg. size still shows promising results. Its performance is comparable to English 8,726 537 238 37 the baseline using the global parameter and also outperforms the German 2,101 450 122 17 baseline (global) recall by 5%, showing it is better at grouping Spanish 2,177 401 149 15 articles. Together 13,004 500 427 30 Condition Analysis. We have analyzed the impact the con- ditions have on the algorithm’s performance. For each algo- rithm version we run the experiments using different values 4.2 Evaluation Metrics of 𝛼 ∈ {0.3, 0.4, 0.5, 0.6, 0.7}, and measure the balanced F-score, precision, and recall, as well as the number of clusters it gener- For the evaluation we use the same metrics as [3]. Let tp be the ated. Table 4 shows the condition analysis results. By analysing number of correctly clustered-together article pairs, let fp be the the results we come to two conclusions: number of incorrectly clustered-together article pairs, and let fn Increasing 𝛼 increases precision, decreases recall, and be the number of incorrectly not-clustered-together article pairs. tp tp generates a larger number of clusters. When 𝛼 is bigger, the Then we report precision as 𝑃 = , recall as 𝑅 = , and tp+fp tp+fn content condition 𝛿 requires the articles to be more similar to 𝑐 the balanced F-score as 𝐹1 = 2 · 𝑃 ·𝑅 . While precision describes the event. This condition is met when the article’s content em- 𝑃 +𝑅 how homogenous are clusters the, recall tells us the amount of bedding is close to the event’s centroid. Since this has to hold for articles that should be together but are actually found in different all articles in the event, then the articles that have high similar- clusters. ity are clustered together, increasing the algorithm’s precision. 15 Information Society 2021, 4 - 8 October 2021, Ljubljana, Slovenia Erik Novak Table 4: The condition analysis results. The bold values REFERENCES represent the best performances on the data set. [1] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vish- rav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Algorithm 𝛼 # clusters 𝐹1 𝑃 𝑅 Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin CONTENT 0.3 46 29.6 19.7 59.8 Stoyanov. 2020. Unsupervised cross-lingual representa- 0.4 234 51.6 46.2 58.4 tion learning at scale. In Proceedings of the 58th Annual 0.5 849 57.7 67.7 50.3 Meeting of the Association for Computational Linguistics, 0.6 1762 45.3 73.1 32.8 8440–8451. 0.7 3185 26.0 81.9 15.5 [2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional CONTENT 0.3 279 43.7 33.3 63.8 transformers for language understanding. In Proceedings + NE 0.4 648 52.9 55.8 50.3 of the 2019 Conference of the North American Chapter of 0.5 1168 56.5 67.4 48.6 the Association for Computational Linguistics: Human Lan- 0.6 1939 45.1 73.6 32.5 guage Technologies, Volume 1 (Long and Short Papers). As- 0.7 3254 25.9 82.3 15.4 sociation for Computational Linguistics, 4171–4186. CONTENT 0.3 344 58.8 63.2 55.0 [3] Sebastião Miranda, Art¯urs Znotin,š, Shay B Cohen, and + TS 0.4 806 64.1 76.5 55.2 Guntis Barzdins. 2018. Multilingual clustering of stream- 0.5 1346 58.8 83.4 45.4 ing news. In Proceedings of the 2018 Conference on Empirical 0.6 2068 47.1 81.7 33.1 Methods in Natural Language Processing. Association for 0.7 3356 25.2 84.8 14.7 Computational Linguistics, Brussels, Belgium. [4] Faik Kerem Örs, Süveyda Yeniterzi, and Reyyan Yeniterzi. CONTENT 0.3 925 72.2 79.7 66.0 2020. Event clustering within news articles. In Proceedings + NE 0.4 1221 72.2 80.5 65.5 of the Workshop on Automated Extraction of Socio-political + TS 0.5 1554 54.0 81.9 40.2 Events from News 2020, 63–68. 0.6 2174 46.7 80.7 32.9 [5] Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How 0.7 3403 25.0 84.8 14.7 multilingual is multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational However, if the Linguistics. Association for Computational Linguistics, 𝛼 is to large then the condition is too strong, thus similar articles can be split into multiple clusters, conse- 4996–5001. quently decreasing recall and increasing the number of clusters [6] Xiaoting Qu, Juan Yang, Bin Wu, and Haiming Xin. 2016. the algorithm generates. A news event detection algorithm based on key elements Algorithms with more conditions can achieve better per- recognition. In 2016 IEEE First International Conference on formance. The algorithm’s performance is increasing with added Data Science in Cyberspace (DSC). (June 2016), 394–399. conditions. While the worst performance is achieved when only [7] Nils Reimers and Iryna Gurevych. 2020. Making monolin- the content condition gual sentence embeddings multilingual using knowledge 𝛿 is used (CONTENT algorithm), the best 𝑐 is reached when all three conditions are used (CONTENT + NE + distillation. In Proceedings of the 2020 Conference on Em- TS algorithm). The most significant contribution is provided by pirical Methods in Natural Language Processing (EMNLP). the time condition Association for Computational Linguistics, 4512–4525. 𝛿 which drastically improves the 𝐹 𝑡 1 score. [8] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: 6 CONCLUSION sentence embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods We propose a news stream clustering algorithm that directly in Natural Language Processing and the 9th International generates cross-lingual event clusters. It uses multilingual lan- Joint Conference on Natural Language Processing (EMNLP- guage models to generate cross-lingual article representations IJCNLP). Association for Computational Linguistics, 3982– which are used to compare with and generate cross-lingual event 3992. clusters. The algorithm was evaluated on a news article data set [9] Jan Rupnik, Andrej Muhic, Gregor Leban, Primoz Skraba, and compared to a strong baseline. The experiment results look Blaz Fortuna, and Marko Grobelnik. 2016. News across promising, but there is still room for improvement. languages - Cross-Lingual document similarity and event In the future, we intend to modify the assignment condition tracking. en. J. Artif. Intell. Res., 55, (January 2016), 283– and learn the condition parameters instead of manually setting 316. them. Modifying the language models to accept longer inputs [10] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. In- could better capture the articles semantic meaning. In addition, troduction to the CoNLL-2003 shared task: language-in- events from different domains are reported with different rates. dependent named entity recognition. In Proceedings of Learning these rates and including them in the algorithm could the Seventh Conference on Natural Language Learning at improve its performance. HLT-NAACL 2003, 142–147. ACKNOWLEDGMENTS [11] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia This work was supported by the Slovenian Research Agency and Polosukhin. 2017. Attention is all you need. In Proceedings the Humane AI Net European Unions Horizon 2020 project under of the 31st International Conference on Neural Information grant agreement No 952026. Processing Systems. Curran Associates Inc., Red Hook, NY, USA, 6000–6010. 16 SloBERTa: Slovene monolingual large pretrained masked language model Matej Ulčar and Marko Robnik-Šikonja University of Ljubljana, Faculty of Computer and Information Science Ljubljana, Slovenia {matej.ulcar, marko.robnik}@fri.uni- lj.si ABSTRACT Successful transformer models typically contain more than 100 million parameters. To train, they require considerable com- Large pretrained language models, based on the transformer putational resources and large training corpora. Luckily, many of architecture, show excellent results in solving many natural lan- these models are publicly released. Their fine-tuning is much less guage processing tasks. The research is mostly focused on Eng- computationally demanding and is accessible to users with mod- lish language; however, many monolingual models for other lan- est computational resources. In this work, we present the training guages have recently been trained. We trained first such mono- of a Slovene transformer-based masked language model, named lingual model for Slovene, based on the RoBERTa model. We SloBERTa, based on a variant of BERT architecture. SloBERTa is evaluated the newly trained SloBERTa model on several classi- the first such publicly released model, trained exclusively on the fication tasks. The results show an improvement over existing Slovene language corpora. multilingual and monolingual models and present current state- of-the-art for Slovene. 2 RELATED WORK KEYWORDS Following the success of the BERT model [5], many transformer-natural language processing, BERT, RoBERTa, transformers, lan- based language models have been released, e.g., RoBERTa [14], guage model GPT-3 [3], and T5 [28]. The complexity of these models has been constantly increasing. The size of newer generations of the models has made training computationally prohibitive for all 1 INTRODUCTION research organizations and is only available to large corporations. Solving natural language processing (NLP) tasks with neural Training also requires huge amounts of training data, which do networks requires presentation of text in a numerical vector not exist for most languages. Thus, most of these large models format, called word embeddings. Embeddings assign each word have been trained only for a few very well-resourced languages, its own vector in a vector space so that similar words have similar chiefly English, or in a massively multilingual fashion. vectors, and certain relationships between word meanings are The BERT model was pre-trained on two tasks simultaneously, expressed in the vector space as distances and directions. Typical a masked token prediction and next sentence prediction. For the static word embedding models are word2vec [19], GloVe [24], and masked token prediction, 15% of tokens in the training corpus fastText [1]. ELMo [25] embeddings are an example of dynamic, were randomly masked before training. The training dataset was contextual word embeddings. Unlike static word embeddings, augmented by duplicating the training corpus a few times, with where a word gets a fixed vector, contextual embeddings ascribe each copy having different randomly selected tokens masked. The a different word vector for each occurrence of a word, based on next sentence prediction task attempts to predict if two given its context. sentences appear in a natural order. State of-the-art text representations are currently based on the The RoBERTa [14] model uses the same architecture as BERT, transformer architecture [35]. GPT-2 [27] and BERT [5] models but drops the next sentence prediction task, as it was shown that it are among the first and most influential transformer models. Due does not contribute to the model performance. The masked token to their ability to be successfully adapted to a wide range of prediction task was changed so that the tokens are randomly tasks, such models are, somewhat impetuously, called foundation masked on the fly, i.e. a different subset of tokens is masked in models [2, 17]. While GPT-2 uses the transformer’s decoder stack each training epoch. to model the next word based on previous words, BERT uses Both BERT and RoBERTa were released in different sizes. Base the encoder stack to encode word representations of a masked models use 12 hidden transformer layers of size 768. Large models word, based on the surrounding context before and after the use 24 hidden transformer layers of size 1024. Smaller-sized BERT word. Previous embedding models (e.g., ELMo and fastText) were models exist using knowledge distillation from pre-trained larger used to extract word representations which were then used to models [11]. train a model on a specific task. In contrast to that, transformer A few massively multilingual models were trained on 100 models are typically fine-tuned for each individual downstream or more languages simultaneously. Notable released variants task, without extracting word vectors. are multilingual BERT (mBERT) [5] and XLM-RoBERTa (XLM-R) [4]. While multilingual BERT models perform well for the trained languages, they lag behind the monolingual models [36, Permission to make digital or hard copies of part or all of this work for personal 33]. Examples of recently released monolingual BERT models for or classroom use is granted without fee provided that copies are not made or various languages are Finnish [36], Swedish [16], Estonian [30], distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this Latvian [37], etc. work must be honored. For all other uses, contact the owner /author(s). The Slovene language is supported by the aforementioned Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia massively multilingual models and by the trilingual CroSloEngual © 2021 Copyright held by the owner/author(s). BERT model [33], which has been trained on three languages, 17 Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia Ulčar and Robnik-Šikonja Croatian, Slovene, and English. No monolingual transformer 98 epochs) on the Slovene corpora, described in Section 3.1. The model for Slovene has been previously released. model supports the maximum input sequence length of 512 sub- word tokens. 3 SLOBERTA SloBERTa was trained as a masked language model, using The presented SloBERTa model is closely related to the French fairseq toolkit [22]. 15% of the input tokens were randomly Camembert model [18], which uses the same architecture and masked, and the task was to predict the masked tokens. We training approach as the RoBERTa base model [14], but uses a used the whole-word masking, meaning that if a word was split different tokenization model. In this section, we describe the into more subtokens and one of them was masked, all the other training datasets, the architecture, and the training procedure of subtokens pertaining to that word were masked as well. Tokens SloBERTa. were masked dynamically, i.e. in each epoch, a different subset of tokens were randomly selected to be masked. 3.1 Datasets Training a successful transformer language model requires a large 4 EVALUATION dataset. We combined five large Slovene corpora in our training We evaluated SloBERTa on five tasks: named-entity recognition dataset. Gigafida 2.0 [13] is a general language corpus, composed (NER), part-of-speech tagging (POS), dependency parsing (DP), of fiction and non-fiction books, newspapers, school textbooks, sentiment analysis (SA), and word analogy (WA). We used the texts from the internet, etc. The Janes corpus [9] is composed of labeled ssj500k corpus [12, 6] for fine-tuning SloBERTa on each several subcorpora. Each subcorpus contains texts from a certain of the NER, POS and DP tasks. For NER, we limited the scope to social medium or a group of similar media, including Twitter, three types of named entities (person, location, and organization). blog posts, forum conversations, comments under articles on We report the results as a macro-average 𝐹 score of these three 1 news sites, etc. We used all Janes subcorpora, except Janes-tweet, classes. For POS-tagging, we used UPOS tags, the results are since the contents of that subcorpus are encoded and need to be reported as a micro-average 𝐹 score. For DP, we report the 1 individually downloaded from Twitter, which is a lengthy process, results as a labeled attachement score (LAS). The SA classifier as Twitter limits the access speed. KAS (Corpus of Academic was fine-tuned on a dataset composed of Slovenian tweets [20, Slovene) [8] consists of PhD, MSc, MA, Bsc, and BA theses written 21], labeled as either "positive", "negative", or "neutral". We report in Slovene between 2000 and 2018. SiParl [23] contains minutes the results as a macro-average 𝐹 score. 1 of Slovene national assembly between 1990 and 2018. SlWaC [15] Traditional WA task measures the distance between word vec- is a web corpus collected from the .si top-level web domain. All tors in a given analogy (e.g., man : king ≈ woman : queen). For corpora used are listed in Table 1 along with their sizes. contextual embeddings such as BERT, the task has to be modified Table 1: Corpora used in training of SloBERTa with their to make sense. First, word embeddings from transformers are sizes in billion of tokens and words. Janes* corpus does generally not used on their own, rather the model is fine-tuned. not include Janes-tweet subcorpus. Four words from an analogy also do not provide enough con- text for use with transformers. In our modification, we input the four words of an analogy in a boilerplate sentence "If the word Corpus Genre Tokens Words [word1] corresponds to the word [word2], then the word [word3] Gigafida 2.0 general language 1.33 1.11 corresponds to the word [word4]." We then masked [word2] and Janes* social media 0.10 0.08 attempted to predict it using masked token prediction. We used KAS academic 1.70 1.33 Slovene part of the multilingual culture-independent word anal- siParl 2.0 parliamentary 0.24 0.20 ogy dataset [32]. We report the results as an average precision@5 slWaC 2.1 web crawl 0.90 0.75 (the proportion of the correct [word2] analogy words among the Total 4.27 3.47 5 most probable predictions). Total after deduplication 4.20 3.41 We compared the performance of SloBERTa with three other transformer models supporting Slovene, CroSloEngual BERT (CSE-BERT) [33], multilingual BERT (mBERT) [5], and XLM-3.2 Data preprocessing RoBERTa (XLM-R) [4]. Where sensible, we also included the results achieved with training a classifier model using Slovene We deduplicated the corpora, using the Onion tool [26]. We split ELMo [31] and fastText embeddings. the deduplicated corpora into three sets, training (99%), validation We fine-tuned the transformer models on each task by adding (0.5%), and test (0.5%). Independently of the three splits, we pre- a classification head on top of the model. The exception is the DP pared a smaller dataset, one 15th of the size of the whole dataset, task, where we used the modified dep2label-bert tool [29, 10]. For by randomly sampling the sentences. We used this smaller dataset ELMo and fastText, we extracted embeddings from the training 1 to train a sentencepiece model , which is used to tokenize and datasets and used them to train token-level and sentence-level encode the text into subword byte-pair-encodings (BPE). The classifiers for each task, except for the DP. The classifiers are sentencepiece model trained for SloBERTa has a vocabulary con- composed of a few LSTM layer neural networks. For the DP taining 32,000 subword tokens. task, we used the modified SuPar tool, based on the deep biaffine 3.3 Architecture and training attention [7]. The details of the evaluation process are presented in [34]. SloBERTa has 12 transformer layers, which is equivalent in size The results are shown in Table 2. The results of ELMo and to BERT-base and RoBERTa-base models. The size of each trans-fastText, while comparable between each other, are not fully com- former layer is 768. We trained the model for 200,000 steps (about parable with the results of transformer models as the classifier 1 https://github.com/google/sentencepiece training approach is different. 18 SloBERTa Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia Table 2: Results of Slovene transformer models. 2020 research and innovation programme under grant agreement No 825153, project EMBEDDIA (Cross-Lingual Embeddings for Model NER POS DP SA WA Less-Represented Languages in European News Media). fastText 0.478 0.527 / 0.435 / REFERENCES ELMo 0.849 0.966 0.914 0.510 / [1] Piotr Bojanowski, Edouard Grave, Armand Joulin, and mBERT 0.885 0.984 0.681 0.576 0.061 Tomas Mikolov. 2017. Enriching word vectors with sub- XLM-R 0.912 0.988 0.793 0.604 0.146 word information. Transactions of the Association for Com- CSE-BERT 0.928 0.990 0.854 0.610 0.195 putational Linguistics, 5, 135–146. SloBERTa 0.933 0.991 0.844 0.623 0.405 [2] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Alt- man, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the NER, POS, SA, and WA tasks, SloBERTa outperforms all 2021. On the opportunities and risks of foundation models. other models/embeddings. For the POS-tagging, the differences ArXiv preprint 2108.07258. (2021). between the models are small, except for fastText, which performs [3] Tom Brown et al. 2020. Language models are few-shot much worse. ELMo, surprisingly, outperforms all transformer learners. In Advances in Neural Information Processing Sys- models on the DP task. However, it performs worse on the other tems. Volume 33, 1877–1901. tasks. SloBERTa performs worse than CSE-BERT on the DP task, [4] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav but beats other multilingual models. Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard The success of ELMo on the DP task can be partially explained Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. by the different tools used for training the classifiers. Further 2019. Unsupervised cross-lingual representation learning work needs to be done to fully evaluate the difference and success at scale. arXiv preprint arXiv:1911.02116. of ELMo embeddings on this task. [5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina The performance on the SA task is limited by the low inter- Toutanova. 2019. BERT: pre-training of deep bidirectional annotator agreement [20]. The reported average of 𝐹 scores for 1 transformers for language understanding. In Proceedings positive and negative class is 0.542 for inter-annotator agreement of the 2019 Conference of the North American Chapter of and 0.726 for self-agreement. Using the same measure (average of the Association for Computational Linguistics: Human Lan- 𝐹 for positive and for negative class), SloBERTa scores 0 1 𝐹1 .667, guage Technologies, Volume 1 (Long and Short Papers), 4171– and mBERT scores 0.593. 4186. doi: 10.18653/v1/N19- 1423. On the WA task, most models perform poorly. This is expected [6] Kaja Dobrovoljc, Tomaž Erjavec, and Simon Krek. 2017. because very little context was provided on the input, and the The universal dependencies treebank for Slovenian. In transformer models need a context to perform well. SloBERTa Proceeding of the 6th Workshop on Balto-Slavic Natural significantly outperforms other models, not only because it was Language Processing (BSNLP 2017). trained only on Slovene data, but largely because its tokenizer [7] Timothy Dozat and Christopher D. Manning. 2017. Deep is adapted to only Slovene language and does not need to cover biaffine attention for neural dependency parsing. In Pro- other languages. ceedings of 5th International Conference on Learning Repre- 5 CONCLUSIONS sentations, ICLR. [8] Tomaž Erjavec, Darja Fišer, and Nikola Ljubešić. 2021. We present SloBERTa, the first monolingual transformer-based The KAS corpus of Slovenian academic writing. Language masked language model trained on Slovene texts. We show that Resources and Evaluation, 55, 2, 551–583. SloBERTa large pretrained masked language model outperforms [9] Darja Fišer, Tomaž Erjavec, and Nikola Ljubešić. 2016. existing comparable multilingual models supporting Slovene on Janes v0. 4: korpus slovenskih spletnih uporabniških vse- four tasks, NER, POS-tagging, sentiment analysis, and word anal- bin. Slovenščina 2.0: empirical, applied and interdisciplinary ogy. The performance on the DP task is competitive, but lags research, 4, 2, 67–99. behind some of the existing models. [10] Carlos Gómez-Rodríguez, Michalina Strzyz, and David In further work we intend to compare improvement of BERT- Vilares. 2020. A unifying theory of transition-based and like monolingual models over multilingual models for other lan- sequence labeling parsing. In Proceedings of the 28th Inter- guages. national Conference on Computational Linguistics, 3776– The pre-trained SloBERTa model is publicly available via CLA- 3793. doi: 10.18653/v1/2020.coling- main.336. 2 3 RIN.SI and Huggingface repositories. We make the code, used [11] Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao for preprocessing the corpora and training the SloBERTa, publicly Chen, Linlin Li, Fang Wang, and Qun Liu. 2020. Tiny- 4 available . BERT: Distilling BERT for natural language understanding. ACKNOWLEDGMENTS (2020). arXiv: 1909.10351 [cs.CL]. [12] Simon Krek, Kaja Dobrovoljc, Tomaž Erjavec, Sara Može, The work was partially supported by the Slovenian Research Nina Ledinek, Nanika Holz, Katja Zupan, Polona Gantar, Agency (ARRS) core research programmes P6-0411 and project Taja Kuzman, Jaka Čibej, Špela Arhar Holdt, Teja Kavčič, J6-2581, as well as the Ministry of Culture of Republic of Slovenia Iza Škrjanec, Dafne Marko, Lucija Jezeršek, and Anja Zajc. through project Development of Slovene in Digital Environment 2019. Training corpus ssj500k 2.2. Slovenian language re- (RSDO). This paper is supported by European Union’s Horizon source repository CLARIN.SI. (2019). 2 http://hdl.handle.net/11356/1397 3 https://huggingface.co/EMBEDDIA/sloberta 4 https://github.com/clarinsi/Slovene-BERT-Tool 19 Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia Ulčar and Robnik-Šikonja [13] Simon Krek, Tomaž Erjavec, Andraž Repar, Jaka Čibej, [28] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Spela Arhar, Polona Gantar, Iztok Kosem, Marko Rob- Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and nik, Nikola Ljubešić, Kaja Dobrovoljc, Cyprian Laskowski, Peter J Liu. 2020. Exploring the limits of transfer learning Miha Grčar, Peter Holozan, Simon Šuster, Vojko Gorjanc, with a unified text-to-text transformer. Journal of Machine Marko Stabej, and Nataša Logar. 2019. Gigafida 2.0: Korpus Learning Research, 21, 1–67. pisne standardne slovenščine. viri.cjvt.si/gigafida. (2019). [29] Michalina Strzyz, David Vilares, and Carlos Gómez-Rodrí- [14] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar guez. 2019. Viable dependency parsing as sequence la- Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettle- beling. In Proceedings of the 2019 Conference of the North moyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly American Chapter of the Association for Computational Lin- optimized BERT pretraining approach. ArXiv preprint guistics: Human Language Technologies, Volume 1 (Long 1907.11692. (2019). and Short Papers), 717–723. doi: 10.18653/v1/N19-1077. [15] Nikola Ljubešić and Tomaž Erjavec. 2011. hrWaC and [30] Hasan Tanvir, Claudia Kittask, and Kairit Sirts. 2020. Est- slWaC: Compiling web corpora for Croatian and Slovene. BERT: A pretrained language-specific BERT for Estonian. In International Conference on Text, Speech and Dialogue. arXiv preprint 2011.04784. (2020). Springer, 395–402. [31] Matej Ulčar and Marko Robnik-Šikonja. 2020. High quality [16] Martin Malmsten, Love Börjeson, and Chris Haffenden. ELMo embeddings for seven less-resourced languages. In 2020. Playing with Words at the National Library of Swe- Proceedings of the 12th Language Resources and Evaluation den – Making a Swedish BERT. ArXiv preprint 2007.01658. Conference, LREC 2020, 4733–4740. (2020). [32] Matej Ulčar, Kristiina Vaik, Jessica Lindström, Milda Daili- [17] Gary Marcus and Ernest Davis. 2021. Has AI found a new d ˙ enait ˙ e, and Marko Robnik-Šikonja. 2020. Multilingual foundation? The Gradient. 11 September 2021. culture-independent word analogy datasets. In Proceedings [18] Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, of the 12th Language Resources and Evaluation Conference, Yoann Dupont, Laurent Romary, Éric de la Clergerie, Djamé 4067–4073. Seddah, and Benoît Sagot. 2020. CamemBERT: A tasty [33] Matej Ulčar and Marko Robnik-Šikonja. 2020. FinEst BERT French language model. In Proceedings of the 58th Annual and CroSloEngual BERT: less is more in multilingual mod- Meeting of the Association for Computational Linguistics, els. In Proceedings of Text, Speech, and Dialogue, TSD 2020, 7203–7219. 104–111. [19] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. [34] Matej Ulčar, Aleš Žagar, Carlos S. Armendariz, Andraž 2013. Efficient estimation of word representations in vector Repar, Senja Pollak, Matthew Purver, and Marko Robnik- space. arXiv preprint 1301.3781. Šikonja. 2021. Evaluation of contextual embeddings on [20] Igor Mozetič, Miha Grčar, and Jasmina Smailović. 2016. less-resourced languages. ArXiv preprint 2107.10614. (2021). Multilingual Twitter sentiment classification: the role of [35] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- human annotators. PLOS ONE, 11, 5. reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia [21] Igor Mozetič, Miha Grčar, and Jasmina Smailović. 2016. Polosukhin. 2017. Attention is all you need. In Advances Twitter sentiment for 15 european languages. Slovenian in neural information processing systems, 5998–6008. language resource repository CLARIN.SI. (2016). http:// [36] Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma, hdl.handle.net/11356/1054. Juhani Luotolahti, Tapio Salakoski, Filip Ginter, and Sampo [22] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Pyysalo. 2019. Multilingual is not enough: BERT for Finnish. Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. arXiv preprint arXiv:1912.07076. (2019). Fairseq: a fast, extensible toolkit for sequence modeling. [37] Art ¯ urs Znotinš and Guntis Barzdinš. 2020. LVBERT: Trans- , , In Proceedings of NAACL-HLT 2019: Demonstrations. former-based model for Latvian language understanding. [23] Andrej Pančur and Tomaž Erjavec. 2020. The siParl corpus In Human Language Technologies–The Baltic Perspective: of Slovene parliamentary proceedings. In Proceedings of Proceedings of the Ninth International Conference Baltic the Second ParlaCLARIN Workshop, 28–34. HLT 2020. Volume 328, 111. [24] Jeffrey Pennington, Richard Socher, and Christopher Man- ning. 2014. GloVe: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing EMNLP, 1532–1543. [25] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gard- ner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Pro- ceedings of the 2018 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2227–2237. doi: 10.18653/v1/N18- 1202. [26] Jan Pomikálek. 2011. Removing boilerplate and duplicate content from web corpora. PhD thesis. Masaryk university, Brno, Czech Republic. [27] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog. 2019 Feb 24. (2019). 20 Understanding the Impact of Geographical Bias on News Sentiment: A Case Study on London and Rio Olympics Swati Dunja Mladenić swati@ijs.si dunja.mladenic@ijs.si Jožef Stefan Institute, Jožef Stefan Institute, Jožef Stefan International Postgraduate School Jožef Stefan International Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia News article on London Olympic Legacy News articles on Rio Olympic Legacy Headline: Five Years On, Headline: Rio 2016's London's Olympic Real Estate venues abandoned and Headline: Rio 2016: An Legacy Is A Clear Winner vandalised in worrying legacy Olympics legacy of despair Category: Business Category: Business Headline: Rio de Janeiro and wreckage suffering painful post-Olympic Category: Business Summary: The Olympics have Summary: Rio de Janeiro hangover become famously bad for real pulled off last year's Olympics, Category: Sports Summary: When any country estate… keeping crime… hosts a major sporting event, Summary: The Brazilian the organisers and those... Sentiment: 0.067 Sentiment: -0.372 metropolis of Rio de Janeiro was the epicenter of the… Sentiment: -0.435 Article Similarity: 0.509, Sentiment Dissimilarity: 0.439 Sentiment: -0.388 Article Similarity: 0.632, Sentiment Dissimilarity: 0.502 Article Similarity: 0.503, Sentiment Dissimilarity: 0.455 Figure 1: An example to illustrate the impact of geographical location on the sentiment of similar news articles. ABSTRACT to induce a variety of political and social implications, both direct There are various types of news bias, most of which play an and indirect. For instance, any political controversy presented important role in manipulating public perceptions of any event. from a specific perspective may alter the voting pattern [4, 1, 6]. Researchers frequently question the role of geographical location There are different forms of news bias, and geographical bias in attributing such biases. To that end, we intend to investigate the is one of them. It exists if the sentiment polarity of similar arti- impact of geographical bias on news sentiments in related articles. cles published in different geographical location is contradictory As our case study, we use news articles collected from the Event or varies significantly. Sentiment analysis methods, which are Registry over two years about the Olympic legacy in London commonly used to determine news bias [3, 14], can be used to and Rio. Our experimental analysis reveals that geographical examine the shift in sentiment polarity in similar news articles. boundaries do have an impact on news sentiment. Now, an intriguing question arises: Is geographical bias a factor affecting news sentiment? This study seeks to answer the above KEYWORDS question by identifying and comparing sentiments of similar news articles. In doing so, we demonstrate how geographical Bias, News Bias, Geographical Bias, Olympics, Semantic Similar- location impacts the sentiments of similar articles. We also inves- ity, Sentiment Analysis, Dataset tigate this impact in relation to several news categories such as politics, business, sports, and so on. 1 INTRODUCTION The Olympic Games are a symbol of the greatest sports events Claims of bias in news coverage raise questions about the role of in the world. Every edition leaves a number of legacies for the geography in shaping public perceptions of similar events. Based Olympic Movement, as well as unforgettable memories for each on the geographical location, multiple factors, such as political host city, whether positive or negative. In this regard, we select affiliation, editorial independence, etc., can influence the way news articles about the Olympic legacy in London and Rio as a news articles are generated. Although it is well known that biased case study for our analysis. news can have more influence on people’s thinking and decision- We use Event Registry1 [10] to collect English news articles, making processes [7, 9], it is nearly impossible to produce an along with their sentiment and categories, published between article without any bias. Biased news articles have the potential January 2017 and December 2020. We use the popular Sentence- BERT (SBERT) [12] embedding to represent the articles and then Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or compute the cosine similarity between them to identify similar distributed for profit or commercial advantage and that copies bear this notice and article pairs. the full citation on the first page. Copyrights for third-party components of this Our data and code can be found in the GitHub repository at work must be honored. For all other uses, contact the owner/author(s). Information Society 2021, 4 October 2021, Ljubljana, Slovenia https:// github.com/ Swati17293/ geographical-bias. © 2021 Copyright held by the owner/author(s). 1https://eventregistry.org 21 Information Society 2021, 4 October 2021, Ljubljana, Slovenia Swati and Dunja Mladenić 1.1 Contributions in London/Rio if the headline and/or summary of the article The paper’s contributions are as follows: contains the keywords ‘London’/‘Rio’, ‘Olympic’, and ‘Legacy’. For each article, we then extract the summary, category, and • We propose a task of analyzing the impact of geographical sentiment. The article summaries vary in length from 290 to 6,553 bias on the sentiment of news articles with data on the words. Sentiment scores ranges from −1 to 1. We select seven Olympic legacies of Rio and London as a case study. major news categories, namely business, politics, technology, • We present a dataset of English news articles customized environment, health, sports, and arts-and-entertainment, and to the above-mentioned task. remove the rest of the categories. After excluding the duplicate • We present experimental results to demonstrate the afore- articles we end up with 8,690 and 5,120 articles about the Olympic mentioned impact of geographical bias. legacy in London and Rio respectively. 2 RELATED WORK 4 MATERIALS AND METHODS The Majority of the sentiment analysis methods for news bias analysis depend on the sentiment words that are explicitly stated. 4.1 Methodology SentiWordNet2, which is a publicly available lexical resource used The primary task is to compute the average difference in senti-by the researchers for opinion mining to identify the sentiment ment scores between similar news articles about the Olympic inducing words that classify them as positive, negative, or neutral. legacies in Rio and London. The stated task can be subdivided Melo et al. [5] collected and analyzed articles from Brazil’s and mathematically formulated as follows: news media and social media to understand the country’s re- (1) Generate two distinct sets of news articles 𝐴1 and 𝐴2, one sponse to the COVID-19 pandemic. They proposed using an about the London Olympic legacy and the other about the enhanced topic model and sentiment analysis method to tackle Rio Olympic legacy. For each ′ 𝑎 ∈ 𝐴 ∈ 𝐴 𝑖 1 find a list of 𝑎 2, 𝑗 this task. They identified and applied the main themes under con- where 𝑡 ℎ 𝑎 is the 𝑖 article in set 𝐴 , 𝑠 )} 𝑖 1 = {(𝑎1, 𝑠1), (𝑎2, 𝑠2)...(𝑎𝑛 𝑛 sideration in order to comprehend how their sentiments changed and ′ 𝑡 ℎ ′ ′ ′ ′ ′ ′ 𝑎 is the 𝑗 article in set 𝐴2 = {(𝑎 , 𝑠 ), (𝑎 , 𝑠 )...(𝑎 , 𝑠 )} over time. They discovered that certain elements in both media 𝑗 1 1 2 2 𝑚 𝑚 which is the closest match (c.f. Section 4.1.1) to 𝑎 . Here, reflected negative attitudes toward political issues. 𝑖 𝑛 = |𝐴 Quijote et al. [11] used SentiWordNet along with the Inverse 1 | and 𝑚 = |𝐴2 |. (2) For each list, calculate 𝐷 to represents the difference Reinforcement Model to analyze the bias present in the news ar- 𝑖 𝑗 between the sentiment scores ′ 𝑠 and 𝑠 of the articles 𝑎 ticle and to determine whether the outlets are biased or not. The 𝑖 𝑖 𝑗 ′ lexicons were first scored for the experiments using SentiWord- and 𝑎 . 𝑗 Net and then fed to the Inverse Reinforcement model as input. (3) Calculate the average difference 𝐷 of sentiment scores. To determine the news bias, the model measured the deviation (4) Calculate the percentage of similar article pairs with re- and controversy scores of the articles. The findings lead to the versed polarity and those with unchanged polarity. inference that articles from major news outlets in the Philippines The secondary task is to assess the primary task with respect are not biased, excluding those from the Manila Times. to news categories, i.e. to calculate the average difference 𝐷 of Bharathi and Geetha [3] classified the articles published by sentiment scores for similar articles in each category. the UK, US, and India median as positive, negative, or neutral In the following subsections, we discuss the tasks mentioned using the content sentiment algorithm [2]. The sentiment scores above in greater detail. of the opinion words and their polarities were used as input to 4.1.1 Article Similarity. We embed the articles in sets 𝐴1 and 𝐴2 the algorithm. to construct sets ′ ′ ′ 𝐹1 = {𝑓1, 𝑓2...𝑓 } and 𝐹 , 𝑓 ... 𝑓 }. While 𝑚 2 = {𝑓 Existing research investigates news bias using sentiment anal- 1 2 𝑛 alternative embedding approaches can be utilized, in this study ysis methods, but, unlike our work, it does not provide a suitable we select the popular Sentence-BERT (SBERT) [12] embedding automated method for analyzing the impact of geographical bias to extract 768-dimensional feature vectors to represent the indi- on news sentiment. vidual articles in 𝐹1 and 𝐹2. For each article in 3 DATA DESCRIPTION 𝑎 𝐴 𝑖 1, we compute the similarity score3 between 𝑎 and every article 𝑎 in 𝐴 𝑖 𝑗 2 using the cosine similarity 3.1 Raw Data Source metric 𝑐𝑜𝑠 ′ ′ 𝑆𝑖𝑚 (𝑎 , 𝑎 ) (Eq 1). We consider articles 𝑎 and 𝑎 to be 𝑖 𝑖 𝑗 𝑗 We use Event Registry [10] as our raw data source which mon-similar only if their similarity score is greater than 0.5. itors, gathers, and delivers news articles from all around the ′ 𝑓 · 𝑓 world. It also annotates articles with numerous metadata such as 𝑖 𝑗 𝑐𝑜𝑠 ′ 𝑆𝑖𝑚 (𝑎 , 𝑎 ) = (1) 𝑖 a unique identifier for article identification, categories to which 𝑗 || ′ 𝑓 | | | | 𝑓 || 𝑖 𝑗 it may belong, geographical location, sentiment, and so on. Its where ′ 𝑓 and 𝑓 represents the embedded feature vectors of article 𝑖 𝑗 large-scale coverage can therefore be used effectively to assess ′ 𝑎 and 𝑎 . 𝑖 the impact of geographical bias on news sentiment. 𝑗 The similarity score ranges from −1 to 1, where −1 indicates 3.2 Dataset that the articles are completely unrelated and 1 indicates that they are identical, and in-between scores indicate partial similarity or To generate our dataset, we use a similar data collection process dissimilarity. as described in [13]. Using the Event Registry API, we collect all English-language news articles about the Olympic legacy in 4.1.2 Average Sentiment Dissimilarity. For every pair of similar ′ London and Rio published between January 2017 and December articles 𝑎 and 𝑎 , we calculate the difference 𝐷 between their 𝑖 𝑖 𝑗 𝑗 ′ 2020. We consider an article to be about the Olympic Legacy sentiment scores 𝑠 and 𝑠 . To calculate the average sentiment 𝑖 𝑗 2http://sentiwordnet.isti.cnr.it/ 3https://en.wikipedia.org/wiki/Cosine_similarity 22 Understanding the Impact of Geographical Bias on News Sentiment Information Society 2021, 4 October 2021, Ljubljana, Slovenia Table 1: Category-wise confusion matrix to show the percentage of similar article pairs with respect to their sentiment polarity. Sports Business Politics Environment Health Technology Arts & Entertainment Pos Neg Pos Neg Pos Neg Pos Neg Pos Neg Pos Neg Pos Neg Pos 77 10 62 28 42 18 55 18 29 12 87 4 59 16 Neg 11 2 7 4 23 16 14 12 12 46 1 0 7 18 Table 2: Confusion matrix to show the percentage of sim- ilar article pairs with respect to their sentiment polarity. Positive Negative Positive 69 15 Negative 11 4 Table 3: Distribution of average sentiment difference across news categories for similar article pairs with iden- tical category. Figure 2: Distribution of average sentiment differences News category Average Sentiment Difference across categories for similar articles in the same category. Sports 0.19 Business 0.20 Politics 0.18 Health 0.16 Environment 0.22 Technology 0.14 Arts and Entertainment 0.19 dissimilarity score 𝐷, we add all 𝐷 and divide it by the total 𝑖 𝑗 number of similar article pairs. 5 RESULTS AND ANALYSIS Figure 3: An illustration of the effect of category on senti- In our experiments, we compare 44,492,800 possible article pairs ment polarity. for similarity and discover 375,008 similar pairs. The comparison in terms of sentiment similarity reveals that if two articles from different geographical regions are similar, in our case Rio and The categorical distribution of the percentage of similar article London, the average difference in their sentiment scores is 0.171. pairs in terms of sentiment polarity is shown in Table 1. ‘Politics’ In addition, as defined in Table 2, we calculate the percentage of has the highest percentage of articles with reversed polarity, similar article pairs based on their sentiment polarity. It’s worth while ‘technology’ has the lowest. Categories such as ‘business’ noting that the polarity of the article is completely reversed and ‘entertainment’, though not as clearly as ‘politics’, exhibit the 27% of the time, indicating the impact of geographic region on same bias. sentiments. This disparity arises from the fact that, in contrast to other cat- It is because the success of mega-events such as the Olympics egories, politics is most influenced by geographical boundaries, in a particular host city is heavily influenced by its residents’ trust whereas science and technology are typically location indepen- and support for the government [8]. It can be viewed positively as dent. Since politics has such a large influence on shaping beliefs a national event with social and economic benefits, or negatively and public perceptions, it is frequently twisted to fit a particu- as a source of money waste. While the Olympics have left an lar narrative of a story. It is inherently linked to geographical economic and social legacy in London, a series of structural borders, and it can be extremely polarizing depending on the investment demands in Rio raise the question of whether or not geographical region. the Olympics was worthwhile for the entire country. 6 CONCLUSIONS AND FUTURE WORK 5.1 Impact of news categories In this work, we use news articles about the Olympic Legacy in The impact of news categories on the sentiments of similar arti- London and Rio as a case study to understand how geographical cles with identical categories from different geographical regions boundaries interplay with news sentiments. is shown in Table 3. It demonstrates that certain news categories We begin by presenting a dataset of news articles collected have a greater impact than others. Figure 2 depicts this distinction over two years using the Event Registry API. We compute the more clearly. cosine similarity scores of all possible embedded article pairs, one 23 Information Society 2021, 4 October 2021, Ljubljana, Slovenia Swati and Dunja Mladenić from each set of Olympic legacy articles (London and Rio). We algorithm. Indonesian Journal of Electrical Engineering and use the popular Sentence-BERT for article embedding and then Computer Science, 16, 2, 882–889. compute the sentiment difference between similar article pairs. [4] Chun-Fang Chiang and Brian Knight. 2011. Media bias From 44,492,800 possible article pairs we end up with 375,008 and influence: evidence from newspaper endorsements. similar pairs. The Review of economic studies, 78, 3, 795–820. In our analysis, we discovered that the sentiment reflected [5] Tiago de Melo and Carlos MS Figueiredo. 2021. Comparing in similar articles from different geographical regions differed news articles and tweets about covid-19 in brazil: senti- significantly. We also investigate this difference in relation to ment analysis and topic modeling approach. JMIR Public different news categories such as politics, business, sports, and Health and Surveillance, 7, 2, e24585. so on. We find a significant difference in news sentiment across [6] Claes H De Vreese. 2005. News framing: theory and typol- geographical boundaries when it comes to political news, while ogy. Information Design Journal & Document Design, 13, in the case of news in technology, the difference is much smaller. 1. We find that articles in categories such as politics and business [7] John Duggan and Cesar Martinelli. 2011. A spatial theory can be heavily influenced by geographical location, articles in of media slant and voter choice. The Review of Economic categories such as science and technology are typically location Studies, 78, 2, 640–666. independent. [8] Dogan Gursoy and KW Kendall. 2006. Hosting mega events: In the future, we plan to identify the most frequently men- modeling locals’ support. Annals of tourism research, 33, 3, tioned topics in the Olympic legacy corpus to see how they affect 603–623. the news sentiment of articles about different geographical lo- [9] Daniel Kahneman and Amos Tversky. 2013. Choices, val- cations. Since our study is limited to English news articles, we ues, and frames. In Handbook of the fundamentals of finan- intend to learn more about the role of cultures and languages in cial decision making: Part I. World Scientific, 269–278. this bias analysis. We also intend to broaden our investigation to [10] Gregor Leban, Blaz Fortuna, Janez Brank, and Marko Gro- discover the adjectives used to describe the negative and positive belnik. 2014. Event registry: learning about world events legacies of Rio and London. Such an analysis would aid in un- from news. In Proceedings of the 23rd International Confer- derstanding the expectations from cities such as Rio (the first in ence on World Wide Web, 107–110. South America to host the Olympics) in comparison to London. [11] TA Quijote, AD Zamoras, and A Ceniza. 2019. Bias detec- tion in philippine political news articles using sentiword- 7 ACKNOWLEDGMENTS net and inverse reinforcement model. In IOP Conference This work was supported by the Slovenian Research Agency and Series: Materials Science and Engineering number 1. Vol- the European Union’s Horizon 2020 research and innovation ume 482. IOP Publishing, 012036. program under the Marie Skłodowska-Curie grant agreement No [12] Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: 812997. sentence embeddings using siamese bert-networks. In Pro- ceedings of the 2019 Conference on Empirical Methods in REFERENCES Natural Language Processing. Association for Computa- [1] Dan Bernhardt, Stefan Krasa, and Mattias Polborn. 2008. tional Linguistics, (November 2019). http://arxiv.org/abs/ Political polarization and the electoral effects of media 1908.10084. bias. Journal of Public Economics, 92, 5-6, 1092–1104. [13] Swati, Tomaž Erjavec, and Dunja Mladenić. 2020. Eveout: [2] Shri Bharathi and Angelina Geetha. 2017. Sentiment anal- reproducible event dataset for studying and analyzing the ysis for effective stock market prediction. International complex event-outlet relationship. Journal of Intelligent Engineering and Systems, 10, 3, 146– [14] Taylor Thomsen. 2018. Do media companies drive bias? 153. using sentiment analysis to measure media bias in news- [3] SV Shri Bharathi and Angelina Geetha. 2019. Determina- paper tweets. tion of news biasedness using content sentiment analysis 24 An evaluation of BERT and Doc2Vec model on the IPTC Subject Codes prediction dataset Marko Pranjić Marko Robnik-Šikonja Senja Pollak marko.pranjic@styria.ai marko.robnik@fri.uni- lj.si senja.pollak@ijs.si Jožef Stefan International University of Ljubljana, Faculty of Jožef Stefan Institute Postgraduate School Computer and Information Science Jamova cesta 39 Jamova cesta 39 Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia Trikoder d.o.o. Zagreb, Croatia ABSTRACT standardized set of topics would enable faster news production and higher quality of the metadata for news content. Large pretrained language models like BERT have shown excel- In this paper, we use recently published ST T News[10] dataset lent generalization properties and have advanced the state of the in Finnish to evaluate the performance of the monolingual Fin- art on various NLP tasks. In this paper we evaluate Finnish BERT BERT model [13] on the IPTC Subject Codes prediction task, (FinBERT) model on the IPTC Subject Codes prediction task. We together with the Doc2Vec[3] model as a baseline. We attempt compare it to a simpler Doc2Vec model used as a baseline. Due to to encode the hierarchical nature of the prediction task in the hierarchical nature of IPTC Subject Codes, we also evaluate the prediction network topology by mimicking the structure of the effect of encoding the hierarchy in the network layer topology. labels. Finally, impact of using a different tokenizers with the Contrary to our expectations, a simpler baseline Doc2Vec model same model is evaluated. clearly outperforms the more complex FinBERT model and our The paper is structured as follows. In Section 2, we describe attempts to encode hierarchy in a prediction network do not the dataset and the labels relevant for the prediction task. Section yield systematic improvement. 3 describes the methods used to model the prediction task and KEYWORDS all variations of experiments. In Section 4, we provide results of our experiments and, finally, in Section 5 we conclude this paper news categorization, text representation, BERT, Doc2Vec, IPTC and suggest ideas for further work. Subject Codes 2 DATASET 1 INTRODUCTION The ST T corpus [10] contains 2.8 million news articles from the The field of Natural Language Processing (NLP) has greatly ben-Finnish News Agency (ST T) published between 1992 and 2018. efited from the advances in deep learning. New techniques and The articles come with a rich metadata information including the architectures are developed at a fast pace. The Transformer ar- 1 news article topics encoded as IPTC Subject Codes . The IPTC chitecture [12] is the foundation for most new NLP models and Subject Codes are a deprecated version of IPTC taxonomy of it is especially successful with models for text representation, news topics focused on text. The IPTC Subject Codes standard such as BERT model [1] which dominates the text classification. describes around 1400 topics structured in three hierarchical The gains in performance promised by the large BERT models levels. The first level consists of the most general topics. Topics comes at the price of significant data resources and computa- on the second level are subtopics of the ones at the first level and, tional capabilities required in the model pretraining phase. The likewise, topics on the third level are subtopics of the ones on practitioners take one of the models pretrained in the language second level. All topics on the third level are leaf topics - there of the data and finetune it for the specific classification prob- are no more subdivisions, but there are also some topics on the lem. Multilingual BERT-like models have also shown remarkable second level that are leaf topics and do not extend to the third potential for cross-lingual transfer ([7], [8], [6]). A majority of level. A set of IPTC topics at ST T is an extended version of IPTC the research with BERT-like models is focused on English, while Subject Codes as some codes used at ST T are not part of the IPTC less-resourced languages tend to be neglected. standard. The IPTC Subject Codes originate in the journalistic setting. Not all articles in the ST T corpus contain the IPTC Subject The news articles are tagged with the IPTC topics to enable search Codes, as can be seen in Figure 1, showing the ratio of articles and classification of the news content, as well as to facilitate containing this information through time. IPTC Subject Codes content storage and digital asset management of news content were introduced in ST T in May 2011 and around 10-15% of articles at media houses. It provides a consistent and language agnostic do not contain this information. coding of topics across different news providers and across time. If an article contains a specific sub-topic, it also contains its Solving the automatic classification of the news content to the upper-level topics. For example, if an article contains the third level topic "poetry", it also contains the second level topic "litera-Permission to make digital or hard copies of part or all of this work for personal ture" that generalizes the "poetry", as well as the first level topic or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and "arts, culture and entertainment". In this way, article metadata the full citation on the first page. Copyrights for third-party components of this contains full path through the topic hierarchy. work must be honored. For all other uses, contact the owner /author(s). Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia © 2020 Copyright held by the owner/author(s). 1 https://iptc.org/standards/subject-codes/ 25 Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia Pranjić, et al. experiments use the PV-DM variant of the algorithm available 2 in the Gensim library with most of the hyperparameters set to their default values. We set the context window width to 5 and train the network for 10 epochs on the news content from the training data. The model produces a 256 dimensional output vector. Once the model is trained, we do not finetune it further during training of the prediction task. Tokenization of the data was done using the SentencePiece[2] tokenizer. It was trained to produce a vocabulary of 40,000 tokens by using randomly selected 1 million sentences sampled from the articles in the training set. Additionally, we ran experiments using the same WordPiece[14] tokenizer that is used with the FinBERT model. 3.2 BERT Figure 1: The Ratio of news articles in STT corpus contain- ing IPTC Subject Codes. BERT is an deep neural-network architecture of bidirectional text encoders introduced in [1]. The base model consists of 12 Transformer [12] layers. It is trained using the masked language Most articles are assigned only a small number of leaf-level modeling (MLM) and next sentence prediction (NSP) objectives topics (and its higher-level topics), but they can contain up to 7, 19 on a large text corpora. Maximum length of the input sequence and 30 topics from the first, second and third level, respectively. for the model is 512 tokens and each token is represented with 768 We split the dataset to train, validation and test set such that dimensions. Model inference produces a context dependent repre- all articles published after 31-12-2017 belong to the test set and sentations of the input tokens. The whole input sequence can be discard articles without IPTC Subject Codes from it. The rest of represented with a single vector by using the context dependent the articles were randomly split such that 5% of articles contain- representation of the [CLS] token. In [1], this representation is ing IPTC Subject Codes represent the validation set and all other used as an aggregate sequence representation for classification articles belong to the train set. tasks. Another way to represent the whole sequence, as used After this step, there are around 30 thousand articles in the in [9], is to take the average representation of all output tokens validation set, around 100 thousand in test set and 2.7 million in (AVG). In this paper, we use FinBERT, a BERT model introduced 3 training set - of which some 560 thousand contain IPTC Subject in [13] that was pretrained on Finnish corpora. We should note Codes annotation. that this model contains the ST T corpus as part of its training The train set contains 17 different topics on the first level, 400 data. 4 on the second level, and 972 on the third (the most specific) level. Input to the model is restricted to 512 tokens and longer news In our experiments, we evaluate models only on topics found in articles are trimmed such that only the first 512 tokens are used. the training set. In the dataset, there are less than 5% and 7% of documents in the training and test data that are longer than 512 tokens. We 3 METHODOLOGY experiment with the CLS and the AVG representations and in both cases the article representation is a 768 dimensional vector. For our experiments, we used a network design consisting of two The FinBERT model is finetuned during training of the IPTC stacked neural networks (extractor and predictor). The extractor Subject Codes prediction task. processes the text and produces the text representation in the format of a numeric vector. The predictor (the second part) is 3.3 Prediction network a multi-label prediction network that maps the extracted text representation vector to IPTC Subject Codes. For the extractor For the predictor part, we experiment with two different archi- part, we evaluate the Doc2Vec and BERT model and for the tectures. The first is a single layer of the neural network that predictor our models use one or three layer neural network. maps the input vector to the predictions and can be seen in the Figure 2. The IPTC Subject Codes on all levels are concatenated 3.1 Doc2Vec together, thus producing a 1389 outputs in the final layer. The second architecture utilizes the tree hierarchy of the IPTC Before the contextual token embeddings became popular, this Subject Codes. We assumed that a flat output (the previous ap- model was regularly used to represent a text paragraph with proach) requires the network to predict each label independently, a fixed vector. It was introduced in [3] with two variants of irrespective of the level of the target label. By introducing sepa-the algorithm - PV-DM (Paragraph Vector-Distributed Memory) rate layers for each target level, we expect that the model will and PV-CBOW (Paragraph Vector-Continuous Bag-of-Words). implicitly learn the hierarchy among labels. We designed this In the PV-DM variant of the algorithm, a training context is network in three layers and the architecture is shown in Figure 3. defined as a sliding window over the text. The model is a shallow The first layer of the network predicts labels from the third IPTC neural network trained to predict the central word of this context hierarchical level (the most fine-grained topics), the second layer window given the embeddings of the rest of the context words together with the embedding of the whole document. During 2 training, the network learns both the word embeddings and the https://radimrehurek.com/gensim/ 3 We also test the FinEst BERT[11] but since the better performance was achieved embedding for the document. The simpler PV-CBOW variant with the FinBERT[13], we do not include FinEst BERT it in the results. does not employ a context window, the neural network is trained 4 The tokenizer used with the model is a predefined WordPiece tokenizer that came to predict a randomly sampled word from the document. Our with the FinBERT model. 26 STT IPTC Subject Codes evaluation Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia using a WordPiece (WP) tokenizer and either the CLS token or the average (AVG) of all output tokens as a text representation. The Doc2Vec model is using either the WordPiece (WP) tokenizer or the SentencePiece (SP) tokenizer. 4.1 Evaluation metrics We approach the article categorization problem through the in- formation retrieval paradigm. Namely, we try to return the set of the most probable IPTC Subject Codes assigned to each ar- ticle in the ST T corpus. We use two performance metrics, the mean average precision (mAP) and recall at 10 (R@10). The mean average precision returns the expectation of the area under the precision-recall curve for a random query. The recall at 10 com- putes the ratio of correct topics found in the 10 tags with the highest predicted probability. To measure the generalization of our prediction models, we compute these metrics separately for each level of the IPTC Subject Codes. Figure 2: Predictor network architecture, flat variant. The 4.2 Results and discussion image does not show a normalization layer before the out- put layer. In all experiments, the Doc2Vec model performed significantly better than the FinBERT model, regardless of the specific extrac- tor or predictor setup. This is surprising in the light of other successful applications of BERT models. Nevertheless, as there are less than 5% of articles in the training set and less than 7% of articles in the test set that have more than 512 tokens (the limitation of BERT but not Doc2Vec) we cannot assign the poor performance of BERT to this limitation. Some other relevant findings are as follows. While for some tasks[9] the BERT average token representation performs better than the representation based on the CLS token, in our experiments the CLS and the AVG representations perform comparably. The three-layer network mimicking the shape of the tree-like IPTC Subject Codes hierarchy did not yield any systematic im- provement over the single, flat layer of the neural network. Dif- ference in tokenizers for Doc2Vec experiments shows small, but consistent improvement when using the SentencePiece tokenizer. 5 CONCLUSIONS AND FURTHER WORK Figure 3: Predictor network architecture, tree variant. The In this work, we have compared a monolingual FinBERT and image does not show a normalization layer before each Doc2Vec model on the IPTC Subject Codes prediction task in output layer. Finnish language. We evaluated several variations of experiments and achieved consistently better results with a Doc2Vec model. In contrast to the Doc2Vec, the BERT model has a limitation in predicts topics from the second level and the third layer predicts the form of maximum number of input tokens. We believe the only the toplevel IPTC Subject Codes. results cannot be explained by this as the data used does not 3.4 Training contain a significant amount of documents exceeding this limit. We plan to explore this topic further in hope of understanding Each model was trained using the batch size of 128 articles and and addressing this problem. Recent work in BERT finetuning AdamW[4] optimizer with the learning rate of 1e-3. We compute strategies[5] identifies a problem of vanishing gradients due to the metrics on the validation set every 100 iterations. Once the excessive learning rates and implementation details of the opti- loss on the validation data starts increasing, we stop the training mizer. and evaluate the best performing checkpoint on the test data. Our attempt at encoding the hierarchical nature of the predic- The loss function used in all experiments is the sum of binary tion task did not yield systematic improvement and we believe cross-entropy losses calculated at each topic level. The news it is worthwhile to explore other strategies and improve on this articles that do not have an annotation for certain topic level do area, like encoding the hierarchy of the predictions in the loss not contribute to the loss of that level. function itself. For Doc2Vec experiments, consistently better results were 4 EXPERIMENTS AND RESULTS achieved using the SentencePiece[2] tokenizer over the Word-All experiments were repeated three times and we report the Piece[14] tokenizer used in FinBERT model. Both of those tok-median of those three runs in Table 1. The extraction network enizers retain the whole information of the input as there are no was evaluated with four configurations. The FinBERT model is destructive operations on the text. We plan further experiments 27 Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia Pranjić, et al. Table 1: Results for different experimental configurations. Extractor Predictor mAP (lvl 1) mAP (lvl 2) mAP (lvl 3) R@10 (lvl 1) R@10 (lvl 2) R@10 (lvl 3) FinBERT (CLS) Flat 0.5432 0.2047 0.1031 0.9058 0.3687 0.2242 FinBERT (CLS) Tree 0.5434 0.1949 0.1043 0.9058 0.3602 0.2417 FinBERT (AVG) Flat 0.5401 0.2026 0.1006 0.9045 0.3692 0.2391 FinBERT (AVG) Tree 0.5410 0.2088 0.1089 0.9078 0.3724 0.2367 Doc2Vec (WP) Flat 0.8091 0.5204 0.2990 0.9721 0.7008 0.4750 Doc2Vec (WP) Tree 0.8127 0.5202 0.2972 0.9743 0.7099 0.4714 Doc2Vec (SP) Flat 0.8298 0.5550 0.3149 0.9803 0.7277 0.4951 Doc2Vec (SP) Tree 0.8315 0.5643 0.3282 0.9832 0.7358 0.4896 to confirm and quantify these findings and understand what en- [7] Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. How ables such improvement of downstream prediction task at the multilingual is multilingual BERT? In Proceedings of the tokenizer level. 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Flo- ACKNOWLEDGMENTS rence, Italy, (July 2019), 4996–5001. doi: 10.18653/v1/P19- The work was partially supported by the Slovenian Research 1493. https://aclanthology.org/P19- 1493. Agency (ARRS) core research programmes P6-0411 and P2-0103, [8] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, as well as the research project J6-2581 (Computer-assisted mul- Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and tilingual news discourse analysis with contextual embeddings). Peter J. Liu. 2020. Exploring the limits of transfer learning This paper is supported by European Union’s Horizon 2020 re- with a unified text-to-text transformer. Journal of Machine search and innovation programme under grant agreement No Learning Research, 21, 140, 1–67. http://jmlr.org/papers/ 825153, project EMBEDDIA (Cross-Lingual Embeddings for Less- v21/20- 074.html. Represented Languages in European News Media). [9] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: sentence embeddings using Siamese BERT-networks. In REFERENCES Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International [1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Joint Conference on Natural Language Processing (EMNLP- Toutanova. 2019. BERT: pre-training of deep bidirectional IJCNLP). Association for Computational Linguistics, Hong transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of Kong, China, (November 2019), 3982–3992. doi: 10.18653/ the Association for Computational Linguistics: Human Lan- v1/D19- 1410. https://aclanthology.org/D19- 1410. guage Technologies, Volume 1 (Long and Short Papers) [10] ST T. 2019. Finnish news agency archive 1992-2018, source . (June (http://urn.fi/urn:nbn:fi:lb-2019041501). (2019). 2019), 4171–4186. doi: 10.18653/v1/N19- 1423. [11] Matej Ulčar and Marko Robnik-Šikonja. 2020. Finest bert [2] Taku Kudo and John Richardson. 2018. Sentencepiece: a and crosloengual bert. In Text, Speech, and Dialogue. Petr simple and language independent subword tokenizer and Sojka, Ivan Kopeček, Karel Pala, and Aleš Horák, editors. detokenizer for neural text processing. In (January 2018), Springer International Publishing, Cham, 104–111. 66–71. doi: 10.18653/v1/D18- 2012. [12] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- [3] Quoc Le and Tomas Mikolov. 2014. Distributed represen- reit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia tations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning Polosukhin. 2017. Attention is all you need. In Proceedings (Pro- of the 31st International Conference on Neural Information ceedings of Machine Learning Research) number 2. Eric P. Processing Systems (NIPS’17). Curran Associates Inc., Red Xing and Tony Jebara, editors. Volume 32. PMLR, Bejing, Hook, NY, USA, 6000–6010. isbn: 9781510860964. China, (June 2014), 1188–1196. https://proceedings.mlr. [13] Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma, press/v32/le14.html. Juhani Luotolahti, Tapio Salakoski, Filip Ginter, and Sampo [4] Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight Pyysalo. 2019. Multilingual is not enough: bert for finnish. decay regularization. In International Conference on Learn- ing Representations (2019). arXiv: 1912.07076 [cs.CL]. . https : / / openreview. net / forum ? id = [14] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Bkg6RiCqY7. Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, [5] Marius Mosbach, Maksym Andriushchenko, and Dietrich Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Klakow. 2021. On the stability of fine-tuning {bert}: mis- Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan conceptions, explanations, and strong baselines. In Inter- national Conference on Learning Representations Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith . https : Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff //openreview.net/forum?id=nzpLWnVAyah. Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, [6] Andraž Pelicon, Marko Pranjić, Dragana Miljković, Blaž Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Škrlj, and Senja Pollak. 2020. Zero-shot learning for cross- Google’s neural machine translation system: bridging the lingual news sentiment classification. Applied Sciences, gap between human and machine translation. (2016). arXiv: 10, 17. issn: 2076-3417. doi: 10.3390/app10175993. https: 1609.08144 [cs.CL]. //www.mdpi.com/2076- 3417/10/17/5993. 28 Classification of Cross-cultural News Events Abdul Sittar∗ Dunja Mladenić abdul.sittar@ijs.si dunja.mladenic@ijs.si Jožef Stefan Institute and Jožef Stefan International Jožef Stefan Institute and Jožef Stefan International Postgraduate School Postgraduate School Jamova cesta 39 Jamova cesta 39 Ljubljana, Slovenia Ljubljana, Slovenia ABSTRACT meta data of an event is shown in the Table 1. The main scientific We present a methodology to support the analysis of culture contributions of this paper are the following: from text such as news events and demonstrate its usefulness (1) A novel perspective of aligning news events across dif- on categorising news events from different categories (society, ferent cultures through categorising countries and news business, health, recreation, science, shopping, sports, arts, com- events. puters, games and home) across different geographical locations (2) A cross-cultural automatically annotated dataset in several (different places in 117 countries). We group countries based on different domains (Business, Science, Sports, Health etc.). the culture that they follow and then filter the news events based (3) Experimental comparison of several classification mod- on their content category. The news events are automatically els adopting different set of features (character ngrams, labelled with the help of Hofstede’s cultural dimensions. We GLOVE embeddings and word ngrams). present combinations of events across different categories and check the performances of different classification methods. We Table 1: The description of the meta data of an event. also presents experimental comparison of different number of Attributes Description features in order to find a suitable set to represent the culture. title title of the event summary summary of the event source event reported by a news source KEYWORDS categories list of DMOZ categories cultural barrier, news events, text classification location location of the event 1 INTRODUCTION Culture is defined as a collective programming of the mind which 2 RELATED WORK distinguishes the members of one group or category of people In this section, we review the related literature about the influ- from another [9]. It has a huge impact on the lives of people and ence of culture, its representation and classification in different in result it influences events that involve cross-cultural stake- fields. holders. News spreading is one of the most effective mechanisms Countries that share a common culture are expected to have for spreading information across the borders. The news to be heavier news flows between them when reporting on similar spread wider cross multiple barriers such as linguistic, economic, events [10]. There are many quantitative studies that found de-geographical, political, time zone, and cultural barriers. Due to mographic, psychological, socio-cultural, source, system, and rapidly growing number of events with significant international content-related aspects [2]. impact, cross-cultural analytics gain increased importance for Cross-cultural research and understanding the cultural influences professionals and researchers in many disciplines, including digi- in different fields have competitive advantages. The goal of re- tal humanities, media studies, and journalism. The most recent searching the impact of culture might be to draw conclusions examples of such events include COVID-19 and Brexit [1]. There in which way the cultural factors influence a specific corporate are few determinants that have significant influence on the pro- action. There are many type of cultures such as societal, organi- cess of information selection, analysis and propagation. These zational, and business culture etc [8]. include cultural values and differences, economic conditions and The hidden nature of cultural behavior causes some difficulties association between countries. For instance, if two countries are in measurement and defining these. To cope with difficulties, culturally more similar, there are more chances that there will researchers have developed measurements that measure culture be a heavier news flow between them [10], [3]. In this paper, on a general scale to compare differences among cultures and we focus on classification of news events across different cul- management styles. These results can be used to find similarities tures. We select some of the most read daily newspapers and within a region and differences to other regions. There are many collect information using Event Registry about the news they models that have tried to explain cultural differences between have published. Event Registry is a system which analyzes news societies. Hofstede’s national culture dimensions (HNCD) have articles, identifies groups of articles that describe the same event been widely used and cited in different disciplines [6, 5]. Hofst-and represent them as a single event [7]. The description of the ede’s dimensions are the result of a factor analysis at the level of country means of comprehensive survey instrument, aimed Permission to make digital or hard copies of part or all of this work for personal at identifying systematic differences in national cultural. Their or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and purpose is to measure culture in countries, societies, sub-groups, the full citation on the first page. Copyrights for third-party components of this and organizations; they are not meant to be regarded as psycho-work must be honored. For all other uses, contact the owner/author(s). logical traits. Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia © 2021 Copyright held by the owner/author(s). There is a plethora of research studies that were conducted to un- derstand the cultural influences such as cross-culture privacy and 29 Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia Abdul and Dunja, et al. attitude prediction, and cultural influences on today’s business. provided by the Event Registry. This task can be formulated as: [4] explores how culture affects the technological, organizational, 𝐶 = 𝑓 (𝑆, 𝐺 ) and environmental determinants of machine learning adoption by conducting a comparative case study between Germany and C donates the culture of the news event, f is the learning function, US. Rather than looking at the influence of cultural differences S donates summary of a news event and G donates category of a within one domain, we intend to understand association between news event (see Table 1). news events belonging to different domains (society, business, health, recreation, science, shopping, sports, arts, computers, 4.2 Methodology games and home) and different cultures (117 countries from all the continents). We conduct this research to find an appropriate representation and classification of culture across different do- mains. Clusters of Countries Char Ngrams 3 DATA DESCRIPTION News Events Dataset Annotation Glove Embeddings Classification 3.1 Dataset Statistics We choose the top 10 daily read newspapers in the world in 2020 1 and collect the events reported by these newspapers using Event Category of Events Word Ngrams Registry [7] over the time period of 2016-2020. Approximately 8000 events belongs to each newspaper with exception of “Za-man” that has only 900 events. Figure 1 shows the number of events reported by the selected newspapers on a yearly basis. Figure 2: Classification of cross-cultural news events. This dataset can be found on the Zenodo repository (version 1.0.0) 2 4.2.1 Data labeling. Each news event has information about the type of categories to which it belongs and the location where it happened (see Table 1). Each event has many categories and each category has a weight reflecting its relevance for the event. We only keep the most relevant categories and group the news events based on their categories. For each group of events, we estimate the cultural characteristic of each event through the country of the place where the event occurred. We cluster the countries based on their culture. We utilize the Hofstede’s national culture dimensions (HNCD) to represent the culture of a country. We take average of cultural dimensions and call it average cultural score. Based on this score, we find optimal number of clusters using Figure 1: Each color in a bar represents the total number popular clustering algorithm k-means (see Figure 4). Finally, we of events per year by a daily newspaper and a complete label each news event with one of the six cultural clusters. bar shows the total number of events per year by all the newspapers. The attributes of an event with description are displayed in Table 1. Few attributes are self-explanatory such as title, summary, date, and source. DMOZ-categories are used to represent topics of the content. The DMOZ project is a hierarchical collection of web page links organized by subject matters 3. Event Registry use top 3 levels of DMoz taxonomy which amount to about 50,000 categories 4. 4 MATERIAL AND METHODS 4.1 Problem Definition Figure 3: The pie chart depicts the percentage of the news events that occurred in six different clusters (each cluster There are two main parts of the problem that we are addressing. consists of a list of countries with similar culture). The first part is to label the examples by assigning a culture C to a news event E using its location L. The second part is a multi-class classification task where we predict the culture C of a news event 4.2.2 Data representation. Each news event in Event Registry E using its summary description S and its content category G as has associated categories with it along with a weight (see Table 1), we take the top categories based on their weight. In case of 1https://www.trendrr.net/ 2 multiple categories with equal weight, we sort them alphabeti- https://zenodo.org/record/5225053 3https://dmoz-odp.org/ cally and keep the first one. We represent each news event by a 4https://eventregistry.org/documentation?tab=terminology short summary S and a set of content categories G. 30 Classification of Cross-cultural News Events Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia Figure 4: In word cloud, the color of each word shows cluster to whom it belongs (see Figure 3). Radial dendrograms illustrate the shared categories of news events between the pair of six clusters. 4.2.3 Data Modeling. For multi-class classification task, we use word ngrams using 1 to 3 word ngrams (see Figure 5). Looking at simple classification models (SVM, Decision Tree, KNN, Naive the character ngrams, the highest F1-score is achieved when we Bayes, Logistic Regression) as well as neural network. For sim- select the top 15K characters for all the tested algorithms except ple classification models, we input character and word ngrams Naive Bayes which declines in performance with the growing varying the number of ngrams and compare the results. We also set of features. Based on these settings, we achieve the highest use pre-trained Glove embeddings. accuracy (0.85) using Logistic Regression. Using Glove embed- dings, we experiment with and without using the category of 5 EXPERIMENTAL EVALUATION event. The highest F1-score with and without the category is 0.80 5.1 Evaluation Metric and 0.79 respectively. For multi-class classification task, we use following most com- monly used evaluation measures: accuracy, precision, recall, and F1 score. 7 CONCLUSIONS AND FUTURE WORK 6 RESULTS AND ANALYSIS For researchers and professionals, it is very important to anal- 6.1 Annotation Results yse the cross-cultural differences in different disciplines. As the The results of annotation are six clusters where almost 50% news international impact is increasing and international events are events belong to the two clusters (shown with red and blue colors) becoming popular, the need to develop some automatic methods and remaining 50% belong to the other four clusters 3. Looking is significantly increasing and leaving a blank space. We con-in each group, we find that clusters do not lies in a specific ducted experiments on news events related to different fields geographic area or a continent. Rather all the countries in a to have a broader look on data and machine learning methods. cluster belong to the different continents. Similarly, these clusters Further research would be helpful in examining the impact of do not have all the countries that are economically rich or poor. specific socio-cultural factors on news events. In this research There are more categories in green and red colors in the word work, we estimate the culture of a specific place by its country, cloud (see Figure 4) which represent to the cluster with that colors. use basic features and simple classification models. To continue Radial dendrograms in Figure 4 present the shared categories this work further, we would like to improve feature set such as between the clusters. In the figure, root of the tree is data and by including part of speech tagging (POS) as well as other state then there are ten pair of clusters that share the same categories. of the art embeddings. The objective of this whole process was to keep news events according to the category to whom they belongs. Moreover, we can only observe the cultural differences when we have same type of news events from different places. ACKNOWLEDGMENTS 6.2 Classification Results The research described in this paper was supported by the Slove- Fro the experimental results we can see that the best performance nian research agency under the project J2-1736 Causalify and is achieved by Logistic Regression, kNN and Decision Tree. The by the European Union’s Horizon 2020 research and innovation performance of SVM varies depending on the number of selected programme under the Marie Skłodowska-Curie grant agreement features: the highest F1-score is achieved with the top 10K or 20K No 812997. 31 Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia Abdul and Dunja, et al. [3] Tsan-Kuo Chang and Jae-Won Lee. 1992. Factors affecting gatekeepers’ selection of foreign news: a national survey of newspaper editors. Journalism Quarterly, 69, 3, 554–561. [4] Verena Eitle and Peter Buxmann. 2020. Cultural differences in machine learning adoption: an international compari- son between germany and the united states. [5] Meihan He and Jongsu Lee. 2020. Social culture and in- novation diffusion: a theoretically founded agent-based model. Journal of Evolutionary Economics, 1–41. [6] Mahmood Khosrowjerdi, Anneli Sundqvist, and Katriina Byström. 2020. Cultural patterns of information source use: a global study of 47 countries. Journal of the Association for Information Science and Technology, 71, 6, 711–724. [7] Gregor Leban, Blaz Fortuna, Janez Brank, and Marko Gro- belnik. 2014. Event registry: learning about world events from news. In Proceedings of the 23rd International Confer- ence on World Wide Web, 107–110. [8] Björn Preuss. 2017. Text mining and machine learning to capture cultural data. Technical report. working paper, 2. doi: 10.13140/RG. 2.2. 30937.42080. [9] Giselle Rampersad and Turki Althiyabi. 2020. Fake news: acceptance by demographics and culture on social media. Journal of Information Technology & Politics, 17, 1, 1–11. [10] H Denis Wu. 2007. A brave new world for international news? exploring the determinants of the coverage of for- eign news on us websites. International Communication Gazette, 69, 6, 539–551. Figure 5: First two line charts illustrate the variations in F1 score by simple classification models after varying the number of features. The first line chart depicts the results of word ngrams whereas the second one shows the results for character ngrams. The last line graph presents com- parison between Glove embeddings (with and without cat- egory feature). REFERENCES [1] Sara Abdollahi, Simon Gottschalk, and Elena Demidova. 2020. Eventkg+ click: a dataset of language-specific event- centric user interaction traces. arXiv preprint arXiv:2010.12370. [2] Hosam Al-Samarraie, Atef Eldenfria, and Husameddin Dawoud. 2017. The impact of personality traits on users’ information-seeking behavior. Information Processing & Management, 53, 1, 237–247. 32 Zotero to Elexifinder: Collection, curation, and migration of bibliographical data David Lindemann david.lindemann@ijs.si Jožef Stefan Institute Jamova cesta 39 Ljubljana, Slovenia Figure 1: Zotero to Elexifinder workflow model ABSTRACT workshop connected to the 2019 eLex conference in Sintra (Por- tugal), it was decided to combine the efforts, and the workflow In this paper, we present ongoing work concerning a workflow explained in this paper was designed, in order to merge existing and software tool pipeline for collecting and curating bibliograph- datasets, decide criteria for data curation, and make the results ical data of the domain of Lexicography and Dictionary Research, available to the lexicographic community. Two years later, at the and data export in a custom JSON format as required by the 2021 Euralex conference, Elexifinder version 2 was introduced Elexifinder application, a discovery portal for lexicographic lit- [3]. Main shortcomings of Elexifinder version 1 have been sorted erature. We present the employed software tools, which are all out, namely the missing author disambiguation, and the coverage freely available and open source. A Wikibase instance has been of the domain’s literature has been significantly increased, also chosen as central data repository. We also present requirements regarding publication languages other than English. Moreover, a for bibliographical data to be suitable for import into Elexifinder; vocabulary of lexicographic terms has been developed, which is these include disambiguation of entities like natural persons and now used for content-describing indexation of article full texts. natural languages, and a processing of article full texts. Beyond Lexicography and Dictionary Research is a relatively small the domain of Lexicography, the described workflow is applicable discipline, having thematic intersections with Corpus Linguis- in general to single-domain small scale digital bibliographies. tics, Terminology, Natural Language Processing, and Philology. KEYWORDS In metalexicographic literature, all aspects of the lexicographic process, dictionary structure and functions, dictionary use, and bibliographical data, author disambiguation, e-science corpora other relevant issues are discussed. The lexicographic commu- 1 INTRODUCTION nity communication is mainly taking place through a reduced number of conference series and journals, being complemented 1 In 2019, version 1 of Elexifinder, a discovery portal for lexico- by handbooks and other edited volumes. The need for a dedicated graphic literature, was launched in the framework of the ELEXIS digital bibliography arises from the following observations: 2 project [2]. At the same time, at University of Hildesheim, a domain ontology and bibliographical data collection for Lexicog- • The vast majority of publications do not have Digital Ob- raphy and Dictionary Research was planned [6, 5]. Both endeav-ject Identifiers (DOI), and thus are not indexed in cross- ours already had compiled significant datasets. At a dedicated domain digital collections of publication metadata. This applies to nearly all older publications, but also to many 1 Accessible at https://finder.elex.is. 2 newer contributions published in the last two decades. See https://elex.is. • When searching for metalexicographical publications in Permission to make digital or hard copies of part or all of this work for personal cross-domain digital collections, search results are mixed or classroom use is granted without fee provided that copies are not made or up with publications from other domains, which may dis-distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this turb a straightforward information retrieval. work must be honored. For all other uses, contact the owner /author(s). • Author disambiguation in domain-independent digital col- Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia lections that can be considered the big players in the field © 2020 Copyright held by the owner/author(s). (such as Google Scholar) is not at all accurate, so that very 33 Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia D. Lindemann 12 often name variants are not resolved to a single person added. Duplicate management has been done in batches (whole entity, and different persons with the same name are not journal issues or conference iterations), or one by one using disambiguated. Zotero’s built-in duplicate detection functionality. Main criterion • If articles are indexed with content-describing terms in for the inclusion of metadata records has been the availability of cross-domain digital collections, the vast majority of those the corresponding full texts. This means a clear preference for terms will be out of the scope of the domain we are looking Open Access publications; but also other publications have been at. included, wherever a suitable license agreement allowed access • 13 Publication metadata found at big (i.e. automatically com- to the text. 14 piled) repositories is often incomplete or noisy, so that Zotero data can be accessed by API, or exported locally using using those, e.g. for citations, requires manual interven- pre-set or custom export scripts. We use an adapted version of tion in order to achieve a publishable quality. the Zotero JSON-CSL exporter, which produces a list of JSON objects containing all metadata fields and their values as literal Therefore, it seems useful to provide the lexicographic commu- strings, as well as the location of all local file attachment copies. nity with a platform that makes publications and their metadata For statements that cannot be expressed using standard Zotero accessible in a way that the described shortcomings will be over- 15 fields , we have used Zotero tags as workaround, following a come. Single-domain endeavours of this kind, which all involve simple syntax of predicate and object. For example, for asserting manual curation, are DBLP 3 for Computer Science, IxTheo4 for that an article is a review article, the tag ":type Review", and so Theology, or EconBiz5 for Economics. Inspired by features found on. Tags in Zotero can be easily copied from one item to others in these, we propose a workflow that involves the use of free by manual drag-and-drop operations, set via API, and also be software accessible to anybody, which makes it reproducible and included in display styles, so that in the Zotero item listings, cost-reducing. for example, review article titles can be preceded by a coloured symbol. With this workaround we can assert semantic triples 2 LEXBIB ZOTERO GROUP inside Zotero. That is, for instance, that for representing the 6 Zotero, developed and maintained by the Corporation for Digital statement that a certain item is contained in another item (e.g. a 7 Scholarship , a non-profit organisation, is the most widely used book chapter item in an item of type book), we use a tag beginning open source citation management software application. Zotero with ":container", followed by an identifier for the containing offers functionality for web-scraping publication metadata, im-item; for a conference paper presented at a certain event, we use porting metadata from different structured formats, and an online a tag beginning with ":event", followed by an identifier for that platform for collaborative curation of metadata, along with the event. For both of these, corresponding Zotero fields do exist possibility to attach full text PDF (and TXT versions) to metadata ("contained in", "presented at"), but these are filled by the web records. The Zotero scraper functionality allows to download scraping and importer translators with literal string values as publication metadata and attached PDF files from all those sites needed for citations, and not with unambiguous identifiers. 8 the Zotero community has provided a "translator" for, includ- For Elexifinder, a special metadatum is included in all publica- ing the web platforms of major publishing houses, Open Journal tion metadata sets: The location of the first author. This allows Systems, etc. From the Zotero platform, users are able to obtain the generation of location maps and search filters according to metadata records as single items or as batches for import into locations in the Elexifinder portal. For these locations, we insert their own citation managers, or as export records in a range of 16 English Wikipedia page titles in the Zotero "extra" field. citation styles or in structured formats such as bibtex. Members of a Zotero group can view and download full text attachments. 3 LEXBIB WIKIBASE Moreover, Zotero items can be annotated with custom tags, and 3.1 Wikibase as LOD infrastructure solution additional information (such as excerpts or comments) can be attached to them. Around Zotero, an active community is devel- The decisive shift from a metadata set as in Zotero, which con- 9 oping plug-ins that add new functionalities to Zotero. sists of certain fields and their literal values, towards unambigu- In the first planning period of the LexBib project, funded by the ous Linked Data lies in the reconciliation of those literal values University of Hildesheim, conference publications of the Euralex against existing or new unambiguous identifiers. For example, and the eLex conference series, and publications from a range of and this already refers to the hardest nut to crack in this context, journals and edited volumes have been added to LexBib Zotero an author may have several name variants appearing across the 10 group. Items collected for Elexifinder version 1, available as publication metadata collection, and there may be other persons tabular data, have then been merged to the Zotero group. For this sharing the same name, or any of the name variants. But one 11 purpose, tabular csv data has been transformed to RIS format author or editor (i.e., a "creator") should only have one identifier and imported to Zotero. Additionally, metadata records from (such as ORCID). Since we do not know Wikidata and/or ORCID OBELEX-meta and EURALEX-Dykstra bibliographies have been identifiers of all creators in our database, we need to create our own (and map them later). Other Zotero fields that should be 3 Accessible at https://dblp.org/. 12 See references in [3]. 4 Accessible at https://ixtheo.de/. 13 Article full text are stored and exclusively used for project-related text mining 5 Accessible at https://www.econbiz.de/. tasks; they cannot be downloaded from Zotero. We instead provide download links 6 See https://zotero.org. which lead to the download offered by the corresponding publisher, subject to 7 See https://digitalscholar.org/. applicable restrictions. 8 14 See https://www.zotero.org/support/translators. See https://www.zotero.org/support/dev/web_api/v3/start. 9 15 For example, very recently the Cita plug-in has been developed, which allows to See https://www.zotero.org/support/kb/item_types_and_fields. add citation metadata to Zotero records, see https://meta.m.wikimedia.org/wiki/ 16 Wikipedia page titles are unambiguous (see e.g. https://en.wikipedia.org/wiki/ Wikicite/grant/WikiCite_addon_for_Zotero_with_citation_graph_support. Cambridge vs. https://en.wikipedia.org/wiki/Cambridge, _Massachusetts), and map 10 Last version accessible at https://www.zotero.org/groups/lexbib/library. to only one Wikidata entity. This strategy has turned out effective, since manual 11 See https://en.wikipedia.org/wiki/RIS_(file_format). annotators are able to find the adequate Wikipedia page without hassle. 34 Zotero to Elexifinder Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia reconciled against unambiguous identifiers are those describing wikibase properties, in this case with datatype "item", that the containing item, the conference where the contribution was is, to object properties. presented, the journal, the publisher, the publication place, and • Creator name and publisher name literals are mapped to the publication language. For some of these, persistent identifiers the properties corresponding to the creator role (author are available in many cases (e.g. journals), or in all cases (lan- or editor), or to the publisher. This is done in a way that guages). In general, we create our own identifiers, and map them the name literals appear as qualifiers to a wikibase "no- to Wikidata; in some cases, immediately (languages, places, and, value" statement, which is a placeholder for the creator or by ISSN, also journals), and in other cases, we leave that mapping publisher item, that will be defined in the disambiguation to the (near) future, as it is the case for creators and publishers. process explained below. Other Zotero fields contain identifiers (ISSN, ISBN, DOI), which • Zotero fields that contain external identifiers (ISSN, ISBN after normalisation can be taken directly as external identifiers and DOI), are mapped to the corresponding properties of in a Linked Database. datatype "external identifier". Wikibase properties of that After experimenting with different RDF database solutions, datatype allow to define a URL pattern, in order to make which allow to represent data in the described way, we have the identifier a valid hyperlink, which can be clicked on 17 decided for Wikibase, which is the software infrastructure un- in Wikibase entity data pages. 18 derlying main Wikidata. Since 2019, "Wikibase as a Service" • As mentioned, we use the Zotero "extra" field ("note" in 19 is offered to the community. Wikibase entities are items (each bibtex) for annotation of the item with a Wikipedia page of which has its own identifier preceded by the letter Q), and that corresponds to the first author’s location. Wikidata properties (preceded by letter P), just as in Wikidata, but in a API is queried for the corresponding Wikidata entity, an different namespace. Properties may point to other items, other equivalent of which is created in LexBib Wikibase, in order properties, external identifiers, or values of a certain datatype, to function as object to the property "first author location". 20 such as "monolingual text", "point in time", "string", "url", etc. • The Zotero "language" field, in LexBib may contain a two- Wikibase as central data repository solution has several ad- letter ISO-639-1, or a three-letter ISO-639-3 code. This vantages compared to other infrastructure solutions for Linked is mapped to a property pointing to the language item Open Data (LOD): corresponding to that code. • • The Zotero item URI is taken as external identifier in Entity data is displayed on entity pages, where it can be LexBib wikibase, with the Zotero storage location of PDF viewed and edited. These pages always reflect the last and TXT attachments as qualifiers to that statement. In update. • addition, we annotate this statement with a qualifier as- A complete edit history is available, and changes can be serting the presence of an abstract, and, if any, in what undone. 23 • language. Every entity page is linked to a dedicated discussion page. • • The content of the remaining fields is mapped to Wikibase User and user rights management allow a community- properties of the corresponding datatype ("URL", "string", driven editing process. • or "point in time"). In addition to query interface and SPARQL endpoint known from other RDF database solutions, Wikibase data can be The resulting dataset is then imported into LexBib Wikibase. It uploaded and downloaded using an API, and as entity data is worth mentioning that uploading data to a Wikibase triple 24 dump in several formats. by triple using the mediawiki API of the Wikibase instance takes about 0.5 seconds per triple, which is due to the need of The backbone of LexBib Wikibase is an ontology of classes and updating Wikibase search indices and edit histories for every 21 properties, which can be aligned to Wikidata or other external single uploaded triple. ontologies. We have started to define these alignments. This en- sures interoperability with other resources, such as Wikidata, so 3.3 Entity disambiguation using Open Refine that data can be transferred from LexBib to Wikidata or vice versa, The around 5,000 creator names appearing in LexBib Zotero by or accessed in both at the same time, using federated SPARQL spring 2021 have been mapped to around 4,000 unique person queries. items. This has been done testing different clustering algorithms 25 3.2 Zotero to Wikibase migration available in the Open Refine application, by Christiane Klaes from the University of Hildesheim, in the framework of her MA As mentioned before, Zotero item data is exported from a local thesis [1]. These are the creator items present in LexBib Wikibase Zotero instance, using an adapted version of the Zotero JSON-CSL 26 experimental version 2. 22 exporter. The resulting list of JSON objects is then processed From that moment on, any new Zotero item that is exported in the following way: to Wikibase, which will contain, as explained above, one or more • Zotero tags that contain semantic triple shortcodes (ex- creator statements of type "novalue", is reconciled against existing plained above) are mapped to the corresponding LexBib LexBib Wikibase creator items, using the given and last name literal qualifiers. For this purpose, a reconciliation service for 27 LexBib Wikibase is set up , and then accessed by Open Refine, 17 See http://wikiba.se; our instance is accessible at http://lexbib.elex.is. 18 in order to match creator name literals to creator items. Accessible at http://www.wikidata.org. 19 See https://www.wbstack.com. The service has been co-enabled by Adam Shore-23 land (https://addshore.com/), Rhizome (https://rhizome.org/), and WMDE (https: The abstract language is assumed to be the same as the publicaton language, if //www.wikimedia.de/). not stated different as tag shortcode ":abstractLang". 20 24 See https://www.wikidata.org/wiki/Help:Data_type. For LexBib Wikibase, see https://lexbib.elex.is/w/api.php. 21 25 For more information, see LexBib Wikibase main page at https://lexbib.elex.is. Available at https://openrefine.org/. 22 26 Available at https://github.com/elexis- eu/elexifinder/blob/master/Zotero/LexBib_ Accessible at https://data.lexbib.org. 27 JSON.js. This is done using https://github.com/wetneb/openrefine- wikibase. 35 Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia D. Lindemann If a literal can not be matched to any existing item, a new The full text body itself is also exported to Elexifinder, where person item is created. The reconciliation also works with fuzzy it is used for displaying the first bits of it in search result displays, matches, and all name variants attached to existing items are and for wikification, from which Elexifinder "concepts" are ob- considered. Matches can also be manually chosen. Any additional tained, as long as the system is able to associate named entities name variant appearing in Zotero data is linked to the LexBib occurring in the text with Wikipedia pages that describe them. Wikibase person item as "alias" label, while the most frequent name variant is chosen as "preferred" label. This allows for the 5 CONCLUSIONS AND OUTLOOK new name variants being available for subsequent reconciliation The described workflow enables us to disambiguate entities found iterations. in bibliographical datasets. For the time being, we are applying LexBib persons have up to six name variants found in Zotero this for feeding the Elexifinder app. Having chosen Wikibase data. In some cases, we have chosen the preferred name variant as central data repository also allows for aligning LexBib data manually, according to the author’s own choice, or to conventions with Wikidata in a straightforward way. In some cases, we have in the community regarding the naming of commonly known imported statements from Wikidata, in order to enrich LexBib 28 authors. entities with additional information, but that can be done the other way round as well. In other words: Wherever we find (or 3.4 Full text processing create) a Wikidata entity to align with our own, we can export the LexBib full text PDFs are stored in the local Zotero storage folder, statements asserted on LexBib Wikibase to the main Wikidata. which is automatically synchronised with Zotero cloud. When We have done this using LexBib events (conferences) as test case, processing Zotero JSON output, PDF files are sent to an installa- and plan to align other entity types with Wikidata in the near 29 tion of the GROBID application , which will propose a TEI rep- future, namely articles, persons, and organisations. resentation of the PDF content. This allows for isolating the full text body from the other text components, such as title, running ACKNOWLEDGMENTS titles, abstract, author list, and references section. The extracted The research received funding from the European Union’s Hori- full text body is manually validated, and, in case of any mistake, zon 2020 research and innovation programme under grant agree- it is corrected, using a plain TXT version of the PDF, which is by ment No. 731015. default produced by Zotero. GROBID turns out to structure PDF content as TEI very ef- REFERENCES ficiently if the article resembles a typical structure as found in [1] Christiane Klaes. 2021. Linked Open Data-Strategien zum journals and proceedings. Book chapters and review articles, Identity Management in einer Fachontologie. Master’s thesis. which normally do not feature an abstract, in turn, are usually Universität Hildesheim, Hildesheim, (June 2021). http : / / not parsed adequately. In those cases, we now use directly the lexbib.elex.is/entity/Q15468. plain TXT version for producing a cleaned version manually. [2] Iztok Kosem and Simon Krek. 2019. ELEXIFINDER: A Tool 30 The article text is then lemmatised, and lexicalisations of for Searching Lexicographic Scientific Output. In Electronic 31 LexVoc lexicographic terms are looked up in the text. LexVoc Lexicography in the 21st Century: Proceedings of the eLex 32 vocabulary is a resource still under development; for the term 2019 Conference. Lexical Computing CZ s.r.o., Brno, 506– discovery process, terms and lexicalisations (labels) are obtained 518. http://lexbib.elex.is/entity/Q9484. from LexBib Wikibase by a SPARQL query, the result of which [3] Iztok Kosem and David Lindemann. 2021. New develop- will reflect the state of LexVoc in that particular moment. The ments in Elexifinder, a discovery portal for lexicographic keyword processor returns counts of every term, so that relative literature. In Lexicography for Inclusion: Proceedings of the frequencies can be calculated for every term, according to the 19th EURALEX International Congress, 7-11 September 2021, occurrences of its labels and the amount of tokens in the article Alexandroupolis, Vol. 2. Democritus University of Thrace, text body; this information can be uploaded to LexBib Wikibase Alexandroupolis, 759–766. http : / / lexbib . elex . is / entity / bibliographical items, so that term indexation becomes part of Q15467. their entity data. [4] Gregor Leban, Blaz Fortuna, Janez Brank, and Marko Gro- belnik. 2014. Event registry: learning about world events 4 WIKIBASE TO ELEXIFINDER from news. In Proceedings of the 23rd International World The described workflow is necessary for being able to export Wide Web Conference, WWW14, Seoul, Korea, April 7-11, bibliographical data in a custom JSON format, as needed for Elex- 2014, 107–110. doi: 10.1145/2567948.2577024. ifinder, which is an application based on some of the elements of [5] David Lindemann, Christiane Klaes, and Philipp Zumstein. the Event Registry system architecture [4]. In particular, authors 2019. Metalexicography as Knowledge Graph. OASICS, 70. and content-describing terms (Elexifinder "categories") have to http://lexbib.elex.is/entity/Q13955. be represented as objects containing an unambiguous URI and a [6] David Lindemann, Fritz Kliche, and Ulrich Heid. 2018. LexBib: textual label; the containing item, the LexBib Zotero item URI, A Corpus and Bibliography of Metalexicographical Publi- and the link for accessing full text download are represented as cations. In Lexicography in Global Contexts: Proceedings of URL, publication date in ISO 8601 format, publication language the 18th EURALEX International Congress, 17-21 July 2018, in ISO 639-3 format, and the item title as simple string. Ljubljana. Ljubljana University Press, Ljubljana, 699–712. http://lexbib.elex.is/entity/Q6059. 28 See an example at http://lexbib.elex.is/entity/Q1583. 29 See https://grobid.readthedocs.io. 30 For the time being, we are only processing English text. For lemmatisation, we use spaCy (see https://spacy.io/). 31 This is done using https://pypi.org/project/flashtext/. 32 Described at http://lexbib.elex.is/wiki/LexVoc. 36 Simple Discovery of COVID IS WAR Metaphors Using Word Embeddings Mojca Brglez Senja Pollak Špela Vintar University of Ljubljana Jožef Stefan Institute University of Ljubljana Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia mojca.brglez@ff.uni- lj.si senja.pollak@ijs.si spela.vintar@ff.uni- lj.si ABSTRACT an innovative methodological approach. We propose a top-down method to search for expected conceptual metaphors through In the past year, the discourse on the COVID-19 pandemic has semi-automatic means employing word embeddings. While most produced a great number of metaphors stemming from the more previous corpus-based approaches to identify metaphors either basic conceptual metaphor ILLNESS IS WAR. In this paper, we use a small set of candidate words or require manual inspec- present a semi-automatic method to detect linguistic manifes- tions of large data samples, our approach reduces manual work tations of the latter in Slovene media. The method consists of on assembling linguistic data by combining existing annotated assembling a seed vocabulary of war-related words from an ex- resources and text mining methods. isting Slovene metaphor corpus, extending the vocabulary using word embeddings, and refining the extended vocabulary using 2 PROPOSED APPROACH intersection filtering. Our method offers a quick compilation of corpus data for further analysis, however, we also address is- Our method aims to discover linguistic expressions of the con- sues related to the method’s precision and the need for manual ceptual metaphor COVID IS WAR in the corpus by targeting filtering. a broader potentially metaphoric vocabulary. Previous related works have relied on either a limited vocabulary set (e.g. [7]) or a KEYWORDS list of words laboriously compiled from various sources such as dictionaries, thesauri and other studies on metaphor [19], or have metaphors, covid, word embeddings, media discourse used sophisticated but complex NLP methods and specialized 1 INTRODUCTION resources (e.g. [6]). In our experiment, we use a simple unsupervised approach using existing resources and language processing The COVID pandemic has been a ubiquitous topic in the dis- technologies. course of the past year, featuring in medical, political, public The main novelty of our approach is using pre-trained word and personal discourse. The emergence of a new virus of yet embeddings to extend the vocabulary, used also by e.g. [16] and unknown origin, behaviour and effects has presented itself like a [18] to extend terminology. As past research has shown [14], word complex and obscure topic. To make sense of it, we have once embeddings used for training language models retain linguistic more resorted to metaphorical language, much like we do when regularities, including syntactic and semantic relationships be- faced with other abstract, obscure concepts. According to Con- tween words. This means that similar words have similar vectors, ceptual Metaphor Theory (CMT, [11, 12]), metaphors “are among and the closer vector representations (word embeddings) are, our principal vehicles for understanding” and “play a central role the higher the chance they share a certain semantic space. We in the construction of social and political reality” ([12, p. 151]). make use of this feature by trying to capture a semantic space that In CMT, linguistic metaphors such as "food for thought" and would resemble the conceptual domain of WAR, which represents "half-baked idea" are considered manifestations of an established the source domain of the metaphor. conceptual mapping between a more concrete domain and a more abstract domain, here for example IDEAS ARE FOOD. The do- 2.1 Method main of DISEASES, on the other hand, is often mapped to the First, we start by collecting war-related lexical units from the domain of WAR, a more common frame of reference which has KOMET corpus [1], the only corpus of metaphors in Slovene taken hold as a fairly conventional way to talk about illnesses which was recently compiled and annotated similarly to the and their treatments, as well as several other domains ([8]). English corpus of metaphors, VUAMC [17]. KOMET contains ap-As was already observed in various studies ([19, 2, 5, 7]), the dis-proximately 200,000 words obtained from journalistic, fiction and course on the current COVID pandemic has also repeatedly used online texts and was hand-annotated for metaphoricity on the ba- the WAR domain in its metaphors. At the time of our experiment, sis of the MIPVU procedure ([17]). Additionally, the metaphoric however, no study has yet addressed the use of such metaphors expressions are tagged for one of 69 semantic frames, i.e. the in Slovene, where they were also adopted for communicating source concepts that semantically motivate them. One of these se- various implications, preventive measures, recommendations and mantic frames is #met.battle, which subsumes 105 metaphoric laws to abide by. To investigate the use and pervasiveness of this instances with 67 different lemmas, such as predati, ostrostrelec, metaphorical domain in Slovene media, we have conducted a orožje, napasti [surrender, sniper, weapon, attack]. These also quick analysis of a corpus of COVID-related news articles using form multi-word idioms such as železna pest [iron fist] and boriti Permission to make digital or hard copies of part or all of this work for personal se z mlini na veter [to tilt at windmills] which we exclude from or classroom use is granted without fee provided that copies are not made or our candidates list because the word embeddings we use only rep-distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this resent tokens, not whole phrases. Moreover, the lemmas within work must be honored. For all other uses, contact the owner /author(s). do not themselves necessarily represent the desired domain. We Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia also filter out some words erroneously annotated with the frame © 2021 Copyright held by the owner/author(s). such as številen [numerous]. This gives a starting vocabulary of 37 Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia Brglez et al. 51 unique seed words. Then, to extend the vocabulary further, (2) true metaphorical expressions referring to disease as target we employ Slovene word token embeddings ([13] pre-trained domain with fastText ([4]) on various large corpora of Slovene (GigaFida, For example, in the following sentence, the word brigade Janes, KAS, slWaC etc.). For each seed word in the list of words [brigades] only refers to a name of a street, which we mark extracted from the KOMET corpus, we use the Gensim library as literal usage. ([20]) to find the word’s N nearest neighbours in the fastText embeddings’ space (using the most_similar function). • /. . . / odvzem brisov pri pacientih s sumom na Covid-19: ob To increase the robustness of the extended vocabulary, we try Cesti proletarskih brigad 21 /. . . / to automatically filter out lexis not related to war. To this end, /. . . / taking swabs from patients with suspected Covid-19: at we use the word embeddings intersection method ([18]). The 21, Proletarian Brigades Road /. . . / method retains only the candidates that intersect between the In the following example, the word napad [attack] is used to refer sets, meaning they occur in the neighbourhood of at least k input to another domain – INTERNET, COMP UTING, which we mark seed words. For our main experiment, presented in this paper, we as metaphor for another target domain. select the parameters N =50 and k=3. We thus obtain a maximum • Covid-19 je okrepil trend rasti kibernetskih napadov [Covid- of 2550 (50 x 51) potential candidates. In the output, there are 19 reinforced the growing trend of cyber attacks] 2078 unique words, and, after lemmatization, 1539 unique lemmas. After the intersection filtering, the vocabulary extended by word The following three example sentences contain expression that embeddings consists of 184 word lemmas: 44 of them are already we mark as metaphor for the target domain of DISEASE. included in our initial seed set and 140 are new lemmas. We join • Čeprav v boju z virusom to nikakor ni hitro. the new, extended set with the initial seed set, which yields a [Although this is by no means fast in the fight against the total of 191 lemmas to search for. virus.] 3 CORPUS • Kako bo jeseni, ko bodo »udarili« še drugi virusi? The experiment is carried out on a corpus of Slovene COVID- [What will happen in autumn, when other viruses also 19-related news articles, automatically crawled from the web by “strike”?] searching for the keyword “covid-19” in article titles (a subset of the Slovene corpus used in the Slav-NER 2021 shared task ([15]). • Prvi organski sistem v organizmu, ki ga virus napade, The corpus consists of 233 texts spanning from February 2nd povzroči pljučnico, . . . to December 11th, 2020 . To prepare it for analysis, we remove [The first system in the organism that the virus attacks the header of each text (comprised of the article number, locale, causes pneumonia . . . ] date and URL), then parse the text into sentences and tokens Results of this analysis are presented in Table 1, whereby using the NLTK library ([3]). We also lemmatize the corpus using we report only lemmas that were metaphorically used for the the LemmaGen lemmatization module ([9]). The pre-processed DISEASE target domain at least once. corpus contains 7,273 sentences and 151,947 tokens. As can be derived from Table 1, our proposed method correctly 3.1 Corpus search identified 25 different lemmas with a total of 123 occurrences that are used metaphorically to frame the topic of the pandemic. In the next step, we extract all sentences from the corpus con- Out of our 233 articles, 68 or 29,18% contained at least one mili- taining any of the war-related terms from our expanded vocab- taristic metaphorical expression. The ostensibly most frequent ulary of 191 lemmas. The results yield 335 instances of poten- expression used was boj [fight] with 46 metaphorical occurrences, tially metaphorical expressions. Out of the 191 lemmas on the followed by boriti [to fight] with 13 metaphorical occurrences metaphorical candidate list, the COVID corpus contains 49, ap- and soočati [to confront] with 7 metaphorical occurrences. They pearing in 268 sentences. Due to the unsupervised approach these account for 37.4%, 10.6% and 5.7% of all metaphorical expressions are still only candidate words from the semantic domain of war. found by our method, respectively, and together, they represent A manual analysis shows that in addition to war metaphors, our more than 50% of them. This points to the interpretation that the extracted sentences include the following four cases: news corpus contains mostly highly conventional and recurrent (1) Some of the seed words found in the corpus are used metaphors. A lot of the war-related vocabulary (potential can- literally; didates in our extended war-related lexis) is not used, meaning (2) Some of the seed words found in the corpus are a result the corpus does not, at this moment, exhibit very original, novel of lemmatization errors metaphorical expressions. Using a larger and a more recently (3) Some of the seed words found in the corpus are used compiled corpus would perhaps reveal a more innovative use of metaphorically, but refer to other target domains, such COVID IS WAR metaphors. The vocabulary extension method as POLITICS or NATURE (e. g. boriti se proti podnebnim using word embeddings has proven fruitful as it revealed some spremembam [’fight against climate change’]) metaphorical expressions that were not in the initial 51-word (4) Some of the seed words in our initial 191-candidate list list extracted from the KOMET corpus. The 9 newly discovered are not actually related to the topic of WAR but are more lemmas are: soočiti, izbojevati, zmagati, obraniti, uiti, soočanje, closely related to another topic (e.g. gol [‘goal’]) spopadati, zoperstaviti, podleči [to confront, to fight, to win, to On this account we perform a manual analysis of the extracted defend, to escape, confrontation, to combat, to oppose, to suc- sentences and categorize them as follows: cumb]. (1) falsely extracted instances due to a lemmatization error The analysis also revealed some additional lemmas that relate or literal use, or true metaphorical expressions but with the epidemic to the war frame. In the sentences containing the other source or target domain, and lemmas we searched for, there were other words from the WAR 38 Simple Discovery of COVID IS WAR Metaphors Using Word Embeddings Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia Table 1: Analysis of metaphoric lemmas from the ex- (75, 100, 150 and 200). Our initial experiments were carried out tended vocabulary on a N of 50 and intersection k of 3. However, by changing the parameters, the results of initial new lemmas could differ. In Lemma Corpus Literal DISEASE Figure 1, we analyse how the seed list changes with different ocuses, as target parameters: N of 50 and 75 neighbours, each combined with the curences lemma- domain intersection count k of 2, 3 and 4. Note that these refer only to tization the list of potentially metaphoric lemmas, and not to the analy- errors sis of their use, which can only be analysed in context. We see or other that the initially selected parameters (50 neighbours and 3 recur- source/target rences) are an acceptable middle-ground between precision and domain size while still maintaining an unsupervised approach, however, had we wanted more examples, we could increase the parameter Boj [fight] 57 11 46 N or decrease the parameter k. Boriti [to fight] 16 3 13 For the recall, we are not able to carry out a systematic eval- Soočati [to confront] 17 10 7 uation. Nevertheless, based on metaphor clusters analysis men- Spopad [to combat] 6 6 tioned above, we identified the set of additional words that belong Spopadanje [combat- 6 6 to the military vocabulary: fronta, strategija, preboj, akcijski, vo- ting] jen, sovražnik [front, strategy, breakthrough, action [ADJ], war Zoperstaviti [to 5 5 [ADJ], enemy]. The words vojen [war[ADJ]] and sovražnik [en- oppose] emy] would have been included if we lowered the intersection Bitka [battle] 5 1 4 parameter to k = 2 at N = 50 neighbours or extended the vo- Napad [attack] 41 37 4 cabulary by N = 75 neighbours while keeping the intersection Podleči [succumb] 5 1 4 parameter k = 3. Other metaphorical expressions occurring in Spopadati [to combat] 5 1 4 the corpus (fronta, preboj, strategija, akcijski) [front, strategy, Bojen [combat [ADJ]] 17 15 2 breakthrough, action [ADJ]] are not found anywhere in the first Borba [battle] 3 1 2 200 neighbours of any of the words, indicating perhaps that the Braniti [to defend] 4 2 2 number of neighbours might be further increased. However, we Napasti [to attack] 6 4 2 observe that increasing the number of neighbours leads to fuzzier Obramben [defense 9 7 2 results. The added vocabulary using 75, 100, 150, and 200 near- [ADJ]] est neighbours of our initial seed words includes increasingly Soočanje [confronting] 2 2 more words unrelated to the topic of war and some very common Soočiti [to confront] 6 4 2 words, which would need additional filtering. We assume that Žrtev [victim] 49 47 2 the reason for this is that words commonly used metaphorically Borec [fighter] 3 2 1 (conventional or dead metaphors) are “displaced” in the vector Izbojevati [to fight] 1 1 space of embeddings, moving away from the words in their orig- Obraniti [to defend] 1 1 inal semantic domains and closer to words in other semantic Štab [base, headquar- 3 2 1 domains – target domains. For example, we observed a lot of ters] sports expressions in our extended vocabulary (e.g. “ball”, “goal”, Udariti [to hit] 2 2 “goalpost”). This shows how entrenched metaphors are in our Uiti [to escape] 2 1 1 language: in the vector space of word embeddings, the seman- Zmagati [to win] 5 4 1 tic domains are already “muddled”. In the present example, this TOTAL 270 147 123 could be a due to the frequent linguistic manifestations of the conceptual metaphor COMPETITION IS WAR. domain forming so called metaphor clusters ([10]). Thus, we managed to capture some metaphorical expressions that appeared in close vicinity (in the same sentence) of the found metaphor- ical expressions: fronta, strategija, preboj, akcijski načrt, vojna mentaliteta, sovražnik [front, strategy, breakthrough, action plan, war mentality, enemy]. For instance, our method found the sen- tence below which, in addition to the word bitka [battle] in our candidate list, contains a metaphorical use of the word fronta [front]. • Bitka proti virusu na več frontah [Battle against the virus on multiple fronts] 4 ANALYSING DIFFERENT PARAMETER SETTINGS Figure 1: Analysis of vocabulary extension parameters N Some of the expressions mentioned above would have been cap- and k tured had we modified the parameters of vocabulary extension. Namely, we experimented with using more nearest neighbours 39 Information Society 2021, 4–8 October 2021, Ljubljana, Slovenia Brglez et al. 5 CONCLUSION [9] Matjaž Juršič, Igor Mozetič, Tomaž Erjavec, and Nada Lavrač. 2010. Lemmagen: multilingual lemmatisation with We present an innovative approach using word embeddings as induced ripple-down rules. Journal of Universal Computer a tool for extending the vocabulary of potentially metaphoric Science, 16, 9, 1190–1214. http://www.jucs.org/jucs_16_9/ expressions and identify them in corpora. Our approach shows lemma_gen_multilingual_lemmatisation|. promise in that it correctly identifies numerous such expressions [10] Veronika Koller. 2003. Metaphor clusters, metaphor chains: and confirms that intersections of semantic spaces of metaphor- analyzing the multifunctionality of metaphor in text. In ical seed words can be used to refine the quest for words per- volume 5, 115–134. taining to the military domain. Nevertheless, some metaphoric [11] George Lakoff and Mark Johnson. 1980. Metaphors we live expressions are missed by our method and the experiment still by. University of Chicago press. needs manual analysis. Further research and experiments would [12] George Lakoff and Mark Johnson. 2003. Metaphors we live be needed for a larger expansion of vocabulary and a finer filter- by. University of Chicago press. ing approach as well as comparing different word embeddings, [13] Nikola Ljubešić and Tomaž Erjavec. 2018. Word embed- possibly those trained on more literal language. dings CLARIN.SI-embed.sl 1.0. Slovenian language resource ACKNOWLEDGMENTS repository CLARIN.SI. (2018). http://hdl.handle.net/11356/ 1204. This work is supported by the Slovenian Research Agency by [14] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. the research core funding P6-0215 and P2-0103, as well as by Linguistic regularities in continuous space word represen- the research project CANDAS ( J6-2581). The work has also been tations. In Proceedings of the 2013 Conference of the North supported by the European Union’s Horizon 2020 research and in- American Chapter of the Association for Computational Lin- novation programme under grant agreement No. 825153, project guistics: Human Language Technologies. Association for EMBEDDIA (Cross-Lingual Embeddings for Less-Represented Computational Linguistics, Atlanta, Georgia, (June 2013), Languages in European News Media). The results of this paper 746–751. https://aclanthology.org/N13- 1090. reflect only the authors’ view and the Commission is not re- [15] Jakub Piskorski, Bogdan Babych, Zara Kancheva, Olga sponsible for any use that may be made of the information it Kanishcheva, Maria Lebedeva, Michał Marcińczuk, Preslav contains. Nakov, Petya Osenova, Lidia Pivovarova, Senja Pollak, REFERENCES Pavel Přibáň, Ivaylo Radev, Marko Robnik-Sikonja, Vasyl Starko, Josef Steinberger, and Roman Yangarber. 2021. [1] Špela Antloga. 2020. Metaphor corpus KOMET 1.0. Slove- Slav-NER: the 3rd cross-lingual challenge on recognition, nian language resource repository CLARIN.SI. (2020). http: normalization, classification, and linking of named entities //hdl.handle.net/11356/1293. across Slavic languages. In Proceedings of the 8th Workshop [2] Benjamin R. Bates. 2020. The (in)appropriateness of the on Balto-Slavic Natural Language Processing. Association war metaphor in response to SARS-CoV-2: a rapid analysis for Computational Linguistics, Kiyv, Ukraine, 122–133. of Donald J. Trump’s rhetoric. Frontiers in Communication, https://aclanthology.org/2021.bsnlp- 1.15. 5, 50, (June 2020). doi: 10.3389/fcomm.2020.000505. [16] Senja Pollak, Andraž Repar, Matej Martinc, and Vid Pod- [3] Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural pečan. 2019. Karst exploration: extracting terms and defi- Language Processing with Python. (1st edition). O’Reilly nitions from karst domain corpus. In Proceedings of eLex Media, Inc. 2019, 934–956. [4] Piotr Bojanowski, Edouard Grave, Armand Joulin, and [17] G. Steen. 2010. A Method for Linguistic Metaphor Identifica- Tomas Mikolov. 2017. Enriching word vectors with sub- tion: From MIP to MIPVU. Converging evidence in language word information. Transactions of the Association for Com- and communication research. John Benjamins Publishing putational Linguistics, 5, (June 2017), 135–146. doi: doi. Company. doi: 10.1075/celcr.14. org10.1162/tacl_a_00051. [18] Špela Vintar, Larisa Grcic, Matej Martinc, Senja Pollak, [5] Eunice Castro Seixas. 2021. War metaphors in political and Uroš Stepišnik. 2020. Mining semantic relations from communication on Covid-19. Frontiers in Sociology, 5, 112. comparable corpora through intersections of word embed- doi: 10.3389/fsoc.2020.583680. dings. In (May 2020). https://aclanthology.org/2020.bucc- [6] Jane Demmen, Elena Semino, Zsófia Demjén, Veronika 1.5.pdf . Koller, Andrew Hardie, Paul Rayson, and Sheila Payne. [19] Philipp Wicke and Marianna M. Bolognesi. 2020. Framing 2015. A computer-assisted study of the use of violence COVID-19: how we conceptualize and discuss the pan- metaphors for cancer and end of life by patients, family demic on Twitter. PLOS ONE, 15, 9, (September 2020), 1– carers and health professionals. International Journal of 24. doi: 10.1371/journal.pone.0240010. Corpus Linguistics, 20, 2, 205–231. doi: 10.1075/ijcl.20.2. [20] Radim Řehůřek and Petr Sojka. 2010. Software framework 03dem. for topic modelling with large corpora. In (May 2010), 45– [7] Damián Fernández-Pedemonte, Felicitas Casillo, and Ana 50. doi: 10.13140/2.1.2393.1847. Jorge-Artigau. 2021. Communicating COVID-19: metaphors we “survive” by. Tripodos, 2, (February 2021), 145–160. doi: 10.51698/tripodos.2020.47p145- 160. [8] Stephen J. Flusberg, Teenie Matlock, and Paul H. Thi- bodeau. 2018. War metaphors in public discourse. Metaphor and Symbol, 33, 1, 1–18. doi: 10 . 1080 / 10926488 . 2018 . 1407992. 40 Topic modelling and sentiment analysis of COVID-19 related news on Croatian Internet portal Maja Buhin Pandur Jasminka Dobša Faculty of Organization and Informatics, Faculty of Organization and Informatics, University of Zagreb University of Zagreb Varaždin, Croatia Varaždin, Croatia mbuhin@foi.hr jasminka.dobsa@foi.hr Slobodan Beliga Ana Meštrović University of Rijeka, Department of Informatics & University of Rijeka, Department of Informatics & University of Rijeka, Center for Artificial University of Rijeka, Center for Artificial Intelligence and Cybersecurity Intelligence and Cybersecurity Rijeka, Croatia Rijeka, Croatia sbeliga@uniri.hr amestorovic@uniri.hr ABSTRACT approach by using NRC word-emotion lexicon [13] for detection of sentiments (positive or negative) and basic emotions, The research aims to identify topics and sentiments related to the according to Pluchik’s model of emotions [15], in extracted COVID-19 pandemic in Croatian online news media. For topics. analysis, we used news related to the COVID-19 pandemic from The main goal of this paper is to analyse sentiments and the Croatian portal Tportal.hr published from 1st January 2020 to emotions in crises communication in the news related to the 19th February 2021. Topic modelling was conducted by using the COVID-19 pandemic published on the Croatian online portal. LDA method, while dominant emotions and sentiments related Our goal was aggravated in this research because articles belong to extracted topics were identified by National Research Council rather to objective than to subjective type of reporting. Another Canada (NRC) word-emotion lexicon created originally for problem is the lack of lexical resources for sentiment and English and translated into Croatian, among other languages. We emotions in the Croatian language. Glavaš and co-workers [10] believe that the results of this research will enable a better developed a Croatian sentiment lexicon called CroSentiLex, understanding of the crisis communication in the Croatian media which consists of positive and negative lists of words ranked with related to the COVID-19 pandemic. PageRank scores. Nevertheless, there is no available lexicon for the analysis of emotions for the Croatian language. Our analysis uses the NRC word-emotion lexicon, initially developed for KEYWORDS English and translated into 104 languages, including Croatian. News media, sentiment, emotions, pandemic, lexicon approach, Such an approach has disadvantages due to cultural differences, Latent Dirichlet Allocation but developing emotion lexicons for low-resource languages as Croatian is very demanding. Sentiment analysis of COVID-19 related texts is conducted mainly for texts written in English, 1 INTRODUCTION such as research by Shofiya and Abidi [17], where the There are three major approaches to sentiment and emotions SentiStrength tool was used to detect the polarity of tweets, and analysis in text: lexicon based, machine learning based approach support vector machine (SVM) algorithm was employed for [12] and the most recent deep-learning approach. In this research, sentiment classification. In [14], tweets about COVID-19 in we used a hybrid approach by applying the method of Latent Brazil written in Brazilian Portuguese due to lack of language Dirichlet Allocation (LDA) for topic modelling [6] and lexicon resources are analysed by translating original text from Portuguese to English and using available resources for English. Regarding Croatian social media space, Twitter social network communication was analysed through sentiment Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed analysis [2] and COVID-19 information spreading [3]. Crisis for profit or commercial advantage and that copies bear this notice and the full communication of Croatian online portals was already explored citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). by topic modelling of COVID-19 related articles [7]. However, Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia in that research, it is not included further sentiment and emotional © 2020 Copyright held by the owner/author(s). analysis of topics. In [4], information monitoring and name entity 41 Information Society 2020, 5–9 October 2020, Ljubljana, M. Buhin Pandur et al. Slovenia recognition were conducted on news portal texts related to a COVID-19 article only if it contains at least one keyword pandemics. related to coronavirus thematic. We use COVID-19 thesaurus for article filtering, which contains about thirty of the most important words describing the SARS-CoV-2 virus epidemic together with 2 METHODS their corresponding morphological variations. From the total of 31,177 articles, according to defined filtering, the dataset used in 2.1 Latent Dirichlet Allocation the experiment consists of 12,080 COVID-19 related articles. LDA is a generative, probabilistic hierarchical Bayesian model Articles on the portal are categorised into one of nine main that induces topics from a document collection [5,6]. The categories: Biznis ( Business), Sport ( Sport), Kultura ( Culture), intuition behind topic modelling using LDA is that documents Tehno ( Techno), Showtime, Lifestyle, Autozona ( Autozone), exhibit multiple topics. The topic is formally defined as a Funbox, and Vijesti ( News) (see Table 1). distribution over fixed vocabulary. Induction of topics is done in Documents of a collection are created using text from the three steps: article’s subcategory, introduction, main text, and tags. The  Each document in the collection is distributed over topics collection is preprocessed by ejection of English and Croatian that are sampled using Dirichlet distribution. stop words and numbers and performing a lemmatisation. It is  Each word in the document is connected with one single created a term-document matrix using tf-idf weighting scheme. topic based on Dirichlet distribution. The collection is indexed by terms contained in at least four  Each topic is defined as a multinomial distribution over documents of the collection, and the final list of index terms words that are assigned to the sampled topics. contained 31,121 terms. Topic modelling by LDA is conducted using stm package in R [16]. Table 1: Number of articles from dataset categorised into one of nine main categories 2.2 Number of topics estimation Before performing the LDA topic modelling, it has to be Category Number COVID-19 articles estimated the number of topics. In this research we used four Business 2,767 metrics from the R package ldatuning: Arun2010 [1], Sport 2,008 CaoJuan2009 [8], Deveaud2014 [9], and Griffiths2004 [11]. Culture 894 Measures Arun2010 and CaoJuan2009 have to be minimised, Techno 101 while measures Deveaud2014 and Griffiths2004 have to be Showtime 1,352 maximised. However, as measures, Arun2010 and CaoJuan2009 Lifestyle 1,442 Autozone 124 generally decrease with the number of topics, and measures Funbox 58 Deveaud2014 and Griffiths2004 increase with the number of News 3,334 topics, we will choose the number of topics as the value when observed measures start to stagnate. 3.2 Results 2.3 Detection of sentiments and emotions As a first step, the number of topics had to be estimated. Since For the association of sentiments and emotions to extracted articles on the portal are categorised into nine main categories, topics it was used NRC word-emotion lexicon [13], which we examined a number of topics from 5 to 15. We chose nine consists of 14,182 words with scores of 0 or 1, according to the topics since the metrics started to stagnate for a higher number association to positive or negative sentiment or one of eight of topics (see Figure 1). emotions of Pluchick’s model ( anger, anticipation, disgust, fear, joy, sadness, surprise, and trust) [15]. The lexicon was created manually by crowdsourcing on Mechanical Turk. For every sentiment and emotion, we created a vector with a distribution of zeros and ones over the words of a controlled dictionary created from the collection. Association of topics to sentiments and emotions is calculated as the cosine similarity between vectors of topics and corresponding vector of sentiment or emotion. 3 EXPERIMENT 3.1 Data set and preprocessing The data set used for research consists of articles from the Internet portal Tportal.hr related to the topics of COVID-19 Figure 1: Metrics for estimation of the best fitting number pandemic crises and collected from 1st January 2020 to 19th of topics for 5 to 15 topics February 2021. Each article included in the dataset is defined as 42 Topic modelling and sentiment analysis of COVID-19 Information Society 2020, 5–9 October 2020, Ljubljana, related news on Croatian Internet portal Slovenia Table 2: Top 10 words with the largest probabilities over mali (small), trošak (expenditure), posljedica topics and top 10 words with a negative sentiment with the (consequence), epidemija (epidemic) largest probabilities over topics, both sorted in descending words by theme: order of their probabilities. Topics are sorted by their osoba (person), koronavirus (coronavirus), covid, representation in documents in descending order. slučaj (case), mjera (measure), broj (number), županija (county), nov (new), sat (hour), bolnica Topic 7 – Topic’s (hospital) Top 10 words Daily words by negative sentiment: theme reports bolest (disease), virus (virus), zaraziti (to infect), words by theme: zaraza (infection), epidemija (epidemic), umrijeti (to koronavirus (coronavirus), liga (league), klub (club), die), velik (big), infekcija (infection), zarazan nogometni (football), igrač (player), godina (year), (contagious), simptom (symptom) utakmica (match), sezona (season), hrvatski words by theme: Topic 1 – (Croatian), nogomet (football) godina (year), film (film), nov (new), festival Sport words by negative sentiment: (festival), program (program), hrvatski (Croatian), igrač (player), velik (big), problem (problem), Zagreb, kultura (culture), kazalište (theater), knjiga epidemija (epidemic), odgoditi (to delay), prekinuti Topic 8 – (book) (to interrupt), čekati (to wait), borba (fight), napraviti Culture words by negative sentiment: (to make), posljedica (consequence) velik (big), mali (small), predstavljati (to present), words by theme: nastup (appearance), otkazati (to cancel), odgoditi (to cijepljenje (vaccination), cjepivo (vaccine), zemlja delay), smrt (death), rat (war), strana (side), kritika Topic 2 – (country), europski (European), koronavirus (critique) Vaccination (coronavirus), doza (dose), predsjednik (president), words by theme: vlada (government), mjera (measure), čovjek (man) and nov (new), proizvod (product), automobil (car), velik words by negative sentiment: epidemic (big), godina (year), hrvatska (Croatia), proizvodnja vlada (government), velik (big), epidemija (production), tvrtka (company), trgovina (market), measures (epidemic), red (order), borba (fight), sud (court), Topic 9 – kupac (buyer) granica (border), problem (problem), potreban Business 2 words by negative sentiment: (required), upozoriti (to warn) velik (big), nafta (oil), epidemija (epidemic), lanac words by theme: (chain), smanjiti (decrease), kriza (crisis), mali mjera (measure), hrvatska (Croatia), vlada (small), zaraza (infection), problem (problem), utjecaj Topic 3 – (government), rad (labor), pomoć (help), potpora (influence) Earthquake (support), odluka (decision), potres (earthquake), zaštita (protection), Zagreb and Topics were labelled based on words with the largest words by negative sentiment: government probabilities in topics vectors (keywords) shown in Table 2. potres (earthquake), velik (major), pogoditi (to hit), measures potreban (required), posao (job), šteta (demage), Some of the topics are directly connected to main categories on prijava (report), republika (republic), poziv (call), the portal: the first topic is labelled as Sport, the fourth topic as posljedica (consequence) Lifestyle, and the eighth topic as Culture, while the sixth and the words by theme: ninth topics are connected to the business world and are labelled modni (fashion), godina (year), pandemija as Business 1 and Business 2. Business 1 is associated with the (pandemic), nov (new), koronavirus (coronavirus), capital market, while Business 2 is associated with production. poznat (famous), moda (fashion), obitelj (family), Topic 4 – Topic 2 is associated with Vaccination and epidemic measures, brend (brand), model (model) Lifestyle while Topic 3 is associated with Earthquake and government words by negative sentiment: velik (big), nositi (to wear), izolacija (isolation), veza measures. Topic 5 seems rather General on stories in a pandemic (relationship), majka (mother), dug (debt), djevojka world, while Topics 7 contains daily reports on the pandemic (wench), znak (sign), mali (small), pun (full) state. words by theme: We found that all topics are mainly associated with negative čovjek (man), vrijeme (time), znati (know), virus sentiments. In Table 2 are listed words associated with negative (virus), velik (big), život (life), dan (day), dijete sentiment with the largest probabilities across topics, while Topic 5 – (child), koronavirus (coronavirus), dobro (good) words associated with positive sentiment have coincided with the words by negative sentiment: Generally words from topics theme. This list gives some insight into what velik (big), virus (virus), problem (problem), posao stories “bears” negative sentiment in the topics. (job), napraviti (to make), bolest (disease), mali (small), potreban (required), teško (hard), nositi (to Figure 2 shows the association of topics to sentiments and wear) emotions. The ratio of positive and negative sentiments is the best for categories of Sport and Culture. These categories and words by theme: Lifestyle are only categories associated with joy as one of the posto (percentage), godina (year), pad (drop), velik dominant emotions. Surprise and anticipation are dominant (big), pandemija (pandemic), tržište (market), rast emotions across all topics. Categories Vaccination and epidemic (growth), kuna, gospodarstvo (economy), banka Topic 6 – measures, Earthquake and government support, Generally (bank) Business 1 stories and Business 1 are associated with the emotion of sadness, words by negative sentiment: pad (drop), velik (big), kriza (crisis), vlada while categories Vaccination and epidemic measures and Daily (government), prihod (income), smanjiti (decrease), reports are associated with fear. 43 Information Society 2020, 5–9 October 2020, Ljubljana, M. Buhin Pandur et al. Slovenia ACKNOWLEDGEMENTS This work has been supported in part by the Croatian Science Foundation under the project IP-CORONA-04-2061, “Multilayer Framework for the Information Spreading Characterization in Social Media during the COVID-19 Crisis” (InfoCoV) and by the University of Rijeka project number uniri- drustv-sp-20-58. REFERENCES [1] R. Arun, V. Suresh, C.E. Madhavan and M. Narasima Murty. 2010. On finding the natural number of topics with Latent Dirichlet Allocation: Some observations, In Proceedings of Advances in Knowledge Discovery and Data Mining, 14th Pacific-Asia Conference (PAKDD 2010), Hyderabad, India. doi: 10.1007/978-3-642-1357-3_43. [2] K. Babić, M. Petrović, S. Beliga, S. Martinčić-Ipšić, A. Jarynowski and A. Meštrović. 2022. COVID-19-Related Communication on Twitter: Analysis of the Croatian and Polish Attitudes. In: Yang XS., Sherratt S., Dey N., Joshi A. (eds) Proceedings of Sixth International Congress on Information and Communication Technology. Lecture Notes in Networks and Systems, vol 216. Springer, Singapore. Available at https://link.springer.com/chapter/10.1007%2F978-981-16-1781-2_35. [3] K. Babić, M. Petrović, S. Beliga, S. Martinšić-Ipšić, M. Pranjić and A. Meštrović. 2021. Prediction of COVID-19 related information spreading on Twitter. In Proceedings of the IEEE International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO 2021), accepted for publication. [4] S. Beliga, S. Martinčić-Ipšić, M. Matešić and A. Meštrović. 2021. Natural Language Processing and Statistic: The First Six Months of the COVID-19 Infodemic in Croatia, In The Covid-19 Pandemic as a Challenge for Media and Communication Studies. K. Kopecka-Piech and B. Łódzki, Eds., Routledge, Taylor & Francis Group, accepted for publication. Figure 2: Association of topics to sentiments and emotions [5] D. M. Blei. 2012. Probabilistic topic models. Communications of the ACM, 55(4), 77-84. doi:10.1145/2133806.2133826. [6] D. M. Blei, A. Y. Ng and M.I. Jordan. 2003. Latent Dirichlet Allocation. 4 CONCLUSIONS AND FURTHER WORK Journal of Machine Learning Research 3, 993-1022. [7] P. K. Bogović, S. Beliga , A. Meštrović and S. Martinčić-Ipšić. 2021. The main goal of this paper was to analyse sentiments and Topic modelling of Croatian news during COVID-19 pandemic. In emotions in crises communication in the news related to the Proceedings of the IEEE International Convention on Information and COVID-19 pandemic. For that purpose, we have created our Communication Technology, Electronics and Microelectronics (MIPRO 2021), accepted for publication. collection of documents from articles on the Internet news portal [8] J. Chao, L. Tian, Z. Jintao, T. Yongdong and S. Tang. 2009. A density- connected to pandemic crises and analysed it utilising the LDA based method for adaptive LDA model selection, Neurocomputing, 72(7-method for extraction of prevalent topics in the collection and 9), 1775-1781. doi: 10.1016/j.neucom.2008.06.0011. [9] R. Deveaud, E. Sanjuan, P. Bellot. 2014. Accurate and effective latent NRC word-emotion lexicon for detection of sentiments and concept modeling for ad hoc information retrieval, Document emotions associated with extracted topics. Numérique, 17(1). doi: 10.3166/dn.17.1.61-84. Application of LDA resulted in relatively intuitive topics. [10] G. Glavaš, J. Šnajder and B. Dalbelo Bašić. 2012. Semi-supervised Some of them can be associated with the main categories of the acqusition of Croatian sentiment lexicon. In Proceedings of 15th International Conference on Text, Speech and Dialogue, TSD 2112, Brno, observed portal, and the other are related to the actual situation 166-173. in a pandemic world in Croatia: vaccination, earthquake (there [11] T.L. Griffiths, M. Steyvers. 2004. Finding scientific topics. In Proceedings were two great earthquakes in Croatia in 2020), stories, daily of the National Academy of Sciences 101 Suppl 1(1), 5228-35, doi: 10.1073/pnas.0307752101. reports. It is shown that all extracted topics are associated [12] H. Lane, C. Howard and H. Hapke. 2019. Natural Language Processing dominantly with negative sentiment, while prevalent emotions in Action. Manning Publications, New York, NY. are anticipation, surprise, sadness and fear. [13] S. M. Mohammad and P.D. Turney. 2013. Crowdsourcing a word-emotion By this research, we have gained insight into how COVID-19 association lexicon. Computational Intelligence, 29(3), 436-465. [14] T. Melo and C. M. S. Figueiredo. 2021. Comparing news articles and pandemic crises was communicated to the public. To gain insight tweets about COVID-19 in Brasil: Sentiment analysis and topic modeling into how the public experienced the crises, we could use the same approach. JMIR Public Health and Surveillance, 7(2), doi: methodology applied to comments of articles or on social 10.2196/24585. [15] R. Plutchik. 1962. The Emotions. Random House, New York, NY. networks. This could be a direction for a further work. Also, it [16] M. Roberts, B.M. Stewart and D. Tingley. 2019. stm: An R package for would be interesting to investigate how topics and structural topic models, Journal of Statistical Software, 91(2), 1-40. doi: sentiments/emotions are changing and evaluating over time. 10.18637/jss.v091.i02. [17] C . Shofiya and S. Abidi. 2021. Sentiment analysis on COVID-19-related social distancing in Canada using Twitter data. International Journal of Environmental Research and Public Health, 18(11), 1-10. 44 Tackling Class Imbalance in Radiomics: the COVID-19 Use Case Jože M. Rožanec∗ Tim Poštuvan∗ Jožef Stefan International Postgraduate School École Polytechnique Fédérale de Lausanne (EPFL) Ljubljana, Slovenia Lausanne, Switzerland joze.rozanec@ijs.si tim.postuvan@epfl.ch Blaž Fortuna Dunja Mladenić Qlector d.o.o. Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia blaz.fortuna@qlector.com dunja.mladenic@ijs.si ABSTRACT dyspnea[5]. In addition, older people, or people with previous med-Since the start of the COVID-19 pandemic, much research has been ical problems (e.g., diabetes, obesity, or hypertension), are more published highlighting how artificial intelligence models can be likely to develop a severe form of the disease[12, 42], which can used to diagnose a COVID-19 infection based on medical images. derive into multiple organ failure, acute respiratory distress syn- Given the scarcity of published images, heterogeneous sources, for- drome, fulminant pneumonia, heart failure, arrhythmias, or renal mats, and labels, generative models can be a promising solution failure, among others[37, 40]. for data augmentation. We propose performing data augmentation Expert radiologists have observed that the impact of the COVID- on the embeddings space, saving computation power and stor- 19 infection on the respiratory system can be discriminated from age. Moreover, we compare different class imbalance mitigation other viral pneumonia in computed tomography (CT) scans[7, 39]. strategies and machine learning models. We find CTGAN data aug- Most frequent radiological signs include irregular ground-glass mentation shows promising results. The best overall performance opacities and consolidations, observed mostly in the peripheral and was obtained with a GBM model trained with focal loss. basal sites[31]. While such opacities were observed up to a maximum of seven days before the symptoms onset[25], they progress CCS CONCEPTS rapidly and remain a long time after the symptoms onset[35, 38]. • While such opacities can be observed on chest radiography, they Information systems → Data mining; • Computing method- have low sensitivity, which can lead to misleading diagnoses in ologies → Computer vision problems; • Applied computing; early COVID-19 stages, and thus a CT scan is preferred[38]. Scientific studies have shown Artificial Intelligence (AI) is a KEYWORDS promising technology transforming healthcare and medical prac- COVID-19, CT Scans, Imbalanced Dataset, Data Augmentation, tice helping on some clinicians’ tasks (e.g., decision support, or Computer-Aided Diagnosis, Radiomics, Artificial Intelligence, Ma- providing disease diagnosis)[45]. In particular, the field of radiomics chine Learning studies how to mine medical imaging data to create models that ACM Reference Format: support or execute such tasks. Given that distinct patterns can Jože M. Rožanec, Tim Poštuvan, Blaž Fortuna, and Dunja Mladenić. 2021. be observed on chest radiographies and CT scans, clinicians and Tackling Class Imbalance in Radiomics: the COVID-19 Use Case. In Ljubljana researchers sought to use AI for COVID-19 diagnostics[31]. ’21: Slovenian KDD Conference on Data Mining and Data Warehouses, October, There are multiple challenges associated with radiomics, and 2021, Ljubljana, Slovenia. ACM, New York, NY, USA, 4 pages. in particular, with the COVID-19 diagnosis use case. Despite the limitations that can exist regarding privacy concerns[26, 44], many 1 INTRODUCTION datasets have been made publicly available. From those datasets, In December 2019, an outbreak of the coronavirus SARS-CoV-2 many are limited to a few cases[35]; were collected from different infection (a.k.a COVID-19) began in Wuhan, China. The disease sources and image protocols, and thus cannot be merged (e.g., the rapidly spread across the world, and on January 30th 2020, the gray-levels across images can have different meanings[7]); or were World Health Organization (WHO) declared a global health emer-labeled at different granularity levels (e.g., patient-level, or slice- gency. The most common COVID-19 symptoms are dry cough, level)[2]. Therefore, models developed from these datasets cannot sore throat, fever, loss of taste or smell, diarrhea, myalgia, and always be ported to a specific environment. Finally, limitations can exist regarding data collection, further limiting available data to ∗Both authors contributed equally to this research. develop working models to diagnose the disease. The main contributions of this research are (i) a comparative Permission to make digital or hard copies of part or all of this work for personal or study between four data-augmentation strategies used to deal with classroom use is granted without fee provided that copies are not made or distributed class imbalance, (ii) across eight frequently cited machine learn-for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. ing algorithms, based on a real-world dataset of chest CT scans For all other uses, contact the owner/author(s). annotated with their COVID-19 diagnosis. We developed the ma- SiKDD ’21, October, 2021, Ljubljana, Slovenia chine learning models with images provided by the Medical Physics © 2021 Copyright held by the owner/author(s). 45 SiKDD ’21, October, 2021, Ljubljana, Slovenia Rožanec and Poštuvan Research Group at the University of Ljubljana and made them avail- data[8]. Due to these reasons, care must be taken to select metable as part of the RIS competition1. rics not sensitive to such imbalance. Among common strategies We report the models’ discrimination power in terms of the area to deal with class imbalance, we find oversampling data methods, under the receiver operating characteristic curve (AUC ROC). The which aim to increase the number of data instances of the minority AUC ROC is a widely adopted classification metric that quantifies class to balance the dataset. Oversampling methods can add data the sensitivity and specificity of the model while is invariant to a instances from existing ones by replicating them (e.g., using a näive priori class probabilities. random sampler that draws new samples by randomly sampling This paper is organized as follows. Section 2 outlines related with replacement from the available train samples), or by creating scientific works, Section 3 provides an overview of the use case, synthetic data instances (e.g., through SMOTE[9], ADASYN[19], and Section 4 details the methodology. Finally, section 5 presents or GANs). In addition to data oversampling, the Focal Loss[29] and discusses the results obtained, while Section 6 concludes and can be used on specific algorithms. The Focal Loss reshapes the describes future work. cross-entropy loss to down-weight well-classified examples while focusing on the misclassified ones, achieving better discrimination. 2 RELATED WORK Finally, while the techniques mentioned above are useful for clas- The field of radiomics is concerned with extracting high-dimensional sification, we can reframe the problem as an anomaly detection data from medical images, which can be mined to provide diagnoses problem, attempting to detect which data instances correspond to and prognoses, assuming the image features reflect an underly- the minority class (anomaly). ing pathophysiology[16, 27, 28]. While the research on the field is Through the research we reviewed, we found a paper describing experiencing exponential growth, multiple authors have warned the use of SMOTE[14], and two papers using GANs[1, 34] for data about common issues affecting the quality and reproducibility of augmentation at the image level. We found no paper performing radiomics research and proposed several criteria that should be met a more extensive assessment of the class imbalance influence nor to mitigate them (e.g., RQS, CLAIM, or TRIPOD)[10, 27, 32]. It has compared class imbalance strategies towards the COVID-19 detec-also been observed that the translation into clinical use has been tion models’ outcomes. We propose utilizing data augmentation slow[13]. techniques, generating new embeddings instead of full images. Such Since the start of the COVID-19 pandemic, much research has an approach provides similar information in the embedding space been published highlighting how AI models could be used to is- as would be obtained from synthetic images while enabling widely sue COVID-19 diagnoses based on medical images. While much used techniques for tabular data oversampling. Furthermore, in research was invested into transfer learning leveraging pre-trained GANs, new data instances are cheaper to compute and store than deep learning models, or the use of deep learning models as feature would be if creating new images. extractors[24], some authors also experimented with handcrafted features[7]. Most common machine learning approaches involved 3 USE CASE the use of deep learning (end-to-end models, or pre-trained models The research reported in this paper is done with images provided by for feature extraction)[14, 23, 34, 36, 43], Support Vector Machine the Medical Physics Research Group at the University of Ljubljana (SVM)[4, 7, 14, 22, 23, 34, 36, 38, 43], k-Nearest Neighbors (kNN)[14, and made available as part of the RIS competition. The dataset 22, 23, 38, 43], Random Forest (RF)[22, 23, 36], CART[22, 23, 36], was built from computed tomography (CT) scans obtained from Näive Bayes[22, 23], and Gradient Boosted Machines (GBM)[6, 22]. three datasets reported in[18, 25, 33], that correspond to 289 healthy Two commonly faced challenges regarding COVID-19 diagnoses persons and 66 COVID-19 patients. Healthy persons are determined based on medical images are images scarcity and class imbalance. with a CT score between zero and five, while COVID-19 patients are Given the heterogeneity of the datasets, it is not always possi- considered those with a CT score equal to or higher than ten[15]. ble to merge them[2, 7, 35]. Thus, some researchers successfully Each CT scan was segmented into twenty slices, resulting in 7.100 experimented using generative adversarial networks (GANs) to images with an axial view of the lungs, and annotated into two generate new images that comply with the existing patterns in classes: COVID-19 and non-COVID-19. The visual inspection of the dataset[1, 34]. GANs provide means to learn deep representa-CT scans aims to determine if the person was infected with the tions from labeled data and generate new data samples based on a COVID-19 disease. Automating this task reduces manual work and competition involving two models: a generator, learns to generate speeds up the diagnosis. new images only from its interaction with the discriminator; and the discriminator, who has access to the real and synthetic data 4 METHODOLOGY instances, and tries to tell the difference between them[3, 11]. While this method was first applied on images[17], new approaches were We propose using artificial intelligence for an automated COVID-19 developed to adapt it for tabular data[41]. diagnosis based on images obtained from CT scan segmentation, The fact that the classification categories are not approximately posing it as a binary classification problem. The discrimination equally represented in a dataset can affect how the machine learn- capability of the models is measured with the AUC ROC metric ing algorithms learn and their performance on unseen data, where with a cut threshold of 0.5. the distribution can be different from the one observed in training We use the ResNet-18 model[20] for feature extraction, retrieving the vector produced by the Average Pooling layer. Since the vector consists of 512 features, we perform feature selection computing 1http://tiziano.fmf.uni-lj.si/ the features’ mutual information and selecting the top K to avoid 46 Tackling Class Imbalance in Radiomics: the COVID-19 Use Case SiKDD ’21, October, 2021, Ljubljana, Slovenia √ overfitting. To obtain K, we follow the equation 𝐾 = 𝑁 suggested this approach leads to the best forecast outcomes with a GBM by[21], where N is the number of data instances in the train set. model trained with a Focal Loss on a dataset enriched with new To evaluate the models’ performance across different data aug- CTGAN generated instances. Moreover, we compare this approach mentation strategies, we apply a stratified ten-fold cross-validation. to other imbalanced data strategies, finding that Näive random Data augmentation is performed by introducing additional minority oversampling, SMOTE, and ADASYN degrade the resulting models’ class data samples on the train folds. We consider five imbalance mit- performance compared to the original dataset. Future work will igation strategies: NONE (without data augmentation), RANDOM focus on further understanding the cases where the CTGAN data (näive random sampler), SMOTE, ADASYN, and CTGAN (GAN that augmentation leads to poor results and provide an integral explain- enables the conditional generation of data instances based on a ability model for machine learning classifiers that consume image class label)[41]. No augmentation is performed on the test fold to embeddings. ensure measurements are comparable. The performance of the data augmentation strategies is measured across eight machine learning ACKNOWLEDGMENTS algorithms: SVM, kNN, RF, CART, Gaussian Näive Bayes, Multi- This work was supported by the Slovenian Research Agency. The layer Perceptron (MLP), GBM, and Isolation Forest (IF)[30]. Finally, authors acknowledge the Medical Physics Research Group at the we compare the performance of the data augmentation scenarios University of Ljubljana2 for providing the image segmentation data computing the average AUC ROC across the test folds and assess as part of the RIS competition3. if the difference is statistically significant by using the Wilcoxon signed-rank test, using a p-value of 0.05. REFERENCES [1] Erdi Acar, Engin Şahin, and İhsan Yılmaz. 2021. Improving effectiveness of 5 RESULTS AND ANALYSIS different deep learning-based models for detecting COVID-19 from computed tomography (CT) images. Neural Computing and Applications (2021), 1–21. When comparing the results across different imbalance mitigation [2] Parnian Afshar, Shahin Heidarian, Nastaran Enshaei, Farnoosh Naderkhani, strategies (see Table 1), we observed that data augmentation leads Moezedin Javad Rafiee, Anastasia Oikonomou, Faranak Babaki Fard, Kaveh to inferior results in most cases. While this outcome was expected Samimi, Konstantinos N Plataniotis, and Arash Mohammadi. 2021. COVID- CT-MD, COVID-19 computed tomography scan dataset applicable in machine for IF (the minority class is no longer an outlier after data augmen- learning and deep learning. Scientific Data 8, 1 (2021), 1–8. tation), we found that only the CART, MLP, and GBM algorithms [3] Alankrita Aggarwal, Mamta Mittal, and Gopi Battineni. 2021. Generative adversarial network: An overview of theory and applications. achieved better performance with CTGAN data augmentation com- International Journal of Information Management Data Insights (2021), 100004. pared to the original dataset. Moreover, six algorithms achieved [4] Dhurgham Al-Karawi, Shakir Al-Zaidi, Nisreen Polus, and Sabah Jassim. 2020. the best results when augmented with CTGAN compared to other Machine learning analysis of chest CT scan images as a complementary digital test of coronavirus (COVID-19) patients. MedRxiv (2020). data imbalance strategies (except NONE). We confirmed the AUC [5] William E Allen, Han Altae-Tran, James Briggs, Xin Jin, Glen McGee, Andy Shi, ROC differences between imbalanced datasets strategies were sta-Rumya Raghavan, Mireille Kamariza, Nicole Nova, Albert Pereta, et al. 2020. tistically significant, with a few exceptions: Population-scale longitudinal mapping of COVID-19 symptoms, behaviour and SMOTE vs. ADASYN testing. Nature human behaviour 4, 9 (2020), 972–982. for CART, MLP, and GBM; NONE vs. RANDOM for CART; NONE [6] Eduardo J Mortani Barbosa, Bogdan Georgescu, Shikha Chaganti, Gorka Bastar-vs. SMOTE for Näive Bayes; RANDOM vs. SMOTE for SVM and RF; rika Aleman, Jordi Broncano Cabrero, Guillaume Chabin, Thomas Flohr, Philippe Grenier, Sasa Grbic, Nakul Gupta, et al. 2021. Machine learning automatically de-and RANDOM and SMOTE vs. CTGAN for SVM and IF. From the tects COVID-19 using chest CTs in a large multicenter cohort. European radiology results obtained, we consider the CTGAN success can be attributed (2021), 1–11. to the fact the generative model can learn over time to generate [7] Mucahid Barstugan, Umut Ozkaya, and Saban Ozturk. 2020. Coronavirus (covid-19) classification using ct images by machine learning methods. arXiv preprint high-quality data instances based on the discriminator’s feedback arXiv:2003.09424 (2020). loop, while Näive random sampling reuses existing instances (pro- [8] Nitesh V Chawla. 2009. Data mining for imbalanced datasets: An overview. Data viding little new information to the dataset), and the SMOTE and mining and knowledge discovery handbook (2009), 875–886. [9] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. ADASYN algorithms generate new samples based on heuristics 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial without learning capabilities. intelligence research 16 (2002), 321–357. [10] Gary S Collins, Johannes B Reitsma, Douglas G Altman, and Karel GM Moons. We observed that GBM models trained with a Focal Loss achieved 2015. Transparent reporting of a multivariable prediction model for individual the best results in all datasets. Even when no data augmentation is prognosis or diagnosis (TRIPOD): the TRIPOD statement. Journal of British performed and the RF achieves the best result, the difference is not Surgery 102, 3 (2015), 148–158. [11] Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sen-statistically significant compared to the GBM model. The overall gupta, and Anil A Bharath. 2018. Generative adversarial networks: An overview. best performance was obtained with a GBM model trained over a IEEE Signal Processing Magazine 35, 1 (2018), 53–65. dataset with CTGAN data augmentation. While the reasons behind [12] Thays Maria Costa de Lucena, Ariane Fernandes da Silva Santos, Brenda Regina de Lima, Maria Eduarda de Albuquerque Borborema, and Jaqueline de Azevêdo Silva. the performance drop for the kNN, Näive Bayes, RF, and SVM 2020. Mechanism of inflammatory response in associated comorbidities in models remain unclear, further investigation is required to clarify COVID-19. Diabetes & Metabolic Syndrome: Clinical Research & Reviews 14, 4 (2020), 597–600. them. Nevertheless, we consider the CTGAN data augmentation [13] Daniel Pinto Dos Santos, Matthias Dietzel, and Bettina Baessler. 2021. A decade on the embeddings space approach is promising. of radiomics research: are images really data or just patterns in the noise? [14] El-Sayed M El-Kenawy, Abdelhameed Ibrahim, Seyedali Mirjalili, Marwa Met-wally Eid, and Sherif E Hussein. 2020. Novel feature selection and voting classi-6 CONCLUSION fier algorithms for COVID-19 classification in CT images. IEEE Access 8 (2020), 179317–179335. This research presents a novel approach towards data augmentation in radiomics by generating new data instances in the embedding 2https://medfiz.si/en space rather than generating new images. We demonstrate that 3http://tiziano.fmf.uni-lj.si/ 47 SiKDD ’21, October, 2021, Ljubljana, Slovenia Rožanec and Poštuvan Class Imbalance Mitigation CART IF kNN MLP Naive Bayes RF SVM GBM Strategies NONE 0,6429 0,6802 0,8504 0,7879 0,6653 0,8601 0,8066 0,8555 RANDOM 0,6402 0,5215 0,7846 0,7993 0,6464 0,6691 0,6888 0,8150 SMOTE 0,6147 0,5607 0,6813 0,7663 0,6590 0,6660 0,6817 0,7826 ADASYN 0,6020 0,5863 0,6660 0,7655 0,6282 0,6435 0,6652 0,7787 CTGAN 0,7401 0,5340 0,8118 0,8419 0,6395 0,7090 0,6896 0,8871 Table 1: Average AUC ROC values obtained across the ten cross-validation folds. Best results are bolded, second-best results are highlighted in italics. [15] Marco Francone, Franco Iafrate, Giorgio Maria Masci, Simona Coco, Francesco [30] Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In 2008 Cilia, Lucia Manganaro, Valeria Panebianco, Chiara Andreoli, Maria Chiara eighth ieee international conference on data mining. IEEE, 413–422. Colaiacomo, Maria Antonella Zingaropoli, et al. 2020. Chest CT score in COVID- [31] Hossein Mohammad-Rahimi, Mohadeseh Nadimi, Azadeh Ghalyanchi- 19 patients: correlation with disease severity and short-term prognosis. European Langeroudi, Mohammad Taheri, and Soudeh Ghafouri-Fard. 2021. Application of radiology 30, 12 (2020), 6808–6817. machine learning in diagnosis of COVID-19 through X-ray and CT images: a [16] Robert J Gillies, Paul E Kinahan, and Hedvig Hricak. 2016. Radiomics: images scoping review. Frontiers in cardiovascular medicine 8 (2021), 185. are more than pictures, they are data. Radiology 278, 2 (2016), 563–577. [32] John Mongan, Linda Moy, and Charles E Kahn Jr. 2020. Checklist for artificial [17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, intelligence in medical imaging (CLAIM): a guide for authors and reviewers. Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial [33] Sergey P Morozov, Anna E Andreychenko, Ivan A Blokhin, Pavel B Gelezhe, nets. Advances in neural information processing systems 27 (2014). Anna P Gonchar, Alexander E Nikolaev, Nikolay A Pavlov, Valeria Yu Chernina, [18] Stephanie A Harmon, Thomas H Sanford, Sheng Xu, Evrim B Turkbey, Holand Victor A Gombolevskiy. 2020. Mosmeddata: data set of 1110 chest ct scans ger Roth, Ziyue Xu, Dong Yang, Andriy Myronenko, Victoria Anderson, Amel performed during the covid-19 epidemic. Digital Diagnostics 1, 1 (2020), 49–59. Amalou, et al. 2020. Artificial intelligence for the detection of COVID-19 pneu- [34] Jawad Rasheed, Alaa Ali Hameed, Chawki Djeddi, Akhtar Jamil, and Fadi Al-monia on chest CT using multinational datasets. Nature communications 11, 1 Turjman. 2021. A machine learning-based framework for diagnosis of COVID-19 (2020), 1–7. from chest X-ray images. Interdisciplinary Sciences: Computational Life Sciences [19] Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. 2008. ADASYN: Adaptive 13, 1 (2021), 103–117. synthetic sampling approach for imbalanced learning. In 2008 IEEE international [35] Michael Roberts, Derek Driggs, Matthew Thorpe, Julian Gilbey, Michael Yeung, joint conference on neural networks (IEEE world congress on computational intelli-Stephan Ursprung, Angelica I Aviles-Rivero, Christian Etmann, Cathal McCague, gence). IEEE, 1322–1328. Lucian Beer, et al. 2021. Common pitfalls and recommendations for using machine [20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning to detect and prognosticate for COVID-19 using chest radiographs and learning for image recognition. In Proceedings of the IEEE conference on computer CT scans. Nature Machine Intelligence 3, 3 (2021), 199–217. vision and pattern recognition. 770–778. [36] Prottoy Saha, Muhammad Sheikh Sadi, and Md Milon Islam. 2021. EMCNet: [21] Jianping Hua, Zixiang Xiong, James Lowey, Edward Suh, and Edward R Automated COVID-19 diagnosis from X-ray images using convolutional neural Dougherty. 2005. Optimal number of features as a function of sample size network and ensemble of machine learning classifiers. Informatics in medicine for various classification rules. Bioinformatics 21, 8 (2005), 1509–1515. unlocked 22 (2021), 100505. [22] Lal Hussain, Tony Nguyen, Haifang Li, Adeel A Abbasi, Kashif J Lone, Zirun [37] Adekunle Sanyaolu, Chuku Okorie, Aleksandra Marinkovic, Risha Patidar, Kokab Zhao, Mahnoor Zaib, Anne Chen, and Tim Q Duong. 2020. Machine-learning Younis, Priyank Desai, Zaheeda Hosein, Inderbir Padda, Jasmine Mangat, and classification of texture features of portable chest X-ray accurately classifies Mohsin Altaf. 2020. Comorbidity and its impact on patients with COVID-19. SN COVID-19 lung infection. BioMedical Engineering OnLine 19, 1 (2020), 1–18. comprehensive clinical medicine (2020), 1–8. [23] Seifedine Kadry, Venkatesan Rajinikanth, Seungmin Rho, Nadaradjane Sri Mad- [38] Ahmet Saygılı. 2021. A new approach for computer-aided detection of coronavirus hava Raja, Vaddi Seshagiri Rao, and Krishnan Palani Thanaraj. 2020. Development (COVID-19) from CT and X-ray images using machine learning methods. Applied of a machine-learning system to classify lung ct scan images into normal/covid-19 Soft Computing 105 (2021), 107323. class. arXiv preprint arXiv:2004.13122 (2020). [39] H Swapnarekha, Himansu Sekhar Behera, Janmenjoy Nayak, and Bighnaraj Naik. [24] Sara Hosseinzadeh Kassania, Peyman Hosseinzadeh Kassanib, Michal J Wesolows-2020. Role of intelligent computing in COVID-19 prognosis: A state-of-the-art kic, Kevin A Schneidera, and Ralph Detersa. 2021. Automatic detection of coron-review. Chaos, Solitons & Fractals 138 (2020), 109947. avirus disease (COVID-19) in X-ray and CT images: a machine learning based [40] Tianbing Wang, Zhe Du, Fengxue Zhu, Zhaolong Cao, Youzhong An, Yan Gao, approach. Biocybernetics and Biomedical Engineering 41, 3 (2021), 867–879. and Baoguo Jiang. 2020. Comorbidities and multi-organ injuries in the treatment [25] Michael T Kassin, Nicole Varble, Maxime Blain, Sheng Xu, Evrim B Turkbey, of COVID-19. The Lancet 395, 10228 (2020), e52. Stephanie Harmon, Dong Yang, Ziyue Xu, Holger Roth, Daguang Xu, et al. 2021. [41] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Generalized chest CT and lab curves throughout the course of COVID-19. Scien-2019. Modeling tabular data using conditional gan. arXiv preprint arXiv:1907.00503 tific reports 11, 1 (2021), 1–13. (2019). [26] Virendra Kumar, Yuhua Gu, Satrajit Basu, Anders Berglund, Steven A Eschrich, [42] Jing Yang, Ya Zheng, Xi Gou, Ke Pu, Zhaofeng Chen, Qinghong Guo, Rui Ji, Haojia Matthew B Schabath, Kenneth Forster, Hugo JWL Aerts, Andre Dekker, David Wang, Yuping Wang, and Yongning Zhou. 2020. Prevalence of comorbidities in Fenstermacher, et al. 2012. Radiomics: the process and the challenges. Magnetic the novel Wuhan coronavirus (COVID-19) infection: a systematic review and resonance imaging 30, 9 (2012), 1234–1248. meta-analysis. Int J Infect Dis 10, 10.1016 (2020). [27] Philippe Lambin, Ralph TH Leijenaar, Timo M Deist, Jurgen Peerlings, Eve- [43] Huseyin Yasar and Murat Ceylan. 2021. A novel comparative study for detection lyn EC De Jong, Janita Van Timmeren, Sebastian Sanduleanu, Ruben THM Larue, of Covid-19 on CT lung images using texture analysis, machine learning, and deep Aniek JG Even, Arthur Jochems, et al. 2017. Radiomics: the bridge between learning methods. Multimedia Tools and Applications 80, 4 (2021), 5423–5447. medical imaging and personalized medicine. Nature reviews Clinical oncology 14, [44] Stephen SF Yip and Hugo JWL Aerts. 2016. Applications and limitations of 12 (2017), 749–762. radiomics. Physics in Medicine & Biology 61, 13 (2016), R150. [28] Philippe Lambin, Emmanuel Rios-Velazquez, Ralph Leijenaar, Sara Carvalho, [45] Kun-Hsing Yu, Andrew L Beam, and Isaac S Kohane. 2018. Artificial intelligence Ruud GPM Van Stiphout, Patrick Granton, Catharina ML Zegers, Robert Gillies, in healthcare. Nature biomedical engineering 2, 10 (2018), 719–731. Ronald Boellard, André Dekker, et al. 2012. Radiomics: extracting more information from medical images using advanced feature analysis. European journal of cancer 48, 4 (2012), 441–446. [29] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision. 2980–2988. 48 Observing Water-Related Events for Evidence-Based Decision-Making Joao Pita Costa * *****, M. Besher Massri *, Inna Novalija *, Ignacio Casals del Busto **, Iulian Mocanu ***, Maurizio Rossi ****, Jan Šturm *, Eva Erzin *, Alenka Guček *, Matej Posinković *, Marko Grobelnik * ***** * Institute Jozef Stefan, Slovenia, ** Aguas del Alicante, Spain, *** Apa Braila, Romania, **** Ville de Carouge, Switzerland, ***** Quintelligence, Slovenia ABSTRACT propose a slightly different approach that integrates heterogeneous data sources to try and solve common With the awareness of a changing climate impacting our research questions, as well as to support water management sustainability, and in line with the European Green Deal companies in their current problems. This solution is named initiative or the Sustainable Development Goal 6 addressing NAIADES Water Observatory (NWO), available at water, the industry, society and local governments are naiades.ijs.si, putting together: (i) real-time information from requiring reliable and comprehensive technology that can multilingual world news on water topics; (ii) data provide them an overview to water events to anticipate visualisation of water-related indicators through time, problems and the tools to analyse best practices appropriate sourced from the datasets associated with the Sustainable to solve them. This paper presents the NAIADES Water Development Goal 6 (water) and other UN data (see Figure Observatory (NOW), a digital solution offering a series of 1); and (iii) scientific knowledge from published biomedical analysis and visualisations of water-related topics, helping research on water-related topics (e.g., water contamination). users to extract important insights in relation to the water Due to the rapidly growing awareness of the sustainability sector. Taking advantage of heterogeneous data sources, from challenges that we are facing in Europe and worldwide in the the media and social media landscape, to published research context of water resource management, there has been much and global/local indicators. Through collaboration with local work done to develop systems that are able to collect water resource management institutions, the NWO was information about the available water and even simulate and configured to local priorities and ingests local datasets to forecast that in the near future. But these are usually better fit the needs of decision-makers. geolocation-based systems ingesting water-related data to enable real-time monitoring of resources and usage [x] [y] [z], CCS CONCEPTS and thus much different than the water observatory that we • Real-time systems • Data management systems • Life and are proposing in this paper. The typical example is GoAigua medical science system [4], a digital twin technology allowing, e.g., the city of Valencia to optimize its water management at the network KEYWORDS level, improving efficiency in daily operations, plan real-time Water Resource Management, Smart Water, Observatory, scenarios, and make some prediction on its future behaviour Water Digital Twin, Elasticsearch, Streamstory [5]. 1 Introduction The water sector is facing rapid development towards the smart digitalisation of resources, much motivated and supported by the UN’s global initiative for the Sustainable Development Goal 6. In that context, the efforts to address the specific challenges related to water management data and priorities multiply globally. There are several “digital twin” systems dedicated to water, each of which focuses on the Figure 1: Visualisation of water-related indicators within different aspects of the digitalisation of signals to support Spain to complement the global indicators view ingesting water management companies, as well as water data from, e.g., U.N. and the World Bank. “observatories”. These are usually meant as Geographical Information Systems that showcase the different aspects of water resources through time. 2 A data-driven solution for water events Within the scope of the European Commission-funded project The proposed Water Observatory enables extraction of NAIADES [1] focusing on the automation of the water insightful water-related information, configured to use case resource management and environmental monitoring, we priorities and needs from the data integration of 49 SIKDD’21, October 2021, Ljubljana, Slovenia J. Pita Costa et al. heterogeneous sources. This includes information from social ● Media: each location has its own news and social media media when the weather is favourable for floods and the streams configured to priorities and aspects of the news historical information from news and published research on that stakeholders define as topics of interest (e.g. floods) these weather-related events and how to make better ● Research: similarly to the media sources, the research decisions to solve them. topics allow for some customisation to fit the needs of This is complemented by data ingested from global and local the local user better indicators (i.e., datasets at regional level), showcasing the ● Resources: the natural resources information provided observation of water-related datasets linked to SDG 6 at for exploration is geolocated to the regions of interest to global and country levels that can help us observe changes the user of the platform and trends. The NAIADES Water Observatory enables the It is relatively easy to include new use cases and user to explore the information provided by published corresponding workspaces after the discussions on user science and the success stories that can be used in decision- priorities that will allow us to configure the information making and water education at the local level (i.e., showcasing presented and making it meaningful. the resources and problematics of the region). In this approach, the water data sensing is done over dynamic open data sources that serve as digital sensors (news, social 3 Addressing the challenges of tomorrow media, indicators, publications, weather forecasts). This data With the range of views provided at the observatory, the is then integrated and visualised, each in its tab, addressing problems addressed can be of complex nature and cover a specific topics of interest. The observatory is thus composed range of concerns and workflows. The different ICT of all that heterogeneous data coming in at different capabilities available across the water sector require intuitive frequencies. The interactions between those data sources to and meaningful technologies to ensure the usefulness of the solve common problems make it a Water Digital Twin. The contribution to the Community. The target users of the NWO envisioned examples include the analysis of best practices in seem to belong to three main scenarios with different water events in, e.g. Braila, identified in the news and workflows that can be supported by the developed explored over the published research, or the alerts triggered technology: by weather conditions and observed over social media on a 1. Water resource management: using the provided water event. The questions we are trying to solve with this information in the resolution of problems related to innovative technology are, e.g., if we can predict water weather events to understand how their actions are shortages in a certain region given the historical data; or if we perceived by the consumers and to explore successful can identify early signals of water-related problems from scenarios in similar cases social media (see Figure 2). 2. Local governments: to help evidence-based decision- making using open data, better synchronise to SDG6 and other guidelines and evaluate commitments in time 3. General public: for water education with a local context, in aspects that matter to the local population, based on parts of the Water Observatory that can be open to public The priorities in the European Union are rapidly changing towards sustainability and environmental efficiency, transversally to most domains of action. The European Commission’s Green Deal [3] aiming for a climate-neutral Europe by 2050 and boosting the economy through green Figure 2: Analysis of the sentiment in water-related posts in Twitter and the relation to consumer satisfaction and technology provides a new framework to understand and water-related events position water resource management in the context of the challenges of tomorrow. The NAIADES Water Observatory All of the views of this observatory, each of which represents will not only contribute to the improvement of European digital solutions on their own, are configured to the local sustainability in water-related matters but will also assign the priorities of the NAIADES users as a Proof of Concept, local actors on the water resource management an active role showing that each can address specific conditions. in that. The NAIADES Water Observatory provides the user of ● Indicators: adding to the global UN indicators, we are the NAIADES platform, as earlier extensively discussed, with ingesting curated open datasets that have regional the global and local insight that can be transformed into information about water topics of interest to the business intelligence, and help companies to steer their stakeholder strategies towards customer satisfaction. We will be 50 Observing Water-Related Events for Evidence-based Decision- SIKDD’21, October 2019, Ljubljana, Slovenia Making describing selected views of this observatory through the enabled sites across the world. From the data management verticals (or views) News – Indicators – Biomedical, first at module the real-time news data is accessed by the news the level of the specific dashboards that constitute the tabs in dashboard that can be configured by the NAIADES user to the online instance, and then by the extended exploratory tune the topics of interest in the configuration web app. To instances, including public instances and APIs, for each of the further explore a water-related topic, the NWO provides a three verticals. dashboard for the analysis of social media posts in Twitter (see Figure 2), collected in a real-time frequency, where sentiment is analysed, related concepts are extracted and it is possible to access the raw tweets or apply several filters. Finally, the biomedical module allows for the exhaustive exploration of water contamination information from scientific research articles published worldwide and available through the MEDLINE biomedical open dataset [9] and the Microsoft Academic Graph [8]. The MEDLINE dataset is collected from the official FTP source made available by the North American National Library of Medicine (NLM) over an Figure 3: The global view of the pilot 1 over usage and data XML dump and uploaded to the elasticSearch data sources. management system through a python script, the Microsoft Academic Graph dataset is collected from an Azure container These dashboards come together to provide the user with a with the data biweekly updated by the Microsoft Research global perspective in real-time, where five different tiers of team. The data management is based on the elasticsearch usability are made available (see Figure 3). The tiers allow for technology [2, useful for both the interactive data the extended usability of the Water Observatory, visualisations and the Indicators Explorer view. The latter Transversally to the data sources available. allows the NAIADES user to explore the raw data through template visualisations, use a Lucene-based query that can 4 System description and architecture leverage the loaded metadata, and easily build visualisation The NWO offers user exploratory dashboards for the further modules that can define a new dashboard of data investigation over news, to get deeper into the indicators visualisation modules. The dataset is then called over and ingested, and to explore the biomedical research on water HTTP API by the SearchPoint technology [6] to load the contamination in detail. Moreover, each of the three dataset and respective metadata. thus allowing for powerful dashboards have versions built to be exposed by, e.g., iframe Lucene-based queries and further interaction over a movable through a publicly available channel that can be used for pointer. This will lead to the refinement of the search of integration in high management KPI-monitoring dashboards. information that can then be extended over the Biomedical Furthermore, we also offer a part of the information in these Explorer, which feeds over the same dataset through Kibana, through APIs easily integrable with our own systems. but also allows for the analysis of raw data, or the easy The Indicators view provides the user with interactive data construction of data visualisation modules from templates, exploration tools that allow for the KPI-monitoring over and for an interactive data visualisation dashboard. All the several water-related topics that include the SDG 6, the World mentioned dashboards can be made publicly available Bank Open Data, the UN data, etc. In this module we also through, e.g., iframe to be integrated in high-management KPI ingest regional data sources that include local indicators, monitors. addressing the user’s priorities. Considering their well- established data types, the data integration is possible and, whenever limitations appear due to lack or poor quality of the data, the dataset is pre-processed to allow for data completion (whenever possible), or at least the improvement of data quality. The Media view provides the user with the real-time news monitoring over water-related topics (such as Water Scarcity and Water Contamination), and the analysis of water-related tweets based on data visualisation modules. Based on the news engine Eventregistry [7] this view provides the system Figure 4: System architecture of the NAIADES Water Observatory showcasing the relation between used with a continuous stream of news articles, sourced from RSS- technologies and NOW views 51 SIKDD’21, October 2021, Ljubljana, Slovenia J. Pita Costa et al. 5. Conclusions and further work In this paper we discussed the technological development and research opportunities motivated by the emerging need to support decision-makers with evidence from open data that can retract best practices and answer questions from the collected data, bringing the digitalisation of the water sector to a new level. The potential to ingest complementary local data and configure global sources to parameters addressing local priorities provides a local dimension that is being explored Figure 6: The multi time-series analysis of the weather close to the priorities of the NAIADES data providers within parameters, using Markov chains in complex data water resource management institutions. It will also be visualisation through the Streamstory technology [9]. exploring the insights driven by the appropriate aspects of chosen datasets, e.g., between news data and focused ACKNOWLEDGMENTS interactions through Twitter for weather-related events when the weather is likely to be favourable to their cause (see We thank the support of the European Commission on the Figure 5). There are many systems that can collect business H2020 NAIADES project (GA nr. 820985). intelligence data, but we believe that the “digital twin”-type of insight is in the interaction between these data streams. REFERENCES [1] CORDIS, "NAIADES Project". [Online]. Available: https://cordis.europa.eu/project/id/820985 [Accessed 1 9 2020]. [2] Elasticsearch, "Elasticsearch," 2020. [Online]. Available: https://www.elastic.co/elasticsearch/. [Accessed 1 9 2020]. [3] European Commission, "European Green Deal," 2019. [Online]. Available: https://ec.europa.eu/info/strategy/priorities-2019-2024/european-green-deal_en. [Accessed 1 9 2020]. [4] Idrica, "GoAigua: Smart Water for a Better World," 2020. [Online]. Available: https://www.idrica.com/goaigua/. [Accessed 1 9 2020]. [5] Idrica, "Digital Twin: implementation and benefits for the water sector," 19 2 2020. [Online]. Available: https://www.idrica.com/blog/digital- Figure 5: Preliminary data analysis of the relation twin-implementation-benefits-water-sector/. [Accessed 1 9 2020]. between news and tweets on water-related events and [6] Institute Jozef Stefan, "Streamstory". [Online]. Available: their relations with other topics (e.g., weather). http://streamstory.ijs.si/. [Accessed 26 8 2021] [7] G. Leban, B. Fortuna, J. Brank and M. Grobelnik, "Event registry: learning about world events from news," Proceedings of the 23rd International Further development to the NAIADES Water Observatory, Conference on World Wide Web, pp. 107-110, 2014. will be providing the users with tools to explore the impact of [8] Microsoft, "Microsoft Academic Graph". [Online]. Available: https://www.microsoft.com/en-us/research/project/microsoft-natural resources as, e.g., the weather, as well as predictions academic-graph/. [Accessed 26 8 2021] on the levels of the available bodies of water, based on [9] National Library of Medicine, "MEDLINE". [Online]. Available: https://www.nlm.nih.gov/medline/medline_overview.html. [Accessed 26 ingested weather data from the ECMWF (on humidity, 8 2021] temperature and rainfall) and other open data sources. This [10] L. Stopar, P. Škraba, M. Grobelnik, and D. Mladenić (2018). StreamStory: Exploring Multivariate Time Series on Multiple Scales. IEEE transactions will help the users to have some insight on the impact of the on visualization and computer graphics 25.4: 1788-1802. climate crisis in regions that directly relate to their water resources. We will use a sophisticated engine - Streamstory [6][10] - to explore the states of that weather-related data and short/medium term predictions on aspects of that data (see Figure 6). 52 Anomaly Detection on Live Water Pressure Data Stream Gal Petkovšek Matic Erznožnik Klemen Kenda Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Jamova 39, 1000 Ljubljana, Jožef Stefan International Slovenia Slovenia Postgraduate School gal.petkovsek@ijs.si matic.erznoznik@ijs.si Jamova 39, 1000 Ljubljana, Slovenia klemen.kenda@ijs.si ABSTRACT The algorithms in this paper were already considered in the We present the application of several anomaly detection related work in different settings and for different time series. algorithms to water pressure data streams. We evaluate their quality on unlabelled data sets using agreement rates. Anomaly detection can be used by estimating the expected The applied algorithms are the Generative Adversarial Net- regular interval in the upcoming measurement. This can be work (GAN), DBSCAN, Welford’s algorithm and Facebook achieved in an incremental fashion with a simple short-term Prophet. We found that GAN performed best. prediction model, for example with Kalman filter [7], or with a more advanced approach, based on time-series modeling Keywords [11]. The latter can be used in several settings, for example water management, machine learning, anomaly detection in detecting air temperature anomalies in the sewer systems [12]. 1. INTRODUCTION DBSCAN [10] is a data clustering algorithm that can be ap-In last decades, Internet of Things (IoT) has penetrated plied in frequently changing data sets. Its incremental ver- and shaped several fields such as energy management, traf- sion [5] can be used in a streaming setting. The potential of fic, health care and others. The water sector is, however, the algorithm for anomaly detection has been demonstrated still implementing IoT solutions that will improve the water in several use cases, for example in detecting air temperature management with features such as real-time consumption anomalies [3]. prediction, leakage detection, water quality estimation and others. The paper that demonstrated the use of Generative Ad- versarial Networks for anomaly detection on data stream is In the presented work, we focus on the anomaly detection on fairly recent [6]. The authors have shown that this approach the live water pressure data stream from the town of Braila can outperform several other baselines on data sets obtained (Romania). The overall goal of the research is to detect leak- from NASA, Yahoo, Amazon etc. They introduced different age points in the city’s water distribution network. To detect measures of evaluating the reconstruction accuracy, which the presence of a leakage in the system we apply an anomaly we tried to improve upon in our paper. detection algorithm to the water pressure data stream. We considered several such algorithms, which were applied and In this work, we use the already established anomaly detec- evaluated on four data streams obtained from four pressure tion approaches and compare their performance on an unla- sensors. Our goal was to find the algorithm which returns beled water pressure data stream from a water distribution the best results. Since the data is not labeled (regular or network. A more detailed description of the algorithms is anomalous), the estimation of accuracy was done with a given in the Methodology section. We argue that the rela- method considering relative agreement among selected al- tive agreement approach [1] improves the anomaly detection gorithms [1]. The anomaly detection algorithms that were performance, which we demonstrate by manual evaluation tested were GAN (generative adversarial networks) [6], DB-of the results. SCAN [10], Welford’s algorithm [9] and anomaly detection with Facebook Prophet [11]. It is important to note that first three algorithms consider the data stream as an actual 2. DATA AND DATA PREPROCESSING live stream. This means that they consume one sample at We demonstrate our anomaly detection methodology on four a time (or a feature vector containing multiple past values, data sets. Each of the data sets represents the pressure val- enrichment values and contextual data) and declare it reg- ues of one of the sensors, which are located at different points ular or anomalous as the algorithms were intended to do in in Braila’s water distribution network. The sensors are la- production. In contrast, the Facebook Prophet consumes beled as ‘5770’, ‘5771’, ‘5772’ and ‘5773’. The data sets the whole data stream as a batch and labels all the samples contain between 10 and 11 thousand instances, which are together. This makes it unusable in production (in this set- spaced in 15 minute intervals, so about 100 days-worth of ting), however it is included in the experiment since it can data. The data was first pre-processed to remove any du- help to estimate the accuracy of other algorithms. plicated points and ‘holes’ in the data which were formed as a consequence of sensor down-time. When working with Anomaly detection on time series is a well researched field. data streams, this process should be done automatically to 53 avoid any incorrect analysis when feeding the data into the errors (4 standard deviations from the mean of the window). anomaly detection algorithms. Each of the four data sets We used a slightly different approach using the moving aver- was split into a training and evaluation part. The training age multiplied by a constant as the threshold. This proved sets consisted of the first 2000 data points and the evalu- to be easier to implement on our live data stream use-case. ation sets contained all the rest. This is done so that the algorithms which require training can be trained on one part of the data and evaluated on the other (GAN, DBSCAN). 3.3 DBSCAN DBSCAN [4] is a well-known data clustering algorithm. It 3. METHODOLOGY groups together points, which are close together based on 3.1 Evaluation of algorithms Euclidean distance. The group with the largest number of points in our case are considered ‘normal’, and the lower- Evaluation of the performance of algorithms on unlabelled density groups are outliers which are then labeled as an data always represents a challenge. Since we are work- anomaly. The parameter which measures how close the ing with such data an actual calculation of accuracy scores points should be for them to still be considered of the same would require manual labelling of the data instances. To group, can be adjusted based on the data set, and the de- avoid this time-consuming process, we use a method for es- sired sensitivity of the algorithm. For DBSCAN we also use timating error rates (ratio of wrong classifications to the an input vector composed of consecutive pressure values. In total number of instances) from the agreement rates of mul- this case, we discovered that a vector of 5-6 values works tiple algorithms. Agreement rate of two classifiers fi and fj best. is defined in the following way: S 1 X a 3.4 Welford’s algorithm {i,j} = I{fi(Xs) = fj (Xs)} S s=1 Welford’s algorithm gets its name from the Welford’s method for online estimation of mean and variance. A very simple where X1, ..., XS are unlabeled samples. The calculated anomaly detection approach [9] can then be constructed by agreement rates are then inserted into the following equa-defining the upper and lower limits (UL and LL) of ”normal” tions: data as a function of mean and variance: a{i, j} = 1 − e{i} − e{j} + 2e{i,j} U L = mean + X ∗ variance Here we assume that the functions make independent errors we can substitute e{i, j} with e{i}e{j}. With such a system of equations we can then calculate error rates using some LL = mean − X ∗ variance root-finding algorithm. Such an approach has been previ- X is fixed and determines the threshold band. Any instance ously used for the evaluation of classifiers on an unlabelled which falls out of that band is labeled as an anomaly. In- dataset [1]. Therefore we consider the anomaly detection stances can then be input into the algorithm one by one to algorithm as a binary classifier and use the aforementioned be labeled and after each the mean and the variance (con- method for the comparison of different algorithms. Addi- sequently UL and LL also) are updated. tionally, two important assumptions were made. Firstly, we For this experiment the actual Welford’s method was not assumed that the anomaly detection algorithms were inde- used since the mean and variance were computed from the pendent and secondly, that each of those algorithms per- last 1500 samples so that they would better adapt to the forms better than a random classifier. new samples. Note that the first 1500 samples therefore Since the estimated performance of one algorithm depends could not be labeled; however, this was not a problem since on the output of the others it was important that the al- most of the other approaches required 2000 samples for fit- gorithms yield a similar percentage of anomalies. In other ting the models and the evaluation was therefore done on words, the algorithms are tuned to have similar predicted the remaining stream. However, the upper and lower limits positive condition rate (P P CR = F P +T P ). For F P +T P +F N +T N of the interval were still computed as shown above with the most data streams this means that 1%-3% of the samples value of X = 2.2. are labelled as anomalous. 3.2 GAN 3.5 Facebook Prophet The Generative Adversarial Network (GAN)[6] is an unsu-Facebook Prophet is an algorithm for time series forecast- pervised machine learning approach to anomaly detection. ing that works especially well on data streams with multiple An encoder-decoder structure of the neural network is used seasonalities [8]. Prophet also works well with missing data to first encode the input data point and then decode the which makes it a good candidate for the problem at hand. encoded one. The model learns to reconstruct the input After fitting the model it can make predictions for a cho- data point as closely as possible. The idea is that the re- sen set of timestamps presented to it. Furthermore besides construction should be better if the input data is ‘normal’ the prediction it also outputs upper and lower limits of the and worse if it is abnormal/anomalous. We use an input confidence interval for every sample. Ashrapov [2] demon-vector, which is composed of 10 consecutive values of the strates the implementation of an anomaly detection algo- uni-variate data stream. We then compare the input vec- rithm which uses this property to classify the samples inside tor to the reconstructed one using the mean squared error the confidence interval as regular and the rest as anoma- (MSE) metric. We classify the data point as ‘normal’ if the lies. The model is fitted on the entire data set and then value of the MSE is below the defined threshold. [6] calcu-makes predictions on the same data set, providing both the lated the thresholds using sliding windows on reconstruction anomaly detection and the confidence interval. 54 4. RESULTS The results of the algorithms for data stream from sensor 5770 are presented in Figures 1, 2, 3 and 4. The charts show the raw values obtained from the pressure sensors, indicating the points which are labeled as anomalies with red points. Since the data sets are unlabelled it is hard to assess the accuracy of each algorithm based on anomaly visualizations alone, but we do notice some similarities and some differ- ences. All of the algorithms are good at identifying obvious outliers (points which fall far out of the ‘normal’ range). The difference between the algorithms can be noticed when Figure 4: Anomalies found using Facebook Prophet on datas- classifying points closer to the normal range. For example tream from sensor 5770. Welford’s algorithm tends to label points as anomalies at the peaks of daily pressure fluctuation, which might not be ideal since we know that this behaviour can be considered sets are unlabeled, it is hard to determine the optimal pa- normal. More sophisticated algorithms such as GAN and rameters. We decided to tune the algorithms to have similar Prophet were also able to identify more ”subtle” anomalies. recall of 1 - 3%, as we deemed that this would make the com- parison of the algorithms the most fair. In Table 1 the shares of anomalies are presented for each separate data stream. 5770 5771 5772 5773 Algorithm anomaly anomaly anomaly anomaly share share share share GAN 1.42% 0.99% 0.77% 1.13% DBSCAN 2.63% 2.82% 2.73% 2.85% Welford’s 3.39% 3.41% 1.66% 3.16% algorithm Facebook 1.66% 1.13% 0.46% 1.40% Prophet Figure 1: Anomalies found using GAN on data stream from sensor 5770. Table 1: Shares of anomalies for all four data streams. The error rates calculated from agreement rates are shown in Table 1 for each of the data streams. Since we assumed most of the samples in the data stream were normal these error rates are not very informative out of context. We can however, observe that Prophet performed best followed by GAN, DBSCAN and Welford, respectively. The results are consistent in all four scenarios. If we take into consideration that Prophet worked on the whole data set at once when the other three were limited to one sample at a time (as it is in production) we can declare that GAN performed best out of Figure 2: Anomalies found using DBSCAN on datastream the algorithms that can detect anomalies on a live stream. from sensor 5770 5770 5771 5772 5773 Algorithm Error Error Error Error rate rate rate rate GAN 1.34% 1.38% 0.66% 1.09% DBSCAN 1.59% 1.70% 1.78% 1.81% Welford’s 2.44% 2.41% 1.10% 2.31% algorithm Facebook 1.14% 0.62% 0.39% 0.81% Prophet Table 2: Error rates estimated from agreement rates for all Figure 3: Anomalies found using Welford’s algorithm on four data streams. datastream from sensor 5770. We also considered a state-of-the-art method Isolation For- est, however it was too sensitive and therefore not usable in The recall of each algorithm can be increased or decreased the error rate calculation. by modifying parameters and thresholds. Since the data 55 5. CONCLUSIONS [10] Schubert, E., Sander, J., Ester, M., Kriegel, We have tested five anomaly detection algorithms (Gener- H. P., and Xu, X. Dbscan revisited, revisited: why ative Adversarial Network, DBSCAN, Facebook Prophet, and how you should (still) use dbscan. ACM Welford’s algorithm and Isolation Forest) on four separate Transactions on Database Systems (TODS) 42, 3 data streams of water pressure data. Out of those five the (2017), 1–21. Isolation Forest performed poorly since the share of anoma- [11] Taylor, S. J., and Letham, B. Forecasting at scale. lies found with this method was unreasonably high and was The American Statistician 72, 1 (2018), 37–45. therefore not included in the final error estimates calcula- [12] Thiyagarajan, K., Kodagoda, S., Ulapane, N., tion. and Prasad, M. A temporal forecasting driven Other approaches had similar shares of anomalies and were approach using facebook’s prophet method for therefore used to calculate agreement rates and finally the anomaly detection in sewer air temperature sensor estimated error rates of each anomaly detection algorithm. system. In 2020 15th IEEE Conference on Industrial The results were consistent for all four data streams. Prophet Electronics and Applications (ICIEA) (2020), performed best in every setting, however it looked at a data pp. 25–30. stream as a batch and it therefore could not be used for online anomaly detection. GAN performed second best fol- lowed by DBSCAN and Welford’s algorithm which all work on a live data stream. Therefore we can conclude that the most fitting algorithm to be used for anomaly detection on the live water pressure data from water distribution network is GAN. In future work, Facebook prophet could be adopted in such a way that it would also work on a live data stream since it has shown promising results in this experiment. 6. ACKNOWLEDGMENTS This paper is supported by European Union’s Horizon 2020 research and innovation programme under grant agreement No. 820985, project NAIADES (A holistic water ecosystem for digitisation of urban water sector). 7. REFERENCES [1] Antonios Platanios, E. Estimating accuracy from unlabeled data. [2] Ashrapov, I. Anomaly detection in time series with prophet library, Jun 2020. [3] Celik, M., Dadaser-Celik, F., and Dokuz, A. S. Anomaly detection in temperature data using dbscan algorithm. 2011 International Symposium on INnovations in Intelligent SysTems and Applications (2011). [4] do Prado, K. S. How dbscan works and why should we use it?, Apr 2017. [5] Ester, M., and Wittmann, R. Incremental generalization for mining in a data warehousing environment. In International Conference on Extending Database Technology (1998), Springer, pp. 135–149. [6] Geiger, A., Cuesta-Infante, A., and Veeramachaneni, K. Adversarially learned anomaly detection for time series data, 2020. [7] Kenda, K., and Mladenić, D. Autonomous sensor data cleaning in stream mining setting. Business Systems Research: International journal of the Society for Advancing Innovation and Research in Economy 9, 2 (2018), 69–79. [8] Krieger, M. Time series analysis with facebook prophet: How it works and how to use it, Mar 2021. [9] Lobo, J. L. Detecting real-time and unsupervised anomalies in streaming data: a starting point, Feb 2020. 56 Entropy for Time Series Forecasting João Costa António Costa Fakulteta za matematiko in fiziko ESN Paris joaocostamat@gmail.com antoniocbscosta@gmail.com Klemen Kenda João Pita Costa Jožef Stefan Institut IRCAI klemen.kenda@ijs.si joao.pitacosta@quintelligence.com Figure 1: Sample of the time series and projections of the embedding - This plot gives us a geometrical representation of the theory involved in section 3 and shows the reconstructed state space of the given time series. This can be obtained by using Takens’ embedding to reconstruct the time series 𝑦, given in figure a), as the markovian system 𝑌 with 𝐾 time delays and 𝐾 then use Principal Component Analysis in order to perform the change of basis of the data. The obtained projections b), c) and d) attain the dynamics of the system, which gives us the possibility to predict the time series with higher efficiency. ABSTRACT for the H2020 NAIADES Project [2] with data collected from the Municipality of Alicante (Spain). We will present this study for In this paper, we present the exploitation of a method to extract the Autobus Dataset, related to the Bus Station Areas in Alicante. information from microscopic samples of time series data in order to provide a representation of optimized stability to a chaotic sys- tem [1]. The main goal of this approach is to predict the dynamics 2 STATIONARY AND CHAOTIC NATURE of a time series and therefore develop optimized forecasting al- gorithms. First, we study how to increase the predictability of 2.1 Dickey-Fuller Test for Stationarity a system and second, we develop a Deep Learning Algorithm, In order to proceed with the theory involved in the method, namely an LSTM, that can recognize patterns in sequential data it is necessary to understand the behaviour of the time series and accurately predict the future behaviour of a time series. and its sensitivity to initial conditions. For studying time series’ stationarity, one can use the Augmented Dickey-Fuller test, which KEYWORDS is a type of statistical test called a unit root test, where generally Recurrent Neural Networks, LSTM, Entropy, Markov Chain, Clus- the null hypothesis is that the time series can be represented by tering, Time Series a unit root, which means that for 𝑦 = {𝑦 }𝑇 , the information 𝑡 𝑡 =1 at point 𝑦 does not provide us the ability to predict 𝑦 . In 𝑡 −1 𝑡 1 INTRODUCTION our case, we obtained that the p-value of the test was 0, so the null hypothesis was rejected and the time series has no unit Given its intrinsic nature, mathematics concerns with the con- root. Therefore, it is stationary and the time delays will provide struction of formal statements and proofs relating the different important information for predicting the dynamics of the time concepts within it. Its methods are used in countless ways and series. effectively model the shape of our world. But how is it possible to shape the unknown? Motivated by this question and the upmost need for finding ways of optimizing water resources for future 2.2 Lyapunov exponents for understanding generations, there has been a great development on the study of chaotic nature dynamical systems based on, for example, (Shannon) entropy [9] and phase space reconstruction [4]. In this paper, we provide an The Lyapunov Exponent is a quantifier for the sensitivity of the approach to water resource management using Deep Learning time series on initial conditions and therefore for its chaotic na- and Chaos Theory, by studying the dynamics of a time series ture. The main idea is to select an array of nearest neighbors, i.e, using the 2 main ideas cited before. This study was developed points at minimum distance, and calculate its trajectories in time. By doing so, we can then obtain an average of this divergence Permission to make digital or hard copies of part or all of this work for personal exponent which gives us the Lyapunov Exponent. Since the sys-or classroom use is granted without fee provided that copies are not made or tem is bounded, the divergence is also bounded and will reach distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this a plateau after a certain number of timesteps. In our case, the work must be honored. For all other uses, contact the owner /author(s). Lyapunov Exponent, given as the initial slope, is ≈ 518 and the Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia initial growth is exponential, as can be seen in figure 5. Therefore, © 2020 Copyright held by the owner/author(s). the time series is of a chaotic nature. 57 Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia João Costa, et al. 3 MAXIMUM PREDICTABILITY Given the high variability of any chaotic system, it is hard to capture the whole set of variables that model the state space. This is characteristic of a non-Markovian system which is highly unpredictable. How do we surpass this issue? Takens’ Embedding Theorem [8] tells us that, under certain conditions, it is possible to use past data to reconstruct a Markovian system, thus giving us the possibility to model the initial time series with higher efficiency. We start by considering a set of ODEs 𝑥 = ( ¤ 𝑥 ) and the 1, ¤ 𝑥 2, . . . , ¤ 𝑥 𝑑 -dimensional time series 𝐷 𝑦 (𝑡 ) of duration 𝑇 which is a set of incomplete measurements of 𝑥 given by a measure 𝑀 , i.e., 𝑦 = 𝑀 (𝑥 ). Then, in order to Figure 2: An LSTM performs the following ordered compu- calculate the number of 𝐾 time delays to feed the LSTM with, tations: The first step is to forget their irrelevant history. the 𝑑 -dimensional measurements are lifted into the state space Then, LSTMs perform computation to decide on relevant 𝑑 ×𝐾 𝑌 ∈ consisting of the previously referred 𝐾 time delays parts of new information and based on the previous two 𝐾 R [3]. It is possible to quantify the chaotic measure of the system steps, they selectively update the internal state. Finally, an 𝑌 by calculating the entropy resulting from clustering. This output is generated. 𝐾 can be done by partitioning the 𝑑 × 𝐾 -dimensional space into 𝑁 Voronoi cells using 𝐾 -Means clustering. Having partitioned shown in this figure can be mathematically represented as the state space 𝑌 , the reconstructed dynamics are encoded as a 𝐾 row-stochastic transition probability matrix 𝑃 = [𝑃 ] which 𝑇 𝑖 𝑗 𝑖, 𝑗 𝑓 (𝑥 , ℎ ) = 𝜎 (𝑤 𝑥 + 𝑤 ℎ + 𝑏 ) 𝑡 𝑡 𝑡 −1 𝑡 𝑡 −1 𝑓 ,𝑥 𝑓 ,ℎ 𝑓 relates increments on the state-space density 𝑝 in the following ( ) = 𝑇 + + ) (4) way 𝑖 𝑥 , ℎ 𝜎 (𝑤 𝑥 𝑤 ℎ 𝑏 𝑡 𝑡 𝑡 −1 𝑖,𝑥 𝑡 𝑖,ℎ 𝑡 −1 𝑖 𝑇 ∑︁ 𝑜 (𝑥 , ℎ ) = 𝜎 (𝑤 𝑥 + 𝑤 ℎ + 𝑏 ), 𝑡 𝑡 𝑡 −1 𝑜 ,𝑥 𝑡 𝑜 ,ℎ 𝑡 −1 𝑜 𝑝 (𝑡 + 𝛿𝑡 ) = 𝑃 𝑝 (𝑡 ). (1) 𝑖 𝑗 𝑖 𝑗 𝑑 𝑗 where 𝑤 , 𝑤 , 𝑤 ∈ are weight parameters and 𝜎 is an 𝑓 ,𝑥 𝑖,𝑥 𝑜 ,𝑥 R activation function. The entropy rate of the initial time series 𝑦 (𝑡 ) is then approxi- mated by estimating the entropy rate (Figure 3) of the associated 4.2 Our approach Markov chain on the different time delays 𝐾 using Kolmogorov’s The core idea is to take a list of 𝑘 training sets 𝑄 0, 𝑄1, . . . , 𝑄𝑘−1 definition and testing sets 𝑃 in order to generalize the model 0, 𝑃1, . . . , 𝑃𝑘 −1 ∑︁ and do the best estimation for the time series. This is based ℎ (𝐾 ) = − 𝜋 𝑃 log 𝑃 , (2) 𝑝 𝑖 𝑖 𝑗 𝑖 𝑗 𝑁 on translating the testing sets’ partitions along the time series, 𝑖, 𝑗 0 𝑛 where the first partition 𝑃 = { } is taken from the 0 𝑝 , . . . , 𝑝 0 0 where 𝜋 is the estimated stationary distribution of the Markov zeroth point of the time series data and the last partition 𝑃 = 𝑘 −1 0 𝑛 chain 𝑃 . This approximation gives an estimate for the conditional {𝑝 , . . . , 𝑝 } until the last point of the time series data and 𝑘 −1 𝑘 −1 entropies (Figure 6), i.e., for a discrete state with delay vectors |𝑦 | ®𝐾 𝑦 = { ® 𝑦 , . . . , ® 𝑦 }, the entropy of the Markov chain provides | | = 𝑖 𝑖 +𝐾 −1 𝑃 , ∀𝑖 ∈ {0, . . . , 𝑘 − 1} (5) 𝑖 𝑘 an estimate for the conditional entropy, where |𝑦 | stands for the cardinality of the time series 𝑦. This procedure yields 𝑘 models which will use each of the training ℎ (𝐾 ) ≈ ⟨− log[𝑝 (𝑦 |𝑦 , . . . , 𝑦 )]⟩ 𝑝 𝑖 𝑁 𝑁 𝑖 +𝐾 𝑖 +𝐾 −1 sets to make predictions on the respective test sets. Given the = 𝐻 (𝑁 ) − 𝐻 (𝑁 ) 𝐾 +1 𝐾 (3) erratic nature of the data, which was taken in 15 and 30 minutes = ℎ (𝑁 ), 𝐾 samples, a resampling to 30 minute delays had to be done on the 15 minutes delay data points and a masking was added to the time where 𝐻 is the Shannon Entropy of the sequence obtained by 𝐾 series in order to neglect NaN values that could be created from partitioning the ® 𝑦 space into 𝑁 partitions. resampling. Therefore, a masking layer was added and the model is composed by 3 other layers L , L and L , where 𝑛 = 𝑛 𝑛 𝑛 1 1 2 3 4 MODEL ARCHITECTURE 𝑛 = 1 (we have a univariate timeseries) and = 64, since it gave 3 𝑛2 the best results in cross validation. A dropout regularization of 4.1 LSTM 0.1 was added for better approximation of training and validation Long Short Term Memory (LSTM) Networks are a special type errors and the batch size was set to 128. The mean squared error of Recurrent Neural Networks (RNN) which rely on gated cells for the predictions on the training set is ≈ 0.00115 and for the that control the flow of information by choosing what elements testing set is ≈ 0.00236. One can address the capacity of the model of the sequence are passed on to the next module. This idea was whose predictive results are shown in figure 4. introduced in order to surpass the vanishing gradient problem in conventional RNNs [7]. At each time 𝑡 , consider 𝑓 as the 𝑡 5 FORECASTING forget gate, 𝑖 as the input gate and 𝑜 as the output gate, which 𝑡 𝑡 5.1 Forecasting Methods are functions that depend on the output of the previous LSTM Consider a time series 𝑇 = {𝑡 }. The forecasting process module, given by ℎ and on the input of the current timestep, 1, . . . , 𝑡 𝑁 𝑡 −1 can be done in 3 ways: given by 𝑥 . Then, the next figure shows a representation of how 𝑡 a single LSTM cell performs its computations. The computations (1) iterated forecasting 58 Entropy for Time Series Forecasting Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia (2) direct forecasting 6.3 Data and Code Git Repository (3) multi-neural network forecasting The complete work can be found in: Process number (1) is based on "many-to-one" forecast for which https://github.com/johncoost/JoaoModelsForAlicante. 𝑡 ≈ F (𝑡 , . . . , 𝑡 ), 𝑖 ∈ {1, . . . , 𝑁 − 𝑛}. (6) 𝑛+1 𝑖 𝑖 +𝑛−1 7 PLOT OF RESULTS Then, a 𝐾 -step forecast can be iteratively obtained by ˆ 𝑡 := F ( ˆ 𝑡 , . . . , ˆ 𝑡 , ˆ 𝑡 ), 𝑗 ∈ 1, . . . , 𝐾 . (7) 𝑁 + 𝑗 𝑁 + 𝑗 −𝑛+1 𝑁 + 𝑗 −2 𝑁 + 𝑗 −1 Process number (2) can be characterized by training a "many-to- many" function F for which (𝑡 , . . . , 𝑡 ) ≈ F (𝑡 , . . . , 𝑡 ), (8) 𝑖 +𝑛 𝑖 +𝑛+𝐾 −1 𝑖 𝑖 +𝑛−1 where 𝑖 ∈ {1, . . . , 𝑁 − 𝑛 − 𝐾 + 1}. We can obtain a 𝐾 -step forecast by ( ˆ 𝑡 , . . . , ˆ 𝑡 ) := F (𝑡 , . . . , 𝑡 ). (9) 𝑁 +1 𝑁 +𝐾 𝑁 −𝑛+1 𝑁 Finally, process (3) is defined by 𝑘 "many-to-one" functions F which hold the following relationship 1, . . . , F𝑘 𝑡 ≈ F (𝑡 , . . . , 𝑡 ) 𝑖 +𝑛 1 𝑖 𝑖 +𝑛−1 Figure 3: Entropy Rate ℎ - The entropy rate ℎ is given as .. the function of the number of partitions 𝑁 for increasing (10) . number of delays 𝐾 (given by the different colors in a de- 𝑡 ≈ F (𝑡 , . . . , 𝑡 ), 𝑖 +𝑛+𝐾 −1 𝑘 𝑖 𝑖 +𝑛−1 scendent mode). It is possible to observe that the entropy rate is a non-decreasing function on the number of par- where 𝑖 ranges from 1 to 𝑁 − 𝑛 − 𝐾 + 1. Process (1) does not titions 𝑁 . The idea is to choose the value of 𝑁 for which require 𝑘 a propri while both process (2) and (3) are dependent the entropy is maximum so that we have the maximum on the choice of 𝑘 . possible information about the system’s dynamics. 5.2 Our Approach We chose to do a Direct Forecasting for the next 7 days by taking the last test set partition 𝑃 and did a prediction on this test 𝑘 −1 set. Although forecasting seems pretty motivating, by choosing a partition that attains more characteristics of the time series, one can achieve even better results. The achieved forecast can be seen on Figure 8 and compared with a 7 days sample on Figure 7. 6 RESEARCH METHODS 6.1 Time Series Reconstruction Consider the time series 𝑦 with duration 𝑇 as given in section 2. The idea is to add 𝐾 time delays to 𝑦 in order to obtain a ( 𝑑 ×𝐾 𝑡 − 𝐾 ) × 𝐾𝑑 space 𝑌 ∈ and further partition 𝑌 using Figure 4: Prediction on the last test set - This shows a sam- 𝐾 R 𝐾 𝑘 -means Clustering into 𝑁 Voronoi Cells. ple of the last test set and its prediction. We can observe the effectiveness of the LSTM in modelling the given time 6.2 Entropy Calculation series by having a deep understanding of its inherent dy- namics. Consider the 𝑁 Voronoi Cells given as the number of partitions of 𝑌 and consider the joint probability 𝑝 (𝑐 , . . . , 𝑐 ), {𝑖 } ∈ 𝐾 𝑖 𝑖 1, . . . , 𝑖 1 𝑙 𝑙 {0, . . . , 𝑁 − 1}. Then, the Shannon Entropy [6] is given by ∑︁ 𝐻 = − 𝑝 (𝑐 , . . . , 𝑐 ) log 𝑝 (𝑐 , . . . , 𝑐 ) (11) 𝑙 𝑖 𝑖 𝑖 𝑖 1 𝑙 1 𝑙 and the conditional probabilites are given by 𝑝 (𝑐 |𝑐 , . . . , 𝑐 ), (12) 𝑖 𝑖 𝑖 𝑙 +1 1 𝑙 where 𝑐𝑖 is the next Voronoi Cell after 𝑐 . We can calculate the 𝑙 +1 𝑖𝑙 entropy rate growth by considering the conditional probabilities of the system given the previous 𝑙 cells, when visiting the (𝑙 +1)-th cell, via ℎ = ⟨− log[𝑝 (𝑐 |𝑐 , . . . , 𝑐 )]⟩ = 𝐻 − 𝐻 (13) 𝑙 𝑖 𝑖 𝑖 𝑙 +1 1 𝑙 𝑙 +1 𝑙 Figure 5: In this figure, we can understand the initial expo- Taking the supremum limit over all possible partitions 𝑃 of 𝑌 , 𝐾 nential growth on distance between points (given in blue), we obtain the Kolmogorov-Sinai invariant of the system, relative to a curve of slope 1 (given in orange). ℎ = sup lim ℎ (𝑃 ). (14) 𝐾 𝑆 𝑙 𝑙 →∞ 𝑃 59 Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia João Costa, et al. building other algorithms, such as Transformer neural network, that would provide even better results. Another idea is to use weather data and build a multivariate LSTM that optimally gives better results than the univariate one. 9 ACKNOWLEDGMENTS I greatly thank to António Carlos Costa for working in coopera- tion and giving me the possibility to use the powerful machinery he built in order to obtain the desired 𝐾 time delays and under- stand the complex dynamics of the system. Also, to the NAIADES team at Jožef Stefan Institute for all the knowledge exchange and, in particular, to Klemen Kenda for giving me the possibility of writing this paper and João Pita Costa for giving me insights on Figure 6: Conditional Entropies - In this plot we can see how to write and structure the paper. the entropy rate for number of partitions This paper is supported by European Union’s Horizon 2020 𝑁 = 200 which maximizes this entropy. This function reaches a plateau research and innovation programme under grant agreement No. at 820985, project NAIADES (A holistic water ecosystem for digiti- ≈ 24 timesteps, which gives us an idea about which is the optimal sation of urban water sector). 𝐾 to choose. Given that we have 30 minutes timesteps, this plot shows that the optimized time delay is REFERENCES of 12h which corresponds to the day and night cycles [1] Tosif Ahamed, Antonio Carlos Costa, and Greg J. Stephens. 2019. Capturing the continuous complexity of behavior in c. elegans. (2019). arXiv: 1911.10559 [q-bio.NC]. [2] 2019-2022. Cordis, "naiades project". In CORDIS. \url{https: //cordis.europa.eu/project/id/820985}. [3] Antonio Carlos Costa, Tosif Ahamed, David Jordan, and Greg Stephens. 2021. Maximally predictive ensemble dy- namics from data. (2021). arXiv: 2105.12811 [physics.bio-ph]. [4] Vicente de P. Rodrigues da Silva, Adelgicio F. Belo Filho, Vijay P. Singh, Rafaela S. Rodrigues Almeida, Bernardo B. da Silva, Inajá F. de Sousa, and Romildo Morant de Holanda. 2017. Entropy theory for analysing water resources in north- eastern region of brazil. Hydrological Sciences Journal, 62, Figure 7: 7 Days Sample 7, 1029–1038. doi: 10.1080/02626667.2015.1099789. eprint: https : / / doi . org / 10 . 1080 / 02626667 . 2015 . 1099789. https : //doi.org/10.1080/02626667.2015.1099789. [5] David A. Dickey and Wayne A. Fuller. 1979. Distribution of the estimators for autoregressive time series with a unit root. Journal of the American Statistical Association, 74, 366a, 427–431. doi: 10 . 1080 / 01621459 . 1979 . 10482531. eprint: https : / / doi . org / 10 . 1080 / 01621459 . 1979 . 10482531. https : //doi.org/10.1080/01621459.1979.10482531. [6] Robert M. Gray. 2011. Entropy and Information Theory. (2nd edition). Springer Publishing Company, Incorporated. isbn: 9781441979698. [7] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short- term memory. Neural Computation, 9, 8, 1735–1780. doi: Figure 8: Prediction for 7 days ahead - Actual forecast using 10.1162/neco.1997.9.8.1735. 336 timesteps that gives a 7 day future forecast sample using [8] Floris Takens. 1981. Detecting strange attractors in tur- the LSTM model and direct forecasting. It is possible to bulence. In Lecture Notes in Mathematics. Springer Berlin observe that, as in figure 6, the values vary between ≈ 2000 Heidelberg, 366–381. doi: 10.1007/bfb0091924. to ≈ 14000 flow units and the essential dynamics of the [9] Peyman Yousefi, Gregory Courtice, Gholamreza Naser, and time series were understood by the LSTM. Hadi Mohammadi. 2020. Nonlinear dynamic modeling of urban water consumption using chaotic approach (case 8 CONCLUSION study: city of kelowna). Water, 12, 3. issn: 2073-4441. doi: 10.3390/w12030753. https://www.mdpi.com/2073- 4441/12/ Having developed all the necessary machinery for constructing 3/753. a coherent forecasting engine, we come to the conclusion that although the cardinality of the time series data was relatively small, the obtained results are promising and the model will certainly show satisfying results when applied in real time. For the future, we want to continue developing the project by 60 Modeling stochastic processes by simultaneous optimization of latent representation and target variable Jakob Jelenčič Dunja Mladenić Jozef Stefan International Jozef Stefan International Postgraduate School Postgraduate School Jozef Stefan Institute Jozef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia jakob.jelencic@ijs.si dunja.mladenic@ijs.si ABSTRACT We have evaluated the proposed method on an equities This paper proposes a novel method for modeling stochastic dataset and a cryptocurrency dataset, in both cases achieving processes, which are known to be notoriously hard to predict extraordinary results on the test dataset. We have also shown accurately. State of the art methods quickly overfit and the importance of noise distribution and how the de-noising create big differences between train and test datasets. We fails if the distributions of the data and noise do not align. present a method based on simultaneous optimization of la- tent representation and the target variable that is capable of The rest of the paper is organised as follows. Section 2 dealing with stochastic processes and to some extent reduces describes the data we were using. In section 3 we introduce the overfitting. We evaluate the method on equities and the proposed method. In section 4 we present empirical cryptocurrency datasets, specifically chosen for their chaotic results. In section 5 we conclude by pointing out the main and unpredictable nature. We show that with our method results and defining guidance for the future work. we significantly reduce overfitting and increase performance, compared to several commonly used machine learning algo- 2. DATA rithms: Random forest, General linear model and LSTM The proposed method works well for stochastic processes. deep learning model. Equities are supposed to follow some form of stochastic pro- cess [9], either the Black-Scholes one or some more complex 1. INTRODUCTION process with unknown formulation. In order to evaluate our Time series prediction has always been an interesting chal- method, we have collected daily data of more than 5000 lenge. Deep learning structures that are designed for time equities listed on NASDAQ from 2007 on. The data is freely series are prone to overfitting. Especially if the underlying available on the Yahoo Finance website [2]. We transformed time series is stochastic by nature. Every young researcher’s the data using technical analysis [10] and for test set took first attempt when dealing with time series, was trying to every instance that happened after 2019. We calculated mov- learn a time series model that will predict future prices; ing average using 10 days closing price then tried to predict whether in equities, commodities, forex or cryptocurrencies. the direction of the change of this trendline. Unfortunately it is not that simple. One can easily build a near perfect model on the train dataset just to find it is The equity data turned out to be a little bit timid, not completely useless on the test dataset. chaotic enough to demonstrate the full ability of the pro- posed method. This is why we also collected minute data of We propose a novel method that is capable of effectively cryptocurrencies Ethereum and Bitcoin and used the method combatting the overfitting, especially this proves to be a on them as well. Data is available on the crypto exchange difficult task when one is dealing with a problem directly Kraken [1]. We used the same transformation as for the applicable in practical situations. The main idea is to add equities, but with a bit quicker trend. This time the target noise from the same distribution as the training data and variable was change in the trendline in the next 6 hours. For then at the same time optimize the target variable and the the test set we took every instance that has time stamp after latent representation with the help of the autoencoder. The December 2020. longer the training goes, the lower is the amplitude of noise and the less focus is on the optimization of the representation. The reader should note that the end goal is not to accurately predict future equity price, since that is next to impossible. As soon there is a pattern, someone will profit from it and then the pattern will change. By predicting the future trend line, one can obtain a significant confidence interval and estimates of where the price could be, and then design for example a derivative strategy that searches for favourable risk versus rewards trades. 3. PROPOSED METHOD 61 We propose the method designed for prediction of stochastic Algorithm 1 Noise definition processes. The method achieves significant results improving 1: Inputs: X, α, β, epoch the metrics and loss functions on unseen data, where standard 2: Y = [ts, ts, np] . Array for holding Cholesky deep learning is prone to over-fit. The main advantage is decompositions of time correlation matrices. reducing the gap between training data and testing data, 3: for t ∈ {1, . . . , np} do sometimes to a degree where one sacrifices a little bit on the 4: Σt = cov(X[, , t]) train side to actually have the model outperforming it on 5: Y [, , t] = chol(Σt) . In practice the test data. This is very important in time series, where a closest positive definite matrix of Σt is computed before prediction model is usually just one part of a bigger strategy the Cholesky decomposition. and where the train over-fit is the biggest issue. For example, 6: end for designing a trading strategy on over-fitted predictions, that 7: Z = [bs, ts, np] . Array for holding noise samples. kind of mistake can lead to huge capital losses. 8: for i ∈ {1, . . . , ts} do 9: Σi = cov(X[, i, ]) The proposed method can be broken down into 3 important 10: Z[, i, ] = mvn(bs, Σi) parts: normalization, noise addition and additional opti- 11: end for mization of latent representation. Each part can be easily 12: for j ∈ {1, . . . , np} do integrated into an already existing pipeline. 13: Z[, , j] = matmul(Z[, , j], Y [, , j]) . Correcting initially independent noise samples with respect to time. 14: end for 3.1 Empirical normalization 15: for w ∈ {1, . . . , ts} do Normalization plays an important role in deep learning mod- 16: Z[, w, ] = Z[, w, ] ∗ ((βts−w · αepoch) · sd) . Decrease els. It was shown that normalization significantly speeds up the noise during the training procedure. the gradient descent, almost independently of where normal- 17: end for ization takes place. It can be weight normalization [11] during 18: R = X + Z the actual optimization, or it can be the batch normalization 19: Return R. [8], or just normalization of the whole input data [7]. In the proposed method it is important that the 3 dimensional input data comes from the same distribution as the gener- 3.3 Optimization of latent representation ated noise. Since it is fairly straightforward to sample data The most common issue with deep learning optimization is from a 3 dimensional normal distribution, we normalize input falling into a local optimum and being unable to move past data using an empirical cumulative distribution function [12] it [13]. We introduce autoencoder part into the optimization and empirical copula [4] [5]. We align all central moments procedure in order to force the model to shift from going of the unknown distribution to the ones from centered and directly to local optimum to learning the latent representation standardised normal distribution. The normalization takes first. We expect that this combined with the addition of place before the data is reshaped to 3 dimensional tensor. noise, will force the model first to learn how to ignore the noise that we added and the noise that is already in the data 3.2 Noise addition by nature of the stochastic process [15]. We optimized the Introduction of the noise is not new in unsupervised learning model using the Adam optimizer [6]. The loss function used and it was shown that it has a positive effect [14]. Adding in optimization is defined like: noise to input data and then forcing the model to learn how to ignore it has a lot of success in generative adversar- ial networks [3], where convergence can be very tricky to L = LY + Wae · decayepoch · Lae, achieve. We transformed that idea and embedded it into su- pervised learning procedure. The noise addition is described in Algorithm 1. where LY stands for the supervised loss function which will depend on the problem while Lae stands for the loss between In Algorithm 1 we will use the following abbreviations. encoded output and input data. Decay weight is decreasing the longer the training goes on. • X = [bs, ts, np] stands for the input tensor with 3 di- 4. RESULTS mensions; batch size, time steps and number of features We have divided the results section into 2 parts: unsupervised used for predictions. and supervised. In the first we demonstrate why the noise distribution is important. For the unsupervised part, due to • α, β are parameters that control how fast noise will hardware constraints, we have only used the cryptocurrency decrease during the training procedure. They should dataset since we deemed it more demanding than the equity be between 0 and 1, where lower value correspond to a one. In the second, we demonstrate how the our method faster decrease in the amplitude of the added noise. increases test metric on both datasets. • mvn stands for function sampling from a two dimen- sional correlated Gaussian distribution, where Σ is the covariance. matmul stands for matrix multiplication. 62 4.1 Unsupervised learning results one. This result is definitely worth further investigation and In order to test the efficiency of distributed noise versus just experimentation. random noise, we created 3 models. The baseline model was a deep learning model with 3 stacked LSTM layers, encoded 4.2 Supervised learning results layer, then again 3 stacked LSTM for decoded output. We In the previous section we have shown that the distribution have used Adam as optimizer. As loss function we used of the noise matters. In this section we will show that mean-squared error. We have stopped the learning after noise combined with optimization of latent representation there was no improvement for 25 epochs on the validation set. significantly improves metrics on unseen data. Similarly as The validation set was randomly taken out of the train set. before, α and β were both set to 0.99 and sd was initially set Parameters α and β were both set to 0.99 and sd was initially to 1.25. From our experience this setting achieves the best set to 1.25. The noise decreases with learning procedure. results, but further exploration needs to be done. Wae was Interestingly keeping noise constant did not achieve any initially set to 5 and decay to 0.95. results. Since we now operate in a supervised environment, we can compare our models to the majority class. But to really demonstrate the effectiveness of the method, we chose to compare the following models: • Majority class, which serves as a sanity check. • Random Forest with 500 trees. • Generalized linear model. • Deep learning model with 3 stacked LSTM layers. Figure 1: Test loss of autoencoder model with random noise (green) versus no noise (blue). • Deep learning model with 3 stacked LSTM layers and optimization of latent representation. Initially we have tested baseline model versus de-noising • Deep learning model with 3 stacked LSTM layers and model but with uncorrelated noise. In the Figure 1 is plotted correlated noise addition. the de-noising test loss function in green colour and the baseline test loss function in blue. Training was stopped • Finally, deep learning model with 3 stacked LSTM relatively early compared to Figure 2 and it is also obvious layers and correlated noise addition and optimization that de-noising test loss is even worse than that of the classic of latent representation. autoencoder. In the second example we switched from uncorrelated noise All 4 of the deep learning models are identical, all are opti- to the noise with same distribution as input data. As is mized with Adam and categorical cross entropy was used as a apparent on Figure 2, where again we have de-noising test loss function for the supervised part and mean squared error loss plotted with green and classic test loss with blue, the de- for the autoencoder part. Initially we have only tested the noising autoencoder achieved lower test loss than the classic models on equities data, but it turned out that the equities one. were not chaotic enough. By that we mean that especially with deep learning models the difference between train and test loss was not so big that it would be problematic. From previous work experience we know that overfit is a big issue in cryptocurrency dataset, so then we decided to test that dataset in a supervised setting as well. All models were trained three times on each dataset and the results in Table 1 and Table 2 are the averages of the 3 runs. In Table 1 we show the results from the equity dataset. Our method managed to improve test accuracy (from 0.673 to 0.682) without decreasing train accuracy (0.681). Maintain- ing test accuracy and keeping it comparable to test one is important if one needs to build additional strategy upon Figure 2: Test loss of autoencoder model with correlated noise predictions. Just noise addition slightly improved the results (green) versus no noise (blue). (from 0.673 to 0.675), while just the optimization of the latent distribution does not improve anything. What we expected is that then the train and validation losses will be worse than with the classic autoencoder. Surprisingly, that was not the case. With the de-noising autoencoder using noise with the same distribution as the input data, both train and validation losses were better than with classic 63 6. ACKNOWLEDGMENTS Table 1: Supervised results on equity dataset. This work was supported by the Slovenian Research Agency. Method Train Accuracy Test Accuracy We also wish to thank prof. dr. Ljupčo Todorovski for his Majority 0.513 0.537 help, especially with unsupervised results. Random Forest 0.649 0.655 GLM 0.664 0.655 7. REFERENCES LSTM 0.681 0.673 latent LSTM 0.633 0.673 [1] Kraken exchange. https://www.kraken.com/. noise LSTM 0.681 0.675 [2] Yahoo Finance. https://finance.yahoo.com/. latent noise LSTM 0.681 0.682 [3] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath. Generative adversarial networks: An overview. IEEE Signal In Table 2 we show results from the cryptocurrency dataset. Processing Magazine, 35(1):53–65, 2018. Similar as on the equity dataset, our method behaves as [4] P. Jaworski, F. Durante, W. K. Hardle, and T. Rychlik. intended on the cryptocurrency dataset as well. We can see Copula theory and its applications, volume 198. reduced overfitting that is apparent in the normal LSTM Springer, 2010. model. With those results we can conclude that the proof of [5] H. Joe. Dependence Modeling with Copulas. CRC Press, concept works, but for additional claims we will need more 2014. testing and deeper parameter analysis. [6] D. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. 2014. https://arxiv.org/abs/1412.6980. Table 2: Supervised results on cryptocurrency dataset. [7] K. Y. Levy. The power of normalization: Faster evasion Method Train Accuracy Test Accuracy of saddle points. arXiv preprint arXiv:1611.04831, Majority 0.512 0.556 2016. Random Forest 0.689 0.692 [8] M. Liu, W. Wu, Z. Gu, Z. Yu, F. Qi, and Y. Li. Deep GLM 0.682 0.695 learning based on batch normalization for p300 signal LSTM 0.754 0.696 detection. Neurocomputing, 275:288–297, 2018. latent LSTM 0.736 0.683 [9] R. C. Merton. Option pricing when underlying stock noise LSTM 0.697 0.695 returns are discontinuous. Journal of financial latent noise LSTM 0.706 0.714 economics, 3(1-2):125–144, 1976. [10] J. J. Murphy. Technical Analysis of the Financial It is interesting to point out that with the proposed method Markets: A Comprehensive Guide to Trading Methods the test loss on cryptocurrency dataset was 0.552, while and Applications. New York Institute of Finance Series. train loss was 0.592. While 0.552 was the best loss any deep New York Institute of Finance, 1999. learning model achieved, that wide difference indicates that [11] T. Salimans and D. P. Kingma. Weight normalization: we could improve our model even further by fine tuning the A simple reparameterization to accelerate training of parameters. deep neural networks. Advances in neural information processing systems, 29:901–909, 2016. 5. CONCLUSIONS AND FUTURE WORK [12] B. W. Turnbull. The empirical distribution function In this work we have introduced and demonstrated how the with arbitrarily grouped, censored and truncated data. addition of noise and simultaneous optimization of latent Journal of the Royal Statistical Society: Series B representation and target variable reduce overfitting on time (Methodological), 38(3):290–295, 1976. series data. In the unsupervised case we have shown that the [13] R. Vidal, J. Bruna, R. Giryes, and S. Soatto. distribution of the noise matters and the input data must Mathematics of deep learning. arXiv preprint align to achieve maximum effect from the noise addition. arXiv:1712.04741, 2017. [14] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. In the future work we have to estimate the effect of the Manzagol. Extracting and composing robust features newly introduced parameters on method’s convergence. At with denoising autoencoders. In Proceedings of the 25th the same time we need to explore how the method behaves international conference on Machine learning, pages when embedded into larger models, transformers for example. 1096–1103, 2008. We also need to evaluate the method in datasets that are by nature stochastic but do not come from the financial domain. [15] N. Wax. Selected papers on noise and stochastic Finally, we need to evaluate our method on a dataset that is processes. Courier Dover Publications, 1954. not stochastic. 64 Causal relationships among global indicators Matej Neumann Marko Grobelnik Jožef Stefan Institute Jožef Stefan Institute Jamova cesta 39, Ljubljana, Slovenia Jamova cesta 39, Ljubljana, Slovenia matej.neumann@student.fmf.uni- marko.grobelnik@ijs.si lj.si ABSTRACT This is the official source published by the United Nations it It is important to know how changing one thing will affect provides information on the development and implementation of another. This becomes even more important when the thing we an indicator framework for the follow up and review of the 2030 are changing will affect a lot of people. Therefore, we need a way Agenda for Sustainable Development [4]. to visualize how all the things are connected. In this paper, we will demonstrate an approach that uses Granger causality to find 2.2 The World Bank (WB) causal relationships between global indicators. Our results show As the data set provided by the UN itself often has missing that global indicators are indeed highly interconnected however, values, which results in unhealthy timeseries and unreliable they still need to be looked at within each country individually. results, we decided to add the dataset “World Development We also comment how this approach can be used to help with Indicators” from The World Bank [5]. Although the data set policy making decisions. might not be as official as the one provided by the UN, it does contain 1440 unique indicators for 266 different countries and KEYWORDS groups, where each indicator contains a timeseries ranging from Causality, Global indicators, Granger, Timeseries, SDGs the year 1960 to the present time. This addition does not only make the dataset healthier, it also introduces new indicators that are not listed in the UN SDGs. Even so our new dataset still has 1 INTRODUCTION some limitations. From Figure 1 we can see that on average a The Sustainable Development Goals (SDGs) launched on country or groups has no values for around 33% of its indicators. January 1, 2016 include 17 goals, 169 targets and 232 unique Therefore, from now on when talking about the indicators, we indicators with the intent to help frame the policies of the United will restrict ourselves to just those ones that have at least 20 Nations’ (UN) member states through 2030 [8]. Because the nonmissing values in their timeseries. This restriction will insure goals are highly interconnected, as the indicators are not that we are always dealing with a healthy timeseries and it is independent, it is important to understand synergies, conflicts justified as on average those indicators make up about 50% of all and causal relationships between them to support decisions. of the ones available as seen in Figure 2. Without such understanding a policy to help one goal could hurt another. For example, a policy aiming to improve hunger could conflict with climate-mitigation. This paper will focus on finding such relationship with Granger causality. Granger causality is a statistical concept of causality that is based on prediction and was traditionally only used in the financial domain however, over recent years there has been growing interest in the use of Granger causality to identify causal interactions in neural data [6]. Similar works such as [7] and [2] have already looked for causal relationships between specific SDGs. This paper confirms the previously done work and expands it by adding additional indicators and looking for causal relationship between all the indicators, not just the ones focused on SDGs. In paper [2] the authors say that the analysis of all of the indicators country by country is without doubt impractical. Figure 1: Percentage of indicators having x nonmissing Nevertheless, Table 2 shows that however impractical it may be, values in its timeseries. it is still required, as even neighboring countries have vastly different causal relationships. 2 DESCRIPTION OF DATA 2.1 United Nations Statistics Division (UNSD) 65 future values of Y with both the past values of X and Y and not just the past values of Y. More formally, let x and y be stationary timeseries and let x(t) and y(t) be the univariate autoregression of x and y respectfully: 𝑝 𝑥(𝑡) = 𝑏 0 + ∑ 𝑏𝑖𝑥(𝑡 − 𝑖) + 𝐸2(𝑡) 𝑖=1 𝑝 𝑦(𝑡) = 𝑎0 + ∑ 𝑎𝑖𝑦(𝑡 − 𝑖) + 𝐸1(𝑡) 𝑖=1 where p is the number of chosen lagged values included in the model, 𝑎𝑖 and 𝑏𝑖 are contributions of each lagged observation to the predicted values of 𝑥(𝑡) and 𝑦(𝑡) and 𝐸 𝑖(𝑡) the difference between the predicted value and the actual value. To test the null hypothesis that x does not Granger-cause y, we augment 𝑦(𝑡) by including the lagged values of 𝑥 to get: Figure 2: percentage of indicators having at least x 𝑝 nonmissing values in its timeseries. 𝑦(𝑡) = 𝑐0 + ∑ 𝑎𝑖𝑦(𝑡 − 𝑖) + 𝑏𝑖𝑥(𝑡) + 𝐸3(𝑡). 𝑖=1 To better imagine what kind of indicators we are dealing with, We then say that x Granger-causes y if the coefficients 𝑏𝑖 are we can check Table 1 which shows the top 10 most common ones. jointly significantly different from zero. This can be tested by performing an F-test of the null hypothesis that 𝑏𝑖 = 0 for all i. Indicator name Frequency Renewable electricity output 265 3.2 Statistical significance and the p-value (% of total electricity output) In testing, a result has statistical significance if it is unlikely to Population, total 265 occur assuming the null hypothesis. More precisely, a Population growth (annual %) 265 significance level α, is the probability of the test rejecting the null Nitrous oxide emissions in 265 hypothesis, given that the null hypothesis was assumed to be true energy sector (thousand metric and the p-value is the probability of getting result at least as tons of CO2 equivalent) extreme, given that the null hypothesis is true. Then we say that Methane emissions in energy 265 the result is statistically significant when 𝑝 ≤ 𝛼. sector (thousand metric tons of CO2 equivalent) 3.3 Limitations of the Granger causality test Agricultural nitrous oxide 265 As its name implies, Granger causality is not necessarily true emissions (thousand metric tons causality. Having said this, it has been argued that given a of CO2 equivalent) probabilistic view of causation, Granger causality can be Agricultural methane 265 considered true causality in that sense, especially when emissions (thousand metric tons Reichenbach's "screening off" notion of probabilistic causation of CO2 equivalent) is considered [1]. Urban population growth 263 A problem may occur if both timeseries x and y are connected (annual %) via a third timeseries z. In that case our test can reject the null Urban population (% of total 263 hypothesis even if manipulation of one of the timeseries would population) not change the other. Other possible sources of problems can Urban population 263 happen due to: (1) not frequent enough or too frequent sampling, Table 1: Most common indicators and their frequency of 20 (2) time series nonstationarity, (3) nonlinear causal relationship. nonmissing values 4 EXPERIMENTS 3 METHODOLOGY 4.1 Setup Due to time constraints and the limitations of my home system, 3.1 Granger causality we decided to limit ourselves to taking just a few countries and The causal relationships between indicators were determined by groups and calculating the causality relationships for them. The the Granger causality test. The Granger causality test is a ones we decided on are: (1) United States, (2) China, (3) statistical hypothesis test for determining whether one timeseries Uruguay, (4) Slovenia, (5) Austria, (6) Croatia, (7) Italy, (8) is useful in forecasting another. Informally we say that timeseries European Union and (9) OECD. Our plan was to choose X Granger-causes timeseries Y if predictions of the value of Y based on its own past values and on the past values of X are better than predictions of Y based only on Y's own past values. Or in other words X Granger-causes Y if we can better explain the 66 AUS CH CRO EU ITA OECD SLO UY USA AUS 100% 4.8% 5.1% 6.9% 6.7% 6.0% 5.9% 4.4% 7.1% CH 100% 5.6 3.5% 4.3% 3.9% 4.2% 4.7% 4.3% CRO 100% 4.6% 5% 3.3% 6.6% 3.8% 5.6% EU 100% 11% 20% 5.7% 3.6% 10% ITA 100% 6.7% 7.5% 3.8% 6.7% OECD 100% 5% 3% 17% SLO 100% 3.5% 5.6% UY 100% 4.2% USA 100% Table 2: Percentage of same causal relationships. That being said one can easily imagine why each population age Granger-causes a few of the major world powers and compare the differences and similarities between the causal relationships. the next one. For example, if we know the percentage of people 4.2 Modeling the dataset aged 4, we can pretty accurately predict what the percentage of Once the data was collected from the UNSD and WB website it people aged 5 is going to be in the next year. first had to be put into a suitable form. We decided on a 3D matrix where the first component represented the country or SDG Buzzwords group, the second component represented the time series and last Zero Hunger nourishment, food, stun, anemia, one representing the indicator. agriculture 4.3 Parameters Clean Water and water, sanitation, drinking, drink, Sanitation hygiene, freshwater As mentioned before, when searching for causal relationships in a certain country or group we limit ourselves only to those Affordable and energy, electricity, fuel indicators who have at least 20 nonmissing values. Furthermore, Clean Energy we chose a significance level of 0.05 or 5% and tested for lagged Climate Action disaster, disasters, climate, natural, values from 1 to 4. risk, Sendai, environment, environmental, green, developed, pollution 4.4 Determining causality Good Health and mortality, birth, infection, Once the modeling was done and the parameters were set we first Well-Being tuberculosis, malaria, hepatitis, needed to make sure that the timeseries were stationary. To do disease, cancer, diabetes, that we ran the ADF-test and differenced the times series treatment, Alcohol, death, birth, accordingly to make them stationary. Then we ran the Granger- health, pollution, medicine causality test 4 times, once for each lagged value, for each of the Table 3: Some of the most common buzzwords found in 9 countries and groups listed in 4.1. The results for each lagged SDGs value were then saved in a 1440x1440 weighted adjacency matrix, where the (i,j) element was nonzero if and only if the i-th indicator Granger-caused the j-th indicator for all lagged values between 1 and 4 and had the weight of the average of the 4 p- values. Once we had the weighed adjacency matrix we matched the available indicators with the 17 SDGs by comparing the most common buzzwords found in the description of the SDGs and the name of the indicators. An example of some of the buzzwords can be seen in Table 3. 5 RESULTS With the weighted adjacency matrix in hand, it is sensible to ask ourselves whether there exist any causal relationships that hold true for each of the tested countries or groups. The answer is positive as seen in Figure 3. We can however see that the only causal relationships that survived were the ones that connected Figure 3 Only causal relationships that are true for each of different population ages to each other. This result seems the 9 countries and groups (continuous down). sensible as in general no two countries are exactly the same and are therefore going to have a unique set of causal relationships. On the other hand, one may assume that if we compare countries which are close to each other or are historically connected then the causal relationships should not differ by a lot. 67 Figure 4: Interconnectedness of SDGs. That however is not the case as can be seen in Table 2. This This work has been supported by the Slovenian research agency. suggests, that when talking about causal relationships, one must look at each country or group individually. Therefore, let’s focus just on Slovenia. Due to Slovenia 8 REFERENCES having 10083 positive causal relationships we will limit ourselves to just those that interact with SDGs. Figure 4 shows that indeed SDGs are not independent and in fact are highly [1] M. Michael in S. L. Bressler, „Foundational interconnected. The presence of self-loops also suggests that perspectives on causality in large-scale brain networks,“ there exist causal relationships between indicators of an SDG Physics of Life Reviews, pp. 107-123, 2015. itself. This result has two consequences: [2] G. Dörgő, V. Sebestyén in J. Abonyi, „Evaluating the • When thinking about policies aiming to improve Interconnectedness of the Sustainable Development one goal we need to be careful to not harm another Goals Based on the Causality Analysis of Sustainability • Indicators,“ Instead of outright improving one goal, we can Sustainability, 2018. instead focus the ones that are in causal relationship [3] C. Stefano in S. Sangwon, „Cause-effect analysis for with the one we wish to improve sustainable development policy,“ NRC Research Press, Let’s give an example. Suppose we would want to 2017. implement a policy to help to help lower the suicide mortality [4] https://unstats.un.org/sdgs/indicators/database/. rate, but we are not how to do that directly. We can therefore [5] https://datacatalog.worldbank.org/search/dataset/00377 instead check which indicators Granger-cause the one we are 12/World-Development-Indicators. trying to improve. In our case the indicator “Unemployment, youth total (% of total labor force ages 15-24)” Granger-causes [6] B. Corrado in K. Peter, „On the directionality of cortical the suicide mortality rate. Therefore, if we improved the % of interactions studied,“ Biological Cybernetics, 1999. unemployed young people we would be able to also reduce the [7] K. Irfan, H. Fujun in P. L. Hoang, „The impact of suicide mortality rate which was our initial goal. natural resources, energy consumption, and population growth on environmental quality: Fresh evidence from the United States of America,“ Science of The Total 6 CONCLUSION AND FUTURE WORK Environment, 2020. In this paper we demonstrated an approach for calculating [8] H. Tomáš, J. Svatava and M. Bedřich, “Sustainable causality between depending global indicators and mentioned Development Goals: A need for relevant indicators,” how this can help with implementing policies. We also showed Ecological Indicators, pp. 565-573, 2016. that neighboring and similar countries in general don’t have the same causal relationships, which makes it hard to group them together. However, finding such a grouping, if it exists, could be done in the future. The approach shown in this paper could also be implemented to find causal relationship between certain google searches and natural events. For example, we could check if there is any correlation between the increase of users searching the words “water”, “rain”, or “cloud” and the likelihood of a flood happening. 7 ACKNOWLEDGMENTS 68 Active Learning for Automated Visual Inspection of Manufactured Products Elena Trajkova∗ Jože M. Rožanec∗ Paulien Dam University of Ljubljana, Faculty of Jožef Stefan International Philips Consumer Lifestyle BV Electrical Engineering Postgraduate School Drachten, The Netherlands Ljubljana, Slovenia Ljubljana, Slovenia paulien.dam@philips.com trajkova.elena.00@gmail.com joze.rozanec@ijs.si Blaž Fortuna Dunja Mladenić Qlector d.o.o. Jožef Stefan Institute Ljubljana, Slovenia Ljubljana, Slovenia blaz.fortuna@qlector.com dunja.mladenic@ijs.si ABSTRACT regarding defective products, it provides insights into when and Quality control is a key activity performed by manufacturing enter- where such defects occur, which can be used to further dig into the prises to ensure products meet quality standards and avoid potential root causes of such defects and mitigation actions to improve the damage to the brand’s reputation. The decreased cost of sensors and quality of manufacturing products and processes. connectivity enabled an increasing digitalization of manufacturing. The decreased cost of sensors and connectivity enabled an in- In addition, artificial intelligence enables higher degrees of automa- creasing digitalization of manufacturing [3], which along with the tion, reducing overall costs and time required for defect inspection. adoption of Artificial Intelligence (AI) [12], represents an opportu-In this research, we compare three active learning approaches and nity towards enhancing the defect detection in industrial settings five machine learning algorithms applied to visual defect inspection [5]. While the quality of the manual inspection has low scalability with real-world data provided by Philips Consumer Lifestyle BV. Our (requires time to train an inspector, the employees can work a lim- results show that active learning reduces the data labeling effort ited amount of time and are subject to fatigue, and the inspection without detriment to the models’ performance. itself is slow), its quality can be affected by the operator-to-operator inconsistency, and it depends on the complexity of the task, the CCS CONCEPTS employees (e.g., their intelligence, experience, well-being), the en- • vironment (e.g., noise and temperature), the management support Information systems → Data mining; • Computing method- and communication [23]; none of these factors affect the outcome ologies → Computer vision problems; • Applied computing; of automated quality inspection. Machine learning has been suc- KEYWORDS cessfully applied to defect detection in a wide range of scenarios [1, 9, 11, 15, 21]. Smart Manufacturing, Machine Learning, Automated Visual Inspec- An annotated dataset must be acquired to implement machine tion, Defect Detection learning models for defect detection successfully. The increasing ACM Reference Format: number of sensors provides large amounts of data. As the manufac- Elena Trajkova, Jože M. Rožanec, Paulien Dam, Blaž Fortuna, and Dunja turing process quality increases, the data obtained from the sensors Mladenić. 2021. Active Learning for Automated Visual Inspection of Man- is expected to be highly imbalanced: most of the data instances ufactured Products. In Ljubljana ’21: Slovenian KDD Conference on Data will correspond to non-defective products, and a small proportion Mining and Data Warehouses, October, 2021, Ljubljana, Slovenia. ACM, New York, NY, USA, 4 pages. of them will correspond to different kinds of defects. Annotating all the data is prone to similar limitations as manual inspection 1 INTRODUCTION described in the paragraph above. It is thus imperative to provide strategies to select a limited subset of them that are most informa- Quality control is one of the critical activities that must be per- tive to the defect detection models. formed by manufacturing enterprises [27, 28]. The main purpose of We frame the defect detection problem as a supervised learning such activity is to detect product defects meeting quality standards, problem. Given a large amount of unlabeled data, and based on avoid rework, supply chain disruptions, and avoid potential dam- the premise that only a tiny fraction of the data provides new age to the brand’s reputation [3, 27]. Along with the information information to the model and thus has the potential to enhance its ∗Both authors contributed equally to this research. performance, we adopt an active learning approach. Active learning is a subfield of machine learning that attempts to identify the most Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed informative unlabeled data instances, for which labels are requested for profit or commercial advantage and that copies bear this notice and the full citation some oracle (e.g., a human expert) [24]. This research compares on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). three active learning strategies: pool-based sampling, stream-based SiKDD ’21, October, 2021, Ljubljana, Slovenia sampling, and query by committee. © 2021 Copyright held by the owner/author(s). 69 SiKDD ’21, October, 2021, Ljubljana, Slovenia Trajkova and Rožanec The main contributions of this research are (i) a comparative instances are drawn one at a time, and a decision is made whether study between the five most frequently cited machine learning a label is requested, or the sample is discarded), and (iii) pool-based algorithms for automated defect detection and (ii) three active selective sampling (queries samples from a pool of unlabeled data). learning approaches (iii) for a real-world multiclass classification Among the frequently used querying strategies, we find (i) un- problem. We develop the machine learning models with images certainty sampling (select an unlabeled sample with the highest provided by the Philips Consumer Lifestyle BV corporation. The uncertainty, given a certain metric or machine-learning model[17]), dataset comprises shaver images divided into three classes, based or (ii) query-by-committee (retrieve the unlabeled sample with the on the defects related to the printing of the logo of the Philips highest disagreement between a set of forecasting models (com- Consumer Lifestyle BV corporation: good shavers, shavers with mittee)) [6, 24]. More recently, new scenarios have been proposed double printing, and shavers with interrupted printing. leveraging reinforcement learning, where an agent learns to select We evaluate the models using the area under the receiver oper- images based on the similarity relationship between the instances ating characteristic curve (AUC ROC, see [4]). AUC ROC is widely and rewards obtained based on the oracle’s feedback [22]. In addi-adopted as a classification metric, having many desirable properties tion, it has been demonstrated that ensemble-based active learning such as being threshold independent and invariant to a priori class can effectively counteract class imbalance through new labeled probabilities. We measure AUC ROC considering prediction scores images acquisition [2]. cut at a threshold of 0.5. Active learning was successfully applied in the manufacturing This paper is organized as follows. Section 2 outlines the current domain, but scientific literature remains scarce on this domain [19]. state of the art and related works, Section 3 describes the use case, Some use cases include the automatic optical inspection of printed and Section 4 provides a detailed description of the methodology circuit boards[8] and the identification of the local displacement and experiments. Finally, section 5 outlines the results obtained, between two layers on a chip in the semi-conductor industry[25]. while Section 6 concludes and describes future work. The use of machine learning automates the defect detection, and active learning enables an inspection by exception [5], only querying for labels of the images that the model is most uncertain about. 2 RELATED WORK While this considerably reduces the volume of required inspections, Among the many techniques used for automated defect inspection, it is also essential to consider that it can produce an incomplete we find the automated visual inspection, which refers to image ground truth by missing the annotations of defective parts classified processing techniques for quality control, usually applied in the as false negatives and not queried by the active learning strategy production line of manufacturing industries [1]. Visual inspection [7]. requires extracting features from the images, which are used to train the machine learning model. This procedure is simplified when using deep learning models, enabling end-to-end learning, where a 3 USE CASE single architecture can perform feature extraction and classification The use case provided for this research corresponds to visual in- [10, 18], and have shown state-of-the-art performance for image spection of shavers produced by Philips Consumer Lifestyle BV. The classification [20]. visual quality inspection aims to detect defective printing of a logo The use of automated visual inspection for defect detection has on the shavers. This use case focuses on four pad printing machines been applied to multiple manufacturing use cases. [21] manually ex-setup for a range of different products, and different logos. A lot tracted features (e.g., histograms) from machine component images of products are produced every day on these machines, which are and compared the performance of the Näive Bayes and C4.5 models. manually handled and inspected on their visual quality and re- [9] extracted statistical features from the images and compared moved from further processing if the prints on the products are the performance of Support Vector Machines (SVM), Multilayer not classified as good. Operators spend several seconds handling, Perceptron (MLP), and k-nearest neighbors (kNN) models for visual inspecting, and labeling the products. Given an automated visual inspection of microdrill bits in printed circuit board production. quality inspection system would strongly reduce the need to manu- [11] used 3D convolutional filters applied on computed tomog-ally inspect and label the images, it could speed up the process for raphy images and an SVM classifier for defect detection during more than 40%. Currently there are two types of defects classified metallic powder bed fusion in additive manufacturing. [15] used related to the printing quality of the logo on the shaver: double some heuristics to detect regions of interest on slate slab images, printing, and interrupted printing. Therefore, images are classified on which they performed feature engineering to later train an SVM into three classes: good printing (class zero), double printing (class model on them. Finally, [1] reported using a custom neural network one), and interrupted printing (class two). A labeled dataset with a for feature extraction and an SVM model for classification when total of 3.518 images was provided to train and test the models. inspecting aerospace components. While the authors cited above worked with fully labeled datasets, a production line continually generates new data, exceeding the 4 METHODOLOGY labeling capacity. A possible solution to this issue is the use of active We pose automated defect detection as a multiclass classification learning, where the active learner identifies informative unlabeled problem. We measure the model’s performance with the AUC ROC instances and requests labels to some oracle. Typical scenarios in- metric, using the "one-vs-rest" heuristic method, which involves volve (i) membership query synthesis (a synthetic data instance splitting the multiclass dataset into multiple binary classification is generated), (ii) stream-based selective sampling (the unlabeled problems. Furthermore, we calculate the metrics for each class and 70 Active Learning for Automated Visual Inspection of Manufactured Products SiKDD ’21, October, 2021, Ljubljana, Slovenia compute their average, weighted by the number of true instances When analyzing the results, we were interested in how the mod- for each class. els’ performance evolved through time and significant variations To extract features from the images, we make use of the ResNet- between the first and last results observed. To that end, we as- 18 model [13], extracting embeddings from the Average Pooling sessed the statistical significance between the means of the first layer. Since the embedding results in 512 features, which could and last quartiles of the test fold for each active learning scenario. cause overfitting, we use the mutual information to evaluate the We assessed the statistical significance using the Wilcoxon signed- √ most relevant ones and select the top K features, with 𝐾 = 𝑁 , rank test, with a p-value of 0.05. While such variations existed and where N is the number of data instances in the train set, as suggested were positive in most test folds (the models learned through time), in [14]. the improvements were not statistically significant in none of the To evaluate the models’ performance across different active learn- scenarios. ing strategies, we apply a stratified k-fold cross validation [29], using one fold for testing, one fold as a pool of unlabeled data for 6 CONCLUSION active learning, and the rest from training the model. We adopt In this paper, we compared three active learning scenarios (pool- k=10 based on recommendations by [16], and query all available based, stream-based with classifier uncertainty sampling, and query-unlabeled instances to evaluate the active learning approaches. We by-committee) across five machine learning algorithms (Gaussian compare three active learning scenarios: drawing queries through Näive Bayes, CART, Linear SVM, MLP, and kNN). We found that (i) stream-based classifier uncertainty sampling accepting instances the best performance was achieved by the MLP model regardless with an uncertainty threshold above the 75th percentile of observed of the active learning strategy. The second-best performance was instances, (ii) pool-based sampling selecting the instances a given obtained through the query-by-committee strategy, while the fre- model is most uncertain about, and pool-based sampling consider- quently used SVM models ranked third. We found no significant ing a query-by-committee strategy, where the committee is created difference between using pool-based or stream-based active learn- with models trained with the five algorithms we consider in this re- ing approaches. Results from the query-by-committee approach search: Gaussian Näive Bayes, CART (Classification and Regression were statistically significant in all cases and better than all the Trees, similar to C4.5, but it does not compute rule sets), Linear SVM, models, except for the MLPs. Finally, we found no case where the MLP, and kNN. Comparing deep learning models remains a subject improvement between the first and last quartile of the test fold in of future work. Finally, we compare the performance of the active each active learning scenario would be significant. We believe that learning scenarios computing the average AUC ROC of each fold further investigation is required to determine if a larger pool of un- and assess if the results differences obtained from each model are labeled images would help us achieve such a significant difference. statistically significant by using the Wilcoxon signed-rank test[26], Future work will focus on data augmentation techniques that could using a p-value of 0.05. help achieve a statistically significant improvement over time when applying active learning techniques. 5 RESULTS AND ANALYSIS ACKNOWLEDGMENTS The results obtained from the experiments we ran, and described This work was supported by the Slovenian Research Agency and in Section 4, are presented in Table 1, and Table 2. Table 1 describes the European Union’s Horizon 2020 program project STAR under the average AUC ROC per each active learning scenario and model grant agreement number H2020-956573. The authors acknowledge for each cross-validation test fold. We observe that the best model the valuable input and help of Jelle Keizer and Yvo van Vegten from across strategies is the MLP, which achieved the best or second-best Philips Consumer Lifestyle BV. performance across almost every fold in pool-based and stream- based active learning. Among those two scenarios, the best results REFERENCES were obtained for stream-based active learning. We observed the [1] Carlos Beltrán-González, Matteo Bustreo, and Alessio Del Bue. 2020. External same across the rest of the models, though the differences were and internal quality inspection of aerospace components. In 2020 IEEE 7th International Workshop on Metrology for AeroSpace (MetroAeroSpace). IEEE, 351–355. not significant for all but the Näive Bayes models (see Table 2). [2] William H Beluch, Tim Genewein, Andreas Nürnberger, and Jan M Köhler. 2018. Query-by-committee displayed a strong performance, showing best The power of ensembles for active learning in image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9368–9377. results immediately after the MLP. When assessing the statistical [3] Tajeddine Benbarrad, Marouane Salhaoui, Soukaina Bakhat Kenitar, and Mounir significance between the query-by-committee scenario and results Arioua. 2021. Intelligent machine vision model for defective product inspection obtained from different models with stream-based and pool-based based on machine learning. Journal of Sensor and Actuator Networks 10, 1 (2021), 7. strategies, we observed that differences were significant in all cases, [4] Andrew P. Bradley. 1997. The use of the area under the ROC curve in the except for the SVM models. SVM models, most widely used in ac-evaluation of machine learning algorithms. Pattern Recognition 30, 7 (1997), 1145 tive learning literature related to automated defect inspection, were – 1159. https://doi.org/10.1016/S0031-3203(96)00142-2 [5] Amal Chouchene, Adriana Carvalho, Tânia M Lima, Fernando Charrua-Santos, the third-best models among the tested ones, immediately after Gerardo J Osório, and Walid Barhoumi. 2020. Artificial intelligence for prod-the MLPs in stream-based and pool-based active learning and the uct quality inspection toward smart industries: quality control of vehicle non-conformities. In 2020 9th international conference on industrial technology and query-by-committee approach. SVM models did not display signif-management (ICITM). IEEE, 127–131. icant differences when compared across different active learning [6] David Cohn, Les Atlas, and Richard Ladner. 1994. Improving generalization with scenarios. The worst results were consistently observed for the active learning. Machine learning 15, 2 (1994), 201–221. [7] Antoine Cordier, Deepan Das, and Pierre Gutierrez. 2021. Active learning using CART models. weakly supervised signals for quality inspection. arXiv preprint arXiv:2104.02973 71 SiKDD ’21, October, 2021, Ljubljana, Slovenia Trajkova and Rožanec Active Learning scenario Model Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10 CART 0,8168 0,7828 0,7810 0,7694 0,8196 0,7805 0,7843 0,7970 0,8409 0,7940 kNN 0,9289 0,9121 0,9174 0,8686 0,9024 0,9000 0,9051 0,8960 0,9282 0,9082 stream-based MLP 0,9900 0,9928 0,9846 0,9563 0,9804 0,9807 0,9710 0,9729 0,9793 0,9845 Näive Bayes 0,8818 0,8668 0,8819 0,8686 0,8829 0,8899 0,8650 0,8877 0,8864 0,9098 SVM 0,9752 0,9828 0,9725 0,9530 0,9816 0,9720 0,9570 0,9412 0,9824 0,9712 CART 0,7584 0,7904 0,7543 0,7468 0,8441 0,7730 0,8044 0,7701 0,7850 0,7412 kNN 0,9189 0,9149 0,9161 0,8581 0,9055 0,9036 0,8961 0,8910 0,9224 0,9056 pool-based MLP 0,9892 0,9921 0,9845 0,9563 0,9790 0,9803 0,9702 0,9723 0,9806 0,9840 Näive Bayes 0,8800 0,8654 0,8809 0,8677 0,8813 0,8895 0,8637 0,8873 0,8850 0,9090 SVM 0,9752 0,9819 0,9726 0,9518 0,9806 0,9712 0,9562 0,9412 0,9823 0,9722 query-by-committee 0,9774 0,9824 0,9714 0,9500 0,9723 0,9726 0,9597 0,9571 0,9830 0,9734 Table 1: AUC ROC values were obtained across the ten cross-validation folds. Best results are bolded, second-best results are highlighted in italics. Active Learning scenarios Model stream-based vs. pool-based stream-based vs. query-by-committee pool-based vs. query-by-committee CART 0,0840 0,0020 0,0020 kNN 0,1309 0,0020 0,0020 MLP 0,0856 0,0039 0,0039 Näive Bayes 0,0020 0,0020 0,0020 SVM 0,1824 0,4316 0,6250 Table 2: p-values obtained for the Wilcoxon signed-rank test when comparing the average of AUC ROC results across ten cross-validation folds. (2021). [22] Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Xiaojiang [8] Wenting Dai, Abdul Mujeeb, Marius Erdt, and Alexei Sourin. 2018. Towards Chen, and Xin Wang. 2020. A survey of deep active learning. arXiv preprint automatic optical inspection of soldering defects. In 2018 International Conference arXiv:2009.00236 (2020). on Cyberworlds (CW). IEEE, 375–382. [23] Judi E See. 2012. Visual inspection: a review of the literature. Sandia Report [9] Guifang Duan, Hongcui Wang, Zhenyu Liu, and Yen-Wei Chen. 2012. A machine SAND2012-8590, Sandia National Laboratories, Albuquerque, New Mexico (2012). learning-based framework for automatic visual inspection of microdrill bits in [24] Burr Settles. 2009. Active learning literature survey. (2009). PCB production. IEEE Transactions on Systems, Man, and Cybernetics, Part C [25] Karin van Garderen. 2018. Active Learning for Overlay Prediction in Semi- (Applications and Reviews) 42, 6 (2012), 1679–1689. conductor Manufacturing. (2018). [10] Tobias Glasmachers. 2017. Limits of end-to-end learning. In Asian Conference on [26] Frank Wilcoxon. 1992. Individual comparisons by ranking methods. In Break-Machine Learning. PMLR, 17–32. throughs in statistics. Springer, 196–202. [11] Christian Gobert, Edward W Reutzel, Jan Petrich, Abdalla R Nassar, and Shashi [27] Thorsten Wuest, Christopher Irgens, and Klaus-Dieter Thoben. 2014. An ap-Phoha. 2018. Application of supervised machine learning for defect detection proach to monitoring quality in manufacturing using supervised machine learn-during metallic powder bed fusion additive manufacturing using high resolution ing on product state data. Journal of Intelligent Manufacturing 25, 5 (2014), imaging. Additive Manufacturing 21 (2018), 517–528. 1167–1180. [12] Irlán Grangel-González. 2019. A knowledge graph based integration approach for [28] Jing Yang, Shaobo Li, Zheng Wang, Hao Dong, Jun Wang, and Shihao Tang. 2020. industry 4.0. Ph.D. Dissertation. Universitäts-und Landesbibliothek Bonn. Using deep learning to detect defects in manufacturing: a comprehensive survey [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual and current challenges. Materials 13, 24 (2020), 5755. learning for image recognition. In Proceedings of the IEEE conference on computer [29] Xinchuan Zeng and Tony R Martinez. 2000. Distribution-balanced stratified vision and pattern recognition. 770–778. cross-validation for accuracy estimation. Journal of Experimental & Theoretical [14] Jianping Hua, Zixiang Xiong, James Lowey, Edward Suh, and Edward R Artificial Intelligence 12, 1 (2000), 1–12. Dougherty. 2005. Optimal number of features as a function of sample size for various classification rules. Bioinformatics 21, 8 (2005), 1509–1515. [15] Carla Iglesias, Javier Martínez, and Javier Taboada. 2018. Automated vision system for quality inspection of slate slabs. Computers in Industry 99 (2018), 119–129. [16] Max Kuhn, Kjell Johnson, et al. 2013. Applied predictive modeling. Vol. 26. Springer. [17] David D Lewis and Jason Catlett. 1994. Heterogeneous uncertainty sampling for supervised learning. In Machine learning proceedings 1994. Elsevier, 148–156. [18] Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3431–3440. [19] Lingbin Meng, Brandon McWilliams, William Jarosinski, Hye-Yeong Park, Yeon-Gil Jung, Jehyun Lee, and Jing Zhang. 2020. Machine learning in additive manufacturing: A review. Jom 72, 6 (2020), 2363–2377. [20] Samira Pouyanfar, Saad Sadiq, Yilin Yan, Haiman Tian, Yudong Tao, Maria Presa Reyes, Mei-Ling Shyu, Shu-Ching Chen, and Sundaraja S Iyengar. 2018. A survey on deep learning: Algorithms, techniques, and applications. ACM Computing Surveys (CSUR) 51, 5 (2018), 1–36. [21] S Ravikumar, KI Ramachandran, and V Sugumaran. 2011. Machine learning approach for automated visual inspection of machine components. Expert systems with applications 38, 4 (2011), 3260–3266. 72 Learning to Automatically Identify Home Appliances Dan Lorbek Ivančič1, Blaž Bertalanič1,2, Gregor Cerar1, Carolina Fortuna1 1Jozef Stefan Institute, Ljubljana, Slovenia 2Faculty of Electrical Engineering, University of Ljubljana, Slovenia E-mail: dl0586@student.uni-lj.si Abstract. Appliance load monitoring (ALM) is a The obtained data is then disaggregated and each individ- technique that enables increasing the efficiency of domes- ual appliance and its energy consumption are detected. tic energy usage by obtaining appliance specific power One promising approach to ILM for automatic iden- consumption profiles. While machine learning have been tification of home appliances is the use of machine learn- shown to be suitable for ALM, the work on analyzing ing (ML). For instance, in [4] they used ML to find pat- design trade-offs during the feature and model selection terns in the data and extract useful information such as steps of the ML model development is limited. In this type of load, electricity consumption detail and the run- paper we show that 1) statistical features capturing the ning conditions of appliances [4]. More recently, [5] fo- shape of the time series, yield superior performance by cused on the study of design trade-offs during the fea- up to 20 percentage points and 2) our best deep neural ture and model selection steps of the development of the network-based model slightly outperforms our best gradi- ML-based classifier for ILM. In their study they consid- ent descent boosted decision trees by 2 percentage points ered various statistical summaries for feature engineering at the expense of increased training time. and classical machine learning techniques for model se- lection. We complement the work in [5] by extending 1 Introduction the feature set with additional shape capturing values and considering deep learning (DNN) and gradient boosted Household energy consumption accounts for a large pro- trees (XGBoost) as promising modelling techniques. The portion of the world’s total energy consumption. The first contributions of this paper are as follows: studies, conducted as early as the 1970s, showed that as much as 25% of national energy was consumed by our • We explore a variety of different statistical features and domestic appliances alone. This figure rose to 30% in show the ones capturing the shape of the time series, 2001 [1] and continues to increase with an exponential such as longest strike above mean, longest strike be- rate. Some researchers even predict that these numbers low mean, absolute energy and kurtosis yield superior will double by 2030 [2]. performance by up to 20 percentage points. In support of rationalizing consumption, appliance load monitoring (ALM) has been introduced. It aims to help • We show that our best DNN based model slightly out- solve domestic energy usage related issues by obtaining performs our best XGBoost by 2 percentage points at appliance specific power consumption profiles. Such data the expense of increased training time. We also show can help devise load scheduling strategies for optimal en- that our models outperform the results from [5] by 5 ergy utilization [2]. Additionally, data about appliance percentage points. usage can provide useful insight into daily activities of The paper is organized as follows. Section 2 summa- residents which can be useful for long-distance monitor- rizes related work, Section 3 formulates the problem and ing of elderly people who prefer to stay at home rather provides methodological details, Section 4 focuses on the than going to retirement homes [2]. Other applications study of feature selection trade-offs, while Section 5 dis- include theft detection, building safety monitoring, etc. cusses model selection. Concluding remarks are drawn The two different ways of realizing ALM are intru- in Section 6. sive load monitoring (ILM) and non-intrusive load mon- itoring (NILM). While ILM is known to be more accu- 2 Related Work rate, it requires multiple sensors throughout the entire building to be installed which incurs extra hardware cost Existing work that uses machine learning for ALM, such and installation complexity. NILM, however, is a cost- as in [6] investigates the performance of deep learning effective, easy to maintain process for analyzing changes neural networks on NILM classification tasks and builds in the voltage [3] and current going into a building with- a model that is able to accurately detect activations of out having to install any additional sensors on different common electrical appliances using data from the smart household devices, since it operates using only data ob- meter. More complex DNNs for NILM classification tasks tained from the single main smart meter in a building. are presented by the authors in [3], where they introduce 73 a Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) based model and show that it outperforms the considered baselines. In [7] they approach a simi- lar problem by proposing a convolutional neural network based model that allows simultaneous detection and clas- sification of events without having to perform double pro- cessing. In [8] authors train a temporal convolutional neural network to automatically extract high-level load signatures for individual appliances while in [9] a fea- ture extraction method is presented using multiple par- allel convolutional layers as well as an LSTM recurrent neural network based model is proposed. 3 Problem formulation Our goal was to design a classifier that when given an input time series T, it is able to accurately map this data to the appropriate class C, as shown in equation 1. C = Φ(T ) (1) where Φ represents the mapping function from time series to target classes and C is a set of these classes, where each class corresponds to one of the following house- hold appliances: computer monitor, laptop computer, tele- vision, washer dryer, microwave, boiler, toaster, kettle Figure 1: Selected appliances, showing power in relation to time and fridge. The appliances and measured data illustrated over a 1 hour interval. in Figure 1 available in the public UK-Dale dataset are used. The UK DALE (Domestic Appliance-level Elec- tricity) contains the power demand from 5 different houses are provided by dedicated time series feature engineering in the United Kingdom. The dataset was build at a sample- tools such as tsfresh 1. rate of 16 Hz for the whole-house and 0.1667 Hz for each Following an extensive evaluation of combinations of individual appliance. Data is spread into 1 hour long seg- time-series, we report the results for a representative se- ments, each dataset sample contains a time series with lection of three feature sets as follows: 600 datapoints as depicted in Figure 1. FeatureSet1 - This feature set consists of the raw For realizing Φ, we perform first a feature selection time series, containing 2517 time series samples, each task followed by a model selection one. For selecting the with 600 datapoints. It is used as a baseline to see the best feature set, we perform feature selection in Section performance achieves with the available data. 4. For model selection, we go beyond the work in [5] and FeatureSet2 - This feature set consists of: mean consider deep learning architectures enabled by Tensor- value, maximum, minimum, standard deviation, flow and advanced decision trees that use on optimized variance, peak − to − peak, count above mean, count distributed gradient boosting technique available in the below mean, mean change, absolute mean change, XGBoost open source library as detailed in Section 5. absolute energy. The count above and below mean counts the numbers of values in each sample that are higher or 4 Feature selection lower than the mean value of that same sample and helps quantifying the width of a pulse such as the ones for the As can be seen in Figure 1, the time-series corresponding toaster and microwave from Figure 1. The mean absolute to each device has unique shape and patterns, therefore an change gives the mean over the absolute differences be- intuitive approach to feature selection is to extract statis- tween subsequent time series values. The absolute energy tical properties of the time series that would capture the represents the sum of squared values, calculated using unique properties of the signals. For instance, a summary formula shown in equation 2 and provides the informa- such as the peak-to-peak value is able to capture the dif- tion on whether a specific appliance has large consump- ference between the maximum and minimum value in a tion profile or not. time series signal while one such as skewness is able to describe the asymmetry in the distribution of datapoints n−1 X in a particular sample. A good combination of such fea- E = x2i (2) ture would be able to inform the model with relevant in- i=0 formation about the power consumption of each appli- FeatureSet3 - After taking a deeper look into the fea- ance, making it easier to find patterns in the data and per- tures from FeatureSet2, we noticed that minimum is re- form classification task more accurately. Recently, stan- 1 dard tools for computing a large range of such summaries https : //tsf resh.readthedocs.io/en/latest/text/list of f eatures.html 74 dundant as it is usually zero in every sample and peak-to-them. Nevertheless, the CNN classifies all three the best peak is in most cases equal to maximum value due to the due to its superior pattern recognition ability. lowest value mostly being zero. This feature set consists of: maximum, standard deviation, mean absolute Table 2: Per class performance, FeatureSet3 vs best [5] change, mean change, longest strike above mean, Class Inst. CNN f1 XGB f1 [5] f1 longest strike below mean, absolute energy, kurtosis, number of peaks in each signal. The longest strike monitor 300 0.827 0.833 0.780 above and below mean returns the length of the the longest laptop 276 0.983 0.932 0.838 consecutive subsequence that is higher or lower than the television 300 0.992 0.976 0.941 mean value of that specific sample. The kurtosis is an- washer/dryer 226 0.941 0.912 0.804 other metric of describing the probability distribution and microwave 300 0.688 0.620 0.687 measures how heavily the tails of a distribution differ boiler 300 1.000 0.968 0.940 from the tails of a normal distribution. toaster 215 0.949 0.940 0.806 kettle 300 0.756 0.722 0.739 fridge 300 1.000 0.983 0.970 Table 1: Feature comparison using the best models. Model Feature set Precision Recall f1 DNN3 FeatureSet1 0.638 0.595 0.573 5 Model selection XGB3 FeatureSet1 0.799 0.769 0.779 DNN3 FeatureSet2 0.918 0.885 0.889 For analyzing the performance of DNN and XGBoost for XGB3 FeatureSet2 0.869 0.864 0.867 our problem we conducted extensive performance eval- DNN3 FeatureSet3 0.931 0.898 0.902 uations. We started by developing a deep learning se- XGB3 FeatureSet3 0.888 0.889 0.889 quential model, which at first consisted of three dense layers, each with an arbitrarily chosen number of neu- DNN3 best[5] 0.893 0.887 0.888 rons. By trying different combinations of hyperparame- XGB3 best[5] 0.861 0.860 0.861 ters such as number of neurons, loss functions, optimiz- SVM[5] best[5] 0.851 0.835 0.834 ers, batch size, number of epochs, number of layers and learning rate, we came closer to finding the best suited model for our problem. For optimizing certain hyperpa- rameters we took advantage of the automatic hyperpa- 4.1 Results rameter optimization framework Optuna 2. We then ap- The results of the feature selection process are listed in plied similar optimization techniques on the XGB model, Table 1 for the two techniques considered in this paper. although it’s default parameter configuration already gave As can be seen from the second column of the table en- good results. All the experiments were ran on Google titles instances, the dataset is balanced. From columns Colab using an instance with Nvidia Tesla K80 GPU and 3-5 it can be seen that for the baseline FeatureSet1, the 12.69 GB of RAM. f1 score is 0.57 for the CNN and 0.77 for XGB. By using In this section we present and analyze three represen- features that better capture the shape of the time series tative models from each class, DNN and XGboost respec- such as in the case of FeatureSet2, an improvement of tively. up to 20% can be seen as follows: the f1 of the CNN model increasing to 0.89, the precision 0.92 and recall to 5.1 Deep neural network 0.88. The XGBoost model also performed better with an DNN1 - This model consisted of three fully con- f1 of 0.87, precision of 0.87 and recall of 0.86. Finally, it nected dense layers. The first two had 32 neurons each can be seen from the table that FeatureSet3 performs the as well as ReLU (rectified linear unit) activation function, best with the f1 of 0.90, precision of 0.93 and recall of while the output layer had nine neurons, each correspond- 0.90 for the CNN model and f1 of 0.89, precision of 0.89 ing to one of the nine possible appliances and Softmax and recall of 0.89 for the XGB model. FeatureSet3 per- activation function. formed better than FeatureSet2 because its features had DNN2 - For this model we took the DNN1 model and much less correlation between each other as well as all of added an additional dense layer with 64 neurons as well the redundant features from FeatureSet2 were removed. as changed the activation function to linear in the penulti- For FeatureSet3, a variety of different feature orderings mate layer. With this additional complexity we expected were also tested but the results remained more within 1% to see better results. accuracy variance. DNN3 - For this model we introduced two 1D con- To gain insights into the per class performance of Fea- volution layers, first with 128 filters and second with 64. tureSet3 with the two techniques, we present per device Then we used a flatten layer to reduce the dimensionality f1 score breakdown in Table 2. It can be seen that com- of the output space, and make the data compatible with puter monitor, microwave and kettle are classified worst the following dense layer, followed by another (output) by all three models, as their similar consumption profiles dense layer. make it difficult for the models to distinguish between 2https : //optuna.org 75 5.2 XGBoost 6 Conclusions XGB1 - This is the model with standard configura- In this paper we investigated the design trade-offs during tion, i.e. maximum depth of 3, 100 estimators and learn- the feature and model selection steps of the development ing rate of 0.1. of the ML-based classifier for ILM. After formulating our XGB2 - In this model we increased the maximum problem, we first show that by extracting various statis- depth to 4 as well as first reduced learning rate by 50% tical features from raw time series data and then training (to 0.05) and then increased the number of estimators by our models with these features, we were able to improve 50% (to 200). Doing this gave slightly better results. f1 score by up to 20 percentage points. XGB3 - For this model we decreased the maximum Second, we propose two different ML techniques and depth to 2, increased number of estimators to 500 and our process of developing the proposed models using these. learning rate to 0.25. We show that optimizing hyperparameters to better suit our specific problem can improve their respective perfor- Table 3: Model performance on FeatureSet3. mance by around 4 percentage points. However, choos- ing the right features that better capture the shape of the Model Precision Recall f1 Comp. time data has a much greater impact on the end results than op- DNN1 0.866 0.851 0.846 10.972s timizing the models. We also show that classical machine DNN2 0.900 0.887 0.889 21.026s learning model does not perform significantly worse than DNN3 0.931 0.898 0.902 21.124s the deep neural network based one, while at the same time being less computationally expensive. XGB1 0.876 0.863 0.864 1.126s XGB2 0.884 0.881 0.882 2.518s References XGB3 0.888 0.889 0.889 3.225s [1] L. Shorrock, J. Utley et al., Domestic energy fact file 2003. SVM [5] 0.878 0.852 0.852 0.301s Citeseer, 2003. [2] A. Zoha, A. Gluhak, M. A. Imran, and S. Rajasegarar, “Non-intrusive load monitoring approaches for disaggre- 5.3 Results gated energy sensing: A survey,” Sensors, vol. 12, no. 12, pp. 16 838–16 866, 2012. 5.3.1 Classification performance [3] J. Kim, T.-T.-H. Le, and H. Kim, “Nonintrusive The classification performance of the models is provided Load Monitoring Based on Advanced Deep Learn- in Table 3. It can be seen that the best performing models ing and Novel Signature,” Computational Intelli- are DNN3 with an f1 score of 0.90 and XGB3 with an f1 gence and Neuroscience, vol. 2017, p. e4216281, of 0.88. However, the computation time of XGB3 is only Oct. 2017, publisher: Hindawi. [Online]. Available: 3.23s while for DNN3 it is 21.12s. The XGB classifier https://www.hindawi.com/journals/cin/2017/4216281/ using classical machine learning performed only about [4] E. Aladesanmi and K. Folly, “Overview of non- 1 percentage point worse than the CNN model, while at intrusive load monitoring and identification tech- the same time being much less complex and able to com- niques,” IFAC-PapersOnLine, vol. 48, no. 30, pp. plete the entire training process about 18 seconds faster 415–420, 2015, 9th IFAC Symposium on Control than the CNN. In addition, the XGB model is much eas- of Power and Energy Systems CPES 2015. [Online]. ier to optimize since it has no hidden layers and a pre- Available: https://www.sciencedirect.com/science/article/ arranged hyperparameter configuration that usually re- pii/S2405896315030566 quires no further optimization at all. From the last line of [5] L. Ogrizek, B. Bertalanic, G. Cerar, M. Meza, and C. For- the table it can be seen that the SVM-based model from tuna, “Designing a machine learning based non-intrusive [5] performs 5 percentage points less than DNN3 on Fea- load monitoring classifier,” in 2021 IEEE ERK, 2021, pp. 1–4. tureSet3. [6] M. Devlin and B. P. Hayes, “Non-intrusive load monitor- 5.3.2 Computation time ing using electricity smart meter data: A deep learning approach,” in 2019 IEEE Power Energy Society General The superior performance of the DNN model comes at a Meeting (PESGM), 2019, pp. 1–5. cost of increased algorithm complexity and hence longer computation time. As depicted in Table 3 the first DNN [7] F. Ciancetta, G. Bucci, E. Fiorucci, S. Mari, and A. Fiora- vanti, “A new convolutional neural network-based system model took 10.97 seconds to complete the training pro- for nilm applications,” IEEE Transactions on Instrumenta- cess and the best (most complex one) took 21.12 seconds. tion and Measurement, vol. 70, pp. 1–12, 2021. XGBoost, on the other hand, was much faster with XGB1 [8] Y. Yang, J. Zhong, W. Li, T. A. Gulliver, and S. Li, taking only 1.12 seconds. The added depth for the XGB2 “Semisupervised multilabel deep learning based nonintru- caused a slight increase in computation time to 2.52 sec- sive load monitoring in smart grids,” IEEE Transactions onds, which further increased to 3.23 seconds due to the on Industrial Informatics, vol. 16, no. 11, pp. 6892–6902, high number of estimators used in XGB3. Finally, the 2020. state of the art was the fastest to complete the training [9] W. He and Y. Chai, “An empirical study on energy disaggre- process taking only 0.3 seconds but scored the worst in gation via deep learning,” Advances in Intelligent Systems terms of performance. Research, vol. 133, pp. 338–342, 2016. 76 Indeks avtorjev / Author index Beliga Slobodan ........................................................................................................................................................................... 41 Bertalanič Blaž ............................................................................................................................................................................. 73 Brank Janez .................................................................................................................................................................................... 5 Brglez Mojca ................................................................................................................................................................................ 37 Buhin Pandur Maja....................................................................................................................................................................... 41 Casals del Busto Ignacio .............................................................................................................................................................. 49 Cerar Gregor ................................................................................................................................................................................. 73 Costa Joao .................................................................................................................................................................................... 57 Dam Paulien ................................................................................................................................................................................. 69 Dobša Jasminka ............................................................................................................................................................................ 41 Eržin Eva ...................................................................................................................................................................................... 49 Erznožnik Matic ........................................................................................................................................................................... 53 Fortuna Blaž ........................................................................................................................................................................... 45, 69 Fortuna Carolina ........................................................................................................................................................................... 73 Grobelnik Marko .................................................................................................................................................................. 5, 9, 49 Guček Alenka ............................................................................................................................................................................... 49 Jelenčič Jakob ............................................................................................................................................................................... 61 Kenda Klemen ........................................................................................................................................................................ 53, 57 Lindemann David ......................................................................................................................................................................... 33 Lorbek Ivančič Dan ...................................................................................................................................................................... 73 Massri M.Besher ...................................................................................................................................................................... 5, 49 Meštrović Ana .............................................................................................................................................................................. 41 Mladenić Dunja ............................................................................................................................................ 5, 9, 21, 29, 45, 61, 69 Mladenic Grobelnik Adrian............................................................................................................................................................ 9 Mocanu Iulian .............................................................................................................................................................................. 49 Neumann Matej ............................................................................................................................................................................ 65 Novak Erik ................................................................................................................................................................................... 13 Novalija Inna ............................................................................................................................................................................ 5, 49 Petkovšek Gal ............................................................................................................................................................................... 53 Pita Costa Joao ....................................................................................................................................................................... 49, 57 Pollak Senja ............................................................................................................................................................................ 25, 37 Posinković Matej .......................................................................................................................................................................... 49 Poštuvan Tim ............................................................................................................................................................................... 45 Pranjić Marko ............................................................................................................................................................................... 25 Robnik-Šikonja Marko ........................................................................................................................................................... 17, 25 Rossi Maurizio ............................................................................................................................................................................. 49 Rožanec Jože M. .................................................................................................................................................................... 45, 69 Schwabe Daniel .............................................................................................................................................................................. 5 Sittar Abdul .................................................................................................................................................................................. 29 Šturm Jan ...................................................................................................................................................................................... 49 Swati ............................................................................................................................................................................................. 21 Trajkova Elena ............................................................................................................................................................................. 69 Ulčar Matej ................................................................................................................................................................................... 17 Vintar Špela .................................................................................................................................................................................. 37 77 78 Odkrivanje znanja in podatkovna skladišča • SiKDD Data Mining and Data Warehouses • SiKDD Dunja Mladenić, Marko Grobelnik Document Outline 02 - Naslovnica - notranja - C - TEMP 03 - Kolofon - C - TEMP 04 - IS2021 - Predgovor - TEMP 05 - IS2021 - Konferencni odbori 07 - Kazalo - C 08 - Naslovnica - notranja - C - TEMP 09 - Predgovor podkonference - C 10 - Programski odbor podkonference - C 01- Novalijaetal 02 - MladenicEtal 03 - Novak Abstract 1 Introduction 2 Related Work 3 The Clustering Algorithm 3.1 Article Representation 3.2 Event Representations 3.3 Assignment Condition 4 Experiments 4.1 Data Set 4.2 Evaluation Metrics 4.3 Baseline Algorithm 5 Results 6 Conclusion Acknowledgments 04 - Ulcar+Robnik Abstract 1 Introduction 2 Related work 3 SloBERTa 3.1 Datasets 3.2 Data preprocessing 3.3 Architecture and training 4 Evaluation 5 Conclusions Acknowledgments 05 - Swati+Mladenic Abstract 1 Introduction 1.1 Contributions 2 Related Work 3 Data Description 3.1 Raw Data Source 3.2 Dataset 4 Materials and Methods 4.1 Methodology 5 Results and Analysis 5.1 Impact of news categories 6 Conclusions and Future Work 7 Acknowledgments 06 - Pranjicetal Abstract 1 Introduction 2 Dataset 3 Methodology 3.1 Doc2Vec 3.2 BERT 3.3 Prediction network 3.4 Training 4 Experiments and results 4.1 Evaluation metrics 4.2 Results and discussion 5 Conclusions and Further Work Acknowledgments 07 - Sittar+Mladenic Abstract 1 Introduction 2 Related Work 3 Data Description 3.1 Dataset Statistics 4 Material and Methods 4.1 Problem Definition 4.2 Methodology 5 Experimental Evaluation 5.1 Evaluation Metric 6 Results and Analysis 6.1 Annotation Results 6.2 Classification Results 7 Conclusions and Future Work 08 - LindemannDavid Abstract 1 Introduction 2 LexBib Zotero group 3 LexBib Wikibase 3.1 Wikibase as LOD infrastructure solution 3.2 Zotero to Wikibase migration 3.3 Entity disambiguation using Open Refine 3.4 Full text processing 4 Wikibase to Elexifinder 5 Conclusions and Outlook Acknowledgments 09 - Brglezetal Abstract 1 Introduction 2 Proposed approach 2.1 Method 3 Corpus 3.1 Corpus search 4 Analysing different parameter settings 5 Conclusion Acknowledgments 10 - Panduretal 11 - Rožanecetal Abstract 1 Introduction 2 Related Work 3 Use Case 4 Methodology 5 Results and Analysis 6 Conclusion Acknowledgments References 12 - Costaetal 13 - Petkovšeketal Introduction Data and Data Preprocessing Methodology Evaluation of algorithms GAN DBSCAN Welford's algorithm Facebook Prophet Results Conclusions Acknowledgments References 14 - Costaetal_2 Abstract 1 Introduction 2 Stationary and Chaotic Nature 2.1 Dickey-Fuller Test for Stationarity 2.2 Lyapunov exponents for understanding chaotic nature 3 Maximum Predictability 4 Model Architecture 4.1 LSTM 4.2 Our approach 5 Forecasting 5.1 Forecasting Methods 5.2 Our Approach 6 Research Methods 6.1 Time Series Reconstruction 6.2 Entropy Calculation 6.3 Data and Code Git Repository 7 Plot of results 8 Conclusion 9 Acknowledgments 15 - Jelencic+Mladenic Introduction Data Proposed method Empirical normalization Noise addition Optimization of latent representation Results Unsupervised learning results Supervised learning results Conclusions and future work Acknowledgments References 16 - Neumann+Grobelnik 17 - Trajkovaetal Abstract 1 Introduction 2 Related Work 3 Use Case 4 Methodology 5 Results and Analysis 6 Conclusion Acknowledgments References 18 - Ivancicetal 12 - Index - C Blank Page Blank Page Blank Page Blank Page 15 - Jelencic+Mladenic.pdf Introduction Data Proposed method Empirical normalization Noise addition Optimization of latent representation Results Unsupervised learning results Supervised learning results Conclusions and future work Acknowledgments References Blank Page