Zbornik 23. mednarodne multikonference INFORMACIJSKA DRUŻBA Zvezek A Proceedings of the 23rd International Multiconference .si INFORMATION SOCIETY Volume A .ijsI S http://is Slovenska konferenca o umetni inteligenci 20 Slovenian Conference on Artificial Intelligence 20 Uredili / Edited byMitja Luštrek, Matjaž Gams, Rok Piltaver 6.–7. oktober 2020 / 6–7 October 2020 Ljubljana, Slovenia Zbornik 23. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2020 Zvezek A Proceedings of the 23rd International Multiconference INFORMATION SOCIETY – IS 2020 Volume A Slovenska konferenca o umetni inteligenci Slovenian Conference on Artificial Intelligence Uredili / Edited by Mitja Luštrek, Matjaž Gams, Rok Piltaver http://is.ijs.si 6. – 7. oktober 2020 / 6 - 7 October 2020 Ljubljana, Slovenia Uredniki: Mitja Luštrek Odsek za inteligentne sisteme Institut »Jožef Stefan«, Ljubljana Matjaž Gams Odsek za inteligentne sisteme Institut »Jožef Stefan«, Ljubljana Rok Piltaver Celtra, d. o. o. in Odsek za inteligentne sisteme Institut »Jožef Stefan«, Ljubljana Založnik: Institut »Jožef Stefan«, Ljubljana Priprava zbornika: Mitja Lasič, Vesna Lasič, Lana Zemljak Oblikovanje naslovnice: Vesna Lasič Dostop do e-publikacije: http://library.ijs.si/Stacks/Proceedings/InformationSociety Ljubljana, oktober 2020 Informacijska družba ISSN 2630-371X Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni knjižnici v Ljubljani COBISS.SI-ID=33223427 ISBN 978-961-264-202-0 (epub) ISBN 978-961-264-203-7 (pdf) PREDGOVOR MULTIKONFERENCI INFORMACIJSKA DRUŽBA 2020 Triindvajseta multikonferenca Informacijska družba (http://is.ijs.si) je doživela polovično zmanjšanje zaradi korone. Zahvala za preživetje gre tistim predsednikom konferenc, ki so se kljub prvi pandemiji modernega sveta pogumno odločili, da bodo izpeljali konferenco na svojem področju. Korona pa skoraj v ničemer ni omejila neverjetne rasti IKTja, informacijske družbe, umetne inteligence in znanosti nasploh, ampak nasprotno – kar naenkrat je bilo večino aktivnosti potrebno opraviti elektronsko in IKT so dokazale, da je elektronsko marsikdaj celo bolje kot fizično. Po drugi strani pa se je pospešil razpad družbenih vrednot, zaupanje v znanost in razvoj. Celo Flynnov učinek – merjenje IQ na svetovni populaciji – kaže, da ljudje ne postajajo čedalje bolj pametni. Nasprotno - čedalje več ljudi verjame, da je Zemlja ploščata, da bo cepivo za korono škodljivo, ali da je korona škodljiva kot navadna gripa (v resnici je desetkrat bolj). Razkorak med rastočim znanjem in vraževerjem se povečuje. Letos smo v multikonferenco povezali osem odličnih neodvisnih konferenc. Zajema okoli 160 večinoma spletnih predstavitev, povzetkov in referatov v okviru samostojnih konferenc in delavnic in 300 obiskovalcev. Prireditev bodo spremljale okrogle mize in razprave ter posebni dogodki, kot je svečana podelitev nagrad – seveda večinoma preko spleta. Izbrani prispevki bodo izšli tudi v posebni številki revije Informatica (http://www.informatica.si/), ki se ponaša s 44-letno tradicijo odlične znanstvene revije. Multikonferenco Informacijska družba 2020 sestavljajo naslednje samostojne konference: • Etika in stroka • Interakcija človek računalnik v informacijski družbi • Izkopavanje znanja in podatkovna skladišča • Kognitivna znanost • Ljudje in okolje • Mednarodna konferenca o prenosu tehnologij • Slovenska konferenca o umetni inteligenci • Vzgoja in izobraževanje v informacijski družbi Soorganizatorji in podporniki konference so različne raziskovalne institucije in združenja, med njimi tudi ACM Slovenija, SLAIS, DKZ in druga slovenska nacionalna akademija, Inženirska akademija Slovenije (IAS). V imenu organizatorjev konference se zahvaljujemo združenjem in institucijam, še posebej pa udeležencem za njihove dragocene prispevke in priložnost, da z nami delijo svoje izkušnje o informacijski družbi. Zahvaljujemo se tudi recenzentom za njihovo pomoč pri recenziranju. V 2020 bomo petnajstič podelili nagrado za življenjske dosežke v čast Donalda Michieja in Alana Turinga. Nagrado Michie-Turing za izjemen življenjski prispevek k razvoju in promociji informacijske družbe je prejela prof. dr. Lidija Zadnik Stirn. Priznanje za dosežek leta pripada Programskemu svetu tekmovanja ACM Bober. Podeljujemo tudi nagradi »informacijska limona« in »informacijska jagoda« za najbolj (ne)uspešne poteze v zvezi z informacijsko družbo. Limono je prejela »Neodzivnost pri razvoju elektronskega zdravstvenega kartona«, jagodo pa Laboratorij za bioinformatiko, Fakulteta za računalništvo in informatiko, Univerza v Ljubljani. Čestitke nagrajencem! Mojca Ciglarič, predsednik programskega odbora Matjaž Gams, predsednik organizacijskega odbora i FOREWORD INFORMATION SOCIETY 2020 The 23rd Information Society Multiconference (http://is.ijs.si) was halved due to COVID-19. The multiconference survived due to the conference presidents that bravely decided to continue with their conference despite the first pandemics in the modern era. The COVID-19 pandemics did not decrease the growth of ICT, information society, artificial intelligence and science overall, quite on the contrary – suddenly most of the activities had to be performed by ICT and often it was more efficient than in the old physical way. But COVID-19 did increase downfall of societal norms, trust in science and progress. Even the Flynn effect – measuring IQ all over the world – indicates that an average Earthling is becoming less smart and knowledgeable. Contrary to general belief of scientists, the number of people believing that the Earth is flat is growing. Large number of people are weary of the COVID-19 vaccine and consider the COVID-19 consequences to be similar to that of a common flu dispute empirically observed to be ten times worst. The Multiconference is running parallel sessions with around 160 presentations of scientific papers at twelve conferences, many round tables, workshops and award ceremonies, and 300 attendees. Selected papers will be published in the Informatica journal with its 44-years tradition of excellent research publishing. The Information Society 2020 Multiconference consists of the following conferences: • Cognitive Science • Data Mining and Data Warehouses • Education in Information Society • Human-Computer Interaction in Information Society • International Technology Transfer Conference • People and Environment • Professional Ethics • Slovenian Conference on Artificial Intelligence The Multiconference is co-organized and supported by several major research institutions and societies, among them ACM Slovenia, i.e. the Slovenian chapter of the ACM, SLAIS, DKZ and the second national engineering academy, the Slovenian Engineering Academy. In the name of the conference organizers, we thank all the societies and institutions, and particularly all the participants for their valuable contribution and their interest in this event, and the reviewers for their thorough reviews. For the fifteenth year, the award for life-long outstanding contributions will be presented in memory of Donald Michie and Alan Turing. The Michie-Turing award was given to Prof. Dr. Lidija Zadnik Stirn for her life-long outstanding contribution to the development and promotion of information society in our country. In addition, a recognition for current achievements was awarded to the Program Council of the competition ACM Bober. The information lemon goes to the “Unresponsiveness in the development of the electronic health record”, and the information strawberry to the Bioinformatics Laboratory, Faculty of Computer and Information Science, University of Ljubljana. Congratulations! Mojca Ciglarič, Programme Committee Chair Matjaž Gams, Organizing Committee Chair ii KONFERENČNI ODBORI CONFERENCE COMMITTEES International Programme Committee Organizing Committee Vladimir Bajic, South Africa Matjaž Gams, chair Heiner Benking, Germany Mitja Luštrek Se Woo Cheon, South Korea Lana Zemljak Howie Firth, UK Vesna Koricki Olga Fomichova, Russia Marjetka Šprah Vladimir Fomichov, Russia Mitja Lasič Vesna Hljuz Dobric, Croatia Blaž Mahnič Alfred Inselberg, Israel Jani Bizjak Jay Liebowitz, USA Tine Kolenik Huan Liu, Singapore Henz Martin, Germany Marcin Paprzycki, USA Claude Sammut, Australia Jiri Wiedermann, Czech Republic Xindong Wu, USA Yiming Ye, USA Ning Zhong, USA Wray Buntine, Australia Bezalel Gavish, USA Gal A. Kaminka, Israel Mike Bain, Australia Michela Milano, Italy Derong Liu, Chicago, USA prof. Toby Walsh, Australia Programme Committee Mojca Ciglarič, chair Andrej Gams Vladislav Rajkovič Bojan Orel, co-chair Matjaž Gams Grega Repovš Franc Solina, Mitja Luštrek Ivan Rozman Viljan Mahnič, Marko Grobelnik Niko Schlamberger Cene Bavec, Nikola Guid Špela Stres Tomaž Kalin, Marjan Heričko Stanko Strmčnik Jozsef Györkös, Borka Jerman Blažič Džonova Jurij Šilc Tadej Bajd Gorazd Kandus Jurij Tasič Jaroslav Berce Urban Kordeš Denis Trček Mojca Bernik Marjan Krisper Andrej Ule Marko Bohanec Andrej Kuščer Tanja Urbančič Ivan Bratko Jadran Lenarčič Boštjan Vilfan Andrej Brodnik Borut Likar Baldomir Zajc Dušan Caf Janez Malačič Blaž Zupan Saša Divjak Olga Markič Boris Žemva Tomaž Erjavec Dunja Mladenič Leon Žlajpah Bogdan Filipič Franc Novak iii iv KAZALO / TABLE OF CONTENTS Slovenska konferenca o umetni inteligenci / Slovenian Conference on Artificial Intelligence .......................... 1 PREDGOVOR / FOREWORD ................................................................................................................................. 3 PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ..................................................................................... 5 Using Mozil a’s Deep Speech to Improve Speech Emotion Recognition / Andova Andrejaana, Bromuri Stefano, Luštrek Mitja ....................................................................................................................................................... 7 Towards Automatic Recognition of Similar Chess Motifs / Bizjak Miha, Guid Matej ............................................ 11 Drinking Detection From Videos in a Home Environment / De Masi Carlo M., Luštrek Mitja .............................. 15 Semantic Feature Selection for AI-Based Estimation of Operation Durations in Individualized Tool Manufacturing / Dovgan Erik, Filipič Bogdan .................................................................................................. 19 Generating Alternatives for DEX Models using Bayesian Optimization / Gjoreski Martin, Kuzmanovski Vladimir .......................................................................................................................................................................... 23 Detekcija napak na industrijskih izdelkih / Golob David, Petrovčič Janko, Kalabakov Stefan, Kocuvan Primož, Bizjak Jani, Dolanc Gregor, Ravničan Jože, Gams Matjaž, Bohanec Marko .................................................. 27 Data Protection Impact Assessment - an Integral Component of a Successful Research Project From the GDPR Point of View / Gültekin Várkonyi Gizem, Gradišek Anton .............................................................................. 32 Deep Transfer Learning for the Detection of Imperfectionson Metallic Surfaces / Kalabakov Stefan, Kocuvan Primož, Bizjak Jani, Gazvoda Samo, Gams Matjaž ......................................................................................... 35 Fall Detection and Remote Monitoring of Elderly People Using a Safety Watch / Kiprijanovska Ivana, Bizjak Jani, Gams Matjaž ............................................................................................................................................ 39 Machine Vision System for Quality Control in Manufacturing Lines / Kiprijanovska Ivana, Bizjak Jani, Gazvoda Samo, Gams Matjaž ......................................................................................................................................... 43 Abnormal Gait Detection Using Wrist-Worn Inertial Sensors / Kiprijanovska Ivana, Gjoreski Hristijan, Gams Matjaž ............................................................................................................................................................... 47 Avtomatska detekcija obrabe posnemalnih igel / Kocuvan Primož, Bizjak Jani, Kalabakov Stefan, Gams Matjaž .......................................................................................................................................................................... 51 Povečevanje enakosti (oskrbe duševnega zdravja) s prepričljivo tehnologijo / Kolenik Tine, Gams Matjaž ....... 55 Analiza glasu kot diagnostičn ametodaza odkrivanje Parkinsonove bolezni / Levstek Andraž, Silan Darja, Vodopija Aljoša ................................................................................................................................................. 59 STRAW Application for Collecting Context Data and Ecological Momentary Assessment / Lukan Junoš, Katrašnik Marko, Bolliger Larissa, Clays Els, Luštrek Mitja ............................................................................. 63 URBANITE H2020 Project Algorithms and Simulation Techniques for Decision-Makers / Machidon Alina, Smerkol Maj, Gams Matjaž .............................................................................................................................. 68 Towards End-to-end Text to Speech Synthesis in Macedonian Language / Neceva Marija, Stoilkovska Emilija, Gjoreski Hristijan .............................................................................................................................................. 72 Improving Mammogram Classification by Generating Artificial Images / Peterka Ana, Bosnić Zoran, Osipov Evgeny .............................................................................................................................................................. 76 Mobile Nutrition Monitoring System: Qualitative and Quantitative Monitoring / Reščič Nina, Jordan Marko, De Boer Jasmijn, Bierhoff Ilse, Luštrek Mitja ......................................................................................................... 80 Recognition of Human Activities and Falls by Analyzing the Number of Accelerometers and their Body Location / Shulajkovska Miljana, Gjoreski Hristijan ........................................................................................................... 84 Sistem za ocenjevanje esejev na podlag ikoherence in semantične skladnosti / Simončič Žiga, Bosnić Zoran . 88 Mental State Estimation of People with PIMD using Physiological Signals / Slapničar Gašper, Dovgan Erik, Valič Jakob, Luštrek Mitja ................................................................................................................................. 92 Energy-Efficient Eating Detection Using a Wristband / Stankoski Simon, Luštrek Mitja ...................................... 96 Comparison of Methods for Topical Clustering of Online Multi-speaker Discourses / Stropnik Vid, Bosnić Zoran, Osipov Evgeny ............................................................................................................................................... 100 Machine Learning of Surrogate Models with an Application to Sentinel 5P / Szlupowicz Michał Artur, Brence Jure, Adams Jennifer, Malina Edward, Džeroski Sašo .................................................................................. 104 Deep Multi-label Classification of ChestX-ray Images / Štepec Dejan ............................................................... 108 Smart Issue Retrieval Application / Zupančič Jernej, Budna Borut, Mlakar Miha, Smerkol Maj ........................ 112 Adaptation of Text to Publication Type / Žontar Luka, Bosnić Zoran ................................................................. 116 Indeks avtorjev / Author index .............................................................................................................................. 121 v vi Zbornik 23. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2020 Zvezek A Proceedings of the 23rd International Multiconference INFORMATION SOCIETY – IS 2020 Volume A Slovenska konferenca o umetni inteligenci Slovenian Conference on Artificial Intelligence Uredili / Edited by Mitja Luštrek, Matjaž Gams, Rok Piltaver http://is.ijs.si 6. – 7. oktober 2020 / 5 - 7 October 2020 Ljubljana, Slovenia 1 2 PREDGOVOR Leto 2020 je bilo za informacijsko družbo zelo pomembno: zmanjšanje medosebnih stikov zaradi COVID-19 je pokazalo, da se da s pomočjo informacijskih tehnologij postoriti še precej več, kot smo si do zdaj mislili. S pomočjo telekonferenčnih sistemov smo se sestajali, digitalno smo prenašali in podpisovali dokumente, prek spleta smo lahko naročili domala vse izdelke in storitve ... Čeravno sta umetna inteligenca in informacijska družba vedno tesneje povezani, pa podobno dramatičnega napredka pri umetni inteligenci ni bilo opaziti. Seveda to ne pomeni, da napredka ni bilo – raznotere metode umetne inteligence še naprej postajajo vedno zmogljivejše in predvsem prodirajo v vedno manjše in cenejše naprave: opažamo lahko, da se namenski procesorji za operacije umetnih nevronskih mrež vedno pogosteje pojavljajo v pametnih telefonih, pametnih zvočnikih z govornimi asistenti in podobnih napravah. Umetno inteligenco smo zapregli tudi v spopad s COVID-19. Raziskovalci so jo uporabili za določanje strukture virusa in za iskanje učinkovitih zdravil in cepiv. Skupina ameriških organizacij je razpisala nagrado za najboljše pristope rudarjenja po besedilih, ki bodo iz 19 GB besedil, povezanih z boleznijo, izluščila koristne informacije. Razvitih je bilo več diagnostičnih sistemov za podporo odločanju, ki analizirajo slike pljuč in druge podatke. Precej raziskovalcev se je z metodami umetne inteligence lotilo napovedovanja širjenja bolezni in določanja dejavnikov, ki nanj vplivajo. Tovrstne raziskave se dogajajo tudi v Sloveniji. K sreči COVID-19 naši konferenci ni storil dosti žalega. Resda se ob pisanju tegale uvodnika še ne ve zagotovo, ali bo konferenca potekala na daljavo ali jo bomo uspeli speljati hibridno, kot načrtujemo – da bo del udeležencev prisoten v živo v predavalnici, del pa na daljavo. A verjamemo, da to na kakovost izvedbe ne bo bistveno vplivalo. Z zadovoljstvom pa ugotavljamo, da smo letos dobili največ prispevkov v zadnjih petih letih – v zbornik jih je vključenih kar 28. Tokrat je bolje kot običajno zastopana Fakulteta za računalništvo in informatiko Univerze v Ljubljani, ki ima skupaj z Institutom Jožef Stefan (od koder je – kot vsako leto – največ prispevkov) vodilno vlogo pri raziskavah umetne inteligence v Sloveniji. Nekaj prispevkov je tudi iz tujine in industrije, čeprav bi si zlasti slednjih želeli več. Slovenija namreč izobrazi veliko strokovnjakov s področja umetne inteligence in precej jih najde pot v industrijo, kjer se dogaja marsikaj zanimivega, o čemer vemo premalo. V to smer si bomo zato še bolj prizadevali v prihodnjih letih. 3 FOREWORD 2020 was an important year for the information society: social distancing due to COVID-19 showed that information technologies allow us to do even more that we previously thought. Teleconferencing systems allowed us to meet virtually, we transferred and signed documents digitally, we ordered every imaginable product and service online … However, even though artificial intelligence and information society are increasingly interlinked, the progress of artificial intelligence this year was not as significant. This certainly does not mean there was no progress – various artificial-intelligence methods are still steadily improving, and, perhaps even more importantly, becoming available in ever smaller and cheaper devices: dedicated processors accelerating neural-network computations are becoming common in smartphones, smart speakers with conversational assistants and similar devices. Artificial intelligence also helps fight COVID-19. It was used to determine the structure of the virus and to discover effective drugs and vaccines. A group of USA organizations offered a prize for the best data-mining methods that can extract information from 19 GB of texts related to the disease. Several diagnostic decision support systems were developed, which analyse images of the lungs and other data. Many researchers used artificial intelligence to forecast the spread of the disease and the factors that affect it. Such research is also conducted in Slovenia. Fortunately, COVID -19 did not much affect our conference. At the time of writing this editorial, it is still not clear whether it will take place remotely, or we will succeed with planned the hybrid approach, where a part of the participants will attend live in a lecture room with the rest connected via teleconference. Either way, we are confident this will not have a major impact on the quality of the conference. We are pleased to report that this year we have the largest number of papers in the last five years – there are 28 in these proceedings. The Faculty of Computer and Information Science is represented better than in previous years, which is quite appropriate considering that – aside from Jožef Stefan Institute (which contributed the largest number of papers, as usual) – it is the leading Slovenian research institution on artificial intelligence. There are also some papers from abroad and from the industry, although we would prefer to see more of these, especially the latter. The number of experts on artificial intelligence in Slovenia is quite large and a significant number find their way to the industry, where many interesting but not widely known developments take place. We aim to improve on this aspect in the following years. 4 PROGRAMSKI ODBOR / PROGRAMME COMMITTEE Mitja Luštrek Matjaž Gams Rok Piltaver Marko Bohanec Tomaž Banovec Cene Bavec Jaro Berce Marko Bonač Ivan Bratko Dušan Caf Bojan Cestnik Aleš Dobnikar Bogdan Filipič Nikola Guid Borka Jerman Blažič Tomaž Kalin Marjan Krisper Marjan Mernik Vladislav Rajkovič Ivo Rozman Niko Schlamberger Tomaž Seljak Miha Smolnikar Peter Stanovnik Damjan Strnad Peter Tancig Pavle Trdan Iztok Valenčič Vasja Vehovar Martin Žnidaršič 5 6 Using Mozilla’s DeepSpeech to Improve Speech Emotion Recognition Andrejaana Andova Stefano Bromuri Mitja Luštrek Jožef Stefan International Open University of the Netherlands Jožef Stefan Institute Postgraduate School Heerlen, Netherlands Jamova cesta 39 Jožef Stefan Institute Stefano.Bromuri@ou.nl Ljubljana, Slovenia Jamova cesta 39 mitja.lustrek@ijs.si Ljubljana, Slovenia andrejaana.andova@ijs.si ABSTRACT gather a dataset composed of speeches used in different contexts, A lot of effort in detecting emotions in speech has already been which is a hard task. made. However, most of the related work was focused on training Most of the currently available emotional speech datasets are a model on an emotional speech dataset, and testing the model composed of actors performing scenes with different emotions. on the same dataset. A model trained on one dataset seems to Finding actors and writing the scenes could be a costly and ef- provide poor results when tested on another dataset. This means fortful task and, thus, it is hard to collect large amounts of data that the models trained on publicly available datasets cannot be in this way. However, the major problem of this type of data is used in real-life applications where the speech context is different. that all of the emotions are acted and may be more exaggerated Furthermore, collecting large amounts of data to build an efficient when compared to real-life emotions [8]. This type of data is speech emotion classifier is not possible in most cases. probably pretty different when compared to data from real-life Because of this, some researchers tried using transfer learn- applications where emotions are expressed with less intensity. To ing to improve the performance of a baseline model trained on solve this problem, some researchers tried using transfer learning only one dataset. However, most of the works so far developed methods to build a model that is more robust to changes in the methods that transfer information from one emotional speech data. dataset into another emotional speech dataset. Some researchers tried using speeches recorded in real-life In this work, we try to transfer parameters from a pre-trained scenarios and asked people to listen to these speeches and anno- speech-to-text model that is already widely used. Unlike other tate the emotions they recognize in the speakers’ voices. When related work, which uses emotional speech datasets that are collecting a dataset in this way one needs to find people that usually small, in this method we will try to transfer information would listen to the whole dataset and annotate the data. The from a larger speech dataset which was collected by Mozilla and annotators would probably have different abilities to detect the whose main purpose was to transcribe speech. emotions and different perceptions of what each emotion should We used the first layer from the DeepSpeech model as the basis be like. Because of this, in many cases not all of them will agree for building another deep neural network, which we trained on on which emotion is present in a sample. Another drawback of the improvisation utterances from the IEMOCAP dataset. this type of data collection is that most of the time people do not experience extreme emotions. Because of this, such datasets KEYWORDS will result in almost no emotions – the speech would be mostly neutral. speech emotion recognition, feature transfer, DeepSpeech The main idea behind transfer learning is to use information from a dataset called source dataset to improve the performance 1 INTRODUCTION of a target dataset. The source and the target datasets may have There are many issues when trying to build a model for speech labeled or unlabeled data, may have the same data distribution or emotion recognition, but the main problem is the lack of emo- different data distribution, and they can be constructed to solve tional speech data. Collecting a dataset is often a challenging the same task or they may try to solve different tasks. Depending and effortful task, but in speech emotion recognition a few addi- on this, there are different approaches to transfer learning. They tional problems arise when creating a dataset. One of the main are more thoroughly explained by S. J. Pan et al. [5]. problems is that speech is a context-dependent problem. One In this work, we decided to follow the usual transfer learning could gather a dataset from job interviews and build a precise approach, and use a pre-trained speech-to-text model trained on model that detects emotions in job applicants’ speech. however, a large nonemotional English dataset collected by Mozilla. This the same model would probably not work for a phone application model may not contain any emotional information that would be that tries to analyze the emotions of its users. Thus, to build a useful for our task, but we believe it contains information about general model for speech emotion recognition, one would need to the speech of the subjects that could be used in speech emotion recognition. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or 2 RELATED WORK distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this While research in speech emotion recognition where training work must be honored. For all other uses, contact the owner/author(s). and testing are done on one dataset has already been well-studied, Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia © 2020 Copyright held by the owner/author(s). using other datasets to make the model more generalized has been in focus only in recent years. 7 Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Andrejaana Andova, Stefano Bromuri, and Mitja Luštrek Table 1: Emotion distribution in IEMOCAP. Anger Happiness Sadness Neutral 500 94 467 392 Some researchers tried using unlabeled target data to improve distribution of the emotions after the data reduction is given in speech emotion recognition models. Thus, Parthasarathy and Table 1. Busso [6] connected supervised and unsupervised learning to improve the performance of speech emotion recognition on a 4 METHODOLOGY target dataset. They used a network architecture similar to au- We developed methods that transfer information from a large toencoders to encode large amounts of unlabeled target data in nonemotional speech dataset into a target emotional speech an unsupervised way by putting the same speech in the input dataset. Since in most of the related work researchers were ex- and the output of the network. To force the network to encode tracting information from smaller emotional speech datasets and the emotional information from the speech, they connected the transferring this information to other emotional speech datasets, last encoding layer to another layer that was trying to learn the this is the first attempt that we know of in which a transfer of arousal, valence, and the dominance annotations on the speech information is tried from already well-defined pre-trained speech in a supervised way. When they compared their method to other dataset into a smaller emotional speech dataset, which is the state-of-the-art models, it showed improvement in the arousal standard approach in most transfer learning applications. and the dominance space while in the valence space they got However, to compare if the methods provide any useful im- results slightly worse than the state-of-the-art. provement, we compare them to a baseline model that was trained Some authors thought about bringing the feature space from and tested on IEMOCAP, and which does not use any kind of the source and the target data closer together. Thus, Song et al., [7] information transfer. used MMDE optimization and dimension reduction algorithms to bring the feature spaces from the source and the target datasets 4.1 Baseline Model closer together. After that, they used the shifted feature space To build a baseline classifier, we decided to use standard machine from the source dataset to train an SVM model. They used the learning approaches trained on features extracted using OpenS- EmoDB dataset as a source dataset, and a Chinese emotional MILE [2] as a baseline method. After testing several different dataset collected by them as a target dataset. After they trained machine learning approaches, we saw that Random Forest ob- the SVM model on the source dataset only, they applied the tained the best results for most of the target datasets. Because model on the target dataset and showed that the model performed of this, we decided to use a Random Forest classifier with 1000 with 59.8% accuracy. These results show improvement when trees and a maximal depth of 10 as a baseline model. compared to an SVM model trained on the source dataset and tested on the target dataset without any dimension reduction 4.2 DeepSpeech Model applied, which performs with 29.8% accuracy. However, the best performance was achieved with a model trained and tested on the target dataset, which achieved 85.5% accuracy. 3 DATASET In this research we used the Interactive emotional dyadic motion capture database (IEMOCAP) [1]. IEMOCAP consists of speech from ten different English-speaking actors (five male and five female), and it is the largest dataset for speech emotion recognition that we found publicly available. It consists of approximately twelve hours of data where actors perform improvisations or scripted scenarios, specifically selected to elicit emotional ex- pressions. Since the actors were not given any specific emotions that they had to act, the database was annotated by multiple annotators into categorical labels, as well as dimensional labels, such as valence, activation, and dominance. The set of emotions the annotators could choose from was anger, happiness, excite- ment, sadness, frustration, fear, surprise, other, and neutral, but because most of the related work on transfer learning in speech emotion recognition only used anger, happiness, sadness and neutral utterances in their methods, we decided to also just use these emotions in our method. We noticed that most of the time, the three annotators did not Figure 1: Architecture of the original DeepSpeech model. perceive the same emotion and, thus, we decided to eliminate all data where all three annotators did not agree on the detected DeepSpeech is a model that tries to provide transcriptions emotion. This reduced the amount of data significantly. The of a given speech. The model has been trained on the English 8 Speech Emotion Recognition using DeepSpeech features Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Table 2: Classification accuracy obtained from the majority classifier and baseline Random Forest Classifier compared to the DeepSpeech features method. Model Majority Baseline DeepSpeech features Dense 34% 67% 58% LSTM 34% 67% 7% Dense1+Dense2 34% 67% 26% Dense1+LSTM2 34% 67% 66% data from the Mozilla Common Voice dataset [3]. This dataset a ‘relu’ function and has 20 hidden states. It is then connected to consists of 1469 hours of speech data that has been recorded by a dense layer activated by a ‘softmax’ activation function which 61521 different voices. The people whose voices were collected predicts the label of the whole utterance. belonged to different nationalities (and thus different English The third network architecture is composed of two parts. In accents), and different ages. All of this data is publicly available the first part we predict the emotion probabilities for each frame and can be easily accessed. separately and in the second part we use the emotion proba- The architecture of the DeepSpeech model is presented in bilities predictions from the first layer to predict the emotion Figure 1. Each utterance is a time-series data, where every time-probabilities for the whole utterance. The first part of the archi- slice is a vector of MFCC audio features [4]. The goal of the tecture is the same as in the first network architecture and is network is to convert an input sequence 𝑥 into a sequence of trained one one half of the training data. In the second part of character probabilities for the transcription 𝑦. this network, we use the predictions from the first part as input The network is composed of five hidden layers. The first three to a dense layer with a softmax activation function. The second layers are dense layers with ‘ReLU’ as an activation function. part of the network is trained on the other half of the training The fourth layer is an LSTM layer, the fifth layer is once again data. In this network architecture, for each sequence of 20 frames a dense layer with ‘ReLU’ activation function. The output layer we predict one vector of emotions. has a softmax function which outputs character probabilities. The fourth network consists of two separate parts and is pre- In the example in Figure 1 the output of the first frame is the sented in Figure 2. The first part takes the output of the Deep-character ‘C’, the second frame outputs the character ‘A’, and the Speech model, and tries to predict the probability for each of third frame outputs the character ‘T’, resulting with the word the target emotions separately. The first dense layer has a ’relu’ ‘CAT’. activation function and outputs 204 features. It is then connected to another dense layer with a softmax activation function that 4.3 Transfer Learning Using DeepSpeech predicts the emotions present in each frame separately. The sec- ond part of the network uses the output emotion probabilities We decided to experiment if we could transfer information from from the first part of the layer as an input. The second part of the DeepSpeech model that would be useful for the speech emo- the network consists of one LSTM layer which is trained on the tion recognition task. We used the representation learned by second half of the training data. The LSTM layer is activated by the DeepSpeech network to extract features for the IEMOCAP a ‘relu’ function and has 20 hidden states. It is then connected to dataset. We used the output from the first layer in the Deep- a dense layer activated by a ‘softmax’ activation function which Speech model as features for a given frame. We ended up with predicts the label of the whole utterance. This network archi- 2048 features for every 10-millisecond frame. So, if the whole tecture in a way is a combination from the first and the second utterance was 3 seconds long, we would receive a matrix with network architecture. dimensions 1800x2048 after the deep speech feature extraction. After the features from all the samples in IEMOCAP have been extracted, we trained a deep neural network using them. 5 RESULTS We simply added the layers from the new deep neural network on Since the DeepSpeech model is capable of learning language top of the first layer from the DeepSpeech model, and trained the phases in the speech, we decided to remove all scripted utter- new deep neural network from scratch by just using the samples ances from the IEMOCAP dataset and use just the utterances in from the IEMOCAP dataset. This way we repurpose the feature which the actors were asked to improvise. To evaluate the neural representations from the first layer of the DeepSpeech model. network architectures we used the leave-one-subject-out cross We experimented with several different deep neural network validation. architectures to see which one works best for this problem. In In Table 2 we present the results obtained from each of the the first architecture, we used a feed-forward network on the ex-deep neural network architectures that we tried as well as the tracted features per each frame. We used one hidden dense layer accuracy of the baseline model and the majority classifier. In the with ‘relu’ activation function and 204 neurons. We connected results we can see that the LSTM network architecture that we this layer to a dense layer with softmax activation function which tried performs quite poor, with classification accuracy of only 7%. predicted the emotion probabilities for each frame separately. Al- The most probable explanation for this is that this architecture though in the IEMOCAP dataset there are no labels for each is quite complex since it has 2048 features for each frame, and of the frames separately, we use the target label for the whole it tries to train an LSTM model on all of these features. To train utterance as target label for each of the frames. a model with this amount of parameters, we would need much The second model architecture we tried was to use the features more samples than the IEMOCAP improvisations. from the whole frame as input, and use a LSTM layer to learn the The architecture that provides the best results is the one that representations from the features. The LSTM layer is activated by uses a FFN to predict the features in each frame, and then uses a 9 Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Andrejaana Andova, Stefano Bromuri, and Mitja Luštrek 6 CONCLUSION In this work we tried to improve a baseline speech emotion recognition classifier by transferring information from a pre- trained model. Although this transfer learning method has been most widely used in other computer science fields, most of the related work in speech emotion recognition developed transfer learning methods that transfer information from other emotional speech datasets into a target emotional speech dataset. The pre-trained model we used was Mozilla’s DeepSpeech that was developed as a speech-to-text model. To recognize emotions in speech, we used the first layer from the DeepSpeech model, on top of which we added a new classifier that was trained from scratch on an emotional speech dataset. This way we repurposed the feature maps learned previously for the dataset. The results from this approach did not seem to improve the classification accuracy of the improvisations part in the IEMO- CAP dataset. A possible explanation for this could be that the speech-to-text and speech emotion recognition tasks are simply not sufficiently related, and because of this the model could not extract any useful information from the DeepSpeech model. How- ever, since this was the first attempt to transfer information from a well-defined pre-trained model to a speech emotion recognition task, we believe it is still a valuable attempt. Figure 2: Architecture of the original DeepSpeech model. 7 ACKNOWLEDGMENTS This research has received funding from the European Union’s LSTM network to predict the final emotion predictions for the Horizon 2020 research and innovation programme under Grant whole utterance. We further experimented with this network Agreement No 769765 architecture to see how much the length of the frames changes the performance of the model. The results are presented in Figure REFERENCES 3. In this figure, we can notice that the performance of the model [1] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, can be improved by using bigger frames when training the LSTM Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok part of the DeepSpeech model. However, the performance of the Lee, and Shrikanth S Narayanan. 2008. Iemocap: interac- model does not differ a lot – only a few percentage points. tive emotional dyadic motion capture database. Language The results show that some of the DeepSpeech architectures resources and evaluation, 42, 4, 335. can perform better than the majority classifier but none of the [2] Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. architectures outperforms the baseline model. A possible explana- Opensmile: the munich versatile and fast open-source audio tion for this could be that these two tasks are simply not related feature extractor. In Proceedings of the 18th ACM interna- enough and we cannot use information from the DeepSpeech tional conference on Multimedia, 1459–1462. model to improve the performance of a model for speech emotion [3] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, recognition. Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567. [4] Beth Logan et al. 2000. Mel frequency cepstral coefficients for music modeling. In Ismir. Volume 270, 1–11. [5] Sinno Jialin Pan and Qiang Yang. 2009. A survey on transfer learning. IEEE Transactions on knowledge and data engineer- ing, 22, 10, 1345–1359. [6] Srinivas Parthasarathy and Carlos Busso. 2019. Semi-supervised speech emotion recognition with ladder networks. arXiv preprint arXiv:1905.02921. [7] Peng Song, Yun Jin, Li Zhao, and Minghai Xin. 2014. Speech emotion recognition using transfer learning. IEICE TRANS- ACTIONS on Information and Systems, 97, 9, 2530–2532. [8] Carl E Williams and Kenneth N Stevens. 1972. Emotions and speech: some acoustical correlates. The Journal of the Acoustical Society of America, 52, 4B, 1238–1250. Figure 3: Performance of DeepSpeech model by using dif- ferent frame lengths. 10 Towards Automatic Recognition of Similar Chess Motifs Miha Bizjak Matej Guid University of Ljubljana University of Ljubljana Faculty of Computer and Information Science Faculty of Computer and Information Science Ljubljana, Slovenia Ljubljana, Slovenia ABSTRACT 1.1 Related Work We present a novel method to find chess positions similar to a Existing chess search systems equipped with a query-by-example given query position from a collection of archived chess games. (QBE) [11] search interface are limited to searching only the exact Our approach considers not only the static similarity due to the matches in response to a given query position. To alleviate the arrangement of the chess pieces, but also the dynamic similarity problem of exact position searches, the Chess Query Language based on the recognition of chess motifs and dynamic, tactical system (CQL) [1] allows the search for approximate matches aspects of position similarity. We use information retrieval tech-of positions. However, it requires the user to define complex niques to enable efficient approximate searches, and implement queries in the system-specific language. The search results can textual encoding that captures the position, accessibility and con- be sorted by any user-defined feature. In addition, the CQL works nectivity between chess pieces, pawn structures, and moves that directly on game files and checks each game sequentially, making represent the solution to the problem. We have shown experi- it inefficient for querying larger databases. mentally how important the inclusion of both static and dynamic To overcome these problems, an approach has been proposed features is for the successful detection of similar chess motifs. which is based on information retrieval for obtaining similar In another experiment the program was able to quickly traverse chess positions [4], constructing a textual representation for a large database of positions to identify similar chess tactical each board position and using information retrieval methods problems. A chess expert found the resulting program useful for to calculate the similarity between these documents. Instead of automatically generating instructive examples for chess training. constructing a query manually, the user specifies a chess posi- tion and a query encoding the characteristics of the position is KEYWORDS automatically generated internally. Initially, a naive encoding problem solving, chess motifs, automatic similarity recognition was used, which only contains the positions of the individual pieces. The results have been improved by including additional information about the mobility of the individual pieces and the 1 INTRODUCTION structural relationships between the pieces. Further work has A significant part of acquiring human skills is to identify our been carried out to improve the quality of retrieval by implement- weaknesses and take measures to remedy them. In problem- ing automatic recognition of pawn structures [7]. The additional solving domains such as chess, the analysis of past games is information provided by the application of domain knowledge important for players trying to improve their game. Identifying has proved useful, however, the positions are still only statically their mistakes enables chess players to work on improving some evaluated. aspects of their game. This is often done by training on similar All existing approaches have a common shortcoming: they problems. Finding relevant similar problems involves recognis- only allow the search for statically similar positions, while ignor- ing both static patterns, i.e. finding similar chess positions, and ing the dynamic factors, which are often far more important to dynamic patterns, i.e. finding similar move sequences that solve a obtain relevant search results. problem. These static and dynamic patterns are often referred to as chess motifs. Learning and recognising chess motifs during the game is one of the main prerequisites for becoming a competent chess player [2]. 2 DOMAIN DESCRIPTION Chess instructors often look for examples containing relevant In this paper, we will focus on automatic retrieval of similar chess chess motifs from real games to provide their students with useful tactical problems from a large database of chess games. In chess, teaching material. However, it is impossible for a human being to the term tactic is used to describe a sequence of moves that takes go through thousands or even millions of games and find problem advantage of a certain position on the board and allows the player positions with similar chess motifs and similar solutions to those to gain material, a positional advantage, or even leads to a forced overlooked by the students in their game. Finding contextually checkmate sequence. similar chess positions could also be used for annotating chess Chess tactical problems are particularly important for the games [5] and in intelligent chess tutoring systems [10]. progress of chess players. Knowledge of tactical motifs helps The goal of our research is to develop a method to automati- them to quickly recognise the possible presence of a winning cally retrieve chess positions with similar chess motifs for a given or drawing combination in a position. Chess players improve query position from a collection of archived chess games. their tactical skills by solving tactical problems. A large number of games are decided by tactics, since a single mistake, which Permission to make digital or hard copies of part or all of this work for personal gives the opponent an opportunity for tactics can change the out-or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and come of a game. To help players to discover tactical possibilities the full citation on the first page. Copyrights for third-party components of this in games, many common patterns or tactical motifs have been work must be honored. For all other uses, contact the owner/author(s). defined in the chess literature [6]. Stoiljkovikj et al. developed a Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia method for estimating the difficulty of chess tactical problems [9]. © 2020 Copyright held by the owner/author(s). They introduced a concept of meaningful search trees, which can 11 Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Miha Bizjak and Matej Guid (a) (b) (c) (a) (b) (c) Figure 1: Tactical motifs. Figure 2: Static and dynamic similarity. potentially be used either for motif recognition or as an additional For each tactic, the input consists of a starting position in FEN feature for positional similarity ranking. format and a solution move sequence in algebraic notation. The We use standard chess annotation. Chess games are stored solution can be provided with the position or calculated using using Portable Game Notation (PGN), chess positions are de- a chess engine. Sections 3.1 and 3.2 describe the features and scribed with Forsyth-Edwards Notation (FEN), and chess moves terms that are generated, and Figure 3 shows an example of a are described with Standard algebraic notation (SAN) [3]. text encoding. Figure 1 shows some of the more common motifs. In Figure 3.1 Static Features 1a, Black performs a double attack on the white king and queen at the same time. White must move the king out of check, allowing The static part of the encoding includes information about the Black to capture the queen. Figure 1b is an example of a discovered positions of pieces on the board, structural relationships between attack. By moving the bishop, White opens the queen’s line of pieces and pawn structures present in the position. attack on the rook on a2. After Black responds to move out of the The implementation is based on previous work on similar check, White can capture the black rook. The tactic in Figure 1c position retrieval [4] and pawn structure detection [7] and is is called deflection. The black king protects the rook on f8. White intended to serve as a baseline on which we aim to improve by gives a check with the bishop, forcing the black king to move implementing encoding of dynamic features. away from the rook so that it can be captured. 3.1.1 Piece positions and connectivity. The section describing To illustrate the difference between static and dynamic similar- piece positions and connectivity encoding consists of three parts: ity using an example, we compare the query position in Figure 2a with the positions in Figure 2b and Figure 2c. The position in • naive encoding - the positions of all the pieces on the board. Figure 2b seems to be very similar to that in Figure 2a: only the • reachable squares - all squares reachable by pieces on the white rook on h4 and the black rook on e8 have been removed. board in one move, with decreasing weight based on dis- These two positions are statically similar. On the other hand, tance from the original position, in format {piece symbol the position in Figure 2c seems to be quite different. However, and position}|{weight}. if we compare the move sequences that represent solutions to • connectivity between the pieces - the structural relation- these two tactical problems, we notice a great dynamic similarity. ships between the pieces in the positions. For each piece The solution in Figure 2a is 1. Rh8+ Kxh8 2. Qh6+ Kg8 3. Qxg7#. it is recorded which other pieces it attacks, defends or The solution in Figure 2c contains the same tactical motif as the attacks through another piece (X-ray attack). Attacks are solution mentioned above: the white rook is sacrificed on h8 encoded as {attacking piece symbol}>{attacked piece symbol and the black king must capture it, allowing the white queen to and position}. For defense and X-ray attack terms, < and = appear with check on h6 (note that it cannot be captured due separators are used instead. to the activity of the white bishop along the long diagonal) and 3.1.2 Pawn structures. For this section of the encoding, we use deliver checkmate on the next move. Note that such motif is not pawn structure detection algorithms [7] to detect the following possible in the position shown in Figure 2b. pawn structures in the position and encode them into terms: iso- We are particularly interested in recognising the dynamic sim- lated pawns (I{pawn position}), (protected) passed pawns (F{pawn ilarity, i.e. finding positions with similar motif(s) in the solution position}), backward pawns, doubled pawns and pawn chains. of the problem. However, we also want to take into account the Terms P({number}) and p({number}) are used to encode the num- static similarity, i.e. finding problems with similar initial position. ber of pawn islands for white and black, respectively. 3 SIMILARITY COMPUTATION 3.2 Dynamic Features To determine similarity between tactical problems we use an In the dynamic part of the encoding, we focus more on the solu- approach based on information retrieval. A set of features is tion of the tactical problem, trying to capture the motif behind computed from each problem’s starting position and its solution it. We first encode some general characteristics of the solution, move sequence. The features are then converted into textual then add more specific terms describing the move sequence. terms, forming a document that represents the problem. A collec- tion of documents is used to build an index, which can then be 3.2.1 General dynamic features. In this part we encode some queried using the textual encoding of a new position to retrieve basic features of the solution move sequence that can help us de- the most similar positions in the index. For the implementation termine similarity. We use a single term for each of the following of the system for indexing and retrieval of similar tactics we use features if it holds for the solution: the Apache Lucene Core library. Search results are ranked using • ?px - the player captures a piece in at least one of the the BM25 ranking function [8]. moves 12 Towards Automatic Recognition of Similar Chess Motifs Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia (a) Encoded position. Black to play, solution: 1... Qh1+ 2. Nxh1 (a) Base problem. Black to play, (b) Simplified problem. Black Rg2#. solution: 1... Rxa2+ 2. Kxa2 to play, solution: 1... Rxa2+ 2. Ra8+ 3. Ba7 Rxa7+ 4. Qa5 Kxa2 Ra5+ 3. Qa4 Rxa4#. Rxa5#. Feature set Generated terms static_positions qc1 Pb2 Pf2 Kh2 Pa3 Rd3 Ng3 Rh3 Qb4 ... Figure 4: A pair of tactical problems from the data set. qa1|0.78 qb1|0.89 qd1|0.89 qe1|0.78 ... q>Pb2 q>Pc4 Q>nb7 N>pg7 r>Ng3 4 EXPERIMENTAL RESULTS Pq !N>q !q>K !b>N !K>r !r>K !r>P obtain a set of position pairs that were considered similar by human experts. We manually checked the puzzles and verified (b) Text encoding of each set of features for the above position. the similarity between the solutions of the individual problem pairs. A total of 400 pairs were collected for the test data set. Figure 3: Text encoding of a tactical position. An example of such a pair is shown in Figure 4. The solution to both problems is to sacrifice the rook on the a-file to expose the king, resulting in checkmate with the other rook and the bishop on e4. The solution in the simplified problem contains the • ?ox - the opponent captures a piece in at least one of the same motif, but there are much fewer pieces, so the solution is moves generally easier for the students to find. • ?+ - the player gives a check at least once during the se- quence 4.1 Evaluation of Similarity Detection • ?= - the player promotes a pawn in at least one of the We tested the effectiveness of our methods using the set of 400 moves pairs of problems described in the previous section. We first built • ?S - the player sacrifices one or more pieces an index using the simplified version of the problem from each • ?# - the solution ends with a checkmate pair, then performed a query on the index with each of the regular • ?1/2 - the solution ends in a draw problems. For each query we recorded the rank of the matching position in the results and calculated how often the matching 3.2.2 Solution sequence features. In this section we encode infor- position appeared as the top result or within the first 𝑁 results. mation about the solution move sequence. The encoding includes We tested the search accuracy using the following feature a term for each: subsets: each feature group on its own, all static features, all • type of piece moved: !-{piece symbol} dynamic features and all features combined. All runs used the • type of piece captured: !x{piece symbol} default BM25 parameters 𝑘1 = 1.2 and 𝑏 = 0.75 and all included • attack between pieces that occurs during the solution: feature sets were weighted equally. The results are presented in !{attacking piece symbol}>{attacked piece symbol} Table 1. • type of piece sacrificed: !S{piece symbol} Using either only static or dynamic features did not yield the • (if the final position is a checkmate) type of piece involved best results. The results were significantly improved when both in checkmate: !#{piece symbol} static and dynamic features were combined. This shows that each set of features covers a different aspect of a tactic, both of which We count a piece as involved in checkmate if it is attacking either need to be considered when determining similarity. the king directly or any of the squares where the king could move from the current position (ignoring checks). 4.2 Similar Position Retrieval To include information about the order of moves and cap- In the second experiment, we selected 10 contextually different tures we also include a term for each two consecutive moves chess tactical problems and then automatically retrieved 5 most and captures in the solution. We also include a term for each similar positions for each of them from a large database of 278,840 pair of pieces involved in checkmate to capture more specific combinations of pieces. 1https://chesskingtraining.com/ct-art 13 Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Miha Bizjak and Matej Guid Accuracy Feature set used top-1 top-5 top-10 static_positions 0.234 0.378 0.428 static_pawns 0.033 0.083 0.126 dynamic_general 0.008 0.038 0.071 dynamic_solution 0.421 0.657 0.761 all static features 0.252 0.370 0.433 all dynamic features 0.418 0.652 0.761 all features, equal weights 0.481 0.736 0.814 Table 1: Success rates for different configurations. (a) Query position. Black to play, solution: 1... Bh2+ 2. Kxh2 Qxe1. Position Solution Similarity score tactical problems constructed from the lichess.org game database. Building the index took about 14 minutes (it only needs to be static 38.95 done once), and retrieval was fast: only about 4 seconds. 1... Bh2+ dynamic 45.04 Figure 5 shows a query position and the first two of the five 2. Kxh2 Qxd1 total 83.99 most similar retrieved positions. This example illustrates how similarity ranking works and how the static and dynamic features contribute to the similarity scores of the results. The query posi- tion is an example of a discovered attack motif. With 1... Bh2+, Black sacrifices the bishop to later capture the rook on e1 with static 64.62 the queen. The first result shows the same motif with an almost 1... Nf3+ dynamic 12.32 identical move sequence. The main difference is that the key 2. Qxf3 Qxe1+ total 76.94 pieces are on the d-file and not on the e-file. The second result is another case of a discovered attack. In this example it is not a bishop but a knight sacrificed with a check to the white king. It is the static similarity (the arrangement and position of the pieces (b) Retrieval results. in the initial position) that contributes most to the great overall similarity of this tactical problem, although a certain dynamic Figure 5: Example of retrieval results. similarity was also detected. The resulting most similar positions were shown to a chess [2] Mark Dvoretsky and Artur Yusupov. 2006. Secrets of Chess expert. The expert was asked to comment on the reasons for Training. Edition Olms. the similarity of the resulting problems with the original query [3] International Chess Federation (FIDE). 2020. The FIDE positions, taking into account both static and dynamic aspects. Handbook. https://handbook.fide.com/. (2020). The expert was able to explain the similarity in 48 out of 50 [4] Debasis Ganguly, Johannes Leveling, and Gareth JF Jones. problems. Overall, the expert praised the program’s ability to 2014. Retrieval of similar chess positions. In Proceedings of detect dynamic similarity of positions, even if the initial positions the 37th international ACM SIGIR conference on Research & differ significantly. development in information retrieval. ACM, 687–696. 5 CONCLUSIONS [5] Matej Guid, Martin Možina, Jana Krivec, Aleksander Sadikov, and Ivan Bratko. 2008. Learning positional features for We introduced a novel method for retrieving similar chess posi- annotating chess games: A case study. In International tions, which takes into account not only static similarity due to Conference on Computers and Games. Springer, 192–204. the arrangement of the chess pieces, but also dynamic similarity [6] Chess Informant. 2014. Encyclopedia of Chess Combina- based on the recognition of chess motifs and dynamic, tactical tions, 5th Edition. Chess Informant. aspects of position similarity. The merits of the method were put [7] Matic Plut. 2018. Recognition of positional motifs in chess to the test in two experiments. The first experiment emphasized positions. Diploma thesis. University of Ljubljana. the importance of including both static and dynamic features for [8] Stephen E Robertson, Steve Walker, Susan Jones, Miche- the successful detection of similar chess motifs. In the second line M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi experiment, the program was able to quickly traverse a large at trec-3. Nist Special Publication Sp, 109, 109. database of positions to identify similar chess tactical problems. [9] Simon Stoiljkovikj, Ivan Bratko, and Matej Guid. 2015. A A chess expert was able to explain the similarity in the vast major- computational model for estimating the difficulty of chess ity of the retrieved problems and praised the program’s ability to problems. In The Annual Third Conference on Advances in detect dynamic similarity of positions even if the initial positions Cognitive Systems. differ significantly. The resulting program can be useful for the [10] Beverly Park Woolf. 2010. Building intelligent interactive automatic generation of instructive examples for chess training. tutors: Student-centered strategies for revolutionizing e-learning. Morgan Kaufmann. REFERENCES [11] Moshé M Zloof. 1975. Query-by-example: the invocation [1] G Costeff. 2004. The Chess Query Language: CQL. ICGA and definition of tables and forms. In Proceedings of the 1st Journal, 27, 4, 217–225. International Conference on Very Large Data Bases, 1–24. 14 Drinking Detection From Videos in a Home Environment Carlo M. De Masi Mitja Luštrek carlo.maria.demasi@ijs.si mitja.lustrek@ijs.si Jožef Stefan Institute Jožef Stefan Institute Jamova cesta 39 Jamova cesta 39 Ljubljana, Slovenia Ljubljana, Slovenia ABSTRACT parameters makes 3DCNNs generally harder to train than their 2D counterparts. One way to fix this is to produce 3D models by We present a pipeline developed with the aim of helping people "inflating" 2D ones, i.e. by adding a temporal dimension to a model with mild cognitive impairment (MCI) in the accomplishment of pre-trained for image classification. This allows to determine the every-day tasks. Our system adopts a number of computer vision architecture of the 3D network and to bootstrap its values starting methods to analyze RGB videos collected from cameras, and from the corresponding values in the 2D model: convolutional provides a successful, quasi real-time detection of the targeted kernels with dimensions 𝑁 × 𝑁 are inflated to a 3D kernel with activity (drinking) when the latter is at least partially visible to dimensions 𝑁 × 𝑁 × 𝑡 , spanning 𝑡 frames, and each of the t planes the camera. in the 𝑁 × 𝑁 × 𝑡 kernel is initialized by the pre-trained 𝑁 × 𝑁 KEYWORDS weights rescaled by 1/t [1, 9]. Another approach separately analyzes spatial components computer vision, activity recognition, object detection, pose esti- (i.e. single frames), providing static information about scenes mation and objects in the picture, and temporal components related to motion and variation between frames [11]. A two-stream network 1 INTRODUCTION parallelly processes single frames and optical flows, respectively, Mild cognitive impairment (MCI) is a common problem among and then combines their predictions. elders, affecting 15–20% of people over 65 in the USA [10]. In Finally, another method worth mentioning is based on the ob-order to help people affected by MCI in the accomplishment of servation that some actions (i.e., clapping hands) are better char- every-day tasks, we adopt various kind of detection techniques acterized by high-frequency temporal features, whereas other to predict what users are currently doing, which, combined with ones (i.e., dancing) can be better understood when lower fre- a knowledge of their activities schedule, allows our system to quency variations are observed. As a result, a model characterized provide context-based reminders. Here, we present our attempts by two parallel channels can be used. The first (slow) channel to detect one of such activities (i.e. drinking) from videos, by the operates at low framerate and analyzes few sparse frames, in use of computer vision and deep learning algorithms. order to deduce the semantics of the action, while the second This paper is organized as follows. In the remainder of this sec- (fast) branch is responsible for capturing fast variations, and so tion, we give an overview of the current SOTA regarding activity operates at higher framerate [3]. recognition from videos. In Section 2 we describe the computer In this work, we adopted a modified version of an inflated vision techniques used to trigger the more computationally in- 3D network as described in [14], to include non-local blocks. tensive task of activiry recognition, to obtain a quasi real-time Unlike convolutional and recurrent operations, which are only monitoring of the user’s activities. Finally, in Sections 3 and 4 we able to capture spatio-temporal features in a local neighborhood, present the results and conclusions of the paper. non-local blocks compute the response at a certain position as a weighted sum of features at all positions in space and time. This 1.1 Video Activity Recognition allows the model to capture dependencies between pixels that are distant both in space and time, and makes it more accurate Differently than what happened for image classification, where for video classification. in the last years a number of clear front runner architectures and techniques have been established, the topic of activity recognition from videos still presents numerous open issues [1]. 2 SYSTEM ARCHITECTURE An immediate approach to the problem consists in using image classification networks to extract features from each frame of The purpose of our system is to provide users context-based the video; then, predictions for the whole video can either be reminders related to the activity of drinking. To this aim, a RGB obtained by pooling over frames (at the cost of losing information camera is placed in the kitchen of the user’s apartment (where the about temporal ordering) [5] , or by adopting LSTM layers [2]. activity is most likely to take place) and the video is sent through A more elaborate way to adapt the concepts used in image a RTSP stream to a remote server, to be analyzed by the activity classification methods to video recognition consists in using recognition model during the day. The results are uploaded to a 3DCNN, i.e. convolutional models characterized by an additional Cloud Firestore Database, which is queried to determine whether third temporal dimension [4, 12, 13]. The increased number of the users have been drinking enough, and reminders are provided through an app running on a local device if not. Permission to make digital or hard copies of part or all of this work for personal One problem arising from this scheme is that most action or classroom use is granted without fee provided that copies are not made or recognition algorithms are computationally expensive, which distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this prevents them from running in real time. For this reason, we work must be honored. For all other uses, contact the owner /author(s). decided not to run the model continuously, but to execute it only Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia in moments where it is most likely that the users are about to © 2020 Copyright held by the owner/author(s). perform the targeted activity. We employed a combination of 15 Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia De Masi, Luštrek • a classic computer vision approach, where the drinking vessel is located through a color/shape-based detection; • a deep learning object detection algorithm, re-trained to detect a personalized mug. In the fist scenario, we applied a series of filters (Gaussian blur, dilation/erosion) to reduce noise, followed by a color mask in the HSV space to select only objects with a certain color. A further selection is then done based on the shape properties of the previously selected areas; a polygonal approximation of their contours is performed, and other shape-related features such as area, circularity and convexity are considered to eliminate shapes different from the expected one. Figure 1: System architecture. Video stream from RGB cameras is sent In the second case, we collected a dataset of about 500 images to a remote server and fed to the activity recognition model. Results are of the selected mug, and used it to re-train a second SSD model. uploaded to a Firestore database, where they are monitored so that notifi- cations can be sent back to an app. In order to account for false negatives in the mug detection, that may occur in some frames even if the mug has not been moved, for each frame the current position of the mug is compared to the classic and deep-learning-based computer vision techniques to history of positions in the past few frames. Once a displacement identify some triggers for the video activity recognition model, of the mug is detected, the trigger is activated. such as: (i) user standing in certain areas of the kitchen; (ii) user standing in certain areas of the kitchen, and interacting with 2.3 Clip Recording and Activity Recognition some objects (tap, fridge); (iii) a specific object, assumed to be Following the activation of one of the triggers, the next video used by the user for drinking, is moved from its current position. frames (for a time interval of about 30 seconds) are used to gener- 2.1 User Localization And Interaction With ate short video clips, each of which has a duration of 10 seconds, the Environment with an overlapping window of 4 seconds. These values have been selected to have a higher probability to obtain at least one The localization of the user and their interactions with the envi- video clip completely capturing the whole drinking process, and ronment are detected through a combination of object detection to match the length of the videos in the Kinetics400 dataset [6], and pose estimation techniques. For the object detection, we which has been used for the activity-recognition model training. adopted a Single Shot MultiBox Detector (SSD) [8], pre-trained on the 80 classes of the COCO dataset [7], which also include 3 RESULTS AND DISCUSSION "person". As for pose estimation, we used a SimpleNet model In this section, we present the results of the various steps involved with a ResNet backbone [15]. in the whole drinking-detection pipeline. During the initial setup, the camera image is shown to the user (Fig. 2a) and regions of interest (ROIs) can be selected (Fig. 2b). 3.1 User Localization - Results These can be of two types, i.e. single or double-zone. The first ones are identified by a single rectangular box, which is activated when We tested the efficiency of the localization module in different the user’s feet are within the box, hence providing indications on scenarios, varying based on how clearly the user was visible (com- the user’s location (see Fig 2c). Double-zone ROIs are formed by pletely visible; legs occluded; head occluded; head and legs oc-two rectangular boxes; one of them, analogously to the previous cluded, only torso visible) and on which side (front/back/right/left) case, is activated when the user steps inside of it, while the second of the user was visible, and the results showed an average accu- box is activated if one of the user’s hands (located by the pose racy of over 98%. estimation model) is within it (Fig. 2d). Overall, a double-zone ROI is considered activated only if both conditions are met. Once 3.2 Drinking Vessel Position Detection - the ROI is configured, the user is requested to input: Results • the name used to identify the current ROI; As illustrated in Sec. 2.2, for the task of detecting the displacement • an observation time 𝑡 (in seconds), i.e. the time after 𝑜𝑏𝑠 of the drinking vessel we adopted two approaches, one based on which the ROI is activated, once the requirements (user classic computer vision methods and one on deep learning. and hands positions) are met; The first method does not provide a confidence score for de- • an action to be performed once the ROI is activated. Cur- tections, nor the coordinates of the object’s bounding box, so rently, only one default action - recording and analyzing we took a simpler approach than with normal object detection video clips - is supported, but this will be extended to algorithms in evaluating the results. We collected some videos include further possibilities. in a home-like environment, with the object located in different positions, or with a person handling it (moving it, using it to 2.2 Drinking Vessel Position Detection drink...), and analyzed them frame-by-frame to check whether A second trigger for activity recognition is given by the displace- the objects present in each frame were detected or not. The re- ment of a particular object (mug, cup, glass). To this regard, in sulting confusion matrix, reported in Table 1, shows that the the pilot phase of the project users will be asked to always use detection algorithm scored precision and recall values of .93 and one specific drinking vessel when they are drinking, which the .90, respectively. This method proved to be very efficient, when model will be trained to recognize. correctly fine-tuned, and the algorithm detected the object in For this task, we considered two possibilities: most of the frames where it was at least partially visible. The 16 Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia (a) (b) (c) (d) Figure 2: Triggers based on user’s location and their interaction with the environment. Regions of Interest are selected during the setup phase (b), and they are activated either if the user steps inside (c), or if the user steps inside and has their hands next to another object (d). Table 1: Confusion matrix for the color/shape-based detec- 3.3 Activity Recognition - Results tion of the mug We tested the adopted activity recognition model on a new cus- tom dataset, consisting of roughly 100 videos we recorded our- Pred selves in a variety of environments and conditions. In order to P N make the clips as similar as possible to real-life situations, the P 133 15 True videos contained instances where actions similar to drinking N 10 1 were performed, to increase the recognition difficulty. The clips can be classified as belonging to two difficulty categories, based on the angle the user was facing with respect to the camera; videos were classified as "hard" whenever this angle was greater ◦ greatest issue of the method is that it had to be very carefully than 90 (see Fig. 4). The precision-recall curve for the model on tuned, especially regarding the color selection part, which is still this dataset is shown in Fig. 5. sensible to lightning variations even after converting the image to the HSV colorspace. False detection can also be a problem. 4 CONCLUSIONS We tested the algorithm in situations where some of the objects The tests performed on triggers are very encouraging for the one present in the scene had colors similar to the object we wanted based on the user location and their interaction, and indicate that to detect, and in spite of being able to filter out most of them we the deep-learning approach should be preferable for the detec- still obtained some false positives, especially when the lighting tion of the drinking vessel and its displacement, especially after varied, thus rendering the selection of the parameters for the increasing the amount of training data. The activity-recognition color mask less efficient. model based on inflated 3D CNN with the addition of non-local The results of the evaluation of the SSD model are shown in blocks provided the best accuracy in situations were the user is Fig. 3. As evident from the plot, the model immediately reached facing the camera at least partially, and the use of triggers allows a very high mAP [7], of the order ≈ 0.9, on our test dataset. It for a quasi real time usage. A number of improvements will be should be noted that, while preparing the training dataset, we added to the pipeline in the future. Currently, only one action is followed a somewhat different approach than what is usually triggered, i.e. recording and analysis of video clips, but we plan to done for training object-detection models. In most situations, include other possibilities, such as using the information on the one wants to make the model as general as possible and avoid user location to check whether they need assistance in operating overfitting, which is achieved by taking images of the desired domestic appliances. The object detection model could also be object in as many different conditions (size, aspect ratio, point of extended, in order to identify interactions with other elements view angle, rotation, lightning) as possible. In our case, however, of the environment, and provide corresponding context-based re- the location of the camera will be more or less constant, i.e. sponses. Finally, the only action currently recognized is drinking, attached to the ceiling of the room, in order to provide a good but as mentioned in the introduction the aim of the project is to view of the environment. As a result, this will greatly limit the assist users in the accomplishment of various activities. In this variability in the images of the object the system will analyze, sense, the next planned step is to include detection of parts of especially regarding the aspect ratio and the orientation of the the morning toilet routines, such as brushing teeth and washing mug. Moreover, whereas an object detector is usually tasked to hands. identify many different instances of objects in a certain class (i.e., a generic "mug"), in our case the task is greatly simplified by the fact that we are looking to locate one very specific object. 17 Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia De Masi, Luštrek Figure 3: mAP values on the test dataset for the SSD model, re-trained to recognize the project custom mug. [4] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2012. 3d convolutional neural networks for human action recog- nition. IEEE transactions on pattern analysis and machine intelligence, 35, 1, 221–231. [5] Andrej Karpathy, George Toderici, Sanketh Shetty, et al. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 1725–1732. [6] Will Kay, Joao Carreira, Karen Simonyan, et al. 2017. The kinetics human action video dataset. (2017). arXiv: 1705. 06950 [cs.CV]. [7] Tsung-Yi Lin, Michael Maire, Serge Belongie, et al. 2014. Figure 4: Difficulty classes for the custom dataset we used to test the Microsoft coco: common objects in context. (2014). arXiv: activity recognition model. Video clips were classified as "hard" whenever 1405.0312 [cs.CV]. the angle between the user front side and the camera was greater than ◦ 90 . [8] Wei Liu, Dragomir Anguelov, Dumitru Erhan, et al. 2016. Ssd: single shot multibox detector. Lecture Notes in Com- puter Science, 21–37. issn: 1611-3349. doi: 10.1007/978- 3- 319- 46448- 0_2. http://dx.doi.org/10.1007/978- 3- 319- 46448- 0_2. [9] Elman Mansimov, Nitish Srivastava, and Ruslan Salakhut- dinov. 2015. Initialization strategies of spatio-temporal convolutional neural networks. arXiv preprint arXiv:1503.07274. [10] Ronald C Petersen, Oscar Lopez, Melissa J Armstrong, et al. 2018. Practice guideline update summary: mild cog- nitive impairment: report of the guideline development, dissemination, and implementation subcommittee of the american academy of neurology. Neurology, 90, 3, 126–135. [11] Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, 568– Figure 5: Test results of the activity recognition model on the test 576. dataset. [12] Du Tran, Lubomir Bourdev, Rob Fergus, et al. 2015. Learn- ing spatiotemporal features with 3d convolutional net- REFERENCES works. In Proceedings of the IEEE international conference on computer vision, 4489–4497. [1] Joao Carreira and Andrew Zisserman. 2017. Quo vadis, [13] Gül Varol, Ivan Laptev, and Cordelia Schmid. 2017. Long- action recognition? a new model and the kinetics dataset. term temporal convolutions for action recognition. IEEE In proceedings of the IEEE Conference on Computer Vision transactions on pattern analysis and machine intelligence, and Pattern Recognition, 6299–6308. 40, 6, 1510–1517. [2] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadar- [14] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim- rama, et al. 2015. Long-term recurrent convolutional net- ing He. 2018. Non-local neural networks. In Proceedings of works for visual recognition and description. In Proceed- the IEEE conference on computer vision and pattern recog- ings of the IEEE conference on computer vision and pattern nition, 7794–7803. recognition, 2625–2634. [15] Bin Xiao, Haiping Wu, and Yichen Wei. 2018. Simple base- [3] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, et al. lines for human pose estimation and tracking. In Proceed- 2019. Slowfast networks for video recognition. In Proceed- ings of the European conference on computer vision (ECCV), ings of the IEEE international conference on computer vision, 466–481. 6202–6211. 18 Semantic Feature Selection for AI-Based Estimation of Operation Durations in Individualized Tool Manufacturing Erik Dovgan Bogdan Filipič Jožef Stefan Institute Jožef Stefan Institute Jamova cesta 39 Jamova cesta 39 Ljubljana, Slovenia Ljubljana, Slovenia erik.dovgan@ijs.si bogdan.filipic@ijs.si ABSTRACT very diverse, which increases the difficulty of automated duration Accurate estimation of operation durations is of key importance prediction. in production processes, since the accuracy of estimations di- We propose an approach for predicting operation durations in rectly affects the quality of production plans and thus the entire the manufacturing of individualized tools. The tools are manu- production process. This task is even more challenging when ally divided into several positions of varying complexity, where individualized tools are being produced. From the machine learn- each position is specified with a 3D computer model. In addition, ing point of view, this means a low number of diverse samples, a set of operations are predefined for each of these positions. while the number of features can be significantly higher. To tackle The proposed approach processes the 3D model of each position this issue, we introduce semantic feature selection that reduces and predicts the duration of the corresponding manufacturing the number of features. This results in obtaining a better ratio operations. To this end, it firstly extracts a set of volume, sur- between the number of samples and features and, at the same face, gradient and other features from the 3D model, and then time, reduces the prediction error. We demonstrate the proposed applies the Random Forest regression model [1] to predict the approach on the problem of estimating the operation durations duration of each operation. This process is additionally enhanced in the manufacturing of injection molds and show the predic- with semantic feature selection that evaluates various sets of se- tion accuracy improvement resulting from the semantic feature mantically related features, such as volume features, in order to selection. assess the predictive capability of these feature sets. We demon- strate the proposed approach on the problem of estimating the KEYWORDS operation durations in the manufacturing of injection molds in a specific tool shop. By processing a dataset from this tool shop, injection molding, tool manufacturing, duration prediction, fea- we show the prediction accuracy improvement resulting from ture selection, random forest the semantic feature selection. 1 INTRODUCTION The rest of the paper is organized as follows. Section 2 introduces the relevant tool positions and the related operations, The efficiency of tool shop manufacturing processes heavily de- and describes the extracted features and the semantic feature pends on the accuracy of production plans. Inaccurate plans can selection. Numerical experiments and the obtained results are lead to significant delays in production, due date violations, late presented in Section 3. Finally, Section 4 concludes the paper delivery penalties, and even loss of customers. A key step of with the summary of our work and the ideas for future work. planning is accurate estimation of durations of all the operations to be executed in the manufacturing process. The estimation 2 PREDICTING OPERATION DURATIONS can be performed manually by an expert utilizing his/her expert knowledge, or automatically by means of tools such as those WITH AI METHODS involving AI methods as, for example, demonstrated in [3]. Prediction of operation durations consists of extracting features Automated estimation of operation durations with AI meth- from the tool data in the form of 3D computer models, and ap- ods consists of learning a predictive model from the features plying a machine learning model to predict the durations. This extracted from examples of past, i.e., already concluded opera- approach is applied for each tool position and each operation tions and their actual durations, and then applying the model to at this position independently, thus a custom machine learning new operations with known features and unknown durations. model is built and applied for each combination of position and In the case of tool manufacturing, the features can be extracted operation. In addition, when feature selection is involved, a differ- from 3D computer models of already manufactured tools. To ent set of features is considered for each of these combinations. build an accurate predictive model, a large set of already manu- factured tools has to be processed. However, this is not possible 2.1 Relevant Positions and Related in certain cases, for example, when dealing with individualized Operations tools, such as injection molds. This is due to the fact that the tool shops specialized in individualized tool manufacturing typically The tools regarded in this study are injection molds that are used produce only few such tools per year. In addition, these tools are to form the final products made of plastic under high pressure. Although the injection mold is composed of several positions, its Permission to make digital or hard copies of part or all of this work for personal most complex and thus the most relevant positions are the bottom or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and and the top element. These two elements have to be manufactured the full citation on the first page. Copyrights for third-party components of this with the highest precision. Since they are in physical contact with work must be honored. For all other uses, contact the owner/author(s). the final product, any defect of the mold surface would result in Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia a defect of the final product. An example of the injection mold © 2020 Copyright held by the owner/author(s). is shown in Figure 1, where the red color indicates the surface 19 Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Dovgan and Filipič that is in contact with the final product. In the dataset used in • volumes of the entire tool position (such as volume of the this study, these two elements are marked as positions 1 and shape and volume of the mold), 30. These positions require a set of operations, where the most • volumes of the holes that are open, and of those that are relevant operations are shown in Tables 1–2. closed, • features for each of 6 directions, i.e., projections (x, y, z, each of them increasingly or decreasingly), for example, direction (z, decreasingly) defines the features obtained from the top-down projection, while direction (z, increas- ingly) defines the features obtained from the bottom-up projection; the features for each direction are as follows: – volumes (including the volumes of holes), – surface area, – number of faces, – number of faces per dm2, – valley features, computed as the height versus width ratio of the valleys (in all valley directions to find the maximum value); this feature is aimed at identifying deep and narrow valleys that are harder to process, – valley height, computed as the height of the valleys in all valley directions to find the maximum value; this Figure 1: Example of a 3D computer model of an injection feature is aimed at obtaining the depth of valleys that mold, https://grabcad.com/library/injection-mold-pc-abs- represents the drill distance, 1 by Mauro Menchini. – gradient features, calculated as the maximum gradient in all directions; this feature is aimed at identifying areas with non-horizontal and non-vertical gradient that are Table 1: Operations at Position 1 harder to process. Since the valley features, valley height and gradient features Operation Description are calculated for each point of the projection, the number of 32 CAM rough features is very high and varies across the tool positions which 31 CAM fine are of varying sizes. To reduce the number of features and obtain 43 CAM erosion a constant number of features independently of the position size, 19 Heat treatment histograms of these features are calculated using expert-defined 23 Measuring machine bins. 36 CNC milling 3 axis, rough The 3D model of each position also contains expert-defined 41 CNC milling 3 axis, fine annotations of the model parts with different colors of model 42 CNC milling 5 axis, fine faces (see the example in Figure 1). These model parts are also 13 Submersible erosion taken into account when extracting features and therefore ob- taining additional features that characterize a feature for each part independently. For example, when calculating the number Table 2: Operations at Position 30 of faces, one feature is obtained for all the faces, and for each part an additional feature is calculated denoting the number of faces on that specific part. The part-specific features are calculated for Operation Description the following features: 32 CAM rough 31 CAM fine • volumes of the holes: total, open, closed, 37 CAM wire erosion • projection features: 43 CAM erosion – volume, 19 Heat treatment – surface area, 11 Wire erosion – number of faces, 23 Measuring machine – number of faces per dm2, 36 CNC milling 3 axis, rough – valley features, 41 CNC milling 3 axis, fine – valley height, 42 CNC milling 5 axis, fine – gradient features. 13 Submersible erosion Examples of parts that are annotated in the 3D computer mod- els include: (1) Free holes, (6) Tolerance holes, (7) Parting surface, 2.2 Description of the Extracted Features (10) Matching surfaces, (12_4) Part shape: High gloss polished, The proposed approach extracts a set of features from a 3D com- (12_5) Part shape: Optical faces, (12_7) Part shape: Galvanic pins, puter model of a tool. These features were suggested by a tool (12_8) Part shape: Special surface finishing. In total, 30 parts are shop expert and can be categorized as follows: annotated by the expert. 20 Semantic Feature Selection for AI-Based Estimation of Operation Durations Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Table 3: Feature Sets The operation durations were predicted with the Random Forest regression model. Its performance was assessed with the Name Number of features leave-one-out test using the default model-building parameters. The selected performance metric was the Root Mean Squared expert 524 on average Error (RMSE), which has to be minimized. RMSE was also cal- volume 6 culated for durations estimated by the expert. The effectiveness volume_projection 30 of feature selection was determined by comparing the Random volume_no_hole 3 Forest performance when using all the features and when using volume_projection_no_hole 6 only a selected set of features. volume_hole 3 The initial experiment aimed at finding whether the prediction volume_projection_hole 24 of operation durations involving the proposed feature selection volume_hole_part 90 outperforms the prediction without feature selection considering volume_projection_no_hole_part 180 all the features (i.e., the default feature set). To this end, for material 4 each combination of position and operation, all the feature sets surface_projection 6 were processed and the feature set with the lowest RMSE was surface_projection_part 180 selected. The results are shown in Figure 2. These results are faces_count_projection 6 normalized with respect to the RMSE of durations estimated faces_count_projection_part 180 by the expert and are therefore expressed as percentages of the faces_per_dm2_projection 6 RMSE resulting from the expert estimation. They show that for faces_per_dm2_projection_part 180 each combination of position and operation, there exists at least valley_hist_projection 18 one set of features that allows for more accurate prediction than valley_hist_projection_part 540 the default feature set (since it reduces the RMSE). In addition, for valley_h_projection 48 position 1, operation 32, and position 30, operation 31, the default valley_h_projection_part 1440 feature set produces a RMSE equal to the RMSE of the expert grad_hist_projection 18 estimation, while feature selection improves it. For position 30, grad_hist_projection_part 540 operation 32, the default feature set results in a higher RMSE projection_* 562 than the RMSE of the expert estimation. Although in this case projection_side 2248 feature selection improves the result, it still performs worse than projection_top_bottom 1124 the expert estimation. part_* 111 Planned by expert 1.2 No feature selection 2.3 Semantic Feature Selection Feature selection 1.0 The total number of features obtained in the presented feature extraction procedure is 3472. Since this is a large number, we 0.8 introduce semantic feature selection that combines semantically similar features into (partially overlapping) feature sets. In addi- 0.6 tion, the tool shop expert also selected a set of the most relevant 0.4 features for each operation. However, this was defined only for a limited set of crucial operations. The resulting feature sets and 0.2 the related numbers of features are shown in Table 3. Specifically, 0.0 if the name of a set contains "part", the set contains all the features of the specific part. The "valley_hist_" contains the valley (1, 32) (1, 31) (1, 43) (1, 19) (1, 23) (1, 36) (1, 41) (1, 42) (1, 13) (30, 32) (30, 31) (30, 37) (30, 43) (30, 19) (30, 11) (30, 23) (30, 36) (30, 41) (30, 42) (30, 13) features, "valley_h_" valley height, and "grad_hist_" gradient features. Projection sets "projection_" contain all the features from specific projections and are defined as follows: Figure 2: Percentages of RMSE with respect to the RMSE of durations estimated by the tool shop expert. The horizon- • projection_100: projection from left to right (x axis) tal axis denotes the combinations of (position, operation). • projection_200: projection from right to left (x axis) • projection_010: projection from front to back (y axis) Subsequently, the most relevant combinations of positions • projection_020: projection from back to front (y axis) and operations were analyzed in more detail and selected results • projection_001: projection from bottom to top (z axis) are presented in Figures 3–5. These results show the RMSE of • projection_002: projection from top to bottom (z axis) durations estimated by the expert, the RMSE obtained without In total, 60 sets of features were defined. feature selection, and the RMSE obtained with various sets of features. To make the figures readable, we only show the best 3 EXPERIMENTS AND RESULTS 33% feature sets. Figure 3 shows position 1 and operation 36 (i.e., We evaluated the proposed approach on a dataset from the Plam-CNC milling 3 axis, rough). The best features are the gradient tex tool shop [4, 2]. Due to individualized tool manufacturing, features, surface features and features from the bottom-up pro-the number of already produced tools was low, namely 30 in- jection. Note also that the bottom side of this position is the most stances of position 1 and 26 instances of position 30. Besides the complex one, thus the bottom-up projection is of high impor- actual duration of each operation, each instance also included tance. The same projection is also the most relevant for position the duration estimated by the tool shop expert. 1, operation 13 (i.e., submersible erosion) (see Figure 4), since 21 Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Dovgan and Filipič the erosion is applied only to the bottom side of this position. Position 30, Operation 13 Part 9 (i.e., released surfaces) and faces count are also among the 45 Planned by expert most important features, where faces count can be used to esti- No feature selection mate the complexity of the surface that has to be eroded. Finally, 40 Feature selection position 30, operation 13 (i.e., submersible erosion) is shown in 35 Figure 5. For this combination, the top-down projection is the most relevant, since the erosion is applied only to the top side of 30 this position. Part 1 (i.e., free holes) and faces count are also very important. The importance of the appropriate projection and the faces count is consistent with the results for position 1 and the part_1 expert part_3 part_5 part_10 same operation (see Figure 4). part_11_5 volume_hole projection_002 projection_100 projection_001 projection_side Position 1, Operation 36 volume_projection surface_projection 10 grad_hist_projection Planned by expert projection_top_bottom grad_hist_projection_part surface_projection_part 8 No feature selection Feature selection faces_count_projection_part volume_projection_no_hole volume_projection_no_hole_part 6 4 Figure 5: RMSE obtained when predicting the duration of operation 13 (submersible erosion) at position 30. expert part_9 volume the prediction accuracy, it includes semantic feature selection projection_001 projection_020 by combining features into semantically meaningful feature sets. volume_no_hole projection_200 projection_side projection_010 projection_100 The experimental results showed that this approach in most cases grad_hist_projection surface_projection valley_h_projection volume_projection projection_top_bottom valley_hist_projection outperforms the expert predictions. In addition, semantic feature surface_projection_part volume_projection_hole selection outperforms the approach with no feature selection. volume_projection_no_hole valley_hist_projection_part A detailed analysis of the proposed feature selection approach showed that there exist meaningful relations between the tool Figure 3: RMSE obtained when predicting the duration of manufacturing operations and the best performing feature sets operation 36 (CNC milling 3 axis, rough) at position 1. for predicting the durations of these operations. In future work we will evaluate additional regression algo- rithms to assess the quality of Random Forest predictions. It Position 1, Operation 13 would be also relevant to analyze the samples for which the pre- diction error is the highest. Special attention should be given to 22.5 Planned by expert the operation for which the presented approach did not outper- No feature selection 20.0 Feature selection form the expert prediction. 17.5 ACKNOWLEDGMENTS 15.0 This work was in part funded by the KET4CleanProduction project "Improved Planning of Manufacturing Processes for Indi- vidualized Tools" where the AI-based solution was developed for part_9 expert the Plamtex tool shop. The authors also acknowledge the finan- part_12_2 part_11_1 cial support from the Slovenian Research Agency (research core projection_001 projection_010 projection_side projection_100 projection_020 funding No. P2-0209). We are particularly grateful to Plamtex volume_no_hole for sharing the tool dataset and the expert knowledge on tool grad_hist_projection surface_projection volume_projection manufacturing, positions, operations, and the suitable features. faces_count_projection projection_top_bottom grad_hist_projection_part valley_h_projection_part surface_projection_part faces_count_projection_part volume_projection_no_hole REFERENCES [1] Leo Breiman. 2001. Random forests. Machine Learning, 45, Figure 4: RMSE obtained when predicting the duration of 1, 5–32. operation 13 (submersible erosion) at position 1. [2] Erik Dovgan, Peter Korošec, and Bogdan Filipič. 2020. Tool- Analysis: A program for predicting the duration of machin- ing operations in the production of tools using artificial 4 CONCLUSION intelligence. Technical report IJS-DP 13195. Jožef Stefan Institute, Ljubljana. We presented an AI-based approach to predicting the operation [3] Mesut Kumru and Pinar Yildiz Kumru. 2014. Using artifi- durations in individualized tool manufacturing, which is, in a cial neural networks to forecast operation times in metal long run, aimed at replacing the existing human-based estima- industry. International Journal of Computer Integrated Man- tion process. The proposed approach extracts a set of features ufacturing, 27, 1, 48–59. from 3D computer models of tools and applies Random Forest [4] Plamtex INT, d.o.o. 2020. https://www.plamtex.si/en/. regression to predict the operation durations. To further improve 22 Generating Alternatives for DEX Models using Bayesian Optimization Martin Gjoreski Vladimir Kuzmanovski Marko Bohanec Department of Intelligent Department of Computer Science Department of Knowledge Systems Aalto University, Finland Technologies Jožef Stefan Institute vladimir.kuzmanovski@aalto.fi Jožef Stefan Institute Jožef Stefan Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia Department of Knowledge marko.bohanec@ijs.si martin.gjoreski@ijs.si Technologies Jožef Stefan Institute Ljubljana, Slovenia ABSTRACT development of qualitative multi-attribute decision models and the evaluation of alternatives (options). DEXi has been used to Multi-attribute decision analysis is an approach to decision analyze decision problems in different domains in healthcare [9], support in which decision alternatives are assessed by multi- agriculture [10], [11], [12], economy [13], etc. criteria models. In this paper, we address the problem of A useful extension of DEX would be the possibility to search generating alternatives: given a multi-attribute model and an for new alternatives that require the smallest change to the alternative, the goal is to generate alternatives that require the existing alternative to obtain a desirable outcome. This task is smallest change to the current alternative to obtain a desirable important for practical decision support [14], however the related outcome. We present a novel method for alternative generation work on generating alternatives for qualitative multi-attribute based on Bayesian optimization and adapted to qualitative DEX decision models is quite scarce. The only related study was models. The method was extensively evaluated on 42 different presented by Bergez [15], in which the focus is on attribute DEX decision models with a variable complexity (e.g., variable scoring (and not on the alternatives), and the starting (current) depth and variable attribute’s weight distribution). The method’s alternative was not taken into a consideration. More specifically, behavior was analyzed with respect to computing time, time to Bergez developed a genetic algorithm for searching a set of the obtaining the first appropriate alternative, number of generated ‘‘worst-best’’ i.e., lowest scores for the input attributes that lead alternatives, and number of attribute changes required to reach to the highest score for the root attribute (the decision model’s the generated alternatives. The experimental results confirmed output), and ‘‘best-worst’’ i.e., highest scores for the input the method’s suitability for the task, generating at least one attributes that lead to the lowest score for the root attribute. appropriate alternative within one minute. The relation between In this study, we developed a stochastic method for the decision-model’s depth and the computing time was linear generating alternatives that require the smallest change to the and not exponential, which implies that the method is scalable. current alternative to obtain a desirable outcome. To avoid combinatorial explosion, the method uses guided search based on KEYWORDS Bayesian optimization. The method is evaluated on 42 different multi-attribute models, method DEX, alternatives, decision qualitative multi-attribute models with a varying complexity. support, Bayesian optimization The method’s behavior was analyzed with respect to several characteristics including: computing time, time to first appropriate alternative, number of generated (appropriate) 1 INTRODUCTION alternatives, and number of attribute changes required to reach Hierarchical multi-attribute models are a type of decision models the generated alternatives. [1],[2],[3], which decompose the problem into smaller and less complex subproblems and represent it by a hierarchy of attributes and utility functions. Such decision models are especially useful 2 DOMAIN DESCRIPTION in complex decision problems [4],[5]. In this study, a set of 42 DEX multi-attribute decision models DEX is a hierarchical qualitative multi-attribute method were used. The models are benchmark mock models, designed whose models are characterized by using qualitative (symbolic) by Kuzmanovski et al. [16]. The decision models are designed attributes and decision rules. The method is supported by DEXi by taking into account properties such as model depth, [6],[6],[7],[8], an interactive computer program for the distribution of attributes' aggregation weights (weights' distribution), and inter-dependency of attributes (input links). Permission to make digital or hard copies of part or all of this work for personal or Table 1 presents a summary of the decision models. The weights' classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full distribution is given with descriptive names: skewed, normal, citation on the first page. Copyrights for third-party components of this work must and uniform. All the attributes in the models are defined with be honored. For all other uses, contact the owner/author(s). Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia same value scale (low, medium, high), including the input and © 2020 Copyright held by the owner/author(s). the output attributes. Additional assumption is that all attribute combinations are possible. 23 Table 1: Properties of the mock DEX decision models. From the distance function, a similarity function 𝑠 can be also defined as one minus the normalized distance. The distance is normalized using the maximum plausible distance for the specific problem. For example, if 𝑎 ̅ has 20 attributes with possible values between 0 and 2 and each attribute has the highest possible value, and if 𝑐 ̅ has only attributes with the lowest possible value (0), then the maximum distance is 20 * 2. 𝑑( 𝑐, ̅ 𝑎 ̅ ) 𝑠( 𝑐, ̅ 𝑎 ̅̅,̅ ) = 1 − max_𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 Finally, the optimization function can be defined as: 𝑓( 𝑐, ̅ 𝑎 ̅ , 𝐷𝑀( 𝑐, ̅ ), 𝐷𝑀( 𝑎 ̅ )) 3 METHOD FOR GENERATING 𝑠( 𝑐, ̅ 𝑎 ̅̅,̅ ), 𝑖𝑓 𝐷𝑀( 𝑎 ̅ ) > 𝐷𝑀( 𝑐 ̅) ALTERNATIVES = { 0, 𝑖𝑓 𝐷𝑀( 𝑎 ̅ ) ≤ 𝐷𝑀( 𝑐 ̅) An efficient search strategy is required to generate alternatives where 𝐷𝑀(∗) is the output of the decision model for the specific that require the smallest change to the current alternative to alternative. By optimizing 𝑓 , the method searches for obtain a desirable outcome. A naïve approach would be to alternatives that are as similar as possible to 𝑐 ̅ and improve the generate all possible alternatives, or to iteratively generate output of the decision model (𝐷𝑀( 𝑎 ̅ ) > 𝐷𝑀( 𝑐 ̅)). random alternatives, and to evaluate the outcome for each In order to apply the Bayesian optimization approach, a alternative. However, for reasonably complex decision models, surrogate function (a model), an acquisition function, and a the search space can be enormous, rendering the naïve generator of alternatives, need to be defined. The surrogate approaches unsuitable. model 𝑆𝑀 is a model that estimates the objective function for a A more appropriate approach would be to use informed given alternative as input. Typically, models based on Gaussian search based on the history of previously generated and evaluated Process (GP) [17] are used because by exploiting the mean and alternatives. The history can be used to estimate the search space the standard deviation of the output distribution, we can balance and the behavior of the decision model. Based on that estimation, the trade-off of exploiting (higher mean) and exploring (higher more promising alternatives can be generated. By focusing on standard deviation). Since GP models are computationally the more promising alternatives the search space is reduced, and expensive with the complexity of consequently, the time needed to find the appropriate alternatives 𝑂(𝑛3), ensemble models such as Random Forest (RF) can be also used [18]. In that case, the is also reduced. The next subsections describe a stochastic mean and the variance are calculated based on the predictions of method that uses Bayesian optimization to efficiently generate all base models available in the ensemble. Our method uses RF such alternatives. The method assumes that we do not know the with 1000 decision trees as base models. internal rules by which the decision models operate, thus it falls The acquisition function operates on top of the mean and into the category of ‘‘black-box’’ optimization techniques. standard deviation of the Knowing and utilizing the decision rules might help the search 𝑆𝑀’s output. The final version of the algorithm, but this option was not addressed in this study. method uses the expected improvement (𝐸𝐼 ) as an acquisition function [19]. This acquisition function checks the improvement 3.1 Implementation that each candidate alternative brings with respect to the maximum known value ( µ(𝑆𝑀( 𝑎 ̅ )) − 𝑎𝑏), and scales those The problem of generating alternatives that require the smallest improvements with respect to the uncertainty. If two alternatives change to the current alternative to obtain a desirable outcome have a similar mean value, the one with higher uncertainty can be defined as an optimization problem with two objectives: (𝜎(𝑆𝑀( 𝑎 ̅ )) will be preferred by the acquisition function. (1) improved outcome (desired output) of the decision model, Finally, we need to define the generator of alternatives. Our and (2) maximum similarity between the current alternative 𝑐, ̅ method uses two generators of alternatives: a neighborhood and the new proposed alternative 𝑎 ̅ . For each decision model generator and a random generator. Based on the distance function 𝐷𝑀, one alternative can be defined as a tuple of attributes 𝑎 ̅ = 𝑑 , neighborhood relation can be defined. Two alternatives 𝑎 ̅ 1 ̅̅ (𝑎1,𝑎2, … , 𝑎𝑛), where each attribute can take any value of a and 𝑎 ̅ 2 ̅̅ are considered as neighbors with a degree k, if 𝑑(𝑎 ̅ 1 ̅̅ , 𝑎 ̅ 2 ̅̅) limited set of values. Usually, that set includes ordinal values = k. . The random generator is a generator of alternatives which: (e.g., low, medium and high) and those values can be encoded (1) avoids generating known alternatives; and (2) is conditioned with integers (e.g., 0, 1 and 2). Consequently, a distance 𝑑 by the best-known (with respect to the optimization function) between alternatives can be defined over Euclidean space. The alternative discovered in the previous iterations. specific distance function used by the method is a modified Algorithm 1 presents the implementation of the proposed element-wise difference between the candidate alternative 𝑎 ̅ and method. The function check_promising_values runs the 𝑆𝑀 on a the current alternative 𝑐 ̅ . This distance considers only the set of promising alternatives. This set contains all alternatives attributes for which the candidate alternative has higher values that have been previously generated as neighbors to a specific compared to the current alternative 𝑐 ̅. best alternative, but have not been evaluated with the 𝐷𝑀 𝑎 because the acquisition function has selected other alternatives. 𝑗 − 𝑐𝑗, 𝑖𝑓 𝑎𝑗 > 𝑐𝑗 𝑑( 𝑐, ̅ 𝑎 ̅ ) = ∑ { 0, 𝑖𝑓 𝑎 This enables one final check of the most promising solutions 𝑗 ≤ 𝑐𝑗 which may have been missed because of an earlier bad prediction of the 𝑆𝑀. 24 Algorithm 1: varied, i.e., from low to medium, from low to high, from high to Input: Decision model DM, current alternative CA, medium, and from high to low. This experimental setup resulted Output: best_alternatives # parameters and initialization in 756 different experimental runs. Each experiment was running max_e = 150 # maximum number of epochs for a minimum of 100 epochs, a maximum of 150 epochs, and 50 n_candidates = 10 # candidates per iteration epochs without improvement. The method and the experiments objective_jitter = 0.8 # if an alternative is close to the current best (e.g, 75% as good as the current best , the were implemented in Python, and are available online1. alternative’s neighbors should be checked) random_sample_size = 10000 4.2 Experimental Results best_alternatives = [] surrogate_model = new Random_Forest() The average experiment duration for the models with depth 3 was promising_alternatives_pool = [] less than 5 min. For the models with depth 4, the duration #initial values candidate_alternatives = generate_random_alternatives(10) increased for 3 min and for the models with depth 5 the duration real_objective_values = objective_func(DM, CA, alternatives) increased for additional 3 min. This indicates that the relation surrogate_model.fit(candidate_alternatives, real_objective_values) between the computational time and the model depth is linear. known_alternatives.add(candidate_alternatives, real_objective_values) The final output of the algorithm is a set of thousands of best_alternative,best_score = max(candidate_alternatives different alternatives. However, from a user perspective, only ,real_objective_values) one or just a few alternatives should be enough. Figure 1 presents neighboring_alternatives= gen_neighborhood(best_alternative) while counter < max_e do: the number of epochs required to generate the first alternative for if size(neighboring_alternatives)>0: the most complex models (depth 5). From the figure it can be alternatives_pool = neighboring_alternatives seen that on average, the first alternatives are generated in the else: first 10 epochs. For the less complex models, the number of alternatives_pool = gen_rand_alternatives(best_alternative, random_sample_size) required epochs was less than 5. # get top ranked (e.g., 10) candidates using the acquisition function candidate_alternatives, candidate_scores = perform_acquisition(alternatives_pool, n_candidates) #evaluation of candidate alternatives real_objective_values = objective_func(DM, CA, alternatives) known_alternatives.add(candidate_alternatives, real_objective_values) #update current best and promising alternatives i=0 while i < size(candidate_scores) do: if best_score*objective_jitter <= candidate_scores[i] do: neighboring_alternatives = gen_ neighbourhood(candidate_alternatives[i]) promising_alternatives_pool.add(neighboring_alternatives) if Figure 1: Number of epochs required to generate the first best_score< candidate_scores[i] do: best_alternatives = [] alternative in the final set of alternatives. best_alternatives.add(candidate_alternatives[i]) if best_score==candidate_scores[i] do: In each epoch, the algorithm selects the top 10 alternatives best_alternatives.add(candidate_alternatives[i]) i++ with respect to the optimization score. The higher the score, the #update the surrogate model better the alternatives are. The selected alternatives depend on surrogate_model.fit(candidate_alternatives, real_objective_values) the acquisition function, which in turn depends the predictions of counter++ the surrogate model. Figure 2 present the average optimization end score in each epoch for the most complex models (depth 5). For #peform final check of the promising alternatives best_alternatives = a comparison, the average optimization score of 10 randomly check_promising_values(promising_alternatives_pool,best_alt sampled alternatives at each epoch is also presented (dashed line). ernatives) From the figure it can be seen that the optimization score of the return best_alternatives random samples is significantly lower than the optimization score of the samples selected using the proposed algorithm. 4 EXPERIMENTS Finally, the presented algorithm is stochastic and the optimality of the solution cannot be guaranteed. One metric that 4.1 Experimental Setup presents the quality of the solutions is the number of attribute changes required to achieve the final solution starting from the The method was evaluated with the 42 decision models described current state of the current alternative. Figure 3 presents that in Section 2. For each decision model, nine different randomly metric, which is the same as the distance defined in Section 3.1. sampled starting alternatives (current alternatives 𝑐 ̅ ) were From the figure it can be seen that in the majority of the cases, sampled. Three of those alternatives were with a final attribute the final solution can be reached with less than 5 attribute value low, three with a final attribute value medium, and three changes. Exception of this are the decision models that have a with a final attribute value high. The desirable outcome was also depth 5 and uniform weights’ distribution. 1 Repository link. 25 Figure 2: Average optimization score for the decision models with depth 5. Full line - alternatives generated by the surrogate model. Dashed line - random alternatives. The type of attribute weights is color-coded (blue-normal, Figure 3: Boxplots for the number of changes required to orange-skewed, green-uniform). switch from the starting alternative to the best alternative. This is because these models have a larger number of input Regarding the future work, the proposed method is stochastic attributes and the uniform distribution requires many attributes and the optimality of the final solution cannot be guaranteed. In to be changed in order for that change to be prolonged to the order to do that, the method needs to be validated additionally. aggregate attribute. On the other hand, the models with normal Promising options include comparison of the proposed method and skewed weights’ distribution require smaller number of with deterministic methods and methods that utilize internal rules attribute changes for that change to be propagated to the by which the decision models operate. aggregate attributes. REFERENCES [1] Power, D.J. Decision Support Systems: Concepts and Resources for 5 DISCUSSION AND CONCLUSION Managers. Quorum Books, Westport, 2002. [2] Turban, E., Aronson, J. and Liang, T.-P. Decision Support Systems and We presented a novel method for generating alternatives for Intelligent Systems, Prentice Hall, Upper Saddle River, 7th Edition, 2005. [3] Mallach, E.G. Decision Support and Data Warehouse Systems. Irwin, multi-attribute DEX decision models based on Bayesian Burr Ridge, 2000. [4] Sadok, W., Angevin, F., Bergez, J.-E., Bockstaller, C., Colomb, B., optimization. The main goal of the method was to generate Guichard, L., Reau, R., Messeau, A. and Doré, T. MASC: a qualitative alternatives that require the smallest change to the current multi-attribute decision model for ex-ante assessment of the sustainability of cropping systems. Agron. Sustain. Dev. 29, 447–461, 2009. alternative to obtain a desirable outcome. The method was [5] Munda, G. Multiple criteria decision analysis and sustainable extensively evaluated on 42 different DEX decision models. The development. In: Multiple Criteria Decision Analysis: State of the Art Surveys, Springer-Verlag, New York, 2005. models were with a variable complexity (e.g., variable depth and [6] Bohanec, M. and Rajkovič, V. DEX: An Expert System Shell for Decision variable attribute’s weight distribution). The method’s behavior Support. Sistemica 1(1), 145-157, 1990. [7] Bohanec, M. and Rajkovič, V. Multi-attribute decision modeling: was analyzed with respect to several characteristics: computing Industrial applications of DEX. Informatica 23, 487-491, 1999. [8] Bohanec, M. DEXi: Program for Multi-Attribute Decision Making User's time, time to first appropriate alternative, number of generated Manual." Ljubljana, Slovenia: Institut Jozef Stefan, 2008. (appropriate) alternatives, and number of attribute changes [9] Bohanec, M., Zupan, B. and Rajkovič, V. Applications of qualitative multi-attribute decision models in health care, International Journal of required to reach the generated alternatives. Medical Informatics 58-59, 191-205, 2000. The experimental results confirmed that the method is [10] Bohanec, M., Cortet, J., Griffiths, et al. A qualitative multi-attribute model for assessing the impact of cropping systems on soil quality. suitable for the task i.e., it generates at least one appropriate Pedobiologia 51, 239–250, 2007. alternative in less than a minute, even for the most complex [11] Bohanec, M., Messéan, A., Scatasta, S. et al. A qualitative multi-attribute model for economic and ecological assessment of genetically modified decision models. In the majority of the cases, the computing time crops. Ecol. Model. 215, 247–261, 2008. was lower than that. The discovery of the alternatives was [12] Coquil, X., Fiorelli, J.L., Mignolet, C., et al. Evaluation multicritère de la durabilité agr environnementale de systèmes de polyculture élevage equally distributed throughout the overall runtime. Exception of laitiers biologiques. Innov. Agron. 4, 239–247, 2009. [13] Bohanec, M., Cestnik, B., Rajkovič, V. Qualitative multi-attribute this is the final check performed by the algorithm (see modeling and its application in housing, Journal of Decision Systems 10, check_promising_values in Algorithm 1), which generates the pp. 175-193, 2001. [14] Debeljak, M., Trajanov, A., Kuzmanovski, V. et al. A field-scale decision majority of the alternatives for the more complex models (depth support system for assessment and management of soil functions. 4 and depth 5). The quality of the alternatives was also Frontiers in Environmental Science, 7, p.115, 2019. [15] Bergez, J.-E. Using a genetic algorithm to define worst-best and best-appropriate as in the majority of the cases, the generated worst options of a DEXi-type model: Application to the MASC model of alternatives could be reached by less than 5 attribute changes. cropping-system sustainability. Computers and electronics in agriculture 90: 93-98, 2013. Finally, the relation between the decision-model’s depth and the [16] Kuzmanovski, V., Trajanov, A., Dzeroski, S., et al., M. Cascading constructive heuristic for optimization problems over hierarchically computing time was linear and not exponential, which implies decomposed qualitative decision space. Omega, submitted September, that the method is scalable. 2020. [17] Rasmussen C. E. and Williams C. K.I. Gaussian Processes for Machine The method implementation considers ordinal attribute Learning”, MIT Press 2006. values. However, there is possibility for considering other types [18] Frank, H., Hoos, H. H., and Leyton-Brown, K. Sequential model-based optimization for general algorithm configuration (extended version). of distance measures that would work in nominal settings (e.g., Technical Report TR-2010–10, University of British Columbia, Computer Levenshtein distance). Science, Tech. Rep. 2010. [19] Lizotte F. Practical Bayesian Optimization. PhD thesis, University of Alberta, Edmonton, Alberta, Canada, 2008. 26 Detekcija napak na industrijskih izdelkih Defect Detection on Industrial Products David Golob Janko Petrovčič Stefan Kalabakov Institut Jožef Stefan Institut Jožef Stefan Institut Jožef Stefan Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia david.golob@ijs.si janko.petrovcic@ijs.si stefan.kalabakov@ijs.si Primož Kocuvan Jani Bizjak Gregor Dolanc Institut Jožef Stefan Institut Jožef Stefan Institut Jožef Stefan Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia primoz.kocuvan@ijs.si jani.bizjak@ijs.si gregor.dolanc@ijs.si Jože Ravničan Matjaž Gams UNIOR Kovaška industrija d.d. Institut Jožef Stefan Zreče, Slovenia Ljubljana, Slovenia joze.ravnican@unior.com matjaz.gams@ijs.si POVZETEK zaznavanje napak na industrijskih izdelkih/odkovkih za podjetje Unior d.d. Raziskave so bile narejene v okviru projekta V članku predstavimo različne metode za detekcijo napak na ROBKONCEL ( [1]), ki ga sofinancira Republika Slovenija iz industrijskih odkovkih. Raziskava je bila narejena v okviru Evropskega sklada za regionalni razvoj. Klasični pristopi, ki so projekta ROBKONCEL. Napake, ki jih želimo zaznati, so manjši uporabljeni za detekcijo napak na industrijskih objektih, udarci ter poškodbe na struženi površini. V začetnih poskusih temeljijo na računalniškem vidu ( [2], [3], [4], [5]). V naši smo uporabili metode računalniškega vida ter metode zaznavanja raziskavi uporabimo dva pristopa računalniškega vida, in sicer, napak s tresljaji. Začetni rezultati niso zadovoljivi, vendar detekcijo objektov (angl. »object detection«) ter segmentacijo nekatere metode kažejo vzpodbudne rezultate, ki bi se jih dalo slike (angl. »image segmentation«). Prav tako smo poskusili izboljšati z večjim naborom podatkov. zaznati napake s tresljaji izdelkov. Glede na inicialne eksperimente, ki niso dali optimalnih rezultatov, se v prihodnje KLJUČNE BESEDE usmerjamo na poskuse strojnega učenja z večjim naborom Detekcija napak, računalniški vid, tresljaji, industrijski izdelki podatkov ter drugimi, konkretno laserskim čitalnikom, ki se trenutno kaže kot najbolj perspektivna možnost. Raziskave so ABSTRACT zanimive predvsem zato, ker so pokazale določene težave v In this paper different methods for error detection on industrial uporabi metod strojne inteligence pri delu z industrijskimi forks are presented. Part of the research was done for project produkti. ROBKONCEL. The types of errors that are detected are mostly scratches and dents on smooth metal surfaces. First a computer vision approach is used and then method for detecting errors 2 PRISTOP RAČUNALNIŠKEGA VIDA from vibrations is discussed. Initial results are not encouraging, V tem pristopu se napake na izdelkih zaznavajo iz navadnih slik. but could possibly be improved with larger dataset for training. Podani so primeri brezhibnih izdelkov in primeri z napakami, tipično poškodbami na struženi površini. Algoritmi, ki zaznavajo KEYWORDS napake, temeljijo na pod-področju strojnega učenja, to je Error detection, computer vision, vibrations, industrial products globokega učenja. V zadnjih nekaj letih je področje globokega učenja doseglo izjemne rezultate na področju računalniškega vida, kot npr. detekcija objektov, segmentacija slik ter 1 UVOD klasifikacija slik. Pomanjkljivost globokega učenja je, da zahteva V zadnjem času so z napredkom strojnega učenja ter umetne velik nabor učnih podatkov. V naših poskusih smo, kot rečeno, inteligence napredovali tudi procesi kontrole kakovosti v uporabili dva (pod) pristopa, to sta, detekcija objektov (angl. industriji. Namen naše raziskave je razviti algoritem za »object detection«) ter segmentacija slike (angl. »image segmentation«). Nekaj primerov detekcije napak iz industrijskih Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia © 2020 Copyright held by the owner/author(s). 27 izdelkih z uporabo računalniškega vida je opisanih v [2], [3], [4] ter [5]. 2.1 Detekcija objektov V pristopu detekcije objektov tipično skušamo poiskati izbrani objekt (to je lahko npr. avto, pešec, kolo, prometni znak itd.). V našem problemu je izbrani objekt napaka na industrijskem odkovku. Za ta pristop smo imeli na razpolago 9 izdelkov, iz katerih smo naredili nabor 46 slik. Slika 2: Nevronska mreža za prepoznavanje objektov, vir: Nabor slik smo nato ločili na učno in testno množico. Delitev [9] je narejena tako, da se isti izdelek ne pojavi v različnih množicah. rezultatih na testni množici. Za boljše rezultate bi očitno Na vsaki sliki v učni množici je bilo potrebno ročno označiti potrebovali več slik in več različnih napak. napako/napake s pravokotniki. Ko imamo označene slike, jih lahko uporabimo za učenje globoke nevronske mreže, ki je sposobna prepoznavanja objektov (napak) v slikah. Nevronska mreža je na začetku sestavljena iz več t.i. konvolucijskih slojev (angl. »convolution layers«), na koncu pa imamo par polno povezanih slojev (angl. »fully connected layers«). Konvolucijski sloji so sposobni kreiranja uporabnih značilk (kot npr. razni robovi in oblike na sliki), ki so nato uporabljene v polno povezanih slojih (glej sliko 1 za primer). V primeru detekcije objektov nevronska mreža v prvem delu odkrije t.i. regije zanimanja (angl. »regions of interest«) na sliki, le te regije so v obliki pravokotnikov. Vsaka regija zanimanja je nato vhodni podatek v drugi del nevronske mreže, katere naloga je klasifikacija dane regije (glej sliko 2). V našem primeru smo uporabili že v naprej zgrajeno in naučeno nevronsko mrežo, ki Slika 3: Detekcija napak s prepoznavanjem objektov smo jo nato »naučili« prepoznavati naše objekte (napake). Nevronsko mrežo, ki smo jo uporabili, se imenuje »Faster RCNN inception« in je bila naučena na podatkovni množici imenovani 2.2 Segmentacija slike »COCO« [6]. Ta nevronska mreža je prosto dostopna ter podprta V segmentaciji slike klasificiramo vsako slikovno točko v s strani Python knjižnice Tensorflow [7]. določen razred (glej sliko 4 za primer). V našem primeru imamo Ko imamo naučeno nevronsko mrežo, klasificiramo določeno samo dva razreda, to sta, »napaka« in »ni-napake«. Tudi v tem sliko kot »napako«, v primeru da mreža zazna napako z več kot pristopu uporabimo (globoke) nevronske mreže za segmentacijo 40% verjetnostjo (glej sliko 3 za primer). V tabeli 1 in tabeli 2 in klasifikacijo. lahko vidimo rezultate mreže na učni množici oziroma na testni Za arhitekturo nevronske mreže smo uporabili arhitekturo, ki množici. je bila uporabljena za podoben problem (glej [5] za podrobnosti). Arhitektura je vidna sliki 5. Nevronska mreža je sestavljena iz Tabela 1: Učna množica: 27 slik, 26 z napako, 1 brez. dveh delov, in sicer, segmentacijskega dela ter klasifikacijskega Točnost: 81%, priklic: 81%, natančnost: 100%. dela. Vhodni podatek v segmentacijski del je črno-bela slika TP FP TN FN objekta, klasifikacijski del pa ima dva vhodna podatka (tenzorja) 21 0 1 5 in sicer gre za dva tenzorja iz segmentacijske mreže. Prvi tenzor je segmentacija (pomanjšane) slike objekta, (na sliki 5 je označen Tabela 2: Testna množica: 19 slik, 18 z napako, 1 brez. kot »segmentation ouput«) to je tenzor debeline 1, kjer vsak Točnost: 10%, priklic: 5%, natančnost: 100% element (ki se ga lahko predstavlja kot slikovno točko) TP FP TN FN predstavlja verjetnost napake. Drugi tenzor pa je predzadnji 1 0 1 17 tenzor v segmentacijski mreži. Izhodni tenzor za klasifikacijsko nevronsko mrežo je Opazimo, da na učni množici dobimo zadovoljivo natančnost, verjetnost, ali slika vsebuje izdelek z napako, za segmentacijsko vendar model ni sposoben generalizacije, kar se vidi v slabih nevronsko mrežo pa je segmentacija pomanjšane slike objekta. Segmentacijski del se uči ločeno od klasifikacijskega. In sicer, se uči iz ročno označenih slik segmentacije. Klasifikacijski del pa se uči iz binarnih oznak (1 pomeni, da ima objekt napako in 0 pomeni, da slika nima napake). V tem pristopu razdelimo podatke na učno, validacijsko ter testno množico (kjer noben izdelek ne more biti v dveh množicah). Nato vsako slikovno točko v sliki označimo, kot napako ali ni-napake. To naredimo za vsako sliko v učni in validacijski množici. Slika 1: Globoka nevronska mreža s konvolucijami, vir: [8] 28 Nevronska mreža nam poda segmentacijo slike ter 3 PRISTOP S TRESLJAJI klasifikacijo slike. Primer izhoda nevronske mreže za Eden izmed ' alternativnih' , vendar potencialno obetavnih segmentacijo je prikazan na sliki 6. pristopov je analiza na osnovi oscilatornega vzbujanja pomika. Na validacijski množici smo določili število epoh za učenje Eksperiment je potekal v laboratoriju odseka E2 na IJS. Pozitiv mreže in sicer smo za segmentacijsko mrežo uporabili 2900 epoh izdelka (dejanski odkovek) smo postavili v negativ (stojalo za in za klasifikacijsko nevronsko mrežo 200 epoh. Za treniranje odkovke – glej sliko 7) ter generirali oscilatorni pomik negativa mreže je bil uporabljen gradientni spust (angl. Gradient Descent) (stojala) s pomočjo generatorja vibracij. Zanimalo nas je, ali bi algoritem s parametrom hitrost učenja (angl. »learning rate«) utegnile poškodbe izdelka na naležni površini s stojalom 10-3. Posamezni rezultati so zbrani v tabelah 3,4 in 5. (negativom) kakorkoli vplivati na sklopitev med izdelkom in stojalom. V ta namen smo opazovali dva signala: vzbujevalni Tabela 3: Učna množica: 43 slik, 29 z napako, 14 brez signal pomika stojala in izmerjeni signal pomika izdelka ter napake. Točnost:100%, priklic: 100%, natančnost: 100% opazovali odnos med obema. Za vzbujanje pomika negativa TP FP TN FN (stojala) smo uporabili sinusni vzbujevalni signal. Meritve 29 0 14 0 pomika izdelka smo opravili z laserskim merilnikom razdalje z visoko natančnostjo. Merilnik kontinuirano meri razdaljo do izdelka, ter nato z numeričnim odvajanjem izračuna hitrost, ki je Tabela 4: Validacijska množica: 25 slik, 21 z napako, 4 brez izhodni signal. Za osnovni preizkus smiselnosti metode smo na napake. Točnost: 64%, priklic: 66,7%, natančnost: 87,5%. enem od izdelkov simulirali napako tako, da smo na naležno TP FP TN FN površino prilepili droben kos izolacijskega traku. Izkazalo se je, 14 2 2 7 da le-ta bistveno vpliva na sklop izdelek-negativ in to nam je dalo upanje, da bi utegnile tudi poškodbe naležne površine izdelka vplivati na sklopitev in s tem na relacijo med pomikom negativa Tabela 5: Testna množica: 28 slik, 21 slik z napako, 7 brez in izdelka. napake. Točnost: 71,4%, priklic: 81%, natančnost: 81% Posnetki meritve izhodnega signala so dolgi 10s. Meritve smo TP FP TN FN opravili pod 4 različnimi nastavitvami vhodnega signala, in sicer: 17 4 3 4 • Nastavitev 1: Amplituda: 0,389 Vpp frekvenca: 50Hz Vidimo, da se je nevronska mreža sposobna naučiti s 100% točnostjo, vendar ima, podobno kot prejšnji pristop, problem z • Nastavitev 2: Amplituda: 0,389 Vpp; frekvenca: 60Hz generalizacijo. • Nastavitev 3: Amplituda: 0,2026 Vpp; frekvenca: 60Hz • Nastavitev 4: Amplituda: 0,2026 Vpp; frekvenca 50Hz Nastavitve so bile izbrane na podlagi izhodnega signala, izkaže se, da za višje amplitude izhodni signal postane šumen. Za ta pristop imeli na voljo 24 izdelkov. Preizkusili smo sledeče možne pristope detekcije napak iz Slika 4: Primer segmentacije slike, vir: [10] signalov: • Ekspertno izbrane značilke ter uporaba klasičnih metod strojnega učenja. • Računalniško generirane značilke ter uporaba 2-slojne nevronske mreže Slika 5: Arhitektura Slika 7: Meritev vibracij Slika 6: Primer segmentacije slike. Levo: original, sredina: ročna segmentacija, desno: modelska segmentacija. 29 3.1 Ekspertno izbrane značilke ter uporaba vhodnega signala, kjer je bila amplituda 0,389 Vpp s frekvenco klasičnih metod strojnega učenja 60 Hz). Najboljše testne rezultate so v tabeli 8. V tem pristopu so značilke, uporabljane v algoritmih strojnega Tabela 8: Osnovni model: logistična regresija. Končni učenja, izbrane na podlagi dobrih izkušenj. Značilke, ki so bile model: AdaBoost izbrane, so se namreč izkazale kot dobre v drugi aplikaciji Točnost Priklic Natančnost F1 strojnega učenja. Izbranih značilk je 22 in uporabljajo osnovne 68 % značilke signala iz časovnega ter frekvenčnega spektra, npr. 3 85 % 76 % 73 % najvišji vrhovi spektralne gostote ter njihove frekvence, energija spektralne gostote, itd. 3.2 Računalniško generirane značilke ter Vsak posnetek odkovka je razdeljen na 10 kosov, kjer je vsak kos 1s dolg posnetek. Za vsak kos se nato izračuna ekspertno uporaba 2-slojne nevronske mreže izbrane značilke. Tako za vsak vzorec dobimo 10 podatkovnih Za avtomatsko generacijo značilk smo uporabili za to namenjeno točk z 22 značilkami. knjižnico. Pri nastavljenem parametru FDR (False Discovery Uporabljen model je sestavljen iz dveh modelov. In sicer iz Rate) na privzeto vrednost, ki je 0,05 po statističnem testu, nismo osnovnega ter končnega modela. Osnovni model za vsako dobili nobene značilke, ki bi bila relevantna za klasifikacijo. Ker podatkovno točko izračuna verjetnost, da ta točka pripada knjižnica uporablja statistično analizo za ocenjevanje produktu z napako. Ker imamo za vsak produkt 10 podatkovnih relevantnosti značilk, torej ni nujno, da niso pomembne pri točk, dobimo z osnovnim modelom 10 verjetnosti za vsak strojnem učenju, zato smo dvignili prag FDR na začetku na 0,5 produkt. Končni model potem klasificira produkt v »odkovek z in nato še na 0,99. Pri tem smo pri vrednosti 0,5 FDR dobili le napako« ali »odkovek brez napake«. Vhodni podatek v končni eno značilko. Ta je 50. Fourierev koeficient oziroma pri model je 10 verjetnosti, dobljenih iz osnovnega modela. nastavitvi 2 in 3 smo dobili 60. Fourierev koeficient. Slednja Preizkusili smo več možnih algoritmov, in sicer algoritem vrednost je seveda osnovni harmonik vzbujalnega signala. Pri podpornih vektorjev (angl. »support vector classifier«), nekaterih nastavitvah in pri večji vrednosti FDR smo dobili algoritem naključnih gozdov, logistično regresijo, algoritem nekatere Fouriereve koeficiente v okolici 50. in 60. koeficienta, »AdaBoost« ter algoritem »XGBoost«. Te algoritme smo kar je smiselno, ker je odziv odkovka različen glede na preizkušali tako za osnovni kot končni model. poškodbo. Zaradi tega smo sklenili, da izračunamo Fouriereve V prvem poskusu, so bili podatki razdeljeni na učno ter testno koeficiente v okolici 50. in 60. in jih uporabimo za klasifikacijo. množico. Na učni množici smo z 8 delnim prečnim preverjanjem Hevristično smo določili, da izračunamo prvih 256 koeficientov. izbrali optimalne parametre za osnovni ter končni model. Nato S tem smo zajeli vse koeficiente v okolici 50. in 60. Izračun smo celoten model testirali na testni množici. prevelikega števila koeficientov pomeni, da lahko porabimo vse Uporabljena je bila nastavitev 2 vhodnega signala, kjer je bila vire, ki so na voljo nevronski mreži, prav tako pa uradni viri [11] amplituda 0,389 Vpp s frekvenco 60 Hz. Rezultati so zbrani v v tem primeru navajajo 28 x 28 točk oziroma vhodnih nevronov. tabelah 6 in 7. Nevronska mreža je sestavljena iz vhodne plasti, ki ima 256 nevronov, nato sledita dve skriti plasti, prva z 16 nevroni, ter Osnovni model: XGBoost druga z 8. Zadnja izhodna plast je sestavljena iz 2 nevronov, ta Končni model: Naključni gozdovi predstavljata poškodovan ali nepoškodovan odkovek. Takšne nastavitve smo dobili od večkratnega testiranja modela Tabela 6: Učna množica: 19 produktov: 12 z napako, 7 brez (optimizacija hiperparametrov). Za razliko od prejšnjega napake. Točnost: 100%, priklic: 100%, natančnost: 100%. pristopa smo uporabili celoten 10-sekunden posnetek za izračun TP FP TN FN koeficientov. 7 0 12 0 Kot v predhodnem primeru smo na začetku uporabili optimizacijo hiperparametrov na učni množici. To pomeni, da Tabela 7: Testna množica: 5 produktov: 2 z napako, 3 brez smo z izbranimi parametri, ki so dosegli najvišjo točnost pri napake. Točnost: 100%, priklic: 100%, natančnost: 100%. modelu nevronske mreže uporabili za učenje modela. Vseh 24 TP FP TN FN učnih primerov smo razdelili na učno (19 primerov) in testno (5 3 primerov). Uporabili smo 5-delno prečno preverjanje kot v 0 2 0 prejšnjem primeru. Ker dobimo 5 vrednosti posameznih metrik, Da se izognemo naključnemu dobremu rezultatu na testni na koncu izračunamo povprečje. Rezultati so zbrani v tabeli 9. množici, uporabimo še drug poskus. In sicer, uporabimo metodo prečnega preverjanja za določanje učne in testne množice. Konkretno uporabimo 5-delno prečno preverjanje, kjer so Tabela 9: Točnost priklic in natančnost brez F1 metrike podatki razdeljeni na 5 delov. Naš postopek ima 5 iteracij, na Točnost Priklic Natančnost vsaki iteraciji je en del podatkov izbran kot testna množica, ostali 48 % 42 % 91 % štirje deli pa so izbrani kot učna množica. Na vsaki iteraciji na učni množici z 8 delnim prečnim preverjanjem izberemo optimalne parametre in naučimo model na učni množici, nato pa ocenimo model na testni množici. Ker uporabljamo 5 delov, 4 ZAKLJUČEK dobimo 5 ocen točnosti, priklica ter natančnosti, iz katerih nato V tem prispevku so opisani pristopi ter modeli za detekcijo napak izračunamo povprečje. (uporabljena je bila nastavitev 2 na industrijskih izdelkih - odkovkih. 30 Rezultati za detekcijo napak z uporabo računalniškega vida in [5] D. Tabernik, Š. Samo , J. Skvarč in D. Skočaj, segmentacije slike so se izkazali kot nezadovoljivi za praktično „Segmentation-based deep-learning approach for surface- uporabo, kjer se zahtevata visoka točnost in priklic. Rezultati z defect detection,“ Journal of Intelligent Manufacturing, uporabo računalniškega vida in detekcije objektov so 2019. nezadovoljivi najbrž zato, ker so napake na kovini podobne [6] [Elektronski]. Available: http://cocodataset.org/#home. temnim lisam na kovini, ki jih je polno na odkovkih. Rezultati za detekcijo napak z uporabo tresljajev so [7] Tensorflow, „Tensorflow home page,“ [Elektronski]. vzpodbudni, ampak nezadovoljivi. Available: https://www.tensorflow.org/. [Poskus dostopa Glavni razlog za slabše rezultate je pomanjkanje podatkov ter 30 January 2020]. zajem podatkov v nekontroliranem okolju. Menimo, da ko bo na [8] [Elektronski]. Available: voljo več podatkov, se bodo rezultati izboljšali. https://towardsdatascience.com/mnist-handwritten-digits- classification-using-a-convolutional-neural-network-cnn- af5fafbc35e9. 5 BIBLIOGRAFIJA [9] U. Farooq, 15 February 2018. [Elektronski]. Available: https://medium.com/@umerfarooq_26378/from-r-cnn-to- [1] „ROBKONCEL,“ SMM, January 2019. [Elektronski]. mask-r-cnn-d6367b196cfd. Available: http://www.smm.si/?post_id=4682. [Poskus [10] J. Jordan, „Jeremy Jordan,“ 30 March 2018. [Elektronski]. dostopa 30 January 2020]. Available: https://www.jeremyjordan.me/evaluating- [2] M. El-Agamy, M. A. Awad in H. A. Sonbol, „Automated image-segmentation-models/. inspection of surface defects using machine vision,“ v 17th [11] [Elektronski]. Available: Int. AMME Conference, Cairo, 2016. https://www.tensorflow.org/tutorials/keras/classification. [3] C. Ming , B.-C. Chen , L. G. Jacque in C. Ming-Fu, [Poskus dostopa 2019]. „Development of an optical inspection platform for [12] M. Gjoreski, S. Kalabakov, M. Luštrek in H. Gjoreski, surface defect detection in touch panel glass,“ „Cross-dataset deep transfer learning for activity International Journal of Optomechatronics, Izv. 10, št. 2, recognition,“ v Proceedings of the 2019 ACM pp. 63-72, 2016. International Joint Conference on Pervasive and [4] X. Sun, J. Gu, S. Tang in J. Li, „Research Progress of Ubiquitous Computing and Proceedings of the 2019 ACM Visual Inspection Technology of Steel Products—A International Symposium on Wearable Computers, 2019. Review,“ Applied sciences, Izv. 8, št. 11, 2018. 31 Data Protection Impact Assessment - an Integral Component of a Successful Research Project From the GDPR Point of View Gizem Gültekin Várkonyi Anton Gradišek University of Szeged Jožef Stefan Institute Szeged, Hungary Ljubljana, Slovenia gizemgv@juris.u-szeged.hu anton.gradisek@ijs.si ABSTRACT Developing an AI-based service for a target population, for example people with diabetes, chronic heart failure, obesity, Artificial intelligence and algorithmic decision-making systems dementia, skin cancer, etc., typically starts with a research help generate new knowledge about diseases which then help project. One of the key components of such a project is collecting better manage it and assist people in clinical treatment needs. The substantial amounts of data in a pilot study, with participants that blood of such AI systems is personal data that is both used for resemble the target audience for the final service. When planning training or is already the output of the algorithmic assessments. the pilot study, researchers enter a slippery terrain of dealing with This work aims guiding the AI researchers to be familiar with the personal data, as the participants are providing their own data for legal rules binding them while processing personal data within the purpose of the study. For the illustration, we can imagine a their AI-based projects as indicated in the General Data Protection Regulation rules with a specific focus on why and how project where we collect medical data of three types; general to conduct a self-Data Protection Impact Assessment. The self- medical data provided by the medical doctor responsible for the assessment guideline presented throughout the work is an output participant, lifestyle data collected by either wearable or of the mutual experiences and collaboration between a lawyer stationary sensors, and self-reported data that is obtained via and an AI researcher on the topic. questionnaires that the participants fill. KEYWORDS The data provided by the participants fall under the scope of the data protection, impact assessment, GDPR, artificial intelligence, European Union’s General Data Protection Regulation (GDPR) medical data since it refers to identified or identifiable personal issues of them. The GDPR entered into force on the 25th of May 2018 with one of the aims of keeping up with the technological developments 1 Introduction challenging efficient protection of personal data [2]. The risk- It is possible to look out for artificial intelligence (AI) systems based approach embedded in the GDPR came along with several dealing with personal data from two different perspectives. On safeguards as one of them is the Data Protection Impact one hand, it offers great benefits for the users, developers, and Assessment (DPIA). The DPIA can help AI-researchers to researchers, if used correctly. For example, AI-enabled health comply with the GDPR requirements at an early stage of a new care technologies could predict the treatment of diseases 75% project. It can help reduce the risks arising from the use of AI better, and could reduce the clinical errors 2/3 at the clinics using technologies challenging the efficient protection of fundamental AI compared to the clinics that do not [1]. On the other hand, the rights and principles [3]. Several policy papers generated by the improper handling of personal data can quickly lead to abuse, EU institutions [4] [5] focusing on regulation of AI state that sharing sensitive information, or other problems (unwanted data legal compliance is a keyword for gaining user trust and DPIA is disclosure, complex and costly legal procedures, high fines, etc.), one way to reach user trust. However, there is no standard set for therefore it has to be handled with the utmost care. In this paper, conducting a DPIA that could guide the AI-researchers. In this we will focus on the legality of medical applications containing paper, we present some of the key points of conducting the DPIA personal data that is defined as sensitive data in legal documents, that could be useful for the AI-researchers. such as the analysis of sensor data to help patients with chronic diseases manage their condition and improve the quality of life, or to help the elderly with independent living by providing safety 2 Data Protection Impact Assessment in the features and improved communication channels. GDPR The term DPIA was not specifically described in the GDPR, however, was referred as it is a process to help managing the risks to the data subjects’ (participants of the research project, in this Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed case) rights and freedoms as a result of data processing. In other for profit or commercial advantage and that copies bear this notice and the full words, DPIA is a process consisting of several other sub-citation on the first page. Copyrights for third-party components of this work must processes to describe the risks and assess the legality of the be honored. For all other uses, contact the owner/author(s). Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia system in terms of data protection. These risks could be related © 2020 Copyright held by the owner/author(s). to system security, system design, implementation, 32 administration and development on a further run. The aim of the The Data Specific Assessment (DSA) is the procedure where DPIA is to take appropriate safeguards to minimize the risks, if the data to be used in the AI project should be introduced very impossible to eliminate all. DPIA is not a simple one-time specifically in order to comply with the basic rules of the GDPR, reporting activity, it is an ongoing process that should be mainly, the purpose limitation, transparency, accuracy, data continuously carried out during the lifetime of a project, minimization, and consent. It should be kept in mind that one of therefore DPIA should always be monitored and updated [6]. the requirements to be ensuring a valid consent is identifying the concrete data list, together with the planned processing activities It is the AI-researcher’s responsibility to convey a DPIA when of that data in the frame of a research project. Information the data processing activity is likely to constitute a “high risk” to serving to identify the persons involved with data processing are the rights and freedoms of natural persons (e.g. users of an AI the natural elements of the DSA. For example, AI-researchers in service who both benefit from the service and contribute to it the project should identify the data processing purposes specific with their data). How to decide whether a certain data processing to the project aims and present the list of purposes in a written activity would be resulting in a high risk is not an easy task, but form to the participants. The indicated purposes should follow there are several guidelines and list of processing requiring DPIA the related data to be processed listed again in a written form, published by the National Supervisory Authorities [7]. These followed by the clear identification of the AI-researchers and lists could be the first sources for the data controllers to decide other people involving the processing activity. about the necessity of the DPIA for a certain project [8]. Next, the Data Subject Specific Assessment should follow the Failure to conduct a right DPIA raises a risk for the AI- procedure where the focus is on explaining all the details about researchers; they may face several sanctions, especially financial how the AI-researchers will ensure the rights of the participants penalties. Apart from that, conducting a right DPIA would be by protecting their informational self-determination right. The beneficial for the data controllers not only from the legal and the key point in this assessment is to gain trust of the participants as financial point of view. A DPIA could help data controllers to required by law and ethics. One of the key aspects here is to make avoid implementing irrelevant solutions from the beginning of sure that the participants are introduced by the project team on the project which may refer to assessing the technical feasibility the ways their data will be used, as well as the possibility for of the system in parallel with the legal compliance [8]. Therefore, them to request removal of their data if so desired. The project the DPIA could help data controllers to save time and money. It team shall also ensure that the participants have a certain degree also prevents the companies from losing their reputation (or from of accession to the decisions made by the algorithm about them. the scandals, as such occurred with the Cambridge Analytica, Explaining an algorithmic decision relating the participants’ Equifax, Facebook, etc.). Finally, a DPIA document can prove personal assessment should be understandable to them since the the trustworthiness of the project team before the public, as well classification models based on decision trees are easily as the related authorities, since it is an evidence of the respect comprehensible to humans. On the other hand, models that are towards the right to data protection. based on complex multilayer neural networks are essentially black boxes where it is not possible to determine why a particular An AI project aiming to collect personal data and evaluate the decision was reached based on easily interpretable rules. Bearing data with an automated decision-making system with the help of in mind the black box nature of the algorithmic assessment, profiling tools such as surveys and hardware equipment must be choosing a model that is firstly understandable and explainable assessed from the risk point of view. Below, we present a step- to the AI-researchers is a suggested action in this sense. The by-step guideline on how to conduct a DPIA on AI-based social implications of choosing a black box algorithm is an research. emerging research field. Finally, the project team should ensure that the system offers tools for the participants to keep their data accurate and to block third party access. 3 Conducting a Data Protection Impact Assessment The Project Specific Assessment is the last part of the DPIA, In this section, we assume a project aiming at developing a presenting and explaining the legal basis for data processing, the medical software with the help of an algorithm that is going to external project partners involved with data processing activities, enable collecting and processing participants’ sensitive data and the security measures that will be implemented to safeguard based on profiling. Additionally, a large amount of data will be the data processed during the project. As the project likely deals collected for feeding the algorithm, meaning that the participants with sensitive medical data, security protocols have to be may lose a degree of control of their data stored and processed elaborated, which include proper hierarchy regarding the data by the AI system. Based on these inputs, the project may reveal access, encryption algorithms, regular security updates, and risks for rights and freedoms of the data subjects involved, if physical access to the hardware where the data is located. these are not mitigated. Therefore, we need to conduct a DPIA and identify the risk categories with the planned mitigations. The final but an ongoing phase of the DPIA is the monitoring phase. Whenever there is a new element embedded in the project, We identified three steps for conducting a successful DPIA in the and this element seems to change the balance of the risks that project: the Data Specific Assessment, the Data Subject Specific were assessed earlier, the DPIA should be reviewed. This Assessment, and the Project Specific Assessment. element could be involving a new data type in the algorithm or planning a commercial use of the algorithm. Bearing in mind the fact that machine learning techniques and algorithms are referred 33 to as entirely new technologies [3] and the growing amount of risks and find mitigation strategies for certain weak points. Last data together with a variety of hardware would raise risks to but not least, by conducting the DPIA, the project team fulfills persons’ right to data protection [9], we suggest the project team the legal requirements, ensures higher trust of people involved, to review the DPIA periodically, for instance, every year at least. and avoids unforeseeable problems that might later occur. ACKNOWLEDGMENTS 4 Conclusion This work was supported by the ERA PerMed project Data Protection Impact Assessment is an integral part of any BATMAN, which was financed on Slovenian side by the research project focusing on development of an AI algorithm Ministry of education, science, and sport (MIZŠ). An extended with personal data. Such data might be sensitive in nature, such version of this paper was submitted to journal Informatica. as medical data, to be used for developing an algorithm to detect diseases. Besides it is a legal requirement as provided for by the GDPR, a DPIA is a tool for the AI-researchers to assess the REFERENCES weaknesses in the system that may then risk the protection of fundamental rights of the persons participating in the research [1] “The AI effect: How artificial intelligence is making health care more project who contribute to the development of the project with human”, [Online], study conducted by MIT Technology Review Insights and GE Healthcare, 2019. Accessed from: their personal data. Since there are few guidelines on how to https://www.technologyreview.com/hub/ai-effect/ Last accessed: 20 conduct a DPIA for a research project specific to the topic, this April 2020. [2] EDPS (2012). “Opinion of the European Data Protection Supervisor on work initiates a step-by-step guideline for the AI-researchers. the data protection reform package”, (7 March 2012). [3] ICO (2018). Accountability and governance: Data Protection Impact The first step considers a Data Specific Assessment that the data Assessments (DPIAs). [4] European Commission (2018). Communication from the Commission to and the purposes of the data processing are clearly identified and the European Parliament, the European Council, the Council, the listed in a written form to be presented to the participants. It is European Economic and Social Committee and the Committee of the Regions, Artificial Intelligence for Europe. COM (2018) 237 final. followed by the Data Subject Specific Assessment which focuses [5] European Commission (2018) Communication from the Commission to on the ways the AI-researchers ensure the protection of the the European Parliament, the European Council, the Council, the participants’ right to data protection in line with the GDPR European Economic and Social Committee and the Committee of the Regions, Coordinated Plan on Artificial Intelligence. COM (2018) 795 requirements. Such requirements include providing explanation final. on the decisions reached as a result of algorithmic assessments. [6] Wright, David. (2012). The state of the art in privacy impact assessment. Computer Law & Security Review, 28(1), 54–61. The third step relates to the Project Specific Assessment and this https://doi.org/https://doi.org/10.1016/j.clsr.2011.11.007 step focuses mostly on the security measures planned to be taken [7] Hungarian National Authority for Data Protection and Freedom of Information (NAIH), List of Processing Operations Subject to DPIA 35(4) by the project team to mitigate the risks that appeared during the GDPR https://naih.hu/list-of-processing-operations-subject-to-dpia-35-4- previous two assessments. We would suggest the AI-researchers -gdpr.html review the DPIA at least once a year, otherwise revision is [8] Wright, David. (2011). Should Privacy Impact Assessments Be Mandatory? Commun. ACM, 54(8), 121–131. required whenever a new element is added to the system ending https://doi.org/10.1145/1978542.1978568 with a new data processing. [9] Chandra, Sudipta., Ray, Soumya., Goswami, R.T. (2017). Big Data Security: Survey on Frameworks and Algorithms, in 2017 IEEE 7th International Advance Computing Conference (IACC), Hyderabad, pp. From the planning stage of the project to the annual revisions, 48-54. doi: 10.1109/IACC.2017.0025 the DPIA could help the project team to identify the potential 34 Deep Transfer Learning for the Detection of Imperfections on Metallic Surfaces Stefan Kalabakov Primož Kocuvan Jani Bizjak stefan.kalabakov@ijs.si primoz.kocuvan@ijs.si jani.bizjak@ijs.si Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Mednarodna podiplomska šola Ljubljana, Slovenia Mednarodna podiplomska šola Jožefa Stefana Jožefa Stefana Ljubljana, Slovenia Ljubljana, Slovenia Samo Gazvoda Matjaž Gams samo.gazvoda@gorenje.com matjaz.gams@ijs.si Gorenje gospodinjski aparati, d.d. Jožef Stefan Institute Mednarodna podiplomska šola Jožefa Stefana Ljubljana, Slovenia ABSTRACT image processing for detecting imperfections in the manufac- In the last decade, consumers’ expectations have significantly turing processes [1]. However, these systems rely heavily on increased regarding the availability and quality of the products specialized lighting solutions in order to highlight imperfections they buy. To this end, manufacturers have focused on streamlin- on the surfaces of objects [6]. The systems are usually expensive ing their manufacturing lines by employing intelligent solutions and require close proximity to the object which is being investi- wherever possible. Since the field of quality control remains de- gated in order to provide good detection accuracy. Furthermore, pendent mainly on specialized workers, interest in incorporating methods which do not use any kind of learning require features artificial intelligence (AI) advances in this field has dramatically which are hand-crafted for each application specifically and re- increased. In this paper, we present a short exploration into a quire some degree of uniformity in size and shape of the errors computer vision system built to detect imperfections on metallic which might appear. This problem with hand-crafted features, for surfaces. In particular, we leverage deep transfer learning to build us, exists even when using classic machine learning models, as a model that can classify small segments of a bigger image while we were not provided with details regarding the size and shape of using a tiny dataset for training. In these initial experiments, we the errors. To solve this, we opted to use deep learning models, as show that layers trained on the ImageNet dataset can be used as they automatically extract features based on the training set and feature extractors when building a model for a vastly different have proved to produce state-of-the-art results in many areas problem. [3]. With this in mind, the aim of this paper is to investigate low cost state-of-the-art deep learning methods which work in KEYWORDS suboptimal lighting and which automatically extract features which are robust to the shape and size of the errors which appear deep transfer learning, computer vision, quality control on metallic surfaces. Finally, since our dataset is extremely small, we leveraged transfer learning in order to use the full potential 1 INTRODUCTION of deep models. Today, products are expected to be available fast, in vast quan- tities, and with exceptional quality. To this end, manufacturers have started streamlining their manufacturing lines by employing network-connected intelligent machines wherever possible [10]. This has created great interest in incorporating advances in arti- 2 PROBLEM DEFINITION ficial intelligence (AI) in the industry. In recent years industrial The ultimate goal of the ROBKONCEL project is to create a quality adoption of AI is becoming more and more feasible [7], mainly control process for the detection of several possible manufactur-thanks to the significant progress in hardware computational ing errors on both the inside and outside of ovens. In this work, resources. we focus on detecting scratches and dents, i.e., imperfections on In spite of this, quality control is one manufacturing process the oven faceplates’ metallic surface. We perform this quality which still remains highly dependent on expert human workers. check in the manufacturing process’s final phases, as almost fully This dependence, in some instances, makes it slower, more prone assembled ovens get transported on a conveyor belt. In order to to errors, and more expensive. To mitigate this, there has been produce a method that is least costly to implement, we chose a limited adoption of computer vision systems paired with classical simple RGB camera as the sensor in this application. The camera is positioned such that it can take a picture that contains the Permission to make digital or hard copies of part or all of this work for personal whole metallic surface while not interfering with other quality or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and control processes, thus improving efficiency. Finally, our method the full citation on the first page. Copyrights for third-party components of this is supposed to highlight the areas where dents and scratches work must be honored. For all other uses, contact the owner/author(s). are found so that an inspection of the algorithm’s work can be Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia done at any time. Figure 1 shows an example image used for the © 2020 Copyright held by the owner/author(s). purposes of this paper. 35 Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Kalabakov, et al. Figure 1: An image taken by an RGB camera of the metal- lic surface of interest. Figure 3: Example of image segmentation. 2.1 Data Due to the frequency with which these imperfections occur, we Finally, since the newly constructed windows will be used to did not have a large dataset to include in this study. On the con- train a deep learning model, we need to assign a label to each trary, the number of faceplates we could use to get the necessary one of them. In this application, the labels are "0" and "1". If the number of images was only five. Of those five faceplates, one was label "0" is assigned to a window, it means that the window’s without imperfections, and the rest contained a varying number area does not include any scratches or dents on the surface. On of defects on the metallic surface. Since any deep learning re- the other hand, the label "1" means that the area covered by the quires a large amount of data and since the number of faceplates window includes a scratch or a dent. The labels are assigned to is small, using images that portray the whole area of one faceplate each window by examining the mask. For each window, we take as examples to a deep neural network (DNN) would be ineffective. the corresponding area it covers in the mask, and if it includes To combat this problem, we took images of the different front a certain number of pixels annotated as belonging to an imper- panels (five images in total) and segmented them into hundreds fection, then the window is assigned a label "1". Otherwise, it is of smaller examples, which we use as inputs to fine-tune several assigned the label "0". The number of pixels that are used as a models. Additionally, by performing class-invariant transforma- threshold for labeling the windows is: tions on these smaller images, we attempt to diversify the set of examples used to fine-tune the models. The segmentation of im- 𝑡 ℎ𝑟 𝑒𝑠ℎ𝑜𝑙𝑑 = 0.1 × 𝑛𝑢𝑚𝑃𝑖𝑥𝑒𝑙𝑠𝐼 𝑛𝑊 𝑖𝑛𝑑𝑜𝑤 ages into smaller examples and their augmentation are presented in subsection 3.1 and subsection 3.2, respectively. 3.2 Augmentation Augmentation of images in the data-space has been shown to 3 METHOD produce great results when it comes to improving the accuracy 3.1 Segmentation of classifiers [5]. Since after segmenting the image, the number In order to segment the images, we first created a hand-annotated of examples (windows) that do not contain an imperfection is set of binary images (masks). These masks complement the orig- largely greater than the number of examples that do, we apply inal set of five images by showing where in them, a scratch or a certain transformations to the windows that contain an error, and dent is visible on the metallic surface. In more detail, the masks we save each of those transformed windows as a new example. were produced by having humans mark the exact locations of It is important to emphasize that none of these transformations these imperfections. In the masks, pixels which are part of some affect the example’s label, meaning that if we apply them to an imperfection (in the RGB image) are marked with the color white, example containing an error, the transformed example will also while all others are represented in black. An image and its corre- contain the same error. The transformations we use are: sponding mask are shown on Figure 1 and Figure 2, respectively. • rotation • change of contrast • change of brightness • flipping After applying these transformations to a single example, 23 new samples are obtained. Figure 2: A mask constructed for the image in Figure 1. 3.3 Deep Transfer Learning For the task of classifying windows based on whether they con- The next step in the segmentation process is to divide the tain an imperfection or not, we tested four different model archi- image into chunks (windows). We do this by "sliding" a window tectures. One is a simple Convolutional Neural Network (CNN), with a fixed size across the whole image. Each of these windows and the other three are more complicated architectures that are covers a specific area of the image and will serve as a training or well established in the world of image recognition. testing instance when fine-tuning the models. Overlap between The simple CNN is used as a baseline for what an end-to-end several windows is allowed in fact, it is encouraged, seeing that model can achieve on this dataset. However, since the number some overlap means that we can generate more examples. The of examples is still relatively low, training an end-to-end deep size of the window is 200 by 200 pixels and the allowed overlap learning model was not expected to yield great results. between windows is 75%. On the other hand, the VGG16 [8], InceptionV3 [9] and ResNet101V2 However, since in this paper’s scope, we are only interested [2] architectures were used to leverage deep transfer learning [4]. in the faceplate’s metallic parts, we make sure that none of the To be more specific, all of these networks have been used in the windows cover an area that includes the display. In Figure 3 we ImageNet competition, and their internal parameters (weights), can see (in green) the windows produced by the segmentation from that competition, are openly available for use. By using step and how none of them overlap with the area of the display. their pretrained convolutional layers as feature extractors and 36 Deep Transfer Learning for the Detection of Imperfections on Metallic Surfaces Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia training our own set of fully connected layers, we can signifi- clusters. This dismissal is possible because finding the exact mar- cantly improve our performance and training time. Effectively, gins of the imperfections is not of great importance in our use we transfer the knowledge stored in their parameters (weights) case. from the ImageNet dataset to our quality control problem. To implement this, in every architecture, we disregard the fully connected layers included with these architectures and generate our own (with random weights). We then attach these fully con- nected layers to the output of the convolutional layers (provided as pretrained on ImageNet) and train only the fully connected layers while freezing the convolutional layers’ parameters. The number of fully connected layers we generate is four, and the number of neurons per layer is 512, 256, 128, and 64, respectively. The implementation and the weights of these models are ac- quired from the Keras package in TensorFlow. 4 EVALUATION Figure 4: Example of the custom visualisation metric. The 4.1 Experimental Setup top image has the colors of the windows selected based on We evaluated the performance of each model using Leave One the groundtruth, while the middle one has them selected Image Out (LOIO) cross-validation. This means that models are based on the predictions of the classifier. The bottom im- trained using examples (windows) from all images but one, and age represents a color-coded version of the difference be- are tested using the instances from the image excluded in the tween the top two images. training process. The process is repeated several times, and each time a different image is used to test the models’ performance. Since one of the faceplates did not have any errors on its surface, 5 RESULTS AND DISCUSSION windows from that image were never used to test models, instead Table 1 shows the average (macro) F1-scores that each of the they were always used for training. In summary, all the models models achieved when performing 4-fold LOIO cross-validation. are evaluated using a 4-fold LOIO cross-validation. 4.2 Evaluation Metric In this work, we use F1-score with macro averaging as the metric for the evaluation of the models. In particular, we use (macro) Table 1: Average model F1-score after 4-fold LOIO cross- F1-score to determine the model’s ability to classify segmented validation. windows. The choice to use (macro) F1-score rather than accu- racy was made because of the class imbalance in our data. A significant difference between accuracy and (macro) F1-score In all of our experiments, the Simple-CNN and VGG16 archi- comes from the fact that accuracy reports a higher value, even tectures produce very low results. It is our opinion that perhaps, in many false positives. For example, a high accuracy score will a simple stacking of convolutional layers is not enough for this be reported when a classifier predicts only positive values on particular use case, since both networks are unable to learn and a test set containing many positive examples, even though the instead predict every example as an example with an error. On the classifier completely misclassifies the negative instances. other hand, InceptionV3 and ResNet101V2 produce good results To fully understand the classification results, aside from the F1- in comparison to the other two architectures. A head to head score metrics, we also visually represent how the predictions look comparison of the per image F1-scores of the two best models once all windows have been rearranged in their initial positions. can be found in Table 2. This representation overlays windows in their original places but changes their pixels’ value to all white or black based on their predicted values. An example of this representation is shown in the middle image in the triplet on Figure 4. The top image in that same figure changes the pixels’ values based on the ground-truth rather than prediction value. Finally, the figure’s bottom image represents a color-coded version of the difference between the top two images. Windows in green represent windows which Table 2: Per image F1-scores for the InceptionV3 and have been predicted as containing a fault, when in fact they do ResNet101V2 models. contain a fault (True Positive - TP). Windows in red represent windows which have been predicted as not containing a fault, when in fact they do contain a fault (False Negative -FN). And Although there is only a small difference between the F1-scores finally, windows in blue, represent windows which have been of InceptionV3 and ResNet101V2, only 2% as seen on Table 1, predicted as containing a fault, when in fact they do not contain there is a large difference in how they predict the same images, a fault (False Positive - FP). as we can see in Figure 5 and Figure 6. This view is especially useful for our evaluation since it allows As is clearly visible, ResNet101V2 produces a lot more false us to filter out wrongly classified windows which surround green positives in comparison to InceptionV3. However, if we consider 37 Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Kalabakov, et al. surfaces should be detected. Based on the results it seems that transfer learning is a suitable tool for use when the target dataset is really small, even in the case when the source and target prob- lems are vastly different. Furthermore, it seems like more complex architectures produce better results compared to more traditional ones. When more examples of faceplates with imperfections be- come available, we plan on exploring the effects of fine tuning some of the convolutional layers in these models rather than freezing all of them during training. Another possible path to take in the future includes using GANs in order to generate real- istic looking samples of windows with imperfections and further augmenting our training set. Finally, it is important to note that exploring more appropriate lighting solutions might produce better results. ACKNOWLEDGMENTS Figure 5: Visual representation of the predictions pro- Part of this research was done under and for ROBKONCEL project. duced by the InceptionV3 model. Additionally, this research was partly funded by the Slovene Hu- man Resources Development and Scholarship Fund (Ad futura). REFERENCES [1] Fernando Gayubo, José Luis Gonzalez, Eusebio de la Fuente, Felix Miguel, and Jose R Peran. 2006. On-line machine vision system for detect split defects in sheet-metal form- ing processes. In 18th International Conference on Pattern Recognition (ICPR’06). Volume 1. IEEE, 723–726. [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. In European conference on computer vision. Springer, 630– 645. [3] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature, 521, 7553, 436–444. [4] Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic. 2014. Learning and transferring mid-level image represen- tations using convolutional neural networks. In Proceed- ings of the IEEE conference on computer vision and pattern Figure 6: Visual representation of the predictions pro- recognition, 1717–1724. duced by the ResNet101V2 model. [5] Luis Perez and Jason Wang. 2017. The effectiveness of data augmentation in image classification using deep learning. each cluster of same colored pixels as an error, we can see that arXiv preprint arXiv:1712.04621. ResNet101V2 produces far better results when it comes to the [6] Franz Pernkopf and Paul O’Leary. 2002. Visual inspection number of true positives and false negatives. So, even though of machined metallic high-precision surfaces. EURASIP ResNet101V2 produces a lot of false positives, it only manages Journal on Advances in Signal Processing, 2002, 7, 650750. to miss one error from all four images, whereas InceptionV3 [7] Michael Sharp, Ronay Ak, and Thomas Hedberg Jr. 2018. A manages to miss four errors. These results can be seen on Table survey of the advancing use and development of machine 3. When counting the clusters, we do not consider red clusters learning in smart manufacturing. Journal of manufacturing surrounding green clusters as a false negative. systems, 48, 170–179. [8] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. [9] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the incep- Table 3: A sum of the number of true positives and false tion architecture for computer vision. In Proceedings of the negative clusters for each of the models across the four IEEE conference on computer vision and pattern recognition, test images. 2818–2826. [10] Chris J Turner, Christos Emmanouilidis, T Tomiyama, Ashutosh Tiwari, and Rajkumar Roy. 2019. Intelligent de- cision support for maintenance: an overview and future 6 CONCLUSION AND FUTURE WORK trends. International Journal of Computer Integrated Man- ufacturing, 32, 10, 936–959. In this paper we presented a deep transfer learning approach to quality control in the case where imperfections on metallic 38 Fall Detection and Remote Monitoring of Elderly People Using a Safety Watch Ivana Kiprijanovska Jani Bizjak Matjaž Gams Department of Intelligent Department of Intelligent Department of Intelligent Systems Systems Systems Jožef Stefan Institute, Jožef Stefan Institute, Jožef Stefan Institute, Jožef Stefan International Jožef Stefan International Jožef Stefan International Postgraduate School Postgraduate School Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia ivana.kiprijanovska@ijs.si jani.bizjak@ijs.si matjaz.gams@ijs.si ABSTRACT puts a burden on the health-care system with over-crowded nursing homes and hospitals, and causes higher health-care As seniors age, the risk of unforeseen accidents that affect their expenditures [4]. Therefore, monitoring the day-to-day routine well-being increases. Therefore, monitoring the day-to-day of the elderly who live alone is an important precaution to routine of elderly people is an important precaution to undertake. undertake, especially when they are living alone. Due to the Remote health monitoring systems are essential for rapid demographic change and aging of the population, the enhancing care in a reliable manner and allow the elderly to development of remote monitoring systems has become the remain in their home environment rather than in expensive center of attention for both researchers and industries. In this nursing homes [5]. Such systems also allow communication paper, we present the design of a safety watch integrated in a with remote healthcare facilities and caregivers, thus allowing comprehensive health monitoring system capable of observing healthcare personnel to keep track of the elderly’s overall the elderly remotely. It integrates low-power hardware condition and respond, if necessary, from a distant centralized architecture and energy-efficient software configuration, which facility [6]. Due to the rapidly increasing aging population, such significantly extend the battery autonomy of the device. One of technologies have become a subject of interest for both the major modules running on the safety watch is the automatic researchers and industries. detection of falls and similar dangerous situations. For that One of the first remote monitoring systems presented in the purpose, several machine learning methods were tested, among literature are camera-based systems. They are capable of which the Random Forest method achieved the highest recognizing complex gait activities, but restrict the movement accuracy in detection of falls on data recorded from 17 of the user within a specific range. Apart from that, they are participants, and was implemented on the actual device. complex, expensive and often related to privacy concerns. A recent survey gives an insight to the studies carried out in KEYWORDS vision-based patient monitoring [7]. In the last few years, Safety watch, remote monitoring, energy efficiency, fall wearable motion sensors have gained in popularity for detection monitoring human activities in real time. They can monitor and record real-time information about one’s physiological condition and motion activities. Wearable sensor-based health 1 INTRODUCTION monitoring systems may comprise different types of sensors More than 90% of the elderly desire to live in their own homes that can be integrated into textile fiber, clothes, and elastic for as long as they possibly can [1]. However, as seniors age, bands or can be directly attached to the human body. One such the risk of unforeseen accidents that affect their well-being system in presented in [8], which uses mobile phone as an increases. For example, the lives of elderly people are very intermediary to get vital data from various sensors and transmit often affected by falls, which lead to not only physical injuries data to a server for further processing. The main limitation of but also psychological consequences that further reduce their this system is the fact that the analysis is not performed in the independence and decrease the quality of their life [2][3]. The place where the signal is acquired, and there may be a loss of lack of independence causes them to no longer feel comfortable efficiency in the wireless network when physiological signals with living alone, forcing them to move into nursing homes. It are sent. Another wearable personal healthcare system is presented in [9]. It employs a number of wearable sensors to continuously collect users’ vital signals and uses Bluetooth Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or devices to transmit the data to a mobile phone, which can distributed for profit or commercial advantage and that copies bear this notice and perform on-site vital data storage and processing. After local the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). data processing, the mobile phone periodically report users’ Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia health status to a healthcare centre. Apart from such systems, © 2020 Copyright held by the owner/author(s). various wearable commercial products are available on the 39 market, for example the biometric shirt by Hexoskin, and fitness trackers by Fitbit and Jawbone. However, many current solutions either provide insufficient functionalities at a reasonable price or are advanced but too expensive, too energy demanding or too invasive [10]. The aim of the HomeCare2020 project was to provide a comprehensive solution for a smart healthcare monitoring system, capable of observing the elderly remotely, while eliminating the problems mentioned before. The system aimed to enable the elderly to live home independently until later age and to make them feel safer and more confident in performing everyday tasks and activities. The developed system integrates two interconnected devices: advanced touch-screen care-phone (HomeTab) and a multifunctional safety watch. In this paper, Figure 2: Safety watch appearance the design of the safety watch is presented. From a software perspective, the design principle behind the 2 SAFETY WATCH DESIGN safety watch is to preserve the battery autonomy of the device. Therefore, the main processing unit is intended to sleep The safety watch is a custom-made wristband device meant to whenever possible and only wakes up when certain events be carried by seniors to provide 24/7 security, inside or outside happen, i.e., when there is an immediate danger for the user. of the home. The safety watch has two working modes, depending on Its core part, from a hardware perspective, is an ARM-based whether the user wears the watch or not. If the watch is not low-power Bluetooth module by Nordic [11]. The priority on worn, all working modes are disabled, since there is no need of choosing the processors and other hardware components was motion monitoring, and only the device status (worn or not given to how much energy they consume, since a device that worn) is checked in 1-minute intervals. If the watch is worn, it requires everyday charging is strongly undesirable, especially monitors motion, accumulates the number of steps, and sends for the elderly, who might have problems remembering when or data over Bluetooth to the HomeTab. Once the battery of the how to charge the device. The safety watch integrates a low- device drops to 30% or lower, the sleeping time of the main power LSM6DSL system-in-package featuring a 3D digital processor increases from 5 to 10 minutes and the user is accelerometer and a 3D digital gyroscope. As well as that, it notified about the low battery level. The software design of the contains a low-power Quecktel module that integrates NB-IoT safety watch is illustrated in Figure 3. and GPS functionality. Since GPS and NB-IoT consume a lot of The safety watch monitors users behaviour (activity levels), power, these two functionalities are disabled for most of the providing incentives to the users (through HomeTab) to move time and programmatically enabled only when needed (i.e., more and at the same time allow to determine unusually low when an emergency call is made and the device is out of activity (due to sickness). The integrated LSM6DSL step-count Bluetooth range of HomeTab). The Quecktel module is functionality enables the number of steps to be detected connected to a SIM card, which is required for NB-IoT throughout the day and to be sent in regular 15-minutes functionality. These components are connected to a intervals via Bluetooth to the HomeTab. This gives information rechargeable Li-ion battery, which can be recharged using a about the user’s activity levels, which the system later analyses wireless (induction) charger. The diagram of the safety watch to detect possible irregularities in the user’s behaviour (which circuit can be seen in Figure 1. can be caused by an undetected disease). For example, if a user is feeling ill (has a flue), he will likely stay in bed significantly longer than when healthy, so the lack of movement can be detected, and caregivers notified. 2.1 Fall Detection Automatic fall detection is one of the most important modules running on the safety watch. A machine learning method that can automatically detect falls and similar dangerous situations was developed and implemented in the final software of the safety watch. For training of machine learning models, we used a publicly Figure 1: Diagram of safety watch circuit available dataset that contained acceleration data from a wrist- worn device from 17 subjects [12]. It comprised 11 daily-life The outer side of the safety watch housing is comprised of a activities, including 5 types of falling, namely: walking, membrane keypad used for manual alarm triggering (e.g. if the standing, sitting, picking up an object, laying, jumping, falling individual is in a dangerous situation). The keypad also backwards, falling sideward, falling forwards using knees, integrates a small LED, used to provide a feedback to the users falling forwards using hands, and falling sitting in an empty (e.g., alarm triggered, low battery alerts). Its appearance is shown in Figure 2. 40 Figure 3: System software design chair. Since our aim was to only detect falls in general, we model, the kNN model and the DT model can be seen in Table grouped all fall-related activities as one class, and all other 1, Table 2, and Table 3, respectively. activities as another class. The non-fall activities were additionally under sampled, in order to adjust the class Table 1: Summed and normalized (per row) confusion distribution of the dataset. The data were further segmented matrix. LOSO evaluation with Random Forest model. using a sliding window technique, with a window size of 2 seconds and 50% overlap between consecutive windows. To Non-fall Fall train the machine learning models, several statistical features Non-fall 97 3 were extracted from the acceleration signals, including mean, Fall 2 98 standard deviation, median, maximum, minimum, mean absolute change, variance, kurtosis, skewness, and similar. The Table 2: Summed and normalized (per row) confusion window size and the optimal feature set was chosen based on matrix. LOSO evaluation with Decision Tree model. our previous work [13]. Various machine learning algorithms were tested – Decision Non-fall Fall Tree (DT), Random Forest (RF), k-nearest neighbors (kNN). Non-fall 91 9 The different algorithms performances were evaluated using the leave-one-subject-out cross-validation technique. With this Fall 8 92 technique, the data is divided into N-number of folds (where N is the number of subjects in the dataset). Each fold is comprised Table 3: Summed and normalized (per row) confusion of data from a single subject. In each iteration of the LOSO matrix. LOSO evaluation with kNN model. cross-validation, data from one subject is used for testing the method, and the training data is comprised of the remaining N-1 Non-fall Fall subjects. Among the tested algorithms, RF proved to have the Non-fall 87 13 best accuracies per watt of power consumed processing the data. Fall 17 83 RF is an ensemble classifier that fits a number of decision trees on various sub-samples of the dataset and outputs the majority Since the aim of the system is to offer a great degree of class label from the constructed trees. It utilizes two random accuracy in detecting actual fall, as well as in filtering false steps in the process of creating trees – a random sampling of the alarms, two metrics were analyzed: (i) sensitivity – capacity to training data points and a random choosing of a splitting feature, detect actual fall, defined as the ratio between the number of which make it robust to noise and outliers [14]. The results falls correctly detected (true positives) and the falls that actually achieved on the laboratory data with the best-performing RF happened; (ii) specificity – capacity to filter false alarms, 41 defined as the ratio between properly discarded activities (true Overall, the software design of the system is highly energy-negatives) and the total number of discarded activities. From efficient and significantly extends the service time of the the confusion matrix presented in Table 1, it can be seen that wearable device, which makes it convenient for use by elderly the model has a very high sensitivity score – 98%, and people. The system is easily operated and therefore shows great specificity score – 97%. They are both very important for a promise for providing long-term and continuous monitoring of real-life implementation of the model – it means that the model the elderly in an unobtrusive way. We believe that it can accurately detects falls, without triggering too many false efficiently contribute to improving remote healthcare services. alarms, which can be detrimental to users. The implementation of the fall detection functionality on the ACKNOWLEDGMENTS hardware was also properly managed to extend the battery life. The authors would like to thank everyone that helped in any The most significant battery saving is done by processing the way with producing this paper. The first author acknowledges acceleration data in batches. The accelerometer stores the financial support from the Slovene Human Resources acceleration values in its internal memory while the main Development and Scholarship Fund – Ad Futura. Part of this processor sleeps. The accelerometer’s buffer fills in 10 seconds, research was done under EIT Health HomeCare2020 project. and when it is full, it wakes the main processor, and the collected data is sent to it for further processing. The main REFERENCES processor stores for about 120 seconds of acceleration data [1] Roy, N.; Dubé, R.; Després, C.; Freitas, A.; Légaré, F. Choosing between before running the fall detection algorithm. Once the 120- staying at home or moving: A systematic review of factors influencing seconds of data is stored, the required features are calculated housing decisions among frail older adults. PLoS One 2018. [2] Institute of Medicine Falls in Older Persons: Risk Factors and from the acceleration signals, and the pre-trained RF model Prevention. In The Second Fifty Years: Promoting Health and Preventing (stored in the RAM of the safety watch) is run. If no fall is Disability; 1992 ISBN 978-0-309-04681-7. [3] Boyé, N. da; Van Lieshout, E. mm; Van Beeck, ed f.; Hartholt, K. a.; detected in the two-minute segment, the main processor goes Van Der Cammen, T. jm; Patka, P. The impact of falls in the elderly. back to sleep, otherwise, an alarm procedure is triggered. The Trauma 2013. alarm is sent via Bluetooth to the HomeTab device, which [4] Stevens, J.A.; Corso, P.S.; Finkelstein, E.A.; Miller, T.R. The costs of fatal and non-fatal falls among older adults. Inj. Prev. 2006. forwards it to the server for further processing. If the safety [5] Klaassen, B.; van Beijnum, B.J.F.; Hermens, H.J. Usability in watch is out-of-range of the HomeTab, it uses NB-IoT network telemedicine systems—A literature survey. Int. J. Med. Inform. 2016. [6] Majumder, S.; Aghayi, E.; Noferesti, M.; Memarzadeh-Tehran, H.; for alarm transmission. In this case, it also tries to get the user’s Mondal, T.; Pang, Z.; Deen, M.J. Smart homes for elderly healthcare— location using a GPS signal. Recent advances and research challenges. Sensors (Switzerland) 2017. [7] Sathyanarayana, S.; Satzoda, R.K.; Sathyanarayana, S.; Thambipillai, S. Vision-based patient monitoring: a comprehensive review of algorithms 3 Conclusion and technologies. J. Ambient Intell. Humaniz. Comput. 2018. [8] Benlamri, R.; Docksteader, L. MORF: A mobile health-monitoring This paper presented the design of a safety watch integrated platform. IT Prof. 2010. into the HomeCare2020 comprehensive solution for a smart [9] Wu, W.; Cao, J.; Zheng, Y.; Zheng, Y.P. WAITER: A wearable personal healthcare and emergency aid system. In Proceedings of the 6th Annual healthcare monitoring system, primarily targeted at elderly IEEE International Conference on Pervasive Computing and people. The main purpose of the safety watch is to help the Communications, PerCom 2008; 2008. [10] Peetoom, K.K.B.; Lexis, M.A.S.; Joore, M.; Dirksen, C.D.; De Witte, elderly to live home independently until later age and to make L.P. Literature review on monitoring technologies and their outcomes in them feel safer and more confident performing everyday tasks independently living elderly people. Disabil. Rehabil. Assist. Technol. 2015. and activities. One of the most important modules running on [11] nRF5 SDK - nordicsemi.com Available online: the safety watch is fall detection, which makes the users able to https://www.nordicsemi.com/Software-and-tools/Software/nRF5-SDK call for emergency treatment in the case of a dangerous (accessed on Aug 26, 2020). [12] Martínez-Villaseñor, L.; Ponce, H.; Brieva, J.; Moya-Albor, E.; Núñez- situation. For this purpose, different machine learning models Martínez, J.; Peñafort-Asturiano, C. Up-fall detection dataset: A were tested and compared. Among them, RF classification multimodal approach. Sensors (Switzerland) 2019. [13] Gjoreski, H.; Stankoski, S.; Kiprijanovska, I.; Nikolovska, A.; model proved to have the highest performance per watt of Mladenovska, N.; Trajanoska, M.; Velichkovska, B.; Gjoreski, M.; power consumed processing the data, which makes it the most Luštrek, M.; Gams, M. Wearable Sensors Data-Fusion and Machine- suitable choice for implementation. Learning Method for Fall Detection and Activity Recognition. In Studies in Systems, Decision and Control; 2020. [14] Breiman, L. Random Forest. Mach. Learn. 2001, 45, 5–32. 42 Machine Vision System for Quality Control in Manufacturing Lines Ivana Kiprijanovska Jani Bizjak Samo Gazvoda Matjaž Gams Department of Intelligent Department of Intelligent Cooking Appliances Department of Intelligent Systems Systems Division Systems Jožef Stefan Institute, Jožef Stefan Institute, Gorenje Group Jožef Stefan Institute, Jožef Stefan International Jožef Stefan International samo.gazvoda@gorenje.com Jožef Stefan International Postgraduate School, Postgraduate School Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia ivana.kiprijanovska@ijs.si jani.bizjak@ijs.si matjaz.gams@ijs.si ABSTRACT industrial cameras with specialized optics to acquire images [3]. In manufacturing, quality control is a process that oversees the After an image is acquired, computer hardware and software aspects of production and ensures that only products that process, analyze, and measure various characteristics of the conform to industry standards and quality criteria leave the image for automated decision-making. production line. Automation of the quality control process Development of an integrated system for comprehensive significantly reduces the time spent on products’ testing, hence quality control in production with an intelligent process control reducing the overall manufacturing costs. In this paper, we system is the main aim of the ROBKONCEL project [4]. One present a brief overview of the algorithms adopted to the aim of of the objectives of this project is the detection of faults in the detection of one possible fault in the production of ovens – non- production of ovens. In this paper, we present the initial working oven fan. The detection is performed through visual experiments in the detection of one of the possible faults – non- data. In the initial experiments, several image processing working oven fan. algorithms were used, and the preliminary results are encouraging. 2 PROBLEM DEFINITION KEYWORDS The quality control of the ovens is intended to take place in a machine vision, image processing, fault detection factory environment, where products moving on a conveyor belt are visually observed, i.e., a machine vision system acquires videos of the ovens. These videos are segmented into 1 INTRODUCTION image frames (at a 30 fps rate), and the obtained image frames Quality control is becoming an increasingly important aspect of are further processed to detect if the fan is working or not. For today’s manufacturing processes [1]. For efficient and the initial experiments, we collected a few videos in a successful production, manufacturers rely on quality control laboratory setting, with various lightings and camera positions, systems integrated into the manufacturing process. The resulting in approximately 7200 images (~4000 working fan traditional quality control process requires vast capacities of and ~3200 non-working fan). Additionally, the visual data of specialized labor. High utilization of the specialists may lead to the ovens’ fans were acquired through a closed door, which human errors, low reliability of the process, and a negative makes the fault detection more challenging (Figure 1). This is impact on the quality of production. Compared to manual preferred as the process of opening and closing the door in a quality control, automated quality control systems offer a manufacturing environment would be too slow. reliable control process with various other advantages, including the ability to work 24 hours a day and, in some tasks, perform faster measurements with higher accuracy and consistency compared to humans [2]. Such systems are also a practical choice when the test cases need to run regularly over a significant amount of time. Machine vision quality control systems play a growing role in modern manufacturing quality control systems. These systems rely on digital sensors inside Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Figure 1: Image of an oven's fan acquired through a closed © 2020 Copyright held by the owner/author(s). oven door. 43 3 IMPLEMENTED TECHNIQUES 3.2 Glare Reduction The image processing steps for the oven fault detection, i.e., A common problem in image processing is the occurrence of non-working fan detection, are as following: specular reflections on the images. In our case, since the videos of the fan were recorded through a glass, a significant amount 1. Object detection of specular reflections, or glare, was produced during the 2. Glare reduction recording. To reduce the effects of the glare, a glare reduction 3. Image thresholding algorithm was applied. The basic glare reduction procedure consisted of 3 steps: (i) decomposition of the original image Each of these steps and the image processing algorithms into a color, saturation and brightness component (HSV); (ii) implemented in them are explained in the following sections. finding particularly bright areas in the image; (iii) inpainting of these areas with the values of the surrounding pixels. 3.1 Object Detection Each image was first converted into HSV color space, which In order to detect and isolate the circle area of the oven fan, we describes the image by its hue (H), saturation (S) and brightness made use of the Hough Gradient Method [5], which is an (V) component (Figure 3). extension to the standard Hough Transform technique [6] for isolating features of a particular shape within an image. The Hough Gradient Method is based on gradient information of edges and is used to improve the speed of the circle detection in order to meet real-time implementation requirements. The calculation steps of the Hough Gradient method are as follows: (i) detect edges in the image; (ii) calculate the local gradient for the edge points using a Sobel operator; (iii) use an accumulator to count the possible circle center on the normal direction of Figure 3: Image decomposition into hue, saturation and edge points’ tangent; (iv) choose the peak circle center and brightness component. circle radius for the general circle equation. The implementation of the Hough Gradient method in With such decomposition, a general rule for pixels that are OpenCV requires a single channel image, so the first step in the subject to specular reflections can be derived; namely, an image detection of circles was to convert the acquired images from the can only contain glare if its color is not saturated, and it has RGB color space to grayscale. Furthermore, two parameters of high brightness. Since light reflections are white, any pixel the circle detection function were tuned, namely: the minimum containing glare cannot have saturation (since white has no distance between the center coordinates of the detected circles color or saturation). Accordingly, we first filtered out the areas and the ratio of the resolution of the original image to the that have low saturation. Next, the area of the non-saturated accumulator resolution [5]. Before running the circle detection pixel was reduced by an erosion operation, and the brightness function, a simple median filter [7] was applied to the images values of the saturated pixels were set to 0. By filtering out the for noise reduction. This helped in reducing the effects of very bright pixels (e.g., all pixels that have a value larger than various reflections in the glass part of the oven door. In general, 130), we obtained the final glare mask (Figure 4). without blurring, the algorithm tended to extract too many circular features, resulting in false circles detection. Therefore, this preprocessing step was crucial for successful circle detection. The circled detection algorithm resulted in a single circle detected in every image; however, with a varying radius. Since the further analyses require images with the same dimensions, the mean value of the detected circles’ radius was calculated and used to isolate the fan area on the images. (Figure 2). Figure 4: Original image and the obtained glare mask. The glared pixels were then interpolated with an inpainting operation. This operation fills the masked pixels with the values that stem from the adjacent non-masked pixels. The original image and its corrected version after the reduction of the glare can be seen in Figure 5. There is a significant amount of glare on the original image, which was effectively removed in the corrected image. The corrected image is a good approximation of the original image when no glare is present. Figure 2: Detected oven's fan area. 44 helped in eliminating quick 1-frame changes from working to non-working, or vice versa. Eventually, the implemented image-processing method resulted in 95% of correctly classified images, on four different videos. The confusion matrix of the method is presented in Table 1. Table 1: Confusion matrix for the proposed method. Non-working Working Non-working 3117 82 Figure 5: Original image and its corrected version. Working 280 3720 3.3 Thresholding As the main purpose of the system is to offer a high accuracy in detection of oven faults, while filtering false alarms, If the two figures representing working and non-working fan in we additionally analysed two metrics: (i) sensitivity, i.e., Figure 6 are analyzed, it can be seen that lighting allows the method’s capacity to detect actual faults (non-working fans), oven fan parts to stand out and be clearly seen behind the grid defined as the ratio between the number of non-working fan when the fan is not working. On the other hand, when the fan is images correctly identified (true positives) and the total number working, the fan area behind the grid is blurred. Therefore, a of non-working fan images; (ii) specificity, i.e., method’s simple thresholding method was utilized to distinguish working capacity to filter false alarms, defined as the ratio between and non-working fan. properly discarded images (true negatives) and the total number of discarded images. The method has a very high sensitivity score of 97%, and specificity score of 93%. Figure 6: Working and non-working oven fan. Thresholding is one of the simplest methods for image segmentation and creation of binary images [6]. The main goal Figure 7: Non-working and working oven fan – thresholded of the utilized binary thresholding was to enhance the parts of images the oven fan when it is not working. For that purpose, the images were firstly converted from RGB color space to grayscale. Next, with the binary thresholding method, each 4 CONCLUSION pixel in the images was replaced with a black pixel if its In this paper, we presented an image processing pipeline intensity was less than a chosen constant (T=90), or a white adopted for the aim of detection of a possible fault in pixel if its intensity was greater than the chosen constant. This production of ovens – non-working oven fan. The image results in the illuminated parts of the oven fan becoming processing steps contain object detection (for isolating the oven completely white (when the fan is not working), while the grid fan area from the images), glare reduction (for reducing the and the moving fan become completely black, as can be seen in effects of specular reflections), and image thresholding (for the examples in Figure 7. final decision-making). The preliminary results show that a As a final step, the number of white pixels in the final quality control system that exploits image processing binary-threshold images, which present only the non-working algorithms could be used in an automated manufacturing fans, was calculated. Then, the 5th percentile of these values environment. In the future, we plan to employ reflection was calculated and set as a threshold value when deciding if a removal algorithms, which can significantly facilitate the object given image represents a working or non-working fan. detection process, such as Sparse Blind Separation with Basically, if the image contains more than X white pixels, Motions (SPBS-M) [8], Superimposed Image Decomposition where X is the previously calculated value of the 5th percentile, (SID) [9], Ghosting Cues [10] and similar. However, the it is classified as a non-working oven fan; otherwise, it is utilization of such algorithms may significantly impact the time classified as a working oven fan. performance of the method, so an acceptable trade-off between In the last post-processing step, the class for each image method’s accuracy and time performance should be explored in frame was taken as the majority class of the last 20 frames. It future analyses. 45 ACKNOWLEDGMENTS [4] ROBKONCEL Available online: http://www.smm.si/?post_id=4682&lang=en (accessed on Aug 28, The first author acknowledges the financial support from the 2020). Slovene Human Resources Development and Scholarship Fund [5] Yuen, H.K.; Princen, J.; Dlingworth, J.; Kittler, J. A Comparative Study of Hough Transform Methods for Circle Finding.; 2013. – Ad Futura. Part of this research was done under and for [6] Shapiro, L.; Stockman, G. Computer Vision 1st Edition; 2001; ISBN ROBKONCEL project. 9780130307965. [7] Huang, T.S.; Yang, G.J.; Tang, G.Y. A Fast Two-Dimensional Median Filtering Algorithm. IEEE Trans. Acoust. 1979. REFERENCES [8] Gai, K.; Shi, Z.; Zhang, C. Blind separation of superimposed moving images using image statistics. IEEE Trans. Pattern Anal. Mach. Intell. [1] Mohamad, H.; Jenal, R.; Genas, D. Quality Control Implementation in 2012. Manufacturing Companies: Motivating Factors and Challenges. In [9] Guo, X.; Cao, X.; Ma, Y. Robust separation of reflection from multiple Applications and Experiences of Quality Control; 2011. images. In Proceedings of the Proceedings of the IEEE Computer [2] Heleno, P.; Davies, R.; Brazio Correia, B.A.; Dinis, J. A machine vision Society Conference on Computer Vision and Pattern Recognition; 2014. quality control system for industrial acrylic fibre production. EURASIP [10] Shih, Y.; Krishnan, D.; Durand, F.; Freeman, W.T. Reflection removal J. Appl. Signal Processing 2002. using ghosting cues. In Proceedings of the Proceedings of the IEEE [3] Golnabi, H.; Asadpour, A. Design and application of industrial machine Computer Society Conference on Computer Vision and Pattern vision systems. Robot. Comput. Integr. Manuf. 2007. Recognition; 2015. 46 Abnormal Gait Detection Using Wrist-Worn Inertial Sensors Ivana Kiprijanovska Hristijan Gjoreski Matjaž Gams Department of Intelligent Faculty of Electrical Engineering Department of Intelligent Systems and Information Technologies Systems Jožef Stefan Institute, Skopje, N. Macedonia Jožef Stefan Institute, Jožef Stefan International hristijang@feit.ukim.edu.mk Jožef Stefan International Postgraduate School Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia ivana.kiprijanovska@ijs.si matjaz.gams@ijs.s ABSTRACT shown that these disorders carry a high risk for falls, with an annual fall rate of 60–80% in patients with Alzheimer's, Falls are a major health problem among elderly people and Parkinson's or similar diseases [4][5]. However, there is often lead to serious physical and psychological consequences. substantial evidence that falls can be prevented if individuals at Identification of elderly people who are at risk of falling helps increased risk of falling are identified and enrolled in targeted for the selection of effective preventative measures that fall prevention programmes [6]. Therefore, identification of minimize the likelihood of falls. The occurrence of gait balance impairment and gait abnormalities is an essential step abnormalities is one of the most significant fall precursors. in fall prevention. Wearable sensors enable continuous monitoring of gait during Camera-based 3D motion capture systems and instrumented daily routines, and therefore offer the possibility of early walkways are considered as the gold standard in gait analysis in detection of gait changes. In this paper, we analyze the ability terms of accuracy. However, these systems are only suitable for of machine learning models to detect gait abnormalities using hospitals or hospital-like settings, such as specialized gait data from inertial sensors integrated into a smartwatch and how analysis clinics, due to their size and the need for qualified they perform on the dominant and non-dominant wrist. professionals to operate them. Moreover, current clinical evaluation of gait is costly and time-consuming, and thus KEYWORDS cannot be performed frequently. Even though the completeness Gait analysis, abnormal gait, fall risk assessment, smartwatch, and the accuracy of the clinical measurements are wearable sensors unquestionable, a mobile and pervasive gait analysis alternative suitable for non-hospital settings is a necessity. Recent technological advancements in wearable sensors offer means 1 INTRODUCTION for analyzing gait during everyday-life living. Among wearable Falls present a major health problem among elderly people. devices, wristbands and smartwatches are increasingly popular One-third of the population aged over 65 years experience at because people find the wrist placement one of the least least one fall per year [1]. Falls greatly affect the quality of life intrusive placements to wear a device. and restrict the independence of those affected. They not only In this paper, we analyzed the ability of inertial sensors lead to severe physical consequences but also result in high integrated into smartwatches to detect human gait abnormalities health care costs. Due to the rapid aging of the population, this that are related to fall risk. Moreover, we studied how the problem will further increase in the near future [2]. Therefore, performance of machine learning models on the non-dominant there is an urgent need for reliable screening tools to identify wrist compares to the performance on the dominant wrist. those at risk and to target effective fall prevention strategies. Falls are a consequence of several intrinsic and extrinsic fall risk factors, among which balance and gait disorders are the 2 RELATED WORK most common ones [3]. Gait is a sensitive indicator of an The recent advancements in sensor technology have led to individual's overall health status, so the occurrence of abnormal applications of wearable sensor devices in gait analysis for fall gait patterns usually represents an early indication of an risk assessment. Several studies have been carried out by underlying neurodegenerative disorder. Clinical research has combining wearable devices with inertial sensors and machine learning methods. The general pipeline in these studies consists Permission to make digital or hard copies of part or all of this work for personal or of signal acquisition while the person performs everyday-life classroom use is granted without fee provided that copies are not made or activities or pre-defined functional tests, signal processing and distributed for profit or commercial advantage and that copies bear this notice and feature engineering, and lastly training a machine learning the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). classifier that produces an output that depends on the Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia application. © 2020 Copyright held by the owner/author(s). 47 Howcroft et al. [7] have presented insightful accounts of peripheral vision, visual distortion, balance deficit, and similar. features, classification models and validation strategies related These effects alter the gait and are highly correlated with an to sensor-based fall-risk assessment. They have found large increased risk of falls [3]. Both scenarios (normal and abnormal heterogeneity in terms of sensor-based features and sensor walk) were repeated by each subject five times, resulting in ten placement. Regarding the features, the existing studies most walking sessions per subject. An example of two motion often use features from the time and frequency domain, which samples from the sensors in the smartwatch worn on the right include mean, variance and energy of the windowed inertial wrist of one subject is shown in Figure 2. data, as well as spectral components such as dominant frequency and harmonic ratio [8]. As well as that, some biomechanical gait features, such as stride length, clearance, stance and swing time for each stride, cycle time, cadence and similar, have been revealed as effective predictors of falls [9]. In terms of the location of the sensors, the most exploited body positions are the shanks, waist, pelvis, and feet. In [10], the authors made use of wearable devices incorporating accelerometer and gyroscopes, worn on the shanks and waist. Figure 1: Equipment for data collection They proposed a general probabilistic modeling approach for classification of different pathological types of gait through the estimation of spatiotemporal features. They showed that a 4 METHOD Support Vector Machine (SVM) classifier can identify mobility The machine learning method that we developed for this study impairment in elderly people, with an accuracy of 90.5%. In consists of several steps: preprocessing of the acquired sensor [11], the authors showed that with assessment of walking signals – filtering and data segmentation, feature engineering quality during a six-minute walk test with accelerometers and extraction from signal segments, and training of a attached to their lower leg and pelvis, prospective fallers and classification model. In the first step, the raw IMU signals were non-fallers can be successfully differentiated with a Random filtered with a band-pass filter with cut-off frequencies in the Forest (RF) classificator. Similar findings were confirmed also range of 0.5 to 3.5 Hz [14], which allowed for reducing the for inertial sensors attached at the sternum, in [12]. However, frequencies outside of the range of frequencies related to human these body locations may be found obtrusive for wearing a walking activity [15]. After the filtering step, the sensor signals device for longer periods of time. On the other hand, the wrist is were segmented using a sliding window. Since window size and considered as the most unobtrusive and widely accepted the sliding parameter have to be tuned correctly for the task at position to wear a device, which does not affect everyday-life hand, the windowing parameters were determined empirically. activities of the user. Still, sensors worn on the wrist are Eventually, we chose a window size of 8 seconds, with 50% affected by frequent movements, as the hand is generally the overlap between consecutive windows. most active part of the body. It makes the analysis of the gait To train a classification model, we extracted several features very challenging, and thus wrist-worn devices have not yet been from the time and frequency domain, for each sensor signal. utilized for gait abnormalities detect for fall risk assessment. The tsfresh python package [16] allows general-purpose time- Considering the lack of evidence supporting the feasibility of series feature extraction, which we exploited in generating more fall risk assessment with sensors worn on the wrist, in this paper than 100 features per sensor stream. These features included the we analyze the performance of several machine learning minimum, maximum, mean, variance, the correlation between methods that utilize inertial sensor data from a wrist-worn axes, their covariance, skewness, kurtosis, the number of times device. the signal is above/below its mean, the signal’s mean change, and its different autocorrelations, among others. Additional subset of frequency-domain features was also calculated using 3 DATASET the signal’s power spectral density (PSD), which is based on the For this study, we collected a dataset comprised of recordings fast Fourier transform (FFT), and included PSD energy, entropy, from 18 subjects (8 males, 10 females, aged 19-54). Each and binned distribution, the largest magnitude from the PSD (of subject wore two smartwatches Mobvoi TicWatch E [13], one the dominant frequency in the signal), and first four statistical on the left, and one on the right wrist (Figure 1). The two moments of the PSD (mean, standard deviation, skewness, and smartwatches had an Android application that collected data kurtosis) [17][18]. from the inertial sensors integrated into the devices, namely: We compared several different ML models that have all accelerometer, gyroscope, and magnetometer, at a sampling previously been proven suitable for human activities analysis: frequency of 100 Hz. 1) Decision tree (DT) [19] is an algorithm that learns a The subjects were walking back and forth along a 15-meters model in the form of a tree structure with decision nodes with straight line and performed two scenarios – normal walk and two or more branches, each representing values for a tested simulated abnormal walk. In the normal gait scenario, subjects feature, and leaf nodes which represent a decision on the target walked at a comfortable pace and performed a natural gait, class. In other words, it predicts the target class by learning while in the simulated abnormal walk scenario, subjects walked decision rules from the training features. while wearing impairment glasses [8]. The glasses were used to 2) Random forest (RF) [20] is an ensemble of decision tree simulate the effects of impairment, including reduction of classifiers. It creates multiple decision trees, each trained on a 48 Figure 2: IMU sensors signals from normal walking session (left) and abnormal walking session (right) bootstrapped sample of the original training data, and searches on the left wrist (L - L), training on the right wrist and testing only across a randomly selected subset of features to determine on the left wrist (R - L), and training on both wrists and testing a split. For the decision on the target class, each tree outputs a on the left wrist ((L+R) - L). With these combinations, we want prediction, and the final prediction of the classifier is to see if training a model with data from only a particular wrist determined by a majority vote of the trees. or both wrists combined leads to higher accuracy. Moreover, 3) Support vector machine (SVM) [21] is an algorithm that another challenge that we took into account is a device with a is characterized by the use of kernel functions. They are used to model developed for the right (left) wrist to be worn on the left transform feature vectors into higher dimensional space, in (right) wrist, hence the “switching wrists” combinations [23]. which a separation hyper-plane is learned to best fit the training The results from these experiments can be seen in Table 1. The data. performance of the machine learning models is additionally 4) K-nearest neighbors (kNN) [22] is an algorithm that uses compared with the performance of a baseline method - majority feature-vector similarity, i.e., for each feature vector in the test vote classifier. data, it finds the k-nearest neighbors in the training set. The From the presented results, it can be seen that the RF final prediction of the classifier is determined by a majority algorithm significantly outperforms the other algorithms for vote of the chosen neighbors. each train-test combination, while the kNN achieves the lowest To estimate the generalization accuracy of the models, we accuracy in detection of gait abnormalities. Moreover, the utilized the leave-one-subject-out cross-validation technique. results show that the right-left combination achieves 72.2% With this approach, the data is repeatedly split according to the accuracy, which is significantly lower than the left-left number of subjects in the dataset. In each iteration, one subject combination, which achieves 83.9% with the RF model. On the is selected for testing purposes, while the other subjects are other hand, the difference between the left-right and right-right used for training the model. This procedure is repeated until combinations is minor – only 1.5 percentage point. These data from all subjects have been used as test data. results suggest that models trained with data from the left wrist could perform well on both wrist, but the data acquired from the right wrist does not bring enough information to train a reliable 5 EXPERIMENTAL RESULTS model that could perform well on the left wrist, as well. To observe the performance of the models in real-life scenarios, However, the problem of “switching wrists” could be we carried out several experiment. In fact, we observe the overcome if the models are trained with data from both wrists. performance of the models on the left and right wrist separately, In fact, the models trained with data from the left and right to see if they achieve similar result on both wrists. wrist combined, outperform the other two combinations for Since real-life poses many challenges that should be taken both wrists, achieving the highest accuracy of 84.3% for the left into account, we considered three different training scenarios wrist, and 82.3% for the right wrist with the RF model. for each wrist. Namely, we test the accuracy of the models for Overall, the results suggest that the models perform better six train-test combinations: training on the left wrist and testing for the left wrist. Since all subjects included in the dataset were on the right wrist (L - R), training on the right wrist and testing right-handed, we can conclude that the non-dominant hand on the right wrist (R - R), training on both wrists and testing on brings more information regarding the walking patterns of the the right wrist ((L+R) - R), training on the left wrist and testing subjects. 49 Table 1: Gait abnormality detection accuracy of individual classifiers. Classifier L - L R - L (L + R) - L R - R L-R (L + R) - R Baseline – Majority 61.4 61.4 61.4 61.4 61.4 61.4 Classifier DT 75.1 51.2 78.0 74.5 65.6 76.6 RF 83.9 72.2 84.3 82.8 81.3 84.3 SVM 68.3 61.0 72.4 64.4 66.4 71.4 kNN 63.2 57.3 63.8 61.2 62.6 63.0 6 CONCLUSION [6] Institute of Medicine Falls in Older Persons: Risk Factors and In this paper, we analyzed the ability of machine learning Prevention. In The Second Fifty Years: Promoting Health and Preventing algorithms to detect gait abnormalities using data from inertial Disability; 1992 ISBN 978-0-309-04681-7. sensors integrated into a smartwatch. Among the compared [7] Howcroft, J.; Kofman, J.; Lemaire, E.D. Review of fall risk assessment in geriatric populations using inertial sensors. J. Neuroeng. Rehabil. machine learning algorithms, Random Forest achieved the 2013. highest accuracy. The analysis of the performance of the [8] Riva, F.; Toebes, M.J.P.; Pijnappels, M.; Stagni, R.; van Dieën, J.H. Estimating fall risk with inertial sensors using gait stability measures that models on the left and right wrist showed that they perform do not require step detection. Gait Posture 2013. better on the left wrist, which was the non-dominant for the [9] Tunca, C.; Pehlivan, N.; Ak, N.; Arnrich, B.; Salur, G.; Ersoy, C. Inertial sensor-based robust gait analysis in non-hospital settings for neurological subjects included in the dataset. The experiments with the disorders. Sensors (Switzerland) 2017. “switching wrist”, i.e., training the models with data collected [10] Mannini, A.; Trojaniello, D.; Cereatti, A.; Sabatini, A.M. A machine from one wrist and testing on the other showed that the learning framework for gait classification using inertial sensors: Application to elderly, post-stroke and huntington’s disease patients. accuracy of the models significantly drops. However, when the Sensors (Switzerland) 2016, 16. models were trained with data from both wrists and applied on [11] Drover, D.; Howcroft, J.; Kofman, J.; Lemaire, E.D. Faller classification in older adults using wearable sensors based on turn and straight-walking each wrist individually, the accuracy increased, outperforming accelerometer-based features. Sensors (Switzerland) 2017. even the models that were trained and tested on the same wrist. [12] Brodie, M.A.; Lord, S.R.; Coppens, M.J.; Annegarn, J.; Delbaere, K. Therefore, the best practical solution is to deploy a model Eight-week remote monitoring using a freely worn device reveals unstable gait patterns in older fallers. IEEE Trans. Biomed. Eng. 2015. trained with data from both wrists. Overall, the results are [13] TicWatch S&E - A smartwatch powered by Wear OS by Google satisfactory and show that data generated by wrist-worn inertial Available online: https://www.mobvoi.com/eu/pages/ticwatchse (accessed on Aug 30, 2020). sensors is sufficient for gait abnormalities detection and can be [14] Dehzangi, O.; Taherisadr, M.; ChangalVala, R. IMU-based gait used for fall risk assessment in non-clinical environments. recognition using convolutional neural networks and multi-sensor fusion. Sensors (Switzerland) 2017. [15] Antonsson, E.K.; Mann, R.W. The frequency content of gait. J. Biomech. ACKNOWLEDGMENTS 1985. [16] Overview on extracted features — tsfresh 0.16.1.dev65+gd190be5 documentation Available online: The authors would like to thank all the participants that took https://tsfresh.readthedocs.io/en/latest/text/list_of_features.html part in the dataset collection. The first author acknowledges the (accessed on Aug 30, 2020). financial support from the Slovene Human Resources [17] Su, X.; Tong, H.; Ji, P. Activity recognition with smartphone sensors. Tsinghua Sci. Technol. 2014. Development and Scholarship Fund – Ad Futura. [18] Gjoreski, M.; Janko, V.; Slapničar, G.; Mlakar, M.; Reščič, N.; Bizjak, J.; Drobnič, V.; Marinko, M.; Mlakar, N.; Luštrek, M.; et al. Classical REFERENCES and deep learning methods for recognizing human activities and modes of transportation with smartphone sensors. Inf. Fusion 2020. [1] Dionyssiotis, Y. Analyzing the problem of falls among older people. Int. [19] Gordon, A.D.; Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. J. Gen. Med. 2012. Classification and Regression Trees. Biometrics 1984. [2] Ageing and health Available online: https://www.who.int/news- [20] Breiman, L. Random Forest. Mach. Learn. 2001, 45, 5–32. room/fact-sheets/detail/ageing-and-health (accessed on Aug 30, 2020). [21] Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995. [3] Salzman, B. Gait and balance disorders in older adults. Am. Fam. [22] Aha, D.W.; Kibler, D.; Albert, M.K. Instance-Based Learning Physician 2011. Algorithms. Mach. Learn. 1991. [4] Horikawa, E.; Matsui, T.; Arai, H.; Seki, T.; Iwasaki, K.; Sasaki, H. Risk [23] Gjoreski, M.; Gjoreski, H.; Luštrek, M.; Gams, M. How accurately can of falls in Alzheimer’s disease: A prospective study. Intern. Med. 2005. your wrist device recognize daily activities and detect falls? Sensors [5] Allen, N.E.; Schwarzel, A.K.; Canning, C.G. Recurrent falls in (Switzerland) 2016. parkinson’s disease: A systematic review. Parkinsons. Dis. 2013. 50 Avtomatska detekcija obrabe posnemalnih igel Automatic Wear Detection of Broaches Primož Kocuvan Jani Bizjak primoz.kocuvan@ijs.si jani.bizjak@ijs.si Institut "Jožef Stefan" Institut "Jožef Stefan" Jamova cesta 39 Jamova cesta 39 Ljubljana, Slovenija Ljubljana, Slovenija Stefan Kalabakov Matjaž Gams stefan.kalabakov@ijs.si matjaz.gams@ijs.si Institut "Jožef Stefan" Institut "Jožef Stefan" Jamova cesta 39 Jamova cesta 39 Ljubljana, Slovenija Ljubljana, Slovenija Slika 1: Odčitki signala posnemalne igle POVZETEK najboljše metode je 27 posnemanj oz. 1,8% glede na povprečno Posnemanje materiala je ena izmed metod strojnega obdelovanja število posnemanj, ki se opravijo pred menjavo. izdelkov, ki jih dosežemo s t.i posnemalno iglo. V grobem ločimo zunanje posnemanje in notranje posnemanje materiala. V pri- KLJUČNE BESEDE spevku se posvečamo notranjemu posnemanju, pri katerem se Posnemalne igle, avtomatsko zaznavanje, regresija, strojno uče- v začetku naredi manjšo luknjo v obdelovanec, nato pa posto- nje poma oblikuje profil. To se doseže z različnimi premeri rezil tako, da je na začetku premer manjši, nato pa se postopno povečuje. ABSTRACT Tako se lahko oblikuje poljuben krožni ali n-kotni profil. Zaradi obrabe rezil pri posnemanju se morajo le-ta redno menjati. V Broaching is one of the methods in metalworking, which is per- prispevku je opisan pristop napovedovanja obrabe posnemalne formed with the so-called broach. We distinguish between exter- igle glede na cikel posnemanja. Glavna značilka, uporabljena za nal broaching and internal broaching. In this paper, an internal napovedovanje, je t.i mikroraztezanje (ang. microstrain), ki pove, broaching is presented, where a small hole is initially made in za koliko se spremeni obremenitev na merilnem mestu v delcih the workpiece, and then the broach gradually forms a profile. na milijon. V prispevku je predstavljenih več metod strojnega This is achieved with different blade diameters so that initially učenja za reševanje omenjenega problema. Povprečna napaka the diameter is smaller and then it gradually increases. Thus, any circular or polygon shape can be formed. Due to the wear during Permission to make digital or hard copies of part or all of this work for personal broaching, the blades must be replaced regularly. In this paper, or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and an approach for predicting how many broaching processes or the full citation on the first page. Copyrights for third-party components of this the number of work cycles can still be done before replacing the work must be honored. For all other uses, contact the owner/author(s). broach are presented. We did this by measuring and monitoring Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia the microstrain parameter by the cut time, which tells how much © 2020 Copyright held by the owner/author(s). the strain changes in parts per million. Thus, with regression 51 Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Primož Kocuvan, et al. machine learning procedures, we learned a model that missed an average of 27 cycles or 1.8 %. KEYWORDS Broaches, automatic detection, regression, machine learning Slika 4: Primerjava raztezkov med obrabljeno in novo po- 1 UVOD snemalno iglo Posnemanje materiala je zelo natančen postopek obdelovanja kovinskih izdelkov. Cena posnemalnih igel je relativno visoka (nekaj tisoč EUR), zato se posnemanje v industriji uporablja le v primerih, ko imamo dovolj veliko število obdelovancev. Upo- raba obrabljene ali uničene posnemalne igle zaradi množične proizvodnje privede do visokih stroškov za proizvajalca, zato se igle trenutno menjajo po 1500 posnemanjih, ne glede na nji- hovo obrabo. S pomočjo strojnega učenja je mogoče natančneje napovedati, kdaj bo določena igla preveč obrabljena, s tem pa pri- dobimo boljši izkoristek igel ter takojšnje zaznavanje morebitne okvare igle. Avtorji prispevka so za razne industrijske aplikacije dobili več nagrad (prof. dr. Matjaž Gams [1]), medtem ko se je prvi avtor ukvarjal s procesiranjem časovnih signalov v svoji diplomski nalogi [2]. Nekateri raziskovalci so se lotili obdelave Slika 3: Primer posnemalne igle [9] s pomočjo kombinacije strojnega vida in učenja [3], [4], [5] ter merjenja sil [6], [7]. Na sliki 2 je primer signala posnemanja enega cikla oziroma enega obdelovanca. Na abcisni osi je čas, 2 DEFINICIJA PROBLEMA medtem ko je na ordinatni osi obremenitev oziroma raztezek Rezila se med uporabo obrabljajo (postanejo topa), zaradi česar je na distančni plošči (angl. microstrain) [8]. Distančna plošča je potrebna večja sila za posamezen rez. S povečevanjem sile se veča kovinska plošča, ki zagotavlja ustrezen odmik med kovinskim verjetnost, da bo rez nepravilen, oz. se bo rezilo poškodovalo (npr. izdelkom in posnemalno iglo. Tu merimo naš raztezek. odlomil rezilni zob). Ko so rezila preveč obrabljena, jih je mogoče nabrusiti, kar je veliko ceneje od nakupa novega rezila v primeru nepopravljivih poškodb (npr. zloma zoba). Trenutno je postopek v proizvodnji tak, da se vsa rezila po 1500 posnemanjih zamenjajo, saj je verjetnost za napako po tem številu posnemanj previsoka. Problem je, da se rezilo zaradi različnih zunanjih dejavnikov (npr. mazivne tekočine, temperature itd.) obrablja hitreje ali počasneje, kar privede do okvar na izdelku ali slabšega izkoristka rezila. Na sliki 4 je prikazana primerjava signala iz nove igle (modra) ter obrabljene igle (rdeča). Vidimo lahko, da ima posnemalna igla predstavljena z rdečim signalom v splošnem večji integral (površino pod krivuljo), to pomeni, da je sila večja. Razlikuje se tudi po številu ter jakosti posameznih vrhov, npr. v nekaterih primerih določeni vrhovi manjkajo (rezilo (nož) je popolnoma izrabljen). Slika 2: Primer povečave enega reza igle poljubnega si- 3 REŠEVANJE PROBLEMA gnala Iz slike 4 lahko vidimo, da sta število in višina (integral) vrhov eden pomembnejših faktorjev pri prepoznavi okvare, sekundarni faktor pa je oblika vrhov. Avtomatskega prepoznavanja vrhov smo se lotili tako, da smo zaznali, kdaj se signal dvigne od stan- dardne deviacije (šuma) signala. Med posameznimi rezi je igla v mirovanju, kar je razvidno iz slike 1. Na ta način smo dobili Opazimo, kako se raztezek (ki ga lahko interpretiramo kot okno, ki vsebuje le signal, ki nastane med rezanjem. Poiskali smo silo) na distančni plošči spreminja, ko se spreminja premer zob okrog 1000 različnih atributov, ki opisujejo signal s pomočjo knji- posnemalne igle. Na sliki 3 je primer posnemalne igle (splošno), žnice Tsfresh [10]. Ti atributi so npr. minimalna in maksimalna z rdečo barvo je označen obdelovanec. Smer puščice nakazuje vrednost signala, frekvence in vzorci, ki se pojavljajo v signalu. pomik. Nato smo atribute filtrirali z ozirom na relevantnost (prav tako 52 Avtomatska detekcija obrabe posnemalnih Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Slika 5: Primerjava odčitkov raztezka z leve in desne po- Slika 6: Integral signala glede na število rezanj posne- snemalne igle malne igle z omenjeno knjižnico), ki za vsak atribut izračuna p-vrednost oz. statistično stopnjo značilnosti. V zadnji fazi se nad množico p-vrednosti požene Benjamini-Yekutieli algoritem, ki se odloči katere značilke obdržimo in katere izločimo. Izkazalo se je, da so najpomembnejši atributi ploščina, maksimalna vrednost ter število vrhov, torej le trije atributi. Z izbranimi atributi smo s pomočjo strojnega učenja napovedali, v kakšnem ciklu oz. kako blizu okvari je določena igla. Uporabili smo naslednje pristope z učnim okoljem Sci-kit learn [11], [12]: • linearno regresijo (Linear Regression) [13], • gradientno ojačitev za regresijo (Gradient Boosting) [14], • klasifikator AdaBoost (AdaBoost Classifier) [15], • K najbližjih sosedov (K Nearest Neighbours) [16]. Obdelovalni oz. posnemalni stroj, s katerega smo pridobili me- ritve, ima levo in desno posnemalno iglo, pri čemer obe delujeta istočasno, torej obe posnemata (režeta) material hkrati. Na sliki 5 je primer meritev leve in desne posnemalne igle za posnemanje ob določenem času. Opazimo, da ima ena igla večji integral, Slika 7: Maksimalne vrednosti signala glede na število re- kar pomeni, da bi morali na začetku merilnega cikla kalibrirati zanj posnemalne igle iglo/senzor. S tem bi zagotovili enako izhodišče za nadaljnjo sta- tistično obdelavo podatkov. Da bi se izognili tej težavi smo v tem prispevku primerjali le posnemalne igle, ki so na isti strani (leva ali desna). 4 REZULTATI Za napovedovanje zvezne vrednosti ciljne spremenljivke (re- Na sliki 6 je prikazana primerjava integralov signala po določegresija) uporabljamo metriko MAE (angl. Mean Absolute Error) nem številu rezov (ordinata). Vidimo lahko (še posebej na desni in RMSE (angl. Root Mean Squared Error). Razlika je v tem, da igli), da se z večanjem števila rezov vrednosti integralov pove- metrika absolutne napake vrne le razliko absolutne napake, med- čujejo, kar je skladno s pričakovanji, da je za enak rez s topim tem ko RMSE vrne kvadrat te napake, s čimer kaznujemo večje nožem potrebna večja sila. razlike, torej primere, ko se napaka razlikuje za večje število Podobno, čeprav manj izrazito, lahko ugotovimo za maksi- ciklov. V našem primeru smo uporabili le vrednost MAE, ki je malno silo, ki nastane med rezom, kar je razvidno iz slike 7. definirana z enačbo (1). Na sliki 8 je prikazano število vrhov, ki jih algoritem prepozna. Po pričakovanjih je število vrhov obratno sorazmerno s številom rezov. Rezila na posameznih iglah se obrabljajo, zato te igle ne 𝑛 1 Õ režejo več, torej je sila na rezilu nizka, saj igla ne postruži nič 𝑀 𝐴𝐸 = |𝑦 − 𝑥 | (1) 𝑖 𝑖 𝑛 materiala. Nato sledi naslednja igla, ki ni obrabljena, ker pred- 𝑖 =1 hodna igla ni opravila svojega dela, mora ta igla odstraniti večjo količino materiala, kar privede do večje sile ter obrabe na tem V tabeli 1 so prikazani rezulati napovedovanja cikla za posa-igli. mezno metodo strojnega učenja. 53 Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Primož Kocuvan, et al. [4] S.Kurada in C.Bradley. [n. d.] A machine vision system for tool wear assessment. 30, 295–304. [5] S.Damodarasamy in Shivakumar Raman. [n. d.] An ine- xpensive system for classifying tool wear states using pattern recognition. 170, 149–160. [6] Dongfeng Shi in Nabil N.Gindy. [n. d.] Tool wear predic- tive model based on least squares support vector machines. 21, 1799–1814. [7] S. Rangwala in D. Dornfeld. [n. d.] Sensor integration using neural networks for intelligent tool condition monitoring. 112, 219–228. [8] Anderson Langone Silva, Marcus Varanis, Arthur Guil- herme Mereles, Clivaldo Oliveira in José Manoel Balthazar. [n. d.] A study of strain and deformation measurement using the arduino microcontroller and strain gauges devi- ces. 41. [9] Srednja šola Koper. 2020. Posnemanje materiala. http : Slika 8: Število vrhov signala glede na število rezanj posne- / / www2 . sts . si / arhiv / tehno / projekt3 / Posnemanje / malne igle posnemanje.htm. [10] Ts fresh library. 2020. Tsfresh. https://tsfresh.readthedocs. Tabela 1: Regresorji in njihove pripadajoče metrike MAE io/en/latest/. [11] Aurélien Géron. 2017. Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Tech- Regresor MAE niques to Build Intelligent Systems. O’Reilly Media; 1st Linearna regresija 101,25 Edition, 574. Gradient boost 27,58 [12] Andreas C. Müller in Sarah Guido. 2016. Introduction to AdaBoost 165,44 Machine Learning with Python: A Guide for Data Scientists. KNN 74,16 O’Reilly Media; 1st Edition, 400. [13] Sci kit learn. 2020. Regression - linear regression. https: / / scikit - learn . org / stable / modules / generated / sklearn . linear_model.LinearRegression.html. 5 ZAKLJUČEK [14] Sci kit learn. 2020. Regression - gradient boost. https : Način merjenja mikroraztezka na distančni plošči ter analize / / scikit - learn . org / stable / modules / generated / sklearn . časovnega signala s pomočjo strojnega učenja je naš prispevek ensemble.GradientBoostingClassifier.html. na tem področju. Z navedenimi pristopi smo dobili povprečno [15] Sci kit learn. 2020. Regression - adaboost. https://scikit- absolutno napako (MAE) 27,58 kar pomeni, da se naš model v pov- learn.org/stable/modules/generated/sklearn.ensemble. prečju zmoti za 27,58, pri napovedovanju cikla trenutnega reza. AdaBoostRegressor.html. Vrednosti (število ciklov) gre od 0 do 1500 1. To pomeni, da model [16] Sci kit learn. 2020. Regression k-nearest-neighbour. https: s točnostjo 98,16 % napoveduje v katerem ciklu je posnemalna / / scikit - learn . org / stable / modules / generated / sklearn . igla, oz. kdaj je iglo potrebno zamenjati. V nadaljevanju raziskave, neighbors.KNeighborsRegressor.html#sklearn.neighbors. se je potrebno osredotočiti na optimizacijo hiperparametrov po- KNeighborsRegressor. sameznega regresorja. Končni cilj raziskave je implementacija tovrstnega primerjanja na podlagi signala v proizvodni proces. ACKNOWLEDGMENTS Ta raziskava je bila delno financirana s strani projekta ROB- KONCEL s šifro OP20.03530 in ARRS. Zahvaljujemo se podjetju UNIOR (Jože Ravničan in Tomaž Hohler). LITERATURA [1] 2011. Ventil - revija za fluidno tehniko, avtomatizacijo in mehatroniko. V Ljubljana. [2] Primož Kocuvan. 2015. Zaznavanje srčnega šuma v fono- kardiogramih. Diplomsko delo - Univerza v Ljubljani, 50. [3] Wenmeng Tian, Lee J. Wells in Jaime Camelio. 2016. Broa- ching tool degradation characterization based on functio- nal descriptors. V (MSEC). 11th Manufacturing Science in Engineering Conference (MSEC2016), USA. 1Model privzame, da je iglo potrebno zamenjati, ko signal izgleda, kot izgleda na igli s 1500 rezi. Če bi želeli točno izvedeti, kdaj je "točka preloma", torej ko je igla okvarjena, bi bilo potrebno izvesti še nekaj meritev/posnetkov, kjer bi se igla uporabljala dokler ne bi prišlo do napak na izdelku. 54 Povečevanje enakosti (oskrbe duševnega zdravja) s prepričljivo tehnologijo Increasing Equality (in Mental Health Care) with Persuasive Technology Tine Kolenik† Matjaž Gams Odsek za inteligentne sisteme Odsek za inteligentne sisteme Institut “Jožef Stefan” in Institut “Jožef Stefan” Mednarodna podiplomska šola Ljubljana, Slovenija Jožefa Stefana matjaz.gams@ijs.si Ljubljana, Slovenija tine.kolenik@ijs.si POVZETEK of people with mental health issues. This paper presents such systems with a brief overview of the field, with the main Neuspešno spopadanje z naraščajočimi težavami z duševnim contribution being an analysis of potential problems and zdravjem močno ovira blaginjo posameznika in družbe. Kljub solutions that persuasive technology offers in the field of mental temu so ovire do dostopa in enakosti v oskrbi na področju health care. Persuasive technology seems to be able to duševnega zdravja, ki jih je veliko, znane, obsegajo pa od complement existing mental health care solutions, thereby osebnih stigm do socialno-ekonomske neenakosti. Tehnologija, reducing unequal access to and inequality in mental health care predvsem pa umetna inteligenca, ima ob takšnem stanju as well as reducing inequality in general. priložnost, da s svojim razvojem poskuša ublažiti obstoječi položaj z edinstvenimi rešitvami. Multi- in interdisciplinarne KEYWORDS raziskave na področju prepričljive tehnologije, katere cilj je Digital mental health, persuasive technology, artificial spreminjanje vedenja ali mentalnega stanja brez zavajanja in prisile, kažejo uspeh pri izboljšanju počutja pri ljudeh s intelligence, mental health care access, equality. tovrstnimi težavami. V prispevku so predstavljeni takšni sistemi s kratkim pregledom področja, glavni doprinos pa je analiza 1 UVOD potencialnih težav in rešitev, ki jih prepričljiva tehnologija nudi na področju oskrbe duševnega zdravja. Zdi se, da prepričljiva Težave na področju duševnega zdravja so že desetletja v porastu, tehnologija lahko dopolni obstoječe rešitve za pomoč pri uničujoč učinek tega pa so pripoznali tudi svetovni odločevalci, duševnem zdravju, s tem pa zmanjša težave v dostopnosti in saj so Združeni narodi izboljšanje na tem področju uvrstili med enakosti zdravstvene oskrbe kot tudi v enakosti nasploh. svoje cilje trajnostnega razvoja [42]. Med temi težavami izstopajo predvsem stres, anksioznost in depresija (SAD). KLJUČNE BESEDE Beležijo, da se v nekaterih skupinah z akutnim stresom spopada 74% ljudi [24], z anksiozno motnjo 28% ljudi [5] in z depresijo Digitalno duševno zdravje, prepričljiva tehnologija, umetna 48% ljudi [36]. Kar se zdi še bolj problematično, je dejstvo, da v inteligenca, dostopnost in enakost zdravstvene oskrbe. državah z nizkim in srednjim dohodkom okoli 80% ljudi ni ABSTRACT deležno zdravljenja zaradi svojih duševnih težav, v državah z visokim dohodkom pa ta številka dosega okoli 35% [33]. Težave The inability to cope with increasing mental health issues among z duševnim zdravjem povzročijo daljnosežne in večplastne the populace severely hampers the well-being of both the posledice, ki jih občutijo bolniki, njihova neposredna okolica individual and society. Barriers to access and equality in mental (družina, skrbniki) in širša družba [41]. Bolniki se soočajo s health care, many of which are well known, range from personal slabšo kakovostjo življenja, nižjimi izobraževalnimi rezultati, stigmas to socio-economic inequality. This offers technology, nižjo produktivnostjo, potencialno revščino, socialnimi težavami especially artificial intelligence, the opportunity to try to in dodatnimi zdravstvenimi težavami. Skrbniki se soočajo z alleviate the existing situation with unique solutions. Multi- and večjimi čustvenimi in fizičnimi izzivi, pa tudi z zmanjšanim interdisciplinary research in the field of persuasive technology, dohodkom in povečanimi finančnimi stroški. Družba se vsako which aims to change behavior or mental states without leto sooča z izgubo več odstotnih točk BDP in milijardami deception and coercion, shows success in improving well-being dolarjev na državo skupaj s poslabšanjem zaupanja v inštitucije javnega zdravja in s krhanjem socialne kohezije. Vse to vodi v čed alje močnejšo pozitivno povratno zanko – SAD ohranja in Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed krepi SAD. Težave z duševnim zdravjem prepogosto vodijo tudi for profit or commercial advantage and that copies bear this notice and the full v izgubo človeškega življenja, saj se številne države spopadajo z citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). visoko stopnjo samomorov [8]. Razlogi za višanje simptomov Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia SAD vključujejo močno pomanjkanje strokovnjakov in © 2020 Copyright held by the owner/author(s). predpisov za duševno zdravje [39] ter neenak dostop do oskrbe 55 na področju duševnega zdravja [9]. Zato se zdi, da tehnološke in duševno zdravje). Osebnost se meri na različnih dimenzijah druge znanstvene terapevtske intervencije lahko pomagajo pri (odprtost, vestnost, ekstravertnost, sprejemljivost, nevroticizem), izboljšanju trenutnega stanja sistema, zlasti ker imajo ki poskušajo opisati posameznikove tendence, povezane z posamezniki z duševnimi težavami terapije raje kot zdravila [2]. njegovimi psihološkimi lastnostmi, kot so duševna stanja in Zaradi napredka vedenjskih ved na področju človekovega odločanje. Prepričevanje na področju duševnega zdravja je hkrati odločanja in sorodnih pojavov [34] ter prihodom digitalnih bolj uspešno, če PT dostopa do podatkov o posameznikovem tehnologij, umetne inteligence in velikega podatkovja se je duševnem zdravju. V ta namen lahko uporabimo vprašalnike razvoj usmeril v ustvarjanje tehnologij, ki bi pomagale, SAD [21] za kategorizacijo ljudi s simptomi SAD. motivirale in usmerjale ljudi, da izboljšajo sebe in svet. Okvirji prepričevanja so lahko implementirani v različne Prepričljiva tehnologija (PT) je eden izmed rezultatov tovrstnih tehnološke platforme. Nedavni pregledni članek PT za zdravje in prizadevanj. Gre za tehnologijo, ki "spreminja stališča ali dobro počutje [27] je ugotovil, da so najpogosteje uporabljene vedenja ali oboje (brez uporabe prisile ali zavajanja)" [12, str. platforme mobilne naprave (28%), sledijo igre (17%), spletna in 20]. Sprememba vedenja velja za pojav začasnega ali trajnega socialna omrežja (14%) ter druge specializirane naprave (13%), učinka na vedenje, odnos in druga duševna stanja posameznika v namizne aplikacije (12%), senzorji in nosljive naprave (9%) ter primerjavi s preteklostjo [12]. PT se že uporablja za pomoč pri zasloni v javnem prostoru (5%). Vrst aplikacij, ki delujejo kot duševnem zdravju [25, 27], kar prispeva k enakosti in omogoča PT, je na tem področju več, inteligentni kognitivni asistenti lažji dostop do zdravstvene oskrbe [37]. (IKA; znani tudi kot pogovorni roboti ali pogovorna umetna Prispevek ima sledečo strukturo: poglavje 2 nudi pregled inteligenca) pa so najbolj napredni in razširjeni [4, 18, 26, 27, 30, področje PT za pomoč pri duševnem zdravju, poglavje 3 37, 44]. IKA izkazujejo številne človeku podobne sposobnosti, analizira težave in rešitve, ki jih nudi PT, poglavje 4 pa poda saj lahko do neke mere razumejo kontekst, se prilagajajo, se nekaj zaključnih misli in idej za prihodnje delo. učijo, komunicirajo, sodelujejo, napovedujejo, zaznavajo, razlagajo in utemeljujejo. Najpomembneje je, da se IKA lahko pogovarjajo v naravnem jeziku in jih je zato mogoče ustvariti 2 PREGLED PODROČJA tako, da nudijo terapevtsko pomoč. Rezultati različnih Pričujoče poglavje vsebuje pregled področja PT in področja preglednih člankov [4, 18, 26, 27, 30, 37] kažejo, da so IKA sprememb vedenja. učinkovito sredstvo za lajšanje simptomov SAD. Izvedli smo Sprememba vedenja je pojav, za katerega velja, da pri kratek pregled prispevkov o najsodobnejših IKA za duševno posamezniku povzroči začasen ali trajen učinek na njegovo zdravje in tri na kratko predstavljamo za ponazoritev tovrstne vedenje v primerjavi s tem, kako se je vedel v preteklosti [12]. tehnologije. Vsi trije IKA [11, 14, 43] delujejo podobno, tako da Ne vključuje le vedenja, temveč tudi duševna stanja. Intervencije z uporabo skriptiranih pogovorov in osnovnih sposobnosti za spremembo vedenja so velik del PT, ki se že pogosto uporablja procesiranja naravnega jezika nudijo pomoč. Ta je odvisna od na zdravstvenih področjih. Obstoječi sistemi s pomočjo umetne uporabniškega modela, ki vsebuje podatke o čustvih inteligence spremljajo vedenje ljudi ter njihova fiziološka in uporabnikov in ravni SAD. Vsi IKA se v eksperimentih izkažejo duševna stanja z namenom, da jih motivirajo in vplivajo na za 15–20% uspešnejše pri lajšanju SAD od uradno priporočenega njihovo počutje, vse to pa lahko počnejo v naravnem jeziku [27]. gradiva za samopomoč. Eden najpogosteje uporabljenih okvirjev prepričevanja in Takšna tehnologija nudi številne prednosti na področju sprememb vedenja, ki jih uporabljajo takšne tehnologije, so duševnega zdravja: lahko je brezplačna in omogoča pomoč Cialdinijeva načela prepričevanja (CPP) [6]. Obstajajo tudi drugi socialno-ekonomsko prikrajšanim ljudem; na voljo je 24 ur na okviri [25, 27], vendar je za namene tega dela opisan samo CPP. dan, 7 dni v tednu, kar pomeni, da bolnikom ni treba čakati na Njegova glavna ideja je, da ne obstaja splošna strategija naslednjo terapijo; veliko ljudi s simptomi SAD lažje zaupajo prepričevanja, ki bi delovala na vse ljudi. CPP zato opiše več računalniku kot osebi [10, 22]; tehnologija je na voljo na strategij prepričevanja, saj so različni ljudje različno dovzetni za oddaljenih lokacijah itd. Tehnologija lahko tako zmanjša različne strategije. obremenitev zdravstvenega sistema in njegovih izvajalcev ter CPP predvideva 7 strateških podlag za prepričevanje: 1) zmanjša ovire za dostop do oskrbe duševnega zdravja na splošno. avtoriteta, ki cilja na ljudi, ki so bolj nagnjeni k temu, da jih Pomembno je poudariti, da tehnologija deluje komplementarno motivira legitimna avtoriteta; 2) zavezanost in doslednost, ki sta in ne nadomešča strokovnjakov [16, 18, 37]. Prednosti rabe namenjena ljudem, ki se bolj pogosto zavežejo k nečemu, če so tovrstne tehnologije in morebitne težave so podrobneje se tako vedli že prej; 3) družbeni dokazi, ki ciljajo na ljudi, ki se obravnavane v naslednjem poglavju. ponavadi vedejo tako, kot se vedejo drugi; 4) všečnost, ki cilja na ljudi, za katere je bolj verjetno, da jih motivira nekdo, ki jim je všeč; 5) recipročnost, ki cilja na ljudi, ki so nagnjeni k vračanju 3 PREDNOSTI IN MOREBITNE TEŽAVE uslug; 6) pomanjkanje, ki cilja na ljudi, ki menijo, da so redke Pričujoče poglavje obravnava posledice uporabe PT za stvari bolj dragocene; 7) enotnost, ki vpliva na ljudi, na katere duševno zdravje na področju spodbujanja enakosti in dostopnosti vplivajo pozivi, ki se tičejo njihove skupinske identitete. Na oskrbe duševnega zdravja, dotakne pa se tudi posledic na različne ljudi vplivajo različne strategije, interaktivna splošno. Posledice so razdeljene na tiste, ki ponujajo potencialne tehnologija pa nudi orodje za učinkovitejšo izbiro tistih strategij, rešitve obstoječih težav in ovir za enakost in dostopnost, in tiste, ki delujejo za določene ljudi. ki se kažejo kot problemi te tehnologije pri doseganju enakosti. Za izbiranje najučinkovitejše strategije se PT pogosto opira Na koncu poglavja so na kratko obravnavani tudi drugi problemi, na osebnostne modele, kot je velikih pet faktorjev osebnosti [31], ki na videz niso povezani z enakostjo, a so ključnega pomena, da in vprašalnike za posamezne domene, kjer se PT uporablja (npr. PT doseže svoj potencial. 56 Kategorije, v katerih PT ponuja potencialne rešitve: ker se ne bojijo, da bi jih obsojali, pridobijo pa zasebnost za Stroški: Cena storitev, ki jih nudijo strokovnjaki za duševno razkrivanje svojih občutkov in misli na splošno. To pomeni, da zdravje (od psihoterapevtov do kliničnih psihologov in se lahko število ljudi, ki se izogibajo stikom s strokovnjaki, psihiatrov) se od države do države razlikujejo in so predvsem zmanjša z uvedbo terapevtskih možnosti, za katere bolniki odvisni od državnih predpisov in subvencij. Neposredni stroški menijo, da so zanje varnejše in brez stigme. za bolnika so večinoma odvisni od števila strokovnjakov, ki so Vendar pa takšna tehnologija potencialno prinaša tudi težave, na voljo v določeni državi. Neodvisno od njihove višine pa ki jih je potrebno izpostaviti in resno obravnavati, da bi PT stroški velikokrat ovirajo dostopnost do oskrbe ljudi iz nižjih dosegel potencial, ki ga ima na področju duševnega zdravja: socialno-ekonomskih okolij [23]. Dostop do PT za duševno Izključitev ranljivih skupin: Tehnološko usmerjene rešitve zdravje je lahko brezplačen (in velikokrat je [11]) zaradi veliko oskrbe duševnega zdravlja lahko vodijo v izključevanje nižjih stroškov, povezanih z izdelavo. K temu prispevajo trije nekaterih ranljivih skupin. Mednje spadajo starostniki, najnižji glavni dejavniki: 1) razširljivost, kar pomeni, da lahko en sistem socialno-ekonomski razred in kulturno specifične skupine. Zdi PT teoretično nudi pomoč neomejenemu številu ljudi (edini se, da je skupina, ki jo je uvedba tehnologije najbolj prizadela, strošek, ki ga prinaša razširljivost, so stroški strežnika, ki so skupina starostnikov [1]. Njihova nižja sposobnost vključevanja obrobni v primerjavi s človeškim delom) – nasprotno pa je en tehnologije v vsakdanje življenje lahko vodi v globlje razlike strokovnjak za duševno zdravje omejen na določeno število ljudi; med njimi in drugimi generacijskimi skupinami. Druga skupina 2) zmožnost, da učinkovit PT lahko ustvari veliko ljudi, ljudi, ki je lahko izključena iz koristi PT za duševno zdravje, so predvsem zaradi obstoječih raziskav, ki temeljito poročajo o ljudje iz najnižjega socialno-ekonomskega razreda, kjer jim PT učinkovitih sistemih; in 3) količina ljudi, ki je sposobna morda sploh ne bo na voljo [28]. Poglabljanje že tako velikih proizvajati takšne sisteme, je veliko večja, kot je strokovnjakov, razlik bi skupini povzročilo še bolj katastrofalne socialno- ki lahko ponudijo psihoterapevtsko in podobno pomoč. ekonomske življenjske razmere. Skupine, ki jih posvojitev Razpoložljivost: Problem razpoložljivosti lahko ločimo v tri tehnologije prizadene zaradi kulturnih razlik, so ključnega podkategorije: 1) razpoložljivost na podlagi lokacije, 2) pomena pri razmisleku o napredku enakosti. Raziskave kažejo, razpoložljivost na podlagi časa in 3) razpoložljivost na podlagi da kulture z manj sodobnimi družbenopolitičnimi nagnjenji stroškov. Razpoložljivost na podlagi lokacije se nanaša na ljudi kažejo manjšo tendenco po posvajanju tehnologije [19]. Vseeno s težavami v duševnem zdravju na lokacijah, ki nimajo se zdi, da se večja prisotnost področja raziskovanja PT pojavlja neposrednega dostopa do strokovnjakov za duševno zdravje (ali tudi v nekaterih državah z nizkimi dohodki [40]. pa celo nimajo računalniškega dostopa do terapije na daljavo) Pristranost v raziskovanju: Zaradi pomanjkanja standardov [15]. Uporaba PT za duševno zdravje je ena redkih potencialnih evalvacije PT za duševno zdravje je raziskovalno področje bolj rešitev v takih primerih. Razpoložljivost na podlagi časa se dovzetno za pristranost v raziskovanju. Možnih težav je veliko: nanaša na ljudi z duševnimi težavami, ki potrebujejo terapevtsko 1) sistemov PT, za katere se trdi, da so uspešni, ne preučujejo pomoč v času, ko njihov izbrani strokovnjak ni na voljo. PT za vedno v empiričnih poskusih (npr. randomizirana kontrolirana duševno zdravje je na voljo 24 ur na dan, zato se njihova uporaba raziskava), temveč v kvazi eksperimentih [43] ali sploh ne; 2) dopolnjuje z izbranim strokovnjakom za duševno zdravje. metrika, na podlagi katere bi lahko ocenili takšne sisteme, ni Bolniki nenehno poročajo o teh potrebah in take dopolnilne jasna (običajno izhaja posredno iz njihove učinkovitosti v uporabe že obstajajo [29]. Razpoložljivost, ki temelji na stroških, raziskavi, kjer je cilj lajšanje simptomov SAD [37]); 3) ni se nanaša na ljudi z duševnimi težavami, ki potrebujejo soglasja o tem, kateri podatki so potrebni, da sistem razume terapevtsko pomoč, vendar nimajo sredstev za dostop, ki bi bil uporabnika in mu s tem nudi učinkovito pomoč, s čimer je izbira obsežnejši od najmanjše priporočene količine ur na teden [13] – vrste podatkov zaenkrat večkrat odvisna od predpostavk ta se ocenjuje na eno uro na teden. Raziskave [13, 32] kažejo, da raziskovalcev kot pa od obstoječih spoznanj. pogostejše terapije prinašajo boljše rezultate, dopolnilna uporaba Uporaba PT za duševno zdravje ima tudi težave, ki se ne PT za duševno zdravje pa lahko premosti to vrzel pri ljudeh, ki nanašajo samo na doseganje enakosti in dostopnosti. Čeprav so si ne morejo privoščiti več terapije. Razpoložljivost na podlagi izjemno pomembni, je njihova poglobljena analiza izven stroškov je hkrati tesno povezana s širšim problemom stroškov, okvirjev tega dela. Vseeno jih nekaj omenimo: 1.) problem omenjenim v prejšnji kategoriji. varstva osebnih podatkov [3]; 2) problem pomanjkanja Stigma: Samostigma, predsodki, ki jih ljudje z duševnimi longitudinalnih raziskav o spremembah vedenja s PT [20]; 3) težavami imajo o sebi zaradi svojih težav, in javna stigma, odziv etičnost uporabe osebnih podatkov za prepričevanje [17]; in 4) splošne populacije na ljudi z duševnimi boleznimi, predstavljata potencialni problem avtomatizacije in izgube zaposlitve eno poglavitnih težav v boju proti duševnim težavam [7]. Težava strokovnjakov za duševno zdravje. Zagotovo obstajajo tudi druge je dvojna: zaradi javne stigme se posamezniki bojijo, kaj si bo težave in pomisleki, vendar smo želeli, da je ta seznam kratek in družba mislila o njih, če bodo iskali zdravljenje, medtem ko se da z njim pokažemo, da obstajajo tudi druge težave s PT in da se zaradi samostigme bojijo interakcije s strokovnjakom in jih zavedamo. dvomov, da si njihove težave pomoč sploh zaslužijo. Ta dvojnost prispeva k temu, da se posamezniki z duševnimi težavami odločijo, da se ne bodo zdravili pri strokovnjakih za duševno 4 ZAKLJUČEK IN PRIHODNJE DELO zdravje. Do 96% ljudi s SAD ne išče zdravljenja [35]. Raziskave o PT za duševno zdravje, zlasti o IKA za zdravljenje SAD, so Pričujoče delo raziskuje, kako lahko prepričljiva tehnologija, ki pokazale, da ljudje v splošnem lažje zaupajo svoje težave poskuša brez prisile vplivati na vedenje ljudi, poveča enakost in računalniškemu ali mobilnemu sistemu kot osebi [22]. To je zato, dostopnost oskrbe duševnega zdravja, s čimer bi okrepila enakost 57 na splošno. Delo, ki se nadalje osredotoča na stres, anksioznost [16] C.M. Kennedy, J. Powell, T.H. Payne, J. Ainsworth, A. Boyd in. I. in depresijo, preučuje, zakaj je duševno zdravje precejšnja ovira Buchan, 2012. Active Assistance Technology for Health-Related Behavior Change: An Interdisciplinary Review. Journal of Medical za enakost in zakaj imajo ljudje z duševnimi težavami ovire pri Internet Research 14, 3 (2012). dostopu do zdravstvene oskrbe. Nato poda svoje argumente za [17] D. B. Klein, 2004. Statist Quo Bias. Econ. Jour. Watch 1 (2004), 260–71. [18] L. Laranjo idr., 2018. Conversational agents in healthcare: a systematic uporabo prepričljive tehnologije v tej domeni. Sledi predstavitev review. Journal of the American Medical Informatics Association 25, 9 prepričljive tehnologije v njeni multi- in interdisciplinarni sestavi (2018), 1248–1258. vedenjskih znanosti in računalništva ter umetne inteligence. [19] S. G. Lee, S. Trimi in C. Kim, 2013. The impact of cultural differences on technology adoption. Journal of World Business 48, 1 (2013), 20–29. Predstavljeni so primeri implementacije prepričljive tehnologije [20] S. S. Lee, Y. K. Lim in K. P. Lee, 2011. A long-term study of user za duševno zdravje v inteligentnih kognitivnih asistentih, experience towards interaction designs that support behavior change. In CHI’11 Extended Abstracts on Human Factors in Computing Systems, vključno z njihovo učinkovitostjo za lajšanje simptomov stresa, ACM, New York, NW, 2065–2070. tesnobe in depresije. Delo nazadnje raziskuje potencialne rešitve, [21] S. H. Lovibond in Peter F. Lovibond. 1996. Manual for the depression anxiety stress scales. Psychology Foundation of Australia, Sydney. ki jih taka tehnologija ponuja na področju duševnega zdravja, in [22] G. M. Lucas, J. Gratch, A. King in L. P. Morency, 2014. It’s only a morebitne težave, ki bi jih lahko ustvarila. Prihodnje delo computer: Virtual humans increase willingness to disclose. Computers in vključuje nadaljnje raziskovanje problemov in rešitev, Human Behavior 37 (2014), 94–100. [23] P. McCrone idr., 2004. Cost-effectiveness of computerised cognitive- poglobitev v tehnično zasnovo tovrstnih tehnologij, še posebej behavioural therapy for anxiety and depression in primary care: tistih, ki uporabljajo umetno inteligenco, ter ponujanje novih Randomised controlled trial British Journal of Psychiatry 185, 1 (2004), 55–62. konceptualnih in tehničnih smernic za PT za duševno zdravje pri [24] Mental Health Foundation. 2018. Stress: Are we coping? Mental Health zmanjševanju neenakosti oskrbe duševnega zdravja in Foundation, London. neenakosti na splošno. [25] D. C. Mohr idr., 2013. Behavioral intervention technologies: evidence review and recommendations for future research in mental health. General hospital psychiatry 35, 4 (2013). ZAHVALA [26] J. L. Z. Montenegro, C. A. da Costa in R. da Rosa Righi, 2019. Survey of conversational agents in health. Expert Systems with Applications 129 Delo je nastalo v okviru programa mladih raziskovalcev, ki ga je (2019), 56–67. [27] R. Orji in K. Moffatt, 2016. Persuasive technology for health and wellness: financirala Javna agencija za raziskovalno dejavnost Republike State-of-the-art and emerging trends. Health Informatics Journal 24, 1 Slovenije iz državnega proračuna. (2016), 66–91. [28] M. Pigato. 2001. Information and communication technology, poverty, and development in sub-Saharan Africa and South Asia (English), Africa VIRI Region working paper series; no. 20. The World Bank, Washington, D.C. [1] [29] M. Price idr., 2013. mHealth: A Mechanism to Deliver More Accessible, I. Amaral in F. Daniel, 2016. Ageism and IT: social representations, More Effective Mental Health Care. Clinical Psychology & exclusion and citizenship in the digital age. Lecture Notes in Computer Psychotherapy 21 (2013), 427– 436. Science 9755 (2016), 159–166. [30] S. Provoost, H. M. Lau, J. Ruwaard in H. Riper, 2017. Embodied [2] M. C. Angermeyer in H. Matschinger, 1996. The effect of personal Conversational Agents in Clinical Psychology: A Scoping Review. experience with mental illness on the attitude towards individuals Journal of Medical Internet Research 19, 5 (2017). suffering from mental disorders. Social Psychiatry and Psychiatric [31] B. Rammstedt in O.P. John, 2007. Measuring personality in one minute or Epidemiology. The International Journal for Research in Social and less: A 10-item short version of the Big Five Inventory in English and Genetic Epidemiology and Mental Health Services 31, 6 (1996), 321–326. German. Journal of Research in Personality 41, 1 (2007), 203–212. [3] S. Avancha, A. Baxi in D. Kotz, 2012. Privacy in mobile technology for [32] R. Sandell idr., 2000. Varieties of long-term outcome among patients in personal healthcare. ACM Computing Surveys 45, 1 (2012). psychoanalysis and long-term psychotherapy: a review of findings in the [4] D. Bakker, N. Kazantzis, D. Rickwood in N. Rickard, 2016. Mental Health Stockholm Outcome of Psychoanalysis and Psychotherapy Project Smartphone Apps: Review and Evidence-Based Recommendations for (STOPP). The International Journal of Psychoanalysis 81 (2000), 921– Future Developments. JMIR Mental Health 3, 1 (2016). 942. [5] A. Baxter, J.M. Scott, T. Vos in H. Whiteford, 2013. Global prevalence of [33] A. Schmidtke idr., 1996. Attempted suicide in Europe: rates, trends and anxiety disorders: a systematic review and meta-regression. Psychological sociodemographic characteristics of suicide attempters during the period Medicine, 43 (2013), 897–910. 1989–1992. Acta Psychiatrica Scandinavica 93 (1996), 327-38. [6] R. Cialdini. 2016. Pre-Suasion: A Revolutionary Way to Influence and [34] R. H. Thaler in C. R. Sunstein. 2008. Nudge: improving decisions using Persuade, Simonand Schuster. Simon & Schuster, New York, NY. the architecture of choice. Yale University Press, New Haven, CT. [7] P. W. Corrigan in A. C. Watson, 2002. Understanding the impact of stigma [35] G. Thornicroft idr., 2017. Undertreatment of people with major depressive on people with mental illness. World psychiatry: official journal of the disorder in 21 countries. British Journal of Psychiatry 210, 2 (2017), 119– World Psychiatric Association (WPA) 1, 1 (2002), 16–20. 124. [8] S. C. Curtin, M. Warner in H. Hedegaard. 2016. Increase in suicide in the [36] J. M. Twenge, 2014. Time Period and Birth Cohort Differences in United States, 1999-2014. U.S. Department of Health and Human Depressive Symptoms in the U.S., 1982–2013. Social Indicators Research Services, Centers for Disease Control and Prevention, National Center for 121, 2 (2014), 437–454. Health Statistics, Hyattsville, MD. [37] A. N. Vaidyam idr., 2019. Chatbots and Conversational Agents in Mental [9] European Commission. 2018. Inequalities in access to healthcare - A Health: A Review of the Psychiatric Landscape. Canadian journal of study of national policies. psychiatry 64, 7 (2019). https://ec.europa.eu/social/main.jsp?catId=738&langId=en&pubId=8152 [38] P. S. Wang idr., 2007. Use of mental health services for anxiety, mood, [10] A. Fadhil in G. Schiavo, 2019. Designing for Health Chatbots. arXiv, and substance disorders in 17 countries in the WHO world mental health (2019). https://arxiv.org/abs/1902.09022 surveys. The Lancet 370, 9590 (2007), 841–50. [11] K. K. Fitzpatrick, A. Darcy in M. Vierhile, 2017. Delivering Cognitive [39] P. Winkler idr., 2017. A blind spot on the global mental health map: a Behavior Therapy to Young Adults With Symptoms of Depression and scoping review of 25 years development of mental health care for people Anxiety Using a Fully Automated Conversational Agent (Woebot): A with severe mental illnesses in central and eastern Europe. The Lancet Randomized Controlled Trial. JMIR Mental Health 4, 2 (2017). Psychiatry 4, 8 (2017), 634–642. [12] B. J. Fogg. 2002. Persuasive technology. MK, Burlington, MA. [40] H. Winschiers-Theophilus idr., 2018. Proceedings of the Second African [13] N. Freedman idr., 1999. The Effectiveness of Psychoanalytic Conference for Human Computer Interaction: Thriving Communities. Psychotherapy: the Role of Treatment Duration, Frequency of Sessions, Association for Computing Machinery, New York, NY. and the Therapeutic Relationship. Journal of the American Psychoanalytic [41] World Health Organization. 2003. Investing in Mental Health. Association 47, 3 (1999), 741–772. https://apps.who.int/iris/handle/10665/42823 [14] R. Fulmer idr., 2018. Using Psychological Artificial Intelligence (Tess) to [42] World Health Organization (WHO). 2013. Mental Health Action Plan Relieve Symptoms of Depression and Anxiety: Randomized Controlled 2013-2020. Geneva, Switzerland. Trial. JMIR Mental Health 5, 4 (2018). [43] A. Yorita idr., 2018. A Robot Assisted Stress Management Framework: [15] K. Gibson idr., 2009. Clinicians’ attitudes toward the use of information Using Conversation to Measure Occupational Stress. In 2018 IEEE and communication technologies for mental health services in remote and International Conference on Systems, Man, and Cybernetics (SMC). rural areas. Canadian Society of Telehealth Conference, Vancouver, [44] M. Mlakar, A. Tavčar, G. Grasselli in M. Gams. 2018. Asistent za stres. October 3–6, (2009). http://poluks.ijs.si:12345/. 58 Analiza glasu kot diagnostična metoda za odkrivanje Parkinsonove bolezni Speech Anlysis as a Diagnostic Method for the Detection of Parkinson’s Disease Andraž Levstek Darja Silan Aljoša Vodopija Gimnazija Jožeta Plečnika Gimnazija Jožeta Plečnika Institut “Jožef Stefan” Šubičeva ulica 1 Šubičeva ulica 1 Jamova cesta 39 Ljubljana, Slovenija Ljubljana, Slovenija Ljubljana, Slovenija levstek.andraz@gmail.com darja.silan@gjp.si aljosa.vodopija@ijs.si POVZETEK KEYWORDS Parkinsonova bolezen je nevrodegenerativna bolezen, ki pov- Parkinson’s disease, speech analysis, machine learning, random zroča težave v delovanju mišic zaradi pomanjkanja dopamina v forest, feature importance možganskem deblu, poleg tega vpliva tudi na glas. Slednji po- stane bolj monoton, hripav in šibek. Zaradi naštetih sprememb se 1 UVOD za diagnosticiranje Parkinsonove bolezni vse pogosteje uporablja Parkinsonova bolezen je nevrodegenerativno in izčrpavajoče bo- analiza glasu z metodami umetne inteligence. V tej raziskavi smo lezensko stanje, ki vpliva na osrednje živčevje. Bolezen prizadene s pomočjo metod strojnega učenja primerjali zvočne posnetke približno 1 % ljudi, starejših od 60 let. Bolnik s Parkinsonovo bole- glasu zdravih oseb in bolnikov s Parkinsonovo boleznijo. Za iz- znijo se pogosto trese, ima težave s hojo in ravnotežjem, njegovo boljšavo klasifikacijske točnosti smo dodatno uporabili pristop gibanje postane počasno, pojavi se rigidnost. Pojavijo se lahko zmanjševanja razsežnosti. Najbolj točen klasifikator smo zgradili tudi duševne motnje, kot so anksioznost, depresija ter težave s z uporabo metode naključnih gozdov, s katerim smo dosegli 73 % spanjem, razmišljanjem in obnašanjem. točnost. Dobljeni rezultati nakazuje na povezavo med Parkin- Parkinsonova bolezen vpliva tudi na glas. Večina bolnikov ima sonovo boleznijo in karakteristično spremembo glasu. Ocenili govorne težave, kot so šibek, zadihan, hripav, višji in monoton smo pomembnost posameznih zvočnih posnetkov in pripadajočih glas. Za bolnika so značilne hripavost, zmanjšana jakost glasu, atributov. Izsledke raziskave lahko uporabimo za nadgradnjo ob- težava s pravilno artikulacijo fonemov in brbljanje [5]. stoječe metodologije s predlogi za dodatne posnetke, ki vsebujejo Diagnostične metode, ki bi stoodstotno dokazala prisotnost več informacij o prisotnosti Parkinsonove bolezni. Parkinsonove bolezni, še ne poznamo. Diagnoza temelji na vidnih in razpoznavnih simptomih, preteklem zdravstvenem stanju, fi- KLJUČNE BESEDE zičnem ter nevrološkem pregledu in bolnikovi anamnezi [13]. Parkinsonova bolezen, analiza glasu, strojno učenje, naključni Po kriterijih mora biti za dokaz Parkinsonove bolezni prisotna gozdovi, pomembnost atributov akineza ter še vsaj ena druga lastnost (npr. tremor rok pri mi- rovanju, rigidnost ali posturalne motnje). Po teh kriterijih se ABSTRACT Parkinsonovo bolezen lahko identificira z 90 % točnostjo, vendar diagnoza lahko traja več let [12]. Pri diagnosticiranju se upora-Parkinson’s disease is a neurodegenerative disorder that causes blja tudi slikanje možganov z magnetno resonanco, pozitronsko impaired muscle function because of a lack of dopamine in the emisijsko tomografijo in računalniško tomografijo. Vse naštete brain stem. Parkinson’s disease also affects speech ability. The diagnostične metode so drage ter zahtevne, zato se išče cenejše voice becomes monotone, hoarse and feeble. For this reason, one in preprostejše metode [13]. of the emerging ways to diagnose Parkinson’s disease is speech V diagnostične namene se vse pogosteje uporablja analiza analysis using artificial intelligence. In this paper, we use machine zvočnih posnetkov glasu z uporabo metod umetne inteligence learning to connect voice samples to the presence of Parkinson’s (npr. strojno učenje, procesiranje signalov itd.). Tovrsten način disease. To improve the classification accuracy, we additionally diagnostike je povsem varen, preprost, hiter in ne zahteva dra- use a dimensionality reduction approach. The most accurate clas- gocenih namenskih naprav [8], vendar je to področje v primeru sifier was built with random forest, with an accuracy of 73 %. Parkinsonove bolezni še v razvoju. Večina raziskovalcev se na- The experimental results indicate the correlation between the mreč ukvarja le z doseganjem čim večje klasifikacijske točnosti [1, voice changes and the presence of Parkinson’s disease. Addition- 7, 10, 11], pri tem pa zanemarjajo pomemben vidik analize, in si-ally, we estimate the importance of individual voice samples and cer da bi skušali identificirati pomembne posnetke in pripadajoče corresponding features. The results can be used to improve the glasovne atribute. Taka dognanja bi pripomogla k boljšemu ra- current methodology by proposing additional voice samples, that zumevanju problematike in omogočila oblikovanje natančnejših contain more information on the presence of Parkinson’s disease. testov. V tem prispevku poročamo o testiranju uporabnost analize glasu z metodami strojnega učenja za diagnosticiranje Parkinso- Permission to make digital or hard copies of part or all of this work for personal nove bolezni. Opravljena študija temelji na zvočnih posnetkih 40 or classroom use is granted without fee provided that copies are not made or oseb (20 bolnikov s Parkinsonovo boleznijo) pridobljenih v razi-distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this skavi [10]. Na teh podatkih smo testirali pet različnih algoritmov work must be honored. For all other uses, contact the owner /author(s). strojnega učenja. Za izboljšanje rezultatov smo dodatno uporabili Information society 2020, October 5–9, 2020, Ljubljana, Slovenia metodo za zmanjšanje razsežnosti in izboljšamo klasifikacijsko © 2020 Copyright held by the owner/author(s). točnost za približno 5 %. 59 Information society 2020, October 5–9, 2020, Ljubljana, Slovenia Levstek et al. Tabela 1: Glasovni atributi, uporabljeni za strojno učenje: V nasprotju z večino sorodnega dela smo ocenili tudi pomemb- frekvenčni, pulzni, amplitudni, glasovni ter harmonični. nost posameznih posnetkov in pripadajočih atributov. V ta namen smo uporabili metodo naključnih gozdov, saj ta dosega najvišjo točnost. Na ta način lahko ugotovimo, kateri atributi in posnetki Skupina Atribut vsebujejo več informacij o prisotnosti Parkinsonove bolezni. Jitter (local) Prispevek je organiziran na sledeči način. V drugem poglavju Jitter (local, absolute) predstavimo podatke, v tretjem poglavju opišemo metodologijo. Frekvenčni Jitter (rap) V četrtem in petem poglavju predstavimo rezultate in pridobljena Jitter (ppq5) dognanja. V zadnjem poglavju naredimo zaključek in orišemo Jitter (ddp) nadaljnje delo. Število glasovnih pulzov 2 PODATKI Število nihalnih dob Pulzni Povprečna perioda Podatki so bili zbrani na Istanbulski fakulteti za medicino (Is- Standardna deviacija period tanbul Faculty of Medicine, Istanbul University) leta 2014 [10]. Shimmer (local) Zbrali so zvočne posnetke 40 ljudi, 6 žensk ter 14 moških s Par- Shimmer (local, dB) kinsonovo boleznijo in 10 zdravih žensk ter 10 zdravih moških. Shimmer (apq3) Vsaka oseba je posnela 26 posnetkov, ki vključujejo samoglasnike, Amplitudni Shimmer (apq5) kratke stavke in besede. Natančneje, posnetki 1–3 predstavljajo Shimmer (apq11) trajajoče samoglasnike “a”, “o” in “u”, posnetki 4–13 predstavljajo Shimmer (dda) števila od 1 do 10, posnetki 14–17 predstavljajo krajše stavke in Delež nezvenečih časovnih oken posnetki 18–26 predstavljajo besede. Vsi posnetki so v turščini, Glasovni Število lomljenj glasu 1 posneti so bili z mikrofonom Trust MC-1500 . Delež lomljenj glasu Vsaki osebi pripada 26 zvočnih posnetkov in vsakemu po- Srednja vrednost višine glasu snetku 26 linearnih ter frekvenčnih atributov, zgrajenih z uporabo Povprečna višina glasu programske opreme za akustično analizo Praat [2]. Vsi atributi Standardna deviacija višine glasu so numerični in se jih običajno izračuna za analizo glasu [2, 10]. Najvišja višina tona Povzeti so v Tabeli 1. Skupno je v množici podatkov 676 atribu-Harmonični Najnižja višina tona tov in ciljni razred. Slednji je binaren in predstavlja prisotnost Avtokorelacija tona (pozitiven = 1) oziroma odsotnost (negativen = 0) Parkinsonove Razmerje šum-harmonik bolezni. Imena nekaterih atributov uporabljamo v angleščini, saj Razmerje harmonik-šum pripadajoči slovenski izrazi ne obstajajo. 3 METODOLOGIJA Tabela 2: Rezultati klasifikatorjev v obliki točnosti, senzi- Klasifikatorje smo gradili s petimi algoritmi za strojno učenje: tivnosti in specifičnosti. Najvišja vrednost posamezne me- odločitveno drevo (C4.5), naivni Bayes (NB), metoda najbližjih trike je odebeljena. sosedov (𝑘 NN), metoda podpornih vektorjev (SVM) ter metoda naključnih gozdov (RF). Za vse navedene algoritme smo uporabili Algoritem Točnost Senzitivnost Specifičnost privzete vrednosti parametrov, saj uglaševanje ni signifikantno izboljšalo klasifikacijske točnosti. C4.5 0,63 0,65 0,60 Število atributov močno presega število primerkov, zato smo NB 0,63 0,80 0,45 se odločili za uporabo metode zmanjševanja razsežnosti in s tem 𝑘 NN 0,48 0,55 0,40 uspešno izboljšali klasifikacijsko točnost za 5 %. Za izbor atribu- SVM 0,68 0,70 0,65 tov smo uporabili široko poznano metodo, imenovano rekurzivna RF 0,73 0,75 0,70 odstranitev atributov (ang. recursive feature elimination, RFE) [4], ki temelji na vzvratni odstranitvi nepomembnih atributov. Me- Tabela 3: Matrika zamenjav za klasifikator, zgrajen z me- toda RFE spada med metode po principu ovojnice (ang. wrapper ) todo RF. in smo jo uporabili v kombinaciji z zgoraj naštetimi algoritmi za strojno učenje. Končno število atributov, ki v RFE nastopa kot parameter, smo ocenili z 10-kratnim prečnim preverjanjem. Napoved / Pravi Negativen (0) Pozitiven (1) Za strojno učenje smo uporabili knjižnico caret [6], implemen-Negativen (0) 14 5 tirano v programskem jeziku R [9]. Pozitiven (1) 6 15 4 REZULTATI Za evalvacijo in izbor najboljšega algoritma smo uporabili pristop V Tabeli 2 so prikazani rezultati v obliki povprečne točnosti, po metodi “izpusti enega” (ang. leave one subject out, LOSO). povprečne senzitivnosti in povprečne specifičnosti. Vidimo, da je Najprej smo na učni množici z 10-kratnim prečnim preverjanjem najbolj točen klasifikator, zgrajen z metodo RF, najmanj točen pa z ocenili končno število atributov, ki nastopa kot parameter metode metodo 𝑘 NN. Najvišjo senzitivnost je dosegel klasifikator, zgrajen RFE. Nato smo z uglašeno metodo RFE izbrali najboljše atribute z metodo NB, specifičnost pa klasifikator, zgrajen z metodo RF. V in pripadajoči klasifikator. S slednjim smo klasificirali izpuščen Tabeli 3 so prikazani rezultati za klasifikator, zgrajen z metodo primerek in opisan postopek ponovili za vse primerke. RF v obliki matrike zamenjav. Klasifikator je pravilno klasificiral 1 https://www.trust.com/en/product/14896-design-microphone-mc-1500 29 primerkov, zmotil pa se je v 11 primerih. 60 Analiza glasu kot diagnostična metoda za odkrivanje Parkinsonove bolezni Information society 2020, October 5–9, 2020, Ljubljana, Slovenia Zanimala nas je pomembnost posameznih posnetkov in pri- 5 DISKUSIJA padajočih atributov. V ta namen smo postopek izbora atributov Podobno kot sorodne raziskave [1, 7, 10, 11] tudi naši rezultati ponovili za RF, a tokrat na celotnih podatkih brez izpusta primer-nakazujejo na povezavo med glasovnimi atributi in prisotnostjo kov. Pomembnost izbranih posnetkov in atributov smo izraču- Parkinsonove bolezni. Najbolj točen klasifikator zgradimo z upo- nali s postopkom, imenovanim permutacijska pomembnost (ang. rabo metode RF, s katerim dosežemo 73 % točnost. Za primerjavo permutation importance), ki ga lahko neposredno vključimo v nekatera sorodna dela poročajo o točnosti okoli 80 %. metodo RF [3]. Za vsako drevo posebej izračunamo točnost na Pri tem so najpomembnejši in pogosti frekvenčni atributi izpuščenih primerkih (naključno izpuščenih za gradnjo drevesa). (Slika 1 in Slika 3). Sklepamo, da zaradi karakteristične deviacije Nato ponovimo izračun točnosti po permutaciji določenega atri-frekvence glasu pri Parkinsonovi bolezni. Med posnetki izstopajo buta. Pomembnost tega atributa je povprečje razlik v točnosti števila in kratki stavki (Slika 2 in Slika 4). O prisotnosti bolezni pred in po njegovi permutaciji. Pri tem poudarimo, da pri metodi nam več povedo zahtevni ter daljši posnetki. RF ni težav s koreliranimi atributi, saj postopek uporabimo na Kljub temu je tak način diagnoze nezadosten. Najbolj točna posameznem drevesu, ki je po načinu izgradnje nekoreliran. metoda zgreši 25 % bolnikov, kar je za medicinsko prakso nespre- Na ta način izberemo 27 izmed 676 atributov. Med njimi se jemljivo [13]. Pri tem moramo poudariti, da smo imeli opravka najpogosteje pojavljajo frekvenčni atributi (Slika 1), medtem ko z omejenim številom primerkov (posnetih je bilo le 40 oseb). V so ostale skupine atributov podobno zastopane. Med posnetki primeru, da bi zbrali več zvočnih posnetkov obolelih in zdravih se najpogosteje pojavljajo števila, nato kratki stavki. Najslabše oseb, bi lahko klasifikator izboljšali z uporabo naprednejših me- zastopani so trajajoči samoglasniki (Slika 2). tod strojnega učenja, ki jih na tako malem številu primerkov ni Slika 3 in Slika 4 predstavljata zaporedoma pomembnost izbra-bilo moč uporabiti. nih atributov (agregirano čez posnetke) in pomembnost posnet- Morda ne bo nikoli moč stoodstotno določiti prisotnost Par- kov (agregirano čez atribute) za metodo RF. Atributi in posnetki kinsonove bolezni iz analize glasu z uporabo metod strojnega so razvrščeni od manj pomembnih do bolj pomembnih. Iz rezulta- učenja, vendar bi tovrstne metode lahko uporabili bodisi komple- tov je razvidno, da so za metodo RF najpomembnejši frekvenčni mentarno za nadgradnjo obstoječih metod bodisi kot presejalni atributi. Najmanj pomembni pa so harmonični atributi in atributi, test. Pri tem poudarimo, da je analiza glasu poceni in za bolnika izpeljani iz tona glasu. Najpomembnejši posnetek je število “4”. povsem nemoteča ter varna preiskava. Opazimo, da števila in kratki stavki vsebujejo več informacij od ostalih posnetkov. 6 ZAKLJUČEK V prispevku smo z metodami strojnega učenja primerjali zvočne posnetke zdravih oseb in bolnikov s Parkinsonovo boleznijo. Na- men študije je bil preveriti, ali lahko iz analize glasu sklepamo o prisotnosti Parkinsonove bolezni in ali je možno zgraditi klasifi- kator za uporabo v praksi. Dodatno smo tudi ocenili pomembnost posameznih posnetkov in pripadajočih glasovnih atributov. Rezultati nakazujejo, da pri bolnikih s Parkinsonovo boleznijo pride do poslabšanja zvočne artikulacije, saj smo s klasifikatorjem, zgrajenim z metodo naključnih gozdov, uspešno zaznali 73 % bolnikov. Ne glede na to klasifikator še ni primeren za uporabo v praksi, saj je njegova točnost prenizka. Sedanji klasifikator lahko uporabimo kot komplementarni test že obstoječim. Za najpomembnejše zvočne posnetke se izkažejo števila in kratki stavki. Pri tem so najmanj pomembni trajajoči samoglasniki in besede. Med atributi izstopajo frekvenčni in amplitudni. Slika 1: Število izbranih atributov za posamezne skupine Trenutno raziskujemo možnost, da bi zbrali več sorodnih zvoč- po uporabi metode RFE v kombinaciji z metodo RF. nih posnetkov. Na ta način bi lahko uporabili kompleksnejše metode, ki omogočajo odkrivanje zagonetnih zakonitosti, ki jih na tako majhnem naboru primerkov ni bilo mogoče odkriti. Naš dolgoročni cilj je izgradnja klasifikatorja, ki bi uspešno identificiral večino bolnikov tudi za ceno nekoliko nižje točnosti (nekatere zdrave osebe bi klasificiral za bolne). Klasifikator bi lahko uporabili kot presejalni test in na ta način olajšali sedanjo diagnostiko Parkinsonove bolezni. Poskusili bomo tudi razbrati, zakaj so ravno posnetki števil vsebovali več informacij o pri- sotnosti Parkinsonove bolezni, in z dobljenim znanjem skušali predlagati celovitejši nabor izrazov, besed in fonemov. ZAHVALA Avtorji se zahvaljujejo gospe Ireni Hočevar Boltežar za razlago glasovnih atributov in slovenske prevode. A. Vodopija se doda- tno zahvaljuje finančni podpori Javne agencije za raziskovalno Slika 2: Število izbranih posnetkov za posamezne skupine dejavnost Republike Slovenije (program usposabljanja mladega po uporabi metodo RFE v kombinaciji z metodo RF. raziskovalca). 61 Information society 2020, October 5–9, 2020, Ljubljana, Slovenia Levstek et al. Slika 3: Pomembnost izbranih atributov za klasifikator, Slika 4: Pomembnost izbranih posnetkov za klasifikator, zgrajen z metodo RF. Pomembnost posamezne skupine je zgrajen z metodo RF. Pomembnost posamezne skupine je agregirana pomembnost pripadajočih atributov. agregirana pomembnost pripadajočih posnetkov. LITERATURA [8] M. A. Little, P. E. McSharry, S. Roberts, D. Costello in I. Moroz. 2007. Exploiting nonlinear recurrence and frac- [1] I. Bhattacharya in M. P. S. Bhatia. 2010. SVM classification tal scaling properties for voice disorder detection. Nature to distinguish parkinson disease patients. V Proceedings of the 1st Amrita ACM-W Celebration on Women in Computing Precedings. doi: 10.1038/npre.2007.326.1. in India [9] R Core Team. 2013. R: A Language and Environment for Sta- . ACM, New York, NY, USA, 1–6. doi: 10 . 1145 / tistical Computing. R Foundation for Statistical Computing. 1858378.1858392. Vienna, Austria. http://www.R- project.org/. [2] P. Boersma. 2001. Praat, a system for doing phonetics by [10] B. E. Sakar, M. E. Isenkul, C. O. Sakar, A. Sertbas, F. Gurgen, computer. Glot International, 5, 9/10, 341–345. S. Delil, H. Apaydin in O. Kursun. 2013. Collection and ana- [3] L. Breiman. 2001. Random forests. Machine Learning, 45, lysis of a parkinson speech dataset with multiple types of 1, 5–32. doi: 10.1023/A:1010933404324. sound recordings. IEEE Journal of Biomedical and Health In- [4] I. Guyon, J. Weston, S. Barnhill in V. Vapnik. 2002. Gene formatics, 17, 4, 828–834. doi: 10.1109/JBHI.2013.2245674. selection for cancer classification using support vector [11] C. O. Sakar in O. Kursun. 2010. Telediagnosis of parkin- machines. Machine Learning, 46, 1, 389–422. doi: 10.1023/ son’s disease using measurements of dysphonia. Journal A:1012487302797. of Medical Systems, 34, 4, 591–599. doi: 10.1007/s10916- [5] I. Hočevar Boltežar. 2013. Fiziologija in patologija glasu ter izbrana poglavja iz patologije govora 009- 9272- y. . Pedagoška fakulteta. [12] C. Silva. 2018. Speech analysis may help diagnose parkin- http://www.biblos.si/lib/book/9789612531416. son’s and at earlier stage, study says. Parkinson’s News [6] M. Kuhn. 2008. Building predictive models in R using the Today. (2018). https : / / parkinsonsnewstoday. com / 2018 / caret package. Journal of Statistical Software, Articles, 28, 02/05/speech- analysis- can- help- detect- parkinsons- in- 5, 1–26. doi: 10.18637/jss.v028.i05. early- stages- study- says/. [7] M. A. Little, P. E. McSharry, E. J. Hunter, J. Spielman in L. [13] E. Tolosa, G. Wenning in W. Poewe. 2006. The diagnosis O. Ramig. 2009. Suitability of dysphonia measurements for of parkinson’s disease. The Lancet Neurology, 5, 1, 75–86. telemonitoring of parkinson’s disease. IEEE Transactions on Biomedical Engineering doi: 10.1016/S1474- 4422(05)70285- 4. , 56, 4, 1015–1022. doi: 10.1109/ TBME.2008.2005954. 62 STRAW Application for Collecting Context Data and Ecological Momentary Assessment Junoš Lukan Marko Katrašnik Larissa Bolliger Jožef Stefan Institute Jožef Stefan Institute Department of Public Health Jožef Stefan International Jamova cesta 39 Ghent University Postgraduate School Ljubljana, Slovenia Ghent, Belgium Jamova cesta 39 marko.katrasnik@gmail.com larissa.bolliger@ugent.be Ljubljana, Slovenia junos.lukan@ijs.si Els Clays Mitja Luštrek Department of Public Health Jožef Stefan Institute Ghent University Jamova cesta 39 Ghent, Belgium Ljubljana, Slovenia els.clays@ugent.be mitja.lustrek@ijs.si ABSTRACT phone use and location) is monitored without user intervention To study stress at the workplace and relate it to user context and or interaction. The second mode of operation are prompts or self-reports, we developed an application based on the AWARE questions for the user, where some information about the context framework, a mobile instrumentation toolkit. The application and the participant’s mental state is gathered by asking for it serves two purposes: of passively collecting data about user’s explicitly. environment and offering questionnaires as means of ecological As a starting point for writing the STRAW application, we momentary assessment. We implemented methods to import used AWARE, a mobile instrumentation toolkit which had the the questionnaires into the phone’s database and trigger them initial purpose of inferring users’ context [5]. It enables logging at the right times. We also considered privacy implications of of data as reported by the phone’s operating system and a wide collecting such data and took additional measures to conceal the variety of hardware sensors. At several points, this toolkit was identity of our study’s participants wherever we evaluated it was adapted to better suit our needs, and additional capabilities were under the risk of exposure. Finally, we had to establish a server added on top of it. application to handle receiving and storage of collected data and We also developed two modular functionalities of the applica- implemented a rudimentary login process to additionally secure tion: Bluetooth integration with an Empatica E4 wristband [23] our servers. to enable simultaneous collection of physiological data and voice detection and speaker diarization capabilities [15]. We already KEYWORDS reported on these developments elsewhere, whereas in this paper, we give an overview of the app’s capabilities. context detection, application development, privacy, ecological momentary assessment 1 APPLICATION OVERVIEW The best machine learning models for stress detection and affect recognition are multimodal [1, 17]. Combining data from different 1.1 Data Types modalities is especially effective, such as using physiological, An important aspect of the STRAW application are prompts, behavioural or contextual, and psychological (self-reported) data. called EMAs. The users can be prompted to make a diary entry Collecting such data in the real-world setting presents a challenge, at a specific time which is called Experience Sampling Method however. [ESM; 3] or, more broadly (when data other than experience In the project called Stress at work (STRAW), the main object-are noted), Ecological Momentary Assessment [EMA; 20]. Diary ive is to analyse the relationship between psychosocial stress methods increase the reliability of collected self-reports as they experiences in the workplace, work activities and events, and are less prone to recall bias [14]. peripheral physiology. To facilitate integration of various data EMAs are the main mode of user interaction in the STRAW sources, an application was designed to run continuously and application. The content of specific questions is beyond the scope monitor their environment and specific phone-related events. of this paper, but in general, the questions are based on existing The application’s purpose is two-fold. The primary mode of psychological questionnaires measuring stressors, stress, and operation is silent and continuous: the user context (such as their related responses. The implementation of EMAs is described in Section 2. Permission to make digital or hard copies of part or all of this work for personal In addition to this, we selected a subset of data that might help or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and us determine users’ context. Below is a list of sensors that are the full citation on the first page. Copyrights for third-party components of this used in the STRAW application together with the description of work must be honored. For all other uses, contact the owner/author(s). data they collect. Data availability from some of these sensors Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia © 2020 Copyright held by the owner/author(s). is dependent on phone’s hardware and the version of operating system. 63 Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia Junoš Lukan, Marko Katrašnik, Larissa Bolliger, Els Clays, and Mitja Luštrek – Acceleration: There are several sources (i.e. virtual sensors) version and phone manufacturer which tries to close applications of acceleration data in a smartphone. Accelerometers meas- for energy efficiency. We attempted to whitelist this application ure acceleration magnitude in various directions and re- in the most common battery-saving software. port either linear acceleration (without gravity effects), gravity, or combined acceleration. This is used further in 2 ECOLOGICAL MOMENTARY Google’s activity recognition API [10]. ASSESSMENT – Barometer: Ambient air pressure. As mentioned, one of the main functions of the STRAW applica- – Light: Luminance of the ambient light captured by the light tion is to collect users’ answers to questionnaires. AWARE already sensor. implements a ‘sensor’ for experience sampling method, which – Temperature: Temperature of the phone’s hardware sensor. shows DialogFragments as the one in Figure 1, but it was too – Bluetooth: This sensor logs surrounding Bluetooth-enabled rudimentary for our study protocol. The main upgrades we had to and visible devices, specifically their hashed MAC ad- make were the mechanism of triggering EMAs and management dresses, and received signal strength indicator (RSSI) in of the database of available questions (items) to include in the decibels. questionnaires. – Location: Device’s current location (latitude, longitude, and altitude, which are masked as described in Section 3) and its velocity (speed and bearing). This uses various methods, such as GPS and known Wi-Fis in vicinity resulting in different degrees of accuracy. Location category is also acquired with Foursquare API. – Network: Network availability (e.g. none or aeroplane mode, Wi-Fi, Bluetooth, GPS, mobile) and traffic data (received and sent packets and bytes over either Wi-Fi or mobile data). – Proximity: Uses the sensor by the device’s display to detect nearby objects. It can either be a binary indicator of an object’s presence or the distance to the object. – Timezone: Device’s current time zone. – Wi-Fi: Logs of surrounding Wi-Fi access points, specifically their hashed MAC addresses, received signal strength in- dicator (RSSI) in decibels, security protocols, and band frequency. The information on the currently connected access point is also included. – Applications: This includes the category of the application currently in use (i.e. running in the foreground) and data related to notifications that any application sends. No- tification header text (but not content), the category of the application that triggered the notification and delivery modes (such as sound, vibration and LED light) are logged. – Battery: Battery information, such as current battery per- Figure 1: An example of an ecological momentary assess- centage level, voltage, and temperature, and its health, as ment prompt. well as power-related events, such as charging and dis- charging times are monitored. – Communication: Information about calls and messages sent or received by the user. This includes the call or message 2.1 EMA Triggering type (i.e. incoming, outgoing, or missed), length of the Originally, AWARE provides a couple of ways to trigger EMAs: call session, and trace, a SHA-1 encrypted phone number at a specific time, by a certain context (i.e. taking into account that was contacted. The phone numbers themselves or the values from other sensors) or on demand (manually). In our study, contents of messages and calls are not logged. time is the most important trigger of EMAs, but we needed finer – Processor: Processor load in CPU ticks and the percentage control. of load dedicated to user and system processes or idle load. The EMAs in our studies are divided into three types: a) morn- – Screen: Screen status: turned on or off and locked or un- ing EMAs with questions about sleep quality, b) work-hour EMAs locked. with questions about momentary affect, job characteristics, work – Voice activity: A classifier, trained using Weka [7]. The activities, and similar, and c) evening EMAs with questions about features are calculated using openSMILE [4] and the out- the whole workday and after-work activities. The first EMA is put is an indicator of human voice activity [15]. triggered in the first hour after the start of the workday as set by the user. The rest of the EMAs during work hours trigger ap- The data described in the list above are collected automatically proximately every 90 minutes, but not closer than 30 min apart. and continuously. The application is run as a foreground service, The time is dependent on the last answered EMA rather than set which means that the data collection continues even while the in advance, and additional reminders are scheduled in the case application is not actively used (i.e. it is minimized). Despite of user inactivity. The final EMA of the day is triggered in the this, there exists software that is specific to the operating system evening at a time set by the user. 64 STRAW Application for Collecting Context Data and Ecological Momentary Assessment Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia Each of these types of EMA is implemented as a separate The MAC addresses of detected WiFi and Bluetooth devices are IntentService [11] and handled by a JobScheduler [18]. This hashed in the same way. enabled us to enforce the requirements outlined above such as The location data in their raw form are highly revealing of setting the minimum latency with which the job can start and a user’s identity [2]. Instead of storing the actual geographic making use of periodic jobs. coordinates provided by this sensor, the Foursquare Places API [6] is used to extract the category (venue) of a location. This 2.2 Question Database API enables saving general categories such as ‘bookstore’ or ‘gas station’ near the user’s location. But since we wanted to keep In the original AWARE implementation, questions are queued the option to analyse users’ movements, we also implemented into a questionnaire directly in the code of the application by a transformation of coordinates. We converted longitude and using their custom ESMFactory class. For our study, we use a latitude into spherical coordinates, applied a stochastic rotation pool of more than 200 questions per language from which a (but constant within a specific user) and converted these back subset is sampled for every EMA. We therefore needed a more to transformed longitude and latitude. This enabled us to keep systematic way of storing them within the application. the distances between the locations faithful to original data, but To ease the insertion of individual items, we prepared a spread- transformed to another place on Earth. sheet template which is meant to be human-readable and filled As described in our previous work [15], voice activity recog-out manually. Individual items from this spreadsheet are later nition is performed on the phone in its entirety. This means that converted into JavaScript Object Notation (JSON) and stored raw audio recordings can be discarded immediately after pro- in an SQLite database [13] in phone’s internal storage. This im-cessing and only the calculated features are saved to the database. plementation enabled us to adapt the content of EMAs without Alternatively, only the final binary prediction of human voice touching the source code of the application. It also simplified presence can be retained, but this makes any post-hoc analysis the final selection of questions, such as selecting one language (such as speaker diarization) impossible. (English, Dutch, or Slovenian) and grammatical gender. 3 PRIVACY ENHANCEMENTS 4 SERVER APPLICATION The data collected by the STRAW application have different For the purpose of storing the data on a server, a Python applica- degrees of risk to the users’ privacy. Their privacy would be tion was implemented in Flask [21], which accepts the data in a threatened if an outsider gained unauthorized access to the data. JSON format and saves it in a PostgreSQL [22] database. In addi-These possible external threats are considered in Section 5. tion to receiving the data and managing credentials (as described Even when the data are safely communicated and stored, how- in Section 5), it also performs a couple of additional functions. ever, an involuntary exposure of users’ identity might still be As mentioned in Section 3, instead of saving application names possible. Assuming the data are well protected from unauthor-we only log their category as classified in Google Play Store. To ized external access, these risks will in turn be treated as internal reduce the number of queries, we implemented this as a part of the in this section. server application. As part of the upload process, the application Some of the data collected by the STRAW application are name is received in plain text, but only retained until query personal data, so even when storing them securely and after returns its category. After that, the application name is hashed to pseudonymization, some risk of a privacy breach remains. Since enable comparisons with later records and the name in plain text AWARE is widely used in scientific studies it already implements is discarded. In this way, we could build a database of application some privacy enhancing mechanisms. We performed a thorough name hashes and their corresponding categories on the server, application vulnerability analysis and identified several further while not keeping a record of what applications individual users threats to privacy that we wished to address. While the data use. are safely communicated and stored, an involuntary exposure of The server application also provides a simple UI for admin- users’ identity might still be possible. The types of data that de- istrators, where some metadata about the data collection itself serve special attention are applications, communication, location, are shown in forms of tables and charts. We can access data on and voice activity. last upload, number of days of participation, and number of data As mentioned in Section 1, the notifications that other applica-points for each individual user. This enables us to detect any tions send are monitored in the STRAW application. The content problems with data collection and troubleshoot them early. of the notification, such as that of an instant messaging applica- tion or calendar notification, is never actually stored. We deemed 5 CLIENT-SERVER COMMUNICATION AND even the application names to be sensitive, so we chose to only LOGIN save application categories. This process is further described in The STRAW application and other sensing applications are not Section 4. special in the degree they could be subject to external attacks The content of calls or messages is never logged, but the phone [2]. An attacker might want to expose identity of a user or try to numbers tied to them can be. Since we wanted to keep track of reveal their personal data such as location. There are three points recurring contact with the same person, but not reveal their real of entry for an external attacker: local storage, transmission of phone number, we decided to encrypt them using the SHA-1 data, and the servers. algorithm. While it would be possible to decrypt a phone number While the data reside on the device they are saved locally in by a brute-force attack, the AWARE implementation offers the the phone’s storage. According to Android’s documentation, this option of adding a salt. Thus by using the username (further de- database is exclusive to the STRAW application [9]: scribed in Section 5) as a salt, the phone numbers are sufficiently protected from inadvertent disclosure risk, while the hashed Other applications cannot access files stored within value is retained even across different application installations. internal storage. This makes internal storage a good 65 Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia Junoš Lukan, Marko Katrašnik, Larissa Bolliger, Els Clays, and Mitja Luštrek place for application data that other applications [2] Delphine Christin. 2016. Privacy in mobile participatory shouldn’t access. sensing. Current trends and future challenges. Journal of Additionally, once the data are transmitted to the server, the local Systems and Software, 116, 57–68. doi: 10.1016/j.jss.2015. database is periodically deleted. This reduces the privacy risk 03.067. of the database being exposed, while also decreasing the local [3] Mihaly Csikszentmihalyi, Reed Larson and Suzanne Prescott. storage requirements. 1977. The ecology of adolescent activity and experience. It is therefore the transmission of data where we had to secure Journal of Youth and Adolescence, 6, 3, (September 1977), the data. They are transmitted over encrypted HTTPS connection, 281–294. doi: 10.1007/bf02138940. which eliminates the risk of exposure during this part of commu- [4] Florian Eyben, Felix Weninger, Florian Gross and Björn nication. The data are received by an application server residing Schuller. 2013. Recent developments in openSMILE, the at Jožef Stefan Institute (JSI), with a dedicated port listening for Munich open-source multimedia feature extractor. In Pro- incoming transmissions. ceedings of the 21st ACM international conference on Multi- The application server communicates with another, database media - MM '13. ACM Press. doi: 10.1145/2502081.2502224. server, also residing at JSI. This second server can only be accessed [5] Denzil Ferreira, Vassilis Kostakos and Anind K. Dey. 2015. from within the JSI local area network. The database itself is AWARE: Mobile context instrumentation framework. Fron- also protected with a password and the user accessing it via the tiers in ICT, 2, 6, 1–9. issn: 2297-198X. doi: 10.3389/fict. application server does not have administrator privileges. 2015.00006. https://www.frontiersin.org/article/10.3389/ Since the STRAW application is a part of a wider study, it is fict.2015.00006. disseminated to recruited participants only. In addition to the data [6] Foursquare. [n. d.] Places SDK. Venue search. Retrieved from this application, other data are collected, such as responses 26/08/2020 from https://developer.foursquare.com/docs/ to questionnaires in baseline screening and physiological data api-reference/venues/search/. from wristbands. It was therefore necessary that the data can be [7] Eibe Frank, Mark A. Hall and Ian H. Witten. 2016. The linked back to an individual in order to join the data from various WEKA workbench. (4th edition). Morgan Kaufmann. sources. We developed a login method to enable this. [8] Martin Gjoreski, Mitja Luštrek, Matjaž Gams and Hristijan Using OkHttp [19] client-side and Flask-HTTPAuth [12] server-Gjoreski. 2017. Monitoring stress with a wrist device using side, we implemented basic access authentication and token au- context. Journal of Biomedical Informatics, 73, 159–170. thentication [16]. The login credentials are disseminated to re-issn: 1532-0464. doi: 10.1016/j.jbi.2017.08.006. gistered participants in our study and are input upon the install- [9] Google. [n. d.] Access app-specific files. Access from in- ation of the STRAW application. This serves multiple purposes: ternal storage. Retrieved 26/08/2020 from https://developer. by requiring login, we only accept data from actual participants android.com/training/data-storage/app-specific. of our study, while we can also use the assigned username to [10] Google. [n. d.] Adapt your app by understanding what pseudoanonymously link data from various sources. users are doing. Retrieved 26/08/2020 from https://developers. google.com/location-context/activity-recognition. 6 CONCLUSION [11] Google. [n. d.] IntentService. Retrieved 26/08/2020 from The application used in the STRAW project serves a dual pur- https://developer.android.com/reference/android/app/ pose: to collect users’ answers to questionnaires and passively IntentService.html. collect data about their environment and phone usage. While the [12] Miguel Grinberg. [n. d.] Flask-HTTPAuth. Retrieved 26/08/2020 application was tailored to requirements of our study, this paper from https://flask-httpauth.readthedocs.io/en/latest/. outlined the main issues and possible solutions when developing [13] D. Richard Hipp, Dan Kennedy and Joe Mistachkin. 2019. an application for research purposes. SQLite. Computer software. (2019). https : / / sqlite . org / The AWARE framework provided a solid foundation and espe- index.html. cially eased sensor data collection, there are additional challenges [14] Gillian H. Ice and Gary D. James, editors. 2006. Measur- that researchers need to face when trying to use an application ing emotional and behavioral response. General principles. like this in a scientific study. The data gathered using this applic- Measuring Stress in Humans. A Practical Guide for the Field. ation will help us develop improved models of stress recognition .Part II –Measuring stress responses. Cambridge Univer- [8], which will help us integrate physiological data with more sity Press, Cambridge, UK, (December 2006). Chapter 3, detailed contextual data and more reliable self-reports. 60–93. isbn: 978-0-521-84479-6. [15] Marko Katrašnik, Junoš Lukan, Mitja Luštrek and Vitomir ACKNOWLEDGMENTS Štruc. 2019. Razvoj postopka diarizacije govorcev z al- goritmi strojnega učenja. In Proceedings of the 22nd In- The authors acknowledge the STRAW project was financially ternational Multiconference INFORMATION SOCIETY – IS supported by the Slovenian Research Agency (ARRS, project ID 2019. Slovenian Conference on Artificial Intelligence. Mitja N2-0081) and by the Research Foundation – Flanders, Belgium Luštrek, Matjaž Gams and Rok Piltaver, editors. Volume A, (FWO, project no. G.0318.18N). 57–60. https://is.ijs.si/archive/proceedings/2018/files/ REFERENCES Zbornik%20-%20A.pdf. [16] Chris Schmidt. 2001. Token based authentication. In Ac- [1] Ane Alberdi, Asier Aztiria and Adrian Basarab. 2016. To- cepted papers for FOAF-Galway. 1st Workshop on Friend wards an automatic early stress recognition system for of a Friend, Social Networking and the Semantic Web. office environments based on multimodal measurements. https : / / www . w3 . org / 2001 / sw / Europe / events / foaf - A review. Journal of Biomedical Informatics, 59, (February galway/papers/fp/token_based_authentication/. 2016), 49–75. doi: 10.1016/j.jbi.2015.11.007. 66 STRAW Application for Collecting Context Data and Ecological Momentary Assessment Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia [17] Philip Schmidt, Attila Reiss, Robert Duerichen and Kristof [21] The Pallets team. 2010. Flask. Computer software. (2010). Van Laerhoven. Wearable affect and stress recognition: a http://flask.pocoo.org/. review. (21st November 2018). [22] The PostgreSQL Global Development Group. 2019. Postgr- [18] Joanna Smith. 2016. Scheduling jobs like a pro with Job- eSQL 11.3 Documentation. Version 11.3. Scheduler. https : / / medium . com / google - developers / [23] Marija Trajanoska, Marko Katrašnik, Junoš Lukan, Mar- scheduling-jobs-like-a-pro-with-jobscheduler-286ef8510129. tin Gjoreski, Hristijan Gjoreski and Mitja Luštrek. 2018. [19] Square, Inc. 2019. OkHttp. Computer software. (2019). https: Context-aware stress detection in the aware framework. //square.github.io/okhttp/. In Proceedings of the 21st International Multiconference IN- [20] Arthur A. Stone and Saul Shiffman. 1994. Ecological mo- FORMATION SOCIETY – IS 2018. Slovenian Conference mentary assessment (EMA) in behavioral medicine. Annals on Artificial Intelligence. Mitja Luštrek, Rok Piltaver and of Behavioral Medicine, 16, 3, 199–202. doi: 10.1093/abm/ Matjaž Gams, editors. Volume A, 25–28. https://is.ijs.si/ 16.3.199. archive/proceedings/2018/files/Zbornik%20-%20A.pdf. 67 URBANITE H2020 Project Algorithms and Simulation Techniques for Decision - Makers Alina Machidon Maj Smerkol Matjaž Gams alina.machidon@ijs.si maj.smerkol@ijs.si matjaz.gams@ijs.si Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan Institute Jamova cesta 39 Jamova cesta 39 Jamova cesta 39 Ljubljana, Slovenia Ljubljana, Slovenia Ljubljana, Slovenia ABSTRACT on case-specific models. The goal of the activities will be to imple- ment novel tools and services in order to enable policy-makers URBANITE (Supporting the decision-making in URBAN trans- to use advanced data analysis and machine learning methods formation with the use of dIsruptive TEchnologies) is a H2020 during the design of novel policies for a specific city project with the goal to provide an ecosystem model that artic- URBANITE will allow the analysis of the traffic flows that are ulates the expectations, trust and attitude from civil servants, currently happening and have happened up until that moment. In citizens and other stakeholders in the use of disruptive technolo- addition to the visualization of the traffic, usage of economy shar- gies. This model will be supported with the provision of a data ing vehicles and other aspects, URBANITE will analyse which management platform and algorithms for data – driven decision are the bottlenecks and critical points, based on a set of parame- – making in the field of urban transformation. One of the main ters to be determined by the civil servants. Due to the fact that output of the project will be a Decision-Support System includ- historic data is stored, trends can be determined by URBANITE ing (AI based) predictive algorithms and simulation models for by big data algorithms. These trend analyses can entail the un- mobility that support the decision–making process by analyzing derstanding of, for instance, the use of a certain transportation the current situation, the trends that occurred in a certain time system (e.g. bikes) in a certain neighbourhood of the municipality, frame and allowing to predict future situations, when changing or the peak hours in which a street is blocked. URBANITE will one or more variables. URBANITE will analyze the impact, trust also provide means to simulate the effect of different situations and attitudes of civil servants, citizens and other stakeholders such as opening a pedestrian street at certain times, location of with respect to the integration of disruptive technologies such electric charging stations, or bike sharing points through the as Artificial Intelligence (AI), Decision Support Systems (DSS), implementation of artificial intelligence algorithms. To achieve big data analytics and predictive algorithms in a data–driven that, URBANITE will build first generic models from the data decision-making process. The results of the project will be val- across all the cities and then provide adaptation mechanisms to idated in four real use cases: Amsterdam, Bilbao, Helsinki and apply these models to the different use cases. From the data avail- Messina. This paper overviews the current state of the project’s able, URBANITE will extract and formalize knowledge and then, progress. through a combination of classification, regression, clustering, KEYWORDS and frequent pattern mining algorithms, conclude into some de- cisions and actionable models that will enable city policy-makers AI, Big Data, DSS, disruptive technologies, URBANITE project to simulate and assess the outcomes and implications of new 1 INTRODUCTION policies. In recent times, the cities and urban environments are facing 2 SYSTEM’S ARCHITECTURE a revolution in urban mobility, bringing up unforeseen conse- The URBANITE project will combine various data sources, algo- quences that public administrations need to manage. It is in this rithms, libraries and tools that provide the best solutions to the new context that public administrations and policy makers need scope of the project. The technical "core" of the project has to means to help them understand this new scenario, supporting fulfill the following objectives: them in making policy–related decisions and predicting eventu- alities. The traditional technological solutions are no longer valid • Deploy tools for big data exploration with the active in- for this situation and therefore, disruptive technologies such as volvement of policy-makers. big data analytics, predictive algorithms as well as decision sup- • Design methods for the detection of important events that port systems profiting from artificial intelligence techniques to need to be addressed. support policy – makers come into place. In order to provide the desired functionalities, several state-of- The main technical objective of the URBANITE project is the-art technologies are currently examined and tested in order the development of advanced AI algorithms for analysis of big to be adapted, customized and integrated into the platform. A data on mobility. The developed methods and tools will provide simplified preliminary architecture is presented in Figure 1. substantial support for policy-makers to tackle complex policy problems on the mobility domain and will enable their validation 2.1 Data Analysis Module Permission to make digital or hard copies of part or all of this work for personal One of the first tasks involves the development of various meth-or classroom use is granted without fee provided that copies are not made or ods for exploratory data analysis and user interaction. Multi-distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this modal methods, tools and services for big data on urban mobility work must be honored. For all other uses, contact the owner /author(s). will be implemented that will provide exploratory analysis capa- Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia bilities and enable the policy-makers to actively search for causal © 2020 Copyright held by the owner/author(s). relations in the data will be provided by the platform. 68 Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Machidon, Smerkol and Gams Figure 1: High Level Architecture of the URBANITE Platform. The methods to be included in the platform can be segmented The URBANITE recommendation engine will identify and in four main groups: predict important or problematic events related to mobility and will provide suggestions to tackle the issue. The policy support • clustering, where the main goal is to reduce the amount system will provide support to the policy-makers for identifying of data by grouping together similar instances. The im- possible policies that tackle events based on specific criteria. The plemented method will provide mechanisms to group inputs will have to be aggregated for effective decision-making instances based on GIS data or any subset of attributes using hierarchical multi-criteria decision models. that users will define. For example, platform users might choose to cluster all instances based on the type of trans- portation used (shared bikes, electric cars, etc.) 2.3 Policy Simulation and Validation Engine • projection methods that will be used to reduce the dimen- Simulation transparency is a vital feature of the decision making sionality of the data items. The goal of these methods is process when quantitative computer tools are used to justify to represent the data in a lower dimensional space in such some strategies [10]. Simulation predictions can play a catalytic a way that the key relations of the data structures are role in the development of public policies, in the elaboration of preserved. The results of the methods can be used to more safety procedures, and in establishing legal liability. Hence, given clearly visualize the data or use the transformed data in the impact that modelling and simulation predictions are known the next rounds of analysis • to have, the credibility of the computational results is of crucial self-organizing map involves the use of a type of artificial importance to engineering designers and managers but also to neural network, trained in an unsupervised manner. The public servants, and to all citizens affected by the decisions that method can at the same time reduce the amount of data are based on these predictions [10]. (similar to clustering) and nonlinearly projects the data To create trust and increase the model’s credibility and the into lower dimensionalities • simulation results delivered, it is crucial to deal with a validation prediction/regression methods, or classification models, strategy in which non-simulation-trained end-users could feel that will allow to exploit the data comfortable and trust the simulation model [10]. In the URBANITE project, the policy simulation and validation 2.2 Recommendation Engine module will provide methods and tools to simulate the efficiency Recommendation engines (also known as recommender systems) of specific policies in the target domain. Given a new policy, ur- are information filtering systems that deal with the problem of in- ban mobility model and the target parameters, the system can formation overload [6] by filtering key information "chunks" out evaluate the performance of the new policy based on the observed of large amount of dynamically generated information accord- parameters. The implementation of credible traffic simulations ing to user’s preferences, interest, or observed behavior about for the entire city has been addressed by various project; however, item [8][5]. Recommendation engines have the ability to pre-it is not yet adequately solved, due to its complexity. In URBAN- dict whether a particular user would prefer an item or not based ITE ,the constructed model will be used to predict and classify on the user’s profile [5]. Recommendation engine is defined as traffic flow changes based on the provided changes in the new a decision making strategy for users under complex informa- policies. Policy-makers will select the defined KPI’s that need to tion environments [4]. Recently, various approaches for building be evaluated by the validation engine and based on the scores recommendation engines were developed, based on either collab- the new policies achieve, policy-makers will be able to make an orative filtering, content-based filtering or hybrid filtering [12], informed decision about which policies should be deployed in [11], [9]. the city. 69 URBANITE H2020 Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia 2.4 Advanced Visualization Methods social media data [13]. The involvement of the municipalities of Bilbao, Helsinki, Amsterdam and Messina will provide a wide Another important task will be the implementation of advanced range of data sources related to the urban mobility, along with visualizations for mobility patterns, highlighting important events, the public, open-source ones. and results of policy validations. The main visualization func- Several types of data sources were identified for the URBAN- tionalities will present the information on a combination of map ITE project: layers, describing where in the city specific events or a sequence 1 of events occurred. Visualizations will involve the use of heat • geospatial data, e.g. maps (Open Street Maps , but also maps, traffic flow graphics, and other transportation clusters. proprietary maps of the cities) Users will be able to change and interact with the visualization • additional info such as: car and lorry registration, infor- parameters. For example, select specific time ranges, zoom, high- mation on parking lots, dynamic parking data, cadastre light, display additional information, etc. Considering the variety information, commercial register, care services, tourism and characteristics of the data, one concern is regarding the de- accommodation picting multidimensional data in a human-perceivable manner. • demographics: statistical information on the number of Several graphical methods are customarily used for a preliminary inhabitants of different city districts, the number of house- analysis of generic multivariate datasets [2]: scatter plots, pie holds, population’s age brackets, city boundaries, etc. charts and bar plots, histograms, box plots, violin and bean plots, • public transportation: tram and metro lines, static and dy- spider/radar/star/polar plots, glyph plots, mosaic and spine plots, namic information about the public bus transport service, treemaps, and others. the GPS position of the buses Traffic datasets are generally high-dimensional or spatial- • traffic data: the count of car traffic and speeds, traffic status temporal [3], thus visualizing traffic data mostly employs in-in real time, vehicle counts on the ring roads, etc. formation visualization and visual analytics. • bicycle information: bike counters, bicycle collection points, Traffic data contain multiple variables, of which the most calculated number of bikes in specific road segments, City- 2 important ones are time and space. Several different types of Bikes visualisation are currently used for traffic data, among them: • pedestrian: manual counts of pedestrians visualization of time, visualization of spatial properties and spatio- • electric charging stations temporal visualization. • taxi stops available Location is the main spatial property of traffic data. Based on • harbour transport data, ferry traffic statistics the aggregation level of location information, visualization of • geographic airport information 3 spatial properties can be categorized into three classes: point- • air quality (OpenAQ ) based visualization (no aggregation) , line-based visualization • noise maps 4 (first-order aggregation), and region-based visualization (second- • wheather data (OpenWeatherMap ) order aggregation) [3]. The format of this datasets varies from JSON, XML, CSV, XLSX, Heatmaps are the most used visualisation tools to show the WMS , GEOJSO or GML. The main issue with the mobility related integrated quantity of a large scale of objects in a map. data sources it is related to the high level of heterogeneity, both A preliminary user interface prototype is depicted in Figure 2. in terms of data format and data availability. Most of the cities involved on the project have some data related to the traffic in the city, for example, but the format of the data, the level of granularity (how often is the data updated) and the availability of historical data (for how long does the city store historical data) varies greatly from one case to another. Another special aspect that needs to be addressed is the im- pact of the COVID-19 on the mobility sector. Since COVID-19 has disrupted all of the social, economic and political aspects of life, the urban mobility area was also affected. Some analysis [1] revealed that the overall mobility fall was up to 76%, public trans- port users dropped by up to 93%, NO2 emissions were reduced by up to 60%, and traffic accidents were reduced by up to 67% in relative terms. This phenomenon of experiencing unexpected change of concepts or data characteristics over time is referred to as concept drift [7] and is one of the key challenges that the URBANITE project will need to deal with when choosing the best way to proceed for making the most appropriate predictions regarding the impact of various traffic policies. The algorithms developed should take into consideration the stability-plasticity Figure 2: User Interface Mock up of the URBANITE Plat- dilemma as a reference. Especially since it’s still difficult to pre- form. dict how the crisis derived from the pandemic will evolve and how the urban mobility will be afterwards. 3 DATA SOURCES 1 https://www.openstreetmap.org/ 2 https://api.citybik.es/v2/ There are several collection procedures of the traffic related data 3 https://openaq.org/ and they range from sensor readings to airborne imagery and 4 https://openweathermap.org/ 70 Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Machidon, Smerkol and Gams Figure 3: Data Sources for the URBANITE Platform. 4 CONCLUSIONS social and context-aware mobile recommendation system for tourism. Pervasive and Mobile Computing, 38, 505–515. The technical core in the URBANITE project focuses on the de- [5] F.O. Isinkaye, Y.O. Folajimi, and B.A. Ojokoh. 2015. Recom- velopment of advanced AI algorithms for analysis of big data on mendation systems: principles, methods and evaluation. mobility. The developed methods and tools will provide substan- Egyptian Informatics Journal, 16, 3, 261 –273. issn: 1110- tial support for policy-makers to tackle complex policy problems 8665. doi: https://doi.org/10.1016/j.eij.2015.06.005. on the mobility domain and will enable their validation on case- [6] Joseph A Konstan and John Riedl. 2012. Recommender sys- specific models. The goal of the activities is to implement novel tems: from algorithms to user experience. User modeling tools and services in order to enable policy-makers to use ad- and user-adapted interaction, 22, 1-2, 101–123. vanced data analysis and machine learning methods during the [7] Jesus L Lobo, Javier Del Ser, Miren Nekane Bilbao, Ibai design of novel policies for a specific city. Lana, and Sancho Salcedo-Sanz. 2016. A probabilistic sam- One underlining factor in URBANITE is the adaptation of ple matchmaking strategy for imbalanced data streams everything that it is created to civil servants, citizens and inter- with concept drift. In International Symposium on Intelli- esting parties that may or not be digitally literate. The use of gent and Distributed Computing. Springer, 237–246. big data techniques and artificial intelligence algorithms, up till [8] Chenguang Pan and Wenxin Li. 2010. Research paper rec- now, is not a common skill among public servants and this is ommendation with topic analysis. In 2010 International one of the reasons the data analysis processes and user interac- Conference On Computer Design and Applications. Vol- tion mechanisms described in this work are developed with the ume 4. IEEE, V4–264. abilities of the non-experts in mind too. [9] Nymphia Pereira and Satishkumar L Varma. 2019. Finan- ACKNOWLEDGMENTS cial planning recommendation system using content-based collaborative and demographic filtering. In Smart Inno- This paper is supported by European Union’s Horizon 2020 Re- vations in Communication and Computational Sciences. search and Innovation Programme, URBANITE project under Springer, 141–151. Grant Agreement No.870338. [10] Miquel Angel Piera, Roman Buil, and Egils Ginters. 2013. REFERENCES Validation of agent-based urban policy models by means of state space analysis. In 2013 8th EUROSIM Congress on [1] Alfredo Aloi, Borja Alonso, Juan Benavente, Rubén Cordera, Modelling and Simulation. IEEE, 403–408. Eneko Echániz, Felipe González, Claudio Ladisa, Raquel [11] Tomasz Rutkowski, Jakub Romanowski, Piotr Woldan, Lezama-Romanelli, Álvaro López-Parra, Vittorio Mazzei, Paweł Staszewski, Radosław Nielek, and Leszek Rutkowski. et al. 2020. Effects of the covid-19 lockdown on urban 2018. A content-based recommendation system using neuro- mobility: empirical evidence from the city of santander fuzzy approach. In 2018 IEEE International Conference on (spain). Sustainability, 12, 9, 3870. Fuzzy Systems (FUZZ-IEEE). IEEE, 1–8. [2] Sunith Bandaru, Amos HC Ng, and Kalyanmoy Deb. 2017. [12] Diego Sánchez-Moreno, Ana B Gil González, M Dolores Data mining methods for knowledge discovery in multi- Muñoz Vicente, Vivian F López Batista, and María N Moreno objective optimization: part a-survey. Expert Systems with García. 2016. A collaborative filtering method for music Applications, 70, 139–159. recommendation using playing coefficients for artists and [3] Wei Chen, Fangzhou Guo, and Fei-Yue Wang. 2015. A users. Expert Systems with Applications, 66, 234–244. survey of traffic data visualization. IEEE Transactions on [13] G. Zhou, J. Lu, C.-Y. Wan, M. D. Yarvis, and J. A. Stankovic. Intelligent Transportation Systems, 16, 6, 2970–2984. 2008. Body Sensor Networks. MIT Press, Cambridge, MA. [4] Ricardo Colomo-Palacios, Francisco José García-Peñalvo, Vladimir Stantchev, and Sanjay Misra. 2017. Towards a 71 Towards End-to-end Text to Speech Synthesis in Macedonian Language Marija Neceva, Emilija Stoilkovska, Hristijan Gjoreski mneceva@gmail.com, emi.stoilkovska@gmail.com, hristijang@feit.ukim.edu.mk Faculty of Electrical Engineering and Information Technologies Ss. Cyril and Methodius University Skopje, N. Macedonia ABSTRACT unlike end-to-end speech recognition [4] or machine translation [5], TTS outputs are continuous, and much longer A text-to-speech (TTS) synthesis system typically consists of than input sequences. Mainly referring to the advantages of multiple stages: text analysis frontend, an acoustic model and end-to-end systems, this paper proposes an implementation an audio synthesis module. Building these components often of Google’s Tacotron model as a TTS system for Macedonian requires extensive domain expertise and may contain brittle language. Tacotron is an end-to-end generative TTS model design choices. The paper presents an end-to-end deep based on the sequence-to-sequence model (seq2seq) [6] learning approach to speech synthesis in Macedonian with attention paradigm [7]. This model takes characters as language. The developed model uses the Google’s Tacotron input and outputs raw spectrogram. We implemented our architecture and is able to generate speech out of text from own version of Tacotron, based on few published articles. What we kept is their deep learning architecture, but made multiple speakers using attention mechanism. It consists of some changes in model’s hyper parameters and other three parts: an encoder, an attention-based decoder and a utilities (like known symbols, numbers etc.). That way the post-processing network. The model was trained on a model was adapted to work with Cyrillic. Given dataset recorded by five, mixed gender speakers, resulting in pairs, our Tacotron model was trained completely from 25.5 hours of data, or 13,101 pairs of text-speech segments. scratch only on our dataset. It does not require phoneme- The results show that the model successfully generates level alignment, so it can easily scale to using large amounts speech from text data, which was empirically shown using a of acoustic data with transcripts. quantitative questionnaire answered by 42 subjects. 2 RELATED WORK KEYWORDS WaveNet [8] is a powerful, non end-to-end, generative audio text-to-speech, deep learning, tacotron, multi-speaker, model which works well for TTS synthesis. It is used as a replacement of the vocoder and acoustic model of the system. seq2seq, text, audio, attention It can be slow due to its sample-level autoregressive nature. 1 It also requires conditioning on linguistic features from an INTRODUCTION existing TTS frontend. Modern TTS pipelines are complex [1]. For example, statistical parametric ones have a text frontend, extracting Deep Voice [9] is a neural model which replaces every various linguistic features, a duration model, an acoustic component in a typical TTS pipeline by a corresponding feature prediction model and a complex signal-processing- neural network. However, each component is independently based vocoder [2][3]. These components usually require trained, and it’s nontrivial to change the system to train in an extensive domain expertise, are laborious to design and must end-to-end fashion. be trained independently. Consequently, errors from each Wang et. al [10] presents one of the first studies of end-to-component may compound. Otherwise, implementing an end TTS using seq2seq with attention. However, it requires a integrated end-to-end TTS system offers many advantages. pre-trained hidden Markov model (HMM) aligner to help the First, it can be trained on pairs with minimal seq2seq model learn the alignment and a vocoder due to human annotation. It also alleviates the need for laborious predicting vocoder parameters. Furthermore, the model is feature engineering. Further, it allows rich conditioning on trained on phoneme inputs with possibilities of hurting the various attributes, such as speaker or language, or high-level prosody and producing limited experimental results. features like sentiment. Similarly, adaptation to new data might also be easier. Finally, a single model is likely to be Char2Wav [11] is an independently developed end-to-end more robust than a multi-stage. All these advantages imply model that can be trained on characters. However, it still that an end-to-end system allows training on huge amounts predicts vocoder parameters before using a SampleRNN real world data. But knowing that TTS is a large-scale inverse neural vocoder [12] and their seq2seq and SampleRNN problem and due to existence of different pronunciations or models need to be separately pre-trained. speaking styles, decompressing a highly compressed source MAIKA [26] is a Macedonian TTS project that was made text into audio may cause difficulties in the learning task of public few months ago. However, there is no documentation an end-to-end model. The main problem is coping with large of how it works. Therefore, it is technically challenging to variations at the signal level for a given input. Moreover, 72 compare with a system that only has web interface which 3.3 Decoder generates sound. Tacotron model uses a content-based tanh attention decoder eSpeak [27] is an open source TTS project that also supports [18], where a stateful recurrent layer produces the attention Macedonian language. The documentation states that the query at each decoder time step. The input of decoder’s RNN Macedonian model is based on the Croatian - which has its is formed by concatenating the context vector and the limitations since the Macedonian language is quite different, attention RNN cell output. Decoder’s internal structure is a especially the pronunciation and the grammar. stack of GRUs with vertical residual connections [5], used for 3 speeding up convergence. A simple fully-connected output MODEL ARCHITECTURE layer is used to predict the decoder targets. Its target is 80- The backbone of Tacotron is a seq2seq model with attention band mel-scale spectrogram, later converted to waveform by [7][13]. Figure 1 illustrates the model, which includes an a post-processing network. It predicts multiple, non-encoder, an attention-based decoder, and a post-processing overlapping, output frames at each decoder step. Predicting net. At a high-level, this model takes characters as input and r frames at once divides the total number of decoder steps by produces spectrogram frames, which are later converted to r, which reduces model size, training and inference time and waveforms. These components are described below. increases convergence speed. This is likely because neighboring speech frames are correlated and each character usually corresponds to multiple frames, plus emitting multiple frames allows the attention to move forward early in training. For defining the input of the next decoding step “teacher forcing” mechanism is used, pointing that on each time step, decoder’s input is the ground-truth value of the previous predicted decoder output. 3.4 Attention Mechanism Attention mechanism is applied in order to “learn” mappings between input and output sequences through gradient descent and back-propagation. It is used as a way for the decoder to learn at which time step, which internal state of Figure 1: Model architecture the encoder deserves more attention when generating its 3.1 current output. The whole process of calculating the CBHG Module attention weights and using them to form the decoder input CBHG is a module for extracting representations from has been illustrated in Figure 2. sequences. It consists of bank of 1-D convolutional filters, followed by highway networks [14] and a bidirectional gated recurrent unit (GRU) [15]. The input sequence is first convolved with k sets of 1-D convolutional filters. These filters explicitly model local and contextual information (creating unigrams, bigrams, up to k-grams). Next the convolution outputs are stacked together and max pooled along time to increase local invariances. Further the processed sequence is passed to a few fixed-width 1-D convolutions, whose outputs are added with the original input sequence via residual connections [16]. Batch normalization [17] is used for all convolutional layers. Moreover, the fixed-width convolution outputs are fed into a multi-layer highway network to extract high-level features. Finally, a bidirectional GRU RNN has been stacked on top, extracting sequential features from both forward and backward context. Figure 2: What is behind the attention mechanism 3.2 Encoder 3.5 Post-processing Net and Waveform Synthesis The encoder extracts robust sequential representations of text. The input to the encoder is a character sequence, with The post-processing net is converting the seq2seq target to a each character represented as a one-hot vector and form that can be synthesized into waveforms [20][21]. Since embedded into a continuous vector. Onto each embedding is Griffin-Lim has been used as a synthesizer, the post- applied a set of non-linear transformations, known as “pre- processing net learns to predict spectral magnitude, sampled net”. The “pre-net” is represented as a bottleneck layer with on a linear-frequency scale. The Griffin – Lim algorithm dropout, helping convergence and improving generalization. allows convergence towards estimated phase layer. Phase’s A CBHG module transforms the “pre-net” outputs into the quality depends on the number of iterations applied. final encoder representation used by the attention module. Although more iterations may lead to overfitting, better Moreover, CBHG-based encoder reduces overfitting and audio is produced. Within our setup, Griffin-Lim converges makes fewer mispronunciations than a standard multi-layer after 50 iterations even though 30 iterations seems to be RNN encoder. enough. 73 3.6 Model Parameters information about the model formed up to that step, while the other two are an alignment plot and an audio file The log magnitude spectrogram is obtained using Hann synthesized by that mode. The synthesized audio file is used windowing with 50 ms frame length, 12.5 ms frame shift, and for checking the quality of the current model. The alignment 2048-point FT. 24 kHz sampling rate has been used for all plot shows if the decoder has learned which input state of the experiments. For both seq2seq decoder (mel-scale encoder is important for producing its current output. That spectrogram) and post-processing net (linear-scale means if there is an “A” on input, “A” should be produced as spectrogram) a simple L1 loss with equal weight has been sound for output. As a good alignment plot is considered the used. The model has been trained using a batch size of 4, one who looks like a diagonal line. This system was trained where all sequences are padded to a max length. for 5 days, reached 412 000 steps and got 412 different 4 models. The system started showing a good alignment on 63 DATASET 000th step. The last model was chosen as referent one. Its There is no public dataset of audio data in Macedonian training and test results sound much better and were more language, therefore we had to create one. We used publicly understandable than those generated from the other models. available books in Macedonian from the website of the National Association of the Blind of the Republic of North 5.2 Evaluation Macedonia. The books have been recorded by 5 speakers, 3 To estimate the model’s performance, we used 10, out of 14 male and 2 female. They are segmented using an algorithm random sentences as test examples. The results show that which separates input audio based on silence length and more than half of the synthesized audio files [22] were threshold. Silence length varies between 700 – 1000 ms. The successfully representing the input sequence of the model. audio clips were additionally padded with 700 ms at both This was empirically shown using a quantitative beginning and end to avoid sudden cut offs. questionnaire [23] answered by 42 subjects, 10 IT experts Next, the audio files were transcribed manually, aided by the and 32 general public volunteers. The questionnaire was written version of the audio book. The transcriptions are made up of 10 stages, for each of the 10 audio files. The void of any punctuation, capitalization, or any special reason for choosing 10 test examples was to make the characters, including numbers. They include only the 31 questionnaire more compact, smaller and quicker for the letters from the Macedonian alphabet and the space evaluators. Each stage contains 3 sub questions for the character to separate between words. The reason for this is currently observed audio file. The Mean Opinion Score (MOS) that the initial dataset was also used for another task (Speech [24] was used as a measure for answering i.e. scoring each Recognition) and the researchers removed the punctuations. one of it. MOS is a measure of audio quality. It is a subjective In this phase we could not retrieve the original raw data that measurement used to test the listener’s perception of the includes the punctuation. The final dataset contains 13,101 audio quality and clarity. A group of 42 subjects were asked audio files and transcripts in Macedonian language [25]. to do the questionnaire. Each audio file required to be scored Additional statistics about the dataset are listed in Table 1. with a score from 1-5 in terms of three criterions: naturalness, intelligibility and accuracy. Where naturalness To be mentioned, the goal of the dataset is not the dataset stands for the similarity of produced audio file with the itself, but how we can develop a deep learning, end to end, natural human speech, intelligibility or clarity of spoken multi-speaker TTS for Macedonian language. Detailed words and accuracy or how much the spoken sequence language analysis of the dataset is planned for another study, corresponds with the original, required to be spoken text. in which the focus will be more on the linguistically part of the dataset. The results from the questionnaire are shown in Table 2. Table 1: Dataset statistics Each row of the table represents the MOS for one of the three criterions, calculated separately for experts and volunteers. Total Clips 13 101 The calculations are done by summing the scores for each criterion and consequently averaging it. By analyzing the Total Words 188 521 results for each criterion is clear that, the experts score the model’s performance better compared to the volunteers. Distinct Words 28 791 Looking at the total score, experts evaluated the model’s performance for 0.265 better than the volunteers. We Total duration 25:36:20 speculate that the reason for this might be that when the experts are evaluating the model they also take into account Mean Clip Duration 7.04 sec the technical challenges and aspects of such system. On the other hand the volunteers simply evaluate the sound and its Min Clip Duration 0.73 sec quality. Additionally, in Figure 3 and Figure 4 we show the box-plots Max Clip Duration 97.6 sec (1.37 min) for the answers given by the experts and the volunteers respectively. The figures show that the accuracy is the 5 TRAINING AND EVALUATION characteristic that achieves the highest score, and the 5.1 naturalness is the characteristic that achieved the lowest Training score. We speculate that the reason for low naturalness score During the training phase there is an output produced on is the presence of sudden pauses when words should be every 1000th step. It takes few seconds for an output to be spoken or existence of mumbling instead of clear produced. Each output contains five files, three of which give pronunciation. There are only few such occurrences. 74 Table 2: MOS Score results will not be able to properly pronounce them. Note that this is not the case with all of the words not being present in the MOS Score training data, but in very rare occasions. Normally, the model will still generate speech even though a word is not present Experts Volunteers in the dataset. Accuracy 4.8 4.6 ACKNOWLEDGEMENT Intelligibility 4.5 4.2 We are thankful for the support of the NVIDIA Corporation and their generous donation of a Titan XP GPU. Naturalness 4.1 3.9 Total 4.5 4.2 REFERENCES [1] P.Taylor. Text-to-speech synthesis. Cambridge university press, 2009. [2] H. Zen, K.Tokuda,А.W.Black. Statistical parametric speech synthesis. Speech Communication, 51(11):1039–1064, 2009. [3] Y.Agiomyrgiannakis. Vocaine the vocoder and applications in speech synthesis. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp. 4230–4234. IEEE, 2015. [4] W.Chan, N.Jaitly, Q.Le, and O.Vinyals. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pp. 4960– 4964. IEEE, 2016. [5] Y.Wu, M.Schuster, Z.Chen, Q.V.Le, M.Norouzi,W.Macherey, M.Krikun, Y.Cao, Figure 3: Box plot of all grades given by the volunteers Q.Gao, K.Macherey. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144, 2016. [6] I.Sutskever, O.Vinyals,Q.V.Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014. [7] D.Bahdanau, K.Cho, Y.Bengio. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473, 2014. [8] A.Oord, S.Dieleman, H.Zen, K.Simonyan, O.Vinyals, A.Graves, N.Kalchbrenner, A.Senior, K.Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv:1609.03499, 2016. [9] S.Arik, M.Chrzanowski, A.Coates, G.Diamos, A.Gibiansky, Y.Kang, X.Li, J.Miller, J.Raiman, S.,M.Shoeybi. Deep voice: Realtime neural text-to-speech. arXiv:1702.07825, 2017. Figure 4: Box plot of all grades given by the IT experts [10] W.Wang, S.Xu, B.Xu. First step towards end-to-end parametric TTS synthesis: Generating spectral parameters with neural attention. In 6 CONCLUSION Proceedings Interspeech, pp. 2243–2247, 2016. [11] J.Sotelo, S.Mehri, K.Kumar, J.F.Santos, K.Kastner, A.Courville, Y.Bengio. The paper presented an end-to-end deep learning approach Char2Wav: End-to-end speech synthesis. In ICLR2017 workshop submission, 2017. to speech synthesis in Macedonian language. The developed [12] S.Mehri, K.Kumar, I.Gulrajani, R.Kumar, S.Jain, J.Sotelo, A.Courville, model uses the Google’s Tacotron architecture and generates Y.Bengio. SampleRNN: An unconditional end-to-end neural audio speech out of text from multiple speakers using attention generation model. arXiv preprint:1612.07837, 2016. mechanism. The approach consists of three parts: an [13] O.Vinyals, Ł.Kaiser, T.Koo, S.Petrov, I.Sutskever, G.Hinton. Grammar as a foreign language. In Advances in Neural Information Processing Systems, encoder, an attention-based decoder and a post-processing pp. 2773–2781, 2015. network. The model was trained on a dataset recorded by [14] R.K.Srivastava, K.Greff, J. Schmidhuber. Highway networks. (2015). five, mixed gender speakers, resulting in nearly 25.5 hours of [15] J.Chung, C.Gulcehre, K.H.Cho, Y.Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555, 2014. data. The results show that the model successfully generates [16] K.He, X.Zhang, S.Ren, J.Sun. Deep residual learning for image recognition. speech from text, which was empirically shown using a In Proceedings of the IEEE Conference on Computer Vision and Pattern quantitative questionnaire answered by 42 subjects. Recognition, pp.770–778, 2016. [17] S.Ioffe, C.Szegedy. Batch normalization: Accelerating deep network To the best of our knowledge, this is the first end-to-end training by reducing internal covariate shift. arXiv preprint multi-speaker deep learning model for Macedonian arXiv:1502.03167, 2015. [18] O.Vinyals, Ł.Kaiser, T.Koo, S.Petrov, I.Sutskever, G.Hinton. Grammar as a language. We strongly believe that this will be a benchmark foreign language. In Advances in Neural Information Processing Systems, and motivation for future studies and finally to have a decent pp. 2773–2781, 2015. TTS system for Macedonian - which has significant societal [19] D.Kingma ,J.Ba. Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR), impact. 2015. [20] Y.Masuyama, K.Yatabe, Y.Koizumi, Y.Oikawa, N.Harda (2019): Deep Griffin Some of the limitations of the model are the gender diversity – Lim Iteration. of speakers and the limited dataset. There is definitely room [21] J.Wodecki (2018): Intuitive explanation of the Griffin – Lim algorithm. for improvement, and probably the dataset plays a crucial [22] Synthesized test audio files: role in it. However, the data collection process is extensive https://drive.google.com/drive/folders/1LkgKAKcD9qNMw_3stbHEhszx hrPyPmAA?usp=sharing. and very time consuming task. With the given dataset we [23] Quantitative questionnaire used for evaluation of the model: cannot estimate or empirically evaluate how much more data https://docs.google.com/forms/d/e/1FAIpQLSeJJJVRjU3tzbLi1mix9buN is needed to achieve state-of-the-art intelligibility and Os002GFaTvSp9TVO752OCPNUvA/viewform?fbclid=IwAR1bLE8hrEALj7 MwHkAgDKrf0JfyClD-DTuCiGdJ8Nc68Jl1XYv_1_MRxoE. naturalness of artificially created speech. Additionally, in a [24] P.C. Loizou. Speech Quality Assessment. University of Texas-Dallas, few of the generated samples there are pauses at places Department of Electrical Engineering, Richardson, TX, USA. where a word should be spoken. The reason for this is when [25] M.Trajanoska, H.Gjoreski. Towards end-to-end Speech Recognition in the model generates sound, it uses character embeddings Macedonian Language. BalkanCom (2019). [26] MAIKA: https://maika.mk/ with specific ordering, learned during training. If those [27] eSpeak: http://espeak.sourceforge.net/ embeddings have never been seen during training, the model 75 Improving Mammogram Classification by Generating Artificial Images Ana Peterka† Zoran Bosnić Evgeny Osipov University of Ljubljana, University of Ljubljana, Luleå University of Technology, Faculty of Computer and Faculty of Computer and Department of Computer Science, Information Science, Information Science, Electrical and Space Engineering, Ljubljana, Slovenia Ljubljana, Slovenia Luleå, Sweden anapeterka1151@gmail.com zoran.bosnic@fri.uni-lj.si evgeny.osipov@ltu.se ABSTRACT imaging field due privacy concerns of the patients and the time consuming expert annotations. Furthermore, the data is often Training a deep convolutional neural network (DCNN) from the imbalanced, meaning that pathologic findings are relatively very scratch is difficult, because it requires large amounts of labeled rare. This can result in overfitting the model and bad training data. This is a big problem especially in the medical generalization ability. domain, since datasets are scarce and data is often imbalanced. So far, this problem has been addressed with transfer learning This can result in overfitting the model. Fine-tuning a model that and data augmentation techniques. In this paper, we evaluate has been pre-trained on a large dataset shows promising results. these techniques on the CBIS-DDSM dataset, which is a publicly Another approach is to augment the dataset with artificially available dataset that contains benign and malignant generated learning examples. In this paper, we augment the mammograms. We propose a novel approach of generating new learning set with artificially generated images that are produced images with Generative Adversarial Networks (GANs) by conditional infilling GAN. The results that we obtained show combined with traditional data augmentation, such as horizontal that we can relatively easily generate realistically looking flipping, rotations etc., and evaluate if increasing the dataset mammograms that improve the classification of benign and helped to achieve better classification. We also test if fine tuning malignant mammograms. a ResNet-50 model helps improve the results. The paper is structured as follows. Section 2 presents the KEYWORDS related work, Section 3 describes the data augmentation data augmentation, transfer learning, CNN, ResNet-50, GAN, techniques used, Section 4 the training process, Section 5 the ciGAN evaluation metrics used and the results, and in Section 6 we state our conclusions and discuss the prospective future work. 1 INTRODUCTION Breast cancer is a cancer that is found in the tissue of the breast, 2 RELATED WORK when abnormal cells grow in an uncontrolled way. It can affect This section provides a brief review of past work that falls down both women and men, though it is prevalent in women. Statistics to three categories: show that it has the highest mortality rate of any cancer in women 1. improved classification with traditional data worldwide and that 1 in 8 women in the EU will develop breast augmentation, cancer before the age of 851. Screening mammography helps 2. improved classification with generating synthetic images diagnose cancer at an early stage, which significantly increases using generative adversarial network, the survival rates. However, the evaluation of mammograms 3. transfer learning and fine tuning. performed by doctors and radiologists is tedious, lengthy and error prone, as it results in a high number of false positives. The problem with small datasets, especially in the medical New approaches in deep learning (DL), in particular domain, is that models that are trained on them tend to overfit the convolutional neural networks (CNNs), have proven their data. There are a lot of approaches to reduce it, like batch potential for medical imaging classification tasks. This could normalization, dropout, data augmentation and also transfer relieve radiologists and give patients quicker and more accurate learning. Traditional data augmentation based on affine diagnosis. However, the performance of CNNs are dependent on transformations, such as translation, rotation, shearing, flipping large labeled datasets, which are hard to obtain in the medical and scaling, is the most widely used and very easy to implement. They are ubiquitous in computer vision tasks and show very pro mising results [1]. However, they do not bring any new visual Permission to make digital or hard copies of part or all of this work for personal or features that could additionally improve the generalization of the classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full CNN. citation on the first page. Copyrights for third-party components of this work must Synthetic image generation with GANs enables more be honored. For all other uses, contact the owner/author(s). variability to the dataset and further improves robustness of the Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia © 2020 Copyright held by the owner/author(s). 1 https://www.europadonna.org/breast-cancer-facs/ 76 classification network. GANs were inspired by game theory, Images are in DICOM format, which is the standard for medical where two neural networks are pitted against each other using a imaging information. The data is already split in the training and minmax strategy. They were first introduced in [2], and they have testing set. We used a part of the testing set as a validation set for recently been applied to many different medical imaging the classification network. applications, mostly for image to image translation and image inpainting. In [3], the authors used conditional infilling GAN to 3.2 Traditional data augmentation synthesize lesions on mammograms. To compensate for the lack of training images, we used classical Transfer learning and fine tuning for mammography medical data augmentation techniques, in particular horizontal flipping, images was the main topic in [4] and [5]. In [4], they rotations of up to 30°, and zoom range from 0.75 to 1.25 and test demonstrated that a whole image model trained on DDSM can if this improved the performance of the CNN. be easily transferred to INbreast without using its lesion annotations and using only a small amount of training data. In 3.3 Data augmentation with GANs [5], the authors showed that fine tuning ResNet-50 model pre- To further augment and balance the dataset, we use a GAN trained on ImageNet can be used to perform tumor classification variant, called conditional infilling GAN (ciGAN) [3]. GANs are in CBIS-DDSM dataset. a type of generative models, which means they are able to In this paper, we will first use traditional data augmentation produce novel examples, based on the training data. They consist techniques and later additionally augment the dataset with of two neural networks, a generator and a discriminator, which applying the ciGAN (conditional infilling GAN). We will are pitted against each other. Generator tries to capture the data's evaluate the improvements with a fine tuned ResNet-50 model. distribution while the discriminator tries to distinguish real and generated examples. By training them simultaneously, the 3 AUGMENTING THE DATASET generator will get better at generating realistic data, while the discriminator gets better at distinguishing real and fake data. In In this section, we first describe the dataset, then we explain the the case of ciGAN, the generator is based on a cascaded traditional data augmentation methods used and a GAN method refinement network (CRN) [8], where features are generated at for synthesizing new images. multiple scales before being concatenated, which yields a more realistic image synthesis. 3.1 The CBIS-DDSM dataset In our approach, we apply the ciGAN to sample a location on CBIS-DDSM [6] is a publicly available dataset that contains a healthy mammogram and then synthesize a lesion in its digitized images from scanned films of mammogram images and location, as shown in Figure 1. The input is a concatenated stack it is a subset of the DDSM dataset that consists of only benign of: and malign cases, while the DDSM also contains normal. The  a corrupted image (one channel grayscale image with data was acquired from 1566 patients and it contains both lesion replaced by uniform distribution of values between mediolateral oblique (MLO) and craniocaudal (CC) views of 0 and 1), each breast. Images are grayscale, and they have corresponding  a binary mask that marks lesion (1 representing the binary masks that indicate mass and ROI images of that mass. location of the lesion, and the zeros elsewhere), and Figure 1: The ciGAN architecture. The input consists of two one channel images, and 2 class channels for indicating malignant/benign label. Output of the generator is, together with the real image fed into the discriminator, which predicts whether each image is either generated or original and also whether the image contains benign or malignant lesions. 77  the class label ([1,0] representing the non-malignant class, extracted from pretrained networks [10]. It encourages the and [0,1] representing the malignant class). generator to output images with similar high level features The generator is comprised of multiple convolutional blocks. as the original image. In this case, the VGG-19 [11] convolutional neural network is used, pretrained on the The first convolutional block receives input stack, downsampled ImageNet dataset. It is defined as to the 4x4 resolution. Resolution is doubled between consecutive blocks. So the next convolutional block is fed with concatenation of the output from the first layer, upsampled to the 8x8 and an input stack resized to 8x8. This is repeated until resolution of where R denotes a real image, S a synthetic image and a 256x256 is obtained. The discriminator has similar, but inverse feature function; structure.  Boundary Loss: is used to encourage smoothing between infilled components and the context of the generated image. 3.4 Differences to the related work It is a L1 difference between the real and generated images Our work is based on the before mentioned ciGAN [3], with a at the boundary and defined as few improvements. While the former method was trained on non- malignant versus malignant cases, our approached uses benign and malignant cases, since we believe that the real hardship is where w denotes the mask with Gaussian filter of standard distinguishing the lesions and not only noticing them. Images in deviation 10 applied, and is the element wise product; the original work show that for acquiring synthetic non-  Adversarial Loss: is the general GAN loss. It is defined as a malignant mammograms, the lesion was removed, making the distance between the true and the generated distribution at picture a normal mammogram. Since we used a sliding window the current iteration. Its goal is to converge to the approach of extracting normal patches instead of the mask, we equilibrium in the minmax game between generator G and did not have to remove the malignant lesion, but we applied both discriminator D, as follows: masks independently, so we obtained only benign and malignant cases. All generated benign cases contain a lesion. We also applied zooming and rotation to lesions before generating new images, hence our generated images have more diverse tumors. where c denotes the class label. 4 GENERATING ARTIFICIAL IMAGES 4.3 Training The ciGAN is first pretrained on perceptual loss for 300 epochs. 4.1 Preprocessing Then the training of discriminator and generator are alternating, To extract patches of 256x256 pixels that are fed into ciGAN, we when loss for either drops below 0.3 for additional 2000 epochs. used a sliding window technique. The program loops through the The ciGAN produces realistic images as shown in Figure 2. whole mammogram image with the stride of 128 and checks if the rectangular region overlaps the majority of the breast. It also checks whether the patch contains lesion or it shows only normal breast tissue, and labels it accordingly. This is done by comparing the same region of the corresponding binary mask. At the end the patch dataset contains 5466 images, 1743 of them are normal, 2198 benign and 1525 malignant. After acquiring a dataset of patches, the program loops through all the patches containing only normal tissue. For each normal patch, it randomly chooses one patch that contains a 1. Normal image 2. Random malignant mask lesion. The patch with lesion is then randomly zoomed in/out by a small factor, to obtain more diverse masses. Next, we check whether on the same location as is lesion, on the normal patch, is only breast tissue and not background. If not, the next random lesion patch is chosen and the whole process is repeated until a suitable match is found. Once there is a suitable pair obtained, the normal image is corrupted, by replacing the area defined by the mask of the lesion with uniform distribution. 4.2 Loss functions 3. Corrupted image 4. Generated image The ciGAN model is trained by utilizing three loss functions [3]: Figure 2: A generated sample from ciGAN. Image 1 is the  Perceptual loss: is a loss calculated between the ground truth normal image without a lesion, image 2 is the binary mask and the output image. But unlike a per-pixel loss, which is representing the random malignant lesion, image 3 is the based on differences between pixels, it measures the corrupted image and image 4 is the synthesized image with discrepancy between high-level perceptual features malignant lesion. 78 5 EVALUATION AND RESULTS Testing these methods on different medical datasets shall be the subject of future work. As well, one may consider using these For evaluation of results three metrics were used. The first one is methods on bigger data sets and improve the current state of the accuracy, which tells us how many examples were correctly art algorithms. Since the ciGAN’s discriminator was also classified. The second one is recall/sensitivity, which is the conditioned on class, we intend on extracting its features and fraction between true positives and the sum of true positives and using it for classification on other mammography dataset, for false positives. It is the most important metric in this case, due to example on the INBreast dataset. We also plan on adding more the risk of overlooking cancer. The third one is Area Under synthetic images to the dataset, to see if we can further improve Curve (AUC), which measures area under the ROC curve. We the classification. evaluate the results by performing 4 experiments: Currently, the mammogram classification is performed by 1. Shallow CNN [12]: we implement it as the baseline. The the doctors and radiologists, but we hope that improving the network is fed a patch and classifies it as either malignant classification with the use of machine learning combined with or benign. It consists of three convolutional blocks, these and similar techniques could relieve them of such tasks in composed of 3x3 Convolutions, Batch Normalization, the near future. ReLU activation function and Max Pooling, followed by three Dense layers, and softmax function for binary Table 1: The obtained accuracy, recall and AUC scores classification. 2. ResNet-50: we classify the data using a ResNet-50 [13]. accuracy recall AUC 3. ResNet-50 with finetuning: we check if transfer learning improves the results. Shallow CNN 0.57267 0.44810 0.54943 4. ResNet-50 + Traditional data augmentation, Resnet-50 without 5. ResNet-50 + Traditional data augmentation and generated 0.58295 0.53859 0.58634 finetuning artificial images. ResNet-50 0.60155 0.55769 0.59443 As mentioned in [5], we fine-tuned the Resnet-50 [12] model with ImageNet weights. It is an extremely deep neural network ResNet-50 0.67132 0.64231 0.66666 + traditional with 150+ layers and consists of convolutional layers, pooling layers and multiple residual blocks. In the residual blocks, the ResNet-50 layers are fed into the next layer and also directly into the layers + traditional 0.76145 0.61538 0.71638 about two to three hops away. The input to the ResNet-50 model + artificial is a patch of a size 224x224x3. Since mammograms have only grayscale channels, the color information is copied over all three channels. We used the Adam optimizer with an initial learning REFERENCES rate of 10−5, 𝛽1 = 0.9, 𝛽2 = 0.999, 𝑒 = 10−8 and ImageNet [1] Wang, J., & Perez, L. (2017). The effectiveness of data augmentation in weight initialization. We trained it for 50 epochs with batch size image classification using deep learning. Convolutional Neural Networks Vis. Recognit, 11. of 32 and a 0.9 learning rate decay every 30 epochs. [2] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Table 1 shows the obtained results. We can see that already Ozair, S., ... & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems (pp. 2672-2680). using only fine tuning using ResNet-50 improved the results. [3] Wu, E., Wu, K., Cox, D., & Lotter, W. (2018). Conditional infilling GANs After combining ResNet-50 with traditional data augmentation, for data augmentation in mammogram classification. In Image Analysis we obtained even better performance metrics. Nevertheless, by for Moving Organ, Breast, and Thoracic Images (pp. 98-106). Springer, Cham. increasing the dataset with relatively small amounts of synthetic [4] Shen, L. (2017). End-to-end training for whole image breast cancer images while simultaneously balancing it, we improved accuracy diagnosis using an all convolutional design. arXiv preprint and AUC even more, but obtaining a slight decrease in the recall. arXiv:1711.05775. [5] Agarwal, R., Diaz, O., Lladó, X., & Martí, R. (2018, July). Mass detection in mammograms using pre-trained deep learning models. In 14th International Workshop on Breast Imaging (IWBI 2018) (Vol. 10718, p. 6 CONCLUSION 107181F). International Society for Optics and Photonics. [6] Lee, R. S., Gimenez, F., Hoogi, A., Miyake, K. K., Gorovoy, M., & Rubin, In this paper we discussed overcoming the obstacle of small and D. L. (2017). A curated mammography data set for use in computer-aided imbalanced mammography dataset. We proposed an approach detection and diagnosis research. Scientific data, 4, 170177, https://wiki.cancerimagingarchive.net/display/Public/CBIS-DDSM. for artificial generation of images that are produced by a [7] Odena, A., Olah, C., & Shlens, J. (2017, July). Conditional image conditional infilling GAN (ciGAN). The results showed that we synthesis with auxiliary classifier gans. In International conference on can relatively easy generate realistically looking mammograms machine learning (pp. 2642-2651). that improve the classification of benign and malignant [8] Chen, Q., & Koltun, V. (2017). Photographic image synthesis with cascaded refinement networks. In Proceedings of the IEEE international mammograms. Further, we evaluated the learning performance conference on computer vision (pp. 1511-1520). when using fine-tuning, classical data augmentation and [9] Johnson, J., Alahi, A., & Fei-Fei, L. (2016, October). Perceptual losses for synthetic examples. The results showed that each of these real-time style transfer and super-resolution. In European conference on computer vision (pp. 694-711). Springer, Cham. techniques improved classification, yielding the best results [10] Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks using all three together. for large-scale image recognition. arXiv preprint arXiv:1409.1556. Comparing the results to previously developed method [3], [11] Lévy, D., & Jain, A. (2016). Breast mass classification from mammograms using deep convolutional neural networks. arXiv preprint we obtained worse results in terms of AUC, but we believe the arXiv:1612.00542. reason behind it is the fact that all our images contain lesion, [12] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for which must be harder for a neural network to distinguish, image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778). compared to distinguishing non-malignant and malignant images. 79 Mobile Nutrition Monitoring System: Qualitative and Quantitative Monitoring Nina Reščič Marko Jordan Jasmijn de Boer nina.rescic@ijs.si Department of Intelligent Systems, ConnectedCare Department of Intelligent Systems, Jožef Stefan Institute Nijmegen, Netherlands Jožef Stefan Institute Ljubljana, Slovenia International Postgraduate School Jozef Stefan Ljubljana, Slovenia Ilse Bierhoff Mitja Luštrek ConnectedCare mitja.lustrek@ijs.si Nijmegen, Netherlands Department of Intelligent Systems, Jožef Stefan Institute Ljubljana, Slovenia ABSTRACT Edison et al. [8] proposed a method that recognizes each intake 1 gesture separately and later the intake gestures within 60 minutes The WellCo project aims to provide a mobile application featur- interval are clustered. ing a virtual coach for behaviour changes aiming to achieve for For qualitative monitoring we evaluated both dietary recalls healthier lifestyle. The nutrition monitoring module consists of and FFQs as self-reporting methods. However, dietary recalls two main parts - qualitative (Food Frequency Questionnaire) and require typing or complex food item selection which can be quantitative (eating detection and bite counting). In this paper cumbersome on mobile devices, so we opted for FFQ. FFQs are we present the nutrition monitoring module that connects both the most commonly selected tools in nutrition monitoring as they monitoring aspects as implemented in the virtual coach (mobile are efficient, cost-effective and non-invasive [9, 6].The developed application). FFQ covers all key aspects of healthy diet, and is modular, so that KEYWORDS only questions pertaining to certain aspects can be asked. This is important in ubiquitous settings where one wishes to minimize nutrition monitoring, eating detection, FFQ the required inputs from the user. 1 INTRODUCTION To our knowledge the developed application module is the first one to combine qualitative (validated FFQ) and quantitative mon- Proper nutrition habits are beneficial for healthy lifestyle and itoring (bite counting method) and to provide recommendations help to prevent many chronic diseases, such as cancer, diabetes based on data gathered by monitoring. and hypertension. Automated monitoring has become really im- portant i nutrition monitoring, but in only gives quantitative information (when is the user eating, how much did he eat...), 2 METHOD while qualitative information (what is the user eating) is acquired by using 24 hour food recall diaries or by using Food Frequency 2.1 Method Overview Questionnaires (FFQs). In the WellCo project we aimed to devel- The paper describes the nutrition monitoring module developed oped a user friendly nutrition module, which monitors qualitative in the Wellco project. and quantitative aspects of users’ nutrition. We combined the The qualitative monitoring starts with a five-question ques- self-reported FFQ, Extended Short Form Food Frequency Ques- tionnaire that provides essential information about the user’s tionnaire (ESFFFQ), developed and validated in the project project diet. Based on this, some goals to improve the user’s nutrition [5], with automated monitoring by using a commercially avail-can already be recommended. However, the users are invited able wearable smartwatch. This paper describes the developed to answer a more extensive questionnaire that paints a more module and the improvements we made since our previous pa- complete picture and allows recommending more goals. This pers [5, 2, 7]. questionnaire is an extended version of a validated questionnaire, By using wrist-worn devices to collect data, it is possible to rec- and the extension was validated by us [5]. How successful the ognize eating gestures [4] or even count ’bites’ or assess caloric users are at achieving their goals is monitored with goal-specific intake [10]. Mirtchou et al. [3] explored eating detection by us-questions on a bi-weekly basis. ing several sensors and combining real-life and laboratory data. The quantitative monitoring uses the accelerometer and 1 gyroscope in a smartwatch to detect micromovements related to http://wellco- project.eu eating (e.g., picking up food, putting it into the mouth). From a se- Permission to make digital or hard copies of part or all of this work for personal quence of such micromovement, we then recognise whether the or classroom use is granted without fee provided that copies are not made or user has made one “bite” (taken the food to the mouth). The im-distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this proved method uses a Convolutional neural network to recognise work must be honored. For all other uses, contact the owner /author(s). the micromovements and a LSTM neural network to recognise Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia bites. The latter achieved higher accuracy so it was the one se- © 2020 Copyright held by the owner/author(s). lected to be integrated into the WellCo system. 80 Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Reščič and Jordan, et al. 2.2 FFQ - Qualitative Monitoring First we linearly interpolated all accelerometer and gyroscope measurements as well as the probabilities of bites to 4Hz fre- When choosing goals that would help users of the WellCo virtual quency. Next, the normalization was applied to interpolated ac- coach towards behavioural changes for healthier lifestyle, we celerometer and gyroscope data. We constructed 90 s long sliding were leaning on national dietary recommendation and dietary windows with a 2.5s step. Each window contained 360 of the recommendations for elderly, combined with expert knowledge previously obtained accelerometer, gyroscope and bite probabil- by the nutritionist involved in the project. A summary of national ity values (obtained with CNN and LSTM networks as described dietary recommendations is presented in Table 1. in [2]). 4Hz frequency was used to achieve faster training and Guidelines specifically for the elderly are very similar to na-predicting, while also enabling us to construct longer windows. tional dietary recommendations for all three countries involved in A window was labelled as a positive instance, if the majority of pilots (Italy, Spain and Denmark), but they put additional empha- the window belonged inside a meal. sis on dairy consumption, as this is a good source of proteins and To solve this machine learning task, an inception-type neural calcium, which are beneficial and often under-consumed; drink- network was constructed, with the added GRU layers at the end. ing enough water, as dehydration is often a problem with elderly; The inception part of the network is mainly made of two types of and leucine consumption (in milk, peanuts, oatmeal, peanuts, fish, inception blocks. Both types consist of convolutional layers and poultry, egg white, wheat sprouts, etc). Given these recommenda- end with a filter concatenation. The B block includes also a max tions, we chose goals we will suggest WellCo users to follow and pooling operation. Each block in the network is succeeded by a use in order to improve their diet: fruit consumption, vegetable consumption max pooling layer. The entire architecture is presented in Table , salt consumption, fat consumption, fibre consump- tion 1. The inputs were transformed in the (batch size,timestamps,1,7) , protein consumption, salt consumption, fish consumption and water consumption shape. “Prep” (preparation) in Table 1 refers to the yellow con- . volutional layers in Figure 5, whereas “Pool proj” refers to 1x1 In our search for a comprehensive but still short FFQ we found convolutional layer after 4x1 max pooling layer. The final model a validated questionnaire named Short Food Frequency Question- used approximately 130 K parameters. naire (SFFQ)[1], which consists of 23 questions and fully covers With the intention of smoother and better learning, the ratio five of our chosen goals – fruit and vegetable consumption, sugar consumption between positive and negative instances was fixed to 1:2. During , fat consumption and fish consumption. To cover the the sampling, we actually focused more on problematic areas, by four missing goals (protein, fibre, salt and water consumption) first predicting with the network and then selecting problematic we added additional 8 questions, turning the SFFQ into the so- instances to train on. Learning rate was set to keep decreasing called Extended Short Food Frequency Questionnaire (ESFFQ). every few epochs. Certain hyper-parameters were subject to The validation of the questionnaire is described in our previous optimization during cross-validation, with the help of hyperopt paper [5]. library. The function to minimize was categorical cross entropy. In the next part, the outputs ∈ [0,1] of the neural network, 2.3 Quantitative Monitoring which represent the probabilities that the given windows are eating instances, are taken to form possible/candidate meals. The main objective of the smartwatch-based nutrition monitoring This is done in the following manner: is bite counting (counting the number of time the user takes food to the mouth). • Round 1: Find all probabilities, denoted as beacons, that The bite-counting algorithm described in [2] was used as the are higher than a p1 threshold. Include also all probabilities base for all of the following work. When deciding how to present that are closer than t1 seconds to any of the beacons. Set the results of the developed algorithm to the users in the mobile all the other probabilities temporarily to 0. application, we had to make some improvements to our model. As • Round 2: Find all probabilities that are higher than a p2 the number of bites does not really give much useful information threshold and group them together, if they are immediately to the users, we decided to join individual bites into meals and next to each other. For each group find the time distance to recognize meals as snack, small meal or big meal. to its nearest group. Finally remove all groups that have either 1 or 2 members and are more than t2 seconds away 2.3.1 Datasets. To construct the bite detection algorithm, we from the corresponding nearest group. created the Wild Meals Dataset (WMD). It includes 51 sessions • Round 3: If there exist any two groups of the form [A,B] and 99 meals, with known starting and ending time points, be- and [C,D], where 0 ≤ C − B ≤ t3 (all in seconds), combine longing to 11 unique subjects, recorded ’in-wild’. For 68 of those these two groups together to form a new group, [A,D]. meals we have also obtained the approximate number of the This means that indices in [A,D] can now represent the corresponding bites, since the subjects were asked to count them probabilities of zero as well. while eating. Additionally we used the publicly available The • Round 4: Similar as Round 3, but with a t4 parameter in Food Intake Cycle (FIC) dataset and The Free Food Intake Cycle place of t3. (FreeFIC). All datasets contains tri-axial signals from accelerome- At this point the probabilities of windows, previously temporar- ters and gyroscopes in wrist devices with the sampling frequency ily set to zero, are switched back to their original values. For of 100 Hz. the final model, we obtained the following values of the above hyperparameters: 2.3.2 Meal detection method. The algorithm for meal detection Since p2 > p1, this means that Round 1 in this particular case was comprised of two parts: in the first part probabilities that was not necessary, although in some other cases it could have given time periods are part of eating were assigned, whereas been. Once the candidate meals have been obtained, the features in the second part these probabilities were grouped together to are constructed for the ensemble of random forest, support vector form a meal. machine, knn and gradient boosting algorithms. The ensemble 81 Mobile Nutrition Monitoring System: Qualitative and Quantitative Monitoring Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Table 1: Architecture of the network Type Units/Nodes Kernel/stride Output 1x1 4x1 prep 4x1 6x1 prep 6x1 Pool Inception-A 360x1x128 32 64 32 Max pool 3x1/2 180x1x128 Inception-B 180x1x128 32 64 64 16 16 16 Max pool 3x1/2 90x1x128 Inception-B 90x1x128 32 64 64 16 16 16 Max pool 3x1/2 45x1x128 Inception-B 45x1x128 32 64 64 16 16 16 Max pool 3x1/2 23x1x128 GRU 23x32 GRU 32 Dense 64 64 Dropout(0.36) 64 Dense 2 2 Table 2: Hyperparameters. Table 3: Results of bite recognition and meal detection al- gorithm. p1 t1(sec) p2 t2(sec) t3(sec) t4(sec) 0.46 61 0.87 120 63 61 F1-score precision recall cov_area outside_area Avg. 0.76 0.88 0.72 0.81 0.03 makes the final decision whether a candidate meal is in fact a Table 4: Example of recommendations for qualitative meal or not. The following features are created for each candidate monitoring (goal_sugar) and quantitative monitoring (nu- meal: trition_number_of_meal). • The mean, standard deviation, the 25th, 50th and 75th percentile of all the probabilities inside a given candidate goal_sugar It seems you don’t eat enough veg- meal. etables. Vegetables are important • The mean and standard deviation of the first and second sources of many nutrients, such as half of a potential meal, separately. vitamins, minerals and dietary fibre. • The mass of all the future probabilities inside all the poten- Try to eat 2 servings of vegetables tial meals closer than 3 hours to a given candidate meal, per day. Serving is 1 cup of fresh or divided by their time centre. half cup of cooked vegetables. • The mass of all the past probabilities inside all the poten- nutrition_number_of_meal Try to eat 3–5 meals per day (e.g. 3 tial meals closer than 3 hours to a given candidate meal, bigger, 2 smaller). Avoid snacking divided by their time centre. between meals. Hyper-parameters for each model in the ensemble, as well as p1, t1, p2 t2, t3 and t4 values, were calculated with a cross-validation, with the help of hyperopt library. The function to minimize was • For F1-score, precision and recall, def A was used, while negative F1-score. cov_area and outside_area used def B. However, double cross-validation results show that all ground truth meals, 3 RESULTS with one exception, had at most one corresponding, true 3.1 Bite Counting positive predicted meal. • Covered area (cov_area): for a given ground truth meal, In Table 4 we present the results of evaluation of our work. The the length of the areas, which laid inside the ground truth analysis of the entire pipeline is based on Leave-One-Subject-Out meal, of the corresponding true positive meals, divided by double cross-validation. For calculation of the above statistics the length of the ground truth meal. the following definitions were used: • Outside area (outside_area): for a given predicted, true • True positive prediction of a meal: any prediction of the positive meal, the length of the area that laid outside the respective meal for which the majority of the prediction corresponding ground truth meal, divided by the length laid inside the ground truth meal. If there was more than of the predicted meal. one prediction of eating for a certain meal, only one pre- diction is actually counted as a true positive, whereas all 3.2 Application Implementation the others are not regarded as a false positive.. This is due to the possibility that the subjects didn’t eat their en- The application shows users the detected meals, number of bites tire recording time; as such it did not seem reasonable to and score quality for the chosen goals (see Figure 1). Based on penalize the pipeline for predicting more than one meal, the results we additionally show the user recommendations to however, only one true positive is counted in order not follow in order to improve their nutrition. Example for recom- to encourage the algorithm to predict a bundle of eating mendations for both, qualitative and quantitative monitoring is instances. shown in table. 82 Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Reščič and Jordan, et al. Figure 1: Application view for both monitoring tasks. 4 CONCLUSION [3] Mark Mirtchouk, Drew Lustig, Alexandra Smith, Ivan Ching, Min Zheng, and Samantha Kleinberg. 2017. Rec- The developed nutrition monitoring module consists of two parts ognizing eating from body-worn sensors: combining free- - qualitative monitoring and quantitative monitoring. Both of the living and laboratory data. 1, 3. doi: 10.1145/3131894. developed modules are implemented in a mobile application. In [4] Raul I. Ramos-Garcia, Eric R. Muth, John N. Gowdy, and our future work we would like to improve the developed eating Adam W. Hoover. 2014. Improving the recognition of eat- detection and bite counting algorithms. ing gestures using intergesture sequential dependencies. The developed FFQ (ESFFFQ) can be used to support a wide IEEE Journal of Biomedical and Health Informatics, 19, 3, range of nutrition goals and minimizes the number of questions 825–831. asked, so it is suitable for mobile nutrition monitoring. To make [5] Nina Reščič, Eva Valenčič, Enej Mlinarič, Barbara Koroušić the application user friendly the questions from the FFQ will Seljak, and Mitja Luštrek. 2019. Mobile nutrition monitor- not be asked all at the same time, but separately during a course ing for well-being. In Adjunct Proceedings of the 2019 ACM of fortnight. This means that some of the questions won’t be International Joint Conference on Pervasive and Ubiquitous asked, hence it is really important to ask the right questions. In Computing and Proceedings of the 2019 ACM International our future work we will try to explore the problem of question Symposium on Wearable Computers (UbiComp/ISWC ’19 ranking. With this we would be able to ask the questions in a Adjunct). London, United Kingdom, 1194–1197. specific order and loose as few information as possible. [6] JS Shim, K Oh, and HC Kim. 2014. Dietary assessment 5 ACKNOWLEDGMENTS methods in epidemiologic studies. Epidemiol Health, 36. doi: 10.4178/epih/e2014009. WellCo Project has received funding from the European Union’s [7] Simon Stankoski, Nina Reščič, Grega Mežič, and Mitja Horizon2020 research and innovation program under grant agree- Luštrek. 2020. Real-time eating detection using a smart- ment No 769765. watch. In Junction Publishing, USA. REFERENCES [8] Edison Thomaz, Irfan Essa, and Gregory D. Abowd. 2015. A practical approach for recognizing eating moments with [1] Christine L Cleghorn, Roger A Harrison, Joan K Ransley, wrist-mounted inertial sensing. In Association for Comput- Shan Wilkinson, James Thomas, and Janet E Cade. 2016. ing Machinery, New York, NY, USA. isbn: 9781450335744. Can a dietary quality score derived from a short-form doi: 10.1145/2750858.2807545. ffq assess dietary quality in uk adult population surveys? [9] Frances Thompson and T Byers. 1994. Dietary assessment Public Health Nutrition, 19, 16, 2915–2923. doi: 10.1017/ resource manual. The Journal of nutrition, 124, (December S1368980016001099. 1994), 2245S–2317S. doi: 10.1093/jn/124.suppl_11.2245s. [2] 2019. Counting bites with a smart watch. In Slovenian Con- [10] Shibo Zhang, William Stogin, and Nabil Alshurafa. 2018. ference on Artificial Intelligence : proceedings of the 22nd In- I sense overeating. Inf. Fusion, 41, C, (May 2018), 37–47. ternational Multiconference Information Society. Volume A, doi: 10.1016/j.inffus.2017.08.003. 49–52. 83 Recognition of Human Activities and Falls by Analyzing the Number of Accelerometers and their Body Location Miljana Shulajkovska, Hristijan Gjoreski miljanash@gmail.com, hristijang@feit.ukim.edu.mk Faculty of Electrical Engineering and Information Technologies Ss. Cyril and Methodius University Skopje, N. Macedonia ABSTRACT the voluntary interaction of the users with the sensors. In the latter, the sensors are attached to the user. This paper presents an approach to activity recognition and This paper presents a machine learning approach to fall detection using wearable accelerometers placed on activity recognition and fall detection using wearable different locations of the human body. We studied how the accelerometers placed on different locations of the human location and the number of wearable accelerometers body. The goal of the paper is to study how the location and influence on the performance of the recognition of the the number of wearable accelerometers in luence on the activities and the falls. The final goal was to build a machine performance of the recognition of the activities and the falls. learning model that can correctly recognize the activities and This study is of practical importance of such systems, i.e., to the falls using as few accelerometers as possible. The model build a machine learning model that can correctly recognize was evaluated on a public dataset consisting of more than the activities and the falls using as few accelerometers as 850 GB of data, recorded by 17 people. In total we evaluated possible. 15 combinations of four accelerometers placed on the belt, the left ankle, the left wrist and the neck. The results showed that the neck and the ankle accelerometers proved sufficient 2 RELATED WORK to correctly recognize all the activities and falls with 94.2% A considerable amount of work has been done in human accuracy. Each of the sensors used individually achieved activity recognition for the last decade where a lot of studies 94.02% and 93.4% accuracy respectively. aim to identify activities based on data obtained from KEYWORDS accelerometers as sensors widely integrated into wearable systems [3][4]. activity recognition, fall detection, wearable sensors, Researchers have reported high accuracy scores in machine learning detecting activities when investigating the best placement of the accelerometer on the human body [5][6][7]. Increasing 1 INTRODUCTION the number of sensors increases the complexity of the classi ication problem. For these reasons, a number of According to United Nations World Population Prospects studies have investigated the use of a single accelerometer. 2019, by 2050, one in six people in the world will be over the However, doing so generally decreases the number of age of 65 [1]. As people are getting older, their risk for falls activities that can be recognized accurately [8]. Consequently, also increases. Falls are a major public health problem in one of the major considerations in activity recognition is the elderly people often causing fatal injuries. It is important to location or combination of locations of the accelerometers assure that injured people receive assistance as quickly as that provide the most relevant information. possible. Because of this, building a good fall detection In [5] the authors study the best location to place system is of a big importance to help medicine solve this accelerometers for fall detection, based on the classi ication problem. of postures. Four accelerometers were placed at the chest, The ield of Human Activity Recognition (HAR) and fall waist, ankle and thigh. Statistical features were calculated for detection has become one of the trendiest research topics each axis of the accelerometer in addition to the magnitude. due to availability of low cost, low power consuming sensors, Results indicated that one accelerometer (chest or waist) by i.e., accelerometers. The recognition of human activities has itself was not enough to suf iciently classify the activities been approached in two different ways, namely using (75%). There was, however, a signi icant improvement in ambient and wearable sensors [2]. In the former, the sensors classi ication accuracy achieved by combining the are ixed in predetermined points of interest on the body of accelerometer at the chest or waist with one placed on the the subject, so the inference of activities entirely depends on ankle (91%). Following the work described in [5] we explore ∗ Both authors contributed equally to this research. this approach using different dataset while investigating all Permission to make digital or hard copies of part or all of this work for personal possible sensor placement combinations. or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia © 2020 Copyright held by the owner/author(s). 84 Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia 3 ACTIVITY RECOGNITION datasets representing every combination of these sensors to show the importance of the placement of the accelerometer. 3.1 Dataset In our research the sampling rate of the sensor is 18 Hz, which means 18 samples are provided every second. In In this research we used the UP-Fall Detection dataset, which Figure 1Error! Reference source not found. the raw data is publicly available [9]. The dataset contains 17 Subjects that from 3-axis accelerometer is shown from person who is are performing 11 activities. Each activity is performed 3 performing three activities: standing, falling forward using times. The activities performed are related to six simple hands and laying. human daily activities and ive human falls showed in Table 1. These types of activities and falls are chosen from the 3.2 analysis of those reported in literature [10][11]. All daily Feature Extraction activities are performed during 60 s, except jumping that is Feature extraction is really important step in the activity performed during 30 s and picking up an object which it is an recognition process in order to ilter relevant information action done once within a 10-s period. A single fall is and obtain quantitative measures that allow signals to be performed in each of the three ten seconds period trials. compared. In our research we used statistical features to create the feature vectors. All the attributes are computed by using the technique of overlapping sliding windows [5]. Table 1: Activities performed in the Dataset Because the inal sampling frequency of our accelerometers was 18 Hz, we chose a window size of 18, Activity ID Description Duration (s) which is one second time interval. We decided for one- 1 Falling forward using hands 10 second time interval because in our target activities there are 2 Falling forward using knees 10 transitional activities (standing up and going down) that 3 Falling backwards 10 usually last from one to four seconds. Statistical attributes 4 Falling sideward 10 are extracted for each axis of the accelerometer. 5 Falling sitting in empty chair 10 The feature extraction phase produces 36 features 6 Walking 60 (summarized in Table 2) from the accelerations along the x, 7 Standing 60 y, and z axes. The irst three features (Mean X/Y/Z,) provide 8 Sitting 60 information about body posture, and the remaining features 9 Picking up an object 10 represent motion shape, motion variation, and motion 10 Jumping 30 similarity (correlation). 11 Laying 60 Once the features are extracted (and selected), a feature vector is formed. During training, feature vectors extracted In order to collect data from young healthy subjects from training data are used by a machine learning algorithm without any impairment, is considered a multimodal to build an activity recognition model. During classi ication, approach for sensing the activities in three different ways feature vectors extracted from test data are fed into the using wearables, context-aware sensors and cameras, all at model, which recognizes the active. the same time. However, of our particular interest is how acceleration data can be used for the recognition of activities. The analyzed data is obtained from accelerometers placed on Table 2: Overview of the extracted features. The number of features is represented with # ankle, neck, wrist and belt. This way we created 15 different Feature name # Mean (X, Y, Z) 3 Standard deviation (X, Y, Z) 3 Root mean square (X, Y, Z) 3 Maximal amplitude (X, Y, Z) 3 Minimal amplitude (X, Y, Z) 3 Median (X, Y, Z) 3 Number of zero-crossing (X, Y, Z) 3 Skewness (X, Y, Z) 3 Kurtosis (X, Y, Z) 3 First Quartile (X, Y, Z) 3 Third Quartile (X, Y, Z) 3 Autocorrelation (X, Y, Z) 3 3.3 Methods Machine learning approach was used for the activity Figure 1 Raw Data from 3-Axis Accelerometer recognition. In this study, the machine learning task is to learn a model that will be able to classify the target activities 85 Human Activity Recognition Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia (e.g. standing, sitting, falling, etc.) of the person wearing Four evaluation metrics are commonly used in activity accelerometers. For this purpose, we used 4 different recognition: the recall, precision, accuracy and F-measure. machine learning algorithms: Random Forest, Support We have analyzed the accuracy score, which shows how Vector Machine, k-Nearest Neighbors and Multilayer many of the predicted activities are correctly classi ied. Perceptron. The Random Forest (RF) classi ier, like its name implies, 4.2 Results consists of a large number of individual decision trees that For the irst experiment we compared 4 ML models using the operate as an ensemble. The fundamental concept behind RF ankle accelerometer - shown in Figure 2. We used the ankle is the low correlation between any of the individual accelerometer because our initial studies showed that it constituent models protecting each other from their performs the best. Random Forest showed the best results individual error. with 92.92% of accuracy. Therefore, it was used for further The Support Vector Machine (SVM) method has also been experiments. broadly used in HAR although they do not provide a set of Table 3 shows the comparison of activity recognition rules understandable to humans. SVMs rely on kernel accuracy using 4 accelerometers placed on ankle, belt, neck functions that project all instances to a higher dimensional and wrist. It shows how the number and placements of space with the aim of inding a linear decision boundary (i.e., accelerometer can affect the recognition of particular a hyperplane) to partition the data. activities. The k-Nearest Neighbors (k-NN) is a supervised classi ication technique that uses the Euclidean distance to 100 92.92 classify a new observation based on the similarity (distance) 90.14 92.43 84.31 between the training set and the new sample to be classi ied. 80 The Multilayer Perceptron (MLP) [12], is an arti icial 60 neural network with multilayer feed-forward architecture. 40 The MLP minimizes the error function between the Accuracy in % 20 estimated and the desired network outputs, which represent 0 the class labels in the classi ication context. Several studies RF SVM MLP KNN show that MLP is ef icient in non-linear classi ication problems, including human activity recognition. Brief study Figure 2: Comparison of different algorithms using of MLP and other classi ication methods is shown in [13][14]. Ankle Accelerometer Placing the accelerometer on the belt can distinguish 4 EXPERIMENTS sitting, standing or jumping, but distinguishing different kind of falls that include some transitions, like standing, falling 4.1 Evaluation Techniques and then laying is a problem. Adding one accelerometer on the neck, can slightly improve the results, but still cannot To properly evaluate the models, we divided the data into recognize correctly the falls. Combination of neck and ankle train and test using leave-one-person-out cross-validation. accelerometer proved best results with 94.2% accuracy. On With the leave-one-person-out each fold is represented by the other hand, an accelerometer on the ankle can distinguish the data of one person. This means the model was trained on walking, standing and laying, but has problems with picking the data recorded for 16 people and tested on the remaining up an object and also recognizing the falls. Most of the fall person's data. This procedure was repeated for each person activities are recognized as standing or laying. By combining data (17 times) and the average performance was measured. Table 3: Comparison of activity recognition accuracy using different number of accelerometers (1, 2, 3 or 4) placed on ankle, belt, neck and wrist 86 Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia this sensor with neck accelerometer, the algorithm can models and Random Forest showed best results. Then, we distinguish each of the discussed activities. compared the best model on different data, and we got the Because of situation like this, we decided to compare the conclusion that the data from ankle and neck sensors was results using different number of accelerometers and suf icient for human activity recognition and fall detection different body placements. The idea is to use as few sensors process with accuracy of 94.2%. as possible to maximize the user’s comfort, but to use enough of them to achieve satisfactory performance. REFERENCES [1] United Nations Publications. World Population Ageing 2019 Highlights. Department of Economic and Social Affairs Population Division. [2] Labrador, Miguel A., and Oscar D. Lara Yejas. Human activity recognition: using wearable sensors and smartphones. CRC Press, 2013. [3] Ravi, N., Dandekar, N., Mysore, P. and Littman, M.L., 2005, July. Activity recognition from accelerometer data. In Aaai (Vol. 5, No. 2005, pp. 1541-1546). [4] Kwapisz, J.R., Weiss, G.M. and Moore, S.A., 2011. Activity recognition using cell phone accelerometers. ACM SigKDD Explorations Newsletter, 12(2), pp.74-82. [5] Gjoreski, H., Lustrek, M. and Gams, M., 2011, July. Accelerometer placement for posture recognition and fall detection. In 2011 Seventh International Conference on Figure 3: Confusion matrix for Neck and Ankle Intelligent Environments (pp. 47-54). IEEE. Accelerometer [6] Gjoreski, M., Gjoreski, H., Luštrek, M. and Gams, M., 2016. How accurately can your wrist device rcognize daily We must make a trade-off between correctly detecting activities and detect falls?. Sensors, 16(6), p.800. simple activity and speci ic fall. The results showed that neck [7] Atallah, L., Lo, B., King, R. and Yang, G.Z., 2011. Sensor and ankle accelerometers are best suited for fall detection positioning for activity recognition using wearable with overall accuracy of 94.19%. The confusion matrix for accelerometers. IEEE transactions on biomedical circuits neck and ankle accelerometers is shown in Figure 3. The and systems, 5(4), pp.320-329. most false positive predictions for fall activities are predicted [8] Bonomi, A.G., Plasqui, G., Goris, A.H. and Westerterp, K.R., as laying. Also, very small percent of the non-fall activities are 2009. Improving assessment of daily energy expenditure predicted as falls, which dismiss the false alarms for falls. by identifying types of physical activity with a single accelerometer. Journal of applied physiology. [9] The Challenge UP dataset: 5 CONCLUSION http://sites.google.com/up.edu.mx/har-up/ [10] Igual, R., Medrano, C. and Plaza, I., 2013. Challenges, issues In this paper we presented an approach to human activity and trends in fall detection systems. Biomedical recognition and how location and number of sensors can engineering online, 12(1), p.66. impact on the process of HAR. Our aim was to build a model [11] Z Zhang, C Conly, V Athitsos. 2015. A survey on vision-who can correctly recognize and classify the fall activities based fall detection. 8th ACM International Conference on using small number of accelerometers, but still can obtain PETRA '15, ACM, New York, NY, USA, Article 46, 1–7. high accuracy scores. With one accelerometer placed on the [12] Attal, F., Mohammed, S., Dedabrishvili, M., Chamroukhi, F., Oukhellou, L. and Amirat, Y., 2015. Physical human activity ankle or the neck we got high accuracy scores, but by recognition using wearable sensors. Sensors, 15(12), combining these two sensors the model can classify the falls pp.31314-31338. more precisely. [13] Altun, K., Barshan, B. and Tunçel, O., 2010. Comparative The main input to our system is the data from the inertial study on classifying human activities with miniature sensors. Because the data is sensory, additional attributes inertial and magnetic sensors. Pattern Recognition, 43(10), are calculated. This process of feature extraction is general pp.3605-3620. and can be used in similar problems. Next, the algorithms for [14] M. Gjoreski, V. Janko, G. Slapničar, M. Mlakar, N. Reščič, J. the inal tasks of activity recognition and fall detection are Bizjak, V. Drobnič, M. Marinko, N. Mlakar, M. Luštrek, M. designed and implemented using the data from the ankle Gams, Classical and deep learning methods for recognizing accelerometer. We used a machine learning approach for human activities and modes of transportation with smartphone sensors, Information Fusion, Volume 62, 2020, solving the problem of activity recognition. We evaluated the Pages 47-62, 1566-2535. 87 Sistem za ocenjevanje esejev na podlagi koherence in semantične skladnosti Automated Essay Evaluation System Based on Coherence and Semantic Consistency Žiga Simončič Zoran Bosnić Univerza v Ljubljani, Fakulteta za računalništvo in Univerza v Ljubljani, Fakulteta za računalništvo in informatiko informatiko Večna pot 113, 1000 Ljubljana Večna pot 113, 1000 Ljubljana zs3179@student.uni- lj.si zoran.bosnic@fri.uni- lj.si POVZETEK osredotočajo predvsem na sintaksno analizo, premalo pozornosti pa posvečajo semantiki [6]. To slabost obstoječih sistemov rešuje V članku opisujemo implementacijo sistema za ocenjevanje ese-sistem SAGE, ki ga Zupanc opisuje v svoji disertaciji [5]. SAGE jev v angleškem jeziku. Zgledujemo se po metodologiji obstoje- dosega zavidljivo napovedno točnost v primerjavi z ostalimi so- čega sistema, ki poleg ocenjevanja sintakse uporablja tudi mere dobnimi sistemi, vendar je trenutna implementacija sistema v koherentnosti in semantične skladnosti. Metodologijo implemen- prototipni fazi in ni zrela za produkcijo. tiramo v grafičnem okolju Orange, s prijaznim vmesnikom, op- Glavni cilj dela je bila implementacija sistema na način, da bo cijsko uporabo vektorskih vložitev za predstavitev besedila in uporabnikom čimbolj dostopen, enostaven in prijazen za upo- možnostjo nadaljnjega razvoja sistema. Sistem evalviramo na rabo. Da zadostimo tem ciljem, smo se odločili za implementacijo podatkih dostopnih na spletnem mestu Kaggle in, kolikor je mo- 1 v programskem okolju Orange, ki je namenjen hitremu pro- goče, rezultate primerjamo z rezultati dosedanje metodologije in totipiranju modelov in raziskovanju podatkov, namenjen tako jih podrobno analiziramo. Poglobimo se tudi v izbiranje atribu- začetnikom kot zahtevnejšim uporabnikom. Sistem je v Orange- tov za izboljšanje rezultatov. Glavni prispevki dela obsegajo (1) u implementiran v obliki gradnikov (angl. widgets). Med seboj implementacijo sistema, (2) enostavnost uporabe in (3) izboljšave jih lahko povezujemo in kombiniramo, tako da smo uvoz dato- dosedanjega dela, vključno z dodatnimi računskimi opcijami in tek, gradnjo in testiranje modelov prepustili gradnikom, ki so podrobno analizo izbiranja atributov za izboljšanje rezultatov. v Orange-u že implementirani. Skupno smo implementirali tri KLJUČNE BESEDE gradnike — prvi implementira vse atributske funkcije, vključno s koherenco, drugi implementira sistem za analizo semantične ocenjevanje esejev, semantična skladnost, Orange skladnosti, tretji pa je namenjen evalvaciji modela po kvadratno uteženi kapi. ABSTRACT Sistem Zupanc [6] temelji na ekstrakciji različnih atributov iz In this paper we describe an implementation of an essay grading podanih besedil (esejev) in se loči na tri (pod)sisteme: AGE, AGE+ system. We lean heavily on the methodology of an existing sys- in SAGE. Oznaka “sistem Zupanc” predstavlja njeno implemen- tem, which, besides using syntactical measurements, also uses tacijo vseh teh treh sistemov. Vsak sistem nadgradi prejšnjega coherence and semantic cosistency measures. We implement the z dodatnimi atributi. Sistem AGE predstavlja skupek atributov methodology in the Orange data mining tool, with a firendly user osnovne sintaktične statistike, berljivostnih, leksikalnih, slovnič- interface, optional use of word embeddings for word representa- nih in vsebinskih mer. To obsega različne značilnosti besedila, vse tion and the possibility for further developments of the system. od osnovnih, kot so število znakov, besed itd., pa do števila slov- The system is evaluated on public datasets from the Kaggle web- ničnih napak in računanje podobnosti z ostalimi eseji. Skupno ta site. The results are to the most possible extent compared with sistem zajema 72 različnih atributov, v prispevku tega članka pa the results of the existing methodology and analyzed in detail. smo temu sistemu dodali še pet novih atributov (št. znakov brez We also compare several attribute selection methods, which im- presledkov in štiri dodatne atribute, ki štejejo število posameznih prove our results. Main contributions of this work are comprised oblikoskladenjskih oznak). Skupno torej 77 atributov. of (1) implementation of the system, (2) ease of use and (3) im- Atributom sistema AGE dodamo atribute za merjenje kohe- provements upon previous work, including additional computing rence in s tem dobimo sistem AGE+. Koherenco merimo tako, options and detailed attribute selection analysis. da besedilo najprej razdelimo na prekrivajoče se odseke (drseče okno) in posamezne odseke pretvorimo v večdimenzionalni pro- KEYWORDS stor. V tem prostor lahko posamezne odseke primerjamo in z automated essay evaluation, semantic consistency, Orange različnimi merami ocenimo ocenimo konsistentnost besedila in tok misli. Število atributov za merjenje koherence je 29. 1 UVOD Če vsem zgornjim atributom dodamo še nabor treh atributov, ki jih pridobimo s preverjanjem semantične skladnosti, govo- Učitelji v izobraževalnih ustanovah so odgovorni za predajanje rimo o sistemu SAGE. Sistem za zaznavanje semantičnih napak v znanj velikemu številu učencev. Del učnega procesa je tudi pisa- ozadju uporablja ontologijo, kateri postopoma dodajamo dejstva, nje esejev, ki jih morajo učitelji prebrati in oceniti. Ocenjevanje ki jih izluščimo iz besedila. Z logičnim sklepanjem nato ugoto- esejev ni le časovno potratno, ampak potencialno tudi nekoliko vimo, če so trditve iz besedila logično konsistentne ali ne. To nam pristransko. Naloga učitelja je tudi, da napake označi, popravi in prinese tri dodatne atribute in možnost povratne informacije, v komentira celotno delo. katerih povedih je prišlo do semantičnega neskladja. S pomočjo računalnika lahko ocenjevanje esejev olajšamo. Dandanašnji sistemi za ocenjevanje esejev (tudi komercialni) se 1 https://orange.biolab.si/ 88 Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Simončič in Bosnić 2 SORODNA DELA V sklopu svojega dela se je Zupanc [5] osredotočila na (v času njenega raziskovanja že zaključeno) tekmovanje avtomatskega 2 ocenjevanje esejev, ki ga je gostil Kaggle. Na tem tekmovanju so pomerili različni sistemi, s katerimi je Zupanc primerjala svoj sistem. Najboljša mesta na končni lestvici so večinoma zasedali komercialni sistemi za ocenjevanje esejev, nekaj pa je bilo tudi po meri narejenih uporabniških modelov. Komercialni sistemi 3 4 5 kot so PEG, e-rater in IntelliMetric imajo že dolgo zgodovino in s tem velik tržni delež ter izpopolnjen finančni model. V času raziskovanja noben od naštetih ni ponujal brezplačne verzije sistema. Podrobno razčlenitev modelov in splošen opis njihovega delovanja najdemo v delih Zupanc [5] ter Zupanc in Bosnić [6]. V zadnjem času se na različnih področjih čedalje bolj uve- ljavljajo nevronski modeli, zato smo pogledali in testirali nekaj izvedb. Martinc in sod. [3] opisujejo uspešnost treh različnih nevronskih modelov pri ocenjevanju besedil, ki sicer niso eseji. Tudi Taghipour in Tou Ng [4] sta primerjala različne nevronske modele za ocenjevanje esejev (na istih podatkih kot mi). Najboljši model dosega skoraj tak rezultat, kot mi. Alikaniotis in sod. so objavili članek [1], kjer so tudi testirali uspešnost različnih nevronskih modelov na enaki podatkovni zbirki esejev, kot smo jo uporabljali mi. 3 OPIS IMPLEMENTACIJE IN METODE 3.1 Uporabljena orodja Celoten sistem smo implementirali z uporabo orodja za podat- kovno rudarjenje Orange v programskem jeziku Python. Glavne Slika 1: Prikaz vseh treh gradnikov uporabljene knjižnice za razčlenitev besedila in izračun atributov 6 7 8 9 so NLTK, SpaCy, scikit-learn in language-check za zaznava- nje pravopisnih napak. Gradnik ima tri vhode: 10 Za delo z ontologijami smo uporabili knjižnico rdflib in (1) vhod za ocenjene eseje, zunanja sistema (v smislu samostojna lokalna programa) ClausIE 11 (2) vhod za neocenjene eseje in (na voljo tudi OpenIE5.0) in HermiT. (3) vhod za izvorno besedilo. 3.2 Implementacija gradnikov v Orange Vhoda za ocenjene in neocenjene eseje sta namenjena učni mno- žici ocenjenih esejev in množici neocenjenih esejev, ki jim ho- Skupno smo razvili tri gradnike, ki zajemajo celoten opisan sistem. čemo napovedati ocene. Na obeh množicah se izračunajo enaki Slika 1 prikazuje vse tri gradnike, ki so opisani v nadaljevanju. atributi. Atribute ocenjenih esejev uporabimo za gradnjo modela. Prvi gradnik je namenjen izračunu vseh različnih mer. To Vhod za izvorno besedilo je neobvezen in predstavlja izhodiščno so osnovne (plitke) statistične mere, mere berljivosti, leksikalne zgodbo, knjigo ali dejstva, ki naj bi jih pisec eseja poznal. Če mere, slovnične mere, vsebinske mere in mere koherentnosti. so eseji osnovani na podlagi nekega izvornega besedila, ga po- Gradnik predstavlja sistema AGE in AGE+, odvisno od uporabni- vežemo na ustrezen vhod in s tem izračunamo dodaten atribut kove izbire atributov, ki naj se izračunajo. Če označimo izračun (podobnost eseja z izvornim besedilom). Gradnik ima dva izhoda, vseh atributov, razen atributov za koherenco, govorimo o sistemu AGE in sicer izhod za izračunane atribute ocenjenih esejev in izhod , z dodanimi atributi za koherenco pa govorimo o sistemu AGE+ za izračunane atribute neocenjenih esejev. To nam omogoča, da . Ker je računanje nekaterih naprednih mer bolj zahtevno, podatke ustrezno nastavimo kot vhode v ostale Orange-ove gra- se lahko uporabnik odloči za izračun kakršnekoli kombinacije dnike. naštetih šestih skupin mer. Za vsebinske mere in mere koheren- Drugi gradnik obsega delo in iskanje semantičnih neskladno- tnosti je na voljo dodatna izbira metode pretvorbe besedila v sti z ontologijo. Predstavlja izračun dodatnih atributov, ki jih večdimenzionalni vektorski prostor. Tu podpiramo dve metodi: prinaša sistem SAGE. Gradnik je samostojen zaradi velike ra- statistično pretvorbo TF-IDF in vektorske vložitve GloVe (v dveh čunske in časovne zahtevnosti. Ima dve nastavitvi: ali želimo izvedbah: SpaCy in Flair). uporabiti razreševalnik koreferenc in ali želimo, da se nam za 2 https://www.kaggle.com/ semantične napake vrne podrobna razlaga. Uporaba koreferenc 3 https://www.measurementinc.com/products-services/automated-essay-scoring je priporočljiva, saj je v primerih posrednega navezovanja na raz- 4 https://www.ets.org/ lične pojme v besedilu to edini način zajetja celotne semantične 5 http://www.intellimetric.com/direct/ 6 informacije. Izberemo lahko tudi izvorno besedilo ali zgodbo, s https://www.nltk.org/ 7 https://spacy.io/ katerim se razširi ontologijo, tako da ta vključuje tudi vsebino 8 https://scikit-learn.org/stable/ osnovnega besedila. To besedilo se bo obdelalo pred vsem osta- 9 https://pypi.org/project/language-check/ 10 lim, izluščene trojice pa bodo dodane v ontologijo. Razširjena https://rdflib.readthedocs.io/en/stable/ 11 http://www.hermit-reasoner.com/ ontologija se bo uporabila za preverjanje skladnosti esejev. Če 89 Sistem za ocenjevanje esejev na podlagi koherence in semantične skladnost Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Za posamezen esej poiščemo koreference v besedilu (angl. co- reference resolution). Ugotavljanje referenc nam omogoča odkri- vanje posrednih referenc na določene entitete in zamenjavo z neposredno entiteto. Primer: “Bob likes pizza. He eats it all the time.” nadomestimo z “Bob likes pizza. Bob eats pizza all the time”. Naslednji korak je razčlenitev besedila na posamezne povedi in ekstrakcija informacij s pomočjo sistema OpenIE (angl. Open Information Extraction). V tem koraku posamezne povedi pretvo- rimo v eno ali več trojic, ki opišejo relacije, izražene v povedi in so primerne za logično obdelavo. Za zgornji primer bi tako dobili dve trojici: (Bob, like, pizza) in (Bob, eat, pizza). Uporabili smo sistem za ekstrakcijo ClausIE [2], podpiramo pa tudi mo- 13 žnost uporabe sistema OpenIE5. Vse pridobljene trojice nato postopoma dodajamo v ontologijo, obenem pa preverjamo njeno skladnost. Za vsak element trojice poskušamo v ontologiji najti Slika 2: Primer uporabe sistema AGE/AGE+ že obstoječ element. Pri tem preiščemo sopomenke, nadpomenke in protipomenke, v najslabšem primeru pa dodamo v ontologijo nov element. Po vsakem dodajanju elementov in trojic, preverimo skladnost ontologije. Skladnost preverjamo z logičnim sklepal- izvornega besedila ne dodamo, se za preverjanje skladnosti nor- nikom HermiT, ki vrača dva tipa napak. Prvi tip napak se zgodi, malno uporabi osnovna ontologija (ontologija COSMO). Gradnik ko ima nek razred (owl:class) prirejene entitete, ki jih ne sme ima samo en vhod — vhod za eseje ter en izhod — tabela treh imeti (unsatisfiable case). Drugi tip napak pa se proži, ko se s atributov o številu posameznih napak in niz z osnovno razlago sklepanjem ugotovi logična napaka — nekonsistentna ontologija ter dodatni stolpec s podrobno razlago, če je ta izbrana. (angl. inconsistent ontology). Do takšnih napak pride ponavadi Tretji gradnik je namenjen evalvaciji napovedanih ocen in zaradi neposrednih nasprotij (npr. owl:disjointWith) med dvema pravih ocen esejev. Ker Orange ne podpira mer za izračun natanč- relacijama, ki pravi, da entiteta ne more imeti obeh relacij hkrati). nega strinjanja (angl. exact agreement) in kvadratne utežene kape Na podlagi povzročenih tipov napak osnujemo tri dodatne (angl. quadratic weighted kappa - QWK ), smo naredili gradnik, atribute, ki jih lahko uporabimo pri napovedovanju ocen esejev: ki prejme tabelo z napovedanimi ocenami in pravimi ocenami. število neizpolnjenih primerov (pri dodajanju novih entitet v on- Zgledovali smo se po izhodu gradnika Test and Score — za zago- tologijo), število napak nekonsistentne ontologije (pri dodajanju tavljanje interoperabilnosti lahko ta izhod vežemo neposredno trojic) in vsota obeh prejšnjih. na vhod našega gradnika, kjer se izračunata prej omenjeni meri. Uporaba gradnika za izračun atributov in evalvacijo modela s 3.4 Rezultati kvadratno uteženo kapo je prikazana na Sliki 2. Sistem smo testirali na podatkih že nekaj let starega tekmovanja 14 ASAP na spletni strani Kaggle. Podatki obsegajo osem različnih 3.3 Semantična analiza podatkovnih zbirk (oz. devet, ker se druga zbirka ocenjuje po Eden glavnih prispevkov dela Zupanc in Bosnić [6] je uporaba dveh kriterijih). Tema esejev v vsaki podatkovni zbirki je različna. ontologij za ugotavljanje semantične skladnosti. Ta postopek je Zbirke so razdeljene na učno, validacijsko in testno množico, uporaben na dva načina: z njim pridobimo nekaj dodatnih atri- vendar ocene validacijske in testne množice niso na voljo, zato butov, ki jih lahko uporabimo pri napovedovanju ocen esejev, smo za evalvacijo našega sistema uporabili 10-kratno prečno pre- dodatno pa nam ta postopek tudi sporoči, kje se nahajajo seman- verjanje. Razpon ocen je v vsaki zbirki različen, gibljejo se od 0–4, tične napake. Slednja funkcionalnost je zelo pomembna, saj tako pa vse do 0–60. Za oceno modelov smo uporabili mero kvadra- učenec prejeme neposredno informacijo o napakah v eseju. tno utežene kape (angl. quadratic weighted kappa), ki upošteva Postopek temelji na uporabi ontologije, v katero postopoma razpon ocen in vrne relativno ujemanje napovedane ocene z de- dodajamo v relacije strukturirane stavke in sproti preverjamo jansko oceno. Sistem smo testirali na modelu linearne regresije skladnost ontologije. Osnovna struktura ontologije je predsta- in naključnih gozdov. Bolje se je odrezala linearna regresija, zato vljena s “trojicami” v obliki (osebek, relacija, predmet). Relacija smo se nanjo osredotočili v nadaljnjih eksperimentih. Uporabili lahko predstavlja omejitev, konceptualno povezavo (npr. (Alice, smo regularizacijo L2 s parametrom 𝛼 = 0, 02. isMotherOf, Bob)) ali definira tip. V implementaciji smo za predsta- Na začetku smo modele gradili na celotnem naboru izračuna- vitev trojic uporabili jezik RDF, ki je podoben jeziku OWL, vendar nih atributov. Ker sistem AGE+, domnevno zaradi prevelikega šte- ni logični jezik. Uporabili smo ontologijo COSMO (angl. Common vila atributov (106), ni dosegal boljših rezultatov od sistema AGE, Semantic Model 12 ). Predstavljena je v semantičnem jeziku OWL smo preizkusili nekaj metod za izbiranje atributov. Glavni metodi (Web Ontology Language), ki omogoča gradnjo kompleksnih shem naše analize sta bili vnaprejšnje izbiranje atributov (angl. forward različnih konceptov, dejstev in medsebojnih relacij. V primeru, attribute selection) in izločanje atributov (angl. backward feature da bi hoteli ontologiji dodati dodatna specifična znanja, to lahko elimination). Obe metodi sta izboljšali rezultat. Uporabili smo jih storimo. V našem primeru je poleg nekaterih esejev tudi izvorno skupaj z 10-prečnim preverjanjem. Na vsaki iteraciji prečnega besedilo, na podlagi katerega so bili eseji spisani. Izvorno besedilo preverjanja smo dodali/odstranili posamezne atribute in glede dodamo v ontologijo pred eseji in po enakem postopku kot eseje na povprečje preko vseh iteracij dodali/odstranili atribut z najve- in je razložen spodaj. čjim/najmanjšim prispevkom. To smo ponavljali, dokler ni bilo 13 https://github.com/dair-iitd/OpenIE-standalone 12 14 https://www.w3.org/OWL/ https://www.kaggle.com/c/asap- aes 90 Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Simončič in Bosnić Tabela 1: Primerjava rezultatov brez izbiranja atributov ocenjevanje modelov (kvadratno utežene kape). Sumimo, da so naše implementacije sistemov AGE in AGE+ (TF-IDF), pri- za učenje svojega modela uporabili vse podatkovne zbirke skupaj, merjava s sistemom Zupanc (AGE) in strnjeni rezultati iz- saj je njihov rezultat v območju skoraj 100% natančnosti (0,96), z biranja ter izločanja atributov na sistemu AGE+ dvakrat večjo absolutno napako (RMSE), kot naš model, ki ima rezultat približno 0,77. Z uporabo vseh zbirk na našem sistemu Brez izbiranja Izbiranje Izločanje tudi dobimo tako visok rezultat (0,97 in 0,94, odvisno od modela). AGE AGE+ Zupanc (AGE) AGE+ AGE+ DS1 0,8358 0,8343 0,8447 0,8369 0,8439 4 ZAKLJUČEK DS2a 0,7001 0,7073 0,7389 0,7158 0,7324 V sklopu tega dela smo implementirali sistem za ocenjevanje DS2b 0,6789 0,6676 0,5386 0,6941 0,7028 esejev po zgledu dela Zupanc [5] v programskem okolju Orange. DS3 0,6578 0,6622 0,6591 0,6656 0,6958 Implementacija v okolju Orange omogoča enostavno uporabo DS4 0,7536 0,7547 0,7174 0,7619 0,7769 sistema in združljivost z že implementirami funkcionalnostmi DS5 0,7964 0,7955 0,7949 0,8028 0,8122 Orange-a. Sistemu smo dodali nekaj novih atributov in možnost DS6 0,7734 0,7675 0,7636 0,7771 0,7871 predstavitve besed z vektorskimi vložitvami GloVe. Naša imple- DS7 0,8071 0,8034 0,7888 0,8083 0,8183 15 mentacija sistema je na voljo na repozitoriju git. Sistem temelji DS8 0,7479 0,7428 0,7738 0,7681 0,7717 na ekstrakciji velikega števila atributov iz besedil in nato izboru AVG 0,7501 0,7484 0,7356 0,759 0,7712 najboljšega nabora za določeno podatkovno zbirko. Inovativni del preteklega dela, ki je vključen tudi v naši implementaciji, je dodaten sistem za preverjanje semantične skladnosti, s pomočjo več izboljšanja. Pri analizi smo opazili, da je nabor atributov, ki katerega nabor atributov dodatno obogatimo, obenem pa imamo pride v končni izbor, relativno majhen. Ugotovili smo, da je zaradi možnost, da nam sistem izpiše vse zaznane semantične napake prečnega preverjanja velika možnost, da s trenutnim naborom oz. neskladja. Prispevek tega članka predstavlja tudi primerjava atributov pridemo v lokalni optimum. Zaradi povprečenja čez tehnik izbiranja atributov in primerjava rezultatov s preteklim vse iteracije lahko nek atribut v prvi iteraciji izboljša rezultat, v delom. Sistem bi bilo smiselno preizkusiti tudi z drugimi napove- drugi pa poslabša, in je v povprečju označen kot neprimeren. Za dnimi modeli, saj smo se v našem delu najbolj osredotočili le na izogibanje tem lokalnim optimumom smo implementirali mejo, linearno regresijo in naključne gozdove. Zanimiv izziv bi bil tudi kolikokrat se lahko v povprečju rezultat poslabša, preden nabor prilagoditev sistema za slovenski jezik, ker je jezik sintaktično atributov označimo kot končen. S tem smo kratkoročno poslab- kompleksnejši, orodja za obdelavo besedil pa še niso tako zrela šali rezultat, vendar dolgoročno ustvarili kombinacijo atributov, kot za angleški jezik. ki dajejo v povprečju boljši rezultat. S to metodo izogibanja opti- mumov smo še dodatno izboljšali končne rezultate, ki so strnjeno ZAHVALA prikazani v Tabeli 1. Pri izbiranju in izločanju atributov je AGE Zahvaljujemo se sodelavcem Laboratorija za bioinformatiko na izpuščen, saj AGE+ v obeh primerih dosega boljše rezultate. Ker Fakulteti za računalništvo in informatiko za podporo in nasvete testni podatki niso več na voljo, smo naše rezultate s sistemom Zu- pri implementaciji sistema v programskem okolju Orange. panc lahko primerjali le s primerjavo sistemov AGE z 10-kratnim prečnim preverjanjem. Vidimo, da dosegamo zelo podobne rezul- LITERATURA tate, kot sistem Zupanc oz. jih nekoliko presegamo. Z ustreznim [1] Dimitrios Alikaniotis, Helen Yannakoudakis in Marek Rei. izbiranjem atributov pa naš rezultat še dodatno izboljšamo. 2016. Automatic text scoring using neural networks. Pro- Sistem SAGE smo iz tabele izpustili, saj so rezultati z izloča- ceedings of the 54th Annual Meeting of the Association for njem atributom le malenkost boljši od sistema AGE+, prav tako Computational Linguistics (Volume 1: Long Papers). doi: 10. pa smo ga uporabili le na podatkovnih zbirkah, ki so vsebovale 18653/v1/p16- 1068. http://dx.doi.org/10.18653/v1/P16- 1068. izvorno besedilo (samo štiri zbirke). Kljub temu pa sistem SAGE [2] Luciano Del Corro in Rainer Gemulla. 2013. ClausIE: Clause- ob zaznanem semantičnem neskladju nudi izpis povratne infor- Based Open Information Extraction. V Proceedings of the macije. Primer v nadaljevanju prikazuje delovanje razreševalnika 22nd international conference on World Wide Web, 355–366. koreferenc in odkrivanje semantičnih napak. Zaradi korenjenja [3] Matej Martinc, Senja Pollak in Marko Robnik-Šikonja. 2019. so nekatere besede v razlagi lahko odsekane. Vhod “George likes Supervised and unsupervised neural approaches to text basketball and doesn’t like sports.”, sproži napako z razlago: “Re- readability. arXiv preprint arXiv:1907.11779. lation ’George likes basketball and George doesn’t like sports.’ is [4] Kaveh Taghipour in Hwee Tou Ng. 2016. A neural approach inconsistent with a relation in ontology: ’George likes basketball to automated essay scoring. V Proceedings of the 2016 Confe- and George doesn’t like sports.’” in podrobno razlago: “Relation rence on Empirical Methods in Natural Language Processing. not consistent: Georg likes Basketball. Relations doesNotLike Association for Computational Linguistics, Austin, Texas, and likes are opposite/disjoint. Relation not consistent: Georg (november 2016), 1882–1891. doi: 10.18653/v1/D16- 1193. doesNotLike Basketball.”. Osnovna razlaga deluje na ravni po- https://www.aclweb.org/anthology/D16- 1193. vedi in nam v tem primeru pove, da je poved v nasprotju sama s [5] Kaja Zupanc. 2018. Semantics-based automated essay eva- sabo. Podrobna razlaga pravi, da George ima in nima rad košarke. luation. Doktorska disertacija. Fakulteta za računalništvo Beseda “sports” se v podrobni razlagi ne pojavi, ker je košarka in informatiko, Univerza v Ljubljani. podrazred športa in tam najprej pride do nasprotja. [6] Kaja Zupanc in Zoran Bosnić. 2017. Automated essay eva- Omenili bi še primerjavo našega sistema z omenjenimi ne- luation with semantic analysis. Knowledge-Based Systems, vronski modeli. Model Taghipour in Tou Ng [4] dosega podobne 120, 118–132. rezultate, kot naš sistem (nekaj pod 0,77). Alikaniotis in sod. [1] opisujejo, da njihov model dosega rezultat 0,96, vendar sumimo na nekaj nepravilnosti, ki izvirajo iz napačne uporabe mere za 15 https://github.com/venom1270/essay-grading 91 Mental State Estimation of People with PIMD using Physiological Signals Gašper Slapničar Erik Dovgan gasper.slapnicar@ijs.si erik.dovgan@ijs.si Jožef Stefan Institute, Jožef Stefan IPS Jožef Stefan Institute Jamova cesta 39 Jamova cesta 39 Ljubljana, Slovenia Ljubljana, Slovenia Jakob Valič Mitja Luštrek jakob.valic@ijs.si mitja.lustrek@ijs.si Jožef Stefan Institute Jožef Stefan Institute Jamova cesta 39 Jamova cesta 39 Ljubljana, Slovenia Ljubljana, Slovenia ABSTRACT their surroundings through technology. One part of the system People with profound intellectual and multiple disabilities are a considers the patterns in a person’s gestures and facial expres- very diverse and vulnerable group of people. Their disabilities are sions, which might have some significance and correlation to cognitive, motor and sensory, and they are also incapable of sym- their behavioural and mental state, or their communication at- bolic communication, making them heavily reliant on caregivers. tempt. The initial solution dealing with this part was already We investigated the connection between physiological signals described by Cigale et al. [1, 2]. In this paper, we instead focus on and inner states as well as communication attempts of people exploring the relationship between the physiological response with PIMD, using signal processing and machine learning tech- of the body and the mental state of the people with PIMD by niques. The inner states were annotated by expert caregivers, and using features computed from photoplethysmogram (PPG). PPG several heart rate variability features were computed from photo- is a periodic signal, where each cycle corresponds to a single plethysmogram. We then fed the features into hyper-parameter- heart beat. We obtained the PPG in two different ways: 1.) by tuned classification models. We achieved the highest accuracy using a high-quality wearable Empatica E4 with an optical sensor of 62% and F1-score of 0.59 for inner state (pleasure, displeasure, measuring the reflection of light from the skin and 2.) by using a neutral) classification using Extreme Gradient Boosting, which contact-free RGB camera mounted on a wall, which records the notably surpassed the baseline. color changes of the skin pixels. The features were then used to train classification models, which predicted the person’s inner KEYWORDS state or communication attempt. The rest of this paper is structured as follows: we first inves- PIMD, mental state, physiological signals, classification tigate the related work in Section 2, then we describe the data collected and used in the experiments in Section 3. We continue 1 INTRODUCTION with the methodology and experimental setup description in People with profound intellectual and multiple disabilities (PIMD) Section 4, and conclude with results and discussion in Section 5. often face extreme difficulties in their day-to-day life due to severe cognitive, motor and sensory disabilities. They require a nearly everpresent caregiver to help them with most tasks. 2 RELATED WORK Additionally, they are unable to communicate their feelings or The connection between physiological parameters and mental express their current mental state in a traditional symbolic way. states is a mature and highly-researched field when it comes to This causes a gap between a caregiver and the care recipient, average healthy people. as it can take an extended period of time for the caregiver to Schachter et al. [6] investigated the emotional state of people recognize any potential patterns and their relationship with the as a function of cognitive, social and physiological state. Several mental state of the care recipient. propositions were made and experimentally confirmed, support- The aforementioned reasons call for a technological solution ing the overall connection between emotional and physiological that might help bridge the gap between the caregiver and the care state. recipient and help the former better understand the latter. The Cigale et al. [1, 2] explored the communication signals of peo-INSENSION project [8] aims to develop such assistive technol-ple with PIMD, which are atypical and idiosyncratic. They high- ogy, which takes into account many aspects of the care recipient. lighted the challenging interpretation of these signals and their The aim is to both bridge the previously mentioned gap as well meaning and suggested how technology could help overcome as empower the people with PIMD to be able to interact with the gap between caregivers and care recipients. Some models were proposed that take the person’s non-verbal signals (NVS) Permission to make digital or hard copies of part or all of this work for personal as input and classify their inner state or communication attempt. or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and Kramer et al. [3] highlighted the challenges of analysing the the full citation on the first page. Copyrights for third-party components of this NVS in people with PIMD, as they are difficult to discern, instead work must be honored. For all other uses, contact the owner/author(s). focusing on physiological body responses. They conducted a Information society ’20, October 5–9, 2020, Ljubljana, Slovenia research in which the expressions of three emotional states of one © 2020 Copyright held by the owner/author(s). person with PIMD were recorded during nine emotion-triggering 92 Information society ’20, October 5–9, 2020, Ljubljana, Slovenia Slapničar, et al. situations. They collected heart rate (HR) and skin conductance level (SCL), and investigated the connection between these two physiological signals and the emotional state. They found higher SCL activity during anger or happiness and lower SCL activity during relaxation or neutral state. Vos et al. confirmed that HR and skin temperature allow the same conclusions in people with PIMD and people without dis- abilities, regarding positive and negative emotion. This finding gives additional motivation to our work, showing that the con- nection between physiological and mental state also holds for people with PIMD [9]. 3 DATA We created a recording setup in the INSENSION project, which uses two Logitech C920 cameras capable of recording full HD (1920x1080) resolution video at 30 frames per second (fps). The cameras were setup perpendicular to one another to record from two distinct angles, allowing for decent facial exposure even when the face changes direction. The caregivers were instructed to attempt to conduct their activity in front of one of the cam- eras whenever possible. Additionally, the subjects were given an Empatica E4 wristband, which served both as the ground truth for PPG, as well as a fall-back mechanism for obtaining physio- Figure 1: Example of good (green) and bad (red) video logical signals in cases when camera is unreliable or unavailable. recordings. The wristband records PPG at 64 Hz, allowing for capture of reasonable morphological details. The temporal synchronization between the video and ground truth was ensured to the best of our abilities using suitable protocols and checks. 𝑝𝑟 𝑜𝑡 𝑒𝑠𝑡 With the described setup, we obtained 48 recording sessions,     (2) each lasting between 10 and 30 minutes. Five sessions were elim- 𝐶𝑜𝑚𝑚𝐴𝑡 𝑡 𝑒𝑚𝑝𝑡 = 𝑐𝑜𝑚𝑚𝑒𝑛𝑡  inated immediately, as there was a large mismatch between the  𝑑𝑒𝑚𝑎𝑛𝑑  duration of the video and the duration of the ground truth, which The caregivers were tasked with annotation of videos, looking may happen due to several reasons, such as a caregiver forgetting at camera recordings and marking inner states and communica- to turn on the wristband during a session or the wristband losing tion attempts in time, always marking the start and end of each connection. recognized state, regardless of duration (can be a few seconds or It is important to note that the recordings were made in a a few minutes). Naturally, large periods remained where noth- natural way, as the caregivers were not given any additional ing was annotated, as the experts were either not sure or did restrictions other than to be in front of the camera when possible. not recognize any of the pre-defined states. This does not mean In practice this means that large parts of some recordings might that nothing is happening in those periods, but simply that the be useless due to the person with PIMD being turned away or inner experience of the person with PIMD is unknown. Thus, we the caregiver blocking them. Examples of good and bad sessions added an additional class value for the areas where nothing was are shown in Figure 1. annotated – unknown. 3.1 Annotating the ground truth 4 METHODOLOGY OF MENTAL STATE In order to classify mental states of people with PIMD, we first ESTIMATION required the ground truth annotations. As it is generally difficult Having both the ground truth annotations and physiological to obtain such ground truth, we relied on the expert knowledge data and videos, we then investigated two approaches: 1.) we of partners in the project who specialize in education of people attempted to reconstruct PPG from the camera recordings in a with special needs, alongside the caregivers, who know their contact-free manner and use the reconstructed rPPG (remote care recipients the best. Together they devised an annotation PPG) to calculate features and to classify inner state and commu- schema, in which they annotated inner states and communication nication attempt and 2.) we directly used the Empatica ground attempts of people with PIMD and can take the values given in truth PPG to calculate features to be used in the same classifica- Equations 1 and 2. tion task. 𝑑𝑖𝑠 𝑝𝑙 𝑒𝑎𝑠𝑢𝑟 𝑒 if 1, 2 or 3  4.1 Using rPPG Reconstruction    𝐼 𝑛𝑛𝑒𝑟 𝑆 𝑡 𝑎𝑡 𝑒 = 𝑛𝑒𝑢𝑡 𝑟 𝑎𝑙 if 4, 5 or 6 (1) In order to obtain the remote PPG, we used a rather standard   𝑝𝑙 𝑒𝑎𝑠𝑢𝑟 𝑒 if 7, 8 or 9 pipeline, which was updated with a convolutional neural network  The three numbers within each mental state indicate the inten- in order to further enhance the rPPG. At a high level, the pipeline sity, where a lower number for displeasure means more intense consists of detection or region of interest (ROI), extraction of displeasure, and a higher number for pleasure indicates more red, green and blue signal components (RGB), detrending and intense pleasure. band-pass filtering of RGB, rPPG reconstruction using the Plane 93 Mental State Estimation of People with PIMD using Physiological Signals Information society ’20, October 5–9, 2020, Ljubljana, Slovenia Orthogonal to Skin (POS) algorithm, band-pass rPPG filtering Table 1: List of computed HRV features. (0.5 to 4.0 Hz), and rPPG enhancement via deep learning. Details were already described in our previous work [7] and are not Feature Description subject of this paper. HRmean 60/𝑚𝑒𝑎𝑛(𝑁 𝑁 ) We ran the pipeline described above on 30-second segments HRmedian 60/𝑚𝑒𝑑𝑖𝑎𝑛(𝑁 𝑁 ) of video using a sliding window without overlap. We decided to IBImedian 𝑚𝑒𝑑𝑖𝑎𝑛 (𝑁 𝑁 ) use 30 seconds due to the nature of some frequency features that SDNN 𝑠𝑡 𝑑 (𝑁 𝑁 ) we chose, as frequency analysis makes sense once a reasonable SDSD ′ 𝑠𝑡 𝑑 (𝑎𝑏𝑠 (𝑁 𝑁 )) number of periods are available - in our case this means that a RMSSD ′ 𝑠𝑞𝑟 𝑡 (𝑚𝑒𝑎𝑛 ( (𝑁 𝑁 )2)) sufficient number of heart cycles must be available. Additionally, NN20 and NN50 The number of pairs of successive NNs this length makes sense as we are primarily attempting to predict that differ by more than 20ms and 50ms inner states, which do not change extremely in such a short time pNN20 and pNN50 The proportion of NN20 and NN50 di- span. An example output of the pipeline is shown in Figure 2. vided by total number of NNs SDbonus1 𝑠𝑞𝑟 𝑡 (0.5) ∗ 𝑆 𝐷 𝑁 𝑁 SDbonus2 2 2 𝑠𝑞𝑟 𝑡 (𝑎𝑏𝑠 (2 ∗ 𝑆 𝐷𝑆 𝐷 − 0.5 ∗ 𝑆𝐷𝑆𝐷 )) VLF Area under periodogram in the very low frequencies LF Area under periodogram in the low fre- quencies HF Area under periodogram in the high fre- quencies LFnorm and HFnorm Area under periodogram in the low and high frequencies, normalized by the Figure 2: Example of a good rPPG segment obtained with whole area under periodogram our pipeline. LFdHF 𝐿𝐹 /𝐻 𝐹 We then used the rPPG to compute several heart rate variabil- where 𝑠𝑡𝑑 is standard deviation, ity (HRV) features. These are known to be well-correlated with 𝑎𝑏𝑠 is absolute value, ′ stress, cognitive load, conflict experience and other inner states 𝑋 is the first order derivative, [5, 4]. A detailed list of computed features is given in Table 1. 𝑠𝑞𝑟 𝑡 is the square root and 𝑁 𝑁 are the beat-to-beat intervals. 4.2 Using Empatica PPG The Empatica records PPG directly on the skin, thus making label it actually belongs to, so we decided to exclude it from the raw PPG readily available, without the need for additional evaluation. This left us with 272 instances for class inner state reconstruction. Still, due to subject arm and wrist movements, we and 80 instances for class communication attempt, which was opted to use similar preprocessing steps used previously, namely annotated more sparsely. The final distributions for each class detrending and band-pass filtering, as the signal can sometimes are shown in Figure 3. be quite noisy. We computed the same set of features and window length as before (see Table 1), and used them in the same classification task, attempting to recognize inner states and communication attempts. 5 EXPERIMENTS AND RESULTS Once both the input (HRV features) and output (annotations) were known, we investigated six classification algorithms (k Nearest Neighbours, Decision Trees, Random Forest, Support Vector Machines, AdaBoost and Extreme Gradient Boosting) for Figure 3: Distributions of both classes. this task, always training separate models for inner state and communication attempt. We always compared each algorithm Initially we conducted a 5-fold cross validation (CV) to inves- against a baseline majority vote classifier using two metrics, tigate the best hyper-parameters using a grid search. Once the accuracy and F1-score. hyper-parameters were determined, we ran a separate experi- ment, using the best overall hyper-parameters for each model. 5.1 Using Empatica PPG Again, we ran a 5-fold CV with the best hyper-parameter set- We started our evaluation using the Empatica data, as it is more tings obtained on the full data to validate the performance. All reliable, since the PPG reconstruction is not needed. At the time the investigated algorithms (from the Scikit-learn and XGBoost of evaluation, we had annotations for 15 recording sessions in packages) and their corresponding sets of optimized parameters which 2 different people with PIMD are present. Using the chosen with the best values are available from the authors, but are not 30-second window, we initially had 417 segments of Empatica listed here due to space restrictions. Results of our evaluation PPG available. The unknown class label heavily skewed the data in terms of accuracy and F1-score for both classes are given in for both classes, and there is no way to know which (other) class Table 2. 94 Information society ’20, October 5–9, 2020, Ljubljana, Slovenia Slapničar, et al. Table 2: Accuracy and F1 score for both classes. 0.59 for inner state, and accuracy of 48% and F1 score of 0.45 for communication attempt, notably surpassing the baseline majority Algorithm classifier. 𝐴𝐶𝐶 𝐹 1 𝑚𝑒𝑛𝑡 𝑎𝑙 𝑠𝑡 𝑎𝑡 𝑒 𝑚𝑒𝑛𝑡 𝑎𝑙 𝑠𝑡 𝑎𝑡 𝑒 Limitations of our work lie in low number of instances for Baseline (majority) 0.52 0.36 communication attempt and little variety in subjects, having just kNN 0.55 0.55 two for which annotations were available. Additionally, the eval- Tree 0.54 0.56 uation using the rPPG is limited, as we had very few instances for RF 0.57 0.56 which both high-quality segments of video and annotations were SVM 0.55 0.52 available. Thus, the focus of future work should be on gathering AdaBoost 0.59 0.56 more data and conducting a more extensive evaluation of the XGB 0.62 0.59 methods, which is planned in the trial stage of the INSENSION Algorithm 𝐴𝐶𝐶 𝐹 1 project. 𝑐𝑜𝑚𝑚𝑎𝑡 𝑡 𝑒𝑚𝑝𝑡 𝑐𝑜𝑚𝑚𝑎𝑡 𝑡 𝑒𝑚𝑝𝑡 Baseline (majority) 0.45 0.27 ACKNOWLEDGMENTS kNN 0.42 0.42 Tree 0.41 0.39 This work is part of the INSENSION project that has received RF 0.46 0.43 funding from the European Union’s Horizon 2020 research and SVM 0.43 0.34 innovation programme under grant agreement No. 780819. The AdaBoost 0.43 0.41 authors also acknowledge the financial support from the Slove- XGB 0.48 0.45 nian Research Agency (ARRS). REFERENCES 5.2 Using rPPG reconstruction [1] Matej Cigale and Mitja Luštrek. 2019. Multiple knowledge Using the rPPG for evaluation proved to be more difficult, as we categorising behavioural states and communication attempts only had limited amount of good subsequent facial crops from the in people with profound intellectual and multiple disabili- videos, while also having a limited amount of annotations. This ties. In AmI (Workshops/Posters), 46–54. meant that the overlap between the two was very small – we had [2] Matej Cigale, Mitja Luštrek, Matjaž Gams, Torsten Krämer, only 12 such 30-second segments for inner state and only 6 for Meike Engelhardt, and Peter Zentel. 2018. The quest for communication attempt. Such a low amount of data is infeasible understanding: helping people with PIMD to communicate to be used in a realistic evaluation scheme (not even all three with their caregivers. INFORMATION SOCIETY-IS 2018. different class labels were present), so we instead decided to use [3] Torsten Krämer and Peter Zentel. 2020. Expression of emo- the models previously trained on the Empatica data, to classify tions of people with profound intellectual and multiple these instances obtained via the rPPG. We achieved reasonably disabilities. A single-case design including physiological high accuracy of 75% and F1-score of 0.84 for inner state and low data. Psychoeducational Assessment, Intervention and Reha- accuracy of 33% and F1-score of 0.33 for communication attempt. bilitation, 2, 1, 15–29. Confusion matrices are shown in Figure 4. [4] Richard D Lane, Kateri McRae, Eric M Reiman, Kewei Chen, Geoffrey L Ahern, and Julian F Thayer. 2009. Neural corre- lates of heart rate variability during emotion. Neuroimage, 44, 1, 213–222. [5] Junoš Lukan, Martin Gjoreski, Heidi Mauersberger, Anneka- trin Hoppe, Ursula Hess, and Mitja Luštrek. 2018. Analysing physiology of interpersonal conflicts using a wrist device. In European Conference on Ambient Intelligence. Springer, 162–167. [6] Stanley Schachter. 1964. The interaction of cognitive and physiological determinants of emotional state. In Advances in experimental social psychology. Volume 1. Elsevier, 49–80. Figure 4: Confusion matrices for classifying rPPG in- [7] Gašper Slapničar, Erik Dovgan, Pia Čuk, and Mitja Luštrek. stances using models trained on Empatica data. For inner 2019. Contact-free monitoring of physiological parameters state, the class values are 1.0="neutral" and 2.0="pleasure". in people with profound intellectual and multiple disabili- For communication attempt 1.0="comment" and 2.0="de- ties. In Proceedings of the IEEE International Conference on mand". Computer Vision Workshops. [8] Poznań Supercomputing and Networking Center. 2017. The INSENSION project. https://www.insension.eu/. 6 CONCLUSION [9] Pieter Vos, Paul De Cock, Vera Munde, Katja Petry, Wim Van Den Noortgate, and Bea Maes. 2012. The tell-tale: what We conducted an initial investigation of the connection between do heart rate; skin temperature and skin conductance reveal physiological signals and mental states of people with PIMD, about emotions of people with severe and profound intel- attempting to classify their inner states and communication at- lectual disabilities? Research in developmental disabilities, tempts. We used HRV features computed from the PPG obtained 33, 4, 1117–1127. with an Empatica E4 wristband and investigated the performance of such models on instances obtained via rPPG. XGB has shown the best performance, achieving accuracy of 62% and F1 score of 95 Energy-Efficient Eating Detection Using a Wristband Simon Stankoski Mitja Luštrek Department of Intelligent Systems Department of Intelligent Systems Jožef Stefan Institute Jožef Stefan Institute Jožef Stefan International Jožef Stefan International Postgraduate School Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia simon.stankoski@ijs.si mitja.lustrek@ijs.si ABSTRACT Another group of people that require monitoring of their eating behavior are people with mild cognitive impairment and Understanding people’s dietary habits plays a crucial role in dementia. They often forget whether they have already eaten interventions that promote a healthy lifestyle. For this purpose, and, as a result, eat lunch or dinner multiple times a day or not a multitude of studies explored automatic eating detection with at all. It might cause additional health problems. Proper various sensors. Despite progress over the years, most proposed treatment of these issues requires an objective estimation of the approaches are not suitable for implementation on embedded time the meal takes place, the duration of the meal, and what devices. The purpose of this paper is to describe a method that the individual eats. uses a wristband configuration of sensors to continuously track Wristband devices and smartwatches are increasingly wrist motion throughout the day and detect periods of eating popular, mainly because people are accustomed to wearing automatically. The proposed method uses an energy-efficient watches, which makes the wrist placement one of the least approach for activation of a machine learning model, based on a intrusive body placements to wear a device. Additionally, the specific trigger. The method was evaluated on data recorded cost of these devices is relatively low, which makes them easily from 10 subjects during free-living. The results showed a accessible to everyone. However, these devices offer limited precision of 0.84 and a recall of 0.75. Additionally, our analysis computing power and battery life, which makes the shows that by using the trigger, the usage of the machine implementation of a smart feature as eating detection on such a learning model can be reduced by 80%. device a challenging task. This paper describes a method for real-time eating detection KEYWORDS using a wristband. The proposed method detects periods and Eating detection, wristband, energy efficient, activity duration of eating. The output from the method can be used to recognition track frequency of eating and could serve to start methods for counting food intakes. The work done in this study is important for the following 1 INTRODUCTION reasons. We developed a trigger that can reduce the usage of the Understanding people’s dietary habits plays a crucial role in machine learning procedure, meaning that our method will not interventions that promote a healthy lifestyle. Obesity, which is greatly affect the battery life of the device. Additionally, we a consequence of bad nutritional habits and excessive energy evaluated different machine learning algorithms in terms of intake, can be a major cause of cardiovascular diseases, diabetes accuracy and model size. The method was evaluated on data or hypertension. Latest statistics indicate that obesity recorded in real-life from 10 subjects. prevalence has increased substantially over the last three decades [1]. More than 600 million adults (13% of the total adult population) were classified as obese in 2014 [2]. In 2 RELATED WORK addition, the prevalence of obesity is estimated to be 23% in the Recent advancements in wearable sensing technology (e.g., European Region by 2025. Also, in 2017, it was reported that commercial inertial sensors, fitness bands, and smartwatches) poor diet has contributed to 11 million deaths worldwide. have allowed researchers and practitioners to utilize different Monitoring eating habits of overweight people is an essential types of wearable sensors to assess dietary intake and eating step towards improving nutritional habits and weight behavior in both laboratory and free-living conditions. A management. multitude of studies for the detection of eating periods have been proposed in the past decade. Mirtchou et al. [3] explored eating detection using several sensors and combining real-life Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or and laboratory data. Edison et al. [4] proposed a method that distributed for profit or commercial advantage and that copies bear this notice and recognizes intake gestures separately, and later clusters the the full citation on the first page. Copyrights for third-party components of this intake gestures within 60-minute intervals. The method was work must be honored. For all other uses, contact the owner/author(s). Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia evaluated on real-life data. Dong et al. [5] proposed a method © 2020 Copyright held by the owner/author(s). for eating detection in real-life situations based on a novel idea 96 that meals tend to be preceded and succeeded by periods of detect eating after a trigger is activated during a meal, then the vigorous wrist motion. Amft et al. [6] presented an accurate constraints of the trigger should be reduced. method for eating and drinking detection using sensors attached The next step is the definition of stopping criteria for the to the wrist and upper arm on both hands. Navarathna et al. [7] machine learning model. The idea here is to stop the machine combined sensor data from a smartwatch and a smartphone, learning procedure after a specific number of windows if there which resulted in improved eating detection accuracy compared is no eating detected. Each time our trigger is activated, the to only using smartwatch data. Kyritsis et al. [8] proposed a machine learning procedure is turned on for the next three deep learning based method that recognizes bite segments, buffers of data. The machine learning procedure is stopped if which are used for construction of eating periods. there is no positive prediction in any of the three windows. The work presented in this paper is an extension of our However, if there is at least one positive prediction, the previous work [9], and the main novelty is an energy efficient machine learning procedure continues to work for another three approach for real-time eating detection. new buffers. Also, the number of windows for which the machine learning procedure is active was experimentally obtained. 3 METHOD The proposed eating detection method consists of two parts, 3.2 Machine-Learning Procedure namely: a threshold-based trigger, used for activation of an А detailed description of the used method can be seen in [9]. eating detection machine learning procedure, and a machine- The method is based on machine learning and consists of the learning method that predicts whether eating took place. following steps: filtering the accelerometer and gyroscope data coming from the wristband, segmentation of the filtered data, 3.1 Energy-Efficient Trigger feature extraction, feature selection, two stages of model The recent advancements in the technological development and training and predictions smoothing. accessibility of wearable devices bring new opportunities in the In the first step, the raw data were filtered with a 5th order field of human activity recognition (HAR). However, the median filter to reduce noise. Furthermore, the median filtered limited battery life and computational resources remain a data was additionally filtered with low-pass and band-pass challenge for real-life implementation of advanced HAR filters. Hence, we ended up with three different streams of data, applications. Using a machine learning based model for eating median, low-pass and band-pass filtered data. detection that is working all the time results in a rapid battery The accelerometer and gyroscope data were segmented drain. Therefore, we designed a threshold-based trigger that using a sliding window of 15 seconds with a 3-second overlap activates the machine learning model only when specific between consecutive windows. This means that once we have criteria are met. The main concept behind the trigger is to only 15 seconds of data, the buffer is adjusted to only store 3 select moments when the human is making a movement with seconds of new data. After that, each time the buffer is full, we his hand towards the head. add the new 3 seconds of data to the previous 15 seconds For this purpose, we used data from an accelerometer. This window and we drop the oldest 3 seconds from it. The reason sensor provides information about the wristband’s orientation for the length of the window is that it needs to contain an entire from which we can see whether the hand is oriented towards the food intake gesture [10]. head. The recent accelerometers that are used in battery-limited After the segmentation step, we extracted three different devices can store acceleration values in their internal memory groups of features. Also, we included a feature selection step to without interacting with the main chip of the microcontroller. improve the computational efficiency of the method, to remove The first step of trigger implementation is to define the the features that did not contribute to the accuracy and to reduce buffer size in the sensor’s internal memory and the sensor’s the odds of overfitting. sampling frequency. Based on these two parameters, we enable The training procedure for the method used in this study the accelerometer to collect data for a specific time without consists of three stages. The first two aim at training an eating- interacting with the main chip of the microcontroller. This detection models on an appropriate amount of representative means that the main chip of the microcontroller could be in eating and non-eating data. The third step smooths the sleep mode for the predefined period. When the accelerometer’s predictions of the model. buffer is full, the accelerometer interrupts the main chip and transfers the stored acceleration data to it. We use the accelerometer’s y-axis and z-axis to detect moments when the 4 DATASET AND EXPERIMENTAL SETUP individual is moving the hand towards the face. Namely, we For this study, we recorded data from 10 subjects (8 male and 2 calculate the mean value for both axes, and if both of the values female), ranging in age from 20 to 41 years. The data were are above a predefined threshold value, the machine learning recorded using a commercial smartwatch Mobvoi TicWatch S procedure for eating detection is activated. We used two axes running WearOS, providing 3–axis accelerometer and 3–axis for the trigger to reduce the possible situations in which our gyroscope data sampled at 100 Hz. The technical description of trigger is falsely activated. However, one can work only with the sensors from the smartwatch shows that the recorded data is one axis, which will result in more activated triggers. We could compatible with our target wristband for which we are say that having more activated triggers is not desirable. developing our eating detection method. Additionally, the use However, if the eating detection method is not good enough to of a commercially available smartwatch was an easier option for recording data. The collected dataset contains recordings 97 from usual daily activities performed by the subjects, including Table 2: Results of eating detection procedure achieved with eating. The subjects were wearing the smartwatch on their different algorithms and their model size. dominant hand while recording. The smartwatch had an application installed on it, which enabled them to label the Algorithm Precision Recall F1 score Model size beginning and the end of each meal. There were no limitations Random Forest 0.84 0.75 0.79 36339 KB about the type of meals the subjects could have while recording, Logistic Regression 0.70 0.71 0.70 1.25 KB which resulted in having 70 different meals included in the dataset. Furthermore, the subjects were also asked to act LinearSVC 0.69 0.71 0.70 1.8 KB naturally while having their meals, meaning talking, Decision Tree 0.59 0.65 0.62 175 KB gesticulating, using the smartphone, etc. The total data duration is 161 hours and 18 minutes, out of which 8 hours and 19 The used combinations for the window and slide size are shown minutes correspond to eating activities. in the first column of the table. The second column shows the For evaluation, the LOSO cross-validation technique was average time needed for the trigger to be activated for the first used. In other words, the models were trained on the whole time after a meal is started. The third column shows the average dataset except for one subject on which we later tested the percentage of triggered windows during a meal. These two performance. The same procedure was repeated for each subject columns were used as a metric for selecting the optimal size of in the dataset. The results obtained using this evaluation a window and slide between the windows. The last column technique are more reliable compared to approaches where the shows the number of meals when the trigger was activated. The same subject’s data is used for both training and testing, which values for the second and third columns were obtained only show excessively optimistic results. from the meals for which the trigger was activated. Row-wise As mentioned before, smartwatches offer limited resources, comparison between these two columns shows the results one of which is the size of the RAM memory. Therefore, we obtained with each different combination of a window and analyzed models with different sizes to see whether the bigger slide. We can see that the most optimal combination regarding and more complex models provide higher accuracy. We tested the average time needed for a trigger to be activated after a the performance of four different machine learning algorithms, meal is started is a window size of 3 seconds with a slide of 1 Random Forest [11], Decision Tree [12], Logistic Regression second between two windows. Therefore, in our further [13] and LinearSVC [14]. analysis, we used this combination. The optimal window size of We analyzed the following evaluation metrics: recall, 3 seconds is expected if we have in mind that the usual intake precision and F1 score. These evaluation metrics are the most gesture lasts around 2 seconds. Longer windows fail to detect commonly used metrics for classification tasks like ours and the gesture while having a meal because usually we have two or give a realistic estimate of the efficacy of the algorithm. Also, three intakes in 15 seconds and the mean value over the whole the final results were obtained from the whole recordings by window is low. each subject. The reason for this is mainly to give a real picture Table 2 shows the final results obtained using the whole of how good the developed method is in real-life settings. method described in Section 2. Row-wise comparison between the used evaluation metrics shows the results obtained using the different algorithms shown in the first column. Additionally, the 5 RESULTS last column of the table represents the final model size. We can The primary use of the trigger is to reduce the activity of the clearly see that the results achieved with Random Forest are machine learning procedure. However, for the efficiency of the better than the remaining algorithms. However, if we compare trigger, a very important requirement is when and how often the the model size of the best performing algorithm with the trigger is activated during a meal. In order to achieve accurate remaining algorithms we can say that the results achieved using predictions, we want the trigger to be activated as soon as the Logistic Regression and LinearSVC are acceptable. meal is started. Additionally, the percentage of activated Additionally, the precision value of 0.84 shows that the triggers during a meal should be bigger compared to noneating combination of trigger and machine learning procedure can segments. For this purpose, we explored which window size differentiate between eating and noneating segments. However, works best with our trigger. Table 1 shows the results achieved the recall value of 0.75 suggests that a more accurate method in the conducted experiments. We tested two different window regarding the eating periods is needed. sizes with two slide values for each window, resulting in a total We also analyzed how much time each of the previously of four combinations. described algorithms was active during the noneating period. The results from this experiment are shown in Table 3. Additionally, in this table we can see the false positive rate Table 1: Different window size for the trigger procedure. during the noneating period. The best results are achieved using a Random Forest classifier, which is active only 20% of the Window and Trigger % of activated Meals whole noneating period. This means that our trigger-based slide size activation time triggers detected procedure reduces the usage of the machine-learning procedure 3 - 1 36 s 34.2 68/70 for 80%. However, this number also depends on the detection 3 - 3 41 s 32.6 68/70 method because once it is activated, the eating predictions 15 - 3 48 s 42.0 55/70 extend the active time of the method. 15 - 5 41 s 42.0 54/70 98 Table 3: Comparison of active time and false positive rate of to activate the trigger during eating periods more easily. the machine learning algorithms during noneating period. Additionally, this could reduce the activation of the machine- learning procedure during non-eating periods. Also, we plan to Active time during explore memory efficient methods for storing the models in Algorithm False positive rate noneating period memory. Random Forest 20% 1.36% ACKNOWLEDGMENTS Logistic Regression 22% 2.18% This work was supported by the WellCo and CoachMyLife LinearSVC 22% 2.34% projects. The WellCo project has received funding from the Decision Tree 23% 3.93% European Union’s Horizon 2020 research and innovation programme under grant agreement No 769765. The CoachMyLife project has received funding from the the AAL programme (AAL-2018-5-120-CP) and the Ministry of Public 6 CONCLUSION AND FUTURE WORK Administration of Slovenia. In this paper, we presented a method that can accurately detect eating moments using a 3-axis accelerometer and gyroscope REFERENCES sensor data. Our method consists of an energy-efficient trigger [1] World Health Organization. World Health Statistics 2015. Luxembourg, WHO, 2015 and a machine-learning procedure, which is started only after [2] Public Health England. Data Factsheet: Adult Obesity International the trigger is activated. We evaluated this method using a Comparisons. London, 2016. dataset of 70 meals from 10 subjects. The results from the http://webarchive.nationalarchives.gov.uk/20170110165728/http://www. noo.org.uk/NOO_pub/Key_data LOSO evaluation showed that we are able to recognize eating [3] M. Mirtchouk, D. Lustig, A. Smith, I. Ching, M. Zheng, and S. with a precision of 0.84 and recall of 0.75. Kleinberg. Recognizing eating from body-worn sensors: Combining freeliving and laboratory data. Proc. ACM Interact. Mob. Wearable The presented results are important because both the training Ubiquitous Technol., 1(3):85:1–85:20, Sept. 2017 and the evaluation data were recorded in uncontrolled real-life [4] E. Thomaz, I. Essa, and G. D. Abowd. A practical approach for recognizing eating moments with wrist-mounted inertial sensing. In conditions. We want to emphasize the real-life evaluation since Proceedings of the 2015 ACM International Joint Conference on it shows the robustness of the method while dealing with plenty Pervasive and Ubiquitous Computing, UbiComp ’15, pages 1029–1040, of different activities that might be mistaken for eating as well New York, NY, USA, 2015. ACM. [5] Y. Dong, J. L. Scisco, M. Wilson, E. Muth, and A. W. Hoover. Detecting as recognizing meals that were recorded in many different periods of eating during free-living by tracking wrist motion. IEEE environments while using many different utensils. The Journal of Biomedical and Health Informatics, 18:1253–1260, 2014 [6] O. Amft, H. Junker, and G. Troster. Detection of eating and drinking arm proposed method can also deal with interruptions while having gestures using inertial body-worn sensors. In Ninth IEEE International a meal, such as having a conversation, using the smartphone, Symposium on Wearable Computers (ISWC’05), pages 160–163. IEEE, etc. Additionally, we believe that the energy efficiency of the 2005. [7] P. Navarathna, B. W. Bequette, and F. Cameron, “Wearable Device proposed method is very important. The proposed technique Based Activity Recognition and Prediction for Improved Feedforward uses a trigger to activate the machine learning procedure and it Control,” in Proceedings of the American Control Conference, 2018, doi: 10.23919/ACC.2018.8430775. is able to reduce the active time of the machine learning [8] K. Kyritsis, C. Diou, and A. Delopoulos, “Detecting Meals in the Wild procedure for almost 80%. If we have in mind that the Using the Inertial Data of a Typical Smartwatch,” in Proceedings of the wristbands are devices with limited resources, we could say that Annual International Conference of the IEEE Engineering in Medicine and Biology Society, EMBS, 2019, doi: 10.1109/EMBC.2019.8857275 even small reductions in resource usage can be significant for [9] Stankoski S, Resçiç N, Mezic G, Lustrek M. Real-time Eating Detection longer battery life. Using a Smartwatch. InEWSN 2020 Feb 17 (pp. 247-252). [10] Xu Ye, Guanling Chen, and Yu Cao. Automatic eating detection using The initial results achieved in this study are encouraging for head-mount and wrist-worn accelerometers. In 2015 17th International further work in which we expect to improve the eating detection Conference on E-health Networking, Application Services (HealthCom), pages 578–581, Oct 2015. method. In the near future, we plan to optimize our machine [11] T. K. Ho. Random decision forests. In Proceedings of 3rd international learning procedure to detect eating periods more accurately conference on document analysis and recognition, volume 1, pages 278– once the trigger is activated. Furthermore, we want to overcome 282. IEEE, 1995. [12] P. H. Swain and H. Hauska, "The decision tree classifier: Design and the problem with false positives predictions. For this problem, potential," in IEEE Transactions on Geoscience Electronics, vol. 15, no. we believe that a more sophisticated method for selecting 3, pp. 142-147, July 1977, doi: 10.1109/TGE.1977.6498972. [13] Lee, Youngjo, John A. Nelder, and Yudi Pawitan. Generalized linear representative noneating data will help to recognize the models with random effects: unified analysis via H-likelihood. Vol. 153. problematic activities and directly include them in the training CRC Press, 2018. data. Also, we plan to investigate personalized threshold values. [14] Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: A library for support vector machines." ACM transactions on intelligent systems and We believe that personalized values for the threshold will help technology (TIST) 2.3 (2011): 1-27. 99 Comparison of Methods for Topical Clustering of Online Multi-speaker Discourses Vid Stropnik Zoran Bosnić Evgeny Osipov University of Ljubljana, University of Ljubljana, Luleå University of Technology, Faculty of Computer and Faculty of Computer and Department of Computer Science, Information Science, Information Science, Electrical and Space Engineering, Velenje, Slovenia Ljubljana, Slovenia Luleå, Sweden vs6309@student.uni-lj.si zoran.bosnic@fri.uni-lj.si evgeny.osipov@ltu.se ABSTRACT speakers. Consequently, traditional summarization methods do not translate well to these sorts of text bodies. Discussions held on online forums differ from traditional text In Section 2 of this paper, the related work establishes the documents in several ways. In addition to individual text-bodies general framework that other authors generally use for the task (submission comments, forum posts etc.) being very short, they at hand. It establishes the Latent Dirichlet Allocation (LDA) also have multiple messengers, each of whom may exhibit topic modeling algorithm as the current leading method for unique patterns of speech. Consequently, state of the art methods topical grouping of individual comments. These topical groups for text summarization are often rendered inapplicable for these play a pivotal role in later summarization steps, also presented in sorts of corpora. This paper evaluates the topic-clustering Section 2. algorithm used in the state-of-the-art online comment clustering In this paper, we externally evaluate and compare LDA techniques, as parts of commonly used summarizer models. It versus two frameworks, using word representations in semantic proposes two alternative, vector-based approaches and presents vector space. We describe the analyzed methods in Sections 3 results of a comparative external analysis, concluding in the three and 4. In Section 5, we describe the comparative evaluation methods being comparable. methodology used to determine the applicability of each modeling technique and present our results. We follow it up by KEYWORDS discussing further work in the conclusion of this paper. latent Dirichlet allocation, word embeddings, GloVe, hyperdimensional computing, self-organized maps, topical clustering, clustering evaluation, discussion summarization 2 RELATED WORK Online discussion summarization is a field that has not been addressed directly by many authors. One group of works [5-7] have roughly described a three-step process, commonly 1 INTRODUCTION presented as the state of the art. The approaches includes a topical User generated comments carry a great amount of useful clustering of all the observed comments, establishing a ranking information. Big data researchers have successfully used them to method for determining the most salient ones in each cluster, and predict stock market volatility [1] and predict the characteristics later summarizing this selection. Between them, the authors of such comments that perform the best on a given online confidently establish Latent Dirichlet Allocation (LDA) topic platform [2]. User comments can also offer vast amounts of modeling as the most human like grouping algorithm. Further complementary information, as well as being forms of work also proposes a novel graph-based linear regression model information surveillance, entertainment or social utility [3]. based on the Markov Cluster Algorithm (MCL), [8] which Existing mechanisms for displaying comments on websites do outperforms LDA, but uses the knowledge of multidomain not scale well and often lead to cyberpolarization [4]. knowledge bases for implementation. While we argue that Furthermore, they are platform-specific and often fail to offer an extractive summarization is not an ideal method for the analysis overall image of the topics discussed in a given comments of multi-speaker corpora, the first step of identifying and section. topically clustering individual comments in each comment A comprehensive, easily understandable automatic summary section is assumed as a required step towards successful of the online discourse at hand can be instinctively understood as summarization of the topics discussed therein. a solution to this problem. This, however, is no easy task, seeing To the best of our knowledge, popular NLP word embedding as these corpora are often very short and come from multiple algorithms (i.e. word2vec, GloVe) have not been used directly for comment summarization applications up until now. Similarly. Permission to make digital or hard copies of part or all of this work for personal or neither have hyperdimensional representations, another topic of classroom use is granted without fee provided that copies are not made or distributed interest. for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia © 2020 Copyright held by the owner/author(s). 100 3 NLP METHODS 3.2 Word Embeddings In this work, we examine three distinct topical clustering models, Word Embeddings is a collective name for a set of language the output of which is always a set of comment clusters, given a modeling and feature learning techniques, yielding word multi-comment input. representations using vectors, the relative similarities of which The first is an LDA model, using a Term frequency – inverse correlate with the semantic similarity of the represented words. document frequency (TFIDF) word representation as an input. In These meanings are extracted from the contexts – fixed-size this representation, the comments were hard-clustered into the windows of preceding and succeeding words, in which groups, determined by the degree of membership of which had individual words appear in in the training corpus. The generation been the highest in a soft-clustering approach, provided by the of these vectors is achieved by Context counting [11] or Context LDA model. prediction [12]. While there have been several claims of one of The second examined model uses GloVe word embeddings the methods for synthetizing word embeddings being superior clustered with the k-means clustering algorithm, thus portraying over another, recent work implies the correspondence between words in semantic vector space using information of contexts in these model-types [13]. Whichever way these word-vectors are which words often appear. created, they represent semantic meaning in vector space. Using The third model creates Hyperdimensional representations of algebraic similarity measures (in our case, cosine distance) on words, mapped them into a two-dimensional topology using the comment-word averages, the relative likeness of the examined self-organized maps algorithm and then clustered it like the comments’ meanings is calculated. Comment clusters can then preceding model. This approach is the least explored for this use- be created by clustering the semantic-space points into groups case and is inspired by the observed differences between the with high intra-cluster and low inter-cluster similarity. These functionality of the human brain and the traditional von groups represent topical clusters, used in our examination. Neumann architecture for modern computing. We performed the comparative evaluation of the models on 3.3 Hyperdimensional Computing the Reddit Corpus (by subreddit) dataset, provided by the Cornell Hyperdimensional computing is a family of biologically inspired Conversational Analysis Toolkit (Convokit)1. Five Conversations, methods for representing and manipulating concepts and their corresponding to as many treads on the website Reddit were meanings in high-dimensional space. Random Bipolar vectors of extracted from the corpus. We selected threads, discussing topics high, but fixed dimensionality ( ≥ 1000 ) are initialized as from different subject domains, where each contained at least 50 individual word representations and are then transformed in ways non-removed comment text bodies. Two human annotators were that represent semantically similar comments closer in the high- then asked to manually identify topical clusters in the selected dimensional vector space, while the similarity of dissimilar Conversations. The comment texts were provided to them in the comments is likely close to zero due to their inherent form of a set of numbered text files, containing only the text data orthogonality. The methods used to transform these vectors are in chronological order of submission. Reddit post titles or other binding, bundling and permuting [14]. By using these methods, metadata were not available to the annotators and no guidance individual hyperdimensional vectors are created for each was given as to the number of topics required. The clusterings comment, encoding the used words and their position in the were examined as-is, with no singleton removal performed. comment in the vector. We describe the NLP techniques used to create the three Similar to the clustering of word embeddings, semantically clustering models in the following subsections, with external similar comment groups can be found by clustering, thus evaluation results being presented in Section 5. determining the outputs of the third model. However, the performance of this method did not yield comparative results at first. We hypothesised that this might be due to the high 3.1 Latent Dirichlet Allocation component count of the used vectors (more than double the Latent Dirichlet Allocation (LDA) is a topic modeling technique dimensions of the Word Embedding approach), so a method of initially proposed in the context of population genetics, but later dimensionality reduction was examined, aiming to improve its applied in machine learning in the early 21st century. It assumes results. It is described in the next sub-section. a generative process of documents as random mixtures over a collection of latent topics. Each of these topics, in turn, is 3.4 Self-Organized Maps characterized by a certain distribution over words. A topic model Self-organizing maps (SOM), also known as Kohonen networks can be created by estimating the per document distribution of are computational methods for the visualization and analysis of topics θ and the per topic distribution over words φ. [9] Many high-dimensional data. The output of the algorithm is a set of methods, such as variational inference, Bayesian parameter nodes, arranged in a certain topology that represents the nodes’ estimation [9] and Collapsed Gibbs sampling [10], have been mutual relation, with each node being represented with a weight used to approximate these values. In the end, they all boil down vector of t dimensional components, with t corresponding to the to maximizing the model’s probability of creating the exact uniform dimensionality of data being reduced [15]. As data documents, provided to it in the input, assuming the knowledge representations in high-dimensional vector spaces are inherently of the number of topic distributions. vulnerable to sparseness, clustering outputs can differ in cases where the clustered data is first dimensionally reduced. Thus, we used the SOM algorithm to examine if the results (of the 1 https://convokit.cornell.edu/documentation/index.html/ 101 examination in Section 5) of any of the proposed frameworks can even clearer in Figure 2, which shows each model’s performance be improved by dimensionally reducing the vector with respect to the agreement score between the two human representations prior to clustering. annotators. The percentage is calculated as an averaged sum of SOM proved to drastically improve the performance of the all four metric scores, weighted by the sum of these scores, Hyperdimensional computing model, while making the Word achieved by the human versus human evaluation. In the figure, Embeddings-based model perform worse. Consequently, we Word Embeddings can be seen as the best-performing approach, only use SOM prior to clustering the HD-based approach in the reaching 54.18 % of the Human agreement. The performance of evaluation, presented in Section 5. LDA presented in Figure 2 is also comparable to that found in [5]. However, the difference in results between the best and the 4 IMPLEMENTATION All implementational work was done with the Python programming language. All text corpora were pre-processed using the WordNetLemmatizer and PorterStemmer from NLTK.2 Stop word removal was done in the pre-processing step using the topic modeling package Gensim3, which also provided the submodules for TFIDF and LdaModel, used for the implementation of Latent Dirichlet Allocation. GloVe word embeddings were provided as part of the NLP open-source library SpaCy 4 as part of the “en_core_web_md” pretrained statistical model for the English Language. The SOM algorithm was implemented using the SimpSOM package5, with k-means clustering being provided by Scikit-Learn.6 5 EVALUATION To analyze the applicability of LDA, Word Embeddings and Figure 1: Visualization of agreement metric results between dimensionally reduced Hyperdimensional computing for the the human annotators (top) and the average annotator vs. discussed use-case, topical clustering outputs were created for 5 model agreement (bottom three) Reddit Conversations. Two human annotators also manually created topical groups for these conversations. The goal of our evaluation was to see which model created the most human-like worst performing models being less than 7% of the total human clusters; consequently having the highest average agreement agreement score, this metric is not enough to establish Word measure with the clustering samples, provided by the two Embeddings as superior to LDA or indeed, dimensionally annotators. reduced High-dimensional computing. We can conclude that Topical clusters, created by the three models, were externally both Hyperdimensional computing and Word Embeddings can evaluated using four symmetric agreement measures: The V- produce topical clusters, comparable to the current state of the art Measure [16], The Fowlkes-Mallows Index [17], the Rand Index LDA method. [18] and the Mutual information score [19]. The latter two were Semantic document representations performing as well as the also adjusted for variance. For each examined model, the best state-of-the-art topic modeling framework using LDA opens up performing number of topic clusters was selected. The agreement plentiful possibilities in the field of multi-speaker conversation of the clustering output of each model was measured against both analysis. Whereas topic modeling’s more direct approach of of the manual clusterings, with the per annotator average of each inferring latent conversation topics might be useful in their metric being the final output. discovery, the possibility of applying algebraic functions to Figure 1 shows the result scores of all four metrics for each individual comment vectors might enable further topic mining analyzed method. In the top row, the average agreement between and experimentation. While the k-means clustering algorithm the two annotators is also shown. This is, expectedly, higher than requires a desired number of clusters at input, similar to LDA, its the average agreement between any examined model and the job is not to encode semantics in the Word Embedding or SOM- human outputs. A few takeaways can be addressed, examining HDC framework. This means that an alternative clustering the figure. Firstly, the different methods were successful to a algorithm – one without the need for an input number of medoids varying degree, depending on the used metric, with each - could be used for the task of grouping comments. This, in turn, performing the best according to at least one. Secondly, when would result in a truly unsupervised topical clustering framework. comparing their average relative success in relation to the A comparative evaluation of these approaches is a field of agreement scores between Annotator A and Annotator B, we can interest in the future, as our non-conclusive experiments have see that their performances are very similar. This can be seen 2 https://www.nltk.org/ 5 https://github.com/fcomitani/SimpSOM/ 3 https://radimrehurek.com/gensim/ 6 https://scikit-learn.org/ 4 https://spacy.io/ 102 already shown a vast variance in results when using different and trust in the press’, Comput. Hum. Behav. , vol. 54, pp. 231–239, Jan. clustering approaches. 2016, doi: 10.1016/j.chb.2015.07.046. [4] S. Faridani, E. Bitton, K. Ryokai, and K. Goldberg, ‘Opinion space: a scalable tool for browsing online comments’, in Proceedings of the 28th international conference on Human factors in computing systems - CHI ’10, Atlanta, Georgia, USA, 2010, p. 1175, doi: LDA_TFIDF GloVe SOM_HDV 10.1145/1753326.1753502. [5] C. Llewellyn, C. Grover, and J. Oberlander, ‘Summarizing Newspaper Comments’, Proc. Eighth Int. AAAI Conf. Weblogs Soc. Media, pp. 599– 54,18% 602, Jun. 2014. [6] Z. Ma, A. Sun, Q. Yuan, and G. Cong, ‘Topic-driven reader comments 52,55% summarization’, in Proceedings of the 21st ACM international conference on Information and knowledge management - CIKM ’12, Maui, Hawaii, USA, 2012, p. 265, doi: 10.1145/2396761.2396798. [7] E. Khabiri, J. Caverlee, and C.-F. Hsu, ‘Summarizing User-Contributed Comments’, presented at the International AAAI Conference on Weblogs and Social Media, pp. 534–537, Barcelona, Spain, Jul. 2011. [8] A. Aker et al. , ‘A Graph-Based Approach to Topic Clustering for Online 47,19% Comments to News’, in Advances in Information Retrieval, vol. 9626, N. Ferro, F. Crestani, M.-F. Moens, J. Mothe, F. Silvestri, G. M. Di Nunzio, C. Hauff, and G. Silvello, Eds. Cham: Springer International Publishing, 2016, pp. 15–29. [9] D. Blei, A. Y. Ng, and M. I. Jordan, ‘Latent Dirichlet Allocation’, J. Mach. Learn. Res. , vol. 3, no. 4–5, pp. 993–1022, 2000, doi: 10.1162/jmlr.2003.3.4-5.993. [10] W. M. Darling, ‘A Theoretical and Practical Implementation Tutorial on Topic Modeling and Gibbs Sampling’, Proc. 49th Annu. Meet. Assoc. Comput. Linguist. Hum. Lang. Technol. , pp. 642–647, Dec. 2011. [11] J. Pennington, R. Socher, and C. Manning, ‘Glove: Global Vectors for Figure 2: Percentage of the Human versus human Word Representation’, in Proceedings of the 2014 Conference on agreement score achieved by each model (averaged between Empirical Methods in Natural Language Processing (EMNLP), Doha, 4 agreement metrics) Qatar, 2014, pp. 1532–1543, doi: 10.3115/v1/D14-1162. [12] T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word Representations in Vector Space’, ArXiv13013781 Cs, Sep. 2013, Accessed: Aug. 19, 2020. [Online]. Available: 6 CONCLUSION http://arxiv.org/abs/1301.3781. [13] A. Österlund, D. Ödling, and M. Sahlgren, ‘Factorization of Latent In this article, we work from our hypothesis that popular Variables in Distributional Semantic Models’, in Proceedings of the 2015 semantics-laden vector representations of text data can be Conference on Empirical Methods in Natural Language Processing, applicable in the established framework for extractive online Lisbon, Portugal, 2015, pp. 227–231, doi: 10.18653/v1/D15-1024. [14] D. Kleyko, E. Osipov, D. De Silva, and U. Wiklund, ‘Distributed discussion summarization. We present two models using Representation of n-gram Statistics for Boosting Self-organizing Maps different vector-based representation techniques and conclude with Hyperdimensional Computing’, in Perspectives of System Informatics, 12th International Andrei P. Ershov Informatics Conference, that they are both comparable to the Latent Dirichlet Allocation Revised Selected Papers, pp. 64-79, Novosibirsk, Russia, 2019. topic modelling technique, used in most literature, with the Word [15] T. Kohonen, T. S. Huang, and M. R. Schroeder, Self-Organizing Maps. Berlin, Heidelberg: Springer Berlin / Heidelberg, 2012. Embeddings-based framework outperforming it in our external [16] A. Rosenberg and J. Hirschberg, ‘V-Measure: A Conditional Entropy- evaluations. Based External Cluster Evaluation Measure’, in Proceedings of the 2007 As mentioned in Section 2, the authors of this article argue Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), that extractive summarizations are intrinsically less suitable Prague, Czech Republic, Jun. 2007, pp. 410–420, Accessed: Aug. 20, when working with multi-speaker corpora. Our future work in 2020. [Online]. Available: https://www.aclweb.org/anthology/D07-1043. [17] E. B. Fowlkes and C. L. Mallows, ‘A Method for Comparing Two this field includes the modeling of an abstractive summarizer Hierarchical Clusterings’, J. Am. Stat. Assoc. , vol. 78, no. 383, pp. 553– framework, using the findings presented in this paper. Our intent 569, Sep. 1983, doi: 10.1080/01621459.1983.10478008. is to use them in conjunction with graph-based approaches that [18] W. M. Rand, ‘Objective Criteria for the Evaluation of Clustering Methods’, J. Am. Stat. Assoc. , vol. 66, no. 336, pp. 846–850, Dec. 1971, take advantage of multidomain knowledge bases like DBPedia doi: 10.1080/01621459.1971.10482356. for both clustering and topic-labelling [8, 20]. [19] N. X. Vinh, J. Epps, and J. Bailey, ‘Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Whether used in extractive or abstractive applications, we Correction for Chance’, J. Mach. Learn. Res. , vol. 11, pp. 2837–2854, Oct. presume that the field will greatly benefit from our findings, 2010. [20] I. Hulpus, C. Hayes, M. Karnstedt, and D. Greene, ‘Unsupervised graph- seeing that the two vector-based representation frameworks open based topic labelling using dbpedia’, in Proceedings of the sixth ACM a plethora of new possibilities for other researchers. These international conference on Web search and data mining - WSDM ’13, include the detailed data manipulation using algebraic operations Rome, Italy, 2013, p. 465, doi: 10.1145/2433396.2433454. on individual comment vectors, as well as said vectors being suitable inputs for deep learning models using neural networks. REFERENCES [1] W. Antweiler and M. Z. Frank, ‘Is All That Talk Just Noise? The Information Content of Internet Stock Message Boards’, J. Finance, vol. 59, no. 3, pp. 1259–1294, Jun. 2004, doi: 10.1111/j.1540- 6261.2004.00662.x. [2] T. Weninger, ‘An exploration of submissions and discussions in social news: mining collective intelligence of Reddit’, Soc. Netw. Anal. Min. , vol. 4, no. 1, p. 173, Dec. 2014, doi: 10.1007/s13278-014-0173-9. [3] E. Go, K. H. You, E. Jung, and H. Shim, ‘Why do we use different types of websites and assign them different levels of credibility? Structural relations among users’ motives, types of websites, information credibility, 103 Machine Learning of Surrogate Models with an Application to Sentinel 5P Michał Artur Szlupowicz Jure Brence Jennifer Adams m.szlupowicz@gmail.com jure.brence@ijs.si jennifer.adams@esa.it Warsaw University of Technology, Jožef Stefan Institute Φ-lab, ESA/ESRIN Faculty of Physic Ljubljana, Slovenia Frascati, Italy Warsaw, Poland Edward Malina Sašo Džeroski edward.malina.13@alumni.ucl.ac.uk saso.dzeroski@ijs.si Earth and Mission Science Division Jožef Stefan Institute ESA/ESTEC Ljubljana, Slovenia Noordwijk, the Netherlands ABSTRACT the task of prediction, while offering a choice between PCA and autoencoders to reduce dimensionality [4, 6, 3]. In this paper we Surrogate models are efficient approximations of computation-present an extension of the framework with two types of ensem- ally expensive simulations or models. In this paper, we report bles of decision trees for prediction [4], as well as an evaluation improvements of a framework for learning surrogates on input of the performance and utility of three additional algorithms and output spaces with reduced dimensionality. We present non- for dimensionality analysis and dimensionality reduction: t-SNE linear embeddings and feature importance as additional methods [11], UMAP [12] and feature importance based on random forests for dimensional analysis and reduction. The choice of models for [10]. prediction is extended with two types of ensembles of decision trees. The performance of the additions is evaluated and com- pared with the original approaches on a dataset, generated by 2 DATASET RemoTeC, a complex radiative transfer model. The training dataset was generated using the RemoTeC tool and KEYWORDS in total consists of 50000 samples. Each input state vector con- tains a set of atmospheric parameters: solar zenith angle (SZA), spectral data, neural network, ensemble, surrogate model, dimen- albedo, temperature, pressure, aerosols and profiles of the CH4, sionality reduction CO and H20 gases (in total 125 dimensions). The sampling of the data ensures that the data covers the entire range of conditions 1 INTRODUCTION that S5P/TROPOMI is expected to encounter. Exploratory data The TROPOspheric Monitoring Instrument (TROPOMI) is an analysis reveals three dimensions with zero variance. Removing on board satellite instrument on the Copernicus Sentinel-5 Pre- them results in a dataset with a 122-dimensional input space. cursor satellite [9]. Its main objective is to provide accurate ob-The output training data was created using the RemoTeC RTM servations of atmospheric parameters, as the concentrations of in the S5P/TROPOMI Shortwave InfraRed (SWIR3) band. Each atmospheric constituents. Those can be used to obtain better target vector consists of an infrared spectrum with 834 dimen- air quality forecasts and to monitor global trends. However, the sions. retrieval of interesting attributes involves running a retrieval algorithm, such as RemoTeC [2, 8], based on “optimal estimation 3 SURROGATE MODELS methods" that tend to be computationally very expensive [7]. The framework for learning surrogates is capable of learning Machine learning techniques can be used to learn surrogate both forward and backwards models. The former predict spectra, models that approximate the outputs of intensive simulations given atmospheric parameters. The latter reverse this process and are much faster at making predictions [13]. A framework and learn to approximate atmospheric parameters that produce a for learning surrogates of radiative transfer models has been given spectrum, which is useful for optimizing parameters of the developed [1]. Due to the high dimensionality of both input and RemoTeC simulation. Surrogates are generally predictive models output spaces, the framework employs dimensionality reduction that map directly between input and output data of a simula- - methods that find low-dimenensional projections (embeddings) tion or computationally expensive model. They offer much faster of data that preserve as much information as possible [4]. Predic-predictions at the cost incurring a prediction error. However, tive models are learned on input and output spaces with reduced when the data is high dimensional and contains many samples, dimensionality. the computational cost of training and prediction can still be Despite promising results, the existing framework for learning non-trivial. In such cases, methods of dimensionality reduction surrogates is limited to simple feed-forward neural networks for can offer not only time savings, but also improvements in predic- Permission to make digital or hard copies of part or all of this work for personal tive performance. In our framework, we employ dimensionality or classroom use is granted without fee provided that copies are not made or reduction to atmospheric parameters, as well as to the spectral distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this space. Predictive models learn to map between reduced spaces. work must be honored. For all other uses, contact the owner /author(s). An inverse transformation is performed on predictions in the re- Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia duced space to obtain predictions in the original output space. For © 2020 Copyright held by the owner/author(s). that reason, dimensionality reduction algorithms must provide 104 Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Michał A. Szlupowicz and Jure Brence, et al. an inverse transformation in order to be useful as a component of regularization techniques can be employed. One of options is of a surrogate model in our framework. adding artificial noise to the input data, which forces the network to generalize. 3.1 Dimensionality Reduction In our framework, we employ this kind of autoencoder, often referred to as a denoising autoencoder, by adding Gaussian noise A high number of dimensions makes a problem much harder for with mean 0 and standard deviation 0.1 to input data during the many machine learning algorithms due to the curse of dimen- training process. A more thorough investigation of the effect of sionality. For this reason, we have tried a range of dimensionality this technique on the predictive power can be found in [1]. For reduction (DR) methods on our data before performing training both atmospheric parameters and the spectral space, we used the on them. DR methods are (potentially unsupervised) algorithms same 7 layers architecture with an appropriate size of input and that try to find a projection of the data to a lower dimension of output layers. The architecture can be summarized as: space that preserve as much information as possible. A lower number of dimensions helps reduce computation time • input layer of size 𝑁 + Gaussian noise 0 and often even improves the predictive performance of models. • dense layer of size 𝑁 and ReLu activation 1 < 𝑁0 Furthermore, DR methods can also be used to visualize high • dense layer of size 𝑁 = 1 and ReLu activation 2 𝑁1 2 dimensional data by finding an informative projection into two • dense embedding layer of size 𝑁 and linear activation 3 dimensions that is understandable to humans. Some algorithms, • dense layer of size 𝑁 and ReLu activation 2 such as t-SNE or UMAP, serve especially this purpose. • dense layer of size 𝑁 and ReLu activation 1 Principal Component Analysis (PCA) is one of the most pop- • output layer of size 𝑁 and linear activation 0 ular dimensionality reduction methods [4]. PCA finds linear pro-The t-Distributed Stochastic Neighbor Embedding (t-SNE) jections to a lower-dimensional subspace so that variance in the [11] is a non-linear unsupervised technique for high dimension data is maximized. Visualizing the ratio of variance, covered by data visualization that can model complex, non-linear dependen- individual principal components is a way of assessing the intrin- cies. t-SNE places points that are similar in the original space close sic dimensionality of the data, as shown in Figure 1. We see that, together in the embedding layer with a high probability, while for the 122-dimensional atmospheric parameter space, we need: placing dissimilar points close together with only a low probabil- • 23 dimensions to explain 95% of the variance, ity. Since t-SNE is a stochastic and non-parametric method there • 45 dimensions to explain 99% of the variance, is no way to perform a reverse transformation from the embed- • 73 dimensions to explain 99.9% of the variance, ding space to the original space. This excludes the method from and for the output 834-dimensional spectral space: use as part of the surrogate modelling process. It can, however, • 1 dimension to explain 95% of the variance, be useful for visualizing the dataset. Another disadvantage of • 2 dimensions to explain 99% of the variance, t-SNE is its high computational complexity. • 9 dimensions to explain 99.9% of the variance. Uniform Manifold Approximation and Projection (UMAP) [12] is another dimension reduction technique used for dataset visualizations, constructed from a theoretical framework based in Riemannian geometry and algebraic topology. UMAP preforms similarly to t-SNE, but preserves more of the global data structure with superior run time performance. As is the case with t-SNE, UMAP does not allow for reverse transformations, which means we can not use it to learn surrogates. However, visualizations using UMAP allowed us to gain useful insights into the structure of our dataset. 3.2 Prediction Models One of the predictors we used in our experiment was a feed- forward neural network (NN). We have chosen an architecture, consisting of 2 hidden full connected layers with ReLu activation functions and linear activation on the output layer [6]. Random Forest (RF) is an ensemble learning technique suited for both regression and classification problems. It uses sample bagging and feature sampling methods to train a set of decision trees. Prediction is performed by averaging over predictions from the individual regression trees. The main advantage of RF over a simple decision tree is the much better generalization. We decided Figure 1: Dependence of the cumulative relative variance to use this kind of predictor, because it is capable of performing on the number of principal components for both the input multi target regression [10]. and the output space. Extra Random Trees (ET) is a technique very similar to random forests, with two main differences. First, it uses the whole dataset Autoencoders (AE) [3] are a type of artificial neural network for training individual trees instead of using bags of samples. used to learn low dimensional representations. AE are trained to Second, it uses random cuts for each split, instead of using the reproduce input data on the output of the network after passing optimal one (in case of Gini or Entropy reduction). It has been through a bottleneck in the network architecture. To prevent shown to perform better than random forests for some problems autoencoders from memorizing the training dataset, a variety [5]. 105 Machine Learning of Surrogate Models with an Application to Sentinel 5P Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia 4 EXPERIMENT two principal components. Only about half of the features are assigned non-negligible importance. The features identified by Our experiment is composed of three parts. In the first two, we this approach warrant further investigation by domain experts. employ methods of dimensionality reduction as a way to gain insight and understanding about our dataset and problem. The third part is an empirical evaluation of different combinations of methods for dimensionality reduction and prediction, aiming to identify the one that offers the best predictive performance on unseen data. 4.1 Visualization We applied the UMAP and t-SNE visualization techniques to both atmospheric parameters and spectrum data. As expected, both methods showed clusters in the atmospheric parameters data. In the spectrum data space, UMAP identified a structure in the data, depicted in figure: 2. A comparison of the data points sampled from different clusters shows a large difference in the scale of individual data points. This is likely one of the reasons why such a high variance is concentrated in the first principal component (as seen in Figure 1). Figure 3: Random forest predictor importance of atmo- spheric data features. 4.3 Regression To compare different regressors and methods of dimensionality reduction, we performed forward and backward predictions us- ing neural network, random forest and extra random trees for both autoencoder and PCA embeddings. We reduced the dimen- sionality of the input space from 123 to 73 and the dimensionality of the output space from 834 to 9. These values correspond to 99.9% explained variance when using PCA. The noise level of the autoencoder was set to 𝜎 = 0.1. A more thorough study of the effects of these parameters can be found in [1]. We compare the predictive power of various combinations of either AE or PCA for dimension reduction, and either neural network, random forest or extra trees as a predictive model, using 10-fold cross valida- (a) UMAP visualization tion. In Table 1 we compare the results, using the coefficient of determination as the evaluation metric [4]: 𝑀 𝑆 𝐸 (model) 2 𝑅 = 1 − . 𝑣 𝑎𝑟 𝑖𝑎𝑛𝑐𝑒 (training set) Table 1: Coefficient of determination for various combina- tions of dimensionality reduction methods (DR) and pre- dictive models (PM), estimated by 10-fold cross validation. forward backward (b) Comparison of data points PM / DR AE PCA AE PCA Figure 2: UMAP visualization of the spectrum data. NN 0.9995 0.9998 0.8454 0.9206 RF 0.8931 0.9937 0.9267 0.9311 ET 0.9228 0.9958 0.9370 0.9510 4.2 Feature Importance 2 The main advantage of using tree-based models over neural net- For the forward model, the best performance of 𝑅 = 0.9998 is works is their interpretability. While the ability to be understood achieved by a neural network, mapping between spaces reduced by a human is lost when moving to an ensemble from a single by PCA. For the backward model, the best performing model 2 tree, random forests can be very useful for estimating the impor- are extra trees, paired with PCA, achieving 𝑅 = 0.9510. Both tance of individual features for prediction. We trained a random represent very satisfactory and promising models to employ as forest predictor on the full dataset and visualized feature impor- surrogates for radiative transfer modeling. From Table 1, we can tance values in Figure 3. We see that 70% of feature importance also see that PCA outperformed autoencoders in all cases, while is accumulated in just two dimensions. This corresponds well also being much faster to compute. The comparison of predic- to the PCA estimate of most variance being encompassed by tive models is not as simple. For the forward model, the neural 106 Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Michał A. Szlupowicz and Jure Brence, et al. network is the best, but only by a small margin. For the back- insight about the data. However, feature importance can also be ward model, the differences are larger, with the neural network used to compute feature rankings and perform feature selection, performing the worst. The performance of random forests was which can be considered as another method of dimensionality re- between the performances of the other two predictive models duction. In further work, it might be worthwile to investigate this for both the forward and the backward problem. approach further and include it as an option in the framework Since one of the main uses for surrogate models is speeding for learning surrogates. up computation, time complexity is an important consideration. The main disadvantage of neural networks is the computational 6 ACKNOWLEDGEMENTS complexity required for both training and prediction. An autoen- We thank dr. Jovan Tanevski for his initial work on the project, coder takes about ten times as long to transform a data point as well as his ideas and help in further work. to the embedding space than PCA. For predictive models, the neural network used in this study needed approximately three REFERENCES times as long to make a prediction than random forests and extra [1] Jure Brence, Jovan Tanevski, Jennifer Adams, Edward Ma- trees, which had a similar time complexity. Nonetheless, mak- lina, and Sašo Džeroski. 2020. Learning surrogates of a ing predictions for a test set of 5000 points using any of the radiative transfer model for the sentinel 5p satellite. In Pro- described surrogates takes up to one second, while running the ceedings of International Conference on Discovery Science full RemoTeC simulation requires several hours of computation. (Lecture Notes in Computer Science). Volume 12323. When comparing with the evaluation results reported for the [2] A Butz, André Galli, O Hasekamp, J Landgraf, P Tol, and I original framework in [1], the performances in this paper are Aben. 2012. Tropomi aboard sentinel-5 precursor: prospec-slightly worse. The reason is the fact that the original study tive performance of ch4 retrievals for aerosol and cirrus reduced the dimensions of the input space to 102 and the output loaded atmospheres. Remote Sensing of Environment, 120, space to 50 dimensions. In this study we focused on further 267–276. reducing the dimensions and reduced the dimension of the input [3] David Charte, Francisco Charte, Salvador García, María J space to 73 dimensions and the output space to 9 dimensions. It del Jesus, and Francisco Herrera. 2018. A practical tutorial is an interesting observation that for different dimensionalities, on autoencoders for nonlinear feature fusion: taxonomy, the best performance is achieved by different algorithms. models, software and guidelines. Information Fusion, 44, 78–96. 5 DISCUSSION AND FURTHER WORK [4] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning. Number 10. Vol- The original framework for learning surrogates on input and out- ume 1. Springer series in statistics New York. put spaces with reduced dimensionality showed high predictive [5] Pierre Geurts, Damien Ernst, and Louis Wehenkel. 2006. and computational performance on the RemoTeC dataset. The Extremely randomized trees. Machine learning, 63, 1, 3–42. results were very promising for applications in data analysis for [6] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Earth Observation missions as a way to dramatically speed up Deep Learning. http://www.deeplearningbook.org. MIT computation without sacrificing much accuracy. However, no Press. single model and approach is the best for every dataset and appli- [7] Otto P Hasekamp and J Landgraf. 2002. A linearized vector cation, which made the limited scope of options in the original radiative transfer model for atmospheric trace gas retrieval. framework a potential downside. With the work presented in this Journal of Quantitative Spectroscopy and Radiative Transfer, paper, the range of methods available has been extended. Since 75, 2, 221–238. issn: 00224073. doi: 10.1016/S0022- 4073(01) the choice of algorithms for dimensionality reduction on the in- 00247- 3. put and output spaces, as well as the choice of prediction model [8] Haili Hu, Otto Hasekamp, André Butz, André Galli, Jochen for both the forward and the backward model are all indepen- Landgraf, Joost Aan de Brugh, Tobias Borsdorff, Remco dent from each other, the number of combinations of algorithms Scheepmaker, and Ilse Aben. 2016. The operational methane available is considerable. Furthermore, the dimension analysis retrieval algorithm for tropomi. Atmospheric Measurement enabled by UMAP, t-SNE and feature importance represents a Techniques (AMT), 9, 11, 5423–5440. new way of assessing intrinsic dimensionality and making a more [9] IPCC. 2014. Fifth Assessment Report - Impacts, Adaptation informed choice of the number of target dimensions. and Vulnerability. (2014). Retrieved 06/12/2017 from http: The paper presents an evaluation of the performance of vari- //www.ipcc.ch/report/ar5/wg2/. ous included methods on the RemoTeC dataset. However, each [10] Andy Liaw, Matthew Wiener, et al. 2002. Classification of the analyzed algorithms is defined by a number of hyper- and regression by randomforest. R news, 2, 3, 18–22. parameters, which is especially true for neural networks and [11] Laurens van der Maaten and Geoffrey Hinton. 2008. Vi- autoencoders. Furthermore, the dimensions of the reduced input sualizing data using t-sne. Journal of machine learning and output spaces can also be consider hyperparameters of the research, 9, Nov, 2579–2605. framework. For the presented evaluation we chose the hyper- [12] Leland McInnes, John Healy, and James Melville. 2018. parameters based on values reported in previous work and to Umap: uniform manifold approximation and projection some degree optimized them manually. A more rigorous study is for dimension reduction, (December 2018). https://arxiv. required that employs automated hyperparameter optimization org/abs/1802.03426. in order to compare the available algorithms fairly and arrive at [13] J Tanevski, S Džeroski, and T Todorovski. 2019. Meta- a reliable conclusion of what is the best approach to modeling model framework for surrogate-based parameter estima- the RemoTeC simulation. tion in dynamical systems. IEEE Access, 99. Finally, in this study we touched upon the subject of estimat- ing feature importance using random forests in order to gain 107 Deep Multi-label Classification of Chest X-ray Images Dejan Štepec dejan.stepec@xlab.si University of Ljubljana, Faculty of Computer and Information Science XLAB d.o.o. Ljubljana, Slovenia ABSTRACT taking into account underlying dependency structure and pow- In this paper we address the problem of Chest X-ray (CXR) clas- erful deep features, could advance current state-of-the-art of the sification in a multi-label classification (MLC) setting, in which supervised MLC deep-learning based approaches. each sample can be associated with one or several labels. The availability of large-scale CXR datasets has provided the abil- ity to develop highly accurate deep-learning based supervised models, that closely resembles the performance of human radiol- ogists. We compare an end-to-end deep-learning based approach with different ensembles of predictive clustering trees (PCTs) and show that similar predictive performance can be achieved, when using the features extracted from the pre-trained deep-learning model. KEYWORDS Chest X-ray, deep-learning, predictive clustering trees, random forest, extra tree 1 INTRODUCTION Chest X-ray (CHR) is one of the most common medical imag- ing modalities, with millions of scans performed globally every year [6]. A computer-aided diagnosis (CAD) system can significantly reduce the burden of radiologists and thus reduce preva- lence and early detection of many deadly diseases. There has been a lot of effort recently, to harness the power of machine learning based methods, especially deep-learning, for disease classifica- tion and localization from CXR images [17]. Interpreting CXR images is very difficult even for the trained pathologists, with different visual ambiguities representing a significant challenge to distinguish between different diseases, resulting in misdiag- noses [5]. Recently, deep-learning based approaches have been presented, that together with the availability of large-scale datasets signifi- cantly improve the performance of CAD methods and in some Figure 1: Few examples of Chest X-ray images from the cases reach the radiologist-level performance [8]. In comparison CheXpert dataset [8]. with other approaches and datasets [9, 13, 1], newly presented datasets [8, 10] enable the development of CAD methods for detection of presence of multiple diseases present in CXR images 2 RELATED WORK at the same time. We evaluate an end-to-end deep-learning based approach Recent prevalence of deep-learning methods and increased avail- for multi-label classification (MLC) of CXR images, based on ability of large-scale datasets with labeled data has provided med- DenseNet architecture [7] and compare it with the traditional ical community with significant advances, in comparison with approach based on predictive clustering trees (PCT) [2], in an en-the methods that require sub-optimal manual feature engineer- semble setting, using the features extracted from the pre-trained ing [14]. State-of-the-art CNN models are becoming a de-facto deep-learning network. We demonstrate a similar predictive per-standard for a wide range of application in medical imaging, such formance on a large-scale CheXpert dataset [8], thus opening the as detection, classification and segmentation. Similar advances potential to use PCTs also in a hierarchical setting [20], which in terms of the methods and available data have been observed in the domain of Chest X-ray (CXR) images. Permission to make digital or hard copies of part or all of this work for personal Multi-label classification (MLC) setting is a very common set-or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and ting in interpreting CXR images, due to presence of multiple the full citation on the first page. Copyrights for third-party components of this diseases in one particular CXR sample. Deep-learning architec-work must be honored. For all other uses, contact the owner/author(s). ture CheXNet [19] was proposed, based on DenseNet-121 [7], Information society ’20, October 5–9, 2020, Ljubljana, Slovenia © 2020 Copyright held by the owner/author(s). trained on ChestX-ray14 dataset [21], which achieved state-of-the-art results over 14 labeled pathologies and even exceeded 108 Information society ’20, October 5–9, 2020, Ljubljana, Slovenia Dejan Štepec (b) (a) Figure 2: (a) Label uncertainty distribution over 14 pathologies in the CheXpert dataset [8] over all the samples in the training data and (b) distribution and probability of occurrence of multiple pathologies in a particular sample (multi-label classification). radiologist performance on pneumonia. Recently, very large-scale using radiographs, thus labels only represent positive or negative CXR datasets were presented, such as CheXpert [8] and MIMIC-class, with no uncertainties. Evaluation is performed only on 5 CXR [10], which enabled the development of much more robust observations, selected based on their clinical significance and supervised models. Additionally, the new datasets also capture prevalence in the dataset (i.e. Atelectasis, Cardiomegaly, Consoli- the notion of uncertainty through labels and different approaches dation, Edema and Pleural Effusion). have been proposed for handling such labels. A similar archi- The distribution of all the observed pathologies in the train- tecture to CheXNet was proposed and performance surpassed 3 ing data and their uncertainty is presented in Figure 2a and the certified radiologists in 3 different pathologies [8]. distribution of observations over a single example in Figure 2b, The above MLC approaches do not take into the account the which shows that there is around 30% chance of having at least 2 dependencies between disease labels, which, when exploited, pathologies present at the same time, labeled as definite positive. significantly improves the performance of the predictive mod- In CheXpert [8], different strategies of using uncertainty labels els [16]. We evaluate an end-to-end deep-learning based approach were evaluated. The two most simple approaches are to ignore for MLC of CXR images, based on DenseNet architecture [7] and uncertain samples during the training or to map them to either compare it with the traditional approach based ob predictive negative of positive class. They also evaluate a semi-supervised clustering trees (PCTs) [2], in an ensemble setting, using the fea-approach, where the ignore approach is used to label uncertain ex- tures extracted from the pre-trained deep-learning network. We amples, in order to re-label them. 3-class classification approach demonstrate a similar predictive performance on a large-scale is also evaluated where uncertain label is used as a separate class CheXpert dataset [8], thus opening the potential to use PCTs also during the training and during testing, only the probabilities for in a hierarchical setting [20], which taking into account under-positive and negative class are reported. In our work, we use lying dependency structure and powerful deep features, could the simple mapping approach, by mapping uncertain labels to a advance current state-of-the-art of the supervised MLC deep- positive class and not-mentioned samples to a negative class. learning based approaches and also compete against hierarchical deep-learning based approaches [4, 16], which take the hierarchy into account implicitly, using the conditional probability. 3.1 Methods We evaluate an end-to-end deep-learning based approach for 3 CHEXPERT: A LARGE CHEST multi-label classification (MLC) of CXR images, based on DenseNet- 121 architecture [7] and compare it with the traditional approach RADIOGRAPH DATASET based ob predictive clustering trees (PCT) [2], in an ensemble CheXpert [8] is a large publicly available dataset for chest ra-setting, using the features extracted from the pre-trained deep- diograph interpretation, consisting of 224,316 CXR images of learning network. 65,240 patients, where the presence of 14 different observations is labeled as positive, negative, uncertain or not mentioned. CXR images are collected retrospectively from Stanford Hospital, to- 3.2 End-To-End Deep Learning gether with associated radiology reports. Labels (and their un- Several convolutional neural networks (CNNs) were evaluated certainty) were automatically extracted from the section of the in CheXpert [8] and DenseNet-121 [7] architecture produced the radiology report, which summarizes the key findings. A large best results. Because of that, we used DenseNet-121 for all of list of phrases was manually curated by multiple board-certified our experiments. Original DenseNet is designed for multi-class radiologists to match various ways of observations, mentioned classification, where the neural network has the same number differently in the reports. Extracted phrases are then classified of output nodes as the number of classes. Each output node into positive, negative, uncertain or not-mentioned classes and belongs to some class and outputs a score for that class. In a aggregated into a final set of predefined observations (i.e. patholo- multi-class setting, the scores are passed through softmax layer, gies) with prevailed occurrence. The publicly available test data which converts scores into probabilities (class probabilities sums consists of 234 samples from 234 patients, where ground truth to 1) and the input sample is classified into a corresponding class, is set by a consensus of 3 radiologists, who annotated the set that has the highest probability. 109 Deep Multi-label Classification of Chest X-ray Images Information society ’20, October 5–9, 2020, Ljubljana, Slovenia In a multi-label classification (MLC) setting, the difference is, 4 RESULTS that an input sample can belong to multiple classes at the same We evaluated different approaches on the publicly available test time, thus the final score needs to be independent for each of data, consisting out of 234 samples from 234 patients, where the classes, because of that, sigmoid function is used instead of ground truth is set by a consensus of 3 radiologists. We report softmax. Additionally, categorical cross-entropy loss function the results in terms of the Receiver Operating Characteristic needs to be replaced with binary cross-entropy. We implemented Curves (ROC) in Figure 3 and its area under the curve (AUC) modified DenseNet-121 in PyTorch1 using Adam optimizer with in Table 1. In terms of the approaches presented in our work the same learning rates and parameters as used in CheXpert [8]. (i.e. DenseNet-121, RF-PCT and EXTRA-PCT), DenseNet-121 per- The images were resized to 320 x 320, same as in [8] and we forms the best, with EXTRA-PCT approach following it closely. trained the network for 10 epochs using a fixed batch size of 32 The biggest differences are observed on the Cardiomegaly class, images and evaluated the performance on a left-out validation which coincides with the results reported in CheXpert [8], as set of 500 images using the receiver operating characteristic most of the uncertain cases are borderline, which reduces the curve (ROC) and its area under the curve (AUC), averaged across performance of the simple mapping to positive or negative label. all observations. The best performing model in terms of global Table 1 also compares presented approaches against the DenseNet-AUC score was selected for evaluation on a test set, presented in 121 baseline presented in CheXpert [8], where 10 checkpoints Section 4. per run were chosen and each model was run three times, thus generating and ensemble of 30 models, which improved the re- 3.3 Predictive Clustering Trees sults by a small margin over our baseline DenseNet-121 approach. Predictive clustering trees (PCTs) [2] are decision trees viewed Nevertheless, we achieved or surpassed CheXpert results on Car-as a hierarchy of clusters, where the top node corresponds to diomegaly and Pleural Effusion classes and also achieved similar one cluster containing all the data, which is recursively parti- performance on other classes. tioned into smaller clusters while moving down the tree. PCTs are constructed with a standard "top-down induction of decision 5 CONCLUSION trees" (TDIDT) algorithm, the major difference in comparison In this paper we addressed the problem of Chest X-ray (CXR) with CART [3] or C4.5 [18] induction is that the PCTs treat vari-classification in a multi-label classification (MLC) setting and ance and prototype functions as parameters, selected based on compared an end-to-end deep-learning based approach with dif- the learning task at hand. To construct a regression tree, for ex- ferent ensembles of predictive clustering trees (PCTs) and showed ample, the variance function returns the variance of the given that similar predictive performance can be achieved, when using instances’ target values, and the prototype is their average value. the features extracted from the pre-trained deep-learning model. For the task of predicting tuples of discrete variables, used in This results show the potential to use PCTs also in a hierarchi- the multi-label classification (MLC) [15], the variance functions cal setting, which taking into account underlying dependency is computed as the sum of the Gini indices[28] of the variables structure and powerful deep features, could advance current from the target tuple and the prototype function returns a vector state-of-the-art. of probabilities, that an example belongs to a particular class in the target tuple. ACKNOWLEDGMENTS In our work we utilized PCTs in an ensemble setting, where This work has been supported by the H2020 iPC project (826121) a set of predictive models (i.e. PCTs) predictions are combined to obtain a final prediction, this is especially useful for unstable REFERENCES base predictors (e.g. trees), where small changes in the dataset, [1] Worawate Ausawalaithong, Arjaree Thirach, Sanparith yield substantially different models and usually achieves a much Marukatat, and Theerawit Wilaiprasitporn. 2018. Auto- better predictive performance [12]. In our work we consider a matic lung cancer prediction from chest x-ray images us-Random forest of PCTs (RF-PCT) [12] and ensembles of extremely ing the deep learning approach. In 2018 11th Biomedical randomized PCTs (EXTRA-PCT) [11] for MLC. In RF-PCT, several Engineering International Conference (BMEiCON). IEEE, 1– bootstrap replicates are first constructed and a randomized PCT 5. is then applied, by selecting a subset of attributes in each node, [2] Hendrik Blockeel, Luc De Raedt, and Jan Ramon. 1998. on which all possible tests are considered and the best one is Top-down induction of clustering trees. In Proceedings of selected. The number of attributes selected is a given parameter, the Fifteenth International Conference on Machine Learn- typically a function of the total number of attributes (e.g. log(N) - ing (ICML ’98). Morgan Kaufmann Publishers Inc., San where N represents the number of attributes). In EXTRA-PCT, no Francisco, CA, USA, 55–63. bootstrap replicates are constructed and in each internal node, [3] Leo Breiman, JH Friedman, RA Olshen, and CJ Stone. 1984. for the each attribute, a test is selected randomly. Classification and regression trees. statistics/probability We used CLUS2 framework for the PCT construction. We series. (1984). used 50 baseline PCTs for RF-PCT, as well as EXTRA-PCT. The [4] Haomin Chen, Shun Miao, Daguang Xu, Gregory D Hager, input presented the 1024D features extracted from the pre-trained and Adam P Harrison. 2019. Deep hierarchical multi-label DenseNet-121 network, extracted before the last fully-connected classification of chest x-ray images. In International Con- classification layer in DensetNet-121. Similarly to the DenseNet- ference on Medical Imaging with Deep Learning, 109–120. 121 end-to-end approach, the RF-PCT and EXTRA-PCT were [5] Louke Delrue, Robert Gosselin, Bart Ilsen, An Van Lan- evaluated on a test set in terms of AUC score. deghem, Johan de Mey, and Philippe Duyck. 2011. Diffi- culties in the interpretation of chest radiography. In Com- 1 parative interpretation of CT and standard radiography of https://pytorch.org/hub/pytorch_vision_densenet 2http://clus.sourceforge.net/ the chest. Springer, 27–49. 110 Information society ’20, October 5–9, 2020, Ljubljana, Slovenia Dejan Štepec (a) DenseNet-121 (b) RF-PCT (c) EXTRA-PCT Figure 3: Receiver Operating Characteristic Curves (ROC) for an end-to-end deep-learning approach based on DenseNet-121 (a) and ensemble of Predictive Clustering Trees (PCTs) based on random forest (b) and extremely randomized trees (c). Table 1: Comparison of different methods against the baseline CheXpert results [8] in terms of AUC scores. Method Atelectasis Cardiomegaly Consolidation Edema Pleural Effusion CheXpert (U-Ones) [8] 0.86 0.83 0.90 0.94 0.93 DenseNet-121 0.81 0.84 0.86 0.92 0.93 RF-PCT 0.83 0.75 0.83 0.90 0.92 EXTRA-PCT 0.82 0.80 0.86 0.90 0.93 [6] [n. d.] Diagnostic Imaging Dataset 2019-20 Data, NHS 2017. Deep learning in medical imaging: general overview. England. (Accessed 4 August 2020). Korean journal of radiology, 18, 4, 570–584. [7] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and [15] Gjorgji Madjarov, Dragi Kocev, Dejan Gjorgjevikj, and Kilian Q Weinberger. 2017. Densely connected convolu- Sašo Džeroski. 2012. An extensive experimental compari- tional networks. In Proceedings of the IEEE conference on son of methods for multi-label learning. Pattern recognition, computer vision and pattern recognition, 4700–4708. 45, 9, 3084–3104. [8] Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Sil- [16] Hieu H Pham, Tung T Le, Dat Q Tran, Dat T Ngo, and viana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Ha Q Nguyen. 2019. Interpreting chest x-rays via cnns Haghgoo, Robyn Ball, Katie Shpanskaya, et al. 2019. Chex- that exploit disease dependencies and uncertainty labels. pert: a large chest radiograph dataset with uncertainty arXiv preprint arXiv:1911.06475. labels and expert comparison. In Proceedings of the AAAI [17] Chunli Qin, Demin Yao, Yonghong Shi, and Zhijian Song. Conference on Artificial Intelligence. Volume 33, 590–597. 2018. Computer-aided detection in chest radiography based [9] Amit Kumar Jaiswal, Prayag Tiwari, Sachin Kumar, Deepak on artificial intelligence: a survey. Biomedical engineering Gupta, Ashish Khanna, and Joel JPC Rodrigues. 2019. Iden- online, 17, 1, 113. tifying pneumonia in chest x-rays: a deep learning ap- [18] J Ross Quinlan. 2014. C4. 5: programs for machine learning. proach. Measurement, 145, 511–518. Elsevier. [10] Alistair EW Johnson, Tom J Pollard, Nathaniel R Green- [19] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, baum, Matthew P Lungren, Chih-ying Deng, Yifan Peng, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Cur- Zhiyong Lu, Roger G Mark, Seth J Berkowitz, and Steven tis Langlotz, Katie Shpanskaya, et al. 2017. Chexnet: radiologist- Horng. 2019. Mimic-cxr-jpg, a large publicly available data- level pneumonia detection on chest x-rays with deep learn- base of labeled chest radiographs. arXiv preprint arXiv:1901.07042. ing. arXiv preprint arXiv:1711.05225. [11] Dragi Kocev and Michelangelo Ceci. 2015. Ensembles of [20] Celine Vens, Jan Struyf, Leander Schietgat, Sašo Džeroski, extremely randomized trees for multi-target regression. and Hendrik Blockeel. 2008. Decision trees for hierarchical In International Conference on Discovery Science. Springer, multi-label classification. Machine learning, 73, 2, 185. 86–100. [21] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Moham- [12] Dragi Kocev, Celine Vens, Jan Struyf, and Sašo Džeroski. madhadi Bagheri, and Ronald M Summers. 2017. Chestx- 2013. Tree ensembles for predicting structured outputs. ray8: hospital-scale chest x-ray database and benchmarks Pattern Recognition, 46, 3, 817–833. on weakly-supervised classification and localization of [13] Paras Lakhani and Baskaran Sundaram. 2017. Deep learn- common thorax diseases. In Proceedings of the IEEE con- ing at chest radiography: automated classification of pul- ference on computer vision and pattern recognition, 2097– monary tuberculosis by using convolutional neural net- 2106. works. Radiology, 284, 2, 574–582. [14] June-Goo Lee, Sanghoon Jun, Young-Won Cho, Hyunna Lee, Guk Bae Kim, Joon Beom Seo, and Namkug Kim. 111 Smart Issue Retrieval Application Jernej Zupančič Borut Budna Miha Mlakar jernej.zupancic@ijs.si borut.budna@ijs.si Maj Smerkol Jožef Stefan Institute Faculty of Computer and miha.mlakar@ijs.si Jamova cesta 39 Information Science maj.smerkol@ijs.si Ljubljana, Slovenia Ljubljana, Slovenia Jožef Stefan Institute Jožef Stefan International Jamova cesta 39 Postgraduate School Ljubljana, Slovenia Ljubljana, Slovenia Figure 1: SIRA screenshot ABSTRACT developers, SIRA can help find existing answers to questions that We present Smart Issue Retrieval Application (SIRA), a customer have already been resolved by developers and therefore reduce support tool for searching of relevant email threads or issues the amount of distractions for the development team. when an email thread and keywords are given. Presented are the We use language models in order to retrieve information about overall application architecture, the processing pipeline, which the question from the issue at hand. Using multiple different ap- transforms the data into a search friendly form, and the search proaches, application searches the database of resolved issues in algorithm itself. order to find a developers’ answers to same or similar questions. KEYWORDS 2 SIRA ARCHITECTURE customer support, language models, information retrieval SIRA comprises five main application components (Fig. 2): (1) Database. PostgreSQL [6] is used as the application data-1 INTRODUCTION base, since it includes decent built-in text search capabili- ties and change data capture options. Customer support is an important part of many large businesses (2) Processing daemon. Python [7] process responsible for data and high quality customer support can improve the user experi-processing for search in the event of change data capture. ence and help businesses retain their customer for longer periods. (3) Back-end application. Python Flask-based back-end appli- For larger companies, it can also be a strain on their human re- cation exposing the application programming interface sources as many customer support issues need to be resolved for SIRA. in short time. While the customer support team may resolve (4) Front-end application. React-based [8] single-page applica-most issues on their own sometimes they need the help of the tion for interacting with SIRA. development department. Often similar issues are presented to (5) Documentation. MKdocs-based user documentation for the developers multiple times. final users, admins, and developers. In order to minimize the number of issues that need attention from other departments, we have developed an application to Each SIRA component is packaged within a Docker [4] im- help the customer support technicians resolve issues without help age and can be managed using “docker-compose” [2] tool. This from developers. While some issues will still need the attention of enables deterministic packaging of application code for develop- ment, testing and production. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or 3 SIRA FUNCTIONALITY distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this The main goal of SIRA is to enable customer support staff to work must be honored. For all other uses, contact the owner/author(s). quickly find answers to similar questions that have already been Information society ’20, October 5–9, 2020, Ljubljana, Slovenia resolved in the past. Search is therefore the primary functionality © 2020 Copyright held by the owner/author(s). of the application and can be split into three parts: 112 Information society ’20, October 5–9, 2020, Ljubljana, Slovenia Novak, et al. particular word. This is sensible for cases when the word actu- ally repeats in the content. However, if it repeats due to the text duplication it could negatively impact the search results. We define a repeated email as an email body that appears within another email body. This is usually a result of using a “Reply” functionality when responding to an email within an email client. To delete email 𝐴 from email 𝐵, the following method is used: (1) Extract only alphanumeric characters from the two email bodies 𝐴 and 𝐵 to get alphanumeric(𝐴) and alphanumeric(𝐵). (2) If alphanumeric(𝐴) appears within alphanumeric(𝐵), mark it for removal from alphanumeric(𝐵). (3) If alphanumeric(𝐴) does not appear within alphanumeric(𝐵), Figure 2: SIRA architecture overview iterate over substrings of alphanumeric(𝐵) and compute the matching percentage of consecutive alphanumeric blocks from alphanumeric(𝐴). The substring with the max- (1) Processing. Upon new data arrival, pre-processes the text imum match is a candidate for removal. If it exceeds a to obtain representation suitable for search. predefined threshold it is indeed marked for removal from (2) Search. Computing relevancy scores upon search request alphanumeric(𝐵). by taking into account as much information about issue (4) Reconstruct 𝐵 by dropping the substring marked for re- or email thread as possible. moval and all non-alphanumeric characters positioned (3) Logging. To improve the search in the future the search re- within the marked substring when expanded with all the sults and structured user feedback is gathered and stored. characters. In the rest of this section we will describe each part in more 3.1.2 Non-author lines removal. An email body usually com- details. prises: (1) Relevant content 3.1 Processing (2) Signature For the search to be efficient it is beneficial to pre-process the (3) Confidentiality notice raw emails. The processing daemon runs as a separate python (4) Previous email headers process and utilizes PostgreSQL’s logical replication functionality (5) Previous email content in order to transform new content as soon as it is written to the The only text that should be used for text comparison is the database. The following steps are executed when processing the relevant content part. While previous email content was mostly issues: removed in the repeated emails removal step (3.1.1), other email (1) HTML clean. Beautiful Soup [1] library is used to extract body parts can still impact text comparison results. Machine only relevant text from email XML markup. learning was utilized to develop a model for determining whether (2) Empty line removal. Python script is used to detect and a particular line in the email body belongs to the relevant content remove empty lines. part of an email or not. (3) Repeated emails removal. Parts of emails are deleted if they Dataset preparation. First, we implemented an application already appear within some previous mails of the same with a basic graphical user interface that enabled us to label each issue. line with one of the following categories: (4) Semi-structured emails handling. Some emails are actually (1) AUTHOR. The relevant content falls into this category. a filled out form in an email format. A python script is (2) QUOTED. This is the previous email content. used to extract only the relevant information. (3) AUTO-PERSONALIZED. This is the text, that was set by (5) Non-author lines removal. A machine learning model was a user in the email client, which is automatically inserted developed and is deployed for tackling this task. by the email client. Signature is an example of this. (6) Non-alphanumeric-only characters lines removal. Python (4) AUTO-NON-PERSONALIZED. This is the text inserted script is used to detect and remove those lines. by the email client automatically. An example of this is (7) Word vector representation computation and update. Fast- previous email headers. Text [3] word vectors are used to compute word vector (5) NEEDS-PRETTIFY. Sometimes the whole email body is representation of text. present in one line only. To properly label the body it (8) Storing of processed text. The processed text is stored into should be further split into multiple lines. database, where built-in database indexing is utilized to (6) OTHER. Everything else. further prepare the text for efficient text searching. Second, we labeled each line belonging to 100 random issues. In the rest of this section we focus on the non-trivial processing This way we generated a dataset of 37,421 labeled lines in 586 steps. emails. Since the assumption was that the “QUOTED” lines are 3.1.1 Repeated emails removal. There were two reasons for re- already filtered out using remove repeated emails method, we moving repeated emails from an email thread. First, when dis- omit those lines from the dataset. This left us with 9,848 labeled playing an email, usually also all the previous emails are included, lines. which results in poor readability. Second, some methods for com- Features. The computed features were of two types: local paring the text take into account the number of occurrences of a features that took into account just the current line, and global 113 Smart Issue Retrieval Application Information society ’20, October 5–9, 2020, Ljubljana, Slovenia features that took into account the relative position and content A basic GUI was built to inspect the models and overview the of a line within the whole email. miss-classified examples. In the end, the hierarchical model was Local features: chosen with most of the presented features, with the exception of (1) Number and proportion of capitalized words “CountVectorizer” and “Tfidf Vectorizer” features. The additional (2) Number and proportion of non-alphanumeric characters chosen higher-level feature was the sum of three consecutive (3) Number and proportion of numeric characters “AUTHOR” probabilities. Random forest was chosen as the clas- (4) “CountVectorizer” from the scikit-learn ([5]) package sification algorithm, without feature standardization or dimen- (5) “Tfidf Vectorizer” from the scikit-learn package sionality reduction step. The threshold probability was lowered (6) Word vector line representation to 0.12 so recall could be kept high. The final model miss-classified 59 out of 2,394 rows marked Global features: as “AUTHOR” (recall = 0.975) and 629 out of 7,454 rows marked (1) Line position from the start as “OTHER” (recall = 0.806). (2) Line position until the end 3.1.3 Word vector representation computation and update. Word (3) Does “regard” appear before this line, within this line, after vector representation of content is used to compare email bodies this line and email subjects between different issues. (4) Do four or more consecutive non-alphanumeric characters To compute the word vector representation of text, either appear before this line, within this line, after line issue body or issue subject, the following steps are executed: (1) (5) Does a date-like string appear before this line, within this Tokenize text, (2) Remove stop-words, (3) Query word vector line, after this line representation for each word using fastText common crawl word (6) Does a time-like string appear before this line, within this vectors with dimension 300, (4) Compute mean of all word vectors line, after this line belonging to the words in the text, (5) Normalize the mean vector In order to smooth the predictions we also tested hierarchical by dividing the mean vector by the mean vector length. modeling by first building a model for “AUTHOR” detection and Instead of generating the representation vectors on-the-fly, then using the predictions on the lower level as additional fea- they are pre-computed and only read when needed, which greatly tures for the higher level. One approach for using the predictions reduces the inference time. To update word vector representation from the lower level was to just use the “AUTHOR” predictions of a particular text, the corresponding row in the word vector of lines just before and just after the current line. The predictions matrix is updated with the new values and stored on disk as a were padded with 1 at the beginning of an email and with 0 at Numpy array. the end. The second approach was based on the sum of three consecutive “AUTHOR” class probabilities for: lines, just before 3.2 Search the current line, lines where the current line is in the middle, and Each issue consists of: subject, document (the email body of text), lines just after the current line. We padded the predictions with and keywords the user marked the issues with. The keywords can 1s at the beginning of an email and with 0s at the end. be positive, meaning that a keyword is related with the contents Further, the features were scaled using the StandardScaler and of the issue, or negative when keyword is not related with the the feature space dimensionality was reduced using the principal contents of the particular issue. Additionally, a keyword can component analysis - PCA, both from the scikit-learn package. be explicit, where a user uses the keyword for searching when Models. For modeling we utilized scikit-learn package and considering a particular issue. On the other hand, a keyword can tested the following algorithms: (1) Logistic regression, (2) Multi- be implicit – soft keywords, where the user searched for relevant nomial Naive Bayes, (3) Support vector machine, (4) Random issues using a keyword, but the search results were not marked forest classifier. as relevant. Rudimentary hyper-parameter tuning was done to pick the When computing the relevancy of issues, given a starting best ones. issue and some keywords, several relevancy sub-scores are first Evaluation. Each pipeline was evaluated using 10-fold cross computed and then aggregated to form a single relevancy score. validation with the splits over issues. This means that all the lines In Table 1 all combinations for relevance sub-scores are listed. belonging to one issue were either in the training or the testing The final score is computed as a weighted average, as in equa- set to prevent data leaking. tion 1. The weights 𝑤 were determined based on the final user Model selection. The performance of all models was tracked 𝑖 feedback. through various metrics: (1) Confusion matrix finalScore = 𝑤1 · KeywordToKeywordScore (2) Precision and recall at different minimum recall thresholds + 𝑤2 · KeywordToSoftKeywordScore (3) Precision-recall curve (4) “AUTHOR” probabilities for each line in the test set + 𝑤3 · KeywordToDocumentScore The main concern regarding the model performance was that + 𝑤4 · KeywordToSubjectdScore it should prioritize keeping the “AUTHOR” lines (“AUTHOR” + 𝑤5 · DocumentToKeywordScore (1) recall) over average model accuracy. This is a direct result of the + 𝑤 application architecture – if the line would be removed by the 6 · DocumentToSoftKeywordScore chosen model, it wouldn’t be possible to search over it. This would + 𝑤7 · DocumentToDocumentScore directly impact the performance in the real-world. Additionally, + 𝑤8 · SubjectToKeywordScore few additional lines shouldn’t hinder the readability too much. The gathered metrics enabled us to closely inspect each model + 𝑤9 · SubjectToSoftKeywordScore and overview the performance regarding real-world application. + 𝑤10 · SubjectToSubjectScore 114 Information society ’20, October 5–9, 2020, Ljubljana, Slovenia Novak, et al. Table 1: Relevance sub-scores matrix Other issues (Not) Keyword Soft (Not-) keyword Document Subject (Not-) Keyword Exact match Exact match Full-text search Full-text search issue Soft (Not) Keyword / / / / ent Word vector cosine Document Reverse full-text Reverse full-text / search search similarity Curr Word vector cosine Subject Reverse full-text Reverse full-text / search search similarity 3.2.1 Exact match. This relevance score compares (soft) key- the database, including user defined keywords and appropriate words related to issues and those inserted in the keyword input results marking. box. Given a (soft) keyword, search for all the documents that are Preprocessing is done without any user interaction and in- in relation to this exact (soft) keyword. Each relation can either volves multiple algorithms and AI methods to extract the text of be positive or negative. Therefore, the returned score is positive the issue from original encoded emails. Testing of the algorithms in case of positive relation and negative otherwise. shows good results both in terms of precision and recall. Word vector representations are pre-computed in order to improve 3.2.2 Full-text search. This relevance score compares keywords performance of search algorithms. entered in the keyword input box and issue documents or issue Based on the extracted plain text of the issue the application subjects. Full-text search capability of PostgreSQL is leveraged for searches for similar issues that have already been resolved. The this score. However, the results are modified to return negative users can therefore quickly find the information related to the scores in case of not-keyword match. issue. The system is currently in use and only after some time of 3.2.3 Reverse full-text search. This relevance score compares real-world usage we will be able to evaluate the whole system. the selected issue document or subject and all existing (soft) Due to logging the interactions in the database we expect to keywords. First, for each keyword a full-text search relevance be able to analyze the usage and quality of the results. This will score is computed. Second, for each issue in the database do a allow us to improve the system and add other functionality that sum of its related keyword relevance scores. will improve user experience and further improve the customer support technicians’ workflow. 3.2.4 Word vector cosine similarity. This relevance score com- pares the selected issue document and subject to all existing issue ACKNOWLEDGMENTS documents and subjects, respectively. Pre-computed word vec- tors as described in Section 3.1.3 are used. The relevance score is Nicelabel d.o.o. funded the research presented in this paper. We computed as: thank Gregor Grasselli, Zdenko Vuk and Miha Štravs for help in application development. wordVectorSimilarity(𝑇1,𝑇2) = 1 − 𝑇1 · 𝑇2. (2) Since the word vectors used are normalized, this is actually REFERENCES 1− cosine distance between 𝑇1 and 𝑇2. [1] Beautiful Soup Developers. 2019. Beautiful soup. https:// Two other methods for comparing the text were also tested: www.crummy.com/software/BeautifulSoup/. (2019). PostgreSQL built-in trigram text similarity, which was too slow [2] Docker Inc. 2019. Docker-compose. https://docs.docker. for production use, and tf-idf representation of text and cosine com/compose/. (2019). distance-based relevance score, which did not perform as well as [3] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas the word vectors method. Mikolov. 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. 3.3 Logging [4] Dirk Merkel. 2014. Docker: lightweight linux containers To improve the search performance in the future, several interac- for consistent development and deployment. Linux journal, tions with the application are logged: 2014, 239, 2. (1) Search results with relevance scores [5] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. (2) Viewed search results Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, (3) Relevant issue/belonging email found V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. (4) No relevant issue/belonging email found Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Only after sufficient real-world usage of the application we machine learning in Python. Journal of Machine Learning can quantitatively evaluate the performance of the whole search Research, 12, 2825–2830. pipeline and act upon the results. [6] PostgreSQL Global Development Group. 2019. PostgreSQL, version 12. http://www.postgresql.org. (2019). 4 DISCUSSION AND CONCLUSION [7] Python Software Foundation. 2018. Python language refer- ence, version 3.7. http://www.python.org. (2018). The SIRA system was developed and deployed, including five [8] React Developers. 2019. React. https://reactjs.org/. (2019). docker-image packaged modules. The main functionalities of the first major release include preprocessing of the text of the issue, search integrating four different search algorithms and a logging system that stores interactions with the system into 115 Adaptation of Text to Publication Type Luka Žontar Zoran Bosnić University of Ljubljana, University of Ljubljana, Faculty of Computer and Information Science Faculty of Computer and Information Science Ljubljana, Slovenia Ljubljana, Slovenia zontarluka98@gmail.com zoran.bosnic@fri.uni- lj.si ABSTRACT In this paper, we adapt texts to context by manipulating three text evaluation metrics: length, polarity and readability. Our In this paper, we propose a methodology that can adapt texts method will be able to transition between social media posts, re- to target publication types using summarization, natural lan- search articles, newspaper articles and official statements, where guage generation and paraphrasing. The solution is based on key each publication type targets a different audience. While gov- text evaluation characteristics that describe different publication ernmental institutions and academics both publish neutrally- types. To examine types, such as social media posts, newspaper oriented texts, research articles tend to be much longer than articles, research articles and official statements, we use three dis- official statements. Social media and news usually target wider tinct text evaluation metrics: length, text polarity and readability. audiences, which is why texts should be more readable. However, Our methodology iteratively adapts each of the text evaluation the two can be separated by the amount of opinion we can in- metrics. To alter length, we focus on abstractive summarization clude. Newspaper articles should be less biased and thus include using text-to-text transformers and distinct natural language gen- less positively or negatively-oriented words. eration models that are fine-tuned for each target publication Our methodology iteratively adapts key text evaluation met- type. Next, we adapt polarity and readability using synonym rics towards the mean values of the target publication type that replacement and additionally, manipulate the latter by replacing will be calculated from a sample set of articles. In each iteration sentences with paraphrases, which are automatically generated our method first manipulates length using abstractive summariza- using a fine-tuned text-to-text transformer. The results show that tion techniques and natural language generation models. Next, the proposed methodology successfully adapts text evaluation it replaces words with more appropriate synonyms and adjusts metrics to target publication types. We find that in some cases polarity and readability scores. Finally, it uses a fine-tuned text- adapting the chosen text evaluation metrics is not enough and to-text transformer to generate more appropriate paraphrases we can corrupt the content using our methodology. However, that replace whole sentences in our text and alter readability. generally, our methodology generates suitable texts that we could present to a target audience. 2 RELATED WORK KEYWORDS While we are trying to automatically adapt texts to a particular genre, researchers have already made progress in automatic text text adaptation, context-aware, artificial intelligence, text sum- simplification, where we try to adapt text to be more readable and marization, natural language processing easier to understand. Carroll and Tait [2] developed a methodology to simplify texts for people that suffer from aphasia, which 1 INTRODUCTION is a disability of language processing. The developed system con- With more and more internet usage, the textual data on the inter- sists of an analyser component, which provides syntactic analysis net is highly increasing. However, different media target different and a simplifier component, which adapts texts using lexical and audiences and thus an arbitrary article may not be appropriate syntactic simplification. Lexical simplifier replaces the words in for everyone. Consequently, already published content is being text with synonyms by considering Kucera-Francis frequency rewritten and adapted for other target audiences. of each available synonym that is held in WordNet. Syntactic Why is targeting audiences so important? When speaking constructions that are not constructed in Subject-Verb-Object with someone in person, we adjust body language, tone and the order can also be tough to process for aphasic people. Therefore, words we use, so that the audience understands the message we the authors proposed several syntactic simplifications, such as are trying to send. In a similar manner, we also have to be aware replacement of passive constructions with active constructions. of the target audience when writing. Even though the task of A lot of research has already been done on how to evaluate adapting texts to different audiences may look easy to experi- and alter text and we will use many existing methods to help enced writers, rookies and amateurs may struggle in selecting us develop our methodology. We picked three text evaluation the information that might be relevant to a particular target audi- metrics that can be reasonably altered using existing methods. ence. Nevertheless, a way to deal with words and some common Flesch [5] developed an equation that determines the readability sense should be enough to complete the task, but due to the of the text using the number of words per sentence and the latter requirement automating this task becomes a much harder number of syllables per word ratios. Even though structure-based problem. metrics are important, we also have to consider the message of the text. Using sentiment analysis, we can determine whether the Permission to make digital or hard copies of part or all of this work for personal writer has positive or negative affections towards the topic of the or classroom use is granted without fee provided that copies are not made or text. Feldman [4] in his article discusses several approaches of distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this sentiment analysis based on the unit that we will be classifying work must be honored. For all other uses, contact the owner /author(s). (i.e. documents, sentences, aspects). Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia As length is one of the chosen text evaluation metrics that © 2020 Copyright held by the owner/author(s). we wish to adapt, we have to be able to both summarize and 116 Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Žontar and Bosnić extend the text. According to Allahyari et al. [1], we differentiate firstly pre-trained on a data-rich task using texts from the between extractive and abstractive summarization approaches. Colossal Clean Crawled Corpus and then fine-tuned on a Extractive approaches shorten the original text by excluding less downstream task using a dataset of texts and their sum- relevant sentences. Significance of the sentence can be evaluated maries as the expected outputs from the aforementioned by determining whether the sentence is related with the main corpus. topic or whether its content is distinctive in comparison to other • To generate additional text, if the input text is shorter sentences. On the other hand, abstractive approaches tend to than the average text of the target publication type, we summarize texts in a new (more human-like) manner by struc- use fine-tuned natural language generation models. turing the text into some logical form such as graphs, trees and We generate four pre-trained GPT2 natural language gen- ontologies [6]. eration models [7] that are based on the aforementioned When adapting shorter texts to longer, natural language gen- unsupervised multitask learners. Each model is then fine- eration has proven to be a very strong tool. Radford et al. [7] tuned on a dataset of 100 texts of a certain considered developed a natural language generation technique to generate publication type and should be able to generate texts simi- additional text and produced state of the art results using unsu- lar to the ones that it was fine-tuned on. Consequently, we pervised multitask learners for model learning. Their model was would assume that the generated text needs less further trained to predict the next word in text based on 40GB of Internet adaptation. content. They concluded that large training datasets and models • While adapting length might be the procedure with the trained to maximize the likelihood of a sufficiently varied corpus most visible results, we also have to adapt the other text can learn a surprising amount of tasks, while no supervision is evaluation metrics. We develop a synonym replacement needed in training. procedure to adjust polarity and readability scores to the Another method that is commonly used when adapting texts target values. The procedure is executed in iterations and to context is paraphrasing, i.e. rewording of something written in each iteration we replace the word with the highest sum by changing its structure or replacing the words with their syn- of absolute relative differences of polarity and readability onyms. Goutham in his article [9] used a pre-trained text-to-text scores to the initial values of the target publication type transfer transformer to generate paraphrases of questions. The with its optimal synonym, i.e. the synonym which causes model was fine-tuned, where the input texts were questions from the sum of absolute relative differences to minimize. We Quora and the expected output were the questions that were used the lexical database WordNet to acquire synonyms labeled as their duplicates. of the considered word. In our paper, we plan to exploit the aforementioned abstractive • Finally, we alter readability by generating paraphrases summarization technique to shorten our texts and fine-tune the with a T5 text-to-text transformer [9] that was fine- pre-trained natural language generation model that Radford et tuned to generate paraphrases by learning on Microsoft al. [7] developed. Similarly as Goutham [9], we intend to fine-Research Paraphrase Corpus dataset [3]. We then pick the tune a pre-trained text-to-text transformer that would be able optimal paraphrase, which minimizes the relative differ- to generate paraphrases of a sentence. To calculate readability ence to the target readability score. score of the input text, we plan to use the formula proposed by Flesch [5]. Replacing sentences with their paraphrases could potentially also alter length and polarity. We test the assumption by generat- 3 ADAPTATION OF TEXT ing five paraphrases for each sentence in 100 documents for each considered publication type and find that the relative difference As mentioned before, the proposed method iteratively manipu- of length and polarity between the initial sentence and its para- lates the chosen text evaluation metrics to adapt text to different phrases is not significant. The obtained mean relative difference target audiences. In Figure 1, which gives an overview of the −3 of polarity scores in this preliminary analysis was 0.91 · 10 and method, we can see that before we start running the process, we −3 the mean relative difference of lengths was 0.11 · 10 . calculate the initial values of text evaluation metrics for each publication type as the average values of a set of documents. Our main dataset consists of 150 documents for each publication type, 4 EVALUATION AND RESULTS where all the documents hold text that contain COVID-19 related In our experiments, we evaluate the quality of text transforma- content, with which we minimize the effect of variables that we tion between all possible pairs of four different publication media will not take into account in text adaptation. We also define the types: social media, news, research articles and official statements. number of iterations (in our case: 5) and the acceptable error 𝜖 We tested our methodology by generating adapted texts of a sub- (in our case: 𝜖 = 0.1) that determines whether it is still worth set of the main dataset that was introduced in Section 3. The altering a particular text evaluation metric. subset consists of 100 documents for each publication type (i.e., In each iteration, relative differences between current and 400 altogether) that were randomly chosen from the main dataset. initial values of text evaluation metrics are calculated. If the We adapted each document to the other three publication types absolute relative difference to some metric is bigger than 𝜖 , we try and thus test all of the 12 possible transitions. We observed how to adjust it to the targeted value. We adjust key text evaluation the key text evaluation metrics behaved and whether the gen- metrics in the main loop of the process in Figure 1 using the erated text was meaningful or not. The results text evaluation following procedures: metrics before and after adaptation to context are shown in Table • In case the target length is smaller than its current value, 1. In Table 2, we present the results of content quality evaluation we use a pre-trained T5 text-to-text transformer [8] to of the generated texts. summarize the input text. The model is an encoder-de- From Table 1 we can observe that the text evaluation metrics coder model that uses transfer learning on a model that is successfully changed in the right direction. In most cases we 117 Adaptation of Text to Publication Type Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Target publication type Official statements Research articles News Social media Input publication type Initial Adapted Initial Adapted Initial Adapted Initial Adapted Length 0.79 0.04 0.04 0.03 36.39 0.35 Official statements Polarity 2.88 0.15 2.05 0.04 2.78 0.4 Readability 0.36 0.75 0.23 0.35 0.4 0.24 Length 3.06 0.05 2.99 0.04 136.23 0.33 Research articles Polarity 0.81 0.27 0.33 0.07 0.18 0.46 Readability 0.17 0.08 0.34 0.22 0.45 0.12 Length 0.97 0.03 0.99 0.03 63.79 0.4 News Polarity 0.88 0.14 0.43 0.1 0.33 0.37 Readability 1.21 0.05 1.2 0.84 0.24 0.11 Length 0.69 0.02 0.64 0.03 0.97 0.04 Social media Polarity 0.85 0.28 0.28 0.02 0.55 0.06 Readability 0.71 0.27 0.69 0.8 0.24 0.28 Table 1: Absolute relative differences to initial values of target publication type before and after transition that synonym replacement method performs suitably, too. Its inefficiency may be caused by the lack of choice in synonym and paraphrase replacement and the limited amount of words and sentences that can be replaced. As an example, we tried to adapt this research article to a social media post. By including statements that are colored in yellow in Figure 2, such as “The authors have proposed” and “The researchers used”, we imply that the social media post talks about a research article, which it does. Furthermore, the replacement of the word “texts” with “written matters” and the word “audiences” with “audience groups” indicates that the initial readability of this research article is higher than the expected value of social media posts, because we lower the Flesch Reading Ease score with the mentioned transformations. The content is appropriate as it extracts some of the most crucial concepts of this article. Figure 2: Example of text adaptation from this research article to a social media post Additionally, we evaluated the content quality by checking semantic similarity between the input and the generated text. Using GloVe word embeddings, we transformed the text into vectors and calculated the angle between the vectors. With cosine measure we evaluated whether the vectors point in a similar direction, i.e. the contents of texts, are similar. In Table 2, we present the mean cosine similarities between GloVe embeddings of the input and the adapted texts. The results show that the generated texts preserve the original content. Cosine similarity Figure 1: Flowchart of the text adaptation methodology scores are high in all transitions, however, the scores are a bit lower when we adapt to or from a social media post. This could be a consequence of the inability to thoroughly define the content significantly improved the values of metrics. The length manipu- in short texts that are expected in social media. lation managed to consistently decrease the relative difference While our method successfully adapts key text evaluation towards targeted length and in many occasions even converge metrics, our results are not perfect when it comes to the con- under 𝜖 value. Polarity and readability scores seem harder to tent. Our method has its drawbacks such as generating lots of adapt. However, in each case we successfully adapted the sum of additional content, which often results in an unconnected text. relative differences of those metrics, with which we can conclude Additionally, synonym replacement and paraphrase generation 118 Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Žontar and Bosnić hhhh Original publication type hhhhhhh Research article Official statement Social media News hhhh Target publication type hhhhh Research article 0.94 0.82 0.97 Official statement 0.95 0.82 0.97 Social media 0.83 0.93 0.90 News 0.95 0.96 0.82 Table 2: Cosine similarities between GloVe embeddings can incorrectly replace original sentence or word, where the context and the methodology could also consider patterns that paraphrase or synonym changes the meaning but proves to be might not be obvious to human’s eye. effcient when adapting text evaluation metrics, if there exist such synonyms that are more appropriate to use for a particular target REFERENCES audience. Nevertheless, our methodology generated a few se- [1] Mehdi Allahyari, Seyedamin Pouriyeh, Mehdi Assefi, Saeid quences that could be published for target audiences without any Safaei, Elizabeth Trippe, Juan Gutierrez, and Krys Kochut. changes and lots of texts would only require minor corrections. 2017. Text summarization techniques: a brief survey. In- To conclude this section, we are satisfied with the benchmark- ternational Journal of Advanced Computer Science and Ap- ing results that our method produced in adapting key text evalua- plications (IJACSA), 8, (July 2017), 397–405. doi: 10.14569/ tion metrics. The methodology produces some interesting content IJACSA.2017.081052. and can thus be used as a baseline for further text adaptation to [2] John Carroll, Guido Minnen, Yvonne Canning, Siobhan De- target audiences. vlin, and John Tait. 1998. Practical simplification of english newspaper text to assist aphasic readers. Proc. of AAAI-98 Workshop on Integrating Artificial Intelligence and Assistive 5 CONCLUSION Technology, (July 1998), 7–10. [3] William B. Dolan and Chris Brockett. 2005. Automatically In this article we developed a methodology that adapts texts constructing a corpus of sentential paraphrases. In Proceed- to context. The methodology focuses on three text evaluation ings of the Third International Workshop on Paraphrasing metrics: length, readability and polarity of the text. Our method (IWP2005), 9–16. https://www.aclweb.org/anthology/I05- iteratively adapts text to the calculated initial values based on 5002. the targeted publication type by adjusting the key text evalua- [4] Ronen Feldman. 2013. Techniques and applications for sen- tion metrics. We successfully managed to adjust text evaluation timent analysis. Commun. ACM, 56, (April 2013), 82–89. doi: metrics in nearly all transitions. 10.1145/2436256.2436274. While we found text evaluation metrics that define different [5] Rudolf Flesch. 1979. How to Write Plain English: A Book for publication types, in some cases adjusting these measures is not Lawyers and Consumers. Harper & Row. enough. Generating longer sequences of additional text, we find [6] Kavita Ganesan, ChengXiang Zhai, and Jiawei Han. 2010. that the generated content is not connected and while we can find Opinosis: a graph based approach to abstractive summa- a chain of related topics of subsections, in some cases it is hard rization of highly redundant opinions. In Proceedings of the to define the common thread that is held throughout the whole 23rd International Conference on Computational Linguistics text. Additionally, if such synonyms and paraphrases exist that (Coling 2010). Coling 2010 Organizing Committee, Beijing, corrupt the content but improve the relative differences to the China, (August 2010), 340–348. https://www.aclweb.org/ targeted values of key text evaluation metrics, the methodology anthology/C10- 1039. will replace existing words and sentences with senseless content. [7] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Despite these drawbacks, we generated lots of results that reflect Amodei, and Ilya Sutskever. 2018. Language models are the targeted publication types and even more results that would unsupervised multitask learners. https : / / d4mucfpksywv. require only minor changes to be completely acceptable. We cloudfront.net/better- language- models/language- models. conclude this article with satisfactory results of both content of pdf . generated texts and their values of key text evaluation metrics. [8] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Our ideas for further work include improvement of natural Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and language generation model, where the pre-trained model that we Peter J. Liu. 2020. Exploring the limits of transfer learning used should be trained on longer texts so that we could generate with a unified text-to-text transformer. Journal of Machine text based on longer prompts and thus make sure that we hold the Learning Research, 21, 140, 1–67. http://jmlr.org/papers/ common thread throughout the whole text. Determining whether v21/20- 074.html. synonyms or paraphrases corrupt the message of the text is also [9] Goutham Ramsri. 2020. Paraphrase any question with T5 very important. Word embeddings can be used to represent the (Text-To-Text Transfer Transformer). Towards Data Science. context of the text and we could use it to determine whether the [Accessed: 17. 8. 2020]. (2020). https://towardsdatascience. synonym fits the current context or not. Another way to adapt com / paraphrase - any - question - with - t5 - text - to - text - text to context would be to create a dataset of texts, where each transfer- transformer- pretrained- model- and- cbb9e35f1555. row hold different versions of the same text and each version represents the text written for different target audience. This way we would be able to teach text-to-text models to adapt text to 119 120 Indeks avtorjev / Author index Adams Jennifer ........................................................................................................................................................................... 104 Andova Andrejaana ........................................................................................................................................................................ 7 Bierhoff Ilse ................................................................................................................................................................................. 80 Bizjak Jani ............................................................................................................................................................ 27, 35, 39, 43, 51 Bizjak Miha .................................................................................................................................................................................. 11 Bohanec Marko ............................................................................................................................................................................ 27 Bolliger Larissa ............................................................................................................................................................................ 63 Bosnić Zoran .......................................................................................................................................................... 76, 88, 100, 116 Brence Jure ................................................................................................................................................................................. 104 Bromuri Stefano ............................................................................................................................................................................. 7 Budna Borut ............................................................................................................................................................................... 112 Clays Els....................................................................................................................................................................................... 63 De Boer Jasmijn ........................................................................................................................................................................... 80 De Masi Carlo M. ......................................................................................................................................................................... 15 Dolanc Gregor .............................................................................................................................................................................. 27 Dovgan Erik ........................................................................................................................................................................... 19, 92 Džeroski Sašo ............................................................................................................................................................................. 104 Filipič Bogdan .............................................................................................................................................................................. 19 Gams Matjaž ...................................................................................................................................... 27, 35, 39, 43, 47, 51, 55, 68 Gazvoda Samo ....................................................................................................................................................................... 35, 43 Gjoreski Hristijan ............................................................................................................................................................. 47, 72, 84 Gjoreski Martin ............................................................................................................................................................................ 23 Golob David ................................................................................................................................................................................. 27 Gradišek Anton ............................................................................................................................................................................ 32 Guid Matej ................................................................................................................................................................................... 11 Gültekin Várkonyi Gizem ............................................................................................................................................................ 32 Jordan Marko ............................................................................................................................................................................... 80 Kalabakov Stefan ............................................................................................................................................................. 27, 35, 51 Katrašnik Marko ........................................................................................................................................................................... 63 Kiprijanovska Ivana ......................................................................................................................................................... 39, 43, 47 Kocuvan Primož ............................................................................................................................................................... 27, 35, 51 Kolenik Tine ................................................................................................................................................................................. 55 Kuzmanovski Vladimir ................................................................................................................................................................ 23 Levstek Andraž ............................................................................................................................................................................ 59 Lukan Junoš ................................................................................................................................................................................. 63 Luštrek Mitja .................................................................................................................................................... 7, 15, 63, 80, 92, 96 Machidon Alina ............................................................................................................................................................................ 68 Malina Edward ........................................................................................................................................................................... 104 Mlakar Miha ............................................................................................................................................................................... 112 Neceva Marija .............................................................................................................................................................................. 72 Osipov Evgeny ..................................................................................................................................................................... 76, 100 Peterka Ana .................................................................................................................................................................................. 76 Petrovčič Janko ............................................................................................................................................................................ 27 Ravničan Jože ............................................................................................................................................................................... 27 Reščič Nina .................................................................................................................................................................................. 80 Shulajkovska Miljana ................................................................................................................................................................... 84 Silan Darja .................................................................................................................................................................................... 59 Simončič Žiga .............................................................................................................................................................................. 88 Slapničar Gašper .......................................................................................................................................................................... 92 Smerkol Maj ......................................................................................................................................................................... 68, 112 Stankoski Simon ........................................................................................................................................................................... 96 Štepec Dejan ............................................................................................................................................................................... 108 Stoilkovska Emilija ...................................................................................................................................................................... 72 Stropnik Vid ............................................................................................................................................................................... 100 Szlupowicz Michał Artur ........................................................................................................................................................... 104 121 Valič Jakob ................................................................................................................................................................................... 92 Vodopija Aljoša ........................................................................................................................................................................... 59 Žontar Luka ................................................................................................................................................................................ 116 Zupančič Jernej .......................................................................................................................................................................... 112 122 IS Slovenska konferenca o umetni inteligenci Slovenian Conference on Artificial Intelligence 20 Mitja Luštrek, Matjaž Gams, Rok Piltaver 20 Document Outline 02 - Naslovnica - notranja - A - TEMP 03 - Kolofon - A - TEMP 04 - IS2020 - Predgovor 05 - IS2020 - Konferencni odbori 07 - Kazalo - A 08 - Naslovnica - podkonferenca - A 09 - Predgovor podkonference - A 10 - Programski odbor podkonference - A 01 - IS2020-3 Abstract 1 Introduction 2 Related Work 3 Dataset 4 Methodology 4.1 Baseline Model 4.2 DeepSpeech Model 4.3 Transfer Learning Using DeepSpeech 5 Results 6 Conclusion 7 Acknowledgments 02 - IS2020_Chess_Motifs Abstract 1 Introduction 1.1 Related Work 2 Domain Description 3 Similarity Computation 3.1 Static Features 3.2 Dynamic Features 4 Experimental Results 4.1 Evaluation of Similarity Detection 4.2 Similar Position Retrieval 5 Conclusions 03 - DeMasiLustrek-2 Abstract 1 Introduction 1.1 Video Activity Recognition 2 System Architecture 2.1 User Localization And Interaction With the Environment 2.2 Drinking Vessel Position Detection 2.3 Clip Recording and Activity Recognition 3 Results and discussion 3.1 User Localization - Results 3.2 Drinking Vessel Position Detection - Results 3.3 Activity Recognition - Results 4 Conclusions 04 - plamtex-is2020-1 Abstract 1 Introduction 2 Predicting Operation Durations with AI Methods 2.1 Relevant Positions and Related Operations 2.2 Description of the Extracted Features 2.3 Semantic Feature Selection 3 Experiments and Results 4 Conclusion Acknowledgments 05 - Gjoreski_DEXi_Alternatives_IS2020_revised 06 - Paper_David_revised_fixed 07 - GDPR-Gizem-Anton 08 - Paper_IS___Revised-2 Abstract 1 Introduction 2 Problem definition 2.1 Data 3 Method 3.1 Segmentation 3.2 Augmentation 3.3 Deep Transfer Learning 4 Evaluation 4.1 Experimental Setup 4.2 Evaluation Metric 5 Results and discussion 6 Conclusion and future work Acknowledgments 09 - Fall_detection_IS_paper_1_v3 10 - Machine_vision_IS_paper_2_v2 11 - Gait_abnormalities_IS_paper_3_v3 12 - Napovedovanje_obrabe_posnemalnih_igel_v3 Abstract 1 Uvod 2 Definicija problema 3 Reševanje problema 4 Rezultati 5 Zaključek Acknowledgments 13 - Povečevanje-enakosti-s-prepričljivo-tehnologijo_Kolenik_Gams_V2 14 - levstek_is20 Abstract 1 Uvod 2 Podatki 3 Metodologija 4 Rezultati 5 Diskusija 6 Zaključek 15 - STRAWapp Abstract 1 Application Overview 1.1 Data Types 2 Ecological Momentary Assessment 2.1 EMA Triggering 2.2 Question Database 3 Privacy Enhancements 4 Server Application 5 Client-Server Communication and Login 6 Conclusion Acknowledgments 16 - Urbanite_IS2020_revised_2 Abstract 1 Introduction 2 System's architecture 2.1 Data Analysis Module 2.2 Recommendation Engine 2.3 Policy Simulation and Validation Engine 2.4 Advanced Visualization Methods 3 Data Sources 4 Conclusions Acknowledgments 17 - Towards-end-to-end-text-to-speech-synthesis-in-Macedonian-language-3 18 - Peterka-mammogram-r1-v2 19 - Nutrition_monitoring_IS2020_v1 Abstract 1 Introduction 2 Method 2.1 Method Overview 2.2 FFQ - Qualitative Monitoring 2.3 Quantitative Monitoring 3 Results 3.1 Bite Counting 3.2 Application Implementation 4 Conclusion 5 Acknowledgments 20 - Paper_IS 21 - Simoncic-Bosnic-Sistem-za-ocenjevanje-esejev-v6.1 Abstract 1 Uvod 2 Sorodna dela 3 Opis implementacije in metode 3.1 Uporabljena orodja 3.2 Implementacija gradnikov v Orange 3.3 Semantična analiza 3.4 Rezultati 4 Zaključek 22 - slapnicar_dovgan_valic_lustrek_Mental_State_Estimation_of_People_with_PIMD_using_Physiological_Signals Abstract 1 Introduction 2 Related Work 3 Data 3.1 Annotating the ground truth 4 Methodology of mental state estimation 4.1 Using rPPG Reconstruction 4.2 Using Empatica PPG 5 Experiments and Results 5.1 Using Empatica PPG 5.2 Using rPPG reconstruction 6 Conclusion Acknowledgments 23 - SimonStankoski_IS-1 24 - Stropnik-Bosnic-v2.2 25 - IS2020_Surrogates_v2 Abstract 1 Introduction 2 Dataset 3 Surrogate Models 3.1 Dimensionality Reduction 3.2 Prediction Models 4 Experiment 4.1 Visualization 4.2 Feature Importance 4.3 Regression 5 Discussion and further work 6 Acknowledgements 26 - PASD_IS_paper_2020_final Abstract 1 Introduction 2 Related Work 3 CheXpert: A Large Chest Radiograph Dataset 3.1 Methods 3.2 End-To-End Deep Learning 3.3 Predictive Clustering Trees 4 Results 5 Conclusion Acknowledgments 27 - SIRA-Mlakar-corrected Abstract 1 Introduction 2 SIRA Architecture 3 SIRA Functionality 3.1 Processing 3.2 Search 3.3 Logging 4 Discussion and Conclusion Acknowledgments 28 - Adaptation_of_Text_to_Publication_Type-2.0 Abstract 1 Introduction 2 Related work 3 Adaptation of text 4 Evaluation and results 5 Conclusion 12 - Index - A-2 Blank Page Blank Page Blank Page Blank Page Blank Page 12 - Napovedovanje_obrabe_posnemalnih_igel_final.pdf Abstract 1 Uvod 2 Definicija problema 3 Reševanje problema 4 Rezultati 5 Zaključek Acknowledgments Blank Page