Zbornik 23. mednarodne multikonference

INFORMACIJSKA DRUŻBA

Zvezek A

Proceedings of the 23rd International Multiconference

.si

INFORMATION SOCIETY

Volume A

.ijsI S

http://is

Slovenska konferenca o

umetni inteligenci

20 Slovenian Conference on

Artificial Intelligence

20 Uredili / Edited byMitja Luštrek, Matjaž Gams, Rok Piltaver

6.–7. oktober 2020 / 6–7 October 2020

Ljubljana, Slovenia





Zbornik 23. mednarodne multikonference

INFORMACIJSKA DRUŽBA – IS 2020

Zvezek A





Proceedings of the 23rd International Multiconference

INFORMATION SOCIETY – IS 2020

Volume A





Slovenska konferenca o umetni inteligenci

Slovenian Conference on Artificial Intelligence





Uredili / Edited by



Mitja Luštrek, Matjaž Gams, Rok Piltaver





http://is.ijs.si





6. – 7. oktober 2020 / 6 - 7 October 2020

Ljubljana, Slovenia



Uredniki:





Mitja Luštrek

Odsek za inteligentne sisteme

Institut »Jožef Stefan«, Ljubljana



Matjaž Gams

Odsek za inteligentne sisteme

Institut »Jožef Stefan«, Ljubljana



Rok Piltaver

Celtra, d. o. o. in

Odsek za inteligentne sisteme

Institut »Jožef Stefan«, Ljubljana





Založnik: Institut »Jožef Stefan«, Ljubljana

Priprava zbornika: Mitja Lasič, Vesna Lasič, Lana Zemljak

Oblikovanje naslovnice: Vesna Lasič





Dostop do e-publikacije:

http://library.ijs.si/Stacks/Proceedings/InformationSociety





Ljubljana, oktober 2020





Informacijska družba

ISSN 2630-371X



Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni

knjižnici v Ljubljani

COBISS.SI-ID=33223427

ISBN 978-961-264-202-0 (epub)

ISBN 978-961-264-203-7 (pdf)





PREDGOVOR MULTIKONFERENCI

INFORMACIJSKA DRUŽBA 2020



Triindvajseta multikonferenca Informacijska družba (http://is.ijs.si) je doživela polovično zmanjšanje zaradi korone.

Zahvala za preživetje gre tistim predsednikom konferenc, ki so se kljub prvi pandemiji modernega sveta pogumno odločili, da bodo izpeljali konferenco na svojem področju.



Korona pa skoraj v ničemer ni omejila neverjetne rasti IKTja, informacijske družbe, umetne inteligence in znanosti nasploh, ampak nasprotno – kar naenkrat je bilo večino aktivnosti potrebno opraviti elektronsko in IKT so dokazale, da je elektronsko marsikdaj celo bolje kot fizično. Po drugi strani pa se je pospešil razpad družbenih vrednot, zaupanje v znanost in razvoj. Celo Flynnov učinek – merjenje IQ na svetovni populaciji – kaže, da ljudje ne postajajo čedalje bolj pametni. Nasprotno - čedalje več ljudi verjame, da je Zemlja ploščata, da bo cepivo za korono škodljivo, ali da je korona škodljiva kot navadna gripa (v resnici je desetkrat bolj). Razkorak med rastočim znanjem in vraževerjem se povečuje.



Letos smo v multikonferenco povezali osem odličnih neodvisnih konferenc. Zajema okoli 160 večinoma spletnih predstavitev, povzetkov in referatov v okviru samostojnih konferenc in delavnic in 300 obiskovalcev. Prireditev bodo spremljale okrogle mize in razprave ter posebni dogodki, kot je svečana podelitev nagrad – seveda večinoma preko spleta. Izbrani prispevki bodo izšli tudi v posebni številki revije Informatica (http://www.informatica.si/), ki se ponaša s 44-letno tradicijo odlične znanstvene revije.



Multikonferenco Informacijska družba 2020 sestavljajo naslednje samostojne konference:

• Etika in stroka

• Interakcija človek računalnik v informacijski družbi

• Izkopavanje znanja in podatkovna skladišča

• Kognitivna znanost

• Ljudje in okolje

• Mednarodna konferenca o prenosu tehnologij

• Slovenska konferenca o umetni inteligenci

• Vzgoja in izobraževanje v informacijski družbi





Soorganizatorji in podporniki konference so različne raziskovalne institucije in združenja, med njimi tudi ACM

Slovenija, SLAIS, DKZ in druga slovenska nacionalna akademija, Inženirska akademija Slovenije (IAS). V imenu organizatorjev konference se zahvaljujemo združenjem in institucijam, še posebej pa udeležencem za njihove dragocene prispevke in priložnost, da z nami delijo svoje izkušnje o informacijski družbi. Zahvaljujemo se tudi recenzentom za njihovo pomoč pri recenziranju.



V 2020 bomo petnajstič podelili nagrado za življenjske dosežke v čast Donalda Michieja in Alana Turinga. Nagrado Michie-Turing za izjemen življenjski prispevek k razvoju in promociji informacijske družbe je prejela prof. dr. Lidija Zadnik Stirn. Priznanje za dosežek leta pripada Programskemu svetu tekmovanja ACM Bober. Podeljujemo tudi nagradi »informacijska limona« in »informacijska jagoda« za najbolj (ne)uspešne poteze v zvezi z informacijsko družbo. Limono je prejela »Neodzivnost pri razvoju elektronskega zdravstvenega kartona«, jagodo pa Laboratorij za bioinformatiko, Fakulteta za računalništvo in informatiko, Univerza v Ljubljani. Čestitke nagrajencem!





Mojca Ciglarič, predsednik programskega odbora

Matjaž Gams, predsednik organizacijskega odbora





i



FOREWORD

INFORMATION SOCIETY 2020





The 23rd Information Society Multiconference (http://is.ijs.si) was halved due to COVID-19. The multiconference survived due to the conference presidents that bravely decided to continue with their conference despite the first pandemics in the modern era.



The COVID-19 pandemics did not decrease the growth of ICT, information society, artificial intelligence and science overall, quite on the contrary – suddenly most of the activities had to be performed by ICT and often it was more efficient than in the old physical way. But COVID-19 did increase downfall of societal norms, trust in science and progress. Even the Flynn effect – measuring IQ all over the world – indicates that an average Earthling is becoming less smart and knowledgeable. Contrary to general belief of scientists, the number of people believing that the Earth is flat is growing. Large number of people are weary of the COVID-19 vaccine and consider the COVID-19

consequences to be similar to that of a common flu dispute empirically observed to be ten times worst.



The Multiconference is running parallel sessions with around 160 presentations of scientific papers at twelve conferences, many round tables, workshops and award ceremonies, and 300 attendees. Selected papers will be published in the Informatica journal with its 44-years tradition of excellent research publishing.



The Information Society 2020 Multiconference consists of the following conferences:

• Cognitive Science

• Data Mining and Data Warehouses

• Education in Information Society

• Human-Computer Interaction in Information Society

• International Technology Transfer Conference

• People and Environment

• Professional Ethics

• Slovenian Conference on Artificial Intelligence



The Multiconference is co-organized and supported by several major research institutions and societies, among them ACM Slovenia, i.e. the Slovenian chapter of the ACM, SLAIS, DKZ and the second national engineering academy, the Slovenian Engineering Academy. In the name of the conference organizers, we thank all the societies and institutions, and particularly all the participants for their valuable contribution and their interest in this event, and the reviewers for their thorough reviews.



For the fifteenth year, the award for life-long outstanding contributions will be presented in memory of Donald Michie and Alan Turing. The Michie-Turing award was given to Prof. Dr. Lidija Zadnik Stirn for her life-long outstanding contribution to the development and promotion of information society in our country. In addition, a recognition for current achievements was awarded to the Program Council of the competition ACM Bober. The information lemon goes to the “Unresponsiveness in the development of the electronic health record”, and the information strawberry to the Bioinformatics Laboratory, Faculty of Computer and Information Science, University of Ljubljana. Congratulations!



Mojca Ciglarič, Programme Committee Chair

Matjaž Gams, Organizing Committee Chair





ii

KONFERENČNI ODBORI

CONFERENCE COMMITTEES



International Programme Committee

Organizing Committee

Vladimir Bajic, South Africa

Matjaž Gams, chair

Heiner Benking, Germany

Mitja Luštrek

Se Woo Cheon, South Korea

Lana Zemljak

Howie Firth, UK

Vesna Koricki

Olga Fomichova, Russia

Marjetka Šprah

Vladimir Fomichov, Russia

Mitja Lasič

Vesna Hljuz Dobric, Croatia

Blaž Mahnič

Alfred Inselberg, Israel

Jani Bizjak

Jay Liebowitz, USA

Tine Kolenik

Huan Liu, Singapore



Henz Martin, Germany

Marcin Paprzycki, USA

Claude Sammut, Australia

Jiri Wiedermann, Czech Republic

Xindong Wu, USA

Yiming Ye, USA

Ning Zhong, USA

Wray Buntine, Australia

Bezalel Gavish, USA

Gal A. Kaminka, Israel

Mike Bain, Australia

Michela Milano, Italy

Derong Liu, Chicago, USA

prof. Toby Walsh, Australia





Programme Committee

Mojca Ciglarič, chair

Andrej Gams

Vladislav Rajkovič

Bojan Orel, co-chair

Matjaž Gams

Grega Repovš

Franc Solina,

Mitja Luštrek

Ivan Rozman

Viljan Mahnič,

Marko Grobelnik

Niko Schlamberger

Cene Bavec,

Nikola Guid

Špela Stres

Tomaž Kalin,

Marjan Heričko

Stanko Strmčnik

Jozsef Györkös,

Borka Jerman Blažič Džonova

Jurij Šilc

Tadej Bajd

Gorazd Kandus

Jurij Tasič

Jaroslav Berce

Urban Kordeš

Denis Trček

Mojca Bernik

Marjan Krisper

Andrej Ule

Marko Bohanec

Andrej Kuščer

Tanja Urbančič

Ivan Bratko

Jadran Lenarčič

Boštjan Vilfan

Andrej Brodnik

Borut Likar

Baldomir Zajc

Dušan Caf

Janez Malačič

Blaž Zupan

Saša Divjak

Olga Markič

Boris Žemva

Tomaž Erjavec

Dunja Mladenič

Leon Žlajpah

Bogdan Filipič

Franc Novak





iii





iv



KAZALO / TABLE OF CONTENTS



Slovenska konferenca o umetni inteligenci / Slovenian Conference on Artificial Intelligence .......................... 1

PREDGOVOR / FOREWORD ................................................................................................................................. 3

PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ..................................................................................... 5

Using Mozil a’s Deep Speech to Improve Speech Emotion Recognition / Andova Andrejaana, Bromuri Stefano,

Luštrek Mitja ....................................................................................................................................................... 7

Towards Automatic Recognition of Similar Chess Motifs / Bizjak Miha, Guid Matej ............................................ 11

Drinking Detection From Videos in a Home Environment / De Masi Carlo M., Luštrek Mitja .............................. 15

Semantic Feature Selection for AI-Based Estimation of Operation Durations in Individualized Tool

Manufacturing / Dovgan Erik, Filipič Bogdan .................................................................................................. 19

Generating Alternatives for DEX Models using Bayesian Optimization / Gjoreski Martin, Kuzmanovski Vladimir

.......................................................................................................................................................................... 23

Detekcija napak na industrijskih izdelkih / Golob David, Petrovčič Janko, Kalabakov Stefan, Kocuvan Primož,

Bizjak Jani, Dolanc Gregor, Ravničan Jože, Gams Matjaž, Bohanec Marko .................................................. 27

Data Protection Impact Assessment - an Integral Component of a Successful Research Project From the GDPR

Point of View / Gültekin Várkonyi Gizem, Gradišek Anton .............................................................................. 32

Deep Transfer Learning for the Detection of Imperfectionson Metallic Surfaces / Kalabakov Stefan, Kocuvan

Primož, Bizjak Jani, Gazvoda Samo, Gams Matjaž ......................................................................................... 35

Fall Detection and Remote Monitoring of Elderly People Using a Safety Watch / Kiprijanovska Ivana, Bizjak

Jani, Gams Matjaž ............................................................................................................................................ 39

Machine Vision System for Quality Control in Manufacturing Lines / Kiprijanovska Ivana, Bizjak Jani, Gazvoda

Samo, Gams Matjaž ......................................................................................................................................... 43

Abnormal Gait Detection Using Wrist-Worn Inertial Sensors / Kiprijanovska Ivana, Gjoreski Hristijan, Gams

Matjaž ............................................................................................................................................................... 47

Avtomatska detekcija obrabe posnemalnih igel / Kocuvan Primož, Bizjak Jani, Kalabakov Stefan, Gams Matjaž

.......................................................................................................................................................................... 51

Povečevanje enakosti (oskrbe duševnega zdravja) s prepričljivo tehnologijo / Kolenik Tine, Gams Matjaž ....... 55

Analiza glasu kot diagnostičn ametodaza odkrivanje Parkinsonove bolezni / Levstek Andraž, Silan Darja,

Vodopija Aljoša ................................................................................................................................................. 59

STRAW Application for Collecting Context Data and Ecological Momentary Assessment / Lukan Junoš,

Katrašnik Marko, Bolliger Larissa, Clays Els, Luštrek Mitja ............................................................................. 63

URBANITE H2020 Project Algorithms and Simulation Techniques for Decision-Makers / Machidon Alina,

Smerkol Maj, Gams Matjaž .............................................................................................................................. 68

Towards End-to-end Text to Speech Synthesis in Macedonian Language / Neceva Marija, Stoilkovska Emilija,

Gjoreski Hristijan .............................................................................................................................................. 72

Improving Mammogram Classification by Generating Artificial Images / Peterka Ana, Bosnić Zoran, Osipov

Evgeny .............................................................................................................................................................. 76

Mobile Nutrition Monitoring System: Qualitative and Quantitative Monitoring / Reščič Nina, Jordan Marko, De

Boer Jasmijn, Bierhoff Ilse, Luštrek Mitja ......................................................................................................... 80

Recognition of Human Activities and Falls by Analyzing the Number of Accelerometers and their Body Location /

Shulajkovska Miljana, Gjoreski Hristijan ........................................................................................................... 84

Sistem za ocenjevanje esejev na podlag ikoherence in semantične skladnosti / Simončič Žiga, Bosnić Zoran . 88

Mental State Estimation of People with PIMD using Physiological Signals / Slapničar Gašper, Dovgan Erik,

Valič Jakob, Luštrek Mitja ................................................................................................................................. 92

Energy-Efficient Eating Detection Using a Wristband / Stankoski Simon, Luštrek Mitja ...................................... 96

Comparison of Methods for Topical Clustering of Online Multi-speaker Discourses / Stropnik Vid, Bosnić Zoran,

Osipov Evgeny ............................................................................................................................................... 100

Machine Learning of Surrogate Models with an Application to Sentinel 5P / Szlupowicz Michał Artur, Brence

Jure, Adams Jennifer, Malina Edward, Džeroski Sašo .................................................................................. 104

Deep Multi-label Classification of ChestX-ray Images / Štepec Dejan ............................................................... 108

Smart Issue Retrieval Application / Zupančič Jernej, Budna Borut, Mlakar Miha, Smerkol Maj ........................ 112

Adaptation of Text to Publication Type / Žontar Luka, Bosnić Zoran ................................................................. 116

Indeks avtorjev / Author index .............................................................................................................................. 121





v





vi





Zbornik 23. mednarodne multikonference

INFORMACIJSKA DRUŽBA – IS 2020

Zvezek A





Proceedings of the 23rd International Multiconference

INFORMATION SOCIETY – IS 2020

Volume A





Slovenska konferenca o umetni inteligenci

Slovenian Conference on Artificial Intelligence





Uredili / Edited by



Mitja Luštrek, Matjaž Gams, Rok Piltaver





http://is.ijs.si





6. – 7. oktober 2020 / 5 - 7 October 2020

Ljubljana, Slovenia

1





2





PREDGOVOR





Leto 2020 je bilo za informacijsko družbo zelo pomembno: zmanjšanje medosebnih stikov zaradi COVID-19 je pokazalo, da se da s pomočjo informacijskih tehnologij postoriti še precej več, kot smo si do zdaj mislili. S pomočjo telekonferenčnih sistemov smo se sestajali, digitalno smo prenašali in podpisovali dokumente, prek spleta smo lahko naročili domala vse izdelke in storitve ... Čeravno sta umetna inteligenca in informacijska družba vedno tesneje povezani, pa podobno dramatičnega napredka pri umetni inteligenci ni bilo opaziti. Seveda to ne pomeni, da napredka ni bilo – raznotere metode umetne inteligence še naprej postajajo vedno zmogljivejše in predvsem prodirajo v vedno manjše in cenejše naprave: opažamo lahko, da se namenski procesorji za operacije umetnih nevronskih mrež vedno pogosteje pojavljajo v pametnih telefonih, pametnih zvočnikih z govornimi asistenti in podobnih napravah.



Umetno inteligenco smo zapregli tudi v spopad s COVID-19. Raziskovalci so jo uporabili za določanje strukture virusa in za iskanje učinkovitih zdravil in cepiv. Skupina ameriških organizacij je razpisala nagrado za najboljše pristope rudarjenja po besedilih, ki bodo iz 19

GB besedil, povezanih z boleznijo, izluščila koristne informacije. Razvitih je bilo več diagnostičnih sistemov za podporo odločanju, ki analizirajo slike pljuč in druge podatke.

Precej raziskovalcev se je z metodami umetne inteligence lotilo napovedovanja širjenja bolezni in določanja dejavnikov, ki nanj vplivajo. Tovrstne raziskave se dogajajo tudi v Sloveniji.



K sreči COVID-19 naši konferenci ni storil dosti žalega. Resda se ob pisanju tegale uvodnika še ne ve zagotovo, ali bo konferenca potekala na daljavo ali jo bomo uspeli speljati hibridno, kot načrtujemo – da bo del udeležencev prisoten v živo v predavalnici, del pa na daljavo. A verjamemo, da to na kakovost izvedbe ne bo bistveno vplivalo. Z zadovoljstvom pa ugotavljamo, da smo letos dobili največ prispevkov v zadnjih petih letih – v zbornik jih je vključenih kar 28. Tokrat je bolje kot običajno zastopana Fakulteta za računalništvo in informatiko Univerze v Ljubljani, ki ima skupaj z Institutom Jožef Stefan (od koder je – kot vsako leto – največ prispevkov) vodilno vlogo pri raziskavah umetne inteligence v Sloveniji.

Nekaj prispevkov je tudi iz tujine in industrije, čeprav bi si zlasti slednjih želeli več. Slovenija namreč izobrazi veliko strokovnjakov s področja umetne inteligence in precej jih najde pot v industrijo, kjer se dogaja marsikaj zanimivega, o čemer vemo premalo. V to smer si bomo zato še bolj prizadevali v prihodnjih letih.





3

FOREWORD





2020 was an important year for the information society: social distancing due to COVID-19

showed that information technologies allow us to do even more that we previously thought.

Teleconferencing systems allowed us to meet virtually, we transferred and signed documents digitally, we ordered every imaginable product and service online … However, even though artificial intelligence and information society are increasingly interlinked, the progress of artificial intelligence this year was not as significant. This certainly does not mean there was no progress – various artificial-intelligence methods are still steadily improving, and, perhaps even more importantly, becoming available in ever smaller and cheaper devices: dedicated processors accelerating neural-network computations are becoming common in smartphones, smart speakers with conversational assistants and similar devices.



Artificial intelligence also helps fight COVID-19. It was used to determine the structure of the virus and to discover effective drugs and vaccines. A group of USA organizations offered a prize for the best data-mining methods that can extract information from 19 GB of texts related to the disease. Several diagnostic decision support systems were developed, which analyse images of the lungs and other data. Many researchers used artificial intelligence to forecast the spread of the disease and the factors that affect it. Such research is also conducted in Slovenia.



Fortunately, COVID -19 did not much affect our conference. At the time of writing this editorial, it is still not clear whether it will take place remotely, or we will succeed with planned the hybrid approach, where a part of the participants will attend live in a lecture room with the rest connected via teleconference. Either way, we are confident this will not have a major impact on the quality of the conference. We are pleased to report that this year we have the largest number of papers in the last five years – there are 28 in these proceedings. The Faculty of Computer and Information Science is represented better than in previous years, which is quite appropriate considering that – aside from Jožef Stefan Institute (which contributed the largest number of papers, as usual) – it is the leading Slovenian research institution on artificial intelligence. There are also some papers from abroad and from the industry, although we would prefer to see more of these, especially the latter. The number of experts on artificial intelligence in Slovenia is quite large and a significant number find their way to the industry, where many interesting but not widely known developments take place.

We aim to improve on this aspect in the following years.





4





PROGRAMSKI ODBOR / PROGRAMME COMMITTEE

Mitja Luštrek

Matjaž Gams

Rok Piltaver

Marko Bohanec

Tomaž Banovec

Cene Bavec

Jaro Berce

Marko Bonač

Ivan Bratko

Dušan Caf

Bojan Cestnik

Aleš Dobnikar

Bogdan Filipič

Nikola Guid

Borka Jerman Blažič

Tomaž Kalin

Marjan Krisper

Marjan Mernik

Vladislav Rajkovič

Ivo Rozman

Niko Schlamberger

Tomaž Seljak

Miha Smolnikar

Peter Stanovnik

Damjan Strnad

Peter Tancig

Pavle Trdan

Iztok Valenčič

Vasja Vehovar

Martin Žnidaršič

5





6





Using Mozilla’s DeepSpeech to Improve Speech Emotion

Recognition

Andrejaana Andova

Stefano Bromuri

Mitja Luštrek

Jožef Stefan International

Open University of the Netherlands

Jožef Stefan Institute

Postgraduate School

Heerlen, Netherlands

Jamova cesta 39

Jožef Stefan Institute

Stefano.Bromuri@ou.nl

Ljubljana, Slovenia

Jamova cesta 39

mitja.lustrek@ijs.si

Ljubljana, Slovenia

andrejaana.andova@ijs.si

ABSTRACT

gather a dataset composed of speeches used in different contexts,

A lot of effort in detecting emotions in speech has already been

which is a hard task.

made. However, most of the related work was focused on training

Most of the currently available emotional speech datasets are

a model on an emotional speech dataset, and testing the model

composed of actors performing scenes with different emotions.

on the same dataset. A model trained on one dataset seems to

Finding actors and writing the scenes could be a costly and ef-

provide poor results when tested on another dataset. This means

fortful task and, thus, it is hard to collect large amounts of data

that the models trained on publicly available datasets cannot be

in this way. However, the major problem of this type of data is

used in real-life applications where the speech context is different.

that all of the emotions are acted and may be more exaggerated

Furthermore, collecting large amounts of data to build an efficient

when compared to real-life emotions [8]. This type of data is speech emotion classifier is not possible in most cases.

probably pretty different when compared to data from real-life

Because of this, some researchers tried using transfer learn-

applications where emotions are expressed with less intensity. To

ing to improve the performance of a baseline model trained on

solve this problem, some researchers tried using transfer learning

only one dataset. However, most of the works so far developed

methods to build a model that is more robust to changes in the

methods that transfer information from one emotional speech

data.

dataset into another emotional speech dataset.

Some researchers tried using speeches recorded in real-life

In this work, we try to transfer parameters from a pre-trained

scenarios and asked people to listen to these speeches and anno-

speech-to-text model that is already widely used. Unlike other

tate the emotions they recognize in the speakers’ voices. When

related work, which uses emotional speech datasets that are

collecting a dataset in this way one needs to find people that

usually small, in this method we will try to transfer information

would listen to the whole dataset and annotate the data. The

from a larger speech dataset which was collected by Mozilla and

annotators would probably have different abilities to detect the

whose main purpose was to transcribe speech.

emotions and different perceptions of what each emotion should

We used the first layer from the DeepSpeech model as the basis

be like. Because of this, in many cases not all of them will agree

for building another deep neural network, which we trained on

on which emotion is present in a sample. Another drawback of

the improvisation utterances from the IEMOCAP dataset.

this type of data collection is that most of the time people do

not experience extreme emotions. Because of this, such datasets

KEYWORDS

will result in almost no emotions – the speech would be mostly

neutral.

speech emotion recognition, feature transfer, DeepSpeech

The main idea behind transfer learning is to use information

from a dataset called source dataset to improve the performance

1

INTRODUCTION

of a target dataset. The source and the target datasets may have

There are many issues when trying to build a model for speech

labeled or unlabeled data, may have the same data distribution or

emotion recognition, but the main problem is the lack of emo-

different data distribution, and they can be constructed to solve

tional speech data. Collecting a dataset is often a challenging

the same task or they may try to solve different tasks. Depending

and effortful task, but in speech emotion recognition a few addi-

on this, there are different approaches to transfer learning. They

tional problems arise when creating a dataset. One of the main

are more thoroughly explained by S. J. Pan et al. [5].

problems is that speech is a context-dependent problem. One

In this work, we decided to follow the usual transfer learning

could gather a dataset from job interviews and build a precise

approach, and use a pre-trained speech-to-text model trained on

model that detects emotions in job applicants’ speech. however,

a large nonemotional English dataset collected by Mozilla. This

the same model would probably not work for a phone application

model may not contain any emotional information that would be

that tries to analyze the emotions of its users. Thus, to build a

useful for our task, but we believe it contains information about

general model for speech emotion recognition, one would need to

the speech of the subjects that could be used in speech emotion

recognition.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or 2

RELATED WORK

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this While research in speech emotion recognition where training

work must be honored. For all other uses, contact the owner/author(s).

and testing are done on one dataset has already been well-studied,

Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

© 2020 Copyright held by the owner/author(s).

using other datasets to make the model more generalized has

been in focus only in recent years.

7





Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

Andrejaana Andova, Stefano Bromuri, and Mitja Luštrek

Table 1: Emotion distribution in IEMOCAP.

Anger

Happiness

Sadness

Neutral

500

94

467

392

Some researchers tried using unlabeled target data to improve

distribution of the emotions after the data reduction is given in

speech emotion recognition models. Thus, Parthasarathy and

Table 1.

Busso [6] connected supervised and unsupervised learning to improve the performance of speech emotion recognition on a

4

METHODOLOGY

target dataset. They used a network architecture similar to au-

We developed methods that transfer information from a large

toencoders to encode large amounts of unlabeled target data in

nonemotional speech dataset into a target emotional speech

an unsupervised way by putting the same speech in the input

dataset. Since in most of the related work researchers were ex-

and the output of the network. To force the network to encode

tracting information from smaller emotional speech datasets and

the emotional information from the speech, they connected the

transferring this information to other emotional speech datasets,

last encoding layer to another layer that was trying to learn the

this is the first attempt that we know of in which a transfer of

arousal, valence, and the dominance annotations on the speech

information is tried from already well-defined pre-trained speech

in a supervised way. When they compared their method to other

dataset into a smaller emotional speech dataset, which is the

state-of-the-art models, it showed improvement in the arousal

standard approach in most transfer learning applications.

and the dominance space while in the valence space they got

However, to compare if the methods provide any useful im-

results slightly worse than the state-of-the-art.

provement, we compare them to a baseline model that was trained

Some authors thought about bringing the feature space from

and tested on IEMOCAP, and which does not use any kind of

the source and the target data closer together. Thus, Song et al., [7]

information transfer.

used MMDE optimization and dimension reduction algorithms to

bring the feature spaces from the source and the target datasets

4.1

Baseline Model

closer together. After that, they used the shifted feature space

To build a baseline classifier, we decided to use standard machine

from the source dataset to train an SVM model. They used the

learning approaches trained on features extracted using OpenS-

EmoDB dataset as a source dataset, and a Chinese emotional

MILE [2] as a baseline method. After testing several different dataset collected by them as a target dataset. After they trained

machine learning approaches, we saw that Random Forest ob-

the SVM model on the source dataset only, they applied the

tained the best results for most of the target datasets. Because

model on the target dataset and showed that the model performed

of this, we decided to use a Random Forest classifier with 1000

with 59.8% accuracy. These results show improvement when

trees and a maximal depth of 10 as a baseline model.

compared to an SVM model trained on the source dataset and

tested on the target dataset without any dimension reduction

4.2

DeepSpeech Model

applied, which performs with 29.8% accuracy. However, the best

performance was achieved with a model trained and tested on

the target dataset, which achieved 85.5% accuracy.

3

DATASET

In this research we used the Interactive emotional dyadic motion

capture database (IEMOCAP) [1]. IEMOCAP consists of speech from ten different English-speaking actors (five male and five female), and it is the largest dataset for speech emotion recognition

that we found publicly available. It consists of approximately

twelve hours of data where actors perform improvisations or

scripted scenarios, specifically selected to elicit emotional ex-

pressions. Since the actors were not given any specific emotions

that they had to act, the database was annotated by multiple

annotators into categorical labels, as well as dimensional labels,

such as valence, activation, and dominance. The set of emotions

the annotators could choose from was anger, happiness, excite-

ment, sadness, frustration, fear, surprise, other, and neutral, but

because most of the related work on transfer learning in speech

emotion recognition only used anger, happiness, sadness and

neutral utterances in their methods, we decided to also just use

these emotions in our method.

We noticed that most of the time, the three annotators did not

Figure 1: Architecture of the original DeepSpeech model.

perceive the same emotion and, thus, we decided to eliminate

all data where all three annotators did not agree on the detected

DeepSpeech is a model that tries to provide transcriptions

emotion. This reduced the amount of data significantly. The

of a given speech. The model has been trained on the English

8





Speech Emotion Recognition using DeepSpeech features

Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

Table 2: Classification accuracy obtained from the majority classifier and baseline Random Forest Classifier compared to the DeepSpeech features method.

Model

Majority

Baseline

DeepSpeech features

Dense

34%

67%

58%

LSTM

34%

67%

7%

Dense1+Dense2

34%

67%

26%

Dense1+LSTM2

34%

67%

66%

data from the Mozilla Common Voice dataset [3]. This dataset a ‘relu’ function and has 20 hidden states. It is then connected to

consists of 1469 hours of speech data that has been recorded by

a dense layer activated by a ‘softmax’ activation function which

61521 different voices. The people whose voices were collected

predicts the label of the whole utterance.

belonged to different nationalities (and thus different English

The third network architecture is composed of two parts. In

accents), and different ages. All of this data is publicly available

the first part we predict the emotion probabilities for each frame

and can be easily accessed.

separately and in the second part we use the emotion proba-

The architecture of the DeepSpeech model is presented in

bilities predictions from the first layer to predict the emotion

Figure 1. Each utterance is a time-series data, where every time-probabilities for the whole utterance. The first part of the archi-

slice is a vector of MFCC audio features [4]. The goal of the tecture is the same as in the first network architecture and is

network is to convert an input sequence 𝑥 into a sequence of

trained one one half of the training data. In the second part of

character probabilities for the transcription 𝑦.

this network, we use the predictions from the first part as input

The network is composed of five hidden layers. The first three

to a dense layer with a softmax activation function. The second

layers are dense layers with ‘ReLU’ as an activation function.

part of the network is trained on the other half of the training

The fourth layer is an LSTM layer, the fifth layer is once again

data. In this network architecture, for each sequence of 20 frames

a dense layer with ‘ReLU’ activation function. The output layer

we predict one vector of emotions.

has a softmax function which outputs character probabilities.

The fourth network consists of two separate parts and is pre-

In the example in Figure 1 the output of the first frame is the sented in Figure 2. The first part takes the output of the Deep-character ‘C’, the second frame outputs the character ‘A’, and the

Speech model, and tries to predict the probability for each of

third frame outputs the character ‘T’, resulting with the word

the target emotions separately. The first dense layer has a ’relu’

‘CAT’.

activation function and outputs 204 features. It is then connected

to another dense layer with a softmax activation function that

4.3

Transfer Learning Using DeepSpeech

predicts the emotions present in each frame separately. The sec-

ond part of the network uses the output emotion probabilities

We decided to experiment if we could transfer information from

from the first part of the layer as an input. The second part of

the DeepSpeech model that would be useful for the speech emo-

the network consists of one LSTM layer which is trained on the

tion recognition task. We used the representation learned by

second half of the training data. The LSTM layer is activated by

the DeepSpeech network to extract features for the IEMOCAP

a ‘relu’ function and has 20 hidden states. It is then connected to

dataset. We used the output from the first layer in the Deep-

a dense layer activated by a ‘softmax’ activation function which

Speech model as features for a given frame. We ended up with

predicts the label of the whole utterance. This network archi-

2048 features for every 10-millisecond frame. So, if the whole

tecture in a way is a combination from the first and the second

utterance was 3 seconds long, we would receive a matrix with

network architecture.

dimensions 1800x2048 after the deep speech feature extraction.

After the features from all the samples in IEMOCAP have

been extracted, we trained a deep neural network using them.

5

RESULTS

We simply added the layers from the new deep neural network on

Since the DeepSpeech model is capable of learning language

top of the first layer from the DeepSpeech model, and trained the

phases in the speech, we decided to remove all scripted utter-

new deep neural network from scratch by just using the samples

ances from the IEMOCAP dataset and use just the utterances in

from the IEMOCAP dataset. This way we repurpose the feature

which the actors were asked to improvise. To evaluate the neural

representations from the first layer of the DeepSpeech model.

network architectures we used the leave-one-subject-out cross

We experimented with several different deep neural network

validation.

architectures to see which one works best for this problem. In

In Table 2 we present the results obtained from each of the the first architecture, we used a feed-forward network on the ex-deep neural network architectures that we tried as well as the

tracted features per each frame. We used one hidden dense layer

accuracy of the baseline model and the majority classifier. In the

with ‘relu’ activation function and 204 neurons. We connected

results we can see that the LSTM network architecture that we

this layer to a dense layer with softmax activation function which

tried performs quite poor, with classification accuracy of only 7%.

predicted the emotion probabilities for each frame separately. Al-

The most probable explanation for this is that this architecture

though in the IEMOCAP dataset there are no labels for each

is quite complex since it has 2048 features for each frame, and

of the frames separately, we use the target label for the whole

it tries to train an LSTM model on all of these features. To train

utterance as target label for each of the frames.

a model with this amount of parameters, we would need much

The second model architecture we tried was to use the features

more samples than the IEMOCAP improvisations.

from the whole frame as input, and use a LSTM layer to learn the

The architecture that provides the best results is the one that

representations from the features. The LSTM layer is activated by

uses a FFN to predict the features in each frame, and then uses a

9





Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

Andrejaana Andova, Stefano Bromuri, and Mitja Luštrek

6

CONCLUSION

In this work we tried to improve a baseline speech emotion

recognition classifier by transferring information from a pre-

trained model. Although this transfer learning method has been

most widely used in other computer science fields, most of the

related work in speech emotion recognition developed transfer

learning methods that transfer information from other emotional

speech datasets into a target emotional speech dataset.

The pre-trained model we used was Mozilla’s DeepSpeech that

was developed as a speech-to-text model. To recognize emotions

in speech, we used the first layer from the DeepSpeech model,

on top of which we added a new classifier that was trained from

scratch on an emotional speech dataset. This way we repurposed

the feature maps learned previously for the dataset.

The results from this approach did not seem to improve the

classification accuracy of the improvisations part in the IEMO-

CAP dataset. A possible explanation for this could be that the

speech-to-text and speech emotion recognition tasks are simply

not sufficiently related, and because of this the model could not

extract any useful information from the DeepSpeech model. How-

ever, since this was the first attempt to transfer information from

a well-defined pre-trained model to a speech emotion recognition

task, we believe it is still a valuable attempt.

Figure 2: Architecture of the original DeepSpeech model.

7

ACKNOWLEDGMENTS

This research has received funding from the European Union’s

LSTM network to predict the final emotion predictions for the

Horizon 2020 research and innovation programme under Grant

whole utterance. We further experimented with this network

Agreement No 769765

architecture to see how much the length of the frames changes

the performance of the model. The results are presented in Figure

REFERENCES

3. In this figure, we can notice that the performance of the model

[1] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh,

can be improved by using bigger frames when training the LSTM

Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok

part of the DeepSpeech model. However, the performance of the

Lee, and Shrikanth S Narayanan. 2008. Iemocap: interac-

model does not differ a lot – only a few percentage points.

tive emotional dyadic motion capture database. Language

The results show that some of the DeepSpeech architectures

resources and evaluation, 42, 4, 335.

can perform better than the majority classifier but none of the

[2] Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010.

architectures outperforms the baseline model. A possible explana-

Opensmile: the munich versatile and fast open-source audio

tion for this could be that these two tasks are simply not related

feature extractor. In Proceedings of the 18th ACM interna-

enough and we cannot use information from the DeepSpeech

tional conference on Multimedia, 1459–1462.

model to improve the performance of a model for speech emotion

[3] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro,

recognition.

Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh,

Shubho Sengupta, Adam Coates, et al. 2014. Deep speech:

scaling up end-to-end speech recognition. arXiv preprint

arXiv:1412.5567.

[4] Beth Logan et al. 2000. Mel frequency cepstral coefficients

for music modeling. In Ismir. Volume 270, 1–11.

[5] Sinno Jialin Pan and Qiang Yang. 2009. A survey on transfer

learning. IEEE Transactions on knowledge and data engineer-

ing, 22, 10, 1345–1359.

[6] Srinivas Parthasarathy and Carlos Busso. 2019. Semi-supervised

speech emotion recognition with ladder networks. arXiv

preprint arXiv:1905.02921.

[7] Peng Song, Yun Jin, Li Zhao, and Minghai Xin. 2014. Speech

emotion recognition using transfer learning. IEICE TRANS-

ACTIONS on Information and Systems, 97, 9, 2530–2532.

[8] Carl E Williams and Kenneth N Stevens. 1972. Emotions

and speech: some acoustical correlates. The Journal of the

Acoustical Society of America, 52, 4B, 1238–1250.

Figure 3: Performance of DeepSpeech model by using dif-

ferent frame lengths.

10





Towards Automatic Recognition of Similar Chess Motifs

Miha Bizjak

Matej Guid

University of Ljubljana

University of Ljubljana

Faculty of Computer and Information Science

Faculty of Computer and Information Science

Ljubljana, Slovenia

Ljubljana, Slovenia

ABSTRACT

1.1

Related Work

We present a novel method to find chess positions similar to a

Existing chess search systems equipped with a query-by-example

given query position from a collection of archived chess games.

(QBE) [11] search interface are limited to searching only the exact Our approach considers not only the static similarity due to the

matches in response to a given query position. To alleviate the

arrangement of the chess pieces, but also the dynamic similarity

problem of exact position searches, the Chess Query Language

based on the recognition of chess motifs and dynamic, tactical

system (CQL) [1] allows the search for approximate matches aspects of position similarity. We use information retrieval tech-of positions. However, it requires the user to define complex

niques to enable efficient approximate searches, and implement

queries in the system-specific language. The search results can

textual encoding that captures the position, accessibility and con-

be sorted by any user-defined feature. In addition, the CQL works

nectivity between chess pieces, pawn structures, and moves that

directly on game files and checks each game sequentially, making

represent the solution to the problem. We have shown experi-

it inefficient for querying larger databases.

mentally how important the inclusion of both static and dynamic

To overcome these problems, an approach has been proposed

features is for the successful detection of similar chess motifs.

which is based on information retrieval for obtaining similar

In another experiment the program was able to quickly traverse

chess positions [4], constructing a textual representation for a large database of positions to identify similar chess tactical

each board position and using information retrieval methods

problems. A chess expert found the resulting program useful for

to calculate the similarity between these documents. Instead of

automatically generating instructive examples for chess training.

constructing a query manually, the user specifies a chess posi-

tion and a query encoding the characteristics of the position is

KEYWORDS

automatically generated internally. Initially, a naive encoding

problem solving, chess motifs, automatic similarity recognition

was used, which only contains the positions of the individual

pieces. The results have been improved by including additional

information about the mobility of the individual pieces and the

1

INTRODUCTION

structural relationships between the pieces. Further work has

A significant part of acquiring human skills is to identify our

been carried out to improve the quality of retrieval by implement-

weaknesses and take measures to remedy them. In problem-

ing automatic recognition of pawn structures [7]. The additional solving domains such as chess, the analysis of past games is

information provided by the application of domain knowledge

important for players trying to improve their game. Identifying

has proved useful, however, the positions are still only statically

their mistakes enables chess players to work on improving some

evaluated.

aspects of their game. This is often done by training on similar

All existing approaches have a common shortcoming: they

problems. Finding relevant similar problems involves recognis-

only allow the search for statically similar positions, while ignor-

ing both static patterns, i.e. finding similar chess positions, and

ing the dynamic factors, which are often far more important to

dynamic patterns, i.e. finding similar move sequences that solve a

obtain relevant search results.

problem. These static and dynamic patterns are often referred to

as chess motifs. Learning and recognising chess motifs during the

game is one of the main prerequisites for becoming a competent

chess player [2].

2

DOMAIN DESCRIPTION

Chess instructors often look for examples containing relevant

In this paper, we will focus on automatic retrieval of similar chess

chess motifs from real games to provide their students with useful

tactical problems from a large database of chess games. In chess,

teaching material. However, it is impossible for a human being to

the term tactic is used to describe a sequence of moves that takes

go through thousands or even millions of games and find problem

advantage of a certain position on the board and allows the player

positions with similar chess motifs and similar solutions to those

to gain material, a positional advantage, or even leads to a forced

overlooked by the students in their game. Finding contextually

checkmate sequence.

similar chess positions could also be used for annotating chess

Chess tactical problems are particularly important for the

games [5] and in intelligent chess tutoring systems [10].

progress of chess players. Knowledge of tactical motifs helps

The goal of our research is to develop a method to automati-

them to quickly recognise the possible presence of a winning

cally retrieve chess positions with similar chess motifs for a given

or drawing combination in a position. Chess players improve

query position from a collection of archived chess games.

their tactical skills by solving tactical problems. A large number

of games are decided by tactics, since a single mistake, which

Permission to make digital or hard copies of part or all of this work for personal gives the opponent an opportunity for tactics can change the out-or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and come of a game. To help players to discover tactical possibilities

the full citation on the first page. Copyrights for third-party components of this in games, many common patterns or tactical motifs have been

work must be honored. For all other uses, contact the owner/author(s).

defined in the chess literature [6]. Stoiljkovikj et al. developed a Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

method for estimating the difficulty of chess tactical problems [9].

© 2020 Copyright held by the owner/author(s).

They introduced a concept of meaningful search trees, which can

11





Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Miha Bizjak and Matej Guid

(a)

(b)

(c)

(a)

(b)

(c)

Figure 1: Tactical motifs.

Figure 2: Static and dynamic similarity.

potentially be used either for motif recognition or as an additional

For each tactic, the input consists of a starting position in FEN

feature for positional similarity ranking.

format and a solution move sequence in algebraic notation. The

We use standard chess annotation. Chess games are stored

solution can be provided with the position or calculated using

using Portable Game Notation (PGN), chess positions are de-

a chess engine. Sections 3.1 and 3.2 describe the features and scribed with Forsyth-Edwards Notation (FEN), and chess moves

terms that are generated, and Figure 3 shows an example of a are described with Standard algebraic notation (SAN) [3].

text encoding.

Figure 1 shows some of the more common motifs. In Figure

3.1

Static Features

1a, Black performs a double attack on the white king and queen at the same time. White must move the king out of check, allowing

The static part of the encoding includes information about the

Black to capture the queen. Figure 1b is an example of a discovered positions of pieces on the board, structural relationships between

attack. By moving the bishop, White opens the queen’s line of

pieces and pawn structures present in the position.

attack on the rook on a2. After Black responds to move out of the

The implementation is based on previous work on similar

check, White can capture the black rook. The tactic in Figure 1c

position retrieval [4] and pawn structure detection [7] and is is called deflection. The black king protects the rook on f8. White

intended to serve as a baseline on which we aim to improve by

gives a check with the bishop, forcing the black king to move

implementing encoding of dynamic features.

away from the rook so that it can be captured.

3.1.1

Piece positions and connectivity. The section describing

To illustrate the difference between static and dynamic similar-

piece positions and connectivity encoding consists of three parts:

ity using an example, we compare the query position in Figure 2a

with the positions in Figure 2b and Figure 2c. The position in

• naive encoding - the positions of all the pieces on the board.

Figure 2b seems to be very similar to that in Figure 2a: only the

• reachable squares - all squares reachable by pieces on the

white rook on h4 and the black rook on e8 have been removed.

board in one move, with decreasing weight based on dis-

These two positions are statically similar. On the other hand,

tance from the original position, in format {piece symbol

the position in Figure 2c seems to be quite different. However, and position}|{weight}.

if we compare the move sequences that represent solutions to

• connectivity between the pieces - the structural relation-

these two tactical problems, we notice a great dynamic similarity.

ships between the pieces in the positions. For each piece

The solution in Figure 2a is 1. Rh8+ Kxh8 2. Qh6+ Kg8 3. Qxg7#.

it is recorded which other pieces it attacks, defends or

The solution in Figure 2c contains the same tactical motif as the attacks through another piece (X-ray attack). Attacks are

solution mentioned above: the white rook is sacrificed on h8

encoded as {attacking piece symbol}>{attacked piece symbol

and the black king must capture it, allowing the white queen to

and position}. For defense and X-ray attack terms, < and =

appear with check on h6 (note that it cannot be captured due

separators are used instead.

to the activity of the white bishop along the long diagonal) and

3.1.2

Pawn structures. For this section of the encoding, we use

deliver checkmate on the next move. Note that such motif is not

pawn structure detection algorithms [7] to detect the following possible in the position shown in Figure 2b.

pawn structures in the position and encode them into terms: iso-

We are particularly interested in recognising the dynamic sim-

lated pawns (I{pawn position}), (protected) passed pawns (F{pawn

ilarity, i.e. finding positions with similar motif(s) in the solution

position}), backward pawns, doubled pawns and pawn chains.

of the problem. However, we also want to take into account the

Terms P({number}) and p({number}) are used to encode the num-

static similarity, i.e. finding problems with similar initial position.

ber of pawn islands for white and black, respectively.

3

SIMILARITY COMPUTATION

3.2

Dynamic Features

To determine similarity between tactical problems we use an

In the dynamic part of the encoding, we focus more on the solu-

approach based on information retrieval. A set of features is

tion of the tactical problem, trying to capture the motif behind

computed from each problem’s starting position and its solution

it. We first encode some general characteristics of the solution,

move sequence. The features are then converted into textual

then add more specific terms describing the move sequence.

terms, forming a document that represents the problem. A collec-

tion of documents is used to build an index, which can then be

3.2.1

General dynamic features. In this part we encode some

queried using the textual encoding of a new position to retrieve

basic features of the solution move sequence that can help us de-

the most similar positions in the index. For the implementation

termine similarity. We use a single term for each of the following

of the system for indexing and retrieval of similar tactics we use

features if it holds for the solution:

the Apache Lucene Core library. Search results are ranked using

• ?px - the player captures a piece in at least one of the

the BM25 ranking function [8].

moves

12





Towards Automatic Recognition of Similar Chess Motifs

Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

(a) Encoded position. Black to play, solution: 1... Qh1+ 2. Nxh1

(a) Base problem. Black to play,

(b) Simplified problem. Black

Rg2#.

solution: 1... Rxa2+ 2. Kxa2

to play, solution: 1... Rxa2+ 2.

Ra8+ 3. Ba7 Rxa7+ 4. Qa5

Kxa2 Ra5+ 3. Qa4 Rxa4#.

Rxa5#.

Feature set

Generated terms

static_positions

qc1 Pb2 Pf2 Kh2 Pa3 Rd3 Ng3 Rh3 Qb4 ...

Figure 4: A pair of tactical problems from the data set.

qa1|0.78 qb1|0.89 qd1|0.89 qe1|0.78 ...

q>Pb2 q>Pc4 Q>nb7 N>pg7 r>Ng3

4

EXPERIMENTAL RESULTS

P<Pa3 P<Ng3 K<Ng3 K<Rh3 P<Qb4 ...

q=Pa3

To evaluate the effectiveness of our methods, we used a number of

static_pawns

If2 ia7 Fc4 P(2) p(2)

problems that we have collected from the Chess Tactics Art (CT-

dynamic_general

?ox ?+ ?# ?S

ART 6.0) training course1. Many puzzles in this course consist dynamic_solution

!-q !-N !-r !-qN !-Nr !xq

of pairs of positions: one is taken from a real game, another

!Sq

represents a simplified version where the same tactical motif

!#b !#r !#br

usually appears on a smaller 5×5 board. This fact allowed us to

!K>q !N>q !q>K !b>N !K>r !r>K !r>P

obtain a set of position pairs that were considered similar by

human experts. We manually checked the puzzles and verified

(b) Text encoding of each set of features for the above position.

the similarity between the solutions of the individual problem

pairs. A total of 400 pairs were collected for the test data set.

Figure 3: Text encoding of a tactical position.

An example of such a pair is shown in Figure 4. The solution to both problems is to sacrifice the rook on the a-file to expose

the king, resulting in checkmate with the other rook and the

bishop on e4. The solution in the simplified problem contains the

• ?ox - the opponent captures a piece in at least one of the

same motif, but there are much fewer pieces, so the solution is

moves

generally easier for the students to find.

• ?+ - the player gives a check at least once during the se-

quence

4.1

Evaluation of Similarity Detection

• ?= - the player promotes a pawn in at least one of the

We tested the effectiveness of our methods using the set of 400

moves

pairs of problems described in the previous section. We first built

• ?S - the player sacrifices one or more pieces

an index using the simplified version of the problem from each

• ?# - the solution ends with a checkmate

pair, then performed a query on the index with each of the regular

• ?1/2 - the solution ends in a draw

problems. For each query we recorded the rank of the matching

position in the results and calculated how often the matching

3.2.2

Solution sequence features. In this section we encode infor-

position appeared as the top result or within the first 𝑁 results.

mation about the solution move sequence. The encoding includes

We tested the search accuracy using the following feature

a term for each:

subsets: each feature group on its own, all static features, all

• type of piece moved: !-{piece symbol}

dynamic features and all features combined. All runs used the

• type of piece captured: !x{piece symbol}

default BM25 parameters 𝑘1 = 1.2 and 𝑏 = 0.75 and all included

• attack between pieces that occurs during the solution:

feature sets were weighted equally. The results are presented in

!{attacking piece symbol}>{attacked piece symbol}

Table 1.

• type of piece sacrificed: !S{piece symbol}

Using either only static or dynamic features did not yield the

• (if the final position is a checkmate) type of piece involved

best results. The results were significantly improved when both

in checkmate: !#{piece symbol}

static and dynamic features were combined. This shows that each

set of features covers a different aspect of a tactic, both of which

We count a piece as involved in checkmate if it is attacking either

need to be considered when determining similarity.

the king directly or any of the squares where the king could move

from the current position (ignoring checks).

4.2

Similar Position Retrieval

To include information about the order of moves and cap-

In the second experiment, we selected 10 contextually different

tures we also include a term for each two consecutive moves

chess tactical problems and then automatically retrieved 5 most

and captures in the solution. We also include a term for each

similar positions for each of them from a large database of 278,840

pair of pieces involved in checkmate to capture more specific

combinations of pieces.

1https://chesskingtraining.com/ct-art

13





Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Miha Bizjak and Matej Guid

Accuracy

Feature set used

top-1

top-5

top-10

static_positions

0.234

0.378

0.428

static_pawns

0.033

0.083

0.126

dynamic_general

0.008

0.038

0.071

dynamic_solution

0.421

0.657

0.761

all static features

0.252

0.370

0.433

all dynamic features

0.418

0.652

0.761

all features, equal weights

0.481

0.736

0.814

Table 1: Success rates for different configurations.

(a) Query position. Black to play, solution: 1... Bh2+ 2. Kxh2 Qxe1.

Position

Solution

Similarity score

tactical problems constructed from the lichess.org game database.

Building the index took about 14 minutes (it only needs to be

static

38.95

done once), and retrieval was fast: only about 4 seconds.

1... Bh2+

dynamic

45.04

Figure 5 shows a query position and the first two of the five 2. Kxh2 Qxd1

total

83.99

most similar retrieved positions. This example illustrates how

similarity ranking works and how the static and dynamic features

contribute to the similarity scores of the results. The query posi-

tion is an example of a discovered attack motif. With 1... Bh2+,

Black sacrifices the bishop to later capture the rook on e1 with

static

64.62

the queen. The first result shows the same motif with an almost

1... Nf3+

dynamic

12.32

identical move sequence. The main difference is that the key

2. Qxf3 Qxe1+

total

76.94

pieces are on the d-file and not on the e-file. The second result

is another case of a discovered attack. In this example it is not a

bishop but a knight sacrificed with a check to the white king. It is

the static similarity (the arrangement and position of the pieces

(b) Retrieval results.

in the initial position) that contributes most to the great overall

similarity of this tactical problem, although a certain dynamic

Figure 5: Example of retrieval results.

similarity was also detected.

The resulting most similar positions were shown to a chess

[2]

Mark Dvoretsky and Artur Yusupov. 2006. Secrets of Chess

expert. The expert was asked to comment on the reasons for

Training. Edition Olms.

the similarity of the resulting problems with the original query

[3]

International Chess Federation (FIDE). 2020. The FIDE

positions, taking into account both static and dynamic aspects.

Handbook. https://handbook.fide.com/. (2020).

The expert was able to explain the similarity in 48 out of 50

[4]

Debasis Ganguly, Johannes Leveling, and Gareth JF Jones.

problems. Overall, the expert praised the program’s ability to

2014. Retrieval of similar chess positions. In Proceedings of

detect dynamic similarity of positions, even if the initial positions

the 37th international ACM SIGIR conference on Research &

differ significantly.

development in information retrieval. ACM, 687–696.

5

CONCLUSIONS

[5]

Matej Guid, Martin Možina, Jana Krivec, Aleksander Sadikov,

and Ivan Bratko. 2008. Learning positional features for

We introduced a novel method for retrieving similar chess posi-

annotating chess games: A case study. In International

tions, which takes into account not only static similarity due to

Conference on Computers and Games. Springer, 192–204.

the arrangement of the chess pieces, but also dynamic similarity

[6]

Chess Informant. 2014. Encyclopedia of Chess Combina-

based on the recognition of chess motifs and dynamic, tactical

tions, 5th Edition. Chess Informant.

aspects of position similarity. The merits of the method were put

[7]

Matic Plut. 2018. Recognition of positional motifs in chess

to the test in two experiments. The first experiment emphasized

positions. Diploma thesis. University of Ljubljana.

the importance of including both static and dynamic features for

[8]

Stephen E Robertson, Steve Walker, Susan Jones, Miche-

the successful detection of similar chess motifs. In the second

line M Hancock-Beaulieu, Mike Gatford, et al. 1995. Okapi

experiment, the program was able to quickly traverse a large

at trec-3. Nist Special Publication Sp, 109, 109.

database of positions to identify similar chess tactical problems.

[9]

Simon Stoiljkovikj, Ivan Bratko, and Matej Guid. 2015. A

A chess expert was able to explain the similarity in the vast major-

computational model for estimating the difficulty of chess

ity of the retrieved problems and praised the program’s ability to

problems. In The Annual Third Conference on Advances in

detect dynamic similarity of positions even if the initial positions

Cognitive Systems.

differ significantly. The resulting program can be useful for the

[10]

Beverly Park Woolf. 2010. Building intelligent interactive

automatic generation of instructive examples for chess training.

tutors: Student-centered strategies for revolutionizing e-learning.

Morgan Kaufmann.

REFERENCES

[11]

Moshé M Zloof. 1975. Query-by-example: the invocation

[1]

G Costeff. 2004. The Chess Query Language: CQL. ICGA

and definition of tables and forms. In Proceedings of the 1st

Journal, 27, 4, 217–225.

International Conference on Very Large Data Bases, 1–24.

14





Drinking Detection From Videos in a Home Environment

Carlo M. De Masi

Mitja Luštrek

carlo.maria.demasi@ijs.si

mitja.lustrek@ijs.si

Jožef Stefan Institute

Jožef Stefan Institute

Jamova cesta 39

Jamova cesta 39

Ljubljana, Slovenia

Ljubljana, Slovenia

ABSTRACT

parameters makes 3DCNNs generally harder to train than their

2D counterparts. One way to fix this is to produce 3D models by

We present a pipeline developed with the aim of helping people

"inflating" 2D ones, i.e. by adding a temporal dimension to a model with mild cognitive impairment (MCI) in the accomplishment of

pre-trained for image classification. This allows to determine the

every-day tasks. Our system adopts a number of computer vision

architecture of the 3D network and to bootstrap its values starting

methods to analyze RGB videos collected from cameras, and

from the corresponding values in the 2D model: convolutional

provides a successful, quasi real-time detection of the targeted

kernels with dimensions 𝑁 × 𝑁 are inflated to a 3D kernel with

activity (drinking) when the latter is at least partially visible to

dimensions 𝑁 × 𝑁 × 𝑡 , spanning 𝑡 frames, and each of the t planes

the camera.

in the 𝑁 × 𝑁 × 𝑡 kernel is initialized by the pre-trained 𝑁 × 𝑁

KEYWORDS

weights rescaled by 1/t [1, 9].

Another approach separately analyzes spatial components

computer vision, activity recognition, object detection, pose esti-

(i.e. single frames), providing static information about scenes

mation

and objects in the picture, and temporal components related to

motion and variation between frames [11]. A two-stream network 1

INTRODUCTION

parallelly processes single frames and optical flows, respectively,

Mild cognitive impairment (MCI) is a common problem among

and then combines their predictions.

elders, affecting 15–20% of people over 65 in the USA [10]. In Finally, another method worth mentioning is based on the ob-order to help people affected by MCI in the accomplishment of

servation that some actions (i.e., clapping hands) are better char-

every-day tasks, we adopt various kind of detection techniques

acterized by high-frequency temporal features, whereas other

to predict what users are currently doing, which, combined with

ones (i.e., dancing) can be better understood when lower fre-

a knowledge of their activities schedule, allows our system to

quency variations are observed. As a result, a model characterized

provide context-based reminders. Here, we present our attempts

by two parallel channels can be used. The first (slow) channel

to detect one of such activities (i.e. drinking) from videos, by the

operates at low framerate and analyzes few sparse frames, in

use of computer vision and deep learning algorithms.

order to deduce the semantics of the action, while the second

This paper is organized as follows. In the remainder of this sec-

(fast) branch is responsible for capturing fast variations, and so

tion, we give an overview of the current SOTA regarding activity

operates at higher framerate [3].

recognition from videos. In Section 2 we describe the computer In this work, we adopted a modified version of an inflated

vision techniques used to trigger the more computationally in-

3D network as described in [14], to include non-local blocks.

tensive task of activiry recognition, to obtain a quasi real-time

Unlike convolutional and recurrent operations, which are only

monitoring of the user’s activities. Finally, in Sections 3 and 4 we able to capture spatio-temporal features in a local neighborhood,

present the results and conclusions of the paper.

non-local blocks compute the response at a certain position as a

weighted sum of features at all positions in space and time. This

1.1

Video Activity Recognition

allows the model to capture dependencies between pixels that

are distant both in space and time, and makes it more accurate

Differently than what happened for image classification, where

for video classification.

in the last years a number of clear front runner architectures and

techniques have been established, the topic of activity recognition

from videos still presents numerous open issues [1].

2

SYSTEM ARCHITECTURE

An immediate approach to the problem consists in using image

classification networks to extract features from each frame of

The purpose of our system is to provide users context-based

the video; then, predictions for the whole video can either be

reminders related to the activity of drinking. To this aim, a RGB

obtained by pooling over frames (at the cost of losing information

camera is placed in the kitchen of the user’s apartment (where the

about temporal ordering) [5] , or by adopting LSTM layers [2].

activity is most likely to take place) and the video is sent through

A more elaborate way to adapt the concepts used in image

a RTSP stream to a remote server, to be analyzed by the activity

classification methods to video recognition consists in using

recognition model during the day. The results are uploaded to a

3DCNN, i.e. convolutional models characterized by an additional

Cloud Firestore Database, which is queried to determine whether

third temporal dimension [4, 12, 13]. The increased number of the users have been drinking enough, and reminders are provided

through an app running on a local device if not.

Permission to make digital or hard copies of part or all of this work for personal One problem arising from this scheme is that most action

or classroom use is granted without fee provided that copies are not made or recognition algorithms are computationally expensive, which

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this prevents them from running in real time. For this reason, we

work must be honored. For all other uses, contact the owner /author(s).

decided not to run the model continuously, but to execute it only

Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

in moments where it is most likely that the users are about to

© 2020 Copyright held by the owner/author(s).

perform the targeted activity. We employed a combination of

15





Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

De Masi, Luštrek

• a classic computer vision approach, where the drinking

vessel is located through a color/shape-based detection;

• a deep learning object detection algorithm, re-trained to

detect a personalized mug.

In the fist scenario, we applied a series of filters (Gaussian

blur, dilation/erosion) to reduce noise, followed by a color mask

in the HSV space to select only objects with a certain color. A

further selection is then done based on the shape properties of

the previously selected areas; a polygonal approximation of their

contours is performed, and other shape-related features such as

area, circularity and convexity are considered to eliminate shapes

different from the expected one.

Figure 1: System architecture. Video stream from RGB cameras is sent

In the second case, we collected a dataset of about 500 images

to a remote server and fed to the activity recognition model. Results are

of the selected mug, and used it to re-train a second SSD model.

uploaded to a Firestore database, where they are monitored so that notifi-

cations can be sent back to an app.

In order to account for false negatives in the mug detection, that

may occur in some frames even if the mug has not been moved,

for each frame the current position of the mug is compared to the

classic and deep-learning-based computer vision techniques to

history of positions in the past few frames. Once a displacement

identify some triggers for the video activity recognition model,

of the mug is detected, the trigger is activated.

such as: (i) user standing in certain areas of the kitchen; (ii) user

standing in certain areas of the kitchen, and interacting with

2.3

Clip Recording and Activity Recognition

some objects (tap, fridge); (iii) a specific object, assumed to be

Following the activation of one of the triggers, the next video

used by the user for drinking, is moved from its current position.

frames (for a time interval of about 30 seconds) are used to gener-

2.1

User Localization And Interaction With

ate short video clips, each of which has a duration of 10 seconds,

the Environment

with an overlapping window of 4 seconds. These values have

been selected to have a higher probability to obtain at least one

The localization of the user and their interactions with the envi-

video clip completely capturing the whole drinking process, and

ronment are detected through a combination of object detection

to match the length of the videos in the Kinetics400 dataset [6],

and pose estimation techniques. For the object detection, we

which has been used for the activity-recognition model training.

adopted a Single Shot MultiBox Detector (SSD) [8], pre-trained on the 80 classes of the COCO dataset [7], which also include 3

RESULTS AND DISCUSSION

"person". As for pose estimation, we used a SimpleNet model

In this section, we present the results of the various steps involved

with a ResNet backbone [15].

in the whole drinking-detection pipeline.

During the initial setup, the camera image is shown to the user

(Fig. 2a) and regions of interest (ROIs) can be selected (Fig. 2b).

3.1

User Localization - Results

These can be of two types, i.e. single or double-zone. The first ones

are identified by a single rectangular box, which is activated when

We tested the efficiency of the localization module in different

the user’s feet are within the box, hence providing indications on

scenarios, varying based on how clearly the user was visible (com-

the user’s location (see Fig 2c). Double-zone ROIs are formed by pletely visible; legs occluded; head occluded; head and legs oc-two rectangular boxes; one of them, analogously to the previous

cluded, only torso visible) and on which side (front/back/right/left)

case, is activated when the user steps inside of it, while the second

of the user was visible, and the results showed an average accu-

box is activated if one of the user’s hands (located by the pose

racy of over 98%.

estimation model) is within it (Fig. 2d). Overall, a double-zone ROI is considered activated only if both conditions are met. Once

3.2

Drinking Vessel Position Detection -

the ROI is configured, the user is requested to input:

Results

• the name used to identify the current ROI;

As illustrated in Sec. 2.2, for the task of detecting the displacement

• an observation time 𝑡

(in seconds), i.e. the time after

𝑜𝑏𝑠

of the drinking vessel we adopted two approaches, one based on

which the ROI is activated, once the requirements (user

classic computer vision methods and one on deep learning.

and hands positions) are met;

The first method does not provide a confidence score for de-

• an action to be performed once the ROI is activated. Cur-

tections, nor the coordinates of the object’s bounding box, so

rently, only one default action - recording and analyzing

we took a simpler approach than with normal object detection

video clips - is supported, but this will be extended to

algorithms in evaluating the results. We collected some videos

include further possibilities.

in a home-like environment, with the object located in different

positions, or with a person handling it (moving it, using it to

2.2

Drinking Vessel Position Detection

drink...), and analyzed them frame-by-frame to check whether

A second trigger for activity recognition is given by the displace-

the objects present in each frame were detected or not. The re-

ment of a particular object (mug, cup, glass). To this regard, in

sulting confusion matrix, reported in Table 1, shows that the the pilot phase of the project users will be asked to always use

detection algorithm scored precision and recall values of .93 and

one specific drinking vessel when they are drinking, which the

.90, respectively. This method proved to be very efficient, when

model will be trained to recognize.

correctly fine-tuned, and the algorithm detected the object in

For this task, we considered two possibilities:

most of the frames where it was at least partially visible. The

16





Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

(a)

(b)

(c)

(d)

Figure 2: Triggers based on user’s location and their interaction with the environment. Regions of Interest are selected during the setup phase (b), and they are activated either if the user steps inside (c), or if the user steps inside and has their hands next to another object (d).

Table 1: Confusion matrix for the color/shape-based detec-

3.3

Activity Recognition - Results

tion of the mug

We tested the adopted activity recognition model on a new cus-

tom dataset, consisting of roughly 100 videos we recorded our-

Pred

selves in a variety of environments and conditions. In order to

P

N

make the clips as similar as possible to real-life situations, the

P

133

15

True

videos contained instances where actions similar to drinking

N

10

1

were performed, to increase the recognition difficulty. The clips

can be classified as belonging to two difficulty categories, based

on the angle the user was facing with respect to the camera;

videos were classified as "hard" whenever this angle was greater

◦

greatest issue of the method is that it had to be very carefully

than 90

(see Fig. 4). The precision-recall curve for the model on tuned, especially regarding the color selection part, which is still

this dataset is shown in Fig. 5.

sensible to lightning variations even after converting the image

to the HSV colorspace. False detection can also be a problem.

4

CONCLUSIONS

We tested the algorithm in situations where some of the objects

The tests performed on triggers are very encouraging for the one

present in the scene had colors similar to the object we wanted

based on the user location and their interaction, and indicate that

to detect, and in spite of being able to filter out most of them we

the deep-learning approach should be preferable for the detec-

still obtained some false positives, especially when the lighting

tion of the drinking vessel and its displacement, especially after

varied, thus rendering the selection of the parameters for the

increasing the amount of training data. The activity-recognition

color mask less efficient.

model based on inflated 3D CNN with the addition of non-local

The results of the evaluation of the SSD model are shown in

blocks provided the best accuracy in situations were the user is

Fig. 3. As evident from the plot, the model immediately reached facing the camera at least partially, and the use of triggers allows

a very high mAP [7], of the order ≈ 0.9, on our test dataset. It for a quasi real time usage. A number of improvements will be

should be noted that, while preparing the training dataset, we

added to the pipeline in the future. Currently, only one action is

followed a somewhat different approach than what is usually

triggered, i.e. recording and analysis of video clips, but we plan to

done for training object-detection models. In most situations,

include other possibilities, such as using the information on the

one wants to make the model as general as possible and avoid

user location to check whether they need assistance in operating

overfitting, which is achieved by taking images of the desired

domestic appliances. The object detection model could also be

object in as many different conditions (size, aspect ratio, point of

extended, in order to identify interactions with other elements

view angle, rotation, lightning) as possible. In our case, however,

of the environment, and provide corresponding context-based re-

the location of the camera will be more or less constant, i.e.

sponses. Finally, the only action currently recognized is drinking,

attached to the ceiling of the room, in order to provide a good

but as mentioned in the introduction the aim of the project is to

view of the environment. As a result, this will greatly limit the

assist users in the accomplishment of various activities. In this

variability in the images of the object the system will analyze,

sense, the next planned step is to include detection of parts of

especially regarding the aspect ratio and the orientation of the

the morning toilet routines, such as brushing teeth and washing

mug. Moreover, whereas an object detector is usually tasked to

hands.

identify many different instances of objects in a certain class (i.e.,

a generic "mug"), in our case the task is greatly simplified by the fact that we are looking to locate one very specific object.

17





Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

De Masi, Luštrek

Figure 3: mAP values on the test dataset for the SSD model, re-trained to recognize the project custom mug.

[4]

Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2012. 3d

convolutional neural networks for human action recog-

nition. IEEE transactions on pattern analysis and machine

intelligence, 35, 1, 221–231.

[5]

Andrej Karpathy, George Toderici, Sanketh Shetty, et al.

2014. Large-scale video classification with convolutional

neural networks. In Proceedings of the IEEE conference on

Computer Vision and Pattern Recognition, 1725–1732.

[6]

Will Kay, Joao Carreira, Karen Simonyan, et al. 2017. The

kinetics human action video dataset. (2017). arXiv: 1705.

06950 [cs.CV].

[7]

Tsung-Yi Lin, Michael Maire, Serge Belongie, et al. 2014.

Figure 4: Difficulty classes for the custom dataset we used to test the

Microsoft coco: common objects in context. (2014). arXiv:

activity recognition model. Video clips were classified as "hard" whenever

1405.0312 [cs.CV].

the angle between the user front side and the camera was greater than

◦

90 .

[8]

Wei Liu, Dragomir Anguelov, Dumitru Erhan, et al. 2016.

Ssd: single shot multibox detector. Lecture Notes in Com-

puter Science, 21–37. issn: 1611-3349. doi: 10.1007/978-

3- 319- 46448- 0_2. http://dx.doi.org/10.1007/978- 3- 319-

46448- 0_2.

[9]

Elman Mansimov, Nitish Srivastava, and Ruslan Salakhut-

dinov. 2015. Initialization strategies of spatio-temporal

convolutional neural networks. arXiv preprint arXiv:1503.07274.

[10]

Ronald C Petersen, Oscar Lopez, Melissa J Armstrong, et

al. 2018. Practice guideline update summary: mild cog-

nitive impairment: report of the guideline development,

dissemination, and implementation subcommittee of the

american academy of neurology. Neurology, 90, 3, 126–135.

[11]

Karen Simonyan and Andrew Zisserman. 2014. Two-stream

convolutional networks for action recognition in videos.

In Advances in neural information processing systems, 568–

Figure 5: Test results of the activity recognition model on the test

576.

dataset.

[12]

Du Tran, Lubomir Bourdev, Rob Fergus, et al. 2015. Learn-

ing spatiotemporal features with 3d convolutional net-

REFERENCES

works. In Proceedings of the IEEE international conference

on computer vision, 4489–4497.

[1]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis,

[13]

Gül Varol, Ivan Laptev, and Cordelia Schmid. 2017. Long-

action recognition? a new model and the kinetics dataset.

term temporal convolutions for action recognition. IEEE

In proceedings of the IEEE Conference on Computer Vision

transactions on pattern analysis and machine intelligence,

and Pattern Recognition, 6299–6308.

40, 6, 1510–1517.

[2]

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadar-

[14]

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-

rama, et al. 2015. Long-term recurrent convolutional net-

ing He. 2018. Non-local neural networks. In Proceedings of

works for visual recognition and description. In Proceed-

the IEEE conference on computer vision and pattern recog-

ings of the IEEE conference on computer vision and pattern

nition, 7794–7803.

recognition, 2625–2634.

[15]

Bin Xiao, Haiping Wu, and Yichen Wei. 2018. Simple base-

[3]

Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, et al.

lines for human pose estimation and tracking. In Proceed-

2019. Slowfast networks for video recognition. In Proceed-

ings of the European conference on computer vision (ECCV),

ings of the IEEE international conference on computer vision,

466–481.

6202–6211.

18





Semantic Feature Selection for AI-Based Estimation of

Operation Durations in Individualized Tool Manufacturing

Erik Dovgan

Bogdan Filipič

Jožef Stefan Institute

Jožef Stefan Institute

Jamova cesta 39

Jamova cesta 39

Ljubljana, Slovenia

Ljubljana, Slovenia

erik.dovgan@ijs.si

bogdan.filipic@ijs.si

ABSTRACT

very diverse, which increases the difficulty of automated duration

Accurate estimation of operation durations is of key importance

prediction.

in production processes, since the accuracy of estimations di-

We propose an approach for predicting operation durations in

rectly affects the quality of production plans and thus the entire

the manufacturing of individualized tools. The tools are manu-

production process. This task is even more challenging when

ally divided into several positions of varying complexity, where

individualized tools are being produced. From the machine learn-

each position is specified with a 3D computer model. In addition,

ing point of view, this means a low number of diverse samples,

a set of operations are predefined for each of these positions.

while the number of features can be significantly higher. To tackle

The proposed approach processes the 3D model of each position

this issue, we introduce semantic feature selection that reduces

and predicts the duration of the corresponding manufacturing

the number of features. This results in obtaining a better ratio

operations. To this end, it firstly extracts a set of volume, sur-

between the number of samples and features and, at the same

face, gradient and other features from the 3D model, and then

time, reduces the prediction error. We demonstrate the proposed

applies the Random Forest regression model [1] to predict the approach on the problem of estimating the operation durations

duration of each operation. This process is additionally enhanced

in the manufacturing of injection molds and show the predic-

with semantic feature selection that evaluates various sets of se-

tion accuracy improvement resulting from the semantic feature

mantically related features, such as volume features, in order to

selection.

assess the predictive capability of these feature sets. We demon-

strate the proposed approach on the problem of estimating the

KEYWORDS

operation durations in the manufacturing of injection molds in a

specific tool shop. By processing a dataset from this tool shop,

injection molding, tool manufacturing, duration prediction, fea-

we show the prediction accuracy improvement resulting from

ture selection, random forest

the semantic feature selection.

1

INTRODUCTION

The rest of the paper is organized as follows. Section 2 introduces the relevant tool positions and the related operations,

The efficiency of tool shop manufacturing processes heavily de-

and describes the extracted features and the semantic feature

pends on the accuracy of production plans. Inaccurate plans can

selection. Numerical experiments and the obtained results are

lead to significant delays in production, due date violations, late

presented in Section 3. Finally, Section 4 concludes the paper delivery penalties, and even loss of customers. A key step of

with the summary of our work and the ideas for future work.

planning is accurate estimation of durations of all the operations

to be executed in the manufacturing process. The estimation

2

PREDICTING OPERATION DURATIONS

can be performed manually by an expert utilizing his/her expert

knowledge, or automatically by means of tools such as those

WITH AI METHODS

involving AI methods as, for example, demonstrated in [3].

Prediction of operation durations consists of extracting features

Automated estimation of operation durations with AI meth-

from the tool data in the form of 3D computer models, and ap-

ods consists of learning a predictive model from the features

plying a machine learning model to predict the durations. This

extracted from examples of past, i.e., already concluded opera-

approach is applied for each tool position and each operation

tions and their actual durations, and then applying the model to

at this position independently, thus a custom machine learning

new operations with known features and unknown durations.

model is built and applied for each combination of position and

In the case of tool manufacturing, the features can be extracted

operation. In addition, when feature selection is involved, a differ-

from 3D computer models of already manufactured tools. To

ent set of features is considered for each of these combinations.

build an accurate predictive model, a large set of already manu-

factured tools has to be processed. However, this is not possible

2.1

Relevant Positions and Related

in certain cases, for example, when dealing with individualized

Operations

tools, such as injection molds. This is due to the fact that the tool

shops specialized in individualized tool manufacturing typically

The tools regarded in this study are injection molds that are used

produce only few such tools per year. In addition, these tools are

to form the final products made of plastic under high pressure.

Although the injection mold is composed of several positions, its

Permission to make digital or hard copies of part or all of this work for personal most complex and thus the most relevant positions are the bottom

or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and and the top element. These two elements have to be manufactured

the full citation on the first page. Copyrights for third-party components of this with the highest precision. Since they are in physical contact with

work must be honored. For all other uses, contact the owner/author(s).

the final product, any defect of the mold surface would result in

Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

a defect of the final product. An example of the injection mold

© 2020 Copyright held by the owner/author(s).

is shown in Figure 1, where the red color indicates the surface 19





Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

Dovgan and Filipič

that is in contact with the final product. In the dataset used in

• volumes of the entire tool position (such as volume of the

this study, these two elements are marked as positions 1 and

shape and volume of the mold),

30. These positions require a set of operations, where the most

• volumes of the holes that are open, and of those that are

relevant operations are shown in Tables 1–2.

closed,

• features for each of 6 directions, i.e., projections (x, y, z,

each of them increasingly or decreasingly), for example,

direction (z, decreasingly) defines the features obtained

from the top-down projection, while direction (z, increas-

ingly) defines the features obtained from the bottom-up

projection; the features for each direction are as follows:

– volumes (including the volumes of holes),

– surface area,

– number of faces,

– number of faces per dm2,

– valley features, computed as the height versus width

ratio of the valleys (in all valley directions to find the

maximum value); this feature is aimed at identifying

deep and narrow valleys that are harder to process,

– valley height, computed as the height of the valleys in

all valley directions to find the maximum value; this

Figure 1: Example of a 3D computer model of an injection

feature is aimed at obtaining the depth of valleys that

mold, https://grabcad.com/library/injection-mold-pc-abs-

represents the drill distance,

1 by Mauro Menchini.

– gradient features, calculated as the maximum gradient

in all directions; this feature is aimed at identifying areas

with non-horizontal and non-vertical gradient that are

Table 1: Operations at Position 1

harder to process.

Since the valley features, valley height and gradient features

Operation

Description

are calculated for each point of the projection, the number of

32

CAM rough

features is very high and varies across the tool positions which

31

CAM fine

are of varying sizes. To reduce the number of features and obtain

43

CAM erosion

a constant number of features independently of the position size,

19

Heat treatment

histograms of these features are calculated using expert-defined

23

Measuring machine

bins.

36

CNC milling 3 axis, rough

The 3D model of each position also contains expert-defined

41

CNC milling 3 axis, fine

annotations of the model parts with different colors of model

42

CNC milling 5 axis, fine

faces (see the example in Figure 1). These model parts are also 13

Submersible erosion

taken into account when extracting features and therefore ob-

taining additional features that characterize a feature for each

part independently. For example, when calculating the number

Table 2: Operations at Position 30

of faces, one feature is obtained for all the faces, and for each part

an additional feature is calculated denoting the number of faces

on that specific part. The part-specific features are calculated for

Operation

Description

the following features:

32

CAM rough

31

CAM fine

• volumes of the holes: total, open, closed,

37

CAM wire erosion

• projection features:

43

CAM erosion

– volume,

19

Heat treatment

– surface area,

11

Wire erosion

– number of faces,

23

Measuring machine

– number of faces per dm2,

36

CNC milling 3 axis, rough

– valley features,

41

CNC milling 3 axis, fine

– valley height,

42

CNC milling 5 axis, fine

– gradient features.

13

Submersible erosion

Examples of parts that are annotated in the 3D computer mod-

els include: (1) Free holes, (6) Tolerance holes, (7) Parting surface,

2.2

Description of the Extracted Features

(10) Matching surfaces, (12_4) Part shape: High gloss polished,

The proposed approach extracts a set of features from a 3D com-

(12_5) Part shape: Optical faces, (12_7) Part shape: Galvanic pins,

puter model of a tool. These features were suggested by a tool

(12_8) Part shape: Special surface finishing. In total, 30 parts are

shop expert and can be categorized as follows:

annotated by the expert.

20





Semantic Feature Selection for AI-Based Estimation of Operation Durations Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

Table 3: Feature Sets

The operation durations were predicted with the Random

Forest regression model. Its performance was assessed with the

Name

Number of features

leave-one-out test using the default model-building parameters.

The selected performance metric was the Root Mean Squared

expert

524 on average

Error (RMSE), which has to be minimized. RMSE was also cal-

volume

6

culated for durations estimated by the expert. The effectiveness

volume_projection

30

of feature selection was determined by comparing the Random

volume_no_hole

3

Forest performance when using all the features and when using

volume_projection_no_hole

6

only a selected set of features.

volume_hole

3

The initial experiment aimed at finding whether the prediction

volume_projection_hole

24

of operation durations involving the proposed feature selection

volume_hole_part

90

outperforms the prediction without feature selection considering

volume_projection_no_hole_part

180

all the features (i.e., the default feature set). To this end, for

material

4

each combination of position and operation, all the feature sets

surface_projection

6

were processed and the feature set with the lowest RMSE was

surface_projection_part

180

selected. The results are shown in Figure 2. These results are faces_count_projection

6

normalized with respect to the RMSE of durations estimated

faces_count_projection_part

180

by the expert and are therefore expressed as percentages of the

faces_per_dm2_projection

6

RMSE resulting from the expert estimation. They show that for

faces_per_dm2_projection_part

180

each combination of position and operation, there exists at least

valley_hist_projection

18

one set of features that allows for more accurate prediction than

valley_hist_projection_part

540

the default feature set (since it reduces the RMSE). In addition, for

valley_h_projection

48

position 1, operation 32, and position 30, operation 31, the default

valley_h_projection_part

1440

feature set produces a RMSE equal to the RMSE of the expert

grad_hist_projection

18

estimation, while feature selection improves it. For position 30,

grad_hist_projection_part

540

operation 32, the default feature set results in a higher RMSE

projection_*

562

than the RMSE of the expert estimation. Although in this case

projection_side

2248

feature selection improves the result, it still performs worse than

projection_top_bottom

1124

the expert estimation.

part_*

111

Planned by expert

1.2

No feature selection

2.3

Semantic Feature Selection

Feature selection

1.0

The total number of features obtained in the presented feature

extraction procedure is 3472. Since this is a large number, we

0.8

introduce semantic feature selection that combines semantically

similar features into (partially overlapping) feature sets. In addi-

0.6

tion, the tool shop expert also selected a set of the most relevant

0.4

features for each operation. However, this was defined only for a

limited set of crucial operations. The resulting feature sets and

0.2

the related numbers of features are shown in Table 3. Specifically, 0.0

if the name of a set contains "part", the set contains all the features of the specific part. The "valley_hist_" contains the valley (1, 32) (1, 31) (1, 43) (1, 19) (1, 23) (1, 36) (1, 41) (1, 42) (1, 13) (30, 32) (30, 31) (30, 37) (30, 43) (30, 19) (30, 11) (30, 23) (30, 36) (30, 41) (30, 42) (30, 13) features, "valley_h_" valley height, and "grad_hist_" gradient features. Projection sets "projection_" contain all the features from specific projections and are defined as follows:

Figure 2: Percentages of RMSE with respect to the RMSE of

durations estimated by the tool shop expert. The horizon-

• projection_100: projection from left to right (x axis)

tal axis denotes the combinations of (position, operation).

• projection_200: projection from right to left (x axis)

• projection_010: projection from front to back (y axis)

Subsequently, the most relevant combinations of positions

• projection_020: projection from back to front (y axis)

and operations were analyzed in more detail and selected results

• projection_001: projection from bottom to top (z axis)

are presented in Figures 3–5. These results show the RMSE of

• projection_002: projection from top to bottom (z axis)

durations estimated by the expert, the RMSE obtained without

In total, 60 sets of features were defined.

feature selection, and the RMSE obtained with various sets of

features. To make the figures readable, we only show the best

3

EXPERIMENTS AND RESULTS

33% feature sets. Figure 3 shows position 1 and operation 36 (i.e., We evaluated the proposed approach on a dataset from the Plam-CNC milling 3 axis, rough). The best features are the gradient

tex tool shop [4, 2]. Due to individualized tool manufacturing, features, surface features and features from the bottom-up pro-the number of already produced tools was low, namely 30 in-

jection. Note also that the bottom side of this position is the most

stances of position 1 and 26 instances of position 30. Besides the

complex one, thus the bottom-up projection is of high impor-

actual duration of each operation, each instance also included

tance. The same projection is also the most relevant for position

the duration estimated by the tool shop expert.

1, operation 13 (i.e., submersible erosion) (see Figure 4), since 21





Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Dovgan and Filipič

the erosion is applied only to the bottom side of this position.

Position 30, Operation 13

Part 9 (i.e., released surfaces) and faces count are also among the

45

Planned by expert

most important features, where faces count can be used to esti-

No feature selection

mate the complexity of the surface that has to be eroded. Finally,

40

Feature selection

position 30, operation 13 (i.e., submersible erosion) is shown in

35

Figure 5. For this combination, the top-down projection is the most relevant, since the erosion is applied only to the top side of

30

this position. Part 1 (i.e., free holes) and faces count are also very

important. The importance of the appropriate projection and the

faces count is consistent with the results for position 1 and the

part_1

expert

part_3

part_5

part_10

same operation (see Figure 4).

part_11_5

volume_hole

projection_002

projection_100 projection_001

projection_side

Position 1, Operation 36

volume_projection

surface_projection

10

grad_hist_projection

Planned by expert

projection_top_bottom

grad_hist_projection_part

surface_projection_part

8

No feature selection

Feature selection

faces_count_projection_part

volume_projection_no_hole

volume_projection_no_hole_part

6

4

Figure 5: RMSE obtained when predicting the duration of

operation 13 (submersible erosion) at position 30.

expert

part_9

volume

the prediction accuracy, it includes semantic feature selection

projection_001 projection_020

by combining features into semantically meaningful feature sets.

volume_no_hole

projection_200

projection_side projection_010 projection_100

The experimental results showed that this approach in most cases

grad_hist_projection surface_projection

valley_h_projection

volume_projection

projection_top_bottom

valley_hist_projection

outperforms the expert predictions. In addition, semantic feature

surface_projection_part

volume_projection_hole

selection outperforms the approach with no feature selection.

volume_projection_no_hole

valley_hist_projection_part

A detailed analysis of the proposed feature selection approach

showed that there exist meaningful relations between the tool

Figure 3: RMSE obtained when predicting the duration of

manufacturing operations and the best performing feature sets

operation 36 (CNC milling 3 axis, rough) at position 1.

for predicting the durations of these operations.

In future work we will evaluate additional regression algo-

rithms to assess the quality of Random Forest predictions. It

Position 1, Operation 13

would be also relevant to analyze the samples for which the pre-

diction error is the highest. Special attention should be given to

22.5

Planned by expert

the operation for which the presented approach did not outper-

No feature selection

20.0

Feature selection

form the expert prediction.

17.5

ACKNOWLEDGMENTS

15.0

This work was in part funded by the KET4CleanProduction

project "Improved Planning of Manufacturing Processes for Indi-

vidualized Tools" where the AI-based solution was developed for

part_9

expert

the Plamtex tool shop. The authors also acknowledge the finan-

part_12_2

part_11_1

cial support from the Slovenian Research Agency (research core

projection_001

projection_010

projection_side

projection_100 projection_020

funding No. P2-0209). We are particularly grateful to Plamtex

volume_no_hole

for sharing the tool dataset and the expert knowledge on tool

grad_hist_projection surface_projection

volume_projection

manufacturing, positions, operations, and the suitable features.

faces_count_projection

projection_top_bottom

grad_hist_projection_part

valley_h_projection_part

surface_projection_part

faces_count_projection_part volume_projection_no_hole

REFERENCES

[1]

Leo Breiman. 2001. Random forests. Machine Learning, 45,

Figure 4: RMSE obtained when predicting the duration of

1, 5–32.

operation 13 (submersible erosion) at position 1.

[2]

Erik Dovgan, Peter Korošec, and Bogdan Filipič. 2020. Tool-

Analysis: A program for predicting the duration of machin-

ing operations in the production of tools using artificial

4

CONCLUSION

intelligence. Technical report IJS-DP 13195. Jožef Stefan

Institute, Ljubljana.

We presented an AI-based approach to predicting the operation

[3]

Mesut Kumru and Pinar Yildiz Kumru. 2014. Using artifi-

durations in individualized tool manufacturing, which is, in a

cial neural networks to forecast operation times in metal

long run, aimed at replacing the existing human-based estima-

industry. International Journal of Computer Integrated Man-

tion process. The proposed approach extracts a set of features

ufacturing, 27, 1, 48–59.

from 3D computer models of tools and applies Random Forest

[4]

Plamtex INT, d.o.o. 2020. https://www.plamtex.si/en/.

regression to predict the operation durations. To further improve

22





Generating Alternatives for DEX Models using Bayesian

Optimization

Martin Gjoreski

Vladimir Kuzmanovski

Marko Bohanec

Department of Intelligent

Department of Computer Science

Department of Knowledge

Systems

Aalto University, Finland

Technologies

Jožef Stefan Institute

vladimir.kuzmanovski@aalto.fi

Jožef Stefan Institute

Jožef Stefan Postgraduate School

Ljubljana, Slovenia

Ljubljana, Slovenia

Department of Knowledge

marko.bohanec@ijs.si

martin.gjoreski@ijs.si

Technologies





Jožef Stefan Institute



Ljubljana, Slovenia

ABSTRACT

development of qualitative multi-attribute decision models and

the evaluation of alternatives (options). DEXi has been used to

Multi-attribute decision analysis is an approach to decision

analyze decision problems in different domains in healthcare [9],

support in which decision alternatives are assessed by multi-

agriculture [10], [11], [12], economy [13], etc.

criteria models. In this paper, we address the problem of

A useful extension of DEX would be the possibility to search

generating alternatives: given a multi-attribute model and an

for new alternatives that require the smallest change to the

alternative, the goal is to generate alternatives that require the

existing alternative to obtain a desirable outcome. This task is

smallest change to the current alternative to obtain a desirable

important for practical decision support [14], however the related outcome. We present a novel method for alternative generation

work on generating alternatives for qualitative multi-attribute

based on Bayesian optimization and adapted to qualitative DEX

decision models is quite scarce. The only related study was

models. The method was extensively evaluated on 42 different

presented by Bergez [15], in which the focus is on attribute DEX decision models with a variable complexity (e.g., variable

scoring (and not on the alternatives), and the starting (current)

depth and variable attribute’s weight distribution). The method’s

alternative was not taken into a consideration. More specifically,

behavior was analyzed with respect to computing time, time to

Bergez developed a genetic algorithm for searching a set of the

obtaining the first appropriate alternative, number of generated

‘‘worst-best’’ i.e., lowest scores for the input attributes that lead

alternatives, and number of attribute changes required to reach

to the highest score for the root attribute (the decision model’s

the generated alternatives. The experimental results confirmed

output), and ‘‘best-worst’’ i.e., highest scores for the input

the method’s suitability for the task, generating at least one

attributes that lead to the lowest score for the root attribute.

appropriate alternative within one minute. The relation between

In this study, we developed a stochastic method for

the decision-model’s depth and the computing time was linear

generating alternatives that require the smallest change to the

and not exponential, which implies that the method is scalable.

current alternative to obtain a desirable outcome. To avoid

combinatorial explosion, the method uses guided search based on

KEYWORDS

Bayesian optimization. The method is evaluated on 42 different

multi-attribute models, method DEX, alternatives, decision

qualitative multi-attribute models with a varying complexity.

support, Bayesian optimization

The method’s behavior was analyzed with respect to several

characteristics including: computing time, time to first

appropriate alternative, number of generated (appropriate)

1 INTRODUCTION

alternatives, and number of attribute changes required to reach

Hierarchical multi-attribute models are a type of decision models

the generated alternatives.

[1],[2],[3], which decompose the problem into smaller and less complex subproblems and represent it by a hierarchy of attributes

and utility functions. Such decision models are especially useful

2 DOMAIN DESCRIPTION

in complex decision problems [4],[5].

In this study, a set of 42 DEX multi-attribute decision models

DEX is a hierarchical qualitative multi-attribute method

were used. The models are benchmark mock models, designed

whose models are characterized by using qualitative (symbolic)

by Kuzmanovski et al. [16]. The decision models are designed attributes and decision rules. The method is supported by DEXi

by taking into account properties such as model depth,

[6],[6],[7],[8], an interactive computer program for the distribution of attributes' aggregation weights (weights'



distribution), and inter-dependency of attributes (input links).

Permission to make digital or hard copies of part or all of this work for personal or

Table 1 presents a summary of the decision models. The weights'

classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full distribution is given with descriptive names: skewed, normal,

citation on the first page. Copyrights for third-party components of this work must and uniform. All the attributes in the models are defined with

be honored. For all other uses, contact the owner/author(s).

Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

same value scale (low, medium, high), including the input and

© 2020 Copyright held by the owner/author(s).

the output attributes. Additional assumption is that all attribute

combinations are possible.

23



Table 1: Properties of the mock DEX decision models.

From the distance function, a similarity function 𝑠 can be also

defined as one minus the normalized distance. The distance is

normalized using the maximum plausible distance for the

specific problem. For example, if 𝑎

̅ has 20 attributes with

possible values between 0 and 2 and each attribute has the highest

possible value, and if 𝑐

̅ has only attributes with the lowest

possible value (0), then the maximum distance is 20 * 2.



𝑑( 𝑐,

̅ 𝑎

̅ )

𝑠( 𝑐,

̅ 𝑎

̅̅,̅ ) = 1 −



max_𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒





Finally, the optimization function can be defined as:



𝑓( 𝑐,

̅ 𝑎

̅ , 𝐷𝑀( 𝑐,

̅ ), 𝐷𝑀( 𝑎

̅ ))

3 METHOD FOR GENERATING

𝑠( 𝑐,

̅ 𝑎

̅̅,̅ ),

𝑖𝑓 𝐷𝑀( 𝑎

̅ ) > 𝐷𝑀( 𝑐

̅)

ALTERNATIVES

= {



0,

𝑖𝑓 𝐷𝑀( 𝑎

̅ ) ≤ 𝐷𝑀( 𝑐

̅)

An efficient search strategy is required to generate alternatives



where 𝐷𝑀(∗) is the output of the decision model for the specific

that require the smallest change to the current alternative to

alternative. By optimizing 𝑓 , the method searches for

obtain a desirable outcome. A naïve approach would be to

alternatives that are as similar as possible to 𝑐

̅ and improve the

generate all possible alternatives, or to iteratively generate

output of the decision model (𝐷𝑀( 𝑎

̅ ) > 𝐷𝑀( 𝑐

̅)).

random alternatives, and to evaluate the outcome for each

In order to apply the Bayesian optimization approach, a

alternative. However, for reasonably complex decision models,

surrogate function (a model), an acquisition function, and a

the search space can be enormous, rendering the naïve

generator of alternatives, need to be defined. The surrogate

approaches unsuitable.

model 𝑆𝑀 is a model that estimates the objective function for a

A more appropriate approach would be to use informed

given alternative as input. Typically, models based on Gaussian

search based on the history of previously generated and evaluated

Process (GP) [17] are used because by exploiting the mean and alternatives. The history can be used to estimate the search space

the standard deviation of the output distribution, we can balance

and the behavior of the decision model. Based on that estimation,

the trade-off of exploiting (higher mean) and exploring (higher

more promising alternatives can be generated. By focusing on

standard deviation). Since GP models are computationally

the more promising alternatives the search space is reduced, and

expensive with the complexity of

consequently, the time needed to find the appropriate alternatives

𝑂(𝑛3), ensemble models such

as Random Forest (RF) can be also used [18]. In that case, the is also reduced. The next subsections describe a stochastic

mean and the variance are calculated based on the predictions of

method that uses Bayesian optimization to efficiently generate

all base models available in the ensemble. Our method uses RF

such alternatives. The method assumes that we do not know the

with 1000 decision trees as base models.

internal rules by which the decision models operate, thus it falls



The acquisition function operates on top of the mean and

into the category of ‘‘black-box’’ optimization techniques.

standard deviation of the

Knowing and utilizing the decision rules might help the search

𝑆𝑀’s output. The final version of the

algorithm, but this option was not addressed in this study.

method uses the expected improvement (𝐸𝐼 ) as an acquisition



function [19]. This acquisition function checks the improvement 3.1 Implementation

that each candidate alternative brings with respect to the

maximum known value ( µ(𝑆𝑀( 𝑎

̅ )) − 𝑎𝑏), and scales those

The problem of generating alternatives that require the smallest

improvements with respect to the uncertainty. If two alternatives

change to the current alternative to obtain a desirable outcome

have a similar mean value, the one with higher uncertainty

can be defined as an optimization problem with two objectives:

(𝜎(𝑆𝑀( 𝑎

̅ )) will be preferred by the acquisition function.

(1) improved outcome (desired output) of the decision model,

Finally, we need to define the generator of alternatives. Our

and (2) maximum similarity between the current alternative 𝑐,

̅

method uses two generators of alternatives: a neighborhood

and the new proposed alternative 𝑎

̅ . For each decision model

generator and a random generator. Based on the distance function

𝐷𝑀, one alternative can be defined as a tuple of attributes 𝑎

̅ =

𝑑 , neighborhood relation can be defined. Two alternatives 𝑎

̅ 1

̅̅

(𝑎1,𝑎2, … , 𝑎𝑛), where each attribute can take any value of a

and 𝑎

̅ 2

̅̅ are considered as neighbors with a degree k, if 𝑑(𝑎

̅ 1

̅̅ , 𝑎

̅ 2

̅̅)

limited set of values. Usually, that set includes ordinal values

= k. . The random generator is a generator of alternatives which:

(e.g., low, medium and high) and those values can be encoded

(1) avoids generating known alternatives; and (2) is conditioned

with integers (e.g., 0, 1 and 2). Consequently, a distance 𝑑

by the best-known (with respect to the optimization function)

between alternatives can be defined over Euclidean space. The

alternative discovered in the previous iterations.

specific distance function used by the method is a modified

Algorithm 1 presents the implementation of the proposed

element-wise difference between the candidate alternative 𝑎

̅ and

method. The function check_promising_values runs the 𝑆𝑀 on a

the current alternative 𝑐

̅ . This distance considers only the

set of promising alternatives. This set contains all alternatives

attributes for which the candidate alternative has higher values

that have been previously generated as neighbors to a specific

compared to the current alternative 𝑐

̅.

best alternative, but have not been evaluated with the 𝐷𝑀



𝑎

because the acquisition function has selected other alternatives.

𝑗 − 𝑐𝑗,

𝑖𝑓 𝑎𝑗 > 𝑐𝑗

𝑑( 𝑐,

̅ 𝑎

̅ ) = ∑ {



0,

𝑖𝑓 𝑎

This enables one final check of the most promising solutions

𝑗 ≤ 𝑐𝑗

which may have been missed because of an earlier bad prediction



of the 𝑆𝑀.

24



Algorithm 1:

varied, i.e., from low to medium, from low to high, from high to

Input: Decision model DM, current alternative CA,

medium, and from high to low. This experimental setup resulted

Output: best_alternatives

# parameters and initialization

in 756 different experimental runs. Each experiment was running



max_e = 150 # maximum number of epochs

for a minimum of 100 epochs, a maximum of 150 epochs, and 50

n_candidates = 10 # candidates per iteration

epochs without improvement. The method and the experiments

objective_jitter = 0.8 # if an alternative is close to the current

best (e.g, 75% as good as the current best , the

were implemented in Python, and are available online1.

alternative’s neighbors should be checked)

random_sample_size = 10000

4.2 Experimental Results

best_alternatives = []

surrogate_model = new Random_Forest()

The average experiment duration for the models with depth 3 was

promising_alternatives_pool = []

less than 5 min. For the models with depth 4, the duration

#initial values

candidate_alternatives = generate_random_alternatives(10)

increased for 3 min and for the models with depth 5 the duration



real_objective_values = objective_func(DM, CA, alternatives)

increased for additional 3 min. This indicates that the relation

surrogate_model.fit(candidate_alternatives, real_objective_values)

between the computational time and the model depth is linear.

known_alternatives.add(candidate_alternatives,

real_objective_values)

The final output of the algorithm is a set of thousands of



best_alternative,best_score = max(candidate_alternatives

different alternatives. However, from a user perspective, only

,real_objective_values)

one or just a few alternatives should be enough. Figure 1 presents neighboring_alternatives= gen_neighborhood(best_alternative)

while counter < max_e do:

the number of epochs required to generate the first alternative for



if size(neighboring_alternatives)>0:

the most complex models (depth 5). From the figure it can be

alternatives_pool = neighboring_alternatives

seen that on average, the first alternatives are generated in the

else:

first 10 epochs. For the less complex models, the number of

alternatives_pool = gen_rand_alternatives(best_alternative,

random_sample_size)

required epochs was less than 5.





# get top ranked (e.g., 10) candidates using the acquisition

function

candidate_alternatives, candidate_scores =

perform_acquisition(alternatives_pool, n_candidates)

#evaluation of candidate alternatives

real_objective_values = objective_func(DM, CA, alternatives)

known_alternatives.add(candidate_alternatives,

real_objective_values)

#update current best and promising alternatives

i=0

while i < size(candidate_scores) do:

if best_score*objective_jitter <= candidate_scores[i] do:

neighboring_alternatives = gen_



neighbourhood(candidate_alternatives[i])

promising_alternatives_pool.add(neighboring_alternatives)

if

Figure 1: Number of epochs required to generate the first

best_score< candidate_scores[i] do:

best_alternatives = []

alternative in the final set of alternatives.

best_alternatives.add(candidate_alternatives[i])

if best_score==candidate_scores[i] do:

In each epoch, the algorithm selects the top 10 alternatives

best_alternatives.add(candidate_alternatives[i])

i++

with respect to the optimization score. The higher the score, the



#update the surrogate model

better the alternatives are. The selected alternatives depend on

surrogate_model.fit(candidate_alternatives, real_objective_values)

the acquisition function, which in turn depends the predictions of

counter++

the surrogate model. Figure 2 present the average optimization end

score in each epoch for the most complex models (depth 5). For

#peform final check of the promising alternatives

best_alternatives =

a comparison, the average optimization score of 10 randomly

check_promising_values(promising_alternatives_pool,best_alt

sampled alternatives at each epoch is also presented (dashed line).

ernatives)

From the figure it can be seen that the optimization score of the

return best_alternatives

random samples is significantly lower than the optimization

score of the samples selected using the proposed algorithm.

4 EXPERIMENTS

Finally, the presented algorithm is stochastic and the

optimality of the solution cannot be guaranteed. One metric that

4.1 Experimental Setup

presents the quality of the solutions is the number of attribute

changes required to achieve the final solution starting from the

The method was evaluated with the 42 decision models described

current state of the current alternative. Figure 3 presents that in Section 2. For each decision model, nine different randomly

metric, which is the same as the distance defined in Section 3.1.

sampled starting alternatives (current alternatives 𝑐

̅ ) were

From the figure it can be seen that in the majority of the cases,

sampled. Three of those alternatives were with a final attribute

the final solution can be reached with less than 5 attribute

value low, three with a final attribute value medium, and three

changes. Exception of this are the decision models that have a

with a final attribute value high. The desirable outcome was also

depth 5 and uniform weights’ distribution.

1 Repository link.

25





Figure 2: Average optimization score for the decision

models with depth 5. Full line - alternatives generated by



the surrogate model. Dashed line - random alternatives. The

type of attribute weights is color-coded (blue-normal,

Figure 3: Boxplots for the number of changes required to

orange-skewed, green-uniform).

switch from the starting alternative to the best alternative.

This is because these models have a larger number of input

Regarding the future work, the proposed method is stochastic

attributes and the uniform distribution requires many attributes

and the optimality of the final solution cannot be guaranteed. In

to be changed in order for that change to be prolonged to the

order to do that, the method needs to be validated additionally.

aggregate attribute. On the other hand, the models with normal

Promising options include comparison of the proposed method

and skewed weights’ distribution require smaller number of

with deterministic methods and methods that utilize internal rules

attribute changes for that change to be propagated to the

by which the decision models operate.

aggregate attributes.

REFERENCES

[1]

Power, D.J. Decision Support Systems: Concepts and Resources for

5 DISCUSSION AND CONCLUSION

Managers. Quorum Books, Westport, 2002.

[2]

Turban, E., Aronson, J. and Liang, T.-P. Decision Support Systems and We presented a novel method for generating alternatives for

Intelligent Systems, Prentice Hall, Upper Saddle River, 7th Edition, 2005.

[3]

Mallach, E.G. Decision Support and Data Warehouse Systems. Irwin,

multi-attribute DEX decision models based on Bayesian

Burr Ridge, 2000.

[4]

Sadok, W., Angevin, F., Bergez, J.-E., Bockstaller, C., Colomb, B., optimization. The main goal of the method was to generate

Guichard, L., Reau, R., Messeau, A. and Doré, T. MASC: a qualitative alternatives that require the smallest change to the current

multi-attribute decision model for ex-ante assessment of the sustainability

of cropping systems. Agron. Sustain. Dev. 29, 447–461, 2009.

alternative to obtain a desirable outcome. The method was

[5]

Munda, G. Multiple criteria decision analysis and sustainable

extensively evaluated on 42 different DEX decision models. The

development. In: Multiple Criteria Decision Analysis: State of the Art Surveys, Springer-Verlag, New York, 2005.

models were with a variable complexity (e.g., variable depth and

[6]

Bohanec, M. and Rajkovič, V. DEX: An Expert System Shell for Decision

variable attribute’s weight distribution). The method’s behavior

Support. Sistemica 1(1), 145-157, 1990.

[7]

Bohanec, M. and Rajkovič, V. Multi-attribute decision modeling:

was analyzed with respect to several characteristics: computing

Industrial applications of DEX. Informatica 23, 487-491, 1999.

[8]

Bohanec, M. DEXi: Program for Multi-Attribute Decision Making User's

time, time to first appropriate alternative, number of generated

Manual." Ljubljana, Slovenia: Institut Jozef Stefan, 2008.

(appropriate) alternatives, and number of attribute changes

[9]

Bohanec, M., Zupan, B. and Rajkovič, V. Applications of qualitative multi-attribute decision models in health care, International Journal of required to reach the generated alternatives.

Medical Informatics 58-59, 191-205, 2000.

The experimental results confirmed that the method is

[10]

Bohanec, M., Cortet, J., Griffiths, et al. A qualitative multi-attribute model for assessing the impact of cropping systems on soil quality.

suitable for the task i.e., it generates at least one appropriate

Pedobiologia 51, 239–250, 2007.

alternative in less than a minute, even for the most complex

[11]

Bohanec, M., Messéan, A., Scatasta, S. et al. A qualitative multi-attribute

model for economic and ecological assessment of genetically modified decision models. In the majority of the cases, the computing time

crops. Ecol. Model. 215, 247–261, 2008.

was lower than that. The discovery of the alternatives was

[12]

Coquil, X., Fiorelli, J.L., Mignolet, C., et al. Evaluation multicritère de la durabilité agr environnementale de systèmes de polyculture élevage

equally distributed throughout the overall runtime. Exception of

laitiers biologiques. Innov. Agron. 4, 239–247, 2009.

[13]

Bohanec, M., Cestnik, B., Rajkovič, V. Qualitative multi-attribute

this is the final check performed by the algorithm (see

modeling and its application in housing, Journal of Decision Systems 10,

check_promising_values in Algorithm 1), which generates the

pp. 175-193, 2001.

[14]

Debeljak, M., Trajanov, A., Kuzmanovski, V. et al. A field-scale decision

majority of the alternatives for the more complex models (depth

support system for assessment and management of soil functions.

4 and depth 5). The quality of the alternatives was also

Frontiers in Environmental Science, 7, p.115, 2019.

[15]

Bergez, J.-E. Using a genetic algorithm to define worst-best and best-appropriate as in the majority of the cases, the generated

worst options of a DEXi-type model: Application to the MASC model of

alternatives could be reached by less than 5 attribute changes.

cropping-system sustainability. Computers and electronics in agriculture

90: 93-98, 2013.

Finally, the relation between the decision-model’s depth and the

[16]

Kuzmanovski, V., Trajanov, A., Dzeroski, S., et al., M. Cascading

constructive heuristic for optimization problems over hierarchically

computing time was linear and not exponential, which implies

decomposed qualitative decision space. Omega, submitted September,

that the method is scalable.

2020.

[17]

Rasmussen C. E. and Williams C. K.I. Gaussian Processes for Machine

The method implementation considers ordinal attribute

Learning”, MIT Press 2006.

values. However, there is possibility for considering other types

[18]

Frank, H., Hoos, H. H., and Leyton-Brown, K. Sequential model-based optimization for general algorithm configuration (extended version).

of distance measures that would work in nominal settings (e.g.,

Technical Report TR-2010–10, University of British Columbia, Computer

Levenshtein distance).

Science, Tech. Rep. 2010.



[19]

Lizotte F. Practical Bayesian Optimization. PhD thesis, University of

Alberta, Edmonton, Alberta, Canada, 2008.

26





Detekcija napak na industrijskih izdelkih

Defect Detection on Industrial Products

David Golob

Janko Petrovčič

Stefan Kalabakov

Institut Jožef Stefan

Institut Jožef Stefan

Institut Jožef Stefan

Ljubljana, Slovenia

Ljubljana, Slovenia

Ljubljana, Slovenia

david.golob@ijs.si

janko.petrovcic@ijs.si

stefan.kalabakov@ijs.si

Primož Kocuvan

Jani Bizjak

Gregor Dolanc

Institut Jožef Stefan

Institut Jožef Stefan

Institut Jožef Stefan

Ljubljana, Slovenia

Ljubljana, Slovenia

Ljubljana, Slovenia

primoz.kocuvan@ijs.si

jani.bizjak@ijs.si

gregor.dolanc@ijs.si

Jože Ravničan

Matjaž Gams

UNIOR Kovaška industrija d.d.

Institut Jožef Stefan

Zreče, Slovenia

Ljubljana, Slovenia

joze.ravnican@unior.com

matjaz.gams@ijs.si

POVZETEK

zaznavanje napak na industrijskih izdelkih/odkovkih za podjetje

Unior d.d. Raziskave so bile narejene v okviru projekta

V članku predstavimo različne metode za detekcijo napak na

ROBKONCEL ( [1]), ki ga sofinancira Republika Slovenija iz

industrijskih odkovkih. Raziskava je bila narejena v okviru

Evropskega sklada za regionalni razvoj. Klasični pristopi, ki so

projekta ROBKONCEL. Napake, ki jih želimo zaznati, so manjši

uporabljeni za detekcijo napak na industrijskih objektih,

udarci ter poškodbe na struženi površini. V začetnih poskusih

temeljijo na računalniškem vidu ( [2], [3], [4], [5]). V naši

smo uporabili metode računalniškega vida ter metode zaznavanja

raziskavi uporabimo dva pristopa računalniškega vida, in sicer,

napak s tresljaji. Začetni rezultati niso zadovoljivi, vendar

detekcijo objektov (angl. »object detection«) ter segmentacijo

nekatere metode kažejo vzpodbudne rezultate, ki bi se jih dalo

slike (angl. »image segmentation«). Prav tako smo poskusili

izboljšati z večjim naborom podatkov.

zaznati napake s tresljaji izdelkov. Glede na inicialne

eksperimente, ki niso dali optimalnih rezultatov, se v prihodnje

KLJUČNE BESEDE

usmerjamo na poskuse strojnega učenja z večjim naborom

Detekcija napak, računalniški vid, tresljaji, industrijski izdelki

podatkov ter drugimi, konkretno laserskim čitalnikom, ki se

trenutno kaže kot najbolj perspektivna možnost. Raziskave so

ABSTRACT

zanimive predvsem zato, ker so pokazale določene težave v

In this paper different methods for error detection on industrial

uporabi metod strojne inteligence pri delu z industrijskimi

forks are presented. Part of the research was done for project

produkti.

ROBKONCEL. The types of errors that are detected are mostly

scratches and dents on smooth metal surfaces. First a computer

vision approach is used and then method for detecting errors

2 PRISTOP RAČUNALNIŠKEGA VIDA

from vibrations is discussed. Initial results are not encouraging,

V tem pristopu se napake na izdelkih zaznavajo iz navadnih slik.

but could possibly be improved with larger dataset for training.

Podani so primeri brezhibnih izdelkov in primeri z napakami,

tipično poškodbami na struženi površini. Algoritmi, ki zaznavajo

KEYWORDS

napake, temeljijo na pod-področju strojnega učenja, to je

Error detection, computer vision, vibrations, industrial products

globokega učenja. V zadnjih nekaj letih je področje globokega

učenja doseglo izjemne rezultate na področju računalniškega

vida, kot npr. detekcija objektov, segmentacija slik ter

1 UVOD

klasifikacija slik. Pomanjkljivost globokega učenja je, da zahteva

V zadnjem času so z napredkom strojnega učenja ter umetne

velik nabor učnih podatkov. V naših poskusih smo, kot rečeno,

inteligence napredovali tudi procesi kontrole kakovosti v

uporabili dva (pod) pristopa, to sta, detekcija objektov (angl.

industriji. Namen naše raziskave je razviti algoritem za

»object detection«) ter segmentacija slike (angl. »image



segmentation«). Nekaj primerov detekcije napak iz industrijskih

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

© 2020 Copyright held by the owner/author(s).

27





izdelkih z uporabo računalniškega vida je opisanih v [2], [3], [4]

ter [5].

2.1 Detekcija objektov

V pristopu detekcije objektov tipično skušamo poiskati izbrani

objekt (to je lahko npr. avto, pešec, kolo, prometni znak itd.). V

našem problemu je izbrani objekt napaka na industrijskem

odkovku. Za ta pristop smo imeli na razpolago 9 izdelkov, iz

katerih smo naredili nabor 46 slik.

Slika 2: Nevronska mreža za

prepoznavanje objektov, vir:

Nabor slik smo nato ločili na učno in testno množico. Delitev

[9]

je narejena tako, da se isti izdelek ne pojavi v različnih množicah.

rezultatih na testni množici. Za boljše rezultate bi očitno

Na vsaki sliki v učni množici je bilo potrebno ročno označiti

potrebovali več slik in več različnih napak.

napako/napake s pravokotniki. Ko imamo označene slike, jih



lahko uporabimo za učenje globoke nevronske mreže, ki je

sposobna prepoznavanja objektov (napak) v slikah.

Nevronska mreža je na začetku sestavljena iz več t.i.

konvolucijskih slojev (angl. »convolution layers«), na koncu pa

imamo par polno povezanih slojev (angl. »fully connected

layers«). Konvolucijski sloji so sposobni kreiranja uporabnih

značilk (kot npr. razni robovi in oblike na sliki), ki so nato

uporabljene v polno povezanih slojih (glej sliko 1 za primer). V

primeru detekcije objektov nevronska mreža v prvem delu

odkrije t.i. regije zanimanja (angl. »regions of interest«) na sliki,

le te regije so v obliki pravokotnikov. Vsaka regija zanimanja je

nato vhodni podatek v drugi del nevronske mreže, katere naloga

je klasifikacija dane regije (glej sliko 2). V našem primeru smo



uporabili že v naprej zgrajeno in naučeno nevronsko mrežo, ki

Slika 3: Detekcija napak s prepoznavanjem objektov

smo jo nato »naučili« prepoznavati naše objekte (napake).



Nevronsko mrežo, ki smo jo uporabili, se imenuje »Faster RCNN

inception« in je bila naučena na podatkovni množici imenovani

2.2 Segmentacija slike

»COCO« [6]. Ta nevronska mreža je prosto dostopna ter podprta

V segmentaciji slike klasificiramo vsako slikovno točko v

s strani Python knjižnice Tensorflow [7].

določen razred (glej sliko 4 za primer). V našem primeru imamo

Ko imamo naučeno nevronsko mrežo, klasificiramo določeno

samo dva razreda, to sta, »napaka« in »ni-napake«. Tudi v tem

sliko kot »napako«, v primeru da mreža zazna napako z več kot

pristopu uporabimo (globoke) nevronske mreže za segmentacijo

40% verjetnostjo (glej sliko 3 za primer). V tabeli 1 in tabeli 2

in klasifikacijo.

lahko vidimo rezultate mreže na učni množici oziroma na testni

Za arhitekturo nevronske mreže smo uporabili arhitekturo, ki

množici.

je bila uporabljena za podoben problem (glej [5] za podrobnosti).



Arhitektura je vidna sliki 5. Nevronska mreža je sestavljena iz

Tabela 1: Učna množica: 27 slik, 26 z napako, 1 brez.

dveh delov, in sicer, segmentacijskega dela ter klasifikacijskega

Točnost: 81%, priklic: 81%, natančnost: 100%.

dela. Vhodni podatek v segmentacijski del je črno-bela slika

TP

FP

TN

FN

objekta, klasifikacijski del pa ima dva vhodna podatka (tenzorja)

21

0

1

5

in sicer gre za dva tenzorja iz segmentacijske mreže. Prvi tenzor



je segmentacija (pomanjšane) slike objekta, (na sliki 5 je označen

Tabela 2: Testna množica: 19 slik, 18 z napako, 1 brez.

kot »segmentation ouput«) to je tenzor debeline 1, kjer vsak

Točnost: 10%, priklic: 5%, natančnost: 100%

element (ki se ga lahko predstavlja kot slikovno točko)

TP

FP

TN

FN

predstavlja verjetnost napake. Drugi tenzor pa je predzadnji

1

0

1

17

tenzor v segmentacijski mreži.



Izhodni tenzor za klasifikacijsko nevronsko mrežo je

Opazimo, da na učni množici dobimo zadovoljivo natančnost,

verjetnost, ali slika vsebuje izdelek z napako, za segmentacijsko

vendar model ni sposoben generalizacije, kar se vidi v slabih

nevronsko mrežo pa je segmentacija pomanjšane slike objekta.

Segmentacijski del se uči ločeno od klasifikacijskega. In

sicer, se uči iz ročno označenih slik segmentacije. Klasifikacijski

del pa se uči iz binarnih oznak (1 pomeni, da ima objekt napako

in 0 pomeni, da slika nima napake).

V tem pristopu razdelimo podatke na učno, validacijsko ter

testno množico (kjer noben izdelek ne more biti v dveh

množicah). Nato vsako slikovno točko v sliki označimo, kot

napako ali ni-napake. To naredimo za vsako sliko v učni in

validacijski množici.

Slika 1: Globoka nevronska

mreža s konvolucijami, vir: [8]

28





Nevronska mreža nam poda segmentacijo slike ter

3 PRISTOP S TRESLJAJI

klasifikacijo slike. Primer izhoda nevronske mreže za

Eden izmed ' alternativnih' , vendar potencialno obetavnih

segmentacijo je prikazan na sliki 6.

pristopov je analiza na osnovi oscilatornega vzbujanja pomika.

Na validacijski množici smo določili število epoh za učenje

Eksperiment je potekal v laboratoriju odseka E2 na IJS. Pozitiv

mreže in sicer smo za segmentacijsko mrežo uporabili 2900 epoh

izdelka (dejanski odkovek) smo postavili v negativ (stojalo za

in za klasifikacijsko nevronsko mrežo 200 epoh. Za treniranje

odkovke – glej sliko 7) ter generirali oscilatorni pomik negativa

mreže je bil uporabljen gradientni spust (angl. Gradient Descent)

(stojala) s pomočjo generatorja vibracij. Zanimalo nas je, ali bi

algoritem s parametrom hitrost učenja (angl. »learning rate«)

utegnile poškodbe izdelka na naležni površini s stojalom

10-3. Posamezni rezultati so zbrani v tabelah 3,4 in 5.

(negativom) kakorkoli vplivati na sklopitev med izdelkom in



stojalom. V ta namen smo opazovali dva signala: vzbujevalni

Tabela 3: Učna množica: 43 slik, 29 z napako, 14 brez

signal pomika stojala in izmerjeni signal pomika izdelka ter

napake. Točnost:100%, priklic: 100%, natančnost: 100%

opazovali odnos med obema. Za vzbujanje pomika negativa

TP

FP

TN

FN

(stojala) smo uporabili sinusni vzbujevalni signal. Meritve

29

0

14

0

pomika izdelka smo opravili z laserskim merilnikom razdalje z



visoko natančnostjo. Merilnik kontinuirano meri razdaljo do



izdelka, ter nato z numeričnim odvajanjem izračuna hitrost, ki je

Tabela 4: Validacijska množica: 25 slik, 21 z napako, 4 brez

izhodni signal. Za osnovni preizkus smiselnosti metode smo na

napake. Točnost: 64%, priklic: 66,7%, natančnost: 87,5%.

enem od izdelkov simulirali napako tako, da smo na naležno

TP

FP

TN

FN

površino prilepili droben kos izolacijskega traku. Izkazalo se je,

14

2

2

7

da le-ta bistveno vpliva na sklop izdelek-negativ in to nam je dalo



upanje, da bi utegnile tudi poškodbe naležne površine izdelka



vplivati na sklopitev in s tem na relacijo med pomikom negativa

Tabela 5: Testna množica: 28 slik, 21 slik z napako, 7 brez

in izdelka.

napake. Točnost: 71,4%, priklic: 81%, natančnost: 81%

Posnetki meritve izhodnega signala so dolgi 10s. Meritve smo

TP

FP

TN

FN

opravili pod 4 različnimi nastavitvami vhodnega signala, in sicer:

17

4

3

4

•

Nastavitev 1: Amplituda: 0,389 Vpp frekvenca: 50Hz



Vidimo, da se je nevronska mreža sposobna naučiti s 100%





točnostjo, vendar ima, podobno kot prejšnji pristop, problem z

•

Nastavitev 2: Amplituda: 0,389 Vpp; frekvenca: 60Hz

generalizacijo.

•

Nastavitev 3: Amplituda: 0,2026 Vpp; frekvenca:

60Hz

•

Nastavitev 4: Amplituda: 0,2026 Vpp; frekvenca

50Hz

Nastavitve so bile izbrane na podlagi izhodnega signala, izkaže

se, da za višje amplitude izhodni signal postane šumen.

Za ta pristop imeli na voljo 24 izdelkov.

Preizkusili smo sledeče možne pristope detekcije napak iz

Slika 4: Primer segmentacije slike, vir: [10]

signalov:



•

Ekspertno izbrane značilke ter uporaba klasičnih

metod strojnega učenja.

•

Računalniško generirane značilke ter uporaba 2-slojne

nevronske mreže

Slika 5: Arhitektura

Slika 7: Meritev vibracij





Slika 6: Primer segmentacije slike. Levo: original, sredina:

ročna segmentacija, desno: modelska segmentacija.

29

3.1 Ekspertno izbrane značilke ter uporaba

vhodnega signala, kjer je bila amplituda 0,389 Vpp s frekvenco

klasičnih metod strojnega učenja

60 Hz). Najboljše testne rezultate so v tabeli 8.



V tem pristopu so značilke, uporabljane v algoritmih strojnega

Tabela 8: Osnovni model: logistična regresija. Končni

učenja, izbrane na podlagi dobrih izkušenj. Značilke, ki so bile

model: AdaBoost

izbrane, so se namreč izkazale kot dobre v drugi aplikaciji

Točnost

Priklic

Natančnost

F1

strojnega učenja. Izbranih značilk je 22 in uporabljajo osnovne

68 %

značilke signala iz časovnega ter frekvenčnega spektra, npr. 3



85 %

76 %

73 %

najvišji vrhovi spektralne gostote ter njihove frekvence, energija



spektralne gostote, itd.

3.2 Računalniško generirane značilke ter

Vsak posnetek odkovka je razdeljen na 10 kosov, kjer je vsak

kos 1s dolg posnetek. Za vsak kos se nato izračuna ekspertno

uporaba 2-slojne nevronske mreže

izbrane značilke. Tako za vsak vzorec dobimo 10 podatkovnih

Za avtomatsko generacijo značilk smo uporabili za to namenjeno

točk z 22 značilkami.

knjižnico. Pri nastavljenem parametru FDR (False Discovery

Uporabljen model je sestavljen iz dveh modelov. In sicer iz

Rate) na privzeto vrednost, ki je 0,05 po statističnem testu, nismo

osnovnega ter končnega modela. Osnovni model za vsako

dobili nobene značilke, ki bi bila relevantna za klasifikacijo. Ker

podatkovno točko izračuna verjetnost, da ta točka pripada

knjižnica uporablja statistično analizo za ocenjevanje

produktu z napako. Ker imamo za vsak produkt 10 podatkovnih

relevantnosti značilk, torej ni nujno, da niso pomembne pri

točk, dobimo z osnovnim modelom 10 verjetnosti za vsak

strojnem učenju, zato smo dvignili prag FDR na začetku na 0,5

produkt. Končni model potem klasificira produkt v »odkovek z

in nato še na 0,99. Pri tem smo pri vrednosti 0,5 FDR dobili le

napako« ali »odkovek brez napake«. Vhodni podatek v končni

eno značilko. Ta je 50. Fourierev koeficient oziroma pri

model je 10 verjetnosti, dobljenih iz osnovnega modela.

nastavitvi 2 in 3 smo dobili 60. Fourierev koeficient. Slednja

Preizkusili smo več možnih algoritmov, in sicer algoritem

vrednost je seveda osnovni harmonik vzbujalnega signala. Pri

podpornih vektorjev (angl. »support vector classifier«),

nekaterih nastavitvah in pri večji vrednosti FDR smo dobili

algoritem naključnih gozdov, logistično regresijo, algoritem

nekatere Fouriereve koeficiente v okolici 50. in 60. koeficienta,

»AdaBoost« ter algoritem »XGBoost«. Te algoritme smo

kar je smiselno, ker je odziv odkovka različen glede na

preizkušali tako za osnovni kot končni model.

poškodbo. Zaradi tega smo sklenili, da izračunamo Fouriereve

V prvem poskusu, so bili podatki razdeljeni na učno ter testno

koeficiente v okolici 50. in 60. in jih uporabimo za klasifikacijo.

množico. Na učni množici smo z 8 delnim prečnim preverjanjem

Hevristično smo določili, da izračunamo prvih 256 koeficientov.

izbrali optimalne parametre za osnovni ter končni model. Nato

S tem smo zajeli vse koeficiente v okolici 50. in 60. Izračun

smo celoten model testirali na testni množici.

prevelikega števila koeficientov pomeni, da lahko porabimo vse

Uporabljena je bila nastavitev 2 vhodnega signala, kjer je bila

vire, ki so na voljo nevronski mreži, prav tako pa uradni viri [11]

amplituda 0,389 Vpp s frekvenco 60 Hz. Rezultati so zbrani v

v tem primeru navajajo 28 x 28 točk oziroma vhodnih nevronov.

tabelah 6 in 7.

Nevronska mreža je sestavljena iz vhodne plasti, ki ima 256



nevronov, nato sledita dve skriti plasti, prva z 16 nevroni, ter

Osnovni model: XGBoost

druga z 8. Zadnja izhodna plast je sestavljena iz 2 nevronov, ta

Končni model: Naključni gozdovi

predstavljata poškodovan ali nepoškodovan odkovek. Takšne



nastavitve smo dobili od večkratnega testiranja modela

Tabela 6: Učna množica: 19 produktov: 12 z napako, 7 brez

(optimizacija hiperparametrov). Za razliko od prejšnjega

napake. Točnost: 100%, priklic: 100%, natančnost: 100%.

pristopa smo uporabili celoten 10-sekunden posnetek za izračun

TP

FP

TN

FN

koeficientov.

7

0

12

0

Kot v predhodnem primeru smo na začetku uporabili



optimizacijo hiperparametrov na učni množici. To pomeni, da

Tabela 7: Testna množica: 5 produktov: 2 z napako, 3 brez

smo z izbranimi parametri, ki so dosegli najvišjo točnost pri

napake. Točnost: 100%, priklic: 100%, natančnost: 100%.

modelu nevronske mreže uporabili za učenje modela. Vseh 24

TP

FP

TN

FN

učnih primerov smo razdelili na učno (19 primerov) in testno (5

3

primerov). Uporabili smo 5-delno prečno preverjanje kot v



0

2

0

prejšnjem primeru. Ker dobimo 5 vrednosti posameznih metrik,



Da se izognemo naključnemu dobremu rezultatu na testni

na koncu izračunamo povprečje. Rezultati so zbrani v tabeli 9.

množici, uporabimo še drug poskus. In sicer, uporabimo metodo



prečnega preverjanja za določanje učne in testne množice.



Konkretno uporabimo 5-delno prečno preverjanje, kjer so

Tabela 9: Točnost priklic in natančnost brez F1 metrike

podatki razdeljeni na 5 delov. Naš postopek ima 5 iteracij, na

Točnost

Priklic

Natančnost



vsaki iteraciji je en del podatkov izbran kot testna množica, ostali

48 %

42 %

91 %



štirje deli pa so izbrani kot učna množica. Na vsaki iteraciji na



učni množici z 8 delnim prečnim preverjanjem izberemo

optimalne parametre in naučimo model na učni množici, nato pa

ocenimo model na testni množici. Ker uporabljamo 5 delov,

4 ZAKLJUČEK

dobimo 5 ocen točnosti, priklica ter natančnosti, iz katerih nato

V tem prispevku so opisani pristopi ter modeli za detekcijo napak

izračunamo povprečje. (uporabljena je bila nastavitev 2

na industrijskih izdelkih - odkovkih.

30

Rezultati za detekcijo napak z uporabo računalniškega vida in

[5] D. Tabernik, Š. Samo , J. Skvarč in D. Skočaj,

segmentacije slike so se izkazali kot nezadovoljivi za praktično

„Segmentation-based deep-learning approach for surface-

uporabo, kjer se zahtevata visoka točnost in priklic. Rezultati z

defect detection,“ Journal of Intelligent Manufacturing,

uporabo računalniškega vida in detekcije objektov so

2019.

nezadovoljivi najbrž zato, ker so napake na kovini podobne

[6] [Elektronski]. Available: http://cocodataset.org/#home.

temnim lisam na kovini, ki jih je polno na odkovkih.

Rezultati za detekcijo napak z uporabo tresljajev so

[7] Tensorflow, „Tensorflow home page,“ [Elektronski].

vzpodbudni, ampak nezadovoljivi.

Available: https://www.tensorflow.org/. [Poskus dostopa

Glavni razlog za slabše rezultate je pomanjkanje podatkov ter

30 January 2020].

zajem podatkov v nekontroliranem okolju. Menimo, da ko bo na

[8] [Elektronski].

Available:

voljo več podatkov, se bodo rezultati izboljšali.

https://towardsdatascience.com/mnist-handwritten-digits-

classification-using-a-convolutional-neural-network-cnn-

af5fafbc35e9.

5 BIBLIOGRAFIJA

[9] U. Farooq, 15 February 2018. [Elektronski]. Available:



https://medium.com/@umerfarooq_26378/from-r-cnn-to-

[1] „ROBKONCEL,“ SMM, January 2019. [Elektronski].

mask-r-cnn-d6367b196cfd.

Available: http://www.smm.si/?post_id=4682. [Poskus

[10] J. Jordan, „Jeremy Jordan,“ 30 March 2018. [Elektronski].

dostopa 30 January 2020].

Available:

https://www.jeremyjordan.me/evaluating-

[2] M. El-Agamy, M. A. Awad in H. A. Sonbol, „Automated

image-segmentation-models/.

inspection of surface defects using machine vision,“ v 17th

[11] [Elektronski].

Available:

Int. AMME Conference, Cairo, 2016.

https://www.tensorflow.org/tutorials/keras/classification.

[3] C. Ming , B.-C. Chen , L. G. Jacque in C. Ming-Fu,

[Poskus dostopa 2019].

„Development of an optical inspection platform for

[12] M. Gjoreski, S. Kalabakov, M. Luštrek in H. Gjoreski,

surface defect detection in touch panel glass,“

„Cross-dataset deep transfer learning for activity

International Journal of Optomechatronics, Izv. 10, št. 2,

recognition,“ v Proceedings of the 2019 ACM

pp. 63-72, 2016.

International Joint Conference on Pervasive and

[4] X. Sun, J. Gu, S. Tang in J. Li, „Research Progress of

Ubiquitous Computing and Proceedings of the 2019 ACM

Visual Inspection Technology of Steel Products—A

International Symposium on Wearable Computers, 2019.

Review,“ Applied sciences, Izv. 8, št. 11, 2018.



31





Data Protection Impact Assessment - an Integral

Component of a Successful Research Project From the

GDPR Point of View

Gizem Gültekin Várkonyi

Anton Gradišek

University of Szeged

Jožef Stefan Institute

Szeged, Hungary

Ljubljana, Slovenia

gizemgv@juris.u-szeged.hu

anton.gradisek@ijs.si



ABSTRACT

Developing an AI-based service for a target population, for

example people with diabetes, chronic heart failure, obesity,

Artificial intelligence and algorithmic decision-making systems

dementia, skin cancer, etc., typically starts with a research

help generate new knowledge about diseases which then help

project. One of the key components of such a project is collecting

better manage it and assist people in clinical treatment needs. The

substantial amounts of data in a pilot study, with participants that

blood of such AI systems is personal data that is both used for

resemble the target audience for the final service. When planning

training or is already the output of the algorithmic assessments.

the pilot study, researchers enter a slippery terrain of dealing with

This work aims guiding the AI researchers to be familiar with the

personal data, as the participants are providing their own data for

legal rules binding them while processing personal data within

the purpose of the study. For the illustration, we can imagine a

their AI-based projects as indicated in the General Data

Protection Regulation rules with a specific focus on why and how

project where we collect medical data of three types; general

to conduct a self-Data Protection Impact Assessment. The self-

medical data provided by the medical doctor responsible for the

assessment guideline presented throughout the work is an output

participant, lifestyle data collected by either wearable or

of the mutual experiences and collaboration between a lawyer

stationary sensors, and self-reported data that is obtained via

and an AI researcher on the topic.

questionnaires that the participants fill.



KEYWORDS

The data provided by the participants fall under the scope of the

data protection, impact assessment, GDPR, artificial intelligence,

European Union’s General Data Protection Regulation (GDPR)

medical data

since it refers to identified or identifiable personal issues of them.

The GDPR entered into force on the 25th of May 2018 with one

of the aims of keeping up with the technological developments

1 Introduction

challenging efficient protection of personal data [2]. The risk-

It is possible to look out for artificial intelligence (AI) systems

based approach embedded in the GDPR came along with several

dealing with personal data from two different perspectives. On

safeguards as one of them is the Data Protection Impact

one hand, it offers great benefits for the users, developers, and

Assessment (DPIA). The DPIA can help AI-researchers to

researchers, if used correctly. For example, AI-enabled health

comply with the GDPR requirements at an early stage of a new

care technologies could predict the treatment of diseases 75%

project. It can help reduce the risks arising from the use of AI

better, and could reduce the clinical errors 2/3 at the clinics using

technologies challenging the efficient protection of fundamental

AI compared to the clinics that do not [1]. On the other hand, the

rights and principles [3]. Several policy papers generated by the

improper handling of personal data can quickly lead to abuse,

EU institutions [4] [5] focusing on regulation of AI state that

sharing sensitive information, or other problems (unwanted data

legal compliance is a keyword for gaining user trust and DPIA is

disclosure, complex and costly legal procedures, high fines, etc.),

one way to reach user trust. However, there is no standard set for

therefore it has to be handled with the utmost care. In this paper,

conducting a DPIA that could guide the AI-researchers. In this

we will focus on the legality of medical applications containing

paper, we present some of the key points of conducting the DPIA

personal data that is defined as sensitive data in legal documents,

that could be useful for the AI-researchers.

such as the analysis of sensor data to help patients with chronic

diseases manage their condition and improve the quality of life,

or to help the elderly with independent living by providing safety

2 Data Protection Impact Assessment in the

features and improved communication channels.

GDPR

The term DPIA was not specifically described in the GDPR,



however, was referred as it is a process to help managing the risks



to the data subjects’ (participants of the research project, in this

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed case) rights and freedoms as a result of data processing. In other

for profit or commercial advantage and that copies bear this notice and the full words, DPIA is a process consisting of several other sub-citation on the first page. Copyrights for third-party components of this work must processes to describe the risks and assess the legality of the

be honored. For all other uses, contact the owner/author(s).

Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

system in terms of data protection. These risks could be related

© 2020 Copyright held by the owner/author(s).

to system security, system design, implementation,

32

administration and development on a further run. The aim of the The Data Specific Assessment (DSA) is the procedure where

DPIA is to take appropriate safeguards to minimize the risks, if

the data to be used in the AI project should be introduced very

impossible to eliminate all. DPIA is not a simple one-time

specifically in order to comply with the basic rules of the GDPR,

reporting activity, it is an ongoing process that should be

mainly, the purpose limitation, transparency, accuracy, data

continuously carried out during the lifetime of a project,

minimization, and consent. It should be kept in mind that one of

therefore DPIA should always be monitored and updated [6].

the requirements to be ensuring a valid consent is identifying the



concrete data list, together with the planned processing activities

It is the AI-researcher’s responsibility to convey a DPIA when

of that data in the frame of a research project. Information

the data processing activity is likely to constitute a “high risk” to

serving to identify the persons involved with data processing are

the rights and freedoms of natural persons (e.g. users of an AI

the natural elements of the DSA. For example, AI-researchers in

service who both benefit from the service and contribute to it

the project should identify the data processing purposes specific

with their data). How to decide whether a certain data processing

to the project aims and present the list of purposes in a written

activity would be resulting in a high risk is not an easy task, but

form to the participants. The indicated purposes should follow

there are several guidelines and list of processing requiring DPIA

the related data to be processed listed again in a written form,

published by the National Supervisory Authorities [7]. These

followed by the clear identification of the AI-researchers and

lists could be the first sources for the data controllers to decide

other people involving the processing activity.

about the necessity of the DPIA for a certain project [8].





Next, the Data Subject Specific Assessment should follow the

Failure to conduct a right DPIA raises a risk for the AI-

procedure where the focus is on explaining all the details about

researchers; they may face several sanctions, especially financial

how the AI-researchers will ensure the rights of the participants

penalties. Apart from that, conducting a right DPIA would be

by protecting their informational self-determination right. The

beneficial for the data controllers not only from the legal and the

key point in this assessment is to gain trust of the participants as

financial point of view. A DPIA could help data controllers to

required by law and ethics. One of the key aspects here is to make

avoid implementing irrelevant solutions from the beginning of

sure that the participants are introduced by the project team on

the project which may refer to assessing the technical feasibility

the ways their data will be used, as well as the possibility for of the system in parallel with the legal compliance [8]. Therefore,

them to request removal of their data if so desired. The project

the DPIA could help data controllers to save time and money. It

team shall also ensure that the participants have a certain degree

also prevents the companies from losing their reputation (or from

of accession to the decisions made by the algorithm about them.

the scandals, as such occurred with the Cambridge Analytica,

Explaining an algorithmic decision relating the participants’

Equifax, Facebook, etc.). Finally, a DPIA document can prove

personal assessment should be understandable to them since the

the trustworthiness of the project team before the public, as well

classification models based on decision trees are easily

as the related authorities, since it is an evidence of the respect

comprehensible to humans. On the other hand, models that are

towards the right to data protection.

based on complex multilayer neural networks are essentially



black boxes where it is not possible to determine why a particular

An AI project aiming to collect personal data and evaluate the

decision was reached based on easily interpretable rules. Bearing

data with an automated decision-making system with the help of

in mind the black box nature of the algorithmic assessment,

profiling tools such as surveys and hardware equipment must be

choosing a model that is firstly understandable and explainable

assessed from the risk point of view. Below, we present a step-

to the AI-researchers is a suggested action in this sense. The

by-step guideline on how to conduct a DPIA on AI-based

social implications of choosing a black box algorithm is an

research.

emerging research field. Finally, the project team should ensure

that the system offers tools for the participants to keep their data

accurate and to block third party access.

3 Conducting a Data Protection Impact



Assessment

The Project Specific Assessment is the last part of the DPIA,

In this section, we assume a project aiming at developing a

presenting and explaining the legal basis for data processing, the

medical software with the help of an algorithm that is going to

external project partners involved with data processing activities,

enable collecting and processing participants’ sensitive data

and the security measures that will be implemented to safeguard

based on profiling. Additionally, a large amount of data will be

the data processed during the project. As the project likely deals

collected for feeding the algorithm, meaning that the participants

with sensitive medical data, security protocols have to be

may lose a degree of control of their data stored and processed

elaborated, which include proper hierarchy regarding the data

by the AI system. Based on these inputs, the project may reveal

access, encryption algorithms, regular security updates, and

risks for rights and freedoms of the data subjects involved, if

physical access to the hardware where the data is located.

these are not mitigated. Therefore, we need to conduct a DPIA



and identify the risk categories with the planned mitigations.

The final but an ongoing phase of the DPIA is the monitoring



phase. Whenever there is a new element embedded in the project,

We identified three steps for conducting a successful DPIA in the

and this element seems to change the balance of the risks that

project: the Data Specific Assessment, the Data Subject Specific

were assessed earlier, the DPIA should be reviewed. This

Assessment, and the Project Specific Assessment.

element could be involving a new data type in the algorithm or



planning a commercial use of the algorithm. Bearing in mind the

fact that machine learning techniques and algorithms are referred

33

to as entirely new technologies [3] and the growing amount of risks and find mitigation strategies for certain weak points. Last

data together with a variety of hardware would raise risks to

but not least, by conducting the DPIA, the project team fulfills

persons’ right to data protection [9], we suggest the project team

the legal requirements, ensures higher trust of people involved,

to review the DPIA periodically, for instance, every year at least.

and avoids unforeseeable problems that might later occur.

ACKNOWLEDGMENTS

4 Conclusion

This work was supported by the ERA PerMed project

Data Protection Impact Assessment is an integral part of any

BATMAN, which was financed on Slovenian side by the

research project focusing on development of an AI algorithm

Ministry of education, science, and sport (MIZŠ). An extended

with personal data. Such data might be sensitive in nature, such

version of this paper was submitted to journal Informatica.

as medical data, to be used for developing an algorithm to detect



diseases. Besides it is a legal requirement as provided for by the

GDPR, a DPIA is a tool for the AI-researchers to assess the

REFERENCES

weaknesses in the system that may then risk the protection of



fundamental rights of the persons participating in the research

[1]

“The AI effect: How artificial intelligence is making health care more project who contribute to the development of the project with

human”, [Online], study conducted by MIT Technology Review Insights

and

GE

Healthcare,

2019.

Accessed

from:

their personal data. Since there are few guidelines on how to

https://www.technologyreview.com/hub/ai-effect/ Last accessed: 20

conduct a DPIA for a research project specific to the topic, this

April 2020.

[2]

EDPS (2012). “Opinion of the European Data Protection Supervisor on

work initiates a step-by-step guideline for the AI-researchers.

the data protection reform package”, (7 March 2012).



[3]

ICO (2018). Accountability and governance: Data Protection Impact

The first step considers a Data Specific Assessment that the data

Assessments (DPIAs).

[4]

European Commission (2018). Communication from the Commission to

and the purposes of the data processing are clearly identified and

the European Parliament, the European Council, the Council, the

listed in a written form to be presented to the participants. It is

European Economic and Social Committee and the Committee of the

Regions, Artificial Intelligence for Europe. COM (2018) 237 final.

followed by the Data Subject Specific Assessment which focuses

[5]

European Commission (2018) Communication from the Commission to

on the ways the AI-researchers ensure the protection of the

the European Parliament, the European Council, the Council, the

participants’ right to data protection in line with the GDPR

European Economic and Social Committee and the Committee of the

Regions, Coordinated Plan on Artificial Intelligence. COM (2018) 795

requirements. Such requirements include providing explanation

final.

on the decisions reached as a result of algorithmic assessments.

[6]

Wright, David. (2012). The state of the art in privacy impact assessment.

Computer

Law

&

Security

Review,

28(1),

54–61.

The third step relates to the Project Specific Assessment and this

https://doi.org/https://doi.org/10.1016/j.clsr.2011.11.007

step focuses mostly on the security measures planned to be taken

[7]

Hungarian National Authority for Data Protection and Freedom of

Information (NAIH), List of Processing Operations Subject to DPIA 35(4)

by the project team to mitigate the risks that appeared during the

GDPR https://naih.hu/list-of-processing-operations-subject-to-dpia-35-4-

previous two assessments. We would suggest the AI-researchers

-gdpr.html

review the DPIA at least once a year, otherwise revision is

[8]

Wright, David. (2011). Should Privacy Impact Assessments Be

Mandatory?

Commun.

ACM,

54(8),

121–131.

required whenever a new element is added to the system ending

https://doi.org/10.1145/1978542.1978568

with a new data processing.

[9]

Chandra, Sudipta., Ray, Soumya., Goswami, R.T. (2017). Big Data

Security: Survey on Frameworks and Algorithms, in 2017 IEEE 7th



International Advance Computing Conference (IACC), Hyderabad, pp.

From the planning stage of the project to the annual revisions,

48-54. doi: 10.1109/IACC.2017.0025

the DPIA could help the project team to identify the potential



34





Deep Transfer Learning for the Detection of Imperfections on Metallic Surfaces

Stefan Kalabakov

Primož Kocuvan

Jani Bizjak

stefan.kalabakov@ijs.si

primoz.kocuvan@ijs.si

jani.bizjak@ijs.si

Jožef Stefan Institute

Jožef Stefan Institute

Jožef Stefan Institute

Mednarodna podiplomska šola

Ljubljana, Slovenia

Mednarodna podiplomska šola

Jožefa Stefana

Jožefa Stefana

Ljubljana, Slovenia

Ljubljana, Slovenia

Samo Gazvoda

Matjaž Gams

samo.gazvoda@gorenje.com

matjaz.gams@ijs.si

Gorenje gospodinjski aparati, d.d.

Jožef Stefan Institute

Mednarodna podiplomska šola

Jožefa Stefana

Ljubljana, Slovenia

ABSTRACT

image processing for detecting imperfections in the manufac-

In the last decade, consumers’ expectations have significantly

turing processes [1]. However, these systems rely heavily on increased regarding the availability and quality of the products

specialized lighting solutions in order to highlight imperfections

they buy. To this end, manufacturers have focused on streamlin-

on the surfaces of objects [6]. The systems are usually expensive ing their manufacturing lines by employing intelligent solutions

and require close proximity to the object which is being investi-

wherever possible. Since the field of quality control remains de-

gated in order to provide good detection accuracy. Furthermore,

pendent mainly on specialized workers, interest in incorporating

methods which do not use any kind of learning require features

artificial intelligence (AI) advances in this field has dramatically

which are hand-crafted for each application specifically and re-

increased. In this paper, we present a short exploration into a

quire some degree of uniformity in size and shape of the errors

computer vision system built to detect imperfections on metallic

which might appear. This problem with hand-crafted features, for

surfaces. In particular, we leverage deep transfer learning to build

us, exists even when using classic machine learning models, as

a model that can classify small segments of a bigger image while

we were not provided with details regarding the size and shape of

using a tiny dataset for training. In these initial experiments, we

the errors. To solve this, we opted to use deep learning models, as

show that layers trained on the ImageNet dataset can be used as

they automatically extract features based on the training set and

feature extractors when building a model for a vastly different

have proved to produce state-of-the-art results in many areas

problem.

[3]. With this in mind, the aim of this paper is to investigate low cost state-of-the-art deep learning methods which work in

KEYWORDS

suboptimal lighting and which automatically extract features

which are robust to the shape and size of the errors which appear

deep transfer learning, computer vision, quality control

on metallic surfaces. Finally, since our dataset is extremely small,

we leveraged transfer learning in order to use the full potential

1

INTRODUCTION

of deep models.

Today, products are expected to be available fast, in vast quan-

tities, and with exceptional quality. To this end, manufacturers

have started streamlining their manufacturing lines by employing

network-connected intelligent machines wherever possible [10].

This has created great interest in incorporating advances in arti-

2

PROBLEM DEFINITION

ficial intelligence (AI) in the industry. In recent years industrial

The ultimate goal of the ROBKONCEL project is to create a quality

adoption of AI is becoming more and more feasible [7], mainly control process for the detection of several possible manufactur-thanks to the significant progress in hardware computational

ing errors on both the inside and outside of ovens. In this work,

resources.

we focus on detecting scratches and dents, i.e., imperfections on

In spite of this, quality control is one manufacturing process

the oven faceplates’ metallic surface. We perform this quality

which still remains highly dependent on expert human workers.

check in the manufacturing process’s final phases, as almost fully

This dependence, in some instances, makes it slower, more prone

assembled ovens get transported on a conveyor belt. In order to

to errors, and more expensive. To mitigate this, there has been

produce a method that is least costly to implement, we chose a

limited adoption of computer vision systems paired with classical

simple RGB camera as the sensor in this application. The camera

is positioned such that it can take a picture that contains the

Permission to make digital or hard copies of part or all of this work for personal whole metallic surface while not interfering with other quality

or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and control processes, thus improving efficiency. Finally, our method

the full citation on the first page. Copyrights for third-party components of this is supposed to highlight the areas where dents and scratches

work must be honored. For all other uses, contact the owner/author(s).

are found so that an inspection of the algorithm’s work can be

Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

done at any time. Figure 1 shows an example image used for the

© 2020 Copyright held by the owner/author(s).

purposes of this paper.

35





Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

Kalabakov, et al.

Figure 1: An image taken by an RGB camera of the metal-

lic surface of interest.

Figure 3: Example of image segmentation.

2.1

Data

Due to the frequency with which these imperfections occur, we

Finally, since the newly constructed windows will be used to

did not have a large dataset to include in this study. On the con-

train a deep learning model, we need to assign a label to each

trary, the number of faceplates we could use to get the necessary

one of them. In this application, the labels are "0" and "1". If the number of images was only five. Of those five faceplates, one was

label "0" is assigned to a window, it means that the window’s

without imperfections, and the rest contained a varying number

area does not include any scratches or dents on the surface. On

of defects on the metallic surface. Since any deep learning re-

the other hand, the label "1" means that the area covered by the

quires a large amount of data and since the number of faceplates

window includes a scratch or a dent. The labels are assigned to

is small, using images that portray the whole area of one faceplate

each window by examining the mask. For each window, we take

as examples to a deep neural network (DNN) would be ineffective.

the corresponding area it covers in the mask, and if it includes

To combat this problem, we took images of the different front

a certain number of pixels annotated as belonging to an imper-

panels (five images in total) and segmented them into hundreds

fection, then the window is assigned a label "1". Otherwise, it is of smaller examples, which we use as inputs to fine-tune several

assigned the label "0". The number of pixels that are used as a

models. Additionally, by performing class-invariant transforma-

threshold for labeling the windows is:

tions on these smaller images, we attempt to diversify the set of

examples used to fine-tune the models. The segmentation of im-

𝑡 ℎ𝑟 𝑒𝑠ℎ𝑜𝑙𝑑 = 0.1 × 𝑛𝑢𝑚𝑃𝑖𝑥𝑒𝑙𝑠𝐼 𝑛𝑊 𝑖𝑛𝑑𝑜𝑤

ages into smaller examples and their augmentation are presented

in subsection 3.1 and subsection 3.2, respectively.

3.2

Augmentation

Augmentation of images in the data-space has been shown to

3

METHOD

produce great results when it comes to improving the accuracy

3.1

Segmentation

of classifiers [5]. Since after segmenting the image, the number In order to segment the images, we first created a hand-annotated

of examples (windows) that do not contain an imperfection is

set of binary images (masks). These masks complement the orig-

largely greater than the number of examples that do, we apply

inal set of five images by showing where in them, a scratch or a

certain transformations to the windows that contain an error, and

dent is visible on the metallic surface. In more detail, the masks

we save each of those transformed windows as a new example.

were produced by having humans mark the exact locations of

It is important to emphasize that none of these transformations

these imperfections. In the masks, pixels which are part of some

affect the example’s label, meaning that if we apply them to an

imperfection (in the RGB image) are marked with the color white,

example containing an error, the transformed example will also

while all others are represented in black. An image and its corre-

contain the same error. The transformations we use are:

sponding mask are shown on Figure 1 and Figure 2, respectively.

• rotation

• change of contrast

• change of brightness

• flipping

After applying these transformations to a single example, 23

new samples are obtained.

Figure 2: A mask constructed for the image in Figure 1.

3.3

Deep Transfer Learning

For the task of classifying windows based on whether they con-

The next step in the segmentation process is to divide the

tain an imperfection or not, we tested four different model archi-

image into chunks (windows). We do this by "sliding" a window

tectures. One is a simple Convolutional Neural Network (CNN),

with a fixed size across the whole image. Each of these windows

and the other three are more complicated architectures that are

covers a specific area of the image and will serve as a training or

well established in the world of image recognition.

testing instance when fine-tuning the models. Overlap between

The simple CNN is used as a baseline for what an end-to-end

several windows is allowed in fact, it is encouraged, seeing that

model can achieve on this dataset. However, since the number

some overlap means that we can generate more examples. The

of examples is still relatively low, training an end-to-end deep

size of the window is 200 by 200 pixels and the allowed overlap

learning model was not expected to yield great results.

between windows is 75%.

On the other hand, the VGG16 [8], InceptionV3 [9] and ResNet101V2

However, since in this paper’s scope, we are only interested

[2] architectures were used to leverage deep transfer learning [4].

in the faceplate’s metallic parts, we make sure that none of the

To be more specific, all of these networks have been used in the

windows cover an area that includes the display. In Figure 3 we ImageNet competition, and their internal parameters (weights),

can see (in green) the windows produced by the segmentation

from that competition, are openly available for use. By using

step and how none of them overlap with the area of the display.

their pretrained convolutional layers as feature extractors and

36





Deep Transfer Learning for the Detection of Imperfections on Metallic Surfaces Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

training our own set of fully connected layers, we can signifi-

clusters. This dismissal is possible because finding the exact mar-

cantly improve our performance and training time. Effectively,

gins of the imperfections is not of great importance in our use

we transfer the knowledge stored in their parameters (weights)

case.

from the ImageNet dataset to our quality control problem.

To implement this, in every architecture, we disregard the fully

connected layers included with these architectures and generate

our own (with random weights). We then attach these fully con-

nected layers to the output of the convolutional layers (provided

as pretrained on ImageNet) and train only the fully connected

layers while freezing the convolutional layers’ parameters. The

number of fully connected layers we generate is four, and the

number of neurons per layer is 512, 256, 128, and 64, respectively.

The implementation and the weights of these models are ac-

quired from the Keras package in TensorFlow.

4

EVALUATION

Figure 4: Example of the custom visualisation metric. The

4.1

Experimental Setup

top image has the colors of the windows selected based on

We evaluated the performance of each model using Leave One

the groundtruth, while the middle one has them selected

Image Out (LOIO) cross-validation. This means that models are

based on the predictions of the classifier. The bottom im-

trained using examples (windows) from all images but one, and

age represents a color-coded version of the difference be-

are tested using the instances from the image excluded in the

tween the top two images.

training process. The process is repeated several times, and each

time a different image is used to test the models’ performance.

Since one of the faceplates did not have any errors on its surface,

5

RESULTS AND DISCUSSION

windows from that image were never used to test models, instead

Table 1 shows the average (macro) F1-scores that each of the they were always used for training. In summary, all the models

models achieved when performing 4-fold LOIO cross-validation.

are evaluated using a 4-fold LOIO cross-validation.

4.2

Evaluation Metric

In this work, we use F1-score with macro averaging as the metric

for the evaluation of the models. In particular, we use (macro)

Table 1: Average model F1-score after 4-fold LOIO cross-

F1-score to determine the model’s ability to classify segmented

validation.

windows. The choice to use (macro) F1-score rather than accu-

racy was made because of the class imbalance in our data. A

significant difference between accuracy and (macro) F1-score

In all of our experiments, the Simple-CNN and VGG16 archi-

comes from the fact that accuracy reports a higher value, even

tectures produce very low results. It is our opinion that perhaps,

in many false positives. For example, a high accuracy score will

a simple stacking of convolutional layers is not enough for this

be reported when a classifier predicts only positive values on

particular use case, since both networks are unable to learn and

a test set containing many positive examples, even though the

instead predict every example as an example with an error. On the

classifier completely misclassifies the negative instances.

other hand, InceptionV3 and ResNet101V2 produce good results

To fully understand the classification results, aside from the F1-

in comparison to the other two architectures. A head to head

score metrics, we also visually represent how the predictions look

comparison of the per image F1-scores of the two best models

once all windows have been rearranged in their initial positions.

can be found in Table 2.

This representation overlays windows in their original places but

changes their pixels’ value to all white or black based on their

predicted values. An example of this representation is shown in

the middle image in the triplet on Figure 4. The top image in that same figure changes the pixels’ values based on the ground-truth

rather than prediction value. Finally, the figure’s bottom image

represents a color-coded version of the difference between the

top two images. Windows in green represent windows which

Table 2: Per image F1-scores for the InceptionV3 and

have been predicted as containing a fault, when in fact they do

ResNet101V2 models.

contain a fault (True Positive - TP). Windows in red represent

windows which have been predicted as not containing a fault,

when in fact they do contain a fault (False Negative -FN). And

Although there is only a small difference between the F1-scores

finally, windows in blue, represent windows which have been

of InceptionV3 and ResNet101V2, only 2% as seen on Table 1,

predicted as containing a fault, when in fact they do not contain

there is a large difference in how they predict the same images,

a fault (False Positive - FP).

as we can see in Figure 5 and Figure 6.

This view is especially useful for our evaluation since it allows

As is clearly visible, ResNet101V2 produces a lot more false

us to filter out wrongly classified windows which surround green

positives in comparison to InceptionV3. However, if we consider

37





Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

Kalabakov, et al.

surfaces should be detected. Based on the results it seems that

transfer learning is a suitable tool for use when the target dataset

is really small, even in the case when the source and target prob-

lems are vastly different. Furthermore, it seems like more complex

architectures produce better results compared to more traditional

ones. When more examples of faceplates with imperfections be-

come available, we plan on exploring the effects of fine tuning

some of the convolutional layers in these models rather than

freezing all of them during training. Another possible path to

take in the future includes using GANs in order to generate real-

istic looking samples of windows with imperfections and further

augmenting our training set. Finally, it is important to note that

exploring more appropriate lighting solutions might produce

better results.

ACKNOWLEDGMENTS

Figure 5: Visual representation of the predictions pro-

Part of this research was done under and for ROBKONCEL project.

duced by the InceptionV3 model.

Additionally, this research was partly funded by the Slovene Hu-

man Resources Development and Scholarship Fund (Ad futura).

REFERENCES

[1]

Fernando Gayubo, José Luis Gonzalez, Eusebio de la Fuente,

Felix Miguel, and Jose R Peran. 2006. On-line machine

vision system for detect split defects in sheet-metal form-

ing processes. In 18th International Conference on Pattern

Recognition (ICPR’06). Volume 1. IEEE, 723–726.

[2]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.

2016. Identity mappings in deep residual networks. In

European conference on computer vision. Springer, 630–

645.

[3]

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015.

Deep learning. nature, 521, 7553, 436–444.

[4]

Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef Sivic.

2014. Learning and transferring mid-level image represen-

tations using convolutional neural networks. In Proceed-

ings of the IEEE conference on computer vision and pattern

Figure 6: Visual representation of the predictions pro-

recognition, 1717–1724.

duced by the ResNet101V2 model.

[5]

Luis Perez and Jason Wang. 2017. The effectiveness of data

augmentation in image classification using deep learning.

each cluster of same colored pixels as an error, we can see that

arXiv preprint arXiv:1712.04621.

ResNet101V2 produces far better results when it comes to the

[6]

Franz Pernkopf and Paul O’Leary. 2002. Visual inspection

number of true positives and false negatives. So, even though

of machined metallic high-precision surfaces. EURASIP

ResNet101V2 produces a lot of false positives, it only manages

Journal on Advances in Signal Processing, 2002, 7, 650750.

to miss one error from all four images, whereas InceptionV3

[7]

Michael Sharp, Ronay Ak, and Thomas Hedberg Jr. 2018. A

manages to miss four errors. These results can be seen on Table

survey of the advancing use and development of machine

3. When counting the clusters, we do not consider red clusters learning in smart manufacturing. Journal of manufacturing

surrounding green clusters as a false negative.

systems, 48, 170–179.

[8]

Karen Simonyan and Andrew Zisserman. 2014. Very deep

convolutional networks for large-scale image recognition.

arXiv preprint arXiv:1409.1556.

[9]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon

Shlens, and Zbigniew Wojna. 2016. Rethinking the incep-

Table 3: A sum of the number of true positives and false

tion architecture for computer vision. In Proceedings of the

negative clusters for each of the models across the four

IEEE conference on computer vision and pattern recognition,

test images.

2818–2826.

[10]

Chris J Turner, Christos Emmanouilidis, T Tomiyama,

Ashutosh Tiwari, and Rajkumar Roy. 2019. Intelligent de-

cision support for maintenance: an overview and future

6

CONCLUSION AND FUTURE WORK

trends. International Journal of Computer Integrated Man-

ufacturing, 32, 10, 936–959.

In this paper we presented a deep transfer learning approach

to quality control in the case where imperfections on metallic

38





Fall Detection and Remote Monitoring of Elderly People

Using a Safety Watch

Ivana Kiprijanovska

Jani Bizjak

Matjaž Gams

Department of Intelligent

Department of Intelligent

Department of Intelligent

Systems

Systems

Systems

Jožef Stefan Institute,

Jožef Stefan Institute,

Jožef Stefan Institute,

Jožef Stefan International

Jožef Stefan International

Jožef Stefan International

Postgraduate School

Postgraduate School

Postgraduate School

Ljubljana, Slovenia

Ljubljana, Slovenia

Ljubljana, Slovenia

ivana.kiprijanovska@ijs.si

jani.bizjak@ijs.si

matjaz.gams@ijs.si





ABSTRACT

puts a burden on the health-care system with over-crowded

nursing homes and hospitals, and causes higher health-care

As seniors age, the risk of unforeseen accidents that affect their

expenditures [4]. Therefore, monitoring the day-to-day routine

well-being increases. Therefore, monitoring the day-to-day

of the elderly who live alone is an important precaution to

routine of elderly people is an important precaution to

undertake.

undertake, especially when they are living alone. Due to the

Remote health monitoring systems are essential for

rapid demographic change and aging of the population, the

enhancing care in a reliable manner and allow the elderly to

development of remote monitoring systems has become the

remain in their home environment rather than in expensive

center of attention for both researchers and industries. In this

nursing homes [5]. Such systems also allow communication

paper, we present the design of a safety watch integrated in a

with remote healthcare facilities and caregivers, thus allowing

comprehensive health monitoring system capable of observing

healthcare personnel to keep track of the elderly’s overall

the elderly remotely. It integrates low-power hardware

condition and respond, if necessary, from a distant centralized

architecture and energy-efficient software configuration, which

facility [6]. Due to the rapidly increasing aging population, such

significantly extend the battery autonomy of the device. One of

technologies have become a subject of interest for both

the major modules running on the safety watch is the automatic

researchers and industries.

detection of falls and similar dangerous situations. For that

One of the first remote monitoring systems presented in the

purpose, several machine learning methods were tested, among

literature are camera-based systems. They are capable of

which the Random Forest method achieved the highest

recognizing complex gait activities, but restrict the movement

accuracy in detection of falls on data recorded from 17

of the user within a specific range. Apart from that, they are

participants, and was implemented on the actual device.

complex, expensive and often related to privacy concerns. A

recent survey gives an insight to the studies carried out in

KEYWORDS

vision-based patient monitoring [7]. In the last few years,

Safety watch, remote monitoring, energy efficiency, fall

wearable motion sensors have gained in popularity for

detection

monitoring human activities in real time. They can monitor and



record real-time information about one’s physiological

condition and motion activities. Wearable sensor-based health

1 INTRODUCTION

monitoring systems may comprise different types of sensors

More than 90% of the elderly desire to live in their own homes

that can be integrated into textile fiber, clothes, and elastic

for as long as they possibly can [1]. However, as seniors age,

bands or can be directly attached to the human body. One such

the risk of unforeseen accidents that affect their well-being

system in presented in [8], which uses mobile phone as an

increases. For example, the lives of elderly people are very

intermediary to get vital data from various sensors and transmit

often affected by falls, which lead to not only physical injuries

data to a server for further processing. The main limitation of

but also psychological consequences that further reduce their

this system is the fact that the analysis is not performed in the independence and decrease the quality of their life [2][3]. The

place where the signal is acquired, and there may be a loss of

lack of independence causes them to no longer feel comfortable

efficiency in the wireless network when physiological signals

with living alone, forcing them to move into nursing homes. It

are sent. Another wearable personal healthcare system is



presented in [9]. It employs a number of wearable sensors to



continuously collect users’ vital signals and uses Bluetooth

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or devices to transmit the data to a mobile phone, which can

distributed for profit or commercial advantage and that copies bear this notice and perform on-site vital data storage and processing. After local

the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

data processing, the mobile phone periodically report users’

Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

health status to a healthcare centre. Apart from such systems,

© 2020 Copyright held by the owner/author(s).

various wearable commercial products are available on the

39





market, for example the biometric shirt by Hexoskin, and

fitness trackers by Fitbit and Jawbone. However, many current

solutions either provide insufficient functionalities at a

reasonable price or are advanced but too expensive, too energy

demanding or too invasive [10].

The aim of the HomeCare2020 project was to provide a

comprehensive solution for a smart healthcare monitoring

system, capable of observing the elderly remotely, while

eliminating the problems mentioned before. The system aimed

to enable the elderly to live home independently until later age

and to make them feel safer and more confident in performing

everyday tasks and activities. The developed system integrates



two interconnected devices: advanced touch-screen care-phone

(HomeTab) and a multifunctional safety watch. In this paper,

Figure 2: Safety watch appearance

the design of the safety watch is presented.



From a software perspective, the design principle behind the

2 SAFETY WATCH DESIGN

safety watch is to preserve the battery autonomy of the device.

Therefore, the main processing unit is intended to sleep

The safety watch is a custom-made wristband device meant to

whenever possible and only wakes up when certain events

be carried by seniors to provide 24/7 security, inside or outside

happen, i.e., when there is an immediate danger for the user.

of the home.

The safety watch has two working modes, depending on

Its core part, from a hardware perspective, is an ARM-based

whether the user wears the watch or not. If the watch is not

low-power Bluetooth module by Nordic [11]. The priority on

worn, all working modes are disabled, since there is no need of

choosing the processors and other hardware components was

motion monitoring, and only the device status (worn or not

given to how much energy they consume, since a device that

worn) is checked in 1-minute intervals. If the watch is worn, it

requires everyday charging is strongly undesirable, especially

monitors motion, accumulates the number of steps, and sends

for the elderly, who might have problems remembering when or

data over Bluetooth to the HomeTab. Once the battery of the

how to charge the device. The safety watch integrates a low-

device drops to 30% or lower, the sleeping time of the main

power LSM6DSL system-in-package featuring a 3D digital

processor increases from 5 to 10 minutes and the user is

accelerometer and a 3D digital gyroscope. As well as that, it

notified about the low battery level. The software design of the

contains a low-power Quecktel module that integrates NB-IoT

safety watch is illustrated in Figure 3.

and GPS functionality. Since GPS and NB-IoT consume a lot of

The safety watch monitors users behaviour (activity levels),

power, these two functionalities are disabled for most of the

providing incentives to the users (through HomeTab) to move

time and programmatically enabled only when needed (i.e.,

more and at the same time allow to determine unusually low

when an emergency call is made and the device is out of

activity (due to sickness). The integrated LSM6DSL step-count

Bluetooth range of HomeTab). The Quecktel module is

functionality enables the number of steps to be detected

connected to a SIM card, which is required for NB-IoT

throughout the day and to be sent in regular 15-minutes

functionality. These components are connected to a

intervals via Bluetooth to the HomeTab. This gives information

rechargeable Li-ion battery, which can be recharged using a

about the user’s activity levels, which the system later analyses

wireless (induction) charger. The diagram of the safety watch

to detect possible irregularities in the user’s behaviour (which

circuit can be seen in Figure 1.

can be caused by an undetected disease). For example, if a user



is feeling ill (has a flue), he will likely stay in bed significantly

longer than when healthy, so the lack of movement can be

detected, and caregivers notified.

2.1 Fall Detection

Automatic fall detection is one of the most important modules

running on the safety watch. A machine learning method that

can automatically detect falls and similar dangerous situations

was developed and implemented in the final software of the



safety watch.

For training of machine learning models, we used a publicly

Figure 1: Diagram of safety watch circuit

available dataset that contained acceleration data from a wrist-

worn device from 17 subjects [12]. It comprised 11 daily-life

The outer side of the safety watch housing is comprised of a

activities, including 5 types of falling, namely: walking,

membrane keypad used for manual alarm triggering (e.g. if the

standing, sitting, picking up an object, laying, jumping, falling

individual is in a dangerous situation). The keypad also

backwards, falling sideward, falling forwards using knees,

integrates a small LED, used to provide a feedback to the users

falling forwards using hands, and falling sitting in an empty

(e.g., alarm triggered, low battery alerts). Its appearance is

shown in Figure 2.

40





Figure 3: System software design

chair. Since our aim was to only detect falls in general, we

model, the kNN model and the DT model can be seen in Table

grouped all fall-related activities as one class, and all other

1, Table 2, and Table 3, respectively.

activities as another class. The non-fall activities were

additionally under sampled, in order to adjust the class

Table 1: Summed and normalized (per row) confusion

distribution of the dataset. The data were further segmented

matrix. LOSO evaluation with Random Forest model.

using a sliding window technique, with a window size of 2

seconds and 50% overlap between consecutive windows. To



Non-fall

Fall

train the machine learning models, several statistical features

Non-fall

97

3

were extracted from the acceleration signals, including mean,

Fall

2

98

standard deviation, median, maximum, minimum, mean

absolute change, variance, kurtosis, skewness, and similar. The

Table 2: Summed and normalized (per row) confusion

window size and the optimal feature set was chosen based on

matrix. LOSO evaluation with Decision Tree model.

our previous work [13].

Various machine learning algorithms were tested – Decision



Non-fall

Fall

Tree (DT), Random Forest (RF), k-nearest neighbors (kNN).

Non-fall

91

9

The different algorithms performances were evaluated using the

leave-one-subject-out cross-validation technique. With this

Fall

8

92

technique, the data is divided into N-number of folds (where N

is the number of subjects in the dataset). Each fold is comprised

Table 3: Summed and normalized (per row) confusion

of data from a single subject. In each iteration of the LOSO

matrix. LOSO evaluation with kNN model.

cross-validation, data from one subject is used for testing the

method, and the training data is comprised of the remaining N-1



Non-fall

Fall

subjects. Among the tested algorithms, RF proved to have the

Non-fall

87

13

best accuracies per watt of power consumed processing the data.

Fall

17

83

RF is an ensemble classifier that fits a number of decision trees



on various sub-samples of the dataset and outputs the majority

Since the aim of the system is to offer a great degree of

class label from the constructed trees. It utilizes two random

accuracy in detecting actual fall, as well as in filtering false

steps in the process of creating trees – a random sampling of the

alarms, two metrics were analyzed: (i) sensitivity – capacity to

training data points and a random choosing of a splitting feature,

detect actual fall, defined as the ratio between the number of

which make it robust to noise and outliers [14]. The results

falls correctly detected (true positives) and the falls that actually

achieved on the laboratory data with the best-performing RF

happened; (ii) specificity – capacity to filter false alarms,

41

defined as the ratio between properly discarded activities (true Overall, the software design of the system is highly energy-negatives) and the total number of discarded activities. From

efficient and significantly extends the service time of the

the confusion matrix presented in Table 1, it can be seen that

wearable device, which makes it convenient for use by elderly

the model has a very high sensitivity score – 98%, and

people. The system is easily operated and therefore shows great

specificity score – 97%. They are both very important for a

promise for providing long-term and continuous monitoring of

real-life implementation of the model – it means that the model

the elderly in an unobtrusive way. We believe that it can

accurately detects falls, without triggering too many false

efficiently contribute to improving remote healthcare services.

alarms, which can be detrimental to users.

The implementation of the fall detection functionality on the

ACKNOWLEDGMENTS

hardware was also properly managed to extend the battery life.

The authors would like to thank everyone that helped in any

The most significant battery saving is done by processing the

way with producing this paper. The first author acknowledges

acceleration data in batches. The accelerometer stores

the financial support from the Slovene Human Resources

acceleration values in its internal memory while the main

Development and Scholarship Fund – Ad Futura. Part of this

processor sleeps. The accelerometer’s buffer fills in 10 seconds,

research was done under EIT Health HomeCare2020 project.

and when it is full, it wakes the main processor, and the

collected data is sent to it for further processing. The main

REFERENCES

processor stores for about 120 seconds of acceleration data

[1]

Roy, N.; Dubé, R.; Després, C.; Freitas, A.; Légaré, F. Choosing between

before running the fall detection algorithm. Once the 120-

staying at home or moving: A systematic review of factors influencing seconds of data is stored, the required features are calculated

housing decisions among frail older adults. PLoS One 2018.

[2]

Institute of Medicine Falls in Older Persons: Risk Factors and

from the acceleration signals, and the pre-trained RF model

Prevention. In The Second Fifty Years: Promoting Health and Preventing

(stored in the RAM of the safety watch) is run. If no fall is

Disability; 1992 ISBN 978-0-309-04681-7.

[3]

Boyé, N. da; Van Lieshout, E. mm; Van Beeck, ed f.; Hartholt, K. a.; detected in the two-minute segment, the main processor goes

Van Der Cammen, T. jm; Patka, P. The impact of falls in the elderly.

back to sleep, otherwise, an alarm procedure is triggered. The

Trauma 2013.

alarm is sent via Bluetooth to the HomeTab device, which

[4]

Stevens, J.A.; Corso, P.S.; Finkelstein, E.A.; Miller, T.R. The costs of fatal and non-fatal falls among older adults. Inj. Prev. 2006.

forwards it to the server for further processing. If the safety

[5]

Klaassen, B.; van Beijnum, B.J.F.; Hermens, H.J. Usability in

watch is out-of-range of the HomeTab, it uses NB-IoT network

telemedicine systems—A literature survey. Int. J. Med. Inform. 2016.

[6]

Majumder, S.; Aghayi, E.; Noferesti, M.; Memarzadeh-Tehran, H.;

for alarm transmission. In this case, it also tries to get the user’s

Mondal, T.; Pang, Z.; Deen, M.J. Smart homes for elderly healthcare—

location using a GPS signal.

Recent advances and research challenges. Sensors (Switzerland) 2017.

[7]

Sathyanarayana, S.; Satzoda, R.K.; Sathyanarayana, S.; Thambipillai, S.



Vision-based patient monitoring: a comprehensive review of algorithms 3 Conclusion

and technologies. J. Ambient Intell. Humaniz. Comput. 2018.

[8]

Benlamri, R.; Docksteader, L. MORF: A mobile health-monitoring

This paper presented the design of a safety watch integrated

platform. IT Prof. 2010.

into the HomeCare2020 comprehensive solution for a smart

[9]

Wu, W.; Cao, J.; Zheng, Y.; Zheng, Y.P. WAITER: A wearable personal

healthcare and emergency aid system. In Proceedings of the 6th Annual

healthcare monitoring system, primarily targeted at elderly

IEEE International Conference on Pervasive Computing and

people. The main purpose of the safety watch is to help the

Communications, PerCom 2008; 2008.

[10]

Peetoom, K.K.B.; Lexis, M.A.S.; Joore, M.; Dirksen, C.D.; De Witte, elderly to live home independently until later age and to make

L.P. Literature review on monitoring technologies and their outcomes in

them feel safer and more confident performing everyday tasks

independently living elderly people. Disabil. Rehabil. Assist. Technol.

2015.

and activities. One of the most important modules running on

[11]

nRF5

SDK

-

nordicsemi.com

Available

online:

the safety watch is fall detection, which makes the users able to

https://www.nordicsemi.com/Software-and-tools/Software/nRF5-SDK

call for emergency treatment in the case of a dangerous

(accessed on Aug 26, 2020).

[12]

Martínez-Villaseñor, L.; Ponce, H.; Brieva, J.; Moya-Albor, E.; Núñez-

situation. For this purpose, different machine learning models

Martínez, J.; Peñafort-Asturiano, C. Up-fall detection dataset: A

were tested and compared. Among them, RF classification

multimodal approach. Sensors (Switzerland) 2019.

[13]

Gjoreski, H.; Stankoski, S.; Kiprijanovska, I.; Nikolovska, A.;

model proved to have the highest performance per watt of

Mladenovska, N.; Trajanoska, M.; Velichkovska, B.; Gjoreski, M.;

power consumed processing the data, which makes it the most

Luštrek, M.; Gams, M. Wearable Sensors Data-Fusion and Machine-

suitable choice for implementation.

Learning Method for Fall Detection and Activity Recognition. In Studies

in Systems, Decision and Control; 2020.

[14]

Breiman, L. Random Forest. Mach. Learn. 2001, 45, 5–32.





42





Machine Vision System for Quality Control in

Manufacturing Lines

Ivana Kiprijanovska

Jani Bizjak

Samo Gazvoda

Matjaž Gams

Department of Intelligent

Department of Intelligent

Cooking Appliances

Department of Intelligent

Systems

Systems

Division

Systems

Jožef Stefan Institute,

Jožef Stefan Institute,

Gorenje Group

Jožef Stefan Institute,

Jožef Stefan International

Jožef Stefan International

samo.gazvoda@gorenje.com

Jožef Stefan International

Postgraduate School,

Postgraduate School



Postgraduate School

Ljubljana, Slovenia

Ljubljana, Slovenia



Ljubljana, Slovenia

ivana.kiprijanovska@ijs.si

jani.bizjak@ijs.si



matjaz.gams@ijs.si



ABSTRACT



industrial cameras with specialized optics to acquire images [3].

In manufacturing, quality control is a process that oversees the

After an image is acquired, computer hardware and software

aspects of production and ensures that only products that

process, analyze, and measure various characteristics of the

conform to industry standards and quality criteria leave the

image for automated decision-making.

production line. Automation of the quality control process

Development of an integrated system for comprehensive

significantly reduces the time spent on products’ testing, hence

quality control in production with an intelligent process control

reducing the overall manufacturing costs. In this paper, we

system is the main aim of the ROBKONCEL project [4]. One

present a brief overview of the algorithms adopted to the aim of

of the objectives of this project is the detection of faults in the

detection of one possible fault in the production of ovens – non-

production of ovens. In this paper, we present the initial

working oven fan. The detection is performed through visual

experiments in the detection of one of the possible faults – non-

data. In the initial experiments, several image processing

working oven fan.

algorithms were used, and the preliminary results are

encouraging.

2 PROBLEM DEFINITION

KEYWORDS

The quality control of the ovens is intended to take place in a

machine vision, image processing, fault detection

factory environment, where products moving on a conveyor

belt are visually observed, i.e., a machine vision system

acquires videos of the ovens. These videos are segmented into

1 INTRODUCTION

image frames (at a 30 fps rate), and the obtained image frames

Quality control is becoming an increasingly important aspect of

are further processed to detect if the fan is working or not. For

today’s manufacturing processes [1]. For efficient and

the initial experiments, we collected a few videos in a

successful production, manufacturers rely on quality control

laboratory setting, with various lightings and camera positions,

systems integrated into the manufacturing process. The

resulting in approximately 7200 images (~4000 working fan

traditional quality control process requires vast capacities of

and ~3200 non-working fan). Additionally, the visual data of

specialized labor. High utilization of the specialists may lead to

the ovens’ fans were acquired through a closed door, which

human errors, low reliability of the process, and a negative

makes the fault detection more challenging (Figure 1). This is

impact on the quality of production. Compared to manual

preferred as the process of opening and closing the door in a

quality control, automated quality control systems offer a

manufacturing environment would be too slow.

reliable control process with various other advantages,



including the ability to work 24 hours a day and, in some tasks,

perform faster measurements with higher accuracy and

consistency compared to humans [2]. Such systems are also a

practical choice when the test cases need to run regularly over a

significant amount of time. Machine vision quality control

systems play a growing role in modern manufacturing quality

control systems. These systems rely on digital sensors inside





Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

Figure 1: Image of an oven's fan acquired through a closed

© 2020 Copyright held by the owner/author(s).

oven door.

43





3 IMPLEMENTED TECHNIQUES

3.2 Glare Reduction

The image processing steps for the oven fault detection, i.e.,

A common problem in image processing is the occurrence of

non-working fan detection, are as following:

specular reflections on the images. In our case, since the videos



of the fan were recorded through a glass, a significant amount

1.

Object detection

of specular reflections, or glare, was produced during the

2.

Glare reduction

recording. To reduce the effects of the glare, a glare reduction

3.

Image thresholding

algorithm was applied. The basic glare reduction procedure

consisted of 3 steps: (i) decomposition of the original image

Each of these steps and the image processing algorithms

into a color, saturation and brightness component (HSV); (ii)

implemented in them are explained in the following sections.

finding particularly bright areas in the image; (iii) inpainting of

these areas with the values of the surrounding pixels.

3.1 Object Detection

Each image was first converted into HSV color space, which

In order to detect and isolate the circle area of the oven fan, we

describes the image by its hue (H), saturation (S) and brightness

made use of the Hough Gradient Method [5], which is an

(V) component (Figure 3).

extension to the standard Hough Transform technique [6] for



isolating features of a particular shape within an image. The

Hough Gradient Method is based on gradient information of

edges and is used to improve the speed of the circle detection in

order to meet real-time implementation requirements. The

calculation steps of the Hough Gradient method are as follows:

(i) detect edges in the image; (ii) calculate the local gradient for



the edge points using a Sobel operator; (iii) use an accumulator

to count the possible circle center on the normal direction of

Figure 3: Image decomposition into hue, saturation and

edge points’ tangent; (iv) choose the peak circle center and

brightness component.

circle radius for the general circle equation.

The implementation of the Hough Gradient method in

With such decomposition, a general rule for pixels that are

OpenCV requires a single channel image, so the first step in the

subject to specular reflections can be derived; namely, an image

detection of circles was to convert the acquired images from the

can only contain glare if its color is not saturated, and it has RGB color space to grayscale. Furthermore, two parameters of

high brightness. Since light reflections are white, any pixel

the circle detection function were tuned, namely: the minimum

containing glare cannot have saturation (since white has no

distance between the center coordinates of the detected circles

color or saturation). Accordingly, we first filtered out the areas

and the ratio of the resolution of the original image to the

that have low saturation. Next, the area of the non-saturated

accumulator resolution [5]. Before running the circle detection

pixel was reduced by an erosion operation, and the brightness

function, a simple median filter [7] was applied to the images

values of the saturated pixels were set to 0. By filtering out the

for noise reduction. This helped in reducing the effects of

very bright pixels (e.g., all pixels that have a value larger than various reflections in the glass part of the oven door. In general,

130), we obtained the final glare mask (Figure 4).

without blurring, the algorithm tended to extract too many



circular features, resulting in false circles detection. Therefore,

this preprocessing step was crucial for successful circle

detection. The circled detection algorithm resulted in a single

circle detected in every image; however, with a varying radius.

Since the further analyses require images with the same

dimensions, the mean value of the detected circles’ radius was

calculated and used to isolate the fan area on the images.

(Figure 2).





Figure 4: Original image and the obtained glare mask.

The glared pixels were then interpolated with an inpainting

operation. This operation fills the masked pixels with the values

that stem from the adjacent non-masked pixels. The original

image and its corrected version after the reduction of the glare

can be seen in Figure 5. There is a significant amount of glare

on the original image, which was effectively removed in the

corrected image. The corrected image is a good approximation



of the original image when no glare is present.

Figure 2: Detected oven's fan area.

44





helped in eliminating quick 1-frame changes from working to

non-working, or vice versa.

Eventually, the implemented image-processing method

resulted in 95% of correctly classified images, on four different

videos. The confusion matrix of the method is presented in

Table 1.

Table 1: Confusion matrix for the proposed method.



Non-working

Working



Non-working

3117

82

Figure 5: Original image and its corrected version.

Working

280

3720



3.3 Thresholding

As the main purpose of the system is to offer a high

accuracy in detection of oven faults, while filtering false alarms,

If the two figures representing working and non-working fan in

we additionally analysed two metrics: (i) sensitivity, i.e.,

Figure 6 are analyzed, it can be seen that lighting allows the

method’s capacity to detect actual faults (non-working fans),

oven fan parts to stand out and be clearly seen behind the grid

defined as the ratio between the number of non-working fan

when the fan is not working. On the other hand, when the fan is

images correctly identified (true positives) and the total number

working, the fan area behind the grid is blurred. Therefore, a

of non-working fan images; (ii) specificity, i.e., method’s

simple thresholding method was utilized to distinguish working

capacity to filter false alarms, defined as the ratio between

and non-working fan.

properly discarded images (true negatives) and the total number

of discarded images. The method has a very high sensitivity

score of 97%, and specificity score of 93%.





Figure 6: Working and non-working oven fan.



Thresholding is one of the simplest methods for image

segmentation and creation of binary images [6]. The main goal

Figure 7: Non-working and working oven fan – thresholded

of the utilized binary thresholding was to enhance the parts of

images

the oven fan when it is not working. For that purpose, the

images were firstly converted from RGB color space to

grayscale. Next, with the binary thresholding method, each

4 CONCLUSION

pixel in the images was replaced with a black pixel if its

In this paper, we presented an image processing pipeline

intensity was less than a chosen constant (T=90), or a white

adopted for the aim of detection of a possible fault in

pixel if its intensity was greater than the chosen constant. This

production of ovens – non-working oven fan. The image

results in the illuminated parts of the oven fan becoming

processing steps contain object detection (for isolating the oven

completely white (when the fan is not working), while the grid

fan area from the images), glare reduction (for reducing the

and the moving fan become completely black, as can be seen in

effects of specular reflections), and image thresholding (for

the examples in Figure 7.

final decision-making). The preliminary results show that a

As a final step, the number of white pixels in the final

quality control system that exploits image processing

binary-threshold images, which present only the non-working

algorithms could be used in an automated manufacturing

fans, was calculated. Then, the 5th percentile of these values

environment. In the future, we plan to employ reflection

was calculated and set as a threshold value when deciding if a

removal algorithms, which can significantly facilitate the object

given image represents a working or non-working fan.

detection process, such as Sparse Blind Separation with

Basically, if the image contains more than X white pixels,

Motions (SPBS-M) [8], Superimposed Image Decomposition

where X is the previously calculated value of the 5th percentile,

(SID) [9], Ghosting Cues [10] and similar. However, the

it is classified as a non-working oven fan; otherwise, it is

utilization of such algorithms may significantly impact the time

classified as a working oven fan.

performance of the method, so an acceptable trade-off between

In the last post-processing step, the class for each image

method’s accuracy and time performance should be explored in

frame was taken as the majority class of the last 20 frames. It

future analyses.

45

ACKNOWLEDGMENTS

[4]

ROBKONCEL

Available

online:

http://www.smm.si/?post_id=4682&lang=en (accessed on Aug 28,

The first author acknowledges the financial support from the

2020).

Slovene Human Resources Development and Scholarship Fund

[5]

Yuen, H.K.; Princen, J.; Dlingworth, J.; Kittler, J. A Comparative Study

of Hough Transform Methods for Circle Finding.; 2013.

– Ad Futura. Part of this research was done under and for

[6]

Shapiro, L.; Stockman, G. Computer Vision 1st Edition; 2001; ISBN

ROBKONCEL project.

9780130307965.

[7]

Huang, T.S.; Yang, G.J.; Tang, G.Y. A Fast Two-Dimensional Median

Filtering Algorithm. IEEE Trans. Acoust. 1979.

REFERENCES

[8]

Gai, K.; Shi, Z.; Zhang, C. Blind separation of superimposed moving images using image statistics. IEEE Trans. Pattern Anal. Mach. Intell.

[1]

Mohamad, H.; Jenal, R.; Genas, D. Quality Control Implementation in 2012.

Manufacturing Companies: Motivating Factors and Challenges. In

[9]

Guo, X.; Cao, X.; Ma, Y. Robust separation of reflection from multiple Applications and Experiences of Quality Control; 2011.

images. In Proceedings of the Proceedings of the IEEE Computer

[2]

Heleno, P.; Davies, R.; Brazio Correia, B.A.; Dinis, J. A machine vision

Society Conference on Computer Vision and Pattern Recognition; 2014.

quality control system for industrial acrylic fibre production. EURASIP

[10]

Shih, Y.; Krishnan, D.; Durand, F.; Freeman, W.T. Reflection removal J. Appl. Signal Processing 2002.

using ghosting cues. In Proceedings of the Proceedings of the IEEE

[3]

Golnabi, H.; Asadpour, A. Design and application of industrial machine

Computer Society Conference on Computer Vision and Pattern

vision systems. Robot. Comput. Integr. Manuf. 2007.

Recognition; 2015.





46





Abnormal Gait Detection Using Wrist-Worn Inertial

Sensors

Ivana Kiprijanovska

Hristijan Gjoreski

Matjaž Gams

Department of Intelligent

Faculty of Electrical Engineering

Department of Intelligent

Systems

and Information Technologies

Systems

Jožef Stefan Institute,

Skopje, N. Macedonia

Jožef Stefan Institute,

Jožef Stefan International

hristijang@feit.ukim.edu.mk

Jožef Stefan International

Postgraduate School



Postgraduate School

Ljubljana, Slovenia



Ljubljana, Slovenia

ivana.kiprijanovska@ijs.si

matjaz.gams@ijs.s

ABSTRACT

shown that these disorders carry a high risk for falls, with an

annual fall rate of 60–80% in patients with Alzheimer's,

Falls are a major health problem among elderly people and

Parkinson's or similar diseases [4][5]. However, there is

often lead to serious physical and psychological consequences.

substantial evidence that falls can be prevented if individuals at

Identification of elderly people who are at risk of falling helps

increased risk of falling are identified and enrolled in targeted

for the selection of effective preventative measures that

fall prevention programmes [6]. Therefore, identification of

minimize the likelihood of falls. The occurrence of gait

balance impairment and gait abnormalities is an essential step

abnormalities is one of the most significant fall precursors.

in fall prevention.

Wearable sensors enable continuous monitoring of gait during

Camera-based 3D motion capture systems and instrumented

daily routines, and therefore offer the possibility of early

walkways are considered as the gold standard in gait analysis in

detection of gait changes. In this paper, we analyze the ability

terms of accuracy. However, these systems are only suitable for

of machine learning models to detect gait abnormalities using

hospitals or hospital-like settings, such as specialized gait

data from inertial sensors integrated into a smartwatch and how

analysis clinics, due to their size and the need for qualified

they perform on the dominant and non-dominant wrist.

professionals to operate them. Moreover, current clinical

evaluation of gait is costly and time-consuming, and thus

KEYWORDS

cannot be performed frequently. Even though the completeness

Gait analysis, abnormal gait, fall risk assessment, smartwatch,

and the accuracy of the clinical measurements are

wearable sensors

unquestionable, a mobile and pervasive gait analysis alternative

suitable for non-hospital settings is a necessity. Recent

technological advancements in wearable sensors offer means

1 INTRODUCTION

for analyzing gait during everyday-life living. Among wearable

Falls present a major health problem among elderly people.

devices, wristbands and smartwatches are increasingly popular

One-third of the population aged over 65 years experience at

because people find the wrist placement one of the least

least one fall per year [1]. Falls greatly affect the quality of life

intrusive placements to wear a device.

and restrict the independence of those affected. They not only

In this paper, we analyzed the ability of inertial sensors

lead to severe physical consequences but also result in high

integrated into smartwatches to detect human gait abnormalities

health care costs. Due to the rapid aging of the population, this

that are related to fall risk. Moreover, we studied how the

problem will further increase in the near future [2]. Therefore,

performance of machine learning models on the non-dominant

there is an urgent need for reliable screening tools to identify

wrist compares to the performance on the dominant wrist.

those at risk and to target effective fall prevention strategies.

Falls are a consequence of several intrinsic and extrinsic fall

risk factors, among which balance and gait disorders are the

2 RELATED WORK

most common ones [3]. Gait is a sensitive indicator of an

The recent advancements in sensor technology have led to

individual's overall health status, so the occurrence of abnormal

applications of wearable sensor devices in gait analysis for fall

gait patterns usually represents an early indication of an

risk assessment. Several studies have been carried out by

underlying neurodegenerative disorder. Clinical research has

combining wearable devices with inertial sensors and machine



learning methods. The general pipeline in these studies consists



Permission to make digital or hard copies of part or all of this work for personal or of signal acquisition while the person performs everyday-life

classroom use is granted without fee provided that copies are not made or activities or pre-defined functional tests, signal processing and

distributed for profit or commercial advantage and that copies bear this notice and feature engineering, and lastly training a machine learning

the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

classifier that produces an output that depends on the

Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

application.

© 2020 Copyright held by the owner/author(s).

47





Howcroft et al. [7] have presented insightful accounts of

peripheral vision, visual distortion, balance deficit, and similar.

features, classification models and validation strategies related

These effects alter the gait and are highly correlated with an

to sensor-based fall-risk assessment. They have found large

increased risk of falls [3]. Both scenarios (normal and abnormal

heterogeneity in terms of sensor-based features and sensor

walk) were repeated by each subject five times, resulting in ten

placement. Regarding the features, the existing studies most

walking sessions per subject. An example of two motion

often use features from the time and frequency domain, which

samples from the sensors in the smartwatch worn on the right

include mean, variance and energy of the windowed inertial

wrist of one subject is shown in Figure 2.

data, as well as spectral components such as dominant



frequency and harmonic ratio [8]. As well as that, some



biomechanical gait features, such as stride length, clearance,

stance and swing time for each stride, cycle time, cadence and

similar, have been revealed as effective predictors of falls [9].

In terms of the location of the sensors, the most exploited

body positions are the shanks, waist, pelvis, and feet. In [10],

the authors made use of wearable devices incorporating



accelerometer and gyroscopes, worn on the shanks and waist.

Figure 1: Equipment for data collection

They proposed a general probabilistic modeling approach for

classification of different pathological types of gait through the

estimation of spatiotemporal features. They showed that a

4 METHOD

Support Vector Machine (SVM) classifier can identify mobility

The machine learning method that we developed for this study

impairment in elderly people, with an accuracy of 90.5%. In

consists of several steps: preprocessing of the acquired sensor

[11], the authors showed that with assessment of walking

signals – filtering and data segmentation, feature engineering

quality during a six-minute walk test with accelerometers

and extraction from signal segments, and training of a

attached to their lower leg and pelvis, prospective fallers and

classification model. In the first step, the raw IMU signals were

non-fallers can be successfully differentiated with a Random

filtered with a band-pass filter with cut-off frequencies in the

Forest (RF) classificator. Similar findings were confirmed also

range of 0.5 to 3.5 Hz [14], which allowed for reducing the

for inertial sensors attached at the sternum, in [12]. However,

frequencies outside of the range of frequencies related to human

these body locations may be found obtrusive for wearing a

walking activity [15]. After the filtering step, the sensor signals

device for longer periods of time. On the other hand, the wrist is

were segmented using a sliding window. Since window size and

considered as the most unobtrusive and widely accepted

the sliding parameter have to be tuned correctly for the task at

position to wear a device, which does not affect everyday-life

hand, the windowing parameters were determined empirically.

activities of the user. Still, sensors worn on the wrist are

Eventually, we chose a window size of 8 seconds, with 50%

affected by frequent movements, as the hand is generally the

overlap between consecutive windows.

most active part of the body. It makes the analysis of the gait

To train a classification model, we extracted several features

very challenging, and thus wrist-worn devices have not yet been

from the time and frequency domain, for each sensor signal.

utilized for gait abnormalities detect for fall risk assessment.

The tsfresh python package [16] allows general-purpose time-

Considering the lack of evidence supporting the feasibility of

series feature extraction, which we exploited in generating more

fall risk assessment with sensors worn on the wrist, in this paper

than 100 features per sensor stream. These features included the

we analyze the performance of several machine learning

minimum, maximum, mean, variance, the correlation between

methods that utilize inertial sensor data from a wrist-worn

axes, their covariance, skewness, kurtosis, the number of times

device.

the signal is above/below its mean, the signal’s mean change,

and its different autocorrelations, among others. Additional

subset of frequency-domain features was also calculated using

3 DATASET

the signal’s power spectral density (PSD), which is based on the

For this study, we collected a dataset comprised of recordings

fast Fourier transform (FFT), and included PSD energy, entropy,

from 18 subjects (8 males, 10 females, aged 19-54). Each

and binned distribution, the largest magnitude from the PSD (of

subject wore two smartwatches Mobvoi TicWatch E [13], one

the dominant frequency in the signal), and first four statistical

on the left, and one on the right wrist (Figure 1). The two

moments of the PSD (mean, standard deviation, skewness, and

smartwatches had an Android application that collected data

kurtosis) [17][18].

from the inertial sensors integrated into the devices, namely:

We compared several different ML models that have all

accelerometer, gyroscope, and magnetometer, at a sampling

previously been proven suitable for human activities analysis:

frequency of 100 Hz.

1) Decision tree (DT) [19] is an algorithm that learns a

The subjects were walking back and forth along a 15-meters

model in the form of a tree structure with decision nodes with

straight line and performed two scenarios – normal walk and

two or more branches, each representing values for a tested

simulated abnormal walk. In the normal gait scenario, subjects

feature, and leaf nodes which represent a decision on the target

walked at a comfortable pace and performed a natural gait,

class. In other words, it predicts the target class by learning

while in the simulated abnormal walk scenario, subjects walked

decision rules from the training features.

while wearing impairment glasses [8]. The glasses were used to

2) Random forest (RF) [20] is an ensemble of decision tree

simulate the effects of impairment, including reduction of

classifiers. It creates multiple decision trees, each trained on a

48





Figure 2: IMU sensors signals from normal walking session (left) and abnormal walking session (right) bootstrapped sample of the original training data, and searches

on the left wrist (L - L), training on the right wrist and testing only across a randomly selected subset of features to determine

on the left wrist (R - L), and training on both wrists and testing

a split. For the decision on the target class, each tree outputs a

on the left wrist ((L+R) - L). With these combinations, we want

prediction, and the final prediction of the classifier is

to see if training a model with data from only a particular wrist

determined by a majority vote of the trees.

or both wrists combined leads to higher accuracy. Moreover,

3) Support vector machine (SVM) [21] is an algorithm that

another challenge that we took into account is a device with a

is characterized by the use of kernel functions. They are used to

model developed for the right (left) wrist to be worn on the left

transform feature vectors into higher dimensional space, in

(right) wrist, hence the “switching wrists” combinations [23].

which a separation hyper-plane is learned to best fit the training

The results from these experiments can be seen in Table 1. The

data.

performance of the machine learning models is additionally

4) K-nearest neighbors (kNN) [22] is an algorithm that uses

compared with the performance of a baseline method - majority

feature-vector similarity, i.e., for each feature vector in the test

vote classifier.

data, it finds the k-nearest neighbors in the training set. The

From the presented results, it can be seen that the RF

final prediction of the classifier is determined by a majority

algorithm significantly outperforms the other algorithms for

vote of the chosen neighbors.

each train-test combination, while the kNN achieves the lowest

To estimate the generalization accuracy of the models, we

accuracy in detection of gait abnormalities. Moreover, the

utilized the leave-one-subject-out cross-validation technique.

results show that the right-left combination achieves 72.2%

With this approach, the data is repeatedly split according to the

accuracy, which is significantly lower than the left-left

number of subjects in the dataset. In each iteration, one subject

combination, which achieves 83.9% with the RF model. On the

is selected for testing purposes, while the other subjects are

other hand, the difference between the left-right and right-right

used for training the model. This procedure is repeated until

combinations is minor – only 1.5 percentage point. These

data from all subjects have been used as test data.

results suggest that models trained with data from the left wrist

could perform well on both wrist, but the data acquired from the

right wrist does not bring enough information to train a reliable

5 EXPERIMENTAL RESULTS

model that could perform well on the left wrist, as well.

To observe the performance of the models in real-life scenarios,

However, the problem of “switching wrists” could be

we carried out several experiment. In fact, we observe the

overcome if the models are trained with data from both wrists.

performance of the models on the left and right wrist separately,

In fact, the models trained with data from the left and right

to see if they achieve similar result on both wrists.

wrist combined, outperform the other two combinations for

Since real-life poses many challenges that should be taken

both wrists, achieving the highest accuracy of 84.3% for the left

into account, we considered three different training scenarios

wrist, and 82.3% for the right wrist with the RF model.

for each wrist. Namely, we test the accuracy of the models for

Overall, the results suggest that the models perform better

six train-test combinations: training on the left wrist and testing

for the left wrist. Since all subjects included in the dataset were

on the right wrist (L - R), training on the right wrist and testing

right-handed, we can conclude that the non-dominant hand

on the right wrist (R - R), training on both wrists and testing on

brings more information regarding the walking patterns of the

the right wrist ((L+R) - R), training on the left wrist and testing

subjects.

49

Table 1: Gait abnormality detection accuracy of individual classifiers.

Classifier

L - L

R - L

(L + R) - L

R - R

L-R

(L + R) - R

Baseline – Majority

61.4

61.4

61.4

61.4

61.4

61.4

Classifier

DT

75.1

51.2

78.0

74.5

65.6

76.6

RF

83.9

72.2

84.3

82.8

81.3

84.3

SVM

68.3

61.0

72.4

64.4

66.4

71.4

kNN

63.2

57.3

63.8

61.2

62.6

63.0





6 CONCLUSION



[6]

Institute of Medicine Falls in Older Persons: Risk Factors and

In this paper, we analyzed the ability of machine learning

Prevention. In The Second Fifty Years: Promoting Health and Preventing

algorithms to detect gait abnormalities using data from inertial

Disability; 1992 ISBN 978-0-309-04681-7.

sensors integrated into a smartwatch. Among the compared

[7]

Howcroft, J.; Kofman, J.; Lemaire, E.D. Review of fall risk assessment in geriatric populations using inertial sensors. J. Neuroeng. Rehabil.

machine learning algorithms, Random Forest achieved the

2013.

highest accuracy. The analysis of the performance of the

[8]

Riva, F.; Toebes, M.J.P.; Pijnappels, M.; Stagni, R.; van Dieën, J.H.

Estimating fall risk with inertial sensors using gait stability measures that models on the left and right wrist showed that they perform

do not require step detection. Gait Posture 2013.

better on the left wrist, which was the non-dominant for the

[9]

Tunca, C.; Pehlivan, N.; Ak, N.; Arnrich, B.; Salur, G.; Ersoy, C. Inertial

sensor-based robust gait analysis in non-hospital settings for neurological

subjects included in the dataset. The experiments with the

disorders. Sensors (Switzerland) 2017.

“switching wrist”, i.e., training the models with data collected

[10]

Mannini, A.; Trojaniello, D.; Cereatti, A.; Sabatini, A.M. A machine from one wrist and testing on the other showed that the

learning framework for gait classification using inertial sensors:

Application to elderly, post-stroke and huntington’s disease patients.

accuracy of the models significantly drops. However, when the

Sensors (Switzerland) 2016, 16.

models were trained with data from both wrists and applied on

[11]

Drover, D.; Howcroft, J.; Kofman, J.; Lemaire, E.D. Faller classification

in older adults using wearable sensors based on turn and straight-walking

each wrist individually, the accuracy increased, outperforming

accelerometer-based features. Sensors (Switzerland) 2017.

even the models that were trained and tested on the same wrist.

[12]

Brodie, M.A.; Lord, S.R.; Coppens, M.J.; Annegarn, J.; Delbaere, K.

Therefore, the best practical solution is to deploy a model

Eight-week remote monitoring using a freely worn device reveals

unstable gait patterns in older fallers. IEEE Trans. Biomed. Eng. 2015.

trained with data from both wrists. Overall, the results are

[13]

TicWatch S&E - A smartwatch powered by Wear OS by Google

satisfactory and show that data generated by wrist-worn inertial

Available

online:

https://www.mobvoi.com/eu/pages/ticwatchse

(accessed on Aug 30, 2020).

sensors is sufficient for gait abnormalities detection and can be

[14]

Dehzangi, O.; Taherisadr, M.; ChangalVala, R. IMU-based gait

used for fall risk assessment in non-clinical environments.

recognition using convolutional neural networks and multi-sensor fusion.

Sensors (Switzerland) 2017.

[15]

Antonsson, E.K.; Mann, R.W. The frequency content of gait. J. Biomech.

ACKNOWLEDGMENTS

1985.

[16]

Overview on extracted features — tsfresh 0.16.1.dev65+gd190be5

documentation

Available

online:

The authors would like to thank all the participants that took

https://tsfresh.readthedocs.io/en/latest/text/list_of_features.html

part in the dataset collection. The first author acknowledges the

(accessed on Aug 30, 2020).

financial support from the Slovene Human Resources

[17]

Su, X.; Tong, H.; Ji, P. Activity recognition with smartphone sensors.

Tsinghua Sci. Technol. 2014.

Development and Scholarship Fund – Ad Futura.

[18]

Gjoreski, M.; Janko, V.; Slapničar, G.; Mlakar, M.; Reščič, N.; Bizjak, J.; Drobnič, V.; Marinko, M.; Mlakar, N.; Luštrek, M.; et al. Classical REFERENCES

and deep learning methods for recognizing human activities and modes of transportation with smartphone sensors. Inf. Fusion 2020.

[1]

Dionyssiotis, Y. Analyzing the problem of falls among older people. Int.

[19]

Gordon, A.D.; Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J.

J. Gen. Med. 2012.

Classification and Regression Trees. Biometrics 1984.

[2]

Ageing and health Available online: https://www.who.int/news-

[20]

Breiman, L. Random Forest. Mach. Learn. 2001, 45, 5–32.

room/fact-sheets/detail/ageing-and-health (accessed on Aug 30, 2020).

[21]

Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995.

[3]

Salzman, B. Gait and balance disorders in older adults. Am. Fam.

[22]

Aha, D.W.; Kibler, D.; Albert, M.K. Instance-Based Learning

Physician 2011.

Algorithms. Mach. Learn. 1991.

[4]

Horikawa, E.; Matsui, T.; Arai, H.; Seki, T.; Iwasaki, K.; Sasaki, H. Risk

[23]

Gjoreski, M.; Gjoreski, H.; Luštrek, M.; Gams, M. How accurately can of falls in Alzheimer’s disease: A prospective study. Intern. Med. 2005.

your wrist device recognize daily activities and detect falls? Sensors

[5]

Allen, N.E.; Schwarzel, A.K.; Canning, C.G. Recurrent falls in

(Switzerland) 2016.

parkinson’s disease: A systematic review. Parkinsons. Dis. 2013.





50





Avtomatska detekcija obrabe posnemalnih igel

Automatic Wear Detection of Broaches

Primož Kocuvan

Jani Bizjak

primoz.kocuvan@ijs.si

jani.bizjak@ijs.si

Institut "Jožef Stefan"

Institut "Jožef Stefan"

Jamova cesta 39

Jamova cesta 39

Ljubljana, Slovenija

Ljubljana, Slovenija

Stefan Kalabakov

Matjaž Gams

stefan.kalabakov@ijs.si

matjaz.gams@ijs.si

Institut "Jožef Stefan"

Institut "Jožef Stefan"

Jamova cesta 39

Jamova cesta 39

Ljubljana, Slovenija

Ljubljana, Slovenija

Slika 1: Odčitki signala posnemalne igle

POVZETEK

najboljše metode je 27 posnemanj oz. 1,8% glede na povprečno

Posnemanje materiala je ena izmed metod strojnega obdelovanja

število posnemanj, ki se opravijo pred menjavo.

izdelkov, ki jih dosežemo s t.i posnemalno iglo. V grobem ločimo

zunanje posnemanje in notranje posnemanje materiala. V pri-

KLJUČNE BESEDE

spevku se posvečamo notranjemu posnemanju, pri katerem se

Posnemalne igle, avtomatsko zaznavanje, regresija, strojno uče-

v začetku naredi manjšo luknjo v obdelovanec, nato pa posto-

nje

poma oblikuje profil. To se doseže z različnimi premeri rezil tako,

da je na začetku premer manjši, nato pa se postopno povečuje.

ABSTRACT

Tako se lahko oblikuje poljuben krožni ali n-kotni profil. Zaradi

obrabe rezil pri posnemanju se morajo le-ta redno menjati. V

Broaching is one of the methods in metalworking, which is per-

prispevku je opisan pristop napovedovanja obrabe posnemalne

formed with the so-called broach. We distinguish between exter-

igle glede na cikel posnemanja. Glavna značilka, uporabljena za

nal broaching and internal broaching. In this paper, an internal

napovedovanje, je t.i mikroraztezanje (ang. microstrain), ki pove,

broaching is presented, where a small hole is initially made in

za koliko se spremeni obremenitev na merilnem mestu v delcih

the workpiece, and then the broach gradually forms a profile.

na milijon. V prispevku je predstavljenih več metod strojnega

This is achieved with different blade diameters so that initially

učenja za reševanje omenjenega problema. Povprečna napaka

the diameter is smaller and then it gradually increases. Thus, any

circular or polygon shape can be formed. Due to the wear during

Permission to make digital or hard copies of part or all of this work for personal broaching, the blades must be replaced regularly. In this paper,

or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and an approach for predicting how many broaching processes or

the full citation on the first page. Copyrights for third-party components of this the number of work cycles can still be done before replacing the

work must be honored. For all other uses, contact the owner/author(s).

broach are presented. We did this by measuring and monitoring

Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

the microstrain parameter by the cut time, which tells how much

© 2020 Copyright held by the owner/author(s).

the strain changes in parts per million. Thus, with regression

51





Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

Primož Kocuvan, et al.

machine learning procedures, we learned a model that missed

an average of 27 cycles or 1.8 %.

KEYWORDS

Broaches, automatic detection, regression, machine learning

Slika 4: Primerjava raztezkov med obrabljeno in novo po-

1

UVOD

snemalno iglo

Posnemanje materiala je zelo natančen postopek obdelovanja

kovinskih izdelkov. Cena posnemalnih igel je relativno visoka

(nekaj tisoč EUR), zato se posnemanje v industriji uporablja le

v primerih, ko imamo dovolj veliko število obdelovancev. Upo-

raba obrabljene ali uničene posnemalne igle zaradi množične

proizvodnje privede do visokih stroškov za proizvajalca, zato

se igle trenutno menjajo po 1500 posnemanjih, ne glede na nji-

hovo obrabo. S pomočjo strojnega učenja je mogoče natančneje

napovedati, kdaj bo določena igla preveč obrabljena, s tem pa pri-

dobimo boljši izkoristek igel ter takojšnje zaznavanje morebitne

okvare igle. Avtorji prispevka so za razne industrijske aplikacije

dobili več nagrad (prof. dr. Matjaž Gams [1]), medtem ko se je prvi avtor ukvarjal s procesiranjem časovnih signalov v svoji

diplomski nalogi [2]. Nekateri raziskovalci so se lotili obdelave Slika 3: Primer posnemalne igle [9]

s pomočjo kombinacije strojnega vida in učenja [3], [4], [5] ter merjenja sil [6], [7]. Na sliki 2 je primer signala posnemanja enega cikla oziroma enega obdelovanca. Na abcisni osi je čas,

2

DEFINICIJA PROBLEMA

medtem ko je na ordinatni osi obremenitev oziroma raztezek

Rezila se med uporabo obrabljajo (postanejo topa), zaradi česar je

na distančni plošči (angl. microstrain) [8]. Distančna plošča je potrebna večja sila za posamezen rez. S povečevanjem sile se veča

kovinska plošča, ki zagotavlja ustrezen odmik med kovinskim

verjetnost, da bo rez nepravilen, oz. se bo rezilo poškodovalo (npr.

izdelkom in posnemalno iglo. Tu merimo naš raztezek.

odlomil rezilni zob). Ko so rezila preveč obrabljena, jih je mogoče

nabrusiti, kar je veliko ceneje od nakupa novega rezila v primeru

nepopravljivih poškodb (npr. zloma zoba). Trenutno je postopek

v proizvodnji tak, da se vsa rezila po 1500 posnemanjih zamenjajo,

saj je verjetnost za napako po tem številu posnemanj previsoka.

Problem je, da se rezilo zaradi različnih zunanjih dejavnikov (npr.

mazivne tekočine, temperature itd.) obrablja hitreje ali počasneje,

kar privede do okvar na izdelku ali slabšega izkoristka rezila.

Na sliki 4 je prikazana primerjava signala iz nove igle (modra) ter obrabljene igle (rdeča). Vidimo lahko, da ima posnemalna

igla predstavljena z rdečim signalom v splošnem večji integral

(površino pod krivuljo), to pomeni, da je sila večja. Razlikuje se

tudi po številu ter jakosti posameznih vrhov, npr. v nekaterih

primerih določeni vrhovi manjkajo (rezilo (nož) je popolnoma

izrabljen).

Slika 2: Primer povečave enega reza igle poljubnega si-

3

REŠEVANJE PROBLEMA

gnala

Iz slike 4 lahko vidimo, da sta število in višina (integral) vrhov eden pomembnejših faktorjev pri prepoznavi okvare, sekundarni

faktor pa je oblika vrhov. Avtomatskega prepoznavanja vrhov

smo se lotili tako, da smo zaznali, kdaj se signal dvigne od stan-

dardne deviacije (šuma) signala. Med posameznimi rezi je igla

v mirovanju, kar je razvidno iz slike 1. Na ta način smo dobili Opazimo, kako se raztezek (ki ga lahko interpretiramo kot

okno, ki vsebuje le signal, ki nastane med rezanjem. Poiskali smo

silo) na distančni plošči spreminja, ko se spreminja premer zob

okrog 1000 različnih atributov, ki opisujejo signal s pomočjo knji-

posnemalne igle. Na sliki 3 je primer posnemalne igle (splošno), žnice Tsfresh [10]. Ti atributi so npr. minimalna in maksimalna z rdečo barvo je označen obdelovanec. Smer puščice nakazuje

vrednost signala, frekvence in vzorci, ki se pojavljajo v signalu.

pomik.

Nato smo atribute filtrirali z ozirom na relevantnost (prav tako

52





Avtomatska detekcija obrabe posnemalnih

Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

Slika 5: Primerjava odčitkov raztezka z leve in desne po-

Slika 6: Integral signala glede na število rezanj posne-

snemalne igle

malne igle

z omenjeno knjižnico), ki za vsak atribut izračuna p-vrednost

oz. statistično stopnjo značilnosti. V zadnji fazi se nad množico

p-vrednosti požene Benjamini-Yekutieli algoritem, ki se odloči

katere značilke obdržimo in katere izločimo. Izkazalo se je, da

so najpomembnejši atributi ploščina, maksimalna vrednost ter

število vrhov, torej le trije atributi. Z izbranimi atributi smo s

pomočjo strojnega učenja napovedali, v kakšnem ciklu oz. kako

blizu okvari je določena igla. Uporabili smo naslednje pristope z

učnim okoljem Sci-kit learn [11], [12]:

• linearno regresijo (Linear Regression) [13],

• gradientno ojačitev za regresijo (Gradient Boosting) [14],

• klasifikator AdaBoost (AdaBoost Classifier) [15],

• K najbližjih sosedov (K Nearest Neighbours) [16].

Obdelovalni oz. posnemalni stroj, s katerega smo pridobili me-

ritve, ima levo in desno posnemalno iglo, pri čemer obe delujeta

istočasno, torej obe posnemata (režeta) material hkrati. Na sliki

5 je primer meritev leve in desne posnemalne igle za posnemanje ob določenem času. Opazimo, da ima ena igla večji integral,

Slika 7: Maksimalne vrednosti signala glede na število re-

kar pomeni, da bi morali na začetku merilnega cikla kalibrirati

zanj posnemalne igle

iglo/senzor. S tem bi zagotovili enako izhodišče za nadaljnjo sta-

tistično obdelavo podatkov. Da bi se izognili tej težavi smo v tem

prispevku primerjali le posnemalne igle, ki so na isti strani (leva

ali desna).

4

REZULTATI

Za napovedovanje zvezne vrednosti ciljne spremenljivke (re-

Na sliki 6 je prikazana primerjava integralov signala po določegresija) uporabljamo metriko MAE (angl. Mean Absolute Error)

nem številu rezov (ordinata). Vidimo lahko (še posebej na desni

in RMSE (angl. Root Mean Squared Error). Razlika je v tem, da

igli), da se z večanjem števila rezov vrednosti integralov pove-

metrika absolutne napake vrne le razliko absolutne napake, med-

čujejo, kar je skladno s pričakovanji, da je za enak rez s topim

tem ko RMSE vrne kvadrat te napake, s čimer kaznujemo večje

nožem potrebna večja sila.

razlike, torej primere, ko se napaka razlikuje za večje število

Podobno, čeprav manj izrazito, lahko ugotovimo za maksi-

ciklov. V našem primeru smo uporabili le vrednost MAE, ki je

malno silo, ki nastane med rezom, kar je razvidno iz slike 7.

definirana z enačbo (1).

Na sliki 8 je prikazano število vrhov, ki jih algoritem prepozna.

Po pričakovanjih je število vrhov obratno sorazmerno s številom

rezov. Rezila na posameznih iglah se obrabljajo, zato te igle ne

𝑛

1 Õ

režejo več, torej je sila na rezilu nizka, saj igla ne postruži nič

𝑀 𝐴𝐸 =

|𝑦 − 𝑥 |

(1)

𝑖

𝑖

𝑛

materiala. Nato sledi naslednja igla, ki ni obrabljena, ker pred-

𝑖 =1

hodna igla ni opravila svojega dela, mora ta igla odstraniti večjo

količino materiala, kar privede do večje sile ter obrabe na tem

V tabeli 1 so prikazani rezulati napovedovanja cikla za posa-igli.

mezno metodo strojnega učenja.

53





Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

Primož Kocuvan, et al.

[4]

S.Kurada in C.Bradley. [n. d.] A machine vision system

for tool wear assessment. 30, 295–304.

[5]

S.Damodarasamy in Shivakumar Raman. [n. d.] An ine-

xpensive system for classifying tool wear states using

pattern recognition. 170, 149–160.

[6]

Dongfeng Shi in Nabil N.Gindy. [n. d.] Tool wear predic-

tive model based on least squares support vector machines.

21, 1799–1814.

[7]

S. Rangwala in D. Dornfeld. [n. d.] Sensor integration using

neural networks for intelligent tool condition monitoring.

112, 219–228.

[8]

Anderson Langone Silva, Marcus Varanis, Arthur Guil-

herme Mereles, Clivaldo Oliveira in José Manoel Balthazar.

[n. d.] A study of strain and deformation measurement

using the arduino microcontroller and strain gauges devi-

ces. 41.

[9]

Srednja šola Koper. 2020. Posnemanje materiala. http :

Slika 8: Število vrhov signala glede na število rezanj posne-

/ / www2 . sts . si / arhiv / tehno / projekt3 / Posnemanje /

malne igle

posnemanje.htm.

[10]

Ts fresh library. 2020. Tsfresh. https://tsfresh.readthedocs.

Tabela 1: Regresorji in njihove pripadajoče metrike MAE

io/en/latest/.

[11]

Aurélien Géron. 2017. Hands-On Machine Learning with

Scikit-Learn and TensorFlow: Concepts, Tools, and Tech-

Regresor

MAE

niques to Build Intelligent Systems. O’Reilly Media; 1st

Linearna regresija

101,25

Edition, 574.

Gradient boost

27,58

[12]

Andreas C. Müller in Sarah Guido. 2016. Introduction to

AdaBoost

165,44

Machine Learning with Python: A Guide for Data Scientists.

KNN

74,16

O’Reilly Media; 1st Edition, 400.

[13]

Sci kit learn. 2020. Regression - linear regression. https:

/ / scikit - learn . org / stable / modules / generated / sklearn .

linear_model.LinearRegression.html.

5

ZAKLJUČEK

[14]

Sci kit learn. 2020. Regression - gradient boost. https :

Način merjenja mikroraztezka na distančni plošči ter analize

/ / scikit - learn . org / stable / modules / generated / sklearn .

časovnega signala s pomočjo strojnega učenja je naš prispevek

ensemble.GradientBoostingClassifier.html.

na tem področju. Z navedenimi pristopi smo dobili povprečno

[15]

Sci kit learn. 2020. Regression - adaboost. https://scikit-

absolutno napako (MAE) 27,58 kar pomeni, da se naš model v pov-

learn.org/stable/modules/generated/sklearn.ensemble.

prečju zmoti za 27,58, pri napovedovanju cikla trenutnega reza.

AdaBoostRegressor.html.

Vrednosti (število ciklov) gre od 0 do 1500 1. To pomeni, da model

[16]

Sci kit learn. 2020. Regression k-nearest-neighbour. https:

s točnostjo 98,16 % napoveduje v katerem ciklu je posnemalna

/ / scikit - learn . org / stable / modules / generated / sklearn .

igla, oz. kdaj je iglo potrebno zamenjati. V nadaljevanju raziskave,

neighbors.KNeighborsRegressor.html#sklearn.neighbors.

se je potrebno osredotočiti na optimizacijo hiperparametrov po-

KNeighborsRegressor.

sameznega regresorja. Končni cilj raziskave je implementacija

tovrstnega primerjanja na podlagi signala v proizvodni proces.

ACKNOWLEDGMENTS

Ta raziskava je bila delno financirana s strani projekta ROB-

KONCEL s šifro OP20.03530 in ARRS. Zahvaljujemo se podjetju

UNIOR (Jože Ravničan in Tomaž Hohler).

LITERATURA

[1]

2011. Ventil - revija za fluidno tehniko, avtomatizacijo in

mehatroniko. V Ljubljana.

[2]

Primož Kocuvan. 2015. Zaznavanje srčnega šuma v fono-

kardiogramih. Diplomsko delo - Univerza v Ljubljani, 50.

[3]

Wenmeng Tian, Lee J. Wells in Jaime Camelio. 2016. Broa-

ching tool degradation characterization based on functio-

nal descriptors. V (MSEC). 11th Manufacturing Science in

Engineering Conference (MSEC2016), USA.

1Model privzame, da je iglo potrebno zamenjati, ko signal izgleda, kot izgleda na igli s 1500 rezi. Če bi želeli točno izvedeti, kdaj je "točka preloma", torej ko je igla okvarjena, bi bilo potrebno izvesti še nekaj meritev/posnetkov, kjer bi se igla uporabljala dokler ne bi prišlo do napak na izdelku.

54





Povečevanje enakosti (oskrbe duševnega zdravja) s

prepričljivo tehnologijo

Increasing Equality (in Mental Health Care) with Persuasive Technology

Tine Kolenik†



Matjaž Gams

Odsek za inteligentne sisteme



Odsek za inteligentne sisteme

Institut “Jožef Stefan” in



Institut “Jožef Stefan”

Mednarodna podiplomska šola



Ljubljana, Slovenija

Jožefa Stefana



matjaz.gams@ijs.si

Ljubljana, Slovenija

tine.kolenik@ijs.si

POVZETEK

of people with mental health issues. This paper presents such

systems with a brief overview of the field, with the main

Neuspešno spopadanje z naraščajočimi težavami z duševnim

contribution being an analysis of potential problems and

zdravjem močno ovira blaginjo posameznika in družbe. Kljub

solutions that persuasive technology offers in the field of mental

temu so ovire do dostopa in enakosti v oskrbi na področju

health care. Persuasive technology seems to be able to

duševnega zdravja, ki jih je veliko, znane, obsegajo pa od

complement existing mental health care solutions, thereby

osebnih stigm do socialno-ekonomske neenakosti. Tehnologija,

reducing unequal access to and inequality in mental health care

predvsem pa umetna inteligenca, ima ob takšnem stanju

as well as reducing inequality in general.

priložnost, da s svojim razvojem poskuša ublažiti obstoječi

položaj z edinstvenimi rešitvami. Multi- in interdisciplinarne

KEYWORDS

raziskave na področju prepričljive tehnologije, katere cilj je

Digital mental health, persuasive technology, artificial

spreminjanje vedenja ali mentalnega stanja brez zavajanja in

prisile, kažejo uspeh pri izboljšanju počutja pri ljudeh s

intelligence, mental health care access, equality.

tovrstnimi težavami. V prispevku so predstavljeni takšni sistemi

s kratkim pregledom področja, glavni doprinos pa je analiza

1 UVOD

potencialnih težav in rešitev, ki jih prepričljiva tehnologija nudi

na področju oskrbe duševnega zdravja. Zdi se, da prepričljiva

Težave na področju duševnega zdravja so že desetletja v porastu,

tehnologija lahko dopolni obstoječe rešitve za pomoč pri

uničujoč učinek tega pa so pripoznali tudi svetovni odločevalci,

duševnem zdravju, s tem pa zmanjša težave v dostopnosti in

saj so Združeni narodi izboljšanje na tem področju uvrstili med

enakosti zdravstvene oskrbe kot tudi v enakosti nasploh.

svoje cilje trajnostnega razvoja [42]. Med temi težavami

izstopajo predvsem stres, anksioznost in depresija (SAD).

KLJUČNE BESEDE

Beležijo, da se v nekaterih skupinah z akutnim stresom spopada

74% ljudi [24], z anksiozno motnjo 28% ljudi [5] in z depresijo

Digitalno duševno zdravje, prepričljiva tehnologija, umetna

48% ljudi [36]. Kar se zdi še bolj problematično, je dejstvo, da v

inteligenca, dostopnost in enakost zdravstvene oskrbe.

državah z nizkim in srednjim dohodkom okoli 80% ljudi ni

ABSTRACT

deležno zdravljenja zaradi svojih duševnih težav, v državah z

visokim dohodkom pa ta številka dosega okoli 35% [33]. Težave

The inability to cope with increasing mental health issues among

z duševnim zdravjem povzročijo daljnosežne in večplastne

the populace severely hampers the well-being of both the

posledice, ki jih občutijo bolniki, njihova neposredna okolica

individual and society. Barriers to access and equality in mental

(družina, skrbniki) in širša družba [41]. Bolniki se soočajo s

health care, many of which are well known, range from personal

slabšo kakovostjo življenja, nižjimi izobraževalnimi rezultati,

stigmas to socio-economic inequality. This offers technology,

nižjo produktivnostjo, potencialno revščino, socialnimi težavami

especially artificial intelligence, the opportunity to try to

in dodatnimi zdravstvenimi težavami. Skrbniki se soočajo z

alleviate the existing situation with unique solutions. Multi- and

večjimi čustvenimi in fizičnimi izzivi, pa tudi z zmanjšanim

interdisciplinary research in the field of persuasive technology,

dohodkom in povečanimi finančnimi stroški. Družba se vsako

which aims to change behavior or mental states without

leto sooča z izgubo več odstotnih točk BDP in milijardami

deception and coercion, shows success in improving well-being

dolarjev na državo skupaj s poslabšanjem zaupanja v inštitucije



javnega zdravja in s krhanjem socialne kohezije. Vse to vodi v

čed alje močnejšo pozitivno povratno zanko – SAD ohranja in

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed krepi SAD. Težave z duševnim zdravjem prepogosto vodijo tudi

for profit or commercial advantage and that copies bear this notice and the full v izgubo človeškega življenja, saj se številne države spopadajo z

citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

visoko stopnjo samomorov [8]. Razlogi za višanje simptomov

Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

SAD vključujejo močno pomanjkanje strokovnjakov in

© 2020 Copyright held by the owner/author(s).

predpisov za duševno zdravje [39] ter neenak dostop do oskrbe

55

na področju duševnega zdravja [9]. Zato se zdi, da tehnološke in duševno zdravje). Osebnost se meri na različnih dimenzijah

druge znanstvene terapevtske intervencije lahko pomagajo pri

(odprtost, vestnost, ekstravertnost, sprejemljivost, nevroticizem),

izboljšanju trenutnega stanja sistema, zlasti ker imajo

ki poskušajo opisati posameznikove tendence, povezane z

posamezniki z duševnimi težavami terapije raje kot zdravila [2].

njegovimi psihološkimi lastnostmi, kot so duševna stanja in

Zaradi napredka vedenjskih ved na področju človekovega

odločanje. Prepričevanje na področju duševnega zdravja je hkrati

odločanja in sorodnih pojavov [34] ter prihodom digitalnih

bolj uspešno, če PT dostopa do podatkov o posameznikovem

tehnologij, umetne inteligence in velikega podatkovja se je

duševnem zdravju. V ta namen lahko uporabimo vprašalnike

razvoj usmeril v ustvarjanje tehnologij, ki bi pomagale,

SAD [21] za kategorizacijo ljudi s simptomi SAD.

motivirale in usmerjale ljudi, da izboljšajo sebe in svet.

Okvirji prepričevanja so lahko implementirani v različne

Prepričljiva tehnologija (PT) je eden izmed rezultatov tovrstnih

tehnološke platforme. Nedavni pregledni članek PT za zdravje in

prizadevanj. Gre za tehnologijo, ki "spreminja stališča ali

dobro počutje [27] je ugotovil, da so najpogosteje uporabljene

vedenja ali oboje (brez uporabe prisile ali zavajanja)" [12, str.

platforme mobilne naprave (28%), sledijo igre (17%), spletna in

20]. Sprememba vedenja velja za pojav začasnega ali trajnega

socialna omrežja (14%) ter druge specializirane naprave (13%),

učinka na vedenje, odnos in druga duševna stanja posameznika v

namizne aplikacije (12%), senzorji in nosljive naprave (9%) ter

primerjavi s preteklostjo [12]. PT se že uporablja za pomoč pri

zasloni v javnem prostoru (5%). Vrst aplikacij, ki delujejo kot

duševnem zdravju [25, 27], kar prispeva k enakosti in omogoča

PT, je na tem področju več, inteligentni kognitivni asistenti

lažji dostop do zdravstvene oskrbe [37].

(IKA; znani tudi kot pogovorni roboti ali pogovorna umetna

Prispevek ima sledečo strukturo: poglavje 2 nudi pregled

inteligenca) pa so najbolj napredni in razširjeni [4, 18, 26, 27, 30,

področje PT za pomoč pri duševnem zdravju, poglavje 3

37, 44]. IKA izkazujejo številne človeku podobne sposobnosti,

analizira težave in rešitve, ki jih nudi PT, poglavje 4 pa poda

saj lahko do neke mere razumejo kontekst, se prilagajajo, se

nekaj zaključnih misli in idej za prihodnje delo.

učijo, komunicirajo, sodelujejo, napovedujejo, zaznavajo,

razlagajo in utemeljujejo. Najpomembneje je, da se IKA lahko

pogovarjajo v naravnem jeziku in jih je zato mogoče ustvariti

2 PREGLED PODROČJA

tako, da nudijo terapevtsko pomoč. Rezultati različnih

Pričujoče poglavje vsebuje pregled področja PT in področja

preglednih člankov [4, 18, 26, 27, 30, 37] kažejo, da so IKA

sprememb vedenja.

učinkovito sredstvo za lajšanje simptomov SAD. Izvedli smo

Sprememba vedenja je pojav, za katerega velja, da pri

kratek pregled prispevkov o najsodobnejših IKA za duševno

posamezniku povzroči začasen ali trajen učinek na njegovo

zdravje in tri na kratko predstavljamo za ponazoritev tovrstne

vedenje v primerjavi s tem, kako se je vedel v preteklosti [12].

tehnologije. Vsi trije IKA [11, 14, 43] delujejo podobno, tako da

Ne vključuje le vedenja, temveč tudi duševna stanja. Intervencije

z uporabo skriptiranih pogovorov in osnovnih sposobnosti

za spremembo vedenja so velik del PT, ki se že pogosto uporablja

procesiranja naravnega jezika nudijo pomoč. Ta je odvisna od

na zdravstvenih področjih. Obstoječi sistemi s pomočjo umetne

uporabniškega modela, ki vsebuje podatke o čustvih

inteligence spremljajo vedenje ljudi ter njihova fiziološka in

uporabnikov in ravni SAD. Vsi IKA se v eksperimentih izkažejo

duševna stanja z namenom, da jih motivirajo in vplivajo na

za 15–20% uspešnejše pri lajšanju SAD od uradno priporočenega

njihovo počutje, vse to pa lahko počnejo v naravnem jeziku [27].

gradiva za samopomoč.

Eden najpogosteje uporabljenih okvirjev prepričevanja in

Takšna tehnologija nudi številne prednosti na področju

sprememb vedenja, ki jih uporabljajo takšne tehnologije, so

duševnega zdravja: lahko je brezplačna in omogoča pomoč

Cialdinijeva načela prepričevanja (CPP) [6]. Obstajajo tudi drugi

socialno-ekonomsko prikrajšanim ljudem; na voljo je 24 ur na

okviri [25, 27], vendar je za namene tega dela opisan samo CPP.

dan, 7 dni v tednu, kar pomeni, da bolnikom ni treba čakati na

Njegova glavna ideja je, da ne obstaja splošna strategija

naslednjo terapijo; veliko ljudi s simptomi SAD lažje zaupajo

prepričevanja, ki bi delovala na vse ljudi. CPP zato opiše več

računalniku kot osebi [10, 22]; tehnologija je na voljo na

strategij prepričevanja, saj so različni ljudje različno dovzetni za

oddaljenih lokacijah itd. Tehnologija lahko tako zmanjša

različne strategije.

obremenitev zdravstvenega sistema in njegovih izvajalcev ter

CPP predvideva 7 strateških podlag za prepričevanje: 1)

zmanjša ovire za dostop do oskrbe duševnega zdravja na splošno.

avtoriteta, ki cilja na ljudi, ki so bolj nagnjeni k temu, da jih Pomembno je poudariti, da tehnologija deluje komplementarno

motivira legitimna avtoriteta; 2) zavezanost in doslednost, ki sta

in ne nadomešča strokovnjakov [16, 18, 37]. Prednosti rabe

namenjena ljudem, ki se bolj pogosto zavežejo k nečemu, če so

tovrstne tehnologije in morebitne težave so podrobneje

se tako vedli že prej; 3) družbeni dokazi, ki ciljajo na ljudi, ki se

obravnavane v naslednjem poglavju.

ponavadi vedejo tako, kot se vedejo drugi; 4) všečnost, ki cilja

na ljudi, za katere je bolj verjetno, da jih motivira nekdo, ki jim

je všeč; 5) recipročnost, ki cilja na ljudi, ki so nagnjeni k vračanju

3 PREDNOSTI IN MOREBITNE TEŽAVE

uslug; 6) pomanjkanje, ki cilja na ljudi, ki menijo, da so redke

Pričujoče poglavje obravnava posledice uporabe PT za

stvari bolj dragocene; 7) enotnost, ki vpliva na ljudi, na katere

duševno zdravje na področju spodbujanja enakosti in dostopnosti

vplivajo pozivi, ki se tičejo njihove skupinske identitete. Na

oskrbe duševnega zdravja, dotakne pa se tudi posledic na

različne ljudi vplivajo različne strategije, interaktivna

splošno. Posledice so razdeljene na tiste, ki ponujajo potencialne

tehnologija pa nudi orodje za učinkovitejšo izbiro tistih strategij,

rešitve obstoječih težav in ovir za enakost in dostopnost, in tiste,

ki delujejo za določene ljudi.

ki se kažejo kot problemi te tehnologije pri doseganju enakosti.

Za izbiranje najučinkovitejše strategije se PT pogosto opira

Na koncu poglavja so na kratko obravnavani tudi drugi problemi,

na osebnostne modele, kot je velikih pet faktorjev osebnosti [31],

ki na videz niso povezani z enakostjo, a so ključnega pomena, da

in vprašalnike za posamezne domene, kjer se PT uporablja (npr.

PT doseže svoj potencial.

56

Kategorije, v katerih PT ponuja potencialne rešitve:

ker se ne bojijo, da bi jih obsojali, pridobijo pa zasebnost za

Stroški: Cena storitev, ki jih nudijo strokovnjaki za duševno

razkrivanje svojih občutkov in misli na splošno. To pomeni, da

zdravje (od psihoterapevtov do kliničnih psihologov in

se lahko število ljudi, ki se izogibajo stikom s strokovnjaki,

psihiatrov) se od države do države razlikujejo in so predvsem

zmanjša z uvedbo terapevtskih možnosti, za katere bolniki

odvisni od državnih predpisov in subvencij. Neposredni stroški

menijo, da so zanje varnejše in brez stigme.

za bolnika so večinoma odvisni od števila strokovnjakov, ki so

Vendar pa takšna tehnologija potencialno prinaša tudi težave,

na voljo v določeni državi. Neodvisno od njihove višine pa

ki jih je potrebno izpostaviti in resno obravnavati, da bi PT

stroški velikokrat ovirajo dostopnost do oskrbe ljudi iz nižjih

dosegel potencial, ki ga ima na področju duševnega zdravja:

socialno-ekonomskih okolij [23]. Dostop do PT za duševno

Izključitev ranljivih skupin: Tehnološko usmerjene rešitve

zdravje je lahko brezplačen (in velikokrat je [11]) zaradi veliko

oskrbe duševnega zdravlja lahko vodijo v izključevanje

nižjih stroškov, povezanih z izdelavo. K temu prispevajo trije

nekaterih ranljivih skupin. Mednje spadajo starostniki, najnižji

glavni dejavniki: 1) razširljivost, kar pomeni, da lahko en sistem

socialno-ekonomski razred in kulturno specifične skupine. Zdi

PT teoretično nudi pomoč neomejenemu številu ljudi (edini

se, da je skupina, ki jo je uvedba tehnologije najbolj prizadela,

strošek, ki ga prinaša razširljivost, so stroški strežnika, ki so

skupina starostnikov [1]. Njihova nižja sposobnost vključevanja

obrobni v primerjavi s človeškim delom) – nasprotno pa je en

tehnologije v vsakdanje življenje lahko vodi v globlje razlike

strokovnjak za duševno zdravje omejen na določeno število ljudi;

med njimi in drugimi generacijskimi skupinami. Druga skupina

2) zmožnost, da učinkovit PT lahko ustvari veliko ljudi,

ljudi, ki je lahko izključena iz koristi PT za duševno zdravje, so

predvsem zaradi obstoječih raziskav, ki temeljito poročajo o

ljudje iz najnižjega socialno-ekonomskega razreda, kjer jim PT

učinkovitih sistemih; in 3) količina ljudi, ki je sposobna

morda sploh ne bo na voljo [28]. Poglabljanje že tako velikih

proizvajati takšne sisteme, je veliko večja, kot je strokovnjakov,

razlik bi skupini povzročilo še bolj katastrofalne socialno-

ki lahko ponudijo psihoterapevtsko in podobno pomoč.

ekonomske življenjske razmere. Skupine, ki jih posvojitev

Razpoložljivost: Problem razpoložljivosti lahko ločimo v tri

tehnologije prizadene zaradi kulturnih razlik, so ključnega

podkategorije: 1) razpoložljivost na podlagi lokacije, 2)

pomena pri razmisleku o napredku enakosti. Raziskave kažejo,

razpoložljivost na podlagi časa in 3) razpoložljivost na podlagi

da kulture z manj sodobnimi družbenopolitičnimi nagnjenji

stroškov. Razpoložljivost na podlagi lokacije se nanaša na ljudi

kažejo manjšo tendenco po posvajanju tehnologije [19]. Vseeno

s težavami v duševnem zdravju na lokacijah, ki nimajo

se zdi, da se večja prisotnost področja raziskovanja PT pojavlja

neposrednega dostopa do strokovnjakov za duševno zdravje (ali

tudi v nekaterih državah z nizkimi dohodki [40].

pa celo nimajo računalniškega dostopa do terapije na daljavo)

Pristranost v raziskovanju: Zaradi pomanjkanja standardov

[15]. Uporaba PT za duševno zdravje je ena redkih potencialnih

evalvacije PT za duševno zdravje je raziskovalno področje bolj

rešitev v takih primerih. Razpoložljivost na podlagi časa se

dovzetno za pristranost v raziskovanju. Možnih težav je veliko:

nanaša na ljudi z duševnimi težavami, ki potrebujejo terapevtsko

1) sistemov PT, za katere se trdi, da so uspešni, ne preučujejo

pomoč v času, ko njihov izbrani strokovnjak ni na voljo. PT za

vedno v empiričnih poskusih (npr. randomizirana kontrolirana

duševno zdravje je na voljo 24 ur na dan, zato se njihova uporaba

raziskava), temveč v kvazi eksperimentih [43] ali sploh ne; 2)

dopolnjuje z izbranim strokovnjakom za duševno zdravje.

metrika, na podlagi katere bi lahko ocenili takšne sisteme, ni

Bolniki nenehno poročajo o teh potrebah in take dopolnilne

jasna (običajno izhaja posredno iz njihove učinkovitosti v

uporabe že obstajajo [29]. Razpoložljivost, ki temelji na stroških,

raziskavi, kjer je cilj lajšanje simptomov SAD [37]); 3) ni

se nanaša na ljudi z duševnimi težavami, ki potrebujejo

soglasja o tem, kateri podatki so potrebni, da sistem razume

terapevtsko pomoč, vendar nimajo sredstev za dostop, ki bi bil

uporabnika in mu s tem nudi učinkovito pomoč, s čimer je izbira

obsežnejši od najmanjše priporočene količine ur na teden [13] –

vrste podatkov zaenkrat večkrat odvisna od predpostavk

ta se ocenjuje na eno uro na teden. Raziskave [13, 32] kažejo, da

raziskovalcev kot pa od obstoječih spoznanj.

pogostejše terapije prinašajo boljše rezultate, dopolnilna uporaba

Uporaba PT za duševno zdravje ima tudi težave, ki se ne

PT za duševno zdravje pa lahko premosti to vrzel pri ljudeh, ki

nanašajo samo na doseganje enakosti in dostopnosti. Čeprav so

si ne morejo privoščiti več terapije. Razpoložljivost na podlagi

izjemno pomembni, je njihova poglobljena analiza izven

stroškov je hkrati tesno povezana s širšim problemom stroškov,

okvirjev tega dela. Vseeno jih nekaj omenimo: 1.) problem

omenjenim v prejšnji kategoriji.

varstva osebnih podatkov [3]; 2) problem pomanjkanja

Stigma: Samostigma, predsodki, ki jih ljudje z duševnimi

longitudinalnih raziskav o spremembah vedenja s PT [20]; 3)

težavami imajo o sebi zaradi svojih težav, in javna stigma, odziv

etičnost uporabe osebnih podatkov za prepričevanje [17]; in 4)

splošne populacije na ljudi z duševnimi boleznimi, predstavljata

potencialni problem avtomatizacije in izgube zaposlitve

eno poglavitnih težav v boju proti duševnim težavam [7]. Težava

strokovnjakov za duševno zdravje. Zagotovo obstajajo tudi druge

je dvojna: zaradi javne stigme se posamezniki bojijo, kaj si bo

težave in pomisleki, vendar smo želeli, da je ta seznam kratek in

družba mislila o njih, če bodo iskali zdravljenje, medtem ko se

da z njim pokažemo, da obstajajo tudi druge težave s PT in da se

zaradi samostigme bojijo interakcije s strokovnjakom in

jih zavedamo.

dvomov, da si njihove težave pomoč sploh zaslužijo. Ta dvojnost

prispeva k temu, da se posamezniki z duševnimi težavami

odločijo, da se ne bodo zdravili pri strokovnjakih za duševno

4 ZAKLJUČEK IN PRIHODNJE DELO

zdravje. Do 96% ljudi s SAD ne išče zdravljenja [35]. Raziskave



o PT za duševno zdravje, zlasti o IKA za zdravljenje SAD, so

Pričujoče delo raziskuje, kako lahko prepričljiva tehnologija, ki

pokazale, da ljudje v splošnem lažje zaupajo svoje težave

poskuša brez prisile vplivati na vedenje ljudi, poveča enakost in

računalniškemu ali mobilnemu sistemu kot osebi [22]. To je zato,

dostopnost oskrbe duševnega zdravja, s čimer bi okrepila enakost

57

na splošno. Delo, ki se nadalje osredotoča na stres, anksioznost

[16]

C.M. Kennedy, J. Powell, T.H. Payne, J. Ainsworth, A. Boyd in. I.

in depresijo, preučuje, zakaj je duševno zdravje precejšnja ovira

Buchan, 2012. Active Assistance Technology for Health-Related

Behavior Change: An Interdisciplinary Review. Journal of Medical

za enakost in zakaj imajo ljudje z duševnimi težavami ovire pri

Internet Research 14, 3 (2012).

dostopu do zdravstvene oskrbe. Nato poda svoje argumente za

[17]

D. B. Klein, 2004. Statist Quo Bias. Econ. Jour. Watch 1 (2004), 260–71.

[18]

L. Laranjo idr., 2018. Conversational agents in healthcare: a systematic uporabo prepričljive tehnologije v tej domeni. Sledi predstavitev

review. Journal of the American Medical Informatics Association 25, 9

prepričljive tehnologije v njeni multi- in interdisciplinarni sestavi

(2018), 1248–1258.

vedenjskih znanosti in računalništva ter umetne inteligence.

[19]

S. G. Lee, S. Trimi in C. Kim, 2013. The impact of cultural differences on

technology adoption. Journal of World Business 48, 1 (2013), 20–29.

Predstavljeni so primeri implementacije prepričljive tehnologije

[20]

S. S. Lee, Y. K. Lim in K. P. Lee, 2011. A long-term study of user za duševno zdravje v inteligentnih kognitivnih asistentih,

experience towards interaction designs that support behavior change. In CHI’11 Extended Abstracts on Human Factors in Computing Systems, vključno z njihovo učinkovitostjo za lajšanje simptomov stresa,

ACM, New York, NW, 2065–2070.

tesnobe in depresije. Delo nazadnje raziskuje potencialne rešitve,

[21]

S. H. Lovibond in Peter F. Lovibond. 1996. Manual for the depression anxiety stress scales. Psychology Foundation of Australia, Sydney.

ki jih taka tehnologija ponuja na področju duševnega zdravja, in

[22]

G. M. Lucas, J. Gratch, A. King in L. P. Morency, 2014. It’s only a morebitne težave, ki bi jih lahko ustvarila. Prihodnje delo

computer: Virtual humans increase willingness to disclose. Computers in

vključuje nadaljnje raziskovanje problemov in rešitev,

Human Behavior 37 (2014), 94–100.

[23]

P. McCrone idr., 2004. Cost-effectiveness of computerised cognitive-

poglobitev v tehnično zasnovo tovrstnih tehnologij, še posebej

behavioural therapy for anxiety and depression in primary care:

tistih, ki uporabljajo umetno inteligenco, ter ponujanje novih

Randomised controlled trial British Journal of Psychiatry 185, 1 (2004), 55–62.

konceptualnih in tehničnih smernic za PT za duševno zdravje pri

[24]

Mental Health Foundation. 2018. Stress: Are we coping? Mental Health zmanjševanju neenakosti oskrbe duševnega zdravja in

Foundation, London.

neenakosti na splošno.

[25]

D. C. Mohr idr., 2013. Behavioral intervention technologies: evidence review and recommendations for future research in mental health. General hospital psychiatry 35, 4 (2013).

ZAHVALA

[26]

J. L. Z. Montenegro, C. A. da Costa in R. da Rosa Righi, 2019. Survey of

conversational agents in health. Expert Systems with Applications 129

Delo je nastalo v okviru programa mladih raziskovalcev, ki ga je

(2019), 56–67.

[27]

R. Orji in K. Moffatt, 2016. Persuasive technology for health and wellness:

financirala Javna agencija za raziskovalno dejavnost Republike

State-of-the-art and emerging trends. Health Informatics Journal 24, 1

Slovenije iz državnega proračuna.

(2016), 66–91.

[28]

M. Pigato. 2001. Information and communication technology, poverty, and development in sub-Saharan Africa and South Asia (English), Africa

VIRI

Region working paper series; no. 20. The World Bank, Washington, D.C.

[1]

[29]

M. Price idr., 2013. mHealth: A Mechanism to Deliver More Accessible,

I. Amaral in F. Daniel, 2016. Ageism and IT: social representations, More Effective Mental Health Care. Clinical Psychology &

exclusion and citizenship in the digital age. Lecture Notes in Computer Psychotherapy 21 (2013), 427– 436.

Science 9755 (2016), 159–166.

[30]

S. Provoost, H. M. Lau, J. Ruwaard in H. Riper, 2017. Embodied

[2]

M. C. Angermeyer in H. Matschinger, 1996. The effect of personal

Conversational Agents in Clinical Psychology: A Scoping Review.

experience with mental illness on the attitude towards individuals

Journal of Medical Internet Research 19, 5 (2017).

suffering from mental disorders. Social Psychiatry and Psychiatric

[31]

B. Rammstedt in O.P. John, 2007. Measuring personality in one minute or

Epidemiology. The International Journal for Research in Social and less: A 10-item short version of the Big Five Inventory in English and Genetic Epidemiology and Mental Health Services 31, 6 (1996), 321–326.

German. Journal of Research in Personality 41, 1 (2007), 203–212.

[3]

S. Avancha, A. Baxi in D. Kotz, 2012. Privacy in mobile technology for

[32]

R. Sandell idr., 2000. Varieties of long-term outcome among patients in

personal healthcare. ACM Computing Surveys 45, 1 (2012).

psychoanalysis and long-term psychotherapy: a review of findings in the

[4]

D. Bakker, N. Kazantzis, D. Rickwood in N. Rickard, 2016. Mental Health

Stockholm Outcome of Psychoanalysis and Psychotherapy Project

Smartphone Apps: Review and Evidence-Based Recommendations for

(STOPP). The International Journal of Psychoanalysis 81 (2000), 921–

Future Developments. JMIR Mental Health 3, 1 (2016).

942.

[5]

A. Baxter, J.M. Scott, T. Vos in H. Whiteford, 2013. Global prevalence of

[33]

A. Schmidtke idr., 1996. Attempted suicide in Europe: rates, trends and anxiety disorders: a systematic review and meta-regression. Psychological sociodemographic characteristics of suicide attempters during the period

Medicine, 43 (2013), 897–910.

1989–1992. Acta Psychiatrica Scandinavica 93 (1996), 327-38.

[6]

R. Cialdini. 2016. Pre-Suasion: A Revolutionary Way to Influence and

[34]

R. H. Thaler in C. R. Sunstein. 2008. Nudge: improving decisions using

Persuade, Simonand Schuster. Simon & Schuster, New York, NY.

the architecture of choice. Yale University Press, New Haven, CT.

[7]

P. W. Corrigan in A. C. Watson, 2002. Understanding the impact of stigma

[35]

G. Thornicroft idr., 2017. Undertreatment of people with major depressive

on people with mental illness. World psychiatry: official journal of the disorder in 21 countries. British Journal of Psychiatry 210, 2 (2017), 119–

World Psychiatric Association (WPA) 1, 1 (2002), 16–20.

124.

[8]

S. C. Curtin, M. Warner in H. Hedegaard. 2016. Increase in suicide in the

[36]

J. M. Twenge, 2014. Time Period and Birth Cohort Differences in

United States, 1999-2014. U.S. Department of Health and Human Depressive Symptoms in the U.S., 1982–2013. Social Indicators Research

Services, Centers for Disease Control and Prevention, National Center for

121, 2 (2014), 437–454.

Health Statistics, Hyattsville, MD.

[37]

A. N. Vaidyam idr., 2019. Chatbots and Conversational Agents in Mental

[9]

European Commission. 2018. Inequalities in access to healthcare - A Health: A Review of the Psychiatric Landscape. Canadian journal of study

of

national

policies.

psychiatry 64, 7 (2019).

https://ec.europa.eu/social/main.jsp?catId=738&langId=en&pubId=8152

[38]

P. S. Wang idr., 2007. Use of mental health services for anxiety, mood,

[10]

A. Fadhil in G. Schiavo, 2019. Designing for Health Chatbots. arXiv, and substance disorders in 17 countries in the WHO world mental health

(2019). https://arxiv.org/abs/1902.09022

surveys. The Lancet 370, 9590 (2007), 841–50.

[11]

K. K. Fitzpatrick, A. Darcy in M. Vierhile, 2017. Delivering Cognitive

[39]

P. Winkler idr., 2017. A blind spot on the global mental health map: a Behavior Therapy to Young Adults With Symptoms of Depression and

scoping review of 25 years development of mental health care for people

Anxiety Using a Fully Automated Conversational Agent (Woebot): A

with severe mental illnesses in central and eastern Europe. The Lancet Randomized Controlled Trial. JMIR Mental Health 4, 2 (2017).

Psychiatry 4, 8 (2017), 634–642.

[12]

B. J. Fogg. 2002. Persuasive technology. MK, Burlington, MA.

[40]

H. Winschiers-Theophilus idr., 2018. Proceedings of the Second African

[13]

N. Freedman idr., 1999. The Effectiveness of Psychoanalytic

Conference for Human Computer Interaction: Thriving Communities.

Psychotherapy: the Role of Treatment Duration, Frequency of Sessions, Association for Computing Machinery, New York, NY.

and the Therapeutic Relationship. Journal of the American Psychoanalytic

[41]

World Health Organization. 2003. Investing in Mental Health.

Association 47, 3 (1999), 741–772.

https://apps.who.int/iris/handle/10665/42823

[14]

R. Fulmer idr., 2018. Using Psychological Artificial Intelligence (Tess) to

[42]

World Health Organization (WHO). 2013. Mental Health Action Plan

Relieve Symptoms of Depression and Anxiety: Randomized Controlled

2013-2020. Geneva, Switzerland.

Trial. JMIR Mental Health 5, 4 (2018).

[43]

A. Yorita idr., 2018. A Robot Assisted Stress Management Framework:

[15]

K. Gibson idr., 2009. Clinicians’ attitudes toward the use of information

Using Conversation to Measure Occupational Stress. In 2018 IEEE

and communication technologies for mental health services in remote and

International Conference on Systems, Man, and Cybernetics (SMC).

rural areas. Canadian Society of Telehealth Conference, Vancouver,

[44]

M. Mlakar, A. Tavčar, G. Grasselli in M. Gams. 2018. Asistent za stres.

October 3–6, (2009).

http://poluks.ijs.si:12345/.

58





Analiza glasu kot diagnostična metoda za odkrivanje

Parkinsonove bolezni

Speech Anlysis as a Diagnostic Method for the Detection of Parkinson’s Disease Andraž Levstek

Darja Silan

Aljoša Vodopija

Gimnazija Jožeta Plečnika

Gimnazija Jožeta Plečnika

Institut “Jožef Stefan”

Šubičeva ulica 1

Šubičeva ulica 1

Jamova cesta 39

Ljubljana, Slovenija

Ljubljana, Slovenija

Ljubljana, Slovenija

levstek.andraz@gmail.com

darja.silan@gjp.si

aljosa.vodopija@ijs.si

POVZETEK

KEYWORDS

Parkinsonova bolezen je nevrodegenerativna bolezen, ki pov-

Parkinson’s disease, speech analysis, machine learning, random

zroča težave v delovanju mišic zaradi pomanjkanja dopamina v

forest, feature importance

možganskem deblu, poleg tega vpliva tudi na glas. Slednji po-

stane bolj monoton, hripav in šibek. Zaradi naštetih sprememb se

1

UVOD

za diagnosticiranje Parkinsonove bolezni vse pogosteje uporablja

Parkinsonova bolezen je nevrodegenerativno in izčrpavajoče bo-

analiza glasu z metodami umetne inteligence. V tej raziskavi smo

lezensko stanje, ki vpliva na osrednje živčevje. Bolezen prizadene

s pomočjo metod strojnega učenja primerjali zvočne posnetke

približno 1 % ljudi, starejših od 60 let. Bolnik s Parkinsonovo bole-

glasu zdravih oseb in bolnikov s Parkinsonovo boleznijo. Za iz-

znijo se pogosto trese, ima težave s hojo in ravnotežjem, njegovo

boljšavo klasifikacijske točnosti smo dodatno uporabili pristop

gibanje postane počasno, pojavi se rigidnost. Pojavijo se lahko

zmanjševanja razsežnosti. Najbolj točen klasifikator smo zgradili

tudi duševne motnje, kot so anksioznost, depresija ter težave s

z uporabo metode naključnih gozdov, s katerim smo dosegli 73 %

spanjem, razmišljanjem in obnašanjem.

točnost. Dobljeni rezultati nakazuje na povezavo med Parkin-

Parkinsonova bolezen vpliva tudi na glas. Večina bolnikov ima

sonovo boleznijo in karakteristično spremembo glasu. Ocenili

govorne težave, kot so šibek, zadihan, hripav, višji in monoton

smo pomembnost posameznih zvočnih posnetkov in pripadajočih

glas. Za bolnika so značilne hripavost, zmanjšana jakost glasu,

atributov. Izsledke raziskave lahko uporabimo za nadgradnjo ob-

težava s pravilno artikulacijo fonemov in brbljanje [5].

stoječe metodologije s predlogi za dodatne posnetke, ki vsebujejo

Diagnostične metode, ki bi stoodstotno dokazala prisotnost

več informacij o prisotnosti Parkinsonove bolezni.

Parkinsonove bolezni, še ne poznamo. Diagnoza temelji na vidnih

in razpoznavnih simptomih, preteklem zdravstvenem stanju, fi-

KLJUČNE BESEDE

zičnem ter nevrološkem pregledu in bolnikovi anamnezi [13].

Parkinsonova bolezen, analiza glasu, strojno učenje, naključni

Po kriterijih mora biti za dokaz Parkinsonove bolezni prisotna

gozdovi, pomembnost atributov

akineza ter še vsaj ena druga lastnost (npr. tremor rok pri mi-

rovanju, rigidnost ali posturalne motnje). Po teh kriterijih se

ABSTRACT

Parkinsonovo bolezen lahko identificira z 90 % točnostjo, vendar

diagnoza lahko traja več let [12]. Pri diagnosticiranju se upora-Parkinson’s disease is a neurodegenerative disorder that causes

blja tudi slikanje možganov z magnetno resonanco, pozitronsko

impaired muscle function because of a lack of dopamine in the

emisijsko tomografijo in računalniško tomografijo. Vse naštete

brain stem. Parkinson’s disease also affects speech ability. The

diagnostične metode so drage ter zahtevne, zato se išče cenejše

voice becomes monotone, hoarse and feeble. For this reason, one

in preprostejše metode [13].

of the emerging ways to diagnose Parkinson’s disease is speech

V diagnostične namene se vse pogosteje uporablja analiza

analysis using artificial intelligence. In this paper, we use machine

zvočnih posnetkov glasu z uporabo metod umetne inteligence

learning to connect voice samples to the presence of Parkinson’s

(npr. strojno učenje, procesiranje signalov itd.). Tovrsten način

disease. To improve the classification accuracy, we additionally

diagnostike je povsem varen, preprost, hiter in ne zahteva dra-

use a dimensionality reduction approach. The most accurate clas-

gocenih namenskih naprav [8], vendar je to področje v primeru sifier was built with random forest, with an accuracy of 73 %.

Parkinsonove bolezni še v razvoju. Večina raziskovalcev se na-

The experimental results indicate the correlation between the

mreč ukvarja le z doseganjem čim večje klasifikacijske točnosti [1,

voice changes and the presence of Parkinson’s disease. Addition-

7, 10, 11], pri tem pa zanemarjajo pomemben vidik analize, in si-ally, we estimate the importance of individual voice samples and

cer da bi skušali identificirati pomembne posnetke in pripadajoče

corresponding features. The results can be used to improve the

glasovne atribute. Taka dognanja bi pripomogla k boljšemu ra-

current methodology by proposing additional voice samples, that

zumevanju problematike in omogočila oblikovanje natančnejših

contain more information on the presence of Parkinson’s disease.

testov.

V tem prispevku poročamo o testiranju uporabnost analize

glasu z metodami strojnega učenja za diagnosticiranje Parkinso-

Permission to make digital or hard copies of part or all of this work for personal nove bolezni. Opravljena študija temelji na zvočnih posnetkih 40

or classroom use is granted without fee provided that copies are not made or oseb (20 bolnikov s Parkinsonovo boleznijo) pridobljenih v razi-distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this skavi [10]. Na teh podatkih smo testirali pet različnih algoritmov work must be honored. For all other uses, contact the owner /author(s).

strojnega učenja. Za izboljšanje rezultatov smo dodatno uporabili

Information society 2020, October 5–9, 2020, Ljubljana, Slovenia

metodo za zmanjšanje razsežnosti in izboljšamo klasifikacijsko

© 2020 Copyright held by the owner/author(s).

točnost za približno 5 %.

59





Information society 2020, October 5–9, 2020, Ljubljana, Slovenia Levstek et al.

Tabela 1: Glasovni atributi, uporabljeni za strojno učenje:

V nasprotju z večino sorodnega dela smo ocenili tudi pomemb-

frekvenčni, pulzni, amplitudni, glasovni ter harmonični.

nost posameznih posnetkov in pripadajočih atributov. V ta namen

smo uporabili metodo naključnih gozdov, saj ta dosega najvišjo

točnost. Na ta način lahko ugotovimo, kateri atributi in posnetki

Skupina

Atribut

vsebujejo več informacij o prisotnosti Parkinsonove bolezni.

Jitter (local)

Prispevek je organiziran na sledeči način. V drugem poglavju

Jitter (local, absolute)

predstavimo podatke, v tretjem poglavju opišemo metodologijo.

Frekvenčni

Jitter (rap)

V četrtem in petem poglavju predstavimo rezultate in pridobljena

Jitter (ppq5)

dognanja. V zadnjem poglavju naredimo zaključek in orišemo

Jitter (ddp)

nadaljnje delo.

Število glasovnih pulzov

2

PODATKI

Število nihalnih dob

Pulzni

Povprečna perioda

Podatki so bili zbrani na Istanbulski fakulteti za medicino (Is-

Standardna deviacija period

tanbul Faculty of Medicine, Istanbul University) leta 2014 [10].

Shimmer (local)

Zbrali so zvočne posnetke 40 ljudi, 6 žensk ter 14 moških s Par-

Shimmer (local, dB)

kinsonovo boleznijo in 10 zdravih žensk ter 10 zdravih moških.

Shimmer (apq3)

Vsaka oseba je posnela 26 posnetkov, ki vključujejo samoglasnike,

Amplitudni

Shimmer (apq5)

kratke stavke in besede. Natančneje, posnetki 1–3 predstavljajo

Shimmer (apq11)

trajajoče samoglasnike “a”, “o” in “u”, posnetki 4–13 predstavljajo

Shimmer (dda)

števila od 1 do 10, posnetki 14–17 predstavljajo krajše stavke in

Delež nezvenečih časovnih oken

posnetki 18–26 predstavljajo besede. Vsi posnetki so v turščini,

Glasovni

Število lomljenj glasu

1

posneti so bili z mikrofonom Trust MC-1500 .

Delež lomljenj glasu

Vsaki osebi pripada 26 zvočnih posnetkov in vsakemu po-

Srednja vrednost višine glasu

snetku 26 linearnih ter frekvenčnih atributov, zgrajenih z uporabo

Povprečna višina glasu

programske opreme za akustično analizo Praat [2]. Vsi atributi Standardna deviacija višine glasu

so numerični in se jih običajno izračuna za analizo glasu [2, 10].

Najvišja višina tona

Povzeti so v Tabeli 1. Skupno je v množici podatkov 676 atribu-Harmonični

Najnižja višina tona

tov in ciljni razred. Slednji je binaren in predstavlja prisotnost

Avtokorelacija tona

(pozitiven = 1) oziroma odsotnost (negativen = 0) Parkinsonove

Razmerje šum-harmonik

bolezni. Imena nekaterih atributov uporabljamo v angleščini, saj

Razmerje harmonik-šum

pripadajoči slovenski izrazi ne obstajajo.

3

METODOLOGIJA

Tabela 2: Rezultati klasifikatorjev v obliki točnosti, senzi-

Klasifikatorje smo gradili s petimi algoritmi za strojno učenje:

tivnosti in specifičnosti. Najvišja vrednost posamezne me-

odločitveno drevo (C4.5), naivni Bayes (NB), metoda najbližjih

trike je odebeljena.

sosedov (𝑘 NN), metoda podpornih vektorjev (SVM) ter metoda

naključnih gozdov (RF). Za vse navedene algoritme smo uporabili

Algoritem

Točnost

Senzitivnost

Specifičnost

privzete vrednosti parametrov, saj uglaševanje ni signifikantno

izboljšalo klasifikacijske točnosti.

C4.5

0,63

0,65

0,60

Število atributov močno presega število primerkov, zato smo

NB

0,63

0,80

0,45

se odločili za uporabo metode zmanjševanja razsežnosti in s tem

𝑘 NN

0,48

0,55

0,40

uspešno izboljšali klasifikacijsko točnost za 5 %. Za izbor atribu-

SVM

0,68

0,70

0,65

tov smo uporabili široko poznano metodo, imenovano rekurzivna

RF

0,73

0,75

0,70

odstranitev atributov (ang. recursive feature elimination, RFE) [4],

ki temelji na vzvratni odstranitvi nepomembnih atributov. Me-

Tabela 3: Matrika zamenjav za klasifikator, zgrajen z me-

toda RFE spada med metode po principu ovojnice (ang. wrapper )

todo RF.

in smo jo uporabili v kombinaciji z zgoraj naštetimi algoritmi za

strojno učenje. Končno število atributov, ki v RFE nastopa kot

parameter, smo ocenili z 10-kratnim prečnim preverjanjem.

Napoved / Pravi

Negativen (0)

Pozitiven (1)

Za strojno učenje smo uporabili knjižnico caret [6], implemen-Negativen (0)

14

5

tirano v programskem jeziku R [9].

Pozitiven (1)

6

15

4

REZULTATI

Za evalvacijo in izbor najboljšega algoritma smo uporabili pristop

V Tabeli 2 so prikazani rezultati v obliki povprečne točnosti, po metodi “izpusti enega” (ang. leave one subject out, LOSO).

povprečne senzitivnosti in povprečne specifičnosti. Vidimo, da je

Najprej smo na učni množici z 10-kratnim prečnim preverjanjem

najbolj točen klasifikator, zgrajen z metodo RF, najmanj točen pa z

ocenili končno število atributov, ki nastopa kot parameter metode

metodo 𝑘 NN. Najvišjo senzitivnost je dosegel klasifikator, zgrajen

RFE. Nato smo z uglašeno metodo RFE izbrali najboljše atribute

z metodo NB, specifičnost pa klasifikator, zgrajen z metodo RF. V

in pripadajoči klasifikator. S slednjim smo klasificirali izpuščen

Tabeli 3 so prikazani rezultati za klasifikator, zgrajen z metodo primerek in opisan postopek ponovili za vse primerke.

RF v obliki matrike zamenjav. Klasifikator je pravilno klasificiral

1 https://www.trust.com/en/product/14896-design-microphone-mc-1500

29 primerkov, zmotil pa se je v 11 primerih.

60





Analiza glasu kot diagnostična metoda za odkrivanje Parkinsonove bolezni

Information society 2020, October 5–9, 2020, Ljubljana, Slovenia

Zanimala nas je pomembnost posameznih posnetkov in pri-

5

DISKUSIJA

padajočih atributov. V ta namen smo postopek izbora atributov

Podobno kot sorodne raziskave [1, 7, 10, 11] tudi naši rezultati ponovili za RF, a tokrat na celotnih podatkih brez izpusta primer-nakazujejo na povezavo med glasovnimi atributi in prisotnostjo

kov. Pomembnost izbranih posnetkov in atributov smo izraču-

Parkinsonove bolezni. Najbolj točen klasifikator zgradimo z upo-

nali s postopkom, imenovanim permutacijska pomembnost (ang.

rabo metode RF, s katerim dosežemo 73 % točnost. Za primerjavo

permutation importance), ki ga lahko neposredno vključimo v

nekatera sorodna dela poročajo o točnosti okoli 80 %.

metodo RF [3]. Za vsako drevo posebej izračunamo točnost na Pri tem so najpomembnejši in pogosti frekvenčni atributi

izpuščenih primerkih (naključno izpuščenih za gradnjo drevesa).

(Slika 1 in Slika 3). Sklepamo, da zaradi karakteristične deviacije Nato ponovimo izračun točnosti po permutaciji določenega atri-frekvence glasu pri Parkinsonovi bolezni. Med posnetki izstopajo

buta. Pomembnost tega atributa je povprečje razlik v točnosti

števila in kratki stavki (Slika 2 in Slika 4). O prisotnosti bolezni pred in po njegovi permutaciji. Pri tem poudarimo, da pri metodi

nam več povedo zahtevni ter daljši posnetki.

RF ni težav s koreliranimi atributi, saj postopek uporabimo na

Kljub temu je tak način diagnoze nezadosten. Najbolj točna

posameznem drevesu, ki je po načinu izgradnje nekoreliran.

metoda zgreši 25 % bolnikov, kar je za medicinsko prakso nespre-

Na ta način izberemo 27 izmed 676 atributov. Med njimi se

jemljivo [13]. Pri tem moramo poudariti, da smo imeli opravka najpogosteje pojavljajo frekvenčni atributi (Slika 1), medtem ko z omejenim številom primerkov (posnetih je bilo le 40 oseb). V

so ostale skupine atributov podobno zastopane. Med posnetki

primeru, da bi zbrali več zvočnih posnetkov obolelih in zdravih

se najpogosteje pojavljajo števila, nato kratki stavki. Najslabše

oseb, bi lahko klasifikator izboljšali z uporabo naprednejših me-

zastopani so trajajoči samoglasniki (Slika 2).

tod strojnega učenja, ki jih na tako malem številu primerkov ni

Slika 3 in Slika 4 predstavljata zaporedoma pomembnost izbra-bilo moč uporabiti.

nih atributov (agregirano čez posnetke) in pomembnost posnet-

Morda ne bo nikoli moč stoodstotno določiti prisotnost Par-

kov (agregirano čez atribute) za metodo RF. Atributi in posnetki

kinsonove bolezni iz analize glasu z uporabo metod strojnega

so razvrščeni od manj pomembnih do bolj pomembnih. Iz rezulta-

učenja, vendar bi tovrstne metode lahko uporabili bodisi komple-

tov je razvidno, da so za metodo RF najpomembnejši frekvenčni

mentarno za nadgradnjo obstoječih metod bodisi kot presejalni

atributi. Najmanj pomembni pa so harmonični atributi in atributi,

test. Pri tem poudarimo, da je analiza glasu poceni in za bolnika

izpeljani iz tona glasu. Najpomembnejši posnetek je število “4”.

povsem nemoteča ter varna preiskava.

Opazimo, da števila in kratki stavki vsebujejo več informacij od

ostalih posnetkov.

6

ZAKLJUČEK

V prispevku smo z metodami strojnega učenja primerjali zvočne

posnetke zdravih oseb in bolnikov s Parkinsonovo boleznijo. Na-

men študije je bil preveriti, ali lahko iz analize glasu sklepamo o

prisotnosti Parkinsonove bolezni in ali je možno zgraditi klasifi-

kator za uporabo v praksi. Dodatno smo tudi ocenili pomembnost

posameznih posnetkov in pripadajočih glasovnih atributov.

Rezultati nakazujejo, da pri bolnikih s Parkinsonovo boleznijo

pride do poslabšanja zvočne artikulacije, saj smo s klasifikatorjem,

zgrajenim z metodo naključnih gozdov, uspešno zaznali 73 %

bolnikov. Ne glede na to klasifikator še ni primeren za uporabo

v praksi, saj je njegova točnost prenizka. Sedanji klasifikator

lahko uporabimo kot komplementarni test že obstoječim. Za

najpomembnejše zvočne posnetke se izkažejo števila in kratki

stavki. Pri tem so najmanj pomembni trajajoči samoglasniki in

besede. Med atributi izstopajo frekvenčni in amplitudni.

Slika 1: Število izbranih atributov za posamezne skupine

Trenutno raziskujemo možnost, da bi zbrali več sorodnih zvoč-

po uporabi metode RFE v kombinaciji z metodo RF.

nih posnetkov. Na ta način bi lahko uporabili kompleksnejše

metode, ki omogočajo odkrivanje zagonetnih zakonitosti, ki jih

na tako majhnem naboru primerkov ni bilo mogoče odkriti.

Naš dolgoročni cilj je izgradnja klasifikatorja, ki bi uspešno

identificiral večino bolnikov tudi za ceno nekoliko nižje točnosti

(nekatere zdrave osebe bi klasificiral za bolne). Klasifikator bi

lahko uporabili kot presejalni test in na ta način olajšali sedanjo

diagnostiko Parkinsonove bolezni. Poskusili bomo tudi razbrati,

zakaj so ravno posnetki števil vsebovali več informacij o pri-

sotnosti Parkinsonove bolezni, in z dobljenim znanjem skušali

predlagati celovitejši nabor izrazov, besed in fonemov.

ZAHVALA

Avtorji se zahvaljujejo gospe Ireni Hočevar Boltežar za razlago

glasovnih atributov in slovenske prevode. A. Vodopija se doda-

tno zahvaljuje finančni podpori Javne agencije za raziskovalno

Slika 2: Število izbranih posnetkov za posamezne skupine

dejavnost Republike Slovenije (program usposabljanja mladega

po uporabi metodo RFE v kombinaciji z metodo RF.

raziskovalca).

61





Information society 2020, October 5–9, 2020, Ljubljana, Slovenia

Levstek et al.

Slika 3: Pomembnost izbranih atributov za klasifikator,

Slika 4: Pomembnost izbranih posnetkov za klasifikator,

zgrajen z metodo RF. Pomembnost posamezne skupine je

zgrajen z metodo RF. Pomembnost posamezne skupine je

agregirana pomembnost pripadajočih atributov.

agregirana pomembnost pripadajočih posnetkov.

LITERATURA

[8]

M. A. Little, P. E. McSharry, S. Roberts, D. Costello in I.

Moroz. 2007. Exploiting nonlinear recurrence and frac-

[1]

I. Bhattacharya in M. P. S. Bhatia. 2010. SVM classification

tal scaling properties for voice disorder detection. Nature

to distinguish parkinson disease patients. V Proceedings of

the 1st Amrita ACM-W Celebration on Women in Computing

Precedings. doi: 10.1038/npre.2007.326.1.

in India

[9]

R Core Team. 2013. R: A Language and Environment for Sta-

. ACM, New York, NY, USA, 1–6. doi: 10 . 1145 /

tistical Computing. R Foundation for Statistical Computing.

1858378.1858392.

Vienna, Austria. http://www.R- project.org/.

[2]

P. Boersma. 2001. Praat, a system for doing phonetics by

[10]

B. E. Sakar, M. E. Isenkul, C. O. Sakar, A. Sertbas, F. Gurgen,

computer. Glot International, 5, 9/10, 341–345.

S. Delil, H. Apaydin in O. Kursun. 2013. Collection and ana-

[3]

L. Breiman. 2001. Random forests. Machine Learning, 45,

lysis of a parkinson speech dataset with multiple types of

1, 5–32. doi: 10.1023/A:1010933404324.

sound recordings. IEEE Journal of Biomedical and Health In-

[4]

I. Guyon, J. Weston, S. Barnhill in V. Vapnik. 2002. Gene

formatics, 17, 4, 828–834. doi: 10.1109/JBHI.2013.2245674.

selection for cancer classification using support vector

[11]

C. O. Sakar in O. Kursun. 2010. Telediagnosis of parkin-

machines. Machine Learning, 46, 1, 389–422. doi: 10.1023/

son’s disease using measurements of dysphonia. Journal

A:1012487302797.

of Medical Systems, 34, 4, 591–599. doi: 10.1007/s10916-

[5]

I. Hočevar Boltežar. 2013. Fiziologija in patologija glasu ter

izbrana poglavja iz patologije govora

009- 9272- y.

. Pedagoška fakulteta.

[12]

C. Silva. 2018. Speech analysis may help diagnose parkin-

http://www.biblos.si/lib/book/9789612531416.

son’s and at earlier stage, study says. Parkinson’s News

[6]

M. Kuhn. 2008. Building predictive models in R using the

Today. (2018). https : / / parkinsonsnewstoday. com / 2018 /

caret package. Journal of Statistical Software, Articles, 28,

02/05/speech- analysis- can- help- detect- parkinsons- in-

5, 1–26. doi: 10.18637/jss.v028.i05.

early- stages- study- says/.

[7]

M. A. Little, P. E. McSharry, E. J. Hunter, J. Spielman in L.

[13]

E. Tolosa, G. Wenning in W. Poewe. 2006. The diagnosis

O. Ramig. 2009. Suitability of dysphonia measurements for

of parkinson’s disease. The Lancet Neurology, 5, 1, 75–86.

telemonitoring of parkinson’s disease. IEEE Transactions

on Biomedical Engineering

doi: 10.1016/S1474- 4422(05)70285- 4.

, 56, 4, 1015–1022. doi: 10.1109/

TBME.2008.2005954.

62





STRAW Application for Collecting Context Data and

Ecological Momentary Assessment

Junoš Lukan

Marko Katrašnik

Larissa Bolliger

Jožef Stefan Institute

Jožef Stefan Institute

Department of Public Health

Jožef Stefan International

Jamova cesta 39

Ghent University

Postgraduate School

Ljubljana, Slovenia

Ghent, Belgium

Jamova cesta 39

marko.katrasnik@gmail.com

larissa.bolliger@ugent.be

Ljubljana, Slovenia

junos.lukan@ijs.si

Els Clays

Mitja Luštrek

Department of Public Health

Jožef Stefan Institute

Ghent University

Jamova cesta 39

Ghent, Belgium

Ljubljana, Slovenia

els.clays@ugent.be

mitja.lustrek@ijs.si

ABSTRACT

phone use and location) is monitored without user intervention

To study stress at the workplace and relate it to user context and

or interaction. The second mode of operation are prompts or

self-reports, we developed an application based on the AWARE

questions for the user, where some information about the context

framework, a mobile instrumentation toolkit. The application

and the participant’s mental state is gathered by asking for it

serves two purposes: of passively collecting data about user’s

explicitly.

environment and offering questionnaires as means of ecological

As a starting point for writing the STRAW application, we

momentary assessment. We implemented methods to import

used AWARE, a mobile instrumentation toolkit which had the

the questionnaires into the phone’s database and trigger them

initial purpose of inferring users’ context [5]. It enables logging at the right times. We also considered privacy implications of

of data as reported by the phone’s operating system and a wide

collecting such data and took additional measures to conceal the

variety of hardware sensors. At several points, this toolkit was

identity of our study’s participants wherever we evaluated it was

adapted to better suit our needs, and additional capabilities were

under the risk of exposure. Finally, we had to establish a server

added on top of it.

application to handle receiving and storage of collected data and

We also developed two modular functionalities of the applica-

implemented a rudimentary login process to additionally secure

tion: Bluetooth integration with an Empatica E4 wristband [23]

our servers.

to enable simultaneous collection of physiological data and voice

detection and speaker diarization capabilities [15]. We already KEYWORDS

reported on these developments elsewhere, whereas in this paper,

we give an overview of the app’s capabilities.

context detection, application development, privacy, ecological

momentary assessment

1 APPLICATION OVERVIEW

The best machine learning models for stress detection and affect

recognition are multimodal [1, 17]. Combining data from different 1.1 Data Types

modalities is especially effective, such as using physiological,

An important aspect of the STRAW application are prompts,

behavioural or contextual, and psychological (self-reported) data.

called EMAs. The users can be prompted to make a diary entry

Collecting such data in the real-world setting presents a challenge,

at a specific time which is called Experience Sampling Method

however.

[ESM; 3] or, more broadly (when data other than experience In the project called Stress at work (STRAW), the main object-are noted), Ecological Momentary Assessment [EMA; 20]. Diary ive is to analyse the relationship between psychosocial stress

methods increase the reliability of collected self-reports as they

experiences in the workplace, work activities and events, and

are less prone to recall bias [14].

peripheral physiology. To facilitate integration of various data

EMAs are the main mode of user interaction in the STRAW

sources, an application was designed to run continuously and

application. The content of specific questions is beyond the scope

monitor their environment and specific phone-related events.

of this paper, but in general, the questions are based on existing

The application’s purpose is two-fold. The primary mode of

psychological questionnaires measuring stressors, stress, and

operation is silent and continuous: the user context (such as their

related responses. The implementation of EMAs is described in

Section 2.

Permission to make digital or hard copies of part or all of this work for personal In addition to this, we selected a subset of data that might help

or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and us determine users’ context. Below is a list of sensors that are

the full citation on the first page. Copyrights for third-party components of this used in the STRAW application together with the description of

work must be honored. For all other uses, contact the owner/author(s).

data they collect. Data availability from some of these sensors

Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

© 2020 Copyright held by the owner/author(s).

is dependent on phone’s hardware and the version of operating

system.

63





Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

Junoš Lukan, Marko Katrašnik, Larissa Bolliger, Els Clays, and Mitja Luštrek

– Acceleration: There are several sources (i.e. virtual sensors)

version and phone manufacturer which tries to close applications

of acceleration data in a smartphone. Accelerometers meas-

for energy efficiency. We attempted to whitelist this application

ure acceleration magnitude in various directions and re-

in the most common battery-saving software.

port either linear acceleration (without gravity effects),

gravity, or combined acceleration. This is used further in

2 ECOLOGICAL MOMENTARY

Google’s activity recognition API [10].

ASSESSMENT

– Barometer: Ambient air pressure.

As mentioned, one of the main functions of the STRAW applica-

– Light: Luminance of the ambient light captured by the light

tion is to collect users’ answers to questionnaires. AWARE already

sensor.

implements a ‘sensor’ for experience sampling method, which

– Temperature: Temperature of the phone’s hardware sensor.

shows DialogFragments as the one in Figure 1, but it was too

– Bluetooth: This sensor logs surrounding Bluetooth-enabled

rudimentary for our study protocol. The main upgrades we had to

and visible devices, specifically their hashed MAC ad-

make were the mechanism of triggering EMAs and management

dresses, and received signal strength indicator (RSSI) in

of the database of available questions (items) to include in the

decibels.

questionnaires.

– Location: Device’s current location (latitude, longitude, and

altitude, which are masked as described in Section 3) and its velocity (speed and bearing). This uses various methods,

such as GPS and known Wi-Fis in vicinity resulting in

different degrees of accuracy. Location category is also

acquired with Foursquare API.

– Network: Network availability (e.g. none or aeroplane mode,

Wi-Fi, Bluetooth, GPS, mobile) and traffic data (received

and sent packets and bytes over either Wi-Fi or mobile

data).

– Proximity: Uses the sensor by the device’s display to detect

nearby objects. It can either be a binary indicator of an

object’s presence or the distance to the object.

– Timezone: Device’s current time zone.

– Wi-Fi: Logs of surrounding Wi-Fi access points, specifically

their hashed MAC addresses, received signal strength in-

dicator (RSSI) in decibels, security protocols, and band

frequency. The information on the currently connected

access point is also included.

– Applications: This includes the category of the application

currently in use (i.e. running in the foreground) and data

related to notifications that any application sends. No-

tification header text (but not content), the category of

the application that triggered the notification and delivery

modes (such as sound, vibration and LED light) are logged.

– Battery: Battery information, such as current battery per-

Figure 1: An example of an ecological momentary assess-

centage level, voltage, and temperature, and its health, as

ment prompt.

well as power-related events, such as charging and dis-

charging times are monitored.

– Communication: Information about calls and messages sent

or received by the user. This includes the call or message

2.1 EMA Triggering

type (i.e. incoming, outgoing, or missed), length of the

Originally, AWARE provides a couple of ways to trigger EMAs:

call session, and trace, a SHA-1 encrypted phone number

at a specific time, by a certain context (i.e. taking into account

that was contacted. The phone numbers themselves or the

values from other sensors) or on demand (manually). In our study,

contents of messages and calls are not logged.

time is the most important trigger of EMAs, but we needed finer

– Processor: Processor load in CPU ticks and the percentage

control.

of load dedicated to user and system processes or idle load.

The EMAs in our studies are divided into three types: a) morn-

– Screen: Screen status: turned on or off and locked or un-

ing EMAs with questions about sleep quality, b) work-hour EMAs

locked.

with questions about momentary affect, job characteristics, work

– Voice activity: A classifier, trained using Weka [7]. The activities, and similar, and c) evening EMAs with questions about

features are calculated using openSMILE [4] and the out-

the whole workday and after-work activities. The first EMA is

put is an indicator of human voice activity [15].

triggered in the first hour after the start of the workday as set

by the user. The rest of the EMAs during work hours trigger ap-

The data described in the list above are collected automatically

proximately every 90 minutes, but not closer than 30 min apart.

and continuously. The application is run as a foreground service,

The time is dependent on the last answered EMA rather than set

which means that the data collection continues even while the

in advance, and additional reminders are scheduled in the case

application is not actively used (i.e. it is minimized). Despite

of user inactivity. The final EMA of the day is triggered in the

this, there exists software that is specific to the operating system

evening at a time set by the user.

64





STRAW Application for Collecting Context Data and Ecological Momentary Assessment Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

Each of these types of EMA is implemented as a separate

The MAC addresses of detected WiFi and Bluetooth devices are

IntentService [11] and handled by a JobScheduler [18]. This hashed in the same way.

enabled us to enforce the requirements outlined above such as

The location data in their raw form are highly revealing of

setting the minimum latency with which the job can start and

a user’s identity [2]. Instead of storing the actual geographic making use of periodic jobs.

coordinates provided by this sensor, the Foursquare Places API

[6] is used to extract the category (venue) of a location. This 2.2 Question Database

API enables saving general categories such as ‘bookstore’ or ‘gas

station’ near the user’s location. But since we wanted to keep

In the original AWARE implementation, questions are queued

the option to analyse users’ movements, we also implemented

into a questionnaire directly in the code of the application by

a transformation of coordinates. We converted longitude and

using their custom ESMFactory class. For our study, we use a

latitude into spherical coordinates, applied a stochastic rotation

pool of more than 200 questions per language from which a

(but constant within a specific user) and converted these back

subset is sampled for every EMA. We therefore needed a more

to transformed longitude and latitude. This enabled us to keep

systematic way of storing them within the application.

the distances between the locations faithful to original data, but

To ease the insertion of individual items, we prepared a spread-

transformed to another place on Earth.

sheet template which is meant to be human-readable and filled

As described in our previous work [15], voice activity recog-out manually. Individual items from this spreadsheet are later

nition is performed on the phone in its entirety. This means that

converted into JavaScript Object Notation (JSON) and stored

raw audio recordings can be discarded immediately after pro-

in an SQLite database [13] in phone’s internal storage. This im-cessing and only the calculated features are saved to the database.

plementation enabled us to adapt the content of EMAs without

Alternatively, only the final binary prediction of human voice

touching the source code of the application. It also simplified

presence can be retained, but this makes any post-hoc analysis

the final selection of questions, such as selecting one language

(such as speaker diarization) impossible.

(English, Dutch, or Slovenian) and grammatical gender.

3 PRIVACY ENHANCEMENTS

4 SERVER APPLICATION

The data collected by the STRAW application have different

For the purpose of storing the data on a server, a Python applica-

degrees of risk to the users’ privacy. Their privacy would be

tion was implemented in Flask [21], which accepts the data in a threatened if an outsider gained unauthorized access to the data.

JSON format and saves it in a PostgreSQL [22] database. In addi-These possible external threats are considered in Section 5.

tion to receiving the data and managing credentials (as described

Even when the data are safely communicated and stored, how-

in Section 5), it also performs a couple of additional functions.

ever, an involuntary exposure of users’ identity might still be

As mentioned in Section 3, instead of saving application names possible. Assuming the data are well protected from unauthor-we only log their category as classified in Google Play Store. To

ized external access, these risks will in turn be treated as internal

reduce the number of queries, we implemented this as a part of the

in this section.

server application. As part of the upload process, the application

Some of the data collected by the STRAW application are

name is received in plain text, but only retained until query

personal data, so even when storing them securely and after

returns its category. After that, the application name is hashed to

pseudonymization, some risk of a privacy breach remains. Since

enable comparisons with later records and the name in plain text

AWARE is widely used in scientific studies it already implements

is discarded. In this way, we could build a database of application

some privacy enhancing mechanisms. We performed a thorough

name hashes and their corresponding categories on the server,

application vulnerability analysis and identified several further

while not keeping a record of what applications individual users

threats to privacy that we wished to address. While the data

use.

are safely communicated and stored, an involuntary exposure of

The server application also provides a simple UI for admin-

users’ identity might still be possible. The types of data that de-

istrators, where some metadata about the data collection itself

serve special attention are applications, communication, location,

are shown in forms of tables and charts. We can access data on

and voice activity.

last upload, number of days of participation, and number of data

As mentioned in Section 1, the notifications that other applica-points for each individual user. This enables us to detect any

tions send are monitored in the STRAW application. The content

problems with data collection and troubleshoot them early.

of the notification, such as that of an instant messaging applica-

tion or calendar notification, is never actually stored. We deemed

5 CLIENT-SERVER COMMUNICATION AND

even the application names to be sensitive, so we chose to only

LOGIN

save application categories. This process is further described in

The STRAW application and other sensing applications are not

Section 4.

special in the degree they could be subject to external attacks

The content of calls or messages is never logged, but the phone

[2]. An attacker might want to expose identity of a user or try to numbers tied to them can be. Since we wanted to keep track of

reveal their personal data such as location. There are three points

recurring contact with the same person, but not reveal their real

of entry for an external attacker: local storage, transmission of

phone number, we decided to encrypt them using the SHA-1

data, and the servers.

algorithm. While it would be possible to decrypt a phone number

While the data reside on the device they are saved locally in

by a brute-force attack, the AWARE implementation offers the

the phone’s storage. According to Android’s documentation, this

option of adding a salt. Thus by using the username (further de-

database is exclusive to the STRAW application [9]:

scribed in Section 5) as a salt, the phone numbers are sufficiently protected from inadvertent disclosure risk, while the hashed

Other applications cannot access files stored within

value is retained even across different application installations.

internal storage. This makes internal storage a good

65





Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia Junoš Lukan, Marko Katrašnik, Larissa Bolliger, Els Clays, and Mitja Luštrek place for application data that other applications

[2] Delphine Christin. 2016. Privacy in mobile participatory

shouldn’t access.

sensing. Current trends and future challenges. Journal of

Additionally, once the data are transmitted to the server, the local

Systems and Software, 116, 57–68. doi: 10.1016/j.jss.2015.

database is periodically deleted. This reduces the privacy risk

03.067.

of the database being exposed, while also decreasing the local

[3] Mihaly Csikszentmihalyi, Reed Larson and Suzanne Prescott.

storage requirements.

1977. The ecology of adolescent activity and experience.

It is therefore the transmission of data where we had to secure

Journal of Youth and Adolescence, 6, 3, (September 1977),

the data. They are transmitted over encrypted HTTPS connection,

281–294. doi: 10.1007/bf02138940.

which eliminates the risk of exposure during this part of commu-

[4] Florian Eyben, Felix Weninger, Florian Gross and Björn

nication. The data are received by an application server residing

Schuller. 2013. Recent developments in openSMILE, the

at Jožef Stefan Institute (JSI), with a dedicated port listening for

Munich open-source multimedia feature extractor. In Pro-

incoming transmissions.

ceedings of the 21st ACM international conference on Multi-

The application server communicates with another, database

media - MM '13. ACM Press. doi: 10.1145/2502081.2502224.

server, also residing at JSI. This second server can only be accessed

[5] Denzil Ferreira, Vassilis Kostakos and Anind K. Dey. 2015.

from within the JSI local area network. The database itself is

AWARE: Mobile context instrumentation framework. Fron-

also protected with a password and the user accessing it via the

tiers in ICT, 2, 6, 1–9. issn: 2297-198X. doi: 10.3389/fict.

application server does not have administrator privileges.

2015.00006. https://www.frontiersin.org/article/10.3389/

Since the STRAW application is a part of a wider study, it is

fict.2015.00006.

disseminated to recruited participants only. In addition to the data

[6] Foursquare. [n. d.] Places SDK. Venue search. Retrieved

from this application, other data are collected, such as responses

26/08/2020 from https://developer.foursquare.com/docs/

to questionnaires in baseline screening and physiological data

api-reference/venues/search/.

from wristbands. It was therefore necessary that the data can be

[7] Eibe Frank, Mark A. Hall and Ian H. Witten. 2016. The

linked back to an individual in order to join the data from various

WEKA workbench. (4th edition). Morgan Kaufmann.

sources. We developed a login method to enable this.

[8] Martin Gjoreski, Mitja Luštrek, Matjaž Gams and Hristijan

Using OkHttp [19] client-side and Flask-HTTPAuth [12] server-Gjoreski. 2017. Monitoring stress with a wrist device using

side, we implemented basic access authentication and token au-

context. Journal of Biomedical Informatics, 73, 159–170.

thentication [16]. The login credentials are disseminated to re-issn: 1532-0464. doi: 10.1016/j.jbi.2017.08.006.

gistered participants in our study and are input upon the install-

[9] Google. [n. d.] Access app-specific files. Access from in-

ation of the STRAW application. This serves multiple purposes:

ternal storage. Retrieved 26/08/2020 from https://developer.

by requiring login, we only accept data from actual participants

android.com/training/data-storage/app-specific.

of our study, while we can also use the assigned username to

[10] Google. [n. d.] Adapt your app by understanding what

pseudoanonymously link data from various sources.

users are doing. Retrieved 26/08/2020 from https://developers.

google.com/location-context/activity-recognition.

6 CONCLUSION

[11] Google. [n. d.] IntentService. Retrieved 26/08/2020 from

The application used in the STRAW project serves a dual pur-

https://developer.android.com/reference/android/app/

pose: to collect users’ answers to questionnaires and passively

IntentService.html.

collect data about their environment and phone usage. While the

[12] Miguel Grinberg. [n. d.] Flask-HTTPAuth. Retrieved 26/08/2020

application was tailored to requirements of our study, this paper

from https://flask-httpauth.readthedocs.io/en/latest/.

outlined the main issues and possible solutions when developing

[13] D. Richard Hipp, Dan Kennedy and Joe Mistachkin. 2019.

an application for research purposes.

SQLite. Computer software. (2019). https : / / sqlite . org /

The AWARE framework provided a solid foundation and espe-

index.html.

cially eased sensor data collection, there are additional challenges

[14] Gillian H. Ice and Gary D. James, editors. 2006. Measur-

that researchers need to face when trying to use an application

ing emotional and behavioral response. General principles.

like this in a scientific study. The data gathered using this applic-

Measuring Stress in Humans. A Practical Guide for the Field.

ation will help us develop improved models of stress recognition

.Part II –Measuring stress responses. Cambridge Univer-

[8], which will help us integrate physiological data with more sity Press, Cambridge, UK, (December 2006). Chapter 3,

detailed contextual data and more reliable self-reports.

60–93. isbn: 978-0-521-84479-6.

[15] Marko Katrašnik, Junoš Lukan, Mitja Luštrek and Vitomir

ACKNOWLEDGMENTS

Štruc. 2019. Razvoj postopka diarizacije govorcev z al-

goritmi strojnega učenja. In Proceedings of the 22nd In-

The authors acknowledge the STRAW project was financially

ternational Multiconference INFORMATION SOCIETY – IS

supported by the Slovenian Research Agency (ARRS, project ID

2019. Slovenian Conference on Artificial Intelligence. Mitja

N2-0081) and by the Research Foundation – Flanders, Belgium

Luštrek, Matjaž Gams and Rok Piltaver, editors. Volume A,

(FWO, project no. G.0318.18N).

57–60. https://is.ijs.si/archive/proceedings/2018/files/

REFERENCES

Zbornik%20-%20A.pdf.

[16] Chris Schmidt. 2001. Token based authentication. In Ac-

[1] Ane Alberdi, Asier Aztiria and Adrian Basarab. 2016. To-

cepted papers for FOAF-Galway. 1st Workshop on Friend

wards an automatic early stress recognition system for

of a Friend, Social Networking and the Semantic Web.

office environments based on multimodal measurements.

https : / / www . w3 . org / 2001 / sw / Europe / events / foaf -

A review. Journal of Biomedical Informatics, 59, (February

galway/papers/fp/token_based_authentication/.

2016), 49–75. doi: 10.1016/j.jbi.2015.11.007.

66

STRAW Application for Collecting Context Data and Ecological Momentary Assessment Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

[17] Philip Schmidt, Attila Reiss, Robert Duerichen and Kristof

[21] The Pallets team. 2010. Flask. Computer software. (2010).

Van Laerhoven. Wearable affect and stress recognition: a

http://flask.pocoo.org/.

review. (21st November 2018).

[22] The PostgreSQL Global Development Group. 2019. Postgr-

[18] Joanna Smith. 2016. Scheduling jobs like a pro with Job-

eSQL 11.3 Documentation. Version 11.3.

Scheduler. https : / / medium . com / google - developers /

[23] Marija Trajanoska, Marko Katrašnik, Junoš Lukan, Mar-

scheduling-jobs-like-a-pro-with-jobscheduler-286ef8510129.

tin Gjoreski, Hristijan Gjoreski and Mitja Luštrek. 2018.

[19] Square, Inc. 2019. OkHttp. Computer software. (2019). https:

Context-aware stress detection in the aware framework.

//square.github.io/okhttp/.

In Proceedings of the 21st International Multiconference IN-

[20] Arthur A. Stone and Saul Shiffman. 1994. Ecological mo-

FORMATION SOCIETY – IS 2018. Slovenian Conference

mentary assessment (EMA) in behavioral medicine. Annals

on Artificial Intelligence. Mitja Luštrek, Rok Piltaver and

of Behavioral Medicine, 16, 3, 199–202. doi: 10.1093/abm/

Matjaž Gams, editors. Volume A, 25–28. https://is.ijs.si/

16.3.199.

archive/proceedings/2018/files/Zbornik%20-%20A.pdf.

67





URBANITE H2020 Project

Algorithms and Simulation Techniques for Decision - Makers

Alina Machidon

Maj Smerkol

Matjaž Gams

alina.machidon@ijs.si

maj.smerkol@ijs.si

matjaz.gams@ijs.si

Jožef Stefan Institute

Jožef Stefan Institute

Jožef Stefan Institute

Jamova cesta 39

Jamova cesta 39

Jamova cesta 39

Ljubljana, Slovenia

Ljubljana, Slovenia

Ljubljana, Slovenia

ABSTRACT

on case-specific models. The goal of the activities will be to imple-

ment novel tools and services in order to enable policy-makers

URBANITE (Supporting the decision-making in URBAN trans-

to use advanced data analysis and machine learning methods

formation with the use of dIsruptive TEchnologies) is a H2020

during the design of novel policies for a specific city

project with the goal to provide an ecosystem model that artic-

URBANITE will allow the analysis of the traffic flows that are

ulates the expectations, trust and attitude from civil servants,

currently happening and have happened up until that moment. In

citizens and other stakeholders in the use of disruptive technolo-

addition to the visualization of the traffic, usage of economy shar-

gies. This model will be supported with the provision of a data

ing vehicles and other aspects, URBANITE will analyse which

management platform and algorithms for data – driven decision

are the bottlenecks and critical points, based on a set of parame-

– making in the field of urban transformation. One of the main

ters to be determined by the civil servants. Due to the fact that

output of the project will be a Decision-Support System includ-

historic data is stored, trends can be determined by URBANITE

ing (AI based) predictive algorithms and simulation models for

by big data algorithms. These trend analyses can entail the un-

mobility that support the decision–making process by analyzing

derstanding of, for instance, the use of a certain transportation

the current situation, the trends that occurred in a certain time

system (e.g. bikes) in a certain neighbourhood of the municipality,

frame and allowing to predict future situations, when changing

or the peak hours in which a street is blocked. URBANITE will

one or more variables. URBANITE will analyze the impact, trust

also provide means to simulate the effect of different situations

and attitudes of civil servants, citizens and other stakeholders

such as opening a pedestrian street at certain times, location of

with respect to the integration of disruptive technologies such

electric charging stations, or bike sharing points through the

as Artificial Intelligence (AI), Decision Support Systems (DSS),

implementation of artificial intelligence algorithms. To achieve

big data analytics and predictive algorithms in a data–driven

that, URBANITE will build first generic models from the data

decision-making process. The results of the project will be val-

across all the cities and then provide adaptation mechanisms to

idated in four real use cases: Amsterdam, Bilbao, Helsinki and

apply these models to the different use cases. From the data avail-

Messina. This paper overviews the current state of the project’s

able, URBANITE will extract and formalize knowledge and then,

progress.

through a combination of classification, regression, clustering,

KEYWORDS

and frequent pattern mining algorithms, conclude into some de-

cisions and actionable models that will enable city policy-makers

AI, Big Data, DSS, disruptive technologies, URBANITE project

to simulate and assess the outcomes and implications of new

1

INTRODUCTION

policies.

In recent times, the cities and urban environments are facing

2

SYSTEM’S ARCHITECTURE

a revolution in urban mobility, bringing up unforeseen conse-

The URBANITE project will combine various data sources, algo-

quences that public administrations need to manage. It is in this

rithms, libraries and tools that provide the best solutions to the

new context that public administrations and policy makers need

scope of the project. The technical "core" of the project has to

means to help them understand this new scenario, supporting

fulfill the following objectives:

them in making policy–related decisions and predicting eventu-

alities. The traditional technological solutions are no longer valid

• Deploy tools for big data exploration with the active in-

for this situation and therefore, disruptive technologies such as

volvement of policy-makers.

big data analytics, predictive algorithms as well as decision sup-

• Design methods for the detection of important events that

port systems profiting from artificial intelligence techniques to

need to be addressed.

support policy – makers come into place.

In order to provide the desired functionalities, several state-of-

The main technical objective of the URBANITE project is

the-art technologies are currently examined and tested in order

the development of advanced AI algorithms for analysis of big

to be adapted, customized and integrated into the platform. A

data on mobility. The developed methods and tools will provide

simplified preliminary architecture is presented in Figure 1.

substantial support for policy-makers to tackle complex policy

problems on the mobility domain and will enable their validation

2.1

Data Analysis Module

Permission to make digital or hard copies of part or all of this work for personal One of the first tasks involves the development of various meth-or classroom use is granted without fee provided that copies are not made or ods for exploratory data analysis and user interaction. Multi-distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this modal methods, tools and services for big data on urban mobility

work must be honored. For all other uses, contact the owner /author(s).

will be implemented that will provide exploratory analysis capa-

Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

bilities and enable the policy-makers to actively search for causal

© 2020 Copyright held by the owner/author(s).

relations in the data will be provided by the platform.

68





Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

Machidon, Smerkol and Gams

Figure 1: High Level Architecture of the URBANITE Platform.

The methods to be included in the platform can be segmented

The URBANITE recommendation engine will identify and

in four main groups:

predict important or problematic events related to mobility and

will provide suggestions to tackle the issue. The policy support

• clustering, where the main goal is to reduce the amount

system will provide support to the policy-makers for identifying

of data by grouping together similar instances. The im-

possible policies that tackle events based on specific criteria. The

plemented method will provide mechanisms to group

inputs will have to be aggregated for effective decision-making

instances based on GIS data or any subset of attributes

using hierarchical multi-criteria decision models.

that users will define. For example, platform users might

choose to cluster all instances based on the type of trans-

portation used (shared bikes, electric cars, etc.)

2.3

Policy Simulation and Validation Engine

• projection methods that will be used to reduce the dimen-

Simulation transparency is a vital feature of the decision making

sionality of the data items. The goal of these methods is

process when quantitative computer tools are used to justify

to represent the data in a lower dimensional space in such

some strategies [10]. Simulation predictions can play a catalytic a way that the key relations of the data structures are

role in the development of public policies, in the elaboration of

preserved. The results of the methods can be used to more

safety procedures, and in establishing legal liability. Hence, given

clearly visualize the data or use the transformed data in

the impact that modelling and simulation predictions are known

the next rounds of analysis

•

to have, the credibility of the computational results is of crucial

self-organizing map involves the use of a type of artificial

importance to engineering designers and managers but also to

neural network, trained in an unsupervised manner. The

public servants, and to all citizens affected by the decisions that

method can at the same time reduce the amount of data

are based on these predictions [10].

(similar to clustering) and nonlinearly projects the data

To create trust and increase the model’s credibility and the

into lower dimensionalities

•

simulation results delivered, it is crucial to deal with a validation

prediction/regression methods, or classification models,

strategy in which non-simulation-trained end-users could feel

that will allow to exploit the data

comfortable and trust the simulation model [10].

In the URBANITE project, the policy simulation and validation

2.2

Recommendation Engine

module will provide methods and tools to simulate the efficiency

Recommendation engines (also known as recommender systems)

of specific policies in the target domain. Given a new policy, ur-

are information filtering systems that deal with the problem of in-

ban mobility model and the target parameters, the system can

formation overload [6] by filtering key information "chunks" out evaluate the performance of the new policy based on the observed

of large amount of dynamically generated information accord-

parameters. The implementation of credible traffic simulations

ing to user’s preferences, interest, or observed behavior about

for the entire city has been addressed by various project; however,

item [8][5]. Recommendation engines have the ability to pre-it is not yet adequately solved, due to its complexity. In URBAN-

dict whether a particular user would prefer an item or not based

ITE ,the constructed model will be used to predict and classify

on the user’s profile [5]. Recommendation engine is defined as traffic flow changes based on the provided changes in the new

a decision making strategy for users under complex informa-

policies. Policy-makers will select the defined KPI’s that need to

tion environments [4]. Recently, various approaches for building be evaluated by the validation engine and based on the scores

recommendation engines were developed, based on either collab-

the new policies achieve, policy-makers will be able to make an

orative filtering, content-based filtering or hybrid filtering [12],

informed decision about which policies should be deployed in

[11], [9].

the city.

69





URBANITE H2020

Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

2.4

Advanced Visualization Methods

social media data [13]. The involvement of the municipalities of Bilbao, Helsinki, Amsterdam and Messina will provide a wide

Another important task will be the implementation of advanced

range of data sources related to the urban mobility, along with

visualizations for mobility patterns, highlighting important events,

the public, open-source ones.

and results of policy validations. The main visualization func-

Several types of data sources were identified for the URBAN-

tionalities will present the information on a combination of map

ITE project:

layers, describing where in the city specific events or a sequence

1

of events occurred. Visualizations will involve the use of heat

• geospatial data, e.g. maps (Open Street Maps , but also maps, traffic flow graphics, and other transportation clusters.

proprietary maps of the cities)

Users will be able to change and interact with the visualization

• additional info such as: car and lorry registration, infor-

parameters. For example, select specific time ranges, zoom, high-

mation on parking lots, dynamic parking data, cadastre

light, display additional information, etc. Considering the variety

information, commercial register, care services, tourism

and characteristics of the data, one concern is regarding the de-

accommodation

picting multidimensional data in a human-perceivable manner.

• demographics: statistical information on the number of

Several graphical methods are customarily used for a preliminary

inhabitants of different city districts, the number of house-

analysis of generic multivariate datasets [2]: scatter plots, pie holds, population’s age brackets, city boundaries, etc.

charts and bar plots, histograms, box plots, violin and bean plots,

• public transportation: tram and metro lines, static and dy-

spider/radar/star/polar plots, glyph plots, mosaic and spine plots,

namic information about the public bus transport service,

treemaps, and others.

the GPS position of the buses

Traffic datasets are generally high-dimensional or spatial-

• traffic data: the count of car traffic and speeds, traffic status

temporal [3], thus visualizing traffic data mostly employs in-in real time, vehicle counts on the ring roads, etc.

formation visualization and visual analytics.

• bicycle information: bike counters, bicycle collection points,

Traffic data contain multiple variables, of which the most

calculated number of bikes in specific road segments, City-

2

important ones are time and space. Several different types of

Bikes

visualisation are currently used for traffic data, among them:

• pedestrian: manual counts of pedestrians

visualization of time, visualization of spatial properties and spatio-

• electric charging stations

temporal visualization.

• taxi stops available

Location is the main spatial property of traffic data. Based on

• harbour transport data, ferry traffic statistics

the aggregation level of location information, visualization of

• geographic airport information

3

spatial properties can be categorized into three classes: point-

• air quality (OpenAQ )

based visualization (no aggregation) , line-based visualization

• noise maps

4

(first-order aggregation), and region-based visualization (second-

• wheather data (OpenWeatherMap )

order aggregation) [3].

The format of this datasets varies from JSON, XML, CSV, XLSX,

Heatmaps are the most used visualisation tools to show the

WMS , GEOJSO or GML. The main issue with the mobility related

integrated quantity of a large scale of objects in a map.

data sources it is related to the high level of heterogeneity, both

A preliminary user interface prototype is depicted in Figure 2.

in terms of data format and data availability. Most of the cities

involved on the project have some data related to the traffic in

the city, for example, but the format of the data, the level of

granularity (how often is the data updated) and the availability

of historical data (for how long does the city store historical data)

varies greatly from one case to another.

Another special aspect that needs to be addressed is the im-

pact of the COVID-19 on the mobility sector. Since COVID-19

has disrupted all of the social, economic and political aspects of

life, the urban mobility area was also affected. Some analysis [1]

revealed that the overall mobility fall was up to 76%, public trans-

port users dropped by up to 93%, NO2 emissions were reduced

by up to 60%, and traffic accidents were reduced by up to 67%

in relative terms. This phenomenon of experiencing unexpected

change of concepts or data characteristics over time is referred

to as concept drift [7] and is one of the key challenges that the URBANITE project will need to deal with when choosing the

best way to proceed for making the most appropriate predictions

regarding the impact of various traffic policies. The algorithms

developed should take into consideration the stability-plasticity

Figure 2: User Interface Mock up of the URBANITE Plat-

dilemma as a reference. Especially since it’s still difficult to pre-

form.

dict how the crisis derived from the pandemic will evolve and

how the urban mobility will be afterwards.

3

DATA SOURCES

1 https://www.openstreetmap.org/

2 https://api.citybik.es/v2/

There are several collection procedures of the traffic related data

3 https://openaq.org/

and they range from sensor readings to airborne imagery and

4 https://openweathermap.org/

70





Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

Machidon, Smerkol and Gams

Figure 3: Data Sources for the URBANITE Platform.

4

CONCLUSIONS

social and context-aware mobile recommendation system

for tourism. Pervasive and Mobile Computing, 38, 505–515.

The technical core in the URBANITE project focuses on the de-

[5]

F.O. Isinkaye, Y.O. Folajimi, and B.A. Ojokoh. 2015. Recom-

velopment of advanced AI algorithms for analysis of big data on

mendation systems: principles, methods and evaluation.

mobility. The developed methods and tools will provide substan-

Egyptian Informatics Journal, 16, 3, 261 –273. issn: 1110-

tial support for policy-makers to tackle complex policy problems

8665. doi: https://doi.org/10.1016/j.eij.2015.06.005.

on the mobility domain and will enable their validation on case-

[6]

Joseph A Konstan and John Riedl. 2012. Recommender sys-

specific models. The goal of the activities is to implement novel

tems: from algorithms to user experience. User modeling

tools and services in order to enable policy-makers to use ad-

and user-adapted interaction, 22, 1-2, 101–123.

vanced data analysis and machine learning methods during the

[7]

Jesus L Lobo, Javier Del Ser, Miren Nekane Bilbao, Ibai

design of novel policies for a specific city.

Lana, and Sancho Salcedo-Sanz. 2016. A probabilistic sam-

One underlining factor in URBANITE is the adaptation of

ple matchmaking strategy for imbalanced data streams

everything that it is created to civil servants, citizens and inter-

with concept drift. In International Symposium on Intelli-

esting parties that may or not be digitally literate. The use of

gent and Distributed Computing. Springer, 237–246.

big data techniques and artificial intelligence algorithms, up till

[8]

Chenguang Pan and Wenxin Li. 2010. Research paper rec-

now, is not a common skill among public servants and this is

ommendation with topic analysis. In 2010 International

one of the reasons the data analysis processes and user interac-

Conference On Computer Design and Applications. Vol-

tion mechanisms described in this work are developed with the

ume 4. IEEE, V4–264.

abilities of the non-experts in mind too.

[9]

Nymphia Pereira and Satishkumar L Varma. 2019. Finan-

ACKNOWLEDGMENTS

cial planning recommendation system using content-based

collaborative and demographic filtering. In Smart Inno-

This paper is supported by European Union’s Horizon 2020 Re-

vations in Communication and Computational Sciences.

search and Innovation Programme, URBANITE project under

Springer, 141–151.

Grant Agreement No.870338.

[10]

Miquel Angel Piera, Roman Buil, and Egils Ginters. 2013.

REFERENCES

Validation of agent-based urban policy models by means

of state space analysis. In 2013 8th EUROSIM Congress on

[1]

Alfredo Aloi, Borja Alonso, Juan Benavente, Rubén Cordera,

Modelling and Simulation. IEEE, 403–408.

Eneko Echániz, Felipe González, Claudio Ladisa, Raquel

[11]

Tomasz Rutkowski, Jakub Romanowski, Piotr Woldan,

Lezama-Romanelli, Álvaro López-Parra, Vittorio Mazzei,

Paweł Staszewski, Radosław Nielek, and Leszek Rutkowski.

et al. 2020. Effects of the covid-19 lockdown on urban

2018. A content-based recommendation system using neuro-

mobility: empirical evidence from the city of santander

fuzzy approach. In 2018 IEEE International Conference on

(spain). Sustainability, 12, 9, 3870.

Fuzzy Systems (FUZZ-IEEE). IEEE, 1–8.

[2]

Sunith Bandaru, Amos HC Ng, and Kalyanmoy Deb. 2017.

[12]

Diego Sánchez-Moreno, Ana B Gil González, M Dolores

Data mining methods for knowledge discovery in multi-

Muñoz Vicente, Vivian F López Batista, and María N Moreno

objective optimization: part a-survey. Expert Systems with

García. 2016. A collaborative filtering method for music

Applications, 70, 139–159.

recommendation using playing coefficients for artists and

[3]

Wei Chen, Fangzhou Guo, and Fei-Yue Wang. 2015. A

users. Expert Systems with Applications, 66, 234–244.

survey of traffic data visualization. IEEE Transactions on

[13]

G. Zhou, J. Lu, C.-Y. Wan, M. D. Yarvis, and J. A. Stankovic.

Intelligent Transportation Systems, 16, 6, 2970–2984.

2008. Body Sensor Networks. MIT Press, Cambridge, MA.

[4]

Ricardo Colomo-Palacios, Francisco José García-Peñalvo,

Vladimir Stantchev, and Sanjay Misra. 2017. Towards a

71





Towards End-to-end Text to Speech Synthesis in

Macedonian Language



Marija Neceva, Emilija Stoilkovska, Hristijan Gjoreski

mneceva@gmail.com, emi.stoilkovska@gmail.com, hristijang@feit.ukim.edu.mk

Faculty of Electrical Engineering and Information Technologies

Ss. Cyril and Methodius University

Skopje, N. Macedonia



ABSTRACT

unlike end-to-end speech recognition [4] or machine translation [5], TTS outputs are continuous, and much longer A text-to-speech (TTS) synthesis system typically consists of

than input sequences. Mainly referring to the advantages of

multiple stages: text analysis frontend, an acoustic model and

end-to-end systems, this paper proposes an implementation

an audio synthesis module. Building these components often

of Google’s Tacotron model as a TTS system for Macedonian

requires extensive domain expertise and may contain brittle

language. Tacotron is an end-to-end generative TTS model

design choices. The paper presents an end-to-end deep

based on the sequence-to-sequence model (seq2seq) [6]

learning approach to speech synthesis in Macedonian

with attention paradigm [7]. This model takes characters as language. The developed model uses the Google’s Tacotron

input and outputs raw spectrogram. We implemented our

architecture and is able to generate speech out of text from

own version of Tacotron, based on few published articles.

What we kept is their deep learning architecture, but made

multiple speakers using attention mechanism. It consists of

some changes in model’s hyper parameters and other

three parts: an encoder, an attention-based decoder and a

utilities (like known symbols, numbers etc.). That way the

post-processing network. The model was trained on a

model was adapted to work with Cyrillic. Given <text, audio>

dataset recorded by five, mixed gender speakers, resulting in

pairs, our Tacotron model was trained completely from

25.5 hours of data, or 13,101 pairs of text-speech segments.

scratch only on our dataset. It does not require phoneme-

The results show that the model successfully generates

level alignment, so it can easily scale to using large amounts

speech from text data, which was empirically shown using a

of acoustic data with transcripts.

quantitative questionnaire answered by 42 subjects.

2 RELATED WORK

KEYWORDS

WaveNet [8] is a powerful, non end-to-end, generative audio text-to-speech, deep learning, tacotron, multi-speaker,

model which works well for TTS synthesis. It is used as a

replacement of the vocoder and acoustic model of the system.

seq2seq, text, audio, attention

It can be slow due to its sample-level autoregressive nature.

1

It also requires conditioning on linguistic features from an

INTRODUCTION

existing TTS frontend.

Modern TTS pipelines are complex [1]. For example,

statistical parametric ones have a text frontend, extracting

Deep Voice [9] is a neural model which replaces every various linguistic features, a duration model, an acoustic

component in a typical TTS pipeline by a corresponding

feature prediction model and a complex signal-processing-

neural network. However, each component is independently

based vocoder [2][3]. These components usually require trained, and it’s nontrivial to change the system to train in an

extensive domain expertise, are laborious to design and must

end-to-end fashion.

be trained independently. Consequently, errors from each

Wang et. al [10] presents one of the first studies of end-to-component may compound. Otherwise, implementing an

end TTS using seq2seq with attention. However, it requires a

integrated end-to-end TTS system offers many advantages.

pre-trained hidden Markov model (HMM) aligner to help the

First, it can be trained on <text, audio> pairs with minimal

seq2seq model learn the alignment and a vocoder due to

human annotation. It also alleviates the need for laborious

predicting vocoder parameters. Furthermore, the model is

feature engineering. Further, it allows rich conditioning on

trained on phoneme inputs with possibilities of hurting the

various attributes, such as speaker or language, or high-level

prosody and producing limited experimental results.

features like sentiment. Similarly, adaptation to new data

might also be easier. Finally, a single model is likely to be

Char2Wav [11] is an independently developed end-to-end more robust than a multi-stage. All these advantages imply

model that can be trained on characters. However, it still

that an end-to-end system allows training on huge amounts

predicts vocoder parameters before using a SampleRNN

real world data. But knowing that TTS is a large-scale inverse

neural vocoder [12] and their seq2seq and SampleRNN

problem and due to existence of different pronunciations or

models need to be separately pre-trained.

speaking styles, decompressing a highly compressed source

MAIKA [26] is a Macedonian TTS project that was made text into audio may cause difficulties in the learning task of

public few months ago. However, there is no documentation

an end-to-end model. The main problem is coping with large

of how it works. Therefore, it is technically challenging to

variations at the signal level for a given input. Moreover,

72





compare with a system that only has web interface which

3.3 Decoder

generates sound.

Tacotron model uses a content-based tanh attention decoder

eSpeak [27] is an open source TTS project that also supports

[18], where a stateful recurrent layer produces the attention Macedonian language. The documentation states that the

query at each decoder time step. The input of decoder’s RNN

Macedonian model is based on the Croatian - which has its

is formed by concatenating the context vector and the

limitations since the Macedonian language is quite different,

attention RNN cell output. Decoder’s internal structure is a

especially the pronunciation and the grammar.

stack of GRUs with vertical residual connections [5], used for 3

speeding up convergence. A simple fully-connected output

MODEL ARCHITECTURE

layer is used to predict the decoder targets. Its target is 80-

The backbone of Tacotron is a seq2seq model with attention

band mel-scale spectrogram, later converted to waveform by

[7][13]. Figure 1 illustrates the model, which includes an a post-processing network. It predicts multiple, non-encoder, an attention-based decoder, and a post-processing

overlapping, output frames at each decoder step. Predicting

net. At a high-level, this model takes characters as input and

r frames at once divides the total number of decoder steps by

produces spectrogram frames, which are later converted to

r, which reduces model size, training and inference time and

waveforms. These components are described below.

increases convergence speed. This is likely because

neighboring speech frames are correlated and each character

usually corresponds to multiple frames, plus emitting

multiple frames allows the attention to move forward early

in training. For defining the input of the next decoding step

“teacher forcing” mechanism is used, pointing that on each

time step, decoder’s input is the ground-truth value of the

previous predicted decoder output.

3.4 Attention Mechanism

Attention mechanism is applied in order to “learn” mappings

between input and output sequences through gradient



descent and back-propagation. It is used as a way for the

decoder to learn at which time step, which internal state of

Figure 1: Model architecture



the encoder deserves more attention when generating its

3.1

current output. The whole process of calculating the

CBHG Module

attention weights and using them to form the decoder input

CBHG is a module for extracting representations from

has been illustrated in Figure 2.

sequences. It consists of bank of 1-D convolutional filters,

followed by highway networks [14] and a bidirectional gated recurrent unit (GRU) [15]. The input sequence is first convolved with k sets of 1-D convolutional filters. These

filters explicitly model local and contextual information

(creating unigrams, bigrams, up to k-grams). Next the

convolution outputs are stacked together and max pooled

along time to increase local invariances. Further the

processed sequence is passed to a few fixed-width 1-D

convolutions, whose outputs are added with the original

input sequence via residual connections [16]. Batch

normalization [17] is used for all convolutional layers.

Moreover, the fixed-width convolution outputs are fed into a

multi-layer highway network to extract high-level features.

Finally, a bidirectional GRU RNN has been stacked on top,



extracting sequential features from both forward and

backward context.

Figure 2: What is behind the attention mechanism



3.2 Encoder

3.5 Post-processing Net and Waveform

Synthesis

The encoder extracts robust sequential representations of

text. The input to the encoder is a character sequence, with

The post-processing net is converting the seq2seq target to a

each character represented as a one-hot vector and

form that can be synthesized into waveforms [20][21]. Since embedded into a continuous vector. Onto each embedding is

Griffin-Lim has been used as a synthesizer, the post-

applied a set of non-linear transformations, known as “pre-

processing net learns to predict spectral magnitude, sampled

net”. The “pre-net” is represented as a bottleneck layer with

on a linear-frequency scale. The Griffin – Lim algorithm

dropout, helping convergence and improving generalization.

allows convergence towards estimated phase layer. Phase’s

A CBHG module transforms the “pre-net” outputs into the

quality depends on the number of iterations applied.

final encoder representation used by the attention module.

Although more iterations may lead to overfitting, better

Moreover, CBHG-based encoder reduces overfitting and

audio is produced. Within our setup, Griffin-Lim converges

makes fewer mispronunciations than a standard multi-layer

after 50 iterations even though 30 iterations seems to be

RNN encoder.

enough.

73

3.6 Model Parameters

information about the model formed up to that step, while

the other two are an alignment plot and an audio file

The log magnitude spectrogram is obtained using Hann

synthesized by that mode. The synthesized audio file is used

windowing with 50 ms frame length, 12.5 ms frame shift, and

for checking the quality of the current model. The alignment

2048-point FT. 24 kHz sampling rate has been used for all

plot shows if the decoder has learned which input state of the

experiments. For both seq2seq decoder (mel-scale

encoder is important for producing its current output. That

spectrogram) and post-processing net (linear-scale

means if there is an “A” on input, “A” should be produced as

spectrogram) a simple L1 loss with equal weight has been

sound for output. As a good alignment plot is considered the

used. The model has been trained using a batch size of 4,

one who looks like a diagonal line. This system was trained

where all sequences are padded to a max length.

for 5 days, reached 412 000 steps and got 412 different

4

models. The system started showing a good alignment on 63

DATASET

000th step. The last model was chosen as referent one. Its

There is no public dataset of audio data in Macedonian

training and test results sound much better and were more

language, therefore we had to create one. We used publicly

understandable than those generated from the other models.

available books in Macedonian from the website of the

National Association of the Blind of the Republic of North

5.2 Evaluation

Macedonia. The books have been recorded by 5 speakers, 3

To estimate the model’s performance, we used 10, out of 14

male and 2 female. They are segmented using an algorithm

random sentences as test examples. The results show that

which separates input audio based on silence length and

more than half of the synthesized audio files [22] were threshold. Silence length varies between 700 – 1000 ms. The

successfully representing the input sequence of the model.

audio clips were additionally padded with 700 ms at both

This was empirically shown using a quantitative

beginning and end to avoid sudden cut offs.

questionnaire [23] answered by 42 subjects, 10 IT experts Next, the audio files were transcribed manually, aided by the

and 32 general public volunteers. The questionnaire was

written version of the audio book. The transcriptions are

made up of 10 stages, for each of the 10 audio files. The

void of any punctuation, capitalization, or any special

reason for choosing 10 test examples was to make the

characters, including numbers. They include only the 31

questionnaire more compact, smaller and quicker for the

letters from the Macedonian alphabet and the space

evaluators. Each stage contains 3 sub questions for the

character to separate between words. The reason for this is

currently observed audio file. The Mean Opinion Score (MOS)

that the initial dataset was also used for another task (Speech

[24] was used as a measure for answering i.e. scoring each Recognition) and the researchers removed the punctuations.

one of it. MOS is a measure of audio quality. It is a subjective

In this phase we could not retrieve the original raw data that

measurement used to test the listener’s perception of the

includes the punctuation. The final dataset contains 13,101

audio quality and clarity. A group of 42 subjects were asked

audio files and transcripts in Macedonian language [25].

to do the questionnaire. Each audio file required to be scored

Additional statistics about the dataset are listed in Table 1.

with a score from 1-5 in terms of three criterions:

naturalness, intelligibility and accuracy. Where naturalness

To be mentioned, the goal of the dataset is not the dataset

stands for the similarity of produced audio file with the

itself, but how we can develop a deep learning, end to end,

natural human speech, intelligibility or clarity of spoken

multi-speaker TTS for Macedonian language. Detailed

words and accuracy or how much the spoken sequence

language analysis of the dataset is planned for another study,

corresponds with the original, required to be spoken text.

in which the focus will be more on the linguistically part of

the dataset.

The results from the questionnaire are shown in Table 2.

Table 1: Dataset statistics

Each row of the table represents the MOS for one of the three

criterions, calculated separately for experts and volunteers.

Total Clips

13 101

The calculations are done by summing the scores for each

criterion and consequently averaging it. By analyzing the

Total Words

188 521

results for each criterion is clear that, the experts score the

model’s performance better compared to the volunteers.

Distinct Words

28 791

Looking at the total score, experts evaluated the model’s

performance for 0.265 better than the volunteers. We

Total duration

25:36:20

speculate that the reason for this might be that when the

experts are evaluating the model they also take into account

Mean Clip Duration

7.04 sec

the technical challenges and aspects of such system. On the

other hand the volunteers simply evaluate the sound and its

Min Clip Duration

0.73 sec

quality.

Additionally, in Figure 3 and Figure 4 we show the box-plots

Max Clip Duration

97.6 sec (1.37 min)

for the answers given by the experts and the volunteers

respectively. The figures show that the accuracy is the

5 TRAINING AND EVALUATION

characteristic that achieves the highest score, and the

5.1

naturalness is the characteristic that achieved the lowest

Training

score. We speculate that the reason for low naturalness score

During the training phase there is an output produced on

is the presence of sudden pauses when words should be

every 1000th step. It takes few seconds for an output to be

spoken or existence of mumbling instead of clear

produced. Each output contains five files, three of which give

pronunciation. There are only few such occurrences.

74





Table 2: MOS Score results

will not be able to properly pronounce them. Note that this

is not the case with all of the words not being present in the



MOS Score

training data, but in very rare occasions. Normally, the model

will still generate speech even though a word is not present

Experts

Volunteers

in the dataset.

Accuracy

4.8

4.6

ACKNOWLEDGEMENT

Intelligibility

4.5

4.2

We are thankful for the support of the NVIDIA Corporation

and their generous donation of a Titan XP GPU.

Naturalness

4.1

3.9

Total

4.5

4.2

REFERENCES

[1] P.Taylor. Text-to-speech synthesis. Cambridge university press, 2009.

[2] H. Zen, K.Tokuda,А.W.Black. Statistical parametric speech synthesis.

Speech Communication, 51(11):1039–1064, 2009.

[3] Y.Agiomyrgiannakis. Vocaine the vocoder and applications in speech synthesis. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE

International Conference on, pp. 4230–4234. IEEE, 2015.

[4] W.Chan, N.Jaitly, Q.Le, and O.Vinyals. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In

Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International

Conference on, pp. 4960– 4964. IEEE, 2016.



[5] Y.Wu, M.Schuster, Z.Chen, Q.V.Le, M.Norouzi,W.Macherey, M.Krikun, Y.Cao, Figure 3: Box plot of all grades given by the volunteers

Q.Gao, K.Macherey. Google’s neural machine translation system: Bridging

the gap between human and machine translation. arXiv:1609.08144, 2016.

[6] I.Sutskever, O.Vinyals,Q.V.Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp.

3104–3112, 2014.

[7] D.Bahdanau, K.Cho, Y.Bengio. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473, 2014.

[8] A.Oord,

S.Dieleman,

H.Zen,

K.Simonyan,

O.Vinyals,

A.Graves,

N.Kalchbrenner, A.Senior, K.Kavukcuoglu. WaveNet: A generative model

for raw audio. arXiv:1609.03499, 2016.

[9] S.Arik, M.Chrzanowski, A.Coates, G.Diamos, A.Gibiansky, Y.Kang, X.Li, J.Miller, J.Raiman, S.,M.Shoeybi. Deep voice: Realtime neural text-to-speech. arXiv:1702.07825, 2017.

Figure 4: Box plot of all grades given by the IT experts

[10] W.Wang, S.Xu, B.Xu. First step towards end-to-end parametric TTS

synthesis: Generating spectral parameters with neural attention. In

6 CONCLUSION

Proceedings Interspeech, pp. 2243–2247, 2016.

[11] J.Sotelo, S.Mehri, K.Kumar, J.F.Santos, K.Kastner, A.Courville, Y.Bengio.

The paper presented an end-to-end deep learning approach

Char2Wav: End-to-end speech synthesis. In ICLR2017 workshop

submission, 2017.

to speech synthesis in Macedonian language. The developed

[12] S.Mehri, K.Kumar, I.Gulrajani, R.Kumar, S.Jain, J.Sotelo, A.Courville, model uses the Google’s Tacotron architecture and generates

Y.Bengio. SampleRNN: An unconditional end-to-end neural audio

speech out of text from multiple speakers using attention

generation model. arXiv preprint:1612.07837, 2016.

mechanism. The approach consists of three parts: an

[13] O.Vinyals, Ł.Kaiser, T.Koo, S.Petrov, I.Sutskever, G.Hinton. Grammar as a foreign language. In Advances in Neural Information Processing Systems,

encoder, an attention-based decoder and a post-processing

pp. 2773–2781, 2015.

network. The model was trained on a dataset recorded by

[14] R.K.Srivastava, K.Greff, J. Schmidhuber. Highway networks. (2015).

five, mixed gender speakers, resulting in nearly 25.5 hours of

[15] J.Chung, C.Gulcehre, K.H.Cho, Y.Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555, 2014.

data. The results show that the model successfully generates

[16] K.He, X.Zhang, S.Ren, J.Sun. Deep residual learning for image recognition.

speech from text, which was empirically shown using a

In Proceedings of the IEEE Conference on Computer Vision and Pattern quantitative questionnaire answered by 42 subjects.

Recognition, pp.770–778, 2016.

[17] S.Ioffe, C.Szegedy. Batch normalization: Accelerating deep network To the best of our knowledge, this is the first end-to-end

training by reducing internal covariate shift. arXiv preprint

multi-speaker deep learning model for Macedonian

arXiv:1502.03167, 2015.

[18] O.Vinyals, Ł.Kaiser, T.Koo, S.Petrov, I.Sutskever, G.Hinton. Grammar as a language. We strongly believe that this will be a benchmark

foreign language. In Advances in Neural Information Processing Systems,

and motivation for future studies and finally to have a decent

pp. 2773–2781, 2015.

TTS system for Macedonian - which has significant societal

[19] D.Kingma ,J.Ba. Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR),

impact.

2015.

[20] Y.Masuyama, K.Yatabe, Y.Koizumi, Y.Oikawa, N.Harda (2019): Deep Griffin Some of the limitations of the model are the gender diversity

– Lim Iteration.

of speakers and the limited dataset. There is definitely room

[21] J.Wodecki (2018): Intuitive explanation of the Griffin – Lim algorithm.

for improvement, and probably the dataset plays a crucial

[22] Synthesized

test

audio

files:

role in it. However, the data collection process is extensive

https://drive.google.com/drive/folders/1LkgKAKcD9qNMw_3stbHEhszx

hrPyPmAA?usp=sharing.

and very time consuming task. With the given dataset we

[23] Quantitative questionnaire used for evaluation of the model:

cannot estimate or empirically evaluate how much more data

https://docs.google.com/forms/d/e/1FAIpQLSeJJJVRjU3tzbLi1mix9buN

is needed to achieve state-of-the-art intelligibility and

Os002GFaTvSp9TVO752OCPNUvA/viewform?fbclid=IwAR1bLE8hrEALj7

MwHkAgDKrf0JfyClD-DTuCiGdJ8Nc68Jl1XYv_1_MRxoE.

naturalness of artificially created speech. Additionally, in a

[24] P.C. Loizou. Speech Quality Assessment. University of Texas-Dallas, few of the generated samples there are pauses at places

Department of Electrical Engineering, Richardson, TX, USA.

where a word should be spoken. The reason for this is when

[25] M.Trajanoska, H.Gjoreski. Towards end-to-end Speech Recognition in the model generates sound, it uses character embeddings

Macedonian Language. BalkanCom (2019).

[26] MAIKA: https://maika.mk/

with specific ordering, learned during training. If those

[27] eSpeak: http://espeak.sourceforge.net/

embeddings have never been seen during training, the model



75





Improving Mammogram Classification

by Generating Artificial Images

Ana Peterka†

Zoran Bosnić





Evgeny Osipov

University of Ljubljana,

University of Ljubljana,

Luleå University of Technology,

Faculty of Computer and

Faculty of Computer and

Department of Computer Science,

Information Science,

Information Science,

Electrical and Space Engineering,

Ljubljana, Slovenia

Ljubljana, Slovenia

Luleå, Sweden

anapeterka1151@gmail.com

zoran.bosnic@fri.uni-lj.si

evgeny.osipov@ltu.se



ABSTRACT

imaging field due privacy concerns of the patients and the time

consuming expert annotations. Furthermore, the data is often

Training a deep convolutional neural network (DCNN) from the

imbalanced, meaning that pathologic findings are relatively very

scratch is difficult, because it requires large amounts of labeled

rare. This can result in overfitting the model and bad

training data. This is a big problem especially in the medical

generalization ability.

domain, since datasets are scarce and data is often imbalanced.

So far, this problem has been addressed with transfer learning

This can result in overfitting the model. Fine-tuning a model that

and data augmentation techniques. In this paper, we evaluate

has been pre-trained on a large dataset shows promising results.

these techniques on the CBIS-DDSM dataset, which is a publicly

Another approach is to augment the dataset with artificially

available dataset that contains benign and malignant

generated learning examples. In this paper, we augment the

mammograms. We propose a novel approach of generating new

learning set with artificially generated images that are produced

images with Generative Adversarial Networks (GANs)

by conditional infilling GAN. The results that we obtained show

combined with traditional data augmentation, such as horizontal

that we can relatively easily generate realistically looking

flipping, rotations etc., and evaluate if increasing the dataset

mammograms that improve the classification of benign and

helped to achieve better classification. We also test if fine tuning

malignant mammograms.

a ResNet-50 model helps improve the results.

The paper is structured as follows. Section 2 presents the

KEYWORDS

related work, Section 3 describes the data augmentation

data augmentation, transfer learning, CNN, ResNet-50, GAN,

techniques used, Section 4 the training process, Section 5 the

ciGAN

evaluation metrics used and the results, and in Section 6 we state

our conclusions and discuss the prospective future work.

1 INTRODUCTION

Breast cancer is a cancer that is found in the tissue of the breast,

2 RELATED WORK

when abnormal cells grow in an uncontrolled way. It can affect

This section provides a brief review of past work that falls down

both women and men, though it is prevalent in women. Statistics

to three categories:

show that it has the highest mortality rate of any cancer in women

1. improved classification with traditional data

worldwide and that 1 in 8 women in the EU will develop breast

augmentation,

cancer before the age of 851. Screening mammography helps

2. improved classification with generating synthetic images

diagnose cancer at an early stage, which significantly increases

using generative adversarial network,

the survival rates. However, the evaluation of mammograms

3. transfer learning and fine tuning.

performed by doctors and radiologists is tedious, lengthy and

error prone, as it results in a high number of false positives.

The problem with small datasets, especially in the medical

New approaches in deep learning (DL), in particular

domain, is that models that are trained on them tend to overfit the

convolutional neural networks (CNNs), have proven their

data. There are a lot of approaches to reduce it, like batch

potential for medical imaging classification tasks. This could

normalization, dropout, data augmentation and also transfer

relieve radiologists and give patients quicker and more accurate

learning. Traditional data augmentation based on affine

diagnosis. However, the performance of CNNs are dependent on

transformations, such as translation, rotation, shearing, flipping

large labeled datasets, which are hard to obtain in the medical

and scaling, is the most widely used and very easy to implement.

They are ubiquitous in computer vision tasks and show very

pro

mising results [1]. However, they do not bring any new visual

Permission to make digital or hard copies of part or all of this work for personal or features that could additionally improve the generalization of the

classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full CNN.

citation on the first page. Copyrights for third-party components of this work must Synthetic image generation with GANs enables more

be honored. For all other uses, contact the owner/author(s).

variability to the dataset and further improves robustness of the

Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

© 2020 Copyright held by the owner/author(s).

1 https://www.europadonna.org/breast-cancer-facs/

76



classification network. GANs were inspired by game theory,

Images are in DICOM format, which is the standard for medical

where two neural networks are pitted against each other using a

imaging information. The data is already split in the training and

minmax strategy. They were first introduced in [2], and they have

testing set. We used a part of the testing set as a validation set for

recently been applied to many different medical imaging

the classification network.

applications, mostly for image to image translation and image

inpainting. In [3], the authors used conditional infilling GAN to

3.2 Traditional data augmentation

synthesize lesions on mammograms.

To compensate for the lack of training images, we used classical

Transfer learning and fine tuning for mammography medical

data augmentation techniques, in particular horizontal flipping,

images was the main topic in [4] and [5]. In [4], they

rotations of up to 30°, and zoom range from 0.75 to 1.25 and test

demonstrated that a whole image model trained on DDSM can

if this improved the performance of the CNN.

be easily transferred to INbreast without using its lesion

annotations and using only a small amount of training data. In

3.3 Data augmentation with GANs

[5], the authors showed that fine tuning ResNet-50 model pre-

To further augment and balance the dataset, we use a GAN

trained on ImageNet can be used to perform tumor classification

variant, called conditional infilling GAN (ciGAN) [3]. GANs are

in CBIS-DDSM dataset.

a type of generative models, which means they are able to

In this paper, we will first use traditional data augmentation

produce novel examples, based on the training data. They consist

techniques and later additionally augment the dataset with

of two neural networks, a generator and a discriminator, which

applying the ciGAN (conditional infilling GAN). We will

are pitted against each other. Generator tries to capture the data's

evaluate the improvements with a fine tuned ResNet-50 model.

distribution while the discriminator tries to distinguish real and

generated examples. By training them simultaneously, the

3 AUGMENTING THE DATASET

generator will get better at generating realistic data, while the

discriminator gets better at distinguishing real and fake data. In

In this section, we first describe the dataset, then we explain the

the case of ciGAN, the generator is based on a cascaded

traditional data augmentation methods used and a GAN method

refinement network (CRN) [8], where features are generated at

for synthesizing new images.

multiple scales before being concatenated, which yields a more

realistic image synthesis.

3.1 The CBIS-DDSM dataset

In our approach, we apply the ciGAN to sample a location on

CBIS-DDSM [6] is a publicly available dataset that contains

a healthy mammogram and then synthesize a lesion in its

digitized images from scanned films of mammogram images and

location, as shown in Figure 1. The input is a concatenated stack

it is a subset of the DDSM dataset that consists of only benign

of:

and malign cases, while the DDSM also contains normal. The

 a corrupted image (one channel grayscale image with

data was acquired from 1566 patients and it contains both

lesion replaced by uniform distribution of values between

mediolateral oblique (MLO) and craniocaudal (CC) views of

0 and 1),

each breast. Images are grayscale, and they have corresponding

 a binary mask that marks lesion (1 representing the

binary masks that indicate mass and ROI images of that mass.

location of the lesion, and the zeros elsewhere), and



Figure 1: The ciGAN architecture. The input consists of two one channel images, and 2 class channels for indicating malignant/benign label. Output of the generator is, together with the real image fed into the discriminator, which predicts whether each image is either generated or original and also whether the image contains benign or malignant lesions.

77





 the class label ([1,0] representing the non-malignant class,

extracted from pretrained networks [10]. It encourages the

and [0,1] representing the malignant class).

generator to output images with similar high level features

The generator is comprised of multiple convolutional blocks.

as the original image. In this case, the VGG-19 [11]

convolutional neural network is used, pretrained on the

The first convolutional block receives input stack, downsampled

ImageNet dataset. It is defined as

to the 4x4 resolution. Resolution is doubled between consecutive



blocks. So the next convolutional block is fed with concatenation

of the output from the first layer, upsampled to the 8x8 and an



input stack resized to 8x8. This is repeated until resolution of

where R denotes a real image, S a synthetic image and a

256x256 is obtained. The discriminator has similar, but inverse

feature function;

structure.



Boundary Loss: is used to encourage smoothing between

infilled components and the context of the generated image.

3.4 Differences to the related work

It is a L1 difference between the real and generated images

Our work is based on the before mentioned ciGAN [3], with a

at the boundary and defined as

few improvements. While the former method was trained on non-



malignant versus malignant cases, our approached uses benign



and malignant cases, since we believe that the real hardship is

where w denotes the mask with Gaussian filter of standard

distinguishing the lesions and not only noticing them. Images in

deviation 10 applied, and is the element wise product;

the original work show that for acquiring synthetic non-



Adversarial Loss: is the general GAN loss. It is defined as a

malignant mammograms, the lesion was removed, making the

distance between the true and the generated distribution at

picture a normal mammogram. Since we used a sliding window

the current iteration. Its goal is to converge to the

approach of extracting normal patches instead of the mask, we

equilibrium in the minmax game between generator G and

did not have to remove the malignant lesion, but we applied both

discriminator D, as follows:

masks independently, so we obtained only benign and malignant



cases. All generated benign cases contain a lesion. We also



applied zooming and rotation to lesions before generating new



images, hence our generated images have more diverse tumors.

where c denotes the class label.

4 GENERATING ARTIFICIAL IMAGES

4.3 Training

The ciGAN is first pretrained on perceptual loss for 300 epochs.

4.1 Preprocessing

Then the training of discriminator and generator are alternating,

To extract patches of 256x256 pixels that are fed into ciGAN, we

when loss for either drops below 0.3 for additional 2000 epochs.

used a sliding window technique. The program loops through the

The ciGAN produces realistic images as shown in Figure 2.

whole mammogram image with the stride of 128 and checks if

the rectangular region overlaps the majority of the breast. It also

checks whether the patch contains lesion or it shows only normal

breast tissue, and labels it accordingly. This is done by

comparing the same region of the corresponding binary mask. At

the end the patch dataset contains 5466 images, 1743 of them are

normal, 2198 benign and 1525 malignant.

After acquiring a dataset of patches, the program loops

through all the patches containing only normal tissue. For each

normal patch, it randomly chooses one patch that contains a



1. Normal image



2. Random malignant mask

lesion. The patch with lesion is then randomly zoomed in/out by

a small factor, to obtain more diverse masses. Next, we check

whether on the same location as is lesion, on the normal patch, is

only breast tissue and not background. If not, the next random

lesion patch is chosen and the whole process is repeated until a

suitable match is found.

Once there is a suitable pair obtained, the normal image is

corrupted, by replacing the area defined by the mask of the lesion

with uniform distribution.



4.2 Loss functions

3. Corrupted image



4. Generated image

The ciGAN model is trained by utilizing three loss functions [3]:

Figure 2: A generated sample from ciGAN. Image 1 is the



Perceptual loss: is a loss calculated between the ground truth

normal image without a lesion, image 2 is the binary mask

and the output image. But unlike a per-pixel loss, which is

representing the random malignant lesion, image 3 is the

based on differences between pixels, it measures the

corrupted image and image 4 is the synthesized image with

discrepancy between high-level perceptual features

malignant lesion.

78

5 EVALUATION AND RESULTS

Testing these methods on different medical datasets shall be

the subject of future work. As well, one may consider using these

For evaluation of results three metrics were used. The first one is

methods on bigger data sets and improve the current state of the

accuracy, which tells us how many examples were correctly

art algorithms. Since the ciGAN’s discriminator was also

classified. The second one is recall/sensitivity, which is the

conditioned on class, we intend on extracting its features and

fraction between true positives and the sum of true positives and

using it for classification on other mammography dataset, for

false positives. It is the most important metric in this case, due to

example on the INBreast dataset. We also plan on adding more

the risk of overlooking cancer. The third one is Area Under

synthetic images to the dataset, to see if we can further improve

Curve (AUC), which measures area under the ROC curve. We

the classification.

evaluate the results by performing 4 experiments:

Currently, the mammogram classification is performed by

1.

Shallow CNN [12]: we implement it as the baseline. The

the doctors and radiologists, but we hope that improving the

network is fed a patch and classifies it as either malignant

classification with the use of machine learning combined with

or benign. It consists of three convolutional blocks,

these and similar techniques could relieve them of such tasks in

composed of 3x3 Convolutions, Batch Normalization,

the near future.

ReLU activation function and Max Pooling, followed by

three Dense layers, and softmax function for binary

Table 1: The obtained accuracy, recall and AUC scores

classification.

2.

ResNet-50: we classify the data using a ResNet-50 [13].



accuracy

recall

AUC

3.

ResNet-50 with finetuning: we check if transfer learning

improves the results.

Shallow CNN

0.57267

0.44810

0.54943

4.

ResNet-50 + Traditional data augmentation,

Resnet-50 without

5.

ResNet-50 + Traditional data augmentation and generated

0.58295

0.53859

0.58634

finetuning

artificial images.

ResNet-50

0.60155

0.55769

0.59443

As mentioned in [5], we fine-tuned the Resnet-50 [12] model

with ImageNet weights. It is an extremely deep neural network

ResNet-50

0.67132

0.64231

0.66666

+ traditional

with 150+ layers and consists of convolutional layers, pooling

layers and multiple residual blocks. In the residual blocks, the

ResNet-50

layers are fed into the next layer and also directly into the layers

+ traditional

0.76145

0.61538

0.71638

about two to three hops away. The input to the ResNet-50 model

+ artificial

is a patch of a size 224x224x3. Since mammograms have only

grayscale channels, the color information is copied over all three

channels. We used the Adam optimizer with an initial learning

REFERENCES

rate of 10−5, 𝛽1 = 0.9, 𝛽2 = 0.999, 𝑒 = 10−8 and ImageNet

[1]

Wang, J., & Perez, L. (2017). The effectiveness of data augmentation in

weight initialization. We trained it for 50 epochs with batch size

image classification using deep learning. Convolutional Neural Networks

Vis. Recognit, 11.

of 32 and a 0.9 learning rate decay every 30 epochs.

[2]

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D.,

Table 1 shows the obtained results. We can see that already

Ozair, S., ... & Bengio, Y. (2014). Generative adversarial nets.

In Advances in neural information processing systems (pp. 2672-2680).

using only fine tuning using ResNet-50 improved the results.

[3]

Wu, E., Wu, K., Cox, D., & Lotter, W. (2018). Conditional infilling GANs After combining ResNet-50 with traditional data augmentation,

for data augmentation in mammogram classification. In Image Analysis we obtained even better performance metrics. Nevertheless, by

for Moving Organ, Breast, and Thoracic Images (pp. 98-106). Springer, Cham.

increasing the dataset with relatively small amounts of synthetic

[4]

Shen, L. (2017). End-to-end training for whole image breast cancer

images while simultaneously balancing it, we improved accuracy

diagnosis

using

an

all

convolutional

design. arXiv

preprint

and AUC even more, but obtaining a slight decrease in the recall.

arXiv:1711.05775.

[5]

Agarwal, R., Diaz, O., Lladó, X., & Martí, R. (2018, July). Mass detection in mammograms using pre-trained deep learning models. In 14th

International Workshop on Breast Imaging (IWBI 2018) (Vol. 10718, p.

6 CONCLUSION

107181F). International Society for Optics and Photonics.

[6]

Lee, R. S., Gimenez, F., Hoogi, A., Miyake, K. K., Gorovoy, M., & Rubin, In this paper we discussed overcoming the obstacle of small and

D. L. (2017). A curated mammography data set for use in computer-aided

imbalanced mammography dataset. We proposed an approach

detection

and

diagnosis

research. Scientific

data, 4,

170177,

https://wiki.cancerimagingarchive.net/display/Public/CBIS-DDSM.

for artificial generation of images that are produced by a

[7]

Odena, A., Olah, C., & Shlens, J. (2017, July). Conditional image conditional infilling GAN (ciGAN). The results showed that we

synthesis with auxiliary classifier gans. In International conference on can relatively easy generate realistically looking mammograms

machine learning (pp. 2642-2651).

that improve the classification of benign and malignant

[8]

Chen, Q., & Koltun, V. (2017). Photographic image synthesis with cascaded refinement networks. In Proceedings of the IEEE international

mammograms. Further, we evaluated the learning performance

conference on computer vision (pp. 1511-1520).

when using fine-tuning, classical data augmentation and

[9]

Johnson, J., Alahi, A., & Fei-Fei, L. (2016, October). Perceptual losses for synthetic examples. The results showed that each of these

real-time style transfer and super-resolution. In European conference on computer vision (pp. 694-711). Springer, Cham.

techniques improved classification, yielding the best results

[10]

Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks

using all three together.

for large-scale image recognition. arXiv preprint arXiv:1409.1556.

Comparing the results to previously developed method [3],

[11]

Lévy, D., & Jain, A. (2016). Breast mass classification from

mammograms using deep convolutional neural networks. arXiv preprint we obtained worse results in terms of AUC, but we believe the

arXiv:1612.00542.

reason behind it is the fact that all our images contain lesion,

[12]

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for which must be harder for a neural network to distinguish,

image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).

compared to distinguishing non-malignant and malignant images.

79





Mobile Nutrition Monitoring System: Qualitative and

Quantitative Monitoring

Nina Reščič

Marko Jordan

Jasmijn de Boer

nina.rescic@ijs.si

Department of Intelligent Systems,

ConnectedCare

Department of Intelligent Systems,

Jožef Stefan Institute

Nijmegen, Netherlands

Jožef Stefan Institute

Ljubljana, Slovenia

International Postgraduate School

Jozef Stefan

Ljubljana, Slovenia

Ilse Bierhoff

Mitja Luštrek

ConnectedCare

mitja.lustrek@ijs.si

Nijmegen, Netherlands

Department of Intelligent Systems,

Jožef Stefan Institute

Ljubljana, Slovenia

ABSTRACT

Edison et al. [8] proposed a method that recognizes each intake

1

gesture separately and later the intake gestures within 60 minutes

The WellCo project

aims to provide a mobile application featur-

interval are clustered.

ing a virtual coach for behaviour changes aiming to achieve for

For qualitative monitoring we evaluated both dietary recalls

healthier lifestyle. The nutrition monitoring module consists of

and FFQs as self-reporting methods. However, dietary recalls

two main parts - qualitative (Food Frequency Questionnaire) and

require typing or complex food item selection which can be

quantitative (eating detection and bite counting). In this paper

cumbersome on mobile devices, so we opted for FFQ. FFQs are

we present the nutrition monitoring module that connects both

the most commonly selected tools in nutrition monitoring as they

monitoring aspects as implemented in the virtual coach (mobile

are efficient, cost-effective and non-invasive [9, 6].The developed application).

FFQ covers all key aspects of healthy diet, and is modular, so that

KEYWORDS

only questions pertaining to certain aspects can be asked. This is

important in ubiquitous settings where one wishes to minimize

nutrition monitoring, eating detection, FFQ

the required inputs from the user.

1

INTRODUCTION

To our knowledge the developed application module is the first

one to combine qualitative (validated FFQ) and quantitative mon-

Proper nutrition habits are beneficial for healthy lifestyle and

itoring (bite counting method) and to provide recommendations

help to prevent many chronic diseases, such as cancer, diabetes

based on data gathered by monitoring.

and hypertension. Automated monitoring has become really im-

portant i nutrition monitoring, but in only gives quantitative

information (when is the user eating, how much did he eat...),

2

METHOD

while qualitative information (what is the user eating) is acquired

by using 24 hour food recall diaries or by using Food Frequency

2.1

Method Overview

Questionnaires (FFQs). In the WellCo project we aimed to devel-

The paper describes the nutrition monitoring module developed

oped a user friendly nutrition module, which monitors qualitative

in the Wellco project.

and quantitative aspects of users’ nutrition. We combined the

The qualitative monitoring starts with a five-question ques-

self-reported FFQ, Extended Short Form Food Frequency Ques-

tionnaire that provides essential information about the user’s

tionnaire (ESFFFQ), developed and validated in the project project

diet. Based on this, some goals to improve the user’s nutrition

[5], with automated monitoring by using a commercially avail-can already be recommended. However, the users are invited

able wearable smartwatch. This paper describes the developed

to answer a more extensive questionnaire that paints a more

module and the improvements we made since our previous pa-

complete picture and allows recommending more goals. This

pers [5, 2, 7].

questionnaire is an extended version of a validated questionnaire,

By using wrist-worn devices to collect data, it is possible to rec-

and the extension was validated by us [5]. How successful the ognize eating gestures [4] or even count ’bites’ or assess caloric users are at achieving their goals is monitored with goal-specific

intake [10]. Mirtchou et al. [3] explored eating detection by us-questions on a bi-weekly basis.

ing several sensors and combining real-life and laboratory data.

The quantitative monitoring uses the accelerometer and

1

gyroscope in a smartwatch to detect micromovements related to

http://wellco- project.eu

eating (e.g., picking up food, putting it into the mouth). From a se-

Permission to make digital or hard copies of part or all of this work for personal quence of such micromovement, we then recognise whether the

or classroom use is granted without fee provided that copies are not made or user has made one “bite” (taken the food to the mouth). The im-distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this proved method uses a Convolutional neural network to recognise

work must be honored. For all other uses, contact the owner /author(s).

the micromovements and a LSTM neural network to recognise

Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

bites. The latter achieved higher accuracy so it was the one se-

© 2020 Copyright held by the owner/author(s).

lected to be integrated into the WellCo system.

80





Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Reščič and Jordan, et al.

2.2

FFQ - Qualitative Monitoring

First we linearly interpolated all accelerometer and gyroscope

measurements as well as the probabilities of bites to 4Hz fre-

When choosing goals that would help users of the WellCo virtual

quency. Next, the normalization was applied to interpolated ac-

coach towards behavioural changes for healthier lifestyle, we

celerometer and gyroscope data. We constructed 90 s long sliding

were leaning on national dietary recommendation and dietary

windows with a 2.5s step. Each window contained 360 of the

recommendations for elderly, combined with expert knowledge

previously obtained accelerometer, gyroscope and bite probabil-

by the nutritionist involved in the project. A summary of national

ity values (obtained with CNN and LSTM networks as described

dietary recommendations is presented in Table 1.

in [2]). 4Hz frequency was used to achieve faster training and Guidelines specifically for the elderly are very similar to na-predicting, while also enabling us to construct longer windows.

tional dietary recommendations for all three countries involved in

A window was labelled as a positive instance, if the majority of

pilots (Italy, Spain and Denmark), but they put additional empha-

the window belonged inside a meal.

sis on dairy consumption, as this is a good source of proteins and

To solve this machine learning task, an inception-type neural

calcium, which are beneficial and often under-consumed; drink-

network was constructed, with the added GRU layers at the end.

ing enough water, as dehydration is often a problem with elderly;

The inception part of the network is mainly made of two types of

and leucine consumption (in milk, peanuts, oatmeal, peanuts, fish,

inception blocks. Both types consist of convolutional layers and

poultry, egg white, wheat sprouts, etc). Given these recommenda-

end with a filter concatenation. The B block includes also a max

tions, we chose goals we will suggest WellCo users to follow and

pooling operation. Each block in the network is succeeded by a

use in order to improve their diet: fruit consumption, vegetable

consumption

max pooling layer. The entire architecture is presented in Table

, salt consumption, fat consumption, fibre consump-

tion

1. The inputs were transformed in the (batch size,timestamps,1,7)

, protein consumption, salt consumption, fish consumption and

water consumption

shape. “Prep” (preparation) in Table 1 refers to the yellow con-

.

volutional layers in Figure 5, whereas “Pool proj” refers to 1x1

In our search for a comprehensive but still short FFQ we found

convolutional layer after 4x1 max pooling layer. The final model

a validated questionnaire named Short Food Frequency Question-

used approximately 130 K parameters.

naire (SFFQ)[1], which consists of 23 questions and fully covers With the intention of smoother and better learning, the ratio

five of our chosen goals – fruit and vegetable consumption, sugar

consumption

between positive and negative instances was fixed to 1:2. During

, fat consumption and fish consumption. To cover the

the sampling, we actually focused more on problematic areas, by

four missing goals (protein, fibre, salt and water consumption)

first predicting with the network and then selecting problematic

we added additional 8 questions, turning the SFFQ into the so-

instances to train on. Learning rate was set to keep decreasing

called Extended Short Food Frequency Questionnaire (ESFFQ).

every few epochs. Certain hyper-parameters were subject to

The validation of the questionnaire is described in our previous

optimization during cross-validation, with the help of hyperopt

paper [5].

library. The function to minimize was categorical cross entropy.

In the next part, the outputs ∈ [0,1] of the neural network,

2.3

Quantitative Monitoring

which represent the probabilities that the given windows are

eating instances, are taken to form possible/candidate meals.

The main objective of the smartwatch-based nutrition monitoring

This is done in the following manner:

is bite counting (counting the number of time the user takes food

to the mouth).

• Round 1: Find all probabilities, denoted as beacons, that

The bite-counting algorithm described in [2] was used as the are higher than a p1 threshold. Include also all probabilities

base for all of the following work. When deciding how to present

that are closer than t1 seconds to any of the beacons. Set

the results of the developed algorithm to the users in the mobile

all the other probabilities temporarily to 0.

application, we had to make some improvements to our model. As

• Round 2: Find all probabilities that are higher than a p2

the number of bites does not really give much useful information

threshold and group them together, if they are immediately

to the users, we decided to join individual bites into meals and

next to each other. For each group find the time distance

to recognize meals as snack, small meal or big meal.

to its nearest group. Finally remove all groups that have

either 1 or 2 members and are more than t2 seconds away

2.3.1

Datasets. To construct the bite detection algorithm, we

from the corresponding nearest group.

created the Wild Meals Dataset (WMD). It includes 51 sessions

• Round 3: If there exist any two groups of the form [A,B]

and 99 meals, with known starting and ending time points, be-

and [C,D], where 0 ≤ C − B ≤ t3 (all in seconds), combine

longing to 11 unique subjects, recorded ’in-wild’. For 68 of those

these two groups together to form a new group, [A,D].

meals we have also obtained the approximate number of the

This means that indices in [A,D] can now represent the

corresponding bites, since the subjects were asked to count them

probabilities of zero as well.

while eating. Additionally we used the publicly available The

• Round 4: Similar as Round 3, but with a t4 parameter in

Food Intake Cycle (FIC) dataset and The Free Food Intake Cycle

place of t3.

(FreeFIC). All datasets contains tri-axial signals from accelerome-

At this point the probabilities of windows, previously temporar-

ters and gyroscopes in wrist devices with the sampling frequency

ily set to zero, are switched back to their original values. For

of 100 Hz.

the final model, we obtained the following values of the above

hyperparameters:

2.3.2

Meal detection method. The algorithm for meal detection

Since p2 > p1, this means that Round 1 in this particular case

was comprised of two parts: in the first part probabilities that

was not necessary, although in some other cases it could have

given time periods are part of eating were assigned, whereas

been. Once the candidate meals have been obtained, the features

in the second part these probabilities were grouped together to

are constructed for the ensemble of random forest, support vector

form a meal.

machine, knn and gradient boosting algorithms. The ensemble

81





Mobile Nutrition Monitoring System: Qualitative and Quantitative Monitoring Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

Table 1: Architecture of the network

Type

Units/Nodes

Kernel/stride

Output

1x1

4x1 prep

4x1

6x1 prep

6x1

Pool

Inception-A

360x1x128

32

64

32

Max pool

3x1/2

180x1x128

Inception-B

180x1x128

32

64

64

16

16

16

Max pool

3x1/2

90x1x128

Inception-B

90x1x128

32

64

64

16

16

16

Max pool

3x1/2

45x1x128

Inception-B

45x1x128

32

64

64

16

16

16

Max pool

3x1/2

23x1x128

GRU

23x32

GRU

32

Dense

64

64

Dropout(0.36)

64

Dense

2

2

Table 2: Hyperparameters.

Table 3: Results of bite recognition and meal detection al-

gorithm.

p1

t1(sec)

p2

t2(sec)

t3(sec)

t4(sec)

0.46

61

0.87

120

63

61

F1-score

precision

recall

cov_area

outside_area

Avg.

0.76

0.88

0.72

0.81

0.03

makes the final decision whether a candidate meal is in fact a

Table 4: Example of recommendations for qualitative

meal or not. The following features are created for each candidate

monitoring (goal_sugar) and quantitative monitoring (nu-

meal:

trition_number_of_meal).

• The mean, standard deviation, the 25th, 50th and 75th

percentile of all the probabilities inside a given candidate

goal_sugar

It seems you don’t eat enough veg-

meal.

etables. Vegetables are important

• The mean and standard deviation of the first and second

sources of many nutrients, such as

half of a potential meal, separately.

vitamins, minerals and dietary fibre.

• The mass of all the future probabilities inside all the poten-

Try to eat 2 servings of vegetables

tial meals closer than 3 hours to a given candidate meal,

per day. Serving is 1 cup of fresh or

divided by their time centre.

half cup of cooked vegetables.

• The mass of all the past probabilities inside all the poten-

nutrition_number_of_meal

Try to eat 3–5 meals per day (e.g. 3

tial meals closer than 3 hours to a given candidate meal,

bigger, 2 smaller). Avoid snacking

divided by their time centre.

between meals.

Hyper-parameters for each model in the ensemble, as well as p1,

t1, p2 t2, t3 and t4 values, were calculated with a cross-validation,

with the help of hyperopt library. The function to minimize was

• For F1-score, precision and recall, def A was used, while

negative F1-score.

cov_area and outside_area used def B. However, double

cross-validation results show that all ground truth meals,

3

RESULTS

with one exception, had at most one corresponding, true

3.1

Bite Counting

positive predicted meal.

• Covered area (cov_area): for a given ground truth meal,

In Table 4 we present the results of evaluation of our work. The

the length of the areas, which laid inside the ground truth

analysis of the entire pipeline is based on Leave-One-Subject-Out

meal, of the corresponding true positive meals, divided by

double cross-validation. For calculation of the above statistics

the length of the ground truth meal.

the following definitions were used:

• Outside area (outside_area): for a given predicted, true

• True positive prediction of a meal: any prediction of the

positive meal, the length of the area that laid outside the

respective meal for which the majority of the prediction

corresponding ground truth meal, divided by the length

laid inside the ground truth meal. If there was more than

of the predicted meal.

one prediction of eating for a certain meal, only one pre-

diction is actually counted as a true positive, whereas all

3.2

Application Implementation

the others are not regarded as a false positive.. This is

due to the possibility that the subjects didn’t eat their en-

The application shows users the detected meals, number of bites

tire recording time; as such it did not seem reasonable to

and score quality for the chosen goals (see Figure 1). Based on penalize the pipeline for predicting more than one meal,

the results we additionally show the user recommendations to

however, only one true positive is counted in order not

follow in order to improve their nutrition. Example for recom-

to encourage the algorithm to predict a bundle of eating

mendations for both, qualitative and quantitative monitoring is

instances.

shown in table.

82





Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

Reščič and Jordan, et al.

Figure 1: Application view for both monitoring tasks.

4

CONCLUSION

[3]

Mark Mirtchouk, Drew Lustig, Alexandra Smith, Ivan

Ching, Min Zheng, and Samantha Kleinberg. 2017. Rec-

The developed nutrition monitoring module consists of two parts

ognizing eating from body-worn sensors: combining free-

- qualitative monitoring and quantitative monitoring. Both of the

living and laboratory data. 1, 3. doi: 10.1145/3131894.

developed modules are implemented in a mobile application. In

[4]

Raul I. Ramos-Garcia, Eric R. Muth, John N. Gowdy, and

our future work we would like to improve the developed eating

Adam W. Hoover. 2014. Improving the recognition of eat-

detection and bite counting algorithms.

ing gestures using intergesture sequential dependencies.

The developed FFQ (ESFFFQ) can be used to support a wide

IEEE Journal of Biomedical and Health Informatics, 19, 3,

range of nutrition goals and minimizes the number of questions

825–831.

asked, so it is suitable for mobile nutrition monitoring. To make

[5]

Nina Reščič, Eva Valenčič, Enej Mlinarič, Barbara Koroušić

the application user friendly the questions from the FFQ will

Seljak, and Mitja Luštrek. 2019. Mobile nutrition monitor-

not be asked all at the same time, but separately during a course

ing for well-being. In Adjunct Proceedings of the 2019 ACM

of fortnight. This means that some of the questions won’t be

International Joint Conference on Pervasive and Ubiquitous

asked, hence it is really important to ask the right questions. In

Computing and Proceedings of the 2019 ACM International

our future work we will try to explore the problem of question

Symposium on Wearable Computers (UbiComp/ISWC ’19

ranking. With this we would be able to ask the questions in a

Adjunct). London, United Kingdom, 1194–1197.

specific order and loose as few information as possible.

[6]

JS Shim, K Oh, and HC Kim. 2014. Dietary assessment

5

ACKNOWLEDGMENTS

methods in epidemiologic studies. Epidemiol Health, 36.

doi: 10.4178/epih/e2014009.

WellCo Project has received funding from the European Union’s

[7]

Simon Stankoski, Nina Reščič, Grega Mežič, and Mitja

Horizon2020 research and innovation program under grant agree-

Luštrek. 2020. Real-time eating detection using a smart-

ment No 769765.

watch. In Junction Publishing, USA.

REFERENCES

[8]

Edison Thomaz, Irfan Essa, and Gregory D. Abowd. 2015.

A practical approach for recognizing eating moments with

[1]

Christine L Cleghorn, Roger A Harrison, Joan K Ransley,

wrist-mounted inertial sensing. In Association for Comput-

Shan Wilkinson, James Thomas, and Janet E Cade. 2016.

ing Machinery, New York, NY, USA. isbn: 9781450335744.

Can a dietary quality score derived from a short-form

doi: 10.1145/2750858.2807545.

ffq assess dietary quality in uk adult population surveys?

[9]

Frances Thompson and T Byers. 1994. Dietary assessment

Public Health Nutrition, 19, 16, 2915–2923. doi: 10.1017/

resource manual. The Journal of nutrition, 124, (December

S1368980016001099.

1994), 2245S–2317S. doi: 10.1093/jn/124.suppl_11.2245s.

[2]

2019. Counting bites with a smart watch. In Slovenian Con-

[10]

Shibo Zhang, William Stogin, and Nabil Alshurafa. 2018.

ference on Artificial Intelligence : proceedings of the 22nd In-

I sense overeating. Inf. Fusion, 41, C, (May 2018), 37–47.

ternational Multiconference Information Society. Volume A,

doi: 10.1016/j.inffus.2017.08.003.

49–52.

83





Recognition of Human Activities and Falls by Analyzing

the Number of Accelerometers and their Body Location

Miljana Shulajkovska, Hristijan Gjoreski

miljanash@gmail.com, hristijang@feit.ukim.edu.mk

Faculty of Electrical Engineering and Information Technologies

Ss. Cyril and Methodius University

Skopje, N. Macedonia



ABSTRACT

the voluntary interaction of the users with the sensors. In the

latter, the sensors are attached to the user.

This paper presents an approach to activity recognition and

This paper presents a machine learning approach to

fall detection using wearable accelerometers placed on

activity recognition and fall detection using wearable

different locations of the human body. We studied how the

accelerometers placed on different locations of the human

location and the number of wearable accelerometers

body. The goal of the paper is to study how the location and

influence on the performance of the recognition of the

the number of wearable accelerometers in luence on the

activities and the falls. The final goal was to build a machine

performance of the recognition of the activities and the falls.

learning model that can correctly recognize the activities and

This study is of practical importance of such systems, i.e., to

the falls using as few accelerometers as possible. The model

build a machine learning model that can correctly recognize

was evaluated on a public dataset consisting of more than

the activities and the falls using as few accelerometers as

850 GB of data, recorded by 17 people. In total we evaluated

possible.

15 combinations of four accelerometers placed on the belt,

the left ankle, the left wrist and the neck. The results showed

that the neck and the ankle accelerometers proved sufficient

2 RELATED WORK

to correctly recognize all the activities and falls with 94.2%

A considerable amount of work has been done in human

accuracy. Each of the sensors used individually achieved

activity recognition for the last decade where a lot of studies

94.02% and 93.4% accuracy respectively.

aim to identify activities based on data obtained from

KEYWORDS

accelerometers as sensors widely integrated into wearable

systems [3][4].

activity recognition, fall detection, wearable sensors,

Researchers have reported high accuracy scores in

machine learning

detecting activities when investigating the best placement of

the accelerometer on the human body [5][6][7]. Increasing

1 INTRODUCTION

the number of sensors increases the complexity of the

classi ication problem. For these reasons, a number of

According to United Nations World Population Prospects

studies have investigated the use of a single accelerometer.

2019, by 2050, one in six people in the world will be over the

However, doing so generally decreases the number of

age of 65 [1]. As people are getting older, their risk for falls

activities that can be recognized accurately [8]. Consequently,

also increases. Falls are a major public health problem in

one of the major considerations in activity recognition is the

elderly people often causing fatal injuries. It is important to

location or combination of locations of the accelerometers

assure that injured people receive assistance as quickly as

that provide the most relevant information.

possible. Because of this, building a good fall detection

In [5] the authors study the best location to place

system is of a big importance to help medicine solve this

accelerometers for fall detection, based on the classi ication

problem.

of postures. Four accelerometers were placed at the chest,

The ield of Human Activity Recognition (HAR) and fall

waist, ankle and thigh. Statistical features were calculated for

detection has become one of the trendiest research topics

each axis of the accelerometer in addition to the magnitude.

due to availability of low cost, low power consuming sensors,

Results indicated that one accelerometer (chest or waist) by

i.e., accelerometers. The recognition of human activities has

itself was not enough to suf iciently classify the activities

been approached in two different ways, namely using

(75%). There was, however, a signi icant improvement in

ambient and wearable sensors [2]. In the former, the sensors

classi ication accuracy achieved by combining the

are ixed in predetermined points of interest on the body of

accelerometer at the chest or waist with one placed on the

the subject, so the inference of activities entirely depends on

ankle (91%). Following the work described in [5] we explore

∗ Both authors contributed equally to this research.

this approach using different dataset while investigating all



Permission to make digital or hard copies of part or all of this work for personal possible sensor placement combinations.

or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

© 2020 Copyright held by the owner/author(s).

84



Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia





3 ACTIVITY RECOGNITION

datasets representing every combination of these sensors to

show the importance of the placement of the accelerometer.

3.1 Dataset

In our research the sampling rate of the sensor is 18 Hz,

which means 18 samples are provided every second. In

In this research we used the UP-Fall Detection dataset, which

Figure 1Error! Reference source not found. the raw data

is publicly available [9]. The dataset contains 17 Subjects that

from 3-axis accelerometer is shown from person who is

are performing 11 activities. Each activity is performed 3

performing three activities: standing, falling forward using

times. The activities performed are related to six simple

hands and laying.

human daily activities and ive human falls showed in Table

1. These types of activities and falls are chosen from the

3.2

analysis of those reported in literature [10][11]. All daily

Feature Extraction

activities are performed during 60 s, except jumping that is

Feature extraction is really important step in the activity

performed during 30 s and picking up an object which it is an

recognition process in order to ilter relevant information

action done once within a 10-s period. A single fall is

and obtain quantitative measures that allow signals to be

performed in each of the three ten seconds period trials.

compared. In our research we used statistical features to

create the feature vectors. All the attributes are computed by

using the technique of overlapping sliding windows [5].

Table 1: Activities performed in the Dataset

Because the

inal sampling frequency of our

accelerometers was 18 Hz, we chose a window size of 18,

Activity ID

Description

Duration (s)

which is one second time interval. We decided for one-

1

Falling forward using hands

10

second time interval because in our target activities there are

2

Falling forward using knees

10

transitional activities (standing up and going down) that

3

Falling backwards

10

usually last from one to four seconds. Statistical attributes

4

Falling sideward

10

are extracted for each axis of the accelerometer.

5

Falling sitting in empty chair 10

The feature extraction phase produces 36 features

6

Walking

60

(summarized in Table 2) from the accelerations along the x,

7

Standing

60

y, and z axes. The irst three features (Mean X/Y/Z,) provide

8

Sitting

60

information about body posture, and the remaining features

9

Picking up an object

10

represent motion shape, motion variation, and motion

10

Jumping

30

similarity (correlation).

11

Laying

60

Once the features are extracted (and selected), a feature

vector is formed. During training, feature vectors extracted

In order to collect data from young healthy subjects

from training data are used by a machine learning algorithm

without any impairment, is considered a multimodal

to build an activity recognition model. During classi ication,

approach for sensing the activities in three different ways

feature vectors extracted from test data are fed into the

using wearables, context-aware sensors and cameras, all at

model, which recognizes the active.

the same time. However, of our particular interest is how

acceleration data can be used for the recognition of activities.

The analyzed data is obtained from accelerometers placed on

Table 2: Overview of the extracted features. The

number of features is represented with #

ankle, neck, wrist and belt. This way we created 15 different

Feature name

#

Mean (X, Y, Z)

3

Standard deviation (X, Y, Z)

3

Root mean square (X, Y, Z)

3

Maximal amplitude (X, Y, Z)

3

Minimal amplitude (X, Y, Z)

3

Median (X, Y, Z)

3

Number of zero-crossing (X, Y, Z)

3

Skewness (X, Y, Z)

3

Kurtosis (X, Y, Z)

3

First Quartile (X, Y, Z)

3

Third Quartile (X, Y, Z)

3

Autocorrelation (X, Y, Z)

3

3.3 Methods

Machine learning approach was used for the activity

Figure 1 Raw Data from 3-Axis Accelerometer

recognition. In this study, the machine learning task is to

learn a model that will be able to classify the target activities





85

Human Activity Recognition

Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia



(e.g. standing, sitting, falling, etc.) of the person wearing

Four evaluation metrics are commonly used in activity

accelerometers. For this purpose, we used 4 different

recognition: the recall, precision, accuracy and F-measure.

machine learning algorithms: Random Forest, Support

We have analyzed the accuracy score, which shows how

Vector Machine, k-Nearest Neighbors and Multilayer

many of the predicted activities are correctly classi ied.

Perceptron.

The Random Forest (RF) classi ier, like its name implies,

4.2 Results

consists of a large number of individual decision trees that

For the irst experiment we compared 4 ML models using the

operate as an ensemble. The fundamental concept behind RF

ankle accelerometer - shown in Figure 2. We used the ankle

is the low correlation between any of the individual

accelerometer because our initial studies showed that it

constituent models protecting each other from their

performs the best. Random Forest showed the best results

individual error.

with 92.92% of accuracy. Therefore, it was used for further

The Support Vector Machine (SVM) method has also been

experiments.

broadly used in HAR although they do not provide a set of

Table 3 shows the comparison of activity recognition

rules understandable to humans. SVMs rely on kernel

accuracy using 4 accelerometers placed on ankle, belt, neck

functions that project all instances to a higher dimensional

and wrist. It shows how the number and placements of

space with the aim of inding a linear decision boundary (i.e.,

accelerometer can affect the recognition of particular

a hyperplane) to partition the data.

activities.

The k-Nearest Neighbors (k-NN) is a supervised



classi ication technique that uses the Euclidean distance to

100

92.92

classify a new observation based on the similarity (distance)

90.14

92.43

84.31

between the training set and the new sample to be classi ied.

80

The Multilayer Perceptron (MLP) [12], is an arti icial

60

neural network with multilayer feed-forward architecture.

40

The MLP minimizes the error function between the

Accuracy in % 20

estimated and the desired network outputs, which represent

0

the class labels in the classi ication context. Several studies

RF

SVM

MLP

KNN



show that MLP is ef icient in non-linear classi ication

problems, including human activity recognition. Brief study

Figure 2: Comparison of different algorithms using

of MLP and other classi ication methods is shown in [13][14].

Ankle Accelerometer

Placing the accelerometer on the belt can distinguish

4 EXPERIMENTS

sitting, standing or jumping, but distinguishing different kind

of falls that include some transitions, like standing, falling

4.1 Evaluation Techniques

and then laying is a problem. Adding one accelerometer on

the neck, can slightly improve the results, but still cannot

To properly evaluate the models, we divided the data into

recognize correctly the falls. Combination of neck and ankle

train and test using leave-one-person-out cross-validation.

accelerometer proved best results with 94.2% accuracy. On

With the leave-one-person-out each fold is represented by

the other hand, an accelerometer on the ankle can distinguish

the data of one person. This means the model was trained on

walking, standing and laying, but has problems with picking

the data recorded for 16 people and tested on the remaining

up an object and also recognizing the falls. Most of the fall

person's data. This procedure was repeated for each person

activities are recognized as standing or laying. By combining

data (17 times) and the average performance was measured.

Table 3: Comparison of activity recognition accuracy using different number of accelerometers (1, 2, 3 or 4) placed on ankle, belt, neck and wrist



86

Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia this sensor with neck accelerometer, the algorithm can

models and Random Forest showed best results. Then, we

distinguish each of the discussed activities.

compared the best model on different data, and we got the

Because of situation like this, we decided to compare the

conclusion that the data from ankle and neck sensors was

results using different number of accelerometers and

suf icient for human activity recognition and fall detection

different body placements. The idea is to use as few sensors

process with accuracy of 94.2%.

as possible to maximize the user’s comfort, but to use enough

of them to achieve satisfactory performance.

REFERENCES

[1] United Nations Publications. World Population Ageing

2019 Highlights. Department of Economic and Social

Affairs Population Division.

[2] Labrador, Miguel A., and Oscar D. Lara Yejas. Human

activity recognition: using wearable sensors and

smartphones. CRC Press, 2013.

[3] Ravi, N., Dandekar, N., Mysore, P. and Littman, M.L., 2005,

July. Activity recognition from accelerometer data. In Aaai

(Vol. 5, No. 2005, pp. 1541-1546).

[4] Kwapisz, J.R., Weiss, G.M. and Moore, S.A., 2011. Activity

recognition using cell phone accelerometers. ACM SigKDD

Explorations Newsletter, 12(2), pp.74-82.

[5] Gjoreski, H., Lustrek, M. and Gams, M., 2011, July.

Accelerometer placement for posture recognition and fall

detection. In 2011 Seventh International Conference on

Figure 3: Confusion matrix for Neck and Ankle

Intelligent Environments (pp. 47-54). IEEE.

Accelerometer

[6] Gjoreski, M., Gjoreski, H., Luštrek, M. and Gams, M., 2016.

How accurately can your wrist device rcognize daily

We must make a trade-off between correctly detecting

activities and detect falls?. Sensors, 16(6), p.800.

simple activity and speci ic fall. The results showed that neck

[7] Atallah, L., Lo, B., King, R. and Yang, G.Z., 2011. Sensor

and ankle accelerometers are best suited for fall detection

positioning for activity recognition using wearable

with overall accuracy of 94.19%. The confusion matrix for

accelerometers. IEEE transactions on biomedical circuits

neck and ankle accelerometers is shown in Figure 3. The

and systems, 5(4), pp.320-329.

most false positive predictions for fall activities are predicted

[8] Bonomi, A.G., Plasqui, G., Goris, A.H. and Westerterp, K.R.,

as laying. Also, very small percent of the non-fall activities are

2009. Improving assessment of daily energy expenditure

predicted as falls, which dismiss the false alarms for falls.

by identifying types of physical activity with a single

accelerometer. Journal of applied physiology.

[9] The

Challenge

UP

dataset:

5 CONCLUSION

http://sites.google.com/up.edu.mx/har-up/

[10] Igual, R., Medrano, C. and Plaza, I., 2013. Challenges, issues

In this paper we presented an approach to human activity

and trends in fall detection systems. Biomedical

recognition and how location and number of sensors can

engineering online, 12(1), p.66.

impact on the process of HAR. Our aim was to build a model [11] Z Zhang, C Conly, V Athitsos. 2015. A survey on vision-who can correctly recognize and classify the fall activities

based fall detection. 8th ACM International Conference on

using small number of accelerometers, but still can obtain

PETRA '15, ACM, New York, NY, USA, Article 46, 1–7.

high accuracy scores. With one accelerometer placed on the [12] Attal, F., Mohammed, S., Dedabrishvili, M., Chamroukhi, F., Oukhellou, L. and Amirat, Y., 2015. Physical human activity

ankle or the neck we got high accuracy scores, but by

recognition using wearable sensors. Sensors, 15(12),

combining these two sensors the model can classify the falls

pp.31314-31338.

more precisely.

[13] Altun, K., Barshan, B. and Tunçel, O., 2010. Comparative

The main input to our system is the data from the inertial

study on classifying human activities with miniature

sensors. Because the data is sensory, additional attributes

inertial and magnetic sensors. Pattern Recognition, 43(10),

are calculated. This process of feature extraction is general

pp.3605-3620.

and can be used in similar problems. Next, the algorithms for [14] M. Gjoreski, V. Janko, G. Slapničar, M. Mlakar, N. Reščič, J.

the inal tasks of activity recognition and fall detection are

Bizjak, V. Drobnič, M. Marinko, N. Mlakar, M. Luštrek, M.

designed and implemented using the data from the ankle

Gams, Classical and deep learning methods for recognizing

accelerometer. We used a machine learning approach for

human activities and modes of transportation with

smartphone sensors, Information Fusion, Volume 62, 2020,

solving the problem of activity recognition. We evaluated the

Pages 47-62, 1566-2535.





87





Sistem za ocenjevanje esejev na podlagi koherence in

semantične skladnosti

Automated Essay Evaluation System Based on Coherence and Semantic Consistency Žiga Simončič

Zoran Bosnić

Univerza v Ljubljani, Fakulteta za računalništvo in

Univerza v Ljubljani, Fakulteta za računalništvo in

informatiko

informatiko

Večna pot 113, 1000 Ljubljana

Večna pot 113, 1000 Ljubljana

zs3179@student.uni- lj.si

zoran.bosnic@fri.uni- lj.si

POVZETEK

osredotočajo predvsem na sintaksno analizo, premalo pozornosti

pa posvečajo semantiki [6]. To slabost obstoječih sistemov rešuje V članku opisujemo implementacijo sistema za ocenjevanje ese-sistem SAGE, ki ga Zupanc opisuje v svoji disertaciji [5]. SAGE

jev v angleškem jeziku. Zgledujemo se po metodologiji obstoje-

dosega zavidljivo napovedno točnost v primerjavi z ostalimi so-

čega sistema, ki poleg ocenjevanja sintakse uporablja tudi mere

dobnimi sistemi, vendar je trenutna implementacija sistema v

koherentnosti in semantične skladnosti. Metodologijo implemen-

prototipni fazi in ni zrela za produkcijo.

tiramo v grafičnem okolju Orange, s prijaznim vmesnikom, op-

Glavni cilj dela je bila implementacija sistema na način, da bo

cijsko uporabo vektorskih vložitev za predstavitev besedila in

uporabnikom čimbolj dostopen, enostaven in prijazen za upo-

možnostjo nadaljnjega razvoja sistema. Sistem evalviramo na

rabo. Da zadostimo tem ciljem, smo se odločili za implementacijo

podatkih dostopnih na spletnem mestu Kaggle in, kolikor je mo-

1

v programskem okolju Orange,

ki je namenjen hitremu pro-

goče, rezultate primerjamo z rezultati dosedanje metodologije in

totipiranju modelov in raziskovanju podatkov, namenjen tako

jih podrobno analiziramo. Poglobimo se tudi v izbiranje atribu-

začetnikom kot zahtevnejšim uporabnikom. Sistem je v Orange-

tov za izboljšanje rezultatov. Glavni prispevki dela obsegajo (1)

u implementiran v obliki gradnikov (angl. widgets). Med seboj

implementacijo sistema, (2) enostavnost uporabe in (3) izboljšave

jih lahko povezujemo in kombiniramo, tako da smo uvoz dato-

dosedanjega dela, vključno z dodatnimi računskimi opcijami in

tek, gradnjo in testiranje modelov prepustili gradnikom, ki so

podrobno analizo izbiranja atributov za izboljšanje rezultatov.

v Orange-u že implementirani. Skupno smo implementirali tri

KLJUČNE BESEDE

gradnike — prvi implementira vse atributske funkcije, vključno

s koherenco, drugi implementira sistem za analizo semantične

ocenjevanje esejev, semantična skladnost, Orange

skladnosti, tretji pa je namenjen evalvaciji modela po kvadratno

uteženi kapi.

ABSTRACT

Sistem Zupanc [6] temelji na ekstrakciji različnih atributov iz In this paper we describe an implementation of an essay grading

podanih besedil (esejev) in se loči na tri (pod)sisteme: AGE, AGE+

system. We lean heavily on the methodology of an existing sys-

in SAGE. Oznaka “sistem Zupanc” predstavlja njeno implemen-

tem, which, besides using syntactical measurements, also uses

tacijo vseh teh treh sistemov. Vsak sistem nadgradi prejšnjega

coherence and semantic cosistency measures. We implement the

z dodatnimi atributi. Sistem AGE predstavlja skupek atributov

methodology in the Orange data mining tool, with a firendly user

osnovne sintaktične statistike, berljivostnih, leksikalnih, slovnič-

interface, optional use of word embeddings for word representa-

nih in vsebinskih mer. To obsega različne značilnosti besedila, vse

tion and the possibility for further developments of the system.

od osnovnih, kot so število znakov, besed itd., pa do števila slov-

The system is evaluated on public datasets from the Kaggle web-

ničnih napak in računanje podobnosti z ostalimi eseji. Skupno ta

site. The results are to the most possible extent compared with

sistem zajema 72 različnih atributov, v prispevku tega članka pa

the results of the existing methodology and analyzed in detail.

smo temu sistemu dodali še pet novih atributov (št. znakov brez

We also compare several attribute selection methods, which im-

presledkov in štiri dodatne atribute, ki štejejo število posameznih

prove our results. Main contributions of this work are comprised

oblikoskladenjskih oznak). Skupno torej 77 atributov.

of (1) implementation of the system, (2) ease of use and (3) im-

Atributom sistema AGE dodamo atribute za merjenje kohe-

provements upon previous work, including additional computing

rence in s tem dobimo sistem AGE+. Koherenco merimo tako,

options and detailed attribute selection analysis.

da besedilo najprej razdelimo na prekrivajoče se odseke (drseče

okno) in posamezne odseke pretvorimo v večdimenzionalni pro-

KEYWORDS

stor. V tem prostor lahko posamezne odseke primerjamo in z

automated essay evaluation, semantic consistency, Orange

različnimi merami ocenimo ocenimo konsistentnost besedila in

tok misli. Število atributov za merjenje koherence je 29.

1

UVOD

Če vsem zgornjim atributom dodamo še nabor treh atributov,

ki jih pridobimo s preverjanjem semantične skladnosti, govo-

Učitelji v izobraževalnih ustanovah so odgovorni za predajanje

rimo o sistemu SAGE. Sistem za zaznavanje semantičnih napak v

znanj velikemu številu učencev. Del učnega procesa je tudi pisa-

ozadju uporablja ontologijo, kateri postopoma dodajamo dejstva,

nje esejev, ki jih morajo učitelji prebrati in oceniti. Ocenjevanje

ki jih izluščimo iz besedila. Z logičnim sklepanjem nato ugoto-

esejev ni le časovno potratno, ampak potencialno tudi nekoliko

vimo, če so trditve iz besedila logično konsistentne ali ne. To nam

pristransko. Naloga učitelja je tudi, da napake označi, popravi in

prinese tri dodatne atribute in možnost povratne informacije, v

komentira celotno delo.

katerih povedih je prišlo do semantičnega neskladja.

S pomočjo računalnika lahko ocenjevanje esejev olajšamo.

Dandanašnji sistemi za ocenjevanje esejev (tudi komercialni) se

1 https://orange.biolab.si/

88





Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

Simončič in Bosnić

2

SORODNA DELA

V sklopu svojega dela se je Zupanc [5] osredotočila na (v času njenega raziskovanja že zaključeno) tekmovanje avtomatskega

2

ocenjevanje esejev, ki ga je gostil Kaggle.

Na tem tekmovanju

so pomerili različni sistemi, s katerimi je Zupanc primerjala svoj

sistem. Najboljša mesta na končni lestvici so večinoma zasedali

komercialni sistemi za ocenjevanje esejev, nekaj pa je bilo tudi

po meri narejenih uporabniških modelov. Komercialni sistemi

3

4

5

kot so PEG,

e-rater

in IntelliMetric

imajo že dolgo zgodovino

in s tem velik tržni delež ter izpopolnjen finančni model. V času

raziskovanja noben od naštetih ni ponujal brezplačne verzije

sistema. Podrobno razčlenitev modelov in splošen opis njihovega

delovanja najdemo v delih Zupanc [5] ter Zupanc in Bosnić [6].

V zadnjem času se na različnih področjih čedalje bolj uve-

ljavljajo nevronski modeli, zato smo pogledali in testirali nekaj

izvedb. Martinc in sod. [3] opisujejo uspešnost treh različnih nevronskih modelov pri ocenjevanju besedil, ki sicer niso eseji.

Tudi Taghipour in Tou Ng [4] sta primerjala različne nevronske modele za ocenjevanje esejev (na istih podatkih kot mi). Najboljši

model dosega skoraj tak rezultat, kot mi. Alikaniotis in sod. so

objavili članek [1], kjer so tudi testirali uspešnost različnih nevronskih modelov na enaki podatkovni zbirki esejev, kot smo jo

uporabljali mi.

3

OPIS IMPLEMENTACIJE IN METODE

3.1

Uporabljena orodja

Celoten sistem smo implementirali z uporabo orodja za podat-

kovno rudarjenje Orange v programskem jeziku Python. Glavne

Slika 1: Prikaz vseh treh gradnikov

uporabljene knjižnice za razčlenitev besedila in izračun atributov

6

7

8

9

so NLTK,

SpaCy,

scikit-learn

in language-check

za zaznava-

nje pravopisnih napak.

Gradnik ima tri vhode:

10

Za delo z ontologijami smo uporabili knjižnico rdflib

in

(1) vhod za ocenjene eseje,

zunanja sistema (v smislu samostojna lokalna programa) ClausIE

11

(2) vhod za neocenjene eseje in

(na voljo tudi OpenIE5.0) in HermiT.

(3) vhod za izvorno besedilo.

3.2

Implementacija gradnikov v Orange

Vhoda za ocenjene in neocenjene eseje sta namenjena učni mno-

žici ocenjenih esejev in množici neocenjenih esejev, ki jim ho-

Skupno smo razvili tri gradnike, ki zajemajo celoten opisan sistem.

čemo napovedati ocene. Na obeh množicah se izračunajo enaki

Slika 1 prikazuje vse tri gradnike, ki so opisani v nadaljevanju.

atributi. Atribute ocenjenih esejev uporabimo za gradnjo modela.

Prvi gradnik je namenjen izračunu vseh različnih mer. To

Vhod za izvorno besedilo je neobvezen in predstavlja izhodiščno

so osnovne (plitke) statistične mere, mere berljivosti, leksikalne

zgodbo, knjigo ali dejstva, ki naj bi jih pisec eseja poznal. Če

mere, slovnične mere, vsebinske mere in mere koherentnosti.

so eseji osnovani na podlagi nekega izvornega besedila, ga po-

Gradnik predstavlja sistema AGE in AGE+, odvisno od uporabni-

vežemo na ustrezen vhod in s tem izračunamo dodaten atribut

kove izbire atributov, ki naj se izračunajo. Če označimo izračun

(podobnost eseja z izvornim besedilom). Gradnik ima dva izhoda,

vseh atributov, razen atributov za koherenco, govorimo o sistemu

AGE

in sicer izhod za izračunane atribute ocenjenih esejev in izhod

, z dodanimi atributi za koherenco pa govorimo o sistemu

AGE+

za izračunane atribute neocenjenih esejev. To nam omogoča, da

. Ker je računanje nekaterih naprednih mer bolj zahtevno,

podatke ustrezno nastavimo kot vhode v ostale Orange-ove gra-

se lahko uporabnik odloči za izračun kakršnekoli kombinacije

dnike.

naštetih šestih skupin mer. Za vsebinske mere in mere koheren-

Drugi gradnik obsega delo in iskanje semantičnih neskladno-

tnosti je na voljo dodatna izbira metode pretvorbe besedila v

sti z ontologijo. Predstavlja izračun dodatnih atributov, ki jih

večdimenzionalni vektorski prostor. Tu podpiramo dve metodi:

prinaša sistem SAGE. Gradnik je samostojen zaradi velike ra-

statistično pretvorbo TF-IDF in vektorske vložitve GloVe (v dveh

čunske in časovne zahtevnosti. Ima dve nastavitvi: ali želimo

izvedbah: SpaCy in Flair).

uporabiti razreševalnik koreferenc in ali želimo, da se nam za

2 https://www.kaggle.com/

semantične napake vrne podrobna razlaga. Uporaba koreferenc

3 https://www.measurementinc.com/products-services/automated-essay-scoring

je priporočljiva, saj je v primerih posrednega navezovanja na raz-

4 https://www.ets.org/

lične pojme v besedilu to edini način zajetja celotne semantične

5 http://www.intellimetric.com/direct/

6

informacije. Izberemo lahko tudi izvorno besedilo ali zgodbo, s

https://www.nltk.org/

7 https://spacy.io/

katerim se razširi ontologijo, tako da ta vključuje tudi vsebino

8 https://scikit-learn.org/stable/

osnovnega besedila. To besedilo se bo obdelalo pred vsem osta-

9 https://pypi.org/project/language-check/

10

lim, izluščene trojice pa bodo dodane v ontologijo. Razširjena

https://rdflib.readthedocs.io/en/stable/

11 http://www.hermit-reasoner.com/

ontologija se bo uporabila za preverjanje skladnosti esejev. Če

89





Sistem za ocenjevanje esejev na podlagi koherence in semantične skladnost

Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

Za posamezen esej poiščemo koreference v besedilu (angl. co-

reference resolution). Ugotavljanje referenc nam omogoča odkri-

vanje posrednih referenc na določene entitete in zamenjavo z

neposredno entiteto. Primer: “Bob likes pizza. He eats it all the

time.” nadomestimo z “Bob likes pizza. Bob eats pizza all the time”.

Naslednji korak je razčlenitev besedila na posamezne povedi

in ekstrakcija informacij s pomočjo sistema OpenIE (angl. Open

Information Extraction). V tem koraku posamezne povedi pretvo-

rimo v eno ali več trojic, ki opišejo relacije, izražene v povedi

in so primerne za logično obdelavo. Za zgornji primer bi tako

dobili dve trojici: (Bob, like, pizza) in (Bob, eat, pizza). Uporabili

smo sistem za ekstrakcijo ClausIE [2], podpiramo pa tudi mo-

13

žnost uporabe sistema OpenIE5.

Vse pridobljene trojice nato

postopoma dodajamo v ontologijo, obenem pa preverjamo njeno

skladnost. Za vsak element trojice poskušamo v ontologiji najti

Slika 2: Primer uporabe sistema AGE/AGE+

že obstoječ element. Pri tem preiščemo sopomenke, nadpomenke

in protipomenke, v najslabšem primeru pa dodamo v ontologijo

nov element. Po vsakem dodajanju elementov in trojic, preverimo

skladnost ontologije. Skladnost preverjamo z logičnim sklepal-

izvornega besedila ne dodamo, se za preverjanje skladnosti nor-

nikom HermiT, ki vrača dva tipa napak. Prvi tip napak se zgodi,

malno uporabi osnovna ontologija (ontologija COSMO). Gradnik

ko ima nek razred (owl:class) prirejene entitete, ki jih ne sme

ima samo en vhod — vhod za eseje ter en izhod — tabela treh

imeti (unsatisfiable case). Drugi tip napak pa se proži, ko se s

atributov o številu posameznih napak in niz z osnovno razlago

sklepanjem ugotovi logična napaka — nekonsistentna ontologija

ter dodatni stolpec s podrobno razlago, če je ta izbrana.

(angl. inconsistent ontology). Do takšnih napak pride ponavadi

Tretji gradnik je namenjen evalvaciji napovedanih ocen in

zaradi neposrednih nasprotij (npr. owl:disjointWith) med dvema

pravih ocen esejev. Ker Orange ne podpira mer za izračun natanč-

relacijama, ki pravi, da entiteta ne more imeti obeh relacij hkrati).

nega strinjanja (angl. exact agreement) in kvadratne utežene kape

Na podlagi povzročenih tipov napak osnujemo tri dodatne

(angl. quadratic weighted kappa - QWK ), smo naredili gradnik,

atribute, ki jih lahko uporabimo pri napovedovanju ocen esejev:

ki prejme tabelo z napovedanimi ocenami in pravimi ocenami.

število neizpolnjenih primerov (pri dodajanju novih entitet v on-

Zgledovali smo se po izhodu gradnika Test and Score — za zago-

tologijo), število napak nekonsistentne ontologije (pri dodajanju

tavljanje interoperabilnosti lahko ta izhod vežemo neposredno

trojic) in vsota obeh prejšnjih.

na vhod našega gradnika, kjer se izračunata prej omenjeni meri.

Uporaba gradnika za izračun atributov in evalvacijo modela s

3.4

Rezultati

kvadratno uteženo kapo je prikazana na Sliki 2.

Sistem smo testirali na podatkih že nekaj let starega tekmovanja

14

ASAP na spletni strani Kaggle.

Podatki obsegajo osem različnih

3.3

Semantična analiza

podatkovnih zbirk (oz. devet, ker se druga zbirka ocenjuje po

Eden glavnih prispevkov dela Zupanc in Bosnić [6] je uporaba dveh kriterijih). Tema esejev v vsaki podatkovni zbirki je različna.

ontologij za ugotavljanje semantične skladnosti. Ta postopek je

Zbirke so razdeljene na učno, validacijsko in testno množico,

uporaben na dva načina: z njim pridobimo nekaj dodatnih atri-

vendar ocene validacijske in testne množice niso na voljo, zato

butov, ki jih lahko uporabimo pri napovedovanju ocen esejev,

smo za evalvacijo našega sistema uporabili 10-kratno prečno pre-

dodatno pa nam ta postopek tudi sporoči, kje se nahajajo seman-

verjanje. Razpon ocen je v vsaki zbirki različen, gibljejo se od 0–4,

tične napake. Slednja funkcionalnost je zelo pomembna, saj tako

pa vse do 0–60. Za oceno modelov smo uporabili mero kvadra-

učenec prejeme neposredno informacijo o napakah v eseju.

tno utežene kape (angl. quadratic weighted kappa), ki upošteva

Postopek temelji na uporabi ontologije, v katero postopoma

razpon ocen in vrne relativno ujemanje napovedane ocene z de-

dodajamo v relacije strukturirane stavke in sproti preverjamo

jansko oceno. Sistem smo testirali na modelu linearne regresije

skladnost ontologije. Osnovna struktura ontologije je predsta-

in naključnih gozdov. Bolje se je odrezala linearna regresija, zato

vljena s “trojicami” v obliki (osebek, relacija, predmet). Relacija

smo se nanjo osredotočili v nadaljnjih eksperimentih. Uporabili

lahko predstavlja omejitev, konceptualno povezavo (npr. (Alice,

smo regularizacijo L2 s parametrom 𝛼 = 0, 02.

isMotherOf, Bob)) ali definira tip. V implementaciji smo za predsta-

Na začetku smo modele gradili na celotnem naboru izračuna-

vitev trojic uporabili jezik RDF, ki je podoben jeziku OWL, vendar

nih atributov. Ker sistem AGE+, domnevno zaradi prevelikega šte-

ni logični jezik. Uporabili smo ontologijo COSMO (angl. Common

vila atributov (106), ni dosegal boljših rezultatov od sistema AGE,

Semantic Model

12

). Predstavljena je v semantičnem jeziku OWL

smo preizkusili nekaj metod za izbiranje atributov. Glavni metodi

(Web Ontology Language), ki omogoča gradnjo kompleksnih shem

naše analize sta bili vnaprejšnje izbiranje atributov (angl. forward

različnih konceptov, dejstev in medsebojnih relacij. V primeru,

attribute selection) in izločanje atributov (angl. backward feature

da bi hoteli ontologiji dodati dodatna specifična znanja, to lahko

elimination). Obe metodi sta izboljšali rezultat. Uporabili smo jih

storimo. V našem primeru je poleg nekaterih esejev tudi izvorno

skupaj z 10-prečnim preverjanjem. Na vsaki iteraciji prečnega

besedilo, na podlagi katerega so bili eseji spisani. Izvorno besedilo

preverjanja smo dodali/odstranili posamezne atribute in glede

dodamo v ontologijo pred eseji in po enakem postopku kot eseje

na povprečje preko vseh iteracij dodali/odstranili atribut z najve-

in je razložen spodaj.

čjim/najmanjšim prispevkom. To smo ponavljali, dokler ni bilo

13 https://github.com/dair-iitd/OpenIE-standalone

12

14

https://www.w3.org/OWL/

https://www.kaggle.com/c/asap- aes

90





Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Simončič in Bosnić

Tabela 1: Primerjava rezultatov brez izbiranja atributov

ocenjevanje modelov (kvadratno utežene kape). Sumimo, da so

naše implementacije sistemov AGE in AGE+ (TF-IDF), pri-

za učenje svojega modela uporabili vse podatkovne zbirke skupaj,

merjava s sistemom Zupanc (AGE) in strnjeni rezultati iz-

saj je njihov rezultat v območju skoraj 100% natančnosti (0,96), z

biranja ter izločanja atributov na sistemu AGE+

dvakrat večjo absolutno napako (RMSE), kot naš model, ki ima

rezultat približno 0,77. Z uporabo vseh zbirk na našem sistemu

Brez izbiranja

Izbiranje

Izločanje

tudi dobimo tako visok rezultat (0,97 in 0,94, odvisno od modela).

AGE

AGE+

Zupanc (AGE)

AGE+

AGE+

DS1

0,8358

0,8343

0,8447

0,8369

0,8439

4

ZAKLJUČEK

DS2a

0,7001

0,7073

0,7389

0,7158

0,7324

V sklopu tega dela smo implementirali sistem za ocenjevanje

DS2b

0,6789

0,6676

0,5386

0,6941

0,7028

esejev po zgledu dela Zupanc [5] v programskem okolju Orange.

DS3

0,6578

0,6622

0,6591

0,6656

0,6958

Implementacija v okolju Orange omogoča enostavno uporabo

DS4

0,7536

0,7547

0,7174

0,7619

0,7769

sistema in združljivost z že implementirami funkcionalnostmi

DS5

0,7964

0,7955

0,7949

0,8028

0,8122

Orange-a. Sistemu smo dodali nekaj novih atributov in možnost

DS6

0,7734

0,7675

0,7636

0,7771

0,7871

predstavitve besed z vektorskimi vložitvami GloVe. Naša imple-

DS7

0,8071

0,8034

0,7888

0,8083

0,8183

15

mentacija sistema je na voljo na repozitoriju git.

Sistem temelji

DS8

0,7479

0,7428

0,7738

0,7681

0,7717

na ekstrakciji velikega števila atributov iz besedil in nato izboru

AVG

0,7501

0,7484

0,7356

0,759

0,7712

najboljšega nabora za določeno podatkovno zbirko. Inovativni

del preteklega dela, ki je vključen tudi v naši implementaciji, je

dodaten sistem za preverjanje semantične skladnosti, s pomočjo

več izboljšanja. Pri analizi smo opazili, da je nabor atributov, ki

katerega nabor atributov dodatno obogatimo, obenem pa imamo

pride v končni izbor, relativno majhen. Ugotovili smo, da je zaradi

možnost, da nam sistem izpiše vse zaznane semantične napake

prečnega preverjanja velika možnost, da s trenutnim naborom

oz. neskladja. Prispevek tega članka predstavlja tudi primerjava

atributov pridemo v lokalni optimum. Zaradi povprečenja čez

tehnik izbiranja atributov in primerjava rezultatov s preteklim

vse iteracije lahko nek atribut v prvi iteraciji izboljša rezultat, v

delom. Sistem bi bilo smiselno preizkusiti tudi z drugimi napove-

drugi pa poslabša, in je v povprečju označen kot neprimeren. Za

dnimi modeli, saj smo se v našem delu najbolj osredotočili le na

izogibanje tem lokalnim optimumom smo implementirali mejo,

linearno regresijo in naključne gozdove. Zanimiv izziv bi bil tudi

kolikokrat se lahko v povprečju rezultat poslabša, preden nabor

prilagoditev sistema za slovenski jezik, ker je jezik sintaktično

atributov označimo kot končen. S tem smo kratkoročno poslab-

kompleksnejši, orodja za obdelavo besedil pa še niso tako zrela

šali rezultat, vendar dolgoročno ustvarili kombinacijo atributov,

kot za angleški jezik.

ki dajejo v povprečju boljši rezultat. S to metodo izogibanja opti-

mumov smo še dodatno izboljšali končne rezultate, ki so strnjeno

ZAHVALA

prikazani v Tabeli 1. Pri izbiranju in izločanju atributov je AGE

Zahvaljujemo se sodelavcem Laboratorija za bioinformatiko na

izpuščen, saj AGE+ v obeh primerih dosega boljše rezultate. Ker

Fakulteti za računalništvo in informatiko za podporo in nasvete

testni podatki niso več na voljo, smo naše rezultate s sistemom Zu-

pri implementaciji sistema v programskem okolju Orange.

panc lahko primerjali le s primerjavo sistemov AGE z 10-kratnim

prečnim preverjanjem. Vidimo, da dosegamo zelo podobne rezul-

LITERATURA

tate, kot sistem Zupanc oz. jih nekoliko presegamo. Z ustreznim

[1]

Dimitrios Alikaniotis, Helen Yannakoudakis in Marek Rei.

izbiranjem atributov pa naš rezultat še dodatno izboljšamo.

2016. Automatic text scoring using neural networks. Pro-

Sistem SAGE smo iz tabele izpustili, saj so rezultati z izloča-

ceedings of the 54th Annual Meeting of the Association for

njem atributom le malenkost boljši od sistema AGE+, prav tako

Computational Linguistics (Volume 1: Long Papers). doi: 10.

pa smo ga uporabili le na podatkovnih zbirkah, ki so vsebovale

18653/v1/p16- 1068. http://dx.doi.org/10.18653/v1/P16- 1068.

izvorno besedilo (samo štiri zbirke). Kljub temu pa sistem SAGE

[2]

Luciano Del Corro in Rainer Gemulla. 2013. ClausIE: Clause-

ob zaznanem semantičnem neskladju nudi izpis povratne infor-

Based Open Information Extraction. V Proceedings of the

macije. Primer v nadaljevanju prikazuje delovanje razreševalnika

22nd international conference on World Wide Web, 355–366.

koreferenc in odkrivanje semantičnih napak. Zaradi korenjenja

[3]

Matej Martinc, Senja Pollak in Marko Robnik-Šikonja. 2019.

so nekatere besede v razlagi lahko odsekane. Vhod “George likes

Supervised and unsupervised neural approaches to text

basketball and doesn’t like sports.”, sproži napako z razlago: “Re-

readability. arXiv preprint arXiv:1907.11779.

lation ’George likes basketball and George doesn’t like sports.’ is

[4]

Kaveh Taghipour in Hwee Tou Ng. 2016. A neural approach

inconsistent with a relation in ontology: ’George likes basketball

to automated essay scoring. V Proceedings of the 2016 Confe-

and George doesn’t like sports.’” in podrobno razlago: “Relation

rence on Empirical Methods in Natural Language Processing.

not consistent: Georg likes Basketball. Relations doesNotLike

Association for Computational Linguistics, Austin, Texas,

and likes are opposite/disjoint. Relation not consistent: Georg

(november 2016), 1882–1891. doi: 10.18653/v1/D16- 1193.

doesNotLike Basketball.”. Osnovna razlaga deluje na ravni po-

https://www.aclweb.org/anthology/D16- 1193.

vedi in nam v tem primeru pove, da je poved v nasprotju sama s

[5]

Kaja Zupanc. 2018. Semantics-based automated essay eva-

sabo. Podrobna razlaga pravi, da George ima in nima rad košarke.

luation. Doktorska disertacija. Fakulteta za računalništvo

Beseda “sports” se v podrobni razlagi ne pojavi, ker je košarka

in informatiko, Univerza v Ljubljani.

podrazred športa in tam najprej pride do nasprotja.

[6]

Kaja Zupanc in Zoran Bosnić. 2017. Automated essay eva-

Omenili bi še primerjavo našega sistema z omenjenimi ne-

luation with semantic analysis. Knowledge-Based Systems,

vronski modeli. Model Taghipour in Tou Ng [4] dosega podobne 120, 118–132.

rezultate, kot naš sistem (nekaj pod 0,77). Alikaniotis in sod. [1]

opisujejo, da njihov model dosega rezultat 0,96, vendar sumimo

na nekaj nepravilnosti, ki izvirajo iz napačne uporabe mere za

15 https://github.com/venom1270/essay-grading

91





Mental State Estimation of People with PIMD using

Physiological Signals

Gašper Slapničar

Erik Dovgan

gasper.slapnicar@ijs.si

erik.dovgan@ijs.si

Jožef Stefan Institute, Jožef Stefan IPS

Jožef Stefan Institute

Jamova cesta 39

Jamova cesta 39

Ljubljana, Slovenia

Ljubljana, Slovenia

Jakob Valič

Mitja Luštrek

jakob.valic@ijs.si

mitja.lustrek@ijs.si

Jožef Stefan Institute

Jožef Stefan Institute

Jamova cesta 39

Jamova cesta 39

Ljubljana, Slovenia

Ljubljana, Slovenia

ABSTRACT

their surroundings through technology. One part of the system

People with profound intellectual and multiple disabilities are a

considers the patterns in a person’s gestures and facial expres-

very diverse and vulnerable group of people. Their disabilities are

sions, which might have some significance and correlation to

cognitive, motor and sensory, and they are also incapable of sym-

their behavioural and mental state, or their communication at-

bolic communication, making them heavily reliant on caregivers.

tempt. The initial solution dealing with this part was already

We investigated the connection between physiological signals

described by Cigale et al. [1, 2]. In this paper, we instead focus on and inner states as well as communication attempts of people

exploring the relationship between the physiological response

with PIMD, using signal processing and machine learning tech-

of the body and the mental state of the people with PIMD by

niques. The inner states were annotated by expert caregivers, and

using features computed from photoplethysmogram (PPG). PPG

several heart rate variability features were computed from photo-

is a periodic signal, where each cycle corresponds to a single

plethysmogram. We then fed the features into hyper-parameter-

heart beat. We obtained the PPG in two different ways: 1.) by

tuned classification models. We achieved the highest accuracy

using a high-quality wearable Empatica E4 with an optical sensor

of 62% and F1-score of 0.59 for inner state (pleasure, displeasure,

measuring the reflection of light from the skin and 2.) by using a

neutral) classification using Extreme Gradient Boosting, which

contact-free RGB camera mounted on a wall, which records the

notably surpassed the baseline.

color changes of the skin pixels. The features were then used to

train classification models, which predicted the person’s inner

KEYWORDS

state or communication attempt.

The rest of this paper is structured as follows: we first inves-

PIMD, mental state, physiological signals, classification

tigate the related work in Section 2, then we describe the data

collected and used in the experiments in Section 3. We continue

1

INTRODUCTION

with the methodology and experimental setup description in

People with profound intellectual and multiple disabilities (PIMD)

Section 4, and conclude with results and discussion in Section 5.

often face extreme difficulties in their day-to-day life due to

severe cognitive, motor and sensory disabilities. They require

a nearly everpresent caregiver to help them with most tasks.

2

RELATED WORK

Additionally, they are unable to communicate their feelings or

The connection between physiological parameters and mental

express their current mental state in a traditional symbolic way.

states is a mature and highly-researched field when it comes to

This causes a gap between a caregiver and the care recipient,

average healthy people.

as it can take an extended period of time for the caregiver to

Schachter et al. [6] investigated the emotional state of people recognize any potential patterns and their relationship with the

as a function of cognitive, social and physiological state. Several

mental state of the care recipient.

propositions were made and experimentally confirmed, support-

The aforementioned reasons call for a technological solution

ing the overall connection between emotional and physiological

that might help bridge the gap between the caregiver and the care

state.

recipient and help the former better understand the latter. The

Cigale et al. [1, 2] explored the communication signals of peo-INSENSION project [8] aims to develop such assistive technol-ple with PIMD, which are atypical and idiosyncratic. They high-

ogy, which takes into account many aspects of the care recipient.

lighted the challenging interpretation of these signals and their

The aim is to both bridge the previously mentioned gap as well

meaning and suggested how technology could help overcome

as empower the people with PIMD to be able to interact with

the gap between caregivers and care recipients. Some models

were proposed that take the person’s non-verbal signals (NVS)

Permission to make digital or hard copies of part or all of this work for personal as input and classify their inner state or communication attempt.

or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and Kramer et al. [3] highlighted the challenges of analysing the the full citation on the first page. Copyrights for third-party components of this NVS in people with PIMD, as they are difficult to discern, instead

work must be honored. For all other uses, contact the owner/author(s).

focusing on physiological body responses. They conducted a

Information society ’20, October 5–9, 2020, Ljubljana, Slovenia

research in which the expressions of three emotional states of one

© 2020 Copyright held by the owner/author(s).

person with PIMD were recorded during nine emotion-triggering

92





Information society ’20, October 5–9, 2020, Ljubljana, Slovenia

Slapničar, et al.

situations. They collected heart rate (HR) and skin conductance

level (SCL), and investigated the connection between these two

physiological signals and the emotional state. They found higher

SCL activity during anger or happiness and lower SCL activity

during relaxation or neutral state.

Vos et al. confirmed that HR and skin temperature allow the

same conclusions in people with PIMD and people without dis-

abilities, regarding positive and negative emotion. This finding

gives additional motivation to our work, showing that the con-

nection between physiological and mental state also holds for

people with PIMD [9].

3

DATA

We created a recording setup in the INSENSION project, which

uses two Logitech C920 cameras capable of recording full HD

(1920x1080) resolution video at 30 frames per second (fps). The

cameras were setup perpendicular to one another to record from

two distinct angles, allowing for decent facial exposure even

when the face changes direction. The caregivers were instructed

to attempt to conduct their activity in front of one of the cam-

eras whenever possible. Additionally, the subjects were given an

Empatica E4 wristband, which served both as the ground truth

for PPG, as well as a fall-back mechanism for obtaining physio-

Figure 1: Example of good (green) and bad (red) video

logical signals in cases when camera is unreliable or unavailable.

recordings.

The wristband records PPG at 64 Hz, allowing for capture of

reasonable morphological details. The temporal synchronization

between the video and ground truth was ensured to the best of

our abilities using suitable protocols and checks.

𝑝𝑟 𝑜𝑡 𝑒𝑠𝑡

With the described setup, we obtained 48 recording sessions,









(2)

each lasting between 10 and 30 minutes. Five sessions were elim-

𝐶𝑜𝑚𝑚𝐴𝑡 𝑡 𝑒𝑚𝑝𝑡 =

𝑐𝑜𝑚𝑚𝑒𝑛𝑡



inated immediately, as there was a large mismatch between the



𝑑𝑒𝑚𝑎𝑛𝑑



duration of the video and the duration of the ground truth, which

The caregivers were tasked with annotation of videos, looking

may happen due to several reasons, such as a caregiver forgetting

at camera recordings and marking inner states and communica-

to turn on the wristband during a session or the wristband losing

tion attempts in time, always marking the start and end of each

connection.

recognized state, regardless of duration (can be a few seconds or

It is important to note that the recordings were made in a

a few minutes). Naturally, large periods remained where noth-

natural way, as the caregivers were not given any additional

ing was annotated, as the experts were either not sure or did

restrictions other than to be in front of the camera when possible.

not recognize any of the pre-defined states. This does not mean

In practice this means that large parts of some recordings might

that nothing is happening in those periods, but simply that the

be useless due to the person with PIMD being turned away or

inner experience of the person with PIMD is unknown. Thus, we

the caregiver blocking them. Examples of good and bad sessions

added an additional class value for the areas where nothing was

are shown in Figure 1.

annotated – unknown.

3.1

Annotating the ground truth

4

METHODOLOGY OF MENTAL STATE

In order to classify mental states of people with PIMD, we first

ESTIMATION

required the ground truth annotations. As it is generally difficult

Having both the ground truth annotations and physiological

to obtain such ground truth, we relied on the expert knowledge

data and videos, we then investigated two approaches: 1.) we

of partners in the project who specialize in education of people

attempted to reconstruct PPG from the camera recordings in a

with special needs, alongside the caregivers, who know their

contact-free manner and use the reconstructed rPPG (remote

care recipients the best. Together they devised an annotation

PPG) to calculate features and to classify inner state and commu-

schema, in which they annotated inner states and communication

nication attempt and 2.) we directly used the Empatica ground

attempts of people with PIMD and can take the values given in

truth PPG to calculate features to be used in the same classifica-

Equations 1 and 2.

tion task.

𝑑𝑖𝑠 𝑝𝑙 𝑒𝑎𝑠𝑢𝑟 𝑒

if 1, 2 or 3



4.1

Using rPPG Reconstruction







𝐼 𝑛𝑛𝑒𝑟 𝑆 𝑡 𝑎𝑡 𝑒 =

𝑛𝑒𝑢𝑡 𝑟 𝑎𝑙

if 4, 5 or 6

(1)

In order to obtain the remote PPG, we used a rather standard





𝑝𝑙 𝑒𝑎𝑠𝑢𝑟 𝑒

if 7, 8 or 9

pipeline, which was updated with a convolutional neural network



The three numbers within each mental state indicate the inten-

in order to further enhance the rPPG. At a high level, the pipeline

sity, where a lower number for displeasure means more intense

consists of detection or region of interest (ROI), extraction of

displeasure, and a higher number for pleasure indicates more

red, green and blue signal components (RGB), detrending and

intense pleasure.

band-pass filtering of RGB, rPPG reconstruction using the Plane

93





Mental State Estimation of People with PIMD using Physiological Signals

Information society ’20, October 5–9, 2020, Ljubljana, Slovenia

Orthogonal to Skin (POS) algorithm, band-pass rPPG filtering

Table 1: List of computed HRV features.

(0.5 to 4.0 Hz), and rPPG enhancement via deep learning. Details

were already described in our previous work [7] and are not Feature

Description

subject of this paper.

HRmean

60/𝑚𝑒𝑎𝑛(𝑁 𝑁 )

We ran the pipeline described above on 30-second segments

HRmedian

60/𝑚𝑒𝑑𝑖𝑎𝑛(𝑁 𝑁 )

of video using a sliding window without overlap. We decided to

IBImedian

𝑚𝑒𝑑𝑖𝑎𝑛 (𝑁 𝑁 )

use 30 seconds due to the nature of some frequency features that

SDNN

𝑠𝑡 𝑑 (𝑁 𝑁 )

we chose, as frequency analysis makes sense once a reasonable

SDSD

′

𝑠𝑡 𝑑 (𝑎𝑏𝑠 (𝑁 𝑁 ))

number of periods are available - in our case this means that a

RMSSD

′

𝑠𝑞𝑟 𝑡 (𝑚𝑒𝑎𝑛 ( (𝑁 𝑁 )2))

sufficient number of heart cycles must be available. Additionally,

NN20 and NN50

The number of pairs of successive NNs

this length makes sense as we are primarily attempting to predict

that differ by more than 20ms and 50ms

inner states, which do not change extremely in such a short time

pNN20 and pNN50

The proportion of NN20 and NN50 di-

span. An example output of the pipeline is shown in Figure 2.

vided by total number of NNs

SDbonus1

𝑠𝑞𝑟 𝑡 (0.5) ∗ 𝑆 𝐷 𝑁 𝑁

SDbonus2

2

2

𝑠𝑞𝑟 𝑡 (𝑎𝑏𝑠 (2 ∗ 𝑆 𝐷𝑆 𝐷

− 0.5 ∗ 𝑆𝐷𝑆𝐷 ))

VLF

Area under periodogram in the very low

frequencies

LF

Area under periodogram in the low fre-

quencies

HF

Area under periodogram in the high fre-

quencies

LFnorm and HFnorm

Area under periodogram in the low and

high frequencies, normalized by the

Figure 2: Example of a good rPPG segment obtained with

whole area under periodogram

our pipeline.

LFdHF

𝐿𝐹 /𝐻 𝐹

We then used the rPPG to compute several heart rate variabil-

where 𝑠𝑡𝑑

is standard deviation,

ity (HRV) features. These are known to be well-correlated with

𝑎𝑏𝑠

is absolute value,

′

stress, cognitive load, conflict experience and other inner states

𝑋

is the first order derivative,

[5, 4]. A detailed list of computed features is given in Table 1.

𝑠𝑞𝑟 𝑡

is the square root and

𝑁 𝑁

are the beat-to-beat intervals.

4.2

Using Empatica PPG

The Empatica records PPG directly on the skin, thus making

label it actually belongs to, so we decided to exclude it from

the raw PPG readily available, without the need for additional

evaluation. This left us with 272 instances for class inner state

reconstruction. Still, due to subject arm and wrist movements, we

and 80 instances for class communication attempt, which was

opted to use similar preprocessing steps used previously, namely

annotated more sparsely. The final distributions for each class

detrending and band-pass filtering, as the signal can sometimes

are shown in Figure 3.

be quite noisy.

We computed the same set of features and window length as

before (see Table 1), and used them in the same classification task, attempting to recognize inner states and communication

attempts.

5

EXPERIMENTS AND RESULTS

Once both the input (HRV features) and output (annotations)

were known, we investigated six classification algorithms (k

Nearest Neighbours, Decision Trees, Random Forest, Support

Vector Machines, AdaBoost and Extreme Gradient Boosting) for

Figure 3: Distributions of both classes.

this task, always training separate models for inner state and

communication attempt. We always compared each algorithm

Initially we conducted a 5-fold cross validation (CV) to inves-

against a baseline majority vote classifier using two metrics,

tigate the best hyper-parameters using a grid search. Once the

accuracy and F1-score.

hyper-parameters were determined, we ran a separate experi-

ment, using the best overall hyper-parameters for each model.

5.1

Using Empatica PPG

Again, we ran a 5-fold CV with the best hyper-parameter set-

We started our evaluation using the Empatica data, as it is more

tings obtained on the full data to validate the performance. All

reliable, since the PPG reconstruction is not needed. At the time

the investigated algorithms (from the Scikit-learn and XGBoost

of evaluation, we had annotations for 15 recording sessions in

packages) and their corresponding sets of optimized parameters

which 2 different people with PIMD are present. Using the chosen

with the best values are available from the authors, but are not

30-second window, we initially had 417 segments of Empatica

listed here due to space restrictions. Results of our evaluation

PPG available. The unknown class label heavily skewed the data

in terms of accuracy and F1-score for both classes are given in

for both classes, and there is no way to know which (other) class

Table 2.

94





Information society ’20, October 5–9, 2020, Ljubljana, Slovenia

Slapničar, et al.

Table 2: Accuracy and F1 score for both classes.

0.59 for inner state, and accuracy of 48% and F1 score of 0.45 for

communication attempt, notably surpassing the baseline majority

Algorithm

classifier.

𝐴𝐶𝐶

𝐹 1

𝑚𝑒𝑛𝑡 𝑎𝑙 𝑠𝑡 𝑎𝑡 𝑒

𝑚𝑒𝑛𝑡 𝑎𝑙 𝑠𝑡 𝑎𝑡 𝑒

Limitations of our work lie in low number of instances for

Baseline (majority)

0.52

0.36

communication attempt and little variety in subjects, having just

kNN

0.55

0.55

two for which annotations were available. Additionally, the eval-

Tree

0.54

0.56

uation using the rPPG is limited, as we had very few instances for

RF

0.57

0.56

which both high-quality segments of video and annotations were

SVM

0.55

0.52

available. Thus, the focus of future work should be on gathering

AdaBoost

0.59

0.56

more data and conducting a more extensive evaluation of the

XGB

0.62

0.59

methods, which is planned in the trial stage of the INSENSION

Algorithm

𝐴𝐶𝐶

𝐹 1

project.

𝑐𝑜𝑚𝑚𝑎𝑡 𝑡 𝑒𝑚𝑝𝑡

𝑐𝑜𝑚𝑚𝑎𝑡 𝑡 𝑒𝑚𝑝𝑡

Baseline (majority)

0.45

0.27

ACKNOWLEDGMENTS

kNN

0.42

0.42

Tree

0.41

0.39

This work is part of the INSENSION project that has received

RF

0.46

0.43

funding from the European Union’s Horizon 2020 research and

SVM

0.43

0.34

innovation programme under grant agreement No. 780819. The

AdaBoost

0.43

0.41

authors also acknowledge the financial support from the Slove-

XGB

0.48

0.45

nian Research Agency (ARRS).

REFERENCES

5.2

Using rPPG reconstruction

[1]

Matej Cigale and Mitja Luštrek. 2019. Multiple knowledge

Using the rPPG for evaluation proved to be more difficult, as we

categorising behavioural states and communication attempts

only had limited amount of good subsequent facial crops from the

in people with profound intellectual and multiple disabili-

videos, while also having a limited amount of annotations. This

ties. In AmI (Workshops/Posters), 46–54.

meant that the overlap between the two was very small – we had

[2]

Matej Cigale, Mitja Luštrek, Matjaž Gams, Torsten Krämer,

only 12 such 30-second segments for inner state and only 6 for

Meike Engelhardt, and Peter Zentel. 2018. The quest for

communication attempt. Such a low amount of data is infeasible

understanding: helping people with PIMD to communicate

to be used in a realistic evaluation scheme (not even all three

with their caregivers. INFORMATION SOCIETY-IS 2018.

different class labels were present), so we instead decided to use

[3]

Torsten Krämer and Peter Zentel. 2020. Expression of emo-

the models previously trained on the Empatica data, to classify

tions of people with profound intellectual and multiple

these instances obtained via the rPPG. We achieved reasonably

disabilities. A single-case design including physiological

high accuracy of 75% and F1-score of 0.84 for inner state and low

data. Psychoeducational Assessment, Intervention and Reha-

accuracy of 33% and F1-score of 0.33 for communication attempt.

bilitation, 2, 1, 15–29.

Confusion matrices are shown in Figure 4.

[4]

Richard D Lane, Kateri McRae, Eric M Reiman, Kewei Chen,

Geoffrey L Ahern, and Julian F Thayer. 2009. Neural corre-

lates of heart rate variability during emotion. Neuroimage,

44, 1, 213–222.

[5]

Junoš Lukan, Martin Gjoreski, Heidi Mauersberger, Anneka-

trin Hoppe, Ursula Hess, and Mitja Luštrek. 2018. Analysing

physiology of interpersonal conflicts using a wrist device.

In European Conference on Ambient Intelligence. Springer,

162–167.

[6]

Stanley Schachter. 1964. The interaction of cognitive and

physiological determinants of emotional state. In Advances

in experimental social psychology. Volume 1. Elsevier, 49–80.

Figure 4: Confusion matrices for classifying rPPG in-

[7]

Gašper Slapničar, Erik Dovgan, Pia Čuk, and Mitja Luštrek.

stances using models trained on Empatica data. For inner

2019. Contact-free monitoring of physiological parameters

state, the class values are 1.0="neutral" and 2.0="pleasure".

in people with profound intellectual and multiple disabili-

For communication attempt 1.0="comment" and 2.0="de-

ties. In Proceedings of the IEEE International Conference on

mand".

Computer Vision Workshops.

[8]

Poznań Supercomputing and Networking Center. 2017. The

INSENSION project. https://www.insension.eu/.

6

CONCLUSION

[9]

Pieter Vos, Paul De Cock, Vera Munde, Katja Petry, Wim

Van Den Noortgate, and Bea Maes. 2012. The tell-tale: what

We conducted an initial investigation of the connection between

do heart rate; skin temperature and skin conductance reveal

physiological signals and mental states of people with PIMD,

about emotions of people with severe and profound intel-

attempting to classify their inner states and communication at-

lectual disabilities? Research in developmental disabilities,

tempts. We used HRV features computed from the PPG obtained

33, 4, 1117–1127.

with an Empatica E4 wristband and investigated the performance

of such models on instances obtained via rPPG. XGB has shown

the best performance, achieving accuracy of 62% and F1 score of

95





Energy-Efficient Eating Detection Using a Wristband

Simon Stankoski

Mitja Luštrek

Department of Intelligent Systems

Department of Intelligent Systems

Jožef Stefan Institute

Jožef Stefan Institute

Jožef Stefan International

Jožef Stefan International

Postgraduate School

Postgraduate School

Ljubljana, Slovenia

Ljubljana, Slovenia

simon.stankoski@ijs.si

mitja.lustrek@ijs.si

ABSTRACT

Another group of people that require monitoring of their

eating behavior are people with mild cognitive impairment and

Understanding people’s dietary habits plays a crucial role in

dementia. They often forget whether they have already eaten

interventions that promote a healthy lifestyle. For this purpose,

and, as a result, eat lunch or dinner multiple times a day or not

a multitude of studies explored automatic eating detection with

at all. It might cause additional health problems. Proper

various sensors. Despite progress over the years, most proposed

treatment of these issues requires an objective estimation of the

approaches are not suitable for implementation on embedded

time the meal takes place, the duration of the meal, and what

devices. The purpose of this paper is to describe a method that

the individual eats.

uses a wristband configuration of sensors to continuously track

Wristband devices and smartwatches are increasingly

wrist motion throughout the day and detect periods of eating

popular, mainly because people are accustomed to wearing

automatically. The proposed method uses an energy-efficient

watches, which makes the wrist placement one of the least

approach for activation of a machine learning model, based on a

intrusive body placements to wear a device. Additionally, the

specific trigger. The method was evaluated on data recorded

cost of these devices is relatively low, which makes them easily

from 10 subjects during free-living. The results showed a

accessible to everyone. However, these devices offer limited

precision of 0.84 and a recall of 0.75. Additionally, our analysis

computing power and battery life, which makes the

shows that by using the trigger, the usage of the machine

implementation of a smart feature as eating detection on such a

learning model can be reduced by 80%.

device a challenging task.

This paper describes a method for real-time eating detection

KEYWORDS

using a wristband. The proposed method detects periods and

Eating detection,

wristband, energy efficient, activity

duration of eating. The output from the method can be used to

recognition

track frequency of eating and could serve to start methods for

counting food intakes.

The work done in this study is important for the following

1 INTRODUCTION

reasons. We developed a trigger that can reduce the usage of the

Understanding people’s dietary habits plays a crucial role in

machine learning procedure, meaning that our method will not

interventions that promote a healthy lifestyle. Obesity, which is

greatly affect the battery life of the device. Additionally, we

a consequence of bad nutritional habits and excessive energy

evaluated different machine learning algorithms in terms of

intake, can be a major cause of cardiovascular diseases, diabetes

accuracy and model size. The method was evaluated on data

or hypertension. Latest statistics indicate that obesity

recorded in real-life from 10 subjects.

prevalence has increased substantially over the last three

decades [1]. More than 600 million adults (13% of the total adult population) were classified as obese in 2014 [2]. In 2 RELATED WORK

addition, the prevalence of obesity is estimated to be 23% in the

Recent advancements in wearable sensing technology (e.g.,

European Region by 2025. Also, in 2017, it was reported that

commercial inertial sensors, fitness bands, and smartwatches)

poor diet has contributed to 11 million deaths worldwide.

have allowed researchers and practitioners to utilize different

Monitoring eating habits of overweight people is an essential

types of wearable sensors to assess dietary intake and eating

step towards improving nutritional habits and weight

behavior in both laboratory and free-living conditions. A

management.

multitude of studies for the detection of eating periods have



been proposed in the past decade. Mirtchou et al. [3] explored eating detection using several sensors and combining real-life

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or and laboratory data. Edison et al. [4] proposed a method that distributed for profit or commercial advantage and that copies bear this notice and recognizes intake gestures separately, and later clusters the

the full citation on the first page. Copyrights for third-party components of this intake gestures within 60-minute intervals. The method was

work must be honored. For all other uses, contact the owner/author(s).

Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

evaluated on real-life data. Dong et al. [5] proposed a method

© 2020 Copyright held by the owner/author(s).

for eating detection in real-life situations based on a novel idea

96

that meals tend to be preceded and succeeded by periods of detect eating after a trigger is activated during a meal, then the

vigorous wrist motion. Amft et al. [6] presented an accurate constraints of the trigger should be reduced.

method for eating and drinking detection using sensors attached

The next step is the definition of stopping criteria for the

to the wrist and upper arm on both hands. Navarathna et al. [7]

machine learning model. The idea here is to stop the machine

combined sensor data from a smartwatch and a smartphone,

learning procedure after a specific number of windows if there

which resulted in improved eating detection accuracy compared

is no eating detected. Each time our trigger is activated, the

to only using smartwatch data. Kyritsis et al. [8] proposed a machine learning procedure is turned on for the next three

deep learning based method that recognizes bite segments,

buffers of data. The machine learning procedure is stopped if

which are used for construction of eating periods.

there is no positive prediction in any of the three windows.

The work presented in this paper is an extension of our

However, if there is at least one positive prediction, the

previous work [9], and the main novelty is an energy efficient machine learning procedure continues to work for another three

approach for real-time eating detection.

new buffers. Also, the number of windows for which the

machine learning procedure is active was experimentally

obtained.

3 METHOD

The proposed eating detection method consists of two parts,

3.2 Machine-Learning Procedure

namely: a threshold-based trigger, used for activation of an

А detailed description of the used method can be seen in [9].

eating detection machine learning procedure, and a machine-

The method is based on machine learning and consists of the

learning method that predicts whether eating took place.

following steps: filtering the accelerometer and gyroscope data

coming from the wristband, segmentation of the filtered data,

3.1 Energy-Efficient Trigger

feature extraction, feature selection, two stages of model

The recent advancements in the technological development and

training and predictions smoothing.

accessibility of wearable devices bring new opportunities in the

In the first step, the raw data were filtered with a 5th order

field of human activity recognition (HAR). However, the

median filter to reduce noise. Furthermore, the median filtered

limited battery life and computational resources remain a

data was additionally filtered with low-pass and band-pass

challenge for real-life implementation of advanced HAR

filters. Hence, we ended up with three different streams of data,

applications. Using a machine learning based model for eating

median, low-pass and band-pass filtered data.

detection that is working all the time results in a rapid battery The accelerometer and gyroscope data were segmented

drain. Therefore, we designed a threshold-based trigger that

using a sliding window of 15 seconds with a 3-second overlap

activates the machine learning model only when specific

between consecutive windows. This means that once we have

criteria are met. The main concept behind the trigger is to only

15 seconds of data, the buffer is adjusted to only store 3

select moments when the human is making a movement with

seconds of new data. After that, each time the buffer is full, we

his hand towards the head.

add the new 3 seconds of data to the previous 15 seconds

For this purpose, we used data from an accelerometer. This

window and we drop the oldest 3 seconds from it. The reason

sensor provides information about the wristband’s orientation

for the length of the window is that it needs to contain an entire

from which we can see whether the hand is oriented towards the

food intake gesture [10].

head. The recent accelerometers that are used in battery-limited

After the segmentation step, we extracted three different

devices can store acceleration values in their internal memory

groups of features. Also, we included a feature selection step to

without interacting with the main chip of the microcontroller.

improve the computational efficiency of the method, to remove

The first step of trigger implementation is to define the

the features that did not contribute to the accuracy and to reduce

buffer size in the sensor’s internal memory and the sensor’s

the odds of overfitting.

sampling frequency. Based on these two parameters, we enable

The training procedure for the method used in this study

the accelerometer to collect data for a specific time without

consists of three stages. The first two aim at training an eating-

interacting with the main chip of the microcontroller. This

detection models on an appropriate amount of representative

means that the main chip of the microcontroller could be in

eating and non-eating data. The third step smooths the

sleep mode for the predefined period. When the accelerometer’s

predictions of the model.

buffer is full, the accelerometer interrupts the main chip and

transfers the stored acceleration data to it. We use the

accelerometer’s y-axis and z-axis to detect moments when the

4 DATASET AND EXPERIMENTAL SETUP

individual is moving the hand towards the face. Namely, we

For this study, we recorded data from 10 subjects (8 male and 2

calculate the mean value for both axes, and if both of the values

female), ranging in age from 20 to 41 years. The data were

are above a predefined threshold value, the machine learning

recorded using a commercial smartwatch Mobvoi TicWatch S

procedure for eating detection is activated. We used two axes

running WearOS, providing 3–axis accelerometer and 3–axis

for the trigger to reduce the possible situations in which our

gyroscope data sampled at 100 Hz. The technical description of

trigger is falsely activated. However, one can work only with

the sensors from the smartwatch shows that the recorded data is

one axis, which will result in more activated triggers. We could

compatible with our target wristband for which we are

say that having more activated triggers is not desirable.

developing our eating detection method. Additionally, the use

However, if the eating detection method is not good enough to

of a commercially available smartwatch was an easier option

for recording data. The collected dataset contains recordings

97

from usual daily activities performed by the subjects, including Table 2: Results of eating detection procedure achieved with

eating. The subjects were wearing the smartwatch on their

different algorithms and their model size.

dominant hand while recording. The smartwatch had an

application installed on it, which enabled them to label the

Algorithm

Precision Recall F1 score Model size

beginning and the end of each meal. There were no limitations

Random Forest

0.84

0.75

0.79

36339 KB

about the type of meals the subjects could have while recording,

Logistic Regression

0.70

0.71

0.70

1.25 KB

which resulted in having 70 different meals included in the

dataset. Furthermore, the subjects were also asked to act

LinearSVC

0.69

0.71

0.70

1.8 KB

naturally while having their meals, meaning talking,

Decision Tree

0.59

0.65

0.62

175 KB

gesticulating, using the smartphone, etc. The total data duration



is 161 hours and 18 minutes, out of which 8 hours and 19

The used combinations for the window and slide size are shown

minutes correspond to eating activities.

in the first column of the table. The second column shows the

For evaluation, the LOSO cross-validation technique was

average time needed for the trigger to be activated for the first

used. In other words, the models were trained on the whole

time after a meal is started. The third column shows the average

dataset except for one subject on which we later tested the

percentage of triggered windows during a meal. These two

performance. The same procedure was repeated for each subject

columns were used as a metric for selecting the optimal size of

in the dataset. The results obtained using this evaluation

a window and slide between the windows. The last column

technique are more reliable compared to approaches where the

shows the number of meals when the trigger was activated. The

same subject’s data is used for both training and testing, which

values for the second and third columns were obtained only

show excessively optimistic results.

from the meals for which the trigger was activated. Row-wise

As mentioned before, smartwatches offer limited resources,

comparison between these two columns shows the results

one of which is the size of the RAM memory. Therefore, we

obtained with each different combination of a window and

analyzed models with different sizes to see whether the bigger

slide. We can see that the most optimal combination regarding

and more complex models provide higher accuracy. We tested

the average time needed for a trigger to be activated after a

the performance of four different machine learning algorithms,

meal is started is a window size of 3 seconds with a slide of 1

Random Forest [11], Decision Tree [12], Logistic Regression second between two windows. Therefore, in our further

[13] and LinearSVC [14].

analysis, we used this combination. The optimal window size of

We analyzed the following evaluation metrics: recall,

3 seconds is expected if we have in mind that the usual intake

precision and F1 score. These evaluation metrics are the most

gesture lasts around 2 seconds. Longer windows fail to detect

commonly used metrics for classification tasks like ours and

the gesture while having a meal because usually we have two or

give a realistic estimate of the efficacy of the algorithm. Also,

three intakes in 15 seconds and the mean value over the whole

the final results were obtained from the whole recordings by

window is low.

each subject. The reason for this is mainly to give a real picture

Table 2 shows the final results obtained using the whole

of how good the developed method is in real-life settings.

method described in Section 2. Row-wise comparison between

the used evaluation metrics shows the results obtained using the

different algorithms shown in the first column. Additionally, the

5 RESULTS

last column of the table represents the final model size. We can

The primary use of the trigger is to reduce the activity of the

clearly see that the results achieved with Random Forest are

machine learning procedure. However, for the efficiency of the

better than the remaining algorithms. However, if we compare

trigger, a very important requirement is when and how often the

the model size of the best performing algorithm with the

trigger is activated during a meal. In order to achieve accurate

remaining algorithms we can say that the results achieved using

predictions, we want the trigger to be activated as soon as the

Logistic

Regression

and

LinearSVC

are

acceptable.

meal is started. Additionally, the percentage of activated

Additionally, the precision value of 0.84 shows that the

triggers during a meal should be bigger compared to noneating

combination of trigger and machine learning procedure can

segments. For this purpose, we explored which window size

differentiate between eating and noneating segments. However,

works best with our trigger. Table 1 shows the results achieved

the recall value of 0.75 suggests that a more accurate method

in the conducted experiments. We tested two different window

regarding the eating periods is needed.

sizes with two slide values for each window, resulting in a total

We also analyzed how much time each of the previously

of four combinations.

described algorithms was active during the noneating period.

The results from this experiment are shown in Table 3.

Additionally, in this table we can see the false positive rate

Table 1: Different window size for the trigger procedure.

during the noneating period. The best results are achieved using

a Random Forest classifier, which is active only 20% of the

Window and

Trigger

% of activated

Meals

whole noneating period. This means that our trigger-based

slide size

activation time

triggers

detected

procedure reduces the usage of the machine-learning procedure

3 - 1

36 s

34.2

68/70

for 80%. However, this number also depends on the detection

3 - 3

41 s

32.6

68/70

method because once it is activated, the eating predictions

15 - 3

48 s

42.0

55/70

extend the active time of the method.

15 - 5

41 s

42.0

54/70

98

Table 3: Comparison of active time and false positive rate of to activate the trigger during eating periods more easily.

the machine learning algorithms during noneating period.

Additionally, this could reduce the activation of the machine-

learning procedure during non-eating periods. Also, we plan to

Active time during

explore memory efficient methods for storing the models in

Algorithm

False positive rate

noneating period

memory.

Random Forest

20%

1.36%

ACKNOWLEDGMENTS

Logistic Regression

22%

2.18%

This work was supported by the WellCo and CoachMyLife

LinearSVC

22%

2.34%

projects. The WellCo project has received funding from the

Decision Tree

23%

3.93%

European Union’s Horizon 2020 research and innovation



programme under grant agreement No 769765. The

CoachMyLife project has received funding from the the AAL

programme (AAL-2018-5-120-CP) and the Ministry of Public

6 CONCLUSION AND FUTURE WORK

Administration of Slovenia.

In this paper, we presented a method that can accurately detect

eating moments using a 3-axis accelerometer and gyroscope

REFERENCES

sensor data. Our method consists of an energy-efficient trigger

[1]

World Health Organization. World Health Statistics 2015. Luxembourg,

WHO, 2015

and a machine-learning procedure, which is started only after

[2]

Public Health England. Data Factsheet: Adult Obesity International

the trigger is activated. We evaluated this method using a

Comparisons.

London,

2016.

dataset of 70 meals from 10 subjects. The results from the

http://webarchive.nationalarchives.gov.uk/20170110165728/http://www.

noo.org.uk/NOO_pub/Key_data

LOSO evaluation showed that we are able to recognize eating

[3]

M. Mirtchouk, D. Lustig, A. Smith, I. Ching, M. Zheng, and S.

with a precision of 0.84 and recall of 0.75.

Kleinberg. Recognizing eating from body-worn sensors: Combining

freeliving and laboratory data. Proc. ACM Interact. Mob. Wearable

The presented results are important because both the training

Ubiquitous Technol., 1(3):85:1–85:20, Sept. 2017

and the evaluation data were recorded in uncontrolled real-life

[4]

E. Thomaz, I. Essa, and G. D. Abowd. A practical approach for

recognizing eating moments with wrist-mounted inertial sensing. In

conditions. We want to emphasize the real-life evaluation since

Proceedings of the 2015 ACM International Joint Conference on

it shows the robustness of the method while dealing with plenty

Pervasive and Ubiquitous Computing, UbiComp ’15, pages 1029–1040,

of different activities that might be mistaken for eating as well

New York, NY, USA, 2015. ACM.

[5]

Y. Dong, J. L. Scisco, M. Wilson, E. Muth, and A. W. Hoover. Detecting

as recognizing meals that were recorded in many different

periods of eating during free-living by tracking wrist motion. IEEE

environments while using many different utensils. The

Journal of Biomedical and Health Informatics, 18:1253–1260, 2014

[6]

O. Amft, H. Junker, and G. Troster. Detection of eating and drinking arm

proposed method can also deal with interruptions while having

gestures using inertial body-worn sensors. In Ninth IEEE International a meal, such as having a conversation, using the smartphone,

Symposium on Wearable Computers (ISWC’05), pages 160–163. IEEE,

etc. Additionally, we believe that the energy efficiency of the

2005.

[7]

P. Navarathna, B. W. Bequette, and F. Cameron, “Wearable Device

proposed method is very important. The proposed technique

Based Activity Recognition and Prediction for Improved Feedforward

uses a trigger to activate the machine learning procedure and it

Control,” in Proceedings of the American Control Conference, 2018, doi:

10.23919/ACC.2018.8430775.

is able to reduce the active time of the machine learning

[8]

K. Kyritsis, C. Diou, and A. Delopoulos, “Detecting Meals in the Wild procedure for almost 80%. If we have in mind that the

Using the Inertial Data of a Typical Smartwatch,” in Proceedings of the

wristbands are devices with limited resources, we could say that

Annual International Conference of the IEEE Engineering in Medicine

and Biology Society, EMBS, 2019, doi: 10.1109/EMBC.2019.8857275

even small reductions in resource usage can be significant for

[9]

Stankoski S, Resçiç N, Mezic G, Lustrek M. Real-time Eating Detection

longer battery life.

Using a Smartwatch. InEWSN 2020 Feb 17 (pp. 247-252).

[10]

Xu Ye, Guanling Chen, and Yu Cao. Automatic eating detection using The initial results achieved in this study are encouraging for

head-mount and wrist-worn accelerometers. In 2015 17th International

further work in which we expect to improve the eating detection

Conference on E-health Networking, Application Services (HealthCom),

pages 578–581, Oct 2015.

method. In the near future, we plan to optimize our machine

[11]

T. K. Ho. Random decision forests. In Proceedings of 3rd international learning procedure to detect eating periods more accurately

conference on document analysis and recognition, volume 1, pages 278–

once the trigger is activated. Furthermore, we want to overcome

282. IEEE, 1995.

[12]

P. H. Swain and H. Hauska, "The decision tree classifier: Design and the problem with false positives predictions. For this problem,

potential," in IEEE Transactions on Geoscience Electronics, vol. 15, no.

we believe that a more sophisticated method for selecting

3, pp. 142-147, July 1977, doi: 10.1109/TGE.1977.6498972.

[13]

Lee, Youngjo, John A. Nelder, and Yudi Pawitan. Generalized linear representative noneating data will help to recognize the

models with random effects: unified analysis via H-likelihood. Vol. 153.

problematic activities and directly include them in the training

CRC Press, 2018.

data. Also, we plan to investigate personalized threshold values.

[14]

Chang, Chih-Chung, and Chih-Jen Lin. "LIBSVM: A library for support

vector machines." ACM transactions on intelligent systems and We believe that personalized values for the threshold will help

technology (TIST) 2.3 (2011): 1-27.



99





Comparison of Methods for Topical Clustering of

Online Multi-speaker Discourses

Vid Stropnik

Zoran Bosnić

Evgeny Osipov

University of Ljubljana,

University of Ljubljana,

Luleå University of Technology,

Faculty of Computer and

Faculty of Computer and

Department of Computer Science,

Information Science,

Information Science,

Electrical and Space Engineering,

Velenje, Slovenia

Ljubljana, Slovenia

Luleå, Sweden

vs6309@student.uni-lj.si

zoran.bosnic@fri.uni-lj.si

evgeny.osipov@ltu.se



ABSTRACT

speakers. Consequently, traditional summarization methods do

not translate well to these sorts of text bodies.

Discussions held on online forums differ from traditional text

In Section 2 of this paper, the related work establishes the

documents in several ways. In addition to individual text-bodies

general framework that other authors generally use for the task

(submission comments, forum posts etc.) being very short, they

at hand. It establishes the Latent Dirichlet Allocation (LDA)

also have multiple messengers, each of whom may exhibit

topic modeling algorithm as the current leading method for

unique patterns of speech. Consequently, state of the art methods

topical grouping of individual comments. These topical groups

for text summarization are often rendered inapplicable for these

play a pivotal role in later summarization steps, also presented in

sorts of corpora. This paper evaluates the topic-clustering

Section 2.

algorithm used in the state-of-the-art online comment clustering

In this paper, we externally evaluate and compare LDA

techniques, as parts of commonly used summarizer models. It

versus two frameworks, using word representations in semantic

proposes two alternative, vector-based approaches and presents

vector space. We describe the analyzed methods in Sections 3

results of a comparative external analysis, concluding in the three

and 4. In Section 5, we describe the comparative evaluation

methods being comparable.

methodology used to determine the applicability of each

modeling technique and present our results. We follow it up by

KEYWORDS

discussing further work in the conclusion of this paper.

latent Dirichlet allocation, word embeddings, GloVe,

hyperdimensional computing, self-organized maps, topical

clustering, clustering evaluation, discussion summarization

2 RELATED WORK



Online discussion summarization is a field that has not been

addressed directly by many authors. One group of works [5-7]

have roughly described a three-step process, commonly

1 INTRODUCTION

presented as the state of the art. The approaches includes a topical

User generated comments carry a great amount of useful

clustering of all the observed comments, establishing a ranking

information. Big data researchers have successfully used them to

method for determining the most salient ones in each cluster, and

predict stock market volatility [1] and predict the characteristics

later summarizing this selection. Between them, the authors

of such comments that perform the best on a given online

confidently establish Latent Dirichlet Allocation (LDA) topic

platform [2]. User comments can also offer vast amounts of

modeling as the most human like grouping algorithm. Further

complementary information, as well as being forms of

work also proposes a novel graph-based linear regression model

information surveillance, entertainment or social utility [3].

based on the Markov Cluster Algorithm (MCL), [8] which

Existing mechanisms for displaying comments on websites do

outperforms LDA, but uses the knowledge of multidomain

not scale well and often lead to cyberpolarization [4].

knowledge bases for implementation. While we argue that

Furthermore, they are platform-specific and often fail to offer an

extractive summarization is not an ideal method for the analysis

overall image of the topics discussed in a given comments

of multi-speaker corpora, the first step of identifying and

section.

topically clustering individual comments in each comment

A comprehensive, easily understandable automatic summary

section is assumed as a required step towards successful

of the online discourse at hand can be instinctively understood as

summarization of the topics discussed therein.

a solution to this problem. This, however, is no easy task, seeing

To the best of our knowledge, popular NLP word embedding

as these corpora are often very short and come from multiple

algorithms (i.e. word2vec, GloVe) have not been used directly for



comment summarization applications up until now. Similarly.



Permission to make digital or hard copies of part or all of this work for personal or neither have hyperdimensional representations, another topic of

classroom use is granted without fee provided that copies are not made or distributed interest.

for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Information Society 2020, 5–9 October 2020, Ljubljana, Slovenia

© 2020 Copyright held by the owner/author(s).

100

3 NLP METHODS

3.2 Word Embeddings

In this work, we examine three distinct topical clustering models,

Word Embeddings is a collective name for a set of language

the output of which is always a set of comment clusters, given a

modeling and feature learning techniques, yielding word

multi-comment input.

representations using vectors, the relative similarities of which

The first is an LDA model, using a Term frequency – inverse

correlate with the semantic similarity of the represented words.

document frequency (TFIDF) word representation as an input. In

These meanings are extracted from the contexts – fixed-size

this representation, the comments were hard-clustered into the

windows of preceding and succeeding words, in which

groups, determined by the degree of membership of which had

individual words appear in in the training corpus. The generation

been the highest in a soft-clustering approach, provided by the

of these vectors is achieved by Context counting [11] or Context

LDA model.

prediction [12]. While there have been several claims of one of

The second examined model uses GloVe word embeddings

the methods for synthetizing word embeddings being superior

clustered with the k-means clustering algorithm, thus portraying

over another, recent work implies the correspondence between

words in semantic vector space using information of contexts in

these model-types [13]. Whichever way these word-vectors are

which words often appear.

created, they represent semantic meaning in vector space. Using

The third model creates Hyperdimensional representations of

algebraic similarity measures (in our case, cosine distance) on

words, mapped them into a two-dimensional topology using the

comment-word averages, the relative likeness of the examined

self-organized maps algorithm and then clustered it like the

comments’ meanings is calculated. Comment clusters can then

preceding model. This approach is the least explored for this use-

be created by clustering the semantic-space points into groups

case and is inspired by the observed differences between the

with high intra-cluster and low inter-cluster similarity. These

functionality of the human brain and the traditional von

groups represent topical clusters, used in our examination.

Neumann architecture for modern computing.

We performed the comparative evaluation of the models on

3.3 Hyperdimensional Computing

the Reddit Corpus (by subreddit) dataset, provided by the Cornell

Hyperdimensional computing is a family of biologically inspired

Conversational Analysis Toolkit (Convokit)1. Five Conversations,

methods for representing and manipulating concepts and their

corresponding to as many treads on the website Reddit were

meanings in high-dimensional space. Random Bipolar vectors of

extracted from the corpus. We selected threads, discussing topics

high, but fixed dimensionality ( ≥ 1000 ) are initialized as

from different subject domains, where each contained at least 50

individual word representations and are then transformed in ways

non-removed comment text bodies. Two human annotators were

that represent semantically similar comments closer in the high-

then asked to manually identify topical clusters in the selected

dimensional vector space, while the similarity of dissimilar

Conversations. The comment texts were provided to them in the

comments is likely close to zero due to their inherent

form of a set of numbered text files, containing only the text data

orthogonality. The methods used to transform these vectors are

in chronological order of submission. Reddit post titles or other

binding, bundling and permuting [14]. By using these methods,

metadata were not available to the annotators and no guidance

individual hyperdimensional vectors are created for each

was given as to the number of topics required. The clusterings

comment, encoding the used words and their position in the

were examined as-is, with no singleton removal performed.

comment in the vector.

We describe the NLP techniques used to create the three

Similar to the clustering of word embeddings, semantically

clustering models in the following subsections, with external

similar comment groups can be found by clustering, thus

evaluation results being presented in Section 5.

determining the outputs of the third model. However, the



performance of this method did not yield comparative results at

first. We hypothesised that this might be due to the high

3.1 Latent Dirichlet Allocation

component count of the used vectors (more than double the

Latent Dirichlet Allocation (LDA) is a topic modeling technique

dimensions of the Word Embedding approach), so a method of

initially proposed in the context of population genetics, but later

dimensionality reduction was examined, aiming to improve its

applied in machine learning in the early 21st century. It assumes

results. It is described in the next sub-section.

a generative process of documents as random mixtures over a

collection of latent topics. Each of these topics, in turn, is

3.4 Self-Organized Maps

characterized by a certain distribution over words. A topic model

Self-organizing maps (SOM), also known as Kohonen networks

can be created by estimating the per document distribution of

are computational methods for the visualization and analysis of

topics θ and the per topic distribution over words φ. [9] Many

high-dimensional data. The output of the algorithm is a set of

methods, such as variational inference, Bayesian parameter

nodes, arranged in a certain topology that represents the nodes’

estimation [9] and Collapsed Gibbs sampling [10], have been

mutual relation, with each node being represented with a weight

used to approximate these values. In the end, they all boil down

vector of t dimensional components, with t corresponding to the to maximizing the model’s probability of creating the exact

uniform dimensionality of data being reduced [15]. As data

documents, provided to it in the input, assuming the knowledge

representations in high-dimensional vector spaces are inherently

of the number of topic distributions.

vulnerable to sparseness, clustering outputs can differ in cases



where the clustered data is first dimensionally reduced. Thus, we

used the SOM algorithm to examine if the results (of the

1 https://convokit.cornell.edu/documentation/index.html/

101



examination in Section 5) of any of the proposed frameworks can

even clearer in Figure 2, which shows each model’s performance

be improved by dimensionally reducing the vector

with respect to the agreement score between the two human

representations prior to clustering.

annotators. The percentage is calculated as an averaged sum of

SOM proved to drastically improve the performance of the

all four metric scores, weighted by the sum of these scores,

Hyperdimensional computing model, while making the Word

achieved by the human versus human evaluation. In the figure,

Embeddings-based model perform worse. Consequently, we

Word Embeddings can be seen as the best-performing approach,

only use SOM prior to clustering the HD-based approach in the

reaching 54.18 % of the Human agreement. The performance of

evaluation, presented in Section 5.

LDA presented in Figure 2 is also comparable to that found in

[5].

However, the difference in results between the best and the

4 IMPLEMENTATION

All implementational work was done with the Python

programming language. All text corpora were pre-processed

using the WordNetLemmatizer and PorterStemmer from

NLTK.2 Stop word removal was done in the pre-processing step

using the topic modeling package Gensim3, which also provided

the submodules for TFIDF and LdaModel, used for the

implementation of Latent Dirichlet Allocation. GloVe word

embeddings were provided as part of the NLP open-source

library SpaCy 4 as part of the “en_core_web_md” pretrained

statistical model for the English Language. The SOM algorithm

was implemented using the SimpSOM package5, with k-means

clustering being provided by Scikit-Learn.6

5 EVALUATION

To analyze the applicability of LDA, Word Embeddings and

Figure 1: Visualization of agreement metric results between

dimensionally reduced Hyperdimensional computing for the

the human annotators (top) and the average annotator vs.

discussed use-case, topical clustering outputs were created for 5

model agreement (bottom three)

Reddit Conversations. Two human annotators also manually

created topical groups for these conversations. The goal of our



evaluation was to see which model created the most human-like

worst performing models being less than 7% of the total human

clusters; consequently having the highest average agreement

agreement score, this metric is not enough to establish Word

measure with the clustering samples, provided by the two

Embeddings as superior to LDA or indeed, dimensionally

annotators.

reduced High-dimensional computing. We can conclude that

Topical clusters, created by the three models, were externally

both Hyperdimensional computing and Word Embeddings can

evaluated using four symmetric agreement measures: The V-

produce topical clusters, comparable to the current state of the art

Measure [16], The Fowlkes-Mallows Index [17], the Rand Index

LDA method.

[18] and the Mutual information score [19]. The latter two were

Semantic document representations performing as well as the

also adjusted for variance. For each examined model, the best

state-of-the-art topic modeling framework using LDA opens up

performing number of topic clusters was selected. The agreement

plentiful possibilities in the field of multi-speaker conversation

of the clustering output of each model was measured against both

analysis. Whereas topic modeling’s more direct approach of

of the manual clusterings, with the per annotator average of each

inferring latent conversation topics might be useful in their

metric being the final output.

discovery, the possibility of applying algebraic functions to

Figure 1 shows the result scores of all four metrics for each

individual comment vectors might enable further topic mining

analyzed method. In the top row, the average agreement between

and experimentation. While the k-means clustering algorithm

the two annotators is also shown. This is, expectedly, higher than

requires a desired number of clusters at input, similar to LDA, its

the average agreement between any examined model and the

job is not to encode semantics in the Word Embedding or SOM-

human outputs. A few takeaways can be addressed, examining

HDC framework. This means that an alternative clustering

the figure. Firstly, the different methods were successful to a

algorithm – one without the need for an input number of medoids

varying degree, depending on the used metric, with each

- could be used for the task of grouping comments. This, in turn,

performing the best according to at least one. Secondly, when

would result in a truly unsupervised topical clustering framework.

comparing their average relative success in relation to the

A comparative evaluation of these approaches is a field of

agreement scores between Annotator A and Annotator B, we can

interest in the future, as our non-conclusive experiments have

see that their performances are very similar. This can be seen

2 https://www.nltk.org/

5 https://github.com/fcomitani/SimpSOM/

3 https://radimrehurek.com/gensim/

6 https://scikit-learn.org/

4 https://spacy.io/

102

already shown a vast variance in results when using different and trust in the press’, Comput. Hum. Behav. , vol. 54, pp. 231–239, Jan.

clustering approaches.

2016, doi: 10.1016/j.chb.2015.07.046.



[4]

S. Faridani, E. Bitton, K. Ryokai, and K. Goldberg, ‘Opinion space: a scalable tool for browsing online comments’, in Proceedings of the 28th

international conference on Human factors in computing systems - CHI

’10,

Atlanta,

Georgia,

USA,

2010,

p.

1175,

doi:

LDA_TFIDF

GloVe

SOM_HDV

10.1145/1753326.1753502.

[5]

C. Llewellyn, C. Grover, and J. Oberlander, ‘Summarizing Newspaper

Comments’, Proc. Eighth Int. AAAI Conf. Weblogs Soc. Media, pp. 599–

54,18%

602, Jun. 2014.

[6]

Z. Ma, A. Sun, Q. Yuan, and G. Cong, ‘Topic-driven reader comments 52,55%

summarization’, in Proceedings of the 21st ACM international conference

on Information and knowledge management - CIKM ’12, Maui, Hawaii,

USA, 2012, p. 265, doi: 10.1145/2396761.2396798.

[7]

E. Khabiri, J. Caverlee, and C.-F. Hsu, ‘Summarizing User-Contributed Comments’, presented at the International AAAI Conference on Weblogs

and Social Media, pp. 534–537, Barcelona, Spain, Jul. 2011.

[8]

A. Aker et al. , ‘A Graph-Based Approach to Topic Clustering for Online 47,19%

Comments to News’, in Advances in Information Retrieval, vol. 9626, N.

Ferro, F. Crestani, M.-F. Moens, J. Mothe, F. Silvestri, G. M. Di Nunzio,

C. Hauff, and G. Silvello, Eds. Cham: Springer International Publishing,

2016, pp. 15–29.

[9]

D. Blei, A. Y. Ng, and M. I. Jordan, ‘Latent Dirichlet Allocation’, J. Mach.

Learn.

Res. ,

vol.

3,

no.

4–5,

pp.

993–1022,

2000,

doi:

10.1162/jmlr.2003.3.4-5.993.

[10]

W. M. Darling, ‘A Theoretical and Practical Implementation Tutorial on



Topic Modeling and Gibbs Sampling’, Proc. 49th Annu. Meet. Assoc.

Comput. Linguist. Hum. Lang. Technol. , pp. 642–647, Dec. 2011.

[11]

J. Pennington, R. Socher, and C. Manning, ‘Glove: Global Vectors for Figure 2: Percentage of the Human versus human

Word Representation’, in Proceedings of the 2014 Conference on

agreement score achieved by each model (averaged between

Empirical Methods in Natural Language Processing (EMNLP), Doha, 4 agreement metrics)

Qatar, 2014, pp. 1532–1543, doi: 10.3115/v1/D14-1162.

[12]

T. Mikolov, K. Chen, G. Corrado, and J. Dean, ‘Efficient Estimation of Word Representations in Vector Space’, ArXiv13013781 Cs, Sep. 2013, Accessed:

Aug.

19,

2020.

[Online].

Available:

6 CONCLUSION

http://arxiv.org/abs/1301.3781.

[13]

A. Österlund, D. Ödling, and M. Sahlgren, ‘Factorization of Latent

In this article, we work from our hypothesis that popular

Variables in Distributional Semantic Models’, in Proceedings of the 2015

semantics-laden vector representations of text data can be

Conference on Empirical Methods in Natural Language Processing, applicable in the established framework for extractive online

Lisbon, Portugal, 2015, pp. 227–231, doi: 10.18653/v1/D15-1024.

[14]

D. Kleyko, E. Osipov, D. De Silva, and U. Wiklund, ‘Distributed

discussion summarization. We present two models using

Representation of n-gram Statistics for Boosting Self-organizing Maps different vector-based representation techniques and conclude

with Hyperdimensional Computing’, in Perspectives of System

Informatics, 12th International Andrei P. Ershov Informatics Conference, that they are both comparable to the Latent Dirichlet Allocation

Revised Selected Papers, pp. 64-79, Novosibirsk, Russia, 2019.

topic modelling technique, used in most literature, with the Word

[15]

T. Kohonen, T. S. Huang, and M. R. Schroeder, Self-Organizing Maps.

Berlin, Heidelberg: Springer Berlin / Heidelberg, 2012.

Embeddings-based framework outperforming it in our external

[16]

A. Rosenberg and J. Hirschberg, ‘V-Measure: A Conditional Entropy-

evaluations.

Based External Cluster Evaluation Measure’, in Proceedings of the 2007

As mentioned in Section 2, the authors of this article argue

Joint Conference on Empirical Methods in Natural Language Processing

and Computational Natural Language Learning (EMNLP-CoNLL),

that extractive summarizations are intrinsically less suitable

Prague, Czech Republic, Jun. 2007, pp. 410–420, Accessed: Aug. 20,

when working with multi-speaker corpora. Our future work in

2020. [Online]. Available: https://www.aclweb.org/anthology/D07-1043.

[17]

E. B. Fowlkes and C. L. Mallows, ‘A Method for Comparing Two

this field includes the modeling of an abstractive summarizer

Hierarchical Clusterings’, J. Am. Stat. Assoc. , vol. 78, no. 383, pp. 553–

framework, using the findings presented in this paper. Our intent

569, Sep. 1983, doi: 10.1080/01621459.1983.10478008.

is to use them in conjunction with graph-based approaches that

[18]

W. M. Rand, ‘Objective Criteria for the Evaluation of Clustering

Methods’, J. Am. Stat. Assoc. , vol. 66, no. 336, pp. 846–850, Dec. 1971, take advantage of multidomain knowledge bases like DBPedia

doi: 10.1080/01621459.1971.10482356.

for both clustering and topic-labelling [8, 20].

[19]

N. X. Vinh, J. Epps, and J. Bailey, ‘Information Theoretic Measures for



Clusterings Comparison: Variants, Properties, Normalization and

Whether used in extractive or abstractive applications, we

Correction for Chance’, J. Mach. Learn. Res. , vol. 11, pp. 2837–2854, Oct.

presume that the field will greatly benefit from our findings,

2010.

[20]

I. Hulpus, C. Hayes, M. Karnstedt, and D. Greene, ‘Unsupervised graph-

seeing that the two vector-based representation frameworks open

based topic labelling using dbpedia’, in Proceedings of the sixth ACM

a plethora of new possibilities for other researchers. These

international conference on Web search and data mining - WSDM ’13, include the detailed data manipulation using algebraic operations

Rome, Italy, 2013, p. 465, doi: 10.1145/2433396.2433454.



on individual comment vectors, as well as said vectors being



suitable inputs for deep learning models using neural networks.



REFERENCES

[1]

W. Antweiler and M. Z. Frank, ‘Is All That Talk Just Noise? The

Information Content of Internet Stock Message Boards’, J. Finance, vol.

59, no. 3, pp. 1259–1294, Jun. 2004, doi: 10.1111/j.1540-

6261.2004.00662.x.

[2]

T. Weninger, ‘An exploration of submissions and discussions in social news: mining collective intelligence of Reddit’, Soc. Netw. Anal. Min. , vol. 4, no. 1, p. 173, Dec. 2014, doi: 10.1007/s13278-014-0173-9.

[3]

E. Go, K. H. You, E. Jung, and H. Shim, ‘Why do we use different types

of websites and assign them different levels of credibility? Structural relations among users’ motives, types of websites, information credibility,

103





Machine Learning of Surrogate Models with an Application to Sentinel 5P

Michał Artur Szlupowicz

Jure Brence

Jennifer Adams

m.szlupowicz@gmail.com

jure.brence@ijs.si

jennifer.adams@esa.it

Warsaw University of Technology,

Jožef Stefan Institute

Φ-lab, ESA/ESRIN

Faculty of Physic

Ljubljana, Slovenia

Frascati, Italy

Warsaw, Poland

Edward Malina

Sašo Džeroski

edward.malina.13@alumni.ucl.ac.uk

saso.dzeroski@ijs.si

Earth and Mission Science Division

Jožef Stefan Institute

ESA/ESTEC

Ljubljana, Slovenia

Noordwijk, the Netherlands

ABSTRACT

the task of prediction, while offering a choice between PCA and

autoencoders to reduce dimensionality [4, 6, 3]. In this paper we Surrogate models are efficient approximations of computation-present an extension of the framework with two types of ensem-

ally expensive simulations or models. In this paper, we report

bles of decision trees for prediction [4], as well as an evaluation improvements of a framework for learning surrogates on input

of the performance and utility of three additional algorithms

and output spaces with reduced dimensionality. We present non-

for dimensionality analysis and dimensionality reduction: t-SNE

linear embeddings and feature importance as additional methods

[11], UMAP [12] and feature importance based on random forests for dimensional analysis and reduction. The choice of models for

[10].

prediction is extended with two types of ensembles of decision

trees. The performance of the additions is evaluated and com-

pared with the original approaches on a dataset, generated by

2

DATASET

RemoTeC, a complex radiative transfer model.

The training dataset was generated using the RemoTeC tool and

KEYWORDS

in total consists of 50000 samples. Each input state vector con-

tains a set of atmospheric parameters: solar zenith angle (SZA),

spectral data, neural network, ensemble, surrogate model, dimen-

albedo, temperature, pressure, aerosols and profiles of the CH4,

sionality reduction

CO and H20 gases (in total 125 dimensions). The sampling of the

data ensures that the data covers the entire range of conditions

1

INTRODUCTION

that S5P/TROPOMI is expected to encounter. Exploratory data

The TROPOspheric Monitoring Instrument (TROPOMI) is an

analysis reveals three dimensions with zero variance. Removing

on board satellite instrument on the Copernicus Sentinel-5 Pre-

them results in a dataset with a 122-dimensional input space.

cursor satellite [9]. Its main objective is to provide accurate ob-The output training data was created using the RemoTeC RTM

servations of atmospheric parameters, as the concentrations of

in the S5P/TROPOMI Shortwave InfraRed (SWIR3) band. Each

atmospheric constituents. Those can be used to obtain better

target vector consists of an infrared spectrum with 834 dimen-

air quality forecasts and to monitor global trends. However, the

sions.

retrieval of interesting attributes involves running a retrieval

algorithm, such as RemoTeC [2, 8], based on “optimal estimation 3

SURROGATE MODELS

methods" that tend to be computationally very expensive [7].

The framework for learning surrogates is capable of learning

Machine learning techniques can be used to learn surrogate

both forward and backwards models. The former predict spectra,

models that approximate the outputs of intensive simulations

given atmospheric parameters. The latter reverse this process

and are much faster at making predictions [13]. A framework and learn to approximate atmospheric parameters that produce a

for learning surrogates of radiative transfer models has been

given spectrum, which is useful for optimizing parameters of the

developed [1]. Due to the high dimensionality of both input and RemoTeC simulation. Surrogates are generally predictive models

output spaces, the framework employs dimensionality reduction

that map directly between input and output data of a simula-

- methods that find low-dimenensional projections (embeddings)

tion or computationally expensive model. They offer much faster

of data that preserve as much information as possible [4]. Predic-predictions at the cost incurring a prediction error. However,

tive models are learned on input and output spaces with reduced

when the data is high dimensional and contains many samples,

dimensionality.

the computational cost of training and prediction can still be

Despite promising results, the existing framework for learning

non-trivial. In such cases, methods of dimensionality reduction

surrogates is limited to simple feed-forward neural networks for

can offer not only time savings, but also improvements in predic-

Permission to make digital or hard copies of part or all of this work for personal tive performance. In our framework, we employ dimensionality

or classroom use is granted without fee provided that copies are not made or reduction to atmospheric parameters, as well as to the spectral

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this space. Predictive models learn to map between reduced spaces.

work must be honored. For all other uses, contact the owner /author(s).

An inverse transformation is performed on predictions in the re-

Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

duced space to obtain predictions in the original output space. For

© 2020 Copyright held by the owner/author(s).

that reason, dimensionality reduction algorithms must provide

104





Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

Michał A. Szlupowicz and Jure Brence, et al.

an inverse transformation in order to be useful as a component

of regularization techniques can be employed. One of options is

of a surrogate model in our framework.

adding artificial noise to the input data, which forces the network

to generalize.

3.1

Dimensionality Reduction

In our framework, we employ this kind of autoencoder, often

referred to as a denoising autoencoder, by adding Gaussian noise

A high number of dimensions makes a problem much harder for

with mean 0 and standard deviation 0.1 to input data during the

many machine learning algorithms due to the curse of dimen-

training process. A more thorough investigation of the effect of

sionality. For this reason, we have tried a range of dimensionality

this technique on the predictive power can be found in [1]. For reduction (DR) methods on our data before performing training

both atmospheric parameters and the spectral space, we used the

on them. DR methods are (potentially unsupervised) algorithms

same 7 layers architecture with an appropriate size of input and

that try to find a projection of the data to a lower dimension of

output layers. The architecture can be summarized as:

space that preserve as much information as possible.

A lower number of dimensions helps reduce computation time

• input layer of size 𝑁 + Gaussian noise

0

and often even improves the predictive performance of models.

• dense layer of size 𝑁

and ReLu activation

1 < 𝑁0

Furthermore, DR methods can also be used to visualize high

• dense layer of size 𝑁 = 1

and ReLu activation

2

𝑁1

2

dimensional data by finding an informative projection into two

• dense embedding layer of size 𝑁 and linear activation

3

dimensions that is understandable to humans. Some algorithms,

• dense layer of size 𝑁 and ReLu activation

2

such as t-SNE or UMAP, serve especially this purpose.

• dense layer of size 𝑁 and ReLu activation

1

Principal Component Analysis (PCA) is one of the most pop-

• output layer of size 𝑁 and linear activation

0

ular dimensionality reduction methods [4]. PCA finds linear pro-The t-Distributed Stochastic Neighbor Embedding (t-SNE)

jections to a lower-dimensional subspace so that variance in the

[11] is a non-linear unsupervised technique for high dimension data is maximized. Visualizing the ratio of variance, covered by

data visualization that can model complex, non-linear dependen-

individual principal components is a way of assessing the intrin-

cies. t-SNE places points that are similar in the original space close

sic dimensionality of the data, as shown in Figure 1. We see that, together in the embedding layer with a high probability, while

for the 122-dimensional atmospheric parameter space, we need:

placing dissimilar points close together with only a low probabil-

• 23 dimensions to explain 95% of the variance,

ity. Since t-SNE is a stochastic and non-parametric method there

• 45 dimensions to explain 99% of the variance,

is no way to perform a reverse transformation from the embed-

• 73 dimensions to explain 99.9% of the variance,

ding space to the original space. This excludes the method from

and for the output 834-dimensional spectral space:

use as part of the surrogate modelling process. It can, however,

• 1 dimension to explain 95% of the variance,

be useful for visualizing the dataset. Another disadvantage of

• 2 dimensions to explain 99% of the variance,

t-SNE is its high computational complexity.

• 9 dimensions to explain 99.9% of the variance.

Uniform Manifold Approximation and Projection (UMAP)

[12] is another dimension reduction technique used for dataset visualizations, constructed from a theoretical framework based in

Riemannian geometry and algebraic topology. UMAP preforms

similarly to t-SNE, but preserves more of the global data structure

with superior run time performance. As is the case with t-SNE,

UMAP does not allow for reverse transformations, which means

we can not use it to learn surrogates. However, visualizations

using UMAP allowed us to gain useful insights into the structure

of our dataset.

3.2

Prediction Models

One of the predictors we used in our experiment was a feed-

forward neural network (NN). We have chosen an architecture,

consisting of 2 hidden full connected layers with ReLu activation

functions and linear activation on the output layer [6].

Random Forest (RF) is an ensemble learning technique suited

for both regression and classification problems. It uses sample

bagging and feature sampling methods to train a set of decision

trees. Prediction is performed by averaging over predictions from

the individual regression trees. The main advantage of RF over a

simple decision tree is the much better generalization. We decided

Figure 1: Dependence of the cumulative relative variance

to use this kind of predictor, because it is capable of performing

on the number of principal components for both the input

multi target regression [10].

and the output space.

Extra Random Trees (ET) is a technique very similar to random

forests, with two main differences. First, it uses the whole dataset

Autoencoders (AE) [3] are a type of artificial neural network for training individual trees instead of using bags of samples.

used to learn low dimensional representations. AE are trained to

Second, it uses random cuts for each split, instead of using the

reproduce input data on the output of the network after passing

optimal one (in case of Gini or Entropy reduction). It has been

through a bottleneck in the network architecture. To prevent

shown to perform better than random forests for some problems

autoencoders from memorizing the training dataset, a variety

[5].

105





Machine Learning of Surrogate Models with an Application to Sentinel 5P

Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

4

EXPERIMENT

two principal components. Only about half of the features are

assigned non-negligible importance. The features identified by

Our experiment is composed of three parts. In the first two, we

this approach warrant further investigation by domain experts.

employ methods of dimensionality reduction as a way to gain

insight and understanding about our dataset and problem. The

third part is an empirical evaluation of different combinations of

methods for dimensionality reduction and prediction, aiming to

identify the one that offers the best predictive performance on

unseen data.

4.1

Visualization

We applied the UMAP and t-SNE visualization techniques to both

atmospheric parameters and spectrum data. As expected, both

methods showed clusters in the atmospheric parameters data. In

the spectrum data space, UMAP identified a structure in the data,

depicted in figure: 2. A comparison of the data points sampled from different clusters shows a large difference in the scale of

individual data points. This is likely one of the reasons why such

a high variance is concentrated in the first principal component

(as seen in Figure 1).

Figure 3: Random forest predictor importance of atmo-

spheric data features.

4.3

Regression

To compare different regressors and methods of dimensionality

reduction, we performed forward and backward predictions us-

ing neural network, random forest and extra random trees for

both autoencoder and PCA embeddings. We reduced the dimen-

sionality of the input space from 123 to 73 and the dimensionality

of the output space from 834 to 9. These values correspond to

99.9% explained variance when using PCA. The noise level of the

autoencoder was set to 𝜎 = 0.1. A more thorough study of the

effects of these parameters can be found in [1]. We compare the predictive power of various combinations of either AE or PCA for

dimension reduction, and either neural network, random forest

or extra trees as a predictive model, using 10-fold cross valida-

(a) UMAP visualization

tion. In Table 1 we compare the results, using the coefficient of determination as the evaluation metric [4]:

𝑀 𝑆 𝐸 (model)

2

𝑅

= 1 −

.

𝑣 𝑎𝑟 𝑖𝑎𝑛𝑐𝑒 (training set)

Table 1: Coefficient of determination for various combina-

tions of dimensionality reduction methods (DR) and pre-

dictive models (PM), estimated by 10-fold cross validation.

forward

backward

(b) Comparison of data points

PM / DR

AE

PCA

AE

PCA

Figure 2: UMAP visualization of the spectrum data.

NN

0.9995

0.9998

0.8454

0.9206

RF

0.8931

0.9937

0.9267

0.9311

ET

0.9228

0.9958

0.9370

0.9510

4.2

Feature Importance

2

The main advantage of using tree-based models over neural net-

For the forward model, the best performance of 𝑅

= 0.9998 is

works is their interpretability. While the ability to be understood

achieved by a neural network, mapping between spaces reduced

by a human is lost when moving to an ensemble from a single

by PCA. For the backward model, the best performing model

2

tree, random forests can be very useful for estimating the impor-

are extra trees, paired with PCA, achieving 𝑅

= 0.9510. Both

tance of individual features for prediction. We trained a random

represent very satisfactory and promising models to employ as

forest predictor on the full dataset and visualized feature impor-

surrogates for radiative transfer modeling. From Table 1, we can tance values in Figure 3. We see that 70% of feature importance also see that PCA outperformed autoencoders in all cases, while

is accumulated in just two dimensions. This corresponds well

also being much faster to compute. The comparison of predic-

to the PCA estimate of most variance being encompassed by

tive models is not as simple. For the forward model, the neural

106





Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Michał A. Szlupowicz and Jure Brence, et al.

network is the best, but only by a small margin. For the back-

insight about the data. However, feature importance can also be

ward model, the differences are larger, with the neural network

used to compute feature rankings and perform feature selection,

performing the worst. The performance of random forests was

which can be considered as another method of dimensionality re-

between the performances of the other two predictive models

duction. In further work, it might be worthwile to investigate this

for both the forward and the backward problem.

approach further and include it as an option in the framework

Since one of the main uses for surrogate models is speeding

for learning surrogates.

up computation, time complexity is an important consideration.

The main disadvantage of neural networks is the computational

6

ACKNOWLEDGEMENTS

complexity required for both training and prediction. An autoen-

We thank dr. Jovan Tanevski for his initial work on the project,

coder takes about ten times as long to transform a data point

as well as his ideas and help in further work.

to the embedding space than PCA. For predictive models, the

neural network used in this study needed approximately three

REFERENCES

times as long to make a prediction than random forests and extra

[1]

Jure Brence, Jovan Tanevski, Jennifer Adams, Edward Ma-

trees, which had a similar time complexity. Nonetheless, mak-

lina, and Sašo Džeroski. 2020. Learning surrogates of a

ing predictions for a test set of 5000 points using any of the

radiative transfer model for the sentinel 5p satellite. In Pro-

described surrogates takes up to one second, while running the

ceedings of International Conference on Discovery Science

full RemoTeC simulation requires several hours of computation.

(Lecture Notes in Computer Science). Volume 12323.

When comparing with the evaluation results reported for the

[2]

A Butz, André Galli, O Hasekamp, J Landgraf, P Tol, and I

original framework in [1], the performances in this paper are Aben. 2012. Tropomi aboard sentinel-5 precursor: prospec-slightly worse. The reason is the fact that the original study

tive performance of ch4 retrievals for aerosol and cirrus

reduced the dimensions of the input space to 102 and the output

loaded atmospheres. Remote Sensing of Environment, 120,

space to 50 dimensions. In this study we focused on further

267–276.

reducing the dimensions and reduced the dimension of the input

[3]

David Charte, Francisco Charte, Salvador García, María J

space to 73 dimensions and the output space to 9 dimensions. It

del Jesus, and Francisco Herrera. 2018. A practical tutorial

is an interesting observation that for different dimensionalities,

on autoencoders for nonlinear feature fusion: taxonomy,

the best performance is achieved by different algorithms.

models, software and guidelines. Information Fusion, 44,

78–96.

5

DISCUSSION AND FURTHER WORK

[4]

Jerome Friedman, Trevor Hastie, and Robert Tibshirani.

2001. The elements of statistical learning. Number 10. Vol-

The original framework for learning surrogates on input and out-

ume 1. Springer series in statistics New York.

put spaces with reduced dimensionality showed high predictive

[5]

Pierre Geurts, Damien Ernst, and Louis Wehenkel. 2006.

and computational performance on the RemoTeC dataset. The

Extremely randomized trees. Machine learning, 63, 1, 3–42.

results were very promising for applications in data analysis for

[6]

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016.

Earth Observation missions as a way to dramatically speed up

Deep Learning. http://www.deeplearningbook.org. MIT

computation without sacrificing much accuracy. However, no

Press.

single model and approach is the best for every dataset and appli-

[7]

Otto P Hasekamp and J Landgraf. 2002. A linearized vector

cation, which made the limited scope of options in the original

radiative transfer model for atmospheric trace gas retrieval.

framework a potential downside. With the work presented in this

Journal of Quantitative Spectroscopy and Radiative Transfer,

paper, the range of methods available has been extended. Since

75, 2, 221–238. issn: 00224073. doi: 10.1016/S0022- 4073(01)

the choice of algorithms for dimensionality reduction on the in-

00247- 3.

put and output spaces, as well as the choice of prediction model

[8]

Haili Hu, Otto Hasekamp, André Butz, André Galli, Jochen

for both the forward and the backward model are all indepen-

Landgraf, Joost Aan de Brugh, Tobias Borsdorff, Remco

dent from each other, the number of combinations of algorithms

Scheepmaker, and Ilse Aben. 2016. The operational methane

available is considerable. Furthermore, the dimension analysis

retrieval algorithm for tropomi. Atmospheric Measurement

enabled by UMAP, t-SNE and feature importance represents a

Techniques (AMT), 9, 11, 5423–5440.

new way of assessing intrinsic dimensionality and making a more

[9]

IPCC. 2014. Fifth Assessment Report - Impacts, Adaptation

informed choice of the number of target dimensions.

and Vulnerability. (2014). Retrieved 06/12/2017 from http:

The paper presents an evaluation of the performance of vari-

//www.ipcc.ch/report/ar5/wg2/.

ous included methods on the RemoTeC dataset. However, each

[10]

Andy Liaw, Matthew Wiener, et al. 2002. Classification

of the analyzed algorithms is defined by a number of hyper-

and regression by randomforest. R news, 2, 3, 18–22.

parameters, which is especially true for neural networks and

[11]

Laurens van der Maaten and Geoffrey Hinton. 2008. Vi-

autoencoders. Furthermore, the dimensions of the reduced input

sualizing data using t-sne. Journal of machine learning

and output spaces can also be consider hyperparameters of the

research, 9, Nov, 2579–2605.

framework. For the presented evaluation we chose the hyper-

[12]

Leland McInnes, John Healy, and James Melville. 2018.

parameters based on values reported in previous work and to

Umap: uniform manifold approximation and projection

some degree optimized them manually. A more rigorous study is

for dimension reduction, (December 2018). https://arxiv.

required that employs automated hyperparameter optimization

org/abs/1802.03426.

in order to compare the available algorithms fairly and arrive at

[13]

J Tanevski, S Džeroski, and T Todorovski. 2019. Meta-

a reliable conclusion of what is the best approach to modeling

model framework for surrogate-based parameter estima-

the RemoTeC simulation.

tion in dynamical systems. IEEE Access, 99.

Finally, in this study we touched upon the subject of estimat-

ing feature importance using random forests in order to gain

107





Deep Multi-label Classification of Chest X-ray Images

Dejan Štepec

dejan.stepec@xlab.si

University of Ljubljana, Faculty of Computer and Information Science

XLAB d.o.o.

Ljubljana, Slovenia

ABSTRACT

taking into account underlying dependency structure and pow-

In this paper we address the problem of Chest X-ray (CXR) clas-

erful deep features, could advance current state-of-the-art of the

sification in a multi-label classification (MLC) setting, in which

supervised MLC deep-learning based approaches.

each sample can be associated with one or several labels. The

availability of large-scale CXR datasets has provided the abil-

ity to develop highly accurate deep-learning based supervised

models, that closely resembles the performance of human radiol-

ogists. We compare an end-to-end deep-learning based approach

with different ensembles of predictive clustering trees (PCTs) and

show that similar predictive performance can be achieved, when

using the features extracted from the pre-trained deep-learning

model.

KEYWORDS

Chest X-ray, deep-learning, predictive clustering trees, random

forest, extra tree

1

INTRODUCTION

Chest X-ray (CHR) is one of the most common medical imag-

ing modalities, with millions of scans performed globally every

year [6]. A computer-aided diagnosis (CAD) system can significantly reduce the burden of radiologists and thus reduce preva-

lence and early detection of many deadly diseases. There has been

a lot of effort recently, to harness the power of machine learning

based methods, especially deep-learning, for disease classifica-

tion and localization from CXR images [17]. Interpreting CXR

images is very difficult even for the trained pathologists, with

different visual ambiguities representing a significant challenge

to distinguish between different diseases, resulting in misdiag-

noses [5].

Recently, deep-learning based approaches have been presented,

that together with the availability of large-scale datasets signifi-

cantly improve the performance of CAD methods and in some

Figure 1: Few examples of Chest X-ray images from the

cases reach the radiologist-level performance [8]. In comparison CheXpert dataset [8].

with other approaches and datasets [9, 13, 1], newly presented datasets [8, 10] enable the development of CAD methods for detection of presence of multiple diseases present in CXR images

2

RELATED WORK

at the same time.

We evaluate an end-to-end deep-learning based approach

Recent prevalence of deep-learning methods and increased avail-

for multi-label classification (MLC) of CXR images, based on

ability of large-scale datasets with labeled data has provided med-

DenseNet architecture [7] and compare it with the traditional ical community with significant advances, in comparison with

approach based on predictive clustering trees (PCT) [2], in an en-the methods that require sub-optimal manual feature engineer-

semble setting, using the features extracted from the pre-trained

ing [14]. State-of-the-art CNN models are becoming a de-facto deep-learning network. We demonstrate a similar predictive per-standard for a wide range of application in medical imaging, such

formance on a large-scale CheXpert dataset [8], thus opening the as detection, classification and segmentation. Similar advances

potential to use PCTs also in a hierarchical setting [20], which in terms of the methods and available data have been observed

in the domain of Chest X-ray (CXR) images.

Permission to make digital or hard copies of part or all of this work for personal Multi-label classification (MLC) setting is a very common set-or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and ting in interpreting CXR images, due to presence of multiple

the full citation on the first page. Copyrights for third-party components of this diseases in one particular CXR sample. Deep-learning architec-work must be honored. For all other uses, contact the owner/author(s).

ture CheXNet [19] was proposed, based on DenseNet-121 [7],

Information society ’20, October 5–9, 2020, Ljubljana, Slovenia

© 2020 Copyright held by the owner/author(s).

trained on ChestX-ray14 dataset [21], which achieved state-of-the-art results over 14 labeled pathologies and even exceeded

108





Information society ’20, October 5–9, 2020, Ljubljana, Slovenia

Dejan Štepec

(b)

(a)

Figure 2: (a) Label uncertainty distribution over 14 pathologies in the CheXpert dataset [8] over all the samples in the training data and (b) distribution and probability of occurrence of multiple pathologies in a particular sample (multi-label classification).

radiologist performance on pneumonia. Recently, very large-scale

using radiographs, thus labels only represent positive or negative

CXR datasets were presented, such as CheXpert [8] and MIMIC-class, with no uncertainties. Evaluation is performed only on 5

CXR [10], which enabled the development of much more robust observations, selected based on their clinical significance and

supervised models. Additionally, the new datasets also capture

prevalence in the dataset (i.e. Atelectasis, Cardiomegaly, Consoli-

the notion of uncertainty through labels and different approaches

dation, Edema and Pleural Effusion).

have been proposed for handling such labels. A similar archi-

The distribution of all the observed pathologies in the train-

tecture to CheXNet was proposed and performance surpassed 3

ing data and their uncertainty is presented in Figure 2a and the certified radiologists in 3 different pathologies [8].

distribution of observations over a single example in Figure 2b,

The above MLC approaches do not take into the account the

which shows that there is around 30% chance of having at least 2

dependencies between disease labels, which, when exploited,

pathologies present at the same time, labeled as definite positive.

significantly improves the performance of the predictive mod-

In CheXpert [8], different strategies of using uncertainty labels els [16]. We evaluate an end-to-end deep-learning based approach were evaluated. The two most simple approaches are to ignore

for MLC of CXR images, based on DenseNet architecture [7] and uncertain samples during the training or to map them to either

compare it with the traditional approach based ob predictive

negative of positive class. They also evaluate a semi-supervised

clustering trees (PCTs) [2], in an ensemble setting, using the fea-approach, where the ignore approach is used to label uncertain ex-

tures extracted from the pre-trained deep-learning network. We

amples, in order to re-label them. 3-class classification approach

demonstrate a similar predictive performance on a large-scale

is also evaluated where uncertain label is used as a separate class

CheXpert dataset [8], thus opening the potential to use PCTs also during the training and during testing, only the probabilities for

in a hierarchical setting [20], which taking into account under-positive and negative class are reported. In our work, we use

lying dependency structure and powerful deep features, could

the simple mapping approach, by mapping uncertain labels to a

advance current state-of-the-art of the supervised MLC deep-

positive class and not-mentioned samples to a negative class.

learning based approaches and also compete against hierarchical

deep-learning based approaches [4, 16], which take the hierarchy into account implicitly, using the conditional probability.

3.1

Methods

We evaluate an end-to-end deep-learning based approach for

3

CHEXPERT: A LARGE CHEST

multi-label classification (MLC) of CXR images, based on DenseNet-

121 architecture [7] and compare it with the traditional approach RADIOGRAPH DATASET

based ob predictive clustering trees (PCT) [2], in an ensemble CheXpert [8] is a large publicly available dataset for chest ra-setting, using the features extracted from the pre-trained deep-

diograph interpretation, consisting of 224,316 CXR images of

learning network.

65,240 patients, where the presence of 14 different observations

is labeled as positive, negative, uncertain or not mentioned. CXR

images are collected retrospectively from Stanford Hospital, to-

3.2

End-To-End Deep Learning

gether with associated radiology reports. Labels (and their un-

Several convolutional neural networks (CNNs) were evaluated

certainty) were automatically extracted from the section of the

in CheXpert [8] and DenseNet-121 [7] architecture produced the radiology report, which summarizes the key findings. A large

best results. Because of that, we used DenseNet-121 for all of

list of phrases was manually curated by multiple board-certified

our experiments. Original DenseNet is designed for multi-class

radiologists to match various ways of observations, mentioned

classification, where the neural network has the same number

differently in the reports. Extracted phrases are then classified

of output nodes as the number of classes. Each output node

into positive, negative, uncertain or not-mentioned classes and

belongs to some class and outputs a score for that class. In a

aggregated into a final set of predefined observations (i.e. patholo-

multi-class setting, the scores are passed through softmax layer,

gies) with prevailed occurrence. The publicly available test data

which converts scores into probabilities (class probabilities sums

consists of 234 samples from 234 patients, where ground truth

to 1) and the input sample is classified into a corresponding class,

is set by a consensus of 3 radiologists, who annotated the set

that has the highest probability.

109





Deep Multi-label Classification of Chest X-ray Images

Information society ’20, October 5–9, 2020, Ljubljana, Slovenia

In a multi-label classification (MLC) setting, the difference is,

4

RESULTS

that an input sample can belong to multiple classes at the same

We evaluated different approaches on the publicly available test

time, thus the final score needs to be independent for each of

data, consisting out of 234 samples from 234 patients, where

the classes, because of that, sigmoid function is used instead of

ground truth is set by a consensus of 3 radiologists. We report

softmax. Additionally, categorical cross-entropy loss function

the results in terms of the Receiver Operating Characteristic

needs to be replaced with binary cross-entropy. We implemented

Curves (ROC) in Figure 3 and its area under the curve (AUC) modified DenseNet-121 in PyTorch1 using Adam optimizer with in Table 1. In terms of the approaches presented in our work the same learning rates and parameters as used in CheXpert [8].

(i.e. DenseNet-121, RF-PCT and EXTRA-PCT), DenseNet-121 per-

The images were resized to 320 x 320, same as in [8] and we forms the best, with EXTRA-PCT approach following it closely.

trained the network for 10 epochs using a fixed batch size of 32

The biggest differences are observed on the Cardiomegaly class,

images and evaluated the performance on a left-out validation

which coincides with the results reported in CheXpert [8], as set of 500 images using the receiver operating characteristic

most of the uncertain cases are borderline, which reduces the

curve (ROC) and its area under the curve (AUC), averaged across

performance of the simple mapping to positive or negative label.

all observations. The best performing model in terms of global

Table 1 also compares presented approaches against the DenseNet-AUC score was selected for evaluation on a test set, presented in

121 baseline presented in CheXpert [8], where 10 checkpoints Section 4.

per run were chosen and each model was run three times, thus

generating and ensemble of 30 models, which improved the re-

3.3

Predictive Clustering Trees

sults by a small margin over our baseline DenseNet-121 approach.

Predictive clustering trees (PCTs) [2] are decision trees viewed Nevertheless, we achieved or surpassed CheXpert results on Car-as a hierarchy of clusters, where the top node corresponds to

diomegaly and Pleural Effusion classes and also achieved similar

one cluster containing all the data, which is recursively parti-

performance on other classes.

tioned into smaller clusters while moving down the tree. PCTs

are constructed with a standard "top-down induction of decision

5

CONCLUSION

trees" (TDIDT) algorithm, the major difference in comparison

In this paper we addressed the problem of Chest X-ray (CXR)

with CART [3] or C4.5 [18] induction is that the PCTs treat vari-classification in a multi-label classification (MLC) setting and

ance and prototype functions as parameters, selected based on

compared an end-to-end deep-learning based approach with dif-

the learning task at hand. To construct a regression tree, for ex-

ferent ensembles of predictive clustering trees (PCTs) and showed

ample, the variance function returns the variance of the given

that similar predictive performance can be achieved, when using

instances’ target values, and the prototype is their average value.

the features extracted from the pre-trained deep-learning model.

For the task of predicting tuples of discrete variables, used in

This results show the potential to use PCTs also in a hierarchi-

the multi-label classification (MLC) [15], the variance functions cal setting, which taking into account underlying dependency

is computed as the sum of the Gini indices[28] of the variables

structure and powerful deep features, could advance current

from the target tuple and the prototype function returns a vector

state-of-the-art.

of probabilities, that an example belongs to a particular class in

the target tuple.

ACKNOWLEDGMENTS

In our work we utilized PCTs in an ensemble setting, where

This work has been supported by the H2020 iPC project (826121)

a set of predictive models (i.e. PCTs) predictions are combined

to obtain a final prediction, this is especially useful for unstable

REFERENCES

base predictors (e.g. trees), where small changes in the dataset,

[1] Worawate Ausawalaithong, Arjaree Thirach, Sanparith

yield substantially different models and usually achieves a much

Marukatat, and Theerawit Wilaiprasitporn. 2018. Auto-

better predictive performance [12]. In our work we consider a matic lung cancer prediction from chest x-ray images us-Random forest of PCTs (RF-PCT) [12] and ensembles of extremely ing the deep learning approach. In 2018 11th Biomedical

randomized PCTs (EXTRA-PCT) [11] for MLC. In RF-PCT, several Engineering International Conference (BMEiCON). IEEE, 1–

bootstrap replicates are first constructed and a randomized PCT

5.

is then applied, by selecting a subset of attributes in each node,

[2] Hendrik Blockeel, Luc De Raedt, and Jan Ramon. 1998.

on which all possible tests are considered and the best one is

Top-down induction of clustering trees. In Proceedings of

selected. The number of attributes selected is a given parameter,

the Fifteenth International Conference on Machine Learn-

typically a function of the total number of attributes (e.g. log(N) -

ing (ICML ’98). Morgan Kaufmann Publishers Inc., San

where N represents the number of attributes). In EXTRA-PCT, no

Francisco, CA, USA, 55–63.

bootstrap replicates are constructed and in each internal node,

[3] Leo Breiman, JH Friedman, RA Olshen, and CJ Stone. 1984.

for the each attribute, a test is selected randomly.

Classification and regression trees. statistics/probability

We used CLUS2 framework for the PCT construction. We

series. (1984).

used 50 baseline PCTs for RF-PCT, as well as EXTRA-PCT. The

[4] Haomin Chen, Shun Miao, Daguang Xu, Gregory D Hager,

input presented the 1024D features extracted from the pre-trained

and Adam P Harrison. 2019. Deep hierarchical multi-label

DenseNet-121 network, extracted before the last fully-connected

classification of chest x-ray images. In International Con-

classification layer in DensetNet-121. Similarly to the DenseNet-

ference on Medical Imaging with Deep Learning, 109–120.

121 end-to-end approach, the RF-PCT and EXTRA-PCT were

[5] Louke Delrue, Robert Gosselin, Bart Ilsen, An Van Lan-

evaluated on a test set in terms of AUC score.

deghem, Johan de Mey, and Philippe Duyck. 2011. Diffi-

culties in the interpretation of chest radiography. In Com-

1

parative interpretation of CT and standard radiography of

https://pytorch.org/hub/pytorch_vision_densenet

2http://clus.sourceforge.net/

the chest. Springer, 27–49.

110





Information society ’20, October 5–9, 2020, Ljubljana, Slovenia

Dejan Štepec

(a) DenseNet-121

(b) RF-PCT

(c) EXTRA-PCT

Figure 3: Receiver Operating Characteristic Curves (ROC) for an end-to-end deep-learning approach based on DenseNet-121 (a) and ensemble of Predictive Clustering Trees (PCTs) based on random forest (b) and extremely randomized trees (c).

Table 1: Comparison of different methods against the baseline CheXpert results [8] in terms of AUC scores.

Method

Atelectasis

Cardiomegaly

Consolidation

Edema

Pleural Effusion

CheXpert (U-Ones) [8]

0.86

0.83

0.90

0.94

0.93

DenseNet-121

0.81

0.84

0.86

0.92

0.93

RF-PCT

0.83

0.75

0.83

0.90

0.92

EXTRA-PCT

0.82

0.80

0.86

0.90

0.93

[6] [n. d.] Diagnostic Imaging Dataset 2019-20 Data, NHS

2017. Deep learning in medical imaging: general overview.

England. (Accessed 4 August 2020).

Korean journal of radiology, 18, 4, 570–584.

[7] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and

[15] Gjorgji Madjarov, Dragi Kocev, Dejan Gjorgjevikj, and

Kilian Q Weinberger. 2017. Densely connected convolu-

Sašo Džeroski. 2012. An extensive experimental compari-

tional networks. In Proceedings of the IEEE conference on

son of methods for multi-label learning. Pattern recognition,

computer vision and pattern recognition, 4700–4708.

45, 9, 3084–3104.

[8] Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Sil-

[16] Hieu H Pham, Tung T Le, Dat Q Tran, Dat T Ngo, and

viana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad

Ha Q Nguyen. 2019. Interpreting chest x-rays via cnns

Haghgoo, Robyn Ball, Katie Shpanskaya, et al. 2019. Chex-

that exploit disease dependencies and uncertainty labels.

pert: a large chest radiograph dataset with uncertainty

arXiv preprint arXiv:1911.06475.

labels and expert comparison. In Proceedings of the AAAI

[17] Chunli Qin, Demin Yao, Yonghong Shi, and Zhijian Song.

Conference on Artificial Intelligence. Volume 33, 590–597.

2018. Computer-aided detection in chest radiography based

[9] Amit Kumar Jaiswal, Prayag Tiwari, Sachin Kumar, Deepak

on artificial intelligence: a survey. Biomedical engineering

Gupta, Ashish Khanna, and Joel JPC Rodrigues. 2019. Iden-

online, 17, 1, 113.

tifying pneumonia in chest x-rays: a deep learning ap-

[18] J Ross Quinlan. 2014. C4. 5: programs for machine learning.

proach. Measurement, 145, 511–518.

Elsevier.

[10] Alistair EW Johnson, Tom J Pollard, Nathaniel R Green-

[19] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang,

baum, Matthew P Lungren, Chih-ying Deng, Yifan Peng,

Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Cur-

Zhiyong Lu, Roger G Mark, Seth J Berkowitz, and Steven

tis Langlotz, Katie Shpanskaya, et al. 2017. Chexnet: radiologist-

Horng. 2019. Mimic-cxr-jpg, a large publicly available data-

level pneumonia detection on chest x-rays with deep learn-

base of labeled chest radiographs. arXiv preprint arXiv:1901.07042.

ing. arXiv preprint arXiv:1711.05225.

[11] Dragi Kocev and Michelangelo Ceci. 2015. Ensembles of

[20] Celine Vens, Jan Struyf, Leander Schietgat, Sašo Džeroski,

extremely randomized trees for multi-target regression.

and Hendrik Blockeel. 2008. Decision trees for hierarchical

In International Conference on Discovery Science. Springer,

multi-label classification. Machine learning, 73, 2, 185.

86–100.

[21] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Moham-

[12] Dragi Kocev, Celine Vens, Jan Struyf, and Sašo Džeroski.

madhadi Bagheri, and Ronald M Summers. 2017. Chestx-

2013. Tree ensembles for predicting structured outputs.

ray8: hospital-scale chest x-ray database and benchmarks

Pattern Recognition, 46, 3, 817–833.

on weakly-supervised classification and localization of

[13] Paras Lakhani and Baskaran Sundaram. 2017. Deep learn-

common thorax diseases. In Proceedings of the IEEE con-

ing at chest radiography: automated classification of pul-

ference on computer vision and pattern recognition, 2097–

monary tuberculosis by using convolutional neural net-

2106.

works. Radiology, 284, 2, 574–582.

[14] June-Goo Lee, Sanghoon Jun, Young-Won Cho, Hyunna

Lee, Guk Bae Kim, Joon Beom Seo, and Namkug Kim.

111





Smart Issue Retrieval Application

Jernej Zupančič

Borut Budna

Miha Mlakar

jernej.zupancic@ijs.si

borut.budna@ijs.si

Maj Smerkol

Jožef Stefan Institute

Faculty of Computer and

miha.mlakar@ijs.si

Jamova cesta 39

Information Science

maj.smerkol@ijs.si

Ljubljana, Slovenia

Ljubljana, Slovenia

Jožef Stefan Institute

Jožef Stefan International

Jamova cesta 39

Postgraduate School

Ljubljana, Slovenia

Ljubljana, Slovenia

Figure 1: SIRA screenshot

ABSTRACT

developers, SIRA can help find existing answers to questions that

We present Smart Issue Retrieval Application (SIRA), a customer

have already been resolved by developers and therefore reduce

support tool for searching of relevant email threads or issues

the amount of distractions for the development team.

when an email thread and keywords are given. Presented are the

We use language models in order to retrieve information about

overall application architecture, the processing pipeline, which

the question from the issue at hand. Using multiple different ap-

transforms the data into a search friendly form, and the search

proaches, application searches the database of resolved issues in

algorithm itself.

order to find a developers’ answers to same or similar questions.

KEYWORDS

2

SIRA ARCHITECTURE

customer support, language models, information retrieval

SIRA comprises five main application components (Fig. 2):

(1) Database. PostgreSQL [6] is used as the application data-1

INTRODUCTION

base, since it includes decent built-in text search capabili-

ties and change data capture options.

Customer support is an important part of many large businesses

(2) Processing daemon. Python [7] process responsible for data and high quality customer support can improve the user experi-processing for search in the event of change data capture.

ence and help businesses retain their customer for longer periods.

(3) Back-end application. Python Flask-based back-end appli-

For larger companies, it can also be a strain on their human re-

cation exposing the application programming interface

sources as many customer support issues need to be resolved

for SIRA.

in short time. While the customer support team may resolve

(4) Front-end application. React-based [8] single-page applica-most issues on their own sometimes they need the help of the

tion for interacting with SIRA.

development department. Often similar issues are presented to

(5) Documentation. MKdocs-based user documentation for

the developers multiple times.

final users, admins, and developers.

In order to minimize the number of issues that need attention

from other departments, we have developed an application to

Each SIRA component is packaged within a Docker [4] im-

help the customer support technicians resolve issues without help

age and can be managed using “docker-compose” [2] tool. This from developers. While some issues will still need the attention of

enables deterministic packaging of application code for develop-

ment, testing and production.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or 3

SIRA FUNCTIONALITY

distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this The main goal of SIRA is to enable customer support staff to

work must be honored. For all other uses, contact the owner/author(s).

quickly find answers to similar questions that have already been

Information society ’20, October 5–9, 2020, Ljubljana, Slovenia

resolved in the past. Search is therefore the primary functionality

© 2020 Copyright held by the owner/author(s).

of the application and can be split into three parts:

112





Information society ’20, October 5–9, 2020, Ljubljana, Slovenia

Novak, et al.

particular word. This is sensible for cases when the word actu-

ally repeats in the content. However, if it repeats due to the text

duplication it could negatively impact the search results.

We define a repeated email as an email body that appears

within another email body. This is usually a result of using a

“Reply” functionality when responding to an email within an

email client.

To delete email 𝐴 from email 𝐵, the following method is used:

(1) Extract only alphanumeric characters from the two email

bodies 𝐴 and 𝐵 to get alphanumeric(𝐴) and alphanumeric(𝐵).

(2) If alphanumeric(𝐴) appears within alphanumeric(𝐵), mark

it for removal from alphanumeric(𝐵).

(3) If alphanumeric(𝐴) does not appear within alphanumeric(𝐵),

Figure 2: SIRA architecture overview

iterate over substrings of alphanumeric(𝐵) and compute

the matching percentage of consecutive alphanumeric

blocks from alphanumeric(𝐴). The substring with the max-

(1) Processing. Upon new data arrival, pre-processes the text

imum match is a candidate for removal. If it exceeds a

to obtain representation suitable for search.

predefined threshold it is indeed marked for removal from

(2) Search. Computing relevancy scores upon search request

alphanumeric(𝐵).

by taking into account as much information about issue

(4) Reconstruct 𝐵 by dropping the substring marked for re-

or email thread as possible.

moval and all non-alphanumeric characters positioned

(3) Logging. To improve the search in the future the search re-

within the marked substring when expanded with all the

sults and structured user feedback is gathered and stored.

characters.

In the rest of this section we will describe each part in more

3.1.2

Non-author lines removal. An email body usually com-

details.

prises:

(1) Relevant content

3.1

Processing

(2) Signature

For the search to be efficient it is beneficial to pre-process the

(3) Confidentiality notice

raw emails. The processing daemon runs as a separate python

(4) Previous email headers

process and utilizes PostgreSQL’s logical replication functionality

(5) Previous email content

in order to transform new content as soon as it is written to the

The only text that should be used for text comparison is the

database. The following steps are executed when processing the

relevant content part. While previous email content was mostly

issues:

removed in the repeated emails removal step (3.1.1), other email (1) HTML clean. Beautiful Soup [1] library is used to extract body parts can still impact text comparison results. Machine

only relevant text from email XML markup.

learning was utilized to develop a model for determining whether

(2) Empty line removal. Python script is used to detect and

a particular line in the email body belongs to the relevant content

remove empty lines.

part of an email or not.

(3) Repeated emails removal. Parts of emails are deleted if they

Dataset preparation. First, we implemented an application

already appear within some previous mails of the same

with a basic graphical user interface that enabled us to label each

issue.

line with one of the following categories:

(4) Semi-structured emails handling. Some emails are actually

(1) AUTHOR. The relevant content falls into this category.

a filled out form in an email format. A python script is

(2) QUOTED. This is the previous email content.

used to extract only the relevant information.

(3) AUTO-PERSONALIZED. This is the text, that was set by

(5) Non-author lines removal. A machine learning model was

a user in the email client, which is automatically inserted

developed and is deployed for tackling this task.

by the email client. Signature is an example of this.

(6) Non-alphanumeric-only characters lines removal. Python

(4) AUTO-NON-PERSONALIZED. This is the text inserted

script is used to detect and remove those lines.

by the email client automatically. An example of this is

(7) Word vector representation computation and update. Fast-

previous email headers.

Text [3] word vectors are used to compute word vector

(5) NEEDS-PRETTIFY. Sometimes the whole email body is

representation of text.

present in one line only. To properly label the body it

(8) Storing of processed text. The processed text is stored into

should be further split into multiple lines.

database, where built-in database indexing is utilized to

(6) OTHER. Everything else.

further prepare the text for efficient text searching.

Second, we labeled each line belonging to 100 random issues.

In the rest of this section we focus on the non-trivial processing

This way we generated a dataset of 37,421 labeled lines in 586

steps.

emails. Since the assumption was that the “QUOTED” lines are

3.1.1

Repeated emails removal. There were two reasons for re-

already filtered out using remove repeated emails method, we

moving repeated emails from an email thread. First, when dis-

omit those lines from the dataset. This left us with 9,848 labeled

playing an email, usually also all the previous emails are included,

lines.

which results in poor readability. Second, some methods for com-

Features. The computed features were of two types: local

paring the text take into account the number of occurrences of a

features that took into account just the current line, and global

113





Smart Issue Retrieval Application

Information society ’20, October 5–9, 2020, Ljubljana, Slovenia

features that took into account the relative position and content

A basic GUI was built to inspect the models and overview the

of a line within the whole email.

miss-classified examples. In the end, the hierarchical model was

Local features:

chosen with most of the presented features, with the exception of

(1) Number and proportion of capitalized words

“CountVectorizer” and “Tfidf Vectorizer” features. The additional

(2) Number and proportion of non-alphanumeric characters

chosen higher-level feature was the sum of three consecutive

(3) Number and proportion of numeric characters

“AUTHOR” probabilities. Random forest was chosen as the clas-

(4) “CountVectorizer” from the scikit-learn ([5]) package sification algorithm, without feature standardization or dimen-

(5) “Tfidf Vectorizer” from the scikit-learn package

sionality reduction step. The threshold probability was lowered

(6) Word vector line representation

to 0.12 so recall could be kept high.

The final model miss-classified 59 out of 2,394 rows marked

Global features:

as “AUTHOR” (recall = 0.975) and 629 out of 7,454 rows marked

(1) Line position from the start

as “OTHER” (recall = 0.806).

(2) Line position until the end

3.1.3

Word vector representation computation and update. Word

(3) Does “regard” appear before this line, within this line, after

vector representation of content is used to compare email bodies

this line

and email subjects between different issues.

(4) Do four or more consecutive non-alphanumeric characters

To compute the word vector representation of text, either

appear before this line, within this line, after line

issue body or issue subject, the following steps are executed: (1)

(5) Does a date-like string appear before this line, within this

Tokenize text, (2) Remove stop-words, (3) Query word vector

line, after this line

representation for each word using fastText common crawl word

(6) Does a time-like string appear before this line, within this

vectors with dimension 300, (4) Compute mean of all word vectors

line, after this line

belonging to the words in the text, (5) Normalize the mean vector

In order to smooth the predictions we also tested hierarchical

by dividing the mean vector by the mean vector length.

modeling by first building a model for “AUTHOR” detection and

Instead of generating the representation vectors on-the-fly,

then using the predictions on the lower level as additional fea-

they are pre-computed and only read when needed, which greatly

tures for the higher level. One approach for using the predictions

reduces the inference time. To update word vector representation

from the lower level was to just use the “AUTHOR” predictions

of a particular text, the corresponding row in the word vector

of lines just before and just after the current line. The predictions

matrix is updated with the new values and stored on disk as a

were padded with 1 at the beginning of an email and with 0 at

Numpy array.

the end. The second approach was based on the sum of three

consecutive “AUTHOR” class probabilities for: lines, just before

3.2

Search

the current line, lines where the current line is in the middle, and

Each issue consists of: subject, document (the email body of text),

lines just after the current line. We padded the predictions with

and keywords the user marked the issues with. The keywords can

1s at the beginning of an email and with 0s at the end.

be positive, meaning that a keyword is related with the contents

Further, the features were scaled using the StandardScaler and

of the issue, or negative when keyword is not related with the

the feature space dimensionality was reduced using the principal

contents of the particular issue. Additionally, a keyword can

component analysis - PCA, both from the scikit-learn package.

be explicit, where a user uses the keyword for searching when

Models. For modeling we utilized scikit-learn package and

considering a particular issue. On the other hand, a keyword can

tested the following algorithms: (1) Logistic regression, (2) Multi-

be implicit – soft keywords, where the user searched for relevant

nomial Naive Bayes, (3) Support vector machine, (4) Random

issues using a keyword, but the search results were not marked

forest classifier.

as relevant.

Rudimentary hyper-parameter tuning was done to pick the

When computing the relevancy of issues, given a starting

best ones.

issue and some keywords, several relevancy sub-scores are first

Evaluation. Each pipeline was evaluated using 10-fold cross

computed and then aggregated to form a single relevancy score.

validation with the splits over issues. This means that all the lines

In Table 1 all combinations for relevance sub-scores are listed.

belonging to one issue were either in the training or the testing

The final score is computed as a weighted average, as in equa-

set to prevent data leaking.

tion 1. The weights 𝑤 were determined based on the final user Model selection. The performance of all models was tracked

𝑖

feedback.

through various metrics:

(1) Confusion matrix

finalScore = 𝑤1 · KeywordToKeywordScore

(2) Precision and recall at different minimum recall thresholds

+ 𝑤2 · KeywordToSoftKeywordScore

(3) Precision-recall curve

(4) “AUTHOR” probabilities for each line in the test set

+ 𝑤3 · KeywordToDocumentScore

The main concern regarding the model performance was that

+ 𝑤4 · KeywordToSubjectdScore

it should prioritize keeping the “AUTHOR” lines (“AUTHOR”

+ 𝑤5 · DocumentToKeywordScore

(1)

recall) over average model accuracy. This is a direct result of the

+ 𝑤

application architecture – if the line would be removed by the

6 · DocumentToSoftKeywordScore

chosen model, it wouldn’t be possible to search over it. This would

+ 𝑤7 · DocumentToDocumentScore

directly impact the performance in the real-world. Additionally,

+ 𝑤8 · SubjectToKeywordScore

few additional lines shouldn’t hinder the readability too much.

The gathered metrics enabled us to closely inspect each model

+ 𝑤9 · SubjectToSoftKeywordScore

and overview the performance regarding real-world application.

+ 𝑤10 · SubjectToSubjectScore

114





Information society ’20, October 5–9, 2020, Ljubljana, Slovenia Novak, et al.

Table 1: Relevance sub-scores matrix

Other issues

(Not) Keyword

Soft (Not-) keyword

Document

Subject

(Not-) Keyword

Exact match

Exact match

Full-text search

Full-text search

issue

Soft (Not) Keyword

/

/

/

/

ent

Word vector cosine

Document

Reverse full-text

Reverse full-text

/

search

search

similarity

Curr

Word vector cosine

Subject

Reverse full-text

Reverse full-text

/

search

search

similarity

3.2.1

Exact match. This relevance score compares (soft) key-

the database, including user defined keywords and appropriate

words related to issues and those inserted in the keyword input

results marking.

box. Given a (soft) keyword, search for all the documents that are

Preprocessing is done without any user interaction and in-

in relation to this exact (soft) keyword. Each relation can either

volves multiple algorithms and AI methods to extract the text of

be positive or negative. Therefore, the returned score is positive

the issue from original encoded emails. Testing of the algorithms

in case of positive relation and negative otherwise.

shows good results both in terms of precision and recall. Word

vector representations are pre-computed in order to improve

3.2.2

Full-text search. This relevance score compares keywords

performance of search algorithms.

entered in the keyword input box and issue documents or issue

Based on the extracted plain text of the issue the application

subjects. Full-text search capability of PostgreSQL is leveraged for

searches for similar issues that have already been resolved. The

this score. However, the results are modified to return negative

users can therefore quickly find the information related to the

scores in case of not-keyword match.

issue. The system is currently in use and only after some time of

3.2.3

Reverse full-text search. This relevance score compares

real-world usage we will be able to evaluate the whole system.

the selected issue document or subject and all existing (soft)

Due to logging the interactions in the database we expect to

keywords. First, for each keyword a full-text search relevance

be able to analyze the usage and quality of the results. This will

score is computed. Second, for each issue in the database do a

allow us to improve the system and add other functionality that

sum of its related keyword relevance scores.

will improve user experience and further improve the customer

support technicians’ workflow.

3.2.4

Word vector cosine similarity. This relevance score com-

pares the selected issue document and subject to all existing issue

ACKNOWLEDGMENTS

documents and subjects, respectively. Pre-computed word vec-

tors as described in Section 3.1.3 are used. The relevance score is Nicelabel d.o.o. funded the research presented in this paper. We

computed as:

thank Gregor Grasselli, Zdenko Vuk and Miha Štravs for help in

application development.

wordVectorSimilarity(𝑇1,𝑇2) = 1 − 𝑇1 · 𝑇2.

(2)

Since the word vectors used are normalized, this is actually

REFERENCES

1− cosine distance between 𝑇1 and 𝑇2.

[1]

Beautiful Soup Developers. 2019. Beautiful soup. https://

Two other methods for comparing the text were also tested:

www.crummy.com/software/BeautifulSoup/. (2019).

PostgreSQL built-in trigram text similarity, which was too slow

[2]

Docker Inc. 2019. Docker-compose. https://docs.docker.

for production use, and tf-idf representation of text and cosine

com/compose/. (2019).

distance-based relevance score, which did not perform as well as

[3]

Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas

the word vectors method.

Mikolov. 2016. Bag of tricks for efficient text classification.

arXiv preprint arXiv:1607.01759.

3.3

Logging

[4]

Dirk Merkel. 2014. Docker: lightweight linux containers

To improve the search performance in the future, several interac-

for consistent development and deployment. Linux journal,

tions with the application are logged:

2014, 239, 2.

(1) Search results with relevance scores

[5]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B.

(2) Viewed search results

Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,

(3) Relevant issue/belonging email found

V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M.

(4) No relevant issue/belonging email found

Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn:

Only after sufficient real-world usage of the application we

machine learning in Python. Journal of Machine Learning

can quantitatively evaluate the performance of the whole search

Research, 12, 2825–2830.

pipeline and act upon the results.

[6]

PostgreSQL Global Development Group. 2019. PostgreSQL,

version 12. http://www.postgresql.org. (2019).

4

DISCUSSION AND CONCLUSION

[7]

Python Software Foundation. 2018. Python language refer-

ence, version 3.7. http://www.python.org. (2018).

The SIRA system was developed and deployed, including five

[8]

React Developers. 2019. React. https://reactjs.org/. (2019).

docker-image packaged modules. The main functionalities of

the first major release include preprocessing of the text of the

issue, search integrating four different search algorithms and

a logging system that stores interactions with the system into

115





Adaptation of Text to Publication Type

Luka Žontar

Zoran Bosnić

University of Ljubljana,

University of Ljubljana,

Faculty of Computer and Information Science

Faculty of Computer and Information Science

Ljubljana, Slovenia

Ljubljana, Slovenia

zontarluka98@gmail.com

zoran.bosnic@fri.uni- lj.si

ABSTRACT

In this paper, we adapt texts to context by manipulating three

text evaluation metrics: length, polarity and readability. Our

In this paper, we propose a methodology that can adapt texts

method will be able to transition between social media posts, re-

to target publication types using summarization, natural lan-

search articles, newspaper articles and official statements, where

guage generation and paraphrasing. The solution is based on key

each publication type targets a different audience. While gov-

text evaluation characteristics that describe different publication

ernmental institutions and academics both publish neutrally-

types. To examine types, such as social media posts, newspaper

oriented texts, research articles tend to be much longer than

articles, research articles and official statements, we use three dis-

official statements. Social media and news usually target wider

tinct text evaluation metrics: length, text polarity and readability.

audiences, which is why texts should be more readable. However,

Our methodology iteratively adapts each of the text evaluation

the two can be separated by the amount of opinion we can in-

metrics. To alter length, we focus on abstractive summarization

clude. Newspaper articles should be less biased and thus include

using text-to-text transformers and distinct natural language gen-

less positively or negatively-oriented words.

eration models that are fine-tuned for each target publication

Our methodology iteratively adapts key text evaluation met-

type. Next, we adapt polarity and readability using synonym

rics towards the mean values of the target publication type that

replacement and additionally, manipulate the latter by replacing

will be calculated from a sample set of articles. In each iteration

sentences with paraphrases, which are automatically generated

our method first manipulates length using abstractive summariza-

using a fine-tuned text-to-text transformer. The results show that

tion techniques and natural language generation models. Next,

the proposed methodology successfully adapts text evaluation

it replaces words with more appropriate synonyms and adjusts

metrics to target publication types. We find that in some cases

polarity and readability scores. Finally, it uses a fine-tuned text-

adapting the chosen text evaluation metrics is not enough and

to-text transformer to generate more appropriate paraphrases

we can corrupt the content using our methodology. However,

that replace whole sentences in our text and alter readability.

generally, our methodology generates suitable texts that we could

present to a target audience.

2

RELATED WORK

KEYWORDS

While we are trying to automatically adapt texts to a particular

genre, researchers have already made progress in automatic text

text adaptation, context-aware, artificial intelligence, text sum-

simplification, where we try to adapt text to be more readable and

marization, natural language processing

easier to understand. Carroll and Tait [2] developed a methodology to simplify texts for people that suffer from aphasia, which

1

INTRODUCTION

is a disability of language processing. The developed system con-

With more and more internet usage, the textual data on the inter-

sists of an analyser component, which provides syntactic analysis

net is highly increasing. However, different media target different

and a simplifier component, which adapts texts using lexical and

audiences and thus an arbitrary article may not be appropriate

syntactic simplification. Lexical simplifier replaces the words in

for everyone. Consequently, already published content is being

text with synonyms by considering Kucera-Francis frequency

rewritten and adapted for other target audiences.

of each available synonym that is held in WordNet. Syntactic

Why is targeting audiences so important? When speaking

constructions that are not constructed in Subject-Verb-Object

with someone in person, we adjust body language, tone and the

order can also be tough to process for aphasic people. Therefore,

words we use, so that the audience understands the message we

the authors proposed several syntactic simplifications, such as

are trying to send. In a similar manner, we also have to be aware

replacement of passive constructions with active constructions.

of the target audience when writing. Even though the task of

A lot of research has already been done on how to evaluate

adapting texts to different audiences may look easy to experi-

and alter text and we will use many existing methods to help

enced writers, rookies and amateurs may struggle in selecting

us develop our methodology. We picked three text evaluation

the information that might be relevant to a particular target audi-

metrics that can be reasonably altered using existing methods.

ence. Nevertheless, a way to deal with words and some common

Flesch [5] developed an equation that determines the readability sense should be enough to complete the task, but due to the

of the text using the number of words per sentence and the

latter requirement automating this task becomes a much harder

number of syllables per word ratios. Even though structure-based

problem.

metrics are important, we also have to consider the message of

the text. Using sentiment analysis, we can determine whether the

Permission to make digital or hard copies of part or all of this work for personal writer has positive or negative affections towards the topic of the

or classroom use is granted without fee provided that copies are not made or text. Feldman [4] in his article discusses several approaches of distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this sentiment analysis based on the unit that we will be classifying

work must be honored. For all other uses, contact the owner /author(s).

(i.e. documents, sentences, aspects).

Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

As length is one of the chosen text evaluation metrics that

© 2020 Copyright held by the owner/author(s).

we wish to adapt, we have to be able to both summarize and

116





Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Žontar and Bosnić

extend the text. According to Allahyari et al. [1], we differentiate firstly pre-trained on a data-rich task using texts from the

between extractive and abstractive summarization approaches.

Colossal Clean Crawled Corpus and then fine-tuned on a

Extractive approaches shorten the original text by excluding less

downstream task using a dataset of texts and their sum-

relevant sentences. Significance of the sentence can be evaluated

maries as the expected outputs from the aforementioned

by determining whether the sentence is related with the main

corpus.

topic or whether its content is distinctive in comparison to other

• To generate additional text, if the input text is shorter

sentences. On the other hand, abstractive approaches tend to

than the average text of the target publication type, we

summarize texts in a new (more human-like) manner by struc-

use fine-tuned natural language generation models.

turing the text into some logical form such as graphs, trees and

We generate four pre-trained GPT2 natural language gen-

ontologies [6].

eration models [7] that are based on the aforementioned

When adapting shorter texts to longer, natural language gen-

unsupervised multitask learners. Each model is then fine-

eration has proven to be a very strong tool. Radford et al. [7]

tuned on a dataset of 100 texts of a certain considered

developed a natural language generation technique to generate

publication type and should be able to generate texts simi-

additional text and produced state of the art results using unsu-

lar to the ones that it was fine-tuned on. Consequently, we

pervised multitask learners for model learning. Their model was

would assume that the generated text needs less further

trained to predict the next word in text based on 40GB of Internet

adaptation.

content. They concluded that large training datasets and models

• While adapting length might be the procedure with the

trained to maximize the likelihood of a sufficiently varied corpus

most visible results, we also have to adapt the other text

can learn a surprising amount of tasks, while no supervision is

evaluation metrics. We develop a synonym replacement

needed in training.

procedure to adjust polarity and readability scores to the

Another method that is commonly used when adapting texts

target values. The procedure is executed in iterations and

to context is paraphrasing, i.e. rewording of something written

in each iteration we replace the word with the highest sum

by changing its structure or replacing the words with their syn-

of absolute relative differences of polarity and readability

onyms. Goutham in his article [9] used a pre-trained text-to-text scores to the initial values of the target publication type

transfer transformer to generate paraphrases of questions. The

with its optimal synonym, i.e. the synonym which causes

model was fine-tuned, where the input texts were questions from

the sum of absolute relative differences to minimize. We

Quora and the expected output were the questions that were

used the lexical database WordNet to acquire synonyms

labeled as their duplicates.

of the considered word.

In our paper, we plan to exploit the aforementioned abstractive

• Finally, we alter readability by generating paraphrases

summarization technique to shorten our texts and fine-tune the

with a T5 text-to-text transformer [9] that was fine-

pre-trained natural language generation model that Radford et

tuned to generate paraphrases by learning on Microsoft

al. [7] developed. Similarly as Goutham [9], we intend to fine-Research Paraphrase Corpus dataset [3]. We then pick the tune a pre-trained text-to-text transformer that would be able

optimal paraphrase, which minimizes the relative differ-

to generate paraphrases of a sentence. To calculate readability

ence to the target readability score.

score of the input text, we plan to use the formula proposed by

Flesch [5].

Replacing sentences with their paraphrases could potentially

also alter length and polarity. We test the assumption by generat-

3

ADAPTATION OF TEXT

ing five paraphrases for each sentence in 100 documents for each

considered publication type and find that the relative difference

As mentioned before, the proposed method iteratively manipu-

of length and polarity between the initial sentence and its para-

lates the chosen text evaluation metrics to adapt text to different

phrases is not significant. The obtained mean relative difference

target audiences. In Figure 1, which gives an overview of the

−3

of polarity scores in this preliminary analysis was 0.91 · 10

and

method, we can see that before we start running the process, we

−3

the mean relative difference of lengths was 0.11 · 10

.

calculate the initial values of text evaluation metrics for each

publication type as the average values of a set of documents. Our

main dataset consists of 150 documents for each publication type,

4

EVALUATION AND RESULTS

where all the documents hold text that contain COVID-19 related

In our experiments, we evaluate the quality of text transforma-

content, with which we minimize the effect of variables that we

tion between all possible pairs of four different publication media

will not take into account in text adaptation. We also define the

types: social media, news, research articles and official statements.

number of iterations (in our case: 5) and the acceptable error 𝜖

We tested our methodology by generating adapted texts of a sub-

(in our case: 𝜖 = 0.1) that determines whether it is still worth

set of the main dataset that was introduced in Section 3. The altering a particular text evaluation metric.

subset consists of 100 documents for each publication type (i.e.,

In each iteration, relative differences between current and

400 altogether) that were randomly chosen from the main dataset.

initial values of text evaluation metrics are calculated. If the

We adapted each document to the other three publication types

absolute relative difference to some metric is bigger than 𝜖 , we try

and thus test all of the 12 possible transitions. We observed how

to adjust it to the targeted value. We adjust key text evaluation

the key text evaluation metrics behaved and whether the gen-

metrics in the main loop of the process in Figure 1 using the erated text was meaningful or not. The results text evaluation

following procedures:

metrics before and after adaptation to context are shown in Table

• In case the target length is smaller than its current value,

1. In Table 2, we present the results of content quality evaluation we use a pre-trained T5 text-to-text transformer [8] to

of the generated texts.

summarize the input text. The model is an encoder-de-

From Table 1 we can observe that the text evaluation metrics coder model that uses transfer learning on a model that is

successfully changed in the right direction. In most cases we

117





Adaptation of Text to Publication Type

Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia

Target publication type

Official statements

Research articles

News

Social media

Input publication

type

Initial

Adapted

Initial

Adapted

Initial

Adapted

Initial

Adapted

Length

0.79

0.04

0.04

0.03

36.39

0.35

Official statements

Polarity

2.88

0.15

2.05

0.04

2.78

0.4

Readability

0.36

0.75

0.23

0.35

0.4

0.24

Length

3.06

0.05

2.99

0.04

136.23

0.33

Research articles

Polarity

0.81

0.27

0.33

0.07

0.18

0.46

Readability

0.17

0.08

0.34

0.22

0.45

0.12

Length

0.97

0.03

0.99

0.03

63.79

0.4

News

Polarity

0.88

0.14

0.43

0.1

0.33

0.37

Readability

1.21

0.05

1.2

0.84

0.24

0.11

Length

0.69

0.02

0.64

0.03

0.97

0.04

Social media

Polarity

0.85

0.28

0.28

0.02

0.55

0.06

Readability

0.71

0.27

0.69

0.8

0.24

0.28

Table 1: Absolute relative differences to initial values of target publication type before and after transition that synonym replacement method performs suitably, too. Its

inefficiency may be caused by the lack of choice in synonym and

paraphrase replacement and the limited amount of words and

sentences that can be replaced.

As an example, we tried to adapt this research article to a

social media post. By including statements that are colored in

yellow in Figure 2, such as “The authors have proposed” and “The researchers used”, we imply that the social media post talks about

a research article, which it does. Furthermore, the replacement of

the word “texts” with “written matters” and the word “audiences”

with “audience groups” indicates that the initial readability of

this research article is higher than the expected value of social

media posts, because we lower the Flesch Reading Ease score

with the mentioned transformations. The content is appropriate

as it extracts some of the most crucial concepts of this article.

Figure 2: Example of text adaptation from this research

article to a social media post

Additionally, we evaluated the content quality by checking

semantic similarity between the input and the generated text.

Using GloVe word embeddings, we transformed the text into

vectors and calculated the angle between the vectors. With cosine

measure we evaluated whether the vectors point in a similar

direction, i.e. the contents of texts, are similar. In Table 2, we present the mean cosine similarities between GloVe embeddings

of the input and the adapted texts. The results show that the

generated texts preserve the original content. Cosine similarity

Figure 1: Flowchart of the text adaptation methodology

scores are high in all transitions, however, the scores are a bit

lower when we adapt to or from a social media post. This could

be a consequence of the inability to thoroughly define the content

significantly improved the values of metrics. The length manipu-

in short texts that are expected in social media.

lation managed to consistently decrease the relative difference

While our method successfully adapts key text evaluation

towards targeted length and in many occasions even converge

metrics, our results are not perfect when it comes to the con-

under 𝜖 value. Polarity and readability scores seem harder to

tent. Our method has its drawbacks such as generating lots of

adapt. However, in each case we successfully adapted the sum of

additional content, which often results in an unconnected text.

relative differences of those metrics, with which we can conclude

Additionally, synonym replacement and paraphrase generation

118





Information Society 2020, 5–9 October, 2020, Ljubljana, Slovenia Žontar and Bosnić

hhhh

Original publication type

hhhhhhh

Research article

Official statement

Social media

News

hhhh

Target publication type

hhhhh

Research article

0.94

0.82

0.97

Official statement

0.95

0.82

0.97

Social media

0.83

0.93

0.90

News

0.95

0.96

0.82

Table 2: Cosine similarities between GloVe embeddings

can incorrectly replace original sentence or word, where the

context and the methodology could also consider patterns that

paraphrase or synonym changes the meaning but proves to be

might not be obvious to human’s eye.

effcient when adapting text evaluation metrics, if there exist such

synonyms that are more appropriate to use for a particular target

REFERENCES

audience. Nevertheless, our methodology generated a few se-

[1]

Mehdi Allahyari, Seyedamin Pouriyeh, Mehdi Assefi, Saeid

quences that could be published for target audiences without any

Safaei, Elizabeth Trippe, Juan Gutierrez, and Krys Kochut.

changes and lots of texts would only require minor corrections.

2017. Text summarization techniques: a brief survey. In-

To conclude this section, we are satisfied with the benchmark-

ternational Journal of Advanced Computer Science and Ap-

ing results that our method produced in adapting key text evalua-

plications (IJACSA), 8, (July 2017), 397–405. doi: 10.14569/

tion metrics. The methodology produces some interesting content

IJACSA.2017.081052.

and can thus be used as a baseline for further text adaptation to

[2]

John Carroll, Guido Minnen, Yvonne Canning, Siobhan De-

target audiences.

vlin, and John Tait. 1998. Practical simplification of english

newspaper text to assist aphasic readers. Proc. of AAAI-98

Workshop on Integrating Artificial Intelligence and Assistive

5

CONCLUSION

Technology, (July 1998), 7–10.

[3]

William B. Dolan and Chris Brockett. 2005. Automatically

In this article we developed a methodology that adapts texts

constructing a corpus of sentential paraphrases. In Proceed-

to context. The methodology focuses on three text evaluation

ings of the Third International Workshop on Paraphrasing

metrics: length, readability and polarity of the text. Our method

(IWP2005), 9–16. https://www.aclweb.org/anthology/I05-

iteratively adapts text to the calculated initial values based on

5002.

the targeted publication type by adjusting the key text evalua-

[4]

Ronen Feldman. 2013. Techniques and applications for sen-

tion metrics. We successfully managed to adjust text evaluation

timent analysis. Commun. ACM, 56, (April 2013), 82–89. doi:

metrics in nearly all transitions.

10.1145/2436256.2436274.

While we found text evaluation metrics that define different

[5]

Rudolf Flesch. 1979. How to Write Plain English: A Book for

publication types, in some cases adjusting these measures is not

Lawyers and Consumers. Harper & Row.

enough. Generating longer sequences of additional text, we find

[6]

Kavita Ganesan, ChengXiang Zhai, and Jiawei Han. 2010.

that the generated content is not connected and while we can find

Opinosis: a graph based approach to abstractive summa-

a chain of related topics of subsections, in some cases it is hard

rization of highly redundant opinions. In Proceedings of the

to define the common thread that is held throughout the whole

23rd International Conference on Computational Linguistics

text. Additionally, if such synonyms and paraphrases exist that

(Coling 2010). Coling 2010 Organizing Committee, Beijing,

corrupt the content but improve the relative differences to the

China, (August 2010), 340–348. https://www.aclweb.org/

targeted values of key text evaluation metrics, the methodology

anthology/C10- 1039.

will replace existing words and sentences with senseless content.

[7]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario

Despite these drawbacks, we generated lots of results that reflect

Amodei, and Ilya Sutskever. 2018. Language models are

the targeted publication types and even more results that would

unsupervised multitask learners. https : / / d4mucfpksywv.

require only minor changes to be completely acceptable. We

cloudfront.net/better- language- models/language- models.

conclude this article with satisfactory results of both content of

pdf .

generated texts and their values of key text evaluation metrics.

[8]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee,

Our ideas for further work include improvement of natural

Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and

language generation model, where the pre-trained model that we

Peter J. Liu. 2020. Exploring the limits of transfer learning

used should be trained on longer texts so that we could generate

with a unified text-to-text transformer. Journal of Machine

text based on longer prompts and thus make sure that we hold the

Learning Research, 21, 140, 1–67. http://jmlr.org/papers/

common thread throughout the whole text. Determining whether

v21/20- 074.html.

synonyms or paraphrases corrupt the message of the text is also

[9]

Goutham Ramsri. 2020. Paraphrase any question with T5

very important. Word embeddings can be used to represent the

(Text-To-Text Transfer Transformer). Towards Data Science.

context of the text and we could use it to determine whether the

[Accessed: 17. 8. 2020]. (2020). https://towardsdatascience.

synonym fits the current context or not. Another way to adapt

com / paraphrase - any - question - with - t5 - text - to - text -

text to context would be to create a dataset of texts, where each

transfer- transformer- pretrained- model- and- cbb9e35f1555.

row hold different versions of the same text and each version

represents the text written for different target audience. This way

we would be able to teach text-to-text models to adapt text to

119





120





Indeks avtorjev / Author index



Adams Jennifer ........................................................................................................................................................................... 104

Andova Andrejaana ........................................................................................................................................................................ 7

Bierhoff Ilse ................................................................................................................................................................................. 80

Bizjak Jani ............................................................................................................................................................ 27, 35, 39, 43, 51

Bizjak Miha .................................................................................................................................................................................. 11

Bohanec Marko ............................................................................................................................................................................ 27

Bolliger Larissa ............................................................................................................................................................................ 63

Bosnić Zoran .......................................................................................................................................................... 76, 88, 100, 116

Brence Jure ................................................................................................................................................................................. 104

Bromuri Stefano ............................................................................................................................................................................. 7

Budna Borut ............................................................................................................................................................................... 112

Clays Els....................................................................................................................................................................................... 63

De Boer Jasmijn ........................................................................................................................................................................... 80

De Masi Carlo M. ......................................................................................................................................................................... 15

Dolanc Gregor .............................................................................................................................................................................. 27

Dovgan Erik ........................................................................................................................................................................... 19, 92

Džeroski Sašo ............................................................................................................................................................................. 104

Filipič Bogdan .............................................................................................................................................................................. 19

Gams Matjaž ...................................................................................................................................... 27, 35, 39, 43, 47, 51, 55, 68

Gazvoda Samo ....................................................................................................................................................................... 35, 43

Gjoreski Hristijan ............................................................................................................................................................. 47, 72, 84

Gjoreski Martin ............................................................................................................................................................................ 23

Golob David ................................................................................................................................................................................. 27

Gradišek Anton ............................................................................................................................................................................ 32

Guid Matej ................................................................................................................................................................................... 11

Gültekin Várkonyi Gizem ............................................................................................................................................................ 32

Jordan Marko ............................................................................................................................................................................... 80

Kalabakov Stefan ............................................................................................................................................................. 27, 35, 51

Katrašnik Marko ........................................................................................................................................................................... 63

Kiprijanovska Ivana ......................................................................................................................................................... 39, 43, 47

Kocuvan Primož ............................................................................................................................................................... 27, 35, 51

Kolenik Tine ................................................................................................................................................................................. 55

Kuzmanovski Vladimir ................................................................................................................................................................ 23

Levstek Andraž ............................................................................................................................................................................ 59

Lukan Junoš ................................................................................................................................................................................. 63

Luštrek Mitja .................................................................................................................................................... 7, 15, 63, 80, 92, 96

Machidon Alina ............................................................................................................................................................................ 68

Malina Edward ........................................................................................................................................................................... 104

Mlakar Miha ............................................................................................................................................................................... 112

Neceva Marija .............................................................................................................................................................................. 72

Osipov Evgeny ..................................................................................................................................................................... 76, 100

Peterka Ana .................................................................................................................................................................................. 76

Petrovčič Janko ............................................................................................................................................................................ 27

Ravničan Jože ............................................................................................................................................................................... 27

Reščič Nina .................................................................................................................................................................................. 80

Shulajkovska Miljana ................................................................................................................................................................... 84

Silan Darja .................................................................................................................................................................................... 59

Simončič Žiga .............................................................................................................................................................................. 88

Slapničar Gašper .......................................................................................................................................................................... 92

Smerkol Maj ......................................................................................................................................................................... 68, 112

Stankoski Simon ........................................................................................................................................................................... 96

Štepec Dejan ............................................................................................................................................................................... 108

Stoilkovska Emilija ...................................................................................................................................................................... 72

Stropnik Vid ............................................................................................................................................................................... 100

Szlupowicz Michał Artur ........................................................................................................................................................... 104



121



Valič Jakob ................................................................................................................................................................................... 92

Vodopija Aljoša ........................................................................................................................................................................... 59

Žontar Luka ................................................................................................................................................................................ 116

Zupančič Jernej .......................................................................................................................................................................... 112





122





IS Slovenska konferenca o umetni inteligenci

Slovenian Conference on Artificial Intelligence

20 Mitja Luštrek, Matjaž Gams, Rok Piltaver

20





Document Outline


02 - Naslovnica - notranja - A - TEMP

03 - Kolofon - A - TEMP

04 - IS2020 - Predgovor

05 - IS2020 - Konferencni odbori

07 - Kazalo - A

08 - Naslovnica - podkonferenca - A

09 - Predgovor podkonference - A

10 - Programski odbor podkonference - A

01 - IS2020-3 Abstract

1 Introduction

2 Related Work

3 Dataset

4 Methodology 4.1 Baseline Model

4.2 DeepSpeech Model

4.3 Transfer Learning Using DeepSpeech





5 Results

6 Conclusion

7 Acknowledgments





02 - IS2020_Chess_Motifs Abstract

1 Introduction 1.1 Related Work





2 Domain Description

3 Similarity Computation 3.1 Static Features

3.2 Dynamic Features





4 Experimental Results 4.1 Evaluation of Similarity Detection

4.2 Similar Position Retrieval





5 Conclusions





03 - DeMasiLustrek-2 Abstract

1 Introduction 1.1 Video Activity Recognition





2 System Architecture 2.1 User Localization And Interaction With the Environment

2.2 Drinking Vessel Position Detection

2.3 Clip Recording and Activity Recognition





3 Results and discussion 3.1 User Localization - Results

3.2 Drinking Vessel Position Detection - Results

3.3 Activity Recognition - Results





4 Conclusions





04 - plamtex-is2020-1 Abstract

1 Introduction

2 Predicting Operation Durations with AI Methods 2.1 Relevant Positions and Related Operations

2.2 Description of the Extracted Features

2.3 Semantic Feature Selection





3 Experiments and Results

4 Conclusion

Acknowledgments





05 - Gjoreski_DEXi_Alternatives_IS2020_revised

06 - Paper_David_revised_fixed

07 - GDPR-Gizem-Anton

08 - Paper_IS___Revised-2 Abstract

1 Introduction

2 Problem definition 2.1 Data





3 Method 3.1 Segmentation

3.2 Augmentation

3.3 Deep Transfer Learning





4 Evaluation 4.1 Experimental Setup

4.2 Evaluation Metric





5 Results and discussion

6 Conclusion and future work

Acknowledgments





09 - Fall_detection_IS_paper_1_v3

10 - Machine_vision_IS_paper_2_v2

11 - Gait_abnormalities_IS_paper_3_v3

12 - Napovedovanje_obrabe_posnemalnih_igel_v3 Abstract

1 Uvod

2 Definicija problema

3 Reševanje problema

4 Rezultati

5 Zaključek

Acknowledgments





13 - Povečevanje-enakosti-s-prepričljivo-tehnologijo_Kolenik_Gams_V2

14 - levstek_is20 Abstract

1 Uvod

2 Podatki

3 Metodologija

4 Rezultati

5 Diskusija

6 Zaključek





15 - STRAWapp Abstract

1 Application Overview 1.1 Data Types





2 Ecological Momentary Assessment 2.1 EMA Triggering

2.2 Question Database





3 Privacy Enhancements

4 Server Application

5 Client-Server Communication and Login

6 Conclusion

Acknowledgments





16 - Urbanite_IS2020_revised_2 Abstract

1 Introduction

2 System's architecture 2.1 Data Analysis Module

2.2 Recommendation Engine

2.3 Policy Simulation and Validation Engine

2.4 Advanced Visualization Methods





3 Data Sources

4 Conclusions

Acknowledgments





17 - Towards-end-to-end-text-to-speech-synthesis-in-Macedonian-language-3

18 - Peterka-mammogram-r1-v2

19 - Nutrition_monitoring_IS2020_v1 Abstract

1 Introduction

2 Method 2.1 Method Overview

2.2 FFQ - Qualitative Monitoring

2.3 Quantitative Monitoring





3 Results 3.1 Bite Counting

3.2 Application Implementation





4 Conclusion

5 Acknowledgments





20 - Paper_IS

21 - Simoncic-Bosnic-Sistem-za-ocenjevanje-esejev-v6.1 Abstract

1 Uvod

2 Sorodna dela

3 Opis implementacije in metode 3.1 Uporabljena orodja

3.2 Implementacija gradnikov v Orange

3.3 Semantična analiza

3.4 Rezultati





4 Zaključek





22 - slapnicar_dovgan_valic_lustrek_Mental_State_Estimation_of_People_with_PIMD_using_Physiological_Signals Abstract

1 Introduction

2 Related Work

3 Data 3.1 Annotating the ground truth





4 Methodology of mental state estimation 4.1 Using rPPG Reconstruction

4.2 Using Empatica PPG





5 Experiments and Results 5.1 Using Empatica PPG

5.2 Using rPPG reconstruction





6 Conclusion

Acknowledgments





23 - SimonStankoski_IS-1

24 - Stropnik-Bosnic-v2.2

25 - IS2020_Surrogates_v2 Abstract

1 Introduction

2 Dataset

3 Surrogate Models 3.1 Dimensionality Reduction

3.2 Prediction Models





4 Experiment 4.1 Visualization

4.2 Feature Importance

4.3 Regression





5 Discussion and further work

6 Acknowledgements





26 - PASD_IS_paper_2020_final Abstract

1 Introduction

2 Related Work

3 CheXpert: A Large Chest Radiograph Dataset 3.1 Methods

3.2 End-To-End Deep Learning

3.3 Predictive Clustering Trees





4 Results

5 Conclusion

Acknowledgments





27 - SIRA-Mlakar-corrected Abstract

1 Introduction

2 SIRA Architecture

3 SIRA Functionality 3.1 Processing

3.2 Search

3.3 Logging





4 Discussion and Conclusion

Acknowledgments





28 - Adaptation_of_Text_to_Publication_Type-2.0 Abstract

1 Introduction

2 Related work

3 Adaptation of text

4 Evaluation and results

5 Conclusion





12 - Index - A-2

Blank Page

Blank Page

Blank Page

Blank Page

Blank Page

12 - Napovedovanje_obrabe_posnemalnih_igel_final.pdf Abstract

1 Uvod

2 Definicija problema

3 Reševanje problema

4 Rezultati

5 Zaključek

Acknowledgments





Blank Page