Zbornik 21. mednarodne multikonference INFORMACIJSKA DRUŽBA - IS 2018 Zvezek G Proceedings of the 21st International Multiconference INFORMATION SOCIETY - IS 2018 Volume G Sodelovanje, programska oprema in storitve v informacijski družbi Collaboration, Software and Services in Information Society Uredil / Edited by Marjan Heričko http://is.ijs.si 8.–12. oktober 2018 / 8–12 October 2018 Ljubljana, Slovenia Zbornik 21. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2018 Zvezek G Proceedings of the 21st International Multiconference INFORMATION SOCIETY – IS 2018 Volume G Sodelovanje, programska oprema in storitve v informacijski družbi Collaboration, Software and Services in Information Society Uredil / Edited by Marjan Heričko http://is.ijs.si 8.–12. oktober 2018 / 8–12 October 2018 Ljubljana, Slovenia Urednik: Marjan Heričko University of Maribor Faculty of Electrical Engineering and Computer Science Založnik: Institut »Jožef Stefan«, Ljubljana Priprava zbornika: Mitja Lasič, Vesna Lasič, Lana Zemljak Oblikovanje naslovnice: Vesna Lasič Dostop do e-publikacije: http://library.ijs.si/Stacks/Proceedings/InformationSociety Ljubljana, oktober 2018 Informacijska družba ISSN 2630-371X Kataložni zapis o publikaciji (CIP) pripravili v Narodni in univerzitetni knjižnici v Ljubljani COBISS.SI-ID=21853462 ISBN 978-961-264-141-2 (pdf) PREDGOVOR MULTIKONFERENCI INFORMACIJSKA DRUŽBA 2018 Multikonferenca Informacijska družba (http://is.ijs.si) je z enaindvajseto zaporedno prireditvijo osrednji srednjeevropski dogodek na področju informacijske družbe, računalništva in informatike. Letošnja prireditev se ponovno odvija na več lokacijah, osrednji dogodki pa so na Institutu »Jožef Stefan«. Informacijska družba, znanje in umetna inteligenca so še naprej nosilni koncepti človeške civilizacije. Se bo neverjetna rast nadaljevala in nas ponesla v novo civilizacijsko obdobje ali pa se bo rast upočasnila in začela stagnirati? Bosta IKT in zlasti umetna inteligenca omogočila nadaljnji razcvet civilizacije ali pa bodo demografske, družbene, medčloveške in okoljske težave povzročile zadušitev rasti? Čedalje več pokazateljev kaže v oba ekstrema – da prehajamo v naslednje civilizacijsko obdobje, hkrati pa so notranji in zunanji konflikti sodobne družbe čedalje težje obvladljivi. Letos smo v multikonferenco povezali 11 odličnih neodvisnih konferenc. Predstavljenih bo 215 predstavitev, povzetkov in referatov v okviru samostojnih konferenc in delavnic. Prireditev bodo spremljale okrogle mize in razprave ter posebni dogodki, kot je svečana podelitev nagrad. Izbrani prispevki bodo izšli tudi v posebni številki revije Informatica, ki se ponaša z 42-letno tradicijo odlične znanstvene revije. Multikonferenco Informacijska družba 2018 sestavljajo naslednje samostojne konference:  Slovenska konferenca o umetni inteligenci  Kognitivna znanost  Odkrivanje znanja in podatkovna skladišča – SiKDD  Mednarodna konferenca o visokozmogljivi optimizaciji v industriji, HPOI  Delavnica AS-IT-IC  Soočanje z demografskimi izzivi  Sodelovanje, programska oprema in storitve v informacijski družbi  Delavnica za elektronsko in mobilno zdravje ter pametna mesta  Vzgoja in izobraževanje v informacijski družbi  5. študentska računalniška konferenca  Mednarodna konferenca o prenosu tehnologij (ITTC) Soorganizatorji in podporniki konference so različne raziskovalne institucije in združenja, med njimi tudi ACM Slovenija, Slovensko društvo za umetno inteligenco (SLAIS), Slovensko društvo za kognitivne znanosti (DKZ) in druga slovenska nacionalna akademija, Inženirska akademija Slovenije (IAS). V imenu organizatorjev konference se zahvaljujemo združenjem in institucijam, še posebej pa udeležencem za njihove dragocene prispevke in priložnost, da z nami delijo svoje izkušnje o informacijski družbi. Zahvaljujemo se tudi recenzentom za njihovo pomoč pri recenziranju. V letu 2018 bomo šestič podelili nagrado za življenjske dosežke v čast Donalda Michieja in Alana Turinga. Nagrado Michie-Turing za izjemen življenjski prispevek k razvoju in promociji informacijske družbe bo prejel prof. dr. Saša Divjak. Priznanje za dosežek leta bo pripadlo doc. dr. Marinki Žitnik. Že sedmič podeljujemo nagradi »informacijska limona« in »informacijska jagoda« za najbolj (ne)uspešne poteze v zvezi z informacijsko družbo. Limono letos prejme padanje državnih sredstev za raziskovalno dejavnost, jagodo pa Yaskawina tovarna robotov v Kočevju. Čestitke nagrajencem! Mojca Ciglarič, predsednik programskega odbora Matjaž Gams, predsednik organizacijskega odbora i FOREWORD - INFORMATION SOCIETY 2018 In its 21st year, the Information Society Multiconference (http://is.ijs.si) remains one of the leading conferences in Central Europe devoted to information society, computer science and informatics. In 2018, it is organized at various locations, with the main events taking place at the Jožef Stefan Institute. Information society, knowledge and artificial intelligence continue to represent the central pillars of human civilization. Will the pace of progress of information society, knowledge and artificial intelligence continue, thus enabling unseen progress of human civilization, or will the progress stall and even stagnate? Will ICT and AI continue to foster human progress, or will the growth of human, demographic, social and environmental problems stall global progress? Both extremes seem to be playing out to a certain degree – we seem to be transitioning into the next civilization period, while the internal and external conflicts of the contemporary society seem to be on the rise. The Multiconference runs in parallel sessions with 215 presentations of scientific papers at eleven conferences, many round tables, workshops and award ceremonies. Selected papers will be published in the Informatica journal, which boasts of its 42-year tradition of excellent research publishing. The Information Society 2018 Multiconference consists of the following conferences:  Slovenian Conference on Artificial Intelligence  Cognitive Science  Data Mining and Data Warehouses - SiKDD  International Conference on High-Performance Optimization in Industry, HPOI  AS-IT-IC Workshop  Facing demographic challenges  Collaboration, Software and Services in Information Society  Workshop Electronic and Mobile Health and Smart Cities  Education in Information Society  5th Student Computer Science Research Conference  International Technology Transfer Conference (ITTC) The Multiconference is co-organized and supported by several major research institutions and societies, among them ACM Slovenia, i.e. the Slovenian chapter of the ACM, Slovenian Artificial Intelligence Society (SLAIS), Slovenian Society for Cognitive Sciences (DKZ) and the second national engineering academy, the Slovenian Engineering Academy (IAS). On behalf of the conference organizers, we thank all the societies and institutions, and particularly all the participants for their valuable contribution and their interest in this event, and the reviewers for their thorough reviews. For the sixth year, the award for life-long outstanding contributions will be presented in memory of Donald Michie and Alan Turing. The Michie-Turing award will be given to Prof. Saša Divjak for his life-long outstanding contribution to the development and promotion of information society in our country. In addition, an award for current achievements will be given to Assist. Prof. Marinka Žitnik. The information lemon goes to decreased national funding of research. The information strawberry is awarded to the Yaskawa robot factory in Kočevje. Congratulations! Mojca Ciglarič, Programme Committee Chair Matjaž Gams, Organizing Committee Chair ii KONFERENČNI ODBORI CONFERENCE COMMITTEES International Programme Committee Organizing Committee Vladimir Bajic, South Africa Matjaž Gams, chair Heiner Benking, Germany Mitja Luštrek Se Woo Cheon, South Korea Lana Zemljak Howie Firth, UK Vesna Koricki Olga Fomichova, Russia Mitja Lasič Vladimir Fomichov, Russia Blaž Mahnič Vesna Hljuz Dobric, Croatia Jani Bizjak Alfred Inselberg, Israel Tine Kolenik Jay Liebowitz, USA Huan Liu, Singapore Henz Martin, Germany Marcin Paprzycki, USA Karl Pribram, USA Claude Sammut, Australia Jiri Wiedermann, Czech Republic Xindong Wu, USA Yiming Ye, USA Ning Zhong, USA Wray Buntine, Australia Bezalel Gavish, USA Gal A. Kaminka, Israel Mike Bain, Australia Michela Milano, Italy Derong Liu, USA Toby Walsh, Australia Programme Committee Franc Solina, co-chair Matjaž Gams Vladislav Rajkovič Viljan Mahnič, co-chair Marko Grobelnik Grega Repovš Cene Bavec, co-chair Nikola Guid Ivan Rozman Tomaž Kalin, co-chair Marjan Heričko Niko Schlamberger Jozsef Györkös, co-chair Borka Jerman Blažič Džonova Stanko Strmčnik Tadej Bajd Gorazd Kandus Jurij Šilc Jaroslav Berce Urban Kordeš Jurij Tasič Mojca Bernik Marjan Krisper Denis Trček Marko Bohanec Andrej Kuščer Andrej Ule Ivan Bratko Jadran Lenarčič Tanja Urbančič Andrej Brodnik Borut Likar Boštjan Vilfan Dušan Caf Mitja Luštrek Baldomir Zajc Saša Divjak Janez Malačič Blaž Zupan Tomaž Erjavec Olga Markič Boris Žemva Bogdan Filipič Dunja Mladenič Leon Žlajpah Andrej Gams Franc Novak iii iv KAZALO / TABLE OF CONTENTS Sodelovanje, programska oprema in storitve v informacijski družbi / Collaboration, Software and Services in Information Society ................................................................................................................. 1 PREDGOVOR / FOREWORD ....................................................................................................................... 3 PROGRAMSKI ODBORI / PROGRAMME COMMITTEES ........................................................................... 5 Self-Assessment Tool For Evaluating Sustainability Of Ict In Smes / Soini Jari, Leppäniemi Jari, Sil berg Pekka ........................................................................................................................................................ 7 Reference Standard Process Model For Farming To Support The Development Of Applications For Farming / Rupnik Rok ............................................................................................................................11 Semiotics Of Graphical Signs In Bpmn / Kuhar Saša, Polančič Gregor ....................................................15 Knowledge Perception Infuenced By Notation Used For Conceptual Database Design / Kamišalić Aida, Turkanović Muhamed, Heričko Marjan, Welzer Tatjana ........................................................................19 The Use Of Standard Questionnaires For Evaluating The Usability Of Gamfication / Rajšp Alen, Kous Katja, Beranič Tina .................................................................................................................................23 Analyzing Short Text Jokes From Online Sources With Machine Learning Approaches / Šimenko Samo, Podgorelec Vili, Karakatič Sašo .............................................................................................................27 A Data Science Approach To The Analysis Of Food Recipes / Heričko Tjaša, Karakatič Sašo, Podgorelec Vili ........................................................................................................................................31 Introducing Blockchain Technology Into A Real-Life Insurance Use Case / Vodeb Aljaž, Tišler Aljaž, Chuchurski Martin, Orgulan Mojca, Rola Tadej, Unger Tea, Žnidar Žan, Turkanović Muhamed ..........35 A Brief Overview Of Proposed Solutions To Achieve Ethereum Scalability / Podgorelec Blaž, Rek Patrik, Rola Tadej ..............................................................................................................................................39 Integration Heaven Of Nanoservices / Révész Ádám, Pataki Norbert .......................................................43 Service Monitoring Agents For Devops Dashboard Tool / Török Márk, Pataki Norbert .............................47 Incremental Parsing Of Large Legacy C/C++ Software / Fekete Anett, Cserép Máté ...............................51 Visualising Compiler-Generated Special Member Functions Of C++ Types / Szalay Richárd, Porkoláb Zoltán ......................................................................................................................................................55 How Does An Integration With Vcs Affect Ssqsa? / Popović Bojan, Rakić Gordana .................................59 Indeks avtorjev / Author index ......................................................................................................................63 v vi Zbornik 21. mednarodne multikonference INFORMACIJSKA DRUŽBA – IS 2018 Zvezek G Proceedings of the 21st International Multiconference INFORMATION SOCIETY – IS 2018 Volume G Sodelovanje, programska oprema in storitve v informacijski družbi Collaboration, Software and Services in Information Society Uredil / Edited by Marjan Heričko http://is.ijs.si 9. oktober 2018 / 9 October 2018 Ljubljana, Slovenia 1 2 PREFACE This year, the Conference “Collaboration, Software and Services in Information Society” is being organised for the eighteenth time as a part of the “Information Society” multi-conference. As in previous years, the papers from this year's proceedings address actual challenges and best practices related to the development of advanced software and information solutions as well as collaboration in general. Information technologies and the field of Informatics have been the driving force of innovation in business, as well as in the everyday activities of individuals for several decades. Blockchain technology, Big Data, intelligent solution, reference models, open standards, interoperability and the increasing responsiveness of IS/IT experts are leading the way to the development of intelligent digital service platforms, innovative business models and new ecosystems where not only partners, but also competitors are connecting and working together. On the other hand, quality assurance remains a vital part of software and ICT-based service development and deployment. The papers in these proceedings provide a better insight and/or propose solutions to challenges related to: - Self-Assessment of Sustainability of ICT in SMEs; - Ontology-based knowledge sharing on BPMN graphical signs using semiotics; - Influence of notations used for conceptual design on knowledge perception; - Application of machine learning techniques to obtain new knowledge; - Establishment of domain specific reference models; - Introduction of Blockchain technology into real-life use cases; - Architectural design proposals for ensuring scalability of Blockchain platforms; - Application of usability questionnaires when evaluating gamification and serious games - Visualization, analysis and comprehension of complex software systems; - Continuous software development, integration and delivery; - Integration of source code repositories and QA tools. We hope that these proceedings will be beneficial for your reference and that the information in this volume will be useful for further advancements in both research and industry. Prof. Dr. Marjan Heričko CSS 2018 – Collaboration, Software and Services in Information Society Conference Chair 3 PREDGOVOR Konferenco “Sodelovanje, programska oprema in storitve v informacijski družbi” organiziramo v sklopu multikonference Informacijska družba že osemnajstič. Kot običajno, tudi letošnji prispevki naslavljajo aktualne teme in izzive, povezane z razvojem sodobnih programskih in informacijskih rešitev ter storitev kot tudi sodelovanja v splošnem. Informatika in informacijske tehnologije so že več desetletij gonilo inoviranja na vseh področjih poslovanja podjetij ter delovanja posameznikov. Tehnologija veriženja blokov, velepodatki, inteligentne storitve, referenčni modeli, odprti standardi in interoperabilnost ter vedno višja odzivnost informatikov vodijo k razvoju inteligentnih digitalnih storitvenih platform in inovativnih poslovnih modelov ter novih ekosistemov, kjer se povezujejo in sodelujejo ne le partnerji, temveč tudi konkurenti. Napredne informacijske tehnologije in sodobni pristopi k razvoju, vpeljavi in upravljanju omogočajo višjo stopnjo avtomatizacije in integracije doslej ločenih svetov, saj vzpostavljajo zaključeno zanko in zagotavljajo nenehne izboljšave, ki temeljijo na aktivnem sodelovanju in povratnih informacijah vseh vključenih akterjev. Ob vsem tem zagotavljanje kakovosti ostaja eden pomembnejših vidikov razvoja in vpeljave na informacijskih tehnologijah temelječih storitev. Prispevki, zbrani v tem zborniku, omogočajo vpogled v in rešitve za izzive na področjih kot so npr.: - samoocenitev kakovosti in zrelosti IKT podpore v malih in srednje velikih podjetjih; - deljenje znanja o grafičnih simbolih BPMN z uporabo semiotike; - vpliv notacije, uporabljene pri oblikovanju konceptualih modelov, na dojeti nivo pridobljenega znanja; - uporaba tehnik strojnega učenja za ekstrakcijo znanja; - vzpostavitev domenskih referenčnih modelov; - vpeljava tehnologije veriženja blokov v realne primere uporabe; - arhitekturni predlogi za rešitev razširljivosti platform tehnologije veriženja blokov; - uporaba standardnih vprašalnikov uporabnosti pri vrednotenju učinkov vpeljave igrifikacije in resnih iger; - vizualizacija, analiza in razumevanje kompleksnih programskih sistemov; - neprekinjen razvoj, integracija in dobava informacijskih rešitev; - integracija repozitorijev izvorne kode z orodji za zagotavljanje kakovosti. Upamo, da boste v zborniku prispevkov, ki povezujejo teoretična in praktična znanja, tudi letos našli koristne informacije za svoje nadaljnje delo tako pri temeljnem kot aplikativnem raziskovanju. prof. dr. Marjan Heričko predsednik konference CSS 2018 – Collaboration, Software and Services in Information Society Conference 4 PROGRAMSKI ODBOR / PROGRAM COMITTEE Dr. Marjan Heričko University of Maribor, Faculty of Electrical Engineering and Computer Science Dr. Gabriele Gianini University of Milano, Faculty of Mathematical, Physical and Natural Sciences Dr. Hannu Jaakkola Tampere University of Technology Information Technology (Pori) Dr. Mirjana Ivanović University of Novi Sad, Faculty of Science, Department of Mathematics and Informatics Dr. Zoltán Porkoláb Eötvös Loránd University, Faculty of Informatics Dr. Stephan Schlögl MCI Management Center Innsbruck, Department of Management, Communication & IT Dr. Zlatko Stapić University of Zagreb, Faculty of Organization and Informatics Dr. Vili Podgorelec University of Maribor, Faculty of Electrical Engineering and Computer Science Dr. Maja Pušnik University of Maribor, Faculty of Electrical Engineering and Computer Science Dr. Muhamed Turkanović University of Maribor, Faculty of Electrical Engineering and Computer Science Dr. Boštjan Šumak University of Maribor, Faculty of Electrical Engineering and Computer Science Dr. Aida Kamišalić Latifić University of Maribor, Faculty of Electrical Engineering and Computer Science Dr. Gregor Polančič University of Maribor, Faculty of Electrical Engineering and Computer Science Dr. Luka Pavlič University of Maribor, Faculty of Electrical Engineering and Computer Science 5 6 Self-Assessment Tool for Evaluating Sustainability of ICT in SMEs Jari Soini Jari Leppäniemi Pekka Sillberg Tampere University of Technology Tampere University of Technology Tampere University of Technology P.O. Box 300 P.O. Box 300 P.O. Box 300 FI-28101 Pori, Finland FI-28101 Pori, Finland FI-28101 Pori, Finland jari.o.soini@tut.fi jari.leppaniemi@tut.fi pekka.sillberg@tut.fi ABSTRACT effects (e.g., electricity used by database servers, cloud servers, The ever-increasing demand for ICT may compromise global and network routers) that may not be consciously recognized [1, objectives for emissions reduction if the aggregate effects of ICT 2, 3, 4]. Typically users are concerned only of the electricity sustainability are not considered in the business digitalization consumption of their own devices. The increasing demand for ICT processes. In this paper, we present a free self-assessment tool may, in fact, compromise the national objectives for emissions enabling small and medium sized companies to evaluate the reduction if the aggregate effects of ICT un-sustainability (Figure utilized ICT in terms of sustainability. The ICT4S is a free e- 1) are not considered in the business digitalization processes. service, in effect, a web-based self-assessment tool that was developed in co-operation with Swiss Green IT SIG. The assessment is currently divided into five categories of sustainability questions. The categories are strategy, procurement and recycling, practices, servers and network, and Green ICT. As the result, organizations will gain a general understanding about their state of sustainability, and practical suggestions for greater eco-friendliness and sustainability of their ICT operations. Categories and Subject Descriptors • Social and professional topics~Sustainability • Information systems~Web applications Figure 1. Environmental impacts of the ICT. [5] In 2017, it was estimated that ICT accounted for 12% of the General Terms overall electricity consumption around the globe, and the Measurement, Performance, Human Factors. percentage is expected to increase twice as rapidly in the future (by approximately 7% per year). Most of the energy is consumed by networks, server rooms, and computing centers, (Figure 2) the Keywords efficiency of which should urgently be improved. Sustainability, Assessment, ICT, Metrics, Web tools, E-services. 1. INTRODUCTION The study presented in this paper aims at contributing to the business activity digitalization of companies concerning the reduction of carbon footprint and improvement of sustainability. The paper introduces a self-assessment tool developed in the research project that allows companies to self-evaluate the sustainability of the ICT exploited in the organization. The objective is to provide companies with concrete tools and proposals for actions enabling more ecological procedures in the organization. Additionally, the knowledge gained by using the self-assessment tool allows companies to become generally more Figure 2. Electricity consumption in the ICT sector. [6] aware of the distribution of energy consumption in a modern ICT As most of the electricity is still being generated by using fossil infrastructure as well as the factors affecting sustainability of ICT. fuels (Figure 3), the current ICT, and its heavy usage of electrical energy, constitutes a global issue that is, unfortunately, little 2. BACKGROUND known outside the expert field [7, 8]. This is partly due to the There is a lot of evidence for significant benefits in terms of users not perceiving the energy consumption of data systems productivity and cost savings through the exploitation of ICT in operating invisibly or in the background, but rather only noticing the daily business activity of organizations. However, the the consumption of the terminal device, which, in reality, increasingly dependent use of ICT also brings about “invisible” comprises a fraction of the overall energy consumption (Figure 2). 7 In 2015 through 2017, the TUT Pori Department implemented a research project (AjaTar) with the aim of improving the digitalization of organizations and companies while promoting a low-carbon economy and sustainability. As part of the project, a technology enabling organizations to self-evaluate their ICT sustainability was developed, tested, and studied, aiming at increasing general awareness of the distribution of electricity consumption in a modern IT infrastructure in order for the organizations to be able to make ICT-related decisions more consciously than before. The most notable added value of the project comprise an increase in knowhow and knowledge promoting easy and lightweight Figure 3. Electricity generation by source of energy. [9] assessment of sustainability in terms of the organization’s The problem of energy consumption due to the constantly business activities and support processes, as well as a freely increasing utilization of ICT is expected to further worsen available tool for evaluating the sustainability of the ICT used in (Figures 4a and 4b) through the amount of IoT devices and the organization. By making the sustainability issues visible, the automatic steering systems [10]. If the majority of the predicted objective was to change attitudes and conventions related to the IoT devices and information systems supporting them are utilization of ICT in organizations: indeed, during the project, implemented by the current practices, a near-catastrophic peak several organizations distinctly declared their need to recognize demand in terms of electricity will ensue. This, in turn, will result practices promoting sustainable development as well as invest in in an increase in emissions rather than their reduction. an eco-friendly image. 3. ICT4S SELF-ASSESSMENT TOOL During the last six years, the SEIntS research group from TUT Pori Department has studied, developed, and piloted innovative ICT solutions in cooperation with local organizations. Additionally, SEIntS has collaborated with, for example Keio University in Japan as well as with various information society associations, for example, in Switzerland regarding the Green IT and assessment of datacenters. As a result of the AjaTar project, an open self-assessment website for organizations to quickly and easily evaluate the ecological aspects of their ICT-related operations was published at the end of 2017. The self-assessment tool, developed in collaboration with Green IT SIG, a Swiss Green IT information special interest group, is based on the assumption that most of the ICT equipment used in an organization is controllable, enabling the relatively easy (a) Estimated growth. adjustment of various functions. With the assessment tool developed in the project, it is possible to increase knowledge about the ecological aspects related to the use of ICT in organizations and, thus, affect their operations and practices. Based on the self-assessment, the organization is offered overall evaluation of the current state and propositions for practices for more sustainable ICT operations. The self-assessment tool is freely available on a dedicated website for sustainable ICT [11]. On the landing page of the tool (Figure 5) there is a welcoming message that explains the goals of the assessment. There is also information of the privacy solution that is used to guarantee all the information of the assessor’s company. The privacy solution is based on the HTML5 local storage concept. The assessment menu is currently divided into five categories of sustainability questions and the information of the organization to be evaluated. The categories are: strategy, procurement and recycling, practices, servers and network, and (b) Estimated standby energy consumption. Green ICT. Figures 4a and 4b. Estimated growth and impact of IoT devices. [10] Therefore, it is essential to establish instructions and an assessment procedure to support system planning to improve sustainability of ICT, and, thus, to promote methods for a low- carbon economy. 8 After assessing all categories, the assessment tool calculates and shows an evaluation of the given answers. The results are first shown in a short form as in Figure 8, but users can explore the results more carefully by selecting “Display detailed evaluation.” The percentage and the color of the beams give a fast response of the maturity of the different categories. In the case of 100% and a green beam, the user can be satisfied with the sustainability state of their company in that certain category. In the case of low percentages (0 - 70%) or yellow or even red beams, the evaluation shows that there is room for improvement. In such a case, the user may find the detailed evaluation useful when planning concrete actions for these improvements. Figure 5. Welcoming the assessors. Each of the categories comprises several questions and additional text that explains the current issue to the assessor. While trying to answer the questions, the assessor also receives background information on the current topic. In Figures 6 and 7, the assessor is facing questions concerning the strategy and practices at the office. Figure 8. Brief results of the assessment. The detailed evaluation can be shown by selecting the corresponding option in the user interface (see Figure 9). The user is also able to print the results – hopefully in a sustainable way, for example using an e-format such as Portable Document Format (PDF). Figure 6. Assessing the strategy. Figure 9. Detailed results of the assessment. The assessment tool has now been in use for several months. Unfortunately, we do not have the exact statistics concerning the usage of the tool. However, we piloted the tool with the assistance of local companies before launching it last December. Since the piloting groups were satisfied with the tool and because we wanted to keep our promises regarding the privacy of the assessments, we did not implement any logging system in it. We have planned to enhance the tool with a new capability – aiming to enable an easy way to estimate the carbon footprint of the ICT usage in a company. It will not be fully scientific life Figure 7. Assessing practices at the office category. cycle assessment (LCA) but a practical version of such targeted to 9 non-professionals in the field of sustainability. The reasoning for [3] Amsel, N., Ibrahim, Z., Malik, A. and Tomlinson, B. 2011. this new capability is that we anticipate that by introducing easy Toward sustainable software engineering: NIER track. assessment tools we will be able to raise the awareness of published in 33rd International Conference on Software companies in terms of sustainability issues and thus help them to Engineering (ICSE), 21-28 May 2011, Honolulu, USA. develop their business processes toward a sustainable state. [4] Baliga, J., Hinton, K., Ayre, R. and Tucker, R.S. 2009. 4. RESULTS AND FUTURE WORK Carbon footprint of the internet. Journal of Australia, vol. 59, no. 1, 5.1-5.14. This paper presented the ICT4S self-assessment tool enabling companies and other organizations to evaluate the utilized ICT in [5] Hilty, L. and Aebischer, B. (eds.). 2015. ICT Innovations for terms of a low-carbon economy and sustainability and thus Sustainability. Advances in Intelligent System and improve their image as well as resource efficiency. As the result, Computing 310, Springer International Publishing, organizations will gain a general understanding of the current Switzerland. sustainability state of their ICT and practical suggestions for more [6] Corcoran, A. and Andrae, A. 2013. Emerging Trends in eco-friendly and sustainable operations. Electricity Consumption for Consumer ICT. Retrieved The role of the TUT Pori Unit was to function as a producer and August 22, 2018 from https://www.researchgate.net/profile/ facilitator of new knowledge. The applied project aimed at Anders_Andrae/publication/255923829_Emerging_Trends_i contributing to the business development with TUT Pori Unit n_Electricity_Consumption_for_Consumer_ICT/ acting as a distributor of knowledge and knowhow as well as an [7] Pickavet, M., Vereecken, W., Demeyer, S., Audenaert, P., innovator. Within the project, the accumulation of diverse energy- Vermeulen, B., Develder, C., Colle, D., Dhoedt, B. and related knowhow and knowledge and exploitation of sustainable Demeester, P. 2008. Worldwide energy needs for ICT: The solutions of ICT in organizations were successfully implemented. rise of power-aware networking. In proceedings of 2nd Further development is planned to be realized in the ICT4LC International Symposium on Advanced Networks and project launched at the beginning of 2018. It focuses on Telecommunication Systems, 1-3. examining contemporary information processing that is based on [8] Lambert, S., and Van Heddeghem, W. 2012. Worldwide mobile and ‘thin clients’ as well as the increasing utilization rate electricity consumption of communication networks. Optics of information networks and cloud computing. The new project Express, vol. 20, no. 26, 513-524. explores tools for assessing the energy efficiency of business [9] OECD Factbook 2014: Economic, Environmental and Social activities and support processes as well as planning procedures of Statistics. Retrieved August 27, 2018 from business processes, promoting responsible and sustainable http://dx.doi.org/10.1787/888933025499 utilization of ICT in organizations. [10] International Energy Agency. 2016. Energy Efficiency of the 5. ACKNOWLEDGMENTS Internet of Things, Technology and Energy Assessment Our thanks to Niklaus Meyer and Beat Koch from Swiss Green IT Report. Prepared for IEA 4E EDNA. Retrieved August 27, SIG for collaboration. 2019 from https://www.iea-4e.org/document/384/energy- efficiency-of-the-internet-of-things-technology-and-energy- assessment-report 6. REFERENCES [11] Tampere University of Technology. 2017. ICT4S Self [1] Hilty, L., Arnfalk, P., Erdmann, L., Goodman, J., Lehmann, Assessment. Retrieved August 27, 2018 from https://green- M., Wager, A.P. 2006. The relevance of information and ict.fi/arviointi/?lang=en communication technologies for environmental sustainability – A prospective simulation study. Environmental Modelling & Software 2006, vol. 21, issue 11, 1618-1629. [2] Hilty, L. 2008. Information technology and sustainability: Essays on the relationship between ICT and sustainable development. Books on Demand, Norderstedt. 10 Reference Standard Process Model for Farming to Support the Development of Applications for Farming Rok Rupnik Faculty of Computer and Information Science University of Ljubljana Ljubljana, Slovenia +386 1 479 8266 rok.rupnik@fri.uni-lj.si ABSTRACT have managed to define so far: Domains, processes and elements The paper introduces the idea and the concepts of a Reference of process description. We also introduce the current list of Standard Process Model (RSPMF) which are based on the concepts processes and domains. of COBIT, an IT governance framework used worldwide. Our The structure of the paper is as follows. The second chapter research on RSPMF is focused in two directions. First, RSPMF is introduces the EU funded project AgroIT, during which the idea for aimed at becoming a support for Product Managers in software the Reference Standard Process Model arose. Only aspects of the companies developing software products or IoT systems. Namely, project relevant for the content of this paper are introduced. The each process in RSPMF is described through the following third chapter introduces key findings from the AgroIT project components: Process goals, process metrics, KPI’s (Key which led to the idea of RSPMF. To support the idea of RSPMF the Performance Indicators) and process activities, Second, RSPMF is COBIT framework for IT governance is also introduced, since aimed to help managers or owners of bigger farms in farm many concepts of RSPMF are taken from the COBIT framework. management. The paper introduces research in the progress state of The fourth chapter introduces the RSPMF, its concepts, draft list of our research. domains and their processes, and the methodology to facilitate the sustainability of RSPMF. The last chapter contains the conclusion Categories and Subject Descriptors and directions for future work on the RSPMF. D.2.2 [Requirements/Specifications]: Tools. 2. EXPERIENCE GAINED IN THE AgroIT General Terms PROJECT Farming, Standardization, Process model. AgroIT was an EU funded project covering various previously mentioned aspects and problems in today’s implementation of IT Keywords and IoT in farming [5], [6]. First, the project included the Standard Process Model, COBIT, Transformation of model. implementation of ERP systems for farming: A traditional ERP 1. INTRODUCTION system for small and medium enterprises which, additionally, also has modules for livestock, fruit growing, winery, etc. [7]. This area In recent years, farming has become an area with extensive need of farm management was covered, which was the subject of several for the use of information systems and IoT technologies [1]. The papers in recent years [8][1], [2], [6], [7], [9], [10]. Second, the experience gained in an EU funded project has revealed that project included the implementation of a decision support system software companies have diverse and unequal knowledge and based on advanced methods to support decision processes in understanding of farming processes, activities within processes and farming [8]. This way, the area of the use of decision support within metrics. This causes a problem when software products and IoT farm management was covered [1], [6]. Third, the project included systems need to be integrated. There are many software products the implementation of IoT systems where various sensors were and IoT systems on the market today, but each of them covers a used to collect data about several measurements [2], [11], [12]. quite narrow functional area and, for the treason the integration, is Having (a lot of) data available is the basis for farm management simply a necessity [2]. and operations of farms [13]. Fourth, the project also covered the The Reference Standard Process Model is one way to help Product implementation of the cloud integration platform. All applications Managers at software companies in removing the gap of diverse and IoT systems were integrated through the cloud integration and unequal knowledge and understanding of farming processes, platform to facilitate data exchange between them [6], [12], [14]. activities within processes and metrics. The reference model can Six software companies (they were called software partners during become a common denominator, a kind of Esperanto, as a the project) cooperated in the AgroIT project with their software knowledge base for the development of software products and IoT products: Applications, IoT systems and the cloud integration systems for farming. The reference model, on the other hand, will platform. Each software company “contributed” their product to the also help farm managers and owners in farm management. project and, during the project, software products were improved We built and designed a Reference Standard Process Model for significantly, i.e. upgraded and extended. They were also improved Farming (RSPMF) based on the idea and concepts of the COBIT implicitly through integrations between each other. framework, which is defined for the area of IT governance [3], [4]. For the pilot use of integrated software products and IoT systems This paper introduces the research in progress and the concepts we several pilot projects were organised in 5 EU countries by pilot 11 partners. Pilot partners did not do software implementation in the Figure 1. COBIT meta model [3] project, but supported pilot farms in the use of software products. A detailed explanation of the schema, i.e. a detailed explanation of For that reason, pilot partners were organisations with extensive the concepts and relations between them is beyond the scope of this knowledge in agriculture and experience in consulting for farming. paper. 3. KNOWLEDGE OF FARMING FOR 3.2 The idea of the Standard Process Model IMPLEMENTATION OF SOFTWARE for Farming PRODUCTS AND IoT SYSTEMS FOR The idea and concept of the previously introduced COBIT FARMING framework and the problems based on the diversity of knowledge of partners in the project initiated the idea of a Standard Process Improving software products and IoT systems was based on and Model for farming. COBIT is,, therefore based on various concepts, extending existing functionalities of software products and IoT and those concepts can be used and adapted in other areas as well, systems and upgrading them with new ones. The key goal of the not only in IT governance. The idea and concepts of COBIT were project was to design functionalities which base on integration already transferred and used in the governance of Flood between software products and IoT systems. This means that a Management [15] and Nursing [16]. software product also can use data from another software product or IoT system. The transfer of the idea and concepts of a particular Standard or framework to another area, in this case the transfer of COBIT to the During the analysis and design phase it has become apparent that area of farming, does not mean a one-to-one transfer. Some software partners have diverse and unequal knowledge and concepts of source area (in this case, IT governance), might not be understanding of farming processes, activities within processes and relevant or have any sense in the destination area (in this case metrics. The gap was even bigger when compared to the knowledge farming). For this reason, a successful and significant transfer with and understanding of the pilot partners. useful outcome can only be achieved through: The diversity mentioned, and having the expertise of COBIT, has,  Good understanding of the idea and concepts of the step-by-step, led to the idea of transferring the idea of COBIT to be framework of the source area (in this case COBIT), used for farming [3], [4].  Extensive knowledge and experience on the destination area: Processes and their activities, metrics, 3.1 COBIT framework for IT governance responsibilities, rules, etc. COBIT has, in recent years, become a de-facto Standard for IT governance in companies and organisations. COBIT defines a set 4. REFERENCE STANDARD PROCESS of generic processes (IT processes) for the management of IT. For MODEL FOR FARMING (RSPMF) each IT process the following is defined: Process inputs and As can be concluded based on the previous discussion, we designed outputs, goals of the process, key process activities, metrics of the RSMPF on the idea and concepts of COBIT 4.1 [3]. In the literature process (performance measures), and levels of process maturity we so far haven’t found any paper representing a Standard Process (maturity model) [3]. The development of COBIT has been Model for Farming. progressing since 1996, from version 1 to the current version 5. COBIT is the result of several working groups of highly 4.1 The concepts of RSMPF experienced experts as coordinated work owithin ISACA, which is Processes are divided on three hierarchical levels which are called an international professional association focused on IT governance. domains: Govern and Monitor (GM), Plan and Manage (PM) and COBIT is defined as a process model which divides IT into four Implement and Execute (IE). domains: Plan and Organise, Acquire and Implement, Deliver and Farming has several branches: Livestock, fruit growing, Support, and Monitor and Evaluate). Domains have altogether 34 agriculture, winery (viticulture), etc. RSMPF enables modular defined IT processes. definition of processes for every branch of farming. For the Govern The schema below shows a meta model of COBIT and all of its and Monitor domain only common processes are defined, for the concepts. The schema reveals the business orientation of COBIT: other two domains, a process module is also added for every branch The aim of defining the COBIT framework is to align IT and of farming. For now, only the process module for livestock is business where business goals dictate IT goals [3], [4]. defined for domains PM and IE. Each process is described through the following components: Process goals, process metrics, KPI’s ( Key Performance Indicators) and process activities. Each process has a unique code, which reveals the domain to which the process belongs and the process module. The code of Common Processes is CP and the code for LiveStock is LS. The aim of defining RSPMF is not to prevail over any existing Standard for farming. RSMPF is defined and structured to be opened and enables the reference to any existing Standard in the process description section. 4.2 Target groups and aimed benefits of RSPMF When designing a Standard Process Model, regardless of the area it is intended for, the group designing it must first decide which are 12 the target groups who will use the model, and what should be the  GM.09: Implement and monitor implementation of benefits of its use. For target groups this should become a strategy Reference Standard Process Model. Plan and Manage (PM) – Common Processes (CP):  We designed RSPMF for the following groups: PM.CM.01: Manage implementation of strategy and investments  Product Managers in software companies which  PM.CM.02: Manage budget and cost develop software products and IoT systems for farming.  PM.CM.03: Manage financials As can be revealed from our discussion, we noticed the  PM.CM.04: Manage risks need for a Standard Process Model,   PM.CM.05: Manage human resources Managers and owners of bigger farms: COBIT is the  PM.CM.06: Manage buildings and security first place aimed at bigger companies. Each Standard  PM.CM.07: Manage products sales Process Model should, in our opinion, be sized for bigger  institutions (organisations in general). Smaller PM.CM.08: Manage suppliers  institutions then use it to the extent for which they PM.CM.09: Manage sub-contractors believe is suitable for them. We followed this approach  PM.CM.10: Manage certifications in the designing of the RSPMF.  PM.CM.11: Manage environment and protection  PM.CM.12: Manage energy consumption The aimed benefits for Product Managers are as follows:  PM.CM.13: Manage energy production  Based on experience from the AgroIT project, we can  PM.CM.14: Manage farming machinery state that there is a diversity of farming knowledge of  PM.CM.15: Manage equipment Product Managers in software companies. RSPMF will  PM.CM.16: Manage IT become a common denominator, a kind of Esperanto as  PM.CM.17: Manage information system a knowledge base for the development of software  PM.CM.18: Manage innovations products and IoT systems for farming,   PM.CM.19: Manage investment projects We expect the integrations between various software  PM.CM.20: Manage needs and expectations products and IoT systems to be more straightforward and “softer” if  PM.CM.21: Manage knowledge and legislation Product Managers will base  functionalities on RSPMF. PM.CM.22: Manage changes based on legislation demands We are designing RSPMF to reach several aimed benefits for  PM.CM.25: Manage changes based on IT and managers and owners of bigger farms: innovation  Knowledge and experience of farming experts and  PM.CM.26: Manage assets academics will, step by step, be transferred to RSPMF.  PM.CM.27: Manage technical capacity We could say that RSPMF introduces the best practices  PM.CM.28: Manage internal control for farming, Plan and Manage (PM) – LiveStock (LS):  RSPMF provides the best practice guidelines for  PM.LS.01: Manage animal sales processes and their activities on farms. This helps  PM.LS.02: Manage animal purchases managers ensure that the processes perform according  PM.LS.03: Manage animals` health and veterinary to best practice, service  Metrics and KPI’s are defined for processes. This helps  PM.LS.04: Manage animal welfare managers to set goals and execute monitoring. This  PM.LS.05: Manage hygiene lowers various risks,  PM.LS.06: Manage animal feeding and grazing  Managers can identify gaps in process execution and  PM.LS.07: Manage animal reproduction monitoring. This helps them close the gaps identified  PM.LS.08: Manage animal breeding plan and improve processes, Implement and Execute (IE) – Common Processes (CP):  Managers can be better prepared for any auditing. If a  IE.CM.01: Perform internal control particular audited farm will be “RSPMF compliant”,  IE.CM.02: Perform farm accounting then this will increase the trust of auditors,   IE.CM.03: Perform maintenance of buildings Not only managers, but also other personnel working  IE.CM.04: Perform employments and other Human on farm can learn about processes, metrices and KPI’s. Resource issues 4.3 Draft list of domains and their processes  IE.CM.05: Perform product sales We already have defined a draft list of domains and their processes.  IE.CM.06: Perform purchases of equipment  IE.CM.07: Perform purchases of farming machinery Govern and Monitor (GM):   IE.CM.08: Perform purchases and implementation of GM.01: Define and maintain strategy  software products GM.02: Ensure profitability   IE.CM.09: Perform asset maintenance GM.03: Ensure risk governance   IE.CM.10: Perform purchases GM.04: Ensure machinery and equipment governance  Implement and Execute (IE) – LiveStock (LS): GM.05: Ensure IT and innovation governance   IE.LS.01: Perform animal feeding GM.06: Ensure compliance with legislation   IE.LS.02: Perform animal movements and grazing GM.07: Enable external and internal control   IE.LS.03: Preform animal health checking and health GM.08: Manage and monitor process definition and treatment change  IE.LS.04: Perform sales of animals 13  IE.LS.05: Perform purchasing of animals [5] L. Ruiz-Garcia and L. Lunadei, “The role of RFID in  IE.LS.06: Perform animal selection agriculture: Applications, limitations and challenges,”  IE.LS.07: Perform animal reproduction Comput. Electron. Agric. , vol. 79, no. 1, pp. 42–50, Oct. 2011. 4.4 Concepts of methodology to facilitate the [6] A. Kaloxylos et al. , “Farm management systems and the sustainability of RSPMF Future Internet era,” Comput. Electron. Agric. , vol. 89, no. COBIT was first issued in 1996, and this means that it has been null, pp. 130–144, Nov. 2012. going through evolution, where experts from the whole world participated. COBIT is now version 5, but had several versions [7] C. N. Verdouw, R. M. Robbemond, and J. Wolfert, “ERP in before that [3], [4]. agriculture: Lessons learned from Dutch horticulture,” Comput. Electron. Agric. , vol. 114, pp. 125–133, 2015. To facilitate the sustainability of RSPMF, we plan a similar approach. We have plan to issue the first version in a year or year [8] R. Rupnik, M. Kukar, P. Vračar, D. Košir, D. Pevec, and Z. and a half. The first version will cover only livestock. We will form Bosnić, “AgroDSS: A decision support system for an international panel of experts of various profiles: Consultants, agriculture and farming,” Comput. Electron. Agric. , no. academics, Product Managers, farmers and government officials. November 2017, 2018. 5. CONCLUSION AND FUTURE WORK [9] R. Nikkilä, I. Seilonen, and K. Koskinen, “Software architecture for farm management information systems in We have introduced the research in progress for the idea and precision agriculture,” Comput. Electron. Agric. , vol. 70, no. concepts of the Reference Standard Process Model for Farming. 2, pp. 328–336, Mar. 2010. Our aim of the design of reference model is to improve the support for managers and owners of bigger farms in farm management. [10] C. G. Sørensen et al. , “Conceptual model of a future farm Another aim is to facilitate Product Managers in development of management information system,” Comput. Electron. Agric. , software products and IoT systems. vol. 72, no. 1, pp. 37–47, Jun. 2010. In midterm, we also want RSPMF to be suitable for government [11] J. De Baerdemaeker, Precision Agriculture Technology and and EU officials who are responsible for farming. At the moment, Robotics for Good Agricultural Practices, vol. 46, no. 4. we plan to add the concept of maturity levels of a process. The IFAC, 2013. maturity level of a process will show or indicate the level of detail [12] J. Santa, M. A. Zamora-Izquierdo, A. J. Jara, and A. F. and expertise with which a farm executes a process. This way, the Gómez-Skarmeta, “Telematic platform for integral comparison of different farms will also be possible. management of agricultural/perishable goods in terrestrial We are aware that there are two phases of defining RSPMF: First, logistics,” Comput. Electron. Agric. , vol. 80, no. null, pp. 31– to define its concepts and structure; second, to put content in the 40, Jan. 2012. structure of processes` descriptions. Those two phases overlap, [13] J. W. Jones et al. , “Toward a new generation of agricultural because, while inserting the content, for sure some ideas to change system data, models, and knowledge products: State of structure will appear. The definition of concepts and the structure agricultural systems science,” Agric. Syst. , vol. 155, pp. 269– is our research mission for the next 12 months, that is how we plan 288, 2017. it. [14] J. W. Kruize, R. M. Robbemond, H. Scholten, J. Wolfert, and 6. REFERENCES a. J. M. Beulens, “Improving arable farm enterprise [1] A. Kaloxylos et al. , “A cloud-based Farm Management integration - Review of existing technologies and practices System: Architecture and implementation,” from a farmer’s perspective,” Comput. Comput. Electron. Agric. , vol. Electron. Agric. , vol. 100, pp. 168–179, Jan. 2014. 96, pp. 75–89, 2013. [2] S. Fountas et al. , “Farm management information systems: [15] M. Othman, M. Nazir Ahmad, A. Suliman, N. Habibah Current situation and future perspectives,” Arshad, and S. Maidin, “COBIT principles to govern flood Comput. management,” Electron. Agric. , vol. 115, pp. 40–50, 2015. Int. J. Disaster Risk Reduct. , vol. 9, 2014. [3] ISACA, COBIT 4.1. 2007. [16] M. Burnik, “The Approach for the Presentation of Nursing Processes,” University of Primorska, 2011. [4] ISACA, COBIT5: Enabling Processes. 2012. 14 Semiotics of graphical signs in BPMN Saša Kuhar Gregor Polančič Faculty of Electrical Engineering and Computer Science Faculty of Electrical Engineering and Computer Science University of Maribor University of Maribor Maribor, Slovenia Maribor, Slovenia sasa.kuhar@um.si gregor.polancic@um.si ABSTRACT RQ2: Can we categorize graphical signs from BPMN according The terminology of graphical signs (e.g. icons, symbols, to semiotic studies? pictograms, markers etc.) is ambiguous in academic articles. This We organized the remainder of the article as follows. The next is the same with articles focusing on graphics in business chapter presents the theoretical background. Chapters 3 and 4 notations, although concepts of graphical elements in notations represent the main objective of this paper – answering the are well defined. In semiotics, on the other hand, the concepts research questions. The conclusion is given in the last chapter. related to signs are defined in detail. In this paper, we examined linguistic terms that are used for describing graphical elements in 2. BACKGROUND BPMN specifications (BPMN being the de-facto Standard of business notations), and related them to the terminology specified 2.1 Semiotics in semiotics. We created a Sign ontology with BPMN graphical Semiotics is the study of signs and symbols (not only visual) and signs as ontology instances. The ontology can be used by their use or interpretation. For the purpose of the terminology researchers to share common knowledge about concepts of signs, definition, we will sum the book of Daniel Chandler Semiotics: symbols, icons, and indices, as well as the knowledge on BPMN The basics [3], which offers a comprehensive explanation of the graphical signs. field, including many views of modern theoreticians. There are two main traditions in contemporary semiotics: From Ferdinand Categories and Subject Descriptors de Saussure and Charles Sanders Peirce. H.1.m [Information Systems]: Models and Principles, Saussure's model of signs consists of two parts: Signifier (the Miscellaneous. form that the sign takes) and signified (the concept to which it refers). The sign is then the whole that results from the association General Terms of the signifier and the signified (Figure 1 on the left). For Management, Documentation, Design, Languages, Theory. Saussure, both signifier and signified take non-material form rather than substance. Nowadays, common adoption of his model Keywords takes a more materialistic form, where the signifier is commonly interpreted as the material that can be seen, heard, touched, Business Process Model and Notation, BPMN, Semiotics, smelled or tasted. Being concerned mostly with linguistics, Ontologies, Graphical signs, icons. Saussure stressed that the relationship between the signifier and the signified is relatively arbitrary: There is no inherent, essential, 1. INTRODUCTION transparent, self-evident or natural connection between the Business process diagrams provide a graphical notation for signifier and the signified – between the sound of a word and the specifying business processes. Among many business notations, concept to which it refers [3]. Business Process Model and Notation (BPMN) is known as the de-facto Standard [1]. BPMN consists of execution semantics and notation, the latter including graphical elements such as shapes, arrows, icons, and labels. Those elements are all signs where each has a defined meaning and represents a certain concept. However, terminology for graphical elements (e.g. icon, sign, or shape) is not used consistently among researchers in this domain. If one, for example, wants to perform a literature search on icons Figure 1: Saussure's model of signs on the left and Peircès in BPMN, the term icon does not incorporate all linguistic terms model of signs on the right that different authors use in their articles (other words for icon can Peirce, on the other hand, introduced a three-part model be pictogram, symbol, sign, marker etc). Even in BPMN consisting of: Representamen (the form which the sign takes, specifications [2], those terms are not used uniquely, but with also called “sign vehicle” or, in the Saussurean model, the loosely defined synonyms. signifier), interpretant (the sense made of the sign, or signified in With this situation in mind we formulated the following research Saussure's model). and object (something beyond the sign to questions: which it refers, also called the referent). In this model, the sign is the unity of what is represented (the object), how it is represented RQ1: What are the linguistic terms that are used in the BPMN (the representamen) and how it is interpreted (the interpretant) specification for graphical shapes, graphical icons, and other (Figure 1 on the right). The term sign is often used loosely and visual signs? 15 confused with signifier or representamen. However, the signifier and other visual signs?) we examined the BPMN specifications or representamen is the form in which the sign appears, whereas and mapped the specifications’ terms to semiotics’ terms. In the sign is the whole meaningful unity [3]. BPMN specifications the signs are denominated as follows: The term BPMN element represents the term signified, the terms 2.1.1 Symbol, Index, Icon shape, object, marker, indication, icon and depiction stand for In addition to his sign model, Peirce offered a classification of signifier. The answer to RQ1 and a detailed meaning of each signs, based on the relationship between representamen and its BPMN term is provided in Table 1. object or its interpretant, or, in Sausurres’ terms, the relationship Table 1: Linguistic terms used in BPMN specifications between signifier and signified. Dependent upon the relationship Semiotics’ BPMN Detailed meanings in BPMN being more arbitrary, directly connected, or more resembling, terms terms specification three types of signs are possible: Symbol, index, and icon Signified BPMN element Concepts in business notation respectively. Shape Graphical element SYMBOL represents a relationship where the signifier does not Basic shape (e.g. circle Object resemble the signified, but which is arbitrary or conventional. representing simple event) The relationship must be agreed upon and learned, such as in Signifier Marker, Graphical icon that can be language (letters, words, phrases, and sentences), numbers, Morse Indicator or included in an object (e.g. code, traffic lights or national flags. Icon message icon) Depiction Graphical example of the usage INDEX denotes a relationship where the signifier is not arbitrary, but connected directly (physically or causally) to the signified, As we can observe from the Table above, many linguistic terms which can be observed or inferred. An index indicates something are used for signifier, some of which are not used consistently (that is, necessarily, existent). Examples are natural signs (smoke, (e.g. marker, indicator, and icon). The only term from semiotics thunder, footprints), medical symptoms (pain, a rash, pulse-rate), that is used in BPMN specifications is the term icon, that is used measuring instruments (thermometer, clock), ‘signals’ (a knock on to denote a graphical icon and stands for the term signifier. a door, a phone ringing), recordings (a photograph, a film, video shot), personal ‘trademarks’ (handwriting, catchphrases). 4. ONTOLOGY CONSTRUCTION ICON represents a relationship where the signifier is perceived as For the purpose of Ontology construction and answering RQ2 resembling or imitating the signified – being similar in possessing (Can we categorize graphical signs from BPMN according to some of its qualities, like a portrait, a cartoon, a scale-model, semiotic studies?), we followed recommendations in Ontology onomatopoeia, metaphors, sound effects in radio drama, a dubbed Development 101: A Guide to Creating Your First Ontology [5]. film soundtrack and imitative gestures. [3] The authors suggest taking the following 7 steps for ontology creation: Step 1. Determine the domain and scope of the ontology, 2.1.2 Synonyms of terms Step 2. Consider reusing existing ontologies, Step 3. Enumerate The terminology from semiotics is used rarely in popular important terms in the ontology, Step 4. Define the classes and the language. The term sign in semiotics is frequently replaced by the class hierarchy, Step 5. Define the properties of classes, Step 6. term symbol in popular usage [3]. Also, several meanings of the Define the facets of the slots, and Step 7. Create instances. Steps 4 term icon can be found in everyday language: a) To be iconic and 5 are closely intertwined and can be executed simultaneously. means that something or someone is recognized as famous, b) In computing, an icon is a small image intended to signify a 4.1 Domain and scope of BPMN Sign particular function to the user (in semiotic terms these are signs ontology which may be iconic, symbolic or indexical), c) Religious icons For the domain definition, the authors [5] propose answering represent sacred, holy images [3]. If not stated otherwise, we will several questions. Our answers are provided below, after the continue to use terms as defined in semiotics throughout this paper. proposed questions. 2.2 Ontologies What is the domain that the BPMN Sign ontology will cover? Ontologies are explicit formal specifications of the terms in a Signs in BPMN domain and the relationships among them [4]. They define What are we are going to use the ontology for? common vocabulary and can, among other things, be used by To share a common understanding of knowledge about signs researchers, who need to understand and share the structure of among researchers, and to be able to reuse and analyze domain information in a domain [5]. Because of these reasons, we find knowledge. them appropriate for terminology clarification in the domain of For what types of questions should the information in the Graphical Signs in BPMN. Our research purpose is mainly ontology provide answers? definition of terms, so our ontology will, according to Obrst [6], Definitions of concepts in semiotics and relationships among be of the weak to moderately strong semantics, not intended to be them, categorization of BPMN graphical signs according to used for machine processing or machine interpretation (at least semiotics’ concepts, and the frequency of occurrence of sign types not at this stage of our research). in BPMN. 3. LINGUISTIC TERMS IN BPMN Who will use and maintain the ontology? The ontology will be maintained and used by us and will be SPECIFICATION available for other interested researchers. To answer the first RQ (What are the linguistic terms that are used To determine the scope of the ontology, a list of competency in the BPMN specification for graphical shapes, graphical icons, questions can be used that ontology will be able to answer [5]. 16 The competency questions we defined are listed next. Also, over time, a mode can change. Originally signs were in part  iconic, in part indexical (primitive writing), and symbols come What does the term icon mean?  into being by development out of other signs, particularly from How do icons, indices, and symbols correlate?  icons [3]. Which type of sign (icon, index or symbol) is used most in BPMN?  4.5 Sign Ontology construction Are symbols always arbitrary, or can they convey a certain degree of meaning? With the utilization of the Protégé 5.2.0 software tool and according to semiotic concepts and their relationships, we created 4.2 Reuse of existent ontology simple Sign Ontology as follows. We created a class Sign (with With a literature search we found no existing ontologies in the disjoint subclasses Icon, Index and Symbol), a class Relationship domain of signs or icons. However, we identified a Business (with subclasses PrimaryRelationship and SecondaryRelationship), Process Modelling Ontology (BPMO) that has been built and a class BPMNElement (with subclasses BasicShape, Activity, automatically, starting from the XML schemas contained in the Event, Gateway, and Data). We also created 2 object properties: BPMN 2.0 specifications from OMG [7]. It contains all the hasRelationshipType (with subproperties hasPrimaryRelationshipType BPMN elements and their relationships as defined BPMN and hasSecondaryRelationshipType), and its inverse property specifications. The class that is related most closely to our definesModeOf (with subproperties definesPrimaryModeOf and research domain (Graphical Signs) is DiagramElement and its definesSecondaryModeOf). The range of hasPrimaryRelationshipType subclasses (Figure 2). This class is, in BPMN specifications, is the class Sign, and the domain is the class PrimaryRelationship. defined under BPMN Diagram Interchange (BPMN DI) meta- We then defined 3 instances, Arbitrary, Indicative and Similar, model and schema for the purpose of the unambiguous rendering and included them in the classes PrimaryRelationship and of BPMN diagrams in different tools [2]. SecondaryRelationship. Next, we defined that, if a Sign has a hasPrimaryRelationshipType property of value Similar, it is included in the class Icon. Similarly, we defined classes Index (with hasPrimaryRelationshipType property value Indicative) and Symbol ( with hasPrimaryRelationshipType property value Arbitrary). 4.6 BPMN graphical shapes as Instances in Figure 2: DiagramElement class and its subclasses in BPMO, Sign Ontology visualized by the OntoGraf plugin for Protégé To decide whether graphical signs in BPMN are of the mode icon, As our focus in Sign Ontology is mainly on graphical signs that index or symbol, we invited 5 BPMN experts to evaluate BPMN are, as such, not contained in BPMO, we will start our own signs and define one sign mode for each. We chose BPMN ontology and, later, consider the options of merging both experts as they are fully familiar with the concepts (signifieds) in ontologies. BPMN. Before the evaluation, the experts were acquainted with concepts from semiotics. The results of the evaluation are given in Table 2. 4.3 Definition of concepts in Sign Ontology The next step in ontology creation is the enumeration of important On six shapes, the experts agreed on the sign mode, thus defining terms. We defined the concepts for BPMN sign ontology from the primary relationship between signifier and signified. For other semiotics (S ign, Icon, Index, and Symbol), and from BPMN shapes, where experts had different opinions, the mode was ( BasicShape, Activity, Event, Gateway, and Data). defined with the primary and the secondary relationship. The mode that was defined most often by experts was set for the 4.4 Relationships among concepts primary relationship, and the mode that ranked second in choices was set for the secondary relationship. For the definition of a hierarchy of classes and their properties, we will next define the relationships among three types of signs, As we can observe from Table 2, the majority of the signs were again from semiotics. specified as symbols (the primary relationship is arbitrary). 6 symbols were also the only signs where experts agreed fully on At first sight, the relationship among the signifier and the the sign mode. Furthermore, in all but one symbols, the secondary signified (and, consequently, the types of signs) seems mode was set as an index, and, similarly, the other way around; in unambiguous, but that is not always the case. We should keep in all indices, the secondary mode was set as a symbol. The mind that signs denote concepts (not material objects), and each consensus on the primary relationship was not possible for two person has their own understanding of a certain concept in his or signs (Script task and Data object), and on the secondary her mind. Concepts cannot be represented precisely [8] therefore relationship for one sign (Manual task). Thus, for the Script task icons, for example, cannot be denoted simply as similar. They are and the Data object, the primary relationship was not set, but two defined by perceived similarity [3]. Also, as stated in [9], the secondary relationships were set. For Manual task only the process of sign-making is the process of the constitution of primary relationship was set. metaphor, and, therefore, symbols are never only arbitrary. Within each type, signs vary in their degree of conventionality. After the modes of signs were defined we included the signs into Therefore, we must not speak of types of signs but of modes of Sign Ontology. The ontology, including the instances, is shown in relationships where the difference between signs lays in the Figure 3. The figure represents classes as circles and relationships hierarchy of their properties rather than in the properties as lines connecting the circles. The size of the circle corresponds themselves [3]. to the number of instances included in the class. 17 Table 2: Modes of BPMN signs Figure 3: Sign Ontology with BPMN shapes rendered in the NavigOwl plugin for ProtégéCONCLUSION * Signifier Signified Secondary relationship Primary relationship: Arbitrary (Symbol) In this paper, we mapped the linguistic terms from semiotics to linguistic terms regarding signs in BPMN specifications. We Activity found that, in BPMN specifications, many terms are used for the Gateway term signifier, some of which inconsistently. 5 Signal event To correlate concepts from semiotics to BPMN graphical signs, Multiple event we developed the BPMN Sign Ontology based on definitions Ad-hoc sub-process from semiotics. We categorized each BPMN graphical sign in a Complex gateway mode that represents the relationship between signifier and signified. The majority of the BPMN signs are of mode symbol, Event Indicative (index) following by mode index. As the meaning of symbols needs to be Parallel event Indicative (index) learned, this indicates a possible correlation with the principle of Escalation event Indicative (index) Semantic transparency from [10]. Addressing this issue, we will, 4 Link event Indicative (index) in future work, examine our results further with those from [11] Service task Indicative (index) and other related articles. Inclusive gateway Indicative (index) Since the current study included only 5 experts in BPMN, Parallel gateway Indicative (index) resulting in possible bias, empirical research with more users is Error event Indicative (index) planned, as well as a thorough literature search. At this point, the Send task Indicative (index) BPMN Sign Ontology can, in the BPMN domain, serve for unambiguous knowledge definition and sharing. Receive task Indicative (index) 3 Business rule task Indicative (index) 5. REFERENCES Sub-process Indicative (index) [1] M. Kocbek, G. Jošt, M. Heričko, and G. Polančič, Exclusive gateway Indicative (index) “Business process model and notation: The current state of Data object collection Similar (icon) affairs,” Comput. Sci. Inf. Syst. , vol. 12, no. 2, pp. 509–539, Primary relationship: Indicative (Index) 2015. Conditional event Arbitrary (symbol) [2] O.M.G., “Business Process Modeling Notation.” 2011. [3] D. Chandler and E. W. B. Hess-Lüttich, Semiotics the 4 Flow Arbitrary (symbol) Basics, Second Edi., vol. 35, no. 6. London: Routledge, Cancel event Arbitrary (symbol) 2007. Data store Arbitrary (symbol) 3 [4] T. R. Gruber, “A translation approach to portable ontology Compensation event Arbitrary (symbol) specifications,” Knowl. Acquis. , vol. 5, no. 2, pp. 199–220, Primary relationship: Similar (Icon) Jun. 1993. 4 Message event Indicative (index) [5] N. F. Noy and D. L. McGuinness, “Ontology Development Timer event Indicative (index) 101: A Guide to Creating Your First Ontology,” Standford 3 User task Arbitrary (symbol) Knowl. Syst. Lab. Tech. Rep. , pp. 1–25, 2001. Manual task Not set [6] L. Obrst, H. Liu, R. Wray, and L. Wilson, “Ontologies for Primary relationship: Not set semantically interoperable electronic commerce,” IFIP Adv. Inf. Commun. Technol. , vol. 108, pp. 325–333, 2003. Script task Similar/indicative (2*) [7] L. Cabral, B. Norton, J. Domingue, L. C. Kmi, B. Norton, Data object Similar/arbitrary (2*) and J. Domingue, “The business process modelling * - The number of experts who decided on this primary mode ontology,” Proc. 4th Int. Work. Semant. Bus. Process Manag. , pp. 9–16, 2009. [8] A. Fenk, “Symbols and icons in diagrammatic representation,” Pragmat. Cogn. , vol. 6, no. 1–2, pp. 301– 334, 1998. [9] G. R. Kress and T. van Leeuwen, Reading Images (The Grammar of Visual Design). London: Routledge, 1996. [10] D. Moody, “The physics of notations: Toward a scientific basis for constructing visual notations in software engineering,” IEEE Trans. Softw. Eng. , vol. 35, no. 6, pp. 756–779, 2009. [11] N. Genon, P. Heymans, and D. Amyot, “Analysing the Cognitive Effectiveness of the BPMN 2.0 Visual Notation,” in Journal of Visual Languages & Computing, vol. 22, no. 6, 2011, pp. 377–396. 18 Knowledge Perception influenced by Notation Used for Conceptual Database Design Aida Kamišalić Muhamed Turkanović Marjan Heričko Faculty of Electrical Faculty of Electrical Faculty of Electrical Engineering and Computer Engineering and Computer Engineering and Computer Science Science Science University of Maribor University of Maribor University of Maribor Maribor, Slovenia Maribor, Slovenia Maribor, Slovenia aida.kamisalic@um.si muhamed.turkanovic@um.si marjan.hericko@um.si Tatjana Welzer Faculty of Electrical Engineering and Computer Science University of Maribor Maribor, Slovenia tatjana.welzer@um.si ABSTRACT there are no researches that dealt with knowledge percep- The paper presents an experimental study which examined tion within the databases learning environment. In order the influence of the notation used for conceptual design on to examine the effectiveness of learning database fundamen- students’ knowledge perception at higher educational study tals, depending on the notation used for conceptual design, level. The results demonstrate that students’ knowledge per- we set-up a multi-level experimental study [7]. Different ception is higher than actual knowledge throughout the en- experimental instruments to evaluate the effectiveness of a tire learning process and is correlated with the used nota- teaching approach using Barker or Bachman notation for tion. conceptual database design were developed. In contrast to Barker notation, Bachman notation incorporates elements Categories and Subject Descriptors of logical design (i.e. foreign keys) in the conceptual design H.2.1 [Database Management]: Logical Design; K.3.2 level. Students’ achievements were examined with regard [Computers and Education]: Computer and Information to influencing factors throughout the learning process. Re- Science Education sults indicated that introducing the Bachman notation and a manual transformation from a conceptual into a logical data model increased students’ understanding of conceptual, General Terms logical and relational data model concepts (CLR concepts). Theory, Experimentation Here we present another aspect of this study. The influence of notation used for conceptual design on student knowl- Keywords edge perception is examined. Research questions that are entity relationship models, conceptual design, database de- addressed and answered in the paper are (RQ1) How does sign learning, Barker, Bachman, knowledge perception the notation used for conceptual design influence students’ knowledge perception? and (RQ2) Does the correlation be- 1. INTRODUCTION tween student’s knowledge perception and actual knowledge The relational databases are fundamental part of any infor- about CLR concepts change throughout the learning pro- mation system. Conceptual and logical design represent im- cess? portant segment of almost every application. Therefore, dif- The structure of the paper is as follows. In Section 2, a ferent issues related to teaching approaches of database fun- methodological framework and experimental setting are pro- damentals and design must be adequately addressed. The vided. The main contribution of the paper is presented in introduction to databases course is one of the fundamen- Section 3 where results and discussion are detailed. Finally, tals of computer science and/or informatics higher educa- the conclusions are presented in Section 4. tion programs. It is mostly a single semester course that covers data requirements elicitation, conceptual database 2. METHODOLOGY design, normalization, logical database design, and physical database design [3, 4, 5]. There is much research addressing 2.1 Experimental framework issues related to teaching computer science and informatics The study was carried out during the academic year 2016/2017 disciplines including various aspects of databases [3, 5, 9], at the Faculty of Electrical Engineering and Computer Sci- some research has dealt with the effectiveness of teaching ence at the University of Maribor. The experiment was per- approaches to database design (conceptual and logical mod- formed within the Database I course. It is a single semester eling) [1, 2, 7, 8]. However, to the best of our knowledge course that includes 45 hours of theory/practice lectures and 19 30 hours of laboratory work in the form of computer exer- participants have to model an entity (i.e. person) and give cises. it some attributes and possibly a primary key. For the sec- ond task (9b), the participants have to model an additional The focus of the experiment was on the evaluation of stu- entity (i.e. phones) and present an 1:N relationship between dents’ laboratory work. Students were randomly split into the previous entity and the newly added one. For the third two approximately equal size groups. Both groups worked task (9c), the participants had to add a third entity (i.e. ad- on the same database modeling tasks, using the Oracle SQL dress), and correctly use a form of M:N relationship between Developer Data Modeler design tool. One of the groups used the previous entities and the newly added one. In order to Bachman notation which explicitly includes the foreign key be able to analyze the results, five concepts are evaluated: in the E-R diagram, while another group used the Barker entity, relationship, attribute, PK and FK. The scoring is notation, which does not explicitly include the foreign key as follows: if they used any possible form of the concept in in the E-R diagram [6]. their solution and if the presented use of the concept was correct, participants got a point for the concept. Thus, five 2.2 Experimental instruments points could be scored in total. In this section, a detailed presentation of the experimental instruments used during the study is given. The question- 3. RESULTS AND DISCUSSION naire was conducted twice: Intro-Questionnaire and Final- In the next sections we report on the results achieved in the Questionnaire. The participation was optional in both oc- experiment. Statistical analyses were performed using IBM currences. The questionnaire used in the study is available SPSS Statistics version 23. on the web (http://bit.ly/2wMvrVQ). 3.1 Knowledge perception The questionnaire is split into three parts. The first part An analysis was performed on related samples of the per- consists of mainly closed-ended questions related to basic ception score and test score. It was based on data gathered demographic information and database design tools (Ques- from the Intro-Questionnaire and Final-Questionnaire. The tions 1 - 6). The second part consists of a Likert scale-like data for each questionnaire was analyzed separately. multi-level table (Question 8), where participants have to cross one of the multi-level options for five basic database In the analysis we excluded all those records where students terms and concepts: Entity, Relationship, Attribute, Pri- rated one of the concepts as undefined, thus the total num- mary Key (hereinafter PK) and Foreign Key (hereinafter ber of records taken into account were 116. Therefore, we FK). The values of the Likert scale are as follows: (1) - I am got four levels of knowledge and five different concepts. As not familiar with the term, (2) - I am familiar with the term, mentioned in the previous section, part of the questionnaires but not with the meaning, (3) - Undefined, (4) - I am famil- was a short test. We will refer to the total test score as iar with the meaning but I do not know how to use it and (5) the test score. In order to effectively compare the actual - I am familiar with the meaning and I know how to use it. knowledge with the perception, we normalized the results The third part included open-ended questions, given in the of knowledge perception so that the total score (max. 20 form of a short test (Question 9). The short test consists of points) was divided by five. We will refer to the normalized three consecutively simple tasks, whereby each is related to perception results as the perception score. Table 1 reports the previous and each presents an increase in difficulty. In on the results of the analysis which was performed using a order to solve the test correctly, the participants have to use Wilcoxon signed-rank test for related samples. a form of one-to-many (hereinafter 1:N) and many-to-many (hereinafter M:N) relationship. The participants should not be given any instructions on how to solve the test. They Table 1: Correlation of results for perception score should be left to use any means and techniques that seem and test score. appropriate. The foreseen time limit is 20 minutes. Asymp. Experimental Related Sig. N Decision instrument Samples The purpose of the questionnaire was to examine if there (2-tailed) was any correlation between the participant’s perception of Intro- Percep. score Reject the 0.000** 107 knowledge of CLR concepts (Question 8) and their actual Questionnaire - Test score null hypothesis knowledge (score on the test questions 9a, 9b, 9c). When Final- Percep. score Reject the 0.000** 116 the questionnaire was handed out the second time during the Questionnaire - Test score null hypothesis experiment, an additional closed-ended question was added **Significant at 1% to the first part (Question 7), whereby students were asked which notation they used during the laboratory work. The purpose of this particular question was to examine if there We used the Wilcoxon signed-rank test in order to compare was any correlation between the notation used during the two not normally distributed sets of scores, one actual score laboratory work and their knowledge (score on the test ques- and another normalized perception score, that came from tions 9a, 9b, 9c). the same participants, since each participant had to solve tasks and evaluate their knowledge on the aforementioned In order to evaluate the questionnaire a scoring structure CLR topics. The Shapiro-Wilk test of normality indicated for the third part is needed (Question 9). The test con- that data significantly deviates from a normal distribution sists of three consecutive tasks (9a, 9b, 9c), whereby each (p-value below 0.05). The Wilcoxon signed-rank test returns relates to the previous and each constitutes an increase in an asymptotic significance lower than 0.01, thus rejecting difficulty. In order to solve the first task (9a) correctly, the the null hypothesis for related samples. The null hypothesis 20 states that the median of difference between the perception participant reached the opposite result, while 15 assessed score and the test score will equal zero. There is a statisti- their knowledge correctly. A further indication of wrong cally significant difference between the perception score and knowledge perception can be deduced from the mean of the the test score, suggesting that students’ perception of their scored results. The mean of the perception score during knowledge is not in accordance with their actual knowledge the Intro-Questionnaire is 4.095, while the mean for the test on CLR concepts. Figures 1 and 2 depict the correlation be- score stood at 1.74. In addition, the means for the Final- tween students’ actual knowledge and their knowledge per- Questionnaire were 4.957 and 3.35, respectively. We con- ception, which indicates a higher knowledge perception than clude that students overestimated their knowledge of CLR the actual knowledge in both questionnaires. The results concepts throughout the entire course. indicate that the correlation between the knowledge percep- tion and actual knowledge is corrected by the end of the course (Final-Questionnaire), which is due to higher knowl- Table 2: Cases of knowledge perception scores ver- edge achieved by the end of the course. However, the knowl- sus actual knowledge scores. edge perception remains at a high level. Experimental Related Mean Sum of N instrument Samples Rank Ranks Negative 2 a 17.25 34.5 Ranks Intro- Percep. score Positive Questionnaire - Test score 100 b 52.19 5218.5 Ranks Ties 5 c Total 107 Negative 1 a 1 1 Ranks Final- Percep. score Positive Questionnaire - Test score 100 b 51.5 5150 Ranks Ties 15 c Total 116 a Perception score Test score c Perception score = Test score Figure 1: Correlation between students’ actual Conclusions regarding RQ2: Students overestimated their knowledge and their knowledge perception. Intro- knowledge of CLR concepts throughout the entire course. Questionnaire (course start). The correlation between the students’ knowledge perception and actual knowledge is corrected by the end of the course, due to higher knowledge reached by the end of the course. However, the knowledge perception remains at a high level. 3.2 Knowledge perception and notation Additionally, we analyzed the results of the students’ knowl- edge perception and actual knowledge considering the nota- tion used in the learning process. Normalized results of stu- dents’ self-assessment of their knowledge and results of our assessment of their knowledge was summarized and used to assess the students’ perception of knowledge in terms of the dependence of the notation. The range of the summed score is thus 1 - 10. As the summed score approaches the extremes, the students were better able to assess their knowledge. It means that their perception of their knowledge and their ac- tual knowledge were very close. On the contrary, the closer Figure 2: Correlation between students’ actual the results were to the middle, the more students incorrectly knowledge and their knowledge perception. Final- assessed their knowledge. It means that they either overes- Questionnaire (course end). timated or underestimated it. For example students could assess their knowledge as high and reach five points for the perception and also score all five points on the test, thus Table 2 reports on the ranks of the performed Wilcoxon collecting ten points. On the contrary, students could assess signed-rank test. There were 100 out of 107 participants at their knowledge as high, but reach a minimum or even none the Intro-Questionnaire who assessed their knowledge higher points on the test, thus scoring five points in total. The than their actual knowledge was. On the contrary, only two analysis of the impact of the notation was based on the data participants reached the opposite results and only five as- gathered from the Final-Questionnaire only, because the im- sessed their knowledge correctly. The Final-Questionnaire pact of the notation can only be seen after the notation was results showed a slight increase in correctly assessed knowl- used in the learning process. Table 3 reports on the re- edge. There were 100 out of 116 participants at the Final- sults of the Mann-Whitney U test for independent samples. Questionnaire who assessed their knowledge as being higher We used the Mann-Whitney U test in order to compare dif- than their actual knowledge was. On the contrary, only one ferences between two independent groups (students using 21 4. CONCLUSIONS Table 3: Correlation of summed perception and test The paper reported on the results of an experimental study score and influencing factor (notation used). aimed at analyzing the influence of notation used for the Exper. Independ. Depend. Asymp. N Decision conceptual design on students’ knowledge perception. The instr. variable variable Sig. study continues on the work already presented in [7], while Summed reporting on students’ knowledge perception being higher Final- perc. Notation 116 0.008** Reject the Quest. and test null hypothesis than the actual knowledge. score *Significant at 5%; **Significant at 1% We examined whether students’ perception of knowledge is in accordance with their actual knowledge of CLR concepts. The results confirm that their perception is higher than the Bachman or Barker notation) and the dependent variable actual knowledge throughout the entire learning process. By (students’ summarized test score and normalized perception the end, their knowledge increases and perception remains score), while the groups are not normally distributed. The at a similar level as at the beginning. Additionally, the re- Shapiro-Wilk test of normality indicated that data signifi- sults prove that students who used the Bachman notation cantly deviates from a normal distribution (p-value below during the learning process were able to better estimate their 0.05). knowledge. In the future we plan to analyze the correlation between students’ educational background and their success The Mann-Whitney U test returns an asymptotic signifi- rate while learning the CLR concepts on the higher educa- cance lower than 0.01 for the notation variable, therefore tion degree level. rejecting the related null hypothesis. The null hypothesis states that the distribution of the summed score is the same 5. ACKNOWLEDGMENTS across categories of both Bachman and Barker notations. Considering the results, there is a statistically significant dif- The authors acknowledge the financial support from the ference between the summed results scored by notation used Slovenian Research Agency (Research Core Funding No. P2- in the learning process. There were 68 out of 116 students 0057). who used the Barker notation during the learning process, and their summed mean score stood at 8.1. The Bachman 6. REFERENCES notation was used by 48 students, whereby their summed [1] A. Al-Shamailh. An Experimental Comparison of ER mean score was 8.608. According to Figure 3, it is evident and UML Class Diagrams. International Journal of that there were more students who used the Bachman nota- Hybrid Information Technology, 8(2):279–288, 2015. tion and better assessed their knowledge. [2] H. C. Chan, K. K. Wei, and K. L. Siau. Conceptual level versus logical level user-database interaction. In Figure 3 depicts the correlation between the summed per- ICIS 45. Proceedings, pages 29–40, 1991. ception score and test score, and the notation used during [3] T. M. Connolly and C. E. Begg. A the learning process. Constructivist-Based Approach to Teaching Database Analysis and Design. Journal of Information Systems Education, pages 43–53, 2005. [4] R. Dargie and A. Steele. Teaching Database Concepts using Spatial Data Types. In In Proceedings of the 4th annual conference of Computing and Information Technology Research and Education New Zealand, pages 17–21, 2013. [5] C. Dom´ınguez and A. Jaime. Database design learning: A project-based approach organized through a course management system. Computers & Education, 55(3):1312–1320, 2010. Figure 3: Summed perception score and test score in [6] D. C. Hay. A comparison of data modeling techniques. correlation with the notation used during the learn- Essential Strategies, Inc, pages 1–52, 1999. ing process. [7] A. Kamišalić, M. Heričko, T. Welzer, and M. Turkanović. Experimental Study on the Effectiveness of a Teaching Approach Using Barker or On the contrary, there were students who used the Barker Bachman Notation for Conceptual Database Design. notation with a summed score of five which indicates the Computer Science and Information Systems, worst assessment of knowledge. We conclude that students 15(2):421–448, 2018. who used the Bachman notation in the learning process bet- [8] H. C. Purchase, R. Welland, M. McGill, and ter evaluated their knowledge than students who used the L. Colpoys. Comprehension of diagram syntax: an Barker notation. empirical study of entity relationship notations. International Journal of Human-Computer Studies, Conclusions regarding RQ1: Bachman notation posi- 61(2):187–203, 2004. tively influences students’ ability of knowledge self-assessment. [9] S. D. Urban and S. W. Dietrich. Integrating the By the course’s end, the difference between knowledge per- Practical Use of a Database Product into a Theoretical ception and actual knowledge lowers. Curriculum. SIGCSE Bull., 29(1):121–125, 1997. 22 The Use of Standard Questionnaires for Evaluating the Usability of Gamification Alen Rajšp Katja Kous Tina Beranič Faculty of Electrical Faculty of Electrical Faculty of Electrical Engineering and Computer Engineering and Computer Engineering and Computer Science Science Science University of Maribor University of Maribor University of Maribor Maribor, Slovenia Maribor, Slovenia Maribor, Slovenia alen.rajsp@um.si katja.kous@um.si tina.beranic@um.si ABSTRACT the standardised definitions of usability. Fernandez et al. [7] Usability has a significant impact on the satisfaction and found that 59% of the reviewed papers reported end-user- frequency of use of a designed system. Nowadays, gami- based usability testing, while 35% of the reviewed papers fication and serious game approaches are implemented in used the inquiry methods (such as focus group, interviews, software solutions to increase their usability. We present a questionnaires and surveys). Based on these facts, this re- literature review of 32 identified studies measuring usability, search focuses on inquiry methods, more specifically on tech- with established questionnaires in gamified systems and se- nique questionnaires, and investigates which standard ques- rious games. We identified 18 different questionnaires used tionnaires are used most commonly for usability evaluation for measuring usability, and found System Usability Scale in The Gamification domain. Within the presented paper, to be the most widely used. An immense issue exists in the we focus on the research question: Which standard question- field, with only 22% of studies measuring usability actually naires are used for evaluating the usability of gamification? describing or defining what usability is. Using a literature review, we study the use and popularity of established usability questionnaires in the Gamification Categories and Subject Descriptors domain. H.5.2 [User Interfaces]: User-centered design A similar study, made by Yᘠnez-Gómez et al. [18], presents General Terms the review of academic methods for usability evaluation of serious games. The scope of the study is broader, aiming at Measurement, Experimentation, Standardization, Theory, finding the preferred approach for evaluating the usability Verification of games. As the results show, standard questionnaires are the second most used technique applied in post-use analysis Keywords [18]. They mention three questionnaires in use, but detailed Usability Evaluation Method, Formal Questionnaires, Gam- analysis is not provided. Also, in comparison to the pre- ification, Serious Games sented study, our search string differs. Another review is presented by Calderón and Ruiz [5], also covering the do- 1. INTRODUCTION main of Serious Games’ Evaluation. One of the research In recent years, gamification has become an essential part questions concerned evaluation techniques, and discovered of varieties of domains, from Education to Medicine. It is that questionnaires are the most commonly used, but the used for facilitating the use of developed products. They categorization or detailed analysis of the used questionnaires cannot achieve its purpose if the usability of the product is was not provided. inadequate. Therefore usability evaluation should present a crucial step of development. The paper is structured as follows. We start by presenting the research background covering usability evaluation and Solutions utilising (1) gamification or (2) serious game ap- gamification, we continue by presenting and discussing the proach should be evaluated separately, due to them being results of the literature review. We close our paper by pre- inspired by games which have very specific (and different) senting the conclusions reached by our review. natures. The primary function of games is to entertain through experience whereas serious games and gamification have some intended useful purpose [10]. Because gamifi- 2. USABILITY EVALUATION cation and serious games’ approaches utilise elements from The term usability represents a combination of several prop- games, this leads to solutions where even other needs of so- erties and attributes [13]. Regardless of the variety of def- lutions intended for an audience are being met to varying initions by different authors [1, 3, 9, 13, 15, 17], Jeng [12] degrees. In solutions this causes an increase of user satisfac- states that Nielsen and ISO 9241-11 definitions are the most tion. widely cited. ISO 9241-11 defines usability as “the extent to which a product can be used by a specified user to achieve In the web area of expertise, only 18% of the reviewed pa- specified goals with effectiveness, efficiency and satisfaction pers in [7] present usability evaluation methods relying on in a specified context of use” [11], while Nielsen [15] defines 23 usability as an aggregation of five attributes: Learnability, efficiency, memorability, errors and satisfaction. The usability evaluation method is defined as “a procedure, composed of a set of activities for collecting usage data re- lated to end user interaction with a software product, and/or how the specific properties of this software product con- tribute to achieving a certain degree of usability” [7]. Ac- cording to Battleson et al. [2], the usability evaluation meth- ods are classified into three categories: (1) Inquiry methods (such as focus group, interviews, questionnaires and sur- veys), (2) Formal usability testing (such as interactions with a website by performing tasks) and (3) Inspection methods (such as heuristic evaluation, cognitive walk-through, plu- ralistic walk-through and formal inspection). The first two categories involve real-users, while inspection methods are Figure 1: Primary studies by years based on reviewing the usability aspects of web artifacts, which have to comply with established guidelines, and are performed by expert evaluators or designers [7]. from selected primary studies showed that only seven pri- mary studies (22%) defined and described the term of us- 3. GAMIFICATION ability. Two of them indicated usability as a concept (S5, S10), while five researches treated usability as construct, Gamification is the use of design elements characteristic for namely two studies (S11, S21) used Nielsen’s definition, one games in non-game contexts [6]. Gamification should not be research (S4) used the ISO definition, one research (S18) de- confused with serious games. Whereas the goal of introduc- scribed usability as “ease of use of the game”, while study ing gamification is influencing learning related behaviuors S25 defined usability similar to ISO, but expanded the defi- and attitudes without providing knowledge, the use of seri- nition with two new concepts (“simple” and “operating with ous games should influence learning and provide knowledge ease”). The remaining studies (78%) used the term usability by the experience itself [14]. Another way to compare gam- without providing the meaning of usability. ification and serious games is that gamification represents using only parts (game elements) from games, while serious Studies are classified by domain in Table 1. Over half (56%) games represent the whole immense gaming experience [6]. of all studies were from the field of Health and Medicine. Most of the studies from the domain addressed (1) Training 4. EVALUATING THE USABILITY OF of health care personnel (S8, S17, S18), (2) Rehabilitation GAMIFICATION and exercise for patients (S3, S6, S7, S16) and (3) Assessing patients (S1). The second most popular domain (37%) was 4.1 Research Education and Learning. All other identified domains had Our research aims to find available standard questionnaires only 1 study per domain. used for evaluating the usability of gamification. Using the following search string ”usability” AND (”gamification” OR Domain Primary studies ”serious games” OR ”educational games”) we conducted a Agriculture S27 search in the following digital libraries: ScienceDirect, IEEE Xplore, ACM Digital Library and Sage journals. Deter- Business Intelligence S5 mined inclusion and exclusion criteria guided the study se- Computer Science S5 lection process. We considered the papers evaluating us- Education & Learning S2, S8, S10, S13, S14, S16, S17, ability with the help of established and well-known ques- S18, S23, S28, S29, S31 tionnaires. Therefore, we excluded primary studies using Entertainment S4 ad-hoc questionnaires. Health & Medicine S1, S3, S6, S7, S8, S11, S12, S16, S17, S18, S19, S20, S21, After the review process, we selected 33 primary studies. S22, S24, S29, S30, S32 The list of primary studies we used as input into the data ex- Social Science S25 traction and data synthesis step is available at: https://tiny Task Management S9 url.com/CSS2018-IJS. 26 out of 33 primary studies are con- Travel S15 ference papers, whereas seven papers are journal articles. Figure 1 shows the number of primary studies by year of Table 1: Domain publishing. We selected 23 primary studies from the IEEE Xplore digital library, six from the ACM Digital Library, We continued the data extraction by identifying standard three from ScienceDirect and one from Sage journals. questionnaires used for usability evaluation. We followed the explanation provided by Yᘠnez-Gómez et al. [18], which 4.2 Results states that standard questionnaires are the ones that are Within data extraction, we focused on two main areas. First, validated statistically. Table 2 presents used questionnaires we searched for used definitions of usability, since the lat- in connection with primary studies. The majority of stud- ter was evaluated in the analysed studies. Extracted data ies evaluated usability by using the System Usability Scale 24 (SUS). It was used in 78% of primary studies. Although most established questionnaire SUS for measuring usability, Technology Acceptance Model (TAM) is used in the model- but did not define the measured attribute in their research. driven analysis for measurement of users’ acceptance and us- Table 3 presents the connection between the used question- age of technology and it is not classified as a standard ques- naires and measured attributes that were measured at least tionnaire for usability evaluation, it was used for assessment in two primary studies. The most frequently measured at- of gamification in four primary studies. On the other hand, tributes were ”ease of use” and ”usability” and both were Game Experience Questionnaire (GEQ), Task Load index used in six primary studies. In all cases, the attribute ”us- (TLX), Game Engagement Questionnaire (GEQ), Post-Study ability”, was measured with SUS, while the attribute ”ease System Usability Questionnaire (PSSUQ) and Net Promoter of use” was measured with three different questionnaires: Score (NPS) are each used in two primary studies. We ex- SUMI (S7), USE (S9) and TAM (S2, S3, S11, S20). The tracted other questionnaires that are used only in one pri- second most frequent measured attribute was attribute ”use- mary study, such as Presence Questionnaire (PQ) and Soft- fulness”. In three primary studies (S2, S11, S20), it was ware Usability Measurement Inventory (SUMI). To achieve treated and determined as one of the two factors defined in TAM, while, in one case, it was measured with USE (S9) and Questionnaire Primary studies PSSUQ (S15). The attribute ”satisfaction” was the third most commonly used attribute measured by two different System Usability Scale S1, S3, S6, S7, S10, S11, (SUS) S12, S14, S16, S17, S18, questionnaires: SUS (S21, S22, S31) and USE (S9). S19, S20, S21, S22, S23, S24, S25, S26, S27, S28, Measured attribute Questionnaires S29, S30, S31, S32 Ease of use SUMI (S7), USE (S9), Technology Acceptance S2, S3, S11, S20 TAM (S2, S3, S11, S20) Model (TAM) Usability SUS (S10, S11, S16, S23, Game Experience Ques- S1, S4, S30 tionnaire (GEQ) S24, S31) Task Load index (TLX) S1, S22 Usefulness TAM (S2,S11,S20), USE Game Engagement S11, S18 (S9), PSSUQ (S15) Questionnaire (GEQ) Satisfaction USE (S9), SUS Post-Study System Us- S8, S15 (S21, S22, S31) ability Questionnaire Flow GEQ (S1, S4, S11) (PSSUQ) Learnability SUMI (S7), USE (S9) Net Promoter Score S31-S32 Competence GEQ (S1, S4) (NPS) Overall CSUQ (S13), SUMI (S7) User Engagement Scale S5 (UES) Quality of Information CSUQ (S13), PSSUQ (S15) Computer System Us- S13 Quality of interface CSUQ (S13), PSSUQ (S15) ability Questionnaire (CSUQ) Table 3: Connection between the measured at- Software Usability Mea- S7 tributes and used questionnaires surement Inventory (SUMI) The most popular devices on which developed/proposed so- Intrinsic Motivation In- S16 lutions were run were computers (62%), virtual reality equip- ventory (IMI) ment (22%) and mobile devices (16%) as seen in Table 4. User Interaction Satis- S18 faction (QUIS) Device Primary studies Presence Questionnaire S10 (PQ) Computer S1, S2, S4, S5, S6, S7, S8, S10, Usefulness, Satisfaction, S9 S11, S13, S14, S16, S21, S22, and Ease of use (USE) S23, S25, S26, S28, S29, S31 Questionnaire Customised system S19 Pick-A-Mood (PAM) S10 Mobile device S9, S12, S15, S20, S27 Technology Affinity S20 Smart TV S3 - Electronic Devices Virtual reality S10, S15, S17, S18, S24, S30, S32 (TA-ED) Questionnaire Game User Experience S19 and Satisfaction Scale Table 4: Devices on which the studied system runs (GUESS) Differential Emotions S10 4.3 Discussion Scale (DES) An extensive collection of standard questionnaires were found for evaluating the usability of gamification, with System Us- Table 2: Standard questionnaires in use ability Scale (SUS) as the prevailing choice (84% of all stud- ies). Since SUS is a well-known questionnaire, which is easy a comprehensive usability evaluation, it is crucial that mea- to perform and analyse, this is not a surprise. As SUS was surement instruments used are utilised appropriately accord- developed for providing a subjective assessment of usability ing to the attribute they are measuring. 41% (13/32) of [4], its extensive use is even more understandable. The ma- primary studies (S6, S12, S17-S19, S25-S30, S32) used the jority of researchers that used SUS in their studies did not 25 quote explicitly which attribute of usability was measured; Kaufmann, San Francisco, 2002. the remaining studies, where the SUS were used, defined two [4] J. Brooke. Sus: A quick and dirty usability scale, 1996. different attributes that can be measured with SUS. The [5] A. Calderón and M. Ruiz. A systematic literature first attribute was ”usability” and it is in accordance with review on serious games evaluation: An application to description of SUS usage purpose [4], while the second one software project management. Computers & was ”satisfaction”, which is recommended by the ISO/TS Education, 87:396–422, 2015. 20282-2:2013 [8] Standard, where the SUS is defined as a [6] S. Deterding, D. Dixon, R. Khaled, and L. Nacke. questionnaire for measuring satisfaction. From game design elements to gamefulness: Defining gamification. Proceedings of the 15th International Another aspect is also if standard usability questionnaires Academic MindTrek Conference on Envisioning Future can evaluate the usability of gamification adequately. She- Media Environments - MindTrek ’11, pages 9–11, gawa et al. [16] claims that the SUS questionnaire is a veri- 2011. fied instrument for measuring usability in the Serious Games [7] A. Fernandez, E. Insfran, and S. Abrah˜ ao. Usability domain. Technology Acceptance Model (TAM) is widely evaluation methods for the web: A systematic used in the Information System domain to investigate how mapping study. Inf. Softw. Technol., 53(8):789–817, accepted the use of technology is among their target users. Aug 2011. Although it is not classified as a standard questionnaire for [8] I. O. for Standardization. ISO/TS 20282-2:2013 usability evaluation, but rather as a model combining con- Usability of consumer products and products for structs ease of use and usefulness, it was the second most public use - Part 2: Summative test method, 2013. used measuring instrument for usability evaluation in re- [9] E. Furtado, J. J. V. Furtado, F. Lincoln Mattos, and viewed literature. On the other hand, it is also seen that J. Vanderdonckt. Improving usability of an online questionnaires, like Game Experience Questionnaire (GEQ) learning system by means of multimedia, and Game Engagement Questionnaire (GEQ), that originate collaboration, and adaptation resources. In Usability from Gaming domain, are nowadays used to evaluate the us- Eval. Online Learn. Programs, pages 69–86, October ability of gamification. Therefore, the fusion of two fields is 2003. perceived. [10] C. Girard, J. Ecalle, and A. Magnan. Serious games as new educational tools: how effective are they? A 5. CONCLUSION meta-analysis of recent studies. Journal of Computer The paper presents conducted literature review which was Assisted Learning, 29(3):207–219, 2013. aimed at finding standard questionnaires used for usabil- [11] ISO. Standard 9241: Ergonomic Requirements for ity evaluation of gamification and serious games. We found Office Work with Visual Display Terminals (VDT)s, that the majority (84%) of studies evaluate usability using Part 11. Guidance on Usability. 1998. a System Usability Scale (SUS), though some other ques- [12] J. Jeng. What is usability in the context of the digiral tionnaires were also detected and used independently, or in library and how can it be measured? Information combination with SUS. We, as prospective researchers, can Technology and Libraries, 24(2):47–56, Nov 2005. determine only in a minority of cases what primary studies [13] Z. Kılıç Delice, Elif Güngör. The usability analysis were measuring, because only 22% of primary studies mea- with heuristic evaluation and analytic hierarchy suring usability defined or described what usability is. That process. Int. J. Ind. Ergon., 39(6):934–939, Nov 2009. is an immense issue on validity of their measurements of us- [14] R. N. Landers. Developing a Theory of Gamified ability, since multiple definitions of it exist. We propose that Learning: Linking Serious Games and Gamification of methods for measuring usability in the field of Gamification Learning. Simulation & Gaming, 45(6):752–768, 2014. and Serious Games should be formalised in the future. Al- though researchers are already using standardised methods [15] J. Nielsen. Usability Engineering. Academic Press, San for measuring usability, research should also present what Diego, 1993. usability means for them, what they are measuring. [16] R. Shewaga, A. Uribe-Quevedo, B. Kapralos, K. Lee, and F. Alam. A Serious Game for Anesthesia-Based Crisis Resource Management Training. Entertainment 6. ACKNOWLEDGMENTS Computing, 16(2):6:1–6:16, apr 2018. The authors acknowledge the financial support from the [17] G. Tsakonas and C. Papatheodorou. Exploring Slovenian Research Agency (Research Core Funding No. P2- usefulness and usability in the evaluation of open 0057). access digital libraries. Information Processing & Management, 44(3):1234–1250, May 2008. 7. REFERENCES [18] R. Yᘠnez-Gómez, D. Cascado-Caballero, J.-L. J.-L. [1] A. Abran, A. Khelifi, W. Suryn, and A. Seffah. Sevillano, R. Yanez-Gomez, D. Cascado-Caballero, Usability meanings and interpretations in iso and J.-L. J.-L. Sevillano. Academic methods for standards. Information and Software Technology., usability evaluation of serious games: a systematic 11(4):325–338, Aug 2003. review. Multimedia Tools and Applications, [2] B. Battleson, A. Booth, and J. Weintrop. Usability 76(4):5755–5784, Feb 2017. testing of an academic library web site: A case study use of academic library web. J. Acad. Librariansh., 27(3):325–338, 2001. [3] T. Brinck, D. Gergle, and S. D. Wood. Designing Web Sites that Work: Usability for the Web. Morgan 26 Analyzing Short Text Jokes from Online sources with Machine Learning Approaches Samo Šimenko Vili Podgorelec Sašo Karakatič Faculty of Electrical Engineering and Faculty of Electrical Engineering and Faculty of Electrical Engineering and Computer Science Computer Science Computer Science University of Maribor University of Maribor University of Maribor Maribor, Slovenia Maribor, Slovenia Maribor, Slovenia samo.simenko@student.um.si vili.podgorelec@um.si saso.karakatic@um.si ABSTRACT jokes from the online sources. Third section presents the individual This paper presents the whole data mining process of analyzing steps of data- and text- mining in details. It consists of machine jokes in Slovenian language gathered from various online sources. learning method description, applications and techniques used in The gathering was done with the help of web scrapping system and the process and the results itself. We finish up with the conclusion the analysis was carried out on the gathered jokes to determine the and the discussion on the topic of joke analysis with various data properties of various types of jokes. In addition, with the help of mining methods. various text-mining methods, we analyzed different types of jokes 2. GATHERING AND PARSING OF THE and built a machine learning model for classifying jokes into categories. These results are supplemented with the visualization of JOKES FROM THE ONLINE SOURCES different categories and the interpretation of constructed machine In order to fulfill the set goals of analyzing jokes, we obtained these learning classification models. from various sources. Three different sources were used: – Categories and Subject Descriptors From the first source, a web site called VERZIVICI [2], joker already classified into categories; H.4 [Information Systems Applications]: Miscellaneous; – Jokes from the second source NAJVICI [3]; I.2.m [Artificial Intelligence]: Miscellaneous; – Jokes from a third source MLADINSKI [4]. General Terms For the data acquisition we developed a program in the Visual Machine Learning, Data Mining, MDS, SVC Studio IDE, using the C# programming langauge, which acquired jokes from the selected sources and saved them in a suitable text Keywords format. Due to the unstructured data of selected web resources, we Data mining, Machine Learning, Joke analysis, Short text analysis, used HAP (HTMLagilityPack) for processing. HAP is a HTML Text mining parser written in C# for reading/writing the DOM (Document Object Model) and supports plain XPATH or XSLT [1]. Using the 1. INTRODUCTION HAP library and XPATH, we could easily access individual Due to the ever-advancing technology, opportunities are sections, which contained content known as a “joke”. opening for analyzing all types of data, so we can make the most of this and use it for our benefit. By studying and examining various Jokes from VERZIVICI, which were categorized when gathered, types of texts, scientists were already involved in the initial phases were manually entered, since the program for collecting jokes from of textual analysis [8, 9, 10], but studying the meaning and different categories used the category name in the creation of a URL, which is used for scrolling between categories. For connection of texts presents a rather new direction of research, where there is still a lot of room for improvement. While there has NAJVICI, we manually created a URL for gathering jokes so we been a lot of work done on various short text types, i.e. tweets [12], can easily access all jokes on the site. reviews [11], recipes [13] and others, there is a lack of research On the website MLADINSKI, jokes were already grouped and the published on the topic of jokes analysis. jokes were sequentially recorded on one side of the web page. For the purpose of processing and subsequent manipulation, a simple In our paper, we present a process of gathering, parsing and VIC class was created, which contains two textual attributes of Text pre-processing jokes and applying various data- and text-mining and Category. Both attributes can store values in string format, Text techniques to extract patterns and new knowledge from jokes data. attribute is for raw text of a joke, Category is for type of category By semantic text processing, we identify more than just a sequence in which joke is categorized. When we were capturing blank of symbols, we can assign them meaning, which can influence the spaces, we encountered redundant badges before text and between classification of jokes. In our case, we undertook the processing of texts. Also, unreadable machine records were created instead of various jokes that we analyzed in order to determine how the symbols due to coding. All badges with associated symbols and categories of such texts are interconnected by their content and find non-nominal groups of words, which were created instead of out which categories of jokes share the most similar content. Based symbols, were manually entered into the program and then on the texts, we created a classification model for the classification programmatically removed. of jokes into predefined categories. As a result of obtaining and processing the data from the selected The rest of the paper is structured in the following way. The sources, we received the data, which are used as the basis below: following section presents the method for gathering and parsing – VERZIVICI [2] – 13 categories, a total of 1729 jokes, 27 – NAJVICI [3] – a total of 297 jokes, and classification is a supervised machine learning method, which – means that machine learns to classify jokes from the already solved MLADINSKI [4] – a total of 145 jokes. (classified) examples [15]. We have saved the acquired data in the CSV format. Due to the characteristics of the CSV format, the comma symbol "," was There are numerous different classification algorithms [18], but for our case we used the Support Vector Machine (SVM) classifier, changed to the XX symbol, addressed below, because comma in CSV represents a separator between lines, in jokes commas can developed by Vapnik in 2000 [16]. This method learns the have different meaning. All of the jokes were in Slovenian boundaries that separate instances (jokes in our case) from one language, so this had to be taken into consideration during the text category to another, by finding a linear separation border called hyper-plane that has a maximum distance from the entire instance analysis. set, which is called the maximum margin. The instances that are 3. DATA ANALYSIS closest on the hyper-plane (on the hyper-plane itself) are called In this section, we will present the methods and techniques for support vectors. This SVM method also uses a kernel trick [19], analyzing the jokes and the results of these analysis. The whole which maps the attribute space of the classification instance to a process of cleaning, preprocessing, and the analysis itself was done higher dimensional space. In our case, we used a linear kernel, with the Python programming language, and its libraries. which uses a liner function to transform the attributes in such a way, that the margin of the hyper-plane is maximized. 3.1 Cleaning and preprocessing the data We used the implementation of SVM from the library liblinear [20], As mentioned, we use the Python programming language to process which has high flexibility in the choice of penalties and loss data in which you can simply import information in a CSV format functions and should scale to large numbers of samples. This library using the Pandas library [5]. Pandas is an open source, BSD- supports both dense and sparse input and the multiclass support is licensed library providing high-performance, easy-to-use data handled according to a one-vs-the-rest scheme [6]. structures and data analysis tools for the Python programming language [5, 14]. The imported data is then appropriately structured Upon preliminary data preparation, the whole joke dataset is using the DataFrame class with the following columns (attributes): divided into train and test sets, where the training set is used to build – the SVM classification model, and the test set is used to test the Index, quality of the model – the ability to correctly classify yet unseen – Category, and jokes. In our experiment, we applied stratified sampling to split the – RawText. data and used 60% of data for training test and the rest 40% for the test set. The results of the experiment show, that the resulting The XX symbols are also removed and replaced with the comma classification model classifies test jokes with 61% accuracy. The symbol ",". From the text, we also removed stop-words, which is a classifier has correctly classified more than half of jokes into their list of common words that do not carry any semantic meaning and proper category out of 13 possible categories. information. Stop words occurred in texts in high frequency but are of little significance and consequently uninteresting. A sample of The default classification of instances in one of 13 categories would stop words in Slovenian language are the following: result in only 0.08 accuracy, so our resulting classifier improves the default classifier significantly. This represents a high percentage of “in” (En. and), “ali” (En. or), precision as was not foreseen at first glance. Additionally, we also ”je”(En. is), ”za” (En. for), manually examined some of the jokes that were misclassified. ”to” (En. this), ”na” (En. on), Interestingly, although the predicted categories were not correct, ”to” (En. this), ”ti” (En. you), several of the examined jokes would fit well into the predicted ”ko” (En. when), ”bi” (En. would), category as well, as the semantics of a joke is not always ”ne” (En. no), ”da” (En. yes), monolithic. ”že” (En. already), ”le” (En. only). In addition, the punctuations were removed, so the resulting text 3.3 Word frequency analysis and was in the form of one sentence without most common stop words. visualization From the resulting text, we built a representation of every joke in From the dataset of jokes, with attributes of individual word’s tf- the format appropriate for the analysis. We used the method of idf scores, we built word cloud diagrams for every category of the counting the frequency of individual words called word frequency. joke. The word clouds were made with the help of libraries This number was normalized by the word frequency of the word in matplotlib [21] and wordcloud for the Python programming all categories, so the more common words got the lower score and language. In the word cloud, the most common words (or rather the less common and maybe unique words got higher score. This those with higher tf-idf scores) are written in larger font, while process is called tf-idf (term frequency-inverse document those with lower frequency (lower tf-idf scores) are written in frequency) and is a common word scoring method in text mining smaller font. The color of the words only serves to make words [17]. The new dataset was built in such way, that all of the identified more differentiable and thus improves the readability of word words represented one attribute of the joke, and the corresponding clouds. value of that attribute is the tf-idf score of that word in that joke. Also, these word cloud show which highly informative words (non- stop words) are common for each category and can be used for 3.2 Classification of jokes in the manual classification, this way we can check whether a joke, which predetermined categories reads: “pride nekega dne k janezkovemu očetu domov nek njegov We used the classification machine learning technique in order to nadležen prijatelj tone potrka vpraša dober dan oče doma janezek construct a model of classification that would learn how to classify tone ja kje janezek vem grem vprašat” was appropriately classified yet unseen jokes to one of the predetermined categories. This can into a category (the original category is called “janezek”, “Solski” be useful if one would want to automate joke categorization on an was predicted). As we can see in Figure 1, our model correctly online joke portal without any need for human intervention. The decided to classify the joke in the category “Solski”, because the 28 word “janezek” prevails in this category and is the dominant word expressions, which are more commonly used in foreign jokes as in the content of the joke. well as older jokes. Figure 1: Hierarchical clustering of joke categories. 3.5 Multidimensional scaling Multidimensional scaling (MDS) enables the visualization of the level of similarity of individual cases of a dataset by lowering the number of different attributes to only two. It refers to a set of related ordination techniques used in information visualization, in particular to display the information contained in a distance matrix [7]. By using the MDS in the sklearn.manifold[23] library and the mpl_toolkits.mplot3d[24] library, we can observe relations between categories even more efficiently, as shown in a 2D graph in the Figure 3. This plot shows which categories are closer together and which categories differ the most. Contrary to the dendrogram, we can see that “Mujo in Haso” are not so close to “Ciganski” and “Stari vici”, but these three categories differ the most from the rest. Figure 1: Word-clouds for ten joke categories. 3.4 Hierarchy of the categories With the help of the scipy [22] Python library, we also built a dendrogram of relations between the categories using a hierarchical clustering method, which is shown in the Figure 2. Here we also included the category from sources NAJVICI and MLADINSKI, so that we can visually display the content linkage between different categories. The dendrogram is a hierarchical diagram, which shows which terms (in our case joke categories) are closer together by putting the more similar categories closer together on the Y-axis. The more similar are the categories, shorter are the lines connecting these categories, and vice versa. From the dendrogram we can see that the categories MLADINSKI (En. young ones) and SOLSKI (En. School ones) are most similar, since the school is usually visited by young people. Based on the names of the categories NAJVICI and Mesane sale (En. Random jokes), it can also be assumed that these categories are very similar. Figure 2: 2D Multidimensional scaling plot, which shows the From the dendrogram we can also see that groups of categories similarity of different joke categories marked by red and green connections are very different. We can conclude that this division can be attributed primarily to slang 29 This shows the seclusion of three categories (a group of categories [3] http://www.naj-vici.com, Last visited: 5.8.2018 marked in a dendrogram with red color, which includes Stari Vici, [4] http://www.mladinska.com/, Last visited: 5.8.2018 Mujo in Haso and Ciganski) in relation to other categories. These make up a kind of circle around the categories “NAJVICI” and [5] https://pandas.pydata.org, Last visited: 13.8.2018 “Mesana Sale”. Categories “NAJVICI” and “Mesana Sale” are the [6] http://scikit-learn.org/stable/modules/generated/sklearn.svm.Li closest neighbors, which also suggests an exceptional similarity nearSVC.html, Last visited: 13.8.2018 between the categories.With the help of Figure 3, we can see the [7] https://en.wikipedia.org/wiki/Multidimensional_scaling, relationship between categories even better; in the case of the categories “Moski” and “Zenske”, we can see that according to Last visited: 20.8.2018 their content, these two are very similar categories. [8] Song, G., Ye, Y., Du, X., Huang, X. and Bie, S., 2014. Short As depicted in the Figure 4 is a 3D graph of relations for use in text classification: A survey. Journal of Multimedia, 9(5), p.635. further discussions for the show. By using the 3D graph (Graph 4), [9] Chen, M., Jin, X. and Shen, D., 2011, July. Short text we can even more accurately determine the differences between the classification improved by learning multi-granularity topics. In categories of texts. This display mode can turn out to be even more IJCAI (pp. 1776-1781). useful in a larger number of data and when looking for interesting patterns in these texts. [10] Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H. and Demirbas, M., 2010, July. Short text classification in twitter to improve information filtering. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval (pp. 841-842). ACM. [11] Dave, K., Lawrence, S. and Pennock, D.M., 2003, May. Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. In Proceedings of the 12th international conference on World Wide Web (pp. 519-528). ACM. [12] Wang, Y., Liu, J., Qu, J., Huang, Y., Chen, J. and Feng, X., 2014, December. Hashtag graph based topic model for tweet mining. In Data Mining (ICDM), 2014 IEEE International Conference on (pp. 1025-1030). IEEE. [13] Badra, F., Bendaoud, R., Bentebibel, R., Champin, P.A., Cojan, J., Cordier, A., Després, S., Jean-Daubias, S., Lieber, J., Meilender, T. and Mille, A., 2008, September. Taaable: Text mining, ontology engineering, and hierarchical classification for Figure 3: 3D Multidimensional scaling plot, which shows the textual case-based cooking. In 9th European Conference on Case- similarity of different joke categories Based Reasoning-ECCBR 2008, Workshop Proceedings (pp. 219- 4. CONCLUSION 228). This paper presents a use case of machine learning methods in the [14] McKinney, W., 2012. Python for data analysis: Data analysis of short texts in a form of jokes. We presented the process wrangling with Pandas, NumPy, and IPython. " O'Reilly Media, of gathering, cleaning and preprocessing the jokes, which was Inc.". followed by the description of the analysis done with machine [15] Friedman, J., Hastie, T. and Tibshirani, R., 2001. The elements learning methods and various visualization techniques. We of statistical learning (Vol. 1, No. 10). New York, NY, USA:: demonstrated how jokes could be automatically categorized in the Springer series in statistics. predefined categories using the Support Vector Machine classification method. With two different visualizations: the [16] Vapnik, V. and Mukherjee, S., 2000. Support vector method dendrogram and the multidimensional scaling plot, we showed how for multivariate density estimation. In Advances in neural different joke categories are similar one to another. With these information processing systems (pp. 659-665). methods, we demonstrated, how we could perform different [17] Aizawa, A., 2003. An information-theoretic perspective of tf– comparisons, which can serve us in the further processing of data, idf measures. Information Processing & Management, 39(1), and the connection of data between us is visualized in a useful and pp.45-65. interesting way. [18]http://en.wikipedia.org/wiki/Category:Classification_algorith In this paper, we only analyzed the jokes in Slovenian language. ms, Last visited 13.9.2018 For future work, we could compare jokes in different languages to find similarities and differences of jokes and their popularity across [19] https://en.wikipedia.org/wiki/Support_vector_machine Last different languages and cultures. visited 13.9.2018 [20] https://www.csie.ntu.edu.tw/~cjlin/liblinear/, Last visited ACKNOWLEDGMENTS 13.9.2018 The authors acknowledge the financial support from the Slovenian [21] https://matplotlib.org/, Last visited 13.9.2018 Research Agency (research core funding No. P2-0057). [22] https://www.scipy.org/, Last visited 13.9.2018 REFERENCES [23] http://scikit-learn.org/stable/modules/generated/sklearn.manif [1] http://html-agility-pack.net, Last visited: 20.8.2018 old.MDS.html, Last visited 13.9.2018 [2] http://www.verzi-vici.com, Last visited: 5.8.2018 [24] https://matplotlib.org/2.0.2/mpl_toolkits/mplot3d/api.html 30 A Data Science Approach to the Analysis of Food Recipes Tjaša Heričko Sašo Karakatič Vili Podgorelec Faculty of Electrical Engineering and Faculty of Electrical Engineering and Faculty of Electrical Engineering and Computer Science Computer Science, Computer Science, University of Maribor, FERI University of Maribor University of Maribor Maribor, Slovenia Maribor, Slovenia Maribor, Slovenia tjasa.hericko@student.um.si saso.karakatic@um.si vili.podgorelec@um.si ABSTRACT from recipes needed to perform well at cuisine prediction. (3) In this paper, we explore the correlation between cuisine and text- Enhancing cuisine prediction. based information in recipes. The experiments are conducted on a This paper is organized as follows. Section 2 gives a brief real dataset consisting of 9,080 recipes with data science overview of related work. Section 3 presents the dataset used in approaches focusing on enhancing cuisine prediction and our research. Section 4 describes the applied methodologies. providing a detailed insight on the characterization of food Section 5 provides results of our research. Section 6 concludes the cultures. The findings suggest that information about ingredients paper by summarizing the main results of our work. is the most relevant predictor of cuisines, however, despite being less efficient, recipe name, preparation instructions, preparation 2. RELATED WORK time, skill level and nutritional facts can be considered as well. The correlation between recipes and their cuisines has been the subject of several recipe analysis related research. Mostly, there Categories and Subject Descriptors have been previous studies conducted on classifying recipes into I.2.m [Artificial Intelligence]: Miscellaneous. respective cuisines based on ingredients. H. Su et. al. [1] I.5.m [Pattern Recognition]: Miscellaneous. evaluated data collected from Food2 and used the techniques of associative classification and support vector machine to classify General Terms 226,025 recipes to one of six cuisines, using ingredients as inputs, Algorithms, Measurement, Experimentation. with a precision and recall of about 75 %. The researchers in [2– 8] further studied cuisine-ingredient connection, using 39,774 Keywords recipes from twenty cuisines provided by Yummly3. Similar Data Science, Machine Learning, Text mining, Classification, studies were conducted on data from Epicurious [9], Epicurious Food Recipes, Cuisines. and Menupan4 [10] and Food, Epicurious5 and Yummly [11]. A variety of machine learning algorithms, including k-means [2, 9], 1. INTRODUCTION random forest classifier [2, 5, 6, 8, 9, 10], support vector machine In response to technological advancements and social changes in [3, 5, 6, 7, 10, 11], logistic regression [4, 5, 6, 10, 11] and naive the last decades, the tendency to collect and store recipes only in Bayes [5, 6, 7, 9, 10, 11], were used in these studies. From several cookbooks has changed. Numerous online recipe portals started to tested algorithms, linear support vector machine, reaching up to rapidly accumulate food-related content, with more and more 80,9 % accuracy in [7], was found to be the most efficient for this recipes being published online daily. The growth in the amount of cuisine prediction task based on ingredients. user-generated recipe data available on the Internet has raised Other studies focused on the importance of other information several issues that researchers have been trying to address in extracted from recipes for cuisine prediction. H. Kicherer et. al. recent years. The objective of this paper is to explore the [12] evaluated the use of ingredients and preparation instructions correlation between cuisine and text-based information in recipes, for cuisine prediction, conducted on recipes from German website including recipe name, list of ingredients, preparation Chefkoch6. The study revealed that ingredients alone are as good instructions, preparation time, skill level, calories and nutritional an indicator as the recipe instructions. Whereas a combination of information. The results of this study address the issue of information from both – nouns from the instructions and the list automatic recipe cuisine categorization, making it easier to submit of ingredients – performs better. T. Ozaki et. al. [13] also a new recipe and preventing possible additional noise in recipe demonstrated that, based on Japanese recipes from Cookpad database – this can be helpful for both the contributors as well as Data7, certain sets of ingredients and preparation actions deeply for the culinary website curators. correspond to cuisine types. We conducted a series of experiments on a real dataset retrieved Previous studies have already noted that ingredients reveal from BBC Good Food1 consisting of 9,080 recipes from various important information about cuisines and that predicting cuisines cuisines with data science approaches focusing on the following: (1) Providing a detailed insight on the characterization of various 2 https://www.food.com/ food cultures. (2) Identifying necessary text-based information 3 https://www.yummly.com/ 4 https://www.menupan.com/ 5 https://www.epicurious.com/ 6 https://www.chefkoch.de/ 1 https://www.bbcgoodfood.com/ 7 https://cookpad.com/ 31 based on the ingredients is possible. Though, to our knowledge, 4.1 Data Preprocessing few researchers have considered using additional text-based For the dataset to be feasible for the analysis, preprocessing was information from recipes, for instance, preparation instructions, performed on the raw scraped data. preparation time and nutrition facts, as possible attributes in cuisine prediction. Therefore, there is little understanding of how During the data cleaning step, missing values and duplicates were they are related to cuisine types. In contrast to the work presented resolved by removing these recipes from the original dataset, above, we performed a richer analysis of recipes with a wider leaving a subset of 9,080 recipes. range of attributes extracted from recipes, whereas the dominant The original dataset included 45 cuisine categories, many of them approach appears to deal only with ingredients as attributes. only consisted of few recipes. In the next step of data preparation, 3. DATASET based on the findings of previous researches of cuisines being location-dependent [14], we combined smaller cuisines into Our research was conducted on the crawled data collected from an bigger regional cuisine categories (e.g. Balinese, Thai, online food recipe portal BBC Good Food. A dataset of 9,429 Vietnamese and Indonesian into Southeast Asian cuisine) and recipes was scraped with Python8, using Scrapy framework9 and therefore reduced cuisine categories to the following 13: African, CSS selectors, in June 2018. Middle Eastern, South Asian, Southeast Asian, East Asian, For each recipe, the following information was provided: recipe Oceanic, American, Latin American, Western European, Northern name, cuisine, list of ingredients, preparation instructions, European, Central European, Eastern European, Mediterranean. preparation time, skill level and nutrition facts, including the As highlighted in Table 1, preparation time and nutrition facts are amount of calories, total fat, saturated fat, total carbohydrate, numerical, cuisine and skill level are categorical, whereas recipe sugars, protein, fiber and salt per serving. More details are name, list of ingredients and preparation instructions are presented in Table 1. described in natural language. For all of them, additional Table 1. Characteristics of text-based information in recipe preprocessing was needed prior to conducting analyses. Numerical attributes were standardized, considering certain Information Data Type Description algorithms used in our research are sensitive to varied number Arbitrary string described in Recipe name Unstructured scales and intervals used [15]. As scikit-learn algorithms only natural language. work on numerical data, categorical data needed to be encoded as Cuisine Categorical One of 45 cuisine types. numerical. This was done by converting categorical data into Arbitrary string depicting dummy variables [16]. For unstructured data to be used for needed ingredients for classification, several more text preprocessing methods were List of preparation, each ingredient needed: tokenization, stop word removal, stemming and tf–idf Unstructured ingredients normally consisting of an term weighting. Tokenization is the process of segmenting a text ingredient type, an amount and into identifiable basic linguistic units called tokens, such as words a unit. and punctuation [17]. For better processing, all tokens were Step-by-step instructions for Preparation converted to lowercase. Stop words are frequently used common Unstructured preparation using ingredients instructions words, such as ‘and’, ‘the’ and ‘this’. Because their presence in a described in natural language. text fails to distinguish it from other texts and are therefore not A number representing time Preparation Numerical measured in minutes needed useful in classifications, they were removed before further time for preparation. processing [18]. We also made a custom list of stop words, where One of 3 difficulties: easy, we included numbers that represent amounts and words that Skill level Categorical more effort or a challenge. represent units, e.g. ‘2’ and ‘tbs’, that would not be of value in the A number representing analysis. The same applies to punctuation, therefore they were nutrition per serving measured filtered out as well. Next, stemming using the Porter stemming in kcal for calories intake or in Nutrition facts Numerical algorithm, the process of removing morphological affixes from grams for fat, saturated fat, words, which conflate variant forms of a word into a unified carbohydrate, sugars, protein, representation [19], was performed. Lastly, for words counts fiber and salt. being suitable for usage by a classifier, tf–idf transform was 4. METHODOLOGY conducted. Tf–idf, short for term-frequency times inverse The methodology in this paper was implemented in Jupyter document-frequency, is used to re-weight a words importance notebook environment10 running Python code and using a based on a frequency of a world in a document compared to the combination of Python libraries comprising pandas11, scikit- appearance in other documents [20]. learn12, NLTK13, seaborn14, matplotlib15 and wordcloud16. 4.2 Exploratory Data Analysis To get an overall view of the data, exploratory data analysis was made on preprocessed data using graphs, word clouds and tables. 8 https://www.python.org/ Visualization was especially used to provide clarity on the 9 https://scrapy.org/ characterization of various cuisines. 10 http://jupyter.org/ 11 https://pandas.pydata.org/ 4.3 Classification 12 http://scikit-learn.org/ Various classification algorithms were used to perform the cuisine 13 https://www.nltk.org/ prediction based on the information from the recipes. The recipe 14 https://seaborn.pydata.org/ 15 dataset was randomly divided into training (75 %) and testing set https://matplotlib.org/ 16 http://amueller.github.io/word_cloud/ 32 (25 %). The training set was used to train, while the test set was To give an idea of the ingredients that form an integral part of used to assess models. each cuisine, we extracted the most common ingredients in every cuisine and visualized unigrams from the ingredient list in word 4.3.1 Naive Bayes clouds. As detailed in Table 2, many ingredients are frequent in Naive Bayes is based on applying Bayes’ theorem with the naïve all the cuisines, e.g. oil and onion, hence, these will not be useful independence assumption between every pair of features. for prediction. While others are typically used only in certain Gaussian naive Bayes assumes the probability of features is cuisines, e.g. soya sauce and clove. Gaussian. Multinomial naive Bayes implements the algorithm to Figure 1 represents word clouds consisted of the most common the usage for text classification [21]. unigrams extracted from the ingredient list for East Asian cuisine. Although most common ingredients did not give us much insight, 4.3.2 Support Vector Machine these word clouds do show some typical ingredients, based on A linear support vector machine constructs a hyper-plane or set of which they can be distinguished from other cuisines, e.g. sugar, hyper-planes in a high or infinite dimensional space using linear flour, milk, cream, chocolate, egg, mayonnaise, butter in algebra [22]. American cuisine and soy sauce, rice, ginger, soy, chili in East Asian cuisine. 4.4 Evaluation Metrics To measure classification performance the following metrics were used: accuracy and F-score. Accuracy is the percentage of correct predictions. F-score is a weighted average of the precision and recall, where precision represents the ability of the classifier not to label as positive a sample that is negative and recall the ability of the classifier to find all the positive samples [23]. 5. RESULTS As an initial step, we carried out an exploratory data analysis to get a better understanding of cuisines and their characteristics. Figure 1. Word cloud for East Asian cuisine Table 2 lists average preparation time and calories per serving for each cuisine. Given the analysis, recipes from Northern Europe, Middle East and Western Europe take the longest to prepare, Cuisines also differ on nutrition facts. In Figure 2, for every whereas recipes from East Asia, Latin America and Southeast cuisine an average value of each nutrition per serving is presented. Asia are generally the quickest to prepare. Furthermore, on average, Mediterranean, Oceanic and American cuisines are high in energy, on the contrary, Southeast Asian, East Asian and South Asian have recipes with lower energy values. Table 2. Overview of the cuisines Average Average Cuisine Common ingredients preparation calories time [min] [kcal] Oil, onion, lemon, African 51,73 399,68 clove, coriander. Oil, onion, tomato, Middle Eastern 76,67 409,11 garlic, clove. Onion, oil, coriander, South Asian 53,74 367,50 chili, clove. Sauce, lime, chili, oil, Southeast Asian 45,00 350,78 sugar. Sauce, oil, onion, chili, East Asian 40,49 363,18 rice. Oceanic Sugar, oil, egg. 60,36 430,70 Sugar, butter, oil, flour, American 57,68 422,37 egg. Onion, oil, chili, Latin American 43,17 399,30 coriander, lime. Western Sugar, oil, butter, egg, 66,59 394,85 European flour. Northern Oil, sugar, onion, egg, 119,59 374,61 European cream. Central Sugar, butter, egg, 62,73 402,85 European flour, oil. Eastern Oil, butter, egg, flour, Figure 2. Nutrition facts for cuisines 57,96 390,04 European garlic. In the next step, classification algorithms were applied to identify Oil, garlic, clove, Mediterranean 48,68 433,36 which text-based information from recipes is needed to perform tomato, onion. 33 well at cuisine prediction. A classification with multinomial naive Accessed on: August 16, 2018.R. Ghewari, and S. Raiyani, Bayes, based on the list of ingredients, proved to be the most “Predicting Cuisine from Ingredients.” [Online]. Available: efficient. This model yielded an accuracy of 73,8 %. Less than 1 http://cseweb.ucsd.edu/~jmcauley/cse255/reports/fa15/029.pdf. % lower was the accuracy obtained with classification based on Accessed on: August 16, 2018. [6] S. Kalajdziski, G. Radevski, I. Ivanoska, K. Trivodaliev, and B. R. recipe name and more than 2 % based on preparation instructions. Stojkoska, “Cuisine classification using recipes ingredients,” 2018 Classifications based on skill level, preparation time, calories and 41st International Convention on Information and Communication nutritional information all performed with an accuracy of about 56 Technology, Electronics a nd Microelectronics (MIPRO), 2018. %. Classification performance based on accuracy and F-score are [7] R. M. R. V. Kumar, M. A. Kumar, and K. P. Soman, “Cuisine summarized in Table 3. Prediction based on Ingredients using Tree Boosting Algorithms,” Indian Journal of Science and Technology, vol. 9, no. Table 3. Results of classification 45, Aug. 2016. Information Classifier Accuracy F-score [8] T. Arffa, R. Lim, and J. Rachleff, “Learning to cook: An exploration of recipe data.” [Online]. Available: Multinomial naive Recipe name 72,73 % 72,73 % https://pdfs.semanticscholar.org/3f63/269aa7910774e9386b1ffb340 Bayes a9e8638c02d.pdf. Accessed on: August 16, 2018. List of Multinomial naive 73,83 % 73,83 % [9] J. Naik, and V. Polamreddi, “Cuisine Classification and Recipe ingredients Bayes Generation,” 2015. [Online]. Available: Preparation Multinomial naive 70,97 % 70,97 % https://pdfs.semanticscholar.org/aaa9/67ce597961bad308ec137a616 instructions Bayes 9e1aba1fe35.pdf. Accessed on: August 16, 2018. Preparation Gaussian naive Bayes 55,29 % 55,29 % [10] S. Jayaraman, T. Choudhury, and P. Kumar, “Analysis of time Linear SVM 55,68 % 55,68 % classification models based on cuisine prediction using machine learning,” 2017 International Conference On Smart Technologies Skill level Linear SVM 56,12 % 56,12 % For Smart Nation (SmartTechCon), pp. 1485–1490, 2017. [11] H. Kicherer, M. Dittrich, L. Grebe, C. Scheible, and R. Klinger, Gaussian naive Bayes 55,68 % 55,68 % Calories “What you use, not what you do: Automatic classification and Linear SVM 55,68 % 55,68 % similarity detection of recipes,” Data & Knowledge Engineering, Nutritional Gaussian naive Bayes 53,48 % 53,48 % 2018. information Linear SVM 57,00 % 57,00 % [12] T. Ozaki, X. Gao, and M. Mizutani, “Extraction of Characteristic Sets of Ingredients and Cooking Actions on Cuisine Type,” 2017 31st International Conference on Advanced Information 6. CONCLUSION Networking and Applications Workshops (WAINA), pp. 509–513, Thousands of recipes from various cuisines were analyzed with 2017. [13] K. J. Kim, and C. H. Chung, “Tell Me What You Eat, and I Will Tell data science approaches with the objective of providing a deeper You Where You Come From: A Data Science Approach for Global understanding of culinary cultures and cuisine prediction. While Recipe Data on the Web,” IEEE Access, vol. 4, pp. 8199–8211, previous research efforts have mostly used only ingredients for 2016. cuisine prediction, our findings demonstrate that other text-based [14] Scikit-learn, “sklearn.preprocessing.StandardScaler.” [Online]. information extracted from recipes can be used as well. While Available: http://scikit- ingredients with an obtained accuracy of almost 74 % remain to learn.org/stable/modules/generated/sklearn.preprocessing.StandardS be the most efficient, cuisine prediction from recipe name and caler.html. Accessed on: August 21, 2018. preparation instructions also performs well. Whereas prediction [15] Pandas, “pandas.get_dummies.” [Online]. Available: based on preparation time, skill level and nutrition facts were https://pandas.pydata.org/pandas- discovered to be less effective, with about 56 % accuracy. docs/stable/generated/pandas.get_dummies.html. Accessed on: August 21, 2018. [16] NLTK, “NLP with Python – Processing Raw Text.” [Online]. 7. REFERENCES Available: http://www.nltk.org/book/ch03.html. Accessed on: [1] H. Su, M. K. Shan, T. W. Lin, J. Chang, and C. T. Li, “Automatic August 21, 2018. recipe cuisine classification by ingredients,” Proceedings of the [17] NLTK, “NLP with Python – Accessing Text Corpora and Lexical 2014 ACM International Joint Conference on Pervasive and Resources.” [Online]. Available: Ubiquitous Computing Adjunct Publication - UbiComp 14 Adjunct, https://www.nltk.org/book/ch02.html. Accessed on: August 21, pp. 565–570, 2014. 2018. [2] S. Srinivasasubramanian, B. Kushwaha, and V. Parekh, “Identifying [18] NLTK, “NLTK HOWTOs – Stemmers.” [Online]. Available: Cuisines From Ingredients,” 2015. [Online]. Available: http://www.nltk.org/howto/stem.html. Accessed on: August 21, https://pdfs.semanticscholar.org/3daa/3c535a3c2580e69984203137 2018. db3ee6422601.pdf. Accessed on: August 16, 2018. [19] Scikit-learn, “Feature extraction.” [Online]. Available: http://scikit- [3] P. Bhat, S. Gupta, and T. Nabar, “Bon Appetite: Prediction of learn.org/stable/modules/feature_extraction.html. Accessed on: cuisine based on Ingredients.” [Online]. Available: August 21, 2018. http://cseweb.ucsd.edu/~jmcauley/cse255/reports/fa15/020.pdf. [20] Scikit-learn, “Naive Bayes.” [Online]. Available: http://scikit- Accessed on: August 16, 2018. learn.org/stable/modules/naive_bayes.html. Accessed on: August 21, [4] H. H. Holste, M. Nyayapati, and E. Wong, “What Cuisine? - A 2018. Machine Learning Strategy for Multi-label Classification of Food [21] Scikit-learn, “Support Vector Machines.” [Online]. Available: Recipes,” 2015. [Online]. Available: http://scikit-learn.org/stable/modules/svm.html. Accessed on: http://jmcauley.ucsd.edu/cse190/projects/fa15/022.pdf. Accessed on: August 21, 2018. August 16, 2018. [22] Scikit-learn, “Classification metrics.” [Online]. Available: [5] R. S. Verma, and H. Arora, “Cuisine Prediction/Classification based http://scikit- on ingredients.” [Online]. Available: learn.org/stable/modules/model_evaluation.html#classification- http://cseweb.ucsd.edu/~jmcauley/cse255/reports/fa15/028.pdf. metrics. Accessed on: August 21, 201 34 Introducing Blockchain Technology into a Real-Life Insurance Use Case Aljaž Vodeb Aljaž Tišler Martin Chuchurski Faculty of Electrical Engineering and Faculty of Economics and Business Faculty of Electrical Engineering and Computer Science University of Maribor Computer Science University of Maribor Maribor, Slovenia University of Maribor Maribor, Slovenia aljaz.tisler@student.u Maribor, Slovenia aljaz.vodeb@student.um. martin.chuchurski@student. si m.si um.si Mojca Orgulan Tadej Rola Tea Unger Faculty of Electrical Engineering and Faculty of Electrical Engineering and Faculty of Law Computer Science Computer Science University of Maribor University of Maribor University of Maribor Maribor, Slovenia Maribor, Slovenia Maribor, Slovenia tea.unger@student.um. mojca.orgulan@student. tadej.rola@student.u um. si si m.si Žan Žnidar Muhamed Turkanović Faculty of Electrical Engineering and Faculty of Electrical Engineering and Computer Science Computer Science University of Maribor University of Maribor Maribor, Slovenia Maribor, Slovenia zan.znidar@student muhamed.turkanovic@ um.si um.si ABSTRACT The outcome of such approaches is various: (1) proposal and The paper presents an analysis of a possible introduction of the prototypes of blockchain-based use cases, which unnecessarily use blockchain technology into an insurance business use case. The this technology, (2) prototypes which are consistent with the technology’s purpose but are unpractical and not user analysis is focused on the implications such an attempt can have -friendly, and from various standpoints and the technical workaround needed for (3) failed attempts to produce a practical prototype or a production a prototype to be implemented. system. In this article, we explore the possibility of introducing the blockchain technology in an insurance-based use case. The aim was to explore the possible reasonableness of such a use case, its possible Categories and Subject Descriptors restrictions, limitations, advantages and disadvantages. The focus of H.3.4 [Information Storage and Retrieval]: Systems and the paper is on thus the implications of such a use case on all related Software processes and the overall picture of a possible implementation. General Terms 2. BLOCKCHAIN Performance, Economics, Reliability, Experimentation, Security, A blockchain is an invention that can be seen as a distributed ledger Legal Aspects, Verification. of all transactions or events that have been executed and shared among distributed participants. All transactions are verified with Keywords distributed consensus inside the system. Considering basic blockchain platforms, once a transaction is recorded, it cannot be Blockchain; Smart contracts; Ethereum; Insurance removed [2]. Group of verified transactions are stored in a block. Each block contains a cryptographic hash of the previous block and 1. INTRODUCTION a timestamp. New linked block strengthens the integrity of the Blockchain technology nowadays is considered as the new IT previous one, making the chain extremely tamper resistant and revolution and even as the messiah for all IT-based problems. secure. With a public blockchain, a copy of the entire transaction Nevertheless, as with other innovative technologies, public’s hype database (ledger) is distributed to the network. Every person can about the technology is fading. Experts now know that the view transactions and even participate in a consensus process. technology is useful only for specific domains and use cases as public, virtual and untrusted environment or cryptocurrency-based scenarios. Nonetheless, media is full of articles and news about corporations and companies using blockchain technology for some specific use case, which may or may not be fully meaningful. The result of such news is rising prices of cryptocurrencies and more 35 importantly, rising stock prices of organisations [1]. Blockchain enables a more effective way to solve the virtual 3. USE CASE currency problem. It solves it in a distributed manner, without the To test the concept of introducing the blockchain technology in a need for a central authority [3]. Central authority represents costs real-life business use case, we chose the insurance domain, which is and must be trusted to act honestly. also one of the promising domains for the blockchain technology Public blockchain is not the only type of possible blockchain [5]. platforms. There are also private and consortium blockchains [4]. A preliminary result of a market analysis has shown that a possibly Private blockchains have write permission kept centralized to one meaningful, but not yet implemented use case would be the lost organization. That can be useful for a single company for database baggage insurance. This specific real-life use case nowadays still management, auditing, etc. In a consortium blockchain partner represents long-term problems for passengers and airlines. To make companies are joined together in a trusted and adaptable network. it as user-friendly and meaningful as possible, an app was envisaged. The right to read in such blockchain types may be public or restricted The key functionalities of such an app, as presented in Figure 1, to the participants. would be: (1) user scans QR code of the flight ticket, (2) confirms read data, (3) scans barcode of baggage, (4) acknowledges terms of 2.1 Smart contracts the smart contract, (5) info about the possible payout is provided. The concept of smart contract has been known since 1994, when With help of RFID trackers at the airports the system would be able Nick Szabo defined it as a "computerized transaction protocol that to surveillance the position of passenger's baggage based on the executes the terms of a contract". Inside the blockchain context, newly confirmed IATA resolution 753. In case of a lost or delayed smart contracts are stored on the blockchain. They can be presented baggage, an activation of a blockchain-based smart contract is as stored procedures in relational databases. Given that smart executed. A compensation could be given in crypto or fiat currencies contracts are deployed on the blockchain, they have their own (ex. ETH, EURO), within 4 levels of payout. unique addresses. A smart contract is invoked by executing a transaction to the unique address of the contract. It is then executed independent and automatically on each node in the network [8]. The contract has its own state and can manage assets on the ledger. It allows expressing the business logic within a programming code. A well-written smart contract should describe all the possible outcomes of the contract. This means that a function would refuse to execute in case of incorrect (inconsistent with business logic) parameters [8]. Smart contracts are deterministic - this means that the same input will always produce the same output. Implementation of smart contracts on known platforms (e.g., Ethereum), written for example in the Solidity programming language, the developer is prevented from writing non-deterministic contracts, since the programming language does not contain non-deterministic Figure 1: Poster for a possible lost baggage insurance. constructs. All communication with a smart contract is done through cryptographically signed transactions. This means that all blockchain stakeholders will receive a cryptographically verified 4. IMPLICATIONS trace of a contract operation. This section provides the implications of a possible implementation of a blockchain-based solution as presented in section 3 on three 2.2 Oracles domains, legal, economic and organizational. Smart contracts on the Ethereum blockchain platform run within the Ethereum ecosystem, where they communicate with each other. 4.1 Legal implications External data can only enter the blockchain (i.e. smart contracts) Blockchain technology as presented in section 3 raised up some through external interaction using a transaction. This is also a legal issues. The main legal question is the General Data Protection shortcoming of the platform, because the majority of business logic Regulation (GDPR). GDPR is a legal framework for personal data is based on external data, which is thus not part of the blockchain privacy, it has been written by the European Union (EU) and became ledger (e.g., weather, currency price) [9]. To overcome such a effective on May 25th. This framework is drastically changing shortcoming an oracle can be used. Oracle is a trusted data source business of any digital venture. The Regulation granted EU citizens that sends external data to a smart contract in form of a transaction. new rights, e.g., the right to be forgotten and right to request all data By doing so, it relieves the smart contract of the need to directly storage and acquisition links. The latter allows an individual to ask access the desired data outside of the network. Oracles are usually an organization to delete all their personal data they store. This offered as a third-party solution [8]. specific right is also the main problem in the blockchain technology. The oracle service behaves like a data courier where communication Blockchain technology relies on the principles of decentralization between the service and smart contract is asynchronous. First, the and immutability, which means that data stored on the ledger transaction performs the function within a smart contract in which cannot be deleted. When this data includes personal data, we have the instructions for service are sent. The Oracle service will then a problem in the GDPR area. This is the main implication of this obtain a result based on the parameters that will be returned to the domain, since the use case worked on required the processing of smart contract via a special function (callback) implemented in the personal data. The main question is thus how to process personal main smart contract in which we want data (result) from the service data with the blockchain, but still being able to delete it if needed [9]. or to process it outside the blockchain. Research shows that many experts are trying to find a solution [7]. Majority of the solutions are focused on the off/on chain paradigm, whereby personal data is never dealt with on the blockchain. Nonetheless, new problems arise as how to link off/on chain data and if the link itself is a GDPR violation. 36 4.2 Economic Figure 2 presents the architecture of the possible solution. Users The main goal of the solution is to enable air passengers to sign an connect to the service through a dApp with the option to pay with ad hoc luggage insurance, which is tied to an airline ticket. The crypto or fiat currencies. For clarity, the former option will be blockchain technology will be used for the insurance coverage and marked with the letter (a), and the latter with (b). There are two blockchains used, the Ethereum’s MainNet to process paymen the payout of an insurance premium. The solution should allow the t payment of the insurance coverage through cryptocurrencies to get transactions and our InsurNet for business logic (private Ethereum the biggest customer coverage. It is a new business model, where the network). Crypto transactions are first processed on the MainNet target group are all airline users. (2a), where an oracle is triggered to convert the value into fiat (2.1a), before sending it to the InsurNet (2.2a), whereas fiat requests are The biggest negative factor associated with the possible solution is processed directly through the API and if successful, forwarded the volatility of cryptocurrencies. In practice, this represents the towards the InsurNet (2b) to create the insurance (smart) contract. possibility that we lose some of our assets as a customer or as airlines. The InsurNet smart contract uses an oracle deployed at an airline to In addition to volatility, problems can occur in certain processing retrieve the status of the baggage (3.1 and 3.2) before processing the delays. The application itself is also linked to airline and airport data. business logic to determine the validity of the claim. If the user is If the system fails, automatic payment is not made possible, nor can entitled to a payout, the payout oracle is called (4) to determine the the insurance be concluded. From an economic point of view, the correct payment method and convert currency if needed. In case the application also brings many positive aspects. It is about introducing user paid in cryptocurrency (5a), the payout is processed on the the possibility of speeding up the rigid process of current luggage MainNet (6a). Otherwise the FIAT payout is handled off-chain (5b). insurance and redress. The cost of maintaining a blockchain network and smart contracts is not negligible. These can be covered through the annual contribution of airlines for their usage of such a possible solution. At the same time a certain percentage can be collected from each insurance. The economic advantages of such a solution are many: (1) introduction of new technology, (2) the possibility of ad hoc insurance, and (3) a new business model. 4.3 Organizational One of the main problems of a possible solution are of organizational structure. For it to make sense, a platform should be implemented, where all willing airlines could register and provide baggage insurance to all possible consumers. Each airline can and should have a partnership with an insurance company. Thus, to Figure 2: Architectural model of the proposed solution. complete the registration, the airlines must provide their insurance price and max payout in case of a lost baggage. Furthermore, the 6. DISCUSSION solutions must be automatic and enable easy baggage check and Due to the Ethereum Protocol, where every transaction must be insurance claim. A simplification of such a request comes with the validated by miners and added to the block, these can be slowly IATA Resolution 753, which states that by June 2018, airline processed. When a user pays insurance with the cryptocurrency members must be able to, among others, demonstrate delivery of Ether into the smart contract on the MainNet and the transaction is baggage when custody changes [6]. This furthermore implies that confirmed, the function in our smart contract will trigger an event, the ecosystem must include airports which will provide the data which we can listen from outside of our dApp. We will detect the mentioned about the status of the baggage. Technically, a link to a event only when the transaction is confirmed. Once our server web service is required, where data about the baggage is accessible. detects the "Paid" event from the MainNet, it will create a new smart contract on our private blockchain InsurNet. This is reflected 5. PROTOTYPING in some latency for the user. With the aforementioned oracle, we It should be emphasized that blockchain technology is a rather have two more. One is to verify the location of the luggage, while unexplored thing. In most cases there are no examples of good the other one is to process the payment when the event is triggered practice on process of how the introduction of the blockchain on InsurNet. should start. We can consider the following example where the user pays After analyzing the possible use-case and its implications we insurance for one luggage in the cryptocurrency. We will assume propose a prototype in a form of a decentralized application (dApp), the average time to validate the transaction on MainNet is 25 based on the Ethereum smart contracts. The front end of the seconds. The user transfers the cryptocurrency to our smart solution could be a simple Angular 2 web application with an contract, where the validation of this transaction takes 25 seconds. intuitive, user-friendly interface, accessible on multiple devices. Then, on a triggered event, oracle performs a new transaction on The main advantage of using a web application as opposed to our network, where the transaction validation time is defined for 10 device-specific applications, is the support of various operating seconds. Because the user does not have the luggage yet, after three systems and models. If a user selects to pay with cryptocurrency, hours of landing, he performs a payout using the dApp. he/she can use the plugin MetaMask to connect to the Web3 part of Transactions are done within 10 seconds. An oracle then performs the application and send a signed transaction to a smart contract on a new transaction to write the current location information in the the blockchain. According to GDPR laws, personal information smart contract (+ 10 seconds). Since baggage is not yet available, needs to be delible, therefore it should be stored in a separate the user is entitled to a payout, which is reflected in a new event database off-chain, accessible through an API. Such an architecture where an oracle performs a transaction on the MainNet. The can be given by storing airline information off-chain and non- validation of this transaction takes 25 seconds. Thus, it takes at least identifying user insurance data on the blockchain. 80 seconds for all transaction validations to complete. 37 8. REFERENCES [1] CB Insights. Companies 'pivoting to blockchain' see huge stock spikes - but does the hype hold up? CB Insights - Research 7. CONCLUSION Brief. [Available] 2018. By proposing the concept of a fully workable prototype, we www.cbinsights.com/research/blockchain-hype- stock-trends. demonstrate that a solution is possible. Nevertheless, after considering all the implications, we conclude that such a solution [2] BlockChain Technology: Beyond Bitcoin. M. Crosby, would be unpractical and not user friendly, due to all workaround Nachiappan, P. Pattanayak, S. Verma and V. Kalyanaraman. needed in order to prepare a fully working technical solution. 2016, Applied Innovation Review . Considering the current evolutional stage of the blockchain [3] Mattila, Juri. The Blockchain Phenomenon – The Disruptive technology, we conclude that a fully crypto-based solution can be Potential of Distributed Consensus Architectures. [Available] met with approval, thus advocating the idea of the blockchain researchgate.net/publication/313477689_The_Blockchain_P technology being seen as business disruptor in the sense of digital henomenon_-_The_Disruptive_Potential_of_Distribute. money. [4] EduCTX: A Blockchain-Based Higher Education Credit Platform. Muhamed Turkanović, Marko Hölbl, Kristjan Košič, ACKNOWLEDGMENTS Marjan Heričko, Aida Kamišalić. 2018, IEEE Access , str. 5112 Our thanks to the public scholarship, development, disability and - 5127. maintenance fund of the Republic of Slovenia and the project [5] Bruno Teboul, Frédéric Maserati, Maxime Leroux. Following the Creative Path to Knowledge 2017 – 2020 (Po BLOCKCHAIN: CONCEPT AND APPLICATION kreativni poti do znanja 2017 – 2020) - SmartInsTech. DOMAINS. Keyrus. [Available] http://keyrus- prod.s3.amazonaws.com/Avis%20d%27expert/Blockchain/Avis % 20d%27Expert_BLOCKCHAIN-EN%20COM.pdf. [6] IATA. Baggage Reference Manual. 2018. [Available] https://www.iata.org/publications/Documents/brm03-toc- 20180523.pdf. [7] Mercer, Rebekah. Privacy on the Blockchain: Unique Ring Signatures. arXiv. [Available] 2016. https://arxiv.org/pdf/1612.01188.pdf. [8] Podgorelec, Blaž. Arhitektura za nadgradljivost in zamenljivost pametnih pogodb na platformi Ethereum. s.l. : DKUM, 2018. [9] Zdun, Maximilian Wöhrer and Uwe. Design Patterns for Smart Contracts in the Ethereum Ecosystem. univie.ac.at. [Available] 8 2018. http://eprints.cs.univie.ac.at/5665/1/bare_conf.pd 38 A Brief Overview of Proposed Solutions to Achieve Ethereum Scalability Blaž Podgorelec Patrik Rek Tadej Rola Faculty of Electrical Engineering and Faculty of Electrical Engineering and Faculty of Electrical Engineering and Computer Science Computer Science Computer Science University of Maribor University of Maribor University of Maribor Maribor, Slovenia Maribor, Slovenia Maribor, Slovenia blaz.podgorelec@um.si patrik.rek@um.si tadej.rola@student.um.si Muhamed Turkanović Faculty of Electrical Engineering and Computer Science University of Maribor Maribor, Slovenia muhamed.turkanovic@um.si ABSTRACT conclude that it is becoming increasingly popular. The increase in Blockchain technology is part of Gartner’s top technological popularity consequently affects the increased number of trends in the following five years, whereby already moving away transactions performed within the Ethereum blockchain network from the peak of the inflated expectations on its hype cycle, [2], whereby we can assume that the number of business towards the slope of enlightenment. With the development of the processes that are implemented with the help of blockchain blockchain technology, the emergence of completely new business technology and Ethereum is also increasing. processes is anticipated, as well as changes to existing business All transactions transmitted on the blockchain network are processes, which will include the use of blockchain technology in irreversibly recorded in a shared ledger among all network nodes its implementation, partially or completely, thereby taking [3, 4]. Nodes in the blockchain network perform a protocol, advantage of the benefits that the technology itself offers. defining the ability to create new blocks with associated Nevertheless, the technology has several drawbacks, whereby the transactions in an approximate 15 seconds time frame. This allows most vivid is the scalability problem. With the introduction of the frequency of transactions executed in the network to be Blockchain 2.0 and the Ethereum platform, the scalability approximately 7 - 15 transactions per second (tp/s) [5]. The open problem seemed settled out for a moment, which proved source Ethereum platform is based on a permisionless and otherwise with first generations of non-fungible tokens and high publicly accessible blockchain network, which is at the same time traffic. Although Ethereum is in its infancy, progress is on high a distributed and decentralized operating system for running smart tracks, with this year’s focus on the infrastructure. A lot of contracts via its Ethereum Virtual Machine (EVM). Because of research and work is being done on the Ethereum’s layer 2 scaling the platform indigenous crypto currency called Ether, generated solution such as the state channels, plasma and sharding. This by the blockchain network and defined by the protocol, the paper presents a brief overview of the current state of the platform is often used as a payment system, like the Bitcoin. mentioned proposed solutions and some ongoing projects, which Therefore it is often compared to existing non-crypto payment are focused on their implementation. solutions, such as Visa, which, unlike the Ethereum platform, is capable of processing a much larger number of transactions Categories and Subject Descriptors (56,000 tp/s) [6]. H.3.4 [Information Storage and Retrieval]: Systems and In the paper, we will present the problem of scaling the Software Ethereum network and the proposed solutions. These solutions could increase the number of transactions carried out on the General Terms Ethereum platform, thus getting closer or exceeding the Performance, Design, Reliability, Experimentation, Security processing capacity of existing non-crypto payment systems. This would enable the development and implementation of new business processes with the blockchain technology. Keywords Blockchain, scalability, Ethereum, channels, plasma. 2. ETHEREUM SCALING PROBLEM The current implementation of the Ethereum protocol requires the 1. INTRODUCTION processing of all, within the network transmitted transactions, as In recent years, on the basis of an increase in the market well as the storage of all states, from each node in the network, capitalization [1] of the Ethereum platform, the performance of that acts as a validator [7]. To confirm a change of the network which is based entirely on the blockchain technology, we can state with a transaction, the transaction must be included in a 39 block created by a node, which must solve the calculation puzzle network due to the need for processing transactions of defined by the distributed consensus protocol, which is in the blockchain networks [12]. current Ethereum version the Proof of Work (PoW). The The described "simple" solutions directly relate to the so-called processing speed of the transactions is limited by the capacity of trilemma of blockchain technology, which says that the each individual node participating in the network as the blockchain network can contain only two of the three features, transaction validator. Such an implementation of the protocol such as: provides increased safety in terms of secure processing of transactions within the network, which is one of the key properties - Decentralization of such systems. At the same time, the way in which an increased security is achieved, is a major obstacle achieving a greater - Scalability number of transactions carried out within the blockchain network, - Security due to its need for heavy computation [8]. In the case of the use of different altcoins, this would mean The number of transactions one block can include is limited by increasing the efficiency (scalability) of transaction processed the number of gas (fee for processing the operations within the within the blockchain network, while in contrary a reduction of transaction), that can be consumed by all transactions in the block. security of the network itself. The increase in the limit of number In the future, it is possible to expect a change in the way of of transactions in a single block and the aggregation of reaching consensus between the individual nodes in the Ethereum computational power or the share between different blockchain network. Namely, the transition to the use of the Proof of Stake networks would theoretically increase the efficiency (scalability), (PoS) protocol is planned, which would mean that the time of which would require greater use of computational power for the block generation within the Ethereum network with associated processing of all requirements within the blockchain network transactions could be reduced to an average of four seconds [5]. from the network nodes. This reduces the possibility of equal The transition to a new protocol for reaching consensus among participation in the network by nodes with less computational the nodes in the blockchain network will thus reduce the current power, which can lead to a reduction in the decentralization of the scaling problems. In addition, the switch to PoS distributed blockchain network by nodes who have greater computing power consensus will decrease the required computational power and [8]. thus energy consumption of the network. In the following chapters, we will present some solutions that Changing the network consensus protocol between nodes will could solve the described problem of efficiency, whereby not to have a positive effect on the transaction processing frequency affecting one of the described properties of the trilemma of the within the blockchain, but it is expected that the number of blockchain technology. processed transactions will still be significantly smaller compared to the existing payment systems. Described problems in the terms of achieving greater efficiency of blockchain, assuming 3. PROPOSED SOLUTIONS knowledge of its structure and understanding of the concepts of The main concern of blockchain technology is the security and a the blockchain technology, offer so-called "simple" theoretical distributed consensus in a decentralized network. The processing solutions, such as: of every transaction by all nodes of the network is a process that provides these characteristics but does not provide enough 1. It envisages the use of different "altcoins" within a measure for increasing efficiency and scalability. Below we variety of separate blockchain networks, which results describe some already proposed solutions, which can help in a strong increase in the flow rate of the performance increasing the efficiency and scalability of the Ethereum of individual transactions within the separate blockchain blockchain network without undermining the security and networks. As a result, due to the increased number of decentralization of the network as such. different blockchain networks, a reduced number of nodes within different blockchain networks are 3.1 State channels expected, which would mean that separate blockchain One of the proposed solutions, which is currently considered to be networks will be more susceptible to attacks by the most mature and used, is based on the transaction processing malicious nodes than if all network nodes are merged approach outside the blockchain network (i.e. off-chain) through within a single common blockchain network [9, 10]. the establishment of state channels [13]. The proposal of the solution derives from the so-called payment channels, the purpose 2. Increasing the limit of the number of transactions per of which was to allow multiple micro-transactions between two block or increasing the ceiling of fuel consumption in users of the system without the need of transmitting each the case of the Ethereum protocol, theoretically implies transaction through the blockchain network [14]. a large number of processed transactions. Nevertheless, this requires significantly more computational power While payment channels focus on off-chain processing of (for using the PoW protocol, or the percentage (stake) payment transactions, the purpose of the "state channels" is to when using the PoS protocol) to validate a block with establish a channel, through which the state can be changed an increased number of transactions of an individual outside the blockchain network, between predefined participants node in the network [9, 11]. [15]. This is because Ethereum blockchain holds the state of each defined variable of every deployed smart contract. The need to 3. Combining computational power (when using the PoW process a transaction within a blockchain network occurs only in protocol) or stake (when using the PoS protocol) case of disagreement about the state changed by a transaction between the different blockchain networks, can within the established channel by any participant or in the case of theoretically increase the flow of transaction processing, a closed communication within the channel. In case that there is but this could burden each individual node in the 40 no disagreement about the changed state during the 3.3 Sharding communication within the established channel, this solution With the current implementation of the protocol, each node that is significantly increases the number of transactions, since it part of the Ethereum network must validate every transaction, aggregates micro transactions and issues them as one in a which ensures a high level of network security. One solution is predefined time [16]. sharding, where the protocol would separate the network state into State channels are implemented with the help of dedicated smart smaller partitions, called shards. Each shard would store its contracts. The establishment of communication through such a separate state and transaction history. By implementing such a channel is carried out with a special “channel smart contract”, protocol, certain nodes would process only the transactions of aimed at ensuring fair communication between participants that certain shards. Transactions on different shards at the same time perform operations and record the final state into the blockchain would increase the permeability of these [20]. network, after the communication has ended. In case of a conflict Sharding is a general technique used in distributed computing, the between participants in communication outside the blockchain implementation of which can be expected in Ethereum by 2020 (within the channel), the smart contract has the task of selecting [21]. Implementation of sharding is the only one of the described the most relevant last state that the users still agreed on when scaling solutions that will practically have no impact on end users, communicating within the channel [17]. The security of such an as well as not on smart contract developers on the Ethereum off-chain communication approach is based on the fact that each platform. The system for storing states will remain the same. The message sent through the status channel is cryptographically change will be at layer 1 of the Ethereum Protocol. Solutions signed, with the aforementioned channel smart contract having an mentioned in 3.2. and 3.1. will work on layer 2 [22]. Sharding implementation for verifying these messages. Each participant can eliminates the need for the entire network (each node) to process cancel the communication at any time, and the final state that is all transactions. The result is increased number of processed recorded in the blockchain is that which is recognized by all transactions per second [21]. participants in the off-chain communication [15]. Prior to implementing sharding in the protocol, various challenges This type of communication allows the implementation of more must be addressed. The main challenge is a single-shard take over complex operations defined within smart contracts, completely attack. With such an attack, an attacker could possibly take independent of the blockchain network. Consequently this means control of the entire shard, which may result in the avoidance of almost instantaneous execution of operations with very low total sufficient validations, or even worse, to validate the blocks that costs of execution of all implemented channel transactions, since are incorrect. These attacks are usually prevented by random all transactions carried out within the established off-chain sampling schemes. The next challenge is the availability of states channel are aggregated into a single transaction [17, 13]. between different shards. The most appropriate approach for addressing this challenge is that the effect of a transaction 3.2 Plasma depends on the events that happened before in the second shard. The scalability of the Ethereum network with theoretically trillion A simple example is the transfer of money where the user A (e.g. transactions per second should be achieved by the introduction of in shard 2) transfers money to user B (e.g. in shard 7). First, a a strategy called Plasma. Similarly, as in the solution described in debit transaction is executed that destroys the tokens at user A (in Chapter 3.1, the purpose of Plasma is to implement transactions shard 2), after which a "credit" transaction is created that creates without the need for individual confirmation of each of them by the tokens of user B (in shard 7). This transaction has an account the blockchain network. The solution envisages the introduction indicator on a "debit" transaction, which proves that the "credit" of several side chains, whereby the last state of the newly created transaction is legitimate [8]. chain being recorded in i.e. the main blockchain network. This could be implemented without any need to change the current 4. CONCLUSION protocol and Ethereum network. The most important factor in In the paper, we presented several different solutions, the common terms of achieving security in the Plasma solution, relates to the purpose of which is to achieve greater efficiency of scalable privilege of every user to perform transactions within any side transaction processing in the Ethereum blockchain network. State chain (with the exception of the main Ethereum chain) and to channels move state modifications outside of the main blockchain leave the side-chain and write the final state in the main Ethereum network. The Plasma solution envisages the introduction of chain - where the final valid state is defined. To prevent the several blockchains, whereby each chain is used for a specific recording of a false state into the main chain, the Plasma solution purpose. Both solutions allow users to record the final state in the suggests a "Challenge mechanism", which assumes that the state main Ethereum blockchain network. We also descried the that a user wants to record in the main chain is frozen for a certain sharding solution, the introduction of which, in contrast to the period. During this period, other users can prove that the above-mentioned solutions, requires the change of the lowest proposed state is not relevant. Because of the above mechanism, layer of the Ethereum protocol. All the described solutions pursue the user must provide a sum of the Ether cryptocurrency into such the goal of not reducing the current level of transaction processing a transaction that writes the state into the main Ethereum chain, security, as well as maintaining the decentralization of the which if another user proves that such a transaction contains an blockchain itself in order to achieve scalability. In the future, due invalid state, loses and is acquired by that user, who proved the to the increase in the number of transactions transmitted within invalid state. This mechanism could trigger a lot of false evidence the Ethereum network, it is reasonable to expect several concrete of invalid transactions; therefore, a user wishing to prove an implementations (Loom Network, OmiseGO, Raiden,...) of the invalid transaction must pledge a sum of the Ether cryptocurrency, described solutions, as well as an increased use of these in which in the case of false evidence of invalidity, is acquired by the practice, since it is the increase in the efficiency of the transaction user of the original transaction [18, 19]. processing which is one of the key factors in achieving the 41 optimization of existing and new business processes, supported by [12] A. Judmayer, A. Zamyatin, N. Stifter, A. G. Voyiatzis, and E. the blockchain technology. Weippl, “Merged mining: Curse or cure?,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. 5. ACKNOWLEDGMENTS Lect. Notes Bioinformatics), vol. 10436 LNCS, pp. 316–333, The authors acknowledge the financial support from the 2017. Slovenian Research Agency (research core funding No. P2-0057). [13] P. Mccorry, S. Meiklejohn, and A. Miller, “Pisa : Arbitration Outsourcing for State Channels.” 6. REFERENCES [14] “Lightning Network.” [Online]. Available: [1] “Total Market Capitalization,” coinmarketcap.com, 2018. https://lightning.network/. [Accessed: 01-Aug-2018]. [Online]. Available: https://coinmarketcap.com/charts/. [15] J. Coleman, L. Horne, and L. X. L4, “Counterfactual: [Accessed: 06-Jul-2018]. Generalized State Channels,” 2018. [2] “Ethereum Transaction Chart,” etherscan.io, 2018. [Online]. [16] S. Dziembowski, L. Eckey, and S. Faust, “Perun : Virtual Available: https://etherscan.io/chart/tx. [Accessed: 06-Jul- Payment Hubs over Cryptocurrencies.” 2018]. [17] S. Dziembowski, S. Faust, and K. Hostáková, “Foundations [3] B. Podgorelec, “Arhitektura za nadgradljivost in of State Channel Networks,” pp. 1–56, 2018. zamenljivost pametnih pogodb na platformi Ethereum,” University of Maribor, 2018. [18] J. Poon and V. Buterin, “Plasma : Scalable Autonomous Smart Contracts Scalable Multi-Party Computation,” [4] M. Pustisek, A. Kos, and U. Sedlar, “Blockchain Based Whitepaper, pp. 1–47, 2017. Autonomous Selection of Electric Vehicle Charging Station,” 2016 Int. Conf. Identification, Inf. Knowl. Internet [19] “Explained: Ethereum Plasma – Argon Group – Medium.” Things, pp. 217–222, 2016. [Online]. Available: https://medium.com/@argongroup/ethereum-plasma- [5] F. M. Benčić and I. P. Žarko, “Distributed Ledger explained-608720d3c60e. [Accessed: 02-Aug-2018]. Technology: Blockchain Compared to Directed Acyclic Graph,” 2018. [20] R. Jordan, “How to Scale Ethereum: Sharding Explained,” 2018. [Online]. Available: https://medium.com/prysmatic- [6] Visa, “Visa Inc. at a Glance,” no. August, p. 1, 2015. labs/how-to-scale-ethereum-sharding-explained- [7] V. Buterin, “A next-generation smart contract and ba2e283b7fce. [Accessed: 01-Aug-2018]. decentralized application platform,” Etherum, no. January, [21] J. Kim, “Vitalik Buterin: Sharding and Plasma to Help pp. 1–36, 2014. Ethereum Reach 1 Million Transactions Per Second,” 2018. [8] J. Ray, “On sharding blockchains,” github.com/ethereum, [Online]. Available: https://cryptoslate.com/vitalik-buterin- 2018. [Online]. Available: sharding-and-plasma-to-help-ethereum-reach-1-million- https://github.com/ethereum/wiki/wiki/Sharding-FAQs. transactions-per-second/. [Accessed: 01-Aug-2018]. [Accessed: 04-Jul-2018]. [22] A. Rathod, “We Should See Sharding in 2020 as Part of [9] “The State of Scaling Ethereum – ConsenSys Media,” 2018. ‘Ethereum 2.0,’” 2018. [Online]. Available: [Online]. Available: https://media.consensys.net/the-state-of- https://toshitimes.com/we-should-see-sharding-in-2020-as- scaling-ethereum-b4d095dbafae. [Accessed: 04-Jul-2018]. part-of-ethereum-2-0-eth-foundation-researcher/. [Accessed: 01-Aug-2018]. [10] A. Back, M. Corallo, and L. Dashjr, “Enabling blockchain innovations with pegged sidechains,” URL http//www., pp. 1–25, 2014. [11] GoChain, “GoChain : Blockchain at Scale,” pp. 0–5, 2018. 42 Integration Heaven of Nanoservices Ádám Révész Norbert Pataki EPAM Hungary Department of Programming Languages Budapest, Hungary and Compilers, Faculty of Informatics, Adam_Revesz@epam.com Eötvös Loránd University Budapest, Hungary patakino@elte.hu ABSTRACT benefits, improved scalability, separate responsibilities, bet- ter maintainability to name but a few [17]. On the other Microservices have become an essential software architec- hand, having a software architecture utilizing more than 70 ture in the last few years. Nanoservices as a generalization own built nanoservices in active development requires spe- of microservice architecture are getting more and more pop- cial care for build processes. ular recently. However, this means that every component In terms of continuous integration (CI) and continuous de- has more and more public interfaces, and the number of livery (CD) – modern software development process frame- components is increasing, as well. works pipelines are defined as composable parts of the pro- Integration hell had been appeared when the number of cess describing how the product is created, transformed and developers was increased. The developers work parallelly, delivered from making source code and configurations on so it is necessary to merge their work. Collaboration re- developers workstations to serving them to end users [16]. quires software support, such as version controll tools and The pipelines mentioned in this paper are executed by continuous integration servers. automation systems following deterministic scripts referred However, modern software development tools such as build as “pipeline scripts”. systems, testing frameworks and continuous integration servers This paper discusses the topic of bulk management of uni- become sensitive regarding the version of source code to deal fied pipeline scripts in aspects of reproducibility, replayabil- with. This can result in exponential explosion in many ways ity, compactness and overhead of change management. when nanoservices are in the focus. This paper is organized as follows. We present the prob- In this paper, we argue for workflow that can handle this lem of integration hell in section 2. We describe the problem exponential explosion. This workflow can be included into in section 3. Our proposed workflow is presented in section continuous integration servers as jobs in order to execute test 4. Finally, this paper concludes in section 5. cases in a reproducible way even if the test cases deal with special environment specifications. Moreover, the workflow is able to deal with building and artifact publishing pro- 2. INTEGRATION HELL cesses, as well. 2.1 Case Categories and Subject Descriptors The subject of the study is a software running on top of a container orchestration system operating over multiple D.2.7 [Software Engineering]: Distribution, Maintenance, nodes. Using event sourcing with Command Query Respon- and Enhancement; K.6.3 [Computing Milieux]: Software sibility Segregation (CQRS), the software utilizes over 70 Management services. Every own built service is stored in its own version con- Keywords troller system (VCS) repository [14]. Most of them are iden- tical in the aspect of programming language, project struc- Nanoservices, Integration, version control ture, packaging system, types of artifacts, testing frame- works, static analysis system (e.g. [11]). The discussion 1. INTRODUCTION continues about this kind of services. Microservices and nanoservices are essential software ar- 2.2 Orchestration chitectures recently. These software architectures have many A container orchestration tool manages resource alloca- tions, configurations, credentials of containers. Provides common internal network with service discovery, domain Permission to make digital or hard copies of all or part of this work for services, serving well defined endpoints for outer network personal or classroom use is granted without fee provided that copies are communications. not made or distributed for profit or commercial advantage and that copies In terms of scalable services, operating with nanoservices bear this notice and the full citation on the first page. To copy otherwise, to an orchestration tool must provide load balancer service republish, to post on servers or to redistribute to lists, requires prior specific over multiple nodes ensuring high availability. Also provides permission and/or a fee. declarative configuration and deployment management with CSS ’18 Ljubljana, Slovenia Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. the ability of rolling updates and rollbacks between config- 43 uration and deployment versions also. 2.4.5 Common pipeline Currently the industry standard for a real battle tested, The subject project uses mostly Java Spring Boot nanoser- serious production-grade orchestration tool is Kubernetes, vices, which kind of services have a common pipeline script developed by Google [9]. actively developed. The common pipeline script contains the following stages: 2.3 Build tools Modern programming laguage ecosystems have their own • VCS checkout (sometimes multiple) package manager for dependency han- • Build source code using package manager (like npm, dling and easy build, test, install and deploy management Gradle, Cabal, etc.) [12]. The common pipeline script utilizes those package managers, reaching higher level of abstraction [10]. For ex- • Run tests on the artifact using package manager ample: • Sending the source code to the static analysis system • Java, Scala: Gradle[3], Maven[5], Ant[1] • Building Docker image artifact • JavaScript - NodeJS: NPM[6], Yarn[8] • Uploading artifacts • C++: Conan [13] • Announcing build status on channels (email, instant • Python: Pip[7] messaging) • Haskell: Cabal[2] Since these are nanoservices, their Docker images differ only on the built artifact. The configurations, including en- • Docker (images): Docker (registry) [15] vironment variables, configuration and secret files, are han- Closed source software projects as the subject utilize arti- dled by the orchestration tool and building them into an fact repository systems which can serve repositories for mul- image is an anti-pattern in this use case. tiple type of packages for own artifacts and serve as cache 2.5 Integration hell definition for public domain packages (in case of outage and lowering network traffic). For example: Nexus, JFrog Artifactory. Integration hell is a place where developers have to main- tain all the pipeline scripts manually for each service or use 2.4 Pipelines a common pipeline script and update all the source codes and configurations on each service repository to be compat- The services are built automatically on VCS commit on ible with the pipeline script. Also called one pipeline script marked branches. Build pipeline scripts of actively devel- over all. oped services have to be in sync in order to guarantee the same level of quality and compatibility with environment (following its changes). 3. PROBLEM STATEMENT 2.4.1 Pipeline script 3.1 Build job generation A pipeline script is interpreted by a CI tool, a build system The jobs are generated depending on the VCS repository (e.g. Jenkins [4]), is a sequence of commands optionally path structure. The generator job accepts the list of the separated into stages. service names to make build job for. The build jobs are generated from template, the only difference is in the source 2.4.2 Pipeline script stage code repository URL and the project name. A pipeline script stage is a named sequence of commands. Used for visualizing the main parts of the script, leverag- 3.2 Single pipeline script repository approach ing process status display during execution, variable scope Having dozens of services with identical pipeline scripts, it segregation. would come in hand to use the exactly same pipeline script file checked out from one build script repository. 2.4.3 Pipeline command Each pipeline command can be variable declaration and 3.2.1 Limitations of updates definition (including functions), function invocation, shell The single pipeline script repository approach has mul- invocation. tiple pitfalls. Since the the job configuration has only the Ideally, a build system has its own pipeline script domain- repository, the branch name and the path of the pipeline specific language (DSL) with an application-programming script, any change on the pipeline script would affect all the interface (API) library for common operations like VCS check- build jobs at once. In this case either the ability to create out, packaging operations, status notifications, common con- experimental changes on the build scripts is lost or the abil- figuration and secret storage operations. ity to recreate all the build jobs without breaking any of them. 2.4.4 Build job In common CI tools, each pipeline script invoked by a 3.2.2 Lack of replayability corresponding build job. These jobs contain metadata for Other problem regarding the single repository approach running the pipeline script, like the location of the pipeline is the lack of replayability. Having a case when recreat- script itself. Storing and passing variables like job name, ing an artifact based on an older state of the service source parameters (given on job invocation via API call or web code repository is needed, there is no guarantee the cur- UI). rent state of the pipeline script in its repository is backward 44 Figure 1: Sequence diagram of the proposed work- flow compatible, so there is the risk of broken or unstable build (in worse case it turns out in production). The correct build script should be searched in the history of the pipeline script repository (see Figure 1). Figure 2: Sequence diagram of the single source of 3.2.3 Growing overhead truth approach The mentioned problems are getting harder to resolve as the size of the software project (the number of services) is This solution does not introduce the problem of difficult growing. The maintenance cost of those pipeline scripts is generator job but still carries the synchronization problem. high. Onboarding a new developer-, handing out the de- Pipeline scripts are being modified in multiple cases. There velopment of such project could be extremely difficult due are cases which are not strictly drived by source code changes. to the multiple tools and sytems, scripts and their difficult Having the case of enriching the log of the pipeline script in dependency graph. order to leverage traceability of the process. This change is made only in the pipeline script and the side effects are 4. PROPOSED WORKFLOW present only on the pipeline script log. Has no side effect on the artifacts or test results. There are multiple open ques- Addressing these problems a reasonable solution could be tions about which service VCS repository has to be updated a property file in each service source code repository. This first, which should be the subject of experimental changes approach makes the generator job more difficult since every and how to update all the other service pipeline script? invocation it should parse the property file of every repos- itory and generating the job according to that. An other 4.4 Automatized script updating problem is the synchronization of those property files. Addressing these questions, there is a pipeline script in the 4.1 Single source of truth VCS repository but unlike the single pipeline script reposi- tory approach (see 3.2), the service build jobs are not refer- There is an other, more compact, more robust and more ring to the script repository. There is a synchronization job redundant way to address the problems. The single source introduced instead. The pipeline script synchronization job of truth for service artifact build workflows should be the takes service name list as its arguments as the service build repository of their source code. This approach leverages the job generator job does. The pipeline script updater job has compactness of each service. The service VCS repository permission to update the service VCS repositories. To en- should contain the source code of the service, package de- force traceability an issue id referencing an issue describing scriptor (build scripts included) and the pipeline script. This the change and its cause is recommended to be present in the approach can be seen on Figure 2. commit message in all affected VCS repository. The figure 4.2 Utilization of VCS 3 presents this workflow. Since the VCS repository handles the pipeline script along with the source code, any arbitrary snapshot (commit) of 5. CONCLUSION the repository in any time of its history should contain the Microservices and nanoservices are popular software archi- pipeline script which executes exactly the same pipeline with tectures. On the other, dealing with complex software devel- exactly the same result any time. opment processes and many different development software tools, the maintenance can be a critical problem because of 4.3 Keeping job generator simple the combinatorical explosion. 45 [12] M. P. Martinez, T. László, N. Pataki, C. Rotter, and C. Szalai. Multivendor deployment integration for future mobile networks. In A. M. Tjoa, L. Bellatreche, S. Biffl, J. van Leeuwen, and J. Wiedermann, editors, SOFSEM 2018: Theory and Practice of Computer Science: 44th International Conference on Current Trends in Theory and Practice of Computer Science, Krems, Austria, January 29 - February 2, 2018, Proceedings, pages 351–364, Cham, 2018. Springer International Publishing. [13] A. Miranda and J. a. Pimentel. On the use of package managers by the C++ open-source community. In Proceedings of the 33rd Annual ACM Symposium on Applied Computing, pages 1483–1491, New York, NY, USA, 2018. ACM. Figure 3: Sequence diagram of the proposed work- [14] S. Phillips, J. Sillito, and R. Walker. Branching and flow merging: An investigation into current version control practices. In Proceedings of the 4th International Workshop on Cooperative and Human Aspects of This solution holds some security concerns like the up- Software Engineering, CHASE ’11, pages 9–15, New dater pipeline execute right has to be available for restricted York, NY, USA, 2011. ACM. group of users since the VCS enables Jenkins to commit to [15] Á. Révész and N. Pataki. Containerized A/B testing. the master (trunk) branch. In Z. Budimac, editor, Proceedings of the Sixth The current prototype version is restricted to only one Workshop on Software Quality Analysis, Monitoring, kind of services to upgrade their build pipeline. Enabling Improvement, and Applications, pages 14:1–14:8. modular build scripts and their modular upgrade could be a CEUR-WS.org, 2017. next iteration. The bulk update problem could be derivated to a version controll system problem, updating common files [16] S. Stolberg. Enabling agile testing through continuous in two or more repositories. In context of build systems like integration. In Agile Conference, 2009. AGILE ’09., Jenkins (git) submodules could not be an optimal solution pages 369–374, New York, Aug 2009. IEEE. increasing complexity. [17] E. Wolff. Microservices: Flexible Software The proposed solution grants the robust script handling Architectures. CreateSpace Independent Publishing workflow allowing bulk pipeline script updates and replaya- Platform, 2016. bility. It introduces some additional difficulty with the up- date process but it has been automatized. The approach reached a single source of truth state for each service artifact creation process and the refered source is the VCS repository which is a great tool to manage and observe the whole devel- opment of its content through time. The approach reduces the cost of maintaining pipeline scripts. 6. REFERENCES [1] Ant. https://ant.apache.org/. [2] Cabal. https://www.haskell.org/cabal/. [3] Gradle. https://gradle.org/. [4] Jenkins. https://jenkins.io/. [5] Maven. https://maven.apache.org/. [6] Npm. https://npmjs.com/. [7] Pip. https://pypi.org/project/pip/. [8] Yarn. https://yarnpkg.com/. [9] D. Bernstein. Containers and cloud: From LXC to Docker to Kubernetes. IEEE Cloud Computing, 1(3):81–84, Sept. 2014. [10] C. Ebert, G. Gallardo, J. Hernantes, and N. Serrano. Devops. IEEE Software, 33(3):94–100, May 2016. [11] G. Horváth and N. Pataki. Source language representation of function summaries in static analysis. In Proceedings of the 11th Workshop on Implementation, Compilation, Optimization of Object-Oriented Languages, Programs and Systems, ICOOOLPS ’16, pages 6:1–6:9, New York, NY, USA, 2016. ACM. 46 Service Monitoring Agents for DevOps Dashboard Tool Márk Török Norbert Pataki Department of Programming Languages Department of Programming Languages and Compilers, Faculty of Informatics, and Compilers, Faculty of Informatics, Eötvös Loránd University Eötvös Loránd University Budapest, Hungary Budapest, Hungary tmark@caesar.elte.hu patakino@elte.hu ABSTRACT sends report to the developers regarding the changes and their effects [6]. Deployment of the compiled application DevOps is an emerging approach that aims at the symbiosis and its necessary dependencies can be launched in various of development, quality assurance and operations. Develop- infrastructures [4]. Virtual machines in cloud, Docker con- ers need feedback from the test executions that Continuous tainers on a host take part in the deployment frequently [5]. Integration servers support. On the other hand, developers Configuration management tools (e.g. Ansible) can execute need feedback from deployed application that is in produc- specific code snippets for the deployment. Monitoring and tion. logging of the started application is useful to detect every Recently, we are working on the dashboard tool which vi- kind of runtime phenomenon and orchestrate the application sualizes the runtime circumstances for the developers and seamlessly [3]. architects. The tool requires runtime circumstances from However, tools landscape is missing good tools which are the production environment. In this paper, we introduce able to present the runtime performance of applications in our background mechanism which uses agents to retrieve staging or production environment regarding the changes of runtime information and send it to our tool. We present the source code. We are working on a dashboard tool to many specific agents that we have developed for this soft- visualize how the deployed application behaves in specific ware. Our approach deals with many useful services and environment. Many typical use-cases can be mentioned. tools, such as Docker and Tomcat. Does the memory consumption decrease when a feature’s new implementation is deployed? Which commit may cause Categories and Subject Descriptors a memory leak, if it is suspicious. Does the introduction of D.2.5 [Software Engineering]: Testing and Debugging; a new feature or API cause increase in the number of end- D.2.8 [Software Engineering]: Metrics users? How can one compare the performance of the system if the webserver or a database server is replaced? Keywords For our dashboard tool, we have developed many tool- specific agents to report runtime perception. Our tool vi- Agents, Monitoring, DevOps sualizes the reports come from agents. We have developed agents that deal with Docker, Tomcat webserver, etc. In this 1. INTRODUCTION paper, we present our agent-based approach and illustrate DevOps is an emerging approach in modern software en- some agents’ internal high-level functions. gineering. The key achievements of DevOps are compre- This paper is organized as follows. In section 2, we briefly hensive processes from building source to deployment, con- present the main concept of our tool. After, we present our tinuous synchronization of development and operations in agent-based approach in a detailed way with some examples order to make every new feature delivered to the end users. in section 3. Finally, this paper is concluded in section 4. DevOps emphasizes the feedback from every phase. DevOps-culture uses a wide range of software tools. Au- 2. DASHBOARD TOOL tomation of build processes is essential solution for many years. Continuous Integration (CI) servers track the version A safe software development requires control over the en- control system if a change of the source has been commited tire software development lifecycle (SDLC). During the de- [7]. In this case, the CI server (e.g. Jenkins [1]) starts the velopment, it is essential to avoid memory leakage, or overuse compilation process and executes the test cases and finally, of the CPUs. To get a good overview of the resource uti- lization engineers, DevOps engineers have to keep their eyes on these units that means they have to monitor their envi- ronments by using tools that can reflect the status of the Permission to make digital or hard copies of all or part of this work for different services, databases, network I/Os, or the amount personal or classroom use is granted without fee provided that copies are of written/read blocks. not made or distributed for profit or commercial advantage and that copies In this chapter, we would like to give a brief introduction bear this notice and the full citation on the first page. To copy otherwise, to about our Dashboard tool which can help developers to get republish, to post on servers or to redistribute to lists, requires prior specific metrics about their environments. Developers can declare permission and/or a fee. new environments on the board and assign charts to them. CSS ’18 Ljubljana, Slovenia Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. A chart represents a single observable unit from the real en- 47 vironment. Metrics are provided by agents which run on the period. Beside these steps, an agent also has minor charac- machine where the application is deployed. A continuously teristics, like running agents send the gathered information back to the • Listener. This way software and DevOps engineers can get Runs as a daemon an accurate picture immediately. A screenshot can be seen • Validates the configuration file to have proper keys in Figure 1 about how a chart looks like. • Validates the values in the configuration yml file • Checks whether the related OS-level dependencies ex- ist • Transfers the collected metric to JSON Beside these steps, an agent also has minor characteristics. It runs as a daemon. It checks whether the related OS- level dependencies exist. It transfers the collected metric in JSON. All agents require a file that contains specific information for the observed unit, as well as, parameter for the connec- tion to the Listener. One file can be used by many agents, and one file can contain configurations for multiple observed units. Here we detail some of the agents mechanism, how they Figure 1: Memory consumption of a Tomcat in- work and what information we can get from the unit. stance 3.1 Tomcat Tomcat is one of the most popular and widely-used ap- 3. AGENTS IN OUR APPLICATION plication server among Java developers. It provides a sim- In this section, we give a detailed view of how our agents ple dashboard-like landing page where software and DevOps work and what the main steps are that we kept in focus dur- engineers can manage the deployed packages. Via this page ing the implementation. Before we go through the agents those users, who are dedicated to enter the server, can check listed below, we would like to introduce system require- the state of their applications. This can be a simple health- ments. The target hosts are always based on Debian images, check, the number of threads or how much memory is avail- or any of its derivatives, like Ubuntu or Linux Mint. As we able for Tomcat to allocate more space for the applications. present later, we have strived to use as less dependencies as The tomcat agent monitors both the inside status page possible, like OS-related functionalities or commands. Most and the process itself as well. In the configuration file (see of the commands come with the basic OS, like Listing 2), DevOps engineer has to declare specific parame- ps, but some of the switches can be different on other OS, like ters. -eo is Unix syntax, but using axo is acceptable on both Unix and BSD uri : ’ l o c a l h o s t ’ OS, as well. po rt : 8 080 The architecture consists of a server, the Listener and u s e r n a m e : ’ admin ’ nodes which serve as hosts for the agents. In our solution, p a s s w o r d : ’ admin ’ an agent is responsible for the following steps: pid : 2 4 5 6 7 • After start, it runs endlessly Listing 2: Agent configuration file example • Collects the information about the observed unit If pid is not available, agent monitors the inside status • Transforms and if necessary aggregates the collected page only. An example metric that the agent is intended to data send towards the Listener can be seen in Listing 3. • Transfers the data towards the Listener server in JSON { format " s t a t u s ": { " jvm ": { At first, we have to start the agent with an agent-specific " m e m o r y ": { sub-command and a configuration file which contains all the " fr ee ": 233 564 5 , information that are necessary to observe the chosen unit " t o t a l ": 8 8 2 3 4 1 2 3 , (e.g. see Listing 1). " max ": 2 4 5 3 4 2 2 } $ tomcat - a g e n t s t a r t -- fil e c o n f i g . yml } , " c o n n e c t o r ": { Listing 1: Launching the agent " r e q u e s t I n f o ": { When it starts running, it validates the arguments and " m a x T i m e ": 12 , then parse and validates the file against the expected con- ... figuration settings that are required to the unit. Then it } , starts monitoring and collecting metrics in a specified time " t h r e a d I n f o ": { 48 " m a x T h r e a d s ": 1 , pa th : ’/ logs ’ " c u r r e n t T h r e a d C o u n t ": 1 , fi le : ’ obs erv ed ’ " c u r r e n t T h r e a d B u s y ": 0 f o r m a t : ’ S E V E R I T Y || ’ } n u m b e r _ o f _ l i n e s : 10 } } Listing 5: Example configuration for the log agent } The path tag is responsible for the path of the folder which is considered as a log folder and Listing 3: Example metric sent by the tomcat agent file is the observed unit. To distinguish an ERROR leveled message from other messages that contains the word error, engineers have to declare the 3.2 Docker format of the log. The last key is responsible for the number Containerization is new directive in virtualization: this fetched and forwarded messages. A sent message example lightweight approach supports operating system-based iso- sent can be seen in Listing 6. lation among different pieces of the application. Container- { ization is on the crest of a wave since Docker has been de- " l i n e s ": [ ".. ."] , veloped. Docker provides a systematic way to automate the " s e v e r i t y ": { fast deployment of Linux applications inside portable con- " in fo ": 655 , tainers [2]. " w a r n i n g ": 848 , The name of docker is basically almost equivalent of con- " e r r o r ": 2 , tainer for most of the engineers. Docker, just like Tomcat, " f a t a l ": 0 provides a calculation on how much memory it consumes or } what the total bytes of the received and transmitted data } is over the network for each container. These are the stats. Without declaring any specific container name in the config Listing 6: JSON message example sent by the log file, the agent sends information about all the containers at collection agent the same time that are shown up in the stats. An example message can be seen in Listing 4. Since an agent is run on a machine by an arbitrary user, the software, DevOps and test engineers have to take care { that the observed log can be any file depends on the privi- " c o n t a i n e r s " : [ leges of the user. { " pid ": 38 , 3.4 Host Machine " na me " : ’ j i n g l e _ b e l l ’ , The host machine which the agent is executed on, can be " cpu " : 1.86 , a real machine, a virtual machine or a container whether it " mem ": { is on local or on remote. Whichever the host machine is, " u s a g e ": " 1 6 8 . 2 M " , from the agent perspective they are the same. From inside " l i m i t ": " 1 5 . 4 3 G " , out it seems that machine has memory, CPU (or GPU), " p e r c e n t a g e ": 1 .06 hard disk and other resources. These resources are reachable } , for the agents that means agents can use them. Having a ... picture about the usage and consumption of these resources } are essential. ] With this agent, we can monitor the above-mentioned re- } sources and gather their metrics. These metrics are cumu- lated, agent takes, for example the total memory, the total Listing 4: JSON message example sent by agent swap memory or the size of the available space on the hard disk, regardless which processes use them. 3.3 Log Here we would like to give a view which metrics are taken One of the most important mirror of the status of an ap- during the agent’s execution. We arranged the resources plication is its logs. It could contain all the steps that an into three groups. All the metrics belong to the memory, or execution takes and provide those steps in different granu- CPU, or disk storage (volume). larity. The two main approaches in case of this agent are, first, 3.4.1 Memory get the last n messages from the log and forward it to the Memory has multiple parts from total to used to swap. Listener, and second, get the number of the different severity To get an accurate picture about the consumption we use, levels. The earlier can provide a view of the latest messages, multiple commands that can help calculating the usage of which is a talkative information based on the error or excep- the different parts. The agent uses free (see Listing 7), tion messages raised in the code. The latter one can show /proc/meminfo and the vmstat commands to get metrics the ratio of the different levels giving a clear overview how about the memory (see Listing 8). All of them provide in- much warnings or errors get hit during the execution. To formation about how much total memory is in that host, get these two metrics we mentioned above, engineers have what the size of the cached swap or how much memory is to use such a configuration seen in Listing 5. free or how much is available for allocating new processes. ... $ f ree - m 49 t o t a l u sed fre e ... $ df - t e x t 4 Mem : 1 5 8 0 2 5 485 570 7 ... F i l e s y s t e m 1 K - b l o c k s U s e d A v a i l a b l e Use % M o u n t e d on / dev / n v m e 0 n 1 p 5 1 2 0 4 6 2 0 6 4 7 7 2 5 9 4 9 2 3 7 0 4 0 3 9 6 68% / Sw ap : 204 7 0 204 7 Listing 11: Using the df command Listing 7: Using the free command { { " f i l e s y s t e m ": "/ dev / n v m e 0 n 1 p 5 " , " Mem ": { "1 k _ b l o c k s ": 1 2 0 4 6 2 0 6 4 , " t o t a l ": 15802 , " us ed ": 7 7 2 5 9 4 9 2 , " us ed ": 5485 , " a v a i l a b l e ": 3 7 0 4 0 3 9 6 , " fr ee ": 5707 , " use ": 68 , " s h a r e d ": 2088 , " m o u n t e d _ o n ": "/" " bu ff / c a c h e ": 4609 , } " a v a i l a b l e ": 789 4 } , Listing 12: Sent JSON message about volume usage " Sw ap ": { " t o t a l ": 2047 , " us ed ": 0 , 4. CONCLUSION " fr ee ": 204 7 DevOps is an emerging approach that aims at the symbio- } sis of development, quality assurance and operations. Devel- } opers need feedback from the test executions that CI servers support. On the other hand, no tools have been created that Listing 8: Sent message about memory consumption support feedback from the production enviroment to the de- velopers to follow up the code changes and its effect on the 3.4.2 CPU end-users and the production or the staging environment. There are plenty of tools that provide the opportunity to In this paper, we argue for a new tools into the DevOps monitor the usage of the CPU. Some of them are part of toolset. The aim of this tool is retriving and visualizing the default OS, then the rest come as a third-party tool and the runtime circumstances of deployed application because require installation with privileges. We took the focus on this information can be essential for the developers and ar- those tools that are part of the OS, or used in wide range, like chitects. For this tool, we have developed many agents to vmstat, or iostat (see Listing 9). Both tools can provide a collect the runtime performance information from specific picture of the CPU utilization in percentage. services. In this paper, we presented the mechanism of some $ i o s t a t - c specific agents in Linux environment. L i n u x 4.15.0 -32 - g e n e r i c 2018 -08 -25 _ x 8 6 _ 6 4 _ (8 CPU ) avg - cpu : % u s e r % n i c e % s y s t e m % i o w a i t % s t e a l % i d l e 5. REFERENCES 24 ,97 0 ,03 6 ,07 0 ,03 0 ,00 68 ,90 [1] Jenkins. https://jenkins.io/. [2] D. Bernstein. Containers and cloud: From LXC to Listing 9: Using the iostat command Docker to Kubernetes. IEEE Cloud Computing, The agent sends the above information towards the Lis- 1(3):81–84, Sept. 2014. tener as it seen in Listing 10. [3] P. P. I. Langi, Widyawan, W. Najib, and T. B. Aji. An { evaluation of twitter river and logstash performances as " us er ": 24.97 , elasticsearch inputs for social media analysis of twitter. " ni ce ": 0.03 , In Information Communication Technology and " s y s t e m ": 6.07 , Systems (ICTS), 2015 International Conference on, " i o w a i t ": 0.03 , pages 181–186, New York, Sept 2015. IEEE. " s t e a l ": 0.00 , [4] M. Leppänen, S. Mäkinen, M. Pagels, V. P. Eloranta, " id le ": 68. 9 J. Itkonen, M. V. Mäntylä, and T. Männistö. The } highways and country roads to continuous deployment. IEEE Software, 32(2):64–72, Mar 2015. Listing 10: Sent JSON message about CPU usage [5] Á. Révész and N. Pataki. Containerized A/B testing. In Z. Budimac, editor, Proceedings of the Sixth Workshop 3.4.3 Volume on Software Quality Analysis, Monitoring, Volume usage does not belong to the major metrics of Improvement, and Applications, pages 14:1–14:8. the previously mentioned three units. Though it can tell CEUR-WS.org, 2017. useful information about a running application. To get a [6] J. Roche. Adopting DevOps practices in quality metric about the volume agent uses df (see Listing 11) and assurance. Commun. ACM, 56(11):38–43, Nov. 2013. du commands. Both of them are responsible for giving a [7] S. Stolberg. Enabling agile testing through continuous view of how much space is taken by a folder or how the integration. In Agile Conference, 2009. AGILE ’09., size of the local storage changes. Moreover, agent can be pages 369–374, New York, Aug 2009. IEEE. parameterized. It takes the path to the observed folder or partition of the storage of type of the disk. The agent sends aggregated information as it seen in Listing 12. 50 Incremental Parsing of Large Legacy C/C++ Software Anett Fekete, Máté Cserép Eötvös Loránd University Faculty of Informatics Budapest, Hungary {hutche, mcserep}@inf.elte.hu ABSTRACT incremental parsing [14] and the lazy analysis [10] have been CodeCompass is an open source project intended to sup- studied. A great overview of pratical algorithms and the port code comprehension by providing textual information, exsiting methodology is given by Tim A. Wagner in [13]. source code metrics, version control information and visu- C/C++ language-specific compilation tools [12, 4] and pro- alization views of the file and directory level relations for gramming environments [7] supporting incremental parsing the analyzed project. Regarding the typical software de- have also emerged as an advancement. velopment methodologies (especially the agile ones), only a smaller portion of the code base is affected by any change CodeCompass [9] is an open source, scalable code compre- during a shorter amount of time (e.g. between nightly hension tool developed by Ericsson Ltd. and the Eötvös builds), therefore parsing the entire project each time is un- Loránd University, Budapest to help understanding large necessary and expensive. A newly introduced feature, in- legacy software systems. Its web user interface provides rich cremental parsing is intended to solve this problem by only textual search and navigation functionalities and also a wide processing files that have been recently changed and leaving range of rule-based visualization features [5, 6]. The code the rest alone. This is achieved by the maintenance of the comprehension capabilities of CodeCompass is not restricted project workspace database followed by the partial parsing to the existing code base, but important architectural infor- of the project. The feature has been tested both on medium mation are also gained from the build system by processing and large scale projects and proved to be an effective tool the compilation database of the project [11]. The C/C++ in CodeCompass. static analyzer component is based on the LLVM/Clang parser [1] and stores the position and type information of Categories and Subject Descriptors specific AST nodes in the project workspace database to- D.2.3 [Software Engineering]: Coding Tools and Tech- gether with further information collected during the parsing niques; D.3.4 [Programming Languages]: Processors process (e.g. the relations between files). By introducing the concept of incremental parsing into CodeCompass we General Terms can detect the added, deleted or modified files in the pro- gram and carry out maintenance operations for the database Management, Languages of the code comprehension tool in only the required cases. Thus the required time of the reanalysis can be reduced by Keywords multiple magnitudes. code comprehension, software maintenance, static analysis, incremental parsing, C/C++ programming language In this paper first we present our research in Section 2 on how we extended the static analytical capabilities of the 1. INTRODUCTION CodeCompass code comprehension tool with incremental One of the main tasks of a code comprehension software parsing. Then Section 3 demonstrates the usability of the tool is to provide exact textual information and visualiza- concept by showcasing incremental parsing and measuring tion views regarding the analyzed codebase to support the its performance on a medium and a large size C/C++ soft- (newcomer) developers in understanding the source code. ware. Finally, Section 4 concludes the results and discusses For an enterprise software under development this requires further research opportunities. the frequent static reanalysis of the program, which could take several hours for a large legacy software. Performing a complete static analysis each time is a signif- 2. METHODOLOGY icant waste of computational resources, since in most cases A major consideration of the introduced incremental pars- (e.g. between nightly builds) only a few percent of the file ing feature was to integrate it seamlessly into the existing set has been affected by any change. In order to boost the parsing process by not differentiating in how an initial or a parsing and compilation process and to provide richer user follow-up incremental parse should be initiated. This was experience in integrated development environments (IDEs) achieved by utilizing the partial parsing feature of Code- [8], the concept of incremental parsing and compilation has Compass, which means that the tool is capable of continu- been researched since decades. More recently further ap- ing a previously aborted analysis, by omitting the already proaches, like the involvment of version control systems into parsed files which are present in workspace database. 51 Therefore the main concept of the introduced incremental parsing feature consists of two steps: i) perform a database maintenance operation, where the project workspace is re- stored into a state that ii) the existing partial parsing can finish the procedure. 2.1 Determining file states When a new parse is being done in incremental mode, the state of each file is determined first. Let FDB be the file set stored in the workspace database and FDISK be the file set stored on the disk. An f ∈ FDB ∪ FDISK file may take one Figure 1: Traversal directions of the three states listed as follows. Proof. Let G = (V, W, E) be the directed acyclic graph Added files f is added to the project since the latest parse (DAG) of header inclusions with V containing the file set as if f ∈ FDISK but f / ∈ FDB. vertices and E being the set of upward connections, n := |V |, e := |E|. Let W ⊆ V denote the set of directly changed files, Deleted files f is deleted from the project if f ∈ FDB but k := |W |. f / ∈ FDISK . Modified files f is modified when f ∈ F Let NG(v) be the neighborhood file set of vertex v in G, DB ∩ FDISK at the time of the new parse but its content has changed so w ∈ NG(v) ⇔ (v, w) ∈ E. Therefore for a file v we can since the latest. This can be determined by comparing define the directly included file set as NG(v) and the includer the contents that are stored in the database and on files of v as NGT (v), where GT is the transpose graph of G. the disk, or by their respective hashes for performance optimization. We define up(G, v) and down(G, v) as the file set result of the upward and downward traversal for v ∈ V in G by the 2.2 Header inclusion traversal corresponding traversal model, as formally described below: Specifically when parsing a C or C++ language project, up(G, v) = {v} ∪ ∀w∈NG(v) : up(G, w) (1) changes in header inclusions provide one more challenge to tackle. Upon the modification of a header file all further down(G, v) = {v} ∪ ∀w∈NGT (v) : down(G, w) (2) files in the inclusion chain depending on it should be consid- ered as modified, even without containing any direct changes As a simplification in our model lets assume a uniform themselves. Therefore when determining the modified state distribution of header inclusions among the files. Since of a file as defined in Section 2.1, the set of files defined P deg+(v) = P deg−(v) = e, the average in-degree by the header inclusion relationships transitively should be v∈V v∈V and out-degree for a file v is deg+(v) = deg−(v) = e , which checked for changes. There are two approaches for this, as n will be denoted with d henceforth. As a consequence the described below and shown in Figure 1. length of the longest path in G is logdn, which is the length of the longest header inclusion chain in the project, since G Definition 1. For files a, b and c, given that a is included was defined as a DAG. by b and b is included by c, we say that file a is in an upward connection with b and accordingly file c is in a downward Therefore the asymptotic tight bound both for up(G, v) and connection with b. down(G, v) can be calculated as: Θ(up(G, v)) = Θ(down(G, v)) = dlogdn = n (3) Upward traversal model The upward traversal model depends on the upward connection between files. We define up(G) and down(G) as the upward and downward When resolving the state of file a, its included headers traversal algorithms which determines indirectly changed have to be checked for modifications transitively. files in V through header inclusions from W by the cor- responding traversal model. We define the computational Downward traversal model Similarly, the downward complexity of the algorithms as the number of files checked traversal model uses the downward connections that for changes in their content (or by their hash). Based on can be found between files. If a file a is resolved as Equation 3, the asymptotic tight bound both for up(G) and modified, all files that include a can be marked as mod- down(G) can be calculated as: ified transitively. Note that with this method, the state of any marked files can be considered final and can be X Θ(up(G)) = Θ(up(G, v)) = n2 (4) omitted from further inspections. v∈V X Θ(down(G)) = Θ(down(G, w)) = k ∗ n (5) w∈W Theorem 1. The downward traversal model has better computational complexity over the upward traversal model, and therefore is preferred to be used through the incremental Since k ≤ n and in a typical use case for incremental parsing parsing. k n: Θ(down(G)) < Θ(up(G)). 52 An example for the downward traversal model is showcased in Figure 2. On the left side of the figure the example file set Table 1: Time measures for incremental parsing the is shown with header inclusion dependencies denoted as ar- Xerces-C++ project rows between them. Directly modified files are marked with Parse type Changed files Time a dark background, while files requiring expansion through Full parse – 2 min 49 sec traversal to find indirectly changed files are marked with an 1% change 3 10 sec italic font. Note, that these two categories are equivalent in 5% change 17 21 sec the initial stage. On the right side of the figure the effects 10% change 35 49 sec of downward traversing a.h is demonstrated: files c.h, d.h, f.cpp and g.cpp are also detected as indirectly changed files. While c.h was also a directly modified file, observe that it Table 2: Time measures for incremental parsing the no longer requires downward traversal. LLVM project by one atomic transaction Parse type Changed files Time 2.3 Database maintenance Full parse – 5 h 46 min As mentioned above, incremental parsing includes some 1% change 28 7 min 30 sec maintenance of the existing database depending on the state 5% change 142 1 h 58 min of changed files. 10% change 284 2 h 45 min 1. Added files are perceived as new files to the project • Carry out all deletions from the database in one single and therefore are registered into the database. transaction, so the maintenance is either completely 2. Deleted files need to be purged from the database as executed, otherwise no changes are performed. they have been removed from the project. • Generate multiple file level transactions, so informa- 3. Modified files are handled as if they were a combina- tion regarding a file is either cleaned from the database tion of deleted and added files. First, they are com- or the file is untouched, therefore a consistent state of pletely wiped out from the database – meaning that the database is always kept. all their AST related information and file level rela- tions are erased –, thus considering them deleted, then re-registered like newly added files. Directory level re- Table 2 and Table 3 compare the differences when the lations are not sufficiently maintainable, but these rela- database maintenance is executed through a single and by tions can be effectively computed runtime, on-demand file level transactions. It is clear that the extensive size of from the file level relations. the database rollback log containing all the deletion oper- ations for a larger quantity of files can significantly hinder 3. EXPERIMENTAL RESULTS the effectiveness of incremental parsing, providing signifi- The go-to projects on which CodeCompass is usually tested cant difference in the timespan of incremental parsing for are the Xerces-C++ [3] and LLVM [2] projects. Both large size projects like LLVM. Hence while a single transac- are open source projects that have been under develop- tion may provide stronger guarantees, file level transaction ment for several years and therefore are considered legacy proved to be a more adequate solution, where the required projects. Incremental parsing was also tested on these two time is more or less linear with the quantity of parsed files, as Xerces-C++ is a medium size and LLVM is a large-scale depending on the length and content of the files in question. project and contain enough files (respectively 347 and 2845) to produce a significant difference in runtime between even small portions of changes in the number of files. Table 3: Time measures of incremental parsing the LLVM project by file level transactions Incremental parsing is aimed to reduce the parsing time of Parse type Changed files Time builds, especially nightly builds, therefore it was tested on 1% change 28 9 min 30 sec 1, 5 and 10 percent change of the file set, since no bigger 5% change 142 49 min difference between two builds is presumable. The changeset 10% change 284 1 h 21 min was generated automatically by random selection of files.1 Table 1 shows the results for Xerces-C++, while Table 2 and Table 3 depict the results for LLVM. All measurements 4. CONCLUSIONS were carried out on a standard notebook computer, parsing Incremental parsing was introduced into CodeCompass to on 2 processor cores. reduce the costs of parsing, both time and computational resources, by omitting unchanged files in the project. The In order to keep database consistency in case of a graceful feature distinguishes added, deleted and modified files and abort or unexpected termination of the parser module, the handles them accordingly. The early tests of incremental basic concept is that the maintenance operation of incre- parsing were run on the Xerces-C++ and LLVM projects mental parsing must be performed in a transactional mode, and showed that it works according to its original purpose, in one of the following ways: especially in decreasing the timespan of parsing. While 1Only leaf nodes from graph G introduced in Section 2.2 the results are promising, further challenges include the im- were included in the changeset, so header inclusions did not proved reduction of the timespan required by incremental affect the number of changed files. parsing through parallelizing the process. 53 Figure 2: Downward traversing of a.h demonstrated on a showcase file set. 5. ACKNOWLEDGMENTS [12] T. Tromey. Incremental compilation for GCC. In This work is supported by the European Union, co-financed Proceedings of the GCC Developers’ Summit. Citeseer, by the European Social Fund (EFOP-3.6.3-VEKOP-16- 2008. 2017-00002). [13] T. A. Wagner. Practical algorithms for incremental software development environments. PhD thesis, Citeseer, 1997. [14] T. A. Wagner and S. L. Graham. Efficient and flexible 6. REFERENCES incremental parsing. ACM Transactions on [1] Clang: a C language family frontend for LLVM. Programming Languages and Systems (TOPLAS), https://clang.llvm.org/. 20(5):980–1013, 1998. [2] The LLVM Compiler Infrastructure. https://llvm.org/. [3] Xerces-C++ XML Parser. https://xerces.apache.org/xerces-c/. [4] Zapcc – A (Much) Faster C++ Compiler. https://www.zapcc.com/. [5] T. Brunner and M. Cserép. Rule based graph visualization for software systems. In Proceedings of the 9th International Conference on Applied Informatics, pages 121–130, 2014. [6] M. Cserép and D. Krupp. Visualization Techniques of Components for Large Legacy C/C++ software. Studia Universitatis Babes-Bolyai, Informatica, 59:59–74, 2014. [7] M. Karasick. The Architecture of Montana: An Open and Extensible Programming Environment with an Incremental C++ Compiler. SIGSOFT Softw. Eng. Notes, 23(6):131–142, Nov. 1998. [8] R. Medina-Mora and P. H. Feiler. An incremental programming environment. IEEE Transactions on Software Engineering, (5):472–482, 1981. [9] Z. Porkoláb, T. Brunner, D. Krupp, and M. Csordás. Codecompass: An open software comprehension framework for industrial usage. In Proceedings of the 26th Conference on Program Comprehension, ICPC ’18, pages 361–369, New York, NY, USA, 2018. ACM. [10] V. Savitskii and D. Sidorov. Fast analysis of source code in C and C++. Programming and Computer Software, 39(1):49–55, 2013. [11] R. Szalay, Z. Porkoláb, and D. Krupp. Towards better symbol resolution for C/C++ programs: A cluster-based solution. In IEEE 17th International Working Conference on Source Code Analysis and Manipulation (SCAM), pages 101–110. IEEE, 2017. 54 Visualising Compiler-generated Special Member Functions of C++ Types Richárd Szalay Zoltán Porkoláb Eötvös Loránd University, Faculty of Informatics Eötvös Loránd University, Faculty of Informatics Department of Programming Languages and Compilers Department of Programming Languages and Compilers Budapest, Hungary Budapest, Hungary szalayrichard@inf.elte.hu gsd@elte.hu ABSTRACT expressed, yet relied upon by the most trivial codes. What’s more, In the C++ programming language, special member functions are the compiler is free to lazily evaluate the generation of these mem- either user-defined or automatically generated by the compiler. bers, which results in one such member’s non-availability to only be The detailed rules for when and how these methods are generated reported when its usage was attempted. In case the used software are complex and many times surprise developers. As generated library is outdated or not easily modifiable, or not open source, this functions never appear in the source code it is challenging to com- can result in loss of run-time performance or development effort prehend them. For a better understanding of the details under the wasted on having to redesign parts of the software. For discovery hood, we provide a visualisation method which presents gener- and understanding of the existence and behaviour of these methods, ated special functions in the form of C++ source code that in effect developers can either consult the Language Standard, read Abstract identical to their implicit versions. Syntax Trees (ASTs), or view the disassembly of the binary — none of which is favourable for the average developer. CCS CONCEPTS 1 # i n c l u d e < i o s t r e a m > • Software and its engineering → Source code generation; 2 s t r u c t A { i n t x } ; Software maintenance tools; • Human-centered computing 3 i n t main ( ) { → Information visualization; 4 A a1 ; / / <− D e f a u l t c o n s t r u c t o r c a l l e d . GENERAL TERMS 5 a1 . x = 5 ; 6 A a2 ( a1 ) ; / / <− Copy c o n s t r u c t o r c a l l e d . programming languages, software development, visualisation 7 a1 . x = 6 ; KEYWORDS 8 9 / / W i l l p r i n t " 6 5 " . C++ programming language, compilers, code comprehension, code 10 s t d : : c o u t << a1 . x << " ␣ " << a2 . x ; design 11 } 1 MOTIVATION Listing 1: Example code which uses a default and a copy con- structor. Languages supporting the Object-oriented programming (OOP) paradigm define a central principle of object lifetime which is sur- To aid ongoing development and code comprehension of projects rounded by construction/initialisation and destruction/finalisation. we introduced a tool that allows pretty-printing the visual represen- In the Java programming language, apart from the basic default tation of special member functions that is the closest to how they construction – where everything is initialised to the respective zero would be written by developers. To further this aid, we don’t only value – the developer must explicitly state their intent for different show the compiler-generated special members, but provide a subset construction logic, custom finalisation. A special case is when a of the type’s all member functions which shows both user-written new object is created from an already existing one, where deep copy – e.g. a constructor that initialises from a different data type – and (clone) operations or conversions might be warranted. In C++, how- the standard, implicit ones. We used the open source LLVM/Clang ever, the Language Standard specifies that these aforementioned Compiler Infrastructure [16] for parsing and generation. actions, in the form of special member functions [8], should have a The rest of the paper is organised as follows. In Section 2 we default implementation automatically generated by the compiler discuss the purpose and rules of C++ special member functions. if the user does not explicitly write them. The rules which dictate Then, Section 3 describes the implementation approach and chal- the conditions for generating the special member functions and lenges faced with respect to pretty-printing and presentation to the their behaviour can appear dauntingly complex, and subsequent developers. The paper concludes in Section 4. versions of the language standard may revise and elaborate these rules, increasing their complexity. The most recent, and most sig- nificant such change was with the release of the C++11 standard, 2 C++ SPECIAL MEMBER FUNCTIONS which introduced move semantics [9]. Special member functions in C++ denote the functions that are Modernising code initially written for an older standard can be necessary for the management of instances’ lifetime. [12] These cumbersome as the behaviour of special members are never directly are the constructors, the assignment operator and the destructor. 55 CSS’2018, October 2018, Ljubljana, Slovenia Richárd Szalay and Zoltán Porkoláb 2.1 Constructors data member’s destructor is explicitly hidden – this is a common Constructors are responsible for the initialisation of an object. They practice for scenarios where a controller has to ensure an orderly are usually executed together with the memory allocation for the or batch destruction. instance. Unless the user specifies and provides any constructor function, both C++ and Java will generate a default constructor. In Java, this function initialises every data member to their respective 2.3 Assignment operators zero value, such as integer 0, rational 0.0, the \0 character, or a null reference. In C++, the initial state of the members depend on Contrary to Java, where there exists only primitive types and ref- the storage scope of the object – in most cases, the memory garbage erences, C++ is a language with value semantics. Assigning to a is retained from the memory block where the object is allocated. reference in Java only results in the actual memory modification Unlike Java, however, the default constructor is not created if at of a memory address’ size. The object that is no longer referred by least one data member does not have a default constructor. the assigned-away reference is then left for garbage collection, if Another case of construction is when a new object is initialised applicable. In C++, however, this means that assigning an object from the state of another, already living object of the same type. In to another object of the same type results in the assigned-to object Java, this functionality can be achieved in multiple ways, one of having the assigned object’s state’s copy within its own memory which is by using the special clone() function. This function is region. Traditionally, copy assignment operators have a “destructor” defined in Object, and performs a shallow copy of the instance in part where the current object’s resources and buffers are released, question, only initialising the new object’s members to the same and then a “copy constructor”-like logic where the copy of state value of the cloned one [4, 11]. In case of references to other objects takes place, however, the developer is free to choose a different results in aliasing, the sharing of the same resource – usually an implementation. The compiler-generated copy assignment operator internal buffer – by two separate entities. Another problem with implements a memberwise copy assignment for the entire object. clone() is that the existence of the cloneability marker and the Thus, the copy assignment operator is not generated by the com- respective method must exist through the whole chain of the type piler due to type infeasibility if one of the data members cannot be hierarchy – it is usually referred to as an epidemic [10]. What’s more, copy-assigned. cloning does not actually invoke a construction, but rather creates It is noteworthy to mention that not every language has defined a copy of the memory’s snapshot, which means that business logic the = assignment operator as an operator: in some languages, such strictly bound to a constructor, such as initialisation of read-only as Ada or Pascal, assignment is defined as a statement/instruction, members, cannot be done. In C++, the default behaviour of the copy rather than an operator application. This has led to the inability constructor is to run the copy construction of every data member. to write copy assignment logic in Ada. To avoid use of assignment For fundamental types, this means a copy of the value, and for more on types that are not designed for memberwise copy the limited complex types their respective copy constructors are called. Thus, keyword [18] and type-annotation is used. in case a custom resource which can be properly deep-copied is In C++ it is commonly referred as The Rule of Three that if any used the copy constructor that is generated for the object using this of the copy constructor, copy assignment operator, and destruc- resource will be sufficient. tor is written explicitly by the developer, all of them should be written explicitly. This rule of thumb is not enforced by compil- ers but considered a good practice, because, as discussed earlier, 2.2 Destructor explicitly specifying either will not stop the compiler from automat- The destructor or finalise is called at the end of an instance’s life- ically creating the implicit definitions of the other special member time and is responsible for tearing down the state of the instance. functions. This most commonly means releasing resources, performing clean- up tasks and committing changes, e.g. to a database. In Java, the finalize() method’s implementation is run for an object at an unspecified point in time when the runtime’s garbage collector de- 2.4 Members for move semantics cides that the object is to be reaped. [3] The behaviour differences The release of the C++11 Language Specification has introduced between Java Virtual Machine versions and the general looming move semantics, which allows resources to be directly “stolen” by a of a finalisation never happening for an instance resulted in a con- variable from another, as opposed to a copy-constructed and the sensus on not using finalize() – it has also been deprecated original data’s memory destroyed. [13] This is used heavily with since Java 9. Instead, the AutoCloseable design pattern is used that temporary objects which would get destroyed in the next statement. explicitly requires writing a close() method which executes tear- The move special members’ default implementation executes a move down logic, but can be called arbitrarily by the developers when construction or move assignment of every data member, however, teardown is deemed necessary, such as at the end of finishing a the rules for their existence are more exquisite. Move members database operation. In C++, a destructor can be written by the are not generated automatically if any explicit destructor, copy or user or is automatically generated by the compiler. It is always move member exists, and an explicitly defined move member also executed immediately when an instance’s lifetime ends. The gener- turns off the automatic generation of copy members. ated destructor does nothing in its body, and then the destructor Accordingly, the Rule of Three has been extended to also include of each data member is executed individually – as their lifetimes the two move members, and is referred to as The Rule of Five. also expired. Thus an implicit destructor always exists unless a 56 Visualising Compiler-generated Special Member Functions of C++ Types CSS’2018, October 2018, Ljubljana, Slovenia 3 IMPLEMENTATION lists [5] too. The AST nodes found in the subtrees of these nodes are 3.1 Syntax transliteration then manually converted into a textual, source code representation. We used the open source LLVM/Clang Compiler Infrastructure for s t r u c t A { parsing and generation of special member visualisations because A ( ) { } / / The d e f a u l t c o n s t r u c t o r . Clang’s object-oriented Abstract Syntax Tree (AST) API allows for / / The c o p y c o n s t r u c t o r . an optimised and maintainable application. An example subtree of A ( c o n s t A & r h s ) : x ( r h s . x ) { } the AST corresponding to the source code in Listing 1 can be seen } ; in Listing 2. The copy constructor’s body corresponds to copying the Listing 4: The special members of the example class in List- right-hand record’s single data member into the current record’s ing 1 translated back to source text. corresponding data member. CXXConstructorDecl There are three interesting cases that need to be noted, where i m p l i c i t u s e d c o n s t e x p r A v o i d explicit source code differs from what a compiler generates for itself ( c o n s t s t r u c t A &) n o e x c e p t i n l i n e automatically. First of all, the compiler generates the implicit mem- bers’ arguments without an argument name. One such example ParmVarDecl 20 f 9 0 c 0 u s e d c o n s t s t r u c t A & can be seen in Listing 2, where the ParmVarDecl (parameter vari- C X X C t o r I n i t i a l i z e r F i e l d x i n t able declaration) has no name, and the initialiser’s DeclRefExpr I m p l i c i t C a s t E x p r i n t < LValueToRValue > (declaration reference expression) only refers to this ParmVarDecl MemberExpr c o n s t i n t l v a l u e . x by its memory address, 20f90c0. Such a construct cannot exist in DeclRefExpr c o n s t s t r u c t A actual source code. As a remedy, we manually assign the name rhs l v a l u e ParmVar 20 f 9 0 c 0 ' ' to the variable – or in case multiple parameters would be possible, c o n s t s t r u c t A & number them as arg_1, arg_2, . . . – and use it in the pretty-printed CompoundStmt code. Another such interesting case is about move constructors and CXXConstructExpr < c o l : 7 , c o l : 1 1 > s t r u c t A move assignment operators, namely that the compiler generates the v o i d ( c o n s t s t r u c t A &) n o e x c e p t argument as a temporary, an xvalue, from which move operations can be done. However, T&& rhs written in source code specifies Listing 2: The Clang AST representation of the implicit copy a named variable, an lvalue, from whose members move must ex- constructor’s body, and the call to it in main(). plicitly be specified by using a type annotation std::move, which Other compilers, might use different internal representations, on casts the members to be xvalues which denote variables that are which these transformations would be infeasible to execute – in case essentially transformed into a temporary and their resources can of GNU/GCC, the Register Transfer Language (RTL) is only meant be moved from. The pretty-printer annotates the right-hand sides to be used by compiler-internal applications and code generation is of move initialiser or assignment expressions with std::move to organised into various steps called loops. The example of the same ensure the same semantics. We only do this for record types, as no copy construction can be seen in Listing 3, which has already been fundamental type supports move operations. stripped of semantic information and only the memory access for The third case is with regards to inheritance. In case a class has the data member can be studied from it by humans. It should be at least one superclass, the special members’ default behaviour is to noted that the presented representation is the earliest and shortest cast the current instance to the base class and call the appropriate where copy construction is apparent on the inner data member level. constructor or assignment operator for each base class. A core Previous transformation loops only show the copy constructor’s principle in object-oriented programming is that up-casting – cast call source line in it’s original form, i.e. A a2(a1);. to any base class – is always possible and well-defined, however, this would result in unintelligible source code lines, such as *this ( insn 7 6 8 2 = rhs; – which would lead to an infinite recursion if written in ( s e t (mem/ c : S I ( p l us : DI ( reg / f : DI 82 source code verbatim. The type system allows us to see that this = v i r t u a l − s t a c k − v a r s ) is for the base class, so we explicitly wrap the statement into a cast ( c o n s t _ i n t −8 [ 0 x f f f f f f f f f f f f f f f 8 ] ) ) at the appropriate location to show base class initialisation to the [ 1 a2 +0 S4 A64 ] ) developer. Examples of these cases are depicted in Figure 1. ( reg : S I 9 1 ) ) " / tmp / main . cpp " : 6 −1 We have encountered that the Standard only specifies generating ( n i l ) ) a body for a special member if the currently compiled translation unit ODR-uses [7] the function. While no compiler error is given at Listing 3: The GNU RTL of the copy constructor call in line compilation for an infeasible, implicit deleted special member unless 9 of Listing 1. used, the type system in Clang annotates the forward declaration We have utilised Clang’s architecture to perform a parsing on of the function if it is deleted. Thus by using this annotation and the translation unit, and then performed a traversal on the built the related diagnostics, we can, for each member without a body AST searching for all records, or a particular record with a name achieve either an explicit body generation or printing the reason specified by the user. Once the record is found, we visit every special behind the member being deleted by the type system in a single pass. member’s body, and in the case of constructors their initialiser It should be noted that generating the body for members which are 57 CSS’2018, October 2018, Ljubljana, Slovenia Richárd Szalay and Zoltán Porkoláb 4 CONCLUSION In this paper, we have discussed the rules and behaviour of auto- matically generated special member functions, an intrinsic feature of the C++ programming language. We have introduced an ap- proach to transliterate the compiler’s internal representation of these special members to source text to promote understanding of software projects without resorting to unfavourable techniques such as reading syntax trees manually. We have implemented our solution in the open-source code com- prehension tool CodeCompass [1, 14, 15] — http://github.com/ Ericsson/CodeCompass — as an additional visualisation over C++ files. The upstreaming of this addition is underway at the writing of this paper. ACKNOWLEDGMENTS This work presented in this paper was supported by the Euro- pean Union, co-financed by the European Social Fund in project EFOP-3.6.3-VEKOP-16-2017-00002. REFERENCES [1] CodeCompass. 2012. A software comprehension tool for large-scale software written in C/C++ and Java. http://github.com/Ericsson/CodeCompass [2] Margaret Ellis. 1990. The Annotated C++ Reference Manual. Addison-Wesley, Reading, Massachusetts, USA. [3] James Gosling, Bill Joy, Guy L. Steele, Gilad Bracha, Alex Buckley, and Daniel Smith. 2017. Finalization of Class Instances (1st ed.), Chapter 12.6, 389–393. In [4]. https://docs.oracle.com/javase/specs/jls/se9/jls9.pdf visited on 2018-08-13. [4] James Gosling, Bill Joy, Guy L. Steele, Gilad Bracha, Alex Buckley, and Daniel Smith. 2017. The Java Language Specification, Java SE 9 Edition. https://docs. oracle.com/javase/specs/jls/se9/jls9.pdf visited on 2018-08-13. [5] ISO. 2012. Initializing bases and members, Chapter 12.6.2, [class.base.init]. In Figure 1: Special member overview for a class with two base [6]. http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm? csnumber=50372 classes and a single char data member. [6] ISO. 2012. ISO/IEC 14882:2011 Information technology — Programming languages — C++, version 11 (C++11). International Organization for Standardization, Geneva, Switzerland. 1338 (est.) pages. http://www.iso.org/iso/iso_catalogue/catalogue_ tc/catalogue_detail.htm?csnumber=50372 allowed to have one, and it is only an optimisation that generation [7] ISO. 2012. One definition rule, Chapter 3.2.3, [basic.def.odr]. In [6]. http://www. didn’t take place is a non-functional change and does not affect iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=50372 [8] ISO. 2012. Special member functions, Chapter 12, [special]. In [6]. http://www. the semantics of the generated code – thus this transformation can iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=50372 safely be integrated into other compilation steps. [9] ISO. 2012. Temporary objects, Chapter 12.2, [class.temporary]. In [6]. http://www. iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=50372 [10] Marián Juhás, Zoltán Juhász, Ladislav Samuelis, and Csaba Szabó. 2009. Measur- 3.2 Special member overview ing the complexity of students’ assignments. Annales Universitatis Scientiarum Budapestinensis de Rolando Eötvös Nominatae. 31 (2009), 203–215. To facilitate better code comprehension, we have decided not only [11] Zoltán Juhász, Marián Juhás, Ladislav Samuelis, and Csaba Szabó. 2008. Teaching to show the implicit special members but every related overload of Java programming using case studies. Teaching Mathematics and Computer constructors and assignment operators. This allowed us to show Science. 6(2) (2008), 245–256. [12] Stanley Lippman. 1996. Inside the C++ Object Model. Addison Wesley Longman, a subset of the class’ members which are related to the instance’s Reading, Massachusetts, USA. lifetime. [13] Scott Meyers. 2015. Effective Modern C++: 42 specific ways to improve your use of The full overview proves useful when a special member is de- C++11 and C++14. O’Reilly Media, Sebastopol, California, USA. [14] Zoltán Porkoláb and Tibor Brunner. 2018. The CodeCompass Comprehension faulted. If for example, a class contains some constructors and a Framework. In Proceedings of the 26th Conference on Program Comprehension user-defined copy constructor, the move members will not be gen- (ICPC ’18). ACM, New York, New York, USA, 393–396. https://doi.org/10.1145/ 3196321.3196352 erated automatically, however, the developer can explicitly ask the [15] Zoltán Porkoláb, Tibor Brunner, Dániel Krupp, and Márton Csordás. 2018. Code- compiler to generate the methods with the implicit body rules by Compass: An Open Software Comprehension Framework for Industrial Usage. using the = default specifier, available in C++11 and onwards. In Proceedings of the 26th Conference on Program Comprehension (ICPC ’18). ACM, New York, New York, USA, 361–369. https://doi.org/10.1145/3196321.3197546 This is the suggested approach for modern C++, practised by most [16] The LLVM Project. 2003. Clang: C Language Family Frontend for LLVM. http: open-source projects. In this case, we show these members’ body //clang.llvm.org visited on 2018-08-13. along with the rest of the class with the annotation that the user [17] Bjarne Stroustrup. 1994. The design and evolution of C++. Addison-Wesley, Reading, Massachusetts, USA. requested the body generation. [18] S. Tucker Taft, Robert A. Duff, Randall L. Brukardt, and Erhard Ploedereder. 2000. Another case for the full view is showing the reason why a Consolidated Ada Reference Manual: Language and Standard Libraries. Springer- Verlag, Berlin, Heidelberg, Germany. special member was not automatically generated by printing a hint from the semantic analysis’ diagnostics. 58 How Does an Integration with VCS Affect SSQSA? Bojan Popović Gordana Rakić Naovis d.o.o. University of Novi Sad, Faculty of Sciences Trg Bulevar oslobo ¯ denja 30A Dositeja Obradovića 4 Novi Sad, Serbia Novi Sad, Serbia bojan.popovic@primafin.com goca@dmi.uns.ac.rs ABSTRACT Consequently, software analysis tools integrate support for Contemporary trends in software development almost nec- VCS. Usually this support means possibility to analyze code essarily involve version control system (VCS) for storing and stored to VCS repositories. In some cases tools also rely manipulation of source code and other artifacts. Conse- on advantages of VCS to improve analysis performances or quently, tools supporting the development process such are results. software analysis tools integrate with VCS. In most of cases tools support only analysis of the resources in VCS reposi- In this paper we explore potential advantages of integration tories, while some of them rely on VCS to improve analysis of SSQSA (Set of Software Quality Static Analyzers) plat- process and results. In this paper we explore how an inte- form [9] with GIT [2] as a representative VCS. First, we gration of the SSQSA platform with VCS influences some of introduce a concise background by describing VCS (Section its performances. 2) and SSQSA (Section 3). Prerequisites for the integration and the integration are described in the section 4. We dis- Categories and Subject Descriptors cuss results in the section 5 and possible application models and scenarios in the Section 6, that is followed by comparison D.2.8 [Software Engineering]: Metrics—complexity mea- to related integration solutions (Section 7). We conclude the sures, performance measures paper in the Section 8. This paper is summary of a master thesis described in [8] (in Serbian). Keywords Software quality analysis, intermediate representation, Ver- 2. VERSION CONTROL SYSTEMS sion Control System Version control systems (VCS) might have very broad appli- cation in different areas of content manipulation for personal 1. INTRODUCTION or professional purposes. These are tools used primarily to Quality of a software product is observed through the level of support teams and individuals in development and mainte- satisfied requirements. It could be assessed by its execution nance of a software products. These systems remember all by applying different techniques of dynamic analysis. These the changes of separate files, so that at any time we can re- techniques are applicable when the product is ready for test- cover a specific version, or follow and compare changes over ing which might be late to recognize weaknesses or issues. the time. In this way, all data is safer, good synchronization On the other side, static analysis techniques are travers- between the team members is ensured, the possibilities for ing source code and its various intermediate representations errors are significantly reduced, and therefore the project which makes them applicable already in the early phases of development process is improved. software development process [5]. VCS are divided into two large groups [2]: Contemporary software development practice relies on source code repositories and their synchronization implemented by various version control systems (VCS). VCS are used to store CVCS: Centralized Version Control System where all the data the whole history of activities in the evolution of a software are stored to a centralized server. This approach is cer- product, from version information to the finest details about tainly easier to maintain, but in case of system failure, every individual change in the repository, including informa- all information about the project will be lost. Addi- tion about contributors to the changes. tionally, availability of a network connection is very important. Previously, this was the standard way to execute version control. Representatives of this group are CVS: Concurrent Versions System [4] and Subver- sion [3]. DVCS: Distributed Version Control System where clients map the whole repository. If a server failure occurs, any of the client repositories can be copied back to the server to restore it. However, local copy enables us to work on changes independently of a network connection while 59 Property Git Mercurial derived representations are generated based on eCST, by Simple GUI - + a unique implementation of the derivation process, ensures Getting started for beginners - + their language independence and universality, too. Simplicity branches visualization - + Speed (Windows OS) - + By traversing all or some of these universal intermediate rep- Speed online + - resentations different analysis algorithms are implemented. Changing the history + + Therefore, it is possible to have a single implementation of Using the index + - every functionality that we integrate in the SSQSA which PL independent extensions + - ensures consistency of the results across different languages, Repo. migrating to another system + - but also adaptability to a new language and extendability by a new analysis [9]. Described process and a corresponding Table 1: Comparison between Git and Mercurial platform design is illustrated by the Figure 1. Current version of the SSQSA platform manipulates input the connection is necessary only for saving changes at source code from an local directory (components colored by the remote repository or taking a version from it. Files gray color), while our primary goal in this research is to stored on the hard disk are of small size, and hence this integrate it to analyze the code stored in a Git repository. does not pose a problem problem of a storage space. Additionally, we will explore how usage of Git repository for storing intermediate representation affects SSQSA platform An additional advantage of DVCS is that we can share the and its performances. This level of the integration will en- changes with other team members before they are shared able us to traverse only changed fragments of the structures, globally. On the other hand, there is little advantage of cen- which might further lead to improvement of performances tralized systems compared to distributed ones. Centralized of the analyses. The first prototype includes only results of systems offer us an easier way to control all the people who generation of eCST in the repository. New components that access the server, as well as easy provision of a central point implement integration are yellow-colored in the Figure 1. where all the changes are in place. They also offer us the option of downloading only a piece of code, if we only need 4. THE SSQSA AND GIT INTEGRATION to work on a project module. However, if needed, one copy To enable collaboration of SSQSA with Git, it was neces- of the project in the DVCS can be announces as the main sary to connect eCSTGenerator to Git repository and to one, and thus we can simulate the centralized system. enable it to process the source code stored in it. After the first connection eCSTGenerator is processing the whole con- Distractions that can be addressed to distributed systems tent of the repository and generates its eCST representation. are more technical. For example, in case of a project with Every next time, eCSTGenerator will process only changed many large files that can not be compressed, more storage files. This feature was not easily implementable before the space is required. Additionally if we are working on a large integration with Git. project that contains many customized changes, download- ing a full version of the project can take longer than ex- In addition, SSQSA uses advantages of its integration with pected, and also take up more space on the hard drive than Git at one more level. Namely, after the set of eCST is expected. generated, it is stored to a Git repository so that other com- ponents can also process only changes between versions. For All described differences bring to the decision to conduct these purposes we do not use the same repository as it is a the first experimental integration SSQSA platform with a dedicated development repository, while developers do not DVCS. Therefore, we compare Git [2] and Mercurial [6] have to be affected by the analysis. as the main representatives of DVCS in order to compare their properties to our requirements (Table 1). We can conclude that Mercurial has better characteristics from the 5. RESULTS users point of view, but for our integration these character- To explore applicability aspects of the described integration istics do not have value. On the other hand easiness to inte- solution, we measure time needed for generation of eCST grate with other systems, possibility to migrate to an other representation of a JavaScript project ”proton-native”1. system and speed are extremely important to us. Therefore, in this work, we integrate SSQSA with Git. First, we observe time needed only for generation of eCST representation of the source code from the local folder and 3. THE SSQSA PLATFORM compare it to the time needed to generate it for the code The SSQSA (Set of Software Quality Static Analyzers) [9] stored in a Git repository (Table 2). is a set of tools that enables language independent static software product analysis based on its source code. Lan- As we can see, for the first commit generation process lasted guage independence is ensured by a universal intermediate for significantly longer time. The reason for this is time representation of a source code called eCST (enriched Con- needed for the connection to the Git repository. However, crete Syntax Tree). Once when this representation is pro- even though process spends additional time on the connec- duced for any system, written in any set of programming tion, in later commits we get better results from the version languages, it can be transformed to some of derived inter- integrated with Git. mediate representations such are dependency networks, at different abstraction levels, or flow graphs. The fact that 1https://github.com/kusti8/proton-native 60 Figure 1: SSQSA platform and its integration with Git Version from a from a Time for the ally works according to the pull-request model on the local no. of commit local dir Git repo Git connection level. 7. 744 ms 1250 ms 720 ms 14. 812 ms 1270 ms 754 ms The most practical model for implementing a new imple- 34. 1589 ms 1353 ms 739 ms mentation for the use of Git is the pull-request model. A 80. 1601 ms 1520 ms 870 ms project leader can start an eCST generator on a new repos- 126. 1650 ms 1515 ms 780 ms itory commit to analyze the modified file. If a developer wants to create XML trees, it can also launch an eCST gen- Table 2: Comparison between time needed for eCST erator at each commit. The problem can arise if more teams generation proccess from a local directory and from are made and the eCST generation process is lunched then a Git repository. only. In this case it must be adapted to go through all the commits, not only looking at the latest changes. Eventually, if we include functionality for committing of gen- The ”Director and Lieutenant” model is also suitable for new erated eCST to a repository, time needed for whole process implementation. Each sub-project has its own leader who goes over 6000ms. Obviously, in this scenario integration re- can create XML trees. Also, the leader of the repository duces performances of SSQSA. Still, further integration will may generate eCST when joining new changes to a branch utilize benefits of version control to improve generation of of a project (merge). Also, if developers want to generate derived intermediate representations. Finally, it will be in- XML trees, the same rules apply as with the Pull-Request tegrated with the analyzers. It can be expected that, with model. the growth of data that will be saved up in the exchange, traverse and analysis process, the benefits from the integra- The centralized model is the most unpractical model for us- tion will also grow. Therefore, effects of the integration on ing the new implementation. All team members commit other components still have to be explored (Section 8. their changes to a centralized repository, which in this case contains a lot of commits through which traversal should be 6. APPLICATION SCENARIOS conducted. Depending on a scenario, Git has three common application models: a centralized model, a pull-request model, and a Di- 7. RELATED SOLUTIONS rector and Lieutenants model [2]. In a centralized system, all Many tools also support code analysis from various VCS members of the team synchronize their changes in a central such as BCH: Better Code Hub2 and SonarQube3, primar- repository that stores all source code. In the pull-request ily because the repositories have become the standard code model, developers can make changes to his local repository, storage. However, only some tools rely on versions for more and he commits them to his own repository, and can see the advanced analysis. changes that other team members make. In this model one repository is considered the main repository. In order to ac- Lean Language Independent software analyzer (Lisa) is a complish the changes in it, a request is sent to the project software that analyzes the quality of software projects. The leader to pull the changes. The project leader can add devel- main goal of Lisa is to analyze a large number of project re- oper’s repository as a remote repository, locally test changes, visions asynchronously with minimal redundancy. Analyses and if everything is fine, save them to the main repository. are aimed to cover as many analyzes, and as many program- In a Director and Lieutenants model the project is divided into sub-projects and distributed among teams. Each team 2https://bettercodehub.com/ (sub-project) has its own repository and its leader, and usu- 3https://www.sonarqube.org/ 61 ming languages as possible. These goals are comparable Integration is developed at two levels. At the first level with the goals of SSQSA, as well as the new implementation the platform is connected to the Git repository in order to presented in this paper. However, Lisa currently supports enable processing source code stored in it. At the next level three programming languages, while the SSQSA framework of the integration we use Git repository to store XML file currently allows us to work with more than ten program- containing eCST intermediate representation of the source ming languages. Concerning the subject of this paper, We code so that we can always look only for changes, and not can note certain differences in the approach to the problem traverse all the code, or more precisely, eCST representation and the concrete solution implementation. For the needs of it. This is very important if we have in mind that one of the Lisa analyzer, a special interface called SourceAgent input file (compilation unit) is represented by one eCST. has been developed. It supports the asynchronous access to the Git repository and file revisions [1]. On the other At the first look, the results of the integration are not promis- hand, SSQSA, with the current implementation, uses all the ing. Namely, Git connection used the time that we can save benefits of the Git and the library for interactions with it, by looking only in the changes and not in the whole source looks at the differences between the last two committees, code. However, Without storing trees to the Git repository and reads all the files that have been changed, and gener- we are already saving some processing time. In case when ates XML trees for them. Furthermore, Lisa communicates we store eCST in a Git repository we are spending more directly with the Git repository by making a local copy of time but in the future work we will explore if this cost may the remote repository to a local hard disk, while our imple- be payed off after extending this integration on generation of mentation allows reading from a local disk and thus does derived representations and analyzers. For example, genera- not require an internet connection. Internet connection is tion of dependency network currently traverses all the trees only needed if we want to save the generated XML tree in a while after the full integration with Git it will also look only remote repository. for changes. The similar expectation we have from an inte- gration of analyzers with Git. Therefore, these integration The Analizo is a solution that analyzes source code written activities will be subject of the a future work, as well as in different programming languages, whose emphasis was on analysis of potential costs and benefits, and selection of the C, C ++ and Java. The analysis supports the reading of most suitable usage scenarios. content from remote repositories for each audit in which the source code has been changed in the project [10] and, un- 9. REFERENCES like the SSQSA which currently allows reading of contents [1] C. V. Alexandru, S. Panichella, and H. C. Gall. only from the Git repository, allows reading from the Git Reducing redundancies in multi-revision code analysis. and Subversion repositories, and then generates CSV files. In Software Analysis, Evolution and Reengineering SSQSA also compares file revisions and decides from which (SANER), 2017 IEEE 24th International Conference files to create an XML tree. An advantage over Analizo is on, pages 148–159. IEEE, 2017. that we can monitor file versions on a remote repository. [2] S. Chacon and B. Straub. Pro git. Apress, 2014. Again, the difference is in the number of supported lan- [3] B. Collins-Sussman, B. W. Fitzpatrick, and C. M. guages: Analyzo supports three languages, while SSQSA Pilato. Version control with subversion, 2006. currently supports more than ten programming languages. Accessible in URL: http://svnbook. redbean. com, 2007. [4] D. Grune et al. Concurrent versions systems, a EvoJava is a tool for static code analysis of an input from method for independent cooperation. VU Amsterdam. a Java repository. It uses a VCS to access the code, mines Subfaculteit Wiskunde en Informatica, 1986. the source repository, and calculates metrics. Unlike the SSQSA platform, EvoJava uses Subversion (SVN) and pro- [5] G. O Regan. Introduction to software quality. cesses only .java files. The output file is also in .XML for- Springer, 2014. mat, but containing metric results. EvoJava takes a list of [6] B. O Sullivan. Mercurial: The Definitive Guide: The the code versions that is in the repository and thus creates Definitive Guide. ”O’Reilly Media, Inc.”, 2009. a model based on the XML-generated files [7]. SSQSA, on [7] J. Oosterman, W. Irwin, and N. Churcher. Evojava: A the other hand, observes the latest changes that are commit- tool for measuring evolving software. In Proceedings of ted to a remote repository, finds these files in the file system the Thirty-Fourth Australasian Computer Science and creates XML files based on them. Later it automatically Conference-Volume 113, pages 117–126. Australian commits them to a special local or remote repository, where Computer Society, Inc., 2011. we can track what changes were made during the evolution [8] B. Popović. Integration of a platform for static of our software. We cane also note the variety in supported analysis with a version control system (in serbian). programming languages in SSQSA while EvoJava only sup- Master’s thesis, Faculty of Sciences, University of Novi ports Java programming language. Sad, 2018. [9] G. Rakić. Extendable and adaptable framework for 8. CONCLUSION AND FUTURE WORK input language independent static analysis. PhD thesis, Faculty of Sciences, University of Novi Sad, 2015. Following actual trends in software development and soft- ware analysis SSQSA frameworks goes into a direction of [10] A. Terceiro, J. Costa, J. Miranda, P. Meirelles, L. R. integration with VCS. In this paper we compare character- Rios, L. Almeida, C. Chavez, and F. Kon. Analizo: an istics of different VCS and select Git as a first candidate for extensible multi-language source code analysis and the integration. Furtehr, we describe its integration with visualization toolkit. In Brazilian conference on Git and explore possible benefits from this integration for software: theory and practice (Tools Session), 2010. the performances of the platform. 62 Indeks avtorjev / Author index Beranič Tina ................................................................................................................................................................................. 23 Chuchurski Martin........................................................................................................................................................................ 35 Cserép Máté ................................................................................................................................................................................. 51 Fekete Anett ................................................................................................................................................................................. 51 Heričko Marjan ............................................................................................................................................................................ 19 Heričko Tjaša ............................................................................................................................................................................... 31 Kamišalić Aida ............................................................................................................................................................................. 19 Karakatič Sašo ........................................................................................................................................................................ 27, 31 Kous Katja .................................................................................................................................................................................... 23 Kuhar Saša ................................................................................................................................................................................... 15 Leppäniemi Jari .............................................................................................................................................................................. 7 Orgulan Mojca ............................................................................................................................................................................. 35 Pataki Norbert ........................................................................................................................................................................ 43, 47 Podgorelec Blaž............................................................................................................................................................................ 39 Podgorelec Vili ....................................................................................................................................................................... 27, 31 Polančič Gregor ............................................................................................................................................................................ 15 Popović Bojan .............................................................................................................................................................................. 59 Porkoláb Zoltán ............................................................................................................................................................................ 55 Rajšp Alen .................................................................................................................................................................................... 23 Rakić Gordana .............................................................................................................................................................................. 59 Rek Patrik ..................................................................................................................................................................................... 39 Révész Ádám ............................................................................................................................................................................... 43 Rola Tadej .............................................................................................................................................................................. 35, 39 Rupnik Rok .................................................................................................................................................................................. 11 Sillberg Pekka ................................................................................................................................................................................ 7 Šimenko Samo ............................................................................................................................................................................. 27 Soini Jari......................................................................................................................................................................................... 7 Szalay Richárd ............................................................................................................................................................................. 55 Tišler Aljaž ................................................................................................................................................................................... 35 Török Márk .................................................................................................................................................................................. 47 Turkanović Muhamed ............................................................................................................................................................ 19, 35 Unger Tea ..................................................................................................................................................................................... 35 Vodeb Aljaž ................................................................................................................................................................................. 35 Welzer Tatjana ............................................................................................................................................................................. 19 Žnidar Žan .................................................................................................................................................................................... 35 63 64 Konferenca / Conference Uredil / Edited by Sodelovanje, programska oprema in storitve v informacijski družbi / Collaboration, Software and Services in Information Society Marjan Heričko Document Outline 01 - Naslovnica-sprednja-G 02 - Naslovnica - notranja - G 03 - Kolofon - G 04 - 05 - IS2018 - Skupni del 07 - Kazalo - G 08 - Naslovnica podkonference - G 09 - Predgovor podkonference - G 10 - Programski odbor podkonference - G 11 - Clanki - G 01_Soini 02_Rupnik 03_Kuhar 04_Kamisalic Introduction Methodology Experimental framework Experimental instruments Results and discussion Knowledge perception Knowledge perception and notation Conclusions Acknowledgments References 05_Rajsp 06_Simenko 07_Hericko 08_Vodeb 09_Podgorelec 10_Revesz 11_Torok 12_Fekete 13_Szalay Abstract 1 Motivation 2 C++ Special Member Functions 2.1 Constructors 2.2 Destructor 2.3 Assignment operators 2.4 Members for move semantics 3 Implementation 3.1 Syntax transliteration 3.2 Special member overview 4 Conclusion Acknowledgments References 14_Popovic 12 - Index - G 13 - Naslovnica-zadnja-G Blank Page Blank Page Blank Page Blank Page Blank Page Blank Page Blank Page Blank Page 04 - 05 - IS2018 - Predgovor in odbori.pdf 04 - IS2018 - Predgovor 05 - IS2018 - Konferencni odbori 11 - Clanki - G.pdf 01_Soini 02_Rupnik 03_Kuhar 04_Kamisalic Introduction Methodology Experimental framework Experimental instruments Results and discussion Knowledge perception Knowledge perception and notation Conclusions Acknowledgments References 05_Rajsp 06_Simenko 07_Hericko 08_Vodeb 09_Podgorelec 10_Revesz 11_Torok 12_Fekete 13_Szalay Abstract 1 Motivation 2 C++ Special Member Functions 2.1 Constructors 2.2 Destructor 2.3 Assignment operators 2.4 Members for move semantics 3 Implementation 3.1 Syntax transliteration 3.2 Special member overview 4 Conclusion Acknowledgments References 14_Popovic