URN_NBN_SI_doc-CKREDCV0
Knjižnica 46(2002)1-2, 111-136 112 Original scientific article UDC 001.4 : 02 : 004.021 Abstract The theme of the article is the preparation of a stemming algorithm for Slovenian library science texts. The procedure consisted of three phases: learning, testing and evaluation. The preparation of the optimal stemmer for Slovenian texts from the field of library science is presented, its testing and comparison with two other stemmers for the Slov- enian language: the Popovič stemmer and the Generic stemmer. A corpus of 790.000 words from the field of library science was used for learning. Lists of stems, word end- ings and stop-words were built. In the testing phase, the component parts of the algo- rithm were tested on an additional corpus of 167.000 words. In the evaluation phase, a comparison of the three stemmers processing the same word corpus was made. The results of each stemmer were compared with an intellectually prepared control result of the stemming of the corpus. It consisted of groups of semantically connected words with no errors. Understemming was especially monitored – the number of stems for semantically connected words, produced by an algorithm. The results were statistical- ly processed with the Kruskal-Wallis test. The Optimal stemmer produced the best re- sults. It matched best with the reference results and also gave the smallest number of stems for one semantic meaning. The Popovič stemmer followed closely. The Generic stemmer proved to be the least accurate. The procedures described in the thesis can represent a platform for the development of the tools for automatic indexing and re- trieval for library science texts in Slovenian language. Key words: stemming, stemming algorithms, Slovenian language, library and informa- tion science 1Uvod 1.1 Poizvedovanje in avtomatsko indeksiranje Tehnika, ki jo opisuje prispevek, sodi na področje avtomatskega indeksiranja, oz. širše gledano, na področje poizvedovanja. Oboje sta v preglednem članku opisala Vilarjeva in Dimec. Poizvedovanje definirata kot: “Prevod angleškega termina information retrieval , za katerega se v slovenskem prostoru pojavljajo različna poimenovanja, npr. iskanje informacij, iskanje in priklic informacij, itd. Gre za sistematično preiskovanje indeksiranih informacijskih virov z od- krivanjem, izbiranjem in pridobivanjem podatkov, zapisov iz njih. V širšem, v svetu najbolj uveljavljenem pomenu, information retrieval pomensko vključu- je tudi predhodne postopke gradnje zbirke dokumentov, še posebej postopke opisovanja njihove vsebine” (Vilar in Dimec, 2000, str. 7).
RkJQdWJsaXNoZXIy