APPROACHING THE LIMIT OF CLASSIFICATION ACCURACV INFORMATICA 2/88 UDK 519.226.3 Matjaž Gams, Matija Drobnič Jozef Stefan Institute ABSTRACT. One of the recently developed systems for machine learning (GINESYS) signi.f icantly outperformed all compared systems including theoretically optimal Bayesian classifier, which was the second in both tests. We tested several options in Bayesian classifier to investigate the real cause for nonoptimal results and to estimate the upper limit in classification accuracy. The conclusion is that while it is possible to achieve even highec classification accuracy with suitable parameter adjustment in Bayesian classifier, it seems that GINESYS practically achieved the optimal classification accuracy. Sistem za empirično u6enje GINESYS je v praktičnih meritvah presegel primerjane sisteme z Bayesovim klasifikatorjem vred. Podrobna analiza kaže, da je dosežene rezultate roogoče preseči, vendar so že zelo blizu optimalne me je. 1. INTRODUCTION Nachine learning is a quickly developing area of Artificial Intelligence [Hinston]. According to the major inference type used it can be divided into rote learning, learning from instruction, learning by deduction, by analogy, from examples and from observation and discovery [Carbonell et al; Hichalski). The scope of this article is learning from examples or Empirical Learning (EL). The aim of EL is to induce general descriptions of concepts from examples (instances) of these concepts. Examples are usually objects of a known class described in tecms of attributes and values. The final product of learning ace symbolic descriptions in human understandable forms. Induced descriptions of concepts, representing different classes of objects, can be used for classifying new objects. EL systems basically perfocm the same task (classification) as statistical methods and can be directly compared to thera from the point of classification accuracy. On the other hand, EL systems offer fucther advantages, namely a) explanation during classification of new examples and b) the insight into the laws of the domain by observing classification rules. Explanation during classification (a) is important since it enables the user to check the line of reasoning and veri£y the system's decision. The knowledge base (b) can be viewed as a new representation of the domain knovrledge, which can be of great value to domain experts, especially in domains that are not yet well formalized and understood. 2. A SIMPLE EXAMPLE For a simple example let us consider a case where we have a device with 8 binary switch.es representing 256 legal combinations. Device reports errors in some combinations and we want to find out what subsequence causes them. 1 2 3 4 5 1 0 1 0 1 1 SWITCHES 2 1 0 0 0 0 3 1 1 1 1 1 4 0 1 0 1 1 5 1 1 1 0 1 6 0 1 1 0 0 7 0 1 0 1 1 8 1 0 1 1 1 STATUS ERROR OK ERROR OK ERROR Table 1. Device reports error in some combinations of switches. Which subsequence of switches causes them? Probably the most common answer in EL systeras would be, that error is reported when switches 5 and 8 are on (-1). In practical tasks EL systems deal with domains with 10 to 10.000 examples (typically some hundred) with 2 to 500 attributes (typically ten or some ten) [Bceiman et al; Quinlan; Lavrac et al]. Attributes can be real, integer or categorical with many possible values. 3. EMPIRICAL LEARNING The whole process of empitical learning consists of four steps: - preprocessing of learning examples, - construction of a classification rule, - classification of new instances and - analysing the laws of the domain. Detailed description can be found elsewhere, e.g. [Kononenko] or [Gams, Lavrac] with detailed overview of some well known algorithms - C4 [Quinlan], CART [Breiman et al], ASSISTANT 86 [Cestnik et al], CN2 [Clark, Niblett], AQ15 [Carbonell et alj. We shall formally represent here only a domain area and a classification rule. A set of learning examples L - {(x,c)} consists of pairs ..'., c), where x is a vector (denoting propercies of the object) in a 13 measurement space X and c represents index of the class of example x. the Domain 2 Components of vectors x are called attributes or variables. The values of attributes can be numerical or categorical. A classification or a decision rule d(x) is a mapping which maps every x from X into some c from C or into the probability distribution (pl,p2...pj) where pi is a real number between 0 and 1. A classification rule d(x) splits the whole space X into spaces X1,X2,...XJ, such that for every Xi only a certain subset of d(x) is relevant. The syntax of a classification rule d(x) is: ::- | and classification rule ::- if rule :J- | and complex selector Atr ::- Val | val or values ::- 112 | 3 . . . | J class ::- < | - | > operators Atr corresponds to the name of the attribute and Val is a categorical or numerical value. This syntax is transformable into DNF and is similar to the syntax of most rule-based systems or expert systems [Waterman, Hayes- Roth]. Note that the actual syntax is slightly raore complicated [Gams]. 4. DOMAIN DESCRIPTION We performed practical measurements on two real-world domains. Data were obtained by I. Kononenko and represent descriptions and diagnoses of patients from the Oncological Institute Ljubljana. The only correction was replacement of missing values by the most probable ones for a given class. More detailed description is in [Gams], here we present only cumulative data about these domains: Domain 1 number of attributes 18 no. of possible values per attribute 2-8 ( average 3.3 ). number of classes 9 total number of examples 150 distribution of examples amongst classes number of exampl.es in Cl to C9 12 34 56789 2 1 12 8 69 53 1 4 0 importance of attributes - Al to A18 of them is redundant number of attributes 17 no. of possible values per attribute 2-3 ( average 2.2 ) number of classes 22 total number of examples 339 distribution of examples amongst classes number of examples in Cl to C22 1 2 3 4 5 6 7 8 9 10 11 84 20 9 14 39 114 6 0 2 28 12 13 14 15 16 17 18 19 20 21 22 16 7 24 2 1 10 29 6 2 1 24 iroportance of attributes - Al to A17 (counting how many examples overlap when omitting the i-th attribute) 123456 7 89 80 80 58 85 60 53 65 55 68 10 11 12 13 14 15 16 17 63 55 60 53 54 57 65 65 5. GINESYS 5.1. ALGORITHM DESCRIPTION The top level description of GINESVS (Generic INductive Expert SYstem Shell) is as follovs: repeat initialize Rule; generate Rule; add Rule to d(x); L - L - {examples covered by Rule) until satisfiable(d(x)) In this general view GINESVS represents a prototype of a unifying algorithm for empirical learning covering many other systems. In a slightly more specified desctiption we obtain the following algorithm: repeat generalize Rule; repeat specialize Rule until stop(Rule) ; postprocess(Rule); add Rule to d(x); L » L - {examples covered by Rule} until satisfiable(d(x)) The main difference between other EL systems and GINESYS is in "confirmation rules". Basic idea of confirmation cules is using several sources of information for classification. That seems to be common practice in every day life. For example when we try to predict the vreather, we look at the official weather report, but also look at the sky and ask our neighbour. The impleraentation of this idea in GINESYS is that instead of using only one rule for classification several rules confirm or confute the first one. In case of a confrontation between these rules the Bayesian classifier is consulted as a method of a conflict resolution [Waterman, Hayes- Roth]. One confirmation rule in our siraple example in Table 1 could be : Error is reported, when switches 3, 5 and 8 are on. This rule could be redundant or even wrong, but on the other hand it could be the only correct onel From examples in Table 1 it if none not ciear which of these possibilities is the right one, so both (and other) rules are 14 stored and consulted. In more detailed tests [Gams] it was shown that this method of conEulting several rules (- using different kinds of information) significantly improved classification accuracy. 5.2. COMPARATIVE RESULTS A detailed comparison was made with other well known EL systems in two noisy medical domains (oncology). Table 2 shows results in classification accuracy. GINESYS BAYES othec systems domain 69.9 68.4 67.3 1 domain 51.9 50.1 48.7 2 Table 2. Classification accuracy measured as the peccentage of correct guesses. While GINESYS achieved bost results and Bayesian classifier the second ones, none of the compared systems [Gams] outperformed results in the last row in Table 2. These results are actually an average ovec ten runs on randomly chosen 70% of data for learning and remaining 30% of data for testing. In further tests (t-tests, [Gams]) it was shown that the number of tests, distribution and the difference between classification accuracies was sufficient to ensure that differences are a result of some deeper cause (e.g. better algocithm) and not a chance choice. Other measurements proved superiority not only from the point of classification accuracy, but also in generality, complexity of classification rule and explanation [Gams]. GINESYS and othec algocithms discussed in this paper were implemented in Pascal on VAX 11/750. 5.3. IRREPROACHABILITY OF MEASUREMENTS We argue that our measurements are irreproachable (unbiased) since: all systems were measured on exactly the same data - no "cleaning" of data was performed - no special form of data was allowed - no unusual method of measuring classification accuracy was used - no domain dependent parameters were allowed - the number of data and tests was sufficient (t-tests) to avoid chance choice - results were strictly checked and verified by many supervisors from the program source level to the level of classification trace. accuracy as other statistical methods. In our measurements some of the EL systems, especially those vrithout special mechanisms for noisy domains, gave unexpectedly poor performance compared to the results published by the originators of algorithms. Since our iraplementations of those systems were the same as published, several possible explanations remain. It might be that actual implementations use some unpublished extra features, maybe the domains used for testing were especially suitable for specific algocithms etc. The authors of this article 'also find questionable comparing between the system and the expert, since we regard EL systems mainly as a helping tool and not as a stand-alone progcam. The other reason is that fair comparison between machine and human is extremely difficult. The correct comparison should be (system + user) : user. In most complex realistic domains mechanisms for dealing with noise are of greatest importance as independently discovered in [Breiman et al; Kononenko) and it is not realistic to achieve even tolerable results without them [Kononenko; GamsJ. 6. BAYESIAN CLASSIFIER 6.1. THEORETICAL FOUNDATIONS The concept of the Bayes rule is one of the most important concepts in the field of classification and also learning. For the data dravm from a probability distribution P(A,j), the most accucate rule can be given in the terms of P(A,J) and this tule is called the Bayes rule. It is norraally denoted by dB(x). Precisely, suppose that (x,y), x « X, y« Y is a random sample fcora the probability distribution P(A,j) on X x C, i. e., P(xe A, y=j)-P(A,j). Then we define dB(x) as the Bayes rule if for any other classifier d(x), P{dB{z)tc{x)) Lavrac N.), Sigma Press. LavraČ N., Varšek A., Gams M., Kononenko I., Bratko I. (1986): "Automatic construction of the knowledge base for a steel classification expert system", The 6th International Workshop on Expert Systems, Avignon. Michalski R.S. (1987): "Machine Learning", IJCAI 1987, Milano. R.S., Chilausky L.R. (1980): by Being Told and Learning from an Experimental Comparison for Soybean Disease Diagnosis", Policy Analysis and Information Systems, Vol. 4, No 2. Quinlan J.R. (1986): "induction of Decision Trees", AI Summer Seminar, Dubrovnik. Waterraan D.A., Hayes-Roth F. (ed.) (1977): "Pattern-Directed Inference Systems", Academic Press. Winston H.P. (1984): "Artificial Intelligence", Addison-Wesley. Tutorial 6, Hichalski "Learning Examples: