Informática 39 (2015) 71-85 71 Experimental Comparisons of Multi-class Classifiers Lin Li Institute of Intelligent Computing and Information Technology, Chengdu Normal University No.99, East Haike Road Wenjiang District, Chengdu, China E-mail: lilin200909@gmail.com Lin Li, Yue Wu and Mao Ye School. of Computer Science and Engineering, University of Electronic Science and Technology of China No.2006, Xiyuan Ave, West Hi-Tech Zone, Chengdu, China Keywords: multi-class classifier, tree classifier, boost classifier, support vector machines, combining boost classifier Received: June 18, 2014 The multi-class classification algorithms are widely used by many areas such as machine learning and computer vision domains. Nowadays, many literatures described multi-class algorithms, however there are few literature that introduced them with thorough theoretical analysis and experimental comparisons. This paper discusses the principles, important parameters, application domain, runtime performance, accuracy, and etc. of twelve multi-class algorithms: decision tree, random forests, extremely randomized trees, multi-class adaboost classifier, stochastic gradient boosting, linear and nonlinear support vector machines, K nearest neighbors, multi-class logistic classifier, multi-layer perceptron, naive Bayesian classifier and conditional random fields classifier. The experiment tested on five data sets: SPECTF heart data set, Ionosphere radar data set, spam junk mail filter data set, optdigits handwriting data set and scene 15 image classification data set. Our major contribution is that we study the relationships between each classifier and impact of each parameters to classification results. The experiment shows that gradient boosted trees, nonlinear support vector machine, K nearest neighbor reach high performance under the circumstance of binary classification and minor data capacity; Under the condition of high dimension, multi-class and big data, however, gradient boosted trees, linear support vector machine, multi-class logistic classifier get good results. At last, the paper addresses the development and future of multi-class classifier algorithms. Povzetek: V prispevkuje podan pregled klasifikatorjev z vec razredi. 1 Introduction Multi-classification problem [1] is that of supposing a set of training data(x;,q),...,(x,,cn), where xeRp are finite set of input features, cte{i,2,...,K} are class numbers of output variables. The purpose of multi-classification task is to find a classifying rules based on the training sample, then given a new features, outputting a classifying category. Today multi-classifier algorithms are applied to a variety of application areas such as: radar signal classification, character recognition, remote sensing, medical diagnostics, expert systems, voice recognition domains and etc. Multiple classifiers has a long history. Selfridge et al. [2] first propose a multi-classification system based on 'winners get all' solution which chooses the optimal solution as a multi-classifier output. Kanal and Minsky [3, 4] play an important role in multi-classifiers development. They claim that any classifying algorithm does not solve all problems. We need to design specific classifying algorithm for different problems. Multilayer perceptron [5] is an artificial neural network model that can resolve this kind of nonlinear data. Decision tree is an ancient non-parametric classification algorithm that classifies the samples according to the classifying rules. Leo Breiman[6] proposes random forest as a good solution to the scalability issues of single decision tree. Adboost algorithm is proposed by Yoav Freund and Robert Schapire[34] is a meta-classification algorithm which can be combined with other classification algorithms to enhance its performance. Multi-class logistic classifier proposed by Jerome Friedman et al. [7] is another important improvement of enhancing the basic boosting algorithm. K nearest neighbor [5] classifies samples based on the adjacent spatial relationships of features. In 1980s with the rise of data fusion and learning model in statistics and management science, Bayesian expert [8-10] system is widely used. Since the 1990s, Vapnik proposes support vector machines, transforming the feature from low dimensional space into high dimensional space, which is a better ways to classify the features. Nonlinear support vector machine gets a great success, however it is not ideal in some cases i.e. the original features are already high dimensional space, so people propose a linear support vector machine[11] for these cases. Because of the complexity of the data, a single classifier is often difficult to obtain good performance for specific applications, it is a growing tendency to improve classifying performance by a combination of classification methods [12]. 64 Informatica 39 (2015) 63-69 B. Liu et al. How to solve the multi-classification problems is challenging. There are two ways to deal with it [13]. Nilsson et al. [14] first use combination of binary classifiers to solve the multi-classification problems. The other way is that directly extends binary classifier into multiple classifier. The main contributions of this paper are as following: 1. We compare twelve most commonly used algorithms for multi-classification in several aspects such as principle, important parameters, running time performance, and etc. 2. This paper considers the impact of differences of type and size in data sets. We chose binary classification and multi-classification, a small amount of data and a large size set as the evaluation data sets. 3. This paper gives in-depth analysis of multi-category classification for each class. 4. In this paper, we investigate the relationship between the single classifier and a combination of single classifier. This paper discusses the 12 multiple classifiers: decision trees, random forests, extremely randomized trees, multi-class adaboost, stochastic gradient boosting, support vector machines (including linear and nonlinear), K nearest neighbors, multi-class logistic classifier, multilayer perceptron, naive Bayesian classifier and conditional random fields classifier. The paper is organized as follows: section 1 is an introduction of studying content. In section 2 we discuss the related works. Overview of algorithms are discussed in section 3. Section 4 presents an experimental setup and parameters settings. In Section 5, we explain experimental results. Section 6 concludes the paper. 2 Related works There are few comprehensive comparisons of multi-classification algorithms. King et al. [15] is the most comprehensive and earliest study of multi-classification algorithms including CART, the traditional algorithm C4.5, Naïve Bayes, K nearest neighbor, neural networks, and etc. However after that a few emerging algorithms such as support vector machines, random forests have been widely used. Further, data sets they used are too small while comparing to current big data. Then again their evaluation criteria is simple. Bauer et al. [16] thoroughly compare voting classification algorithms including bagging, boosting and its improved versions, but it is only comparing these two types of voting algorithm. LeCun et al. [17] use accuracy, rejecting rates and running time as the evaluation criteria. They compare algorithms: K-nearest neighbor, linear classifier, the main ingredient and polynomial classifiers on handwritten recognition data set. But only one data set used, it is not sufficient to evaluate different application scenarios. Ding et al. [18] use neural networks and support vector machines for protein test data set. They do detailed comparison of the accuracy of different parameters, but this comparison is relatively simple and data set is small. Tao et al. [19] study the decision trees, support vector machines, K nearest neighbor classification of gene sequences of organization application. However, this comparison only discusses single dataset. Foody et al [20] study multi-class image classification by support vector machines. But they only compare support vector machines, even linear support vector machine is not involved. Chih-Wei Hsu et al. [21] also study multi-class support vector machines for a more in-depth theoretical analysis and comparison, but is limited to multi-class support vector machine. Caruana et al. [22] study supervised learning algorithm (support vector machines, neural networks, decision trees, and etc.) in 9 different criteria such as ROC area, F evaluation and etc., but the literature is only discussed two classification data sets. Krusienski et al. [23] compare the Pearson correlation method, Fisher linear discriminant, stepwise linear discriminant, linear support vector machines, and Gaussian kernel support vector machines on P300 Speller data set. However, the data set is relatively simple, and small size of data is often difficult to compare running performance of support vector machine between linear and non-linear scenarios. Table 1 lists the current situation of classifier comparisons. Reference Comparison of algorithms Data sets Evaluation criteria Research domains King et al.[15] Symbolic learning(CART, C4.5, New ID, AC2, Cal5, CN2) , statistic learning( Bayesian network, K-nearest, kernel density, linear discrimination, quadratic discrimination , logistic regression) , neural network Satellite image, hand written digits, vehicle, segment, Credit risk, Belgian data, Shuttle control data, Diabetes data, Heart disease and head injury, German credit data Accuracy General purpose Bauer et al[16 Bagging, boosting and its variants Segment, DNA, chess, waveform, sat-image, mushroom, nursery, letter, shuttle Average error rate, variance, bias General purpose Ding et al.[18 Support decision vector, neural network Protein test dataset Accuracy, Q-percentage, Accuracy Bioinformatics Tao et al.[19] Support decision vector, Bayesian network, K-nearest, decision tree ALL, GCM, SRBCT, MLL-leukemia, lymphoma, NCI60, HBC Accuracy Bioinformatics Foody et al. [20] Support decision vector, decision tree, discriminating analysis, neural network Airborne sensor data Accuracy, Remote imaging Chih-Wei Hsu et al[21] Support decision vector Iris, wine, glass, vehicle, segment, DNA, sat-image, letter, shuttle Accuracy, running-time General purpose Experimental Comparisons of Multi-class Classifiers Informática 39 (2015) 71-85 73 Caruana et al. [22] Support decision vector, neural network, decision tree, memory based learning, bagged tree, boosted tree, boosted stumps RMS, MXE, CAL, ADULT. Accuracy, square error, inter-class cross entropy, ROC regions, F evaluation, recall and precision, average precision and recall, lift, probability calibration General purpose Krusienski et al. [23] Pearson related methods, Fisher linear discrimination, stepwise linear discrimination, linear support decision vector, Gaussian kernel support decision vector P300 Speller ROC curve Medicine domain Our method Decision tree, random forests, extremely randomized trees, multi-class adaboost classifier, multi-class logistic classifier, stochastic gradient boosting, multilayer perceptron, K nearest neighbors, naive Bayesian classifier and support vector machines (including linear and nonlinear) SPECTF, Ionosphere, spam, optdigits and Scene 15 Overall accuracy, average precision and recall, average Jaccard, inter-class F and Jaccard evaluation, running performance. General purpose Table 1: Comparisons of multi-class classification algorithms. Table 1 shows that the majority of current surveys focus on multi-class classifiers in a particular field, such as medicine, biology, remote sensing images, and etc. The data sets and methods used for evaluation is relatively simple. King et al's evaluation [15] is more comprehensive, however the comprising algorithms are classical. After that new algorithms emerge. Our comparing algorithms are the newest and most representative of the current tendency in a variety of application domains. 3 Overview of Algorithms 3.1 Brief introduction of algorithms In order to better understand various classifiers to compare, we briefly introduce multiple classifier algorithms. 3.1.1 Decision tree In machine learning, decision trees [24] is a predictive model. A decision tree is a flowchart-like structure in which internal node represents test on an attribute, each branch represents outcome of test and each leaf node represents class label, and then decision is taken after computing all attributes. A path from root to leaf represents classification rules. Decision tree is actually an adaptive basis function model [25], and can be expressed as follows M M f(x) = E[y\x} = £ wJ(xe RJ =£ wj(x, yj (1) Where Rm is the m' th region, wm is the mean response in this region, and t< encodes the choice of variable to split on, and the threshold value, on the path from the root to the nin, leaf. Classification tree is an ancient method, it has various variants, typically such as[26] ID3 (Iterative Dichotomiser 3) proposed by Ross Quinlan, is a greedy approach that in each iteration choose the best attribute value to split the data, but this method has the problem of local optimum. C4.5 also proposed by Ross Quin lan, is an improved ID3, can be used for classification. CART(classification and regression Trees) [27] proposed by Breiman has same process with C4.5 algorithm, except that the C4.5 uses an information entropy rather than Geni coefficient used by CART. 3.1.2 Random forests Random forests are proposed by Leo Breiman and Adele Cutler[28], It is an ensemble learning method for classification (and regression) that is built by constructing a multitude of decision trees at training time and outputting the class by voting of individual trees. Random forests are a method of building a forest of uncorrelated trees using a CART like procedure, combined with randomized node optimization and bagging [29], Random forests have the advantages of computing efficiency, improving the prediction accuracy without significantly increase of computational cost. Random forest can be well predicted up to thousands of explanatory variables [30], known as one of the best current algorithms [31], 3.1.3 Extremely randomized trees Extremely randomized trees have been introduced by Pierre Geurts, Damien Ernst and Louis Wehenkel[32], The algorithm of growing extremely randomized trees is similar to random forest, but there are two differences: 1. Extremely randomized trees don't apply the bagging procedure to construct a set of the training samples for each tree. The same input training set is used to train all trees. 2. Extremely randomized trees pick a node split very extremely (both a variable index and variable splitting value are chosen randomly), whereas random forests find the best split (optimal one by variable index and variable splitting value) among random subset of variables. 3.1.4 Multi-class adaboost classifier Boosting has been a very successful technique for solving the two-class classification problem [33], In going from two class to multi-class classification, most boosting algorithms have been restricted to reducing the multi-class classification problem to multiple two-class problems [34], 64 Informatica 39 (2015) 63-69 B. Liu et al. 3.1.5 Stochastic gradient boosting Gradient boosting proposed by Friedman [35] is a method to improve basic boosting algorithm. The traditional boosting method is adjusted weights to correct classification samples and error samples based on gradient descend at each iteration. The major difference of gradient boosting from the traditional boosting method is that purpose at each iteration is not to reduce the losses, but in order to eliminate the loss. The new model at each iteration is based on the residuals of former process. Inspired by Breiman[6]'s randomized bagging idea, Friedman introduces stochastic gradient boosting by randomized down-sampling to train basic classifier. 3.1.6 Support vector machines Support vector machines [36, 37] is the method that mapped feature vector into a higher dimensional vector space, where a maximum margin hyper-plane is established in this space. So we choose the hyper-plane so that the distance from it to the nearest data point on each side is maximized. The greater the distance between the nearest data of different classes is, the smaller the total error classification is. The multi-class support vector machines [21] can be defined as follows Supposing i groups sample: i>„q),...,i>„c,) where x,eR°,i=i...,i , < ¡i...../<■; is the type of v . The i- th support vector machine solve this problem. -{W)T W + C^iiWf (2) (Wf^ + ti^i-^jf c=i (3) £>o, j=\,...j (4) Where training sample v is mapping into high dimensional space by function o and regularization parameters c. Minimizing (inxwfw means that we have to minimize 2/1| n- , the margins between two group data. The penalty c^Te nv i is to reduce the number of training error. The core concept of support vector machine is to seek a balance between the regularization tcrimi 2,nr ,ir and the training errors. We get k decision functions after solving formula (4). {n>f(t>{x)+bi ■■■ (5) (v^f tfVa,i, where, c>o is the regularization parameter, w^ is the weight vector associated with class m, e® = l - ac m , and acm = \ if c, = m , ac m = o if c, * m . Note that in (7), the constant m = c? corresponds to the non-negativity constraint: £> o . The decision function is arg max I¡: v (7) 3.1.8 K nearest neighbors K nearest neighbours algorithm (or k-NN for short) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbours, with the object being assigned to the class most common among its k nearest neighbours (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbour. 3.1.9 Multi-class logistic classifier (maximum entropy classifier) In some cases, multi-class logistic regression well fits features. It has the formula [25] xpi' II Y i piy=c\x,W)-- (8) Where, //v < v.w. is the predictive probability, /is the class type of totally C . wc is the weight of class c, and approximated by maximum posterior probability. With this, the log-likelihood has the form (9) = Z (z^-*.) - log^exp(ni^)j This model can be Where, fiIC = x j, = c | xpW) optimized by L-BFGS [41], 3.1.10 Multilayer perceptron A multilayer perceptron (MLP) is a feed forward artificial neural network model that maps sets of input data onto a set of appropriate outputs. A MLP consists of multiple layers of nodes in a directed graph, with each layer fully connected to the next one. Except for the input nodes, each node is a neuron (or processing element) with a nonlinear activation function. MLP[57] has evolved over the years as a very powerful technique for solving a wide variety of problems. Much progress has been made in improving Experimental Comparisons of Multi-class Classifiers Informática 39 (2015) 71-85 75 performance and in understanding how these neural networks operate. However, the need for additional improvements in training these networks still exists since the training process is very chaotic in nature. 3.1.11 Naive Bayesian classifier A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model" Assuming sample x , belongs to type c, . The class-conditional density is f(X\CI)P(CI) _ f(X\CI)P(CI) P(CI\X) = - p(X) £ f(X I C,)P(C,) (10) Where, /(.| C) is the maximum likelihood. Input features is v. c is class type. A simpler alternative to generative and discriminative learning is to dispense with probabilities altogether, and to learn a function, called a discriminant function, that directly maps inputs to outputs. The decision function of naive Bayesian classifier is = argmaxP(C=ci)nP(YI=y1 \C=ck) (11) 3.1.12 Conditional random fields classifier CRFs (Conditional random fields) are a class of statistical modelling method often applied in pattern recognition and machine learning, where they are used for structured prediction. Whereas an ordinary classifier predicts a label for a single sample without regard to "neighboring" samples, a CRF can take context into account; e.g., the linear chain CRF popular in natural language processing predicts sequences of labels for sequences of input samples. We define a CRF on observations A' and random variables y as follows: Let g = (v, e) be a graph such that v n' ,: l , so that y is indexed by the vertices of g . Then (x,Y)is a conditional random field when the random variables Yu, conditioned on X, obey the Markov property with respect to the graph: p(Y, |X,Yw,w#v) = p(YI |X,Yw,w-v) (12) Where „ ,, means that w and v are neighbors in G. What this means is that a CRF is an undirected graphical model whose nodes can be divided into exactly two disjoint sets a' and y , the observed and output variables, respectively; the conditional distribution p(y | x) is then modeled. For classification problem, we compute the maximum conditional probabilistic distribution. 3.2 Comparison of algorithms With reference to [45] for the multi-classification, we made a comparison of 12 algorithms as shown in table 2. We analyze and summarize in Table 2 as follow: 1. Algorithms type Aside from Naive Bayes, others are discriminant model. Bayesian algorithm by learning the joint distribution P(c,x) , then obtains the conditional probability P(x\ c) . The classifying prediction is achieved by maximizing likelihood approximation. However the discriminant method directly makes prediction by the discriminant function or conditional probability. 2. Algorithms trait Decision trees, random forest, extreme random tree, multi-class adaboost upgrade and stochastic gradient boosting all belong to model with adaptive basis functions that can be grouped into common additive model[25], as shown in Equation 13. f(x) = a + + fD(xD) (13) Where /: is the sub-model obtained through training sample. /(»is the superposition of sub-models. Decision tree is the basic sub-model for tree-like algorithms. Upon whether the use of all samples for training, these kinds of algorithms can be divided into random forests and extremely random tree. Random forests is to build submodel through random bagging sampling. However, extremely random tree obtains sub-model using all training samples, but randomly selecting the splitting threshold. Multi-class adaboost and stochastic gradient boosting is a linear combination of sub-models (weak classifiers). The difference lies in their learning algorithms. Multilayer Perceptron, linear and non-linear support vector machines can be classified as kernel methods [46], The unified formula has the form f(x)=w$>(x)+b (14) where wis real weight , !>■ r is bias, oi» function is the type of the classifiers, for MPL ®(.)=(®,o,$,(■),-.ivO) , the i-th hidden is defined as '[' < \) itvx ii , . h is the mapping function , generally a hyperbolic function or shape function B is chosen by MPL. For linear support vector machine oi» is linear function, rather than polynomial function, Gaussian kernel function, and etc. for non-linear support vector machine. K nearest neighbor model is constructed according to the division of distance relationship of the training feature space, being classified by a majority voting. Multi-class logistic regression (maximum entropy) is the probabilistic choice model with constraints that uncertain contents are treated with equal probability of using entropy maximization to represent. Naive Bayes classifier is based on the conditional independence assumption of training samples, learning parameters with the joint probability distribution though Bayes' theorem, then classifying a given sample, by maximizing the posterior probability to get corresponding class. 3. Learning policy, loss and algorithms Decision trees, random forests, extremely randomized trees belong to the maximum likelihood approximation of learning strategies, with the loss of log-likelihood function. 64 Informatica 39 (2015) 63-69 B. Liu et al. Algorithms Algorithms type Algorithms characteristic Learning policy Loss of learning Learning algorithms Classification strategy Decision tree Discriminant Classification tree Regularized maximum likelihood estimation Logarithmic likelihood loss Feature selection, generation, prune IF-THEN policy based on tree spitting Random tree Discriminant Classification tree(based on bagging) Regularized maximum likelihood estimation Logarithmic likelihood loss Building multi-decision tree based on bagging of subsampling Sub-tree voting Extremely random tree Discriminant Classification tree Regularized maximum likelihood estimation Logarithmic likelihood loss Building multi-decision tree Sub-tree voting Multi-class adaboost Discriminant Linear combination of weak classifier(based on decision tree) Addition minimization loss Exponent loss Forward additive algorithm Linear combination of weighted maximum weak classifiers Stochastic gradient boosting Discriminant Linear combination of weak classifier(based on decision tree) Addition minimization loss Exponent loss Stochastic gradient descent algorithm Linear combination of weighted maximum weak classifiers Non-linear Support vector machine (based on libsvm) Discriminant Super-plane separation, kernel trick Minimizing the loss of regular hinge, soft margin maximization Hinge loss Sequential minimal optimization algorithm (SMO) Maximum class of test samples Linear SVM (based on liblinear) Discriminant Super-plane separation Minimizing the loss of regular hinge, soft margin maximization Hinge loss Sequential dual method Maximum weighted test sample K-nearest Discriminant Distance of feature space ...................................................................................................... Multiple voting, empirical loss minimization Multi-logistic (Maximum entropy) Discriminant Conditional probabilistic distribution, Log-linear model Regularized maximum likelihood estimation Logistic loss L-BFGS Maximum likelihood estimation Multilayer perceptron Discriminant Super-plane separation Minimization of error separation distance point to the hyperplane Error separation distance point to the hyper-plane Random gradient decrease Maximum weighted test sample Naive Bayesian classifier Generative Joint distribution of feature and class, conditional independent assumption Maximum likelihood estimation, Maximum posterior probability Logarithmic likelihood loss Probabilistic computation Maximum posterior probability Conditional Random Fields Discriminant Conditional probabilistic distribution under observing sequence, Log-linear model Maximum likelihood estimation, Regularized maximum likelihood estimation Logarithmic likelihood loss Random gradient decrease, quasinewton methods Maximum likelihood estimation Table 2: Summary of tv The decision tree's optimal strategy is to learn some features through a recursive process and split the tree in accordance with the feature of the training samples. In order to have a better generalization ability, the decision tree model has to be pruning for removal of over-fitting. Random forests is based on random sampling based on the approach (bagging) to form more stars forest trees. Extremely randomized trees is randomly selecting the splitting value to build decision trees forests. Multilayer Perception, linear and non-linear support vector machines are to separate hyper-plane. The difference is that multilayer perceptron is to minimize the error hyper-plane, however linear and non-linear support vector machine is a minimal loss of hinge page. Perceptron learning algorithm is stochastic gradient descent, linear support vector machine is a sequential dual method, and non-linear support vector machine is a sequential minimal optimization method. K-nearest neighbor is based on distance of feature space. Multi-class logistic classifier (maximum entropy) learning strategies can be seen as either maximum likelihood estimation, or a logical loss minimization. ve multi-class methods. Loss of function is a linear logarithmic function. The model learning is the maximum likelihood estimation or regularized maximum likelihood estimation under certain training conditions. Optimization algorithm is L-BGFS. Naive Bayesian probability is a typical generative model. The maximum likelihood estimation and Bayesian posterior probabilities is calculated based on the joint probability and conditional independence assumption. 4. Classification strategy Decision tree uses the IF-THEN rules for classification decisions based on the value of the model learned. Random forests and extremely random tree is based on the voting of every single decision tree classification, then taking the highest vote as the final results. Multilayer Perceptron, multi-class logistic regression (maximum entropy), linear and non-linear support vector machine have the same form of classification decisions class of x= aTgmax((Wy(/>(x) + b!). (15) The difference lies in the choice of m x>. Multilayer perceptron machine chooses B-shaped function, hyperbolic tangent function, and etc.; Log-linear Experimental Comparisons of Multi-class Classifiers Informática 39 (2015) 71-85 77 functions is chosen for multi-class logistic regression; linear support vector machine choses a linear function; nonlinear support vector machine's choice is non-linear kernel function. K nearest neighbor is a majority voting that output classification is determined by choosing K nearest voting in light of test sample' s distance to the learned model. Naive Bayesian decision strategy is the rule of maximizing the posterior probability. 4 Experimental Setup In order to evaluate the performance of various types of classifiers, we implemented our comparisons based on Darwin [47], Opencv[56], Libsvm [48] and liblinear [11] in C++. This paper compared 12 kinds of algorithms. 4.1 Performance comparisons Confusion matrix (Confusion Matrices) [49, 50] is a common performance evaluation method in pattern recognition classification, which characterizes the relationship between the type of real classes and the recognition classes. For multi-classification problem (For simplifying the representation, we take three categories as example) is illustrated in Table 2. Prediction True class A B C A AA AB AC B BA BB BC C CA CB CC Table 3: Statistics of confusion matrices for samples classification. Where A, B and C are three classes, AA, BB and CC represent the correct prediction number of samples, the remaining number of samples is representative of the error prediction. AA represents the number of samples correctly identified as samples A. AB is predictive number that original Sample A which is incorrectly predicted as Sample B. The remaining items have the same meaning. Total accuracy rate can be calculated based on confusion matrix as follows. TA _ (AA + BB + CC)/( AA + AB +... + CC) (16) Where aa + ab+... + cc is the total number of sample. TA is the total accuracy. Precision and recall are quantitative evaluation method. They are not only used to evaluate the accuracy of each class, but also the important standard to measure the performance of the classification system. Precision is the fraction of retrieved instances that are relevant. Precision reflects the classification accuracy. In practical applications, the average precision are often used to evaluate multi-classification (taking categories as example), which is calculated as follows avipmsian= (AA/(AA+ B4+CA)+BB/(B4+ BB+ BC) +CC/(AC+ BC+CCj) (17) Recall is the fraction of relevant instances that are retrieved. Recall reflects the classification comprehensiveness. In practical applications, the average recall are often used to evaluate multi-classification (taking categories as example), which is calculated as follows avgrecall_(AA/(AA+ AB+ AC)+BB/(B4+ BB+ BC) + CC/ (CA+ CB+ CC)) (18) F Measure is an integrated measurement method of the recall and precision. Higher values reflect the recall and precision better integrated. f is defined as F _ 2* avgprecisian * avgrecall (19) avgprecisian + avgrecall Jaccard Coefficient [51] is used for comparing the similarity and diversity of sample sets. Taking class A as example, the formula is jca AA (20) AA + AB + AC + BA + CA Average Jaccard coefficient reflects the average of the various categories Jaccard coefficient, which is calculated as ^ , AA BB CC avgJC4 _(-+-+-—)/3(21) A AA+AB+AC+BA+CA AB+BB+CB+BA+BC C4+CB+CC+ AC+BC v 7 Jaccard coefficient predicts more accurately reflects higher. Jaccard coefficient can evaluate multi-class classification in each class. 4.2 Data sets and data transformation We evaluate the performance of these algorithms on five data sets that consists of three binary classification data sets and two multi-classification data sets. 4.2.1 SPECTE Heart data set SPECTF[52] means single-photon emission computed tomography cardiac data sets. SPECTF is a new data set[53]. In cleaning process, the records with missing information, incomplete picture records are filtered, since the original image scale is not uniform, so the original image is transformed into gray image (0 to 255). After cleaning, the data set contains 44 features that record 22 region of interest for the cardiac systolic and expanded state. The data type is an integer type (rang from 0 to 100). The data set has total 267 instances of which containing 80 training samples and 187 test samples. Each instance has two states: normal and abnormal. In experiment, we converted class label into class 0 (normal) and class 1 (abnormal). Histogram distribution of training samples showed normal shape is shown in Figure 1 (a). 4.2.2 Ionosphere data set Ionosphere data set[52] includes is a set of radar data sets created by a military system acquisition in Labrador NATO airbase, Goose Bay, Canada. The data set has total of 34 feature (integer, decimal type) that record 17 sub-pulses labels and values. The data set consists of 351 instances of which 100 instances are training samples, 251 instances are test samples. Each instance has two states: good and bad. In the experiment we converted class label into class 0 that represents the existence of the facts) and class 1 the non-existent facts. Feature type of the dataset is real type, range from -1 to 64 Informatica 39 (2015) 63-69 B. Liu et al. a. Histogram of the distribution of the training samples is shown in Figure 1 (b). 4.2.3 Spam data set Spam[52] is a spam filtering data set that has total of 57 data features. 48 consecutive real number is used to represent the percentage of messages of 48 words (i.e. make, address, all, conference) and 6 word occurrences (!, (, [,, $, #) in email. The data set has a total of 4601 instances, where the training instances are 3065, testing instances are 1,536. Each instance has two states: 0 for not spam and 1 for junk mail. The data set is real type feature information, the range is 0 to 9090. The training sample distribution histogram is scattered, as shown in Figure 1 (c). 4.2.4 Optidigits data set Optdigits[52] is the information collection extracted from 43 individuals (30 of them for training, 13 as a testing) by the handwritten Optical Character Recognition bit image Standardization American National Standards Institute of Technology (NIST). The data set has a total of 5620 instances, where the training instances are 3823, test instances are 1797. The instances have ten classes, from 0 to 9 digits. Feature data type is a positive integer, ranging from 0 to 16. Histogram distribution of training samples is shown in Figure 1 (d). 4.2.5 Scene 15 data set Scene 15[54] data set is a post-processing data set. The original data set has total of 15 classes, 4485 images. We randomly selected sample of 2250 as training samples, 2235 samples for testing. We used the method of PHOW (Pyramid Histogram Of visual Word) [54] for feature extraction. 12000 descriptors were obtained for each image. In order to obtain a better classification within the features, we used method proposed by vedaldi[55] for features kernel transformation, finally we got 36,000 dimensional feature descriptors to characterize a single image. In the experiment we converted 15 classes into class 0 to 14, representing the bedroom, CALsuburb, industrial, kitchen, livingroom, MITcoast, MITforest , MIThighway, MITinsidecity, MITmountain, MITopencountry, MITstreet, MITtallbuilding, PARoffice and store respectively. Feature data type is a positive real number, the range is -0.3485 to 0.4851, further feature data is relatively concentrated, similar to normal training. Histogram distribution of training samples is shown in Figure 1 (e). This data set has large dimensions of features and a large size of data. Selection of this data set is mainly to test algorithms adaptability for real application scenarios. In Figure 1, we demonstrated our diversity of data sets. SPECIF, ionoispher and spam are small scale data sets. They are for binary classification. Optigits and Scene 15 are large scale data sets. They are for multiclass classification. At the same time, the range of our feature vector value are large. The range of Figure 1 feature values is also huge. The range of SPECIF is from 0 to 100, ionoispher from -1 to 1, and spam from 0 to 1000, Optigits from 0 to 20 and Scene 15 from -0.5 to 0.5. The diversity for our comparing experiments are necessary. This shows that our experiments are rich and valuable. , 10' span , 10' »("«««s Figure 1: The histogram characteristics of training data sets. 4.2.6 Dataset selection We select the data set is based on the following considerations. 1. Both binary classification and multi-classification data sets were taken into account for the purpose that some algorithm are superior to binary classification, while others are more suitable for multi-classification problems. 2. For better evaluation of the adaptability of the algorithms, we chosen both low dimensional data set with small size and high dimensional data set with large size. 3. We also paid attention to the data type differences, both pure integer and double data type were included. There were also real data type features, which had either wholly positive or negative features. 4. It can be seen in Figure 1 that the histogram of the distribution of the data we have chosen the difference is quite different. Some are concentrated, just like normal distribution, while others are relatively sparse. 4.3 Parameter setting and selection 4.3.1 Decision tree 1. The maximum depth of tree: The default value was set to 1. With the value increase, classification accuracy and running time will increase. We set maximum depth to 2, 4, 6, 8, 10, 12, 14, 16, 18 and 20 respectively. We found the highest accuracy rate when the maximum depth is 10. 2. Splitting Criteria Consideration: We tested the wrong classification rule, entropy rule, and Geni coefficient rule respectively. We found that Geni coefficient got the best of all, and mistake classification rule got the worst of all. 3. The first split minimum number of samples were tested with value of 0, 2, 4, 6, 8 and 10 respectively. We found the minimum number of samples was increased, but performance degraded. 4. The maximum number of features of the training sample, the default was set to 1000. 0 50 100 -10 10 5000 10000 0 10 Experimental Comparisons of Multi-class Classifiers Informática 39 (2015) 71-85 79 4.3.2 Random forests 1. A maximum depth of random forests, as the same of decision tree, the optimal value was set to 10. 2. The maximum number of training samples feature, the default was set to 1000, the value should change with the number of sample features change, as long as the sample itself to adapt to the largest number of features. 3. The number of decision trees: we tested the value of 20, 40, 60, 80, 100, 120, 140, 160, 180 and 200 respectively. We found that the number increases, the execution time also increases, and after over the value of 100, the improvement of the accuracy rate is not obvious. So we set it to 100. 4. The accuracy of the random forest was used to control the iteration. We tested the value of 0.0001, 0.001, 0.01 and 0.1 respectively. We found that the smaller the value was, the longer the execution time was, and after over the value of 0.001, the accuracy had no substantially change. We set it to 0.001. 4.3.3 Extremely randomized trees Parameters were consistent with the random forest. 4.3.4 Multi-class adaboost classifier 1. For boosting methods, we tested discrete boosting, gentle boosting and real boosting respectively, found that gentle boosting is the best of all. 2. Tree depth was initialized with value of 10. 3. Shrinkage factor: we set it with value of 0.1, 0.2, 0.1 and 0.4 respectively, and found that 0.1 was the best of all. 4. Maximum boosting number: we set it to the value of 10, 50, 100, 200 and 400 respectively. We found that after over the value of 100, the precision had no significant increase. So we set it to value of 100. 4.3.5 Stochastic gradient boosting 1. Maximum depth of the tree, same as in decision tree, was set it to 10. 2. Loss function type for classification problems was generally selected for cross-entropy loss. 3. Shrinkage factor was set to 0.1. 4. Subsampling percentage was used to control the sampling percentage for every single tree in the training process, we set it to 0.8. 5. Maximum lift was set to100. 4.3.6 Support vector machines 1. Type: we set it to c - Support Vector Classification and v - support vector classification, it was found that v -support vector classification is better than c - support. Penalty factor for c - support vector classification was set to 1. Penalty factor for v -support vector classification, value range was (0,1], we set it to 0.5, p was set to 0.1. 2, Kernel types are linear, radial basis, sigmoid-type function, POLY. Method" —--— SPECTF Ionosphere Spam optdigits C - SVM POLY 0.770 0.565 0.618 0.978 RFB 0.898 0.749 0.772 0.562 Sigmoid function 0.080 0.693 0.374 0.101 v - SVM POLY 0.748 0.733 0.629 0.934 RFB 0.903 0.796 0.779 0.586 Sigmoid function 0.080 0.840 0.648 0.101 Table 4: Overall accuracy of libSVM on four data sets. From table 4, we tested non-linear support vector machines total accuracy on four data sets. We selected v -Support Vector POLY for classification as a non-linear support vector machines kernel type. 4.3.7 Linear support vector machines From table 5, we can see that L2R_L2LOSS_SVC for loss function got good accuracy, so we chose L2R_L2LOSS_SVC loss function as loss type of linear support vector machines. Total accuracR?^^ SPECTF Ionosphere Spam Optdigits Cost type —-— L2R L2LOSS SVC 0.620 0.856 0.898 0.947 L2R L2LOSSSVC DUAL 0.577 0.856 0.617 0.939 L2R L1LOSS SVC DUAL 0.577 0.848 0.865 0.935 MCSVM CS 0.805 0.860 0.631 0.933 Table 5: Overall accuracy of liblinear on four data sets. 4.3.8 K nearest neighbors Nearest neighbor K was set to the value of 1, 3, 5, 7, 9, 11, 13 and 15 respectively. The accuracy was found to reduce with increasing K value, so we set it to 1. 4.3.9 Multi-class logistic classifier According to our single test with variable regularization factor values of 1.0e-12, 1.0e-11, 1.0e-10, 1.0e-9, 1.0e-8, 1.0e-7 and 1.0, we found that the best results is the regularization factor with value of 1.0e-9. So in our comparing experiments, the regularization factor (REG_STRENGTH) was set to 1.0e-9. Similarly, we did same experiment, most optimum results were found with the iteration range from 800 to 980. So in our comparing experiments, maximum number of iterations was set to 1000 for avoidance of losing optimum values. 4.3.10 Multilayer perceptron 1. The propagation algorithm was backward and forward propagation respectively. Apparently backward propagation algorithm achieved significantly higher accuracy. 2. Gradient weight was typically set to 0.1. 3. The momentum, front weights reflecting differences in two iterations, was typically set to 0.1. 64 Informatica 39 (2015) 63-69 B. Liu et al. 4.3.11 Naive Bayesian Classifier No parameters needed to be set. 4.3.12 Conditional Random Field Classifier We only need set the classification type in this experiments. 5 Experimental results and analysis Firstly, we need define descriptions for the symbol in tables: DecisionTree represents decision tree classifier. RandomForest represents the random forests. ExtraTrees is extremely random tree. BoostedClassifier represents the multi-class adaboost classifier. GradientBoostTree represents stochastic gradient boosting, libSVM is support vector machines. libLinear represents linear support vector machine. Knearest represents K nearest neighbor classifier, MultiClassLogistic is multi-class logistic classifier. MultiLayerPerceptron represents multilayer perceptron.NormalBayesianNet represents the naive Bayesian classifier. CRF represents Conditional Random Fields classifier 5.1 Overall accuracy on five data sets From table 6, on SPECTF data set, stochastic gradient boosting achieved the highest overall accuracy, average precision, recall and Jaccard value, following by nonlinear support vector machine, random forest, adaboost classifiers and CRF have achieved relatively good performance. Table 6: Overall accuracy on SPECTF data set. From table 7, on Ionoshere datasets, stochastic gradient boosting achieved the highest overall accuracy, average precision, recall and Jaccard value, following by adaboost classifier, extremely randomized trees, CRF and linear support vector machines. Algorithm Overall accuracy Average precision Average recall Average jaccard coefficient DecisionTree 0.905 0.905 0.896 0.819 RandomForest 0.930 0.936 0.918 0.862 ExtraTrees 0.863 0.897 0.832 0.738 BoostedClassifier 0.940 0.942 0.933 0.882 GradientBoostTree 0.981 0.981 0.980 0.962 libSVM 0.618 0.687 0.522 0.332 libLinear 0.898 0.892 0.901 0.811 Knearest 0.775 0.765 0.764 0.622 Multi-classLogistic 0.923 0.921 0.918 0.852 MultiLayerPerceptron 0.908 0.904 0.905 0.827 NaiveBayesianNet 0.602 0.801 0.500 0.301 CRF 0.784 0.778 0.790 0.653 Table 8: Overall accuracy on spam data set. From table 8, on the spam dataset stochastic gradient boosting accuracy achieved the highest overall average precision, recall and Jaccard value, following by multi-lass adaboost classifier, random forests. For binary classification problem but, under the low-dimensional data set, the stochastic gradient boosting and ensemble classifiers such as multi-class adaboost showed a high performance. From table 9, on optdigits data set, for the multi-classification problem, stochastic gradient boosting achieved the highest overall accuracy, average precision, recall and Jaccard value. K-nearest neighbor, non-linear support vector machine, random forests, extremely randomized trees and CRF also achieved good overall performance. Linear support vector machines, many types of logic, multi-layer perceptron is manifested in ordinal performance, and decision trees and Naive Bayes classifier achieved the worst performance of all. From table 10, due to the high dimensional feature data of Scene15 data set, multi-layer perceptron, naive Bayesian classifier and nonlinear support vector machine were failed. This mainly caused by the intrinsic shortcomes of nonlinear SVM, Bayesian networks and multi-layer perceptron. They have too many inner loops and intermediate phased which need computation and storage. This situation is worse when feature data is high dimensional. Another reason is that our testing environment is limited. If we have enough memory and strong CPU capability, I think this phenomenon will disappear. In another view, this also reveals that they Algorithm Overall accuracy Average precision Average recall Average jaccard coefficient DecisionTree 0.657 0.5421 0.631 0.381 RandomForest 0.759 0.5721 0.686 0.456 ExtraTrees 0.668 0.5537 0.667 0.394 BoostedClassifier 0.716 0.5847 0.754 0.440 GradientBoostTree 1.000 1.000 1.000 1.000 libSVM 0.770 0.5664 0.662 0.458 libLinear 0.620 0.5342 0.611 0.356 Knearest 0.604 0.5666 0.724 0.362 Multi-classLogistic 0.620 0.5342 0.611 0.356 MultiLayerPerceptron 0.679 0.5377 0.612 0.391 NaiveBayesianNet 0.588 0.510 0.532 0.327 CRF 0.683 0.543 0.652 0.439 Algorithm Overall accuracy Average precision Average recall Average jaccard coefficient DecisionTree 0.847 0.850 0.847 0.738 RandomForest 0.966 0.966 0.966 0.936 ExtraTrees 0.969 0.970 0.969 0.942 BoostedClassifier 0.870 0.880 0.870 0.778 GradientBoostTree 1.000 1.000 1.000 1.000 libSVM 0.978 0.979 0.978 0.959 libLinear 0.947 0.948 0.946 0.901 Knearest 0.979 0.980 0.979 0.961 Multi-classLogistic 0.943 0.944 0.943 0.893 MultiLayerPerceptron 0.946 0.947 0.946 0.897 NaiveBayesianNet 0.843 0.881 0.844 0.732 CRF 0.960 0.962 0.958 0.932 Table 9: Overall accuracy on optdigits data set. Algorithm Overall accuracy Average precision Average recall Average jaccard coefficient DecisionTree 0.792 0.765 0.799 0.635 RandomForest 0.828 0.807 0.854 0.691 ExtraTrees 0.888 0.862 0.897 0.780 BoostedClassifier 0.900 0.876 0.899 0.798 GradientBoostTree 1.000 1.000 1.000 1.000 libSVM 0.749 0.735 0.775 0.584 libLinear 0.856 0.867 0.787 0.694 Knearest 0.844 0.822 0.804 0.691 Multi-classLogistic 0.768 0.732 0.750 0.593 MultiLayerPerceptron 0.768 0.742 0.775 0.603 NaiveBayesianNet 0.733 0.683 0.633 0.500 CRF 0.863 0.870 0.793 0.612 Table 7: Overall accuracy on Ionosphere data set. Experimental Comparisons of Multi-class Classifiers Informática 39 (2015) 71-85 81 Table 12: Inter-class accuracy on Ionosphere data set. Algorithm Overall accuracy Average precision Average recall Average Jaccard coefficient DecisionTree 0.397 0.391 0.376 0.246 RandomForest 0.531 0.548 0.493 0.339 ExtraTrees 0.655 0.639 0.637 0.479 BoostedClassifier 0.422 0.502 0.424 0.274 GradientBoostTree 0.999 0.999 0.999 0.999 libSVM Invalid Invalid Invalid Invalid libLinear 0.815 0.813 0.8113 0.694 Knearest 0.565 0.602 0.5421 0.382 Multi-classLogistic 0.794 0.797 0.7915 0.665 MultiLayerPerceptron Invalid Invalid Invalid Invalid NaiveBayesianNet Invalid Invalid Invalid Invalid CRF 0.650 0.625 0.621 0.456 Table 10: Overall accuracy on Scene15 data set. Algorithm F Jaccard coefficient Class0 Class1 Class0 Class1 RandomForest 0.286 0.855 0.167 0.747 ExtraTrees 0.244 0.788 0.139 0.650 BoostedClassifier 0.312 0.822 0.185 0.697 GradientBoostTree 1.000 1.000 1.000 1.000 libSVM 0.271 0.863 0.157 0.760 libLinear 0.202 0.751 0.113 0.601 Knearest 0.260 0.730 0.149 0.575 Multi-classLogistic 0.202 0.751 0.113 0.601 MultiLayerPerceptron 0.211 0.799 0.118 0.665 NaiveBayesianNet 0.154 0.728 0.083 0.572 CRF 0.278 0.813 0.156 0.745 Table 11: Inter-class accuracy on SPECTF data set. Algorithm F Jaccard coefficient Class0 Class 1 Class0 Class 1 RandomForest 0.919 0.840 0.850 0.724 ExtraTrees 0.916 0.835 0.844 0.717 BoostedClassifier 0.926 0.847 0.863 0.734 GradientBoostTree 1.000 1.000 1.000 1.000 libSVM 0.796 0.674 0.661 0.508 libLinear 0.903 0.723 0.824 0.566 Knearest 0.890 0.735 0.802 0.581 Multi-classLogistic 0.827 0.651 0.706 0.482 MultiLayerPerceptron 0.820 0.678 0.695 0.513 NaiveBayesianNet 0.822 0.464 0.698 0.302 CRF 0.908 0.838 0.846 0.721 Algorithm F Jaccard coefficient Class0 Class 1 Class0 Class1 RandomForest 0.944 0.908 0.894 0.831 ExtraTrees 0.897 0.798 0.814 0.664 BoostedClassifier 0.952 0.923 0.908 0.857 GradientBoostTree 0.985 0.977 0.970 0.955 libSVM 0.757 0.107 0.609 0.056 libLinear 0.914 0.878 0.841 0.783 Knearest 0.815 0.715 0.687 0.557 Multi-classLogistic 0.937 0.903 0.881 0.823 MultiLayerPerceptron 0.924 0.886 0.859 0.795 NaiveBayesianNet 0.752 0.002 0.603 0.001 CRF 0.933 0.891 0.885 0.838 have tough condition for real application. However, stochastic gradient boosting achieved the highest overall accuracy, average precision, recall and Jaccard value, following by linear support vector machines, multi-class logistic regression. Other algorithms got poor performance. 5.2 Inter class accuracy and jaccard coefficient evaluation on five data sets The statistical results above reflected the overall performance of the algorithms. However, inter classes F Table 13: Inter-class accuracy statistics on Spam data set. and Jaccard could reflect more detailed information. Accuracy of each class may vary greatly due to differences of the data. Overall each accuracy was determined by average of sum each class accuracy. From table 11, on the data set SPECTF, class 0 represents normal, class 1 represents abnormal. Stochastic gradient boosting was fully recognized, so it had the highest f and Jaccard coefficients in each sub class (both class 0 and class 1). Remains of algorithms' F and Jaccard coefficients were not high in class 0, however there were high accuracy in class 1. This Algorithm Class0 Class 1 Class2 Class3 Class4 Class5 Class6 Class7 Class8 Class9 DecisionTree F 0.933 0.830 0.813 0.817 0.803 0.891 0.940 0.836 0.777 0.832 Jaccard 0.875 0.710 0.685 0.690 0.671 0.804 0.887 0.719 0.636 0.712 RandomForest F 0.989 0.965 0.983 0.962 0.981 0.975 0.986 0.972 0.920 0.933 Jaccard 0.978 0.933 0.966 0.926 0.962 0.952 0.973 0.945 0.851 0.875 ExtraTrees F 0.992 0.963 0.994 0.956 0.986 0.975 0.989 0.972 0.938 0.934 Jaccard 0.983 0.928 0.989 0.915 0.973 0.952 0.978 0.945 0.883 0.877 BoostedClassifier F 0.960 0.804 0.930 0.813 0.899 0.919 0.960 0.850 0.783 0.798 Jaccard 0.923 0.672 0.870 0.685 0.816 0.850 0.924 0.740 0.644 0.664 GradientBoostTree F 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 Jaccard 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 libSVM F 0.997 0.978 0.994 0.972 0.994 0.986 0.997 0.977 0.948 0.943 Jaccard 0.994 0.957 0.989 0.946 0.989 0.973 0.995 0.956 0.901 0.892 libLinear F 0.992 0.914 0.983 0.942 0.964 0.941 0.983 0.954 0.890 0.911 Jaccard 0.983 0.842 0.966 0.889 0.930 0.889 0.967 0.912 0.801 0.837 Knearest F 1.000 0.965 0.994 0.978 0.981 0.986 1.000 0.989 0.956 0.949 Jaccard 1.000 0.933 0.989 0.957 0.962 0.973 1.000 0.978 0.916 0.904 Multi-classLogistic F 0.977 0.957 0.963 0.934 0.961 0.937 0.978 0.946 0.892 0.885 Jaccard 0.956 0.918 0.929 0.877 0.925 0.881 0.957 0.897 0.805 0.794 MultiLayerPerceptron F 0.986 0.948 0.938 0.925 0.951 0.957 0.962 0.949 0.913 0.930 Jaccard 0.973 0.902 0.884 0.861 0.906 0.918 0.926 0.902 0.840 0.869 NaiveBayesianNet F 0.942 0.914 0.940 0.891 0.011 0.967 0.949 0.762 0.793 0.883 Jaccard 0.890 0.842 0.886 0.804 0.006 0.936 0.902 0.616 0.656 0.791 CRF Table 14: Inter-class accuracy on optdigits data set. 64 Informatica 39 (2015) 63-69 B. Liu et al. indicated that the overall accuracy was boosted by the accuracy of the class 1. Distribution trends of Jaccard coefficient was in accordance with that of F. This meant that the higher f was, the higher Jaccard coefficient was in class 1. Further, support vector machines, random forests and CRF also had high f and Jaccard coefficient in class 1. From table 12, on data set Ionosphere, class 0 represents the presence of the fact, class 1 represents no presence of the fact. Performance was higher in class 0 than that in class 1. Stochastic gradient boosting to achieve the highest value. Random forests, CRF and multi-class adaboost classifier also got good performance. From table 14, on optdigits dataset, class 0 to class 9 represent ten digit of 0 to 9. All algorithms had relatively equal performance in each class. Stochastic gradient boosting, the non-linear support vector machines, random forests, Multilayer Perceptron, linear multi-class support vector machines and multi-class logistic classifier had high performance in each class. Algorithm ClassO Class 1 Class2 Class3 Class4 Class5 Class6 Class7 DecisionTree F 0.238 0.618 0.718 0.550 0.304 0.397 0.471 0.397 Jaccard 0.135 0.447 0.561 0.380 0.179 0.247 0.308 0.248 RandomForest F 0.848 0.654 0.852 0.327 0.491 0.571 0.618 0.688 Jaccard 0.736 0.486 0.742 0.196 0.326 0.400 0.448 0.524 ExtraTrees F 0.805 0.816 0.859 0.779 0.590 0.744 0.652 0.777 Jaccard 0.673 0.689 0.752 0.638 0.419 0.593 0.483 0.635 BoostedClassifier F 0.606 0.126 0.290 0.415 0.531 0.399 0.447 0.608 Jaccard 0.435 0.067 0.170 0.262 0.362 0.249 0.287 0.437 GradientBoostTree F 1.000 1.000 1.000 1.000 0.997 1.000 1.000 1.000 Jaccard 1.000 1.000 1.000 1.000 0.994 1.000 1.000 1.000 libSVM F Invalid Invalid Invalid Invalid Invalid Invalid Invalid Invalid Jaccard Invalid Invalid Invalid Invalid Invalid Invalid Invalid Invalid libLinear F 0.959 0.860 0.931 0.864 0.784 0.887 0.823 0.890 Jaccard 0.921 0.754 0.871 0.761 0.645 0.797 0.699 0.803 Knearest F 0.606 0.126 0.290 0.415 0.531 0.399 0.447 0.608 Jaccard 0.435 0.067 0.170 0.262 0.362 0.249 0.287 0.437 Multi-classLogistic F 0.941 0.801 0.873 0.820 0.736 0.866 0.770 0.871 Jaccard 0.936 0.701 0.827 0.721 0.617 0.712 0.656 0.776 MultiLayerPerceptron F Invalid Invalid Invalid Invalid Invalid Invalid Invalid Invalid Jaccard Invalid Invalid Invalid Invalid Invalid Invalid Invalid Invalid NaiveBayesianNet F Invalid Invalid Invalid Invalid Invalid Invalid Invalid Invalid Jaccard Invalid Invalid Invalid Invalid Invalid Invalid Invalid Invalid CRF F 0.835 0.647 0.843 0.318 0.485 0.565 0.546 0.659 Jaccard 0.736 0.477 0.729 0.187 0.319 0.389 0.432 0.510 Table 15: Inter-class accuracy on Scene15 data set. Algorithm Class8 Class9 Class10 Class11 Class12 Class12 Class14 DecisionTree F 0.463 0.330 0.142 0.234 0.220 0.326 0.280 Jaccard 0.301 0.198 0.076 0.133 0.123 0.195 0.162 RandomForest F 0.495 0.410 0.034 0.197 0.071 0.282 0.482 Jaccard 0.328 0.258 0.017 0.109 0.037 0.164 0.318 ExtraTrees F 0.681 0.645 0.339 0.286 0.409 0.496 0.531 Jaccard 0.517 0.477 0.204 0.167 0.257 0.330 0.362 BoostedClassifier F 0.611 0.447 0.342 0.320 0.325 0.356 0.485 Jaccard 0.440 0.288 0.206 0.190 0.194 0.216 0.320 GradientBoostTree F 0.997 1.000 1.000 1.000 1.000 1.000 1.000 Jaccard 0.994 1.000 1.000 1.000 1.000 1.000 1.000 BoostedClassifier F 0.623 0.459 0.309 0.288 0.324 0.415 0.528 Jaccard 0.452 0.298 0.183 0.168 0.193 0.262 0.359 libSVM F Invalid Invalid Invalid Invalid Invalid Invalid Invalid Jaccard 0.000 0.000 0.000 0.000 0.000 0.000 0.000 libLinear F 0.862 0.898 0.633 0.645 0.701 0.663 0.685 Jaccard 0.758 0.815 0.463 0.476 0.539 0.495 0.521 Knearest F 0.611 0.447 0.342 0.320 0.325 0.356 0.485 Jaccard 0.440 0.288 0.206 0.190 0.194 0.216 0.320 Multi-classLogistic F 0.847 0.898 0.721 0.613 0.761 0.721 0.655 Jaccard 0.889 0.668 0.775 0.694 0.582 0.764 0.625 MultiLayerPerceptron . F1. Invalid Invalid Invalid Invalid Invalid Invalid Invalid Jaccard Invalid Invalid Invalid Invalid Invalid Invalid Invalid NaiveBayesianNet F Invalid Invalid Invalid Invalid Invalid Invalid Invalid Jaccard Invalid Invalid Invalid Invalid Invalid Invalid Invalid CRF F 0.488 0.402 0.028 0.188 0.066 0.277 0.466 Jaccard 0.313 0.248 0.016 0.100 0.036 0.155 0.309 Table 16: Inter-class accuracy on Scene15 data set (cont.) Experimental Comparisons of Multi-class Classifiers Informática 39 (2015) 71-85 83 Algorithm Binary data sets Multi-class data sets SPECTF Ionosphere Spam optdigits Scene 15 DecisionTree 1 30.9 79.9 202.9 19.562(second) RandomForest 93.9 326.9 2.819 8112 34(minute) ExtraTrees 266.0 437.0 3.036 12541 1hour22 minute BoostedClassifier 202.9 451.9 1761 2.697 477.873 ( second ) GradientBoostTree 437.0 656.0 6732 53807 8hour42 minute libSVM 46.9 108.9 915184 3276 Invalid libLinear 389.9 749.0 3.165 1006 2hour10 minute Knearest 1 16.0 280.9 708.9 120.054 ( second ) Multi-classLogistic 30.9 46.9 377.9 2153 190.660 ( second ) MultiLayerPerceptron 857.9 4306 33727 172847 Invalid NaiveBayesianNet 30.9 93.0 63.0 375.9 Invalid CRF 168.9 235.2 1.522 6420 66 minute Table 17: Running time of twelve algorithms on five From table 15 and table 16, on Scene15, class 0 to class 14 represented fifteen classes. Multilayer perceptron and non-support vector machines were failed because of computation cost, and naive Bayesian classifier was failed due to the huge storage. Stochastic gradient boosting, linear support vector machines achieved good performance, following by multi-class logistic classifier. 5.3 Running time performance on five data sets From table 17, the running time of the 11 kinds by an algorithm on the five data sets can be seen that: 1. On a small data sets (SPECTE and Ionosphere), running time of multilayer perceptron was significantly slower than that of other algorithms, while other algorithms' running time were almost same. Linear support vector machines' (based on liblinear) running time was inversely lower than that of nonlinear support vector machines based on libsvm. 2. On the large data sets (spam, optidigits, scene15), the differences of running time were significant. It is clear that the linear support vector machines were significantly faster than the non-linear support vector machines. 3. For tree classifiers, decision tree was the fastest of all, following by random forests. The slowest was extremely randomized trees. 4. For boosting methods, stochastic gradient boosting was slower than the multiclass adaboost. 5. Due to the large dimensionality of data, non-linear support vector machines, and Bayesian multi-layer perceptron did not succeed. 6. CRF running time is better than ExtraTrees, but slower than RandomForest. 7. In short, for running time efficiency, naive Bayes classifiers, K nearest neighbor and decision tree were basically fast, following by random forests, multi-class logic and linear support vector machines. The stochastic gradient boosting was the slowest of all. 6 Conclusion and future work This article compares 12 kinds of commonly used multi-classification algorithm. In the experiments, we found that: ita sets (unit: millisecond except scene 15 data set). 1. The same algorithm on different data sets showed different performance. It was the key to choose a more adaptive algorithm based on the data set. 2. Stochastic gradient boosting achieved the best classification accuracy in all test data sets, but its running time was slower than other algorithms except the multilayer perceptron. 3. The composite classifiers performed well than single classifier. For example, stochastic gradient boosting, random forest, extremely randomized trees were all better than the basic decision tree. However at the same time, the more complex combination model was, the longer running time was. 4. Linear support vector machines achieved good results of both accuracy and total execution on large data sets while compared with the nonlinear support vector machine. There are still some deficiencies in our comparative study, further research is need: 1. We compare only the basic algorithm of 12 kinds of algorithms, every algorithm has its variants, which are better than the original algorithms. 2. How to choose the optimal parameter settings for algorithm is critical for its performance. There are still more works need to be done. 3. When we deal with large data sets, how much sample should choose for training? How to find the best balance between training time and accuracy is worthy of further exploration. 4. The combination of classifiers can often lead to higher accuracy, but as mentioned above, model training time will significantly increase. Stochastic gradient boosting were obtained good accuracy in our tests on five data sets, however the running time is longer. Actually stochastic gradient boosting have many sub routine which has sub-iteration, so this will elapse many running time. Because the main routine are highly correlate the sub-iteration, so it cannot directly parallel the sub-iteration. How to improve the running time performance with a little bit of decrease in accuracy is a meaningful research, in other words, we need to find a balance between accuracy and running time performance. Acknowledgement This work was supported in part by Scientific Research 64 Informatica 39 (2015) 63-69 Fund of Sichuan Provincial Education Department of China (No. 15TD0038) and Fundamental Research Funds for the Central University. References [1] ZHU J, ROSSET S, ZOU H, et al. Multi-class adaboost [J]. Ann Arbor, 2006, 1001(48109): 1612. [2] SELFRIDGE O G. Pandemonium: a paradigm for learning in mechanization of thought processes [J]. 1958, [3] KANAL L. Patterns in pattern recognition: 19681974 [J]. Information Theory, IEEE Transactions on, 1974, 20(6): 697-722. [4] MINSKY M L. Logical versus analogical or symbolic versus connectionist or neat versus scruffy [J]. AI magazine, 1991, 12(2): 34. [5] ROSENBLATT F: DTIC Document, 1961. [6] BREIMAN L. Random forests [J]. Machine learning, 2001, 45(1): 5-32. [7] FRIEDMAN J, HASTIE T, TIBSHIRANI R. Additive logistic regression: a statistical view of boosting (With discussion and a rejoinder by the authors) [J]. The annals of statistics, 2000, 28(2): 337-407. [8] FRENCH S. Group consensus probability distributions: A critical survey [J]. Bayesian statistics, 1985, 2(183-202. [9] BENEDIKTSSON J A, SWAIN P H. Consensus theoretic classification methods [J]. Systems, Man and Cybernetics, IEEE Transactions on, 1992, 22(4): 688-704. [10] BERNARDO J M, SMITH A F. Bayesian theory [J]. Measurement Science and Technology, 2001, 12(2): 221. [11] FAN R E, CHANG K W, HSIEH C J, et al. LIBLINEAR: A library for large linear classification [J]. The Journal of Machine Learning Research, 2008, 9(1871-4. [12] DUIN R, TAX D. Experiments with classifier combining rules [J]. Multiple classifier systems, 2000, 16-29. [13] ALY M. Survey on multi-class classification methods [J]. Neural networks, 2005, 1-9. [14] NILLSON N. Learning machines: Foundations of trainable pattern classifying systems [M]. McGraw-Hill, New York. 1965. [15] KING R D, FENG C, SUTHERLAND A. Statlog: comparison of classification algorithms on large real-world problems [J]. Applied Artificial Intelligence an International Journal, 1995, 9(3): 289-333. [16] BAUER E, KOHAVI R. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants [J]. Machine learning, 1998, 36(1): 2. [17] LECUN Y, JACKEL L, BOTTOU L, et al. Comparison of learning algorithms for handwritten digit recognition; proceedings of the International conference on artificial neural networks, F, 1995 [C]. B. Liu et al. [18] DING C H, DUBCHAK I. Multi-class protein fold recognition using support vector machines and neural networks [J]. Bioinformatics, 2001, 17(2): 107-38. [19] LI T, ZHANG C, OGIHARA M. A comparative study of feature selection and multi-class classification methods for tissue classification based on gene expression [J]. Bioinformatics, 2004, 20(15): 2429-37. [20] FOODY G M, MATHUR A. A relative evaluation of multi-class image classification by support vector machines [J]. Geoscience and Remote Sensing, IEEE Transactions on, 2004, 42(6): 133543. [21] HSU C-W, LIN C-J. A comparison of methods for multi-class support vector machines [J]. Neural Networks, IEEE Transactions on, 2002, 13(2): 41525. [22] CARUANA R, NICULESCU-MIZIL A. An empirical comparison of supervised learning algorithms; proceedings of the Proceedings of the 23rd international conference on Machine learning, F, 2006 [C]. ACM. [23] KRUSIENSKI D J, SELLERS E W, CABESTAING F, et al. A comparison of classification techniques for the P300 Speller [J]. Journal of neural engineering, 2006, 3(4): 299. [24] WITTEN I H, FRANK E. Data Mining: Practical machine learning tools and techniques [M]. Morgan Kaufmann, 2005. [25] MURPHY K P. Machine Learning: a Probabilistic Perspective [J]. 2012, [26] QUINLAN J R. Induction of decision trees [J]. Machine learning, 1986, 1(1): 81-106. [27] BREIMAN L, FRIEDMAN J, STONE C J, et al. Classification and regression trees [M]. Chapman & Hall/CRC, 1984. [28] HO T K. Random decision forests; proceedings of the Document Analysis and Recognition, 1995, Proceedings of the Third International Conference on, F, 1995 [C]. IEEE. [29] BREIMAN L. Bagging predictors [J]. Machine learning, 1996, 24(2): 123-40. [30] BREIMAN L. Statistical modeling: The two cultures (with comments and a rejoinder by the author) [J]. Statistical Science, 2001, 16(3): 199231. [31] IVERSON L R, PRASAD A M, MATTHEWS S N, et al. Estimating potential habitat for 134 eastern US tree species under six climate scenarios [J]. Forest Ecology and Management, 2008, 254(3): 390-406. [32] GEURTS P, ERNST D, WEHENKEL L. Extremely randomized trees [J]. Machine learning, 2006, 63(1): 3-42. [33] HASTIE T, TIBSHIRANI R, FRIEDMAN J, et al. The elements of statistical learning: data mining, inference and prediction [J]. The Mathematical Intelligencer, 2005, 27(2): 83-5. [34] FREUND Y, SCHAPIRE R E. A decision-theoretic generalization of on-line learning and an application Experimental Comparisons of Multi-class Classifiers Informática 39 (2015) 71-85 85 to boosting [J]. Journal of computer and system sciences, 1997, 55(1): 119-39. [35] FRIEDMAN J H. Stochastic gradient boosting [J]. Computational Statistics & Data Analysis, 2002, 38(4): 367-78. [36] BURGES C J C. A tutorial on support vector machines for pattern recognition [J]. Data mining and knowledge discovery, 1998, 2(2): 121-67. [37] CORTES C, VAPNIK V. Support-vector networks [J]. Machine learning, 1995, 20(3): 273-97. [38] KEERTHI S S, SUNDARARAJAN S, CHANG KW, et al. A sequential dual method for large scale multi-class linear SVMs; proceedings of the Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, F, 2008 [C]. ACM. [39] CRAMMER K, SINGER Y. On the algorithmic implementation of multi-class kernel-based vector machines [J]. The Journal of Machine Learning Research, 2002, 2(265-92. [40] COVER T, HART P. Nearest neighbor pattern classification [J]. Information Theory, IEEE Transactions on, 1967, 13(1): 21-7. [41] LIU D C, NOCEDAL J. On the limited memory BFGS method for large scale optimization [J]. Mathematical programming, 1989, 45(1): 503-28. [42] RONSENBLATT F. The perceptron: a probabilistic model for information storage and organization in the brain [J]. Psychological review, 1958, 65(386-408. [43] BISHOP C M. Pattern recognition and machine learning [M]. Springer New York, 2006. [44] WASSERMAN P D, SCHWARTZ T. Neural networks. II. What are they and why is everybody so interested in them now? [J]. IEEE Expert, 1988, 3(1): 10-5. [45] Hang Li. Statistic Learning [M]. Tsinghua press, 2012. [46] COLLOBERT R, BENGIO S. Links between Perceptron, MLPs and SVMs; proceedings of the Proceedings of the twenty-first international conference on Machine learning, F, 2004 [C]. ACM. [47] GOULD S. DARWIN: A Framework for Machine Learning and Computer Vision Research and Development [J]. Journal of Machine Learning Research, 2012, 13(12): 3499-503. [48] CHANG C C, LIN C J. LIBSVM: a library for support vector machines [J]. ACM Transactions on Intelligent Systems and Technology (TIST), 2011, 2(3): 27. [49] FAWCETT T. An introduction to ROC analysis [J]. Pattern recognition letters, 2006, 27(8): 861-74. [50] DAVIS J, GOADRICH M. The relationship between Precision-Recall and ROC curves; proceedings of the Proceedings of the 23rd international conference on Machine learning, F, 2006 [C]. ACM. [51] JACCARD P. The distribution of the flora in the alpine zone. 1 [J]. New Phytologist, 2006, 11(2): 37-50. [52] ASUNCION A, NEWMAN D J. UCI machine learning repository [M]. 2007. [53] KURGAN L A, CIOS K J, TADEUSIEWICZ R, et al. Knowledge discovery approach to automated cardiac SPECT diagnosis [J]. Artificial Intelligence in Medicine, 2001, 23(2): 149-69. [54] LAZEBNIK S, SCHMID C, PONCE J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories; proceedings of the Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, F, 2006 [C]. IEEE. [55] VEDALDI A, ZISSERMAN A. Efficient additive kernels via explicit feature maps [J]. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2012, 34(3): 480-92. [56] Bradski G, Kaehler A (2008) Learning OpenCV: Computer vision with the OpenCV library. O'Reilly Media, Incorporated [57] Delashmit WH, Manry MT (2005) Recent developments in multilayer perceptron neural networks. In: Proceedings of the seventh Annual Memphis Area Engineering and Science Conference, MAESC.