Informática 36 (2012) 153-160 153 Using M Tree Data Structure as Unsupervised Classification Method Marian Cristian Mihaescu and Dumitru Dan Burdescu University of Craiova, Faculty of Automation Computers and Electronics, Romania E-mail: mihaescu@software.ucv.ro, burdescu@software.ucv.ro Keywords: M Tree, k-means, unsupervised classification, cognitonics Received: December 8, 2011 Increasing the effectiveness of educational processes is one of the greatest challenges for information society. The paper presents the usage of M Tree structure for classification of the learners based on their final marks obtained in their respective courses. The classical building algorithm of M-Trees with an original accustomed clustering procedure was implemented. The data that are managed within M Tree structure are represented by instances. The main goal of the structure is to provide information to students and course managers regarding the knowledge level reached by students. The proposed clustering procedure that is used for splitting full M Tree nodes is designed to properly classify learners. A baseline classification scheme based on k-means clustering and a custom M Tree clustering are presented. For comparison, there are considered classical characterization formulas. Povzetek: Opisana je metoda za izboljšanje učenja na osnovi M-dreves. 1 Introduction The ability to classify a student's performance is very important in internet-based educational environments. A very promising area to attain this objective is the use of special designed data structure. In fact, one of the most useful applications of modern algorithms in e-Learning is classification. E-students are students that follow courses within an e-Learning platform. There are different educational objectives for using classification, such as: to discover potential student groups with similar characteristics and reactions to a particular learning strategy, to improve a student's capacity of learning, to group students who are failure-driven and help them improve their skills, to identify learners with low motivation and find remedial actions to lower drop-out rates etc. In the followings we have applied a classification method using unique algorithms which have a common base (tree classification). One of the greatest challenges in e-Learning area is to continuously improve existing systems. In order to overcome the challenge there are needed sound procedures whose task is to prove the challenger procedure creates a better system than existing one. The key issue regarding effectiveness of educational process is classification. The main goal of this paper is to obtain a better or at least acceptable classification scheme with less computation power. Some possible outcomes of such analysis process are: predicting students' grades (to classify in three classes/clusters of low priority - week learners, medium priority - easy learners, high priority - competitive learners) from test scores. Clustering algorithms are part of the unsupervised classification techniques. They try to group a set of items into subsets or clusters. The cluster algorithms' goal is to create clusters that are coherent internally, but clearly different from each other. In other words, items within a cluster should be as similar as possible; and items in one cluster should be as dissimilar as possible from items in other clusters. In this paper, learners represent items. The standard k-means algorithm [1] is used as baseline unsupervised classifier. K-means is the most important flat clustering algorithm. Its objective is to minimize the average squared Euclidean distance of items from their cluster centres where a cluster centre is defined as the mean or centroid of the items in a cluster. M-tree [2,3] is a dynamic access method suitable to index generic "metric spaces", where the function used to compute the distance between any two objects satisfies the positivity, symmetry, and triangle inequality postulates. The M-tree design fulfils typical requirements of multimedia applications, where objects are indexed using complex features, and similarity queries can require application of time-consuming distance functions. In this paper we describe the basic search and management algorithms of M-tree, introduce several heuristic split policies, and experimentally evaluate them, considering both I/O and CPU costs. The obtained results also show that M-tree performs better than R*-tree on high-dimensional vector spaces. 2 Related Work It is now recognized that e-learning further requires the means to summarize and classify learner trends and patterns. One serious candidate solution is DM (data mining), already quite successful in e-commerce and bio-informatics, where results are achieved through the use of associates, classifiers, clusters [8], pattern analysers, and statistical tools. Educational Data Mining [7] is an emerging discipline concerned with developing methods for exploring the unique types of data that come from educational settings, and using those methods to better understand students, and the settings which they learn in. 154 Informatica 36 (2012) 153-160 M.C. Mihaescu et al. Since the mid-1990's, e-learning has epitomized a broad range of learning categories while reinforcing four major pedagogical perspectives often neglected during e-learning system development. First, insight from cognitive learning processes can shed light on how the brain functions. Second, emotional aspects of learning can be traced, such as interest, motivation, interaction, fulfilment, and enjoyment. The third perspective incorporates skills and behaviours, such as role-playing, that are particularly useful in real settings. Lastly, a social perspective involving the interaction with other people permits a focus on collaborative discovery, namely, the interplay of peer pressure and support. The complexity of the approach is high, and the specialists have been more preoccupied about the development of the information systems from the perspective of the technological informatics infrastructure. The studies devoted to the technology infrastructures embedded in the information systems are insufficiently presented in literature [5]. One of the constructive steps in this direction was done by V. Fomichov and O. Fomichova in [6]. The authors introduced the notation of conceptual-visual dynamic schemes (CVD-schemes). The CVD-schemes are the marked oriented graphs introduced in cognitonics domain for inventing effective teaching analogies. Such graphs establish a correspondence between the components of a piece of theoretical material to be studied and the components of a well-known or just created by the teacher but bright fragment of the inner world's picture of the learner. Novel database applications, such as multimedia, data mining, e-commerce, and many others, make intensive use of similarity queries [2] in order to retrieve the objects that better fit a user request. Since the effectiveness of such queries improves when the user is allowed to personalize the similarity criterion according to which database objects are evaluated and ranked, the development of access methods being able to efficiently support user-defined similarity queries becomes a basic requirement. In this article we introduce the method called the M-tree that can process user-defined queries in generic metric spaces, that is, where the only information about indexed objects is their relative distances. The M-tree is a metric access method that can deal with several distinct distances at a time: (1) a query (user-defined) distance, (2) an index distance (used to build the tree), and (3) a comparison (approximate) distance (used to quickly discard from the search uninteresting parts of the tree). We develop an analytical cost model that accurately characterizes the performance of the M-tree and validate such model through extensive experimentation on real metric data sets. In particular, our analysis is able to predict the best evaluation strategy (i.e., which distances to use) under a variety of configurations, by properly taking into account relevant factors such as the distribution of distances, the cost of computing distances, and the actual index structure. The access method called M-tree is proposed to organize and search large data sets from a generic "metric space", i.e. where object proximity is only defined by a distance function satisfying the positivity, symmetry, and triangle inequality postulates. The M-tree design has been motivated by retrieval requirements from typical multimedia database applications, where objects, such as text, image, and video, are indexed using complex feature representations, and search for objects similar to a query object can involve application of time-consuming distance functions. We detail algorithms for insertion of objects and split management which keep the M-tree always balanced - several heuristic split alternatives are considered and experimentally evaluated. Algorithms for similarity (range and k-nearest neighbours) queries are also described. The results from extensive experimentation with a prototype system are reported, considering as the performance criteria the number of page I/O's and the number of distance computations. The results demonstrate that the M-tree indeed extends the domain of applicability beyond the traditional vector spaces, performs reasonably well in high-dimensional data spaces, and scales well in case of growing files. As our project goal is to bring a contribution to E-learning domain, our idea is to structure didactic chapters as concept maps and provide an efficient electronic students distribution by their results. The chapter's notions will be split depending on their level of priority as follows: - Low priority notions - referring to introductive notions about the chapter. - Medium priority notions - referring to basic notions of the chapter. - Maximum priority notions - referring to advanced notions of the chapter. As our concept maps are represented as tree structures, where each path of the tree is assigned a weight, these priorities can be computed depending on the assigned weights. Assuming that, in order to evaluate a number of students, each chapter presents a final quiz, we have decided to use concept maps as weighted trees in order to generate tests containing notions of different priority levels. The algorithms responsible for generating tests are based on tree searches methods. 3 Building Clusters with M Tree and k-Means Algorithms Our implementation uses the following student's data structures: struct Student{ int IDStudent; float Lp; float Mp; float Maxp; int Where[3]; }; The IDStudent is the identifier corresponding to each student entering the online distribution program. In order to evaluate these students, each chapter presents a final quiz containing notions belonging to the previously discussed levels of priority. Thereby , each student will USING M TREE DATA STRUCTURE AS. Informatica 36 (2012) 153-160 155 get a mark for each type of notion, contained in the chapter: - a Lp mark corresponding to low priority notions, - a Mp mark corresponding to medium priority notions, - MaxP mar corresponding to maximum priority notions. Each one of these marks is important , because they represent the guiding tool for a student. Example: Let us suppose that the student identified by his/her IDStudent=1002 takes the quiz, at the end of the chapter and gets the following results: (Lp= 7.70, Mp=6.78, Maxp=5.00 ). Right know, a teacher, or even an electronic program, is able to compute the minimum performance of this student, reaching the conclusion that the advanced notions of the chapter(indicated by Maxp) have not been covered properly by this student, and therefore , he/she needs to put more energy in this direction. These directions are given by Where vector, used for providing instructions regarding where should a student improve his/her level of knowledge. A graphical representation is given below: Figure 1: Where vector, used for providing instructions regarding where should a student improve his/her level of knowledge. As a start, Where[3]=(0,0,0). If any of this vector's component changes to 1 , this means it becomes active. For instance, if StudentA.Where[3]=(0,1,0), it means that he/she needs to review notions of medium priority level. These students are then, distributed and placed by our algorithm in a M-tree structure. Before moving on with the algorithm, let us present the structure of the M-tree. As it was explained above, the M-tree is a spatial, metric tree, consisting of: 1 root and k leaves containing students.(As we will see later, these leaves represent clusters of students). For now, let us stick to the structure of a M-tree, mentioning that this tree is actually a spatial one, where its leaves can be imagined as spheres, containing points, which are actually students. As far as the structure of a node is concerned, we have: The M-Tree node's structure is: struct m_Node{ int nrKeys; bool isLeaf; float radius [NMAX]; m_Node *routes [NMAX]; struct Student students[NMAX]; }; The nrKeys represents the number of students contained in a node (cluster). As far as our M-tree is concerned, we pay extra attetion to the nodes, because it is very important wheather they are leaves( terminal nodes) or internal nodes, as we will see later in the algorithm. The boolean variable isLeaf is pointing out our exact concern: weather a node is leaf or internal. Moving on, as itwas previously said, we consider our nodes as spheres, and as any sphere, it is geometrically represented by its center C(x,y,z) and its radius R. However, in order to match these notions to our real implementation, we have constructed an abstract interpretation for this geometrical representation. The center C of a sphere( cluster) will be represented by the average student belonging to the set of students contained in that particular cluster. Instead of spatial coordinates (x,y,z) , our centerStudent will be represented by its elements (IDStudent,Lp,Mp,Maxp), which were presented earlier. The radius of this abstract sphere will be represented by the distance between the centerStudent and the student (students) with the lowest results in that cluster. We will take a closer look to this abstract system later, when we will discuss the implemented algorithm. The routes represent the children of a particular node and of course, the nrKeys points inside the sphere, which are actually the students, as in our abstract system, a spatial point is represented by a student. Our implementation is based on the idea of students distribution depending on their results to a quiz at the end of a chapter. For a better understanding of our implementatio let us consider a real situtation: Let us consider a finite set S of k students defined as S={Sti,St2,....Stk), k>0. Let us supose that all these students have taken a quiz at the end of a chapter in order to evaluate their level of knowledge. Each student is represented by his/her IDStudent, and his/her grades: Lp, Mp, Maxp (they were discussed earlier). Let us assume that we want to create an hierarchy among these students, depending on their results. In order to do that, we need to group these students in clusters, each cluster having its own attribute. An attribute, for a cluster, represents the level of performance for that particular group of students. Moreover, these attributes are also used as indicators pointing out to the type of notion (low priority, medium priority, maximum priority) the student needs to review. After a group of students (cluster) is formed, a center is chosen, that is the average student in that group, and all the other students are distributed in a spherical manner, arround him. Computation of a radius for a cluster : the radius of a cluster represents the maximum possible distance between the centerStudent and the rest. The bigger the distance is, the better or the lower the results of that student are. Just as in real cases, when we say there is a big distance between this average student and student A, this means that Student A has either better results or worse results, we don't know for sure. Anyhow, should the distance between the centerStudent and any other student be greater than the cluster's radius, it means that the particular student does not belong to that cluster, for 154 Informatica 36 (2012) 153-160 M.C. Mihaescu et al. the simple reason that he/she is smarter than all those students in that cluster or his/her results are lower than any other's in that cluster. The radius of a cluster is computed, depending on the results of each student, being the maximum distance between two students. We define the distance between StudentA and StudentB as follows: dAB= max{(|LpA-LpB|, |MpA-Mpe|, |MaxpA-Maxpe|)} (1) As you can see, when we measure the distance between two students, we are looking for the most marking difference between them. This also helps us in defining attributes of a cluster, depending on the type of notion students should focus on (low priority notions, medium priority notions, maximum priority notions). Moreover the relation (1) guarantees that the radius of a cluster represents the biggest difference between the levels of knowledge for each student belonging to that cluster. As example, let us consider the student A with his/her results: (9.60, 8, and 7.50). Let us consider the student B with his/her results: (7.60, 8, 6.50). Following the relation (1), we get the biggest difference 2 (9.607.60). This is the biggest distance between them two. So they might have similar knowledge for medium priority notions (8,8), and maximum priority notions(7.50,6.50), but when it comes to low priority notions, we see a gap between them( StudentA -9.60 , StudentB -7.60). Let us consider that StudentB has the lowest result in the cluster, and StudentA is the centerStudent. Then, as we have presented earlier , they can be grouped in a cluster with its radix 2 . Let us suppose now that StudentC gets the results: (5, 6.30, 5). We get: dAC= max {|9.60-5|,|8-6.30|,| 7.50-5|}=4.60, dBC= max {|7.60-5|,|8-6.30|,| 6.50-5|}=2.60. As you can see, neighther of these distances is lower than our cluster's supposed radius, as the difference between Student and the students StudentA, StudentB is huge, so there is no way, StudentC will not become a member of this cluster. Moreover, based on the present results of StudentA and StudentB , we can define an attribute for this cluster: all students belonging to this cluster, will posses similar knowledge levels for medium and maximum priority notions, but the marking difference between them will be represented by the low priority notions, so all of them need to review this part of the chapter. The main steps of the agorithm are: Stepl. We start from a simple representation of students, identified by their elements: - IDStudent - Lp (score) - Mp (score) - Maxp (score) Step2. We picture the set of the students who have taken the test as the points in 3D space. Our algorithm involves two major operations: - a clustering operation - a split operation We will first describe the spliting method. We have decided that these groups of students should have a maximum number of allowed members. Let us denote this number as the filling factor of a cluster (student group). Whenever the number of students in a particular cluster becomes greater that this filling factor, a cluster splitting is involved. This is how the M-tree extends its nodes. The splitting procedure works as follows: At the beginning two random students from that cluster are chosen as the centers for the new clusters resulting after splitting. Let us denote them Studentl and Student2. Next, we distribute the rest of the students arround the new centers Studentl and Student2. If, for instance, we have Studentl and Student2 as centers, the question is where should we attach Student3 to? We compute the distance between (Student1, Student3) and (Student2, Student3), using relation (1). Student3 will go near that student which is more closed to him/her (that is, from a level of knowledge point of view). After the distribution is completed, we start the chooseCenter method, which recalculates the new centers of the clusters, and if new centers are found, the entire discussed process happens again, until new centers are found no more. This process is called the Clustering Process. After that the effective splitting happens, and the initial tree node is split into two nodes. When these clusters are formed, they are also assigned atributes( we have discussed about them earlier). What is interesting is that these atrributes suffer a constant evolution, depending of the students inserted in that specific cluster. As the clusters suffer constant splittings and modifications, whenever the number of students inside is greater than the filling factor, so do the attributes change, transforming a part of the old cluster in a better one ( shelters students with higher levels of knowledge) or even a lower one (shelters students with lower levels of knowledge). The classical M Tree algorithm has been adapted such that the final structure has two levels. The procedure for building the structure takes into consideration both the desired number of clusters and the filling factor of a leaf node. procedure MTree (x1, x2, ..., xN; K; F) // K - the number of clusters // F- filling factor for ( i=1, iroot->nrKeys < filling factor ) #make a simple insertion in the actual root #Else #apply Cluster_and_split; }//end while The classical standard k-means algorithm partitions a dataset on N instances into K clusters. The algorithm is: procedure k-means (xi, x2, ..., xN; K) {ci, c2, ..., cK} ^ Select Random Centroids for ( k=i, k