https://doi.or g/10.31449/inf.v47i8.4767 Informatica 47 (2023) 1–12 1 LP SVM with a Novel Similarity Function O utperforms Powerful LP-QP-Kernel-SVM Considering Efficient Clas sification Rezaul Karim 1, ∗ , Mahmudul Hasan 2 , Amit Kumar Kundu 1 and Ali Ahmed A ve 1 1 Department of Electrical and Electronic Engineering, Uttara University , Dhaka, Bangladesh 2 Department of Electrical Engineering, University of British Columbia, Okanagan Campus E-mail: rezkar@uttarauniversity .edu.bd, mahmud67@student.ubc.ca, amit31416@gmail.com, ali.ave@uttarauniversity .edu.bd *Corresponding author Keywords: classification, SVM, kernel, similarity function, QP , sparse, LP , ef ficiency , machine accuracy time (MA T), performance Received: March 27, 2023 While the cor e quality of SVM comes fr om its ability to get the global optima, its classification perfor - mance also depends on computing kernels. However , while this kernel-complexity generates the power of machine, it is also r esponsible for the computational load to execute this kernel. Mor eover , insisting on a similarity function to be a positive definite kernel demands some pr operties to be satisfied that seem unpr o- ductive sometimes raising a question about which similarity measur es to be used for classifier . W e model V apnik’ s LPSVM pr oposing a new similarity function r eplacing kernel function. Following the strategy of ”Accuracy first, speed second”, we have modelled a similarity function that is mathematically well-defined depending on analysis as well as geometry and complex enough to train the machine for generating solid generalization ability . Being consistent with the theory of learning by Balcan and Blum [1], our similarity function does not need to be a valid kernel function and demands less computational cost for executing compar ed to its counterpart like RBF or other kernels while pr ovides sufficient power to the classifier us- ing its optimal complexity . Benchmarking shows that our similarity function based LPSVM poses test err or 0.86 times of the most powerful RBF based QP SVM but demands only 0.40 times of its computational cost. Povzetek: Za SVM je pr edlagana je nova funkcija podobnosti, ki zamenja funkcijo jedra, zahteva manj računanja in dosega visoko natančnost in hitr ost. 1 Intr oduction A frequent contemporary method in the problems of data classification in machine learning is to encode prior knowl- edge about data patterns (objects) through a kernel oper - ation that takes in two patterns, maps into a higher dimen- sional feature space and outputs a number representing sim- ilarity or dissimilarity with the condition that it must form a positive semi definite (PSD) matrix after applied to all pairs of patterns. T o assess the true structure and input-output relation of the data for solving many real-world problems ef ficiently with Support V ector Machine, appropriate selec- tion of this kernel is a crucial issue where the PSD property of kernel similarity matrix ensures for the SVM to be solved ef ficiently by a convex quadratic programming. For further about kernel SVM based classification and sparse learning one could see these papers [2–10] . However , while the kernel theory is quite well-designed, there are some other issues that sometimes make this kernel less interesting to be used such as i) The formal statement of Mercer ’ s condition is tough to verify . ii) While designing a kernel function for any learning problem, the usual perception is that the better kernel for working on a dataset would also serve as the better similarity function with respect to that data. On contrary , the SVM kernel deals with mar gin in a possibly very high-dimensional space that is also implicit and gen- erally not perceptible in the innate demonstration of data. This turns out to be not so helpful for a designer to have the intuition to model or select an appropriate kernel for the learning task at hand. iii) There could be a question about the usefulness of an algorithm to be controlled by the characteristics of an implicit mapping that might not even be calculated by someone. Additionally , if functionally a kernel is just a black-box method taking two examples as input and giving an output number that gives some concept of how similar they are, it is not so clear if or why the posi- tive semi-definiteness property is certainly needed— that is the physical significance of positive semi definiteness is not obvious although we know that positive semi defi- niteness is needed for the problem to be convex and have a unique minima. Moreover , the constraint of positive semi- definiteness may reject many natural similarity functions for a given problem and there are also many similarity func- tion that are naturally not directly PSD but are proved to be potential, for example, the sigmoid function. This motivates to work with possibly proper similarity 2 Informatica 47 (2023) 1–12 R. Karim et al. function relaxing kernel-constraints for data learning such as learning with non-PSD similarity or distance function. Learning with indefinite kernel or non-PSD similarity ma- trix has attracted huge concentration [1 1–19]. However , [20] have divided recent work on training SVM with in- definite kernels into three main kinds: PSD kernel approx- imation, non-convex optimization, and learning in Krein spaces with a conclusion that all methods are not fully ad- equate as they have either hosted bases of inconsistency in handling training and test patterns using kernel approxima- tion which harms generalization guarantees or established for approximate local minimum solutions by non-convex optimization, or generated nonsparse solutions. But there is another approach that has been studied in a sequence of papers [1], [21], [13], [22] that adopt a certain “goodness” property , which is formally defined for the similarity func- tion and provide both generalization guarantees in terms of how well-suited the similarity function is to the classifica- tion task at hand as well as the capability to use fast algo- rithmic techniques. Informally , a similarity function can be considered as good if patterns of same classes are closer to each other than patterns of dif ferent classes in some sense. The model proposed by [1], [22] developed a general theory of learning with pairwise similarity function that may not necessarily be a valid positive semi-definite kernel and suf- ficient conditions for the function for learning well that does not involve reference to implicit spaces, and nor demands the function to be PSD. Being inspired by this, in this paper we follow the formulation of SVM by modelling a Manhat- tan distance based similarity function replacing the kernel function. The rest of paper is or ganized as follows. Section 2 pro- vides a short related basics about SVM and kernel whereas Section 3 explains the proposed similarity function with analysis. In Section 4, the experimental results are shown with the description of data and model while conclusion with future work is presented in Section 5. 2 Related basics As SVM and kernel are the core of a detector , we provide a short introduction about these here. 2.1 Support vector machine (SVM) Support V ector Machines (SVMs) are advanced classifiers using a higher dimensional feature space and powerful tools for supervised classification. T wo methods for constructing support vector machines are illustrated here where the first one is the conventional and standard method based on the quadratic programming (QP), which we term as QPSVM whereas the second one is based on the linear programming (LP) and we call this VLPSVM. 2.2 Quadratic pr ogramming SVM (QPSVM) QPSVM finds the optimal separating hyperplane using mar gin maximization between two classes [23]. For the instance-label pairs of a training data set is (x i ,y i ),i = 1,...,N , where x i ∈ R d and y i ∈ { 1, − 1} N , the SVM classifier of fers a decision function for finding class of the patternx in the following form: f(x) =sgn ( w· ϕ (x)+b ) , where weight vector w , bias b ∈ R and K(x i ,x j ) = ϕ (x i )· ϕ (x j ) is a kernel function and solves the following primal problem min w,b,ζ f P (w) = 1 2 ∥ w∥ 2 + C N ∑ i=1 ζ i (1) s.t. y i ( w· ϕ (x i )+b ) ≥ 1− ζ i ; (2) ζ i ≥ 0; i = 1, 2,...,N (3) whereζ i are a measure of the miss classification errors. The objective function above is a convex programming prob- lem, where, C is a characteristic parameter of the classifier to be defined by the user that adjusts the trade of f between optimal maximization of mar gin and to minimize the num- ber of overall error on the training vectors. After few cal- culation, corresponding dual problem, becomes, max α f D (α ) = N ∑ i=1 α i − 1 2 N ∑ i=1 N ∑ j=1 α i α j y i y j ϕ (x i )· ϕ (x j ) (4) s.t N ∑ i=1 α i y i = 0 (5) 0≤ α i ≤ C; i = 1, 2,...,N (6) The solution to the QP maximization problem (4)-(6) are used to obtain the values of the variables α i and primal variable w is determined by using the α i value from the expressionw = ∑ N i=1 α i y i ϕ (x i ) . In addition, some KKT conditions are used to determine the bias b. 2.3 V apnik’ s LP SVM (VLPSVM) V apnik suggested VLPSVM to build a separating hyper - plane with support vectors minimization in number (heuris- tic SVM). An elaborate introduction along with mathemat- ical background of both QP and LP based SVM could be found in [4].V apnik formulated an alternative approach im- plementing linear programming to find a separating hyper - plane similar to QPSVM considering the trade of f between minimizing the summation of coef ficients associated with the KCV (kernel computing vector , which plays very simi- lar role as the Support V ectors (SVs) in QPSVM) and min- LP SVM with a Novel Similarity function… Informatica 47 (2023) 1–12 3 Figure 1: Pictorial representation of maximum mar gin con- cept based SVM [7]. Positive and negative patterns are represented by the triangles and the squares respectively . The red straight line is the optimal separating hyperplane and the black lines indicate the mar gin of the respective classes. Patterns that stay outwards from their classes are called outward-deviated patterns and amount of this devia- tion is represented by a non-negative variable, ζ . The ob- jective of the QPSVM is to maximize the mar gin with min- imum training error . The red line is also called the decision boundary of the training output. imizing the error by findingw andb as below; min λ,ξ,b V N ∑ i=1 λ i + C V N ∑ i=1 ξ i (7) s.t. y i ( N ∑ j=1 λ j y j ϕ (x j )· ϕ (x i )+b V ) ≥ 1− ξ i (8) λ j ≥ 0; j = 1, 2,...,N (9) ξ i ≥ 0; i = 1, 2,...,N (10) whereλ 1 ,λ 2 ,...,λ N ,ξ 1 ,ξ 2 ,...,ξ N ,b V are the optimization variable. The penalty parameter (C V ) in VLPSVM is de- fined by the user and C V > 0 is in control of overfitting and learning the data. On the other hand, the slack vari- ables (ξ i ) > 0 are used for the absolute-unity-outward- deviated patterns [6]. After solving the LP optimization problem of (7)-(10), the optimumλ values are used to cal- culate the weight vector w V by using the equation w V = ∑ N j=1 λ j y j ϕ (x j ) and the decision function are in the form off(x) =sgn ( w V · ϕ (x)+b V ) . 2.4 Kernels As the popularity of Support V ector Machines (SVMs) is increased in recent years, kernel methods have got major attention. Kernel functions, which can be expressed as dot product, are used to bridge from linearity to non-linearity in many applications. In addition, these methods map the data in a structured way into a higher dimensional space so that the data could be separated easily . Besides, ker - nel functions must have some important properties such as continuous, symmetric and most preferably , a positive semi-definite (meaning kernel matrices must have no nega- tive Eigen values). Kernels satisfying Mercer ’ s t heorem are positive semi-definite that ensures the optimization prob- lem is convex and has a unique solution. However , many so called kernel functions perform very well and are not positive semi-definite. For example, Sigmoid kernel which is not positive semi-definite for certain values of the param- eters but it is used in wide range of applications. Some of the conventional kernels are Gaussian kernel, Polynomial Kernel, Bessel kernel among which Gaussian (also known as RBF) kernel is the most powerful and popular , which is given asK(x,y ) = exp− ∥ x− y∥ 2 2σ 2 where,∥ x− y∥ 2 is recog- nized as the squared Euclidean distance (between the two feature vectors) andσ is an adjustable parameter , plays an important role in performance of the kernel. 3 Pr oposed distance based similarity function (DSF) As in supervised learning, training examples are used to tell the classifier about the amount of dis/similarity among ex- amples from opposite/own class, learning a sound classifier seems dubious if the given similarity function misses any innate “goodness” property where, naturally the goodness of a similarity function is needed be suitable to the classifi- cation problem at hand. T o say more explicitly and specifi- cally , the discriminators both from QPSVM and VLPSVM can be rewritten in the following common form by using a common term “kernel computing vector (KCV)” for SV or EV or such other kernel operating patterns: f(z) = ∑ i∈ KCVset λ i y i S(x i ,z ) + b where z is any pattern and S(x i ,z ) is the kernel or similarity measure or strength of some sort of attraction force (between the patterns x i and z ), which is the source of main computational expenses and our main goal is to find a suitable mathematical expression (function) for this which is optimally complex(powerful) while not being much expensive to be executed. Thus, it is worthwhile to review notions of similarity , dissimilarity , and distance where we discuss about distance at first as be- low 3.0.1 Differ ent distances There are many distance functions out of which Euclidean distance and Manhattan distance seem to give the best re- sults in distance based data learning [25]. They are given below Euclidean Distance (ED) : The Euclidean distance is the straight-line distance between two points in Euclidean 4 Informatica 47 (2023) 1–12 R. Karim et al. (a) (b) (c) Figure 2: Decision boundaries with training error rates & data size from (a) QPSVM (RBF kernel), (b) VLPSVM (RBF kernel), & (c) VLPSVM (DSF) on an artificial dataset named Half kernel [24] before adding noise. space and an extension to the Pythagorean Theorem. It is also known as L 2 norm or Ruler distance. The Eu- clidean distance between points x and y is expressed as: ED(x,y ) = √ ∑ d i=1 |x i − y i | 2 . This distance is used in the so far the most powerful and popular kernel in ma- chine learning, that is RBF kernel, with which we compare our similarity function considering classification ef ficiency basing on similarity measure. Manhattan Distance (MD) : The Manhattan distance repre- sents the distance between two points as the sum of the ab- solute dif ferences in Cartesian coordinates. It is also known as L 1 norm or City block distance. The Manhattan dis- tance between pointsx andy is expressed as: MD(x,y ) = ∑ d i=1 |x i − y i | and we use this distance to model our simi- larity function. 3.0.2 Main idea W e start by considering the conclusion from [18], that is, objects that are similar in their representation are also sim- ilar in reality and belong, thereby , to the same class. The general idea is to transform a similarity into a dissimilarity function or vice versa by applying a monotonically decreas- ing function. This is according to the general intuition that a distance is small if the similarity is lar ge, and vice versa. Theor em: Similarity decr eases as distance incr eases. Pr oof : Let three points x,y,z are on the same straight path with distance between x and y , d(x,y ) = a > 0 , distance between y and z, d(y,z ) = b > 0 , and distance between x and z, d(x,z ) = a + b . Then similarity be- tween x and y , s(x,y ) ∝ 1 a ⇒ s(x,y ) = K a , similarity between y and z, s(y,z ) ∝ 1 b ⇒ s(y,z ) = K b and simi- larity between x and z, s(x,z ) ∝ 1 a+b ⇒ s(x,z ) = K a+b for a constantK . Now asa,b > 0 , we geta 2 ,b 2 ,ab > 0 . Thus a 2 + b 2 + ab > 0 ⇒ (a + b) 2 > ab ⇒ a+b ab > 1 a+b ⇒ 1 a + 1 b > 1 a+b ⇒ K a + K b > K a+b ⇒ s(x,z ) < s(x,y )+s(y,z ) . Now , let p be another point on the same straight path further outside of z and distance between z and p, d(z,p ) = c. then the similarity between x and p, s(x,p ) 0 . Although there are a number of ways to convert between a distance metric and a similarity measure, we select this one as to have i) Non-linear mapping ii) Less computational cost iii) scaling to keep value between 0 and 1 for numerical ef ficiency . Hence, our proposed Distance Similarity Function (DSF) is expressed as- S(x,y ) = 1 ∑ d i=1 | x i − y i | σ +1 If x coincides with y or x → y then |x i − y i | → 0 ⇒ S(x,y ) → 1 , representing the highest similarity between x and y whereas if x moves infinitely far away from y , then LP SVM with a Novel Similarity function… Informatica 47 (2023) 1–12 5 (a) (b) (c) Figure 3: Decision boundaries with training error rates & data size from (a) QPSVM (RBF kernel), (b) VLPSVM (RBF kernel), & (c) VLPSVM (DSF) on an artificial dataset named Half kernel [24] after adding noise. |x i − y i | → ∞ ⇒ S(x,y ) → 0 representing the lowest similarity between x and y From the expression of similarity function above we see that σ → ∞ ⇒ S(x,y ) → 1 representing x and y have the highest similarity and σ → 0 ⇒ S(x,y ) → 0 representingx andy have the lowest similarity . So, ifσ is overestimated, the expression will behave almost linearly and non-linear mapping will start to lose its power while on contrary if underestimated, the function will lack regular - ization and the decision boundary will be highly sensitive to noise in training data. Thus, this adjustable parameter σ plays a very significant role in the performance of the similarity function and should be carefully tuned according to the problem at hand. And for any ϵ > 0 increment in the distance, the distance will become d + ϵ and our similarity function S will decrease as 1/(d + ϵ ) will go down and the opposite for the decrement of the distance, that is 1/(d − ϵ ) will rise from the decrement of the distance byϵ > 0 . This proves that our similarity function is a monotonic function whose value increases/decreases with the decrements/increment of distance between two patterns. A rough graphical representation of our similarity function (DSF) could be seen in figure below (Fig.4). Non-linear mapping with mathematical expr es- sion: The motivation for such an embedding comes with the hope that the nonlinear transformation of input data into higher dimensionalH allows for using linear techniques in H . Non-linear classification in lower dimension is linear in higher dimension. Our similarity function (f ) does so in the following way: f(d) = 1 d constant +1 which could be written as f(d) = 1 z+1 wherez = d constant which could be expanded follow- Figure 4: Distance-Similarity Function (DSF) using param- eter ,σ = 1. ing two cases as below: Case I: For|z| < 1 1 z+1 = 1− z +z 2 − z 3 ... and 1 2 < 1 z+1 ≤ 1 Case II: For|z| > 1 1 z+1 = 1 z 1 1+ 1 z = 1 z ( 1− z − 1 +z − 2 − z − 3 ... ) = z − 1 − z − 2 +z − 3 − z − 4 ... and0≤ 1 z+1 < 1 2 3.1 Goodness of our similarity function [1] provided a criterion for a good similarity function to be used in a discriminator where it is said approximately that a similarity function is good if the generated mean innerclass similarity is suf ficiently lar ge compared to the mean inter - class similarity . After calculating the pattern-similarities on benchmark data using the following way , we have found out that our proposed function successfully obeys this in general without any deviation. Calculating Similarity: Sum of similarities of a pattern 6 Informatica 47 (2023) 1–12 R. Karim et al. x j with all patterns x i for i = 1, 2,...,N , S(x j ,all ) = ∑ N i=1 S(x i ,x j ) . Then average similarity (of all patterns with other patterns) = mean i (mean j (S(x i ,x j ))) . On Breast Cancer dataset [26] ( or [27]), the average simi- larity of the training patterns using our proposed (DSF) function 0.3614 to own class and 0.3235 to opposite class, which is consistent with the theorem of Balcan-Blum in [1].Whereas, on this dataset, the average kernel (or , sim- ilarity) value of the training patterns using Radial Basis Function (RBF) is 0.6267 to own class and 0.5748 to op- posite class, which is also consistent with the theorem of Balcan-Blum in [1]. However , comparing these values from these two functions we see that DSF gives compara- tively a bit higher Class V ariational Similarity Deviation (= similarity to own class− similarity to opposite class similarity to own class ), which is the higher the better for classification by recognising dis- crepancy of patterns from dif ferent classes) as it is (0.3614 - 0.3235) / 0.3614 = 0.1049 in case of DSF and ( 0.6267 - 0.5748) / 0.6267 = 0.0828 in case of RBF based kernel function. The similarity values from these two functions can be seen more detailed from the figure below ( Fig. 5) Figure 5: Mean Kernel (Similarity) function value compar - ison between RBF kernel and DSF . However , values generated by similarity (or kernel) function operated on patterns heavily depends on pattern- scattering or data complexity . If a pattern from one class is very near to the patterns of the other class, its overall distance to patterns of opposite class is smaller than such of the patterns of its own class. Consequently , similarity behaves inversely . Thus we further investigate these simi- larity values by going into a bit deeper basing on numerical and functional analysis of pattern. 3.1.1 Numerical and functional analysis of pattern W e can determine Patterns’ position from decision func- tion value and the slack variable. Considering the geomet- ric position of the patterns in the pattern space, we can di- vide them into two main parts as inlier and outlier . A pattern is said to be “inlier” if it belongs to the region of its own class boundary and similarly , an “outlier” pattern of a training set is the one that stays outside of its own class boundary . Hence a training pattern,x j with class la- bel y j and decision function value f(x j ) , will be an in- lier if y j = sgn(f(x j )) =⇒ 1 = y j sgn(f(x j )) =⇒ y j f(x j )> 0 and outlier ify j =− sgn(f(x j )) =⇒ − 1 = y j sgn(f(x j )) =⇒ y j f(x j ) < 0 . Now , re-formatting the error constraints of SVM in (2) and in (8) into a single form using the slack variableξ we gety j f(x j )≥ 1− ξ j which givesξ j ≥ 1− y j f(x j ) leading to 0 ≤ ξ j ≤ 1 in case of inlier patterns andξ j > 1 for outlier patterns. W e discuss further about them below . In case of inlier patterns: As these patterns stay inside of the class-decision boundaries of their own classes, gen- erally their overall distances to the patterns of their own classes are smaller compared to the distances to the pat- terns of their opposite classes and the similarity values will behave inversely that is their overall similarities to the patterns of their own classes are higher compared to the similarities to the patterns of their opposite classes. For each inlier , we calculate the ratio of the sum of its sim- ilarity to the patterns of its own class to the sum of its similarity to the patterns of its opposite class(that is, own- class/opposite class) in the way given below and we expect it to be lar ger than 1. For any inlier pattern x i with class labely i , sum of similarities to patterns of its own class is, SumSimOwn i = ∑ m S(x i ,x m )| (sgn(f(x i )) = y i = y m ) and sum of similarities to patterns of its opposite class is, SumSimOps i = ∑ n S(x i ,x n )| (sgn(f(x i )) = y i =− y n ) . Then their similarity ratio,RatSumSim mn i = SumSimOwni SumSimOpsi = ∑ m S(xi,x m)| (sgn(f(xi))=yi=ym) ∑ n S(xi,x n)| (sgn(f(xi))=yi=− yn) The in- lier related similarity ratio values from these two functions can be seen more detailed from the following figures (Fig. 6 and 7) In case of outlier patterns: In case of a complex or very noisy dataset, a well general- ized SVM is expected to introduce more outliers. As these patterns stay outside of their own classes, they behave nearly in the oppoiste way compared to inlier pat- terns in case of pattern distance or similarity . So, for these patterns, generally their overall distances to the patterns of their own classes are lar ger compared to the distances to the patterns of their opposite classes consequently , the similar - ity values behave inversely that is their overall similarities to the patterns of their own classes are smaller compared to the similarities to the patterns of their opposite classes. For each outlier , we calculate the ratio of the sum of its simi- larity to the patterns of its opposite class to the sum of its similarity to the patterns of its own class(that is, opposite class/own class) in the way given below and we expect it to be lar ger than 1. For any outlier patternx o with class label y o , sum of similarities to patterns of its opposite class is, SumSimOps o = ∑ n S(x o ,x n )| (sgn(f(x o )) = − y o = y n ) and sum of similarities to pat- terns of its own class is, SumSimOwn o = ∑ m S(x o ,x m )| (sgn(f(x o )) = − y o = − y m ) . Then their similarity ratio,RatSumSim nm o = SumSimOpso SumSimOwno = ∑ n S(xo,x n)| (sgn(f(xo))=− yo=yn)) ∑ m S(xo,x m)| (sgn(f(xo))=− yo=− ym) The outlier related similarity ratio values from these two functions can be seen more detailed from the following figures (Fig. 8 and 9) Why Manhattan distance based kernel and LP SVM with a Novel Similarity function… Informatica 47 (2023) 1–12 7 Figure 6: Figure showing similarity ratio of inlier patterns (with their corresponding pattern function (absolute) val- ues; higher function value means the pattern stays deeper in its own class) to the patterns of its own class with respect to the similarities of the patterns of opposite class (that is, own class/opposite class) on Breast Cancer dataset using our similarity function (DSF) . As inlier patterns stay in- side the decision boundary of their own classes, for them, generally , similarities to the patterns of their own classes are higher compared to the similarities to the patterns of their opposite classes; hence ratio of the sum of similarities to own class with respect to opposite class should be higher for a good similarity function. This is also the case for our similarity function. However , for very few patterns, which are placed in very special and isolated places of their own class, this may be dif ferent depending on their geometric positions. Amazingly , it can also be seen in the figure that these similarity ratio values are higher for the patterns stay- ing deeper inside of their classes; quite as expected. LPSVM : Comparing to the L2 norm (also known as Euclidean norm, which gives the ordinary distance using Pythagorean theorem from the origin to the point) used in QPSVM, L1 norm leads to a much sparser solution [28] in case of VLPSVM. This is useful to minimize the kernel computation by minimizing the number of kernel computing vector (KCV), the bases patterns that build the discriminator function of a machine by executing kernel operations (Support V ectors or Expansion V ectors or Basis V ectors or Machine V ectors, however it is named by individual author for various machines). Moreover , some other issues regarding distance have inspired to select this distance as i) Problem with Euclidean dis- tance as a family of the Minkowski metric is that the lar gest-scaled feature would dominate the others [29] ii)Euclidean distance is noise sensitive [30] iii) Manhattan distance function, requires less Computation [31] iv) high dimensionality is sensitive to the value of k using the L k based distance learning, which means that the Manhattan distance metric (L 1 norm) is consistently more preferable than the Euclidean distance metric (L 2 norm) for high dimensional data mining applications [32] v)In case of Manhattan Distance, distance between two exterior points a and z through another interior point y can be written as d(x,z ) = d(x,y )+d(y,z ) . So, the distances are linearly related, which matches more with our distance-inverted Figure 7: Figure showing similarity ratio of inlier patterns (with their corresponding pattern function (absolute) val- ues) to the patterns of its own class with respect to the similarities of the patterns of opposite class (that is, own class/opposite class) on Breast Cancer dataset using RBF kernel . Interestingly , it behaves nearly in the very same way as our similarity function. Note that, RBF and DSF based SVMs are two dif ferent machines, so they have two dif ferent discriminators. Thus, they may produce dif fer - ent function values for the same patterns. By this, these two machines may lead to dif ferent outlier , inlier numbers from the same data set and patterns that are detected as inliers/outliers by RBF SVM may not be detected as in- liers/outliers by DSF SVM. This is why number of variables in the horizontal axis of the two plots are dif ferent. similarity measure. vi) By taking just the summation of the absolute values Manhattan Distance considers only the real topological distance but by taking squared values Euclidean distance sometimes may lead to miss interpretation, for example, when0