Image Anal Stereol 2016;35:181-193 doi:10.5566/ias.1490 Original Research Paper ASSESSINGDISSIMILARITYOFRANDOMSETSTHROUGHCONVEX COMPACTAPPROXIMATIONS,SUPPORTFUNCTIONSAND ENVELOPETESTS VESNA GOTOVAC 1 , KATE ˇ RINA HELISOV ´ A B ;2 AND IVO UGRINA 3 1 Department of Mathematics, Faculty of Science, University of Split, 21000 Split, Croatia; 2 Department of Mathematics, Faculty of Electrical Engineering, Czech Technical University in Prague, 162 27 Prague 6 – Dejvice, Czech Republic; 3 Faculty of Pharmacy and Biochemistry, University of Zagreb, 10000 Zagreb, Croatia e-mail: vgotovac@pmfst.hr, helisova@math.feld.cvut.cz, ivougrina@gmail.com (Received February 24, 2016; revised November 16, 2016; accepted November 17, 2016) ABSTRACT In recent years random sets were recognized as a valuable tool in modelling different processes from fields like biology, biomedicine or material sciences. Nevertheless, the full potential of applications has not still been reached and one of the main problems in advancement is the usual inability to correctly differentiate between underlying processes generating real world realisations. This paper presents a measure of dissimilarity of stationary and isotropic random sets through a heuristic based on convex compact approximations, support functions and envelope tests. The choice is justified through simulation studies of common random models like Boolean and Quermass-interaction processes. Keywords: approximations, dissimilarity, envelope tests, random sets, stochastic geometry, support functions . INTRODUCTION The need for continuous improvement in fields like biology, biomedicine or material sciences urges us to develop and explore new techniques and models in applications. Over few years, one of these models received a lot of attention due to its strong mathematical foundations and general nature. As a model, random sets can describe and explain many events. Some examples are dynamics of cells in organisms (Mrkviˇ cka and Mattfeldt, 2011; Hermann et al., 2015), particles in materials (Helisov´ a, 2014; Neumann et al., 2016) or the presence of different plants in ecosystems (Diggle, 1981; Møller and Helisov´ a, 2010). Although three dimensional applications are of interest, applications of random sets are mainly done through two-dimensional modelling. We are often interested only in the projection of objects to the plane (e.g., ground area of plants or trees) or we study only cross-sections of a mass which create planar formations and suppose that the behaviour of the studied object is stationary in the third dimension (e.g., organic cells or material particles). Random sets are represented by different models ranging from simple and intuitive ones to those that are highly complex and specialized. One of the basic models is the Boolean model (Chiu et al., 2013). Due to its simplicity many theoretical results can be derived for it but, unfortunately, the model is not sufficient in many applications. For example, in Diggle (1981) researchers tried to model a data set concerning heather plants and the model proved to be unsatisfactory due to some interactions between bushes. An example of a sophisticated model is Quermass-interaction process. This model introduces dependencies on geometrical characteristics of sets like area or boundary. It was first studied by Kendall et al. (1999) and later extended by Møller and Helisov´ a (2008). As with classical statistics, being able to specify a particular model (e.g., distribution in hypothesis testing) is of great benefit. However, it is not necessarily feasible. Mrkviˇ cka and Mattfeldt (2011) explored tumour cells data with the Boolean model with the aim to use estimated parameters for detecting the differences between mammary cancer and masthopatic tissue. This model proved to be unsatisfactory since only 7% of images of mammary cancer and 1% of images of mastopathy were compatible with the model on the significance level of 0.05. Later in Hermann et al. (2015), the Quermass- interaction process was fitted to the same data, but the compatibility was again rejected in many cases (only 27.5% cases of mammary cancer and 6% cases of mastopathic tissue were accepted with a significance level of 0.05). 181 GOTOVAC V ET AL: Dissimilarity of random sets There are some instances where knowledge about the underlying model is not necessary, although it could be beneficial. For example, in classical hypothesis testing we could use rank based statistics without specifying strict probability distributions. Translated to the language of random sets, one is interested only in distinguishing between two sets seen as realisations of the same or different models. Exactly this problem, restricted only to stationary and isotropic random sets, will be the main topic explored in the rest of the paper. In its most general form the problem can be described by the following diagram: RealisationA RealisationB Characteristics ofA Characteristics ofB Dissimilarity measure Given two realisations of stationary and isotropic random sets one would like to find characteristics useful when making a decision about strength of dissimilarity of the underlying processes that have generated those sets. Strength of dissimilarity will be represented as a number a2 [0;1]; where the lower value of a means lower belief that the realisations come from the same underlying processes. Notice that we have posed the problem starting with realisations of random sets. Contrary to the classical point measurements, where points can be described purely by numbers (up to an error), describing realisations of random sets is much harder. Especially in practice where one usually obtains an image in raster format (like JPG or PNG), i.e., a grid of pixels (usually black and white) representing an observed realisation of a random set. Also, realisations of random sets are often of rugged shape and it is hard to work with them since even the description of the realisations is complicated. Due to these reasons, it seems plausible to try to approximate realisations by sets of more appealing characteristics. One of the approaches, and the one taken in this article, is to use convex compact sets due to some useful properties describing them. Therefore, we shall assess dissimilarity of the underlying process generating realisations through assessing dissimilarity of convex compact sets approximating our realisations. The approach is graphically presented in Fig. 1 and can be described in the following steps: starting with realisations of random sets (images in practice) find a suitable covering with simple (well described) sets like balls (step 1), break the covering into convex compact sets (step 2), derive some characteristics of this approximation (step 3) and use these characteristics to assess the dissimilarity of the underlying processes. To the best of our knowledge, this approach had, until now, not been taken into consideration in the known literature. It is really important to stress that we are not interested in assessing the similarity of images representing realisations of random sets like in the literature on analysis of binary images (e.g., Hall, 1985; Ohser and M¨ ucklich, 2000, or Pratt, 2001) but on assessing the dissimilarity of underlying processes. The paper is organised as follows. In the Material and Methods section we provide basic definitions, methods and procedures. Roughly, we define support functions, introduce a procedure for covering a set by discs, define V oronoi tessellation on the union of discs, introduce the theory behind envelope tests and give a few remarks on free parameters. In the Results section we present a simulation study and numerical results based on the methodology from the Material and Methods section. In the last section we give an overview of the presented results with comments and remarks. MATERIAL AND METHODS BASICS OF RANDOM SETS In this section, we recall some terms related to the basic theory of random sets. Definitions 2-5 concerning basic terms can be found in (Chiu et al., 2013) while Definition 6 of Quermass-interaction process in this form is taken from (Møller and Helisov´ a, 2008). Readers familiar with this topic can skip to the following section. Definition 1. The set K R d is said to be convex if cx+(1 c)y2 K for all x;y2 K and all 0 < c < 1. The set K R d is said to be compact if it is closed and bounded. Definition 2. Let (W;F;P) be a probability space, C the system of closed sets in R d and C = sfC K : K is a compact subset ofR d g, whereC K =fC2C : C\ K6= / 0g. Then a random closed set X in R d is a measurable mapping from(W;F) to(C;C). Definition 3. The distribution P X of a random set X is given by the relation P X (F) = P(fw2W : X(w)2 Fg) for F2C. 182 Image Anal Stereol 2016;35:181-193 P. HERMANN ETAL. (a) Mastopathic tissue (b) Mammary cancer tissue Figure17. Mastopathic(a)andmammarycancer(b)tissuepictureofwhichtheHausdorffmeasureisestimated. Table VII. Hausdorffmeasureofmastopathictissuescalculatedbyusing(10). No. #ofholes diam(U) !(U) Haus.dimension Haus.Measure 11 20 . 3 7 5 00 . 0 6 5 81 . 5 6 3 63 . 2 7 7 0 23 30 . 7 0 2 90 . 2 3 3 71 . 7 6 6 92 . 2 9 4 7 31 60 . 4 4 4 80 . 0 8 2 61 . 6 0 0 03 . 3 1 1 9 41 10 . 7 1 0 00 . 1 2 9 61 . 6 7 2 24 . 3 5 1 7 5–––– – 61 10 . 3 0 9 80 . 0 6 9 21 . 5 7 1 62 . 2 9 0 9 71 90 . 3 8 7 70 . 1 4 7 81 . 6 9 3 41 . 3 5 9 3 81 30 . 9 9 6 50 . 2 0 3 31 . 7 4 4 44 . 8 8 9 6 921 . 0 0 1 70 . 0 5 0 31 . 5 2 0 41 9 . 9 4 5 2 10 29 1.0063 0.2633 1.7860 3.8404 Table VIII. Hausdorffmeasureofmammarycancertissuescalculatedbyusing(10). No. #ofholes diam(U) !(U) Haus.dimension Haus.Measure 11 0 30 . 4 2 5 70 . 3 6 0 91 . 8 3 6 50 . 5 7 7 4 29 40 . 3 6 6 40 . 3 3 5 61 . 8 2 4 90 . 4 7 6 9 36 30 . 4 3 0 80 . 2 6 7 21 . 7 8 8 30 . 8 3 0 1 46 20 . 5 1 7 60 . 2 1 0 91 . 7 5 0 41 . 4 9 7 1 57 80 . 2 8 3 20 . 2 6 6 11 . 7 8 7 60 . 3 9 4 0 66 10 . 3 4 0 20 . 2 6 5 51 . 7 8 7 30 . 5 4 8 4 75 90 . 2 5 9 00 . 2 7 5 91 . 7 9 3 50 . 3 2 1 4 86 50 . 1 9 7 70 . 2 5 6 11 . 7 8 1 50 . 2 1 7 6 93 00 . 2 8 0 60 . 1 7 9 51 . 7 2 4 50 . 6 2 2 8 10 58 0.4530 0.3821 1.8457 0.6068 and Table VIII). Tables VII and VIII provide quantitative information about mastopathic (Table VII) and mammary cancer (Table VIII) tissue dissimilation within the monitored 2D section pattern. The measurable entries are performed for the sake of rating the layout of the tissue arising (R) within the parentone:numberofcoherentcircumscribed‘holes’inparenttissue!lledbythemastopathic/mammary cancertissueequalsthenumberofelementsofR,diameter(diam(R))infactperformsthediameterofthe greatest hole in the picture (herein, the diameter stands for the maximal distance of two arbitrary points withinone‘hole’),density !(R)standsfortheratiobetweenmastopathic(ormammarycancer)tissueto the entire image area, and HM of the pattern depicted on particular picture represents the ratio between diam(R)and!(R).Theremarkablevalue19.9452inTableVIIforHMresultsfromthecombinationofa smallamountoflittleareas(twoholes;see!rsttheplotofthethirdrowinFigure17(a)),alargeenough diameter(morethan1)anddensitywhichhastinyvaluesbecauseofalongandthintissue. Copyright©2015JohnWiley&Sons,Ltd. Statist. Med. 2015,342636–2661 2653 P. HERMANN ETAL. (a) Mastopathic tissue (b) Mammary cancer tissue Figure17. Mastopathic(a)andmammarycancer(b)tissuepictureofwhichtheHausdorffmeasureisestimated. Table VII. Hausdorffmeasureofmastopathictissuescalculatedbyusing(10). No. #ofholes diam(U) !(U) Haus.dimension Haus.Measure 11 20 . 3 7 5 00 . 0 6 5 81 . 5 6 3 63 . 2 7 7 0 23 30 . 7 0 2 90 . 2 3 3 71 . 7 6 6 92 . 2 9 4 7 31 60 . 4 4 4 80 . 0 8 2 61 . 6 0 0 03 . 3 1 1 9 41 10 . 7 1 0 00 . 1 2 9 61 . 6 7 2 24 . 3 5 1 7 5–––– – 61 10 . 3 0 9 80 . 0 6 9 21 . 5 7 1 62 . 2 9 0 9 71 90 . 3 8 7 70 . 1 4 7 81 . 6 9 3 41 . 3 5 9 3 81 30 . 9 9 6 50 . 2 0 3 31 . 7 4 4 44 . 8 8 9 6 921 . 0 0 1 70 . 0 5 0 31 . 5 2 0 41 9 . 9 4 5 2 10 29 1.0063 0.2633 1.7860 3.8404 Table VIII. Hausdorffmeasureofmammarycancertissuescalculatedbyusing(10). No. #ofholes diam(U) !(U) Haus.dimension Haus.Measure 11 0 30 . 4 2 5 70 . 3 6 0 91 . 8 3 6 50 . 5 7 7 4 29 40 . 3 6 6 40 . 3 3 5 61 . 8 2 4 90 . 4 7 6 9 36 30 . 4 3 0 80 . 2 6 7 21 . 7 8 8 30 . 8 3 0 1 46 20 . 5 1 7 60 . 2 1 0 91 . 7 5 0 41 . 4 9 7 1 57 80 . 2 8 3 20 . 2 6 6 11 . 7 8 7 60 . 3 9 4 0 66 10 . 3 4 0 20 . 2 6 5 51 . 7 8 7 30 . 5 4 8 4 75 90 . 2 5 9 00 . 2 7 5 91 . 7 9 3 50 . 3 2 1 4 86 50 . 1 9 7 70 . 2 5 6 11 . 7 8 1 50 . 2 1 7 6 93 00 . 2 8 0 60 . 1 7 9 51 . 7 2 4 50 . 6 2 2 8 10 58 0.4530 0.3821 1.8457 0.6068 and Table VIII). Tables VII and VIII provide quantitative information about mastopathic (Table VII) and mammary cancer (Table VIII) tissue dissimilation within the monitored 2D section pattern. The measurable entries are performed for the sake of rating the layout of the tissue arising (R) within the parentone:numberofcoherentcircumscribed‘holes’inparenttissue!lledbythemastopathic/mammary cancertissueequalsthenumberofelementsofR,diameter(diam(R))infactperformsthediameterofthe greatest hole in the picture (herein, the diameter stands for the maximal distance of two arbitrary points withinone‘hole’),density !(R)standsfortheratiobetweenmastopathic(ormammarycancer)tissueto the entire image area, and HM of the pattern depicted on particular picture represents the ratio between diam(R)and!(R).Theremarkablevalue19.9452inTableVIIforHMresultsfromthecombinationofa smallamountoflittleareas(twoholes;see!rsttheplotofthethirdrowinFigure17(a)),alargeenough diameter(morethan1)anddensitywhichhastinyvaluesbecauseofalongandthintissue. Copyright©2015JohnWiley&Sons,Ltd. Statist. Med. 2015,342636–2661 2653 0123456 3.0 3.5 4.0 4.5 5.0 5.5 6.0 0123456 3.0 3.5 4.0 4.5 5.0 5.5 6.0 Dissimilarity measure 1 1 2 2 3 3 Fig. 1. Steps in assessing dissimilarity of the underlying processes generating realizations through dissimilarity convex compact sets approximating realizations. The original images (left) were kindly provided by Mrkviˇ cka and Mattfeldt (2011). Definition 4. A random set X is stationary if its distribution is invariant under translation, i.e., for all v2R d , the distribution of X+ v=fu+ v;v2 Xg is the same as that of X. A random set X is isotropic if its distribution is invariant under rotation. If a random set is both stationary and isotropic, it is called motion invariant. For A;B R d let us denote by A B :=fx+ y : x2 A;y2 Bg and byjAj the d-dimensional Lebesgue measure of the set A. Definition 5. Let Y =fy 1 ;y 2 ;:::g be a stationary Poisson point process in R d and fB 1 ;B 2 ;:::g be a sequence of independent identically distributed random compact sets in R d that are independent of Y . IfEjB 1 Kj <¥ for all compact sets K, then the random set B=[ ¥ n=1 (y n + B n ) (1) is called Boolean model. Definition 6. Consider a planar random disc Boolean model, i.e., the Boolean model with B 1 being a disc inR 2 with random radius. Quermass-interaction process is a random set whose probability measure is absolutely continuous with respect to the probability measure of the given Boolean model and the density of its probability measure with respect to the probability measure of the given Boolean model is of the form f q (b)= 1 c q expfq 1 A(U b )+q 2 L(U b )+q 3 c(U b )g for each finite disc configuration b =fb 1 :::;b n g, where A = A(U b ) is the area, L = L(U b ) is the perimeter,c =c(U b ) is Euler-Poincar´ e characteristic (i.e., the number of connected components minus the number of holes) of the union U b =[ n i=1 b i , q = (q 1 ;q 2 ;q 3 ) is 3-dimensional vector of parameters and c q is the normalising constant. SUPPORT FUNCTION OF A CONVEX COMPACT SET In the following text, convex compact sets have a central role in our approach and we are restricted only to the planar case, i.e., the convex compact sets inR 2 . One of the most important basic concepts related to convex compact sets is the support function. Definition 7. For a (random) convex compact set X define its support function as h X (u)= sup x2X hu;xi; u2¶b(0;1); where ¶b(0;1) is the unit sphere inR 2 , i.e., the circle with radius 1 and the centre in the origin. Fig. 2 illustrates the support function of two sets, a disc and a square centred in the origin. The importance of the support function in our approach is derived from the following result on the equality of distributions of two random convex compact sets via their support functions (for the proof please take a look at Lavie (2000)). Theorem 8. For two random convex compact sets X 1 and X 2 it holds X 1 = (D) X 2 ,(h X 1 (u i )) i2I = (D) (h X 2 (u i )) i2I for all finite index sets I and (u n ) n2N dense subset of ¶b(0;1): 183 GOTOVAC V ET AL: Dissimilarity of random sets 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 0123456 0.6 0.8 1.0 1.2 1.4 angles Support function 1.5 1.0 0.5 0.0 0.5 1.0 1.5 1.5 0.5 0.0 0.5 1.0 1.5 0123456 1.0 1.1 1.2 1.3 1.4 angles Support function Fig. 2. Examples of the support function for a disc and a square, respectively. COVERING A SET BY DISCS WITH IDENTICAL RADII As mentioned in the Introduction, our aim is to divide a planar data set S (e.g., Fig. 3a) into a family of convex compact sets. In practice we usually start with a digital binary image I containing the observed set, i.e., a picture consisting of black and white pixels, as an approximation of the initial set and its complement. This approximation can be achieved by pixelisation through a simple structure, where pixels are squares with side length D and the binary image I containing an arbitrary planar set S is a matrix M=[m i; j ] of black and white pixels (or the matrix of 1’s and 0’s) such that the pixel m i; j is black if and only if its centre lies in the set S, i.e., if and only if (i D D=2; j D D=2)2 S, and it is white otherwise (Fig. 3b). Let us denote by D the approximation of S in the digital image I. Having a digital approximation D of our initial set S, it can be useful to approximate the shape of the given set S by a geometrical object A with suitable mathematical properties. One of the approaches, and the one taken in this article, would be to cover the pixelised version with discs of identical radii. This can be achieved by utilising a customized version of maximal Poisson-disc sampling algorithm (Ebeida et al., 2011). In the maximal Poisson-disc sampling algorithm, set D is covered by discs b(x i ;r) with centres x i and radii r so that the discs centres come from the Poisson-disc sampling process, i.e., the point process X =fx i ; i = 1;:::;ng is constructed point by point satisfying the following conditions: 1.8x i ;x j 2 X;x i 6= x j :jjx i x j jj r, 2.If D i 1 = Dn[ j=1;:::;i 1 b(x j ;r), then 8x i 2 X;8B D i 1 : P(x i 2 B)= jBj jD i 1 j ; 3.8x2 D9x i 2 X :jjx x i jj< r. It is obvious that in this approach the area of approximating set is larger than the area of the original set D. Depending on the choice of radius r this can lead to many white pixels of the digital image I becoming black, therefore leading to an increase in geometrical properties like area or boundary length. To account for this we introduce a customized version of the maximal Poisson-disc sampling (called CMPD in future) such that we first reduce the set D by a radius r, i.e., we construct D r =fu : b(u;r) Dg; (Fig. 3c). Now, we cover the newly obtained set D r by the maximal Poisson-disc sampling obtaining a set ˆ D r . In order to keep the original shape as precise as possible, we start covering on the border of D r , i.e., the centres are first sampled from the boundary pixels (Fig. 3d). After covering all the boundary pixels we choose the centres from the remaining black pixels of D r (Fig. 3e). However, since set ˆ D r will be a subset of the starting set D we can try to cover what has not yet been covered, i.e., ˜ D= Dn ˆ D r , using the conditions 1 and 2 from the maximal Poisson-disc sampling and adjusted condition 3: 1.8x i ;x j 2 X;x i 6= x j :jjx i x j jj r, 184 Image Anal Stereol 2016;35:181-193 (a) Basic planar set (b) Digital approximation (c) Reduced set D r (d) Covering on boundary pixels (e) Covering of inner pixels (f) Additional covering based on PD(r)a; 186 Image Anal Stereol 2016;35:181-193 where H 0 is a simple null hypothesis. Therefore, if the observed vector T 1 leaves envelope at some point, i.e., R 1 < R (a) the null hypothesis is rejected at significance level a. If the observed vector lies completely inside this envelope, i.e., R 1 > R (a) , the null hypothesis is not rejected at significance level a. If the observed vector coincides in some point with the border of the envelope, i.e., R 1 = R (a) the rejection of the null hypothesis remains undecided. Testing can also be conducted through a customized version of p-values which are defined by assigning an extreme rank measure R k to each of the vectors T k , such that the lowest rank corresponds to the most extreme values of statistic: p = 1 s+ 1 s+1 k=1 1(R k < R 1 ); p + = 1 s+ 1 s+1 k=1 1(R k R 1 ): From the definition of p-values it is easy to see that R 1 < R (a) is equivalent to p + a (null hypothesis rejected), R 1 > R (a) is equivalent to p > a (null hypothesis not rejected) and R 1 = R (a) is equivalent to p a < p + (the rejection of the null hypothesis remains undecided). The width of the interval between p + and p is p + p = 1 s+ 1 s+1 k=1 1(R k = R 1 ) 2n s+ 1 : and in the case of lots of ties it can be large. This problem can sometimes be solved by additional ordering as described by Myllym¨ aki et al. (2016). It consists in replacing ranks R k by vector ranks R k which allow finer ordering among ranks so that in some cases it eliminates ties. Consider the vectors of point-wise ordered ranks R k = (R k;(1) ;:::;R k;(n) ), wherefR k;(1) ;:::;R k;(n) g =fR k (j 1 );:::;R k (j n )g and R k;( j 1 ) R k;( j 2 ) whenever j 1 j 2 . Then R k 1 R k 2 ,9n 0 n : R k 1 ;( j) = R k 2 ;( j) 8 j< n 0 & R k 1 ;(n 0 ) < R k 2 ;(n 0 ) : (3) Using R k instead of R k , k= 1;:::;s+1; for calculation of the width p + p of the p-value interval, it may theoretically happen that the value is the same as for R k , but in the most cases it is significantly narrower. Now lets describe the application of this test to our data. In our case, we compare two realisations of random sets in the way that first, each of the sets is covered by discs with identical radii, then the corresponding V oronoi tessellation is constructed, and finally, m non-neighbouring cells from each of the V oronoi tessellations are sampled. The sampling is carried out so that the first cell B 1 is chosen randomly uniformly from all the cells in the corresponding tessellation and having k sampled cells, the (k+ 1)-th cell B k+1 is chosen randomly uniformly from the set of cellsfB j : B j \ B k+1 = / 0g for k = 1;:::;m 1. It means that our data consist of 2 m characteristics of convex compact sets, more precisely of 2 m support functions related to the centres of the corresponding covering discs and evaluated in some (equidistant) partition of(0;2p]. Thus we do not explicitly have only one characteristic T 1 (j) to be compared to T k (j), k= 2;:::;s+ 1. Therefore we decided to use a permutation version of the above-mentioned test to attain functional ANOV A procedure (Mrkviˇ cka et al., 2015). It works as follows. We form matrices H A and H B by putting in their rows discretized support functions from the first set and the second set, respectively. We denote by H A (j j ) and H B (j j ), respectively, the j th column of matrices H A and H B while the column represents values of support functions of sample cells evaluated at angles j j = j 2p n , j = 1;2;:::;n: Further, denote by H A (j j ), H B (j j ), Var(H B (j j )) and Var(H A (j j )) the mean values and variances, respectively, of each column in each matrix. Assume that there exist non random functions m A (j) andm B (j) such that H A;i (j)=m A (j)+ e A;i (j); H B;i (j)=m B (j)+ e B;i (j) for i = 1;:::;m, where e A;i (j) and e B;i (j) are i.i.d. samples from distribution G(j) for every j; where G(j) has zero mean and finite variance. We want to test the hypothesis H 0 :m A (j) m B (j) 0: The characteristic T 1 (j j ) is the mean difference of support functions H A (j j ) and H B (j j ) normalised to unit variance, i.e., T 1 (j j )= H A (j j ) H B (j j ) p Var(H A (j j ))+ Var(H B (j j )) for j = 1;2;:::;n. Characteristics T k (j j ) for k = 2;:::;s+ 1 are given by T k (j j )= H 0 1 (j j ) H 0 2 (j j ) p Var(H 0 1 (j j ))+ Var(H 0 2 (j j )) ; 187 GOTOVAC V ET AL: Dissimilarity of random sets where H 0 1 ; and H 0 2 are obtained by randomly permuting rows of matrix H A H B and then splitting back into two matrices with equal number of rows. An example of graphical output of the test is shown in Fig. 4. The introduced plots come from the simulation study described in details below in Results section. 0 1 2 3 4 5 6 -0.4 -0.2 0.0 0.2 0.4 angles Envelopes 0 1 2 3 4 5 6 -0.6 -0.4 -0.2 0.0 0.2 0.4 angles Envelopes Fig. 4. Graphical representation of envelopes for number of permutations s= 4999 when comparing two Boolean models, where [p ; p + ) = [0:1726;0:1774) (left), and Cluster model with Repulsive model, where [p ; p + )=[0:0054;0:0102) (right), for details concerning the mentioned models Simulated data section. Dot-dashed black line represents envelope for a = 0:05: ALIGNMENTOFSUPPORTFUNCTIONS Since we assume that the distribution of original random set is isotropic, the same should be true for the distribution of a randomly chosen cell in the V oronoi tessellation of its coverage. However, the mean value of support functions of two cells having identical shape but being rotated by different angles could significantly differ from their original support function. In order to avoid this “loss of information”, we cluster and align support functions of cells that are “almost equal” if transformed under rotation. Alignment is achieved through an adjusted version of agglomerative hierarchical clustering (James et al., 2013) with average linkage and correlation as the dissimilarity measure. In more details, let us denote by H = [h i; j ] = [h i (j j )] a matrix, where h i (j j ) stands for the value of support function for the i-th sampled cell of the approximating tessellation at the j-th angle on a mesh of angles (e.g., j j = j 2p n ; j = 1;:::;n). Also, let us denote with h i;1: the i-th row of the matrix H and with h i;l: =(h i;l ;h i;l+1 ;:::;h i;n ;h i;1 ;:::;h i;l 1 ) a shifted version of h i;1: by l places. Now, for rows i 1 and i 2 we can define the value f cor(i 1 ;i 2 )= max cor h i 1 ;1: ;h i 2 ;l: : l= 1;2;:::;n: : Using this newly defined measure of similarity between rows of H we can build a dendrogram (or clustering tree) for rows thus obtaining the clustering between rows based on the similarity between support functions regardless of rotation. As with almost all applications of hierarchical clustering a threshold C for the correlation should be defined by a researcher to cut the dendrogram into clusters. Alignment is now performed within the obtained clusters by shifting appropriate rows to achieve maximal f cor. Since the main problem in this paper is comparing two random sets, matrix H is defined as H = H A H B where H A and H B are matrices representing support functions for the first and the second set, respectively. Therefore, alignment is done together for both datasets. This approach removes a problem with the choice of the starting row within a cluster to align with since all similar tessellations will be aligned in the same way regardless of affiliation to the first or the second random set. The effects of such alignment when comparing realisations from different underlying processes conducted in our later simulation study can be seen in Fig. 5. CHOOSING “FREE” PARAMETERS Following the aforementioned remarks on radius we define a criterion for choosing the suitable radius 188 Image Anal Stereol 2016;35:181-193 0 1 2 3 4 5 6 2.5 3.0 3.5 4.0 4.5 5.0 angles Supportfunctions 0 1 2 3 4 5 6 2.5 3.0 3.5 4.0 4.5 5.0 angles Supportfunctions 0 1 2 3 4 5 6 -0.4 -0.2 0.0 0.2 0.4 angles Testfunction 0 1 2 3 4 5 6 2.5 3.0 3.5 4.0 4.5 5.0 angles Supportfunctions 0 1 2 3 4 5 6 2.5 3.0 3.5 4.0 4.5 5.0 angles Supportfunctions 0 1 2 3 4 5 6 -0.6 -0.4 -0.2 0.0 0.2 0.4 angles Testfunction Fig. 5. Support functions of 100 independent and randomly chosen cells from Boolean model (left) and Repulsive model (middle) with black lines representing mean values of these functions, and graphical representation of envelope test for these two models without alignment (upper line) and with alignment for C=0.9 (lower line). (expressed again in pixels) as R= max i=1;2;::: fi : PD(i) dg; (4) whered is some predefined pixel difference level. Note that the values PD(i) depend on random covering, so it is a random variable and therefore the value R is random, too. In the section on simulation studies, we have a lot of input realisations of the same random set, thus we use the mean value of PD(i) of all these realisations to find the suitable radius R. In real applications where usually only one realisation is available, we can take R calculated from only one covering of that realisation since our simulation study suggests that the values PD(i) for a given realisation do not change significantly. Since we are comparing two data sets D A and D B it may happen that the corresponding optimal radii R A and R B are different. In that case we define R as R= minfR A ;R B g: (5) RESULTS SIMULATED DATA We applied the methodology described in Material and Methods section to a few preselected pairings between three different simulated realisations of random sets. One random set was a Boolean model (see Definition 5) and the other ones were random disc Quermass-interaction processes (see Definition 6) with parameters chosen so that the processes formed clusters or mutually repulsive components, respectively. More precisely, the Boolean model used for the simulation study had centres of discs in the window 25 25, intensity of the disc centres equal to 0.4 and uniform distribution of radii on the interval (0:5;1) (see the left picture in Fig. 6). The second simulated data set was given by realisations of Quermass-interaction process with parameters q 1 = 0:65, q 2 = 1:1 and q 3 = 0 with respect to the mentioned Boolean model. Since this process produces realisations with larger area and smaller perimeter compared to the reference process, it tends to create clusters (see the middle picture in Fig. 6). Therefore, we will refer to it as the cluster process in the rest of the text. On the other hand, the third data set simulated as Quermass-interaction process with parameters q 1 = 1, q 2 = 1 and q 3 = 0 prefers smaller area and larger perimeter. Realisations are usually small non-overlapping components (see the right picture in Fig. 6) and therefore the process will be referenced as the repulsive process in the rest of the text. 189 GOTOVAC V ET AL: Dissimilarity of random sets We simulated 400 realisations for each of the mentioned processes. All realisations were transformed to matrices of 400 400 black and white pixels (an example of such a matrix for the Boolean model is shown in Fig. 3b) and these matrices played the role of the input data. To explore the sensitivity of the methodology within and between classes of processes the following approach was taken. First, the input matrices were divided into two groups of 200 realisations for Boolean model, cluster process and repulsive process, respectively, and dissimilarity was studied on these groups separately for different processes. Additionally, we considered two groups of 200 realisations of different processes, namely Boolean vs cluster, Boolean vs repulsive and cluster vs repulsive process, and studied the dissimilarity again. CHOOSING THE OPTIMAL RADIUS In order to choose optimal radii, i.e., to determine a suitable pixel difference leveld for approximations of data sets by unions of discs, we studied the behaviour of the approximations for different values of radii and their corresponding pixel difference levels. Data sets introduced in the Results section were covered by discs with identical radii using the method described in the Material and Methods section and taking the values of radii r = 3;:::;15 while the corresponding pixel differences PD(r) were calculated. Consequently the V oronoi tessellations on unions of covering discs were constructed as described in the Material and Methods section. Then we studied the quality of coverings for chosen pixel differences levels d = 10%;20%;30% by visual comparison of the shape of original and approximating sets. Moreover, we explored the shape of cells in the corresponding V oronoi tessellations as well as their quantities. For a realisation of the Boolean model, three different approximations using radii r = 4;7;9 corresponding to pixel differences levels d = 10%;20%;30%; respectively, are presented in Fig. 7. By visual comparison of the similarity of the original set and V oronoi tessellations of its approximations, we have inferred that for a pixel difference level of 10% cells of the tessellation are too small thus do not give us enough information on the inner structure of the original set while for a pixel difference level of 30% the approximation of the original data set is not good enough in description of its shape. However, for a pixel difference level of 20% covering discs approximate original set shape more appropriately and thereby the corresponding tessellation provides good information about inner structure of the original set, too. Thus, we have chosen the pixel difference leveld = 20%. It is easy to notice that the optimal radius gives us information about the structure of the underlying random set and it could be tempting to conclude that the difference between optimal radii in models could be used as the criterion for the final decision on dissimilarity. However, one should be careful in using it as the criterion for the final decision since different realisations of the same process may have different optimal radii and, moreover, since covering is random, even one realisation may result in significant differences of PD for the same radii, see Fig. 8, leading to different optimal radii. Fig. 8. The original set in resolution 43 43, its reduced form and two ways of covering by radius 10, where in the first case, we have 268 white pixels, i.e., PD(10) = 268=43 2 = 0:14, and in the second case, we have 606 white pixels so we obtain PD(10) = 606=43 2 = 0:33. NUMERICAL RESULTS OF ENVELOPE TEST As mentioned in the previous subsection, we considered 400 realisations of each of the Boolean model, cluster process and repulsive process and studied the dissimilarity between realisations of the same processes as well as between realisations of different processes. Therefore, we had six different combinations of pairs in the study. First, we had to choose the optimal radius for each of studied process. Using pixel a difference level of 20% as described in Material and Methods section we obtained from (4) the optimal radii R = 7 for the Boolean model, R = 13 for the cluster process and R= 5 for the repulsive process. Recall that these values were derived so that in (4) we considered the mean pixel difference PD(R) calculated from all 400 190 Image Anal Stereol 2016;35:181-193 Fig. 6. Examples of realisations of Boolean model with intensity of the disc centres 0.4 in the window 25 25 and 0 otherwise, and distribution of radii U(0:5;1) (left picture), Quermass-interaction process with parameters q 1 = 0:65,q 2 = 1:1 andq 3 = 0 (middle picture) and Quermass-interaction process with parametersq 1 = 1, q 2 = 1 andq 3 = 0 (right picture). Fig. 7. Approximations by covering using discs of r= 4 (left figure, corresponding to d = 10%), r= 7 (middle figure, corresponding tod = 20%) and r= 9 (right figure, corresponding tod = 30%) for the planar set and its digital approximation from Fig. 3. realisations of the given process while we obtained the following characteristics for PD (in %): radius PD mean PD sd number of realisations having PD < 20% when covering with the corresponding radius Boolean 7 19.81 1.03 252 (63%) Cluster 13 18.55 2.07 323 (81%) Repulsive 5 19.78 0.64 284 (71%) Due to different radii between processes Eq. 5 was applied leading to the following radii for different combinations: Boolean Cluster Repulsive Boolean 7 7 5 Cluster 13 5 Repulsive 5 Then we applied the envelope test introduced in the Material and Methods section to two groups of 200 realisations for each combination obtaining appropriate p-values. More precisely, for each of the six combinations and appropriate groups, realisations were covered by discs with radii derived with Eq. 5. These coverings were translated to V oronoi tessellations. Consequently, we randomly sampled 100 independent (i.e., non-neighbouring) cells of the tessellation and calculated their support functions. Finally, the envelope test as described in the Material and Methods section was applied to pairs of these support functions, both with and without previous alignment performed for C= 0:9: Histograms of corresponding p-values are shown in Fig. 9. We can observe that comparing groups of realisations of the same processes, with or without 191 GOTOVAC V ET AL: Dissimilarity of random sets alignment, p-values are approximately uniformly distributed which means that the test considers the cells to be identically distributed, in other words that the inner structure is the same so the sets are similar enough. On the other hand when assessing dissimilarity of different processes, p-values are mostly very close to zero, so the hypothesis of identical distribution of the cells is significantly often rejected, so we can conclude that the similarity measure is very low, i.e., the original sets are not similar. The effect of alignment can be best seen when observing histogram of p-values for testing Boolean vs repulsive models, where p-values of the test have significantly lower values after performing alignment (without alignment 48:5% of the p-values were less than 0.05 while 64:5% of p-values were less than 0.05 when using alignment). Since the p-values of test are already very close to zero when comparing Boolean vs cluster and cluster vs repulsive models (all less then 0.05), the effect of alignment is not so noticeable. DISCUSSION Assessing dissimilarity of random sets could have big implications in applied science but is, unfortunately, hard to achieve. Especially if theoretical results are of interest due to high complexity of some random sets. Nevertheless, some advancement can be achieved if heuristic results are considered. In this paper we have presented one of such results. We have presented a method based on covering of random sets with unions of convex compact sets and consequent application of envelope test on support functions of these approximating sets. To justify our choice we have conducted a simulation study on different types of some common models of random sets and obtained valuable results. The presented method, however, suffers from some problems. The main weakness, in our humble opinion, is the covering of the input data set by a union of discs, especially the choice of suitable radius and pixel difference level, i.e., the measure of inaccuracy of the approximation. Here we had to make several decisions based on visual comparisons and results obtained by additional simulation studies. Despite the fact that the obtained numerical results are satisfactory, this part has the potential to be improved. One of the more straightforward approaches to try to improve the current choice of free parameters is in the case were multiple realizations for processes are available. Here, a researcher could use the machine learning approach with training and test sets to derive (optimized) best parameters. Additionally, the workflow presented in this paper can be represented through modules: cover a set, from the covering of the original set derive convex compact sets representing an approximation, derive some characteristics of the approximation and use these characteristics for the dissimilarity measure. Therefore, the workflow is suitable to changes only within modules. If a better covering algorithm can be achieved or a different test devised they can easily be used instead of the original modules. One of the common problems in practice, that was not investigated as a special case in this paper, is the existence of edge-effects, i.e., the case when the observation window contains only parts of the original set and the other parts lying outside the window are not observed (an example is presented in Fig. 3a). In our simulation study we tried to avoid this problem by ignoring the cells lying on the edge of the observation window, sampling only those that were fully contained in the window. For random sets with significant parts observed on the edges appropriate adjustments should be devised. We have left this for future work. Also, in future work additional ways for testing equality of distributions of convex compact approximations might be applied and further simulation studies of the current model would be beneficial to better understand properties of the introduced method. This will, hopefully, lead us to suggest suitable modifications or convenient alternatives to the current method. ACKNOWLEDGEMENTS This work was supported by The Czech Science Foundation (GA ˇ CR), project No.13-05466P. REFERENCES Chiu SN, Stoyan D, Kendall WS, Mecke J (2013). Stochastic geometry and its applications. Chichester: John Wiley & Sons. Diggle PJ (1981). Binary mosaics and the spatial pattern of heather. Biometrics 37:531–9. Ebeida MS, Davidson AA, Patney A, Knupp PM, Mitchell SA, Owens JD (2011). Efficient maximal Poisson-disk sampling. In: ACM SIGGRAPH 2011 Papers. New York: ACM. pp. 49:1–49:12. Hall P (1985). Counting methods for inference in binary mosaics. Biometrics 41:1049–52. Helisov´ a K (2014). Modeling, statistical analyses and simulations of random items and behavior on material 192 Image Anal Stereol 2016;35:181-193 BooleanvsCluster p-values Frequency 0.00 0.01 0.02 0.03 0.04 0.05 0 50 100 150 200 withalignment withoutalignment BooleanvsRepulsive p-values Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 100 120 withalignment withoutalignment ClustervsRepulsive p-values Frequency 0.00 0.01 0.02 0.03 0.04 0.05 0 50 100 150 withalignment withoutalignment BooleanvsBoolean p-values Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25 withalignment withoutalignment ClustervsCluster p-values Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25 withalignment withoutalignment RepulsivevsRepulsive p-values Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25 withalignment withoutalignment Fig. 9. Histograms of p-values for testing all pairs of simulated processes, i.e., Boolean-cluster (upper left), Boolean-repulsive (upper middle), cluster-repulsive (upper right), Boolean-Boolean (lower left), cluster-cluster (lower middle) and repulsive-repulsive (lower right); 200 realisations of each process were used. surfaces. In: TMS 2014 Suppl Proc. Hoboken: John Wiley & Sons. pp. 461–8. Hermann P, Mrkviˇ cka T, Mattfeldt T, Min´ arov´ a M, Helisov´ a K, Nicolis O, Wartner F, Stehl´ ık M (2015). Fractal and stochastic geometry inference for breast cancer: a case study with random fractal models and Quermass- interaction process. Stat Med 34:2636–61. Imai H, Iri M, Murota K (1985). V oronoi diagram in the laguerre geometry and its applications. SIAM Comp 14:93–105. James G, Witten D, Hastie T, Tibshirani R (2013). An introduction to statistical learning. New York: Springer. Kendall WS, Van Lieshout MNM, Baddeley AJ (1999). Quermass-interaction processes: Conditions for stability. Adv Appl Probab 31:315–42. Lavie M (2000). Characteristic function for random sets and convergence of sums of independent random sets. Acta Math Viet 25:87–99. Mrkviˇ cka T, Mattfeldt T (2011). Testing histological images of mammary tissues on compatibility with the Boolean model of random sets. Image Anal Stereol 30:11–8. Mrkviˇ cka T, Myllym¨ aki M, Hahn U (2015). Multiple Monte Carlo testing with applications in spatial point processes. arXiv: 1506.01646. Myllym¨ aki M, Mrkviˇ cka T, Grabarnik P, Seijo H, Hahn U (2016). Global envelope tests for spatial processes. J Roy Stat Soc B (in press). doi: 10.1111/rssb.12172 Møller J, Helisov´ a K (2008). Power diagrams and interaction processes for unions of discs. Adv Appl Probab 40:321–47. Møller J, Helisov´ a K (2010). Likelihood inference for unions of interacting discs. Scand Stat 37:365–81. Neumann M, Stanˇ ek J, Pecho OM, Holzer L, Beneˇ s V , Schmidt V (2016). Stochastic 3d modeling of complex three-phase microstructures in sofc-electrodes with completely connected phases. Comp Mat Sci 118:353– 64. Ohser J, M¨ ucklich F (2000). Statistical analysis of microstructures in materials science. Chichester: John Wiley & Sons. Pratt WK (2001). Digital image processing: PIKS Inside. 3rd Ed. New York: John Wiley & Sons. 193