https://doi.org/10.31449/inf.v44i2.3079 Informatica 44 (2020) 139–145 139 A Novel Method for Determining Research Groups from Co-Authorship Network and Scientific Fields of Authors Jan Pisanski University of Ljubljana, Faculty of Arts E-mail: Jan.Pisanski@ff.uni-lj.si, http://oddelki.ff.uni-lj.si/biblio/oddelek/osebje/pisanski.html ORCID 0000-0002-3060-111X Mark Pisanski Tomaž Pisanski (corresponding author) University of Primorska, FAMNIT E-mail: Tomaz.Pisanski@upr.si, https://en.wikipedia.org/wiki/Tomaz Pisanski, ORCID 0000-0002-1257-5376 Keywords: co-authorship network, scientific field, maximal spanning tree, induced subnetwork, pruning of networks, pathfinder network, MST, line-cut Received: March 9, 2020 Large networks not only have a large number of vertices but also have a large number of edges. Although such networks are generally sparse they are usually difficult to visualise, even locally. This paper considers the case where large weights on edges represent similarity between the corresponding end-vertices. We follow two main ideas in this paper. The first one is network pruning, that is removal of edges that makes the resulting network more manageable while keeping the main characteristic of the original network. The other idea is to partition the network vertex set in such a way that the induced connected components repre- sent groups of network elements that fit together. Furthermore, we assume that the vertices of the network are labeled by types. Here we apply our approach to co-authorship network of researchers in Slovenia in order to identify research groups, finding group leaders and the degree of interdisciplinarity of the group. For the network pruning phase we use a MST-pathfinder network and for vertex partition appropriate line- cuts. Each cluster is assigned a distribution of types. In this case, the types correspond to scientific fields, also known as research interests of authors. A measure of interdisciplinarity of research group is derived from such a distribution. Povzetek: Velika omrežja nimajo le mnogo vozlišˇ c, ampak imajo tudi mnogo povezav. ˇ Ceprav so obiˇ ca- jno taka omrežja redka, so nepregledna in jih je težko prikazati na sliki, tudi lokalno. Ta prispevek obrav- nava omrežja, pri katerih velike vrednosti uteži na povezavah pomenijo podobnost pripadajoˇ cih krajišˇ c. V prispevku sledimo dvema glavnima idejama. Prva je klešˇ cenje omrežja, to je odstranitev manj pomembnih povezav, zaradi ˇ cesar je nastalo omrežje bolj obvladljivo, hkrati pa se ohrani glavna znaˇ cilnost prvotnega omrežja. Druga ideja je razdeliti vozlišˇ ca omrežja tako, da inducirane povezane komponente predstavljajo skupine omrežnih elementov, ki se med seboj prilegajo. Poleg tega predpostavljamo, da so vozlišˇ ca omrežja oznaˇ cena s tipi. V tem prispevku uporabljamo naš pristop k omrežju soavtorstev raziskovalcev v Sloveniji z namenom identifikacije raziskovalnih skupin, iskanja voditeljev skupin in stopnje interdisciplinarnosti skupine. Za fazo klešˇ cenja omrežja uporabljamo usmerjevalno omrežje (MST-pathfinder network), za vozlišˇ cno razbitje pa ustrezne reze povezav. Vsaki skupini je dodeljena porazdelitev tipov. Mero inter- disciplinarnosti raziskovalne skupine izpeljemo iz takšne porazdelitve. V tem primeru tipi predstavljajo znanstvena podroˇ cja, oz. raziskovalne interese avtorjev. 1 Introduction In contemporary research community scientists collabo- rate within formal or informal research groups. Identify- ing such groups from data available in various bibliometric networks is an interesting challenge. In this note we pro- pose a method that uses the co-authorship network on the one hand and declared scientific field of authors that can be extracted from some bibliographic databases, on the other. We propose a theoretical model that uses a network, i.e. graph with weights on edges and labels, called types, on its vertices. We may view labels as scientific fields or sub- fields. Our approach is quite general and can be applied to any weighted network with types. In this paper we apply it to co-authorship networks. Note that scientific fields are sometimes caller research interests. The method consists of two steps. In the first step the original co-authorship network is pruned in order to reduce 140 Informatica 44 (2020) 139–145 J. Pisanski et al. the number of edges and increase the number of compo- nents, in our case producing research groups. In this step line-cuts are determined. In the second step a collection of induced monotype subnetworks is pruned by applying MST-pathfinder algorithm to further reduce the number of edges while keeping the same connectivity. Our original contribution is combination of both methods and the use of symmetric predicate in the first step; see Algorithm 3. Note that the idea of using MST, pathfinder and MST-pathfinder has been used extensively in the past in variety of contexts of bibliographic and other research[6, 8, 11, 22, 23]. This rough general approach may be refined in several different ways. We present in detail only one such refine- ment and discuss some others in the conclusion. In general, bibliographic networks are very large and allow for a vari- ety of methods for data mining [15], however in this pilot study we focus our attention on a relatively small data set. The data is restricted to Slovenian researchers and is taken from Slovenian bibliographic system SICRIS. Moreover, only researchers that are co-authors of mathematicians are considered. 2 Pruning of co-authorship network 2.1 Co-authorship network For basics in graph theory, the reader is referred to [4], for network theory, see for instance [3]. Let V be a list of authors from some bibliographic database. We say that u;v2 V are adjacent: u v, if u andv are co-authors of a common work from the corre- sponding database. Sometimes we restrict our attention to certain types of works or certain types of co-authorships. Usually only scientific works are considered and the co- authorship graph is computed from a two-mode network WA composed of pairs (w;a), works and authors for each co-authora of workw. Since binary relation is irreflex- ive 1 and symmetric it defines a simple graphG = (V; ) that we call the co-authorship graph. Let E = fuv 2 V 2 ju vg denote the set of unordered pairs of adjacent vertices ofG. Instead ofG = (V; ) we may use notation G = (V;E) to denote the same graph. The graph may be weighted where the weightsw on the edges represent the number of joint papers between the two authors. In this way a networkN = (V;E;w) is obtained. Letw(u;v) de- note this weight. Sometimes, we may consider the weight of co-authorship differently for different number n(w) of co-authors of work w. Let W (u;v) denote the collection of works co-authored byu andv. For any workw letn(w) denote the number of authors ofw. Then w(u;v) =jW (u;v)j: 1 Sometimes one may use also loops at each vertex. The weight associ- ated with a loop may depend on the method that the co-authorship graph is constructed. If it is obtained by multiplication of two-mode networks [4] it represent the number of works for a given author. In the fractional approach it may represent the total contribution of an author. Loops are removed if we follow Newman’s approach. In a fractional approach [2] the weight f(u;v) is defined as: f(u;v) = X w2W(u;v) 2 n(w) 2 : In case of Newman’s normalization the weight is: f(u;v) = X w2W(u;v) 2 n(w) (n(w) 1) A networkN is a weighted graphN = (V;E;w), where w :E!R is the weight function. In our case it is positive and the value 0 means there is no edge betweenu andv. A graph G = (V; ) is transformed into the network N = (V;E;a), wherea(u;v) = 1 for all adjacent pairs of verticesu v. The same bibliographic database can pro- duce at least three types of networks for the weight func- tionsa;w;f; defined above: 1. (V;E;a), the binary case, 2. (V;E;w) the standard case, and 3. (V;E;f), the fractional case. Algorithm 4 Prune the networkN = (V;E;w); n =jVj;m =jEj 1: Partition the edge set E into subsets E i with equal weights:E i =fe2Ejw(e) =w i g. 2: Order the parts in descending order of weightsw 1 > w 2 >:::>w k 3: foru2V do 4: S u =fug 5: F =; 6: fori = 1; 2;:::,k do 7: F i =; 8: fore =uv2E i do 9: LetS u ;S v be the corresponding sets. 10: ifS u 6=S v then 11: Appende toF i . 12: fore =uv2F i do 13: ifS u 6=S v then 14: S u =S v =S u [S v 15: ExtendF byF i 16: return subnetworkPr(N) = (V;F;w). There is another aspect that we have not considered in this paper. Namely, the weight of an edgee =uv between two authors u and v may depend also on the total num- ber of papers authored by each of the two authors. In this case we may modify the network to allow loops and define w (u;v) = w(u;v) foru6= v and letw (u) = w (u;u) denote the total number of papers having u as an author. Note that in generalw cannot be computed directly from w since we have no information about the single-authored papers. In this case the best way to computew is to mul- tiplyWA T byWA, whereWA represents a two-mode net- work work-author. The theory of two mode networks and A Novel Method for Determining Research Groups . . . Informatica 44 (2020) 139–145 141 their applications to bibliographic data can be found, for instance in [3]. 2.2 Pruning networks In the analysis of large networks, dense networks present a challenge. Usually one tends to partition the set of vertices and investigate the induced networks on such parts. In [3] one may find a variety of concepts that are useful in such analysis, e.g. cuts, islands, etc. Nevertheless, such subnet- works may be dense again and the role of particular vertices is not clearly visible. For this reason we prune the original networkN = (V;E;w) by appropriately selecting a subset of important edgesE 0 E. Ifw 0 denotes the restriction of w on E 0 , the pruned subnetwork N 0 = (V;E 0 ;w 0 ) is obtained. In case of co-authorship networks large weights indicate close collaboration between authors. When considering re- search groups one may assume strong collaboration within each group. Hence, in such a case a natural approach to pruning would be to remove all edges of lesser weights, while keeping the same connected components. A pos- sible solution is given by the well-known maximum cost spanning tree. More precisely, in case of a disconnected network the resulting graph is a maximum cost spanning forest. However, the problem with a maximum cost spanning forest is that, in case when several edges have the same weights, the forest may not be unique. We use a Kruskal- like algorithm that produces a unique pruned network. Algorithm 1 is almost identical to the MST-pathfinder algorithm of [14] and produces the pathfinder network Pn(1;n 1); for discussion and various aspects see also [19, 5, 20]. It is not hard to see, that the following is true: Theorem 1. The network N 0 (V;F;w) is uniquely deter- mined from the original network. If all weights are positive, the connected components are the same as in the original network. Theorem 2. If all weights inN(V;E;w) are distinct, Al- gorithm 1 produces the (unique) maximum cost spanning forest. On the contrary, if all weights are the same no edge is discarded. Moreover, we easily compute the running time of Algo- rithm 1. Theorem 3. The time complexity of the above algorithm is O(m logm). In fact, the time complexity is the same as for Kruskal’s algorithm[10]. The sorting and partitioning takes O(m logm) steps. There are two loops, each withO(m) steps, and the time complexity for the UNION-FIND is of lesser order. By applying this pruning method strong ties among the nodes remain visible. 3 Line-cuts For further refining the network N(V;E;w) one may choose a parameter t > 0, the threshold, or cut parame- ter and prune the edges with weights less than t. In this way the networkN t (V;E t ;w) is obtained, where E t =fe2Ejw(e) tg: The choice of parametert depends on our aims. There are several obvious goals. For instance: 1. We may choose maximal value oft that guarantees at least a prescribed number of connected components, say . 2. An alternative is to insist that all components have at most prescribed number of vertices, say . We present the basic pruning algorithm; see Algorithm 2. It produces essentially a line-cut, see for instance [3]. The only difference is that we keep isolated vertices. Algorithm 5 Prune the network N = (V;E;w), given threshold parameter t. Connected components of the re- sulting network are called line-cuts. 1: F =; 2: fore2E do 3: ifw(e) t then 4: Appende toF . 5: return subnetworkPr(N;t) = (V;F;w). In Python, Algorithm 2 can be reduced to a single state- ment: F = [e fore2E ifw(e) t] 4 Pruning networks with vertex types Let us assume we are given a finite number of types, or col- orsT , a networkN(V;E;w) and a mappingc : V ! T . The structure N(V;E;w;T;c) will be called a weighted network with vertex types. When pruning network with ver- tex types, a connected component consisting of vertices of a single type will be called monotype. Additionally, we will refer to the number of types used in a connected com- ponent as its type number. The maximum of type numbers of network components is called the type number of the network, In particular, we are interested in networks of low type number, preferably with monotype networks. Parameters of pruning may be adjusted in such a way that a monotype network is obtained. For networks with vertex types, in addition to the two goals described in Sec- tion 3, a third goal may be considered. 142 Informatica 44 (2020) 139–145 J. Pisanski et al. – One may insist that all connected components are monotype, or more general that each component has at most types (colors). The following basic algorithm (Algorithm 3) for a given network with types removes all edges that have endpoints of different types, or more generally, when they satisfy a symmetric predicateP . Algorithm 6 Prune the network with vertex types N = (V;E;w;T;c) with given threshold parametert and (sym- metric) predicate P : T 2 :! f>;?g: Connected com- ponents of the resulting network are called monotype line- cuts. 1: F =; 2: foruv =e2E do 3: ifP (c(u);c(v)) andw(e) t then 4: Appende toF . 5: return subnetworkPr(N;t;P ) = (V;F;w;T;c). As we mentioned above the predicateP usually is true if both endpoints are of the same type. However, other op- tions are possible. Namely we may have a similarity im- posed on the predicates andP signifies that two types are sufficiently similar. We need an algorithm to analyse the network with vertex types; see Algorithm 4. Using these numbers we may select different parameters and re-run this algorithm to reduce the size of the maximal component or alternatively limit the number of different components. We may also insist that all components be composed of a single type. 5 Interdisciplinarity of research groups and leaders of research groups For a given network with vertex types one may perform basic statistics on it. Namely, one may compute absolute frequencies of types on the vertex set. f(x) =jfv2Vjc(v) =xgj f i (x) =jfv2V i jc(v) =xgj or relative frequencies (x) =f(x)=n i (x) =f i (x)=n i wheren =jVj andn i =jV i j. We consider two measures: r(N) = maxf (x)jx2Tg s(N) =jsupp j =jfx2Tj (x)> 0gj and for each component: r(V i ) = maxf i (x)jx2Tg s(V i ) =jsupp i j =jfx2Tj i (x)> 0gj Both measure the diversity of research interests in a re- search group. If r(V i ) < 0:5 there is no dominant disci- pline. Ifr(V i ) = 1, the group is totally homogeneous. Algorithm 7 Analyse network with vertex types N = (V;E;w;T;c). 1: PartitionV into connected componentsV 1 ;V 2 ;:::;V k 2: Letd = maxfjV j j;j = 1; 2;:::;kg 3: forj = 1; 2;:::;k do 4: b(j) = number of different types inV j . 5: Let = maxfb(j);j = 1; 2;:::;kg 6: return number of connected components k, order of maximal connected componentd, and maximal num- ber of types in any component. One way to define a leader of a research group is to de- termine the vertex of maximal degree in the corresponding network, or even better the sum of weights of edges to the neighbouring vertices. There are two parameters that we are interested in. Letm be the number of edges of network N and letd be the maximal degree attained at vertexx. Let d 0 be the second largest degree. Thenx can be defined as a leader of the research group, while dominance is the quo- tientd=m and absolutism is defined by expression 1 d 0 =d. Note that it would be also interesting to explore the diver- sity index [21] in this context. However, we will address all of these in a future work. 6 Example The data used in our experiments was taken from COBIS- S/SICRIS [18] for the works indexed in Scopus [17]. Only papers, where at least one author was a mathematician, were considered. Co-authors that were not registered as researchers in Slovenia were not included. Scientific fields alias research interests, used in Slovenia have three levels. On Level 1 we have: 1 Science 2 Engineering 3 Medicine 4 Biotechnology 5 Social Sciences 6 Humanities 7 Interdisciplinary The next table shows division of Science on Level 2. A Novel Method for Determining Research Groups . . . Informatica 44 (2020) 139–145 143 1.01 Mathematics 1.02 Physics 1.03 Biology 1.04 Chemistry 1.05 Biochemistry and Molecular Biology 1.06 Geology 1.07 Comp. Intensive Methods and Appl. 1.08 Environment Protection 1.09 Pharmacy Figure 1: Line-cuts for threshold valuest = 0; 1; 2;::: for four different predicates, each depending on the level‘ = 0; 1; 2; 3. Each predicate depends on the interpretation of equality between two research types. Red –‘ = 0, blue – ‘ = 1, green –‘ = 2 and yellow –‘ = 3. Finally, the division of Mathematics in the Level 3 is indicated here: 1.01.01 Analysis 1.01.02 Topology 1.01.03 Numerical and Computer Mathematics 1.01.04 Algebra 1.01.05 Graph Theory 1.01.06 Probability and Statistics Level‘ may be interpreted as the length of the research interest code that is used to test equality: for ‘ = 0, the string is not used at all, for‘ = 1 only the first characters are compared, for‘ = 2, the first four characters are com- pared, while for‘ = 3 all seven characters are compared. Different levels can be associated with the suitable choice of predicateP in Algorithm 3. LetP ‘ denote the predicate applicable to level ‘. For instance, for u = 1:01:01 and v = 1:01:04P 2 (u;v) => whileP 3 (u;v) =?. Here we give an example of a pruned research group net- work. We intend to perform a thorough analysis on more complete data set elsewhere. Figures 2 and 3 depict the same research group. The network in Figure 3 is tree-like and is obtained from the Figure 2: One of several research groups determined by the line-cut fort = 8. Figure 3: The research group of Figure 2 , pruned by the MST-pathfinder network. Red – Graph Theory, Yel- low – Algebra, Blue – Numerical Mathematics, Green – Mathematics network of Figure 2 by MST-pathfinder method. Vertex colors denote vertex types: red: 1.01.05, yellow: 1.01.04, blue: 1.01.03 and green 1.01.00. In the database some re- searchers were assigned research interest at level 2, e.g 1.01 (Mathematics). For consistency, we expanded that to level three as 1.01.00. Note that the research group in Figure 3 is composed of two subgroups, one predominantly interested in graph theory and the other in algebra. There is a central triangle connecting the two subgroups. 7 Conclusion Co-authorship graphs and networks are important in the study of research structure and dynamics; see for instance [7, 9, 12, 1]. Their practical value has first been recog- nised by specialised systems, such as MathSciNet and Zb- Math; see [16, 24]. Including them in more general bibli- ographic systems such as SICRIS [18] would be beneficial for most users. Potential applications are plenty. In this paper we presented only one aspect of such applications. In a recent paper [13] a completely different application is sought, namely, organising talks at a conference in such a way that speakers with similar topics are scheduled at dif- ferent times. The data that was available to us has also authors with UNKOWN research interest. In this preliminary study we considered it as a separate research interest. It would be in- teresting to repeat the study with some flexibility and con- 144 Informatica 44 (2020) 139–145 J. Pisanski et al. sider the function:c :V !T[fUNKNOWNg. Clearly line-cuts refine the vertex partition and apply only within a component. Note that in general one could take different thresholds in different components. In case we intend to have components with given maximal size , then indeed different threshold values may be used. In or future more comprehensive work we intend to ad- dress some further extensions and applications of the MST- pathfinder method as well as some of the parameters that we have introduced. Acknowledgement We would like to thank Vladimir Batagelj for useful ad- vice and fruitful discussion. The work of both referees improved the presentation considerably and is gratefully acknowledged. Work of Jan Pisanski is supported in part by the ARRS grants P5-0361 and J5-8247, while work of Tomaž Pisanski is supported in part by the ARRS grants P1-0294,J1-7051, N1-0032, and J1-9187. References [1] T. Bartol, G. Budimir, P. Južniˇ c, K. Stopar. Mapping and classification of agriculture in Web of Science: other subject categories and research fields may bene- fit. Scientometrics, vol. 109 (2016) no. 2, pp. 979-996. https://doi.org/10.1007/s11192-016-2071-6 [2] V . Batagelj. On Fractional Approach to Analy- sis of Linked Networks, Scientometrics (2020). https://doi.org/10.1007/s11192-020-03383-y [3] V . Batagelj, P. Doreian, A. Ferligoj, N. Kejžar. Un- derstanding large temporal networks and spatial net- works: Exploration, pattern searching, visualization and network evolution, (2014) John Wiley & Sons. [4] J.A. Bondy and U.S.R. Murty. Graph theory, (2008) Graduate Texts in Mathematics, 244. Springer, New York. [5] C. Chen. Science mapping: a systematic re- view of the literature. Journal of Data and In- formation Science, vol. 2 (2017) no.2, pp1-40. https://doi.org/10.1515/jdis-2017-0006 [6] C. Chen, S. Morris. Visualizing evolving net- works: Minimum spanning trees versus pathfinder networks. In IEEE Symposium on Informa- tion Visualization 2003 (2003), pp. 67-74. https://doi.org/10.1109/INFVIS.2003.1249010 [7] A. Ferligoj et al. Scientific collaboration dynam- ics in a national scientific system. Scientomet- rics, vol. 104, (2015), no. 3, pp. 985–1012. https://doi.org/10.1007/s11192-015-1585-7 [8] M. Gallivan, M. Ahuja. . Co-authorship, ho- mophily, and scholarly influence in information systems research. Journal of the Association for Information Systems, vol. 16 (2015) no. 12, 2. https://doi.org/10.17705/1jais.00416 [9] L. Kronegger, F. Mali, A. Ferligoj, and P. Doreian. Collaboration structures in Slovenian scientific com- munities. Scientometrics, vol. 90 (2012), no.2, pp. 631–647. https://doi.org/10.1007/s11192-011-0493-8 [10] JB. Kruskal. On the shortest spanning sub- tree of a graph and the traveling salesman problem. Proceedings of the American Mathe- matical Society. vol.7 (1956) no.1, pp. 48–50. https://doi.org/10.1090/S0002-9939-1956-0078686- 7 [11] T. Jacobsen, RL. Punzalan, ML. Hedstrom. Invok- ing “collective memory”: Mapping the emergence of a concept in archival science. Archival Science, vol. 13(2013)no. 2-3, pp. 217-251. [12] S. Peˇ clin, P. Južniˇ c, R. Blagus, M ˇ C. Sajko, J. Stare. Effects of international collaboration and status of journal on impact of papers. Scien- tometrics, vol. 93 (2012) no. 3, pp. 937-948. https://doi.org/10.1007/s11192-012-0768-8 [13] J. Pisanski, T. Pisanski. The use of collaboration distance in scheduling conference talks. Informat- ica : an international journal of computing and informatics, vol. 43 (2019) no. 4, pp. 461–466, https://doi.org/10.31449/inf.v43i4.2832. [14] A. Quirin O. Cordón, V . P. Guerrero–Bote, B. Vargas– Quesada, F. Moya–Anegón. A quick MST-based al- gorithm to obtain Pathfinder networks (1;n 1), Journal of the American Society for Information Sci- ence and Technology vol. 59 (2008) no. 12, pp.1912– 1924. https://doi.org/10.1002/asi.20904 [15] J. Leskovec, A. Rajaraman, and J. Ullman. Mining of Massive Datasets (2014), Cambridge University Press. https://doi.org/10.1017/CBO9781139924801 [16] MathSciNet: https://mathscinet.ams.org/mathscinet/index.html [17] Scopus: https://www.scopus.com/home.uri [18] SICRIS: https://www.sicris.si/public/jqm/cris.aspx?lang=eng [19] A. Vavpetiˇ c, V . Batagelj, V . Podpeˇ can. An implemen- tation of the Pathfinder algorithm for sparse networks and its application on text network, In M. Bohanec (ed.),12th International Multiconference Information Society, vol. A (2009) pp. 236–239. A Novel Method for Determining Research Groups . . . Informatica 44 (2020) 139–145 145 [20] HD. White. Pathfinder networks and author cocitation analysis: A remapping of paradigmatic information scientists. Journal of the American Society for Infor- mation Science and Technology, vol. 54 (2003) no. 5, pp. 423-434. https://doi.org/10.1002/asi.10228 [21] Diversity index, Wikipedia, https://en.wikipedia.org/wiki/Diversity_index [22] H. Yang, HJ. Lee. Research trend visualiza- tion by MeSH terms from Pubmed. Interna- tional journal of environmental research and public health, vol. 15 (2018) no. 6, 1113. https://doi.org/10.3390/ijerph15061113 [23] SY . Yu. Detecting collaboration patterns among iSchools by linking scholarly communication to so- cial networking at the macro and micro levels. LI- BRES: Library and Information Science Research Electronic Journal vol. 23 (2013) no. 2, pp. 1–13. [24] zbMATH: https://zbmath.org/ 146 Informatica 44 (2020) 139–145 J. Pisanski et al.