FOLIA BIOLOGICA ET GEOLOGICA 53/1-2, 79–82, LJUBLJANA 2012 MOLEKBASE: USER FRIENDLY SYSTEM FOR STORING, FILTERING AND CONVERTING POPULATION MOLECULAR DATA MOLEKBASE: UPORABNIKU PRIJAZEN SYSTEM ZA HRANITEV, IZBIRO IN PRETVORBO MOLEKULSKIH PODATKOV V POPULACIJSKI GENETIKI Marjana WESTERGREN 1 & Hojka KRAIGHER 2 1 Dr., Department of Forest Physiology and Genetics, Slovenian Forestry Institute, V ečna pot 2, 1000 Ljubljana, marjana.westergren@ gozdis.si 2 Prof. Dr., Department of Forest Physiology and Genetics, Slovenian Forestry Institute, V ečna pot 2, 1000 Ljubljana, hojka.kraigher@ gozdis.si ABSTRACT UDC 575.17:004.4 MOLEKBASE: user friendly system for storing, filtering and converting population molecular data Molecular experimental data for population genetics is often stored in spreadsheet programmes or as input data for computer programmes that enable analysis of population ge- netics. Such experimental data can often be interpreted only by the researcher who conducted the experiment, diminish- ing the transparency of the whole study. Additionally, same data can be stored at several locations. Making changes to the data in a single location generates inconsistencies in the data- set. A database layout in Access was developed to facilitate transparent population genetic data storage in a single loca- tion and simplify its use for population genetic analysis through a computer programme that enables filtering of the data and transforms it into Genepop, SpaGeDi, Structure, Baps and Convert input files. The MOLEKBASE system is freely available at http://www.gozdis.si/index.php?id=151. Keywords: population genetics, molecular database, data filtering, data conversion IZVLEČEK UDK 575.17:004.4 MOLEKBASE: uporabniku prijazen system za hranitev, iz- biro in pretvorbo molekulskih podatkov v populacijski ge- netiki Molekulski podatki za genetske analize populacij so po- gosto shranjeni v obliki razpredelnic ali kot vhodni podatki za programe, ki omogočajo njihovo analizo. Take podatke lahko pogosto interpretira le raziskovalec, ki je poskus izvajal, kar vodi k manjši transparentnosti celotne raziskave. Pogosto se tako shranjeni podatki nahajajo na več lokacijah. Sprememba v podatkih na eni lokaciji vnese v set podatkov nedoslednosti. Razvili smo matrico baze za transparentno hranitev popula- cijskih molekulskih podatkov na enotni lokaciji v programu Access in pripravili program za izbiro ter pretvorbo podatkov v format, ki ga prepoznajo programi za analize v okviru popu- lacijske genetike Genepop, SpaGeDi, Structure, Baps in Con- vert. Novo razviti sistem MOLEKBASE je prosto dostopen na http://www.gozdis.si/index.php?id=151. Ključne besede: populacijska genetika, baza molekulskih podatkov, izbira podatkov, pretvorba podatkov M. WESTERGREN & H. KRAIGHER: MOLEKBASE: USER FRIENDLY SYSTEM FOR STORING, FILTERING AND CONVERTING 80 FOLIA BIOLOGICA ET GEOLOGICA 53/1-2 – 2012 The analysis of population genetics requires vast data sets. Hundreds of individuals belonging to different po- pulations or species are analysed on as few as five co- -dominant loci in population studies of forest trees (e.g. Heuertz et al. 2003; Fernandez-Manjarres et al. 2006; Heuertz et al. 2004) and up to 377 co-dominant loci in human population studies (Rosenberg et al. 2002). In population genetic analysis of forest trees, microsatellites and isozymes are the markers of choice and the datasets usually consist of a low to medium number of loci, e.g. five to 15. A small analysis of four populations with 50 samples in each population would therefore yield 2000 to 6000 data points for co-domi- nant markers. Molecular experimental data for population geneti- cs is usually stored in tables of spreadsheet programmes such as Excel or as input data for a variety of computer programmes that enable analysis of population genetics. This can lead to the same data being stored at several locations. Changing the data at only one location will therefore generate inconsistencies in the dataset. Additi- onally such experimental data can often be interpreted only by the researcher who conducted the experiment, diminishing the transparency of the whole study. Re- -analysing the data and combining different studies, especially if some time has passed or the personnel in the laboratory have changed, is difficult. In order to overcome the above-mentioned problems we have deve- loped a database layout in Access, in which data from population genetic studies can be stored in a single place. Individuals or populations needed for specific analysis can be filtered out and selected data transfor- med into some of the most common freely available po- pulation genetic programme input formats without ma- king changes to the original data set. The system was developed to help us manage vast datasets of population genetic data needed for the analysis of forest genetic re- sources but could be useful in other fields. INTRODUCTION MATERIALS AND METHODS Review of population genetic studies of forest trees has shown that microsatellites and isozymes are the mar- kers of choice for population genetic analysis of forest trees. The datasets usually consist of a low number of loci for microsatellites to medium number of loci for is- oenzymes. Access was used to develop the layout of the databa- se. The layout allows addition of other needed categories (i.e. columns) if needed by the user. The data filtering and conversion programme was written in MS Visual Studio 2005 vb.net. The MOLEKBASE system (database layout in Access, Windows executable file and the source code), including the user manual and example files, can be fre- ely downloaded from http://www.gozdis.si/index. php?id=151. RESULTS Experimental data and background information in the MOLEKBASE system are stored in three different ta- bles: Molecular data, Population and Locus. The first table contains molecular data in relative sizes or codes in a three-digit format for up to 25 co-dominant loci and information regarding individual samples, such as sample code, population code, species, laboratory code, as well as year of analysis. In the second table, informa- tion regarding sampled populations is stored. This table contains population codes, geographic location in lati- tude/longitude format and/or UTM coordinates and altitude. Other fields describing individual samples and/or populations can be added after the predefined fields. For forestry purposes, these might be vitality, de- velopmental stage, origin of populations, seed stand identifiers etc. In the last table, the names and number of loci belonging to each species and/or experiment are stored. Currently, the database layout supports data storage and manipulation for up to 25 co-dominant diploid loci, which, according to a survey of the literature is suffici- ent for most population genetic studies in the forestry field. The database layout was primarily developed for microsatellites but can store and manipulate any co-do- minant data in three-digit format. With the help of scripts, samples of interest for a certain analysis can be selected and transformed into five different input formats. The programme enables se- M. WESTERGREN & H. KRAIGHER: MOLEKBASE: USER FRIENDLY SYSTEM FOR STORING, FILTERING AND CONVERTING 81 FOLIA BIOLOGICA ET GEOLOGICA 53/1-2 – 2012 lection of data based on country of origin, species, po- pulation, sampling year and individuals. Boolean opera- tors are used to combine different filters. Selected data can be transformed into five different input formats, read by the following population genetics programmes: Genepop (Raymond & Rousset 1995, Rousset 2008), SpaGeDi (Hardy & Vekemans 2002), Structure ( Prit - chard, Stephens & Donnelly 2000), Baps (Coran - der , Waldmann & Sillanpaa 2003) and Convert (Glaubitz 2004). Figure 1: Data filtering and conversion form CONCLUSION MOLEKBASE is a database layout in Access with an ac- companying computer programme that facilitates tran- sparent molecular data storage for population genetic analysis in a single location and its use by filtering and converting selected molecular data into five different input formats. POVZETEK Raziskave v okviru populacijske genetike zahtevajo veli- ke količine podatkov. Genski označevalci, ki jih upora- bljamo pri populacijsko genetskih analizah dreves so naj- večkrat mikrosateliti ali izoencimi; posamezna drevesa The MOLEKBASE system including the user ma- nual and example files, can be freely downloaded from http://www.gozdis.si/index.php?id=151. M. WESTERGREN & H. KRAIGHER: MOLEKBASE: USER FRIENDLY SYSTEM FOR STORING, FILTERING AND CONVERTING 82 FOLIA BIOLOGICA ET GEOLOGICA 53/1-2 – 2012 pa analiziramo na majhnem do srednjem številu lokusov (število analiziranih lokusov se največkrat giblje med pet in 15, kar pri majhni analizi štirih populacij s 50. vzorci na populacijo pomeni med 2000 in 6000 podatkov). Mo- lekulski podatki za genetske analize populacij so pogosto shranjeni v obliki razpredelnic ali kot vhodni podatki za programe, ki omogočajo njihovo analizo. Take podatke lahko največkrat interpretira le raziskovalec, ki je poskus izvajal, kar vodi k manjši transparentnosti celotne razi- skave. Pogosto se tako shranjeni podatki nahajajo na več lokacijah. Sprememba v podatkih na eni lokaciji vnese v set podatkov nedoslednosti. Ponovna analiza ali pa zdru- ževanje večjega števila raziskav, posebej če je od original- ne analize minilo nekaj časa ali pa se je zamenjalo osebje v laboratoriju, je praviloma otežena. Zato smo razvili ma- trico baze za transparentno hranitev populacijskih mole- kulskih podatkov na enotni lokaciji v Accessu in program za izbiro podatkov ter njihovo pretvorbo v pet različnih formatov v MS Visual Studio 2005 vb.net. Eksperimentalni podatki in ostale informacije v sis- temu MOLEKBASE so shranjene v treh različnih tabe- lah. V tabeli »Molecular data« so molekulski podatki v obliki tri-številnih kod ali relativnih dolžin za do največ 25 ko-dominantnih lokusov ter podatki, vezani na vsak vzorec/analiziran osebek. V tabeli »Population« so po- datki, ki se navezujejo na analizirano populacijo, v tabe- li »Locus«so shranjena imena in število analiziranih lo- kusov za vsako vrsto in/ali eksperiment. Sistem dopušča dodajanje novih polj na željo uporabnika. S pomočjo skript lahko uporabnik na podlagi države izvora, biolo- ške vrste, populacije, leta vzorčenja ali posameznikov izbere podatke za določeno analizo ter jih pretvori v pet različnih formatov, ki jih prepoznajo programi za obde- lavo genetsko populacijskih podatkov Genepop, SpaGe- Di, Structure, Baps in Convert. MOLEKBASE je vključno z navodili za uporabo in testnimi podatki prosto dostopen na http://www.gozdis. si/index.php?id=151. ACKNOWLEDGEMENTS The work was supported by the Slovenian Ministry of Higher Education, Science and Technology through the Slove- nian Research Agency: the Y oung Researchers scheme grant no. 3331-03-831659 and the research programme P4-0107 . REFERENCES Corander, J., P.Waldmann & M.J.Sillanpaa, 2003: Bayesian analysis of genetic differentiation between populati­ ons. Genetics (Austin, Texas) 163:367-374 Fernandez-Manjarres, J., P .Gerard, J.Dufour, C.Raquin & N. Frascaria-Lacoste, 2006: Differential patterns of morphological and molecular hybridization between Fraxinus excelsior L. and Fraxinus angustifolia Va h l (O l e a ­ ceae) in eastern and western France. Mol Ecol (Oxford, Velika Britanija) 15:3245–3257 Glaubitz, J.C., 2004: Convert: A user­ friendly program to reformat diploid genotypic data for commonly used popula­ tion genetic software packages. Mol Ecol Notes (Oxford, Velika Britanija) 4 (2):309-310 Hardy, O.J. & X. Vekemans, 2002: SPAGEDi: a versatile computer program to analyse spatial genetic structure at the individual or population levels. Mol Ecol Notes (Oxford, Velika Britanija) 2:618-620 Heuertz, M., J.F. Hausman, O.J. Hardy, G.G. Vendramin, N. Frascaria-Lacoste & X. Vekemans, 2004: Nucle­ ar microsatellites reveal contrasting patterns of genetic structure between western and southeastern European po­ pulations of the common ash (Fraxinus excelsior L.). Evolution (Lancaster, Pennsylvania) 58 (5):976-988 Heuertz, M., X. Vekemans, J.F. Hausman, M. P alada & O.J. Hardy, 2003: Estimating seed vs. Pollen dispersal from spatial genetic structure in the common ash. Mol Ecol (Oxford, Velika Britanija) 12:2483–2495 Pritchard, J.K., M. Stephens & P. Donnelly, 2000: Inference of population structure using multilocus genotype data. Genetics (Austin, Texas) 155:945-959 Raymond, M. & F. Rousset, 1995: Genepop (version­ 1.2) ­ population­ genetics software for exact tests and ecumenici­ sm. J Hered (Washington, D.C.) 86 (3):248-249 Rosenberg, N.A., J.K. Pritchard, J.L. Weber, H.M. Cann, K.K. Kidd, L.A. Zhivotovsky & M. W . Feldman, 2 002: The genetic structure of human populations. Science (New York) 298 (5602):2381-2385 Rousset, F., 2008: Genepop’007: A complete re­ implementation of the genepop software for windows and linux. Mol Ecol Resour (Oxford, Velika Britanija) 8 (1):103-106