Informatica An International Journal of Computing and Informatics Special Issue: Demography and Informatics: Some Interconnections Guest Editors: Janez MalaCiC Janina JóZwiak Alexia Fürnkranz-Prskawetz EDITORIAL BOARDS, PUBLISHING COUNCIL Informatica is a journal primarily covering the European computer science and informatics community; scientific and educational as well as technical, commercial and industrial. Its basic aim is to enhance communications between different European structures on the basis of equal rights and international referee-ing. It publishes scientific papers accepted by at least two referees outside the author's country. In addition, it contains information about conferences, opinions, critical examinations of existing publications and news. Finally, major practical achievements and innovations in the computer and information industry are presented through commercial publications as well as through independent evaluations. Editing and refereeing are distributed. Each editor from the Editorial Board can conduct the refereeing process by appointing two new referees or referees from the Board of Referees or Editorial Board. Referees should not be from the author's country. If new referees are appointed, their names will appear in the list of referees. Each paper bears the name of the editor who appointed the referees. Each editor can propose new members for the Editorial Board or referees. Editors and referees inactive for a longer period can be automatically replaced. Changes in the Editorial Board are confirmed by the Executive Editors. The coordination necessary is made through the Executive Editors who examine the reviews, sort the accepted articles and maintain appropriate international distribution. The Executive Board is appointed by the Society Informatika. Informatica is partially supported by the Slovenian Ministry of Higher Education, Science and Technology. Each author is guaranteed to receive the reviews of his article. When accepted, publication in Informatica is guaranteed in less than one year after the Executive Editors receive the corrected version of the article. Executive Editor - Editor in Chief Anton P. Železnikar Volariceva 8, Ljubljana, Slovenia s51em@lea.hamradio.si http://lea.hamradio.si/~s51em/ Executive Associate Editor - Managing Editor Matjaž Gams, Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia Phone: +386 1 4773 900, Fax: +386 1 251 93 85 matjaz.gams@ijs.si http://dis.ijs.si/mezi/matjaz.html Executive Associate Editor - Deputy Managing Editor Mitja Luštrek, Jožef Stefan Institute mitja.lustrek@ijs.si Executive Associate Editor - Technical Editor Drago Torkar, Jožef Stefan Institute Jamova 39, 1000 Ljubljana, Slovenia Phone: +386 1 4773 900, Fax: +386 1 251 93 85 drago.torkar@ijs.si Editorial Board Juan Carlos Augusto (Argentina) Costin Badica (Romania) Vladimir Batagelj (Slovenia) Francesco Bergadano (Italy) Ranjit Biswas (India) Marco Botta (Italy) Pavel Brazdil (Portugal) Andrej Brodnik (Slovenia) Ivan Bruha (Canada) Wray Buntine (Finland) Hubert L. Dreyfus (USA) Jozo Dujmovic (USA) Johann Eder (Austria) Vladimir A. Fomichov (Russia) Maria Ganzha (Poland) Janez Grad (Slovenia) Marjan Gušev (Macedonia) Dimitris Kanellopoulos (Greece) Hiroaki Kitano (Japan) Igor Kononenko (Slovenia) Miroslav Kubat (USA) Ante Lauc (Croatia) Jadran Lenarcic (Slovenia) Huan Liu (USA) Suzana Loskovska (Macedonia) Ramon L. de Mantras (Spain) Angelo Montanari (Italy) Pavol Nävrat (Slovakia) Jerzy R. Nawrocki (Poland) Nadja Nedjah (Brasil) Franc Novak (Slovenia) Marcin Paprzycki (USA/Poland) Gert S. Pedersen (Denmark) Ivana Podnar Žarko (Croatia) Karl H. Pribram (USA) Luc De Raedt (Belgium) Dejan Rakovic (Serbia) Jean Ramaekers (Belgium) Wilhelm Rossak (Germany) Ivan Rozman (Slovenia) Sugata Sanyal (India) Walter Schempp (Germany) Johannes Schwinn (Germany) Zhongzhi Shi (China) Oliviero Stock (Italy) Robert Trappl (Austria) Terry Winograd (USA) Stefan Wrobel (Germany) Konrad Wrona (France) Xindong Wu (USA) Board of Advisors: Ivan Bratko, Marko Jagodic, Tomaž Pisanski, Stanko Strmcnik Publishing Council: Ciril Baškovic, Cene Bavec, Jožko (Ćuk, Matjan Krisper, Vladislav Rajkovic, Tatjana Welzer Editor's Introduction to the Special Issue Demography and Informatics: Some Interconnections Our modern time is a period of specialization in almost every field of human activity. The scientific activity is probably the best example of the specialization of scientists in usually very tiny spheres of investigation and research. This kind of approach to the development of the scientific knowledge has been very fruitful in practically all scientific disciplines, also including demography and informatics. However, in science and research work the cross disciplinary cooperation is also very important. From time to time it has provided a real breakthrough in many research areas. Therefore, the interdisciplinary discussions, investigations and collaborations are of crucial importance for the modern science. In this line of thinking the editors of Informatica journal have invited the three of us to edit a special issue with the title "Demography and Informatics: Some Interconnections". This invitation was an extension of the collaboration which started at the 10th International Multiconference Information Society 2007, 8th - 9th October 2007, Ljubljana, Slovenia. Between several other conferences there was also a demographic one titled "Slovenian Demographic Challenges of the 21st Century". The conference gathered professionals from different scientific disciplines to discuss recent demographic situations and problems of Slovenia. The idea of a special issue of the Informatica journal was to upgrade some contributions from the conference and to acquire some additional papers from the broader international scientific community and to produce an issue which would be devoted to (some of) the interconnections between demography and informatics. The time period was short and the number of collected papers limited. In spite of that we have succeeded to prepare this special issue. It contains six very different papers dealing with numerous demographic and broader informatics topics. Only two of them are devoted to the local Slovenian situation. In the first of these two J. Sambt and M. Čok analyze the economic consequences of the demographic pressure on the public pension system in Slovenia. In the second M. Černič Istenič and A. Kveder study the relationship between fertility decisions of different generations and developmental characteristics of urban and rural areas in Slovenia. This second paper is more sociologically oriented. The next two papers of J. Malačič and J. Bijak and D. Kupiszewska deal with broader European topics. J. Malačič analyzes recent dynamics of late fertility trends in Europe and concludes that late age - specific fertility will very likely retain more or less marginal shares of total fertility in a modern demographic regime. The paper of J. Bijak and D. Kupiszewska addresses selected computational issues of poor statistics on international migration. A range of computational methods and the algorithm for choosing the best method for estimating missing data are proposed and illustrated by examples for selected countries. The last two papers cover the topics which deal with the world perspective. M. Gams and J. Krivec use data mining techniques to discover the determinants of fertility. The decision trees are presented and the iterative use of data mining techniques employed for the finding of complex relations. Finally, S. Korenjak - Černe, N. Kejžar and V. Batagelj present the results of the clustering of the population pyramids. The big number of countries and the counties of the United States of America are examined. J. Malačič, J. Józwiak and A. Fürnkranz-Prskawetz Demographic Pressure on the Public Pension System Jože Sambt University of Ljubljana, Faculty of Economics Kardeljeva ploščad 17, SI-1000 Ljubljana, Slovenia E-mail: joze.sambt@ef.uni-lj.si Mitja Čok University of Ljubljana, Faculty of Economics Kardeljeva ploščad 17, SI-1000 Ljubljana, Slovenia E-mail: mitja.cok@ef.uni-lj.si Keywords: models, projections, Slovenia, pensions, intergenerational relations Received: March 15, 2007 The combination of fertility, decreasing mortality and the baby-boom generation entering retirement will dramatically increase the share of elderly people in Slovenia in future decades. Without further changes in the pension system this will bring about strong pressure on the public pension system. In the analysis we use a cohort-based model to project the share of public expenditure on pensions in gross domestic product. This model enables us to analyse the long-term effects of the forthcoming demographic changes in connection with the current public pension system. The projected rise in pension expenditure will have to be mitigated at some point in the future and reducing pension benefits is one of the options. The Slovenian pension legislation provides equity among pensioners who retire at different points in time. An equal reduction ofpension benefits suggests an equal distribution of burdens arising from the ageing population. However, the model reveals very different effects of this measure in relation to different cohorts. The analysis tackles increasingly relevant topics of intergenerational relations and questions the distribution of fiscal burdens and benefit among cohorts and generations. Povzetek: Članek predstavi predviden prihodnji pritisk spremenjene starostne strukture prebivalstva Slovenije na javnofinančni pokojninski sistem in učinke njegove blažitve na posamezne kohorte. generations. The current pension legislation provides 1 Introduction horizontal equity between existing and new pensioners. We argue that this seemingly fair arrangement is only For decades the research community has warned the one possibility which brings about a different impact on public via demographic projections concerning different cohorts when introducing the time dimension to forthcoming radical demographic changes. However, this the analysis, as our estimations created by the cohort- did not actually receive much general attention until based model reveal developments started to influence current generations In Section 2, the latest demographic projections for and caused problems associated with the long-term Slovenia are given; presenting forthcoming demographic sustainability of the public finance system. Resolving changes. In Section 3 the cohort-based model used in the population-ageing pressures on the public finance system analysis is explained. Section 4 includes projections of means elevating the tax burden or cutting benefits to public pension expenditures in Slovenia expressed as a individuals. Of course, these measures do not appeal to share of GDP. Section 5 presents the effect of limited the public and politicians are trying to delay them as long pension spending on different cohorts. The conclusions as possible. Lately, this is hardly possible any more and are given in Section 6 population ageing is becoming one of the central issues facing the European Union and many other institutions and countries around the globe. 2 Future demographic development The pressure on public expenditure stems in Slovenia predominantly from three systems: health care, long-term The Slovenian population belongs to the modern care and the pension system. In the paper we concentrate demographic regime with low levels of fertility and on public pension expenditure. We present projections of mortality. In 1981 the total fertility rate ('TFR') dropped this expenditure in the future. It is unlikely that Slovenia's present public finance system can absorb such - a large increase in pension expenditure. An adjustment in The total fertility rate is the average number of children that a woman the direction of a sustainable path raises questions about gives birth to i" her llfetlme, assuming: 1) that the prevaiiing rates distributing burdens over different cohorts and and 22) she w111 survive from birth through to the end of her reproductive life. below 2.1, which represents the replacement fertility rate for developed countries. Since then, the TFR has been continuingly falling and in the last few years it has stabilized at a level somewhat above 1.2. Since 1960 mortality has also been declining. Life expectancy at birth increased in the 1958/59 - 2005/06 period from 65.6 to 74.8 years for males and from 70.7 to 81.9 years for females. During the 1960s Slovenia transformed from a traditional emigration country to an immigration destination. The most important here was the Balkan South-East to North-West immigration stream. In the 1970-1990 period, all net migration flows between Slovenia and other federal parts of Yugoslavia were positive for the then north-west developed Yugoslav republic [11]. Since 1990 this pattern has not changed in spite of the several new state borders which have emerged after the breakdown of Yugoslavia. In the last decade the net migration has amounted to 2,000 to 3,000 people per year, with higher values being seen in the last two years (6,400 in 2005 and 6,200 in 2006). These trends of fertility, mortality and net migration formed the basis of the Eurostat demographic projections [6] published in 2005. Figure 1 to Figure 3 present the assumptions about fertility, mortality and migrations on which those projections are based. Life expectancy at birth (eo) S < 2.00 1.90 1.80 1.70 1.60 1.50 1.40 1.30 1.20 1.10 1.00^1 Total fertility rate (Tf) ■ft 80 ■a E - High variant -Medium variant- - Low variant 78--------------, Year Figure 2: Demographic assumption: mortality Net migrations Year Figure 1: Demographic assumption: fertility The results for the low, medium and high variants of the demographic projections are summarised in Table 1. According to the medium variant, the size of the population decreases by about 100,000 inhabitants by 2050. Because of an assumed substantial positive net migration, a fertility increase (compared to the current level) and increased longevity the projected drop in the total population is only moderate. However, the share of elderly people (aged 65 years and over) is projected to double in the period up until 2050: from the current 16% to 31%. Note: A dashed line denotes values for females while a full line denotes values for males. Figure 3: Demographic assumption: migration The high (low) variant projects much higher (lower) number of inhabitants since it combines optimistic (pessimistic) assumptions regarding all three dimensions: fertility, mortality and net migration. Despite the big differences in those two variants compared to the medium variant, the share of people aged 65 years and over is very similar. To achieve further information, two additional variants are simulated by rearranging the assumptions relating to fertility, mortality and net migration to obtain a range of extremes regarding the share of elderly people (65 years and over). In the 'favourable'2 variant we combine fertility and net migrations from the high variant with the mortality from the low variant, while in the 'unfavourable' variant we combine fertility and net migrations from the low variant and mortality from the high variant. However, even with the very optimistic combination of assumptions the projected share of people aged 65 years and over increases from the current 16% to 24% by 2050, while the pessimistic combination of assumptions even yields an increase to 38%. 90 82 72 1.1.2005* 1. 1. 2010 1. 1.2020 1.1.2030 1. 1. 2040 1.1.2050 Low variant 1,997,590 1,963,853 1,890,415 1,801,674 1,663,014 1,490,760 Aged 65 years and over (%) 15.3 16.7 20.8 25.8 29.4 32.7 Medium variant 1,997,590 2,014,802 2,016,694 2,005,999 1,965,314 1,900,839 Aged 65 years and over (%) 15.3 16.5 20.4 25.1 28.4 31.1 High variant 1,997,590 2,069,175 2,170,058 2,271,619 2,383,601 2,520,801 Aged 65 years and over (%) 15.3 16.3 19.7 23.8 26.5 28.0 'Favourable' variant 1,997,590 2,063,048 2,143,299 2,216,623 2,276,009 2,355,838 Aged 65 years and over (%) 15.3 16.1 19.0 22.4 23.7 23.9 'Unfavourable' variant 1,997,590 1,969,808 1,916,117 1,853,220 1,760,870 1,634,846 Aged 65 years and over (%) 15.3 16.9 21.6 27.4 32.6 37.7 Actual number of inhabitants. Table 1: Eurostat's demographic projections for Slovenia, published in 2005 Besides these assumptions, the age structure of the population also affects the results. The large baby-boom generations born after World War II are approaching their retirement. In the next decade these people are going to shift from employment to retirement status, rapidly expanding the share of the elderly population. These processes are also reflected in Figure 4 where a population pyramid3 graphically represents the projected demographic development of Slovenia. The pyramid in solid colour is for the year 2020, while the shades are outlines presenting demographic developments in the years 2005 to 2050. Shading in the lower age groups depicts the number of people in those age groups in the time period 2005-2019, while the shading in the higher age groups represents the number of people in those older cohorts for the projected period 2021-2050. The figure presents an intermediate stage (the situation in 2020) i.e. the 'emptying' of the number of people in lower age groups and the 'filling in' of higher age groups during the period of the projections. Europe and many other countries around the world, especially developed ones, are also facing a similar process of population ageing so Slovenia is no exception in this regard. However, longevity in Slovenia is increasing relatively rapidly compared with other developed countries and fertility is among the lowest in the world and therefore the process is especially intensified. 3 Cohort-based model The analysis used in this section derives from a cohort-based model which simulates pension expenditures for different cohorts. It is based on a pension profiles matrix, population matrix and a coefficient matrix. The pension profile matrix includes average pensions by age. It builds on the situation from the base year (2006). 3 The population pyramid is a graphical presentation of the population age structure in a presented year. On the vertical axis are age groups and on the horizontal axis is the number (sometimes the share) of the population (males on the left-hand side and females on the right-hand side) by those age groups. Figure 4: Population pyramid for projected Slovenian population in 2020 The population matrix is based on the Eurostat demographic projections presented earlier. Where a longer time span is required, we extend the medium variant of the projections. We thereby use the same data set and software as Eurostat, holding demographic assumptions at the level for 2050. The coefficient matrix summarises the effects of the Pension and Disability Insurance Act introduced in 1999 (PDIA-1999) and gradually coming into effect after 2000. The PDIA-1999 will thus be fully effective in 2024. The transition period is taken into account along with further changes to the pension legislation from 2005. With detailed data about individuals retiring before introducing the PDIA-1999 we simulated the retirement behaviour, wage level etc. - amidst the new conditions. Technically, the matrices have age (a) in their rows and calendar years (t) in their columns. The matrix of pension profiles (PROF) has the pension levels in its cells; the population matrix (P) has the number of people in its cells; and the coefficients matrix (C) contains the coefficients of adjustments. Pensions paid to individuals aged k in year t are thus calculated as: PENS , = PROF tP C Gt a,t a,t a,t a,t t (Eq. 1) where G contains coefficients of the cumulative growth of wages from the base year (in our case 2006) to * time t. Namely, according to the Slovenian pension legislation the growth of pensions is fully indexed in line with the growth of wages (but in practice in the period up until 2024 pensions will grow more slowly due to certain provisions of the pension legislation which are captured by the coefficient matrix (C)). Pension expenditures in year t are calculated as the sum of projected pension expenditures by all age groups: PENSt =£ PENS^t, (Eq. 2) where index a runs from 0 to D; with D denoting the maximum length of life (in our model it is the age group 100+). This pension module is linked to the GDP module. Pension expenditures are namely expressed as a share of GDP. GDP growth is calculated as the sum of the labour productivity growth rate and the labour input growth rate. Further, the labour input growth is defined as the growth of employees in the 16-64 years age group. The same procedure is also used by the European Commission (Ageing Working Group) when projecting public expenditures related to the ageing population [5]. In our model various demographic projections thus affect public pension expenditures expressed as a share of GDP through the pension expenditures and GDP. Labour productivity growth enters into the calculations exogenously, neglecting any possible dependence on the number and age structure of the population (employees). In our analysis we are interested in the effects on pension benefits of currently living cohorts on the assumption that the government caps pension spending at some point. Depending on the chosen ceiling level (cap) the extent of the pension cuts differs. Thus we calculate the reduction in pension benefits that representatives of different cohorts will receive in their remaining lifetime, i.e. the reduction of their pension wealth. For a detailed explanation of the pension wealth definition and empirical results, see, for example [2] and [7]. Technically, pension wealth is obtained by performing a diagonal aggregation of the expected pension benefits in the future, discounted back to the base year t0 (in our case 2006). PENSW PENS,,,^(1 + r ) - (Eq. 3) of 1992 (PDIA-1992) introduced a gradual increase in the retirement age and some other measures to cope with rapidly growing pension expenditures. In 1999 the share of pensions in GDP was 13.4% but the projections simulated a sharp increase in the future if no further measures are introduced. In 1999 a new Pension and Disability Insurance Act (PDIA-1999) was adopted. It tightens retirement conditions and decreases benefits deriving from the mandatory pension system (for details, see [3] and [12]). The effects of this pension reform have been analysed by several researchers using different models, assumptions and partial simulations of the complex Slovenian pension system. However, all of them concluded that, despite the positive effects of the pension reform starting in 2000, further measures will be required in the future to maintain the system's long-term fiscal sustainability (see, for example, [3], [4], [13], [15] and [16]). In the analysis we present results of the projections stemming from the cohort-based model, presented earlier in the text. Apart from the methodology and assumptions described earlier, we applied assumptions about macroeconomic variables (like productivity growth, activity rates etc.) provided by the European Commission [5]. For linking employment rates with the retirement rates the sub-model of the Institute of Macroeconomic Analysis and Development [10] is used. Without going into further details about the assumptions and calculations in Figure 5 we present the results by different demographic variants. In the analysis we excluded some categories of pension expenditure which predominantly or exclusively have a social function (e.g., state pensions). Ph ^^21 19 -- • 'Unavourable' variant • Low variant • Medium variant • High variant 'Favourable' variant ! 17 ^ 15- ^ 13Ì----------- 3 ^^ 11 -9 (N (N m m 2222222222 2222 Year Figure 5: Projections of pension expenditures in GDP 4 Projecting pension expenditures Slovenia inherited a PAYG system from former Yugoslavia after gaining its independence in 1991. The transition to a market economy and loss of markets in other Yugoslav republics caused high unemployment and other labour-market problems. Mass early retirements in the early 1990s was used to mitigate them. Consequently, the share of pensions in GDP rose from 9.6% in 1989 to 14.4% in 1994. The Pension and Disability Insurance Act It is projected that, without further measures in the next decade, demographic pressure of increased longevity and low fertility, further enhanced by babyboom generations entering retirement is transmitted into public pension system. 5 Distributing the fiscal burden One of the cornerstones of the Slovenian pension legislation is the principle of equal benefits for a=0 i-a individuals with the same pension conditions, regardless of the time they retired. The first item in Article 151 of the latest PDIA (a4djusted in 2005) explicitly states that upon the February4 adjustment of the growth of pensions in line with the growth of wages an adjustment for existing pensioners relative to new pensioners is also taking place to assure equal rights for pensioners, who have retired at different time points'. That is to say, someone who will retire 10 years from now will have for the same retiring conditions the same net replacement rate5 as someone who is already retired. According to the current pension legislation, the replacement rate for people entering retirement is decreasing up until 2024, but the pension growth of existing pensioners will also lag behind the growth of wages to keep pace with the conditions for new pensioners. This arrangement suggests fairness in the light of growing questions about the positions of different generations to engage in the problems (or challenges if we employ the word used by politicians) of an ageing population. If we ignore payments and benefits into/from the public system that individuals faced in the past, it seems reasonable and fair to distribute future burdens equally across all generations. In the rest of the paper we contrast this view with the results of the cohort-based model. With the model for each cohort we follow all taxes/benefits that it will pay/receive to/from the public finance system. In this paper we concentrate on public pension benefits only. We calculate pension wealth by cohorts by discounting projected future pensions to the base year, which in our case is 2006. The present value of future pensions is very sensitive to the assumed discount rate. A 5% discount rate was used. This value, for example, is also used as a default value in the generational accounting method for discounting future flows to the base year. However, since we do not analyse absolute values this effect is much smaller as we analyse the relative position regarding the present value of future pension benefits (by cohorts). The range from 2 to 7 percent has been tested without having a significant effect on the results and without altering the conclusions of the analysis. Estimating the effects of the pension legislation on an individual's pension benefits is undertaken by following the parameters of the pension system and the life expectancy of the individual. For an individual with full pension conditions the scale of the projected net replacement rate is presented in Figure 6 - until 2008 there are actual values, thereafter followed by projected values. On the other hand, a calculation at the cohort level has to take into consideration the heterogeneity of the cohort in terms of service years, the future mortality of the cohort members etc. The growth of pensions is adjusted in line with the growth of wages in February and in November. The net replacement rate for pensions is defined as a person's net pension divided by their net wage before retirement. This includes an assumption about non-extreme high or low values which are limited by maximum and minimum values etc. Figure 6: Net replacement rate (in %) for an individual with full pension conditions We believe that the government will not allow an increase in pension expenditures as a share of GDP to the levels presented in Section 4. In the analysis we thus assume that at a certain point the government will limit any further rise in public pension expenditure. This could be done in various ways, among which we analyse the option of cutting pensions. We set the tolerated maximum share of pensions in GDP, alternatively, at rates of 11, 12, 13, 14 and 15 percent; i.e. we assume that after reaching this 'tolerated' maximum the government would cut all pensions simultaneously in order to achieve this goal. We concentrate only on medium variant of demographic projections. )100 X 2 t^ ä ^ g H o S g u q: .ts S Q Ü ;3 O © li J= .iS :s 'iü > s (S U e e J= 95 90 80 -- 15% share of GDP 14% share of GDP 13% share of GDP 12% share of GDP 11% share of GDP 949494949494949 (N r^ r^ 05 \o \o Age 88990 Figure 7: Reduction of pensions by age when limiting pension expenditure Figure 7 compares: 1) the present values of pensions that representatives of a certain age group would receive in their remaining lifetime when limiting pension expenditure with 2) the case without limitations. Limiting share of pensions in GDP to: Age 15% 14% 13% 12% 11% 20-24 94.7 89.4 83.5 77.1 70.7 25-29 95.0 89.8 83.8 77.5 71.1 30-34 95.9 91.4 85.8 79.5 73.0 35-39 97.2 93.8 89.2 83.3 76.8 40-44 98.2 96.0 92.6 87.8 81.5 45-49 99.1 97.8 95.5 92.0 86.8 50-54 99.6 98.9 97.6 95.1 91.1 55-59 99.9 99.6 98.8 97.2 94.3 60-64 100.0 99.9 99.5 98.6 96.6 65-69 100.0 100.0 99.8 99.4 98.0 70-74 100.0 100.0 100.0 99.8 99.0 75-79 100.0 100.0 100.0 99.9 99.6 80-84 100.0 100.0 100.0 100.0 99.9 85-89 100.0 100.0 100.0 100.0 100.0 90-94 100.0 100.0 100.0 100.0 100.0 95-99 100.0 100.0 100.0 100.0 100.0 100+ 100.0 100.0 100.0 100.0 100.0 Table 2: Reduction of discounted pension benefits by age when limiting pension expenditure Figure 7 and Table 2 should be read as follows: if the government were to limit pension expenditures in GDP to 15%, then a representative of the 20-24 years age group (i.e. being on average 22.5 years old) would receive in their remaining life time 94.7% of the amount of pension benefits (discounted to the base year) compared to their remaining lifetime had the government not posed such limitations. This cut would thus reduce the discounted value of expected pension benefits for a person aged 22.5 years by 5.3%. However, the same measure would reduce the discounted value of expected pension benefits of someone aged 52.5 years by just 0.4%. The key factor driving this part of the results is the timing of a cut in pension benefits. According to the projections, the share of public pension expenditure exceeds the 15% limit in 2039. A cut in pensions would follow thereafter; therefore the 50-54 years cohort would be only slightly affected. It would namely collect pension benefits until then at an unchanged rate and only a few of them would still be living to collect benefits at the reduced rate. Further, people aged 80-84 years in 2006, for example, would not be affected at all since according to demographic projections none of them will still be alive in 2039. On the contrary, those aged 20-24 years in the base year would receive reduced pensions for their entire period when retired. If the government were to decide on a much tougher limitation of public pension expenditure - e.g. to 11% of GDP, the effect would be much greater. The cutting of pensions would have to start already in 2018 so practically all cohorts except those aged 85 years and more would be affected. But the magnitude of the reduction for different cohorts would be very different. For those aged 70-74 years this measure would reduce the discounted value of their pension benefits collected in their remaining lifetime by only 1%, while for those aged 20-24 years the reduction would be 29.3%. This can be explained by virtue of the fact that at the beginning only minor pension reductions would be required to stay within the 11% limit. This cohort would thus not be strongly affected by this measure. On the other hand, when cohorts currently aged 20-24 years collect pension benefits, a strong cut of pensions will be required to stay within the 11% limit. The results of the analysis reveal that the timing of measures for mitigating the pressure of an ageing population on pension expenditures decisively determines the distribution of burdens across different cohorts. It is evident that pensioners and people approaching retirement will prefer delaying measures in the form of cutting pensions as long as possible. Ideally for them, they should not be implemented while they are still alive. On the contrary, younger cohort/generations would prefer (or at least they should) prompt actions to distribute the burden over all generations instead of only turning the burden on to them. These opposite aspirations are confronted in the political field since decisions are made by politicians who are elected by people with a right to vote. Positions in this intergenerational 'battle' are thus very unequal. Children do not have voting power; nor do generations that have still to be born, which is especially emphasised by the method of generational accounts (see, for example, [1]), have representatives in these 'negotiations'. On the contrary, there is a rapidly growing number of older people who have voting power and participate at elections over-proportionally (compared to those aged 18-30 years, for example) and who have very clear and unified criteria - the level of benefits they expect to receive from the government. 'In democracies, one-issue voters have a disproportionate impact on the political process, since they don't split their votes because of conflicting interests on other issues' [14]. Some authors see this as enormous issue in the future, employing expressions like 'war between generations' [8] and the 'coming generational storm' [9], while some of them even see this as a threat to democracy in the future [14]. 6 Conclusions According to population projections published by Eurostat in 2005, drastic demographic changes are forthcoming. The share of elderly people aged 65 years and over is expected to about double from the current 16% to about 31% by 2050 in Slovenia. Other European countries and many other countries around the world face the same process of rapid population ageing. In Slovenia it is especially emphasised because of the still rapidly increasing longevity and the very low fertility which is among the lowest in the world. This strong demographic pressure will effect public systems, especially the public pension system which is the focus of this article. It includes simulations of future public pension expenditures as a share of GDP using the cohort-based model. The effects of the pension reform passed in 1999 are expected to almost neutralise demographic pressure during the next decade. Thereafter, the share would increase rapidly from about 11% to almost 17% in 2050 if no further measures are implemented. Our results are in line with earlier results using different models. However, the message is not to postpone measures for a decade. On the contrary, the results speak in favour of acting in a timely fashion since the necessary measures will have to be more drastic if they are delayed. Further, as the analysis reveals there is a huge difference in distribution of the burden across cohorts depending on when the measures are implemented. In the pension legislation the proclaimed equity of replacement rates for pensioners retiring at different times thus does not mean equal burdens on all cohorts and generations. Younger generations would prefer immediate action while older generations would benefit from delaying them for as long as possible. In this 'conflict' the older generations are in a much better position since they have voting power, they over-proportionally participate at elections and their political preferences are much more straightforward and therefore more powerful. 7 References [1] Auerbach, A. J., Gokhale, J., Kotlikoff, L. J. (1991). A Meaningful Alternative to Deficit Accounting. Working Paper no. 3589. National Bureau of Economic Research. [2] Brugiavini, K., Maser, A., Sundén, A. (2005). Measuring Pension Wealth. LWS Workshop "Construction and Usage of Comparable Microdata on Wealth: The LWS". Banca d'Italia, Perugia, Italy, 27-29 (January 2005). [3] Čok, M., Sambt, J., Berk, A., Košak, M. (2008). Long-Term Sustainability of the Slovenian Pension System. Economic and Business Review, vol. 10, no. 4., forthcoming. University of Ljubljana, Faculty of Economics. [4] European Commission (2006). The Impact of Ageing on Public Expenditure: Projections for the EU25 Member States on Pensions, Health Care, Long Term Care, Education and Unemployment Transfers (2004-2050). European Economy, Special report no. 1/2006. Office for Official Publications of the European Communities. [5] European Commission (2006). The 2005 Projections of Age-Related Expenditure (2004-50) for the EU-25 Member States: Underlying Assumptions and Projection Methodologies. European Economy, Special report no. 4/2005. Office for Official Publications of the European Communities. [6] Eurostat: Population projections. http://epp.eurostat.ec.europa.eu/portal/page7_pageid =1996,45323734&_dad=portal&_schema=PORTAL &screen=welcomeref&open=/popul/popula/proj/proj _trend/proj_tbp&language=en&product=EU_MAIN _TREE&root=EU_MAIN_TREE&scrollto=358 [7] Feldstein, M. (1974). Social Security, Induced Retirement, and Aggregate Capital Accumulation. Journal of Political Economy, vol. 82, no. 5, 905926. The University of Chicago Press. [8] Gokhale, J., Kotlikoff, J. L. (2001). Is War Between Generations Inevitable? NCPA Policy Report no. 246. National Center for Policy Analysis. [9] Kotlikoff, J. L., Burns, S. (2004). The Coming Generational Storm: What You Need to Know about America's Economic Future. MIT Press. [10] Kraigher, T. (2005). Middle- and log term projection of demographic development of Slovenia and its socio-economic components (in Slovene). Working Paper no. 10/2005, pp. 78. Insitute of Macroeconomic Analysis and Development. [11] Malačič, J. (2000). The Balkan migration stream South-East to North-West. Studi emigrazione, vol. 37, no. 139, pp. 581-594. Centro Studi Emigrazione Roma. [12] Pension and Disability Insurance Act (PDIA-1999), revised in 2005. http://www.uradni-list.si/_pdf/2005/Ur/u2005104.pdf [13] Sambt, J. (2004). Generational Accounts for Slovenia (in Slovene). Master's thesis. University of Ljubljana, Faculty of Economics. [14] Thurow, C. L. (1999). Generational Equity and the Birth of a Revolutionary Class. In the book: Williamson, B. J., Watts, M. D., Kingson, R. E.: The Generational Equity Debate. Columbia University Press. [15] Verbič, M. (2007). Varying the Parameters of the Slovenian Pension System: an Analysis with an Overlapping-Generations General Equilibrium Model. Post-Communist Economies, vol. 19, no. 4, pp. 449-470. Routledge. [16] Verbič, M., Majcen B., Van Nieuwkoop, R. (2006). Sustainability of the Slovenian Pension System: An Analysis with an Overlapping-Generations General Equilibrium Model. Eastern European Economics, vol. 44, no. 4, pp. 60-81. M.E. Sharpe Inc. Urban-Rural Life Setting as the Explanatory Factor of Differences in Fertility Behaviour in Slovenia Majda Černič Istenič Sociomedical Institute at Scientific Research Centre of Slovenian Academy of Sciences and Arts, Novi trg 2, 1000 Ljubljana, Slovenia University of Ljubljana, Biotechnical Faculty, Agronomy Department, Jamnikarjeva 101, 1000 Ljubljana, Slovenia, E-mail: majdaci@zrc-sazu.si Andrej Kveder Population Activities Unit, UNECE, Palais des Nations, CH-1211 Geneva 10, E-mail: andrej.kveder@unece.org Keywords: survey data, statistical data, fertility behaviour, typology of urban-rural settings, Slovenia Received: March 31, 2008 Contemporary research of fertility behaviour that considers simultaneous longitudinal and prospective inclusion of individual and contextual levels of observation increases its explanatory power of explanation. The only research data of the kind in Slovenian is Fertility behaviour of Slovenians which was a part of international project the Family and Fertility Survey (FFS), carried out in 1995. On the basis of this survey data combined with statistical census data in the form of socio-economic typology of Slovenian countryside the article aims to explain the tendency and reasons of still persistent difference in urban-rural fertility. The study of relationship between fertility decisions of different generations and developmental characteristics of rural and urban areas in Slovenia reveals that this relationship is a complex and dynamic one. Obtained research results call for diversified actions of population policies in Slovenia. In to its field of actions not merely family and social policy should be integrated, but also space and regional measures that will consider different every day's life conditions and needs of people, living in particular space setting, and in this way assist them to fulfil their desired number of children. Povzetek: Narejena je analiza slovenske rodnosti na podeželju in v mestih. 1 Introduction According to Notestein's theory of demographic 1986). In spite of relatively great variety in the beginning transition (1953) a high-level fertility regime of pre- of demographic transition and the level of urbanisation modern societies was replaced with a low-level one causal relations between these two phenomena certainly owing to the process of modernisation and its existed. Research on social-group forerunners of fertility accompanying processes: industrialisation and control in Europe (Livi-Bacci, 1986) proved that urban- urbanisation. As supposed, the growth of big city rural difference in fertility existed even prior to the onset agglomerations and mobile urban population disrupted of general decline in fertility. Some groups within the the strength of traditional norms and commitments to a city population (mainly elites: aristocracies and traditional way of life and stimulated individualism and bourgeoisie) were practising effective family limitation affirmation of personal aspirations. However, as later which influenced the overall urban fertility level. Other research revealed (e.g. "The European Fertility Project" reasons for variation in levels of fertility between urban (Coale, Watkins, 1986)), the casual relationship between and rural populations before the demographic transition urbanisation and the beginning of demographic transition are also low levels of nuptiality in the cities due to high was not always entirely direct and unproblematic. For concentration of servants (it was not easy for them to get instance, the decline in fertility in France "began in the married) (Sharlin, 1986). Urban-rural difference in late eighteenth century, long before the appearance of the natural fertility was also affected with factors such as modern city, while the decline in fertility in England only infant mortality (higher in the cities) and breast-feeding got underway decades after cities like Birmingham and (more widespread in the country) (van de Walle, 1986). Manchester become grimy industrial centres" (Sharlin, After the completion of the transition from high fertility to the low one (enforcement of modern demographic behaviour), urban marital fertility remains lower than rural marital fertility. The differences declined in size, but nonetheless continued to exist in most of the countries that experienced demographic transition (Andorka, 1978). As some recent research (Černič Istenič, Kveder, 2000, Černič Istenič, Kveder, Obersnel Kveder, 2000) indicates, this 'state of affairs' also holds true for Slovenia: today all social strata that live in the cities still have fewer children than those who live in the countryside despite the fact that the difference in the level of fertility among various social strata is indeed diminishing (Javornik, 1999) due to the predominance of two children per family norm (Malačič, 1990, Černič Istenič 1994, Obersnel Kveder et al., 2001). To explicate this issue more precisely in the present article we explore the relationship between fertility decisions of different generations and developmental characteristics of rural and urban areas in Slovenia. On the basis of individual survey data combined with statistical census data in the form of socio-economic typology of Slovenian countryside we intend to discover the tendency of this still persistent difference in urban-rural fertility and its reasons. In this vein, the main traits of Slovenian urbanisation and deagrarisation process over the last 150 years are firstly briefly sketched. Secondly, the main analytical frame of our analysis is outlined. It follows the pertinent theoretical observations which indicate that due to urbanization social, economic and political differentiation firstly increased elsewhere, but later on, a kind of homogenisation in social behaviour of the population took place due to further impacts of urban life patterns on the countryside. In the third part the main explanations of the selected data and applied methods are presented. In the forth section, the results are presented according to the outlined hypothesis and in the final part they are discussed and conclusions are outlined. 1.1 Urbanisation and deagrarisation of Slovenia The beginning of urbanisation in Slovenia was relatively late. A growth of cities started only after 1848. In that time Ljubljana, the capital and the biggest city of today's Slovenia, had just 17000 residents. Urban-rural difference in fertility level existed before, during and after the demographic transition in Slovenia. The earliest available data for the level of fertility in Slovenia that make the comparison between urban and rural fertility possible pertain to generations born during 1873-1877 and during 1898-1902. This comparison show that urban women had from 34 to 47 percentage points lower fertility than countryside women did (Šircelj, 1991). The beginning of demographic transition in Slovenia started only in the first quarter of the twenty-century. However, fertility in the cities already began to decline towards the end of the nineteenth-century. Firstly it started to decline in larger cities and then spread onto the smaller ones. At the break of the twentieth century fertility in cities declined for about 40 percentage points and for about 20 percentage points in the countryside (Šircelj, 1991). Slovenian society was still weakly structured in the middle of the nineteenth century. The main occupation was farming. Specifically, in 1868 the share of farm population was 81,4 percent. A the break of the 20'h century it decreased to 73,2 percent (Klemenčič 2002). The difference between areas with the highest and the lowest share of farm population was 13 percentage points. In 1931 this difference reached 50 percentage points. The share of farm population decreased especially in the neighbourhoods of Ljubljana. At that time no association between the share of farm population and the level of fertility was observed. It appeared only after the Second World War. "Up to that time the level of fertility was much more influenced by the type of settlement where a woman lived than by the share of farm population" (Šircelj, 1991: 244). Since the Second World War farm women have the highest fertility and today they are the only socio-professional group who assures itself biological reproduction (Šircelj, Ilič, Kuhar, Zupančič, 1990, Šircelj 2007). However, fertility of farm women has actually no effect on the national fertility level due to its very low share compared to the whole population1. In the period 1931-2002 the share of farm population declined from 60 to 6,5 percent, the most rapidly in the period 1948-1981; from 48,9 to 9,4 percent (Klemenčič 2002, Statistical Yearbook 2003). At the same time, the age structure of agrarian countryside deteriorated significantly. Young generation was/is immigrating into the cities whereas older generations remain in the countryside. In some parts of Slovenia (hills, highlands, the Karst and particularly borderlands) the age structure of rural areas is so unfavourable that it causes demographic extinction and stagnation. Namely, only 2 percent of the population lives on the above mentioned areas which represent 20 percent of the surface (Pečan, Ravbar, 1999). On the other hand, countryside has also experienced social strata transformation. Due to abandonment of farming and forestry, moving of provincials into other occupations and owing to permanent or accessional immigration of a part of the urban population into the countryside, its social structure becomes more and more heterogeneous (Barbič, 1991; Kovačič, 1995). In relation to this process the way of life of the countryside population is changing. Achievements of urban society are advancing rapidly into the rural areas; activities and services once typical for urban areas become more and more widespread in rural areas. Thus, rural population is getting similar to the urban one (Pečan, Ravbar, 1999). It can be expected that all these processes are influencing social behaviour of the countryside people and consequently their fertility behaviour as well. 1 However, the correlation between decrease in total fertility rate in Slovenia and the decline in the share of farm population was high after the Second World War period. 1.2 Theoretical background In now already classical textbook "Determinants of fertility in advanced societies" by Andorka (1978) a relationship between fertility behaviour and place of residence was characterised as a direct linkage. There is a fairly consistent correlation between urban or rural trait of the place of residence and fertility. The place of residence has a property of natural or man-made environment. In this sense highly populated, densely build-up areas heavily loaded with traffic are defined as urban areas. Correspondingly, there is little space left for parks, private gardens and other places where children can safely spend time outside their homes. On the other hand, the trait of rural area is determined by living predominantly in a one-family house with garden or in a relatively small apartment house. Life there is quieter and safer and children have plentiful space for playing outside their homes. According to Andorka, this ecological characteristic of urban-rural differential is also connected with different monetary costs and efforts necessary for raising and educating children that are much greater in urban areas than in the rural ones. Besides, he presupposed different preferences and different consumption alternatives in these two types of residence. To understand urban-rural differences in fertility behaviour of advanced societies it is also necessary to take into consideration the social characteristics of these types of environment. Mackensen (1982) believes that one general theory of fertility that could adequately explain fertility behaviour in all societies and at all periods of time is neither possible nor justifiable. He is convinced that for this reason, every explanation, observation and research of fertility behaviour like any other social behaviour should proceed from the concept of specific structural and cultural characteristics of each society, which is the product of certain historical processes. Hoffnan-Nowotny (1987) who asserts that fertility behaviour of an individual is connected with structural and cultural characteristics of his/her micro, mezzo and macro social environment shares similar view. Importance of geographical variations in place or context in understanding fertility decision-making is further stressed by Boyle (2003). Discussion concerning typical characteristics of urban and rural places is already a long present one. Classics of sociological thought like K. Marx, F. Tonnies, G. Simmel, (European representatives) and L. Wirth, R. Redfield (attached or influenced by the Chicago School) argued that strong distinctions exist between urban and rural societies. E.g. accordingly to Marx's theorization, an individual living in urban place has universal chances to develop all his/her abilities while in the countryside he/she is bounded by constant reiteration of firmly established patterns of thinking and acting derived from direct dependence of men to nature. Quite on the contrary, Tonnies was convinced that life in the countryside (in Gemeinshaft) bounded together by a unity of wills and solidarity with its tradition and social order, gave an individual a supreme opportunity to harmoniously live with others, while life in urban areas (in Gesellschaft), constituted by commodity exchange and rested on "union of rationale wills", led to undermining genuine attachment between people and community (see Bonner, 1999 for more extensive overview of the urban-rural debate). Extensive empirical research of the rural society evolved since the 1950s, showed that there is no clear rural-urban distinction. Concepts like community and locality did not wholly prove their justification. Fieldwork investigations (Williams, 1963, Bell and Newby, 1971, Pahl, 1965, 1966, Newby, 1978, 1980) revealed that there are no close, isolated, harmonious, functioning in traditional manners and closely kinship-bounded communities. They also showed that low population density or certain fixed settlement patterns were insufficient basis to distinguish between urban and rural places. Marsden (1998, 1999), focusing on restructuring of agriculture, drew attention to the varying degrees of influence and interaction of agricultural, residential and commercial interests in shaping differentiating rural areas that could be analyzed by modelling of typologies (see Mahon, 2001 for more extensive review of recent comprehension of rurality). According to Bourdieu (2003) each classification of social world should take into consideration the principle of differentiation in order to theoretically construct empirical reality. Basically, this principle pertains to the distribution of the forms of power or the kinds of capital that vary according to the specific place and moment. This means that the set of agents or institutions which possess sufficient quantum of a specific capital (especially economic and cultural), that enables them to possess a dominant social position, conserve or transform the "exchange rate" between different kinds of capital through more or less administrative measures. Of course, this holds true for relationships between city and country as well. Characteristics of today's urban and rural societies are undoubtedly strongly related to their specific position and mutual relationship in the last century and a half, which has often taken the framework of city domination over the countryside (Hays 1993). An urbanized society evolved out of the former rural society by exploiting the material and human resources of the countryside intensively, which led to considerable economic, social and political inequalities between these two settings. Along with this, cities changed or gave up thoroughly the culture and the way of life, which long prevailed in the countryside. However in this process all parts of countryside were not equally affected. Hypothesis 1 : on the basis of the above statements, we expected in our analysis that fertility behaviour of individuals is closely linked with economic and social characteristics of their life settings observed through the selected typology of rural areas. Over the last decades, the social structure and culture of rural areas in Europe and other industrially developed countries has changed significantly. Due to massive abandonment of agriculture by a great part of rural population and their engagement in other occupations, rural areas became multifunctional and multistructural (Djurfeldt, 1999). Migrations from urban to rural areas contributed to this heterogeneity to some extent as well. With rapid development of new technologies (information and computer sciences) and improved traffic connections among urban and rural areas, the entire societies became increasingly urbanised, "infected" with urban values and the urban way of life. Owing to these changes, it could be supposed that urban-rural difference in fertility behaviour is diminishing or even vanishing. However, changes in social behaviour do not occur quickly, or at the run of one single generation. According to Inglehart's (1989) theory of value changes they occur with the exchange of generations. This theory is based on two hypotheses: 1. deficiency effect: An individual most strongly appreciates the things that are relatively rare in his/her socio-economic setting. 2. socialisation effect: Value priorities of an individual are not direct reflections of his/hers present socioeconomic setting but are the reflections of conditions in which he/she grew up. Therefore it could be supposed that due to increased impact of globalisation (preferring urban life style) every younger rural generation will be more similar in its social and consequently fertility behaviour to urban population than previous rural generations were. Hypothesis 2: in this respect we expected in our analysis smaller differences in fertility behaviour in younger generation of the respondents from different space settings than in the older ones. Cohorts who voluntarily limited marital fertility, enforced demographic transition, which took place during the second part of the 19th and the first part of the 20*^ century in industrialised world, from high to low fertility. Their belief in marriage and family was still very strong, although they practised contraception from preventive reasons according to Van de Kaa (1999). They wanted to give their children a good start in life and stemmed towards limiting the number of children to be correspondent with that goal. They disciplined themselves to remain married even when love was lost. However, these features of "modern demographic behaviour" did not last very long. Since 1965, new demographic changes have been observed in many European countries: decrease of marriages, increase of cohabitations, postponement or abandonment of parenthood, increase in divorces and single parent households. These changes also denoted as the second demographic transition (Van de Kaa, 1987) presumably occur due to a shift of value orientation from a modern trend to post-modern that signifies further enforcement of the individual's free choice principle, which was introduced during the time of Renaissance and Enlightenment centuries ago. The motto of this trend is that individuals can and should make their own choices. "Post-modern demographic behaviour" of contemporary reproductive cohorts presumably corresponds to the individualistic "lifestyle, where it is understood that sex and marriage/union are no longer closely related, and that contraception is only interrupted to have a self-fulfilling conception" (Van de Kaa, 1999:31). This new pattern of behaviour is seemingly reflected in the changes of the life course of young generations - earlier entry into first sexual intercourse but later achievement of economic and housing autonomy and formation of own families (ledema et al., 1997; Cordon, 1997; Nave-Herz, 1997). Hypothesis 3: in this vein in our analysis we supposed considerable differences in the life course, especially during the transition into adulthood between younger and older generations of the respondents from different space settings. 2 Methods 2.1 Data The analysis was based on the Family and Fertility Survey data collected in Slovenia between December 1994 and December 1995. A representative area sample of the inhabitants of the Republic of Slovenia in their reproductive age period (i.e. aged 15 to 45) of both genders was drawn. Face to face interviewing was used as the data collection method. Realised sample consisted of n=4.559 respondents. The data were weighted according to survey design and population adjustment based on a specified set of socio-demographic variables: gender, age and size of the settlement (see Obersnel Kveder et al., 2001 for more detailed description of the data set). Considering our objective to explore differences between generations, two birth cohorts were emphasized in the present analysis: • respondents aged 20-24 years of both genders -the "younger" generation at the entrance to their reproductive period and also representing contemporary fertility patterns, • respondents aged 40-45 years of both genders -the "older" generation at the end of their reproductive period and also presenting the immediate past fertility patterns. The main interest of the present article was in the variables concerned with the timing of childbearing as well as other events relating to the entry of an individual into adulthood such as sexual debut, partnership history, leaving parental home, education and employment spells. All these events were measured as retrospective event histories. For all strictly reproductive events (i.e. childbearing, use of contraception), the age at the first sexual intercourse was taken as the threshold of becoming at risk, while for all others (i.e. partnership, education, employment, and leaving parental home) the threshold was the respondents' birth. Right censoring was determined by the date of the interview. Besides, the variables describing respondent's preferences, values, attitudes and status characteristics were considered. Some of them: household structure, attitudes towards abortion, gender roles and marriage were constructed from the set of other variables from the survey data set. The variable 'Household structure' was derived from the household grid. It has 13 distinct attributive values depicting various household types in which the respondents can live: 1 alone 2 with a partner 3 with a partner and others (e.g. his/hers parents and siblings, partner's parents and siblings, other relatives)* 4 with a partner and children 5 with a partner, children and others* 6 with children 7 with children and others* 8 with parents and siblings 9 with parents, siblings and other relatives 10 with one parent 11 with non-relatives 12 with other relatives 13 with other relatives and non-relatives The variable 'Abortion' was derived by taking into account "approval" answers from the set of the following dichotomous statements: • An abortion is approved/not approved when mother's life is in danger due to pregnancy. • An abortion is approved/not approved when the risk of the birth of an abnormal child is great. • An abortion is approved/not approved when a woman is not married. • An abortion is approved/not approved when a married couple does not want to have another child. • An abortion is approved/not approved when a woman does not want to have a child at the time being. The variable 'Gender roles' was constructed by taking into account answers "strongly agree" and "agree" from the Likert 5- item scale of the following attitudes: • An employed mother can create as warm and safe relationship with her children as a mother who is not employed. • Employment is the best way for a woman to achieve independence. • Being a housewife is as fulfilling for a woman as being employed. • Both man and woman should equally contribute to their household's budget. • Preschool child would most probably suffer if his mother was employed. • It is acceptable for a woman to be employed, but what most of women really want is home and children. The variable 'Marriage vis-a-vis Cohabitation' was constructed by taking into account responses "very favourable" and "favourable" from the Likert 5- item scale regarding advantages of cohabitation over marriage to achieve the following aims: • general happiness, • economic security, • friendly relationship with others, • personal freedom, • stable relationship, • having children, • social acceptance. Following the aim of this article, fertility behaviour should be put in the perspective of the contextual micro, mezzo and macro level variables. Microenvironment is defined as the immediate living surrounding of the individual varying from their family, household to the neighbourhood. Mezzo environment is by definition broader than the micro and thus can encapsulate a variety of geographical units from the settlement to the municipality and region. Macro socio-economic context is usually associated with the national level indicators. This article focuses on the use of one exemplary indicator measured at mezzo level; the respondent's place of residence determined by the basis of Slovenian Census in 1991. In its essence, the indicator reflects the urban-rural dichotomy and defines 4 possible types of living surroundings: urban, suburban, typical rural and rural depopulation areas (Kovačič et al. 2000, Kovačič, et al. 2002): • Those places that have an urban management character according to the space planning documents have been classified as urban environments. Additionally, the function of centrality of each geographical unit was considered as criteria of demarcation between urban and rural areas. Slovenian geographers classify settlements into seven groups according to their centrality (Vrišer, 1998). Those settlements with the centrality index between 7 and 3 were also defined as urban environments. The settlements with the lowest centrality index 3 were additionally bounded by the minimal size of 3.500 inhabitants. The share of population living in this type of settlements is estimated at 39,20 percent. • All local communities (slo. krajevne skupnosti) with the density of the population greater than 200/km2 were considered as suburban. In addition, the areas with the density lower than 100/mk2, with the index of population growth in the period 1981/91 greater then 110, were also included in this type of settlements. The share of population living in this type of settlements is estimated at 14,80 percent. • All local communities with the long-term (1961/919 and short-term (1981/91) population growth index below 97,5 were considered as depopulation areas. In addition, local communities with the non-negative short-term population growth and with the index of population ageing above the absolute demographic threshold (i.e. 72) were also considered as depopulating. The share of population living in this type of settlements is estimated at 14,87 percent. • All rural areas between suburban and depopulation were considered as typically rural. The share of population living in this type of settlements is estimated at 31,22 percent. Additional analysis which applied the above typology (Perpar, Kovačič, 2002) revealed that these types of rural settlements significantly differ among each other in terms of socio-economic development, infrastructure and natural resources. Taking into account several indicators pertaining to statistical data from 1991, suburban areas are in the most favourable position and depopulation areas are in the least favourable one. Considering the employment structure of population, the highest share of population in suburban and typically rural areas works in secondary sector, whereas in depopulation areas population mostly works in the 2 primary sector . Daily migrations additionally indicate the level of engagement in gainful employment. Its share is again the highest in suburban areas (770 per 1000 inhabitants) and the lowest in depopulation areas (540 per 1000 inhabitants). Further, significant differences among the areas are also observed concerning the education. The highest level of education is reached in suburban areas (43 percent of inhabitants finished at least high school), whereas in typically rural and especially in depopulation areas this share is considerably smaller. The same picture is indicated by the proportion of students per 1000 inhabitants; it is again the most favourable in suburban areas (25 students per 1000 inhabitants), less favourable in typical rural areas (21 to 25 students/1000 inhabitants) and the least favourable in depopulation areas (20 students/1000 inhabitants). Furthermore, indicators pertaining to economic situation confirm already indicated differences. E.g. density of business entities that indicates economic development of the area is the most favourable again in suburban settlements with more than 14 business entities per 1000 inhabitants, less favourable in typical rural areas (12 to 14 business entities per 1000 inhabitants) and the least favourable in depopulation areas (less then 12 business entities per 1000 inhabitants). Considering the infrastructure, the analysis (Perpar, Kovačič, 2002) demonstrated that all areas are relatively well equipped with basic infrastructure, but the best equipped are the suburban ones. In typical and depopulation areas the population is still frequently faced with the problems of drinking water supply, unsettled canalisation and purifying plants and maintenance of local roads. An additional problem of the depopulation areas is abandonment of farming and consequently the forest over-growing. 2 The share of farm population in depopulation areas presents 200 per 1000 inhabitants, in typical rural areas this proportion counts 120 per 1000 inhabitants and in suburban areas only 40 per 1000 inhabitants (Perpar, Kovačič, 2002). Each respondent from the survey was ascribed a settlement type according to his/her residence at the time of the interview. The key for information matching was the local community that could be matched to the survey data as well as the settlement typology specification. The analysed sub-samples were as follows: Table 1: The size of sub-samples (N) Settlements type 20-24 40-45 Urban 192 302 Suburban 109 182 Typical rural 204 295 Depopulation rural 75 131 2.2 Analysis Bivariate associations were analyzed using either the analysis of contingency tables either using comparisons among means. Timings of observed events were analyzed through event history models (Allison, 1995), which enable the estimation of the differences in the individuals' life course. Since only the distribution of the probability of time (T) was taken into account, it was described through cumulative distribution function perspective (survivor function): S (t ) = Pr{r > t}= 1 - F (t ) (1) The result of the survivor function is a probability that an individual "survives" in the process beyond time t. It is defined on the interval from 0 to 1. Life-table method was used for the estimation of the survivor function. The estimate is calculated using the conditional probability of failure (i.e. the probability for an event within a certain interval, given that an individual made it to the start of interval): S (ti )=i[ì (1 - qj ) (2) j=1 where qj is the conditional estimation of failure and is calculated as the ratio of number failed over the effective sample size. Major events in a life history were depicted by the means of calculating the quartiles of the survivor function. 2.3 Results In order to evaluate the importance of contextual information in explaining fertility and family behaviour, the major moments of the reproductive life period, structural and cultural characteristics of individuals were compared against the regional typology. At first, the processes of entry into the first, second and third parenthood were compared among the regions, taking into account both generations together (Figure 1, 2 and 3). The major differences that can be observed at the first birth are related to the differentiation of the urban areas from the rest. As survivor functions show, inhabitants of the urban areas tend to delay the birth of the first child more than the inhabitants of the other regions. There are only 15 percent of women from suburban and rural areas that are still childless 10 years after their sexual debut, while the proportion of nullipara women in the urban areas is above 30 percent. At the second birth, the differentiation of the urban areas from the rest still prevails; with more than 30 percent of inhabitants that did not experience the birth of the second child 10 years after the entry into the first parenthood. However, in this case the suburban areas also differ significantly from typical rural and depopulation rural areas; approximately 24 percent of its inhabitants did not experience the birth of the second child 10 years after the first birth, whereas in typical rural and depopulation rural areas only 18 and 14 percent of inhabitants did respectively. In the case of the third birth, the difference between areas got the character of polarity. The share of inhabitants that experienced the third birth decreased significantly in all regions in comparison with the share of inhabitants that experienced the second birth, but particularly in urban and suburban ones, where more than 80 percent of dwellers did not experience that event at all. In typical rural and depopulation rural areas the comparable shares are 69 and 66 percent respectively. 1 3 5 7 9 11 13 15 17 19 21 23 25 duraticn (inyeais) -o-urtsn -*-sutuiban -^-typical rural -o-c^lcpulaticnIural 29 Figure 1: Time to the first birth. 1 3 7 9 11 13 15 17 19 21 23 25 27 duration (in years) -D- urba^ HK- suburban ^ typical rura^ ^ depopulation rural Figure 2: Time to the second birth. 1 0,9 0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 1 3 5 7 9 17 19 21 11 13 15 duration (in years) -B- urba^ Hie suburban ^ typical ruia^ ^ depopulation rural Figure 3: Time to the third birth. Table 2 shows times in years from the first sexual intercourse to the first, second and third parenthood for both generations separately. The older generation relates a more homogenous picture then the younger one, where the differences among the rural and urban areas are more transparent. The most evidently, the 20-24 urban generation postpones the parenthood in comparison to the urban older one and their rural coevals as well. Among the younger generation in urban areas, 25 percent enters into first parenthood more then 4 years later (7,1 years in comparison to 2,9 years) then the first quarter of the older generation and approximately 3 years later than their counterparts in typical and depopulation rural areas. Furthermore, the postponement of parenthood among younger urban generation is manifested also in the absence of any childbirth at 50th and 75th percentile and consequently any second or third birth at all. Results pertaining to the younger generation also show that the first births reach their maximum level the fastest in typical rural areas. Table 2: Time to parenthood (in years) First Birth Second birth Third birth 20-24 40-45 20-24 40-45 20-24 40-45 25th percentile Urban 7.1 2.9 3.0 Suburban 4.3 2.3 3.0 2.3 - - Typical rural 3.3 2.2 2.6 2.3 3.2 6.0 Depopulation 3.9 2.4 3.6 2.0 - 3.8 50th percentile Urban 7.2 5.1 Suburban 11.9 4.3 4.3 4.4 Typical rural 6.6 6.6 4.2 3.7 - - Depopulation 6.8 4.1 - 3.3 - - 75th percentile Urban 10.3 Suburban 7.7 6.9 9.3 Typical rural 8.8 7.0 - 6.0 - - Depopulation 6.5 - 6.4 - - Although different patterns of entering into parenthood can be observed within generations, substantial differences in the 40-45 generation can be observed in the 75'h percentile values. At this level the second births are rare among urban dwellers. There are also scarcely any families with three or more children. The third births are presented only at the level of 25th percentile among families in typical and depopulation rural areas. In general, the intervals between childbirths are considerably shorter in rural then in urban areas, what has an important implication on the overall number of children in each of the regions. respondents, those from suburban areas came the most near to their desired number. Among older generation, concerning preferred number of children, urban and suburban areas create one pattern (lower number) and typical and depopulation rural areas create the other one (higher number). This observation is consistent with previous as well as with subsequent findings. ui^an sufcui^an lypicai rural depopulation rural ■aU °40-455eais °20t245eais Figure 4: Actual number of children (means with confidence intervals). As a consequence of these events' and spells' patterns, in general, families in both types of rural areas have statistically significantly (F=27.595***3, Bonferroni post-hoc mean difference tests for Urban compared to other types are -0.216***, -0.308*** and -0.358*** respectively) more children then families in urban areas have (Figure 4). Suburban areas do not differ significantly from urban or the rest of rural areas, however they are more similar to urban than rural areas. In the case of 20-24 generation, the differentiation is significant among urban and typical rural areas (F=8.562***, Bonferroni post-hoc mean difference tests for Urban compared to other types are -0.240**, -0.248*** and -0.209 respectively). It is also observed that in this generation, the number of children slightly decreases in depopulation areas in comparison with typical rural and suburban areas. Considering the perception of the ideal number of children per family, there are no great differences among rural and urban areas (Figure 5). The overall perceived ideal family size is above two children, though it is a bit diverse concerning the region type and age. In particular, the members of the younger generation from urban and suburban areas want more children then their older counterparts do. In this regards they are quite similar to their rural counterparts. Of course, more conclusive observations concerning the match between preferred and actual number of children can only be drawn from the older generation, which more or less already finished its reproductive period. It seems that among all older urban suburfcan typical rural depopulation "All °40-4Sjears °20t24jears 'ual Figure 5: Preferences towards the number of children (means with confidence intervals). Table 3 distinctly shows extremely interlaced differences in the ages of entering certain life events among the two observed generations and the four area types. The prevailing pattern of sequence of events can be described as starting with the entry into sexual life, finishing education, getting the first job, leaving the parental home and as the last step, entry into partnership union. The entry into partnership union is consistently the last event to be experienced by any observed group. Some resemblances can be observed between both urban and suburban areas, as well as between typical rural and depopulation areas. In rural areas the conclusion of education process tends to precede the sexual debut. Table 3: Age at life events (in years). First sexual intercourse First partnership The end of education Leaving parental home First employment 20-24 140-45 20-24 1 40-45 20-24 1 40-45 20-24 1 40-45 20-24 140-45 25'h percentile Urban 16.0 17.9 22.3 20.8 17.8 16.9 19.3 18.4 19.0 18.2 Suburban 16.0 17.0 20.5 20.0 17.2 15.8 19.7 18.9 18.5 17.7 Typical rural 17.0 17.0 20.5 20.6 15.9 14.5 19.3 18.3 18.0 17.3 Depopulation 17.0 17.0 20.1 20.3 15.9 14.5 18.9 18.0 17.9 17.2 50th percentile Median) Urban 17.0 18.9 24.8 23.3 20.6 20.5 22.7 20.3 21.3 19.8 Suburban 18.0 18.0 22.1 18.3 18.3 24.4 20.8 19.8 18.8 Typical rural 18.0 18.0 22.9 22.5 17.4 17.5 22.8 20.5 18.8 18.5 Depopulation 18.0 18.0 23.3 22.4 17.1 16.7 20.4 19.3 18.6 18.6 75th percentile Urban 18.0 20.0 26.7 25.8 24.3 22.5 Suburban 18.0 20.0 - 25.4 23.7 24.9 - 24.8 23.9 21.5 Typical rural 19.0 20.0 25.3 19.1 22.7 24.7 20.2 21.2 Depopulation 19.0 20.0 - 24.8 18.8 18.4 - 24.0 20.2 21.6 ' p < 0.01 - ***, 0.01 < p < 0.05 - **, 0.05 < p < 0.10 - * The main difference between the two generations irrespectively of the region is in the earlier sexual debut and the postponing of all other events of the younger population, especially setting up a partnership union and an independent life away form parents, which is most evident in the young urban generation. The end of education process tends to differentiate most strongly among generations and regions. The older generation in depopulation areas is the first to finish their education, whereas the young urban generation is the last. As in case of entry into parenthood, the intervals between above listed events which are considerably shorter in rural areas then in urban ones, additionally contribute to a different fertility level in observed areas. Table 3 can be observed in relation to the contraceptive use (Figure 6). Among the younger generation, the most common contraception method used at the time of the interview was hormonal contraception following by condom and withdrawal. The rarest method used was injection. Quite substantial differences can be observed among different regions. Depopulation areas have a very low condom or diaphragm usage (6,7 percent) compared to other regions (14 percent in urban to 23 percent in suburban areas), but on the other hand, they have the highest usage of hormonal contraception (33,3 percent). However, the key finding is higher proportion of non users among typical rural dwellers (40 percent) than in other areas (24 to 33 percent). Thus, the differences among the regions in relation to the entries into certain life events, as well as in the birth of the first child, are reflected in variations of contraceptive use. Share of respondents in % 0,0 5,0 10,0 15,0 20,0 25,0 30,0 35,0 Pill ' Condom Current non-user Non-user less than 3 months No sex ho Ever non-user te ^ Withdrawal Periodical abstinence UID Pregnancy Diaphragm Injection □ urba^ □ suburban □ typical rura^ Depopulation rural 3,6 3,4 3,2 3 2,8 2,6 2,4 urban suburban typical rural depopulation rural OAll 0 40-45 years A20-24 years Figure 7: Attitudes towards abortion (means with confidence intervals). Attitudes towards abortion are in line with the respondents' expressed religiosity. Figure 8 shows quite high and significant differences across observed regions. Only one third of urban dwellers defined themselves as being religious, while more than a half of the inhabitants of the other three regions did. Typically, rural areas have the highest proportion of religious people. Within the younger generation, these differences are even more expressed (x2 sig. p=0.000; Adjusted Standardized Residuals for Urban 20-24 are -5.6 "Religious" and 6.4 "Not Religious"). E^ not know Not religious 20-25 Nethernor years Keligious Do not know 40-45 Not religious years Nether nor Religious 0,0 10,0 20,0 30,0 40,0 Share in % 50,0 60,0 70,0 □ urban H suburban □ typical rural ■ depopulation rural Figure 6: Current usage of contraception (20-25 years). Figure 8: Religiosity. Explanation for the difference in the number of children can also be indicated in relation with the attitudes towards abortion. The results in Figure 7 show, that abortion is not a very acceptable phenomenon among the respondents. However, there is a small, but a consistent downward trend in acceptability of abortion from urban to depopulation areas. Abortion tends to be more often considered fairly unacceptable in rural areas, whereas in urban areas, the average attitude is more pro-choice oriented. In general, there are no statistically significant differences between the two observed generations, except in typical rural areas where the young show significantly greater (F=10.370***, Bonferroni post-hoc mean difference tests for Typical Rural compared to urban is -0.613***) opposition towards the abortion as the free choice act than the older generation does. The results concerning the attitudes towards marriage and gender roles reveal a rather different picture. As Figure 9 shows, irrespective of the region and the age among the respondents, there are no significant differences in the preference of either cohabitation or institution of marriage. Only the 40-45 generation in urban areas shows a significant tendency (F=8.620***; Bonferroni post-hoc mean difference tests for Urban compared to Suburban and Depopulation Rural urban are -0.2379** and -2562** respectively) towards favouring the cohabitation over marriage in comparison to other regions, whereas younger generation everywhere is more inclined towards cohabitation than marriage. These results are consistent with the share of respondents living in cohabitation unions: in urban areas the share is 15 percent, in suburban 10,8, in typical rural 12,4 and in depopulation rural 13,7 percent. All respondents also share the same tendency in attitudes towards supporting the equality of gender roles (Figure 10). There are no major differences between generations or regions. 2,8 2,6 suburban typical rural depopulation rural OAll 040-45 years A20-24 years Figure 9: Marriage vis-a-vis cohabitation (means with confidence intervals). 3,6 3,2 2,8 2,4 2,2 others with other relatives non-relatives with one parent parents and others parents and siblings children and others children partner, t^dren and others partner and children partner and others partner T . □ ..... 1, W urban suburban typical rural depopulation rural OAll 040-45 years A20-24 years Figure 10: Gender (means with confidence intervals). Substantive differences among regions (x2 sig. p=0.000) are revealed in the household composition (Figure 11). The largest differences are related to the rate of nuclear and extended families - of orientation and procreation -which represent two predominant household composition types in all regions and both generations. The extended family in depopulation regions of Slovenia represents one third of households, while in the urban areas, this proportion is less then 10 percent. Nuclear family is the most representative composition type where the majority of the respondents live. However, in this regards the difference among the regions is significant (Adjusted Standardized Residuals for Depopulation Rural are -5.4 "Partner and Children" and -2.5 "Parents and siblings"); depopulation areas are just above the 50 percent mark, whereas in other regions the share of this type of households is around 70 percent, being the highest in suburban areas (over 75 percent). Although one-parent family households are relatively rare, they can only be observed in the urban and suburban areas (6,9 - 4,1 percent), while in the rural areas they are more an exception then a rule (1,9 - 0,9 percent). Very similar proportions among regions hold true for those living alone. Thus, the variation in number of children in various areas is reflected in the household composition and consequently in their life style. 0 10 20 30 40 50 60 Share of respondents in %o □ urban Esuburban Qtypical rural ■ depopulation rural Figure 11 : Household composition. 2.4 Conclusions Based on the presented results, the first and the third hypothesis can fully be confirmed. The analysis revealed that fertility behaviour of Slovenian population is significantly related to the socio-economic context of diverse socio-geographical regions; the more developed is one setting in terms of favourable and diverse economic activities, available infrastructural capacities and social services, the lower is fertility of its dwellers. Nevertheless, the application of regional typology also revealed that relations of urban-rural division are not uniform and static. Population of urban areas with the pattern of postponement of first births considerably differs from the other regions, whereas at the second and even more at the third birth, the population of suburban areas is getting gradually more similar to the urban ones. The typical rural areas are the most stable in relatively high fertility pattern, whereas in depopulation areas, this pattern is no longer very firm, most probably due to less favourable developmental conditions. Changing similarities and differences among the regions are particularly related to behavioural patterns of the younger generation. The results revealed that tempo and sequences of events in entering into adulthood changed significantly between generations. In older generation among all regions, the life events, that normatively precede parenthood, followed each other quicker than in the case of the younger generation. In older generation among the regions the variations in transition to adulthood certainly exist. But contrary to our expectations, formulated in the second hypothesis, these variations are even more pronounced among the younger generation, particularly due to distinctive behaviour of urban youth. According to the theory of second demographic transition, this group shows an indicative pattern of post-modern demographic life course both through their behaviour and expressed attitudes, whereas rural young generation, for the most part those living in typical rural areas, mainly follow the pattern of their older counterparts. As our analysis indicates, variations in fertility behaviour among the regions and generations can be more pertinently explained by social structural factors (education, household type) then values and attitudes pertaining to the post-modern life patterns. 3.2 3 2 3,4 3 2,6 2 The obtained results call for greater attention to more contextualised approaches in demographic research. Our analysis demonstrated that rural setting as a social space is not a homogeneous category, but a grouping of various structural and cultural characteristics that might specifically determine fertility behaviour. By neglecting this reality, too much information needed for better understandings of variations in fertility behaviour could be lost unduly. The residues of the past urban-rural differences in fertility behaviour are still present today despite the effects of global economic and social trends that carry structural and ideational changes into the countryside and will, as our analysis indicates, also remain in the near future. From this point of view, it was unreasonable that international project Fertility and Family Surveys in ECE countries under supervision of United Nation gave so little attention to urban-rural dimension. Among 22 European countries included in this project, only five of them (Poland, Estonia, Lithuania, Switzerland and Slovenia) took into consideration the respondents' place of residence. In the new Generations and Gender Programme (GGP), which started in the year 2000, the need for more contextual approach is fully considered; all micro, mezzo and macro levels are taken into account. On that basis pertinent results are expected. Hopefully, Slovenian researchers will also have the opportunity to join these endeavours. The observed differences in fertility behaviour patterns among generations living in various geographical regions with different socio-economic characteristics also call for diversified actions of population policies. In the near past it was anticipated that family and social policy measures would contribute to uninterrupted population reproduction in Slovenia the most successfully. However, as our results indicate, population policy also has to integrate space and regional measures in to its field of actions; i.e. urban and rural developmental programmes that will consider different every day's life conditions and needs of people living in particular space setting. To scientifically support such actions, a new research data, e.g. in the frame of GGP, that will provide the picture of the present generations of reproductive age from various viewing angles, are urgently needed. References [1] Allison, P.D. (1995) Survival Analysis Using The SAS System. A Practical Guide. Cary. SAS Institute Inc. [2] Andorka, F. (1978) Determinants of Fertility in Advanced Countries. Methuen & Co. Ltd. London. [3] Barbič, A. (1991) K "novemu" pojmovanju ruralnosti (Towards a "New" Comprehension of Rurality) In: Barbič, A. (ed.) Prihodnost slovenskega podeželja. Prostor, prebivalci, gospodarske dejavnosti (A future of Slovenian Countryside. Space, Inhabitants and Economic Activities). Dolenjska založba. Novo mesto. [4] Bell, C., Newby, H. (1971) Community Studies. Allen and Unwin Ltd. London. [5] Bonner, K.M. (1999) A Great Place to Raise Kids. Interpretation, Science, and the Urban-Rural Debate. McGill-Queen's University Press, Montreal & Kingstone • London • Ithach. [6] Boyle, P. (2003) Population geography: does geography matter in fertility research? Progress in Human Geography, 27(5): 615-626. [7] Bourdieu, P. (2003) Practical Reason. On the Theory of Action. Polity Press. Cambridge. [8] Coale, A. J., Watkins, S. C.(1986) The Fertility Decline in Europe. Princeton University Press. Princeton. New Jersey. [9] Cordon, J. A. F. (1997) Youth residential independence and autonomy: A comparative study. Journal of Family Issues. 18 (6): 576-607. [10] Černič Istenič, M. (1994) Rodost v Sloveniji. (Fertility in Slovenia) Znanstveno in publicistično središče. Ljubljana. [11] Černič Istenič, M., Obersnel-Kveder, D., Kveder, A. (2000) Differences in contraceptive behaviour of men and women in Slovenia regarding their partnership and parenthood history. Paper presented at the Flagship Conference, Brussels, May 29-31. [12] Černič Istenič, M., Kveder, A. (2000) Urban-rural differences in fertility behaviour in Slovenia. Paper presented at X. World Conference of Rural Sociologists, Rio de Janeiro; July 30 - August 5. [13] Djurfeldt, G. (1999) From CAP to CAR? New Directions in European Rural Sociology. http://www.soc.lu.se/esrs. [14] Hays, S.P. (1993) From the history of the city to the history of the urbanized society. Journal of Urban History. 19(4): 3-23. [15] Hoffman-Novotny, H.-J. (1987) The Future of the Family. European Population Conference. Issues and Prospects. Plenary Séances. Central Statistical Office of Finland. Helsinki. pp. 113-199. [16] Iedema, J., Becker, H. A., Sanders, K. (1997). Transition into independence: A comparison of cohorts born since 1930in The Netherlands. European Sociological Review, 13 ( 2): 117-137. [17] Ingleahrt, R. (1989). Kultureller Umbruch. Campus Verlag. Frankfurt/New York. [18] Javornik, J.(1999). Slovenija v 90-ih. Prebivalstvo (Slovenia in the 1990s). In: Hanžek, M. (ed.) Poročilo o človekovemu razvoju Slovenije 1999 (Human Development Report for Slovenia 1999). Urad R Slovenije za makroekonomske analize in razvoj. Ljubljana. pp. 65-67. [19] Kaa, Van de, D. (1987) Europe's second demographic transition. Population Bulletin, 42(1): 1-47. [20] Kaa, Van de, D. (1999) Europe and its population: the long view. In: Kaa, Van de, D., Leridon, H., Gesano, G., Okolski, M. (eds.) European Populations Unity in Diversity. European Population Studies no.6. Kluwer Academic Publishers. Dordrecht/Boston / London. pp. 1-49. [21] Klemenčič, V. (2002) Procesi deagrarizacije in urbanizacije slovenskega podeželja (Processes of deagrarisation and urbanisation of Slovenian countryside). Podeželje na prelomu tisočletja (Countryside at the break of the century/ Mednarodna znanstvena geografska konferenca (International scientific geographic conference), Ljubljana, 19.-21. September 2001. Ljubljana. Oddelek za geografijo Filozofske fakultete. Accessible on: (http://www.ff.uni-lj.si /oddelki /geo / publikacije/dela/files/Dela_17/Vladimir_Klemencic .pdf) [22] Kovačič, M, Perpar, A., Gosar, L. (2002) Členitev podeželja v Sloveniji (Differentiation of countryside in Slovenia) Sodobno kmetijstvo. 35(2): 62-66. [23] Kovačič, M., Gosar, L., Fabijan, R., Perpar, A. (2000). Razvojno - tipološka členitev podeželja v Republiki Sloveniji (Developmentally -Typological Differentiation of Rural Areas in the Republic of Slovenia.). Institute of Agricultural Economics, Biotechnical Faculty, University of Ljubljana. [24] Kovačič, M. (1995) Idejne in teoretske osnove za razvoj podeželja (Ideal and theoretical basis for the countryside development) In: Kovačič, M. (ed.) Izhodišča, sestavine in problemi celovitega razvoja podeželja v Sloveniji (Starting-points, components and problems of integral countryside development in Slovenia) Biotechnical Faculty, University of Ljubljana, pp. 3-20 [25] Livi-Bacci, M. (1986) Social-Group Forerunners in Fertility Control in Europe. In: Coale, A. J., Watkins, S. C. (ed.): The Fertility Decline in Europe. Princeton University Press. Princeton. New Jersey, pp.182-200. [26] Mackensen, R. (1982) Social Change and Reproductive Behaviour - on Continuos Transition. In: Hohn, C., Mackensen, R. (eds.) Determinants of Fertility Trends: Theories Re-examined. Ordina ed.. Liege. pp. 251-279. [27] Mahon, M. (2001) Reconceptualising rural change. A literature review. Paper presented at XIX European Congress for Rural Sociology. Dijon, September 3-7. [28] Malačič, J. (1990) Načrtovanje zdravega potomstva - odgovornost posameznika do družbe (The planning of healthy future generations - the individual's responsibility towards society). Zdravstveni vestnik. 59(7-8): 349-354. [29] Marsden, T. (1998) New rural territories: Regulating the differentiated rural space. Journal of Rural Studies. 14(1): 107-117. [30] Marsden, T. (1999) Rural futures: The consumption countryside and its regulation. Sociologia Ruralis. 39(4): 501-520. [31] Nave-Herz, R. (1997) Still in the nest: The family and young adults in Germany. Journal of Family Issues. 18(6): 671-689. [32] Newby, H. (1978) Property, Paternalism and Power: Class and Control in Rural England. Hutchinson. London. [33] Newby, H. (1980) Urbanisation and the Rural Class Structure: Reflections on a Case Study. In The Rural Sociology of Advanced Societies: Critical Perspectives. Croom Helm. London. [34] Notestein, F. (1953) Economic problems of population change. In: Proceedings of the Eight International Conference of Agricultural Economists. London. Oxford University Press. pp. 13-31. [35] Obersnel Kveder, D., Kožuh Novak, M., Černič Istenič, M., Šircelj, V., Vehovar, V., Rojnik, B. (2001) Fertility and Family Surveys in Countries of the ECE Region - Standard Country Report, Slovenia. Economic Studies No. 10r, UNECE, UNFPA, New York/Geneva. [36] Pahl, R.E. (1965) Urbs in Rure: The Metropolitan Fringe in Hertfordshire. Wiedenfeld & Nickelson. London. [37] Pahl R.E. (1966) The rural-urban continuum. Sociologia Ruralis. 6(3): 299-329. [38] Pečan, J., Ravbar, M. (1999) Regionalni razvoj Slovenije (Regional Development of Slovenia). In: Hanžek, M. (ed.) Poročilo o človekovemu razvoju Slovenije 1999 (Human Development Report for Slovenia 1999). Urad R Slovenije za makroekonomske analize in razvoj. Ljubljana. pp. 31-63. [39] Perpar, A., Kovačič, M. (2002) Razvojno stanje, značilnosti in problemi podeželskih območij v Sloveniji (Development situation, characteristics and problems of rural areas in Slovenia). Sodobno kmetijstvo. 35( 2): 52-61. [40] Sharlin, A. (1986) Urban-Rural Difference in Fertility in Europe during the Demographic Transition. In: Coale, A. J., Watkins, S. C. (ed.): The Fertility Decline in Europe. Princeton University Press. Princeton. New Jersey, pp.234257. [41] Statistical Yearbook (2003) Statistical Office of Republic of Slovenia. Ljubljana. [42] Šircelj, V. (2007) Fertility and educational attainment in Slovenia. Anthropological Notebooks. 13(2): 11-34. [43] Šircelj, V., Ilič, M., Kuhar, A., Zupančič, M. (1990) Projekcije prebivalstva R Slovenije 1986-2006 (Population projection of republic of Slovenia 1986-2006). Statistical Office of Republic of Slovenia. Ljubljana. [44] Šircelj, V. (1991) Determinante rodnosti v Sloveniji (Determinants of fertility in Slovenia): Doctoral Dissertation. Faculty of Arts, University of Ljubljana. [45] Vrišer, I. (1998) Središčna (centralna) naselja. (Central (capital) settlements). In: Geografski atlas Slovenije. Ljubljana, DZS. pp. 308-309. [46] Williams, W.M. (1963) A West Country [47] Village: Ashworthy: Family, Kinship, and Land. Routledge Kegan Paul. London. Late Fertility Trends in Europe Janez Malačič University of Ljubljana, Faculty of Economics Kardeljeva ploscad 17, Ljubljana E-mail: janez.malacic@ef.uni-lj.si Keywords: modern demographic regime, fertility trends, late age-specific fertility, Europe Received: March 25, 2008 European continent has been for-runner in the demographic transition process. Age-specific fertility patterns had changed tremendously during this process in Europe. However, it is still mainly concentrated in the female age group 20 - 34 years. Women and couples tend to limit fertility in marginal age groups of female's reproductive period. In this paper the author addresses late age-specific fertility trends in selected European countries. Age - specific fertility is relatively low after the age of 35 in modern demographic regime. These rates show very low fertility in the age group 40 - 44 years and only exceptional childbearing in the age group 45 - 49 years. In contrast, the age group 35 -39 years is not only dynamic but also still important for the procreation. In general, total late fertility was higher in 1960 than in 2003 in majority of European countries. Further increase in total late fertility in most of Europe is likely, Eastern Europe included. However, late age - specific fertility will very likely retain more or less marginal share of total fertility in modern demographic regime. Late age - specific abortion will considerably contribute to this result. Povzetek: Prispevek obravnava značilnosti pozne rodnosti v Evropi v razmerah sodobnega načina obnavljanja prebivalstva. 1 Introduction European continent has been for-runner in demographic development for several centuries. The most advanced European countries were pioneers at the beginning and at the end of the demographic transition. The main result of the process of demographic transition in Europe was modern demographic regime. Demographers differ in views about the detailed characteristics of the modern demographic regime and about the timing of its appearance. However, there is general agreement that low fertility and low mortality levels are the most prominent features of the post-transitional demographic regime in Europe. Broadly speaking, it is possible to state that the modern demographic regime in Europe emerged in the 1950s. At present, it is about sixty years old. It is hardly enough for the development of all facets of the new way of population or generational replacement. We need much longer period if we want to competently compare modern regime with the traditional or pre-transitional regime of population development. It is also premature to expect that demographers would have reached general consensus on the theory of modern demographic regime. The demographers simply need longer period and more countries and national populations with finished demographic transitions to be able to study modern demographic processes in different environments. From this point of view the so called theory of the second demographic transition is more confusing than illuminating. According to my opinion, it is not wise to speak about the second demographic transition before we really know whether below replacement fertility is one of the main characteristics of the modern demographic regime or not. If the answer is yes, then we do not have any transition at all. In spite of the theoretical and practical dilemmas it is very important for demographers to study different characteristics of the modern demographic regime in Europe. The author of this paper will focus on late age-specific fertility trends in Europe. Age-specific fertility patterns had changed tremendously during the period of the demographic transition in Europe. A uni-modal distribution of age-specific fertility rates considerably slimmed as a consequence of the fertility decline in younger and older ages of the female reproductive period. Age-specific fertility in advanced European countries with modern demographic regime is concentrated in the female age group 20 - 34 years. Women and couples tend to limit fertility in marginal age groups of female's reproductive period. Adolescent fertility is not only unwanted but more and more unacceptable for an individual and the society. Main reasons are prolonged schooling and late economic emancipation of young people. Late age-specific fertility is much more in the domain of an individual woman and her family in spite of the fact that physicians advise avoiding a childbearing after the age of 35 years. Widespread employment of women, low reproductive norms and modern life styles in Europe do not support the childbearing in the period 35 - 49 years of age. Age-specific fertility rates in this period show efficient family planning and the use of modern contraceptives. In this paper late age-specific fertility is defined by the age group 35-49 years of the females. The author will follow available statistical data on the late age-specific fertility and abortion trends in Europe up to the recent years. He will also try to answer the question how much have late age-specific fertility and abortion changed as the consequence of very low level of fertility and of the postponement of childbearing among young generations. Late fertility and late abortion differentials will also be analysed. The author will devote much attention to the question whether any breakthrough can be anticipated in late age-specific fertility in Europe in the future or should we further maintain the position that childbearing after the woman's age of 35 will remain marginal for still very obvious reasons. 2 Low fertility level in Europe Generally, fertility level in almost all of Europe is low indeed. All national period total fertility rates in consecutive years 2003 and 2004 were below 2.04 if we exclude Turkey and Albania (RDDE, 2006, p. 78). Particular rates covered wide range between 1.17 in Ukraine in 2003 and 2.04 in Iceland in 2004. Sporadically and regionally the indicator can be even lower. In the lands of former German Democratic Republic it was only 0.84 in 1995. Intrinsic rates of increase of model populations with such fertility levels are negative. If we take for example Slovenian total fertility level of 1.2 in the year 2003, it is possible to calculate the intrinsic rate of -1.9 which means that model population with such a rate would diminish to a half of its original number in only about 37 years. European fertility decline is also evident in cohort total fertility rates. Completed fertility rates of a female birth cohort born in 1967 were predominantly under 2.0 with the exception of Albania, Armenia, Azerbaijan, Cyprus, Iceland, Ireland (1966), Norway, Serbia and Montenegro and Macedonia (RDDE, 2005, p. 89). Mainland Europe is well below 2.0. Its range was between 1.46 in Germany and 1.99 in France and Slovak Republic in 1967. Below replacement fertility is predominant characteristic of Europe at the beginning of the 21st century. Low fertility level in Europe is main cause for growing number of countries with negative natural increase. In 1980, only Austria and Federal Republic of Germany had negative natural increase of the population. Ten years later, number of births was lower than number of deaths in three countries (Bulgaria, Germany and Hungary). At the turn of the century, in 2003, the number was 19 (RDDE, 2006, p. 56) and it will grow in the future. Mean age of women at birth of first child has been increasing in recent decades in Europe. In spite of some minor problems with missing data it is possible to see considerable increase in the number of countries with the mean age of women at birth of first child of 25 years and more in the period 1970 - 2000. In the years 1970, 1980, 1990 and 2000 there were 4, 8, 16 and 23 such countries respectively in Europe. In 2004, 9 out of 28 countries with published data had the mean age of women at birth of first child of more than 28 years. There were no such countries in 1990 (RDDE, 2006, p. 87). Trends of the mean age of women at childbearing in Europe in the period 1960 - 2003 are more complex. They have been influenced by the decline of number of births of the higher parity birth orders, which have generally occurred at the higher age of the mother and by the increase of the mean age of women at birth of lower parity birth orders. These two forces have worked in opposite directions. Real trends in the period considered show quite similar pattern in majority of European countries. There was decline in the mean age of women at childbearing till the 1970s or early 1980s. Thereafter, the trend has changed the direction. The mean age of women at childbearing has started to increase. In 2004, there were 21 out of 32 countries with the published data where the mean age of women at childbearing was more than 28 years. In 8 countries the indicator was higher than 30 years. These countries were Andorra, Denmark, Ireland, Liechtenstein, the Netherlands, San Marino, Sweden and Switzerland (RDDE, 2006, p. 88). Some decades ago G. Calot and C. Blayo wrote about considerable homogenization of fertility in Western Europe (Calot and Blayo, 1982, p. 353). In recent decades the process has spread over the entire continent. However, it is far from universality. In 2000, the difference between the highest (Iceland) and the lowest (Ukraine) total fertility rate in Europe, Albania and Turkey excluded, was still practically one child per woman (RDDE, 2005, p. 76). Therefore, it is possible to conclude that below replacement fertility level is more important characteristic of European fertility than the homogenization. 3 Late age - specific fertility Late motherhood and postponement of childbearing in Europe are frequent research topics in recent years. Some examples of studies are Van Nimwegen et al, 2002, Sobotka, 2004, Ni Bhrolchain and Toulemon, 2005 and Billari, 2005. It is possible to see different meanings of the terms late motherhood and postponement in these studies. The Dutch authors use the term late motherhood in the meaning of postponement of the childbearing (Van Nimwegen et al, 2002. p. 10 - 16). Many other authors use the term postponement in two different meanings. The first meaning stresses the possibility of compensation of the fertility decline at younger ages with (at least partial) fertility rise at later ages. The second meaning most often simply refers to an increase in the mean age of first birth or in the mean age at childbearing (Ni Bhrolchain and Toulemon, 2005, p. 86). However, in almost all of these studies very little is said about late age -specific fertility in the last 15 years of female reproductive period. Late motherhood or late fertility can be understood also from the viewpoint of the late age -specific fertility. The author of this paper studied late age - specific fertility in Europe in the period 1961 - 1985 ^^alacic, 1994). Late age - specific fertility considerably decreased in the studied period in Europe. However, toward the end of the period in certain European countries some discontinuities and turnabouts in the prevalent tendency became evident. In spite of that, the author of the paper forecasted in the cited paper that the late age - specific fertility would retain a marginal share in total fertility in Europe in decades to come. Late age - specific fertility will be analysed in this paper on the basis of age - specific fertility rates in the age groups 35 - 39, 40 - 44 and 45 - 49 years and late total fertility rate in the age group 35 - 49 years which is defined as the sum of five years age - specific fertility rates in the 15 years age range multiplied by five. Late total fertility rate can be interpreted as the number of childbirths to the hypothetical cohort of women in the age group 35 -49 years under the condition that registered five years age - specific fertility rates in the age group 35 - 49 years refer to the given cohort of females. Late total fertility rate is not dependent on the age structure of the particular population. Age - specific fertility is relatively low after the age of 35 for different reasons. They work practically over the whole reproductive age span and can be grouped as follows: 1. intrauterine mortality, postpartum amenorrhea, ovulation without conception; 2. early sterility and possibly higher intrauterine mortality for older women; 3. extension of birth intervals by abstinence or prolonged breast-feeding and 4. birth control by contraception or abortion (Lutz, 1989, p. 7). The one and two groups are more important for natural fertility regime which can be illustrated by Hutterite fertility. Hutterite fertility rates for age groups 35 -39, 40 - 44 and 45 - 49 are 0.406, 0.222 and 0.061 respectively (Malacic, 2006, p. 55). Contraception and abortion are predominant causes of low levels of late age - specific rates in modern fertility regime of present day Europe. They are the consequence of more or less conscious decisions. Nine European countries and the period 1960 -2003 will be analysed in this paper. The countries will be divided in three groups. In the first group are Denmark, Sweden and United Kingdom. Their basic characteristic is higher fertility rate in the age group 35 - 39 in 2003 than in 1960. The second group includes France, Germany and Italy. Their size is the main reason for the selection from the Denmark Sweden UK1 France2 Germany3 Italy3 Poland Russia Ukraine4 f(35-39) 1960 39.2 39.6 44.1 50.6 44.2 61.2 58.4 53.6 39.6 1983 15.0 26.0 23.0 22.0 16.0 24.8 30.2 22.2 20.2 1993 33.4 43.4 33.4 32.2 21.6 30.4 25.4 10.8 12.0 2003 46.6 51.0 46.0 46.6 30.2 42.6 21.0 16.0 11.0 I2003/1983 310.7 196.2 200.0 211.8 188.7 171.8 69.5 72.1 54.4 I2003/1960 118.9 128.8 104.3 92.1 68.3 69.6 36.0 29.8 27.8 f(40-44) 1960 11.2 12.0 12.3 15.6 11.8 22.2 21.5 18.8 10.6 1983 2.4 4.6 4.4 4.4 2.6 5.2 7.4 4.8 3.6 1993 5.0 7.6 5.8 6.2 3.4 5.6 6.2 2.4 2.0 2003 7.6 9.6 9.4 9.2 4.8 8.0 4.8 2.8 2.0 I2003/1983 316.7 208.7 213.6 209.1 184.6 153.8 64.0 58.3 55.6 I2003/1960 67.9 80.0 76.4 59.0 40.7 36.0 22.3 14.9 18.9 f(45-49) 1960 0.8 0.8 0.8 1.0 0.8 1.8 2.3 2.8 1.0 1983 0.2 0.2 0.4 0.2 0.2 0.4 0.4 0.4 0.4 1993 0.2 0.2 0.2 0.2 0.2 0.2 0.4 0.2 0.0 2003 0.2 0.4 0.6 0.4 0.2 0.4 0.2 0.2 0.2 I2003/1983 100.0 200.0 150.0 200.0 100.0 100.0 50.0 50.0 50.0 I2003/1960 25.0 50.0 75.0 40.0 25.0 22.2 8.7 7.1 20.0 Table 1: Late age - specific fertility rates (in and indexes 2003/1983 and 2003/1960 for selected European countries in 1960, 1983, 1993 and 2003. Source: Recent demographic developments in Europe 2004, Council of Europe, Strasbourg 2005; Demographic Yearbook. Special issue: Historical supplement, United Nations, New York 1979 Notes: 1 1959 instead of 1960; 2 2002 instead of 2003; 3 2001 instead of 2003; 4 1961 instead of 1960. pool of North - Western, South and Central European countries with typical U shaped f35-39 in the studied period. The third group includes Poland, Russia and Ukraine. These countries represent Eastern Europe and to some extent Albania and Turkey. Late age - specific fertility has declined in this group practically over the whole period covered in the analysis. Late age - specific fertility rates for age groups 35 - 39, 40 - 44 and 45 - 49 and indexes 2003/1983 and 2003/1960 for selected European countries and for the years 1960, 1983, 1993 and 2003 are shown in table 1. Statistical sources do not cover all data demanded in the title of the table 1. Therefore, some minor data substitutions were necessary for particular years and countries. It should be outlined also that Germany, Russia and Ukraine have not been independent countries in today's state borders since the 1960s. Fortunately, statistical data for these three states in their present size are available for the whole period studied and can be analysed in this paper. Additionally, figures 1 - 3 are included in the paper to show graphical illustration of the late age - specific fertility trends for two rates f35-39 and f40-44 for the period 1960 -2003. The trends for the three groups of countries are shown separately. Late age - specific fertility rates in selected European countries show very low fertility in the age groups 40 - 44 and 45 - 49 years. Childbearing in the age group 45 - 49 years is really exceptional in modern Europe. In contrast, the age group 35 - Figure 1 : Sum, by five-year age groups, of late age-specific fertility rates f35-39 and f40-44 for Denmark, Sweden and UK for the period 1960 - 2003. Source: RDDE 2004, Council of Europe, Strasbourg 2005. 350 ^ ■ France f35-39 ■ France f40-44 • Germany f35-39 • Germany f40-44 ■ Italy f35-39 ■ Italy f40-44 Year Figure 2: Sum, by five-year age groups, of late age-specific fertility rates f35-39 and f40-44 for France, Germany and Italy for the period 1960 - 2003. Source: RDDE 2004, Council of Europe, Strasbourg 2005. Figure 3 : Sum, by five-year age groups, of late age-specific fertility rates f35-39 and f40-44 for Poland Russian Federation and Ukraine for the period 1960 - 2003. Source: RDDE 2004, Council of Europe, Strasbourg 2005. 39 years is not only dynamic but also still important for the procreation. Our three groups of countries behave differently in this respect. The difference is smaller between for-runners Denmark, Sweden and UK and France, Germany and Italy which represent mainland Europe than between these two groups and the Eastern European group. In the for-runners group we have only three European countries where f35-39 was higher in 2003 than in 1960. In other respects for-runners and mainland groups are quite similar. In all six countries three late age-specific fertility rates declined between 1960 and 1983. Thereafter, two of them, f35 -39 and f40-44, have increased considerably as indexes 2003/1983 show. The third rate, f45-49, remains more or less at very low level. The dynamics in the Eastern European group was very simple in the period considered. All three age-specific fertility rates declined and reached at the end of the period practically the same level of particular rates as other six countries had in 1983. In this respect Poland, Russia and Ukraine are latecomers. In Russia, some signs of turnarounds in 1993 were already visible in f35-39 and f40-44. For more complete elaboration of late age-specific fertility it is necessary to compare late and total fertility. Therefore, we have selected some indicators of late and total fertility for selected European countries for the years 1960, 1983, 1993 and 2003. The indicators are shown in table 2.These indicators are total late fertility rate (Tf,35+), Tf,35+ as a percent of Tf, percent change of the Tf, 35+ in the period 1983 - 2003 and percent change of the Tf in the period 1983 - 2003. For-runners and mainland groups show similar Denmark Sweden UK1 France2 Germany3 Italy3 Poland Russia Ukraine4 Tf, 35+ 1960 256 262 286 336 284 426 411 376 256 1983 88 154 139 133 94 152 190 137 121 1993 193 256 197 193 126 181 160 67 70 2003 272 305 280 281 176 255 130 95 66 Tf, 35+ As a % of Tf 1960 10.0 11.9 11.7 12.3 12.0 17.7 13.8 14.7 11.4 1983 6.4 9.6 7.8 7.5 6.6 10.1 7.9 6.6 5.9 1993 11.0 12.9 11.2 11.6 9.8 14.2 8.6 4.9 4.5 2003 15.5 17.8 16.4 14.9 13.0 20.4 10.6 7.2 5.5 ATf,35+ in 1983 - 2003 in % + 209.1 + 98.1 +101 .4 + 45.1 + 87.2 + 67.8 - 31.6 - 30.7 - 45.5 A Tf in 1983 -2003 in % + 27.5 + 6.2 - 3.4 + 5.6 - 5.6 - 17.2 - 49.0 - 36.8 - 41.7 Table 2: Selected indicators of late and total fertility for selected European countries in 1960, 1983, 1993 and 2003. Source: RDDE 2004, Council of Europe, Strasbourg 2005; Demographic Yearbook. Special issue: Historical supplement, United Nations, New York 1979. Notes: 1 1959 instead of 1960; 2 2002 instead of 2003; 3 2001 instead of 2003; 4 1961 instead of 1960. trends in selected late age-specific fertility indicators in the period 1960 - 2003. Total late fertility rate which is un-comparably lower in modern Europe than in the case of Hutterite fertility declined between 1960 and 1983 and has increased thereafter. Similar dynamics characterised total late fertility as the percentage of the total fertility in six countries of the for-runners and mainland groups. In 2003, however, the total late fertility had higher percentage of the total fertility than in the year 1960. In Italy, 20.4 % of period total fertility occurred in the age group 35 - 49 in the year 2003. Total late fertility as a percent of total fertility increased considerably in the period 1983 - 2003 in all six countries. The increase was un-comparable to the change of the total fertility rates of these six countries in the same period. In the period studied, total fertility increased in Denmark, Sweden and France and declined in UK, Germany and Italy. It is more than evident that late age-specific fertility trends in the for-runners and mainland groups of European countries indicate certain degree of postponement of childbearing at least in the period 1983 - 2003. The Eastern European group has had different development in the period studied. Total late and total fertility rates in this group of countries have had declining trends practically to the end of the period. In two decades of the period 1983 - 2003 total fertility declined more than the share of total late in total fertility in Poland and Russia. The opposite was true for Ukraine during the same period. There are no signs of any postponement of childbearing in Poland, Russia and Ukraine in the age groups 35 - 39 or 40 - 44 years yet.Late age-specific fertility differentials Late age-specific fertility differentials statistical data are rare in international as well as in national data sources. The data shortage is combined with the differences in periods covered and statistical definitions used in particular countries. The author of this paper uses international data sources mainly. Therefore, very limited picture of the late age-specific fertility differentials can be shown in the paper. Only three late age-specific fertility differentials will be used on the basis of the rates f35-39, f40-44 and f45-49. These three characteristics are birth order, rural-urban residence and marital fertility. Late age-specific fertility rates will be transformed in total late fertility by particular characteristic. The method of calculation and the analytical value of the indicator have been explained in previous section. In table 3 total late fertility by birth order parity for our selected European countries in the years 1985 and 2001 are shown. Russian Federation and Ukraine are omitted because of the lack of data. The highest birth order parity is 5+ which is appropriate for the selected countries. European females in the age group 35-49 years gave births of different birth order parity in the years 1985 and 2001. Some of them had their first child while others gave birth to the children of higher birth order parity. The highest values of the indicator in table 3 are for the second and the third birth order parity with the exception of France for year 2001. Parity specific trends in the period 1985 - 2001 followed the division of the selected countries in the three groups, again with the exception of UK, which is closer to mainland group than to the for-runners group. The for-runners countries Denmark and Sweden show increase practically in all birth order Country Year Parity Total 1 2 3 4 5+ Denmark 1985 105 16 32 33 15 8 2001 227 47 81 62 23 14 Sweden 1985 181 27 50 58 28 14 2001 281 66 97 70 28 20 UK1 1985 145 26 44 41 21 18 2001 198 51 75 42 17 13 France2 1985 181 39 47 41 21 23 2001 244 93 66 47 20 18 Germany3 1985 131 35 46 33 14 11 2001 183 56 71 35 12 9 Italy 1981 153 26 47 42 21 17 1997 236 66 101 48 14 7 Poland 1985 179 16 37 48 39 39 2001 115 11 25 31 20 28 Table 3: Parity-specific total late fertility in selected European countries in the years 1985 and 2001 (sum, by each parity in the group 35-49 years of age of mothers). Source: Demographic Yearbook 1986, United Nations, New York, 1988, pp. 791-815 and Eurostat. Notes: ^England and Wales for 1985; 2Marital fertility for 1985; 3Marital fertility and FR Germany for 1985. parities of the late fertility in the period studied. UK and the countries of mainland group experienced increase in the first three parities and decline for the fourth and 5+ parities. The only country with available data from the Eastern European group, Poland, shows decline for all parities in the period 1985 - 2001. In the for-runners and mainland groups of countries the highest relative increase happened in the case of the first parity. In France the first parity had the highest value of all parities in the year 2001. This trend clearly shows certain postponement of childbearing in these countries. Late fertility statistical data in table 3 are very good illustration of the low reproductive norms in the modern demographic regime. Parity specific values of late fertility for the and 5+ birth orders are very low and declining. This is very important for population policy. Parity specific population policy measures should be concentrated in younger ages of females (parents) and for the second and third birth order parity. Rural - urban and marital late fertility in selected European countries and selected years with available data are shown in table 4. Unfortunately, quite a lot of data are not available. Therefore, the table 4 is only partly informative. In Eastern European countries urban and rural late fertility declined in the period studied while the data on marital fertility are not available for Russian Federation and Ukraine. The case of France which is only other country with limited trend data shows increase of both urban and rural late fertility in the period 1982 - 1990. Similar trends are likely in other 5 selected countries of the for-runners and mainland groups of countries in spite of the fact that we do not have that data. In Eastern Europe rural late fertility is still higher than urban one, while it is impossible to say the same for other two groups of countries. Total late marital fertility data for countries with available data are rather old. The data are for the late 1970s and early or middle of the 1980s. Total late marital fertility increased in the period covered in all countries with the exception of Italy. However, this evidence is not very important because of considerable transformations of family structures in Europe in the 1970s and the 1980s. Simultaneously, de iure marriages lost importance in comparison with de facto unions in almost all countries selected. 4 Late fertility and late abortion In modern demographic regime family planning is very important part of the way of life. Different forms of male and female contraception are widespread. People usually do not rely on contraception when they want to conceive. However, from time to time even modern contraception fails and people are forced to choose between childbirth and abortion The frequency of legal induced abortion is an indicator of unwanted pregnancies and failed use of contraception. Abortion rate in particular country depends on legal status of induced abortion and political perception of the fertility level in the Country Urban-Rural fertility Marital late fertility Year Urban Rural Year Marital Tf, 35+ Denmark 1969 127 185 1978 30 1999 n.a. n.a. 1985 100 Sweden 1985 n.a. n.a. 1978 109 1999 n.a. n.a. 1985 188 UK1 1972 188 165 1977 148 1999 n.a. n.a. 1985 146 France 1982 178 144 1979 129 1990 228 178 1985 181 Germany2 1981 46 65 1978 115 1999 n.a. n.a. 1985 143 Italy 1981 n.a. n.a. 1977 208 1999 n.a. n.a. 1981 171 Poland 1985 151 268 1978 168 1999 110 171 1984 201 Russia 1989 120 212 n.a. n.a. 1999 60 90 n.a. n.a. Ukraine 1987 114 173 n.a. n.a. 1998 47 80 n.a. n.a. Table 4: Rural-urban and marital late fertility in selected European countries and selected years with available data (sum, by each characteristic in the group 35-49 years of age of mothers). Source: Demographic Yearbook 1986, United Nations, New York, 1988, pp. 655 - 657 and 876-877. Demographic Yearbooks 1992, 1997, 2000 and 2001, United Nations, New York. Notes: ^England and Wales in 1972. 2DR Germany in 1981 and FR Germany in 1978 and 1985. country. Induced abortion is legal in all selected European countries with the exception of Poland. However, political attitude and legal conditions for the realization of the abortion are quite different in particular countries included in this study. Selected abortion indicators for seven European countries are shown in table 5. Poland and Ukraine are omitted because of unavailable data. Very limited data are available for Russia which has very high total abortion rate. Abortion indicators are constructed in the same way as fertility indicators. There are six abortion indicators in the table 5. They are three age - specific abortion rates, a35-39, a40-44 and a45-49, general abortion rate, a15-49, total late abortion rate, Ta,35-49, and total abortion rate, Ta. The last column of the table 5 shows a percentage of the total late abortion rate in the total abortion rate. Abortion data in table 5 are directly comparable with fertility data in other tables of the paper. Total abortion rate increased in Sweden, UK and Germany and declined in Denmark, France and Italy in the period covered. In particular years between 1997 and 2001 which are included in the table 5 the percentages of Ta in Tf were 24.5 in Denmark, 35.9 in Sweden, 27.5 in UK, 22.5 in France, 17.9 in Germany and 26.9 in Italy (RDDE, 2005). In Russia, an index T^/Tf was 196.3 in 1995 which is incomparably higher than in other countries included in the study. It seems that abortion has replaced modern contraceptives as a dominant mean of family planning in Russia. Late abortion trends were also different in particular countries included in our study and in the period covered. The changes in UK and Italy were considerable. UK experienced increase while in Italy it was decline. The changes in other countries were modest. It was increase in Germany and decline in Denmark, Sweden and France. However, late abortion represents still considerable share of total abortion in the countries included in our consideration. The share was somewhere between one fifth and one quarter in most countries as is shown in column 9 in table 5. There are again two exceptions. Russia has the lowest and UK has the highest figure in column 9 of the table 5. UK experienced also the highest increase in the period 1985 - 1999. The percentage of total late abortion in total abortion increased from 12.6 in 1985 to 30.3 in 1999 in UK. It is more than obvious that late abortion is important mean for achieving low level of late age - specific fertility in all countries which are included in table 5. 5 Conclusion Age - specific fertility distribution has shown considerable changes in modern demographic regime. It seems that postponement of childbearing has been the most prominent feature of these changes in Europe in recent decades. The postponement has influenced at least partly late age - specific fertility as well. Total late fertility increased considerably in the two groups of countries and declined in Eastern European group in the period 1983 - 2003. However, for - runners in total late fertility increase are still rare in Europe. Generally, total late fertility was higher in 1960 than in 2003 in majority of European countries. Further increase in total late fertility in most of Europe is likely, Eastern Europe included. However, late age - specific fertility will very likely retain more or less marginal share of total fertility in modern demographic regime. Late age - specific abortion will considerably contribute to this result. Country Year a35-39 a40-44 a45-49 a15-49 Ta.35-49 Ta 7/8 in % Denmark 1985 13.3 6.7 1.0 15.6 105 546 19.2 2000 13.1 4.8 0.5 12.4 92 433 21.2 Sweden 1985 15.3 7.3 1.0 15.5 118 542 21.8 2001 16.0 5.7 0.6 15.8 111 563 19.7 UK1 1985 7.0 2.9 0.3 11.6 51 406 12.6 1999 14.5 10.5 3.3 13.7 141 465 30.3 France 1984 12.0 5.7 0.8 13.4 92 469 19.6 1997 11.4 4.9 0.5 11.1 84 389 21.6 Germany2 1985 6.2 2.4 0.4 5.3 45 185 24.3 2000 6.6 2.7 0.3 6.8 48 247 19.4 Italy 1982 21.9 9.6 1.0 16.9 162 590 27.5 1998 11.5 5.5 0.5 9.6 87 326 26.7 Russia - n.a. n.a. n.a. n.a. n.a. n.a. n.a. 1995 n.a. n.a. n.a. 58.5 390 2630 14.8 Table 5: Selected indicators of late and total abortion for selected European countries and for the years 1985 and 2000 or the nearest year with available data. Source: Demographic Yearbook 1986, United Nations, New York, 1988, pp. 212-219 and 1036-1038; Demographic Yearbooks 1988, 2000 and 2001, United Nations, New York; RDDE, selected issues, Council of Europe, Strasbourg. Notes: ^England and Wales in 1985; 2FR Germany in 1985. References [1] Billari, F.C. (2005). The Transition to Parenthood in European Societies. Conference Report, European Population Conference 2005, Demographic Challenges for Social Cohesion, Council of Europe, Strasbourg, 7 - 8 April 2005. 49 p. [2] Calot, G. and Blayo C. (1982). Recent course of fertility in Western Europe, Population Studies, 1982/3, pp. 349 - 372. [3] Eurostat home page, November 2005 and 2007. [4] Lutz, W. (1989). Distributional Aspects of Human Fertility. A global comparative study, Academic Press, London. Xi + 282 p. [5] Malacic, J. (1994). Kasni fertilitet i savremena reprodukcija stanovnistva (Late fertility and modern reproduction of the population). In: Maksimovic, I., ed. Prilozi demografskim i ekonomskim naukama (Contributions to the Demographic and Economic Sciences), SANU (Serbian Academy of Art and Sciences), Monografije, Vol. 103, Beograd. Pp. 189 -198. [6] Malacic, J. (2006). Demografija. Teorija, analiza, metode in modeli (Demography. Theory, analysis, methods and models), 6. izdaja, Ekonomska fakulteta, Ljubljana, viii + 339 p. [7] Ni Bhrolchain, M. and Toulemon, L. (2005). Does Postponement Explain the Trend to Later Childbearing in France? In: Vienna Yearbook of Population Research 2005, Vienna Institute of Demography, Austrian Academy of Sciences. Pp. 83 - 107. [8] Recent demographic developments in Europe (RDDE), different yearly issues, Council of Europe, Strasbourg. [9] Sabotka, T. (2004). Is lowest - low fertility in Europe explained by the postponement of childbearing. Population and Development Review, 2004/2, pp. 195 - 220. [10] Van Nimwegen, N. et al (2002). Late motherhood in the Netherlands: current trends, attitudes and policies, Genus, 2002/2, pp. 9 -34. [11] United Nations (1979). Demographic Yearbook. Special issue: Historical supplement, New York. [12] United Nations, Demographic Yearbook, different issues, New York Methodology for the Estimation of Annual Population Stocks by Citizenship Group, Age and Sex in the EU and EFTA Countries Jakub Bijak and Dorota Kupiszewska Central European Forum for Migration and Population Research (CEFMR) ul. Twarda 51/55, 00-818 Warsaw, Poland E-mail: j.bijak@cefmr.pan.pl, d.kupisz@cefmr.pan.pl www.cefmr.pan.pl Keywords: population estimates, stocks by citizenship, Europe, missing data, cohort-wise methods, fitting methods Received: March 9, 2008 The paper addresses selected computational issues related to the challenge of dealing with poor statistics on international migration. Partial results of the on-going Eurostat-funded project on "Modelling of statistical data on migration and migrant population" (MIMOSA) are presented. The focus is on the data on population stocks by broad group of citizenship, sex and age. After a brief overview of the main problems with data on population by citizenship for 31 European countries (27 European Union countries, Iceland, Liechtenstein, Norway and Switzerland), a range of computational methods is proposed including cohort-wise interpolation, cohort-component projections, cohort-wise weights propagation and proportional fitting methods, as well as other, auxiliary methods. The algorithm for choosing the best method for estimating missing data on population stock by broad citizenship (nationals, foreigners - EU27 citizens, foreigners - non EU27 citizens), five-year age group (up to 85+) and sex on 1st January 2002-2006 is proposed and illustrated by examples of its application for selected countries. Povzetek: Opisane so različne metode za ocenjevanje demografskih podatkov. 1 Introduction The deficiencies of statistical information on migration-related variables, such as population flows or stocks, are well-known and widely discussed in the literature [1, 7]. The aim of the paper is to contribute to the works on dealing with these shortcomings and to propose a set of computational methods, as well as an algorithmic procedure of selecting the best one, for the estimation of population stocks as of 1st January in a breakdown by sex, age group and broad citizenship category, for the countries for which information is unavailable or incomplete. The study was carried out within a Eurostat-funded project on "Modelling of statistical data on migration and migrant population" (MIMOSA). It covers 31 European countries, of which 27 belong to the European Union (as per 1st January 2007), and further four - to the EFTA (Iceland, Liechtenstein, Norway and Switzerland). The period of interest is 2002-2006. The citizenship groups under study are: nationals, European Union (EU27) foreigners and non-EU27 foreigners, while the age groups are five-year, with the last, open-ended category being 85 years or more. Generally, the proposed estimation methods aim to combine data from different sources (population census, vital statistics, data on acquisition of citizenship, specific surveys, etc.). In principle, the data that are already available are not modified (for example, in order to harmonise definitions, or for any other reason), unless in the case of inconsistencies between the sources. In the latter cases, the demographic data, provided to Eurostat by national statistical institutes (NSIs), are given priority. Apart from the Introduction, the paper is structured in four sections. Section 2 contains summary information on the availability and quality of the 2002-2006 data on population stocks for 31 countries under study. In Section 3, the proposed methodology for estimating population stocks by sex, age and citizenship groups is discussed. This section presents such tools as estimation of data in single years of age from five-year age-groups, cohort-wise interpolation of population stocks, cohort-component projections, cohort-wise propagation of weights, proportional fitting, as well as other, auxiliary methods. Subsequently, Section 4 contains recommendations concerning the procedure of selecting appropriate estimation methods for each of the countries under study, presented in the form of a decision algorithm and accompanied by several illustrative examples for selected countries. The discussion is summarised in Section 5. The study is based on the data available in the Eurostat databases, supplemented by additional information obtained from national statistical institutes, whenever required and feasible. Throughout the paper, the abbreviation 'NSI' is used to denote the national statistical institute of the respective country, 'JMQ' stands for the Joint Questionnaire on International Migration Statistics (hereafter: Joint Migration Questionnaire) of Eurostat, UN Statistical Division, UN Economic Commission for Europe, the Council of Europe and the International Labour Office. 'LFS' depicts the Labour Force Survey. 2 Availability of the 2002-2006 data on population stocks for 31 European countries Annual statistics on usually resident population by citizenship, sex and age are collected by Eurostat from the NSIs via the Joint Questionnaire on International Migration, together with migration flow data. Population statistics for 37 European countries, collected through the JMQ are checked and subsequently loaded into Euro-stat's on-line database, NewCronos. The data are located under the Population and Social conditions theme, in the International Migration and Asylum domain (MIGR), tables migr st_popctz (population by sex and citizenship) and migr_st_popage (population by age group, citizenship and sex). The data for 2000-2006 come from the following tables in the 2000-2006 JMQs: ■ Table 7a (for 2000-2003, 8a): Usually resident population by citizenship and age, both sexes; ■ Table 7b (for 2000-2003, 8b): Usually resident population by citizenship and age, males; ■ Table 7c (for 2000-2003, 8c): Usually resident population by citizenship and age, females. A detailed analysis of statistics on population stocks by citizenship provided by the 31 countries covered the JMQs for the reference period 2002-2006. Selected results of the analysis of the data availability for particular countries are summarised in Table 1, providing an overview of the situation for all 31 countries. The information on the lack of data, marked as 'not available' in Table 1, was based on the information provided in the JMQ or on information obtained during the THESIM project1. Other missing data were marked as 'not provided to Eurostat'. In addition to missing data, a number of other problems were detected, for example the presence of provisional data, some citizenship categories only, broad age groups, or a different reference date than 1st January. Data on total population stock on 1st January, not disaggregated by citizenship, are also collected by Eurostat within the framework of the Annual Demographic Statistics data collection. These data, disaggregated by sex and age, are located under the Population and social conditions theme, in the Demography (DEMO) domain of the database, table demo_pjan. The results of the review of the availability of these data for the years 20022006 revealed that the data on total population stock by sex and five-year age group (up to 85+) are available for all 31 countries, with the following exceptions: there is no 2006 data by age for Belgium and Italy, while for Romania the highest age group in 2004 data is 80+. 1 Research project THESIM: Towards Harmonised European Statistics on International Migration, funded by the European Commission through the Sixth Framework Programme and executed by a research consortium led by Groupe d'étude de démographie appliquée (GéDAP), Université Catholique de Louvain. In addition to the annual data, Eurostat also collects and disseminates statistics on population by citizenship, sex and age obtained by the countries during population censuses. Like other statistics, the census data are located under the Population and social conditions theme, in the Census (CENS) domain of the database, table cens nsctz. Unlike annual population figures, the census data on population by citizenship, sex and age are available for almost all 31 countries, with the notable exceptions of the United Kingdom, Germany and Malta. A supplementary source of data on population stock by citizenship is the Labour Force Survey. However, the availability of data from the LFS in the Eurostat database is very limited and the reliability of data is probably not high, due to the nature of the data source. By definition, the LFS statistics are estimates and thus bear certain errors, which can be relatively high for disaggregated categories (e.g., for population broken down by age, sex and citizenship groups). However, some use of the LFS data could be considered as an alternative to the proposed methods in the countries where data on total nationals and total foreigners are missing. In the Eurostat database, the LFS tables are located under the Population and social conditions theme, the Labour market (LABOUR) domain, in the table with population data containing the nationality dimension (population by sex, age groups, nationality and labour status, table lfsa_pganws). However, the table does not contain data on the level of individual countries of citizenship and only data on total population and on nationals could be useful for this project. Estimates of the 2002-2006 stock of the EU27 citizens cannot be prepared using the LFS tables in the Eurostat database. These considerations need to be taken into account when proposing computation methods for the current study. 3 Proposed methods of estimating population stocks by citizenship, sex and age The current section presents a theoretical background of the methods proposed for the calculations of the missing elements in population stocks by age, sex and citizenship group. After a brief summary of the notation, the following methods are discussed: interpolation of five-year into one-year age groups, regarded as a preparatory method (Section 3.2), followed by cohort-wise interpolation of population stocks (3.3), cohort-component projections, traditionally used in demography (3.4) and cohort-wise weights propagation (3.5). Further, Section 3.6 describes selected proportional fitting methods, which category encompasses three approaches, depending on the availability of information: the proportional adjustment, direct proportional fitting and iterative proportional fitting. Section 3 concludes by presenting some auxiliary methods for dealing with the Unknown categories, and for the estimation of missing elements of age distributions (3.7). Country 2002 2003 2004 2005 2006 Austria + + + + - Belgium + x - - - Bulgaria ..... Cyprus dref - - - - Czech Republic + + + + + Denmark + + + + + Estonia na na na na - Finland + + + + + France . . - - Germany for, broad age, ±agesex, i ±age, i for, i i P, i Greece . . i for, ±sex for Hungary for for + + + Ireland p, ±ctz, broad age, dref p, ±ctz, broad age, dref p, ±ctz, broad age, dref p, ±ctz, broad age, dref p, ±ctz, broad age, dref Italy dref - - -age -age Latvia .age + + + + Lithuania - -ctz -ctz + + Luxembourg . . tot, nat ±ctz, ±age, ±sex ±ctz, ±age, ±sex Malta . . . . . Netherlands + + + + + Poland dref - -ctz - Portugal p, for, -age p, for - - - Romania dref + + + Slovakia . - for i i Slovenia + + + + + Spain + p + + Sweden + + + + + United Kingdom . ±ctz, dref ±ctz, dref, a70 ±ctz, broad age, dref - Iceland + + - - - Lichtenstein . . . . . Norway + + + + + Switzerland + + + + + + data provided to Eurostat; - data not provided to Eurostat; -age no disaggregation by age; -ctz no disaggregation by citizenship; ±age age disaggregation only for a few citizenship categories; ±agesex disaggregation by age not provided for Males and Females; ±ctz data provided for a few citizenship categories; ±sex disaggregation by sex provided for a few citizenship categories only; a70 age provided only until 70 years, with the open-ended group 70+; broad age data disaggregated by broad age groups; dref reference date different than 1st January; for data provided for foreigners only; i data inconsistency problems; na data not available; nat data provided for nationals; p provisional data; x problems detected in the data sent by the NSI, tot data provided for Total. Table 1 : Availability of data on population stock by citizenship, sex and age in the JMQ, 31 countries, as of 1st January 2002-2006. 3.1 Notation and basic concepts Throughout the paper, the notation used for population variables follows a common convention presented below. In all cases, the superscript n indicates one of the three broad groups of citizenship: nationals, EU27 foreigners or non-EU27 foreigners, abbreviated as N, EU and nEU, respectively, thus reflecting the composition of the European Union as of 1st January 2007. The non-EU27 group includes also the stateless persons. An abbreviation FOR is used for all foreign population (EU27 and non-EU27 together). For the transparency of presentation, the country index is skipped, as all calculations proposed in the paper are always country-specific. The variables in question are as follows: Stock variables: Pn(x, t) - Population in broad citizenship group n, in the age of x years on 1st January, year t. Pn(x, c) - Population in broad citizenship group n, in the age of x years at the census date c. Event variables: Bn(f) - Births during calendar year t in citizenship group n; Dn(x, f) - Deaths of persons aged x years, belonging to citizenship group n, during calendar year t; In(x, t) - Registered immigration of persons in citizenship group n, aged x years, during calendar year t, regardless of the country of origin; En(x, t) - Registered emigration of persons in citizenship group n, aged x years, during calendar year t, regardless of the country of destination; Rn(x, t) - Outcome of the regularisation of the status of formerly irregular residents (cf. [4]) aged x, in year t, by definition referring only to foreigners, n e {EU, nEU}, thus with RN(x, t) = 0; Sn(x, t) - Statistical adjustment (official correction) of the size of population aged x, in year t, due to the reasons other than regularisations; ^n(x, t) - Acquisitions of citizenship by the population aged x, in year t, by definition referring only to foreigners, n e {EU, nEU}, with AN(x, t) = 0. Unless noted otherwise, age is reported in years reached during a given calendar year, and thus the events in question (deaths, migrations, citizenship changes, etc.) correspond to parallelograms with vertical sides on the Lexis diagram. An illustration of the relevant concepts on a Lexis plane is shown in Figure 2, in Section 3.4. Whenever necessary, the index denoting sex is added as an additional subscript g e {m, f} for males and females, respectively, e.g. Pf"(x, t) refers to female population stock, and Dmn(x, t) to deaths among males. In order to distinguish five-year age groups, an additional left-hand side subscript '5' is added. For example, 5Pm"(x, t) refers to male population belonging to broad citizenship group n which was in the age of [x, x+5) years on 1st January of year t. The same principle applies to almost all event variables (D, I, E, R, S and A), with a clear exception of B. In some instances, for the clarity of presentation, the summation of a particular variable over a given index is indicated by an asterisk in a respective place, e.g. AnEU(*, t) = Sx A"EU(x, t) refers to all acquisitions of citizenship by non-EU27 foreigners in year t, regardless of age. Similarly, I*(x, t) = S„ I"(x, t) denotes all immigrants aged x, in year t, irrespective of their citizenship, and D*(*, t) = S„ Sx D"(x, t) refers to all deaths registered in year t, without respect to nationality or age. It has to be noted that in several cases the summation over n involves only two components, e.g. n e {EU, nEU} for R"(x, t) and An(x, t). 3.2 Interpolation of five-year age groups into one-year groups Among the preparatory steps for the estimation of missing data, the most frequent problem concerns disaggrega-tion of five-year age groups of population (or events) into single years. This has to be performed in order to enable cohort-wise interpolations, cohort-component projections with yearly steps, or cohort-wise weights propagation, as described in Sections 3.3, 3.4 and 3.5. If auxiliary information is available from a different source (e.g. from a census, from the previous or next year, etc.), the population size or the number of events can be disaggregated using a 'Prorating' method [11, p. 5-61], whereby the relative distribution from the auxiliary source is imposed on the data in question. The obtained distribution might need to be further adjusted to marginal totals, by means of proportional fitting procedures, described in Section 3.6. If the data on population stocks by sex, broad citizenship group and five-year age group 5Pn(x, t) are available and the stocks by sex and one-year age group P*(x, t) are also known, then, assuming no other information about the distribution by single years, we can estimate the missing distributions for particular c;itizenship groups proportionally, that is as: P"(x+/, t) = 5P"(x, t) ■ P*(x+/, t) / / 5P*(x, t). This is an example of the application of the direct proportional fitting described in Section 3.6.2. If none of the above information is available, the proposed methodology is to use the well-known interpo-lative four-term third-difference solution of Karup and King [11, p. 5-65]. For each five-year group, the disag-gregation into fifths is done via applying multiplicative coefficients to the global value of this group and the neighbouring ones. Different multipliers are used for the first group, the middle groups and the last group, as set forth in Table 2. For example, if we want to split a middle five-year group with population Ni into five single-year groups n1, n2, ns, n4, n5, then: n1= 0.064 Ni-1 + 0.152 N, - 0.016 N,+1 , n2= 0.008 Ni-1 + 0.224 N, - 0.032 N,+1 , etc. When Karup-King multipliers are used, the condition N, = n1 + n2 + n3 + n4 + n5 is automatically fulfilled. First group, N0 Middle groups, N, Last group, Nk No Ni N2 N,+1 Nk-2 Nk-1 Nk First fifth +0.344 -0.208 +0.064 +0.064 +0.152 -0.016 -0.016 +0.112 +0.104 Second fifth +0.248 -0.056 +0.008 +0.008 +0.224 -0.032 -0.032 +0.104 +0.128 Third fifth +0.176 +0.048 -0.024 -0.024 +0.248 -0.024 -0.024 +0.048 +0.176 Fourth fifth +0.128 +0.104 -0.032 -0.032 +0.224 +0.008 +0.008 -0.056 +0.248 Last fifth +0.104 +0.112 -0.016 -0.016 +0.152 +0.064 +0.064 -0.208 +0.344 Source: [11], Table C-1, p. 5-69. Table 2: Coefficients for the Karup-King interpolation formula. As an alternative to the Karup-King interpolation, the six-term fifth-difference interpolative formulae of Sprague or Beers can be applied, which however use information from more surrounding groups. Methodological details can be found in Shryock et al. [11]. In our case, the Karup-King interpolation is recommended for the sake of simplicity. For variables depicting non-vital events, like migration or citizenship acquisitions, the estimates for particular cohorts can be obtained from two neighbouring period-age estimates yielded by the Karup-King formula, split equally by half. For the first cohort, we can assume that a half of the relevant period-age events concern the cohort in question, while for the last, open-ended cohort, we can add up the period-age estimate for the open-ended group and a half of the events concerning the age group immediately preceding the last one. The underlying rationale is an assumption that non-vital events are equally spread over the period-age squares of the Lexis diagram. In any case, the estimates for the eldest cohorts would be close to zero for all practical migration-related applications. Regardless of the method, if the disaggregation is performed on data broken down by sex or citizenship, the final estimate might need to be obtained by proportional fitting methods (described in Section 3.6), in order to ensure the summation to available marginal totals. 3.3 Cohort-wise interpolation of population stocks Given the information on the age structures of the population for two non-adjacent moments of time, a simple idea to obtain the missing figures for in-between moments would be to apply interpolation techniques. In this case, we propose cohort-wise interpolation for all cohorts apart from the youngest and oldest one, which are discussed separately. Overall, this method requires much less information on input than the cohort-component projections presented in the next section, but it requires information about population both before and after the moment for which the estimates are to be done. The in-terpolative approach is recommended for the cases where (a) the span between two points with available data is not wide (say, two-three years), and (b) no information on the distribution of vital and migratory events by citizenship is available. In practical applications, as the ones described in Section 4, it often happens that the data are available for year t from the census conducted at time c (t < c < t+1), and for 1st January of the year t+k, not immediately following the census. Such situations can be put in a general framework illustrated on a Lexis diagram in Figure 1, where a denotes the fraction of a year remaining after the census until 31st December. Figure 1 and the methodology proposed below cover also the situations when data come from other sources than the census, and the situations when the reference date of the data for year t is 1st January. In the latter case it suffices to set a = 1. For the cohorts already existent at the census date c, the interpolation can follow various patterns, but an arithmetic and geometric pattern of growth [3, 10] will be the most frequent choices. As noted by Rowland [10, p. 50], "under arithmetic growth, successive population totals differ from one another by a constant amount [, while] under geometric growth they differ by a constant ratio". For short-period interpolations, both approaches should yield similar results, although this is an empirical issue, and there is no convincing argument in favour of either of them. Hence, a selection of appropriate methods should rely on case-specific judgements. Age x+k+1 y (1-a) ■ P"(x+1, c). a^ P"(x, €)' 7 P"(x+k, t+k) x+k x+2 x+1 1.01. t c 1.01. t+1 1.01. 1.01. t+k 1.01 Year (1-a) a Source: Own elaboration. Figure 1: Cohort-wise interpolation of population stocks: a general idea. It has to be noted that the cohort aged x completed years on 1st January t+k was split at the census date between two age groups: the younger one (aged x completed years) and the older (aged x+1), as shown in Figure 1. Therefore, the interpolative estimate of P"(x, t+i) depends on P"(x, c), P"(x+1, c) and P"(x, t+k). Given the above, the formula for an interpolative estimate of population sizes belonging to a particular age group x+i and citizenship group n, assuming the linear pattern of change, is as follows: Pn(x+i, t+i) = (k-i) / (k-1+a) ■ [a ■ Pn(x, c) + (1-a) ■ ■ Pn(x+1, c)] + (i-1+a) / (k-1+a) ■ Pn(x+k, t+k), (1a) while for the geometric change: Pn(x+i, t+i) = {[a ■ Pn(x, c) + (1-a) ■ Pn(x+1, c)]k-' ■ ■ Pn(x+k, t+k)'-1+a}1 / (k-1+a). (1b) For the youngest and oldest cohorts, for which interpolation as proposed above is not possible, a simplified solution is proposed. In such cases, we suggest to take the average shares (proportions) of the sizes of the respective age groups in the total population, calculated from the data available for neighbouring periods, weighted by the distance between the available data points and estimation point. In order to ensure consistency of the results and summation of the age-specific estimates to the marginal totals by sex or citizenship group, whenever available, the estimates have to be adjusted by the means of proportional fitting, presented in Section 3.6. The framework presented above can be easily generalised to a much less frequent situation with interpolation between two censuses - in such case, a fraction ß of a year between the 1st January of the year of the second census and the second census date, c', should be additionally accounted for. However, the estimates obtained in such cases would be only very approximate, due to a usually large time span between the censuses. It should be noted that an identical solution as shown above in (1a), or in (1b) can be used for extrapolating cohort sizes beyond the available data points, in whichever direction. In either case, it would suffice to put an appropriate integer i < 0 for the backward extrapolation (in particular, following the example from Figure 1, set i = 0 to obtain values for the beginning of the census year), or i > k for the forward extrapolation. The methods discussed above resemble to some extent the ones presented in the Human Mortality Database Methods Protocol [15], with the exception of the oldest age groups, where the quoted study suggests more sophisticated extinct cohort and survivor ratios approaches. Direct application of the methods proposed by Wilmoth et al. [15] would be, however, difficult. This is not because of computational reasons, but rather due to the lack of yearly estimates of deaths, births and migratory events broken down by citizenship groups, which has been listed at the beginning of the current section as a precondition for selecting cohort-wise interpolation method. 3.4 Cohort-component projections As concerns projections, let us denote by Xn(x, t) a sum of all event variables not related to the natural change of population stocks (i.e. all but births and deaths), thus: ^"(x, t) = In(x, t) - En(x, t) + SSn(x, t) + Ske {E^. nEU} Ak(x, t), for n = N; (2a) Xn(x, t) = In(x, t) - En(x, t) + Sn(x, t) + Rn{x, t) - An(x, t), for n 4- N. (2b) Given (2a) and (2b), the population accounting equations for each broad citizenship group are: Pn(0, t+1) = Bn(t) - Dn(0, t) + Xn(0, t); (3a) Pn(x, t+1) = Pn(x-1, t) - Dn(x, t) + ^"(x, t), for x e {1, 2, ..., xmax-1}; (3b) Pn(xmax, t+1) = [Pn(xmax-1, t) + Pn(xmax, t)] - Dn(xmax, t) + + Xn(xmax, t). (3c) In (3c), xmax stands for the highest (open-ended) age group for which information is available. Note also that deaths and other event variables in age group xmax refer to the trapezoid on the Lexis diagram rather than to a parallelogram, while for age group 0 - to a right triangle, as shown in Figure 2. Age m xmax xmax 1 n ^ P (xmax, t) / -Dn(xmax, t) y Pn(xmx, t+1) / Pn(xmax-1, t) / 1.01. t-1 1.01. t 1.01. t+1 Year Age x+1 x-1 Pn(x-1, t) / / Pn(x, t+1) Pn(0, t+1) 0 1.01. t-1 1.01. t 1.01. t+1 Year Source: Own elaboration. Figure 2: Relationships between population stocks Pn, and events Bn, Dn and X" on a Lexis diagram. Under the assumptions presented above, the projection is made following the equations (3a), (3b) and (3c) for consecutive years, on the basis of information available for single-year age groups, decomposed from the five-year groups, if needed. Note that the default citizenship of a newborn child can differ between the countries, either following the ius soli principle, whereby a child acquires the citizenship of the country of birth, or ,us sangu,n,s, according to which a child inherits the citizenship of its parent(s), or finally a mixture of those two, for example differentiating between the generations of migrants, taking into account the length of stay in the country, etc. The general rules are as follows: a) Ius sanguinis If the child gets citizenship of any of the parents, then B"(t) in equation (3 a) may be assumed to be roughly proportional to P"(t). If the child acquires citizenship of the mother and we have no separate estimate of fertility for nationals and foreigners, then B"(t) may be assumed to be roughly proportional to Pf"(t). If the estimates of fertility by broad citizenship and age of mother exist then a better estimate may be obtained using the formula: = B*(t) Sx f"(x) Pf "(x, t) / S,, x fk(x) Pfk(x, t), (4) first and the last one in each year. In particular, this situation applies if the following four conditions hold: 1. Total population by age, 5P*(x, t), is known for successive years, but the citizenship structure is missing; 2. We may assume that the distribution of deaths and migration flows by broad citizenship is the same as the citizenship composition of the population; 3. Acquisitions of citizenship may be ignored; 4. There was no regularization, or it may be ignored. In such cases, the projection equation (3b) combined with proportional fitting is equivalent to proportional decomposition of 5P*(x, t) by citizenship group described in Section 3.6.1. The estimations can be performed using the formula: 5P"(x, t) = 5P*(x, t) ■ 5P"(x-1, t-1) / 5P*(x-1, t-1). (5) The first and the last cohort may be disaggregated using the citizenship composition of the first and last age group in the previous year. In such cases, the following formulas apply: 5P"(0, t) = 5P*(0, t) ■ 5P"(0, t-1) / 5P*(0, t-1), or: 5P"(Xmax, t) = 5P*(xmax, t) " 5P"(Xmax, t-1) / 5P*( x^ (6a) t-1). (6b) where f "(x) denotes age-specific fertility rates for women in age group x, belonging to the group of citizenship n. If the estimates of fertility are available by broad citizenship group, but not by the age of mother, the formula (4) would have to be modified, so as the summation over age reflects only the female population aged 15-49 years. b) Ius soli If the child automatically acquires the citizenship of a given country, then the balance equation for the youngest age group, (3a), becomes, depending on the citizenship in question: PN(0, t+1) = B*(t) - Dn(0, t) + XN(0, t), for n = N; (3a') P"(0, t+1) = X"(0, t) - D"(0, t), for n + N. (3a") In mixed cases, it is recommended to project one part of births according to formulas for ,us sol, and another part according to the ius sanguinis principle. Note also that losses of citizenship are not accounted for, as they in most instances concern persons in reality either already living abroad, or emigrating (and counted in E). For acquisitions of citizenship, we assume that non-nationals fall in the category of nationals upon naturalization, in order to count the same people only once, regardless of the number of citizenships they have. If the breakdown by citizenship group of all variables referring to vital and migratory events can be assumed proportional to the citizenship structure of the population at the beginning of each year, then the projection methodology can be often de facto simplified to proportional adjustment / decomposition, whereby the citizenship distribution of the considered cohort in the previous year would directly apply to all cohorts except the 3.5 Cohort-wise weights propagation In some cases, too much information on the age-sex-citizenship distribution of the components of population change is missing, which renders projections too dubious with respect to the number of assumptions that need to be made. In practice, in such instances the only reliable information comes from the population census and from annual population stocks available in the DEMO domain of the NewCronos database. Hence, the proposed procedure is as follows. For the census population, apply the structure by citizenship, taken from each five-year age group, to the respective single-year age groups (i.e. from age group 04 to single ages 0, 1, 4; from 5-9 to 5, 6, 9 etc.). Let w"(x, c) = P"(x, c) / P(x, c) denote the age-specific shares ('weights') of citizenship group n in the census. Further, set a as a fraction of the calendar year before the census date. It is implicitly assumed that the census population in single-year age groups can be divided between 'older' and 'younger' cohorts using the a and (1-a) partition. For the census date, use the following formula to calculate the share of citizenship group n in the cohort that was aged x years on 1st January of the census year: w"(x+a, c) = [(1- a) ■ P"(x, c) + a ■ P"(x+1, c)] / / [(1- a) ■ P'(x, c) + a ■ P*(x+1, c)], for x < x^ax; (7a) w"(xmax+a, c) = P"(xmax, c) / P*(xmax, c). (7b) For the 1st January of the census year assume that the weights w"(x, t) = w"(x+a, c). For the 1st January of the year following the census year (t > c), assume in turn: w"(x, t) = w"(x-1+a, c), for 0 < x 1). The starting value k = 1 defines also the initial estimate of the joint sex-age-citizenship distribution, Pg(1) "(x, t) = Pg ' "(x, t). Subsequent steps are computed as follows: Pg('k) "(x, t) = P. = p (2k-1) " '(x, t) • Pg(x, t) / P (2k-1) (x, t); (12a) Pg^'k+1) "(x, t) = Pg('k) "(x, t) • P."(*, t) / P. (2k) "(*, t). (12b) The procedure defined by (12a) and (12b) is repeated iteratively till some convergence criterion is achieved. For example, the estimates yielded by consecutive steps should differ by no more than by an arbitrarily-selected small number s. More details of the method have been discussed by Willekens [13, pp. 69-71], Willekens et al. [14], Rees [8] and Norman [6]. Although the IPF method is purely mechanical, its main advantage is that it does not require any additional information (such as data on vital events or migration) or excessive labour resources, and the obtained results (in terms of joint distributions by all variables under study) are automatically coherent with marginal distributions of particular variables. Moreover, under some general assumptions, the IPF estimates can be interpreted from a statistical viewpoint as joint probability distributions obtained using the maximum likelihood or entropy maximisation methods [2, pp. 83-97; after: 13, p. 70]. category, appropriate constraints should be set during the estimation procedure. In either case, when adjustment to broader age groups is needed in order to ensure summation to respective totals (e.g. for functional age groups), it can be done via proportional fitting presented in Section 3.6. 4 Estimating population stock for EU27 and EFTA countries The current section briefly summarises the algorithm for the selection of an appropriate method of computations for a given country (Section 4.1), followed by a brief illustration of the proposed approach employed for the 31 countries under study, and a selection of the results (4.2). 3.7 Auxiliary methods Among the auxiliary methods proposed in the current study, the foremost one is the decomposition of the Unknown category wherever it appears (i.e., with respect to age, citizenship, or even sex, as in the case of Greece for 2005). The universal solution proposed in such cases is a proportional disaggregation: population belonging to the Unknown category is broken down proportionally to the existing, well defined categories (citizenship groups, age groups, etc.) and the resulting parts are attached to these categories. For example, if total population P consists of n well-defined groups Pi, P„, and the Unknown category, Punk, such that P = Si Pi + Pu,é, where i = 1, n, then the following corrections apply: P'j = Pj ■ Punk • Pj / Si P, = Pj (1 + Punk / Si Pi) , for all j, with i = 1, n. (13) If some elements of age structures are missing (e.g. tails of respective age distributions, or a breakdown into five-year groups given the availability of broader ones), we may either use a structure from a different year or fit a mathematical function to available data. For example, we can assume that foreign population stocks are a double-exponential function of age, as originally proposed for the intensity of migration flows by Rogers and Castro [5, 9]. The number of foreign population aged x, ^(x), would then be given by the following equation: ^(x) = c + a1 • exp(-a1 • x) + a2 • exp{-a2 • (x - /u2) + - exp[-l2 • (x - ^2)]}. (14) The parameters c, a1, a1, a2, a2, I2 and ^2 can be estimated separately for each sex, for example using the ordinary least squares method (OLS) on the basis of the data for the available age groups (for example, below 65 years of age). Technically, the calculations can be done in a spreadsheet (e.g. MS Excel) using a solver-like tool, controlling for sensitivity of the algorithm to the choice of initial input values. Based on the obtained parameter estimates, formula (14) yields approximations of ^(x) for the remaining age groups. The last, open-ended group (85+) can be obtained by subtraction of all other figures from the total. To avoid negative numbers in the 85+ 4.1 Procedure for selecting an estimation method In the light of the overview of data availability presented in Section 2 and the methodological discussion presented in Section 3, it is suggested to inspect the following general options of data availability, in order to apply the relevant data estimation procedures: Option 1. All the required data are available in the Eurostat database Whenever all data are available in the Eurostat database, the following five-step procedure is recommended: 1. Organize the data in a database; 2. Verify the data (perform data validation and internal consistency checks); 3. Deal with the Unknown categories (if applicable); 4. Calculate the required aggregates; 5. Check the results. This option includes cases when there is a need for combining data from various parts of the Eurostat database (e.g. in DEMO and in JMQ), and the cases where there is an 'Unknown' category, which has to be disaggregated proportionally among the well specified categories, as described in Section 3.7 on 'Auxilliary methods'. Option 2. Some of the data missing in the Eurostat database can be obtained from the respective NSI or from other sources In this case, two situations are possible: Option 2a. All the missing data may be obtained without contacting the NSI If all the missing information is publicly available, for example from the NSI webpage, it should be downloaded and combined with the Eurostat data. Such an overall dataset should be then subject to a procedure described under Option 1, points 2. through 5. Option 2b. Some (or all) missing data are (or are suspected to be) available either from the NSI or from other sources If some missing information is downloadable from sources like the NSI webpage, it should be collected and merged with the Eurostat data. Nonetheless, there are cases when data are not publicly available but it can be suspected that either some, or even all the missing information is in the possession of the NSI. In such case, the undertaken actions should be as follows: 1. Contact the NSI in order to obtain the missing data. If successful, proceed as in Option 2a; 2. For the data that are still unavailable, but can be estimated, proceed as in Option 3; 3. For the data that are still not available and cannot be estimated, look at Option 4. Option 2 includes cases when data from various national sources has to be combined, for example aggregated data obtained using the component method and data on citizenship composition from the register of foreigners. some methods described in Section 3.7 should be treated as auxiliary to all the remaining ones rather than constituting separate estimation methods per se. Option 3. Some data are not available anywhere but can be estimated Even if some auxiliary data can be collected from whichever source, in many instances the available information can be still incomplete. In a vast majority of such cases, the missing information can be still estimated, either following the methodological guidelines and techniques outlined in Section 3, or by means of more straightforward and easy-to-apply solutions. The order of undertaken actions is then as follows: 1. Organize the available data (from Eurostat and other sources); 2. Verify the data (perform data validation and internal consistency checks); 3. Deal with the Unknown categories (if applicable); 4. Calculate the required aggregates for the available data; 5. For each year for which data are missing select the best method to estimate missing data; 6. Collect supplementary data needed for the estimations; 7. Estimate the missing data; 8. Check the results. For the estimations (item 7), various methods can be used, depending on the range and type of the missing information and data availability. In general, five broad groups of methods can be distinguished here, following the outline presented in Section 3 : a. Proportional fitting methods (Section 3.6); b. Cohort-wise weights propagation (Section 3.5); c. Cohort-wise interpolation of population stocks (Section 3.3); d. Cohort-component projections (Section 3.4); e. Other solutions, not listed above, or combined approaches. The methodology of interpolating the five-year into one-year age groups, presented in Section 3.2, as well as Option 4. Data are not available, and no or only very rough estimates can be produced In principle, this should be a very infrequent option. If no information is available that would enable estimation under Option 3, none or only very rough approximations can be performed, such as for example the 50-50 division of all foreigners into the EU27 and non-EU27 categories. Under Option 3, in all the cases where several methods could be alternatively applied, preference is given to the more straightforward ones, and definitely to the ones having less judgemental elements, thus less potential sources of error. This approach conforms to the Occam's razor principle, stating that "entities are not to be multiplied beyond necessity"2, which in this case means that the proposed models should not be more sophisticated than necessary, due to various possible sources of error. For example, given complete data on stocks by age and citizenship group for two moments of time (e.g. from successive population censuses), if the data on flows (I and E), natural change (B and D) and citizenship acquisitions (A ) are not available by citizenship and/or age, and require estimation, then the intermediate values are recommended to be calculated using cohort-wise interpolation rather than projection. In the former case, the only source of possible error is the composition of population as such, whereas in the latter, judgemental assumptions on the relevant distributions of all components of the balance equation are likely to result in higher uncertainty of the ultimate results, which within the deterministic framework of the project is impossible to assess. Figure 3 presents a decision tree summarising the procedure for selecting the estimation methodology, taking into account all the above options. 4.2 Application of the methodology, examples, selected results The decision tree presented in Figure 3 has been used to select the best estimation method for each of the 31 EU and EFTA countries, accounting for the availability of data in the Eurostat database (either on-line or in the JMQs), in the NSI databases, and at other sources. It turned out that complete data needed to estimate population by broad group of citizenship, sex and age on 1st January 2002-2006 were available in the JMQs for nine countries: Austria, the Czech Republic, Denmark, Finland, Hungary3, Norway, Slovenia and Sweden. After: 'Ockham's razor', in: Encyclopedia Britannica Online, http:// www.britannica.com/eb/article-9056716, accessed on 21st May 2007. 3 For Hungary, data on total population and on the number of Hungarian citizens were not always provided in the JMQ and therefore not available in the migration part of the Eurostat database. However, data on total population were available in the demographic part of the Eurostat database and the number of Hungarian citizens could be calculated directly as a difference between total population and total foreigners, the latter taken from the JMQ. Source: Own elaboration. Figure 3: Decision algorithm for obtaining population stocks by broad citizenship group, sex and age. For additional four countries it was possible either to collect all the missing data from the NSI websites (Belgium and Iceland), or to get them by contacting directly the NSI (Lichtenstein and Switzerland). For the remaining 18 countries some estimations were necessary. The method that proved to useful in the largest number of cases was some sort of proportional fitting (one of the three versions presented in Section 3.6). It was used as the main method for estimating population by broad citizenship in Cyprus, France, Germany, Greece, Italy, Latvia, Luxembourg, Malta, Slovakia, Spain and the UK. In all cases the total population was assumed to be as reported by the NSI in their demographic statistics, while the citizenship structure was taken from varied sources, for example the JMQ data for the same year, data taken from the NSI website (Italy), the census data (Cyprus, France), the data for another year (Romania, Spain), the LFS data (Cyprus, France) or the data from the register of foreigners (Germany) (see also examples below). The cohort-wise interpolation method was used for Ireland, Lithuania and Portugal. For Bulgaria, Estonia and Poland, where only data from the census were available, the cohort-wise weight propagation was applied. For Cyprus, Lithuania, Luxembourg and Poland it was originally planned to use a projection method, however it was decided that it would require too many assumptions that would be difficult to justify, and that the final result would not be reliable enough to justify the additional effort required when using this method. Estimations done for Romania do not fit any of the above groups. They involved simple combination of data coming from various sources. Below, more details about the estimation procedures are provided for selected countries. In doing so, we have tried to give an example for each estimation method. The resulting numbers in terms of the estimated citizenship structures of the populations of 18 European countries (all being EU Member States) on 1st January 2006, are presented in Table 3. Country Total Nationals EU27 foreigners Non-EU27 foreigners Bulgaria 7 718 750 7 693 214 3 855 21 681 Cyprus 766 414 678 114 52 217 36 084 Estonia 1 344 684 1 082 605 3 961 258 118 France 61 166 822 58 208 155 1 148 691 1 809 976 Germany 82 437 995 75 148 846 2 448 113 4 841 036 Greece 11 125 179 10 165 903 180 282 778 994 Ireland 4 209 019 3 779 755 295 165 134 099 Italy 58 751 711 56 081 197 538 853 2 131 661 Latvia 2 294 590 1 837 832 5 527 451 231 Lithuania 3 403 284 3 370 422 1 962 30 900 Luxembourg 469 086 280 938 171 876 16 273 Malta 404 962 392 850 7 022 5 090 Poland 38 157 055 38 115 920 18 660 22 476 Portugal 10 569 592 10 293 686 80 039 195 867 Romania 21 610 213 21 584 220 6 058 19 935 Slovakia 5 389 180 5 368 255 12 289 8 636 Spain 43 758 250 39 755 741 1 326 128 2 676 381 United Kingdom 60 393 100 56 990 704 1 365 190 2 036 807 Source: Own calculations based on the Eurostat and NSI data. described in Section 3.3, was used to obtain the initial estimates of males and females on 1st January 2002, 2003 and 2004. In the next step those initial estimates where proportionally adjusted to the known numbers of males and females by age, taken from the Eurostat demographic database. In Bulgaria, annual data on population by citizenship were not available. The only information on citizenship structure came from the census of 1st March 2001. There are also annual data on population by age and sex prepared by the NSI using the component method, available from the Eurostat database. The estimates of annual 2002-2006 population by citizenship, sex and age were prepared using the cohort-wise weight propagation method. The census data were used as the starting point for calculating the initial shares (weights) of citizenship groups in each age cohort. These shares were iteratively propagated forward as described in Section 3.5 and the resulting weights were combined with the available data on population by sex and age to calculate the required joint distribution by citizenship, sex and age. Table 3 : Estimated population by broad group of citizenship in 18 EU countries, as of 1st January 2006. As concerns particular examples: in Germany, data on foreigners come from two different sources. The component method (Bevölkerungsfortschreibung), based on the last traditional German census of 25th May 1987, is used by the NSI to produce annual figures on total population, total nationals and total foreigners, as well as nationals and foreigners by sex and age. The other source is the Central Register on Foreigners which contains data on foreigners by citizenship, sex and age. The total numbers of foreigners and their sex and age structures differ between both sources. In order to obtain a single set of estimates, the total number of German citizens, the total number of foreigners, as well as the age structures of Germans and foreigners were taken, following the NSI procedures, from the Bevölkerungsfortschreibung data. The distribution of foreigners into EU27 and non-EU27 foreigners was done in proportion to their shares in respective age groups according to the data from the Central Register of Foreigners. Thus, all in all, the proportional decomposition method was used. In Latvia, no joint distribution of population by citizenship and age was available for 1st January 2002, only the structures by age and by citizenship separately. However, the full joint distribution was available for 2003. The iterative proportional fitting method was selected to deal with this case. The joint distribution by citizenship group and age on 1st January 2003 was taken as the starting point for estimating the 2002 structure of population, which was then iteratively adjusted to the known marginal totals. Lithuania is an example for the application of a cohort-wise interpolation method. In this country, the joint distribution of population by sex, age and citizenship was available for the Census date (6th April 2001), as well as for 1st January 2005. The cohort-wise interpolation, as 5 Conclusion As it can be seen from the country-specific overview of problems with data on population stocks by age, sex and citizenship, there is no universal solution for estimating the missing pieces of information in the European countries under study. Nevertheless, depending on the availability of data at hand, either in the Eurostat / JMQ, or in the respective national statistical institutes, several estimation procedures can be proposed and applied, as mentioned in Sections 3 and 4. The methods and algorithm we proposed for this purpose do not, however, consider the issue of the harmonisation of the data and definitions, as mentioned in Section 1. More work would be needed in order to recalculate the population stocks into a common definition (cf. [1, 7]), and make them consistent with the (also re-estimated) statistics on migration flows. These very important research tasks are still to be performed in the subsequent tasks of the MIMOSA research project, of which the current study forms a part. Acknowledgement The paper was prepared within the framework of the research project on "Modelling of statistical data on migration and migrant population" (MIMOSA), commissioned by Eurostat (contract no. 2006/S 100-106607/EN; Lot 2) to the Netherlands Interdisciplinary Demographic Institute (NIDI) and conducted jointly by NIDI, Central European Forum for Migration and Population Research (CEFMR), Groupe d'étude de démographie appliquée, Université Catholique de Louvain (GéDAP uCl) and Southampton Statistical Sciences Research Institute of the University of Southampton (S3RI). We are grateful to our colleagues for their comments and discussions. Special credits go to Peter Ekamper from NIDI for his assistance in the computational part of the current study. References [1] Bilsborrow, R., Hugo, G., Oberai, A.S. and Zlotnik, H. (1997). International migration statistics. Guidelines for improving data collection systems. International Labour Organisation, Geneva. [2] Bishop, Y.M.M., Fienberg, E.F. and Holland, P.W. (1975). Discrete multivariate analysis. MIT Press, Cambridge, MA. [3] Calot, G. and Sardon, J.-P. (2003). Methodology for the calculation of Eurostat's demographic indicators. Eurostat Working Papers and Studies, Population and Social Conditions 3/2003/F/n° 26. Eurostat, Luxembourg. [4] Cangiano, A. (2008). Foreign Migrants in Southern European Countries: Evaluation of Recent Data. In: J. Raymer, F. Willekens (eds.), International Migration in Europe: Data, Models and Estimates. John Wiley, Chichester, pp. 89-114. [5] Castro, L.J. and Rogers, A. (1983). Patterns of Family Migration: Two Methodological Approaches. Environment and Planning A, 15 (2), pp. 237-254. [6] Norman, P. (1999). Putting Iterative Proportional Fitting on the Researcher's Desk. Working Paper 99/03. School of Geography, University of Leeds. [7] Poulain, M., Perrin, N. and Singleton, A. (eds.) (2006), THESIM: Towards Harmonised European Statistics on International Migration, Presses Universitaires de Louvain, Louvain-la-Neuve. [8] Rees, P. (1994). Estimating and projecting the populations of urban communities. Environment and Planning A, 26 (11), pp. 1671-1697. [9] Rogers, A. and Castro, L.J. (1981). Model Migration Schedules. II AS A Report RR-81-30. International Institute for Applied Systems Analysis, Lax-enburg. [10] Rowland, D.T. (2006). Demographic methods and concepts. Oxford University Press, Oxford. [11] Shryock, H.S., Siegel, J.S. and Associates (1993). Interpolation: Selected General Methods. In: D.J. Bogue, E.E. Arriaga, D.L. Anderton and G.W. Rum-sey (eds.), Readings in Population Research Methodology. Vol. 1: Basic Tools. Social Development Center / UN Population Fund, Chicago, pp. 5-48-572. [12] Willekens, F. (1977). The Recovery of Detailed Migration Patterns from Aggregate Data: An Entropy Maximizing Approach. IIASA Report RR-81-30. International Institute for Applied Systems Analysis, Laxenburg. [13] Willekens, F. (1982). Multidimensional Population Analysis with Incomplete Data. In: K.C. Land and A. Rogers (eds.), Multidimensional Mathematical Demography. Academic Press, New York, pp. 43111. [14] Willekens F.J., Pór A. and Raquillet R. (1981). Entropy, multiproportional, and quadratic techniques for inferring patterns of migration from aggregate data. In: A. Rogers (ed.), Advances in Multiregional Demography. IIASA Report RR-81-6. International Institute for Applied Systems Analysis, Laxenburg, pp. 83-124. [15] Wilmoth, J.R., Andreev, K., Jdanov, D., Glei, D.A. with the assistance of C. Boe, M. Bubenheim, D Philipov, V. Shkolnikov and P. Vachon (2005) Methods Protocol for the Human Mortality Data base, version 4. Department of Demography, Uni versity of California, Berkeley. Available from http://www.mortality.org/Public/Docs/MethodsProt ocol.pdf (accessed on 17th May 2007). Demographic Analysis of Fertility Using Data Mining Tools Matjaž Gams and Jana Krivec Department of intelligent systems Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia E-mail: matjaz.gams@ijs.si, jana.krivec@ijs.si Keywords: demography, fertility, data mining Received: March 12, 2008 We used data mining techniques to discover which attributes have the highest impact on country fertility rates. The data was analyzed in various ways; altogether and joined in smaller, meaningful groups, such as sociological, economical, philosophical, biological, etc.. We separately analyzed different groups of countries, current state and fertility trends, and tested several class schemas. Most relevant decision trees are presented and interpreted showing some known and some new conclusions. The iterative use of data mining techniques again proved to be successful in finding complex relations, but still needing expert interpretation as any computer method. Povzetek: Analizirani so poglavitni razlogi za premajhno oz. preveliko rodnost. 1 Introduction Populations change through three major processes: fertility, mortality, and migration. A useful way to express the rate at which women have children is the Total Fertility Rate (TFR). TFR is the average number of children that would be born per woman if all women lived to the end of their childbearing years and bore children according to a given set of age-specific fertility rates [21, 3]. If the average woman has approximately 2 children in her lifetime, this is just enough to maintain the population [7]. Figure 9, Total Fertility Rate Relative to the Replacement Leve] lor Each Country Across the Globe: 2002 In 2002, fertility levels WQre highest In the countries of Sub-Saliaran Africa and tue Near East, and lowest in the more developed countries. I I Bsloiv rsplacemeru I-1 Lessthan I child per I-1 above replacement \-2 children perwomar ibove repbcement B or rrore children per ibove rep lace mart I I NotiViilable Figure 1: TFR in countries in 2002 [21]. As seen in Figure 1, some countries have high and some low TFR. In most European countries TFR in 2006 was below 1.5 children per women [19], which is far less than desired [15]. Namely, sustained low fertility rates can lead to a rapidly aging population and, in the longrun, may place a burden on the economy and the social security system because the pool of younger workers responsible for supporting the dependent elderly population is getting smaller. Tracking trends of fertility rates and factors that influence them helps to support effective social planning and the allocation of basic resources across generations [8]. In this article we present a demographic analysis of 147 worldwide countries described by 95 basic attributes that might affect fertility rate. Even though the idea is everything but new, our approach to this problem is. Namely, our research group has decades of experiences in developing and using data mining (DM) and machine learning (ML) systems such as Weka [25] and Orange [4], last being developed in our broader research group. We typically approach a problem domain in a specific manner, usually obtaining similar results than those of the best experts in the field. It was a particular challenge to test our methods on the demographic problem. We use terms attributes, indicators and factors as synonyms. 2 Related work So far scientific efforts in demography were devoted mainly to exploration and definition of the process of data collection and qualitative interpretation of the statistical results, consequently not putting emphasis on new data analyzing methods. Data is typically analyzed with event history regression methods, Markov transition models and Optimal matching method using common spread statistical packages like (SPSS, SAS, S-Plus, Stata, R, TDA, etc.) [14]. The hypothesis is that between these typical aggregate descriptions and causal analysis there is a deficit of research on complex relations. Several modern methods, including data mining, offer opportunities to fill this gap. In the last decades, data mining tools for knowledge discovery from data (KDD) proved successful in various fields. However, searching through the internet showed that these approaches have received little attention in demographic analyses. There are some publications, e.g. Blockeel et al. [2] showed how mining frequent item sets may be used to detect temporal changes in event sequences frequency from the Austrian FFS data. In Billari et al. [1], three of the authors experienced an induction tree approach for exploring differences in Austrian and Italian life event sequences. Oris et al. [12] initiated social mobility analysis with induction trees. Unlike the statistical modeling approach, the methods make no assumptions about an underlying process generating the data and proceeds mainly heuristically. The approach differs from ours because we study rather static data and do not yet apply sequential rule mining analysis on historical demographic data. We have not noticed DM analyses on the level of countries, similar to ours. 3 Data mining for demography Successful data mining is based on various investigations of the data using different methods, parameters, and data to find most meaningful relations. 3.1 Basic data description Data for machine learning and data mining are most commonly presented in attribute-class form, i.e. in a "learning matrix", where rows represent examples and columns attributes [22]. In our case, an example corresponds to one country, and a class of the country, presented in the last column, denotes fertility rate. The first attribute is the name of the country. Alltogether there are 95 basic attributes and 147 countries. Attributes and their values were partially obtained from the demographic sources such as UN [20], Eurostat [5], and the Slovenian statistical database [16]. Several of the attributes were obtained from the internet, based on the assumption that they might show some interesting demographic relation. We were trying to get as many attributes as possible, nondiscriminatory whether positive or negative in terms of fertility rate. Attributes in demographic literature are grouped into biological and social [9] since human fertility is a socially formed biological process [18]. Newer literature introduces more and more complex structures, based on detailed grouping of social factors. Malai [10] divided factors that impact fertility rate in six groups: (1) biological, (2) economical, (3) social, (4) cultural, (5) anthropological and (6) psychological. Our 95 basic attributes correspond to these six categories, e.g.: state politics towards maternity leave, homosexuality, religion, suicide, abortion, military etc. Some of our measurements were performed on specific groups like (2) economical, consisting of 12 attributes like unemployment rate, GDP ($) per habitant, GDP growth (%). Biological factors (1) include 6 attributes (number of habitants, life expectancy rate, number of men per 1000 women to mention a few of them). On top of these six we added a special category "education and R&D (research and development)" with 38 attributes. There are 11 binary attributes, 2 discrete and the rest numerical. For the basic class we have chosen Total Fertility Rate (TFR), discretized into two values: high (>2) an low (<2). The branching point 2 was chosen because it represents the replacement level of the population. In reality, replacement level is a bit higher, around 2.1, but this number depends on several other parameters such as mortality rate and immigrations, and furthermore only two countries have fertility rate between 2 and 2.1. 3.2 Data modifications 3.2.1 Attribute modifications By attribute modifications we denote eliminating some columns in the learning matrix, and adding new columns, i.e. attributes. Subgroups of columns were chosen based on the demographic categories, and by DM methods. There were 5 new attributes added during the process of DM, thus bringing the total number of attributes to 100. Around half of the experiments were performed on 100 attributes. 3.2.2 Class modifications Besides the basic class discretization into two values, we tried three values of TFR as well: low (< 2), middle (2-3) and high (>3). In another attempt we classified countries according to decrease or increase of TFR. We first calculated average UN predicted TFR for years 20052010 and subtracted average TFR for years 2000-2005. The obtained value was discretized into two classes: ATFR>0, ATFR<0; or three classes: decreasing (ATFR<0.5), stable (-0.50.5). 3.2.3 Modifications of learning examples Learning examples consisted of 147 countries, each represented by a row in the learning matrix. Modifications were performed as eliminating or choosing specific rows to form a new learning matrix. A typical example would be a subgroup "developed countries", consisting only of countries with high gross domestic product (GDP) or Failed States Index (FSI). GDP is defined as the total market value of all final goods and services produced within a given country or region in a given period of time (usually a calendar year) [24]. FSI on the other hand consists of several attributes, describing the strength of central government, provision of public services, level of corruption and criminality; percentage of refugees and involuntary movement of populations, and an amount of economic decline. Since 2005, the index has been published annually by the United States think-tank, the Fund for Peace and the magazine Foreign Policy [23]. GDP review extracted two groups of countries: well developed countries with GDP above 1000$ per habitant (39 countries), and developing countries with GDP less than 1000$ per habitant (108 countries). Examination of FSI revealed three groups of countries: developed with FSI lower then 39.45 (29 countries), moderately developed with FSI 39.45-61.4 (21 countries), and developing with FSI>61.4 (97 countries). We also prepared data for analysis based on the geographical region of the country. In correspondence with UN regional classification [21], we grouped our cases in 6 regions: Asia, Africa, Latin America and the Caribbean, Oceania, Europe and Northern America, and 20 sub regions. 3.3 Methods Machine learning and lately data mining are among the most successful artificial intelligent application areas. Whenever there are lots of learning examples, these systems learn properties of the domain and make predictions about future cases. These systems not only compete with statistical methods in terms of accuracy, they also introduce several new approaches such as cooperation between systems and humans. The constructed knowledge is often in the form of readable, understandable trees, rules and other representations thus enabling further study and fine tuning. Two examples of successful scientific and engineering DM tools are Weka [27] and Orange [4]. Both systems provide tens of DM systems, several data preprocessing and visualization tools. From the ML and DM techniques available in Weka and Orange we have chosen J48, the implementation of C4.5 [25], a method used for induction of classification trees. This method is most commonly used when the emphasis is on transparency of the constructed knowledge. In our case this was indeed so, since the task was to extract most meaningful relation from hundreds of constructed trees. Most meaningful relations are those most significant to humans with best classification accuracy at the same time. To estimate the accuracy of the trees, we used 10-fold cross-validation, built in the system. The estimated accuracy of a classification tree corresponds to a probability that a new example will be correctly classified. A short description of decision trees is presented in this paragraph for readers not familiar with classification trees. Classification trees are built in a top-down manner. The first task is to choose the most informative attribute which will be placed at the root of the classification tree. The next step is to add branches according to the values of the attribute. For a discrete attribute, there are as many branches as there are different values. In case of a numeric attribute, there are only two branches, one that represents values less or equal than the border value as proposed by the system, and the other branch with greater values. The set of examples is divided into subsets corresponding to the branches. Now the process can be repeated recursively for each branch, using only those instances at each particular branch. If at any time all instances at a node have the same classification, further branching is stopped and the classification into that class is proclaimed. The splitting process is usually stopped as soon as sufficient statistical significance is obtained, classifying into the majority class. Classification is performed by starting at the top of the tree and choosing appropriate attribute values to proceed with the chosen branch. At the leaf, the numbers represent all examples and those with different class. Experiments were performed with various method parameters, mainly changing levels of pruning. However, it turned out that default parameters were most successful. Unlike in our typical DM session, we did not modify sources of the DM programs. J48 turned out successful enough. 4 Experiments Tens of trees were created in a systematic way, as presented in Figure 2. First experiments were performed with TFR and ATFR, then with all and only developed countries. Finally, several selections of the attributes were tested: all, economical, direct, social, economical, and educational. These tests resulted in 24 basic trees. In addition, various further experiments were performed. Due to lack of space, only most interesting trees are presented in this paper, those with most meaningful relations to humans and with best classification accuracy at the same time. Figure 2: Structure of experiment. 4.1 TFR class Firstly, the analysis was based on TFR as a class, with 2, 3 or more values. Only experiments with 2 or 3 values were interesting enough to be presented in this paper. 4.1.1 All countries 4.1.1.1 All attributes In the first fertility rate analysis all 147 countries and all 95 available attributes were taken into consideration. The obtained tree is presented in Figure 3, showing that the most important indicator for high TFR is the number of stillborn children per 1000 births. More than 11.5 stillborn children per 1000 births is a strong supporting factor in favor of high TFR of the country and vice versa. The results are consistent with practically all literature in the demographic field and experts' opinions, who claim that death of newborns is in tight connection with social and economical status of mothers who need to have several children to compensate for those dead. According to experts, higher educated mothers usually have less children and lower newborn mortality, low percentage of stillborns is supposed to be related to the costs of child life-support , different life condition of the urbanized and industrialized society, changes of the attitude towards women, decaying of old patriarchal community etc. as the main reasons for fertility decline. As the tree in Figure 3 shows, these relations are indeed statistically most relevant. However, the tree shows additional relations in a structured way with appropriately weighted leaves, i.e. nodes at the bottom of the tree. For example, the top right leaf "high (104/16)" includes 88 countries with high TFR and 16 with low TFR. The bottom left leaf, on the other hand, encapsulates only 2 countries with high TFR, rendering this information as statistically less important. Therefore, in the tree there is just another statistically strongly confirmed relation: when number of dead born children is less than 11.5 and majority religion is Christianity and there are fewer men than women then TFR is low (35/1). This relation shows another crucial matter regarding interpretations of the tree. Why should Christian majority be negative for fertility rate while Christians give high emphasis on families, strong marriages and devotion to children? Indeed, further analysis show, as pointed out by demographic experts long time ago that population in these countries have high divorce rates etc. meaning that people do not follow church directions, but live according to their own desires. The bottom right part of the tree, starting with low percentage of women in the population is statistically rather meaningless, however, density and number of inhabitants gives some indication that these are among relevant attributes. Therefore, reading and interpreting trees demands some understanding of statistics, trees and demographic literature. At each Figure title, there is cross-validation accuracy estimate. For Figure 3 it is over 80%, which is a reasonably good result. Default accuracy obtained by classifying only into the majority class is 89,4%. In another attempt we divided the class values into three groups: low, moderate and high TFR rate. Immediately it should be noted that the accuracy of such a tree seemingly decreases. Namely, the tree in Figure 3 Figure 3: Two-valued TFR classification tree, all attributes (accuracy is 80,3%). classifies into two classes, therefore, accuracy of blind guessing is 50%. For three classes, blind guessing results in 33% accuracy, and the default accuracy is 36,7% for the majority class. Having these statistics in mind, the obtained classification tree in Figure 4 achieves even better classification accuracy with 74,8%. Figure 4: TFR classification tree with three-valued class, considering all attributes (accuracy is 74,8%). The experiment once again revealed the most important attribute: "number of stillborn children". However, the branching point leading to high TFR is in this case much higher: 53.56 children per 1000 births. In this tree, there are three major groups all from 30 to 40 countries: high, moderate and low. The major attribute distinguishing between moderate and low TFR countries is the length of the maternity leave. At this point one should be aware that such attributes are semantically potentially misleading - countries with low TFR probably introduced lengthier maternity leave as a consequence and not as cause. The tree therefore shows most important relations without knowing the nature of them. After obtaining the first tree, in a series of tests seemingly most important attributes are being eliminated in order to test if other attributes can replace them and still obtain similar accuracy. Instead of "number of stillborn children" several attributes can be used: human development index (HDI), life expectancy rate, literacy rate, etc. all denoting the same concept. It is generally accepted that in these, developing countries, TFR is high. For the maternity leave, the elimination of the attribute results in lower accuracy 68%. Although this attribute is obviously important, we are not able to establish the type of relation. Whatever the case, countries with short maternity leave have moderate TFR, and those with long maternity leave low TFR. Although the rest of the relations are not so significant, they represent a bigger share than in the previous tree and they seem to have two common denominators: developmental status and value system. Altogether, analysis so far indicate that the developed countries have low TFR, e.g. most of the European and north American countries, developing countries have high TFR, and moderately developed countries like Botswana, Bolivia, Honduras, Jamaica, etc. have moderate TFR. 4.1.1.2 Selected attributes We further filtered attributes according to the algorithms in DM tools. Again, as seen in Figure 5, the most distinctive attribute regarding TFR rate appears to be the number of stillborn children per 1000 births. When this number is lower or equal to 11.55, the TFR is low (under 2), with the exception of the countries that do not ensure appropriate delivery treatment and invest most of its educational foundation in a primary sector. On the other hand, TFR is low despite high number of stillborn children in the case when the human development index (consists of life expectancy rate, literacy rate, educational rate and standard of living) is high, abortion is allowed and unemployment rate is low (under 13.9 %), or if abortion is not allowed, but the country invests most of its educational foundation in a primary sector and has long maternity leave (more than 11 weeks). The discovered relations indicate a meta attribute - developmental status of the country. 4.1.1.3 Direct attributes The demographic experts classify fertility attributes, i.e. factors, on direct and indirect [10]. Direct factors have direct influence on fertile persons. In this context we built a decision tree including 4 attributes: legality of demanded abortion, number of abortions per 1000 people, percent of married women (between 15 and 49 years old) that use contraception and percent of elders infected with HIV virus or AIDS. The obtained 82,31% accurate tree is presented in Figure 6. Legal abortion associated with low percent of HIV infected elderly relates to low TFR while illegal abortion and lower percent (less then 70) of women using contraception leads to high TFR. These attributes again seems to correlate to the meta attribute - developmental state of the country and to the value system. The other derivation could be that the value system plays an important role. The accuracy is very high indicating that these attributes are meaningful. Figure 5: TFR classification tree with two-valued class, considering only automatically selected attributes (accuracy is 81,6%). H LOW HIGH H ii 14.15-3.55^ n -i ž ■A.lS-^.Zi] Figure 6: TFR classification tree with two-valued class considering only direct attributes (82,3%). 4.1.1.4 Social attributes Since many experts in the field agree [10] that only direct factors can not explain the fertility rate determination, we further examined influence of the indirect TFR factors. We analyzed 11 attributes that express the society attitude towards general life questions: legality of homosexuality, legality of homosexual marriages, possibility of adoptions to homosexuals, number of suicides per 10000 persons (men only, women only, altogether), legality of abortion, number of abortions per 1000 people, number of divorces per 1000 persons, percent of women in the parliament. Figure 7: TFR classification tree with two-valued class, considering only social attributes (81,6%). From Figure 7 one can conclude that TFR is high in more conservative countries that don't allow abortion and adoptions to homosexuals. TFR is high also in the countries that allow abortions but prohibit homosexuality. By this view more liberal countries have low TFR. The accuracy is as high as of the tree constructed on all the attributes, selected by the system. Again it seems that the value system plays an important role. 4.1.1.5 Economical attributes Experts generally find low TFR strongly related to the economical factors, society modernization and liberalization [19]. We wanted to established the nature of economic relations by extracting 13 economical attributes that refer to the field of unemployment, GDP, public health and social protection expenditure, number of working ours per week and inflation rate. The constructed tree is presented in Figure 8. The tree indicates that high GDP, low unemployment rate and high inflation GDP deflator relate to low fertility rate, while low GDP per capita usually relates to a high TFR. As David Heer said [13], economical progress should positively influence fertility rate. Overall statistics significantly disconfirm the hypothesis at least in the modern world where food is not scarce. Our analysis indicates that direct economical attributes are not very relevant for fertility on their own, at least not as other groups of attributes. For example in figure 8 in some cases high GDP per capita leads to high and in others to low TFR. Becker (1981) [17] presents a plausible explanation of such GDP-TFR relations. He claims that TFR depends on the disposable expenses and expected usefulness of the children. To uphold the thesis he gives an example of the rural family that used to have more children in order to assure help for maintaining the family. Human resources were urgent for working on the fields, in the woods, etc. Nowadays, agriculture has become more and more automated, thus reducing the need for human forces. Consequently, the cost benefit of the children dropped drastically and families began to shrink. Besides, factors like higher educational level, lower child mortality rate, and the desire for career making among young people, pushes TFR even lower. This linkage between income and fertility is typical for developed countries, where despite constant income growth, TFR is continually decreasing. Whereas in developing countries, low income does not influence fertility rate. Figure 8: TFR classification tree with two-valued class, considering only economical attributes (78,2%). In any case, the tree from Figure 8 is only 78,23 % accurate, which is low in comparison with trees based on other attributes. This indicates that direct economical factors are not the main cause for the distinction among countries with low and countries with high fertility rate. 4.1.1.6 Educational attributes Analyzing the relation between educational factors and TFR resulted in the tree presented in Figure 9. High percentage of enrolment in primary educational level is in general related with high fertility rate, whereas low TFR is more related to enrolment in secondary or tertiary educational factors. As observed by experts before, high education, especially of women, decreases TFR. 4.1.1.7 Developed countries While developing countries have problems with too high TFR, developed countries, especially in Europe, have problems with low TFR. Mark Steyn, a conservative polemicist, argues that Europe is quickly becoming a barren, ageing, enfeebled place [6]. In the decades after the second world war, rich countries everywhere experienced similar trends. The bonds of traditional family life began to slacken, more women got jobs, people sought enjoyment and satisfaction more and more through individual pursuits rather than in families. This social transformation, which is occurring also in America and East Asia, led to a demographic bonus (a bulge of people working) and to what might be called "the postponement of everything". People left school later, left home later, married later, had children later, they also died later [6]. Even though these interpretations are not uniformly accepted, they seem to be statistically quite well grounded. Figure 9: TFR classification tree with two-valued class, considering only educational attributes (78,2%). Having that in mind, the relevant question is: Why do some rich countries still have high TFR? In the following experiments we denoted 39 countries with high GDP as rich. 4.1.1.8 Selected attributes The tree in Figure 10 indicates that exceptions to the low fertility rate have poor education and social system. Further analyses showed that these countries rely on natural resources such as oil. Figure 10: TFR classification tree with two-valued class and automatically selected attributes (78,2%). 4.1.1.9 Social attributes Analyses of the obtained tree presented in Figure 11 revealed that countries with oil are rich and have Muslim religion. But the relation can be interpreted originally as follows: when Islam is the prevailing religion of the country, then TFR is most likely to be high, while otherwise, TFR decline is the more likely option. Results are consistent with the previously observed relations that TFR is higher in more conservative countries, which Islam countries certainly are. f MUSLIM ^ ( RELIGION ) NO YES LOW (34/4) HIGH C5A)) Figure 11. TFR classification tree with two-valued class considering only social attributes (89,7%). 4.2 ATFR class The newest studies of Worldwatch Institute conclude that there is so much variability in fertility rates that we can not know with any confidence how many people the future holds [11]. Indeed, it seems reasonable that ATFR analyses are a bit less relevant as those with TFR, since they measure the amount of change and not the obtained situation. Even though, our next attempt was to established factors that might influence TFR growth and decline. In the next section a few of the most interesting and accurate trees are presented. 4.2.1 All countries 4.2.1.1 All attributes Again, literacy seems to be an important indicator of TFR trends (see Figures 12 and 13). Countries with low percent of literate habitants generally have increase in TFR. Countries with high percent of literate citizens (above 97.9%) and low unemployment rate (below 9.6%) on the other hand have decreasing TFR trend. LITERACY m INCREASE (-109.53/7 .Dì UNBWIPLOYMENT^ RATE INCREASE DECREASE 1-4.671 ß.33/D.331 Figure 12: ATFR classification tree with three-valued class (81,1%). «■isssai tusmn <■0.11 >D.1I / \ / \ INCRBtìE DECREWE INCREWE DECREASE 0.1) (4,67) (2.33jD,33) Figure 14: ATFR classification tree with three-valued class, automatically selected attributes (85,3%). 4.2.1.2 Social attributes Considering only social attributes, the same tree as in the case of TFR class appeared (see Figure 7), again exposing the importance of conservative politics of the country for the TFR growth trend. Countries that don't allow abortion and adoption to homosexuals have TFR growth trend, whereas countries that allow abortion and homosexuality have TFR decline trend. Accuracy in this case is 78.32%, much lower than in the tree presented in Figure 14. 4.2.2 Developed countries In this case our criteria for dividing countries by their developmental status was FSI. A country was classified as well developed if FSI index was less than 39.45, resulting in 27 countries. Analyses were performed on the attributes separately merged in smaller groups. 4.2.2.1 All attributes Figure 15: ATFR classification tree with two-valued class (accuracy is 84.6 %). Figure 13: Unprunned ATFR classification tree with three-valued class indicator (83,2%). Similar conclusions can be drawn from the tree on Figure 14, when attributes were automatically selected. This tree has surprising high accuracy. Race appeared to be an important factor of TFR trend (see Figure 15). In nations with prevalent Asian and combined race, TFR is likely to increase, while in countries with a majority of white race, TFR is declining. The nature of this genetic relation is not cleat at this point. 4.2.2.2 Economical attributes Figure 16: ATFR classification tree with two-valued class, considering only economical attributes (84.6 %). We can see that highly economical developed countries with more than 10100 GDP per capita ($) have TFR decline. This thesis is for example not in agreement with the Worldwatch Institute study noting that fertility rate is rising in the United States [11]. However, this study is violating the age-old dictum that rich countries do not make lots of babies as well [11]. The tree based on economical attributes is this time quite accurate. Therefore, ATFR analyses gave more statistical relevance on economical attributes than analyses with TFR. 4.2.2.3 Social attributes When selecting only social attributes, the accuracy of ATFR classification trees dropped drastically (on 76.9%) what means that these factors are not good indicators for TFR trends. 5 Conclusion and discussion In fertility analyses, the data mining tools again proved their major asset: the constructed knowledge is in a transparent form, enabling human comprehension of relevant relations in complex forms. In this way, an interactive and interaction process is enabled between computers and humans, exploiting best properties of the two most advanced information machines. Computers fast examine vast search spaces with their advanced speed and accuracy while humans make conclusions and guide search with the advanced cognitive skills. To readdress the problem, let us restate that the space of all potential hypothesis for 100 binary attributes 2100 and a single binary class is 2 . This number is far larger than the number of all atoms in our universe, which is according to Wikipedia around 1080, i.e. 2266. Therefore, there is no way humans can analyze any meaningful share of all the hypotheses. But we can examine results of one search, make conclusions and redo the search changing specific details of the search. In this way humans can "mine" for relevant hypothesis. Regarding the fertility relations, the DM tools enabled rediscovery of major properties. The authors are not experts in the fertility or demographic field, therefore verification of our conclusions by an expert and further analyses of interesting new patterns are a matter of further research. However, we report our impressions for further discussions: - Firstly, we were surprised that there are so many distinctive hypothesis, i.e. patterns discriminating countries with low from those of high fertility rate. Rich countries are predominantly white, have good education, women live longer, literacy is high, the predominant race is white, people have no strong religion obligations, they are liberal etc. and vice versa for the developing countries. - Secondly, according to the constructed trees, it is rather simple to influence the fertility rate - just improve literacy in women or just allow liberalization and decrease the influence of religion. And vice versa - to improve fertility, just e.g. improve moral values and decrease liberalization or decrease literacy or apply any of the remaining 10 or 15 attributes. Some are costly, some are unacceptable, e.g. decreasing literacy. According to some experts, practically all of these attributes are hard to implement in democratic countries. Still, the trees indicates that there are several mechanisms, some of them rather costless, that will change the process in European countries, leading first to economic problems and later to extinction of nations and cultures. - We did not have time to study each particular attribute in detail, such as the length of maternity leave. While the trees so often show relevance of maternity leave for decreasing fertility, the trees do not show whether this is a cause or a consequence. At first, we thought that it is just a consequence of low fertility, just a mechanism of countries promoting higher fertility. In addition, mothers are generally in favor of longer leaves. But after so many trees, and having in mind that this is one of very costly mechanisms, it is becoming more than a suspicion that longer maternity leave is at least a controversial matter. Our analyses dealt with countries and not with individuals. Obviously, several of the fertility and demographic matters are open for further investigation with DM techniques, next time with fertility experts. Acknowledgement In the analyses performed, contributions of the following computer-science students should be acknowledged: Barbara Tvrdi, Anže Jazbec, Marko Kovki, Domen Muren. We also thank Statistical office of the Republic of Slovenia for the answers and the IS 2007 program committee members for the constructive suggestions. Reference [1] F.C. Billari, J. Fürnkranz, and A. Prskawetz (2000). Timing, sequencing, and quantum of life course events: a machine learning approach. Working paper 010, Max-Plank-Institute for Demographic Research, Rostock. [2] H. Blockel, J. Fürnkranz, A. Prskawetz, and F. Billari (2001). Detecting temporal change in event sequences: An application to demographic data. In L.D. Raedt and A. Siebes (eds.), Principles of data Mining and Knowledge discovery: 5'h European Conference, PKDD 2001, Volume LNCS 2168, pp.29-41. Freiburg in Brisgau: Springer. [3] M. Christenson, McDevitt, T., and Stanecki, K. (2004). Global Population Profile: 2002. International Population Reports. Health Studies Branch, International Programs Center, Washington Plaza II, Room 313A U.S. Census Bureau, Washington, DC 20233-8860. [4] J. Demsar, and B. Zupan (2005). From Experimental Machine Learning to Interactive Data Mining, White Paper (www.ailab.si/orange), Faculty of Computer and Information science, University of Ljubljana. [5] Eurostat http://epp.eurostat.ec.europa.eu/ [6] Economist http://www.economist.com/world/europe/displaysto ry.cfm?story_id=9334869 [7] M.Gams (2007). Osnovna demografska gibanja (Basic Demographic Dynamics). In J. Malačič, M. Gams (eds.), Proceedings of the 10t'' International Multi-conference Information Society (volume B) Slovenian Demographic Challenges of the 21st Century. Ljubljana: "Jožef Stefan" Institute, pp. 3537. [8] M. Gams, J. Krivec (2007). Analiza vplivov na rodnost (Analysis of Impacts on Fertility). In J. Malačič, M. Gams (Eds.), Proceedings of the 10th International Multi-conference Information Society (volume B) Slovenian Demographic Challenges of the 21st Century. Ljubljana: "Jožef Stefan" Institute, pp. 35-37. [9] J. Malačič (2006). Demografija: teorija, analiza, metode in modeli (Demography: theory, methods and models), (EF, Manual). 6. edition. Ljubljana: Faculty of Economy, pp. 339 [10] J. Malačič (2000). Demografija- teorija, analiza, metode in modeli (Demography: theory, methods and models). 4. edition. Ljubljana. [11] New York Times http://dotearth.blogs.nytimes.com/2008/03/14/earth-2050-population-unknowable/ [12] M. Oris, G. Ritschard, and A. Berchtold (2000). The use of Markow process and induction trees for the study intergenerational social mobility in nineteenth century Geneva. In Social Science History Association Annual Meeting, Baltimore. [13] M. Potts, Society and Fertility. Estover: McDonald and Evans. 1979. pp. 384 [14] G. Ritschard, and M. Oris (2005). Dealing with Life Course Data in Demography: Statistical and Data mining Approaches, in R. Lévy, P. Ghisletta, J.-M. Le Goff, D. Spini et E. Widmer (eds.), Towards an Interdisciplinary Perspective on the Life Course, Advances in Life Course Research, Vol. 10. Amsterdam: Elsevier, pp. 283-314. [15] Slovenija v številkah (Slovenia in Numbers). Ljubljana: Statistical office of the Republic of Slovenia, 2006. pp. 79 [16] Statistical office of the Republic of Slovenia; Slovenian Statistical Database: http://www.stat.si/ [17] N. Stropnik (1997). Ekonomski vidik starševstva (Economical side of Parenthood). Ljubljana, Scientific and publication center , pp.221. [18] M. Šircelj (1991). Determinante rodnosti v Sloveniji (Fertility Determinates in Slovenia). Doctoral dissertation, Faculty of Art, Ljubljana. [19] United Nations, Department of Economic and Social Affairs, Population Division (2007). World Population Prospects: The 2006 Review, Hilights, Working Paper No.ESA/P/WP.202. [20] UN http://esa.un.org/unpp/ [21] U.S. Census Bureau, International Population Reports WP/02, Global Population Profile: 2002, U.S. Government Printing Office, Washington, DC, 2004. [22] V. Vidulin, M. Gams (2006). "Vpliv investicij v izobraževanje in R&R na gospodarsko rast," Elektroteh. vestn., vol. 73, nr. 5, pp. 285-290. [23] WIKI-1 http://en.wikipedia.org/wiki/Failed States Index [24] WIKI-2 http://en.wikipedia.org/wiki/Gross_Domestic_Prod uct [25] I. H. Witten, E., Frank (2005). "Data Mining -Practical Machine Learning Tools and Techniques (sec. ed.)," Morgan Kaufmann. Clustering of Population Pyramids Simona Korenjak-CCerne University of Ljubljana, Faculty of Economics, Department of Statistics simona.cerne@ef.uni-lj.si Nataša Kejžar University of Ljubljana, Faculty of Social Sciences, Department of Informatics and Methodology natasa.kejzar@fdv.uni-lj.si Vladimir Batagelj University of Ljubljana, Faculty of Mathematics and Physics, Department of Mathematics vladimir.batagelj@fmf.uni-lj.si Keywords: clustering, population pyramid, Ward hierarchical method, hierarchical clustering with relational constraint Received: March 6, 2008 Population pyramid is a verypopularpresentation of the age-sex distribution of the human population of a particular regi on. The shape of the pyramid shows many demographic, social, and poli tical characteristics of the time and the region. In the paper results of hierarchical clustering of the world countries based on population pyramids are presented. Special attention is given to the shapes of the pyramids. The changes of the pyramids' shapes, and also changes of the countries inside main clusters are examined for the years 1996, 2001, and 2006. Also smaller territorial units of a country can be observed through clusters. To illustrate this, clusters of3111 mainland US counties in the year 2000 obtained using the hierarchical clustering with relational constraint of counties' population pyramids are examined. In the paper, the results for clustering into nine main clusters are presented. Povzetek: Prikazano je grupiranje demografskih piramid. 1 Introduction Population pyramid is a very popular presentation of the age-sex distribution of the human population of a particular region. It gives picture of a population's age-sex structure, and can also be used for displaying historical and future trends. Generally, there are three main pyramids' shapes: expansive, constrictive, and stationary (Figure 1). The expan- li Figure 1: Three main shapes of population pyramids sive shape is typical for fast-growing populations where each birth cohort (a group of people born in the same year or years period) is larger than the previous one (Latin America, Africa). Constrictive shape displays lower percentages of younger population (United States). Stationary shape present somehow similar percentages for almost all age groups. The population pyramids of the Scandinavian countries tend to fall in this group. Since the biggest influence on the pyramid's shape have fertility and mortality, the explanation of the pyramids' shapes is often related to the "Demographic Transition Model" (DTM) that describes the population changes over time (Figure 2). It is based on an interpretation that begun in 1929 by the American demographer Warren Thompson, of the observed changes, or transitions, in birth and death rates in industrialized societies over the past two hundred years. Besides births and mortality, also other processes, depending on social or/and political policy and events (migrations, birth control policy, war, life-style etc.) have strong influence on age-sex structure of the population, that reflect also on the shape of the population pyramid. Population pyramids are very easily understandable to almost everyone. In the combination with professional knowledge, they offer also many additional explanations about different processes to experts (e.g. demographers, sociologists, politicians, economists, geographers). Due to these facts we decided to observe clusters of the world countries based on the shapes of their population pyramids. We observed how countries and clusters of them fit with the main pyramids' shapes and how stable are clusters that we got with hierarchical clustering. Selected clustering procedure to determine clusters of the world coun- EXPANSIVE CONSTRICTIVE STATIONARY stage 1: EXPANDING Stage 2: EXPANDING Males Females High birth rate; high death rate; short life expectancy stage 3: STATIONART Males Females High birth rate; fall in death rate; slightly longer life expectancy Stage 4: CONTRACTING Males Females Declining birth rate; low death rate; longer life expectancy Males Females Low birth rate; low death rate; longer life expectancy Figure 2: Shapes of population pyramids for 4 stages of the demographic transition model tries is based on the Ward's hierarchical clustering method. For each country a clustering can be applied also on smaller regions. In smaller regions the influence of other processes besides births and mortality (for example local migrations caused by schooling, religion, work or health reasons etc.) is even more emphasized, which reflect in differences of pyramids' shapes. Therefore we further examined clusters of 3111 mainland US counties. To determine clusters, hierarchical clustering with relational constraint was used. Since we are not demographers we focus on the presentation of the methods and results they produce, and we limited the explanation of them to the 'technical' characteristics (results that can be directly seen from the obtained graphs and clusters), hoping that the proposed approaches will catch the attention of researches using pyramids' analysis in their work. 2 Clustering procedure Data on the population pyramids of the world countries used in our analysis were taken from the web page of the International Data Base (IDB). Age is divided into 17 five-years groups (0-4 years, 5-9 years, 10-14 years, ..., 75-79 years, 80+). In our model, for each country each age group is considered as a separate variable for each sex, so each country is presented with 34 variables: 17 variables for 5-years age groups for men, and 17 variables for 5-years age groups for women. Values are normalized so that they present percentages of the country's population in each age group. Euclidean distance between corresponding vectors is used. Although some objections against the usage of this difference measure can be found (Andreev, 2004), in our opinion in observation of the shapes of the population pyramids, each age-group can be considered as a separate variable. For clustering procedure, the algorithm for the Ward's hierarchical method, implemented in a package 'cluster' in the statistical environment R was used. This clustering procedure is implemented based on the description of the method in Kaufman and Rousseeuw, 1990. Since each country is represented with normalized vector (relative age-sex distribution), the 'centroid' pyramids of the clusters of countries are not real population pyramids describing the whole population in the clusters, but are based on the new vector got with the clustering procedure. So they can be interpreted only in terms of shapes, not as a population pyramids of countries' clusters, because the population size of the whole cluster is not taken into account. But on the other hand such approach enables us to detect even small countries with very different pyramids' shapes based on special countries' characteristics. 3 Clusters of the world's countries over time We observed clusters of the world countries obtained by the Ward's clustering procedure for each year from 1996 to 2007. Although the time period is rather short for the human life, substantial changes can be seen. In Figure 9, Figure 10, and Figure 11 respectively, hierarchical trees for the years 1996, 2001, and 2006 with the appropriate pyramids' shapes for some of the main clusters are presented. 3.1 Pyramids' shapes of the clusters of world countries Our first examination is concentrated on the shapes of the population pyramids of the clusters in the hierarchies. For each of the years 1996, 2001, and 2006, some of the interpretations are given. Much more detailed information can be obtained depending on the cutting level in the hierarchy and on the interest of the observer. 3.1.1 Year 1996 For the year 1996, the dataset contains 215 countries. Four main pyramids's shapes are presented in Figure 3. The first cluster has typical expansive pyramid's shape. It includes 77 countries (most of African countries etc.). When comparing these pyramids' shapes with pyramids' shapes of the DTM shown in Figure 2, we can say that the first one corresponds to the Stage 1, the second and the third have characteristics of the Stage 2 and 3, and the last looks between stages 3 and 4. For easier explanation of later observations we will denote them with letters A, B, C, and D respectively. Clusters at the bottom levels of the hierarchy are more and more similar. When we cut at the level with eight clusters, clusters A and C are each divided into two smaller clusters presented in Figure 4 for cluster A and in Figure 5 for cluster C. Cluster B remains the same also when cutting into eight clusters. First shape from Figure 4 corresponds 0 0 4 0 0810 0 0 6 0 02 0 CO 0 04 OOS A B C D Figure 3: Pyramids's shapes of four main clusters of the countries for the year 1996 000 004 008 Figure 4: Pyramids's shapes of the first two (from the left) of eight clusters of the countries for the year 1996 to the Stage 1, but the second one looks closer to Stage 2 of the DTM from Figure 2. 006 0 02 0 00 0 04 0 08 4ÌL 0 05 002 000 004 008 i 006 0 02 0 00 0 04 0 08 Figure 6: Pyramids' shapes of the last three (from the right) of eight clusters of the countries for the year 1996 interesting points for discussion about similarities and differences among countries and/or clusters. 3.1.2 Year 2001 Hierarchy on the 222 world countries for the year 2001 is presented in Figure 10. At the upper level of the dendrogram two main shapes of pyramids can be seen. The first pyramid's shape approximately corresponds to the Stage 2 of the DTM shown in Figure 2, and the second one approximately corresponds to the Stage 4. One level lower each of two main clusters is divided into two additional clusters. Their pyramids' shapes are presented in Figure 7. Their shapes are similar to those in the year 1996, therefore we denoted appropriate clusters with the same letters A, B, C and D. Figure 5: Pyramids's shapes of the fourth and fifth (from the left) of eight clusters of the countries for the year 1996 The last cluster in Figure 5 includes countries with very big differences between gender's distributions: Bahrain, Kuwait, Qatar, and United Arab Emirates. Since in all these countries the population of men in the data is much bigger (specially in the middle ages) than of women, this has effect also on the pyramid's shape of the cluster. We also calculated dissimilarities between gender's distributions for all countries in 1996, and five of them with the largest differences are: United Arab Emirates, Qatar, Kuwait, Oman, and Bahrain. Four of them are included in the described cluster, detected in the hierarchy. Cluster D is at lower level divided into three smaller clusters. Their pyramids's shapes are presented in Figure 6. Slovenia is in the last cluster from the right together with 34 other countries. The most similar country to Slovenia is Croatia, and after it also Belgium, France, Finland, and Gibraltar what can be seen on the dendrogram in Figure 9. The age distributions of both genders in the pyramid's shape are quite similar, although there are slightly more women, specially those that are older than 70. Similar and even more detailed descriptions at lower levels can be obtained for each group in the hierarchy. With additional knowledge about countries more detailed explanations of the shapes can be given which also offer many 0 05 002 000 004 008 A B C D Figure 7: Pyramids's shapes of four main clusters of the countries for the year 2001 Inspecting lower level of the hierarchy, eight clusters can be detected. First cluster is the same as the first one (A) presented in Figure 7. It includes 60 countries and has typical expanding shape of the population pyramids. Cluster B is at the lower level divided into two smaller clusters. The shapes of the pyramids are presented in Figure 8. They approximately fit to Stage 2 and Stage 3 of the DTM. That shows some progress in countries' development comparing with the year 1996. Figure 8: Pyramids's shapes of the second and third (from left) of eight clusters of the countries for the year 2001 GafmDb^nislgn 20 30 40 50 60 70 80 20 30 40 50 60 70 80 Ihi..^ Cluster C is at the lower level divided into three additional smaller clusters. Their pyramids' shapes are presented in Figure 12. Among eight clusters United Arab Emirates forms separate cluster by itself (the right pyramid in Figure 12). Its population pyramid's shape shows big differences between gender's distributions in the country. At the upper levels, Northern Mariana Islands, Qatar etc. join United Arab Emirates. i m 0 05 002 000 004 008 i 0 05 002 000 004 008 Figure 13: Pyramids's shapes of the last two among eight clusters of the countries for the year 2001 3.1.3 Year 2006 The last hierarchy we present is the hierarchy of 222 the countries for the year 2006. It is presented in Figure 11. The shapes of the pyramids of the four main clusters are presented separately in Figure 14. Also these shapes are similar as in the years 1996 and 2001 therefore appropriate clusters are denoted with the same letters. As for the previous two years also for the year 2006 more detailed explanations for each cluster of the hierarchy could be found. 3.2 Stability of the clusters in the hierarchies In the following section we observe how stable are the main clusters over time. For each of the years 1996, 2001, and 0 06 002 000 004 008 A B C D Figure 12: Pyramids's shapes of three of eight clusters of the countries for the year 2001 The last cluster D among four main clusters is at lower level divided into two smaller and more similar clusters of countries. Their pyramids' shapes are presented in Figure 13. Slovenia is in the right cluster in Figure 13 together with 27 other countries. Among them were twelve countries (Belgium, Bulgaria, Croatia, Czech Republic, Finland, France, Gibraltar, Greece, Hungary, Japan, Portugal, and Spain) also in the selected cluster with Slovenia among eight main clusters for the year 1996. The shape of the pyramid in this cluster is rather symmetric, although there are larger values in the women side (specially for older than 75). Figure 14: Population pyramids of the main clusters of the countries for the year 2006 2006 we present the results of observation of four main clusters. From Figure 3, Figure 7, and Figure 14 we can see four main rather similar pyramids' shapes, although even these are slightly changing over time. Observing changes of countries inside each of the clusters A, B, C and D for each of the years 1996, 2001 and 2006, we can conclude the following: Countries from the cluster A in the year 1996 (Figure 3) are in the year 2001 in clusters A and B (Figure 7). All these countries except one (Vanuatu) are included in the cluster A in the year 2006 (Figure 14). Cluster B in the year 1996 is included in cluster B in 2001 (the only exception is Oman), and also mostly corresponds to the cluster B in the year 2006. Cluster C in the year 1996 is mainly presented in clusters B and C in the year 2001, and all except three of the countries from it are included in the cluster C in the year 2006. The remaining three countries (Kuwait, Turks and Caicos Islands, and United Arab Emirate) are in the year 2006 included in the cluster B. This is not surprising because of the differences of gender's distributions in these countries. The fourth cluster D of four main clusters in the year 1996 is mainly included in cluster D in the year 2001, and most of the countries, precisely 46 from 60 countries from it, are also included in cluster D in the year 2006. The remaining 14 countries from it are in cluster C in 2006. Similar comparisons were made for the years 2001 and 2006: The first cluster A of four main clusters in the year 2001 (Figure 7) is included in cluster A in 2006 (Figure 14). Most of the countries from cluster B in 2001 are included in clusters A and B in 2006, except six countries (Albania, Armenia, Dominica, Kazakhstan, Saint Kitts and Nevis, and Trinidad and Tobago), that are in 2006 included in cluster C. Five of the countries (Brunei, Guam, Northern Mariana Islands, Turks and Caicos Islands, and United Arab Emirates) from cluster C in 2001 are moved to cluster B in 2006, all the remaining countries are in 2006 included in cluster C. 46 of 54 countries (including Slovenia) from cluster D in 2001 create cluster D in 2006, the remaining eight countries are in 2006 included in cluster C. In the Figure 15 we present these movements among four main clusters for the years 1996, 2001 and 2006 with the number of countries. A B C D 1996 2001 2006 77 47 31 60 57 20 1 46 5 25 1 7 53 60 72 36 54 60 5 31 26 40 6 46 86 45 45 46 Figure 15: Movements presented with the number of countries among four main clusters for the years 1996,2001 and 2006 Differences among clusters can be observed in greater detail considering the hierarchies in each of the clusters. 4 Hierarchical clustering with relational constraint of US Counties For US counties age in population pyramids is divided into 18 five-years groups (0-4 years, 5-9 years, 10-14 years,..., 75-79 years, 80-84 years, 85+). In our model, for each US county each age group is considered as a separate variable for each gender, so each county is presented with 36 variables: 18 variables for 5-years age groups for men, and 18 variables for 5-years age groups for women. Values are normalized so that they present percentages of the county's population in each age group among the whole county population. Euclidean distance between the corresponding vectors is used. NA values in data for 6 counties were replaced with 0. For clustering procedure, the algorithm for hierarchical clustering with relational constraints based on the maximum hierarchical method was used. It is implemented in Pajek, the program for analysis and visualization of large networks. The agglomeration of two counties was restricted with the relational constraint based on neighboring counties (Ferligoj, Batagelj, 1983). The maximal method to calculate new dissimilarity between clusters was used. The neighboring relation is symmetric. Therefore the tolerant strategy to determine the relation between the new cluster and other clusters is used (Batagelj, Ferligoj, Mr-var, 2008). As in the case with the world's countries, also here the 'centroid' pyramids of the clusters of counties are not real population pyramids describing the whole population in the clusters, but are based on the new vector produced by the clustering procedure. So they can be interpreted only in terms of shapes, not as a population pyramids of counties' clusters, because the population size of the whole cluster is not taken into account. The cut of the dendrogram was done at height 0.06, which divided counties in 36 clusters with 155 isolates (counties that are very different from all their neighbors). Out of these 36 clusters were only 14 clusters with more than 10 vertices, therefore we decided to increase the height. At height 0.1 we obtained 9 clusters and 54 isolated counties. There are 6 clusters of more than 6 counties (precisely with 7, 15, 69, 402, 1152 and 1406 counties) and 3 of them with 2 counties. Clusters (groups of counties) are presented in Figure 16 with different shapes and colors. Group 1 is the largest group situated at eastern part of the USA (light gray circles). Darker gray group with triangles that borders group 1 is group 2, the second largest group of counties. Dark gray circular vertices at the Florida peninsula belong to group 5. 7 dark gray vertices in the middle of group 2 (in the center of the USA) represent group 7. The large white squares in the north and middle of the USA represent group 4, while area with lighter gray circles inside the bottom of group 2 belongs to group 3. Groups 6, 8 and 9 with 2 counties each are in Figure 16 represented with dark gray diamonds (group 6 is in the north-west of the USA, group 8 at the south east of group 3, and group 9 in the middle of the largest group 1). Further inspection of the pyramids' shapes of the clusters of counties shows that all three 2-vertex clusters (groups 6, 8 and 9) have average population pyramid with mostly young people in their 20s (Figure 17). We conjecture these are counties with mainly student population (surroundings of larger universities and colleges). More precisely: cluster 6 includes counties with University of Idaho and Washington State University. In cluster 8 are Madison and Walker County, Texas, with median ages 33 and 31 and with male population for more than 50% larger than female population. Cluster 9 includes Montgomery and Radford Counties, Virginia, with West Virginia University Institute of Technology, popularly called WVU Tech, and Radford University, which have strong influence on the age distribution. : ; JL Ài I i J Figure 17: Pyramids' shapes of three clusters with two counties Pyramids' shapes of the two largest clusters show typical all-American population pyramid (Figure 18) with rather typical constrictive shape (Figure 1). In two of other four clusters pyramids' shapes older 8 Figure 16: Clustering of US counties in the year 2000 with relational constraints Figure 18: Pyramids' shapes of two largest clusters population is more pronounced (looking bottom-up the pyramid bars start shrinking later than the overall American population pyramid). The groups are concentrated in Florida (first in Figure 19) and in Missouri (the second one in Figure 19). Figure 19: Pyramids' shapes of clusters with older population First of the last two among nine clusters shows relatively less people older than 30 than the overall American population, while the second one (North and middle of the USA) indicate less people in the 20s (they might be away for study). Figure 20: Pyramids' shapes of the two remaining clusters Because of the relational constraint (regional neighborhood) we got 54 isolated counties at the cutting level 0.1. In Figure 16 they are represented with black circles. They remain isolated because they have different pyramids' shapes than the neighboring counties. Most of the isolates (29 of them) are due to the proximity of a university. The shape of their population pyramids looks very much like those in Figure 17. 8 other isolates are more gender specific (have mostly more men than women). There are 5 isolates in the state of New York that have slightly different population distribution as the shape of group around them. The other isolates are of two types: they have either considerably less youngsters (or older people) than the surrounding counties or their pyramids look very random due to the small number of inhabitants. 5 Conclusion Population pyramid is a very popular graphical presentation of the age-sex distribution of the human population of a particular region. Its shape is influenced besides fertility and mortality (usually presented with demographical indicators as birth rates, death rates and growth rates) also by many other social and political policies and events, such as migrations, birth control policy, wars, life-style etc. Population pyramid offers insight into different phenomena in many fields interested in population observations, such as demography, geography, sociology, economy, politics etc. The aim of the paper was to observe how population pyramids of the world's countries corresponds to the main pyramids' shapes, which are usually related to the "Demographic Transition Model". Although the observation period of 10 years was short for the human life, substantial changes can be seen. Roughly speaking we can conclude from our observations, that the pyramid's shapes of the main clusters correspond to the Stages 1 and Stage between 3 and 4 of the "Demographic Transition Model" in the year 1996, and later are moved to the Stages 2 and even closer to 4 in the 2001 and 2006. The divide between the undeveloped and developed countries is increasing. Most of the main four clusters are quite stable through observed years. We are aware that for some observers differences are more important than generalization and they can be observed in detail with the separate inspection of the smaller parts of the hierarchy that belong to each cluster. In the second part, we examine pyramids' shapes of clusters of US counties, because in smaller territorial units the influence of local characteristics is even more emphasized, which reflects also on pyramids' shapes. For clustering 3111 mainland US counties, hierarchical agglom-erative clustering procedure with neighborhood constraint was used. The results confirm strong influences of local characteristics (for example universities) on the pyramids' shapes of smaller populations. The clustering procedure exposed some groups of counties with pyramids' shapes which strongly differs from all-American constrictive population pyramid's shape. Acknowledgement This work was partially supported by the Slovenian Research Agency, Project J1-6062-0101. References [1] Andreev, L. and Andreev, M. (2004) Analysis of Population Pyramids by a New Method for Intelligent Pattern Recognition, Matrixreasoning, Equicom, Inc. [2] Andreev, L. (2004) Fusion of Socio-Demographic Variables, Matrixreasoning, Equicom, Inc. http://www.matrixreasoning.com/ publications/l.html [3] Batagelj V., Ferligoj A. and Mrvar A. (2008): Hierarchical clustering in large networks. Presented at Sunbelt XXVIII, 22-27. January 2008, St. Pete Beach, Florida, USA. [4] Ferligoj A. and Batagelj V. (1983): Some types of clustering with relational constraints. Psychometrika, 48(4), p. 541-552. [5] Kaufman, L. and Rousseeuw, P.J. (1990) Finding Groups in Data: An Introduction to Cluster Analysis, Wiley, New York. [6] Pressat, R. (1978) Statistical Demography (Translated and adapted by Damien A. Courtney), Methuen, University Press, Cambridge. [7] Pressat, R. (1988) The Dictionary of Demography (Edited by Wilson, C.), Basil Blackwell. [8] International Data Base. http://www.census.gov/ipc/www/idbnew.html [9] Mrvar, A. and Batagelj, V. (1996-2008) The Pajek program - home page. http://pajek.imfm.si/ [10] R Development Core Team (2008) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org. Geometric-Invariant Image Watermarking by Key-Dependent Triangulation Shiyan Hu Department of Electrical and Computer Engineering Michigan Technological University Houghton, MI 49931, USA E-mail: shiyan@mtu.edu Keywords: image watermarking, geometric invariance, key-dependent triangulation, security, robustness Received: August 9, 2006 Fast and massive dissemination of image data across the Internet imposes great challenges of protecting images against illegal access and unauthorized reproduction. Image watermarking provides a powerful solution for intellectual protection. In this paper, a new image watermarking approach robust to various geometric distortions is proposed. The new scheme involves detecting image feature points and triangulating them in a secure key-dependent manner. The neighborhood pixel ratio of gray-scale image is investigated in the paper. It is a novel robust image feature which can be seamlessly combined with the proposed key-dependent triangulation scheme. A random pre-warping framework is adopted to make the scheme robust to collusion attack. Our experiments demonstrate that the new scheme is robust to rotation, scaling, noise addition, JPEG compression, StirMark, shearing transformation, collusion, and other common attacks in both spatial and frequency domain. Povzetek: Predstavljen je nov postopek za varovanje slik na internetu. 1 Introduction Fast and massive dissemination of image data across the Internet imposes great challenges of protecting images against illegal access and unauthorized reproduction. As an effective and efficient solution, image watermarking superimposes a copyright message into a host image before dissemination and then unauthorized reproduction can be recognized by extracting the copyright information. Numerous image watermarking techniques have been proposed in the literature such as [1, 2, 3, 4, 5, 6, 7, 8]. Along with the rapid growth of novel watermarking schemes, various attacking attempts have also been developed to destroy watermarks. Among these attacks, geometric attacks are very difficult to handle. This is mainly due to the fact that slight geometric manipulation to the marked image, such as scaling or rotation, could significantly reduce the possibility of a successful watermark retrieval, provided that the watermarking extractor has no knowledge of the distortion parameters. In another word, geometric distortion can easily introduce synchronization errors into the watermark extracting process [9, 10]. In recent years, a number of approaches have been proposed to counteract the geometric synchronization attacks. Popular techniques in the literature can be loosely classified into three categories: 1. Geometric invariant domain based watermarking schemes. In these algorithms, Fourier-Mellin transform is incorporated into some watermarking schemes (e.g., [11, 12]) to tackle with geometric attacks such as rotation, scaling and translation. However, these algorithms are computationally inefficient, hard to implement and cannot survive aspect ratio change [13]. 2. Template Matching-based watermarking schemes. In this class of algorithms, a template is embedded into the host image besides the watermark. The affine geometric distortions to the marked image can be reverted using the estimated parameters through detecting the template. After compensating geometric distortions, the watermark can be easily retrieved by the watermarking extractor. The major drawbacks of these techniques are that the template can be easily detected and removed by attackers [10, 14, 15]. 3. Content-based watermarking schemes. This class of watermarking schemes achieve recovery from geometric distortion using image content. A particular interesting scheme in this category is proposed in [15] where feature points are used as a content descriptor. The method works as follows. First a set of feature points which are robust to geometric distortion are detected. These points are often near corners or edges of the image. A Delau-nay triangulation is computed on the feature points and the watermark is then embedded into the resulting triangles. The above technique has two drawbacks. First, it always computes a Delaunay triangulation on the feature points. Therefore, provided that an attacker can successfully retrieve those feature points, the presence of a watermark can be easily determined and the watermark may be removed or distorted. Since usually well-known feature-point detectors (e.g., Harris detector [16]) are adopted, most feature points can actually be found by attackers. Second, the method is not as robust as expected: feature points are robust (i.e., can be completely retrieved by the watermarking extractor) only against small-degree rotation, and the per- formance of these techniques is considerably compromised when large-degree rotation occurs. This drawback has also been reported in [15, 17]. In this paper, we propose a new feature point-based image watermarking algorithm which is secure and robust to common attacks in both spatial and frequency domain. The proposed scheme first generates four image feature points using a novel robust intersection-based feature point detector. Based on the feature points, a number of additional points are generated and then triangulated in a key-dependent manner. Finally, the watermark is embedded into the resulting triangles. The key dependance properties of the proposed technique is motivated by the following results from the computational geometry literature. It is shown in [18] that there exist Q(2.33") different triangula-tions for a planar set of n points in general position. Therefore, even if attackers repeat the feature points, they are generally not able to compute the right triangulation. We also consider the application of the proposed watermarking scheme to image fingerprinting. Under this situation, robustness against collusion attacks becomes critical. Therefore, a random pre-warping framework is adopted to make the proposed scheme robust against such attacks. The performance of the new scheme is substantiated by the extensive experiments. Our experimental results demonstrate that the new scheme is robust to rotation, scaling, noise addition, JPEG compression, StirMark, shearing transformation, collusion, and other common attacks in both spatial and frequency domain. The rest of the paper is organized as follows: Section 2 describes the robust intersection-based feature point detector. Section 3 describes the key-dependent triangulation-based watermarking scheme. Section 4 presents the experimental results and analysis. A summary of work is given in Section 5. 2 Robust intersection-based feature point detector The first step is to compute some feature points from an image. To this effect, numerous techniques can be applied, however, even the popular Harris detector [16] cannot guarantee the repeatability of feature points after a large degree rotation [15,17]. To settle this problem, our strategy is that we first rotate an image by each integer degree, and apply Harris detector to each resulting image. The intersection of the detected points forms the feature point set. Note that smaller degree interval could be applied, however, integer degree interval suffices as indicated in our experiments. The parameters of Harris detector are determined as follows. In principle, we try to find a nice set of parameters such that the intersection of feature points from all images, after rotated back, contains only four feature points. Therefore, all that we need to record for the watermarking extractor is the set of these parameters and the key. The latter will be described in Section 3. For convenience, we use the popular Lena image as an example to illustrate the ideas in this paper. We first rotate the Lena image (of size 512 x 512) by 1°, 2°,..., 359° and then apply Harris detector with the same parameter values to each resulting image. Some detection results are shown in Fig. 1. We then rotate each image back (e.g., we rotate the second image in Fig. 1 by 15° in clockwise direction), and compute the intersection of all the feature points after necessary translation. The resulting intersection contains only four points as illustrated in Fig. 2(a). For completeness, some details of our implementation of Harris detector [16] are elaborated as follows. 1. Compute x and y derivatives of image I Ix = * I,Iy = * I. (1) 2. Compute products of derivatives at every pixel Ix2 — Ix ' Ix, Iy2 — Iy ' Iy ? Ixy — Ix ' Iy. (2) 3. Compute the sums of the products of derivatives at each pixel tSx2 = G^l * Ix2 , Sx2 = G^l * Iy2 , Sxy = G^l * Ixy ■ 4. Define at each pixel (x, y) the matrix H (x,y) — Sx2 ? Sxy Sx,y, Sy2 y2- / 5. Compute the response of the detector at each pixel R — Det(H) - k(Trace(H))2. (3) (4) (5) Several parameters are to be determined: the sigma of Gaussian derivatives, the sigma of the Gaussian integration, the k in the computation of "cornerness", the size of the window for computing the local maximum in R, and finally the threshold for "cornerness". The parameters used for Fig. 1 are a — 0.5, a' — 0.8, k — 0.05, Theshold — 52000, and window size is set to 30 x 30. The heuristic to determine the parameters reads as follows. a, a', k and window size are first set and Theshold is changed from larger values to smaller values to obtain the desired effect. It is possible that Harris detector with a set of parameters generate more than four intersection points, while slightly increasing the threshold will lead to less than four intersection points. In this case, we do not increase the threshold, instead, we compute four special points from the obtained intersection points. The following process is also useful for recomputing the intersection points in breaking a tie (see Section 3.1). Suppose that there are k intersection points. We first compute the convex hull [19] on them and three cases follow. (1). Exactly four points lie on the convex hull. Then they are returned as the final intersection points. (2). More than four points lie on the convex hull. In this case, the hull edges are sorted according to their lengths and points pi and p j linking the shortest edge are merged. That is, pi (resp. p j ) is removed if pi (resp. p j ) is to the left Figure 1: Feature points (denoted by +) obtained by Harris detector for Lena after rotation of 0°, 15°, 30°, 60°, 120°, 150°, respectively. Figure 2: (a) Intersection points: + denotes the intersection of feature points by Harris detector for Lena. (b) New points for Lena: + denotes four feature points, x denotes 30 generated points. of p j (resp. Pi) in clockwise direction, and the convex hull is then accordingly updated. The process is repeated until only four points are on the convex hull. (3). Three points lie on the convex hull. We then arbitrarily pick a feature point inside the hull and return these four points. The newly picked point does not impact the generation of additional points, refer to Section 3.1. 3 Key-dependent triangulation based watermarking 3.1 Generating additional points Through the above phase, we have four feature points in hand. The following process is carried out to generate N new points in a key-dependent manner. Key-dependent property involves using pseudo-random numbers. Throughout this paper, pseudo-random numbers are generated depending on a secret key, which is stored for the watermarking extractor. The procedure reads as follows. (1). In the case of four hull vertices, we first compute the longer one of the two diagonal segments formed by these vertices. Denote by p a, Pc the two endpoints and by pb, p d other points where pa,pb,pc,pd are in clock-wise order and pa = aigmaxpi^_{pa pcyd2{pi,pb) + d2(pi,pd), d2(-) being the Euclidean distance function. In the case of three hull vertices, the longest hull edge is returned as papc and another hull vertex is pb such thatpa,pb,pc are in clock-wise order. pd is the point inside the hull. Rotate the image such that p^pa is 45° with respect to the horizontal direction. Whenever there is a tie, we choose another set of parameters for Harris detector to recompute the intersection points. Note that images shown in this paper are first rotated back to its original position for the convenience of illustration. (2). Compute the bounding box of feature points in the rotated image as follows. Let A = {a, b, c, d}, and let minx = mini^A {xi},maxx = maxi^A{xi},miny = mvinieA{yi},maxy = maxi£A{y,}. The bounding box is defined by {(minx,miny), (maxx,miny), (maxx, maxy), (minx, maxy)}. (3). Generate two uniform deviates Ai and A2 with our key. A new point is generated as (minx ■ A1 + maxx ■ (1 — Ai), miny ■ A2 + maxy ■ (1 — A2)). (4). Repeat Step (3) until the number of new points reaches N. Refer to Fig. 2(b) for 30 newly generated points. Taking the first generated point as the origin and the vector formed by the first and the second points as the direction of x-coordinate, we build up a reference coordinate system conforming to the right-hand rule. Note that the original four feature points will not be used in the following watermark embedding process. We are now ready to triangulate the N new points in a key-dependent way. 3.2 Computing key-dependent triangulation Refer to Fig. 3(a) for a reference coordinate system with the origin v s, where the dotted line with an arrow represents x-axis. Recall that a triangulation of a planar point set is a maximal set of non-intersecting straight-line segments connecting points in it [19]. The key-dependent triangulation is computed as follows. (1). Sort all vertices (i.e., points) by their polar angle in the reference system. The distance to the origin vs is used Figure 4: Sample triangulations for Lena with different generated points. for tie breaking. Denote the resulting set by V1. Refer to Fig. 3(b) where the vertices are numbered in the order. (2). The triangulation is built incrementally and involves using pseudo-random numbers. First compute as Vvs the set of all vertices visible to vs in V. Note that va and vb are visible to each other if Va, vb are distinct and the line segment va vb does not intersect with any existing segment in the incomplete triangulation. Of course, we require that vavb itself has not been inserted yet. In Fig. 3(b), there is no inserted segment, so Vys = {vi,..., vg}. Randomly pick a vertex vh^ from Vys and insert line segment vgvh^. We then compute Vyhi which contains all the vertices visible to vhl, randomly select a vertex vh2 from Vyhi, and insert the segment vh^ vh2 into the graph. As an illustration, suppose that at a time point, the incomplete triangulation is as Fig. 3(c) and the current vertex is vr, then Vv7 = {v4, ve, vg}. The above process is repeated until either the triangulation is completed or no vertex is visible to the current vertex. Suppose that the triangulation proceeds to Fig. 3(d). The last few visited points form the sequence of vg - v1 - v2 - v3 - v5 - v4. Since there is no vertex visible to the current vertex v4, we have to find the next vertex through backtracking, i.e., the current vertex is changed to vs which is the most recent vertex except v4 . If yet no visible vertex for vs, we will continue the backtracking process. In the case of Fig. 3(d), we need to backtrack to v3 where Vv3 = {vs, vi, vg}. A complete triangulation is shown in Fig. 3(e). For the Lena image, two sample triangulations are shown in Fig. 4. 3.3 Watermark embedding process Suppose that the reference coordinate system and the triangulation can be repeated after various attacks to the original image, the present issue is to come up with a robust embedding method for each triangle. For this purpose, we investigate as follows a neighborhood information-based image feature which is robust to various attacks such as noise, JPEG compression and geometric distortions. This feature has been successfully applied to watermarking binary document images in the previous work [20]. However, more work is needed for extending this tool to watermark € V may be regarded as the 0-th vertex. grayscale images. First note that too small triangles (with the area below a threshold) are first eliminated from consideration. All remaining triangles are then ordered/indexed in the following way. For two triangles defined by vertices va,vb,vc and v d ,ve,vf respectively (without loss of generality, assume that a < b < c, d < e < f ), the order of them is determined by a and d; If a = d, then compare b and e; If still tie, then compare c and f. Given N points, all triangles are uniquely indexed from 1 to N' < 2N - k - 2 where k is the number of vertices on the convex hull of these N points (see [19]). Note that the inequality is due to removal of small triangles. We call each triangle a partition of the image. Denote each partition as pavi, i = 1,... ,N '. We now discuss how to embed the watermark bitstream to a single partition. The weight w (pav i ) of a partition pav i is defined as follows. A counter, initialized to 0, is associated to pavi. For each pixel g inside pavi, we check for its eight neighbors: if more than three neighbors of g have intensity values larger than a threshold Tl, the counter corresponding to pavi is incremented. w(pavi) is defined as "OaUnnl^ir where aveai denotes the area in pixels of pavi. We call this ratio (or partition weight) the neighborhood pixel ratio (NPR). The NPR ratio for gray-scale image is an extension of the NPR ratio for binary image, which is a robust feature as shown in [20]. For a single partition, noise can be often "filtered out", e.g., random noise can be "filtered out" from incrementing the counter since we increase it only when at least four neighbors of a pixel have intensity values larger than TL. Such a computation captures the intrinsic characteristic of a partition to some extent. We illustrate this fact through an example, refer to Fig. 5(a) for a partition from Lena. The involved threshold TL is set to 120. The original partition Fig. 5(a) and the noisy partition Fig. 5(b) are visually different, however, their neighborhood pixel ratios are 0.3267 and 0.3140, respectively. Two ratios differ only by 1%! We also test NPR ratio for scaling, rotation, JPEG compression, low pass and median filtering. Refer to Table 1 for the resulting NPR ratios. All ratios are similar to each other. In contrast, the ratio of another partition from Lena shown in Fig. 5(d) computes to 0.2183, which differs from the original's by 31%! The combinations of the above attacks are also performed, and the NPR ratio is resistant to these attacks. As an example, Fig. 5(c) shows a rotated noisy partition whose NPR ratio computes to 0.3129, very close to 0.3267. The above experiments demonstrate the robustness of the neighborhood pixel ratio. Refer to Section 4 for further experiments on robustness against geometric distortions. It remains to present the principle for modifying a partition - we only modify pixels whose intensity values are close to Tl. For each such pixel, we may either increase or decrease the intensity value depending on whether the ratio is to be increased or decreased. For example, increasing a pixel's intensity from Tl - 2 to Tl is possible to increase the partition weight. The process is repeated un- 1; .-i > • 8 • 7 4 Figure 3: A simple example for key-dependent triangulation, from left to right (a)(b)(c)(d)(e). Figure 5: Top to bottom, left to right: (a) the original partition (b) the partition corrupted with synthetic Gaussian noise with a = 30, (c) rotating by 30° followed by noise addition (the resulting image is scaled down here due to space limitation) (d) another partition (e) a modified partition of (a). Table 1: NPR ratio for the attacked image partition. Attack A partition with the weight of 0.3267 JPEG with QF of 10% 0.3178 Add. White. Gauss. Noise, a = 20 0.3140 Low Pass Filter 0.3126 3 X 3 Median Filter 0.3098 Scale by Factor of 0.5 0.3248 Scale by Factor of 2.0 0.3263 Rotation by 30° 0.3202 Rotation by 60° 0.3126 til the counter reaches the goal within an error of 4/area. For instance, refer to Fig. 5(e) for a modified partition (of Fig. 5(a)) whose ratio is 0.2651 compared to the original ratio 0.3267. The modification is not visually perceptible. Based on the NPR ratio, we are ready to present the watermark embedding process, which is motivated by [21]. Recall that every triangle is indexed. We first randomly select [N'/2J triangles and denote the triangle sequence by Oi,O2,... ,OiN i/2j. We then randomly select the remaining triangles to form the sequence Zi, Z2,..., Z^^i/2j. The watermarking extractor works to retrieve the embedded bitstream. It compares w(Oi) to w(Zi) for each i to decide about the marked bit: if w(Oi) - w(Zi) > Tj, then a 1 is embedded; if w(Zi) - w(Oi) > Tj, then a 0 is embedded. Therefore, the watermarking embedder needs to accordingly modify the relationship between w(Oi) and w(Zi) to embed bitstream. To this effect, without loss of generality, assume that we aim to embed a 0, however, presently w(Oi) > w(Zi). In this case, we modify the par- represented as [22] titions to change w(Oi) to w(Oi)+w(Zi) - and w(Zi) to w(Oi)+w(Zi) 2 + ^J 3.4 Robustness against collusion attacks In this section, we discuss the application of the proposed watermarking algorithm to image fingerprinting. The term fingerprinting refers to superimposing a unique watermark onto each copy of the distributed data. The embedded watermark can be used to identify the unauthorized copies of the data [22]. Digital watermarking can naturally serve as an effective and efficient approach to fingerprint digital data. However, a main shortcoming in applying conventional image watermarking techniques for fingerprinting is that they are not designed to be robust against collusion attacks, which are common attacks to destroy fingerprints. These attacks are performed by a group of colluders with the same digital data containing different fingerprints. A common implementation of such an attack is simply averaging multiple marked versions of an image [23]. Most existing watermarking schemes robust to collusion, such as [24, 25, 26, 27, 28, 29], have shortcomings including compromised watermarking capacity and decreasing effectiveness with the increasing number of colluders [22]. In this paper, we adopt the random pre-warping framework originally proposed in [22] to design a collusion and geometric resistant watermarking scheme. For completeness, some details of the approach in [22] are included as follows. Other than trying to detect collusion and identify traitors by fingerprint, the random pre-warping framework shoots for preventing traitors from obtaining a high-quality copy through collusion. Basically, the method randomly distorts the host image before embedding fingerprint to it such that averaging multiple versions will introduce annoying artifacts and only result in a low-quality image with no commercial value. The idea is feasible due to the following reasons as shown in [22]. For additive watermarking procedures, an averaged image from K distinct copies can be 1 K ! K Sa — ^E Si — S + ^E i=1 K (6) i=1 where S is the host image, Wi is the watermark, and Si — S + Wi is the watermarked image. It is exjpected that Sa looks similar to S due to the fact that ^ Wi should vanish since each watermark Wi can be regarded as a random pattern. In contrast, if we distort S before superimposing Wi onto it, we have [22] K K K Sa — ^E Si — ^E ^i (S ) + ^E Wi, i=1 K i=1 K (7) i=1 where denotes the distortion function. Even if the second term vanishes, we can choose (^(■) such that ^K EK=1 ^i(S) is visually different from S citeCST04. It is shown in [22] that this can be achieved using the standard Stirmark tool [30, 31] which forms the basis of desynchro-nization attacks on many watermarking schemes. With the above introduction, a watermarking scheme robust to both geometric attacks and collusion attacks is clear: we first randomly pre-warp the image followed by embedding watermarks as in the previous sections. The complete process for embedding watermarks onto a host image is summarized in Algorithm 1. To extract a watermark from a possibly modified marked image, we carry out the extracting process as shown in Algorithm 2. 4 Experimental results We have performed experiments over various gray-scale images. We choose the Lena image to present our results and analysis. The extensive experiments on our whole image set are described in the end of this section (see Table 3). To evaluate the robustness of the proposed watermarking scheme, common attacks are tested. For a possibly modified watermarked triangulation, we define the watermark strength as follows. Recall that w(Zi) - w(Oi) > TJ denotes embedding 0 while w(Oi) - w(Zi) > TJ denotes embedding 1. When a marked triangle is attacked, a 1 (resp. 0) can be extracted if w(Oi) - w(Zi) > 0 (resp. w(Oi) - w(Zi) < 0). Therefore, our scheme can tolerate up to T J unit changes in triangle weight. Denote by (resp. ^2) the smallest value of w(Oi) - w(Zi) (resp. w(Zi) - w(Oi)) for all i where a 1 (resp. 0) is actually embedded. The watermark strength is defined as . Clearly, the watermark becomes more robust with thJe increasing value of the watermark strength. It follows from Section 3.3 that the maximum possible value of watermark strength is 12. If a watermark strength is negative, the watermark may not be correctly extracted. When this hap- 2Exception occurs when relationship between w{Zi) and w{Oi) for each i exactly matches the embedding bit sequence. However, it is very unlikely and not observed in the experiments. Algorithm 1 Watermark Embedding Process 1: Use Stirmark to randomly pre-warp the image. Only geometric distortions in StirMark is applied. 2: Compute the four feature points as in Section 2, i.e., determine a set of parameters such that the intersection of feature points for all rotated image versions contains only four points. The set of parameters is recorded for the watermarking extractor. 3: Based on the intersection points, generate N new points in a key-dependent manner. The key is recorded for the watermarking extractor. 4: Triangulate the generated points in a key-dependent manner. 5: Remove too small triangles whose areas are below a threshold and index the remaining ones. 6: Use two pseudo-random triangle sequences to embed the watermark as in Section 3.3. Algorithm 2 Watermark Extracting Process 1: Use the set of recorded parameters to compute the four intersection feature points, i.e., rotate the possibly modified marked image and the intersection of feature points for all rotated image versions should contain only four points. 2: Based on the computed four feature points, generate N new points using the user's key. 3: Triangulate the generated N points using the user's key. 4: Remove too small triangles and index the remaining ones. 5: Generate two pseudo-random triangle sequences with the key and extract the watermark as follows. Compare w(Oi) with w(Zi) for each i: if w(Oi) — w(Zi) > Tj, then a 1 is embedded; if w(Zi) — w(Oi) > TJ, then a 0 is embedded. We set up a small positive value ö for fault-tolerance purpose, i.e., we treat Tj « Tj ± ö. pens, we are yet interested in how many bits can be correctly extracted. We are now ready to present our experimental results. 1. The different image transformations tested are scaling, rotation, and a combination of these transformations. The original Lena image of size 512 x 512 is shown in Fig. 6(a). The watermarked Lena image is shown in Fig. 6(b). The involved Tl is set to the average intensity value of each image. PSNR for the watermarked image is low (18.82) due to the random pre-warping by StirMark. Without applying random pre-warping, the PSNR for watermarked image is 46.13 (see Fig. 6(c)). In the case of single transformation, the proposed technique is robust to any degree of rotation (without cropping) and scaling with factor of 0.5 and 2, respectively. In the above attacks, all embedded bits are extracted. Note that in scaling attacks, all generated points lie in the bounding box of the rotated convex hull and are dependent on some random ratios (i.e., A1,A2). Therefore, as long as the scaling transformation is uniform to the image and the index (i.e., the order) of bounding box vertices is repeatable, the generated points and thus the triangulation are repeatable. In the case of combined transformation, we test the scheme on various sequences of transformations, e.g., Fig. 6(d) shows the resulting image after the sequence of transformations including scaling by factor of 2, 20° rotation and cropping that maintains 60% of the image. All bits are successfully extracted in this class of tests, except those with too much cropping such that feature points have been removed. Since feature points are intrinsic to an image, we consider that such an attack has degraded the quality of the original image to a significant extent and thus the cropped image becomes less useful. 2. For further analysis, we have tested the proposed watermarking scheme for nonlinear geometric attacks through StirMark (with default parameter values), shearing transformation (Gimp software [32] is used to perform this transformation), compression attacks using JPEG (with a wide variety of quality factors ranging from 10% to 90%), and addition of Gaussian noise with a = 20. In all cases, successful retrieval of watermark is reported in our experiments. Note that applying StirMark to a watermarked image still gives acceptable quality of image, see Fig. 6(e). Refer to Fig. 6(f) for the image after shearing distortion. 3. To show the robustness of the proposed scheme to re-watermarking, it is necessary to consider the following scenario. We first embed a watermark Wx to an image, then another watermark Wy is embedded to the marked image. We also embed Wy directly to the original image. Two resulting images must be different, and this is the case as demonstrated by our experiment. In addition, we even re-watermark a marked image (1) without applying the StirMark (since it usually causes considerable difference) (2) using the same set of parameters for detecting feature points. That is, two embedding processes only differ in the generated points and thus the triangulations, both of which are purely dependent on the user's key. Refer to Figure 6: Left to Right, Top to bottom: (a) original Lena image, (b) watermarked Lena with random pre-warping, (c) watermarked Lena without random pre-warping, (d) scaled, rotated and cropped image, (e) StirMarked watermarked image, (f) marked image after shearing distortion, (g) re-watermarking of marked image. Fig. 6(g) for a re-watermarked Lena over Fig. 6(b). The normalized Li distance between Fig. 6(b) and Fig. 6(g) is 4.02, i.e., on average the corresponding intensity values in the two images differ by 4. 4. The next test is to modify the intensity values of the marked image by either increasing or decreasing intensity value by a fixed small amount. Recall that the involved average intensity value Tl is set to the average intensity value of the whole image, thus the watermark is robust to such attacks (refer to Table 2). It is still the case for modifying intensity value by some small random amounts. A more effective attack is to modify the intensity value very close to the average intensity value. However, this attack will not really cause trouble, since instead of setting TL to the average intensity value, we can set TL to be the average intensity value multiplied by a key-dependent random number between [0.5,1.5]. The improved watermarking scheme by key-dependent Tl is tested and with ó = Tj/2, we are usually able to extract all bits. The exception occurs when the attacker correctly guesses TL and randomly exchanges considerable amount of pixels across Tl. In that case, we are still able to extract more than 80% bits as indicated by our experiments. Assume that the maximum amount for modifying any intensity value is 3, then to effectively defeat the watermarking scheme, one needs to correctly estimate the average intensity value as within the range of [TL - 3,TL + 3]. For an image with the average intensity value of 120, TL may be of any value in [60,180]. Thus, an effective guess happens only with a probability of 6/120 = 5%. Refer to Table 2 for quantitative results. Refer to Fig. 7 for watermark strength versus trials of attacks. Out of 100 attacks, only 6 attacks effectively defeat the scheme. 5. Spatial domain filtering is also a class of common attacks. In our experiment, 3 X 3 median filtering and 3 X 3 mean filtering are considered. A frequency-domain low 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 0 10 20 30 40 50 60 70 80 90 100 Trial of attacks Figure 7: Watermark strength in different trials of attacks through random intensity manipulation in spatial domain. pass filter is also applied. Our further analysis includes a random change in the image's frequency domain. The maximum possible change to amplitude of FFT coefficients is ±10% of the original value. Refer to Table 2 for the watermark strength and the ratio of correctly detected bits after these attacks. 6. Printing and Scanning. The print and scan test combines multiple attacks. For example, printing introduces re-quantization and grid apparition while scanning introduces geometric distortions [15]. Our experiments (refer to Table 2) demonstrate that the proposed scheme is robust to the print/scan attack. 7. Collusion Attacks. Due to the random pre-warping framework, the quality of colluded images should be significantly degraded. This is the case as indicated by our experiment. Refer to Fig. 8 for colluded copies of water- marked images. Both Lenas are significantly blurred and are thus of less commercial value. In concluding this section, we present the results for carrying out all the above tests on 50 collected images, refer to Table 3. All ratios shown are averaged over 50 images. Note that for the combinational attack with cropping, the watermark strength is not computed since some triangula-tions are not repeatable due to too much cropping and thus no ratio can be computed in those cases. From Table 3, one sees that the watermark strength is high for all types of attacks and on average, more than 80% marked bits can be correctly extracted even for the most effective attacks. Our experimental results clearly demonstrate the effectiveness of the proposed method. 5 Conclusion We propose a new content-based image watermarking scheme. The scheme belongs to the class of second generation watermarking schemes whose advantages include automatic re-synchronization and exclusion of unreliable template embedding [15, 33]. The strength of the proposed scheme is demonstrated through successful watermark detection after various common attacks such as geometric distortions, StirMark attacks and shearing transformations. The main contribution of this paper is three-fold. First, a spatial domain key-dependent triangulation framework is proposed. Based on the framework, a highly secure and robust image watermarking scheme is presented. Second, a novel feature for gray-scale images, the neighborhood pixel ratio is investigated in this paper. It is an extension of the binary image NPR ratio presented in our previous work [20]. Third, detecting rotation-invariant feature points through inspecting all rotated images is investigated in this paper. Such an idea may have its own interest as well. The proposed key-dependent triangulation framework can be easily combined with other watermarking techniques to obtain a highly secure watermarking scheme. For example, NPR ratio-based embedding process (for each triangle) could be substituted by proper existing geometric-resistant techniques. To design and analyze the combination of the key-dependent triangulation framework with the existing watermarking methods would be an interesting future work. Acknowledgement The author would like to thank the anonymous reviewers for their helpful comments. References [1] I. Cox, J. Kilian, F. Leighton, and T. Shamoon, "Secure spread spectrum watermarking for multime- dia," IEEE Transactions on Image Processing, vol. 6, no. 12, pp. 1673 - 1687, 1997. [2] J. Hernandez, M. Amado, and F. Perez-Gonzalez, "Dct-domain watermarking techniques for still images: detector performance analysis and a new structure," IEEE Transactions on Image Processing, vol. 9, no. 1,pp. 55-68,2000. [3] Z.-M. Lu, D.-G. Xu, and S.-H. Sun, "Multipurpose image watermarking algorithm based on multistage vector quantization," IEEE Transactions on Image Processing, vol. 14, no. 5, pp. 822 - 831, 2005. [4] S.-H. Wang and Y.-P. Lin, "Wavelet tree quantization for copyright protection watermarking," IEEE Transactions on Image Processing, vol. 13, no. 2, pp. 154 - 165, 2004. [5] C.-H. Chang, Z. Ye, and M. Zhang, "Fuzzy-art based adaptive digital watermarking scheme," IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, no. 1, pp. 65-81, 2005. [6] J. Cannons and P. Moulin, "Design and statistical analysis of a hash-aided image watermarking system," IEEE Transactions on Image Processing, vol. 13, no. 10, pp. 1393 - 1408, 2004. [7] A. Alattar, "Reversible watermark using the difference expansion of a generalized integer transform," IEEE Transactions on Image Processing, vol. 13, no. 8, pp. 1147- 1156, 2004. [8] J. Tzeng, W.-L. Hwang, and I.-L. Chern, "An asymmetric subspace watermarking method for copyright protection," IEEE Transactions on Signal Processing, vol. 53, no. 2, pp. 784 - 792, 2005. [9] M. Alghoniemy and A. Tewfik, "Geometric invari-ance in image watermarking," IEEE Transactions on Image Processing, vol. 13, no. 2, 2004. [10] P. Dong and N. Galatsanos, "Affine transformation resistant watermarking based on image normalization," Proceedings of International Conference on Image Processing (ICIP), vol. 3, pp. 489 - 492, 2002. [11] J. Ruanaidh and T. Pun, "Rotation, scale and translation invariant digital image watermarking," Proceedings of the IEEE International Conference on Image Processing (ICIP), 1997. [12] C.-Y. Lin, M. Wu, J. Bloom, M. Miller, I. Cox, and Y.-M. Lui, "Rotation, scale, and translation resilient public watermarking for images," IEEE Transactions on Image Processing, vol. 10, no. 5, pp. 767-782, 2001. [13] X. Qi and J. Qi, "Improved affine resistant watermarking by using robust templates," Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. III, pp. 405-408, 2004. Table 2: Watermark strength of the attacked Lena image. Image Lena Attack Watermark strength Ratio of cor. det. bits Scale by Factor 0.5 0.95 100% Scale by Factor 2.0 0.98 100% Rotation by 10° 1 100% Rotation by 20° 0.98 100% StirMark 0.76 100% Shearing 0.82 100% JPEG QF 10% 0.91 100% Noise 0.83 100% Low Pass Filter 0.80 100% 3 x 3 Median Filter 0.91 100% 3 x 3 Mean Filter 0.82 100% Rand. change by same amount 0.91 100% Worst result in Rand. change in Spa. Domain -0.25 87% Worst result in Rand. change in Fre. Domain 0.18 100% Print and Scan 0.32 100% Figure 8: Colluded copies of Lena. (a) colluded image using two copies (b) colluded image using five copies. Table 3: Averaged watermark strength over 50 attacked images. Attack Avg. watermark str. Ratio of cor. det. bits Scale by Factor 0.5 0.95 100% Scale by Factor 2.0 0.97 100% Rotation by 10° 0.99 100% Rotation by 20° 0.96 100% Rotation by 60° (no cropping) 0.93 100% Combinational attack (no cropping) 0.95 100% Combinational attack (with cropping) - 83% StirMark 0.81 98% Shearing 0.83 98% JPEG QF 10% 0.89 100% Noise 0.86 99% Low Pass Filter 0.81 100% 3 X 3 Median Filter 0.91 100% 3 X 3 Mean Filter 0.85 100% Rand. change in Spa. Domain 0.58 89% Rand. change in Fre. Domain 0.70 96% Print and Scan 0.65 96% [14] A. Herrigel, S. Voloshynovskiy, and Y. Rytsar, "The watermark template attack," Proceedings of SPIE: Security and Watermarking of Multimedia Contents III, pp. 394-405, 2001. [15] P. Bas, J.-M. Chassery, and B. Macq, "Geometrically invariant watermarking using feature point," IEEE Transactions on Image Processing, vol. 11, no. 9, pp. 1014-1028, 2002. [16] C. Harris and M. Stephens, "A combined corner and edge detector," Proceedings of 4th Alvey Vision Conference,pp. 147-151, 1988. [17] J. Dittmann, T. Fiebig, and R. Steinmetz, "New approach for transformation-invariant image and video watermarking in the spatial domain: self-spanning patterns (ssp)," Proceedings of SPIE: Security and Watermarking of Multimedia Contents II, vol. 3971, pp. 176-185, 2000. [18] O. Aichholzer, F. Hurtado, and M. Noy, "A lower bound on the number of triangulations of planar point sets," Computational Geometry: Theory and Applications, vol. 29, no. 2, pp. 135-145, 2004. [19] M. de Berg, M. van Kreveld, M. Overmars, and O. Schwarzkopf, Computational Geometry: Algorithms and Applications, 2nd ed. Springer-Verlag, 2000. [20] S. Hu, "Document image watermarking algorithm based on neighborhood pixel ratio," Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2005. [21] J. Smith and B. Comiskey, "Modulation and information hiding in images," Proceedings of Information Hiding Workshop, 1996. [22] M. Celik, G. Sharma, and A. Tekalp, "Collusion-resilient fingerprinting by random pre-warping," IEEE Signal Processing Letters, vol. 11, no. 10, pp. 831 -835, 2004. [23] M. Wu, W. Trappe, Z. Wang, and K. Liu, "Collusion-resistant fingerprinting for multimedia," Signal Processing Magazine, IEEE, vol. 21, no. 2, pp. 15 - 27, 2004. [24] D. Boneh and J. Show, "Collusion-secure fingerprinting for digital data," IEEE Transactions on Information Theory, vol. 44, no. 5, pp. 1897-1905, 1998. [25] J. Dittmann, A. Behr, M. Stabenau, P. Schmitt, J. Schwenk, and J. Ueberberg, "Combining digital watermarks and collision secure fingerprints for digital images," Proceedings of SPIE: Security and Watermarking of Multimedia Contents, pp. 171-182, 1999. [26] J. Su, J. Eggers, and B. Girod, "Capacity of digital watermarks subjected to an optimal collusion attack," Proceedings of the European Signal Processing Conference (EUSIPCO), 2000. [27] J. Domingo-Ferrer and J. Herrera-Joancomarti, "Simple collusion-secure fingerprinting schemes for images," Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC), 2000. [28] J. Dittmann, P. Schmitt, E. Saar, J. Schwenk, and J. Ueberberg, "Combining digital watermarks and collusion secure fingerprints for digital images," Journal of Electronic Imaging, vol. 9, 2000. [29] Z. Wang, M. Wu, W. Trappe, and K. Liu, "Anti-collusion of group-oriented fingerprinting," Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), 2003. [30] F. Petitcolas, R. Anderson, and M. Kuhn, "Attacks on copyright marking systems," Proceedings of the Second Information Hiding Workshop, LNCS vol. 1525, Springer-Verlag, New York, pp. 219-239, 1998. [31] F. Petitcolas, "Watermarking schemes evaluation," IEEE Signal Processing, vol. 17, no. 5, pp. 58-64, 2000. [32] M. Cutts, "An introduction to the gimp," http://www.acm.org/crossroads/xrds3-4/gimp.html, 1997. [33] M. Kutter, S. Bhattacharjee, and T. Ebrahimi, "Towards second generation watermarking schemes," Proceedings of International Conference on Image Processing (ICIP), vol. 1, pp. 320 - 323, 1999. A Simple Algorithm for the Restoration of Clipped Speech Signal Abdelhaklm Dahlmene, Mohamed Noureddlne and Aarab Azrar Electrical and Electronic Engineering department, Boumerdes University Boumerdes, Algeria, 35000 E-mail: dahimenehakim@yahoo.fr Keywords: speech signal, clipped speech, restoration, interpolation, linear prediction, least square method, Kalman filter Received: August 6, 2007 This paper deals with the problem of peak clipped speech. Our basic assumption is that the clipped speech is voiced and can be linearly predicted with a high accuracy. The coefficients of linear prediction are computed using two different algorithms: a least square direct method and a recursive Kalman filter. The speech reconstruction is accomplished using backward prediction. Povzetek: Predstavljen je algoritem za obnavljanje zvočnega signala. 1 Introduction Speech acquired by personal computer sound cards is often confronted with two main problems: DC level wandering and peak clipping. While building a data base for our speech recognition project, we have been confronted with both problems. The first one is easily eliminated by simple linear processing but the second one requires more complex algorithms. Peak clipping is fundamentally a non linear distortion. It is characterized by the fact that several successive values of the signal disappear and are replaced by a constant. However, it happens that speech signal is highly predictable. So, in essence, peak clipped speech restoration is a problem of interpolation since we are trying to find missing values by using the properties of the signal itself. There exist several methods of interpolation: polynomial (Lagrange, Newton), spline, etc. In the case of peaked clipped speech, an appropriate method is statistical interpolation [1]. 1E+7-= 1E46 - 1F»5 - 1E-M 1E+3 Q(D 10000 200.00 300.00 400.00 Figure 1: Mean magnitude and ZCR scatter plot [3] 2 Justification of the method When there is no a priori information on the signal, the classical numerical interpolation methods (polynomial and spline) should be used. Band limited interpolation [2] uses only the fact that the signal is band limited. Statistical interpolation based on linear prediction [2, 4] uses the fact that that speech signal is highly predictable. A speech segment is composed of a sequence of voiced, unvoiced and silence (noise) segments [2]. The type of speech signal that has the greatest probability for being peak clipped is voiced speech [2, 3]. Figure 1 represents a scatter plot of voiced, unvoiced and silence mean magnitude and zero crossing rate of segments of speech. Voiced speech segments are indicated by the letter "V", unvoiced segments by the letter "U" and the silent segments by the letter "S". It shows clearly that the voiced signals cluster at high mean magnitude values. The mean magnitude is defined as: TO M^ = V x(m) w(n - m) (1) where w(n) is a rectangular window of length 256 samples and the zero crossing rate (ZCR) is: +M ZCR(n) = V |sgn[x(m)] - sgn[x(m - 1)]|w(n - m) (2) m Fortunately, voiced speech happens to be quite predictable. Voiced speech follows quite closely the linear prediction equations [4, 5]. Commercial software like DC-6, from Diamond Cut products, use low order linear prediction for clipped audio signal restoration and the problem of audio signal interpolation have also been addressed by Vaseghi [1] who uses linear prediction from adjacent samples and samples one period away (audio signals are assumed to be periodic). Voiced speech can be considered as a quasi periodic signal. It can be modelled as the output of a linear time invariant system (during few milliseconds, the system can safely be assumed to be time invariant) driven by a periodic train of impulses. In this case, a quite general formulation of the signal will be: p p xn = Z akXn-k + Z bkUn-k (3) k=1 k=0 where the signal uk is equal to 1 every T seconds and zero otherwise, T being the pitch period. a^ and b^ are respectively the recursive and the non recursive parameters of the above production filter of order p. So, within a pitch period (NT samples) and after p samples, we can write: Xn = Z ak k=1 (4) The above equation breaks down in the part of the speech signal that is clipped. So, if we start the time axis at the beginning of a pitch period and if we call NT the number of samples within the pitch period, we can write: Xn = Z akX k=1 ^n-k ' p < n < Nt (5) for < X„ Xmax being the saturation value. / \ xp+1 xp+2 = V y >-1 -p+1 ^ ^Nt-1 ■^Nt-2 in which all the rows such that Equation (6) can be written as: b = X a V ap y (6) 'Nt - P, = Xmax are deleted. (7) and the least square solution of equation (7) can be obtained as [6] : a = ( XTX)-1XTb (8) Another approach to the evaluation of the prediction coefficient is the following sequential algorithm (Kalman filter) [7, 8] based on the subsequent set of equations and on an autoregressive model. Consider the next state equation: a(k +1) = a(k ) + w (k ) (9) where a(k) = (a1, a2, k , a^ )T and w(k) is a white stationary sequence. The observation model is: p Xn =Z arXn-r + b0 Un (10) i=1 and let us consider C(k -1) = x^-1, x^ - 2, l , x^ - p then the observation model becomes: z(k) = x, = C(k - 1)a(k)+v(k) (11) if it is taken that: v(k ) = b0uk then, starting from an initial estimate a(0), we obtain the following recursion: a(k+1) = a(k )+K (k+1)[ xk+j - C(k )a(k )] (12) where K is the Kalman gain given by: Va(k )CT (k ) 3 The proposed restoration algorithm The proposed algorithm for clipped speech restoration is going to be based on linearly predicting the missing values using equation (4). So, the algorithm consists of two following steps: - Computation of the prediction coefficients ak . - Linear prediction of the missing values. 3.1 Computation of the prediction coefficients The computation of the prediction coefficients ak can be accomplished either by using a least square solution or by using a recursive algorithm based on Kalman filtering. For the least square algorithm, we can use equation (5) and build the following matrix vector equation relating speech samples xk: K (k +1) = - (13) b00 + C(k )Va(k )C (k ) and the matrix V,; is the variance matrix of the estimator a and is given by the following equation: Va (k +1) = [I - K (k + 1)C(k )]Va (k ) + Vw (14) where Vw is the variance matrix of the white noise process w(k). The algorithm can be initialized by: Vw =a^I, V,, (0) = 0 and b0 = 1 and stopped by using the criterion: ||a(k+1) - a(k )|2 <£ (15) The stopping criterion can also be used for pitch detection because it is evident that the above norm will be large while being in a clipped part, since the autoregressive model will not be valid. 3.2 Interpolation of the missing samples For the computation of the missing samples, equation (4) can be used starting from p previous samples. This interpolation is referred to as forward. The missing samples can also be predicted from p samples that follow the missing part. The first sample can be obtained by solving equation (4) as: xn-p =Za^Xn - p+i ^ p + 1 < n < NT (I6) i=1 where the coefficients ai are computed from the coefficients ai using: 'p-1 . ; L ; ap-1 = ; ap =■ 1 (17) x, n x k Consequently, the reconstruction is done using backward interpolation. 4 Results In order to test the previously defined algorithms, we are going to use synthetic and natural speech. The natural speech comes from a very large database of speech samples that were collected for the construction of a speech recognition system in colloquial Algerian Arabic [9]. The pitch frequency is about 100 Hz for male speaker and about 220 Hz for a female one. This corresponds to a pitch period T being between 4.5 ms to 10 ms. So, if a reliable estimation of the prediction parameters is desired, we need a fairly high sampling frequency. For example, a sampling frequency of 10 kHz (sampling period of 100 ^s) will provide between 45 and 100 samples for a pitch period. A sampling frequency of 44.1 kHz (sampling period of 22.73 ^s) is chosen, which provides between 198 and 440 samples for a pitch period, which is quite reasonable. Also, in all of the following tests, the speech signal is normalized to a maximum value of one. 4.1 Synthetic speech First the algorithm is tested with a synthetic vowel. The choice of synthetic speech is motivated by the fact that it follows exactly the linear prediction model. The vowel /a/ is generated using the following formants [5]: - The frequencies (F^ ) and the bandwidths ( BW^ ) necessary to specify each formant are shown in Formant F ( Hz ) BW^ ( Hz) 1 730 60 2 1090 100 o 3 2440 120 4 3500 175 5 4500 281 Table 1: Formant Frequencies and Bandwidths [5] - The pitch frequency is 120 Hz (male speaker), which corresponds to NT = 367 samples. Figure 2 shows few periods of the synthetic vowel /a/. The prediction order is set to p = 10. This signal is clipped to a level of ±0.5 and restored using both methods (least square and Kalman filter method). Figure 3 shows one pitch period of the clipped signal. A window of at least 75 samples following the clipped region is used to compute the predictor coefficients. 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 0 500 1000 1500 2000 2500 3000 3500 Sample Number Figure 2 : Synthetic Vowel /a/ Normalized Amplitude Waveform Figure 3 : Clipped artificial vowel The least square computation of the prediction coefficients along with both forward and backward reconstruction produces the error plots shown in figure 4. It can be seen that the backward error is much smaller than the forward one. Also, the error occurs at the end of the reconstruction. The error can be reduced by performing both reconstructions and averaging the results. However, since the error is essentially a high frequency signal, simple low pass filtering after backward reconstruction yields the same result. r\ J \ 1 1 I 1 1 1 1 1 1 1 /W ' ' ' \ 1 1 1 1 1 1 1 1 1 1 1 1 150 200 250 300 Sample Number Fdwerd Recc^iucticn Eircr 150 200 Sample Number Backward Reccnstiucticn Encr 0 600 800 Sa^e N^^ 1 1 1 1 -- 1 1 L 1 _L \ -1 1 1 _L Y 8 137e-006 ^ 1 III III 1 600 800 Sa^e N^^ 50 100 150 200 250 300 350 Sample Nl^ Figure 4 : Forward and Backward Reconstruction Error for Synthetic Speech Kalman filter is also used for the estimation of the prediction parameters. The stopping criterion of the recursive Kalman algorithm is defined as: \a(k +1)-a(k)2