Electronic Lexicography in the 21st Century New Applications for New Users Proceedings of eLex 2011, Bled, 10-12 November 2011 Edited by Iztok Kosem and Karmen Kosem useful T one crucial hand " "years'— recent linm lichir learners beyond «"» r-umoer preser user-friendliness ■«.■ a • ^ IwAll^Cll^ H software have-l satisfy began write borrowed I levels look freely compuber-assisbed provides lear call new I + m format learning ■ ^^ look _____ search wen _ _ firsb designed human may ^^ patterns pa Danbe a^eeCbroniC devel°p-enb web user Dicbionary casV''Jr corpus content focus understand a findings examples interesting common age native rptfpkrpnpp lornA ' mCkoH speakers «i™**— ch^nHarrl compounds NLP conbinue ■ ■ lCtJU ; parbs _ quality reference large much one's decade «nn'o sibe mostiy-datamlned crowd-sourcec Wordnik.com sbandard «""po^NLP content-related inberfaCe FLT : ■ Electronic lexicography in the 21st century: New applications for new users Proceedings of eLex 2011 Edited by Iztok Kosem and Karmen Kosem http://www.trojina.si/elex2011 10-12 November 2011 Bled, Slovenia Electronic lexicography in the 21st century: New applications for new users Proceedings of eLex 2011, Bled, 10-12 November 2011 Edited by Iztok Kosem, Karmen Kosem Front cover Karmen Kosem designed by Published by Trojina, Institute for Applied Slovene Studies © Trojina, Institute for Applied Slovene Studies Ljubljana, November 2011 CIP - Kataložni zapis o publikaciji Narodna in univerzitetna knjižnica, Ljubljana 81'374:004.9(082)(0.034.2) Electronic lexicography in the 21st century [Elektronski vir]: new applications for new users : proceedings of eLex 2011, 10-12 November 2011, Bled, Slovenia / edited by Iztok Kosem and Karmen Kosem. - El. knjiga. -Ljubljana : Trojina, Institute for Applied Slovene Studies, 2011 Način dostopa (URL): http://www.trojina.si/elex2011/elex2011 proceedings.pdf ISBN 978-961-92983-3-6 (pdf) 1. Kosem, Iztok 258763008 Acknowledgements We would like to thank our academic partners and sponsors for supporting the conference. Academic partners META-NET Main sponsors ABBYY* Supporting sponsors OXFORD UNIVERSITY PRESS hTC quietly Willi TRANSPORT THINGS FROM ONE PLACE TO ANOTHER Verb + Modifier + goods • illegal, stolen, contraband, fake, counterfeit => TAKE THINGS SECRETLY TO OR FROM A PLACE Verb + Modifier +goods • faulty, defective, damaged O REFUSE TO ACCEPT THE GOODS RECEIVED OR BOUGH! PROVIDE NEW GOODS Figure 7: Extracts from a DCD entry (second stage) Verb + Modifier + goods O REFUSE TO ACCEPT THE GOODS RECEIVED OR BOUGHT e S' You are not legally obliged to return faulty goods to the seller. Defective goods were returned to the factory for rectification Faulty or damaged goods can be returned for replacement or repair. Usually there are no problems with rejecting faulty goods. Why is 'notice' necessary when the buyer rejects defective goods ? U)_ Verb + Modifier + goods => TRANSPORT THINGS FROM ONE PLACE TO ANOTHER e-g- If you transport dangerous goods, you must be trained. Professional drivers are well-trained in transporting hazardous goods This is a certificate for vehicles which carry dangerous goods or hazardous substances. Chemical tankers have a design enabling them to carry hazardous load. How to transport dangerous cargo from US? Figure 8: Extracts from a DCD entry (third stage) The level of detail or granularity in DCD entries unfolds gradually through three different steps. In a first stage, the screen displays collocational pairs, similarly to conventional collocation dictionaries (Figure 6). In a second stage, the screen displays a semantic description of lexical constellations related to the collocate on which the user has clicked (see Figure 7). Finally, in a third stage, the user is provided with a series of examples representing different lexical realisations of the constellation (see Figure 8). This list is accessed by clicking on the semantic description of the constellation. Where relevant, the list includes references to other headwords sharing in the same lexical constellation pattern (e.g. cargo, load, substance, etc., in the lower part of Figure 8). In these cases, the words are underlined so that the user can follow the link to the corresponding noun entry. Concerning the third guideline, i.e. compactness, information about lexical constellations is presented in a format as succinct as possible. One implication is that labels such as "lexical constellation", "inter-collocability" or "positive co-collocate" are not explicitly mentioned by any means in the entry. This marks a difference with some collocation dictionaries, especially in the Meaning-Text Theory (MTT) framework (notably the DiCE), which make extensive use of specialised terms that are not known to the wider audience and the lay speaker. These terms include MTT jargon such as gloss and lexical function labels such as 'Magn', 'Anti Bon', etc. In the DCD project we try to make the dictionary accessible by keeping metalinguistic data to a minimum. Metalinguistic information is reduced to basic grammatical categories (Verb, Noun, Adjective, etc,) and to semantic labels. For similar reasons, probability and statistical data are not shown to the user. The structure of constellations is signalled only by means of symbols such as arrows, and by highlighting words in authentic examples (see Figures 7 and 8). Finally, the fourth guiding principle is the maximisation of systematicity. This apparently trivial statement contains important implications for the design of dictionary entries. It entails, among other things, the attempt at subsuming as much lexical information as possible under general combination rules. This implies first and foremost that semantic labels will be used to show the interconnectedness of several collocational patterns. This practice, i.e. the grouping of different collocations under meaning categories, has been adopted to a greater or lesser extent by previous collocation dictionaries such as MCD, REDES and the DiCE, but no by others such as the OCD or the BBI. The specific challenge faced now by the DCD is to extend this strategy to apply to the description of semantic regularities underlying lexical constellations. This problem is resolved by inserting semantic paraphrases of constellations at an intermediate stage between collocational information and real examples of constellations (see Figures 7 and 8). The rationale behind this emphasis on the connection of combinatorial and semantic properties of words is our strive for abridging the distance between the collocation dictionary and the general-purpose dictionary. In the line of neo-Firthian thinking, it is our conviction that a well-organised, detailed description of the syntagmatic behaviour of a word has a definitional value. Collocation provides a representation of word meaning, as Firth suggested. 5. Conclusion In this article we have argued that the mainstream approaches to collocation have missed an important aspect of collocational patterning, namely, the operation of dependency relations between different collocations. Crucially, this level of analysis should not be confused with observation of dependency relations between the parts of a collocation. Collocability must be analysed at a different level than inter-collocability. It has also been argued that the LCM provides an adequate analytical framework for inter-collocability. After applying the methodology of constellational analysis to collocational patterns of the noun goods, we have confirmed that different collocations influence in different ways the selection of other collocations of the same noun. Finally, we have explained that dealing with lexical constellations in a dictionary is only possible in an electronic format and requires us to introduce a number of substantial changes with respect to the conventional micro-structural design of collocation dictionaries (including electronic ones). Some of these changes have been illustrated with reference to sample parts from the DCD. 6. Acknowledgements The project presented in this paper is generously funded by a grant from Fundacion Seneca, Agencia de Ciencia y Tecnolog^a de la Region de Murcia (Ref. 08594/ PHCS/08). We are most grateful for this financial support. 7. References Almela, M. (2011). Improving corpus-driven methods of semantic analysis: a case study of the collocational profile of 'incidence'. English Studies, 92(1), pp. 84-99. Almela, M., Cantos, P. & Sanchez, A. (2011). From collocation to meaning: revising corpus-based techniques of lexical semantic analysis. In I. Balteiro (ed.) New Approaches to Specialized English Lexicology and Lexicography. Newcastle u. T.: Cambridge Scholars Press, pp. 47-62. The BBI Dictionary of English Word Combinations (1997). Compiled by M. Benson, E. Benson & R. Ilson. Amsterdam: John Benjamins. Bosque, I. (2001). Sobre el concepto de 'colocacion' y sus limites. Lingwstica Espanola Actual, 23(1), pp. 9-40. Bosque, I. (2004). La direccionalidad en los diccionarios combinatorios y el problema de la seleccion lexica. In T. Cabre (ed.) Lingwstica teörica: analisi i perspectives. Bellaterra: Universitat Autonoma de Barcelona, pp. 13-58. Cantos, P., Sanchez, A. (2001). Lexical constellations: what collocates fail to tell. International Journal of Corpus Linguistics, 6(2), pp. 199-228. DiCE: Diccionario de colocaciones del espanol. Accessed at: http://www.dicesp.com. Hanks, P., Pustejovsky, J. (2005). A Pattern Dictionary for Natural Language Processing. Revue Franqaise de Linguistique Appliquee, 10, pp. 63-82. Herbst, T., Heath, D., Roe, I.F. & Götz, D. (2004). A Valency Dictionary of English. A Corpus-Based Analysis of the Complementation Patterns of English Verbs, Nouns and Adjectives. Berlin: Mouton de Gruyter. Mason, O. (2000). Parameters of collocation: the word in the centre of gravity. In J.M. Kirk (ed.) Corpora Galore. Analyses and techniques in describing English. Amsterdam: Rodopi, pp. 267-280. Macmillan Collocations Dictionary for Learners of English (2010). Compiled by M. Rundell. Oxford: Macmillan. Oxford Collocations Dictionary for Students of English (2009). Compiled by C. McIntosh. Oxford: Oxford University Press. A Pattern Dictionary of English Verbs. Accessed at: http://deb.fi. muni .cz/pdev/. REDES: Diccionario combinatorio del espanol contemporaneo (2004). Compiled by I. Bosque. Madrid: SM. Renouf, A. (1996). Les nyms: en quete du thesaurus des textes. Lingvisticae Investigationes, 20(1), pp. 145-165. Rychly, P. (2008). A lexicographer-friendly association score. In P. Sojka, A. Horak (eds.) Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN2008. Brno: Masaryk University, pp. 6-9. Sinclair, J. (1991). Corpus, Concordance, Collocation. Oxford: Oxford University Press. Collocational networks and their application to an E-Advanced Learner's Dictionary of Verbs in Science (DicSci) Araceli Alonso, Chrystel Millon, Geoffrey Williams Equipe LiCoRN - Universite de Bretagne-Sud Faculte de Sciences Humaines et Sociales 4 rue Jean Zay 56321 Lorient CEDEX (France) E-mail: araceli.alonso@univ-ubs.fr, chrystel.millon@univ-ubs.fr, geoffrey.williams@univ-ubs.fr Abstract The present article deals with a situation that lies between the needs of an advanced learner's dictionary and those of a specialised dictionary in attempting to build a pattern dictionary for verbs which are being used in scientific research papers. Current dictionaries do not necessary assist in the particular production environment of the scientific article. This can be tackled by building a bottom-up phraseological dictionary which will help both with decoding and encoding. The building method uses collocational networks in order to compile a dictionary which will demonstrate usage of individual verbs, grouping them into a natural classification system that will grow from the corpus data. This organic dictionary ultimately makes wide use of mind mapping technology to allow the user to navigate within the dictionary. It contains both individual entries containing phraseological information and super entries linking quasi-synonyms and writing assistance. The dictionary provides the environment which can link phraseological patterns to the corpus data so as to limit the information retrieval process whilst providing real examples of language in use in specialised contexts. Keywords: learner's dictionary; specialised dictionary; organic dictionary; phraseology; collocational networks; verbal patterns 1. Introduction In recent years, developments in technology have brought about some major changes in dictionary-writing. The ground-breaking work of Sinclair and the COBUILD team in the 1980s introduced a move in lexicographical practice towards the creation of corpus-based dictionaries on the basis that users need to know not only the meaning of the word, but the way the word is used in context. Many monolingual and, especially, learner's dictionaries have applied corpus-based techniques for representation of word uses by giving examples taken from a corpus. Although the corpus is now integrated as a source, most of these dictionaries, whether print or cd-rom in format, have not implemented the full potential of adopting a corpus-driven approach to what may be extracted from a corpus, such as the networks of relations between words. The rise of electronic dictionaries due to the widespread use of computers and especially of Internet has also contributed to pushing lexicographical practice further, even if technology changes much more quickly than the dictionary-writing process. As a result, many on-line dictionaries do not take full advantage of the potential offered by web technology. In fact, many of them are just a copy of the paper dictionary. Some attempts at creating genuine on-line dictionaries, such as, the visual dictionaries, or Wordnik (http://www.wordnik.com), based on the web 2.0 or social web, and the like, have been made, but there is still a long way to go. New approaches to dictionary-making practice are needed. In a society of knowledge and technology, dictionaries must be updated and adapted to the users' needs. In the case of science dictionaries, there is a real need for innovation. Most dictionaries of science are very traditional in outlook or simply take the form of terminological databases applying an onomasiological approach which supply the user with a definition and a context, and in some cases, relations between the units, but fail to give detailed information on the syntagmatic and paradigmatic relations between technical words, and technical words and 'general' words. In reality, specialised communication is not just about technical words. In most cases, scientists already know the definition of the technical word, but look up the 'specialised' meaning of a general word in the dictionary, for getting information of the behaviour of the word in a domain-specific context. The wealth of language lies in semitechnical words and general words in specific contexts. As has been stated by many authors (Cabre, 1999; Meyer, 2000; Ciapuscio, 2003; Hunston & Sinclair, 2003; ten Hacken, 2008), the dichotomy between general and specialised languages must be seen in terms of a continuum; they are not clearly separable entities. It can be stated that there is a transfer of lexical units from one side to the other; processes of determinologization or banalization, terminologization and pluriterminologization take place (Meyer, Mackintosh & Varantola, 1997; Cabre, 1999). This passage of meaning potentials from general language to specialised language, and back, is particularly a problem for non-native speakers who need to communicate in scientific contexts. Furthermore, most specialised dictionaries consider only nouns as entries of the dictionary, as according to a classical perspective of terminology, the noun was considered the only category to have a terminological value, and do not take into account the role that other categories, such as verbs, can play in specialised discourse. In order to produce a text, non-native speakers need to understand the characteristics of the specialised discourse and it is not only the noun which plays a relevant role. Verbs, for instance, can help to organize the discourse, to articulate and structure the text, to establish links between different referential lexical units, to express the point of view of the author, to interactuate with the reader, to understand the meaning of a word, etc. As Hanks states 'meanings are constructed around the verb, the pivot of the clause' (Hanks, 2010a:3). Therefore, for a language learner, it is extremely important to get to know the behaviour and use of verbs in order to be able to produce and understand a specialised discourse. A dictionary for verbs used in the sciences can assist by helping users to overcome their basic communication problems. The main objective of this communication is to present the potential of collocational networks for a new approach to an experimental dictionary conceived from the beginning as a virtual dictionary, the E-Advanced Learner's Dictionary of Verbs in Science (DicSci). Collocational networks for the building-up of dictionary entries will be discussed and exemplified with reference to the most frequent verbs extracted from a corpus related to Biosciences. This paper shows how specialised learner's dictionaries have evolved. The article presents the initial premises of the lexicographical project DicSci, paying special attention to the 'organic' nature of the E-Advanced Learner's Dictionary of Verbs in Science, and describes the work methodology and building-up of the dictionary by showing the verb to take as an example. Finally, some conclusions and perspectives are outlined. 2. Learner's dictionaries of science Dictionaries of science or specialised dictionaries are usually terminologically based which have been elaborated taking into account terminological theoretical and methodological framework rather than those of lexicography, particularly those of advanced learner's dictionaries. In many cases, they are terminological databases. Most of these terminological dictionaries or databases are based on an onomasiological perspective, that is, the different entries are organized by means of the concepts. The terminological units are just means of the linguistic expression of the conceptual organisation of a particular domain. The focus is on explaining the concept, with the terminological unit is only observed as a way of designing the concept. Therefore, no attention is being paid to the different senses of a term, as the term is not considered as a lexical unit. More recent approaches to terminology advocate a semasiological approach to terminology — see L'Homme (2005) for more detailed information —, considering the term as a lexical unit which can have the same characteristics of other lexical units of general language. Despite this fact, little progress has been made in specialised dictionary practice. As Williams (2003:94) states, most of these dictionaries, whether multilingual or bilingual terminologies, mostly address the translator, whereas the monolingual encyclopaedic dictionaries principally address the subject specialist. The latter are prescriptive dictionaries whose main aim is to fix and explain terms for native-speakers of the language. They have not been compiled with the foreign learner's needs in mind. They do not explain use of terms in context and, therefore, are of little help for encoding purposes. In the 80s Moulin (1983:151) already considered that existing specialised dictionaries were of little use to foreign learners. Not much progress has been made since then. Even though, some attempts have been made during these years to answer foreign learner's needs, learner's dictionaries of science are not really satisfactory. Some authors (Bergenholtz & Tarp, 1995; Fuertes-Olivera, 2009, 2010; Tarp, 2008) have defended a functional approach to lexicography, usually referred to as the Function Theory of Lexicography, considering lexicography as an area of social practice where the dictionary must take into account users' specific types of problems and situations and satisfy user's needs. From this perspective some specialised dictionaries for foreign learners have been compiled. And even though more attention has been paid to the linguistic characteristics of the terminological units, many problems of grammar and usage have received only minimal attention. On the other hand, learner's dictionaries of English as a foreign language have a strong tradition, but aim at general usage with little coverage of the sciences. Over the years, learner's dictionaries of English as a foreign language have increased in number and variety. Since the first learner's dictionaries much work has been done for giving more information — see Cowie (2002, 2009) for a detailed history of English dictionaries for foreign learners —, paying special attention to the linguistic features of language. However, most advanced learner's dictionaries have paid little attention to the representation of specialised lexical units, being primarily aimed at learners of the language for general purposes. A similar situation can be found with standard bilingual dictionaries which essentially provide decontextualized equivalents with a minimum of encoding assistance. Consequently, many scientists have to rely on 'native English speakers', hopefully with an awareness of genre specificities, to correct their texts. Learner's dictionaries of English as a foreign language deal with grammatical and usage aspects of lexical units, as the learner of the language need information not only for understanding texts, but also for producing texts in the foreign language. For instance, many dictionaries have made an attempt to introduce information on collocations, "lexical co-occurences of words" (Sinclair, 1991:170), in order to give more information about the use of words in context, taking into consideration information extracted from corpora. This has also been recognized as a useful addition to specialised dictionaries, especially in relation to the user's needs for encoding. However, as explained by L'Homme & Leroyer (2009:259) "there does not seem to be a general agreement as to what types of word combinations should be listed, nor as to how they should be presented in specialised reference works." A learner's dictionary of science must be a tool for an ongoing learning process where specific collocations and lexical patterns can help non-native speakers who need to produce scientific texts in English. To do this, we propose to use a bottom up model to create an experimental dictionary dealing with verbs used in scientific texts. We pay our attention to the verbal category, as verbs are the centre of the clause which link nodes of specific terminology and are of phraseological interest. From a classical perspective of terminology, verbs were not considered of interest as they were not proper terminological units. Recent approaches to terminology have shown that not only does the nominal category can have a terminological value, but that other categories, such as adjectives or verbs, can also be domain-specific lexical units — see Lorente (2007, 2009) for more information. According to Lorente (2009:59) verbs are not per se terminological units, but can acquire a 'specialised value' in context when their immediate environment also provides specialised knowledge. Lorente (2007) establishes a classification of verbs used in scientific texts: a) verbos casi-terminos ('near-term verbs'), such as to ionize; b) verbos fraseolögicos ('phraseological verbs'), such as to codify (i.e codify a protein); c) verbos de relaciön lögica ('verbs of logic relation'), such as to present; d) verbos performativos del discurso ('verbs performative of discourse'), such as to conclude. As it can be observed by Lorente's classification, in most cases, the 'specialised value' of a verb is determined by the company it keeps. As Hanks (2010b) establishes, taking into consideration Sinclair's distinction (Sinclair, 1991) between the open-choice principle and the idiom principle, many units have both a terminological tendency (open-choice principle) and a phraseological tendency (idiom principle). Verbs have mainly a phraseological tendency. It is impossible to know the meaning of some of these verbs without knowing the phraseological context in which the verb is used. This phraseological context is the information to which a learner of the language needs to pay particular attention. The difficulty of the learner of science is in the phraseology being used and not in the designation of a concept. An advanced specialised learner's dictionary must pay special attention to those units with a phraseological tendency. In order to write scientific texts in a foreign language, the learner of the language needs to know the meaning of the specific words used in specific contexts and, as it has been mentioned before, there are many words whose meaning can only be understood by knowing the environment where the word is used. The DicSci is an advanced learner's dictionary of verbs whose main aim is to give account of the functioning of a verb in an scientific context, showing its phraseological behavior, taking into account its collocates and its textual environment. 3. DicSci - An E-Advanced Learner's Dictionary of Verbs in Science The lexicographical project DicSci starts off from ongoing work that is both theoretical and practical in nature related to two research projects on science corpora coupled with and analyse of the place of scientific usage in advanced learner's dictionaries and the application of the methodology of collocational networks and collocational resonance (Williams, 1998, 2002, 2003, 2006, 2008a, 2008b, 2008c; Williams & Millon, 2009, 2010) and that of the technique developed by Patrick Hanks — see Hanks (2004, 2006), Hanks & Ježek (2008) for more detailed information —, named Corpus Pattern Analysis or CPA and supported by the Theory of Norms and Exploitations or TNE (Hanks, forthcoming). On the theoretical side the objective of the lexicographical project is to show how collocational networks, collocational resonance and lexical patterns can assist with understanding not just meaning change, but the carry-over of aspects of meaning from changing contextual environments, and also the relations between the technical and the general lexical units. The practical final outcome is an E-Advanced Learner's Dictionary of Verbs in Science (DicSci) built bottom-up using corpus-driven methodologies both for selection of headwords, semantic organisation of the data, representation of norms and exploitations, word syntagmatic and paradigmatic relations and movement of meanings between contexts. The working methodology is based on the use of collocational networks and collocational resonance. This can be further enhanced by applying Corpus Pattern Analysis or CPA. In previous studies (Alonso, 2009; Williams, 1998, 2002, 2006, 2008a, 2008b, 2008c; Williams & Millon, 2009, 2010), these statistically based chains of collocations have been used to demonstrate thematic patterns in texts, as well as means for selecting the lexis for a specialised language dictionary, for observing the movement of meanings between contexts, establishing syntagmatic and paradigmatic relations between units and determining the difference between the 'specialised' and 'general' language use. The methodology proposed is influenced by John Sinclair's insights into collocation and the idiom principle (Sinclair, 1991), Wittgenstein's approach to prototypes (1953), the work on scientific texts developed by Roe (1977) and the later studies of the phraseology of scientific texts developed by Gledhill (2000), the work on pattern grammar by Hunston & Francis (1999), the study on semantic prosody by Louw (1993, 2000|2008), the theory of Lexical Priming proposed by Hoey (2005). Finally, as has been shown in previous studies (Alonso, 2009; Renau & Alonso, in press), the application of Corpus Pattern Analysis proposed by Hanks (2004) for building-up a Pattern Dictionary of English Verbs (PDEV)1 seems to be useful for analysing the normal use of the lexical units in scientific contexts and establishing differences between the general and specialised use of a lexical unit, as well as it can help to improve the dictionary entry as it provides a systematic and very fine analysis of language in use. CPA is a technique which can complement the information given by the collocational networks. Collocational networks, proposed by Williams (1998, 2002), are statistically based chains of collocations, a web of interlocking conceptual clusters realised in the form of words linked through the process of collocation. The idea that collocations "cluster" forming interwoven meaning networks comes from Phillips (1985). Phillips's aim was the study of metastructure within texts and the notion of 'aboutness'. Williams (1998) considered Phillip's work and hypothesised that "the patterns of co-occurrence forming the collocational networks will be unique to any one sublanguage and serve to define the frames of reference within that sublanguage" (Williams 1998:157). From a high frequency lexical unit, considered as node of the network, the collocates are calculated using a statistical measure, mainly MI or Z-Score, even though other statistical measures can be considered. The collocates are then treated as nodes and the collocates of each collocate is then calculated. The network will be allowed to extend through collocational chains until a point is reached where either no more significant collocates are found or where a word-form that has occurred earlier in the network is encountered. A detailed description of the procedure for the creation of collocational networks is shown in Williams (1998). It must be taken into account the importance of the statistical measure selected for calculating the more significant collocates of a lexical unit, as different measures will give different results. For instance, Mutual Information displays more rarer items whereas Z-score gives more general collocates — see Church & Hanks (1990) for more information on measuring word association norms. It is also important to bear in mind that the collocational network can vary depending on the form of the lexical unit. For instance, in texts related to Molecular Biology, the environment developed by the use of 'gene' in singular is quite different to the environment of the form in plural: Figure 1: First level of the collocational network of 'gene' extracted from Williams (2008c:140) bioenergetic x , encoding genes chlororespiration — "gename" / "■- components 1 The Pattern Dictionary of English Verbs (PDEV) is an ongoing project whose first results are free available on the Internet (http ://deb.fi.muni.cz/pdev). Figure 2: First level of the collocational network of 'genes' extracted from Williams (2008c:140) Despite the different collocates associated to each form of the lexical unit, the lemmatised network must be also considered in order to have a complete panorama of the total environment of a word. On the other hand, collocational resonance is also a tool being used at DicSci to show how elements of meaning are carried over from on textual environment to another. The mechanism of collocational resonance has been described in Williams (2008b) and Williams & Millon (2009). The notion of collocational resonance is based on the assumption that language users carry aspects of meaning from previously encountered usage, consciously and subconsciously, subcategorised for topic and genre, coloning the meanings and prosodies in use. This can be mapped by using lexicographical prototypes. For instance, if we consider the word 'culture', one of its meanings is that of farming. When 'we culture children', there are pieces of meaning that still carry a resonance of the meaning of 'culture' as farming. A detailed explanation of resonance with reference to the word 'probe' can be found in Williams & Millon (2009). Collocational resonance is used to explain particular patterns of usages. It can assist in understanding the movement from general to specialised usage of language, or from specialised to general. It can also help to build up the definition of dictionary entries. In the present study we concentrate on collocational networks rather than on collocational resonance, as collocational network is the primary tool for building-up the dictionary DicSci. The third element of the work methodology in compiling DicSci is the use of Corpus Pattern Analysis to give a more accurate account of the normal uses of each of the significant collocates which form the collocational network. CPA is a work-in-progress corpus-driven methodology developed by Hanks for 'mapping meanings onto use' (Hanks, 2002). According to Hanks (2010c:590), "a corpus does not show directly what a word means, but it provides evidence on the basis of which meanings can be inferred." It provides evidence on the word use. Most of these uses are highly patterned. Each unique pattern is usually associated with a specific meaning. CPA is a methodology for identifying prototypical syntagmatic patterns with which words in use are associated. As Hanks (2006:1165) explains "a pattern consists of a verb with its valencies, plus semantic values for each valency and other relevant clues, and is associated with an implicature that associates the meaning with the context rather than with the word in isolation." A pattern is based on the structure of English clause roles described in systemic grammar (Halliday, 1961) — subject, predicator, object, complement, adverbial. Each clause role or argument is 'populated' by a set of collocations. The more significant collocates of a verb are usually nouns which share a semantic aspect of meaning. The meaning of a group of collocates is expressed by a semantic type. Using Hanks' words, semantic types represent 'folk concepts.' All semantic types are stored in a hierarchically structure shallow ontology which is continuously under review. The CPA ontology is corpus-driven. There are cases in which the argument slot is populated by one or more lexical items which cannot be grouped together into semantic types; these are considered as lexical sets. In other cases, the semantic type is complemented by a semantic role. The semantic type is an intrinsic property of the collocate, while a semantic role is an extrinsic property assigned by context. For instance, if we consider the verb to filter, one normal pattern would be [[Human]] filter [[Liquid]]. However, the corpus can show cases in which not all kinds of liquids are filtered but only some specific ones, such as water. The pattern in this case would be [[Human]] filter [[Liquid=Water]]. The organisation of semantic types and semantic roles is not easy and it is only by corpus evidence that this task can be achieved. For more detailed information on the general principles of CPA, see Hanks (2004, 2006, 2010a); for an explanation of the CPA ontology, see Ježek & Hanks (2010). As mentioned before, the DicSci is a corpus-driven dictionary which takes into account the use of words in scientific texts. Therefore, it is obvious that a corpus of scientific texts is needed. To begin with the building-up of the dictionary a corpus was compiled, the BioMed Central corpus (BMC). The BMC is a 33-million-word English language built as part of the Scientext initiative. The Scientext initiative was a project for the creation of comparable corpora carried out by a consortium of three French universities led by the Universite de Grenoble 3. The BMC corpus, which is now freely online at the Scientext website2, stands at 33 million words drawn from 8945 scientific texts from 137 different journals, made freely accessible online by the independent publishing house BioMed Central3. The texts have been selected from a number of journals dating from 1997 to 2005. All texts have been formatted according to the TEI guidelines and have been part-of -speech tagged and lemmatized using Treetagger4. The texts in the BMC corpus encompass a large number of topics and genres, all related to two main areas: biology and medical research. Each text has been informed with XML-TEI annotation to which topic(s) and to which genre is belonged. The corpus cannot be considered as fully representative of published scientific research, as it is focused on articles related to Biosciences. The distribution of topics and genres is not well-balanced, as stated in Williams & Millon (2009). In the present work, however, the subcategorisation of the corpus has not been exploited. Despite the limitations of the corpus, due to its size the BMC corpus provides adequate data for work on an experimental dictionary such as DicSci. More details about the corpus can be found on the Scientext website. Finally, the experimental dictionary presented is considered an 'organic' dictionary. It is 'organic' in the sense that it refers to a living dictionary that will organised itself in a natural way thanks to the links between words shown by means of collocational networks. Collocational networks are used for headwords selection, for structuring and classifying verbs together into classes and as means of navigation. This dictionary will ultimately make wide use of mind mapping technology to allow user navigate within the different entries. The dictionary will provide the environment which can link phraseological patterns to the corpus data whilst providing real examples of language in use in specialised contexts. In the following chapter the use of collocational networks for building-up our dictionary is illustrated through the exploration of the verb to treat. 4. Collocational networks and dictionary making: the verb to treat To treat is the 49th most frequent verb in the BMC corpus with 13018 occurrences. The collocational network was created by measuring the most significant collocates of the verb. Due to space restrictions, Figure 3 below shows 2 http://scientext.msh-alpes.fr/scientext-site/?article30 http://www.biomedcentral.com http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ only the first level of the collocational network of the verb to treat, as the main aim in this paper is to demonstrate the principles, not to expose full networks. This network contains the eight most statistical significant noun collocates of to treat, namely animal, rat, mouse, patient, intention, control, vehicle and cell — showed in red in Figure 3 —, and the first ten most statistical significant verb collocates of each of the nouns. The collocates are calculated by means of Z-score in a span of 5:5, and the collocations that have less than 3 occurrences are kept out. Yet, five verbal collocates were removed from the network, that are deciduoma-bearing, coimmunized, frequency-matched, transfected, and exhaust. The first four are word-forms not recognized by the Treetagger tool, and the last one exhaust was removed because in the noun-verb collocation 'vehicle exhaust', exhaust correspond to a noun which belongs to the syntagmatic lexical unit motor vehicle exhaust. In total, 54 verb collocates have been considered for the network. Among them, seven are amongst the 100 more frequent verbs in the BMC corpus: compare, express, grow, include, receive, stain and use. Moreover, there are eight verbs (without counting treat) that are shared by some of the seven noun collocates, namely anesthetize, compare, feed, immunize, inject, receive, sacrifice, and stain — marked in green in Figure 3. Figure 3: Collocational network from the verb to treat Through the collocational network, verbs that are not in the top 100 verbs list are then introduced. In our illustration, this concerns 47 verbs of the network. Naturally, amongst this set of 'new' verbs, some could have been already enter in the dictionary, as they may have been introduced in a previous analysis. However, not all verbs present in the network will be selected as headwords and considered as entries of the dictionary. Indeed, this depends as well on the frequency. This brief exemplification illustrates the organic nature of the constitution of the dictionary, which will grow in a natural way, by selecting what is statistically significant in the textual environment of the words. It is through the study of the 100 more frequent verbs that other verbs attested in the BMC corpus will in turn be enter in the dictionary. The constitution of the dictionary follows thus an iterative process: the analysis of one verb of the top-100 verb list leads to the consideration of verbs that are not in this list, and the analysis of one of them leads to the consideration of new verbs, and so on. As mentioned in the previous chapters, collocational networks are a mechanism for headwords selection. It also give a first picture of the environment of scientific texts, showing the most significant lexical units which are 'pivots', — using Hanks' terminology — of the clauses or are the main cognitive nodes that form the texts' framework. The collocational network brings about a global picture of the node of the network, in this case the verb to treat. A lexicographical analysis of the network also show that collocates can be grouped in different conceptual classes. In previous research (Williams & Millon, 2009), Levin's classification of verbs was considered (Levin, 1993). However, this classification does not suit all cases as it has not been built taking into account corpus data. Another option would have been that of using a vast hiercharchical ontology such as WordNet, but as Hanks (2006) points out not all lexical items fit into a hierarchical ontology. The relations between lexical units are not always of the same kind. Indeed, Hanks' point of view has been an inspiration for getting a way to group the different collocates into classes. Moreover, an analysis of the different collocations observed in the network brings about different semantic patterns of usage. These different lexical patterns are determined by using CPA. In relation to our example, in general texts, four are the CPA patterns established by Hanks, as shown in Figure 4: No. % Pattern / Implicature I 69% [[Human 1 | Institution 1 | Animal 1]] treat [[Hainan 2 | Animal 2 | Entity | Event]] [Adv[Manner]] conc. [[Human 1 Institution 1 | Animal 1]] behaves toward [[Human 2 | Animal 2 | Entity | Event]] in the [[Manner]] specified exploit. 2 17% [[{Human 1 = Health Professional} | {Process = Medical} | Drug]] treat [[{Human 2 = Patient} | {Animal = Patient} | Disease | Injury]] [NO ADM] conc. [[Human 1 = Health Professional]] applies a [[Drugll or [[Process = Medical]] to [fHuman 2 =Pa£ientll for the purpose of curing the patient's [[Disease | Injury]] exploit. 3 5% [[Human]] treat [[Inanimate]] (with [[Stuff]] | by [[Process]]) The chemical or other properties of [[Inanimate]] are improved or otherwise changed by [[Process]] or the application of [[Stuff]] conc. exploit. 4 5% [[Human 1]] treat [[Human 2 | Self]] {(to [[Eventuality = Good]])} [[Human 1]] gives or pays for [[Eventuality = Good]] as a benefit for [[Human 2 | Self]] conc. exploit. Figure 4: CPA patterns of the verb to treat extracted from Hanks' PDEV5 As can be inferred, the four patterns stand for different meanings of the verb to treat. The percentages assigned to patterns show the distribution of the four patterns within the corpus. At first sight, pattern 2 and 3 seem to be more thematically marked, the first related to Medical and the latter related to Chemistry domain. It could be thought that these patterns would also be commonly used in scientific texts. However, by analysing the BMC corpus applying CPA, differences of usage are brought about. The collocational network already shows that not all patterns are always coincident to those patterns distinguished in general texts. In illustrating our work methodology with the verb to treat, using to the BMC corpus, pattern 1 of the verb to treat is close to the second CPA pattern (see Figure 4), in that it refers to a medical context. Indeed, in the BMC corpus the following normal pattern is found: • X treat Y with Z o [[Human 1 | Human Group]] treat [[Human 2 = Patient | Laboratory Animal = Rat, Mouse | Organism= Cell]] (with [[Drug= Vehicle]]) In this pattern, the different collocates are gathered in different semantic types, as in CPA. By trying to apply CPA to the BMC corpus is clearly not always possible to use the same ontology. The ontology being used in CPA is a corpus-driven shallow ontology created from a general corpus. Many semantic types are not necessary in our case; on the contrary, semantic types that are not considered in CPA ontology are needed for explaining specific uses of a word in Biomedical texts. It is in fact the selection of specific semantic types, semantic roles 5 http://deb.fi.muni.cz/pdev/?action=patterns&id=treat and lexical sets which makes the difference between the general and specialised use of a lexical unit. For instance, in the pattern shown above, not all animals are treated. The semantic type specifically refers to 'Laboratory Animals'. There is a restriction on what is being treated. In reality, the lexical sets that define a given semantic type change according to each verb. For example, we treat rats and mice, but we do not treat neither lion or elephant. Hanks & Ježek (2008) has referred to this change as 'shimmering lexical sets.' By looking at the concordances of treat, a slightly difference between CPA pattern 2 and our pattern 1 can also be detected. Most occurrences of treat refer to medical research and not to medical practices. An animal is treated not for the purpose of being cured, but for getting a cure to a disease. The implicature is not exactly the same. The collocational network shown in Figure 3, also shows that the collocate vehicle is polysemic. Indeed, in the collocational network, the verbal collocates of the nominal collocates of the central verb treat, do not necessarily collocates with treat, since the nouns have been taken in turn as word-nodes. Thus, collocational networks do not stand for one particular meaning of the verb from which they are built. If we consider the noun vehicle — see Figure 3 —, within the occurrences of the collocations (on the lemma level) vehicle - operate and vehicle - move, the noun vehicle denotes a means of transport, whereas within the syntagmatic lexical relations with the verb treat, or its other verbal collocates in the network, it is a medical term used to refer to an excipient. Hence the presence in the network of its verbal collocates dissolve, deliver, administer, receive and inject. These two meanings are therefore linked, because an excipient serves to 'transport' the active ingredients of a medication. This will lead us to draw two nominal semantic types to which the noun vehicle will be attached: 'Transport' and 'Drug'. The verbs dissolve, deliver, administer, receive and inject are in lexical relation with the semantic type 'Drug', gathering themselves in a verbal conceptual class that we could name 'Giving drugs'. Concerning the conceptual classes in which the verbs of DicSci will be gathered, Framenet is consulted, but, ultimately, the verbal clustering in the dictionary DicSci is based on the specialised contexts of the BMC corpus. Using CPA has brought about the necessity of using a shallow ontology in order to explain the phraseological tendency of verbs used in science. Indeed, phraseology occupies a main place in language use, notably through the use of collocations. In the lexicon of a given language, there are strong syntagmatic links between words. The phraseology of a given language implies that speaker (or writer), especially a non-native one, could product unnatural speech if he/she uses a 'wrong' word even if it matches the idea to be expressed. Language use is mainly filled with conventional lexical combinations that a native speaker has unconsciously memorised because he/she has already met them during their life. Non-native speakers, who do not have this linguistic experience, would construct their speech according to the semantic compatibility between words, and not to the lexical compatibility between words. Thus, the speaker, especially the non-native one, has to know the phraseology in use within the language in order to produce natural speech. Naturally, inside the same language, lexical preferences may differ notably between the general language and specialised ones, as notably state L'Homme (1998) or Heid & Freibott (1991). The mechanism used allows conceptual classes that semantically link verbs in the dictionary to grow naturally as new verbs are analysed, and thus eventually split in several sub-classes. This has been illustrated in Williams & Millon (2009). In addition to conceptual classes of verbs, nominal ones are also created, according to the collocational network of the verbs, and notably, through the shared collocates reported in them. It is important to underline that although networks can be automatically built, the eye of the lexicographer is essential. What we are extracting are potential collocates, only through analysis of the concordance can potential definitions be made. The semantic groupings of verbs or nouns follows the same procedure as, although they do fall together naturally, their interpretation and naming is the work of the lexicographer. Nevertheless, we project to apply the word sense discrimination algorithm written by Millon (2011), as we believe that this processing would help us with this task. The next step in the creation of DicSci is that of adding the information extracted from the collocational networks and verbal patterns to the entries of the dictionary. For that, the dictionary production software TshwaneLex6 is being used. The E-Advanced Dictionary of Verbs in Science is conceived as a virtual dictionary. By using visualisation techniques, the idea is to enter the dictionary by means of the collocational networks and from there go into the verbal patterns, concordances and dictionary entries. The grouping of verbs into classes will also give more options for the user to visualise not only syntagmatic relations but also paradigmatic relations between different lexical units. 5. Conclusions The first aim of the DicSci project is to build an organic online dictionary of verbs use in sciences which will reflect usage and assist non-native speakers of English with production. In doing so a work methodology based on collocational networks, collocational resonance and Hanks' Corpus Pattern Analysis-CPA is being developed. In this article, special attention has been paid to the use of collocational networks and application of CPA for building-up the dictionary. Collocational networks provide a natural selection of the main cognitive nodes of scientific texts, show links between lexical units, demonstrate thematic patterns in texts, and facilitate observation of what it is the 'normal' use/s of a specific lexical unit in a scientific context. By taking each collocate at a time, a number of lexico-semantic patterns can be detected. For that, the procedure Corpus Pattern Analysis described by Patrick Hanks is used. CPA method allows us to show the central and prototypical uses of a verb in science. By looking at the own output from Patrick Hanks' CPA, the PDEV, differences between 'general' and 'specialised' uses can be highlighted. From the patterns, the meaning potentials of the verbs can be inferred in a second stage. Furthermore, collocational networks and semantic patterns show similarities and differences between the different uses of a lexical unit. Both mechanisms facilitate sense disambiguation of polysemic words. The methodology proposed shows also differences and similarities between different lexical units. Words that that are semantically related can be clustered together naturally in a conceptual class. In this way, both paradigmatic and syntagmatic relations can be illustrated. The work methodology permits different ways to structure, organise and access the DicSci entries. In this sense, the dictionary is structured and organized according to the collocational networks. Apart from the traditional alphabetically ordering of entries, in the DicSci each central node of a network, which 6 http://tshwanedje.com/tshwanelex/ corresponds to a verb, is an access to the entries of the dictionary. Each verbal collocate can also be a central node of another network and, therefore, another way to enter the dictionary. At the same time, other collocates, such as nouns or adjectives, can also be a means of access. The groupings of verbs will also permit access to the main verbal lexical units. The dictionary is both semasiologically and onomasiologically conceived. The DicSci is an ongoing bottom-up, corpus-driven dictionary which describes how verbs are used in science. It is an organic dictionary in the sense that it is being developed in a natural and continuous process. It is dynamic, a moving system. Each collocational network can bring about new uses and new relations between other verbs and lexical units which have been already included in the dictionary. The relations between the units are continuously in motion. In this paper, we have explored the first stage of the building-up of the dictionary which affects the global organisation and structure of the dictionary, the selection of headwords, the establishment of classes and the demonstration of semantic patterns. Further development is needed in relation to the definition and naming of conceptual classes and the microstructure of each entry. In a second stage, it is also expected to apply the mechanism of collocational resonance to assist in a better understanding of the movement from general to specialised usage of language, or from specialised to general. The final aim of the DicSci project is to compile a dictionary which provides a way to explain not only the terminological tendency of words used in science, but also the phraseological tendency. The information included will help non-native speakers of English who need to produce scientific texts in English to improve their communication skills at different levels. 6. Acknowledgements This paper was carried out during a postdoctoral research stay by one of the authors at the Equipe LiCoRN-Laboratoire HCTI of the Universite Bretagne-Sud, directed by Prof. Geoffrey Williams, in the framework of the National Mobility Programme of Human Resources of the R+D National Programme 2008-2011, financed by the Spanish Ministry of Education. This research has also been funded by the European Project Metricc, and the Spanish National Project HUM2009-07588/FIL0, supported by the Spanish Ministry of Education and Science. 7. References Alonso, A. (unpublished 2009). Caracteristicas del lexico del medio ambiente y pautas de representacion en el diccionario general. PhD Thesis. Institut Universitari de Lingmstica Aplicada - Universitat Pompeu Fabra, Barcelona. Bergenholt, H., Tarp, S. (1995). Manual of Specialised Lexicography. The Preparation of Specialised Dictionaries. Benjamins Translation Library 12. Amsterdam/Philadelphia: John Benjamins. Cabre, M.a T. (1999). La terminologia. Representacion y comunicacion. Elementos para una teoria de base comunicativa y otros articulos. Barcelona: Institut Universitari de Lingmstica Aplicada - Universitat Pompeu Fabra. Ciapuscio, G. (2003). Textos especializados y terminolog^a. Barcelona: Institut Universitari de Lingmstica Aplicada - Universitat Pompeu Fabra. Cowie, A.P. (2002). English Dictionaries for Foreign Learners. A History. Oxford: Oxford University Press. Cowie, A.P. (2009). The Oxford History of English Lexicography. Volumen II. Oxford: Clarendon. Church, K., Hanks, P. (1990). Word Association Norms, Mutual Information, and Lexicography. Computational linguistics, 16(1), pp. 22-29. Fuertes-Olivera, P.A. (2009). Specialised Lexicography for Learners: Specific Proposals for the Construction of Pedagogically oriented Printed Business Dictionaries. Hermes — Journal of Language and Communication Studies, 42, pp. 167-188. Fuertes-Olivera, P.A. (ed.) (2010). Specialised Dictionaries for Learners. Lexicographica Series Maior, 136. Berlin/New York: De Gruyter. Gledhill, C. (2000). Collocations in science writing. Tübingen: Gunter Narr Verlag. Halliday, M.A.K. (1961). Categories of the Theory of Grammar. Word, 17, pp. 241-292. Hanks, P. (2002). Mapping Meaning onto Use. In M.-H-Correard (ed.) Lexicography and Natural Language Processing: a Festschrift in honour of B. T. S. Atkins. United-Kingdom: Euralex, Göteborg University, pp. 156-198. Hanks, P. (2004). The Syntagmatics of Metaphor and Idiom. International Journal of Lexicography, 17(3), pp. 245-274. Hanks, P. (2006). The Organization of the lexicon: Semantic Types and Lexical Sets. In C. Marello et al. (eds.) Proceedings of the XII EURALEX International Congress. Torino: Universita di Torino, pp. 1165-1168. Hanks, P. (2010a). How People Use Words to Make Meanings. In B. Sharp, M. Zock (eds.) Proceedings of the 7th International Workshop on Natural Language Processing and Cognitive Science, NLPCS 2010. In conjunction with ICEIS 2010, Funchal, Madeira, Portugal, pp. 3-13. Hanks, P. (2010b). Terminology, Phraseology, and Lexicography. In A. Dykstra, T. Schoonheim (eds.) Proceedings of the XIV EURALEX International Congress. Leeuwarden, Pays Bas: Fryske Akademy. Hanks, P. (2010c). Compiling a Monolingual Dictionary for Native Speakers. Lexikos 20, 580-598. Hanks, P. (forthcoming). Lexical Analysis: Norms and Exploitations. Massachusetts: The MIT Press. Hanks, P., Ježek, E. (2008). Shimmering Lexical Sets. In E. Bernal, J. DeCesaris, J. (eds.) Proceedings of the XIII Euralex International Congress. Barcelona: Institut Universitari de Lingmstica Aplicada -Universitat Pompeu Fabra, pp. 391-402. Heid, U., Freibott, G. (1991). Collocations dans une base de donnees terminologique et lexicale. Meta, 36(1), pp. 77-91. Hoey, M. (2005). Lexical Priming: A New Theory of Words and Language. London: Routledge. Hunston, S., Francis, G. (1999). Pattern grammar. A corpus-driven approach to the lexical grammar of English. Amsterdam and Philadelphia: John Benjamins. Hunston, S., Sinclair, J. (2003). A local grammar of evaluation. In S. Hunston, G. Thompson (eds.) Evaluation in Text: Authorial stance and the construction of discourse. Oxford: Oxford University Press, pp. 74-101. Ježek, E., Hanks, P. (2010). What lexical sets tell us about conceptual categories. In Corpus Linguistics and the Lexicon, Special issue of Lexis, E-Journal in English Lexicology, 4, pp. 7-22. L'Homme, M.-C. (1998). Caracterisation des combinaisons lexicales specialisees par rapport aux collocations de langue generale. In Proceedings of the VIII EURALEX International Congress. Liege, Belgium, pp. 513-522. L'Homme, M.-C. (2005). Sur la notion de «terme». Meta: Translators' Journal 50(4), pp. 1112-1132. On line: http://id.erudit.org/iderudit/012064ar L'Homme, M.-C. & Leroyer, P. (2009). Combining the semantics of collocations with situation-driven search paths in specialized dictionaries. Terminology 15(2), pp. 258-283. Levin, B. (1993). English verb classes and alternations: a preliminary investigation. Chicago: University of Chicago Press. Lorente, M. (2007). Les unitats lexiques verbals dels textos especialitzats. Redefinicio d'una proposta de classificacio. In M. Lorente et al. (eds.) Estudis de lingmstica i de lingmstica aplicada en honor de M. Teresa Cabre CastelM. Vol. 2: De deixebles 2. Barcelona: Institut Universitari de Lingmstica Aplicada - Universitat Pompeu Fabra; Documenta Universitaria, pp. 365-380. Lorente, M. (2009). Verbos y fraseologia en los discursos de especialidad. In M. Casas, R. Marquez (ed.) XI Jornadas de Lingmstica: homenaje al profesor Jose Luis Guijarro Morales (Cadiz, 22 y 23 de abril de 2008). Cadiz: Universidad de Cadiz. Servicio de Publicaciones, pp. 55-84. Louw, B. (1993). Irony in the text or insincerity in the writer? The diagnostic potential of semantic prosodies. In M. Baker (ed.) Text and Technology. Amsterdam: John Benjamins, pp. 157-76. Louw, B. (2000|2008). Contextual Prosody Theory: bringing Semantic Prosodies to Life. In C. Heffer, H. Sauntson (eds.) Words in Context: A Tribute to John Sinclair on his Retirement. CD-ROM: English Language Research Discourse Analysis Monograph No. 18. Reprinted in online journal Texto (2008): http://www.revue-texto.net/index.php?id=124. Meyer, I. (2000). Computer Words in Our Everyday Lives: How are they interesting for terminography and lexicography? In U. Heid et al. (eds.) Proceedings of IX EURALEX International Conference 2000. Stuttgart: Universität Stuttgart, pp. 39-57. Meyer, I., Mackintosh, K., Varantola, K. (1997). Exploring the reality of virtual: on the lexical implications of becoming a knowledge society. Lexicology, 3(1), pp. 129-163. Million, C. (unpublished 2011). Acquisition automatique de relations lexicales desambiguisees a partir du Web. PhD Thesis. Universite de Bretagne-Sud, Lorient. Moulin, A. (1983). LSP Dictionaries for EFL Learners. In R. R. K. Hartmann (ed.) Lexicography: Principles and Practice. London: Academic Press, pp. 144-152. Phillips, M. (1985). Aspects of Text Structure: An investigation of the lexical Organisation of Text, Amsterdam, North Holland. Renau, I., Alonso, A. (in press). Using Corpus Pattern Analysis for the Spanish Learner's Dictionary DAELE (Diccionario de aprendizaje del espanol como lengua extranjera). In Proceedings Corpus Linguistics Conference 2011. Birmingham: University of Birmingham. Roe, P. (unpublished 1977). The notion of difficulty in Scientific Text. PhD thesis. University of Birmingham, Birmingham. Sinclair, J. (1991). Corpus, Concordance, Collocation. Oxford: Oxford University Press. Tarp, S. (2008). Lexicography in the Borderland between Knowledge and Non-knowledge General Lexicographical Theory with Particular Focus on Learner's Lexicography. Lexicographica Series Maior 134. Tübingen: Max Niemeyer Verlag. ten Hacken, P. (2008). Prototypes and discreteness in terminology. In E. Bernal, J. DeCesaris (eds.) Proceedings of the XIII Euralex International Congress. Barcelona, 15-19 july 2008. Papers de l'IULA. Serie Activitats. 20. Barcelona: Documenta Universitaria. Institut Universitari de Lingmstica Aplicada - Universitat Pompeu Fabra. Williams, G. (1998). Collocational Networks: Interlocking Patterns of Lexis in a Corpus of Plant Biology Research Articles. International Journal of Corpus Linguistics, 3(1), pp. 151-171. Williams, G. (2002). In search of representativity in specialised corpora: categorisation through collocation. International Journal of Corpus Linguistics, 7(1), pp. 43-64. Williams, G. (2003). From meaning to words and back: Corpus linguistics and specialised lexicography. Asp, la revue du GERAS 39-40, pp. 91-106. On line: http: // asp. revues.org/1320. Williams, G. (2006). Advanced ESP and the Learner's Dictionary. In C. Marello et al. (eds.) Proceedings of the XII EURALEX International Congress. Torino: Universita di Torino, pp. 795-801. Williams, G. (2008a). Verbs of Science and the Learner's Dictionary. In J. DeCesaris, E. Bernal (eds.) Proceedings of the XIII Euralex International Congress. Barcelona, 15-19 july 2008. Papers de l'IULA. Serie Activitats. 20. Barcelona: Institut Universitari de Lingmstica Aplicada - Universitat Pompeu Fabra; Documenta Universitaria. Williams, G. (2008b). The Good Lord and his works: A corpus-based study of collocational resonance. In S. Granger, F. Meunier (eds.) Phraseology: an interdisciplinary perspective. Amsterdam: John Benjamins, pp. 159-174. Williams, G. (2008c). Les corpus et le dictionnaire dans les langues scientifiques. In F. Maniez et al. (eds.) Corpus et dictionnaires de langues de specialite. Grenoble: Presses Universitaires de Grenoble. Williams, G., Millon, C. (2009). The General and the Specific: Collocational resonance of scientific language. In Proceedings of the Corpus Linguistics Conference CL2009, 20-23 July 2009. Liverpool: University of Liverpool. Williams, G., Millon, C. (2010). Going organic: Building an experimental bottom-up dictionary of verbs in science. In A. Dykstra, T. Schoonheim (eds.) Proceedings of the XIV EURALEX International Congress. Leeuwarden, Pays Bas: Fryske Akademy, pp. 1251-1257. Wittgenstein, L. (1953). Philosophical investigations. Oxford: Blackwell. Lexical Profiling for Arabic Mohammed Attia, Pavel Pecina, Lamia Tounsi, Antonio Toral and Josef van Genabith School of Computing Dublin City University, Dublin, Ireland E-mail: {mattia, ppecina, atoral, ltounsi, josef}@computing.dcu.ie Abstract We provide lexical profiling for Arabic by covering two important linguistic aspects of Arabic lexical information, namely morphological inflectional paradigms and syntactic subcategorization frames, making our database a rich repository of Arabic lexicographic details. First, we provide a complete description of the inflectional behaviour of Arabic lemmas based on statistical distribution. We use a corpus of 1,089,111,204 words, a pre-annotation tool, knowledge-based rules, and machine learning techniques to automatically acquire lexical knowledge about words' morpho-syntactic attributes and inflection possibilities. Second, we automatically extract the Arabic subcategorization frames (or predicate-argument structures) from the Penn Arabic Treebank (ATB) for a large number of Arabic lemmas, including verbs, nouns and adjectives. We compare the results against a manually constructed collection of subcategorization frames designed for an Arabic LFG parser. The comparison results show that we achieve high precision scores for the three word classes. Both morphological and syntactic specifications are combined and connected in a scalable and interoperable lexical database suitable for constructing a morphological analyser, aiding a syntactic parser, or even building an Arabic dictionary. We build a web application, AraComLex (Arabic Computer Lexicon), available at: http://www.cngl.ie/aracomlex, for managing and maintaining the standardized and scalable lexical database. Keywords: Arabic; subcategorization frames; morphological analysis; morphological paradigms 1. Introduction In a typical dictionary entry of a word, it is expected to find basic information pertaining to the word's morphology (possible inflections) and syntax (part of speech, whether it is transitive or intransitive, in the case of verbs, and what prepositions it can co-occur with). Yet, existing Arabic dictionaries have several limitations. Most of them do not rely on a corpus for attesting the validity of their entries (as in a COBUILD approach (Sinclair, 1987)), but they typically include either refinements, expansions, corrections, or organisational improvements over the previous dictionaries. Therefore, they tend to include obsolete words not in contemporary use. Furthermore, they often do not explicitly state all the possible inflection paradigms, and they do not provide sufficient syntactic information on word's obligatory combinations (or argument list). The aim here is to attempt to resolve these shortcomings by automatically providing a complete description of the inflectional and syntactic behaviour of Arabic lexical entries based on statistical distribution in treebanks and un-annotated corpora. The work described in this paper is divided into two major parts. The first is focused on examining the statistical distribution of inflection paradigms for lexical entries in a large corpus pre-annotated with MADA (Roth et al., 2008), a tool which performs morphological analysis and disambiguation using the Buckwalter morphological analyser (Buckwalter, 2004) and machine learning. The second is related to the automatic extraction of syntactic information, or subcategorization frames, from the Arabic Treebank (ATB) (Maamouri and Bies, 2004). To the best of our knowledge, this is the first attempt at extracting subcategorization frames from the ATB. The subcategorization requirements of lexical entries are important type lexical information, as they indicate the argument(s) a predicate needs in order to form a well-formed syntactic structure. Yet producing such resources by hand is costly and time consuming. Moreover, as Manning (1993) indicates, dictionaries produced by hand will tend to lag behind real language use because of their static nature. Therefore a complete, or at least complementary, automatic process is highly desirable. This paper is structured as follows. In the introduction we describe the motivation behind our work. We differentiate between Modern Standard Arabic (MSA), the focus of this research, and Classical Arabic (CA) which is a historical version of the language. We briefly explain the current state of Arabic lexicography and describe how outdated words are still abundant in current dictionaries. Then we outline the Arabic morphological system to show what layers and tiers are involved in word derivation and inflection. In Section 2, we present the results obtained to date in building and extending the lexical database using a data-driven filtering method and machine learning techniques. We also explain how we use knowledge-based pattern matching in detecting and extracting broken plural forms. In Section 3, we explain the method we followed in extracting and evaluating the subcategorization frames for Arabic verbs, nouns and adjectives. In Section 4, we describe AraComLex, a web application we built for curating and combining our lexical resources. Finally, Section 5 gives the conclusion. 1.1 Modern Standard Arabic vs. Classical Arabic Modern Standard Arabic (MSA), the subject of our research, is the language of modern writing, prepared speeches, and the language of the news. It is the language universally understood by Arabic speakers around the world. MSA stands in contrast to both Classical Arabic (CA) and vernacular Arabic dialects. CA is the language which originated in the Arabian Peninsula centuries before the emergence of Islam and continued to be the standard language until the medieval times. CA continues to the present day as the language of religious teaching, poetry, and scholarly literature. MSA is a direct descendent of CA and is used today throughout the Arab World in writing and in formal speaking (Bin-Muqbil, 2006). MSA is different from CA at the lexical, morphological, and syntactic levels (Watson, 2002; Elgibali and Badawi, 1996; Fischer, 1997). At the lexical level, there is a significant expansion of the lexicon to cater for the needs of modernity. New words are constantly coined or borrowed from foreign languages while many words from CA have become obsolete. Although MSA conforms to the general rules of CA, MSA shows a tendency for simplification, and modern writers use only a subset of the full range of structures, inflections, and derivations available in CA. For example, Arabic speakers no longer strictly abide by case ending rules, which led some structures to become obsolete, while some syntactic structures which were marginal in CA started to have more salience in MSA. For example, the word order of object-verb-subject, one of the classical structures, is rarely found in MSA, while the relatively marginal subject-verb-object word order in CA is gaining more weight in MSA. This is confirmed by Van Mol (2003) who pointed out that MSA word order has shifted balance, as the subject now precedes the verb more frequently, breaking from the classical default word order of verb-subject-object. 1.2 The Current State of Arabic Lexicography Until now, there is no large-scale lexicon (computational or otherwise) for MSA that is truly representative of the language. Al-Sulaiti (2006) emphasises that existing dictionaries are not corpus-based. Ghazali and Braham (2001) stress the need for new dictionaries based on an empirical approach that makes use of contextual analysis of modern language corpora. They point out the fact that traditional Arabic dictionaries are based on historical perspectives and that they tend to include obsolete words that are no longer in current use. The inclusion of these rarities inevitably affects the representativeness of dictionaries and marks a significant bias towards historical or literary forms. In recent years, some advances have been made (Van Mol, 2000; Boudelaa and Marslen-Wilson, 2010), but they are not enough in terms of size or the breadth of linguistic description. The Buckwalter Arabic Morphological Analyzer (BAMA) (Buckwalter, 2004) is widely used by the Arabic NLP research community. It is a de facto standard tool, and has been described as the "most respected lexical resource of its kind" (Hajič et al., 2005). It is designed as a main database of 40,648 lemmas supplemented by three morphological compatibility tables used for controlling affix-stem combinations. Other advantages of BAMA are that it provides information on the root, reconstructs vowel marks and provides an English glossary. The latest version of BAMA is renamed SAMA (Standard Arabic Morphological Analyzer) version 3.1 (Maamouri et al., 2010). Unfortunately, there are some drawbacks in the SAMA lexical database that raise questions for it to be a truthful representation of MSA. We estimate that about 25% of the lexical items included in SAMA are outdated based on our data-driven filtering method explained in Section 2.2.1. SAMA suffers from a legacy of heavy reliance on older Arabic dictionaries, particularly Wehr's Dictionary (Wehr Cowan, 1976), in the compilation of its lexical database. Therefore, there is a strong need to compile a lexicon for MSA that follows modern lexicographic conventions (Atkins and Rundell, 2008) in order to make the lexicon a reliable representation of the language and to make it a useful resource for NLP applications dealing with MSA. Our work represents a further step to address this critical gap in Arabic lexicography. We use a large corpus of more than one billion words to automatically create a lexical database for MSA. We enrich the lexicon with syntactic information by extracting subcategorization frames and significant preposition collocates from the ATB. 1.3 Arabic Morphological System Arabic morphology is well-known for being rich and complex. Arabic morphology has a multi-tiered structure where words are originally derived from roots and pass through a series of affixations and clitic attachments until they finally appear as surface forms. Morphotactics refers to the way morphemes combine together to form words (Beesley and Karttunen, 2003). Generally speaking, morphotactics can be concatenative, with morphemes either prefixed or suffixed to stems, or non-concatenative, with stems undergoing internal alterations to convey morpho-syntactic information (Kiraz, 2001). Arabic is considered as a typical example of a language that employs both concatenative and non-concatenative morphotactics. For example, the verb U{istaEomaluwha1 'they-used-it' and the noun «T'im^lj wAl{istiEomAlAt 'and-the-uses' both originate from the root Eml. Figure 1 shows the layers and tiers embedded in the representation of the Arabic morphological system. The derivation layer is non-concatenative and opaque in the sense that it is a sort of abstraction that affects the choice of a part of speech (POS), and it does not have a direct explicit surface manifestation. By contrast, the inflection 1 All examples are written in Buckwalter Transliteration. layer is more transparent. It applies concatenative morphotactics by using affixes to express morpho-syntactic features. We note that verbs at this level show what is called 'separated dependencies' which means that some prefixes determine the selection of suffixes. Figure 1: Arabic Morphology's Multi-tier Structure In the derivational layer Arabic words are formed through the amalgamation of two tiers, namely root and pattern. A root is a sequence of three consonants and the pattern is a template of vowels with slots into which the consonants of the root are inserted. This process of insertion is called interdigitation (Beesley and Karttunen, 2003). An example is shown in Table 1. Root drs POS V V N N Patter R!aR2aR3 R!aR2R2aR3 RjAR2iR3 müRtaR2~i n a a r3 Stem darasa darrasa dAris mudar~is 'study' 'teach' 'student 'teacher' Table 1. Root and Pattern Interdigitation 2. Extending the Existing Lexicon In this section, we describe the small-scale, manually-constructed lexical resources that we had, and how we managed to significantly extend these resources. We explain how we filter out obsolete words, how we use machine learning to acquire knowledge on morphological paradigms (or continuation classes) for new entries, and how we extract broken plural forms from our corpus. The corpus we use contains 1,089,111,204 words, consisting of 925,461,707 words from the Arabic Gigaword (Parker et al., 2009), in addition to 163,649,497 words from news articles we collected from the Al-Jazeera web site.2 2.1 Existing Lexical Resources There are three key components in the Arabic morphological system: root, pattern and lemma. For accommodating these components, we acquire three lexical databases: one for lemmas, one for word patterns, and one for lemma-root lookup. The lemma database is collected from Attia (2006) which was developed manually. It includes 5,925 nominal lemmas (nouns and adjectives) and 1,529 verb lemmas. The advantage of the lemma entries in this resource is that they are fully specified with necessary morpho-syntactic information. In addition to the usual specification of gender, number and person, it provides information on continuation classes for nominals (as shown in Table 2), whether the noun indicates a human or non-human entity. For verbs it gives details on the transitivity, whether the passive voice is allowed or not, and whether the imperative mood is allowed or not. We automatically create the lemma-root lookup database relying on the SAMA database. We manually developed a database for Arabic patterns that includes 490 patterns (456 for nominals and 34 for verbs). These patterns can be used as indicators of the morphological inflectional and derivational behaviour of Arabic words. Patterns are also powerful in the abstraction and coarse-grained categorisation of word forms. 2.2 Extending the Lexical Database In extending our lexicon, we rely on Attia's manually-constructed lexicon (Attia, 2006) and the lexical database in SAMA 3.1 (Maamouri et al., 2010). Creating a lexicon is usually a labour-intensive task. For instance, Attia took three years in the development of his morphology, while SAMA and its predecessor, BAMA, were developed over more than a decade, and at least seven people were involved in updating and maintaining the morphology. Our objective here is to automatically extend Attia's lexicon (Attia, 2006) using SAMA's database. In order to do this, we need to solve two problems. First, SAMA suffers from a legacy of obsolete entries and we need to filter out these outdated words, as we want to enrich the lexicon only with lexical items that are still in current use. Second, Attia's lexicon requires features (such as humanness for nouns and transitivity for verbs) that are not provided by SAMA, and we want to automatically induce these features. 2.2.1 Lexical Enrichment To address the first problem, we use a data-driven filtering method that combines open web search engines and our pre-annotated corpus. Using frequency statistics3 on lemmas from three web search engines (Al-Jazeera,4 Arabic Wikipedia,5 and the Arabic BBC website6), we find that 7,095 lemmas in SAMA have zero hits. ' Statistics were collected in January 2011. http://aljazeera.net/portal. Collected in January 2010. ' http http http //aljazeera.net/portal //ar.wikipedia.org //www.bbc.co.uk/arabic/ Masculine Singular Feminine Singular Masculine Dual Feminine Dual Masculine Plural Feminine Plural Continuation Class 1 ^ muEal~im, 'teacher' muEal~imap muEal~imAn ^W ■■ muEal~imat An ^W ■■ muEal~imuwn muEal~imAt F-Mdu-Fdu-Mp l-Fpl 2 VIL TAlib, 'student' ÄJIL TAlibap jLJlL TAlibAn jUJIL TAlibatAn - ^LLL TAlibAt F-Mdu-Fdu-Fpl 3 taHoDiyriy~, 'preparatory' taHoDiyriy~ap taHoDiyriy~An taHoDiyriy~atA n F-Mdu-Fdu 4 - »jšj baqarap 'cow' - jtjjij baqaratAn - baqarAt Fdu-Fpl 5 djt-S, tanAzul 'concession' - - - - tanAzulAt Fpl 6 - DaHiy~ap 'victim' - DaHiy~atAn - - Fdu 7 ^^ maHoD 'mere' maHoDap - - - - F 8 jUial {imotiHAn, 'exam' - {imotiHAnAn - - i" i {imotiHAnAt Mdu-Fdu 9 J^ Tay~Ar, 'pilot' - Tay~ArAn - Tay~Aruwn - Mdu-Mpl 10 kitAb, 'book' - jLK kitAbAn - - - Mdu 11 ^LIJLUJ diymuqoratiy~, 'democrat' jj^Lljž^j^ diy-muqoratiy~uwn Mpl 12 EJJ^ xuruwj, 'exiting' - - - - - NoNum 13 mabAHiv, 'investigators' - - - - - Irreg_pl Table 2: Arabic Continuation Classes based on the inflection grid learning is to automatically learn complex patterns from existing (training) data and make intelligent decisions on new (test) data. In our case, we have a seed lexicon (Attia, 2006) with lemmas manually annotated with classes, and we want to build a model for predicting the same classes for each new lemma added to the lexicon. The classes (second column in Table 3) for nominals are continuation classes (or inflection paths), the semantico-grammatical feature of humanness, and POS (noun or adjective). The classes for verbs are transitivity, allowing the passive voice, and allowing the imperative mood. From our seed lexicon we extract two datasets of 4,816 nominals and 1,448 verbs. We feed these datasets with frequency statistics from our pre-annotated corpus and build the statistics into a vector grid. The features (third column in Table 3) for nominals are number, gender, case and clitics; for verbs, number, gender, person, aspect, mood, voice, and clitics. For the implementation of the machine learning algorithm, we use the open-source application Weka version 3.6.4.77. We split each dataset into 66% for training and 34% for testing. We conduct six classification experiments to provide the classes that we need to include in our lexical database. Table 3 gives the results of the experiments in terms of precision, recall, and f-measure. The results show that the highest f-measure scores (above 80%) are achieved for 'Human', 'POS', and 'Transitivity'. Typically one would assume that these features are hard to predict with any reasonable 7 http: //www.c s .waikato.ac.nz/ml/weka/ 26 Frequency statistics from our corpus show that 3,604 lemmas are not used in the corpus at all, and 4,471 lemmas occur less than 10 times. Combining frequency statistics from the web and the corpus, we find that there are 29,627 lemmas that returned at least one hit in the web queries and occurred at least 10 times in the corpus. Using a threshold of 10 occurrences here is discretionary, but the aim is to separate the stable core of the language from instances where the use of a word is perhaps accidental or somewhat idiosyncratic. We consider the refined list as representative of the lexicon of MSA as attested by our statistics. No Classes Features P R F Nominals 1 Continuation Classes: 13 classes number, gender, case, clitics 0.62 0.65 0.63 2 Human: yes, no, unspecified 0.86 0.87 0.86 3 POS: noun, adjective 0.85 0.86 0.85 Verbs 4 Transitivity: transitive, intransitive number, gender, person, aspect, mood, voice, clitics 0.85 0.85 0.84 5 Allow passive: yes, no 0.72 0.72 0.72 6 Allow imperative: yes, no 0.63 0.65 0.64 Table 3: Results of the Classification Experiments. 2.2.2 Feature Enrichment To address the second problem, we use a machine learning classification algorithm, the Multilayer Perceptron (Haykin, 1998). The main idea of machine accuracy without taking the context into account. It was surprising to obtain such good prediction results based on statistics on morphological features alone. We also note that the f-measure for 'Continuation Classes' is comparatively low, but considering that here we are classifying for 13 classes, the results are in fact quite acceptable. Using the machine learning model, we annotate 12,974 new nominals and 5,034 verbs. 2.3 Handling Broken Plurals Broken plurals are an interesting phenomenon in Arabic where the plural is formed not through regular suffixation, but by changing the word pattern. In our seed lexicon (Attia, 2006), we have 950 broken plurals which were collected manually and clearly tagged. In SAMA, however, broken plurals are rather poorly handled. SAMA does not mark broken plurals as "plurals" either in the source file or in the morphology output. There is no straightforward way to automatically collect the list of all broken plural forms from SAMA. For example, the singular form muU. jAanib "side" and the broken plural jawAnib "sides" are analysed as in (1) and (2) respectively. (1) jAnib_1 jAnib jAnib/NOUN side/aspect (2) jAnib_1 jawAnib jawAnib/NOUN sides/aspects The only tags that distinguish the singular from the broken plural form are the gloss (or translation) and voc (or vocalisation). We also note that MADA passes this problem on unsolved, and broken plurals are all marked num=s, meaning that the number is singular. We believe that this shortcoming can have a detrimental effect on the performance of any syntactic parser based on such data. To extract broken plurals from our large MSA corpus (which is annotated with SAMA tags), we rely on the gloss of entries with the same LemmalD. We use Levenshtein Distance which measures the similarity between two strings. For example, using Levenshtein Distance to measure the difference between "sides/aspects" and "side/aspect" will give a distance of 2. When this number is divided by the length of the first string, we obtain 0.15, which is within a threshold (here set to 0.4). Thus the two entries pass the test as possible broken plural candidates. Using this method, we collect 2,266 candidates. We believe, however, that many broken plural forms went undetected because the translation did not follow the assumed format. For example, the word yj^ harb has the translation "war/warfare" while the plural form huruwb has the translation "wars". To validate the list of candidates, we use Arabic word pattern matching. For instance, in the jAnib example above, the singular form (vocalisation) follows the pattern fAEil (or the regular expression .A.il) and the plural form follows the pattern fawAEil (or .awA.i.). In our manually developed pattern database we have fawAEil as a possible plural pattern for fAEil. Therefore, the matching succeeds, and the candidate is considered as a valid broken plural entry. We compiled a list of 135 singular patterns that choose from a set of 82 broken plural patterns. The choice, however, is not free, but each singular form has a limited predefined set of broken plural patterns to select from. From the list of 2,266 candidates produced by Levenshtein Distance, 1,965 were validated using the pattern matching, that is 87% of the instances. When we remove the entries that are intersected with our 950 manually collected broken plurals, 1,780 forms are left. This means that in our lexicon now we have a total of 2,730 broken plural forms. There are some insights that can be gained from the statistics on Arabic plurals in our corpus. The corpus contains 5,570 lemmas which have a feminine plural suffix, 1,942 lemmas with a masculine plural suffix (out of these 1,273 forms intersect with the feminine plural suffix), and about 1,965 lemmas with a broken plural form. This means that the broken plural formation in Arabic is as productive as the regular plural suffixation. Currently, we cannot explain why the feminine plural suffix enjoys this high preference, but we can point to the fact that masculine plural suffixes are used almost exclusively with the natural gender, while the feminine plural suffix, as well as broken plurals, are used liberally with the grammatical gender in addition to the natural gender. 3. Automatic Extraction of Subcategorization Frames The encoding of syntactic subcategorization frames is an essential requirement in the construction of computational and paper lexicons alike. In English, the construction and extraction of subcategorization frames received a lot of attention, one example is the specialized lexicon COMLEX (Grishman et al., 1994) which is an extensive computational lexicon containing syntactic information for approximately 38,000 English headwords, with detailed information on subcategorization, containing 138 distinct verb frames for 5,662 active verbs lemmas. For Arabic, the attention has been directed, almost exclusively, to the construction and automatic extraction of semantic roles (Palmer et al., 2008; Attia et al. 2008). Semantic roles are related to syntactic functions and surface phrase structures, but the three are at totally different and distinct layers of analysis. Grammatical functions are in the intermediary position between phrase structures and semantic roles. It is a major concept in semantic role labelling to make greater level of generalization. There is an emphasis on that the semantic labels do not vary in different syntactic constructions (Palmer et al., 2008). For example, the Arabic verb —^-V lAHaZ "noticed" has two subcategorization frames: for (jjš!l ^ lAHaZa Al-faroq "He noticed the difference" and for Jj'^^l č>l lAHaZa >an~a Al-maHoSuwla yanoquS "He noticed that the crop is decreasing" Yet, in the Arabic Propbank annotation8 both frames have the same roleset: Arg0: observer Arg1: thing noticed or observed To our knowledge, the only resource that currently exists for Arabic subcategorization frames is the lexicon manually developed for the Arabic LFG Parser (Attia, 2008). It is published as an open-source resource under the GPLv3 license9. It contains 64 frame types, 2,709 lemmas types, and 2,901 lemma-frame types, averaging 1.07 frames per lemma. The resource incorporates control information and details of specific prepositions with obliques. We use this resource in the evaluation of our automatically induced lexicon of Arabic subcategorization frames. 3.1 LFG Subcategorization Frames The LFG syntactic theory (Dalrymple, 2001) distinguishes between governable (subcategorizable) and non- governable (non-subcategorizable) grammatical functions. The governable grammatical functions are the arguments required by some predicates in order to produce a well-formed syntactic structure, and they are SUBJ(ect), OBJ(ect), OBJ0, OBL(ique) e, COMP(lement) and XCOMP. The non-governable grammatical functions are not required in the sentence to form a well-formed structure, and they are ADJ(junct) and XADJ. The subcategorization requirements in LFG are expressed in the following format (O'Donovan et al., 2005): n where n is the lemma (predicate or semantic form) and gf is a governable grammatical function. The value of the argument list of the semantic form ensures the well-formedness of the sentence. For example, in the sentence ^Ij d^l A^icl {iEotamada Al-Tifolu EalaY wAlidati-hi "The child relied on his mother", the verb {iEotamada "to rely" has the following argument structure {iEotamada<(|SUBJ)( |OBL>alaY)>. By including a subject and an oblique with the preposition >alaY, we ensure that the verb's subcategorization requirements are met and that the sentence is well-formed, or syntactically valid. 3.2 Extracting Subcategorization frames from the Arabic Treebank 8 http://verbs.colorado.edu/propbank/framesets-arabic 9 http:// arasubcats-lfg .sourceforge.net We follow here the successful model by the previous language resource extraction efforts for other languages including English (O'Donovan et al., 2005) and German (Rehbein and van Genabith, 2009) taking into consideration the specifics of the Arabic language and the resources available for evaluation. We automatically extract the Arabic syntactic-function based subcategorization frames by utilizing an automatic Lexical-Functional Grammar (LFG) f-structure annotation algorithm for Arabic developed in (Tounsi et al., 2009). The syntactic annotations in the ATB provides explicit information on deep representation in the phrase structure such as dealing with traces in the case of pro-dropped arguments which helped the automatic extraction of subcategorization frames to be complete. After we extract the surface forms we lemmatize all forms by re-analysing all the words using the Buckwalter morphology and then choosing the analysis where the word diacrization and the tag set in the ATB match those in the Buckwalter analysis. We provide information on the prepositions for obliques, distinguish between active and passive frames, and provide information on the probability score for each frame and the frequency count for each lemma. We extract 240 frame types for 3,295 lemmas types, with 7,746 lemma-frame types (for verbs, nouns and adjectives), averaging 2.35 frames per lemma. We make this resource available under the open-source license GPLv3 10 . Table 5 shows the list of grammatical functions included in our frames with examples. We compare and evaluate the complete set of subcategorization frames extracted against the manually developed subcategorization frames in the Arabic LFG Parser. Our extraction algorithm deals with the passive voice and its effect on subcategorization behaviour. We find that in Arabic the passive forms stand at 12% of the active forms compared to 31% in English (O'Donovan et al., 2005), as shown in Table 4. Our explanation of the low frequency of the use of passive in Arabic is that there is a tendency to avoid passive verb forms when the active readings are also possible in order to avoid ambiguity and improve readability. For example, the verb form nZm "organize" can have two readings, one for active and one for passive depending on diacritization, or how the word is pronounced. Therefore, instead of the ambiguous passive form, the alternative syntactic construction tam~a "performed/done" + verbal noun is used, giving ■««ikTi tam~a tanZiymuhu "lit. organizing it has been done / it was organized". One evidence for the validity of our explanation is that the verb tam~a is the seventh most frequent verb in the ATB following kAn "be", J^ qAla "say", u^l >aEolana "declare", ^ >ak~ada "confirm", ijL^I >aDAfa "add" and jjj^I {iEotabar "consider". 10 http://arabicsubcats.sourceforge.net Active Passive Passive % Arabic verb frames 5,915 681 12 English verb frames 16,000 5,005 31 Table 4: Comparing active and passive subcategorization frames in Arabic and English Treebank Tag Source Meaning Example 1 subj -SBJ L-T subject ^Sjll jaA'a Al-waqt lit. came the time, "The time came" 2 obj -OBJ L-T object jjjill cijc Earaftu Al-Tariyq "I knew the way" 3 obj2 -DTV/-BNF L-T secondary object UUi .lici >aEoTA-hu TaEAmA "gave him food" 4 obl -CLR L-T oblique aiJIj Jc axjcI {iEotamad EalaY wAlidi-hi "relied on his father" 5 obl2 L secondary oblique ^Ijmjl ^i ^liljj tanAfasa maEa-hu fi Al-sibAq "competed with him in the race" 6 obl-betweenAnd L oblique for between ... and tanaq~ala bayona Al-EirAq wa-Al-kwiyt "moved between Iraq and Kuwait" 7 obl-fromTo L oblique for from ... to ^jj^Jl JJ jil^ sAfara min Al-EirAq an Ulj ji >amkana-hu >an yarAhaA "became possible for him to see it" 10 compH L heavy complementizer >an~a Ijjja ^i plii >a*aEa >an~a-hum harabuwA "announced that they escaped" 11 vcomp L verb complement i^ bada>a yasoquT lit. started fall, "started to fall" 12 xcomp L obligatory control jil^j ji j|J >arAda >an yusAfir "wanted to travel" 13 xcomp-pred -PRD T copular complement L^o* jlS kAna mariDA "was sick" 14 xcomp-verb (VP) T verb complement same as 11 15 comp-sbar (SBAR) T complement with complementizer same as 9 and 10 16 comp-nom (S-NOM) T gerund (masdar) complement A^il^iu nafaY Eilma-hu bi-Al-wAqiEap "denied knowing the incident" 17 comp-s (S) T sentential complement jj! Ml ^JV Jla qAla lAbud~a min Al-ta$Awur "he said there must be negotiations" L: LFG Parser, T: Treebank Table 5: List of Arabic subcategorization frames suffixed with phrase structure information 3.3 Estimating the Subcategorization Probability In order to estimate the likelihood of the occurrence of a certain argument list with a predicate (or lemma), we compute the conditional probability of subcategorization frames based on the number of token occurrences in the ATB, according to the following formula (O'Donovan et al., 2005); t,,» t its countini ArgList)) P(ArgList | n) = —n-^^- \ countini ArgListl)) where ArgListi ... ArgListn are all the possible argument lists that co-occur with n. Because of the variations in verbal subcategorization, probabilities are useful for discriminating prominent frames from accidental ones. An example is shown in table 6 for the verb $Ahada "watch" which has a frequency of 40 occurrences in the ATB. Lemma with argument list Conditional Probability $Ahad 1([subj,obj,comp-s]) 0.0250 $Ahad 1([subj,obj,comp-sbar]) 0.0500 $Ahad 1([subj,passive]) 0.1000 $Ahad_1([subj,obj]) 0.8000 $Ahad_1([subj]) 0.0250 Table 6: Subcategorization frames with probabilities. 3.4 Evaluating the Subcategorization Frames We compare our resource on subcategorization frames against a manually created subcategorization frames lexicon used in a rule-based LFG Parser. The Arabic LFG Parser has detailed subcategorisation information for lexical entries that includes the preposition of obliques, control relationships (or XCOMPs), and the type of complementizer in verbs that have complements. The number of subcategorization frames collected in the ATB induced resource is comparable to the manually constructed frames in the Arabic LFG parser for nouns and adjectives, but it is almost four times larger for verbs, as shown in Table 7. Figure 2 compares the size of the two resources in proportional intersecting circles. The circles on the left represent the treebank-induced resource, and the circles on the right represent the manually constructed resource. Verbs Nouns Adjectives lemma-subcat pairs in ATB 6596 855 295 lemma-subcat pairs in the LFG Parser 1621 991 289 Common lemmas 1447 268 70 Table 7: Number of subcat frames in the ATB and the Arabic LFG Parser Figure 2: Intersecting circles of ATB subcategorization frames (left) and the LFG Parser (right) We compare the subcategorization frames in terms of precision, defined here as the number of exact matches of the argument list divided by the number of common lemmas. Table 8 shows results of matching on all grammatical functions and on selected grammatical relations. We conduct the evaluation experiment at four levels: (1) we match the full argument list between the two data sets, (2) we remove the value of the preposition in obliques, (3) we also remove COMPs and XCOMPs, and (4) we only leave SUBJs, OBJs and OBJ2s. Number (4) denotes transitivity, or the most important type of argument. The smaller the number, the less important the argument type is considered in our perspective. Precision Verbs Nouns Adjectives 1 Full argument list 0.78 0.50 0.53 2 Without preps 0.82 0.52 0.66 3 Without preps, comps and xcomps 0.84 0.54 0.67 4 Without obls, comps and xcomps 0.97 0.73 0.86 Table 8: Evaluating the Tree-induced subcategorization frames against the resource in the Arabic LFG Parser. Table 8 shows that, at level 4, there is a high level of agreement between the two resources. At level 1, although the precision is comparatively low for nouns and adjectives, we notice that the precision is high for verbs which constitute the largest portion of the data and the most important type of predicates when dealing with subcategorization frames. 4. AraComLex Lexical Management Application In order to manage our lexical database, we have developed the AraComLex (Arabic Computer Lexicon) authoring system which provides a Graphical User Interface for human lexicographers to review, modify and update the automatically derived lexical and morphological information. We use AraComLex for storing the lexical resources mentioned in this paper as well as generating data for other purposes, such as frequency counts and data for extending our morphological transducer. The data used in the AraComLex is stored in a relational database, with all various tables connected together as shown in Figure 3 which presents a diagram of the entity relationship (Chen, 1976) of the database. In this diagram, entities are drawn as rectangles and relationships as diamonds. Relationships connect pairs of entities with given cardinality constraints (represented as numbers surrounding the relationship). Three types of cardinality constraints are used in the diagram: 0 (entries in the entity are not required to take part in the relationship), 1 (each entry takes part in exactly one relationship) and n (entries can take part in an arbitrary number of relationships). Entities correspond to tables in the database, while relationships model the relations between the tables. Figure 3: Entity Relationship diagram of AraComLex AraComLex lists all the relevant morphological and morpho-syntactic features for each lemma. We use finite sets of values implemented as drop-down menus to allow lexicographers to edit entries while ensuring consistency, as shown in Figure 4. Two of the innovative features added are the "human" feature and the 13 continuation classes which stand for the inflection grid, or all possible inflection paths, for nominals as shown in Table 2 above. Statistics show the total frequency of the lemma in the corpus and the weights of each morpho-syntactic feature. Lemma Index ID: 16935 id full_form stem freq 1053310 EAmi EAmil 45134 355702 AlEAmlyn EAmil+iyna 34447 1255209 AlEAml EAmil 19995 1082157 EAmlA EAmil 18194 1284539 AlEAmlyn EAmil+ayoni 6541 Figure 5: Lemma Stems The lexicographer can go even further by reviewing the examples in which the words occurred, sorted according to frequency, as shown in Figure 6. For practical reasons and to keep the size of the database within reasonable bound, we only keep records of the word's bigrams, which in most cases are enough to provide a glimpse of the context and possible collocates. Figure 6: Word Examples form_id: 2, arabicUnpointcd: aiabicPointcd:: JjL*, gloss_bw: worker lemma_bw: EAmi] 2. partOfSpeech_pw: noun, lemma id: n® EAmil 2®0. Repealed records: 0. hasARoot: Eml, tcmplate_auto: "@A@i@", templatc_rcgei:" Aj.", partOf5pccch_modif: lcmina_modif: gIos5_modjf: partOf5pccch_ma: [ nolip : j iirc£_plural: JUt LAr-i \_l continuationClass: FemMascduFtmduMascplFempI : ] irrc£p_morph: [ unspec t } worker human: yes ; match«!: (TT) lemma_morph: unspec deleted: JLJ] Add Copy Remove reviewed: LÜ3 Statistics: lemmajreq: 160490, masc_5g: 90295, masc_dl: 12068, masc_pl: 55901, fem_sg: 204, fem_dl: 127Jem_pl: lB95,prc0: 82824,prcl: 8484,prc2: 13169,prc3:0,enc0:652 Figure 4: A nominal entry in AraComLex Figure 4 shows the features specified for nominal lemmas in AraComLex. The feature "lemma_morph" feature can be either 'masc' or 'fem' for nouns and can also be 'unspec' (unspecified) for adjectives. Following SAMA, "partOfSpeech" can be 'noun', 'noun_prop', 'noun_quant', 'noun_num', 'adj', 'adj_comp', or 'adj_num'. For lexicographic purposes, a lexicographer can review the lemma in detail by looking into the stems and full forms, as shown in Figure 5. For verb lemmas, as shown in Figure 7, we provide information on whether the verb is transitive or intransitive and whether it allows passive and imperative inflection, as well as the usual information on the template and the root. One of the features that can be highly valuable for a lexicographer is the link to subcategorization frames. Figure 7: A verb entry in AraComLex The subcategorization frames, as shown in Figure 8, are sorted by probability, ensuring that more frequent subcategorization frames appear on the top. As the figure shows, information on passive occurrences and prepositions for obliques are also included. Lemma ID: >avobat_l |id ||lcmma_id ||subcats || prob ||frcq 1110S| |>avobat_ 111 subj .comp-sbar ||0.4839||62 | 11110| |>avobat_ 111 subj ,obj ||0.371 62 | 111131 |>avobat_ 111 subj _||0.0484 62 | 111121 |>avobat_ 111 subj .passive _| 0.0323 62 | 111071 >avobat_ 111 subj ,obj,comp- sbar _| 0.0161 62 | 1110 81 >avobat_ 111 subj ,obj ,obl-clr@ li _| 0.0161 62 | 1109| |>avobat_ 111 subj .passive 0.0161 62 | 11111 >avobat_ 1 subj jobl-clr® li,corap-sbar0.016162 Figure 8: Verb Subcategorization Frames 5. Conclusion We build a lexicon for MSA focusing on the problem that existing lexical resources tend to include a large subset of obsolete lexical entries, no longer attested in contemporary data, and they do not contain sufficient syntactic information. We start with a manually constructed lexicon of 10,799 MSA lemmas and automatically extend it using lexical entries from SAMA's lexical database, carefully excluding obsolete entries and analyses. We use machine learning on statistics derived from a large pre-annotated corpus for automatically predicting inflectional paradigms, successfully extending the lexicon to 30,587 lemmas. We also provide essential lexicographic information by automatically building a lexicon of subcategorization frames from the ATB. We develop a lexicon authoring system, AraComLex,11 to aid the manual revision of the lexical database by lexicographers. As an output of this project, we create and distribute an open-source finite-state morphological transducer. 12 We also distribute a number of open-source resources that are of essential importance for lexicographic work, including a list of Arabic morphological patterns, 13 subcategorization frames,14 and Arabic lemma frequency counts.15 6. Acknowledgements This research is funded by Enterprise Ireland (PC/09/037), the Irish Research Council for Science Engineering and Technology (IRCSET), and the EU projects PANACEA (7FP-ITC-248064) and META-NET (FP7-ICT- 249119). 7. References Al-Sulaiti, L., Atwell, E. (2006). The design of a corpus of contemporary Arabic. International Journal of Corpus Linguistics, 11(1), pp. 135-171. Atkins, B.T.S., Rundell, M. (2008). The Oxford Guide to Practical Lexicography. Oxford University Press. Attia, M., Rashwan, M., Ragheb, A., Al-Badrashiny, M., Al-Basoumy, H. & Abdou, S. (2008). A Compact Arabic Lexical Semantics Language Resource Based on the Theory of Semantic Fields. In B. Nordström, A. Ranta (eds.) GoTAL '08 Proceedings of the 6th international conference on Advances in Natural Language Processing. Götenburg, Sweden: Springer-Verlag Berlin, Heidelberg, pp. 65-76. Attia, M. (2006). An Ambiguity-Controlled Morphological Analyzer for Modern Standard Arabic Modelling Finite State Networks. In Challenges of Arabic for NLP/MT Conference, The British Computer Society, London, UK. Attia, M. (2008). Handling Arabic Morphological and Syntactic Ambiguity within the LFG Framework with a View to Machine Translation. Ph.D. Thesis. The University of Manchester, Manchester, UK. Beesley, K.R., Karttunen, L. (2003). Finite State Morphology: CSLI studies in computational linguistics. Stanford, California.: CSLI. Bin-Muqbil, M. (2006). Phonetic and Phonological Aspects of Arabic Emphatics and Gutturals. Ph.D. thesis in the University of Wisconsin, Madison. Boudelaa, S., Marslen-Wilson, W.D. (2010). Aralex: A lexical database for Modern Standard Arabic. Behavior Research Methods, 42(2), pp. 481-487. Buckwalter, T. (2004). Buckwalter Arabic Morphological Analyzer (BAMA) Version 2.0. Linguistic Data Consortium (LDC) catalogue number: LDC2004L02, ISBN1-58563- 324-0. Chen, P.P. (1976). The Entity-Relationship Model: Toward a Unified View of Data. ACM Transactions on Database Systems, 1, pp. 9-36. Dalrymple, M. (2001). Lexical Functional Grammar. Volume 34 of Syntax and Semantics. New York: Academic Press. Elgibali, A., Badawi, E.M. (1996). Understanding Arabic: Essays in Contemporary Arabic Linguistics in Honor of El-Said M. Badawi. Egypt: American University in Cairo Press. Fischer, W. (1997). Classical Arabic. In R. Hetzron (ed.) The Semitic Languages. London: Routledge, pp. 397-405. Ghazali, S., Braham, A. (2001). Dictionary Definitions and Corpus-Based Evidence in Modern Standard Arabic. Arabic NLP Workshop at ACL/EACL. Toulouse, France. Grishman, R., MacLeod, C. & Meyers, A. (1994). COMLEX syntax: Building a computational lexicon. In Proceedings of the 15th International Conference on Computational Linguistics, Kyoto, pp. 268-272. Hajič, J., Smrž, O., Buckwalter, T. & Jin, H. (2005). Feature-Based Tagger of Approximations of 11 http://www.cngl.ie/aracomlex 12 http://aracomlex.sourceforge.net 13 http://arabicpatterns.sourceforge.net 14 http://arabicsubcats.sourceforge.net 15 http://arabicwordcount.sourceforge.net Functional Arabic Morphology. In The 4th Workshop on Treebanks and Linguistic Theories (TLT 2005), Barcelona, Spain. Haykin, S. (1998). Neural Networks: A Comprehensive Foundation (2 ed.). Prentice Hall. Kiraz, G.A. (2001). Computational Nonlinear Morphology: With Emphasis on Semitic Languages. Cambridge: Cambridge University Press. Maamouri, M., Graff, D., Bouziri, B., Krouna, S. & Kulick, S. (2010). LDC Standard Arabic Morphological Analyzer (SAMA) v. 3.0. LDC Catalog No. LDC2010L01. ISBN: 1-58563-555-3. Maamouri, M., Bies, A. (2004). Developing an Arabic Treebank: Methods, guidelines, procedures, and tools. In Workshop on Computational Approaches to Arabic Script-based Languages, COLING. Manning, C. (1993). Automatic acquisition of a large subcategorisation dictionary from corpora. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, pp. 235-242. O'Donovan, R., Burke, M., Cahill, A., van Genabith, J. & Way, A. (2005). Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II and Penn-III Treebanks. Computational Linguistics, 31(3), pp. 329-366. Owens, J. (1997). The Arabic Grammatical Tradition. In R. Hetzron, ed.) The Semitic Languages. London: Routledge, pp. 46-48. Palmer, M., Bies, A., Babko-Malaya, O., Diab, M., Maamouri M., Mansouri, A. & Zaghouni, W. (2008). A pilot Arabic Propbank. In Proceedings of LREC, Marrakech, Morocco. Parker, R., Graff, D., Chen, K., Kong, J. & Maeda, K. (2009). Arabic Gigaword Fourth Edition. LDC Catalog No. LDC2009T30. ISBN: 1-58563-532-4. Rehbein, I., van Genabith, J. (2009). Automatic Acquisition of LFG Resources For German - As Good As It Gets. In Proceedings of the LFG09 Conference. Cambridge, UK. Roth, R., Rambow, O., Habash, N., Diab, M. & Rudin, C. (2008). Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking. In Proceedings of Association for Computational Linguistics (ACL), Columbus, Ohio. Sinclair, J. M. (ed.) (1987). Looking Up: An Account of the COBUILD Project in Lexical Computing. London: Collins. Tounsi, L., Attia, M. & van Genabith, J. (2009). 'Automatic Treebank-Based Acquisition of Arabic LFG Dependency Structures.' EACL Workshop on Computational Approaches to Semitic Languages, Athens, Greece. Van Mol, M. (2000). The development of a new learner's dictionary for Modern Standard Arabic: the linguistic corpus approach. In U. Heid, S. Evert, E. Lehmann & C. Rohrer (eds.) Proceedings of the ninth EURALEX International Congress, Stuttgart, pp. 831-836. Van Mol, M. (2003). Variation in Modern Standard Arabic in Radio News Broadcasts, A Synchronic Descriptive Investigation in the use of complementary Particles. Leuven, OLA 117. Watson, J. (2002). The Phonology and Morphology of Arabic. New York: Oxford University Press. Wehr, H., Cowan, J.M. (1976). Dictionary of Modern Written Arabic, pp. VII-XV. Ithaca, New York: Spoken Language Services. A model for integrated dictionaries of fixed expressions 1.2.3 2 3 Henning Bergenholtz ' ' , Theo Bothma and Rufus Gouws 1 Aarhus University; 2 University of Pretoria; 3 Stellenbosch University 1 Center for Lexicography 2 Department of Information Science 3 Department of Afrikaans and Dutch Fuglesangs Alle 4 Private Bag X20 Private Bag X1 8210, Aarhus V Hatfield, 0028 Matieland, 7602 Denmark South Africa South Africa E-mail: hb@asb.dk, tbothma@up.ac.za, rhg@sun.ac.za Abstract This paper discusses a project for the creation of a theoretical model for integrated e-dictionaries, illustrated by means of an e-information tool for the presentation and treatment of fixed expressions using Afrikaans as example language. To achieve this a database of fixed expressions is compiled wherein data are treated in such a way that access can be provided through a variety of dictionaries for specific situations, based on specific lexicographic functions, e.g. the cognitive function as well as the communicative functions of text reception and text production. From one database, the user will have access to six monofunctional dictionaries of fixed expressions. Each one of these dictionaries provides a view on selected fields of the database, i.e. a search is carried out on selected fields in the database and only the data in specific fields that are relevant for the specific dictionary are displayed. There are unique user needs that may not necessarily be satisfied by means of these six dictionaries. Individualised search facilities will therefore be provided to enable a user to retrieve data from a single data field or a user-specified selection of data fields. Phase two will provide the option of setting up a user profile, an extension of data fields and linking to external data sources. The result of the project will therefore be a comprehensive database of Afrikaans fixed expressions that may be accessed through six monofunctional dictionaries, as well as individualised search options, user profiling and the possibility to display additional data on demand. Keywords: Fixed expressions; databases; integrated dictionaries; monofunctional and polyfunctional dictionaries 1. A purportedly user-friendly e-idiom dictionary For many centuries lexicographers have proudly claimed that specifically their own dictionary was user-friendly and satisfied the needs of all users as well. This was, and still is, an immunising and self-serving assertion in most cases. It is based on real, factual research only to a limited extent, and at the same time it is an advertising measure to persuade potential dictionary buyers. However, one thing has changed. Up to some 30 years ago, dictionary user research was de facto nonexistent. This was formulated quite succinctly by Wiegand, who referred to the "known unknown" that needed to be researched (Wiegand, 1977: 59). Apparently, much research has indeed been done. Numerous surveys of all kinds have been conducted, but much of this assumed the form of memory-based questions such as: "How often do you use a dictionary? Daily? Weekly? Monthly? Rarely?"; "What kind of information do you look for? Grammatical? Orthographic? Semantic?"; "What kind of entries do you look for? Collocations? Examples? Items about style?" These days it is hardly possible any more to read and understand all contributions made to and by user studies. In our view this is not worthwhile either, as such surveys mostly ask questions which are constructed instead of being rooted in reality. The research should be conducted on real users with their real and specific needs and on their use of dictionaries, but in most cases it is not. A user with a cognition-related information need may be looking explicitly for certain types of data (examples, for instance). That is not what users with a need for communication-related information do. They have a problem concerning reception, text production or translation and are hoping to get the necessary help in this regard. Ordinary users do not know exactly what lexicographers (and linguists in particular) call such items. We therefore hold the view that lexicographers should not act as dictionary philologists or interpreters of what users remember about their use of dictionaries, but should especially develop new concepts on the basis of theoretical considerations concerning the needs of certain types of users in foreseen situations. It is not the actual user that matters, but the potential user and his/her potential needs in situations anticipated by the lexicographer. For these needs the lexicographer develops a (new) tool of which he/she assumes that it can satisfy the needs he/she foresees. Within the function theory such dictionaries are typically monofunctional, i.e., they address a specific need of a specific user group in a specific situation (see, for example, Bergenholtz, 2010; Bergenholtz, 2011; Bergenholtz & Bergenholtz, 2011, Tarp, 2007; Tarp, 2008; Tarp, 2009a; Tarp, 2009b; Tarp, 2011). Yet a majority of practical and theoretical lexicographers assume that all dictionaries should always provide as much as possible data for the identified user groups, and therefore always were, and should continue to be, polyfunctional (see Bergenholtz 2010). The above introduction reflects the arguments which have been frequently put forward in lexicographic discussions in recent years. Tarp (2002) proposed the following basic division into two types of lexicography: In contemplative lexicography, existing dictionaries are analysed and users are questioned about their use of existing dictionaries to date. In transformative lexicography, theoretical analyses of the potential user situations, the respective user conditions and the user needs are used to develop new concepts for compiling new dictionaries, typically monofunctional dictionaries. On the basis of theoretical analyses the lexicographer therefore decides what the characteristics are of the monofunctional dictionaries that will satisfy specific user needs. In the case of the Centlex dictionaries developed at the University of Aarhus, no general surveys on the use of these dictionaries are undertaken, but feedback in the form of e-mails is analysed and taken into consideration. Moreover, log file analyses are done which, in selected cases, are linked to enquiries among a handful of users (see Bergenholtz & Johnsen 2005; Bergenholtz & Johnsen 2007). Such log file analyses and feedback can lead to small changes, but also to a complete redesign of the dictionary, as was the case with the e-idiom dictionary by Vrang, Bergenholtz & Lund (2003-2005) (Den danske Idiomordbog). This was a dictionary of idioms containing the relatively large number of 8500 idioms. It had been designed especially as a reception dictionary, as it contained only meaning items. In the user guide and in the outer text on the structure of the dictionary, the meaning of 'idiom' was explained clearly. The authors received a fair number of e-mails from users with feedback on this dictionary. None of these mails asked what actually constitutes an idiom. They were, however, quite frequently asked why this or that combination of words could not be found in Den danske Idiomordbog. The typical answer to this question was that the expression in question is a proverb, not an idiom, and is therefore not in this dictionary. This happened regardless of the fact that the terms were clearly defined in the user guide. During the period from mid-2003 until mid-2004 the number of unsuccessful dictionary searches was relatively high. (Misspelled search terms are included here, but amounted to fewer than 3% of searches; searches for unlemmatised idioms are also included, but these searches amounted to less than 1% of the searches.) Number of searches in Den danske Idiomordbog With result 70.4% Without result 29.6% Table 1: Percentage of successful and unsuccessful searches in Den danske Idiomordbog, 2003-2004 There are two sides to the bare figures for successful and unsuccessful uses of Den danske Idiomordbog. The positive side is that the users find the idiom they were looking for in more than 70% of all enquiries. The negative side is that in about 26% of all enquiries (cases with incorrect spelling and deficient lemmas have been deducted from the 29.6%) users were looking for 'idioms' which are not idioms but proverbs, sayings ('winged words'), standard formulations, multiword expressions from other languages and many more. We do not believe that another definition of idioms would have improved this rate. On the other hand, for an internet dictionary of idioms there is an obvious solution: Don't make one at all; make one that contains all forms of fixed expressions. Moreover, the user with a reception problem does not even need to know what kind of fixed expression he/she is dealing with; he/she needs only the meaning. This insight led to a new concept for a new Danish database with fixed expressions from which several monofunctional dictionaries are offered to users (see Bergenholtz, 2011). The preceding insight led to the decision to compile a database for fixed expressions in Afrikaans, rather than a database of idioms. The concept for this database with several monofunctional dictionaries is the point of departure for the concept of the Afrikaans database presented here. The new database differs from the previous concept in some respects, however, especially as regards the number of fields for item types. The Danish database has 14 fields, the new database has 36. Also, this is a database for two languages, viz. Afrikaans and English, not for one. Nevertheless, the basic concept remains intact. A database and a dictionary are not the same thing. A single dictionary can be extracted from a database, and the result will normally be a polyfunctional dictionary. From a database, on the other hand, as many dictionaries could be extracted as are deemed relevant on the basis of theoretical considerations and experience with earlier databases, and these should be function-oriented monofunctional dictionaries (see Bergenholtz & Tarp 2002; Bergenholtz & Tarp 2003; Bergenholtz & Tarp 2005). 2. Afrikaans dictionaries with fixed expressions Afrikaans dictionaries represent a wide-ranging typological variety, compiled to assist different users in finding assistance with regard to both language for general purposes and language for special purposes. Within the category of general dictionaries various monolingual and bilingual dictionaries offer an extensive presentation of fixed expressions. The category of restricted dictionaries also include a few dictionaries that focus on fixed expressions, cf. Malherbe (1924), De Villiers & Gouws (1988), Botha, Kroes & Winckler (1994), Prinsloo (1997) and Prinsloo (2009). The extent and nature of both the macrostructural coverage and the treatment in these dictionaries of fixed expressions differ considerably. They share one feature and that is that they have been produced in printed format. The only non-printed version of Afrikaans fixed expressions can be found in the presentation and treatment of this category of lexical items in those general monolingual and bilingual dictionaries that are available in CD ROM format or online. Afrikaans has a real need for an e-dictionary of fixed expressions. The advantage of the fact that all dictionaries of fixed expressions are only available in printed format is that no bad e-dictionary exists, and a transformative approach to the planning and compilation of an e-dictionary of fixed expressions does not have to pay any attention to any electronic predecessor. In the following sections the plan for an innovative e-dictionary that deals with Afrikaans fixed expressions will be discussed. 3. Concept for a database system for Afrikaans fixed expressions The database system consists of the database itself and the database management system. The database is developed in MySQL. It is integrated in a database management system that has been developed using open source software (HTML, XML, XMLT, Perl, CGI and related technologies). The database management system has a comprehensive administrative back-end which manages access, data security and integrity, including aspects such as version control and back-up. The system has two further interfaces, viz. an interface for the researchers contributing the data for the different fields in the database and an interface for end-users through which they obtain access to the dictionaries and other customization functionalities. In principle, hundreds or even thousands of fields related to one or more phenomena could be provided in a database with one or more languages. For instance, there is a total of 84 fields in a Danish, English and Spanish accounting database, from which 23 different dictionaries are offered to users at present (see Bergenholtz 2011). In this case, only fields with different types of data - apart from the lexicographers' notes on the work in progress - which also finds its way into one or more dictionaries are provided for in the database. If one or more collaborating lexicographers have data at their disposal which are not intended to be presented in at least one dictionary, specific fields could be created for such data so that it could perhaps be accommodated in one or more additional dictionaries. The limit must be drawn where the number of fields becomes so large that the lexicographer loses sight of the big picture and the first presentation of the database takes too long. The order of the fields in the "Field" column in Table 2 is a working order; the order in the individual dictionaries is determined for each respective dictionary. Field_ 1. Core field_ 2. Meaning in Afrikaans_ 3. Internet link to meaning_ 4. Further meaning item in Afrikaans 5. Meaning in English_ 6. Grammar_ 7. Comment on grammar_ 8. Internet link to grammar 9. Background remark(s) 10. Comment on background remark(s) 11. Internet link to background remark(s) 12. Fixed expression(s) in Afrikaans 13. Remarks on the fixed expression(s) 14. References to fixed expression(s) 15. Internet link to variants, e.g. statistical 16. Fixed expression(s) in English translated from Afrikaans 17. Style 18. Comment on style 19. Internet link to style 20. Classification of the fixed expression 21. Comment on classification 22. Collocation(s) 23. Comment on collocation(s) 24. Internet link to collocation(s) 25. Example(s) 26. Comment on example(s) 27. Internet link to example(s) 28. Synonym(s) 29. Comment on synonym(s) 30. Internet link to synonym(s) 31. Antonym(s) 32. Comment on antonym(s) 33. Internet link to antonym(s) 34. Associated concept(s) 35. Key word(s) 36. Memo field Table 2: Data fields for the database of fixed expressions in Afrikaans There is not enough space here to justify all fields. However, some fields which are not self-explanatory do require some explanation. Field 12 contains only one expression in some cases, but in others there are more if variants exist, e.g. the same core of a fixed expression combined with different verbs. If one wants to, one can call this a lemma field. We do not; that would rather be field 1, called the core field, which is identical to Field 12 if there are no variants and contains only the words it has in common with the variants in Field 12 if there are variants. This field is used for automatic searches on the one hand, and on the other for items the user can use as links if search results are displayed as a list, or if synonyms or antonyms are provided. The field contains key words with all the lexical words, including irregularly conjugated forms which occur in the fixed expression(s) of a particular card. In Field 22, we use the term 'collocations' in the sense of combinations of words in which the fixed expression occurs. A collocation is never a complete sentence, unlike the data in Field 25, where 'example' refers to a full sentence. Field 9 contains a brief history behind the full expression; if there are two different histories and it is not clear which one is correct, both are given. In addition, in some cases reference is made to background histories which are given in various textbooks and dictionaries (but are not necessarily correct). Lastly, field 34 contains associated concept(s). This refers to concepts which can be associated with the meaning and use of the fixed expression. Finding such concepts could be very time consuming if some sort of semantic system were applied. That is not how it is done here. In fact, the editor's lexicographic instruction is to write down up to five such associated concepts within 30 seconds (but never more). 4. Six dictionaries with fixed expressions At present the concept provides for six dictionaries - five monofunctional and one polyfunctional. It is a model that can be used not only for these languages and this language combination, but in principle also for at least all Indo-European languages and probably also for other language families, for example the Austronesian languages. 4.1 Meaning of Fixed Expressions Access to the first dictionary is gained by pressing the button "I am reading a text, but do not understand the meaning of a fixed expression". Here the user enters an expression or part of an expression in the search field and obtains the desired information, i.e. the meaning of the fixed expression. This dictionary is called Meaning of Fixed Expressions. When a search is done in this dictionary, the program looks in two of the fields in the database in the order indicated by figures1 (see column 1 in Table 3 below). For this dictionary a maximising search is done. The user obtains one or several articles with the content of three of the fields in the database (see column 3). In other words, the user receives only a small part of the database asarticle, but it is exactly the part that is needed to solve a reception problem. If more than 10 articles are found they are displayed as a list where the content of the core field and the first line of the meaning are shown. The following tables show only those fields which are used for a search and those fields from which data are presented in the dictionary article for the specific dictionary. Because of space limitations in the headings of the tables, "Search" is used as an indication of the fields that are searched and the numbers indicate the order in which the search is carried out. similarly, "Entry" refers to the fields that are shown to the user and the numbers indicate the ordering sequence of the fields in the specific dictionary article. "List" is used as an indication of which data are displayed when a list is needed (i.e., when more than 10 articles are found). 1 In a maximising search the order does not really matter, as all subresults for each individual search are added up in the overall result. In a minimising search this is different. In this case the search ends after searching one field if one or more results are found. Therefore the next fields are not searched. Search Field Entry List 1 1. Core field 1 2. Meaning in Afrikaans 1 2 1st line 5. Further meaning item in Afrikaans 2 2 12. Fixed expression(s) in Afrikaans 3 3 35. Key word(s) Table 3: Search and data fields in the dictionary Meaning of Fixed Expressions 4.2 Use of Fixed Expressions The second dictionary is activated by pressing the button "I am writing a text with a specific fixed expression". Here the user enters a fixed expression or part of it in the search field and obtains information about the use of the fixed expression, including its meaning, grammar, collocations, example sentences and synonymous or antonymous fixed expressions. In other words, the search is expression-specific. We call this dictionary USE OF Fixed Expressions. When this dictionary is activated, four fields of the database are searched; however, this is a minimising search, where the search is terminated after one field type has been searched and other fields are therefore not searched. The items relevant to text production are reflected as figures in column 3; if there are more than 10 articles, a list is shown. Search Field Entry List 1 1. Core field 1 2. Meaning in Afrikaans 3 2 1st line 3. Internet link to meaning 5 4. Further meaning item in 4 Afrikaans 6. Grammar 8 7. Comment on grammar 9 2 12. Fixed expression(s) in Afrikaans 1 13. Remarks on the fixed 2 expression(s) 17. Style 6 18. Comment on style 7 4 22. Collocation(s) 10 23. Comment on 11 collocation 5 25. Example(s) 12 26. Comment on examples 13 28. Synonym(s) 14 29. Comment on synonyms 15 31. Antonym(s) 16 32. Comment on antonyms 17 3 35. Key word(s) Table 4: Search and data fields in the dictionary Use OF Fixed Expressions 4.3 Fixed Expressions with a Specific Meaning The third dictionary is activated by pressing the button "I am writing a text and am looking for a fixed expression with a specific meaning". Here the user can enter one or several words with a specific meaning and find expressions with this meaning or part of this meaning. The user then receives information about the use of the expression, including its meaning, grammar, collocations, example sentences and synonymous or antonymous fixed expressions. In other words, the point of departure is a meaning, which can be very wide and can therefore yield many hits. If a more restricted meaning is used as the search string, fewer hits may be found or even none at all. This dictionary is called FIXED Expressions with a Specific Meaning. When a search is done in this dictionary, the program looks in three of the fields in the database, in the case of a maximising search. The data are presented as in the dictionary mentioned above (Use of Fixed Expressions), as the function is the same, i.e. assistance with text production problems. Search Field Entry List 1. Core field 1 1 2. Meaning in Afrikaans 3 2 1st line 3. Internet link to meaning 5 2 4. Further meaning item in Afrikaans 4 5. Grammar 8 6. Comment on grammar 9 12. Fixed expression(s) in 1 Afrikaans 13. Remarks on the fixed 2 expression(s) 17. Style 6 18. Comment on style 7 22. Collocation(s) 10 23. Comment on 11 collocation 25. Example(s) 12 26. Comment on examples 13 28. Synonym(s) 14 29. Comment on synonyms 15 31. Antonym(s) 16 32. Comment on antonyms 17 3 34. Associated concept(s) Table 5: Search and data fields in the dictionary Fixed Expressions with a Specific Meaning One can then click on the core expression to get to the dictionary article which gives a meaning that fits the context. An article will be displayed with a set of corresponding data, as was illustrated above in the USE of Fixed Expressions dictionary. Although the data presentation of the two dictionaries is identical, the dictionaries are not. In the dictionary FIXED Expressions with a Specific Meaning access is gained by means of a meaning-oriented search, as in a printed dictionary with a systematic macrostructure and with one or more registers, whereas the dictionary Use OF FIXED Expressions corresponds to a dictionary with an alphabetic macrostructure without registers. But the information the user is looking for to assist him/her with the production of a text is the same for both dictionaries. 4.4 Knowledge about Fixed Expressions For the Danish dictionaries with fixed expressions mentioned above there are four dictionaries, the first three like those presented here and a fourth which shows all fields in the database. Here we found that the two text production dictionaries accounted for only about 9% of all user actions during the second period in Table 6. A comparison with the log file analysis from the earlier period (2007), when there was only one production dictionary, shows that this share is relatively stable. compared with the polyfunctional dictionary, which shows everything, the reception dictionary showed a substantial shift in the user actions between the two periods, which are presented in Table 6 as absolute figures and as percentages. Feedback from a random selection of users showed that the change is explained by the fact that many users are looking particularly for the historical (= generic) background to the fixed expression and selected the dictionary that displayed all data for this reason. in view of this experience, we therefore offer a separate dictionary that supplies such historical data as well as meaning items. It is therefore a cognitive dictionary in which a maximising search is performed (left column) and the items of the respective fields are shown in the third column in the order indicated. We call this dictionary Knowledge about Fixed Expressions. 27 February 2007 until 17 December 2007 Understanding a text 51 242 60.33% Writing a text 5 294 6.23% All data 28 405 33.44% 17 December 2007 until 1 December 2008 Understanding a text 154 239 29.57% Writing a text with a known expression 19 386 3.72% Writing a text with a known meaning 27 052 5.19% All data 320 865 61.52% Table 6: Usage statistics for the Danish dictionaries of fixed expressions Search Field Entry List 1 1. Core field 1 2. Meaning in Afrikaans 2 1st line 9. Background remark(s) 6 10. Comment on background remark(s) 7 11. Internet link to background remark(s) 8 2 12. Fixed expression(s) in Afrikaans 1 13. Remark(s) on the fixed expression(s) 2 14. References to fixed expression(s) 3 20. Classification of the fixed expression 4 21. Comment on classification 5 3 35. Key word(s) Table 7: Search and data fields in the dictionary Knowledge about Fixed Expressions 4.5 Afrikaans-English Dictionary of fixed expressions We call the fifth dictionary the Afrikaans-English Dictionary of fixed expressions. It is a communication dictionary with the function of translation. It is not an ideal translation dictionary, however, as no grammatical information on the English equivalents is presented and no translations of collocations or examples are supplied in Afrikaans. Search Field Entry List 1 1. Core field 1 2. Meaning in Afrikaans 2 3. Internet link to meaning 3 4. Further meaning item in Afrikaans 4 5. Meaning in English 6 2 12. Fixed expression(s) in Afrikaans 1 16. Fixed expression(s) in English translated from Afrikaans 5 3 35. Key word(s) Table 8: search and data fields in the dictionary Knowledge about Fixed Expressions 4.6 COMPREHENSIVE KNOWLEDGE ABOUT Fixed Expressions The sixth dictionary is activated by pressing the button "I want to know as much as possible about fixed expressions". We call it Comprehensive Knowledge about Fixed Expressions. It is a traditional polyfunctional dictionary that shows all fields (except for the field for working notes). A minimising search is performed. Search Field Entry List 1 1. Core field 1 1 2. Meaning in Afrikaans 12 3. Internet link to meaning 13 4. Further meaning item in Afrikaans 14 5. Meaning in English 15 6. Grammar 19 7. Comment on grammar 20 8. Internet link to grammar 21 9. Background remark(s) 16 10. Comment on background remark(s) 17 11. Internet link to background remark(s) 18 2 12. Fixed expression(s) in Afrikaans 7 13. Remark(s) on the fixed expression(s) 8 14. References to fixed expression(s) 9 15. Internet link to variants, e.g. statistical 10 16. Fixed expression(s) in English translated from Afrikaans 11 17. Style 2 18. Comment on style 3 19. Internet link to style 4 20. Classification of the fixed expression 5 21. Comments on classification 6 22. Collocation(s) 22 23. Comment on collocations 23 24. Internet link to collocations 24 25. Example(s) 25 26. Comment on examples 26 27. Internet link to examples 27 28. Synonym(s) 28 29. Comment on synonyms 29 30. Internet link to synonyms 30 31. Antonym(s) 31 32. Comment on antonyms 32 33. Internet link to antonyms 33 34. Associated concept(s) 34 3 35. Key word(s) 35 36. Memo field Table 9: search and data fields in the dictionary Comprehensive Knowledge about Fixed Expressions 5. Forthcoming attractions Ultimately, a database and the dictionaries extracted from it are never finished, as new cards can constantly be added and those that have already been made can also be expanded or corrected. Our aim is to build up a database of 10,000 to 15,000 cards. However, we will already offer the users the lexicographically recorded expressions when only 1,000 cards are ready. In the further course of the work we will, as explained with reference to the Danish dictionaries above, amend or add specific / additional data on the basis of log file analyses and user feedback, as well as on the basis of further research on and experimentation with different concepts and tools for manipulating data in the e-environment. Provision has already been made for expansion. The intention is to give users the opportunity to define their profiles, to define their search criteria and to select fields and the order in which they are displayed. For some fields we intend providing the option of displaying more detailed information on request and access to advanced tools. We assume that only a small number of users will make use of these options. Nevertheless, when they do, even more dictionaries will be extracted from one and the same database. It may not be possible to give each of the new, user-defined dictionaries a functional description, as has been done here. However, such options will be "capable of meeting all the users' needs in specific types of situations" (Tarp 2009a: 292) by providing "dynamic articles [...] structured in different ways according to each type of search criteria", "articles that are especially adapted", resulting in "the 'individualization' of the lexical product, adapting to the concrete needs of a concrete user" (Tarp, 2009b: 57-61). 5.1 User profiling We intend providing users with the possibility to define a user profile at the beginning of a consultation session; see Bothma, 2011 for details about user profiling technologies. Users will be able to set up a persistent profile that will remain active across multiple user sessions, but will be able to either reset or change this profile at any stage. Profiles fill enable users to define the specific dictionary they intend consulting during a specific interaction session. For example, a user who is reading a text and regularly needs help only with the meaning of fixed expressions may set his/her profile to use the dictionary Meaning of Fixed Expressions as the default dictionary. A user will also be able to set personalised search options (as discussed below) as default. 5.2 Personalized search and display options The six dictionaries discussed above are six different customised views on the database. Each of these dictionaries is defined in terms of a specific type of user need defined by the lexicographer. Each of the dictionaries is monofunctional in terms of a text reception, text production, text translation or a cognitive information need (in addition to a "traditional" polyfunctional dictionary). It is possible to provide any further number of monofunctional dictionaries in terms of the lexicographer's analysis of perceived user needs. However, it is also possible to provide the user with the option to define his/her own search and therefore define his/her own personalised / customised dictionary. The principles are discussed in Bothma, 2011 and Bergenholtz & Bothma, 2011. We intend providing such customised advanced search facilities where the user can define exactly which data are to be displayed. The user will be able to display the data of only a single field or any combination of fields to satisfy unique information needs in a given situation. 5.3 Additional fields for more detailed information Currently we assume that all users require the same amount of detail when accessing a dictionary article by means of any of the six dictionaries and / or the customisation options. However, this is not necessarily the case. Some users may require only a brief description whereas others may require a detailed exposition. This obviously does not apply to all fields, but could typically apply to, for example, background remarks (fields 9-11) and examples (fields 25-27). A user may require only a few brief comments about the origin and/or history of a fixed expression, or, alternatively, could require a comprehensive exposition on the origins of an expression, alternative views about the origin, a discussion about erroneous or popularly held beliefs about the origins of the expression, etc. The database should make provision to satisfy these individualised user needs as well. The content required for these details can be provided by a member of the lexicographic team (probably a team member who has a background or interest in history, heritage and culture studies) or could be a link to external source(s) where the background of a fixed expression may have been discussed in detail. We intend providing such a facility for expansion. These data can be made accessible on demand, either by means of a "Read more" button when data of fields 9-11 are displayed or by adapting the user profile at the start of the consultation session. The current database structure makes provision for examples with comments about and links to the original contexts of the examples. We provide a highly selective list of examples to illustrate meaning and use of a specific fixed expression. However, we foresee that in individual cases users may require either more examples or additional detail. For example, in a text production situation, a user writing a historical novel may require to know which of two current variants of a fixed expression was used (or was the more common variant) at the time the novel takes place. This requires access to data typically not within the database and tools for text manipulation that are not associated with a lexicographical database. (One of a number of dictionaries that does incorporate such a facility is the Base lexicale du frangais (BLF) (http://ilt.kuleuven.be/blf) which provides the user with the! option of linking to various corpora, including a set of documents of the European Parliament and Wikipedia. The selection of the examples does not require any input from the lexicographer as the BLF and the corpora are linked automatically. These examples are displayed by the BLF only when the user requires this and the possible information overload is displayed on demand.) In the above example a user may require to see the actual examples in context, i.e., a concordance of examples in a keyword in context (KWIC) format; alternatively, a user may require to see a table that provides only a statistical analysis of the occurrence of variants at a specific time. The two options require two different types of tool, namely a tool that can present "raw" corpus data in a KWIC format as well as a tool that can do statistical analysis of the "raw" corpus data and present the results in statistical tables. We hope to incorporate such facilities in due course. This will, however, require a considerable amount of both theoretical and empirical research and depends on the availability of suitable corpora. Research issues that need to be taken into account to incorporate such a facility are, inter alia: • How should the data in the external database(s) be marked up to enable access to specific data at a fine level of granularity? In terms of the above example, granularity may include mark-up for different time periods, different genres, etc. • How are word form variants such as inflections and conjugations to be handled? For example, does the database require detailed tagging of morphological forms beforehand, or would it be possible to link to the "raw" text of the corpora on the fly without prior tagging? • What type of tools will be required to make this type of searching/linking possible? 5.4 Multi-language databases Currently, the database makes provision for Afrikaans and only a single field for English. It is feasible to use the concepts and database structures outlined here for other languages as well, as indicated above. It is therefore feasible to create multiple interlinked databases for fixed expressions in multiple languages. For translation purposes such multiple databases could be interlinked via a pivot language, for example English. Existing databases of fixed expressions could also be linked, even if the data fields in the different databases are not identical. The minimum requirement would be that there are at least a minimum set of corresponding fields, or that translation tables between different fields can be created. 6. Conclusion Some of the envisaged expansions discussed above may not necessarily currently be commercially feasible since the time required to do the programming or to write / collate / select the data may simply be too much to complete the dictionary in a reasonable time. In addition, some of these expansions may not be what individual users may require. However, if researchers do not experiment with concepts and technologies that currently do not seem commercially realistic or feasible, innovation in e-information tools will be stifled. Such "blue sky" research could eventually lead to e-information tools that are not only incrementally better than those that are currently available, but provide different tools through disruptive innovation. The current project therefore has two aims: • To create a database of fixed expressions, as well as to develop the necessary database tools, administrative backend, user interface and search functions, that enable users to have access to a number of monofunctional and one polyfunctional dictionaries. To result in a useful product this database and set of tools has to be completed in a limited timeframe (even though further extensions and updates need to be added regularly). • To provide a platform to experiment with disruptive technologies and see to what extent any of these technologies can add value for the user in providing access to information in terms of the user's specific information need in a given user situation. Such "blue sky" research is absolutely essential to ensure that not only better but different types of e-tools are developed. After all, the development of new cars is not left up to the drivers. One can ask drivers about which aspects of their cars they are not quite satisfied, and the designers and manufacturers of cars can then make the required improvements. However, drivers do not possess the know-how and the technical creativity that is necessary to design and develop cars that are totally new, much better and also manufactured quite differently. As Henry Ford allegedly said, "If I had asked people what they wanted, they would have said faster horses." e-Dictionaries are no different. Users may help to improve e-dictionaries incrementally, but only fundamental research in metalexicography, user needs, database technologies and principles of information organisation, access and retrieval will result in different types of e-tools. 7. References Bergenholtz, H. (2010). Needs-Adapted Data access and data presentation. In Doctorado Honoris Causa del Excmo. Sr. D. Henning Bergenholtz. Valladolid, pp. 41-57. Bergenholtz, H. (2011). Access to and presentation of needs-adapted data in monofunctional internet dictionaries: In P.A. Fuertes-Olivera, H. Bergenholtz (eds.) e-Lexicography: The Internet, Digital Initiatives and Lexicography. London & New York: Continuum 2011, pp. 30-53. Bergenholtz, H., Bergenholtz, I. (2011). A dictionary is a tool, a good dictionary is a monofunctional tool. In P.A. Fuertes-Olivera, H. Bergenholtz (eds.) e-Lexicography: The Internet, Digital Initiatives and Lexicography. 2011. London & New York: Continuum, pp. 188-207. Bergenholtz, H., Bothma, T.J.D. (2011). Needs-adapted data presentation in e-information tools. Lexikos, in press. Bergenholtz, H., Johnsen, M. (2005). Log files as a tool for improving Internet dictionaries. Hermes, 34, pp. 117-141. Bergenholtz, H., Johnsen, M. (2007). Log files can and should be prepared for a functionalistic approach. Lexikos, 17, pp. 1-20. Bergenholtz, H., Tarp, S. (2002). Die moderne lexikographische Funktionslehre. Diskussionsbeitrag zu neuen und alten Paradigmen, die Wörterbücher als Gebrauchsgegenstände verstehen. Lexicographica, 18, pp. 253-263. Bergenholtz, H., Tarp, S. (2003). Two opposing theories: On H.E. Wiegand's recent discovery of lexicographic functions. Hermes, 31, pp. 171-196. Bergenholtz, H., Tarp, S. (2005). Wörterbuchfunktionen. In I. Barz, H. Bergenholtz & J. Korhonen (eds.) Schreiben, Verstehen, Übersetzen und Lernen: Zu ein-und zweisprachigen Wörterbüchern mit Deutsch. 2005. Frankfurt a.M./Bern/New York/Paris: Peter Lang, pp. 11-25. Botha, R.P., Kroes, G. & Winckler, C.H. (1994). Afrikaanse idiome en ander vaste uitdrukkings. Halfweghuis: Southern. Bothma, T.J.D. (2011). Filtering and adapting data and information in the online environment in response to user needs. In P.A. Fuertes-Olivera, H. Bergenholtz (eds.) e-Lexicography: The Internet, Digital Initiatives and Lexicography. 2011. London & New York: Continuum, pp. 71-102. De Villiers, M., Gouws, R.H. (1988). Idiomewoordeboek. Cape Town: Nasou. Malherbe, D.F. (1924). Afrikaanse spreekwoorde en verwante vorme. Bloemfontein: Nasionale Pers. Prinsloo, A.F. (1997). Afrikaanse spreekwoorde en uitdrukkings. Pretoria: J.L. van Schaik. Prinsloo, A.F. (2009). Spreekwoorde en waar hulle vandaan kom. Cape Town: Pharos. Tarp, S. (2002). Translation dictionaries and bilingual dictionaries. Two different concepts. Journal of Translation Studies, 7, pp. 59-84. Tarp, S. (2007). Lexicography in the Information Age. Lexikos, 17, pp. 170-179. Tarp, S. (2008). Lexicography in the borderland between knowledge and non-knowledge. General lexicographical theory with particular focus on learners' lexicography. (Lexicographica. Series Maior 134). Tübingen: Max Niemeyer. Tarp, S. (2009a). Reflections on lexicographical user research. Lexikos, 19, pp. 275-296. Tarp, S. (2009b). Reflections on data access in lexicographic works. In S. Nielsen, S. Tarp (eds.) Lexicography in the 21st Century. In Honour of Henning Bergenholtz. (Terminology and Lexicography Research and Practice, Volume 12). 2009. Amsterdam: John Benjamins, pp. 43-65. Tarp, S. (2011). Lexicographical and other e-tools for consultation purposes: Towards the individualization of needs satisfaction. In P.A. Fuertes-Olivera, H. Bergenholtz (eds.) e-Lexicography: The Internet, Digital Initiatives and Lexicography. 2011. London & New York: Continuum, pp. 55-70. Vrang, V., Bergenholtz, H. & Lund, L. (2003-2005). Den danske Idiomordbog. Database and layout: Richard Almind. www.idiomordbogen.dk. Wiegand, H.E. (1977). Nachdenken über Wörterbücher. In Nachdenken über Wörterbücher. Mannheim/Wien/ Zürich, pp. 51-102. Describing Linde's Dictionary of Polish for Digitalisation Purposes Joanna Bilinska Formal Linguistics Department, University of Warsaw, Browarna 8/10, 00-927 Warsaw, Poland E-mail: j.bilinska@uw.edu.pl Abstract The present paper describes the attempts at digitalising the so called Linde's dictionary of Polish published in 6 volumes between 1807 and 1814 by Samuel Bogumil Linde. We are working on a formal description of the dictionary's structure, whose purpose will be to allow programmers to design a tool for automatic tagging of the text. The dictionary is multilingual, so performing OCR with good quality is a difficult task. The paper also describes the indexes that are going to be added. Compiling an a tergo index and indexes of abbreviations, qualifiers and the names of quotation authors would improve the quality and usefulness of the digitalised version. Our work with the 2nd edition of the dictionary (1854-1861) allows us to test several tools (in different stages of development) that are being developed within the framework of a Polish government grant directed by Janusz S. Bien. Keywords: digitalisation; old dictionaries; Linde's dictionary of Polish 1. Linde's dictionary The paper will demonstrate the attempts made to digitalise the so called Linde's dictionary of Polish (Linde, 1807-14) published in 6 volumes between 1807 and 1814 by Samuel Bogumil Linde. It was the first work of such kind for Polish and it met with excellent reception in Poland and abroad. Being a part of Polish cultural heritage, it ought to be represented in digital form to allow more people to get acquainted with it and to enable more advanced usage of it. The author's intention was to present the Polish language extensively. The dictionary contains as much Polish vocabulary as the author was able to find in available texts. Every word was supplied with all typical pieces of information, such as grammatical properties, definition, quotations from source texts. Moreover, translations are provided into German (in Gothic), Slavic languages, and sometimes also other languages (e.g. Hebrew), as in the author's opinion they were useful to understand older Polish words. The dictionary is both descriptive and normative, because it includes additional information if a word is not used anymore or whether it is more likely to be encountered in poetry or in speech. NAG1EL, glu , m., et gli, i., Boh. mandl; Vind. likaunik, likalu, rolovifhfe, povalavifhe, povalilu; Hois. Karoxi; a) narzedzie do gJadzenia piotna tub chust pranych, walkownica, bit fflaitflfl, fflanbtl, 3?pQe, bit SSMfdje ju glätten. Jaworowc drzewo do magl6w plrtciennych bar-dzo zdatne. tad. H. N. 54. Magie do chust, warsztat bywa d^bowy, wafki debowe, brzozowe, gdzie zaš p{6-tna biela, jaworowe. Kluk. Hoil. 2, 161 el 21. Magiel w fabrykacli pJöcicnnych sklada sie r sicdmiu rucbo-mych wa(k6w, pionowo nad sobq u/oionycb , w dwöcli belkacb osadzonycb, nie wiecdj od siebie odleg/ych , jak tylko, aby pJ6lno mifdzy niemi przesunač sie mog/o. Priedz. 90, cf. kalendra. — §. b) transi. Daremne byJy 1 j^j bryie, jej magle. Pot. Syl. 78, t. j. maglowanie, gla- 1 Poihodz. maylownia , maglowaö, maglowany, pomaglo- Figure 1: An example entry from Linde's dictionary Due to its multilingualism, the dictionary's usefulness as a research resources is not only limited to Poland. It could be used for research purposes by any historian, librarian, or lexicographer interested in other (mainly Slavic) languages and cultures. Moreover, it can be interesting for other scholars studying old books, especially dictionaries, and people interested in digitising them. We are trying to create a formal description of the dictionary's structure, as this could be used by programmers to tag the text automatically in terms of entry names and abbreviations, especially those naming the languages. 2. Digitalisations Several digital versions of this dictionary exist in digital libraries, for example Google Books and Kujawsko-Pomorska Biblioteka Cyfrowa1. Both the 1st and the 2nd edition are freely available on the Internet but their quality is not perfect. Generally, they are available in scanned form with OCR that is far from perfect. As such, they are great examples of care of the Polish heritage but at the same time they are useless for research purposes. However, since the dictionary contains a lot of vocabulary in other languages than Polish, different alphabets and fonts, it is very difficult to perform good OCR. Unfortunately, FineReader does not work sufficiently well with multilingual texts and does not recognise texts that are written in Gothic. While the OCR of Polish parts in the dictionary is good, the parts in other languages are virtually useless. Furthermore, the book is old and even the 2nd (and last) edition, which was later only reprinted, comes from the 19th century (1854-1861). This results in print errors, 1http://books.google.pl/books?id=rs0GAAAAQAAJ&printsec= frontcover#v=onepage&q&f=false (Google Books) and http://kpbc.umk.pl/publication/8173 such as variable position of words when they are typed in other alphabets, etc. For example: boga, boh ©Ott erraffen. 'B0G0TUCZNY, a, e, Eccl. B0_ rOTOYhKT. 1ul a f'tndilur, Gr. #«>(j0VT0?, DOlt ©Ott berpicpfa».' 'BOGOUBIJCA ob. Bogoböjca. IVitiftiivli'ftiZ. a, B0G0W1DZCA, v, m. Eccl. KoroBMjhi|h, 6or03pnTeib, ktöry boga widzi, fconrrig, ber Gc^er ©otteä, ber ©Ott ficfyt. Bogowidzc^ Mojžesza usfuchawszy. Smotr. Ap. KoroRiutHie, noro^tmie, BOGOWLADZTWO, a, n teokracya, bogorzadztwo, Eccl öorojepwaBie, Ross. Figure 2: Variable position of words Kontekst , e, Ross. öorocoTBopeHHUH, stworzony od boga, oon ©Ott erföaffen. 'BOGOTOCZNY, a, e, Eccl. K0_ roTOYhtn a Deo funditur, Gr. &eÖQov?og, BOtt ©ott ijerfltefSeuD.' "BOGOUR1JCA ob. Bogoböjca. BOGOWIDZ, a, BOGOWIDZCA, y, m. Eccl. EoroRHAhi|h, čoroaptTejb, ktöry boga widzi, &eötirris, ber ©efyer ©otteg, ber ©ott jietyt. Bogowidzcs Mojžesza usluchawszy. Smotr. Ap . 1. EoroBHAisHie, Borowie, BOGOWLADZTWO Figure 3: OCR quality This is why this type of digitalisation needs to be done with several specialised tools in addition to a standard OCR program. Some of the tools used will be presented in the paper. 3. Linde's dictionary as a corpus Searching the hidden text layer in large DjVu files, e.g. dictionaries, is not really efficient as it demands loading the whole file. It was decided that it would be much easier to treat the dictionaries' texts as corpora (Bien, 2011) and to use a specialised search engine for corpora. Therefore, recently a digital version of the 2nd edition of Linde's dictionary was made available at University of Warsaw, with a preliminary OCR (SJPL, 2010). This version of OCR was prepared with FineReader 10 (with 300 dpi resolution) and then converted from PDF to DjVu. It was then converted into a text corpus which now consists of ca. 7 million segments (http://poliqarp.wbl.klf.uw.edu.pl/slownik-lindego/). Interface language: English [»] | Change About Available texts: ■ J. Karlowicz, A. Krynski, W. Niedžwiedzki. Dictionary of Polish. Warsaw 1900-1927 ■ S. Bqk, M. R. Mayenowa, F. Peplowski (eds.). Dictionary of the 16th century Polish. Wroclaw — Warszawa, 1966???? (work in progress) ■ M. Samuel Bogumil Linde. Dictionary of Polish (2nd edition). Lwöw 1854-1861. • B. Chlebowski, F. Sulimierski, W. Walewski (ed.), The Geographical Dictionary of the Polish Kingdom and other Slavic Countries, Warszawa 18801902 Settings Basic help Query: albo Search Results Found 423 results so far Displaying results 1—25 1. ). Komen- tarz 2. . Judasz z potomstwem swoim 3. czyli raczej gatunköw mowy, 4. kazdy autor sam siebie, 5. sis pokazač przed nig? 6. si^ podpisu swego zaprzeč; 7. sis pytaj^. Jablon, 8. szaty? wnstrznosci robaczköw, 9. czsstowa) posly Czeskie dobrze; 10. ; ztqd na dobs, 11. n. p. stnerl albo wyklad na proroctwo Hozeasza, albo kronika ministröw heretyckich przez Przewodowskiego albo dyalektöw Slowianskich, niepošlednie miejsce albo tež i jeden drugiego objašnil albo : kto ma czolo po albo : czola na to trzeba albo jak Zimorowicz spiewa: « albo sköry zwierzqt! Bals. albo w przyslowie: Ku- albo od doby do do- albo smart, šmierč. § Figure 4: Lexicographic browser The current version of the dictionary can be searched with the Poliqarp for DjVu browser search engine and a concordancer (also called marasca)2, which allows users to browse the dictionaries as if they were corpora, returning lists of concordances as search results. One of the most useful features of the search engine is that the query results can showed as graphical concordances (see figure 5). And as the results are linked with the scans, one can see the original page of the dictionary with the marked result. An example can be seen on figure 6. Regular expressions can be used for searching, as it is one of the standard features of the Poliqarp concordancer3, which Poliqarp for DjVu is based on. The query syntax is thus the same as in the Poliqarp version used for the National Corpus of Polish, which makes the tool easy to use for people familiar with the earlier version. Query: Croat Search Results Found 10000 results Displaying results 1—25 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. ui, 40.111 mi; Vind. dul, dolina, BH*)*, i. Dzicii, dnia, Roh. den, jifd)cu ift Gall, feu (nicboszczyk); ma/um, Gall, mal, plur. les maux. bfiujtg; ftiiö beut Cateinifd;c ©ecltflc), an$ malum, mal §- 25. M'N. Lat. iVicolaus, Ross. HHKOiaji, Pol. Mikolaj.— Polon. TVicsplik t mesplik, I mespilus, Graec., fiicnü.rj, Germ. Mi»pcl. — Pol. JVietlzwiedz, Boh. nedwed, obs. Boh. ÄJfRtt. medved, Hung, mcdve, Etymon, miöd, Ross, mcai, Boh. med, Lüh. meddus, h üRetfi, Lat. mel, mollis, Graec. fi&i. — Polon, niarika cf. mamka. §■ 26- N Dawniejsi wierszopisowic podstawiaja cz^sto za l, l, n. p. wzieny, wziely.— Polon, maföonek, maßonka, ma/žen-stwo, Boh, manžel, manžclka, manzelstwj, «Manžonek, mö-»wi Mi%czyiski, niektörzy pisza, dowodzac, iž to sfowo zfožone jest z Niemicckiego Mann, i z Polskicgo zona.» Budny, tfumaez Biblii, pisze: manzenstwo, manzonkowie.— Ktoby chciaf wywodziö sfowo jclca, (manubrium, rekojeiö, ®