Contents :
Lexicon management and standard formats ric Laporte Institut Gaspard-Monge (IGM) - CNRS & University of Marne-la-Vall e 15 bd Descartes F 77474 Marne-la-Vall e CEDEX 2 - France eric.laporte@univ-mlv.fr Abstract International standards for lexicon formats are in preparation. To a certain extent the proposed formats converge with prior results of standardization projects. However their adequacy for (i) lexicon management and (ii) lexicon-driven applications have been little debated in the past nor are they as a part of the present standardization effort. We examine these issues. IGM has developed XML formats compatible with the emerging international standards and we report experimental results on large-coverage lexica. Introduction International standards for lexicon formats are in preparation in order to facilitate associated software development meta-documentation and exchange of language resources. To a certain extent the proposed formats converge with prior results of standardization projects. However their adequacy for (i) lexicon management and (ii) lexicon-driven applications have been little debated in the past nor are they as a part of the present standardization effort. We examine these issues. IGM has developed XML formats of lexical resources compatible with the emerging international standards and corresponding software tools and carried out experimentation on large-coverage lexica of English and other languages. We report experimental results. In the next section we briefly describe the standard lexicon model presently in construction. Section 2 examines how adequate this model is for lexicon management. Section 3 takes into account the requirements of lexicon-based lexical tagging. The conclusion synthesizes our results. 1. Previous work A series of standards of representation of lexica for natural language processing (NLP) were successively proposed from Genelex (Normier Nossin 1990) to the present ISO group on Language resource management (Ide Romary 2002). Though some authors emphasize the differences between formats of lexica for written text processing currently in use (Wittenburg et al. 2002) there is much in common among the various models which seem to be converging to an emerging ISO standard. IGM participates in this effort through the Outilex and Normalangue projects1. In this section we describe the overall structure of the emerging standard and in particular we examine how it handles the dichotomy between lemma and inflected form in inflectional languages. 1.1. Lemmas All proposed models have a lemma-based overall structure. In a lemma-based model the set of lexical 1 This paper owes much to the discussions inside this group and to the anonymous reviewers' interesting remarks and constructive suggestions. items is a set of lemma entries i.e. nodes each of which represents a lemma of the language (Fig. 1). dic entry lemma game /lemma pos name 'noun'/ f name 'reliability' value '1'/ inflection ... /inflection /entry /dic Figure 1: Sample of a lemma-based lexicon The notion of lemma exists in all languages. Part-ofspeech an essential feature is attached to lemma entries. In the Olif model (Lieske et al. 2001) which integrates terminological with other lexical information and is consistent with international standards in terminography (TMF: Romary 2001) terminological information is attached to lemmas. Higher-level features are attached either to lemma entries or to senses which are themselves attached to lemmas but have a finer granularity. In the draft model of the Lexical Resource Markup Framework (LMF ISO TC 37/SC4: Francopoulo 2003 George 2003) features such as the applicability of syntactic constructions are attached to senses. Senses play the part of the nodes of a thesaurus semantic links are attached to them. In the Papillon model (Boitet et al. 2002) multilingual links are attached to senses. 1.2. Inflection The capacity to provide links between lemmas and inflected forms is part of the information contained in a lexicon of an inflectional language. Inflectional information in a lemma-based lexicon model can be specified in the form of inflectional rules or of a complete paradigm of inflected forms attached to the lemma. In the second case we obtain a variant of the lemma-based model in which elements which represent inflected forms of a lexical item (word-form entries) are embedded in the corresponding lemma entry. We call this variant a mixed model because it combines these two types of entries (Fig. 2). Inflectional features such as number person mood tense gender case etc. are attached to word-form entries. dic entry lemma game /lemma pos name 'noun'/ f name 'reliability' value '1'/ inflected form game /form f name 'number' value 'singular'/ /inflected inflected form games /form f name 'number' value 'plural'/ /inflected /entry /dic Figure 2: Sample of a mixed lexic
- Rating :
- Surf Anonymously!
- File Type : .pdf
- Length : 5 pages
- File Size: 50.2 kb
- Virus Tested : No
- Verified : 2012-03-23
- Source: infolingu.univ-mlv.fr
INFO HASH : a94c253efe98b7b564980709aa57af59d6284686
blog comments powered by Disqus

Download now