• Home
  • Alerts
  • About
  • Services
SafeSearch:  On

Download Nodalida07KonstantinosCharitakisGreekParallelCorpora.pdf

Contents : Using parallel corpora to create a Greek-English dictionary with Uplug Konstantinos Charitakis Department of Computer and Systems Sciences (DSV) KTH-Stockholm University 164 40 Kista Stockholm Sweden kcha@kth.se Abstract This paper presents the construction of a Greek-English bilingual dictionary from parallel corpora that were created manually by collected documents retrieved from the Internet. The parallel corpora processing was performed by the Uplug word alignment system without the use of language specific information. A sample was extracted from the population of suggested translations and was included in questionnaires that were sent out to Greek-English speakers who evaluated the sample based on the quality of the translation pairs. For the suggested translation pairs of the sample belonging to the stratum with the higher frequency of occurrence 67.11% correct translations were achieved. With an overall 50.63% of correct translations of the sample the results were promising considering the minimal optimisation of the corpus and the differences between the two languages. 1 Introduction Due to the diversity of the known languages and the vast amount of resources required to produce a bilingual dictionary people turned their efforts towards the automation of the task. The emergence of statistical methods have shown promising results and they have given results accurate enough with less effort and resources required that could be used for the task of automated dictionary extraction (Brown et al. 1990). Parallel corpora which are texts aligned together with their translation in one or more languages are extensively used in statistical translation methods as they contain a vast amount of bilingual lexical information (Veronis 2000). After the emergence of statistical translation methods many corpora processing systems and tools have been implemented and have been applied to parallel corpora of most of the popular natural languages. However there are not many projects on automated creation of a dictionary between the Greek and English language pair. Similar work of extraction of Greek-English dictionary was performed by Piperidis et al. (1997 2005) although in both cases the approach was different as it employed statistical techniques coupled with linguistic information for better results and it was applied on a corpus in software domain and on a corpus consisting of official EU documents respectively. Related work with the use of the same system is the work described by Dalianis et al. (2007) where they used Uplug on Scandinavian and English parallel corpora and specifically obtained 71% and 93% for precision and recall respectively for Swedish-English dictionaries. The primary focus of this paper is on the extraction and evaluation of a Greek-English dictionary created from parallel corpora using the Uplug system. The laboration was performed without the use of linguistic information and without the use of optimised sentence aligned corpora for the Greek-English language pair. 2 2.1 Dictionary Extraction and Evaluation The Uplug System For the processing of the corpora the Uplug word alignment system was used. Uplug origins from a project in Uppsala University and provides a collection of tools for linguistic corpus processing word alignment and term extraction from parallel corpora (Tiedemann 1999). Uplug uses language-specific pre-processing modules if available. In other case Uplug uses the basic pre-processing modules that run the general tokenizer the sentence splitter and add simple XML markup. The word aligner implemented in the Uplug system is the Clue Aligner which is based on the combination of word alignment clues. The idea is that features like frequency part-of-speech parsing and word form together with similarity and frequency measures are taken into account and are considered as association clues between words. All these association clues are then combined together in order to find links between words in the source and target languages (Tiedemann 2003). Uplug uses the word Clue Aligner to iterative size reduction and alignment of the corpora. 2.2 Collection of Parallel Corpora There are many available public corpora over the web. One of the most interesting attempts is the OPUS corpus (Tiedemann and Nygaard 2004). However the corpora provided in most cases are already aligned most often at sentence level and tagged using XML format. There were concerns about the optimised corpora available in the way that optimised corpora would give optimised results while our intention was to work with as more realistic input elements as possible. In order to test the full potential of the Uplug system including its sentence alignment process it was thought necessary the use of raw text parallel corpora. Therefore a manually created corpus was used. The English and Greek translated documents included in the corpus were mainly collected from the web site of the European
  • Rating :      
  • Search Skype/AIM!
  • File Type : .pdf
  •    
  • Length : 4 pages
  • File Size: 80.8 kb
  • Virus Tested : No
  • Verified : 2012-07-26
  • Source: people.dsv.su.se
 Email File   

INFO HASH : 6ff85898c8f162bb4e0e952fc6a4d66301dc184c
blog comments powered by Disqus
Download now

File Size: 80.8 kb

Document Preview

    Other Downloads

  • dictionary.pdf20.3 kb
  • englishdictionaryapp.pdf238.6 kb
  • rfl22bensoussan.pdf2.1 mb
  • dictionary.pdf1.9 mb
  • dictionaries.pdf1.1 mb

    Related Keywords

  • papers  ~hercules  

  • Add Media
  • |
  • Terms of Use
  • |
  • FAQ / Help

© 2012 all rights reserved