Sentence-Alignment and Application of Russian-German Multi-Target Parallel Corpora for Linguistic Analysis and Literary Studies

  • Desislava Zhekova Ludwig-Maximilians-Universität München
  • Robert Zangenfeind Ludwig-Maximilians-Universität München
  • Alena Mikhaylova Ludwig-Maximilians-Universität München
  • Tetiana Nikolaienko Ludwig-Maximilians-Universität München

Abstract

This paper presents the application of multi-target parallel corpora consisting of a single source text and multiple target translations of it for linguistic analysis. We discuss the alignment, interactive search and visualization of this type of data within a specific tool called ALuDo (Alignment with Lucene for Dostoyevsky). This is a Java implementation that uses local grammars, ontological information, bilingual dictionaries and statistical approaches for alignment and search. The data set in use is the Russian novel Crime and Punishment by Fyodor Dostoyevsky and three German translations of it. With this bilingual corpus quite a number of investigations in the field of linguistics and of literary studies are possible. Additionally, we release part of the resulting parallel corpus.

DOI: http://dx.doi.org/10.14195/2182-8830_4-1_3

  • Abstract viewed = 105 times
  • HTML viewed = 12 times
  • PDF viewed = 154 times

Downloads

Download data is not yet available.

References

APRESJAN, Jurij D. (1974). Leksicheskaja semantika. Moskva: Nauka.

APRESJAN, Jurij D., and Leonid L. Cinman (2002). “Formal’naja model’ perifra-zirovanija predlozhenij dlja sistem pererabotki tekstov na estestvennyh jazykah.” Russkij jazyk v nauchnom osveshhenii. 4.2: 102-146.

APRESJAN, Jurij D. et. al. (2003). “ETAP-3 Linguistic Processor: a Full-fledged NLP Implementation of the Meaning <=> Text Theory.” Con-ference Proceedings of MTT 2003. Paris: 279-288.

BARZILAY, Regina, and Lillian Lee (2003). “Learning to Paraphrase: An unsupervised Approach using Multiple-sequence Alignment.” Proceedings of the 2003 Conference of the North American Chapter of the ACL-HLT. Stroudsburg, PA, USA. ACL: 16-23.

BARZILAY, Regina, and Kathleen R. McKeown (2001). “Extracting Para-phrases from a Parallel Corpus.” Proceedings of 39th Annual Meeting of the Association for Computational Linguistics: 50–57.
BIBER, Hanno, Evelyn Breiteneder, and Dmitrij Dobrovol’skij (2002). “Corpus-based Study of Collocations in the AAC.” Proceedings of the Tenth EURALEX International Congress, Vol. 1. Eds. Anna Braasch and Claus Povlsen. Center for Sprogtknologi, Kopenhagen. 85–95.

BRAUNE, Fabienne, and Alexander Fraser (2010). “Improved unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora.” Proceedings of the 23nd International Conference on CL (Coling 2010). Bei-jing, China, August. 81-89.

BROWN, Peter F., Jennifer C. Lai, and Robert L. Mercer (1991). “Aligning Sentences in Parallel Corpora.” Proceedings of the 29th Annual Meeting on Association for Computational Linguistics, ACL ’91. Stroudsburg, PA, USA. ACL. 169–176.

CHEN, Stanley F. (1993). “Aligning Sentences in Bilingual Corpora using Lexical Information.” Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics. Ed. Lenhart K. Schubert. ACL. 9-16.

DENG, Yonggang, Shankar Kumar, and William Byrne (2006). “Segmenta-tion and Alignment of Parallel Text for Statistical Machine Translation.” NLE 12.4: 1-26.

DOBROVOL’SKIJ, Dmitrij (2014). “Russkie obrashchenija v parallel’nyh korpusah.” Die Welt der Slaven 59.1: 1-21.

DOLAN, William B., and C. Brockett (2005). “Automatically Constructing a Corpus of Sentential Paraphrases.” Proceedings of IWP. 9-16.

DOSTOYEVSKY, Fyodor Mihailovich (1866). Prestuplenie i nakazanie. Moskva: Editora.

DOSTOYEVSKY, Fyodor Mihailovich (1924). Verbrechen und Strafe. Tr. Alexander Eliasberg. Potsdam: Gustav Kiepenheuer.

DOSTOYEVSKY, Fyodor Mihailovich (1956). Schuld und Sühne. Tr. Hermann Röhl. Berlin: Aufbau Verlag.

DOSTOYEVSKY, Fyodor Mihailovich (2012). Verbrechen und Strafe. Tr. Swetlana Geier. Frankfurt am Main: Fischer Taschenbuch Verlag.

EVERT, Stefan and Andrew Hardie (2011). “Twenty-first century corpus workbench: Updating a query architecture for the new millennium.” Pro-ceedings of the Corpus Linguistics 2011 Conference. Birmingham, UK.

FATTAH, Mohamed Abdel et. al. (2007). “Sentence Alignment using P-NNT and GMM.” Computer Speech & Language 21.4: 594-608.

FELLBAUM, Christiane, ed. (1998). WordNet: an electronic lexical database. Cambridge: MIT Press.

GALE, William A., and Kenneth W. Church (1993). “A Program for Align-ing Sentences in Bilingual Corpora.” Computational Linguistics 19.1: 75-102.

GANITKEVITCH, Juri, Benjamin Van Durme, and Chris Callison-Burch (2013). “PPDB: The Paraphrase Database.” HLT-NAACL. ACL. 758–764.

GROSS, Maurice (1997). “The Construction of Local Grammars.” Finite-State Language Processing. Eds. E. Roche & Y. Schabès. Cambridge: MIT Press.329–354.

GUENTHNER, Franz, and Petra Maier (1994). Das CISLEX Wörterbuchsys-tem. CIS.

HAMP, Birgit, and Helmut Feldweg (1997). “GermaNet – a Lexical-semantic Net for German.” Proceedings of ACL Workshop Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications. Madrid, Spain. 9-15.

IBRAHIM, Ali, Boris Katz, and Jimmy Lin (2003). “Extracting Structural Paraphrases from Aligned Monolingual Corpora.” Proceedings of the Second International Workshop on Paraphrasing. Vol.16 PARAPHRASE ’03. Stroudsburg, PA, USA. ACL. 57-64.

KAY, Martin, and Martin Röscheisen (1993). “Texttranslation Alignment.” Computational Linguistics 19.1: 121-142.

KOBDANI, Hamidreza et. al. (2011). “Bootstrapping Coreference Resolution using Word Associations.” Proceedings of the 49th Annual Meeting of the ACL: Human Language Technologies. Vol. 1 HLT ’11. Stroudsburg, PA, USA. ACL. 783-792.

LANGER, Stefan, Petra Maier, and J. Oesterle (1996). CISLEX – an Elec-tronic Dictionary for German: Its Structure and a Lexicographic Application. CIS-Bericht. CIS.

LEFEVER, Els, Véronique Hoste, and Martine De Cock (2011). “ParaSense or how to use Parallel Corpora for Word Sense Disambiguation.” Proceedings of the 49th Annual Meeting of the ACL: Human Language Technologies. Portland, Oregon, USA, June. ACL. 317-322.

MAUREL, Denis (1989). Reconnaissance de séquences de mots par automate: Cas des adverbes de date du français 1 microfiche. Ph.D. thesis, Université Paris 7. Grenoble. Th.: informatique fondamentale.

MEL’CHUK, Igor’ A. (1974). Opyt teorii lingvisticheskih modelej “Smysl <=> Tekst”. Moskva: Editora.

MAUREL, Denis et. al. (1992). Dictionnaire explicatif et combinatoire du français contemporain: Recherches lexico-sémantiques. Montréal : Les Presses de l’Université de Montréal.

MAUREL, Denis (1996). “Lexical Functions: A Tool for the Description of Lexical Relations in the Lexicon.” Lexical Functions in Lexicography and Natural Language Processing. Ed. Leo Wanner. Amsterdam/Philadelphia: John Benjamins. 37-102.

MILLER, George A. (1995). “WordNet: a Lexical Database for English.” Commun. ACM 38.11: 39-41.

MOORE, Robert C. (2002). “Fast and accurate Sentence Alignment of Bilin-gual Corpora.” Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users, AMTA ’02. London, UK: Springer-Verlag. 135-144.

PANG, Bo, Kevin Knight, and Daniel Marcu. (2003). “Syntax-based Align-ment of Multiple Translations: Extracting Paraphrases and generating new Sentences.” Proceedings of the 2003 Conference of the North American Chapter of the ACL on Human Language Technology, NAACL ’03 Vol. 1. Stroudsburg, PA, USA. ACL. 102-109.

PAPINENI, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. (2002). “Bleu: A Method for automatic Evaluation of Machine Translation.” Pro-ceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02. Stroudsburg, PA, USA. ACL. 311-318.

PRADHAN, Sameer et. al.. (2012). “CoNLL – 2012 shared task: Modeling multilingual unrestricted Coreference in Ontonotes.” Joint Conference on EMNLP and CoNLL – Shared Task. Jeju Island, Korea, July. ACL. 1-40.

RAHMAN, Altaf, and Vincent Ng. (2009). “Supervised Models for Coreference Resolution.” Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP ’09 Vol. 2. Stroudsburg, PA, USA. ACL. 968-977.

SINGH, Sameer et. al.. (2011). “Large-scale cross-document Coreference using distributed Inference and hierarchical Models.” Proceedings of the 49th Annual Meeting of the ACL: Human Language Technologies. Portland, Oregon, USA, June. ACL. 793-803.

SONNENHAUSER, Barbara, and Robert Zangenfeind (2013). “Towards Machine Translation of Russian Aspect.” Proceedings of the 6th International Conference on Meaning-Text Theory. Eds. Valentina Apresjan, Boris Iomdin, and E. Ageeva. Prague. 192-201.

WALDENFELS, Ruprecht von (2006). “Compiling a Parallel Corpus of Slavic Languages. Text Strategies, Tools and the Question of Lemmatization in Alignment.” Beiträge der Europäischen Slavistischen Linguistik (POLYSLAV) 9: 123-138.

WETZEL, Dominikus, and Francis Bond (2012). “Enriching Parallel Corpo-ra for Statistical Machine Translation with semantic Negation Rephras-ing.” Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation. Jeju, Republic of Korea, July. ACL. 20-29.

ZANGENFEIND, Robert (2009). “Types of Paraphrase Rules in Practice. German Paraphrases of a Russian Text.” Meaning – Text Theory 2009. Eds. David Beck, Kim Gerdes, Jasmina Milićević, and Alain Polguère. Montréal. 389-398.

ZANGENFEIND, Robert (2010). Grammatik der Paraphrase (= Linguistic Resources for Natural Language Processing, 4). München: Lincom Europa.

ZANGENFEIND, Robert (2011). “Transfer of Russian Actantial Syntactic Relations into German.” Meaning – Text Theory 2011. Eds. Igor Boguslavsky, and Leo Wanner. Barcelona. 306-31.

ZANGENFEIND, Robert (2012). “Towards a System of Syntactic Dependencies of Ger-man.” Komp’juternaja lingvistika i intellektual’nye tehnologii – Computational Linguistics and Intellectual Technologies 11.18. Ed. Kibrik, A.E., RGGU. Moscow: 706-715.

ZHEKOVA, Desislava (2013). Towards Multilingual Coreference Resolution. Ph.D. thesis, University of Bremen.

ZHEKOVA, Desislava, et al. (2014). “Alignment of Multiple Translations for Linguistic Analysis.” Proceedings of The 3rd Annual International Conference on Language, Literature and Linguistics (L3).
Published
2015-02-28
How to Cite
ZHEKOVA, Desislava et al. Sentence-Alignment and Application of Russian-German Multi-Target Parallel Corpora for Linguistic Analysis and Literary Studies. MATLIT: Materialities of Literature, [S.l.], v. 4, n. 1, p. 45-61, feb. 2015. ISSN 2182-8830. Available at: <http://impactum-journals.uc.pt/matlit/article/view/2333>. Date accessed: 23 feb. 2019. doi: https://doi.org/10.14195/2182-8830.
Section
Secção Temática | Thematic Section

Keywords

interactive alignment; rule-based alignment; statistical alignment; coreference resolution; paraphrase identification