Sentence-Alignment and Application of Russian-German Multi-Target Parallel Corpora for Linguistic Analysis and Literary Studies
This paper presents the application of multi-target parallel corpora consisting of a single source text and multiple target translations of it for linguistic analysis. We discuss the alignment, interactive search and visualization of this type of data within a specific tool called ALuDo (Alignment with Lucene for Dostoyevsky). This is a Java implementation that uses local grammars, ontological information, bilingual dictionaries and statistical approaches for alignment and search. The data set in use is the Russian novel Crime and Punishment by Fyodor Dostoyevsky and three German translations of it. With this bilingual corpus quite a number of investigations in the field of linguistics and of literary studies are possible. Additionally, we release part of the resulting parallel corpus.
- Abstract viewed = 121 times
- HTML viewed = 14 times
- PDF viewed = 194 times
APRESJAN, Jurij D., and Leonid L. Cinman (2002). “Formal’naja model’ perifra-zirovanija predlozhenij dlja sistem pererabotki tekstov na estestvennyh jazykah.” Russkij jazyk v nauchnom osveshhenii. 4.2: 102-146.
APRESJAN, Jurij D. et. al. (2003). “ETAP-3 Linguistic Processor: a Full-fledged NLP Implementation of the Meaning <=> Text Theory.” Con-ference Proceedings of MTT 2003. Paris: 279-288.
BARZILAY, Regina, and Lillian Lee (2003). “Learning to Paraphrase: An unsupervised Approach using Multiple-sequence Alignment.” Proceedings of the 2003 Conference of the North American Chapter of the ACL-HLT. Stroudsburg, PA, USA. ACL: 16-23.
BARZILAY, Regina, and Kathleen R. McKeown (2001). “Extracting Para-phrases from a Parallel Corpus.” Proceedings of 39th Annual Meeting of the Association for Computational Linguistics: 50–57.
BIBER, Hanno, Evelyn Breiteneder, and Dmitrij Dobrovol’skij (2002). “Corpus-based Study of Collocations in the AAC.” Proceedings of the Tenth EURALEX International Congress, Vol. 1. Eds. Anna Braasch and Claus Povlsen. Center for Sprogtknologi, Kopenhagen. 85–95.
BRAUNE, Fabienne, and Alexander Fraser (2010). “Improved unsupervised Sentence Alignment for Symmetrical and Asymmetrical Parallel Corpora.” Proceedings of the 23nd International Conference on CL (Coling 2010). Bei-jing, China, August. 81-89.
BROWN, Peter F., Jennifer C. Lai, and Robert L. Mercer (1991). “Aligning Sentences in Parallel Corpora.” Proceedings of the 29th Annual Meeting on Association for Computational Linguistics, ACL ’91. Stroudsburg, PA, USA. ACL. 169–176.
CHEN, Stanley F. (1993). “Aligning Sentences in Bilingual Corpora using Lexical Information.” Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics. Ed. Lenhart K. Schubert. ACL. 9-16.
DENG, Yonggang, Shankar Kumar, and William Byrne (2006). “Segmenta-tion and Alignment of Parallel Text for Statistical Machine Translation.” NLE 12.4: 1-26.
DOBROVOL’SKIJ, Dmitrij (2014). “Russkie obrashchenija v parallel’nyh korpusah.” Die Welt der Slaven 59.1: 1-21.
DOLAN, William B., and C. Brockett (2005). “Automatically Constructing a Corpus of Sentential Paraphrases.” Proceedings of IWP. 9-16.
DOSTOYEVSKY, Fyodor Mihailovich (1866). Prestuplenie i nakazanie. Moskva: Editora.
DOSTOYEVSKY, Fyodor Mihailovich (1924). Verbrechen und Strafe. Tr. Alexander Eliasberg. Potsdam: Gustav Kiepenheuer.
DOSTOYEVSKY, Fyodor Mihailovich (1956). Schuld und Sühne. Tr. Hermann Röhl. Berlin: Aufbau Verlag.
DOSTOYEVSKY, Fyodor Mihailovich (2012). Verbrechen und Strafe. Tr. Swetlana Geier. Frankfurt am Main: Fischer Taschenbuch Verlag.
EVERT, Stefan and Andrew Hardie (2011). “Twenty-first century corpus workbench: Updating a query architecture for the new millennium.” Pro-ceedings of the Corpus Linguistics 2011 Conference. Birmingham, UK.
FATTAH, Mohamed Abdel et. al. (2007). “Sentence Alignment using P-NNT and GMM.” Computer Speech & Language 21.4: 594-608.
FELLBAUM, Christiane, ed. (1998). WordNet: an electronic lexical database. Cambridge: MIT Press.
GALE, William A., and Kenneth W. Church (1993). “A Program for Align-ing Sentences in Bilingual Corpora.” Computational Linguistics 19.1: 75-102.
GANITKEVITCH, Juri, Benjamin Van Durme, and Chris Callison-Burch (2013). “PPDB: The Paraphrase Database.” HLT-NAACL. ACL. 758–764.
GROSS, Maurice (1997). “The Construction of Local Grammars.” Finite-State Language Processing. Eds. E. Roche & Y. Schabès. Cambridge: MIT Press.329–354.
GUENTHNER, Franz, and Petra Maier (1994). Das CISLEX Wörterbuchsys-tem. CIS.
HAMP, Birgit, and Helmut Feldweg (1997). “GermaNet – a Lexical-semantic Net for German.” Proceedings of ACL Workshop Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications. Madrid, Spain. 9-15.
IBRAHIM, Ali, Boris Katz, and Jimmy Lin (2003). “Extracting Structural Paraphrases from Aligned Monolingual Corpora.” Proceedings of the Second International Workshop on Paraphrasing. Vol.16 PARAPHRASE ’03. Stroudsburg, PA, USA. ACL. 57-64.
KAY, Martin, and Martin Röscheisen (1993). “Texttranslation Alignment.” Computational Linguistics 19.1: 121-142.
KOBDANI, Hamidreza et. al. (2011). “Bootstrapping Coreference Resolution using Word Associations.” Proceedings of the 49th Annual Meeting of the ACL: Human Language Technologies. Vol. 1 HLT ’11. Stroudsburg, PA, USA. ACL. 783-792.
LANGER, Stefan, Petra Maier, and J. Oesterle (1996). CISLEX – an Elec-tronic Dictionary for German: Its Structure and a Lexicographic Application. CIS-Bericht. CIS.
LEFEVER, Els, Véronique Hoste, and Martine De Cock (2011). “ParaSense or how to use Parallel Corpora for Word Sense Disambiguation.” Proceedings of the 49th Annual Meeting of the ACL: Human Language Technologies. Portland, Oregon, USA, June. ACL. 317-322.
MAUREL, Denis (1989). Reconnaissance de séquences de mots par automate: Cas des adverbes de date du français 1 microfiche. Ph.D. thesis, Université Paris 7. Grenoble. Th.: informatique fondamentale.
MEL’CHUK, Igor’ A. (1974). Opyt teorii lingvisticheskih modelej “Smysl <=> Tekst”. Moskva: Editora.
MAUREL, Denis et. al. (1992). Dictionnaire explicatif et combinatoire du français contemporain: Recherches lexico-sémantiques. Montréal : Les Presses de l’Université de Montréal.
MAUREL, Denis (1996). “Lexical Functions: A Tool for the Description of Lexical Relations in the Lexicon.” Lexical Functions in Lexicography and Natural Language Processing. Ed. Leo Wanner. Amsterdam/Philadelphia: John Benjamins. 37-102.
MILLER, George A. (1995). “WordNet: a Lexical Database for English.” Commun. ACM 38.11: 39-41.
MOORE, Robert C. (2002). “Fast and accurate Sentence Alignment of Bilin-gual Corpora.” Proceedings of the 5th Conference of the Association for Machine Translation in the Americas on Machine Translation: From Research to Real Users, AMTA ’02. London, UK: Springer-Verlag. 135-144.
PANG, Bo, Kevin Knight, and Daniel Marcu. (2003). “Syntax-based Align-ment of Multiple Translations: Extracting Paraphrases and generating new Sentences.” Proceedings of the 2003 Conference of the North American Chapter of the ACL on Human Language Technology, NAACL ’03 Vol. 1. Stroudsburg, PA, USA. ACL. 102-109.
PAPINENI, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. (2002). “Bleu: A Method for automatic Evaluation of Machine Translation.” Pro-ceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02. Stroudsburg, PA, USA. ACL. 311-318.
PRADHAN, Sameer et. al.. (2012). “CoNLL – 2012 shared task: Modeling multilingual unrestricted Coreference in Ontonotes.” Joint Conference on EMNLP and CoNLL – Shared Task. Jeju Island, Korea, July. ACL. 1-40.
RAHMAN, Altaf, and Vincent Ng. (2009). “Supervised Models for Coreference Resolution.” Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP ’09 Vol. 2. Stroudsburg, PA, USA. ACL. 968-977.
SINGH, Sameer et. al.. (2011). “Large-scale cross-document Coreference using distributed Inference and hierarchical Models.” Proceedings of the 49th Annual Meeting of the ACL: Human Language Technologies. Portland, Oregon, USA, June. ACL. 793-803.
SONNENHAUSER, Barbara, and Robert Zangenfeind (2013). “Towards Machine Translation of Russian Aspect.” Proceedings of the 6th International Conference on Meaning-Text Theory. Eds. Valentina Apresjan, Boris Iomdin, and E. Ageeva. Prague. 192-201.
WALDENFELS, Ruprecht von (2006). “Compiling a Parallel Corpus of Slavic Languages. Text Strategies, Tools and the Question of Lemmatization in Alignment.” Beiträge der Europäischen Slavistischen Linguistik (POLYSLAV) 9: 123-138.
WETZEL, Dominikus, and Francis Bond (2012). “Enriching Parallel Corpo-ra for Statistical Machine Translation with semantic Negation Rephras-ing.” Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation. Jeju, Republic of Korea, July. ACL. 20-29.
ZANGENFEIND, Robert (2009). “Types of Paraphrase Rules in Practice. German Paraphrases of a Russian Text.” Meaning – Text Theory 2009. Eds. David Beck, Kim Gerdes, Jasmina Milićević, and Alain Polguère. Montréal. 389-398.
ZANGENFEIND, Robert (2010). Grammatik der Paraphrase (= Linguistic Resources for Natural Language Processing, 4). München: Lincom Europa.
ZANGENFEIND, Robert (2011). “Transfer of Russian Actantial Syntactic Relations into German.” Meaning – Text Theory 2011. Eds. Igor Boguslavsky, and Leo Wanner. Barcelona. 306-31.
ZANGENFEIND, Robert (2012). “Towards a System of Syntactic Dependencies of Ger-man.” Komp’juternaja lingvistika i intellektual’nye tehnologii – Computational Linguistics and Intellectual Technologies 11.18. Ed. Kibrik, A.E., RGGU. Moscow: 706-715.
ZHEKOVA, Desislava (2013). Towards Multilingual Coreference Resolution. Ph.D. thesis, University of Bremen.
ZHEKOVA, Desislava, et al. (2014). “Alignment of Multiple Translations for Linguistic Analysis.” Proceedings of The 3rd Annual International Conference on Language, Literature and Linguistics (L3).
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
MATLIT embraces full open access to all issues. Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 International (CC BY-NC-ND 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal. The article can be quoted but not changed and presented differently.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
- A CC licensing information in a machine-readable format is embedded in all articles published by MATLIT.
- Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- NonCommercial — You may not use the material for commercial purposes.
- NoDerivatives — If you remix, transform, or build upon the material, you may not distribute the modified material.
- No additional restrictions — You may not apply legal terms or technological measuresthat legally restrict others from doing anything the license permits.
- You do not have to comply with the license for elements of the material in the public domain or where your use is permitted by an applicable exception or limitation.
- No warranties are given. The license may not give you all of the permissions necessary for your intended use. For example, other rights such as publicity, privacy, or moral rights may limit how you use the material.