Meeting on Corpora

Abstracts

Dr. Vassiliki Foufi (School of Modern Greek), Eleni Kogitsidou (Université de Grenoble), dr. Athanasios Mavropoulos (Centre for the Greek Language), dr. Olympia Tsaknaki (Aristotle University of Thessaloniki)

Compilation of a Literary Text Corpus

The ultimate goal of our research, carried out in the context of the project Compilation of a parallel corpus of French fiction translated into Greek, led by prof. Titika Dimitroulia and financed by the AUTH Research Committee, is the construction of a literary bitext, with the aim of enriching the Greek digital content and studying key issues of Translation and Translation Studies, such as the contribution of translation in shaping the language of the time, the imprint of the time in translation, the translator's style, etc.

The parallel texts are considered very important for the applications of automatic language processing and linguistic research as they help to eliminate semantic ambiguities and contribute to terminology extraction and corpora contrastive studies.

We will present in detail the steps we followed to construct the parallel corpus,  based in electronic texts and using multiple tools. First, we undertook the conversion of files into editable formats by means of ABBYY Fine Reader that provides with an optical character recognition (http://www.abbyy.com.gr/). Then, we elaborated and corrected the texts in order to remove the problems encountered after the file conversion (not recognized accented characters, typographical errors, etc.). Finally, text alignment at a sentence level was performed with the open source LF aligner (http://sourceforge.net/projects/aligner/). We will present mismatches between the source text and the translation, such as different formatting, improper text segmentation in the source language or the target language, mismatches between the textual units, etc.


Prof. Dionysis Goutsos (National and Kapodistrian University of Athens)

Greek corpus building and analysis: The story so far and what is to follow

The paper offers a state-of-the-art account of corpus research on Greek, focusing on both corpus compilation and analysis. It outlines the main phases of development of Modern Greek corpora and presents the most important findings on the description of the Greek language deriving from corpora, with specific examples. The main focus of the paper is on the relevance of these findings for the study of translation in Greek, as well as their implications for translation theory and practice. Finally, the perspectives of corpus-related research on Greek are outlined and some translation hypotheses for further exploration are pointed out.


Prof. Titika Dimitroulia (Aristotle University of Thessaloniki)

Design and compilation of a literary parallel corpus: aims and applications

The compilation of a parallel corpus of French literary fiction translated into Greek is situated in the context of the Corpus-based Translation Studies (CTS) in the Greek-speaking world and its applications to translation didactics. At the same time, the corpus, containing works published after 1974, constitutes a first sample of contemporary translated Greek literary discourse, and thus we believe that it can and even must be incorporated, with all other similar works in progress, in monolingual Greek corpora (HNC, SEK, Diachronic Greek Corpora) and comparable corpora.  

The choice of the works when designing this small-scale corpus, financed by the AUTH Research Committee, complies with the criterion of representativeness/balance, as an effort was made to include works covering three centuries (18th-20th) and different genres, so that the interference of the source language and the genre can be studied better. For the same reason, we tried to include the different categories of translators (ordinary mediators and established writers, academics, professional translators. The texts are released with the permission of their publishers, always under the copyright regime. This is the reason why the corpus, online and open accessed, will give possibility of access to and download of the full texts.  

Based on the corpus and taking into account extra-textual parameters, we plan to study the style of individual translators, collective stylistic features, authorship attribution, translationese and translation universals (explicitation, simplification, normalization etc.), among others. In the field of translation didactics, we examine  the translation of culturèmes and intertextuality, and discuss the concept of quality in literary translation. Finally, the corpus will be used in the context of the debate on digital literature and humanities today. 


Prof. Fryni Kakoyianni-Doa & dr. Eleni Tziafa (University of Cyprus)

The SOURCe Project

The SOURCe project was developed in three parts and includes (a) the search engine for the Searchable Online French-Greek Parallel Corpus for the University of Cyprus (SOURCe), (b) the Pencil and (c) the Library tool. These are designed as freely available resources for language processing, along with the data to be processed, in usable formats for teachers, learners and translators. Our aim is to describe the design principles and the properties of the SOURCe Project and we will outline its future perspectives and applications. This project is led by Fryni Kakoyianni-Doa and is fully funded by the University of Cyprus. The core of the project is a collection of parallel corpora: aligned (in sentence level) original and translated texts, in French and Greek language. In order to release teachers and translators from long preparation and complex corpus-building, we propose the construction of simple, online corpora with basic text-searching facilities, avoiding machine-based annotated, tagged or parsed corpora which are more appropriate for detailed linguistic research. We designed a simple interface, through which the user may search existing corpora, upload texts, and see them online. Moreover, we enabled different 'viewpoints' so that different types of users can see different views on the same underlying datasets. 


Prof. Rudy Loock (Université Lille3)

Intra-language differences and translation quality

The aim of this presentation is to raise the question whether the measurement of intra-language differences between original language and translated language can be used as a tool for translation quality assessment.

To ask such a question is to enter the thorny debate on the interpretation of intra-language differences: should we consider translated language as variation comparable to dialectal variation or should we consider that the over-representation or under-representation of a given linguistic construction means that the quality of the translation should be improved? From an even more general perspective, should we consider that translated language is intrinsically different and represents what researchers have called a third code or should we consider that “the utopian goal is to make it virtually impossible to tell the translation from an original text in that language” (Teubert 1996: 241)?

Through the analysis of a learner corpus (translations tasks from English to French performed by first-year and master’s students) for two case studies (derived adverbs and existential constructions), we try and see whether some correlation can be found between the observed intra-language differences and the overall quality of the translation tasks.


Prof. Sofia Malamatidou (University of Birmingham)

Translation and Language Change: The Interplay of Diachronic and Synchronic Corpus-Based Studies

Corpus-based research has yielded important insights into translation in recent years, but most studies in the field have focused on synchronic analyses, thus neglecting the potential for diachronic analysis to enhance our understanding of how translation might contribute to important phenomena such as language change. Recently, a number of scholars have adopted a corpus-based approach in the investigation of translation as a form of language contact and its impact on the target language. However, no diachronic corpus-based study of translation involving Modern Greek has so far been attempted. Similarly, comparable and parallel corpora have not been efficiently used by linguists for the analysis of diachronic phenomena.

This study aims to combine synchronic and diachronic corpus-based approaches, as well as parallel and comparable corpora for the analysis of linguistic features of translated texts and their impact on non-translated texts. Unlike most studies employing comparable corpora, which focus on revealing recurrent features of translated language independently of the SL and TL, this study approaches texts with the intention of revealing features that are dependent on the specific language pair involved in the translation process, i.e. English and Modern Greek.

The study involves the diachronic analysis of the TROY Corpus: a corpus of Modern Greek non-translated and translated popular science articles, along with their English source texts, covering a 20-year period (1990-2010) and consisting of approximately half a million words. The corpus is divided into three sections. The first subcorpus consists of non-translated Modern Greek popular science articles published in 1990-1991. The second subcorpus consists of non-translated and translated Modern Greek popular science articles published in 2003-2004, as well as the source texts of the translations. The third subcorpus includes non-translated as well as translated texts and their source texts, all published in 2010-2011. The linguistic feature analysed for the purposes of this study is the frequency of the passive voice reporting verbs. 


  Spyridon Pilos (Spyridon Pilos, Head of sector “Language Applications”, Informatics Unit, Resources Directorate, Directorate General for Translation, European Commission, Luxembourg)

The public translation memories and corpora of DG Translation of the European Commission

The Directorate General for Translation (DGT) has made available two data sets relevant for translation: the DGT-TM and the  DGT-Acquis. The DGT-TM, i.e. "DGT Translation Memory", is a collection of compressed multilingual files in the TMX format, an XML-based standard used for the exchange of Translation Memory data. The present update covers the entire body of EU law as published in the L-Series of the Official Journal between 1972 and 2012, in 23 official languages of the EU. Updates are scheduled to be released on a yearly basis. The DGT-Acquis is a paragraph-aligned parallel corpus consisting of full text documents with added meta-information on which paragraphs are aligned with which others in the other languages. In this corpus, one can thus see each sentence in its context, while in translation memories, each sentence is in isolation, i.e. out of context. DGT-Acquis also contains the L-series of the Official Journal but also the LM, C, CA and CE collections. Both resources are available for downloading from the Joint ResearchCenter's website on language technology resources. DGT-TM is also accessible through the EU Open data Portal.


  Prof. Mojca Schlamberger Brezar (University of Ljubljana)

L'argumentation pour ou contre - les connecteurs en traduction du français vers le slovène à travers un corpus parallèle journalistique

À partir des années 1990, la linguistique des corpus a fait des progrès révolutionnaires dans le traitement des données. Le domaine de la traductologie n’y présente pas une exception. Nous parlerons de l'essor de la linguistique des corpus pour le slovène, langue d'un peu plus de 2 millions de locuteurs.

Après un court compte-rendu de l’état de choses dans la linguistique des corpus pour le slovène, nous présenterons la recherche sur les connecteurs argumentatifs et contre-argumentatifs français du type parce que, puisque, mais, pourtant etc. dans les deux parties, journalistique et littéraire, du corpus parallèle français-slovène FraSloK, et rassemblerons leurs équivalents en traduction slovène. Nous nous pencherons sur les stratégies argumentatives utilisées dans le texte du départ et leur impact sur le choix des connecteurs en traduction. Nous analyserons les stratégies de traduction de ces connecteurs et étudierons leur dépendance du type du texte, de différents registres de langue présents dans le corpus et des choix personnels des traducteurs.


Prof. Federico Zanettin (Università di Perugia)

Corpora and literary translation research: Issues and challenges

In this presentation, I consider applications of corpus linguistics tools and methodologies to descriptive translation studies. More specifically, I discuss ways in which corpora of different types can help investigating literary translation, both from a quantitative and a qualitative perspective. I provide an overview of the main research lines, namely research on so-called translation universals, translation norms and translator style. Finally, I consider the main stages of corpus compilation and use, from corpus design and annotation to search and visualization techniques, with a focus on parallel corpora.

Symposium  Corpus-Based Translat

Symposium

Corpus-Based Translation Studies with emphasis on parallel literary corpora

17 January 2014

School of French
Aristotle University of Thessaloniki

Organiser

Titika Dimitroulia

Sponsors


AUTH Research Comitee


Laboratory of Translation and Speech Processing (EMEL)
School of French

 

Meeting on Corpora