A part of the 46th Annual Meeting of the Societas Linguistica Europaea (SLE), 18-21 September 2013. Conference website: http://www.sle2013.eu/
Location: Split University, Croatia
Dates: 20-21 September 2013
WORKSHOP DESCRIPTION [show/hide]
This interdisciplinary, methodologically oriented workshop focuses on the application of parallel corpora in linguistic studies. Parallel corpora are understood here in the broadest possible sense, as any collections of texts in different languages and language varieties that convey similar information and/or are produced under similar pragmatic conditions. They include translated corpora, comparable corpora (balanced samples of the same registers/genres from different languages and language varieties, cf. Aijmer 2008), as well as related texts produced by different speakers of one language (cf. Chafe's 1980 Pear Stories). Thanks to the recent development of NLP resources, parallel corpora are becoming increasingly available for a large number of languages and language varieties. How can linguistic theory benefit from this abundance? The aim of this workshop is to bring together linguists who use different types of parallel corpora in their research, and to encourage cross-fertilization between the approaches. The main focus points of the meeting are as follows:
1. Which theoretically relevant questions can be answered with the help of parallel corpora?
Although parallel corpora have long been known to humanity – from translations of the Bible and the Universal Declaration of Human Rights to home appliance manuals and film subtitles, they have drawn attention of theoretical linguists only recently, mostly in typology (cf. Cysouw & Wälchli 2007) and contrastive linguistics (cf. Granger 2010). These new strands of research are represented at this workshop by papers on such fundamental theoretical topics as encoding of motion events (the contributions of Viberg and Verkerk), argument structure (Celano, Lehmann, Luraghi & Marschke), impersonal reference (Rudolf & Deringer), force dynamics (Levshina), modality (Mauri, Nissim, Pietrandrea & Sansò) and frame semantics (Čulo). These papers demonstrate that languages develop different constructional inventories for similar purposes, depending on their internal 'ecologies' (Achard 2002), cultural-historical factors, and other parameters.
2. What are the similarities and differences in the application of multilingual and monolingual parallel corpora?
The conception of language as a heterogeneous entity, a fuzzy cluster of language varieties (lects), has been gaining importance in contemporary models of language. This tendency has been accompanied by increasing interaction between dialectology and typology (e.g. Kortmann 2004). Several papers at this workshop explore the potential of monolingual comparable corpora for modelling of differences and similarities between language varieties, ranging from regiolects (Wielfaert, Heylen & Speelman) to registers (Wiechmann) and idiolects (Barlow). Bringing together studies based on multilingual and multilectal corpora must add new dimensions to our understanding of language variation at large.
3. Are different types of parallel corpora equally suitable for any theoretical question? What kind of information do theorists need to obtain from parallel corpora and how can it be extracted (e.g. annotation schemas)?
With a few exceptions, most parallel corpora that are available today belong to a specific register or text type, for instance, the proceedings of the European Parliament (Rudolf & Deringer), Wikipedia articles (Rama & Borin), film subtitles (Levshina), translations of the Bible (Mayer & Cysouw; Celano et al.) and fiction (Verkerk). One of the aims of this workshop it to pinpoint the strengths and weaknesses of each corpus type for theoretical research. Another crucial issue is what kind of information can be extracted from a parallel corpus. For instance, von Waldenfels discusses automatic morphosyntactic annotation solutions for a particular language family (Slavic), whereas Mayer and Cysouw propose a novel statistical approach that can be used for word-by-word alignment for any set of languages. The problems of fine-grained semantic annotation and their possible solutions are discussed in Mauri et al.'s paper on annotation of modality and Čulo's contribution on frame-semantic annotation.
4. Which quantitative methods should be used to carry out cross-linguistic and cross-lectal comparisons of different linguistic phenomena?
Large-scale multilingual or multilectal data require adequate analytical methods, which could detect both underlying commonalities and interesting 'deviations' in different languages and lects. The papers in the workshop demonstrate a variety of advanced quantitative techniques that can serve these purposes: from classical correlation and factor analysis (Barlow) and Multidimensional Scaling (Levshina) to less generally applied semantic vector spaces (Wielfaert et al.), configural frequency analysis (Wiechmann) and phylogenetic clustering (Verkerk). One of the aims of the workshop is to discuss the advantages and limitations of these and other quantitative methods.
Last updated on June 12th, 2014