Workshop on Corpus-based Quantitative Typology (CoQuaT 2013)

Pre-conference workshop in connection with the 10th Biennial Conference of the Association of Linguistic Typology (ALT X), Leipzig, Germany, August 15-18, 2013

 

Location: University of Leipzig, Germany
Venue: Felix-Klein-Hörsaal, Paulinum, 5th floor (see here for how to get there)
Date: August 14, 2013

Submission deadline: March 31, 2013
Notification of acceptance: April 22nd, 2013

CONVENORS

Michael Cysouw (Philipps University of Marburg)
Dirk Goldhahn (University of Leipzig)
Thomas Mayer (Philipps University of Marburg)
Uwe Quasthoff (University of Leipzig)

INVITED SPEAKERS

Östen Dahl (Stockholm University)
Kevin Scannell (Saint Louis University)

PROGRAM (Updated)

09.00-09.30: Coffee + Welcome
09.30-10.30: Kevin Scannell: How many languages are on the web? The Crúbadán project ten years on
10.30-11.00: Dirk Goldhahn and Uwe Quasthoff: Using the Leipzig Corpora Collection for language comparison
11.00-11.30: Coffee
11.30-12.00: Harald Hammarström, Mikael Parkvall and Sean Roberts: A parallel text approach to measuring forced expressivity across languages
12.00-12.30: Gertraud Fenk-Oczlon and August Fenk: Quantitative typology based on a matched set of simple declarative sentences
12.30-14.00: Lunch
14.00-14.30: Maria Kholodilova: Animacy distinction in Slavic possessive relative pronouns
14.30-15.00: Harald Hammarström: Three approaches to prefix and suffix statistics in the languages of the world
15.00-16.00: Poster Session (+ Coffee)
16.00-16.30: Thomas Mayer and Michael Cysouw: Using matrix algebra to analyze a massively parallel Bible corpus
16.30-17.30: Östen Dahl: Computer-aided extraction of grammatical morphemes and constructions from a parallel corpus

WORKSHOP DESCRIPTION

The amount of available (textual) corpora of the world’s languages is currently rising at an incredible rate. The aim of this workshop is to bring together researchers dealing with corpus-based quantitative language comparison and to encourage typological studies that rely on corpus data.

A growing body of research uses corpora to investigate the structure of individual languages. There also exists a large amount of research on the world-wide linguistic diversity, though mostly on the basis of information manually extracted from published sources. In contrast, the combination of the two is still rare. There are only few quantitative typological investigations with a world-wide scope that use corpora to infer cross-linguistic generalizations and insights. Some previous work compiled quantitative data through manual corpus annotation (e.g. Greenberg 1960; Wälchli 2005) or automatically with the help of computer programs (e.g. Mayer and Cysouw 2012). In addition, there is some relevant work using corpora to compare a smaller number of (genealogically related) languages (e.g. Bickel 2003; van der Auwera 2005).

Cross-linguistic corpora, in particular (massively) parallel corpora (cf. Cysouw and Wälchli 2007) or comparable corpora compiled through web crawling (e.g. Scannell 2007; Goldhahn et al. 2012), provide an enormous amount of information about the world's languages. Although such data is often not ideal from a linguistic point of view (involving problems of translationese, or being restricted to special textual genres), it would be a waste not at least to try to use them for comparative linguistic purposes.

One of the reasons for the shortage of quantitative cross-linguistic work is the lack of adequate resources for a representative sample of languages. Consequently, on top of the laborious manual analysis, typologically interested researchers are faced with the time-consuming task to build their own corpora from scratch. One goal of this workshop is therefore to collect (online) resources (especially for lesser studied languages) and to exchange experience with crawling texts from the web. Furthermore, we intend to discuss in which formats cross-linguistic corpora should be made publicly available so that typologists can best benefit from them without violating copyright laws.

CALL FOR PAPERS

For this workshop, we welcome any type of cross-linguistic quantitative corpus-based work. We are interested both in the collection and preparation of (massively) cross-linguistic corpora and in investigations that rely on such a resource for language comparison.

A) Possible topics concerning the collection and preparation of text data for a larger number of languages:

  • presentations about projects collecting and organizing (massively) parallel or comparable corpora
  • presentations about projects crawling web data to build a cross-linguistic corpus
  • approaches to (semi-)automatic annotation of corpora for typological research
  • proposals of corpus formats that are useful for typological research and can easily be imported into standard formats

B) Specific examples of corpus-based language comparison, focusing on a particular linguistic topic of choice, using approaches like:

  • (massively) parallel text analysis
  • corpus-based multivariate quantitative comparison of languages
  • unsupervised or semi-supervised language analysis for language comparison
  • evaluation of cross-linguistic corpus-based studies

SUBMISSION PROCEDURE

Please send an abstract of approx. 500 words (excluding references) to coquat2013@gmail.com. Abstracts should contain the author's name, affiliation and contact email. The deadline for the submission of proposals is March 31, 2013. Notification of acceptance is May 1, 2013.

REFERENCES

Bickel, B. 2003. Referential density in discourse and syntactic typology. Language 79. 708-739.

Cysouw, M. and B. Wälchli. (eds.), 2007. Parallel Texts. Using Translational Equivalents in Linguistic Typology. Theme issue in Sprachtypologie & Universalienforschung STUF 60.2.

Goldhahn, D., T. Eckart and U. Quasthoff. 2012. Building large monolingual dictionaries at the Leipzig Corpora collection: From 100 to 200 languages. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), 23-25.

Greenberg, J. H. 1960. A quantitative approach to the morphological typology of language. International Journal of American Linguistics 26. 178-194.

Mayer, T. and M. Cysouw. 2012. Language comparison through sparse multilingual word alignment. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH. 54–62.

Scannell, K. P. 2007. The Crúbadán Project: Corpus building for under-resourced languages. In C. Fairon, H. Naets, A. Kilgarriff, and G-M. de Schryver (eds.), Building and exploring web corpora: proceedings of the 3rd Web as Corpus Workshop, Cahiers du Central: 4, 5-15. Louvain: Presses Universitaires de Louvain.

van der Auwera, J., E. Schalley and J. Nuyts, 2005. Epistemic possibility in a Slavonic parallel corpus - a pilot study. In B. Hansen and P. Karlik (eds.), Modality in Slavonic Languages, New Perspectives, München: Sagner. 201-17.

Wälchli, B. 2005. Co-compounds and Natural Coordination. Oxford: Oxford University Press.

ADJACENT EVENTS

Last updated on  October 14th, 2014