In this talk we will discuss a European Research Council funded collaborative effort to build a Python library for undertaking academic research in historical-comparative linguistics. Historical-comparative linguistics is concerned with comparing language data from multiple sources to establish the genealogical relatedness of languages, to infer language family trees, and to ultimately discover their common origin.
Like species, all languages have evolved from a common origin. In molecular biology, the alignment of protein and DNA sequences is the mainstream method for establishing the phylogenetic reconstruction of organisms. Although this process of comparison is computationally expensive due to the vastly long strings of DNA, its sequencing essentially only involves only four bases (C, T, G, A) with fairly predictable mutations. On the other hand, the comparison of language data involves small strings (i.e. words, which are fairly limited in length) but that can contain scores of sounds (or “letters” equivalent to DNA bases). For example, in some languages there are well over 100 different sounds, which makes the problem of sequencing sounds to compare words in different languages extremely challenging. Even more problematic, is that the mutations of sounds, what linguists called “sound change”, is not fully understood. Although fairly common sound changes have been described, many are purely unpredictable because they arise from factors of social interaction, e.g. the mixture of different native speaking communities or the way in which each new generation adapts the language of their parents. Over time, these and other factors cause a language to diverge (e.g. Latin evolved into Spanish, French, Italian, etc.) and that is precisely what we aim to model and quantify programmatically. In our project, we adapt state-of-the-art methods from biology and apply them to the task of language comparison. We have coded many different methods and algorithms for automatic sequence alignment analyses. We are building models of sound change, applying information theoretic approaches to quantifying the complexity of languages, and are orthographically parsing data from a large set of languages to undertake comparative analysis.
Our aim of implementing quantitative methods, specifically in Python, is to transform historical-comparative linguistics from a primarily handcrafted scientific scholarly endeavor, performed by individual researchers, into a quantitative and collaborative field of research, involving linguists, mathematicians and computer scientists. By using Python and leveraging packages including numpy, scipy and regex (not re), our project takes a quantitative approach to uncover and clarify phylogenetic relationships between under-studied and endangered languages.