### Alignment Analyses

Alignment analyses are the most common way to compare sequences. Given that phonetic sequences are the basic comparanda in both historical linguistics and dialectology, it is therefore straightforward to assume that alignment analyses play a crucial role in both disciplines. Without alignments, i.e., without the explicit matching of sounds, neither could regular sound correspondences be detected nor could cognacy between words or genetic relationship between languages be proven. However, although language comparison is always based on an implicit alignment of words, it is rarely explicitly visualized or termed as such, and in the rare cases where scholars explicitly use alignments to visualize correspondence patterns in words, it merely serves illustrational purposes.

### Basic Formats for Alignments Analyses

In order to exchange, edit, and compare phonetic alignments, different formats are used in the BDPA. Basically, we distinguish between formats for pairwise alignments and for multiple alignments. For practical reasons, the BDPA uses the alignment formats generally employed in LingPy. All formats are text-based and can be edited with help of simple text editors.

The basic format for the representation of multiple alignment analyses is the MSA-format. Files in this format have the extension "msa". The first line of an MSA file serves as an identifier for the dataset from which the alignment was taken. There are no further format restrictions and the user can freely decide what to use as an identifier, as long as it does not exceed the first line. In the BDPA, we use the names of our subsets as dataset identifiers. The second line is reserved as an identifier for the set of aligned sound sequences. The identifier can again be freely chosen by the user. In the BDPA, we generally use the meaning of the sound sequences as identifier, but we also add additional information, such as the anceestral from (in language families) or the orthography of the corresponding word in the standard variety (in dialect datasets). The following lines give the phonetic sequences in aligned form, separated by a tab-stop, and preceded by language identifiers (ISO-code, language name, dialect location) in the first column of the alignment matrix. The hash symbol ("#") is used as a comment character. When placed in the beginning of a line, it indicates that the line should be ignored when parsing the file . Inspired from alignment formats in bioinformatics, LingPy allows for specific additional lines which can be used to annotate the alignments. Instances of metathesis, for example, may be represented by adding a line which starts with the keyword "SWAPS", with a plus character ("+") marking the beginning of a swapped region, the dash character ("-") its center and another plus character the end. All sites which are not affected by swaps contain a dot ("."). In the BDPA, 66 out of 750 multiple alignments contain instances of metathesis and are regularly annotated in the way just described. As an example, consider the file harry_potter.msa:

1 Harry Potter Testset
2 Woldemort (in different languages)
3 English     v     o     l     -     d     e     m     o     r     t
4 German.     w     a     l     -     d     e     m     a     r     -
5 Russian     v     -     l     a     d     i     m     i     r     -
6 SWAPS..     .     +     -     +     .     .     .     .     .     .


Basically, the MSA-format can also be used to represent pairwise alignment analyses. However, since each MSA-file, is a single text-file, we would need 7 197 different text-files to represent all sequence pairs of our master benchmark for pairwise alignment analyses. Using such a large amount of text-files to represent the rather small amount of information available in pairwise alignments is not only impractical as a shared digital resource, but also very inefficient for computation.

In order to deal with large amounts of pairwise alignments in one and the same text-file, LingPy offers an additional format for pairwise alignment analyses. This format is called PSA-format, and files in the format have the extension "psa". As for the MSA-format, the first line of a PSA-file is reserved for an identifier that refers to the dataset from which the data was taken. The sequence pairs themselves are given in triplets, with a sequence identifier in the first line of a triplet (containing the meaning, or orthographical information) and the two sequences in the second and third line contain the alignment matrix with the language identifiers being placed in the first column. All triplets (sequence pair identifier and two sequences) are separated by one empty line. As an example, consider the file harry_potter.psa:

 1 Harry Potter Testset
2 Woldemort in German and Russian
3 German.     w     a     l     -     d     e     m     a     r
4 Russian     v     -     l     a     d     i     m     i     r
5
6 Woldemort in English and Russian
7 English     w     o     l     -     d     e     m     o     r     t
8 Russian     v     -     l     a     d     i     m     i     r     -
9
10 Woldemort in English and German
11 English     w     o     l     d     e     m     o     r     t
12 German.     w     a     l     d     e     m     a     r     -
13


In the BDPA, the pairwise benchmarks, as described above, are provided in PSA-format. Additionally, we extracted all possible pairwise alignments inherent in our master set of 750 multiple alignments and offer them for download in PSA-format. You can download both MSA and PSA files for each subset from here.

### Citing BDPA

If you use this database, please cite the following paper:

• List, Johann-Mattis and Jelena Prokić. (2014). A benchmark database of phonetic alignments in historical linguistics and dialectology. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC), 26 — 31 May 2014, Reykjavik. 288-294.

The paper can be downloaded from this link. Please make sure that you also cite all individual sources of BDPA which you are using. For example, if you use the alignments of the Bai dialects in BDPA, you should quote both original sources from which they were taken, namely:

• Wang, F. (2006): Comparison of languages in contact. The distillation method and the case of Bai. Taipei: INstitue of Linguistics Academia Sinica.
• Allen, B. (2007): Bai dialect survey. SIL International. ULR: http://www.sil.org/silesr/2007/silesr2007-012.pdf

### Sources

All the sources we used to create the alignments can be found here.

### Contact

For technical questions regarding the data, please contact Johann-Mattis List (Philipps-Universität Marburg) or Jelena Prokić (Philipps-Universität Marburg).