Institut

Studium

Forschung


 

The Corpus Described in "A Code-Switching Corpus of Turkish-German Conversations",

LAW 2017

 

The Data Description

We present a corpus collected from Turkish-German bilingual speakers, and annotated with sentence and code-switching boundaries in audio files and their corresponding transcriptions which are carried out as both verbal and normalised tiers. In total, it is 5 hours of speech and 3614 sentences.

The Data Distribution

Transcriptions are available for academic research purposes. Audio files will be manipulated before distribution in order to conceal speakers' identity, to comply with the German national and state data privacy laws.

The Data Format

The audio files are transcribed and annotated using the Praat tool, its output is stored in .TextGrid files. An example screeenshot of the Praat transcriptions and annotations are given below (click to enlarge):

praat

For a more human-readible format we converted the .TextGrid files into a one-sentence-per-line format. An excerpt from a converted file is given below. First three sentences correspond to the spk1_norm tier of the Praat screenshot.

spk2: <SB><DE> Okay , <WCS><TR> şimdi yaptığınız dersin adı ne ?
spk1: <SB><TR> Vallaha şimdi yaptığımız ders <WCS><DE> ist Fertigungsverfahren .
spk1: <SCS><TR> Ehm bizim haftaya <WCS><DE> Hausübung Abgabe <§><TR> miz var .
spk1: <SB><TR> Ve ehm birkaç <WCS><DE> Aufgabe <§><TR> ler yaptık *arkadaşnan ehm .
spk1: <SCS><DE> Wir sind halt nicht weiter gekommen , <WCS><TR> çünkü bazı ehm yerlerde <WCS><DE> sind wir hängen geblieben eh .
spk1: <SCS><TR> Haftaya kadar yetişmesi lazım .
spk1: <SCS><DE> Deswegen <WCS><TR> bayağı bir eh başında oturduk ama anlamadığımız için .
spk1: <SB><TR> İnşallah eh <WCS><DE> werden es morgen noch fortführen nach der Gruppenbesprechung .
spk1: <SCS><TR> Ehm şimdi başka bir <WCS><DE> Übung <§><TR> a geçtim .
spk1: <SCS><DE> Ich muss zeichnen .
spk1: <SCS><TR> Eh bunun da <WCS><DE> Abgabe <§><TR> si var .
spk1: <SCS><DE> Ehm diese Woche müssen wir wirklich sehr viel abgeben .
spk1: <SCS><TR> Bayağı bir ehm <WCS><DE> Abgabe <§><TR> lerimiz olduğu için bu hafta çok yoğun geçiyor bizim için .

There are two consecutive tags at each boundary point, represented between angle brackets. First set of tags correspond to the codesw tier in the Praat annotation and indicate sentence boundaries (SB), intersentential code-switching points (SCS), intrasentential code-switching points (WCS), or intra-word code-switching points (§). The tag in the second position corresponds to the lang tier in the Praat annotation and can have one of Turkish (TR), German (DE), and 3rd Language (LANG3) values.

.TextGrid files are first converted to the tabular .Table format within Praat. The conversion script that reads .Table files and creates the sentence format above is given here.  

For questions and licence please contact Özlem Çetinoğlu.