Data from "A Turkish-German Code-Switching Corpus", LREC 2016
Data and Scripts
Due to the restrictions of Twitter’s Terms of Service, we distribute the tweet IDs instead of actual tweets. We also distribute the edit transcript that converts the original tweets to edited versions, and the language identification annotation aligned with the edited version. You can download the data and scripts from here.
The annotation tool can also be downloaded from here.
Steps to Generate the Corpus
1. Download the tweets with IDs given in all_cs_id.col
2. Put the original tweets into a single file in the format
<tweetID><tab><original tweet>
On this page, this file is called all_cs_idtweet.tab
3. Convert original tweets to edited versions with the following script
perl convert_original_to_edited.pl -i all_cs_idtweet.tab -t all_cs_transcript.tab > all_cs_idtweet_norm.tab
4. Convert the format one-line-per-tweet to the format one-token-per-line.
perl convert_tab_to_col.pl -i all_cs_idtweet_norm.tab -o all_cs_idtweet_norm.col
5. Merge the tweets in all_cs_idtweet_norm.col with the language IDs in all_cs_id_langid.col
paste all_cs_idtweet_norm.col all_cs_id_langid.col | cut -f 1,2,4 | sed $'s/^\t*$//g' > all_cs_idtweet_norm_langid.col
For questions please contact Ozlem Cetinoglu.
Language Identification Guidelines
Tag Set
- TR: Turkish, e.g., ben ‘I’.
- DE: German, e,g., komisch ‘funny’.
- LANG3: Third language, e.g., ‘no way’.
- MIXED: Intra-word CS, e.g., traurigim ‘I am sad’.
- NE: Named entity, e.g., Bern, Ankara, DW (German international broadcaster), Kanal D (Turkish TV channel).
- AMBIGuous: Words that exist in both languages and cannot be disambiguated by the given context.
- OTHER: Punctuation, numbers, emoticons, symbols, and any token that cannot be classified with previous labels, e.g., ‘RT’.
Tool
Run the annotation tool with the command below
python identify_language.py <name>
Instead of name in <name> you have to type in your own name so that the program just show your tweets.
The tool lets you choose one of the 7 tags from the tag set, or you can use the tag FLAG for cases you are not sure and can decide later.
If you make a mistake you can go back one token by typing 'b'. You can type the tags in uppercase or lowercase, the tool is not case sensitive. You cannot abbreviate the tags though. You can use the up arrow key to reach the previously used tags, instead of typing them again. To exit the program please type 'quit'. If you are in the middle of a tweet, a .langid file is not created for this tweet, you will lose the already annotated part. Thus, it is better to quit as soon as you start a new tweet.
Guidelines
Use the guidelines in the appendix of http://www.aclweb.org/anthology/W15-1608. Note that some of our rules may differ, follow our rules in cases of contradiction. Pay attention to following cases:
* Interjections like ay, oh, eh,… are annotated with either TR or DE. For instance:
oh TR
ne TR
güzel TR
oh DE
mein DE
Gott DE
* The tag for numbers is OTHER
ilk TR
11 OTHER
Garantie DE
* Proper names, either Turkish or German, if it has a morpheme or not it gets the tag NE. Then they get the language ID
Frankfurter NE.DE
Allgemeine§'nin NE.MIXED
#mülteci TR
yorumu TR
Istanbul'da NE.TR
Google NE.LANG3
Taner NE.TR
Merkel NE.DE
* If it is not a proper name, and a Turkish suffix is added to a German word, the tag is MIXED
Nerde TR
3 OTHER
Semester§dayım MIXED
daha TR
* re or RE is a Twitter specific token, has to be OTHER
re OTHER
: OTHER
[url] OTHER
* Hashtags are annotated depending on which language they belong to. For instance:
#ciddiyim TR
#happy LANG3
* Check the context if the a proper name is identical in German and Turkish. If you cannot decide, tag it with AMBIG.
@username OTHER
@username OTHER
Ben TR
Stuttgart'tayım NE.TR --> it is not MIXED since the context is Turkish
. OTHER
Benim TR
de TR
Sprachdiplom DE
vardı TR
ama TR
yine TR
de TR
gittim TR
kursa TR
. OTHER
Ökumenisches NE.DE
Zentrum NE.DE
- OTHER
Uni NE.DE
Ökumenisches NE.DE
Stuttgart NE.DE --> It is German since Uni Stuttgart is a German expression