Data from "A Turkish-German Code-Switching Corpus", LREC 2016
Data and Scripts
Due to the restrictions of Twitter’s Terms of Service, we distribute the tweet IDs instead of actual tweets. We also distribute the edit transcript that converts the original tweets to edited versions, and the language identification annotation aligned with the edited version. You can download the data and scripts from here.
The annotation tool can also be downloaded from here.
Steps to Generate the Corpus
1. Download the tweets with IDs given in all_cs_id.col
2. Put the original tweets into a single file in the format
On this page, this file is called all_cs_idtweet.tab
3. Convert original tweets to edited versions with the following script
perl convert_original_to_edited.pl -i all_cs_idtweet.tab -t all_cs_transcript.tab > all_cs_idtweet_norm.tab
4. Convert the format one-line-per-tweet to the format one-token-per-line.
perl convert_tab_to_col.pl -i all_cs_idtweet_norm.tab -o all_cs_idtweet_norm.col
5. Merge the tweets in all_cs_idtweet_norm.col with the language IDs in all_cs_id_langid.col
paste all_cs_idtweet_norm.col all_cs_id_langid.col | cut -f 1,2,4 | sed $'s/^\t*$//g' > all_cs_idtweet_norm_langid.col
For questions please contact Ozlem Cetinoglu.
Language Identification Guidelines
- TR: Turkish, e.g., ben ‘I’.
- DE: German, e,g., komisch ‘funny’.
- LANG3: Third language, e.g., ‘no way’.
- MIXED: Intra-word CS, e.g., traurigim ‘I am sad’.
- NE: Named entity, e.g., Bern, Ankara, DW (German international broadcaster), Kanal D (Turkish TV channel).
- AMBIGuous: Words that exist in both languages and cannot be disambiguated by the given context.
- OTHER: Punctuation, numbers, emoticons, symbols, and any token that cannot be classified with previous labels, e.g., ‘RT’.
Run the annotation tool with the command below
python identify_language.py <name>
Instead of name in <name> you have to type in your own name so that the program just show your tweets.
The tool lets you choose one of the 7 tags from the tag set, or you can use the tag FLAG for cases you are not sure and can decide later.
If you make a mistake you can go back one token by typing 'b'. You can type the tags in uppercase or lowercase, the tool is not case sensitive. You cannot abbreviate the tags though. You can use the up arrow key to reach the previously used tags, instead of typing them again. To exit the program please type 'quit'. If you are in the middle of a tweet, a .langid file is not created for this tweet, you will lose the already annotated part. Thus, it is better to quit as soon as you start a new tweet.
Use the guidelines in the appendix of http://www.aclweb.org/anthology/W15-1608. Note that some of our rules may differ, follow our rules in cases of contradiction. Pay attention to following cases:
* Interjections like ay, oh, eh,… are annotated with either TR or DE. For instance:
* The tag for numbers is OTHER
* Proper names, either Turkish or German, if it has a morpheme or not it gets the tag NE. Then they get the language ID
* If it is not a proper name, and a Turkish suffix is added to a German word, the tag is MIXED
* re or RE is a Twitter specific token, has to be OTHER
* Hashtags are annotated depending on which language they belong to. For instance:
* Check the context if the a proper name is identical in German and Turkish. If you cannot decide, tag it with AMBIG.
Stuttgart'tayım NE.TR --> it is not MIXED since the context is Turkish
Stuttgart NE.DE --> It is German since Uni Stuttgart is a German expression