Institut

Studium

Forschung


 

Data from "A Turkish-German Code-Switching Corpus", LREC 2016

 

Data and Scripts

Due to the restrictions of Twitter’s Terms of Service, we distribute the tweet IDs instead of actual tweets. We also distribute the edit transcript that converts the original tweets to edited versions, and the language identification annotation aligned with the edited version. You can download the data and scripts from here.

The annotation tool can also be downloaded from here.

Steps to Generate the Corpus

1. Download the tweets with IDs given in all_cs_id.col

2. Put the original tweets into a single file in the format

<tweetID><tab><original tweet>

On this page, this file is called all_cs_idtweet.tab

3. Convert original tweets to edited versions with the following script

perl convert_original_to_edited.pl -i all_cs_idtweet.tab -t all_cs_transcript.tab > all_cs_idtweet_norm.tab

4. Convert the format one-line-per-tweet to the format one-token-per-line.

perl convert_tab_to_col.pl -i all_cs_idtweet_norm.tab -o all_cs_idtweet_norm.col

5. Merge the tweets in all_cs_idtweet_norm.col with the language IDs in all_cs_id_langid.col

paste all_cs_idtweet_norm.col all_cs_id_langid.col | cut -f 1,2,4 | sed $'s/^\t*$//g' > all_cs_idtweet_norm_langid.col

For questions please contact Ozlem Cetinoglu.

Language Identification Guidelines

Tag Set

  • TR: Turkish, e.g., ben ‘I’.
  • DE: German, e,g., komisch ‘funny’.
  • LANG3: Third language, e.g., ‘no way’.
  • MIXED: Intra-word CS, e.g., traurigim ‘I am sad’.
  • NE: Named entity, e.g., Bern, Ankara, DW (German international broadcaster), Kanal D (Turkish TV channel).
  • AMBIGuous: Words that exist in both languages and cannot be disambiguated by the given context.
  • OTHER: Punctuation, numbers, emoticons, symbols, and any token that cannot be classified with previous labels, e.g., ‘RT’.

Tool

Run the annotation tool with the command below

python identify_language.py <name>

Instead of name in <name> you have to type in your own name so that the program just show your tweets.

The tool lets you choose one of the 7 tags from the tag set, or you can use the tag FLAG for cases you are not sure and can decide later.

If you make a mistake you can go back one token by typing 'b'. You can type the tags in uppercase or lowercase, the tool is not case sensitive. You cannot abbreviate the tags though. You can use the up arrow key to reach the previously used tags, instead of typing them again. To exit the program please type 'quit'.  If you are in the middle of a tweet, a .langid file is not created for this tweet, you will lose the already annotated part. Thus, it is better to quit as soon as you start a new tweet.

Guidelines

Use the guidelines in the appendix of http://www.aclweb.org/anthology/W15-1608. Note that some of our rules may differ, follow our rules in cases of contradiction. Pay attention to following cases:

* Interjections like ay, oh, eh,… are annotated with either TR or DE. For instance:

oh		TR
ne TR
güzel TR
oh		DE
mein DE
Gott DE

 

* The tag for numbers is OTHER

ilk		TR
11 OTHER
Garantie DE

 

* Proper names, either Turkish or German, if it has a morpheme or not it gets the tag NE. Then they get the language ID

Frankfurter 	NE.DE
Allgemeine§'nin NE.MIXED
#mülteci TR
yorumu TR
Istanbul'da	NE.TR
Google		NE.LANG3
Taner		NE.TR 
Merkel		NE.DE 

 

* If it is not a proper name,  and a Turkish suffix is added to a German word, the tag is MIXED

Nerde		TR
3 OTHER
Semester§dayım MIXED
daha TR

 

* re or RE is a Twitter specific token, has to be OTHER

re		OTHER
: OTHER
[url] OTHER

 

* Hashtags are annotated depending on which language they belong to. For instance:

#ciddiyim	TR
#happy		LANG3

 

* Check the context if the a proper name is identical in German and Turkish. If you cannot decide,  tag it with AMBIG.

@username	OTHER
@username OTHER
Ben TR
Stuttgart'tayım NE.TR --> it is not MIXED since the context is Turkish
. OTHER
Benim TR
de TR
Sprachdiplom DE
vardı TR
ama TR
yine TR
de TR
gittim TR
kursa TR
. OTHER
Ökumenisches NE.DE
Zentrum NE.DE
- OTHER
Uni NE.DE
Ökumenisches NE.DE
Stuttgart NE.DE --> It is German since Uni Stuttgart is a German expression