Modellseite: 3 Spalten (links, Mitte, rechts)

Default-Text der hier stehen soll ...

Data from "A Turkish-German Code-Switching Corpus", LREC 2016

Data and Scripts

Due to the restrictions of Twitter’s Terms of Service, we distribute the tweet IDs instead of actual tweets. We also distribute the edit transcript that converts the original tweets to edited versions, and the language identification annotation aligned with the edited version. You can download the data and scripts from here.

The annotation tool can also be downloaded from here.

Steps to Generate the Corpus

1. Download the tweets with IDs given in all_cs_id.col

2. Put the original tweets into a single file in the format

<tweetID><tab><original tweet>

On this page, this file is called all_cs_idtweet.tab

3. Convert original tweets to edited versions with the following script

perl convert_original_to_edited.pl -i all_cs_idtweet.tab -t all_cs_transcript.tab > all_cs_idtweet_norm.tab

4. Convert the format one-line-per-tweet to the format one-token-per-line.

perl convert_tab_to_col.pl -i all_cs_idtweet_norm.tab -o all_cs_idtweet_norm.col

5. Merge the tweets in all_cs_idtweet_norm.col with the language IDs in all_cs_id_langid.col

paste all_cs_idtweet_norm.col all_cs_id_langid.col | cut -f 1,2,4 | sed $'s/^\t*$//g' > all_cs_idtweet_norm_langid.col

For questions please contact Ozlem Cetinoglu.

Language Identification Guidelines

Tag Set

TR: Turkish, e.g., ben ‘I’.
DE: German, e,g., komisch ‘funny’.
LANG3: Third language, e.g., ‘no way’.
MIXED: Intra-word CS, e.g., traurigim ‘I am sad’.
NE: Named entity, e.g., Bern, Ankara, DW (German international broadcaster), Kanal D (Turkish TV channel).
AMBIGuous: Words that exist in both languages and cannot be disambiguated by the given context.
OTHER: Punctuation, numbers, emoticons, symbols, and any token that cannot be classified with previous labels, e.g., ‘RT’.

Tool

Run the annotation tool with the command below

python identify_language.py <name>

Instead of name in <name> you have to type in your own name so that the program just show your tweets.

The tool lets you choose one of the 7 tags from the tag set, or you can use the tag FLAG for cases you are not sure and can decide later.

If you make a mistake you can go back one token by typing 'b'. You can type the tags in uppercase or lowercase, the tool is not case sensitive. You cannot abbreviate the tags though. You can use the up arrow key to reach the previously used tags, instead of typing them again. To exit the program please type 'quit'. If you are in the middle of a tweet, a .langid file is not created for this tweet, you will lose the already annotated part. Thus, it is better to quit as soon as you start a new tweet.

Guidelines

Use the guidelines in the appendix of http://www.aclweb.org/anthology/W15-1608. Note that some of our rules may differ, follow our rules in cases of contradiction. Pay attention to following cases:

* Interjections like ay, oh, eh,… are annotated with either TR or DE. For instance:

oh		TR
ne		TR 
güzel		TR

oh		DE
mein		DE
Gott		DE

* The tag for numbers is OTHER

ilk		TR
11		OTHER
Garantie	DE

* Proper names, either Turkish or German, if it has a morpheme or not it gets the tag NE. Then they get the language ID

Frankfurter 	NE.DE
Allgemeine§'nin	NE.MIXED
#mülteci	TR
yorumu		TR

Istanbul'da	NE.TR

Google		NE.LANG3

Taner		NE.TR

Merkel		NE.DE

* If it is not a proper name, and a Turkish suffix is added to a German word, the tag is MIXED

Nerde		TR
3		OTHER
Semester§dayım	MIXED
daha		TR

* re or RE is a Twitter specific token, has to be OTHER

re		OTHER
:		OTHER
[url]		OTHER

* Hashtags are annotated depending on which language they belong to. For instance:

#ciddiyim	TR

#happy		LANG3

* Check the context if the a proper name is identical in German and Turkish. If you cannot decide, tag it with AMBIG.

@username	OTHER
@username	OTHER
Ben		TR 
Stuttgart'tayım	NE.TR  --> it is not MIXED since the context is Turkish
.		OTHER 
Benim		TR 
de		TR 
Sprachdiplom	DE 
vardı		TR 
ama		TR 
yine		TR 
de		TR 
gittim		TR 
kursa		TR 
.		OTHER 
Ökumenisches	NE.DE 
Zentrum		NE.DE 
-		OTHER 
Uni		NE.DE
Ökumenisches	NE.DE 
Stuttgart	NE.DE  --> It is German since Uni Stuttgart is a German expression

Modellseite: 3 Spalten (links, Mitte, rechts)

Data from "A Turkish-German Code-Switching Corpus", LREC 2016

Zielgruppe

Formalia

Services

Organisation

Modellseite: 3 Spalten (links, Mitte, rechts)

Data from "A Turkish-German Code-Switching Corpus", LREC 2016

So erreichen Sie uns

Zielgruppe

Formalia

Services

Organisation