Data from "Part of Speech Annotation of a Turkish-German Code-Switching Corpus", LAW 2016


Data and Scripts

Due to the restrictions of Twitter’s Terms of Service, we distribute the tweet IDs instead of actual tweets. We also distribute the edit transcript that converts the original tweets to edited versions, and the language identification annotation aligned with the edited version. You can download the data and scripts from here.

The POS tagging guidelines are also included in the download. 

Steps to Generate the Corpus

1. Download the tweets with IDs given in all_cs_id.col

2. Put the original tweets into a single file in the format

<tweetID><tab><original tweet>

On this page, this file is called

3. Convert original tweets to edited versions with the following script

perl -i -t >

4. Convert the format one-line-per-tweet to the format one-token-per-line.

perl -i -o all_cs_idtweet_norm.col

5. Merge the tweets in all_cs_idtweet_norm.col with the language IDs and POS tags in all_cs_id_langid_pos.col

paste all_cs_idtweet_norm.col all_cs_id_langid_pos.col | cut -f 1,2,4,5 | sed $'s/^\t*$//g' > all_cs_idtweet_norm_langid_pos.col

For questions please contact Ozlem Cetinoglu.