Institut

Studium

Forschung


 

Data from "Part of Speech Annotation of a Turkish-German Code-Switching Corpus", LAW 2016

 

Data and Scripts

Due to the restrictions of Twitter’s Terms of Service, we distribute the tweet IDs instead of actual tweets. We also distribute the edit transcript that converts the original tweets to edited versions, and the language identification annotation aligned with the edited version. You can download the data and scripts from here.

The POS tagging guidelines are also included in the download. 

Steps to Generate the Corpus

1. Download the tweets with IDs given in all_cs_id.col

2. Put the original tweets into a single file in the format

<tweetID><tab><original tweet>

On this page, this file is called all_cs_idtweet.tab

3. Convert original tweets to edited versions with the following script

perl convert_original_to_edited.pl -i all_cs_idtweet.tab -t all_cs_transcript.tab > all_cs_idtweet_norm.tab

4. Convert the format one-line-per-tweet to the format one-token-per-line.

perl convert_tab_to_col.pl -i all_cs_idtweet_norm.tab -o all_cs_idtweet_norm.col

5. Merge the tweets in all_cs_idtweet_norm.col with the language IDs and POS tags in all_cs_id_langid_pos.col

paste all_cs_idtweet_norm.col all_cs_id_langid_pos.col | cut -f 1,2,4,5 | sed $'s/^\t*$//g' > all_cs_idtweet_norm_langid_pos.col

For questions please contact Ozlem Cetinoglu.