Documentation for the corpora used in the paper
Generalization in Native Language Identification: Learners versus Scientists 
Sabrina Stehwien and Sebastian Pado (2015)

For all corpora we used only the L1 classes: GE, ES, FR, IT, JP, TR, ZH


1.) TOEFL11

Full corpus provided by Blanchard et al. (2013)
Used as training data, no preprocessing was necessary

2.) ICLE

Subset of the International Corpus of Learner English (Version 2) (Granger et al., 2009)
One-line headers removed
251 documents per L1 class were extracted
The IDs of the test files used in each cross-validation cycle are given in icle_ids.txt


3.) Lang-8

This corpus was scraped from www.lang-8.com using a web scraper provided by Julian Brooke and Graeme Hirst
http://www.cs.toronto.edu/~jbrooke/Lang8.zip

The script was adapted to fit our purposes (lang8-scraper.py)
and an additional script was used to sort the output into directories (lang8-builder.py)
These scripts are provided in Lang8-Scripts/

176 documents per L1 class were extracted and the IDs of the test files used in each
cross-validation cycle are given in lang8_ids.txt


4.) ACL

Subset of ACL Anthology Network Corpus (Radev et al., 2013)

Steps used to create the corpus:

Step 1: Extracted documents with desired e-mail domains
.de, .es, .it, .tr, .jp, .fr, .cn
The resulting collection of documents contains only these e-mail domains,
so files with foreign domains such as .org, .edu, .com, .net were removed

Step 2: Stripped documents of headers, acknowledgements and reference sections,
keeping otherwise uncleaned data from abstract to conclusion, i.e. full texts
In our case, the resulting collection of documents is called *.stripped.txt

This procedure resulted in sufficient documents per L1, except for Turkish!

Step 3: Manually extracted more documents by Turkish authors
This way we were able to increase the Turkish subset to 54 documents

Step 4: Extract 54 documents per L1

The full corpus is provided in ACL-NLI/
The IDs of the test files used in each cross-validation cycle are given in acl_ids.txt