Documentation for the corpora used in the paper Generalization in Native Language Identification: Learners versus Scientists Sabrina Stehwien and Sebastian Pado (2015) For all corpora we used only the L1 classes: GE, ES, FR, IT, JP, TR, ZH 1.) TOEFL11 Full corpus provided by Blanchard et al. (2013) Used as training data, no preprocessing was necessary 2.) ICLE Subset of the International Corpus of Learner English (Version 2) (Granger et al., 2009) One-line headers removed 251 documents per L1 class were extracted The IDs of the test files used in each cross-validation cycle are given in icle_ids.txt 3.) Lang-8 This corpus was scraped from www.lang-8.com using a web scraper provided by Julian Brooke and Graeme Hirst http://www.cs.toronto.edu/~jbrooke/Lang8.zip The script was adapted to fit our purposes (lang8-scraper.py) and an additional script was used to sort the output into directories (lang8-builder.py) These scripts are provided in Lang8-Scripts/ 176 documents per L1 class were extracted and the IDs of the test files used in each cross-validation cycle are given in lang8_ids.txt 4.) ACL Subset of ACL Anthology Network Corpus (Radev et al., 2013) Steps used to create the corpus: Step 1: Extracted documents with desired e-mail domains .de, .es, .it, .tr, .jp, .fr, .cn The resulting collection of documents contains only these e-mail domains, so files with foreign domains such as .org, .edu, .com, .net were removed Step 2: Stripped documents of headers, acknowledgements and reference sections, keeping otherwise uncleaned data from abstract to conclusion, i.e. full texts In our case, the resulting collection of documents is called *.stripped.txt This procedure resulted in sufficient documents per L1, except for Turkish! Step 3: Manually extracted more documents by Turkish authors This way we were able to increase the Turkish subset to 54 documents Step 4: Extract 54 documents per L1 The full corpus is provided in ACL-NLI/ The IDs of the test files used in each cross-validation cycle are given in acl_ids.txt