*********************************** *** MOP Compound Splitter (MCS) *** *********************************** *** INTRODUCTION *** This system is a compound splitter based on constituent normalization using Morphological Operation Patterns (MOPs). As core splitting method it uses a recursive frequency-based architecture, inspired by Koehn and Knight (2003). The usage of MOPs learned from regular word inflection (i.e., Word MOPs) allows for compound splitting in many languages without the need for language-specific knowledge about constituent inflection. The MCS is described in Ziering and Van der Plas (2016) and Ziering (2018). *** MAIN MODES *** There are four main modes: RESOURCE (for collecting lemma frequency, part-of-speech (PoS) probabilities and MOPs from a PoS-tagged and lemmatized corpus, as resources for the compound splitter), SPLIT (for MOP-based compound splitting), MOP (for testing the potential of MOPs) and HELP (a help page containing all relevant arguments and options). ** RESOURCE ** The purpose of this mode is to compile a lemma set (containing lemmas with corpus frequency and PoS probability) and an MOP set (containing all MOPs used for transforming a lemma into its corresponding word form with corpus frequency and PoS probability). Input file format: word PoS lemma; one token per line. The output are a lemma set and an MOP set. *Usage* java -jar MCS.jar --RESOURCE --CORPUS ** SPLITTING ** This mode is the core method of MCS, the MOP-based compound splitter that generates a linear and a hierarchical structure of a given target compound. As resources, it uses a lemma set and an MOP set. Depending on the language, the user can distinguish between MOPs for the head and for the modifier. There are three different kinds of input: TERM (for a single term), TEXT (for a tokenized text file) and COLLECTION (for a list of target compounds, one per line). There are various additional options, described in the help page (java -jar MCS.jar --HELP SPLITTING). *Output format* There are two different types of output for several numbers of constituents and rankings: (1) TREE OUTPUT and (2) BINARY SPLIT OUTPUT. The format of the TREE OUTPUT is: @ =T For example, splitting "Hühnersuppenrezept" (chicken soup recipe) yields: 3@1#>>>>>> =T null 3 Hühnersuppenrezept hühner|suppen|rezept Huhn Suppe Rezept [ [ Huhn Suppe ] Rezept ] 2@1#>>>>>> =T null 2 Hühnersuppenrezept hühnersuppen|rezept Hühnersuppe Rezept [ Hühnersuppe Rezept ] In the current version, there is no score for trees, but only for binary split decisions. Providing scores for different tree structures (i.e., compound parsing) will be addressed in future work. The format of the BINARY SPLIT OUTPUT is: @ =B For example, splitting "Gastraum" (guest room / gas dream?) into two constituents yields: 2@1#>>>>>> =B 421683.74529459066 2 gastraum gast|raum gast raum null 2@2#>>>>>> =B 98741.69382132468 2 gastraum gas|traum gas traum null *Usage examples* Splitting a term with two specified MOP sets with a verbose command line output: java -jar MCS.jar --SPLIT --LEMMASET --modifierMOPs --headMOPs --TERM --VERBOSE Splitting a collection of terms (one per line) with MOPs for the modifier: java -jar MCS.jar --SPLIT --LEMMASET --modifierMOPs --COLLECTION --OUTPUT Splitting a (tokenized) text (several space-separated tokens per line) with MOPs for the modifier, lower-casing all tokens and a maximum splitting depth of 3 constituents java -jar MCS.jar --SPLIT --LEMMASET --modifierMOPs --COLLECTION --OUTPUT --kmax 3 ** MOP ** The purpose of this mode is to test the potential of MOPs, i.e., to COMPILE MOPs from two strings, to APPLY an MOP on a string, to INVERT an MOP and to LEMMATIZE a term/text/collection by using an word MOP set. *Usage examples* Compiling an MOP for transforming a given source term into a given target term: java -jar MCS.jar --MOP --COMPILE --A --B Applying a given MOP on a given term: java -jar MCS.jar --MOP --APPLY --TERM Inverting a given MOP: java -jar MCS.jar --MOP --INVERT Lemmatizing a term using word MOPs: java -jar MCS.jar --MOP --LEMMATIZE --LEMMASET --TERM ** HELP ** A detailed description of the usage is given in the help page > java -jar MCS.jar --HELP For compound splitting, more information is given in > java -jar MCS.jar --HELP SPLITTING For the MOP mode, more information is given in > java -jar MCS.jar --HELP MOP For the data processing and the creation of the resource (i.e., lemma set and MOP set), more information is given in > java -jar MCS.jar --HELP RESOURCE *** REFERENCES *** Philipp Koehn and Kevin Knight. 2003. Empirical methods for compound splitting. In Proceedings of EACL 2003. Patrick Ziering and Lonneke van der Plas. 2016. Towards Unsupervised and Language-independent Compound Splitting using Inflectional Morphological Transformations. In Proceedings NAACL-HLT 2016. Patrick Ziering. 2018. Indirect Supervision for the Determination and Structural Analysis of Nominal Compounds. PhD thesis.