SIGMORPHON–UniMorph Shared Task on Typologically Diverse and Acquisition-Inspired Morphological Inflection Generation
SIGMORPHON’s seventh installment of its inflection generation shared task will be divided into two parts:
- Part 1: Typologically Diverse Morphological (Re-)Inflection
- Part 2: (Automatic) Morphological Acquisition Trajectories
Please join our Google Group to stay up to date.
Click here to register for the task!
In this shared task, participants will design a model that learns to generate morphological inflections from a lemma and a set of morphosyntactic features of the target form. Each language in the task has its own training, development, and test splits. Training and development splits contain triples, each consisting of a lemma, a target form, and a set of morphological features, provided in the UniMorph format (the “Data” section below provides an example of input format). Test splits only provide lemmas and morphological tags: your model will need to predict the missing target form.
The model should be general enough to work for natural languages of any typological patterning. For example, Tagalog verbs exhibit circumfixation; thus, a model with a strong inductive bias towards suffixing will likely not work well for Tagalog.
This task will proceed in three phases: the Development Phase, the Generalization Phase, and the Evaluation Phase. As the phases advance, more data and more languages will be released.
In the Development Phase, we will provide training and development splits that should be used to develop your system. We will refer to them as the development languages. The list of languages is below. There is a small training set for each languages and a large training set for languages with enough annotated paradigms.
In the Generalization Phase, we will provide training and development splits for new languages where approximately half are genetically related (belong to the same family) and half are genetically unrelated (are isolates or belong to a different family) to the development languages. We will also keep the genetically unrelated language families a surprise, though some languages will come from the same families as those in the Development Phase.
In the Evaluation Phase, the participants’ models will be evaluated on held-out forms from all of the languages from the previous phases. The languages from the Development Phase and the Generalization Phase are evaluated simultaneously. The only difference is that there has been more time to construct a model for those languages released in the Development Phase. It follows that a model could easily overfit to or favor phenomena that are more frequent in languages presented in the Development Phase, especially if parameters are shared across languages. For instance, a model based on the morphological patterning of the Indo-European languages may end up with a bias towards suffixing and will struggle to learn prefixing or circumfixation, the degree to which only becomes apparent during experimentation on other languages whose inflectional morphology patterns differ. Of course, the model architecture itself could explicitly or implicitly favor certain word formation types (suffixing, prefixing, etc.).
Language | Family | code | UM | Contributors |
---|---|---|---|---|
Arabic, Modern Standard | Semitic (Afro-Asiatic) | ara | https://github.com/unimorph/ara/ara_atb | Salam Khalifa, Nizar Habash |
Assamese | Indic (Indo-European) | asm | https://github.com/unimorph/asm/ | Khuyagbaatar Batsuren, Aryaman Arora |
Braj | Indic (Indo-European) | bra | https://github.com/unimorph/bra/ | Shyam Ratan, Ritesh Kumar |
Chukchi | Chukotko-Kamchatkan | ckt | https://github.com/unimorph/ckt/ | Karina Sheifer, Maria Ryskina |
English, Old | Germanic (Indo-European) | ang | https://github.com/unimorph/ang/ | Jeremiah Young |
Evenki | Tungusic | evn | https://github.com/unimorph/evn/ | Elena Klyachko |
Georgian | Kartvelian | kat | https://github.com/unimorph/kat/ | Simon Guriel, Silvia Guriel-Agiashvili & Nona Atanelov |
German, Low | Germanic (Indo-European) | nds | https://github.com/unimorph/nds/ | Jeremiah Young |
German, Middle Low | Germanic (Indo-European) | nds | https://github.com/unimorph/gml/ | Jeremiah Young |
German, Old High | Germanic (Indo-European) | goh | https://github.com/unimorph/goh/ | Jeremiah Young |
Gothic | Germanic (Indo-European) | got | https://github.com/unimorph/got/ | Jeremiah Young |
Gujarati | Indic (Indo-European) | guj | https://github.com/unimorph/guj/ | Aryaman Arora, Khuyaagbaatar Batsuren |
Hebrew | Semitic (Afro-Asiatic) | heb | https://github.com/unimorph/heb/ | Omer Goldman |
Hungarian | Ugric (Uralic) | hun | https://github.com/unimorph/hun/ | Judit Ács, Khuyagbaatar Batsuren, Gábor Bella, Ryan Cotterell, Christo Kirov |
Itelmen | Chukotko-Kamchatkan | itl | https://github.com/unimorph/itl/ | Karina Sheifer, Sofya Ganieva, Matvey Plugaryov |
Karelian | Finnic (Uralic) | krl | https://github.com/unimorph/krl/ | (Wiktionary, VepKar) |
Ket | Yeneisan | ket | https://github.com/unimorph/ket/ | Elena Budianskaya, Polina Mashkovtseva, Alexandra Serova |
Kholosi | Indic (Indo-European) | hsi | https://github.com/unimorph/hsi/ | Aryaman Arora |
Korean | Koreanic | kor | https://github.com/unimorph/kor/ | Maria Nepomniashchaya, Daria Rodionova, Anastasia Yemelina |
Ludian | Finnic (Uralic) | lud | https://github.com/unimorph/lud/ | (VepKar) |
Magahi | Indic (Indo-European) | mag | https://github.com/unimorph/mag/ | Mohit Raj, Ritesh Kumar |
Mongolian, Khalkha | Mongolic | khk | https://github.com/unimorph/khk/ | Khuyagbaatar Batsuren |
Norse, Old | Germanic (Indo-European) | non | https://github.com/unimorph/non/ | Jeremiah Young |
Polish | Slavic (Indo-European) | pol | https://github.com/unimorph/pol/ | Marcin Woliński, Zygmunt Saloni, Robert Wołosz, Włodzimierz Gruszczyński, Danuta Skowrońska, Zigniew Bronk, Witold Kieraś |
Pomak | Slavic (Indo-European) | poma | https://github.com/unimorph/poma/ | Ritvan Karahodja, Antonios Anastasopoulos |
Slovak | Slavic (Indo-European) | slk | https://github.com/unimorph/slk/ | Jan Hajič, Jan Hric, Witold Kieraś |
Sorbian, Upper | Slavic (Indo-European) | hsb | https://github.com/unimorph/hsb/ | Taras Andrushko, Igor Marchenko |
Turkish | Turkic | tur | https://github.com/unimorph/tur/ | Omer Goldman, Duygu Ataman |
Veps | Finnic (Uralic) | vep | https://github.com/unimorph/vep/ | (VepKar) |
Xibe | Tungusic | sjo | https://github.com/unimorph/sjo/ | Elena Klyachko |
Language | Family | code | UM | Contributors |
---|---|---|---|---|
Armenian | Indo-European | hye | https://github.com/unimorph/hye/ | Hossep Dolatian, Khuyagbaatar Batsuren, Ryan Cotterell |
Kazakh | Turkic | kaz | https://github.com/unimorph/kaz/ | Eleanor Chodroff, Khuyagbaatar Batsuren |
Lamahalot | Austronesian | slp | https://github.com/unimorph/slp | Yustinus Ghanggo Ate |
Stage 1: Development Phase
- March 29, 2022: Training and development splits for development languages released; we invite participants to report errors.
- March 29, 2022: Neural and non-neural baselines for development languages released.
Stage 2: Generalization Phase
- April 17, 2022: Training and development splits for surprise languages released. (This is not a zero-shot learning task. Participants will be given training data for all languages.)
Stage 3: Evaluation Phase
- April 22, 2022: Test splits for all languages (both development and surprise) released.
- May
617, 2022: Participants submit test predictions on all languages.
Stage 4: Write-up Phase
- June 3, 2022: Participants’ system description papers due.
The training and development data are provided in a simple utf-8 encoded text format for both the development and surprise languages. Each line in a file is an example that consists of word forms and corresponding morphosyntactic descriptions (MSDs) provided as a set of features, separated by semicolons. We refer to the MSDs as (morphological) tags for simplicity. The fields on a line are TAB-separated. The fields are: lemma, target form, tag. Here we present an example from the Akan training data (the Akan verb “bisa” means “to ask” in English):
bisa mmbisa V;PRS;HAB;NEG
In the training data, we give all three fields. In the test phase, we omit field 2.
We will provide varying amounts of labeled training data, depending on the language, to assess models’ ability to generalize to novel forms, in addition to information about each language’s family and sub-family, and WALS features which participants may optionally use. For each language, the possible inflections are taken from a finite set of morphological tags, presented in the UniMorph schema.
Two training sets are provided for most languages in order to test models' behavior on smaller and larger data sets.
Evaluation script available here https://github.com/sigmorphon/2022InflectionST/tree/main/evaluation
The language generalization evaluation is extended from previous years' design. We will simultaneously evaluate models for both the Development languages, whose training and development sets will be available for an elongated period of time, and the Surprise languages, whose training and development sets will only be available for a short time prior to submission, which precludes extensive tuning. To be officially ranked, you must submit results for all evaluation languages. Thus, to succeed, your class of models (e.g. neural sequence-to-sequence models or weighted finite-state transducers with hand-crafted features) must generalize well to the group of Surprise languages that are typologically distinct from the Development languages you performed model selection on. To repeat: This is not a zero-shot learning task, but rather our evaluation set-up is designed to test the inherent inductive bias in the participants' chosen model class.
Evaluation is designed to provide insights into performance over typologically distinct languages. Accuracy on held out forms will be evaluated separately for three classes of languages:
- held-out forms from the Development languages
- held-out forms from genetically related Surprise languages
- held-out forms from genetically unrelated Surprise languages
For each language, accuracy will be evaluated on the entirety of the held-out forms as well as the following subsets. This will provide insights into systems' ability to generalize across morphological information within languages. Since (lemma, features) pairs are provided at test time and no (lemma, features) pair can have been attested at training time, there are three logical subsets
- held-out forms for lemmas attested in training, feature sets unattested in training
- held-out forms for feature sets attested in training, lemmas unattested in training
- held-out forms for (lemma, feature) pairs completely unattested in training
The human-like generalization part of this shared task will be evaluated as described above but with an additional analysis of error types and their relationships to attested error types observed during child development: the distribution of ''over-regularization'' and ''over-irregularization'' errors, omission and comisson errors, whether errors yield U-shaped developmental trajectories on languaegs for which the phenomenon is well attested, and relative order of acquisition of generalizations as data size increases.
Please submit your team's results to jordan.kodner@stonybrook.edu CCing your team mates by May 6th 17th, 2022.
Baseline results available here https://github.com/sigmorphon/2022InflectionST/tree/main/evaluation
The organizers will provide one non-neural and one neural baseline for the participants’ consumption. Its use is optional and is provided to help the participants develop their own models faster. The neural baseline is a multilingual transformer (Vaswani et al., 2017). The version of this model adopted for character-level tasks currently holds the state-of-the-art on the 2017 SIGMORPHON shared task data. The transformer takes the lemma and morphological tags as input and outputs the target inflection. Given the low-resource setup, a single model will be trained on all languages. Additionally, we consider the data augmentation technique used by Anastasopoulos and Neubig (2019) as another baseline.
To run the non-neural baseline use command:
$ python baselines/nonneural/baseline.py --path part1/development_languages/
To run the neural baseline first download and augment (Anastasopoulos and Neubig, 2019) the data
$ mkdir part1/original
$ cp part1/development_languages/* part1/original
$ bash baselines/neural/example/sigmorphon2021-shared-tasks/augment.sh
$ python baselines/neural/example/sigmorphon2021-shared-tasks/task0-build-dataset.py all
Then, to run the transducer (Wu et al, 2021), one model per language.
$ bash baselines/neural/example/sigmorphon2021-shared-tasks/task0-launch.sh
Task Logistics: Jordan Kodner, Salam Khalifa, Khuyagbaatar Batsuren, Ekaterina Vylomova, Maria Ryskina, Omer Goldman, Jeffrey Heinz, Ryan Cotterell, Mans Hulden, Garret Nicolai, David Yarowsky
Data Preparation: Antonios Anastasopoulos, Taras Andrushko, Aryaman Arora, Duygu Ataman, Nona Atanelov, Khuyagbaatar Batsuren, Zigniew Bronk, Elena Budianskaya, Hossep Dolatian, Sofya Ganieva, Omer Goldman, Włodzimierz Gruszczyński, Simor Guriel, Silvia Guriel-Agiashvili, David Guriel, Nizar Habash, Jan Hajič, Jan Hric, Salam Khalifa, Witold Kieraś, Elena Klyachko, Ritesh Kumar, Ritvan Karahodja, Igor Marchenko, Polina Mashkovtseva, Maria Nepomniashchaya, Matvey Plugaryov, Mohit Raj, Shyam Ratan, Daria Rodionova, Maria Ryskina, Zygmunt Saloni, Alexadra Serova, Karina Sheifer, Danuta Skowrońska, Marcin Woliński, Robert Wołosz, Anastasia Yemelina, Jeremiah Young
How exactly it is that children acquire their native morphologies and carry out morphological generalization in practice remains a major question in language acquisition and theoretical morphology, one that has major implications for cognitive science more broadly. Neural approaches have long played a role is these discussions, with their early promise kicking off the so-called "Past-Tense Debate" of the 1980s and 90s. See Gary Marcus’ book The Algebraic Mind for an overview.
The recent success and popularity of improved neural methods has brought renewed interest in these questions from the computional linguistics and cognitive science communities. A series of papers and responses have been published in the last few years Kirov and Cotterell (2018), Corkery et al. (2019) McCurdy et al. (2020), Belth, Payne, et al. (2021), and Dankers et al. (2021). Last year's SIGMORPHON shared task threw its hat into the ring as well, offering a subtask correlating predicted and human-wellformedness ratings for inflected forms.
This year's subtask approaches the question from a different angle. Instead of predicting well-formedness of nonce word forms, systems will instead be evaluated on their ability to generalize over naturalistic low-resource inputs. We have prepared data to determine systems' learning trajectories and compare them against the wealth of data that has been collected about human learning trajectories for three famous problems: English past tense, the namesake of the Past-Tense Debate (e.g., Marcus et al. 1992), German noun plurals, a well-studied challenge case which may have a minority-default pattern (Clahsen et al. 1992, Marcus et al. 1995), and Arabic noun plurals, an even more challenging case with several competing affixal and templatic patterns (Ravid & Farah 1999, Dawdy-Hesterberg & Pierrehumbert (2014)).
Generalization is a core task facing any morphology learner, human or otherwise, because morphological data is extremely sparse. Even for languages with moderately sized paradigms, the vast majority of possible forms will not be attested even in millions of tokens of input. These distributions are highly skewed such that a few lemmas may have mostly complete paradigms, while most lemmas will only be attested in a couple forms Chan 2008. Due to this speakers young and old are often forced to infer inflected forms that were not well-attested in their inputs up to that point, even for known lemmas.
Children make mistakes when inferring forms which provide insights into the organization of their grammars over time. Much has been discovered about the learning trajectories of children acquiring these patterns. Learners are more likely to over-apply apparently productive patterns than over-apply non-productive patterns (over-regularization over "over-irregularization") and are more likely to omit inflectional information than substitute other inflectional information (omission over comission). English past tense and Arabic noun plural learners exhibit u-shaped developmental regression where their accuracy improves, degrades, and improves again.
In order to evaluate to what extent automatic systems achieve these patterns or different patterns, we have prepared a series of training sets of increasing size so that performance can be calculated as input increases and the presence or absence of particular inputs can be correlated with particular model behaviors. Training sets are sampled weighted by frequency from German and English child-directed speech corpora available from UniMorph with frequencies from the CHILDES database MacWhinney 2000 such that the smallest training sets contain only the highest frequency words. Arabic is sampled in the same way, but words and their frequencies are taken from the Penn Arabic Treebank. A development and test set are held out in each case.
Performance will be evaluated by accuracy across attested and unattested lemmas and features as in Part 1 for each training size. This will provide a developmental trajectory. In addition, error types will be analyzed and classified in a manner similar to Gorman et al. (2019) but with reference to the error types attested and unattested in acquisition. In the lowest training settings, it may not be possible for a system to achieve perfect accuracy on the test set. Those cases will provide insights into what generalizations systems are taking. Considering the following hypothetical example from English, the following verb lemmas are phonotactially similar but inflect the past tense differently. If a system were to adopt any one pattern based on the -ing verbs in the training it would not achieve 100% accuracy, but which generalization it picks (-ed or another one) would reveal something about the model.
sing sang V;PST
sting stung V;PST
bring brought V;PST
ping pinged V;PST
Training and dev data may be downloaded here.
The data are in the standard UniMorph triple file format:
swim swam V;PST
- March 15, 2022: Training data for English, German, and Arabic are released. We invite participants to report errors.
- March 29, 2022: Neural and non-neural baselines for English, German, and Arabic are released.
- April 22, 2022: Test splits for all languages (both development and surprise) released.
- May
61317, 2022: Participants submit test predictions on all languages. - June 3, 2022: Participants’ system description papers due.
Please submit your team's results to jordan.kodner@stonybrook.edu CCing your team mates by May 6, 2022. Please use "SIGMORPHON Task 0 Part 2" in your subject line.
- Jordan Kodner (Stony Brook University)
- Salam Khalifa (Stony Brook University)
Belth, C. A., Payne, S. R., Beser, D., Kodner, J., & Yang, C. (2021). The Greedy and Recursive Search for Morphological Productivity. In Proceedings of the Annual Meeting of the Cognitive Science Society (Vol. 43, No. 43).
Chan, Erwin. (2008). Structures and distributions in morphology learning. Doctoral dissertation, University of Pennsylvania.
Clahsen, H., Rothweiler, M., Woest, A., & Marcus, G. F. (1992). Regular and irregular inflection in the acquisition of German noun plurals. Cognition, 45(3), 225-255.
Corkery, M., Matusevych, Y., and Goldwater, S. (2019). Are we there yet? Encoder-decoder neural networks as cognitive models of English past tense inflection. In Proceedings of ACL 2019.
Dankers, V., Langedijk, A., McCurdy, K., Williams, A., & Hupkes, D. (2021). Generalising to German Plural Noun Classes, from the Perspective of a Recurrent Neural Network.. In Proceedings of CoNLL 2021 (pp. 94-108).
Dawdy-Hesterberg, L. G., & Pierrehumbert, J. B. (2014). Learnability and generalisation of Arabic broken plural nouns. Language, cognition and neuroscience, 29(10), 1268-1282.
Gorman, K., McCarthy, A. D., Cotterell, R., Vylomova, E., Silfverberg, M., & Markowska, M. (2019, November). Weird Inflects but OK: Making Sense of Morphological Generation Errors. In Proceedings of CoNLL 2019 (pp. 140-151).
Kirov, C. and Cotterell, R. (2018). Recurrent Neural Networks in Linguistic Theory: Revisiting Pinker and Prince (1988) and the Past Tense Debate. TACL, 6, 651-665.
Kirov, C., Cotterell, R., Sylak-Glassman, J., Walther, G., Vylomova, E., Xia, P., Faruqui, M., Mielke, S., McCarthy, A., Kübler, S., Yarowsky, D., Eisner, J., and Hulden, M. (2018). UniMorph 2.0: Universal Morphology. In Proceedings of LREC 2018.
Maamouri, M., Bies, A., Buckwalter, T., & Mekki, W. (2004. The Penn Arabic Treebank: Building a large-scale annotated Arabic corpus. In NEMLAR conference on Arabic language resources and tools (Vol. 27, pp. 466-467).
MacWhinney, B. (2000). The CHILDES project: The database.
Marcus, G. F. (2001). The Algebraic Mind: Integrating Connectionism and Cognitive Science. MIT Press.
Marcus, G. F., Pinker, S., Ullman, M., Hollander, M., Rosen, T. J., Xu, F., & Clahsen, H. (1992). Overregularization in language acquisition. Monographs of the society for research in child development.
Marcus, G. F., Brinkmann, U., Clahsen, H., Wiese, R., & Pinker, S. (1995). German inflection: The exception that proves the rule. Cognitive psychology, 29(3), 189-256.
McCurdy, K., Goldwater, S., and Lopez, A. (2020). Inflecting When There’s No Majority: Limitations of Encoder-Decoder Neural Networks as Cognitive Models for German Plurals. In Proceedings of ACL 2020.
Ravid, D., & Farah, R. (1999). Learning about noun plurals in early Palestinian Arabic. First Language, 19(56), 187-206.
We do not tolerate harassment in our shared task. Anyone who has been previously found to have harassed one of the organizers cannot participate in our task in any capacity. There are too few organizers for us to accomodate the harasser in a manner that ensures the safety of their victims. (This policy was written on June 22nd and cannot be applied retroactively, but will be in place for all future iterations of the task.)