LinguisticStructureLM: Transformer-based Language Modeling with Symbolic Linguistic Structure Representations
Published at NAACL-HLT 2022 as "Linguistic Frameworks Go Toe-to-Toe at Neuro-Symbolic Language Modeling" by Jakob Prange, Nathan Schneider, and Lingpeng Kong.
Please cite as:
@inproceedings{prange-etal-2022-linguistic,
title = "Linguistic Frameworks Go Toe-to-Toe at Neuro-Symbolic Language Modeling",
author = "Prange, Jakob and
Schneider, Nathan and
Kong, Lingpeng",
booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = jul,
year = "2022",
address = "Seattle, United States",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.naacl-main.325",
pages = "4375--4391"
- To install dependencies, run:
pip install -r requirements.txt - Download the trained models into this directory.
- Obtain annotated data and store all training and evaluation files as
FORMALISM.training.mrpandFORMALISM.validation.mrp(whereFORMALISMis one of{dm, psd, eds, ptg, ud, ptb-phrase, ptb-func, empty}) in a directory calledmrp/which is a subdirectory of this one. Note: We used the annotated and MRP-formatted WSJ data, so we cannot publicly release it here. Please contact me or open an issue! (You'll probably need an LDC license to get the data.)
To reproduce the main results (table 2 in the paper), complete the following steps:
- Edit
lm_eval.shto match your local environment - Run:
sh eval_all_lm.sh - The results will be written to
stdoutby the eval.py, which will be collected in a file calledeval-dm,dm,psd,eds,ptg,ud,ptb-phrase,ptb-func-10-0001-0.0_0.0-0-14combined.outby lm_eval.sh. Run:cat eval-dm,dm,psd,eds,ptg,ud,ptb-phrase,ptb-func-10-0001-0.0_0.0-0-14combined.out | grep ";all;" | grep gold, which will give you a bunch of semicolon-separated lines you can paste into your favorite spreadsheet. Voila!
To get more info on commandline arguments, run:
python3 train.py or python3 eval.py
To evaluate a trained model more generally (might require additional input file; contact me!), edit lm_eval.sh to match your environment and directory structure, uncomment the lines you want in eval_all_lm.sh and run:
sh eval_all_lm.sh SEED where SEED is the last number before .pt in the model name (currently only seed=14 models are available for download).
To train a new model (requires access to .mrp-formatted and preprocessed data, which you can find here and/or contact me about), edit lm.sh to match your environment and directory structure, uncomment the lines you want in run_all_lm.sh and run:
sh run_all_lm.sh SEED where SEED is a custom random seed you can set.