Towards accurate dependency parsing for Galician with limited resources

  1. Sarymsakova, Albina
  2. Sánchez-Rodríguez, Xulia
  3. Garcia, Marcos
Journal:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2024

Issue: 73

Pages: 247-257

Type: Article

More publications in: Procesamiento del lenguaje natural

Abstract

Automatic syntactic parsing is a fundamental aspect within NLP. However, effective parsing tools necessitate extensive and high-quality annotated treebanks for satisfactory performance. Consequently, the parsing quality for low-resource languages such as Galician remains inadequate. In this context, the present study explores several approaches to improve the automatic syntactic analysis of Galician using the UD framework. Through experimental endeavors, we analyze the quality of the model incrementing the size of the initial training corpus by adding data from Galician PUD treebank. Additionally, we explore the benefits of incorporating contextualized vector representations by comparing the use of various BERT models. Lastly, we assess the impact of integrating cross-lingual training data from similar varieties, analyzing the models’ performance across used treebanks. Our findings underscore (1) the positive correlation between augmented training data and enhanced model performance across used treebanks; (2) superior performance of monolingual BERT models compared to their multilingual analogues; (3) improvement of overall model performance across utilized treebanks by incorporation of cross-lingual data.

Bibliographic References

  • Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June. Association for Computational Linguistics.
  • Gamallo, P. and I. González. 2012. DepPattern: a multilingual dependency parser. In Demo Session of the International Conference on Computational Processing of the Portuguese Language (PROPOR 2012), pages 17–20. Citeseer.
  • Garcia, M. 2021. Exploring the representation of word meanings in context: A case study on homonymy and synonymy. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3625–3640, Online, August. Association for Computational Linguistics.
  • Garcia, M., C. Gómez-Rodríguez, and M. A. Alonso. 2018. New treebank or repurposed? on the feasibility of cross-lingual parsing of romance languages with universal dependencies. Natural Language Engineering, 24(1):91–122.
  • Glavas, G. and I. Vulic. 2021. Climbing the tower of treebanks: Improving low-resource dependency parsing via hierarchical source selection. In C. Zong, F. Xia, W. Li, and R. Navigli, editors, Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4878–4888, Online, August. Association for Computational Linguistics.
  • Kann, K., K. Cho, and S. R. Bowman. 2019. Towards realistic practices in low-resource natural language processing: The development set. In K. Inui, J. Jiang, V. Ng, and X.Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 3342–3349, Hong Kong, China, November. Association for Computational Linguistics.
  • Kondratyuk, D. and M. Straka. 2019. 75 languages, 1 model: Parsing universal dependencies universally. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2779–2795, Hong Kong, China. Association for Computational Linguistics.
  • Lopes, L. and T. Pardo. 2024. Towards portparser - a highly accurate parsing system for Brazilian Portuguese following the Universal Dependencies framework. In P. Gamallo, D. Claro, A. Teixeira, L. Real, M. Garcia, H. G. Oliveira, and R. Amaro, editors, Proceedings of the 16th International Conference on Computational Processing of Portuguese, pages 401–410, Santiago de Compostela, Galicia/ Spain, March. Association for Computational Lingustics.
  • Müller-Eberstein, M., R. van der Goot, and B. Plank. 2021. Genre as weak supervision for cross-lingual dependency parsing. In M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4786–4802, Online and Punta Cana, Dominican Republic, November. Association for Computational Linguistics.
  • Sánchez-Rodr´ıguez, X., A. Sarymsakova, L. Castro, and M. Garcia. 2024. Increasing manually annotated resources for Galician: the parallel Universal Dependencies treebank. In P. Gamallo, D. Claro, A. Teixeira, L. Real, M. Garcia, H. G. Oliveira, and R. Amaro, editors, Proceedings of the 16th International Conference on Computational Processing of Portuguese, pages 587–592, Santiago de Compostela, Galicia/Spain, March. Association for Computational Lingustics.
  • Vania, C., Y. Kementchedjhieva, A. Søgaard, and A. Lopez. 2019. A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages. In K. Inui, J. Jiang, V. Ng, and X.Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 1105–1116, Hong Kong, China, November. Association for Computational Linguistics.
  • Vilares, D., M. Garcia, and C. Gòmez-Rodríguez. 2021. Bertinho: Galician bert representations. arXiv preprint arXiv:2103.13799.
  • Zeman, D., J. Hajic, M. Popel, M. Potthast, M. Straka, F. Ginter, J. Nivre, and S. Petrov. 2018. CoNLL 2018 shared task: Multilingual parsing from raw text to Universal Dependencies. In D. Zeman and J. Hajic, editors, Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 1–21, Brussels, Belgium, October. Association for Computational Linguistics.
  • Zeman, D., M. Popel, M. Straka, [et al.]. 2017. CoNLL 2017 shared task: Multilingual parsing from raw text to Universal Dependencies. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 1–19, Vancouver, Canada, August. Association for Computational Linguistics.