The representativeness threshold for the CETAsubcorpusof the Coruña Corpus

  1. Elena Alfaya Lamas 1
  2. Garrote Espantoso, Menchu
  1. 1 Universidade da Coruña
    info

    Universidade da Coruña

    La Coruña, España

    ROR https://ror.org/01qckj285

Revista:
LFE: revista de lenguas para fines específicos

ISSN: 1133-1127

Año de publicación: 2021

Volumen: 27

Número: 2

Páginas: 125-139

Tipo: Artículo

DOI: 10.20420/RLFE.2021.440 DIALNET GOOGLE SCHOLAR lock_openDialnet editor

Otras publicaciones en: LFE: revista de lenguas para fines específicos

Objetivos de desarrollo sostenible

Resumen

The concept of representativeness is the main distinguishing characteristic of specialised corpora in comparison to other sets of texts. The Coruña Corpus of English Scientific Writing currently comprises four published subcorpora (astronomy, life sciences, history, and philosophy) plus three others under compilation (physics, chemistry and linguistics). In this paper we aim to assess the lexical density of the text samples in CETA, the Corpus of English Texts on Astronomy, by means of the ReCor tool, a posteriori. The study is motivated by the following question: does quantitative representativeness analysis using ReCor provide, in the form of a cross-check, further validation of previous research on the representativeness of CETA? Previous work (Crespo and Moskowich, 2010) has indicated that the CETA corpus is well designed and valid for the purposes for which it was intended. We will here suggest metrics to measure these findings. The most important contribution of this study is to offer quantitative data collection results using the ReCor tool, which allows data triangulation and consequently ensures overall data quality. Results show that data analysis with the ReCor tool supports previous findings, and thus we are able to verify that CETA is indeed representative of the language of its time and register.

Referencias bibliográficas

  • Biber, D. (1993). “Using Registered-diversified Corpora of General Language Studies”. Computational Linguistics, 19 (2), 219-241.
  • Biber, D., Conrad, S. & Reppen, R. (1998a). Preface. In: D. BIBER, S. Conrad & R. Reppen (eds.), Corpus Linguistics: Investigating Language Structure and Use (pp. ix-x). Cambridge: Cambridge University Press.
  • Biber, D., Conrad, S. & Reppen, R. (1998b). Introduction Goals and Methods of the Corpus-based Approach. In: D. Biber, S. Conrad & R. Reppen (eds.), Corpus Linguistics: Investigating Language Structure and Use (pp. 1-18). Cambridge: Cambridge University Press.
  • Booth, A. D. (1967). “A Law of Occurrences for Words of Low Frequency”. Information and Control, 10 (4), 386-393.
  • Corpas, G. y Seghiri, M. (2010). “Size Matters: A Quantitative Approach to Corpus Representativeness”. In R. Rabadán, (ed.) Lengua, traducción, recepción. En honor de Julio César Santoyo (pp. 112-146). Secretar: Universidad de Alicante.
  • Crespo, B. & Moskowich-Spiegel, I. (2010). “CETA in the Context of the Coruña Corpus”. Literary and Linguistic Computing, 25(2), 153-164.
  • Francis, W. N. (1982). Problems of Assembling and Computerizing Large Corpora. In S. Johansson (ed.) et al. Computer Corpora in English Language Research (pp. 7-24). Norway: Norwegian Computing Centre for the Humanities
  • Moskowich-Spiegel, I., Lareo, I., Camiña, G. & Crespo, B. (comps.) (2012). Corpus of English Texts on Astronomy. Amsterdam: John Benjamins.
  • Moskowich-Spiegel, I. (2011). “The Golden Rule of Divine Philosophy: Exemplified in the Coruña Corpus of English Scientific Writing”. Revista de Lenguas para Fines Específicos, 17, 167-197.
  • Moskowich, I. & Crespo García, B. (eds.) (2012). Astronomy ‘playne and simple’: The Writing of Science between 1700 and 1900. Amsterdam: John Benjamins
  • Moyotl-Hernández, E. & Macías-Pérez, M. (2016). “Método para autocompletar consultas basado en cadenas de Markov y la ley de Zipf”. Research in Computing Science, 115, 157-170.
  • Parapar, J. & Moskowich-Spiegel, I. (2007). “The Coruña Corpus Tool”. Revista de Procesamiento del Lenguaje Natural 39, 289–290.
  • Sidorov, G. (2013). “N-gramas sintácticos no-continuos”. Polibits, 48, 69-78.
  • Seghiri, M. (2011). “Metodología protocolizada de compilación de un corpus de seguros de viajes: aspectos de diseño y representatividad”. Revista de Lingüística teórica y Aplicada 49 (2), 13-30.
  • Seghiri, M. (2014). “Too Big or not too Big: Establishing the Minimum Size for a Legal ad hoc Corpus”. Hermes: Journal of Language and Communication in Business 27 (53), 85-98.
  • Seghiri, M. (2015). Determinación de la representatividad cuantitativa de un corpus ad hoc bilingüe (inglés-español) de manuales de instrucciones generales de lectores electrónicos. In M. T. Sánchez (ed.), Corpus-based Translation and Interpreting Studies: From description to application (125- 146). Frankfurt: Frank & Timme.
  • Sinclair, J. (1991). Glossary. In: J. Sinclair (ed.) Corpus, Concordance, Collocation (pp. 169-176). Oxford: Oxford University Press.
  • Torruella, J. & Llisterri, J. (1999). Diseño de corpus textuales y orales. In: J. M. Blecua (ed.) et al. Filología e informática. Nuevas tecnologías en los estudios filológicos (pp. 45-77). Barcelona: Universidad Autónoma de Barcelona.