Segmentación de palabras en español mediante modelos del lenguaje basados en redes neuronales

  1. Gómez Rodríguez, Carlos
  2. Vilares Ferro, Jesús
  3. Doval, Yerai
Journal:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2016

Issue: 57

Pages: 75-82

Type: Article

More publications in: Procesamiento del lenguaje natural

Abstract

In social media platforms special tokens abound such as hashtags and mentions in which multiple words are written together without spacing between them; e.g. #leapyear or @ryanreynoldsnet. Due to the way this kind of texts are written, this word assembly phenomenon can appear with its opposite, word segmentation, affecting any token of the text and making it more difficult to perform analysis on them. In this work we show an algorithmic approach based on a language model - in this case a neural model - to solve the problem of the segmentation and assembly of words, in which we try to recover the standard spacing of the words that have suffered one of these transformations by adding or deleting spaces when necessary. The promising results indicate that after some further refinement of the language model it will be possible to surpass the state of the art.

Bibliographic References

  • Adda-decker, M., G. Adda, y L. Lamel. 2000. Investigating text normalization and pronunciation variants for german broadcast transcription. En ICSLP’2000, p´aginas 266–269.
  • Alfonseca, E., S. Bilac, y S. Pharies. 2008. Decompounding query keywords from compounding languages. En Proc. of the 46th Annual Meeting of the ACL: Short Papers, HLT-Short ’08, p´aginas 253–256, Stroudsburg, PA, USA. ACL.
  • Alonso, M. A., C. G´omez-Rodr´ıguez, D. Vilares, Y. Doval, y J. Vilares. 2015. Seguimiento y an´alisis autom´atico de contenidos en redes sociales. En Actas: III Congreso Nacional de i+d en Defensa y Seguridad, DESEi+d 2015, p´aginas 899– 906.
  • Alonso, M. A. y D. Vilares. 2016. A review on political analysis and social media. Procesamiento del Lenguaje Natural, 56:13–24.
  • Bengio, Y. 2009. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2(1):1–127.
  • Berger, A. L., S. D. Pietra, y V. J. D. Pietra. 1996. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39–71.
  • Brown, P. F., P. V. deSouza, R. L. Mercer, V. J. D. Pietra, y J. C. Lai. 1992. Classbased n-gram models of natural language. Comput. Linguist., 18(4):467–479, Diciembre.
  • Cebrian, M. 2012. Using friends as sensors to detect planetary-scale contagious outbreaks. En Proc. of the 1st International Workshop on Multimodal Crowd Sensing, CrowdSens ’12, p´aginas 15–16, New York, NY, USA. ACM.
  • Chen, S. F. 1998. An empirical study of smoothing techniques for language modeling. Informe t´ecnico.
  • Chi, C.-H., C. Ding, y A. Lim. 1999. Word segmentation and recognition for web document framework. En Proc. of the Eighth International Conference on Information and Knowledge Management, CIKM ’99, p´aginas 458–465, New York, NY, USA. ACM.
  • Gallinucci, E., M. Golfarelli, y S. Rizzi. 2013. Meta-stars: Multidimensional modeling for social business intelligence. En Proc. of the Sixteenth International Workshop on Data Warehousing and OLAP, DOLAP ’13, p´aginas 11–18, New York, NY, USA. ACM.
  • Hochreiter, S. y J. Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
  • Huang, C. y H. Zhao. 2007. Chinese word segmentation: A decade review. Journal of Chinese Information Processing, 21(3):8–20.
  • J´ozefowicz, R., W. Zaremba, y I. Sutskever. 2015. An empirical exploration of recurrent network architectures. En Proc. of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, p´aginas 2342–2350.
  • Kacmarcik, G., C. Brockett, y H. Suzuki. 2000. Robust segmentation of japanese text into a lattice for parsing. En Proc. of the 18th Conference on Computational Linguistics - Volume 1, COLING ’00, p´aginas 390–396, Stroudsburg, PA, USA. ACL.
  • Koehn, P. y K. Knight. 2003. Empirical methods for compound splitting. En Proc. of the Tenth Conference on European Chapter of the ACL - Volume 1, EACL ’03, p´aginas 187–193, Stroudsburg, PA, USA. ACL.
  • Lafferty, J. D., A. McCallum, y F. C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. En Proc. of the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, MA, USA, p´aginas 282–289.
  • Maynard, D. y M. A. Greenwood. 2014. Who cares about sarcastic tweets? investigating the impact of sarcasm on sentiment analysis. En LREC, p´aginas 4238–4243.
  • Mikolov, T. y G. Zweig. 2012. Context dependent recurrent neural network language model. En 2012 IEEE Spoken Language Technology Workshop (SLT), Miami, FL, USA, p´aginas 234–239.
  • Srinivasan, S., S. Bhattacharya, y R. Chakraborty. 2012. Segmenting web-domains and hashtags using length specific models. En Proc. of the 21st ACM International Conference on Information and Knowledge Management, CIKM ’12, p´aginas 1113– 1122, New York, NY, USA. ACM.
  • Suzuki, H., C. Brockett, y G. Kacmarcik. 2000. Using a broad-coverage parser for word-breaking in japanese. En Proc. of the 18th Conference on Computational Linguistics - Volume 2, COLING ’00, p´aginas 822–828, Stroudsburg, PA, USA. ACL.
  • Wang, K., C. Thrasher, y B.-J. P. Hsu. 2011. Web scale nlp: A case study on url word breaking. En Proc. of the 20th International Conference on World Wide Web, WWW ’11, p´aginas 357–366, New York, NY, USA. ACM.
  • Wu, A. y Z. Jiang. 1998. Word segmentation in sentence analysis. En Proc. of the 1998 International Conference on Chinese Information Processing, p´aginas 169–180