Segmentación de palabras en español mediante modelos del lenguaje basados en redes neuronales

  1. Gómez Rodríguez, Carlos
  2. Vilares Ferro, Jesús
  3. Doval, Yerai
Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2016

Issue: 57

Pages: 75-82

Type: Article

More publications in: Procesamiento del lenguaje natural


In social media platforms special tokens abound such as hashtags and mentions in which multiple words are written together without spacing between them; e.g. #leapyear or @ryanreynoldsnet. Due to the way this kind of texts are written, this word assembly phenomenon can appear with its opposite, word segmentation, affecting any token of the text and making it more difficult to perform analysis on them. In this work we show an algorithmic approach based on a language model - in this case a neural model - to solve the problem of the segmentation and assembly of words, in which we try to recover the standard spacing of the words that have suffered one of these transformations by adding or deleting spaces when necessary. The promising results indicate that after some further refinement of the language model it will be possible to surpass the state of the art.

