Worst-Case-Optimal Similarity Joins on Graph Databases

  1. Arroyuelo, Diego 2
  2. Bustos, Benjamin 3
  3. Gómez-Brandón, Adrián 5
  4. Hogan, Aidan 1
  5. Navarro, Gonzalo 1
  6. Reutter, Juan 4
  1. 1 DCC, University of Chile & IMFD, Santiago, Chile
  2. 2 DCC, Escuela de Ingeniería, Pontificia Universidad Católica & IMFD, Santiago, Chile
  3. 3 DCC, University of Chile & IMFD, Santiago, Chile, Chile
  4. 4 DCC, Escuela de Ingeniería, Pontificia Universidad Católica & Instituto de Ingeniería Matemática y Computaciónal, Pontificia Universidad Católica, Santiago, Chile
  5. 5 Universidade da Coruña
    info

    Universidade da Coruña

    La Coruña, España

    ROR https://ror.org/01qckj285

Revista:
Proceedings of the ACM on Management of Data

ISSN: 2836-6573

Año de publicación: 2024

Volumen: 2

Número: 1

Páginas: 1-26

Tipo: Artículo

DOI: 10.1145/3639294 GOOGLE SCHOLAR lock_openAcceso abierto editor

Otras publicaciones en: Proceedings of the ACM on Management of Data

Resumen

We extend the concept of worst-case optimal equijoins in graph databases to the case where some nodes arerequired to be within the ?-nearest neighbors (?-NN) of others under some similarity function. We modelthe problem by superimposing the database graph with the ?-NN graph and show that a variant of LeapfrogTrieJoin (LTJ) implemented over a compact data structure called the Ring can be seamlessly extended tointegrate similarity clauses with the equijoins in the LTJ query process, retaining worst-case optimality inmany relevant cases. Our experiments on a benchmark that combines Wikidata and IMGpedia show that ourenhanced LTJ algorithm outperforms by a considerable margin a baseline that first applies classic LTJ andthen completes the query by applying the similarity predicates. The difference is more pronounced on querieswhere the similarity clauses are more densely connected to the query, becoming of an order of magnitude insome cases.CCS Concepts: • Theory of computation → Database query processing and optimization (theory);Data structures and algorithms for data management.Additional Key Words and Phrases: Worst-case optimal joins; Leapfrog Triejoin; graph patterns; graphdatabases; graph indexing; similarity joins; nearest-neighbor graphs

Información de financiación

Financiadores

Referencias bibliográficas

  • 10.1145/2902251.2902289
  • 10.1145/3034786.3056105
  • 10.14778/3184470.3184473
  • 10.1145/3104031
  • 10.1145/1322432.1322433
  • 10.1145/3448016.3457256
  • 10.1145/3514231
  • B. Arsintescu. How LIquid connects everything so our members can do anything. LinkedIn Engineering Blog, 2023. https://engineering.linkedin.com/blog/2023/how-liquid-connects-everything-so-our-members-can-do-anything.
  • 10.1137/110859440
  • 10.3233/SW-2012-0065
  • B. R. Bebee, D. Choi, A. Gupta, A. Gutmans, A. Khandelwal, Y. Kiran, S. Mallidi, B. McGaughy, M. Personick, K. Rajan, S. Rondelli, A. Ryazanov, M. Schmidt, K. Sengupta, B. B. Thompson, D. Vaidya, and S. Wang. Amazon Neptune: Graph data management in the Cloud. In Proc. ISWC Posters & Demonstrations, 2018.
  • 10.1145/3308558.3313472
  • 10.1109/ICDE.2006.9
  • W. W. Cohen and H. Hirsh. Joins that generalize: Text classification using WHIRL. In Proc. 4th International Conference on Knowledge Discovery and Data Mining (KDD), pages 169--173, 1998.
  • 10.24432/C5CC9H
  • 10.1145/2948992.2949016
  • 10.1371/journal.pone.0074113
  • 10.1145/3514221.3526057
  • 10.1016/0925-7721(95)00009-7
  • 10.1145/1963405.1963487
  • 10.1007/978-3-319-68204-4_8
  • 10.1007/978-3-030-62419-4_12
  • 10.14778/3407790.3407797
  • 10.1145/3398682.3399162
  • 10.1007/978-3-319-67162-8_26
  • 10.1007/978-3-319-07959-2_28
  • R. Grossi, A. Gupta, and J. Vitter. High-order entropy-compressed text indexes. In Proc. 14th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 841--850, 2003.
  • 10.14778/2733004.2733010
  • 10.1080/00031305.1998.10480559
  • 10.1145/276304.276326
  • 10.1007/978-3-642-30284-8_53
  • 10.1007/978-3-030-30793-6_15
  • 10.1109/ICDE53745.2022.00247
  • 10.1007/978-3-540-76298-0_22
  • 10.24432/C50S4B
  • M. Koklu and I. A. Ozkan. Multiclass classification of dry beans using computer vision and machine learning techniques. Computers and Electronics in Agriculture, 174:article 105507, 2020.
  • 10.1007/978-3-030-00668-6_23
  • 10.1145/3446980
  • 10.14778/3554821.3554894
  • 10.1007/3-540-62034-6_35
  • 10.1016/j.jda.2013.07.004
  • 10.1145/3196959.3196990
  • 10.1145/3180143
  • 10.1145/2764947.2764948
  • 10.1007/11764298_8
  • 10.3233/SW-170285
  • 10.1109/ICDE.2010.5447873
  • 10.1145/1978802.1978813
  • 10.14778/3397230.3397250
  • 10.14778/3476249.3476306
  • 10.1007/BF02187718
  • T. L. Veldhuizen. Triejoin: A simple, worst-case optimal join algorithm. In Proc. International Conference on Database Theory (ICDT), pages 96--106, 2014.
  • 10.1145/2629489
  • 10.1109/CVPR.2012.6247790
  • 10.1109/ICDE.1996.492202
  • 10.1007/0-387-29151-2
  • 10.1109/GEOINFORMATICS.2010.5567605
  • Y.-M. Zhang, K. Huang, G. Geng, and C.-L. Liu. Fast knn graph construction with locality sensitive hashing. In Machine Learning and Knowledge Discovery in Databases, pages 660--674. Springer, 2013.