Optimizing SPARQL Queries using Shape Statistics

Cardinality estimates are essential for finding a good join order to improve query performance. In order to access the impact of having shapes statistics of RDF graphs on cardinality estimation, we have performed these experiments. We have generated global and shapes statistics and proposed a join ordering technique to make use of these statistics and estimate cardinalities to propose efficient query plans. We used synthetic (LUBM, WATDIV) and a real dataset (i.e., YAGO-4). We compared against the query plans proposed by Jena ARQ query engine, GraphDB, Characteristics Sets, and SumRDF approach. At this page we present technical details of our experiments such as how to generate these statistics, how to run the experiments, the links to the datasets, and finally the results.

Persistent URI & Licence:

All of the data and results presented in our experimental study are available at https://github.com/Kashif-Rabbani/sparql-optimization/ under Apache License 2.0 .

Datasets, Queries and the Statistics used:

We used the following datasets, queries, and the statistics:

Dataset RDF Dump Queries Stats
LUBM Download See LUBM Queries Global and Shapes Statistics
YAGO-4 Download See YAGO-4 Queries Global and Shapes Statistics
WATDIV-100M Download See WATDIV Queries Global and Shapes Statistics
WATDIV-1Billion Download See WATDIV Queries Global and Shapes Statistics

How does it work?

1. Generating SHACL Shapes Graph:

  Given an RDF graph, we used shaclgen https://pypi.org/project/shaclgen/ library to generate its SHACL shapes graph.

2. Generating Shapes Statistics:

  We use Shapes Annotator component to extend SHACL shapes graph with the statistics of the RDF graph. E.g., for YAGO-4 dataset, we use the https://github.com/Kashif-Rabbani/sparql-optimization/blob/main/code/yagoConfig.properties file by setting the generateStatistics=true.

3. Running Experiments:

We loaded all datasets in Jena TDB, bundled the code in a Jar and created a config file to run each type of experiment. For example we used the following pattern fo run experiments using:

Evaluation Results:

Discussed in the paper and available in folder results_data