Discussion:
Performance regression between Jena 3.1.0 and 3.2.0
Osma Suominen
2017-03-09 14:48:47 UTC
Permalink
Hi,

I wanted to report a performance regression I found. This is probably
something that happened to the query optimizer in the Jena 3.1.1
development. It may be rather benign, but the result was a severe
performance regression in my application.

With YSO [1] as data loaded into TDB, this query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT *
WHERE {
<http://www.yso.fi/onto/yso/p8627> ?p ?o .
OPTIONAL {
{ ?p rdfs:subPropertyOf ?pp }
UNION
{ ?o a ?ot }
}
}

takes about 300 ms on Jena 3.2.0, while it took only around 25 ms on
Jena 3.1.0.

The fix was to separate the single OPTIONAL block into two:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT *
WHERE {
<http://www.yso.fi/onto/yso/p8627> ?p ?o .
OPTIONAL { ?p rdfs:subPropertyOf ?pp }
OPTIONAL { ?o a ?ot }
}

The result is that both Jena versions execute the query in around 25 ms.

You may wonder why I had a query like that in the first place, but this
is not the actual query that I started with, which is a way more complex
CONSTRUCT query and has many UNIONs within the OPTIONAL block (see [2]).

The important thing was to separate the OPTIONAL block dealing with ?p
from the OPTIONAL block dealing with ?o - as long as the block only
deals with one variable from the pattern above, it may contain multiple
UNIONs and actually it makes sense to use UNIONs to avoid internal cross
products and combinatorial explosion when there are multiple solutions
for each pattern.

-Osma


[1] http://api.finto.fi/download/yso/yso-skos.ttl

[2]
https://github.com/NatLibFi/Skosmos/blob/master/model/sparql/GenericSparql.php#L404
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
***@helsinki.fi
http://www.nationallibrary.fi
Andy Seaborne
2017-03-13 15:09:14 UTC
Permalink
Post by Osma Suominen
Hi,
I wanted to report a performance regression I found. This is probably
something that happened to the query optimizer in the Jena 3.1.1
development. It may be rather benign, but the result was a severe
performance regression in my application.
It is the more cautious optimization. The optimizer does not split the
cases of UNION making variables bound in some solutions and not others
from the case of variables being set in nested OPTIONALs.

IMO the rewrite if better anyway.

Thanks for reporting it - it is useful information for any future
optimization work but it's not a limited scope fix to be applied that I
can see. I have it setup for investigation locally.

Andy
Post by Osma Suominen
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT *
WHERE {
<http://www.yso.fi/onto/yso/p8627> ?p ?o .
OPTIONAL {
{ ?p rdfs:subPropertyOf ?pp }
UNION
{ ?o a ?ot }
}
}
takes about 300 ms on Jena 3.2.0, while it took only around 25 ms on
Jena 3.1.0.
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT *
WHERE {
<http://www.yso.fi/onto/yso/p8627> ?p ?o .
OPTIONAL { ?p rdfs:subPropertyOf ?pp }
OPTIONAL { ?o a ?ot }
}
The result is that both Jena versions execute the query in around 25 ms.
You may wonder why I had a query like that in the first place, but this
is not the actual query that I started with, which is a way more complex
CONSTRUCT query and has many UNIONs within the OPTIONAL block (see [2]).
The important thing was to separate the OPTIONAL block dealing with ?p
from the OPTIONAL block dealing with ?o - as long as the block only
deals with one variable from the pattern above, it may contain multiple
UNIONs and actually it makes sense to use UNIONs to avoid internal cross
products and combinatorial explosion when there are multiple solutions
for each pattern.
-Osma
[1] http://api.finto.fi/download/yso/yso-skos.ttl
[2]
https://github.com/NatLibFi/Skosmos/blob/master/model/sparql/GenericSparql.php#L404
Loading...