Placement of VALUES block affects performance in 3.1.1

Discussion:

Osma Suominen

2016-11-01 09:03:05 UTC

Hi,

I'm investigating a performance regression we're seeing with the current
Jena 3.1.1-SNAPSHOT compared to 3.1.0.

The data in graph <http://www.yso.fi/onto/yso/> is the YSO ontology,
available from http://api.finto.fi/download/yso/yso-skos.ttl

This query used to take about 0.2 seconds (with 3.1.0) and now takes
about 10 seconds:

--cut--
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT *
FROM NAMED <http://www.yso.fi/onto/yso/>
WHERE {
?uri ?p ?o .
OPTIONAL {
?x skos:member ?o .
FILTER NOT EXISTS {
?x skos:member ?other .
FILTER NOT EXISTS {
?other skos:broader ?uri
}
}
}
VALUES ?uri { <http://www.yso.fi/onto/yso/p864> }
}
--cut--

If I move the VALUES block to the top of the query, right after WHERE,
then the query becomes fast again.

Is the placement of the VALUES block supposed to affect query evaluation
order in this way? It appears to me that in the slow version, ?uri is
not bound inside the inner FILTER NOT EXISTS, which causes an explosion
of results internally.

-Osma

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
***@helsinki.fi
http://www.nationallibrary.fi

Osma Suominen

2016-11-01 12:38:55 UTC

Permalink

Hi,

Some further observations. The query I sent earlier was a minimal
example, and it was possible to fix it by just moving the VALUES block.
But a slightly more realistic (closer to the original query I'm having
problems with) example involves a UNION and cannot be fixed so easily -
placing the VALUES block first doesn't help:

--cut--
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT *
WHERE {
VALUES ?uri { <http://www.yso.fi/onto/yso/p864> }
{ ?s ?p ?uri }
UNION
{ ?uri ?p ?o
OPTIONAL {
?x skos:member ?o .
FILTER NOT EXISTS {
?x skos:member ?other .
FILTER NOT EXISTS {
?other skos:broader ?uri
}
}
} }
}
--cut--

Jena 3.1.0 tdbquery: 0.9 seconds
Jena 3.1.1-SNAPSHOT tdbquery: 12.8 seconds

I'm aware that in SPARQL, evaluation proceeds from the inside out and
Jena ARQ has moved more and more in this direction with recent releases,
which may also explain this change. But how should VALUES blocks be
placed for optimal query execution? It seems like a waste not to
propagate those fixed bindings into inner parts of the query, even
though that may violate the inside-out order. In the above query, I
don't know where to place the VALUES so that the binding for ?uri (in
effect, changing the variable to a constant) would be applied in all
parts of the query.

Placing the VALUES block at the bottom of the query (outside the WHERE
block) doesn't help either. In fact execution time increases to 17
seconds with 3.1.1-SNAPSHOT (but is unchanged with 3.1.0).

I tried --engine=ref and it was extremely slow also with 3.1.0, so in
that sense, nothing has changed, only an optimization has been dropped
somewhere.

Should I report this as an issue? Or am I just doing something wrong?

-Osma

Post by Osma Suominen
Hi,
I'm investigating a performance regression we're seeing with the current
Jena 3.1.1-SNAPSHOT compared to 3.1.0.
The data in graph <http://www.yso.fi/onto/yso/> is the YSO ontology,
available from http://api.finto.fi/download/yso/yso-skos.ttl
This query used to take about 0.2 seconds (with 3.1.0) and now takes
--cut--
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT *
FROM NAMED <http://www.yso.fi/onto/yso/>
WHERE {
?uri ?p ?o .
OPTIONAL {
?x skos:member ?o .
FILTER NOT EXISTS {
?x skos:member ?other .
FILTER NOT EXISTS {
?other skos:broader ?uri
}
}
}
VALUES ?uri { <http://www.yso.fi/onto/yso/p864> }
}
--cut--
If I move the VALUES block to the top of the query, right after WHERE,
then the query becomes fast again.
Is the placement of the VALUES block supposed to affect query evaluation
order in this way? It appears to me that in the slow version, ?uri is
not bound inside the inner FILTER NOT EXISTS, which causes an explosion
of results internally.
-Osma

Andy Seaborne

2016-11-01 19:12:08 UTC

Permalink

Post by Osma Suominen
Hi,
Some further observations. The query I sent earlier was a minimal
example, and it was possible to fix it by just moving the VALUES block.
But a slightly more realistic (closer to the original query I'm having
problems with) example involves a UNION and cannot be fixed so easily -
--cut--
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT *
WHERE {
VALUES ?uri { <http://www.yso.fi/onto/yso/p864> }
{ ?s ?p ?uri }
UNION
{ ?uri ?p ?o
OPTIONAL {
?x skos:member ?o .
FILTER NOT EXISTS {
?x skos:member ?other .
FILTER NOT EXISTS {
?other skos:broader ?uri

^^^^^^ [*]

Post by Osma Suominen
}
}
} }
}
--cut--
Jena 3.1.0 tdbquery: 0.9 seconds
Jena 3.1.1-SNAPSHOT tdbquery: 12.8 seconds
I'm aware that in SPARQL, evaluation proceeds from the inside out and
Jena ARQ has moved more and more in this direction with recent releases,
which may also explain this change.

It has always been inside-out then optimized to use stream based index
joins.

However in this case "inside out" is confusing because the query has a
double negation of FILTER NOT EXIST.

At 3.1.1 (JENA-1171), EXISTS are analysed whereas previous they were
skipped which could lead to wrong answers.

Osma - could you please try putting the VALUES in each arm of the UNION
which gets you to something like the first example.

The issue is [*], using the variable ?uri again inside an OPTIONAL.

It is possible that ?uri will range over more than the VALUES setting
and affect the OPTIONAL yet the inner EXISTS usage does not set ?uri and
it is not propagated to be joined with the set value.

As wikipedia says for correlated subquery in SQL:
"Because the subquery is evaluated once for each row processed by the
outer query, it can be inefficient."

Post by Osma Suominen
But how should VALUES blocks be
placed for optimal query execution? It seems like a waste not to
propagate those fixed bindings into inner parts of the query, even
though that may violate the inside-out order.

It can be pushed in because:

join(A, union(B,C)) == union(join(A,B), join(A,C))

now if A is an complex expression, that is a bad idea (probably).

If A is a small VALUES block then it makes sense. It isn't done though.

Post by Osma Suominen
In the above query, I
don't know where to place the VALUES so that the binding for ?uri (in
effect, changing the variable to a constant) would be applied in all
parts of the query.

See above.

Post by Osma Suominen
Placing the VALUES block at the bottom of the query (outside the WHERE
block) doesn't help either. In fact execution time increases to 17
seconds with 3.1.1-SNAPSHOT (but is unchanged with 3.1.0).
I tried --engine=ref and it was extremely slow also with 3.1.0, so in
that sense, nothing has changed, only an optimization has been dropped
somewhere.
Should I report this as an issue? Or am I just doing something wrong?
-Osma

james anderson

2016-11-01 22:40:42 UTC

Permalink

good evening;

Hi,
[about a query which exercises the interaction between a values clause and statement pattern at locations relatively remote in the query. ]

[ the question leads to a point about the scope of the respective ?uri variables, ]
The issue is [*], using the variable ?uri again inside an OPTIONAL.
It is possible that ?uri will range over more than the VALUES setting and affect the OPTIONAL yet the inner EXISTS usage does not set ?uri and it is not propagated to be joined with the set value.
[âŠ]
[ âŠ and to one about algebraic equivalence among expressions and the consequent latitude to relocate the values clause.]
join(A, union(B,C)) == union(join(A,B), join(A,C))
now if A is an complex expression, that is a bad idea (probably).
If A is a small VALUES block then it makes sense. It isn't done though.

both of which are significant, but remain side matters to the principal issue, which concerns a reasonable expectation - despite that it may be subject to revision, that the values clause introduces bindings with indefinite scope.[1]

as sparql plays in increasingly significant role in service implementations, library-level support for query parameters[2] needs to give way to protocol-level mechanisms,[3] whereby the language semantics needs more attention than it has yet received.
both immediately federated request combinations and request chains which are mediated by a third service need to propagate intermediate solution sets across requests.
the use case behind this question is one example.

the implementations for elementary variables[3] do not provide sufficient capacity - not to mention the deficiencies of their substitution models.
the sparql federated query recommendation suggests a role which the values clause might play,[4] but does not provide an exact interpretation.

has jenaâs community given this topic any thought?

best regards, from berlin,
- - -
[1] : http://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node43.html#SECTION00700000000000000000
[2] : https://jena.apache.org/documentation/query/parameterized-sparql-strings.html
[3] : http://docs.rdf4j.org/rest-api/#_repository_queries
[4] : http://www.w3.org/TR/sparql11-federated-query/#values

---
james anderson | ***@dydra.com | http://dydra.com

Andy Seaborne

2016-11-02 10:05:52 UTC

Permalink

Post by james anderson
good evening;

Hi,
[about a query which exercises the interaction between a values clause and statement pattern at locations relatively remote in the query. ]

[ the question leads to a point about the scope of the respective ?uri variables, ]
The issue is [*], using the variable ?uri again inside an OPTIONAL.
It is possible that ?uri will range over more than the VALUES setting and affect the OPTIONAL yet the inner EXISTS usage does not set ?uri and it is not propagated to be joined with the set value.
[…]
[ … and to one about algebraic equivalence among expressions and the consequent latitude to relocate the values clause.]
join(A, union(B,C)) == union(join(A,B), join(A,C))
now if A is an complex expression, that is a bad idea (probably).
If A is a small VALUES block then it makes sense. It isn't done though.

The performance issue is that

{ P1 OPTIONAL{P2} VALUES } != { VALUES P1 OPTIONAL{P2} }
but is instead:
{ P1 OPTIONAL{P2} VALUES } == { VALUES...{} { P1 OPTIONAL{P2} } }

but in this data it gets the same answers (presumably) and rather faster.

This is not really about EXISTS at all. Other {P} will have the same
effect.

The programming notion of scope is only an analogy for query languages.
In a query language, same scope means same reference - ie. equivalence.
"Same scope" in a query language means equality-inner-join at some later
point to test they are the same value. Equating variables from
sub-pattern evaluation is like SQL "natural join".

The optimizer is finding situations when equality can be executed as
equivalence (c.f. index join).

In ARQ execution, variables are transformed by renaming apart early in
optimization making static scoping a non-issue for the rule based
optimizations.

Andy

Post by james anderson
as sparql plays in increasingly significant role in service implementations, library-level support for query parameters[2] needs to give way to protocol-level mechanisms,[3] whereby the language semantics needs more attention than it has yet received.
both immediately federated request combinations and request chains which are mediated by a third service need to propagate intermediate solution sets across requests.
the use case behind this question is one example.
the implementations for elementary variables[3] do not provide sufficient capacity - not to mention the deficiencies of their substitution models.
the sparql federated query recommendation suggests a role which the values clause might play,[4] but does not provide an exact interpretation.
has jena’s community given this topic any thought?
best regards, from berlin,
- - -
[1] : http://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node43.html#SECTION00700000000000000000
[2] : https://jena.apache.org/documentation/query/parameterized-sparql-strings.html
[3] : http://docs.rdf4j.org/rest-api/#_repository_queries
[4] : http://www.w3.org/TR/sparql11-federated-query/#values
---

Osma Suominen

2016-11-02 11:08:18 UTC

Permalink

Hi Andy!

Thanks a lot for looking into this and your very clear explanation.

Post by Andy Seaborne
It has always been inside-out then optimized to use stream based index
joins.
However in this case "inside out" is confusing because the query has a
double negation of FILTER NOT EXIST.

Right. Let me explain why the query does this.

The query (or actually a longer variant thereof) is used in the Skosmos
application to generate a list of search results. At the time this query
is used the application already has a (possibly long) list of the SKOS
concept URIs it wants to display. This query is used to look up
additional information about those concepts.

Instead of performing a separate query for each concept URI, a single
query with a VALUES block containing up to 20 concept URIs is used. So
in this case, the VALUES is used thinking of it as a sort of for-each
loop. A single query for 20 concepts is faster than performing 20
queries for individual concepts - at least it used to be.

The reason for the double negation is this:

For each concept X, we want to display the SKOS Collections in which
concept X is a member, but only if the collection consists only of
siblings of X (i.e. sharing at least one broader concept with X).
In practice, this has to be turned around into a double negation: for
each collection, check that it doesn't include a concept that doesn't
share any parent with X.

It's possible that a MINUS expression could be used instead of FILTER
NOT EXISTS and perform better. I will have to test this. But other than
switching to MINUS, I can't think of any way to express this constraint
on collections without using some kind of double negation.

Post by Andy Seaborne
At 3.1.1 (JENA-1171), EXISTS are analysed whereas previous they were
skipped which could lead to wrong answers.
Osma - could you please try putting the VALUES in each arm of the UNION
which gets you to something like the first example.

I will try this as well.

Post by Andy Seaborne
join(A, union(B,C)) == union(join(A,B), join(A,C))
now if A is an complex expression, that is a bad idea (probably).
If A is a small VALUES block then it makes sense. It isn't done though.

Ok. So a potential future optimization perhaps.

-Osma

Osma Suominen

2016-11-02 16:24:34 UTC

Permalink

Post by Osma Suominen
It's possible that a MINUS expression could be used instead of FILTER
NOT EXISTS and perform better. I will have to test this. But other than
switching to MINUS, I can't think of any way to express this constraint
on collections without using some kind of double negation.

I played with MINUS, but it was no better than FILTER NOT EXISTS.

Post by Osma Suominen

Post by Andy Seaborne
Osma - could you please try putting the VALUES in each arm of the UNION
which gets you to something like the first example.

I will try this as well.

This helped - a lot. The query time is practically the same now as it
was with Jena 3.1.0. Thanks!

The original query, though, has four UNION branches. It seems a bit
stupid to put the same VALUES statement (which may have 20 URIs or more)
in each of those arms, but I will do that if it's the only way to get
good performance with 3.1.1.

-Osma