Discussion:
Fuseki/TDB property path performance
Michael Brunnbauer
2015-06-02 13:58:41 UTC
Permalink
hi all,

I have performance problems with queries using property paths on a Fuseki
2.0.0 TDB with half a billion triples from Wikidata. Ramdom disk access does
not seem to be the cause. I use a SSD and see low IO tps values during queries
but high CPU usage. I tried with and without the automatically generated
stats.opt.

Counting all birds takes ca. 8s if not called for the first time (no disk
access, everything in memory):

select count(*) where {
?d1 ( <http://www.wikidata.org/entity/P171s> / <http://www.wikidata.org/entity/P171v> )+ <http://www.wikidata.org/entity/Q5113>
}

Counting all beetles does not seem to finish:

select count(*) where {
?d1 ( <http://www.wikidata.org/entity/P171s> / <http://www.wikidata.org/entity/P171v> )+ <http://www.wikidata.org/entity/Q22671>
}

I tried with and without stats.opt and also with inverse paths (^property)
without success.

I guess this is not the "Counting Beyond a Yottabyte" problem?

http://www.w3.org/blog/SW/2012/04/19/no-more-counting-beyond-a-yottabyte-or-why-the-w3c-process-works/
https://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2012Apr/0003.html

If I do a count(distinct ?d1) in the Bird query, I get the same number so I
guess that the + makes the query "non-counting".

Any idea if this slow performance is to be expected and why?

Regards,

Michael Brunnbauer
--
++ Michael Brunnbauer
++ netEstate GmbH
++ Geisenhausener Straße 11a
++ 81379 München
++ Tel +49 89 32 19 77 80
++ Fax +49 89 32 19 77 89
++ E-Mail ***@netestate.de
++ http://www.netestate.de/
++
++ Sitz: München, HRB Nr.142452 (Handelsregister B München)
++ USt-IdNr. DE221033342
++ Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++ Prokurist: Dipl. Kfm. (Univ.) Markus Hendel
Andy Seaborne
2015-06-02 14:51:55 UTC
Permalink
Hi Michael,

A few facts please:

How many birds are there?

What's

SELECT (count(*) AS ?C1)
{ ?d1 <http://www.wikidata.org/entity/P171s> ?v }

SELECT (count(*) AS ?C2)
{ ?s <http://www.wikidata.org/entity/P171v>
<http://www.wikidata.org/entity/Q5113> }

It's probably a bad execution plan. It's supposed to execute the path
backwards which, caveat the reverse link fan out rates, should be OK.
Post by Michael Brunnbauer
hi all,
I have performance problems with queries using property paths on a Fuseki
2.0.0 TDB with half a billion triples from Wikidata. Ramdom disk access does
not seem to be the cause. I use a SSD and see low IO tps values during queries
but high CPU usage. I tried with and without the automatically generated
stats.opt.
Counting all birds takes ca. 8s if not called for the first time (no disk
select count(*) where {
?d1 ( <http://www.wikidata.org/entity/P171s> / <http://www.wikidata.org/entity/P171v> )+ <http://www.wikidata.org/entity/Q5113>
}
(That's not legal SPARQL :-) The joys of compatibility mode.
Post by Michael Brunnbauer
select count(*) where {
?d1 ( <http://www.wikidata.org/entity/P171s> / <http://www.wikidata.org/entity/P171v> )+ <http://www.wikidata.org/entity/Q22671>
}
I tried with and without stats.opt and also with inverse paths (^property)
without success.
I guess this is not the "Counting Beyond a Yottabyte" problem?
http://www.w3.org/blog/SW/2012/04/19/no-more-counting-beyond-a-yottabyte-or-why-the-w3c-process-works/
https://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2012Apr/0003.html
No. (The data isn't a cliche unless the data is really bizarre)

That is a theoretical piece of work on a different design.

(The fact it uses hyperbole and ridicule to make a technical point is
merely annoying.)
Post by Michael Brunnbauer
If I do a count(distinct ?d1) in the Bird query, I get the same number so I
guess that the + makes the query "non-counting".
Yes.
Post by Michael Brunnbauer
Any idea if this slow performance is to be expected and why?
Regards,
Michael Brunnbauer
Michael Brunnbauer
2015-06-02 14:58:05 UTC
Permalink
Hello Andy,
Post by Andy Seaborne
Hi Michael,
How many birds are there?
14264
Post by Andy Seaborne
What's
SELECT (count(*) AS ?C1)
{ ?d1 <http://www.wikidata.org/entity/P171s> ?v }
1833580
Post by Andy Seaborne
SELECT (count(*) AS ?C2)
{ ?s <http://www.wikidata.org/entity/P171v>
<http://www.wikidata.org/entity/Q5113> }
31

Regards,

Michael Brunnbauer
--
++ Michael Brunnbauer
++ netEstate GmbH
++ Geisenhausener Straße 11a
++ 81379 München
++ Tel +49 89 32 19 77 80
++ Fax +49 89 32 19 77 89
++ E-Mail ***@netestate.de
++ http://www.netestate.de/
++
++ Sitz: München, HRB Nr.142452 (Handelsregister B München)
++ USt-IdNr. DE221033342
++ Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++ Prokurist: Dipl. Kfm. (Univ.) Markus Hendel
Andy Seaborne
2015-06-02 14:58:57 UTC
Permalink
Post by Andy Seaborne
Hi Michael,
How many birds are there?
What's
SELECT (count(*) AS ?C1)
{ ?d1 <http://www.wikidata.org/entity/P171s> ?v }
SELECT (count(*) AS ?C2)
{ ?s <http://www.wikidata.org/entity/P171v>
<http://www.wikidata.org/entity/Q5113> }
[pressed <send> too quickly:]

One other experiment:

select count(*) where {
?x <http://www.wikidata.org/entity/P171v>+
<http://www.wikidata.org/entity/Q5113> .
?d1 <http://www.wikidata.org/entity/P171s> ?x
}

Andy
Post by Andy Seaborne
It's probably a bad execution plan. It's supposed to execute the path
backwards which, caveat the reverse link fan out rates, should be OK.
Post by Michael Brunnbauer
hi all,
I have performance problems with queries using property paths on a Fuseki
2.0.0 TDB with half a billion triples from Wikidata. Ramdom disk access does
not seem to be the cause. I use a SSD and see low IO tps values during queries
but high CPU usage. I tried with and without the automatically generated
stats.opt.
Counting all birds takes ca. 8s if not called for the first time (no disk
select count(*) where {
?d1 ( <http://www.wikidata.org/entity/P171s> /
<http://www.wikidata.org/entity/P171v> )+
<http://www.wikidata.org/entity/Q5113>
}
(That's not legal SPARQL :-) The joys of compatibility mode.
Post by Michael Brunnbauer
select count(*) where {
?d1 ( <http://www.wikidata.org/entity/P171s> /
<http://www.wikidata.org/entity/P171v> )+
<http://www.wikidata.org/entity/Q22671>
}
I tried with and without stats.opt and also with inverse paths (^property)
without success.
I guess this is not the "Counting Beyond a Yottabyte" problem?
http://www.w3.org/blog/SW/2012/04/19/no-more-counting-beyond-a-yottabyte-or-why-the-w3c-process-works/
https://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2012Apr/0003.html
No. (The data isn't a cliche unless the data is really bizarre)
That is a theoretical piece of work on a different design.
(The fact it uses hyperbole and ridicule to make a technical point is
merely annoying.)
Post by Michael Brunnbauer
If I do a count(distinct ?d1) in the Bird query, I get the same number so I
guess that the + makes the query "non-counting".
Yes.
Post by Michael Brunnbauer
Any idea if this slow performance is to be expected and why?
Regards,
Michael Brunnbauer
Michael Brunnbauer
2015-06-02 15:07:10 UTC
Permalink
Hello Andy,
Post by Michael Brunnbauer
select count(*) where {
?x <http://www.wikidata.org/entity/P171v>+
<http://www.wikidata.org/entity/Q5113> .
?d1 <http://www.wikidata.org/entity/P171s> ?x
}
31

BTW: I tested with a Virtuoso SPARQL endoint on a potentially different
Wikidata dump:

select count(*) where {
?d1 ( <http://www.wikidata.org/entity/P171s> /
<http://www.wikidata.org/entity/P171v> )+
<http://www.wikidata.org/entity/Q5113>
}

yields 29626 and

select count(distinct ?d1) where {
?d1 ( <http://www.wikidata.org/entity/P171s> /
<http://www.wikidata.org/entity/P171v> )+
<http://www.wikidata.org/entity/Q5113>
}

yields 14264. Fuseki yields the same number a 14264 for both queries. Did they
get the property path semantics wrong?

Counting beetles works with Virtuoso:

select count(*) where {
?d1 ( <http://www.wikidata.org/entity/P171s> /
<http://www.wikidata.org/entity/P171v> )+
<http://www.wikidata.org/entity/Q22671>
}

yields 210127. and

select count(distinct ?d1) where {
?d1 ( <http://www.wikidata.org/entity/P171s> /
<http://www.wikidata.org/entity/P171v> )+
<http://www.wikidata.org/entity/Q22671>
}

yields 207816.

Regards,

Michael Brunnbauer
--
++ Michael Brunnbauer
++ netEstate GmbH
++ Geisenhausener Straße 11a
++ 81379 München
++ Tel +49 89 32 19 77 80
++ Fax +49 89 32 19 77 89
++ E-Mail ***@netestate.de
++ http://www.netestate.de/
++
++ Sitz: München, HRB Nr.142452 (Handelsregister B München)
++ USt-IdNr. DE221033342
++ Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
++ Prokurist: Dipl. Kfm. (Univ.) Markus Hendel
Andy Seaborne
2016-07-12 07:58:20 UTC
Permalink
JENA-1195 should improve the performance of Kleen star patterns.

Andy

https://issues.apache.org/jira/browse/JENA-1195
Now in the snapshot build.

Loading...