Discussion:
How do I do a join between multiple model.listStatments calls?
Niels Andersen
2016-11-13 06:59:37 UTC
Permalink
Dear user community,

Our current approach to joining multiple model.listStatements (with SimpleSelector) calls is to take the content of the iterators returned and add them to separate HashSets and then use functions such as retainAll to find the intersection between the two sets.

This works relative well when model.listStatements return a small to medium number of statements.

My problem is that this seems to be a very inefficient way of joining to sets of data that are already ordered in TDB. I assume that there must be a better way to do this. I have searched the web, but all uses of listStatements are very simple.

I have also not found an effective way to do filtering (for instance literal less than 5) without comparing every statement that listStatements returns

My questions are:

* What is the recommended way to do a join between two lists of statements?

* What is the recommended way to implement filtering?

* Is there anything else than SimpleSelector? Are there any Advanced selectors?

Thanks in advance,
Niels
Martynas Jusevičius
2016-11-13 07:39:19 UTC
Permalink
Why not SPARQL FILTER?
https://www.w3.org/TR/2013/REC-sparql11-query-20130321/#expressions
Post by Niels Andersen
Dear user community,
Our current approach to joining multiple model.listStatements (with
SimpleSelector) calls is to take the content of the iterators returned and
add them to separate HashSets and then use functions such as retainAll to
find the intersection between the two sets.
This works relative well when model.listStatements return a small to
medium number of statements.
My problem is that this seems to be a very inefficient way of joining to
sets of data that are already ordered in TDB. I assume that there must be a
better way to do this. I have searched the web, but all uses of
listStatements are very simple.
I have also not found an effective way to do filtering (for instance
literal less than 5) without comparing every statement that listStatements
returns
* What is the recommended way to do a join between two lists of statements?
* What is the recommended way to implement filtering?
* Is there anything else than SimpleSelector? Are there any Advanced selectors?
Thanks in advance,
Niels
Claude Warren
2016-11-13 09:04:04 UTC
Permalink
Niels,

SPARQL (https://www.w3.org/TR/rdf-sparql-query/) provides a simple way to
join the triples of different statements and can be called from within your
java code (http://jena.apache.org/documentation/query/index.html).

As noted previously using a filter should do the trick. There is
documentation for how to write your own filter if you need to but you may
find that your filter requirements are already met by existing filters.

Claude
Post by Niels Andersen
Dear user community,
Our current approach to joining multiple model.listStatements (with
SimpleSelector) calls is to take the content of the iterators returned and
add them to separate HashSets and then use functions such as retainAll to
find the intersection between the two sets.
This works relative well when model.listStatements return a small to
medium number of statements.
My problem is that this seems to be a very inefficient way of joining to
sets of data that are already ordered in TDB. I assume that there must be a
better way to do this. I have searched the web, but all uses of
listStatements are very simple.
I have also not found an effective way to do filtering (for instance
literal less than 5) without comparing every statement that listStatements
returns
* What is the recommended way to do a join between two lists of statements?
* What is the recommended way to implement filtering?
* Is there anything else than SimpleSelector? Are there any Advanced selectors?
Thanks in advance,
Niels
--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Niels Andersen
2016-11-13 19:33:09 UTC
Permalink
Claude and Martynas,



Thank you for your quick response.



We are aware that the SPARQL language is providing join and filtering capabilities, it is however important to be reminded that it exists and not get stuck in a single implementation track. Thanks for reminding us.



I apologize that this email is longer than I intended and may seem like a rant against SPARQL. I don’t want to anger anyone with this email, this is a summary of what I have observed and believe to be facts. I want to make it very clear; “Nothing would make me happier than to be proven wrong, I would love to see SPARQL work in large web scale applications”. To people who choose to reply to this email; keep in mind that I only care about scientific and provable facts, I do not care about opinions.



We started with SPARQL as our core language. In the beginning it looked very promising. Due to the following, we are now looking at SPARQL as an add on capability, not the core capability:

1. The performance was poor for advanced queries.

a. It made us question if Jena SPARQL is only viable for simple queries.

b. In particular queries that returned a large dataset in JSON would take a long time to return data.

2. SPARQL does not appear to be a mature language:

a. Compared to SQL: There is no concept of views, functions or procedures. This is particularly a problem as triple stores have weak schema capabilities and the schema must be enforced in the application that interacts with the data.

b. Poor subquery capabilities and performance. No procedural multi statement capabilities. For instance, it is not possible to do the equivalent of SQL selecting into a temporary table in one statement and use this temporary table in a subsequent query.

c. How do I take the result set of one query and pass it to the next query? Do I have to use CONSTRUCT to insert this relationship into the model and re-use it?

3. All SPARQL examples used in documentation are very simple.

a. Again, made us question if SPARQL was fit for more advanced queries.

4. Jena ARQ is limited by the capabilities of the underlying technology.

a. If the underlying technology does is incapable of doing an effective join, then a system put on top of it will be equally ineffective of doing the same. The fact that the SPARQL language provides join capabilities does not mean that Jena provides an effective implementation of this language.

5. All large scale Jena implementations seems to use the Jena API instead of the Jena ARQ

a. Again, making us question the capabilities and maturity of both SPARQL and Jena ARQ

b. SPARQL seemed to be a dead end, only suitable for small solutions and demonstrations.

c. In particular Jena Fuseki is referred to as only fit for smaller solutions.

6. There seemed to be a lack of good query optimizers

a. Even simple things such as changing the order of triples in the WHERE clause would lead to significant different performance.

7. Public SPARQL end-points are notoriously bad.

a. They are constantly down.

b. Queries are slow

c. Queries are often limited to simple triple sets.

d. Some queries would not return and even crash or overload the server.

8. Poor SPARQL documentation:

a. The W3C documentation is hard to read and hard to understand. Combine this with the W3C RDF, OWL and OWL2 documentation and you will see a real issue.

b. The more accessible documentation is shallow and incomplete. Only simple SPARQL queries are shown.

c. There are no really good sources of best practice and application examples. Some of them are even contradicting each other.

d. It seems like there are a lot of good intentions when people start using SPARQL, but they all end up being dead ends.

e. A lot of the documentation seems to be “old”, written in 2008/2009 and not updated since.

f. The biggest red flag is the number of broken links to SPARQL, RDF and OWL documentation on the web.

9. SPARQL can only return rectangular data:

a. This is the same limitation as SQL, but in SQL I can create a procedure that will return multiple datasets with common keys.

b. Rectangular datasets causes duplicate data and loss of structure.

10. Building SPARQL strings to send to the server is not an effective way to deal with queries

a. This is probably more of an opinion than a fact. Excuse me for putting it in the list.

11. The lack of adoption of RDF stores compared to other data stores:

a. Not sure how scientific this chart is, but even if it off by a factor of 10 it shows a big difference: http://db-engines.com/en/ranking_trend/system/Jena%3BMicrosoft+SQL+Server%3BMongoDB%3BMySQL%3BNeo4j



We did not give up, and dug into the problems to find solutions. We observed that some of the query complexity could be simplified by using SPARQL CONSTRUCT statements or Jena inference rules to pre-create relationships that users might want to query on. This provided much faster queries, but made the underlying model more murky with significant duplication of data. The proliferation of the vocabulary (predicates/properties) became a concern. Having to use CONSTRUCTS and rules to “pre-answer” complex questions also contradicts the primary reason to use a triple store in the first place; “we wanted a data store that could answer the questions that no one had thought about”.



While SPARQL seems to promise to do what we want, the reality is that we have been unable to apply it in a way that delivers what we want. I am aware that this might be a failure of understanding how to use SPARQL.



So, please help us understand the following:

1. Are our observations correct? Please prove/disprove each point, it would make me happy to see that I am wrong.

2. Are these issues resolved in the latest Jena and Jena Fuseki implementations? I see that there are comments about faster SPARQL queries in the latest release. Is there any documentation showing what was done to improve it?

3. Are we using SPARQL incorrectly? How should we use it?

4. Is there documentation available that we do not know about? Please point us to the really good documentation. (We have read the positively rated books on the subjects as well as every website that refers to Jena, SPARQL, RDF, OWL, Semantic web within the first page of Google search).

5. Are there examples of large scale solutions built on Jena ARQ/SPARQL without the use of the Jena API? Can we see their reference architectures?

6. How can it be that the Jena API cannot do an effective join? Is SPARQL based on this API? Is there another API available to effectively get to the data?

7. How is ARQ implemented? Does it use the indexed data in Jena TDB? How does it handled indexes in subqueries?

Looking forward to hearing from you again.



Best regards,

Niels









-----Original Message-----
From: Claude Warren [mailto:***@xenei.com]
Sent: Sunday, November 13, 2016 01:04
To: ***@jena.apache.org
Subject: Re: How do I do a join between multiple model.listStatments calls?



Niels,



SPARQL (https://www.w3.org/TR/rdf-sparql-query/) provides a simple way to join the triples of different statements and can be called from within your java code (http://jena.apache.org/documentation/query/index.html).



As noted previously using a filter should do the trick. There is documentation for how to write your own filter if you need to but you may find that your filter requirements are already met by existing filters.



Claude
Post by Niels Andersen
Dear user community,
Our current approach to joining multiple model.listStatements (with
SimpleSelector) calls is to take the content of the iterators returned
and add them to separate HashSets and then use functions such as
retainAll to find the intersection between the two sets.
This works relative well when model.listStatements return a small to
medium number of statements.
My problem is that this seems to be a very inefficient way of joining
to sets of data that are already ordered in TDB. I assume that there
must be a better way to do this. I have searched the web, but all uses
of listStatements are very simple.
I have also not found an effective way to do filtering (for instance
literal less than 5) without comparing every statement that
listStatements returns
* What is the recommended way to do a join between two lists of
statements?
* What is the recommended way to implement filtering?
* Is there anything else than SimpleSelector? Are there any
Advanced selectors?
Thanks in advance,
Niels
--
I like: Like Like - The likeliest place on the web <http://like-like.xenei.com>

LinkedIn: http://www.linkedin.com/in/claudewarren
Andy Seaborne
2016-11-13 21:19:58 UTC
Permalink
ARQ is either as fast at joins as listStatements (because it is using
the underlying Graph.find that backs listStatement) or is faster because
it avoids churning lot of unnecessary bytes.

As many NoSQL application have discovered, reinventing joins client
side, results in a lot of data transfer from data storage to client.

Andy
Claude Warren
2016-11-14 09:43:24 UTC
Permalink
In response to #5: Are there examples of large scale solutions built on
Jena ARQ/SPARQL without the use of the Jena API? Can we see their reference
architectures?

I'm not sure this qualified but the Granatum project used SPARQL to query
the data. It also used the Jena API in several other places, most notably
to track which endpoints were up/down and their response time. This is a
tacit acknowlagement of your point " Public SPARQL end-points are
notoriously bad." The project implemented a preprocessor to the Jena
Query Engine that distributed the queries across multiple endpoints to pick
up the necessary data to answer the query. The entire user/researcher
front end was SPARQL.

https://aran.library.nuigalway.ie/xmlui/bitstream/handle/10379/4845/Linked_
Biomedical_Dataspace_-_Lessons_Learned_integrating_Data_for_Drug_Discovery_%
28Final%29.pdf?sequence=1

Claude
--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Niels Andersen
2016-11-13 20:17:31 UTC
Permalink
Claude and Martynas,



Thank you for your quick response.



We are aware that the SPARQL language is providing join and filtering capabilities, it is however important to be reminded that it exists and not get stuck in a single implementation track. Thanks for reminding us.



My question was specifically with regards to using the Jena API, which I understand is a supported interface to Jena TDB. The question was "How do I do a join between multiple model.listStatments calls?".



I apologize that this email is longer than I intended and may seem like a rant against SPARQL. I don't want to anger anyone with this email, this is a summary of what I have observed and believe to be facts. I want to make it very clear; "Nothing would make me happier than to be proven wrong, I would love to see SPARQL work in large web scale applications". To people who choose to reply to this email; keep in mind that I only care about scientific and provable facts, I do not care about opinions.



We started with SPARQL as our core language. In the beginning it looked very promising. Due to the following, we are now looking at SPARQL as an add on capability, not the core capability:

1. The performance was poor for advanced queries.

a. It made us question if Jena SPARQL is only viable for simple queries.

b. In particular queries that returned a large dataset in JSON would take a long time to return data.

2. SPARQL does not appear to be a mature language:

a. Compared to SQL: There is no concept of views, functions or procedures. This is particularly a problem as triple stores have weak schema capabilities and the schema must be enforced in the application that interacts with the data.

b. Poor subquery capabilities and performance. No procedural multi statement capabilities. For instance, it is not possible to do the equivalent of SQL selecting into a temporary table in one statement and use this temporary table in a subsequent query.

c. How do I take the result set of one query and pass it to the next query? Do I have to use CONSTRUCT to insert this relationship into the model and re-use it?

3. All SPARQL examples used in documentation are very simple.

a. Again, made us question if SPARQL was fit for more advanced queries.

4. Jena ARQ is limited by the capabilities of the underlying technology.

a. If the underlying technology does is incapable of doing an effective join, then a system put on top of it will be equally ineffective of doing the same. The fact that the SPARQL language provides join capabilities does not mean that Jena provides an effective implementation of this language.

5. All large scale Jena implementations seems to use the Jena API instead of the Jena ARQ

a. Again, making us question the capabilities and maturity of both SPARQL and Jena ARQ

b. SPARQL seemed to be a dead end, only suitable for small solutions and demonstrations.

c. In particular Jena Fuseki is referred to as only fit for smaller solutions.

6. There seemed to be a lack of good query optimizers

a. Even simple things such as changing the order of triples in the WHERE clause would lead to significant different performance.

7. Public SPARQL end-points are notoriously bad.

a. They are constantly down.

b. Queries are slow

c. Queries are often limited to simple triple sets.

d. Some queries would not return and even crash or overload the server.

8. Poor SPARQL documentation:

a. The W3C documentation is hard to read and hard to understand. Combine this with the W3C RDF, OWL and OWL2 documentation and you will see a real issue.

b. The more accessible documentation is shallow and incomplete. Only simple SPARQL queries are shown.

c. There are no really good sources of best practice and application examples. Some of them are even contradicting each other.

d. It seems like there are a lot of good intentions when people start using SPARQL, but they all end up being dead ends.

e. A lot of the documentation seems to be "old", written in 2008/2009 and not updated since.

f. The biggest red flag is the number of broken links to SPARQL, RDF and OWL documentation on the web.

9. SPARQL can only return rectangular data:

a. This is the same limitation as SQL, but in SQL I can create a procedure that will return multiple datasets with common keys.

b. Rectangular datasets causes duplicate data and loss of structure.

10. Building SPARQL strings to send to the server is not an effective way to deal with queries

a. This is probably more of an opinion than a fact. Excuse me for putting it in the list.

11. The lack of adoption of RDF stores compared to other data stores:

a. Not sure how scientific this chart is, but even if it off by a factor of 10 it shows a big difference: http://db-engines.com/en/ranking_trend/system/Jena%3BMicrosoft+SQL+Server%3BMongoDB%3BMySQL%3BNeo4j



We did not give up, and dug into the problems to find solutions. We observed that some of the query complexity could be simplified by using SPARQL CONSTRUCT statements or Jena inference rules to pre-create relationships that users might want to query on. This provided much faster queries, but made the underlying model more murky with significant duplication of data. The proliferation of the vocabulary (predicates/properties) became a concern. Having to use CONSTRUCTS and rules to "pre-answer" complex questions also contradicts the primary reason to use a triple store in the first place; "we wanted a data store that could answer the questions that no one had thought about".



While SPARQL seems to promise to do what we want, the reality is that we have been unable to apply it in a way that delivers what we want. I am aware that this might be a failure of understanding how to use SPARQL.



So, please help us understand the following:

1. Are our observations correct? Please prove/disprove each point, it would make me happy to see that I am wrong.

2. Are these issues resolved in the latest Jena and Jena Fuseki implementations? I see that there are comments about faster SPARQL queries in the latest release. Is there any documentation showing what was done to improve it?

3. Are we using SPARQL incorrectly? How should we use it?

4. Is there documentation available that we do not know about? Please point us to the really good documentation. (We have read the positively rated books on the subjects as well as every website that refers to Jena, SPARQL, RDF, OWL, Semantic web within the first page of Google search).

5. Are there examples of large scale solutions built on Jena ARQ/SPARQL without the use of the Jena API? Can we see their reference architectures?

6. How can it be that the Jena API cannot do an effective join? Is SPARQL based on this API? Is there another API available to effectively get to the data?

7. How is ARQ implemented? Does it use the indexed data in Jena TDB? How does it handled indexes in subqueries?



Looking forward to hearing from you again.



Best regards,

Niels









-----Original Message-----
From: Claude Warren [mailto:***@xenei.com]
Sent: Sunday, November 13, 2016 01:04
To: ***@jena.apache.org<mailto:***@jena.apache.org>
Subject: Re: How do I do a join between multiple model.listStatments calls?



Niels,



SPARQL (https://www.w3.org/TR/rdf-sparql-query/) provides a simple way to join the triples of different statements and can be called from within your java code (http://jena.apache.org/documentation/query/index.html).



As noted previously using a filter should do the trick. There is documentation for how to write your own filter if you need to but you may find that your filter requirements are already met by existing filters.



Claude
Post by Niels Andersen
Dear user community,
Our current approach to joining multiple model.listStatements (with
SimpleSelector) calls is to take the content of the iterators returned
and add them to separate HashSets and then use functions such as
retainAll to find the intersection between the two sets.
This works relative well when model.listStatements return a small to
medium number of statements.
My problem is that this seems to be a very inefficient way of joining
to sets of data that are already ordered in TDB. I assume that there
must be a better way to do this. I have searched the web, but all uses
of listStatements are very simple.
I have also not found an effective way to do filtering (for instance
literal less than 5) without comparing every statement that
listStatements returns
* What is the recommended way to do a join between two lists of
statements?
* What is the recommended way to implement filtering?
* Is there anything else than SimpleSelector? Are there any
Advanced selectors?
Thanks in advance,
Niels
--

I like: Like Like - The likeliest place on the web <http://like-like.xenei.com>

LinkedIn: http://www.linkedin.com/in/claudewarren
Niels Andersen
2016-11-13 20:32:39 UTC
Permalink
Dear Jena User Group,

A side note: It looks like the user group is blocking my emails and claiming that it is phishing. Not sure why. In this email I will try to remove web links documenting my statements. If you receive this email from me, but not from the email list, you will know that Jena blocked the email.

First; Claude and Martynas, thank you for your quick response.

We are aware that the SPARQL language is providing join and filtering capabilities, it is however important to be reminded that it exists and not get stuck in a single implementation track. Thanks for reminding us.

My question was specifically with regards to using the Jena API, which I understand is a supported interface to Jena TDB. The question was "How do I do a join between multiple model.listStatments calls?".

I apologize that this email is longer than I intended and may seem like a rant against SPARQL. I don't want to anger anyone with this email, this is a summary of what I have observed and believe to be facts. I want to make it very clear; "Nothing would make me happier than to be proven wrong, I would love to see SPARQL work in large web scale applications". To people who choose to reply to this email; keep in mind that I only care about scientific and provable facts, I do not care about opinions.

We started with SPARQL as our core language. In the beginning it looked very promising. Due to the following, we are now looking at SPARQL as an add on capability, not the core capability:
1. The performance was poor for advanced queries.
a. It made us question if Jena SPARQL is only viable for simple queries.
b. In particular queries that returned a large dataset in JSON would take a long time to return data.
2. SPARQL does not appear to be a mature language:
a. Compared to SQL: There is no concept of views, functions or procedures. This is particularly a problem as triple stores have weak schema capabilities and the schema must be enforced in the application that interacts with the data.
b. Poor subquery capabilities and performance. No procedural multi statement capabilities. For instance, it is not possible to do the equivalent of SQL selecting into a temporary table in one statement and use this temporary table in a subsequent query.
c. How do I take the result set of one query and pass it to the next query? Do I have to use CONSTRUCT to insert this relationship into the model and re-use it?
3. All SPARQL examples used in documentation are very simple.
a. Again, made us question if SPARQL was fit for more advanced queries.
4. Jena ARQ is limited by the capabilities of the underlying technology.
a. If the underlying technology does is incapable of doing an effective join, then a system put on top of it will be equally ineffective of doing the same. The fact that the SPARQL language provides join capabilities does not mean that Jena provides an effective implementation of this language.
5. All large scale Jena implementations seems to use the Jena API instead of the Jena ARQ
a. Again, making us question the capabilities and maturity of both SPARQL and Jena ARQ
b. SPARQL seemed to be a dead end, only suitable for small solutions and demonstrations.
c. In particular Jena Fuseki is referred to as only fit for smaller solutions.
6. There seemed to be a lack of good query optimizers
a. Even simple things such as changing the order of triples in the WHERE clause would lead to significant different performance.
7. Public SPARQL end-points are notoriously bad.
a. They are constantly down.
b. Queries are slow
c. Queries are often limited to simple triple sets.
d. Some queries would not return and even crash or overload the server.
8. Poor SPARQL documentation:
a. The W3C documentation is hard to read and hard to understand. Combine this with the W3C RDF, OWL and OWL2 documentation and you will see a real issue.
b. The more accessible documentation is shallow and incomplete. Only simple SPARQL queries are shown.
c. There are no really good sources of best practice and application examples. Some of them are even contradicting each other.
d. It seems like there are a lot of good intentions when people start using SPARQL, but they all end up being dead ends.
e. A lot of the documentation seems to be "old", written in 2008/2009 and not updated since.
f. The biggest red flag is the number of broken links to SPARQL, RDF and OWL documentation on the web.
9. SPARQL can only return rectangular data:
a. This is the same limitation as SQL, but in SQL I can create a procedure that will return multiple datasets with common keys.
b. Rectangular datasets causes duplicate data and loss of structure.
10. Building SPARQL strings to send to the server is not an effective way to deal with queries
a. This is probably more of an opinion than a fact. Excuse me for putting it in the list.
11. The lack of adoption of RDF stores compared to other data stores:
a. I originally had a link to DB-Engines to show the difference in adoption. I removed it to allow the message to go through to the list.

We did not give up, and dug into the problems to find solutions. We observed that some of the query complexity could be simplified by using SPARQL CONSTRUCT statements or Jena inference rules to pre-create relationships that users might want to query on. This provided much faster queries, but made the underlying model more murky with significant duplication of data. The proliferation of the vocabulary (predicates/properties) became a concern. Having to use CONSTRUCTS and rules to "pre-answer" complex questions also contradicts the primary reason to use a triple store in the first place; "we wanted a data store that could answer the questions that no one had thought about".

While SPARQL seems to promise to do what we want, the reality is that we have been unable to apply it in a way that delivers what we want. I am aware that this might be a failure of understanding how to use SPARQL.

So, please help us understand the following:
1. Are our observations correct? Please prove/disprove each point, it would make me happy to see that I am wrong.
2. Are these issues resolved in the latest Jena and Jena Fuseki implementations? I see that there are comments about faster SPARQL queries in the latest release. Is there any documentation showing what was done to improve it?
3. Are we using SPARQL incorrectly? How should we use it?
4. Is there documentation available that we do not know about? Please point us to the really good documentation. (We have read the positively rated books on the subjects as well as every website that refers to Jena, SPARQL, RDF, OWL, Semantic web within the first page of Google search).
5. Are there examples of large scale solutions built on Jena ARQ/SPARQL without the use of the Jena API? Can we see their reference architectures?
6. How can it be that the Jena API cannot do an effective join? Is SPARQL based on this API? Is there another API available to effectively get to the data?
7. How is ARQ implemented? Does it use the indexed data in Jena TDB? How does it handled indexes in subqueries?

Looking forward to hearing from you again.

Best regards,
Niels




-----Original Message-----
From: Claude Warren [mailto:***@xenei.com]
Sent: Sunday, November 13, 2016 01:04
To: mailto:***@jena.apache.org
Subject: Re: How do I do a join between multiple model.listStatments calls?

Niels,

SPARQL (https://www.w3.org/TR/rdf-sparql-query/) provides a simple way to join the triples of different statements and can be called from within your java code (http://jena.apache.org/documentation/query/index.html).

As noted previously using a filter should do the trick.  There is documentation for how to write your own filter if you need to but you may find that your filter requirements are already met by existing filters.

Claude
Post by Niels Andersen
Dear user community,
Our current approach to joining multiple model.listStatements (with
SimpleSelector) calls is to take the content of the iterators returned
and add them to separate HashSets and then use functions such as
retainAll to find the intersection between the two sets.
This works relative well when model.listStatements return a small to
medium number of statements.
My problem is that this seems to be a very inefficient way of joining
to sets of data that are already ordered in TDB. I assume that there
must be a better way to do this. I have searched the web, but all uses
of listStatements are very simple.
I have also not found an effective way to do filtering (for instance
literal less than 5) without comparing every statement that
listStatements returns
*         What is the recommended way to do a join between two lists of
statements?
*         What is the recommended way to implement filtering?
*         Is there anything else than SimpleSelector? Are there any
Advanced selectors?
Thanks in advance,
Niels
--
I like: Like Like - The likeliest place on the web <http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
james anderson
2016-11-13 21:57:53 UTC
Permalink
good evening mr andersen,

i am genuinely curious, why you and your group would be experiencing such difficulties and would like to understand more about what you are doing.
Post by Niels Andersen
Dear Jena User Group,
[
]
... To people who choose to reply to this email; keep in mind that I only care about scientific and provable facts, I do not care about opinions.
then, let us start there.
your post, as it stood, provides little room for a considered response, as it includes too little information about what you are doing.
if you are at liberty to provide specifics about your application data models, your persistent data vocabularies and storage statistics, your attempts to combine the two through sparql queries, your deployment specifics, and details about observed performance and reliability, it could aid your cause to do so.

your complaints imply that you have specific experience which could relate such information, but, in themselves, permit little beyond commiseration.
if you were to follow-up in concrete terms, you might be more likely to benefit form the collected knowledge in the group.

best regards, from berlin,
---
james anderson | ***@dydra.com | http://dydra.com
Niels Andersen
2016-11-14 01:10:39 UTC
Permalink
Good evening to you as well Mr. Anderson,

We are building an application where we will end up with several hundreds of millions of triples. So the scope of the application could be considered large.

As for the initial question about model.listStatements joins, here is a code snippet:

// Query the model for all the children of nodeResource
StmtIterator iterator1 = model.listStatements(nodeResource, MY_VOCAB.child, (RDFNode)null);

// Iterate through all the statements returned
iterator1.forEachRemaining(statement -> {
// Find all the labels for the objects in the statements (Assume that there is more than one language label for each)
StmtIterator iterator2 = model.listStatements(statement.getObject().asResource(), RDFS.label, (RDFNode)null);
iterator2.forEachRemaining(statement -> {
// Print out the statement to system.out
System.out.println(statement.toString());
});
});
If the query above returned 10,000 children in iterator1, then iterator2 will be called 10,000 times. This does not seem to be very efficient.

To the best of my knowledge, TDB already has indexed lists of OSP, POS and SPO. I would have thought that there was a way to run the second query by just passing an ordered list of the objects returned in the first query. This provides for far better matching than having to run the same query many times.

The alternative approach that we are looking at is to run a second query where we return all the labels of all objects, store the results of each query in a HashSet indexed on ObjectResource and do a RetainAll to join the two sets. The problem with this is that there are way too many labels in the system to do this effectively. I can also create a code snippet for this if it is necessary.

So my question is: What is the correct way to join the results from two model.listStatements?

As for the initial question about model.listStatements filtering, here is a code snippet:

StmtIterator iterator1 = model.listStatements(
new SimpleSelector(nodeResource, MY_VOCAB.value, (RDFNode)null)
{
public boolean selects(Statement s)
{
// return the object literals > 12345
return (s.getObject().asLiteral().getInt() > 12345);
}
});
In the query above; for every value result, the selector has to do a comparison with the filter value. I would have thought that it was easier for TDB to do the filtering, than to include it in a SimpleSelector.

My question is: What is the correct way to implement filtering?

As for the long list in my email that I accidentally sent multiple times; I hope the concerns and questions are clear enough to be answered. Let me know if clarification is needed.

Hope that this makes it clearer.

Thanks in advance,
Niels



-----Original Message-----
From: james anderson [mailto:***@dydra.com]
Sent: Sunday, November 13, 2016 13:58
To: ***@jena.apache.org
Subject: Re: How do I do a join between multiple model.listStatments calls?

good evening mr andersen,

i am genuinely curious, why you and your group would be experiencing such difficulties and would like to understand more about what you are doing.
Post by Niels Andersen
Dear Jena User Group,
[…]
... To people who choose to reply to this email; keep in mind that I only care about scientific and provable facts, I do not care about opinions.
then, let us start there.
your post, as it stood, provides little room for a considered response, as it includes too little information about what you are doing.
if you are at liberty to provide specifics about your application data models, your persistent data vocabularies and storage statistics, your attempts to combine the two through sparql queries, your deployment specifics, and details about observed performance and reliability, it could aid your cause to do so.

your complaints imply that you have specific experience which could relate such information, but, in themselves, permit little beyond commiseration.
if you were to follow-up in concrete terms, you might be more likely to benefit form the collected knowledge in the group.

best regards, from berlin,
---
james anderson | mailto:***@dydra
james anderson
2016-11-14 07:56:01 UTC
Permalink
good morning;
Post by Niels Andersen
Good evening to you as well Mr. Anderson,
We are building an application where we will end up with several hundreds of millions of triples. So the scope of the application could be considered large.
[nested loop natural join implementation
]
If the query above returned 10,000 children in iterator1, then iterator2 will be called 10,000 times. This does not seem to be very efficient.
there is no reason to expect that it would be.
at an abstract level, it ignores two rather central principles for effective query processing:
- if you do not need data, do not touch it.
- if you do not need data, do not move it.

on one hand, there was this message, earlier in the thread,
Post by Niels Andersen
ARQ is either as fast at joins as listStatements (because it is using the underlying Graph.find that backs listStatement) or is faster because it avoids churning lot of unnecessary bytes.
As many NoSQL application have discovered, reinventing joins client side, results in a lot of data transfer from data storage to client.
which alludes to general experience in this regard.
on the other, one could perform concrete timing experiments to determine the respective wild-subject match rate and the statement scan rate for your particular repository statistics, profile the time spent where in the stack, and predict quantitatively, that the approach would likely underperform one which left the join process to mechanisms which is closer to the store and move less data.
Post by Niels Andersen
To the best of my knowledge, TDB already has indexed lists of OSP, POS and SPO. I would have thought that there was a way to run the second query by just passing an ordered list of the objects returned in the first query. This provides for far better matching than having to run the same query many times.
were that the case, the api documentation would describe it.
does it?
Post by Niels Andersen
The alternative approach that we are looking at is to run a second query where we return all the labels of all objects, store the results of each query in a HashSet indexed on ObjectResource and do a RetainAll to join the two sets. The problem with this is that there are way too many labels in the system to do this effectively. I can also create a code snippet for this if it is necessary.
So my question is: What is the correct way to join the results from two model.listStatements?
my question is, why is it necessary to do that on the client side?
Post by Niels Andersen
StmtIterator iterator1 = model.listStatements(
new SimpleSelector(nodeResource, MY_VOCAB.value, (RDFNode)null)
{
public boolean selects(Statement s)
{
// return the object literals > 12345
return (s.getObject().asLiteral().getInt() > 12345);
}
});
In the query above; for every value result, the selector has to do a comparison with the filter value. I would have thought that it was easier for TDB to do the filtering, than to include it in a SimpleSelector.
My question is: What is the correct way to implement filtering?
while “correct” depends much on the concrete case, the method, above, relies on the same problematic approach as your join implementation, yet it makes is no case for a mechanism which performs the work on the client side rather than leave it to a query processor.
Post by Niels Andersen
As for the long list in my email that I accidentally sent multiple times; I hope the concerns and questions are clear enough to be answered. Let me know if clarification is needed.
those are the points with permit commiseration only.
so long as they remain abstract complaints, it is difficult to bring experience to bear on them.
my experience differs from yours in significant ways, but without concrete information, it is not possible to explore, why.

your described case would appear to require a query with a single bgp, which contains two statement patterns and a filter.
given that case, your complaints leave the impression, that the sparql processor executed queries of that form less effectively than you expected and/or was not stable in the process.
at the level of detail which you have supplied, i would not expect that to have been the case.
you will need to say more.

best regards, from berlin,

---
james anderson | ***@dydra.com | http://dydra.com
Andy Seaborne
2016-11-14 09:13:29 UTC
Permalink
Post by Niels Andersen
If the query above returned 10,000 children in iterator1, then
iterator2 will be called 10,000 times. This does not seem to be very
efficient.
Compared with what?

If the pattern for iterator2, without the information from iterator1
returns 10,000,000 items, (which it would in the hash join case), then
it would perform worse.
Post by Niels Andersen
To the best of my knowledge, TDB already has indexed lists of OSP,
POS and SPO. I would have thought that there was a way to run the
second query by just passing an ordered list of the objects returned
in the first query. This provides for far better matching than having
to run the same query many times.
SPO, POS, OSP are not lists (they are B+Trees with range scans).

TDB usually uses an index join (there are has joins as well).

It does it efficiently, not retrieving the RDFterm representation (which
would require persistent storage access although it is heavily cached)
but using the internal numbers used in the index.

{ :nodeResource :child ?X .
?X rdfs:label ?Y
}

TDB will, in the absence of an stats file, will execute in that order.
If you swap them, it will still start at ":nodeResource :child ?X ."

I don't see where filtering "< 5" fits into this example. rdfs;labels
are typically strings.

FILTER(?Z > 12345) is faster if done by TDB than in API code.


If you can calling the pattern repeatedly with different :nodeResource
values, then you will incur overhead.

Andy
Niels Andersen
2016-11-14 19:10:28 UTC
Permalink
Thanks Mr. Anderson,

If I understand your comments correctly, then you say that it is more efficient to perform these types of joins on the server side than the client side with an optimized query processor. I fully agree, and I would appreciate if someone could show me how Jena does perform that join on the server side and how this is different from my example.

Regarding server side processing and "central principles for effective query processing", you hit the nail on its head with your statement: My understanding is that the Jena API is operating directly on the underlying data and my example hence is server-side (it is in the same JVM as the TDB database, this is the only supported implementation of the API. See https://jena.apache.org/documentation/tdb/faqs.html#can-i-share-a-tdb-dataset-between-multiple-applications ). There is no such thing as a client version of the Jena API, only SPARQL is supported client side. Do I understand this correctly?

It therefore looks like we have a disconnect. I believe that the Jena API is a server side function and you state that the Jena API is a client side function. It would great to clarity into this disconnect.

Here is a basic assumption that I have: I may be incorrect, but my impression is that Jena is a set of APIs, not a traditional database. The RDF API is the core that allows interactions with triples, Jena TDB is the persistent file storage, Jena ARQ is the Jena implementation of SPARQL, Jena Fuseki is an implementation of a database, the Jena Ontology API provides a higher level interface to OWL and other models, and the Jena Inferencing API provides reasoning over the data. Users of Jena can use Jena Fuseki or build their own database(s). Do I understand this correctly?

Andy is answering my original question about joins, he stated that Jena ARQ is using the Jena API, Graph.find and listStatement (you included this in your response). Again, if I understand this correctly, then Jena ARQ does not implement a join algorithm based on two sorted lists, so the join must be performed using lookups for each element returned from the first list (like I showed in my example). While this is OK for small datasets, it becomes problematic for large datasets. Do I understand this correctly?

Regarding my statement "TDB already has indexed lists of OSP, POS and SPO" and your response "were that the case, the api documentation would describe it. does it?". The documentation refers to this here https://jena.apache.org/documentation/tdb/store-parameters.html and in passing other places, I am not aware of any place in the documentation where it states how the API is using these indexes. Again, this goes back to the core of my question of effective joins which are hard to do without indexed data.

My goal is to understand how to best use Jena. There may be places that Jena is not a good fit, that is OK, I just need to know where those places are so that we can work around them or avoid them.

My gut feeling is that Jena is a great choice when the user needs to follow-her-nose into data and not return large datasets. If the queries are small, specific and returns a small set of data then Jena will provide good performance. If extensive joins are needed or large datasets are returned, then the user have to think about which API to use (core or ARQ); there will be situations where Jena does not provide the optimal solution and may not be the right choice.

Finally, regarding my list of concerns and questions. Let's start with a specific one: Is there a SPARQL equivalent to SQL views, functions and stored procedures? I believe that the answer is no, and if it is then what is the best practice to provide this functionality?

Again; thanks for your help.

Best regards,
Niels


-----Original Message-----
From: james anderson [mailto:***@dydra.com]
Sent: Sunday, November 13, 2016 23:56
To: ***@jena.apache.org
Subject: Re: How do I do a join between multiple model.listStatments calls?

good morning;
Post by Niels Andersen
Good evening to you as well Mr. Anderson,
We are building an application where we will end up with several hundreds of millions of triples. So the scope of the application could be considered large.
[nested loop natural join implementation…] If the query above returned
10,000 children in iterator1, then iterator2 will be called 10,000 times. This does not seem to be very efficient.
there is no reason to expect that it would be.
at an abstract level, it ignores two rather central principles for effective query processing:
- if you do not need data, do not touch it.
- if you do not need data, do not move it.

on one hand, there was this message, earlier in the thread,
Post by Niels Andersen
ARQ is either as fast at joins as listStatements (because it is using the underlying Graph.find that backs listStatement) or is faster because it avoids churning lot of unnecessary bytes.
As many NoSQL application have discovered, reinventing joins client side, results in a lot of data transfer from data storage to client.
which alludes to general experience in this regard.
on the other, one could perform concrete timing experiments to determine the respective wild-subject match rate and the statement scan rate for your particular repository statistics, profile the time spent where in the stack, and predict quantitatively, that the approach would likely underperform one which left the join process to mechanisms which is closer to the store and move less data.
Post by Niels Andersen
To the best of my knowledge, TDB already has indexed lists of OSP, POS and SPO. I would have thought that there was a way to run the second query by just passing an ordered list of the objects returned in the first query. This provides for far better matching than having to run the same query many times.
were that the case, the api documentation would describe it.
does it?
Post by Niels Andersen
The alternative approach that we are looking at is to run a second query where we return all the labels of all objects, store the results of each query in a HashSet indexed on ObjectResource and do a RetainAll to join the two sets. The problem with this is that there are way too many labels in the system to do this effectively. I can also create a code snippet for this if it is necessary.
So my question is: What is the correct way to join the results from two model.listStatements?
my question is, why is it necessary to do that on the client side?
Post by Niels Andersen
StmtIterator iterator1 = model.listStatements(
new SimpleSelector(nodeResource, MY_VOCAB.value, (RDFNode)null)
{
public boolean selects(Statement s)
{
// return the object literals > 12345
return (s.getObject().asLiteral().getInt() > 12345);
}
});
In the query above; for every value result, the selector has to do a comparison with the filter value. I would have thought that it was easier for TDB to do the filtering, than to include it in a SimpleSelector.
My question is: What is the correct way to implement filtering?
while “correct” depends much on the concrete case, the method, above, relies on the same problematic approach as your join implementation, yet it makes is no case for a mechanism which performs the work on the client side rather than leave it to a query processor.
Post by Niels Andersen
As for the long list in my email that I accidentally sent multiple times; I hope the concerns and questions are clear enough to be answered. Let me know if clarification is needed.
those are the points with permit commiseration only.
so long as they remain abstract complaints, it is difficult to bring experience to bear on them.
my experience differs from yours in significant ways, but without concrete information, it is not possible to explore, why.

your described case would appear to require a query with a single bgp, which contains two statement patterns and a filter.
given that case, your complaints leave the impression, that the sparql processor executed queries of that form less effectively than you expected and/or was not stable in the process.
at the level of detail which you have supplied, i would not expect that to have been the case.
you will need to say more.

best regards, from berlin,

---
james anderson | ***@dy
Andy Seaborne
2016-11-14 19:45:18 UTC
Permalink
Jena has APIs for local and remote access for SPARQL.

Many large installations are a SPARQL triple store with business logic
layer.
Post by Niels Andersen
Andy is answering my original question about joins, he stated that
Jena ARQ is using the Jena API, Graph.find and listStatement (you
included this in your response).
I said it uses Graph.find or is faster.

TDB cuts through Graph.find and listStatements to work on the indexes
themselves.
Post by Niels Andersen
Again, if I understand this
correctly, then Jena ARQ does not implement a join algorithm based on
two sorted lists, so the join must be performed using lookups for
each element returned from the first list (like I showed in my
example). While this is OK for small datasets, it becomes problematic
for large datasets. Do I understand this correctly?
It's called an index join and in TDB does work with RDF terms but with
internal ids (which are fixed 8 bytes long). The representation of teh
RDF terms are left on disk unless needed later ("if you do not need
data, do not touch it.").

If the first set is small, an index join is faster than a merge join. A
merge join still need to traverse the whole of both sides if it does not
use sideways passing ... in which case it becomes a form of index join.
Due to caching, index lookup is not necessarily expensive.

I would still like to hear what you are intending to use RDF for. What
features of semntic web, or RDF are you exploting? You email address
suggests an IoT application.

Andy
Niels Andersen
2016-11-14 21:30:00 UTC
Permalink
Andy,

Thanks for the clarification regarding ARQ.

I am happy to hear that ARQ is using underlying indexes.

Best regards,
Niels

-----Original Message-----
From: Andy Seaborne [mailto:***@apache.org]
Sent: Monday, November 14, 2016 11:45
To: ***@jena.apache.org
Subject: Re: How do I do a join between multiple model.listStatments calls?

Jena has APIs for local and remote access for SPARQL.

Many large installations are a SPARQL triple store with business logic layer.
Post by Niels Andersen
Andy is answering my original question about joins, he stated that
Jena ARQ is using the Jena API, Graph.find and listStatement (you
included this in your response).
I said it uses Graph.find or is faster.

TDB cuts through Graph.find and listStatements to work on the indexes themselves.
Post by Niels Andersen
Again, if I understand this
correctly, then Jena ARQ does not implement a join algorithm based on
two sorted lists, so the join must be performed using lookups for each
element returned from the first list (like I showed in my example).
While this is OK for small datasets, it becomes problematic for large
datasets. Do I understand this correctly?
It's called an index join and in TDB does work with RDF terms but with internal ids (which are fixed 8 bytes long). The representation of teh RDF terms are left on disk unless needed later ("if you do not need data, do not touch it.").

If the first set is small, an index join is faster than a merge join. A merge join still need to traverse the whole of both sides if it does not use sideways passing ... in which case it becomes a form of index join.
Due to caching, index lookup is not necessarily expensive.

I would still like to hear what you are intending to use RDF for. What features of semntic web, or RDF are you exploting? You email add
Rob Vesse
2016-11-15 10:15:10 UTC
Permalink
On 14/11/2016 19:10, "Niels Andersen" <***@thinkiq.com> wrote:

Finally, regarding my list of concerns and questions. Let's start with a specific one: Is there a SPARQL equivalent to SQL views, functions and stored procedures? I believe that the answer is no, and if it is then what is the best practice to provide this functionality?

For views you can use an INSERT {} WHERE {} update to create a new graph with the data of interest eg.

INSERT
{
GRAPH <urn:temporary:example> {
?s ?p ?o
}
}
WHERE
{
# Match ?s ?p ?o as desired
}

The WHERE clause can contain an arbitrarily complex query pattern and the INSERT clause defines a template for the new data to be created. You can then Direct subsequent queries against that temporary graph. Once you’re done with it every grass you can delete it with a DROP <urn:temporary:example>

Note that there is not support for views in the traditional SQL sense in most implementations that I’m aware of. So if your data changes you would need to rerun insert updates to recreate your temporary graphs.

In terms of functions the specifications define an extension mechanism based upon naming functions with URIs. The range of supported functions Will vary between implementations. You can find details on the extensions that ARQ supports at the following page:

http://jena.apache.org/documentation/query/library-function.html

It is also possible to define and register your own functions, the mechanism for this Will vary between vendors. Again the ARQ documentation for this can be found at the following page:

http://jena.apache.org/documentation/query/writing_functions.html

As for stored procedures there is no support in ARQ currently nor do I believe it is planned. Myself and Andy have discussed the idea of something similar in the past but neither of us have been in a position to implement it. I have seen some support for those from other vendors e.g. TopQuadrant’s SPIN but this is not widely supported.

As Andy already implied most large applications use a business logic layer that generates queries and updates as necessary.

Rob
Niels Andersen
2016-11-15 18:38:06 UTC
Permalink
Thanks Rob,

The temporary graph is a good idea. Much appreciated.

What is the best way to ensure that the temporary graph is stored in memory and not in the underlying TDB?

It would be great to have a set of best practices where these examples are shown.

Best regards,
Niels

-----Original Message-----
From: Rob Vesse [mailto:***@dotnetrdf.org]
Sent: Tuesday, November 15, 2016 02:15
To: ***@jena.apache.org
Subject: Re: How do I do a join between multiple model.listStatments calls?

On 14/11/2016 19:10, "Niels Andersen" <***@thinkiq.com> wrote:

Finally, regarding my list of concerns and questions. Let's start with a specific one: Is there a SPARQL equivalent to SQL views, functions and stored procedures? I believe that the answer is no, and if it is then what is the best practice to provide this functionality?

For views you can use an INSERT {} WHERE {} update to create a new graph with the data of interest eg.

INSERT
{
GRAPH <urn:temporary:example> {
?s ?p ?o
}
}
WHERE
{
# Match ?s ?p ?o as desired
}

The WHERE clause can contain an arbitrarily complex query pattern and the INSERT clause defines a template for the new data to be created. You can then Direct subsequent queries against that temporary graph. Once you’re done with it every grass you can delete it with a DROP <urn:temporary:example>

Note that there is not support for views in the traditional SQL sense in most implementations that I’m aware of. So if your data changes you would need to rerun insert updates to recreate your temporary graphs.

In terms of functions the specifications define an extension mechanism based upon naming functions with URIs. The range of supported functions Will vary between implementations. You can find details on the extensions that ARQ supports at the following page:

http://jena.apache.org/documentation/query/library-function.html

It is also possible to define and register your own functions, the mechanism for this Will vary between vendors. Again the ARQ documentation for this can be found at the following page:

http://jena.apache.org/documentation/query/writing_functions.html

As for stored procedures there is no support in ARQ currently nor do I believe it is planned. Myself and Andy have discussed the idea of something similar in the past but neither of us have been in a position to implement it. I have seen some support for those from other vendors e.g. TopQuadrant’s SPIN but this is not widely supported.

As Andy already implied most large applications use a business logic layer that generate
Loading...