Dear Jena User Group,
A side note: It looks like the user group is blocking my emails and claiming that it is phishing. Not sure why. In this email I will try to remove web links documenting my statements. If you receive this email from me, but not from the email list, you will know that Jena blocked the email.
First; Claude and Martynas, thank you for your quick response.
We are aware that the SPARQL language is providing join and filtering capabilities, it is however important to be reminded that it exists and not get stuck in a single implementation track. Thanks for reminding us.
My question was specifically with regards to using the Jena API, which I understand is a supported interface to Jena TDB. The question was "How do I do a join between multiple model.listStatments calls?".
I apologize that this email is longer than I intended and may seem like a rant against SPARQL. I don't want to anger anyone with this email, this is a summary of what I have observed and believe to be facts. I want to make it very clear; "Nothing would make me happier than to be proven wrong, I would love to see SPARQL work in large web scale applications". To people who choose to reply to this email; keep in mind that I only care about scientific and provable facts, I do not care about opinions.
We started with SPARQL as our core language. In the beginning it looked very promising. Due to the following, we are now looking at SPARQL as an add on capability, not the core capability:
1. The performance was poor for advanced queries.
a. It made us question if Jena SPARQL is only viable for simple queries.
b. In particular queries that returned a large dataset in JSON would take a long time to return data.
2. SPARQL does not appear to be a mature language:
a. Compared to SQL: There is no concept of views, functions or procedures. This is particularly a problem as triple stores have weak schema capabilities and the schema must be enforced in the application that interacts with the data.
b. Poor subquery capabilities and performance. No procedural multi statement capabilities. For instance, it is not possible to do the equivalent of SQL selecting into a temporary table in one statement and use this temporary table in a subsequent query.
c. How do I take the result set of one query and pass it to the next query? Do I have to use CONSTRUCT to insert this relationship into the model and re-use it?
3. All SPARQL examples used in documentation are very simple.
a. Again, made us question if SPARQL was fit for more advanced queries.
4. Jena ARQ is limited by the capabilities of the underlying technology.
a. If the underlying technology does is incapable of doing an effective join, then a system put on top of it will be equally ineffective of doing the same. The fact that the SPARQL language provides join capabilities does not mean that Jena provides an effective implementation of this language.
5. All large scale Jena implementations seems to use the Jena API instead of the Jena ARQ
a. Again, making us question the capabilities and maturity of both SPARQL and Jena ARQ
b. SPARQL seemed to be a dead end, only suitable for small solutions and demonstrations.
c. In particular Jena Fuseki is referred to as only fit for smaller solutions.
6. There seemed to be a lack of good query optimizers
a. Even simple things such as changing the order of triples in the WHERE clause would lead to significant different performance.
7. Public SPARQL end-points are notoriously bad.
a. They are constantly down.
b. Queries are slow
c. Queries are often limited to simple triple sets.
d. Some queries would not return and even crash or overload the server.
8. Poor SPARQL documentation:
a. The W3C documentation is hard to read and hard to understand. Combine this with the W3C RDF, OWL and OWL2 documentation and you will see a real issue.
b. The more accessible documentation is shallow and incomplete. Only simple SPARQL queries are shown.
c. There are no really good sources of best practice and application examples. Some of them are even contradicting each other.
d. It seems like there are a lot of good intentions when people start using SPARQL, but they all end up being dead ends.
e. A lot of the documentation seems to be "old", written in 2008/2009 and not updated since.
f. The biggest red flag is the number of broken links to SPARQL, RDF and OWL documentation on the web.
9. SPARQL can only return rectangular data:
a. This is the same limitation as SQL, but in SQL I can create a procedure that will return multiple datasets with common keys.
b. Rectangular datasets causes duplicate data and loss of structure.
10. Building SPARQL strings to send to the server is not an effective way to deal with queries
a. This is probably more of an opinion than a fact. Excuse me for putting it in the list.
11. The lack of adoption of RDF stores compared to other data stores:
a. I originally had a link to DB-Engines to show the difference in adoption. I removed it to allow the message to go through to the list.
We did not give up, and dug into the problems to find solutions. We observed that some of the query complexity could be simplified by using SPARQL CONSTRUCT statements or Jena inference rules to pre-create relationships that users might want to query on. This provided much faster queries, but made the underlying model more murky with significant duplication of data. The proliferation of the vocabulary (predicates/properties) became a concern. Having to use CONSTRUCTS and rules to "pre-answer" complex questions also contradicts the primary reason to use a triple store in the first place; "we wanted a data store that could answer the questions that no one had thought about".
While SPARQL seems to promise to do what we want, the reality is that we have been unable to apply it in a way that delivers what we want. I am aware that this might be a failure of understanding how to use SPARQL.
So, please help us understand the following:
1. Are our observations correct? Please prove/disprove each point, it would make me happy to see that I am wrong.
2. Are these issues resolved in the latest Jena and Jena Fuseki implementations? I see that there are comments about faster SPARQL queries in the latest release. Is there any documentation showing what was done to improve it?
3. Are we using SPARQL incorrectly? How should we use it?
4. Is there documentation available that we do not know about? Please point us to the really good documentation. (We have read the positively rated books on the subjects as well as every website that refers to Jena, SPARQL, RDF, OWL, Semantic web within the first page of Google search).
5. Are there examples of large scale solutions built on Jena ARQ/SPARQL without the use of the Jena API? Can we see their reference architectures?
6. How can it be that the Jena API cannot do an effective join? Is SPARQL based on this API? Is there another API available to effectively get to the data?
7. How is ARQ implemented? Does it use the indexed data in Jena TDB? How does it handled indexes in subqueries?
Looking forward to hearing from you again.
Best regards,
Niels
-----Original Message-----
From: Claude Warren [mailto:***@xenei.com]
Sent: Sunday, November 13, 2016 01:04
To: mailto:***@jena.apache.org
Subject: Re: How do I do a join between multiple model.listStatments calls?
Niels,
SPARQL (https://www.w3.org/TR/rdf-sparql-query/) provides a simple way to join the triples of different statements and can be called from within your java code (http://jena.apache.org/documentation/query/index.html).
As noted previously using a filter should do the trick. There is documentation for how to write your own filter if you need to but you may find that your filter requirements are already met by existing filters.
Claude
Post by Niels AndersenDear user community,
Our current approach to joining multiple model.listStatements (with
SimpleSelector) calls is to take the content of the iterators returned
and add them to separate HashSets and then use functions such as
retainAll to find the intersection between the two sets.
This works relative well when model.listStatements return a small to
medium number of statements.
My problem is that this seems to be a very inefficient way of joining
to sets of data that are already ordered in TDB. I assume that there
must be a better way to do this. I have searched the web, but all uses
of listStatements are very simple.
I have also not found an effective way to do filtering (for instance
literal less than 5) without comparing every statement that
listStatements returns
* What is the recommended way to do a join between two lists of
statements?
* What is the recommended way to implement filtering?
* Is there anything else than SimpleSelector? Are there any
Advanced selectors?
Thanks in advance,
Niels
--
I like: Like Like - The likeliest place on the web <http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren