Simon Schäfer
2016-10-25 21:09:23 UTC
Hello,
I have the problem that I need to split up my index because it can get very huge. I'm not sure which way to take to do the split up, therefore I'm looking for some advice. Right now the index fits on one machine, which means there is no need yet to build a distributed system. Right now I mainly care for improving the search performance if the index has gigabytes of size, therefore I think it is enough to just have multiple datasets and some mechanism that figures out which data needs to be put into which dataset.
There is one key goal which needs to be fulfilled: SPARQL queries that are sent to the server must not know anything about the fact that the system can be decentralized or even distributed. This means that the user writes a SPARQL query assuming that all the data is inside one big index and it is the job of the indexer to figure out how to process the query correctly.
Fulfilling this goal is possible because the ontologies that are used by my system follow a graph structure or in other words there are relations between classes and properties in the system. The part that I haven't figured out yet is at which point the system can look at the ontologies and based on this information where it can read the correct datasets.
I thought that it would make most sense to implement this mechanism inside of the SPARQL evaluation engine of Jena. This evaluation engine already needs to figure out what a SPARQL query means and adding more logic to it seems the way to go. My question: Is it possible to access/alter the behavior of the SPARQL evaluation engine and if yes how? Can someone think of another way on how to solve my problem without going very deeply into Jena itself?
Simon
I have the problem that I need to split up my index because it can get very huge. I'm not sure which way to take to do the split up, therefore I'm looking for some advice. Right now the index fits on one machine, which means there is no need yet to build a distributed system. Right now I mainly care for improving the search performance if the index has gigabytes of size, therefore I think it is enough to just have multiple datasets and some mechanism that figures out which data needs to be put into which dataset.
There is one key goal which needs to be fulfilled: SPARQL queries that are sent to the server must not know anything about the fact that the system can be decentralized or even distributed. This means that the user writes a SPARQL query assuming that all the data is inside one big index and it is the job of the indexer to figure out how to process the query correctly.
Fulfilling this goal is possible because the ontologies that are used by my system follow a graph structure or in other words there are relations between classes and properties in the system. The part that I haven't figured out yet is at which point the system can look at the ontologies and based on this information where it can read the correct datasets.
I thought that it would make most sense to implement this mechanism inside of the SPARQL evaluation engine of Jena. This evaluation engine already needs to figure out what a SPARQL query means and adding more logic to it seems the way to go. My question: Is it possible to access/alter the behavior of the SPARQL evaluation engine and if yes how? Can someone think of another way on how to solve my problem without going very deeply into Jena itself?
Simon