Discussion:
Advice on how to develop a decentralized indexer with Jena
Simon Schäfer
2016-10-25 21:09:23 UTC
Permalink
Hello,

I have the problem that I need to split up my index because it can get very huge. I'm not sure which way to take to do the split up, therefore I'm looking for some advice. Right now the index fits on one machine, which means there is no need yet to build a distributed system. Right now I mainly care for improving the search performance if the index has gigabytes of size, therefore I think it is enough to just have multiple datasets and some mechanism that figures out which data needs to be put into which dataset.

There is one key goal which needs to be fulfilled: SPARQL queries that are sent to the server must not know anything about the fact that the system can be decentralized or even distributed. This means that the user writes a SPARQL query assuming that all the data is inside one big index and it is the job of the indexer to figure out how to process the query correctly.

Fulfilling this goal is possible because the ontologies that are used by my system follow a graph structure or in other words there are relations between classes and properties in the system. The part that I haven't figured out yet is at which point the system can look at the ontologies and based on this information where it can read the correct datasets.

I thought that it would make most sense to implement this mechanism inside of the SPARQL evaluation engine of Jena. This evaluation engine already needs to figure out what a SPARQL query means and adding more logic to it seems the way to go. My question: Is it possible to access/alter the behavior of the SPARQL evaluation engine and if yes how? Can someone think of another way on how to solve my problem without going very deeply into Jena itself?

Simon
Rob Vesse
2016-10-26 08:44:29 UTC
Permalink
Simon

What are you referring to when you say index?

If you are referring to the RDF itself, and assuming you are using TDB, then there is no way to split the index. Andy has done some investigations on a clustered variant of TDB but this is not production ready. Therefore if you need to split the RDF then you will need to adopt a commercial triple store that supports some variation of clustering. Stardog and Virtuoso are two such examples but there are various others available.

If you are referring to a secondary index such as a free text index then I believe that we do support using Apache Lucene’s Solr as the indexer which has clustering support built in.

If you are referring to something else entirely then you will need to explain more what you mean

Rob

On 25/10/2016 22:09, "Simon Schäfer" <***@antoras.de> wrote:

Hello,

I have the problem that I need to split up my index because it can get very huge. I'm not sure which way to take to do the split up, therefore I'm looking for some advice. Right now the index fits on one machine, which means there is no need yet to build a distributed system. Right now I mainly care for improving the search performance if the index has gigabytes of size, therefore I think it is enough to just have multiple datasets and some mechanism that figures out which data needs to be put into which dataset.

There is one key goal which needs to be fulfilled: SPARQL queries that are sent to the server must not know anything about the fact that the system can be decentralized or even distributed. This means that the user writes a SPARQL query assuming that all the data is inside one big index and it is the job of the indexer to figure out how to process the query correctly.

Fulfilling this goal is possible because the ontologies that are used by my system follow a graph structure or in other words there are relations between classes and properties in the system. The part that I haven't figured out yet is at which point the system can look at the ontologies and based on this information where it can read the correct datasets.

I thought that it would make most sense to implement this mechanism inside of the SPARQL evaluation engine of Jena. This evaluation engine already needs to figure out what a SPARQL query means and adding more logic to it seems the way to go. My question: Is it possible to access/alter the behavior of the SPARQL evaluation engine and if yes how? Can someone think of another way on how to solve my problem without going very deeply into Jena itself?

Simon
Simon Schäfer
2016-10-26 10:18:25 UTC
Permalink
Post by Rob Vesse
Simon
What are you referring to when you say index?
If you are referring to the RDF itself, and assuming you are using TDB, then there is no way to split the index. Andy has done some investigations on a clustered variant of TDB but this is not production ready. Therefore if you need to split the RDF then you will need to adopt a commercial triple store that supports some variation of clustering. Stardog and Virtuoso are two such examples but there are various others available.
Yes, I was referring to TDB. I already assumed that Jena can't do it because I couldn't find any documentation for it. However, I'm willing to implement this functionality by myself. I mostly would need some pointers to the location in the sources where I could add such functionality. Assuming of course that I don't have to rewrite half of Jena...
Post by Rob Vesse
If you are referring to a secondary index such as a free text index then I believe that we do support using Apache Lucene’s Solr as the indexer which has clustering support built in.
If you are referring to something else entirely then you will need to explain more what you mean
Rob
Hello,
I have the problem that I need to split up my index because it can get very huge. I'm not sure which way to take to do the split up, therefore I'm looking for some advice. Right now the index fits on one machine, which means there is no need yet to build a distributed system. Right now I mainly care for improving the search performance if the index has gigabytes of size, therefore I think it is enough to just have multiple datasets and some mechanism that figures out which data needs to be put into which dataset.
There is one key goal which needs to be fulfilled: SPARQL queries that are sent to the server must not know anything about the fact that the system can be decentralized or even distributed. This means that the user writes a SPARQL query assuming that all the data is inside one big index and it is the job of the indexer to figure out how to process the query correctly.
Fulfilling this goal is possible because the ontologies that are used by my system follow a graph structure or in other words there are relations between classes and properties in the system. The part that I haven't figured out yet is at which point the system can look at the ontologies and based on this information where it can read the correct datasets.
I thought that it would make most sense to implement this mechanism inside of the SPARQL evaluation engine of Jena. This evaluation engine already needs to figure out what a SPARQL query means and adding more logic to it seems the way to go. My question: Is it possible to access/alter the behavior of the SPARQL evaluation engine and if yes how? Can someone think of another way on how to solve my problem without going very deeply into Jena itself?
Simon
Claude Warren
2016-10-26 11:06:28 UTC
Permalink
I implemented a custom search engine when I worked at the Digital
Enterprise Research Institute. (
https://www.researchgate.net/publication/271191966_A_Roadmap_for_Navigating_the_Life_Sciences_Linked_Open_Data_Cloud
)

In that case we had a vocabulary that our users used. The query engine
had to determine which external endpoints might have the data we wanted,
translate the query to the endpoint vocabulary, and then translate the
results back again.

You could do something similar but simpler as you don't have to do the
translation. Basically take your data set and segregate it by Class (or
some other reasonable partition). Write groups of classes into different
database implementations.

Your query engine takes the knowledge of what classes are in which
databases as well as what properties are available on those classes and
uses that information to tease apart a given query into multiple queries to
each of the database implementations. You can then use a set of federated
queries to reassemble the results into the data requested from the query.

Not trivial, but not impossible either.

Claude
Simon Schäfer
2016-10-26 16:09:00 UTC
Permalink
Post by Claude Warren
I implemented a custom search engine when I worked at the Digital
Enterprise Research Institute. (
https://www.researchgate.net/publication/271191966_A_Roadmap_for_Navigating_the_Life_Sciences_Linked_Open_Data_Cloud
)
In that case we had a vocabulary that our users used. The query engine
had to determine which external endpoints might have the data we wanted,
translate the query to the endpoint vocabulary, and then translate the
results back again.
You could do something similar but simpler as you don't have to do the
translation. Basically take your data set and segregate it by Class (or
some other reasonable partition). Write groups of classes into different
database implementations.
Your query engine takes the knowledge of what classes are in which
databases as well as what properties are available on those classes and
uses that information to tease apart a given query into multiple queries to
each of the database implementations. You can then use a set of federated
queries to reassemble the results into the data requested from the query.
Not trivial, but not impossible either.
This technique wouldn't work for me, since I have only one SPARQL endpoint, which directly reads data from multiple locations.
Claude Warren
2016-10-26 18:25:00 UTC
Permalink
I agree the current architecture would not work, but you could place a
sparql endpoint at each of the multiple locations. The resulting
aggregating query engine might be an interesting project on its own.

Hmmmmmm...... I need to think about that.

Claude
Post by Claude Warren
Post by Claude Warren
I implemented a custom search engine when I worked at the Digital
Enterprise Research Institute. (
https://www.researchgate.net/publication/271191966_A_
Roadmap_for_Navigating_the_Life_Sciences_Linked_Open_Data_Cloud
Post by Claude Warren
)
In that case we had a vocabulary that our users used. The query engine
had to determine which external endpoints might have the data we wanted,
translate the query to the endpoint vocabulary, and then translate the
results back again.
You could do something similar but simpler as you don't have to do the
translation. Basically take your data set and segregate it by Class (or
some other reasonable partition). Write groups of classes into
different
Post by Claude Warren
database implementations.
Your query engine takes the knowledge of what classes are in which
databases as well as what properties are available on those classes and
uses that information to tease apart a given query into multiple
queries to
Post by Claude Warren
each of the database implementations. You can then use a set of
federated
Post by Claude Warren
queries to reassemble the results into the data requested from the
query.
Post by Claude Warren
Not trivial, but not impossible either.
This technique wouldn't work for me, since I have only one SPARQL
endpoint, which directly reads data from multiple locations.
--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Simon Schäfer
2016-10-26 20:29:33 UTC
Permalink
Post by Claude Warren
I agree the current architecture would not work, but you could place a
sparql endpoint at each of the multiple locations. The resulting
aggregating query engine might be an interesting project on its own.
Yes, providing multiple endpoints is possible but in my case it violates my design goal which was mentioned in my first mail and says that the user must not know anything about different endpoints. The user should write only one query without having to care about where the data is stored.
Post by Claude Warren
Hmmmmmm...... I need to think about that.
Claude
Post by Claude Warren
Post by Claude Warren
I implemented a custom search engine when I worked at the Digital
Enterprise Research Institute. (
https://www.researchgate.net/publication/271191966_A_
Roadmap_for_Navigating_the_Life_Sciences_Linked_Open_Data_Cloud
Post by Claude Warren
)
In that case we had a vocabulary that our users used. The query engine
had to determine which external endpoints might have the data we wanted,
translate the query to the endpoint vocabulary, and then translate the
results back again.
You could do something similar but simpler as you don't have to do the
translation. Basically take your data set and segregate it by Class (or
some other reasonable partition). Write groups of classes into
different
Post by Claude Warren
database implementations.
Your query engine takes the knowledge of what classes are in which
databases as well as what properties are available on those classes and
uses that information to tease apart a given query into multiple
queries to
Post by Claude Warren
each of the database implementations. You can then use a set of
federated
Post by Claude Warren
queries to reassemble the results into the data requested from the
query.
Post by Claude Warren
Not trivial, but not impossible either.
This technique wouldn't work for me, since I have only one SPARQL
endpoint, which directly reads data from multiple locations.
--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Claude Warren
2016-10-27 06:33:41 UTC
Permalink
Ahhh, you missed, or I did not call out, a critical part of the
implementation we did. All of the endpoints the engine queried were hidden
from the user. The user only saw one end point. They system took the
user's query to the one endpoint and distributed it across the known
endpoints as appropriate and then combined the results. To the user there
was one end point. But the system queried across bio2rdf, goodrelations,
chebi and others.

Claude
Post by Simon Schäfer
Post by Claude Warren
I agree the current architecture would not work, but you could place a
sparql endpoint at each of the multiple locations. The resulting
aggregating query engine might be an interesting project on its own.
Yes, providing multiple endpoints is possible but in my case it violates
my design goal which was mentioned in my first mail and says that the user
must not know anything about different endpoints. The user should write
only one query without having to care about where the data is stored.
Post by Claude Warren
Hmmmmmm...... I need to think about that.
Claude
---- On Wed, 26 Oct 2016 13:06:28 +0200 Claude Warren <
wrote ----
Post by Claude Warren
I implemented a custom search engine when I worked at the Digital
Enterprise Research Institute. (
https://www.researchgate.net/publication/271191966_A_
Roadmap_for_Navigating_the_Life_Sciences_Linked_Open_Data_Cloud
Post by Claude Warren
)
In that case we had a vocabulary that our users used. The query
engine
Post by Claude Warren
Post by Claude Warren
had to determine which external endpoints might have the data we
wanted,
Post by Claude Warren
Post by Claude Warren
translate the query to the endpoint vocabulary, and then translate
the
Post by Claude Warren
Post by Claude Warren
results back again.
You could do something similar but simpler as you don't have to do
the
Post by Claude Warren
Post by Claude Warren
translation. Basically take your data set and segregate it by
Class (or
Post by Claude Warren
Post by Claude Warren
some other reasonable partition). Write groups of classes into
different
Post by Claude Warren
database implementations.
Your query engine takes the knowledge of what classes are in which
databases as well as what properties are available on those
classes and
Post by Claude Warren
Post by Claude Warren
uses that information to tease apart a given query into multiple
queries to
Post by Claude Warren
each of the database implementations. You can then use a set of
federated
Post by Claude Warren
queries to reassemble the results into the data requested from the
query.
Post by Claude Warren
Not trivial, but not impossible either.
This technique wouldn't work for me, since I have only one SPARQL
endpoint, which directly reads data from multiple locations.
--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Andy Seaborne
2016-10-26 11:32:32 UTC
Permalink
Simon,

The system Rob is referring to is "Lizard" - its on my personal github
account. In essence it is implementing replicated index files. It is
not however, replicating the SPARQL query engine - that is still one
machine for one query. It is not production ready.

There are several ways to go beyond a single machine for different usages.

Reusing a store any key/value or column store tat can provide range
scans easily - that's the core of SPARQL execution is a reasonable baseline.

The other choice these days is Apache Spark.

So how big is big? (in terms of triples and in gigabytes) and what's the
primary use case (stype of query that matters)?

Or we could try to make the indexes smaller ...

Andy
Post by Simon Schäfer
Post by Rob Vesse
Simon
What are you referring to when you say index?
If you are referring to the RDF itself, and assuming you are using TDB, then there is no way to split the index. Andy has done some investigations on a clustered variant of TDB but this is not production ready. Therefore if you need to split the RDF then you will need to adopt a commercial triple store that supports some variation of clustering. Stardog and Virtuoso are two such examples but there are various others available.
Yes, I was referring to TDB. I already assumed that Jena can't do it because I couldn't find any documentation for it. However, I'm willing to implement this functionality by myself. I mostly would need some pointers to the location in the sources where I could add such functionality. Assuming of course that I don't have to rewrite half of Jena...
Post by Rob Vesse
If you are referring to a secondary index such as a free text index then I believe that we do support using Apache Lucene’s Solr as the indexer which has clustering support built in.
If you are referring to something else entirely then you will need to explain more what you mean
Rob
Hello,
I have the problem that I need to split up my index because it can get very huge. I'm not sure which way to take to do the split up, therefore I'm looking for some advice. Right now the index fits on one machine, which means there is no need yet to build a distributed system. Right now I mainly care for improving the search performance if the index has gigabytes of size, therefore I think it is enough to just have multiple datasets and some mechanism that figures out which data needs to be put into which dataset.
There is one key goal which needs to be fulfilled: SPARQL queries that are sent to the server must not know anything about the fact that the system can be decentralized or even distributed. This means that the user writes a SPARQL query assuming that all the data is inside one big index and it is the job of the indexer to figure out how to process the query correctly.
Fulfilling this goal is possible because the ontologies that are used by my system follow a graph structure or in other words there are relations between classes and properties in the system. The part that I haven't figured out yet is at which point the system can look at the ontologies and based on this information where it can read the correct datasets.
I thought that it would make most sense to implement this mechanism inside of the SPARQL evaluation engine of Jena. This evaluation engine already needs to figure out what a SPARQL query means and adding more logic to it seems the way to go. My question: Is it possible to access/alter the behavior of the SPARQL evaluation engine and if yes how? Can someone think of another way on how to solve my problem without going very deeply into Jena itself?
Simon
Simon Schäfer
2016-10-26 13:44:31 UTC
Permalink
Post by Andy Seaborne
Simon,
The system Rob is referring to is "Lizard" - its on my personal github
account. In essence it is implementing replicated index files. It is
not however, replicating the SPARQL query engine - that is still one
machine for one query. It is not production ready.
At least right now I do not need more than a single node to process the SPARQL query - but the query engine needs to read data from multiple datasets (i.e. different locations on the filesystem). Making the file system itself distributed is not yet needed.
Post by Andy Seaborne
There are several ways to go beyond a single machine for different usages.
Reusing a store any key/value or column store tat can provide range
scans easily - that's the core of SPARQL execution is a reasonable baseline.
The other choice these days is Apache Spark.
I guess Spark only helps for cluster processing but that is not what I need for now.
Post by Andy Seaborne
So how big is big? (in terms of triples and in gigabytes) and what's the
primary use case (stype of query that matters)?
So far it was mostly theoretical thinking, I don't know yet how many triples I may get in the end. In my tests Jenas dataset takes already about 100mb (and 500k triples) on disk and that was just for a single use case. And I do have about 100k use cases (even though the tested use case was one of the largest ones).

The platform I'm building is a search engine, data is rarely stored there but needs to be read afterwards many times. In the end SPARQL selects (i.e. read only queries) will be most common.
Post by Andy Seaborne
Or we could try to make the indexes smaller ...
Yes, that is exactly what I'm trying to do but maybe I mean something different with it than you do. As said above, SPARQL selects are the most common use case. I want that users can write their selects without having to worry how the data is physically stored. It needs to be the job of the Linked Data indexer to figure out how to split up the data and how to handle the queries correctly.
Post by Andy Seaborne
Andy
Post by Simon Schäfer
Post by Rob Vesse
Simon
What are you referring to when you say index?
If you are referring to the RDF itself, and assuming you are using TDB, then there is no way to split the index. Andy has done some investigations on a clustered variant of TDB but this is not production ready. Therefore if you need to split the RDF then you will need to adopt a commercial triple store that supports some variation of clustering. Stardog and Virtuoso are two such examples but there are various others available.
Yes, I was referring to TDB. I already assumed that Jena can't do it because I couldn't find any documentation for it. However, I'm willing to implement this functionality by myself. I mostly would need some pointers to the location in the sources where I could add such functionality. Assuming of course that I don't have to rewrite half of Jena...
Post by Rob Vesse
If you are referring to a secondary index such as a free text index then I believe that we do support using Apache Lucene’s Solr as the indexer which has clustering support built in.
If you are referring to something else entirely then you will need to explain more what you mean
Rob
Hello,
I have the problem that I need to split up my index because it can get very huge. I'm not sure which way to take to do the split up, therefore I'm looking for some advice. Right now the index fits on one machine, which means there is no need yet to build a distributed system. Right now I mainly care for improving the search performance if the index has gigabytes of size, therefore I think it is enough to just have multiple datasets and some mechanism that figures out which data needs to be put into which dataset.
There is one key goal which needs to be fulfilled: SPARQL queries that are sent to the server must not know anything about the fact that the system can be decentralized or even distributed. This means that the user writes a SPARQL query assuming that all the data is inside one big index and it is the job of the indexer to figure out how to process the query correctly.
Fulfilling this goal is possible because the ontologies that are used by my system follow a graph structure or in other words there are relations between classes and properties in the system. The part that I haven't figured out yet is at which point the system can look at the ontologies and based on this information where it can read the correct datasets.
I thought that it would make most sense to implement this mechanism inside of the SPARQL evaluation engine of Jena. This evaluation engine already needs to figure out what a SPARQL query means and adding more logic to it seems the way to go. My question: Is it possible to access/alter the behavior of the SPARQL evaluation engine and if yes how? Can someone think of another way on how to solve my problem without going very deeply into Jena itself?
Simon
Rob Vesse
2016-10-27 09:31:57 UTC
Permalink
To be honest this is not particularly huge. TDB is capable of scaling happily into the tens of millions or even hundreds of millions of triples depending on the nature of the queries that will be asked of the data. Provided that it is given a machine with sufficient RAM since it relies upon memory mapped files for performance.

Have you observed particular performance problems in your tests so far? If so partitioning your data across multiple instances is unlikely to provide any improvements. Multiple instances will likely have larger disk and memory footprints overall than a single instance.

I’m not saying that you shouldn’t go down this route but it seems like a bit of an XY problem at the moment.

Rob

On 26/10/2016 14:44, "Simon Schäfer" <***@antoras.de> wrote:

So far it was mostly theoretical thinking, I don't know yet how many triples I may get in the end. In my tests Jenas dataset takes already about 100mb (and 500k triples) on disk and that was just for a single use case. And I do have about 100k use cases (even though the tested use case was one of the largest ones).
Simon Schäfer
2016-10-27 12:37:49 UTC
Permalink
Post by Rob Vesse
To be honest this is not particularly huge. TDB is capable of scaling happily into the tens of millions or even hundreds of millions of triples depending on the nature of the queries that will be asked of the data. Provided that it is given a machine with sufficient RAM since it relies upon memory mapped files for performance.
Have you observed particular performance problems in your tests so far? If so partitioning your data across multiple instances is unlikely to provide any improvements. Multiple instances will likely have larger disk and memory footprints overall than a single instance.
I’m not saying that you shouldn’t go down this route but it seems like a bit of an XY problem at the moment.
Yes, probably it is. I'm going to stick with a single index for now and see how far I can go. Everything else would be over-engineering at the moment. Mostly I wanted to find out anyway if Jena already supports the functionality I asked for, in this case it would have been a no brainer to just adopt it.
Post by Rob Vesse
Rob
So far it was mostly theoretical thinking, I don't know yet how many triples I may get in the end. In my tests Jenas dataset takes already about 100mb (and 500k triples) on disk and that was just for a single use case. And I do have about 100k use cases (even though the tested use case was one of the largest ones).
Joint
2016-10-26 17:09:28 UTC
Permalink
Hi.
I implemented a multi TDB cluster which I think would do what you want. But it's memory hungry on the front end because of the way it needs to distinct the multiple TDB streams.
Rather simplistically the system overrides the find methods on the Dataset interface and forwards the find to multiple TDB back ends. We tried various forward methods and currently use RMI. RMI allows us to leverage streams in Java 8 which handles the heavy lifting of producing a distinct stream of quads returned by the find.
In a bit more detail we stream the quad we are looking for and map the TDB streams in parallel with a distinct. Hence the potential for high memory usage. 
My need was for concurrent write transactions which required this solution to query the solution...


Dick

-------- Original message --------
From: Simon SchÀfer <***@antoras.de>
Date: 25/10/2016 22:09 (GMT+00:00)
To: users <***@jena.apache.org>
Subject: Advice on how to develop a decentralized indexer with Jena

Hello,

I have the problem that I need to split up my index because it can get very huge. I'm not sure which way to take to do the split up, therefore I'm looking for some advice. Right now the index fits on one machine, which means there is no need yet to build a distributed system. Right now I mainly care for improving the search performance if the index has gigabytes of size, therefore I think it is enough to just have multiple datasets and some mechanism that figures out which data needs to be put into which dataset.

There is one key goal which needs to be fulfilled: SPARQL queries that are sent to the server must not know anything about the fact that the system can be decentralized or even distributed. This means that the user writes a SPARQL query assuming that all the data is inside one big index and it is the job of the indexer to figure out how to process the query correctly.

Fulfilling this goal is possible because the ontologies that are used by my system follow a graph structure or in other words there are relations between classes and properties in the system. The part that I haven't figured out yet is at which point the system can look at the ontologies and based on this information where it can read the correct datasets.

I thought that it would make most sense to implement this mechanism inside of the SPARQL evaluation engine of Jena. This evaluation engine already needs to figure out what a SPARQL query means and adding more logic to it seems the way to go. My question: Is it possible to access/alter the behavior of the SPARQL evaluation engine and if yes how? Can someone think of another way on how to solve my problem without going very deeply into Jena itself?

Simon
Simon Schäfer
2016-10-26 20:37:21 UTC
Permalink
Post by Joint
Hi.
I implemented a multi TDB cluster which I think would do what you want. But it's memory hungry on the front end because of the way it needs to distinct the multiple TDB streams.
Rather simplistically the system overrides the find methods on the Dataset interface and forwards the find to multiple TDB back ends. We tried various forward methods and currently use RMI. RMI allows us to leverage streams in Java 8 which handles the heavy lifting of producing a distinct stream of quads returned by the find.
In a bit more detail we stream the quad we are looking for and map the TDB streams in parallel with a distinct. Hence the potential for high memory usage.
My need was for concurrent write transactions which required this solution to query the solution...
Sounds interesting. Your work isn't publicly available, is it?
Joint
2016-10-27 16:59:04 UTC
Permalink
Hi.
I'm in agreement as we run a TDB instance which currently holds 4.5B triples and performance meets our requirement on a 16GB 8 core server. Use SSD drives! We also have this data set split across multiple TDB stores which are fronted by the aforementioned specialised dataset graph.

Dick

-------- Original message --------
From: Rob Vesse <***@dotnetrdf.org>
Date: 27/10/2016 10:31 (GMT+00:00)
To: ***@jena.apache.org
Subject: Re: Advice on how to develop a decentralized indexer with Jena

To be honest this is not particularly huge.  TDB is capable of scaling happily into the tens of millions or even hundreds of millions of triples depending on the nature of the queries that will be asked of the data. Provided that it is given a machine with sufficient RAM since it relies upon memory mapped files for performance.

Have you observed particular performance problems in your tests so far? If so partitioning your data across multiple instances is unlikely to provide any improvements. Multiple instances will likely have larger disk and memory footprints overall than a single instance.

I’m not saying that you shouldn’t go down this route but it seems like a bit of an XY problem at the moment.

Rob

On 26/10/2016 14:44, "Simon SchÀfer" <***@antoras.de> wrote:

    So far it was mostly theoretical thinking, I don't know yet how many triples I may get in the end. In my tests Jenas dataset takes already about 100mb (and 500k triples) on disk and that was just for a single use case. And I do have about 100k use cases (even though the tested use case was one of the largest ones).
Joint
2016-10-30 10:51:29 UTC
Permalink
Hi.
Not currently because it was written within another framework but I'm working on pulling the relevant bits out into another project minus all the dependencies other than Jena and the core java.
I just need work to give me more free time... ;-)


Dick

-------- Original message --------
From: Simon SchÀfer <***@antoras.de>
Date: 26/10/2016 21:37 (GMT+00:00)
To: users <***@jena.apache.org>
Subject: Re: Advice on how to develop a decentralized indexer with Jena
    
Hi.
I implemented a multi TDB cluster which I think would do what you want. But it's memory hungry on the front end because of the way it needs to distinct the multiple TDB streams.
Rather simplistically the system overrides the find methods on the Dataset interface and forwards the find to multiple TDB back ends. We tried various forward methods and currently use RMI. RMI allows us to leverage streams in Java 8 which handles the heavy lifting of producing a distinct stream of quads returned by the find.
In a bit more detail we stream the quad we are looking for and map the TDB streams in parallel with a distinct. Hence the potential for high memory usage.
My need was for concurrent write transactions which required this solution to query the solution...
Sounds interesting. Your work isn't publicly available, is it?

Loading...