Discussion:
Estimation for Jena/Fuseki hardware requirements?
Ignacio Tripodi
2016-03-20 17:16:18 UTC
Permalink
Hello,

I was wondering if you had any minimum hardware suggestions for a
Jena/Fuseki Linux deployment, based on the number of triples used. Is there
a rough guideline for how much RAM should be available in production, as a
function of the size of the imported RDF file (currently less than 2Gb),
number of concurrent requests, etc?

The main use for this will be for wildcarded text searches using the Lucene
full-text index (basically, unfiltered queries using the reverse index). No
SPARQL Update needed. Other resource-intensive operations would be
refreshing the RDF data monthly, followed by rebuilding indices. The test
deployment on my 2012 MacBook runs queries in the order of tens of ms
(unless it's been idle for a while, then the first query is usually in the
order of hundreds of ms for some reason), so I imagine the hardware
requirements can't be that stringent. If it helps, I had to increase my
Java heap size to 3072Mb.

Thanks for any feedback you could provide!
Andy Seaborne
2016-03-20 21:38:10 UTC
Permalink
Post by Ignacio Tripodi
Hello,
I was wondering if you had any minimum hardware suggestions for a
Jena/Fuseki Linux deployment, based on the number of triples used. Is there
a rough guideline for how much RAM should be available in production, as a
function of the size of the imported RDF file (currently less than 2Gb),
number of concurrent requests, etc?
The main use for this will be for wildcarded text searches using the Lucene
full-text index (basically, unfiltered queries using the reverse index). No
SPARQL Update needed. Other resource-intensive operations would be
refreshing the RDF data monthly, followed by rebuilding indices. The test
deployment on my 2012 MacBook runs queries in the order of tens of ms
(unless it's been idle for a while, then the first query is usually in the
order of hundreds of ms for some reason), so I imagine the hardware
requirements can't be that stringent. If it helps, I had to increase my
Java heap size to 3072Mb.
Thanks for any feedback you could provide!
[[
This has been asked on StackOverflow - please copy answers from one
place to the other.
]]

2G in bytes - what is it in triples?

Is this Lucene or Solr?

Is the RDF data held in TDB as the storage? If so, then the part due to
TDB using memory mapped files - these exist in the OS file system cache
not in the java heap. The amount of space it need flexes with use (the
OS does the flexing automatically.

Fir TDB:

TDB write transactions use memory for intermediate space. Read requests
do not normally take space over and above the database caching.

If the data has many large literals, then more heap may be needed
otherwise the space is due to Lucene itself. The jena text subsystem
materializing results so very large results also these may be a factor.

The fact that being idle means the next query is slow is possibly due to
the fact that either the machine is swapping and the in-RAM cached data
got swapped out, or that the file system cache has displaced data and so
it has to go to persistent storage. If you were doing other things on
the machine, it is more likely the latter.

Andy
Ignacio Tripodi
2016-03-21 01:28:30 UTC
Permalink
Hey Andy,

Sorry about the duplicate post, I just removed the one on StackOverflow.

This is using Lucene. At currently 1.6Gb, most of the content is a
collection of (biological) taxonomic entities plus a few owl definitions to
lay out the ontology, and as you correctly guessed, imported as TDB. All
.dat and .idn files after importing and rebuilding the indices add up to
about 2.1Gb. Would the assumption that if we have as much free memory as
2.1Gb in this case, we would be in an optimal situation for caching?

All SPARQL queries for partial string matches will be limited to only the
first handful of (say, 5) results. Should I consider large result sets in
the hardware estimations, regardless? Does Jena still have to internally
bring up the entire result set before filtering the response?

Your theory about swapping for the scenario of slow first requests makes
sense. I'm not too concerned about it (at least until I see how it behaves
in production).

Many thanks for the insights,

-i
Post by Ignacio Tripodi
Post by Ignacio Tripodi
Hello,
I was wondering if you had any minimum hardware suggestions for a
Jena/Fuseki Linux deployment, based on the number of triples used. Is
there
Post by Ignacio Tripodi
a rough guideline for how much RAM should be available in production, as
a
Post by Ignacio Tripodi
function of the size of the imported RDF file (currently less than 2Gb),
number of concurrent requests, etc?
The main use for this will be for wildcarded text searches using the
Lucene
Post by Ignacio Tripodi
full-text index (basically, unfiltered queries using the reverse index).
No
Post by Ignacio Tripodi
SPARQL Update needed. Other resource-intensive operations would be
refreshing the RDF data monthly, followed by rebuilding indices. The test
deployment on my 2012 MacBook runs queries in the order of tens of ms
(unless it's been idle for a while, then the first query is usually in
the
Post by Ignacio Tripodi
order of hundreds of ms for some reason), so I imagine the hardware
requirements can't be that stringent. If it helps, I had to increase my
Java heap size to 3072Mb.
Thanks for any feedback you could provide!
[[
This has been asked on StackOverflow - please copy answers from one
place to the other.
]]
2G in bytes - what is it in triples?
Is this Lucene or Solr?
Is the RDF data held in TDB as the storage? If so, then the part due to
TDB using memory mapped files - these exist in the OS file system cache
not in the java heap. The amount of space it need flexes with use (the
OS does the flexing automatically.
TDB write transactions use memory for intermediate space. Read requests
do not normally take space over and above the database caching.
If the data has many large literals, then more heap may be needed
otherwise the space is due to Lucene itself. The jena text subsystem
materializing results so very large results also these may be a factor.
The fact that being idle means the next query is slow is possibly due to
the fact that either the machine is swapping and the in-RAM cached data
got swapped out, or that the file system cache has displaced data and so
it has to go to persistent storage. If you were doing other things on
the machine, it is more likely the latter.
Andy
Andy Seaborne
2016-03-27 12:05:51 UTC
Permalink
Post by Ignacio Tripodi
Hey Andy,
Sorry about the duplicate post, I just removed the one on StackOverflow.
This is using Lucene. At currently 1.6Gb, most of the content is a
collection of (biological) taxonomic entities plus a few owl definitions to
lay out the ontology, and as you correctly guessed, imported as TDB. All
.dat and .idn files after importing and rebuilding the indices add up to
about 2.1Gb. Would the assumption that if we have as much free memory as
2.1Gb in this case, we would be in an optimal situation for caching?
Yes - that is a good starting point.

(Counts in triples would be useful.)
Post by Ignacio Tripodi
All SPARQL queries for partial string matches will be limited to only the
first handful of (say, 5) results. Should I consider large result sets in
the hardware estimations, regardless? Does Jena still have to internally
bring up the entire result set before filtering the response?
For a text query, then it does have to get all the text index results.

The Lucene's IndexSearcher.search method returns TopDocs which is all
the results (after Lucene limiting).

Andy
Post by Ignacio Tripodi
Your theory about swapping for the scenario of slow first requests makes
sense. I'm not too concerned about it (at least until I see how it behaves
in production).
Many thanks for the insights,
-i
Post by Ignacio Tripodi
Post by Ignacio Tripodi
Hello,
I was wondering if you had any minimum hardware suggestions for a
Jena/Fuseki Linux deployment, based on the number of triples used. Is
there
Post by Ignacio Tripodi
a rough guideline for how much RAM should be available in production, as
a
Post by Ignacio Tripodi
function of the size of the imported RDF file (currently less than 2Gb),
number of concurrent requests, etc?
The main use for this will be for wildcarded text searches using the
Lucene
Post by Ignacio Tripodi
full-text index (basically, unfiltered queries using the reverse index).
No
Post by Ignacio Tripodi
SPARQL Update needed. Other resource-intensive operations would be
refreshing the RDF data monthly, followed by rebuilding indices. The test
deployment on my 2012 MacBook runs queries in the order of tens of ms
(unless it's been idle for a while, then the first query is usually in
the
Post by Ignacio Tripodi
order of hundreds of ms for some reason), so I imagine the hardware
requirements can't be that stringent. If it helps, I had to increase my
Java heap size to 3072Mb.
Thanks for any feedback you could provide!
[[
This has been asked on StackOverflow - please copy answers from one
place to the other.
]]
2G in bytes - what is it in triples?
Is this Lucene or Solr?
Is the RDF data held in TDB as the storage? If so, then the part due to
TDB using memory mapped files - these exist in the OS file system cache
not in the java heap. The amount of space it need flexes with use (the
OS does the flexing automatically.
TDB write transactions use memory for intermediate space. Read requests
do not normally take space over and above the database caching.
If the data has many large literals, then more heap may be needed
otherwise the space is due to Lucene itself. The jena text subsystem
materializing results so very large results also these may be a factor.
The fact that being idle means the next query is slow is possibly due to
the fact that either the machine is swapping and the in-RAM cached data
got swapped out, or that the file system cache has displaced data and so
it has to go to persistent storage. If you were doing other things on
the machine, it is more likely the latter.
Andy
Ignacio Tripodi
2016-03-29 17:38:40 UTC
Permalink
Thanks, Andy. This is a collection of roughly 4.7M triples. How would the
triple count affect the usage estimation, in relation to the total size?

Best regards,

-i
Post by Andy Seaborne
Post by Ignacio Tripodi
Hey Andy,
Sorry about the duplicate post, I just removed the one on StackOverflow.
This is using Lucene. At currently 1.6Gb, most of the content is a
collection of (biological) taxonomic entities plus a few owl definitions
to
Post by Ignacio Tripodi
lay out the ontology, and as you correctly guessed, imported as TDB. All
.dat and .idn files after importing and rebuilding the indices add up to
about 2.1Gb. Would the assumption that if we have as much free memory as
2.1Gb in this case, we would be in an optimal situation for caching?
Yes - that is a good starting point.
(Counts in triples would be useful.)
Post by Ignacio Tripodi
All SPARQL queries for partial string matches will be limited to only the
first handful of (say, 5) results. Should I consider large result sets in
the hardware estimations, regardless? Does Jena still have to internally
bring up the entire result set before filtering the response?
For a text query, then it does have to get all the text index results.
The Lucene's IndexSearcher.search method returns TopDocs which is all
the results (after Lucene limiting).
Andy
Post by Ignacio Tripodi
Your theory about swapping for the scenario of slow first requests makes
sense. I'm not too concerned about it (at least until I see how it
behaves
Post by Ignacio Tripodi
in production).
Many thanks for the insights,
-i
Post by Ignacio Tripodi
Post by Ignacio Tripodi
Hello,
I was wondering if you had any minimum hardware suggestions for a
Jena/Fuseki Linux deployment, based on the number of triples used. Is
there
Post by Ignacio Tripodi
a rough guideline for how much RAM should be available in production,
as
Post by Ignacio Tripodi
Post by Ignacio Tripodi
a
Post by Ignacio Tripodi
function of the size of the imported RDF file (currently less than
2Gb),
Post by Ignacio Tripodi
Post by Ignacio Tripodi
Post by Ignacio Tripodi
number of concurrent requests, etc?
The main use for this will be for wildcarded text searches using the
Lucene
Post by Ignacio Tripodi
full-text index (basically, unfiltered queries using the reverse
index).
Post by Ignacio Tripodi
Post by Ignacio Tripodi
No
Post by Ignacio Tripodi
SPARQL Update needed. Other resource-intensive operations would be
refreshing the RDF data monthly, followed by rebuilding indices. The
test
Post by Ignacio Tripodi
Post by Ignacio Tripodi
Post by Ignacio Tripodi
deployment on my 2012 MacBook runs queries in the order of tens of ms
(unless it's been idle for a while, then the first query is usually in
the
Post by Ignacio Tripodi
order of hundreds of ms for some reason), so I imagine the hardware
requirements can't be that stringent. If it helps, I had to increase my
Java heap size to 3072Mb.
Thanks for any feedback you could provide!
[[
This has been asked on StackOverflow - please copy answers from one
place to the other.
]]
2G in bytes - what is it in triples?
Is this Lucene or Solr?
Is the RDF data held in TDB as the storage? If so, then the part due to
TDB using memory mapped files - these exist in the OS file system cache
not in the java heap. The amount of space it need flexes with use (the
OS does the flexing automatically.
TDB write transactions use memory for intermediate space. Read requests
do not normally take space over and above the database caching.
If the data has many large literals, then more heap may be needed
otherwise the space is due to Lucene itself. The jena text subsystem
materializing results so very large results also these may be a factor.
The fact that being idle means the next query is slow is possibly due to
the fact that either the machine is swapping and the in-RAM cached data
got swapped out, or that the file system cache has displaced data and so
it has to go to persistent storage. If you were doing other things on
the machine, it is more likely the latter.
Andy
Andy Seaborne
2016-03-30 19:54:09 UTC
Permalink
Post by Ignacio Tripodi
Thanks, Andy. This is a collection of roughly 4.7M triples. How would the
triple count affect the usage estimation, in relation to the total size?
From the RDF storage point of view, 4.7 million triples isn't big (it
fits in memory [*] or gets so cached in TDB that it's effectively
in-memory much of which is no in-heap.) Together with the Lucene side,
looks fine for a current server class machine (it seems to be 16G
tending to 32G range these days - this email will be out of date soon! [**])

An SSD is good.

And generally, portables of any kind have slower I/O paths than servers.


Andy

[*] and the amount of heap needed for a parsed files decrease with the
next release due to some node caching.


[**]
Don't set a Java heap between 32G and 48G!
https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/
Post by Ignacio Tripodi
Best regards,
-i
Post by Andy Seaborne
Post by Ignacio Tripodi
Hey Andy,
Sorry about the duplicate post, I just removed the one on StackOverflow.
This is using Lucene. At currently 1.6Gb, most of the content is a
collection of (biological) taxonomic entities plus a few owl definitions
to
Post by Ignacio Tripodi
lay out the ontology, and as you correctly guessed, imported as TDB. All
.dat and .idn files after importing and rebuilding the indices add up to
about 2.1Gb. Would the assumption that if we have as much free memory as
2.1Gb in this case, we would be in an optimal situation for caching?
Yes - that is a good starting point.
(Counts in triples would be useful.)
Post by Ignacio Tripodi
All SPARQL queries for partial string matches will be limited to only the
first handful of (say, 5) results. Should I consider large result sets in
the hardware estimations, regardless? Does Jena still have to internally
bring up the entire result set before filtering the response?
For a text query, then it does have to get all the text index results.
The Lucene's IndexSearcher.search method returns TopDocs which is all
the results (after Lucene limiting).
Andy
Post by Ignacio Tripodi
Your theory about swapping for the scenario of slow first requests makes
sense. I'm not too concerned about it (at least until I see how it
behaves
Post by Ignacio Tripodi
in production).
Many thanks for the insights,
-i
Post by Ignacio Tripodi
Post by Ignacio Tripodi
Hello,
I was wondering if you had any minimum hardware suggestions for a
Jena/Fuseki Linux deployment, based on the number of triples used. Is
there
Post by Ignacio Tripodi
a rough guideline for how much RAM should be available in production,
as
Post by Ignacio Tripodi
Post by Ignacio Tripodi
a
Post by Ignacio Tripodi
function of the size of the imported RDF file (currently less than
2Gb),
Post by Ignacio Tripodi
Post by Ignacio Tripodi
Post by Ignacio Tripodi
number of concurrent requests, etc?
The main use for this will be for wildcarded text searches using the
Lucene
Post by Ignacio Tripodi
full-text index (basically, unfiltered queries using the reverse
index).
Post by Ignacio Tripodi
Post by Ignacio Tripodi
No
Post by Ignacio Tripodi
SPARQL Update needed. Other resource-intensive operations would be
refreshing the RDF data monthly, followed by rebuilding indices. The
test
Post by Ignacio Tripodi
Post by Ignacio Tripodi
Post by Ignacio Tripodi
deployment on my 2012 MacBook runs queries in the order of tens of ms
(unless it's been idle for a while, then the first query is usually in
the
Post by Ignacio Tripodi
order of hundreds of ms for some reason), so I imagine the hardware
requirements can't be that stringent. If it helps, I had to increase my
Java heap size to 3072Mb.
Thanks for any feedback you could provide!
[[
This has been asked on StackOverflow - please copy answers from one
place to the other.
]]
2G in bytes - what is it in triples?
Is this Lucene or Solr?
Is the RDF data held in TDB as the storage? If so, then the part due to
TDB using memory mapped files - these exist in the OS file system cache
not in the java heap. The amount of space it need flexes with use (the
OS does the flexing automatically.
TDB write transactions use memory for intermediate space. Read requests
do not normally take space over and above the database caching.
If the data has many large literals, then more heap may be needed
otherwise the space is due to Lucene itself. The jena text subsystem
materializing results so very large results also these may be a factor.
The fact that being idle means the next query is slow is possibly due to
the fact that either the machine is swapping and the in-RAM cached data
got swapped out, or that the file system cache has displaced data and so
it has to go to persistent storage. If you were doing other things on
the machine, it is more likely the latter.
Andy
Ignacio Tripodi
2016-03-30 20:07:45 UTC
Permalink
Excellent! I'll keep an eye out for the next release. In the meantime, this
provides enough background information.
Thanks, Andy

-i
Post by Andy Seaborne
Post by Ignacio Tripodi
Thanks, Andy. This is a collection of roughly 4.7M triples. How would the
triple count affect the usage estimation, in relation to the total size?
From the RDF storage point of view, 4.7 million triples isn't big (it
fits in memory [*] or gets so cached in TDB that it's effectively
in-memory much of which is no in-heap.) Together with the Lucene side,
looks fine for a current server class machine (it seems to be 16G
tending to 32G range these days - this email will be out of date soon! [**])
An SSD is good.
And generally, portables of any kind have slower I/O paths than servers.
Andy
[*] and the amount of heap needed for a parsed files decrease with the
next release due to some node caching.
[**]
Don't set a Java heap between 32G and 48G!
https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/
Post by Ignacio Tripodi
Best regards,
-i
Post by Andy Seaborne
Post by Ignacio Tripodi
Hey Andy,
Sorry about the duplicate post, I just removed the one on
StackOverflow.
Post by Ignacio Tripodi
Post by Andy Seaborne
Post by Ignacio Tripodi
This is using Lucene. At currently 1.6Gb, most of the content is a
collection of (biological) taxonomic entities plus a few owl
definitions
Post by Ignacio Tripodi
Post by Andy Seaborne
to
Post by Ignacio Tripodi
lay out the ontology, and as you correctly guessed, imported as TDB.
All
Post by Ignacio Tripodi
Post by Andy Seaborne
Post by Ignacio Tripodi
.dat and .idn files after importing and rebuilding the indices add up
to
Post by Ignacio Tripodi
Post by Andy Seaborne
Post by Ignacio Tripodi
about 2.1Gb. Would the assumption that if we have as much free memory
as
Post by Ignacio Tripodi
Post by Andy Seaborne
Post by Ignacio Tripodi
2.1Gb in this case, we would be in an optimal situation for caching?
Yes - that is a good starting point.
(Counts in triples would be useful.)
Post by Ignacio Tripodi
All SPARQL queries for partial string matches will be limited to only
the
Post by Ignacio Tripodi
Post by Andy Seaborne
Post by Ignacio Tripodi
first handful of (say, 5) results. Should I consider large result sets
in
Post by Ignacio Tripodi
Post by Andy Seaborne
Post by Ignacio Tripodi
the hardware estimations, regardless? Does Jena still have to
internally
Post by Ignacio Tripodi
Post by Andy Seaborne
Post by Ignacio Tripodi
bring up the entire result set before filtering the response?
For a text query, then it does have to get all the text index results.
The Lucene's IndexSearcher.search method returns TopDocs which is all
the results (after Lucene limiting).
Andy
Post by Ignacio Tripodi
Your theory about swapping for the scenario of slow first requests
makes
Post by Ignacio Tripodi
Post by Andy Seaborne
Post by Ignacio Tripodi
sense. I'm not too concerned about it (at least until I see how it
behaves
Post by Ignacio Tripodi
in production).
Many thanks for the insights,
-i
Post by Ignacio Tripodi
Post by Ignacio Tripodi
Hello,
I was wondering if you had any minimum hardware suggestions for a
Jena/Fuseki Linux deployment, based on the number of triples used. Is
there
Post by Ignacio Tripodi
a rough guideline for how much RAM should be available in production,
as
Post by Ignacio Tripodi
Post by Ignacio Tripodi
a
Post by Ignacio Tripodi
function of the size of the imported RDF file (currently less than
2Gb),
Post by Ignacio Tripodi
Post by Ignacio Tripodi
Post by Ignacio Tripodi
number of concurrent requests, etc?
The main use for this will be for wildcarded text searches using the
Lucene
Post by Ignacio Tripodi
full-text index (basically, unfiltered queries using the reverse
index).
Post by Ignacio Tripodi
Post by Ignacio Tripodi
No
Post by Ignacio Tripodi
SPARQL Update needed. Other resource-intensive operations would be
refreshing the RDF data monthly, followed by rebuilding indices. The
test
Post by Ignacio Tripodi
Post by Ignacio Tripodi
Post by Ignacio Tripodi
deployment on my 2012 MacBook runs queries in the order of tens of ms
(unless it's been idle for a while, then the first query is usually
in
Post by Ignacio Tripodi
Post by Andy Seaborne
Post by Ignacio Tripodi
Post by Ignacio Tripodi
the
Post by Ignacio Tripodi
order of hundreds of ms for some reason), so I imagine the hardware
requirements can't be that stringent. If it helps, I had to increase
my
Post by Ignacio Tripodi
Post by Andy Seaborne
Post by Ignacio Tripodi
Post by Ignacio Tripodi
Post by Ignacio Tripodi
Java heap size to 3072Mb.
Thanks for any feedback you could provide!
[[
This has been asked on StackOverflow - please copy answers from one
place to the other.
]]
2G in bytes - what is it in triples?
Is this Lucene or Solr?
Is the RDF data held in TDB as the storage? If so, then the part due
to
Post by Ignacio Tripodi
Post by Andy Seaborne
Post by Ignacio Tripodi
Post by Ignacio Tripodi
TDB using memory mapped files - these exist in the OS file system
cache
Post by Ignacio Tripodi
Post by Andy Seaborne
Post by Ignacio Tripodi
Post by Ignacio Tripodi
not in the java heap. The amount of space it need flexes with use
(the
Post by Ignacio Tripodi
Post by Andy Seaborne
Post by Ignacio Tripodi
Post by Ignacio Tripodi
OS does the flexing automatically.
TDB write transactions use memory for intermediate space. Read
requests
Post by Ignacio Tripodi
Post by Andy Seaborne
Post by Ignacio Tripodi
Post by Ignacio Tripodi
do not normally take space over and above the database caching.
If the data has many large literals, then more heap may be needed
otherwise the space is due to Lucene itself. The jena text subsystem
materializing results so very large results also these may be a
factor.
Post by Ignacio Tripodi
Post by Andy Seaborne
Post by Ignacio Tripodi
Post by Ignacio Tripodi
The fact that being idle means the next query is slow is possibly due
to
Post by Ignacio Tripodi
Post by Andy Seaborne
Post by Ignacio Tripodi
Post by Ignacio Tripodi
the fact that either the machine is swapping and the in-RAM cached
data
Post by Ignacio Tripodi
Post by Andy Seaborne
Post by Ignacio Tripodi
Post by Ignacio Tripodi
got swapped out, or that the file system cache has displaced data and
so
Post by Ignacio Tripodi
Post by Andy Seaborne
Post by Ignacio Tripodi
Post by Ignacio Tripodi
it has to go to persistent storage. If you were doing other things on
the machine, it is more likely the latter.
Andy
Loading...