Estimation for Jena/Fuseki hardware requirements?

Discussion:

Ignacio Tripodi

2016-03-20 17:16:18 UTC

Hello,

I was wondering if you had any minimum hardware suggestions for a
Jena/Fuseki Linux deployment, based on the number of triples used. Is there
a rough guideline for how much RAM should be available in production, as a
function of the size of the imported RDF file (currently less than 2Gb),
number of concurrent requests, etc?

The main use for this will be for wildcarded text searches using the Lucene
full-text index (basically, unfiltered queries using the reverse index). No
SPARQL Update needed. Other resource-intensive operations would be
refreshing the RDF data monthly, followed by rebuilding indices. The test
deployment on my 2012 MacBook runs queries in the order of tens of ms
(unless it's been idle for a while, then the first query is usually in the
order of hundreds of ms for some reason), so I imagine the hardware
requirements can't be that stringent. If it helps, I had to increase my
Java heap size to 3072Mb.

Thanks for any feedback you could provide!

Andy Seaborne

2016-03-20 21:38:10 UTC

Permalink

Post by Ignacio Tripodi
Hello,
I was wondering if you had any minimum hardware suggestions for a
Jena/Fuseki Linux deployment, based on the number of triples used. Is there
a rough guideline for how much RAM should be available in production, as a
function of the size of the imported RDF file (currently less than 2Gb),
number of concurrent requests, etc?
The main use for this will be for wildcarded text searches using the Lucene
full-text index (basically, unfiltered queries using the reverse index). No
SPARQL Update needed. Other resource-intensive operations would be
refreshing the RDF data monthly, followed by rebuilding indices. The test
deployment on my 2012 MacBook runs queries in the order of tens of ms
(unless it's been idle for a while, then the first query is usually in the
order of hundreds of ms for some reason), so I imagine the hardware
requirements can't be that stringent. If it helps, I had to increase my
Java heap size to 3072Mb.
Thanks for any feedback you could provide!

[[
This has been asked on StackOverflow - please copy answers from one
place to the other.
]]

2G in bytes - what is it in triples?

Is this Lucene or Solr?

Is the RDF data held in TDB as the storage? If so, then the part due to
TDB using memory mapped files - these exist in the OS file system cache
not in the java heap. The amount of space it need flexes with use (the
OS does the flexing automatically.

Fir TDB:

TDB write transactions use memory for intermediate space. Read requests
do not normally take space over and above the database caching.

If the data has many large literals, then more heap may be needed
otherwise the space is due to Lucene itself. The jena text subsystem
materializing results so very large results also these may be a factor.

The fact that being idle means the next query is slow is possibly due to
the fact that either the machine is swapping and the in-RAM cached data
got swapped out, or that the file system cache has displaced data and so
it has to go to persistent storage. If you were doing other things on
the machine, it is more likely the latter.

Andy

Ignacio Tripodi

2016-03-21 01:28:30 UTC

Permalink

Hey Andy,

Sorry about the duplicate post, I just removed the one on StackOverflow.

This is using Lucene. At currently 1.6Gb, most of the content is a
collection of (biological) taxonomic entities plus a few owl definitions to
lay out the ontology, and as you correctly guessed, imported as TDB. All
.dat and .idn files after importing and rebuilding the indices add up to
about 2.1Gb. Would the assumption that if we have as much free memory as
2.1Gb in this case, we would be in an optimal situation for caching?

All SPARQL queries for partial string matches will be limited to only the
first handful of (say, 5) results. Should I consider large result sets in
the hardware estimations, regardless? Does Jena still have to internally
bring up the entire result set before filtering the response?

Your theory about swapping for the scenario of slow first requests makes
sense. I'm not too concerned about it (at least until I see how it behaves
in production).

Many thanks for the insights,

-i