Discussion:
Jena with Large Data and single process
Adeeb Noor
2013-07-18 07:00:29 UTC
Permalink
Hi All:

I have been using Jena for huge biomedical application with TDB sized 3G
and around 2 billion triples and I did notice a performance issues. Thus, I
have two questions:

1- Does Jena build to handle huge app like I am developing? or do I have to
use a commercial tool ?

2- Does Jena a single thread or can it be ran on multi?

Thanks
--
Adeeb Noor
Ph.D. Candidate
Dept of Computer Science
University of Colorado at Boulder
Cell: 571-484-3303
Email: Adeeb.noor-UWkI3MzZw7X2fBVCVOL8/***@public.gmane.org
Dave Reynolds
2013-07-19 23:34:53 UTC
Permalink
That size is probably pushing things, but it depends on your queries and amount of memory.

Note that for TDB, assuming 64bit and normal memory mapped files, then you don't want to give java all the memory, leave the OS some for caching.

Regarding threads then TDB doesn't use threads as far as I know. It is thread safe so you can have multiple threads in your code doing reads. But whether that will speed things up or just lead to more cache contention I don't know.

Dave
Post by Adeeb Noor
I have been using Jena for huge biomedical application with TDB sized 3G
and around 2 billion triples and I did notice a performance issues. Thus, I
1- Does Jena build to handle huge app like I am developing? or do I have to
use a commercial tool ?
2- Does Jena a single thread or can it be ran on multi?
Thanks
--
Adeeb Noor
Ph.D. Candidate
Dept of Computer Science
University of Colorado at Boulder
Cell: 571-484-3303
Andy Seaborne
2013-07-21 12:26:17 UTC
Permalink
Post by Adeeb Noor
I have been using Jena for huge biomedical application with TDB sized 3G
and around 2 billion triples and I did notice a performance issues. Thus, I
1- Does Jena build to handle huge app like I am developing? or do I have to
use a commercial tool ?
2 billion (2*10^9) is pushing the limits of TDB. It will only handle
very simple lookup queries at this scale.

Have you considered 4Store?
Post by Adeeb Noor
2- Does Jena a single thread or can it be ran on multi?
The query engine and TDB use one thread per request but there can be
many requests at the same time. This is what Fuseki is doing.

(Actually, there is a bit of internal multithreading for sorting)

Andy
Post by Adeeb Noor
Thanks
Marco Neumann
2013-07-21 12:34:25 UTC
Permalink
what are the most critical issues that prevent TDB from handling larger
data sets at the moment iyo? jvm? index?
Post by Adeeb Noor
I have been using Jena for huge biomedical application with TDB sized 3G
and around 2 billion triples and I did notice a performance issues. Thus, I
1- Does Jena build to handle huge app like I am developing? or do I have to
use a commercial tool ?
2 billion (2*10^9) is pushing the limits of TDB. It will only handle very
simple lookup queries at this scale.
Have you considered 4Store?
2- Does Jena a single thread or can it be ran on multi?
The query engine and TDB use one thread per request but there can be many
requests at the same time. This is what Fuseki is doing.
(Actually, there is a bit of internal multithreading for sorting)
Andy
Post by Adeeb Noor
Thanks
--
---
Marco Neumann
KONA
Andy Seaborne
2013-07-21 15:36:46 UTC
Permalink
Post by Marco Neumann
what are the most critical issues that prevent TDB from handling larger
data sets at the moment iyo? jvm? index?
Hi Marco,

Probably some degree of clustering.

Even with improved, more compact, indexing (simple run length encoding
of the B+Tree leaves for example) doesn't make the jump IMO. More RAM,
and more CPU does.

There is a limitation on system bus I/O to main RAM - there is only one
bus on commodity hardware and the processor isn't actually running at
100%. Database code is doing a lot of data structure walking, and not a
lot of CPU-intensive compute. [1]

So a single machine as the way to scale means one with special
(=expensive) interconnect and RAM. Not so commodity any more.

Several machines, by which I mean a few, like 4-10, seems a more
effective way to go.

There are some consequences of this - MVCC datastructures are better for
transactions across a cluster. The demands of multi-machine transaction
coordination are easier.

(MVCC are datastructures where when you write, you also copy all the
tree nodes from root to the active block, then the structure has one
root per transaction - also, transactions become one-write, not two
(once to log, once to main DB). See CouchDB or Mulgara for example.)

And "just doing it" so there is a working architecture that can be
improved rather than trying to be perfect first time.

Andy

[1]

See
http://highscalability.com/blog/2013/6/13/busting-4-modern-hardware-myths-are-memory-hdds-and-ssds-rea.html

and the 45min video



Continue reading on narkive:
Loading...