Jena with Large Data and single process

Discussion:

Adeeb Noor

2013-07-18 07:00:29 UTC

Hi All:

I have been using Jena for huge biomedical application with TDB sized 3G
and around 2 billion triples and I did notice a performance issues. Thus, I
have two questions:

1- Does Jena build to handle huge app like I am developing? or do I have to
use a commercial tool ?

2- Does Jena a single thread or can it be ran on multi?

Thanks

--
Adeeb Noor
Ph.D. Candidate
Dept of Computer Science
University of Colorado at Boulder
Cell: 571-484-3303
Email: Adeeb.noor-UWkI3MzZw7X2fBVCVOL8/***@public.gmane.org

Dave Reynolds

2013-07-19 23:34:53 UTC

Permalink

That size is probably pushing things, but it depends on your queries and amount of memory.

Note that for TDB, assuming 64bit and normal memory mapped files, then you don't want to give java all the memory, leave the OS some for caching.

Regarding threads then TDB doesn't use threads as far as I know. It is thread safe so you can have multiple threads in your code doing reads. But whether that will speed things up or just lead to more cache contention I don't know.

Dave

Post by Adeeb Noor
I have been using Jena for huge biomedical application with TDB sized 3G
and around 2 billion triples and I did notice a performance issues. Thus, I
1- Does Jena build to handle huge app like I am developing? or do I have to
use a commercial tool ?
2- Does Jena a single thread or can it be ran on multi?
Thanks
--
Adeeb Noor
Ph.D. Candidate
Dept of Computer Science
University of Colorado at Boulder
Cell: 571-484-3303

Andy Seaborne

2013-07-21 12:26:17 UTC

Permalink

2 billion (2*10^9) is pushing the limits of TDB. It will only handle
very simple lookup queries at this scale.

Have you considered 4Store?

Post by Adeeb Noor
2- Does Jena a single thread or can it be ran on multi?

The query engine and TDB use one thread per request but there can be
many requests at the same time. This is what Fuseki is doing.

(Actually, there is a bit of internal multithreading for sorting)

Andy

Post by Adeeb Noor
Thanks

Marco Neumann

2013-07-21 12:34:25 UTC

Permalink

what are the most critical issues that prevent TDB from handling larger
data sets at the moment iyo? jvm? index?

2 billion (2*10^9) is pushing the limits of TDB. It will only handle very
simple lookup queries at this scale.
Have you considered 4Store?
2- Does Jena a single thread or can it be ran on multi?
The query engine and TDB use one thread per request but there can be many
requests at the same time. This is what Fuseki is doing.
(Actually, there is a bit of internal multithreading for sorting)
Andy

Post by Adeeb Noor
Thanks

--
---
Marco Neumann
KONA

Andy Seaborne

2013-07-21 15:36:46 UTC

Permalink

Post by Marco Neumann
what are the most critical issues that prevent TDB from handling larger
data sets at the moment iyo? jvm? index?

Hi Marco,

Probably some degree of clustering.

Even with improved, more compact, indexing (simple run length encoding
of the B+Tree leaves for example) doesn't make the jump IMO. More RAM,
and more CPU does.

There is a limitation on system bus I/O to main RAM - there is only one
bus on commodity hardware and the processor isn't actually running at
100%. Database code is doing a lot of data structure walking, and not a
lot of CPU-intensive compute. [1]

So a single machine as the way to scale means one with special
(=expensive) interconnect and RAM. Not so commodity any more.

Several machines, by which I mean a few, like 4-10, seems a more
effective way to go.

There are some consequences of this - MVCC datastructures are better for
transactions across a cluster. The demands of multi-machine transaction
coordination are easier.

(MVCC are datastructures where when you write, you also copy all the
tree nodes from root to the active block, then the structure has one
root per transaction - also, transactions become one-write, not two
(once to log, once to main DB). See CouchDB or Mulgara for example.)

And "just doing it" so there is a working architecture that can be
improved rather than trying to be perfect first time.

Andy

[1]

See
http://highscalability.com/blog/2013/6/13/busting-4-modern-hardware-myths-are-memory-hdds-and-ssds-rea.html

and the 45min video

Continue reading on narkive:

Search results for 'Jena with Large Data and single process' (Questions and Answers)

627

replies

WILL DONALD TRUMP BE HARMFUL to blacks?

started 2017-01-10 23:12:37 UTC

current events

replies

Know ANy LDS Rock Bands???

started 2006-09-24 17:05:38 UTC

music

replies

Can you name a single testible, verifiable Creationist contribution to modern science?

started 2007-10-09 14:58:56 UTC

religion & spirituality