Fuseki hangs under heavy SPARQL query load

Discussion:

Petr Baudis

2014-12-21 16:54:50 UTC

Hi!

I tried to use Apache Fuseki for my QA system, loaded up with part of
DBpedia and set up according to:

https://github.com/brmson/yodaqa/blob/master/data/dbpedia/README.md

It works beautifully, but the system puts Fuseki under a pretty heavy
load, with several tens of SPARQL queries per second at times, often in
parallel. And after about an hour on average, Fuseki just hangs up,
still accepting new queries but never generating a result.

I suspect it might be some kind of deadlock, but I would need some
advice on how to debug it best or what kind of data you would need.

(If you think for this usecase, a different kind of server would be
better, I'll be happy to hear suggestions too. :-) I was using Virtuoso
so far, but with abysmal experience (self-corrupting database, hangs of
different kind), and couldn't get 4store to work; I imported the data
but never made it to return any data in finite time with SPARQL queries
that work with Virtuoso and Fuseki.)

Thanks,

Petr Baudis

Andy Seaborne

2014-12-21 20:30:02 UTC

Permalink

Post by Petr Baudis
Hi!
I tried to use Apache Fuseki for my QA system, loaded up with part of
https://github.com/brmson/yodaqa/blob/master/data/dbpedia/README.md
It works beautifully, but the system puts Fuseki under a pretty heavy
load, with several tens of SPARQL queries per second at times, often in
parallel. And after about an hour on average, Fuseki just hangs up,
still accepting new queries but never generating a result.
I suspect it might be some kind of deadlock, but I would need some
advice on how to debug it best or what kind of data you would need.
(If you think for this usecase, a different kind of server would be
better, I'll be happy to hear suggestions too. :-) I was using Virtuoso
so far, but with abysmal experience (self-corrupting database, hangs of
different kind), and couldn't get 4store to work; I imported the data
but never made it to return any data in finite time with SPARQL queries
that work with Virtuoso and Fuseki.)
Thanks,
Petr Baudis

Hi Petr,

What is the setup in terms of hardware (RAM size, number of CPUs etc
etc), operating system and versions? The details do matter here.

This may be JENA-801 [1].

If so, there is a suggested fix but the code contribution hasn't arrived
yet.

I believe the reporter of JENA-801 (Bala Kolla) used something related
to this to replace CacheLRU:

https://github.com/afs/AFS-Dev/blob/master/src%2Fmain%2Fjava%2Fprojects%2Fcache%2FCacheGuava.java

though if we are going to use Guave Cache (which is highly likely) then
there is a even better way to use in the presence of updates. That's
why I'd like to see what has been done to know if the update chnages
were also tried out. If for your usage, it is just query load, I can
build a special for you to try out (or you can : replace the body of
CacheLRU with CacheGuava body, add dependency to ARQ and build with maven).

Andy

[1]
https://issues.apache.org/jira/browse/JENA-801

Petr Baudis

2014-12-22 02:03:03 UTC

Permalink

Hi!

Post by Andy Seaborne

Post by Petr Baudis
It works beautifully, but the system puts Fuseki under a pretty heavy
load, with several tens of SPARQL queries per second at times, often in
parallel. And after about an hour on average, Fuseki just hangs up,
still accepting new queries but never generating a result.

..snip..

Post by Andy Seaborne
What is the setup in terms of hardware (RAM size, number of CPUs etc
etc), operating system and versions? The details do matter here.

This is:

24GiB RAM
8x AMD FX(tm)-8350 Eight-Core Processor
Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt2-1 (2014-12-08) x86_64 GNU/Linux
(Debian Wheezy, with some leaf Jessie packages mixed in)
Fuseki 1.1.1 (binary distribution)
Jena 2.12.1 (used for tdbloader)

The SPARQL endpoint is publicly available at

http://pasky.or.cz:3030/dbpedia/query

(right now I just run fuseki in a loop and kill it every 10 minutes
as a stopgap measure).

Post by Andy Seaborne
This may be JENA-801 [1].

Hmm. I see, that's interesting. Just to clarify, though - what I'm
seeing is a hard hang, the Fuseki process is not consuming any CPU and
no queries are ever answered (at least in the order of hours). It is
not simply a performance degradation, which I get the impression is
what JENA-801 is about.

(Also, while I'm hitting Fuseki with a lot of queries, I believe there
should never be more than two connections + queries going on at the same
time. The queries are pretty simple, typically take 2-3ms to service.)

(Also, I really do just queries, no updates, I'm running in read-only
mode.)

(Also, no messages like Java GC notifications are printed on Fuseki's
console in the event of this deadlock.)

So this really seems quite different to what I'm reading there and in
JENA-689, JENA-703.

Post by Andy Seaborne
That's why I'd like to see what has been done to know if the update
chnages were also tried out. If for your usage, it is just query
load, I can build a special for you to try out (or you can : replace
the body of CacheLRU with CacheGuava body, add dependency to ARQ and
build with maven).

If in the light of the above you still think trying this out makes
sense, I will be happy to do that in the course of next few days. If
building a special for me would be easy for you, I'd appreciate that,
but otherwise I can give it a try.

--
Petr Baudis
If you do not work on an important problem, it's unlikely
you'll do important work. -- R. Hamming
http://www.cs.virginia.edu/~robins/YouAndYourResearch.html

Petr Baudis

2014-12-22 02:04:35 UTC

Permalink

Post by Petr Baudis
8x AMD FX(tm)-8350 Eight-Core Processor

^^^ this is just 8 cores, not 8*8 cores. ;-)

Petr Baudis

Andy Seaborne

2014-12-22 20:46:39 UTC

Permalink

Post by Petr Baudis
Hi!

Post by Andy Seaborne

..snip..

Post by Andy Seaborne
What is the setup in terms of hardware (RAM size, number of CPUs etc
etc), operating system and versions? The details do matter here.

24GiB RAM
8x AMD FX(tm)-8350 Eight-Core Processor
Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt2-1 (2014-12-08) x86_64 GNU/Linux
(Debian Wheezy, with some leaf Jessie packages mixed in)
Fuseki 1.1.1 (binary distribution)
Jena 2.12.1 (used for tdbloader)
The SPARQL endpoint is publicly available at
http://pasky.or.cz:3030/dbpedia/query
(right now I just run fuseki in a loop and kill it every 10 minutes
as a stopgap measure).

Post by Andy Seaborne
This may be JENA-801 [1].

Hmm. I see, that's interesting. Just to clarify, though - what I'm
seeing is a hard hang, the Fuseki process is not consuming any CPU and
no queries are ever answered (at least in the order of hours). It is
not simply a performance degradation, which I get the impression is
what JENA-801 is about.
(Also, while I'm hitting Fuseki with a lot of queries, I believe there
should never be more than two connections + queries going on at the same
time. The queries are pretty simple, typically take 2-3ms to service.)
(Also, I really do just queries, no updates, I'm running in read-only
mode.)
(Also, no messages like Java GC notifications are printed on Fuseki's
console in the event of this deadlock.)
So this really seems quite different to what I'm reading there and in
JENA-689, JENA-703.

Certainly not JENA-689, JENA-703 which are update related so not
obviously releveant here. JENA-801 is an issue about locking on the node
table, and that synchronization happens in the read only situation as
well, which is why I though it might be related.

If there a a few connections (<=2) and large numbers of small queries
issued over each connection. Assuming there are no sorts and no timeouts
set, then the execution of the query should be all on the thread that it
came in on. And you 8 (shame it's not 8*8!) cores. Do you have couple
of example queries you can share?

Does the CPU load increase to start with, then drops off? Fuseki/TDB is
typically CPU-busy when the OS warms up and the working set index files
is memory.

Maybe the first thing to try is to point jvisualvm (in the JDK) or some
other monitoring tool at the Fuseki process and see if there is any
evidence. The thread dump would be useful. (jconsole even has a "Detect
Deadlock" which I have never used but the button label is suggestive)

Andy

Petr Baudis

2014-12-22 22:58:38 UTC

Permalink

Hi!

Post by Andy Seaborne
If there a a few connections (<=2) and large numbers of small
queries issued over each connection. Assuming there are no sorts and
no timeouts set, then the execution of the query should be all on
the thread that it came in on. And you 8 (shame it's not 8*8!)
cores. Do you have couple of example queries you can share?

Sure! It is typically something like

PREFIX dc: <http://purl.org/dc/elements/1.1/> PREFIX : <http://dbpedia.org/resource/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX dbpedia2: <http://dbpedia.org/property/> PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX dbo: <http://dbpedia.org/ontology/> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> PREFIX dbpedia: <http://dbpedia.org/> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> SELECT ?t WHERE { { ?res rdfs:label "California"@en } UNION { ?redir dbo:wikiPageRedirects ?res . ?redir rdfs:label "California"@en } ?res rdf:type ?t FILTER ( ! regex(str(?res), "^http://dbpedia.org/resource/[^_]*:", "i") ) }

or variations for different labels.

Post by Andy Seaborne
Does the CPU load increase to start with, then drops off?
Fuseki/TDB is typically CPU-busy when the OS warms up and the
working set index files is memory.

I see no obvious CPU load variations. A lot of the queries are
repeated (so quickly warmed cache) and the server runs the user software
itself too.

Post by Andy Seaborne
Maybe the first thing to try is to point jvisualvm (in the JDK) or
some other monitoring tool at the Fuseki process and see if there is
any evidence. The thread dump would be useful. (jconsole even has a
"Detect Deadlock" which I have never used but the button label is
suggestive)

Hmm, seems like that requires a GUI. I can give that a whirl at the
end of the week as I have only remote access to the machine until then.

--
Petr Baudis
If you do not work on an important problem, it's unlikely
you'll do important work. -- R. Hamming
http://www.cs.virginia.edu/~robins/YouAndYourResearch.html

Andy Seaborne

2014-12-23 09:31:24 UTC

Permalink

Post by Petr Baudis
Hi!

Sure! It is typically something like
or variations for different labels.

Post by Andy Seaborne
Does the CPU load increase to start with, then drops off?
Fuseki/TDB is typically CPU-busy when the OS warms up and the
working set index files is memory.

I see no obvious CPU load variations. A lot of the queries are
repeated (so quickly warmed cache) and the server runs the user software
itself too.

Hmm, seems like that requires a GUI. I can give that a whirl at the
end of the week as I have only remote access to the machine until then.

You can do it from the command line with the original tool that was sept
up into jvisualvm:

jstack ProcessId > stack_dump

(IIRC it's officially unsupported these days, but my Java 7 and 8
installations have it)

Andy

Petr Baudis

2014-12-23 20:40:34 UTC

Permalink

Post by Andy Seaborne
You can do it from the command line with the original tool that was
jstack ProcessId > stack_dump
(IIRC it's officially unsupported these days, but my Java 7 and 8
installations have it)

Thanks for the hint! So I saw 1024 threads with

Thread 11824: (state = IN_NATIVE)
- sun.nio.ch.FileDispatcherImpl.read0(java.io.FileDescriptor, long, int) @bci=0 (Compiled frame; information may be imprecise)
- sun.nio.ch.SocketDispatcher.read(java.io.FileDescriptor, long, int) @bci=4, line=39 (Compiled frame)
- sun.nio.ch.IOUtil.readIntoNativeBuffer(java.io.FileDescriptor, java.nio.ByteBuffer, long, sun.nio.ch.NativeDispatcher) @bci=114, line=223 (Compiled frame)
- sun.nio.ch.IOUtil.read(java.io.FileDescriptor, java.nio.ByteBuffer, long, sun.nio.ch.NativeDispatcher) @bci=48, line=197 (Compiled frame)
- sun.nio.ch.SocketChannelImpl.read(java.nio.ByteBuffer) @bci=234, line=379 (Compiled frame)
- org.eclipse.jetty.io.nio.ChannelEndPoint.fill(org.eclipse.jetty.io.Buffer) @bci=64, line=235 (Compiled frame)
- org.eclipse.jetty.server.nio.BlockingChannelConnector$BlockingChannelEndPoint.fill(org.eclipse.jetty.io.Buffer) @bci=9, line=242 (Compiled frame)
- org.eclipse.jetty.http.HttpParser.fill() @bci=322, line=1044 (Compiled frame)
- org.eclipse.jetty.http.HttpParser.parseNext() @bci=177, line=298 (Compiled frame)
- org.eclipse.jetty.http.HttpParser.parseAvailable() @bci=1, line=235 (Compiled frame)
- org.eclipse.jetty.server.BlockingHttpConnection.handle() @bci=51, line=72 (Compiled frame)
- org.eclipse.jetty.server.nio.BlockingChannelConnector$BlockingChannelEndPoint.run() @bci=129, line=298 (Compiled frame)
- org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(java.lang.Runnable) @bci=1, line=608 (Compiled frame)
- org.eclipse.jetty.util.thread.QueuedThreadPool$3.run() @bci=47, line=543 (Compiled frame)
- java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)

and 2 management threads, and the cause is pretty obvious at that point
- yes, I simply ran out of sockets! It seems Fuseki does not time out
inactive connections, but of course the true fault lies in my code which
creates so many of them - or rather never closes the connections.

That was caused by me misreading

https://jena.apache.org/documentation/query/app_api.html

and not doing qexec.close() even though I did not use the try (...) { }
construct. (It would be nice if I could reuse the same HTTP connection
for multiple *different* queries - but I don't think that's possible
here, or is it?)

Thanks,

--
Petr Baudis
If you do not work on an important problem, it's unlikely
you'll do important work. -- R. Hamming
http://www.cs.virginia.edu/~robins/YouAndYourResearch.html

Andy Seaborne

2014-12-24 10:01:41 UTC

Permalink

Hi Petr,

Thanks for the update.

Jena used Apache Apache HttpComponents Client (HttpClient) via code in
org.apache.jena.riot.web.HttpOp.

It should be using a caching ClientConnectionManager. The caching isn't
very high by default.

Or you can use your own setup HttpOp.setDefaultHttpClient.

What i think is happening is that if you don't do the close, then the
connection isn't return to the pool and a new one is created when the
next request comes in. Hence lots of connections all the way through to
the server.

Andy

Post by Petr Baudis

Thanks for the hint! So I saw 1024 threads with
Thread 11824: (state = IN_NATIVE)
and 2 management threads, and the cause is pretty obvious at that point
- yes, I simply ran out of sockets! It seems Fuseki does not time out
inactive connections, but of course the true fault lies in my code which
creates so many of them - or rather never closes the connections.
That was caused by me misreading
https://jena.apache.org/documentation/query/app_api.html
and not doing qexec.close() even though I did not use the try (...) { }
construct. (It would be nice if I could reuse the same HTTP connection
for multiple *different* queries - but I don't think that's possible
here, or is it?)
Thanks,

Petr Baudis

2014-12-24 11:04:51 UTC

Permalink

Hi!

Post by Andy Seaborne
What i think is happening is that if you don't do the close, then
the connection isn't return to the pool and a new one is created
when the next request comes in. Hence lots of connections all the
way through to the server.

Hmm, I don't see reusing of connections even when I do .close(),
though. I'll take look later if I can easily make it reuse connections
with a custom http client class.

Still, I think it'd be worthwhile to time out open connections on the
Fuseki server side after a while. Otherwise, it is e.g. trivial to DDoS
a server open to the internet.

--
Petr Baudis
If you do not work on an important problem, it's unlikely
you'll do important work. -- R. Hamming
http://www.cs.virginia.edu/~robins/YouAndYourResearch.html

Andy Seaborne

2014-12-24 17:31:48 UTC

Permalink

Post by Petr Baudis
Hi!

Hmm, I don't see reusing of connections even when I do .close(),
though. I'll take look later if I can easily make it reuse connections
with a custom http client class.
Still, I think it'd be worthwhile to time out open connections on the
Fuseki server side after a while. Otherwise, it is e.g. trivial to DDoS
a server open to the internet.

Do you have a test case?

I wrote a quick test and traced connections getting put back in the pool
on the client side in tests of 20K requests.
ManagedClientConnectionImpl does get called to recycle the connection.

But it was a same-machine test (due to where I am ATM). From memory,
freeing client and server side resources isn't completely synchronous
with the local OS and it has been possible to run faster than the OS
frees up connections. It might be better to reduce the HttpClient
configuration.

On the server side this is all inside Jetty. There is a tension between
freeing resources and caching. Maybe the code is asking for cached
connections too quickly.

Andy