Forward/backward rules (and reasoner memory leaks)

Discussion:

Martynas Jusevičius

2016-06-20 21:18:03 UTC

Hey,

after using GenericRuleReasoner and InfModel more extensively, we
started experiencing memory leaks that eventually kill our webapp
because it runs out of heap space. Jena version is 2.11.0.

After some profiling, it seems that RETEEngine.clauseIndex and/or
RETEEngine.infGraph are retaining a lot of references. It might be
related to this report, but I'm not sure:
https://mail-archives.apache.org/mod_mbox/jena-users/201403.mbox/%***@gmail.com%3E

The suggestion was to use use backward rules instead of forward rules.
I have read the following:
https://jena.apache.org/documentation/inference/#rules

But still I fail to understand in which situations backward rules
can/should be used instead of forward rules? I guess simply replacing
-> with <- will not be enough? The actual rules in question look like
this:

[gp: (?class rdf:type
<http://www.w3.org/2000/01/rdf-schema#Class>), (?class ?p ?o), (?p
rdf:type owl:AnnotationProperty), (?p rdfs:isDefinedBy
<http://graphity.org/gp#>), (?subClass rdfs:subClassOf ?class),
(?subClass rdf:type <http://www.w3.org/2000/01/rdf-schema#Class>),
noValue(?subClass ?p) -> (?subClass ?p ?o) ]
[gcdm: (?template rdf:type <http://graphity.org/gp#Template>),
(?template <http://graphity.org/gc#defaultMode> ?o), (?subClass
rdfs:subClassOf ?template), (?subClass rdf:type
<http://graphity.org/gp#Template>), noValue(?subClass
<http://graphity.org/gc#defaultMode>) -> (?subClass
<http://graphity.org/gc#defaultMode> ?o) ]
[gcsm: (?template rdf:type <http://graphity.org/gp#Template>),
(?template <http://graphity.org/gc#supportedMode> ?supportedMode),
(?subClass rdfs:subClassOf ?template), (?subClass rdf:type
<http://graphity.org/gp#Template>) -> (?subClass
<http://graphity.org/gc#supportedMode> ?supportedMode) ]
[rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)]

Can these be rewritten as backward rules instead? Does it involve code
changes, such as calling reset() etc?

I would appreciate any help.

Martynas
atomgraph.com

Martynas Jusevičius

2016-06-20 23:05:13 UTC

Permalink

What is the status of JENA-650 by the way?
https://issues.apache.org/jira/browse/JENA-650

On Mon, Jun 20, 2016 at 11:18 PM, Martynas Jusevičius

Post by Martynas JuseviÄius
Hey,
after using GenericRuleReasoner and InfModel more extensively, we
started experiencing memory leaks that eventually kill our webapp
because it runs out of heap space. Jena version is 2.11.0.
After some profiling, it seems that RETEEngine.clauseIndex and/or
RETEEngine.infGraph are retaining a lot of references. It might be
The suggestion was to use use backward rules instead of forward rules.
https://jena.apache.org/documentation/inference/#rules
But still I fail to understand in which situations backward rules
can/should be used instead of forward rules? I guess simply replacing
-> with <- will not be enough? The actual rules in question look like
[gp: (?class rdf:type
<http://www.w3.org/2000/01/rdf-schema#Class>), (?class ?p ?o), (?p
rdf:type owl:AnnotationProperty), (?p rdfs:isDefinedBy
<http://graphity.org/gp#>), (?subClass rdfs:subClassOf ?class),
(?subClass rdf:type <http://www.w3.org/2000/01/rdf-schema#Class>),
noValue(?subClass ?p) -> (?subClass ?p ?o) ]
[gcdm: (?template rdf:type <http://graphity.org/gp#Template>),
(?template <http://graphity.org/gc#defaultMode> ?o), (?subClass
rdfs:subClassOf ?template), (?subClass rdf:type
<http://graphity.org/gp#Template>), noValue(?subClass
<http://graphity.org/gc#defaultMode>) -> (?subClass
<http://graphity.org/gc#defaultMode> ?o) ]
[gcsm: (?template rdf:type <http://graphity.org/gp#Template>),
(?template <http://graphity.org/gc#supportedMode> ?supportedMode),
(?subClass rdfs:subClassOf ?template), (?subClass rdf:type
<http://graphity.org/gp#Template>) -> (?subClass
<http://graphity.org/gc#supportedMode> ?supportedMode) ]
[rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)]
Can these be rewritten as backward rules instead? Does it involve code
changes, such as calling reset() etc?
I would appreciate any help.
Martynas
atomgraph.com

Martynas Jusevičius

2016-06-20 23:05:30 UTC

Permalink

What is the status of JENA-650 by the way?
https://issues.apache.org/jira/browse/JENA-650

On Mon, Jun 20, 2016 at 11:18 PM, Martynas Jusevičius

Dave Reynolds

2016-06-21 08:28:50 UTC

Permalink

Hi Martynas,

If it is related to that then it is not a leak it is "just" memory use.

A leak implies that when you turn over data then unused internal state
objects are not reclaimed. Are you continuously adding and deleting
data? If so then the delete should release the whole of the RETEEngine
state and start over. If that isn't happening then that's a bug but you
could work around with an explicit reset() or even delete and recreate
your InfGraph at that stage. A delete loses all the state anyway.

Post by Martynas JuseviÄius
The suggestion was to use use backward rules instead of forward rules.
https://jena.apache.org/documentation/inference/#rules
But still I fail to understand in which situations backward rules
can/should be used instead of forward rules?

Forward rules are generally faster because they keep all that partially
matched state. So if you have stable data or just add triples
monotonically, and have a lot of queries, then generally use forward
rules for performance.

Backward rules (without tabling) keep no state so there's less memory
overhead and no cost for delete but they are slow and have to redo the
work for every query.

Strictly the performance trade-off is a bit more subtle than that.
Forward rules will try to work out all the entailments whereas backward
rules are just responding to specific queries. So if your queries only
touch a small part of the possible space then backward rules could be
more efficient. However in practice RDF rules seem involve a lot of
unground terms and lots of rules match nearly every query.

Tabling allows you to selectively cache certain predicates which can
enable you to get more reasonable performance while keeping memory use
under control. You can also do some tuning of how the rules execute by
testing if variables are bound or not and using different clause
orderings for different query patterns.

Post by Martynas JuseviÄius
I guess simply replacing
-> with <- will not be enough?

Unless you use non-monotonic predicates (which, sadly, you do) then that
would be enough to get something working. In fact you don't even need to
do that. If you create a pure backward reasoner instances (as opposed to
the hybrid) reasoner it'll read forward syntax rules but treat them as
backward.

Post by Martynas JuseviÄius
The actual rules in question look like
[gp: (?class rdf:type
<http://www.w3.org/2000/01/rdf-schema#Class>), (?class ?p ?o), (?p
rdf:type owl:AnnotationProperty), (?p rdfs:isDefinedBy
<http://graphity.org/gp#>), (?subClass rdfs:subClassOf ?class),
(?subClass rdf:type <http://www.w3.org/2000/01/rdf-schema#Class>),
noValue(?subClass ?p) -> (?subClass ?p ?o) ]

That's a horrible rule from the engine's point of view. The head is
completely ungrounded so when running backwards then it will need to run
for *every* triple pattern. [It also makes no sense to me as a use of
owl:AnnotationProperty but whatever.] You could try it backwards but put
the clauses in a more efficient order:

(?subClass ?p ?o) <-
(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>),
(?subClass rdfs:subClassOf ?class), (?class ?p ?o) .

The rdf:type rdfs:Class constraints are pointless since those are
implied by rdfs:subClassOf anyway. The noValue check is probably best
avoided for both cases.

Alternatively, depending on the nature of your space leak you could use
hybrid rules:

(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>)
->
[ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
(?class ?p ?o) ]

That way the forward engine is only looking at your annotations and the
backward engine then has rules that have grounded predicates. You could
also table those predicates:

(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>)
->
table(?p),
[ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
(?class ?p ?o) ]

Post by Martynas JuseviÄius
[gcdm: (?template rdf:type <http://graphity.org/gp#Template>),
(?template <http://graphity.org/gc#defaultMode> ?o), (?subClass
rdfs:subClassOf ?template), (?subClass rdf:type
<http://graphity.org/gp#Template>), noValue(?subClass
<http://graphity.org/gc#defaultMode>) -> (?subClass
<http://graphity.org/gc#defaultMode> ?o) ]
[gcsm: (?template rdf:type <http://graphity.org/gp#Template>),
(?template <http://graphity.org/gc#supportedMode> ?supportedMode),
(?subClass rdfs:subClassOf ?template), (?subClass rdf:type
<http://graphity.org/gp#Template>) -> (?subClass
<http://graphity.org/gc#supportedMode> ?supportedMode) ]

These two are more reasonable and could be used backwards or hybrid.

Post by Martynas JuseviÄius
[rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)]

That would work backwards. Depending on the scale of your data you might
want to table rdf:type for performance/space tradeoff.

Post by Martynas JuseviÄius
Can these be rewritten as backward rules instead?

Sure, the challenge is performance tuning as noted above.

Post by Martynas JuseviÄius
Does it involve code changes, such as calling reset() etc?

Shouldn't do.

Dave

Andy Seaborne

2016-06-21 08:59:33 UTC

Permalink

We have outstanding:

https://github.com/apache/jena/pull/47

which changes the cache to LRU from fixed.
That does not fix any memory leaks but might mitigate them.

There are two FIXME in the PR which could do with looking at.

Andy

Post by Dave Reynolds
Hi Martynas,

If it is related to that then it is not a leak it is "just" memory use.
A leak implies that when you turn over data then unused internal state
objects are not reclaimed. Are you continuously adding and deleting
data? If so then the delete should release the whole of the RETEEngine
state and start over. If that isn't happening then that's a bug but you
could work around with an explicit reset() or even delete and recreate
your InfGraph at that stage. A delete loses all the state anyway.

Forward rules are generally faster because they keep all that partially
matched state. So if you have stable data or just add triples
monotonically, and have a lot of queries, then generally use forward
rules for performance.
Backward rules (without tabling) keep no state so there's less memory
overhead and no cost for delete but they are slow and have to redo the
work for every query.
Strictly the performance trade-off is a bit more subtle than that.
Forward rules will try to work out all the entailments whereas backward
rules are just responding to specific queries. So if your queries only
touch a small part of the possible space then backward rules could be
more efficient. However in practice RDF rules seem involve a lot of
unground terms and lots of rules match nearly every query.
Tabling allows you to selectively cache certain predicates which can
enable you to get more reasonable performance while keeping memory use
under control. You can also do some tuning of how the rules execute by
testing if variables are bound or not and using different clause
orderings for different query patterns.

Post by Martynas JuseviÄius
I guess simply replacing
-> with <- will not be enough?

That's a horrible rule from the engine's point of view. The head is
completely ungrounded so when running backwards then it will need to run
for *every* triple pattern. [It also makes no sense to me as a use of
owl:AnnotationProperty but whatever.] You could try it backwards but put
(?subClass ?p ?o) <-
(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>),
(?subClass rdfs:subClassOf ?class), (?class ?p ?o) .
The rdf:type rdfs:Class constraints are pointless since those are
implied by rdfs:subClassOf anyway. The noValue check is probably best
avoided for both cases.
Alternatively, depending on the nature of your space leak you could use
(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>)
->
[ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
(?class ?p ?o) ]
That way the forward engine is only looking at your annotations and the
backward engine then has rules that have grounded predicates. You could
(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>)
->
table(?p),
[ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
(?class ?p ?o) ]

These two are more reasonable and could be used backwards or hybrid.

Post by Martynas JuseviÄius
[rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)]

That would work backwards. Depending on the scale of your data you might
want to table rdf:type for performance/space tradeoff.

Post by Martynas JuseviÄius
Can these be rewritten as backward rules instead?

Sure, the challenge is performance tuning as noted above.

Post by Martynas JuseviÄius
Does it involve code changes, such as calling reset() etc?

Shouldn't do.
Dave

Martynas Jusevičius

2016-06-21 20:20:50 UTC

Permalink

What about https://issues.apache.org/jira/browse/JENA-650?

Post by Andy Seaborne
https://github.com/apache/jena/pull/47
which changes the cache to LRU from fixed.
That does not fix any memory leaks but might mitigate them.
There are two FIXME in the PR which could do with looking at.
Andy

Post by Dave Reynolds
Hi Martynas,

If it is related to that then it is not a leak it is "just" memory use.
A leak implies that when you turn over data then unused internal state
objects are not reclaimed. Are you continuously adding and deleting
data? If so then the delete should release the whole of the RETEEngine
state and start over. If that isn't happening then that's a bug but you
could work around with an explicit reset() or even delete and recreate
your InfGraph at that stage. A delete loses all the state anyway.

Forward rules are generally faster because they keep all that partially
matched state. So if you have stable data or just add triples
monotonically, and have a lot of queries, then generally use forward
rules for performance.
Backward rules (without tabling) keep no state so there's less memory
overhead and no cost for delete but they are slow and have to redo the
work for every query.
Strictly the performance trade-off is a bit more subtle than that.
Forward rules will try to work out all the entailments whereas backward
rules are just responding to specific queries. So if your queries only
touch a small part of the possible space then backward rules could be
more efficient. However in practice RDF rules seem involve a lot of
unground terms and lots of rules match nearly every query.
Tabling allows you to selectively cache certain predicates which can
enable you to get more reasonable performance while keeping memory use
under control. You can also do some tuning of how the rules execute by
testing if variables are bound or not and using different clause
orderings for different query patterns.

Post by Martynas JuseviÄius
I guess simply replacing
-> with <- will not be enough?

That's a horrible rule from the engine's point of view. The head is
completely ungrounded so when running backwards then it will need to run
for *every* triple pattern. [It also makes no sense to me as a use of
owl:AnnotationProperty but whatever.] You could try it backwards but put
(?subClass ?p ?o) <-
(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>),
(?subClass rdfs:subClassOf ?class), (?class ?p ?o) .
The rdf:type rdfs:Class constraints are pointless since those are
implied by rdfs:subClassOf anyway. The noValue check is probably best
avoided for both cases.
Alternatively, depending on the nature of your space leak you could use
(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>)
->
[ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
(?class ?p ?o) ]
That way the forward engine is only looking at your annotations and the
backward engine then has rules that have grounded predicates. You could
(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>)
->
table(?p),
[ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
(?class ?p ?o) ]

These two are more reasonable and could be used backwards or hybrid.

Post by Martynas JuseviÄius
[rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)]

That would work backwards. Depending on the scale of your data you might
want to table rdf:type for performance/space tradeoff.

Post by Martynas JuseviÄius
Can these be rewritten as backward rules instead?

Sure, the challenge is performance tuning as noted above.

Post by Martynas JuseviÄius
Does it involve code changes, such as calling reset() etc?

Shouldn't do.
Dave

Andy Seaborne

2016-06-22 12:30:54 UTC

Permalink

Post by Martynas JuseviÄius
What about https://issues.apache.org/jira/browse/JENA-650?

It was a GSoC project and provides some useful prototyping - that was
the project goal and it was successful.

It isn't in a state to integrate in to the release - how about trying it
out?

Andy

Post by Martynas JuseviÄius

Post by Dave Reynolds
Hi Martynas,

If it is related to that then it is not a leak it is "just" memory use.
A leak implies that when you turn over data then unused internal state
objects are not reclaimed. Are you continuously adding and deleting
data? If so then the delete should release the whole of the RETEEngine
state and start over. If that isn't happening then that's a bug but you
could work around with an explicit reset() or even delete and recreate
your InfGraph at that stage. A delete loses all the state anyway.

Forward rules are generally faster because they keep all that partially
matched state. So if you have stable data or just add triples
monotonically, and have a lot of queries, then generally use forward
rules for performance.
Backward rules (without tabling) keep no state so there's less memory
overhead and no cost for delete but they are slow and have to redo the
work for every query.
Strictly the performance trade-off is a bit more subtle than that.
Forward rules will try to work out all the entailments whereas backward
rules are just responding to specific queries. So if your queries only
touch a small part of the possible space then backward rules could be
more efficient. However in practice RDF rules seem involve a lot of
unground terms and lots of rules match nearly every query.
Tabling allows you to selectively cache certain predicates which can
enable you to get more reasonable performance while keeping memory use
under control. You can also do some tuning of how the rules execute by
testing if variables are bound or not and using different clause
orderings for different query patterns.

Post by Martynas JuseviÄius
I guess simply replacing
-> with <- will not be enough?

That's a horrible rule from the engine's point of view. The head is
completely ungrounded so when running backwards then it will need to run
for *every* triple pattern. [It also makes no sense to me as a use of
owl:AnnotationProperty but whatever.] You could try it backwards but put
(?subClass ?p ?o) <-
(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>),
(?subClass rdfs:subClassOf ?class), (?class ?p ?o) .
The rdf:type rdfs:Class constraints are pointless since those are
implied by rdfs:subClassOf anyway. The noValue check is probably best
avoided for both cases.
Alternatively, depending on the nature of your space leak you could use
(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>)
->
[ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
(?class ?p ?o) ]
That way the forward engine is only looking at your annotations and the
backward engine then has rules that have grounded predicates. You could
(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>)
->
table(?p),
[ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
(?class ?p ?o) ]

These two are more reasonable and could be used backwards or hybrid.

Post by Martynas JuseviÄius
[rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)]

That would work backwards. Depending on the scale of your data you might
want to table rdf:type for performance/space tradeoff.

Post by Martynas JuseviÄius
Can these be rewritten as backward rules instead?

Sure, the challenge is performance tuning as noted above.

Post by Martynas JuseviÄius
Does it involve code changes, such as calling reset() etc?

Shouldn't do.
Dave

Stian Soiland-Reyes

2016-06-24 12:43:32 UTC

Permalink

I rebased and solved the FIXMEs. The memory leaks are still there in a
way, but the guava cache would flush them out once it reaches the
configured maximum (I set the default to 512k goals, but the memory
usage per goal could vary a lot depending on the rules)

Post by Dave Reynolds
Hi Martynas,

If it is related to that then it is not a leak it is "just" memory use.
A leak implies that when you turn over data then unused internal state
objects are not reclaimed. Are you continuously adding and deleting
data? If so then the delete should release the whole of the RETEEngine
state and start over. If that isn't happening then that's a bug but you
could work around with an explicit reset() or even delete and recreate
your InfGraph at that stage. A delete loses all the state anyway.

Forward rules are generally faster because they keep all that partially
matched state. So if you have stable data or just add triples
monotonically, and have a lot of queries, then generally use forward
rules for performance.
Backward rules (without tabling) keep no state so there's less memory
overhead and no cost for delete but they are slow and have to redo the
work for every query.
Strictly the performance trade-off is a bit more subtle than that.
Forward rules will try to work out all the entailments whereas backward
rules are just responding to specific queries. So if your queries only
touch a small part of the possible space then backward rules could be
more efficient. However in practice RDF rules seem involve a lot of
unground terms and lots of rules match nearly every query.
Tabling allows you to selectively cache certain predicates which can
enable you to get more reasonable performance while keeping memory use
under control. You can also do some tuning of how the rules execute by
testing if variables are bound or not and using different clause
orderings for different query patterns.

Post by Martynas JuseviÄius
I guess simply replacing
-> with <- will not be enough?

That's a horrible rule from the engine's point of view. The head is
completely ungrounded so when running backwards then it will need to run
for *every* triple pattern. [It also makes no sense to me as a use of
owl:AnnotationProperty but whatever.] You could try it backwards but put
(?subClass ?p ?o) <-
(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>),
(?subClass rdfs:subClassOf ?class), (?class ?p ?o) .
The rdf:type rdfs:Class constraints are pointless since those are
implied by rdfs:subClassOf anyway. The noValue check is probably best
avoided for both cases.
Alternatively, depending on the nature of your space leak you could use
(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>)
->
[ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
(?class ?p ?o) ]
That way the forward engine is only looking at your annotations and the
backward engine then has rules that have grounded predicates. You could
(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>)
->
table(?p),
[ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
(?class ?p ?o) ]

These two are more reasonable and could be used backwards or hybrid.

Post by Martynas JuseviÄius
[rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)]

That would work backwards. Depending on the scale of your data you might
want to table rdf:type for performance/space tradeoff.

Post by Martynas JuseviÄius
Can these be rewritten as backward rules instead?

Sure, the challenge is performance tuning as noted above.

Post by Martynas JuseviÄius
Does it involve code changes, such as calling reset() etc?

Shouldn't do.
Dave

--
Stian Soiland-Reyes
Apache Taverna (incubating), Apache Commons
http://orcid.org/0000-0001-9842-9718

Dave Reynolds

2016-06-24 12:57:28 UTC

Permalink

That's not a cache but a table. Don't think it's guaranteed safe to
delete from it, but may be misremembering - it was a long time ago!

Dave

Post by Stian Soiland-Reyes
I rebased and solved the FIXMEs. The memory leaks are still there in a
way, but the guava cache would flush them out once it reaches the
configured maximum (I set the default to 512k goals, but the memory
usage per goal could vary a lot depending on the rules)

Post by Dave Reynolds
Hi Martynas,

If it is related to that then it is not a leak it is "just" memory use.
A leak implies that when you turn over data then unused internal state
objects are not reclaimed. Are you continuously adding and deleting
data? If so then the delete should release the whole of the RETEEngine
state and start over. If that isn't happening then that's a bug but you
could work around with an explicit reset() or even delete and recreate
your InfGraph at that stage. A delete loses all the state anyway.

Forward rules are generally faster because they keep all that partially
matched state. So if you have stable data or just add triples
monotonically, and have a lot of queries, then generally use forward
rules for performance.
Backward rules (without tabling) keep no state so there's less memory
overhead and no cost for delete but they are slow and have to redo the
work for every query.
Strictly the performance trade-off is a bit more subtle than that.
Forward rules will try to work out all the entailments whereas backward
rules are just responding to specific queries. So if your queries only
touch a small part of the possible space then backward rules could be
more efficient. However in practice RDF rules seem involve a lot of
unground terms and lots of rules match nearly every query.
Tabling allows you to selectively cache certain predicates which can
enable you to get more reasonable performance while keeping memory use
under control. You can also do some tuning of how the rules execute by
testing if variables are bound or not and using different clause
orderings for different query patterns.

Post by Martynas JuseviÄius
I guess simply replacing
-> with <- will not be enough?

That's a horrible rule from the engine's point of view. The head is
completely ungrounded so when running backwards then it will need to run
for *every* triple pattern. [It also makes no sense to me as a use of
owl:AnnotationProperty but whatever.] You could try it backwards but put
(?subClass ?p ?o) <-
(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>),
(?subClass rdfs:subClassOf ?class), (?class ?p ?o) .
The rdf:type rdfs:Class constraints are pointless since those are
implied by rdfs:subClassOf anyway. The noValue check is probably best
avoided for both cases.
Alternatively, depending on the nature of your space leak you could use
(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>)
->
[ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
(?class ?p ?o) ]
That way the forward engine is only looking at your annotations and the
backward engine then has rules that have grounded predicates. You could
(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>)
->
table(?p),
[ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
(?class ?p ?o) ]

These two are more reasonable and could be used backwards or hybrid.

Post by Martynas JuseviÄius
[rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)]

That would work backwards. Depending on the scale of your data you might
want to table rdf:type for performance/space tradeoff.

Post by Martynas JuseviÄius
Can these be rewritten as backward rules instead?

Sure, the challenge is performance tuning as noted above.

Post by Martynas JuseviÄius
Does it involve code changes, such as calling reset() etc?

Shouldn't do.
Dave

Martynas Jusevičius

2016-06-23 20:38:49 UTC

Permalink

Hey again,

I have profiled the CPU time, and it seems that a lot of it (93.5%
after some 22500 HTTP requests) is spent in the following methods:

com.hp.hpl.jena.rdf.model.ModelFactory.createInfModel
(com.hp.hpl.jena.reasoner.Reasoner, com.hp.hpl.jena.rdf.model.Model,
com.hp.hpl.jena.rdf.model.Model)
com.hp.hpl.jena.reasoner.rulesys.GenericRuleReasoner.bindSchema
(com.hp.hpl.jena.graph.Graph)
com.hp.hpl.jena.reasoner.rulesys.FBRuleInfGraph.prepare ()
com.hp.hpl.jena.reasoner.rulesys.impl.RETEEngine.fastInit
(com.hp.hpl.jena.reasoner.Finder)

Probably not so smart to create an InfModel with every
request/response. But in my case it is created using HTTP response
body and metadata only: Model from response body, and schema OntModel
from headers metadata, so I'm not sure how it could be cached. Here is
the code:
https://github.com/AtomGraph/Processor/blob/master/src/main/java/org/graphity/processor/filter/response/HypermediaFilter.java#L107

I would appreciate suggestions on how to improve performance.

Martynas

On Tue, Jun 21, 2016 at 10:28 AM, Dave Reynolds

Post by Dave Reynolds
Hi Martynas,

If it is related to that then it is not a leak it is "just" memory use.
A leak implies that when you turn over data then unused internal state
objects are not reclaimed. Are you continuously adding and deleting data? If
so then the delete should release the whole of the RETEEngine state and
start over. If that isn't happening then that's a bug but you could work
around with an explicit reset() or even delete and recreate your InfGraph at
that stage. A delete loses all the state anyway.

Forward rules are generally faster because they keep all that partially
matched state. So if you have stable data or just add triples monotonically,
and have a lot of queries, then generally use forward rules for performance.
Backward rules (without tabling) keep no state so there's less memory
overhead and no cost for delete but they are slow and have to redo the work
for every query.
Strictly the performance trade-off is a bit more subtle than that. Forward
rules will try to work out all the entailments whereas backward rules are
just responding to specific queries. So if your queries only touch a small
part of the possible space then backward rules could be more efficient.
However in practice RDF rules seem involve a lot of unground terms and lots
of rules match nearly every query.
Tabling allows you to selectively cache certain predicates which can enable
you to get more reasonable performance while keeping memory use under
control. You can also do some tuning of how the rules execute by testing if
variables are bound or not and using different clause orderings for
different query patterns.

Post by Martynas JuseviÄius
I guess simply replacing
-> with <- will not be enough?

Unless you use non-monotonic predicates (which, sadly, you do) then that
would be enough to get something working. In fact you don't even need to do
that. If you create a pure backward reasoner instances (as opposed to the
hybrid) reasoner it'll read forward syntax rules but treat them as backward.

That's a horrible rule from the engine's point of view. The head is
completely ungrounded so when running backwards then it will need to run for
*every* triple pattern. [It also makes no sense to me as a use of
owl:AnnotationProperty but whatever.] You could try it backwards but put the
(?subClass ?p ?o) <-
(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>),
(?subClass rdfs:subClassOf ?class), (?class ?p ?o) .
The rdf:type rdfs:Class constraints are pointless since those are implied by
rdfs:subClassOf anyway. The noValue check is probably best avoided for both
cases.
Alternatively, depending on the nature of your space leak you could use
(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>)
->
[ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
(?class ?p ?o) ]
That way the forward engine is only looking at your annotations and the
backward engine then has rules that have grounded predicates. You could also
(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>)
->
table(?p),
[ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
(?class ?p ?o) ]

These two are more reasonable and could be used backwards or hybrid.

Post by Martynas JuseviÄius
[rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)]

That would work backwards. Depending on the scale of your data you might
want to table rdf:type for performance/space tradeoff.

Post by Martynas JuseviÄius
Can these be rewritten as backward rules instead?

Sure, the challenge is performance tuning as noted above.

Post by Martynas JuseviÄius
Does it involve code changes, such as calling reset() etc?

Shouldn't do.
Dave

Martynas Jusevičius

2016-06-23 20:52:40 UTC

Permalink

Maybe I should evaluate, if I need an InfModel there in the first place...

On Thu, Jun 23, 2016 at 10:38 PM, Martynas Jusevičius

Post by Martynas JuseviÄius
Hey again,
I have profiled the CPU time, and it seems that a lot of it (93.5%
com.hp.hpl.jena.rdf.model.ModelFactory.createInfModel
(com.hp.hpl.jena.reasoner.Reasoner, com.hp.hpl.jena.rdf.model.Model,
com.hp.hpl.jena.rdf.model.Model)
com.hp.hpl.jena.reasoner.rulesys.GenericRuleReasoner.bindSchema
(com.hp.hpl.jena.graph.Graph)
com.hp.hpl.jena.reasoner.rulesys.FBRuleInfGraph.prepare ()
com.hp.hpl.jena.reasoner.rulesys.impl.RETEEngine.fastInit
(com.hp.hpl.jena.reasoner.Finder)
Probably not so smart to create an InfModel with every
request/response. But in my case it is created using HTTP response
body and metadata only: Model from response body, and schema OntModel
from headers metadata, so I'm not sure how it could be cached. Here is
https://github.com/AtomGraph/Processor/blob/master/src/main/java/org/graphity/processor/filter/response/HypermediaFilter.java#L107
I would appreciate suggestions on how to improve performance.
Martynas
On Tue, Jun 21, 2016 at 10:28 AM, Dave Reynolds

Post by Dave Reynolds
Hi Martynas,

If it is related to that then it is not a leak it is "just" memory use.
A leak implies that when you turn over data then unused internal state
objects are not reclaimed. Are you continuously adding and deleting data? If
so then the delete should release the whole of the RETEEngine state and
start over. If that isn't happening then that's a bug but you could work
around with an explicit reset() or even delete and recreate your InfGraph at
that stage. A delete loses all the state anyway.

Forward rules are generally faster because they keep all that partially
matched state. So if you have stable data or just add triples monotonically,
and have a lot of queries, then generally use forward rules for performance.
Backward rules (without tabling) keep no state so there's less memory
overhead and no cost for delete but they are slow and have to redo the work
for every query.
Strictly the performance trade-off is a bit more subtle than that. Forward
rules will try to work out all the entailments whereas backward rules are
just responding to specific queries. So if your queries only touch a small
part of the possible space then backward rules could be more efficient.
However in practice RDF rules seem involve a lot of unground terms and lots
of rules match nearly every query.
Tabling allows you to selectively cache certain predicates which can enable
you to get more reasonable performance while keeping memory use under
control. You can also do some tuning of how the rules execute by testing if
variables are bound or not and using different clause orderings for
different query patterns.

Post by Martynas JuseviÄius
I guess simply replacing
-> with <- will not be enough?

Unless you use non-monotonic predicates (which, sadly, you do) then that
would be enough to get something working. In fact you don't even need to do
that. If you create a pure backward reasoner instances (as opposed to the
hybrid) reasoner it'll read forward syntax rules but treat them as backward.

That's a horrible rule from the engine's point of view. The head is
completely ungrounded so when running backwards then it will need to run for
*every* triple pattern. [It also makes no sense to me as a use of
owl:AnnotationProperty but whatever.] You could try it backwards but put the
(?subClass ?p ?o) <-
(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>),
(?subClass rdfs:subClassOf ?class), (?class ?p ?o) .
The rdf:type rdfs:Class constraints are pointless since those are implied by
rdfs:subClassOf anyway. The noValue check is probably best avoided for both
cases.
Alternatively, depending on the nature of your space leak you could use
(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>)
->
[ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
(?class ?p ?o) ]
That way the forward engine is only looking at your annotations and the
backward engine then has rules that have grounded predicates. You could also
(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>)
->
table(?p),
[ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
(?class ?p ?o) ]

These two are more reasonable and could be used backwards or hybrid.

Post by Martynas JuseviÄius
[rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)]

That would work backwards. Depending on the scale of your data you might
want to table rdf:type for performance/space tradeoff.

Post by Martynas JuseviÄius
Can these be rewritten as backward rules instead?

Sure, the challenge is performance tuning as noted above.

Post by Martynas JuseviÄius
Does it involve code changes, such as calling reset() etc?

Shouldn't do.
Dave

Dave Reynolds

2016-06-23 21:04:17 UTC

Permalink

Hi Martynas,

If it really is a different schema and and data each time then you can't
cache.

If you only have a small number of schemas then you could use bindSchema
to generate a set of partially-evaluated reasoners, cache those, and
pick the right one to use for a given set of message headers.

The other option (apart from stopping using rules) would be to use
backward rules. As already discussed the forward engine does all the
inferences up front whereas the backward rules do them on demand.

Dave

Post by Dave Reynolds
Hi Martynas,

If it is related to that then it is not a leak it is "just" memory use.
A leak implies that when you turn over data then unused internal state
objects are not reclaimed. Are you continuously adding and deleting data? If
so then the delete should release the whole of the RETEEngine state and
start over. If that isn't happening then that's a bug but you could work
around with an explicit reset() or even delete and recreate your InfGraph at
that stage. A delete loses all the state anyway.

Forward rules are generally faster because they keep all that partially
matched state. So if you have stable data or just add triples monotonically,
and have a lot of queries, then generally use forward rules for performance.
Backward rules (without tabling) keep no state so there's less memory
overhead and no cost for delete but they are slow and have to redo the work
for every query.
Strictly the performance trade-off is a bit more subtle than that. Forward
rules will try to work out all the entailments whereas backward rules are
just responding to specific queries. So if your queries only touch a small
part of the possible space then backward rules could be more efficient.
However in practice RDF rules seem involve a lot of unground terms and lots
of rules match nearly every query.
Tabling allows you to selectively cache certain predicates which can enable
you to get more reasonable performance while keeping memory use under
control. You can also do some tuning of how the rules execute by testing if
variables are bound or not and using different clause orderings for
different query patterns.

Post by Martynas JuseviÄius
I guess simply replacing
-> with <- will not be enough?

Unless you use non-monotonic predicates (which, sadly, you do) then that
would be enough to get something working. In fact you don't even need to do
that. If you create a pure backward reasoner instances (as opposed to the
hybrid) reasoner it'll read forward syntax rules but treat them as backward.

That's a horrible rule from the engine's point of view. The head is
completely ungrounded so when running backwards then it will need to run for
*every* triple pattern. [It also makes no sense to me as a use of
owl:AnnotationProperty but whatever.] You could try it backwards but put the
(?subClass ?p ?o) <-
(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>),
(?subClass rdfs:subClassOf ?class), (?class ?p ?o) .
The rdf:type rdfs:Class constraints are pointless since those are implied by
rdfs:subClassOf anyway. The noValue check is probably best avoided for both
cases.
Alternatively, depending on the nature of your space leak you could use
(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>)
->
[ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
(?class ?p ?o) ]
That way the forward engine is only looking at your annotations and the
backward engine then has rules that have grounded predicates. You could also
(?p rdf:type owl:AnnotationProperty),
(?p rdfs:isDefinedBy <http://graphity.org/gp#>)
->
table(?p),
[ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
(?class ?p ?o) ]

These two are more reasonable and could be used backwards or hybrid.

Post by Martynas JuseviÄius
[rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)]

That would work backwards. Depending on the scale of your data you might
want to table rdf:type for performance/space tradeoff.

Post by Martynas JuseviÄius
Can these be rewritten as backward rules instead?

Sure, the challenge is performance tuning as noted above.

Post by Martynas JuseviÄius
Does it involve code changes, such as calling reset() etc?

Shouldn't do.
Dave