Mark Wharton
2016-03-17 12:23:21 UTC
Hi Jena Users.
We've been experiencing some peculiar behaviour with Jena/Fuseki and
Lucene - particularly, but not entirely, around special characters.
We are currently running Fuseki 2.3.0, which seems to include Lucene
4.9.1, as far as we can tell.
Using the query:
PREFIX text: <http://jena.apache.org/text#>
SELECT ?ent ?score
{ (?ent ?score) text:query (<TEXT> 'lang:en') }
...and different values of <TEXT>, the following happens
1) <TEXT> = ''
Get server error: Cannot parse '() AND lang:en'"
2) <TEXT> = '*' - 26 results
3) <TEXT> = '\\*' - 26 results
4) <TEXT> = '\\?' - 26 results
5) <TEXT> = 'will' - 26 results
("will" is one of the words which is ignored by lucene, see e.g.
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.9.1/lucene/analysis/common/src/java/org/apache/lucene/analysis/core/StopAnalyzer.java#L51
6) <TEXT> = '(?)' - 3 results
labels/comments with single character words in them?
7) <TEXT> = '(\\?)' - 26 results
8) <TEXT> = '\\(\\?\\)' - 26 results
It looks to us as if:
Since fuseki turns
"<TEXT>" into "(<TEXT>) AND lang:en",
it would appear that empty matches for TEXT (grouped with
braces) result in ALL entries being matched.
Problem:
Unless know complete list of ignored words & characters that lucene then
goes on to turn into an empty match, it is impossible to stop fuseki
returning ALL results with certain queries!
Thanks in advance for any thoughts and help
Mark
We've been experiencing some peculiar behaviour with Jena/Fuseki and
Lucene - particularly, but not entirely, around special characters.
We are currently running Fuseki 2.3.0, which seems to include Lucene
4.9.1, as far as we can tell.
Using the query:
PREFIX text: <http://jena.apache.org/text#>
SELECT ?ent ?score
{ (?ent ?score) text:query (<TEXT> 'lang:en') }
...and different values of <TEXT>, the following happens
1) <TEXT> = ''
Get server error: Cannot parse '() AND lang:en'"
2) <TEXT> = '*' - 26 results
3) <TEXT> = '\\*' - 26 results
4) <TEXT> = '\\?' - 26 results
5) <TEXT> = 'will' - 26 results
("will" is one of the words which is ignored by lucene, see e.g.
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.9.1/lucene/analysis/common/src/java/org/apache/lucene/analysis/core/StopAnalyzer.java#L51
6) <TEXT> = '(?)' - 3 results
labels/comments with single character words in them?
7) <TEXT> = '(\\?)' - 26 results
8) <TEXT> = '\\(\\?\\)' - 26 results
It looks to us as if:
Since fuseki turns
"<TEXT>" into "(<TEXT>) AND lang:en",
it would appear that empty matches for TEXT (grouped with
braces) result in ALL entries being matched.
Problem:
Unless know complete list of ignored words & characters that lucene then
goes on to turn into an empty match, it is impossible to stop fuseki
returning ALL results with certain queries!
Thanks in advance for any thoughts and help
Mark
--
Technology Lead, Iotic Labs
+44 7973 674404
***@iotic-labs.com
https://www.iotic-labs.com
Technology Lead, Iotic Labs
+44 7973 674404
***@iotic-labs.com
https://www.iotic-labs.com