Fuseki + Lucene special characters.

Osma Suominen

2016-03-18 07:31:54 UTC

Hi Mark!

Thanks for pointing this out. This seems to be a bug/feature of the
jena-text module. It is not directly related to Fuseki (which is really
the web server module), but Fuseki includes jena-text. Your
interpretation of what is happening seems correct to me.

Do you have a suggestion of how this could be resolved? What kind of
query results would you expect for, say, text:query ('' 'lang:en') or
text:query ('will' 'lang:en') ? Do you happen to know a better way to
construct the Lucene query than just ANDing the language restriction to
the keyword part, as is currently done?

One thing that might help a bit is to use a different analyzer than the
default StandardAnalyzer. StandardAnalyzer has a lot of smarts including
the built-in stop word list, but in your case this causes problems with
stopwords such as "will". If you used for example SimpleAnalyzer, then
this would not be an issue. But I guess there would still be problems
with the wildcard-type queries.

-Osma

Post by Mark Wharton
Hi Jena Users.
We've been experiencing some peculiar behaviour with Jena/Fuseki and
Lucene - particularly, but not entirely, around special characters.
We are currently running Fuseki 2.3.0, which seems to include Lucene
4.9.1, as far as we can tell.
PREFIX text: <http://jena.apache.org/text#>
SELECT ?ent ?score
{ (?ent ?score) text:query (<TEXT> 'lang:en') }
...and different values of <TEXT>, the following happens
1) <TEXT> = ''
Get server error: Cannot parse '() AND lang:en'"
2) <TEXT> = '*' - 26 results
3) <TEXT> = '\\*' - 26 results
4) <TEXT> = '\\?' - 26 results
5) <TEXT> = 'will' - 26 results
("will" is one of the words which is ignored by lucene, see e.g.
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.9.1/lucene/analysis/common/src/java/org/apache/lucene/analysis/core/StopAnalyzer.java#L51
6) <TEXT> = '(?)' - 3 results
labels/comments with single character words in them?
7) <TEXT> = '(\\?)' - 26 results
8) <TEXT> = '\\(\\?\\)' - 26 results
Since fuseki turns
"<TEXT>" into "(<TEXT>) AND lang:en",
it would appear that empty matches for TEXT (grouped with
braces) result in ALL entries being matched.
Unless know complete list of ignored words & characters that lucene then
goes on to turn into an empty match, it is impossible to stop fuseki
returning ALL results with certain queries!
Thanks in advance for any thoughts and help
Mark

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
***@helsinki.fi
http://www.nationallibrary.fi