completion with Lucene: desirable from SPARQL

Hi Jean-Marc,

Depending on what exactly you want from such a service, this may be
already possible with jena-text.

I'm assuming that you want to perform a prefix search such as "édu*" and
get possible completions for that, such as "éducation".

You can of course already do a prefix search with jena-text. What you
will get back will be the RDF resources which have labels that contain
this prefix. If the text index is configured to store literal values,
you can ask for the actual values as well.

E.g. with this data:

ex:cse rdfs:label "Conseil supérieur de l'éducation"@fr .

and a suitably configured jena-text index, you can perform this query:

(?s ?score ?literal) text:query (rdfs:label "édu*") .

and get back these bindings:

?s=ex:cse ?literal="Conseil supérieur de l'éducation"@fr

However, you will get the full original literal value, not just the
individual word that matched ("éducation"). If you want just the matched
word, you will need special support that jena-text doesn't currently have.

-Osma

Post by Jean-Marc Vanel
Hi
I'm implementing an equivalent of dbPedia lookup service [1] in
semantic_forms, leveraging on Lucene integration in TDB, and dbPedia mirror
with TDB [2] .
- the hosted service is often down
- completion is in english only
A lookup service with TDB and Lucene would overcome these 2 problems.
So I would need completion with Lucene from SPARQL.
https://jena.apache.org/documentation/query/text-query.html#query-with-sparql
There are plenty of pages when searching for
lucene completion
From these pages there is a code snippet here
http://stackoverflow.com/questions/120180/how-to-do-query-auto-completion-suggestions-in-lucene
but a regular Lucene API may exist.
[1] https://github.com/dbpedia/lookup
[2]
https://github.com/jmvanel/semantic_forms/blob/master/doc/en/administration.md#populating-with-dbpedia-mirroring-dbpedia

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
***@helsinki.fi
http://www.nationallibrary.fi

Jean-Marc Vanel

2016-11-01 09:01:28 UTC

I's too bad that the * joker feature, and other details of the SPARQL to
Lucene query translation, are not documented on the Jena text search page.

Anyway, it works for my use case, I now have on my laptop a (kind of)
replacement of dbPedia lookup service.

To experiment with the original dbPedia lookup service, you can go to
semantic_forms sandbox:
http://163.172.179.125:9111/create?uri=&uri=http%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2FPerson
and type a few letters in the dct:subject field.

I don't need the full original literal value, because the URI results of
the query are labelled in the application: a foaf:Person is labelled by
given and family names, etc.

BUT, there is a "but", the dbPedia lookup service are apropriately ordered
by "notoriety".
Instead, I currently get with http://localhost:9000/lookup?q=*Pari*
on my TDB that mirrors dbPedia.

<ArrayOfResult>
<Result>
<Label>UniversitÃ© Pierre-et-Marie-Curie</Label>
<URI>http://dbpedia.org/resource/Pierre_and_Marie_Curie_University
</URI>
</Result><Result>
<Label>Guillaume Le Gentil</Label>
<URI>http://dbpedia.org/resource/Guillaume_Le_Gentil</URI>
</Result><Result>
<Label>1 E1 m</Label>
<URI>http://dbpedia.org/resource/1_decametre</URI>
</Result><Result>
<Label>1 E4 m</Label>
<URI>http://dbpedia.org/resource/1_myriametre</URI>
</Result><Result>
<Label>Nadia Boulanger</Label>
<URI>http://dbpedia.org/resource/Nadia_Boulanger</URI>
</Result><Result>
<Label>Luis Mariano</Label>
<URI>http://dbpedia.org/resource/Luis_Mariano</URI>
</Result><Result>
<Label>Paul Chemetov</Label>
<URI>http://dbpedia.org/resource/Paul_Chemetov</URI>
</Result><Result>
<Label>Marc Boegner</Label>
<URI>http://dbpedia.org/resource/Marc_Boegner</URI>
</Result><Result>
<Label>Cassandre (graphiste)</Label>
<URI>http://dbpedia.org/resource/Cassandre_(artist)</URI>
</Result><Result>
<Label>La Norville</Label>
<URI>http://dbpedia.org/resource/La_Norville</URI>
</Result>
</ArrayOfResult>

My understanding is that I need to set a weight on URI's in Lucene to
reflect their "notoriety".
I see 2 ways:

1. easy to implement: just count the triples from and to the URI
2. also take in account the the URI's consulted by user in my
application (but currently I don't record that information); there is
also the issue of combining weights 1) and 2)

Google search does both weightings.

So, in the short term I have to figure out how to add weights to the Lucene
- Jena index.

Then I have to read what dbPedia lookup does, and other background material.

Post by Osma Suominen
Hi Jean-Marc,
Depending on what exactly you want from such a service, this may be
already possible with jena-text.
I'm assuming that you want to perform a prefix search such as "Ã©du*" and
get possible completions for that, such as "Ã©ducation".
You can of course already do a prefix search with jena-text. What you will
get back will be the RDF resources which have labels that contain this
prefix. If the text index is configured to store literal values, you can
ask for the actual values as well.
(?s ?score ?literal) text:query (rdfs:label "Ã©du*") .
However, you will get the full original literal value, not just the
individual word that matched ("Ã©ducation"). If you want just the matched
word, you will need special support that jena-text doesn't currently have.
-Osma

Post by Jean-Marc Vanel
Hi
I'm implementing an equivalent of dbPedia lookup service [1] in
semantic_forms, leveraging on Lucene integration in TDB, and dbPedia mirror
with TDB [2] .
- the hosted service is often down
- completion is in english only
A lookup service with TDB and Lucene would overcome these 2 problems.
So I would need completion with Lucene from SPARQL.
https://jena.apache.org/documentation/query/text-query.html#
query-with-sparql
There are plenty of pages when searching for
lucene completion
From these pages there is a code snippet here
http://stackoverflow.com/questions/120180/how-to-do-query-
auto-completion-suggestions-in-lucene
but a regular Lucene API may exist.
[1] https://github.com/dbpedia/lookup
[2]
https://github.com/jmvanel/semantic_forms/blob/master/doc/
en/administration.md#populating-with-dbpedia-mirroring-dbpedia

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
http://www.nationallibrary.fi

Osma Suominen

2016-11-01 12:59:02 UTC

Hi Jean-Marc,

The wildcard queries etc. are basic Lucene features, part of Lucene
query syntax, so probably that's why they not documented on the
jena-text page. The query string is simply passed to the Lucene query
parser by jena-text and should support any features of Lucene, see:
http://lucene.apache.org/core/6_2_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package.description

Glad you were able to get your lookup service working!

Regarding the saving of weights: I think you could simply save them as
triples (perhaps in a separate graph), outside the Lucene index. Then
combine the results of the text:query with the weights from triples
using SPARQL.

The jena-text query also returns score values. I'm not sure how useful
they are in your use case, but they could potentially be used as a
factor in the overall "notoriety" calculation. Though if you are
searching just for single words or prefixes, chances are that the score
values will be the same for all results.

Thanks for all the work on the Lucene 5 and 6 upgrade (JENA-1250)! I
hope we can finish that work and get it merged soon after the 3.1.1
release. In any case the newer Lucene version should perform better and
be easier to maintain moving forward.

-Osma

Post by Jean-Marc Vanel
I's too bad that the * joker feature, and other details of the SPARQL to
Lucene query translation, are not documented on the Jena text search page.
Anyway, it works for my use case, I now have on my laptop a (kind of)
replacement of dbPedia lookup service.
To experiment with the original dbPedia lookup service, you can go to
http://163.172.179.125:9111/create?uri=&uri=http%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2FPerson
and type a few letters in the dct:subject field.
I don't need the full original literal value, because the URI results of
the query are labelled in the application: a foaf:Person is labelled by
given and family names, etc.
BUT, there is a "but", the dbPedia lookup service are apropriately ordered
by "notoriety".
Instead, I currently get with http://localhost:9000/lookup?q=*Pari*
on my TDB that mirrors dbPedia.
<ArrayOfResult>
<Result>
<Label>Université Pierre-et-Marie-Curie</Label>
<URI>http://dbpedia.org/resource/Pierre_and_Marie_Curie_University
</URI>
</Result><Result>
<Label>Guillaume Le Gentil</Label>
<URI>http://dbpedia.org/resource/Guillaume_Le_Gentil</URI>
</Result><Result>
<Label>1 E1 m</Label>
<URI>http://dbpedia.org/resource/1_decametre</URI>
</Result><Result>
<Label>1 E4 m</Label>
<URI>http://dbpedia.org/resource/1_myriametre</URI>
</Result><Result>
<Label>Nadia Boulanger</Label>
<URI>http://dbpedia.org/resource/Nadia_Boulanger</URI>
</Result><Result>
<Label>Luis Mariano</Label>
<URI>http://dbpedia.org/resource/Luis_Mariano</URI>
</Result><Result>
<Label>Paul Chemetov</Label>
<URI>http://dbpedia.org/resource/Paul_Chemetov</URI>
</Result><Result>
<Label>Marc Boegner</Label>
<URI>http://dbpedia.org/resource/Marc_Boegner</URI>
</Result><Result>
<Label>Cassandre (graphiste)</Label>
<URI>http://dbpedia.org/resource/Cassandre_(artist)</URI>
</Result><Result>
<Label>La Norville</Label>
<URI>http://dbpedia.org/resource/La_Norville</URI>
</Result>
</ArrayOfResult>
My understanding is that I need to set a weight on URI's in Lucene to
reflect their "notoriety".
1. easy to implement: just count the triples from and to the URI
2. also take in account the the URI's consulted by user in my
application (but currently I don't record that information); there is
also the issue of combining weights 1) and 2)
Google search does both weightings.
So, in the short term I have to figure out how to add weights to the Lucene
- Jena index.
Then I have to read what dbPedia lookup does, and other background material.

Post by Osma Suominen
Hi Jean-Marc,
Depending on what exactly you want from such a service, this may be
already possible with jena-text.
I'm assuming that you want to perform a prefix search such as "édu*" and
get possible completions for that, such as "éducation".
You can of course already do a prefix search with jena-text. What you will
get back will be the RDF resources which have labels that contain this
prefix. If the text index is configured to store literal values, you can
ask for the actual values as well.
(?s ?score ?literal) text:query (rdfs:label "édu*") .
However, you will get the full original literal value, not just the
individual word that matched ("éducation"). If you want just the matched
word, you will need special support that jena-text doesn't currently have.
-Osma

Post by Jean-Marc Vanel
Hi
I'm implementing an equivalent of dbPedia lookup service [1] in
semantic_forms, leveraging on Lucene integration in TDB, and dbPedia mirror
with TDB [2] .
- the hosted service is often down
- completion is in english only
A lookup service with TDB and Lucene would overcome these 2 problems.
So I would need completion with Lucene from SPARQL.
https://jena.apache.org/documentation/query/text-query.html#
query-with-sparql
There are plenty of pages when searching for
lucene completion
From these pages there is a code snippet here
http://stackoverflow.com/questions/120180/how-to-do-query-
auto-completion-suggestions-in-lucene
but a regular Lucene API may exist.
[1] https://github.com/dbpedia/lookup
[2]
https://github.com/jmvanel/semantic_forms/blob/master/doc/
en/administration.md#populating-with-dbpedia-mirroring-dbpedia

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
http://www.nationallibrary.fi

Jean-Marc Vanel

2016-11-03 11:51:12 UTC

Hi Osma

First I will implement the weight by counting the triples from and to each
URI being indexed in Lucene by Jena-text.
This will give users a first ordering in results, hopefully satisfying.
This is quite similar to the Google page rank, except that instead of
counting the <a href="XXX"> , it will count the triples.

I sketched some code here with most of the plumbing:
https://github.com/jmvanel/semantic_forms/blob/master/scala/forms/src/main/scala/deductions/runtime/jena/lucene/TextIndexerWeight.scala

Comments welcome. It's in Scala, but it should be understandable.
Note that I have one more library dependency :
libraryDependencies += "org.apache.lucene" % "lucene-suggest" % "4.9.1"

This is code for batch primary indexing or re-indexing.
If this works well, I'll have to implement also the callback for updates
like class TextDocProducerTriples in Jena-text.

Post by Osma Suominen
Hi Jean-Marc,
The wildcard queries etc. are basic Lucene features, part of Lucene query
syntax, so probably that's why they not documented on the jena-text page.
The query string is simply passed to the Lucene query parser by jena-text
http://lucene.apache.org/core/6_2_1/queryparser/org/apache/l
ucene/queryparser/classic/package-summary.html#package.description
Glad you were able to get your lookup service working!
Regarding the saving of weights: I think you could simply save them as
triples (perhaps in a separate graph), outside the Lucene index. Then
combine the results of the text:query with the weights from triples using
SPARQL.
The jena-text query also returns score values. I'm not sure how useful
they are in your use case, but they could potentially be used as a factor
in the overall "notoriety" calculation. Though if you are searching just
for single words or prefixes, chances are that the score values will be the
same for all results.
Thanks for all the work on the Lucene 5 and 6 upgrade (JENA-1250)! I hope
we can finish that work and get it merged soon after the 3.1.1 release. In
any case the newer Lucene version should perform better and be easier to
maintain moving forward.
-Osma

Post by Jean-Marc Vanel
I's too bad that the * joker feature, and other details of the SPARQL to
Lucene query translation, are not documented on the Jena text search page.
Anyway, it works for my use case, I now have on my laptop a (kind of)
replacement of dbPedia lookup service.
To experiment with the original dbPedia lookup service, you can go to
http://163.172.179.125:9111/create?uri=&uri=http%3A%2F%2Fxml
ns.com%2Ffoaf%2F0.1%2FPerson
and type a few letters in the dct:subject field.
I don't need the full original literal value, because the URI results of
the query are labelled in the application: a foaf:Person is labelled by
given and family names, etc.
BUT, there is a "but", the dbPedia lookup service are apropriately ordered
by "notoriety".
Instead, I currently get with http://localhost:9000/lookup?q=*Pari*
on my TDB that mirrors dbPedia.
<ArrayOfResult>
<Result>
<Label>UniversitÃ© Pierre-et-Marie-Curie</Label>
<URI>http://dbpedia.org/resource/Pierre_and_Marie_Curie_
University
</URI>
</Result><Result>
<Label>Guillaume Le Gentil</Label>
<URI>http://dbpedia.org/resource/Guillaume_Le_Gentil</URI>
</Result><Result>
<Label>1 E1 m</Label>
<URI>http://dbpedia.org/resource/1_decametre</URI>
</Result><Result>
<Label>1 E4 m</Label>
<URI>http://dbpedia.org/resource/1_myriametre</URI>
</Result><Result>
<Label>Nadia Boulanger</Label>
<URI>http://dbpedia.org/resource/Nadia_Boulanger</URI>
</Result><Result>
<Label>Luis Mariano</Label>
<URI>http://dbpedia.org/resource/Luis_Mariano</URI>
</Result><Result>
<Label>Paul Chemetov</Label>
<URI>http://dbpedia.org/resource/Paul_Chemetov</URI>
</Result><Result>
<Label>Marc Boegner</Label>
<URI>http://dbpedia.org/resource/Marc_Boegner</URI>
</Result><Result>
<Label>Cassandre (graphiste)</Label>
<URI>http://dbpedia.org/resource/Cassandre_(artist)</URI>
</Result><Result>
<Label>La Norville</Label>
<URI>http://dbpedia.org/resource/La_Norville</URI>
</Result>
</ArrayOfResult>
My understanding is that I need to set a weight on URI's in Lucene to
reflect their "notoriety".
1. easy to implement: just count the triples from and to the URI
2. also take in account the the URI's consulted by user in my
application (but currently I don't record that information); there is
also the issue of combining weights 1) and 2)
Google search does both weightings.
So, in the short term I have to figure out how to add weights to the Lucene
- Jena index.
Then I have to read what dbPedia lookup does, and other background material.
Hi Jean-Marc,

Post by Osma Suominen
Depending on what exactly you want from such a service, this may be
already possible with jena-text.
I'm assuming that you want to perform a prefix search such as "Ã©du*" and
get possible completions for that, such as "Ã©ducation".
You can of course already do a prefix search with jena-text. What you will
get back will be the RDF resources which have labels that contain this
prefix. If the text index is configured to store literal values, you can
ask for the actual values as well.
(?s ?score ?literal) text:query (rdfs:label "Ã©du*") .
However, you will get the full original literal value, not just the
individual word that matched ("Ã©ducation"). If you want just the matched
word, you will need special support that jena-text doesn't currently have.
-Osma
Hi

Post by Jean-Marc Vanel
I'm implementing an equivalent of dbPedia lookup service [1] in
semantic_forms, leveraging on Lucene integration in TDB, and dbPedia mirror
with TDB [2] .
- the hosted service is often down
- completion is in english only
A lookup service with TDB and Lucene would overcome these 2 problems.
So I would need completion with Lucene from SPARQL.
https://jena.apache.org/documentation/query/text-query.html#
query-with-sparql
There are plenty of pages when searching for
lucene completion
From these pages there is a code snippet here
http://stackoverflow.com/questions/120180/how-to-do-query-
auto-completion-suggestions-in-lucene
but a regular Lucene API may exist.
[1] https://github.com/dbpedia/lookup
[2]
https://github.com/jmvanel/semantic_forms/blob/master/doc/
en/administration.md#populating-with-dbpedia-mirroring-dbpedia

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
http://www.nationallibrary.fi

Osma Suominen

2016-11-03 12:34:14 UTC

Hi Jean-Marc,

I'm not sure I understand why you need to put the weights inside the
Lucene index. Is it done for performance reasons?

What if the data changes? I mean, not the indexed subject itself, but
for example additional triples get added to the dataset using the same
subject. Surely the Lucene index will get out of date?

-Osma

Post by Jean-Marc Vanel
Hi Osma
First I will implement the weight by counting the triples from and to each
URI being indexed in Lucene by Jena-text.
This will give users a first ordering in results, hopefully satisfying.
This is quite similar to the Google page rank, except that instead of
counting the <a href="XXX"> , it will count the triples.
https://github.com/jmvanel/semantic_forms/blob/master/scala/forms/src/main/scala/deductions/runtime/jena/lucene/TextIndexerWeight.scala
Comments welcome. It's in Scala, but it should be understandable.
libraryDependencies += "org.apache.lucene" % "lucene-suggest" % "4.9.1"
This is code for batch primary indexing or re-indexing.
If this works well, I'll have to implement also the callback for updates
like class TextDocProducerTriples in Jena-text.

Post by Jean-Marc Vanel
I's too bad that the * joker feature, and other details of the SPARQL to
Lucene query translation, are not documented on the Jena text search page.
Anyway, it works for my use case, I now have on my laptop a (kind of)
replacement of dbPedia lookup service.
To experiment with the original dbPedia lookup service, you can go to
http://163.172.179.125:9111/create?uri=&uri=http%3A%2F%2Fxml
ns.com%2Ffoaf%2F0.1%2FPerson
and type a few letters in the dct:subject field.
I don't need the full original literal value, because the URI results of
the query are labelled in the application: a foaf:Person is labelled by
given and family names, etc.
BUT, there is a "but", the dbPedia lookup service are apropriately ordered
by "notoriety".
Instead, I currently get with http://localhost:9000/lookup?q=*Pari*
on my TDB that mirrors dbPedia.
<ArrayOfResult>
<Result>
<Label>Université Pierre-et-Marie-Curie</Label>
<URI>http://dbpedia.org/resource/Pierre_and_Marie_Curie_
University
</URI>
</Result><Result>
<Label>Guillaume Le Gentil</Label>
<URI>http://dbpedia.org/resource/Guillaume_Le_Gentil</URI>
</Result><Result>
<Label>1 E1 m</Label>
<URI>http://dbpedia.org/resource/1_decametre</URI>
</Result><Result>
<Label>1 E4 m</Label>
<URI>http://dbpedia.org/resource/1_myriametre</URI>
</Result><Result>
<Label>Nadia Boulanger</Label>
<URI>http://dbpedia.org/resource/Nadia_Boulanger</URI>
</Result><Result>
<Label>Luis Mariano</Label>
<URI>http://dbpedia.org/resource/Luis_Mariano</URI>
</Result><Result>
<Label>Paul Chemetov</Label>
<URI>http://dbpedia.org/resource/Paul_Chemetov</URI>
</Result><Result>
<Label>Marc Boegner</Label>
<URI>http://dbpedia.org/resource/Marc_Boegner</URI>
</Result><Result>
<Label>Cassandre (graphiste)</Label>
<URI>http://dbpedia.org/resource/Cassandre_(artist)</URI>
</Result><Result>
<Label>La Norville</Label>
<URI>http://dbpedia.org/resource/La_Norville</URI>
</Result>
</ArrayOfResult>
My understanding is that I need to set a weight on URI's in Lucene to
reflect their "notoriety".
1. easy to implement: just count the triples from and to the URI
2. also take in account the the URI's consulted by user in my
application (but currently I don't record that information); there is
also the issue of combining weights 1) and 2)
Google search does both weightings.
So, in the short term I have to figure out how to add weights to the Lucene
- Jena index.
Then I have to read what dbPedia lookup does, and other background material.
Hi Jean-Marc,

Post by Osma Suominen
Depending on what exactly you want from such a service, this may be
already possible with jena-text.
I'm assuming that you want to perform a prefix search such as "édu*" and
get possible completions for that, such as "éducation".
You can of course already do a prefix search with jena-text. What you will
get back will be the RDF resources which have labels that contain this
prefix. If the text index is configured to store literal values, you can
ask for the actual values as well.
(?s ?score ?literal) text:query (rdfs:label "édu*") .
However, you will get the full original literal value, not just the
individual word that matched ("éducation"). If you want just the matched
word, you will need special support that jena-text doesn't currently have.
-Osma
Hi

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
http://www.nationallibrary.fi

Jean-Marc Vanel

2016-11-03 13:12:02 UTC

Post by Osma Suominen
Hi Jean-Marc,
I'm not sure I understand why you need to put the weights inside the
Lucene index. Is it done for performance reasons?

AFAIK using the weights to order results is intimately linked to the text
index querying.
If I want the top 10 results, the search must have the weights beforehand
otherwise I must get all the results to filter later.
This is the reason for using AnalyzingInfixSuggester.
Lucene 4_9_1
https://lucene.apache.org/core/4_9_1/suggest/org/apache/lucene/search/suggest/analyzing/AnalyzingInfixSuggester.html
Lucene 6_2_1
https://lucene.apache.org/core/6_2_1/suggest/org/apache/lucene/search/suggest/analyzing/AnalyzingInfixSuggester.html

I guess this is what you call "performance reasons" .

Post by Osma Suominen
What if the data changes? I mean, not the indexed subject itself, but for
example additional triples get added to the dataset using the same subject.
Surely the Lucene index will get out of date?

As I wrote in the original post, "I'll have to implement also the callback
for updates
like class TextDocProducerTriples in Jena-text." .
http://jena.apache.org/documentation/javadoc/text/org/apache/jena/query/text/TextDocProducerTriples.html

Post by Osma Suominen
-Osma

Post by Jean-Marc Vanel
Hi Osma
First I will implement the weight by counting the triples from and to each
URI being indexed in Lucene by Jena-text.
This will give users a first ordering in results, hopefully satisfying.
This is quite similar to the Google page rank, except that instead of
counting the <a href="XXX"> , it will count the triples.
https://github.com/jmvanel/semantic_forms/blob/master/scala/
forms/src/main/scala/deductions/runtime/jena/lucene/
TextIndexerWeight.scala
Comments welcome. It's in Scala, but it should be understandable.
libraryDependencies += "org.apache.lucene" % "lucene-suggest" % "4.9.1"
This is code for batch primary indexing or re-indexing.
If this works well, I'll have to implement also the callback for updates
like class TextDocProducerTriples in Jena-text.
Hi Jean-Marc,

Post by Osma Suominen
The wildcard queries etc. are basic Lucene features, part of Lucene query
syntax, so probably that's why they not documented on the jena-text page.
The query string is simply passed to the Lucene query parser by jena-text
http://lucene.apache.org/core/6_2_1/queryparser/org/apache/l
ucene/queryparser/classic/package-summary.html#package.description
Glad you were able to get your lookup service working!
Regarding the saving of weights: I think you could simply save them as
triples (perhaps in a separate graph), outside the Lucene index. Then
combine the results of the text:query with the weights from triples using
SPARQL.
The jena-text query also returns score values. I'm not sure how useful
they are in your use case, but they could potentially be used as a factor
in the overall "notoriety" calculation. Though if you are searching just
for single words or prefixes, chances are that the score values will be the
same for all results.
Thanks for all the work on the Lucene 5 and 6 upgrade (JENA-1250)! I hope
we can finish that work and get it merged soon after the 3.1.1 release. In
any case the newer Lucene version should perform better and be easier to
maintain moving forward.
-Osma
I's too bad that the * joker feature, and other details of the SPARQL to

Post by Jean-Marc Vanel
Lucene query translation, are not documented on the Jena text search page.
Anyway, it works for my use case, I now have on my laptop a (kind of)
replacement of dbPedia lookup service.
To experiment with the original dbPedia lookup service, you can go to
http://163.172.179.125:9111/create?uri=&uri=http%3A%2F%2Fxml
ns.com%2Ffoaf%2F0.1%2FPerson
and type a few letters in the dct:subject field.
I don't need the full original literal value, because the URI results of
the query are labelled in the application: a foaf:Person is labelled by
given and family names, etc.
BUT, there is a "but", the dbPedia lookup service are apropriately ordered
by "notoriety".
Instead, I currently get with http://localhost:9000/lookup?q=*Pari*
on my TDB that mirrors dbPedia.
<ArrayOfResult>
<Result>
<Label>UniversitÃ© Pierre-et-Marie-Curie</Label>
<URI>http://dbpedia.org/resource/Pierre_and_Marie_Curie_
University
</URI>
</Result><Result>
<Label>Guillaume Le Gentil</Label>
<URI>http://dbpedia.org/resource/Guillaume_Le_Gentil</URI>
</Result><Result>
<Label>1 E1 m</Label>
<URI>http://dbpedia.org/resource/1_decametre</URI>
</Result><Result>
<Label>1 E4 m</Label>
<URI>http://dbpedia.org/resource/1_myriametre</URI>
</Result><Result>
<Label>Nadia Boulanger</Label>
<URI>http://dbpedia.org/resource/Nadia_Boulanger</URI>
</Result><Result>
<Label>Luis Mariano</Label>
<URI>http://dbpedia.org/resource/Luis_Mariano</URI>
</Result><Result>
<Label>Paul Chemetov</Label>
<URI>http://dbpedia.org/resource/Paul_Chemetov</URI>
</Result><Result>
<Label>Marc Boegner</Label>
<URI>http://dbpedia.org/resource/Marc_Boegner</URI>
</Result><Result>
<Label>Cassandre (graphiste)</Label>
<URI>http://dbpedia.org/resource/Cassandre_(artist)</URI>
</Result><Result>
<Label>La Norville</Label>
<URI>http://dbpedia.org/resource/La_Norville</URI>
</Result>
</ArrayOfResult>
My understanding is that I need to set a weight on URI's in Lucene to
reflect their "notoriety".
1. easy to implement: just count the triples from and to the URI
2. also take in account the the URI's consulted by user in my
application (but currently I don't record that information); there is
also the issue of combining weights 1) and 2)
Google search does both weightings.
So, in the short term I have to figure out how to add weights to the Lucene
- Jena index.
Then I have to read what dbPedia lookup does, and other background material.
Hi Jean-Marc,

Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
http://www.nationallibrary.fi

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
http://www.nationallibrary.fi

Osma Suominen

2016-11-03 13:30:32 UTC

Hi Jean-Marc!

Post by Jean-Marc Vanel
AFAIK using the weights to order results is intimately linked to the text
index querying.
If I want the top 10 results, the search must have the weights beforehand
otherwise I must get all the results to filter later.
This is the reason for using AnalyzingInfixSuggester.
Lucene 4_9_1
https://lucene.apache.org/core/4_9_1/suggest/org/apache/lucene/search/suggest/analyzing/AnalyzingInfixSuggester.html
Lucene 6_2_1
https://lucene.apache.org/core/6_2_1/suggest/org/apache/lucene/search/suggest/analyzing/AnalyzingInfixSuggester.html
I guess this is what you call "performance reasons" .

I don't see why you couldn't, in principle, do something like this:

SELECT ?s (COUNT(*) as ?count)
WHERE {
?s text:query "édu*" .
?s ?p ?o .
}
GROUP BY ?s
ORDER BY DESC(?count)
LIMIT 10

(note: untested query)

I'm sure it will get slow if the number of hits from the text index is
more than a few dozen. But for a small number of results at a time, it
might work.

Post by Jean-Marc Vanel
As I wrote in the original post, "I'll have to implement also the callback
for updates
like class TextDocProducerTriples in Jena-text." .
http://jena.apache.org/documentation/javadoc/text/org/apache/jena/query/text/TextDocProducerTriples.html

Isn't that called only when the indexed triple changes (e.g. the one
with rdfs:label or skos:prefLabel or whatever property you are
indexing), but not when other data related to the same subject changes?
So if new triples are added for the same subject, but its label is
unchanged, then the text index won't see the update and thus the count
of references/triples won't be updated either.

I may be wrong here, I'm not sure how the update tracking works.

-Osma

Jean-Marc Vanel

2016-11-03 16:35:27 UTC

Osma,

That makes sense,
and the first tests are not bad.

Although I'm surprised that "par*" does not get dbpedia:Paris in the first
10;
but "pari*" does get dbpedia:Paris in the first position:

"count" "s"
"3090"^^http://www.w3.org/2001/XMLSchema#integer
http://dbpedia.org/resource/Paris
"2676"^^http://www.w3.org/2001/XMLSchema#integer
http://dbpedia.org/resource/London
"72"^^http://www.w3.org/2001/XMLSchema#integer
http://dbpedia.org/resource/Ãmile_Durkheim
"68"^^http://www.w3.org/2001/XMLSchema#integer http://dbpedia.org/resource/
Henri_Bergson
"66"^^http://www.w3.org/2001/XMLSchema#integer http://dbpedia.org/resource/
20th_arrondissement_of_Paris
"64"^^http://www.w3.org/2001/XMLSchema#integer http://dbpedia.org/resource/
Cornelius_Castoriadis
"64"^^http://www.w3.org/2001/XMLSchema#integer http://dbpedia.org/resource/
Jacques_Derrida
"63"^^http://www.w3.org/2001/XMLSchema#integer http://dbpedia.org/resource/
Michel_Foucault "62"^^http://www.w3.org/2001/XMLSchema#integer
http://dbpedia.org/resource/Louis,_Grand_CondÃ©
"60"^^http://www.w3.org/2001/XMLSchema#integer http://dbpedia.org/resource/
Jean-Jacques_Rousseau

I'll add that SPARQL in my sandbox as a replacement of dbpedia lookup
service,
and tell you how it goes.
But I foresee that using the Lucene implementation after adding the weights
will be more efficient. But that demands more work...

Post by Osma Suominen
Hi Jean-Marc!
AFAIK using the weights to order results is intimately linked to the text

Post by Jean-Marc Vanel
index querying.
If I want the top 10 results, the search must have the weights beforehand
otherwise I must get all the results to filter later.
This is the reason for using AnalyzingInfixSuggester.
Lucene 4_9_1
https://lucene.apache.org/core/4_9_1/suggest/org/apache/luce
ne/search/suggest/analyzing/AnalyzingInfixSuggester.html
Lucene 6_2_1
https://lucene.apache.org/core/6_2_1/suggest/org/apache/luce
ne/search/suggest/analyzing/AnalyzingInfixSuggester.html
I guess this is what you call "performance reasons" .

SELECT ?s (COUNT(*) as ?count)
WHERE {
?s text:query "Ã©du*" .
?s ?p ?o .
}
GROUP BY ?s
ORDER BY DESC(?count)
LIMIT 10
(note: untested query)
I'm sure it will get slow if the number of hits from the text index is
more than a few dozen. But for a small number of results at a time, it
might work.
As I wrote in the original post, "I'll have to implement also the callback

Post by Jean-Marc Vanel
for updates
like class TextDocProducerTriples in Jena-text." .
http://jena.apache.org/documentation/javadoc/text/org/apache
/jena/query/text/TextDocProducerTriples.html

Isn't that called only when the indexed triple changes (e.g. the one with
rdfs:label or skos:prefLabel or whatever property you are indexing), but
not when other data related to the same subject changes? So if new triples
are added for the same subject, but its label is unchanged, then the text
index won't see the update and thus the count of references/triples won't
be updated either.
I may be wrong here, I'm not sure how the update tracking works.
-Osma
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
http://www.nationallibrary.fi

--
Jean-Marc Vanel
Profil: http://163.172.179.125:9111/display?displayuri=http%3A%2F%
2Fjmvanel.free.fr%2Fjmv.rdf%23me
DÃ©ductions SARL - Consulting, services, training,
Rule-based programming, Semantic Web
+33 (0)6 89 16 29 52
Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui

Lorenz B.

2016-11-04 09:05:17 UTC

Hello Jean-Marc,

I think adding something like a pagerank score would improve the
results. Lucene itself just uses more or less the standard IR measure
TF/IDF.

Cheers,
Lorenz

Post by Jean-Marc Vanel
Osma,
That makes sense,
and the first tests are not bad.
Although I'm surprised that "par*" does not get dbpedia:Paris in the first
10;
"count" "s"
"3090"^^http://www.w3.org/2001/XMLSchema#integer
http://dbpedia.org/resource/Paris
"2676"^^http://www.w3.org/2001/XMLSchema#integer
http://dbpedia.org/resource/London
"72"^^http://www.w3.org/2001/XMLSchema#integer
http://dbpedia.org/resource/Ãmile_Durkheim
"68"^^http://www.w3.org/2001/XMLSchema#integer http://dbpedia.org/resource/
Henri_Bergson
"66"^^http://www.w3.org/2001/XMLSchema#integer http://dbpedia.org/resource/
20th_arrondissement_of_Paris
"64"^^http://www.w3.org/2001/XMLSchema#integer http://dbpedia.org/resource/
Cornelius_Castoriadis
"64"^^http://www.w3.org/2001/XMLSchema#integer http://dbpedia.org/resource/
Jacques_Derrida
"63"^^http://www.w3.org/2001/XMLSchema#integer http://dbpedia.org/resource/
Michel_Foucault "62"^^http://www.w3.org/2001/XMLSchema#integer
http://dbpedia.org/resource/Louis,_Grand_CondÃ©
"60"^^http://www.w3.org/2001/XMLSchema#integer http://dbpedia.org/resource/
Jean-Jacques_Rousseau
I'll add that SPARQL in my sandbox as a replacement of dbpedia lookup
service,
and tell you how it goes.
But I foresee that using the Lucene implementation after adding the weights
will be more efficient. But that demands more work...

Post by Osma Suominen
Hi Jean-Marc!
AFAIK using the weights to order results is intimately linked to the text

Post by Jean-Marc Vanel
for updates
like class TextDocProducerTriples in Jena-text." .
http://jena.apache.org/documentation/javadoc/text/org/apache
/jena/query/text/TextDocProducerTriples.html

Isn't that called only when the indexed triple changes (e.g. the one with
rdfs:label or skos:prefLabel or whatever property you are indexing), but
not when other data related to the same subject changes? So if new triples
are added for the same subject, but its label is unchanged, then the text
index won't see the update and thus the count of references/triples won't
be updated either.
I may be wrong here, I'm not sure how the update tracking works.
-Osma
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
http://www.nationallibrary.fi

--
Lorenz BÃŒhmann
AKSW group, University of Leipzig
Group: http://aksw.org - semantic web research center

Jean-Marc Vanel

2016-11-04 10:11:38 UTC

Guten Tag Lorenz !

I don't know what is "IR" .

And reusing Lucene is the plan.
The current code is here (as I mentionned earlier in this thread):
https://github.com/jmvanel/semantic_forms/blob/master/
scala/forms/src/main/scala/deductions/runtime/jena/
lucene/TextIndexerWeight.scala

I don't know how to combine TF-IDF with ranking based on links.
I'm not even sure that, in an RDF world, term frequency is bringing much
useful information.
If you have some synthesis articles to recommend on search in RDF world, or
in general, that would help.

I put on the sandbox the ranking in research (counting the links Ã la Google
rank), so my FOAF profile is now first, due to many cco:expertise links :
http://163.172.179.125:9111/wordsearch?q=Jean-Marc
In good Company with Jean Sablon, Jean Moulin, and pope JP 2.

The TDB was populated with dbpedia with these scripts :
https://github.com/jmvanel/semantic_forms/blob/master/scala/forms_play/scripts/download-dbpedia.sh
https://github.com/jmvanel/semantic_forms/blob/master/scala/forms_play/scripts/populate_with_dbpedia.sh

Post by Lorenz B.
Hello Jean-Marc,
I think adding something like a pagerank score would improve the
results. Lucene itself just uses more or less the standard IR measure
TF/IDF.
Cheers,
Lorenz

Post by Jean-Marc Vanel
Osma,
That makes sense,
and the first tests are not bad.
Although I'm surprised that "par*" does not get dbpedia:Paris in the

first

Post by Jean-Marc Vanel
10;
"count" "s"
"3090"^^http://www.w3.org/2001/XMLSchema#integer
http://dbpedia.org/resource/Paris
"2676"^^http://www.w3.org/2001/XMLSchema#integer
http://dbpedia.org/resource/London
"72"^^http://www.w3.org/2001/XMLSchema#integer
http://dbpedia.org/resource/Ãmile_Durkheim
"68"^^http://www.w3.org/2001/XMLSchema#integer

http://dbpedia.org/resource/

Post by Jean-Marc Vanel
Henri_Bergson
"66"^^http://www.w3.org/2001/XMLSchema#integer

http://dbpedia.org/resource/

Post by Jean-Marc Vanel
20th_arrondissement_of_Paris
"64"^^http://www.w3.org/2001/XMLSchema#integer

http://dbpedia.org/resource/

Post by Jean-Marc Vanel
Cornelius_Castoriadis
"64"^^http://www.w3.org/2001/XMLSchema#integer

http://dbpedia.org/resource/

Post by Jean-Marc Vanel
Jacques_Derrida
"63"^^http://www.w3.org/2001/XMLSchema#integer

http://dbpedia.org/resource/

Post by Jean-Marc Vanel
Michel_Foucault "62"^^http://www.w3.org/2001/XMLSchema#integer
http://dbpedia.org/resource/Louis,_Grand_CondÃ©
"60"^^http://www.w3.org/2001/XMLSchema#integer

http://dbpedia.org/resource/

Post by Jean-Marc Vanel
Jean-Jacques_Rousseau
I'll add that SPARQL in my sandbox as a replacement of dbpedia lookup
service,
and tell you how it goes.
But I foresee that using the Lucene implementation after adding the

weights

Post by Jean-Marc Vanel
will be more efficient. But that demands more work...

Post by Osma Suominen
Hi Jean-Marc!
AFAIK using the weights to order results is intimately linked to the

text

Post by Jean-Marc Vanel
index querying.
If I want the top 10 results, the search must have the weights

beforehand

Post by Jean-Marc Vanel
otherwise I must get all the results to filter later.
This is the reason for using AnalyzingInfixSuggester.
Lucene 4_9_1
https://lucene.apache.org/core/4_9_1/suggest/org/apache/luce
ne/search/suggest/analyzing/AnalyzingInfixSuggester.html
Lucene 6_2_1
https://lucene.apache.org/core/6_2_1/suggest/org/apache/luce
ne/search/suggest/analyzing/AnalyzingInfixSuggester.html
I guess this is what you call "performance reasons" .

callback

Post by Jean-Marc Vanel
for updates
like class TextDocProducerTriples in Jena-text." .
http://jena.apache.org/documentation/javadoc/text/org/apache
/jena/query/text/TextDocProducerTriples.html

Isn't that called only when the indexed triple changes (e.g. the one

with

Post by Osma Suominen
rdfs:label or skos:prefLabel or whatever property you are indexing), but
not when other data related to the same subject changes? So if new

triples

Post by Osma Suominen
are added for the same subject, but its label is unchanged, then the

text

Post by Osma Suominen
index won't see the update and thus the count of references/triples

won't

Post by Osma Suominen
be updated either.
I may be wrong here, I'm not sure how the update tracking works.
-Osma
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
http://www.nationallibrary.fi

--
Lorenz BÃŒhmann
AKSW group, University of Leipzig
Group: http://aksw.org - semantic web research center

Lorenz B.

2016-11-04 10:43:28 UTC

Hi Jean-Marc,

Post by Jean-Marc Vanel
Guten Tag Lorenz !

Good job! German is a very difficult language.

Post by Jean-Marc Vanel
I don't know what is "IR" .

IR = Information Retrieval, which is what Lucene is basically made for.

Post by Jean-Marc Vanel
And reusing Lucene is the plan.
https://github.com/jmvanel/semantic_forms/blob/master/
scala/forms/src/main/scala/deductions/runtime/jena/
lucene/TextIndexerWeight.scala
I don't know how to combine TF-IDF with ranking based on links.
I'm not even sure that, in an RDF world, term frequency is bringing much
useful information.
If you have some synthesis articles to recommend on search in RDF world, or
in general, that would help.

There has been some discussion how to combine ranking metrics like
pagerank with the standard Lucene score, e.g. [1], [2]
I think this can be done via boosting during indexing or by some
user-defined sort.

There has been a lots of research regrading entity ranking, among
others, you can have a look at [3]

[1]
http://blog.trifork.com/2011/11/16/apache-lucene-flexiblescoring-with-indexdocvalues/
[2]
http://stackoverflow.com/questions/22473498/solr-boost-score-based-on-wikipedia-pagerank-and-solr-score
[3] http://ceur-ws.org/Vol-1586/know2.pdf

Post by Jean-Marc Vanel
I put on the sandbox the ranking in research (counting the links Ã la Google
http://163.172.179.125:9111/wordsearch?q=Jean-Marc
In good Company with Jean Sablon, Jean Moulin, and pope JP 2.
https://github.com/jmvanel/semantic_forms/blob/master/scala/forms_play/scripts/download-dbpedia.sh
https://github.com/jmvanel/semantic_forms/blob/master/scala/forms_play/scripts/populate_with_dbpedia.sh

Post by Lorenz B.
Hello Jean-Marc,
I think adding something like a pagerank score would improve the
results. Lucene itself just uses more or less the standard IR measure
TF/IDF.
Cheers,
Lorenz

Post by Jean-Marc Vanel
Osma,
That makes sense,
and the first tests are not bad.
Although I'm surprised that "par*" does not get dbpedia:Paris in the

first

http://dbpedia.org/resource/

Post by Jean-Marc Vanel
Henri_Bergson
"66"^^http://www.w3.org/2001/XMLSchema#integer

http://dbpedia.org/resource/

Post by Jean-Marc Vanel
20th_arrondissement_of_Paris
"64"^^http://www.w3.org/2001/XMLSchema#integer

http://dbpedia.org/resource/

Post by Jean-Marc Vanel
Cornelius_Castoriadis
"64"^^http://www.w3.org/2001/XMLSchema#integer

http://dbpedia.org/resource/

Post by Jean-Marc Vanel
Jacques_Derrida
"63"^^http://www.w3.org/2001/XMLSchema#integer

http://dbpedia.org/resource/

Post by Jean-Marc Vanel
Michel_Foucault "62"^^http://www.w3.org/2001/XMLSchema#integer
http://dbpedia.org/resource/Louis,_Grand_CondÃ©
"60"^^http://www.w3.org/2001/XMLSchema#integer

http://dbpedia.org/resource/

weights

Post by Jean-Marc Vanel
will be more efficient. But that demands more work...

Post by Osma Suominen
Hi Jean-Marc!
AFAIK using the weights to order results is intimately linked to the

text

Post by Jean-Marc Vanel
index querying.
If I want the top 10 results, the search must have the weights

beforehand

callback

Post by Jean-Marc Vanel
for updates
like class TextDocProducerTriples in Jena-text." .
http://jena.apache.org/documentation/javadoc/text/org/apache
/jena/query/text/TextDocProducerTriples.html

Isn't that called only when the indexed triple changes (e.g. the one

with

Post by Osma Suominen
rdfs:label or skos:prefLabel or whatever property you are indexing), but
not when other data related to the same subject changes? So if new

triples

Post by Osma Suominen
are added for the same subject, but its label is unchanged, then the

text

Post by Osma Suominen
index won't see the update and thus the count of references/triples

won't

--
Lorenz BÃŒhmann
AKSW group, University of Leipzig
Group: http://aksw.org - semantic web research center

Jean-Marc Vanel

2016-11-04 07:27:55 UTC

Looking for Pari* with your SPARQL on dbPedia takes 4 seconds on my
supposedly efficient laptop CPU:

$ lscpu
Architecture: x86_64
Mode(s) opÃ©ratoire(s) des processeurs :32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) par cÅur : 2
CÅur(s) par socket : 4
Socket(s): 1
NÅud(s) NUMA : 1
Identifiant constructeur :GenuineIntel
Famille de processeur :6
ModÃšle : 94
Model name: Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
RÃ©vision : 3
Vitesse du processeur en MHz :2644.789
CPU max MHz: 3500,0000
CPU min MHz: 800,0000
BogoMIPS: 5181.67

I should try with SSD.
I don't know whether TDB can exploit multi-core CPU.
Also I don't know whether I can pre-compile the query with a parameter for
runtime.

Anyway, I'll implement the ordering by triple count in Semantic_forms.
Maybe later can it be helpful within Jena-text.

Post by Osma Suominen
Hi Jean-Marc!
AFAIK using the weights to order results is intimately linked to the text

Post by Jean-Marc Vanel
for updates
like class TextDocProducerTriples in Jena-text." .
http://jena.apache.org/documentation/javadoc/text/org/
apache/jena/query/text/TextDocProducerTriples.html

Isn't that called only when the indexed triple changes (e.g. the one with
rdfs:label or skos:prefLabel or whatever property you are indexing), but
not when other data related to the same subject changes? So if new triples
are added for the same subject, but its label is unchanged, then the text
index won't see the update and thus the count of references/triples won't
be updated either.
I may be wrong here, I'm not sure how the update tracking works.
-Osma
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
http://www.nationallibrary.fi

Osma Suominen

2016-11-04 11:47:49 UTC