Empty index with Jena Text and Fuseki

Discussion:

Empty index with Jena Text and Fuseki

Neubert Joachim

2013-06-21 08:33:21 UTC

When I got it right, Fuseki is supposed to build the text index when it starts up. However, this did not work for me.

Starting fuseki (jena-fuseki-0.2.8-20130618.075236-28-server.jar) with an empty index directory, for a very short time, it looks like this:

-rw-r--r--. 1 root root 45 Jun 20 13:46 segments_1
-rw-r--r--. 1 root root 0 Jun 20 13:46 write.lock

and then it stays like this:

-rw-r--r--. 1 root root 45 Jun 20 13:46 segments_1
-rw-r--r--. 1 root root 20 Jun 20 13:46 segments.gen

Text queries yield an empty result, while standard sparql queries work.

I can't figure out what could be wrong with my config:

## ---------------------------------------------------------------
## Read-only TDB dataset (only read services enabled).

<#service_stw_combined> rdf:type fuseki:Service ;
rdfs:label "STW combined TDB Service (R)" ;
fuseki:name "stw_combined" ;
fuseki:serviceQuery "query" ;
fuseki:serviceQuery "sparql" ;
##fuseki:serviceUpdate "update" ;
fuseki:serviceReadGraphStore "data" ;
fuseki:serviceReadGraphStore "get" ;
fuseki:dataset :stw_combined ;
.

:stw_combined rdf:type text:TextDataset ;
text:dataset <#stw> ;
text:index <#stwIndex> ;
.

<#stw> rdf:type tdb:DatasetTDB ;
tdb:location "/opt/thes/var/stw/latest/tdb" ;
##tdb:unionDefaultGraph true ;
.

<#stwIndex> a text:TextIndexLucene ;
text:directory <file:/opt/thes/var/stw/latest/tdb_lucene> ;
text:entityMap <#entMap> ;
.

<#entMap> a text:EntityMap ;
text:entityField "uri" ;
text:defaultField "text" ; ## Must be defined in the text:map
text:map (
# skos:prefLabel
[ text:field "text" ; text:predicate skos:prefLabel ]
# skos:altLabel
[ text:field "text" ; text:predicate skos:altLabel ]
# skos:hiddenLabel
[ text:field "text" ; text:predicate skos:hiddenLabel ]
) .

Help would be much appreciated.

Cheers, Joachim

baran_H

2013-06-21 09:11:41 UTC

Post by Neubert Joachim
When I got it right, Fuseki is supposed to build the text index when it
starts up. However, this did not work for me.
Starting fuseki (jena-fuseki-0.2.8-20130618.075236-28-server.jar) with
-rw-r--r--. 1 root root 45 Jun 20 13:46 segments_1
-rw-r--r--. 1 root root 0 Jun 20 13:46 write.lock
-rw-r--r--. 1 root root 45 Jun 20 13:46 segments_1
-rw-r--r--. 1 root root 20 Jun 20 13:46 segments.gen
Text queries yield an empty result, while standard sparql queries work.
## ---------------------------------------------------------------
## Read-only TDB dataset (only read services enabled).
<#service_stw_combined> rdf:type fuseki:Service ;
rdfs:label "STW combined TDB Service (R)" ;
fuseki:name "stw_combined" ;
fuseki:serviceQuery "query" ;
fuseki:serviceQuery "sparql" ;
##fuseki:serviceUpdate "update" ;
fuseki:serviceReadGraphStore "data" ;
fuseki:serviceReadGraphStore "get" ;
fuseki:dataset :stw_combined ;
.
:stw_combined rdf:type text:TextDataset ;
text:dataset <#stw> ;
text:index <#stwIndex> ;
.
<#stw> rdf:type tdb:DatasetTDB ;
tdb:location "/opt/thes/var/stw/latest/tdb" ;
##tdb:unionDefaultGraph true ;
.
<#stwIndex> a text:TextIndexLucene ;
text:directory <file:/opt/thes/var/stw/latest/tdb_lucene> ;
text:entityMap <#entMap> ;
.
<#entMap> a text:EntityMap ;
text:entityField "uri" ;
text:defaultField "text" ; ## Must be defined in the text:map
text:map (
# skos:prefLabel
[ text:field "text" ; text:predicate skos:prefLabel ]
# skos:altLabel
[ text:field "text" ; text:predicate skos:altLabel ]
# skos:hiddenLabel
[ text:field "text" ; text:predicate skos:hiddenLabel ]
) .
Help would be much appreciated.
Cheers, Joachim

i have had the THE SAME problem for all jena-fuseki-0.2.8 + jena-text,
with rdfs:label-indexing for dbpedia-dataset.
It would be very useful that we get a small example which runs without
problems, perhaps a config.ttl for books.ttl in TDB o a similar thing.
baran

--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Andy Seaborne

2013-06-21 10:09:20 UTC

Post by baran_H
i have had the THE SAME problem for all jena-fuseki-0.2.8 + jena-text,
with rdfs:label-indexing for dbpedia-dataset.
It would be very useful that we get a small example which runs without
problems, perhaps a config.ttl for books.ttl in TDB o a similar thing.
baran

Hi,

There is config-tdb-text.ttl in the distribution that is a combination
of text indexing and TDB.

Andy

baran_H

2013-06-21 11:03:55 UTC

Post by Andy Seaborne

Post by baran_H
i have had the THE SAME problem for all jena-fuseki-0.2.8 + jena-text,
with rdfs:label-indexing for dbpedia-dataset.
It would be very useful that we get a small example which runs without
problems, perhaps a config.ttl for books.ttl in TDB o a similar thing.
baran

Hi,
There is config-tdb-text.ttl in the distribution that is a combination
of text indexing and TDB.
Andy

ok, after reading your reply to Joachim, i load books.ttl with command-line

-Xmx2048M
-Dlog4j.configuration=jena-fuseki-0.2.8-SNAPSHOT\log4j.properties -cp
fuseki-0.2.8-SNAPSHOT\*; tdb.tdbloader --loc myDB myDB\myData\books.ttl

What must command line must be, so that 'indexing dc:title' is also
handled?

thanks, baran

--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Andy Seaborne

2013-06-21 13:33:37 UTC

Post by baran_H

Post by Andy Seaborne

Post by baran_H
i have had the THE SAME problem for all jena-fuseki-0.2.8 + jena-text,
with rdfs:label-indexing for dbpedia-dataset.
It would be very useful that we get a small example which runs without
problems, perhaps a config.ttl for books.ttl in TDB o a similar thing.
baran

Hi,
There is config-tdb-text.ttl in the distribution that is a combination
of text indexing and TDB.
Andy

ok, after reading your reply to Joachim, i load books.ttl with command-line
-Xmx2048M
-Dlog4j.configuration=jena-fuseki-0.2.8-SNAPSHOT\log4j.properties -cp
fuseki-0.2.8-SNAPSHOT\*; tdb.tdbloader --loc myDB myDB\myData\books.ttl
What must command line must be, so that 'indexing dc:title' is also
handled?

You'll need to write an appropriate configuration file and use
jena.textindexer from the #31 development build.

There are too many configuration details to have a setup whereby only
arguments on the command line are used.

It can be the Fuseki configuration file if there is only one defined
text dataset in it.

Andy

Post by baran_H
thanks, baran

baran_H

2013-06-21 14:25:06 UTC

Post by Andy Seaborne

Post by baran_H

Post by Andy Seaborne

Post by baran_H
i have had the THE SAME problem for all jena-fuseki-0.2.8 + jena-text,
with rdfs:label-indexing for dbpedia-dataset.
It would be very useful that we get a small example which runs without
problems, perhaps a config.ttl for books.ttl in TDB o a similar thing.
baran

Hi,
There is config-tdb-text.ttl in the distribution that is a combination
of text indexing and TDB.
Andy

ok, after reading your reply to Joachim, i load books.ttl with command-line
-Xmx2048M
-Dlog4j.configuration=jena-fuseki-0.2.8-SNAPSHOT\log4j.properties -cp
fuseki-0.2.8-SNAPSHOT\*; tdb.tdbloader --loc myDB myDB\myData\books.ttl
What must command line must be, so that 'indexing dc:title' is also
handled?

You'll need to write an appropriate configuration file and use
jena.textindexer from the #31 development build.
There are too many configuration details to have a setup whereby only
arguments on the command line are used.
It can be the Fuseki configuration file if there is only one defined
text dataset in it.
Andy

I have already a config file made from the info
http://jena.staging.apache.org/documentation/query/text-query.html
which works without err-messages.

And I already loaded my dataset without any indexing addings from
commandline (above).

With #31 i exec then the command line:
java -cp fuseki-server.jar jena.textindexer MyFusekiConfigFile

which creates a lucene directory with indexing info where
MyFusekiConfigFile is exactly the same with which i start Fuseki
when indexing finished.

Sorry, Andy, i must again ask, is that ok to try it?

thanks, baran.

--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Andy Seaborne

2013-06-21 19:09:59 UTC

Post by baran_H

Post by Andy Seaborne

Post by baran_H

Post by Andy Seaborne

Post by baran_H
i have had the THE SAME problem for all jena-fuseki-0.2.8 + jena-text,
with rdfs:label-indexing for dbpedia-dataset.
It would be very useful that we get a small example which runs without
problems, perhaps a config.ttl for books.ttl in TDB o a similar thing.
baran

Hi,
There is config-tdb-text.ttl in the distribution that is a combination
of text indexing and TDB.
Andy

ok, after reading your reply to Joachim, i load books.ttl with command-line
-Xmx2048M
-Dlog4j.configuration=jena-fuseki-0.2.8-SNAPSHOT\log4j.properties -cp
fuseki-0.2.8-SNAPSHOT\*; tdb.tdbloader --loc myDB myDB\myData\books.ttl
What must command line must be, so that 'indexing dc:title' is also
handled?

You'll need to write an appropriate configuration file and use
jena.textindexer from the #31 development build.
There are too many configuration details to have a setup whereby only
arguments on the command line are used.
It can be the Fuseki configuration file if there is only one defined
text dataset in it.
Andy

I have already a config file made from the info
http://jena.staging.apache.org/documentation/query/text-query.html
which works without err-messages.
And I already loaded my dataset without any indexing addings from
commandline (above).
java -cp fuseki-server.jar jena.textindexer MyFusekiConfigFile

... when the server wasn't running ...

Post by baran_H
which creates a lucene directory with indexing info where
MyFusekiConfigFile is exactly the same with which i start Fuseki
when indexing finished.

That should work - worked for me when I started up Fuseki on these
directories.

Post by baran_H
Sorry, Andy, i must again ask, is that ok to try it?

I don't understand the question - ask what again?

Post by baran_H
thanks, baran.

baran_H

2013-06-22 07:59:49 UTC

Post by Andy Seaborne

Post by baran_H
I have already a config file made from the info
http://jena.staging.apache.org/documentation/query/text-query.html
which works without err-messages.
And I already loaded my dataset without any indexing addings from
commandline (above).
java -cp fuseki-server.jar jena.textindexer MyFusekiConfigFile

... when the server wasn't running ...

Post by baran_H
which creates a lucene directory with indexing info where
MyFusekiConfigFile is exactly the same with which i start Fuseki
when indexing finished.

That should work - worked for me when I started up Fuseki on these
directories.
Andy

well, i tried it exactly as we above decribed and my minimal
example runs incl. query checking OK! Next step with local dbpedia
a bit more complicated as i have another dataset in memory without
indexing, i report here if i get diffuculties.

but i have a LAST question here:

Is PREFIX text: <http://jena.apache.org/text#> defined in SPARQL-spec
with the aim of identical query syntax for all SPARQL implementations
supporting text-indexing and if not, is a similar thing planned for
the future?

thanks baran

--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Andy Seaborne

2013-06-22 10:38:56 UTC

On 22/06/13 08:59, baran_H wrote:
...
Promise?

Post by baran_H
Is PREFIX text: <http://jena.apache.org/text#> defined in SPARQL-spec
with the aim of identical query syntax for all SPARQL implementations
supporting text-indexing and if not, is a similar thing planned for
the future?

Property functions (sometimes called magic properties) are within SPARQL
syntax but there is no formal definition in the SPARQL specs. The style
is used by other systems, going back to cwm/N3.

PREFIX text: <http://jena.apache.org/text#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?s
{ ?s text:query (rdfs:label 'word' 10) ;
rdfs:label ?label
}

In fact, with possible a slight generous reading, there doesn't need to
be any spec text. You can imagine there really is a pattern in the data
that matches { ?s text:query 'word' } with the resource having a
property text:query and value all the strings it matches. Think of it
as a weird kind of entailment.

Extension within the syntax is more popular than extension that adds
non-standard syntax. If synatx were being added, then

?s TMATCH (rdfs:label, 'word', 10)

or (SPARQL likes simple - keyword first:)

TMATCH ?s WITH (rdfs:label, 'word', 10)

There are no plans that I know of to standardise this - it came up in
scoping SPARQL 1.1 at the use case and requirements stage. The big
problem is defining the text search language. A standard for SPARQL
text search needs a standard for the search string. But while many of
the candidates look the same, they differ in the details. This, coupled
with the fact that implementers do not want to implement text search
themselves but use an existing engine, does make standardizing it unlikely.

The first thing is to let SPARQL 1.1 get established. Any new round of
standardisation should wait to see what the real needs are - not what
the initial issues are.

Areas that could be interesting:

1/ Experimentation with graph operators beyond property paths
2/ Better/different syntax targeting the same algebra

Anyone interested should just dive in.

Andy

baran_H

2013-06-22 17:43:09 UTC

Post by Andy Seaborne

Post by baran_H
Is PREFIX text: <http://jena.apache.org/text#> defined in SPARQL-spec
with the aim of identical query syntax for all SPARQL implementations
supporting text-indexing and if not, is a similar thing planned for
the future?
thanks baran

Property functions (sometimes called magic properties) are within SPARQL
syntax but there is no formal definition in the SPARQL specs. The style
is used by other systems, going back to cwm/N3.
PREFIX text: <http://jena.apache.org/text#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?s
{ ?s text:query (rdfs:label 'word' 10) ;
rdfs:label ?label
}
In fact, with possible a slight generous reading, there doesn't need to
be any spec text. You can imagine there really is a pattern in the data
that matches { ?s text:query 'word' } with the resource having a
property text:query and value all the strings it matches. Think of it
as a weird kind of entailment.
...

But this is a bit juggling with syntax issue, i mean certainly two SPARQL
implementations with same dataset, where i can compare the performance
with identical query-syntax addressing their 'in any way realized'
text-indexing.

Post by Andy Seaborne
There are no plans that I know of to standardise this - it came up in
scoping SPARQL 1.1 at the use case and requirements stage. The big
problem is defining the text search language. A standard for SPARQL
text search needs a standard for the search string. But while many of
the candidates look the same, they differ in the details. This, coupled
with the fact that implementers do not want to implement text search
themselves but use an existing engine, does make standardizing it unlikely.
The first thing is to let SPARQL 1.1 get established. Any new round of
standardisation should wait to see what the real needs are - not whatthe
initial issues are.

I think standard of query-syntax for indexed texts is not a big problem.
It seems to me irrelavant how different SPARQL implementers realize
their text-searching internally, this doesn't need a standardization,
essential is, they make it possible to query it with the same syntax and
let the users easily compare their total performance.

Semantic Web as a whole is ailing at performance issues all the time,
there is so much work to do to get in it, but what you get back is poor
performance in form of results of any kind. And SPARQL spec + SPARQL
implementations are important bottlenecks in performance issues and
text-indexing is ONE of the heavy influencers of the overall
performance of a SPARQL implementation.
Therefore i cannot understand farfetched sounding arguments against
standardizing it at query-syntax-level as 'it is not a real need'.

Another aspect are so long expected huge crowds of the future-world
querying public SPARQL endpoints, do you want to tell each of them what
kind of SPARQL implementation each endpoint is to put a correct query?
If yes, there will never be huge-crowds querying public SPARQL enpoints.

Only if SPARQL prefers the closed world of a (we say) medicine company
as its real public, then i can understand a bit this kind of arguing.

Post by Andy Seaborne
1/ Experimentation with graph operators beyond property paths
2/ Better/different syntax targeting the same algebra
Anyone interested should just dive in.
Andy

I cannot dive in, Andy, i see 1/ definitly not as a priotity, but if
you mean it is a first class performance trigger, then i can really
understand you. In addition, my real profession is 'embedded systems'.
Many years i wrote by the way reports (with live online presentations)
about Semantic Web for the staff of a company, because it interested me
an they were also interested until they said: This will never deliver
something comparable to Google.
And now i am alone with my hobby and hope very much for continued
free support when i have time for it...

baran.

--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/

Andy Seaborne

2013-06-21 10:24:14 UTC

Post by Neubert Joachim
When I got it right, Fuseki is supposed to build the text index when it starts up. However, this did not work for me.

Joachim,

Fuseki indexes the data as it's loaded, it does not index existing data
on startup. I see what you see in the Lucene directory before data is
loaded.

How is the data being loaded into the store?

Have you tried the config-tdb-text.ttl example? I have just checked
using that, and also modified to add something more like the entity map
you have and it works for me.

I've tried s-put and the web UI (SPARQL update) to load data into the
current snapshot build and text queries returned something.

If you have a complete, minimal example of load-query lifecycle that
would be most useful.

Andy

Post by Neubert Joachim
-rw-r--r--. 1 root root 45 Jun 20 13:46 segments_1
-rw-r--r--. 1 root root 0 Jun 20 13:46 write.lock
-rw-r--r--. 1 root root 45 Jun 20 13:46 segments_1
-rw-r--r--. 1 root root 20 Jun 20 13:46 segments.gen
Text queries yield an empty result, while standard sparql queries work.
## ---------------------------------------------------------------
## Read-only TDB dataset (only read services enabled).
<#service_stw_combined> rdf:type fuseki:Service ;
rdfs:label "STW combined TDB Service (R)" ;
fuseki:name "stw_combined" ;
fuseki:serviceQuery "query" ;
fuseki:serviceQuery "sparql" ;
##fuseki:serviceUpdate "update" ;
fuseki:serviceReadGraphStore "data" ;
fuseki:serviceReadGraphStore "get" ;
fuseki:dataset :stw_combined ;
.
:stw_combined rdf:type text:TextDataset ;
text:dataset <#stw> ;
text:index <#stwIndex> ;
.
<#stw> rdf:type tdb:DatasetTDB ;
tdb:location "/opt/thes/var/stw/latest/tdb" ;
##tdb:unionDefaultGraph true ;
.
<#stwIndex> a text:TextIndexLucene ;
text:directory <file:/opt/thes/var/stw/latest/tdb_lucene> ;
text:entityMap <#entMap> ;
.
<#entMap> a text:EntityMap ;
text:entityField "uri" ;
text:defaultField "text" ; ## Must be defined in the text:map
text:map (
# skos:prefLabel
[ text:field "text" ; text:predicate skos:prefLabel ]
# skos:altLabel
[ text:field "text" ; text:predicate skos:altLabel ]
# skos:hiddenLabel
[ text:field "text" ; text:predicate skos:hiddenLabel ]
) .
Help would be much appreciated.
Cheers, Joachim

Neubert Joachim

2013-06-21 10:45:15 UTC

Hi Andy,

thanks for the quick response, which makes quite clear what was wrong: A before for Joseki, I used a pre-built read-only tdb database.

Well, so I have to use Fuseki for tdb building as well. I'll check and report back.

Cheers, Joachim

-----Ursprüngliche Nachricht-----
Von: Andy Seaborne [mailto:andy-1oDqGaOF3Lkdnm+***@public.gmane.org]
Gesendet: Freitag, 21. Juni 2013 12:24
An: users-***@public.gmane.org
Betreff: Re: Empty index with Jena Text and Fuseki

Post by Neubert Joachim
When I got it right, Fuseki is supposed to build the text index when it starts up. However, this did not work for me.

Joachim,

Fuseki indexes the data as it's loaded, it does not index existing data on startup. I see what you see in the Lucene directory before data is loaded.

How is the data being loaded into the store?

Have you tried the config-tdb-text.ttl example? I have just checked using that, and also modified to add something more like the entity map you have and it works for me.

I've tried s-put and the web UI (SPARQL update) to load data into the current snapshot build and text queries returned something.

If you have a complete, minimal example of load-query lifecycle that would be most useful.

Andy

Post by Neubert Joachim
-rw-r--r--. 1 root root 45 Jun 20 13:46 segments_1
-rw-r--r--. 1 root root 0 Jun 20 13:46 write.lock
-rw-r--r--. 1 root root 45 Jun 20 13:46 segments_1
-rw-r--r--. 1 root root 20 Jun 20 13:46 segments.gen
Text queries yield an empty result, while standard sparql queries work.
## ---------------------------------------------------------------
## Read-only TDB dataset (only read services enabled).
<#service_stw_combined> rdf:type fuseki:Service ;
rdfs:label "STW combined TDB Service (R)" ;
fuseki:name "stw_combined" ;
fuseki:serviceQuery "query" ;
fuseki:serviceQuery "sparql" ;
##fuseki:serviceUpdate "update" ;
fuseki:serviceReadGraphStore "data" ;
fuseki:serviceReadGraphStore "get" ;
fuseki:dataset :stw_combined ;
.
:stw_combined rdf:type text:TextDataset ;
text:dataset <#stw> ;
text:index <#stwIndex> ;
.
<#stw> rdf:type tdb:DatasetTDB ;
tdb:location "/opt/thes/var/stw/latest/tdb" ;
##tdb:unionDefaultGraph true ;
.
<#stwIndex> a text:TextIndexLucene ;
text:directory <file:/opt/thes/var/stw/latest/tdb_lucene> ;
text:entityMap <#entMap> ;
.
<#entMap> a text:EntityMap ;
text:entityField "uri" ;
text:defaultField "text" ; ## Must be defined in the text:map
text:map (
# skos:prefLabel
[ text:field "text" ; text:predicate skos:prefLabel ]
# skos:altLabel
[ text:field "text" ; text:predicate skos:altLabel ]
# skos:hiddenLabel
[ text:field "text" ; text:predicate skos:hiddenLabel ]
) .
Help would be much appreciated.
Cheers, Joachim

Andy Seaborne

2013-06-21 13:30:29 UTC

Post by Neubert Joachim
Hi Andy,
thanks for the quick response, which makes quite clear what was wrong: A before for Joseki, I used a pre-built read-only tdb database.
Well, so I have to use Fuseki for tdb building as well. I'll check and report back.

There is a command line tool jena.textindexer to take dataset and
produce an index.

java -cp fuseki-server.jar jena.textindexer YourJosekiConfigFile

But it was broken in the way it handled the command line args - I've
just fixed it and used it to index a store that wasn't loaded with text
indexing enabled:

tdb.tdbloader -loc=DIR
jena.textindexer ....

and it worked for me. You'll need the latest development build (# 31)
which I just kicked off for a full rebuild.

https://repository.apache.org/content/repositories/snapshots/org/apache/jena/jena-fuseki/0.2.8-SNAPSHOT/jena-fuseki-0.2.8-20130621.132913-31-distribution.zip

Andy

Post by Neubert Joachim
Cheers, Joachim
-----Ursprüngliche Nachricht-----
Gesendet: Freitag, 21. Juni 2013 12:24
Betreff: Re: Empty index with Jena Text and Fuseki

Post by Neubert Joachim
When I got it right, Fuseki is supposed to build the text index when it starts up. However, this did not work for me.

Joachim,
Fuseki indexes the data as it's loaded, it does not index existing data on startup. I see what you see in the Lucene directory before data is loaded.
How is the data being loaded into the store?
Have you tried the config-tdb-text.ttl example? I have just checked using that, and also modified to add something more like the entity map you have and it works for me.
I've tried s-put and the web UI (SPARQL update) to load data into the current snapshot build and text queries returned something.
If you have a complete, minimal example of load-query lifecycle that would be most useful.
Andy

Post by Neubert Joachim
-rw-r--r--. 1 root root 45 Jun 20 13:46 segments_1
-rw-r--r--. 1 root root 0 Jun 20 13:46 write.lock
-rw-r--r--. 1 root root 45 Jun 20 13:46 segments_1
-rw-r--r--. 1 root root 20 Jun 20 13:46 segments.gen
Text queries yield an empty result, while standard sparql queries work.
## ---------------------------------------------------------------
## Read-only TDB dataset (only read services enabled).
<#service_stw_combined> rdf:type fuseki:Service ;
rdfs:label "STW combined TDB Service (R)" ;
fuseki:name "stw_combined" ;
fuseki:serviceQuery "query" ;
fuseki:serviceQuery "sparql" ;
##fuseki:serviceUpdate "update" ;
fuseki:serviceReadGraphStore "data" ;
fuseki:serviceReadGraphStore "get" ;
fuseki:dataset :stw_combined ;
.
:stw_combined rdf:type text:TextDataset ;
text:dataset <#stw> ;
text:index <#stwIndex> ;
.
<#stw> rdf:type tdb:DatasetTDB ;
tdb:location "/opt/thes/var/stw/latest/tdb" ;
##tdb:unionDefaultGraph true ;
.
<#stwIndex> a text:TextIndexLucene ;
text:directory <file:/opt/thes/var/stw/latest/tdb_lucene> ;
text:entityMap <#entMap> ;
.
<#entMap> a text:EntityMap ;
text:entityField "uri" ;
text:defaultField "text" ; ## Must be defined in the text:map
text:map (
# skos:prefLabel
[ text:field "text" ; text:predicate skos:prefLabel ]
# skos:altLabel
[ text:field "text" ; text:predicate skos:altLabel ]
# skos:hiddenLabel
[ text:field "text" ; text:predicate skos:hiddenLabel ]
) .
Help would be much appreciated.
Cheers, Joachim

Neubert Joachim

2013-06-22 12:54:37 UTC

Hi Andy,

indexing via Fuseki (#31) and and jena.textIndexer now worked both for me - thanks for your help.

In a production setting, I'd prefer the latter, because

a) the Fuseki datastore should better be read-only, and
b) on large datasets, loading and index building may take some hours, and this will be easier to control in a "local" script

From the script, I reference a temporary config file which holds definitions for only one dataset (whereas the fuseki config may hold many), in order to (re-) build only one index.

Thanks again - Joachim

-----Ursprüngliche Nachricht-----
Von: Andy Seaborne [mailto:andy-1oDqGaOF3Lkdnm+***@public.gmane.org]
Gesendet: Freitag, 21. Juni 2013 15:30
An: users-***@public.gmane.org
Betreff: Re: AW: Empty index with Jena Text and Fuseki

Hi Andy,
thanks for the quick response, which makes quite clear what was wrong: A before for Joseki, I used a pre-built read-only tdb database.
Well, so I have to use Fuseki for tdb building as well. I'll check and report back.

There is a command line tool jena.textindexer to take dataset and produce an index.

java -cp fuseki-server.jar jena.textindexer YourJosekiConfigFile

But it was broken in the way it handled the command line args - I've just fixed it and used it to index a store that wasn't loaded with text indexing enabled:

tdb.tdbloader -loc=DIR
jena.textindexer ....

and it worked for me. You'll need the latest development build (# 31) which I just kicked off for a full rebuild.

https://repository.apache.org/content/repositories/snapshots/org/apache/jena/jena-fuseki/0.2.8-SNAPSHOT/jena-fuseki-0.2.8-20130621.132913-31-distribution.zip

Andy

Cheers, Joachim
-----Ursprüngliche Nachricht-----
Gesendet: Freitag, 21. Juni 2013 12:24
Betreff: Re: Empty index with Jena Text and Fuseki

Post by Neubert Joachim
When I got it right, Fuseki is supposed to build the text index when it starts up. However, this did not work for me.

Joachim,
Fuseki indexes the data as it's loaded, it does not index existing data on startup. I see what you see in the Lucene directory before data is loaded.
How is the data being loaded into the store?
Have you tried the config-tdb-text.ttl example? I have just checked using that, and also modified to add something more like the entity map you have and it works for me.
I've tried s-put and the web UI (SPARQL update) to load data into the current snapshot build and text queries returned something.
If you have a complete, minimal example of load-query lifecycle that would be most useful.
Andy

Post by Neubert Joachim
-rw-r--r--. 1 root root 45 Jun 20 13:46 segments_1
-rw-r--r--. 1 root root 0 Jun 20 13:46 write.lock
-rw-r--r--. 1 root root 45 Jun 20 13:46 segments_1
-rw-r--r--. 1 root root 20 Jun 20 13:46 segments.gen
Text queries yield an empty result, while standard sparql queries work.
## ---------------------------------------------------------------
## Read-only TDB dataset (only read services enabled).
<#service_stw_combined> rdf:type fuseki:Service ;
rdfs:label "STW combined TDB Service (R)" ;
fuseki:name "stw_combined" ;
fuseki:serviceQuery "query" ;
fuseki:serviceQuery "sparql" ;
##fuseki:serviceUpdate "update" ;
fuseki:serviceReadGraphStore "data" ;
fuseki:serviceReadGraphStore "get" ;
fuseki:dataset :stw_combined ;
.
:stw_combined rdf:type text:TextDataset ;
text:dataset <#stw> ;
text:index <#stwIndex> ;
.
<#stw> rdf:type tdb:DatasetTDB ;
tdb:location "/opt/thes/var/stw/latest/tdb" ;
##tdb:unionDefaultGraph true ;
.
<#stwIndex> a text:TextIndexLucene ;
text:directory <file:/opt/thes/var/stw/latest/tdb_lucene> ;
text:entityMap <#entMap> ;
.
<#entMap> a text:EntityMap ;
text:entityField "uri" ;
text:defaultField "text" ; ## Must be defined in the text:map
text:map (
# skos:prefLabel
[ text:field "text" ; text:predicate skos:prefLabel ]
# skos:altLabel
[ text:field "text" ; text:predicate skos:altLabel ]
# skos:hiddenLabel
[ text:field "text" ; text:predicate skos:hiddenLabel ]
) .
Help would be much appreciated.
Cheers, Joachim

Andy Seaborne

2013-06-22 17:18:46 UTC

Post by Neubert Joachim
Hi Andy,
indexing via Fuseki (#31) and and jena.textIndexer now worked both for me - thanks for your help.
In a production setting, I'd prefer the latter, because
a) the Fuseki datastore should better be read-only, and
b) on large datasets, loading and index building may take some hours, and this will be easier to control in a "local" script
From the script, I reference a temporary config file which holds definitions for only one dataset (whereas the fuseki config may hold many), in order to (re-) build only one index.
Thanks again - Joachim

Joachim,

That makes a lot of sense. Would you like to write a paragraph or two
on that and I'll add it to the documentation.

Andy

Andy Seaborne

2013-06-23 19:42:58 UTC

I've been through jena-text and made various fixes including to the
command line indexer. Things should be a lot better now when working
with multiple predicates mapped to the same Lucene field.

Development build of Fuseki #35.

Andy

Neubert Joachim

2013-06-24 07:14:15 UTC

Build #35 worked fine for me - thanks a lot!

I added the newly introduced text:defaultPredicate, but could not figure out what it is supposed to do. The inline comment in config-tdb-text.ttl is not helpful here - it would be great if you could enhance it.

Also, I'd suggest adding a

[ text:field "text" ; text:predicate dc:title ]

line to text:map there. Firstly, this would exemplify the syntax for multiple properties, and secondly, it would create index entries for books.ttl out-of-the-box.

Cheers, Joachim

-----Ursprüngliche Nachricht-----
Von: Andy Seaborne [mailto:andy-1oDqGaOF3Lkdnm+***@public.gmane.org]
Gesendet: Sonntag, 23. Juni 2013 21:43
An: users-***@public.gmane.org
Betreff: Re: Empty index with Jena Text and Fuseki

I've been through jena-text and made various fixes including to the command line indexer. Things should be a lot better now when working with multiple predicates mapped to the same Lucene field.

Development build of Fuseki #35.

Andy

Andy Seaborne

2013-06-24 09:44:45 UTC

Post by Neubert Joachim
Build #35 worked fine for me - thanks a lot!
I added the newly introduced text:defaultPredicate, but could not
figure out what it is supposed to do. The inline comment in
config-tdb-text.ttl is not helpful here - it would be great if you
could enhance it.

Good point - actually, having includes it I am not so sure it's needed
if the default field is known.

The default predicate relates to:

?s text:query 'search' .
?s text:query ('search' 10) .

where the predicate isn't explicitly given but then it does not matter
as the search is on the default Lucene field, regardless of predicate.

So I'll go back and see if it's really needed, which would solve the
matter in a very different way! I now suspect it's driven by the
internal code structure, not the usage so the code should be fixed.

Post by Neubert Joachim
Also, I'd suggest adding a
[ text:field "text" ; text:predicate dc:title ]
line to text:map there. Firstly, this would exemplify the syntax for
multiple properties, and secondly, it would create index entries for
books.ttl out-of-the-box.

Thanks for the feedback,
Andy

Post by Neubert Joachim
Cheers, Joachim
-----Ursprüngliche Nachricht----- Von: Andy Seaborne
Fuseki
I've been through jena-text and made various fixes including to the
command line indexer. Things should be a lot better now when working
with multiple predicates mapped to the same Lucene field.
Development build of Fuseki #35.
Andy

17 Replies
174 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Neubert Joachim 2013-06-21 08:33:21 UTC

baran_H 2013-06-21 09:11:41 UTC

Andy Seaborne 2013-06-21 10:09:20 UTC

baran_H 2013-06-21 11:03:55 UTC

Andy Seaborne 2013-06-21 13:33:37 UTC

baran_H 2013-06-21 14:25:06 UTC

Andy Seaborne 2013-06-21 19:09:59 UTC

baran_H 2013-06-22 07:59:49 UTC

Andy Seaborne 2013-06-22 10:38:56 UTC

baran_H 2013-06-22 17:43:09 UTC

Andy Seaborne 2013-06-21 10:24:14 UTC

Neubert Joachim 2013-06-21 10:45:15 UTC

Andy Seaborne 2013-06-21 13:30:29 UTC

Neubert Joachim 2013-06-22 12:54:37 UTC

Andy Seaborne 2013-06-22 17:18:46 UTC

Andy Seaborne 2013-06-23 19:42:58 UTC

Neubert Joachim 2013-06-24 07:14:15 UTC

Andy Seaborne 2013-06-24 09:44:45 UTC

about - legalese

Loading...