Extending Jena Text to Support ElasticSearch as Indexing/Querying Engine

Discussion:

anuj kumar

2017-02-14 11:32:59 UTC

Hi,
I am working on an application where my data (in N-Triple format) is
stored in HBase and now we want to provide a Free Text search capability to
our end users and we have decided to use ElasticSearch (Not listing the
actual reasons here for keeping things simple but in jist using Lucene or
Solr directly has been ruled out).

Thus I started looking at how Jena works with existing indexing
capabilities (Lucene and Solr) to extend Jena to support ElasticSearch.
I figured out that I probably need to perform the following steps/changes:

- Create a Java Class that extends org.apache.jena.query.text.TextIndex
class. I called this Java class: TextIndexES.java. This class is simply
a Copy Past of TextIndexLucene class.
- Create a Java Class that extends
org.apache.jena.assembler.assemblers.AssemblerBase java class. I called
this Java class: TextIndexESAssembler.java
- Update the org.apache.jena.query.text.TextDataFactory.java class to
include a new method :

public static TextIndex createESIndex(Directory dir,
TextIndexConfig config) {}

This method initiates the TextIndexES class, in case there is no
MultiLingual Support specified.

- Create a TTL class to include ES Index mapping capabilities
- Create a simple test that tries to load this TTL class.

The Test fails with the following error:

org.apache.jena.assembler.exceptions.NoSpecificTypeException: the root
file:///Users/LT-Mac-Akumar/personal-projects/jena/jena-text/testing/TextQuery/text-config-es.ttl#indexES
has no most specific type that is a subclass of ja:Object

doing:
root: http://localhost/jena_example/#text_dataset with type:
http://jena.apache.org/text#TextDataset assembler class: class
org.apache.jena.query.text.assembler.TextDatasetAssembler

at
org.apache.jena.assembler.assemblers.AssemblerGroup$PlainAssemblerGroup.open(AssemblerGroup.java:125)
at
org.apache.jena.assembler.assemblers.AssemblerGroup$ExpandingAssemblerGroup.open(AssemblerGroup.java:81)
at
org.apache.jena.assembler.assemblers.AssemblerBase.open(AssemblerBase.java:39)
at
org.apache.jena.assembler.assemblers.AssemblerBase.open(AssemblerBase.java:35)
at
org.apache.jena.query.text.assembler.TextDatasetAssembler.open(TextDatasetAssembler.java:62)
at
org.apache.jena.query.text.assembler.TextDatasetAssembler.open(TextDatasetAssembler.java:42)
at
org.apache.jena.assembler.assemblers.AssemblerGroup$PlainAssemblerGroup.openBySpecificType(AssemblerGroup.java:143)
at
org.apache.jena.assembler.assemblers.AssemblerGroup$PlainAssemblerGroup.open(AssemblerGroup.java:130)
at
org.apache.jena.assembler.assemblers.AssemblerGroup$ExpandingAssemblerGroup.open(AssemblerGroup.java:81)
at
org.apache.jena.assembler.assemblers.AssemblerBase.open(AssemblerBase.java:39)
at
org.apache.jena.assembler.assemblers.AssemblerBase.open(AssemblerBase.java:35)
at org.apache.jena.query.DatasetFactory.assemble(DatasetFactory.java:290)
at org.apache.jena.query.DatasetFactory.assemble(DatasetFactory.java:264)
at
org.apache.jena.query.text.TestBuildTextDataset.createAssembler(TestBuildTextDataset.java:124)
at
org.apache.jena.query.text.TestBuildTextDataset.buildText_99(TestBuildTextDataset.java:39)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:119)
at
com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:42)
at
com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:234)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

Can some one please point me as to what I am doing wrong or what I may have
missed to update?
For reference, all the classes that I created or modified are attached, so
that, if required, the issue can be reproduced.

Thanks and looking forward to some pointers/resolutions.

--
*Anuj Kumar*

Lorenz B.

2017-02-14 11:46:19 UTC

Permalink

Attachments do not work on this mailing list, thus, it's better to share
resources via some service like Github etc.

Post by anuj kumar
*Anuj Kumar*

--
Lorenz BÃŒhmann
AKSW group, University of Leipzig
Group: http://aksw.org - semantic web research center

anuj kumar

2017-02-14 12:06:44 UTC

Permalink

Thanks Lorenz for the quick headsup. Here is the Github link to the listed
files : https://github.com/EaseTech/jena-text

Thanks,
Anuj Kumar

On Tue, Feb 14, 2017 at 12:46 PM, Lorenz B. <

Post by Lorenz B.
Attachments do not work on this mailing list, thus, it's better to share
resources via some service like Github etc.

Post by anuj kumar
*Anuj Kumar*

--
Lorenz BÃŒhmann
AKSW group, University of Leipzig
Group: http://aksw.org - semantic web research center

--
*Anuj Kumar*

Osma Suominen

2017-02-14 12:47:25 UTC

Permalink

Hi Anuj,

I'm not sure what the problem is - maybe others more familiar with the
assembler can help - but would it be helpful to work on a fork of the
Jena source tree instead of a separate project? Then all the scaffolding
to load the right classes etc. would already be in place. Maybe you are
already doing it that way (I see that the package declaration is
"package org.apache.jena.query.text;") but it's not obvious from the
files you posted to GitHub.

If you make a good implementation of jena-text with ES (including
writing unit tests), I don't see why it couldn't later be merged to Jena
itself. If you were working on a fork, you could then do a pull request
so that it can be reviewed and, if appropriate, merged.

-Osma

Post by anuj kumar
Thanks Lorenz for the quick headsup. Here is the Github link to the listed
files : https://github.com/EaseTech/jena-text
Thanks,
Anuj Kumar
On Tue, Feb 14, 2017 at 12:46 PM, Lorenz B. <

Post by Lorenz B.
Attachments do not work on this mailing list, thus, it's better to share
resources via some service like Github etc.

Post by anuj kumar
*Anuj Kumar*

--
Lorenz Bühmann
AKSW group, University of Leipzig
Group: http://aksw.org - semantic web research center

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
***@helsinki.fi
http://www.nationallibrary.fi

anuj kumar

2017-02-14 13:03:14 UTC

Permalink

Thanks Osma.
I was working on a local copy of Jena source code initially.
I have now forked Jena and added my specific files as I specified in my
previous email to ease debugging by more experienced Jena developers.
The forked repo can be found here : https://github.com/EaseTech/jena

You will see that most of the code in these new files is simply the one
that existed for Lucene based files. My first goal is toinstantiate the
TextIndexES file and get the test case working. I will then move to
implement the actual ES code, which IMO, should be much faster.

Thanks,
Anuj Kumar

Post by Osma Suominen
Hi Anuj,
I'm not sure what the problem is - maybe others more familiar with the
assembler can help - but would it be helpful to work on a fork of the Jena
source tree instead of a separate project? Then all the scaffolding to load
the right classes etc. would already be in place. Maybe you are already
doing it that way (I see that the package declaration is "package
org.apache.jena.query.text;") but it's not obvious from the files you
posted to GitHub.
If you make a good implementation of jena-text with ES (including writing
unit tests), I don't see why it couldn't later be merged to Jena itself. If
you were working on a fork, you could then do a pull request so that it can
be reviewed and, if appropriate, merged.
-Osma

Post by Lorenz B.
resources via some service like Github etc.
*Anuj Kumar*
--
Lorenz BÃŒhmann
AKSW group, University of Leipzig
Group: http://aksw.org - semantic web research center

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
http://www.nationallibrary.fi

--
*Anuj Kumar*

Osma Suominen

2017-02-14 13:13:33 UTC

Permalink

Post by anuj kumar
I was working on a local copy of Jena source code initially.
I have now forked Jena and added my specific files as I specified in my
previous email to ease debugging by more experienced Jena developers.
The forked repo can be found here : https://github.com/EaseTech/jena
You will see that most of the code in these new files is simply the one
that existed for Lucene based files. My first goal is toinstantiate the
TextIndexES file and get the test case working. I will then move to
implement the actual ES code, which IMO, should be much faster.

Great, I hope you get this working!

If you feel that you are duplicating existing Lucene code in your ES
implementation, consider abstracting that out into e.g. a common
superclass instead. This is something that already bothers me in the
current Lucene vs Solr implementations - there's even a "DRY" comment in
the code showing that somebody else has thought about it too.

Also it might be helpful to try to reuse all the Lucene unit tests for
ES as well, if you can figure out a way to do that.

-Osma

anuj kumar

2017-02-14 13:15:42 UTC

Permalink

I will do it. But I need to first get the simple test working in order to
move forward. I hope I someone here can help me.

Thanks,
Anuj Kumar