Osma Suominen
2016-10-25 12:05:08 UTC
Hi,
I'm trying to post-process a large bibliographic data set which, among
its 30M or so triples split into 300 N-Triples files, contains a few bad
URIs. Because of the bad URIs, I run into problems when trying to use
the data, e.g. to load it into TDB or SDB. The data set is created from
MARC records using a XQuery-based conversion process [1] that isn't very
careful with URIs, so bad URIs or other errors in the original records
may be passed through and will be present in the output files.
What I'd like to do is to merge the 300 files into a single N-Triples
file, without including the triples with the bad URIs, using e.g. riot
from the command line, like this:
riot input*.nt >output.nt
But the bad URIs in the input files cause parsing errors and subsequent
triples in the same file will not be included in the output.
Here is a small example file, with a bad URI on the 2nd line:
--cut--
<http://example.org/007334701> <http://schema.org/name> "example bad URL" .
<http://example.org/007334701> <http://schema.org/url>
<http://example.org/007334701.pdf |q PDF> .
<http://example.org/007334701> <http://schema.org/description> "an
example with a bad URL" .
--cut--
When parsed using the above riot command, I get this output:
14:47:45 ERROR riot :: [line: 2, col: 90] Bad character
in IRI (space): <http://example.org/007334701.pdf[space]...>
<http://example.org/007334701> <http://schema.org/name> "example bad URL" .
So the command outputs just the first triple (i.e. anything before the
bad URI), but omits the bad one as well as the last one which came after
the bad URI. If I have a file with 100000 triples with one having a bad
URI on line 50000, the last 50000 triples in that file are discarded.
I tried the --nocheck option but it didn't seem to make any difference,
the result is exactly the same.
Also there is the --stop option, but it would do the opposite of what I
want - I don't want to stop on the first error, but instead continue
with the parsing.
I see that ModLangParse, the class used to process command line options
in riot, has some initial support for a --skip option [2] that would
probably do what I want, i.e. omit the bad triples while preserving all
the valid ones. But that option handling code is commented out and
CmdLangParse doesn't do anything with skipOnBadTerm (the boolean field
that would be set based on that option) [3].
So how can I get rid of the few bad triples in my input files while
preserving all the good ones?
I'm using apache-jena 3.1.1-SNAPSHOT from 2016-10-24.
Thanks,
Osma
[1] https://github.com/lcnetdev/marc2bibframe
[2]
https://github.com/apache/jena/blob/master/jena-cmds/src/main/java/arq/cmdline/ModLangParse.java#L78
[3]
https://github.com/apache/jena/blob/master/jena-cmds/src/main/java/riotcmd/CmdLangParse.java#L224
I'm trying to post-process a large bibliographic data set which, among
its 30M or so triples split into 300 N-Triples files, contains a few bad
URIs. Because of the bad URIs, I run into problems when trying to use
the data, e.g. to load it into TDB or SDB. The data set is created from
MARC records using a XQuery-based conversion process [1] that isn't very
careful with URIs, so bad URIs or other errors in the original records
may be passed through and will be present in the output files.
What I'd like to do is to merge the 300 files into a single N-Triples
file, without including the triples with the bad URIs, using e.g. riot
from the command line, like this:
riot input*.nt >output.nt
But the bad URIs in the input files cause parsing errors and subsequent
triples in the same file will not be included in the output.
Here is a small example file, with a bad URI on the 2nd line:
--cut--
<http://example.org/007334701> <http://schema.org/name> "example bad URL" .
<http://example.org/007334701> <http://schema.org/url>
<http://example.org/007334701.pdf |q PDF> .
<http://example.org/007334701> <http://schema.org/description> "an
example with a bad URL" .
--cut--
When parsed using the above riot command, I get this output:
14:47:45 ERROR riot :: [line: 2, col: 90] Bad character
in IRI (space): <http://example.org/007334701.pdf[space]...>
<http://example.org/007334701> <http://schema.org/name> "example bad URL" .
So the command outputs just the first triple (i.e. anything before the
bad URI), but omits the bad one as well as the last one which came after
the bad URI. If I have a file with 100000 triples with one having a bad
URI on line 50000, the last 50000 triples in that file are discarded.
I tried the --nocheck option but it didn't seem to make any difference,
the result is exactly the same.
Also there is the --stop option, but it would do the opposite of what I
want - I don't want to stop on the first error, but instead continue
with the parsing.
I see that ModLangParse, the class used to process command line options
in riot, has some initial support for a --skip option [2] that would
probably do what I want, i.e. omit the bad triples while preserving all
the valid ones. But that option handling code is commented out and
CmdLangParse doesn't do anything with skipOnBadTerm (the boolean field
that would be set based on that option) [3].
So how can I get rid of the few bad triples in my input files while
preserving all the good ones?
I'm using apache-jena 3.1.1-SNAPSHOT from 2016-10-24.
Thanks,
Osma
[1] https://github.com/lcnetdev/marc2bibframe
[2]
https://github.com/apache/jena/blob/master/jena-cmds/src/main/java/arq/cmdline/ModLangParse.java#L78
[3]
https://github.com/apache/jena/blob/master/jena-cmds/src/main/java/riotcmd/CmdLangParse.java#L224
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
***@helsinki.fi
http://www.nationallibrary.fi
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
***@helsinki.fi
http://www.nationallibrary.fi