Getting rid of triples with bad URIs

Discussion:

Osma Suominen

2016-10-25 12:05:08 UTC

Hi,

I'm trying to post-process a large bibliographic data set which, among
its 30M or so triples split into 300 N-Triples files, contains a few bad
URIs. Because of the bad URIs, I run into problems when trying to use
the data, e.g. to load it into TDB or SDB. The data set is created from
MARC records using a XQuery-based conversion process [1] that isn't very
careful with URIs, so bad URIs or other errors in the original records
may be passed through and will be present in the output files.

What I'd like to do is to merge the 300 files into a single N-Triples
file, without including the triples with the bad URIs, using e.g. riot
from the command line, like this:

riot input*.nt >output.nt

But the bad URIs in the input files cause parsing errors and subsequent
triples in the same file will not be included in the output.

Here is a small example file, with a bad URI on the 2nd line:
--cut--
<http://example.org/007334701> <http://schema.org/name> "example bad URL" .
<http://example.org/007334701> <http://schema.org/url>
<http://example.org/007334701.pdf |q PDF> .
<http://example.org/007334701> <http://schema.org/description> "an
example with a bad URL" .
--cut--

When parsed using the above riot command, I get this output:

14:47:45 ERROR riot :: [line: 2, col: 90] Bad character
in IRI (space): <http://example.org/007334701.pdf[space]...>
<http://example.org/007334701> <http://schema.org/name> "example bad URL" .

So the command outputs just the first triple (i.e. anything before the
bad URI), but omits the bad one as well as the last one which came after
the bad URI. If I have a file with 100000 triples with one having a bad
URI on line 50000, the last 50000 triples in that file are discarded.

I tried the --nocheck option but it didn't seem to make any difference,
the result is exactly the same.

Also there is the --stop option, but it would do the opposite of what I
want - I don't want to stop on the first error, but instead continue
with the parsing.

I see that ModLangParse, the class used to process command line options
in riot, has some initial support for a --skip option [2] that would
probably do what I want, i.e. omit the bad triples while preserving all
the valid ones. But that option handling code is commented out and
CmdLangParse doesn't do anything with skipOnBadTerm (the boolean field
that would be set based on that option) [3].

So how can I get rid of the few bad triples in my input files while
preserving all the good ones?

I'm using apache-jena 3.1.1-SNAPSHOT from 2016-10-24.

Thanks,
Osma

[1] https://github.com/lcnetdev/marc2bibframe

[2]
https://github.com/apache/jena/blob/master/jena-cmds/src/main/java/arq/cmdline/ModLangParse.java#L78

[3]
https://github.com/apache/jena/blob/master/jena-cmds/src/main/java/riotcmd/CmdLangParse.java#L224

--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
***@helsinki.fi
http://www.nationallibrary.fi

Neubert, Joachim

2016-10-25 18:07:21 UTC

Permalink

Hi Osma,

What a coincidence: Today I ran into the same problem here. I've (many large) jsonld files with a few messy URIs like this:

{
"@context" :
{
"dcterms": "http://purl.org/dc/terms/",
"eb": "http://zbw.eu/beta/resource/title/",
"gnd": "http://d-nb.info/gnd/",
"subject_gnd": { "@id": "dcterms:subject", "@type": "@id" }
},
"@graph" : [
{
"subject_gnd" : [
"gnd:4114557-4 4070699-0",
"gnd:4114247-0"
],
"@id" : "eb:10010237512"
}
]
}

riot produces a warning and 2 triples:

# riot --check --strict /tmp/example.jsonld
20:00:26 WARN riot :: Bad IRI: <http://d-nb.info/gnd/4114557-4 4070699-0> Code: 17/WHITESPACE in PATH: A single whitespace character. These match no grammar rules of URIs/IRIs. These characters are permitted in RDF URI References, XML system identifiers, and XML Schema anyURIs.
<http://zbw.eu/beta/resource/title/10010237512> <http://purl.org/dc/terms/subject> <http://d-nb.info/gnd/4114557-4 4070699-0> .
<http://zbw.eu/beta/resource/title/10010237512> <http://purl.org/dc/terms/subject> <http://d-nb.info/gnd/4114247-0> .

Strangely I can load the .jsonld file via tdbloader (resulting in the very two triples shown above in the tdb). Loading the equivalent .nt file aborts with an exception:

ERROR [line: 1, col: 116] Bad character in IRI (space): <http://d-nb.info/gnd/4114557-4[space]...>
org.apache.jena.riot.RiotException: [line: 1, col: 116] Bad character in IRI (space): <http://d-nb.info/gnd/4114557-4[space]...>

Neither of these behaviors is very helpful. Some --skip option which consistently skips the bad triples and outputs or loads the good ones would be great. Or perhaps somebody has another idea how to get rid of the bad URIs?

Cheers, Joachim

-----Ursprüngliche Nachricht-----
Gesendet: Dienstag, 25. Oktober 2016 14:05
Betreff: Getting rid of triples with bad URIs
Hi,
I'm trying to post-process a large bibliographic data set which, among its 30M
or so triples split into 300 N-Triples files, contains a few bad URIs. Because of
the bad URIs, I run into problems when trying to use the data, e.g. to load it into
TDB or SDB. The data set is created from MARC records using a XQuery-based
conversion process [1] that isn't very careful with URIs, so bad URIs or other
errors in the original records may be passed through and will be present in the
output files.
What I'd like to do is to merge the 300 files into a single N-Triples file, without
including the triples with the bad URIs, using e.g. riot from the command line,
riot input*.nt >output.nt
But the bad URIs in the input files cause parsing errors and subsequent triples
in the same file will not be included in the output.
--cut--
<http://example.org/007334701> <http://schema.org/name> "example bad URL" .
<http://example.org/007334701> <http://schema.org/url>
<http://example.org/007334701.pdf |q PDF> .
<http://example.org/007334701> <http://schema.org/description> "an
example with a bad URL" .
--cut--
14:47:45 ERROR riot :: [line: 2, col: 90] Bad character
in IRI (space): <http://example.org/007334701.pdf[space]...>
<http://example.org/007334701> <http://schema.org/name> "example bad URL" .
So the command outputs just the first triple (i.e. anything before the bad URI),
but omits the bad one as well as the last one which came after the bad URI. If I
have a file with 100000 triples with one having a bad URI on line 50000, the last
50000 triples in that file are discarded.
I tried the --nocheck option but it didn't seem to make any difference, the result
is exactly the same.
Also there is the --stop option, but it would do the opposite of what I want - I
don't want to stop on the first error, but instead continue with the parsing.
I see that ModLangParse, the class used to process command line options in
riot, has some initial support for a --skip option [2] that would probably do what
I want, i.e. omit the bad triples while preserving all the valid ones. But that
option handling code is commented out and CmdLangParse doesn't do anything
with skipOnBadTerm (the boolean field that would be set based on that option)
[3].
So how can I get rid of the few bad triples in my input files while preserving all
the good ones?
I'm using apache-jena 3.1.1-SNAPSHOT from 2016-10-24.
Thanks,
Osma
[1] https://github.com/lcnetdev/marc2bibframe
[2]
https://github.com/apache/jena/blob/master/jena-
cmds/src/main/java/arq/cmdline/ModLangParse.java#L78
[3]
https://github.com/apache/jena/blob/master/jena-
cmds/src/main/java/riotcmd/CmdLangParse.java#L224
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist National Library of Finland P.O. Box
26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529

Andy Seaborne

2016-10-26 11:50:15 UTC

Permalink

Hi Osma,

I usually treat this an an ETL cleaning problem and text-process - it's
not just finding the duff URIs but fixing them in some way.

We could change the parser behaviour for bad URIs. There is a reason
why it is picky though - if bad data gets into a database it is very
hard to fix it up afterwards. Often, problems arise days/weeks/months
later and may be in the interaction with other systems when query
results published.

Turtle and N-triples explicitly define a token rule (N-triples):

[8] IRIREF ::= '<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>'

whereby soace is rules out at the bottom-most level of the parsing process.

JSON-LD is 3rd party system : jsonld-java.

Looks to me like Jena is not checking the output from that as it creates
the Jena objects because "ParserProfileChecker" is checking for triple
problems (literals as subjects etc) and assumes it's input terms are valid.

Andy

Post by Osma Suominen
Hi,
I'm trying to post-process a large bibliographic data set which, among
its 30M or so triples split into 300 N-Triples files, contains a few bad
URIs. Because of the bad URIs, I run into problems when trying to use
the data, e.g. to load it into TDB or SDB. The data set is created from
MARC records using a XQuery-based conversion process [1] that isn't very
careful with URIs, so bad URIs or other errors in the original records
may be passed through and will be present in the output files.
What I'd like to do is to merge the 300 files into a single N-Triples
file, without including the triples with the bad URIs, using e.g. riot
riot input*.nt >output.nt
But the bad URIs in the input files cause parsing errors and subsequent
triples in the same file will not be included in the output.
--cut--
<http://example.org/007334701> <http://schema.org/name> "example bad URL" .
<http://example.org/007334701> <http://schema.org/url>
<http://example.org/007334701.pdf |q PDF> .
<http://example.org/007334701> <http://schema.org/description> "an
example with a bad URL" .
--cut--
14:47:45 ERROR riot :: [line: 2, col: 90] Bad character
in IRI (space): <http://example.org/007334701.pdf[space]...>
<http://example.org/007334701> <http://schema.org/name> "example bad URL" .
So the command outputs just the first triple (i.e. anything before the
bad URI), but omits the bad one as well as the last one which came after
the bad URI. If I have a file with 100000 triples with one having a bad
URI on line 50000, the last 50000 triples in that file are discarded.
I tried the --nocheck option but it didn't seem to make any difference,
the result is exactly the same.
Also there is the --stop option, but it would do the opposite of what I
want - I don't want to stop on the first error, but instead continue
with the parsing.
I see that ModLangParse, the class used to process command line options
in riot, has some initial support for a --skip option [2] that would
probably do what I want, i.e. omit the bad triples while preserving all
the valid ones. But that option handling code is commented out and
CmdLangParse doesn't do anything with skipOnBadTerm (the boolean field
that would be set based on that option) [3].
So how can I get rid of the few bad triples in my input files while
preserving all the good ones?
I'm using apache-jena 3.1.1-SNAPSHOT from 2016-10-24.
Thanks,
Osma
[1] https://github.com/lcnetdev/marc2bibframe
[2]
https://github.com/apache/jena/blob/master/jena-cmds/src/main/java/arq/cmdline/ModLangParse.java#L78
[3]
https://github.com/apache/jena/blob/master/jena-cmds/src/main/java/riotcmd/CmdLangParse.java#L224

Osma Suominen

2016-10-27 07:06:01 UTC

Permalink

Hi Andy!

You're right - these problems should be fixed, preferably at the source
(in my case, the bad MARC records). And I will try to do that. But I'm
setting up a conversion pipeline [1] to be run periodically, and I want
that to be robust, so that small errors like this do not cause big
problems later on. Even if I fix the current problems, one day someone
will introduce a new bad URI into a MARC record. It is better to simply
drop a single bad triple instead of losing 50k triples from the same batch.

I was surprised that riot didn't help here, particularly since it has
the --nocheck option, and --stop is not the default mode of operation.

I could use unix tools like grep, awk and/or sed to check for bad URIs
and fix or filter them on the fly, but it's nontrivial - I might miss an
edge case somewhere. I thought it would be better if I could use the
same tool that already validates URIs/IRIs to also reject the bad triples.

What is --nocheck in riot supposed to do, if it has no effect in this case?

The --skip option seems to be half-implemented, do you (or anyone else)
know why?

I can try to patch up the code if it's obvious what should be done.
Right now I'm a bit confused about how the options are supposed to work
and whethere there's a bug somewhere, or just a missing feature.

-Osma

Post by Neubert, Joachim
Hi Osma,
I usually treat this an an ETL cleaning problem and text-process - it's
not just finding the duff URIs but fixing them in some way.
We could change the parser behaviour for bad URIs. There is a reason
why it is picky though - if bad data gets into a database it is very
hard to fix it up afterwards. Often, problems arise days/weeks/months
later and may be in the interaction with other systems when query
results published.
[8] IRIREF ::= '<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>'
whereby soace is rules out at the bottom-most level of the parsing process.
JSON-LD is 3rd party system : jsonld-java.
Looks to me like Jena is not checking the output from that as it creates
the Jena objects because "ParserProfileChecker" is checking for triple
problems (literals as subjects etc) and assumes it's input terms are valid.
Andy

Rob Vesse

2016-10-27 09:19:42 UTC

Permalink

Skipping bad data in parsers tends to be a non-trivial problem particularly with more complex formats. Most parsers whether hand written or generated obvious on tokenising input stream into discrete recognisable tokens using the grammar rules to decide what kind of token is expected next. In the event that you hit a bad token you then need to recover somehow. In practice this usually means discarding tokens and/or input until you reach a point where you can safely restart parsing. For N-Triples this is relatively easy since you can simply read to the nextnew line.

However, many other formats what difficult to impossible to successfully recover from errors, particularly in the case of formats with global state e.g. Prefix mappings because if you skip over a section of invalid Data that would have changed the global State your interpretation of the rest of the data might be completely incorrect.

Rob

On 27/10/2016 08:06, "Osma Suominen" <***@helsinki.fi> wrote:

Hi Andy!

You're right - these problems should be fixed, preferably at the source
(in my case, the bad MARC records). And I will try to do that. But I'm
setting up a conversion pipeline [1] to be run periodically, and I want
that to be robust, so that small errors like this do not cause big
problems later on. Even if I fix the current problems, one day someone
will introduce a new bad URI into a MARC record. It is better to simply
drop a single bad triple instead of losing 50k triples from the same batch.

I was surprised that riot didn't help here, particularly since it has
the --nocheck option, and --stop is not the default mode of operation.

I could use unix tools like grep, awk and/or sed to check for bad URIs
and fix or filter them on the fly, but it's nontrivial - I might miss an
edge case somewhere. I thought it would be better if I could use the
same tool that already validates URIs/IRIs to also reject the bad triples.

What is --nocheck in riot supposed to do, if it has no effect in this case?

The --skip option seems to be half-implemented, do you (or anyone else)
know why?

I can try to patch up the code if it's obvious what should be done.
Right now I'm a bit confused about how the options are supposed to work
and whethere there's a bug somewhere, or just a missing feature.

-Osma

Andy Seaborne

2016-10-27 09:21:05 UTC

Permalink

This post might be inappropriate. Click to display it.

Osma Suominen

2016-10-27 09:46:31 UTC

Permalink

Hi Andy!

Post by Andy Seaborne
Shouldn't the conversion to triples check the URIs for validity? At

Post by Osma Suominen

Post by Neubert, Joachim
[8] IRIREF ::= '<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>'

That rule was chosen (by EricP) as a balance between full and expensive
URI checking and some degree of correctness with a regex or simple
scanning check.

Probably it should, but it's a converter developed by the Library of
Congress (https://github.com/lcnetdev/marc2bibframe) and the XQueries
are quite big beasts already. It's not being maintained anymore and I'm
reluctant to change it on my own. Instead I try to work around any
issues by pre- and post-processing my data.

Post by Andy Seaborne
Having bad URI in the database is, in my experience, a big problem. They
are hard to find later and fix once it is in a database (best way I know
- dump the database to N-Quads and fix the text). Usually, the first
report is when users of the system report issues some time later.

Yes, I'm not planning to put the bad URIs in a database. Instead I try
to get rid of them as soon as possible - either eliminating them at the
source, or failing that, right after the conversion to RDF.

Post by Andy Seaborne
What does your pipe do about IRI warnings? Or other broken URIs?

Most URIs in the data are generated in the conversion process itself,
using only alphanumeric characters etc. So the problem is really only a
handful of URIs (Web document URLs generally) that were incorrectly
entered into the MARC records.

Post by Andy Seaborne
That's open source for you.

Right - you get to keep both pieces when it breaks. :)

No really, I'm trying to understand the issue so that I can propose or
even fix things myself.

Post by Andy Seaborne
It is one line to grep for spaces in URIs with the bonus you can write
those lines to a separate file for accurate reporting of problems.

Right. I had this in mind. Except it is not enough to check for spaces,
since there are other kinds of bad URIs as well - I recall seeing at
least unescaped braces in there. But the IRIREF regex is a good starting
point, sure.

Post by Andy Seaborne
It does not need to be an "either/or" - one stage of the pipeline checks
the data (there are other useful checks like all lines end in a DOT),
then parse it to get other checking. All checking does not have to be
bundled into one stage.

Yes, I'm just trying to make this as efficient as possible, within
reason. But definitely this can be broken up and make a separate
validation step.

Post by Andy Seaborne
Unfortunately, this is a low level syntax (tokenization) issue. I will
put in some code that can be used to change this one case (I'll prepare
the PR in a few minutes; the code exists because I did some maintenance
investigating this yesterday) but you'll encounter more other problems.
* <http://example/<<<<>>>>>>
* Bad unicode sequences. Quite nasty as reporting the line number is bad
if java conversion to unicode is done. JavaCC has this problem as well.
* Stray newlines: literals and URIs.
<http://example/abc
def> .
"I forgot the
triple quotes"
and these are harder to have any recovery policy for. There is a real
performance/functionality tradeoff here. To be able to skip bad data
(error recovery) is at odds with fast tokenizing and input caching.

Very good examples!

Post by Andy Seaborne

Post by Osma Suominen
The --skip option seems to be half-implemented, do you (or anyone else)
know why?

I am a lazy good for-nothing programmer.

Oh really :) I think the above already explains why this hasn't been
implemented. I just happened to notice that someone had at least thought
of a --skip option, even though it wasn't really implemented, that's why
I asked.

Post by Andy Seaborne
The best approach is to add a new parser for N-triples (which is not at
all hard - N-Triples is so simple) which can do recovery, reporting and
splitting the output between good and bad. The current parser can't
output to different places. It should be easy to register it as a
replacement for the standard one.

Okay. I will think about this. But most likely I'll just use a separate
regex validation/filtering step outside Jena.

-Osma

james anderson

2016-10-27 10:24:03 UTC

Permalink

good afternoon;

Post by Osma Suominen
Hi Andy!

Post by Andy Seaborne
Shouldn't the conversion to triples check the URIs for validity? At

Post by Neubert, Joachim
[8] IRIREF ::= '<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>'

That rule was chosen (by EricP) as a balance between full and expensive
URI checking and some degree of correctness with a regex or simple
scanning check.

...
No really, I'm trying to understand the issue so that I can propose or even fix things myself.
...
Okay. I will think about this. But most likely I'll just use a separate regex validation/filtering step outside Jena.

as andy noted, the iriref syntax is well suited to regex use and sed is well capable of applying it to effect for line-oriented content.
if the statement constituency does not matter (that is, no literal subjects), then that should be true for any text encoding.

taken as the canonical criteria, it eliminates misgivings about "[missing] an edge case somewhereâ while offering the advantage that, if this is not a casual application,
- you have a low implementation threshold for tooling which will operate on effectively unlimited datasets,
- it can produce diffs to allow one to record and report on deficiencies in the initial data,
- repeat the result with less resource expenditure, and even
- reflect on the transformation in order to produce purposeful corrections in order to improve the result quality.

while all of these would be possible to achieve through a process which is integrated into the parser, the effort would likely be greater.
if this is an ongoing production case, an independent transformation stage could even put you in position to reconcile later documents through other channels - eg sparql queries, which otherwise could end up divorced from their intended target terms.

best regards, from berlin,
---
james anderson | ***@dydra.com | http://dydra.com

Osma Suominen

2016-10-31 17:04:46 UTC

Permalink

Hi all,

I wrote a little Python script to do the N-triples parsing/validation
using a regex as suggested:
https://github.com/NatLibFi/bib-rdf-pipeline/blob/master/scripts/filter-bad-ntriples.py

It doesn't check for absolutely everything (e.g. formatting language
tags or datatypes) but it's enough for what I need right now.

The reason I didn't use grep was that I want to both pass through the
valid triples (on stdout) and report the bad ones (on stderr). Maybe sed
could do it too, but this was easier for me.

Thanks for all the advice!

-Osma

PS. I also found a tool called "reshaperdf" which has a "correct"
command that does a very similar operation - fixing some bad triples and
reporting others. It only checks for spaces in URIs but not e.g. braces,
so it wasn't useful to me without modifications.
https://github.com/linked-swissbib/reshaperdf/blob/master/src/main/java/org/gesis/reshaperdf/cmd/correct/CorrectCommand.java

Post by james anderson
good afternoon;

Post by Osma Suominen
Hi Andy!

Post by Andy Seaborne
Shouldn't the conversion to triples check the URIs for validity? At

Post by Neubert, Joachim
[8] IRIREF ::= '<' ([^#x00-#x20<>"{}|^`\] | UCHAR)* '>'

That rule was chosen (by EricP) as a balance between full and expensive
URI checking and some degree of correctness with a regex or simple
scanning check.

as andy noted, the iriref syntax is well suited to regex use and sed is well capable of applying it to effect for line-oriented content.
if the statement constituency does not matter (that is, no literal subjects), then that should be true for any text encoding.
taken as the canonical criteria, it eliminates misgivings about "[missing] an edge case somewhere” while offering the advantage that, if this is not a casual application,
- you have a low implementation threshold for tooling which will operate on effectively unlimited datasets,
- it can produce diffs to allow one to record and report on deficiencies in the initial data,
- repeat the result with less resource expenditure, and even
- reflect on the transformation in order to produce purposeful corrections in order to improve the result quality.
while all of these would be possible to achieve through a process which is integrated into the parser, the effort would likely be greater.
if this is an ongoing production case, an independent transformation stage could even put you in position to reconcile later documents through other channels - eg sparql queries, which otherwise could end up divorced from their intended target terms.
best regards, from berlin,
---