A structured approach to covert XML into RDF

Jan Wielemaker

Abstract

This document describes a Prolog-based toolkit that supports a structured approach to convert vocabularies and metadata collections from XML into RDF. First, the data is translated automatically from XML to RDF using a generic procedure. Next, we use a graph-rewriting language to translate the initial RDF into its final form. The toolkit provides a web-based interface to analyse the data in all its stages.

1 Introduction

2 Converting XML into RDF

3 Default XML name mapping

4 Subsequent meta-data mapping

4.1 Fix the node-structure

4.2 Re-establish internal links

4.3 Re-establish external links

4.4 Create a mapping schema

4.5 Assign URIs to blank nodes where applicable.

5 Enriching the crude RDF

5.1 Renaming resoures (or naming blank-nodes)

5.2 Putting triples in another graph

5.3 Utility predicates

6 Putting it all together (examples)

6.1 Deleting a triple

6.2 Preserving info about unknown creators

6.3 Negation

7 Running the toolkit

1 Introduction

Many datasets are transferred as XML, providing a tree-based datamodel that is purely syntactic in nature. Semantic processing is standardised around RDF, which provides a graph-based model. In the transformation process we must identify syntactic artifacts such as meaningless ordering in the XML data, lacking structure (e.g., the creator of an artwork is not a literal string by a person identified by a resource with properties) and overly structured data (e.g. the dimension of an object is a property of the object, not of some placeholder that combines physical properties of the object). These syntactic artifacts must be translated into a proper semantic model where objects and properties are typed and semantically related to common vocabularies such as SKOS and Dublin Core.

This document describes our toolkit for supporting this transformation process, together with examples taken from actual translations. The toolkit is implemented in SWI-Prolog and can be downloaded using GIT from one of the addresses below. Running the toolkit requires SWI-Prolog, which can be downloaded from http://www.swi-prolog.org for Windows, MacOS and Linux or in source for many other platforms.

git://eculture.cs.vu.nl/home/git/econnect/xmlrdf.git
http://eculture.cs.vu.nl/git/econnect/xmlrdf.git

The graph-rewrite engine is written in Prolog. This document does not assume any knowledge about Prolog. The rule-language is, as far as possible, a clean declarative graph-rewrite language. The transformation process for actual data however can be complicated. For these cases the rule-system allow mixing rules with arbitrary Prolog code, providing an unconstrained transformation system. We provide a (dynamically extended) library of Prolog routines for typical conversion tasks.

2 Converting XML into RDF

The core idea behind converting `data-xml' into RDF is that every complex XML element maps to a resource (often a bnode) and every atomic attribute maps to an attribute of this bnode. Such a translation gives a valid RDF document, which is much easier to access for further processing.

There are a few places where we must be more subtle in the initial conversion. First, the XML reserved attributes:

The xml:lang attribute is kept around and if we create an RDF literal, it is used to create a literal in the current language.
xmlns declarations are ignored (they make the declarations available to the application, but the namespaces are already processed by the XML parser).

Second, we may wish to map some of our properties into rdfs:XMLLiteral or RDF dataTypes. In particular the first must be done in the first pass to avoid all the complexities of turning the RDF back into XML (think of the above mentioned declarations, but ordering requirements can make this fundamentally impossible).

Because this step needs type information about properties, we might as well allow for some very simple transformations in the first phase. These transformations are guided by the target RDF schema. The transformation process can add additional properties to the target RDF properties and RDF classes. The property is called map:xmlname, where the map prefix is currently defined as http://cs.vu.nl/eculture/map/. If this property is associated to a class, an XML element with the defined name is translated into an instance of this class. If it is associated to a property, it affects XML attribute or atomic element translation in two ways:

It uses the RDF property name rather than the XML name for the property
The rdfs:range of the property affects the value translation:
- If it is rdfs:XMLLiteral, the sub-element is translated to an RDF XMLLiteral.
- If it is an XSD datatype, the sub-element is translated into a typed RDF literal
- It it is a proper class and the value is a valid URI, the URI is used as value without translation into a literal.

Below is an example that maps XML elements record into vra:Work instances and maps the XML attribute title into the vra:title property. Note that it is not required (and not desirable) to add the map:xmlname properties to the actual schema files. Instead, put them in a separate file and load both into the conversion engine.

@prefix   vra: <http://www.vraweb.org/vracore/vracore3#> .
@prefix   map: <http://cs.vu.nl/eculture/map/> .

# Map element-names to rdf:type

vra:Work map:xmlname "record" .

# Map xml attribute and sub-element names to properties

vra:title map:xmlname "title" .

3 Default XML name mapping

The initial XML to RDF mapper uses the XML attribute and tag-names for creating RDF properties. It provides two optional processing steps that make identifiers fit better with the RDF practice.

It can add a prefix to each XML name to create a fully qualified URI. E.g.,

?- rdf_current_ns(ahm, Prefix),
   load_xml_as_rdf('data.xml',
                   [ prefix(Prefix)
                   ]).

It `restyles' XML identifiers. Notably identifiers that contain a dot (.) are hard to process using Turtle. The library identifies alphanumerical substrings of the XML name and constructs new identifiers from these parts. By default, predicates start with a lowercase letter and each new part starts with an uppercase letter, as in oneTwo. Types (classes) start with an uppercase letter, as in OneTwo. This behaviour can be controlled with the options predicate_style and class_style of load_xml_as_rdf/2.

4 Subsequent meta-data mapping

Further mapping of meta-data consists of the following steps:

Fix the node-structure.
Re-establish internal links.
Re-establish external links.
Create a mapping schema that link the classes and predicates of the source to the target schema (e.g., Dublin Core).
Assign URIs to blank nodes where applicable.

4.1 Fix the node-structure

Source-data generally uses a record structure. Sometimes, each record is a simple flat list of properties, while in other cases it has a deeply nested structure. We distinguish three types of properties:

Properties with a clear single literal value, such as a collection-identifier. Such properties are directly mapped to RDF literals.
Properties with instance-specific scope that may have annotations. Typical examples are the title (multiple, translations, who has given the work a title, etc.) or a dimension (unit, which dimension, etc.). In this case, we create a new RDF node for each instance.
Properties that link to external resources: persons (creator), material (linking to a controlled vocabulary), etc. In this case the mapper unites multiple values that have the same properties. E.g., we create a single creator node for all creators found in a collection that have the same name a date of birth.
Addition (bibliographical) information is accumulated in the RDF node.
PROBLEM: sometimes the additional information clarifies the relation of the shared resource to a specific work and sometimes it privides more information about the resource (e.g. place of birth).

For cases (2) and (3) above, each metadata field has zero or more RDF nodes that act as value. The principal value is represented by rdf:value, while the others use the original property name. E.g., the AHM data contains

Record title Title .
Record title.type Type .

This is translated into

Record title [ a ahm:Title ;
               rdf:value "Some title" ;
               ahm:titleType Type ;
             ] .

If the work has multiple titles, each title is represented by a separate node.

Because this step may involve using ordering information of the initial XML data that is still present in the raw converted RDF graph, this step must be performed before the data is saved.

4.2 Re-establish internal links

This step is generally trivial. Some properties represent links to other works in the collection. The property value is typically a literal representing a unique identifier to the target object such as the collection identifier or a database key. This step replaces the predicate-value with an actual link to the target resource.

4.3 Re-establish external links

This step re-establishes links from external resources such as vocabularies which we know to be used during the annotation. In this step we only make mapping for which we are absolutely sure. I.e., if there is any ambiguity, which is not uncommon, we maintain the value as a blank node created in step (1).

4.4 Create a mapping schema

It is adviced to maintain the original property- and type-names (classes) in the RDF because this

Allows to reason about possible subtle differences between the source-specific properties and properties that come from generic schemas such as Dublin Core. E.g., a creator as listed for a work in a museum for architecture is typically an architect and the work in the museum is some form of reproduction on the real physical object. If we had replaced the original creator property by dcterms:creator, this information is lost.
It makes it much easier to relate the RDF to the original collection data. One of the advantages of this is that it becomes easier to reuse the result of semantic enrichment in the original data-source.

This implies that the converted data is normally accompagnied by a schema that lists the properties and types in the data and relates them using rdfs:subPropertyOf or rdfs:subClassOf to one or more generic schemas (e.g., Dublic Core). ClioPatria provides a facility to compute a schema for a graph from the actual data. This schema can be used as a starting point. To get this schema, open the ClioPatria web-interface, Use Places/Graphs to locate the graph and choose the option Compute a schema for this graph and Show the result as Turtle.

4.5 Assign URIs to blank nodes where applicable.

Any blank node we may wish to link to from the outside world needs to be given a real URI. The record-URIs are typically created from the collection-identifier. For other blank nodes, we look for distinguishing (short) literals.

5 Enriching the crude RDF

The obtained RDF is generally rather crude. Typical `flaws' are:

It contains literals where it should have references to other RDF instances
One probably wants proper resources for many of the blank nodes.
Some blank nodes provide no semantic organization and must be removed.
At other place, intermediate instances must be created (as blank nodes or named instances).
In addition to the above, some literal fields need to be rewritten, sometimes to (multiple) new literals and sometimes to a named or bnode instance.

Our rewrite language is a production-rule system, where the syntax is modelled after CHR (a committed-choice language for constraint programming) and the triple notation is based on Turtle/SPARQL. There are 3 types of production rules:

Propagation rules add triples
Simplication rules delete triples and add new triples.
Simpagation rules are in between. They match triples, delete triples and add triples,

The overall syntax for the three rule-types is (in the order above):

<name>? @@ <triple>* ==> <guard>? , <triple>*.
<name>? @@ <triple>* <=> <guard>? , <triple>*.
<name>? @@ <triple>* \ <triple>* <=> <guard>? , <triple>*.

Here, <guard> is an arbitrary Prolog term. <triple> is a triple in a Turtle-like, but Prolog native, syntax:

{ <subject> , <predicate> , <object> }

Any of these fields may contain a variable, written as a Prolog variable: an uppercase letter followed by zero or more letters, digits or the underscore. E.g., Hello, Hello_world, A9. Resources are either fully (single-)quoted Prolog atoms (E.g. 'http://example.com/me', or terms of the form <prefix> : <local>, where <prefix> is a defined prefix (see rdf_register_ns/2) and <local> is a possible quoted Prolog atom. E.g., vra:title or ulan:'Person' (note the quotes to avoid interpretation as a variable). Literals can use a more elaborate syntax:

<string> ^^ <type>
<string> @ <lang>
<string>
literal(Atom)

Here, <string> is a double-quoted Prolog string and <type> is a resource. The form literal(Atom) can be used to match the text of an otherwise unqualified literal with a variable. I.e.,

{ S, vra:title, literal(A) }

has the same meaning as the SPARQL expression ?S vra:title ?A FILTER isLiteral(?A),

Triples in the condition side can be postfixed using '?', in which case they are optional matches. If the triple cannot be matched, triples on the production-side that use the variable are ignored.

Triples in the condition can also be enclosed in a Prolog list ([...]), In this case, the triples are requested to be in the order specified. Ordering is not an official part of the RDF specs, but the SWI-Prolog RDF store maintains the order of triples in generated in the XML conversion process. An ordered set can match multiple times on a given subject, where it AB can match both AAABBB and ABABAB. Both forms appear in real-world XML data.

Finally, on the production side, the object can take this form:

bnode([ {<predicate> = <object>}
      ],
      [ {<option>}
      ])

This means, `for the object, create a bnode from the given <predicate> = <object> pairs'. The <option>s guide the process. At this moment, there is only one option with two values:

share_if(equal)
share_if(equal([<predicate>*]))

Without any option, each execution of the rule creates a new bnode. With the share_if option equal, it uses the same bnode-id for all productions that produce the same predicate-object list (in canonical order, after removing duplicates). Using the last form, it considers two blank nodes equal if they have the same triples on the given predicates. All other predicates are simply added to the blank-node.

5.1 Renaming resoures (or naming blank-nodes)

The construct {X} can be used on the condition and action side of a rule. If used, there must be exactly one such construct, one for the resource to be deleted and one for the resource to be added. All resources for which the condition matches are renamed. Below is an example rule. The first triple extracts the identifier. This triple must remain in the database. The =|\ {A}=| binds the (blank node) identifier to be renamed. The two Prolog guards verify that the resource is a blank node and generate an identifier (URI). The action ({S}) gives the rule engine the URI that must be given to the matched =|{A}=|.

work_uris @@
{ A, vra:'idNumber.currentRepository', ID } \ {A} <=>
    rdf_is_bnode(A),
    literal_to_id(ID, ahm, S),
    {S}.

5.2 Putting triples in another graph

Triples created by the action side of a rule are added to the graph that is being rewritten. It is also possible to add them to another graph using the syntax below:

    { S,P,O } >> Graph

E.g., if we want to store the information about person resources that we create in a graph named persons, we can so so using a rule like this:

person @@
{S, creator, Name},
{S, 'creator.date_of_birth', Born} ?,
{S, 'creator.date_of_death', Died} ?,
{S, 'creator.role', Role} ?
        <=>
        Name \== "onbekend",
        name_to_id(Name, ahm, Creator),
        { S, vra:creator, Creator },
        { Creator, rdf:type, ulan:'Person' } >> persons,
        { Creator, vp:labelPreferred, Name } >> persons,
        { Creator, ulan:birthDate, Born }    >> persons,
        { Creator, ulan:deathDate, Died }    >> persons,
        { Creator, ulan:role, Role }	 >> persons.

5.3 Utility predicates

The rewriting process is often guided by a guard which is, as already mentioned, an arbitrary Prolog goal. Because translation of repositories shares a lot of common tasks, we plan to develop a library for these. This section documents the available predicates.

[det]literal_to_id(+LiteralOrList, +NS, -ID)

Generate an identifier from a literal by mapping all characters that are not allowed in a (Turtle) identifier to _. LiteralOrList can be a list. In this case we generate an id for each element in LiteralOrList and append these. A typical usage scenario is to add a type:

literal_to_id(['book-', Literal], NS, ID)

Another is to add the label of the parent:

literal_to_id([ParentLit, '-', Literal], NS, ID)

To be done: - Verify that the generated URI is unique!
- Remove diacritics for non-iso-latin-1 text

name_to_id(+Literal, +NS, -ID)

Similar to literal_to_id/3, but intended to deal with person names.

To be done: Now simply the same as literal_to_id/3

6 Putting it all together (examples)

Below we give some rules that we wrote to convert real data.

6.1 Deleting a triple

Sometimes XML contains data that simply means `nothing'. We want to delete this data:

{_, creator, "onbekend" } <=>
    true.

Now, in the data from which this was extracted, this is a bit too crude because some records keep data about the creator even though his/her name is not known. Therefore, we preceed the rule with the rule of the next section. Note that the order of rules matter: a rule is executed before the next one. In this particular case we could have removed the {S, creator, "onbekend"} triple from the example below to make it match after the rule above is executed.

6.2 Preserving info about unknown creators

The example below deals with entries in the database where the `creator' is unknown (Dutch: onbekend), but some properties are known about him or her. The remainder of the condition matches possible information about this creator using an `optional' match. The guard verifies there is at least some information about our unknown creator. The action part of the rule associates a new blank node as a creator.

creator_onbekend @@
{S, creator, "onbekend"},
{S, 'creator.date_of_birth', Born} ?,
{S, 'creator.date_of_death', Died} ?,
{S, 'creator.role', Role} ?
        <=>
        at_least_one_given([Born, Died, Role]),
        { S, vra:creator,
          bnode([ ulan:birthDate = Born,
                  ulan:deathDate = Died,
                  ulan:role = Role
                ])
        }.

at_least_one_given(Values) :-
        member(V, Values),
        ground(V), !.

6.3 Negation

Negation is only provided as Prolog negation--by-failure in the guard. This implies that we cannot use the {...} triple notation to test on the absence of a triple, but instead we need to use the SWI-Prolog RDF-DB primitive rdf/3. For example, to delete all person records that have no name, we can use the rule below. The first triple verifies the record-type. The second matches all triples on that record and the guard verifies that the subject has no triples for the property ahm:name.

delete_no_name @@
{ S, rdf:type, ahm:'Person' },
{ S, _, _ }
        <=>
        \+ rdf(S, ahm:name, _).

7 Running the toolkit

Currently, there is no well-defined workflow for running the tools. The files run.pl and rewrite.pl contain a skeleton that I use to convert the data from AHM (Amsterdams Historisch Museum). The file run.pl loads relevant background data and defines run/0 to call the initial converter. The relevant steps of the initial converter are to load VRA and mapping.ttl that contains the map:xmlname declarations discussed above. Next, we load the XML into crude RDF using the call below. The options specify that the input in XML without namespaces (dialect xml rather than xmlns) and that the file contains XML elements named record as the desired unit of data for conversion.

run(File) :-
        load_xml_as_rdf(File,
                        [ dialect(xml),
                          unit(record)
                        ]).

The result can be browsed by typing ?- triple20.

The file rewrite.pl scripts the rewrite phase. It sets up namespaces, calls to the rewrite predicates with the proper arguments and finally provides the rules. Here are the toplevel predicates:

rewrite: Apply all rules on the graph data
rewrite(+Rule): Apply the given rule on the graph data
rewrite(+Graph, +Rule): Apply the given rule on the given graph.
rewrite(:To, +From): Invoke the term-rewriting system
list_rules: List the available rules to the console.

Below is an example run, showing all available rules and running a single rule. The example demonstrates that rules are applied until a fixed-point is reached (i.e., the RDF database does not change by applying the rules).

?- [rewrite].
true.

?- list_rules.
Defined RDF mapping rules:

        title_translations
        dimension
        work_uris
        creator_sequence
        creator_onbekend
        delete_unknown_creator
        delete_empty_literal
        creator
        material_aat
        related_object

true.

?- rewrite(delete_empty_literal).
% Applying ... delete_empty_literal (1)
% 0.100 seconds; 23,456 changes; 2,008,860 --> 1,985,404 triples
% Step 1: generation 2,020,746 --> 2,044,202
% Applying ... delete_empty_literal (1)
% 0.000 seconds; no change
% Step 2: generation 2,044,202 --> 2,044,202
true.

Index

L
list_rules/0
literal_to_id/3
N
name_to_id/3
R
rewrite/0
rewrite/1
rewrite/2