RDF and Domain Specific Languages (DSL) - A Perfect Match

For those who know me, it is no secret that I am addicted to RDF. It is not only the most simple data model I know but also by far the most powerful one. Since I first heard about it in 2008 I have written a lot of RDF manually, either for prototyping how data could be modeled in RDF or to "serialize" some information into RDF, like R2RML (opens new window) mappings. However, writing RDF by hand is not a lot of fun, it is sometimes very verbose and typos happen easily, even when using serializations like Turtle.

Last year my Zazuko colleague Michael Rauch started talking about so-called Domain Specific Languages (opens new window) (DSLs). He worked with DSLs at a previous job and thought it might be a good solution to both save time and avoid pesky typos by generating RDF. From Wikipedia:

A domain-specific language (DSL) is a computer language specialized to a particular application domain.

[...]

A domain-specific language is somewhere between a tiny programming language and a scripting language and is often used in a way analogous to a programming library. The boundaries between these concepts are quite blurry, much like the boundary between scripting languages and general-purpose languages.

At Zazuko we convert a lot of existing data into RDF. Lots of data is maintained or exported in different data sources and formats, for example relational databases, CSV or Excel files, XML or JSON. We believe that RDF can only be used in real world production environments when these RDF conversions are completely automated. This ensures that the conversion can run quickly and in a repeatable fashion, which in turn ensures that the knowledge graph that we are building is always up to date. A nice side effect of full automation is that the cost is minimized as well.

To achieve this goal, we are working on multiple abstraction layers including an automated pipelining system (opens new window), something we will present in a future blog post. In this article we want to present how to simplify and maintain mappings of non-RDF resources to RDF by using a DSL. These mappings can be done in many ways, at Zazuko we mostly use:

R2RML (opens new window) for relational databases
CSV on the Web (opens new window) and RML (opens new window) for CSV or Excel files (basically any kind of tabular data)
RML (opens new window) for XML files
JSON-LD context (opens new window) for JSON structures

In this blog post, we will focus on R2RML (opens new window), the language that defines how relational systems can be exposed as RDF. We use this a lot for our customers and I used to write R2RML mappings manually. The mappings shown in this post are based on the R2RML example (opens new window), it includes information about the schema of the database we want to map. This snippet (opens new window) shows how one can map an existing row in a table to RDF:

@prefix rr: <http://www.w3.org/ns/r2rml#>.
@prefix ex: <http://example.com/ns#>.

<#TriplesMap1>
    rr:logicalTable [ rr:tableName "EMP" ];
    rr:subjectMap [
        rr:template "http://data.example.com/employee/{EMPNO}";
        rr:class ex:Employee;
    ];
    rr:predicateObjectMap [
        rr:predicate ex:name;
        rr:objectMap [ rr:column "ENAME" ];
    ].

This will map the relational table EMP to RDF and generate the following triples:

<http://data.example.com/employee/7369> rdf:type ex:Employee.
<http://data.example.com/employee/7369> ex:name "SMITH".

As you can see in the mapping example, this is not necessarily fun to write. For large relational databases, these mapping files can easily become thousands of lines long. I did write larger mappings using R2RML and the longer the file gets, the higher the chance that I introduce errors and reference tables or rows that do not exist in the database. In short, the process is very error-prone, time-consuming and difficult to maintain. Some vendors started creating their own abstractions, one example is the Stardog Mapping Syntax (opens new window) (SMS) which I used as well for a while. I liked it better as it felt like writing Turtle-templates but it restricted me to Stardog databases and did not support me in tooling either so I could still do tons of typos in there.

By now you probably get where I am heading: My colleague sat down and started to create a DSL for generating R2RML mappings! With this blog post, we are happy to announce the first public version of what we have already been using in production within Zazuko for about a year. With it I wrote mappings that create thousands of lines of R2RML in Turtle syntax. I can now maintain the DSL in a GitHub repository as a text file and the DSL tool (XText (opens new window) in our case) provides code-assist on everything and generates an RDF R2RML mapping file in Turtle.

The DSL for R2RML currently supports:

An abstracted syntax tailored to mapping relational tables (also CSVs) to RDF.
Full code-assist within the tool. This applies to the DSL itself as well as to mappings to RDF terms (predicates and classes) and table rows. For me this is the most important feature, no one wants to code without good autocompletion these days and I don't want to map any data without support from whatever tooling I use.
Syntax validation for everything. If you mistyped something it will be highlighted in the editor and it will not create a new output before the syntax issues are fixed. This is also super useful to see if some mappings need to be adjusted when the table schema changed.
R2RML and RML output. At the moment we use it within Zazuko for relational tables & CSV mappings.
Text based DSL. You can easily track changes to your mapping with standard tooling like git.

Coming back to our example from the R2RML specification, this is how the mapping looks in our DSL:

map TriplesMap1 from EMPLOYEE {
  subject template "http://data.example.com/employee/{0}" with EMPNO;
  
  types ex.Employee;
  
  properties
    ex.name from ENAME;
}

This is much easier to read, especially when you start to repeat the properties part for all the rows you want to map. But the fun starts when work in Eclipse with autocomplete enabled, this is how I would create this mapping:

Note that we omitted two things to make this a valid DSL: The metadata about what tables and rows we have in the relational database and what vocabulary we autocomplete to. This can be directly maintained in the mapping file itself or in separate files. The full example can be found on GitHub (opens new window). Obviously this is something we want to be able to do with introspection in the future, as indicated in our wishlist at the end of this article.

You can also check out our screencast, we explain a few additional details and show you how to use it.

If you would like to try our tool you can install it as a precompiled extension in Eclipse. Step-by-step instructions for the installation and for getting the samples up and running are in the repository with the documentation for the DSL (opens new window) on GitHub.

In case you run into issues, GitHub is also the place where you can file bugs and give us feedback. Note that at the time of writing, the source code of this extension is not public.

We have many ideas of where this extension could and should go, among others:

Introspection of the data source: Right now we have to define the structure of our tables manually, obviously this should be relatively easy to generate via JDBC or inspection of CSV headers.
Autocomplete for existing RDF Vocabularies, based on the work we did for our Prefix Server (opens new window).
Support for XML and JSON: We should be able to use the same concepts for XML and JSON as well. There is RML (opens new window) and JSON-LD (opens new window) (in particular the context), which should all be relatively easy to generate from our DSL.
Support the DSL with all features like autocomplete outside of Eclipse. We have ideas about how this could be done in other editors like Visual Studio Code, Atom or even pure web browsers (using Web IDEs & XText without Eclipse).

We will develop these features based on user feedback, please get in contact with us if you would like to have features prioritized for your data scientists and development teams, we are happy to help!

Note that it is very likely that the syntax will change, that is why the release number is not 1.0 yet. So if you start using it, you might have to adjust your mappings in the future to reflect these changes.