Back to blog

Linked Data Cubes - Creating Datasets

Many people and organizations already publish statistical datasets using the RDF Data Cube Vocabulary (opens new window). As mentioned in the specification, the Data Cube vocabulary can be used for any multi-dimensional data, but I have not yet encountered RDF Data Cube data that was not of statistical nature in the wild.

At Zazuko we help organizations publish statistical data, we also use RDF Data Cubes in diverse machine learning projects.

Based on our experience I would like to propose some minor changes that make RDF Data Cubes not only more powerful but also easier to publish.

The life cycle of Data Cubes is comprised of creating, describing and aggregating Data Cubes. Each of this step should be taken care of separately.

  • Creating Datasets
  • Describing Datasets (Metadata)
  • Aggregating Datasets

Creating Datasets

In this first blog post, I will have a look at how to create and publish Datasets with the Data Cube Vocabulary. First of all, let's define what the scope of this step is:

  • An Observer generates Observations.
  • Each Observation consists of one or more dimensions, one or more measures, and optional attributes.
  • Multiple Observations can be combined into a Dataset.

Let's keep it at that for a start.

The RDF Data Cube Vocabulary (opens new window) uses one subject per Observation. Dimensions, measures and attributes are attached to the subject of the Observation. To connect an Observation with its Dataset, the Observation has an outgoing link to the Dataset. This works very well for publishing but can cause multiple challenges in follow-up processes.

SPARQL exclusive - No follow your nose

If you know the IRI (Internationalized Resource Identifier) of a Dataset, SPARQL is the only way to find the Observations because the links point from the Observations to the Dataset. Most of the time you will work with data on a SPARQL endpoint or data dumps, which you import into your own SPARQL endpoint, but wouldn't it be nice to also have a dereferencable version of your data? This can be especially useful for inspecting and debugging Datasets.

No recompose of Datasets

Since the Dataset is tied to the Observation, an Observation can't be assigned to a different Dataset.

Ideally, I would like to be able to combine multiple Datasets into a bigger one. For example, all European countries independently publish the same kind of statistical data. It should be possible to recompose the Observations into a bigger Dataset for the whole of Europe without creating copies of the data.

In some situation, you might want the opposite: splitting a Dataset into multiple smaller ones. One use case is machine learning, where you want to split a Dataset into a training and a test Dataset. This should be possible without creating copies of the existing data, just by linking to the Observations.

The solution to this problem is very simple, reverse the direction of the link from the Dataset to the Observation. Now new Datasets can be created by pointing from the new Dataset to the existing Observation.

Observed by property for the Observations

The origin of the Observation might still be of interest, at the moment there is no standardized way of stating that within the RDF Data Cube vocabulary. An observed by property attached to the Observation should point to the person, organization or machine (Observer) that created the Observation. The Observer can then contain a more detailed description of how Oberservations were measured.

Examples

Water Quality of Rivers

In the field of Open Data, environmental data is one of my favorite topics. For this reason, I was very happy to work on a project aimed at publishing the data from the Rheinüberwachungsstation Weil am Rhein (opens new window) (monitoring station on river Rhine at the border of Switzerland and Germany) as linked data. The Observations are RDF Data Cube compatible, and at the same time prepared for the improved structure with the observedBy property. It would be possible to attach that information to the Dataset, but with the information at the Observation level, it's very easy to merge a similar Dataset from a different monitoring station into a bigger Dataset and distinguish them based on the observedBy dimension.

The Data is publicly available and browsable, but it contains many Observations. That's why I prepared a SPARQL query that only selects a single day of observations (opens new window). Trifid (opens new window) is used to render the results into a browsable table, so they can easily be explored by following links.

Air Quality (in progress)

A very similar project on my todo list is the data of my air quality sensor, which sends data to luftdaten.info (opens new window). It's possible to send the data to my own URL via a hook in the firmware. This would allow me to create my own Dataset in order to keep historical data. Others could do the same and the Datasets could be merged into a bigger Dataset.

Heater Controllers becoming smarter

I have a lot of sensors and some actuators like heater controllers in my smart home. All sensor data is dumped into a SPARQL store using the adapted Data Cube structure. Each sensor has its own Dataset, which allows me to generate nice charts of sensor data trends.

To save energy I reduce room temperature when I am not using a room or even the whole apartment. Heating up a room to the right temperature at the right time can get more complex than you might think. Depending on the temperature outside and the temperature in the rooms around the room to heat up, the time needed to heat up the room can vary a lot. A simple way would be heating up the room far in advance, but that would also waste a lot of energy.

For this use case I have created two machine learning applications:

  • A reinforcement learning application to control the heater controllers
  • A virtual apartment to experiment with the first application faster than real-time

This is where the improvements of the Data Cube structure come into play. To train the neural network of the virtual apartment, I feed it with random data from historical Datasets and expect a forecast of the next values.
Besides the training data, I also need some test data. The option to recompose Datasets makes it easier to create training and test Datasets and it makes the training more transparent as well. During the training process, I simply point to the composed training Dataset without handling CSV files or different versions of the same file.

Adding new data is simple too. I just copy the existing Dataset and add links to the new Observations. Since the reinforcement learning heater controller application may try new things in my real apartment, I can make iterations of the Datasets and simply train the virtual apartment again with the latest Dataset.

Summary

In this blog post I have shown how reversing the direction of the link from Observation -> Dataset to Dataset -> Observation makes Data Cubes much more powerful, both in theory and in practice with real-world use cases. It is still possible to keep track of the origin of the Observation with the observed by property.

In a follow-up post, I will take a look at Datasets metadata.