Isn't SPARQL slow?
How do you scale an RDF-based web application?
are two commonly asked questions by people and organizations new to RDF. When I started working with RDF, this was one of my main concerns as well. More than a decade later, I'm not worried about scaling RDF anymore. For large parts because I've seen other organizations do it and also because we do it for our customers, on large datasets.
In this blog post we will talk about a very simple but effective way to reduce the load of any SPARQL endpoint: Using an HTTP caching server like Varnish (opens new window).
If your organization is using RDF, there is a good chance you expose some or all of it using a SPARQL endpoint. SPARQL 1.1 consists of multiple W3C specifications, one of them is the SPARQL 1.1 Protocol (opens new window). As you can see in this specification, the standard means of transport for talking to SPARQL endpoints is the HTTP protocol and fortunately, this is something that can be cached really well.
By caching SPARQL HTTP requests, most web applications & many other use cases should become very fast. It also reduces the general load on your SPARQL endpoint. More complicated setups like adding Elastic as a middleware between your web application and RDF/SPARQL, might no longer be necessary.
Update 23.02.2022: Miel (opens new window) pointed us to this project (opens new window) from the Japanese Database Center for Life Science. This looks like a more versatile but also more complicated approach that might be very interesting for some use-cases.
Varnish & SPARQL
One of the most popular HTTP caching servers is Varnish. As described on its homepage (opens new window):
Varnish Cache is a web application accelerator also known as a caching HTTP reverse proxy. You install it in front of any server that speaks HTTP and configure it to cache the contents. Varnish Cache is really, really fast. It typically speeds up delivery with a factor of 300 - 1000x, depending on your architecture
That sounds like something we want in front of a popular SPARQL endpoint!
By default, Varnish is pretty easy to set up for simple HTTP GET requests. Once your web application provides the right HTTP headers, caching should be working according to the caching policy. However, setting it up for the SPARQL 1.1 Protocol is a bit more tricky, for multiple reasons:
- SPARQL queries can be passed using GET or POST (opens new window). While you might get away with GET (pun intended) for a while, larger SPARQL queries require POST. And sooner or later someone or some service will come up with a larger query.
- SPARQL provides multiple query result formats. There are currently specifications for CSV/TSV (opens new window), JSON (opens new window) and XML (opens new window). This means the cached results for the same input query depend on the query result format, even if the SPARQL query itself is the same.
- Cache invalidation is one of the two hard things (opens new window) in Computer Science. This is also true for every SPARQL endpoint, YMMV.
To give you a starting point, we provide a docker image (opens new window) of an opinionated Varnish configuration. Feel free to extend and adjust it according to your needs. As usual, pull requests or other contributions are welcome! We can also provide commercial support for it, contact us for more information.
- Caching makes sense for read-operations. Only use it with SPARQL 1.1 Query Language (opens new window) and not with SPARQL 1.1 Update. Our configuration does not know what SPARQL protocol you send, so be careful how you use it.
- Varnish can only serve what is cached already. If it's not in the cache, your SPARQL endpoint will have to provide it first. Depending on your workload, you might want to warm up the cache (opens new window).
- We can only cache things that are exactly the same. A SPARQL query that is the same from a triple-pattern point of view is not the same for the HTTP caching server if a single character is changed in the query. Depending on your SPARQL endpoint, it still might use some internal cache to give you a fast answer but it won't be considered a cache-hit for the HTTP caching server.
- SPARQL results can become very large. Some of our customers provide large OLAP cubes as RDF using the RDF Cube Schema (opens new window). These datasets can contain millions of triples and depending on the query, the potential answer could be gigabytes of data. This can be cached by Varnish but you will need enough available memory for it to be useful.
- It might be a good idea to not cache the default SPARQL endpoint and provide the cached one as a "virtual/proxy" SPARQL endpoint under a different URL. That way people can interact with the endpoint as they would but don't have to worry about wrong/outdated answers. Once the application/queries are stable, you move the endpoint to the cached one.
- For large SPARQL results, simply enabling HTTP compression (opens new window) might have a positive imact as well.
- SPARQL query optimization (opens new window) is always a good idea.
As mentioned before, Varnish will not cache anything unless a HTTP header tells it to do so. Either the SPARQL endpoint or a proxy/fontend like Trifid (opens new window) have to provide at least these two HTTP headers in the request to make sure it properly handles RDF & SPARQL:
Cache-Control, according to RFC7234 (opens new window)
Vary, according to RFC7231 (opens new window)
Our default configuration in Trifid (opens new window) sets,
Vary: Accept as default to make sure the content-type is not ignored and
Cache-Control: public, max-age=120 for a default cache of 2 hours.
There are additional things we could do in our setup, like setting a grace period for cached requests. Unfortunately most SPARQL endpoints seem not to provide an ETag header at the moment so this can't be implemented, see this issue (opens new window) for more details.
Varnish Configuration Details
To help you understanding or adjusting the image, our colleague Ludovic explains some of the configuration in it. Some things like backend host and port, timeout, time-to-live and body size are configurable using environment variables. See the configuration section (opens new window) for details.
The configuration itself lives in the file default.vcl (opens new window). It provides a list of tweaks to make it suitable for caching SPARQL HTTP queries:
- Definition of the backend & timeout (opens new window)
- Remove all cookies from response (opens new window), so that it caches without any cookie
- Remove incoming cookies & allow caching of POST requests (opens new window)
- Remove incoming cookies & X-Body-Len (opens new window)
- Allow caching of POST request (opens new window)
- Recalculate body size (opens new window), so that we are sure that the body has not a longer size than the body size specified in the configuration
- If the body size was too big, if the body was empty or if there was no body (opens new window), just forward the request to the backend
- Else, use the request from cache if it exists or create it (opens new window)
- Change the hash function (opens new window), so that a different body leads to a different hash
- Make sure that the backend receive a POST request (opens new window)
- Add a response header to see if it was a cache miss or a cache hit (opens new window); useful for debugging purposes
From a routing perspective, this is the setup we use.
We usually create another subdomain for the caching endpoint, so when the uncached-endpoint is for example at
ld.zazuko.com/query, the cached version would be
ld-cached.zazuko.com/query. That way it is easy to see in the name if the endpoint is cached or not.