2014/06/10

presenting kbodata

There's a scary thing out there. It can tell computers the chemical composition of your prescription drugs. It can tell computers who's the top scoring soccer player on the World Cup (once that competition gets underway of course). The computers can then act on that information. It is called The Web of Data and like it or not, it will change a lot of things in our everyday live.

While it uses the same technology as The Web, it is different in that it is not made for human consumption but for machine consumption. Everybody reading this blog knows what Wikipedia is, but did you know about DBPedia? The difference ? Well, a human can make sense of a regular webpage and can infer meaning from it. A machine can do no such thing. It needs the data in a structured format and it needs to be told what the meaning of the data is.

There is a steady movement towards making more and more data publicly available. Tim Berners Lee (yes, him again) described a five-star system for publishing data in this way. For once governments are leading the movement (often because there are regulations that make it mandatory for them to open up their data) although more and more corporations are joining every day.
So, when the Belgian government decided to publish the Belgian company data (KBO/BCE/CBE) as three-star data (csv files), Paul Hermans and myself decided to add a couple of stars. We created a KBOData showcase :
  • Paul added a semantic layer, turning the csv files into rdf. NetKernel is used to do the actual transformation in batch.
  • The resulting triples are stored into Stardog.
  • Based on input from Paul, I developed a(n almost) general purpose module (very little customization needed) in NetKernel for publishing the data.
  • NetKernel also takes care of the caching, both in-memory and persistent where needed.
  • Benjamin Nowack added the user experience (also served on NetKernel), for while it is about machine consumption, a showcase implies there's something to see/do for humans too. Note that what you see as a human is exactly the same as what a machine 'sees', there is no second set of data.
We learned a lot during the process. For one we seriously underestimated the amount of data (more than 74 000 000 triples/facts). This will lead to more use of paging in a second iteration. NetKernel is a natural match for structured data with lots of transformations (which is what this is all about), but even NetKernel can not shield against an open-ended request for the whole database.

A bit of cutting-edge was added with the fragments-server. Linked Data Fragments is a recent development from Ghent university to make The Web of Data more scalable. So when I say paging, it is very likely that the whole site will be based on fragments in the next iteration.

If you're interested in the finer details and/or want a similar implementation for your data, contact Paul or myself and we'll help you along.