2014/10/14

back to the beginning ... a library module

It has been a while since my last post. But like my friend who blogs here - check it out, he has style - I just can't stop doing it.

In my last post I presented the KBOData site. For all its fancy features, the real work is to get all the raw data (1.12 Gigabytes worth of csv files) into the correct format for the - monthly - database load. The database is Stardog so the csv has to be transformed into one of the rdf formats. Turtle was selected.

For your information, the 1.12 Gigabytes of csv gets turned into 9.68 Gigabytes worth of triples.

Now there are a lot of tools available in NetKernel and we could definitely write our own csv-processor but there are good libraries available. I selected Super CSV and created a library module with it. A library module provides - in case that wasn't clear - functionality to other modules.

I'm not going to discuss the whole module (which you can find here, the module name is urn.org.elbeesee.supercsv), if you've followed the Back to the beginning series most of it should be familiar. I am going to discuss the new stuff though.

I removed the class file and the supercsv jar file before checking the module into Github (both to safe space on Github and to avoid errors due to a different environment). This means the module will not work as is, you'll need to compile it yourself.

One. The version in module.xml matches the version of the Super CSV jar file (2.2.0 at the time I write this). This is good practice when you wrap a 3rd party software (as we are doing here).

Two. The module contains a lib directory underneath the module's root. This is where we're going to put the 3rd party jars. In this case super-csv-2.2.0.jar which you can get from the Super CSV download.

Three. We add functionality. The active:csvfreemarker accessor takes a csv file as input, applies a freemarker template to each line and writes the output to a different file. It assumes the first row of the csv to contain the column headers.

We could export the Super CSV classes so that they can be used directly in other modules. While there may be cases where this is useful, this often quickly leads to classloader hell. Keeping the 3rd party functionality wrapped inside is the best way to go.

Four. The accessor itself contains nothing special (you'll find it - minus the freemarker processing - in the examples on the Super CSV site).

Five. The active:csvfreemarker response just mentions the input file has been processed. It is the side-effect (the output file) that we are interested in, the response is expired immediately.

Six. A unittest is provided. You need to replace the values for infile, outfile and the stringEquals value with your own. The input file could for example contain this :

firstname,lastname
tom,geudens
peter,rodgers
tony,butterfield
tom,mueck
rené,luyckx


Which will result in this output file :
geudens tom
rodgers peter
butterfield tony
mueck tom
luyckx rené


Note that Freemarker allows a lot more in its templates than is shown in the unittest. Here's one from KBOData :

<#if CONTACTTYPE == "TEL">
<http://data.kbodata.be/organisation/${ENTITYNUMBER?replace(".", "_")}#id> <http://www.w3.org/2006/vcard/ns#hasTelephone>; "${VALUE}" .
<#elseif CONTACTTYPE == "EMAIL">
<http://data.kbodata.be/organisation/${ENTITYNUMBER?replace(".", "_")}#id> <http://www.w3.org/2006/vcard/ns#hasEmail>; "${VALUE}" .
<#elseif CONTACTTYPE == "WEB">
<http://data.kbodata.be/organisation/${ENTITYNUMBER?replace(".", "_")}#id> <http://www.w3.org/2006/vcard/ns#hasURL>; "${VALUE}" .
</#if>


Seven. Usage. In fact the unittest shows how to use the library module. You import the public space and provide freemarker templates as res:/resources/freemarker/[template].freemarker resources. Enjoy !




P.S. I noticed that so far I have assumed that you know how to set up your development environment in order to build a NetKernel module. If this is not the case, there are tutorials for Eclipse and IntelliJ in the documentation.

P.P.S. Applying a Freemarker request to every single line takes - even in NetKernel and even using the lifesaver for batch processing - a while (depending on the size of the input of course). In a next post I'll discuss how we can fan out the requests.

2014/06/10

presenting kbodata

There's a scary thing out there. It can tell computers the chemical composition of your prescription drugs. It can tell computers who's the top scoring soccer player on the World Cup (once that competition gets underway of course). The computers can then act on that information. It is called The Web of Data and like it or not, it will change a lot of things in our everyday live.

While it uses the same technology as The Web, it is different in that it is not made for human consumption but for machine consumption. Everybody reading this blog knows what Wikipedia is, but did you know about DBPedia? The difference ? Well, a human can make sense of a regular webpage and can infer meaning from it. A machine can do no such thing. It needs the data in a structured format and it needs to be told what the meaning of the data is.

There is a steady movement towards making more and more data publicly available. Tim Berners Lee (yes, him again) described a five-star system for publishing data in this way. For once governments are leading the movement (often because there are regulations that make it mandatory for them to open up their data) although more and more corporations are joining every day.
So, when the Belgian government decided to publish the Belgian company data (KBO/BCE/CBE) as three-star data (csv files), Paul Hermans and myself decided to add a couple of stars. We created a KBOData showcase :
  • Paul added a semantic layer, turning the csv files into rdf. NetKernel is used to do the actual transformation in batch.
  • The resulting triples are stored into Stardog.
  • Based on input from Paul, I developed a(n almost) general purpose module (very little customization needed) in NetKernel for publishing the data.
  • NetKernel also takes care of the caching, both in-memory and persistent where needed.
  • Benjamin Nowack added the user experience (also served on NetKernel), for while it is about machine consumption, a showcase implies there's something to see/do for humans too. Note that what you see as a human is exactly the same as what a machine 'sees', there is no second set of data.
We learned a lot during the process. For one we seriously underestimated the amount of data (more than 74 000 000 triples/facts). This will lead to more use of paging in a second iteration. NetKernel is a natural match for structured data with lots of transformations (which is what this is all about), but even NetKernel can not shield against an open-ended request for the whole database.

A bit of cutting-edge was added with the fragments-server. Linked Data Fragments is a recent development from Ghent university to make The Web of Data more scalable. So when I say paging, it is very likely that the whole site will be based on fragments in the next iteration.

If you're interested in the finer details and/or want a similar implementation for your data, contact Paul or myself and we'll help you along.

2014/05/28

back to the beginning ... xml recursion language

ROC/NetKernel was originally thought out and developed with XML in mind. Note that it was never (not then, not today) bound to this data exchange format, everything is after all a resource and another format is just a transreption away, but at the time it was the prevailing format. Today JSON is, tomorrow we might lose confidence in braces, ... actually that whole discussion is moot.

Transreption : Lossless transformation of one representation format to another representation format.

If you're a fanboy ... yes, I dropped the word isomorphic from the above definition, that word may mean something to you, it means I like using difficult words to me.

It'll come as no surprise then that there are quite a few NetKernel batteries for slicing and dicing XML.

Battery : Normally some sort of electrochemical cell, here it means an addition that makes usage of a given thing easier, in this case NetKernel.

The single most powerful of these is the XML Recursion Language (XRL). In order to discuss what it does, here's a small XML snippet (that could be the footer of an HTML page) :
<div id="footer">
    <p>All rights reserved. © <span>2013</span> Elephant Bird Consulting BVBA</p>
</div>


No, my calendar is not behind. This snippet is a (file) resource that I use over and over again as footer for my webpages. Only, I have to manually update it every year, on every server where I use it. Tedious work and I quite often forget to change it here or there.

Here's the same small XML snippet that solves my problem using XRL :
<div xmlns:xrl="http://netkernel.org/xrl" id="footer">
    <p>All rights reserved. © <span xrl:eval="text">active:widgetCurrentYear</span>
Elephant Bird Consulting BVBA</p>
</div>


Now, when I use this template in an active:xrl2 request, it in turn requests active:widgetCurrentYear which is a small component that returns the current year.

That's cool, but it gets even better. Consider this template :
<html xmlns:xrl="http://netkernel.org/xrl">
    <xrl:include identifier="res:/elbeesee/demo/xrl/header"/>
    <xrl:include identifier="res:/elbeesee/demo/xrl/body"/>
    <xrl:include identifier="res:/elbeesee/demo/xrl/footer"/>
</html>

Do you see ? When we request active:xrl2 with this template, it will request (and include) the specified resources. Our footer snippet could be the last one. And this is where the recursion comes in. Automagically it will then request active:widgetCurrentYear. And so further, as deep as you care to go !

By the way, it's active:xrl2 because NetKernel 3 contained a quite different version of the tool which is kept available for legacy reasons.

If you want the example, the basic bits (just for the footer, you can no doubt add the complete page with header and body yourself) can be found in my public github archive, you'll need the following modules :
  • urn.com.elbeesee.tool.widget
  • urn.com.elbeesee.demo.xrl
Enjoy !

2014/05/14

lifesaver for batch processing

It's been a while since the last post, I've been quite occupied with KBOData, more information on that will follow soon (here and on http://practical-linkeddata.blogspot.com). Today a short tip.

Batch processing. At some point in my IT career it occupied all of my time. Batch processing on mainframe (using PL/I and IDMS) to be exact. The performance we got back then is unmatched by anything I've seen since, you just can't beat the big iron as far as batch processing goes.

Standard ROC processing isn't optimized for batch processing. Look at it this way ... say you request the resource that is your batch process then out-of-the-box NetKernel keeps track of every dependency that makes up that resource. In a batch process this can pretty quickly turn nasty on memory usage. And think about it, rarely do you want the result of a batch process to be cached.

It is possible to do very efficient batch processing with ROC though. You can fan out requests asynchronously for example. More on that another time. For now, here's the lifesaver I got from Tony Butterfield yesterday. Not only did it shorten execution time massively, it also kept memory usage down (to next to nothing) :
<request>
  <identifier>active:csv2nt+filein@data:text/plain,file:/var/tmp/input/address.csv+fileout@data:text/plain,file:/var/tmp/output/address.ttl+template@data:text/plain,address</identifier>
  <header name="forget-dependencies">
    <literal type="boolean">true</literal>
  </header>
</request>


What this exact request does is not so important (it converts a huge csv file into rdf syntax using a freemarker template on each line), what is important is the header. For that header makes the difference between "batch sucks on NetKernel" and "batch roc(k)s on NetKernel". Thanks Tony !

2014/03/08

back to the beginning ... construct

I had a couple of constructive discussions after my last post. A couple of doubts were raised about the reality of what I said.

There's only one way to answer those doubts and that is by showing you. So when I was revising some stuff for a customer earlier this week, I reconstructed one component for discussion here. Before I go into the details (and show you where you can find it), allow me to say a couple of words about APIs on the Internet.


Well, like most of the Apis mellifera family, APIs on the Internet have quite a sting but unlike bees, they regularly sting again (wasp like), as anybody trying to keep up with the Google, Facebook, Twitter, Dropbox, <you name it>, APIs can attest to.

However, for all their perceived flaws (which I won't go into) they are a step towards a programmable web and I deal with them on a frequent basis. What I typically do in order to use them is construct a component that :
  1. Shields the complexity.
  2. Constrains the usage.
  3. Improves the caching.
As an example, I created the urn.com.elbeesee.api.rovi module which is now available on Github. The Rovi (http://developer.rovicorp.com) API provides metadata for movies and music products.

Note that I only provided the source, if you want to use it you'll have to build the Java code. If this is making too much of an assumption on my part, contact me and I'll walk you through, no problem. If I get lots of requests, I'll blog about that next time.

You'll notice that the module provides two accessors, one - active:rovimd5 is private and computes the signature necessary to make requests. The other one active:rovirelease is public and takes a upcid as argument and provides access to the Rovi Release API.

In order to use active:rovirelease it needs to be able to find three resources when you issue a request to it, rovi:apikey, rovi:secretkey and rovi:expiry.

The first two are obvious and it is obvious why I'm not providing those in the module itself. The third one may be less obvious, but you'll note the following in the code :

rovireleaserequest.setHeader("exclude-dependencies",true); // we'll determine cachebility ourselves
 

When making the actual request to Rovi I ignore any caching directives that come my way. And on the response I do the following :

vResponse.setExpiry(INKFResponse.EXPIRY_MIN_CONSTANT_DEPENDENT, System.currentTimeMillis() + vRoviExpiry);

Two questions that can be raised about this are :
  1. Why are you doing this ?
  2. Is this legal ?
The Why is easy to answer. It is my business logic that should decide how quickly it wants updates, not a 3rd party API that wants to make money out of my requests. 

The legal aspect is not so clear and you should carefully read what the terms of usage are. Note however that I am not persisting any results from the API, I'm just letting the business logic dictate how long they are relevant in memory (and since memory is limited the distribution of the requests will determine which results remain in memory and which do not).

Adding persistence would not be very hard, however especially for paying services you then need to be fully aware of the terms of usage. Contact me for details if you want to know how to add a persistence layer.

Another takeaway from this module is that I throttle active:rovirelease. Granted, this is maybe also something that shouldn't be done in there (as it may depend on your business model) but controlling the flow is an important aspect of using APIs and this is a - simple - way to do it.

A last takeaway is that I don't interpret the result of the request in this component, nor do I force it into a format of any kind. And while I will grant that adding some form of handling could be useful after the actual API call it is an important takeaway. This component shields the API. What is done with the result belongs in another component.

This component is used in production. It is reality. You'll also find that it doesn't contain any kind of magic or clever coding (or a lot of coding at all). And yet it accomplishes quite a few things. The main thing it accomplishes is that it turns an API not under your control into a resource that is.

2014/01/27

back to the beginning ... context

I gather that Thor (aka Peter Rodgers) approves of my Back To The Beginning series. Let me tell you, it is hard going and not a post goes by without me making a couple more assumptions than I wanted and creating a couple more loose ends that I need to tie up. It is mostly context, ROC/NetKernel is what I have been breathing in the past six years. When I look at other evolutions in the IT field (yes, I do) they look as alien to me as ROC/NetKernel might look to you. It's all a matter of point-of-view, of context.

This brings me to a question I got about my last post. Actually, two questions.

The first one was : "Why are you not showing a lot more code ? Things remain at the child's play stage without it, a real application has a lot more code !". While I disagree with the statement, that is a good observation that deserves an answer.

I'm not easily offended and I had a very good email discussion with the person that made the observation. Feel free to contact me if you have questions too !


ROC development is not code-centric. It is about constructing small components (endpoints) that you can compose together and constrain as needed. If you know about shell scripting, you are already familiar with this concept. A lot of components (for example, I used an XSLT transformation in an earlier post) are already available.

The components should be made such that they do one task, one task only. For myself I use the following rule-of-thumb ... if a component requires more than 200 lines of source code, comments and logging included, I have not been thinking things through enough.

Tony Butterfield gave me the following thought exercise to come to grips with it. Would the Linux ls command be a good candidate for a ROC component ? The answer is ... no. The core of the command (directory listing) is fine but it has too many formatting and filtering options. By putting those options in the same component you would take away the chance for the ROC system to optimize them. They should have been in different components.

So the reason there hasn't been a lot of code in this series shows the reality of developing in ROC/NetKernel, it is not me trying to avoid complexity.

The second question was : "What is this aContext-thingie in your Java code ?". Ah, right, oops, I didn't actually discuss coding in the ROC world at all yet, did I ?

Well, to start, if it runs in the JVM, you can potentially use it to ROC code in. In practice, I find that a combination of Java and Groovy is ideal. Note that I wasn't formally trained in either. Note that I am pretty proficient in Python (and Jython is an option in NetKernel).  However, if 200 lines are all I'm going to write for a given component, I'm not going to require wizardry in any language, right ? So I decided to use Java for writing new components (since NetKernel was developed in it, it is closest to the metal) and I use Groovy as my composition language.

I am quite serious, I can't stand up to any seasoned J2EE developer in a "pure" Java coding challenge and I have great respect for their skills. However, if I'm allowed to use ROC/NetKernel to solve the problem I will go toe to toe with the best.

Writing a new component always follows these four steps :
  1. What have I promised that I will provide ?
  2. What do I need in order deliver my promise ?
  3. Add value (the thing that makes this special).
  4. Return a response.
When a request of an endpoint is made, you are handed the context of that request to take those steps. This context allows you to pull in the arguments to the request, for example :

String theargument = context.source("arg:theargument",String.class);


It allows you to create new (sub)requests in order to add value, for example :

INKFRequest subrequest = context.createRequest("active:somecomponent");

subrequest.addArgumentByValue("theargument", theargument);
String theresult = (String)context.issueRequest(subrequest);

And finally it allows you to return a response :

context.createResponseFrom(theresult);

When using Java you are given slightly more control over things, however the HelloWorldAccessor with the onSource method from the introduction module is a good starting point, we'll discuss different types of endpoints and verbs in a later post (loose ends again, I know). The same thing in Groovy would look like this :

import org.netkernel.layer0.nkf.*;
import org.netkernel.layer0.representation.*
import org.netkernel.layer0.representation.impl.*;


// context
INKFRequestContext aContext = (INKFRequestContext)context;

//

aContext.createResponseFrom("Hello World");

Due to the way a Groovy program is instantiated you are dropped straight into the equivalent of the onSource method in a Java accessor. Also, the assignment of context to aContext is strictly speaking not necessary, it is a coding practice that allows me to see things correctly in my editor (Eclipse). In any of the available scripting languages you'll always have context available.

So ... why are these two (the Java accessor in the introduction module and the Groovy program above) good starting points but actually bad components ? Because they don't add value, the response is a static string, I could just as well - and did in the introduction module - define a literal resource.

Food for thought, no ?

2014/01/03

back to the beginning ... logging

In the second half of the 1990s I was an IDMS database administrator for Belgium's biggest retailer. When our resident guru almost got killed by the job I got most of the main databases in my care ... and I must admit I ruled supreme. If you've never heard of the BOFH, check him out here and here. I don't know if any of those stories are based on reality, but they are nothing compared to some of the stuff I pulled off.

I hated logging.

Let me place that statement in the correct context. PL/I did not allow asynchronous actions, logging ate processing time. Also, disk storage was not cheap, the estimated costs of storage could and would often kill a project before it even started. Database storage was even more expensive. Migration to tape was an option but it made consultation afterwards very impractical.

This brings me to the why of logging. Opinions may differ but I see only two reasons :
  • Audit
  • Bug fixing by means of postmortem analysis
Audit. You want to know what was requested and - if you're a security auditor - who requested it. This is a - and in my view the only - legitimate reason for logging. However, even back then tools existed that allowed auditing without having to put it in the code. A fellow database administrator that was a bit too curious about the wages of the others and looked them up on the database found that out the hard way.

Bug fixing by means of postmortem analysis. You want to know what the state of the system was at the time of an error. This requires excessive amounts of logging. It did back then and it does today. And let me tell you something ... it's never enough.

You might say I'm an old fart that's not up to speed with the current state of technology. Storage is very cheap, asynchronous logging has become the standard ... and doesn't everybody say that you should log everything because it can be big stash of useful data itself ?

As a matter of fact, they - whoever they are - don't. They mean audit data collected on the edges of your system, not the things you'd typically put in a put skip list or System.out.println.


I still hate logging. And therefore I was very happy that when NetKernel 4 was released, it contained a back-in-time machine. Such was the power that 1060 Research also released a backport for NetKernel 3. This time machine is also known as the Visualizer. When running, it captures the complete (!) state of any request in the NetKernel system. Anybody that has worked with it agrees that it is sheer magic, for you can pinpoint any error almost immediately. Such a powerful tool warrants its own blogpost, so that's for next time.

All personal opinion aside, how does one log in NetKernel then ? Well, lets see how we can add some to our introduction module. First I want to have some audit of the incoming requests. We could write our own transparent overlay - another topic for a future blogpost - for this, but as it happens the HTTP Jetty Fulcrums have everything that's needed.


Open [installationdirectory]/modules/urn.org.netkernel.fulcrum.frontend-1.7.12/etc/HTTPServerConfig.xml in your favorite text/xml editor.
Remove the 
<!--Uncomment for NCSA Logging 
and the matching 
--> 
line. You can also change the settings and/or the name of the logfile. Restart NetKernel (this change is not picked up dynamically). You should now find a new logfile under [installationdirectory]/log.

Now try http://localhost:8080/introduction/helloworld-file in your browser. Open up the logfile and you should see something like this (given date, locale and browser differences) :

0:0:0:0:0:0:0:1 -  -  [03/jan/2014:09:42:35 +0000] "GET /introduction/helloworld-file HTTP/1.1" 200 0 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:26.0) Gecko/20100101 Firefox/26.0"

That's audit taken care of. Do I have to mention the fact that this logfile can easily be picked up by tools like Splunk and Logstash ?

If you need help with integrations like that, feel free to contact me. I've been there and done that.

Maybe you think I'm full of shit in my rant above or maybe you just have to comply with the rules and regulations of a customer. If so, yes, you can also do explicit logging in NetKernel.

Since the HelloWorldAccessor is the only piece of code we have, it's in there that we'll add it. The onSource method looks like this :

    public void onSource(INKFRequestContext aContext) throws Exception {
        aContext.createResponseFrom("Hello World");
    }


Adding logging is as simple as :

    public void onSource(INKFRequestContext aContext) throws Exception {
        aContext.logRaw(INKFLocale.LEVEL_INFO,"Logging from HelloWorldAccessor");
        aContext.createResponseFrom("Hello World");
    }


You'll notice the logging appears in two places. Firstly in the standard output of the NetKernel process :

I 11:00:34 HelloWorldAc~ Logging from HelloWorldAccessor

Secondly in [installationdirectory]/log/netkernel-0.log :

<record>
  <date>2014-01-03T11:00:34</date>
  <millis>1388743234821</millis>
  <sequence>278</sequence>
  <logger>NetKernel</logger>
  <level>INFO</level>
  <class>HelloWorldAccessor</class>
  <thread>167</thread>
  <message>Logging from HelloWorldAccessor</message>
</record>


Why is this ? Well, the log methods look for a configuration resource. Either you pass this resource in the method, or the resource res:/etc/system/LogConfig.xml is used (if it can be found), or - as a final resort - [installationdirectory]/etc/KernelLogConfig.xml is used. Check it out, it has two handlers.

So, to preempt a couple of questions, if you want a different log for each application, you can. If you want a JSON formatted log, no problem. Another common request these days (for yes, I am up to speed) is that the log messages themselves have to be formatted.

In order to do that, your module requires a res:/etc/messages.properties file resource. An actual file yes, logging is provided at such a low level that not all the resource abstractions are in place yet. The file can contain things like :

AUDIT_BEGIN={"timestamp":"%1","component":"%2", "verb":"%3", "type": "AUDIT_BEGIN"}
AUDIT_END={"timestamp":"%1","component":"%2", "verb":"%3", "type": "AUDIT_END"}


In your code you can then write :

aContext.logFormatted(INKFLocale.LEVEL_INFO,"AUDIT_BEGIN",         System.currentTimeMillis(), "HelloWorldAccessor" , "SOURCE");

And the results look like this :

I 11:42:03 HelloWorldAc~{"timestamp":"1388745723359","component":"HelloWorldAccessor", "verb":"SOURCE", "type": "AUDIT_BEGIN"}

and :

<record>
  <date>2014-01-03T11:42:03</date>
  <millis>1388745723360</millis>
  <sequence>263</sequence>
  <logger>NetKernel</logger>
  <level>INFO</level>
  <class>HelloWorldAccessor</class>
  <thread>176</thread>
  <message>{"timestamp":"1388745723359","component":"HelloWorldAccessor", "verb":"SOURCE", "type": "AUDIT_BEGIN"}</message>
</record>


Again I'll mention the fact that these logfiles can easily be picked up by tools like Splunk and Logstash. And there you have it, a complete - and customizable - logging system. To close my post I'm going to talk about the loglevels for a moment, NetKernel provides :
  • INKFLocale.LEVEL_DEBUG
  • INKFLocale.LEVEL_FINEST
  • INKFLocale.LEVEL_FINER
  • INKFLocale.LEVEL_INFO
  • INKFLocale.LEVEL_WARNING
  • INKFLocale.LEVEL_SEVERE
Not only can these be easily matched to ITIL aware operations systems, you can also turn them off and on in NetKernel itself. This will allow you to safe quite a bit of storage ... you never know when that might come in handy ;-).