2014/11/08

ten

When writing you occasionally make bold statements. Rarely does somebody call my bluff. But somebody did and therefore I have to show you ten significantly different hello world examples.

One
<literal type="string" uri="res:/tomgeudens/helloworld-literal">Hello World</literal>

As I've said before, we should have that one in the list of Hello World programs.

Two
<accessor>
    <id>tomgeudens:helloworld:java:accessor</id>
    <class>org.tomgeudens.helloworld.HelloWorldAccessor</class>
    <grammar>res:/tomgeudens/helloworld-java</grammar>
</accessor>


For the code itself I refer to earlier blogposts.

Three
<accessor>
    <id>tomgeudens:helloworld:groovy:accessor</id>
    <prototype>GroovyPrototype</prototype>
    <script>res:/resources/groovy/helloworld.groovy</script>
    <grammar>res:/tomgeudens/helloworld-groovy</grammar>

</accessor>

Which requires the script and the language import
<literal type="string" uri="res:/resources/groovy/helloworld.groovy">
    context.createResponseFrom("Hello World");
</literal>


<import>
    <!-- contains GroovyPrototype -->
    <uri>urn:org:netkernel:lang:groovy</uri>
</import>


Four
<mapper>
    <config>
        <endpoint>
            <grammar>res:/tomgeudens/helloworld-data</grammar>
            <request>
                <identifier>data:text/plain,Hello World</identifier>
            </request>
        </endpoint>
    </config>
    <space>

        <import>
            <!-- contains data:/ scheme -->
            <uri>urn:org:netkernel:ext:layer1</uri>
        </import>

    </space>
</mapper>


Five
<mapper>
    <config>
        <endpoint>
            <grammar>res:/tomgeudens/helloworld-file</grammar>
            <request>
                <identifier>file:/c:/temp/helloworld.txt</identifier>
            </request>
        </endpoint>
    </config>
    <space>

        <import>
            <!-- contains file:/ scheme -->
            <uri>urn:org:netkernel:ext:layer1</uri>
        </import> 

    </space>
</mapper>


Of course you need to replace the identifier with an existing file of your own.

Six
<mapper>
    <config>
        <endpoint>
            <grammar>res:/tomgeudens/helloworld-fileset</grammar>
            <request>
                <identifier>res:/resources/txt/helloworld.txt</identifier>
            </request>
        </endpoint>
    </config>
    <space>

        <fileset>
            <regex>res:/resources/txt/.*</regex>
        </fileset>

    </space>
</mapper>


Seven
<mapper>
    <config>
        <endpoint>
            <grammar>res:/tomgeudens/helloworld-freemarker</grammar>
            <request>
                <identifier>active:freemarker</identifier>
                <argument name="operator">data:text/plain,${one} ${two}</argument>
                <argument name="one">data:text/plain,Hello</argument>
                <argument name="two">data:text/plain,World</argument>
            </request>
        </endpoint>
    </config>
    <space>

        <import>
            <!-- contains active:freemarker -->
            <uri>urn:org:netkernel:lang:freemarker</uri>
        </import>
 

        <import>
            <!-- contains data:/ scheme -->
            <uri>urn:org:netkernel:ext:layer1</uri>
        </import> 

    </space>
</mapper>


Eight
<mapper>
    <config>
        <endpoint>
            <grammar>res:/tomgeudens/helloworld-http</grammar>
            <request>
                <identifier>http://localhost:8080/tomgeudens/helloworld-literal</identifier>
            </request>
        </endpoint>
    </config>
    <space>

        <import>
            <!-- contains http:/ scheme -->
            <uri>urn:org:netkernel:client:http</uri>
        </import>

    </space>
</mapper>


Which requires that the first example is exposed on the frontend fulcrum.
 
Nine
<mapper>
    <config>
        <endpoint>
            <grammar>res:/tomgeudens/helloworld-xpath</grammar>
            <request>
                <identifier>active:xpath</identifier>
                <argument name="operand">
                    <literal type="xml">
                        <document>Hello World</document>
                    </literal>
                </argument>
                <argument name="operator">
                    <literal type="string">string(/document)</literal>
                </argument>
            </request>
        </endpoint>
    </config>
    <space>

        <import>
            <!-- contains active:xpath -->
            <uri>urn:org:netkernel:xml:core</uri>
        </import>

    </space>
</mapper>


Ten
<mapper>
    <config>
        <endpoint>
            <grammar>res:/tomgeudens/helloworld-dpml</grammar>
            <request>
                <identifier>active:dpml</identifier>
                <argument name="operator">res:/resources/dpml/helloworld.dpml</argument>
            </request>
        </endpoint>
    </config>
    <space>

        <literal type="xml" uri="res:/resources/dpml/helloworld.dpml">
            <sequence>
                <literal assignment="response" type="string">Hello World</literal>
            </sequence>
        </literal>


        <import>
            <!-- contains active:dpml -->
            <uri>urn:org:netkernel:lang:dpml</uri>
        </import>

    </space>
</mapper>



And there you go, ten resource oriented hello world examples. There are many more possibilities but I think the above show both that there is a lot available and at the same time show that the patterns are always the same. Enjoy.



2014/10/24

back to the beginning ... async 101

Even the most humble of modern laptops today has multiple cores at its disposal. When you work Resource Oriented you benefit from the fact that resource requests are automatically spread over the available cores. However within one (root) request you typically make subrequests sequentially. In most cases this is exactly what you want as one subrequest provides the input for the next ... and so on.

There are cases however where you can benefit from parallel processing. A webpage, for example, can be composed from several snippets which can be requested in parallel. In a previous post I discussed the XRL language :

<html xmlns:xrl="http://netkernel.org/xrl">
    <xrl:include identifier="res:/elbeesee/demo/xrl/header" async="true"/>
    <xrl:include identifier="res:/elbeesee/demo/xrl/body" async="true"/>
    <xrl:include identifier="res:/elbeesee/demo/xrl/footer" async="true"/>
</html>


Another use case for parallel processing is batch processing. In my last post I developed an active:csvfreemarker component. It applies a freemarker template to every csv row in an input file and writes the result to an output file. It works. However, the files I want processed contain millions of rows and applying a freemarker template does take a bit of time. Can parallel processing help ? Yes it can ! Here's the revelant bit of code :

while(vCsvMap != null) {
    int i = 0;
    List<INKFAsyncRequestHandle> vHandles = new ArrayList<INKFAsyncRequestHandle>();

    while( (vCsvMap != null) && (i < 8) ) {
        INKFRequest freemarkerrequest = aContext.createRequest("active:freemarker");
        freemarkerrequest.addArgument("operator", "res:/resources/freemarker/" + aTemplate + ".freemarker");
        for (Map.Entry<String,String> vCsvEntry : vCsvMap.entrySet()) {
            freemarkerrequest.addArgumentByValue(vCsvEntry.getKey().toUpperCase(), vCsvEntry.getValue());
        }
        freemarkerrequest.setRepresentationClass(String.class);
        INKFAsyncRequestHandle vHandle = aContext.issueAsyncRequest(freemarkerrequest);
        vHandles.add(vHandle);

        vCsvMap = vInReader.read(vHeader);
        i = i + 1;
    }
    for (int j=0; j<i; j++) {
        INKFAsyncRequestHandle vHandle = vHandles.get(j);
        String vOut = (String)vHandle.join();
        vOutWriter.append(vOut).append("\n");
    }

}

The freemarker requests are issued as async requests in groups of eight. Their results are then processed in order in the for-loop.

Why eight ? That number depends on several things. The number of cores available, the duration of each async request, ... You'll need to experiment a bit to see what fits your environment/requirements. So actually the number should not be hard-coded. Bad me.

2014/10/14

back to the beginning ... a library module

It has been a while since my last post. But like my friend who blogs here - check it out, he has style - I just can't stop doing it.

In my last post I presented the KBOData site. For all its fancy features, the real work is to get all the raw data (1.12 Gigabytes worth of csv files) into the correct format for the - monthly - database load. The database is Stardog so the csv has to be transformed into one of the rdf formats. Turtle was selected.

For your information, the 1.12 Gigabytes of csv gets turned into 9.68 Gigabytes worth of triples.

Now there are a lot of tools available in NetKernel and we could definitely write our own csv-processor but there are good libraries available. I selected Super CSV and created a library module with it. A library module provides - in case that wasn't clear - functionality to other modules.

I'm not going to discuss the whole module (which you can find here, the module name is urn.org.elbeesee.supercsv), if you've followed the Back to the beginning series most of it should be familiar. I am going to discuss the new stuff though.

I removed the class file and the supercsv jar file before checking the module into Github (both to safe space on Github and to avoid errors due to a different environment). This means the module will not work as is, you'll need to compile it yourself.

One. The version in module.xml matches the version of the Super CSV jar file (2.2.0 at the time I write this). This is good practice when you wrap a 3rd party software (as we are doing here).

Two. The module contains a lib directory underneath the module's root. This is where we're going to put the 3rd party jars. In this case super-csv-2.2.0.jar which you can get from the Super CSV download.

Three. We add functionality. The active:csvfreemarker accessor takes a csv file as input, applies a freemarker template to each line and writes the output to a different file. It assumes the first row of the csv to contain the column headers.

We could export the Super CSV classes so that they can be used directly in other modules. While there may be cases where this is useful, this often quickly leads to classloader hell. Keeping the 3rd party functionality wrapped inside is the best way to go.

Four. The accessor itself contains nothing special (you'll find it - minus the freemarker processing - in the examples on the Super CSV site).

Five. The active:csvfreemarker response just mentions the input file has been processed. It is the side-effect (the output file) that we are interested in, the response is expired immediately.

Six. A unittest is provided. You need to replace the values for infile, outfile and the stringEquals value with your own. The input file could for example contain this :

firstname,lastname
tom,geudens
peter,rodgers
tony,butterfield
tom,mueck
rené,luyckx


Which will result in this output file :
geudens tom
rodgers peter
butterfield tony
mueck tom
luyckx rené


Note that Freemarker allows a lot more in its templates than is shown in the unittest. Here's one from KBOData :

<#if CONTACTTYPE == "TEL">
<http://data.kbodata.be/organisation/${ENTITYNUMBER?replace(".", "_")}#id> <http://www.w3.org/2006/vcard/ns#hasTelephone>; "${VALUE}" .
<#elseif CONTACTTYPE == "EMAIL">
<http://data.kbodata.be/organisation/${ENTITYNUMBER?replace(".", "_")}#id> <http://www.w3.org/2006/vcard/ns#hasEmail>; "${VALUE}" .
<#elseif CONTACTTYPE == "WEB">
<http://data.kbodata.be/organisation/${ENTITYNUMBER?replace(".", "_")}#id> <http://www.w3.org/2006/vcard/ns#hasURL>; "${VALUE}" .
</#if>


Seven. Usage. In fact the unittest shows how to use the library module. You import the public space and provide freemarker templates as res:/resources/freemarker/[template].freemarker resources. Enjoy !




P.S. I noticed that so far I have assumed that you know how to set up your development environment in order to build a NetKernel module. If this is not the case, there are tutorials for Eclipse and IntelliJ in the documentation.

P.P.S. Applying a Freemarker request to every single line takes - even in NetKernel and even using the lifesaver for batch processing - a while (depending on the size of the input of course). In a next post I'll discuss how we can fan out the requests.

2014/06/10

presenting kbodata

There's a scary thing out there. It can tell computers the chemical composition of your prescription drugs. It can tell computers who's the top scoring soccer player on the World Cup (once that competition gets underway of course). The computers can then act on that information. It is called The Web of Data and like it or not, it will change a lot of things in our everyday live.

While it uses the same technology as The Web, it is different in that it is not made for human consumption but for machine consumption. Everybody reading this blog knows what Wikipedia is, but did you know about DBPedia? The difference ? Well, a human can make sense of a regular webpage and can infer meaning from it. A machine can do no such thing. It needs the data in a structured format and it needs to be told what the meaning of the data is.

There is a steady movement towards making more and more data publicly available. Tim Berners Lee (yes, him again) described a five-star system for publishing data in this way. For once governments are leading the movement (often because there are regulations that make it mandatory for them to open up their data) although more and more corporations are joining every day.
So, when the Belgian government decided to publish the Belgian company data (KBO/BCE/CBE) as three-star data (csv files), Paul Hermans and myself decided to add a couple of stars. We created a KBOData showcase :
  • Paul added a semantic layer, turning the csv files into rdf. NetKernel is used to do the actual transformation in batch.
  • The resulting triples are stored into Stardog.
  • Based on input from Paul, I developed a(n almost) general purpose module (very little customization needed) in NetKernel for publishing the data.
  • NetKernel also takes care of the caching, both in-memory and persistent where needed.
  • Benjamin Nowack added the user experience (also served on NetKernel), for while it is about machine consumption, a showcase implies there's something to see/do for humans too. Note that what you see as a human is exactly the same as what a machine 'sees', there is no second set of data.
We learned a lot during the process. For one we seriously underestimated the amount of data (more than 74 000 000 triples/facts). This will lead to more use of paging in a second iteration. NetKernel is a natural match for structured data with lots of transformations (which is what this is all about), but even NetKernel can not shield against an open-ended request for the whole database.

A bit of cutting-edge was added with the fragments-server. Linked Data Fragments is a recent development from Ghent university to make The Web of Data more scalable. So when I say paging, it is very likely that the whole site will be based on fragments in the next iteration.

If you're interested in the finer details and/or want a similar implementation for your data, contact Paul or myself and we'll help you along.

2014/05/28

back to the beginning ... xml recursion language

ROC/NetKernel was originally thought out and developed with XML in mind. Note that it was never (not then, not today) bound to this data exchange format, everything is after all a resource and another format is just a transreption away, but at the time it was the prevailing format. Today JSON is, tomorrow we might lose confidence in braces, ... actually that whole discussion is moot.

Transreption : Lossless transformation of one representation format to another representation format.

If you're a fanboy ... yes, I dropped the word isomorphic from the above definition, that word may mean something to you, it means I like using difficult words to me.

It'll come as no surprise then that there are quite a few NetKernel batteries for slicing and dicing XML.

Battery : Normally some sort of electrochemical cell, here it means an addition that makes usage of a given thing easier, in this case NetKernel.

The single most powerful of these is the XML Recursion Language (XRL). In order to discuss what it does, here's a small XML snippet (that could be the footer of an HTML page) :
<div id="footer">
    <p>All rights reserved. © <span>2013</span> Elephant Bird Consulting BVBA</p>
</div>


No, my calendar is not behind. This snippet is a (file) resource that I use over and over again as footer for my webpages. Only, I have to manually update it every year, on every server where I use it. Tedious work and I quite often forget to change it here or there.

Here's the same small XML snippet that solves my problem using XRL :
<div xmlns:xrl="http://netkernel.org/xrl" id="footer">
    <p>All rights reserved. © <span xrl:eval="text">active:widgetCurrentYear</span>
Elephant Bird Consulting BVBA</p>
</div>


Now, when I use this template in an active:xrl2 request, it in turn requests active:widgetCurrentYear which is a small component that returns the current year.

That's cool, but it gets even better. Consider this template :
<html xmlns:xrl="http://netkernel.org/xrl">
    <xrl:include identifier="res:/elbeesee/demo/xrl/header"/>
    <xrl:include identifier="res:/elbeesee/demo/xrl/body"/>
    <xrl:include identifier="res:/elbeesee/demo/xrl/footer"/>
</html>

Do you see ? When we request active:xrl2 with this template, it will request (and include) the specified resources. Our footer snippet could be the last one. And this is where the recursion comes in. Automagically it will then request active:widgetCurrentYear. And so further, as deep as you care to go !

By the way, it's active:xrl2 because NetKernel 3 contained a quite different version of the tool which is kept available for legacy reasons.

If you want the example, the basic bits (just for the footer, you can no doubt add the complete page with header and body yourself) can be found in my public github archive, you'll need the following modules :
  • urn.com.elbeesee.tool.widget
  • urn.com.elbeesee.demo.xrl
Enjoy !

2014/05/14

lifesaver for batch processing

It's been a while since the last post, I've been quite occupied with KBOData, more information on that will follow soon (here and on http://practical-linkeddata.blogspot.com). Today a short tip.

Batch processing. At some point in my IT career it occupied all of my time. Batch processing on mainframe (using PL/I and IDMS) to be exact. The performance we got back then is unmatched by anything I've seen since, you just can't beat the big iron as far as batch processing goes.

Standard ROC processing isn't optimized for batch processing. Look at it this way ... say you request the resource that is your batch process then out-of-the-box NetKernel keeps track of every dependency that makes up that resource. In a batch process this can pretty quickly turn nasty on memory usage. And think about it, rarely do you want the result of a batch process to be cached.

It is possible to do very efficient batch processing with ROC though. You can fan out requests asynchronously for example. More on that another time. For now, here's the lifesaver I got from Tony Butterfield yesterday. Not only did it shorten execution time massively, it also kept memory usage down (to next to nothing) :
<request>
  <identifier>active:csv2nt+filein@data:text/plain,file:/var/tmp/input/address.csv+fileout@data:text/plain,file:/var/tmp/output/address.ttl+template@data:text/plain,address</identifier>
  <header name="forget-dependencies">
    <literal type="boolean">true</literal>
  </header>
</request>


What this exact request does is not so important (it converts a huge csv file into rdf syntax using a freemarker template on each line), what is important is the header. For that header makes the difference between "batch sucks on NetKernel" and "batch roc(k)s on NetKernel". Thanks Tony !

2014/03/08

back to the beginning ... construct

I had a couple of constructive discussions after my last post. A couple of doubts were raised about the reality of what I said.

There's only one way to answer those doubts and that is by showing you. So when I was revising some stuff for a customer earlier this week, I reconstructed one component for discussion here. Before I go into the details (and show you where you can find it), allow me to say a couple of words about APIs on the Internet.


Well, like most of the Apis mellifera family, APIs on the Internet have quite a sting but unlike bees, they regularly sting again (wasp like), as anybody trying to keep up with the Google, Facebook, Twitter, Dropbox, <you name it>, APIs can attest to.

However, for all their perceived flaws (which I won't go into) they are a step towards a programmable web and I deal with them on a frequent basis. What I typically do in order to use them is construct a component that :
  1. Shields the complexity.
  2. Constrains the usage.
  3. Improves the caching.
As an example, I created the urn.com.elbeesee.api.rovi module which is now available on Github. The Rovi (http://developer.rovicorp.com) API provides metadata for movies and music products.

Note that I only provided the source, if you want to use it you'll have to build the Java code. If this is making too much of an assumption on my part, contact me and I'll walk you through, no problem. If I get lots of requests, I'll blog about that next time.

You'll notice that the module provides two accessors, one - active:rovimd5 is private and computes the signature necessary to make requests. The other one active:rovirelease is public and takes a upcid as argument and provides access to the Rovi Release API.

In order to use active:rovirelease it needs to be able to find three resources when you issue a request to it, rovi:apikey, rovi:secretkey and rovi:expiry.

The first two are obvious and it is obvious why I'm not providing those in the module itself. The third one may be less obvious, but you'll note the following in the code :

rovireleaserequest.setHeader("exclude-dependencies",true); // we'll determine cachebility ourselves
 

When making the actual request to Rovi I ignore any caching directives that come my way. And on the response I do the following :

vResponse.setExpiry(INKFResponse.EXPIRY_MIN_CONSTANT_DEPENDENT, System.currentTimeMillis() + vRoviExpiry);

Two questions that can be raised about this are :
  1. Why are you doing this ?
  2. Is this legal ?
The Why is easy to answer. It is my business logic that should decide how quickly it wants updates, not a 3rd party API that wants to make money out of my requests. 

The legal aspect is not so clear and you should carefully read what the terms of usage are. Note however that I am not persisting any results from the API, I'm just letting the business logic dictate how long they are relevant in memory (and since memory is limited the distribution of the requests will determine which results remain in memory and which do not).

Adding persistence would not be very hard, however especially for paying services you then need to be fully aware of the terms of usage. Contact me for details if you want to know how to add a persistence layer.

Another takeaway from this module is that I throttle active:rovirelease. Granted, this is maybe also something that shouldn't be done in there (as it may depend on your business model) but controlling the flow is an important aspect of using APIs and this is a - simple - way to do it.

A last takeaway is that I don't interpret the result of the request in this component, nor do I force it into a format of any kind. And while I will grant that adding some form of handling could be useful after the actual API call it is an important takeaway. This component shields the API. What is done with the result belongs in another component.

This component is used in production. It is reality. You'll also find that it doesn't contain any kind of magic or clever coding (or a lot of coding at all). And yet it accomplishes quite a few things. The main thing it accomplishes is that it turns an API not under your control into a resource that is.