2014/10/24

back to the beginning ... async 101

Even the most humble of modern laptops today has multiple cores at its disposal. When you work Resource Oriented you benefit from the fact that resource requests are automatically spread over the available cores. However within one (root) request you typically make subrequests sequentially. In most cases this is exactly what you want as one subrequest provides the input for the next ... and so on.

There are cases however where you can benefit from parallel processing. A webpage, for example, can be composed from several snippets which can be requested in parallel. In a previous post I discussed the XRL language :

<html xmlns:xrl="http://netkernel.org/xrl">
    <xrl:include identifier="res:/elbeesee/demo/xrl/header" async="true"/>
    <xrl:include identifier="res:/elbeesee/demo/xrl/body" async="true"/>
    <xrl:include identifier="res:/elbeesee/demo/xrl/footer" async="true"/>
</html>


Another use case for parallel processing is batch processing. In my last post I developed an active:csvfreemarker component. It applies a freemarker template to every csv row in an input file and writes the result to an output file. It works. However, the files I want processed contain millions of rows and applying a freemarker template does take a bit of time. Can parallel processing help ? Yes it can ! Here's the revelant bit of code :

while(vCsvMap != null) {
    int i = 0;
    List<INKFAsyncRequestHandle> vHandles = new ArrayList<INKFAsyncRequestHandle>();

    while( (vCsvMap != null) && (i < 8) ) {
        INKFRequest freemarkerrequest = aContext.createRequest("active:freemarker");
        freemarkerrequest.addArgument("operator", "res:/resources/freemarker/" + aTemplate + ".freemarker");
        for (Map.Entry<String,String> vCsvEntry : vCsvMap.entrySet()) {
            freemarkerrequest.addArgumentByValue(vCsvEntry.getKey().toUpperCase(), vCsvEntry.getValue());
        }
        freemarkerrequest.setRepresentationClass(String.class);
        INKFAsyncRequestHandle vHandle = aContext.issueAsyncRequest(freemarkerrequest);
        vHandles.add(vHandle);

        vCsvMap = vInReader.read(vHeader);
        i = i + 1;
    }
    for (int j=0; j<i; j++) {
        INKFAsyncRequestHandle vHandle = vHandles.get(j);
        String vOut = (String)vHandle.join();
        vOutWriter.append(vOut).append("\n");
    }

}

The freemarker requests are issued as async requests in groups of eight. Their results are then processed in order in the for-loop.

Why eight ? That number depends on several things. The number of cores available, the duration of each async request, ... You'll need to experiment a bit to see what fits your environment/requirements. So actually the number should not be hard-coded. Bad me.

2014/10/14

back to the beginning ... a library module

It has been a while since my last post. But like my friend who blogs here - check it out, he has style - I just can't stop doing it.

In my last post I presented the KBOData site. For all its fancy features, the real work is to get all the raw data (1.12 Gigabytes worth of csv files) into the correct format for the - monthly - database load. The database is Stardog so the csv has to be transformed into one of the rdf formats. Turtle was selected.

For your information, the 1.12 Gigabytes of csv gets turned into 9.68 Gigabytes worth of triples.

Now there are a lot of tools available in NetKernel and we could definitely write our own csv-processor but there are good libraries available. I selected Super CSV and created a library module with it. A library module provides - in case that wasn't clear - functionality to other modules.

I'm not going to discuss the whole module (which you can find here, the module name is urn.org.elbeesee.supercsv), if you've followed the Back to the beginning series most of it should be familiar. I am going to discuss the new stuff though.

I removed the class file and the supercsv jar file before checking the module into Github (both to safe space on Github and to avoid errors due to a different environment). This means the module will not work as is, you'll need to compile it yourself.

One. The version in module.xml matches the version of the Super CSV jar file (2.2.0 at the time I write this). This is good practice when you wrap a 3rd party software (as we are doing here).

Two. The module contains a lib directory underneath the module's root. This is where we're going to put the 3rd party jars. In this case super-csv-2.2.0.jar which you can get from the Super CSV download.

Three. We add functionality. The active:csvfreemarker accessor takes a csv file as input, applies a freemarker template to each line and writes the output to a different file. It assumes the first row of the csv to contain the column headers.

We could export the Super CSV classes so that they can be used directly in other modules. While there may be cases where this is useful, this often quickly leads to classloader hell. Keeping the 3rd party functionality wrapped inside is the best way to go.

Four. The accessor itself contains nothing special (you'll find it - minus the freemarker processing - in the examples on the Super CSV site).

Five. The active:csvfreemarker response just mentions the input file has been processed. It is the side-effect (the output file) that we are interested in, the response is expired immediately.

Six. A unittest is provided. You need to replace the values for infile, outfile and the stringEquals value with your own. The input file could for example contain this :

firstname,lastname
tom,geudens
peter,rodgers
tony,butterfield
tom,mueck
rené,luyckx


Which will result in this output file :
geudens tom
rodgers peter
butterfield tony
mueck tom
luyckx rené


Note that Freemarker allows a lot more in its templates than is shown in the unittest. Here's one from KBOData :

<#if CONTACTTYPE == "TEL">
<http://data.kbodata.be/organisation/${ENTITYNUMBER?replace(".", "_")}#id> <http://www.w3.org/2006/vcard/ns#hasTelephone>; "${VALUE}" .
<#elseif CONTACTTYPE == "EMAIL">
<http://data.kbodata.be/organisation/${ENTITYNUMBER?replace(".", "_")}#id> <http://www.w3.org/2006/vcard/ns#hasEmail>; "${VALUE}" .
<#elseif CONTACTTYPE == "WEB">
<http://data.kbodata.be/organisation/${ENTITYNUMBER?replace(".", "_")}#id> <http://www.w3.org/2006/vcard/ns#hasURL>; "${VALUE}" .
</#if>


Seven. Usage. In fact the unittest shows how to use the library module. You import the public space and provide freemarker templates as res:/resources/freemarker/[template].freemarker resources. Enjoy !




P.S. I noticed that so far I have assumed that you know how to set up your development environment in order to build a NetKernel module. If this is not the case, there are tutorials for Eclipse and IntelliJ in the documentation.

P.P.S. Applying a Freemarker request to every single line takes - even in NetKernel and even using the lifesaver for batch processing - a while (depending on the size of the input of course). In a next post I'll discuss how we can fan out the requests.