get by and go on: xml

2011/05/07

Selecting the right key value for the Muenchian method

I have recently faced the problem of element grouping using XSLT. So I had to learn and understand what the Muenchian method is about, since it seems to be the de facto solution for grouping in XSLT due to its good performance (or the bad performance of alternate solution).

I had to transform a Subversion XML verbose log to get a list of changed files for a given issue. Here is a sample document:

One issue may comprise several SVN revision which could (and usually do) affect the same files. The file above illustrates this situation. A simple template matching the issue log entries and copying all the path nodes ends up with a list having repeated items.

The transformation used:

Transformation's result using 'ISSUE-2' as value for 'ticket' parameter:

So the Muenchian method solved the repeated items problem.

Transformation using Muenchian method to create just one entry for each path:

Transformation's result using 'ISSUE-2' as value for 'ticket' parameter:

No repeated items in list, right. But there are missing files! Adding some debugging code to the transformation showed that the missing files where those which also appeared in previous logentries for other issues in input file. This is pretty clear in this example, but when working with real data and an larger input file, it took me some hours to realize what was actually going on.

The expression generate-id() = generate-id(key('path-key', text())[1]) evaluates to false for this nodes, since first node with the same key is out of the nodeset to which template is applied

Muenchian method does not work so simply for grouping child nodes of nodes filtered out using a transformation parameter. So the first approach to solve the problem was to use two transformations and writting a short shell script to call xsltproc twice in a piped chain.

First transformation takes a ticket parameter and copies just the interesting log entries:

Second applies the Muenchian method to get just unique entries for affected paths in log entries:

Transformation's result using 'ISSUE-2' as value for 'ticket' parameter for the first transformation:

This approach worked perfect, but I kept thinking on a possible solution using a single transformation, and eventually found it using a different, slightly complex key. The problem was that the actual key that identified the unique entries I was using had to consider the ticket as well as the path.

Here is the transformation:

2010/12/15

xmllint and xsltproc

I've been recently developing XSL transformations. I've been implementing C++ code to edit XML files using XPath values too.

I had previously used xsltproc, the command line utility distributed together with libxslt to test and debug my XSL transformations.

I have also used xmllint to validate hand-written XML files in order to be both well-formed and valid against a given DTD or XSchema.

But I recently discovered a wonderful xmllint feature: the --shell option. It runs an interactive shell that allows the user to navigate within the XML document as in a file system. But navigating is not the most useful feature I found for the shell option: it's the xpath command.

It allows to quickly check what node-set will be obtained when applying an XPath expression to a given XML input file.

It's a really useful utility when writing complex XPath expressions or checking out results for XPath expressions on complex documents.