Unique key grouping with multiple input documents and nodesets in variables

Home > XSL > Unique key grouping with multiple input documents and nodesets in variables

Unique key grouping with multiple input documents and nodesets in variables

September 8, 2006 Leave a comment Go to comments

The problem

A common requirement in XSL transformations is to group unsorted data by a key unique to each group, and sort it for display. For example:

<?xml version="1.0"?>
<food>
  <item type="Fruit" name="Orange" />
  <item type="Vegetable" name="Cucumber" />
  <item type="Meat" name="Chicken" />
  <item type="Vegetable" name="Carrot" />
  <item type="Vegetable" name="Potato" />
  <item type="Meat" name="Pork" />
  <item type="Fruit" name="Banana" />
  <item type="Fruit" name="Apple" />
</food>

We may typically want to output an HTML table with each food type (Fruit, Meat, Vegetable) as a header with all the items in each food category as table rows/records, ie.

Fruit

Apple

Banana

Orange

Meat

Chicken

Pork

Vegetable

Carrot

Cucumber

Potato

When dealing with a single input document addressed directly (ie. not through a variable), this is a trivial problem for XSL, easily solvable using <xsl:key> and key().

This technique fails when you must group a node-set stored in a variable or in an external document referenced with document() because <xsl:key> does not allow you to use variables or document() in its attributes – it is only capable of indexing the main source document. Additionally, XPath axes we might use for testing uniqueness in a set (typically preceding-sibling:: in an <xsl:for-each> to see if we have found the first occurrence of a new group key) do not function on external documents referenced by document().

So how do we group with XSL variables or multiple input documents?

The solution

The solution is to use a nested iteration (<xsl:for-each>), whereby the outer loop scans looking for each previously unencountered key – ignoring those we have found already – and the inner loop selects, sorts and outputs all the items sharing that key (ie. all the items in the group). Here is an example for the XML document above, assuming the node-set is stored in the variable $food:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

 <xsl:output method="html" />

<!-- Note this is just an arbitrary example to show grouping
 from documents and variables other than the source document; if
 food.xml is your only input, it is better to use standard
 grouping techniques using <xsl:key> -->

 <xsl:variable name="food" select="document('food.xml')/food" />

 <xsl:template match="/">

  <table>
   <xsl:for-each select="$food/item">
   <xsl:sort select="@type" data-type="text" order="ascending" />

    <xsl:if test="not(preceding-sibling::item[@type=current()/@type])">

     <!-- Output group header -->
     <tr>
      <td style="background-color: yellow">
       <xsl:value-of select="@type" />
      </td>
     </tr>

     <xsl:for-each select="$food/item[@type=current()/@type]">
     <xsl:sort select="@name" data-type="text" order="ascending" />

      <!-- Output group item -->
      <tr>
       <td>
        <xsl:value-of select="@name" />
       </td>
      </tr>

     </xsl:for-each>
    </xsl:if>

   </xsl:for-each>
  </table>

 </xsl:template>
</xsl:stylesheet>

This code is rather deceptive to the eye. Here is how it works:

Each item in the table (in this case, each <item> child of <food>) is examined.
The outer sort does not affect the grouping or uniqueness tests, it only sorts the output/display order of the groups. We sort in ascending order of @type in the example, so the final groups will be listed in the order: Fruit, Meat, Vegetable.
The <xsl:if> specifies what our grouping key is – in this case @type. The test itself compares the grouping key of all previous siblings in document orderto that of the item we’re iterating; if no previous sibling with the same key is found in the document, then we have found the first occurrence of an item with this key in the document.
Notice that the outer <xsl:for-each> may iterate over several (or many) items with the same grouping key before it finds the first one in the source document. If you sort on multiple criteria, the <xsl:for-each> may examine elements in the document with a grouping key that appears in elements earlier in the document with the same grouping key, before it examines the first one. This is inconsequential, and is mentioned only for clarity. In all cases the first instance of the grouping key in the document will be found eventually. Later items may be examined before earlier items because <xsl:for-each> iterates over items in the specified sorted order. This technique allows you to apply any combination of sorts as required for your output without causing the grouping algorithm to fail.
All items which are not the first occurrence of a particular grouping key are ignored by the outer <xsl:for-each>. Only the first occurrence of each key is processed, which ensures that each grouping key in the document is processed by the inner <xsl:for-each> exactly once.
Once we’ve found the first instance of a unique grouping key, the inner <xsl:for-each> selects all instances in the document with the same grouping key, and outputs them as desired (in this case as an HTML table row). Again, the sort does not affect the grouping and only modifies the order in which elements are sorted for output within the individual group. We have sorted on @name in ascending order in the example, so the fruits will be output eg.: Apple, Banana, Orange.

XPath expressions as grouping keys

Any XPath expression can be used as a grouping key – it doesn’t have to be a simple element or attribute reference. In a real-world example, a show scheduling system has an input XML document with a list of shows with their start times in ISO date format (YYYY-MM-DDTHH:MM:SS+ZZ:ZZ), and we want to group the shows by day in ascending chronological order for display on a web site, printing a header for each day with the date, and listing the shows for that day underneath.

In this case, every show start time is unique, but the first 10 digits of the time YYYY-MM-DD can be used as a grouping key. The first 10 digits will group all the shows which start on the same day together.

The only change you need to make is to the <xsl:if> test:

<xsl:if test="not(preceding-sibling::broadcast[
substring(current()/startTimeISODisplay,1,10)
=substring(startTimeISODisplay,1,10)])">

(the document contains a number of <broadcast> elements which each have one <startTimeISODisplay> child containing the ISO start date of the show)

Instead of using an attribute reference such as @type as we did before, now we use a more complicated XPath expression – substring(startTimeISODisplay,1,10) – as the grouping key. Remember that both sides of the comparison must use the same grouping key.

(Aside: using ISO dates is a smart choice for processing dates and times in XSL stylesheets, because the ISO date can be treated as text; when lexically sorted, the dates will run in ascending or descending chronological order as specified without further manipulation – this works because the ISO date format lists each unit of time in descending order of importance from left to right, ie. year first, seconds last)

Filtered grouping

Let’s say you want to process only a subset of the items in your source data. For example, in our radio scheduling system, we may want to display only shows upto a certain time in the future, or with a particular presenter. You can use any XPath expression as filter criteria. Two changes must be made to the grouping code:

The filter criteria must be inserted as a predicate in the inner <xsl:for-each> element (the one which processes each matching item in a specific group).
The output of a group header must be placed inside the inner <xsl:for-each> loop instead of the outer one. This is because the outer loop iterates over every element in the list, whereas the inner loop only iterates over the filtered list. This can lead to a situation where the inner loop has zero elements to process and isn’t executed at all. In this case, if the group header output code is left in the outer loop, group headers may be output for cases where there are no matching elements after filtering. Placing the header output code in the inner loop ensures headers are only output for filtered groups with at least one element in them.

As an arbitrary example, let’s suppose we want to process our food list above, displaying only items beginning with the letters A-M, which are also not meats. One possible XPath expression for this is:

string-length(translate(substring(@name, 1, 1), 'ABCDEFGHIJKLM', '')) = 0 and @type != 'Meat'

We re-write the example above as follows:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
 <xsl:output method="html" />

 <xsl:variable name="food" select="document('food.xml')/food" />

 <xsl:template match="/">

  <table>
   <xsl:for-each select="$food/item">
   <xsl:sort select="@type" data-type="text" order="ascending" />

    <xsl:if test="not(preceding-sibling::item[@type=current()/@type])">

     <xsl:for-each select="$food/item[string-length(translate(substring(@name, 1, 1),
     'ABCDEFGHIJKLM', '')) = 0 and @type != 'Meat' and @type=current()/@type]">

     <xsl:sort select="@name" data-type="text" order="ascending" />

      <!-- Output group header first if processing the first item in the list -->
      <xsl:if test="position()=1">
       <tr>
        <td style="background-color: yellow">
         <xsl:value-of select="@type" />
        </td>
       </tr>
      </xsl:if>

      <!-- Output group item -->
      <tr>
       <td>
        <xsl:value-of select="@name" />
       </td>
      </tr>

     </xsl:for-each>
    </xsl:if>

   </xsl:for-each>
  </table>

 </xsl:template>
</xsl:stylesheet>

This stylesheet outputs the following:

Fruit

Apple

Banana

Vegetable

Carrot

Cucumber

Chicken and Pork are excluded because they are Meats (and no Meat header is displayed, as it would be if we had left the group header output code in the outer loop); Orange and Potato are excluded because they don’t start with any of the letters A-M.

Grouping re-use with XSL templates

Sometimes you want to re-use the same groups in different ways, or to run the same grouping algorithm over different subsets of your input. One way to do this is to use an XSL template to store the grouping code.

You can supply a parameter which is a filtered nodeset to be used in the inner <xsl:for-each> loop, in order to determine what subset of your input to use. You can also supply additional parameters to control how the matching items will be processed.

Here is an example using our food document again:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

 <xsl:output method="html" />

 <xsl:variable name="food" select="document('food.xml')/food" />

 <xsl:template match="/">
  Fruits and Vegetables from A-M as a table:
  <table>
   <xsl:call-template name="filtered-food">
    <xsl:with-param name="subset" select="$food/item[string-length(translate
    (substring(@name, 1, 1), 'ABCDEFGHIJKLM', ''))=0 and @type!='Meat']" />
    <xsl:with-param name="style" select="'table'" />
   </xsl:call-template>
  </table>

  All food types from N-Z as a list:
  <ul>
   <xsl:call-template name="filtered-food">
    <xsl:with-param name="subset" select="$food/item[string-length(translate
    (substring(@name, 1, 1), 'NOPQRSTUVWXYZ', ''))=0]" />
    <xsl:with-param name="style" select="'list'" />
   </xsl:call-template>
  </ul>
 </xsl:template>

 <xsl:template name="filtered-food">
  <xsl:param name="subset" />
  <xsl:param name="style" />

  <xsl:for-each select="$food/item">
  <xsl:sort select="@type" data-type="text" order="ascending" />

   <xsl:if test="not(preceding-sibling::item[@type=current()/@type])">

    <xsl:for-each select="$subset[@type=current()/@type]">
    <xsl:sort select="@name" data-type="text" order="ascending" />

     <xsl:if test="$style='table'">
      <!-- Output group header first if processing the first item in the list -->
      <xsl:if test="position()=1">
       <tr>
        <td style="background-color: yellow">
         <xsl:value-of select="@type" />
        </td>
       </tr>
      </xsl:if>

      <!-- Output group item -->
      <tr>
       <td>
        <xsl:value-of select="@name" />
       </td>
      </tr>
     </xsl:if>

     <xsl:if test="$style='list'">
      <!-- Output group header first if processing the first item in the list -->
      <xsl:if test="position()=1">
       <li>
        <b><xsl:value-of select="@type" /></b>
       </li>
      </xsl:if>

      <!-- Output group item -->
      <li>
       <xsl:value-of select="@name" />
      </li>
     </xsl:if>

    </xsl:for-each>
   </xsl:if>

  </xsl:for-each>

 </xsl:template>
</xsl:stylesheet>

This stylesheet produces the following output:

Fruits and Vegetables from A-M as a table:

Fruit

Apple

Banana

Vegetable

Carrot

Cucumber

All food types from N-Z as a list:

Fruit
Orange
Meat
Pork
Vegetable
Potato

The grouping code has been placed in a template called filtered-food. The subset parameter gives a filtered list of items we want to process; the style parameter determines whether to display them as a table or a list.

<xsl:call-template name="filtered-food"> can be called as many times as required with different subsets to get the output we want.

Notice that the outer loop in the grouping code still iterates every element in the list ($food/item) and the <xsl:if> grouping test is the same – all we have changed is the nodeset processed by the inner loop, which now processes $subset instead of $food/item.

Note that for readability it is usually best to encapsulate the table/list display code into its own template, and call this from within the inner loop of the grouping template, passing the style parameter along (or using it to select a display template to call).

Real world example: You can see a real-world example of filtered grouping and display styles at our radio station web site ~~Deviant Audio~~ (the radio station is now defunct). The front page has a show schedules mini-view showing the forthcoming shows. These are filtered by time – it shows broadcasts from now until 2 weeks in the future. By using the form you can also filter by genre, DJ etc. On the ~~main schedules page~~ the same grouping code is used to produce the list of shows, however this time the code is called twice – one for each of the columns – with different filters. The left-hand column shows broadcasts from 0-7 days into the future; the right-hand from 7-14 days into the future. The style parameter passed is also different from that on the homepage, causing the generation of larger, more detailed listings. We’ll look at some of the code for this at the end of the article.

Using entities to simplify grouping code

Entities are XML language elements which are substituted for fixed string literals when the document is processed. Entities always start with an ampsersand (&) and end with a semi-colon (;). Some entities such as & (&) and < (<) are already defined, but you can also create your own.

Entities can contain anything including stylesheet code, which makes them useful for macro substitution. Substitution is done before the stylesheet code is parsed so the substitution is treated as code, not as a string to be directly copied to the output.

Some XSL purists would argue that using XML entities to store stylesheet code is a misuse of entities. I would argue that it increases the readability of complex stylesheets, and in the case where you have the same XPath expression repeated several times, reduces the scope for error when the expression has to be changed.

Entities are defined like this:

<?xml version="1.0"?>
<!DOCTYPE xsl:stylesheet [
  <!ENTITY entity-name "entity content" >
  <!ENTITY entity-2 "entity content 2" >
]>

<xsl:stylesheet ...

To use the entities above in your stylesheet, you might write something like:

&entity-name;

The scheduling system I developed for ~~Deviant Audio~~ uses entities to simplify grouping. Here are the entity definitions:

<!DOCTYPE xsl:stylesheet [
  <!ENTITY broadcast-filter "(meta/genre=$_genre or $_genre='') and
	(authority=$_dj or $_dj='') and
	(@channel=$_channel or $_channel='') and
	endTimeUTC > php:function('time') and
	(@hidden!='true' or not(@hidden))" >

  <!-- Matches all broadcasts that meet our filter criteria which end in the
  future (gives the show currently on if any, and all future shows).
  Although the scheduler web service can return pre-filtered results,
  sometimes we may want to re-filter cached data, and/or eliminate past
  shows that are still cached. -->
  <!ENTITY matching-broadcasts "$schedules/broadcasts/broadcast[&broadcast-filter;]" >

  <!ENTITY matching-14days "&matching-broadcasts;[startTimeUTC &lt;= $Within14Days]" >
  <!ENTITY matching-7days "&matching-broadcasts;[startTimeUTC &lt;= $Within7Days]" >
  <!ENTITY matching-7to14days "&matching-broadcasts;[startTimeUTC > $Within7Days and startTimeUTC &lt;= $Within14Days]" >

  <!-- Returns true if this is the first occurrence of a particular show date
  in document order, false if not the first occurrence -->
  <!ENTITY first-show-of-local-day "not(preceding-sibling::broadcast[
  substring(current()/startTimeISODisplay,1,10)=substring
  (startTimeISODisplay,1,10)])" >
]>

The first entity broadcast-filter defines filter conditions that will apply to all grouping scenarios: matching or all (if no parameter supplied) genres, DJs, channels, ending in the future (don’t display past shows; we use a PHP extension to get the current time), and only events that should be displayed in the public schedules.

The second entity matching-broadcasts is an XPath expression which defines a nodeset that is a list of all the schedule entries which meet our basic filter criteria (that should be applied to all grouping operations).

In the next three entities, we define further subsets of the filtered list, splitting it into shows occurring in the next 7 days, in the next 14 days and between 7 and 14 days in the future (“next week”). These will be used to generate specific groupings for display in different places on the web site.

Finally, the last entity – first-show-of-local-day – is the grouping test that will go in our <xsl:if> statement, and as discussed in the section “XPath expressions as grouping keys”, groups each set of shows by day, so that they can be displayed in chronological order on the site.

Now we turn to how the entities are used in the stylesheet. Our scheduler uses the XSL template method described above to make the grouping code re-usable in a template called filtered-schedules. In a two-column layout presenting the shows in the next 7 days in the left column, and for the 7 days after that in the right column, we use code such as:

<table id="TwoColumnSchedules" cellpadding="0" cellspacing="0">
    <tr>
        <!-- Shows in the next week -->
        <td id="TCSFirstColumn">
            <h1>Next 7 days:</h1>

            <table cellpadding="1" cellspacing="1">
                <xsl:call-template name="filtered-schedules">
                    <xsl:with-param name="filtered-schedules" select="&matching-7days;" />
                    <xsl:with-param name="style" select="'full'" />
                </xsl:call-template>
            </table>
        </td>

        <!-- Shows the week after -->
        <td id="TCSSecondColumn">
            <h1>Following 7 days:</h1>

            <table cellpadding="1" cellspacing="1">
                <xsl:call-template name="filtered-schedules">
                    <xsl:with-param name="filtered-schedules" select="&matching-7to14days;" />
                    <xsl:with-param name="style" select="'full'" />
                </xsl:call-template>
            </table>
        </td>
    </tr>
</table>

As you can see, the entities provide a very readable and greatly simplified way of specifying a subset. Without the entity code, the first call for example would appear like this:

<xsl:call-template name="filtered-schedules">
    <xsl:with-param name="filtered-schedules"
        select="$schedules/broadcasts/broadcast[(meta/genre=$_genre or $_genre='') and
	(authority=$_dj or $_dj='') and
	(@channel=$_channel or $_channel='') and
	endTimeUTC > php:function('time') and
	(@hidden!='true' or not(@hidden))]
         [startTimeUTC <= $Within7Days]" />
    <xsl:with-param name="style" select="'full'" />
</xsl:call-template>

In the grouping code template itself, the only change we make is to the <xsl:if> statement:

<xsl:for-each select="$schedules/broadcasts/broadcast">
    <xsl:sort select="startTimeISODisplay" data-type="text" order="ascending" />
<xsl:if test="&first-show-of-local-day;">
...

Using entities also lets use the same criteria in non-grouping code, for example when no scheduled shows meet the filter criteria, we can let the user know with code such as:

<xsl:if test="not(&matching-broadcasts;)">
    <h1>No shows currently scheduled matching your criteria.</h1>
</xsl:if>

Excluding an individual item

Sometimes it is desirable to exclude a specific item from the list. In our scheduling system, we highlight the next broadcast in a separate display area to the rest of the schedules, and include a countdown on the time remaining until it begins.

One way to exclude a specific item (node) from the list is to use generate-id(). This XPath function generates a unique value for any given node in a document, which is guaranteed to be the same during a single execution of the stylesheet wherever and however many times generate-id() is called for the same node. This allows specific nodes to be isolated using two steps:

Use generate-id() to find the ID of the node you want to exclude and save it in a variable
When iterating the nodes in the inner loop, compare the result of generate-id() on the current node being processed with the ID saved in step one. If they match, you are processing the isolated node, and can highlight or exclude it as appropriate. For exclusion, you simply add the generate-id() comparison to the existing filter criteria.

Here is how we isolate a node:

<!-- Gets the ID of the first (oldest) show that meets our criteria.
     This will either be the show currently broadcasting or the next show
     to be aired. -->
<xsl:variable name="schedules-nextshow-id">
    <xsl:for-each select="&matching-broadcasts;">
        <xsl:sort select="startTimeISODisplay" data-type="text" order="ascending" />
        <xsl:if test="position()=1">
            <xsl:value-of select="generate-id(.)" />
        </xsl:if>
    </xsl:for-each>
</xsl:variable>

<!-- Gets the first (oldest) show that meets our criteria -->
<xsl:variable name="schedules-nextshow" select="&matching-broadcasts;[generate-id(.)=$schedules-nextshow-id]" />

These should be top-level variable definitions so they can be used by any template in the stylesheet.

The first variable gets the ID of the node to isolate. It works by iterating over the filtered list of items to group, sorted such that the node we want to isolate is the first one iterated (in this case, by sorting the events in chronological order), using generate-id() on the first node iterated and storing the result, and skipping processing of all other nodes by way of the condition position()=1.

The second variable gets the actual isolated node from its ID by using an XPath expression to filter out all items except the one with the stored ID. This variable definition is not actually required for grouping purposes, but is used to access the isolated node separately in the code that processes it. For example in our scheduler, the isolated node is displayed highlighted by other code in the stylesheet.

In our grouping template, we just add an additional condition to the filter criteria in the inner loop:

<xsl:if test="&first-show-of-local-day;">
    <xsl:for-each select="$filtered-schedules[substring(startTimeISODisplay,1,10)=substring(current()/startTimeISODisplay,1,10) and generate-id(.)!=$schedules-nextshow-id]">
    <xsl:sort select="startTimeISODisplay" data-type="text" order="ascending" />

Remember that $filtered-schedules is the subset of items we want the template to process, and has already had all the other unwanted nodes filtered out.

This example always excludes the isolated node from processing. If you just want to exclude it in certain cases, you can instead add the filter condition generate-id()!=$schedules-nextshow-id in the subset parameter ($filtered-schedules in this case) in your call to the grouping template (<xsl:call-template name="filtered-schedules"> in this case).

If we want to do custom processing for the isolated node, we can reference it however we like elsewhere in the stylesheet. For example, in our scheduler, we use this code to indicate to the user whether the next show is broadcasting right now, or coming up next:

<xsl:choose>
    <xsl:when test="php:function('time') > $schedules-nextshow/startTimeUTC
and php:function('time') &lt;= $schedules-nextshow/endTimeUTC">
        <xsl:text>Show currently in progress </xsl:text>
    </xsl:when>
    <xsl:otherwise>
        <xsl:text>Next show </xsl:text>
    </xsl:otherwise>
</xsl:choose>

Conclusion

Grouping in XSL without <xsl:key> is required when the source data is not the main source document being processed by the stylesheet. This occurs either when the source data is stored as a nodeset in a variable, or when document() is used to reference nodes in a tertiary XML document. This kind of grouping can be awkward and inefficient (especially for large datasets), but I hope the techniques presented here make your grouping problems a little less troublesome!

Please send feedback via the contact page, or feel free to leave a comment!