Home > XSL > How to tidy up badly formatted XML with an XSL stylesheet

How to tidy up badly formatted XML with an XSL stylesheet

January 21, 2008 Leave a comment Go to comments

How do you turn this:

<broadcastScheduler><!-- ================================================================ --><!--
CURRENT SHOWS --><!-- ================================================================ --><recurr
ingBroadcast href="show-varee-season-2" indexStart="26" recurFrequency="14"><firstAirDate channel
="d2" date="2007-09-21T22:00:00-06:00" dstRules="us" automaticStartup="false"/><firstAirDate chan
nel="d1" date="2007-10-05T22:00:00+00:00" dstRules="eu" automaticStartup="true"/></recurringBroad
cast><recurringBroadcast href="show-dan-ascherl-season-1" indexStart="6" recurFrequency="1md"><fi
rstAirDate channel="d1" date="2007-05-04T22:00:00+01:00" dstRules="eu" automaticStartup="true"/><
firstAirDate channel="d2" date="2007-05-04T22:00:00-05:00" dstRules="us" automaticStartup="true"/
><firstAirDate channel="d1" date="2007-05-18T22:00:00+01:00" dstRules="eu" automaticStartup="true
"/><firstAirDate channel="d2" date="2007-05-18T22:00:00-05:00" dstRules="us" automaticStartup="tr
ue"/></recurringBroadcast><recurringBroadcast href="show-bent-killer-season-1" indexStart="21" re
curFrequency="14"><firstAirDate channel="d1" date="2007-08-29T19:30:00+01:00" dstRules="eu" autom
aticStartup="true"/><firstAirDate channel="d2" date="2007-08-29T19:30:00-05:00" dstRules="us" aut
omaticStartup="true"/></recurringBroadcast><recurringBroadcast href="show-serge-season-1" indexSt
art="6" recurFrequency="7"><firstAirDate channel="d1" date="2007-05-04T22:00:00+02:00" dstRules="
eu" automaticStartup="true"/><firstAirDate channel="d2" date="2007-05-04T21:00:00-05:00" dstRules
="us" automaticStartup="true"/></recurringBroadcast><recurringBroadcast href="show-neil-bowles-se
ason-1" indexStart="15" recurFrequency="14"><firstAirDate channel="d1" date="2007-11-01T23:00:00+
01:00" dstRules="eu" automaticStartup="true"/><firstAirDate channel="d2" date="2007-11-01T23:00:0
0-05:00" dstRules="us" automaticStartup="true"/></recurringBroadcast>

into this:

<!-- ================================================================ -->
<!-- ================================================================ -->
  <recurringBroadcast href="show-varee-season-2" indexStart="26" recurFrequency="14">
    <firstAirDate channel="d2" date="2007-09-21T22:00:00-06:00" dstRules="us" automaticStartup="false"/>
    <firstAirDate channel="d1" date="2007-10-05T22:00:00+00:00" dstRules="eu" automaticStartup="true"/>
  <recurringBroadcast href="show-dan-ascherl-season-1" indexStart="6" recurFrequency="1md">
    <firstAirDate channel="d1" date="2007-05-04T22:00:00+01:00" dstRules="eu" automaticStartup="true"/>
    <firstAirDate channel="d2" date="2007-05-04T22:00:00-05:00" dstRules="us" automaticStartup="true"/>
    <firstAirDate channel="d1" date="2007-05-18T22:00:00+01:00" dstRules="eu" automaticStartup="true"/>
    <firstAirDate channel="d2" date="2007-05-18T22:00:00-05:00" dstRules="us" automaticStartup="true"/>
  <recurringBroadcast href="show-bent-killer-season-1" indexStart="21" recurFrequency="14">
    <firstAirDate channel="d1" date="2007-08-29T19:30:00+01:00" dstRules="eu" automaticStartup="true"/>
    <firstAirDate channel="d2" date="2007-08-29T19:30:00-05:00" dstRules="us" automaticStartup="true"/>
  <recurringBroadcast href="show-serge-season-1" indexStart="6" recurFrequency="7">
    <firstAirDate channel="d1" date="2007-05-04T22:00:00+02:00" dstRules="eu" automaticStartup="true"/>
    <firstAirDate channel="d2" date="2007-05-04T21:00:00-05:00" dstRules="us" automaticStartup="true"/>
  <recurringBroadcast href="show-neil-bowles-season-1" indexStart="15" recurFrequency="14">
    <firstAirDate channel="d1" date="2007-11-01T23:00:00+01:00" dstRules="eu" automaticStartup="true"/>
    <firstAirDate channel="d2" date="2007-11-01T23:00:00-05:00" dstRules="us" automaticStartup="true"/>

3,000 lines of the above was the horror I woke upto one day when PHP decided to stop formatting automated changes to our radio schedules properly. We need our schedules XML to be human-readable so I had to come up with a quick way to reformat it.

This XSL stylesheet will do the job for you:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:output indent="yes" method="xml" />

    <xsl:strip-space elements="*" />

    <xsl:template match="/">
        <xsl:apply-templates select="*|comment()" />

    <xsl:template match="comment()">
        & #13;& #10;
        <xsl:copy-of select="." />

    <xsl:template match="*">
            <xsl:copy-of select="@*" />

            <xsl:apply-templates select="*|comment()|text()" />
        & #13;& #10;

    <xsl:template match="text()">
        <xsl:if test="not(preceding-sibling::*) and not(following-sibling::*)">
            <xsl:copy-of select="." />

IMPORTANT: Remove the space between the four occurrences of & and # in the above code when copying it into your own applications. A quirk in the blogging system prevents me from putting these two symbols together as they should be.

There are four key points to how this works:

  1. In xsl:output, the indent attribute is set to yes. This automatically indents increasingly deep nested element levels correctly.
  2. The <xsl:strip-space elements="*" /> declaration removes all redundant whitespace between the end of one element and the start of another, sibling element. The formatting rules in the rest of the stylesheet won’t work without this because between-element whitespace would otherwise be preserved and prevent auto-indentation from working.
  3. Before comments and after element closing tags, newlines are inserted with the entities (carriage return and linefeed respectively), making sure there is one comment or one element per line.
  4. Text nodes are only copied if there are no sibling elements (see the text() template definition). Without this, 0-length text nodes will be copied to the output and prevent indentation from working.

If you use a program like Visual Studio or ActiveState Komodo you can run the stylesheet on your badly behaved XML right from your development environment. Just copy and paste in the stylesheet above, run it in your editor and select the XML file to process.

I hope you find the stylesheet useful!

Categories: XSL Tags:
  1. Anonymous
    September 17, 2012 at 20:43

    Hi Katy, thanks a lot for the epxlanation and the code as well – keep up the good job!!!

  1. No trackbacks yet.

Share your thoughts! Note: to post source code, enclose it in [code lang=...] [/code] tags. Valid values for 'lang' are cpp, csharp, xml, javascript, php etc. To post compiler errors or other text that is best read monospaced, use 'text' as the value for lang.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: