Tuesday, January 23, 2007

Converting OPML files with XSLT

We just got some nice new Dell laptops here at Magenic. Each laptop comes with IE7 and Office 2007 (Word 2007, Outlook 2007, and so on). IE7 has the capability to track RSS feeds. Our Minneapolis GM, Dave Meier was using Onfolio to track his RSS feeds and decided to switch to IE7. In theory, this should be as simple as exporting the OPML file that describes the feeds from Onfolio and importing that file into IE7. There is a standard for these files (see http://www.opml.org/spec2#subscriptionLists). However, it turns out that Onfolio was not outputting an attribute “required” by the standard, specifically type=”rss”. It seems that there are packages that just assume that RSS is the target and leave this out. IE7 was enforcing the requirement that the attribute be present (Microsoft enforcing standards, who knew?).

Obviously, one could edit the OPML file in a text editor and fix it manually. This actually would not be too bad with a macro capability that could loop through the text and make the changes, but what would be the fun in that. Dave Meier was aware of XSLT and thought that it might be of help. For those of you not familiar with XSLT, it is a language (expressed in XML) that defines a set of transformations from an input XML file to an output file. The output file may be XML (as you will see is the case here), an HTML file, or a Text file. XSLT is very much a pattern matching language: the language says “when you see this pattern in the XML input, do these things to create output”.

Let’s take a look at the example that I created for Dave Meier. The input looked something like this:

<opml version="1.1">


<title>My Feeds</title>



<outline text="News">

<outline text="Latest news from Minneapolis/St. Paul Business Journal" xmlUrl="http://www.bizjournals.com/rss/feed/daily/twincities" htmlUrl="http://twincities.bizjournals.com/twincities/breaking_news.html?from_rss=1" />




The issue here that complicates things is that there are two “outline” elements at different levels of the hierarchy. The higher (or grouping) element is distinguished from the lower (or specifying) element by the fact that only the lower element has the “xmlUrl” attribute.

However, with XSLT this is not a problem. Here is the XSLT “script” (XSLT is typically compiled, making it a “programming” language but I have typically used XSLT in scripting situations which explains why I call it a script):


<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="xml" encoding="UTF-8" omit-xml-declaration="yes" indent="no"/>
<xsl:template match="@*|node()">
<xsl:apply-templates select="@*|node()"/>
<xsl:template match="outline[@xmlUrl]">
<xsl:attribute name="type">rss</xsl:attribute>
<xsl:apply-templates select="@*|node()"/>

This script contains two templates. The first template is a general template that matches everything in the incoming XML file. This template is a very common template that appears in many XSLT scripts. It is called the “identity” template in that it copies everything in the input to the output. The second template is specific to our application. It matches to those elements that have an “xmlUrl” attribute. The output actions within this template output the “outline” element, the missing “type” attribute, and the rest of the matched “outline” element. We could test for the presence of an existing type attribute but XSLT is smart enough to simply replace the value of the attribute if it already exists. In this case, in effect, the script will keep any existing values and add the “rss” value only if there is no existing “type” attribute.

A question of interest is “what happens when two or more templates match a particular node of the input?” The answer is that the XSLT processor/compiler assigns a priority to each template based upon how widely the template in question “spreads its pattern matching net”. The narrower the scope of the match, the higher priority assigned to that the template. The wider the scope of the match, the lower priority assigned to that template. There are some more rules to assigning priority, but in the end when multiple templates match the incoming node in the XML file, the template with the highest priority is applied. What this means in our example, is that the upper “identity” template matches everything but the input matched by the lower “application” template. In other words, most applications of XSLT consist of dropping the identity template into the script and adding just the templates needed to address the specific input that needs to be changed.

If you have an XSLT processor handy, you just invoke that processor specifying the input XML, the script, and the output XML file. There are several free processors available, most of which use Java. Since we are a Microsoft shop, you should know that .NET contains a class in the BCL that handles XSLT 1.0 (there is an XSLT 2.0 standard but Microsoft has chosen so far not to support it). Here is the .NET C# source code for a simple command line utility to transform an input file using an XSLT script to produce an output XML file (watch for word wrap):


1 using System;
2 using System.IO;
3 using System.Xml;
4 using System.Xml.Xsl;
5 using System.Text;
7 namespace RunXsltTransformation
8 {
10 class MainProgram
11 {
13 private static FileInfo InputXML;
14 private static FileInfo XsltXML;
15 private static FileInfo OutputXML;
17 static void Main(string[] args)
18 {
20 bool CanContinue = true;
22 if(CanContinue)
23 {
24 CanContinue = PrepareFiles(args);
25 }
27 if (CanContinue)
28 {
29 CanContinue = TransformInput();
30 }
32 if (CanContinue)
33 {
34 Console.WriteLine("Completed Successfully");
35 }
36 else
37 {
38 Console.WriteLine("Could not continue because of errors");
39 }
41 }
42 private static bool PrepareFiles(string[] args)
43 {
44 try
45 {
46 if (3 != args.Length)
47 {
48 WriteMessage("There must be three arguments: input file path, Xslt file path, output file path");
49 return false;
50 }
51 InputXML = GetIncomingFileInfo(args[0], "Input XML");
52 XsltXML = GetIncomingFileInfo(args[1], "XSLT XML");
53 OutputXML = GetOutgoingFileInfo(args[2], "Output XML");
54 if (null == InputXML || null == XsltXML || null == OutputXML)
55 {
56 return false;
57 }
58 return true;
59 }
60 catch (Exception ex)
61 {
62 WriteMessage(ex.Message);
63 return false;
64 }
65 }
66 private static bool TransformInput()
67 {
68 XmlReader InputReader = null;
69 StreamWriter OutputWriter = null;
70 try
71 {
72 // build the transform
73 XslCompiledTransform transform = new XslCompiledTransform();
74 InputReader = new XmlTextReader(new StreamReader(InputXML.FullName, System.Text.Encoding.UTF8));
75 OutputWriter = new StreamWriter(OutputXML.FullName, false, System.Text.Encoding.UTF8);
76 transform.Load(XsltXML.FullName, XsltSettings.Default, null);
78 // and now the transform
79 transform.Transform(InputReader, null, OutputWriter);
80 return true;
81 }
82 catch (Exception ex)
83 {
84 WriteMessage(ex.Message);
85 return false;
86 }
87 finally
88 {
89 InputReader.Close();
90 OutputWriter.Close();
91 }
92 }
93 private static FileInfo GetIncomingFileInfo(string theFilePath, string theTitle)
94 {
95 if (null == theFilePath)
96 {
97 WriteMessage(theTitle + " file path cannot be null");
98 return null;
99 }
100 if (theFilePath.Trim().Length == 0)
101 {
102 WriteMessage(theTitle + " file path cannot be empty");
103 return null;
104 }
105 if (false == File.Exists(theFilePath))
106 {
107 WriteMessage(theTitle + " file path must exist");
108 return null;
109 }
110 return new FileInfo(theFilePath);
111 }
112 private static FileInfo GetOutgoingFileInfo(string theFilePath, string theTitle)
113 {
114 if (null == theFilePath)
115 {
116 WriteMessage(theTitle + " file path cannot be null");
117 return null;
118 }
119 if (theFilePath.Trim().Length == 0)
120 {
121 WriteMessage(theTitle + " file path cannot be empty");
122 return null;
123 }
124 if (true == File.Exists(theFilePath))
125 {
126 File.Delete(theFilePath);
127 }
128 return new FileInfo(theFilePath);
129 }
130 private static void WriteMessage(string theMessage)
131 {
132 Console.WriteLine(theMessage);
133 }
135 }
136 }

This is taken from a VS2005 project that compiles the binary into a file called RunXsltTransformation.exe. With that binary, the call to convert the feeds file, myFeeds.xml, to an output file, myConvertedFeeds.xml, using an XSLT script, ConvertOPML.xslt, would look like this (running from the command line in the same directory where all of the files are located):

RunXsltTransformation myFeeds.xml ConvertOPML.xslt myConvertedFeeds.xml