A few weeks ago, I had to manage a some scenarios involving XML files. One of the problem I had was to compare some large XML files that had small content and schema changes.
My first attempt was to use a regular text diff tool like KDiff3. I did not get the result I wanted. Since my files contained whitespaces that were different in each file, I could not easily pinpoint where in my files were the meaningful differences.
For example, if you have these two XML files, where [TAB] is the tabulation character:
<client id="30" name="Georges">
<phone>
<number>
555-555-5555
</number>
<phone>
<phone />
</client>
<client name="Georges" id="30">
[TAB]<phone></phone>
[TAB]<phone>
[TAB][TAB]<number>
[TAB][TAB][TAB]555-555-5555
[TAB][TAB]</number>
[TAB]</phone>
</client>
They can be exactly the same XML file with the same meaning, but to a text diff utility, they can be quite different. Whitespaces, empty tags with or without an end tag, attributes order and tags order can all be problematic to find out the real differences in XML files with some simple tools.
I began searching for some real XML diff tools. I found a few commercial offerings including DeltaXML and a few open source projects including xmldiff and XMLUnit that could be of help.
The commercial products were pretty good. I had a hard time using the open source ones. They all had some problems that I could not get over with. I decided to build my own library to do what I wanted. I though it would be easy to build it.
I started with these goals in mind:
- It should work fine on large streams using a read-only forward-only access interface.
- It should support namespaces.
- It should detect added, removed, renamed and moved actions for elements, attributes, namespaces and data.
- It should be accessible as a library and as a command line tool.
The goals 2 and 4 were quite easy. Many XML interfaces support namespace these days and building a command line tool over a well defined library is a simple matter.
For the goals 1 and 3, it was another matter. There are many API interfaces to access an XML file in a read-only forward-only manner. I settled for pulldom with Python.
I began to draw my algorithm on a blank sheet of paper after creating some code to read the files. I often like to take a look back at where I am going on paper before investing more time coding. My first scenario was the most simple one, finding out if an element has been added or removed from either file.
After a few sketches, I found out it would be impossible to figure that simple case in a forward-only manner without storing some part of the file in memory and in the worst case, nearly storing the whole file in memory. That pretty much killed my first goal.
After, I choose to search the web for some tip on how to compare hierarchical structure like XML file using a random access. My new goal was to load both tree in memory and compare them. I found out many graduate papers on different algorithms to perform that kind of comparison, but there was nothing simple with them.
Finally, I choose not to build it. XML diff is hard and it is not for me. I will use the tools already available instead of building mine.