Central control, central power

2009 is here. My new year resolution is actually a wish. I wish we move towards a more decentralized way of doing things wheter it is within the political system, within companies, within open source projects, the way we create things.

The more I think about it, the more it make sense to do it. How can a central unit, person or group know what is the right thing for all the other people?

There are some situations which require a decision to be taken that goes against the will of the majority or that a decentralized model cannot find an easy way to settle on it but it is just the right thing to do. A good example is the plastic bag tax. In many countries where it was first announced to impose a tax on plastic bag in order to lessen the use of them, people were against it. The results after a few months is that plastic bag usage drop significantly, people change their habits, they change their mind and actually promote the use of reusable bags.

We should find a way to isolate those cases and resolve them with a strong decision making process like a benevolent dictator, but the normal way of resolving problems should come from the bottom and not the top.

By the way, happy new year !

My failed attempt at building an XML diff library

A few weeks ago, I had to manage a some scenarios involving XML files. One of the problem I had was to compare some large XML files that had small content and schema changes.

My first attempt was to use a regular text diff tool like KDiff3. I did not get the result I wanted. Since my files contained whitespaces that were different in each file, I could not easily pinpoint where in my files were the meaningful differences.

For example, if you have these two XML files, where [TAB] is the tabulation character:

<client id="30" name="Georges">
 <phone>
  <number>
   555-555-5555
  </number>
 <phone>
 <phone />
</client>
<client name="Georges" id="30">
[TAB]<phone></phone>
[TAB]<phone>
[TAB][TAB]<number>
[TAB][TAB][TAB]555-555-5555
[TAB][TAB]</number>
[TAB]</phone>
</client>

They can be exactly the same XML file with the same meaning, but to a text diff utility, they can be quite different. Whitespaces, empty tags with or without an end tag, attributes order and tags order can all be problematic to find out the real differences in XML files with some simple tools.

I began searching for some real XML diff tools. I found a few commercial offerings including DeltaXML and a few open source projects including xmldiff and XMLUnit that could be of help.

The commercial products were pretty good. I had a hard time using the open source ones. They all had some problems that I could not get over with. I decided to build my own library to do what I wanted. I though it would be easy to build it.

I started with these goals in mind:

  1. It should work fine on large streams using a read-only forward-only access interface.
  2. It should support namespaces.
  3. It should detect added, removed, renamed and moved actions for elements, attributes, namespaces and data.
  4. It should be accessible as a library and as a command line tool.

The goals 2 and 4 were quite easy. Many XML interfaces support namespace these days and building a command line tool over a well defined library is a simple matter.

For the goals 1 and 3, it was another matter. There are many API interfaces to access an XML file in a read-only forward-only manner. I settled for pulldom with Python.

I began to draw my algorithm on a blank sheet of paper after creating some code to read the files. I often like to take a look back at where I am going on paper before investing more time coding. My first scenario was the most simple one, finding out if an element has been added or removed from either file.

After a few sketches, I found out it would be impossible to figure that simple case in a forward-only manner without storing some part of the file in memory and in the worst case, nearly storing the whole file in memory. That pretty much killed my first goal.

After, I choose to search the web for some tip on how to compare hierarchical structure like XML file using a random access. My new goal was to load both tree in memory and compare them. I found out many graduate papers on different algorithms to perform that kind of comparison, but there was nothing simple with them.

Finally, I choose not to build it. XML diff is hard and it is not for me. I will use the tools already available instead of building mine.

How it all started

My first moment of awe working with a computer was when I was 10. It was after I completed my first “real” program.

I was toying with the different command lines available on my father’s personal computer running a version of MS-DOS. I quickly got tired of using “dir” and “copy” after a while and I discovered “qbasic“.

It was a wonderful discovery with full of menus, an editor, some basic windows and the ability to run something. What in the world could I run and how ?

I knew absolutely nothing about QBasic. I went in the help documentation to find some basic statement to produce something meaningful. After a while, I found out how to print some text on the screen, how to do some simple math. I knew nothing about algebra but I figured out I could capture a value in a named area after a while.

Printing text and doing some simple math became boring after a while. I searched my local BBS to find something else to do with this newly acquired power. I found some simple programs written by other people that did various things. I did a few copy-paste from those sources into my program to try out new things. That is when I discovered I could create random things.

It took a few hours, but I manage to put all those knownledge together and create a working D&D character sheet creator (yes, I am a total nerd).

This is how it all started.

My rant about Team Server Foundation

Many developpers all tied together in an endless loop by Jonathan Caves

Many developers all tied together in an endless loop by Jonathan Caves

This is my first rant and it is probably not the last one.

Team Server Foundation (TFS) is the Microsoft way of managing source and doing some project tracking at the same time. It integrate so many Microsoft technologies that you are totally tied to one of Bill Gates’ leg when you are using it.

Let’s get started with the good ones. TFS is not such a bad software. It is at least five time much better than what was previously offered by Microsoft in the same field which was Visual Source Safe. It is well integrated with one of the most used IDE which is Visual Studio.

I have to use TFS on a daily basis mainly through the Visual Studio Team System (VSTS) which is the add-on for Visual Studio to access TFS. Getting used to it was a challenge since I tasted flexible, ligthweight and fully distributed source control management system before. I am still trying to figure out some of the details, but I am slowly merging with the Borg. Here is what I dislike about TFS.

You have to “Check out” a file before modifying it

After getting the source code of your application, before modifying any file, you have to perform a “Check out” to edit a file. Why ? Well, that is a good question. It might be to keep up with the good old fashion of doing things with Visual Source Safe. It might be to inform the central server of who is working on the file. It might be to disable developers from modifying a file that has been locked. I should probably not mention that you can remove the read-only attribute of the file to start messing up with this scheme.

The whole concept of “Check out” is my biggest complain about TFS and it is useless to say the least.

It is centrally based

Source control management should not be centrally based anymore. This is 2008 and the first distributed version control system was created in 1997. There is no reason to be tied to a central server and be locked out during shortages.

A developer should be able to work with his local repository the way he wants without being tied to a central server.

It uses weird concepts like shelves

Shelves is one of those things that are only really useful in some really specific situations. In my opinion, it should not have been implemented in the first place or only as an optional add-on. A shelve in TFS is a group of files that is stored in a temporary place with a given name. It is like a lightweight branch where you can easily share modifications that you did for approval by others for instance without having to create a branch for it.

The problem with this concept is that it is not a branch. You are actually storing files on the TFS server and not changes. When you are recovering those shelves or you are “unshelving” them, you are restoring those files and not the changes you made. If one of your colleagues modified that file while it was sitting on the server, you have to apply your colleagues changes to the file you retored before checking in that file. It is a real mess and quite error prone.

Merging is not trivial

I had to do my first merge between branches today. I must say that it was quite painful. After making Visual Studio crashed twice in a row, I resorted to use the command line client. It took me a while just to find out where it was. I had to read some online documentation to figure it all out because there is no command line help. Trying to get help from the command line will actually launch a windows help file.

After my first failed try, I forgot to specify the recursive switch. While waiting for my second failed try to complete, I search the blogsphere for some concrete examples. I figured out I had to specified a changeset range instead of just the changeset I wanted. Finally, I succeeded but it took a while even when I had the right command.

When you are finally doing the merge, it shows a bunch of windows specifying which files need to get merged without mentioning if there are any conflicting merges or if it can all be done without your intervention.

It is trying to do too many things at the same time

TFS is a you-will-not-need-any-other-tool-to-manage-your-project kind of application. It is trying to be a source control management software, a bug tracking software, a collaborative application, a reporting server, a continuous integration server, a portal server and some others. At best, it is succeeding in only one or two of those areas.

If you do not know anything else, it is probably the best thing in the world. Once you tried a few specialized tools in each of those aspects, you might stop relying on it.

It is not bundled with what you expect from a VCS

I expect my version control system to be bundled at least with a blame or annotate tool to find out who wrote which line in a file. You will not find this in the normal client package for TFS. You will need to install the power tools for TFS. Not to mention that in order to install the power tools, I had to install a bunch of other packages that all refused to install before I had install some other prerequisites.

My suggestion is to stay away from this monster.

Continuous integration

A small robot giving you a hand by woordenaar

A small robot giving you a hand by woordenaar

The time has come to get out of the integrated development environment mindset. Software development should not be based on a monolithic piece of software like the IDE. In the beginning, I though it was great to have all those tools within a click away easy access but the more I think about it, the more I am repulsed by it. I want to be independent from that easy build button. It seems like it does not find with that mindset.

In my never ending quest for better software quality, I got interested by the continuous integration process. Before getting too deep with the matter, let’s try to define what is the integration process in a software development. Integration happens when you make a change to your software and you want to make sure everything works with the software you had before and that your change conforms to some quality standards. A change can be a lot of different things like adding a simple message, adding a module, changing the colour of a box or refactoring your whole application. It can be while starting up your project and developing it, it can be a change during the beta phase or it can be an emergency bug correction for a running application. Whatever is changed, however it is done and wherever it goes, you want to make sure it does work.

Integration can be quite easy for small projects with small changes while being quite hard for large projects with big changes. Continuous integration is about making that process an incremental, routinely done, automated and simple thing to do.

How do it works? You will often have one or many dedicated servers waiting for changes to be included in the source control management system. It can be with a hookup script or with a pull process. Often, you will also see some of those continuous integration servers setup to run some task at predetermined intervals like every midnight. Whether is it event based or scheduled, they will most likely run one or more of the following tasks:

  • Building the source
  • Running unit tests
  • Running a code coverage analysis
  • Running a code analysis for standards conformity
  • Running a performance analysis
  • Generating code from a model
  • Building the documentation
  • Deploying the results on a test server

For each task, the server will keep a report on what happened, how did it go and what are the results. For instance, for the unit tests task, you might want to have a report with how many tests succeeded, how many failed and which one failed. A concrete example would be the waterfall view for the Google Chrome continuous integration process.

There are many advantages to use a continuous integration process. The most important one is to find bugs early and correct them early.

Let’s say you configured your continuous integration server to build your source whenever you commit a change in your source control management system. If you commit a change that breaks the build process, you can be notified quickly that you did something wrong and correct it. The cost of correcting a bug is often proportional with the length of time from the moment it was introduced to the moment it was found.

There are many solutions both open source, free and commercial ones available to fill your needs. BuildBot, CruiseControl are some of the popular ones. A simple script might also be the best thing for you is not much is wanted.

With the quality goal in mind, a proper continuous integration process is a must.

Automation and you

Keeping the gears running by Curious Expeditions

Keeping the gears running by Curious Expeditions

Many new business software projects are about automating some repetitive and boring tasks. Instead of having a big spreadsheet in which we all enter our time log that we share by sending it by email to get some reports at the end of the month, we create a centralized client/server application where everyone can enter whenever he wants his time log and automatically generate and send that report at the end of the month. Instead of manually copying our past time log from our old spreadsheets into our new system, we create a simple script that reads those spreadsheets and import that data in our new centralized application. Tasks that could have taken days are completed in minutes instead.

In many situations, I think we could push even more for automation. In the software development business, I feel like we are working on automating tasks for our clients but we are not thinking about automating our tasks enough. I guess it is different from places to places, but in each of my job experiences, I have found processes that could get an easy productivity boost simply by creating a simple script.

The reasons are numerous for not automating more. We do have time for it, we are not using the right tools, we do not know how to create the right tools, we fear changes, we do not know how to validate our automations, we are using applications or processes which do not have any automation entry points or we are plainly lazy and we do not want to learn how to automate things. I include myself in because I have been using each of those reasons at least once for not doing it.

If you do not have time to automate your tasks, you are just wasting your time. Just think about it. Automating is all about saving time. The exception would be a task that is quick to execute or rarely executed that would take an enormous about of effort to automate. I you let you be the judge on this one but do not pretend it always requires huge efforts to automate a process.

Using the wrong tool is a frequent reason. It is hard to overcome because in many cases, you just do not know that there are better tools, applications or ways to automate your task. One way is to try and stay up-to-date with the latest technologies, try to learn new stuff like a scripting language (Python is probably a good start). Automating a task that require you to use a GUI application where there are no alternatives like command line interface, a library interface or any public API is obvliously hard. Even then, there are tools which can help you out in those situations.

If you are like me, with a strong Windows background where the GUI is king, you might want to see how things are done in the unix/linux world where everything is a small command line program that is often used in a long tool chain to automate many tasks. If you are shy of installing a full-blown free linux operating system to toy with those tools, you can do like I do and install a linux-like environment for Windows.

To end this post, I will show you some examples and tools that I have been using recently to automate boring tasks.

Mass image manipulation

To automate the creation of thumbnails with a folder of a thousand pictures, I have been using ImageMagick and a command line similar to this one:

mogrify -format jpg -size 600x600 -auto-orient *.jpg

XML manipulation

To update large and complex xml configuration files, I have been using different XSLT templates with Saxon, a free and open source XSLT processor.

Data transformation, importation and exportation

I often had to import or export data from and to text files, spreadsheets, different databases, LDAP repositories or outlook/exchange. I used to create small C# command line programs but I have switch to Python recently for the productivity gain. My strategy is always the same. Find a component that reads the raw data of the input source, find another component that can write to the destination source and transform the data in the middle.