Moving away from CVS
When LShift first started off in 2000, the only real option for mature, open source version control was CVS. We’ve used CVS for most of our projects since then, and gone on to develop a strong infrastructure for managing CVS-backed projects, including a web interface for viewing versions, a web-based searchable database for related CVS commits (“CVSzilla”) which infers transactions from multiple simultaneous commits, and integration with the Bugzilla bug tracker.
Today, there are many other options, and I’ll discuss six major alternatives here: Subversion, Monotone, darcs, Git, Bazaar, and Mercurial. They all aim to do better than CVS in a variety of ways; these include:
Entire tree versioning: a version consists of a snapshot of an entire source tree, and a single change may affect many files. This is something our CVSzilla tool simulates for CVS, but it’s built in to modern systems, and it makes it easy to ask questions like “what was the most recent change to the source?”.
Support for renames: you can rename a file without losing the connection with the history of the file before the rename. This is made conceptually much simpler by entire tree versioning.
Cheap branching: creating branches is a low-cost operation even for large source trees. Not all version control systems offer this; I once worked with one where branching was so expensive that the downtime for creating a branch had to be a part of the project schedule.
Explicit merging: the system can create a version which is the merge of two branches, including the changes in both and deferring conflicts to the user, and mark it as such in the version metadata.
General removal of cruft: CVS is now over 20 years old and was one of the first open source systems of its kind, and the experience of two decades allows many opportunities to streamline and modernise.
These advantages hold for all of the major alternatives to CVS, and in particular they hold for Subversion, the oldest and one of the most popular. Subversion aims to be a “better CVS” and its CVS lineage is clearly shown in the way it thinks. In particular, it is based around a centralized development model – when you want to use the version control system, you connect to the central version control server. This once seemed like the obvious only way such a thing could work, but the future of version control is taking a very different direction.
In a distributed version control system (DVCS), a developer may have a local copy not only of their current version of the sources, but of the entire version database (the “repository”), and this local copy supports not only examining the history but also adding to it. When you wish to share your changes with others, you connect to a remote repository and push your changes to it, and you can similarly pull other people’s changes into yours. Why is this useful?
Speed: everything except the push/pull operations are local operations, and with no network latency or bandwidth issues to contend with they can be much faster.
Disconnected operation: you can do most version control operations while disconnected from the network, such as during a flight. This is one of the main original motivations of distributed version control, though today the typical developer spends so little time disconnected (even flights are getting wired now) that for many this isn’t the compelling advantage it once was.
Open source branching: if I want to create a branch of an open source project hosted using a DVCS, I don’t have to either persuade the lead developers to give me commit access to their project or aggressively “fork” the project: I can create my own public repository which includes both their changes and mine, and any developers that are interested can pull from both of us.
A DVCS has to have excellent support for branching and merging. This is because in a distributed system, if you and I both check out the same repository and check in changes locally, there is no way to ensure that one of our versions is a successor of the other; we will have created a fork in the version tree. If it’s not easy to re-unite the version history and create a new version that includes both our changes, the project will quickly fall apart. That’s why all five popular DVCSes offer these features:
A version DAG: a version may have more than one parent, and the metadata explicitly includes the DAG that relates versions to each other.
History-aware merging: when two versions are to be merged, the history of both and the way they are related is taken into account.
By contrast, if you merge a development branch into the trunk in Subversion and then make further changes on the branch, you will not be able to merge these further changes into the trunk unless you calculate which changes are new and which the trunk has already seen, and merge in by hand only the novel changes. Actually, this may no longer be true; the Subversion developers were aware of this problem when I last checked over a year ago and may now have found a fix.
In a DVCS, history-aware merging must form a part of the design right from the start if the system is to be at all useful. Once you get used to working with a system that supports a version DAG with history-aware merging, all other ways of expressing the problem of version control seem like poor approximations to this more fundamental expression of what the true idea behind version control is. In particular, this supports new ways of working that allow the VCS to do more work for the developers:
Commit-first: if you have made a change and so has another developer, you commit your change before you merge with theirs. Merges can go wrong – either the system or the human it asks can make the wrong choices – and having a permanent record of the pre-merge state can be a lifesaver.
Branch-per-bug development: create a new branch for every bug/feature you are working on, and merge with the trunk at the last minute. This means that if work on some features is complete while others are half-done, a new version can be produced that includes only the completed features.
This last feature in particular offers an invaluable boost to our agile way of working, making it far more likely we can produce a working version of the software at the end of the timebox even if development on some nice-to-have features is still incomplete and would leave the software in a broken state.
This article lays out more than enough reasons to migrate away from CVS to a more advanced system. But such a migration imposes costs: we have to gain experience using the system, and provide the same infrastructure support we’ve written code to provide for CVS. Unless we want to incur that expense six more times, in order to move away from CVS we must choose which system we will migrate to and use for future projects. That wasn’t an easy decision, as I’ll get into in my next blog post on the subject.
Continued in Part 2: Choosing a new version control system.