Choosing a new version control system

By: on April 30, 2008

(Continued from Moving away from CVS)

The wealth of options for a replacement for CVS presents us with a problem. We can’t choose a version control system by comparing feature lists: what seems perverse when presented in the manual may become natural in real use (which is the reaction many have to CVS’s “merge-don’t-lock” way of working at first), and contrarily what seems attractive on paper may prove problematic in real use (the system may claim sophisticated merging, but will it actually do what you want given your version history?). Equally, however, trying to use every system in anger would impose a very serious cost: unless we write the infrastructure for every system we test, some live project will have to do without it while they try out the shiny new system, and for every system someone will have to undergo the considerable expense of really learning how to use it and make it behave well. So we have to find ways to at least thin the candidate list.

We first narrow the list to the six candidates mentioned in the previous post: Subversion, Monotone, darcs, Git, Bazaar, and Mercurial. All of these have a sizeable community behind them and are used by popular projects. This means they have demonstrated themselves fit for purpose, and that there is a community who will provide help if we encounter problems, and code to support integration with other pieces of software. Other candidates may have interesting properties, but to choose them would be to be relatively out on our own; their lack of popularity also increases the risk that they will simply be abandoned after we have invested in them. In particular, this eliminates Codeville, the innovative DVCS designed by BitTorrent inventor Bram Cohen, though there seems little reason to pick it up in any case now that its main selling point, a smart text merging algorithm, has been picked up by Bazaar and could later be supported by some of the other systems if it is found to be usefully superior.

Of the six, the non-distributed Subversion is the first to be thrown out. This isn’t because we expect to benefit greatly from the possibility of disconnected operation, though it may prove useful sometimes; it is because we would like the other features of DVCSes described in the last article, in particular history-aware merging, and the general cleanliness of the underlying model. It’s a difficult decision, because Subversion has by far the best tool support of all of our candidates, including a mature Eclipse plugin; however, this is a decision we need to make based on the long-term future, and we anticipate that if we can pick a system that will remain popular then such support is just a matter of waiting for the tools to catch up.

The remaining five are very hard to choose between; I’ve had a hard time even finding discussion of how to choose one, because most articles focus on how each one is better than CVS or Subversion rather than comparing them to their DVCS peers. All are licensed under the GPL.

Monotone is the oldest of the five remaining candidates, and the first that I took an interest in. It has an attractively clean model of how a DVCS should work, and is in many ways the “most decentralized” of the five, because of the way it handles authentication. In any other DVCS, if I pull from your repository or allow you to push to mine, I am implicitly trusting you as a source of good revisions that I might like to build on. In Monotone, revisions are cryptographically signed, and it is these signatures that decide which revisions I will pay attention to; as a result, Monotone servers exchange not assertions but facts, and you don’t have to go to a particular server to get “authoritative” information on which is the right revision.

However, these signatures represent an unsolved management headache: how do you decide which keys to trust? As things stand, everyone has to update their keyring when a new developer joins the project. In February of last year, I attended a week-long Monotone developer’s summit in San Francisco hosted by Google and my sole personal goal while there was to find a better solution; I met a great many very very smart people and we had some fascinating discusions around the idea of “policy branches” to solve this problem, but we were never able to agree on exactly how such branches should work and as far as I know the problem is still unsolved.

Experiments with using Monotone internally showed other problems. Monotone repositories have a single global lock, so if for example a repository is made available in a web interface you can’t commit to it at the same time, a problem we were able to work around only with some very nasty hacks using multiple repositories. The same problem makes email notification hooks difficult to write, with the additional constraint that they must be written in an obscure interpreted language called Lua, and if more than one hook is to be run for the same event, the programmer must handle this themselves. Monotone itself is written in an eclectic style of C++ that makes it very hard to hack on or even understand what is happening internally. Finally, Monotone tends to be slow in normal use. Overall, we didn’t find working with Monotone to be an enjoyable experience, and we started looking at other candidates.

darcs has its supporters in this office. It’s written in Haskell, the statically typed pure functional programming language which had a place on our “Language du jour” whiteboard for much more than a day. It has by far the best support for “cherry-picking” (pulling in a change to a branch without pulling in all the changes that led to it) thanks to its “algebra of patches” that underlies its operation. However, this model is also what puts me off about it: it is very hard for darcs to cleanly support binary files, for example, because they aren’t well expressed by patches, and patches underlie every part of darcs including the storage and network formats; the other DVCSs have binary storage and network formats and consider the line-oriented nature of files only at merge time. To embed the assumption that all files are line-oriented text files so deeply into the architecture of a DVCS seems to me like a wrong turn that it would be very hard to back out of, so I kept looking.

That leaves three: Git, Bazaar, and Mercurial. All three date from around 2005, when Larry McVoy withdrew the limited license grant on his proprietary BitKeeper DVCS and the Linux kernel had to find a replacement in a hurry, a disaster for kernel development that vividly demonstrated the short-sightedness of Linus’s policy of trying to pretend that software licences don’t matter. All three have been chosen by major projects: Git is used most famously by the Linux kernel, Bazaar by Ubuntu’s Launchpad development centre, and Mercurial by the Java and Mozilla projects. A full evaluation of all three would be a fantastically costly exercise, so we had to use more superficial characteristics to decide which one to explore next.

Git is Linus’s own creation, started (I’m told) when Linus learned that the lead Monotone dev was on holiday and wasn’t about to start hacking on Monotone to improve performance until his return. To be sure, Git has very impressive performance, but there are several areas of concern: git has over a hundred subcommands betraying a lack of focus in interface design, and Win32 support (essential for us) is poor. In the end I felt I didn’t have faith in Git’s technical direction; I got the feeling that it was too wedded to a worse-is-better philosophy in which performance is more important than a clean model. To us this meant that it would take reports of crippling performance problems from other systems before we’d reassess Git.

The choice between Bazaar and Mercurial was in some ways the most arbitrary. Both are in Python, and both have a strong supporting community with lots of extensions – these two are not unrelated, as the choice of Python as implementation language lowers the barriers to getting involved. Each has a comparison page about the other, cross-linked, indicating their relative strengths, and updated as each draws features and ideas from the other or shoots ahead in an area it was formerly behind. There have even been joint Bazaar/Mercurial summit meetings hosted by Canonical, which didn’t result in either project subsuming the other but a rapid cross-fertilization of ideas. In the end I chose based on my feel for which had the clearest architectural vision, and based on the choices other projects have made, in particular projects which I felt would be good at making good choices, such as Java and Coyotos, and other LShift developers agreed: the choice was Mercurial.

Since then we’ve used Mercurial in anger for several projects, and done quite a bit of infrastructure work, integrating Mercurial with other tools that we use and otherwise making it more useful to us. So how’s it been working out for us? We’ll cover that in Part Three…

FacebookTwitterGoogle+

11 Comments

  1. tonyg says:

    I’m curious as to what makes you think darcs is text-centric? I’ve never noticed a particular bias in my use of darcs – it handles binary files in much the same way as every other version control system I’ve used, at least as far as I’ve noticed.

  2. Tom Berger says:

    For obvious reasons*, I’m quite curious to hear more about what led you to choose Mercurial over Bazaar.

    You only mention two criteria:

    1. …clearest architectural vision…

    I find it hard to think that this is the case, but, either it’s true (in which case the bzr folks should work on focusing their ‘architectural vision’) or it isn’t (in which case they should work on better PR). What in particular made you think that Mercurial’s architectural vision is clearer?

    1. projects which I felt would be good at making good choices, such as Java and Coyotos

    Setting aside the paradox of Java choosing a system written in a language other than Java (and with a radically different architectural vision too!), I’m pretty sure that most big projects (like Java, Coyotos, Mozilla) that have chosen Mercurial over Bazaar did so because at the time, hg had better performance for extremely large trees, and despite the fact that bzr had many other clear advantages. Meanwhile bzr improved a lot without compromising many of its process and architecture goals (turns out they were right about premature optimization after all). 99% of users (and LShift probably being inside these) shouldn’t suffer from these performance problems, and definitely don’t today, with the latest versions of bzr.

    If you don’t feel like posting in more detail here, or sending something to the bzr mailing list, I wouldn’t mind chatting about this (I don’t hope to convince you, and Mercurial is not a bad choice, just want to learn a thing or two).

    Also, I won’t bother listing here what are, IMHO, the clear advantages of bzr, but ask if you’re interested.

    Tom

    • disclosure: the commenter is employed by Canonical, the company sponsoring bzr.
  3. Paul Crowley says:

    Darcs and binary files: see

    http://wiki.darcs.net/DarcsWiki/FrequentlyAskedQuestions#head-f475693ac8dd4af9a381aad47b3c9dc90d6d7a32
    and this thread: http://osdir.com/ml/version-control.darcs.user/2004-07/msg00018.html

    Darcs stores binary files as hex, and does not attempt to delta-compress them at all, which could be painful on network bandwidth. As I said, the storage and network format is fundamentally based on patches which are fundamentally only meaningful on line-oriented text files.

    Tom: I’m very interested to hear about the advantages of Bazaar over Mercurial; deciding between those two was as I said the most arbitrary part of the decision.

  4. tonyg says:

    Delta-compression is a pretty important feature, I agree, but really only for space-saving: it’s a representation issue rather than a semantic issue, if you see what I mean. No matter the system, binary files are treated as blobs (diff & merge over binaries not being a realistic option?). Darcs could add delta compression to its repository- and network-format and it wouldn’t change the way the system worked at all.

  5. Great article Paul, look forward to the next one.

    We are currently adopting Mercural too, but have a lot invested in Subversion – particularly in training for non-techies to get the version control bug thanks to TortoiseSVN, although part of the company is still using CVS as well.

    I’d be very interested in how you migrate away from Subversion and/or how you find ways to maintain both repositories side by side especially in reusing your existing tools.

    The MercurialPlugin for Trac has eased our pain and there appears possibilities for DVCS of an existing SVN workarea:

    http://trac.edgewall.org/wiki/TracMercurial
    http://www.selenic.com/mercurial/wiki/index.cgi/WorkingWithSubversion

  6. Paul Crowley says:

    Tony: sure, but representations in VCSs are hard to change, especially the network protocol. Some binary formats can be merged with special tools, but only if the VCS can defer merging to external executables, which again runs entirely contrary to the spirit of darcs.

    I’d like to see Mercurial (and/or Bazaar I guess) adopt something like darcs’s approach to history-aware cherry picking, but retain the binary approach to storage and transmission and use the darcs algorithms only at merge time.

  7. tonyg says:

    Liam, have you tried TortoiseHg? How do you find it, if so?

  8. David Roussel says:

    ahh!

    If you get the captca wrong you lose your comment!!!!

    Again, but shorter:

    • looking forward to part 3

    • agree about the architectural vision. hg has it, git has it (for the lowwe, pluming, layers), but bzr doesn’t. Maybe I haven’t read enoght about bzr.

    • how did you get on with the merging in hg? How did you do you merging on windows?

  9. bastiand says:

    Nice post. The mercurial eclipse plugin has seen quite some enhancements, recently. So I’d like to invite you to try it and give us some hints, which are the biggest roadblocks for you when using it :-).

    Bastian

  10. FooBat says:

    Mercurial makes sense. I don’t see how Bazaar can even be considered for projects with larger repositories or more history. Even using Bazaar on its own repository is painfully slow. A pull to update a clone of bzr.dev took well over 6 hours for me–for a <100MB repository! In the same time, I had time to update git, linux-kernel, and mercurial clones, rebuild them, make dinner, and write some code.

  11. Mike Kramlich says:

    My favorite currently is git.
    In the past I’ve used CVS, Subversion, Perforce, ClearCase and AccuRev and they all sucked in various ways.
    Git basically just does version tracking on a file tree.
    All of Git’s persisted history & metadata about your tracked file tree lives under a single .git subdir that hangs off the root of the tree. There is no central repository or other file magic, no hanging chads or dot-file seeds, etc.
    It’s easy to use. As an example, once Git is installed, you can literally pull down the current official Linux kernel trunk in a single CLI invocation, with no previous setup steps necessary. Subsequent refreshes (pulls) are also that easy.
    Git also has the advantage that it’s of the distributed flavor of VCS’s so very friendly for open source projects or distributed/global employees; it’s become the official one for Linux; and it’s lightning fast in the most common use cases for developers since it’s all about running small processes locally and talking to a local filesystem. So the problems of a client/server or centralized/remote architecture go away. There are no locking/in-use problems between concurrent users, for example.
    Git seems to be one of the faster systems, due to it’s architecture.
    Also, if you don’t trust it’s developer (Linus) to create a VCS to put your golden eggs into, then you probably shouldn’t be using Linux either. (Though it’s a much younger codebase, of course, so beware.)
    I can’t speak to how well it handles binary files.
    I have not used Mercurial, Bazaar, Darcs or Arch, so I can’t say if git is better or worse than any of those. A rough initial glance at them did not impress me enough to look closer, unlike git. Some Googling will uncover several in-depth comparisons and benchmarking in this area.

Post a comment

Your email address will not be published.

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>