technology from back to front

Archive for the ‘Version control’ Category

Enhancing peer review through GitHub

You love GitHub. Of course you do. You love peer review. You especially love sending a pull request back asking for nits to be picked. So when your submitter claims to have addressed your concerns, how do you check? You could walk the commits. You could diff the entire pull request against master. If only you could diff the HEAD of the pull request against the original state of the pull request, letting you check just the new set of commits…

With github-differ you can!

Simply add this tiny extension[1] to your Chrome, and it will decorate each commit in GitHub’s Commits tab. Pick any two commits, and the extension will redirect you to a page showing the comparison of those two commits! Job done!

[1]The JavaScript involved is so small that it should be trivial to port this to FireFox’s GreaseMonkey framework.

Frank Shearar

Continuous Integration for Github Pull Requests with Teamcity

Most developers with an interest in open source software these days have seen the Github interface for handling pull requests, and relatedly, Travis CI’s support for pull requests. And so we thought it’d be useful to have something similar for our internal CI system.

Read more…

Ceri Storey

mercurial-server 0.8 released

mercurial-server home page

mercurial-server gives your developers remote read/write access to centralized Mercurial repositories using SSH public key authentication; it provides convenient and fine-grained key management and access control.

Read more…

Paul Crowley

Mix and match version control

LShift’s standard version control platform these days is Mercurial, but just before we adopted it, I started a project using Trac and Subversion, mostly because that’s what Trac does out of the box.

Later, we branched the project to add a large new project, and during that branch we converted from using ant to Maven and modularised the project, resulting in a lot of moved files. This made what we were doing on the branch a lot easier, but left us with a merge that subversion wasn’t capable of, even though we had used svn mv to move all the files.

What was capable of the merge was Mercurial. I imported the whole subversion repository using hg convert. See the convert extension documentation. It works exactly as described, but make sure you have 1.0.1 or later – I had problems with earlier versions.

The merge went reasonably well, so I was left with a merged version in a Mercurial repository. I was going to switch to using Mercurial, and its Trac integration, when I discovered that couldn’t cope with multiple repositories. The Trac instance was managing several different source projects, which would have to go into several mercurial repositories, which I couldn’t merge together in any satisfactory way.

There are several projects around to address this (I’ll probably cover them in another post), none of which are ready for production yet. I decided the most expedient thing would be to try and generate a patch for my merge, and apply it to the subversion repository.

A conventional patch would lose the version history of all the moved files, so I decided a git diff would do the job. You can certainly, with some patience, get git-svn to do this, and understand what it was doing. Lacking that patience, I wrote a script to do the job. It parses the git diff and deals with any directory creation needed, calls to svn mv, svn add, and svn rm as required by the diff. It actually turns out to be a bit more work than I was expecting, so I’ve published it here.


Smalltalk vs. Javascript; Diff and Diff3 for Squeak Smalltalk

Many of my recent posts here have discussed the diff and diff3 code I wrote in Javascript. A couple of weekends ago I sat down and translated the code into Squeak Smalltalk. The experience of writing the “same code” for the two different environments let me compare them fairly directly.

To sum up, Smalltalk was much more pleasant than working with Javascript, and produced higher-quality code (in my opinion) in less time. It was nice to be reminded that there are some programming languages and environments that are actually pleasant to use.

The biggest win was Smalltalk’s collection objects. Where stock Javascript limits you to the non-polymorphic

for (var index = 0; index < someArray.length; index++) {
  var item = someArray[index];
  /* do something with item, and/or index */

Smalltalk permits

someCollection do: [:item | "do something with item"].

or, alternatively

someCollection withIndexDo:
    [:item :index | "do something with item and index"].

Smalltalk collections are properly object-oriented, meaning that the code above is fully polymorphic. The Javascript equivalent only works with the built-in, not-even-proper-object Arrays.

Of course, I could use one of the many, many, many, many Javascript support libraries that are out there; the nice thing about Smalltalk is that I don’t have to find and configure an ill-fitting third-party bolt-on collections library, and that because the standard library is simple yet rich, I don’t have to worry about potential incompatibilities between third-party libraries, such as can occur in Javascript if you’re mixing and matching code from several sources.

Other points that occurred to me as I was working:

  • Smalltalk has simple, sane syntax; Javascript… doesn’t. (The number of times I get caught out by the semantics of this alone…!)
  • Smalltalk has simple, sane scoping rules; Javascript doesn’t. (O, for lexical scope!)
  • Smalltalk’s uniform, integrated development tools (including automated refactorings and an excellent object explorer) helped keep the code clean and object-oriented.
  • The built-in SUnit test runner let me develop unit tests alongside the code.

The end result of a couple of hours’ hacking is an implementation of Hunt-McIlroy text diff (that works over arbitrary SequenceableCollections, and has room for alternative diff implementations) and a diff3 merge engine, with a few unit tests. You can read a fileout of the code, or use Monticello to load the DiffMerge module from my public Monticello repository. [Update: Use the DiffMerge Monticello repository on SqueakSource.]

If Monticello didn’t already exist, it’d be a very straightforward matter indeed to build a DVCS for Smalltalk from here. I wonder if Spoon could use something along these lines?

It also occurred to me it’d be a great thing to use OMeta/JS to support the use of

<script type="text/smalltalk">"<![CDATA["
  (document getElementById: 'someId') innerHTML: '<p>Hello, world!</p>'

by compiling it to Javascript at load-time (or off-line). Smalltalk would make a much better language for AJAX client-side programming.


Adding distributed version control to TiddlyWiki

After my talk on Javascript DVCS at the Osmosoft Open Source Show’n’tell, I went to visit Osmosoft, the developers of TiddlyWiki, to talk about giving TiddlyWiki some DVCS-like abilities. Martin Budden and I sat down and built a couple of prototypes: one where each tiddler is versioned every time it is edited, and one where versions are snapshots of the entire wiki, and are created each time the whole wiki is saved to disk.

Regular DVCS SynchroTiddly
Repository The html file contains everything
File within repository Tiddler within wiki
Commit a revision Save the wiki to disk
Save a text file Edit a tiddler
Push/pull synchronisation Import from other file

If you have Firefox (it doesn’t work with other browsers yet!) you can experiment with an alpha-quality DVCS-enabled TiddlyWiki here. Take a look at the “Versions” tab, in the control panel at the right-hand-side of the page. You’ll have to download it to your local hard disk if you want to save any changes.

It’s still a prototype, a work-in-progress: the user interface for version management is clunky, it’s not cross-browser, there are issues with shadow tiddlers, and I’d like to experiment with a slightly different factoring of the repository format, but it’s good enough to get a feel for the kinds of things you might try with a DVCS-enabled TiddlyWiki.

Despite its prototypical status, it can synchronize content between different instances of itself. For example, you can have a copy of a SynchroTiddly on your laptop, email it to someone else or share it via HTTP, and import and merge their changes when they make their modified copy visible via an HTTP server or email it back to you.

I’ve been documenting it in the wiki itself — if anyone tries it out, please feel free to contribute more documentation; you could even make your altered wiki instance available via public HTTP so I can import and merge your changes back in.


Mercurial merge technique

We’re using Mercurial here at LShift for much of our development work, now, and we’re finding it a great tool. We make heavy use of branches (“branch per bug”) for many projects, and this is also a pretty smooth experience. One issue that has come up is policy regarding merging the trunk (“default”) into any long-lived feature/bug branches: should you do it, or should you not?

My vote is that you should merge default into long-lived branches fairly regularly; otherwise, you have a big-bang, all-at-once nightmare of a merge looming ahead of you. If you do merge frequently, though, there’s one subtlety to be aware of: hg diff is not history aware, so in order to get an accurate, focussed picture of all the changes that have been made on your long-lived branch, you need to do one of two things:

* either, merge default into your long-lived branch right before you merge the long-lived branch back into default, and run hg diff after that’s complete; or
* (recommended) do a throw-away test-merge of the long-lived branch into default directly.

Imagine a history like this:

 (2) (3)
  |   |
   \ /

… where (1) is an ancestral revision, (2) is the default branch, and (3) is the long-lived branch – let’s call it “foo”.

Given this history, running hg update -C default (to make the working copy be the default branch, i.e. revision (2)) followed by hg diff foo will give you a misleading diff – one that undoes the changes (1) to (2) before doing the changes from (1) to (3). This is almost certainly not what you want!

Instead, run a test merge, by hg update -C default followed by hg merge foo and then plain old hg diff. Note that this modifies your working copy! You will need to revert (by hg update -C default) if you decide the merge isn’t ready to be committed.

The output of hg diff after the hg merge shows a history-aware summary of the changes that the merge would introduce to your checked-out branch. It’s this history-awareness (“three-way merge”) that makes it so much superior to the history-unaware simple diff (“two-way merge”).


diff3, merging, and distributed version control

Yesterday I presented my work on Javascript diff, diff3, merging and version control at the Osmosoft Open Source Show ‘n Tell. (Previous posts about this stuff: here and here.)
The slides for the talk are here. They’re a work-in-progress – as I think of things, I’ll continue to update them.

To summarise: I’ve used the diff3 I built in May to make a simple Javascript distributed version-control system that manages a collection of JSON structures. It supports named branches, merging, and import/export of revisions. So far, there’s no network synchronisation protocol, although it’d be easy to build a simple one using the rev import/export feature and XMLHttpRequest, and the storage format and repository representation is brutally naive (and because it doesn’t yet delta-compress historical versions of files, it is a bit wasteful of memory).

You can try out a few browser-based demos of the features of the diff and DVCS libraries:

* a demo of diff, comm, and patch functionality.
* a demo of three-way merge and conflict-handling functionality.
* a demo of a Javascript DVCS, a bit like Mercurial, that manages a collection of JSON objects (presenting them in a file-like way, for the purposes of the demo).

The code is available using Mercurial by hg clone (or by simply browsing to that URL and exploring from there). It’s quite small and (I hope) easily understood – at the time of writing,

* the diff/diff3 code and support utilities are ~310 lines; and
* the DVCS code is ~370 lines.

The core interfaces, algorithms and internal structures of the DVCS code seem quite usable to me. In order to get to an efficient DVCS from here, the issues of storage and network formats will have to be addressed. Fortunately, storage and network formats are only about efficiency, not about features or correctness, and so they can be addressed separately from the core system. It will also eventually be necessary to revisit the naive LCA-computation code I’ve written, which is used to select an ancestor for use in a merge.

The code is split into a few different files:

* The sources for the diff and diff3 demos and the DVCS demo. In the latter, check out the definition of presets.preset1 for an example of how to use the DVCS, and presets.ambiguousLCA for an example of the repository format and the use of the revision import feature.
* The diff and diff3 code itself.
Graph utilities (for computing LCA etc)
* The DVCS and pseudo-file-system code.
* The repository history-graph-drawing code and a python script for drawing the little tile images used in rendering a repository history graph.


Diff for Javascript, revisited

Last weekend I finally revisited the diff-in-javascript code I’d written a couple of years back, adding (very simple) patch-like and diff3-like functionality.

On the way, not only did I discover Khanna, Kunal and Pierce’s excellent paper “A Formal Investigation of Diff3“, but I found, the revision-control wiki, which I’m just starting to get my teeth into. I’m looking forward to learning more about merge algorithms.

The code I wrote last weekend is available: just download diff.js. The tools included:

* Diff.diff_comm – works like a simple Unix comm(1)
* Diff.diff_patch – works like a simple Unix diff(1)
* Diff.patch – works like a (very) simple Unix patch(1) (it’s not a patch on Wall’s patch)
* Diff.diff3_merge – works like a couple of the variations on GNU’s diff3(1)

Let’s try out a few examples. First, we need to set up a few “files” that we’ll run through the tools. The tools work with files represented as arrays of strings – each string representing one line in the file – so we’ll define three such arrays:

var base = "the quick brown fox jumped over a dog".split(/\s+/);
var derived1 = "the quick fox jumps over some lazy dog".split(/\s+/);
var derived2 =
  "the quick brown fox jumps over some record dog".split(/\s+/);

Examining base shows us that we have the right format:

js> uneval(base);
["the", "quick", "brown", "fox", "jumped", "over", "a", "dog"]

First, let’s run the subroutine I originally wrote back in 2006:

js> uneval(Diff.diff_comm(base, derived1));
[{common:["the", "quick"]},
 {file1:["brown"], file2:[]},
 {file1:["jumped"], file2:["jumps"]},
 {file1:["a"], file2:["some", "lazy"]},

The result is an analysis of the chunks in the two files that are the same, and those that are different. The same results can be presented in a terser differential format, similar to Unix diff:

js> uneval(Diff.diff_patch(base, derived1));
[{file1:{offset:2, length:1, chunk:["brown"]},
  file2:{offset:2, length:0, chunk:[]}},

 {file1:{offset:4, length:1, chunk:["jumped"]},
  file2:{offset:3, length:1, chunk:["jumps"]}},

 {file1:{offset:6, length:1, chunk:["a"]},
  file2:{offset:5, length:2, chunk:["some", "lazy"]}}]

Note the similarity of the results to the line-numbers and line-counts that diff(1) outputs by default.

The next example takes a diff-like patch, and uses it to reconstruct a derived file from the base file:

js> uneval(Diff.patch(base, Diff.diff_patch(base, derived1)));
["the", "quick", "fox", "jumps", "over", "some", "lazy", "dog"]

Finally, we attempt a diff3 style merge of the changes made in derived1 and derived2, hopefully ending up with a file without conflicts:

js> uneval(Diff.diff3_merge(derived1, base, derived2, true));
[{ok:["the", "quick", "fox", "jumps", "over"]},
 {conflict:{a:["some", "lazy"], aIndex:5,
            o:["a"], oIndex:6,
            b:["some", "record"], bIndex:6}},

We see that the result isn’t what we’d hoped for: while two regions are unproblematic, and merge cleanly, the algorithm couldn’t decide how to merge one part of the inputs, and has left the conflict details in the output for the user to resolve.

There are algorithms that give different (and usually better) results than diff3 – the paper I mentioned above explains some of the problems with diff3, and I’m looking forward to reading about what alternatives others have come up with at


Choosing a new version control system

(Continued from Moving away from CVS)

The wealth of options for a replacement for CVS presents us with a problem. We can’t choose a version control system by comparing feature lists: what seems perverse when presented in the manual may become natural in real use (which is the reaction many have to CVS’s “merge-don’t-lock” way of working at first), and contrarily what seems attractive on paper may prove problematic in real use (the system may claim sophisticated merging, but will it actually do what you want given your version history?). Equally, however, trying to use every system in anger would impose a very serious cost: unless we write the infrastructure for every system we test, some live project will have to do without it while they try out the shiny new system, and for every system someone will have to undergo the considerable expense of really learning how to use it and make it behave well. So we have to find ways to at least thin the candidate list.

We first narrow the list to the six candidates mentioned in the previous post: Subversion, Monotone, darcs, Git, Bazaar, and Mercurial. All of these have a sizeable community behind them and are used by popular projects. This means they have demonstrated themselves fit for purpose, and that there is a community who will provide help if we encounter problems, and code to support integration with other pieces of software. Other candidates may have interesting properties, but to choose them would be to be relatively out on our own; their lack of popularity also increases the risk that they will simply be abandoned after we have invested in them. In particular, this eliminates Codeville, the innovative DVCS designed by BitTorrent inventor Bram Cohen, though there seems little reason to pick it up in any case now that its main selling point, a smart text merging algorithm, has been picked up by Bazaar and could later be supported by some of the other systems if it is found to be usefully superior.

Of the six, the non-distributed Subversion is the first to be thrown out. This isn’t because we expect to benefit greatly from the possibility of disconnected operation, though it may prove useful sometimes; it is because we would like the other features of DVCSes described in the last article, in particular history-aware merging, and the general cleanliness of the underlying model. It’s a difficult decision, because Subversion has by far the best tool support of all of our candidates, including a mature Eclipse plugin; however, this is a decision we need to make based on the long-term future, and we anticipate that if we can pick a system that will remain popular then such support is just a matter of waiting for the tools to catch up.

The remaining five are very hard to choose between; I’ve had a hard time even finding discussion of how to choose one, because most articles focus on how each one is better than CVS or Subversion rather than comparing them to their DVCS peers. All are licensed under the GPL.

Monotone is the oldest of the five remaining candidates, and the first that I took an interest in. It has an attractively clean model of how a DVCS should work, and is in many ways the “most decentralized” of the five, because of the way it handles authentication. In any other DVCS, if I pull from your repository or allow you to push to mine, I am implicitly trusting you as a source of good revisions that I might like to build on. In Monotone, revisions are cryptographically signed, and it is these signatures that decide which revisions I will pay attention to; as a result, Monotone servers exchange not assertions but facts, and you don’t have to go to a particular server to get “authoritative” information on which is the right revision.

However, these signatures represent an unsolved management headache: how do you decide which keys to trust? As things stand, everyone has to update their keyring when a new developer joins the project. In February of last year, I attended a week-long Monotone developer’s summit in San Francisco hosted by Google and my sole personal goal while there was to find a better solution; I met a great many very very smart people and we had some fascinating discusions around the idea of “policy branches” to solve this problem, but we were never able to agree on exactly how such branches should work and as far as I know the problem is still unsolved.

Experiments with using Monotone internally showed other problems. Monotone repositories have a single global lock, so if for example a repository is made available in a web interface you can’t commit to it at the same time, a problem we were able to work around only with some very nasty hacks using multiple repositories. The same problem makes email notification hooks difficult to write, with the additional constraint that they must be written in an obscure interpreted language called Lua, and if more than one hook is to be run for the same event, the programmer must handle this themselves. Monotone itself is written in an eclectic style of C++ that makes it very hard to hack on or even understand what is happening internally. Finally, Monotone tends to be slow in normal use. Overall, we didn’t find working with Monotone to be an enjoyable experience, and we started looking at other candidates.

darcs has its supporters in this office. It’s written in Haskell, the statically typed pure functional programming language which had a place on our “Language du jour” whiteboard for much more than a day. It has by far the best support for “cherry-picking” (pulling in a change to a branch without pulling in all the changes that led to it) thanks to its “algebra of patches” that underlies its operation. However, this model is also what puts me off about it: it is very hard for darcs to cleanly support binary files, for example, because they aren’t well expressed by patches, and patches underlie every part of darcs including the storage and network formats; the other DVCSs have binary storage and network formats and consider the line-oriented nature of files only at merge time. To embed the assumption that all files are line-oriented text files so deeply into the architecture of a DVCS seems to me like a wrong turn that it would be very hard to back out of, so I kept looking.

That leaves three: Git, Bazaar, and Mercurial. All three date from around 2005, when Larry McVoy withdrew the limited license grant on his proprietary BitKeeper DVCS and the Linux kernel had to find a replacement in a hurry, a disaster for kernel development that vividly demonstrated the short-sightedness of Linus’s policy of trying to pretend that software licences don’t matter. All three have been chosen by major projects: Git is used most famously by the Linux kernel, Bazaar by Ubuntu’s Launchpad development centre, and Mercurial by the Java and Mozilla projects. A full evaluation of all three would be a fantastically costly exercise, so we had to use more superficial characteristics to decide which one to explore next.

Git is Linus’s own creation, started (I’m told) when Linus learned that the lead Monotone dev was on holiday and wasn’t about to start hacking on Monotone to improve performance until his return. To be sure, Git has very impressive performance, but there are several areas of concern: git has over a hundred subcommands betraying a lack of focus in interface design, and Win32 support (essential for us) is poor. In the end I felt I didn’t have faith in Git’s technical direction; I got the feeling that it was too wedded to a worse-is-better philosophy in which performance is more important than a clean model. To us this meant that it would take reports of crippling performance problems from other systems before we’d reassess Git.

The choice between Bazaar and Mercurial was in some ways the most arbitrary. Both are in Python, and both have a strong supporting community with lots of extensions – these two are not unrelated, as the choice of Python as implementation language lowers the barriers to getting involved. Each has a comparison page about the other, cross-linked, indicating their relative strengths, and updated as each draws features and ideas from the other or shoots ahead in an area it was formerly behind. There have even been joint Bazaar/Mercurial summit meetings hosted by Canonical, which didn’t result in either project subsuming the other but a rapid cross-fertilization of ideas. In the end I chose based on my feel for which had the clearest architectural vision, and based on the choices other projects have made, in particular projects which I felt would be good at making good choices, such as Java and Coyotos, and other LShift developers agreed: the choice was Mercurial.

Since then we’ve used Mercurial in anger for several projects, and done quite a bit of infrastructure work, integrating Mercurial with other tools that we use and otherwise making it more useful to us. So how’s it been working out for us? We’ll cover that in Part Three…

Paul Crowley



You are currently browsing the archives for the Version control category.



2000-14 LShift Ltd, 1st Floor, Hoxton Point, 6 Rufus Street, London, N1 6PE, UK+44 (0)20 7729 7060   Contact us