DVCS Boxing Match: Git v. Hg
From Google Summer of Code Mentor Wiki
I've been waiting for the organizers of this session to publish their slides, but it looks like they're not going to, so I'm just going to add this page with my notes.
Caveat: I was listening to a number of side conversations, plus taking mad notes, so take these points with a grain of salt. I'll try to mark the places where the information is fuzzy.
The session featured major developers for both Git and Hg. They went down a list of features of the two systems (they had collaborated on the list earlier in the day). They marked items with a plus for elements where one product excelled, minus for where one sucked, and a bullet otherwise. Each had its plusses and minuses. After that, they answered questions from the audience. (I wish that Bazaar had been represented; the three-corner comparison would have been beneficial.) I didn't manage to write down all the items, but these are the ones that seemed most important to me.
Git has a longer history; Hg is the relative newcomer (less than three years). For an application as mission-critical as a source-control system, that's a very short time to ensure reliability.
The history they accumulate is roughly similar (and will likely get closer); tools that convert from one to the other are almost completely lossless. Converting then converting back loses only a small amount of information.
Hg is considered "friendlier" with a lower learning curve. This is despite the fact that Hg uses two distinct sets of commands and two distinct vocabularies for operations depending upon whether the repository is local or remote.
Documentation for Hg is substantially better, including a book (the hgbook, http://hgbook.red-bean.com/). They've also had the advantage of trying the documentation on a fairly savvy group of developers (Mozilla) who gave them lots of feedback that helped polish the rough edges.
Both allow "dumb" HTTP as a (read-only?) protocol; both do better with a specialized server. Git's server uses their own protocol; Hg uses HTTP via either CGI (et.al.) or a standalone server. (Despite the distributed model, both expect the central server to use the specialized protocol and users to expose their own trees with "dumb" HTTP.)
From the joint description they gave of their shared problems with "dumb" HTTP, I am under the impression that both can act on a file-by-file basis.
Neither seemed to deal well with centralized copies of work in progress (a "branches" subdirectory in SVN). (Very fuzzy here.) There may be implications here for things like GSoC, where a student is expected to make regular checkins.
Despite the clear asymmetry of their models in terms of support features for a central server, both claimed that there was nothing special about their own central server. (For some reason when they said this, I kept hearing "All animals are equal.")
They both have roughly the same client-side storage requirements, give or take, coming in at about 1.8 that of CVS, so about half of what SVN requires. Even with the reduced storage over SVN, more information is available, so more operations are local and do not require a network connection.
Each minimizes storage by sharing common fragments (e.g., the license at the head of every file) so the total requirement is smaller than you'd expect. Git can be manually helped to guide the search for common elements (all Makefiles, all *.py files, etc.) and usually outperforms Hg in this area.
Both have some support for "shallow" checkouts (restricting the range of history acquired) to minimize client-side storage. Hg also has some support for a "narrow" (partial-tree) checkout although it's quite new. Neither deals well with a very "broad" project (lots of smaller, semi-independent subprojects); they both suggest multiple repositories for those cases.
Neither deals gracefully with binary files that are frequently updated; in effect, the delta between two files is bigger than the files themselves, so both files are kept in the history and must be included in the client-side storage requirements. (Both recommended SVN for projects with lots of files like that, where it primarily impacts the server-side storage.) Both agree that a way to selectively forget some history (at least in the client, as with the shallow checkout) would be useful. Both identified (different?) cases where an "abandoned" development line (tips that were no longer accessible) could be usefully pruned from the client-side history. (Hence, I'd imagine that neither can keep part of the history remotely and fetch it when needed.) This is all for future development.
Git requires substantially more server-side storage in the naive case, but with planning and tweaking, can be compressed more, and ends up needing less storage. The management is an on-going manual requirement; audience feedback considered it essential. Hg uses compressed deltas, which gives pretty good average performance, but not as good as well-tuned Git.
Git uses write-once files (i.e., read-only after creation); Hg appends to files, so has a greater risk of corruption in the event of a crash. Their storage formats are optimized for different aspects and are quite different. Both are considering stealing some elements from the other.
Both can run off of an SVN backend via a bridge, but both say the performance sucks. Both can create their own repository from SVN and incrementally update it (i.e., copy changes made in SVN into the local repository).
Git consists of 180+ individual programs, each doing a small job well. Projects are expected to have their own shell scripts that implement their specific workflow. (I only heard part of this, but an audience member said that dozens of already-written samples come with the distribution, and hundreds of examples are available at various sites; it's their basic form of extension.) This is fast on *IX, stunningly slow on DOS. They anticipate that they will be creating some composite programs, mostly to avoid the overhead on DOS.
Hg is a Python module; their command-line tool simply calls the Python function that implements the action. The exact count of functions was unknown, but similar in scope to Git's commands. Like Git, there are hundreds of extensions (some so ubiquitous that even the developers use them) but rather than a fork/exec to do some action, it's a VM function call, so it's fairly fast everywhere.
Performance for both is dominated by network access, so it's generally similar. That said, Git uses substantially less CPU on *IX, so it's quicker on most benchmarks. Both are considered "fast enough." Neither had performance data for server-side operations (an audience member said it was "noise" on their Apache server compared to the load of serving normal pages, but when I asked, he said they didn't use CGI, so they didn't have the startup cost for every operation).
I asked about code size (an indirect measure of bugginess, as there are no bugs possible in code that doesn't exist): Git is 350,000 lines of code, mostly C; Hg is 40,000+ lines of code, virtually all in Python. Neither appeared to have a large-scale test suite.
I forgot to ask about GUI support, although given their anticipation that projects will create their own workflows from the pieces, writing a GUI could present serious difficulties.

