Lessons from PostgreSQL's Git transition

timmorgan · on Oct 18, 2010

GitHub makes a downloadable tarball of each repository available for download, which is nice for those who don't have (or don't want to install) Git.

I understand that the Postgres team isn't using GitHub, but it makes me think -- could they use an oldschool tarball to update their BuildFarm nodes who cannot run Git? Or better yet, rsync, so only changes are transferred?

Anyway, this is a decent writeup. I've wondered before about large project migration to Git.

pyre · on Oct 19, 2010

Someone suggested that in the comments, and the reply was that the amount of bandwidth to pull down the tarball is greater than just syncing the changes.

It seems like they could use some sort of centralized checkout system + rsync to pull down changes. Possibly using gitolite's master/slave mirroring functionality.

plaes · on Oct 19, 2010

IIRC, they even provide two-way (commit and fetch) subversion->git gateway...

mrduncan · on Oct 19, 2010

You're correct:

http://github.com/blog/626-announcing-svn-support

http://github.com/blog/644-subversion-write-support

DannoHung · on Oct 18, 2010

Great pains taken to preserve the full commit history... does anyone know how valuable it really is? (Not to imply that it is worthless, but how much effort went into preserving it vs how much effort would be spent if it were not available). More of a general VCS conversion question, I suppose.

jacobian · on Oct 18, 2010

In my experience with Django, the commit history has been incredibly critical. There's been a good number of bugs that required digging back to nearly the beginning of the public commit history (July 2005). There have even been a few bugs that we only really understood after prying into the old private repo (which was converted from CVS and goes back to 2003).

kingkilr · on Oct 18, 2010

+1, you can tell how valuable it because I always end up crying when I'm trying to track the history of something and it turns out the change is from a branch merge that wasn't properly tracked, and thus you can't get the precise commit that introduced it.

neilc · on Oct 18, 2010

Depends on the project, but for a long-running project with a large code base and many contributors (many of whom are no longer actively involved in the project), it can be very valuable. In the case of PostgreSQL, the commit history goes back to ~1995, and most changes are described with a detailed commit message, which can be very helpful when understanding why a certain piece of code behaves the way that it does.

ciupicri · on Oct 18, 2010

But shouldn't the comments from the source code provide an explanation?

silentbicycle · on Oct 18, 2010

Commit messages are a better place than comments for explaining the rationale behind cross-cutting changes.

ciupicri · on Oct 19, 2010

I admit that I've been in situations where I had to find out in what change a piece of code was committed, but on the other hand the code and the comments weren't too great. Also one of the purposes was to find out the author, so that I could pass the bug to him.

I'm curious in what situations would a commit message be more appropriate than a comment. A couple of examples would be great.

By the way, is there something like Perforce's Time-lapse View tool[1] for git?

[1] http://www.perforce.com/perforce/press/pr55.html

silentbicycle · on Oct 19, 2010

Example: "Gathered & cut this code from files A, B, and C, because Client X changed their mind about situation Y, and now we can drop those special cases." Anything involving moving / deleting a lot of code. Adding comments to explain deleted code in several places really doesn't make sense.

I don't know about the viewer, but I'm going to look into this soon. We've used p4 for years and are strongly considering switching to git, and if so, I will be doing a lot of helping/explaining during the transition. I really dislike p4 as a VCS (though it was probably the best choice at the time), but agree its diff-viewing tool is actually quite good.

ciupicri · on Oct 19, 2010

I agree that you can't put that explanation into a comment and that a changelog would be fine, but the documentation of the project should also describe what's needed and what not. I guess this is one more reason for the built-in wiki from the Fossil SCM.

eeperson · on Oct 19, 2010

Git has a limited version of that built-in, 'git gui blame' (this is also integrated with the 'gitk' tool). However, that doesn't let you jump quickly to different places in history. If you need some thing that shows you the file 'blame' and quickly lets you easily jump around in history, I recommend 'qgit'.

ciupicri · on Oct 19, 2010

Thanks for the qgit tip, but it's still no match for Perforce.

nettdata · on Oct 18, 2010

And knowing what files were touched for the various commits is a big advantage as well.

adambyrtek · on Oct 19, 2010

It's also worth mentioning that it's sometimes critical for open source projects to be able to identify all authors of a particular part of the code. There were several cases in the past when a project had to go through a painful process of tracking down all the contributors in order to ask for their permission to relicense the code. Without version control history this would have been much harder, if not impossible.

Besides that it just doesn't sound right to throw away potentially valuable historical information just to avoid a manageable amount of work. After all the conversion has to be done only once. No to mention that if you don't migrate the history, there is a significant probability that you'd regret not doing that when it's already too late.

There is also a precedent to the contrary. When Linux migrated to brand new Git all the historical information was discarded. Probably the reason was that the metadata was stored in a proprietary Bitkeeper repository and obviously there were no migration tools available at that time.

irinotecan · on Oct 18, 2010

Well, since they wanted the ability to build any previous version of PostGreSQL from the version history, for them I'd say it was quite valuable. I would suspect this is the main driving issue for whether or not to preserve commit history when porting to a new VCS; for some developers they may almost never go back and rebuild old versions from scratch. For OSS developers though I can see that they would need to preserve "full disclosure" and if possible make every commit history available for others, if not the devs themselves, to analyze.

selenamarie · on Oct 19, 2010

Yes, PostgreSQL supports major version releases for five years so preserving the full history is critically important. Developers regularly back-patched bugfixes as far back as 7.4 (released November 2003, and just EOL'd this month).

robin_reala · on Oct 19, 2010

Perl kept their commit history back to 1987 when they migrated to Git: http://perl5.git.perl.org/perl.git/commit/8d063cd8450e59ea1c... Presumably it’s worth something even if it’s just historical interest.

JulianMorrison · on Oct 19, 2010

Also, the alternative to a commit history in Git is not "no commit history at all", it's a CVS with a read-only filesystem.

grammaton · on Oct 19, 2010

Sometimes it's absolutely critical for tracking down bugs - going back through the previous versions to locate which exact commit introduced an error can be a huge time saver, and sometimes the only way to track an issue down in a reasonable period of time. Even if that isn't something that happens for your project very often, I tend to think of commit history as being much like insurance - it's better to have it and not need it then need it and not have it.

easp · on Oct 19, 2010

In another context, when making a point about code maturity, someone noted that people were still finding and fixing bugs in Postgres code that was over 10 years old. In those situations, I think preserving info about the provenance of every piece of code is extremely valuable. Particularly if you expect the project to live long enough to make another SCM system transition in the future.

konad · on Oct 19, 2010

Even after reading the comments herein surely just keeping a frozen CVS tree would have worked.

jarin · on Oct 18, 2010

"Yes, that's correct. No merge commits. To submit a patch, extract it as a context diff and e-mail it. Committers are to apply the patch under their own names, without branch history. The project has decided, more-or-less, to use Git like it was CVS as far as commits to the main repository are concerned. Rather than adapt the PostgreSQL project's workflow to Git, Git would be adapted to the project's workflow."

I hope they fix that culture issue soon, one of the biggest strengths of Git is its merging.

selenamarie · on Oct 19, 2010

Peter notes in the LWN comments that committers use merging in their personal repos regularly. The restriction only applies to commits to the master. Many people share their individual repos and branches to git.postgresql.org (and GitHub).

The point is to keep our commit history clean while we figure out best practices for the project.

Groxx · on Oct 19, 2010

I've been wondering what the point of keeping a "clean" history is. In what way is less information better? All this means is you're destroying the real history, rewriting it to look more like a straight line, which is not at all how it was developed.

If you want such a history to reflect how it was developed, there are wonderful version control programs out there. Like RCS. Why not just use that?

silentbicycle · on Oct 19, 2010

While I can't speak for the parent comment, one reason people try to keep clean commit histories in git is that you can easily undo, cherry pick, etc. individual commits.

If all of the changes necessary to add a feature / fix a major bug are in one commit, backporting that commit becomes much easier. If those changes are broken up in multiple commits ("fix this issue" + "oops, forgot that file (and unrelated changes in the same file)"), then it's more trouble. It usually doesn't make sense to have the repository in those intermediate states anyway, once the primary commit is done. History rewriting can make those commits atomic.

Generally speaking, "clean" git histories have relevant history combined, but no significant information lost.

selenamarie · on Oct 19, 2010

This is definitely true of Postgres, where backporting is relatively common (currently to 5 versions).

Another cultural difference is that the narrative in our commit log has a pretty strong voice. That consistency is valued above individual details about exactly who did what to produce the patch. If you want the details, we have extensive discussions of what lead to a patch on the pgsql-hackers mailing list.

Our archives and threads are pretty well organized. We commonly refer back to threads 3-4 years old when discussing design decisions.

I agree with many of the commenters that it would be nice to have all of the details that lead up to the final patch. Maybe something like a sub-commit history that is generally hidden, but possible to mine when desired. Here my ignorance of git is showing, as I don't know exactly what is possible.

chousuke · on Oct 19, 2010

I don't think there's any reason you couldn't have both merges and a clean history.

IIRC the git project itself accomplishes this so that the maintainer forms topic branches from patch series sent to the mailing list (a single patch must still be a self-contained change, so "cleanliness" and bisectability is retained) and then merges those topic branches to whatever branch they're approved for: pu for "proposed updates", "next" for the next "major" release, and "master" for the stable branch. The system is interesting in that "pu" gets completely rebased regularly while next and master have immutable history; and master is periodically merged with next, so that any fixes for the stable branch get propagated into next as well.

silentbicycle · on Oct 19, 2010

Indeed. I don't understand the problem the postgres people had with merge. I'm sure they have their reasons, though.

When people talk about clean commit histories in git, they usually mean rebasing / merging topic branches atomically like we said, though, and Groxx's comment asked about the rationale.

grammaton · on Oct 19, 2010

Also keep in mind that a common work flow for tracking down an issue is to revert back to previous versions until you narrow down which one introduced the bug. In that case, you most certainly want fewer, cleaner commits. Narrowing a bug down to a commit that introduced four new features, fixed six bugs, and cleaned up a couple of dozen oopsies in the code base doesn't help you much.

limmeau · on Oct 19, 2010

Cleaner commits: agree. But the goal of fewer commits doesn't usually lead to fewer features per commit, rather more.

alvherre · on Oct 19, 2010

We strive to have exactly one feature or bugfix per commit. Squashing commits to a single one does not mean we mix stuff randomly.

dochtman · on Oct 19, 2010

Everyone here should subscribe to LWN and get this kind of content sooner.