Migrating from SVN to Git


Learn from our challenges and triumphs as our talented engineering team offers insights for discussion and sharing.

Migrating from SVN to Git


Over the last few weeks, LiveRamp has converted from an all-SVN shop to an all-Git shop, switching over our entire stack of Java/Ruby + Ant/Ivy/Rake + Hudson. It was a big effort, but it has yielded big results. It wasn’t easy to decide to commit to the migration, but in the end I think it’s clear we’ve come out ahead. In this post, I’ll try to detail the factors we considered when deciding to switch and then talk about some practical tips we learned about how to do the actual migration.

Why switch?

We didn’t decide to look down this migration path just because Git is the hot new things these days. GitHub has done a lot to build Git’s popularity, and after using it to release some of our dev kits, we decided to develop Hank (our open-source distributed database) on our public GitHub account. The experience left us wishing we could use the same capabilities on all our projects.

A lot of things about Git are great. We don’t need to rehash the whole “why Git?” debate in general, but some inherent features like cheap branching and easy merging are just obvious productivity boosts for our team. However, the most compelling initial feature for us was the possibility of the many hyper-flexible workflows allowed by the decentralized nature of Git.

The most immediate workflow improvement we wanted to realize centered around our intern program. In order to protect our codebase from accidental catastrophic commits (ie, a bad database migration), we restrict all new interns from committing directly to our production repos. Instead, the changes they make must be reviewed and committed by their mentors. It’s not reasonable to deny our interns the ability to do intermediate commits, since that would compromise their ability to share their work with others and make them susceptible to accidental code loss. The compromise we arrived at was to have interns maintain their own SVN branches of the projects they worked on. The commits in these branches would eventually be merged back after review. This system worked alright, but there were a lot of points of friction: administering the permissions settings is a bit annoying, and until recently, SVN merging was about as much fun as having your teeth pulled. However, with Git, this kind of branching, forking, and eventual merging is the norm and is very well supported. Add in to that tools like Pull Requests, and the whole thing becomes very streamlined.

Which leads us to the next reason for switching: access to fantastic tools. When we saw that GitHub Enterprise had debuted, we were really excited. From our experience with Hank, we knew that GitHub offered a great deal of highly usable features that we wanted to use all the time. Of course, we already had some similar tools in house for code browsing (Trac) and reviews (ReviewBoard), but honestly, I wouldn’t say that anyone here was particularly satisfied with their capabilities. With GitHub Enterprise, however, we get a first-rate code browser and perhaps the best code review tool any of us have ever used. Pull Requests are great for facilitating code reviews, but we also really like the fact that you can just comment on any commit at any time. This allows us to do light-weight ad-hoc code reviews without a second thought.

Why not to switch

As anyone with a few years’ worth of startup life can attest to, your code really adds up fast. Before you know it, you go from your first tiny project to a proliferation of them, all strung together in complex and interesting ways. At the center of this tangle is likely to be your source control system. Systems you build will start to rely on your source control system in subtle but important ways, and each of these new dependencies is another big reason to hold off from a migration even when you have a good list of reasons to give it a shot.

It’s rational to be risk averse in this situations, because unless you have prior experience in the new tools, you’re going to have to learn a whole bunch of new things in a very short amount of time. You might have to scramble to figure out how to replace customizations you’ve built or work around features that don’t translate. Builds will break, and people will have to get over the initial learning curve, during which they’ll commit ugly disasters that you’ll have to learn how to undo.

We had a variety of problems like this come up in our migration process. The most troublesome involved our extensive use of SVN externals to manage dependencies in our system. While Git does have the concept of submodules, they’re not a direct replacement, and even though the differences could be worked around, it felt like exchanging one SCM-specific hack for another. What made this one even more troubling was that it was clearly a blocker: if we didn’t come up with a good replacement, then we would be faced with mountains of tedious work.

We spent a long time weighing all the positives and the negatives, specifically trying to figure out a good solution to the blockers like externals. Finally we felt like we had a path to making this a reality, believed the benefits outweighed the cost, and steeled ourselves for the work.

What we did

The migration process took the better part of a month with the labor of one person almost full time and one person pitching in from the devops side.

Install GitHub Enterprise

One of the very first things we did was to get a trial going with GitHub Enterprise. Since it was going to be replacing our SVN repo, it was of the utmost importance that we start to understand how we could operate the new repo software. GitHub provides a virtual appliance and a software package that automates virtually all of the installation process – you just supply a machine with hypervisor software installed, and it takes care of everything else. It wasn’t totally painless, as we had to suss out exactly how to configure some important integration points like our LDAP service, but in the end it wasn’t more than a few hours of clicking around.

A key mistake we made when first configuring GitHub Enterprise was to let it put our repos on the root of the virtual device. It seemed attractive as a way to just get up and running, but once we started importing projects, it started filling up very rapidly. It never quite got to fire drill stage, but it didn’t feel good to see that we were already at 50% space with far less than 50% of our projects imported. Our advice is to configure GitHub Enterprise to use an external block device immediately. The good news is that even if you don’t do this, when you ultimately switch over, GitHub just automatically moves your repos to the bigger disk. Yay for easy migration!

After we had the system running, we spent a little time configuring the Organization and Teams. Since we have a pretty flat structure, we only ended up making one organization, which we called MasterRepos, and one team for Full Time employees, to whom we gave push/pull privileges to all repos owned by MasterRepos. The only people left were our interns, who by default get read-only access to our MasterRepos.

Dealing with externals

Before we could move our projects over to Git, we had to eliminate our dependency on SVN externals. We had basically three different classes of code artifacts that we shared through externals: common build files, Ivy configuration files, and shared Ruby code.

Ruby code was the easiest to resolve, as all we had to do was package the components up as gems and reconfigure out projects to use the appropriate gem. That’s not to say that it was easy, exactly. The key problem we faced was that our code wasn’t structured anything like coherent libraries, and thus they lacked the cohesiveness that would have made them trivial to gemify. Instead, we needed to put in the effort to figure out which files belonged in each gem and then unify them so that they could all be required together. As a result there were some challenges in getting all the load path stuff correct, and the language certainly isn’t going to help you be absolutely sure that you got everything right.  A good test suite was crucial during this step for exercising as many code paths as we could. Still, we didn’t catch everything until we did some staging deploys.

We ended up solving the problem of distributing build files and Ivy configuration files the same way. We made our Ant build system self-bootstrapping, that is, it includes a target that can be used to download a replacement of itself. When we started the migration, we were doing an “svn export” to get the build files from where they were checked into svn. After we’d migrated the build files themselves into a Git repo, we changed the build files and the next time they were bootstrapped, they changed from doing “svn export” to “git clone”. Likewise, we added an Ant build target to download the Ivy config files from our Ivy repository.

Once we had the strategy all laid out and a few demo projects working, it was just a manner of systematically moving through every project and replacing the externals with the proper solution. This part was tedious, but not difficult.

Incrementally migrate projects to Git

Generally as soon as we had eliminated all the external dependencies from a project, we would migrate it to a Git repo. Step one was to make sure that the developers working in a given project could accept some short downtime. After that, we would have someone from our Ops team change the permissions on the svn project and deny all commits, guaranteeing that no one could introduce divergent changes.

Once the project was locked down, we’d use “git svn clone” to import the history into a new Git repo, one per project. We never ran into any problems with this step – git svn clone worked flawlessly. However, be aware that the history import proces can take a long time and use lots of disk space, particularly when there are lots of binary things committed to SVN. (For instance, we had some really slow imports due to the way we used to check in our lib jars, rather than using Ivy.)

When the import’s done, all that’s left is to create a new repo on our GitHub machine and do the initial push. I made a point of always doing this to repos created and owned by our MasterRepos organization, but it’s not hard to move things around once they’re pushed. The final step would be to announce to the dev team that the project has moved into Git for future development.

Chase down dependent systems: deploys, CI builds, peripheral tools

Getting the projects from SVN to Git feels like a major achievement, and it is, but it’s just the start. Once the code has moved, it’s time to start pointing all your tools at the new spot.

We switched our Jenkins builds to reference the appropriate Git repo URLs. One surprise here was that if you don’t specify which branch to build, Jenkins will build all of them, which is probably not what you want. We always set this to “master”. Under the “advanced options,” we select “Wipe repo” to blow away any possible local changes when the build starts, and also set “Checkout/merge to local branch (optional)” to “master” so that we can automatically commit from our builds if necessary. This last one was a particular gotcha for getting some things to work right.

Updating our deploy system was pretty trivial. Typically we updated the deploy to point to the correct project URL right after we moved the project to Git.

We also had to go around and update various extraneous systems to look in the right places. For instance, we had an automatic JavaDoc publishing script that pulled from SVN that had to be redirected.

Further work

Today everything is up and running, but of course, “everything” is relative. There are still a bunch of things we’d like to get totally figured out.

One is GitHub’s notification scheme. As it stands, creating a pull request notifies a lot more people than we’d like. It seems like we’re stuck waiting for GitHub to update their capabilities. For now, we’re just getting some extra emails.

We used to have a number of pre-commit hooks in SVN. Unfortunately, cloning a Git repo does NOT clone the hooks, which means it’s difficult to distribute pre-commit hooks. We’re probably going to end up making the hook functionality more a part of our build system, which isn’t ideal, but it will get the job done. Related, we tried to get server-side hooks configured, but right now GitHub Enterprise doesn’t support server-side hooks at all. They’re thinking about it, but again, we’ll have to wait.

We also want to put together a really good set of command line tools for working with our Git repos and pull requests. We have some started, but we’re confident there are more to be made. As we use the system more, we think these things will bubble up, and we’ll tackle them as they arise.