Transferring Files Between Git Repos

Engineering

Learn from our challenges and triumphs as our talented engineering team offers insights for discussion and sharing.

Transferring Files Between Git Repos

Engineering

LiveRamp’s code base, like many large primarily-Java code bases, is made of a series of Maven modules. Over our years of operation, we’ve built up some very large kitchen-sink style projects that include code for many different features. These projects get in the way of modularization and abstraction by making it very easy to entangle dependencies between various pieces of code. Large projects like this also obscure code ownership and make it unclear who should be asked about modifications or enhancements to a particular bit of functionality. Both of these issues have been causing us increasing pain as we’ve rapidly expanded our engineering team, and so we recently began the process of not only splitting up these monolithic projects into sensible modules, but doing our best to automate the process as we go. We’ve encountered a few interesting problems while tackling this and wanted to share the solutions we’ve found. This will be the first in a series of posts about the process and tools we’ve developed along the way.

Transferring Files While Maintaining History

One large concern the team had when we began this project was maintaining history in our SCM (git) for files which move to a new repository. We had a pretty long list of requirements for the process:

  1. Original history should be maintained, including timestamps and commit authorship. One of our primary uses for our source history is figuring out who to ask about a change, and this information needs to be correct to enable that.
  2. The process should work not just for splitting off a new project but for transferring between two existing projects as well. This enables a wider range of refactorings and allows us to more easily repair mistakes if code is left out of a split-off project
  3. History for files not being transferred should be omitted. If moving a single file from project A to project B causes B to gain the entire git history of A, our git repos would become much larger than necessary and searching the history would become pretty confusing.
  4. The process should be fast enough to use while iterating. Disentangling code that lives in the same maven module can be tricky and may require many iterations, so we wanted something that would be fast enough to be usable
  5. Transfers should be easy to revert in case of mistakes

We discussed a number of approaches and ended up settling on one that is fairly complex, but which satisfies all of these requirements. I’ll illustrate our approach by outlining the steps for transferring a file from a project Foo to a project Bar (splitting projects is a subset of this – make a new project Bar and transfer the things you want split out of Foo)

First we identify all the file paths we want to move from Foo to Bar. This process handles directories recursively so that we can move whole java packages together if we want. Let’s say we want to move the package com.liveramp.foo.widgets over to the project Bar. Our first step is to create a patch which represents all of the commits for the files in the directories src/java/com/liveramp/foo/widgets and test/java/com/liveramp/foo/widgets. We do this using the following command:

cd Bar
git log --pretty=email --patch-with-stat -m --first-parent --reverse --full-index --binary -- src/java/com/liveramp/foo/widgets  test/java/com/liveramp/foo/widgets > widgets.patch

Most of the options here are standard to produce a patch we can later apply using ‘git am’, but the -m and
–first-parent options were something we had to discover by trial and error – without these options, the produced patch will not cleanly apply later if there were any merge commits along the way. If you make heavy use of Github’s pull request feature, your git history will be littered with merge commits, making these options necessary to get a usable patch.

The next step is to apply the patch we just made to an empty repository. We do this in order to satisfy requirement 5 – If we simply applied this patch to project Bar, we’d get a series of fast-forward commits that, in case of a mistake or other problem, would need to be reverted 1-by-1 (this is not terribly hard in git, but it is a little bit error prone). What we want instead is a merge commit that represents all of our changes and which can be reverted all in one go.

We create the empty repo with

mkdir FooTemp
git init FooTemp
cd FooTemp
git am --committer-date-is-author-date < ../Bar/widgets.patch

The –committer-date-is-author-date options makes it so that the commit date is roughly the same as it was in the original repo. Without this options, the commit date would show up as the date when the patch was applied (right now) and could be confusing.

Finally we need to pull the contents of this repo into bar. We do this with

cd ../Bar
git fetch file:/;
git merge FETCH_HEAD -m "transfer widgets from Foo"

We now have a merge commit which adds all the files we wanted with their original git history. Now we simply remove the files from Foo and we’re done!

A few approaches we considered but didn’t go with:

  1. Anything involving git filter-branch – solutions involving this command are often very slow when operating on old repositories with large numbers of commits. Combining that with the fact that –subdirectory-filter destroys directory information, and anything involving –index-filter would be very difficult to understand made us discard this class of solutions.
  2. Clone + delete – one of our engineers pointed out that splitting is super easy if you just make a new clone of the project, rename it, and delete all the files you don’t want. We ended up not going this route because of requirements 2 and 3 – you end up with an over-large repo and no easy way to correct issues after the fact with this method