From SVN to Git

February 3, 2019

Source Control Management (SCM) is certainly a key part in our everyday life. Nowadays, it is also a collaborative platform instead of solely securing team's work. At this game, Git is often considered as a first-class citizen especially over the last decade with the amazing adoption of GitHub, GitLab or BitBucket.

Thinking about migrating from another SCM seems legit even for legacy code bases considering the many advantages provided by Git.

This article is based on migration we performed from a huge SVN repository.

By huge, we mean a single 10-year-old SVN repository containing:

  • Hundreads of underlying projects
  • ~450,000 commits
  • Hundreds of committers
  • Multiple active branches

It might not be the case for brand new projects but here,  we had to deal with commit history. For maintenance purpose, it is required to read the last changes that happened on a particular file set.

What we wanted to achieve:

  • Extract a single project at a time (a specific subset of our SVN repository)
  • Target a specific branch
  • Recover SVN committers
  • Recover SVN history
  • Migration process for a single project must be done within a day

We started to look at the existing tools available and found plenty of homemade libraries to do so. Most of them worked fine when trying to migrate a « snapshot » of a SVN repository, meaning with no history.

At this point, no library was able to read successfully our huge history. Another constraint was working against us: installing software directly on SVN server was not an option for security reasons. History had to be recovered using standard remote protocols (such as svn+ssh).

Also well documented, Git itself provides a built-in command line:

We finally moved to this approach  to see what is possible and got better results, especially regarding SVN history. Nonetheless, we had to face a sad truth: recover the full range of SVN revision history was impossible.

Our understanding of « git svn clone » is not perfect but SVN revisions seem to read in a very linear way. This comes along with a time consuming and I/O intensive process.

At some point in the process, « git svn clone » was like frozen, unable to continue furthermore. We decided to make a tradeoff which consists in limiting the number of recovered SVN revisions. That is precisely what the command line allows us by using the « -r » argument.

In our case, we were able to recover half  of SVN revisions, around 200,000. This was not perfect but acceptable. Here comes the first learning: the « single-repository » SVN pattern should be avoided carefully.

We created a bash script which tries to automate the process as much as we can.

It is based on a GitLab target

# Prerequisites :
# * an empty target GitLab project named after TARGET_PROJECT editable property

export USER=j.doe
export SOURCE_SVN_ROOT=svn/bar
export SOURCE_SVN_BRANCH=trunk
export TARGET_GITLAB_ROOT=gitlab/bar
export TARGET_GIT_BRANCH=master


# STEP 1 : recovering SVN committers for history

echo "Retrieving SVN commiters..."
svn log $SVN_PROJECT_PATH --xml | grep -P "^<author" | sort -u | perl -pe 's/<author>(.*?)<\/author>/$1 = $1 <$1\>/' > users
echo "Commiters successfully retrieved"

# STEP 2 : initial conversion process using standard SVN layout for branches and tags

echo "Cloning from SVN repository $SVN_PROJECT_PATH (can be long)"
git svn clone $SVN_PROJECT_PATH -t tags -b branches -T $SOURCE_SVN_BRANCH --authors-file=users --no-metadata -s $TARGET_GITLAB_PROJECT -r $SOURCE_SVN_MIN_REVISION:HEAD
echo "git svn clone successful"

echo "moving to $TARGET_GITLAB_PROJECT"

# STEP 3 : restore regular GIT tags from their SVN counterparts

echo "Restoring proper tags as GIT labels..."
git for-each-ref refs/remotes/origin/tags | cut -d / -f 5- | grep -v @ | while read tagname; do git tag "$tagname" "origin/tags/$tagname"; git branch -r -d "origin/tags/$tagname"; done
echo "tags successfully restored"

# STEP 4 : restore GIT branches with their remote references

echo "Restoring local branches from refs/remotes/origin..."
git for-each-ref refs/remotes/origin | cut -d / -f 4- | grep -v @ | while read branchname; do git branch "$branchname" "refs/remotes/origin/$branchname"; git branch -r -d "origin/$branchname"; done
echo "branches restored"

# STEP 5 : set GIT remote origin

echo "Set remote origin to "$GITLAB_PROJECT_ORIGIN
git remote add origin $GITLAB_PROJECT_ORIGIN

# STEP 6 (optional) : rename branch name

git checkout $SOURCE_SVN_BRANCH
git branch -d $TARGET_GIT_BRANCH

# FINAL STEP : manual push to remote repository

echo "Last step left manual, push to GitLab using the following command : git push --set-upstream origin $TARGET_GIT_BRANCH"