Tue 05 December 2023

Using Git LFS and fast-import together

Introduction

Recently, we worked on some mirroring software using Lorry, which is a tool we developed at Codethink as part of our Software Mirroring Solution. Lorry mirrors git repos and can also mirror other types of repo such as Subversion and Bazaar by converting them to Git.

The normal process for writing changes to a Git repository is to git clone the repository, then git checkout the branch the changes go into, make the changes (including adding, removing, and modifying files), then git add the changed files and git commit the changes with an appropriate message and author. This has its downsides, such as having to check out the entire contents of the head of a branch to make any changes to that branch, which is very filesystem-intensive if you have a lot of files in that branch.

Fortunately, Git's authors are well aware of this problem and wrote git fast-import, a tool for writing commits directly to Git history, which can be performed on a bare repository without having to check out any branches. Lorry's plugins all use Git fast-import, so my new plugin for adding tarballs of source code to a repository should, too.

Unfortunately, Git is inefficient at storing files that aren't plain text (like, say, gzip-compressed data).

Git LFS is a tool for efficiently adding large text files and binaries to Git. It does this by storing the files in a special subdirectory in .git and committing a "pointer" file into the Git history. Unfortunately, the command-line interface for this requires a Git checkout.

After trying to find an explanation on how to use these two together and failing, we decided to write our own so that whoever has to think about this problem next has a head start.

How to do it

If you simply need to add some files to a Git repository, you can scroll to the end of this article, where we have linked a rough-around-the-edges python script that will do this for you.

If you need to know how that script works, need to implement that functionality as part of a bigger project, or are just curious, then read on...

Create or clone a Git repository

First of all, you need your Git repository. Given we don't need a working area, this is as simple as:

git init --bare my_repo

When adding files to an existing repository, this step would instead resemble

git clone git@your-host.com:project/your_repo my_repo

Install Git LFS into that repository

Vitally, we also need to install Git LFS in that repository (and our system in general).

git -C my_repo lfs install

You'll know this is installed if the post-checkout, post-commit, post-merge, and pre-push scripts have been installed to my_repo/hooks, and git config --global --list shows entries under filter.lfs

Define a .gitattributes file

Git LFS will only be used for files that are specified in .gitattributes.

If we were using a working tree, this would be a simple matter of running git lfs track <expression>, to give a globbing expression that matches those files (in the exact same way as files are matched in .gitignore)

Assuming the branch we're committing to doesn't already contain an appropriately-formatted .gitattributes, we'll need to do that ourselves.

A useful basic example would be

* filter=lfs diff=lfs merge=lfs -text
.gitattributes filter diff merge text=auto

More recent entries take precedence over previous ones, so what this specifies is:

  1. Use "lfs" as the value of the 'filter', 'diff' and 'merge' attributes, and remove the 'text' attribute from all files.

  2. Except .gitattributes, which has the default 'filter' 'diff' and 'merge' attributes, and "auto" for the text attribute.

For your own purposes, you might specify individual file formats (e.g. *.jpg), or individual files.

Get the pointer digests of the files you're adding

Next, we need the pointer digests of these files. Let's suppose we have the file foo.jpg.

git lfs pointer --file foo.jpg

Will report on stderr

Git LFS pointer for foo.jpg

And on stdout

version https://git-lfs.github.com/spec/v1
oid sha256:1efb87c81994f1d308e3f315fc8d1192605e636404f8371baed1aa875667e0d2
size 645485

Notably, it contains a sha256sum of the file, and the size of the file in bytes.

Copy the large files into your Git repo

Normally, Git LFS would be responsible for storing these files, but Git fast-import skips that stage, so we'll have to do that ourselves.

Large files are stored in a similar way to Git objects - based on their hash, with two subdirectories of hexadecimal two-digit pairs, e.g. in a directory starting with objects/1a/2b/1a2b3c4f...

Assuming we're still using that foo.jpg file as earlier, whose digest told us its sha256sum:

mkdir -p my_repo/lfs/objects/1e/fb
cp foo.jpg my_repo/lfs/objects/1e/fb/1efb87c81994f1d308e3f315fc8d1192605e636404f8371baed1aa875667e0d2

Note that the two subfolders are the first four digits of the sha256sum, then the full sum is used as the filename.

Write these files using Git fast-import

Now comes the hard part. Git fast-import has a complicated but thoroughly-explained format here.

Git fast-import takes one long stream of commands into its stdin.

For the purposes of this example, we will be using the branch main

We will need a commit author. For the purposes of this example, I will use LFS Fast Import <lfsfastimport@example.com>.

We also need a time (in seconds since the epoch) to say that these files were committed. For convenience, I will be using the time of writing, 1652290426.

With shell commands, we'd start with

git -C myrepo fast-import --quiet <<EOF

Write the gitattributes commit

First, we'll start with .gitattributes.

commit refs/heads/main
committer LFS Fast Import <lfsfastimport@example.com> 1652290426 +0000
data <<EOM
Write gitattributes for LFS
EOM

Fast-import will either replace a given branch, or append all of its commits on top of it. If you want to retain history, add a line with the hash of the last commit of that branch.

e.g.

$ git -C my_repo rev-parse main
c83d3f31d2f68299cdd639d9b4dd93aaf92d9dc0

Using that, we can base this next commit off that previous commit by adding to stdin:

from c83d3f31d2f68299cdd639d9b4dd93aaf92d9dc0

If you don't want to retain history, don't add a line here.

Now, we continue with the actual file we're adding in this commit

M 100644 inline .gitattributes
data <<EOM
* filter=lfs diff=lfs merge=lfs -text
.gitattributes filter diff merge text=auto
EOM

Note that the extra blank line at the end is intentional, it separates this commit from the next one.

Commit the pointer files

First, we'll write a new commit header

commit refs/heads/main
committer LFS Fast Import <lfsfastimport@example.com> 1652290426 +0000
data <<EOM
Add some large files with LFS
EOM

We can omit the 'from' line since we've already specified what we're basing it on in this series of commits.

Now, for each file we're adding, we write an entry to fast-import. This example assumes you're writing foo.jpg to images/foo.jpg

M 100644 inline images/foo.jpg
data <<EOM
version https://git-lfs.github.com/spec/v1
oid sha256:1efb87c81994f1d308e3f315fc8d1192605e636404f8371baed1aa875667e0d2
size 645485
EOM

Then we finish the commit by adding a blank line

Ending the fast-import

Now we've added all our commits, we finish our input stream.

EOF

Finishing up

Now that we've finished adding our files to the repository, we can check our changes.

We can check that the commits have been added into our history with

git -C my_repo log -p main

We can check that the files are properly part of LFS by cloning the repository and verifying that the actual file appears in the checkout

git clone file://$PWD/my_repo my_repo_2
cd my_repo_2
git checkout main
file images/foo.jpg

note: The reason why we're cloning with a URL starting with file:// is to force it to use the HTTP protocol. Without it, git may throw an error citing batch request: missing protocol

As long as file doesn't tell you foo.jpg is ASCII text, you've successfully added a file using Git LFS and fast-import!

In conclusion

As promised, the ready-made script that does all this is hosted at Codethink Labs.

Adding large files with Git fast-import requires you to know how to use Git fast-import, and how Git LFS works behind the scenes. Fortunately, Git fast-import is well-documented, the man page for Git LFS describes what it does, and the format of the Git LFS pointer files is relatively simple.

Codethink are available to help with advanced Git workflows, please get in touch to find out more via sales@codethink.co.uk.

Other Content

Get in touch to find out how Codethink can help you

sales@codethink.co.uk +44 161 660 9930

Contact us