Introduction
Recently, we worked on some mirroring software using Lorry, which is a tool we developed at Codethink as part of our Software Mirroring Solution. Lorry mirrors git repos and can also mirror other types of repo such as Subversion and Bazaar by converting them to Git.
The normal process for writing changes to a Git repository is to git clone
the repository, then git checkout
the branch the changes go into, make the
changes (including adding, removing, and modifying files), then git add
the
changed files and git commit
the changes with an appropriate message and
author.
This has its downsides, such as having to check out the entire contents of
the head of a branch to make any changes to that branch, which is very
filesystem-intensive if you have a lot of files in that branch.
Fortunately, Git's authors are well aware of this problem and wrote git fast-import, a tool for writing commits directly to Git history, which can be performed on a bare repository without having to check out any branches. Lorry's plugins all use Git fast-import, so my new plugin for adding tarballs of source code to a repository should, too.
Unfortunately, Git is inefficient at storing files that aren't plain text (like, say, gzip-compressed data).
Git LFS is a tool for efficiently adding
large text files and binaries to Git. It does this by storing the files
in a special subdirectory in .git
and committing a "pointer" file into
the Git history. Unfortunately, the command-line interface for this requires
a Git checkout.
After trying to find an explanation on how to use these two together and failing, we decided to write our own so that whoever has to think about this problem next has a head start.
How to do it
If you simply need to add some files to a Git repository, you can scroll to the end of this article, where we have linked a rough-around-the-edges python script that will do this for you.
If you need to know how that script works, need to implement that functionality as part of a bigger project, or are just curious, then read on...
Create or clone a Git repository
First of all, you need your Git repository. Given we don't need a working area, this is as simple as:
git init --bare my_repo
When adding files to an existing repository, this step would instead resemble
git clone git@your-host.com:project/your_repo my_repo
Install Git LFS into that repository
Vitally, we also need to install Git LFS in that repository (and our system in general).
git -C my_repo lfs install
You'll know this is installed if the post-checkout
, post-commit
,
post-merge
, and pre-push
scripts have been installed to my_repo/hooks
,
and git config --global --list
shows entries under filter.lfs
Define a .gitattributes file
Git LFS will only be used for files that are specified in .gitattributes.
If we were using a working tree, this would be a simple matter of running
git lfs track <expression>
, to give a globbing expression that matches
those files (in the exact same way as files are matched in .gitignore)
Assuming the branch we're committing to doesn't already contain an appropriately-formatted .gitattributes, we'll need to do that ourselves.
A useful basic example would be
* filter=lfs diff=lfs merge=lfs -text
.gitattributes filter diff merge text=auto
More recent entries take precedence over previous ones, so what this specifies is:
-
Use "lfs" as the value of the 'filter', 'diff' and 'merge' attributes, and remove the 'text' attribute from all files.
-
Except .gitattributes, which has the default 'filter' 'diff' and 'merge' attributes, and "auto" for the text attribute.
For your own purposes, you might specify individual file formats
(e.g. *.jpg
), or individual files.
Get the pointer digests of the files you're adding
Next, we need the pointer digests of these files. Let's suppose we have
the file foo.jpg
.
git lfs pointer --file foo.jpg
Will report on stderr
Git LFS pointer for foo.jpg
And on stdout
version https://git-lfs.github.com/spec/v1
oid sha256:1efb87c81994f1d308e3f315fc8d1192605e636404f8371baed1aa875667e0d2
size 645485
Notably, it contains a sha256sum of the file, and the size of the file in bytes.
Copy the large files into your Git repo
Normally, Git LFS would be responsible for storing these files, but Git fast-import skips that stage, so we'll have to do that ourselves.
Large files are stored in a similar way to Git objects - based on their hash,
with two subdirectories of hexadecimal two-digit pairs, e.g. in a directory
starting with objects/1a/2b/1a2b3c4f...
Assuming we're still using that foo.jpg
file as earlier, whose digest
told us its sha256sum:
mkdir -p my_repo/lfs/objects/1e/fb
cp foo.jpg my_repo/lfs/objects/1e/fb/1efb87c81994f1d308e3f315fc8d1192605e636404f8371baed1aa875667e0d2
Note that the two subfolders are the first four digits of the sha256sum, then the full sum is used as the filename.
Write these files using Git fast-import
Now comes the hard part. Git fast-import has a complicated but thoroughly-explained format here.
Git fast-import takes one long stream of commands into its stdin.
For the purposes of this example, we will be using the branch main
We will need a commit author. For the purposes of this example, I will use
LFS Fast Import <lfsfastimport@example.com>
.
We also need a time (in seconds since the epoch) to say that these files were
committed. For convenience, I will be using the time of writing, 1652290426
.
With shell commands, we'd start with
git -C myrepo fast-import --quiet <<EOF
Write the gitattributes commit
First, we'll start with .gitattributes.
commit refs/heads/main
committer LFS Fast Import <lfsfastimport@example.com> 1652290426 +0000
data <<EOM
Write gitattributes for LFS
EOM
Fast-import will either replace a given branch, or append all of its commits on top of it. If you want to retain history, add a line with the hash of the last commit of that branch.
e.g.
$ git -C my_repo rev-parse main
c83d3f31d2f68299cdd639d9b4dd93aaf92d9dc0
Using that, we can base this next commit off that previous commit by adding to stdin:
from c83d3f31d2f68299cdd639d9b4dd93aaf92d9dc0
If you don't want to retain history, don't add a line here.
Now, we continue with the actual file we're adding in this commit
M 100644 inline .gitattributes
data <<EOM
* filter=lfs diff=lfs merge=lfs -text
.gitattributes filter diff merge text=auto
EOM
Note that the extra blank line at the end is intentional, it separates this commit from the next one.
Commit the pointer files
First, we'll write a new commit header
commit refs/heads/main
committer LFS Fast Import <lfsfastimport@example.com> 1652290426 +0000
data <<EOM
Add some large files with LFS
EOM
We can omit the 'from' line since we've already specified what we're basing it on in this series of commits.
Now, for each file we're adding, we write an entry to fast-import. This
example assumes you're writing foo.jpg to images/foo.jpg
M 100644 inline images/foo.jpg
data <<EOM
version https://git-lfs.github.com/spec/v1
oid sha256:1efb87c81994f1d308e3f315fc8d1192605e636404f8371baed1aa875667e0d2
size 645485
EOM
Then we finish the commit by adding a blank line
Ending the fast-import
Now we've added all our commits, we finish our input stream.
EOF
Finishing up
Now that we've finished adding our files to the repository, we can check our changes.
We can check that the commits have been added into our history with
git -C my_repo log -p main
We can check that the files are properly part of LFS by cloning the repository and verifying that the actual file appears in the checkout
git clone file://$PWD/my_repo my_repo_2
cd my_repo_2
git checkout main
file images/foo.jpg
note: The reason why we're cloning with a URL starting with file://
is to force it to use the HTTP protocol. Without it, git may throw an error
citing batch request: missing protocol
As long as file
doesn't tell you foo.jpg is ASCII text, you've successfully
added a file using Git LFS and fast-import!
In conclusion
As promised, the ready-made script that does all this is hosted at Codethink Labs.
Adding large files with Git fast-import requires you to know how to use Git fast-import, and how Git LFS works behind the scenes. Fortunately, Git fast-import is well-documented, the man page for Git LFS describes what it does, and the format of the Git LFS pointer files is relatively simple.
Codethink are available to help with advanced Git workflows, please get in touch to find out more via sales@codethink.co.uk.
Other Content
- A new way to develop on Linux - Part II
- GUADEC 2024
- Developing a cryptographically secure bootloader for RISC-V in Rust
- Philip Martin, Meet the Team
- Improving systemd’s integration testing infrastructure (part 1)
- A new way to develop on Linux
- RISC-V Summit Europe 2024
- Safety Frontier: A Retrospective on ELISA
- Codethink sponsors Outreachy
- The Linux kernel is a CNA - so what?
- GNOME OS + systemd-sysupdate
- Codethink has achieved ISO 9001:2015 accreditation
- Outreachy internship: Improving end-to-end testing for GNOME
- Lessons learnt from building a distributed system in Rust
- FOSDEM 2024
- Introducing Web UI QAnvas and new features of Quality Assurance Daemon
- Outreachy: Supporting the open source community through mentorship programmes
- Testing in a Box: Streamlining Embedded Systems Testing
- SDV Europe: What Codethink has planned
- How do Hardware Security Modules impact the automotive sector? The final blog in a three part discussion
- How do Hardware Security Modules impact the automotive sector? Part two of a three part discussion
- How do Hardware Security Modules impact the automotive sector? Part one of a three part discussion
- Automated Kernel Testing on RISC-V Hardware
- Automated end-to-end testing for Android Automotive on Hardware
- GUADEC 2023
- Embedded Open Source Summit 2023
- RISC-V: Exploring a Bug in Stack Unwinding
- Adding RISC-V Vector Cryptography Extension support to QEMU
- Introducing Our New Open-Source Tool: Quality Assurance Daemon
- Long Term Maintainability
- FOSDEM 2023
- Think before you Pip
- BuildStream 2.0 is here, just in time for the holidays!
- A Valuable & Comprehensive Firmware Code Review by Codethink
- GNOME OS & Atomic Upgrades on the PinePhone
- Flathub-Codethink Collaboration
- Codethink proudly sponsors GUADEC 2022
- Tracking Down an Obscure Reproducibility Bug in glibc
- Web app test automation with `cdt`
- FOSDEM Testing and Automation talk
- Protecting your project from dependency access problems
- Porting GNOME OS to Microchip's PolarFire Icicle Kit
- YAML Schemas: Validating Data without Writing Code
- Deterministic Construction Service
- Codethink becomes a Microchip Design Partner
- Hamsa: Using an NVIDIA Jetson Development Kit to create a fully open-source Robot Nano Hand
- Using STPA with software-intensive systems
- Codethink achieves ISO 26262 ASIL D Tool Certification
- RISC-V: running GNOME OS on SiFive hardware for the first time
- Automated Linux kernel testing
- Native compilation on Arm servers is so much faster now
- Full archive