June 8, 2014

How does Git work?

So what’s behind the abstractions of branches and commits in git? How are the files really stored? . At the heart of git is an object database, everything is an object, commits, files and folders, everything. Inside your repo, the whole commit tree is stored in your .git directory.

Git takes the SHA1 hash of every file and compresses it using zlib/deflate and stores it in its object database, where each object is a file named after its SHA1 hash. Each directory is stored as a tree object, which is basically a flat file with a list of its files and subdirectories with their permissions and hash references. A commit object contains the commit message, its parent, its author and a reference to the hash of the root directory tree. So when you make a change to a file, its hash changes. When you commit it, the entry in the tree is updated. A branch is simply a reference to a commit. The reason git forces you to commit or stash changes before switching branches is that it has no reference to your changed files in its object database unless you commit. Stashing creates a temporary tree object and also stores your changed files. That is how conceptually simple git is!

You must be thinking that there is a problem with this approach, what if you change a single file in 50 different commits, does git create 50 different copies of the same file? Yes and no. Git has a workaround for this, it performs a “garbage collection” step periodically or every time you do a remote push. It looks at different hashes with the same file name or similar file size. It takes the first version of the file and stores the subsequent versions as diffs and combines them into a packfile. A packfile has an accompanying index file which contains a list of hashes of objects it contains and their offsets.

So this was just a high level overview of how git stores things internally, the actual implementation details may vary. You can find out more about git internals in this book .

Kudos

How does Git work?

Now read this

Advanced Linux Programming