Git Objects

November 8, 2025

Git is a distributed version control system, primarily used for source code. This article introduces its data model and describes how data is stored and retrieved.

I am rewriting some parts of Git to learn Zig,^[1] and this article serves both as documentation for that activity and as a tool to help consolidate concepts.^[2]

What is a repository

A Git repository is a directory where the data given to it are saved, along with all the metadata needed to maintain the version history. This is usually the hidden .git directory inside the one in which you are working. The set of data and metadata forms the range of all possible objects that Git maintains and stores internally in the so-called object database.

The working directory is also referred to as the worktree. Its existence is critical for working with versions and executing some commands, but it is not always required. The .git directory contains all the information needed to recreate a worktree. When only the .git directory is present — possibly with a different name — without a worktree, it is called a bare repository. This is the situation normally handled by remote synchronization services, like GitHub or GitLab, where it is not necessary to work directly on an exploded folder.

Structure of the `.git` Directory

The .git directory has a default structure that, upon initialization, contains the empty folders objects/info, objects/pack, refs/heads, refs/tags, and the HEAD file.

The objects directory is used for the object database, while the refs directory is for references (i.e. for tracking local or remote branches and tags). The HEAD file contains a text pointing to the current branch, for example “ref: refs/heads/main”, where refs/heads is the relative path within the .git directory and main is the name of the file corresponding to the branch (this file does not exist until the first commit).

The directory contains other files, but since we are only dealing with the data model here, we will ignore them because they are not relevant to the topic.

You can create a new repository using the init command within a directory of your choice. For example:

❯ mkdir new-repostitory
❯ cd new-repostitory/
❯ git init
Initialized empty Git repository in /path/to/new-repostitory/.git/

Object Types

All files that are given to Git become anonymous BLOBs (Binary Large Objects). Versioning and tagging operations lead to the creation of metadata for maintaining:

the directory tree structure and file names (TREE),
the version creation information (COMMIT),
the tag applied to a specific version (TAG).

These four are the only objects managed by Git.

Blobs are saved based only on their binary content; they have no logical structure and do not retain any file name. This means that saving two different files with the same content will produce only one object in the database. Objects are identified exclusively by their content (as explained below), which is why Git is a content-addressable filesystem.

Tree, commit, and tag objects, unlike blobs, have a logical structure defined by Git (which resembles that of email messages^[3]). Writing or reading these objects requires serializing or deserializing them from the binary format used within the database.

The Database

There are various mechanisms and formats for saving the objects in the database, but the basic — and not particularly efficient — one that will be discussed here is the individual writing of single elements. In this case, they are called loose objects to distinguish them from other formats.

Saving a new Object

To save an object in the database, it must first be serialized into binary format (this operation is not needed for blobs, which are already in the required format). A header is prepended to this binary content in the form of a C-style string (null-terminated), consisting of the object type (blob, tree, commit, or tag), the content length in bytes, and a null byte. The resulting sequence of bytes is the object that will actually be saved in the database and used for identification. From now on, we will call this byte sequence encoded content.

Before saving, the encoded content is compressed with the zlib deflate function. The save path is determined by the identifier. To avoid cluttering a single directory with all the saved files, the first two characters of the identifier are used to make an intermediate directory in which the file is saved with a name consisting of the remaining characters.

You can create a new blob using the hash-object command. For example, you can read directly from standard input:

❯ echo -n "give me a name" | git hash-object -w --stdin
dfa75596eeaaa914b9ee90b177ae16767f8d96a0

or you can use a file:

❯ echo -n "give me a name" > new-file.txt
❯ git hash-object -w new-file.txt
dfa75596eeaaa914b9ee90b177ae16767f8d96a0

Reading an Existing Object

To read an object from the database, the reverse operations to those performed for writing must be carried out. The read file is decompressed with the zlib inflate function, then the binary content is extracted by separating it from the header, and finally the object's logical structure is deserialized.

You can read an existing blob using the cat-file command. For example, you can read the content:

❯ git cat-file -p dfa75596eeaaa914b9ee90b177ae16767f8d96a0
give me a name

or get the type:

❯ git cat-file -t dfa7559
blob

or get the size in bytes:

❯ git cat-file -s dfa7559
14

Object Identification

Git is often referred to as a key-value store, meaning that complex objects can be saved or retrieved based on an identifying key. A key feature of Git is that this key is a signature of the entire content of the saved object rather than one of its properties or attributes. The saving operation itself produces the key, and it does not need to be determined beforehand.

The identifier is calculated by applying the SHA-1 hashing function to the encoded content. This produces a fixed-length sequence of bytes used as the access key to the object. For ease of use, this identifier is presented as a hexadecimal text string (often abbreviated) and is referred to as the object name (not to be confused with the file name).

It should be noted that the hashing function is used exclusively to obtain a signature of the file content. This makes it useful for checking for potential content corruption during saving or network transfer, not for security purposes. To improve these aspects, the SHA-256 hashing function is currently being adopted.

References

Pro Git book, Git Objects
VonC answering to Why does Git use a cryptographic hash function?
Thread Starting to think about sha-256?

https://codeberg.org/rafftre/zit ↩︎
The vast number of texts that discuss Git are too dispersive or too superficial, and the only valid source remains the source code. E.g. the README for one of the early versions of Git describes object management better than any other text that attempts to explain it. ↩︎
RFC 5322, Internet Message Format. ↩︎