Git Storage Internals: From Snapshots to Checkout
Many people think of Git as “a system that records an operation log for every step.” That mental model creates confusion again and again when you learn rebase, merge, and reset.
Git’s real model is closer to this sentence: it is an immutable object database based on content addressing.
Git Stores Snapshots, Not Operations
Each commit is not “a record of what you did.” It is “a record of what the project looked like at this moment.”
That “look” is a snapshot.
If you did not change a file between two commits, Git does not store another copy of the same content. It keeps referencing the old object. That gives you two results:
- Semantically, a commit is a complete snapshot.
- At the storage layer, Git deduplicates by reusing objects.
This is where many first-time Git users get misled: conceptually it looks like “backing up the whole repository,” but the implementation is really “object reuse.”
Git’s Four Core Object Types
Git’s internal structure can first be compressed into four keywords:
blob file content
tree directory structure
commit one commit
tag tag
blob: Content Only, Not Filenames
A blob stores the raw bytes of file content. Git calculates a hash over the content, commonly SHA-1, while modern versions also support the SHA-256 repository format. That hash is the object identifier.
The same content only needs one blob in the repository.
tree: A Directory Index
A tree records directory entries. Each entry contains a mode, a name, and the hash of the object it points to.
It describes “which files or subdirectories are under this directory, and which objects they point to.”
commit: Connecting History
A commit contains at least the following information:
- A pointer to a root tree
- A pointer to the parent commit, or multiple parents for a merge commit
- Author, committer, and timestamp
- Commit message
That naturally makes commits form a directed acyclic graph, or DAG.
A Minimal Example: What Actually Happens Across Two Commits
Assume the project only has hello.txt.
The first commit contains:
hello world
Git creates a set of objects:
blob A (hello world)
tree A (hello.txt -> blob A)
commit A (root tree = tree A)
You change the file to:
hello world!!!
After the second commit:
blob B (hello world!!!)
tree B (hello.txt -> blob B)
commit B (root tree = tree B, parent = commit A)
If another file in the repository did not change, tree B still points to its original blob instead of copying a new object.
What git log Really Does: Traverse Parent Pointers
git log does not read some “operation log table.”
Its core action is this: start from the commit pointed to by the current HEAD, then walk backward through parent pointers.
Simplified, it looks like this:
A <- B <- C <- D (HEAD)
Running git log means walking from D to C, B, and A, then formatting and displaying the metadata of each commit.
What git checkout <commit> Actually Does in Two Steps
Take git checkout B as an example. The core process can be split into two steps:
- Update where
HEADpoints. - Rebuild the working directory files from the target commit’s root tree.
In other words:
- Find
commit B - Read the
tree Bit points to - Write the corresponding blob contents from the tree back to disk
What you see is “the project returned to the state it had at moment B.”
Many people worry that “the later commits were lost.” Usually they are not. The commit objects are still in the repository; the current reference simply no longer points to them.
Git and diff: Separate the Logical Layer from the Storage Layer
A common question is: “Doesn’t Git store diffs?”
The answer has two layers:
- Logical model: Git organizes history as snapshot objects.
- Transfer and compression: packfiles may use delta compression to reduce size.
So when you learn Git conceptually, start by holding onto the “snapshot model.” Do not reverse the compression details of packfiles into Git’s core abstraction.
Connecting the Structure with an ASCII Diagram
commit D
|
v
tree D
|
+-- src/ -> tree X
| |
| +-- main.ts -> blob M
|
+-- README.md -> blob R
commit C
|
v
tree C
|
+-- src/ -> tree X
| |
| +-- main.ts -> blob K
|
+-- README.md -> blob R (reused)
Here README.md did not change, so blob R is reused across multiple commits.
Why This Model Matters
Once you understand this model, many commands become predictable:
- Creating a branch is cheap because a branch is essentially a movable reference.
mergecreates a new commit with multiple parents.rebasereplays commits on top of a new parent and creates new objects.reflogcan save you because it records changes in reference positions.
You no longer rely on “memorizing commands.” You reason about behavior through changes in the object graph.
Summary
Git is not an operation log system. It is a snapshot-driven object database:
blobstores contenttreestores directory mappingscommitconnects historylogtraverses the commit graphcheckoutswitches references and restores a tree
When you view Git through this model, most “mysterious” day-to-day development problems become engineering questions you can verify.