Merkle tree design¶
A Merkle tree is a tree where every node's identity is the hash of its content and its children's identities. Two subtrees with the same content collapse to the same node. This is the same insight Git is built on.
Three node kinds¶
Neogit's Merkle tree has three kinds of node. The canonical serializations live
in neogit/core/merkle/filesystem.py (Blob, Tree) and neogit/merkle/hasher.py
(Commit). The contract is: same inputs, same hash, always.
Blob: SHA-1(content)¶
Leaves. The identity is the SHA-1 of the node's content, where "content"
depends on the file kind (FSMerkleVisitor.visit_FSFileNode):
- Regular file: the raw file bytes.
- Symlink: the link target string from
os.readlink(), not the bytes of whatever it points at. The symlink's own target is what is captured and hashed. - Anything else (sockets, FIFOs, devices, or a file that raises
OSErrorwhen read, such as an unreadable FUSE/reparse entry): hashed as empty content (b"").
Tree: SHA-1(sorted child entries)¶
A directory. FSMerkleVisitor.visit_FSDirectoryNode walks the directory's
children, sorted directories first, then by name, and for each child feeds
the bytes f"{child_name}{child_hash}\n" into the running SHA-1. The final
digest is the tree's hash.
# Equivalent of the canonical form, for one directory:
for child in sorted(children, key=lambda c: (not c.is_dir, c.name)):
h.update(f"{child.name}{child.hash}\n".encode())
tree_hash = h.hexdigest()
Note that the entry contains only name and child_hash; there is no
explicit "kind" byte. The hash is unambiguous because each directory commits
to its sorted, newline-delimited child list.
Commit: SHA-1(name + date + tree_sha1)¶
A snapshot. From hasher.py:
Note what is not in the commit hash: the description, the previous-commit
pointer, the branch. Two commits with the same name, same date, and the same
root tree therefore hash identically, and Neo4j's uniqueness constraint on
Commit.hash deduplicates them. If you need the description or the history
edge to participate in identity, that's a deliberate change to the canonical
form.
Why the children are sorted¶
If we hashed children in insertion order, two directories with the same files in different filesystem-walk orders would get different hashes. The deterministic sort (dirs-first, then by name) makes the hash a true function of the content of the directory, not the order the OS happened to return entries in.
What dedup buys us¶
Consider a 50,000-file source tree, snapshotted twice with one file changed.
- 1 new
Blob(the changed file) - ≤ depth-of-changed-file new
Treenodes (every ancestor directory) - 1 new
Commit
Everything else is reused: no new database rows, no new object-store uploads. A "snapshot" of a barely-changed tree costs roughly log(N) writes, not N.
For OSWatcher's whole-OS captures, where consecutive snapshots differ in dozens of files out of millions, this matters by orders of magnitude.
Why diffs are cheap¶
A diff between two commits walks both root trees in parallel:
- If both subtrees have the same hash → no recursion, no changes there
- If hashes differ → recurse into children
- Hit a
Blobmismatch → that's an edit; mismatchedBlobon one side → add/remove
Most of the tree usually hashes-equal at the top, so diffs touch O(changed paths) nodes, not O(total paths).
What we don't do (and could)¶
- No re-chunking of large files. A 4 GB ISO that flips one byte gets re-uploaded in full. Real Git mitigates this with delta packs; neogit doesn't (yet).
- No content-defined chunking. Each blob is one file. Fine for typical OS captures; weaker for huge binaries.
These are extension points, not architectural limits: the object-storage layer already abstracts the byte plane, so an alternative chunking strategy could slot in without touching the graph.