Why neogit?¶

neogit turns filesystem history into a content-addressed temporal graph. That phrase is doing a lot of work, so this page unpacks it, then shows the three kinds of question the graph makes cheap, two of which Git cannot answer at all.

A content-addressed temporal graph¶

neogit borrows Git's model:

Content-addressed. Every file, directory, and commit is identified by the hash of its own content. Identical content is the same node everywhere, so deduplication is free and identity is global across snapshots, branches, and machines.
Temporal. The time axis is the commit DAG, exactly like Git. History is a chain of immutable snapshots, not a timestamp column.
Graph. Instead of packing that DAG into files on disk, neogit stores it in Neo4j, so the history is queryable with Cypher rather than only walkable through git's history commands.

The data model is small (Commit, Branch, Tree, Blob; see the data model reference). The power comes from putting it in a graph.

neogit is a substrate you enrich¶

Capturing the filesystem is only the foundation. Because every node is content-addressed and lives in one graph, downstream tools can hang their own hashable characteristics off it: symbols, structs, registry values, syscalls, anything you can hash. Those characteristics inherit the same deduplication, the same temporal identity, and the same queryability as the file bytes.

OSWatcher is the reference example. It builds on neogit to trace a file, a registry key, a symbol, or a struct field across an operating system's entire release history, the same "git log for any characteristic" experience, on data Git's object model never knew about.

The three pillars¶

1. Evolution: how one thing changed over time¶

The classic version-control question: walk a characteristic backwards through history and see when it appeared and how it changed. This is what neogit diff and OSWatcher's "git log" feature surface, for example how _EPROCESS.Flags2 (a Windows kernel struct field) evolved build to build, or every SHA-1 of /Windows/System32/OpenSSH/ssh.exe since it first shipped (demo).

Git can do this one too, via diff. The next two are where it can't follow.

2. Provenance: where has this characteristic ever appeared?¶

Given one object, find every commit that contained it. With content-addressing this is a reverse traversal, the free inverted index Git has no equivalent for. Because enrichment nodes live in the same graph, the same query answers "which operating systems ever shipped this exact symbol, struct, or registry value?" You just start the traversal from the enriched node instead of a Blob.

3. Commonality: what is common or stable across all of history?¶

Aggregate across the whole corpus, not just two snapshots. Pairwise intersection/difference (what neogit diff does) is the n = 2 special case; the graph also answers corpus-wide questions like "the 20 most widely-shared files across every captured snapshot." "Most stable" is the same idea with the time axis folded in: a characteristic whose content hash is unchanged across the most consecutive releases. "Top 20 most stable kernel structs across Windows history" is a single query, and has no Git equivalent, because Git can't enumerate membership across history without scanning all of it.

Why a graph, and not Git plus a script?¶

The honest answer is one structural fact about how Git's object model works:

Git objects are addressed by the hash of their own content, and an object's bytes never reference what points at it, so the object graph is navigable only forward: commit → tree → blob. There is no native way to ask the reverse, "which commits have this object in their tree?" git log --find-object looks like it, but it's pickaxe: it reports where an object's count changed in a diff (addition/deletion), so it misses commits that carry the object unchanged and falsely includes commits that deleted it. The only correct answer is to walk all of history and inspect every tree, a scan rather than a lookup. (The commit-graph file and reachability bitmaps don't help: both are forward-only.)

Provenance and commonality are exactly those reverse and aggregate questions. Put the content-addressed objects in a graph and they become single traversals, which is the whole reason neogit exists.

Where to look next¶

Why Neo4j?: the choice of graph database, and the Git comparison in full
Architecture overview: how structure and bytes are split across two stores
Merkle tree design: how the hashing works
Data model reference: exact node and edge shapes