|
| 1 | +# Building an Immutable, Content-Addressed Filesystem in Python |
| 2 | + |
| 3 | +*How Git-style content addressing creates elegant, functional data structures* |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +When you run `git commit`, something interesting happens: Git doesn't store your files by name. Instead, it computes a SHA-1 hash of each file's content and stores the file under that hash. The filename is just a pointer to the hash. This is **content addressing**—identifying data by what it contains rather than where it lives. |
| 8 | + |
| 9 | +This seemingly simple idea has profound implications. In this post, I'll show how to build a content-addressed virtual filesystem in Python, exploring how immutability and content addressing work together to create elegant, functional data structures. |
| 10 | + |
| 11 | +## Why Content Addressing? |
| 12 | + |
| 13 | +Consider a traditional filesystem. When you modify a file, the system overwrites the old content. The file's identity (its path) stays the same, but its content changes. This mutable approach has problems: |
| 14 | + |
| 15 | +1. **No automatic history**: Once you overwrite, the old data is gone |
| 16 | +2. **No deduplication**: Two identical files take up twice the space |
| 17 | +3. **No integrity verification**: Corruption can go undetected |
| 18 | + |
| 19 | +Content addressing solves all three. If a file's identity *is* its content (via a hash), then: |
| 20 | + |
| 21 | +1. **History is preserved**: Changing content creates a new hash, so the old version still exists |
| 22 | +2. **Deduplication is automatic**: Identical content has identical hashes—stored once |
| 23 | +3. **Integrity is built-in**: If the content doesn't match the hash, you know something's wrong |
| 24 | + |
| 25 | +## The Node Hierarchy |
| 26 | + |
| 27 | +Let's build this. First, we define our filesystem nodes using Python's frozen dataclasses: |
| 28 | + |
| 29 | +```python |
| 30 | +from dataclasses import dataclass, field |
| 31 | +import hashlib |
| 32 | +import json |
| 33 | + |
| 34 | +@dataclass(frozen=True) |
| 35 | +class Node: |
| 36 | + """Base class for all filesystem nodes.""" |
| 37 | + mode: int |
| 38 | + uid: int = 1000 |
| 39 | + gid: int = 1000 |
| 40 | + mtime: float = field(default_factory=time.time) |
| 41 | + |
| 42 | + def compute_hash(self) -> str: |
| 43 | + """Compute SHA256 hash of this node including all metadata.""" |
| 44 | + data = json.dumps(self.to_dict(), sort_keys=True) |
| 45 | + return hashlib.sha256(data.encode()).hexdigest() |
| 46 | +``` |
| 47 | + |
| 48 | +The `frozen=True` parameter is crucial. It makes instances immutable—you cannot modify a Node after creation. Any "change" requires creating a new Node. |
| 49 | + |
| 50 | +We then specialize for different node types: |
| 51 | + |
| 52 | +```python |
| 53 | +@dataclass(frozen=True) |
| 54 | +class FileNode(Node): |
| 55 | + """Regular file node.""" |
| 56 | + content: bytes = b"" |
| 57 | + |
| 58 | +@dataclass(frozen=True) |
| 59 | +class DirNode(Node): |
| 60 | + """Directory node containing references to child nodes.""" |
| 61 | + children: Dict[str, str] = field(default_factory=dict) # name -> hash |
| 62 | +``` |
| 63 | + |
| 64 | +Notice that `DirNode.children` maps names to *hashes*, not to Node objects directly. This is the key insight: directories don't contain files; they contain *references* to file hashes. The actual nodes live in a separate store. |
| 65 | + |
| 66 | +## The DAG Structure |
| 67 | + |
| 68 | +This reference-based approach creates a Directed Acyclic Graph (DAG): |
| 69 | + |
| 70 | +```python |
| 71 | +class FileSystem: |
| 72 | + """Content-addressable virtual filesystem.""" |
| 73 | + |
| 74 | + def __init__(self): |
| 75 | + # The DAG: hash -> Node |
| 76 | + self.nodes: Dict[str, Node] = {} |
| 77 | + |
| 78 | + # Path index: absolute path -> hash |
| 79 | + self.paths: Dict[str, str] = {} |
| 80 | + |
| 81 | + def _add_node(self, node: Node) -> str: |
| 82 | + """Add a node to the DAG, returning its hash.""" |
| 83 | + node_hash = node.compute_hash() |
| 84 | + if node_hash not in self.nodes: |
| 85 | + self.nodes[node_hash] = node |
| 86 | + return node_hash |
| 87 | +``` |
| 88 | + |
| 89 | +When we add a node, we compute its hash and store the mapping `hash → node`. If an identical node already exists (same hash), we don't duplicate it—we just return the existing hash. **Deduplication is automatic.** |
| 90 | + |
| 91 | +## Immutable Updates |
| 92 | + |
| 93 | +Here's where immutability shines. When we write to a file, we don't modify anything. Instead, we: |
| 94 | + |
| 95 | +1. Create a new FileNode with the new content |
| 96 | +2. Create a new DirNode for the parent, pointing to the new file hash |
| 97 | +3. Update the path index |
| 98 | + |
| 99 | +```python |
| 100 | +def write(self, path: str, content: bytes) -> bool: |
| 101 | + """Write content to a file.""" |
| 102 | + parent_path, name = self._get_parent_path(path) |
| 103 | + parent_hash = self.paths[parent_path] |
| 104 | + parent = self.nodes[parent_hash] |
| 105 | + |
| 106 | + # Create new file node |
| 107 | + file_node = FileNode(content) |
| 108 | + file_hash = self._add_node(file_node) |
| 109 | + |
| 110 | + # Create new parent directory with updated child reference |
| 111 | + new_children = dict(parent.children) |
| 112 | + new_children[name] = file_hash |
| 113 | + new_parent = DirNode(children=new_children) |
| 114 | + new_parent_hash = self._add_node(new_parent) |
| 115 | + |
| 116 | + # Update path index |
| 117 | + self.paths[parent_path] = new_parent_hash |
| 118 | + self.paths[path] = file_hash |
| 119 | + |
| 120 | + return True |
| 121 | +``` |
| 122 | + |
| 123 | +The old FileNode still exists in `self.nodes`. The old DirNode still exists too. We've just created new versions and updated where the path points. This is **structural sharing**—unchanged parts of the tree are shared between versions. |
| 124 | + |
| 125 | +## Visualizing the DAG |
| 126 | + |
| 127 | +Let's trace through an example: |
| 128 | + |
| 129 | +```python |
| 130 | +fs = FileSystem() |
| 131 | +fs.mkdir("/project") |
| 132 | +fs.write("/project/main.py", b"print('hello')") |
| 133 | +fs.write("/project/main.py", b"print('world')") |
| 134 | +``` |
| 135 | + |
| 136 | +After these operations, our DAG contains: |
| 137 | + |
| 138 | +``` |
| 139 | +Hash: a1b2c3... → DirNode(children={}) # original /project |
| 140 | +Hash: d4e5f6... → FileNode("print('hello')") # first version |
| 141 | +Hash: g7h8i9... → DirNode(children={"main.py": "d4e5f6..."}) |
| 142 | +Hash: j0k1l2... → FileNode("print('world')") # second version |
| 143 | +Hash: m3n4o5... → DirNode(children={"main.py": "j0k1l2..."}) |
| 144 | +``` |
| 145 | + |
| 146 | +Both versions of `main.py` exist. The path `/project/main.py` points to the latest hash (`j0k1l2...`), but we could easily restore the old version if we tracked which hashes corresponded to which versions. |
| 147 | + |
| 148 | +## Benefits in Practice |
| 149 | + |
| 150 | +This design enables powerful features almost for free: |
| 151 | + |
| 152 | +**Snapshots**: Save the current `paths` dictionary. Restore it later to go back in time. |
| 153 | + |
| 154 | +```python |
| 155 | +def snapshot(self) -> Dict[str, str]: |
| 156 | + """Create a snapshot of the current filesystem state.""" |
| 157 | + return dict(self.paths) |
| 158 | + |
| 159 | +def restore(self, snapshot: Dict[str, str]): |
| 160 | + """Restore filesystem to a previous snapshot.""" |
| 161 | + self.paths = dict(snapshot) |
| 162 | +``` |
| 163 | + |
| 164 | +**Deduplication**: Multiple paths can point to the same hash. |
| 165 | + |
| 166 | +```python |
| 167 | +# These might share the same underlying node if content is identical |
| 168 | +fs.write("/file1.txt", b"hello") |
| 169 | +fs.write("/file2.txt", b"hello") # Same hash, no new storage |
| 170 | +``` |
| 171 | + |
| 172 | +**Integrity checking**: If someone asks for a file, we can verify it. |
| 173 | + |
| 174 | +```python |
| 175 | +def verify(self, path: str) -> bool: |
| 176 | + """Verify a file's integrity.""" |
| 177 | + node_hash = self.paths[path] |
| 178 | + node = self.nodes[node_hash] |
| 179 | + return node.compute_hash() == node_hash |
| 180 | +``` |
| 181 | + |
| 182 | +## The Functional Programming Connection |
| 183 | + |
| 184 | +This approach is deeply connected to functional programming. In FP: |
| 185 | + |
| 186 | +- Data is immutable |
| 187 | +- "Changes" create new values |
| 188 | +- Sharing is safe because nothing mutates |
| 189 | + |
| 190 | +Our filesystem follows these principles exactly. Nodes are frozen. "Writing" creates new nodes. Multiple paths can safely share nodes because nodes never change. |
| 191 | + |
| 192 | +This is why Clojure's persistent data structures, Haskell's pure values, and Git's object store all use similar ideas. **Content addressing + immutability = safe, efficient, verifiable data.** |
| 193 | + |
| 194 | +## Trade-offs |
| 195 | + |
| 196 | +Nothing is free. This approach has costs: |
| 197 | + |
| 198 | +1. **Memory**: Old versions accumulate. You need garbage collection to reclaim space from unreachable nodes. |
| 199 | + |
| 200 | +2. **Performance**: Creating new nodes for every change can be slower than in-place mutation for write-heavy workloads. |
| 201 | + |
| 202 | +3. **Complexity**: Path resolution requires extra indirection through the hash table. |
| 203 | + |
| 204 | +For many use cases—especially those valuing history, integrity, and safe concurrency—these trade-offs are worthwhile. |
| 205 | + |
| 206 | +## Conclusion |
| 207 | + |
| 208 | +Content addressing transforms how we think about data. Instead of "where is this file?" we ask "what is this content's identity?" Instead of destructive updates, we create new versions while sharing unchanged structure. |
| 209 | + |
| 210 | +This pattern appears everywhere: Git, IPFS, Nix, Docker layers, and many database internals. Understanding it opens doors to building robust, elegant systems. |
| 211 | + |
| 212 | +The full implementation in [DagShell](https://github.com/queelius/dagshell) extends these ideas with a complete POSIX-like interface, demonstrating how content addressing can underpin a full virtual filesystem. |
| 213 | + |
| 214 | +--- |
| 215 | + |
| 216 | +*Next in this series: [Unix Philosophy in Python](#) — building composable commands with method chaining.* |
0 commit comments