Skip to content

Commit ee81813

Browse files
queeliusclaude
andcommitted
Add blog posts on DagShell design patterns
Three pedagogical posts covering: - Immutable, content-addressed filesystem design - Unix philosophy with method chaining in Python - Embedding a Scheme interpreter as a DSL 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
1 parent 518154d commit ee81813

3 files changed

Lines changed: 995 additions & 0 deletions

File tree

Lines changed: 216 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,216 @@
1+
# Building an Immutable, Content-Addressed Filesystem in Python
2+
3+
*How Git-style content addressing creates elegant, functional data structures*
4+
5+
---
6+
7+
When you run `git commit`, something interesting happens: Git doesn't store your files by name. Instead, it computes a SHA-1 hash of each file's content and stores the file under that hash. The filename is just a pointer to the hash. This is **content addressing**—identifying data by what it contains rather than where it lives.
8+
9+
This seemingly simple idea has profound implications. In this post, I'll show how to build a content-addressed virtual filesystem in Python, exploring how immutability and content addressing work together to create elegant, functional data structures.
10+
11+
## Why Content Addressing?
12+
13+
Consider a traditional filesystem. When you modify a file, the system overwrites the old content. The file's identity (its path) stays the same, but its content changes. This mutable approach has problems:
14+
15+
1. **No automatic history**: Once you overwrite, the old data is gone
16+
2. **No deduplication**: Two identical files take up twice the space
17+
3. **No integrity verification**: Corruption can go undetected
18+
19+
Content addressing solves all three. If a file's identity *is* its content (via a hash), then:
20+
21+
1. **History is preserved**: Changing content creates a new hash, so the old version still exists
22+
2. **Deduplication is automatic**: Identical content has identical hashes—stored once
23+
3. **Integrity is built-in**: If the content doesn't match the hash, you know something's wrong
24+
25+
## The Node Hierarchy
26+
27+
Let's build this. First, we define our filesystem nodes using Python's frozen dataclasses:
28+
29+
```python
30+
from dataclasses import dataclass, field
31+
import hashlib
32+
import json
33+
34+
@dataclass(frozen=True)
35+
class Node:
36+
"""Base class for all filesystem nodes."""
37+
mode: int
38+
uid: int = 1000
39+
gid: int = 1000
40+
mtime: float = field(default_factory=time.time)
41+
42+
def compute_hash(self) -> str:
43+
"""Compute SHA256 hash of this node including all metadata."""
44+
data = json.dumps(self.to_dict(), sort_keys=True)
45+
return hashlib.sha256(data.encode()).hexdigest()
46+
```
47+
48+
The `frozen=True` parameter is crucial. It makes instances immutable—you cannot modify a Node after creation. Any "change" requires creating a new Node.
49+
50+
We then specialize for different node types:
51+
52+
```python
53+
@dataclass(frozen=True)
54+
class FileNode(Node):
55+
"""Regular file node."""
56+
content: bytes = b""
57+
58+
@dataclass(frozen=True)
59+
class DirNode(Node):
60+
"""Directory node containing references to child nodes."""
61+
children: Dict[str, str] = field(default_factory=dict) # name -> hash
62+
```
63+
64+
Notice that `DirNode.children` maps names to *hashes*, not to Node objects directly. This is the key insight: directories don't contain files; they contain *references* to file hashes. The actual nodes live in a separate store.
65+
66+
## The DAG Structure
67+
68+
This reference-based approach creates a Directed Acyclic Graph (DAG):
69+
70+
```python
71+
class FileSystem:
72+
"""Content-addressable virtual filesystem."""
73+
74+
def __init__(self):
75+
# The DAG: hash -> Node
76+
self.nodes: Dict[str, Node] = {}
77+
78+
# Path index: absolute path -> hash
79+
self.paths: Dict[str, str] = {}
80+
81+
def _add_node(self, node: Node) -> str:
82+
"""Add a node to the DAG, returning its hash."""
83+
node_hash = node.compute_hash()
84+
if node_hash not in self.nodes:
85+
self.nodes[node_hash] = node
86+
return node_hash
87+
```
88+
89+
When we add a node, we compute its hash and store the mapping `hash → node`. If an identical node already exists (same hash), we don't duplicate it—we just return the existing hash. **Deduplication is automatic.**
90+
91+
## Immutable Updates
92+
93+
Here's where immutability shines. When we write to a file, we don't modify anything. Instead, we:
94+
95+
1. Create a new FileNode with the new content
96+
2. Create a new DirNode for the parent, pointing to the new file hash
97+
3. Update the path index
98+
99+
```python
100+
def write(self, path: str, content: bytes) -> bool:
101+
"""Write content to a file."""
102+
parent_path, name = self._get_parent_path(path)
103+
parent_hash = self.paths[parent_path]
104+
parent = self.nodes[parent_hash]
105+
106+
# Create new file node
107+
file_node = FileNode(content)
108+
file_hash = self._add_node(file_node)
109+
110+
# Create new parent directory with updated child reference
111+
new_children = dict(parent.children)
112+
new_children[name] = file_hash
113+
new_parent = DirNode(children=new_children)
114+
new_parent_hash = self._add_node(new_parent)
115+
116+
# Update path index
117+
self.paths[parent_path] = new_parent_hash
118+
self.paths[path] = file_hash
119+
120+
return True
121+
```
122+
123+
The old FileNode still exists in `self.nodes`. The old DirNode still exists too. We've just created new versions and updated where the path points. This is **structural sharing**—unchanged parts of the tree are shared between versions.
124+
125+
## Visualizing the DAG
126+
127+
Let's trace through an example:
128+
129+
```python
130+
fs = FileSystem()
131+
fs.mkdir("/project")
132+
fs.write("/project/main.py", b"print('hello')")
133+
fs.write("/project/main.py", b"print('world')")
134+
```
135+
136+
After these operations, our DAG contains:
137+
138+
```
139+
Hash: a1b2c3... → DirNode(children={}) # original /project
140+
Hash: d4e5f6... → FileNode("print('hello')") # first version
141+
Hash: g7h8i9... → DirNode(children={"main.py": "d4e5f6..."})
142+
Hash: j0k1l2... → FileNode("print('world')") # second version
143+
Hash: m3n4o5... → DirNode(children={"main.py": "j0k1l2..."})
144+
```
145+
146+
Both versions of `main.py` exist. The path `/project/main.py` points to the latest hash (`j0k1l2...`), but we could easily restore the old version if we tracked which hashes corresponded to which versions.
147+
148+
## Benefits in Practice
149+
150+
This design enables powerful features almost for free:
151+
152+
**Snapshots**: Save the current `paths` dictionary. Restore it later to go back in time.
153+
154+
```python
155+
def snapshot(self) -> Dict[str, str]:
156+
"""Create a snapshot of the current filesystem state."""
157+
return dict(self.paths)
158+
159+
def restore(self, snapshot: Dict[str, str]):
160+
"""Restore filesystem to a previous snapshot."""
161+
self.paths = dict(snapshot)
162+
```
163+
164+
**Deduplication**: Multiple paths can point to the same hash.
165+
166+
```python
167+
# These might share the same underlying node if content is identical
168+
fs.write("/file1.txt", b"hello")
169+
fs.write("/file2.txt", b"hello") # Same hash, no new storage
170+
```
171+
172+
**Integrity checking**: If someone asks for a file, we can verify it.
173+
174+
```python
175+
def verify(self, path: str) -> bool:
176+
"""Verify a file's integrity."""
177+
node_hash = self.paths[path]
178+
node = self.nodes[node_hash]
179+
return node.compute_hash() == node_hash
180+
```
181+
182+
## The Functional Programming Connection
183+
184+
This approach is deeply connected to functional programming. In FP:
185+
186+
- Data is immutable
187+
- "Changes" create new values
188+
- Sharing is safe because nothing mutates
189+
190+
Our filesystem follows these principles exactly. Nodes are frozen. "Writing" creates new nodes. Multiple paths can safely share nodes because nodes never change.
191+
192+
This is why Clojure's persistent data structures, Haskell's pure values, and Git's object store all use similar ideas. **Content addressing + immutability = safe, efficient, verifiable data.**
193+
194+
## Trade-offs
195+
196+
Nothing is free. This approach has costs:
197+
198+
1. **Memory**: Old versions accumulate. You need garbage collection to reclaim space from unreachable nodes.
199+
200+
2. **Performance**: Creating new nodes for every change can be slower than in-place mutation for write-heavy workloads.
201+
202+
3. **Complexity**: Path resolution requires extra indirection through the hash table.
203+
204+
For many use cases—especially those valuing history, integrity, and safe concurrency—these trade-offs are worthwhile.
205+
206+
## Conclusion
207+
208+
Content addressing transforms how we think about data. Instead of "where is this file?" we ask "what is this content's identity?" Instead of destructive updates, we create new versions while sharing unchanged structure.
209+
210+
This pattern appears everywhere: Git, IPFS, Nix, Docker layers, and many database internals. Understanding it opens doors to building robust, elegant systems.
211+
212+
The full implementation in [DagShell](https://github.com/queelius/dagshell) extends these ideas with a complete POSIX-like interface, demonstrating how content addressing can underpin a full virtual filesystem.
213+
214+
---
215+
216+
*Next in this series: [Unix Philosophy in Python](#) — building composable commands with method chaining.*

0 commit comments

Comments
 (0)