Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
105 changes: 105 additions & 0 deletions blog/2026-04-06-curvine-metadata-benchmark/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Curvine Benchmark: 300 Million Files in Just 38 GB of Memory

*Translated from the original Chinese article published on April 6, 2026.*

In distributed file systems, metadata memory efficiency, concurrent request handling, and small-file throughput are core indicators of overall system capability. Curvine recently completed a high-intensity metadata benchmark, and the results were clear: Curvine reached a new high-water mark for open-source metadata efficiency while delivering core capabilities comparable to commercial distributed storage products.

### 🔥 Key Takeaways

- **Efficient memory usage**: With **800,000 directories** and **300 million files**, and one block written per file, Curvine used just **38 GB** of memory. That is roughly on par with the metadata-memory capability described for the commercial edition of JuiceFS in reference [1].
- **Low latency under massive concurrency**: With **100,000 clients** looping operations, throughput held steady at **53,000 ops/s**. Average command latency stayed **below 2 ms**, and **P99 latency stayed below 9 ms**.
- **High small-file throughput**: Under heavy concurrent small-file writes, Curvine sustained **12 million small files per hour**, with an average write time of **0.3 ms per file**.

## 📝 Test Setup

- **Curvine cluster**: one Master and one Worker
- **Benchmark machine**: Alibaba Cloud `ecs.i5.8xlarge`, 32 cores, 256 GB RAM
- **Clients**: 100,000 FUSE clients
- **Operations**: repeated high-frequency commands such as `mkdir`, `touch`, file writes, and `ls`

## 📊 Core Benchmark Results

### 🧠 Memory Efficiency: A New Open-Source High-Water Mark

- Managed scale: **800,000 directories + 300 million files**
- Per-file data written: **1 block**
- Total memory usage: **38 GB**
- Comparison point: comparable to the metadata-memory capability described for the commercial edition of JuiceFS

![Memory efficiency benchmark](./memory-efficiency.png)

### ⏱️ High Concurrency, Low Latency at 100,000 Clients

- Concurrent clients: **100,000 FUSE clients**
- Stable throughput: **53,000 ops/s**
- Average latency: **up to 2 ms**
- P99 latency: **up to 9 ms**

![QPS under concurrency](./qps.png)

![Latency under concurrency](./latency.png)

Connection overhead was also low: **100,000 live connections consumed only 1.1 GB**, or about **11.5 KB per connection**.

![Connection overhead](./connection-overhead.png)

Once the benchmark stopped, Master memory dropped immediately from **39.1 GB** back to **38 GB**.

![Master memory after benchmark stop](./master-memory-recovery.png)

### 🚀 Small-File Throughput: Built for Scale

- Files written per hour: **12 million small files**
- Average write time per file: **0.3 ms**
- Throughput remained saturated even under high concurrency

At **15:00**, Curvine had written **287 million files**:

![Small-file count at 15:00](./small-file-count-1500.png)

At **16:00**, the total had reached **299 million files**:

![Small-file count at 16:00](./small-file-count-1600.png)

## 🏗️ Metadata Architecture

Curvine's metadata subsystem stands out not just in large-scale memory efficiency and high-concurrency performance, but also in comparison with other open-source systems. Those results come from a deliberately designed metadata architecture.

![Curvine metadata architecture overview](./metadata-architecture.png)

### 💡 Design Principles

1. A single Master should support very large namespaces and massive numbers of small files.
2. The system should provide high concurrency and low latency for frequent operations such as create, delete, and update.
3. External dependencies should be minimized to reduce operational complexity while keeping the system stable.

Based on those goals, Curvine combines an **in-memory directory tree**, **standalone RocksDB**, and a **Raft-based consistency mechanism**. This three-layer design balances performance, scale, and stability.

| Layer | Core Responsibility | Why It Exists |
| --- | --- | --- |
| In-memory directory tree | Stores directory structure metadata such as directory names and parent-child relationships; handles path resolution, directory listing, and other high-frequency namespace operations | Keeps the hottest namespace operations in memory so directory lookups and path matching stay in the microsecond range; stores only lightweight directory structure to maximize scale |
| Metadata RocksDB (`inode` engine) | Persists complete file and directory metadata, including file size, permissions, `mtime`, block locations, and full directory relationships | Uses column families to separate different metadata types, improving read/write efficiency and making frequent metadata updates easier to manage |
| Raft log RocksDB | Persists the log of all metadata mutations, including create, delete, and update operations, in order for node-to-node synchronization | Separates log storage from metadata storage so replication, compaction, cleanup, and recovery do not interfere with metadata reads and writes |

### 🛡️ FsMode: Working with UFS for Safe Durability

Curvine also supports **FsMode**, which synchronizes metadata and file data to the underlying file system (UFS). This creates a dual safety model of **local storage plus disk-backed fallback**, preventing data loss without sacrificing runtime performance.

## 🚀 Future Directions

Curvine's metadata system will keep pushing forward in three areas:

1. **10 billion files on a single node**: continue deepening single-node capability until a standard **512 GB** memory machine can manage metadata for **10 billion files**.
2. **Federation**: improve cluster-scale metadata expansion with an HDFS Federation-like model that partitions by directory and can scale beyond **100 billion files**. Federation is especially strong for centralized metadata operations such as `mv` and `ls`, but it requires directory planning up front.
3. **Pluggable metadata management**: abstract the metadata interface and support pluggable metadata backends for better flexibility and adaptability.

## 📚 References

1. https://mp.weixin.qq.com/s/zbBUQ4P53PPWQjOHQmw8uw
2. https://hadoop.apache.org/docs/r3.4.0/hadoop-project-dist/hadoop-hdfs-rbf/HDFS%20RouterFederation.html

### 👇 Follow Us

We regularly share hands-on work on distributed storage, metadata optimization, and high-concurrency benchmarking.

GitHub: https://github.com/CurvineIO/curvine
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Curvine 压测:3 亿文件仅占 38G 内存,开源项目天花板

在分布式文件系统领域,元数据的内存效率、并发处理能力、小文件吞吐性能,一直是衡量产品核心能力的关键指标。近期,Curvine 完成了一组高规格元数据压测,结果显示:Curvine 的元数据内存效率达到了开源项目中的顶尖水平,核心能力可与商业版分布式存储产品相当。

### 🔥 开篇结论

- **内存高效利用**:在 **80 万目录**、**3 亿文件**、每个文件写入一个 block 的条件下,Curvine 仅占用 **38G** 内存,与参考材料 [1] 中 JuiceFS 商业版的元数据能力大致相当。
- **高并发低延迟**:在 **10 万客户端** 循环操作的压力下,QPS 稳定在 **5.3 万每秒**,命令操作**平均时延低于 2ms**,**P99 时延低于 9ms**。
- **小文件高吞吐**:高并发写入大量小文件时,Curvine 可实现**每小时写入 1200 万小文件**,平均写入一个小文件仅需 **0.3ms**。

## 📝 测试条件

- **Curvine 集群**:一台 Master,一台 Worker
- **测试机型**:阿里云 `ecs.i5.8xlarge`,32 核,256G 内存
- **客户端**:10 万个 FUSE 客户端
- **操作**:客户端循环执行 `mkdir`、`touch`、写文件、`ls` 等高频命令

## 📊 核心压测数据

### 🧠 内存效率:开源第一梯队

- 管理规模:**80 万目录 + 3 亿文件**
- 单文件写入:**1 个 block**
- 内存占用:**仅 38G**
- 对标结论:与 JuiceFS 商业版的元数据内存能力相当

![内存效率压测](./memory-efficiency.png)

### ⏱️ 高并发低延迟:10 万客户端快跑稳跑

- 并发客户端:**10 万 FUSE 客户端**
- 稳定吞吐:**5.3 万次/秒**
- 平均时延:**不超过 2ms**
- P99 时延:**不超过 9ms**

![高并发 QPS](./qps.png)

![高并发时延](./latency.png)

连接开销同样很低:**10 万连接仅消耗 1.1G 内存**,平均每个连接约 **11.5KB**。

![连接开销](./connection-overhead.png)

压测停止后,Master 内存会立刻从 **39.1G** 回落到 **38G**。

![压测停止后的 Master 内存](./master-memory-recovery.png)

### 🚀 小文件高吞吐:海量场景无压力

- 每小时写入:**1200 万小文件**
- 单文件平均写入时延:**0.3ms**
- 高并发下吞吐持续打满

**15:00** 时,Curvine 已写入 **2.87 亿文件**:

![15 点文件总量](./small-file-count-1500.png)

**16:00** 时,文件总量达到 **2.99 亿**:

![16 点文件总量](./small-file-count-1600.png)

## 🏗️ 元数据架构

Curvine 的元数据能力不仅在大规模内存效率和高并发性能上表现突出,与其他开源产品相比也具备明显优势。其背后是一套经过精心设计的元数据架构。

![Curvine 元数据架构概览](./metadata-architecture.png)

### 💡 设计理念

1. 单 Master 支撑大规模文件与海量小文件。
2. 以高并发、低延迟应对频繁的创建、删除、修改等高频元数据操作。
3. 尽量减少对外部组件的依赖,降低运维复杂度,同时保证系统稳定性。

基于这些目标,Curvine 选择了 **内存目录树 + 单机 RocksDB + Raft 一致性机制** 的三层组合,在性能、规模和稳定性之间取得平衡。

| 层次 | 核心职责 | 设计动机 |
| --- | --- | --- |
| 内存目录树 | 存储目录结构信息,包括目录名、父子关系,并处理路径解析、目录列举等高频操作 | 将高频命名空间操作放在内存中,把目录查询和路径匹配延迟控制在微秒级;只维护轻量目录结构,最大化可支撑规模 |
| 元数据 RocksDB(`inode` 引擎) | 持久化文件和目录的完整元数据,包括文件大小、权限、`mtime`、block 位置以及完整目录关系 | 通过列族机制拆分不同类型的元数据,提升读写效率,并更好地适配频繁的元数据更新 |
| Raft 日志 RocksDB | 持久化所有元数据修改日志,包括创建、删除、更新等操作,并按顺序用于多节点同步 | 将日志存储与元数据存储完全隔离,避免互相干扰,同时便于同步、压缩、清理和故障恢复 |

### 🛡️ FsMode:与 UFS 协同,保障数据兜底安全

Curvine 支持 **FsMode**,会将元数据和文件数据同步到底层统一文件系统(UFS),形成**本地存储 + 磁盘兜底**的双重保障,在不影响系统性能的前提下避免数据丢失。

## 🚀 未来演进方向

Curvine 的元数据能力还会继续向前推进,重点包括三个方向:

1. **单机百亿**:继续深挖单机能力,让普通 **512G 内存** 机器也能支撑 **百亿级文件元数据**。
2. **联邦 Federation**:增强元数据集群扩展性,采用类似 HDFS Federation 的模式,通过目录拆分支撑 **千亿级以上** 的规模。该模式对 `mv`、`ls` 等集中式元数据操作尤其友好,但需要在初始化时规划目录结构。
3. **插件式元数据管理**:抽象元数据接口,支持插件化元数据后端,进一步提升灵活性和适配能力。

## 📚 参考材料

1. https://mp.weixin.qq.com/s/zbBUQ4P53PPWQjOHQmw8uw
2. https://hadoop.apache.org/docs/r3.4.0/hadoop-project-dist/hadoop-hdfs-rbf/HDFS%20RouterFederation.html

### 👇 关注我们

我们会持续分享分布式存储、元数据优化和高并发压测等实战内容。

GitHub:https://github.com/CurvineIO/curvine
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading