Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
263 changes: 203 additions & 60 deletions docs/3-User-Manuals/1-Key-Features/01-ufs.md

Large diffs are not rendered by default.

223 changes: 173 additions & 50 deletions docs/3-User-Manuals/1-Key-Features/02-cache.md

Large diffs are not rendered by default.

95 changes: 95 additions & 0 deletions docs/3-User-Manuals/1-Key-Features/03-fuse.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# FUSE

## 1. Basic Understanding of FUSE

### 1. What Is FUSE?

FUSE stands for Filesystem in Userspace. It is a general mechanism provided by the Linux kernel. Its biggest advantage is that it turns distributed or custom storage into an operating experience like a local disk.

### 2. Core Role of FUSE in Curvine

FUSE is the most important and most general API entry point of Curvine, and it is also the simplest way to access Curvine:

- Zero-code-change usage: users do not need to modify business code or perform extra adaptation. Like operating a local C drive or D drive, users can operate Curvine storage through system-native file commands such as `ls`, `cat`, `vim`, and `git`.
- High-performance foundation: Curvine is based on Rust and has fully asynchronously reconstructed FUSE. It only depends on the kernel-native FUSE module and has no extra redundant dependencies. This is the core underlying guarantee for supporting high concurrency and high performance.

## 2. Data Consistency

### 1. Conventional Distributed Storage: Close-to-Open Semantics

Most distributed storage systems follow Close-to-Open by default: after a file is written, the written data becomes visible to other clients only after the file is closed. Content written in the middle is not exposed.

### 2. Curvine Consistency: Visible After Flush

Curvine does not follow Close-to-Open. Instead, it is fully consistent with local disk file semantics:

- Core rule: as long as data is flushed to storage, it immediately becomes visible to all clients without waiting for the file to be closed.
- Core value: this is the key for Curvine to support complex application scenarios such as `git clone` and database storage, which require file semantics fully consistent with local disks.

### 3. Notes on Concurrent Reads and Writes

Curvine supports multiple clients concurrently reading and writing the same file, but it has clear characteristics:

- To fully support POSIX semantics, Curvine removes write protection.
- Risk point: without write protection, if multiple clients write to the same file at the same time, data overwrite or dirty reads may occur.
- One-sentence summary of the difference: Curvine consistency equals local disk semantics, sacrificing part of write protection in exchange for concurrency and random write capability. Other distributed storage systems use write protection plus Close-to-Open, sacrificing visibility in exchange for strong consistency.

## 3. Metadata Cache

Metadata can be simply understood as the identity information of a file, such as file name, size, permission, modification time, and file or directory type. Caching metadata can greatly improve file query performance. Curvine divides metadata cache into kernel metadata cache and client metadata cache.

### 1. Kernel Metadata Cache

The Curvine FUSE client can control kernel-level metadata cache through configuration. The default cache time is 1 second, which significantly improves query efficiency.

Core configuration parameters:

| Configuration Item | Default Value | Function |
| --- | --- | --- |
| `attr_timeout` | 1 second | File attribute cache time, accelerating `getattr` operations. |
| `entry_timeout` | 1 second | File/directory type cache time, accelerating `lookup` operations. |
| `negative_timeout` | 1 second | Cache time for failed queries of nonexistent files, avoiding repeated invalid requests. |

Cache meaning:

When accessing a path such as `/a/b/c.log`, the kernel looks up `/`, `/a`, `/a/b`, and `/a/b/c.log` in sequence. Without cache, more than 4 remote RPC requests are required. With cache, data is directly retrieved from the kernel. In massive small-file scenarios, the performance improvement is extremely obvious. When file data does not change frequently, the cache time can be appropriately increased.

Core characteristics:

- Reduces context switching between kernel mode and user mode. When the cache hits, the Curvine user-space program is not accessed.
- Only accelerates a few interfaces such as `lookup` and `getattr`, and the cached content is limited.

### 2. Client Metadata Cache

The client-level cache developed by Curvine itself caches more complete content than kernel cache:

- Caches complete file attributes, file block information, and all file lists under directories.
- Advantage: repeatedly opening the same file does not require accessing the metadata service, greatly reducing remote RPC calls.
- Limitation: it only guarantees data consistency within a single client. Files created by the current client can be seen immediately by itself. Files created by other clients can only be seen after the current client's cache expires.

### 3. Comparison and Usage Recommendations

- Kernel cache: reduces kernel-mode switching and only accelerates part of the interfaces. It delivers the best performance when the full page cache hits.
- Client cache: caches complete content and reduces RPCs, but cannot reduce mode switching.
- Usage: either cache can be used independently, or both can be used together. Adjust flexibly according to the business scenario.

## 4. Data Cache

### 1. Kernel Page Cache

- Principle: the kernel caches file data that has already been read into memory page cache. Repeated reads are served directly from memory.
- Curvine optimization: Curvine tracks opened files. If a file is modified, the page cache automatically becomes invalid to ensure the latest data is read.
- Performance: repeatedly reading the same file can achieve microsecond-level latency and throughput of dozens of GiB per second, which is extremely fast.
- It is enabled by default. When memory is insufficient, `direct_io = true` can be configured to disable page cache.

### 2. Kernel Writeback Cache

- Kernel requirement: Linux kernel 3.15 or later is required. This is a FUSE-specific feature.
- Principle: the kernel merges many small, high-frequency random write requests and writes them in batches, reducing the number of I/O operations and improving random write performance.
- Side effect: it converts sequential writes into random writes, seriously reducing sequential write performance.
- It is disabled by default and must be manually enabled. It is only suitable for scenarios with many random writes and is not recommended for sequential write scenarios.

## 5. FUSE Version Selection

- Recommended configuration: Linux kernel version 5.0 or later, such as Ubuntu 22.04 and Rocky Linux 9. This can deliver the best FUSE performance, and the asynchronous and cache features are fully adapted.
- Low-kernel risk: when the kernel version is lower than 4.15, FUSE concurrency capability is poor and performance bottlenecks are obvious. It is not recommended for production environments.
102 changes: 65 additions & 37 deletions docs/4-Benchmark/01-meta.md
Original file line number Diff line number Diff line change
@@ -1,58 +1,86 @@
# Metadata Benchmark
# Metadata Performance Test

This page documents the metadata benchmark workflow that is currently checked into Curvine. The source of truth is `build/tests/meta-bench.sh`.
## 1. Test Environment Configuration

## What the script runs
This test is based on a Curvine storage environment deployed on Alibaba Cloud ECS servers. All configurations have been verified to ensure the stability and consistency of the test environment and the reproducibility of the test results. The detailed configuration is as follows:

The wrapper loads `../conf/curvine-env.sh`, sets `CLASSPATH` to `lib/curvine-hadoop-*shade.jar`, and invokes:
### 1.1 Server Configuration

```bash
java -Xms256m -Xmx256m io.curvine.bench.NNBenchWithoutMR -operation $ACTION -bytesToWrite 0 -confDir ${CURVINE_HOME}/conf -threads 10 -baseDir cv://default/fs-meta -numFiles 1000
```
- Test instance type: Alibaba Cloud ECS i5.8xlarge, configured with 32-core CPU and 256 GB memory, meeting the computing requirements of high-concurrency tests.
- Network bandwidth: 80 Gbps high-speed network, completely avoiding network transmission bottlenecks and ensuring that the test results reflect the performance of the storage itself.

Because `-bytesToWrite` is fixed to `0`, this script is aimed at metadata operations rather than data-path throughput.
### 1.2 Deployment Architecture

## Supported actions
- Service node: 1 ECS server, independently deploying the core services `curvine-master` and `curvine-worker`, ensuring service runtime independence.
- Client node: 1 ECS server, deploying the FUSE client to simulate storage access requests in real business scenarios.

The checked-in script lists these actions:
## 2. NNBench Test Configuration and Operations

- `createWrite`
- `openRead`
- `rename`
- `delete`
- `rmdir`
### 2.1 Test Tool and Parameters

Each run executes one action at a time.
The HDFS `NNBenchWithoutMR` tool is used for metadata performance testing. The test parameters are fixed as follows to ensure consistent test pressure:

## Default checked-in workload
- Number of test threads: 40
- Number of files processed by a single thread: 10000
- Test path: `cv://default/fs-meta`
- Bytes written: 0, only metadata operations are tested and no actual data is written.

The active wrapper parameters are:
### 2.2 Test Script Modification

- Java heap: `256m`
- Threads: `10`
- Base path: `cv://default/fs-meta`
- File count: `1000`
Modify the `tests/meta-bench.sh` script as follows. The following content can be directly copied to replace the original script to ensure that the script can execute normally:

Older benchmark notes that mention larger heaps, higher thread counts, or published QPS tables are not reflected in the current checked-in script.
```bash
# Load Curvine environment configuration
. "$(cd "`dirname "$0"`"; pwd)"/../conf/curvine-env.sh

# Configure the classpath and specify Curvine Hadoop dependencies
export CLASSPATH=$(echo $CURVINE_HOME/lib/curvine-hadoop-*shade.jar | tr ' ' ':')

# Test operation type. A parameter must be passed in. Optional values are as follows:
# createWrite: create write test
# openRead: open read test
# rename: rename test
# delete: delete test
# rmdir: remove directory test
ACTION=$1

# Execute the NNBenchWithoutMR test
java -Xms256m -Xmx256m \
io.curvine.bench.NNBenchWithoutMR \
-operation $ACTION \
-bytesToWrite 0 \
-confDir ${CURVINE_HOME}/conf \
-threads 40 \
-baseDir cv://default/fs-meta \
-numFiles 10000
```

## How to run
### 2.3 Configuration Parameter Modification

From the source tree:
Modify `curvine-site.xml` and change the master connection count to 3 to achieve the best performance:

```bash
bash build/tests/meta-bench.sh createWrite
bash build/tests/meta-bench.sh openRead
bash build/tests/meta-bench.sh rename
bash build/tests/meta-bench.sh delete
bash build/tests/meta-bench.sh rmdir
```xml
<property>
<name>fs.cv.master_conn_pool_size</name>
<value>3</value>
</property>
```

Before running the benchmark, make sure:
### 2.4 Supplementary Notes

- Script execution method: execute `sh meta-bench.sh [test operation type]` in the `tests` directory. For example, execute `sh meta-bench.sh createWrite` to run the create write test.
- JVM parameter description: `-Xms256m -Xmx256m` fixes the JVM heap memory to avoid memory fluctuation affecting the test results.
- Dependency description: ensure that the `CURVINE_HOME` environment variable is configured correctly and that the corresponding `curvine-hadoop-*shade.jar` dependency exists in the `lib` directory.

## 3. NNBench Test Results

- Curvine is built and `${CURVINE_HOME}/lib/curvine-hadoop-*shade.jar` is present.
- `${CURVINE_HOME}/conf` contains the cluster configuration.
- The target cluster is reachable at the `cv://default` endpoint used by the config.
Based on the above environment and parameters, this test executed four metadata operations: `createWrite` (create write), `openRead` (open read), `rename` (rename), and `delete` (delete). Each operation was repeated 3 times, and the average value was taken as the final result to ensure the accuracy of the test data.

## Results
### 3.1 Test Result Summary Table

The reference tree does not check in official metadata benchmark result tables for this workload. Record QPS from your own runs and annotate them with the cluster shape, storage tier, and config used for that run.
| Test Operation Type | Average Operations per Second (QPS) |
| --- | ---: |
| `createWrite` (create write) | 21192 |
| `openRead` (open read) | 60181 |
| `rename` (rename) | 27776 |
| `delete` (delete) | 30511 |
Loading
Loading