Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
166 changes: 166 additions & 0 deletions CPU_Optimization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
### Whisper CPU 性能优化建议(Windows)

本文面向在 CPU 上运行本仓库 Whisper 的场景,给出按“成本→收益”排序的实用优化项,并提供可直接复制的命令行与代码示例。

---

### 结论速览(优先尝试)

- **解码简化**:`--temperature 0 --beam_size 1 --best_of 1`
- **关闭词级时间戳**:`--word_timestamps False`
- **固定语言**:已知语种时加 `--language zh`(跳过自动检测)
- **设置线程数**:`--threads <物理核数或略低>`(如 8)
- **静音跳过更激进**:`--no_speech_threshold 0.8`(按效果微调)
- **仅处理必要片段**:`--clip_timestamps start,end`

示例(PowerShell):
```bash
python -m whisper .\audio.wav \
--device cpu --model small \
--threads 8 \
--temperature 0 --beam_size 1 --best_of 1 \
--word_timestamps False \
--language zh \
--no_speech_threshold 0.8
```

---

### 1) 运行参数层(零改代码,见效快)

- **解码策略**:在 CPU 上优先使用贪心解码(降低搜索开销)。
- 配置:`--temperature 0 --beam_size 1 --best_of 1`

- **关闭词级时间戳**:词级对齐会进行额外的注意力/DTW 计算,CPU 上开销明显。
- 配置:`--word_timestamps False`
- 相关代码位置:
```401:411:whisper/transcribe.py
if word_timestamps:
add_word_timestamps(
segments=current_segments,
model=model,
tokenizer=tokenizer,
mel=mel_segment,
num_frames=segment_size,
prepend_punctuations=prepend_punctuations,
append_punctuations=append_punctuations,
last_speech_timestamp=last_speech_timestamp,
)
```

- **固定语言/跳过检测**:自动语言检测仅用前 30s,但仍有前处理与推理开销。
- 配置:`--language zh`(或其它目标语言)

- **合理设置线程数**:一般设置为物理核数或略低;避免超线程导致的调度开销。
- 配置:`--threads <N>`
- 相关参数定义:
```564:566:whisper/transcribe.py
parser.add_argument("--threads", type=optional_int, default=0, help="number of threads used by torch for CPU inference; supercedes MKL_NUM_THREADS/OMP_NUM_THREADS")
parser.add_argument("--clip_timestamps", type=str, default="0", help="comma-separated list start,end,start,end,... timestamps (in seconds) of clips to process, where the last end timestamp defaults to the end of the file")
```

- **静音跳过阈值**:提高 `--no_speech_threshold` 可减少对“空窗”的计算时间(存在轻微误杀风险,按效果微调)。

- **只处理必要片段**:对长音频,使用 `--clip_timestamps` 按需裁剪。

- **可选:减少控制台输出**:大量 I/O 也会带来微小开销,`--verbose False` 可略降耗(收益有限)。

---

### 2) 模型选择(速度/质量权衡)

- 尽可能选择更小的模型以提升 CPU 速度:`tiny` < `base` < `small` < `medium`。
- 建议从 `small` 起步,质量可接受且显著快于 `medium`;若对速度极致敏感可尝试 `base/tiny`。

---

### 3) 更快的推理后端:Faster-Whisper(CTranslate2)

在 CPU 上,CTranslate2 的 **INT8/INT8_float32** 推理通常显著快于原生 PyTorch。

- 安装:
```bash
pip install faster-whisper
```

- 最小示例:
```python
from faster_whisper import WhisperModel

model = WhisperModel("small", device="cpu", compute_type="int8") # 或 "int8_float32"
segments, info = model.transcribe("audio.wav", language="zh", beam_size=1)
text = "".join([s.text for s in segments])
```

- 建议:
- 首选 `compute_type="int8"`;若担心精度,尝试 `"int8_float32"`。
- 结合上文“运行参数层”的建议(固定语言、降低搜索、裁剪片段)。

---

### 4) PyTorch 与环境层优化(可选进阶)

- **设置线程环境变量(Windows)**:
- PowerShell:
```powershell
$env:OMP_NUM_THREADS = "8"
$env:MKL_NUM_THREADS = "8"
```
- 与 `--threads` 一致或略低,避免过度并行。

- **在代码中设置线程数**:
```python
import torch
torch.set_num_threads(8)
```

- **试用 PyTorch 2.x `torch.compile`(CPU 也可有收益)**:首轮有编译预热开销,适合长音频/批量处理。
```python
import torch
model = torch.compile(model, backend="inductor", mode="reduce-overhead")
```

- **动态量化 Linear 层**(CPU 常见做法,速度可提升;精度影响通常较小):
```python
import torch
from torch.ao.quantization import quantize_dynamic

model = quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
```

- **依赖更新**:使用较新的 PyTorch(含 MKL/OpenMP 优化)与 NumPy,可获得汇编级优化收益。

---

### 5) 与当前仓库参数的衔接

- 本仓库 CLI 已内置关键开关(`--threads`、`--word_timestamps`、`--language`、`--clip_timestamps` 等),无需改代码即可生效。
- CPU 下已自动禁用 FP16;若未来需要,可考虑扩展 BF16 开关(需 CPU 支持)。

---

### 6) 常见取舍与排错

- 模型越小越快,但识别质量下降;优先在 `small` 与 `base` 之间做权衡。
- 提高 `--no_speech_threshold` 可能误判极低音量语音为静音;按素材微调。
- `torch.compile` 首次调用会更慢(编译),在批量任务中才体现收益。
- 动态量化可能对少数语言/场景略降精度,需 A/B 验证。

---

### 7) 推荐“CPU 友好”启动模板

```bash
python -m whisper .\audio.wav \
--device cpu --model small \
--threads 8 \
--temperature 0 --beam_size 1 --best_of 1 \
--word_timestamps False \
--language zh \
--no_speech_threshold 0.8 \
--output_dir . --output_format all
```

若希望进一步提速,请优先尝试:更小模型(`base/tiny`)或 Faster-Whisper 的 `int8` 推理。


16 changes: 16 additions & 0 deletions dist/install_ffmpeg_with_choco.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
@echo off
setlocal

rem 安装 Chocolatey(使用 PowerShell 执行官方安装脚本)
powershell -NoProfile -ExecutionPolicy Bypass -Command "Set-ExecutionPolicy Bypass -Scope Process -Force; iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))"

rem 使用 choco 安装 ffmpeg
rem 若当前会话尚未刷新 PATH,优先尝试通过完整路径调用;否则退回到在 CMD 中执行命令
if exist "%ProgramData%\chocolatey\bin\choco.exe" (
"%ProgramData%\chocolatey\bin\choco.exe" install ffmpeg -y
) else (
cmd /c "choco install ffmpeg -y"
)

endlocal

96 changes: 96 additions & 0 deletions v2w_service.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
import whisper
import time
import datetime
from flask import Flask, request, jsonify
import threading

app = Flask(__name__)
app.config['JSON_AS_ASCII'] = False # 确保中文正常显示

# 全局变量存储模型
model = None


def load_model():
global model
print("开始加载 Whisper 模型...")
start_time = time.time()

# 手动指定模型存储路径
model_path = "./models" # 您可以修改为任意路径

# 根据实际情况,选择使用CPU还是GPU
model = whisper.load_model("medium", download_root=model_path)

load_time = time.time() - start_time
print(f"模型加载完成,耗时: {str(datetime.timedelta(seconds=load_time))}")
print(f"模型存储路径: {model_path}")


# 在应用启动时加载模型
@app.before_request
def before_first_request():
global model
if model is None:
print("首次请求,加载模型中...")
load_model()


@app.route('/transcribe', methods=['POST'])
def transcribe():
if model is None:
return jsonify({"error": "模型尚未加载完成"}), 503

if 'audio' not in request.files:
return jsonify({"error": "未提供音频文件"}), 400

audio_file = request.files['audio']
audio_path = f"/{audio_file.filename}"
audio_file.save(audio_path)

# 开始转录 - 使用 FP32 避免 NaN
start_time = time.time()
result = model.transcribe(audio_path, language="zh", fp16=False) # 主动降低精度,使用 FP32 避免 NaN
transcription_time = time.time() - start_time

return jsonify({
"text": result["text"],
"processing_time": transcription_time
})


@app.route('/transcribe_text', methods=['POST'])
def transcribe_text():
"""返回纯文本格式的转录结果,方便命令行查看"""
if model is None:
return "模型尚未加载完成", 503

if 'audio' not in request.files:
return "未提供音频文件", 400

audio_file = request.files['audio']
audio_path = f"/{audio_file.filename}"
audio_file.save(audio_path)

# 开始转录 - 使用 FP32 避免 NaN
start_time = time.time()
result = model.transcribe(audio_path, language="zh", fp16=False)
transcription_time = time.time() - start_time

# 返回纯文本格式
return f"{result['text']}\r\n处理时间: {transcription_time:.2f}秒"


@app.route('/health', methods=['GET'])
def health_check():
return jsonify({
"status": "ok",
"model_loaded": model is not None
})


if __name__ == '__main__':
# 在启动应用前预先加载模型
print("启动服务前预先加载模型...")
load_model()
app.run(host='0.0.0.0', port=5000, threaded=True)
40 changes: 40 additions & 0 deletions voice2word.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import whisper
import time
import datetime


def format_time(seconds):
"""将秒数格式化为易读的时间字符串"""
return str(datetime.timedelta(seconds=seconds))


def transcribe_with_timing():
# 记录开始时间
start_time = time.time()

print("开始加载 Whisper 模型...")
model_load_start = time.time()
model = whisper.load_model("medium") #
model_load_time = time.time() - model_load_start
print(f"模型加载完成,耗时: {format_time(model_load_time)}")

print("开始语音识别...")
transcription_start = time.time()
result = model.transcribe("dingzhen.wav", language="zh")
transcription_time = time.time() - transcription_start
print(f"语音识别完成,耗时: {format_time(transcription_time)}")

# 输出结果
print("\n识别结果:")
print(result["text"])

# 计算总时间
total_time = time.time() - start_time
print(f"\n总运行时间: {format_time(total_time)}")
print(f"详细时间:")
print(f"- 模型加载: {format_time(model_load_time)} ({model_load_time / total_time:.1%})")
print(f"- 语音识别: {format_time(transcription_time)} ({transcription_time / total_time:.1%})")


if __name__ == "__main__":
transcribe_with_timing()
13 changes: 12 additions & 1 deletion whisper/audio.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
import os
import sys
from functools import lru_cache
from pathlib import Path
from subprocess import CalledProcessError, run
from typing import Optional, Union

Expand Down Expand Up @@ -102,7 +104,16 @@ def mel_filters(device, n_mels: int) -> torch.Tensor:
"""
assert n_mels in {80, 128}, f"Unsupported n_mels: {n_mels}"

filters_path = os.path.join(os.path.dirname(__file__), "assets", "mel_filters.npz")
# 使用 pathlib 处理路径,支持开发环境和打包环境
if getattr(sys, 'frozen', False):
# 打包后的 exe 环境
exe_dir = Path(sys.executable).parent
filters_path = exe_dir / "whisper" / "assets" / "mel_filters.npz"
else:
# 开发环境
filters_path = Path(__file__).parent / "assets" / "mel_filters.npz"

print(f"filters_path: {filters_path}")
with np.load(filters_path, allow_pickle=False) as f:
return torch.from_numpy(f[f"mel_{n_mels}"]).to(device)

Expand Down
13 changes: 12 additions & 1 deletion whisper/tokenizer.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
import base64
import os
import string
import sys
from dataclasses import dataclass, field
from functools import cached_property, lru_cache
from pathlib import Path
from typing import Dict, List, Optional, Tuple

import tiktoken
Expand Down Expand Up @@ -329,7 +331,16 @@ def split_tokens_on_spaces(self, tokens: List[int]):

@lru_cache(maxsize=None)
def get_encoding(name: str = "gpt2", num_languages: int = 99):
vocab_path = os.path.join(os.path.dirname(__file__), "assets", f"{name}.tiktoken")
# 使用 pathlib 处理路径,支持开发环境和打包环境
if getattr(sys, 'frozen', False):
# 打包后的 exe 环境
exe_dir = Path(sys.executable).parent
vocab_path = exe_dir / "whisper" / "assets" / f"{name}.tiktoken"
else:
# 开发环境
vocab_path = Path(__file__).parent / "assets" / f"{name}.tiktoken"

print(f"vocab_path: {vocab_path}")
ranks = {
base64.b64decode(token): int(rank)
for token, rank in (line.split() for line in open(vocab_path) if line)
Expand Down
Loading