openai · zhushizi · Oct 11, 2025 · Oct 17, 2025 · Oct 17, 2025
diff --git a/CPU_Optimization.md b/CPU_Optimization.md
@@ -0,0 +1,166 @@
+### Whisper CPU 性能优化建议（Windows）
+
+本文面向在 CPU 上运行本仓库 Whisper 的场景，给出按“成本→收益”排序的实用优化项，并提供可直接复制的命令行与代码示例。
+
+---
+
+### 结论速览（优先尝试）
+
+- **解码简化**：`--temperature 0 --beam_size 1 --best_of 1`
+- **关闭词级时间戳**：`--word_timestamps False`
+- **固定语言**：已知语种时加 `--language zh`（跳过自动检测）
+- **设置线程数**：`--threads <物理核数或略低>`（如 8）
+- **静音跳过更激进**：`--no_speech_threshold 0.8`（按效果微调）
+- **仅处理必要片段**：`--clip_timestamps start,end`
+
+示例（PowerShell）：
+```bash
+python -m whisper .\audio.wav \
+  --device cpu --model small \
+  --threads 8 \
+  --temperature 0 --beam_size 1 --best_of 1 \
+  --word_timestamps False \
+  --language zh \
+  --no_speech_threshold 0.8
+```
+
+---
+
+### 1) 运行参数层（零改代码，见效快）
+
+- **解码策略**：在 CPU 上优先使用贪心解码（降低搜索开销）。
+  - 配置：`--temperature 0 --beam_size 1 --best_of 1`
+
+- **关闭词级时间戳**：词级对齐会进行额外的注意力/DTW 计算，CPU 上开销明显。
+  - 配置：`--word_timestamps False`
+  - 相关代码位置：
+```401:411:whisper/transcribe.py
+if word_timestamps:
+    add_word_timestamps(
+        segments=current_segments,
+        model=model,
+        tokenizer=tokenizer,
+        mel=mel_segment,
+        num_frames=segment_size,
+        prepend_punctuations=prepend_punctuations,
+        append_punctuations=append_punctuations,
+        last_speech_timestamp=last_speech_timestamp,
+    )
+```
+
+- **固定语言/跳过检测**：自动语言检测仅用前 30s，但仍有前处理与推理开销。
+  - 配置：`--language zh`（或其它目标语言）
+
+- **合理设置线程数**：一般设置为物理核数或略低；避免超线程导致的调度开销。
+  - 配置：`--threads <N>`
+  - 相关参数定义：
+```564:566:whisper/transcribe.py
+parser.add_argument("--threads", type=optional_int, default=0, help="number of threads used by torch for CPU inference; supercedes MKL_NUM_THREADS/OMP_NUM_THREADS")
+parser.add_argument("--clip_timestamps", type=str, default="0", help="comma-separated list start,end,start,end,... timestamps (in seconds) of clips to process, where the last end timestamp defaults to the end of the file")
+```
+
+- **静音跳过阈值**：提高 `--no_speech_threshold` 可减少对“空窗”的计算时间（存在轻微误杀风险，按效果微调）。
+
+- **只处理必要片段**：对长音频，使用 `--clip_timestamps` 按需裁剪。
+
+- **可选：减少控制台输出**：大量 I/O 也会带来微小开销，`--verbose False` 可略降耗（收益有限）。
+
+---
+
+### 2) 模型选择（速度/质量权衡）
+
+- 尽可能选择更小的模型以提升 CPU 速度：`tiny` < `base` < `small` < `medium`。
+- 建议从 `small` 起步，质量可接受且显著快于 `medium`；若对速度极致敏感可尝试 `base/tiny`。
+
+---
+
+### 3) 更快的推理后端：Faster-Whisper（CTranslate2）
+
+在 CPU 上，CTranslate2 的 **INT8/INT8_float32** 推理通常显著快于原生 PyTorch。
+
+- 安装：
+```bash
+pip install faster-whisper
+```
+
+- 最小示例：
+```python
+from faster_whisper import WhisperModel
+
+model = WhisperModel("small", device="cpu", compute_type="int8")  # 或 "int8_float32"
+segments, info = model.transcribe("audio.wav", language="zh", beam_size=1)
+text = "".join([s.text for s in segments])
+```
+
+- 建议：
+  - 首选 `compute_type="int8"`；若担心精度，尝试 `"int8_float32"`。
+  - 结合上文“运行参数层”的建议（固定语言、降低搜索、裁剪片段）。
+
+---
+
+### 4) PyTorch 与环境层优化（可选进阶）
+
+- **设置线程环境变量（Windows）**：
+  - PowerShell：
+    ```powershell
+    $env:OMP_NUM_THREADS = "8"
+    $env:MKL_NUM_THREADS = "8"
+    ```
+  - 与 `--threads` 一致或略低，避免过度并行。
+
+- **在代码中设置线程数**：
+```python
+import torch
+torch.set_num_threads(8)
+```
+
+- **试用 PyTorch 2.x `torch.compile`（CPU 也可有收益）**：首轮有编译预热开销，适合长音频/批量处理。
+```python
+import torch
+model = torch.compile(model, backend="inductor", mode="reduce-overhead")
+```
+
+- **动态量化 Linear 层**（CPU 常见做法，速度可提升；精度影响通常较小）：
+```python
+import torch
+from torch.ao.quantization import quantize_dynamic
+
+model = quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
+```
+
+- **依赖更新**：使用较新的 PyTorch（含 MKL/OpenMP 优化）与 NumPy，可获得汇编级优化收益。
+
+---
+
+### 5) 与当前仓库参数的衔接
+
+- 本仓库 CLI 已内置关键开关（`--threads`、`--word_timestamps`、`--language`、`--clip_timestamps` 等），无需改代码即可生效。
+- CPU 下已自动禁用 FP16；若未来需要，可考虑扩展 BF16 开关（需 CPU 支持）。
+
+---
+
+### 6) 常见取舍与排错
+
+- 模型越小越快，但识别质量下降；优先在 `small` 与 `base` 之间做权衡。
+- 提高 `--no_speech_threshold` 可能误判极低音量语音为静音；按素材微调。
+- `torch.compile` 首次调用会更慢（编译），在批量任务中才体现收益。
+- 动态量化可能对少数语言/场景略降精度，需 A/B 验证。
+
+---
+
+### 7) 推荐“CPU 友好”启动模板
+
+```bash
+python -m whisper .\audio.wav \
+  --device cpu --model small \
+  --threads 8 \
+  --temperature 0 --beam_size 1 --best_of 1 \
+  --word_timestamps False \
+  --language zh \
+  --no_speech_threshold 0.8 \
+  --output_dir . --output_format all
+```
+
+若希望进一步提速，请优先尝试：更小模型（`base/tiny`）或 Faster-Whisper 的 `int8` 推理。
+
+
diff --git a/dist/install_ffmpeg_with_choco.bat b/dist/install_ffmpeg_with_choco.bat
@@ -0,0 +1,16 @@
+@echo off
+setlocal
+
+rem 安装 Chocolatey（使用 PowerShell 执行官方安装脚本）
+powershell -NoProfile -ExecutionPolicy Bypass -Command "Set-ExecutionPolicy Bypass -Scope Process -Force; iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1'))"
+
+rem 使用 choco 安装 ffmpeg
+rem 若当前会话尚未刷新 PATH，优先尝试通过完整路径调用；否则退回到在 CMD 中执行命令
+if exist "%ProgramData%\chocolatey\bin\choco.exe" (
+    "%ProgramData%\chocolatey\bin\choco.exe" install ffmpeg -y
+) else (
+    cmd /c "choco install ffmpeg -y"
+)
+
+endlocal
+
diff --git a/v2w_service.py b/v2w_service.py
@@ -0,0 +1,96 @@
+import whisper
+import time
+import datetime
+from flask import Flask, request, jsonify
+import threading
+
+app = Flask(__name__)
+app.config['JSON_AS_ASCII'] = False  # 确保中文正常显示
+
+# 全局变量存储模型
+model = None
+
+
+def load_model():
+    global model
+    print("开始加载 Whisper 模型...")
+    start_time = time.time()
+
+    # 手动指定模型存储路径
+    model_path = "./models"  # 您可以修改为任意路径
+
+    # 根据实际情况，选择使用CPU还是GPU
+    model = whisper.load_model("medium", download_root=model_path)
+
+    load_time = time.time() - start_time
+    print(f"模型加载完成，耗时: {str(datetime.timedelta(seconds=load_time))}")
+    print(f"模型存储路径: {model_path}")
+
+
+# 在应用启动时加载模型
+@app.before_request
+def before_first_request():
+    global model
+    if model is None:
+        print("首次请求，加载模型中...")
+        load_model()
+
+
+@app.route('/transcribe', methods=['POST'])
+def transcribe():
+    if model is None:
+        return jsonify({"error": "模型尚未加载完成"}), 503
+
+    if 'audio' not in request.files:
+        return jsonify({"error": "未提供音频文件"}), 400
+
+    audio_file = request.files['audio']
+    audio_path = f"/{audio_file.filename}"
+    audio_file.save(audio_path)
+
+    # 开始转录 - 使用 FP32 避免 NaN
+    start_time = time.time()
+    result = model.transcribe(audio_path, language="zh", fp16=False) # 主动降低精度，使用 FP32 避免 NaN
+    transcription_time = time.time() - start_time
+
+    return jsonify({
+        "text": result["text"],
+        "processing_time": transcription_time
+    })
+
+
+@app.route('/transcribe_text', methods=['POST'])
+def transcribe_text():
+    """返回纯文本格式的转录结果，方便命令行查看"""
+    if model is None:
+        return "模型尚未加载完成", 503
+
+    if 'audio' not in request.files:
+        return "未提供音频文件", 400
+
+    audio_file = request.files['audio']
+    audio_path = f"/{audio_file.filename}"
+    audio_file.save(audio_path)
+
+    # 开始转录 - 使用 FP32 避免 NaN
+    start_time = time.time()
+    result = model.transcribe(audio_path, language="zh", fp16=False)
+    transcription_time = time.time() - start_time
+
+    # 返回纯文本格式
+    return f"{result['text']}\r\n处理时间: {transcription_time:.2f}秒"
+
+
+@app.route('/health', methods=['GET'])
+def health_check():
+    return jsonify({
+        "status": "ok",
+        "model_loaded": model is not None
+    })
+
+
+if __name__ == '__main__':
+    # 在启动应用前预先加载模型
+    print("启动服务前预先加载模型...")
+    load_model()
+    app.run(host='0.0.0.0', port=5000, threaded=True)
diff --git a/voice2word.py b/voice2word.py
@@ -0,0 +1,40 @@
+import whisper
+import time
+import datetime
+
+
+def format_time(seconds):
+    """将秒数格式化为易读的时间字符串"""
+    return str(datetime.timedelta(seconds=seconds))
+
+
+def transcribe_with_timing():
+    # 记录开始时间
+    start_time = time.time()
+
+    print("开始加载 Whisper 模型...")
+    model_load_start = time.time()
+    model = whisper.load_model("medium") #
+    model_load_time = time.time() - model_load_start
+    print(f"模型加载完成，耗时: {format_time(model_load_time)}")
+
+    print("开始语音识别...")
+    transcription_start = time.time()
+    result = model.transcribe("dingzhen.wav", language="zh")
+    transcription_time = time.time() - transcription_start
+    print(f"语音识别完成，耗时: {format_time(transcription_time)}")
+
+    # 输出结果
+    print("\n识别结果:")
+    print(result["text"])
+
+    # 计算总时间
+    total_time = time.time() - start_time
+    print(f"\n总运行时间: {format_time(total_time)}")
+    print(f"详细时间:")
+    print(f"- 模型加载: {format_time(model_load_time)} ({model_load_time / total_time:.1%})")
+    print(f"- 语音识别: {format_time(transcription_time)} ({transcription_time / total_time:.1%})")
+
+
+if __name__ == "__main__":
+    transcribe_with_timing()
diff --git a/whisper/audio.py b/whisper/audio.py
@@ -1,5 +1,7 @@
 import os
+import sys
 from functools import lru_cache
+from pathlib import Path
 from subprocess import CalledProcessError, run
 from typing import Optional, Union
 
@@ -102,7 +104,16 @@ def mel_filters(device, n_mels: int) -> torch.Tensor:
     """
     assert n_mels in {80, 128}, f"Unsupported n_mels: {n_mels}"
 
-    filters_path = os.path.join(os.path.dirname(__file__), "assets", "mel_filters.npz")
+    # 使用 pathlib 处理路径，支持开发环境和打包环境
+    if getattr(sys, 'frozen', False):
+        # 打包后的 exe 环境
+        exe_dir = Path(sys.executable).parent
+        filters_path = exe_dir / "whisper" / "assets" / "mel_filters.npz"
+    else:
+        # 开发环境
+        filters_path = Path(__file__).parent / "assets" / "mel_filters.npz"
+
+    print(f"filters_path: {filters_path}")
     with np.load(filters_path, allow_pickle=False) as f:
         return torch.from_numpy(f[f"mel_{n_mels}"]).to(device)
 

diff --git a/whisper/tokenizer.py b/whisper/tokenizer.py
@@ -1,8 +1,10 @@
 import base64
 import os
 import string
+import sys
 from dataclasses import dataclass, field
 from functools import cached_property, lru_cache
+from pathlib import Path
 from typing import Dict, List, Optional, Tuple
 
 import tiktoken
@@ -329,7 +331,16 @@ def split_tokens_on_spaces(self, tokens: List[int]):
 
 @lru_cache(maxsize=None)
 def get_encoding(name: str = "gpt2", num_languages: int = 99):
-    vocab_path = os.path.join(os.path.dirname(__file__), "assets", f"{name}.tiktoken")
+    # 使用 pathlib 处理路径，支持开发环境和打包环境
+    if getattr(sys, 'frozen', False):
+        # 打包后的 exe 环境
+        exe_dir = Path(sys.executable).parent
+        vocab_path = exe_dir / "whisper" / "assets" / f"{name}.tiktoken"
+    else:
+        # 开发环境
+        vocab_path = Path(__file__).parent / "assets" / f"{name}.tiktoken"
+
+    print(f"vocab_path: {vocab_path}")
     ranks = {
         base64.b64decode(token): int(rank)
         for token, rank in (line.split() for line in open(vocab_path) if line)