Skip to content

Add LongBench v2 support for DeepSeek#2410

Merged
yiliu30 merged 2 commits intolongbenchfrom
longbench-deepseek
Feb 14, 2026
Merged

Add LongBench v2 support for DeepSeek#2410
yiliu30 merged 2 commits intolongbenchfrom
longbench-deepseek

Conversation

@yiliu30
Copy link
Contributor

@yiliu30 yiliu30 commented Feb 14, 2026

This PR adds LongBench v2 support for DeepSeek, similar to the Qwen implementation in PR #2406.

Changes

  • DeepSeek LongBench v2 support

Type of Change

Feature enhancement - adds long-context evaluation support for DeepSeek models

Description

Adapts the LongBench v2 evaluation framework for DeepSeek models:

  • Dynamic configuration: Automatically adjusts max_length to 40960 for longbench tasks (vs 8192 for standard tasks)
  • Server-based evaluation: Uses vLLM server with API-based evaluation for better stability with long contexts
  • Modular functions: Refactored code into reusable functions for server management and evaluation
  • Parallel execution: Supports 512 threads for efficient longbench evaluation

Implementation Details

  1. Added long-bench-eval dependency to requirements.txt
  2. Refactored run_evaluation.sh with:
    • Task-based routing (detects 'longbench' in task name)
    • vLLM server lifecycle management (start, health check, cleanup)
    • Proper error handling and logging
  3. Maintains backward compatibility with standard lm-eval tasks

Expected Behavior

When task name contains 'longbench', the script will:

  1. Start a vLLM server with 40K context length support
  2. Wait for server to be ready (with health checks)
  3. Run LongBench evaluation via API
  4. Clean up server on completion or error

For standard tasks, uses the original direct lm-eval execution.

How has this PR been tested?

Mirrors the Qwen implementation from PR #2406

Dependency Change

Added: long-bench-eval @ git+https://github.com/yiliu30/long-bench-eval

- Add long-bench-eval dependency to requirements.txt
- Refactor run_evaluation.sh to support both standard and LongBench v2 tasks
- Add dynamic max_length configuration (40960 for longbench tasks)
- Implement vLLM server-based evaluation for LongBench
- Add helper functions for server lifecycle management
- Support up to 40K context length evaluation with 512 threads

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
@yiliu30 yiliu30 merged commit cb1b4a3 into longbench Feb 14, 2026
7 checks passed
@yiliu30 yiliu30 deleted the longbench-deepseek branch February 14, 2026 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant