Skip to content

Conversation

@43758726
Copy link
Collaborator

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Avoiding finding dataset module when not using lite module.

Modification

Move load_dataset location from outside to the function of get_calib_loaders in lmdeploy/lmdeploy/lite/utils/calib_dataloader.py.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

Copilot AI review requested due to automatic review settings January 23, 2026 11:33
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Moves Hugging Face datasets loading/imports into get_calib_loaders to prevent datasets being imported at module import time when lite calibration isn’t used.

Changes:

  • Removed module-level datasets imports and relocated load_dataset calls into get_calib_loaders.
  • Updated dataset-specific helper functions (get_wikitext2, get_c4, etc.) to accept a pre-loaded dataset argument instead of calling load_dataset internally.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Returns:
List of sampled and tokenized training examples.
"""
from datasets import VerificationMode, load_dataset
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_calib_loaders unconditionally imports datasets at runtime. If datasets is an optional dependency (as implied by the motivation), consider wrapping this import in a try/except ImportError and raising a clearer error instructing users how to install the lite/extra requirements, rather than propagating a bare ModuleNotFoundError.

Suggested change
from datasets import VerificationMode, load_dataset
try:
from datasets import VerificationMode, load_dataset
except ImportError as e:
raise ImportError(
"The 'datasets' package is required to use calibration data loaders. "
"Please install lmdeploy with the 'lite' extra (e.g. `pip install lmdeploy[lite]`) "
"or install the 'datasets' package directly."
) from e

Copilot uses AI. Check for mistakes.
raise InterruptedError('There have been some issues when generating '
'the dataset, you could try to download it '
'locally first, and replace the `data_files`'
'with local addresses or use other datasets '
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The InterruptedError message concatenates two string literals without a separating space: "replace the data_files" + "with local addresses...", which renders as "data_fileswith". Add a trailing/leading space so the error message is readable.

Suggested change
'with local addresses or use other datasets '
' with local addresses or use other datasets '

Copilot uses AI. Check for mistakes.
Comment on lines +376 to +379
from datasets import VerificationMode, load_dataset
if 'wikitext2' in name:
return get_wikitext2(tokenizer, nsamples, seed, seqlen)
dataset = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train')
return get_wikitext2(dataset, tokenizer, nsamples, seed, seqlen)
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is intended to avoid importing datasets at module import time; there are no unit tests covering this behavior (e.g., that import lmdeploy.lite.utils.calib_dataloader succeeds when datasets is not installed, and that get_calib_loaders still calls load_dataset for each supported name). Please add tests that mock/monkeypatch datasets.load_dataset (and simulate ImportError for datasets) to prevent regressions without requiring network downloads.

Copilot uses AI. Check for mistakes.
@lvhan028 lvhan028 merged commit 65735d2 into InternLM:main Jan 23, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants