-
Notifications
You must be signed in to change notification settings - Fork 649
[Fix] move calibrate load dataset location #4289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix] move calibrate load dataset location #4289
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Moves Hugging Face datasets loading/imports into get_calib_loaders to prevent datasets being imported at module import time when lite calibration isn’t used.
Changes:
- Removed module-level
datasetsimports and relocatedload_datasetcalls intoget_calib_loaders. - Updated dataset-specific helper functions (
get_wikitext2,get_c4, etc.) to accept a pre-loadeddatasetargument instead of callingload_datasetinternally.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Returns: | ||
| List of sampled and tokenized training examples. | ||
| """ | ||
| from datasets import VerificationMode, load_dataset |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_calib_loaders unconditionally imports datasets at runtime. If datasets is an optional dependency (as implied by the motivation), consider wrapping this import in a try/except ImportError and raising a clearer error instructing users how to install the lite/extra requirements, rather than propagating a bare ModuleNotFoundError.
| from datasets import VerificationMode, load_dataset | |
| try: | |
| from datasets import VerificationMode, load_dataset | |
| except ImportError as e: | |
| raise ImportError( | |
| "The 'datasets' package is required to use calibration data loaders. " | |
| "Please install lmdeploy with the 'lite' extra (e.g. `pip install lmdeploy[lite]`) " | |
| "or install the 'datasets' package directly." | |
| ) from e |
| raise InterruptedError('There have been some issues when generating ' | ||
| 'the dataset, you could try to download it ' | ||
| 'locally first, and replace the `data_files`' | ||
| 'with local addresses or use other datasets ' |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The InterruptedError message concatenates two string literals without a separating space: "replace the data_files" + "with local addresses...", which renders as "data_fileswith". Add a trailing/leading space so the error message is readable.
| 'with local addresses or use other datasets ' | |
| ' with local addresses or use other datasets ' |
| from datasets import VerificationMode, load_dataset | ||
| if 'wikitext2' in name: | ||
| return get_wikitext2(tokenizer, nsamples, seed, seqlen) | ||
| dataset = load_dataset('wikitext', 'wikitext-2-raw-v1', split='train') | ||
| return get_wikitext2(dataset, tokenizer, nsamples, seed, seqlen) |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is intended to avoid importing datasets at module import time; there are no unit tests covering this behavior (e.g., that import lmdeploy.lite.utils.calib_dataloader succeeds when datasets is not installed, and that get_calib_loaders still calls load_dataset for each supported name). Please add tests that mock/monkeypatch datasets.load_dataset (and simulate ImportError for datasets) to prevent regressions without requiring network downloads.
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
Avoiding finding dataset module when not using lite module.
Modification
Move load_dataset location from outside to the function of get_calib_loaders in lmdeploy/lmdeploy/lite/utils/calib_dataloader.py.
Checklist