Skip to content

Conversation

@jberchtold-nvidia
Copy link
Collaborator

Description

HF "glue" dataset seems to have moved from "glue" to "nyu-mll/glue". Small PR to update this dataset path as we've started to see 404 errors

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Use "nyu-mll/glue" instead of "glue" dataset path

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 26, 2026

Greptile Overview

Greptile Summary

This PR updates HuggingFace dataset paths to use the new organization-prefixed format (nyu-mll/glue and ylecun/mnist) instead of the deprecated short names (glue and mnist), resolving 404 errors that were occurring when tests attempted to download datasets.

  • Updated all encoder test files to use nyu-mll/glue for the GLUE CoLA dataset
  • Updated MNIST test to use ylecun/mnist for the MNIST dataset
  • Added datasets.txt manifest file documenting the datasets used by tests for pre-emptive caching
  • Minor style issue: datasets.txt is missing a trailing newline

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The changes are straightforward string replacements updating dataset paths to match HuggingFace's new organizational structure. All modifications follow the same pattern across test files, the logic remains unchanged, and the fix directly addresses a known 404 error issue. The only minor issue is a missing trailing newline in the new manifest file.
  • No files require special attention

Important Files Changed

Filename Overview
examples/jax/datasets.txt New manifest file listing datasets for caching; missing trailing newline
examples/jax/encoder/test_model_parallel_encoder.py Updated dataset path from glue to nyu-mll/glue to fix 404 errors
examples/jax/encoder/test_multigpu_encoder.py Updated dataset path from glue to nyu-mll/glue to fix 404 errors
examples/jax/encoder/test_multiprocessing_encoder.py Updated dataset path from glue to nyu-mll/glue to fix 404 errors
examples/jax/encoder/test_single_gpu_encoder.py Updated dataset path from glue to nyu-mll/glue to fix 404 errors
examples/jax/mnist/test_single_gpu_mnist.py Updated dataset path from mnist to ylecun/mnist to fix 404 errors

Sequence Diagram

sequenceDiagram
    participant Test as Test Script
    participant HF as HuggingFace Datasets API
    participant Dataset as Dataset Repository
    
    Note over Test,Dataset: Before: Using deprecated paths
    Test->>HF: load_dataset("glue", "cola")
    HF->>Dataset: Request from old path
    Dataset-->>HF: 404 Not Found
    HF-->>Test: Error
    
    Note over Test,Dataset: After: Using new organization paths
    Test->>HF: load_dataset("nyu-mll/glue", "cola")
    HF->>Dataset: Request from nyu-mll/glue
    Dataset-->>HF: Return dataset
    HF-->>Test: Dataset loaded
    
    Test->>HF: load_dataset("ylecun/mnist")
    HF->>Dataset: Request from ylecun/mnist
    Dataset-->>HF: Return dataset
    HF-->>Test: Dataset loaded
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@jberchtold-nvidia jberchtold-nvidia force-pushed the jberchtold/fix-glue-cola-dataset-name branch from bc3729d to 4c8eb15 Compare January 26, 2026 21:35
@jberchtold-nvidia
Copy link
Collaborator Author

/te-ci L0 jax

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

timmoon10
timmoon10 previously approved these changes Jan 27, 2026
Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Jeremy Berchtold <[email protected]>
@jberchtold-nvidia
Copy link
Collaborator Author

/te-ci L0 jax

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Jeremy Berchtold <[email protected]>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

@@ -0,0 +1,3 @@
# Datasets used by TE encoder tests. Pull these to pre-emptively cache datasets
ylecun/mnist
nyu-mll/glue No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing newline at end of file

Suggested change
nyu-mll/glue
nyu-mll/glue

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@jberchtold-nvidia
Copy link
Collaborator Author

/te-ci L0 jax

@KshitijLakhani KshitijLakhani self-requested a review January 27, 2026 18:43
Copy link
Collaborator

@KshitijLakhani KshitijLakhani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !
Thanks Jeremy

@KshitijLakhani KshitijLakhani self-requested a review January 27, 2026 18:44
@jberchtold-nvidia jberchtold-nvidia merged commit 2104e4c into NVIDIA:main Jan 27, 2026
19 of 23 checks passed
@jberchtold-nvidia jberchtold-nvidia deleted the jberchtold/fix-glue-cola-dataset-name branch January 27, 2026 20:10
KshitijLakhani pushed a commit that referenced this pull request Jan 28, 2026
…x 404 error (#2625)

* Use "nyu-mll/glue" instead of "glue" for encoder datasets to fix 404 error

Signed-off-by: Jeremy Berchtold <[email protected]>

* rename mnist dataset path

Signed-off-by: Jeremy Berchtold <[email protected]>

* add dataset manifest

Signed-off-by: Jeremy Berchtold <[email protected]>

---------

Signed-off-by: Jeremy Berchtold <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants