Skip to content

Conversation

@tompollard
Copy link
Member

As discussed in #165, it would be helpful to have a pre-submission validation tool that helps contributors verify their datasets before uploading to PhysioNet.

This pull request adds a new validate module and command line interface. Users can now run physionet validate PATH to generate a report card for their dataset.

Validation Categories:

  • Filesystem: File naming conventions, proprietary formats, hidden/temp files, version control artifacts
  • Documentation: Required files (README.md by default)
  • Integrity: CSV structure, encoding, duplicate columns, inconsistent rows
  • Quality: Empty columns in CSV
  • Privacy: Sensitive information (SSN, email, phone, ages >89), sensitive config files

An example report is below:

PhysioNet Dataset Validation Report
==================================================

Metadata:
  Dataset: organ-retrieval-and-collection-of-health-information-for-donation-orchid-2.1.1
  Validator version: 0.1.4
  Timestamp: 2025-12-12 20:58:28 UTC
  Total size: 682.6 MB (17 files)

Validation Results:
==================================================
✗ Filesystem (1 error, 3 warnings)
  ⚠ Data Description.csv - Filename contains spaces: Data Description.csv
  ⚠ OPOReferrals.mat - Proprietary file format detected: OPOReferrals.mat
  ⚠ DataDescription copy.dicom - Filename contains spaces: DataDescription copy.dicom
  ✗ CultureEvents****(.csv - Filename contains invalid characters ('*'): CultureEvents****(.csv
✗ Documentation (1 error)
  ✗ README.md - Required file not found: README.md
✓ Integrity
✓ Quality
✗ Privacy (1 error, 1 warning)
  ✗ credentials.json - Sensitive file detected: credential file
  ⚠ LICENSE.txt - Potential private information detected (email address)

Summary:
==================================================
3 errors, 4 warnings

✗ Dataset has errors that must be fixed before submission

Recommendations:
==================================================

Filesystem:
  ✗ Remove special characters from filename (use only letters, numbers, underscores, hyphens, and periods)
  ⚠ Replace spaces with underscores or hyphens (2 files)
  ⚠ MATLAB format; consider .csv, .zarr, .parquet, or .npy instead

Documentation:
  ✗ Add README.md to your dataset. At minimum, the file should include a title and a brief description of the package content.

Privacy:
  ✗ Remove 'credentials.json' from the dataset before submission
  ⚠ Review and remove or de-identify sensitive information

Note: A validation report (PHYSIONET_REPORT.md) has been saved in your
      dataset folder. Please include this file in your final submission.

@tompollard tompollard merged commit d1315b5 into main Dec 12, 2025
4 checks passed
@tompollard tompollard deleted the tp/validator branch December 12, 2025 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants