Skip to content

feat: Pull test data from public KABR HuggingFace datasets#129

Draft
EmersonFras wants to merge 2 commits into
mainfrom
feat/issue-38/public-test-data
Draft

feat: Pull test data from public KABR HuggingFace datasets#129
EmersonFras wants to merge 2 commits into
mainfrom
feat/issue-38/public-test-data

Conversation

@EmersonFras
Copy link
Copy Markdown
Contributor

Closes data-infra-fwg #38.

imageomics/kabr_testing is a private/gated dataset. This replaces it with three public datasets:

  • imageomics/KABR-mini-scene-raw-videos (behavior video, miniscene, annotation, metadata)
  • imageomics/KABR-raw-videos (detection video)
  • imageomics/kabr-worked-examples (detection annotation)

Because the public filenames differ from what the tools expect (e.g. DJI_0068_trimmed.mp4 vs DJI_0068.mp4), get_behavior() and get_detection() now stage downloaded files into a tempfile.mkdtemp() workspace with the canonical names the tools require. Large video files are symlinked to avoid copying.

Test assertions that were hardcoded to specific track counts, frame counts, and split thresholds are now derived from the annotation at test time, making them robust to any input data.

Replace private imageomics/kabr_testing with three public repos:
- KABR-mini-scene-raw-videos for DJI_0001 behavior data
- KABR-raw-videos for DJI_0068 detection video
- kabr-worked-examples for DJI_0068 detection annotation

Downloads are staged in a per-call temp directory with canonical
names (DJI_0001/DJI_0001.mp4, DJI_0068/DJI_0068.mp4) so existing
tool naming assumptions remain satisfied.
Remove hardcoded assumptions tied to the old private test dataset:
- test_tracks_extractor: track IDs and counts now derived from annotation
- test_cvat2slowfast: row count now checked as > 0 rather than == 91
- test_cvat2ultralytics: split thresholds and frame indices computed
  from the annotation at runtime instead of being hardcoded
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Switches the test data source from the gated imageomics/kabr_testing dataset to three public KABR datasets on Hugging Face. Because the public filenames differ from the canonical names the tools expect, the test helpers stage downloaded files into a temp workspace under the expected names (symlinking large videos), and brittle hardcoded assertions on track counts, frame counts, and split thresholds are now derived from the annotation at runtime.

Changes:

  • Replace single DATA_HUB with three public hubs and update file paths in tests/utils.py; rewrite get_behavior()/get_detection() to stage files via tempfile.mkdtemp().
  • Generalize tests/test_tracks_extractor.py to derive expected mp4s, tracks, and color indices from the parsed annotation.
  • Generalize tests/test_cvat2ultralytics.py to mirror the tool's split logic (int(n*0.8)/int(n*0.87)) and derive image/label expectations from the annotation, and relax tests/test_cvat2slowfast.py row-count assertion.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
tests/utils.py Switch to three public HF datasets and stage files into a tmpdir with canonical names.
tests/test_tracks_extractor.py Derive expected per-track mp4s and color indices from the annotation rather than hardcoding 0/1.
tests/test_cvat2ultralytics.py Replicate tool's track2end/skip/split logic to compute expected outputs dynamically.
tests/test_cvat2slowfast.py Replace hardcoded frame counts with bounds derived from the produced data.csv.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/utils.py
Comment on lines +33 to +55
tmpdir = tempfile.mkdtemp()
base = Path(tmpdir) / "DJI_0001"
(base / "actions").mkdir(parents=True)
(base / "metadata").mkdir(parents=True)

def get_behavior():
video = get_cached_datafile(BEHAVIOR_VIDEO)
miniscene = get_cached_datafile(BEHAVIOR_MINISCENE)
annotation = get_cached_datafile(BEHAVIOR_ANNOTATION)
metadata = get_cached_datafile(BEHAVIOR_METADATA)
return video, miniscene, annotation, metadata
video = base / "DJI_0001.mp4"
miniscene = base / "43.mp4"
annotation = base / "actions" / "43.xml"
metadata = base / "metadata" / "DJI_0001_metadata.json"

os.symlink(video_hf, video)
os.symlink(miniscene_hf, miniscene)
os.symlink(annotation_hf, annotation)
shutil.copy2(metadata_hf, metadata)

return str(video), str(miniscene), str(annotation), str(metadata)


def get_detection():
video = get_cached_datafile(DETECTION_VIDEO)
annotation = get_cached_datafile(DETECTION_ANNOTATION)
return video, annotation
video_hf = get_hf(DETECTION_VIDEO_HUB, DETECTION_VIDEO, REPO_TYPE)
annotation_hf = get_hf(DETECTION_ANNOTATION_HUB, DETECTION_ANNOTATION, REPO_TYPE)

tmpdir = tempfile.mkdtemp()
Comment thread tests/utils.py
video = base / "DJI_0068.mp4"
annotation = base / "DJI_0068.xml"

os.symlink(video_hf, video)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants