feat: Pull test data from public KABR HuggingFace datasets#129
Draft
EmersonFras wants to merge 2 commits into
Draft
feat: Pull test data from public KABR HuggingFace datasets#129EmersonFras wants to merge 2 commits into
EmersonFras wants to merge 2 commits into
Conversation
Replace private imageomics/kabr_testing with three public repos: - KABR-mini-scene-raw-videos for DJI_0001 behavior data - KABR-raw-videos for DJI_0068 detection video - kabr-worked-examples for DJI_0068 detection annotation Downloads are staged in a per-call temp directory with canonical names (DJI_0001/DJI_0001.mp4, DJI_0068/DJI_0068.mp4) so existing tool naming assumptions remain satisfied.
Remove hardcoded assumptions tied to the old private test dataset: - test_tracks_extractor: track IDs and counts now derived from annotation - test_cvat2slowfast: row count now checked as > 0 rather than == 91 - test_cvat2ultralytics: split thresholds and frame indices computed from the annotation at runtime instead of being hardcoded
Contributor
There was a problem hiding this comment.
Pull request overview
Switches the test data source from the gated imageomics/kabr_testing dataset to three public KABR datasets on Hugging Face. Because the public filenames differ from the canonical names the tools expect, the test helpers stage downloaded files into a temp workspace under the expected names (symlinking large videos), and brittle hardcoded assertions on track counts, frame counts, and split thresholds are now derived from the annotation at runtime.
Changes:
- Replace single
DATA_HUBwith three public hubs and update file paths intests/utils.py; rewriteget_behavior()/get_detection()to stage files viatempfile.mkdtemp(). - Generalize
tests/test_tracks_extractor.pyto derive expected mp4s, tracks, and color indices from the parsed annotation. - Generalize
tests/test_cvat2ultralytics.pyto mirror the tool's split logic (int(n*0.8)/int(n*0.87)) and derive image/label expectations from the annotation, and relaxtests/test_cvat2slowfast.pyrow-count assertion.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| tests/utils.py | Switch to three public HF datasets and stage files into a tmpdir with canonical names. |
| tests/test_tracks_extractor.py | Derive expected per-track mp4s and color indices from the annotation rather than hardcoding 0/1. |
| tests/test_cvat2ultralytics.py | Replicate tool's track2end/skip/split logic to compute expected outputs dynamically. |
| tests/test_cvat2slowfast.py | Replace hardcoded frame counts with bounds derived from the produced data.csv. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+33
to
+55
| tmpdir = tempfile.mkdtemp() | ||
| base = Path(tmpdir) / "DJI_0001" | ||
| (base / "actions").mkdir(parents=True) | ||
| (base / "metadata").mkdir(parents=True) | ||
|
|
||
| def get_behavior(): | ||
| video = get_cached_datafile(BEHAVIOR_VIDEO) | ||
| miniscene = get_cached_datafile(BEHAVIOR_MINISCENE) | ||
| annotation = get_cached_datafile(BEHAVIOR_ANNOTATION) | ||
| metadata = get_cached_datafile(BEHAVIOR_METADATA) | ||
| return video, miniscene, annotation, metadata | ||
| video = base / "DJI_0001.mp4" | ||
| miniscene = base / "43.mp4" | ||
| annotation = base / "actions" / "43.xml" | ||
| metadata = base / "metadata" / "DJI_0001_metadata.json" | ||
|
|
||
| os.symlink(video_hf, video) | ||
| os.symlink(miniscene_hf, miniscene) | ||
| os.symlink(annotation_hf, annotation) | ||
| shutil.copy2(metadata_hf, metadata) | ||
|
|
||
| return str(video), str(miniscene), str(annotation), str(metadata) | ||
|
|
||
|
|
||
| def get_detection(): | ||
| video = get_cached_datafile(DETECTION_VIDEO) | ||
| annotation = get_cached_datafile(DETECTION_ANNOTATION) | ||
| return video, annotation | ||
| video_hf = get_hf(DETECTION_VIDEO_HUB, DETECTION_VIDEO, REPO_TYPE) | ||
| annotation_hf = get_hf(DETECTION_ANNOTATION_HUB, DETECTION_ANNOTATION, REPO_TYPE) | ||
|
|
||
| tmpdir = tempfile.mkdtemp() |
| video = base / "DJI_0068.mp4" | ||
| annotation = base / "DJI_0068.xml" | ||
|
|
||
| os.symlink(video_hf, video) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes data-infra-fwg #38.
imageomics/kabr_testingis a private/gated dataset. This replaces it with three public datasets:imageomics/KABR-mini-scene-raw-videos(behavior video, miniscene, annotation, metadata)imageomics/KABR-raw-videos(detection video)imageomics/kabr-worked-examples(detection annotation)Because the public filenames differ from what the tools expect (e.g.
DJI_0068_trimmed.mp4vsDJI_0068.mp4),get_behavior()andget_detection()now stage downloaded files into atempfile.mkdtemp()workspace with the canonical names the tools require. Large video files are symlinked to avoid copying.Test assertions that were hardcoded to specific track counts, frame counts, and split thresholds are now derived from the annotation at test time, making them robust to any input data.