feat: Pull test data from public KABR HuggingFace datasets by EmersonFras · Pull Request #129 · Imageomics/kabr-tools

EmersonFras · 2026-05-14T17:31:12Z

imageomics/kabr_testing is a private/gated dataset. This replaces it with three public datasets:

imageomics/KABR-mini-scene-raw-videos (behavior video, miniscene, annotation, metadata)
imageomics/KABR-raw-videos (detection video)
imageomics/kabr-worked-examples (detection annotation)

Because the public filenames differ from what the tools expect (e.g. DJI_0068_trimmed.mp4 vs DJI_0068.mp4), get_behavior() and get_detection() now stage downloaded files into a tempfile.mkdtemp() workspace with the canonical names the tools require. Large video files are symlinked to avoid copying.

Test assertions that were hardcoded to specific track counts, frame counts, and split thresholds are now derived from the annotation at test time, making them robust to any input data.

Replace private imageomics/kabr_testing with three public repos: - KABR-mini-scene-raw-videos for DJI_0001 behavior data - KABR-raw-videos for DJI_0068 detection video - kabr-worked-examples for DJI_0068 detection annotation Downloads are staged in a per-call temp directory with canonical names (DJI_0001/DJI_0001.mp4, DJI_0068/DJI_0068.mp4) so existing tool naming assumptions remain satisfied.

Remove hardcoded assumptions tied to the old private test dataset: - test_tracks_extractor: track IDs and counts now derived from annotation - test_cvat2slowfast: row count now checked as > 0 rather than == 91 - test_cvat2ultralytics: split thresholds and frame indices computed from the annotation at runtime instead of being hardcoded

Copilot

Pull request overview

Switches the test data source from the gated imageomics/kabr_testing dataset to three public KABR datasets on Hugging Face. Because the public filenames differ from the canonical names the tools expect, the test helpers stage downloaded files into a temp workspace under the expected names (symlinking large videos), and brittle hardcoded assertions on track counts, frame counts, and split thresholds are now derived from the annotation at runtime.

Changes:

Replace single DATA_HUB with three public hubs and update file paths in tests/utils.py; rewrite get_behavior()/get_detection() to stage files via tempfile.mkdtemp().
Generalize tests/test_tracks_extractor.py to derive expected mp4s, tracks, and color indices from the parsed annotation.
Generalize tests/test_cvat2ultralytics.py to mirror the tool's split logic (int(n*0.8)/int(n*0.87)) and derive image/label expectations from the annotation, and relax tests/test_cvat2slowfast.py row-count assertion.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
tests/utils.py	Switch to three public HF datasets and stage files into a tmpdir with canonical names.
tests/test_tracks_extractor.py	Derive expected per-track mp4s and color indices from the annotation rather than hardcoding `0`/`1`.
tests/test_cvat2ultralytics.py	Replicate tool's `track2end`/skip/split logic to compute expected outputs dynamically.
tests/test_cvat2slowfast.py	Replace hardcoded frame counts with bounds derived from the produced `data.csv`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    tmpdir = tempfile.mkdtemp()
+    base = Path(tmpdir) / "DJI_0001"
+    (base / "actions").mkdir(parents=True)
+    (base / "metadata").mkdir(parents=True)

-def get_behavior():
-    video = get_cached_datafile(BEHAVIOR_VIDEO)
-    miniscene = get_cached_datafile(BEHAVIOR_MINISCENE)
-    annotation = get_cached_datafile(BEHAVIOR_ANNOTATION)
-    metadata = get_cached_datafile(BEHAVIOR_METADATA)
-    return video, miniscene, annotation, metadata
+    video = base / "DJI_0001.mp4"
+    miniscene = base / "43.mp4"
+    annotation = base / "actions" / "43.xml"
+    metadata = base / "metadata" / "DJI_0001_metadata.json"
+
+    os.symlink(video_hf, video)
+    os.symlink(miniscene_hf, miniscene)
+    os.symlink(annotation_hf, annotation)
+    shutil.copy2(metadata_hf, metadata)
+
+    return str(video), str(miniscene), str(annotation), str(metadata)


 def get_detection():
-    video = get_cached_datafile(DETECTION_VIDEO)
-    annotation = get_cached_datafile(DETECTION_ANNOTATION)
-    return video, annotation
+    video_hf = get_hf(DETECTION_VIDEO_HUB, DETECTION_VIDEO, REPO_TYPE)
+    annotation_hf = get_hf(DETECTION_ANNOTATION_HUB, DETECTION_ANNOTATION, REPO_TYPE)
+
+    tmpdir = tempfile.mkdtemp()


+    video = base / "DJI_0068.mp4"
+    annotation = base / "DJI_0068.xml"
+
+    os.symlink(video_hf, video)


EmersonFras added 2 commits May 14, 2026 13:04

EmersonFras requested a review from Copilot May 14, 2026 17:31

Copilot started reviewing on behalf of EmersonFras May 14, 2026 17:31 View session

Copilot AI reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Pull test data from public KABR HuggingFace datasets#129

feat: Pull test data from public KABR HuggingFace datasets#129
EmersonFras wants to merge 2 commits into
mainfrom
feat/issue-38/public-test-data

EmersonFras commented May 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

EmersonFras commented May 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants