Skip to content

Commit 7b70079

Browse files
authored
AssetStore Import Updates (#1507)
* import annotation files, process dangling annotations files, collection destination imports * replace largeImage import folder creation with static importing, add recursive dangling annotation processing * initial sample creation script * adding minIO script to launch and readme * updating readme instructions * update readme * fix: update spacing on minIOConfig.py * feat: update documentation for importing annotations
1 parent 165c593 commit 7b70079

File tree

7 files changed

+621
-21
lines changed

7 files changed

+621
-21
lines changed

docs/Deployment-Storage.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,13 @@ If you have data in S3 or MinIO, you can mirror it in DIVE for annotation.
7575
* You should not make changes to folder contents once a folder has been mirrored into DIVE. Adding or removing images in a particular folder may cause annotation alignment issues.
7676
* Adding entire new folders is supported, and will require a re-index of your S3 bucket.
7777

78+
### S3/MinIO and Annotation Importing
79+
80+
During the importing process annotations that are associated with image-sequences or video files can be automatically imported
81+
82+
* **Video** - For video files the annotation file (CSV or JSON) needs to have the same name as the video with a changed extension. I.E. video.mp4 will have either video.csv or video.json. This will automatically import those annotations when the S3/GCP indexing/importing is done
83+
* **Image Sequence** - Image-Sequences should already be in their own folder. The annotation file (CSV or JSON) needs to just be in the same file. It shouldn't matter what the name of the file is during importing.
84+
7885
### Pub/Sub notifications
7986

8087
Creating pub/sub notifications is **optional**, but will keep your mount point up-to-date automatically with new data added to the bucket. In order to make use of this feature, your DIVE server must have a public static IP address or domain name.
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
*.json
2+
*.jpg
3+
*.jpeg
4+
*.mp4
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# Sample Data Generation and MinIO Setup for DIVE
2+
3+
This folder contains scripts to generate sample video and image sequence data, and to host it in a MinIO bucket suitable for importing into DIVE using the AssetStore Importing Tool
4+
5+
Prerequisites
6+
7+
- UV installed for running scripts
8+
- ffmpeg installed and available in your PATH
9+
- docker installed and running
10+
- the DIVE docker compose web application up and running
11+
12+
1. Generate Sample Data
13+
14+
The script generateSampleData.py creates a folder structure containing:
15+
16+
- Videos: MP4 format, H.264 codec, random duration (5–30 seconds), 1280x720 resolution.
17+
- Image Sequences: Extracted frames from temporary videos, stored as sequential JPGs.
18+
- Annotations: Each video or image sequence is accompanied by a JSON annotation file describing moving or scaling geometric shapes (rectangle, star, circle, diamond) per frame.
19+
- Videos: the annotation json file has the same name as the Video file just with the extension '.json'
20+
- Image Sequences: just requires that there be a any .json file in the same folder and it will import it as the annotations
21+
22+
Usage
23+
```bash
24+
uv run --script generateSampleData.py
25+
26+
```
27+
- Most fields are optional and the default values should work well enough
28+
- --output (-o): Base output directory (default: ./sample)
29+
- --folders (-f): Number of top-level folders (default: 3)
30+
- --max-depth (-d): Maximum subfolder depth (default: 2)
31+
- --videos (-v): Maximum videos per folder (default: 2)
32+
- --total (-t): Total number of datasets (videos or image sequences) to create (default: 10)
33+
34+
The script will randomly generate videos or image sequences with annotations inside the output directory. The output directory defaults to ./sample
35+
36+
2. Setup MinIO and Upload Sample Data
37+
38+
The script minIOConfig.py launches a MinIO server in Docker, creates a bucket, and uploads the generated sample data. This bucket can then be mounted in DIVE for importing datasets.
39+
40+
Usage
41+
42+
```bash
43+
uv run --script minIOConfig.py
44+
```
45+
46+
- --data-dir (-d): Path to the folder containing the generated sample data (default: ./sample)
47+
- --api-port: Port for S3 API access (default: 9000)
48+
- --console-port: Port for MinIO Console access (default: 9001)
49+
50+
What it does
51+
52+
1. Starts a MinIO server in Docker (minio_server) with persistent storage.
53+
2. Starts a persistent mc client container (minio_client) to configure the bucket.
54+
3. Creates a bucket called dive-sample-data.
55+
4. Uploads all generated sample data into the bucket.
56+
5. Creates an access key and secret key for S3 API access.
57+
6. Provides URLs to access the MinIO Console and API endpoints.
58+
59+
Example Output
60+
61+
✅ MinIO setup complete!
62+
Console: http://localhost:9001 (user: rootuser / rootpass123)
63+
S3 API: http://minio_server:9000
64+
S3 API: http://172.19.0.9:9000
65+
Bucket: dive-sample-data
66+
67+
3. Mount Bucket in DIVE
68+
69+
1. Open DIVE.
70+
2. Go the girder interface http:localhost:8010/girder
71+
3. Create a new Collection or a folder for the destination of importing this S3 data
72+
4. Go to Add S3 Dataset.
73+
- http://localhost:8010/girder#assetstores
74+
- click on "Create new Amazon S3 assetStore"
75+
5. Enter the Credentials for Importing
76+
- AssetStore Name: `Test MinIO Importing` Can be a custom name
77+
- S3 bucket Name: `dive-sample-data`
78+
- Path prefix (optional): leave it blank
79+
- Access key ID: `OMKF2I2NOD7JGYZ9XHE3`
80+
- Secree access key: `xbze+fJ6Wrfplq17JjSCZZJSz7AxEwRWm1MZXH2O`
81+
- Service: `http://{S3 API IP Address}:9000` so it would be `http://172.19.0.9:9000` if that is in the console
82+
- Region: leave it blank
83+
- Options can be left unchecked
84+
6. After saving it should go back to the assetstore list
85+
7. Find the `Test MinIO Importing` assetstore or the custom name used
86+
8. Click on Import data
87+
- Choose either the folder or Collection created in Step 3 and import the data
88+
- Importing can take long depending on the number of datasets created
89+
9. After import is complete it will kick off processing tasks to check any videos and convert if needed
90+
- Check the http://localhost:8010/girder#jobs page to see that importing jobs have completed
91+
10. Finally you can go to the regular DIVE interface (http://localhost:8010) click on the globe in the breadcrumb bar at the top of your user directory and navigate to your collection
92+
- You should be able to open DIVE Datasets and they should have random annotations in them.
93+
- Video Datasets should have an annotation speed relative to the video and image-sequences should default to 1FPS
94+
95+
96+
Notes
97+
98+
- The sample data is randomized for testing purposes: moving shapes, bouncing objects, and varying sizes.
99+
- Image sequences are generated by extracting frames from temporary videos.
100+
- MinIO is run in Docker for ease of setup and cleanup.
101+
- Once you close the minio_server docker container you won't have access to the video files or image-sequences anymore. They still appear in girder but attempting to load them will provide blank media.
102+
103+
Cleanup
104+
105+
To remove MinIO containers:
106+
107+
docker rm -f minio_server
Lines changed: 267 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,267 @@
1+
# /// script
2+
# requires-python = ">=3.8"
3+
# dependencies = [
4+
# "click",
5+
# "faker",
6+
# ]
7+
# ///
8+
import random
9+
import subprocess
10+
import json
11+
from pathlib import Path
12+
import click
13+
from faker import Faker
14+
import math
15+
16+
fake = Faker()
17+
18+
FRAME_WIDTH = 1280
19+
FRAME_HEIGHT = 720
20+
21+
def create_random_video(file_path: Path, duration: int):
22+
"""Create a random test video using ffmpeg (MP4 container, H.264 codec)."""
23+
cmd = [
24+
"ffmpeg", "-y",
25+
"-f", "lavfi", "-i", "testsrc=size=1280x720:rate=30",
26+
"-t", str(duration),
27+
"-c:v", "libx264",
28+
"-pix_fmt", "yuv420p",
29+
str(file_path)
30+
]
31+
subprocess.run(cmd, check=True)
32+
33+
def extract_frames_from_video(video_path: Path, image_dir: Path):
34+
"""Extract frames from a video and save as sequential JPG files."""
35+
image_dir.mkdir(parents=True, exist_ok=True)
36+
cmd = [
37+
"ffmpeg", "-y",
38+
"-i", str(video_path),
39+
str(image_dir / "frame_%04d.jpg")
40+
]
41+
subprocess.run(cmd, check=True)
42+
43+
def generate_star_points(cx, cy, r, spikes=5):
44+
"""Generate points for a star polygon."""
45+
pts = []
46+
angle = math.pi / spikes
47+
for i in range(2 * spikes):
48+
radius = r if i % 2 == 0 else r / 2
49+
x = cx + math.cos(i * angle) * radius
50+
y = cy + math.sin(i * angle) * radius
51+
pts.append([x, y])
52+
pts.append(pts[0]) # close polygon
53+
return pts
54+
55+
def generate_diamond_points(cx, cy, r):
56+
return [
57+
[cx, cy - r],
58+
[cx + r, cy],
59+
[cx, cy + r],
60+
[cx - r, cy],
61+
[cx, cy - r],
62+
]
63+
64+
def generate_circle_points(cx, cy, r, segments=12):
65+
pts = []
66+
for i in range(segments+1):
67+
angle = 2 * math.pi * i / segments
68+
x = cx + math.cos(angle) * r
69+
y = cy + math.sin(angle) * r
70+
pts.append([x, y])
71+
return pts
72+
73+
def generate_geometry(shape: str, cx: float, cy: float, size: float):
74+
"""Return GeoJSON polygon of the shape centered at (cx, cy)."""
75+
if shape == "star":
76+
coords = generate_star_points(cx, cy, size)
77+
elif shape == "diamond":
78+
coords = generate_diamond_points(cx, cy, size)
79+
elif shape == "circle":
80+
coords = generate_circle_points(cx, cy, size)
81+
else: # rectangle fallback
82+
half = size
83+
coords = [
84+
[cx-half, cy-half],
85+
[cx+half, cy-half],
86+
[cx+half, cy+half],
87+
[cx-half, cy+half],
88+
[cx-half, cy-half]
89+
]
90+
return {'geojson': {
91+
"type": "FeatureCollection",
92+
"features": [
93+
{
94+
"type" :"Feature",
95+
"geometry": {
96+
"type": "Polygon",
97+
"coordinates": [coords]
98+
},
99+
"properties": { "key": "" }
100+
}
101+
]
102+
},
103+
'coords': coords
104+
}
105+
106+
def geometry_bounds(coords):
107+
"""Calculate bounding box [x1, y1, x2, y2] from polygon points."""
108+
xs = [pt[0] for pt in coords]
109+
ys = [pt[1] for pt in coords]
110+
return [min(xs), min(ys), max(xs), max(ys)]
111+
112+
def generate_annotation_json(num_frames: int, output_file: Path):
113+
"""Generate annotation JSON with moving/scaling geometry."""
114+
num_tracks = random.randint(3, 5)
115+
tracks = {}
116+
117+
for i in range(num_tracks):
118+
track_id = i
119+
shape_type = random.choice(["circle", "rectangle", "diamond", "star"])
120+
begin = 0
121+
end = num_frames - 1
122+
123+
# Initial position and motion
124+
x, y = random.randint(100, FRAME_WIDTH-100), random.randint(100, FRAME_HEIGHT-100)
125+
dx, dy = random.choice([-5, 5]), random.choice([-3, 3])
126+
base_size = random.randint(40, 80)
127+
growth_rate = random.uniform(0.05, 0.15)
128+
129+
features = []
130+
for frame in range(num_frames):
131+
# Update position and bounce
132+
x += dx
133+
y += dy
134+
if x < 50 or x > FRAME_WIDTH-50:
135+
dx *= -1
136+
x += dx
137+
if y < 50 or y > FRAME_HEIGHT-50:
138+
dy *= -1
139+
y += dy
140+
141+
# Smooth scaling
142+
scale = 0.5 * (1 + math.sin(growth_rate * frame))
143+
size = base_size * (0.75 + 0.5 * scale)
144+
145+
# Create moving geometry
146+
output_data = generate_geometry(shape_type, x, y, size)
147+
geom = output_data['geojson']
148+
coords = output_data['coords']
149+
bounds = geometry_bounds(coords)
150+
151+
feature = {
152+
"frame": frame,
153+
"bounds": bounds,
154+
"keyframe": True,
155+
"geometry": geom
156+
}
157+
features.append(feature)
158+
159+
tracks[str(track_id)] = {
160+
"id": track_id,
161+
"meta": {"shape": shape_type},
162+
"attributes": {},
163+
"confidencePairs": [[fake.word(), float(random.randrange(0, 100)/100)]],
164+
"begin": begin,
165+
"end": end,
166+
"features": features
167+
}
168+
169+
annotation = {
170+
"tracks": tracks,
171+
"groups": {},
172+
"version": 2
173+
}
174+
with open(output_file, "w") as f:
175+
json.dump(annotation, f, indent=2)
176+
177+
def create_video_content(base_dir: Path, max_videos: int, counter: dict, total: int):
178+
"""Create videos and associated JSON annotations."""
179+
if counter['count'] >= total:
180+
return
181+
num_videos = random.randint(1, max_videos)
182+
for _ in range(num_videos):
183+
if counter['count'] >= total:
184+
break
185+
duration = random.randint(5, 30)
186+
name = fake.word() + ".mp4"
187+
video_path = base_dir / name
188+
create_random_video(video_path, duration)
189+
counter['count'] += 1
190+
191+
# Generate annotation JSON
192+
generate_annotation_json(duration * 30, video_path.with_suffix(".json"))
193+
194+
def create_image_sequence_content(base_dir: Path, counter: dict, total: int):
195+
"""Create image sequence from a temporary video and generate JSON annotations."""
196+
if counter['count'] >= total:
197+
return
198+
duration = random.randint(5, 30)
199+
tmp_video = base_dir / (fake.word() + "_tmp.mp4")
200+
create_random_video(tmp_video, duration)
201+
seq_folder = base_dir / (tmp_video.stem.replace("_tmp", "") + "_frames")
202+
extract_frames_from_video(tmp_video, seq_folder)
203+
tmp_video.unlink()
204+
counter['count'] += 1
205+
206+
# Generate annotation JSON for image sequence folder
207+
annotation_file = seq_folder / (seq_folder.stem + ".json")
208+
generate_annotation_json(duration * 30, annotation_file)
209+
210+
def create_folder_structure(base_dir: Path, depth: int, max_depth: int,
211+
max_videos: int, counter: dict, total: int):
212+
"""Recursively create folders with either videos or image sequences."""
213+
if counter['count'] >= total:
214+
return
215+
216+
content_type = random.choice(["video", "images"])
217+
if content_type == "video":
218+
create_video_content(base_dir, max_videos, counter, total)
219+
else:
220+
create_image_sequence_content(base_dir, counter, total)
221+
222+
if counter['count'] >= total:
223+
return
224+
225+
num_subfolders = random.randint(0, 3)
226+
for _ in range(num_subfolders):
227+
if counter['count'] >= total:
228+
break
229+
subfolder = base_dir / fake.word()
230+
subfolder.mkdir(parents=True, exist_ok=True)
231+
if depth < max_depth:
232+
create_folder_structure(subfolder, depth+1, max_depth, max_videos, counter, total)
233+
else:
234+
leaf_type = random.choice(["video", "images"])
235+
if leaf_type == "video":
236+
create_video_content(subfolder, max_videos, counter, total)
237+
else:
238+
create_image_sequence_content(subfolder, counter, total)
239+
240+
@click.command()
241+
@click.option('--output', '-o', default='./sample', show_default=True,
242+
type=click.Path(file_okay=False), help="Base output directory")
243+
@click.option('--folders', '-f', default=3, show_default=True,
244+
help="Number of top-level folders to create")
245+
@click.option('--max-depth', '-d', default=2, show_default=True,
246+
help="Maximum subfolder depth")
247+
@click.option('--videos', '-v', default=2, show_default=True,
248+
help="Maximum videos per folder")
249+
@click.option('--total', '-t', default=10, show_default=True,
250+
help="Total number of datasets (videos or image sequences)")
251+
def main(output, folders, max_depth, videos, total):
252+
base_path = Path(output)
253+
base_path.mkdir(parents=True, exist_ok=True)
254+
255+
counter = {'count': 0}
256+
click.echo(f"Generating up to {total} datasets in {base_path}...")
257+
for _ in range(folders):
258+
if counter['count'] >= total:
259+
break
260+
folder_path = base_path / fake.word()
261+
folder_path.mkdir(parents=True, exist_ok=True)
262+
create_folder_structure(folder_path, 1, max_depth, videos, counter, total)
263+
264+
click.echo(f"Done! Created {counter['count']} datasets.")
265+
266+
if __name__ == '__main__':
267+
main()

0 commit comments

Comments
 (0)