Sample iOS app for microsoft/MoGe (CVPR 2025 Oral) — open-domain monocular 3D geometry estimation. Given a single photo it predicts a metric depth map, surface normals, and a confidence mask in one forward pass on a DINOv2 ViT-B backbone with three task heads.
This demo uses the MoGe-2 ViT-B + normal variant (104 M parameters) at a fixed 504 × 504 input. The CoreML model is one self-contained mlpackage (~200 MB at FP16) that returns five tensors: points, depth, normal, mask, metric_scale.
Left: original photo, center: metric depth (turbo colormap), right: surface normals.
| Module | Inputs | Outputs |
|---|---|---|
MoGe2_ViTB_Normal_504 |
image [1, 3, 504, 504] in [0, 1] | points [1, 504, 504, 3], depth [1, 504, 504], normal [1, 504, 504, 3], mask [1, 504, 504], metric_scale [1] |
- Download MoGe2_ViTB_Normal_504.mlpackage.zip (or build from
conversion_scripts/convert_moge2.py) - Unzip and drop
MoGe2_ViTB_Normal_504.mlpackageinto the Xcode project - Build and run on a physical device (iOS 17+)
- DINOv2 backbone with a frozen pos_embed. The stock DINOv2
interpolate_pos_encodingdoes a bicubic + antialias resize of the pretrained positional embedding every forward call. coremltools cannot trace bicubic + antialias cleanly. The converter pre-computes the interpolated pos_embed once for the fixed 36 × 36 token grid and replaces the method with a constant lookup, so the traced graph never hits bicubic interpolation. onnx_compatible_mode = Truedisables the antialias path inDINOv2Encoder.forwardas well, leaving only bilinearF.interpolatecalls that coremltools handles natively.- Aspect-ratio-aware num_tokens path is collapsed. MoGeModel computes
base_h, base_wat runtime from(num_tokens, aspect_ratio). The wrapper hard-codesbase_h = base_w = 36(= 504 / 14) so the trace sees Python ints all the way through. - Pyramid features and UV grids are pre-computed.
normalized_view_plane_uvfor each of the 5 levels is registered as a non-persistent buffer at wrapper construction time, so the converted graph contains nolinspace/meshgridops. intop patch for multi-dim shape casts is required (same as SinSR / Swin Transformer). DINOv2's positional indexing emits int casts on a 2-element shape tensor that the stock coremltools converter assumes are scalars. The patched op accepts both.- Outputs match MoGeModel.forward with
remap_output='exp'baked in (the wrapper inlines thexy * z, z = exp(z)remap so Swift just receives the final(B, H, W, 3)point map plus a depth slice). - Focal / shift / intrinsics recovery is left to the Swift driver. The CoreML model returns the affine point map plus
metric_scale; the demo app applies the scale and ignores any focal-shift refinement (good enough for visualization). For metric SLAM-style use you would portrecover_focal_shiftto Swift. - FP16 throughout. DINOv2 ViT attention does not overflow at this resolution; FP16 +
.cpuAndNeuralEngineruns comfortably on iPhone 15 / 17. Real-image parity vs the PyTorch reference is ~1 % relative on depth and < 6° on surface normals.
- Stretch-resize the photo to 504 × 504 (the converted graph bakes
aspect_ratio = 1.0into its UV grids, so a plain square resize is what the model expects; the original aspect is restored at display time). - Wrap in a BGRA
CVPixelBuffer(the model'sImageTypeinput appliesscale = 1/255automatically). - Run
MLModel.prediction. Readdepth,normal,mask,metric_scalewith stride-aware access (the ANE returns non-contiguous strides — see Basic Pitch indocs/coreml_conversion_notes.md). - Mask out background pixels and multiply depth by
metric_scaleto get meters. - Render: turbo colormap for depth (near = warm), surface-normal RGB for normals (
nx, -ny, nz→r, g, b). The SwiftUI view re-applies the photo's original aspect so theoriginal/depth/normaltoggles line up pixel-for-pixel.
conversion_scripts/convert_moge2.py
python convert_moge2.py # ViT-B normal, 504×504
python convert_moge2.py --variant vits-normal --size 392 # smaller / faster
python convert_moge2.py --variant vitl-normal --size 504 # higher quality, ~660 MB

