Achieving 1 Billion Row Spatial Joins on a Single GPU with H3-Turbo (SYCL) #1137
cflockhart
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone,
I wanted to share an exciting milestone we’ve reached with a project we've been working on called h3-turbo.
We set out to see how far we could push H3's performance by moving core operations entirely to the GPU. To benchmark this, we replicated Query 11 from SpatialBench (a massive Point-in-Polygon spatial join commonly used to benchmark distributed systems like Apache Sedona).
The Benchmark:
Dataset: 1.1 Billion points (pings) joined against 100 Million polygons (zones).
Hardware: A single consumer-grade GPU node (e.g.,RTX 4090 with 24GB VRAM and>20GB System RAM on RunPod).
Framework: Python, utilizing h3-turbo with dynamic GPU batching to prevent OOM errors.
The Result: Instead of spinning up a multi-node Spark cluster with massive memory and S3 overhead, we successfully executed the entire 1.1B x 100M spatial join on a single machine.
How it works under the hood:
SYCL / AdaptiveCpp: We ported core H3 logic (like latlng_to_cell and cell_to_parent) into SYCL kernels. This allows the code to run natively on NVIDIA, AMD, and Intel GPUs from a single codebase, although it's currently only been tested on Nvidia due to the scarcity of AMD and Intel on sites like Runpod. We can create builds for AMD and Intel for anyone who wants to try on those architectures.
Zero-Copy Python Bindings: The library takes standard NumPy arrays (uint64) directly from Python and pushes them to the GPU.
PySpark Pandas UDFs: For those already using Databricks or Spark, we built vectorized Pandas UDF wrappers (e.g., spatial_join_udf, latlng_to_cell_udf) that map distributed DataFrame partitions directly onto the GPU hardware.
If anyone is dealing with massive-scale geospatial joins or heavy H3 aggregations and wants to dramatically reduce their compute footprint (moving from a cluster to a single GPU), I'd love to hear your thoughts or use cases!
Sample Jupyter notebooks with results at:
Beta Was this translation helpful? Give feedback.
All reactions