Simple GeluAndMul kernels example

seroyer · seroyer · commit b6d98010493f · 2026-04-22T13:55:21.000-05:00
Add Hugging Face kernel hub example using kernels-community/activation
GeluAndMul kernel.

Signed-off-by: Steven Royer &lt;sroyer@redhat.com&gt;
diff --git a/Containerfile b/Containerfile
@@ -0,0 +1,22 @@
+FROM nvcr.io/nvidia/cuda:13.0.3-cudnn-devel-ubi9
+
+
+RUN dnf update -y && \
+    dnf install -y python3.12 python3.12-pip python3.12-devel vim
+
+RUN pip3.12 install --upgrade pip
+RUN pip3.12 install uv
+
+WORKDIR /src
+
+# Create virtual env
+RUN uv venv venv --python 3.12
+ENV PATH="/src/venv/bin:/src/venv/lib64/python3.12/site-packages/nvidia/cu13/bin:$PATH"
+ENV VIRTUAL_ENV=/src/venv
+ENV UV_LINK_MODE=copy
+
+RUN --mount=type=cache,target=/root/.cache/uv \
+    uv pip install torch==2.11 torchvision kernels
+
+COPY gelu-and-mul-test.py /src/
+
diff --git a/README.md b/README.md
@@ -0,0 +1,78 @@
+# Hugging Face kernel hub example
+
+This shows an example using the Hugging Face kernel hub to download a
+pre-compiled kernel.  In this example, the 
+[activation](https://huggingface.co/kernels/kernels-community/activation)
+gelu_and_mul kernel is demonstrated.  At the time of this writing, this set
+of kernels has builds for Nvidia CUDA and Apple Metal.  For the exact set of
+supported hardware, check the kernel card.
+
+## Build
+
+There is a Containerfile that encapsulates the environment needed to run a
+torch application that uses the kernels interface on Nvidia GPUs.
+
+```bash
+podman build . -t gelu:latest
+```
+
+You can also see what the Containerfile does and recreate it locally in a
+python virtual environment if you prefer not to use containers, for example
+if you want to try it on Apple Metal.
+
+## Run
+
+Set the HF_TOKEN environment variable to your Hugging Face token.  Then:
+
+```bash
+podman run -it --rm --device nvidia.com/gpu=all --security-opt=label=disable -e HF_TOKEN=${HF_TOKEN} gelu:latest python3.12 gelu-and-mul-test.py
+```
+
+The output should look something like this if you have a supported Nvidia GPU
+and >=580 driver:
+
+```bash
+==========
+== CUDA ==
+==========
+
+CUDA Version 13.0.3
+
+Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+This container image and its contents are governed by the NVIDIA Deep Learning Container License.
+By pulling and using the container, you accept the terms and conditions of this license:
+https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
+
+A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
+
+Fetching 6 files: 100%|███████████████████████████████████████████████████████████████| 6/6 [00:01<00:00,  3.99it/s]
+Download complete: : 4.18MB [00:01, 3.30MB/s]              Success!                   | 2/6 [00:01<00:03,  1.17it/s]
+Download complete: : 4.18MB [00:01, 2.62MB/s]
+```
+
+Note that the kernels package will output progress information about the
+download to stderr.  You can filter that out.  For example, you can
+redirect the stderr to /dev/null if you want cleaner output.  Just be aware
+that debugging failures will be harder that way.
+
+```bash
+$ podman run -it --rm --device nvidia.com/gpu=all --security-opt=label=disable -e HF_TOKEN=${HF_TOKEN} gelu:latest bash
+
+==========
+== CUDA ==
+==========
+
+CUDA Version 13.0.3
+
+Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+
+This container image and its contents are governed by the NVIDIA Deep Learning Container License.
+By pulling and using the container, you accept the terms and conditions of this license:
+https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
+
+A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
+
+[root@fc9f150b0e9e src]# python3.12 gelu-and-mul-test.py 2> /dev/null
+Success!
+```
diff --git a/gelu-and-mul-test.py b/gelu-and-mul-test.py
@@ -0,0 +1,47 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+
+from kernels import use_kernel_forward_from_hub
+from kernels import use_kernel_mapping, LayerRepository
+from kernels import Mode, kernelize
+
+# Define the hub kernel to use for this test on cuda devices
+kernel_layer_mapping = {
+    "GeluAndMul": {
+        "cuda": LayerRepository(
+            repo_id="kernels-community/activation",
+            layer_name="GeluAndMul",
+            version=1,
+        )
+    }
+}
+
+# Implement the torch fallback method and request to use a hub kernel if available
+@use_kernel_forward_from_hub("GeluAndMul")
+class GeluAndMul(nn.Module):
+    """Implementation from https://github.com/vllm-project/vllm
+       vllm/model_executor/layers/activation.py
+    """
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        d = x.shape[-1] // 2
+        return F.gelu(x[..., :d], approximate="none") * x[..., d:]
+
+# Run the pure torch method first so we can compare the hub kernel
+x = torch.randn(32, 512, device="cuda", dtype=torch.bfloat16)
+model = GeluAndMul()
+torch_out = model(x)
+hub_out = None
+
+# Run the hub optimized kernel now
+with use_kernel_mapping(kernel_layer_mapping):
+    # Tell kernels that we want to do inference and enable torch.compile
+    model = kernelize(model, device="cuda", mode=Mode.INFERENCE | Mode.TORCH_COMPILE)
+
+    hub_out = model(x)
+
+# Make sure the hub optimized kernel gives the same output
+if torch.allclose(hub_out, torch_out, atol=1e-3, rtol=1e-3):
+    print("Success!")
+else:
+    print("Failed...")