docs: update detection core with tips for using Gemini integration (#1925)

tberends · pre-commit-ci[bot] · Borda · web-flow · commit 5c5bfe55445a · 2026-02-04T18:39:20.000+09:00
* docs: update detection core with tips for using Gemini integration
* fix(pre_commit): 🎨 auto format pre-commit hooks
* Apply suggestions from code review

---------

Co-authored-by: pre-commit-ci[bot] &lt;66853113+pre-commit-ci[bot]@users.noreply.github.com&gt;
Co-authored-by: Jirka Borovec &lt;6035284+Borda@users.noreply.github.com&gt;
Co-authored-by: Copilot &lt;175728472+Copilot@users.noreply.github.com&gt;
diff --git a/src/supervision/detection/core.py b/src/supervision/detection/core.py
@@ -983,6 +983,53 @@ def from_lmm(cls, lmm: LMM | str, result: str | dict, **kwargs: Any) -> Detectio
             ```
 
         !!! example "Gemini 2.0"
+
+            ??? tip "Prompt engineering"
+
+                From Gemini 2.0 onwards, models are further trained to detect objects in
+                an image and get their bounding box coordinates. The coordinates,
+                relative to image dimensions, scale to [0, 1000]. You need to convert
+                these normalized coordinates back to pixel coordinates using your
+                original image size.
+
+                According to the Gemini API documentation on image prompts (see
+                https://ai.google.dev/gemini-api/docs/vision#image-input), when using a
+                single image with text, the recommended approach is to place the text
+                prompt after the image part in the contents array. This ordering has
+                been shown to produce significantly better results in practice.
+
+                For example, when calling the Gemini API directly, you can structure
+                the request like this, with the image part first and the text prompt
+                second in the `parts` list:
+
+                ```json
+                {
+                  "model": "models/gemini-2.0-flash",
+                  "contents": [
+                    {
+                      "role": "user",
+                      "parts": [
+                        {
+                          "inline_data": {
+                            "mime_type": "image/png",
+                            "data": "<BASE64_IMAGE_BYTES>"
+                          }
+                        },
+                        {
+                          "text": "Detect all the cats and dogs in the image..."
+                        }
+                      ]
+                    }
+                  ]
+                }
+                ```
+                To get the best results from Google Gemini 2.0, use the following prompt.
+
+                ```
+                Detect all the cats and dogs in the image. The box_2d should be
+                [ymin, xmin, ymax, xmax] normalized to 0-1000.
+                ```
+
             ```python
             import supervision as sv
 
@@ -1020,6 +1067,31 @@ def from_lmm(cls, lmm: LMM | str, result: str | dict, **kwargs: Any) -> Detectio
                 including small, distant, or partially visible ones, and to return
                 tight bounding boxes.
 
+                According to the Gemini API documentation on image prompts, when using
+                a single image with text, the recommended approach is to place the text
+                prompt after the image part in the `contents` array. See the official
+                Gemini vision docs for details:
+                https://ai.google.dev/gemini-api/docs/vision#multi-part-input
+
+                For example, using the `google-generativeai` client:
+
+                ```python
+                from google.generativeai import types
+
+                response = model.generate_content(
+                    contents=[
+                        types.Part.from_image(image_bytes),
+                        "Carefully examine this image and detect ALL visible objects, including "
+                        "small, distant, or partially visible ones.",
+                    ],
+                    generation_config=generation_config,
+                    safety_settings=safety_settings,
+                )
+                ```
+
+                This ordering (image first, then text) has been shown to produce
+                significantly better results in practice.
+
                 ```
                 Carefully examine this image and detect ALL visible objects, including
                 small, distant, or partially visible ones.
@@ -1391,6 +1463,28 @@ def from_vlm(cls, vlm: VLM | str, result: str | dict, **kwargs: Any) -> Detectio
             ```
 
         !!! example "Gemini 2.0"
+
+            ??? tip "Prompt engineering"
+
+                From Gemini 2.0 onwards, models are further trained to detect objects in
+                an image and get their bounding box coordinates. The coordinates,
+                relative to image dimensions, scale to [0, 1000]. You need to convert
+                these normalized coordinates back to pixel coordinates based on your
+                original image size.
+                According to the [Gemini API documentation on image prompts](
+                https://ai.google.dev/gemini-api/docs/vision?lang=python#image_prompts), when using
+                a single image with text, the recommended approach is to place the text
+                prompt after the image part in the `contents` array (for example,
+                `contents=[image_part, text_part]`). This ordering has been shown to
+                produce significantly better results in practice.
+
+                To get the best results from Google Gemini 2.0, use the following prompt.
+
+                ```
+                Detect all the cats and dogs in the image. The box_2d should be
+                [ymin, xmin, ymax, xmax] normalized to 0-1000.
+                ```
+
             ```python
             import supervision as sv
 
@@ -1428,6 +1522,27 @@ def from_vlm(cls, vlm: VLM | str, result: str | dict, **kwargs: Any) -> Detectio
                 including small, distant, or partially visible ones, and to return
                 tight bounding boxes.
 
+                According to the [Gemini API documentation on image prompts](
+                https://ai.google.dev/gemini-api/docs/vision?hl=en),
+                when using a single image with text, place the text prompt after the image
+                part in the `contents` array. For example, with the `google-genai` client:
+
+                ```python
+                response = model.generate_content(
+                    [
+                        {
+                            "role": "user",
+                            "parts": [
+                                types.Part.from_bytes(image_bytes, mime_type="image/png"),
+                                types.Part.from_text(prompt),
+                            ],
+                        }
+                    ]
+                )
+                ```
+
+                This ordering has been shown to produce significantly better results in practice.
+
                 ```
                 Carefully examine this image and detect ALL visible objects, including
                 small, distant, or partially visible ones.