Skip to content

Commit 5c5bfe5

Browse files
tberendspre-commit-ci[bot]BordaCopilot
authored
docs: update detection core with tips for using Gemini integration (#1925)
* docs: update detection core with tips for using Gemini integration * fix(pre_commit): 🎨 auto format pre-commit hooks * Apply suggestions from code review --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Jirka Borovec <[email protected]> Co-authored-by: Copilot <[email protected]>
1 parent 0ebab21 commit 5c5bfe5

File tree

1 file changed

+115
-0
lines changed

1 file changed

+115
-0
lines changed

src/supervision/detection/core.py

Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -983,6 +983,53 @@ def from_lmm(cls, lmm: LMM | str, result: str | dict, **kwargs: Any) -> Detectio
983983
```
984984
985985
!!! example "Gemini 2.0"
986+
987+
??? tip "Prompt engineering"
988+
989+
From Gemini 2.0 onwards, models are further trained to detect objects in
990+
an image and get their bounding box coordinates. The coordinates,
991+
relative to image dimensions, scale to [0, 1000]. You need to convert
992+
these normalized coordinates back to pixel coordinates using your
993+
original image size.
994+
995+
According to the Gemini API documentation on image prompts (see
996+
https://ai.google.dev/gemini-api/docs/vision#image-input), when using a
997+
single image with text, the recommended approach is to place the text
998+
prompt after the image part in the contents array. This ordering has
999+
been shown to produce significantly better results in practice.
1000+
1001+
For example, when calling the Gemini API directly, you can structure
1002+
the request like this, with the image part first and the text prompt
1003+
second in the `parts` list:
1004+
1005+
```json
1006+
{
1007+
"model": "models/gemini-2.0-flash",
1008+
"contents": [
1009+
{
1010+
"role": "user",
1011+
"parts": [
1012+
{
1013+
"inline_data": {
1014+
"mime_type": "image/png",
1015+
"data": "<BASE64_IMAGE_BYTES>"
1016+
}
1017+
},
1018+
{
1019+
"text": "Detect all the cats and dogs in the image..."
1020+
}
1021+
]
1022+
}
1023+
]
1024+
}
1025+
```
1026+
To get the best results from Google Gemini 2.0, use the following prompt.
1027+
1028+
```
1029+
Detect all the cats and dogs in the image. The box_2d should be
1030+
[ymin, xmin, ymax, xmax] normalized to 0-1000.
1031+
```
1032+
9861033
```python
9871034
import supervision as sv
9881035
@@ -1020,6 +1067,31 @@ def from_lmm(cls, lmm: LMM | str, result: str | dict, **kwargs: Any) -> Detectio
10201067
including small, distant, or partially visible ones, and to return
10211068
tight bounding boxes.
10221069
1070+
According to the Gemini API documentation on image prompts, when using
1071+
a single image with text, the recommended approach is to place the text
1072+
prompt after the image part in the `contents` array. See the official
1073+
Gemini vision docs for details:
1074+
https://ai.google.dev/gemini-api/docs/vision#multi-part-input
1075+
1076+
For example, using the `google-generativeai` client:
1077+
1078+
```python
1079+
from google.generativeai import types
1080+
1081+
response = model.generate_content(
1082+
contents=[
1083+
types.Part.from_image(image_bytes),
1084+
"Carefully examine this image and detect ALL visible objects, including "
1085+
"small, distant, or partially visible ones.",
1086+
],
1087+
generation_config=generation_config,
1088+
safety_settings=safety_settings,
1089+
)
1090+
```
1091+
1092+
This ordering (image first, then text) has been shown to produce
1093+
significantly better results in practice.
1094+
10231095
```
10241096
Carefully examine this image and detect ALL visible objects, including
10251097
small, distant, or partially visible ones.
@@ -1391,6 +1463,28 @@ def from_vlm(cls, vlm: VLM | str, result: str | dict, **kwargs: Any) -> Detectio
13911463
```
13921464
13931465
!!! example "Gemini 2.0"
1466+
1467+
??? tip "Prompt engineering"
1468+
1469+
From Gemini 2.0 onwards, models are further trained to detect objects in
1470+
an image and get their bounding box coordinates. The coordinates,
1471+
relative to image dimensions, scale to [0, 1000]. You need to convert
1472+
these normalized coordinates back to pixel coordinates based on your
1473+
original image size.
1474+
According to the [Gemini API documentation on image prompts](
1475+
https://ai.google.dev/gemini-api/docs/vision?lang=python#image_prompts), when using
1476+
a single image with text, the recommended approach is to place the text
1477+
prompt after the image part in the `contents` array (for example,
1478+
`contents=[image_part, text_part]`). This ordering has been shown to
1479+
produce significantly better results in practice.
1480+
1481+
To get the best results from Google Gemini 2.0, use the following prompt.
1482+
1483+
```
1484+
Detect all the cats and dogs in the image. The box_2d should be
1485+
[ymin, xmin, ymax, xmax] normalized to 0-1000.
1486+
```
1487+
13941488
```python
13951489
import supervision as sv
13961490
@@ -1428,6 +1522,27 @@ def from_vlm(cls, vlm: VLM | str, result: str | dict, **kwargs: Any) -> Detectio
14281522
including small, distant, or partially visible ones, and to return
14291523
tight bounding boxes.
14301524
1525+
According to the [Gemini API documentation on image prompts](
1526+
https://ai.google.dev/gemini-api/docs/vision?hl=en),
1527+
when using a single image with text, place the text prompt after the image
1528+
part in the `contents` array. For example, with the `google-genai` client:
1529+
1530+
```python
1531+
response = model.generate_content(
1532+
[
1533+
{
1534+
"role": "user",
1535+
"parts": [
1536+
types.Part.from_bytes(image_bytes, mime_type="image/png"),
1537+
types.Part.from_text(prompt),
1538+
],
1539+
}
1540+
]
1541+
)
1542+
```
1543+
1544+
This ordering has been shown to produce significantly better results in practice.
1545+
14311546
```
14321547
Carefully examine this image and detect ALL visible objects, including
14331548
small, distant, or partially visible ones.

0 commit comments

Comments
 (0)