@@ -983,6 +983,53 @@ def from_lmm(cls, lmm: LMM | str, result: str | dict, **kwargs: Any) -> Detectio
983983 ```
984984
985985 !!! example "Gemini 2.0"
986+
987+ ??? tip "Prompt engineering"
988+
989+ From Gemini 2.0 onwards, models are further trained to detect objects in
990+ an image and get their bounding box coordinates. The coordinates,
991+ relative to image dimensions, scale to [0, 1000]. You need to convert
992+ these normalized coordinates back to pixel coordinates using your
993+ original image size.
994+
995+ According to the Gemini API documentation on image prompts (see
996+ https://ai.google.dev/gemini-api/docs/vision#image-input), when using a
997+ single image with text, the recommended approach is to place the text
998+ prompt after the image part in the contents array. This ordering has
999+ been shown to produce significantly better results in practice.
1000+
1001+ For example, when calling the Gemini API directly, you can structure
1002+ the request like this, with the image part first and the text prompt
1003+ second in the `parts` list:
1004+
1005+ ```json
1006+ {
1007+ "model": "models/gemini-2.0-flash",
1008+ "contents": [
1009+ {
1010+ "role": "user",
1011+ "parts": [
1012+ {
1013+ "inline_data": {
1014+ "mime_type": "image/png",
1015+ "data": "<BASE64_IMAGE_BYTES>"
1016+ }
1017+ },
1018+ {
1019+ "text": "Detect all the cats and dogs in the image..."
1020+ }
1021+ ]
1022+ }
1023+ ]
1024+ }
1025+ ```
1026+ To get the best results from Google Gemini 2.0, use the following prompt.
1027+
1028+ ```
1029+ Detect all the cats and dogs in the image. The box_2d should be
1030+ [ymin, xmin, ymax, xmax] normalized to 0-1000.
1031+ ```
1032+
9861033 ```python
9871034 import supervision as sv
9881035
@@ -1020,6 +1067,31 @@ def from_lmm(cls, lmm: LMM | str, result: str | dict, **kwargs: Any) -> Detectio
10201067 including small, distant, or partially visible ones, and to return
10211068 tight bounding boxes.
10221069
1070+ According to the Gemini API documentation on image prompts, when using
1071+ a single image with text, the recommended approach is to place the text
1072+ prompt after the image part in the `contents` array. See the official
1073+ Gemini vision docs for details:
1074+ https://ai.google.dev/gemini-api/docs/vision#multi-part-input
1075+
1076+ For example, using the `google-generativeai` client:
1077+
1078+ ```python
1079+ from google.generativeai import types
1080+
1081+ response = model.generate_content(
1082+ contents=[
1083+ types.Part.from_image(image_bytes),
1084+ "Carefully examine this image and detect ALL visible objects, including "
1085+ "small, distant, or partially visible ones.",
1086+ ],
1087+ generation_config=generation_config,
1088+ safety_settings=safety_settings,
1089+ )
1090+ ```
1091+
1092+ This ordering (image first, then text) has been shown to produce
1093+ significantly better results in practice.
1094+
10231095 ```
10241096 Carefully examine this image and detect ALL visible objects, including
10251097 small, distant, or partially visible ones.
@@ -1391,6 +1463,28 @@ def from_vlm(cls, vlm: VLM | str, result: str | dict, **kwargs: Any) -> Detectio
13911463 ```
13921464
13931465 !!! example "Gemini 2.0"
1466+
1467+ ??? tip "Prompt engineering"
1468+
1469+ From Gemini 2.0 onwards, models are further trained to detect objects in
1470+ an image and get their bounding box coordinates. The coordinates,
1471+ relative to image dimensions, scale to [0, 1000]. You need to convert
1472+ these normalized coordinates back to pixel coordinates based on your
1473+ original image size.
1474+ According to the [Gemini API documentation on image prompts](
1475+ https://ai.google.dev/gemini-api/docs/vision?lang=python#image_prompts), when using
1476+ a single image with text, the recommended approach is to place the text
1477+ prompt after the image part in the `contents` array (for example,
1478+ `contents=[image_part, text_part]`). This ordering has been shown to
1479+ produce significantly better results in practice.
1480+
1481+ To get the best results from Google Gemini 2.0, use the following prompt.
1482+
1483+ ```
1484+ Detect all the cats and dogs in the image. The box_2d should be
1485+ [ymin, xmin, ymax, xmax] normalized to 0-1000.
1486+ ```
1487+
13941488 ```python
13951489 import supervision as sv
13961490
@@ -1428,6 +1522,27 @@ def from_vlm(cls, vlm: VLM | str, result: str | dict, **kwargs: Any) -> Detectio
14281522 including small, distant, or partially visible ones, and to return
14291523 tight bounding boxes.
14301524
1525+ According to the [Gemini API documentation on image prompts](
1526+ https://ai.google.dev/gemini-api/docs/vision?hl=en),
1527+ when using a single image with text, place the text prompt after the image
1528+ part in the `contents` array. For example, with the `google-genai` client:
1529+
1530+ ```python
1531+ response = model.generate_content(
1532+ [
1533+ {
1534+ "role": "user",
1535+ "parts": [
1536+ types.Part.from_bytes(image_bytes, mime_type="image/png"),
1537+ types.Part.from_text(prompt),
1538+ ],
1539+ }
1540+ ]
1541+ )
1542+ ```
1543+
1544+ This ordering has been shown to produce significantly better results in practice.
1545+
14311546 ```
14321547 Carefully examine this image and detect ALL visible objects, including
14331548 small, distant, or partially visible ones.
0 commit comments