Skip to content

Owl-1.5模型的上下文问题 #266

@someonetang

Description

@someonetang

项目代码中,在Agent运行单个任务时,注意到在多轮步骤后,发送给模型的请求中包含了之前所有步骤图片的Base64数据,具体如下

  {
    "role": "system",
    "content": [
      {
        "type": "text",
        "text": "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{\"type\": \"function\", \"function\": {\"name_for_human\": \"mobile_use\", \"name\": \"mobile_use\", \"description\": \"Use a touchscreen to interact with a mobile device, and take screenshots.\n* This is an interface to a mobile device with touchscreen. You can perform actions like clicking, typing, swiping, etc.\n* Some applications may take time to start or process actions, so you may need to wait and take successive screenshots to see the results of your actions.\n* The screen's resolution is 1000x1000.\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. Don't click boxes on their edges unless asked.\", \"parameters\": {\"properties\": {\"action\": {\"description\": \"The action to perform. The available actions are:\n* `key`: Perform a key event on the mobile device.\n    - This supports adb's `keyevent` syntax.\n    - Examples: \"volume_up\", \"volume_down\", \"power\", \"camera\", \"clear\".\n* `click`: Click the point on the screen with coordinate (x, y).\n* `long_press`: Press the point on the screen with coordinate (x, y) for specified seconds.\n* `swipe`: Swipe from the starting point with coordinate (x, y) to the end point with coordinates2 (x2, y2).\n* `type`: Input the specified text into the activated input box.\n* `system_button`: Press the system button.\n* `open`: Open an app on the device.\n* `wait`: Wait specified seconds for the change to happen.\n* `answer`: Terminate the current task and output the answer.\n* `terminate`: Terminate the current task and report its completion status.\", \"enum\": [\"key\", \"click\", \"long_press\", \"swipe\", \"type\", \"system_button\", \"open\", \"wait\", \"answer\", \"terminate\"], \"type\": \"string\"}, \"coordinate\": {\"description\": \"(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=click`, `action=long_press`, and `action=swipe`.\", \"type\": \"array\"}, \"coordinate2\": {\"description\": \"(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=swipe`.\", \"type\": \"array\"}, \"text\": {\"description\": \"Required only by `action=key`, `action=type`, `action=open`, `action=answer`.\", \"type\": \"string\"}, \"time\": {\"description\": \"The seconds to wait. Required only by `action=long_press` and `action=wait`.\", \"type\": \"number\"}, \"button\": {\"description\": \"Back means returning to the previous interface, Home means returning to the desktop, Menu means opening the application background menu, and Enter means pressing the enter. Required only by `action=system_button`\", \"enum\": [\"Back\", \"Home\", \"Menu\", \"Enter\"], \"type\": \"string\"}, \"status\": {\"description\": \"The status of the task. Required only by `action=terminate`.\", \"type\": \"string\", \"enum\": [\"success\", \"failure\"]}}, \"required\": [\"action\"], \"type\": \"object\"}, \"args_format\": \"Format the arguments as a JSON object.\"}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call>\n\n# Response format\n\nResponse format for every step:\n1) Action: a short imperative describing what to do in the UI.\n2) A single <tool_call>...</tool_call> block containing only the JSON: {\"name\": <function-name>, \"arguments\": <args-json-object>}.\n\nRules:\n- Output exactly in the order: Action, <tool_call>.\n- Be brief: one for Action.\n- Do not output anything else outside those two parts.\n- If finishing, use action=terminate in the tool call."
      }
    ]
  },
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "Please generate the next move according to the UI screenshot, instruction and previous actions.\n\nInstruction: Record an audio clip and save it with name \"presentation_fGwr.m4a\" using Audio Recorder app.\n\nPrevious actions:\nNo previous action."
      },
      {
        "type": "image_url",
        "image_url": {
          "url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEU......"
        }
      }
    ]
  },
  {
    "role": "assistant",
    "content": [
      {
        "type": "text",
        "text": "<think>I'm on the Android home screen and need to record an audio clip using the Audio Recorder app, saving it as 'presentation_fGwr.m4a'. The app isn't visible on the current screen, so I'll swipe up from the bottom to access the app drawer, which contains all installed apps. Once in the app drawer, I'll locate and open the Audio Recorder app to proceed.</think>\nAction: Swipe upward from the lower portion of the screen to open the app drawer and access all installed applications.\n<tool_call>\n{\"name\": \"mobile_use\", \"arguments\": {\"action\": \"swipe\", \"coordinate\": [510, 748], \"coordinate2\": [502, 322]}}\n</tool_call>"
      }
    ]
  },
  {
    "role": "user",
    "content": [
      {
        "type": "image_url",
        "image_url": {
          "url": "data:image/png;base64,iVBORw0KGg......"
        }
      }
    ]
  },
  {
    "role": "assistant",
    "content": [
      {
        "type": "text",
        "text": "<think>The swipe up successfully opened the app drawer, revealing all installed apps, including the Audio Recorder app in the second row. I’ve accessed the app drawer, and the next step is to tap the Audio Recorder icon to launch it. Once opened, I can proceed with recording the audio clip and saving it with the name 'presentation_fGwr.m4a'. The Audio Recorder app will provide the necessary tools for this task.</think>\nAction: Click on the Audio Recorder app icon in the second row to launch the application.\n<tool_call>\n{\"name\": \"mobile_use\", \"arguments\": {\"action\": \"click\", \"coordinate\": [396, 328]}}\n</tool_call>"
      }
    ]
  },
  {
    "role": "user",
    "content": [
      {
        "type": "image_url",
        "image_url": {
          "url": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUg......"
        }
      }
    ]
  },
  {
    "role": "assistant",
    "content": [
      {
        "type": "text",
        "text": "<think>The Audio Recorder app launched, but a warning dialog appeared, indicating some records were deleted or moved. This is a common notification and doesn't affect recording new audio. The dialog offers two options: 'Ok' and 'Details'. To proceed, I'll click 'Ok' to dismiss the dialog, as it’s the standard action for such notifications. Afterward, I can use the app to record and save the audio clip with the name 'presentation_fGwr.m4a'.</think>\nAction: Click on the \"Ok\" button in the warning dialog to dismiss the notification and proceed with using the Audio Recorder app.\n<tool_call>\n{\"name\": \"mobile_use\", \"arguments\": {\"action\": \"click\", \"coordinate\": [440, 598]}}\n</tool_call>"
      }
    ]
  },
  {
    "role": "user",
    "content": [
      {
        "type": "image_url",
        "image_url": {
          "url": "data:image/png;base64,iVBORw0KGgoA......"
        }
      }
    ]
  },
  {
    "role": "assistant",
    "content": [
      {
        "type": "text",
        "text": "<think>The warning dialog is dismissed, and the Audio Recorder app is ready. I've opened the app drawer, launched the app, and cleared the initial warning. The interface shows a waveform timeline, playback button, timer, and a file labeled 'Record-1'. A large white circular button at the bottom center appears to be the record button. My tasks are: 1) Start recording, 2) Record audio, and 3) Save as 'presentation_fGwr.m4a'. Clicking the record button will start the process, change the interface to recording mode, and allow me to proceed.</think>\nAction: Click on the large white circular record button at the bottom center of the screen to start recording audio.\n<tool_call>\n{\"name\": \"mobile_use\", \"arguments\": {\"action\": \"click\", \"coordinate\": [510, 915]}}\n</tool_call>"
      }
    ]
  },
  {
    "role": "user",
    "content": [
      {
        "type": "image_url",
        "image_url": {
          "url": "data:image/png;base64,iVBORw0KGgoAAAANSUh......"
        }
      }
    ]
  }
]

这会在端侧部署时对模型上下文能力造成很大的挑战,为何不使用在MobileAgent-v3时使用的conclusion模式,或者模型是否能直接使用MobileAgent-v3组织的上下文进行输入?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions