New Foundry Evaluations - sample.tool_calls property is not being filled correctly #295
Replies: 2 comments
-
|
Thanks for laying this out so clearly, I have validated that In the current New Foundry evaluation pipeline, the
Meanwhile, the tracing view does have the full tool call content (request + response), which confirms that the data exists THE ISSUE IS it’s just not being surfaced into the evaluation sample. This mismatch breaks any evaluator that depends on tool semantics, including:
So your diagnosis is spot‑on: the evaluation sample is incomplete. Desired Outcome This would allow:
This is a completely reasonable expectation, especially since the tracing system already has the data. Current WorkaroundYou’re right: there is no viable workaround today. Because:
So any evaluator that depends on tool semantics is effectively blind. Why this mattersWhen building agents with real toolchains, governance, and auditability this limitation blocks:
|
Beta Was this translation helpful? Give feedback.
-
|
Thank you @leestott for your thorough investigation and confirmation. I'd like to emphasize that the evaluators this breaks are built-in offerings from Microsoft that are suggested for use when an Evaluation is created. My end-goal is for these evaluators to work as described. Based on this, could you provide a ballpark estimation of when we can expect a resolution for this issue? This would allow me to advise my client about the roadmap for the application I'm working on. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Technical Feedback
When inspecting the results.jsonl of an evaluation I can confirm that in the Evaluations process (in New Foundry) the tool calls made by the target agent are not included in the sample data. There is an object
sample.tool_calls, and it contains the correcttool_call_id, but it does not contain any actual content about the tool call or its result. In the tracing view, the sametool_call_idcan be seen with much more data, including the request and response made. In this case, azure_ai_search is the tool and the query and retrieved documents are seen.Built-in AI Evaluators that judge the agent's use of tools do not work because of this problem.
Desired Outcome
The desired outcome is that all available data about the agent's tool calls in the thread is present in the sample.tool_calls array.
Current Workaround
I have not found a current workaround.
Beta Was this translation helpful? Give feedback.
All reactions