New Foundry Evaluations - sample.tool_calls property is not being filled correctly #295

JeroenvdV · 2026-01-13T16:08:25Z

JeroenvdV
Jan 13, 2026

Technical Feedback

When inspecting the results.jsonl of an evaluation I can confirm that in the Evaluations process (in New Foundry) the tool calls made by the target agent are not included in the sample data. There is an object sample.tool_calls, and it contains the correct tool_call_id, but it does not contain any actual content about the tool call or its result. In the tracing view, the same tool_call_id can be seen with much more data, including the request and response made. In this case, azure_ai_search is the tool and the query and retrieved documents are seen.

Built-in AI Evaluators that judge the agent's use of tools do not work because of this problem.

Desired Outcome

The desired outcome is that all available data about the agent's tool calls in the thread is present in the sample.tool_calls array.

Current Workaround

I have not found a current workaround.

leestott · 2026-01-14T19:05:04Z

leestott
Jan 14, 2026
Maintainer

Thanks for laying this out so clearly, I have validated that In the current New Foundry evaluation pipeline, the sample.tool_calls array only stores the tool_call_id, not the full serialized tool request/response payload. That means:

The evaluator sees that a tool call occurred
But it cannot see what the tool call contained
And therefore cannot judge correctness, relevance, or safety of the tool usage

Meanwhile, the tracing view does have the full tool call content (request + response), which confirms that the data exists THE ISSUE IS it’s just not being surfaced into the evaluation sample.

This mismatch breaks any evaluator that depends on tool semantics, including:

Tool correctness evaluators
Tool‑usage‑appropriateness evaluators
Any custom evaluator that inspects retrieved documents, queries, or tool outputs

So your diagnosis is spot‑on: the evaluation sample is incomplete.

Desired Outcome
Full tool call payloads (request + response) to be included in sample.tool_calls, not just the IDs.

This would allow:

Accurate tool‑usage evaluation
Reproducibility
Debugging
Custom evaluator logic
Alignment between tracing and evaluation artifacts

This is a completely reasonable expectation, especially since the tracing system already has the data.

Current Workaround

You’re right: there is no viable workaround today.

Because:

The evaluation pipeline does not expose a hook to enrich sample.tool_calls
The tool payloads cannot be fetched post‑hoc using the tool_call_id
Evaluators cannot access the tracing subsystem
And the evaluation runner does not allow custom sample shaping

So any evaluator that depends on tool semantics is effectively blind.

Why this matters

When building agents with real toolchains, governance, and auditability this limitation blocks:

Proper regression testing
Safety validation
Tool‑chain correctness scoring
Multi‑step agent evaluation
Enterprise‑grade QA workflows

0 replies

JeroenvdV · 2026-01-15T10:52:49Z

JeroenvdV
Jan 15, 2026
Author

Thank you @leestott for your thorough investigation and confirmation.

I'd like to emphasize that the evaluators this breaks are built-in offerings from Microsoft that are suggested for use when an Evaluation is created. My end-goal is for these evaluators to work as described.

Based on this, could you provide a ballpark estimation of when we can expect a resolution for this issue? This would allow me to advise my client about the roadmap for the application I'm working on.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Microsoft Foundry

New Foundry Evaluations - sample.tool_calls property is not being filled correctly #295

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Microsoft Foundry

New Foundry Evaluations - sample.tool_calls property is not being filled correctly #295

Uh oh!

JeroenvdV Jan 13, 2026

Technical Feedback

Desired Outcome

Current Workaround

Replies: 2 comments

Uh oh!

leestott Jan 14, 2026 Maintainer

Current Workaround

Why this matters

Uh oh!

JeroenvdV Jan 15, 2026 Author

JeroenvdV
Jan 13, 2026

leestott
Jan 14, 2026
Maintainer

JeroenvdV
Jan 15, 2026
Author