Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Important
Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
Agents are powerful productivity assistants. They can plan, make decisions, and execute actions. Agents typically first reason through user intents in conversations, select the correct tools to call and satisfy the user requests, and complete various tasks according to their instructions. We currently support these agent-specific evaluators for agentic workflows:
Evaluating Azure AI agents
Agents emit messages, and providing the above inputs typically require parsing messages and extracting the relevant information. If you're building agents using Azure AI Agent Service, we provide native integration for evaluation that directly takes their agent messages. To learn more, see an end-to-end example of evaluating agents in Azure AI Agent Service.
Besides IntentResolution
, ToolCallAccuracy
, TaskAdherence
specific to agentic workflows, you can also assess other quality and safety aspects of your agentic workflows, using our comprehensive suite of built-in evaluators. We support this list of evaluators for Azure AI agent messages from our converter:
- Quality:
IntentResolution
,ToolCallAccuracy
,TaskAdherence
,Relevance
,Coherence
,Fluency
- Safety:
CodeVulnerabilities
,Violence
,Self-harm
,Sexual
,HateUnfairness
,IndirectAttack
,ProtectedMaterials
.
In this article we show examples of IntentResolution
, ToolCallAccuracy
, and TaskAdherence
. For examples of using other evaluators with Azure AI agent messages, see evaluating Azure AI agents.
Model configuration for AI-assisted evaluators
For reference in the following code snippets, the AI-assisted evaluators use a model configuration for the LLM-judge:
import os
from azure.ai.evaluation import AzureOpenAIModelConfiguration
from dotenv import load_dotenv
load_dotenv()
model_config = AzureOpenAIModelConfiguration(
azure_endpoint=os.environ["AZURE_ENDPOINT"],
api_key=os.environ.get("AZURE_API_KEY"),
azure_deployment=os.environ.get("AZURE_DEPLOYMENT_NAME"),
api_version=os.environ.get("AZURE_API_VERSION"),
)
Evaluator model support
We support AzureOpenAI or OpenAI reasoning models and non-reasoning models for the LLM-judge depending on the evaluators:
Evaluators | Reasoning Models as Judge (example: o-series models from Azure OpenAI / OpenAI) | Non-reasoning models as Judge (example: gpt-4.1, gpt-4o, etc.) | To enable |
---|---|---|---|
Intent Resolution , Task Adherence , Tool Call Accuracy , Response Completeness |
Supported | Supported | Set additional parameter is_reasoning_model=True in initializing evaluators |
Other quality evaluators | Not Supported | Supported | -- |
For complex evaluation that requires refined reasoning, we recommend a strong reasoning model like o3-mini
and o-series mini models released afterwards with a balance of reasoning performance and cost efficiency.
Intent resolution
IntentResolutionEvaluator
measures how well the system identifies and understands a user's request, including how well it scopes the user’s intent, asks clarifying questions, and reminds end users of its scope of capabilities. Higher score means better identification of user intent.
Intent resolution example
from azure.ai.evaluation import IntentResolutionEvaluator
intent_resolution = IntentResolutionEvaluator(model_config=model_config, threshold=3)
intent_resolution(
query="What are the opening hours of the Eiffel Tower?",
response="Opening hours of the Eiffel Tower are 9:00 AM to 11:00 PM."
)
Intent resolution output
The numerical score is on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and additional fields can help you understand why the score is high or low.
{
"intent_resolution": 5.0,
"intent_resolution_result": "pass",
"intent_resolution_threshold": 3,
"intent_resolution_reason": "The response provides the opening hours of the Eiffel Tower clearly and accurately, directly addressing the user's query. It includes specific times, which fully resolves the user's request for information about the opening hours.",
"additional_details": {
"conversation_has_intent": True,
"agent_perceived_intent": "inquire about opening hours",
"actual_user_intent": "find out the opening hours of the Eiffel Tower",
"correct_intent_detected": True,
"intent_resolved": True
}
}
If you're building agents outside of Azure AI Agent Service, this evaluator accepts a schema typical for agent messages. To learn more, see our sample notebook for Intent Resolution.
Tool call accuracy
ToolCallAccuracyEvaluator
measures the accuracy and efficiency of tool calls made by an agent in a run. It provides a 1-5 score based on:
- the relevance and helpfulness of the tool invoked;
- the correctness of parameters used in tool calls;
- the counts of missing or excessive calls.
Note
ToolCallAccuracyEvaluator
only supports Azure AI Agent's Function Tool evaluation, but doesn't support Built-in Tool evaluation. The agent run must have at least one Function Tool call and no Built-in Tool calls made to be evaluated.
Tool call accuracy example
from azure.ai.evaluation import ToolCallAccuracyEvaluator
tool_call_accuracy = ToolCallAccuracyEvaluator(model_config=model_config, threshold=3)
tool_call_accuracy(
query="How is the weather in Seattle?",
tool_calls=[{
"type": "tool_call",
"tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ",
"name": "fetch_weather",
"arguments": {
"___location": "Seattle"
}
}],
tool_definitions=[{
"id": "fetch_weather",
"name": "fetch_weather",
"description": "Fetches the weather information for the specified ___location.",
"parameters": {
"type": "object",
"properties": {
"___location": {
"type": "string",
"description": "The ___location to fetch weather for."
}
}
}
}
]
)
Tool call accuracy output
The numerical score is on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason and tool call detail fields can help you understand why the score is high or low.
{
"tool_call_accuracy": 5,
"tool_call_accuracy_result": "pass",
"tool_call_accuracy_threshold": 3,
"details": {
"tool_calls_made_by_agent": 1,
"correct_tool_calls_made_by_agent": 1,
"per_tool_call_details": [
{
"tool_name": "fetch_weather",
"total_calls_required": 1,
"correct_calls_made_by_agent": 1,
"correct_tool_percentage": 1.0,
"tool_call_errors": 0,
"tool_success_result": "pass"
}
],
"excess_tool_calls": {
"total": 0,
"details": []
},
"missing_tool_calls": {
"total": 0,
"details": []
}
}
}
If you're building agents outside of Azure AI Agent Service, this evaluator accepts a schema typical for agent messages. To learn more, see, our sample notebook for Tool Call Accuracy.
Task adherence
In various task-oriented AI systems such as agentic systems, it's important to assess whether the agent has stayed on track to complete a given task instead of making inefficient or out-of-scope steps. TaskAdherenceEvaluator
measures how well an agent’s response adheres to their assigned tasks, according to their task instruction (extracted from system message and user query), and available tools. Higher score means better adherence of the system instruction to resolve the given task.
Task adherence example
from azure.ai.evaluation import TaskAdherenceEvaluator
task_adherence = TaskAdherenceEvaluator(model_config=model_config, threshold=3)
task_adherence(
query="What are the best practices for maintaining a healthy rose garden during the summer?",
response="Make sure to water your roses regularly and trim them occasionally."
)
Task adherence output
The numerical score is on a Likert scale (integer 1 to 5) and a higher score is better. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
{
"task_adherence": 2.0,
"task_adherence_result": "fail",
"task_adherence_threshold": 3,
"task_adherence_reason": "The response partially addresses the query by mentioning relevant practices but lacks critical details and depth, making it insufficient for a comprehensive understanding of maintaining a rose garden in summer."
}
If you're building agents outside of Azure AI Agent Service, this evaluator accepts a schema typical for agent messages. To learn more, see our sample notebook for Task Adherence.