에이전트 평가자(미리 보기)

2025-07-16

Important

이 문서에 표시된 항목(미리 보기)은 현재 퍼블릭 미리 보기에서 확인할 수 있습니다. 이 미리 보기는 서비스 수준 계약 없이 제공되며, 프로덕션 워크로드에는 권장되지 않습니다. 특정 기능이 지원되지 않거나 기능이 제한될 수 있습니다. 자세한 내용은 Microsoft Azure Preview에 대한 추가 사용 약관을 참조하세요.

에이전트는 강력한 생산성 도우미입니다. 작업을 계획, 결정 및 실행할 수 있습니다. 에이전트는 일반적으로 대화에서 사용자 의도를 통해 가장 먼저 이유를 지정하고, 사용자 요청을 호출하고 충족할 올바른 도구를 선택하고 , 지침에 따라 다양한 작업을 완료 합니다. 현재 에이전트 워크플로에 대해 이러한 에이전트별 평가자를 지원합니다.

Intent resolution
도구 호출 정확도
Task adherence

Azure AI 에이전트 평가

에이전트는 메시지를 내보내고 위의 입력을 제공하려면 일반적으로 메시지를 구문 분석하고 관련 정보를 추출해야 합니다. Azure AI 에이전트 서비스를 사용하여 에이전트를 빌드하는 경우 에이전트 메시지를 직접 받는 평가를 위한 네이티브 통합을 제공합니다. 자세한 내용은 Azure AI 에이전트 서비스에서 에이전트를 평가하는 엔드 투 엔드 예제를 참조하세요.

IntentResolution ToolCallAccuracy TaskAdherence 뿐만 아니라 에이전트 워크플로와 관련된 포괄적인 기본 제공 평가기 제품군을 사용하여 에이전트 워크플로의 다른 품질 및 안전 측면을 평가할 수도 있습니다. 변환기에서 Azure AI 에이전트 메시지에 대한 평가자 목록을 지원합니다.

Quality: IntentResolution, ToolCallAccuracy, TaskAdherence, Relevance, Coherence, Fluency
Safety: CodeVulnerabilities, Violence, Self-harm, Sexual, HateUnfairness, IndirectAttack, ProtectedMaterials.

이 문서에서는 , ToolCallAccuracy및 TaskAdherence.의 IntentResolution예를 보여 줍니다. Azure AI 에이전트 메시지와 함께 다른 평가자를 사용하는 예제는 Azure AI 에이전트 평가를 참조하세요.

AI 지원 평가자에 대한 모델 구성

다음 코드 조각에서 참조하기 위해 AI 지원 평가자는 LLM-judge에 대한 모델 구성을 사용합니다.

import os
from azure.ai.evaluation import AzureOpenAIModelConfiguration
from dotenv import load_dotenv
load_dotenv()

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ["AZURE_ENDPOINT"],
    api_key=os.environ.get["AZURE_API_KEY"],
    azure_deployment=os.environ.get("AZURE_DEPLOYMENT_NAME"),
    api_version=os.environ.get("AZURE_API_VERSION"),
)

평가기 모델 지원

We support AzureOpenAI or OpenAI reasoning models and non-reasoning models for the LLM-judge depending on the evaluators:

Evaluators	판단 모델 추론(예: Azure OpenAI/OpenAI의 o 시리즈 모델)	판사로서의 비 추론 모델(예: gpt-4.1, gpt-4o 등)	To enable
`Intent Resolution`, `Task Adherence`, `Tool Call AccuracyResponse Completeness`	Supported	Supported	계산기 초기화에 추가 매개 변수 `is_reasoning_model=True` 설정
기타 품질 평가자	Not Supported	Supported	--

정교한 추론이 필요한 복잡한 평가의 경우 추론 성능과 비용 효율성의 균형을 통해 나중에 출시된 강력한 추론 모델과 O o3-mini 시리즈 미니 모델을 사용하는 것이 좋습니다.

Intent resolution

IntentResolutionEvaluator 는 시스템의 의도 범위를 얼마나 잘 지정하고, 명확한 질문을 하고, 최종 사용자에게 해당 기능 범위를 미리 알려주는지를 포함하여 사용자의 요청을 얼마나 잘 식별하고 이해하는지를 측정합니다. 점수가 높을수록 사용자 의도를 더 잘 식별할 수 있습니다.

의도 확인 예제

from azure.ai.evaluation import IntentResolutionEvaluator

intent_resolution = IntentResolutionEvaluator(model_config=model_config, threshold=3)
intent_resolution(
    query="What are the opening hours of the Eiffel Tower?",
    response="Opening hours of the Eiffel Tower are 9:00 AM to 11:00 PM."
)

의도 확인 출력

Likert 눈금(정수 1~5)의 숫자 점수와 더 높은 점수가 더 좋습니다. 숫자 임계값(기본값: 3)이 지정된 경우 점수 >= 임계값이면 "pass"를 출력하거나 그렇지 않으면 "fail"를 출력합니다. 이유 및 추가 필드를 사용하면 점수가 높거나 낮은 이유를 이해하는 데 도움이 될 수 있습니다.

{
    "intent_resolution": 5.0,
    "intent_resolution_result": "pass",
    "intent_resolution_threshold": 3,
    "intent_resolution_reason": "The response provides the opening hours of the Eiffel Tower clearly and accurately, directly addressing the user's query. It includes specific times, which fully resolves the user's request for information about the opening hours.",
    "additional_details": {
        "conversation_has_intent": True,
        "agent_perceived_intent": "inquire about opening hours",
        "actual_user_intent": "find out the opening hours of the Eiffel Tower",
        "correct_intent_detected": True,
        "intent_resolved": True
    }
}

Azure AI 에이전트 Serice 외부에서 에이전트를 빌드하는 경우 이 평가자는 에이전트 메시지에 일반적인 스키마를 허용합니다. To learn more, see our sample notebook for Intent Resolution.

도구 호출 정확도

ToolCallAccuracyEvaluator 는 에이전트 워크플로의 이전 단계에서 적절한 도구를 선택하고, 추출하고, 올바른 매개 변수를 처리하는 에이전트의 기능을 측정합니다. 각 도구 호출이 정확한지(이진) 여부를 감지하고 평균 점수를 다시 보고하며, 이는 도구 호출의 통과율로 해석될 수 있습니다.

Note

ToolCallAccuracyEvaluator Azure AI 에이전트의 함수 도구 평가만 지원하지만 기본 제공 도구 평가는 지원하지 않습니다. 에이전트 메시지에는 평가하기 위해 실제로 호출된 함수 도구가 하나 이상 있어야 합니다.

도구 호출 정확도 예제

from azure.ai.evaluation import ToolCallAccuracyEvaluator

tool_call_accuracy = ToolCallAccuracyEvaluator(model_config=model_config, threshold=3)
tool_call_accuracy(
    query="How is the weather in Seattle?",
    tool_calls=[{
                    "type": "tool_call",
                    "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ",
                    "name": "fetch_weather",
                    "arguments": {
                        "___location": "Seattle"
                    }
                }],
    tool_definitions=[{
                    "id": "fetch_weather",
                    "name": "fetch_weather",
                    "description": "Fetches the weather information for the specified ___location.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "___location": {
                                "type": "string",
                                "description": "The ___location to fetch weather for."
                            }
                        }
                    }
                }
    ]
)

도구 호출 정확도 출력

숫자 점수(올바른 도구 호출의 전달 속도)는 0-1이고 더 높은 점수는 더 좋습니다. 숫자 임계값(기본값: 3)이 지정된 경우 점수 >= 임계값이면 "pass"를 출력하거나 그렇지 않으면 "fail"를 출력합니다. 이유 및 도구 호출 세부 정보 필드를 사용하면 점수가 높거나 낮은 이유를 이해하는 데 도움이 될 수 있습니다.

{
    "tool_call_accuracy": 1.0,
    "tool_call_accuracy_result": "pass",
    "tool_call_accuracy_threshold": 0.8,
    "per_tool_call_details": [
        {
            "tool_call_accurate": True,
            "tool_call_accurate_reason": "The input Data should get a Score of 1 because the TOOL CALL is directly relevant to the user's question about the weather in Seattle, includes appropriate parameters that match the TOOL DEFINITION, and the parameter values are correct and relevant to the user's query.",
            "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ"
        }
    ]
}

Azure AI 에이전트 서비스 외부에서 에이전트를 빌드하는 경우 이 평가자는 에이전트 메시지에 일반적인 스키마를 허용합니다. 자세한 내용은 도구 호출 정확도에 대한 샘플 Notebook을 참조하세요.

Task adherence

에이전트 시스템과 같은 다양한 작업 지향 AI 시스템에서 에이전트가 비효율적이거나 범위를 벗어난 단계를 수행하는 대신 지정된 작업을 완료하기 위해 계속 진행되었는지 여부를 평가하는 것이 중요합니다. TaskAdherenceEvaluator 는 작업 지침(시스템 메시지 및 사용자 쿼리에서 추출됨) 및 사용 가능한 도구에 따라 에이전트의 응답이 할당된 작업을 얼마나 잘 준수하는지 측정합니다. 점수가 높을수록 지정된 작업을 해결하기 위해 시스템 명령을 더 잘 준수할 수 있습니다.

작업 준수 예제

from azure.ai.evaluation import TaskAdherenceEvaluator

task_adherence = TaskAdherenceEvaluator(model_config=model_config, threshold=3)
task_adherence(
        query="What are the best practices for maintaining a healthy rose garden during the summer?",
        response="Make sure to water your roses regularly and trim them occasionally."                         
)

작업 준수 출력

Likert 눈금(정수 1~5)의 숫자 점수와 더 높은 점수가 더 좋습니다. 숫자 임계값(기본값: 3)이 지정된 경우 점수 >= 임계값이면 "pass"를 출력하거나 그렇지 않으면 "fail"를 출력합니다. 이유 필드를 사용하면 점수가 높거나 낮은 이유를 이해하는 데 도움이 될 수 있습니다.

{
   "task_adherence": 2.0,
    "task_adherence_result": "fail",
    "task_adherence_threshold": 3,
    "task_adherence_reason": "The response partially addresses the query by mentioning relevant practices but lacks critical details and depth, making it insufficient for a comprehensive understanding of maintaining a rose garden in summer."
}

Azure AI 에이전트 서비스 외부에서 에이전트를 빌드하는 경우 이 평가자는 에이전트 메시지에 일반적인 스키마를 허용합니다. To learn more, see our sample notebook for Task Adherence.