エージェントエバリュエーター (プレビュー)

Important

この記事で "(プレビュー)" と付記されている項目は、現在、パブリックプレビュー段階です。このプレビューはサービスレベルアグリーメントなしで提供されており、運用環境ではお勧めしません。特定の機能はサポート対象ではなく、機能が制限されることがあります。詳細については、「 Microsoft Azure プレビューの追加使用条件」を参照してください。

エージェントは強力な生産性アシスタントです。計画、決定、アクションの実行を行うことができます。エージェントは、通常、会話でユーザーの意図を使用して最初に理由を指定し、ユーザー要求を呼び出して満たす適切なツールを選択し、指示に従ってさまざまなタスクを完了します。 Azure AI Foundry では現在、エージェントワークフローに対して次のエージェント固有のエバリュエーターがサポートされています。

インテントの解決
ツール呼び出しの精度
タスクの準拠

Azure AI エージェントの評価

エージェントはメッセージを出力します。通常、入力を提供するには、メッセージを解析し、関連情報を抽出する必要があります。 Azure AI Agent Service を使用してエージェントを構築している場合、サービスはエージェントメッセージを直接受け取る評価のためのネイティブ統合を提供します。例については、「 AI エージェントの評価」を参照してください。

エージェントワークフローに固有の IntentResolution、 ToolCallAccuracy、 TaskAdherence に加えて、包括的な組み込みエバリュエータースイートを使用して、エージェントワークフローの他の品質と安全性の側面を評価することもできます。 Azure AI Foundry では、コンバーターからの Azure AI エージェントメッセージのエバリュエーターの一覧がサポートされています。

品質: IntentResolution、 ToolCallAccuracy、 TaskAdherence、 Relevance、 Coherence、 Fluency
安全性: CodeVulnerabilities、 Violence、 Self-harm、 Sexual、 HateUnfairness、 IndirectAttack、 ProtectedMaterials

この記事では、 IntentResolution、 ToolCallAccuracy、 TaskAdherenceの例を示します。 Azure AI エージェントメッセージで他のエバリュエーターを使用する例については、 Azure AI エージェントの評価に関するページを参照してください。

AI 支援エバリュエーターのモデル構成

次のコードスニペットのリファレンスとして、AI 支援エバリュエーターは、大規模言語モデルジャッジ (LLM ジャッジ) のモデル構成を使用します。

import os
from azure.ai.evaluation import AzureOpenAIModelConfiguration
from dotenv import load_dotenv
load_dotenv()

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ["AZURE_ENDPOINT"],
    api_key=os.environ.get("AZURE_API_KEY"),
    azure_deployment=os.environ.get("AZURE_DEPLOYMENT_NAME"),
    api_version=os.environ.get("AZURE_API_VERSION"),
)

エバリュエーターモデルのサポート

Azure AI Agent Service では、評価者に応じて、LLM-judge の AzureOpenAI または OpenAI 推論モデルと非推論モデルがサポートされます。

Evaluators	ジャッジとしての推論モデル (例: Azure OpenAI/OpenAI からの o シリーズモデル)	ジャッジとしての非推論モデル (例: gpt-4.1 または gpt-4o)	有効にするには
`IntentResolution`、`TaskAdherence`、`ToolCallAccuracy`、`ResponseCompleteness`、`Coherence`、`Fluency`、`Similarity`、`Groundedness`、`Retrieval`、`Relevance`	Supported	Supported	エバリュエーターの初期化に追加のパラメーター `is_reasoning_model=True` を設定する
その他のエバリュエーター	サポートされていません	Supported	--

詳細な推論を必要とする複雑な評価では、推論のパフォーマンスとコスト効率のバランスを取った 4.1-mini のような強力な推論モデルをお勧めします。

意図の解決

IntentResolutionEvaluator は、システムがユーザーの要求を識別して理解する程度を測定します。この理解には、ユーザーの意図をどの程度範囲に設定するか、明確にするための質問を行い、エンドユーザーにその機能の範囲を思い出させる方法が含まれます。スコアが高いほど、ユーザーの意図をより適切に識別できます。

意図解決の例

from azure.ai.evaluation import IntentResolutionEvaluator

intent_resolution = IntentResolutionEvaluator(model_config=model_config, threshold=3)
intent_resolution(
    query="What are the opening hours of the Eiffel Tower?",
    response="Opening hours of the Eiffel Tower are 9:00 AM to 11:00 PM."
)

意図解決の出力

数値スコアは Likert スケール (整数 1 ~ 5) です。スコアが高いほど良くなります。数値のしきい値 (既定値は 3) を指定すると、スコアがしきい値場合は、エバリュエーターも>を出力し、それ以外の場合は失敗します。理由やその他のフィールドを使用すると、スコアが高いか低いかを理解するのに役立ちます。

{
    "intent_resolution": 5.0,
    "intent_resolution_result": "pass",
    "intent_resolution_threshold": 3,
    "intent_resolution_reason": "The response provides the opening hours of the Eiffel Tower clearly and accurately, directly addressing the user's query. It includes specific times, which fully resolves the user's request for information about the opening hours.",
    "additional_details": {
        "conversation_has_intent": True,
        "agent_perceived_intent": "inquire about opening hours",
        "actual_user_intent": "find out the opening hours of the Eiffel Tower",
        "correct_intent_detected": True,
        "intent_resolved": True
    }
}

Azure AI Foundry Agent Service の外部でエージェントを構築している場合、このエバリュエーターはエージェントメッセージに一般的なスキーマを受け入れます。サンプルノートブックについては、「意図の解決」を参照してください。

ツール呼び出しの精度

ToolCallAccuracyEvaluator は、実行中のエージェントによって行われたツール呼び出しの精度と効率を測定します。次に基づいて 1 から 5 のスコアを提供します。

呼び出されたツールの関連性と有用性
ツール呼び出しで使用されるパラメーターの正確性
不足している呼び出しまたは過剰な呼び出しの数

ツール呼び出し評価のサポート

ToolCallAccuracyEvaluator では、次のツールに対する Azure AI Foundry Agent Service での評価がサポートされています。

ファイル検索
Azure AI 検索
Bingアース
Bing Custom Search
SharePoint 基礎
コードインタープリター
ファブリックデータエージェント
OpenAPI
関数ツール (ユーザー定義ツール)

サポートされていないツールがエージェントの実行で使用されている場合、エバリュエーターはパスを出力し、呼び出されたツールの評価がサポートされない理由を示します。この方法を使用すると、これらのケースを簡単に除外できます。評価を有効にするには、サポートされていないツールをユーザー定義ツールとしてラップすることをお勧めします。

ツール呼び出し精度の例

from azure.ai.evaluation import ToolCallAccuracyEvaluator

tool_call_accuracy = ToolCallAccuracyEvaluator(model_config=model_config, threshold=3)

# provide the agent response with tool calls 
tool_call_accuracy(
    query="What timezone corresponds to 41.8781,-87.6298?",
    response=[
    {
        "createdAt": "2025-04-25T23:55:52Z",
        "run_id": "run_DmnhUGqYd1vCBolcjjODVitB",
        "role": "assistant",
        "content": [
            {
                "type": "tool_call",
                "tool_call_id": "call_qi2ug31JqzDuLy7zF5uiMbGU",
                "name": "azure_maps_timezone",
                "arguments": {
                    "lat": 41.878100000000003,
                    "lon": -87.629800000000003
                }
            }
        ]
    },    
    {
        "createdAt": "2025-04-25T23:55:54Z",
        "run_id": "run_DmnhUGqYd1vCBolcjjODVitB",
        "tool_call_id": "call_qi2ug31JqzDuLy7zF5uiMbGU",
        "role": "tool",
        "content": [
            {
                "type": "tool_result",
                "tool_result": {
                    "ianaId": "America/Chicago",
                    "utcOffset": None,
                    "abbreviation": None,
                    "isDaylightSavingTime": None
                }
            }
        ]
    },
    {
        "createdAt": "2025-04-25T23:55:55Z",
        "run_id": "run_DmnhUGqYd1vCBolcjjODVitB",
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "The timezone for the coordinates 41.8781, -87.6298 is America/Chicago."
            }
        ]
    }
    ],   
    tool_definitions=[
                {
                    "name": "azure_maps_timezone",
                    "description": "local time zone information for a given latitude and longitude.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "lat": {
                                "type": "float",
                                "description": "The latitude of the ___location."
                            },
                            "lon": {
                                "type": "float",
                                "description": "The longitude of the ___location."
                            }
                        }
                    }
                }
    ]
)

# alternatively, provide the tool calls directly without the full agent response
tool_call_accuracy(
    query="How is the weather in Seattle?",
    tool_calls=[{
                    "type": "tool_call",
                    "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ",
                    "name": "fetch_weather",
                    "arguments": {
                        "___location": "Seattle"
                    }
                }],
    tool_definitions=[{
                    "id": "fetch_weather",
                    "name": "fetch_weather",
                    "description": "Fetches the weather information for the specified ___location.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "___location": {
                                "type": "string",
                                "description": "The ___location to fetch weather for."
                            }
                        }
                    }
                }
    ]
)

ツール呼び出し精度の出力

数値スコアは Likert スケール (整数 1 ~ 5) です。スコアが高いほど良くなります。数値のしきい値 (既定値は 3) を指定すると、スコアがしきい値場合は、エバリュエーターも>を出力し、それ以外の場合は失敗します。理由とツール呼び出しの詳細フィールドを使用して、スコアが高いか低いかを理解します。

{
    "tool_call_accuracy": 5,
    "tool_call_accuracy_result": "pass",
    "tool_call_accuracy_threshold": 3,
    "details": {
        "tool_calls_made_by_agent": 1,
        "correct_tool_calls_made_by_agent": 1,
        "per_tool_call_details": [
            {
                "tool_name": "fetch_weather",
                "total_calls_required": 1,
                "correct_calls_made_by_agent": 1,
                "correct_tool_percentage": 1.0,
                "tool_call_errors": 0,
                "tool_success_result": "pass"
            }
        ],
        "excess_tool_calls": {
            "total": 0,
            "details": []
        },
        "missing_tool_calls": {
            "total": 0,
            "details": []
        }
    }
}

Azure AI エージェントサービスの外部でエージェントを構築している場合、このエバリュエーターはエージェントメッセージに一般的なスキーマを受け入れます。サンプルノートブックについては、「ツール呼び出しの精度」を参照してください。

タスクの準拠

エージェントシステムなど、さまざまなタスク指向 AI システムでは、非効率的な手順や範囲外の手順を実行するのではなく、タスクを完了するためにエージェントが追跡状態にあるかどうかを評価することが重要です。 TaskAdherenceEvaluator は、エージェントのタスク命令と使用可能なツールに従って、エージェントの応答が割り当てられたタスクにどの程度準拠しているかを測定します。タスク命令は、システムメッセージとユーザークエリから抽出されます。スコアが高いほど、タスクを解決するためのシステム命令の準拠性が向上します。

タスクの準拠の例

from azure.ai.evaluation import TaskAdherenceEvaluator

task_adherence = TaskAdherenceEvaluator(model_config=model_config, threshold=3)
task_adherence(
        query="What are the best practices for maintaining a healthy rose garden during the summer?",
        response="Make sure to water your roses regularly and trim them occasionally."                         
)

タスクの準拠の出力

数値スコアは Likert スケール (整数 1 ~ 5) です。スコアが高いほど良くなります。数値のしきい値 (既定値は 3) を指定すると、スコアがしきい値場合は、エバリュエーターも>を出力し、それ以外の場合は失敗します。理由フィールドを使用して、スコアが高いか低いかを理解します。

{
   "task_adherence": 2.0,
    "task_adherence_result": "fail",
    "task_adherence_threshold": 3,
    "task_adherence_reason": "The response partially addresses the query by mentioning relevant practices but lacks critical details and depth, making it insufficient for a comprehensive understanding of maintaining a rose garden in summer."
}

Azure AI エージェントサービスの外部でエージェントを構築している場合、このエバリュエーターはエージェントメッセージに一般的なスキーマを受け入れます。サンプルノートブックについては、「タスクの準拠」を参照してください。

フィードバック

このページはお役に立ちましたか?

Last updated on 2025-10-30