Azure OpenAI 평가판(미리 보기)

2025-08-16

큰 언어 모델의 평가는 다양한 작업 및 차원에서 성능을 측정하는 중요한 단계입니다. 이는 학습에서 성능 향상(또는 손실)을 평가하는 것이 중요한 미세 조정된 모델에 특히 중요합니다. 철저한 평가는 다양한 버전의 모델이 애플리케이션 또는 시나리오에 미치는 영향을 이해하는 데 도움이 될 수 있습니다.

Azure OpenAI 평가를 통해 개발자는 예상되는 입출력 쌍에 대해 테스트하기 위한 평가 실행을 만들고 정확도, 안정성, 전반적인 성능과 같은 주요 메트릭에 걸쳐 모델의 성능을 평가할 수 있습니다.

평가 지원

지역별 가용성

오스트레일리아 동부
브라질 남부
캐나다 중부
미국 중부
미국 동부 2
프랑스 중부
독일 중서부
이탈리아 북부
일본 동부
일본 서부
한국 중부
미국 중북부
노르웨이 동부
폴란드 중부
남아프리카 북부
동남아시아
스페인 중부
스웨덴 중부
스위스 북부
스위스 서부
아랍에미리트 북부
영국 남부
영국 서부
서유럽
미국 서부
미국 서부 2
미국 서부 3

선호하는 지역이 누락된 경우 Azure OpenAI 지역을 참조하고 Azure OpenAI 지역 가용성 영역 중 하나인지 확인합니다.

지원되는 배포 유형

스탠다드
글로벌 표준
데이터 영역 표준
프로비전 관리
글로벌 프로비전 관리
데이터 영역 프로비전 관리

평가 API(미리 보기)

평가 API를 사용하면 API 호출을 통해 직접 모델 출력을 테스트하고 프로그래밍 방식으로 모델 품질 및 성능을 평가할 수 있습니다. Evaluation API를 사용하려면 REST API 설명서를 확인하세요.

평가 파이프라인

테스트 데이터

테스트하려는 참조 자료 데이터 세트를 조립해야 합니다. 데이터 세트 만들기는 일반적으로 시간이 지나도 평가가 시나리오와 관련성을 유지하도록 보장하는 반복적인 프로세스입니다. 이러한 참조 자료 데이터 세트는 일반적으로 수작업으로 작성되며 모델에서 예상되는 동작을 나타냅니다. 데이터 세트에는 레이블이 지정되어 있으며 예상 답변이 포함되어 있습니다.

비고

감정 및 유효한 JSON 또는 XML과 같은 일부 평가 테스트에는 기본 진리 데이터가 필요하지 않습니다.

데이터 원본은 JSONL 형식이어야 합니다. 다음은 JSONL 평가 데이터 세트의 두 가지 예입니다.

{"question": "Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q.", "subject": "abstract_algebra", "A": "0", "B": "4", "C": "2", "D": "6", "answer": "B", "completion": "B"}
{"question": "Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5.", "subject": "abstract_algebra", "A": "8", "B": "2", "C": "24", "D": "120", "answer": "C", "completion": "C"}
{"question": "Find all zeros in the indicated finite field of the given polynomial with coefficients in that field. x^5 + 3x^3 + x^2 + 2x in Z_5", "subject": "abstract_algebra", "A": "0", "B": "1", "C": "0,1", "D": "0,4", "answer": "D", "completion": "D"}

평가 파일을 업로드하고 선택하면 처음 세 줄의 미리 보기가 반환됩니다.

{"input": [{"role": "system", "content": "Provide a clear and concise summary of the technical content, highlighting key concepts and their relationships. Focus on the main ideas and practical implications."}, {"role": "user", "content": "Tokenization is a key step in preprocessing for natural language processing, involving the division of text into smaller components called tokens. These can be words, subwords, or characters, depending on the method chosen. Word tokenization divides text at word boundaries, while subword techniques like Byte Pair Encoding (BPE) or WordPiece can manage unknown words by breaking them into subunits. Character tokenization splits text into individual characters, useful for multiple languages and misspellings. The tokenization method chosen greatly affects model performance and its capacity to handle various languages and vocabularies."}], "output": "Tokenization divides text into smaller units (tokens) for NLP applications, using word, subword (e.g., BPE), or character methods. Each has unique benefits, impacting model performance and language processing capabilities."}      
{"input": [{"role": "system", "content": "Create a comprehensive yet concise summary that captures the essential technical details and significance of the topic."}, {"role": "user", "content": "Self-attention mechanisms are vital in modern transformer models, allowing them to evaluate the relevance of different parts of an input sequence when processing each element. In self-attention, each position in a sequence learns to focus on all other positions, forming a weighted sum of their values. Attention weights are derived using queries, keys, and values, with compatibility between a query and key determining attention to each value. This mechanism enables capturing long-range dependencies and relationships within a sequence, making it effective for tasks needing context understanding and inter-part relationships."}], "output": "Self-attention is crucial in transformers, enabling each sequence position to evaluate and collect data from all others. It uses queries, keys, and values for attention weights, effectively capturing long-range dependencies and contextual links."}      
{"input": [{"role": "system", "content": "Synthesize the technical information into a brief, accurate summary that maintains the core concepts and their significance."}, {"role": "user", "content": "The transformer architecture significantly changed natural language processing by introducing a new method for sequence processing. Unlike recurrent neural networks, transformers process whole sequences at once using self-attention. The architecture consists of an encoder and decoder, each with multiple layers of multi-head attention, feed-forward networks, and normalization layers. The encoder processes the input sequence, and the decoder produces the output sequence. Positional encodings are added to retain sequence order information since the model lacks inherent sequence position awareness. This architecture underlies many leading language models like BERT and GPT."}], "output": "Transformers are a transformative neural architecture processing sequences with self-attention instead of recurrence. Their encoder-decoder design, with multi-head attention and position encodings, is fundamental to modern language models like BERT and GPT."}

평가 파일을 업로드하고 선택하면 처음 세 줄의 미리 보기가 반환됩니다.

이전에 업로드한 기존 데이터 세트를 선택하거나 새 데이터 세트를 업로드할 수 있습니다.

응답 만들기(선택 사항)

평가에 사용하는 프롬프트는 프로덕션에서 사용할 프롬프트와 일치해야 합니다. 이러한 프롬프트는 모델이 따라야 할 지침을 제공합니다. 플레이그라운드 환경과 마찬가지로, 프롬프트에 퓨샷 장면 예를 포함하도록 여러 개의 입력을 만들 수 있습니다. 자세한 내용은 프롬프트 디자인 및 프롬프트 엔지니어링 의 일부 고급 기술에 대한 자세한 내용은 프롬프트 엔지니어링 기술을 참조하세요.

{{input.column_name}} 형식을 사용하여 프롬프트 내에서 입력 데이터를 참조할 수 있습니다. 여기서 column_name은 입력 파일의 열 이름에 해당합니다.

평가 중에 생성된 출력은 {{sample.output_text}} 형식을 사용하여 후속 단계에서 참조됩니다.

비고

데이터를 올바르게 참조하려면 이중 중괄호를 사용해야 합니다.

모델 배포

Azure OpenAI에서는 평가에 사용할 모델 배포를 만들어야 합니다. 필요에 따라 단일 모델 또는 여러 모델을 선택하고 배포할 수 있습니다. 이러한 모델 배포는 기본 모델 또는 선택한 테스트 조건으로 미세 조정된 모델을 채점할 때 사용됩니다. 배포된 모델을 사용하여 제공된 프롬프트에 대한 응답을 자동으로 생성할 수도 있습니다.

목록에서 사용할 수 있는 배포는 Azure OpenAI 리소스 내에서 만든 배포에 따라 달라집니다. 원하는 배포를 찾을 수 없는 경우 Azure OpenAI 평가 페이지에서 새 배포를 만들 수 있습니다.

테스트 기준

테스트 기준은 대상 모델에서 생성된 각 출력의 효과를 평가하는 데 사용됩니다. 이러한 테스트는 일관성을 보장하기 위해 입력 데이터와 출력 데이터를 비교합니다. 다양한 수준에서 출력의 품질과 관련성을 테스트하고 측정하기 위해 다양한 기준을 유연하게 구성할 수 있습니다.

각 테스트 조건을 클릭하면 고유한 평가 데이터 세트 및 조건에 따라 수정할 수 있는 미리 설정된 스키마뿐만 아니라 다양한 유형의 채점자가 표시됩니다.

시작하기

평가 만들기

Azure AI Foundry 포털 내에서 Azure OpenAI 평가(미리 보기)를 선택합니다. 이 보기를 옵션으로 보려면 먼저 지원되는 지역에서 기존 Azure OpenAI 리소스를 선택합니다.
선택 + 새 평가
평가를 위해 테스트 데이터를 제공하는 방법을 선택합니다. 저장된 채팅 완료를 가져오거나, 제공된 기본 템플릿을 사용하여 데이터를 만들거나, 고유한 데이터를 업로드할 수 있습니다. 사용자 고유의 데이터 업로드를 살펴보겠습니다.

.jsonl 형식의 평가용 데이터를 선택하세요. 기존 데이터가 이미 있는 경우 해당 데이터를 선택하거나 새 데이터를 업로드할 수 있습니다.

새 데이터를 업로드하면 파일의 처음 세 줄이 오른쪽에 미리 보기로 표시됩니다.

샘플 테스트 파일이 필요한 경우 이 샘플 .jsonl 텍스트를 사용할 수 있습니다. 이 샘플에는 다양한 기술 콘텐츠의 문장이 포함되어 있으며, 이러한 문장에서 의미 체계 유사성을 평가할 예정입니다.

{"input": [{"role": "system", "content": "Provide a clear and concise summary of the technical content, highlighting key concepts and their relationships. Focus on the main ideas and practical implications."}, {"role": "user", "content": "Tokenization is a key step in preprocessing for natural language processing, involving the division of text into smaller components called tokens. These can be words, subwords, or characters, depending on the method chosen. Word tokenization divides text at word boundaries, while subword techniques like Byte Pair Encoding (BPE) or WordPiece can manage unknown words by breaking them into subunits. Character tokenization splits text into individual characters, useful for multiple languages and misspellings. The tokenization method chosen greatly affects model performance and its capacity to handle various languages and vocabularies."}], "output": "Tokenization divides text into smaller units (tokens) for NLP applications, using word, subword (e.g., BPE), or character methods. Each has unique benefits, impacting model performance and language processing capabilities."}      
{"input": [{"role": "system", "content": "Create a comprehensive yet concise summary that captures the essential technical details and significance of the topic."}, {"role": "user", "content": "Self-attention mechanisms are vital in modern transformer models, allowing them to evaluate the relevance of different parts of an input sequence when processing each element. In self-attention, each position in a sequence learns to focus on all other positions, forming a weighted sum of their values. Attention weights are derived using queries, keys, and values, with compatibility between a query and key determining attention to each value. This mechanism enables capturing long-range dependencies and relationships within a sequence, making it effective for tasks needing context understanding and inter-part relationships."}], "output": "Self-attention is crucial in transformers, enabling each sequence position to evaluate and collect data from all others. It uses queries, keys, and values for attention weights, effectively capturing long-range dependencies and contextual links."}      
{"input": [{"role": "system", "content": "Synthesize the technical information into a brief, accurate summary that maintains the core concepts and their significance."}, {"role": "user", "content": "The transformer architecture significantly changed natural language processing by introducing a new method for sequence processing. Unlike recurrent neural networks, transformers process whole sequences at once using self-attention. The architecture consists of an encoder and decoder, each with multiple layers of multi-head attention, feed-forward networks, and normalization layers. The encoder processes the input sequence, and the decoder produces the output sequence. Positional encodings are added to retain sequence order information since the model lacks inherent sequence position awareness. This architecture underlies many leading language models like BERT and GPT."}], "output": "Transformers are a transformative neural architecture processing sequences with self-attention instead of recurrence. Their encoder-decoder design, with multi-head attention and position encodings, is fundamental to modern language models like BERT and GPT."}      
{"input": [{"role": "system", "content": "Distill the technical content into a clear, concise summary that preserves the most important concepts and their practical applications."}, {"role": "user", "content": "Cosine similarity is a metric for measuring similarity between vectors by calculating the cosine of the angle between them. The outcome ranges from -1 to 1, with 1 indicating identical orientation, 0 orthogonality, and -1 opposite orientation. In machine learning and NLP, cosine similarity is useful for comparing document vectors, word embeddings, or any high-dimensional representations. It's often preferred over Euclidean distance because it's invariant to vector magnitude, focusing instead on orientation. This makes it particularly useful for comparing documents or embeddings of various lengths or scales."}], "output": "Cosine similarity evaluates vector similarity by computing their angle's cosine, with values from -1 to 1. It's widely used in ML and NLP for comparing document vectors and embeddings due to its magnitude-invariant nature."}      
{"input": [{"role": "system", "content": "Provide a succinct summary that captures the key technical concepts and their relationships while maintaining clarity and accuracy."}, {"role": "user", "content": "Neural network activation functions are crucial for introducing non-linearity to deep learning models. Commonly used activation functions include ReLU (Rectified Linear Unit), which outputs the input directly if positive and zero otherwise, sigmoid, which compresses values between 0 and 1, and tanh, which normalizes values between -1 and 1. Each function has specific benefits and applications: ReLU avoids vanishing gradients and enables sparse activation, sigmoid is useful for binary classification, and tanh is often used in hidden layers for its zero-centered output. The activation function choice greatly affects model training dynamics and ultimate performance."}], "output": "Activation functions introduce non-linearity to neural networks, with ReLU, sigmoid, and tanh being the most common. Each serves distinct purposes: ReLU prevents vanishing gradients, sigmoid is suitable for binary classification, and tanh offers zero-centered output, all affecting model performance differently."}      
{"input": [{"role": "system", "content": "Create a focused summary that emphasizes the main technical concepts and their practical implications while maintaining accuracy."}, {"role": "user", "content": "Batch normalization is a technique that stabilizes and accelerates deep neural network training by normalizing each layer's inputs. It normalizes activations using the mean and variance computed over the current data batch. This normalization reduces internal covariate shift, where the input distribution changes during training. Batch normalization includes learnable parameters (gamma and beta) allowing the network to reverse normalization if necessary. Benefits include enabling higher learning rates, lessening the importance of precise parameter initialization, and serving as a regularizer, potentially removing the need for dropout."}], "output": "Batch normalization stabilizes neural network training by normalizing layer inputs using batch statistics. It addresses internal covariate shift with learnable parameters, allowing higher learning rates and improved training dynamics, while offering regularization benefits."}      
{"input": [{"role": "system", "content": "Summarize the technical content while preserving the essential concepts and their interconnections, ensuring clarity and precision."}, {"role": "user", "content": "Gradient descent optimization algorithms are foundational to training machine learning models. The basic version updates parameters in the steepest descent direction of the loss function. Variants like Stochastic Gradient Descent (SGD) use random samples to estimate gradients, enhancing computational efficiency. Advanced optimizers like Adam combine momentum (helping overcome local minima) and RMSprop (adapting learning rates for each parameter). These optimizers maintain running averages of gradients and their squares, enabling adaptive learning rates and more efficient training. The optimizer choice can significantly influence model convergence speed and final performance."}], "output": "Gradient descent algorithms optimize ML models by updating parameters based on loss gradients. Variants like SGD use sampling for efficiency, while advanced optimizers like Adam combine momentum with adaptive learning rates for improved training performance."}      
{"input": [{"role": "system", "content": "Deliver a concise yet comprehensive summary that highlights the key technical aspects and their practical significance."}, {"role": "user", "content": "Cross-validation is a resampling method used to evaluate machine learning models on limited data. The most common form, k-fold cross-validation, involves dividing the dataset into k equal parts, training the model on k-1 folds while validating on the remaining fold, and rotating through all combinations. This provides a more robust model performance assessment than a single train-test split, helping detect overfitting and estimate model generalization to unseen data. Variations include stratified k-fold (maintaining class distribution), leave-one-out (k equals dataset size), and time series cross-validation for temporal data."}], "output": "Cross-validation evaluates ML models by training and testing on different data splits, typically using k-fold methodology. This approach offers better performance assessment than single splits, with variations for different data types and requirements."}      
{"input": [{"role": "system", "content": "Generate a clear and focused summary that captures the essential technical details while maintaining their relationships and significance."}, {"role": "user", "content": "Transfer learning is a machine learning method where a model developed for one task is reused as the starting point for a model on a second task. This approach is powerful in deep learning, where pre-trained models on large datasets (like ImageNet for computer vision or BERT for NLP) are fine-tuned on specific downstream tasks. Transfer learning reduces the need for large amounts of task-specific training data and computational resources, as the model has already learned useful features from the source ___domain. Common strategies include feature extraction (freezing pre-trained layers) and fine-tuning (updating all or some pre-trained weights)."}], "output": "Transfer learning reuses models trained on one task for different tasks, particularly effective in deep learning. It leverages pre-trained models through feature extraction or fine-tuning, reducing data and computational needs for new tasks."}      
{"input": [{"role": "system", "content": "Provide a precise and informative summary that distills the key technical concepts while maintaining their relationships and practical importance."}, {"role": "user", "content": "Ensemble methods combine multiple machine learning models to create a more robust and accurate predictor. Common techniques include bagging (training models on random data subsets), boosting (sequentially training models to correct earlier errors), and stacking (using a meta-model to combine base model predictions). Random Forests, a popular bagging method, create multiple decision trees using random feature subsets. Gradient Boosting builds trees sequentially, with each tree correcting the errors of previous ones. These methods often outperform single models by reducing overfitting and variance while capturing different data aspects."}], "output": "Ensemble methods enhance prediction accuracy by combining multiple models through techniques like bagging, boosting, and stacking. Popular implementations include Random Forests (using multiple trees with random features) and Gradient Boosting (sequential error correction), offering better performance than single models."}

테스트 데이터의 입력을 사용하여 새 응답을 만들려면 새 응답 생성을 선택할 수 있습니다. 그러면 평가 파일의 입력 필드가 출력을 생성하기 위해 선택한 모델에 대한 개별 프롬프트에 삽입됩니다.

선택한 모델을 선택합니다. 모델이 없는 경우 새 모델 배포를 만들 수 있습니다. 선택한 모델은 입력 데이터를 가져와 고유한 출력을 생성합니다. 이 경우 변수 {{sample.output_text}}에 저장됩니다. 그런 다음 나중에 테스트 조건의 일부로 해당 출력을 사용합니다. 또는 사용자 고유의 사용자 지정 시스템 메시지와 개별 메시지 예제를 수동으로 제공할 수 있습니다.

테스트 조건을 만들려면 추가를 선택합니다. 제공된 예제 파일의 경우 의미 체계 유사성을 평가할 것입니다. 의미 체계 유사성에 대한 테스트 조건 사전 설정이 포함된 모델 득점자를 선택합니다.

맨 위에서 의미 체계 유사성 을 선택합니다. 아래쪽으로 스크롤한 후, User 섹션에서 {{item.output}}을 Ground truth으로 지정하고, {{sample.output_text}}를 Output로 지정합니다. 이렇게 하면 평가 .jsonl 파일(제공된 샘플 파일)의 원래 참조 출력을 가져와서 이전 단계에서 선택한 모델에서 생성된 출력과 비교합니다.

추가를 선택하여 이 테스트 조건을 추가합니다. 테스트 조건을 더 추가하려는 경우 이 단계에서 추가할 수 있습니다.
이제 평가를 만들 준비가 되었습니다. 평가 이름을 입력하고, 모든 항목이 올바른지 검토하고, 제출 하여 평가 작업을 만듭니다. 평가 작업의 상태 페이지로 이동하면 상태가 대기 중으로 표시됩니다.

평가 작업이 만들어지면 작업을 선택하여 작업의 전체 세부 정보를 볼 수 있습니다.

의미 유사성 출력 세부 정보 보기에는 통과한 테스트 결과를 복사/붙여넣을 수 있는 JSON 표현이 포함되어 있습니다.

평가 작업 페이지의 왼쪽 위 모서리에서 + 실행 추가 단추를 선택하여 Eval 실행을 더 추가할 수도 있습니다.

평가 만들기

데이터 원본 구성 및 평가 테스트 조건을 지정하여 평가를 만들 수 있습니다. 다음은 데이터 원본 구성을 정의할 수 있는 여러 가지 방법 중 하나입니다. 하나 이상의 테스트 조건을 지정할 수도 있습니다.

import asyncio
import json
import requests

async def create_eval():
    response = await asyncio.to_thread(
        requests.post,
        f'{API_ENDPOINT}/openai/v1/evals',
        headers={
            'api-key': API_KEY,
            'aoai-evals': 'preview'
        },
        json={
            'name': 'My Evaluation',
            'data_source_config': {
                'type': 'custom',
                'item_schema': {
                'type': 'object',
                'properties': {
                    'question': {
                    'type': 'string'
                    },
                    'subject': {
                    'type': 'string'
                    },
                    'A': {
                    'type': 'string'
                    },
                    'B': {
                    'type': 'string'
                    },
                    'C': {
                    'type': 'string'
                    },
                    'D': {
                    'type': 'string'
                    },
                    'answer': {
                    'type': 'string'
                    },
                    'completion': {
                    'type': 'string'
                    }
                }
                }
            },
            'testing_criteria': [
                {
                'type': 'string_check',
                'reference': '{{item.completion}}',
                'input': '{{item.answer}}',
                'operation': 'eq',
                'name': 'string check'
                }
            ]
        })

    print(response.status_code)
    print(json.dumps(response.json(), indent=2))

단일 실행 만들기

Azure OpenAI Evaluation을 사용하면 평가 작업에서 여러 실행을 만들 수 있습니다. 단일 평가 실행을 기존 평가에 추가하려면 기존 평가의 eval_id 실행을 지정할 수 있습니다.

import asyncio
import requests
import json


response = await asyncio.to_thread(
    requests.post,
    f'{API_ENDPOINT}/openai/v1/evals/{eval_id}/runs',
    headers={
        'api-key': API_KEY,
        'aoai-evals': 'preview'
    },
    json={
        "name": "No sample",
        "metadata": {
            "sample_generation": "off",
            "file_format": "jsonl"
        },
        "data_source": {
            "type": "jsonl",
            "source": {
            "type": "file_id",
            "id": "file-75099d8d4b5b4abca7cc91e9eca7bba1"
            }
        }
    })

print(response.status_code)
print(json.dumps(response.json(), indent=2))

기존 평가 업데이트

import asyncio
import requests
import json

async def update_eval():
    response = await asyncio.to_thread(
        requests.post,
        f'{API_ENDPOINT}/openai/v1/evals/{eval_id}',
        headers={
            'api-key': API_KEY,
            'aoai-evals': 'preview'
        },
        json={
            "name": "Updated Eval Name",
            "metadata": {
                "sample_generation": "off",
                "file_format": "jsonl",
                "updated": "metadata"
            }
        })

    print(response.status_code)
    print(json.dumps(response.json(), indent=2))

평가 결과

평가가 완료되면 를 지정하여 eval_id평가 작업에 대한 평가 결과를 가져올 수 있습니다.

import asyncio
import requests

async def get_eval():
    response = await asyncio.to_thread(
        requests.get,
        f'{API_ENDPOINT}/openai/v1/evals/{eval_id}',
        headers={
            'api-key': API_KEY,
            'aoai-evals': 'preview'
        })

    print(response.status_code)
    print(response.json())

단일 평가 실행 결과

기존 평가 작업에서 단일 평가 실행을 만드는 방법과 마찬가지로 단일 실행에 대한 결과를 검색할 수도 있습니다.

import asyncio
import requests
import json

async def get_eval_run():
    response = await asyncio.to_thread(
        requests.get,
        f'{API_ENDPOINT}/openai/v1/evals/eval_67fd95c864f08190817f0dff5f42f49e/runs/evalrun_67fe987a6c548190ba6f33f7cd89343d',
        headers={
            'api-key': API_KEY,
            'aoai-evals': 'preview'
        })

    print(response.status_code)
    print(json.dumps(response.json(), indent=2))

위의 예제의 매개 변수 외에도 필요에 따라 더 구체적인 드릴다운을 위해 이러한 매개 변수를 평가 결과에 추가할 수 있습니다.

이름	안으로	필수	유형	설명
끝점	길	예	문자열	지원되는 Azure OpenAI 엔드포인트(프로토콜 및 호스트 이름(예: `https://aoairesource.openai.azure.com`). "aoairesource"를 Azure OpenAI 리소스 이름으로 대체). https://{your-resource-name}.openai.azure.com
eval-id	길	예	문자열	실행의 조회를 위한 평가 ID입니다.
실행-아이디	길	예	문자열	출력 항목을 검색할 실행의 ID입니다.
후	문의	아니오	문자열	이전 페이지 매김 요청의 마지막 출력 항목에 대한 식별자입니다.
한계	문의	아니오	정수	검색할 출력 항목의 수입니다.
상태	문의	아니오	문자열	가능한 값: fail, pass. 상태별로 출력 항목을 필터링합니다. 실패한 출력 항목으로 필터링하지 못하거나 전달된 출력 항목을 필터링하는 데 실패합니다.
주문	문의	아니오	문자열	가능한 값: asc, desc. 타임스탬프를 기준으로 출력 항목의 정렬 순서입니다. 오름차순으로 asc를 사용하거나 내림차순으로 desc를 사용합니다. 기본값은 asc입니다.
API 버전	문의	예	문자열	요청된 API 버전입니다.

평가 목록

생성된 모든 평가 작업 목록을 보려면 다음을 수행합니다.

import asyncio
import requests
import json

async def get_eval_list():
    response = await asyncio.to_thread(
        requests.get,
        f'{API_ENDPOINT}/openai/v1/evals',
        headers={
            'api-key': API_KEY,
            'aoai-evals': 'preview'
        })

    print(response.status_code)
    print(json.dumps(response.json(), indent=2))

실행에 대한 출력 세부 정보

단일 평가 실행에 대해 채점자에서 생성된 개별 출력을 볼 수 있습니다.

import asyncio
import requests
import json

async def get_eval_output_item_list():
    response = await asyncio.to_thread(
        requests.get,
        f'{API_ENDPOINT}/openai/v1/evals/eval_67fd95c864f08190817f0dff5f42f49e/runs/evalrun_67fe987a6c548190ba6f33f7cd89343d/output_items',
        headers={
            'api-key': API_KEY,
            'aoai-evals': 'preview'
        })

    print(response.status_code)
    print(json.dumps(response.json(), indent=2))

보고 싶은 특정 출력 결과가 있는 경우 출력 항목 ID를 지정할 수 있습니다.

import asyncio
import requests
import json

async def get_eval_output_item():
    response = await asyncio.to_thread(
        requests.get,
        f'{API_ENDPOINT}/openai/v1/evals/eval_67fd95c864f08190817f0dff5f42f49e/runs/evalrun_67fe987a6c548190ba6f33f7cd89343d/output_items/outputitem_67fe988369308190b50d805120945deb',
        headers={'api-key': API_KEY})

    print(response.status_code)
    print(json.dumps(response.json(), indent=2))

평가 만들기

curl -X POST "$AZURE_OPENAI_ENDPOINT/openai/v1/evals" \
  -H "Content-Type: application/json" \
  -H "api-key: $AZURE_OPENAI_API_KEY" \
  -H "aoai-evals: preview" \
  -d '{
    "name": "Math Quiz",
    "data_source_config": {
      "type": "custom",
      "include_sample_schema": true,
      "item_schema": {
        "type": "object",
        "properties": {
          "question": { "type": "string" },
          "A": { "type": "string" },
          "B": { "type": "string" },
          "C": { "type": "string" },
          "D": { "type": "string" },
          "answer": { "type": "string" }
        }
      }
    },
    "testing_criteria": [
      {
        "type": "string_check",
        "reference": "{{item.answer}}",
        "input": "{{sample.output_text}}",
        "operation": "eq",
        "name": "string check"
      }
    ]
  }'

단일 실행 만들기

Azure OpenAI Evaluation을 사용하면 평가 작업에서 여러 실행을 만들 수 있습니다. 를 지정하여 이전 단계에서 만든 평가 작업에 새 평가 실행을 추가할 수 있습니다 eval-id.

curl -X POST "$AZURE_OPENAI_ENDPOINT/openai/v1/evals/{eval-id}/runs" \
  -H "Content-Type: application/json" \
  -H "api-key: $AZURE_OPENAI_API_KEY" 
  -H "aoai-evals: preview" \

기존 평가 업데이트

curl -X POST "$AZURE_OPENAI_ENDPOINT/openai/v1/evals/{eval-id} \
  -H "Content-Type: application/json" \
  -H "api-key: $AZURE_OPENAI_API_KEY" 
  -H "aoai-evals: preview" \

평가 결과

평가가 완료되면 를 지정하여 eval_id평가 작업에 대한 평가 결과를 가져올 수 있습니다.

curl -X GET "$AZURE_OPENAI_ENDPOINT/openai/v1/evals/{eval-id}" \
  -H "Content-Type: application/json" \
  -H "api-key: $AZURE_OPENAI_API_KEY" 
  -H "aoai-evals: preview" \

단일 평가 실행 결과

기존 평가 작업에서 단일 평가 실행을 만드는 방법과 마찬가지로 단일 실행에 대한 결과를 검색할 수도 있습니다.

curl -X GET "$AZURE_OPENAI_ENDPOINT/openai/v1/evals/{eval-id}/runs/{run-id}" \
  -H "Content-Type: application/json" \
  -H "api-key: $AZURE_OPENAI_API_KEY" 
  -H "aoai-evals: preview" \

위의 예제의 매개 변수 외에도 필요에 따라 더 구체적인 드릴다운을 위해 이러한 매개 변수를 평가 결과에 추가할 수 있습니다.

이름	안으로	필수	유형	설명
끝점	길	예	문자열	지원되는 Azure OpenAI 엔드포인트(프로토콜 및 호스트 이름(예: `https://aoairesource.openai.azure.com`). "aoairesource"를 Azure OpenAI 리소스 이름으로 대체). https://{your-resource-name}.openai.azure.com
eval-id	길	예	문자열	실행의 조회를 위한 평가 ID입니다.
실행-아이디	길	예	문자열	출력 항목을 검색할 실행의 ID입니다.
후	문의	아니오	문자열	이전 페이지 매김 요청의 마지막 출력 항목에 대한 식별자입니다.
한계	문의	아니오	정수	검색할 출력 항목의 수입니다.
상태	문의	아니오	문자열	가능한 값: fail, pass. 상태별로 출력 항목을 필터링합니다. 실패한 출력 항목으로 필터링하지 못하거나 전달된 출력 항목을 필터링하는 데 실패합니다.
주문	문의	아니오	문자열	가능한 값: asc, desc. 타임스탬프를 기준으로 출력 항목의 정렬 순서입니다. 오름차순으로 asc를 사용하거나 내림차순으로 desc를 사용합니다. 기본값은 asc입니다.
API 버전	문의	예	문자열	요청된 API 버전입니다.

평가 목록

생성된 모든 평가 작업 목록을 보려면 다음을 수행합니다.

curl -X GET "$AZURE_OPENAI_ENDPOINT/openai/v1/evals/{eval-id}/runs" \
  -H "Content-Type: application/json" \
  -H "api-key: $AZURE_OPENAI_API_KEY" 
  -H "aoai-evals: preview" \

실행에 대한 출력 세부 정보

단일 평가 실행에 대해 채점자에서 생성된 개별 출력을 볼 수 있습니다.

curl -X GET "$AZURE_OPENAI_ENDPOINT/openai/v1/evals/{eval-id}/runs/{run-id}/output_items" \
  -H "Content-Type: application/json" \
  -H "api-key: $AZURE_OPENAI_API_KEY" 
  -H "aoai-evals: preview" \

보고 싶은 특정 출력 결과가 있는 경우 출력 항목 ID를 지정할 수 있습니다.

curl -X GET "$AZURE_OPENAI_ENDPOINT/openai/v1/evals/{eval-id}/runs/{run-id}/output_items/{output-item-id}" \
  -H "Content-Type: application/json" \
  -H "api-key: $AZURE_OPENAI_API_KEY"
  -H "aoai-evals: preview" \

테스트 조건 유형

Azure OpenAI 평가는 제공된 예제에서 본 의미 체계 유사성을 기반으로 다양한 평가 테스트 조건을 제공합니다. 이 섹션에서는 각 테스트 조건에 대한 정보를 훨씬 더 자세히 설명합니다.

사실성

제출된 답변의 실제 정확도를 전문가 답변과 비교하여 평가합니다.

사실성은 제출된 답변의 실제 정확도를 전문가 답변과 비교하여 평가합니다. 채점자는 CoT(생각의 사슬) 프롬프트를 활용하여 제출된 답변이 전문가 답변과 일관성이 있는지, 하위 집합인지, 상위 집합인지, 또는 상충되는지를 판단합니다. 이는 스타일, 문법 또는 문장 부호의 차이를 무시하고 오로지 실제 콘텐츠에만 포커스를 맞춥니다. 사실성은 AI가 제공하는 답변의 정확도를 보장하는 교육 도구 및 콘텐츠 검증을 포함하되 이에 국한되지 않는 여러 시나리오에서 유용할 수 있습니다.

이 테스트 기준의 일부로 사용되는 프롬프트 텍스트를 보려면 프롬프트 옆에 있는 드롭다운을 선택합니다. 현재 프롬프트 텍스트는 다음과 같습니다.

Prompt
You are comparing a submitted answer to an expert answer on a given question.
Here is the data:
[BEGIN DATA]
************
[Question]: {input}
************
[Expert]: {ideal}
************
[Submission]: {completion}
************
[END DATA]
Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
(A) The submitted answer is a subset of the expert answer and is fully consistent with it.
(B) The submitted answer is a superset of the expert answer and is fully consistent with it.
(C) The submitted answer contains all the same details as the expert answer.
(D) There is a disagreement between the submitted answer and the expert answer.
(E) The answers differ, but these differences don't matter from the perspective of factuality.

의미 체계 유사성

모델의 응답과 참조 간의 유사성 정도를 측정합니다. Grades: 1 (completely different) - 5 (very similar);

감정

출력의 감정적 톤을 식별하려고 시도합니다.

이 테스트 기준의 일부로 사용되는 프롬프트 텍스트를 보려면 프롬프트 옆에 있는 드롭다운을 선택합니다. 현재 프롬프트 텍스트는 다음과 같습니다.

Prompt
You will be presented with a text generated by a large language model. Your job is to rate the sentiment of the text. Your options are:

A) Positive
B) Neutral
C) Negative
D) Unsure

[BEGIN TEXT]
***
[{text}]
***
[END TEXT]

First, write out in a step by step manner your reasoning about the answer to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character (without quotes or punctuation) on its own line corresponding to the correct answer. At the end, repeat just the letter again by itself on a new line

문자열 확인

출력이 예상 문자열과 정확히 일치하는지 확인합니다.

문자열 검사는 두 개의 문자열 변수에 대해 다양한 이진 연산을 수행하여 다양한 평가 기준을 허용합니다. 이는 동등성, 포함, 특정 패턴 등 다양한 문자열 관계를 확인하는 데 도움이 됩니다. 이 평가기는 대/소문자를 구분하거나 구분하지 않는 비교를 허용합니다. 또한 참 또는 거짓 결과에 대해 지정된 등급을 제공하여 비교 결과에 따라 사용자 지정된 평가 결과를 제공합니다. 지원되는 작업 형식은 다음과 같습니다.

equals: 출력 문자열이 평가 문자열과 정확히 같은지 확인합니다.
contains: 평가 문자열이 출력 문자열의 부분 문자열인지 확인합니다.
starts-with: 출력 문자열이 평가 문자열로 시작하는지 확인합니다.
ends-with: 출력 문자열이 평가 문자열로 끝나는지 확인합니다.

비고

테스트 기준에서 특정 매개 변수를 설정할 때 변수 와 템플릿 중에서 선택할 수 있는 옵션이 있습니다. 입력 데이터의 열을 참조하려면 변수 를 선택합니다. 고정 문자열을 제공하려는 경우 템플릿 을 선택합니다.

유효한 JSON 또는 XML

출력이 유효한 JSON인지 XML인지 확인합니다.

스키마와 일치

출력이 지정된 구조를 따르는지 확인합니다.

기준 일치

모델의 응답이 기준에 맞는지 평가합니다. 등급: 합격 또는 불합격.

이 테스트 기준의 일부로 사용되는 프롬프트 텍스트를 보려면 프롬프트 옆에 있는 드롭다운을 선택합니다. 현재 프롬프트 텍스트는 다음과 같습니다.

Prompt
Your job is to assess the final response of an assistant based on conversation history and provided criteria for what makes a good response from the assistant. Here is the data:

[BEGIN DATA]
***
[Conversation]: {conversation}
***
[Response]: {response}
***
[Criteria]: {criteria}
***
[END DATA]

Does the response meet the criteria? First, write out in a step by step manner your reasoning about the criteria to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer. "Y" for yes if the response meets the criteria, and "N" for no if it does not. At the end, repeat just the letter again by itself on a new line.
Reasoning:

텍스트 품질

참조 텍스트와 비교하여 텍스트의 질을 평가합니다.

요약:

BLEU 점수: BLEU 점수를 사용하여 생성된 텍스트를 하나 이상의 고품질 참조 번역과 비교하여 품질을 평가합니다.
ROUGE 점수: ROUGE 점수를 사용하여 생성된 텍스트와 참조 요약을 비교하여 텍스트의 품질을 평가합니다.
코사인: 코사인 유사성이라고도 하며, 두 개의 텍스트 포함(모델 출력과 참조 텍스트 등)의 의미가 얼마나 일치하는지를 측정하여 두 텍스트 포함 간의 의미 체계 유사성을 평가하는 데 도움이 됩니다. 이는 벡터 공간에서 거리를 측정하여 수행됩니다.

세부 정보:

BLEU(Bilingual Evaluation Understudy) 점수는 NLP(자연어 처리) 및 기계 번역에서 일반적으로 사용됩니다. 텍스트 요약 및 텍스트 생성 사용 사례에 널리 사용됩니다. 생성된 텍스트가 참조 텍스트와 얼마나 일치하는지 평가합니다. BLEU 점수는 0~1까지이며, 점수가 높을수록 품질이 좋음을 나타냅니다.

ROUGE(Recall-Oriented Understudy for Gisting Evaluation)는 자동 요약 및 기계 번역을 평가하는 데 사용되는 메트릭 세트입니다. 생성된 텍스트와 참조 요약 간의 중첩을 측정합니다. ROUGE는 생성된 텍스트가 참조 텍스트를 얼마나 잘 다루는지 평가하기 위해 재현율 지향적 측정값에 중점을 둡니다. ROUGE 점수는 다음을 비롯한 다양한 메트릭을 제공합니다.

ROUGE-1: 생성된 텍스트와 참조 텍스트 사이에 유니그램(단일 단어)이 겹칩니다.
ROUGE-2: 생성된 텍스트와 참조 텍스트 사이에 2그램(두 개의 연속된 단어)이 겹칩니다.
ROUGE-3: 생성된 텍스트와 참조 텍스트 사이에 3그램(연속된 세 단어)가 겹칩니다.
ROUGE-4: 생성된 텍스트와 참조 텍스트 사이에 4그램(4개의 연속된 단어)이 겹칩니다.
ROUGE-5: 생성된 텍스트와 참조 텍스트 사이에 5그램(5개의 연속된 단어)이 겹칩니다.
ROUGE-L: 생성된 텍스트와 참조 텍스트 사이에 L-그램(L개의 연속된 단어)이 겹칩니다.

텍스트 요약 및 문서 비교는 특히 텍스트 일관성 및 관련성이 중요한 시나리오에서 ROUGE에 대한 최적의 사용 사례 중 하나입니다.

코사인 유사성은 두 개의 텍스트 포함(모델 출력과 참조 텍스트 등)의 의미가 얼마나 일치하는지를 측정하여 두 텍스트 포함 간의 의미 체계 유사성을 평가하는 데 도움이 됩니다. 다른 모델 기반 평가기와 마찬가지로 평가를 위해 모델 배포를 제공해야 합니다.

중요합니다

이 평가기에서는 포함 모델만 지원됩니다.

text-embedding-3-small
text-embedding-3-large
text-embedding-ada-002

사용자 지정 프롬프트

모델을 사용하여 출력을 지정된 레이블 집합으로 분류합니다. 이 평가기는 사용자가 정의해야 하는 사용자 지정 프롬프트를 사용합니다.

피드백

이 페이지가 도움이 되었나요?

다음을 통해 공유

Azure OpenAI 평가판(미리 보기)

평가 지원

지역별 가용성

지원되는 배포 유형

평가 API(미리 보기)

평가 파이프라인

테스트 데이터

평가 형식

응답 만들기(선택 사항)

모델 배포

테스트 기준

시작하기

평가 만들기

평가 만들기

단일 실행 만들기

기존 평가 업데이트

평가 결과

단일 평가 실행 결과

평가 목록

실행에 대한 출력 세부 정보

평가 만들기

단일 실행 만들기

기존 평가 업데이트

평가 결과

단일 평가 실행 결과

평가 목록

실행에 대한 출력 세부 정보

테스트 조건 유형

사실성

의미 체계 유사성

감정

문자열 확인

유효한 JSON 또는 XML

스키마와 일치

기준 일치

텍스트 품질

사용자 지정 프롬프트

피드백

추가 리소스