Edit

Share via


Observability in generative AI

Important

Items marked (preview) in this article are currently in public preview. This preview is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

In today's AI-driven world, Generative AI Operations (GenAIOps) is revolutionizing how organizations build and deploy intelligent systems. As companies increasingly use AI to transform decision-making, enhance customer experiences, and fuel innovation, one element stands paramount: robust evaluation frameworks. Evaluation isn't just a checkpoint. It's the foundation of trust in AI applications. Without rigorous assessment, AI systems can produce content that's:

  • Fabricated or ungrounded in reality
  • Irrelevant or incoherent to user needs
  • Harmful in perpetuating content risks and stereotypes
  • Dangerous in spreading misinformation
  • Vulnerable to security exploits

This is where evaluators become essential. These specialized tools measure both the frequency and severity of risks in AI outputs, enabling teams to systematically address quality, safety, and security concerns throughout the entire AI development journey—from selecting the right model to monitoring production performance, quality, and safety.

What are evaluators?

Evaluators are specialized tools that measure the quality, safety, and reliability of AI responses. By implementing systematic evaluations throughout the AI development lifecycle, teams can identify and address potential issues before they impact users. The following supported evaluators provide comprehensive assessment capabilities across different AI application types and concerns:

General purpose

Evaluator Purpose Inputs
Coherence Measures logical consistency and flow of responses. Query, response
Fluency Measures natural language quality and readability. Response
QA Measures comprehensively various quality aspects in question-answering. Query, context, response, ground truth

To learn more, see General purpose evaluators.

Textual similarity

Evaluator Purpose Inputs
Similarity AI-assisted textual similarity measurement. Query, context, ground truth
F1 Score Harmonic mean of precision and recall in token overlaps between response and ground truth. Response, ground truth
BLEU Bilingual Evaluation Understudy score for translation quality measures overlaps in n-grams between response and ground truth. Response, ground truth
GLEU Google-BLEU variant for sentence-level assessment measures overlaps in n-grams between response and ground truth. Response, ground truth
ROUGE Recall-Oriented Understudy for Gisting Evaluation measures overlaps in n-grams between response and ground truth. Response, ground truth
METEOR Metric for Evaluation of Translation with Explicit Ordering measures overlaps in n-grams between response and ground truth. Response, ground truth

To learn more, see Textual similarity evaluators

RAG (retrieval augmented generation)

Evaluator Purpose Inputs
Retrieval Measures how effectively the system retrieves relevant information. Query , context
Document Retrieval Measures accuracy in retrieval results given ground truth. Ground truth , retrieved documents,
Groundedness Measures how consistent the response is with respect to the retrieved context. Query (optional), context, response
Groundedness Pro Measures whether the response is consistent with respect to the retrieved context. Query, context, response
Relevance Measures how relevant the response is with respect to the query. Query, response
Response Completeness Measures to what extent the response is complete (not missing critical information) with respect to the ground truth. Response, ground truth

To learn more, see Retrieval-augmented Generation (RAG) evaluators.

Safety and security (preview)

Evaluator Purpose Inputs
Hate and Unfairness Identifies biased, discriminatory, or hateful content. Query, response
Sexual Identifies inappropriate sexual content. Query, response
Violence Detects violent content or incitement. Query, response
Self-Harm Detects content promoting or describing self-harm. Query, response
Content Safety Comprehensive assessment of various safety concerns. Query, response
Protected Materials Detects unauthorized use of copyrighted or protected content. Query, response
Code Vulnerability Identifies security issues in generated code. Query, response
Ungrounded Attributes Detects fabricated or hallucinated information inferred from user interactions. Query, context, response

To learn more, see Risk and safety evaluators.

Agents (preview)

Evaluator Purpose Inputs
Intent Resolution Measures how accurately the agent identifies and addresses user intentions. Query, response
Task Adherence Measures how well the agent follows through on identified tasks. Query, response, tool definitions (optional)
Tool Call Accuracy Measures how well the agent selects and calls the correct tools to. Query, either response or tool calls, tool definitions

To learn more, see Agent evaluators.

Azure OpenAI graders (preview)

Evaluator Purpose Inputs
Model Labeler Classifies content using custom guidelines and labels. Query, response, ground truth
String Checker Performs flexible text validations and pattern matching. Response
Text Similarity Evaluates the quality of text or determine semantic closeness. Response, ground truth
Model Scorer Generates numerical scores (customized range) for content based on custom guidelines. Query, response, ground truth

To learn more, see Azure OpenAI Graders.

Evaluators in the development lifecycle

By using these evaluators strategically throughout the development lifecycle, teams can build more reliable, safe, and effective AI applications that meet user needs while minimizing potential risks.

Diagram of enterprise GenAIOps lifecycle, showing model selection, building an AI application, and operationalizing.

The three stages of GenAIOps evaluation

Base model selection

Before building your application, you need to select the right foundation. This initial evaluation helps you compare different models based on:

  • Quality and accuracy: How relevant and coherent are the model's responses?
  • Task performance: Does the model handle your specific use cases efficiently?
  • Ethical considerations: Is the model free from harmful biases?
  • Safety profile: What is the risk of generating unsafe content?

Tools available: Azure AI Foundry benchmark for comparing models on public datasets or your own data, and the Azure AI Evaluation SDK for testing specific model endpoints.

Pre-production evaluation

After you select a base model, the next step is to develop an AI application—such as an AI-powered chatbot, a retrieval-augmented generation (RAG) application, an agentic AI application, or any other generative AI tool. Once development is complete, pre-production evaluation begins. Before deploying to a production environment, thorough testing is essential to ensure the model is ready for real-world use.

Pre-production evaluation involves:

  • Testing with evaluation datasets: These datasets simulate realistic user interactions to ensure the AI application performs as expected.
  • Identifying edge cases: Finding scenarios where the AI application's response quality might degrade or produce undesirable outputs.
  • Assessing robustness: Ensuring that the model can handle a range of input variations without significant drops in quality or safety.
  • Measuring key metrics: Metrics such as response groundedness, relevance, and safety are evaluated to confirm readiness for production.

Diagram of pre-production evaluation for models and applications with the six steps.

The pre-production stage acts as a final quality check, reducing the risk of deploying an AI application that doesn't meet the desired performance or safety standards.

Evaluation Tools and Approaches:

  • Bring your own data: You can evaluate your AI applications in pre-production using your own evaluation data with supported evaluators, including generation quality, safety, or custom evaluators, and view results via the Azure AI Foundry portal. Use Azure AI Foundry’s evaluation wizard or Azure AI Evaluation SDK’s supported evaluators, including generation quality, safety, or custom evaluators, and view results via the Azure AI Foundry portal.
  • Simulators and AI red teaming agent (preview): If you don’t have evaluation data (test data), Azure AI Evaluation SDK’s simulators can help by generating topic-related or adversarial queries. These simulators test the model’s response to situation-appropriate or attack-like queries (edge cases).
    • Adversarial simulators injects static queries that mimic potential safety risks or security attacks such as or attempt jailbreaks, helping identify limitations and preparing the model for unexpected conditions.
    • Context-appropriate simulators generate typical, relevant conversations you’d expect from users to test quality of responses. With context-appropriate simulators you can assess metrics such as groundedness, relevance, coherence, and fluency of generated responses.
    • AI red teaming agent (preview) simulates complex adversarial attacks against your AI system using a broad range of safety and security attacks using Microsoft’s open framework for Python Risk Identification Tool or PyRIT. Automated scans using the AI red teaming agent enhances pre-production risk assessment by systematically testing AI applications for risks. This process involves simulated attack scenarios to identify weaknesses in model responses before real-world deployment. By running AI red teaming scans, you can detect and mitigate potential safety issues before deployment. This tool is recommended to be used with human-in-the-loop processes such as conventional AI red teaming probing to help accelerate risk identification and aid in the assessment by a human expert.

Alternatively, you can also use Azure AI Foundry portal's evaluation widget for testing your generative AI applications.

Once satisfactory results are achieved, the AI application can be deployed to production.

Post-production monitoring

After deployment, continuous monitoring ensures your AI application maintains quality in real-world conditions:

  • Performance tracking: Regular measurement of key metrics.
  • Incident response: Swift action when harmful or inappropriate outputs occur.

Effective monitoring helps maintain user trust and allows for rapid issue resolution.

Azure AI Foundry Observability provides comprehensive monitoring capabilities essential for today's complex and rapidly evolving AI landscape. Seamlessly integrated with Azure Monitor Application Insights, this solution enables continuous monitoring of deployed AI applications to ensure optimal performance, safety, and quality in production environments. The Foundry Observability dashboard delivers real-time insights into critical metrics, allowing teams to quickly identify and address performance issues, safety concerns, or quality degradation. For Agent-based applications, Foundry offers enhanced continuous evaluation capabilities that can be enabled to provide deeper visibility into quality and safety metrics, creating a robust monitoring ecosystem that adapts to the dynamic nature of AI applications while maintaining high standards of performance and reliability.

By continuously monitoring the AI application's behavior in production, you can maintain high-quality user experiences and swiftly address any issues that surface.

Building trust through systematic evaluation

GenAIOps establishes a reliable process for managing AI applications throughout their lifecycle. By implementing thorough evaluation at each stage—from model selection through deployment and beyond—teams can create AI solutions that aren't just powerful but trustworthy and safe.

Evaluation cheat sheet

Purpose Process Parameters
What are you evaluating for? Identify or build relevant evaluators - Quality and performance sample notebook

- Agents Response Quality

- Safety and Security (Safety and Security sample notebook)

- Custom (Custom sample notebook)
What data should you use? Upload or generate relevant dataset Generic simulator for measuring Quality and Performance (Generic simulator sample notebook)

- Adversarial simulator for measuring Safety and Security (Adversarial simulator sample notebook)

AI red teaming agent for running automated scans to assess safety and security vulnerabilities (AI red teaming agent sample notebook)
What resources should conduct the evaluation? Run evaluation - Local run

- Remote cloud run
How did my model/app perform? Analyze results View aggregate scores, view details, score details, compare evaluation runs
How can I improve? Make changes to model, app, or evaluators - If evaluation results didn't align to human feedback, adjust your evaluator.

- If evaluation results aligned to human feedback but didn't meet quality/safety thresholds, apply targeted mitigations. Example of mitigations to apply: Azure AI Content Safety

Region support

Currently certain AI-assisted evaluators are available only in the following regions:

Region Hate and unfairness, Sexual, Violent, Self-harm, Indirect attack, Code vulnerabilities, Ungrounded attributes Groundedness Pro Protected material
East US 2 Supported Supported Supported
Sweden Central Supported Supported N/A
US North Central Supported N/A N/A
France Central Supported N/A N/A
Switzerland West Supported N/A N/A

Pricing

Observability features such as Risk and Safety Evaluations and Continuous Evaluations are billed based on consumption as listed in our Azure pricing page. Select the tab labeled Complete AI Toolchain to view the pricing details for evaluations.