Mosaic AI Agent Evaluation tutorial notebook (MLflow 2)

2025-09-25

Important

Databricks recommends using MLflow 3 for evaluating and monitoring GenAI apps. This page describes MLflow 2 Agent Evaluation.

For an introduction to evaluation and monitoring on MLflow 3, see Evaluation and monitoring.
For information about migrating to MLflow 3, see Migrate to MLflow 3 from Agent Evaluation.
For MLflow 3 information on this topic, see Evaluation and monitoring.

The following notebook demonstrates how to evaluate a gen AI app using Agent Evaluation's proprietary LLM judges, custom metrics, and labels from ___domain experts. It demonstrates the following:

How to load production logs (traces) into an evaluation dataset.
How to run an evaluation and do root cause analysis.
How to create custom metrics to automatically detect quality issues.
How to send production logs for SMEs to label and evolve the evaluation dataset.

To get your agent ready for pre-production, see the Mosaic AI agent demo notebook. For general information, see Mosaic AI Agent Evaluation (MLflow 2).

Agent Evaluation custom metrics, guidelines, and ___domain expert labels notebook

Get notebook

Feedback

Was this page helpful?

Share via

Mosaic AI Agent Evaluation tutorial notebook (MLflow 2)

Agent Evaluation custom metrics, guidelines, and ___domain expert labels notebook

Feedback

Additional resources