Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
CI/CD (Continuous Integration and Continuous Delivery) has become a cornerstone of modern data engineering and analytics, as it ensures that code changes are integrated, tested, and deployed rapidly and reliably. Databricks recognizes that you may have diverse CI/CD requirements shaped by your organizational preferences, existing workflows, and specific technology environment, and provides a flexible framework that supports various CI/CD options.
This page describes best practices to help you design and build robust, customized CI/CD pipelines that align with your unique needs and constraints. By leveraging these insights, you can accelerate your data engineering and analytics initiatives, improve code quality, and reduce the risk of deployment failures.
Core principles of CI/CD
Effective CI/CD pipelines share foundational principles regardless of implementation specifics. The following universal best practices apply across organizational preferences, developer workflows, and cloud environments, and ensure consistency across diverse implementations, whether your team prioritizes notebook-first development or infrastructure-as-code workflows. Adopt these principles as guardrails while tailoring specifics to your organization's technology stack and processes.
- Version control everything
- Store notebooks, scripts, infrastructure definitions (IaC), and job configurations in Git.
- Use branching strategies, such as Gitflow, that are aligned with standard development, staging, and production deployment environments.
- Automate testing
- Implement unit tests for business logic using libraries, such as pytest for Python and ScalaTest for Scala.
- Validate notebook and workflow functionality with tools, such as Databricks CLI bundle validate.
- Use integration tests for workflows and data pipelines, such as chispa for Spark DataFrames.
- Employ Infrastructure as Code (IaC)
- Define clusters, jobs, and workspace configurations with Databricks Asset Bundles YAML or Terraform.
- Parameterize instead of hardcoding environment-specific settings, such as cluster size and secrets.
- Isolate environments
- Maintain separate workspaces for development, staging, and production.
- Use MLflow Model Registry for model versioning across environments.
- Choose tools that match your cloud ecosystem:
- Azure: Azure DevOps and Databricks Asset Bundles or Terraform.
- AWS: GitHub Actions and Databricks Asset Bundles or Terraform.
- GCP: Cloud Build and Databricks Asset Bundles or Terraform.
- Monitor and automate rollbacks
- Track deployment success rates, job performance, and test coverage.
- Implement automated rollback mechanisms for failed deployments.
- Unify asset management
- Use Databricks Asset Bundles to deploy code, jobs, and infrastructure as a single unit. Avoid siloed management of notebooks, libraries, and workflows.
Databricks Asset Bundles for CI/CD
Databricks Asset Bundles offer a powerful, unified approach to managing code, workflows, and infrastructure within the Databricks ecosystem and are recommended for your CI/CD pipelines. By bundling these elements into a single YAML-defined unit, bundles simplify deployment and ensure consistency across environments. However, for users accustomed to traditional CI/CD workflows, adopting bundles may require a shift in mindset.
For example, Java developers are used to building JARs with Maven or Gradle, running unit tests with JUnit, and integrating these steps into CI/CD pipelines. Similarly, Python developers often package code into wheels and test with pytest, while SQL developers focus on query validation and notebook management. With bundles, these workflows converge into a more structured and prescriptive format, emphasizing bundling code and infrastructure for seamless deployment.
The following sections explore how developers can adapt their workflows to leverage bundles effectively.
To quickly get started with Databricks Asset Bundles, try a tutorial: Develop a job with Databricks Asset Bundles or Develop DLT pipelines with Databricks Asset Bundles.
CI/CD source control recommendations
The first choice developers need to make when implementing CI/CD is how to store and version source files. Bundles enable you to easily contain everything - source code, build artifacts, and configuration files - and locate them in the same source code repository, but another option is to separate bundle configuration files from code-related files. The choice depends on your team's workflow, project complexity, and CI/CD requirements, but Databricks recommends the following:
- For small projects or tight coupling between code and configuration, use a single repository for both code and bundle configuration to simplify workflows.
- For larger teams or independent release cycles, use separate repositories for code and bundle configuration, but establish clear CI/CD pipelines that ensure compatibility between versions.
Whether you choose to co-locate or separate your code-related files from your bundle configuration files, always use versioned artifacts, such as Git commit hashes, when uploading to Databricks or external storage to ensure traceability and rollback capabilities.
Single repository for code and configuration
In this approach, both the source code and bundle configuration files are stored in the same repository. This simplifies workflows and ensures atomic changes.
Pros | Cons |
---|---|
|
|
Example: Python code in a bundle
This example has Python files and bundle files in one repository:
databricks-dab-repo/
├── databricks.yml # Bundle definition
├── resources/
│ ├── workflows/
│ │ ├── my_pipeline.yml # YAML pipeline def
│ │ └── my_pipeline_job.yml # YAML job def that runs pipeline
│ ├── clusters/
│ │ ├── dev_cluster.yml # development cluster def
│ │ └── prod_cluster.yml # production def
├── src/
│ ├── dlt_pipeline.ipynb # pipeline notebook
│ └── mypython.py # Additional Python
└── README.md
Separate repositories for code and configuration
In this approach, the source code resides in one repository, while the bundle configuration files are maintained in another. This option is ideal for larger teams or projects where separate groups handle application development and Databricks workflow management.
Pros | Cons |
---|---|
|
|
Example: Java project and bundle
In this example, a Java project and its files are in one repository and the bundle files are in another repository.
Repository 1: Java files
The first repository contains all Java-related files:
java-app-repo/
├── pom.xml # Maven build configuration
├── src/
│ ├── main/
│ │ ├── java/ # Java source code
│ │ │ └── com/
│ │ │ └── mycompany/
│ │ │ └── app/
│ │ │ └── App.java
│ │ └── resources/ # Application resources
│ └── test/
│ ├── java/ # Unit tests for Java code
│ │ └── com/
│ │ └── mycompany/
│ │ └── app/
│ │ └── AppTest.java
│ └── resources/ # Test-specific resources
├── target/ # Compiled JARs and classes
└── README.md
- Developers write application code in
src/main/java
orsrc/main/scala
. - Unit tests are stored in
src/test/java
orsrc/test/scala
. - On a pull request or commit, CI/CD pipelines:
- Compile the code into a JAR, for example,
target/my-app-1.0.jar
. - Upload the JAR to a Databricks Unity Catalog volume. See upload JAR.
- Compile the code into a JAR, for example,
Repository 2: Bundle files
A second repository contains only the bundle configuration files:
databricks-dab-repo/
├── databricks.yml # Bundle definition
├── resources/
│ ├── jobs/
│ │ ├── my_java_job.yml # YAML job dev
│ │ └── my_other_job.yml # Additional job definitions
│ ├── clusters/
│ │ ├── dev_cluster.yml # development cluster def
│ │ └── prod_cluster.yml # production def
└── README.md
The bundle configuration databricks.yml and job definitions are maintained independently.
The databricks.yml references the uploaded JAR artifact, for example:
- jar: /Volumes/artifacts/my-app-${{ GIT_SHA }}.)jar
Recommended CI/CD workflow
Regardless of whether you are co-locating or separating your code files from your bundle configuration files, a recommended workflow would be the following:
Compile and test the code
- Triggered on a pull request or a commit to the main branch.
- Compile code and run unit tests.
- Output a versioned file, for example,
my-app-1.0.jar
.
Upload and store the compiled file, such as a JAR, to a Databricks Unity Catalog volume.
- Store the compiled file in a Databricks Unity Catalog volume or an artifact repository like AWS S3 or Azure Blob Storage.
- Use a versioning scheme tied to Git commit hashes or semantic versioning, for example,
dbfs:/mnt/artifacts/my-app-${{ github.sha }}.jar
.
Validate the bundle
- Run
databricks bundle validate
to ensure that thedatabricks.yml
configuration is correct. - This step ensures that misconfigurations, for example, missing libraries, are caught early.
- Run
Deploy the bundle
- Use
databricks bundle deploy
to deploy the bundle to a staging or production environment. - Reference the uploaded compiled library in
databricks.yml
. For information about referencing libraries, see Databricks Asset Bundles library dependencies.
- Use
CI/CD for machine learning
Machine learning projects introduce unique CI/CD challenges compared to traditional software development. When implementing CI/CD for ML projects, you will likely need to consider the following:
- Multi-team coordination: Data scientists, engineers, and MLOps teams often use different tools and workflows. Databricks unifies these processes with MLflow for experiment tracking, Delta Sharing for data governance, and Databricks Asset Bundles for infrastructure-as-code.
- Data and model versioning: ML pipelines require tracking not just code but also training data schemas, feature distributions, and model artifacts. Databricks Delta Lake provides ACID transactions and time travel for data versioning, while MLflow Model Registry handles model lineage.
- Reproducibility across environments: ML models depend on specific data, code, and infrastructure combinations. Databricks Asset Bundles ensure atomic deployment of these components across development, staging, and production environments with YAML definitions.
- Continuous retraining and monitoring: Models degrade due to data drift. Databricks Workflows enable automated retraining pipelines, while MLflow integrates with Prometheus and Lakehouse Monitoring for performance tracking.
MLOps Stacks for ML CI/CD
Databricks addresses ML CI/CD complexity through MLOps Stacks, a production-grade framework that combines Databricks Asset Bundles, preconfigured CI/CD workflows, and modular ML project templates. These stacks enforce best practices while allowing flexibility for multi-team collaboration across data engineering, data science, and MLOps roles.
Team | Responsibilities | Example bundle components | Example artifacts |
---|---|---|---|
Data engineers | Build ETL pipelines, enforce data quality | DLT YAML, cluster policies | etl_pipeline.yml , feature_store_job.yml |
Data scientists | Develop model training logic, validate metrics | MLflow Projects, notebook-based workflows | train_model.yml , batch_inference_job.yml |
MLOps engineers | Orchestrate deployments, monitor pipelines | Environment variables, monitoring dashboards | databricks.yml , lakehouse_monitoring.yml |
ML CI/CD collaboration might look like:
- Data engineers commit ETL pipeline changes to a bundle, triggering automated schema validation and a staging deployment.
- Data scientists submit ML code, which run unit tests and deploy to a staging workspace for integration testing.
- MLOps engineers review validation metrics and promote vetted models to production using the MLflow Registry.
For implementation details, see:
- MLOps Stacks bundle: Step-by-step guidance for bundle initialization and deployment.
- MLOps Stacks GitHub repository: Preconfigured templates for training, inference, and CI/CD.
By aligning teams with standardized bundles and MLOps Stacks, organizations can streamline collaboration while maintaining auditability across the ML lifecycle.
CI/CD for SQL developers
SQL developers using Databricks SQL to manage streaming tables and materialized views can leverage Git integration and CI/CD pipelines to streamline their workflows and maintain high-quality pipelines. With the introduction of Git support for queries, SQL developers can focus on writing queries while leveraging Git to version control their .sql
files, which enables collaboration and automation without needing deep infrastructure expertise. In addition, the SQL editor enables real-time collaboration and integrates seamlessly with Git workflows.
For SQL-centric workflows:
Version control SQL files
- Store .sql files in Git repositories using Databricks Git folders or external Git providers, for example, GitHub, Azure DevOps.
- Use branches (for example, development, staging, production) to manage environment-specific changes.
Integrate
.sql
files into CI/CD pipelines to automate deployment:- Validate syntax and schema changes during pull requests.
- Deploy
.sql
files to Databricks SQL workflows or jobs.
Parameterize for environment isolation
Use variables in
.sql
files to dynamically reference environment-specific resources, such as data paths or table names:CREATE OR REFRESH STREAMING TABLE ${env}_sales_ingest AS SELECT * FROM read_files('s3://${env}-sales-data')
Schedule and monitor refreshes
- Use SQL tasks in a Databricks Job to schedule updates to tables and materialized views (
REFRESH MATERIALIZED VIEW view_name
). - Monitor refresh history using system tables.
- Use SQL tasks in a Databricks Job to schedule updates to tables and materialized views (
A workflow might be:
- Develop: Write and test
.sql
scripts locally or in the Databricks SQL editor, then commit them to a Git branch. - Validate: During a pull request, validate syntax and schema compatibility using automated CI checks.
- Deploy: Upon merge, deploy the .sql scripts to the target environment using CI/CD pipelines, for example, GitHub Actions or Azure Pipelines.
- Monitor: Use Databricks dashboards and alerts to track query performance and data freshness.
CI/CD for dashboard developers
Databricks supports integrating dashboards into CI/CD workflows using Databricks Asset Bundles. This capability enables dashboard developers to:
- Version-control dashboards, which ensures auditability and simplifies collaboration between teams.
- Automate deployments of dashboards alongside jobs and pipelines across environments, for end-to-end alignment.
- Reduce manual errors and ensure that updates are applied consistently across environments.
- Maintain high-quality analytics workflows while adhering to CI/CD best practices.
For dashboards in CI/CD:
Use the
databricks bundle generate
command to export existing dashboards as JSON files and generate the YAML configuration that includes it in the bundle:resources: dashboards: sales_dashboard: display_name: 'Sales Dashboard' file_path: ./dashboards/sales_dashboard.lvdash.json warehouse_id: ${var.warehouse_id}
Store these
.lvdash.json
files in Git repositories to track changes and collaborate effectively.Authomatically deploy dashboards in CI/CD pipelines with
databricks bundle deploy
. For example, the GitHub Actions step for deployment:name: Deploy Dashboard run: databricks bundle deploy --target=prod env: DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
Use variables, for example
${var.warehouse_id}
, to parameterize configurations like SQL warehouses or data sources, ensuring seamless deployment across dev, staging, and production environments.Use the
bundle generate --watch
option to continuously sync local dashboard JSON files with changes made in the Databricks UI. If discrepancies occur, use the--force
flag during deployment to overwrite remote dashboards with local versions.
For information about dashboards in bundles, see dashboard resource. For details about bundle commands, see bundle
command group.