Share via


Best practices and recommended CI/CD workflows on Databricks

CI/CD (Continuous Integration and Continuous Delivery) has become a cornerstone of modern data engineering and analytics, as it ensures that code changes are integrated, tested, and deployed rapidly and reliably. Databricks recognizes that you may have diverse CI/CD requirements shaped by your organizational preferences, existing workflows, and specific technology environment, and provides a flexible framework that supports various CI/CD options.

This page describes best practices to help you design and build robust, customized CI/CD pipelines that align with your unique needs and constraints. By leveraging these insights, you can accelerate your data engineering and analytics initiatives, improve code quality, and reduce the risk of deployment failures.

Core principles of CI/CD

Effective CI/CD pipelines share foundational principles regardless of implementation specifics. The following universal best practices apply across organizational preferences, developer workflows, and cloud environments, and ensure consistency across diverse implementations, whether your team prioritizes notebook-first development or infrastructure-as-code workflows. Adopt these principles as guardrails while tailoring specifics to your organization's technology stack and processes.

  • Version control everything
    • Store notebooks, scripts, infrastructure definitions (IaC), and job configurations in Git.
    • Use branching strategies, such as Gitflow, that are aligned with standard development, staging, and production deployment environments.
  • Automate testing
    • Implement unit tests for business logic using libraries, such as pytest for Python and ScalaTest for Scala.
    • Validate notebook and workflow functionality with tools, such as Databricks CLI bundle validate.
    • Use integration tests for workflows and data pipelines, such as chispa for Spark DataFrames.
  • Employ Infrastructure as Code (IaC)
    • Define clusters, jobs, and workspace configurations with Databricks Asset Bundles YAML or Terraform.
    • Parameterize instead of hardcoding environment-specific settings, such as cluster size and secrets.
  • Isolate environments
    • Maintain separate workspaces for development, staging, and production.
    • Use MLflow Model Registry for model versioning across environments.
  • Choose tools that match your cloud ecosystem:
    • Azure: Azure DevOps and Databricks Asset Bundles or Terraform.
    • AWS: GitHub Actions and Databricks Asset Bundles or Terraform.
    • GCP: Cloud Build and Databricks Asset Bundles or Terraform.
  • Monitor and automate rollbacks
    • Track deployment success rates, job performance, and test coverage.
    • Implement automated rollback mechanisms for failed deployments.
  • Unify asset management
    • Use Databricks Asset Bundles to deploy code, jobs, and infrastructure as a single unit. Avoid siloed management of notebooks, libraries, and workflows.

Databricks Asset Bundles for CI/CD

Databricks Asset Bundles offer a powerful, unified approach to managing code, workflows, and infrastructure within the Databricks ecosystem and are recommended for your CI/CD pipelines. By bundling these elements into a single YAML-defined unit, bundles simplify deployment and ensure consistency across environments. However, for users accustomed to traditional CI/CD workflows, adopting bundles may require a shift in mindset.

For example, Java developers are used to building JARs with Maven or Gradle, running unit tests with JUnit, and integrating these steps into CI/CD pipelines. Similarly, Python developers often package code into wheels and test with pytest, while SQL developers focus on query validation and notebook management. With bundles, these workflows converge into a more structured and prescriptive format, emphasizing bundling code and infrastructure for seamless deployment.

The following sections explore how developers can adapt their workflows to leverage bundles effectively.

To quickly get started with Databricks Asset Bundles, try a tutorial: Develop a job with Databricks Asset Bundles or Develop DLT pipelines with Databricks Asset Bundles.

CI/CD source control recommendations

The first choice developers need to make when implementing CI/CD is how to store and version source files. Bundles enable you to easily contain everything - source code, build artifacts, and configuration files - and locate them in the same source code repository, but another option is to separate bundle configuration files from code-related files. The choice depends on your team's workflow, project complexity, and CI/CD requirements, but Databricks recommends the following:

  • For small projects or tight coupling between code and configuration, use a single repository for both code and bundle configuration to simplify workflows.
  • For larger teams or independent release cycles, use separate repositories for code and bundle configuration, but establish clear CI/CD pipelines that ensure compatibility between versions.

Whether you choose to co-locate or separate your code-related files from your bundle configuration files, always use versioned artifacts, such as Git commit hashes, when uploading to Databricks or external storage to ensure traceability and rollback capabilities.

Single repository for code and configuration

In this approach, both the source code and bundle configuration files are stored in the same repository. This simplifies workflows and ensures atomic changes.

Pros Cons
  • All related artifacts, code, and YAML configurations are versioned together, which reduces coordination overhead.
  • A single pull request can update both the compiled build file and its corresponding bundle configuration.
  • The CI/CD pipeline can build, test, validate, and deploy from a single repository.
  • Over time, the repository may become bloated with both code and configuration files.
  • Code and bundle changes require a coordinated release.

Example: Python code in a bundle

This example has Python files and bundle files in one repository:

databricks-dab-repo/
├── databricks.yml               # Bundle definition
├── resources/
│   ├── workflows/
│   │   ├── my_pipeline.yml      # YAML pipeline def
│   │   └── my_pipeline_job.yml  # YAML job def that runs pipeline
│   ├── clusters/
│   │   ├── dev_cluster.yml      # development cluster def
│   │   └── prod_cluster.yml     # production def
├── src/
│   ├── dlt_pipeline.ipynb       # pipeline notebook
│   └── mypython.py              # Additional Python
└── README.md

Separate repositories for code and configuration

In this approach, the source code resides in one repository, while the bundle configuration files are maintained in another. This option is ideal for larger teams or projects where separate groups handle application development and Databricks workflow management.

Pros Cons
  • Teams working on developing code can focus on their repository while the data engineering team manages the bundle configurations.
  • Allows for independent release updates and versioning to the compiled code, such as JAR, and bundle configurations without coupling them.
  • Each repository is smaller and easier to manage.
  • Requires additional coordination between repositories during deployment.
  • You must ensure the correct version of dependencies, such as the JAR version, are referenced in the bundle repository.

Example: Java project and bundle

In this example, a Java project and its files are in one repository and the bundle files are in another repository.

Repository 1: Java files

The first repository contains all Java-related files:

java-app-repo/
├── pom.xml                  # Maven build configuration
├── src/
│   ├── main/
│   │   ├── java/            # Java source code
│   │   │   └── com/
│   │   │       └── mycompany/
│   │   │           └── app/
│   │   │               └── App.java
│   │   └── resources/       # Application resources
│   └── test/
│       ├── java/            # Unit tests for Java code
│       │   └── com/
│       │       └── mycompany/
│       │           └── app/
│       │               └── AppTest.java
│       └── resources/       # Test-specific resources
├── target/                  # Compiled JARs and classes
└── README.md

  • Developers write application code in src/main/java or src/main/scala.
  • Unit tests are stored in src/test/java or src/test/scala.
  • On a pull request or commit, CI/CD pipelines:
    • Compile the code into a JAR, for example, target/my-app-1.0.jar.
    • Upload the JAR to a Databricks Unity Catalog volume. See upload JAR.

Repository 2: Bundle files

A second repository contains only the bundle configuration files:

databricks-dab-repo/
├── databricks.yml               # Bundle definition
├── resources/
│   ├── jobs/
│   │   ├── my_java_job.yml  # YAML job dev
│   │   └── my_other_job.yml # Additional job definitions
│   ├── clusters/
│   │   ├── dev_cluster.yml  # development cluster def
│   │   └── prod_cluster.yml # production def
└── README.md
  • The bundle configuration databricks.yml and job definitions are maintained independently.

  • The databricks.yml references the uploaded JAR artifact, for example:

    - jar: /Volumes/artifacts/my-app-${{ GIT_SHA }}.)jar
    

Regardless of whether you are co-locating or separating your code files from your bundle configuration files, a recommended workflow would be the following:

  1. Compile and test the code

    • Triggered on a pull request or a commit to the main branch.
    • Compile code and run unit tests.
    • Output a versioned file, for example, my-app-1.0.jar.
  2. Upload and store the compiled file, such as a JAR, to a Databricks Unity Catalog volume.

    • Store the compiled file in a Databricks Unity Catalog volume or an artifact repository like AWS S3 or Azure Blob Storage.
    • Use a versioning scheme tied to Git commit hashes or semantic versioning, for example, dbfs:/mnt/artifacts/my-app-${{ github.sha }}.jar.
  3. Validate the bundle

    • Run databricks bundle validate to ensure that the databricks.yml configuration is correct.
    • This step ensures that misconfigurations, for example, missing libraries, are caught early.
  4. Deploy the bundle

    • Use databricks bundle deploy to deploy the bundle to a staging or production environment.
    • Reference the uploaded compiled library in databricks.yml. For information about referencing libraries, see Databricks Asset Bundles library dependencies.

CI/CD for machine learning

Machine learning projects introduce unique CI/CD challenges compared to traditional software development. When implementing CI/CD for ML projects, you will likely need to consider the following:

  • Multi-team coordination: Data scientists, engineers, and MLOps teams often use different tools and workflows. Databricks unifies these processes with MLflow for experiment tracking, Delta Sharing for data governance, and Databricks Asset Bundles for infrastructure-as-code.
  • Data and model versioning: ML pipelines require tracking not just code but also training data schemas, feature distributions, and model artifacts. Databricks Delta Lake provides ACID transactions and time travel for data versioning, while MLflow Model Registry handles model lineage.
  • Reproducibility across environments: ML models depend on specific data, code, and infrastructure combinations. Databricks Asset Bundles ensure atomic deployment of these components across development, staging, and production environments with YAML definitions.
  • Continuous retraining and monitoring: Models degrade due to data drift. Databricks Workflows enable automated retraining pipelines, while MLflow integrates with Prometheus and Lakehouse Monitoring for performance tracking.

MLOps Stacks for ML CI/CD

Databricks addresses ML CI/CD complexity through MLOps Stacks, a production-grade framework that combines Databricks Asset Bundles, preconfigured CI/CD workflows, and modular ML project templates. These stacks enforce best practices while allowing flexibility for multi-team collaboration across data engineering, data science, and MLOps roles.

Team Responsibilities Example bundle components Example artifacts
Data engineers Build ETL pipelines, enforce data quality DLT YAML, cluster policies etl_pipeline.yml, feature_store_job.yml
Data scientists Develop model training logic, validate metrics MLflow Projects, notebook-based workflows train_model.yml, batch_inference_job.yml
MLOps engineers Orchestrate deployments, monitor pipelines Environment variables, monitoring dashboards databricks.yml, lakehouse_monitoring.yml

ML CI/CD collaboration might look like:

  • Data engineers commit ETL pipeline changes to a bundle, triggering automated schema validation and a staging deployment.
  • Data scientists submit ML code, which run unit tests and deploy to a staging workspace for integration testing.
  • MLOps engineers review validation metrics and promote vetted models to production using the MLflow Registry.

For implementation details, see:

By aligning teams with standardized bundles and MLOps Stacks, organizations can streamline collaboration while maintaining auditability across the ML lifecycle.

CI/CD for SQL developers

SQL developers using Databricks SQL to manage streaming tables and materialized views can leverage Git integration and CI/CD pipelines to streamline their workflows and maintain high-quality pipelines. With the introduction of Git support for queries, SQL developers can focus on writing queries while leveraging Git to version control their .sql files, which enables collaboration and automation without needing deep infrastructure expertise. In addition, the SQL editor enables real-time collaboration and integrates seamlessly with Git workflows.

For SQL-centric workflows:

  • Version control SQL files

    • Store .sql files in Git repositories using Databricks Git folders or external Git providers, for example, GitHub, Azure DevOps.
    • Use branches (for example, development, staging, production) to manage environment-specific changes.
  • Integrate .sql files into CI/CD pipelines to automate deployment:

    • Validate syntax and schema changes during pull requests.
    • Deploy .sql files to Databricks SQL workflows or jobs.
  • Parameterize for environment isolation

    • Use variables in .sql files to dynamically reference environment-specific resources, such as data paths or table names:

      CREATE OR REFRESH STREAMING TABLE ${env}_sales_ingest AS SELECT * FROM read_files('s3://${env}-sales-data')
      
  • Schedule and monitor refreshes

    • Use SQL tasks in a Databricks Job to schedule updates to tables and materialized views (REFRESH MATERIALIZED VIEW view_name).
    • Monitor refresh history using system tables.

A workflow might be:

  1. Develop: Write and test .sql scripts locally or in the Databricks SQL editor, then commit them to a Git branch.
  2. Validate: During a pull request, validate syntax and schema compatibility using automated CI checks.
  3. Deploy: Upon merge, deploy the .sql scripts to the target environment using CI/CD pipelines, for example, GitHub Actions or Azure Pipelines.
  4. Monitor: Use Databricks dashboards and alerts to track query performance and data freshness.

CI/CD for dashboard developers

Databricks supports integrating dashboards into CI/CD workflows using Databricks Asset Bundles. This capability enables dashboard developers to:

  • Version-control dashboards, which ensures auditability and simplifies collaboration between teams.
  • Automate deployments of dashboards alongside jobs and pipelines across environments, for end-to-end alignment.
  • Reduce manual errors and ensure that updates are applied consistently across environments.
  • Maintain high-quality analytics workflows while adhering to CI/CD best practices.

For dashboards in CI/CD:

  • Use the databricks bundle generate command to export existing dashboards as JSON files and generate the YAML configuration that includes it in the bundle:

    resources:
      dashboards:
        sales_dashboard:
          display_name: 'Sales Dashboard'
          file_path: ./dashboards/sales_dashboard.lvdash.json
          warehouse_id: ${var.warehouse_id}
    
  • Store these .lvdash.json files in Git repositories to track changes and collaborate effectively.

  • Authomatically deploy dashboards in CI/CD pipelines with databricks bundle deploy. For example, the GitHub Actions step for deployment:

    name: Deploy Dashboard
      run: databricks bundle deploy --target=prod
    env:
      DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }}
    
  • Use variables, for example ${var.warehouse_id}, to parameterize configurations like SQL warehouses or data sources, ensuring seamless deployment across dev, staging, and production environments.

  • Use the bundle generate --watch option to continuously sync local dashboard JSON files with changes made in the Databricks UI. If discrepancies occur, use the --force flag during deployment to overwrite remote dashboards with local versions.

For information about dashboards in bundles, see dashboard resource. For details about bundle commands, see bundle command group.