Edit

Share via


Replicate and sync mainframe data to Azure

Azure Data Factory
Azure Databricks

This example architecture outlines an implementation plan to replicate and sync data during modernization to Azure. It describes technical aspects like data stores, tools, and services.

Architecture

An architecture diagram that shows how to sync on-premises data and Azure databases data during mainframe modernization.

Download a Visio file of this architecture.

Workflow

The following workflow corresponds to the previous diagram:

Mainframe and midrange systems update on-premises application databases at regular intervals. To maintain consistency, this solution syncs the latest data with Azure databases. The sync process involves the following steps.

  1. Azure Data Factory dynamic pipelines orchestrate activities that range from data extraction to data loading. You can schedule pipeline activities, start them manually, or trigger them automatically.

    Pipelines group the activities that perform tasks. To extract data, Azure Data Factory dynamically creates one pipeline for each on-premises table. You can then use a massively parallel implementation when you replicate data in Azure. You can also configure the solution to meet your requirements:

    • Full replication: You replicate the entire database and make the necessary modifications to data types and fields in the target Azure database.

    • Partial, delta, or incremental replication: You use watermark columns in source tables to sync the updated rows with Azure databases. These columns contain either a continuously incrementing key or a time stamp that indicates the table's last update.

    Azure Data Factory also uses pipelines for the following transformation tasks:

    • Data-type conversion
    • Data manipulation
    • Data formatting
    • Column derivation
    • Data flattening
    • Data sorting
    • Data filtering
  2. On-premises databases like Db2 zOS, Db2 for i, and Db2 LUW store the application data.

  3. A self-hosted integration runtime (IR) provides the environment that Azure Data Factory uses to run and dispatch activities.

  4. Azure Data Lake Storage Gen2 and Azure Blob Storage stage the data. This step is sometimes required to transform and merge data from multiple sources.

  5. For data preparation, Azure Data Factory uses Azure Databricks, custom activities, and pipeline dataflows to transform data quickly and effectively.

  6. Azure Data Factory loads data into the following relational and nonrelational Azure databases:

    • Azure SQL
    • Azure Database for PostgreSQL
    • Azure Cosmos DB
    • Azure Data Lake Storage
    • Azure Database for MySQL
  7. SQL Server Integration Services (SSIS) extracts, transforms, and loads data.

  8. The on-premises data gateway is a locally installed Windows client application that acts as a bridge between your local on-premises data sources and Azure services.

  9. A data pipeline in Microsoft Fabric is a logical grouping of activities that perform data ingestion from Db2 to Azure storage and databases.

  10. If the solution requires near real-time replication, you can use non-Microsoft tools.

Components

This section describes other tools that you can use during data modernization, synchronization, and integration.

Data integrators

  • Azure Data Factory is a hybrid data integration service. You can use this fully managed, serverless solution to create, schedule, and orchestrate extract, transform, and load (ETL) workflows and extract, load, and transform (ELT) workflows.

  • Azure Synapse Analytics is an enterprise analytics service that accelerates time to insight across data warehouses and big data systems. Azure Synapse Analytics combines the best of the following technologies and services:

    • SQL technologies for enterprise data warehousing

    • Spark technologies for big data

    • Azure Data Explorer for log and time series analytics

    • Azure Pipelines for data integration and ETL and ELT workflows

    • Deep integration with other Azure services, such as Power BI, Azure Cosmos DB, and Azure Machine Learning

  • SSIS is a platform for building enterprise-level data integration and transformation solutions. You can use SSIS to manage, replicate, cleanse, and mine data.

  • Azure Databricks is a data analytics platform. It's based on the Apache Spark open-source distributed processing system and is optimized for the Azure cloud platform. In an analytics workflow, Azure Databricks reads data from multiple sources and uses Spark to provide insights.

Data storage

  • Azure SQL Database is part of the Azure SQL family and is built for the cloud. This service provides the benefits of a fully managed and evergreen platform as a service (PaaS). SQL Database also provides AI-powered, automated features that optimize performance and durability. Serverless compute and hyperscale storage options automatically scale resources on demand.

  • Azure SQL Managed Instance is part of the Azure SQL service portfolio. This intelligent and scalable cloud database service combines the broadest SQL Server engine compatibility with all the benefits of a fully managed and evergreen PaaS. Use SQL Managed Instance to modernize existing apps at scale.

  • SQL Server on Azure Virtual Machines provides a way to lift and shift SQL Server workloads to the cloud with complete code compatibility. As part of the Azure SQL family, SQL Server on Azure Virtual Machines provides the combined performance, security, and analytics of SQL Server with the flexibility and hybrid connectivity of Azure. Use SQL Server on Azure Virtual Machines to migrate existing apps or build new apps. You can also access the latest SQL Server updates and releases, including SQL Server 2019.

  • Azure Database for PostgreSQL is a fully managed relational database service that's based on the community edition of the open-source PostgreSQL database engine. Use this service to focus on application innovation instead of database management. You can also scale your workload quickly and easily.

  • Azure Cosmos DB is a globally distributed, multiple-model database. Use Azure Cosmos DB to ensure that your solutions can elastically and independently scale throughput and storage across any number of geographic regions. This fully managed NoSQL database service guarantees single-digit, millisecond latencies at the ninety-ninth percentile anywhere in the world.

  • Data Lake Storage is a storage repository that holds a large amount of data in its native, raw format. Data lake stores are optimized for scaling to terabytes and petabytes of data. The data typically comes from multiple, heterogeneous sources and can be structured, semi-structured, or unstructured. Data Lake Storage Gen2 combines Data Lake Storage Gen1 capabilities with Blob Storage. This next-generation data lake solution provides file system semantics, file-level security, and scale. It also provides the tiered storage, high availability, and disaster recovery capabilities of Blob Storage.

  • Microsoft Fabric is an enterprise-ready, end-to-end analytics platform. It unifies data movement, data processing, ingestion, transformation, real-time event routing, and report building. It supports these capabilities by using integrated services like Fabric Data Engineer, Fabric Data Factory, Fabric Data Science, Fabric Real-Time Intelligence, Fabric Data Warehouse, and Fabric Databases.

  • Azure Database for MySQL is a fully managed relational database service that's based on the community edition of the open-source MySQL database engine.

Other tools

Scenario details

Data availability and integrity are essential in mainframe and midrange modernization. Data-first strategies help keep data intact and available during the migration to Azure. To prevent disruptions during modernization, sometimes you need to replicate data quickly or keep on-premises data in sync with Azure databases.

Specifically, this solution covers:

  • Extraction: Connect to and extract data from a source database.

  • Transformation:

    • Staging: Temporarily store data in its original format and prepare it for transformation.

    • Preparation: Transform and manipulate data by using mapping rules that meet target database requirements.

  • Loading: Insert data into a target database.

Potential use cases

Data replication and sync scenarios that can benefit from this solution include:

  • Command Query Responsibility Segregation architectures that use Azure to service all inquire channels.

  • Environments that test on-premises applications and rehosted or re-engineered applications in parallel.

  • On-premises systems that have tightly coupled applications that require phased remediation or modernization.

Recommendations

You can apply the following recommendations to most scenarios. Follow these recommendations unless you have a specific requirement that overrides them.

When you use Azure Data Factory to extract data, take steps to tune the performance of the copy activity.

Considerations

These considerations implement the pillars of the Azure Well-Architected Framework, which is a set of guiding tenets that you can use to improve the quality of a workload. For more information, see Microsoft Azure Well-Architected Framework.

Reliability

Reliability helps ensure that your application can meet the commitments that you make to your customers. For more information, see Design review checklist for Reliability.

  • Infrastructure management, including availability, is automated in Azure databases.

  • For information about Microsoft Service for DRDA failover protection, see Pooling and failover.

  • You can cluster the on-premises data gateway and IR to provide higher availability guarantees.

Security

Security provides assurances against deliberate attacks and the misuse of your valuable data and systems. For more information, see Design review checklist for Security.

  • Use network security groups to limit access to only the necessary functions for each service.

  • Use private endpoints for your PaaS services. Use service firewalls that are both reachable and unreachable through the internet to supplement security for your services.

  • Use managed identities for component-to-component dataflows.

  • For information about the types of client connections that Microsoft Service for DRDA supports, see Plan and architect solutions by using Microsoft Service for DRDA. Client connections affect the nature of transactions, pooling, failover, authentication, and encryption on your network.

Cost Optimization

Cost Optimization focuses on ways to reduce unnecessary expenses and improve operational efficiencies. For more information, see Design review checklist for Cost Optimization.

  • Pricing models vary between component services. Review the pricing models of the available component services to ensure that they fit your budget.

  • Use the Azure pricing calculator to estimate the cost of implementing this solution.

Operational Excellence

Operational Excellence covers the operations processes that deploy an application and keep it running in production. For more information, see Design review checklist for Operational Excellence.

  • Infrastructure management, including scalability, is automated in Azure databases.

  • You can scale out the self-hosted IR by associating the logical instance with multiple on-premises machines in active-active mode.

Performance Efficiency

Performance Efficiency refers to your workload's ability to scale to meet user demands efficiently. For more information, see Design review checklist for Performance Efficiency.

  • Consider Azure ExpressRoute as a high-scale option if your implementation uses significant bandwidth for initial replication or ongoing changed data replication.

  • Choose the right IR configuration for your scenario.

Next steps