As enterprises are adopting cloud, the common migration activity is migrating the on-premises database to cloud. What to migrate, when to migrate, where to migrate and how to migrate are million dollar questions. Migration could be just Lift & Shift Migration, it could be Modernization or it could be part of Transformation. First step in this journey starts with Discovery and Assessment and that’s why Google has acquired StratZone which helps enterprises with discovery and assessment of their IT landscape and helps them moving to Google Cloud.
For various reasons cloud adoption is not an option but a requirement and pandemic has accelerated the cloud adoption. As per Gartner1 “Digital native companies are those that started using public cloud from the ground up and include it at the center of their business model”.
All migrations, modernization or transformation plans have one common problem or task which is “Migrating the Database”. Let’s start with some common use cases for migrating the database to the cloud:
Infrastructure modernization Adopting serverless or cloud native architecture and technologies Making data available closer to application’s region For better disaster recovery capabilities and/or replicating the critical data to cloud for safe keeping Near real-time analytics, ad-hoc analysis and reports against operational and transactional data without impacting the OLTP workloads Availability of cloud services and on-demand resources which helps you innovate faster as well as meeting the regulatory and compliance requirements Cost advantage of using cloud based computing and storage.
When migrating database to Google Cloud following are the most common services you would use for your specific requirements:
Database Migration Service Datastream Dataflow Data Fusion Google Cloud Marketplace
Most common Google Cloud services where you would like to migrate your database are following:
Cloud SQL (for MySQL, PostgreSQL, and SQL Server) Bare Metal Solution for Oracle BigQuery Cloud Spanner
Normally we classify our database migration under two categories: Heterogeneous and Homogenous Migrations. In this article we will see the different and commonly used options of migrating the MySQL and PostgreSQL database engines to Google Cloud.
Migrating MySQL and PostgreSQL database engines to Cloud SQL (for MySQL, PostgreSQL)
Google Cloud provide Database Migration Service for out-of-box migration of same or compatible database engines (homogenous) which are either MySQL or PostgreSQL based. We calls this Lift & Shift migration. Below figures shows the list of source database engines supported by Google Cloud Database Migration Service. Based on your selection of source the target data source will be selected automatically.
If database migration is part of App Modernization and needs ETL because of schema change then use Datastream + Dataflow as per your specific requirements.
Migrating MySQL and PostgreSQL database engines to Cloud Spanner
Cloud Spanner supports horizontal scalability and global availability of data and at the same time provides strong consistency while optimizes performance by automatically sharding data which makes it unique. It is a globally scalable and distributed SQL database service which decouples compute from storage. In addition to this Spanner also supports PostgreSQL Interface which helps when migrating applications which are using PostgreSQL.
But it comes with limitations and should be used for some specific use cases. When migrating MySQL or PostgreSQL to Cloud Spanner (heterogeneous) then we need to refactor application and schema, some important tasks are:
Move stored procedures and triggers to application as Spanner doesn’t support procedures and triggers. Any sequence generator logic also needs to be moved to application layer. Spanner support limited data validation constraints, so any complex data constraints needs to be moved to application layer. Also you need to be very careful in selecting the primary key to avoid hotspots. Spanner comes with Interleaving tables which supports pre-joining parent-child tables, but it requires you to define primary keys in particular way. Spanner offers read/write and read-only transactions and also support bounded staleness when it comes to query data and hence provide performance benefits for some particular use cases. Spanner supports PostgreSQL Interface hence applications using PostgreSQL has minimal impact but applications layer using MySQL needs to be refactored.
For data validation we can use Dataflow to perform a detailed comparison over large data sets by using the “Join transform”. This transform takes 2 keyed data sets, and matches the values by key.
Bulk data export/import:
To export date from MySQL either we can use “mysqldump” tool can export the entire database into well-formed XML files or we can use SELECT … INTO OUTTFILE sql statement to create CSV files for each table. In case of PostgresSQL you can use “pg_dump”.
You can load it into MySQL by using the “LOAD DATA INFILE” command. You can use “pg_restore” to import data to PostgreSQL.
Since schema is different between MySQL/PostgreSQL and Spanner, we need to do some amount of ETL. Once data is extracted from data sources you can use either Dataflow or Data Fusion to push it to Spanner after ETL processing. You can also use Datastream as explained in below section.
Change Data Capture (CDC):
Often after migrating the initial set of data we need to maintain two data source in sync before cutover. You need to do CDC (change data capture) until you decide when is the best time to cutover. Other reasons are like doing data validation, avoid downtime, continuous replication for using cloud for analytics and meeting other business requirements.
You need to enable MySQL and PostgreSQL source databases for change data capture events, below are the some of the options:
For MySQL, Maxwell’s daemon is one of the open source project for CDC stream for your database. For PostgreSQL, PostgreSQL Logical Decoding (which uses write-ahead log) for CDC. But downside is that developer has to write elaborated logic to process events.
Once configures you can use following products and services for ingesting continuous stream of events:
Datastream – Datastream is serverless CDC and replication service provided by Google Cloud. It not only capture CDC data (DML changes e.g. insert, update and delete) but also replicate historical data. It comes with some limitations e.g. limited to 10,000 tables, event size limitation of 3 MB etc.
Datastream supports the following versions of MySQL database2:
MySQL 5.6 MySQL 5.7 MySQL 8.0
Datastream supports the following types of MySQL database2:
Self-hosted MySQL Cloud SQL for MySQL Amazon RDS for MySQL Amazon Aurora MySQL MariaDB
Debezium – Debezium is an open source distributed platform for change data capture. When using Debezium, one of the possible architecture is below (from Pub/Sub we can use Dataflow to push data to Spanner):
Striim on Google Cloud Marketplace
Other not so popular option is to use Audit Triggers.
Migrating MySQL and PostgreSQL database engines to BigQuery
BigQuery is serverless, highly scalable and cost-effective data warehouse. When migrating data from MySQL or PostgreSQL to BigQuery we either do as-is migration or we do ETL for some analytical use case, often enrich and join multiple data sources before pushing it to BigQuery.
When we develop applications we choose different types of data stores each optimized for its specific use case e.g. transactions, relational constraints, nosql, indexes, complex query etc. But we need to do analytics to get the bigger picture across multiple sources then BigQuery comes handy and then comes the requirement of data migration or replication.
Once data is in BigQuery and streaming pipeline is in place, it could be used for variety of use cases ranging from real-time analytics to predictive analytics to geospatial analysis, developing reports and dashboards, sharing data and actionable insights etc.
Bulk data export/import:
BigQuery provides you “bq” command-line tool which is python based and you can use “bq load” command to bulk upload the data into BigQuery. Before using “bq load” command you need to pull data out of MySQL/PostgreSQL. In case of MySQL you can use “mysqldump” or SQL query to pull data as discussed in previous section. In case of PostgresSQL you can use “pg_dump”. Once data is extracted from data sources you can use either Dataflow or Data Fusion to push it to BigQuery after ETL processing. You can also use Datastream as explained in below section.
Google Cloud Dataprep by Trifacta (on Google Cloud Marketplace), provides BigQuery pushdown which gives flexibility to run the ETL jobs using either BigQuery or Dataflow. And you can visually exploring, cleaning, and preparing data in your ETL pipeline. BigQuery pushdown on Dataprep enables faster data transformations, optimized costs, and increased flexibility.
Change Data Capture (CDC):
Using change data capture (CDC) solution you get the stream of real-time updates in your source data. Some important use cases for CDC are replication of critical database, synchronization between a transactional data store and a data warehouse-type store and zero downtime cutover strategy for your database migration job.
You can use Datastream or Debezium to capture the CDC stream from source database and use the Dataflow to push the data in BigQuery after doing the ETL processing.
You can use Datastream for migrating MySQL and Debezium for both MySQL and PostgreSQL data sources.
Data Fusion, is another option for some complex use cases like ingesting data from variety of sources and data formats to BigQuery. For replication jobs Data Fusion only support BigQuery as target.
Striim on Google Cloud Marketplace, is another solution of moving data from your transactional MySQL or PostgreSQL databases to BigQuery.
Google Cloud, Marketplace and Open Source community provides lots of solutions to simplify your migration journey, at the same time you need to do very careful analysis of your business requirements, use case and target architectures before short listing one. In some complex cases you might need to choose multiple options as each requirement is different and each option have benefits and flexibility over other based on the specific use case requirements.
https://www.gartner.com/en/articles/the-top-6-trends-impacting-infrastructure-and-operations-for-2022,Trend No. 2: Digital natives.
https://cloud.google.com/datastream/docs/sources,Source MySQL database