Most people who use cloud services use more than one product.These products can be from different cloud providers or may vary in terms of the underlying technology.
Therefore, it has become a basic necessity for users to transfer data among multiple cloud products.
For example:
• 1. A user migrates data from the Oracle database in an offline data center to RDS PPAS on the cloud.
• 2. A user uses RDS MySQL as the database for business transactions and HybridDB for PostgreSQL for their data warehouse.
• 3. A user imports massive data from ODPS into HybridDB for PostgreSQL for real-time analysis.
Data migration to the cloud is an inevitable necessity for the above situations. This article will present possible solutions for the above scenarios that I hope will prove helpful as you use cloud products.
Before we even begin migrating data, we need to gain a certain understanding of the related hardware so that we can find the most appropriate solutions.
If your data is on the cloud and in the same zone as the destination system, congratulations! You're in the ideal situation. Do your best to only migrate data within the same zone.
Current cloud products are commonly configured with Ethernet connections of at least 1 gigabit, and usually around 10 gigiabits. Intra-zone data migration has low latency, high bandwidth, and passes through the minimum required number of VSwitches, making throughput speeds ideal.
Therefore, we should do our best to make sure that the backend database, front-end ECS, and the OSS (which stores massive amounts of data) are in the same zone.
A portion of customers with high availability demands tend to choose multi-zone deployment within a single city, or even deployment in zones across cities. Furthermore, Alibaba Cloud has many data products to support native multi-zone deployment solutions.
Alibaba Cloud zones in the same city or between cities are connected through a leased line network. While transferring data under these conditions may not be as efficient as transferring within the same zone, network quality is still quite high.
Network communication efficiency is best in the following conditions (listed in descending order):
The same zone > Multi-zones in the same city > Multi-zones between cities
For example:
(Within East China I Zone B) > (Between East China I Zone B and East China I Zone C) > (Between East China I Zone B and North China I Zone B)
This is the least efficient and unfortunately most common scenario for data migration to the cloud, as described in the background reading. The reason is that the data is transferred over a physical channel that is public and cannot be controlled. Latency is usually high and quality can fluctuate.
We can use software to make up for these deficiencies. We usually recommend that users select software and services with the following features.
•Has a support mechanism, supports resumable data transfers, and prevents large tasks from failing due to a single failure.
•Allows for high concurrency to increase throughput.
•Uses incremental data migration to reduce service downtime.
Now let's talk about the data format in data exchanges.
There are usually two methods to transfer data between different data products:
Connect the software or service simultaneously to the source and target databases, pull data from the source database, convert it to the format recognized by the target database, and then immediately write it to the target database.
This method does not require an intermediate data storage space but does require high network quality. If the data is too big, e.g. at the petabyte level, migration could take a very long time.
Alibaba Cloud open-source product rds_dbsync, CDP (dataX), and Data Transmission Service (DTS) all fall into this category.
If your data size is large, e.g. a data warehouse with dozens of terabytes of data, it is recommended to use offline migration to transfer your data.
Offline migration refers to the process of exporting all data from the source database into a common data format, and then importing it to the target database.
Offline data migration offers the following advantages over migration on the cloud:
•Data exported offline is usually compressed by a ratio of between 1:2 and 1:5, which can significantly save network expenses and improve overall efficiency.
•Offline migration can easily be done through parallel processing, which is the most effective way to increase efficiency.
For file-based data migration, data format is key. Data formats serve to clearly express the method by which data is organized in a file.
The following are the most commonly used file formats: XT/CSV, TSV, ORC, Parquet, protobuf, etc.
Some of these data formats, specifically ORC and Parquet, are data compression formats. For uncompressed data, e.g. CSV, we can select our preferred data compression format, e.g. gzip, bzip2, snappy, etc.
•For structured data, the ideal data format is CSV. CSV is a commonly used, simple and transparent data format. For more information, refer to: Common Format and MIME Types for Comma-Separated Values (CSV) Files
•PostgreSQL CSV parameters are specified in reference document 2. Applicable to the community and Alibaba Cloud's PostgreSQL, Greenplum, and HybridDB for PostgreSQL.
•Any file that complies with the CSV standard can be imported into the PostgreSQL products.
o PostgreSQL imports data COPY through push
o HybridDB for PostgreSQL writes data COPY through push
CSV is a relatively simple text format, the advantage being that it has clearly defined semantics, which can easily be applied to even complex scenarios.
•The line delimiter of CSV files is 'n', or the line break
•DELIMITER is the delimiter that defines columns
o When user data contains DELIMITER, QUOTE need to be used as well.
o The recommended column delimiters are ',', 't' , '|' or other uncommon characters.
•The QUOTE character is used to enclose user data that contains special characters by column
o Strings that contain special characters will be enclosed by QUOTE to differentiate user data and the control characters.
o If not necessary, e.g. with integers, the data will not be enclosed by QUOTE characters (to optimize efficiency).
o The QUOTE cannot be the same as the DELIMITER, and double quotation marks are the default QUOTE.
o If the user data contains QUOTE characters, they need to be differentiated with an escape character.
•ESCAPE special characters
o Place an escape character before a special character that needs to be escaped to indicate that it is not to be counted as a special character.
o The default ESCAPE character is the same as the QUOTE character, that is, double quotation marks.
o You can use''(default escape character for MySQL) or another character..
The following are the default control characters for TEXT and CSV files.
Control CharacterFormat | TEXT | CSV |
---|---|---|
DELIMITER (column delimiter) | t (tab) | , (comma) |
QUOTE | " (double-quote) | " (double-quote) |
ESCAPE | (N/A) | Same as QUOTE |
NULL | N (backslash-N) | (Empty string without quotation marks) |
2,599 posts | 762 followers
FollowApsaraDB - July 13, 2023
ApsaraDB - September 23, 2021
Alibaba Cloud Indonesia - August 12, 2024
ApsaraDB - September 27, 2021
ApsaraDB - January 22, 2021
Alibaba Clouder - February 11, 2020
2,599 posts | 762 followers
FollowSecure and easy solutions for moving you workloads to the cloud
Learn MoreAlibaba Cloud offers Independent Software Vendors (ISVs) the optimal cloud migration solutions to ready your cloud business with the shortest path.
Learn MoreAn online MPP warehousing service based on the Greenplum Database open source program
Learn MoreAlibaba Cloud PolarDB for PostgreSQL is an in-house relational database service 100% compatible with PostgreSQL and highly compatible with the Oracle syntax.
Learn MoreMore Posts by Alibaba Clouder
Raja_KT March 8, 2019 at 4:10 am
Good info. fdw is one strong point of PostgreSQL and OSS is one backbone