Database Migration from Microsoft Azure to Snowflake on AWS: Half 1

Technological developments have elevated the demand for enhanced infrastructure with fast deployments. Public cloud suppliers are consistently upgrading their insurance policies and infrastructure to match ever-growing enterprise necessities. This competitors offers companies the power to decide on the cloud supplier(s) that greatest match particular governance and cost-effectiveness wants.

On this weblog, we talk about migrating databases from SQL Server on Azure VM to Snowflake on AWS. 

Understanding the Drawback

Microsoft Azure is a cloud computing providing for Microsoft-managed information facilities, whereas Snowflake is a cloud-based information warehousing answer that gives software-as-a-service (SaaS) primarily based on varied public cloud suppliers.

On this situation, we have to migrate the SQL Server databases to Snowflake. There may be additionally a necessity for the migrated information to be put in Snowflake with a number of schemas for database names with the info within the exact and proper kind. 

Whereas Azure helps ingestion from varied sources and clouds, it didn’t assist direct egress to different cloud suppliers. Discovering a workable answer for shifting egress to AWS-based Snowflake was the primary problem. 

Our Strategy to Migration

In keeping with the Snowflake documentation, Snowflake has a stage function that might deal with the above difficulty. It’s, primarily, a path wherein the info information that should be ingested are saved, much like the idea of SMB Samba mount. Utilizing Snowflake Stage allowed it to load Azure Blob storage, after which Snowflake may learn and ingest the info in flat information. The subsequent step was to maneuver information from SQL Server to Azure Blob storage.

The Migration Course of

The migration course of consisted of the next steps: 

  • Replicate the database schema in Snowflake as per Azure SQL database
  • Arrange Azure Knowledge Manufacturing unit pipeline to create parquet snappy format flat information on Blob storage.
    • Use parquet information for information compression and fast information load in Snowflake
  • Create file format in Snowflake
    • Create or exchange file format <file_format_name> kind = ‘parquet’;
  • Create Stage in Snowflake
    • create or exchange stage <Stage_Name>
    • url=”<Azure Blob file/folder path>’
    • credentials=(azure_sas_token= <token>)
    • file_format = <file_format_name>;
  • To confirm if information are staged
  • Lastly, load information to Snowflake desk from Stage 

(Observe that each one parquet information is saved in a single column ($1))

copy into TEST1

from (choose

$1:CustomerID::varchar,

$1:NameStyle:title::varchar,

$1:Title:metropolis.bag::variant,

$1:FirstName::varchar,

$1:MiddleName::varchar,

$1:LastName::varchar,

$1:Suffix::varchar,

$1:CompanyName::varchar,

$1:SalesPerson::varchar,

$1:EmailAddress::varchar,

$1:Cellphone:title::varchar,

$1:PasswordHash::varchar,

$1:PasswordSalt::varchar,

$1:rowguid:title::varchar,

$1:ModifiedDate::datetime

from @ <Stage_Name>);

Let’s break down the main points of every step within the course of.

1. Leveraging Azure Knowledge Manufacturing unit

Azure’s Knowledge Manufacturing unit is a GUI-based device that facilitates an end-to-end ETL answer and supplies a step-by-step information for constructing the pipeline. It has Supply (SQL server), Goal (Blob storage) and the required settings for tuning the efficiency of the pipeline. These choices made it an ideal answer for the custom-made wants of this migration mission—which was to wrangle the info earlier than exporting to Blob storage. This was addressed seamlessly by Knowledge Manufacturing unit which is roofed intimately within the later part of this weblog.

2. Tuning the Efficiency of Knowledge Manufacturing unit Pipeline

The tough a part of this migration was that whereas the pipeline was straightforward to construct, using its full potential and deriving optimum efficiency was a problem.

There have been terabytes of knowledge that wanted to be exported to Blob from SQL, which might have taken weeks to switch with out tuning. After satisfactory POC, it was discovered that Knowledge Manufacturing unit may assist dynamic vary in studying information from the supply.

Let’s say there’s a desk XYZ that’s 800 GB. As per the strategy talked about above, Knowledge Manufacturing unit is required to maneuver the massive quantity of knowledge into Blobs. With the standard technique, the GUI, by default, writes the info to Blob serially which might be slower.

Now, if we take a look at the desk XYZ with a column “date”, the 800 GB of knowledge might be partitioned into small units relying on month or 12 months. This might imply that every partition just isn’t straight depending on different date partitions and might be written in parallel. This can be faster and extra resource-efficient.

This may be achieved through the use of the dynamic vary filter which might be solely utilized by writing the choose assertion quite than deciding on the checkbox of the present tables.

3. Utilizing Parquet File

The exported information wanted to be saved in a flat file whereas sustaining integrity and compression. CSV was the primary alternative however throughout POC many challenges have been confronted whereas writing the file, sustaining the areas and new line characters which corrupted information. The Knowledge Manufacturing unit supplied the Parquet format of a file that had an important compression fee (75%) and likewise maintained the integrity of the info. Parquet was optimized to work with complicated information in bulk and thus was appropriate for this mission. With respect to the above determine, it may be seen that  40GB of knowledge was compressed to 11GB.

4. Integration Runtime

For the Knowledge Manufacturing unit to work, it required extra compute energy which was facilitated within the following methods: 

  • Auto-Resolve Integration Runtime

Because the title suggests, the compute assets have been managed and assembled by the Microsoft information facilities and the associated fee was incurred on a per-unit foundation. The area of those assets was mechanically determined primarily based on availability. That is chosen by default when operating a Knowledge Manufacturing unit pipeline.

  • Self-Hosted Integration Runtime

This runtime makes use of the assets that exist already. For instance, the self-hosted IR allowed downloading a consumer program on the machine for the assets required and making a service and coupling it with the Knowledge Manufacturing unit.    

4. Establishing the Self-Hosted Integration Runtime

This was the most effective obtainable possibility because the SQL server was already hosted on a standalone Azure VM, which offered the liberty to make use of the complete capability of assets hooked up to it. It included the next steps: 

1. Setting-up Self Hosted IR

  • For Azure Knowledge Manufacturing unit to work with Azure VM it wanted to arrange the Azure Knowledge Manufacturing unit Integration Runtime
    • In Azure Knowledge Manufacturing unit, choose “Handle” after which “Integration Runtimes”
    • Choose “+ New”, then “Azure, Self Hosted”, 
    • Subsequent, choose “Community Atmosphere -> Self hosted”
    • Subsequent, give an acceptable title to the self-hosted IR
    • As soon as the IR is created, Auth keys can be introduced. Copy these keys
  • Now as per the final display screen, a hyperlink was offered to obtain Microsoft Integration Runtime
      • Obtain and set up the mixing runtime from the Microsoft hyperlink
      • As soon as put in, enter the Auth Key1 worth and register “Launch Configuration Supervisor”
      • When registered, the self-hosted IR will bind with the info manufacturing facility
      • Now, set up the Java Runtime 64 bit as it’s required for the self-hosted IR to work. Seek advice from this guide.

    2. Create Linked Service in Knowledge Manufacturing unit

  • Proceed to Knowledge Manufacturing unit 
    • Choose “Handle” after which “Linked Companies”
    • Choose SQL server as kind of service and provides it an acceptable title
    • Now beneath “Join through Integration Runtime” choose the created self-hosted IR
    • Put within the server title. It is very important notice that the server title ought to be the identical title used to attach the SQL server efficiently
    • Put within the credentials and take a look at connection

Within the subsequent weblog, we will take a look at among the challenges encountered throughout the migration, cost-saving actions and our strategy to information validation.

Supply hyperlink

Leave a Reply

Your email address will not be published.