BUSINESS INSIGHTS

Dec 21, 2016

5 Important Steps When Migrating to your Scaled-Out Data Warehouse

Josh Fennessy Posted by Josh Fennessy

In Leo Furlong’s and Merrill Aldrich's recent post, Options to Scale Your SQL Server Data Warehouse, they described different scenarios for growing your data warehouse beyond a single-server solution. While appliances like APS or platform solutions like Azure SQL Data Warehouse (DW) remove the need to build your own scaled-out infrastructure, they still require solution planning to migrate your existing single-server DW to a distributed format.

iStock-501195316edited.png

1. Know how your data will be distributed

This is the most important step to plan for when moving to a scaled-out DW solution. Data distribution is a key step in the linear performance scale that you can achieve in a distributed environment. Distribution is very important because an incorrect distribution key between two or more tables will result in the need for data shuffling between nodes in the database. The shuffling that can occur is very expensive because in this operation, data is physically being moved between disks, file groups, or in some cases, even physical nodes. While this data is being shuffled, no processing is happening – meaning your query stalls while the data is being moved around. Therefore, it is important to design your distribution patterns in such a way that you will minimize the amount of shuffling that will happen.

In both cases of APS or Azure SQL DW, data is distributed on a single column in a table. You specify this column at the time the table is created, and it cannot be changed. APS supports three types of table distribution:

REPLICATE – a copy of the entire table is stored on each distribution of the distributed database. NOTE: Replicated tables are currently only supported in APS. Azure SQL DW does not support replicated tables.

ROUND_ROBIN – the data is distributed row by row among the distributions in the distributed database.

HASH – a single column is chosen as the distribution key, and rows are distributed to head distribution based on values in the chosen column.

In a production solution, you should NEVER use ROUND_ROBIN. It will never result in a distribution that avoids shuffling. Since the data is distributed row by row, even simple queries will result in a significant amount of data shuffling.

This leaves two types of distribution patterns that are acceptable for a production solution. REPLICATED tables should be used for tables that are less than 5GB in total size, and aren't updated that often. Command examples of good candidates for replicated tables are smaller dimension tables in a star schema. This type of replication works well for smaller tables because any joins to these tables can be satisfied on each distribution without any shuffling, as all of the data is available.

For your larger tables, you'll be using the HASH distribution, which means that you must pick an appropriate distribution column. There are a few factors to consider when choosing your distribution key:

Skew – One key factor in choosing the distribution key is how much skew will be created. Skew happens when more data exists on one distribution than the other, or when there aren't enough unique values in the column to distribute data to all distributions. For example, a column containing Yes/No values is not a good distribution key, since the data will only be distributed to 2 distributions, instead of the 16+ distributions that are available.

Changing Values – Once a row has been written to the table, the value in the distribution key column CAN NOT change. To change that value in the column you would need to delete the row and reinsert it. Therefore, it is a best practice to choose a distribution key column whose values do not change after being written.

Join/Group by Usage – For many tables, there is a column that is often used in Joins to other large tables. This column is generally a good candidate for a distribution key – as long as it doesn't break the two previous best practice rules. For example, the OrderHeader and OrderLine tables are always joined by OrderNumber. The OrderNumber column would be ideal for the distribution key, because all rows in the OrderHeader table would be able to join to their corresponding rows in the OrderLine table without shuffling. The HASH algorithm used on both tables is the same, and since the columns match, we know the rows will live in the same distribution.

For more information on how APS and Azure SQL DW works with distributed tables, check out this short Developer Introduction by James Rowland-Jones, Principal Program Manager for Azure SQL DW at Microsoft.

2. Test your reporting applications

While both APS and Azure SQL DW work with standard SQL Server ODBC connections, it's still a good idea to test your reporting applications during a proof of concept or trial period.

Some reporting applications write their own SQL statements to issue to the database. While not common, it is possible that the reporting application issues a statement that isn't compatible with APS and Azure SQL DW's DSQL interpretation. It's a good idea to test your most used reports against the system to ensure that the query issues are compatible.

Also, some reporting applications use non-standard SQL server drivers. Understanding exactly how your reporting system connects to the appliance will be very important prior to migrating your DW solution over.

If you don't yet have your APS appliance installed, we've helped past customers run a proof of concept in partnership with a Microsoft Technology Center (MTC). Many of these MTC locations have an APS appliance installed, and we can help schedule time on that appliance for a customer to do a test migration, and ensure that their current reporting systems will integrate with the appliance. If you're considering moving to Azure SQL DW instead of the on-premises route, Microsoft is currently offering a 1-month free trial – which is a perfect opportunity to shake out any major connectivity issues that you may face.

3. Check your procedural code for incompatibilities

Although migrating to APS or Azure SQL DW doesn't require a complete relearn of query and procedural programming, DSQL does have some important differences between TSQL that can have an impact on your current data processing and querying solutions.

While there are migration tools to help migrate a DW's table schema and data from a single-server approach to a distributed one, there doesn't exist a solution to convert incompatible procedural code – that still must be done manually.

Some common incompatibilities are:

  • Variable assignment using the SELECT @variable = {statement} format. All variables need to be assigned using a SET statement.
  • CTE expressions in UPDATES and DELETES are not supported. CTEs are supports for SELECTS and INSERTS.
  • Temporary tables need to be modified to include a distribution scheme appropriate for the operation being performed. Also, existence must be checked when using IF OBJECT_ID() IS NOT NULL to determine the table already exists. In APS/Azure SQL DW temp tables exist for the entire session, not just the query context.
  • ANSI joins are not allowed in UPDATE or DELETE statements. Implicit joins are acceptable, but often it is required to use a temp table to join the data together to be updated prior to running the statement.
  • SELECT…INTO is not supported in DSQL, instead the CREATE TABLE AS SELECT (CTAS) statement is used to perform the operation of creating a new table as the result of a select statement.
  • The MERGE statement is also not supported in DSQL, so instead, a solution using CTAS can be implemented and is described in this Microsoft article.

4. Be ready to modify your ETL processing

One of the biggest development hurdles to cross when migrating to a scale-out DW solution is what do with your current ETL processes.

Often ETL is written in a tool like SSIS or Informatica. While APS/Azure SQL DW supports both tools, and even ships with special destination adapters to maximize data loading performance, it's often more efficient to translate your ETL to stored procedures that run natively on the appliance.

If you rely on SSIS or Informatica to do your data cleansing and transformation, you are not taking advantage of the distributed processing engine of APS/Azure SQL DW. Often, ETL processing time is an important reason to move to a distributed solution, and therefore, it's important to understand how to maximize that performance.

As you are migrating your ETL processes to DSQL stored procedures, a few common design patterns should be paid attention to:

  • Consider staging data to the appliance with dwloader (APS) or PolyBase (Azure SQL DW). These tools are optimized for inserting bulk data into the environment, and are often the fastest way to get staging data loaded – often surpassing the 2TB/hour rate.
  • Processing that currently makes use of UPDATE statements may be more efficient with the CTAS and RENAME OBJECT commands to build a new copy of the table and replace it. This will also reduce the amount of object locking occurring, and result in reduced downtime for user queries.
  • Use PARTITION to quickly replace volatile data (such as current month transactions that are updated often) or archive data to other tables quickly. Table partitioning in APS/Azure SQL DW is a simpler implementation than that in SQL Server and is a bit easier to manage during reoccurring ETL processing.

Don't let the magnitude of ETL migration scare you away from moving to a distributed processing environment. Often, your current ETL processes can be repointed to insert directly into the new system with minor modifications – a lift and shift approach – then, a phased project approach can be implemented to migrate your ETL to take advantage of the distributed processing paradigms.

5. Identify your highest priority pain points

You've probably made the decision to move to a scaled-out DW solution because you JUST WANT THINGS TO BE FASTER! And, good choice, your solution likely WILL be faster, much faster, than you could achieve with your single-server solution of yesteryear.

However, the road to get to the faster solution will be one that could last as little as a few weeks, or as much as 2 years, depending on the complexity of your data warehouse. You can also expect and plan for continued optimization and new development after that. So, knowing you can't tackle everything at once, it's important to optimize your highest pain points first:

Is your user report performance the most important aspect of your decision? 
If so, then taking advantage of existing ETL processing, and focusing on distribution design and query performance should be the efforts of your first phases of solution migration.

Are your current ETL processes running so long that users are negatively affected?
If this is the strongest pain point for you and your team, then it's a good idea to start tackling those long-running ETL processes and get them translated to DSQL ASAP. Your user reports will continue to run faster against the distributed environment, but you can optimize those later, after your ETL issues are sorted out.

While distributed processing is a great solution to scale issues, it's not magic. Every process you run in a distributed environment will not be faster – you'll run into situations where your new environment is not as efficient for a specific process as your old one. You'll need to plan for those types of issues and realize that you can't tackle everything at the same time. Having a priority plan of attack will help to show the quick value of your new solution implementation.

You're not alone in this journey

Moving to a distributed processing environment really is a journey. In some ways, it will feel painless – just a simple connection modification and your reports are now working! But in others, it will feel like you're relearning everything you know about data warehousing.

You're not alone! We've worked with many clients on this path, and we've found that it helps to have an experienced team on your side. If you're ready to take your first steps, contact us today and we'll help you build a plan to ensure that your scale-out DW is a huge success.

New Call-to-action
Josh Fennessy

About The Author

Josh Fennessy

Josh is a Solution Architect at BlueGranite. Josh is passionate about enabling information workers to become data leaders. His passions in the data space include: Modern Data Warehousing, unstructured analytics, distributed computing, and NoSQL database solutions.

Latest Posts

Click here to register!