In my eleventh entry on SSIS optimization techniques I am going to focus on design patterns for loading data in parallel. When I first started working with SSIS my goal was just to get the right objects in the right order to do a reasonably good job moving data. Eventually I started working with larger and larger data sets and this mindset was no longer meeting the business needs. Load were taking longer and load windows were getting shorter. I had to find another way.
Parallelism in SSIS and SQL Server can take many forms. The first, most obvious is to schedule different packages pulling different data at the same time. I think most people do this without even thinking about it. Another common scenario is the concept of an SSIS master package that will call some packages concurrently and others in sequence according to their dependencies. As with most things there are some limits to the master package design. There are settings to control the number of concurrent executions in the control flow (MaxConcurrentExecutables property) and in the data flow (EngineThreads property). You can find a great explanation of these properties from an MSDN article under the heading Configure the Package for Parallel Execution. It can be very useful to learn and understand these properties.
Designing SSIS Parallelism
Aside from the straightforward concepts presented above there are a few more things you can do to enhance the parallelism of your load processes. Most of these are also pretty straight forward to do, but you need to think of them and design with them in mind. The first thing I like to do is to address the issue of loading large tables. More specifically tables where you need to move enough data every day that a single failure can stress the entire load process, or jeopardize your load window SLAs.
Designing a package with multiple concurrent executions.
The idea is pretty simple. Build a single package with a single data flow to pull from your source and load your destination. The only additional thing is to add a where clause to your source query that will enable you to split the data set you load into multiple discrete chunks. Say for example you already have a where clause with a date constraint, something like
WHERE ActivityDate = CONVERT(DATE,DATEADD(dd,-1,GetDate()))
to pull only yesterday’s data. I would add an additional constraint to the query that would allow me to split the data into even smaller amounts. This additional constraint should be something meaningful for your data. The key issue is to choose a data point that is easily divided and preferably divided pretty evenly. A good example is geographic or maybe by sales rep. Do a little analysis on your data and determine the best values to use in order to split your data into smaller chunks. You can do as many chunks as you like. I have some data I have broken down into 14 chunks, and some I have only split in half to enhance my load times. Once you have analyzed your data then you need to add those additional constraints to your query in the form of parameters, or in a variable with an expression that uses input parameters. To keep it simple say you choose geography and you want to split and load your data by two regions. The where clause in your source query might look something like this:
WHERE ActivityDate = CONVERT(DATE,DATEADD(dd,-1,GetDate())) AND Region = ?
The question mark (?) in this case represents a parameter value in the source query that will take the parameter input for a region value. In one execution of the package you can configure the package to run with a region parameter value of “East” and then in the other execution of the package to run with the region parameter value of “West.” Make sure you use parameters, and don’t hard code anything here, this will give you the ability to easily change in the future. The easy way to execute this is to set up 2 SQL Agent jobs to run at the same time, one loading East data, and one loading West data. If your source and network interface can handle the load you may see a dramatic drop in your load times. Depending on source, target and network resources, splitting your loads into 2 or more parts that can run concurrently can make a big difference in your load times.
Managing contention with parallel loads
First, I am a big believer in using staging tables as a target for all my load processes. It is largely considered a best practice. There are also several things you might be able to do with staging tables that will be a little more difficult when loading directly to production. Let me start by trying to graphically demonstrate the different approaches you can use to set up your staging environment. These are partly dependent on the version of SQL Server you are running.
The most common pattern is a single source, with a single load process loading a single, unpartitioned target. If the target has any indexing on it there is very little you can do to parallelize without dropping the indexes.
You will end up with locking contention running multiple loads against an indexed, unpartitioned target. If you drop all the indexes, including the clustered index so all you have is a heap for a target you can usually use the fast load option in SSIS and run several concurrent loads without significant contention. In many cases this will be faster, even after adding the rebuild index time, but you should test this scenario before using it in production.
A little explanation of table partitioning is in order. Table partitioning is a way to divide larger tables into smaller pieces, typically by values in a single column. These smaller pieces can be spread across different file groups and storage. However, a partitioned table is still addressed as a single logical entity. The end result is that data access is usually faster because the data in each piece of the table can be accessed independently by the server, rather than accessing the entire table. However, the developers only see the single table. Knowledge of the partitioning scheme can benefit a developer in writing more efficient queries.
If you have SQL Server Enterprise version then table partitioning is available to you. In some cases partitioning can enhance your loads. However, as a DBA friend of mine reminded me once, table partitioning is a storage strategy, not an optimization strategy. Basically, I agree, but that doesn’t mean that table partitioning can’t be both a storage and an optimization strategy. It all comes down to the column used to partition the table. For a good overview on partitioning in SQL Server I suggest you check out this MSDN article.
Comparing to the single source to a single target above. A simple load to a partitioned target would have a single source query and the SQL Server engine on the target would separate the incoming data and route it to the proper partition based on the value in the partitioned column.
The SSIS data flow would look exactly the same as a single source to a single target because all of the work to separate by partition is done by the SQL Server engine.
The next option loading a partitioned target allows for more parallelism, and a reduction of your overall load time. Conceptually it would be a bit more like this.
The way to load this is with 3 queries that differ by the WHERE clause only. Let’s take a more practical example. Suppose your partition scheme is by a date column, and you have a different partition for each day. You want to load the last 3 days worth of data. You would set up a single SSIS package with a parameter driven query in the source. Something like
WHERE OrderDate = ?
The question mark (?) would be the place holder for your date. I usually don’t use a true date as an input parameter to a package. I find it easier to use an integer parameter based on the number of days ago because it is easier to set up SQL Agent jobs that way. I then take that “DaysAgo” parameter and use that in a variable to create the proper date value. The expression for the variable would look something like this:
(DT_WSTR, 10) (DT_DBDATE) DATEADD( “dd”, – @[$Package::DaysAgo] , GetDate())
If you plug this into your query source parameter you now have a single package that you can simply call 3 times with 3 SQL Agent jobs. Each job would have a different input for the DaysAgo value. One package would be 3 days ago, the next would be 2 days ago, and the last would be 1 day ago. But since you are loading a target that is partitioned by the day, you can run all three concurrently and you don’t have to worry about locking because each will be loading separate partitions.
This approach can reduce your overall load times as long as you are not saturating the source, the target, or your network. Generally, it takes a lot to saturate one or more of these. I like to keep adding more and more parallel and concurrent processes until I see a degradation in overall speed. Just keep an eye on your logs so you can monitor performance.
Table partitions also offer an interesting, and very quick way to move data from staging to your production table called partition switching. But there are a number of prerequisites that are beyond the scope of this article. But I found a great overall view (including a poster) of partition switching on the Brent Ozar website. If you are considering using table partitioning, it is a good place to start.
Partition switching allows you to move the data you have loaded into your staging tables to production tables by simply changing some metadata, rather than moving all the data itself. Something else I find can improve your load times is to load individual heaps (for each day in my example) and then add the indexing, etc. that will align it to the table partition scheme and then do the switch. This takes more time and effort to set up and get working but this way you get the speed advantage of loading a heap, and the speed advantage of a partition switch.
This does get somewhat complicated and it takes a lot of explanation and set up, but it can pay dividends. The real key is understanding all the issues around table partitioning and how you can take advantage of them to move your data into production.
I covered a few more advanced topics here, but if you take the time on the links to understand all the issues, you will find that you can really decrease a lot of your load times, in some cases quite drastically.
I hope you find this useful. Thanks for reading.