SSIS Best Practices

Every time I start off with my SSIS development activity, one thing I keep in mind is to design my SSIS package effectively and develop it so that it provides the best performance ever possible. With that in mind I have collected the best practices followed by everyone and me.

These practices will help in building a highly efficient ETL project. But there are other factors which impact the performance, one of them is infrastructure and network. So you should do thorough testing before putting these changes into your production environment.

Naming Conventions

Every coding language has its own naming standards defined. Naming conventions help by providing shortest possible information about the objects. For instance, in an SQL statement, if you see 'dbo.t_dim_product' or 'mart.v_fact_sales_yearly', you can understand that these are two different objects, one being a dimension table containing product information and the later one is a view based on sales fact table which gets you yearly data.

Practice naming all package elements, it is good to have a short description for all objects in a package; it will be very much helpful while reviewing and debugging the code.

Avoid using connection-specific info (such as server name, database name or file path) in the names of connection managers. For example: 'TransactionHistory' or 'Trans_Hist' is a safer than '100.100.10.1\Transaction_Hist.Orders'.
Avoid using dot ('.') in package elements. Dot ('.') in an element name is sometimes confused with the SQL Server object naming convention (which throws validation errors) and hence should be avoided. Alternatively an underscore ('_') is recommendable.
Avoid lengthy package names. Make sure that a package name doesn't exceed 100 characters. During package deployment in SQL Server type mode, it is noticed that any character over 100 from package name are truncated. This might result in SSIS package failure during runtime, especially when you are using 'Execute Package Tasks' in your package.

I like to share a link, wherein you can get more insight on naming convention.

Projecting Columns

Normally in a SQL statement to retrieve data from source, we tend to select columns that are not used at all in the process. These unused columns tend to occupy part of the buffer and ultimately lead to poor performance. The following are few tips to boost process performance,

The data flow from upstream to downstream in a package is a memory intensive task, at most of the steps and component level we have to carefully check and make sure that any unnecessary columns are not passed to downstream. This helps in avoiding extra execution time overhead in a package and in turn improves overall performance.
While configuring any OLEDB connection manager as a source, avoid using 'Table or view' as data access mode, this is similar to 'SELECT * FROM ', and as most of us know, it takes all the columns in table including those which are not even required. Always try to use 'SQL command' data access mode and only include required column names in your SELECT T-SQL statement. This allows you to block unnecessary columns being passed downstream.
Use 'NOLOCK' hint to avoid locking overhead. This improves the speed of large tables scan.

Effective Data types

ETL process is all about handling data effectively, needs some cautious steps to keep that data flow performance on par. One has to design the process effectively and plan each every aspect of the kind of data that channels through the pipeline. Deciding kind of data types also impacts data flow and transformations applied. The following few pointers to look for,

Make sure to use narrow data types. This allocates less memory for the transformation. Avoid excessive casting of data types.
Keep an eye on precision issues when using data types such as money, float and decimal.

Filtered Source Data

If passing selectively columns is a best practice, why not extracting filtered result set? Not only the number of columns, but number of rows that get into the pipeline also impacts data flow task execution time and performance.

It is recommended to filter data at source adapter rather than using 'Conditional Split Transformation' in the later stages. This boosts data flow execution time significantly when processing medium or large business data.

Please note that, it is developer's duty to carefully verify the source data, so as to filter out unwanted information and select only the required result set for further processing.

Pre-Sorting Data

Sorting of data is an expensive operation, in SSIS you can sort data coming from upstream using 'Sort Transformation', however this is a memory intensive task and most of the times it reduces data flow task performance.

As a best practice, at most of the places where you know that data is coming from database tables, it's better to perform the sorting operation at the database level, where sorting can be performed within the query. This in fact is a very good practice because sorting at database level is much refined and happens at database server level.

I want you to have a look this useful article by Rob Farley 'The SSIS tuning tip that everyone misses', wherein he clearly explains why pre-sorting is important. He has also shared a lot of helpful information on SQL querying in his blog.

Pulling High Volumes of Data

In case of bulk inserts, the data flow task execution happens at snail's pace. This could be due many reasons, it may be because of all buffers being allocated and process is waiting for free buffers, locks on the target tables and data retrieval is slow. These can be avoided by setting proper buffer size and buffer storage path; the later can be avoided by using 'Hints' in the data retrieval SQL statements.

There is one more area which needs cautious approach and that's 'Indexes'. Basically, indexes are created to enhance data search. If the target table has index defined, it makes data loading lot slower (only in case of bulk loading).

Historical data loads are normally bulk loads, ETL processes that allow complete refresh on historical data

The recommendation is to consider dropping your target table indexes if possible before inserting data, especially if the volume of inserts is very high. The following is a graphical representation,

OLEDB Destination Settings

After extracting and transforming the data, comes the final and vital phase of data loading. To load data into a target, we require component that connects to target data source. 'OLEDB Destination' is commonly used component to carry load targets. This component if configured carefully, will immensely contribute towards performance of data loading.

The following are few configurable properties of the component,
Data Access Mode - This property provides fast loading option which internally uses a BULK INSERT statement for uploading data into the destination table instead of a simple INSERT statement (for each single row). So unless you have a reason for changing it, don't change this default option.

(If you opt for 'fast load' option, there are few more settings which you should configure accordingly; as explained below)

Keep Identity - By default this setting is disabled, which means the destination table (if it has an identity column) will create identity values on its own. If you enable it, the SSIS dataflow engine will ensure that the source identity values are preserved and the same incoming value is inserted into the destination table.

Keep Nulls - By default this setting is disabled, which means default value will be inserted (if the default constraint is defined on the target column) if NULL value is coming from the source for that particular column. If you enable it, default constraint on destination column will be ignored and preserved NULL of the source column will be inserted into the destination.

Table Lock - By default this setting is enabled and it is recommended to let it be so, unless there are chances of simultaneous usage of the table between multiple tasks/processes. It signifies that a table lock will be acquired on the destination table instead of acquiring multiple row level locks, which could turn into lock escalation problems.

Check Constraints - By default this setting is enabled and it is recommended is to disable it. After disabling it, ensure that incoming data is proper and doesn't violate constraints (of destination table), else reset it to default state. This setting specifies that the dataflow pipeline engine will validate the incoming data against the constraints of target table. Disabling this option will bypass constraint check and will enhance data load performance.

Rows per Batch and Maximum Insert Commit Size Settings

Buffer management is one of the fundamental activities of any data transfer process. In SSIS, buffer size and allocation can be managed up to an extent using two properties i.e. 'Rows per Batch' and 'Maximum Insert Commit Size'. These are very much important in tempdb enhancing performance and control transaction log expansion.

Rows per batch - The default value for this property is -1 which specifies all incoming rows will be treated as a single batch. You can change this default behavior and break all incoming rows into multiple batches. The allowed value is only positive integer which specifies the maximum number of rows per batch.

Maximum insert commit size - The default value for this property is '2147483647' (largest value of 4byte integer type) which specifies all incoming rows will be committed once on successful completion. You can specify a positive value for this property to indicate commit will happen for that number of rows.

You might be wondering, changing default value for these properties will in turn add extra overhead for the dataflow engine to load/commit several times. Yes it is true, but at the same time it will release the pressure on transaction log and tempdb from to growing tremendously.

Specifically during high volume data transfers, if these properties are set to default state, transaction log and tempdb will grow during extraction process, tempdb will soon run out of memory and extraction will fail. So it is recommended to set these properties to an optimum value.

Package Protection/Security

During SSIS packages development, most of the time one has to share his package with other team members or one has to deploy the same package on any other environment, UAT or production systems. At this moment, developer has to ensure correct package protection level. If package is deployed with default protection level i.e. 'EncryptSenstiveWithUserKey', same package might not execute as expected in other environments, this is because package was encrypted with user's personal key.

In general, to avoid package deployment error due to security settings and to ensure smooth execution across all environments set package protection level to 'DontSaveSenstive'.

Before changing protection levels, one has to first understand its behavior and for that please refer this link.

Execute SQL Task

'Execute SQL Task' is our best friend in SSIS, I hope everyone agrees on this; we can use this to run a single or multiple SQL statement at a time. The beauty of this component is that it can return results in different ways i.e. a single row, full result set or as XML. You can create different type of connections using this component with the help of connection providers such as OLEDB, ODBC, ADO, ADO.NET etc. It is preferable to use this component most of the time with 'For Each Loop Container' to define iteration loop on the basis of result returned by 'Execute SQL Task'.

UNION ALL or MERGE Transformation

As you know 'Merge Transformation' merges two sorted datasets and 'Union All' merges multiple datasets. It is preferable to use 'Union All transformation' if your package doesn't require sorted output, as there is no reordering of output rows.

Whereas 'Merge Transformation' requires sorted inputs, so this transformation can always be imagined to be preceded by a 'Sort Transformation' at each input to produce sorted inputs (only when you are sure that incoming data in pipeline is not sorted). If you are sure that the data in pipeline is sorted then just set 'IsSorted' property of source transformation/adapter to 'True', this will avoid unnecessary and expensive sort operations.

Group by Sequence Container

It is a best practice to make use of 'Sequence Containers' in SSIS packages to group different components at 'Control Flow' level. This offers a rich set of facilities,

Provides a scope for variables that a group of related tasks and containers can use.
Provides facility to manage properties of multiple tasks by setting property at Sequence container level.
Provides facility to set transaction isolation level at Sequence container level.

Flat File Connection Manager

When it comes to tuning of flat file connection managers, I remember of one issue wherein column data types are determined automatically. If you are using a flat file connection manage, after verifying for proper number of columns, make sure their data types are proper in length and type. Creating Flat File connection manager (for .txt or even sometimes for .xls or .csv files) with default setting, defines column data types as string [DT_STR]. This can be clearly observed in the below screenshots,

Connection with default data types might not be a right option because you may have some numeric, integer or Boolean columns in your source files, passing them as a string to downstream would take unnecessary memory space and in later stages it may result in package failure.

Checkpoint to Restart

If you are designing an ETL solution for a small, medium or large enterprise business needs, it is always good to have a feature of restarting failed packages from the point of failure. SSIS have an amazing feature called 'Checkpoint' to support restarting of failed packages from the point of failure. However, you have to configure the checkpoint feature at the package level.

When using checkpoints, use an expression to populate the 'CheckpointFilename' property which will allow you to include the value returned from 'System::PackageName'. It will also allow you to easily identify which checkpoint file is used by which package.

Refer Checkpoints in SSIS (http://www.sqlservercentral.com/articles/Integration+Services+(SSIS)/65473/) , on steps to create checkpoints. I found this article on SQLServerCentral published by Dinesh Asanka.

Package Re-usability

The most desired feature in SSIS packages development is re-usability. Alternatively, we can call them as standard packages that can be re-used during different ETL component development (mainly to standardize logging, event handling and configuration settings).

In SSIS, this can be easily achieved using template features. SSIS template packages are the re-usable packages that one can use in any SSIS project at any number of times. To know more about how to configure this, please see http://support.microsoft.com/kb/908018

Delay Validation Property

This is an amazing property provided in SSIS for most of the elements; element validation is delayed until its execution begins.

You can set 'DelayValidation' to True on package elements whose configuration is not valid at design time to prevent validation errors.

For example, you may have a Data Flow task that uses a destination table that does not exist until an 'Execute SQL Task' creates the table at run time. The 'DelayValidation' property can be enabled at the package level, or at the level of the individual tasks and containers that the package includes.

OLEDB Destination VS SQL Server Destination

SSIS provides two data flow destination components (which uses OLEDB connection providers) i.e. OLEDB Destination and SQL Server Destination, one connects to remote databases and the later connects to only local databases.

In case of destination SQL tables, there is a considerable performance improvement when SQL Server Destination is used. However, it has a limitation that the destination database must be located on the same box where Integration Services is installed; this may not always be true across all environments (Dev, Test and Production) and hence 'OLEDB Destination' is widely used.

Summary

In this article, we saw best practices for components/tasks, security settings and practices for efficient memory usage. Most of the settings/properties must be cautiously configured, to get the best out of it. If you have observed, all these tips or tweaks always revolve around one crucial point and that's 'Data transfer performance'. In this article we saw how best practices offer the best performance out of the tool in return.

SQLogy - Think SQL

Friday, October 7, 2011

SSIS Best Practices

1 comment: