Tag Archives: SSIS

SSIS, Tempdb Database, and SQL Server Log Files

The tempdb system database is a global resource that is available to all users connected to the instance of SQL Server. Tempdb is re-created every time SQL Server is started, which means the system always starts with a clean copy of the database and there is never anything in tempdb that is saved from one session of SQL Server to another.

The tempdb’s usage could be roughly separated into 3 categories:

User objects: Any user with permission to connect to an instance of SQL Server would be able to create temporary tables and table variables. Temporary user objects that are explicitly created, such as: global or local temporary tables, temporary stored procedures, table variables, or cursors, are stored in tempdb database.

Internal job: Tempdb holds the intermediate data when SQL Server is doing tasks like:

  • DBCC CHECKDB or DBCC CHECKTABLE;
  • Common Table Expressions (CTE);
  • ORDER BY, GROUP BY, UNION Queries;
  • Index rebuild or creation.

Version store: Version Store is a collection of data pages that hold the data rows used for row versioning. Triggers, MARS (Multiple Active Result Sets), and online index operations are examples of SQL Server features that use version store.

Here in this post, I’m going to talk about how tempdb is affected when SSIS is used to load data into target tables or update previously loaded data (specifically in fact tables), and provide some workaround for preventing the database from growing too large. The solution provided here will pretty much have the same effect on SQL Server logs, resulting in smaller log files and saving space.

Populating Fact Tables

A fact table in a data warehouse has direct or indirect (through mapping tables) relationship to the dimension tables it is related to. Whether or not the table that a fact table is related to is dimension or mapping, there must be a column (foreign key) in the fact table that refers to the primary key of the linked table. This means there are 2 sets of columns in a fact table: measure columns which hold the numeric values that should be analysed and foreign keys to dimension tables that are used to slice and dice the measure values.

The set of dimension and/or mapping foreign keys in a fact table define its granularity: the lowest level that data should be aggregated before loading into DW. In other words, data should be aggregated at the right granular level, which is defined by its foreign key columns.

A good way of aggregating data at the required granular level is using a GROUP BY clause in the query that is used to pump the data into DW’s fact table. For example, if the structure of the fact table to be populated is like the one depicted below:

Image

Then the following query is the best to populate the table:

SELECT     ‘ProductKey’ = p.[ProductKey]

,’DateKey’ = d.[DateKey]

,’SalesAmount’ = SUM(s.Sales_Value)

FROM       [OLTP].[dbo].[Sales] s

INNER JOIN [DW].[dbo].[DimProduct] p ON s.Product_Key = p.ProductKey

INNER JOIN [DW].[dbo].[DimDate] d ON s.Sales_DateTime = d.DateTime_Value

GROUP BY    p.[ProductKey]

,d.[DateKey]

What should be considered at this stage is that since populating a fact table, especially for the first time, needs a huge number of records to be transferred from OLTP database to the DW database and the query to do so has a GROUP BY clause that engages tempdb database in the process of preparing data, the ETL should be designed in the way that it doesn’t blow up tempdb: The ETL for populating fact tables must be designed and implemented in such a way that it doesn’t load the data in a single data transfer process, and breaks it into smaller chunks of data to be loaded once at a time.

The best candidate for breaking down data is the date-related factor at which the data is aggregated. In our example the data is aggregated at the date level (it could be at any other level like Week, Month, or Year.)

Design and implementation of ETL using SSIS:

As explained before, the ETL for populating fact table should be designed in such a way that it loads data into DW by period (day, week, month, and year) to prevent tempdb and log files from growing too large. This needs our SSIS package to do the following steps:

A- Get the list of distinct periods for the whole data that is going to be loaded,

B- Save the periods extracted in previous step into an SSIS variable,

C- Run the process that cleanses and inserts data into DW once at a time for each period.

Let’s see how each step could be implemented using SSIS:

1- Open BIDS, create a new SSIS project;

2- Add a new Execute SQL Task to your SSIS Project. This task is responsible for reading the data to be loaded and build a list of distinct periods;

3- Create a variable of type Object at the scope of Package. This variable is going to hold the list of periods and will be used by different tasks at different steps of loading data. I named it Periods;

4- Add another variable of type Integer, Int32, to your package again at the scope of Package. This variable will be used to hold each period of time read from Periods variable, for which the package is going to load data. In my example this variable is called CalendarDate;

5- Next step is to extract the periods for which the data should be loaded into the fact table. This would be done by writing a query that extracts the distinct list of time periods from the OLTP database, and configure the previously added Execute SQL Task to run the query and save the result in the Periods variable:

a-  Add the SQL script to your Execute SQL Task:

b- Change the ResultSet property of the Execute SQL Task to Full Result Set;

c- Click on the Result Set tab, and add a new result set to store the result of running your query to the Periods Variable:

6- To get the package to transfer data once for a period at a time, we need a Foreach Loop Container. The Foreach container is going to loop through what we have already loaded into Periods SSIS variable, and run the process for transferring the data once for each period:

a- Add a Foreach Loop container to your project and open its editor by double-clicking on it;

b- Go to Collection tab, and select “Foreach ADO Enumerator” from Enumarator drop down and “User::Periods” from ADO object source variable:

c- Click on on Variable Mapping tab of Foreach Loop Editor and add CalendarDate variable. This variable is going to hold the period for which the data is going to be transferred from OLTP to DW at each iteration:

7- Add the DFT and the sequence container (optional) to your project to do the transfer using CalendarDate variable. You should be able to find loads of resources on how to use a variable in an OLE DB source of a DFT on internet, in case you don’t know how to do that.

Advertisements

Improve your SSIS package’s performance

Hello everyone.

I spent almost the whole last week and the first 2 days of this week trying to improve my BI solutions’ performance. In my quest on learning the tricks to make my package faster, I came across SSIS Performance Design Patterns video by Matt Masson. A comprehensive discussion indeed, that I’m gonna list in vrief here:
1- Utilize parallelism: It is easy to utilize parallelism in SSIS. All you need to do is to recognize which Data Flow Tasks (DFTs) could be started at the same time and set the control flow constraints of your package in the way that they all can run simultaneously.

2- Synchronous vs. Asynchronous components: A synchronous transformation of SSIS  takes a buffer, processes the buffer, and passes the result through without waiting for the next buffer to come in. On the other hand, an asynchronous transformation needs to process all its input data to be able to give out any output. This can case serious performance issues when the size of the input data to the async. transformation is too big to fit into memory and needs to be transferred to HDD at multiple stages.

3- Execution tree: An execution tree starts where a buffer starts and ends where the same buffer ends. These execution trees specify how buffers and threads are allocated in the package. Each tree creates a new buffer and may execute on a different thread.  When a new buffer is created such as when a partially blocking or blocking transformation is added to the pipeline, additional memory is required to handle the data transformation; however, it is important to note that each new tree may also give you an additional worker thread.

4-OLE DB Command transformation: OLE DB Command is a row-by-row transformation, meaning that it runs the command in it on each one of its input rows. This make sit to be damn too slow when the number of the rows goes up. The solution for boosting performance is to stage data into a temporary table and use Execute SQL Task outside that DFT.

5- SQL Server Destination vs. OLE DB Destination: There are multiple reason why to use OLE DB Destination and not use SQL Server Destination:

  • OLE DB Destination is mostly faster,
  • OLE DB Destination is a lot clearer when it fails (The error message is more helpful),
  • SQL Server Destination works only when SSIS is installed on the destination server.

6- Change Data Capture (CDC): Try to reduce the amount of data to be transferred to the maximum level you can, and do it as close to the source as you can. A Modified On column on the source table(s) helps a lot in this case.

7- Slowly Changing Dimension (SCD) transformation: There is only one advice about SSIS’s Slowly Changing Dimension transformation, and that is get rid of it! The reasons are:

  • It doesn’t use any cached data, and goes to the data source every single time it is called,
  • It uses many OLE DB Command transformations,
  • Fast Data Load is off by default on its OLE DB Destination.

I recommend you to go and watch the whole video if you have enough time. All these topics are discussed more in details in the video.
Cheers!


Spliting records into multiple rows using SSIS Script Component

Hi folks.
This is a post in response to Add Interim Rows Depending on Max and Min Values question asked in MSDN SSIS Forum, but it could be applied to any similar situation and problem. HEre I am gonna describe the steps needs to be taken to split the records coming from a data base table into multiple records and insert them into a destination table.

Let’s say we have a table with the following data in it:

Key  min  max
A     0    3
B     2    5

and we need to get:

Key  Values
A      0
A      1
A      2
A      3
B      2
B     …
to
B     5

out of it. This situation could be easily addressed using SSIS Script Component. The steps to achieve that are as follows:

1- Open BIDS, create a new Integration Services project, and open Package.dtsx file;

2- Add a new Data Flow Task to the package, and double-click on it to navigate to Data Flow tab;

3- Add a Script Component to your data flow. When the “Select Script Component Type” dialog box opens, select Source and hit OK;

4- Open Script Component by double-clicking on it and navigate to “Inputs and Outputs” tab. Add 2 Output columns by selecting “Output Columns” folder and hitting Add Column button, as displayed in the following picture:

5- Add a new Conenction Manager to “Connection Managers” tab to point to your instance of SQL Server DB:

6- Now navigate to the Script tab, and hit Edit Script. This will get you to Visual Studio Tools for Applications (VSTA), where you can edit your script’s code. Replace the code with the following snippet:

using System;

using System.Data;

using Microsoft.SqlServer.Dts.Pipeline.Wrapper;

using Microsoft.SqlServer.Dts.Runtime.Wrapper;

using System.Data.SqlClient;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]


public classScriptMain : UserComponent

{

          privateSqlConnection cnn;

          publicoverridevoid AcquireConnections(object Transaction)

{

                      //base.AcquireConnections(Transaction);

                      IDTSConnectionManager100 cnnMgr = this.Connections.Connection;

                      string cnnstr = cnnMgr.ConnectionString.Substring(cnnMgr.ConnectionString.IndexOf(“Data Source”), cnnMgr.ConnectionString.IndexOf(“Provider”));

cnnstr +=“Integrated Security=SSPI;”;

                   cnn = new SqlConnection(cnnstr);

}

           publicoverridevoid PreExecute()

{

                        base.PreExecute();

cnn.Open();

}

          publicoverridevoid PostExecute()

{

                           base.PostExecute();

cnn.Close();

}

        publicoverridevoid CreateNewOutputRows()

{

                          SqlCommand cmd = newSqlCommand(“SELECT * FROM ss”, cnn);

                         SqlDataReader reader = null;

reader = cmd.ExecuteReader();

                          while (reader.Read())

{

                                          for (int i = (int)reader[1]; i <= (int)reader[2]; i++)

{

Output0Buffer.AddRow();

Output0Buffer.oName = (string)reader[0];

Output0Buffer.oValue = i;

}

}

}

}

7- Add a new OLE DB Destination to the Data Flow tab. Get the Output (green arrow) of the Script component and conenct it to OLE DB Destination just added. Configure OLE Db Destination to insert data into the destination table:

Run the package. You’ll get the splitted records into your destination table.

 

 


SSIS SQL Server Deployment, File System Deployment, and SSIS Package Store

SSIS packages can be stored in file system, as well as SQL Server. Let’s have a look at each one of these options for storing our SSIS packages.

1- SQL Server:

When a package is deployed to SQL Server, it is actually stored in MSDB database. The easiest way to deploy a SSIS package to SQL Server is by using Package Deployment Utility:

  • Open your SSIS project in BIDS (Business Intelligence Development Studio),
  • Right-click on the project entry in solution explorer and select properties,
  • Navigate to deployment tab, and set CrteateDeploymentUtility to true,
  • Build your project.

After completing these steps, go to bin -> Deployment folder of your project’s folder and double-click on .SSISDeploymentManifet file. This will launch Package Installation Wizard,which will get you through the necessary steps to deploy your package to SQL Server:

  • In the Deploy SSIS Package page of the wizard, select SQL Server deployment and click Next:

SQL Server Deployment

  • Specify Target SQL Server page is where you must specify the instance of SQL Server to which you want to deploy your package. Insert the instance name in Server name drop-down list, select appropriate Authentication mode, and the Path for the folder in which the package will be deployed:

SQL Server Deployment

Note that the list of folders you’ll get for the Package Path is exactly the same as folders you’ll see when you connect to SSIS using SSMS. If you have a new project in hand and you need to create a new folder for this project’s packages, you should Connect to SSIS through SSMS, Right-click on MSDB folder in object explorer, and select New Folder.

  • Next page is Select Installation Folder. You can set where package and its dependencies to be stored in your hard drive by clicking the Browse button:

SQL Server Deployment

  • Last step is to click Next in Confirm Installation page:

SQL Server Deployment

Apart from package deployment utility, developers can use DTUtil comand line utility for moving a package, which I believe is more complex and time taking that this option.

Now that you have completed these steps, your package is stored in MSDB database in dbo.sysssispackages table.

2- File System:

When a package is stored in file system, the package itself and its dependencies are store in a folder in hard drive. a package can be deployed to file system easily, by copying and pasting the .dtsx file to the target folder. This is the easiest way. Package Deployment Utility and DTUtil can be used as well, like SQL Server Deployment.

If you connect to SSIS through SSMS, you’ll see a folder named File system. This is the folder for displaying the packages that are deployed to file system. But not all the packages available in your hard drive will be displayed here. The folder that SSMS will look into to list the packages is C:\Program Files\Microsoft SQL Server\100\DTS\Packages. You can add new folders to here and then copy your packages, for better management.

If you like to add new folders for your SSIS packages other than just File System and MSDB to be vieable in SSMS, you should locate the file named “MsDtsSrvr.ini.xml” and configure it. Basically what you need to do is to add a new Folder node to this XML file.

Do not forget to restart your SSIS service after making any of these changes,  otherwise you won’t see any result.

SSIS Package Store:

Some websites and blogs mention a third type of package storage named SSIS Package Store, which is not correct. SSIS Package Store is nothing but combination of SQL Server and File System deployment, as you can see when you connect to SSIS through SSMS: it looks like a store which has categorized its contents (packages) into different categories based on its manager’s (which is you, as the package developer) taste. So, don’t get it wrong as something different from the 2 types of package deployment.


Changing a Connection Manager’s Conenction String using C# – SSIS 2008

A SSIS package can be run using C#, or basically any CLR language. The piece of code depicted in the following image calls a package, tries to execute it, and prints any error or exception that occurs dutring execution of the package to output:

Simple and easy!

Now suppose our package has 2 connection managers: one named “OLTP Source” and another one named “DW Destination”. And let’s suppose that OLTP Source connection manager is created prior to DW Destination, and therefore it is listed first in the Connection Managers pane of BIDS.

The connection managers in a package are accessible in C# code using Connections collection. They are accessed using an index to enumerate in the collection. For example, Pkg.Connections[0] points to the first connection manager listed in the Connection Managers pane. And the connection string of each connection manager can be set by editing connection manager’s ConnectionString property.

The following line of code sets the connection string of the second connection manager in the package (in our example, DW Destination):

pkg.Connections[1].ConnectionString = “Data Source=localhost;Initial Catalog=DW;Integrated Security=SSPI”;

This scenario can be used to update any property of any object in the package when the package is going to be run using a CLR language.


SQL Server 2000 stored procedure and SSIS 2008 OLE DB Source

What is the first thing that comes to your mind as a Microsoft BI expert, when you think of a data migration tool? Undoubtedly SSIS!

As you all know, SSIS is a great tool for migrating data from legacy systems and databaseas into newer and more robust places and repositories to better serve knowledge and data workers.
Yesterday, I was assigned to a task which was about getting the list of custoemrs from SQL Server 2000 into SQL Server 2008, using SSIS 2008. It didn’t seem something extraordinary at first: all I needed to do was to get an OLE DB source into my Data Flow Task, have a lookup table to filter existing records in the table, and pass the new rows into the destination table. I then realized that not all the customers need to be migrated, and there is a stored procedure written in SQL Server 2000 to return the list of the customers that should be migrated. The SP works this way: gets the Id of a principal as input, and extracts the list of customers for the principal based on some pre-defined business logic.
What I needed to be able to do was to pass the principal ID as my package’s parameter to SP, and use the result returned by SP in my source query’s WHERE clause. Passing parameter to package and then use it in t-sql query is nothing difficult. It gets messy when you need to use that parameter in a nested query: using IN keyword in the main query and using SP in the internal query to build a temporary table for IN keyword.
I have only one word for doing this thing: Impossible!
I tried different types of calling a SP in another query’s WHERE clause. I even got the SP’s code and put it into my t-sql query. But as you may know, SSIS’s OLE DB source does not support parameters in nested queries.
So, I tried to design a strategy for doing this. A strategy to get rid of the parameterized nested query.
What I did is, I added a Execute SQL Task to my package, which gets executed before my Data Flow task. The role of this Execute SQL Task is to get the list of customer IDs from the source database by executing the SP, and store them in a temporary table in destination SQL Server 2008 using this piece of code:

IF Object_Id('TemporaryTablet', 'U') IS NOT NULL
DROP TABLE TemporaryTable
CREATE TABLE TemporaryTable(CustID int)
INSERT INTO TemporaryTable EXECUTE dbo.Sql2000SP ? --? is the parameter passed to SP

Then, I connect this Execute SQL Task to my Data Flow task. This way, the new Data Flow task is not paramtereized anymore because the parameter, which is the principal ID, is used in previous step to pull the list of customers into the temporary table using Execute SQL Task. This is the code that gets data in my OLE DB Source:

SELECT Column1, Column2, Column3
FROM OriginalTable
WHERE CC.CustId IN (SELECT * FROM TemporaryTable)

Bingo! It works without any problem: there is no parameter to be passed to SP or the nested query to replicate SP’s logic in OLE DB Source! 🙂