Category Archives: SSIS

SQL Server 2014 and SSDT (AKA BIDS)

Hey folks. This is going to be a short post, just wanted to mention something that may come handy for those who are interested in play with SQL Server 2014.
I downloaded SQL Server 2014 a couple of weeks ago and started exploring its new features to see what has changed/improved. After going through very few blogs I usually check and running some tests on the DB engine (and be amazed by how efficient the new version is compared to 2008 R2), I wanted to start playing with the tools that I am most interested in: SSIS and SSAS.
What do you guys do when you want to add a new SSIs project? That’s right, you open Visual Studio or SSDT (aka BIDS) and create a new project. But wait a minute, where is the new Visual Studio that is supposed to come with SQL Server 2014? I looked into the folder created in my start menu, couldn’t find anything in there.
After looking into MSDN I realized that unlike previous versions, SSDT does not come with 2014 version of SQL Server and it should be downloaded separately. It’s so ironic our friends in Microsoft forgot to mention it, isn’t it?
Anyway, let’s not be a fusspot about how could they forget mentioning this and download it ourselves from here.

Hope this can help, cheers.

Advertisements

SSIS, Tempdb Database, and SQL Server Log Files

The tempdb system database is a global resource that is available to all users connected to the instance of SQL Server. Tempdb is re-created every time SQL Server is started, which means the system always starts with a clean copy of the database and there is never anything in tempdb that is saved from one session of SQL Server to another.

The tempdb’s usage could be roughly separated into 3 categories:

User objects: Any user with permission to connect to an instance of SQL Server would be able to create temporary tables and table variables. Temporary user objects that are explicitly created, such as: global or local temporary tables, temporary stored procedures, table variables, or cursors, are stored in tempdb database.

Internal job: Tempdb holds the intermediate data when SQL Server is doing tasks like:

  • DBCC CHECKDB or DBCC CHECKTABLE;
  • Common Table Expressions (CTE);
  • ORDER BY, GROUP BY, UNION Queries;
  • Index rebuild or creation.

Version store: Version Store is a collection of data pages that hold the data rows used for row versioning. Triggers, MARS (Multiple Active Result Sets), and online index operations are examples of SQL Server features that use version store.

Here in this post, I’m going to talk about how tempdb is affected when SSIS is used to load data into target tables or update previously loaded data (specifically in fact tables), and provide some workaround for preventing the database from growing too large. The solution provided here will pretty much have the same effect on SQL Server logs, resulting in smaller log files and saving space.

Populating Fact Tables

A fact table in a data warehouse has direct or indirect (through mapping tables) relationship to the dimension tables it is related to. Whether or not the table that a fact table is related to is dimension or mapping, there must be a column (foreign key) in the fact table that refers to the primary key of the linked table. This means there are 2 sets of columns in a fact table: measure columns which hold the numeric values that should be analysed and foreign keys to dimension tables that are used to slice and dice the measure values.

The set of dimension and/or mapping foreign keys in a fact table define its granularity: the lowest level that data should be aggregated before loading into DW. In other words, data should be aggregated at the right granular level, which is defined by its foreign key columns.

A good way of aggregating data at the required granular level is using a GROUP BY clause in the query that is used to pump the data into DW’s fact table. For example, if the structure of the fact table to be populated is like the one depicted below:

Image

Then the following query is the best to populate the table:

SELECT     ‘ProductKey’ = p.[ProductKey]

,’DateKey’ = d.[DateKey]

,’SalesAmount’ = SUM(s.Sales_Value)

FROM       [OLTP].[dbo].[Sales] s

INNER JOIN [DW].[dbo].[DimProduct] p ON s.Product_Key = p.ProductKey

INNER JOIN [DW].[dbo].[DimDate] d ON s.Sales_DateTime = d.DateTime_Value

GROUP BY    p.[ProductKey]

,d.[DateKey]

What should be considered at this stage is that since populating a fact table, especially for the first time, needs a huge number of records to be transferred from OLTP database to the DW database and the query to do so has a GROUP BY clause that engages tempdb database in the process of preparing data, the ETL should be designed in the way that it doesn’t blow up tempdb: The ETL for populating fact tables must be designed and implemented in such a way that it doesn’t load the data in a single data transfer process, and breaks it into smaller chunks of data to be loaded once at a time.

The best candidate for breaking down data is the date-related factor at which the data is aggregated. In our example the data is aggregated at the date level (it could be at any other level like Week, Month, or Year.)

Design and implementation of ETL using SSIS:

As explained before, the ETL for populating fact table should be designed in such a way that it loads data into DW by period (day, week, month, and year) to prevent tempdb and log files from growing too large. This needs our SSIS package to do the following steps:

A- Get the list of distinct periods for the whole data that is going to be loaded,

B- Save the periods extracted in previous step into an SSIS variable,

C- Run the process that cleanses and inserts data into DW once at a time for each period.

Let’s see how each step could be implemented using SSIS:

1- Open BIDS, create a new SSIS project;

2- Add a new Execute SQL Task to your SSIS Project. This task is responsible for reading the data to be loaded and build a list of distinct periods;

3- Create a variable of type Object at the scope of Package. This variable is going to hold the list of periods and will be used by different tasks at different steps of loading data. I named it Periods;

4- Add another variable of type Integer, Int32, to your package again at the scope of Package. This variable will be used to hold each period of time read from Periods variable, for which the package is going to load data. In my example this variable is called CalendarDate;

5- Next step is to extract the periods for which the data should be loaded into the fact table. This would be done by writing a query that extracts the distinct list of time periods from the OLTP database, and configure the previously added Execute SQL Task to run the query and save the result in the Periods variable:

a-  Add the SQL script to your Execute SQL Task:

b- Change the ResultSet property of the Execute SQL Task to Full Result Set;

c- Click on the Result Set tab, and add a new result set to store the result of running your query to the Periods Variable:

6- To get the package to transfer data once for a period at a time, we need a Foreach Loop Container. The Foreach container is going to loop through what we have already loaded into Periods SSIS variable, and run the process for transferring the data once for each period:

a- Add a Foreach Loop container to your project and open its editor by double-clicking on it;

b- Go to Collection tab, and select “Foreach ADO Enumerator” from Enumarator drop down and “User::Periods” from ADO object source variable:

c- Click on on Variable Mapping tab of Foreach Loop Editor and add CalendarDate variable. This variable is going to hold the period for which the data is going to be transferred from OLTP to DW at each iteration:

7- Add the DFT and the sequence container (optional) to your project to do the transfer using CalendarDate variable. You should be able to find loads of resources on how to use a variable in an OLE DB source of a DFT on internet, in case you don’t know how to do that.


Improve your SSIS package’s performance

Hello everyone.

I spent almost the whole last week and the first 2 days of this week trying to improve my BI solutions’ performance. In my quest on learning the tricks to make my package faster, I came across SSIS Performance Design Patterns video by Matt Masson. A comprehensive discussion indeed, that I’m gonna list in vrief here:
1- Utilize parallelism: It is easy to utilize parallelism in SSIS. All you need to do is to recognize which Data Flow Tasks (DFTs) could be started at the same time and set the control flow constraints of your package in the way that they all can run simultaneously.

2- Synchronous vs. Asynchronous components: A synchronous transformation of SSIS  takes a buffer, processes the buffer, and passes the result through without waiting for the next buffer to come in. On the other hand, an asynchronous transformation needs to process all its input data to be able to give out any output. This can case serious performance issues when the size of the input data to the async. transformation is too big to fit into memory and needs to be transferred to HDD at multiple stages.

3- Execution tree: An execution tree starts where a buffer starts and ends where the same buffer ends. These execution trees specify how buffers and threads are allocated in the package. Each tree creates a new buffer and may execute on a different thread.  When a new buffer is created such as when a partially blocking or blocking transformation is added to the pipeline, additional memory is required to handle the data transformation; however, it is important to note that each new tree may also give you an additional worker thread.

4-OLE DB Command transformation: OLE DB Command is a row-by-row transformation, meaning that it runs the command in it on each one of its input rows. This make sit to be damn too slow when the number of the rows goes up. The solution for boosting performance is to stage data into a temporary table and use Execute SQL Task outside that DFT.

5- SQL Server Destination vs. OLE DB Destination: There are multiple reason why to use OLE DB Destination and not use SQL Server Destination:

  • OLE DB Destination is mostly faster,
  • OLE DB Destination is a lot clearer when it fails (The error message is more helpful),
  • SQL Server Destination works only when SSIS is installed on the destination server.

6- Change Data Capture (CDC): Try to reduce the amount of data to be transferred to the maximum level you can, and do it as close to the source as you can. A Modified On column on the source table(s) helps a lot in this case.

7- Slowly Changing Dimension (SCD) transformation: There is only one advice about SSIS’s Slowly Changing Dimension transformation, and that is get rid of it! The reasons are:

  • It doesn’t use any cached data, and goes to the data source every single time it is called,
  • It uses many OLE DB Command transformations,
  • Fast Data Load is off by default on its OLE DB Destination.

I recommend you to go and watch the whole video if you have enough time. All these topics are discussed more in details in the video.
Cheers!


Spliting records into multiple rows using SSIS Script Component

Hi folks.
This is a post in response to Add Interim Rows Depending on Max and Min Values question asked in MSDN SSIS Forum, but it could be applied to any similar situation and problem. HEre I am gonna describe the steps needs to be taken to split the records coming from a data base table into multiple records and insert them into a destination table.

Let’s say we have a table with the following data in it:

Key  min  max
A     0    3
B     2    5

and we need to get:

Key  Values
A      0
A      1
A      2
A      3
B      2
B     …
to
B     5

out of it. This situation could be easily addressed using SSIS Script Component. The steps to achieve that are as follows:

1- Open BIDS, create a new Integration Services project, and open Package.dtsx file;

2- Add a new Data Flow Task to the package, and double-click on it to navigate to Data Flow tab;

3- Add a Script Component to your data flow. When the “Select Script Component Type” dialog box opens, select Source and hit OK;

4- Open Script Component by double-clicking on it and navigate to “Inputs and Outputs” tab. Add 2 Output columns by selecting “Output Columns” folder and hitting Add Column button, as displayed in the following picture:

5- Add a new Conenction Manager to “Connection Managers” tab to point to your instance of SQL Server DB:

6- Now navigate to the Script tab, and hit Edit Script. This will get you to Visual Studio Tools for Applications (VSTA), where you can edit your script’s code. Replace the code with the following snippet:

using System;

using System.Data;

using Microsoft.SqlServer.Dts.Pipeline.Wrapper;

using Microsoft.SqlServer.Dts.Runtime.Wrapper;

using System.Data.SqlClient;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]


public classScriptMain : UserComponent

{

          privateSqlConnection cnn;

          publicoverridevoid AcquireConnections(object Transaction)

{

                      //base.AcquireConnections(Transaction);

                      IDTSConnectionManager100 cnnMgr = this.Connections.Connection;

                      string cnnstr = cnnMgr.ConnectionString.Substring(cnnMgr.ConnectionString.IndexOf(“Data Source”), cnnMgr.ConnectionString.IndexOf(“Provider”));

cnnstr +=“Integrated Security=SSPI;”;

                   cnn = new SqlConnection(cnnstr);

}

           publicoverridevoid PreExecute()

{

                        base.PreExecute();

cnn.Open();

}

          publicoverridevoid PostExecute()

{

                           base.PostExecute();

cnn.Close();

}

        publicoverridevoid CreateNewOutputRows()

{

                          SqlCommand cmd = newSqlCommand(“SELECT * FROM ss”, cnn);

                         SqlDataReader reader = null;

reader = cmd.ExecuteReader();

                          while (reader.Read())

{

                                          for (int i = (int)reader[1]; i <= (int)reader[2]; i++)

{

Output0Buffer.AddRow();

Output0Buffer.oName = (string)reader[0];

Output0Buffer.oValue = i;

}

}

}

}

7- Add a new OLE DB Destination to the Data Flow tab. Get the Output (green arrow) of the Script component and conenct it to OLE DB Destination just added. Configure OLE Db Destination to insert data into the destination table:

Run the package. You’ll get the splitted records into your destination table.

 

 


Populating Data Warehouse Using SSIS

ETL tools are meant to be used for moving data from a source to a destination. They make life of a data warehouse or BI developer easy by providing easy to use user interfaces and many functionalities to cleanse data before loading into destination. But the situation is not always this easy, especially when moving data from production database (OLTP) to the data warehouse: you need to reference data in both of the different databases to be able to populate your DW with correct data, which generally refers to substituting dimension ID columns with the “Surrogate Keys” that are pre-loaded into data warehouse. Let’s have a look at different choices a developer has for referencing multiple databases in his SSIS package:

1- Using Linked Servers: The first solution that might occur to a developer could be referencing the tables in both databases at the query level, by doing a JOIN between them and stating the databases’ names before each table’s name to differentiate tables from different sources. This needs a linked server been set up between the 2 referenced databases. You can check this pageif you like to learn how to create a linked server.Let’s see an example of the queries that can be used when having a linked server. Suppose that the production database instance is named OLTP and the instance of database that is acting as data warehouse is named DW. Let’s assume there are 2 tables in our OLTP database, with following relationship:

When pulling data from this database into DW, the dimension tables’ IDs (in this case, SalesPersonID) must be replaced with surrogate keys:

If there is a linked server set up between OLTP and DW servers of the above example, then the query that populates SalesFact table in DW looks like this:

SELECT       ‘Amount’ = s.Amount
,’SalesPersonKey’ = dsp.SalesPersonKey
FROM         [OLTP].[dbo].Sales s
INNER JOIN [DW].[dbo].DimSalesPerson dsp ON s.SalesPersonID =              dsp.SalesPersonFullAlternateKey

2- SSIS Lookup Transformation: In cases when it is not possible to create a linked server between the two data sources for reasons like security concerns, SSIS’s Lookup transformation can be used. Lookup transformation is capable of doing many things for a BI developer, from adding new records to a destination table to replacing the columns from different tables. To replace a column from one table with a column from another one using Lookup transformation, all you need to do is to tick the column from referencing table in columns tab of your Lookup transformation:

    You can even replace the old column (SalesPersonID) with the new one (SalesPersonKey) by setting the Lookup Operation column:

This mechanism could be used to replace many IDs with the appropriate Surrogate Keys by using a waterfall of lookups, one for each dimension.


SSIS SQL Server Deployment, File System Deployment, and SSIS Package Store

SSIS packages can be stored in file system, as well as SQL Server. Let’s have a look at each one of these options for storing our SSIS packages.

1- SQL Server:

When a package is deployed to SQL Server, it is actually stored in MSDB database. The easiest way to deploy a SSIS package to SQL Server is by using Package Deployment Utility:

  • Open your SSIS project in BIDS (Business Intelligence Development Studio),
  • Right-click on the project entry in solution explorer and select properties,
  • Navigate to deployment tab, and set CrteateDeploymentUtility to true,
  • Build your project.

After completing these steps, go to bin -> Deployment folder of your project’s folder and double-click on .SSISDeploymentManifet file. This will launch Package Installation Wizard,which will get you through the necessary steps to deploy your package to SQL Server:

  • In the Deploy SSIS Package page of the wizard, select SQL Server deployment and click Next:

SQL Server Deployment

  • Specify Target SQL Server page is where you must specify the instance of SQL Server to which you want to deploy your package. Insert the instance name in Server name drop-down list, select appropriate Authentication mode, and the Path for the folder in which the package will be deployed:

SQL Server Deployment

Note that the list of folders you’ll get for the Package Path is exactly the same as folders you’ll see when you connect to SSIS using SSMS. If you have a new project in hand and you need to create a new folder for this project’s packages, you should Connect to SSIS through SSMS, Right-click on MSDB folder in object explorer, and select New Folder.

  • Next page is Select Installation Folder. You can set where package and its dependencies to be stored in your hard drive by clicking the Browse button:

SQL Server Deployment

  • Last step is to click Next in Confirm Installation page:

SQL Server Deployment

Apart from package deployment utility, developers can use DTUtil comand line utility for moving a package, which I believe is more complex and time taking that this option.

Now that you have completed these steps, your package is stored in MSDB database in dbo.sysssispackages table.

2- File System:

When a package is stored in file system, the package itself and its dependencies are store in a folder in hard drive. a package can be deployed to file system easily, by copying and pasting the .dtsx file to the target folder. This is the easiest way. Package Deployment Utility and DTUtil can be used as well, like SQL Server Deployment.

If you connect to SSIS through SSMS, you’ll see a folder named File system. This is the folder for displaying the packages that are deployed to file system. But not all the packages available in your hard drive will be displayed here. The folder that SSMS will look into to list the packages is C:\Program Files\Microsoft SQL Server\100\DTS\Packages. You can add new folders to here and then copy your packages, for better management.

If you like to add new folders for your SSIS packages other than just File System and MSDB to be vieable in SSMS, you should locate the file named “MsDtsSrvr.ini.xml” and configure it. Basically what you need to do is to add a new Folder node to this XML file.

Do not forget to restart your SSIS service after making any of these changes,  otherwise you won’t see any result.

SSIS Package Store:

Some websites and blogs mention a third type of package storage named SSIS Package Store, which is not correct. SSIS Package Store is nothing but combination of SQL Server and File System deployment, as you can see when you connect to SSIS through SSMS: it looks like a store which has categorized its contents (packages) into different categories based on its manager’s (which is you, as the package developer) taste. So, don’t get it wrong as something different from the 2 types of package deployment.


Changing a Connection Manager’s Conenction String using C# – SSIS 2008

A SSIS package can be run using C#, or basically any CLR language. The piece of code depicted in the following image calls a package, tries to execute it, and prints any error or exception that occurs dutring execution of the package to output:

Simple and easy!

Now suppose our package has 2 connection managers: one named “OLTP Source” and another one named “DW Destination”. And let’s suppose that OLTP Source connection manager is created prior to DW Destination, and therefore it is listed first in the Connection Managers pane of BIDS.

The connection managers in a package are accessible in C# code using Connections collection. They are accessed using an index to enumerate in the collection. For example, Pkg.Connections[0] points to the first connection manager listed in the Connection Managers pane. And the connection string of each connection manager can be set by editing connection manager’s ConnectionString property.

The following line of code sets the connection string of the second connection manager in the package (in our example, DW Destination):

pkg.Connections[1].ConnectionString = “Data Source=localhost;Initial Catalog=DW;Integrated Security=SSPI”;

This scenario can be used to update any property of any object in the package when the package is going to be run using a CLR language.