An alternative for Index with Include: CLUSTERED (UNIQUE) Index

Indexes are good and helpful for reading from the tables and one can add different indexes based on any kind of queries that are going to be issued against the tables and make sure they always hit indexes, not the table. But that strategy has a drawback: it’ll slow down inserts into the table because everytime a new record is added to the table, all the indexes on the table need to be updated. That was the reason I was looking at the list of indexes I have on one of my busiest summary tables to see if I can get rid of any on them and make my inserts faster (This table is updated once a day and takes about 8 minutes for 1.5 million records)
I realized I have 3 separate indexes on different subsets of the columns that are in my UNIQUE index, each one with different INCLUDE columns. And in case you don’t know what INCLUDE columns do, they are saved alongside the primary columns in index so that they can be read by accessing the index rather than having to go to the table and get them from there. This could mean doubling (or tripling) the space the table takes on disk.
Now, let’s see what a UNIQUE CLUSTERED index does. A clustered index forces the order at which the records are stored in disk. for example, if our table has 2 columns, ID and Name, and there is a clustered index on ID column:

CREATE TABLE [dbo].[Table_1](
[ID] [int] NOT NULL,
[Name] [varchar](50) NOT NULL
) ON [PRIMARY]

CREATE UNIQUE CLUSTERED INDEX [Ind1] ON [dbo].[Table_1]
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

Now, let’s add 3 records to the table:
INSERT INTO [dbo].[Table_1]
([ID]
,[Name])
VALUES
(2
,’S’)
GO

INSERT INTO [dbo].[Table_1]
([ID]
,[Name])
VALUES
(3
,’M’)
GO

INSERT INTO [dbo].[Table_1]
([ID]
,[Name])
VALUES
(1
,’Z’)
GO

If you query the table now, you’ll see that the result will be sorted by values in Id column, without having any ORDER BY clause in the query. This means that the data inserted into the table is already saved into the disk sorted by the column in clustered index, not by the order at which the data is inserted into the table. In other words, if we query the table and filter by ID, query engine will be able to locate the record satisfying the condition in where clause without having to search through the whole table. And guess what, this has exactly the same effect as we had an index (non clustered) on ID and included Name in it.

This can work in any case when there is the need for creating index with include columns in it. And it best works when the columns making the clustered index form a unique key on the table, simply because having a UNIQUE CLUSTERED INDEX not only tells the query engine how to locate the records, but it also makes it sure that every occurrence of the index is unique and happens one and only once.


SQL Server 2014 and SSDT (AKA BIDS)

Hey folks. This is going to be a short post, just wanted to mention something that may come handy for those who are interested in play with SQL Server 2014.
I downloaded SQL Server 2014 a couple of weeks ago and started exploring its new features to see what has changed/improved. After going through very few blogs I usually check and running some tests on the DB engine (and be amazed by how efficient the new version is compared to 2008 R2), I wanted to start playing with the tools that I am most interested in: SSIS and SSAS.
What do you guys do when you want to add a new SSIs project? That’s right, you open Visual Studio or SSDT (aka BIDS) and create a new project. But wait a minute, where is the new Visual Studio that is supposed to come with SQL Server 2014? I looked into the folder created in my start menu, couldn’t find anything in there.
After looking into MSDN I realized that unlike previous versions, SSDT does not come with 2014 version of SQL Server and it should be downloaded separately. It’s so ironic our friends in Microsoft forgot to mention it, isn’t it?
Anyway, let’s not be a fusspot about how could they forget mentioning this and download it ourselves from here.

Hope this can help, cheers.


Drop failed for DatabaseRole : The database principal owns a schema in the schema and cannot be dropped

This error is raised when there is a schema owned by the role you are trying to drop. The most straight forward and quick fix for this error is to revert the schema ownership to the appropriate role in the database, which will make the dropping role not the owner of any schema in the database anymore.

For example, if you have assigned the db_datareader schema to be owned by the database role you created, you’ll notice that you can’t tick off the owner ship from the role. To work around this issue, simply open the db_datareader database role of the database and make it the owner of db_datareader schema (tick the box in “Schema owned by this role” for db_datareader.) After doing so, you’ll see that the schema in question is not owned by your role, and you’ll be able to drop the database role without any problem.


SSIS, Tempdb Database, and SQL Server Log Files

The tempdb system database is a global resource that is available to all users connected to the instance of SQL Server. Tempdb is re-created every time SQL Server is started, which means the system always starts with a clean copy of the database and there is never anything in tempdb that is saved from one session of SQL Server to another.

The tempdb’s usage could be roughly separated into 3 categories:

User objects: Any user with permission to connect to an instance of SQL Server would be able to create temporary tables and table variables. Temporary user objects that are explicitly created, such as: global or local temporary tables, temporary stored procedures, table variables, or cursors, are stored in tempdb database.

Internal job: Tempdb holds the intermediate data when SQL Server is doing tasks like:

  • DBCC CHECKDB or DBCC CHECKTABLE;
  • Common Table Expressions (CTE);
  • ORDER BY, GROUP BY, UNION Queries;
  • Index rebuild or creation.

Version store: Version Store is a collection of data pages that hold the data rows used for row versioning. Triggers, MARS (Multiple Active Result Sets), and online index operations are examples of SQL Server features that use version store.

Here in this post, I’m going to talk about how tempdb is affected when SSIS is used to load data into target tables or update previously loaded data (specifically in fact tables), and provide some workaround for preventing the database from growing too large. The solution provided here will pretty much have the same effect on SQL Server logs, resulting in smaller log files and saving space.

Populating Fact Tables

A fact table in a data warehouse has direct or indirect (through mapping tables) relationship to the dimension tables it is related to. Whether or not the table that a fact table is related to is dimension or mapping, there must be a column (foreign key) in the fact table that refers to the primary key of the linked table. This means there are 2 sets of columns in a fact table: measure columns which hold the numeric values that should be analysed and foreign keys to dimension tables that are used to slice and dice the measure values.

The set of dimension and/or mapping foreign keys in a fact table define its granularity: the lowest level that data should be aggregated before loading into DW. In other words, data should be aggregated at the right granular level, which is defined by its foreign key columns.

A good way of aggregating data at the required granular level is using a GROUP BY clause in the query that is used to pump the data into DW’s fact table. For example, if the structure of the fact table to be populated is like the one depicted below:

Image

Then the following query is the best to populate the table:

SELECT     ‘ProductKey’ = p.[ProductKey]

,’DateKey’ = d.[DateKey]

,’SalesAmount’ = SUM(s.Sales_Value)

FROM       [OLTP].[dbo].[Sales] s

INNER JOIN [DW].[dbo].[DimProduct] p ON s.Product_Key = p.ProductKey

INNER JOIN [DW].[dbo].[DimDate] d ON s.Sales_DateTime = d.DateTime_Value

GROUP BY    p.[ProductKey]

,d.[DateKey]

What should be considered at this stage is that since populating a fact table, especially for the first time, needs a huge number of records to be transferred from OLTP database to the DW database and the query to do so has a GROUP BY clause that engages tempdb database in the process of preparing data, the ETL should be designed in the way that it doesn’t blow up tempdb: The ETL for populating fact tables must be designed and implemented in such a way that it doesn’t load the data in a single data transfer process, and breaks it into smaller chunks of data to be loaded once at a time.

The best candidate for breaking down data is the date-related factor at which the data is aggregated. In our example the data is aggregated at the date level (it could be at any other level like Week, Month, or Year.)

Design and implementation of ETL using SSIS:

As explained before, the ETL for populating fact table should be designed in such a way that it loads data into DW by period (day, week, month, and year) to prevent tempdb and log files from growing too large. This needs our SSIS package to do the following steps:

A- Get the list of distinct periods for the whole data that is going to be loaded,

B- Save the periods extracted in previous step into an SSIS variable,

C- Run the process that cleanses and inserts data into DW once at a time for each period.

Let’s see how each step could be implemented using SSIS:

1- Open BIDS, create a new SSIS project;

2- Add a new Execute SQL Task to your SSIS Project. This task is responsible for reading the data to be loaded and build a list of distinct periods;

3- Create a variable of type Object at the scope of Package. This variable is going to hold the list of periods and will be used by different tasks at different steps of loading data. I named it Periods;

4- Add another variable of type Integer, Int32, to your package again at the scope of Package. This variable will be used to hold each period of time read from Periods variable, for which the package is going to load data. In my example this variable is called CalendarDate;

5- Next step is to extract the periods for which the data should be loaded into the fact table. This would be done by writing a query that extracts the distinct list of time periods from the OLTP database, and configure the previously added Execute SQL Task to run the query and save the result in the Periods variable:

a-  Add the SQL script to your Execute SQL Task:

b- Change the ResultSet property of the Execute SQL Task to Full Result Set;

c- Click on the Result Set tab, and add a new result set to store the result of running your query to the Periods Variable:

6- To get the package to transfer data once for a period at a time, we need a Foreach Loop Container. The Foreach container is going to loop through what we have already loaded into Periods SSIS variable, and run the process for transferring the data once for each period:

a- Add a Foreach Loop container to your project and open its editor by double-clicking on it;

b- Go to Collection tab, and select “Foreach ADO Enumerator” from Enumarator drop down and “User::Periods” from ADO object source variable:

c- Click on on Variable Mapping tab of Foreach Loop Editor and add CalendarDate variable. This variable is going to hold the period for which the data is going to be transferred from OLTP to DW at each iteration:

7- Add the DFT and the sequence container (optional) to your project to do the transfer using CalendarDate variable. You should be able to find loads of resources on how to use a variable in an OLE DB source of a DFT on internet, in case you don’t know how to do that.


Get tables and their data without backing up the database: SQL Server 2008 R2

There may be situations when you want to get all of the tables in a SQL Server database and the data they currently hold, but you can’t (because of permissions assigned to your account, for example) or don’t want to backup the whole database. Well, follow the following steps to achieve this:

1- Open SSMS and connect to the correct instance of SQL Server;

2- Right-click on the appropriate database entry under Databases folder, go to Tasks, and then select “Generate Scripts…” . Generate and Publish Scripts wizard launches;

3- Click next in Introduction page;

4- In the next page, Choose Objects, pick out “Select specific database objects” and then check the checkbox beside Tables entry. You can even select the tables for which you want to get the scripts specifically by clicking on the + sign beside Tables entry. Click Next;

5- Here’s the tricky part: you can get either just the tables’ schema only or schema together with the data for the tables selected in the previous step. To do this, click on the Advanced button. Then in the Advanced  Scripting Options page find “Type of data to script” property and change it to “Schema and data”. (It is set to “Schema only” by default) Click Ok to go back to Set Scripting Options page;

6– Select “Save to new query window” and hit Next 2 times. When all the actions listed are completed click Finish.

After completing these steps, you’ll get a new query window opened in SSMS which has the script to create all the tables and insert the records they already contain in the original database.


Improve your SSIS package’s performance

Hello everyone.

I spent almost the whole last week and the first 2 days of this week trying to improve my BI solutions’ performance. In my quest on learning the tricks to make my package faster, I came across SSIS Performance Design Patterns video by Matt Masson. A comprehensive discussion indeed, that I’m gonna list in vrief here:
1- Utilize parallelism: It is easy to utilize parallelism in SSIS. All you need to do is to recognize which Data Flow Tasks (DFTs) could be started at the same time and set the control flow constraints of your package in the way that they all can run simultaneously.

2- Synchronous vs. Asynchronous components: A synchronous transformation of SSIS  takes a buffer, processes the buffer, and passes the result through without waiting for the next buffer to come in. On the other hand, an asynchronous transformation needs to process all its input data to be able to give out any output. This can case serious performance issues when the size of the input data to the async. transformation is too big to fit into memory and needs to be transferred to HDD at multiple stages.

3- Execution tree: An execution tree starts where a buffer starts and ends where the same buffer ends. These execution trees specify how buffers and threads are allocated in the package. Each tree creates a new buffer and may execute on a different thread.  When a new buffer is created such as when a partially blocking or blocking transformation is added to the pipeline, additional memory is required to handle the data transformation; however, it is important to note that each new tree may also give you an additional worker thread.

4-OLE DB Command transformation: OLE DB Command is a row-by-row transformation, meaning that it runs the command in it on each one of its input rows. This make sit to be damn too slow when the number of the rows goes up. The solution for boosting performance is to stage data into a temporary table and use Execute SQL Task outside that DFT.

5- SQL Server Destination vs. OLE DB Destination: There are multiple reason why to use OLE DB Destination and not use SQL Server Destination:

  • OLE DB Destination is mostly faster,
  • OLE DB Destination is a lot clearer when it fails (The error message is more helpful),
  • SQL Server Destination works only when SSIS is installed on the destination server.

6- Change Data Capture (CDC): Try to reduce the amount of data to be transferred to the maximum level you can, and do it as close to the source as you can. A Modified On column on the source table(s) helps a lot in this case.

7- Slowly Changing Dimension (SCD) transformation: There is only one advice about SSIS’s Slowly Changing Dimension transformation, and that is get rid of it! The reasons are:

  • It doesn’t use any cached data, and goes to the data source every single time it is called,
  • It uses many OLE DB Command transformations,
  • Fast Data Load is off by default on its OLE DB Destination.

I recommend you to go and watch the whole video if you have enough time. All these topics are discussed more in details in the video.
Cheers!


Making SSRS reports faster: get rid of Parameter Sniffing

Is your SSRS report running slowly? Are you using a stored procedure to pull the data and pass them to report? If your answer to these questions is yes, then you are a victim of SQL Server’s Parameter Sniffing.
The first question is, what is Parameter Sniffing? It refers to SQL Server’s effort to reduce CPU overhead by using the same query execution plan for all similar queries instead of compiling the query each time it is being executed. As long as the queries would have really returned the same plan, this is a big performance winner. SQL Server internally tries to automatically turn simple non-parameterized user queries into parameterized queries to take advantage of this performance gain.
Parameter use, especially in more complex scenarios, can also cause performance issues. If the queries are complex and/or the data distribution on the columns against which the parameter is compared vary, the cost of different plan choices can change. A plan that is optimal for one parameter value may perform poorly for another value. The query optimizer still needs to estimate the selectivity and cardinality of predicates using a parameter value. This section describes how this process works in more detail.
So, what to do to prevent being a victim of this process? It is very easy to achieve: all you need to do is to declare local parmeter(s) inside your stored procedure’s code and assign the values passed to the SP to the newly introduced parameters. for example, lets say we have a sp with following definition:

CREATE PROCEDURE [SP_Test_ParameterSniffing]
@CustomerID INT
AS
BEGIN

SELECT *
FROM Customer c
WHERE c.CustomerID = @CustomerID

END

All you need to do is to add a new parameter to your SP code and assign the value passed to the SP to theis new parameter and use it in your WHERE clause:

CREATE PROCEDURE [SP_Test_ParameterSniffing]
@CustomerID INT
AS
BEGIN

DECLARE @CustomerID2 INT;
SET @CustomerID2 = @CustomerID;
SELECT *
FROM Customer c
WHERE c.CustomerID = @CustomerID2

END

This will cause the original query execution plan to be bypassed and have no effect on the query performance.