Category Archives: SQL Server Database Engine

An alternative for Index with Include: CLUSTERED (UNIQUE) Index

Indexes are good and helpful for reading from the tables and one can add different indexes based on any kind of queries that are going to be issued against the tables and make sure they always hit indexes, not the table. But that strategy has a drawback: it’ll slow down inserts into the table because everytime a new record is added to the table, all the indexes on the table need to be updated. That was the reason I was looking at the list of indexes I have on one of my busiest summary tables to see if I can get rid of any on them and make my inserts faster (This table is updated once a day and takes about 8 minutes for 1.5 million records)
I realized I have 3 separate indexes on different subsets of the columns that are in my UNIQUE index, each one with different INCLUDE columns. And in case you don’t know what INCLUDE columns do, they are saved alongside the primary columns in index so that they can be read by accessing the index rather than having to go to the table and get them from there. This could mean doubling (or tripling) the space the table takes on disk.
Now, let’s see what a UNIQUE CLUSTERED index does. A clustered index forces the order at which the records are stored in disk. for example, if our table has 2 columns, ID and Name, and there is a clustered index on ID column:

CREATE TABLE [dbo].[Table_1](
[ID] [int] NOT NULL,
[Name] [varchar](50) NOT NULL
) ON [PRIMARY]

CREATE UNIQUE CLUSTERED INDEX [Ind1] ON [dbo].[Table_1]
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

Now, let’s add 3 records to the table:
INSERT INTO [dbo].[Table_1]
([ID]
,[Name])
VALUES
(2
,’S’)
GO

INSERT INTO [dbo].[Table_1]
([ID]
,[Name])
VALUES
(3
,’M’)
GO

INSERT INTO [dbo].[Table_1]
([ID]
,[Name])
VALUES
(1
,’Z’)
GO

If you query the table now, you’ll see that the result will be sorted by values in Id column, without having any ORDER BY clause in the query. This means that the data inserted into the table is already saved into the disk sorted by the column in clustered index, not by the order at which the data is inserted into the table. In other words, if we query the table and filter by ID, query engine will be able to locate the record satisfying the condition in where clause without having to search through the whole table. And guess what, this has exactly the same effect as we had an index (non clustered) on ID and included Name in it.

This can work in any case when there is the need for creating index with include columns in it. And it best works when the columns making the clustered index form a unique key on the table, simply because having a UNIQUE CLUSTERED INDEX not only tells the query engine how to locate the records, but it also makes it sure that every occurrence of the index is unique and happens one and only once.


SSIS, Tempdb Database, and SQL Server Log Files

The tempdb system database is a global resource that is available to all users connected to the instance of SQL Server. Tempdb is re-created every time SQL Server is started, which means the system always starts with a clean copy of the database and there is never anything in tempdb that is saved from one session of SQL Server to another.

The tempdb’s usage could be roughly separated into 3 categories:

User objects: Any user with permission to connect to an instance of SQL Server would be able to create temporary tables and table variables. Temporary user objects that are explicitly created, such as: global or local temporary tables, temporary stored procedures, table variables, or cursors, are stored in tempdb database.

Internal job: Tempdb holds the intermediate data when SQL Server is doing tasks like:

  • DBCC CHECKDB or DBCC CHECKTABLE;
  • Common Table Expressions (CTE);
  • ORDER BY, GROUP BY, UNION Queries;
  • Index rebuild or creation.

Version store: Version Store is a collection of data pages that hold the data rows used for row versioning. Triggers, MARS (Multiple Active Result Sets), and online index operations are examples of SQL Server features that use version store.

Here in this post, I’m going to talk about how tempdb is affected when SSIS is used to load data into target tables or update previously loaded data (specifically in fact tables), and provide some workaround for preventing the database from growing too large. The solution provided here will pretty much have the same effect on SQL Server logs, resulting in smaller log files and saving space.

Populating Fact Tables

A fact table in a data warehouse has direct or indirect (through mapping tables) relationship to the dimension tables it is related to. Whether or not the table that a fact table is related to is dimension or mapping, there must be a column (foreign key) in the fact table that refers to the primary key of the linked table. This means there are 2 sets of columns in a fact table: measure columns which hold the numeric values that should be analysed and foreign keys to dimension tables that are used to slice and dice the measure values.

The set of dimension and/or mapping foreign keys in a fact table define its granularity: the lowest level that data should be aggregated before loading into DW. In other words, data should be aggregated at the right granular level, which is defined by its foreign key columns.

A good way of aggregating data at the required granular level is using a GROUP BY clause in the query that is used to pump the data into DW’s fact table. For example, if the structure of the fact table to be populated is like the one depicted below:

Image

Then the following query is the best to populate the table:

SELECT     ‘ProductKey’ = p.[ProductKey]

,’DateKey’ = d.[DateKey]

,’SalesAmount’ = SUM(s.Sales_Value)

FROM       [OLTP].[dbo].[Sales] s

INNER JOIN [DW].[dbo].[DimProduct] p ON s.Product_Key = p.ProductKey

INNER JOIN [DW].[dbo].[DimDate] d ON s.Sales_DateTime = d.DateTime_Value

GROUP BY    p.[ProductKey]

,d.[DateKey]

What should be considered at this stage is that since populating a fact table, especially for the first time, needs a huge number of records to be transferred from OLTP database to the DW database and the query to do so has a GROUP BY clause that engages tempdb database in the process of preparing data, the ETL should be designed in the way that it doesn’t blow up tempdb: The ETL for populating fact tables must be designed and implemented in such a way that it doesn’t load the data in a single data transfer process, and breaks it into smaller chunks of data to be loaded once at a time.

The best candidate for breaking down data is the date-related factor at which the data is aggregated. In our example the data is aggregated at the date level (it could be at any other level like Week, Month, or Year.)

Design and implementation of ETL using SSIS:

As explained before, the ETL for populating fact table should be designed in such a way that it loads data into DW by period (day, week, month, and year) to prevent tempdb and log files from growing too large. This needs our SSIS package to do the following steps:

A- Get the list of distinct periods for the whole data that is going to be loaded,

B- Save the periods extracted in previous step into an SSIS variable,

C- Run the process that cleanses and inserts data into DW once at a time for each period.

Let’s see how each step could be implemented using SSIS:

1- Open BIDS, create a new SSIS project;

2- Add a new Execute SQL Task to your SSIS Project. This task is responsible for reading the data to be loaded and build a list of distinct periods;

3- Create a variable of type Object at the scope of Package. This variable is going to hold the list of periods and will be used by different tasks at different steps of loading data. I named it Periods;

4- Add another variable of type Integer, Int32, to your package again at the scope of Package. This variable will be used to hold each period of time read from Periods variable, for which the package is going to load data. In my example this variable is called CalendarDate;

5- Next step is to extract the periods for which the data should be loaded into the fact table. This would be done by writing a query that extracts the distinct list of time periods from the OLTP database, and configure the previously added Execute SQL Task to run the query and save the result in the Periods variable:

a-  Add the SQL script to your Execute SQL Task:

b- Change the ResultSet property of the Execute SQL Task to Full Result Set;

c- Click on the Result Set tab, and add a new result set to store the result of running your query to the Periods Variable:

6- To get the package to transfer data once for a period at a time, we need a Foreach Loop Container. The Foreach container is going to loop through what we have already loaded into Periods SSIS variable, and run the process for transferring the data once for each period:

a- Add a Foreach Loop container to your project and open its editor by double-clicking on it;

b- Go to Collection tab, and select “Foreach ADO Enumerator” from Enumarator drop down and “User::Periods” from ADO object source variable:

c- Click on on Variable Mapping tab of Foreach Loop Editor and add CalendarDate variable. This variable is going to hold the period for which the data is going to be transferred from OLTP to DW at each iteration:

7- Add the DFT and the sequence container (optional) to your project to do the transfer using CalendarDate variable. You should be able to find loads of resources on how to use a variable in an OLE DB source of a DFT on internet, in case you don’t know how to do that.


Get tables and their data without backing up the database: SQL Server 2008 R2

There may be situations when you want to get all of the tables in a SQL Server database and the data they currently hold, but you can’t (because of permissions assigned to your account, for example) or don’t want to backup the whole database. Well, follow the following steps to achieve this:

1- Open SSMS and connect to the correct instance of SQL Server;

2- Right-click on the appropriate database entry under Databases folder, go to Tasks, and then select “Generate Scripts…” . Generate and Publish Scripts wizard launches;

3- Click next in Introduction page;

4- In the next page, Choose Objects, pick out “Select specific database objects” and then check the checkbox beside Tables entry. You can even select the tables for which you want to get the scripts specifically by clicking on the + sign beside Tables entry. Click Next;

5- Here’s the tricky part: you can get either just the tables’ schema only or schema together with the data for the tables selected in the previous step. To do this, click on the Advanced button. Then in the Advanced  Scripting Options page find “Type of data to script” property and change it to “Schema and data”. (It is set to “Schema only” by default) Click Ok to go back to Set Scripting Options page;

6– Select “Save to new query window” and hit Next 2 times. When all the actions listed are completed click Finish.

After completing these steps, you’ll get a new query window opened in SSMS which has the script to create all the tables and insert the records they already contain in the original database.