How to import org.apache.spark.sql.SQLContext.implicits in Spark 1.6: error “value toDF is not a member of org.apache.spark.rdd.RDD”

I am doing a mini project for my company using Spark/Scala and have been stuck with the error mentioned in the title for a couple of days. Googling that error suggested to import org.apache.spark.sql.SQLContext.implicits, and that’s what I did:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.sql._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SQLContext.implicits

import org.apache.spark.SparkConf
object TestSQLContext {
[…..]
def main(args:Array[String]) {
[…..]
}
}

And that was the start of the problem: my application started to give a new error:

object SQLContext is not a member of package org.apache.spark.sql
[error] Note: class SQLContext exists, but it has no companion object.

The problem is, none of those online posts mention that we need to create an instance of org.apache.spark.sql.SQLContext before being able to use its members and methods. This is the right way to do it:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.sql._
import org.apache.spark.SparkConf
object Hi {

case class DimC(ID:Int, Name:String, City:String, EffectiveFrom:Int, EffectiveTo:Int)

def main(args:Array[String]) {
val conf = new SparkConf().setAppName(“LoadDW”)
val sc = new SparkContext(conf)
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

val fDimCustomer = sc.textFile(“DimCustomer.txt”)

var dimCustomer1 =   fDimCustomer.map(_.split(‘,’)).map(r=>DimC(r(0).toInt,r(1),r(2),r(3).toInt,r(4).toInt)).toDF

dimCustomer1.registerTempTable(“Cust_1”)

val customers = sqlContext.sql(“select * from Cust_1”)

customers.show()
}
}

Hope this post helps and please do not hesitate to ask your questions in comments section.

Cheers.

 

 

Advertisements

OBIEE RPD Design: Convert Snowflake to Star schema from multiple sources in (Combine dimensions)

As I play more with OBIEE, I learn more about what it is capable of and where its main power resides. OBIEE has 3 layers: Physical, Business Model and Mapping, and Presentation. The middle layer, BMM, is what makes OBIEE special: it is where we can define how data from different sources and tables come together and form a nice and clean star schema.

Like most of my posts, I am gonna explain how this is achievable by demonstrating a demo. I defined 2 databases in my local instance of SQL Server. One with 1 fact table and 3 dimensions and another one with a single dimension that will be referenced by one of the dimensions in the first database: DimCustomer, DimLocation, DimProduct and FactTest in database Test and DimProductCategory in database Test2.

Screen Shot 2015-02-10 at 9.56.51 am

Let’s go through the steps to create and design the RPD for our data source. Launch Oracle BI Administrator:

a) Create first database connection:

Click on File and then New Repository. This will launch Create New Repository wizard. As you know, we can define only one data source using Create Repository wizard. I used this wizard to create my first database connection to Test database. I didn’t list steps here because it is very easy and straightforward.

b) Create second database connection:

1- Right-click on Physical pane of Oracle BI Administration Tool and select “New Database..”

2-Provide a name in General tab. Click on “Connection Pools” tab and then click on the green “+” sign.

3- Give the new Connection Pool a name, select the data source pointing to the second database from “Data Source Name” drop-down list and enter appropriate “User Name” and “Password”, as shown below. You will be prompted to re-enter the password after clicking Ok.

Screen Shot 2015-02-10 at 10.38.51 am

4- Click Ok again in New database window.

5- Now that we have added a new database and connection pool to our Physical layer, it’s time to add tables to the new connection. Right click on the entry for the new connection pool ( in my case, Connection Pool 2) and select “Import Metadata”. This will launch Import Metadata wizard, where you can select objects that should be added to the Physical layer.

6- Select the type of objects you want to import and click Next. In my tutorial, I select DimProductCategory from database Test2 and add it to Repository View, and then click Finish:

Screen Shot 2015-02-10 at 11.21.11 am

Your Physical layer pane should have 2 entries for 2 database connections and tables imported on each connection. My Physical Layer looks like this:

Screen Shot 2015-02-10 at 11.33.17 am

C) Physical Diagram

Now it is the time to define how tables are joined together in Physical layer.

1- Select tables from both connections, right-click and select “Physical Diagram” and then “Selected Object(s) Only”.

I didn’t define any foreign keys in my database, therefore I’ll get none of the tables related together in my initial physical diagram and I have to define how tables are joined manually.

2- To define the first join, click on New Join button, click on the fact table and then on DimProduct. This will create a join between those 2 tables based on the columns with the same name. You can change this in the next window that opens, “Physical Foreign Key”:

Screen Shot 2015-02-10 at 11.56.36 am

3- Do the same for the rest of the joins, including DimProduct and DimProductCategory. Note that there is no difference between joining the tables on the same database and tables that are on different databases. My diagram looks like this after setting up the join between all tables:

Screen Shot 2015-02-10 at 12.21.40 pm

d) Business Model and Mapping

After finishing with physical layer, it is time to define the business logic in BMM layer. In this section, we are gonna create a new Business Model, add objects from Physical layer to it, and define our Star schema on top of snow-flake diagram we created in physical layer by combining DimCustomer & DimLocation into one dimension, and DimProduct & DimProductCategory into another.

1- Right-click in Business Model and Mapping pane and select “New Business Model..”. Give your model a name and un-check Disabled checkbox.

2- Drag and Drop your fact table (FactTest) into the new BMM model:

Screen Shot 2015-02-10 at 1.28.36 pm

3- Now it’s time to combine our first tables together and create a single dimension. Let’s start with DimProduct and DimProductCategory, the 2 tables that reside on separate databases. We will use “Logical Table” to do this. Right-click on Business Model, then “New Object” and then select “Logical Table…”. Logical Table window will open.

4- Give your new logical table a name in “General”. Let’s call it DimMasterProduct.

5- Click on “Sources” tab. As its name implies, this is where we can define the source(s) for our new logical table from tables available in Physical layer. Click on the green “+” button. Logical Table Source window opens.

6- Give it a name.

7- The next step is very important: click on green “+” button below Name textbox, to map it to a physical table. A new window open which lists all the tables used in our Physical diagram:

Screen Shot 2015-02-10 at 1.44.56 pm

Click on DimProduct and then click “Select” Button. You’ll see Logical Table Source window again, with DimProduct added as the source:

Screen Shot 2015-02-10 at 1.47.25 pm

8- Click on the “+” button again to add another table. You’ll see that instead of all tables in Physical layer, only those that are joined to DimProduct in Physical Diagram are listed here. Select DimProductCategory. You’ll see a new Join is defined and you can change it from inner join to right or left join:

Screen Shot 2015-02-10 at 1.51.04 pm

9- Click on “Column Mapping” tab. Here we can add the columns from our sources to our logical table. Click on “Add New Column” button. Logical column window opens.

10- Give your new column a name (Product_ID). Click on “Column Source” tab. Select (Click) the logical table source you created in previous step and click OK:

Screen Shot 2015-02-10 at 2.04.52 pm

A new Logical Column is added to the “Column Mapping” tab of “Logical Table Source”. Select the right column from Expression drop-down list:

Screen Shot 2015-02-10 at 2.09.01 pm

11- Repeat the steps to map the rest of columns for the logical table. I added 4 columns, as shown below:

Screen Shot 2015-02-10 at 2.11.17 pm

Click OK to go back to Logical Table window. Click OK to exit Logical Table window.

e) Business Model Diagram

Notice that both FactTest and DimMasterProduct have a # mark on them, implying they are Fact tables. OBIEE will treat all tables with no join pointing to them in BMM layer as fact tables. To change this, we need to define our “Business Model Diagram”.

1- Right-click on your Business Model and select “Business Model Diagram”, and then “Whole Diagram”. My 2 tables in BMM layer namely FactTest and DimMasterProduct are shown.

2- Like what we did for Physical Layer, join the 2 tables together by clicking on “New Join” button, and then FactTest and DimMasterProduct. Logical Join window opens:

Screen Shot 2015-02-10 at 2.29.55 pm

Notices the difference? Here, unlike Physical layer, we don’t need to define which columns will be used to join 2 tables together: OBIEE will work it out based the physical relationship of tables in physical layer.

My diagram looks like this now (Note there is no # on DimMasterProduct anymore):

Screen Shot 2015-02-10 at 2.48.01 pm

f) Presentation Layer

As you know most probably, you can add your BMM model into Presentation layer just by dragging it and dropping into Presentation layer:

Screen Shot 2015-02-10 at 2.54.09 pm

Now, save your solution and when asked if you want OBIEE to check Global Consistency, click Yes. You’ll notice global consistency failed with an error on the logical table not having a primary key defined for it:

Screen Shot 2015-02-10 at 2.56.50 pm

This error can be fixed very easily:

1- Double-click on your logical table’s entry in BMM layer.

2- Go to “Keys” tab. There should be an empty row added.

3- Enter a name for the Key in “Key Name” column.

4- Select the column that is the primary key of your table from “Columns” drop-down.

5- Select the key you just created from “Primary Key” drop-down.

Screen Shot 2015-02-10 at 3.00.38 pm

6- Click Ok and save your RPD. Click yes for OBIEE to check Global Consistency.

7- Notice the change in the icon of Business Model in BMM layer.

Now your RPD is ready to be deployed and used for analysis. This method applies to any other scenario where 2 or more tables need to be joined together to create a single dimension: In other words, to convert snow-flake to star schema.

Hope I didn’t confuse you with such a long post,

Cheers.


OBIEE: Multiple joins between same tables (Fact to Dim)

Hi all. I am finally writing a new post after more than 1 year and surprisingly it is not on SQL Server! I must confess that I am not a front-end kinda person and do not particularly enjoy doing dashboards and reports. But I recently started a new job and my first project is going to be on OBIEE.

Since I best learn by doing and diving into getting my hands dirty (instead of reading pages and pages of tutorials), I defined a new project for myself that I am not gonna bore you with. What we need to know for the sake of this post is that I have a fact table with 5 columns:

Buyer_ID

Product_ID

Seller_ID

Sale_Date

Sale_Qty

The twist here is that both Buyer_ID and Seller_ID point to the same dimension table, DimCustomer. Since we can add each table to the Physical layer of Oracle BI Administration Tool only once, I was wondering how I should model my data model in OBIEE.

Basically what we need to do is to tell OBIEE to join the fact table twice to DimCustomer:

  • First on Fact.Buyer_ID = DimCustomer.Customer_ID
  • Then on Fact.Seller_ID = DimCustomer.Customer_ID

And we already know that the physical structure of queries sent to data source is defined by how objects are related and joined to each other in Physical layer of BI Administration Tool. Therefore, we need to find a way to create 2 objects in Physical layer based on DimCustomer. Well, this is very easy: it can be done by creating an Alias based on the original DimCustomer.

 

Screen Shot 2015-02-03 at 1.38.25 pm

 

 

 

 

 

 

 

From here, everything is very straightforward:

  • On General tab, give the new Alias a name and make sure it is based on the physical table you want it to be by checking Source Table right under Name textbox;
  • If this Alias is going to be used as a dimension table, define the primary key of that table on Keys tab. You’ll need to enter a name in Key Name column and then choose the actual column from Columns drop-down to be the primary key of this new Alias table.

And now you are done. You’ll see a new entry in Physical layer as a table that you can use in your physical diagram and use it to define your model properly.

Hope this post will come in handy, cheers.

 

 


An alternative for Index with Include: CLUSTERED (UNIQUE) Index

Indexes are good and helpful for reading from the tables and one can add different indexes based on any kind of queries that are going to be issued against the tables and make sure they always hit indexes, not the table. But that strategy has a drawback: it’ll slow down inserts into the table because everytime a new record is added to the table, all the indexes on the table need to be updated. That was the reason I was looking at the list of indexes I have on one of my busiest summary tables to see if I can get rid of any on them and make my inserts faster (This table is updated once a day and takes about 8 minutes for 1.5 million records)
I realized I have 3 separate indexes on different subsets of the columns that are in my UNIQUE index, each one with different INCLUDE columns. And in case you don’t know what INCLUDE columns do, they are saved alongside the primary columns in index so that they can be read by accessing the index rather than having to go to the table and get them from there. This could mean doubling (or tripling) the space the table takes on disk.
Now, let’s see what a UNIQUE CLUSTERED index does. A clustered index forces the order at which the records are stored in disk. for example, if our table has 2 columns, ID and Name, and there is a clustered index on ID column:

CREATE TABLE [dbo].[Table_1](
[ID] [int] NOT NULL,
[Name] [varchar](50) NOT NULL
) ON [PRIMARY]

CREATE UNIQUE CLUSTERED INDEX [Ind1] ON [dbo].[Table_1]
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

Now, let’s add 3 records to the table:
INSERT INTO [dbo].[Table_1]
([ID]
,[Name])
VALUES
(2
,’S’)
GO

INSERT INTO [dbo].[Table_1]
([ID]
,[Name])
VALUES
(3
,’M’)
GO

INSERT INTO [dbo].[Table_1]
([ID]
,[Name])
VALUES
(1
,’Z’)
GO

If you query the table now, you’ll see that the result will be sorted by values in Id column, without having any ORDER BY clause in the query. This means that the data inserted into the table is already saved into the disk sorted by the column in clustered index, not by the order at which the data is inserted into the table. In other words, if we query the table and filter by ID, query engine will be able to locate the record satisfying the condition in where clause without having to search through the whole table. And guess what, this has exactly the same effect as we had an index (non clustered) on ID and included Name in it.

This can work in any case when there is the need for creating index with include columns in it. And it best works when the columns making the clustered index form a unique key on the table, simply because having a UNIQUE CLUSTERED INDEX not only tells the query engine how to locate the records, but it also makes it sure that every occurrence of the index is unique and happens one and only once.


SQL Server 2014 and SSDT (AKA BIDS)

Hey folks. This is going to be a short post, just wanted to mention something that may come handy for those who are interested in play with SQL Server 2014.
I downloaded SQL Server 2014 a couple of weeks ago and started exploring its new features to see what has changed/improved. After going through very few blogs I usually check and running some tests on the DB engine (and be amazed by how efficient the new version is compared to 2008 R2), I wanted to start playing with the tools that I am most interested in: SSIS and SSAS.
What do you guys do when you want to add a new SSIs project? That’s right, you open Visual Studio or SSDT (aka BIDS) and create a new project. But wait a minute, where is the new Visual Studio that is supposed to come with SQL Server 2014? I looked into the folder created in my start menu, couldn’t find anything in there.
After looking into MSDN I realized that unlike previous versions, SSDT does not come with 2014 version of SQL Server and it should be downloaded separately. It’s so ironic our friends in Microsoft forgot to mention it, isn’t it?
Anyway, let’s not be a fusspot about how could they forget mentioning this and download it ourselves from here.

Hope this can help, cheers.


Drop failed for DatabaseRole : The database principal owns a schema in the schema and cannot be dropped

This error is raised when there is a schema owned by the role you are trying to drop. The most straight forward and quick fix for this error is to revert the schema ownership to the appropriate role in the database, which will make the dropping role not the owner of any schema in the database anymore.

For example, if you have assigned the db_datareader schema to be owned by the database role you created, you’ll notice that you can’t tick off the owner ship from the role. To work around this issue, simply open the db_datareader database role of the database and make it the owner of db_datareader schema (tick the box in “Schema owned by this role” for db_datareader.) After doing so, you’ll see that the schema in question is not owned by your role, and you’ll be able to drop the database role without any problem.


SSIS, Tempdb Database, and SQL Server Log Files

The tempdb system database is a global resource that is available to all users connected to the instance of SQL Server. Tempdb is re-created every time SQL Server is started, which means the system always starts with a clean copy of the database and there is never anything in tempdb that is saved from one session of SQL Server to another.

The tempdb’s usage could be roughly separated into 3 categories:

User objects: Any user with permission to connect to an instance of SQL Server would be able to create temporary tables and table variables. Temporary user objects that are explicitly created, such as: global or local temporary tables, temporary stored procedures, table variables, or cursors, are stored in tempdb database.

Internal job: Tempdb holds the intermediate data when SQL Server is doing tasks like:

  • DBCC CHECKDB or DBCC CHECKTABLE;
  • Common Table Expressions (CTE);
  • ORDER BY, GROUP BY, UNION Queries;
  • Index rebuild or creation.

Version store: Version Store is a collection of data pages that hold the data rows used for row versioning. Triggers, MARS (Multiple Active Result Sets), and online index operations are examples of SQL Server features that use version store.

Here in this post, I’m going to talk about how tempdb is affected when SSIS is used to load data into target tables or update previously loaded data (specifically in fact tables), and provide some workaround for preventing the database from growing too large. The solution provided here will pretty much have the same effect on SQL Server logs, resulting in smaller log files and saving space.

Populating Fact Tables

A fact table in a data warehouse has direct or indirect (through mapping tables) relationship to the dimension tables it is related to. Whether or not the table that a fact table is related to is dimension or mapping, there must be a column (foreign key) in the fact table that refers to the primary key of the linked table. This means there are 2 sets of columns in a fact table: measure columns which hold the numeric values that should be analysed and foreign keys to dimension tables that are used to slice and dice the measure values.

The set of dimension and/or mapping foreign keys in a fact table define its granularity: the lowest level that data should be aggregated before loading into DW. In other words, data should be aggregated at the right granular level, which is defined by its foreign key columns.

A good way of aggregating data at the required granular level is using a GROUP BY clause in the query that is used to pump the data into DW’s fact table. For example, if the structure of the fact table to be populated is like the one depicted below:

Image

Then the following query is the best to populate the table:

SELECT     ‘ProductKey’ = p.[ProductKey]

,’DateKey’ = d.[DateKey]

,’SalesAmount’ = SUM(s.Sales_Value)

FROM       [OLTP].[dbo].[Sales] s

INNER JOIN [DW].[dbo].[DimProduct] p ON s.Product_Key = p.ProductKey

INNER JOIN [DW].[dbo].[DimDate] d ON s.Sales_DateTime = d.DateTime_Value

GROUP BY    p.[ProductKey]

,d.[DateKey]

What should be considered at this stage is that since populating a fact table, especially for the first time, needs a huge number of records to be transferred from OLTP database to the DW database and the query to do so has a GROUP BY clause that engages tempdb database in the process of preparing data, the ETL should be designed in the way that it doesn’t blow up tempdb: The ETL for populating fact tables must be designed and implemented in such a way that it doesn’t load the data in a single data transfer process, and breaks it into smaller chunks of data to be loaded once at a time.

The best candidate for breaking down data is the date-related factor at which the data is aggregated. In our example the data is aggregated at the date level (it could be at any other level like Week, Month, or Year.)

Design and implementation of ETL using SSIS:

As explained before, the ETL for populating fact table should be designed in such a way that it loads data into DW by period (day, week, month, and year) to prevent tempdb and log files from growing too large. This needs our SSIS package to do the following steps:

A- Get the list of distinct periods for the whole data that is going to be loaded,

B- Save the periods extracted in previous step into an SSIS variable,

C- Run the process that cleanses and inserts data into DW once at a time for each period.

Let’s see how each step could be implemented using SSIS:

1- Open BIDS, create a new SSIS project;

2- Add a new Execute SQL Task to your SSIS Project. This task is responsible for reading the data to be loaded and build a list of distinct periods;

3- Create a variable of type Object at the scope of Package. This variable is going to hold the list of periods and will be used by different tasks at different steps of loading data. I named it Periods;

4- Add another variable of type Integer, Int32, to your package again at the scope of Package. This variable will be used to hold each period of time read from Periods variable, for which the package is going to load data. In my example this variable is called CalendarDate;

5- Next step is to extract the periods for which the data should be loaded into the fact table. This would be done by writing a query that extracts the distinct list of time periods from the OLTP database, and configure the previously added Execute SQL Task to run the query and save the result in the Periods variable:

a-  Add the SQL script to your Execute SQL Task:

b- Change the ResultSet property of the Execute SQL Task to Full Result Set;

c- Click on the Result Set tab, and add a new result set to store the result of running your query to the Periods Variable:

6- To get the package to transfer data once for a period at a time, we need a Foreach Loop Container. The Foreach container is going to loop through what we have already loaded into Periods SSIS variable, and run the process for transferring the data once for each period:

a- Add a Foreach Loop container to your project and open its editor by double-clicking on it;

b- Go to Collection tab, and select “Foreach ADO Enumerator” from Enumarator drop down and “User::Periods” from ADO object source variable:

c- Click on on Variable Mapping tab of Foreach Loop Editor and add CalendarDate variable. This variable is going to hold the period for which the data is going to be transferred from OLTP to DW at each iteration:

7- Add the DFT and the sequence container (optional) to your project to do the transfer using CalendarDate variable. You should be able to find loads of resources on how to use a variable in an OLE DB source of a DFT on internet, in case you don’t know how to do that.