First Datawarehousing project...need some help and input !!

  • Hey guys,

    As you may have understood from the title, my company has been giving responsibility of bringing up the DW using SQL server 2005.

    It;s turning out to be a little more complicated than i thought and i need some help to expedite my project.

    What I'm done with so far

    (a) Importing all data from the various transaction databases into the staging area

    (b) Last phase of deduplicating data(I had hoped SSIS would make it easier or at least help in automation..but i have finally ended up manually importing and de-duplicating data)

    What i need help with is

    (A) preparing with the next phase..SSAS..do i really need to use this tool, i have the dimensions setup and data imported Once the data is de-duplicated ...is it ready for reporting?? i.e. SSRS

    (B)So far data cleansing has only been removing duplicates..is there anything else i should look for?

    (C) whats the best way to prepare for the reporting phase.

    Could you please give me your feedback along these lines, in addition any comments/tips/advice from those who have projects under their belt would help me ..in my first one.

    Thanks, i really need some advice...

  • Most companies think building a data warehouse is liked implementing another database. It is a totally different.

    (a) Importing all data from the various transaction databases into the staging area

    What do you mean by that? A data warehouse has dimension tables and fact tables. The staging area should have the staging dimension tables and fact tables, after you imported the data, had you loaded into the tables? Then after you think it is correct, then create the tables in the data warehouse database.

    (b) Last phase of deduplicating data(I had hoped SSIS would make it easier or at least help in automation..but i have finally ended up manually importing and de-duplicating data)

    SSIS will not de-duplicate the data for you. You have to write the procedure yourself. I wish SQL Server 2005 build all the tables for me by itself too, Microsoft refuses to take my idea.

    B)So far data cleansing has only been removing duplicates..is there anything else i should look for?

    You have to ask the business what they want. Building a data warehouse is not just an IT project, it involves customers input. For example what kind of hierarchy the customers want in the dimension table so they can drill up and drill down for the data. What kind of measures do they want to see in the fact table? What kind of business rule do they want to implement?

    (C) whats the best way to prepare for the reporting phase.

    You can query the cube using MDX or you can use other reporting tool.

  • Thanks for your reply,

    What i meant was that, i have imported and loaded data into the staging area. I am deduplicating the data. It takes so long though when there a like 35 million records in the customer table.

    Which brings me to my next point

    I tried running an SSIS package(import(from various customer tables)--->Fuzzy grouping--->Sorting>>>Inserting into the Customer_Dimension table.

    But it kept giving me memory issues and hence i stopped. I'm now just de-duplicating, by using select and group by statements.

    My question is:

    I'm trying to have a process that would(this is when the DW becomes fully operational------>

    (A) Identify a if a record already exists in a DB

    (B)Deduplicate/cleanse data

    (C)Import the data from the staging area dimensions table to the DW dimensions/fact table

    (D) presents the data for reporting then almost on a daily basis(this is what management/marketing expects)

    I had hoped to use SSIS( Fuzzy Logic, grouping etc) to automate all these tasks , but it seems like a remote possibly, with all the memory contention issues

    IS there a better way?

    Lastly, how long does such an implementation take(an entire DW project), because right now im

    at the point of just getting it done, while i still feel that there would be a more sophisticated way to Import/cleanse data.

    Hope i made some sense

    🙂

    As always..any direction..any advice..for a newbie like me..would be benificial..

    thanks

  • It always takes a long time when you load the data into the data warehouse table the first time.

    Anyway where are the source of the data coming from? Do you have multiple sources to load the customer table?

    Usually after the first load, then people does incremental loading. That is during the ETL part, only extract the data from the source system if it is new or it has been updated. Then you have a small set of input data to check if the record already exists in the dimension table.

    The implementation time depends on how big your data warehouse is. The first time I spent two years to build an enterprise data warehouse. That's why people tends to build data mart.

  • Yes, 2 years sounds about right to me. It takes a long time to make a datawarehouse robust, correct and fast!

    With regard to dedupe, this will become a vast program that you need to make as fast as possible. Note that data warehouses are normally required to be data enrichment databases. For example, if you have 2 duplicate records where one has an email address but no phone number and the other one has a phone number but no email address, then your new master record should have both the phone number and the email address. It's not a simple case of throwing away one of the records!

    If you're running out of memory and you're unable to get more, then you need to do less work. Therefore, initially why don't you identify and dedupe EXACT matches first. Then, when they're all cleared, start on the fuzzy matches.

    One other very important aspect is always store the URN of the source system with the URN allocated to the customer within the warehouse. That way, when the record next arrives with a completely different address (because the customer has moved house and this has been recorded on the source system) then you will still be able to match them to their datawarehouse record and update their address accordingly without creating a new customer record.

  • Thanks again!

    The expectation for me is currently in a couple of months. I'm not sure if it would be realistic expectation and how to convince others the same

    Getting back to the real issue...

    Deduplicating the data is not in itself that complicated. What messed it up was trying to use SSIS to sort/deduplicate. It couldn't handle the amount data and its transformations seem to keep eating too much of the memory resources.

    So i'm attempting to do it manually....like grouping by attributes such as fname, lname, email etc and then deleting those rows which have the max id(in that subset).

    Would using indexes improve query/ETL performance??

    Regards,

    JD

  • How do you define a customer is duplicate in the system, the same customer number, the name, the same address? So are you going to put an index in all those fields, actually it will hurt the performance.

    You have to ask your users about de-dup. If the customer name is the same but the address is different, do you update the address or do you add another record in?

    Actually I want to know why do you keep saying SSIS has a problem in de-duplicate the record, what is the problem?

  • I am currently working on a business intelligence solution for a variety of systems. The end goal is to have a data warehouse that enables our customers to pull data from all of their systems. The overall goal is two fold: Reporting from Report Services or another similar report writing tool, providing a Cube. In the process we are copying and cleaning data from the production server to a report server. The database in question is a multimillion transaction line database.

    I ran into a few issues regarding memory early on. My first change to the process was setting the target (data warehouse) database to Simple for it's logging mode. This reduced the amount of space the log takes up during the cleansing process - very important!

    Other memory issues I found were regarding the location of the files. Make sure that your drives haven't been partitioned. I have often installed in the default location on the C drive, only to find that the server was partitioned for all data to be on another larger drive.

    The next issue we ran into with memory and transaction times was working over a linked server. I was originally working off of temp tables and source tables to import into the dimension tables. This was not an issue when the system was located on the same server. However, when working with linked servers the same query would not complete. I discovered that the system is copying over the entire tables called and storing them in memory - ouch! We finally decided to copy the source tables into a staging database on the data warehouse. We opted to do this with a t-sql query that we turned into a procedure. (We have found SSIS is often not the smoothest in running through the job agent).

    All of the cleansing we have done through t-sql procedures. We have encrypted this procedures as we are selling this product and want to protect the code from prying eyes. 🙂 We use a Job to schedule the procedure(s).

    The scripts populate into a star format data warehouse. Once in the fact and dimensions structures, it can be immediately used in SSRS to create a data model or a report direct from the source.

    If you wish to have a cube, you can then use SSAS to create the cube. This has its own set of tasks to learn - such as building the cube, partitioning, processing, etc etc.

    We have been working on version one of our data warehouse for the past four months. We are in the process of releasing phase one for two different systems. I spend a good 20 hours a week just on these two data warehouses - on top of my other tasks of report writing, needs analysis, and so forth.

    I recommend that you sit down with your company and set a phase one deliverable for a few months. Keep it simple and focused on a small subset of needs. Once you have proven the architecture works in your environment, move to stage two, and so forth. It will take a little business analysis to figure out what they "really need" right now. In the end, it will make the entire data warehouse much stronger!

    I also recommend the book: Delivering Business Intelligence with Microsoft SQL Server 2005. It is available at most major bookstores and online. This book helped me get up to speed on the world of Business Intelligence. It includes everything from the theories of data warehousing to data restructuring and how to use the various SQL tools. I have not used all of the tools they describe (due to time and the fact that some of our customers will be on SS2000) but intend to at some point!

    Good luck!

    Alexis Parker

    Consultant - Venue Management

  • Im trying to deduplicate data from the customers table----

    1] if a match on firstname. lastname, email then it is a duplicate---delete duplicates

    2]If match cannot be found on first name, last name and email...then i try on fname,lname, ssn, ...fname,lname,addr and so on.

    Maybe when i use the sort procedure during SSIS (where i sort using fname,lname,email,addr,ssn) its not able to hence it runs out of memory.

    I tried running a simple delete statement on the customers_dimension record

    {

    Delete from customers_dimension

    where customerKey not in

    { Select max(customerKey)

    from customers_dimension c

    group by firstname, lastname, email)

    }

    The query ran for 17 hrs , i stopped it since i didn't want to consume more resources

    There are about 140 milllion records.

    Please do let me know about your thoughts

    Thanks

  • Is there a reason you are using the group by in this case? Have you attempted to run this on a smaller set of data with and without the group by? The group by will definitely increase the time it takes to run a 140 million record query.

  • I'm not really sure how to identify duplicates without using the group by clause.

    Secondly, to really identify the duplicates, it would have to sort through the entire set of data first. Thats why i figure i would need to do it.

    I'm also wondering if this could work

    1]Select all distinct (SELECT DISTINCT fname, lname, email from customers_dimension) rows into another table

    2]Do a join from this second table onto the parent customers table( on the id, fname, lname, email, addr, ssn)

    I'm not sure if the distinct clause would work on multiple tables though.

    -------------->>>> I'm currently running the group by with a sligtly different syntax

    DELETE FROM

    c1

    FROM

    customers_dimension c1

    INNER JOIN

    (

    SELECT

    MAX(customerKey) AS customerKey,

    CustomerID,

    firstname,

    lastname,

    email

    FROM

    Customers_dimension

    GROUP BY

    CustomerID,

    firstname,

    lastname,

    email

    HAVING

    COUNT(*) > 1

    ) c2

    ON(

    c1.CustomerID = c2.CustomerID

    AND

    c1.firstname = c2.firstname

    AND

    c1.lastname = c2.lastname

    AND

    c1.email = c2.email

    AND

    c1.customerKey <> c2.Cus

    Running for the past hour....it has worked on like really small subsets(say 10 rows or so)...so thats why i have persisted so far with the above methodology

    Let me know your thoughts...

    Thanks

  • I have thought of another solution...

    The initial problem occured when i tried to join data across 9 different customer tables(on the customer id)

    For some reason...duplicates were created of each and every customer record. This caused the size to balloon from 35 million records to 140 million records.

    I am thinking of-

    Just de-duplicating the customers table[Customers table has fname, lname, email, but does not have the address or SSN, which are also the deciding parameters when trying to deduplicate the data.

    Then adding the fields from other tables with records that still remain in deduplicated customers table.

    Two problems that see arise are----

    1] On what fields do i join. Meaning after deduplicating the data, and then inserting data from the 8 other customer tables how do i join. Reason the customer id alone would be suffcient, since other tables will also probably have duplicates

    2] this may be silly------But inserting into the new DB...without it appending

    Any suggestions friends?

    Thanks for all your input so far:)

  • Maybe you should take care of each customer table first, the dataset is smaller and easier to handle.

    To find customer with duplicate record

    SELECT FirstName, LastName, Email, COUNT(*)

    FROM Customer

    GROUP BY FirstName, LastName, Email

    HAVING COUNT(*) > 1

    After you de-duplicate each customer table, load the first one into the staging table, then when you load the second one on

    Find out which one is already in the staging table, then do not load those records

    INSERT INTO CustomerDimStaging (....)

    SELECT c2......

    FROM Customer2 c2

    LEFT OUTER JOIN CustomerDimStaging s ON s.FirstName = c2.FirstName and

    s.LastName = c2.LastName AND

    s.Email = c2.Email

    WHERE s.FirstName IS NULL AND s.LastName IS NULL and s.Email IS NULL

  • Thanks!! I get what you are saying, however there are 9 customer tables and the only attribute i could use to join with the Dim table would be the customerID. This is because i have tables for address, ssn etc

    So the query you gave me should be modified as

    INSERT INTO CustomerDimStaging (....)

    SELECT c2......

    FROM Customer2 c2

    LEFT OUTER JOIN CustomerDimStaging s ON s.customerid = c2.customerid

    I tried a query similar to this, however the CustomerDimStaging and customer tables reside on different databases(schemas) and i receive errors when i try to access them. It seems relatively straightforward but i cannot access two different schemas...in that one query [for ex using- datawarehouse.dbo.customerDimstaging..and transactDB.dbo.customers].

    Do let me know..thanks!

    Additionally i have some more queries-

    After integrating and removing duplicates, can i move directly to using this data source for building reports using SSRS.

    Does it matter(functionally & performance wise) if i create an SSAS database first. My thinking is that i skip SSAS for now and start with reporting and then later use SSAS when the company needs to do datamining.

    Please do let me know if this thinking is correct?

    Jude

  • If the table in different database, you need to put in databasename.owner.tablename in the query. If the database in different server, you need to create a link server , otherwise you can't access the database on that server.

    I create the dimension tables and fact table in SQL Server. Then I go to SSAS to create the cube.

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply