As part of a recent project I was faced with a problem. The new "proof of concept" corporate OLAP data warehouse that we were developing was starting to grow way beyond its initial specification, and the time taken to carry out daily processing would soon become difficult to explain to management. There was one clear area where the increase in processing time was becoming unsustainable: the preparation of the fact and dimensional tables in a SQL Server database that were then used by SSAS to populate the Analysis Services cube.
The increase in processing time was a worry, as this was only a small data warehouse with three fact tables of 40 to 50 million rows and a dozen dimensions, none of which exceeded a hundred thousand rows. It is worth noting, however, that this process always rebuilt the cube completely- and completely repopulated the dimensional tables - as this was fundamental to the nature of the data warehouse project. In fact, new source data or updated and cleansed source data from the past was continually being added, and so it was impossible merely to add new data on an incremental basis. This allowed t he test users to define reports while the data was being pieced together, and so accelerate the development of the entire BI solution. Hey! That's Agile development!
So, if we exclude the hundreds of hours of work that went into the data cleansing and aggregation, we were looking at a data remapping exercise, which had to be accelerated. In essence a relational OLTP data structure had to be transformed into a dimensional OLAP structure - but processed in the shortest possible time.
Here, then, is a high-level overview of the process:
The data staging was an extremely complex process, which amalgamated data from nearly a dozen source systems, and fed the source data into a relational model. This relational model was then used to create the dimensional model in SQL Server. We were not given the option of altering the data or processing model for various reasons, but had to find ways of accelerating the preparation of the SQL Server tables which lay behind the Analysis Services DSV which populated the cube.
Why did we need to go faster? Simply because this was a daily process, and the relational to dimensional mapping was taking over four hours - whereas the cube processed in 40 minutes, after some intensive optimization. Corporate IT strategy imposed both the overall multi-database architecture, and the use of surrogate IDs in SSAS.
Relational to dimensional mapping
The process steps for the relational to dimensional mapping were as follows:
- The initial fact tables were defined to hold all the fact data as well as all data that would allow the creation of surrogate IDs to map to all dimensions. Multiple empty columns were also created to hold the surrogate IDs.
- The fact tables were populated by joining data tables from the Normalized database, and extracting all the fact data, as well as the data that would allow mapping to the dimension tables.
- All dimension tables were created in the Dimensional database, then with each process they were truncated and populated using INSERT INTO ... SELECT statements based, again, on joining data tables from the Normalized database. Each dimension table had a surrogate ID to identify uniquely each record
- The fact tables were then updated with the surrogate IDs from the dimension tables. This process involved multiple sequential updates to the fact tables, often requiring complex mappings between the fact and dimension tables.
As you can imagine, the actual data reorganisation, from relational to dimensional. was not technically difficult. It was essentially a series of lookup operations. The problem, as I said in the introduction, was the ever increasing slowness of the process. Analysis of the T-SQL steps showed that the majority of the duration of the process was spent updating the fact table, which was extremely large given that it had to contain facts, mapping data and eventually the surrogate IDs. In a perfect world, of course, we would be doing incremental updates to get round the problem of the ever-increasing processing time. However, we were adding new data from different corporate sources with alarming frequency, and these data sources affected the entire dataset - as did updates to existing sources - so we could never say for sure that data was fixed in time. This meant that a standard time-based incremental update was out of the question. Anyway, it was considered good practice to reprocess everything, so that is what we had to do...
Still, necessity being the mother of invention, what was the solution that we came up with?
After testing various indexing strategies, it became clear that indexing was not going to provide the required increase in speed. It helped, certainly, for the dimension mapping, but further speed gains were needed. Another approach was required to avoid updating tens of millions of records. So, if updates were not the answer, perhaps inserts could be. This idea became more appealing when we noted (with relief) that the databases used were minimally logged, as is perhaps normal for ETL databases. In fact all were SQL Server 2005 databases using the simple recovery model. Partitioning was not an option, as we were not using the Enterprise version of SQL Server.
After some thought, we found ways of basing the entire data reorganization, from OLTP relational to OLAP dimensional using inserts. It was done this way:
- The dimension tables were created, just like in the initial model - except that now the dimension tables were dropped at the start of the process, then re-created using SELECT ... INTO statements.
- For each fact table, a staging table was created containing all the fact data (the "metrics") and the data which would allow mapping to dimension tables - I suggest calling this "Lookup data". Unlike in the original model, this table did not contain the columns for the surrogate IDs. This table was materialized into the Dimensional database.
- For each dimension table, a temporary table was created containing only the ID from the fact table, and the surrogate ID from the dimension table.
- Finally, when all the temporary tables were created, a final query joined each temporary fact table to the relevant temporary dimension tables, and only the fact data and the surrogate IDs were inserted into the final fact table.
- Any staging tables that had been persisted to disk were deleted.
As this may seem more complicated than it really is, perhaps a diagram will help:
The advantages to this approach were:
- No updates at all were required.
- All inserts were SELECT ... INTO, which was considerably faster than INSERT INTO ... SELECT (at least on SQL Server 2005), and were minimally logged.
- The temporary tables used to hold the fact ID and the surrogate ID are extremely narrow, and only contain integer data. This makes the tables extremely small, even when containing millions of rows.
The end result was a reduction in processing time of a factor of nearly five. So a four and a half hour process was now completed in just over one hour. And before you ask, yes it did give exactly the same results once the cube was processed.
Obviously I cannot give the code for the entire data warehouse, but I can give a much simplified example using one fact table (sales data) and three dimensions (geography, clients and products). This will, I hope, be a basis for you to try and adapt and extend to your own requirements. Clearly, this is the smallest data warehouse that you are ever likely to meet, and it will, hopefully, process in a couple of seconds. However if you apply these techniques to larger data sets, I hope that you will meet the sort of acceleration that we did in our project. The code (and the minuscule amount of purely fictitious data required to test it) is provided in the attached source code file.
As background information, imagine that you are working for a speciality sports car wholesale dealership, and that you have an OLTP database containing sales data (which becomes the Normalized database in the first drawing above), and this must be denormalised into a dimensional structure (the Dimensional database in the first drawing above). To reduce the complexity everything in this example is in the same database - though in practice this will almost certainly not be the case.
In the attached code sample, the following tables represent the "normalized" data:
Schematically, it is like this:
And the following are the "Dimensional" data:
Schematically, it is like this:
So here is the T_SQL code which produces the output described above. I have reduced it to one stored procedure in the attached code file, for simplicity's sake:
DROP TABLE dbo.Dim_Client
DROP TABLE dbo.Dim_Geography
DROP TABLE dbo.Dim_Product
DROP TABLE Fact_Cars
-- Create staging fact table
IDENTITY (INT,1,1) AS ID
FROM CarSales.dbo.Client C
INNER JOIN CarSales.dbo.Invoice I
ON C.ID = I.ClientID
INNER JOIN CarSales.dbo.Invoice_Lines L
ON I.ID = L.InvoiceID
INNER JOIN CarSales.dbo.Stock S
ON L.StockID = S.ID
-- Create three dimensions - Clients, geography and Products
-- Clients first
IDENTITY (INT,1,1) AS ClientID
GROUP BY ClientName
-- Then Geography
IDENTITY (INT,1,1) AS GeographyID
GROUP BY Country
-- Finally Products
IDENTITY (INT,1,1) AS ProductID
GROUP BY Make
-- Now create the temp tables for surrogate IDs
FROM dbo.Dim_Client D INNER JOIN dbo.Stg_Fact F
ON D.ClientName = F.ClientName
FROM dbo.Stg_Fact F
INNER JOIN CarSales.dbo.Client C
ON F.ClientName = C.ClientName
INNER JOIN dbo.Dim_Geography D
ON C.Country = D.Country
AND C.County = D.County
AND C.Town = D.Town
FROM dbo.Dim_Product D INNER JOIN dbo.Stg_Fact F
ON D.Make = F.Make
AND D.Marque = F.Marque
AND D.Colour = F.Colour
-- And finally create the final fact table
FROM dbo.Stg_Fact F INNER JOIN dbo.#Tmp_Client C
ON F.ID = C.ID
INNER JOIN dbo.#Tmp_Product P
ON F.ID = P.ID
INNER JOIN dbo.#Tmp_Geography G
ON F.ID = G.ID
-- And to cleanup, drop the staging fact table
DROP TABLE dbo.Stg_Fact
- The surrogate IDs were created using the IDENTITY (INT, 1 1) statement. You could also use ROW_NUMBER() if you prefer.
- Adding indexes to the temporary tables (at least in our specific database) proved pointless, as adding indexes to the temporary tables took longer than any speed gains encountered when joining the temporary tables to create the final fact table. However, this will probably not be the case with all data sets, so I can only suggest that you test with your data. Indexing the OLTP tables, especially if they will be used later in the process to lookup the surrogate IDs can prove beneficial.
- Using #temporary tables (in tempdb) provided a clear speed increase over persisting the dimension temporary tables to disk in the Dimensional database - but this was probably due to the fact that Tempdb was on a separate disk array, and had been optimized for the hardware (ie as many tempdb files as there were processors).
- Making the staging fact table as narrow as possible is key to accelerating the creation of this table - and improving lookup times when using it to get surrogate IDs. So referring to source tables (as shown here in the case of the Geography dimension) can be a useful technique.
- You may be required, in a real-world situation, to perform quite complex SQL to create both the dimension tables, and the temporary tables containing the fact IDs and the surrogate IDs. Remember that intermediate temp tables to group data by levels of granularity can speed up (as well as simplify) this process.
- You can, of course create the dimensional tables alongside the relational tables in the same database - at least for proof of concept.
- This process can be optimized further by placing the database files on different disk spindles and/or by materializing the temp tables to a database on yet another disk array. All will depend on how far you want to go to improve throughput.
And a warning:
This sample data contains no nulls, and no unavailable data, so it all processed perfectly. In the real world, you may well have to handle "unavailable" data in the denormalisation phase. This could also require the judicious use of OUTER JOINs to ensure that dimensional; mapping does not exclude data from the fact table. In practice we tried using outer joins both to create the temporary tables of surrogate IDs and to create the final definitive fact table without any problems. As for the speed hit with outer joins, well it was palpable, but in our data set there seemed to be little certain speed gain when using outer joins to create (larger) temporary tables of surrogate Ids or placing the outer joins at the end of the process to create the final fact table. I imagine that this could vary with different data sets.
Of course, this process can be adapted to an incremental data load. Some ideas for doing this are:
- The dimension tables could be updated, if there are few changes to the data.
- Using SQL Server 2008 you can now use SELECT ... INTO statements which are minimally logged, providing that certain conditions are met. So a similar approach could be used using a slightly different syntax.
Anyway, this is an idea which I hope can be of some use to other SQL Server developers- and I would be delighted to hear any other ideas which speed up data warehouse preparation!