Brainstorming on strategies for a database with some data warehouse characteristics but with some updating required

Question

Post reply

Brainstorming on strategies for a database with some data warehouse characteristics but with some updating required

Anye Mercy

SSCrazy

Points: 2134
More actions
July 13, 2010 at 2:44 pm

#387346

Hi all, I apologize in advance for the non-specific nature of this inquiry but I am at the theoretical brainstorming level right now. We're about to start redesigning our database systems and I am trying to come up with the best approach for our business need.
The current problem space:
We are using SQL 2005 Standard Edition. Moving to Enterprise is not feasible because we not only host in-house but also sell the application & database as a "single box solution" and the cost of an Enterprise license would multiply the sales cost of that box by a factor of 10 or more - basically killing that line of business.
1 common database controlling all "jobs", user information, permissions per job etc.
Each job has its own database. The schema of an individual job database is about 98% the same between jobs, with a couple of tables that are dynamically built based on user selections at job creation time.
The web application that is powered by these databases only accesses a single job at a time, so a single database at a time. There is never a need to mix data between jobs.
The function of the web application that uses the job databases is to query the data in the job database based on some ad-hoc set of criteria. We'll call the data "documents" and the properties that it queries based on "metadata". There is a master table listing each document and its common properties, and several other tables that are joined to this table with either 1:1 or 1:many relationships, and the specifics of these relationships can vary. For example, each document belongs to a single category (1:1), but each document can be associated with any number of tags (1:0..n).
Most of the metadata is set at load time -- however, there is some that is user defined through usage of the application, and this information also needs to be queryable. For example, the user can apply tags to the document, and may want to query for all documents that are not tagged, or all docs that have a specific tag or tags. Additionally, the user can create new metadata fields to describe the document, populate data for a document in these fields and then later query these fields for a specific value or being null or not null.
Additionally - there are actually two modules in the web application that can access / query the data in different ways. They share the schema, and there is a table that has rows to indicate whether a document is present in module 1 and/or module 2 -- so all queries have to join to this table and look for the appropriate module row. I considered just adding flags to the document "master" table to indicate existence in one module or the other but that didn't seem as extensible if we add additional modules -- but I may revisit this decision for performance reasons.
Summary -
So -- to me it sounds like this problem space is like a hybrid between a data warehouse and an OLTP. The vast range of possible queries and "decisions" is very data warehouse like, the "transactions" are the tagging actions (where the user can tag one or many documents based on the results of a query) but these will change the data that will be queried the next time.
Our current system -- is a standard database (not datawarehouse) using 3rd normal relational modelling. I have indexes on all the fields that are used in query "where" clauses, FKs in all the appropriate places, but when our master table gets over about 1.5m documents, we start seeing performance issues for some queries and operations.
Things that I think are contributing to the performance issues -- the practice of "soft deletes" rather than hard deletes. I am doing this because our customers are notorious for making mistakes and changing their mind - and it is easier for us to back out changes when the rows are soft-deleted. But, with the large jobs, when all the queries have to filter out the deleted rows, this could be adding time -- it also increases the size of the tag table considerably as similar rows may be added, soft-deleted, added again -- but with bulk tag operations that is faster than looking for individual rows to undelete. And of course, the module join table probably slows it down a little too.
My questions --
Is there a way to design a system to get the benefits of data warehousing for the querying purpose but that can allow for some very specific data to be updated and queried? Maybe a way to get a list of documents based on the static parts of the query from a data warehouse and then somehow join them to the non-static data in a standard SQL Server database?
Is partitioning something that could help me? I am skeptical of horizontal partitioning in this case because there is no single property that we always query on -- we'd have to partition based on the identity column just for balancing purposes and it is very likely that any query a user chooses to run is going to hit multiple or all partitions. The one thing that is commonly part of the query that might make sense to partition by is not a static field -- it is a field that the user changes through the course of using the application -- so that doesn't seem feasible to me (rows would be moving across partitions willy-nilly, if that is even possible).
I am even considering a NoSql solution for this system, but I haven't got enough information there yet to see whether the performance would be better in our business space (I'm working on a proof of concept, but the .NET drivers for the NoSql system I am evaluating are still a bit buggy, so it may be awhile before it is ready for primetime.) But, I also have concerns about integrating NoSql with the 3rd party search indexing software that we are already using, and whether the complex searches will be too much for it.
Thanks in advance. Sorry about the vagueness, I'm really just brainstorming ideas now. I posted in another thread about a specific query issue I was having, but right now I am looking ahead at the big picture and wondering what underlying strategies might prove helpful in this scenario.
--
Anye Mercy
"Service Unavailable is not an Error" -- John, ENOM support
"You keep using that word. I do not think it means what you think it means." -- Inigo Montoya in "Princess Bride"
"Civilization exists by geologic consent, subject to change without notice." -- Will Durant

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

PaulB-TheOneAndOnly SSC Guru Points: 66772 More actions · Answer 1

Okay ... too much going on in a single post, let me pick one...

Anye Mercy (7/13/2010)
Is partitioning something that could help me?

Partitioning is usually implemented to help in administrative tasks like archiving and purging, also to improve query performance when queries have to retrieve a large number of rows in a table storing atomic data.

Having said that... are those candidate-tables 100M rows in size and growing?

If the answer is No... chances are partitioning is not going to help.

Last but not least, you can implement "poor's man partitioning" in Standard Edition, nothing but a hand made bunch of tables (one "per partition") with an umbrella view on top of them.

_____________________________________
Pablo (Paul) Berzukov

Author of Understanding Database Administration available at Amazon and other bookstores.

Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.

lloyd-236007 SSC Enthusiast Points: 197 More actions · Answer 2

Anye Mercy (7/13/2010)
Each job has its own database. The schema of an individual job database is about 98% the same between jobs, with a couple of tables that are dynamically built based on user selections at job creation time.

Why would you need each job to have its own database, i think it can be achieved by a table, since you mention they are same between jobs. You could think about using horizontal partitioning here if the job tables have lots of data. That way your query will run over lesser rows.

The web application that is powered by these databases only accesses a single job at a time, so a single database at a time. There is never a need to mix data between jobs.

Even more proof to use it as a single table in a single database and you also avoid having to generate schema at run time.

Additionally - there are actually two modules in the web application that can access / query the data in different ways. They share the schema, and there is a table that has rows to indicate whether a document is present in module 1 and/or module 2 -- so all queries have to join to this table and look for the appropriate module row. I considered just adding flags to the document "master" table to indicate existence in one module or the other but that didn't seem as extensible if we add additional modules -- but I may revisit this decision for performance reasons.

It might make sense to store on one table with a flag of value 1 or 0 and have it as a clustered index since the data then is sorted on that field. You will have to update your queries to use that field then, much better than going to a separate table in my opinion

Things that I think are contributing to the performance issues -- the practice of "soft deletes" rather than hard deletes. I am doing this because our customers are notorious for making mistakes and changing their mind - and it is easier for us to back out changes when the rows are soft-deleted. But, with the large jobs, when all the queries have to filter out the deleted rows, this could be adding time -- it also increases the size of the tag table considerably as similar rows may be added, soft-deleted, added again -- but with bulk tag operations that is faster than looking for individual rows to undelete. And of course, the module join table probably slows it down a little too.

you could try to use a timestamp field that indicates when something was deleted along with a deleted column, that way you wouldn't have to join.

My questions --

Is there a way to design a system to get the benefits of data warehousing for the querying purpose but that can allow for some very specific data to be updated and queried? Maybe a way to get a list of documents based on the static parts of the query from a data warehouse and then somehow join them to the non-static data in a standard SQL Server database?

I think that is possible but then you would need to join across a DW database and a SQL database, if you have ROLAP on the DW side it might be able to do what you are asking but then it would beat the purpose of DW since it would have to update the aggregates at realtime

Anye Mercy SSCrazy Points: 2134 More actions · Answer 3

lloyd-236007 (7/14/2010)
Anye Mercy (7/13/2010)
Each job has its own database. The schema of an individual job database is about 98% the same between jobs, with a couple of tables that are dynamically built based on user selections at job creation time.
Why would you need each job to have its own database, i think it can be achieved by a table, since you mention they are same between jobs. You could think about using horizontal partitioning here if the job tables have lots of data. That way your query will run over lesser rows.

Each job belongs to a different customer, and they don't want their data mingling. Also since each job is 10's of thousands to millions of rows, and there are an unlimited number of jobs -- the tables would get massive.

I also said only some tables are the same between jobs -- there are some dynamic tables as well.

This part of the design will not be changed.

The web application that is powered by these databases only accesses a single job at a time, so a single database at a time. There is never a need to mix data between jobs.
Even more proof to use it as a single table in a single database and you also avoid having to generate schema at run time.

I don't follow - this aspect of it is exactly why it makes more sense to have multiple databases. You will never need to run a query that hits different customers' data at the same time.

Additionally - there are actually two modules in the web application that can access / query the data in different ways. They share the schema, and there is a table that has rows to indicate whether a document is present in module 1 and/or module 2 -- so all queries have to join to this table and look for the appropriate module row. I considered just adding flags to the document "master" table to indicate existence in one module or the other but that didn't seem as extensible if we add additional modules -- but I may revisit this decision for performance reasons.
It might make sense to store on one table with a flag of value 1 or 0 and have it as a clustered index since the data then is sorted on that field. You will have to update your queries to use that field then, much better than going to a separate table in my opinion

I think we are in agreement here.

Things that I think are contributing to the performance issues -- the practice of "soft deletes" rather than hard deletes. I am doing this because our customers are notorious for making mistakes and changing their mind - and it is easier for us to back out changes when the rows are soft-deleted. But, with the large jobs, when all the queries have to filter out the deleted rows, this could be adding time -- it also increases the size of the tag table considerably as similar rows may be added, soft-deleted, added again -- but with bulk tag operations that is faster than looking for individual rows to undelete. And of course, the module join table probably slows it down a little too.
you could try to use a timestamp field that indicates when something was deleted along with a deleted column, that way you wouldn't have to join.

The LastModifedDate field already holds when the record was deleted. I don't think that solves the issue though of mixing updates and inserts in a bulk process -- for example if they are tagging 10,000 rows at once, of which 1200 had previously had that tag which was now deleted, we'd need to insert 8800 records and update 1200. As opposed to just ignoring the previously deleted row and inserting 10,000.

My questions --
Is there a way to design a system to get the benefits of data warehousing for the querying purpose but that can allow for some very specific data to be updated and queried? Maybe a way to get a list of documents based on the static parts of the query from a data warehouse and then somehow join them to the non-static data in a standard SQL Server database?
I think that is possible but then you would need to join across a DW database and a SQL database, if you have ROLAP on the DW side it might be able to do what you are asking but then it would beat the purpose of DW since it would have to update the aggregates at realtime

Can you please elaborate on this? I have not worked with data warehouses so am only somewhat familiar with how they work. What I am envisioning (with no idea of the feasibility) is to only house the static data in the data warehouse, not the fields and tables that users can modify. I wouldn't think that a solely DW solution would work because of the need to update. Is it possible to join across the two? Or to query the DW, populate a temp table with a list of IDs that it returned and join that in the SQL? Having never queried a DW I don't know how similarly that works, whether you can query from .NET etc.

Thanks for the ideas!

--
Anye Mercy
"Service Unavailable is not an Error" -- John, ENOM support
"You keep using that word. I do not think it means what you think it means." -- Inigo Montoya in "Princess Bride"
"Civilization exists by geologic consent, subject to change without notice." -- Will Durant

lloyd-236007 SSC Enthusiast Points: 197 More actions · Answer 4

Can you please elaborate on this? I have not worked with data warehouses so am only somewhat familiar with how they work. What I am envisioning (with no idea of the feasibility) is to only house the static data in the data warehouse, not the fields and tables that users can modify. I wouldn't think that a solely DW solution would work because of the need to update. Is it possible to join across the two? Or to query the DW, populate a temp table with a list of IDs that it returned and join that in the SQL? Having never queried a DW I don't know how similarly that works, whether you can query from .NET etc.

One of the storage mechanisms when creating cubes in DW is called ROLAP, relational olap and if my memory serves me correctly, it can be used with the base database and updates happen on the fly against the static fields. Other mechanisms would be MOLAP and HOLAP, probably you could find some more details of msdn.

Thanks for the ideas![/quote]