RE: Brainstorming on strategies for a database with some data warehouse characteristics but with some updating required

SSC Enthusiast

Points: 197

July 14, 2010 at 2:35 am

#1192622

Anye Mercy (7/13/2010)
Each job has its own database. The schema of an individual job database is about 98% the same between jobs, with a couple of tables that are dynamically built based on user selections at job creation time.

Why would you need each job to have its own database, i think it can be achieved by a table, since you mention they are same between jobs. You could think about using horizontal partitioning here if the job tables have lots of data. That way your query will run over lesser rows.

The web application that is powered by these databases only accesses a single job at a time, so a single database at a time. There is never a need to mix data between jobs.

Even more proof to use it as a single table in a single database and you also avoid having to generate schema at run time.

Additionally - there are actually two modules in the web application that can access / query the data in different ways. They share the schema, and there is a table that has rows to indicate whether a document is present in module 1 and/or module 2 -- so all queries have to join to this table and look for the appropriate module row. I considered just adding flags to the document "master" table to indicate existence in one module or the other but that didn't seem as extensible if we add additional modules -- but I may revisit this decision for performance reasons.

It might make sense to store on one table with a flag of value 1 or 0 and have it as a clustered index since the data then is sorted on that field. You will have to update your queries to use that field then, much better than going to a separate table in my opinion

Things that I think are contributing to the performance issues -- the practice of "soft deletes" rather than hard deletes. I am doing this because our customers are notorious for making mistakes and changing their mind - and it is easier for us to back out changes when the rows are soft-deleted. But, with the large jobs, when all the queries have to filter out the deleted rows, this could be adding time -- it also increases the size of the tag table considerably as similar rows may be added, soft-deleted, added again -- but with bulk tag operations that is faster than looking for individual rows to undelete. And of course, the module join table probably slows it down a little too.

you could try to use a timestamp field that indicates when something was deleted along with a deleted column, that way you wouldn't have to join.

My questions --

Is there a way to design a system to get the benefits of data warehousing for the querying purpose but that can allow for some very specific data to be updated and queried? Maybe a way to get a list of documents based on the static parts of the query from a data warehouse and then somehow join them to the non-static data in a standard SQL Server database?

I think that is possible but then you would need to join across a DW database and a SQL database, if you have ROLAP on the DW side it might be able to do what you are asking but then it would beat the purpose of DW since it would have to update the aggregates at realtime