RE: Brainstorming on strategies for a database with some data warehouse characteristics but with some updating required

SSCrazy

Points: 2134

July 14, 2010 at 7:37 am

#1192806

lloyd-236007 (7/14/2010)
Anye Mercy (7/13/2010)
Each job has its own database. The schema of an individual job database is about 98% the same between jobs, with a couple of tables that are dynamically built based on user selections at job creation time.
Why would you need each job to have its own database, i think it can be achieved by a table, since you mention they are same between jobs. You could think about using horizontal partitioning here if the job tables have lots of data. That way your query will run over lesser rows.

Each job belongs to a different customer, and they don't want their data mingling. Also since each job is 10's of thousands to millions of rows, and there are an unlimited number of jobs -- the tables would get massive.

I also said only some tables are the same between jobs -- there are some dynamic tables as well.

This part of the design will not be changed.

The web application that is powered by these databases only accesses a single job at a time, so a single database at a time. There is never a need to mix data between jobs.
Even more proof to use it as a single table in a single database and you also avoid having to generate schema at run time.

I don't follow - this aspect of it is exactly why it makes more sense to have multiple databases. You will never need to run a query that hits different customers' data at the same time.

Additionally - there are actually two modules in the web application that can access / query the data in different ways. They share the schema, and there is a table that has rows to indicate whether a document is present in module 1 and/or module 2 -- so all queries have to join to this table and look for the appropriate module row. I considered just adding flags to the document "master" table to indicate existence in one module or the other but that didn't seem as extensible if we add additional modules -- but I may revisit this decision for performance reasons.
It might make sense to store on one table with a flag of value 1 or 0 and have it as a clustered index since the data then is sorted on that field. You will have to update your queries to use that field then, much better than going to a separate table in my opinion

I think we are in agreement here.

Things that I think are contributing to the performance issues -- the practice of "soft deletes" rather than hard deletes. I am doing this because our customers are notorious for making mistakes and changing their mind - and it is easier for us to back out changes when the rows are soft-deleted. But, with the large jobs, when all the queries have to filter out the deleted rows, this could be adding time -- it also increases the size of the tag table considerably as similar rows may be added, soft-deleted, added again -- but with bulk tag operations that is faster than looking for individual rows to undelete. And of course, the module join table probably slows it down a little too.
you could try to use a timestamp field that indicates when something was deleted along with a deleted column, that way you wouldn't have to join.

The LastModifedDate field already holds when the record was deleted. I don't think that solves the issue though of mixing updates and inserts in a bulk process -- for example if they are tagging 10,000 rows at once, of which 1200 had previously had that tag which was now deleted, we'd need to insert 8800 records and update 1200. As opposed to just ignoring the previously deleted row and inserting 10,000.

My questions --
Is there a way to design a system to get the benefits of data warehousing for the querying purpose but that can allow for some very specific data to be updated and queried? Maybe a way to get a list of documents based on the static parts of the query from a data warehouse and then somehow join them to the non-static data in a standard SQL Server database?
I think that is possible but then you would need to join across a DW database and a SQL database, if you have ROLAP on the DW side it might be able to do what you are asking but then it would beat the purpose of DW since it would have to update the aggregates at realtime

Can you please elaborate on this? I have not worked with data warehouses so am only somewhat familiar with how they work. What I am envisioning (with no idea of the feasibility) is to only house the static data in the data warehouse, not the fields and tables that users can modify. I wouldn't think that a solely DW solution would work because of the need to update. Is it possible to join across the two? Or to query the DW, populate a temp table with a list of IDs that it returned and join that in the SQL? Having never queried a DW I don't know how similarly that works, whether you can query from .NET etc.

Thanks for the ideas!

--
Anye Mercy
"Service Unavailable is not an Error" -- John, ENOM support
"You keep using that word. I do not think it means what you think it means." -- Inigo Montoya in "Princess Bride"
"Civilization exists by geologic consent, subject to change without notice." -- Will Durant