Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««123»»

Writing Nearly Codeless Apps: Part 5 Expand / Collapse
Author
Message
Posted Friday, November 5, 2010 11:10 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Thursday, January 24, 2013 12:02 PM
Points: 42, Visits: 152
David Ziffer (11/4/2010)
wbrianwhite (11/4/2010)
[quote]wbrianwhite (11/4/2010)

Several problems. 1, what is @AsOf is a future date? Your function returns nothing from live or archive table. 2, the select max audit date subquery does not specify "and AuditStatus <> 'D'", so it can return a date that will be excluded by the select statement. The first one is hypothetical, but the second one is a failure type error.

3, why select the max(AuditDate) in the subquery instead of selecting the primary key of the archive table, that seems like a better choice since then the top level select is selecting where PK = known value.

You could also find the archive row with just a left outer self join instead of a correlated subquery. Something like this (haven't tested it) ... I've seen left outer self joins perform much better than correlated subqueries that use aggregated functions. Though they're a bit harder to read at first, once you're familiar with the pattern they're fine.

Brian: Several answers:

1. It doesn't make much sense to specify a future audit date, so I would not sacrifice any performance to generate code to specially handle that circumstance. But in any event this code can't possibly be sensitive to whether the date you specify is in the future, because the data in the database has no idea what time it is when you run the query, and furthermore the query is not sensitive to the current time. If you think carefully about the algorithm (or just try it out in the RAP example) you will find that putting in a future date will guarantee that you get the latest version of everything, keeping in mind that the latest version of a deleted record is "no record". But you'll get the latest version expensively, because you'll be going through the more calculation-laden archive table query rather than the ultra-simple current-table query.

2. Either you're expecting a different result than I am or your analysis is flawed. The subquery correctly returns whatever record is the most recent one (meeting the search criterion, in this case matching the correct key) that precedes or equals the AuditDate. If that record happens to be a "D" record then the main query correctly returns no record (which is what we want, because the record has been deleted). If that record is not a "D" record then it returns the most recent version of the record preceding or equalling the date. It is not our intention to return some record no matter what; if a record has been deleted prior to or at the AsOfDate, then we don't want to see it at all.

3. The left-join logic for extracting the maximum date looks interesting and I must say that I never considered such a thing. However I think you'd need to reverse the sign so that we're looking for the "A" record where there is no GREATER "B" record (I believe your logic as published would find the minimum "A", not the maximum that we want). But I would have to test this carefully to see how the "where" condition interacts with the join. And we'd have to change the test for "D" because your query here would neglect to properly return "no record" if the AsOfDate follows the deletion time of the record. And I would be very suspicious of the claimed performance improvement ... it seems to me that the number of records this join would produce would rise exponentially with the number of archived records matching the search criterion.

Thanks for your inputs. I am always grateful when people take the time to analyze my stuff. It's just that in this case I am unable to agree with your assessments (except possibly that a left-join could be used in place of the subquery). I would be interested in your responses to my responses and/or your clarifications, in case there's something here that I'm just not getting.


"because the data in the database has no idea what time it is when you run the query,"
GetDate()? You could select from your main table where @asOf is null or >= getDate(). To me, showing the main record would seem more likely to match users' expectations. If this function is a general purpose function, I would expect it would end up used in places you aren't anticipating. In our shop we use a history table, where the top row in the history table is the current data (trigger populates all data into the history table, even current data), so we can join against it for any time period, including the future. It avoids the issue of splitting queries between main and archive as well. What if someone wants to input that user Bob Smith will move from the R&D department to Implementation department in June of 2011? How could they join against your user data retrieval function when doing so? It seems like it would return 'no such user bob smith in june of 2011'. For us we have the same query to show user department changes for any time in the past or future. So that is definitely shaping my expectations here.

"If that record happens to be a "D" record then the main query correctly returns no record (which is what we want, because the record has been deleted). "
What if it's been un-deleted since then? The point of using soft-deletes in a status column is that they can be undone later, while physical deletes can't. If you stop reporting on deleted rows you might as well physically delete them and save the disk space and extra query overhead. You could also include a @IncludeDeleted parameter to allow them to be ignored/included as desired. At our shop someone could 'delete' a group. Then realize it was a mistake. Get the group list passing in @IncludeDeleted, and then undelete it.

"And I would be very suspicious of the claimed performance improvement ... it seems to me that the number of records this join would produce would rise exponentially with the number of archived records matching the search criterion."
Run some tests. I have. The only time I like to use correlated subqueries is when doing and "and exists" correlated subquery, because then sql server can exit as soon as it finds a single row. Regular correlated subqueries are expensive. SQL Server is built to handle set-based logic efficiently, and the left outer where b is null algorithm is exactly that kind of set based operation. You're right on the condition I put in though, it should be and B.AuditDate > A.AuditDate where ... and B.UserId is null. That will get the A row where there is no higher date row in that table that is still less than the input date. I use it on large data sets all the time with good results. You should just be able to do
select UserId,
LoginName,
Notes,
AuditDate,
AuditUserId,
AuditStatus
from TBadmUser#
where UserId = @UserId
and AuditStatus <> 'D'
and A.AuditDate = (select max(AuditDate) from TBadmUser#
where UserId = A.UserId
and AuditDate <= @AsOf)
and @AsOf is not null

select case when A.AuditStatus when 'D' then null else A.UserId end [UserId],
case when A.AuditStatus when 'D' then null else A.LoginName end [LoginName],
case when A.AuditStatus when 'D' then null else A.Notes end [Notes],
case when A.AuditStatus when 'D' then null else A.AuditDate [AuditDate],
case when A.AuditStatus when 'D' then null else A.AuditUserId [AuditUserId],
case when A.AuditStatus when 'D' then null else A.AuditStatus [AuditStatus]
from TBadmUser# A
left join TBadmUser# B
on A.UserId = B.UserId
and A.UserId = @UserId
and B.AuditDate > A.AuditDate
where A.AuditDate <= @AsOf
and B.UserId is null

And then include the actual execution plan, and see which select uses more than 50% of the query time. I'd be interested in your results.
Post #1016676
Posted Friday, November 5, 2010 4:05 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Thursday, January 24, 2013 12:02 PM
Points: 42, Visits: 152
Er, my case statement is a little off. Too many whens
Post #1016821
Posted Friday, November 5, 2010 10:06 PM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Wednesday, November 6, 2013 2:24 PM
Points: 69, Visits: 182
Brian: Your comments suggest to me that you are accustomed to an archiving model that is entirely different from RAP's archiving model, and that you are presuming that RAP works in the same way. I imagine that you are accustomed to the traditional model in which all versions of every record are stored in the same table as the most recent version, and in which queries for a given record must go looking for the most recent one based on the records' AuditDates, and that "deleted" records are merely marked for deletion rather than actually being deleted. In this model deleted records can be brought back to life by simply unmarking them. If this is your view of the RAP model, then I ask that you reread the article carefully and also reread the code sample carefully.

In the RAP model, each conceptual table is really two tables: a "current" table that stores only current records, and an "archive" table that stores a log of every version of every record. Queries for current data look only in the "current" table, where they will find only current records. Such queries never have to check for AuditDates because there is always exactly one version of every record in the "current" table, namely the current version. Queries for current data also never need to check for "deleted" flags because records in the "current" table are, by definition, not deleted.

Queries for current data never look in the "archive" table. The only purpose of the "archive" table is to store past versions of every record for the purpose of reconstructing snapshots of past data as of a given time (for example seeing a report or your application screens as they would have appeared six months ago). There is no need to incorporate any sense of the current time (i.e. the time at which the query is executed) because such data is timeless. There is no reason for a query into this table using a future AsOf date to return anything other than what it currently returns, which is the last recorded version of every record that hasn't been deleted yet.

While it would be possible to reconstruct a new record in the "current" table containing the same business key (and other data) as a previously deleted record, that reconstructed record would necessarily have a new surrogate key (since SQL Server properly would not allow us to reuse a surrogate key in the "current" table) and as such, from the perspective of the "archive" table, that reconstructed record would be a new record (since it has a new key) and not part of the old record's audit trail.

There is no way for RAP to "undelete" a record in the manner you describe because RAP does not delete records by marking them. Rather it deletes records by actually deleting them from the "current" table. The corresponding record with the "D" flag in the "archive" table is merely a historical marker indicating when the lifetime of the record ended. Since queries for current data never look in the "archive" table, and since snapshot queries (i.e. queries with a non-null AsOf date) intentionally never return deleted records, the data in a "D" record is never returned in a query, nor is it intended to be returned.

If you reread the article carefully, I think you will find that the RAP model for creating, updating, deleting, and querying data is radically different from what you are imagining. It is intentionally designed to be radically different, and I am hoping you will agree that this intentionally different design is radically better.
Post #1016849
Posted Monday, November 8, 2010 5:56 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Thursday, January 24, 2013 12:02 PM
Points: 42, Visits: 152
David Ziffer (11/5/2010)
Brian: Your comments suggest to me that you are accustomed to an archiving model that is entirely different from RAP's archiving model, and that you are presuming that RAP works in the same way. I imagine that you are accustomed to the traditional model in which all versions of every record are stored in the same table as the most recent version, and in which queries for a given record must go looking for the most recent one based on the records' AuditDates, and that "deleted" records are merely marked for deletion rather than actually being deleted. In this model deleted records can be brought back to life by simply unmarking them. If this is your view of the RAP model, then I ask that you reread the article carefully and also reread the code sample carefully.

In the RAP model, each conceptual table is really two tables: a "current" table that stores only current records, and an "archive" table that stores a log of every version of every record. Queries for current data look only in the "current" table, where they will find only current records. Such queries never have to check for AuditDates because there is always exactly one version of every record in the "current" table, namely the current version. Queries for current data also never need to check for "deleted" flags because records in the "current" table are, by definition, not deleted.

Queries for current data never look in the "archive" table. The only purpose of the "archive" table is to store past versions of every record for the purpose of reconstructing snapshots of past data as of a given time (for example seeing a report or your application screens as they would have appeared six months ago). There is no need to incorporate any sense of the current time (i.e. the time at which the query is executed) because such data is timeless. There is no reason for a query into this table using a future AsOf date to return anything other than what it currently returns, which is the last recorded version of every record that hasn't been deleted yet.

While it would be possible to reconstruct a new record in the "current" table containing the same business key (and other data) as a previously deleted record, that reconstructed record would necessarily have a new surrogate key (since SQL Server properly would not allow us to reuse a surrogate key in the "current" table) and as such, from the perspective of the "archive" table, that reconstructed record would be a new record (since it has a new key) and not part of the old record's audit trail.

There is no way for RAP to "undelete" a record in the manner you describe because RAP does not delete records by marking them. Rather it deletes records by actually deleting them from the "current" table. The corresponding record with the "D" flag in the "archive" table is merely a historical marker indicating when the lifetime of the record ended. Since queries for current data never look in the "archive" table, and since snapshot queries (i.e. queries with a non-null AsOf date) intentionally never return deleted records, the data in a "D" record is never returned in a query, nor is it intended to be returned.

If you reread the article carefully, I think you will find that the RAP model for creating, updating, deleting, and querying data is radically different from what you are imagining. It is intentionally designed to be radically different, and I am hoping you will agree that this intentionally different design is radically better.


And yet your sample query did what? Look in the current table or the archive table :)

"Rather it deletes records by actually deleting them from the "current" table. The corresponding record with the "D" flag in the "archive" table is merely a historical marker indicating when the lifetime of the record ended".

This seems odd to me. Why not mark a status on the current row, and insert a D row in the archive table? You can treat it as deleted without deleting it.

I do not agree this approach has any advantages over a _hist table populated with triggers. It is reliant on all developers only using the 'approved' procs to manipulate data, a convention that will eventually fail, while triggers will not.

Did you try the performance test I suggested above?
Post #1017124
Posted Monday, November 8, 2010 9:23 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Thursday, January 24, 2013 12:02 PM
Points: 42, Visits: 152
I imagine that you are accustomed to the traditional model in which all versions of every record are stored in the same table as the most recent version, and in which queries for a given record must go looking for the most recent one


Also, no. I am imaging a user table and a user_hist table, where the hist table is populated by an insert/update and a delete trigger. Are you familiar with this pattern? If not it would explain some of your earlier comments. I have literally never heard of someone storing their history data in their main table before, the mind boggles at why someone would do that.
Post #1017262
Posted Thursday, November 11, 2010 9:26 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Wednesday, November 6, 2013 2:24 PM
Points: 69, Visits: 182
Brian: Let me illustrate the benefits of the RAP approach with a scenario. Let's say you and I both implement the same application and we have the same user running both apps side by side. At the end of some time, our user has entered 1 million records into a particular table but has also deleted 400,000 of them, so that there are only 600,000 "current" records in the table.

In the RAP application, when the user queries the table for current data, the server looks only in the "current" table, which contains only current records, so it has to search only through 600,000 records. In your application, the server has to look through 1 million records.

In your application, every programmer writing every query into this table must remember to check for the "D" flag, and the server has to perform this test on every record in every query. The RAP application author (and server) does not need to check any flags - he just writes the natural query with no consideration of any audit or status fields.

In the RAP application, I have stored every version of every record, including every version that was created by repeatedly updating the same record. This means that when querying historical data (as the RAP "ExampleCRM" app does when you log in using a date), the application can see a snapthot of all the data in the entire application as of any given date. For example if I want to see ALL of my data exactly as it was six months ago (including all screens, reports, etc.) all I do is specify a date of "six months ago" and it's as though I had logged into the app on that day. Nothing is ever lost. And since all these record versions are stored separately from the "current" data, there is no performance cost for storing them when looking up current data, no matter how many there are. In your app, the best you can ever do is see the latest version of each record, because that's the only version you've got; there is simply no way for you to reconstruct all your data as of a given past date.
Post #1019409
Posted Thursday, November 11, 2010 12:08 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Thursday, January 24, 2013 12:02 PM
Points: 42, Visits: 152
David Ziffer (11/11/2010)
Brian: Let me illustrate the benefits of the RAP approach with a scenario. Let's say you and I both implement the same application and we have the same user running both apps side by side. At the end of some time, our user has entered 1 million records into a particular table but has also deleted 400,000 of them, so that there are only 600,000 "current" records in the table.

In the RAP application, when the user queries the table for current data, the server looks only in the "current" table, which contains only current records, so it has to search only through 600,000 records. In your application, the server has to look through 1 million records.

In your application, every programmer writing every query into this table must remember to check for the "D" flag, and the server has to perform this test on every record in every query. The RAP application author (and server) does not need to check any flags - he just writes the natural query with no consideration of any audit or status fields.

In the RAP application, I have stored every version of every record, including every version that was created by repeatedly updating the same record. This means that when querying historical data (as the RAP "ExampleCRM" app does when you log in using a date), the application can see a snapthot of all the data in the entire application as of any given date. For example if I want to see ALL of my data exactly as it was six months ago (including all screens, reports, etc.) all I do is specify a date of "six months ago" and it's as though I had logged into the app on that day. Nothing is ever lost. And since all these record versions are stored separately from the "current" data, there is no performance cost for storing them when looking up current data, no matter how many there are. In your app, the best you can ever do is see the latest version of each record, because that's the only version you've got; there is simply no way for you to reconstruct all your data as of a given past date.


You are really missing some basic database concepts. You have a view for Current_Users or Active_Purchases or whatever, and 98% of the queries use those views. The 2% of the time you need to do something weird, like undelete, or report on 'all including deleted' you query the underlying table. That is for current data, soft deleted or otherwise. If you are starting from scratch, you make it an indexed view and suffer literally zero performance hit. We had huge tables before that was an option, so we don't have indexed views, but we still perform just fine.

Historical data is not stored in the main table. I was quite clear about this above. It's in a separate table, named by convention <table>_hist or something like that. The hist tables are populated by triggers from the main tables. Every insert/update/delete(when permitted) is recorded. If I want a user and group and purchase list as of 6 months ago, I query the _hist tables, joined to the other _hist tables. I have one set of queries for current data, and one set of queries for historical data. You have a huge number of table valued functions each of which does an if/else to abstract away from the end developer the idea of current versus historical. The difference is that I can index my historical/current data and tune my current/historical queries way differently.

If your database performance is significantly affected by storing 400,000 records, then there is probably something wrong. We have in the tens of millions of user records, and keeping the ones that are no longer active is just not a performance hit. We also soft-delete to avoid cascading deletes. What do you do about them? What if a user has been responsible for 10s of thousands of transactions over the years, all FK'd to them. Each of which have tens of detail records hanging off of them. When you physically delete that user, how many rows have to be deleted? Easily hundreds of thousands. For one user. What if you delete a group of 10,000? You're talking millions and millions of cascading deletes. Your entire database is in jeopardy at this point, as rowlocks escalate to pagelocks, escalate to table locks, and writes and transactions start to pile up to deadly levels as this one delete uses up all your I/O. While in my system, soft deleting a user affects... two rows - one to update the current table, one insert into the hist table. Soft Deleting 10,000 affects 20,000 rows. It is a linear progression. Cascading deletes are a logarithmic progression. I will take a gradual creep in table size over huge unpredicatable I/O thrashing any day.
Post #1019511
Posted Saturday, November 13, 2010 8:18 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Tuesday, September 2, 2014 5:27 PM
Points: 24, Visits: 60
I was wondering what the benefits were of having two separate functions for potentially returning all records if the primary key parameter is passed in as NULL, like:

UFTBadmUser##PK (@AsOfDate, @UserId)
UFTBadmUser#PK (@AsOfDate, @UserId)

Why not just have the one function that uses
where (UserID = @UserID or @UserID is null)?

It can't be that much of a performance hit to check whether or not a parameter's value is null and why would you even have a @UserID parameter in the second function at all if the only time you would use it would be if you wanted to pass @UserID = null?
Post #1020398
Posted Sunday, November 14, 2010 7:17 PM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Wednesday, November 6, 2013 2:24 PM
Points: 69, Visits: 182
omarserenity (11/13/2010)
I was wondering what the benefits were of having two separate functions for potentially returning all records if the primary key parameter is passed in as NULL, like:

UFTBadmUser##PK (@AsOfDate, @UserId)
UFTBadmUser#PK (@AsOfDate, @UserId)

Why not just have the one function that uses
where (UserID = @UserID or @UserID is null)?

It can't be that much of a performance hit to check whether or not a parameter's value is null and why would you even have a @UserID parameter in the second function at all if the only time you would use it would be if you wanted to pass @UserID = null?

Actually there is quite a performance hit that occurs when you do the test for null. So there are two versions: one that allows you to get all records if you want, and a much faster one that requires an exact match on the key value(s).
Post #1020509
Posted Monday, November 15, 2010 9:48 AM
SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Thursday, December 18, 2014 12:52 PM
Points: 266, Visits: 2,613
It can't be that much of a performance hit to check whether or not a parameter's value is null ...

I can testify to the potential hit on checking for a parameter's NULL value. I once had a stored proc bring an entire server to its knees. Making one change - turning one proc into three procs that did not have to check for NULL values - instantly and completely fixed all problems.

You will not see this problem for checking for nulls all the time. There are times when checking for a null value makes sense. You should just be aware that it is a potential gotcha--especially since the problem may not be apparent in testing.
Post #1020891
« Prev Topic | Next Topic »

Add to briefcase ««123»»

Permissions Expand / Collapse