Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 1234»»»

Production Subsets Expand / Collapse
Author
Message
Posted Thursday, August 21, 2014 8:17 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: Administrators
Last Login: Yesterday @ 8:41 PM
Points: 31,080, Visits: 15,525
Comments posted to this topic are about the item Production Subsets






Follow me on Twitter: @way0utwest

Forum Etiquette: How to post data/code on a forum to get the best help
Post #1606092
Posted Thursday, August 21, 2014 11:51 PM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: 2 days ago @ 11:23 AM
Points: 55, Visits: 435
Unless there is a compelling reason not to, developing on full production data is best.

Consider:

  • Comprehensive set of values are needed to test logic

    Simple things like a NULL where you didn't expect one, or a non-numeric, or a duplicate value can cause unexpected problems when you move to production unless you developed and tested against full production data


  • Performance can be vastly different depending on the volume of data involved

    That CTE or Correlated Subquery might work fine with modest numbers of rows...but may need to be refactored to hold up under the demands of full production data. It is better to test and tune as you go.


  • Helps explain and define requirements

    Sure, in a perfect world there is documentation about the data structures, and there is good referential integrity. But in the real world in which I live, sometimes there is neither, and reverse-engineering and research is required. I have often come across situations where there may be thousands of rows in a table with nulls in a column...and then rows that have a value. If I had only a subset of production data, I could likely interpret that a column was not being used, when in fact it actually it was.


  • Realistic demos and tests

    Imaginary data and simple "Test 1" data introduce an additional layer of abstraction for both developers and stakeholders. "Imagine that is a real order detail record.", vs. looking at an actual order detail. Besides the unexpected data that may affect behavior, looking at dummy data tends to cause us to gloss over problems because we aren't really interpreting what we are seeing in real world terms


  • Continuous Data Improvement

    Some of development is about improving data and the way it is organized. For example, you might notice that you have long address values that are being truncated, and an address2 column is needed. Or that you are in danger of overflowing an int identity column. Or need for data cleansing for rows imported from another source. Or any number of opportunities for improvement. These things might not be on a requirements document or even on an agile story, but they are things that should be corrected. There is a better chance of finding opportunities for improvement when you are seeing more data.


  • Simplicity

    Just today I was developing a complex stored procedure. I was testing against full production data, and it worked fine. I handed the code off to another developer who was testing against a dev data...and my code did not return the expected results. It turned out that the problem was with the dev data. No big deal, but there was confusion and some wasted time that would have been avoided if we both were using full production data.



Post #1606116
Posted Friday, August 22, 2014 1:54 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 1:45 AM
Points: 5,468, Visits: 3,249
As always there is a balance to be had and it depends very much on both the task one is doing and the conditions one is doing them under.

My preference is for developers to work against a local database with a subset of data (which is maintained through source control in some way i.e. can be completely ditched and replaced with a couple of clicks / commands), with the Continuous Integration server running tests on check-in against a large (full if it is not MASSIVE - I do mean absolutely massive) set of data and a regular CI build with fully instrumented performance tests.

My reasoning is that it is often too expensive timewise to run the full suite of performance tests on every check-in (otherwise it encourages more infrequent and larger check-ins), developers will not want to return to work that is non-performant so will often consider performance if they know they have to pick up their own failures, developers will often continuously defer performance issues unless it breaks the build (only performance tests do that) also in some cases stored procedure performance is outside of the scope of the developer.

Some people may immediately shout at me for the last point but I have worked on projects where the 3GL developers write stored procedures as best as they can but Database Developers (DBAs) accept the stored procedures and treat each of them as a specification. Sometimes this involves them rewriting the stored procedure but keeping identical the signature and output, occasionally there is an advised / required change to the signature and sometimes the stored procedure needs no alteration. Under these conditions I have always believed that the 3GL developers should always aim for the latter but accept that either of the former two may be the case.


Gaz

-- Stop your grinnin' and drop your linen...they're everywhere!!!
Post #1606136
Posted Friday, August 22, 2014 2:59 AM


SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: Yesterday @ 4:42 AM
Points: 1,710, Visits: 1,126
I'm with David and Gary. I'm not in a shop where we have subsets of devs like 3GL and DB specific types. We have got as database guy but to get him to look at a project is very unlikely as he will be sorting out other stuff like DTS/SSIS packages and data cleansing exercises. The point about working to get adequate performance with realistic data is the most important, but all of what David said was relevant.
Post #1606152
Posted Friday, August 22, 2014 3:04 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: 2 days ago @ 12:27 AM
Points: 50, Visits: 349
As long as the data is scrubbed so that it doesnt break the data protection laws then a full dataset can be easier. But that always takes time and most developers want the unaltered data just in case the changes break their code.

But, a subset is better cause it can be clean. Why write more complex logic when the base data needs to be cleaned in the first place.
Post #1606154
Posted Friday, August 22, 2014 3:18 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 1:45 AM
Points: 5,468, Visits: 3,249
Yet Another DBA (8/22/2014)
As long as the data is scrubbed so that it doesnt break the data protection laws then a full dataset can be easier. But that always takes time and most developers want the unaltered data just in case the changes break their code...


There are tools. I have not used them myself but I have used the output and been very satisfied. It was realistic data (not directly but in terms of various lengths, NULLs, etc.) and ensured that I was compliant.



Gaz

-- Stop your grinnin' and drop your linen...they're everywhere!!!
Post #1606157
Posted Friday, August 22, 2014 3:39 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, September 22, 2014 1:55 AM
Points: 34, Visits: 163
I used to favour production data for development and testing, but with the introduction of the Data Protection Act in the UK, doing so became something of a headache.

It is still worth testing on a database of representative size*, but these days I would advocate using synthetic data. Production data can be of rather variable quality, and may not be available at all during the development phase. Synthetic data, if you know the algorithms used to generate it will have predictable characteristics, so you can run a query and whilst you may not know what you will get back, you should know exactly how big the result set should be.

* So often, I have seen systems go live only to be beset by severe performance problems which needed to be addressed urgently. Using a full size database lets you find these problems before go-live.

Many years ago, I worked on a CRM system. I was tasked with populating the database with enough test data to make the database size about the same as the live system would have. I wrote a program to generate it, populating all the tables. It took a couple of hours to run, but it gave data of predictable characteristics, and you could throw the database away and start again without too much pain.

Usually, the data migration will happen close to go-live, so the testing on production data can happen at the User Acceptance testing (UAT) phase. Sure, there will be things in the production data that trip up the application. But letting the users test on the production data means they will take care of it the way they know how, and developers don't have to have the responsibility of it.
Post #1606165
Posted Friday, August 22, 2014 3:43 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: 2 days ago @ 12:27 AM
Points: 50, Visits: 349
john.riley-1111039 (8/22/2014)
........

* So often, I have seen systems go live only to be beset by severe performance problems which needed to be addressed urgently. Using a full size database lets you find these problems before go-live.
......


Totally agree, its very difficult for a developer not to tune or ask for help to tune when their work is too slow for production.
With today's hard drives there is no reason why even a 100GB database can't be restored quickly to test/develop their code against.


Post #1606168
Posted Friday, August 22, 2014 6:06 AM


SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Thursday, September 4, 2014 9:46 AM
Points: 295, Visits: 280
I am for having the right data in the right environment.

Dev databases should have a data set that is representative of Production but smaller. All known scenarios should be covered. This includes having some bad data in Dev if the Prod system can allow it to get in there. You won't always know what these scenarios are until you run into them in Prod but when you do, just add the scenario to the test data set. The point of the Dev environment is to provide developers with enough data to test their stuff without hindering productivity. Security hinders productivity so there should be no sensitive data in Dev.

So, how do we test properly against the full data set to verify performance and quality. I argue that it's not the developers' job to do this. It belongs in the hands of the Quality Assurance (QA) team. The QA team should be testing on their own set of servers with their own data. This would be a smaller group of people so testing against Prod data shouldn't be as big of an issue. If it is an issue then you need to use test data here but they should be comparable to Prod in size as well as data distribution. The QA environment is also a good place to do regression testing. Go ahead and set up all your nightly jobs and unit tests to run against QA as well as Prod. That should help you discover those unknown scenarios in your new code before it gets to Prod.


Bryant E. Byrd, BSSE MCDBA MCAD
Business Intelligence Administrator
MSBI Administration Blog
Post #1606218
Posted Friday, August 22, 2014 6:17 AM
SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Saturday, September 13, 2014 4:06 AM
Points: 988, Visits: 804
Tatsu (8/22/2014)
I am for having the right data in the right environment.

I don't see the point. If production data can be obfuscated while maintaining similar statistical characteristics, and if there's space, then why bother pruning the data? You have the extra cost of pruning, and all you achieve is to delay finding problems until the developer's mind is well and truly elsewhere.
Post #1606224
« Prev Topic | Next Topic »

Add to briefcase 1234»»»

Permissions Expand / Collapse