Hunting down bad code/ETL

  • Hi all,

    I'm leading a drive to get the code and ETL in our organisation reality-checked and improved, we'll be going to cloud soon and management acknowledge we have a lot of data is clearly aberrant both in terms of reporting, performance and cleanliness, and I hope someone here has a few roadmaps to share....

    For example, we'll be looking for SPROCS containing Select *, use/overusage of cursors, ETL that doesn't log an error when Zero rows are loaded...

    Any more stuff to look out for would be appreciated!!

  • The first thing that comes to mind is SQLCop.

    However, what you're trying to do might be an extensive work depending on your database and code quality.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • JaybeeSQL (6/16/2016)


    Hi all,

    I'm leading a drive to get the code and ETL in our organisation reality-checked and improved, we'll be going to cloud soon and management acknowledge we have a lot of data is clearly aberrant both in terms of reporting, performance and cleanliness, and I hope someone here has a few roadmaps to share....

    For example, we'll be looking for SPROCS containing Select *, use/overusage of cursors, ETL that doesn't log an error when Zero rows are loaded...

    Any more stuff to look out for would be appreciated!!

    I wouldn't call loading zero rows an error, but being alerted to such an event is worthwhile. You never know, it may not be an error.

  • Some coding practices, like cursors and SELECT *, are known to be bad and you can spot them just be reading the code or performing a text search across your code repository or directly againt the the objects in your database.

    For example:

    exec sp_msforeachdb

    '

    print ''?'';

    use ?;

    select db_name(), type, name as object_name

    , object_definition(object_id) as object_text

    from sys.objects as o

    where is_ms_shipped = 0

    and (object_definition(object_id) like ''%SELECT *%''

    or object_definition(object_id) like ''%DECLARE CURSOR%'');

    ';

    By the way, if you're not maintaining all your application, database, and ETL source code in a version control system like Git, SVN, or TFS, then definately do that first.

    Other performance bottlenecks are not easy to spot check in code, especially if performance tuning is not your primary expertise. One tool I would reccomend is SQLSentry, which is a process and operational status monitoring tool. Amuong 100 other things, it will tell you what statements are the top consumers of CPU, I/O, memory, duration, etc.

    PragmaticWorks BIxPress (amoung other things) provides a plugin auditing framework and monitoring dashboard for SSIS. It also has a feature for analyzing SSIS projects for best practices and suggested performance optimizations.

    Before you start any T-SQL performance tuning effort, familiarize yourself with some good books that focus on execution plan analysis, like the one by Grant Fritchey.

    https://www.simple-talk.com/books/sql-books/sql-server-execution-plans,-second-edition,-by-grant-fritchey/

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Lynn Pettis (6/16/2016)


    JaybeeSQL (6/16/2016)


    Hi all,

    I'm leading a drive to get the code and ETL in our organisation reality-checked and improved, we'll be going to cloud soon and management acknowledge we have a lot of data is clearly aberrant both in terms of reporting, performance and cleanliness, and I hope someone here has a few roadmaps to share....

    For example, we'll be looking for SPROCS containing Select *, use/overusage of cursors, ETL that doesn't log an error when Zero rows are loaded...

    Any more stuff to look out for would be appreciated!!

    I wouldn't call loading zero rows an error, but being alerted to such an event is worthwhile. You never know, it may not be an error.

    I would add that loading rows isn't an indication of success, either. A check on the correct number of rows should be in place.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • In the past there was an ETL process where a standard deviation in record counts based on the previous X runs would flag the dataset in staging and prevent it from release to production.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (6/16/2016)


    In the past there was an ETL process where a standard deviation in record counts based on the previous X runs would flag the dataset in staging and prevent it from release to production.

    We actually used control numbers to be sure that the data was correct. A row count and two sums on different values to ensure the data was correctly processed.

    We couldn't base operations on previous runs because some days had more activity than the others. We expected an approximate number of rows, but I remember when we thought we had a huge problem by getting less than 10 rows when we were expecting several thousands. It was just that the business had to close because there were disturbs in that country.

    You need to monitor as well for expensive queries (after applying fixes from SQLCop or anything similar) which could be queries that take one second or two but are run a thousand times a minute.

    As I said, this is a whole lot of work and no tool will solve everything.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (6/17/2016)


    Eric M Russell (6/16/2016)


    In the past there was an ETL process where a standard deviation in record counts based on the previous X runs would flag the dataset in staging and prevent it from release to production.

    We actually used control numbers to be sure that the data was correct. A row count and two sums on different values to ensure the data was correctly processed.

    We couldn't base operations on previous runs because some days had more activity than the others. We expected an approximate number of rows, but I remember when we thought we had a huge problem by getting less than 10 rows when we were expecting several thousands. It was just that the business had to close because there were disturbs in that country.

    You need to monitor as well for expensive queries (after applying fixes from SQLCop or anything similar) which could be queries that take one second or two but are run a thousand times a minute.

    As I said, this is a whole lot of work and no tool will solve everything.

    In our case, we were receiving healthcare data feeds from 200+ external providers. We had a dashboard application that displayed ETL operational stats, stuff like data distribution alerts in addition to record count deviations. For example, if more than X% of patients had an empty SSN, or if X phone numbers occurred for more than Y patients, then it could be an indication that the external data provider was stubbing the phone number column. The data analysts would research the issue, query the dataset in staging environment, and then could go into the dashboard and either release the dataset to production or flag it for folowup with the client. The application was developed in-house using SilverLight, but I'm sure there are canned 3rd party solutions for this as well.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • As a bit of a sidebar and like anything else in SQL Server, saying something like the use of SELECT * always being bad is the wrong thing to say. If you need all columns from a wide table (lots of columns like some of the more ridiculous tables we have at work that have more than 100 columns), you'll frequently find that SELECT * is sometimes significantly faster than listing every bloody column in the code.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • That's exactly why this won't have an easy solution. Each system is different.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Jeff Moden (6/17/2016)


    As a bit of a sidebar and like anything else in SQL Server, saying something like the use of SELECT * always being bad is the wrong thing to say. If you need all columns from a wide table (lots of columns like some of the more ridiculous tables we have at work that have more than 100 columns), you'll frequently find that SELECT * is sometimes significantly faster than listing every bloody column in the code.

    Do you mean a faster execution plan at runtime, or it's just faster when coding by hand?

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (6/20/2016)


    Jeff Moden (6/17/2016)


    As a bit of a sidebar and like anything else in SQL Server, saying something like the use of SELECT * always being bad is the wrong thing to say. If you need all columns from a wide table (lots of columns like some of the more ridiculous tables we have at work that have more than 100 columns), you'll frequently find that SELECT * is sometimes significantly faster than listing every bloody column in the code.

    Do you mean a faster execution plan at runtime, or it's just faster when coding by hand?

    Both.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (6/20/2016)


    Eric M Russell (6/20/2016)


    Jeff Moden (6/17/2016)


    As a bit of a sidebar and like anything else in SQL Server, saying something like the use of SELECT * always being bad is the wrong thing to say. If you need all columns from a wide table (lots of columns like some of the more ridiculous tables we have at work that have more than 100 columns), you'll frequently find that SELECT * is sometimes significantly faster than listing every bloody column in the code.

    Do you mean a faster execution plan at runtime, or it's just faster when coding by hand?

    Both.

    I guess it's kind of like when I'm ordering a Sub sandwhich, the type where one walks up to the bar and tells the server what all to put between two slices of footlong bread. I just say "Give me everything", which is a lot faster to say and probably easier for them than: "Lettuce, tomato, spinach leaves, bell pepper, pickels, cucumber, onion, spicy peppers, olives, salt, pepper, olive oil, mayonaise, ...". Sometimes when I walk in the door, they recognize me and have already started my order, which is kind of like a cached execution plan. 🙂

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Augment ETL to other systems when possible, audit/logging/benchmark all systems into one location for accessibility and viewability, meta is king, watch out for common SQL (or SSIS) antipatterns, use flowcharts, don't over complicate documentation, and keep it simple stupid (KISS).

  • Eric M Russell (6/20/2016)


    Jeff Moden (6/20/2016)


    Eric M Russell (6/20/2016)


    Jeff Moden (6/17/2016)


    As a bit of a sidebar and like anything else in SQL Server, saying something like the use of SELECT * always being bad is the wrong thing to say. If you need all columns from a wide table (lots of columns like some of the more ridiculous tables we have at work that have more than 100 columns), you'll frequently find that SELECT * is sometimes significantly faster than listing every bloody column in the code.

    Do you mean a faster execution plan at runtime, or it's just faster when coding by hand?

    Both.

    I guess it's kind of like when I'm ordering a Sub sandwhich, the type where one walks up to the bar and tells the server what all to put between two slices of footlong bread. I just say "Give me everything", which is a lot faster to say and probably easier for them than: "Lettuce, tomato, spinach leaves, bell pepper, pickels, cucumber, onion, spicy peppers, olives, salt, pepper, olive oil, mayonaise, ...". Sometimes when I walk in the door, they recognize me and have already started my order, which is kind of like a cached execution plan. 🙂

    As someone who's worked in food service (ice cream store), I don't like when people say give me a sundae with everything - everything is different for different people. Had the same problem with people say saying they'll have the same thing but they really only wanted the same size with a different flavor and a different topping. :crazy:

    I did have a bunch of customers that I knew the usual order for. I always questioned whether I should start their usual order right away or check first to see if they wanted something different.



    The opinions expressed herein are strictly personal and do not necessarily reflect the views or policies of my employer.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply