Handling string column with dates

  • Get them to fix the dates for the next import or you are going to find yourself struggling with all this again.

  • Lynn Pettis - Friday, October 20, 2017 3:22 PM

    Get them to fix the dates for the next import or you are going to find yourself struggling with all this again.

    +10

    Thom, I wish it was that easy. The data I receive in excel cannot be fixed. It was manually entered in and there are many users entering the data into a template. We received the data from the client and have no control over their process. I extracted the data and inserted it into SQL. It what you said was an option, believe me...that would have been the first thing I did. That's why I said... "it is what it is..." I wasn't being sarcastic or trying to shoot you down.

    Although you may have no control over the current data, you should definitely be feeding back on the problems you are encountering to your provider here. Tell them about the problems you are encountering, and that they need to enforce, going forward, using data validations, especially with data such as numbers and dates. Let them know that because their data is in such a poor condition, so is the data/analysis you are going to provide from it. Let them know the assumptions you've had to make, which could produce anomalies (such as that all dates are assumed to be in the format dd MM yyyy, unless that conversion fails; this means someone may have entered 01 06 2017 but the date they meant is 06 January 2017).

    If you can't fix it this time around, then you're priority is to make sure it's in a usable condition next time. it'll make your (or whoever would have to work with the mess next time) life a lot easier, and your client will be much happier with data/analysis they can rely on. Far too many companies are happy to blame the company that produced the analysis, rather than themselves, when the data that was provided was flawed in the first place. "It's not our fault the data was wrong, we hired you to fix it!" 😉

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • GrassHopper - Friday, October 20, 2017 2:55 PM

    On another update, I am getting the same error msg and I can't find the needle in the haystack.  I was thinking of doing an update on a portion of the data at a time.  In other words, break up the update into 10 passes until it fails and then i can concentrate on a smaller set of data to look at for the problem.  The total size is 130,571 records.  It has an ID column that is unique, but not in serial order.  How can I break this update up into 10 update passes?

    i.e.  - ID numbers:
    117915
    117916
    117917
    118300
    118301
    118302
    118512
    118513

    SQL Code:

    Update XLSHdr_Promo_TescoUK
    Set PTL_EndDate_Dt =     Case WHEN #Staging.PTL_EndDate LIKE '[0-3][0-9].[0-1][0-9].[0-9][0-9][0-9][0-9]' THEN CONVERT(date, #Staging.PTL_EndDate, 104)  -- 27.01.2014
                                WHEN #Staging.PTL_EndDate LIKE '[0-3][0-9].[0-1][0-9].[0-9][0-9]' THEN CONVERT(date, #Staging.PTL_EndDate, 4)  -- 27.07.16
                            END
    From #Staging
    Where XLSHdr_Promo_TescoUK.ID = #Staging.ID     

    A loop like this will do the trick.  You can keep changing the WHERE clause to do different batches
    CREATE TABLE #Staging (
      ID      INT NOT NULL
    , PTL_EndDate   VARCHAR(20)
    , Clean_PTL_EndDate DATE
    );

    INSERT INTO #Staging (ID, PTL_EndDate)
    VALUES ( 117915, '27.01.2014' )
      , ( 117916, '15.27.2014' )
      , ( 117916, '01.27.14' )
      , ( 117917, '2014.01.27' )
      , ( 118301, '27.01.2014' )
      , ( 118302, '01.27.2014' );

    DECLARE @BatchSize INT = 1; -- Start with a big number, and keep making it smaller

    BEGIN TRY
    WHILE EXISTS (SELECT 1 FROM #Staging
          WHERE Clean_PTL_EndDate IS NULL
          AND (PTL_EndDate LIKE '[0-3][0-9].[0-1][0-9].[0-9][0-9][0-9][0-9]'
           OR PTL_EndDate LIKE '[0-1][0-9].[0-3][0-9].[0-9][0-9][0-9][0-9]')
         )
    BEGIN
      WITH cteBatch AS (
      SELECT TOP (@BatchSize)
        ID
       , PTL_EndDate
       , Clean_PTL_EndDate
       , YearPart = PARSENAME(PTL_EndDate, 1)
       , MonthPart = CASE WHEN PARSENAME(PTL_EndDate, 2) > 12 THEN PARSENAME(PTL_EndDate, 3) ELSE PARSENAME(PTL_EndDate, 2) END
       , DayPart = CASE WHEN PARSENAME(PTL_EndDate, 2) > 12 THEN PARSENAME(PTL_EndDate, 2) ELSE PARSENAME(PTL_EndDate, 3) END
      FROM #Staging
      WHERE Clean_PTL_EndDate IS NULL
       AND (PTL_EndDate LIKE '[0-3][0-9].[0-1][0-9].[0-9][0-9][0-9][0-9]'
        OR PTL_EndDate LIKE '[0-1][0-9].[0-3][0-9].[0-9][0-9][0-9][0-9]')
      ORDER BY ID
      )
      UPDATE cteBatch
      SET Clean_PTL_EndDate = CONVERT(date, DayPart + '.' + MonthPart + '.' + YearPart, 104);

      WAITFOR DELAY '00:00:00.100'; -- Pause for 100ms
    END;
    END TRY
    BEGIN CATCH
    -- Somewhere in this batch you should find your bad apple.
    -- If this returns no data, then the specific pattern has been taken care of
    SELECT TOP (@BatchSize) *
    FROM #Staging
    WHERE Clean_PTL_EndDate IS NULL
    AND (PTL_EndDate LIKE '[0-3][0-9].[0-1][0-9].[0-9][0-9][0-9][0-9]'
      OR PTL_EndDate LIKE '[0-1][0-9].[0-3][0-9].[0-9][0-9][0-9][0-9]')
    ORDER BY ID;
    END CATCH

  • GrassHopper - Friday, October 20, 2017 2:49 PM

    Thom A - Friday, October 20, 2017 12:21 PM

    I'm surprised you've said that string like '18th February 2015' are being returned with the LIKE statement (as they wouldn't).

    DesNorton - Friday, October 20, 2017 11:58 AM

    declare @datestring varchar(20) = '19th May 2015';

    select convert(date, replace(@datestring, 'th', ''));

    The problem with this, Des, (edit and now Lynn who has posted the same) is that if the OP has data with "10th" they probably have data with "1st", "21st" and even "31st". That presents a problem as...

    REPLACE(REPLACE(REPLACE(REPLACE('1st August 2017','st ',' '),'nd ',' '),'rd ',' '),'th ',' ')
    Returns '1 Augu 2017'. You're probably going to need to do something more like...
    REPLACE(REPLACE(REPLACE(REPLACE('1st August 2017','1st','1'),'nd ',' '),'rd ',' '),'th ',' ')

    I know I've been shot down about it before, but I'm going to say it again; Get the provider to fix their data... There's far far far too many permutations. I woulnd't be surprised if we start seeing examples of dates such as 'Friday, 20th October 2017' soon, or (even worse), something stupid like 'Thursday, 11st October 2017' (2017-10-11 was a Wednesday 😉 ).

    Thom, I wish it was that easy.  The data I receive in excel cannot be fixed.  It was manually entered in and there are many users entering the data into a template.  We received the data from the client and have no control over their process.  I extracted the data and inserted it into SQL.  It what you said was an option, believe me...that would have been the first thing I did.  That's why I said... "it is what it is..."  I wasn't being sarcastic or trying to shoot you down.

    I understand the source of your problem but, after all is said and done on this post, we're still bringing in data that could be incorrect when either DD or MM is 12 or less and we don't actually know if the correct format for the date is MM/DD/YYYY or DD/MM/YYYY.  If anything goes wrong, it's ultimately going to be your fault whether you that will be the scapegoat whether you can explain it or not.   Whatever god-forsaken management is having you perpetuate this data atrocity, they not the ones that will take the fall.  They'll put it on you and they can make it stick because you knew of the problem and coded for it.

    I hope you don't need it but good luck.  I'd have refused to write the code based on the fact that there is data that cannot be resolved properly and that's not an ethical thing to do because either you and management don't know what damage it cause or, if you do know it, that's even worse.  If someone ever comes down on you guys, remember your own words... "It is what it is" and it IS wrong.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Saturday, October 28, 2017 2:53 PM

    I understand the source of your problem but, after all is said and done on this post, we're still bringing in data that could be incorrect when either DD or MM is 12 or less and we don't actually know if the correct format for the date is MM/DD/YYYY or DD/MM/YYYY.  If anything goes wrong, it's ultimately going to be your fault whether you that will be the scapegoat whether you can explain it or not.   Whatever god-forsaken management is having you perpetuate this data atrocity, they not the ones that will take the fall.  They'll put it on you and they can make it stick because you knew of the problem and coded for it.

    I hope you don't need it but good luck.  I'd have refused to write the code based on the fact that there is data that cannot be resolved properly and that's not an ethical thing to do because either you and management don't know what damage it cause or, if you do know it, that's even worse.  If someone ever comes down on you guys, remember your own words... "It is what it is" and it IS wrong.

    Jeff you completely hit the nail on the head here.

    I've had the experience of being given data like these with the day and month were interchangeable and thus wouldn't do the job I was asked to; on account it would be wrong. I was pushed and pushed Todo it anyway. "Thom, there must be some magic you can work to make it right!" We all know, there is no magic button.

    In the end, I did the work, but I saved off all the communications between myself and manager at the time, raising my concerns and my hesitance to do it; and that effectively I was only doing the work under pressure that made me more uncomfortable than doing the work itself.

    Unsurprisingly, after a mk the or so, the analysts started reporting problem with "my" data and raised their concerns. They came down on me like a tonne of bricks and to say I was furious is an understatement. I did, however, pass on all the emails between myself and my manager to their director and the analyst's management team, which eventually cause the storm to blow away from my direction.

    Since then, however, I've refused to do the same again (not that I'm with that employer anymore). i don't let others in my team either, it's not worth the hassle, and any analysis of wrong data provides wrong analysis; so, what's the point? 🙂

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • Thom A - Saturday, October 28, 2017 3:17 PM

    Jeff Moden - Saturday, October 28, 2017 2:53 PM

    I understand the source of your problem but, after all is said and done on this post, we're still bringing in data that could be incorrect when either DD or MM is 12 or less and we don't actually know if the correct format for the date is MM/DD/YYYY or DD/MM/YYYY.  If anything goes wrong, it's ultimately going to be your fault whether you that will be the scapegoat whether you can explain it or not.   Whatever god-forsaken management is having you perpetuate this data atrocity, they not the ones that will take the fall.  They'll put it on you and they can make it stick because you knew of the problem and coded for it.

    I hope you don't need it but good luck.  I'd have refused to write the code based on the fact that there is data that cannot be resolved properly and that's not an ethical thing to do because either you and management don't know what damage it cause or, if you do know it, that's even worse.  If someone ever comes down on you guys, remember your own words... "It is what it is" and it IS wrong.

    Jeff you completely hit the nail on the head here.

    I've had the experience of being given data like these with the day and month were interchangeable and thus wouldn't do the job I was asked to; on account it would be wrong. I was pushed and pushed Todo it anyway. "Thom, there must be some magic you can work to make it right!" We all know, there is no magic button.

    In the end, I did the work, but I saved off all the communications between myself and manager at the time, raising my concerns and my hesitance to do it; and that effectively I was only doing the work under pressure that made me more uncomfortable than doing the work itself.

    Unsurprisingly, after a mk the or so, the analysts started reporting problem with "my" data and raised their concerns. They came down on me like a tonne of bricks and to say I was furious is an understatement. I did, however, pass on all the emails between myself and my manager to their director and the analyst's management team, which eventually cause the storm to blow away from my direction.

    Since then, however, I've refused to do the same again (not that I'm with that employer anymore). i don't let others in my team either, it's not worth the hassle, and any analysis of wrong data provides wrong analysis; so, what's the point? 🙂

    You were smart in doing the documentation.  As you found out, it's better so just say "No".

    I don't have any direct reports (thank the lucky stars), but I DO sit with the Developers and have expressed to them the great importance of absolute honesty absolutely all the time.  I also explained to them that I was looking for a job when I found this one and that if they are in fear of any ramifications of saying "No" when it  comes to violating data integrity or honest use, to bring the problem to me and I'll take point on the problem and go to the wall for them.  If they try to hide what has been done, then I'll take point on that problem and they know the meaning of that, as well.

    People sometimes say "Well... no one's life depends on what we're doing" and that idiotic, short sighted, moronic, uninformed saying couldn't be further from the truth.  I HAVE worked for companies that allowed such shenanigans (of which I took no part and even took it to the GMs) with data to occur and when the company got sued into oblivion, everyone's life in the company changed.  I've also worked for a fortune 500 company that warped data to mean something else (a BIG reason why I refer to "BI" as an "oxymoron" because I also took it all the way up to the GM, who ignored me) and half the company got laid off in the month that I told them it would happen about two years previous.  Those poor people had no where to go and rumor has it that a couple of people that could no longer afford certain medications because they were laid off actually did die.

    When it comes to data, I believe in absolute honesty, absolute integrity, absolute accuracy, and timely reporting of it all.  There is no "It is what it is" acceptance of crap data on my watch.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 6 posts - 46 through 50 (of 50 total)

You must be logged in to reply to this topic. Login to reply