trimming SSNs

  • I have thousands of social security numbers I need to trim to leave only the last 4 digits... it is in a varchar field as xxx-xx-xxxx sometimes, but others are in xxxxxxxxx format and some scraps are incomplete numbers like xxx-, etc. Any ideas?

  • If you use RIGHT([oldSSN],4) then the rest shouldn't matter, because you'll never get any of the dashes anyway, because they would be 5th from the right.

  • pietlinden - Wednesday, March 13, 2019 9:58 AM

    If you use RIGHT([oldSSN],4) then the rest shouldn't matter, because you'll never get any of the dashes anyway, because they would be 5th from the right.

    Assuming someone never fat-fingered and left out a final digit.  However, you can certainly run a query ahead of time and see what those results look like.   The max number of possible DISTINCT values is only 10,000,  plus the unique number of shorter deviations or less than perfectly entered SSN values, especially where the length is > 9 and there are no dashes...  In that case, there's no possible way to be sure you have a valid "last 4".   You may want to see what that distinct list looks like, and set up some tests on length and or dashes, just to see if you can at least consistently identify anything that is some form of deviation.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • sgmunson - Wednesday, March 13, 2019 10:16 AM

    pietlinden - Wednesday, March 13, 2019 9:58 AM

    If you use RIGHT([oldSSN],4) then the rest shouldn't matter, because you'll never get any of the dashes anyway, because they would be 5th from the right.

    Assuming someone never fat-fingered and left out a final digit.  However, you can certainly run a query ahead of time and see what those results look like.   The max number of possible DISTINCT values is only 10,000,  plus the unique number of shorter deviations or less than perfectly entered SSN values, especially where the length is > 9 and there are no dashes...  In that case, there's no possible way to be sure you have a valid "last 4".   You may want to see what that distinct list looks like, and set up some tests on length and or dashes, just to see if you can at least consistently identify anything that is some form of deviation.

    True - but that would mean that your data was bad BEFORE you started, so to some degree you're not making it any worse than it already is.  On paper - you're really no worse off by truncating to the last 4 that you have.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (4) - Wednesday, March 13, 2019 1:00 PM

    sgmunson - Wednesday, March 13, 2019 10:16 AM

    pietlinden - Wednesday, March 13, 2019 9:58 AM

    If you use RIGHT([oldSSN],4) then the rest shouldn't matter, because you'll never get any of the dashes anyway, because they would be 5th from the right.

    Assuming someone never fat-fingered and left out a final digit.  However, you can certainly run a query ahead of time and see what those results look like.   The max number of possible DISTINCT values is only 10,000,  plus the unique number of shorter deviations or less than perfectly entered SSN values, especially where the length is > 9 and there are no dashes...  In that case, there's no possible way to be sure you have a valid "last 4".   You may want to see what that distinct list looks like, and set up some tests on length and or dashes, just to see if you can at least consistently identify anything that is some form of deviation.

    True - but that would mean that your data was bad BEFORE you started, so to some degree you're not making it any worse than it already is.  On paper - you're really no worse off by truncating to the last 4 that you have.

    My objective was to make that point obvious, and that just trusting what they have might not be prudent, and that identifying and quantifying the extent of the problem could be valuable.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Another damnable company that has SSNs in clear text. :crazy:

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • briancampbellmcad - Wednesday, March 13, 2019 9:50 AM

    >> I have thousands of social security numbers I need to trim to leave only the last 4 digits... it is in a varchar field as xxx-xx-xxxx sometimes, but others are in xxxxxxxxx format and some scraps are incomplete numbers like xxx-, etc. Any ideas? <<

    The first thing you should do is clean up the data that you've got. In a well-run database, the data is cleaned up and scrubbed before it gets into the tables. I also see no reason that you're using VARCHAR(n), since the SSN is always nine digits.
    ssn CHAR(11) NOT NULL
       CHECK (ssn LIKE '[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]')
    Your incomplete and scrap numbers cannot be substringed safely and you're going to need to do some work.

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • Matt Miller (4) - Wednesday, March 13, 2019 1:00 PM

    sgmunson - Wednesday, March 13, 2019 10:16 AM

    pietlinden - Wednesday, March 13, 2019 9:58 AM

    If you use RIGHT([oldSSN],4) then the rest shouldn't matter, because you'll never get any of the dashes anyway, because they would be 5th from the right.

    Assuming someone never fat-fingered and left out a final digit.  However, you can certainly run a query ahead of time and see what those results look like.   The max number of possible DISTINCT values is only 10,000,  plus the unique number of shorter deviations or less than perfectly entered SSN values, especially where the length is > 9 and there are no dashes...  In that case, there's no possible way to be sure you have a valid "last 4".   You may want to see what that distinct list looks like, and set up some tests on length and or dashes, just to see if you can at least consistently identify anything that is some form of deviation.

    True - but that would mean that your data was bad BEFORE you started, so to some degree you're not making it any worse than it already is.  On paper - you're really no worse off by truncating to the last 4 that you have.

    Except that now you've destroyed your inputs....  in general, that's a bad idea.   If you made a backup of the original data, that would be another story.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • jcelko212 32090 - Thursday, March 14, 2019 1:44 PM

    briancampbellmcad - Wednesday, March 13, 2019 9:50 AM

    >> I have thousands of social security numbers I need to trim to leave only the last 4 digits... it is in a varchar field as xxx-xx-xxxx sometimes, but others are in xxxxxxxxx format and some scraps are incomplete numbers like xxx-, etc. Any ideas? <<

    The first thing you should do is clean up the data that you've got. In a well-run database, the data is cleaned up and scrubbed before it gets into the tables. I also see no reason that you're using VARCHAR(n), since the SSN is always nine digits.
    ssn CHAR(11) NOT NULL
       CHECK (ssn LIKE '[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]')
    Your incomplete and scrap numbers cannot be substringed safely and you're going to need to do some work.

    I see no reason at all to waste 2 bytes per row storing dashes.  2B rows = 4B bytes totally wasted.

    ssn could be char(9) or even just an int, either properly encrypted of course.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher - Thursday, March 14, 2019 3:01 PM

    jcelko212 32090 - Thursday, March 14, 2019 1:44 PM

    briancampbellmcad - Wednesday, March 13, 2019 9:50 AM

    I see no reason at all to waste 2 bytes per row storing dashes.  2B rows = 4B bytes totally wasted.

    Are these the same two bytes we saved by dropping the century from dates back in the Y2K days?  I'm going to argue for the dashes. The design principle that you want to store data the way it is used and use it the way it's stored. When you see 2019-03-19, you know that it's a date. When you see 252-77-6688, you know that it's a Social Security number. Likewise, 23:00:00 Hrs is clearly a time. The cost of a couple of pieces of punctuation is negligible these days, but the cost of an error in misreading the data is not negligible.

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • jcelko212 32090 - Thursday, March 14, 2019 1:44 PM

    briancampbellmcad - Wednesday, March 13, 2019 9:50 AM

    >> I have thousands of social security numbers I need to trim to leave only the last 4 digits... it is in a varchar field as xxx-xx-xxxx sometimes, but others are in xxxxxxxxx format and some scraps are incomplete numbers like xxx-, etc. Any ideas? <<

    The first thing you should do is clean up the data that you've got. In a well-run database, the data is cleaned up and scrubbed before it gets into the tables. I also see no reason that you're using VARCHAR(n), since the SSN is always nine digits.
    ssn CHAR(11) NOT NULL
       CHECK (ssn LIKE '[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]')
    Your incomplete and scrap numbers cannot be substringed safely and you're going to need to do some work.

    No... not correct.  The FIRST thing to do is to protect the SSNs and that's not being done here.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • jcelko212 32090 - Thursday, March 14, 2019 8:19 PM

    ScottPletcher - Thursday, March 14, 2019 3:01 PM

    jcelko212 32090 - Thursday, March 14, 2019 1:44 PM

    briancampbellmcad - Wednesday, March 13, 2019 9:50 AM

    I see no reason at all to waste 2 bytes per row storing dashes.  2B rows = 4B bytes totally wasted.

    Are these the same two bytes we saved by dropping the century from dates back in the Y2K days?  I'm going to argue for the dashes. The design principle that you want to store data the way it is used and use it the way it's stored. When you see 2019-03-19, you know that it's a date. When you see 252-77-6688, you know that it's a Social Security number. Likewise, 23:00:00 Hrs is clearly a time. The cost of a couple of pieces of punctuation is negligible these days, but the cost of an error in misreading the data is not negligible.

    You  shouldn't be able to read SSNs in clear text from a table to begin with.  Save the formatting for the display layer.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • jcelko212 32090 - Thursday, March 14, 2019 8:19 PM

    ScottPletcher - Thursday, March 14, 2019 3:01 PM

    jcelko212 32090 - Thursday, March 14, 2019 1:44 PM

    briancampbellmcad - Wednesday, March 13, 2019 9:50 AM

    I see no reason at all to waste 2 bytes per row storing dashes.  2B rows = 4B bytes totally wasted.

    Are these the same two bytes we saved by dropping the century from dates back in the Y2K days?  I'm going to argue for the dashes. The design principle that you want to store data the way it is used and use it the way it's stored. When you see 2019-03-19, you know that it's a date. When you see 252-77-6688, you know that it's a Social Security number. Likewise, 23:00:00 Hrs is clearly a time. The cost of a couple of pieces of punctuation is negligible these days, but the cost of an error in misreading the data is not negligible.

    So you're saying that we should store dates as '2019-03-19'?  That's just not done.  Do we need to store the blanks in credit card #s too, which are usually printed on cards as nnnn nnnn nnnn nnnn to make them easier to read.

    The display has nothing to do with how data is stored.

    The Y2K method only saved one byte, and, yes, it was the correct decision at the time.

    You love living in a completely theoretical world, where the logical design is never converted to physical design.  Back in the real world, we have to follow the normal process of converting a logical design to a physical one, accepting comprises along the way.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher - Friday, March 15, 2019 9:04 AM

    jcelko212 32090 - Thursday, March 14, 2019 8:19 PM

    ScottPletcher - Thursday, March 14, 2019 3:01 PM

    jcelko212 32090 - Thursday, March 14, 2019 1:44 PM

    briancampbellmcad - Wednesday, March 13, 2019 9:50 AM

    I see no reason at all to waste 2 bytes per row storing dashes.  2B rows = 4B bytes totally wasted.

    Are these the same two bytes we saved by dropping the century from dates back in the Y2K days?  I'm going to argue for the dashes. The design principle that you want to store data the way it is used and use it the way it's stored. When you see 2019-03-19, you know that it's a date. When you see 252-77-6688, you know that it's a Social Security number. Likewise, 23:00:00 Hrs is clearly a time. The cost of a couple of pieces of punctuation is negligible these days, but the cost of an error in misreading the data is not negligible.

    So you're saying that we should store dates as '2019-03-19'?  That's just not done.  Do we need to store the blanks in credit card #s too, which are usually printed on cards as nnnn nnnn nnnn nnnn to make them easier to read.

    The display has nothing to do with how data is stored.

    The Y2K method only saved one byte, and, yes, it was the correct decision at the time.

    You love living in a completely theoretical world, where the logical design is never converted to physical design.  Back in the real world, we have to follow the normal process of converting a logical design to a physical one, accepting comprises along the way.

    "He who loves practice without theory is like the sailor who boards ship without a rudder and compass and never knows where he may cast." -- Leonardo da Vinci

    My first paying IT job was in 1965. That's over 50 years in this trade; my consolation is that I got old by not dying. 🙂 Please don't tell me that I'm all theory. I was a code monkey for over 25 years. However, if you want to yell at me for being all "ANSI/ISO standards", "mathematical correctness", "best practices in the industry", "decades of research behind me" and pedantic as all merry hell, I will agree.

    Basically, at some point, people were paying me to fix the messes that have been made by People like you, who think that it's okay to compromise on things. You have given me most of my later SQL consulting work. I fix bad designs and I'm expensive. But it's cheaper than living with a bad design of somebody wanted to save the cost of some storage, or kludge a program to cover a bad design..

    I also agree that display has nothing to do with storage. Those things we show decimals are stored as binary. Of course we don't know if it's high-end or low-end storage or maybe on a it old Russian three valued machine. However, I like Brent's law that data should be stored the way it's used and use it the way it's stored. This means that a human being can read it, that pattern matching is a lot easier, check digits are easier to compute, etc. It's important to pick the level of abstraction (physical hardware, programming language, particular product, etc.) for your design. In the database, deciding to optimize at the current hardware/software level is always wrong.

    I wrote one of the first articles on the Y2K problems in Information Systems News when I had a regular column. It was all too familiar with what would happen we got to the cusp of 1999 – 2000 and had to decide which decade ambiguous dates would fall. This decision was totally dependent on your data. You would be surprised by the percentage of errors you got trying to add the century. We have a lot of people that live over 90. Now tell me if a birthdate year is 19xx or 20xx in the hospital that treats both geriatrics and pediatrics. Whoever thought this was "a correct decision at the time" bought me a car at the time I got through cleaning of their data. Our slogan was "another day, another K, and we mean take-home pay" for jobs like this.

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • pietlinden - Wednesday, March 13, 2019 9:58 AM

    If you use RIGHT([oldSSN],4) then the rest shouldn't matter, because you'll never get any of the dashes anyway, because they would be 5th from the right.

    RIGHT([oldSSN],4) just gives me the first 3 characters and a dash, e.g. 235-55-7777 becomes 235-

Viewing 15 posts - 1 through 15 (of 36 total)

You must be logged in to reply to this topic. Login to reply