trimming SSNs

  • jcelko212 32090 - Friday, March 15, 2019 11:40 AM

    ScottPletcher - Friday, March 15, 2019 9:04 AM

    jcelko212 32090 - Thursday, March 14, 2019 8:19 PM

    ScottPletcher - Thursday, March 14, 2019 3:01 PM

    jcelko212 32090 - Thursday, March 14, 2019 1:44 PM

    briancampbellmcad - Wednesday, March 13, 2019 9:50 AM

    I see no reason at all to waste 2 bytes per row storing dashes.  2B rows = 4B bytes totally wasted.

    Are these the same two bytes we saved by dropping the century from dates back in the Y2K days?  I'm going to argue for the dashes. The design principle that you want to store data the way it is used and use it the way it's stored. When you see 2019-03-19, you know that it's a date. When you see 252-77-6688, you know that it's a Social Security number. Likewise, 23:00:00 Hrs is clearly a time. The cost of a couple of pieces of punctuation is negligible these days, but the cost of an error in misreading the data is not negligible.

    So you're saying that we should store dates as '2019-03-19'?  That's just not done.  Do we need to store the blanks in credit card #s too, which are usually printed on cards as nnnn nnnn nnnn nnnn to make them easier to read.

    The display has nothing to do with how data is stored.

    The Y2K method only saved one byte, and, yes, it was the correct decision at the time.

    You love living in a completely theoretical world, where the logical design is never converted to physical design.  Back in the real world, we have to follow the normal process of converting a logical design to a physical one, accepting comprises along the way.

    "He who loves practice without theory is like the sailor who boards ship without a rudder and compass and never knows where he may cast." -- Leonardo da Vinci

    My first paying IT job was in 1965. That's over 50 years in this trade; my consolation is that I got old by not dying. 🙂 Please don't tell me that I'm all theory. I was a code monkey for over 25 years. However, if you want to yell at me for being all "ANSI/ISO standards", "mathematical correctness", "best practices in the industry", "decades of research behind me" and pedantic as all merry hell, I will agree.

    Basically, at some point, people were paying me to fix the messes that have been made by People like you, who think that it's okay to compromise on things. You have given me most of my later SQL consulting work. I fix bad designs and I'm expensive. But it's cheaper than living with a bad design of somebody wanted to save the cost of some storage, or kludge a program to cover a bad design..

    I also agree that display has nothing to do with storage. Those things we show decimals are stored as binary. Of course we don't know if it's high-end or low-end storage or maybe on a it old Russian three valued machine. However, I like Brent's law that data should be stored the way it's used and use it the way it's stored. This means that a human being can read it, that pattern matching is a lot easier, check digits are easier to compute, etc. It's important to pick the level of abstraction (physical hardware, programming language, particular product, etc.) for your design. In the database, deciding to optimize at the current hardware/software level is always wrong.

    I wrote one of the first articles on the Y2K problems in Information Systems News when I had a regular column. It was all too familiar with what would happen we got to the cusp of 1999 – 2000 and had to decide which decade ambiguous dates would fall. This decision was totally dependent on your data. You would be surprised by the percentage of errors you got trying to add the century. We have a lot of people that live over 90. Now tell me if a birthdate year is 19xx or 20xx in the hospital that treats both geriatrics and pediatrics. Whoever thought this was "a correct decision at the time" bought me a car at the time I got through cleaning of their data. Our slogan was "another day, another K, and we mean take-home pay" for jobs like this.

    And paying for the extra RAM and disks back in the day would have bought 2-5 houses.

    And it's not me you're doing re-work for.  I do a true logical design, to the constant annoyance of many others on this site, who insists it's not needed (you can just start with an "identity" column and your "table" "design" issues are solved many seem to think).  When things change sufficiently, you can go back to the logical design and convert it to a newer physical design.

    In the database, deciding to optimize at the current hardware/software level is always wrong. 

    ... Ridiculous.  There's literally nothing else you can do.  You can't make something that only runs 20 years from that timee (well, those of us who don't just follow theory can't).  We need things that work now.

    Again, how data is physically stored is 100% irrelevant to how it's displayed, period.  Humans don't read binary, and all modern commercial data is stored that way (afaik).

    Yes, anyone that insists on storing the dashes within a SSN is just not thinking clearly.  What happens as the population grows when it goes to 10 chars?  And they decide as a result to format it differently?  You're the one that is providing future make-work, and at a loss today too!  The Y2K thing saved big resources at the time.  You're wanting to waste bytes now and make it harder to refactor in the future, the perfecta of poor design and practice.

    Any times I've dealt with contractors they've been virtually useless.  Smart people, many, but useless for actually getting anything done.  They do produce directories full of "recommendations" and "documentation", but nothing you can actually use to engineer a working system.

    The people that compromise when needed are the only people who ever produce anything.  If MS had waited until Windows was perfect, we'd still be waiting on it.  If IBM had waited until their relational implementation was perfect, System R and everything that followed would not have already happened.

    It's like saying "Don't reinvent the wheel."  The wheel's been reinvented hundreds of times, or we'd all have wooden wheels with no rims.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • briancampbellmcad - Friday, March 15, 2019 12:19 PM

    pietlinden - Wednesday, March 13, 2019 9:58 AM

    If you use RIGHT([oldSSN],4) then the rest shouldn't matter, because you'll never get any of the dashes anyway, because they would be 5th from the right.

    RIGHT([oldSSN],4) just gives me the first 3 characters and a dash, e.g. 235-55-7777 becomes 235-

    That looks more like the results of a LEFT([oldSSN],4).

  • Lynn Pettis - Friday, March 15, 2019 1:36 PM

    briancampbellmcad - Friday, March 15, 2019 12:19 PM

    pietlinden - Wednesday, March 13, 2019 9:58 AM

    If you use RIGHT([oldSSN],4) then the rest shouldn't matter, because you'll never get any of the dashes anyway, because they would be 5th from the right.

    RIGHT([oldSSN],4) just gives me the first 3 characters and a dash, e.g. 235-55-7777 becomes 235-

    That looks more like the results of a LEFT([oldSSN],4).

    Same results... note that this is stored a varchar not integer

  • briancampbellmcad - Friday, March 15, 2019 1:42 PM

    Lynn Pettis - Friday, March 15, 2019 1:36 PM

    briancampbellmcad - Friday, March 15, 2019 12:19 PM

    pietlinden - Wednesday, March 13, 2019 9:58 AM

    If you use RIGHT([oldSSN],4) then the rest shouldn't matter, because you'll never get any of the dashes anyway, because they would be 5th from the right.

    RIGHT([oldSSN],4) just gives me the first 3 characters and a dash, e.g. 235-55-7777 becomes 235-

    That looks more like the results of a LEFT([oldSSN],4).

    Same results... note that this is stored a varchar not integer

    Okay, you are going to have to show me your code.  LEFT and RIGHT are string functions so we are assuming that the data is stored as character data.

  • Lynn Pettis - Friday, March 15, 2019 4:14 PM

    briancampbellmcad - Friday, March 15, 2019 1:42 PM

    Lynn Pettis - Friday, March 15, 2019 1:36 PM

    briancampbellmcad - Friday, March 15, 2019 12:19 PM

    pietlinden - Wednesday, March 13, 2019 9:58 AM

    If you use RIGHT([oldSSN],4) then the rest shouldn't matter, because you'll never get any of the dashes anyway, because they would be 5th from the right.

    RIGHT([oldSSN],4) just gives me the first 3 characters and a dash, e.g. 235-55-7777 becomes 235-

    That looks more like the results of a LEFT([oldSSN],4).

    Same results... note that this is stored a varchar not integer

    Okay, you are going to have to show me your code.  LEFT and RIGHT are string functions so we are assuming that the data is stored as character data.

    Using your single value sample:

    IF OBJECT_ID('dbo.TestSSNTrim','U') IS NOT NULL
    DROP TABLE [dbo].[TestSSNTrim];

    CREATE TABLE [dbo].[TestSSNTrim](
    [OldSSN] varchar(11));

    INSERT INTO [dbo].[TestSSNTrim]
    (
    [OldSSN]
    )
    VALUES
    (
    '235-55-7777' -- OldSSN - varchar(11)
    );

    SELECT
    [tst].[OldSSN]
    , LeftTrim = LEFT([tst].[OldSSN],4)
    , RightTrim = RIGHT([tst].[OldSSN],4)
    FROM
    [dbo].[TestSSNTrim] AS [tst];

    IF OBJECT_ID('dbo.TestSSNTrim','U') IS NOT NULL
    DROP TABLE [dbo].[TestSSNTrim];
    GO


    OldSSN         LeftTrim RightTrim
    -----------    -------- ---------
    235-55-7777    235-     7777

  • All SSN handling/validation should be done in code, before sent to the database. Is it only one type of SSN? Else OID is needed, to define type of SSN.
    For the examples above, maybe use char(11).
    For security add sql table or database encryption.

  • jonas.gunnarsson 52434 - Saturday, March 16, 2019 12:12 PM

    All SSN handling/validation should be done in code before sent to the database. Is it only one type of SSN? Else OID is needed, to define type of SSN.
    For the examples above, maybe use char(11).
    For security add sql table or database encryption.

    The SSN is a mess from a data design viewpoint. There is no check digit. As of 2011, parts of it are random.  You really have to use some kind of look-up to avoid dead people, and unissued numbers. 

    https://en.wikipedia.org/wiki/Social_Security_number

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • jonas.gunnarsson 52434 - Saturday, March 16, 2019 12:12 PM

    All SSN handling/validation should be done in code, before sent to the database. Is it only one type of SSN? Else OID is needed, to define type of SSN.
    For the examples above, maybe use char(11).
    For security add sql table or database encryption.

    I missing it... how will an OID help here?  And what if there is no "code" for "All SSN handling/validation"?    What if the inputs are coming from a file where the creators of the file did no such validation?

    And why in the hell are people still helping someone that admits the SSNs are stored as plain text in a VARCHAR() column?  :angry:

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Saturday, March 16, 2019 2:22 PM

    I missing it... how will an OID help here?  And what if there is no "code" for "All SSN handling/validation"?    What if the inputs are coming from a file where the creators of the file did no such validation?

    And why in the hell are people still helping someone that admits the SSNs are stored as plain text in a VARCHAR() column?  :angry:

    OID, may be useful if you handle different kind of person numbers. The number in example is OID=2.16.840.1.113883.4.1
    You should not accept these kind of values to be from unknown origin, should be validated against the official master database for the values.
    For all type of sensitive data, it should be encrypted, i guess encrypt the database(TDE), would be sufficient?

  • jonas.gunnarsson 52434 - Sunday, March 17, 2019 10:50 AM

    Jeff Moden - Saturday, March 16, 2019 2:22 PM

    I missing it... how will an OID help here?  And what if there is no "code" for "All SSN handling/validation"?    What if the inputs are coming from a file where the creators of the file did no such validation?

    And why in the hell are people still helping someone that admits the SSNs are stored as plain text in a VARCHAR() column?  :angry:

    OID, may be useful if you handle different kind of person numbers. The number in example is OID=2.16.840.1.113883.4.1
    You should not accept these kind of values to be from unknown origin, should be validated against the official master database for the values.
    For all type of sensitive data, it should be encrypted, i guess encrypt the database(TDE), would be sufficient?

    Are you saying that the OID has a meaning similar to SSN in that it's supposed to uniquely identify someone?  Are you also saying that there's some "master database" that already exists or are you talking about building one?  And are you also saying that OIDs should be encrypted in our local databases???

    Also, if the OIDs are imported from 3rd party sources along with other customer data, what guarantee do you have that the originator of the data did it right?  Just like SSNs, I'm thinking you don't.

    The bottom line here is that I'm not seeing an advantage to OIDs over SSNs when it comes to databases even if they were accurate down to the length of a person's intestinal tract.

    And, no... TDE isn't actually sufficient by itself because people on the inside can still get to the data if the sensitive data doesn't have an extra layer of protection to protect against people on the inside, which is a bit more difficult to accomplish.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Thursday, March 14, 2019 8:31 PM

    jcelko212 32090 - Thursday, March 14, 2019 1:44 PM

    briancampbellmcad - Wednesday, March 13, 2019 9:50 AM

    >> I have thousands of social security numbers I need to trim to leave only the last 4 digits... it is in a varchar field as xxx-xx-xxxx sometimes, but others are in xxxxxxxxx format and some scraps are incomplete numbers like xxx-, etc. Any ideas? <<

    The first thing you should do is clean up the data that you've got. In a well-run database, the data is cleaned up and scrubbed before it gets into the tables. I also see no reason that you're using VARCHAR(n), since the SSN is always nine digits.
    ssn CHAR(11) NOT NULL
       CHECK (ssn LIKE '[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]')
    Your incomplete and scrap numbers cannot be substringed safely and you're going to need to do some work.

    No... not correct.  The FIRST thing to do is to protect the SSNs and that's not being done here.

    All the data is in 111-11-1111 text format. I didn't design the frontend or the backend or this would not even be an issue. The frontend cannot be modified as it is compiled and the design and display is locked. I did discover the (Right([ssno],4) works on the SQL Server side however the display keeps it at 111-. So what I would like to do is pad each resulting 111- style number to make it display as xxx-xx-1111. Is this possible?

  • briancampbellmcad - Monday, March 18, 2019 1:44 PM

    Jeff Moden - Thursday, March 14, 2019 8:31 PM

    jcelko212 32090 - Thursday, March 14, 2019 1:44 PM

    briancampbellmcad - Wednesday, March 13, 2019 9:50 AM

    >> I have thousands of social security numbers I need to trim to leave only the last 4 digits... it is in a varchar field as xxx-xx-xxxx sometimes, but others are in xxxxxxxxx format and some scraps are incomplete numbers like xxx-, etc. Any ideas? <<

    The first thing you should do is clean up the data that you've got. In a well-run database, the data is cleaned up and scrubbed before it gets into the tables. I also see no reason that you're using VARCHAR(n), since the SSN is always nine digits.
    ssn CHAR(11) NOT NULL
       CHECK (ssn LIKE '[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]')
    Your incomplete and scrap numbers cannot be substringed safely and you're going to need to do some work.

    No... not correct.  The FIRST thing to do is to protect the SSNs and that's not being done here.

    All the data is in 111-11-1111 text format. I didn't design the frontend or the backend or this would not even be an issue. The frontend cannot be modified as it is compiled and the design and display is locked. I did discover the (Right([ssno],4) works on the SQL Server side however the display keeps it at 111-. So what I would like to do is pad each resulting 111- style number to make it display as xxx-xx-1111. Is this possible?

    SELECT 'xxx-xx-' + RIGHT([ssno], 4)

  • I would like to emphasize some things that have already been said.
    First thing you should do is remove the dashes and just keep the digits.
    Second thing is create 2 new columns, one to store the last 4 digits of the SSN and one to store the encrypted full SSN.
    Last thing is to drop the column with the SSN in plain text.

    OK, the second thing might need a much more work that it's specified here (Create keys, back them up, change procedures and queries to retrieve SSN, etc), but this is for everyone's safety. I suggest having 2 columns to prevent unnecessary use of resources and further control of who has access to the complete number.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares - Monday, March 18, 2019 2:27 PM

    I would like to emphasize some things that have already been said.
    First thing you should do is remove the dashes and just keep the digits.
    Second thing is create 2 new columns, one to store the last 4 digits of the SSN and one to store the encrypted full SSN.
    Last thing is to drop the column with the SSN in plain text.

    OK, the second thing might need a much more work that it's specified here (Create keys, back them up, change procedures and queries to retrieve SSN, etc), but this is for everyone's safety. I suggest having 2 columns to prevent unnecessary use of resources and further control of who has access to the complete number.

    I think you should encrypt the last 4 digits of the SSN as well.  Even just the last 4 bytes is far too sensitive to be allowed in plain text.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • For all of you advocating encryption, how do you stored SSNs?  Fortunately I don't deal with data requiring this level of protection, but I'm curious how you do it.

Viewing 15 posts - 16 through 30 (of 36 total)

You must be logged in to reply to this topic. Login to reply