SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


trimming SSNs


trimming SSNs

Author
Message
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (984K reputation)SSC Guru (984K reputation)SSC Guru (984K reputation)SSC Guru (984K reputation)SSC Guru (984K reputation)SSC Guru (984K reputation)SSC Guru (984K reputation)SSC Guru (984K reputation)

Group: General Forum Members
Points: 984852 Visits: 49389
jcelko212 32090 - Thursday, March 14, 2019 1:44 PM
briancampbellmcad - Wednesday, March 13, 2019 9:50 AM


>> I have thousands of social security numbers I need to trim to leave only the last 4 digits... it is in a varchar field as xxx-xx-xxxx sometimes, but others are in xxxxxxxxx format and some scraps are incomplete numbers like xxx-, etc. Any ideas? <<

The first thing you should do is clean up the data that you've got. In a well-run database, the data is cleaned up and scrubbed before it gets into the tables. I also see no reason that you're using VARCHAR(n), since the SSN is always nine digits.
ssn CHAR(11) NOT NULL
CHECK (ssn LIKE '[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]')
Your incomplete and scrap numbers cannot be substringed safely and you're going to need to do some work.


No... not correct. The FIRST thing to do is to protect the SSNs and that's not being done here.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

When you put the right degree of spin on it, the number 318 is also a glyph that describes the nature of a DBAs job. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (984K reputation)SSC Guru (984K reputation)SSC Guru (984K reputation)SSC Guru (984K reputation)SSC Guru (984K reputation)SSC Guru (984K reputation)SSC Guru (984K reputation)SSC Guru (984K reputation)

Group: General Forum Members
Points: 984852 Visits: 49389
jcelko212 32090 - Thursday, March 14, 2019 8:19 PM
ScottPletcher - Thursday, March 14, 2019 3:01 PM
[quote]
jcelko212 32090 - Thursday, March 14, 2019 1:44 PM
[quote]
briancampbellmcad - Wednesday, March 13, 2019 9:50 AM

I see no reason at all to waste 2 bytes per row storing dashes. 2B rows = 4B bytes totally wasted.
Are these the same two bytes we saved by dropping the century from dates back in the Y2K days? I'm going to argue for the dashes. The design principle that you want to store data the way it is used and use it the way it's stored. When you see 2019-03-19, you know that it's a date. When you see 252-77-6688, you know that it's a Social Security number. Likewise, 23:00:00 Hrs is clearly a time. The cost of a couple of pieces of punctuation is negligible these days, but the cost of an error in misreading the data is not negligible.

You shouldn't be able to read SSNs in clear text from a table to begin with. Save the formatting for the display layer.


--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

When you put the right degree of spin on it, the number 318 is also a glyph that describes the nature of a DBAs job. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
ScottPletcher
ScottPletcher
SSC Guru
SSC Guru (96K reputation)SSC Guru (96K reputation)SSC Guru (96K reputation)SSC Guru (96K reputation)SSC Guru (96K reputation)SSC Guru (96K reputation)SSC Guru (96K reputation)SSC Guru (96K reputation)

Group: General Forum Members
Points: 96733 Visits: 10336
jcelko212 32090 - Thursday, March 14, 2019 8:19 PM
ScottPletcher - Thursday, March 14, 2019 3:01 PM
[quote]
jcelko212 32090 - Thursday, March 14, 2019 1:44 PM
[quote]
briancampbellmcad - Wednesday, March 13, 2019 9:50 AM

I see no reason at all to waste 2 bytes per row storing dashes. 2B rows = 4B bytes totally wasted.
Are these the same two bytes we saved by dropping the century from dates back in the Y2K days? I'm going to argue for the dashes. The design principle that you want to store data the way it is used and use it the way it's stored. When you see 2019-03-19, you know that it's a date. When you see 252-77-6688, you know that it's a Social Security number. Likewise, 23:00:00 Hrs is clearly a time. The cost of a couple of pieces of punctuation is negligible these days, but the cost of an error in misreading the data is not negligible.

So you're saying that we should store dates as '2019-03-19'? That's just not done. Do we need to store the blanks in credit card #s too, which are usually printed on cards as nnnn nnnn nnnn nnnn to make them easier to read.

The display has nothing to do with how data is stored.

The Y2K method only saved one byte, and, yes, it was the correct decision at the time.

You love living in a completely theoretical world, where the logical design is never converted to physical design. Back in the real world, we have to follow the normal process of converting a logical design to a physical one, accepting comprises along the way.


SQL DBA,SQL Server MVP(07, 08, 09) Prosecutor James Blackburn, in closing argument in the Fatal Vision murders trial:
If in the future, you should cry a tear, cry one for them [the murder victims]. If in the future, you should say a prayer, say one for them. And if in the future, you should light a candle, light one for them.
Joe Celko
Joe Celko
SSCrazy Eights
SSCrazy Eights (8.5K reputation)SSCrazy Eights (8.5K reputation)SSCrazy Eights (8.5K reputation)SSCrazy Eights (8.5K reputation)SSCrazy Eights (8.5K reputation)SSCrazy Eights (8.5K reputation)SSCrazy Eights (8.5K reputation)SSCrazy Eights (8.5K reputation)

Group: General Forum Members
Points: 8534 Visits: 4310
ScottPletcher - Friday, March 15, 2019 9:04 AM
jcelko212 32090 - Thursday, March 14, 2019 8:19 PM
ScottPletcher - Thursday, March 14, 2019 3:01 PM
[quote]
jcelko212 32090 - Thursday, March 14, 2019 1:44 PM
[quote]
briancampbellmcad - Wednesday, March 13, 2019 9:50 AM

I see no reason at all to waste 2 bytes per row storing dashes. 2B rows = 4B bytes totally wasted.
Are these the same two bytes we saved by dropping the century from dates back in the Y2K days? I'm going to argue for the dashes. The design principle that you want to store data the way it is used and use it the way it's stored. When you see 2019-03-19, you know that it's a date. When you see 252-77-6688, you know that it's a Social Security number. Likewise, 23:00:00 Hrs is clearly a time. The cost of a couple of pieces of punctuation is negligible these days, but the cost of an error in misreading the data is not negligible.

So you're saying that we should store dates as '2019-03-19'? That's just not done. Do we need to store the blanks in credit card #s too, which are usually printed on cards as nnnn nnnn nnnn nnnn to make them easier to read.

The display has nothing to do with how data is stored.

The Y2K method only saved one byte, and, yes, it was the correct decision at the time.

You love living in a completely theoretical world, where the logical design is never converted to physical design. Back in the real world, we have to follow the normal process of converting a logical design to a physical one, accepting comprises along the way.
"He who loves practice without theory is like the sailor who boards ship without a rudder and compass and never knows where he may cast." -- Leonardo da Vinci

My first paying IT job was in 1965. That's over 50 years in this trade; my consolation is that I got old by not dying. :-) Please don't tell me that I'm all theory. I was a code monkey for over 25 years. However, if you want to yell at me for being all "ANSI/ISO standards", "mathematical correctness", "best practices in the industry", "decades of research behind me" and pedantic as all merry hell, I will agree.

Basically, at some point, people were paying me to fix the messes that have been made by People like you, who think that it's okay to compromise on things. You have given me most of my later SQL consulting work. I fix bad designs and I'm expensive. But it's cheaper than living with a bad design of somebody wanted to save the cost of some storage, or kludge a program to cover a bad design..

I also agree that display has nothing to do with storage. Those things we show decimals are stored as binary. Of course we don't know if it's high-end or low-end storage or maybe on a it old Russian three valued machine. However, I like Brent's law that data should be stored the way it's used and use it the way it's stored. This means that a human being can read it, that pattern matching is a lot easier, check digits are easier to compute, etc. It's important to pick the level of abstraction (physical hardware, programming language, particular product, etc.) for your design. In the database, deciding to optimize at the current hardware/software level is always wrong.

I wrote one of the first articles on the Y2K problems in Information Systems News when I had a regular column. It was all too familiar with what would happen we got to the cusp of 1999 – 2000 and had to decide which decade ambiguous dates would fall. This decision was totally dependent on your data. You would be surprised by the percentage of errors you got trying to add the century. We have a lot of people that live over 90. Now tell me if a birthdate year is 19xx or 20xx in the hospital that treats both geriatrics and pediatrics. Whoever thought this was "a correct decision at the time" bought me a car at the time I got through cleaning of their data. Our slogan was "another day, another K, and we mean take-home pay" for jobs like this.

Please post DDL and follow ANSI/ISO standards when asking for help.
briancampbellmcad
briancampbellmcad
Hall of Fame
Hall of Fame (3.6K reputation)Hall of Fame (3.6K reputation)Hall of Fame (3.6K reputation)Hall of Fame (3.6K reputation)Hall of Fame (3.6K reputation)Hall of Fame (3.6K reputation)Hall of Fame (3.6K reputation)Hall of Fame (3.6K reputation)

Group: General Forum Members
Points: 3551 Visits: 635
pietlinden - Wednesday, March 13, 2019 9:58 AM
If you use RIGHT([oldSSN],4) then the rest shouldn't matter, because you'll never get any of the dashes anyway, because they would be 5th from the right.

RIGHT([oldSSN],4) just gives me the first 3 characters and a dash, e.g. 235-55-7777 becomes 235-

ScottPletcher
ScottPletcher
SSC Guru
SSC Guru (96K reputation)SSC Guru (96K reputation)SSC Guru (96K reputation)SSC Guru (96K reputation)SSC Guru (96K reputation)SSC Guru (96K reputation)SSC Guru (96K reputation)SSC Guru (96K reputation)

Group: General Forum Members
Points: 96733 Visits: 10336
jcelko212 32090 - Friday, March 15, 2019 11:40 AM
ScottPletcher - Friday, March 15, 2019 9:04 AM
jcelko212 32090 - Thursday, March 14, 2019 8:19 PM
ScottPletcher - Thursday, March 14, 2019 3:01 PM
[quote]
jcelko212 32090 - Thursday, March 14, 2019 1:44 PM
[quote]
briancampbellmcad - Wednesday, March 13, 2019 9:50 AM

I see no reason at all to waste 2 bytes per row storing dashes. 2B rows = 4B bytes totally wasted.
Are these the same two bytes we saved by dropping the century from dates back in the Y2K days? I'm going to argue for the dashes. The design principle that you want to store data the way it is used and use it the way it's stored. When you see 2019-03-19, you know that it's a date. When you see 252-77-6688, you know that it's a Social Security number. Likewise, 23:00:00 Hrs is clearly a time. The cost of a couple of pieces of punctuation is negligible these days, but the cost of an error in misreading the data is not negligible.

So you're saying that we should store dates as '2019-03-19'? That's just not done. Do we need to store the blanks in credit card #s too, which are usually printed on cards as nnnn nnnn nnnn nnnn to make them easier to read.

The display has nothing to do with how data is stored.

The Y2K method only saved one byte, and, yes, it was the correct decision at the time.

You love living in a completely theoretical world, where the logical design is never converted to physical design. Back in the real world, we have to follow the normal process of converting a logical design to a physical one, accepting comprises along the way.
"He who loves practice without theory is like the sailor who boards ship without a rudder and compass and never knows where he may cast." -- Leonardo da Vinci

My first paying IT job was in 1965. That's over 50 years in this trade; my consolation is that I got old by not dying. :-) Please don't tell me that I'm all theory. I was a code monkey for over 25 years. However, if you want to yell at me for being all "ANSI/ISO standards", "mathematical correctness", "best practices in the industry", "decades of research behind me" and pedantic as all merry hell, I will agree.

Basically, at some point, people were paying me to fix the messes that have been made by People like you, who think that it's okay to compromise on things. You have given me most of my later SQL consulting work. I fix bad designs and I'm expensive. But it's cheaper than living with a bad design of somebody wanted to save the cost of some storage, or kludge a program to cover a bad design..

I also agree that display has nothing to do with storage. Those things we show decimals are stored as binary. Of course we don't know if it's high-end or low-end storage or maybe on a it old Russian three valued machine. However, I like Brent's law that data should be stored the way it's used and use it the way it's stored. This means that a human being can read it, that pattern matching is a lot easier, check digits are easier to compute, etc. It's important to pick the level of abstraction (physical hardware, programming language, particular product, etc.) for your design. In the database, deciding to optimize at the current hardware/software level is always wrong.

I wrote one of the first articles on the Y2K problems in Information Systems News when I had a regular column. It was all too familiar with what would happen we got to the cusp of 1999 – 2000 and had to decide which decade ambiguous dates would fall. This decision was totally dependent on your data. You would be surprised by the percentage of errors you got trying to add the century. We have a lot of people that live over 90. Now tell me if a birthdate year is 19xx or 20xx in the hospital that treats both geriatrics and pediatrics. Whoever thought this was "a correct decision at the time" bought me a car at the time I got through cleaning of their data. Our slogan was "another day, another K, and we mean take-home pay" for jobs like this.

And paying for the extra RAM and disks back in the day would have bought 2-5 houses.

And it's not me you're doing re-work for. I do a true logical design, to the constant annoyance of many others on this site, who insists it's not needed (you can just start with an "identity" column and your "table" "design" issues are solved many seem to think). When things change sufficiently, you can go back to the logical design and convert it to a newer physical design.

In the database, deciding to optimize at the current hardware/software level is always wrong.

... Ridiculous. There's literally nothing else you can do. You can't make something that only runs 20 years from that timee (well, those of us who don't just follow theory can't). We need things that work now.

Again, how data is physically stored is 100% irrelevant to how it's displayed, period. Humans don't read binary, and all modern commercial data is stored that way (afaik).

Yes, anyone that insists on storing the dashes within a SSN is just not thinking clearly. What happens as the population grows when it goes to 10 chars? And they decide as a result to format it differently? You're the one that is providing future make-work, and at a loss today too! The Y2K thing saved big resources at the time. You're wanting to waste bytes now and make it harder to refactor in the future, the perfecta of poor design and practice.

Any times I've dealt with contractors they've been virtually useless. Smart people, many, but useless for actually getting anything done. They do produce directories full of "recommendations" and "documentation", but nothing you can actually use to engineer a working system.

The people that compromise when needed are the only people who ever produce anything. If MS had waited until Windows was perfect, we'd still be waiting on it. If IBM had waited until their relational implementation was perfect, System R and everything that followed would not have already happened.

It's like saying "Don't reinvent the wheel." The wheel's been reinvented hundreds of times, or we'd all have wooden wheels with no rims.


SQL DBA,SQL Server MVP(07, 08, 09) Prosecutor James Blackburn, in closing argument in the Fatal Vision murders trial:
If in the future, you should cry a tear, cry one for them [the murder victims]. If in the future, you should say a prayer, say one for them. And if in the future, you should light a candle, light one for them.
Lynn Pettis
Lynn Pettis
SSC Guru
SSC Guru (439K reputation)SSC Guru (439K reputation)SSC Guru (439K reputation)SSC Guru (439K reputation)SSC Guru (439K reputation)SSC Guru (439K reputation)SSC Guru (439K reputation)SSC Guru (439K reputation)

Group: General Forum Members
Points: 439248 Visits: 44200
briancampbellmcad - Friday, March 15, 2019 12:19 PM
pietlinden - Wednesday, March 13, 2019 9:58 AM
If you use RIGHT([oldSSN],4) then the rest shouldn't matter, because you'll never get any of the dashes anyway, because they would be 5th from the right.

RIGHT([oldSSN],4) just gives me the first 3 characters and a dash, e.g. 235-55-7777 becomes 235-

That looks more like the results of a LEFT([oldSSN],4).


Cool
Lynn Pettis

For better assistance in answering your questions, click here
For tips to get better help with Performance Problems, click here
For Running Totals and its variations, click here or when working with partitioned tables
For more about Tally Tables, click here
For more about Cross Tabs and Pivots, click here and here
Managing Transaction Logs

SQL Musings from the Desert Fountain Valley SQL (My Mirror Blog)
briancampbellmcad
briancampbellmcad
Hall of Fame
Hall of Fame (3.6K reputation)Hall of Fame (3.6K reputation)Hall of Fame (3.6K reputation)Hall of Fame (3.6K reputation)Hall of Fame (3.6K reputation)Hall of Fame (3.6K reputation)Hall of Fame (3.6K reputation)Hall of Fame (3.6K reputation)

Group: General Forum Members
Points: 3551 Visits: 635
Lynn Pettis - Friday, March 15, 2019 1:36 PM
briancampbellmcad - Friday, March 15, 2019 12:19 PM
pietlinden - Wednesday, March 13, 2019 9:58 AM
If you use RIGHT([oldSSN],4) then the rest shouldn't matter, because you'll never get any of the dashes anyway, because they would be 5th from the right.

RIGHT([oldSSN],4) just gives me the first 3 characters and a dash, e.g. 235-55-7777 becomes 235-

That looks more like the results of a LEFT([oldSSN],4).

Same results... note that this is stored a varchar not integer

Lynn Pettis
Lynn Pettis
SSC Guru
SSC Guru (439K reputation)SSC Guru (439K reputation)SSC Guru (439K reputation)SSC Guru (439K reputation)SSC Guru (439K reputation)SSC Guru (439K reputation)SSC Guru (439K reputation)SSC Guru (439K reputation)

Group: General Forum Members
Points: 439248 Visits: 44200
briancampbellmcad - Friday, March 15, 2019 1:42 PM
Lynn Pettis - Friday, March 15, 2019 1:36 PM
briancampbellmcad - Friday, March 15, 2019 12:19 PM
pietlinden - Wednesday, March 13, 2019 9:58 AM
If you use RIGHT([oldSSN],4) then the rest shouldn't matter, because you'll never get any of the dashes anyway, because they would be 5th from the right.

RIGHT([oldSSN],4) just gives me the first 3 characters and a dash, e.g. 235-55-7777 becomes 235-

That looks more like the results of a LEFT([oldSSN],4).

Same results... note that this is stored a varchar not integer

Okay, you are going to have to show me your code. LEFT and RIGHT are string functions so we are assuming that the data is stored as character data.


Cool
Lynn Pettis

For better assistance in answering your questions, click here
For tips to get better help with Performance Problems, click here
For Running Totals and its variations, click here or when working with partitioned tables
For more about Tally Tables, click here
For more about Cross Tabs and Pivots, click here and here
Managing Transaction Logs

SQL Musings from the Desert Fountain Valley SQL (My Mirror Blog)
Lynn Pettis
Lynn Pettis
SSC Guru
SSC Guru (439K reputation)SSC Guru (439K reputation)SSC Guru (439K reputation)SSC Guru (439K reputation)SSC Guru (439K reputation)SSC Guru (439K reputation)SSC Guru (439K reputation)SSC Guru (439K reputation)

Group: General Forum Members
Points: 439248 Visits: 44200
Lynn Pettis - Friday, March 15, 2019 4:14 PM
briancampbellmcad - Friday, March 15, 2019 1:42 PM
Lynn Pettis - Friday, March 15, 2019 1:36 PM
briancampbellmcad - Friday, March 15, 2019 12:19 PM
pietlinden - Wednesday, March 13, 2019 9:58 AM
If you use RIGHT([oldSSN],4) then the rest shouldn't matter, because you'll never get any of the dashes anyway, because they would be 5th from the right.

RIGHT([oldSSN],4) just gives me the first 3 characters and a dash, e.g. 235-55-7777 becomes 235-

That looks more like the results of a LEFT([oldSSN],4).

Same results... note that this is stored a varchar not integer

Okay, you are going to have to show me your code. LEFT and RIGHT are string functions so we are assuming that the data is stored as character data.

Using your single value sample:

IF OBJECT_ID('dbo.TestSSNTrim','U') IS NOT NULL
DROP TABLE [dbo].[TestSSNTrim];

CREATE TABLE [dbo].[TestSSNTrim](
[OldSSN] varchar(11));

INSERT INTO [dbo].[TestSSNTrim]
(
[OldSSN]
)
VALUES
(
'235-55-7777' -- OldSSN - varchar(11)
);

SELECT
[tst].[OldSSN]
, LeftTrim = LEFT([tst].[OldSSN],4)
, RightTrim = RIGHT([tst].[OldSSN],4)
FROM
[dbo].[TestSSNTrim] AS [tst];

IF OBJECT_ID('dbo.TestSSNTrim','U') IS NOT NULL
DROP TABLE [dbo].[TestSSNTrim];
GO


OldSSN LeftTrim RightTrim
----------- -------- ---------
235-55-7777 235- 7777



Cool
Lynn Pettis

For better assistance in answering your questions, click here
For tips to get better help with Performance Problems, click here
For Running Totals and its variations, click here or when working with partitioned tables
For more about Tally Tables, click here
For more about Cross Tabs and Pivots, click here and here
Managing Transaction Logs

SQL Musings from the Desert Fountain Valley SQL (My Mirror Blog)
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum









































































































































































SQLServerCentral


Search