Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««4243444546»»»

Tally OH! An Improved SQL 8K “CSV Splitter” Function Expand / Collapse
Author
Message
Posted Tuesday, September 11, 2012 6:07 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Thursday, October 16, 2014 6:23 PM
Points: 14, Visits: 225
Paul,

I didn't take offence. I was similarly concerned I was a bit abrupt writing in a hurry late at night.

Can you avoid the conversion to Unicode by casting to a byte array?

-mark
Post #1357344
Posted Tuesday, September 11, 2012 6:13 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Thursday, October 16, 2014 6:23 PM
Points: 14, Visits: 225
Good point. The function should cast the output back to COLLATION DATABASE_DEFAULT. Another plus for CLR?

-mark

peter-757102 (9/11/2012)
The recent discussion has hit an interesting topic, namely that of collations. While I can subscribe to the idea that it is ok for the delimiter being compared by binary means, I don't like to see side effects on the function output itself when that is further processed. Having to specify collations everywhere would make the function cumbersome to use. It might even throw off the optimizer in not using indexes due to collation differences.

An example scenario could be this:

The outcome of the splitter is used to match the substrings against a table of specific words (indexed) and for each word we count the number of occurrences in the original separated input sting. Assuming both the words in that table and the original string are in the same collation, the code should be completely transparent in terms of collations, else if looses much of its appeal.
Post #1357347
Posted Tuesday, September 11, 2012 8:32 AM
SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Monday, May 26, 2014 10:11 AM
Points: 452, Visits: 820
SQL Kiwi (9/11/2012)
m.t.cleary (9/11/2012)
I agree that the CLR has additional and desirable features, but if we're comparing apples we need to make sure the functionality we are comparing is the same. My first test with a new variant is to verify it works properly using a full outer join against the results of a known good splitter and compare.

Agreed. I was surprised no-one picked up the difference before (I guess it goes to show how many people even tried the CLR solution).


I did try it but only to measure the execution speed If I would have been allowed to use it, I also might have picked up this difference as well But seriously, it was a nice catch. Another thing I just noticed is the return data type for the item is nvarchar(4000). Should not it be nvarchar(max) ?
Post #1357468
Posted Tuesday, September 11, 2012 8:37 AM
SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Monday, May 26, 2014 10:11 AM
Points: 452, Visits: 820
m.t.cleary (9/11/2012)
Jeff,

Usman’s N1 splitter is in http://www.sqlservercentral.com/Forums/FindPost1303895.aspx

On the subject of benchmark frameworks, I have some sympathy for storing the output. Splitters are often used where you are going insert the results and real world measurements are the ultimate test. I was playing with Usman’s splitter trying to work out where the speed differences are because the tests you’d done with physical tables seemed to indicate cte’s were faster. I found that storing a the output increased the variability and masked the small differences I was looking for.

What I learnt was:
• If you use a physical tally table, WITH (NOLOCK) makes quite a difference.



I do not think this is the major contributor. No string concatenation (was part of old physical tally table splitter) and use of data type INT instead of BIGINT(so no implicit conversions when dealing upto varchar(8000) strings ) seems to help a lot.
Post #1357471
Posted Tuesday, September 11, 2012 8:41 AM
SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Monday, May 26, 2014 10:11 AM
Points: 452, Visits: 820
m.t.cleary (9/11/2012)
peter-757102 (9/11/2012)
The recent discussion has hit an interesting topic, namely that of collations. While I can subscribe to the idea that it is ok for the delimiter being compared by binary means, I don't like to see side effects on the function output itself when that is further processed. Having to specify collations everywhere would make the function cumbersome to use. It might even throw off the optimizer in not using indexes due to collation differences.

An example scenario could be this:

The outcome of the splitter is used to match the substrings against a table of specific words (indexed) and for each word we count the number of occurrences in the original separated input sting. Assuming both the words in that table and the original string are in the same collation, the code should be completely transparent in terms of collations, else if looses much of its appeal.

Good point. The function should cast the output back to COLLATION DATABASE_DEFAULT. Another plus for CLR?

-mark



I do not find the reason why? The collation part is only for the delimiter comparison. But the item split would be of the same collation.
Post #1357474
Posted Tuesday, September 11, 2012 8:50 AM
SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Monday, May 26, 2014 10:11 AM
Points: 452, Visits: 820
peter-757102 (9/11/2012)
Having to specify collations everywhere would make the function cumbersome to use. It might even throw off the optimizer in not using indexes due to collation differences.

An example scenario could be this:

The outcome of the splitter is used to match the substrings against a table of specific words (indexed) and for each word we count the number of occurrences in the original separated input sting. Assuming both the words in that table and the original string are in the same collation, the code should be completely transparent in terms of collations, else if looses much of its appeal.


Why there would be a need to specify collations everywhere? For the sake of clearing the ambiguity this is the part (in bold) of the code where comparison is done using binary collation (only for the delimiter, so can be used in most cases)

SUBSTRING(@pString, (N+1),
ISNULL(NULLIF(CHARINDEX(@pDelimiter ,
@pString COLLATE Latin1_General_BIN, (N+1)
), 0)
- (N+1), 8000) )

And this is the part where the original string is split into items

SUBSTRING(@pString, (N+1),
ISNULL(NULLIF(CHARINDEX(@pDelimiter ,
@pString COLLATE Latin1_General_BIN, (N+1)), 0)
- (N+1), 8000) )

So I do not think we are losing the default collation.
Post #1357484
Posted Tuesday, September 11, 2012 8:58 AM
Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: 2 days ago @ 6:21 AM
Points: 334, Visits: 2,273
Usman Butt (9/11/2012)
peter-757102 (9/11/2012)
Having to specify collations everywhere would make the function cumbersome to use. It might even throw off the optimizer in not using indexes due to collation differences.

An example scenario could be this:

The outcome of the splitter is used to match the substrings against a table of specific words (indexed) and for each word we count the number of occurrences in the original separated input sting. Assuming both the words in that table and the original string are in the same collation, the code should be completely transparent in terms of collations, else if looses much of its appeal.


Why there would be a need to specify collations everywhere? For the sake of remove the ambiguity this is the part (in bold) of the code where comparison is done using binary collation (only for the delimiter, so can be used in most cases)

SUBSTRING(@pString, (N+1),
ISNULL(NULLIF(CHARINDEX(@pDelimiter ,
@pString COLLATE Latin1_General_BIN, (N+1)
), 0)

And this is the part where the original string is split into items

SUBSTRING(@pString, (N+1),
ISNULL(NULLIF(CHARINDEX(@pDelimiter ,
@pString COLLATE Latin1_General_BIN, (N+1)), 0)
- (N+1), 8000) )
- (N+1), 8000) )

So I do not think we are losing the default collation.


My concern was just a general remark for such routines, not based on any specific implementation. I explicitly agreed that a binary comparison for the delimeter made sense. My concern is only about the final output, which I like to see in the original collation or that of the database default in case of a constant or variable as input.

Just to make further processing as clean as possible and able to use indexes without collation specifiers. If your implementation takes care of all that, all the better!

It is easy to test for any function. Just output the result to a table using the select ... into TableX ... from ... construct and then check the collation on the columns using sp_help TableX.
Post #1357492
Posted Tuesday, September 11, 2012 10:39 AM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: Thursday, November 20, 2014 7:53 PM
Points: 9,928, Visits: 11,194
Usman Butt (9/11/2012)
Another thing I just noticed is the return data type for the item is nvarchar(4000). Should not it be nvarchar(max)?

Not for me, no

The source code is there for anyone that does want to split strings where elements may be > 4000 characters.




Paul White
SQL Server MVP
SQLblog.com
@SQL_Kiwi
Post #1357578
Posted Tuesday, September 11, 2012 4:21 PM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: Thursday, November 20, 2014 7:53 PM
Points: 9,928, Visits: 11,194
m.t.cleary (9/11/2012)
Can you avoid the conversion to Unicode by casting to a byte array?

Yes, but it all gets very inconvenient from there on in - for example working out whether the byte stream was originally Unicode or not.




Paul White
SQL Server MVP
SQLblog.com
@SQL_Kiwi
Post #1357740
Posted Wednesday, September 12, 2012 7:43 AM
SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Monday, May 26, 2014 10:11 AM
Points: 452, Visits: 820
SQL Kiwi (9/11/2012)
Usman Butt (9/11/2012)
Another thing I just noticed is the return data type for the item is nvarchar(4000). Should not it be nvarchar(max)?

Not for me, no

The source code is there for anyone that does want to split strings where elements may be > 4000 characters.


hahahaha.. Yes, one can change it since the source code is available. But since it is to be taken as the generic splitter , it should be handling strings with any length of elements Should say we are comparing oranges with oranges when it comes to varchar(max) versions

But on a serious note, is there any performance degrade if nvarchar(max) is used? I do not remember the link, but I read somewhere that sometimes nvarchar(max) could speed up the execution as compared to nvarchar(N) but may depend upon the logic of the code. BTW, In my initial testing, I did see some improved performance if I change the return type to let say nvarchar(20) etc.
Post #1357975
« Prev Topic | Next Topic »

Add to briefcase «««4243444546»»»

Permissions Expand / Collapse