Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • Tom.Thomson (5/13/2011)


    erroneous post - content deleted

    You make me curious 😉

  • peter-757102 (5/13/2011)


    Tom.Thomson (5/13/2011)


    erroneous post - content deleted

    You make me curious 😉

    I was thinking of trying with a quick approximation for sqrt which doesn't need any real numbers, but as soon as I posted the suggestion I realised that this is one of those cases where a quick and dirty approximation will hinder rather than help.

    Tom

  • Tom.Thomson (5/14/2011)


    peter-757102 (5/13/2011)


    Tom.Thomson (5/13/2011)


    erroneous post - content deleted

    You make me curious 😉

    I was thinking of trying with a quick approximation for sqrt which doesn't need any real numbers, but as soon as I posted the suggestion I realised that this is one of those cases where a quick and dirty approximation will hinder rather than help.

    Yeah, I mentioned the result of my tests with that a few posts back. It did reduce reported memory consumption, but i am not sure the reported memory is actually claimed...i need to develop some test for that really.

    In the meantime I did come up with a new function that uses nearly no memory and works a quite a bit faster for strings with ** very few ** elements. The trouble is, the number of few elements it can handle faster is too limited (break even at 5 elements), one has to wonder if it makes sense to have such a over specialized function. In theory the function should be incredibly fast, I have yet to find what exactly causes it to under perform to my expectations. lets hope I do find the cause as scaling it up to 16 characters or so would make it useful.

    The function quickly determines the first few separator positions using nothing but the minimum number of charindex function calls. So on a string with two separators (3 elements) it requires just 3 function calls and creates no substrings to test individually. It also works fully with collations like any other SQL code would and does not even need to know the length of the input string. The old method needs to know that and also inspect every character of the string plus perform a a charindex for each element detected. You could view this as an in-line loop unrolled version of a recursive CTE without the overhead and with a limit of how many elements will be recognized.

    Another avenue I am going to test in an attempt to reduce function call overhead is to modify the tally CTE in such a way, that it skips all characters before the first separator (based on a single charindex call). I am just hoping at this point it wont run into similar issues as the above mentioned version and suddenly becomes slower for the majority of cases.

    But life ain't fair in T-SQL land and I wish SQL Server had some better build-in functional extensibility as it does now. While CLR certainly works on a technical level, it is far from a perfect merger with SQL Server. Ideal would be to have basic programming capabilities comparable to that of CLR, but a more domain specific interface to T-SQL, and build into SQL Server itself completely. So no need for an external compiler and with an interface that makes streaming data the norm rather then something you have to work hard towards yourself. There are plenty of script languages that compile to well performing code on the fly, why can't we have one in SQL Server for this purpose is what I wonder. Maybe Microsoft is blinded and only wants to promote the CLR as the ideal solution for everything...which it obviously cannot be.

    Here the experimental code so far for the small number of elements splitter, maybe someone else has an idea what causes it do perform less then can be expected:

    create FUNCTION dbo.DelimitedSplit8K_T2( @pString VARCHAR(8000), @pDelimiter CHAR(1) )

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    select

    ItemNumber = r.r

    , Item = substring( @pString, r.s, isnull( nullif( r.n, 0 ), 8001 ) - r.s )

    from

    ( select 1, charindex( @pDelimiter, @pString, 1 ) ) s1( s, n )

    outer apply( select top( isnull( sign( s1.n ), 0 ) ) s1.n + 1, charindex( @pDelimiter, @pString, s1.n + 1 ) ) s2( s, n )

    outer apply( select top( isnull( sign( s2.n ), 0 ) ) s2.n + 1, charindex( @pDelimiter, @pString, s2.n + 1 ) ) s3( s, n )

    outer apply( select top( isnull( sign( s3.n ), 0 ) ) s3.n + 1, charindex( @pDelimiter, @pString, s3.n + 1 ) ) s4( s, n )

    cross apply

    (

    select 1, s1.s, s1.n

    union all select top( isnull( sign( s1.n ), 0 ) ) 2, s2.s, s2.n

    union all select top( isnull( sign( s2.n ), 0 ) ) 3, s3.s, s3.n

    union all select top( isnull( sign( s3.n ), 0 ) ) 4, s4.s, s4.n

    ) as r( r, s, n )

    ;

    go

    A more streamlined version that is faster and has room to be extended to 6 elements (at the expense of loosing some of its speed gain):

    create FUNCTION dbo.DelimitedSplit8K_T2( @pString VARCHAR(8000), @pDelimiter CHAR(1) )

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    select

    ItemNumber = r.r

    , Item = substring( @pString, r.s + 1, isnull( nullif( r.n, 0 ), 8001 ) - r.s - 1 )

    from

    ( select 0, charindex( @pDelimiter, @pString, 1 ) ) s1( s, n )

    outer apply( select top ( isnull( s1.n, 0 ) ) s1.n, charindex( @pDelimiter, @pString, s1.n + 1 ) ) s2( s, n )

    outer apply( select top ( isnull( s2.n, 0 ) ) s2.n, charindex( @pDelimiter, @pString, s2.n + 1 ) ) s3( s, n )

    outer apply( select top ( isnull( s3.n, 0 ) ) s3.n, charindex( @pDelimiter, @pString, s3.n + 1 ) ) s4( s, n )

    cross apply

    (

    select 1, s1.s, s1.n

    union all select 2, s2.s, s2.n

    union all select 3, s3.s, s3.n

    union all select 4, s4.s, s4.n

    ) as r( r, s, n )

    where

    r.s is not null

    ;

    go

    Other then that, I give up, its not going to work any better then this!

  • peter-757102 (5/14/2011)


    While CLR certainly works on a technical level, it is far from a perfect merger with SQL Server. Ideal would be to have basic programming capabilities comparable to that of CLR, but a more domain specific interface to T-SQL, and build into SQL Server itself completely. So no need for an external compiler and with an interface that makes streaming data the norm rather then something you have to work hard towards yourself.

    I agree with this sentiment, to a large extent.

    I would certainly like to see CLR code compiled and maintained within SQL Server, probably with a T-SQL extension. It is a popular request, and there is some evidence that MSFT will in fact provide this feature:

    http://connect.microsoft.com/SQLServer/feedback/details/265266/add-server-side-compilation-ability-to-sql-clr

    I have to say, though, that I think T-SQL is horribly broken. There was some chatter a while back about replacing T-SQL entirely. This would be a good thing, in my view, though it does rather depend on what it was to be replaced with 🙂

    I could go on about how SQL itself is broken (a properly relational database would be nice!) but that's probably a little too far off topic for this thread, and probably a little too much to hope for in any case.

  • It would be very interesting, indeed, if there were a true relational database that some of the people touting the attributes of relational databases could try. I believe the majority of them wouldn't be disappointed while people like me would absolutely hate it because I tend do things with data that probably couldn't be done in a true relational database.

    So far as T-SQL and SQL in general go... yes, I agree it has some shortcomings but I'm definitely in love with the idea of declarative functionality rather than procedural flexibility although, as some have already stated, a proper mix of the two would be much more useful.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • SQLkiwi (5/14/2011)


    peter-757102 (5/14/2011)


    There was some chatter a while back about replacing T-SQL entirely. This would be a good thing, in my view, though it does rather depend on what it was to be replaced with 🙂

    Let the horrors begin...

    * ErLang

    * F#

    * APL

  • I wouldn't dare to claim T-SQL is broken as for the most part it does a good job and to my satisfaction. The more recent added features (recursive CTE, output clause, to name a few) all have serious quirks that make adopting them non-transparent and often impossible to use. Even if your code is correct, it might not work as it completely depends on other features/settings used within the runtime environment, outside of your control. This makes these features pretty much non-existent and therefore represent a waste of time on Microsoft's part. This has been going on for like the last two major versions now, a very bad sign if you ask me and I worry about the quality and focus of the SQL Server team at Microsoft now. They don't seem to be in control of their product in the way they should.

    Also, during my tests on the _T2 version for limited number of elements I ran into a disturbing issue that caused me to give up in frustration. The optimizer did a nice job and recognized it could remove all cross joins and set operations and reduce it all to a constant scan (short for some procedural logic). The resulting logic however was full of redundant calls to charindex, and sadly it did do nothing to remove this redundancy. If it would have, the result would be blazingly fast, optimal code. It could be that it thought a call to charindex does not cost anything at all or it simply lacks the ability to de-duplicate operations at the expression level.

    The logic it reduced the critical part of the statement too must have looked like this:

    select

    s1 = 0

    , n1 = charindex( @pDelimiter, @pString, 1 )

    , s2 = nullif( charindex( @pDelimiter, @pString, 1 ), 0 )

    , n2 = charindex( @pDelimiter, @pString, nullif( charindex( @pDelimiter, @pString, 1 ), 0 ) + 1 )

    , s3 = nullif( charindex( @pDelimiter, @pString, nullif( charindex( @pDelimiter, @pString, 1 ), 0 ) + 1 ), 0 )

    , n3 = charindex( @pDelimiter, @pString, nullif( charindex( @pDelimiter, @pString, nullif( charindex( @pDelimiter, @pString, 1 ), 0 ) + 1 ), 0 ) + 1 )

    , s4 = nullif( charindex( @pDelimiter, @pString, nullif( charindex( @pDelimiter, @pString, nullif( charindex( @pDelimiter, @pString, 1 ), 0 ) + 1 ), 0 ) + 1 ), 0 )

    , n4 = charindex( @pDelimiter, @pString, nullif( charindex( @pDelimiter, @pString, nullif( charindex( @pDelimiter, @pString, nullif( charindex( @pDelimiter, @pString, 1 ), 0 ) + 1 ), 0 ) + 1 ), 0 ) + 1 )

    In short, I think many of the problems I still do have with current T-SQL are more a result of a sloppy, not well thought out implementation. A thus not a result of some fundamental issues. Completely replacing everything seems overkill to me and just will open a new can of worms we all have to learn to live and work with.

  • peter-757102 (5/15/2011)


    ...I ran into a disturbing issue that caused me to give up in frustration.

    Time for some beer popsicles and a long chat with the dust bunnies. 😛

    In short, I think many of the problems I still do have with current T-SQL are more a result of a sloppy, not well thought out implementation. A thus not a result of some fundamental issues. Completely replacing everything seems overkill to me and just will open a new can of worms we all have to learn to live and work with.

    I'm right there with you. My stand is that if anyone tries to replace T-SQL with some other nonsense, I'll be donning my postal uniform and making a road-trip to Redmond with the biggest pork-chop launcher I can build. 😀 If folks don't like T-SQL, let them use something else but leave T-SQL the hell alone. 🙂 Same goes for the availability of NULLs.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • SQLkiwi (5/14/2011)


    peter-757102 (5/14/2011)


    While CLR certainly works on a technical level, it is far from a perfect merger with SQL Server. Ideal would be to have basic programming capabilities comparable to that of CLR, but a more domain specific interface to T-SQL, and build into SQL Server itself completely. So no need for an external compiler and with an interface that makes streaming data the norm rather then something you have to work hard towards yourself.

    I agree with this sentiment, to a large extent.

    I would certainly like to see CLR code compiled and maintained within SQL Server, probably with a T-SQL extension. It is a popular request, and there is some evidence that MSFT will in fact provide this feature:

    http://connect.microsoft.com/SQLServer/feedback/details/265266/add-server-side-compilation-ability-to-sql-clr

    Thanks for the pointer. That item now has another vote.

    I have to say, though, that I think T-SQL is horribly broken. There was some chatter a while back about replacing T-SQL entirely. This would be a good thing, in my view, though it does rather depend on what it was to be replaced with 🙂

    I could go on about how SQL itself is broken (a properly relational database would be nice!) but that's probably a little too far off topic for this thread, and probably a little too much to hope for in any case.

    Yes, it probably is too much to hope for, because (a) the relational fundamentalist extremists would be certain to wreck any attempt to do something sensible: they would insist that outer union and outer join be banned, along with any sort of NULL, and that for every statement that delivered some rows as its result the writer of the statement would have to specify the primary key of that result, and probably a lot of other rather silly requirements too and (b) the standards bodies would cave in to vested interests and besides making all sorts of stupidity mandatory would leave out of the standard all sorts of useful things because someone thought including them would advantage someone else (I can just see the SQL die-hards demanding that the relational language not implement natural equijoin, for example, so that we would still have to write it as a projection from the column-duplicating inner equijoin that SQL supports; and the "select * is a crime" nuts would of course puke at the very concept of natural join, wouldn't they).

    I suppose that sounds cynical - but my view is the result of observing the standardisation processes for data communications protocols, for SQL, and for some other languages over a long period.

    But SQL really is broken (T-SQL is overall no better, and the standardistas are voting for connect items to bring it down to to the level of the SQL standard in at least one place where it is currently better than the standard) and it really would be nice to have a real relational database system and an appropriate language for it.

    Tom

  • Jeff Moden (5/14/2011)


    It would be very interesting, indeed, if there were a true relational database that some of the people touting the attributes of relational databases could try. I believe the majority of them wouldn't be disappointed while people like me would absolutely hate it because I tend do things with data that probably couldn't be done in a true relational database.

    I'd be surprised if you did anything that couldn't be done in a true relational database - but not at all surprised if some of the people out there claiming to know what the one true relational model is produced a database that prevented you from doing them. If you were to get a real relational language I think you would probably like it (and those I sometimes call the "relational fundamentalists" would probably hate it).

    So far as T-SQL and SQL in general go... yes, I agree it has some shortcomings but I'm definitely in love with the idea of declarative functionality rather than procedural flexibility although, as some have already stated, a proper mix of the two would be much more useful.

    Well, if you want declarative functionality you certainly don't want SQL! Declarative systems people sometimes quarrel with SQL people because SQL people keep on claiming it is a declarative language, which it most certainly is not. Would you adopt the idea that no dml statement can modify the visible data in relations (all it can do is prepare modifications to be carried out by a subsequent commit statement or discarded by rollback) - once you let the modifications proposed by one dml statement be visible to a subsequent dml statement in the same transaction you have a state-oriented procedural dml not a declarative one. This is very hard indeed to handle in a language that doesn't allow higher order functions and general composition of statement components to form more complex statements, so you would end up with something not much like SQL (SQL has taken on board some of that language style with its fairly general use of subqueries and of CTEs, but not enough of it).

    With a decent declarative functional language for writing the bits we might do in CLR for SQL, there would be no need for a procedural language for those bits. I don't know whether F# will become such a language, but I suspect Haskell would be a better choice anyway.

    And we will always need some non-declarative bits: commitment can be handled as at present, with transaction starts and commits, maybe also with autocommit (single statement transactions). Writing to files (other than database files) with immediate effect (eg for activities needing synchronous logging, such as some debugging) and I guess anything else needed could be handled using monads in the auxiliary language. So there's probably nothing procedural needed at all at database level, and very little that's not declarative. Of course under the covers things like concurrency control and recovery logging and so on will be happening, and these are probably not going to be declarative and may be procedural.

    Tom

  • mark hutchinson (5/15/2011)


    SQLkiwi (5/14/2011)


    peter-757102 (5/14/2011)


    There was some chatter a while back about replacing T-SQL entirely. This would be a good thing, in my view, though it does rather depend on what it was to be replaced with 🙂

    Let the horrors begin...

    * ErLang

    * F#

    * APL

    Well, I agree that APL is probably a horror in this context; but the other two? If they count as horrors, then surely so must C#, and things like VBS must count as more horrid than horrors.:hehe::-D:w00t:

    Tom

  • Tom.Thomson (5/15/2011)


    Well, I agree that APL is probably a horror in this context; but the other two? If they count as horrors, then surely so must C#, and things like VBS must count as more horrid than horrors.:hehe::-D:w00t:

    I absolutely agree that the experiment known as C# was a mistake... 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • As a VB guy, I'm pissed that the VB team in Microsoft's .Net team lost every decision battle to the C# team.

    Thank you anyway, Anders, for Delphi. Maybe the VB-vs-C# decisions were just your revenge.

  • Tom.Thomson (5/15/2011)


    Jeff Moden (5/14/2011)


    It would be very interesting, indeed, if there were a true relational database that some of the people touting the attributes of relational databases could try. I believe the majority of them wouldn't be disappointed while people like me would absolutely hate it because I tend do things with data that probably couldn't be done in a true relational database.

    I'd be surprised if you did anything that couldn't be done in a true relational database - but not at all surprised if some of the people out there claiming to know what the one true relational model is produced a database that prevented you from doing them. If you were to get a real relational language I think you would probably like it (and those I sometimes call the "relational fundamentalists" would probably hate it).

    So far as T-SQL and SQL in general go... yes, I agree it has some shortcomings but I'm definitely in love with the idea of declarative functionality rather than procedural flexibility although, as some have already stated, a proper mix of the two would be much more useful.

    Well, if you want declarative functionality you certainly don't want SQL! Declarative systems people sometimes quarrel with SQL people because SQL people keep on claiming it is a declarative language, which it most certainly is not. Would you adopt the idea that no dml statement can modify the visible data in relations (all it can do is prepare modifications to be carried out by a subsequent commit statement or discarded by rollback) - once you let the modifications proposed by one dml statement be visible to a subsequent dml statement in the same transaction you have a state-oriented procedural dml not a declarative one. This is very hard indeed to handle in a language that doesn't allow higher order functions and general composition of statement components to form more complex statements, so you would end up with something not much like SQL (SQL has taken on board some of that language style with its fairly general use of subqueries and of CTEs, but not enough of it).

    With a decent declarative functional language for writing the bits we might do in CLR for SQL, there would be no need for a procedural language for those bits. I don't know whether F# will become such a language, but I suspect Haskell would be a better choice anyway.

    And we will always need some non-declarative bits: commitment can be handled as at present, with transaction starts and commits, maybe also with autocommit (single statement transactions). Writing to files (other than database files) with immediate effect (eg for activities needing synchronous logging, such as some debugging) and I guess anything else needed could be handled using monads in the auxiliary language. So there's probably nothing procedural needed at all at database level, and very little that's not declarative. Of course under the covers things like concurrency control and recovery logging and so on will be happening, and these are probably not going to be declarative and may be procedural.

    Thanks, Tom. I can always count on you for wise words.

    I started to write about what I think of some of the relational zealots and fundamentalists but some truly unkind words became the main feature of the post and so I decided to practice what most of our Mom's taught us... "If you don't have something nice to say, don't say it." 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I like to add my idea that it is a mistake to even strive for a fully declarative/functional language in the first place. SQL can be difficult to handle at times as one change in what you want might mean a redesign of your whole query. This property is not something to strive for endlessly, not everything should be written down as an endless formula that is impossible to comprehend in the end.

    We got to ask ourself: what is the benefit of a functional language over a procedural one?

    My take in this is the ability of the functional language to make an implementation where lots of details are taken care off for you (possibly efficiently). Things you otherwise need to deal with yourself, providing a kind of runtime framework if you will. So as long as you can easily work on a functional level and as human can comprehend the functional notation and easily change it to do something slightly different you want, your good with sticking to functional.

    But there comes a point (and this is different for everyone) where you have to break things down as these limits are reached, and that is the end of the line for sticking to pure functional. This means that in reality there is no true workable functional solution possible as you will always run into this limit sooner or later Only hybrid solutions can survive in the wild and have meaning, hybrids like SQL. And hybridization should be completed in terms of running procedural functionality outside the runtime framework of SQL itself. In other works, you should be able to add new functions, possible with some meta information (is deterministic, returns null on any null input, etc) that helps the optimizer make better use of them when used as part of an SQL statement. It would also be nice if some runtime environment properties were communicated, such as certain set options that can affect functionality.

    Anyway, I am getting sucked into a whole new discussion here, just wanted to say my thing and get some rest in my head now :).

Viewing 15 posts - 241 through 255 (of 990 total)

You must be logged in to reply to this topic. Login to reply