A Case For Concatenation (Building Delimited Strings ) In T-SQL

  • Bruce W Cassidy (2/26/2009)


    [font="Verdana"]One of the recent (i.e. in the last ten years) changes in data warehousing has been a move away from specialised ETL tools like Informatica, Data Stage, etc, and towards letting the RDBMS do the work. This approach has been called ELT (extract, load, transform).

    For smaller data warehouses, this makes sense. There's no need to support two disctinct platforms for throwing data around. It's all done in the RDBMS.

    Oracle have a product that does that (sorry, can't remember the name), and we use a product at work named Wherescape RED that does the same over the top of SQL Server 2005.

    So the database does a lot of work in chopping, changing and reformatting data during the staging process. There's no real presentation layer, as the data just gets posted into another database.

    That's a very valid reason to be looking at string operations within an RDBMS, from my perspective.

    [/font]

    Very cool... I've been doing it that way since day 1 with SQL Server... currently working on a project that I've dubbed as "STEPS".... "Simplified Table Extraction, Parsing, and Summarization". It's all being done in T-SQL and it imports files not importable directly, have an unknown number of columns all with unknown names, and some of the columns must be pair with adjacent columns depending on name. Currently, I'm using a 100k Tally table to do the heavy lifting. I just got done testing it... old process with Perl script and some ActiveX took 40 minutes to process two particular files and that was just to get them ready to import... my process actually does everything including the import in 107 seconds and I haven't even tuned the temp tables involved, yet. We import hundreds of such files a day... I'm thinking we'll finally be able to disconnect the garden hose from the server. 😉

    My mantra in the project has been, "We don' need no stinkin' DTS/SSIS". :hehe:

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • [font="Verdana"]Any chance of you sharing some of the techniques you're using for that? I've been doing similar things, building an ODS for a client company. But I use bcp 😀

    I have issues with SSIS. Namely, to use it well, you have to be fluent in not one, but three programming languages at once (SQL for your database connections, VB for any plug-in code and C for the inline code). It's incredibly complex, and it's insanely hard to try and modify. I have met several companies who have SSIS packages that everyone flat out refuses to change simply because of the complexity of even just understanding what it does.

    Give me well-written straight line code in T-SQL any day. It may not perform quite as well as well-written SSIS, but at least I can easily understand and change it.

    [/font]

  • I've found SSIS is great for files with a relative degree of consistency, and has major problems with files that have been mangled by people who don't have the faintest clue what they're doing. What I've done in those cases is pretty much pull the whole thing into a table with 1 column of nvarchar(max) and an ID and nothing else, then process in T-SQL from there.

    Didn't have a prior automated process. Before I took it on, it was manual, and took about an hour per file. I took it down, eventually, to about 10 seconds per file (average file size about 5k rows, up to about 20k for the biggest ones). Numbers table and such were key to the process. I could do even better with 2k5 and some of the XML functions, but this was on 2k and didn't have what I would have used.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • files that have been mangled by people who don't have the faintest clue what they're doing

    I just have to ask, how much of your coding efforts are spent on compensating for the inadequacies of others?

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Bob Hovious (2/26/2009)


    files that have been mangled by people who don't have the faintest clue what they're doing

    I just have to ask, how much of your coding efforts are spent on compensating for the inadequacies of others?

    Not much. For the most part, a customer not knowing how to format a spreadsheet for import into SQL isn't their problem. That's what I'm being paid for.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • I was referring to inadequacies of alleged IT professionals 😉

    My war story is that our development group used to have a guy whose entire job was to import data. I found this out after we went through some rightsizing and he was let go. When I reviewed his code it turned out to be slow AND inaccurate. He was driving it through a VB program with maximum RBAR. We just scrapped it all. Now things he used to run for a day are handled in under a minute, without duplications or omissions. No brilliance required, it was just a sad example of developers (who don't understand database) thinking that knowing how to do an INSERT or an UPDATE is all there is to it.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Bruce W Cassidy (2/26/2009)


    [font="Verdana"]Any chance of you sharing some of the techniques you're using for that? I've been doing similar things, building an ODS for a client company. But I use bcp 😀

    I have issues with SSIS. Namely, to use it well, you have to be fluent in not one, but three programming languages at once (SQL for your database connections, VB for any plug-in code and C for the inline code). It's incredibly complex, and it's insanely hard to try and modify. I have met several companies who have SSIS packages that everyone flat out refuses to change simply because of the complexity of even just understanding what it does.

    Give me well-written straight line code in T-SQL any day. It may not perform quite as well as well-written SSIS, but at least I can easily understand and change it.

    [/font]

    Sure... most of it is just importing whole rows of the file into a 2 column table... one for an IDENTITY and one for the data. Of course, that requires the use of a BCP format file so I can skip over the identity.

    Then, I simply do a Tally table split to an NVP (Name/Value Pair) table in order by the IDENTITY column (which gets saved in the NVP) and the order of the Tally table.

    The reason I do this is two fold... that allows me to split the data into "columns" in the NVP without knowing how many there are and it also allows me to skip rows with BCP (Bulk Insert, actually). Neither like it very much when there are different numbers of delimiters in the header rows than the body of the file and they won't skip rows based just on the EOL character... the number of delimiters must match in all rows. So, I just treat each row as if it had no delimiters to allow the skip of 20 or 30 header rows.

    Note that I do NOT use a splitter function for this... I split the whole table all at once. If you haven't seen that before, take a look at the code in the following article (near the end)...

    http://www.sqlservercentral.com/articles/TSQL/62867/

    The section that I'm talking about in the article is titled, [font="Arial Black"]"One Final "Split" Trick with the Tally Table".[/font]

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Bob Hovious (2/26/2009)


    files that have been mangled by people who don't have the faintest clue what they're doing

    I just have to ask, how much of your coding efforts are spent on compensating for the inadequacies of others?

    I make about 90-95% of my living at it. The other 5-10% is building new stuff and maintaining servers.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Bob Hovious (2/27/2009)


    I was referring to inadequacies of alleged IT professionals 😉

    My war story is that our development group used to have a guy whose entire job was to import data. I found this out after we went through some rightsizing and he was let go. When I reviewed his code it turned out to be slow AND inaccurate. He was driving it through a VB program with maximum RBAR. We just scrapped it all. Now things he used to run for a day are handled in under a minute, without duplications or omissions. No brilliance required, it was just a sad example of developers (who don't understand database) thinking that knowing how to do an INSERT or an UPDATE is all there is to it.

    The really sad part is, some people probably thought he was doing a good job.

    And, absolutely spot on... just because someone walked by a book store that sells SQL books, doesn't mean that they even know how to spell it. 😉 I pretty much blame management for "bad" SQL Developers because it's those managers that are totally clueless... For example... you've seen the "want ads"... "Wanted, Web developer. Must have x number of years experience with Java/VB/C#/flavor-of-the-month and 1 year of experience with SQL." Such a person (usually, there are those wonderful exceptions) shouldn't even be allowed near the DB except by calling stored procedures that someone else wrote. But, sure enough, they get put into the position of solving ETL and other large batch problems.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Bob Hovious (2/27/2009)


    I was referring to inadequacies of alleged IT professionals 😉

    My war story is that our development group used to have a guy whose entire job was to import data. I found this out after we went through some rightsizing and he was let go. When I reviewed his code it turned out to be slow AND inaccurate. He was driving it through a VB program with maximum RBAR. We just scrapped it all. Now things he used to run for a day are handled in under a minute, without duplications or omissions. No brilliance required, it was just a sad example of developers (who don't understand database) thinking that knowing how to do an INSERT or an UPDATE is all there is to it.

    Looks like were migrating back in the direction of The Thread. 🙂

    My current job, the prior DBA was okay for some things, but for some aweful reason, he built a whole series of DTS packages that import text files into tables in SQL, and it's all done through VBS. Open file item, set a bunch of variables to the values in the first row, build an SQL string that does an insert statement with those variables, move to the next line in the file, rinse, repeat. Over 7-million times on one of the files. Some of these take as much as six hours to import a few hundred thousand lines of text. No transformations, just move from text to SQL, same columns and everything.

    He also wrote this wonderfully complex archive process for another database. The source database is third-party and subject to change without notice. The process queried the schema data for the tables, built a select and insert statement from sys.columns, dynamically, then dumped the data into a holding table using Select...Into..., then finally moved the data into the final archive tables, with the same structure as the primary tables.

    One problem was that it didn't compare the table structure and modify the target table, which completely defeated the purpose of the dynamic SQL. Errored out at least once because the target table didn't have a column that the source table did. The other problem was that it's a daily job, and the run-time was up to just over 20 hours, and the data is growing. So I changed it to a simple "Insert ... Select * from ... for XML raw, type" and an insert into a single XML column. No dynamic SQL, no staging tables, but handles schema changes without problem. Took the whole process down to about 3 minutes, and this copy is moving the data across a network to a different database server. Uses about 10% more disk space, and is somewhat more complex to query (XQuery instead of just straight XML), though views help with that, but worth it for not tying up a whole server for 20+ hours per day.

    On the other hand, he built a number of reasonably good databases, made sure there was adequate automatic maintenance on them, managed the backups and such, and otherwise did his job. Just those two things were pretty messed up.

    They also had a dev here who did a lot of "pay no attention to the man behind the curtain" type support for his applications. In other words, they don't run unless someone holds their hand and issues "back door" commands. Everyone thought he was great, till he left. His name was Dan, and now it's not uncommon to have people refer to broken things as "Dangineered" (pronounced like "engineered", but with his name incorporated).

    So, yeah, some time spent on such things. But for me, it's not even close to the majority of my time. Most of it is spent constructively, either on new systems or new features for existing systems.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Viewing 10 posts - 16 through 24 (of 24 total)

You must be logged in to reply to this topic. Login to reply