SQL 2000 DBA Toolkit Part 2

  • Comments posted to this topic are about the content posted at http://www.sqlservercentral.com/columnists/mcoles/sql2000dbatoolkitpart2.asp

  • Why do you attempt to make SQL Server a perl interpreter or a Unix utility?

    I just cannot see any benefits this pattern matching brings when all the utilities prvovided in the products above (AWK, ED, SED, BOURNE, BASH, KSH,JAVA) are excelllent products solely geared toward the kind of patterns you're attempting to parse and are totally disjoint to database technologies.

    Not only that SQL Server is a database tool with limited extensions into the world of file and file data streaming mechanisms, a poor extension of which is your toolkit.

    If you can shed some light on its adopted usage within SQL Server as an addition or extension to the embedded toolkits shipped with SQL Server then please share them with me, otherwise time is better spent letting the right tool carry out the correct operation on data, or extending the DOS capability (which has always been poor).

    A totally disappointing argument and article that can simply be replaced by one-shot pattern matching utility scripts using simpler notation, more easily readable and more readily available, shareable, updateable, compatible, readable, understandable, reusable .... must I carry on like the article, the list of reasons is endless.

    David Wootton

  • I would venture the guess the best tool is the database since that's where most of the data is. SED, AWK, etc., apart from not being on Windows in general or inside SQL Server, is not accessable from within a query. Which is where you would want to match up data.

    This isn't UNIX either, so those excellent products work in a great many, albeit minority, of the computers in the world.

    The right tool is the one that fits the job and for a SQL Server DBA, these tools can work well. For a Unix admin, it's not an issue since you won't work with SQL Server anyway.

  • Incorrect assumptions above.

    SQL Server can and does work logically on Unix by using linked servers, DTS or windows scripting host, transparent gateways, .NET (to name a few techniques) however this is not relevant to this discussion.

    I feel you've missed the whole point of this criticism, that being what use is this extension (used in its vaguest terms) to SQL Server functionality, not the operating system SQL Server resides on.

    You opened up your part 2 of your article with a quote about Alice in Wonderland.

    Here's another quote for you

    Ambition is the last refuge of the failure.

    David Wootton

     

  • So you would advocate pulling all data from the database server, running it through a utility and then putting it back for further querying?

  • INTRODUCTION

    Thanks for your feedback.  Your quote is appreciated, and now I provide you with one of my personal favorites:  "Those who can, do.  Those who can't, become critics."  First let me explain that I'm making no "arguments" in the article, so I can see why you would find the "argument" to be "totally disappointing."  I am simply introducing tools that I have found useful in SQL Server at various times, and explaining how to use them.  Feel free to quote me on the following:

    "For those who don't find them useful, the simplest solution would probably be to not use them.

     For those who do find them useful, the best solution is probably to use them."

    BACKGROUND

    Next a little background for those who might find the tools useful.  (Feel free to skip to this paragraph, Mr. Wootton.)  These Regular Expression tools were created in response to another method of implementing Regular Expressions in SQL 2000 - using the sp_OACreate method to invoke the VBA regular expression parser.  Unfortunately that method caused me a lot of memory leaks, late-night server re-boots, and some difficult conversions of Perl regex's to Microsoft syntax.

    UTILITY

    I am not going to argue the utility of Regular Expressions in SQL.  Like a debate over the usefulness of O2 in the atmosphere, the debate over the utility of Regular Expressions in SQL was resolved a long time ago by large organizations full of very smart people.  The proof of the utility of Regular Expressions in SQL is self-evident in SQL 2005 and Oracle, which both offer regular expression functionality.  In SQL 2005 it is accessed via the .NET Framework.  Oracle offers a more direct approach, with statements like "REGEXP_LIKE", "REGEXP_INSTR", "REGEXP_SUBSTR" and "REGEXP_REPLACE" which are similar to the XP/UDF methods I have implemented for SQL 2000.

    Your question about "making SQL Server a Perl interpreter or a Unix utility" might best be addressed to Oracle and Microsoft., as they have a lot more people who can take the time to explain the intricate details of why and wherefore.  Of course, they might offer you a similar solution:  i.e., if you don't like those tools then simply don't use them.  There's no reason to get rid of all the fishing nets in the world because you're a carpenter and so will never use one.

    EXAMPLES:  As You Wish...

    Here are two examples of times that I have found regular expressions directly in SQL to be useful in my own experience: 

    • When searching for a specific pattern via Query Analyzer.  This is a one-off situation for a single SELECT statement.  I would generally use this when debugging an application or otherwise verifying the format of data.

    It is absolutely true that you could DTS your 10,000,000 row table out to a text file, copy it over to a UNIX box and run SED against it; or write a Java or .NET application that connects to the database, reads all the data into memory and performs comparisons.  Or even just open it in NotePad and click Edit > Find in the main menu and search the flat file.  You could conceivably hard-wire a dozen one-time-use disposable applications to perform these specific comparisons for each of the formats you are testing using character-level comparisons and For..Next loops and avoid regular expressions altogether!  One application for credit card numbers, one application for email addresses, one application for testing password complexity, and so on and so forth.

    Personally I prefer to do any of these tasks with a single SELECT statement and a Regular Expression; particularly since Regular Expressions have already been written for just about any data validation tasks you can imagine.

    • I have encountered situations in which data validation (of various degrees) at the database level was a functional requirement.  I'll not argue the wrongness or rightness of placing validation functionality at that level; it is what it is: what the customer demanded.  Those who think data validation at the database level is wrong should feel free to lobby ANSI, Microsoft and Oracle to remove TRIGGERs and CONSTRAINTS.  Unfortunately I can do nothing about that.

    That said, it's true that you could write a stored procedure or user-defined function for every variation of the data you need to validate, using a counter variable and WHILE loop, or you might even be able to use the extremely limited pattern-matching facilities of the SQL LIKE statement.  I can picture a UDF with a WHILE loop and a few "flag" variables to validate email addresses, another UDF using a WHILE loop to validate phone numbers, etc.  Alternatively, as you mentioned, you could write a program in JAVA to read all of the data into memory and perform the comparisons for you, although it would need to be invoked via xp_cmdshell.  I can only imagine the performance of a system where xp_cmdshell loads command.com which invokes a command-line Java interpreter which in turn loads an application that tests a single piece of data on a table where 10,000,000 rows are being inserted.  That's 10,000,000 invocations of xp_cmdshell, assuming your DBA didn't disable xp_cmdshell for security reasons.

    CONCLUSION:  Put Your Writing Where Your Mouth Is

    As far as AWK, SED and the "DOS" capability which has "always been poor", I for one would definitely be interested in seeing any examples you might have of performing validation of SQL Server data using these facilities.  Of particular use, to me anyway, would be a way to call this functionality from a Query Analyzer via SELECT statement without the need to export rows to an intermediate format like a flat file.

    I think that would make an extremely useful article that many would benefit from; you should definitely consider sharing your experience with the rest of us in article form, particularly the methods you use to access SED and AWK from within SQL Server!

    I look forward to further contributions from you, in your admittedly "unambitious style", to the SQL Server community at large!

    I sincerely hope your day gets better,

    Michael Coles

  • SOME EXAMPLES

    Just for giggles, let's look at some situations and see what solutions you propose:

    CONDITIONS

    1.  You have SQL Server 2000 on Windows 2003.

    2.  You have a Unix box which is on the same physical network as my SQL Server.

    3.  You have a table with 20,000,000 rows.

    4.  The table has a column for "email address"

    TASK #1:  One-time Validation

    You have just been tasked with performing a one-time validation of all the email addresses in the table against RFC 2822.

    One Solution:  Use the fn_regex_match() function to write A SINGLE SELECT statement in Query Analyzer.

    Wootton solution: ________________________________________________________________ 

    TASK #2:  Trigger Validation

    Once the one-time validation above has been done, the technical specs for the database are revised to include a requirement for validation of email addresses against RFC 2822 via a TRIGGER at INSERT or UPDATE time.  It was decided that validation needed to occur at the database level so that admins and developers updating the rows directly wouldn't accidentally circumvent the validation and enter bad data.

    One Solution:  Use the fn_regex_match() UDF to create a trigger with a few lines of code in it.

    Wootton solution: ________________________________________________________________

    TASK #3:  Parsing

    The boss has tasked you with creating a report of the domain names tied to these email addresses, ranking them from most common to least common.  He wants only the domain (i.e., "microsoft.com"), but not the local domain (i.e., "mail." part of "mail.microsoft.com").  You also note that some of the domains are IP addresses which should be returned in full.

    One solution: Use xp_regex_search to grab the domains out of the email address and insert them into a temporary table.  Then SELECT on the temporary table with a GROUP BY and ORDER BY.  In total, about 3 - 4 lines of T-SQL code.

    Wootton solution: _________________________________________________________________

    CONCLUSION:  Show Me The Money!

    As requested, I have supplied very specific examples of when and where these tools might be useful.  Each is based loosely on real-world situations that I, and other DBAs, have encountered at various times.

    I anxiously await your MS SQL Server/Unix-based solutions to these common DBA tasks, Mr. Wootton.

    Cheers,

    Michael Coles

    "He has a right to criticize, who has the heart to help." - A. Lincoln

  • You serious?  MS SQL runs on UNIX?  And how do you use BOURNE from inside T-SQL???

    I use regex on my SQL box now, the pcre one from Code Project http://www.codeproject.com/database/xp_pcre.asp in triggers.  Plus I can copy and paste regex from anywhere on the web.  I think they're a lot easier than trying to shuttle data back and forth to a unix box every time I want to check a simple pattern.  Wouldn't that slow everything down?  Can you even do that from a trigger?  I tried tsql stroed procs and udfs at first, but they get complicated real fast.  What are the other options you talked about that are endless?

    Thanks.

  • Sorry guys ... can't read any of that ... I tend to tune out when there is a lot of bolding and uppercase typing.

     

    PS: Enjoying the series Mike, I'm sure Steve can fill you in on what to do with bad reviews

     

    --------------------
    Colt 45 - the original point and click interface

  • LOL.  Thanks Phill.  I usually try to work with critics to explain my position in a rational way and listen to their position in a patient way.  Usually there's middle ground we can meet on once we understand each other...  But that doesn't always work...

    When a guy starts screaming about using UNIX to validate the data in your SQL Server databases without any further explanation, he doesn't leave you much to work with

    As always, appreciate the feedback and hope you find the tools useful!

  • FYI to Everyone:

    Steve has made the source code (VC++ 7.1 solution) for this project available in a ZIP file from http://www.sqlservercentral.com/products/mcoles/default.asp.

    Compilation of some parts requires Boost Library source (in particular the Regular Expression extended procedures require Boost).  Boost source is available for free download at http://www.boost.org/.

    If you do download and compile the source, I highly recommend downloading the latest Platform SDK from http://www.microsoft.com/downloads/details.aspx?FamilyId=A55B6B43-E24F-4EA3-A93E-40C0EC4F68E5&displaylang=en.

    I've also been asked several times about the license for this toolkit.  Here it is:

    DBA Toolkit License

    The DBA Toolkit is composed of my original source code, and/or other source code that the original authors have placed in the public domain.  Credit has been given throughout the source code to those original authors, and their original licensing agreements have been maintained in the source code where appropriate.

    You are free to use this source code and binaries however you want, modify it however you like, and use it for any purpose you like, personal or commercial.  All I ask is two things:

    1.  If you redistribute the source code, modified or unmodified, give me a shout out in the comments or something

    2.  If you do something really cool and make big and/or useful improvements, consider sharing it with the rest of the world (this is not mandatory, but it is the nice thing to do).  I'm particularly interested in the improvements people make to the source, and always appreciate the feedback.

    Obviously the Boost Library is covered by Boost's own licensing agreement which you will need to observe if you decide to use the portions of code that interface with it (the Regular Expression functionality).  And Microsoft has their own licensing agreements you might need to review when before distributing programs compiled using the Platform SDK.

  • I've always greatly admired the DBA toolkit, and found the criticism of Mr Wooton very amusing. Well, I assume it was tongue-in-cheek. Even if it wasn't useful, it is great to illustrate that it can be done.

    One point he's right about. SQL Server works very well in a Unix\J2EE environment thanks to its rather good JDBC driver.  Of course, the database server itself has to run on  Windows, but that is only a problem to the fanatics.

    Best wishes,
    Phil Factor

  • I recognize I'm a little late to the party, but...

    Has anyone tried to use these xp's in an x64 environment? We are moving from 32-bit SQL2K to 64-bit SQL2K5 and would like to continue using these in the new environment, rather than refactor to the SQL2K5 built-ins.

    Thanks,

    Paul S

  • Just FYI, I haven't compiled or tested these on 64-bit systems, but for SQL 2K5, I'd highly recommend using the SQLCLR and built-in functionality.  On SQL 2K5 XP's are deprecated, and the SQLCLR code is much easier to maintain.  Creating regular expression matching procedures and functions using .NET and the SQLCLR is relatively simple.  Just my .02.

    Thanks,

    Michael

  • Mike

    First of all, congratulations on earning your MVP status. Reading the news prompted me to look this article up to help me with an address validation task I have been asked to do.

    I have used the kit and have managed to solve my problem, but, I am having to use a cursor :w00t: as the xp_regex_search returns a table with the found value. Any chance we can nag Steve into fixing the link to the source code above so I can muck about with it and make it do what I want?

    For those that are interested, here's more on what I'm trying to do. (Usual Disclaimer: Schema and App NOT my design ;))

    We have an address table that contains among other columns PK, CustCode, AddressLine1, AddressLine2, AddressLine3, AddressLine4, AddressLine5, AddressLine6, Postcode.

    The app that uses this table is dreadful, and over time postcodes have ended up in any of the address fields. So addressLine5 might contain "MANCHESTER M13 OJR" for example.

    What I currently have is the following psuedo(ish) code

    Read all records with NULL postcode into a temp table consisting of above columns.

    Read first record from the temp table into Cursor

    CREATE TABLE #TestSearch ( matchnum int , groupnum int , value varchar ( 128 ) )

    select @addline = @add1 --Sets string to be tested

    EXEC master.dbo.xp_regex_match @regExStr , @addLine, 'I+' , @Ans Output --Where @RegEx is a Postcode match

    if @ans = 'Y'

    Begin

    INSERT INTO #TestSearch ( matchnum , groupnum , value )

    EXEC master.dbo.xp_regex_search @regExStr , @addline, 'I+'

    if Exists(select 1 from #testSearch)

    select @val = value from #TestSearch

    Begin

    Update #TestMatch

    Set postcode = @val where id = @id

    End

    END

    Do the same for addressLine2, 3 etc

    Get the next record and go back to the top

    It would be much simpler for me, and probably much quicker to say

    Update #TestMatch

    Set postcode = master.dbo.fn_regex_match_and_return_matched_value(addressLine1, @regEx)

    Where Postcode is NULL

    And then do AddressLine2, etc. Or build a coalesce statement. Anyway, I'm sure you get the idea.

    Any other obvious solutions I have missed, please feel free to correct me 😀

    Once again, well done on the MVP thing.

    Dave J


    http://glossopian.co.uk/
    "I don't know what I don't know."

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply