Parsing HTML to SQL using SQLDOM

  • Jon,

    Sure, you could parse HTML that is already stored in a column.

    I wrote a brief post[/url] on this sort of thing at http://sf.net/p/sqldom/discussion/general/thread/5bf1fdb6/

    In a nutshell, the #spactDOMLoad procedure accepts an @HTML parameter that contains the HTML to be parsed: it can come from any source, including an existing row in a SQL table.

    After calling #spactDOMLoad you could select the data you want to keep either with the #spgetDOM procedure or by directly selecting rows from the temporary tables #tblDOM, #tblDomAttribs, etc.

  • Hello again,

    I was curious if it would be possible to to change what it is your code is looking to capture after a tag that looks like this <OPTION selected>Text I want is here</OPTION>. I have ran your code against some test HTML for the form I was speaking about and I do see the option element in your tag column but the problem is it doesn't pick up the selected portion and place it into the tag column as well, which I use to determine which option the user has selected so that I may parse it. Also, when I use HTML: <Option Selected Id=55555 Name=Burr Oak> It fills in the tag column with a blank string into the name column but a NULL in the ID column. Also if I leave the Selected tag information out it not only puts a blank string in for the name column but also does the same for the ID field as well (see screenshot). If I could get it to somehow notice the Selected word inside the <Option Selected> tag and save the actual ID and Name in the result list I could use it to pull the couple pieces of information I need.

    Also on a side note, how do you recommend actually running a where against the result set because I would need to look through hundreds of <option> tags to find the couple that say <option Selected> in it?

    Thank you in advance.

  • Normally attributes that are parsed by SQLDOM are stored in the table #tblDOMAttributes (which can be joined to table #tblDOM via the DEID column).

    In HTML 4, "selected" is really supposed to be attributes like this:

    <option selected="selected">My Option</option>

    SQLDOM has not support parsing out attributes that are not in the form of "name=value". Consequently, SQLDOM has not parsed out the selected attribute in someting like:

    <option selected>My Option</option>

    Back in the old days (HTML 2, and possibly HTML 3.2), "selected" with no value was allowed.

    Based on your request I modified SQLDOM and posted an updated version (.924) that parses out these old-fashioned quasi-attributes with no values. Please download version .924 from: https://sourceforge.net/projects/sqldom/files/

    Once you parse your HTML with version .924 you will be able to retrieve the selected options with something like this:

    EXEC #spactDOMLoad @HTML = '<html><body>Hello World.<option selected>Try this</option></body></html>'

    SELECT

    dom.Tag,

    dom.ID,

    txt.TextData,

    attr.Name,

    attr.Value

    FROM

    #tblDOM dom

    LEFT JOIN #tblDOM txt ON

    dom.DEID = txt.ParentDEID AND

    txt.TextData IS NOT NULL

    LEFT JOIN #tblDOMAttribs attr ON

    dom.DEID = attr.DEID

  • David,

    Thank you very, very much. I should be able to pull off exactly what I am trying to do. I really do appreciate it. This could be very useful for future endeavors for me as well.

    Thanks again,

    Jon

  • Great article. We'll done sir!

    I do, however, disagree a little with this line (emph. mine)...

    But HTML is often not well-formed XML. Some tags are "singleton" tags--that do not require end tags...

    Html 4.01+ (May 2000) is in fact well-formed XML. In other words, if it was developed or generated this century it is almost certainly well-formed. I don't know how much html 4.0 and ealier is still out there but it's certainly the exception and not the rule.

    HTML 4.01 DOES require end tags or it can be empty. E.g. this is well formed: and so is this <p/>. This: <p> is neither legit html if its 4.01 or later, nor is it well-formed xml.

    I am not trying to take anything away from your excellent article but felt that is worth calling out.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • In your article:

    This will return a resultset containing the nodes that were parsed:

    DEIDDocIDTagIDNameClassTextData

    OpenTagStartPos

    CloseTagEndPos

    ParentDEID

    HUIDSortHUIDDOMLevel

    11htmlNULLNULLNULLNULL1119NULL111

    21headNULLNULLNULLNULL105311.11.0000012

    31titleNULLNULLNULLNULL184321.1.1000001.000001.0000013

    41NULLNULLNULLNULLSample HTMLNULLNULL31.1.1.1000001.000001.000001.0000014

    51bodyNULLNULLNULLNULL5711011.21.0000022

    61divNULLNULLNULLNULL6710051.2.1000001.000002.0000013

    71NULLNULLNULLNULLHelloNULLNULL61.2.1.1000001.000002.000001.0000014

    81bNULLNULLNULLNULL825361.2.1.2000001.000002.000001.0000024

    91NULLNULLNULLNULLWorldNULLNULL81.2.1.2.1000001.000002.000001.000002.0000015

    for the row DEID

    8

    the OpenTagStartPos is 82 and the CloseTagEndPos is 53.

    Is this a typo?

  • for the row DEID 8

    the OpenTagStartPos is 82 and the CloseTagEndPos is 53.

    Is this a typo?

    Yes..a typo. That row actually returns a br for the tag (not a b), and returns the correct start/end position.

    (Copying results and formatting them by hand in markup for article publication is a manual process and subject to typos like this slipping through. Good catch!)

  • This is a really cool exercise. However, it should remain simply that.

    SQL is not HTML

    HTML is not SQL

    I can't believe nobody else has raised a warning flag on this.

    Just because SQL Server enables this doesn't mean it is a good idea.

    Do you know why this could be bad? What concerns or questions must a developer ask before falling back on something this niave?

    It violates almost all of the SOLID principles for software development. SOLID is not only for object oriented coding.

    You are mixing presentation, persistence and database layers all in a database engine. What are you going to do when your SQL Server, which costs a LOT of money, begins to have slower performance because it is Converting To/From HTML To/From SQL instead of managing database and database transactions?

    Keep writing stuff like this. It keeps me in business rescuing applications designed this way!

  • taylor_benjamin (12/29/2014)


    This is a really cool exercise. However, it should remain simply that.

    SQL is not HTML

    HTML is not SQL

    I can't believe nobody else has raised a warning flag on this.

    Just because SQL Server enables this doesn't mean it is a good idea.

    Do you know why this could be bad? ...

    Raised a warning flag on what exactly? I suggest that you re-read the first few paragraphs and re-examine your comment.

    Nobody is suggesting that you store replace SQL with HTML or vise-versa. There are times when you need to extract data from HTML into your OLTP systems or data warehouse. This article provides an excellent way to do so when required.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Alan:

    I think it is cool that you can do this using SQL.

    There are so many better tools to use. Why would you use an engine tuned to manage set data to perform procedural code using MXLDOM? Yes, SQL Server expresses the contents as sets, with a lot of overhead.

    If you were simply using a local instance of SQL Server to do this, it wouldn't be as bad. But this kind of design can kill a production SQL Server and keep it from doing what it needs to do efficiently; serve up data.

    SSIS was designed to do this kind of thing. There are many other tools. Linq is also another wonderful tool for this kind of activity if you enjoy set like capabilities.

    I don't see any warnings given in the editorial regarding why or where this should be used. Perhaps I need to go back and read it more thoroughly.

    In my 30+ years of DBA experience, if this were done on an instance of SQL Server intended to store data, it is not a design that can scale. If that instance supports things that need to scale, don't implement this there. Put it on SQL Express, or an in memory instance. Don't put this on a production instance of SQL Server.

    I've seen too many systems where databases have been sharded into multiple instances or massive servers were purchased because developers put things on the database server that could be done just as easily using tools that can scale. SQL Server doesn't scale out easily. So, a good rule of thumb is, if it isn't storing, manipulating, or retrieving data, there is probably a better place to do it than on your SQL Server.

    I know that is a hard statement. But with a million war stories, many of my own (and my own making), demonstrating why this is a non-production solution.

    Let me close by saying I really appreciate your style of writing. it is clear and accurate, easy to understand. Please don't let my criticism discourage you in any fashion. I look forward to seeing more.

  • I'll put my bug report here, and thanks for the work on this!

    First, a not-a-bug query: This works fine on my Workstation, but fails to get the data from URLs if run on the server. There are no errors reported, so I assume the GET is not returning anything. I can browse to the page with IE on the server fine. Any thoughts on what Server/SQL/Firewall configuration needs to be done?

    Now, SQLDOM bug and notes:

    sdom.spactDOMClear is missing DELETE on the table #tblDOMHierarchy.

    Not necessarily a bug, but tricky to work around what I found in a loop to get multiple URLs:

    SELECT @UrlId = 1;

    WHILE (@UrlId <= 5)

    BEGIN

    SELECT @URL = URL FROM #urls WHERE UrlId = @UrlId;

    EXEC sdom.sputilGetHTTP

    @URL = @URL,

    @ResponseText = @HTML OUTPUT,

    @SuppressResultset = 1;

    -- ***** DOMLoad will error if there is a value in @DocID, even if @CreateNew = 1

    SET @DocID = NULL;

    EXEC sdom.spactDOMLoad @DocID = @DocID, @HTML = @HTML, @CreateNew = 1;

    -- ***** DOMLoad apparently doesn't return @DocID value correctly

    SET @DocID = @UrlId;

    EXEC sdom.spgetDOM @DocID = @DocID, @SuppressResultSet = 1;

    SELECT @UrlId = @UrlId + 1;

    END;

    Edit: Using SQLDOM_core_persist_927

    So long, and thanks for all the fish,

    Russell Shilling, MCDBA, MCSA 2K3, MCSE 2K3

  • Suggestion: Change the spgetInitSession to create the temp tables and then execute user code. This avoids pasting (potentially changeable) SQLDOM code into user's procedures and neatly resolves the temp table scope problem. Note: Set the owner of the sdom schema to an account with permissions to execute the OA_ procedures.

    CREATE PROCEDURE sdom.spgetInitSession

    @SQLToExecute nvarchar(MAX)

    WITH EXECUTE AS OWNER

    AS

    BEGIN

    SET NOCOUNT ON

    IF OBJECT_ID('tempdb..#tblDOMDocs') IS NOT NULL BEGIN

    DROP TABLE #tblDOMDocs

    END

    -- REMOVED THE REMAINING SQLDOM TEMP TABLE CREATION CODE

    -- ...

    -- Now execute the user code

    EXECUTE sp_executesql @SQLToExecute;

    END

    GO

    Usage:

    DECLARE @SqlToExecute VARCHAR(MAX) =

    '

    DECLARE @HTML varchar(MAX);

    DECLARE @DocID INT;

    EXEC sdom.sputilGetHTTP

    @URL = ''https://google.com'',

    @ResponseText = @HTML OUTPUT,

    @SuppressResultset = 0;

    -- ***** DOMLoad will error if there is a value in @DocID, even if @CreateNew = 1

    SET @DocID = NULL;

    EXEC sdom.spactDOMLoad @DocID = @DocID, @HTML = @HTML, @CreateNew = 1;

    -- ***** DOMLoad apparently does not return @DocID value

    SET @DocID = 1;

    EXEC sdom.spgetDOM @DocID = @DocID, @SuppressResultSet = 1;

    SELECT * FROM #tblDOMHierarchy;

    ';

    EXECUTE sdom.spgetInitSession @SqlToExecute;

    So long, and thanks for all the fish,

    Russell Shilling, MCDBA, MCSA 2K3, MCSE 2K3

  • FYI, I actually have two variants of SQLDOM--one that creates permanent objects in the database (i.e the "persist" version), and one that creates only temporary objects.

    The article really deals with the temporary-only flavor of SQLDOM.

    However, the thinking behind the "persist" version is that in some user applications there is a SQL connection that stays alive and remains dedicated to the user session. In a scenario like this, at user session start-up you may want to instantiate the session-specific temporary tables, but have these persist for the life of the session. The only way to accomplish this is to have SQLDOM return the dynamic SQL to the caller so that the caller can execute it on the connection. (Otherwise, the scope of the temporary objects would be limited to the initialization call.)

    In fact, I didn't really realize that this article was going to be re-published this month. I have several refinements and updates I have accumulated over the past year that I should post to the SourceForge project. I appreciate and am open to the suggestions, particularly if you and/or others have an interest in using or contributing to the project.

    I might suggest that we move the discussion about enhancements to SourceForge, so as to leave this SQL Server Central thread to deal more specifically with the article. Does that make sense?

  • Here's the answer: On the server, running in a job, results were returned but were getting trucated at 512 characters. I needed to add one statement before calling the sproc that gets the data to get the job to not truncate the data:

    SET TEXTSIZE 2147483647

    Russell Shilling (12/30/2014)


    I'll put my bug report here, and thanks for the work on this!

    First, a not-a-bug query: This works fine on my Workstation, but fails to get the data from URLs if run on the server. There are no errors reported, so I assume the GET is not returning anything. I can browse to the page with IE on the server fine. Any thoughts on what Server/SQL/Firewall configuration needs to be done?

    So long, and thanks for all the fish,

    Russell Shilling, MCDBA, MCSA 2K3, MCSE 2K3

  • Will do. I'll put me feedback there.

    David Rueter (12/31/2014)


    <snip>

    In fact, I didn't really realize that this article was going to be re-published this month. I have several refinements and updates I have accumulated over the past year that I should post to the SourceForge project. I appreciate and am open to the suggestions, particularly if you and/or others have an interest in using or contributing to the project.

    I might suggest that we move the discussion about enhancements to SourceForge, so as to leave this SQL Server Central thread to deal more specifically with the article. Does that make sense?

    So long, and thanks for all the fish,

    Russell Shilling, MCDBA, MCSA 2K3, MCSE 2K3

Viewing 15 posts - 16 through 30 (of 31 total)

You must be logged in to reply to this topic. Login to reply