Parsing HTML to SQL using SQLDOM

  • David Rueter

    SSCrazy

    Points: 2632

    Comments posted to this topic are about the item Parsing HTML to SQL using SQLDOM

  • davoscollective

    SSCertifiable

    Points: 6355

    Very useful tools, thanks for sharing!

  • phil 26982

    SSC Enthusiast

    Points: 105

    Very nice thanks for sharing those tools with us!

    I have had to do somethinng similar in the past but was fortunate enough to be using an html source that was well formatted xml, so the xml type was just what I needed. This looks very useful for 'other' flavours of html document.

  • Jerrry Brenner

    SSC Journeyman

    Points: 81

    I gave a talk at SQL Saturday 109 about how we have functionality in our application that captures the top 400 SQL statements by total elapsed time, including the showplan xml, and builds a zip file of interlinked html pages and returns the zip file to the user through the browser. For each showplan xml string, we build a DOM and run some xPath expressions across the DOM, to pull out useful information and to build a tabular representation of the query plan, showing useful information that's dependent on the type of RelOp node. (I'm repeating the talk for the SQL PASS Virtual Performance Chapter[/url] on 5/24. I got the idea of using a tabular display for the query plans from blog posts by Pinal Dave and Michael Coles. They used xQuery imbedded in SQL, but doing it in the app server is much more efficient, especially given the # of plans that I am capturing and the conditional evaluation of xPath expressions. (Show the build and probe tables for a hash match, sort columns, etc.)

    It seems like your code could be used to do some (or all) of that, with the ability to conditionally run xPath expressions against the showplan xml being very useful for anyone that wanted to do a detailed analysis of an arbitrary query plan.

  • Phil Morris-454316

    SSC Veteran

    Points: 252

    That's really neat Jerry. SSIS log tables would be another similar need for something good like that. I've done it with the BI tools before, but a SQL-HTML setup for this would be even better. I had a play with SQL-XML-XSLT-HTML with this also and that also seemed a neat approach.

  • chris.rogers00

    Right there with Babe

    Points: 790

    Very useful tool, thank you, but I don't know why you felt the need to include a whole text document about "God" in the project files. I'm not sure what your god has to do with SQL Server. Thank you though

  • Dave Vroman

    SSC Eights!

    Points: 821

    Very nice. I'm curious if the SQLDOM is using HTML5.

    In that case the <BR>

    would become <br />. Any of the tags that do not have a closing tag, such as line break and image must be in the form <tag />.

  • dmfd

    SSC Enthusiast

    Points: 164

    David, thank you for this. I have a pet project that will require me to screen scrape a few sites. This will definitely come in handy.

  • mohdasalah

    Valued Member

    Points: 73

    Thank you it's very usefull but how about its performance, did you do any stress test?

  • EKD

    Mr or Mrs. 500

    Points: 502

    The text file in the project looks good.

  • David Rueter

    SSCrazy

    Points: 2632

    I did stress test, and spent some time optimizing.

    I don't have detailed benchmarks handy, but by way of example parsing the Google home page (once the HTML is loaded in to a variable) takes roughly 120ms on my dev box--a modest VM running SQL 2008. (Tested over a large number of iterations.)

    Anecdotally, I noticed that parse time is similar to (perhaps a little faster) http://lint.brihten.com/html/

    I was able to decrease parse time down to about a third of what it was in my initial version.

    I learned something very interesting: concatenating long strings is a performance killer. Concatenating short strings is fine. A VARCHAR(MAX) that has less than 8000 characters is treated as a short string--with fast concatenation. When the string grows to 8000 characters, concatenation becomes painfully slow.

    So to optimize, where concatenation is needed I try to work only with short strings. When I approach 8000 I concatenate the chunk of data to a larger string and then clear out my accumulator variable. In this way I minimize the number of long concatenations.

    I know performance for the actual parsing isn't as fast as what we would see if the parser were built in C++, but it seems that performance is solid, consistent and sufficient for some production uses. Also, it is uncertain that a different implementation in C++ could parse and deliver the rows to SQL faster: I suspect that this T-SQL implementation isn't quite as efficient in processing strings, but excels in getting the data into a resultset--meaning that overall performance may be similar to what you could achieve in a different environment.

  • anthonyqkiernan

    SSC Rookie

    Points: 46

    This looks like it could be very useful for a project I have on the go. Even if it's not, I'm very impressed with the work behind it. Well done you!

  • heb1014

    Hall of Fame

    Points: 3790

    Good post. While I don't have an immediate use for this, I'll tuck it away for the future.

    Also, I do know why you felt the need to include a whole text document about God in the project files. I'm sure God is the center of all things and I ask His guidance even before working with SQL Server. Thank you indeed! 🙂

  • JMelin

    SSC Journeyman

    Points: 85

    Looks like this would be a very useful tool. I just had one quick question. I am storing an HTML form inside of a table in my database. What I am looking to do is parse certain portions of the HTML code and place it into another table along with other data that I can obtain without parsing HTML. My question is would I be able ot point your SQLDOM at a column that holds HTML code and use it in a similar fashion?

    Essentially I can grab the data normally through SSIS and place it into a table but instead of saving the HTML column from the source I would like to parse out certain elements of the HTML and place it into the destination along with the elements I need that aren't within the HTML code.

    Thank you,

    Jon

  • JMelin

    SSC Journeyman

    Points: 85

    --- Sorry double post, was having browser issues ---

    Looks like this would be a very useful tool. I just had one quick question. I am storing an HTML form inside of a table in my database. What I am looking to do is parse certain portions of the HTML code and place it into another table along with other data that I can obtain without parsing HTML. My question is would I be able ot point your SQLDOM at a column that holds HTML code and use it in a similar fashion?

    Essentially I can grab the data normally through SSIS and place it into a table but instead of saving the HTML column from the source I would like to parse out certain elements of the HTML and place it into the destination along with the elements I need that aren't within the HTML code.

    Thank you,

    Jon

Viewing 15 posts - 1 through 15 (of 31 total)

You must be logged in to reply to this topic. Login to reply