Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««1234»»»

Parsing HTML to SQL using SQLDOM Expand / Collapse
Author
Message
Posted Tuesday, April 17, 2012 8:38 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Today @ 9:45 AM
Points: 71, Visits: 547
I did stress test, and spent some time optimizing.

I don't have detailed benchmarks handy, but by way of example parsing the Google home page (once the HTML is loaded in to a variable) takes roughly 120ms on my dev box--a modest VM running SQL 2008. (Tested over a large number of iterations.)

Anecdotally, I noticed that parse time is similar to (perhaps a little faster) http://lint.brihten.com/html/

I was able to decrease parse time down to about a third of what it was in my initial version.

I learned something very interesting: concatenating long strings is a performance killer. Concatenating short strings is fine. A VARCHAR(MAX) that has less than 8000 characters is treated as a short string--with fast concatenation. When the string grows to 8000 characters, concatenation becomes painfully slow.

So to optimize, where concatenation is needed I try to work only with short strings. When I approach 8000 I concatenate the chunk of data to a larger string and then clear out my accumulator variable. In this way I minimize the number of long concatenations.

I know performance for the actual parsing isn't as fast as what we would see if the parser were built in C++, but it seems that performance is solid, consistent and sufficient for some production uses. Also, it is uncertain that a different implementation in C++ could parse and deliver the rows to SQL faster: I suspect that this T-SQL implementation isn't quite as efficient in processing strings, but excels in getting the data into a resultset--meaning that overall performance may be similar to what you could achieve in a different environment.
Post #1284975
Posted Wednesday, April 18, 2012 2:59 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Thursday, May 3, 2012 8:20 AM
Points: 4, Visits: 15
This looks like it could be very useful for a project I have on the go. Even if it's not, I'm very impressed with the work behind it. Well done you!
Post #1285479
Posted Thursday, April 19, 2012 6:55 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Saturday, August 8, 2015 4:00 PM
Points: 56, Visits: 423
Good post. While I don't have an immediate use for this, I'll tuck it away for the future.

Also, I do know why you felt the need to include a whole text document about God in the project files. I'm sure God is the center of all things and I ask His guidance even before working with SQL Server. Thank you indeed!
Post #1286377
Posted Monday, April 23, 2012 10:09 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Wednesday, July 10, 2013 9:11 AM
Points: 5, Visits: 22
Looks like this would be a very useful tool. I just had one quick question. I am storing an HTML form inside of a table in my database. What I am looking to do is parse certain portions of the HTML code and place it into another table along with other data that I can obtain without parsing HTML. My question is would I be able ot point your SQLDOM at a column that holds HTML code and use it in a similar fashion?

Essentially I can grab the data normally through SSIS and place it into a table but instead of saving the HTML column from the source I would like to parse out certain elements of the HTML and place it into the destination along with the elements I need that aren't within the HTML code.

Thank you,
Jon
Post #1288309
Posted Monday, April 23, 2012 10:35 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Wednesday, July 10, 2013 9:11 AM
Points: 5, Visits: 22
--- Sorry double post, was having browser issues ---

Looks like this would be a very useful tool. I just had one quick question. I am storing an HTML form inside of a table in my database. What I am looking to do is parse certain portions of the HTML code and place it into another table along with other data that I can obtain without parsing HTML. My question is would I be able ot point your SQLDOM at a column that holds HTML code and use it in a similar fashion?

Essentially I can grab the data normally through SSIS and place it into a table but instead of saving the HTML column from the source I would like to parse out certain elements of the HTML and place it into the destination along with the elements I need that aren't within the HTML code.

Thank you,
Jon
Post #1288332
Posted Monday, April 23, 2012 12:52 PM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Today @ 9:45 AM
Points: 71, Visits: 547
Jon,

Sure, you could parse HTML that is already stored in a column.

I wrote a brief post on this sort of thing at http://sf.net/p/sqldom/discussion/general/thread/5bf1fdb6/

In a nutshell, the #spactDOMLoad procedure accepts an @HTML parameter that contains the HTML to be parsed: it can come from any source, including an existing row in a SQL table.

After calling #spactDOMLoad you could select the data you want to keep either with the #spgetDOM procedure or by directly selecting rows from the temporary tables #tblDOM, #tblDomAttribs, etc.
Post #1288449
Posted Tuesday, April 24, 2012 3:31 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Wednesday, July 10, 2013 9:11 AM
Points: 5, Visits: 22
Hello again,

I was curious if it would be possible to to change what it is your code is looking to capture after a tag that looks like this <OPTION selected>Text I want is here</OPTION>. I have ran your code against some test HTML for the form I was speaking about and I do see the option element in your tag column but the problem is it doesn't pick up the selected portion and place it into the tag column as well, which I use to determine which option the user has selected so that I may parse it. Also, when I use HTML: <Option Selected Id=55555 Name=Burr Oak> It fills in the tag column with a blank string into the name column but a NULL in the ID column. Also if I leave the Selected tag information out it not only puts a blank string in for the name column but also does the same for the ID field as well (see screenshot). If I could get it to somehow notice the Selected word inside the <Option Selected> tag and save the actual ID and Name in the result list I could use it to pull the couple pieces of information I need.

Also on a side note, how do you recommend actually running a where against the result set because I would need to look through hundreds of <option> tags to find the couple that say <option Selected> in it?

Thank you in advance.


  Post Attachments 
ExampleDOM.jpg (2 views, 100.56 KB)
Post #1289404
Posted Tuesday, April 24, 2012 10:33 PM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Today @ 9:45 AM
Points: 71, Visits: 547
Normally attributes that are parsed by SQLDOM are stored in the table #tblDOMAttributes (which can be joined to table #tblDOM via the DEID column).

In HTML 4, "selected" is really supposed to be attributes like this:

<option selected="selected">My Option</option>

SQLDOM has not support parsing out attributes that are not in the form of "name=value". Consequently, SQLDOM has not parsed out the selected attribute in someting like:

<option selected>My Option</option>

Back in the old days (HTML 2, and possibly HTML 3.2), "selected" with no value was allowed.

Based on your request I modified SQLDOM and posted an updated version (.924) that parses out these old-fashioned quasi-attributes with no values. Please download version .924 from: https://sourceforge.net/projects/sqldom/files/

Once you parse your HTML with version .924 you will be able to retrieve the selected options with something like this:

EXEC #spactDOMLoad @HTML = '<html><body>Hello World.<option selected>Try this</option></body></html>'

SELECT
dom.Tag,
dom.ID,
txt.TextData,
attr.Name,
attr.Value
FROM
#tblDOM dom
LEFT JOIN #tblDOM txt ON
dom.DEID = txt.ParentDEID AND
txt.TextData IS NOT NULL
LEFT JOIN #tblDOMAttribs attr ON
dom.DEID = attr.DEID

Post #1289533
Posted Thursday, April 26, 2012 3:40 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Wednesday, July 10, 2013 9:11 AM
Points: 5, Visits: 22
David,

Thank you very, very much. I should be able to pull off exactly what I am trying to do. I really do appreciate it. This could be very useful for future endeavors for me as well.

Thanks again,
Jon
Post #1291162
Posted Sunday, December 28, 2014 12:10 AM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Today @ 7:38 AM
Points: 1,189, Visits: 4,469
Great article. We'll done sir!

I do, however, disagree a little with this line (emph. mine)...

But HTML is often not well-formed XML. Some tags are "singleton" tags--that do not require end tags...


Html 4.01+ (May 2000) is in fact well-formed XML. In other words, if it was developed or generated this century it is almost certainly well-formed. I don't know how much html 4.0 and ealier is still out there but it's certainly the exception and not the rule.

HTML 4.01 DOES require end tags or it can be empty. E.g. this is well formed: and so is this <p/>. This: <p> is neither legit html if its 4.01 or later, nor is it well-formed xml.

I am not trying to take anything away from your excellent article but felt that is worth calling out.


-- Alan Burstein



Read this article for best practices on asking questions.
Need to split a string? Try this (Jeff Moden)
Need a pattern-based string spitter? Try this (Dwain Camps)

"I can't stress enough the importance of switching from a 'sequential files' mindset to 'set-based' thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code. " -- Itzek Ben-Gan 2001

My blog
Post #1646699
« Prev Topic | Next Topic »

Add to briefcase ««1234»»»

Permissions Expand / Collapse