Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««12

Parsing HTML to SQL using SQLDOM Expand / Collapse
Author
Message
Posted Tuesday, April 17, 2012 1:46 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Wednesday, August 20, 2014 2:22 AM
Points: 50, Visits: 124
The text file in the project looks good.
Post #1284722
Posted Tuesday, April 17, 2012 8:38 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Yesterday @ 2:44 PM
Points: 55, Visits: 430
I did stress test, and spent some time optimizing.

I don't have detailed benchmarks handy, but by way of example parsing the Google home page (once the HTML is loaded in to a variable) takes roughly 120ms on my dev box--a modest VM running SQL 2008. (Tested over a large number of iterations.)

Anecdotally, I noticed that parse time is similar to (perhaps a little faster) http://lint.brihten.com/html/

I was able to decrease parse time down to about a third of what it was in my initial version.

I learned something very interesting: concatenating long strings is a performance killer. Concatenating short strings is fine. A VARCHAR(MAX) that has less than 8000 characters is treated as a short string--with fast concatenation. When the string grows to 8000 characters, concatenation becomes painfully slow.

So to optimize, where concatenation is needed I try to work only with short strings. When I approach 8000 I concatenate the chunk of data to a larger string and then clear out my accumulator variable. In this way I minimize the number of long concatenations.

I know performance for the actual parsing isn't as fast as what we would see if the parser were built in C++, but it seems that performance is solid, consistent and sufficient for some production uses. Also, it is uncertain that a different implementation in C++ could parse and deliver the rows to SQL faster: I suspect that this T-SQL implementation isn't quite as efficient in processing strings, but excels in getting the data into a resultset--meaning that overall performance may be similar to what you could achieve in a different environment.
Post #1284975
Posted Wednesday, April 18, 2012 2:59 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Thursday, May 3, 2012 8:20 AM
Points: 4, Visits: 15
This looks like it could be very useful for a project I have on the go. Even if it's not, I'm very impressed with the work behind it. Well done you!
Post #1285479
Posted Thursday, April 19, 2012 6:55 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Yesterday @ 6:16 AM
Points: 51, Visits: 385
Good post. While I don't have an immediate use for this, I'll tuck it away for the future.

Also, I do know why you felt the need to include a whole text document about God in the project files. I'm sure God is the center of all things and I ask His guidance even before working with SQL Server. Thank you indeed!
Post #1286377
Posted Monday, April 23, 2012 10:09 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Wednesday, July 10, 2013 9:11 AM
Points: 5, Visits: 22
Looks like this would be a very useful tool. I just had one quick question. I am storing an HTML form inside of a table in my database. What I am looking to do is parse certain portions of the HTML code and place it into another table along with other data that I can obtain without parsing HTML. My question is would I be able ot point your SQLDOM at a column that holds HTML code and use it in a similar fashion?

Essentially I can grab the data normally through SSIS and place it into a table but instead of saving the HTML column from the source I would like to parse out certain elements of the HTML and place it into the destination along with the elements I need that aren't within the HTML code.

Thank you,
Jon
Post #1288309
Posted Monday, April 23, 2012 10:35 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Wednesday, July 10, 2013 9:11 AM
Points: 5, Visits: 22
--- Sorry double post, was having browser issues ---

Looks like this would be a very useful tool. I just had one quick question. I am storing an HTML form inside of a table in my database. What I am looking to do is parse certain portions of the HTML code and place it into another table along with other data that I can obtain without parsing HTML. My question is would I be able ot point your SQLDOM at a column that holds HTML code and use it in a similar fashion?

Essentially I can grab the data normally through SSIS and place it into a table but instead of saving the HTML column from the source I would like to parse out certain elements of the HTML and place it into the destination along with the elements I need that aren't within the HTML code.

Thank you,
Jon
Post #1288332
Posted Monday, April 23, 2012 12:52 PM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Yesterday @ 2:44 PM
Points: 55, Visits: 430
Jon,

Sure, you could parse HTML that is already stored in a column.

I wrote a brief post on this sort of thing at http://sf.net/p/sqldom/discussion/general/thread/5bf1fdb6/

In a nutshell, the #spactDOMLoad procedure accepts an @HTML parameter that contains the HTML to be parsed: it can come from any source, including an existing row in a SQL table.

After calling #spactDOMLoad you could select the data you want to keep either with the #spgetDOM procedure or by directly selecting rows from the temporary tables #tblDOM, #tblDomAttribs, etc.
Post #1288449
Posted Tuesday, April 24, 2012 3:31 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Wednesday, July 10, 2013 9:11 AM
Points: 5, Visits: 22
Hello again,

I was curious if it would be possible to to change what it is your code is looking to capture after a tag that looks like this <OPTION selected>Text I want is here</OPTION>. I have ran your code against some test HTML for the form I was speaking about and I do see the option element in your tag column but the problem is it doesn't pick up the selected portion and place it into the tag column as well, which I use to determine which option the user has selected so that I may parse it. Also, when I use HTML: <Option Selected Id=55555 Name=Burr Oak> It fills in the tag column with a blank string into the name column but a NULL in the ID column. Also if I leave the Selected tag information out it not only puts a blank string in for the name column but also does the same for the ID field as well (see screenshot). If I could get it to somehow notice the Selected word inside the <Option Selected> tag and save the actual ID and Name in the result list I could use it to pull the couple pieces of information I need.

Also on a side note, how do you recommend actually running a where against the result set because I would need to look through hundreds of <option> tags to find the couple that say <option Selected> in it?

Thank you in advance.


  Post Attachments 
ExampleDOM.jpg (2 views, 100.56 KB)
Post #1289404
Posted Tuesday, April 24, 2012 10:33 PM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Yesterday @ 2:44 PM
Points: 55, Visits: 430
Normally attributes that are parsed by SQLDOM are stored in the table #tblDOMAttributes (which can be joined to table #tblDOM via the DEID column).

In HTML 4, "selected" is really supposed to be attributes like this:

<option selected="selected">My Option</option>

SQLDOM has not support parsing out attributes that are not in the form of "name=value". Consequently, SQLDOM has not parsed out the selected attribute in someting like:

<option selected>My Option</option>

Back in the old days (HTML 2, and possibly HTML 3.2), "selected" with no value was allowed.

Based on your request I modified SQLDOM and posted an updated version (.924) that parses out these old-fashioned quasi-attributes with no values. Please download version .924 from: https://sourceforge.net/projects/sqldom/files/

Once you parse your HTML with version .924 you will be able to retrieve the selected options with something like this:

EXEC #spactDOMLoad @HTML = '<html><body>Hello World.<option selected>Try this</option></body></html>'

SELECT
dom.Tag,
dom.ID,
txt.TextData,
attr.Name,
attr.Value
FROM
#tblDOM dom
LEFT JOIN #tblDOM txt ON
dom.DEID = txt.ParentDEID AND
txt.TextData IS NOT NULL
LEFT JOIN #tblDOMAttribs attr ON
dom.DEID = attr.DEID

Post #1289533
Posted Thursday, April 26, 2012 3:40 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Wednesday, July 10, 2013 9:11 AM
Points: 5, Visits: 22
David,

Thank you very, very much. I should be able to pull off exactly what I am trying to do. I really do appreciate it. This could be very useful for future endeavors for me as well.

Thanks again,
Jon
Post #1291162
« Prev Topic | Next Topic »

Add to briefcase ««12

Permissions Expand / Collapse