|
|
|
Valued Member
      
Group: General Forum Members
Last Login: Monday, May 20, 2013 10:32 PM
Points: 50,
Visits: 120
|
|
| The text file in the project looks good.
|
|
|
|
|
SSC Rookie
      
Group: General Forum Members
Last Login: Friday, May 31, 2013 11:24 AM
Points: 44,
Visits: 359
|
|
I did stress test, and spent some time optimizing.
I don't have detailed benchmarks handy, but by way of example parsing the Google home page (once the HTML is loaded in to a variable) takes roughly 120ms on my dev box--a modest VM running SQL 2008. (Tested over a large number of iterations.)
Anecdotally, I noticed that parse time is similar to (perhaps a little faster) http://lint.brihten.com/html/
I was able to decrease parse time down to about a third of what it was in my initial version.
I learned something very interesting: concatenating long strings is a performance killer. Concatenating short strings is fine. A VARCHAR(MAX) that has less than 8000 characters is treated as a short string--with fast concatenation. When the string grows to 8000 characters, concatenation becomes painfully slow.
So to optimize, where concatenation is needed I try to work only with short strings. When I approach 8000 I concatenate the chunk of data to a larger string and then clear out my accumulator variable. In this way I minimize the number of long concatenations.
I know performance for the actual parsing isn't as fast as what we would see if the parser were built in C++, but it seems that performance is solid, consistent and sufficient for some production uses. Also, it is uncertain that a different implementation in C++ could parse and deliver the rows to SQL faster: I suspect that this T-SQL implementation isn't quite as efficient in processing strings, but excels in getting the data into a resultset--meaning that overall performance may be similar to what you could achieve in a different environment.
|
|
|
|
|
Forum Newbie
      
Group: General Forum Members
Last Login: Thursday, May 03, 2012 8:20 AM
Points: 4,
Visits: 15
|
|
| This looks like it could be very useful for a project I have on the go. Even if it's not, I'm very impressed with the work behind it. Well done you!
|
|
|
|
|
SSC Rookie
      
Group: General Forum Members
Last Login: Tuesday, June 04, 2013 8:55 AM
Points: 39,
Visits: 292
|
|
Good post. While I don't have an immediate use for this, I'll tuck it away for the future.
Also, I do know why you felt the need to include a whole text document about God in the project files. I'm sure God is the center of all things and I ask His guidance even before working with SQL Server. Thank you indeed! 
|
|
|
|
|
Forum Newbie
      
Group: General Forum Members
Last Login: Monday, July 09, 2012 1:24 PM
Points: 5,
Visits: 21
|
|
Looks like this would be a very useful tool. I just had one quick question. I am storing an HTML form inside of a table in my database. What I am looking to do is parse certain portions of the HTML code and place it into another table along with other data that I can obtain without parsing HTML. My question is would I be able ot point your SQLDOM at a column that holds HTML code and use it in a similar fashion?
Essentially I can grab the data normally through SSIS and place it into a table but instead of saving the HTML column from the source I would like to parse out certain elements of the HTML and place it into the destination along with the elements I need that aren't within the HTML code.
Thank you, Jon
|
|
|
|
|
Forum Newbie
      
Group: General Forum Members
Last Login: Monday, July 09, 2012 1:24 PM
Points: 5,
Visits: 21
|
|
--- Sorry double post, was having browser issues ---
Looks like this would be a very useful tool. I just had one quick question. I am storing an HTML form inside of a table in my database. What I am looking to do is parse certain portions of the HTML code and place it into another table along with other data that I can obtain without parsing HTML. My question is would I be able ot point your SQLDOM at a column that holds HTML code and use it in a similar fashion?
Essentially I can grab the data normally through SSIS and place it into a table but instead of saving the HTML column from the source I would like to parse out certain elements of the HTML and place it into the destination along with the elements I need that aren't within the HTML code.
Thank you, Jon
|
|
|
|
|
SSC Rookie
      
Group: General Forum Members
Last Login: Friday, May 31, 2013 11:24 AM
Points: 44,
Visits: 359
|
|
Jon,
Sure, you could parse HTML that is already stored in a column.
I wrote a brief post on this sort of thing at http://sf.net/p/sqldom/discussion/general/thread/5bf1fdb6/
In a nutshell, the #spactDOMLoad procedure accepts an @HTML parameter that contains the HTML to be parsed: it can come from any source, including an existing row in a SQL table.
After calling #spactDOMLoad you could select the data you want to keep either with the #spgetDOM procedure or by directly selecting rows from the temporary tables #tblDOM, #tblDomAttribs, etc.
|
|
|
|
|
Forum Newbie
      
Group: General Forum Members
Last Login: Monday, July 09, 2012 1:24 PM
Points: 5,
Visits: 21
|
|
Hello again,
I was curious if it would be possible to to change what it is your code is looking to capture after a tag that looks like this <OPTION selected>Text I want is here</OPTION>. I have ran your code against some test HTML for the form I was speaking about and I do see the option element in your tag column but the problem is it doesn't pick up the selected portion and place it into the tag column as well, which I use to determine which option the user has selected so that I may parse it. Also, when I use HTML: <Option Selected Id=55555 Name=Burr Oak> It fills in the tag column with a blank string into the name column but a NULL in the ID column. Also if I leave the Selected tag information out it not only puts a blank string in for the name column but also does the same for the ID field as well (see screenshot). If I could get it to somehow notice the Selected word inside the <Option Selected> tag and save the actual ID and Name in the result list I could use it to pull the couple pieces of information I need.
Also on a side note, how do you recommend actually running a where against the result set because I would need to look through hundreds of <option> tags to find the couple that say <option Selected> in it?
Thank you in advance.
|
|
|
|
|
SSC Rookie
      
Group: General Forum Members
Last Login: Friday, May 31, 2013 11:24 AM
Points: 44,
Visits: 359
|
|
Normally attributes that are parsed by SQLDOM are stored in the table #tblDOMAttributes (which can be joined to table #tblDOM via the DEID column).
In HTML 4, "selected" is really supposed to be attributes like this:
<option selected="selected">My Option</option>
SQLDOM has not support parsing out attributes that are not in the form of "name=value". Consequently, SQLDOM has not parsed out the selected attribute in someting like:
<option selected>My Option</option>
Back in the old days (HTML 2, and possibly HTML 3.2), "selected" with no value was allowed.
Based on your request I modified SQLDOM and posted an updated version (.924) that parses out these old-fashioned quasi-attributes with no values. Please download version .924 from: https://sourceforge.net/projects/sqldom/files/
Once you parse your HTML with version .924 you will be able to retrieve the selected options with something like this:
EXEC #spactDOMLoad @HTML = '<html><body>Hello World.<option selected>Try this</option></body></html>'
SELECT dom.Tag, dom.ID, txt.TextData, attr.Name, attr.Value FROM #tblDOM dom LEFT JOIN #tblDOM txt ON dom.DEID = txt.ParentDEID AND txt.TextData IS NOT NULL LEFT JOIN #tblDOMAttribs attr ON dom.DEID = attr.DEID
|
|
|
|
|
Forum Newbie
      
Group: General Forum Members
Last Login: Monday, July 09, 2012 1:24 PM
Points: 5,
Visits: 21
|
|
David,
Thank you very, very much. I should be able to pull off exactly what I am trying to do. I really do appreciate it. This could be very useful for future endeavors for me as well.
Thanks again, Jon
|
|
|
|