Parsing HTML to SQL using SQLDOM

Question

Post reply

Parsing HTML to SQL using SQLDOM

Viewing 15 posts - 16 through 30 (of 32 total)

You must be logged in to reply to this topic. Login to reply

David Rueter SSCrazy Points: 2661 More actions · Answer 1

Jon,

Sure, you could parse HTML that is already stored in a column.

I wrote a brief post[/url] on this sort of thing at http://sf.net/p/sqldom/discussion/general/thread/5bf1fdb6/

In a nutshell, the #spactDOMLoad procedure accepts an @HTML parameter that contains the HTML to be parsed: it can come from any source, including an existing row in a SQL table.

After calling #spactDOMLoad you could select the data you want to keep either with the #spgetDOM procedure or by directly selecting rows from the temporary tables #tblDOM, #tblDomAttribs, etc.

JMelin SSC Journeyman Points: 85 More actions · Answer 2

Hello again,

I was curious if it would be possible to to change what it is your code is looking to capture after a tag that looks like this <OPTION selected>Text I want is here</OPTION>. I have ran your code against some test HTML for the form I was speaking about and I do see the option element in your tag column but the problem is it doesn't pick up the selected portion and place it into the tag column as well, which I use to determine which option the user has selected so that I may parse it. Also, when I use HTML: <Option Selected Id=55555 Name=Burr Oak> It fills in the tag column with a blank string into the name column but a NULL in the ID column. Also if I leave the Selected tag information out it not only puts a blank string in for the name column but also does the same for the ID field as well (see screenshot). If I could get it to somehow notice the Selected word inside the <Option Selected> tag and save the actual ID and Name in the result list I could use it to pull the couple pieces of information I need.

Also on a side note, how do you recommend actually running a where against the result set because I would need to look through hundreds of <option> tags to find the couple that say <option Selected> in it?

Thank you in advance.

David Rueter SSCrazy Points: 2661 More actions · Answer 3

Normally attributes that are parsed by SQLDOM are stored in the table #tblDOMAttributes (which can be joined to table #tblDOM via the DEID column).

In HTML 4, "selected" is really supposed to be attributes like this:

<option selected="selected">My Option</option>

SQLDOM has not support parsing out attributes that are not in the form of "name=value". Consequently, SQLDOM has not parsed out the selected attribute in someting like:

<option selected>My Option</option>

Back in the old days (HTML 2, and possibly HTML 3.2), "selected" with no value was allowed.

Based on your request I modified SQLDOM and posted an updated version (.924) that parses out these old-fashioned quasi-attributes with no values. Please download version .924 from: https://sourceforge.net/projects/sqldom/files/

Once you parse your HTML with version .924 you will be able to retrieve the selected options with something like this:

EXEC #spactDOMLoad @HTML = '<html><body>Hello World.<option selected>Try this</option></body></html>'

SELECT

dom.Tag,

dom.ID,

txt.TextData,

attr.Name,

attr.Value

FROM

#tblDOM dom

LEFT JOIN #tblDOM txt ON

dom.DEID = txt.ParentDEID AND

txt.TextData IS NOT NULL

LEFT JOIN #tblDOMAttribs attr ON

dom.DEID = attr.DEID

JMelin SSC Journeyman Points: 85 More actions · Answer 4

David,

Thank you very, very much. I should be able to pull off exactly what I am trying to do. I really do appreciate it. This could be very useful for future endeavors for me as well.

Thanks again,

Jon

Alan Burstein SSC Guru Points: 61141 More actions · Answer 5

Great article. We'll done sir!

I do, however, disagree a little with this line (emph. mine)...

But HTML is often not well-formed XML. Some tags are "singleton" tags--that do not require end tags...

Html 4.01+ (May 2000) is in fact well-formed XML. In other words, if it was developed or generated this century it is almost certainly well-formed. I don't know how much html 4.0 and ealier is still out there but it's certainly the exception and not the rule.

HTML 4.01 DOES require end tags or it can be empty. E.g. this is well formed: and so is this <p/>. This: <p> is neither legit html if its 4.01 or later, nor is it well-formed xml.

I am not trying to take anything away from your excellent article but felt that is worth calling out.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Robert.Sterbal SSCrazy Points: 2855 More actions · Answer 6

In your article:

This will return a resultset containing the nodes that were parsed:

DEIDDocIDTagIDNameClassTextData

OpenTagStartPos

CloseTagEndPos

ParentDEID

HUIDSortHUIDDOMLevel

11htmlNULLNULLNULLNULL1119NULL111

21headNULLNULLNULLNULL105311.11.0000012

31titleNULLNULLNULLNULL184321.1.1000001.000001.0000013

41NULLNULLNULLNULLSample HTMLNULLNULL31.1.1.1000001.000001.000001.0000014

51bodyNULLNULLNULLNULL5711011.21.0000022

61divNULLNULLNULLNULL6710051.2.1000001.000002.0000013

71NULLNULLNULLNULLHelloNULLNULL61.2.1.1000001.000002.000001.0000014

81bNULLNULLNULLNULL825361.2.1.2000001.000002.000001.0000024

91NULLNULLNULLNULLWorldNULLNULL81.2.1.2.1000001.000002.000001.000002.0000015

for the row DEID

8

the OpenTagStartPos is 82 and the CloseTagEndPos is 53.

Is this a typo?

David Rueter SSCrazy Points: 2661 More actions · Answer 7

for the row DEID 8
the OpenTagStartPos is 82 and the CloseTagEndPos is 53.
Is this a typo?

Yes..a typo. That row actually returns a br for the tag (not a b), and returns the correct start/end position.

(Copying results and formatting them by hand in markup for article publication is a manual process and subject to typos like this slipping through. Good catch!)

taylor_benjamin SSC Veteran Points: 286 More actions · Answer 8

This is a really cool exercise. However, it should remain simply that.

SQL is not HTML

HTML is not SQL

I can't believe nobody else has raised a warning flag on this.

Just because SQL Server enables this doesn't mean it is a good idea.

Do you know why this could be bad? What concerns or questions must a developer ask before falling back on something this niave?

It violates almost all of the SOLID principles for software development. SOLID is not only for object oriented coding.

You are mixing presentation, persistence and database layers all in a database engine. What are you going to do when your SQL Server, which costs a LOT of money, begins to have slower performance because it is Converting To/From HTML To/From SQL instead of managing database and database transactions?

Keep writing stuff like this. It keeps me in business rescuing applications designed this way!

Alan Burstein SSC Guru Points: 61141 More actions · Answer 9

taylor_benjamin (12/29/2014)
This is a really cool exercise. However, it should remain simply that.
SQL is not HTML
HTML is not SQL
I can't believe nobody else has raised a warning flag on this.
Just because SQL Server enables this doesn't mean it is a good idea.
Do you know why this could be bad? ...

Raised a warning flag on what exactly? I suggest that you re-read the first few paragraphs and re-examine your comment.

Nobody is suggesting that you store replace SQL with HTML or vise-versa. There are times when you need to extract data from HTML into your OLTP systems or data warehouse. This article provides an excellent way to do so when required.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

taylor_benjamin SSC Veteran Points: 286 More actions · Answer 10

Alan:

I think it is cool that you can do this using SQL.

There are so many better tools to use. Why would you use an engine tuned to manage set data to perform procedural code using MXLDOM? Yes, SQL Server expresses the contents as sets, with a lot of overhead.

If you were simply using a local instance of SQL Server to do this, it wouldn't be as bad. But this kind of design can kill a production SQL Server and keep it from doing what it needs to do efficiently; serve up data.

SSIS was designed to do this kind of thing. There are many other tools. Linq is also another wonderful tool for this kind of activity if you enjoy set like capabilities.

I don't see any warnings given in the editorial regarding why or where this should be used. Perhaps I need to go back and read it more thoroughly.

In my 30+ years of DBA experience, if this were done on an instance of SQL Server intended to store data, it is not a design that can scale. If that instance supports things that need to scale, don't implement this there. Put it on SQL Express, or an in memory instance. Don't put this on a production instance of SQL Server.

I've seen too many systems where databases have been sharded into multiple instances or massive servers were purchased because developers put things on the database server that could be done just as easily using tools that can scale. SQL Server doesn't scale out easily. So, a good rule of thumb is, if it isn't storing, manipulating, or retrieving data, there is probably a better place to do it than on your SQL Server.

I know that is a hard statement. But with a million war stories, many of my own (and my own making), demonstrating why this is a non-production solution.

Let me close by saying I really appreciate your style of writing. it is clear and accurate, easy to understand. Please don't let my criticism discourage you in any fashion. I look forward to seeing more.

Russell Shilling SSCarpal Tunnel Points: 4399 More actions · Answer 11

I'll put my bug report here, and thanks for the work on this!

First, a not-a-bug query: This works fine on my Workstation, but fails to get the data from URLs if run on the server. There are no errors reported, so I assume the GET is not returning anything. I can browse to the page with IE on the server fine. Any thoughts on what Server/SQL/Firewall configuration needs to be done?

Now, SQLDOM bug and notes:

sdom.spactDOMClear is missing DELETE on the table #tblDOMHierarchy.

Not necessarily a bug, but tricky to work around what I found in a loop to get multiple URLs:

SELECT @UrlId = 1;

WHILE (@UrlId <= 5)

BEGIN

SELECT @URL = URL FROM #urls WHERE UrlId = @UrlId;

EXEC sdom.sputilGetHTTP

@URL = @URL,

@ResponseText = @HTML OUTPUT,

@SuppressResultset = 1;

-- ***** DOMLoad will error if there is a value in @DocID, even if @CreateNew = 1

SET @DocID = NULL;

EXEC sdom.spactDOMLoad @DocID = @DocID, @HTML = @HTML, @CreateNew = 1;

-- ***** DOMLoad apparently doesn't return @DocID value correctly

SET @DocID = @UrlId;

EXEC sdom.spgetDOM @DocID = @DocID, @SuppressResultSet = 1;

SELECT @UrlId = @UrlId + 1;

END;

Edit: Using SQLDOM_core_persist_927

So long, and thanks for all the fish,

Russell Shilling, MCDBA, MCSA 2K3, MCSE 2K3

Russell Shilling SSCarpal Tunnel Points: 4399 More actions · Answer 12

Suggestion: Change the spgetInitSession to create the temp tables and then execute user code. This avoids pasting (potentially changeable) SQLDOM code into user's procedures and neatly resolves the temp table scope problem. Note: Set the owner of the sdom schema to an account with permissions to execute the OA_ procedures.

CREATE PROCEDURE sdom.spgetInitSession

@SQLToExecute nvarchar(MAX)

WITH EXECUTE AS OWNER

AS

BEGIN

SET NOCOUNT ON

IF OBJECT_ID('tempdb..#tblDOMDocs') IS NOT NULL BEGIN

DROP TABLE #tblDOMDocs

END

-- REMOVED THE REMAINING SQLDOM TEMP TABLE CREATION CODE

-- ...

-- Now execute the user code

EXECUTE sp_executesql @SQLToExecute;

END

GO

Usage:

DECLARE @SqlToExecute VARCHAR(MAX) =

'

DECLARE @HTML varchar(MAX);

DECLARE @DocID INT;

EXEC sdom.sputilGetHTTP

@URL = ''https://google.com'',

@ResponseText = @HTML OUTPUT,

@SuppressResultset = 0;

-- ***** DOMLoad will error if there is a value in @DocID, even if @CreateNew = 1

SET @DocID = NULL;

EXEC sdom.spactDOMLoad @DocID = @DocID, @HTML = @HTML, @CreateNew = 1;

-- ***** DOMLoad apparently does not return @DocID value

SET @DocID = 1;

EXEC sdom.spgetDOM @DocID = @DocID, @SuppressResultSet = 1;

SELECT * FROM #tblDOMHierarchy;

';

EXECUTE sdom.spgetInitSession @SqlToExecute;

So long, and thanks for all the fish,

Russell Shilling, MCDBA, MCSA 2K3, MCSE 2K3

David Rueter SSCrazy Points: 2661 More actions · Answer 13

FYI, I actually have two variants of SQLDOM--one that creates permanent objects in the database (i.e the "persist" version), and one that creates only temporary objects.

The article really deals with the temporary-only flavor of SQLDOM.

However, the thinking behind the "persist" version is that in some user applications there is a SQL connection that stays alive and remains dedicated to the user session. In a scenario like this, at user session start-up you may want to instantiate the session-specific temporary tables, but have these persist for the life of the session. The only way to accomplish this is to have SQLDOM return the dynamic SQL to the caller so that the caller can execute it on the connection. (Otherwise, the scope of the temporary objects would be limited to the initialization call.)

In fact, I didn't really realize that this article was going to be re-published this month. I have several refinements and updates I have accumulated over the past year that I should post to the SourceForge project. I appreciate and am open to the suggestions, particularly if you and/or others have an interest in using or contributing to the project.

I might suggest that we move the discussion about enhancements to SourceForge, so as to leave this SQL Server Central thread to deal more specifically with the article. Does that make sense?

Russell Shilling SSCarpal Tunnel Points: 4399 More actions · Answer 14

Here's the answer: On the server, running in a job, results were returned but were getting trucated at 512 characters. I needed to add one statement before calling the sproc that gets the data to get the job to not truncate the data:

SET TEXTSIZE 2147483647

Russell Shilling (12/30/2014)
I'll put my bug report here, and thanks for the work on this!
First, a not-a-bug query: This works fine on my Workstation, but fails to get the data from URLs if run on the server. There are no errors reported, so I assume the GET is not returning anything. I can browse to the page with IE on the server fine. Any thoughts on what Server/SQL/Firewall configuration needs to be done?

So long, and thanks for all the fish,

Russell Shilling, MCDBA, MCSA 2K3, MCSE 2K3

Russell Shilling SSCarpal Tunnel Points: 4399 More actions · Answer 15

Will do. I'll put me feedback there.

David Rueter (12/31/2014)
<snip>
In fact, I didn't really realize that this article was going to be re-published this month. I have several refinements and updates I have accumulated over the past year that I should post to the SourceForge project. I appreciate and am open to the suggestions, particularly if you and/or others have an interest in using or contributing to the project.
I might suggest that we move the discussion about enhancements to SourceForge, so as to leave this SQL Server Central thread to deal more specifically with the article. Does that make sense?

So long, and thanks for all the fish,

Russell Shilling, MCDBA, MCSA 2K3, MCSE 2K3

Parsing HTML to SQL using SQLDOM

Cookies on SQLServerCentral