September 30, 2012 at 11:31 pm
Dear SQL Gurus:
I have a table [tblHTML] with an ntext column [strPage] that has HTML in it.
Each record's [strPage] field text should contain an equal number of '<table>' and '</table>' strings.
But some do not. I need to find those records so I can fix them.
I think what I need to do is:
1. figure out how many occurrences of '<table>' and '</table>' there are in each record's [strPage].
2. If the number of '<table>' strings does not equal the number of '</table>' strings, then return the id of that record as part of the recordset.
I'm stuck on #1. I've looked for 3 hours now, and I haven't found anyone else with my exact issue - to return a count for each record of the number of occurrences of a string in an ntext field.
I found some solutions that involve using LEN and REPLACE ( like this one) but those do not seem to work because "Argument data type ntext is invalid for argument 1 of replace function."
Then I found a post here at SQL Sever Central and I eagerly made a number table to use with it, but then I realized it returns a total count for the column, not the individual records that contain the string.
Can you please help me or point me in the right direction? Thank you!
October 1, 2012 at 4:05 am
You should really avoid using ntext, text and image datatypes and use nvarchar(max), varchar(max), and varbinary(max) instead http://msdn.microsoft.com/en-us/library/ms187993.aspx. I guess you have old database migrated from previous version of SQL Server.
For mow you can convert ntext to nvarchar(max) before using string functions. However, I would create and use CLR function which would give way better performance for what you're trying to do.
October 1, 2012 at 5:56 am
This will get you 1 and 2:
DROP TABLE #tblHTML
CREATE TABLE #tblHTML (RowID INT, strPage ntext)
INSERT INTO #tblHTML (RowID, strPage)
SELECT
1, REPLICATE(CAST('I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' AS VARCHAR(MAX)),25)
+ '<table>' + 'I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' + '</table>'
+ REPLICATE('I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ',25)
+ '<table>' + 'I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' + '</table>'
+ REPLICATE('I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ',25)
+ '<table>' + 'I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' + '</table>'
+ REPLICATE('I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ',25)
+ '<table>' + 'I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' + '</table>'
+ REPLICATE('I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ',25)
+ '<table>' + 'I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' + '</table>'
+ REPLICATE('I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ',25)
+ '<table>' + 'I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' + '</table>'
UNION ALL
SELECT
2, REPLICATE(CAST('I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' AS VARCHAR(MAX)),25)
+ '<table>' + 'I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' + '</table>'
+ REPLICATE('I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ',25)
+ '<table>' + 'I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' + '</table>'
+ REPLICATE('I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ',25)
+ '<table>' + 'I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' + 'xxxxxxxx'
+ REPLICATE('I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ',25)
+ '<table>' + 'I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' + '</table>'
+ REPLICATE('I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ',25)
+ '<table>' + 'I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' + '</table>'
+ REPLICATE('I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ',25)
+ '<table>' + 'I have a table [tblHTML] with an ntext column [strPage] that has HTML in it ' + '</table>'
SELECT *, LEN(CAST(strPage AS VARCHAR(MAX))) FROM #tblHTML
;WITH rCTE AS (
SELECT RowID, q.strPage,
lastStartpos = NULLIF(CHARINDEX('<table>',q.strPage,1),0),
lastEndpospos = NULLIF(CHARINDEX('</table>',q.strPage,1),0)
FROM #tblHTML
CROSS APPLY (SELECT strPage = CAST(strPage AS VARCHAR(MAX))) q
UNION ALL
SELECT RowID, strPage,
lastStartpos = NULLIF(CHARINDEX('<table>',strPage,lastStartpos+1),0),
lastEndpospos = NULLIF(CHARINDEX('</table>',strPage,lastEndpospos+1),0)
FROM rCTE
WHERE NOT (lastStartpos IS NULL OR lastEndpospos IS NULL)
)
SELECT RowID, TagStarts = COUNT(lastStartpos), TagEnds = COUNT(lastEndpospos)
FROM rCTE
GROUP BY RowID
HAVING COUNT(lastStartpos) <> COUNT(lastEndpospos)
ORDER BY RowID
OPTION(MAXRECURSION 0)
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
October 1, 2012 at 6:12 am
This appears to work:
I've elected to show rows where both values are zero in case these are wrong - You can remove this if you want.
I've set MAXRECURSION to 10, but you can increase this if necessary...
IF OBJECT_ID('dbo.tblHTML') IS NOT NULL
DROP TABLE dbo.tblHTML;
CREATE TABLE [dbo].[tblHTML](
[RowId] [int] NOT NULL,
[strPage] [ntext] NULL
) ON [PRIMARY]
insert into dbo.tblHTML values ( 1, '<table>www</table><table>xxx</table><table>yyy</table>zzz</table>' );
insert into dbo.tblHTML values ( 2, '<table>xxx</table><table>yyy</table><table>zzz</table>' );
insert into dbo.tblHTML values ( 3, '>xxx</table><table>yyy</table><table>zzz</table>' );
insert into dbo.tblHTML values ( 4, '>xxx</table>' );
insert into dbo.tblHTML values ( 5, '>xxx</table' );
insert into dbo.tblHTML values ( 6, '<table>xxx</table>' );
DECLARE @SearchHTMLStart Varchar(30), @SearchHTMLEnd Varchar(30);
SET @SearchHTMLStart = '<table>';
SET @SearchHTMLEnd = '</table>';
WITH CTE_Start AS
(
SELECT RowId,
Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLStart ,[strPage] , 0 ) > 0 THEN 1 ELSE 0 END,
Location = CHARINDEX ( @SearchHTMLStart ,[strPage] , 0 )
FROM dbo.tblHTML
UNION ALL
SELECT h.RowId,
Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLStart ,h.[strPage] , c.Location +1 ) > 0 THEN 1 ELSE 0 END,
Location = CHARINDEX ( @SearchHTMLStart ,h.[strPage] , c.Location +1 )
FROM dbo.tblHTML h
INNER JOIN CTE_Start c ON h.RowId = c.RowId
WHERE Occurrence > 0
),
CTE_End AS
(
SELECT RowId,
Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLEnd ,[strPage] , 0 ) > 0 THEN 1 ELSE 0 END,
Location = CHARINDEX ( @SearchHTMLEnd ,[strPage] , 0 )
FROM dbo.tblHTML
UNION ALL
SELECT h.RowId,
Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLEnd ,h.[strPage] , c.Location +1 ) > 0 THEN 1 ELSE 0 END,
Location = CHARINDEX ( @SearchHTMLEnd ,h.[strPage] , c.Location +1 )
FROM dbo.tblHTML h
INNER JOIN CTE_End c ON h.RowId = c.RowId
WHERE Occurrence > 0
),
CTE_Start_Sum AS
(
SELECT s.RowId, SUM(s.Occurrence) as NumberStart
FROM CTE_Start s
GROUP BY s.RowId
),
CTE_End_Sum AS
(
SELECT e.RowId, SUM(e.Occurrence) as NumberEnd
FROM CTE_End e
GROUP BY e.RowId
)
SELECT ISNULL(s.RowId, e.RowId) as RowId, NumberStart, NumberEnd
FROM CTE_Start_Sum s
INNER JOIN CTE_End_Sum e ON s.RowId = e.RowId
WHERE (NumberStart <> NumberEnd) OR (NumberStart = 0 AND NumberEnd = 0)
OPTION (MAXRECURSION 10);
October 1, 2012 at 6:19 am
i like all the examples so rar, but make sure you search for <table without the ending > tag.
if the table has a class, a css style or even older style html, you will get misleading counts;
<table border="1">
<table class="main_css">
Lowell
October 1, 2012 at 6:24 am
laurie-789651 (10/1/2012)
This appears to work:I've elected to show rows where both values are zero in case these are wrong - You can remove this if you want.
I've set MAXRECURSION to 10, but you can increase this if necessary...
IF OBJECT_ID('dbo.tblHTML') IS NOT NULL
DROP TABLE dbo.tblHTML;
CREATE TABLE [dbo].[tblHTML](
[RowId] [int] NOT NULL,
[strPage] [ntext] NULL
) ON [PRIMARY]
insert into dbo.tblHTML values ( 1, '<table>www</table><table>xxx</table><table>yyy</table>zzz</table>' );
insert into dbo.tblHTML values ( 2, '<table>xxx</table><table>yyy</table><table>zzz</table>' );
insert into dbo.tblHTML values ( 3, '>xxx</table><table>yyy</table><table>zzz</table>' );
insert into dbo.tblHTML values ( 4, '>xxx</table>' );
insert into dbo.tblHTML values ( 5, '>xxx</table' );
insert into dbo.tblHTML values ( 6, '<table>xxx</table>' );
DECLARE @SearchHTMLStart Varchar(30), @SearchHTMLEnd Varchar(30);
SET @SearchHTMLStart = '<table>';
SET @SearchHTMLEnd = '</table>';
WITH CTE_Start AS
(
SELECT RowId,
Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLStart ,[strPage] , 0 ) > 0 THEN 1 ELSE 0 END,
Location = CHARINDEX ( @SearchHTMLStart ,[strPage] , 0 )
FROM dbo.tblHTML
UNION ALL
SELECT h.RowId,
Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLStart ,h.[strPage] , c.Location +1 ) > 0 THEN 1 ELSE 0 END,
Location = CHARINDEX ( @SearchHTMLStart ,h.[strPage] , c.Location +1 )
FROM dbo.tblHTML h
INNER JOIN CTE_Start c ON h.RowId = c.RowId
WHERE Occurrence > 0
),
CTE_End AS
(
SELECT RowId,
Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLEnd ,[strPage] , 0 ) > 0 THEN 1 ELSE 0 END,
Location = CHARINDEX ( @SearchHTMLEnd ,[strPage] , 0 )
FROM dbo.tblHTML
UNION ALL
SELECT h.RowId,
Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLEnd ,h.[strPage] , c.Location +1 ) > 0 THEN 1 ELSE 0 END,
Location = CHARINDEX ( @SearchHTMLEnd ,h.[strPage] , c.Location +1 )
FROM dbo.tblHTML h
INNER JOIN CTE_End c ON h.RowId = c.RowId
WHERE Occurrence > 0
),
CTE_Start_Sum AS
(
SELECT s.RowId, SUM(s.Occurrence) as NumberStart
FROM CTE_Start s
GROUP BY s.RowId
),
CTE_End_Sum AS
(
SELECT e.RowId, SUM(e.Occurrence) as NumberEnd
FROM CTE_End e
GROUP BY e.RowId
)
SELECT ISNULL(s.RowId, e.RowId) as RowId, NumberStart, NumberEnd
FROM CTE_Start_Sum s
INNER JOIN CTE_End_Sum e ON s.RowId = e.RowId
WHERE (NumberStart <> NumberEnd) OR (NumberStart = 0 AND NumberEnd = 0)
OPTION (MAXRECURSION 10);
I'm informed it won't work properly:
CHARINDEX won't work properly with ntext, you have to cast to VARCHAR(max), otherwise it stops working at 8000 characters
Thanks ChrisM@Work π
October 1, 2012 at 6:29 am
laurie-789651 (10/1/2012)
laurie-789651 (10/1/2012)
This appears to work:I've elected to show rows where both values are zero in case these are wrong - You can remove this if you want.
I've set MAXRECURSION to 10, but you can increase this if necessary...
IF OBJECT_ID('dbo.tblHTML') IS NOT NULL
DROP TABLE dbo.tblHTML;
CREATE TABLE [dbo].[tblHTML](
[RowId] [int] NOT NULL,
[strPage] [ntext] NULL
) ON [PRIMARY]
insert into dbo.tblHTML values ( 1, '<table>www</table><table>xxx</table><table>yyy</table>zzz</table>' );
insert into dbo.tblHTML values ( 2, '<table>xxx</table><table>yyy</table><table>zzz</table>' );
insert into dbo.tblHTML values ( 3, '>xxx</table><table>yyy</table><table>zzz</table>' );
insert into dbo.tblHTML values ( 4, '>xxx</table>' );
insert into dbo.tblHTML values ( 5, '>xxx</table' );
insert into dbo.tblHTML values ( 6, '<table>xxx</table>' );
DECLARE @SearchHTMLStart Varchar(30), @SearchHTMLEnd Varchar(30);
SET @SearchHTMLStart = '<table>';
SET @SearchHTMLEnd = '</table>';
WITH CTE_Start AS
(
SELECT RowId,
Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLStart ,[strPage] , 0 ) > 0 THEN 1 ELSE 0 END,
Location = CHARINDEX ( @SearchHTMLStart ,[strPage] , 0 )
FROM dbo.tblHTML
UNION ALL
SELECT h.RowId,
Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLStart ,h.[strPage] , c.Location +1 ) > 0 THEN 1 ELSE 0 END,
Location = CHARINDEX ( @SearchHTMLStart ,h.[strPage] , c.Location +1 )
FROM dbo.tblHTML h
INNER JOIN CTE_Start c ON h.RowId = c.RowId
WHERE Occurrence > 0
),
CTE_End AS
(
SELECT RowId,
Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLEnd ,[strPage] , 0 ) > 0 THEN 1 ELSE 0 END,
Location = CHARINDEX ( @SearchHTMLEnd ,[strPage] , 0 )
FROM dbo.tblHTML
UNION ALL
SELECT h.RowId,
Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLEnd ,h.[strPage] , c.Location +1 ) > 0 THEN 1 ELSE 0 END,
Location = CHARINDEX ( @SearchHTMLEnd ,h.[strPage] , c.Location +1 )
FROM dbo.tblHTML h
INNER JOIN CTE_End c ON h.RowId = c.RowId
WHERE Occurrence > 0
),
CTE_Start_Sum AS
(
SELECT s.RowId, SUM(s.Occurrence) as NumberStart
FROM CTE_Start s
GROUP BY s.RowId
),
CTE_End_Sum AS
(
SELECT e.RowId, SUM(e.Occurrence) as NumberEnd
FROM CTE_End e
GROUP BY e.RowId
)
SELECT ISNULL(s.RowId, e.RowId) as RowId, NumberStart, NumberEnd
FROM CTE_Start_Sum s
INNER JOIN CTE_End_Sum e ON s.RowId = e.RowId
WHERE (NumberStart <> NumberEnd) OR (NumberStart = 0 AND NumberEnd = 0)
OPTION (MAXRECURSION 10);
I'm informed it won't work properly:
CHARINDEX won't work properly with ntext, you have to cast to VARCHAR(max), otherwise it stops working at 8000 characters
Thanks ChrisM@Work π
You're welcome - your query works just fine if you introduce the CAST where necessary. Why am I writing this when you're sitting 3 feet away? How was your weekend? π
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
October 1, 2012 at 7:55 am
Solution incorporating the above suggestions (thank you):
--====== TEST DATA =============
IF OBJECT_ID('dbo.tblHTML') IS NOT NULL
DROP TABLE dbo.tblHTML;
CREATE TABLE [dbo].[tblHTML](
[RowId] [int] NOT NULL,
[strPage] [ntext] NULL
) ON [PRIMARY]
insert into dbo.tblHTML values ( 1, '<table>www</table><table>xxx</table><table>yyy</table>zzz</table>' );
insert into dbo.tblHTML values ( 2, '<table>xxx</table><table>yyy</table><table>zzz</table>' );
insert into dbo.tblHTML values ( 3, '>xxx</table><table>yyy</table><table>zzz</table>' );
insert into dbo.tblHTML values ( 4, '>xxx</table>' );
insert into dbo.tblHTML values ( 5, '>xxx</table' );
insert into dbo.tblHTML values ( 6, '<table>xxx</table>' );
--========= SOLUTION Altered as recommended by ChrisM@Work ===============
DECLARE @SearchHTMLStart Varchar(30), @SearchHTMLEnd Varchar(30);
SET @SearchHTMLStart = '<table'; -- Leaving off trailing '>'
SET @SearchHTMLEnd = '</table>';
WITH CTE_Start AS
(
SELECT RowId,
Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLStart ,q.[strPage] , 0 ) > 0 THEN 1 ELSE 0 END,
Location = CHARINDEX ( @SearchHTMLStart ,q.[strPage] , 0 )
FROM dbo.tblHTML
CROSS APPLY (SELECT strPage = CAST(strPage AS VARCHAR(MAX))) q
UNION ALL
SELECT h.RowId,
Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLStart ,q.[strPage] , c.Location +1 ) > 0 THEN 1 ELSE 0 END,
Location = CHARINDEX ( @SearchHTMLStart ,q.[strPage] , c.Location +1 )
FROM dbo.tblHTML h
CROSS APPLY (SELECT strPage = CAST(strPage AS VARCHAR(MAX))) q
INNER JOIN CTE_Start c ON h.RowId = c.RowId
WHERE Occurrence > 0
),
CTE_End AS
(
SELECT RowId,
Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLEnd ,q.[strPage] , 0 ) > 0 THEN 1 ELSE 0 END,
Location = CHARINDEX ( @SearchHTMLEnd ,q.[strPage] , 0 )
FROM dbo.tblHTML
CROSS APPLY (SELECT strPage = CAST(strPage AS VARCHAR(MAX))) q
UNION ALL
SELECT h.RowId,
Occurrence = CASE WHEN CHARINDEX ( @SearchHTMLEnd ,q.[strPage] , c.Location +1 ) > 0 THEN 1 ELSE 0 END,
Location = CHARINDEX ( @SearchHTMLEnd ,q.[strPage] , c.Location +1 )
FROM dbo.tblHTML h
CROSS APPLY (SELECT strPage = CAST(strPage AS VARCHAR(MAX))) q
INNER JOIN CTE_End c ON h.RowId = c.RowId
WHERE Occurrence > 0
),
CTE_Start_Sum AS
(
SELECT s.RowId, SUM(s.Occurrence) as NumberStart
FROM CTE_Start s
GROUP BY s.RowId
),
CTE_End_Sum AS
(
SELECT e.RowId, SUM(e.Occurrence) as NumberEnd
FROM CTE_End e
GROUP BY e.RowId
)
SELECT ISNULL(s.RowId, e.RowId) as RowId, NumberStart, NumberEnd
FROM CTE_Start_Sum s
INNER JOIN CTE_End_Sum e ON s.RowId = e.RowId
WHERE (NumberStart <> NumberEnd) OR (NumberStart = 0 AND NumberEnd = 0)
OPTION (MAXRECURSION 10);
October 1, 2012 at 12:33 pm
Wow! I had hoped to find a response but I am overwhelmed. Thank you all very much!
Lowell, thank you for the valuable tip. I do need to check for <table in case there are some classes or other values in the opening tag.
SSCommitted, You are quite right, this database has been around a while. Thank you for the suggestion I will read up on changing the data type.
And ChrisM@Work & laurie-789651, thank you both for your code! Both your examples are way over my head, but I'm going to pick them apart and learn how they work.
I've already run the query (with the <table modification) and found the 500 or so records out of 30,000 which need fixing. I'm very happy! :w00t:
Thank you all again! π
October 2, 2012 at 2:05 am
Mixolydian (10/1/2012)
Wow! I had hoped to find a response but I am overwhelmed. Thank you all very much!Lowell, thank you for the valuable tip. I do need to check for <table in case there are some classes or other values in the opening tag.
SSCommitted, You are quite right, this database has been around a while. Thank you for the suggestion I will read up on changing the data type.
And ChrisM@Work & laurie-789651, thank you both for your code! Both your examples are way over my head, but I'm going to pick them apart and learn how they work.
I've already run the query (with the <table modification) and found the 500 or so records out of 30,000 which need fixing. I'm very happy! :w00t:
Thank you all again! π
You're welcome, thanks for the feedback. If you have trouble understanding the solutions posted, just ask π
For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply