Distinguish Between Identical Rows

Question

Distinguish Between Identical Rows

Neil Burton

SSC-Insane

Points: 23137
More actions
March 5, 2021 at 2:19 pm

#3853842

We're in the middle of a project to load IIS log files into a database for the purpose of keeping an eye on application performance. I strongly suspect we're re-inventing the wheel here but please humour me, it's not my idea. We're using Log Parser to load the file contents into a landing table and then moving into the appropriate table. The difficulty I'm finding is we have a lot of rows that appear to be duplicates in that they have the same values for all columns but they are actually loaded from discrete entries in the log files. The only difference is the row number they have in the file. For example, the first row is RowNo1,a,b,c,d,e,f and the second row is RowNo2,a,b,c,d,e,f, the only difference between the rows is the RowNo, however they're not technically duplicates. It's further complicated by the fact that the RowNo is only unique within a file and to ensure uniqueness, we have to add the file name too. I need to ensure that a row has not already been loaded and so far the only way we've found to do this is to include the RowNo and file name (or ID) to the row as a uniqueifier. In my opinion though, the file data is not actually an attribute of the event in the IIS log and so it doesn't belong on a table of events.
My question is how do we distinguish these rows when we're loading tables? They are identical but they're not duplicates. We want a count of events per time period so we need all of them. Is including the file and row data on the table a valid way of doing things?

On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
—Charles Babbage, Passages from the Life of a Philosopher
How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537

Viewing 10 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

Jonathan AC Roberts SSCoach Points: 18407 More actions · Answer 1

Not sure if I fully understand your problem but would a query to load from the landing table into the appropriate table like this do?

INSERT INTO MainTable
(
    Col1,
    Col2,
    ...
)
SELECT DISTINCT 
       Col1,
       Col2,
       ...
  FROM LandingTable l
 WHERE NOT EXISTS(SELECT *
                    FROM MainTable m
                   WHERE m.Col1 = l.Col1
                     AND m.Col2 = l.Col2
                     AND ...
                  )
;

Are you worried you might try to load the same file twice?

Neil Burton SSC-Insane Points: 23137 More actions · Answer 2

I think you've got what I mean and that kind of query would be my first choice. But here's the rub, there will be rows in the main table like below.

WHERE 
l.col1 = m.col1 
and l.col2 = m.col2
.
.
.
and l.col9 = m.col9

There is no difference between the values in each column but each row definitely represents a discrete entry in the log. The only difference between them is their LogRowID and the filename. As I said, I'm of the opinion that the file data does not represent an attribute of the event in which we're interested so it shouldn't go on the main table. I'm happy to be put straight on that though and it's currently the only way of ensuring a row is unique.

On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
—Charles Babbage, Passages from the Life of a Philosopher

How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537

Jonathan AC Roberts SSCoach Points: 18407 More actions · Answer 3

Why are you worried about getting duplicates?

If you make sure the same file isn't loaded twice then would that solve your problem?

Neil Burton SSC-Insane Points: 23137 More actions · Answer 4

Jonathan AC Roberts wrote:

Why are you worried about getting duplicates?
If you make sure the same file isn't loaded twice then would that solve your problem?

It's not the "duplicates" within a file that I'm worried about. There's stuff in place to ensure that a given row in a given file isn't loaded more than once. The problem I have is to ensure that rows in different files that have the same values barring the file data get loaded without loading anything twice. There's nothing to stop a row like File2,RowNo1,a,b,c,d,e,f being loaded into the main table when there is already a row like File1,RowNo1,a,b,c,d,e,f in there. What I need to find is a way to avoid loading File1,RowNo3,k,b,c,g,e,f, again. I'm trying to find a way of comparing two rows that are identical in all ways apart from

I'm not explaining this particularly well.

The problem is that both within and across different files, there can be discrete rows with the same values in each column. Each of these rows represents a different event even though they appear to be the same. I need to find a way to allow these rows to be loaded individually but at the same time not loading unique rows more than once. A query using WHERE NOT EXISTS breaks down because there are rows where everything matches apart from the file data. In my opinion, the file data shouldn't be included on a table holding event data because it's not an attribute of the event. It is however, in some cases, the only way of differentiating between the rows to see what has and hasn't been loaded.

On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
—Charles Babbage, Passages from the Life of a Philosopher

How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537

Jonathan AC Roberts SSCoach Points: 18407 More actions · Answer 5

So you are just trying to make sure you don't load the same file twice. Maybe if you create another table of all the files you have loaded, insert a new row into that and check it doesn't already exist in that table, then import the file.

ScottPletcher SSC Guru Points: 101223 More actions · Answer 6

I think you need to add the file "key" -- to identity the unique file being loaded -- to the data file to distinguish between rows. That doesn't violate data rules, it's a very common thing. For example, the order# in an order detail table. The order# is not an attribute of a line item on the order, but it is needed to identity the owner / parent of the row data.

SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

Neil Burton SSC-Insane Points: 23137 More actions · Answer 7

ScottPletcher wrote:

I think you need to add the file "key" -- to identity the unique file being loaded -- to the data file to distinguish between rows. That doesn't violate data rules, it's a very common thing. For example, the order# in an order detail table. The order# is not an attribute of a line item on the order, but it is needed to identity the owner / parent of the row data.

We're already doing that because it's the only way to ensure uniqueness. It just didn't feel right and I may have been paying too much attention to a certain user's rants. We'll crack on as we are.

On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
—Charles Babbage, Passages from the Life of a Philosopher

How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537

Neil Burton SSC-Insane Points: 23137 More actions · Answer 8

Jonathan AC Roberts wrote:

So you are just trying to make sure you don't load the same file twice. Maybe if you create another table of all the files you have loaded, insert a new row into that and check it doesn't already exist in that table, then import the file.

I'm really making a mess of explaining this.

We log the file name and the last row loaded from each file in a table already. Each file can be read from more than once so we use the last row loaded to know where to each time. When we load from the file, we take the data for each event and load it to the landing table with the file ID and the LogRowID. The event data can be identical between events i.e. the event time (only held to the second), the queryID, StemID, Method, TimeTaken and UserName can be the same or NULL from one row to the next both within a file and in separate files.

What I was trying to do was differentiate between unique and non-unique rows across files to avoid loading any row more than once. It didn't feel right to add the file data to do this but it appears that it's a valid approach.

Thanks for your help. I'll make sure I think through my explanations more thoroughly next time.

On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
—Charles Babbage, Passages from the Life of a Philosopher

How to post a question to get the most help http://www.sqlservercentral.com/articles/Best+Practices/61537

Jeffrey Williams SSC Guru Points: 90351 More actions · Answer 9

To process IIS logs - the first thing you must do is change how those log files are configured. In IIS - you need to at least add the server name to the log files to distinguish the data.

You should also change the schedule of log file rollover, if you have not already configured that...and set the file to use local time for file naming and rollover.

I use a Powershell script to execute Log Parser - here is what that looks like:

# Setup the LogParser Query
$query  = "SELECT" `
            + "  s-computername As server_name" `
            + ", '$fileName' As file_name" `
            + ", LogRow" `
            + ", to_localtime(to_timestamp(date,time)) As cs_date_time" `
            + ", to_string(to_localtime(to_timestamp(date,time)), 'hh:mm:ss') As cs_time" `
            + ", s-ip As s_ip" `
            + ", cs-method As cs_method" `
            + ", cs-uri-stem As cs_uri_stem" `
            + ", cs-uri-query As cs_uri_query" `
            + ", s-port As s_port" `
            + ", cs-username As cs_username" `
            + ", c-ip As c_ip" `
            + ", cs(User-Agent) As cs_user_agent" `
            + ", cs-host As cs_host" `
            + ", sc-status As sc_status" `
            + ", sc-substatus As sc_substatus" `
            + ", sc-win32-status As sc_win32_status" `
            + ", sc-bytes As sc_bytes" `
            + ", cs-bytes As cs_bytes" `
            + ", time-taken As time_taken";
    $query += " INTO dbo.IISLogs";
    $query += " FROM '$($_.FullName)'";
    
    # Setup LogParser Parameters
    $parameters = @();
    $parameters += $query;
    $parameters += "-i:IISW3C";
    $parameters += "-o:SQL";
    $parameters += "-server:SERVERNAMEHERE";
    $parameters += "-database:DATABASENAMEHERE";
    $parameters += "-transactionRowCount:2000000";
    $parameters += "-ignoreIdCols:on";
$parameters += "-createtable:on";
    
    # Import the file using LogParser
    & $logParser $parameters | Out-Null;

The table is setup this way:

USE [DATABASENAMEHERE]
GO

/****** Object:  Table [dbo].[IISLogs]    Script Date: 3/5/2021 12:26:16 PM ******/
SET ANSI_NULLS ON
GO

SET QUOTED_IDENTIFIER ON
GO

CREATE TABLE [dbo].[IISLogs](
[row_id] [numeric](18, 0) IDENTITY(1,1) NOT NULL,
[server_name] [varchar](255) NOT NULL,
[file_name] [varchar](255) NOT NULL,
[LogRow] [int] NOT NULL,
[cs_date_time] [datetime] NULL,
[cs_time] [varchar](255) NULL,
[s_ip] [varchar](255) NULL,
[cs_method] [varchar](255) NULL,
[cs_uri_stem] [varchar](255) NULL,
[cs_uri_query] [varchar](255) NULL,
[s_port] [int] NULL,
[cs_username] [varchar](255) NULL,
[c_ip] [varchar](255) NULL,
[cs_user_agent] [varchar](255) NULL,
[cs_host] [varchar](255) NULL,
[sc_status] [int] NULL,
[sc_substatus] [int] NULL,
[sc_win32_status] [int] NULL,
[sc_bytes] [int] NULL,
[cs_bytes] [int] NULL,
[time_taken] [int] NULL,
 CONSTRAINT [PK_IISLogs] PRIMARY KEY CLUSTERED 
(
[row_id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, DATA_COMPRESSION = PAGE) ON [PRIMARY]
) ON [PRIMARY]

GO

SET ANSI_PADDING ON

GO

/****** Object:  Index [IX_IISLogs_DateTime]    Script Date: 3/5/2021 12:26:16 PM ******/
CREATE NONCLUSTERED INDEX [IX_IISLogs_DateTime] ON [dbo].[IISLogs]
(
[cs_date_time] ASC
)
INCLUDE ( [server_name],
[time_taken]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, DATA_COMPRESSION = PAGE) ON [PRIMARY]
GO

SET ANSI_PADDING ON

GO

/****** Object:  Index [UX_IISLogs_Server_File_Row]    Script Date: 3/5/2021 12:26:16 PM ******/
CREATE UNIQUE NONCLUSTERED INDEX [UX_IISLogs_Server_File_Row] ON [dbo].[IISLogs]
(
[server_name] ASC,
[file_name] ASC,
[LogRow] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, DATA_COMPRESSION = PAGE) ON [PRIMARY]
GO

With this setup - there is no need for a staging table or post processing. If we attempt to load a file that has already been loaded - it will fail due to the constraint on server_name, file_name and LogRow. If we do want to reprocess that table we can delete the data from the table and reload the file.

To keep things manageable - we set our IIS log files to rollover every hour. Once a day, after midnight - we run a process that sends all of the previous days log files to a shared location on the SQL Server system. At 5am every morning we run a process that looks to this shared location, loops through all of the files that have been delivered and loads them to the database - the script above is part of that script that is run.

We archive the files and hold them for 14 days...just in case we need to reprocess for some reason.

I would be careful with adding indexes...this can become a tremendous amount of data in a very short amount of time. You should make sure you decide how much data to keep and build in an archive/purge process...and make sure you set the table to use PAGE compression.

Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

― Charles R. Swindoll

How to post questions to get better answers faster
Managing Transaction Logs