What is the best practice about maintain and using Big Log Database

Question

Post reply

What is the best practice about maintain and using Big Log Database

sm_iransoftware

Ten Centuries

Points: 1228
More actions
January 21, 2015 at 8:36 pm

#300545

Hi
we have a big Log Database.
(we store all users clicks (link addresses) and All users search on our web site to this database for BI Analysing)
It is a big database with 2 Gigabyte Growth in a day!!!
There are some problems about this DB :
1- Huge size of Live database.
2- We can't get reports from it.
- Because we didn't create Appropriate index on it. and selects on it are very slow.
(Indexing on a database with 2% select and 98% insert and huge size ... I think is wrong.)
3- We have to Backup every 2 month and go to another server and Truncate this database tables. and start again...
-------------------------------------------------------
is there any best practice to maintain and using these (Log user actions) databases ?
thank you

Viewing 15 posts - 1 through 15 (of 24 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1003853 More actions · Answer 1

Yes. Use the SQL Server Enterprise Edition and do Table Partitioning.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

KenpoDBA SSCrazy Eights Points: 9811 More actions · Answer 2

Or if you don't have Ent then you can use SPs to query the tables and split them up into dated ranged.

When you call the SP you can make a decision based off of the date and route it to the right table.

You can also throw some views in there if the data's not dated.

All the same, while Ent is nice, it's not always feasible. Getting just a little creative you can do a lot.

Watch my free SQL Server Tutorials at:
http://MidnightDBA.com
Blog Author of:
DBA Rant – http://www.MidnightDBA.com/DBARant

Minion Maintenance is FREE:

Eric M Russell SSC Guru Points: 125519 More actions · Answer 3

I can see now this thread is going to get a lot of comment. This is one niche application where I'd suggest Hadoop and HiveSQL.

As far as doing this in SQL Server, where to begin...

First question: Why does the organization need this thing? I know the web developers must find it very useful when debugging some issue with the website, but why is it important from the business perspective to track of every user's click and then have the BI team report on it?

If you must contain this data in SQL Server, then make sure the table schema is as compact and index-able as possible. For example, hopefully you are not containing the full website URL, user name, and date/time stamp as columns, but should rather be using a star-schema type model with integer based foreign keys.

For example:

create table WebSite_Click

(

TimeKey int not null,

UserKey int not null,

URLKey int not null

,constraint PK_WebSite_Click

primary key clustered ( TimeKey, UserKey, URLKey )

with ( ignore_dup_key = on, data_compression = page )

);

It's important that TimeKey be the leading column in the clustered key to prevent page splits and fragmentation. It's also a good assumption that practually all queries against this table will be at least partially within the context of time, even if aggregating on an entire week or month.

You don't want the web application inserting directly into the table for every click. Instead have the website insert click records into something like a tab delimited flat file, and then implement an SSIS package or job to bulk insert the table in batches periodically (15 minutes, hourly, or whatever). Negotiate the required interval with the BI team, they'll naturally want it updated in real-time, but the less frequent the batch inserts, the better for performance and maintainability. You may even find a longer batch interval to result in better query performance.

At this point even a medium powered SQL Server instance should be able to handle multi-million row bulk inserts and billion row aggregate queries within a reasonable amount of time (15 minutes or less).

Also, investigate using table partitioning with a sliding window and Analysis Services for reporting. Ralph Kimball has written about data modeling and ETL for click-stream data warehousing, and I think he even wrote a book entirely on the subject.

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Paresh Motiwala SSCommitted Points: 1772 More actions · Answer 4

What is the real size of the "big" database?

Paresh Motiwala Manager of Data Team, Big Data Enthusiast, ex DBA

Jeff Moden SSC Guru Points: 1003853 More actions · Answer 5

KenpoDBA (1/22/2015)
Or if you don't have Ent then you can use SPs to query the tables and split them up into dated ranged.
When you call the SP you can make a decision based off of the date and route it to the right table.
You can also throw some views in there if the data's not dated.
All the same, while Ent is nice, it's not always feasible. Getting just a little creative you can do a lot.

Agreed. You can do partitioning in the Standard Edition without much fuss with Partitioned Views, which require the splitting of a large table in a manner similar to what you've described but without the need for custom code in your SP's.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

sm_iransoftware Ten Centuries Points: 1228 More actions · Answer 6

THX All

Feedbacks was very nice and wide

I should have read more carefully down the Answers

(specially about Hadoop and HiveSQL)

-----------------------------------------------------------

And now more details about your questions and ideas :

1- We offer thousands of products on our website. And the user may click on any of them repeatedly

products are imported Dynamicly and I don't know have a fixed address for evey click. (Meanwhile our products may be finished and deleted from the website after some days.)

Then : Save the code Address instead of all address is interesting idea, but I think is not possible.

2- the database has some tables . (Main table) the biggest of them has 140 GB Size and about 490 Millions rec. for 5 month

3- partitioning is good Idea. I can Create Partition for every Month. But Do I some operations to create new partition for every month? Do I it automatically?

4- I doubt that create Appropriate indexes on the table on live Database. because we have Huge insert and insert speed may be slow. And indexes get big size for this table !!!

5- compact tables is nice suggestion. I should do it. But I should think about time of compact our main table!!!

6- insert into a flat file then Bulk insert is a good idea. I should be investigated. Is there a benefit for us?

7- Can I Create a database next to this live db. and every night copy Day's data to new db and then truncate tables on live db. and on new database create much index that in needed Without fear of slowing.

8- and last, no sql . I must learn more about them. and bii team cam use them?

thanks again

Jeff Moden SSC Guru Points: 1003853 More actions · Answer 7

Looking at your questions, the answer to all of them is "It Depends" because you haven't told us what that "Main Table" is being used for. Is that the products table or the "big log" table that holds all of the clicks.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

sm_iransoftware Ten Centuries Points: 1228 More actions · Answer 8

this database has 5 tables . all for saving log data. such as : addresses , searchs , sessions, referre site, errors ...

and biggest table of them is for saving log clicking. that's all.

Jeff Moden SSC Guru Points: 1003853 More actions · Answer 9

Can you post the CREATE TABLE code for the largest table (the "clicks" table, apparently) along with any indexes and constraints, please?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

sm_iransoftware Ten Centuries Points: 1228 More actions · Answer 10

CREATE TABLE [dbo].[...](

[Id] [bigint] IDENTITY(1,1) NOT NULL,

[SessionId] [bigint] NOT NULL,

[RegDate] [date] NOT NULL,

[RegTime] [time](0) NOT NULL,

[ReferrerUrl] [nvarchar](2000) NOT NULL,

[LandingUrl] [nvarchar](2000) NOT NULL,

CONSTRAINT [PK_TBL_Request_Id] PRIMARY KEY CLUSTERED

(

[Id] DESC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

) ON [PRIMARY]

GO

Jeff Moden SSC Guru Points: 1003853 More actions · Answer 11

sm_iransoftware (1/23/2015)
CREATE TABLE [dbo].[...](
[Id] [bigint] IDENTITY(1,1) NOT NULL,
[SessionId] [bigint] NOT NULL,
[RegDate] [date] NOT NULL,
[RegTime] [time](0) NOT NULL,
[ReferrerUrl] [nvarchar](2000) NOT NULL,
[LandingUrl] [nvarchar](2000) NOT NULL,
CONSTRAINT [PK_TBL_Request_Id] PRIMARY KEY CLUSTERED
(
[Id] DESC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

I'd partition that table by RegDate by month. In the process of partitioning it, I'd move the clustered index to a composite index of RegDate and ID (In that order... the addition of ID is to make it a UNIQUE clustered index). Since it IS an audit table, I see no reason to bother with a PK because we always want to capture everything thrown into this table. Keep in mind that the clustered index does NOT have to be the PK nor does the table need to have a PK to be a proper table of this nature. And, no... I'm not saying we should get rid of the ID column. We need that for uniqueness of each row.

And yes, the table can be partitioned whether you have the Standard Edition or the Enterprise Edition. If you have the Standard Edition, we can do it using a Partitioned View. If you have the Enterprise Edition, we can do it using a Partitioned Table or a Partitioned View. Both have some serious advantages over this large of a table being a monolithic structure not the least of which is greatly reduced backup times because, once the first backup of the individual partitions are taken, you'll only need to ever backup the current month after that. Index maintenance will enjoy similar serious benefits as will your poor log file during index maintenance. Depending on the criteria of your queries, you might see a benefit in a bit of increased performance but that's not what we'd partition for.

Shifting gears, if you had left the RegDate and RegTime as a single RegDT column, we might have been able to use the RegDt and SessionID columns as the UNIQUE clustered index and just drop the ID column to save on disk, memory, and backup space. We COULD do that transformation during the partitioning process without adding any additional time. Your queries wouldn't suffer any degradation in performance and some might even improve in performance depending, of course, on the queries but I think it very likely.

Bottom line is that you'd only need to ever backup the current month (once all have been backed up once), index maintenance would only need be done on the current month, and making this into either a Partitioned Table or Partitioned View wouldn't require any changes to any other code. It would also allow for OnLine Piecemeal Restores should anything ever happen to a partition of this table. That would also allow you to "Get Back In Business" much more quickly should a DR even occur where the whole database needs to be restored.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Eric M Russell SSC Guru Points: 125519 More actions · Answer 12

sm_iransoftware (1/22/2015)
THX All
Feedbacks was very nice and wide
I should have read more carefully down the Answers
(specially about Hadoop and HiveSQL)
-----------------------------------------------------------
And now more details about your questions and ideas :
1- We offer thousands of products on our website. And the user may click on any of them repeatedly
products are imported Dynamicly and I don't know have a fixed address for evey click. (Meanwhile our products may be finished and deleted from the website after some days.)
Then : Save the code Address instead of all address is interesting idea, but I think is not possible.

It seems to me that, if the application could insert the actual ProductID being clicked on, that would be more useful to the business for reporting purposes than the full URL. From a database administration perspective, this would probably reduce the storage size of this table by 90%. That's a game changer.

Also, if the business is simply interested in how many distinct users click on each distinct product (but NOT interested in how many times a distinct user clicks on a distinct product), then you may want to set IGNORE_DUP_KEY = ON to prevent insertion of duplicate records. There is no point in logging 10 seperate records for the same Time / Product / User event, if that's just redundant data for which the BI team has to filter out. First let the BI team define what a "duplicate" is, and then run some queries and see how much "duplication" there is in this table of 400+ million records.

2- the database has some tables . (Main table) the biggest of them has 140 GB Size and about 490 Millions rec. for 5 month

An average of 3 million inserts per day would put a heavy strain on an OLTP server. I definately reccomend you investigate the approach of having the web application log records to flat text files, and then have an ETL process that bulk loads the records into the table at regular intervals like hourly. That too is a game changer.

3- partitioning is good Idea. I can Create Partition for every Month. But Do I some operations to create new partition for every month? Do I it automatically?

If you take the above advice, reduce the table to a narrow set of keys and perform batch inserting, then it's possible you could get by without partitioning. Sliding window partitioning is fairly complex to setup, and I would only do it as a last resort.

4- I doubt that create Appropriate indexes on the table on live Database. because we have Huge insert and insert speed may be slow. And indexes get big size for this table !!!

If you cluster the table on insert time then that will help reduce index fragmentation. Also insert time is probably what drives most of the reporting queries on this table. To further faciliate reporting, you may get by with only one additional non-clustered index. Really, if the reporting team is hitting this table hard with queries throughout the day, then you may want to look into leveraging Analysis Services or some other OLAP tool.

5- compact tables is nice suggestion. I should do it. But I should think about time of compact our main table!!!

Perhaps you thought I was talking about schema compaction (DBCC CLEANTABLE) or index regorganization (DBCC REINDEX). But actually I was just talking about replacing things like user names and URLs with smaller data types (like ProductID for example istead of URL) so the column set and row size is narrow.

6- insert into a flat file then Bulk insert is a good idea. I should be investigated. Is there a benefit for us?

There would definately be a benefit to logging records into a flat text file and then bulk loading at internals. You don't want the web application inserting 3 million records daily (~35 records per second) to a table. That's bad for the web application (synchonous database calls), bad for the database (heavy disk i/o queuing), and bad for the reporting queries (blocking). Another advantage is that whenever you need to perform some maintenance on the table, like re-indexing, you can temporarily turn the bulk load process off. The web application would continue logging to the text file uninterrupted, and then the bulk load can catch up again when it's turned back on. You may also find benefit in suspending the bulk load process while the reporting team is running reports.

If nothing else, do this.

7- Can I Create a database next to this live db. and every night copy Day's data to new db and then truncate tables on live db. and on new database create much index that in needed Without fear of slowing.

Just forget about logging daily records from the application directly to a table. As described above, have the application log to a text file, and then schedule a process to load the records to the table daily (or hourly).

8- and last, no sql . I must learn more about them. and bii team cam use them?

Hadoop would be an investment in learning and development, both for you and the BI team. Currently you are accumulating about 300 GB a year. The threshold for Hadoop is about 10 TB, so that can be part of a five year plan. Just focus on the above reccomendations and read up on Microsoft's HDInsight service when you have time.

Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius, and a lot of courage, to move in the opposite direction. - Albert Einstein

thanks again

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

sm_iransoftware Ten Centuries Points: 1228 More actions · Answer 13

TNX alot

Is very useful ideas and tips in your comments. (all of you)

Thanks alot

sm_iransoftware Ten Centuries Points: 1228 More actions · Answer 14

Hi Again

I want to do some thing on this database.

First Of All , I should start with a small database.

But current DB has many connection from site. And I Should Replace It with a New empty database . (with the same staructure)

This is my plan :

1- Create a new database with same structure.

2- Rename current database to olddb.

with this code :

USE master

GO

EXEC sp_dboption CurDataBase, 'Single User', True

EXEC sp_renamedb 'CurDataBase', 'OldDataBase'

GO

3- Rename Newdb to current DB.

USE master

GO

EXEC sp_renamedb 'NewDataBase', 'CurDataBase'

is it true ? and Tsql code is ok ? (Please dont forget many of connection to curdatabase that Is log db. loss some seconds data is not problems)

Thank you