RE: What is the best practice about maintain and using Big Log Database

SSC Guru

Points: 125522

January 26, 2015 at 7:59 am

sm_iransoftware (1/22/2015)
THX All
Feedbacks was very nice and wide
I should have read more carefully down the Answers
(specially about Hadoop and HiveSQL)
-----------------------------------------------------------
And now more details about your questions and ideas :
1- We offer thousands of products on our website. And the user may click on any of them repeatedly
products are imported Dynamicly and I don't know have a fixed address for evey click. (Meanwhile our products may be finished and deleted from the website after some days.)
Then : Save the code Address instead of all address is interesting idea, but I think is not possible.

It seems to me that, if the application could insert the actual ProductID being clicked on, that would be more useful to the business for reporting purposes than the full URL. From a database administration perspective, this would probably reduce the storage size of this table by 90%. That's a game changer.

Also, if the business is simply interested in how many distinct users click on each distinct product (but NOT interested in how many times a distinct user clicks on a distinct product), then you may want to set IGNORE_DUP_KEY = ON to prevent insertion of duplicate records. There is no point in logging 10 seperate records for the same Time / Product / User event, if that's just redundant data for which the BI team has to filter out. First let the BI team define what a "duplicate" is, and then run some queries and see how much "duplication" there is in this table of 400+ million records.

2- the database has some tables . (Main table) the biggest of them has 140 GB Size and about 490 Millions rec. for 5 month

An average of 3 million inserts per day would put a heavy strain on an OLTP server. I definately reccomend you investigate the approach of having the web application log records to flat text files, and then have an ETL process that bulk loads the records into the table at regular intervals like hourly. That too is a game changer.

3- partitioning is good Idea. I can Create Partition for every Month. But Do I some operations to create new partition for every month? Do I it automatically?

If you take the above advice, reduce the table to a narrow set of keys and perform batch inserting, then it's possible you could get by without partitioning. Sliding window partitioning is fairly complex to setup, and I would only do it as a last resort.

4- I doubt that create Appropriate indexes on the table on live Database. because we have Huge insert and insert speed may be slow. And indexes get big size for this table !!!

If you cluster the table on insert time then that will help reduce index fragmentation. Also insert time is probably what drives most of the reporting queries on this table. To further faciliate reporting, you may get by with only one additional non-clustered index. Really, if the reporting team is hitting this table hard with queries throughout the day, then you may want to look into leveraging Analysis Services or some other OLAP tool.

5- compact tables is nice suggestion. I should do it. But I should think about time of compact our main table!!!

Perhaps you thought I was talking about schema compaction (DBCC CLEANTABLE) or index regorganization (DBCC REINDEX). But actually I was just talking about replacing things like user names and URLs with smaller data types (like ProductID for example istead of URL) so the column set and row size is narrow.

6- insert into a flat file then Bulk insert is a good idea. I should be investigated. Is there a benefit for us?

There would definately be a benefit to logging records into a flat text file and then bulk loading at internals. You don't want the web application inserting 3 million records daily (~35 records per second) to a table. That's bad for the web application (synchonous database calls), bad for the database (heavy disk i/o queuing), and bad for the reporting queries (blocking). Another advantage is that whenever you need to perform some maintenance on the table, like re-indexing, you can temporarily turn the bulk load process off. The web application would continue logging to the text file uninterrupted, and then the bulk load can catch up again when it's turned back on. You may also find benefit in suspending the bulk load process while the reporting team is running reports.

If nothing else, do this.

7- Can I Create a database next to this live db. and every night copy Day's data to new db and then truncate tables on live db. and on new database create much index that in needed Without fear of slowing.

Just forget about logging daily records from the application directly to a table. As described above, have the application log to a text file, and then schedule a process to load the records to the table daily (or hourly).

8- and last, no sql . I must learn more about them. and bii team cam use them?

Hadoop would be an investment in learning and development, both for you and the BI team. Currently you are accumulating about 300 GB a year. The threshold for Hadoop is about 10 TB, so that can be part of a five year plan. Just focus on the above reccomendations and read up on Microsoft's HDInsight service when you have time.

Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius, and a lot of courage, to move in the opposite direction. - Albert Einstein

thanks again

"Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho