Need ideas on how to handle the inserting of millions of rows per day

  • I'm looking for ideas and curious how you handle at your company the inserting of large numbers of rows.

    Our scenario is this. We have a high traffic website (millions of visits per day) and need to track everything which occurs. This means each click, page visit, etc. We do this today but it's extremely slow and are investigating new ways to handle this. We are using SQL 2012 Ent with tons of memory and procs on multiple servers.

    We're not looking to store it in a relational database of course, since inserting needs to be fast. We'll deal with parsing everything out via etl.

    One developer suggested using CouchDB, however in researching it, I don't really see the advantage since there isn't a way to really get the data out into SQL.

    We already using 3rd party companies for traffic analysis, but we need to store all of this data onsite for our dw.

    Any suggestions would be appreciated. If anyone is doing this currently, I'd be curious how you handle it.

    Thanks

  • ItsDaveTime (12/13/2013)


    I'm looking for ideas and curious how you handle at your company the inserting of large numbers of rows.

    Our scenario is this. We have a high traffic website (millions of visits per day) and need to track everything which occurs. This means each click, page visit, etc. We do this today but it's extremely slow and are investigating new ways to handle this. We are using SQL 2012 Ent with tons of memory and procs on multiple servers.

    We're not looking to store it in a relational database of course, since inserting needs to be fast. We'll deal with parsing everything out via etl.

    One developer suggested using CouchDB, however in researching it, I don't really see the advantage since there isn't a way to really get the data out into SQL.

    We already using 3rd party companies for traffic analysis, but we need to store all of this data onsite for our dw.

    Any suggestions would be appreciated. If anyone is doing this currently, I'd be curious how you handle it.

    Thanks

    You cannot fix what you don't know is broken. You must find the top bottleneck, address it, lather-rinse-repeat. Given memory and procs I am guessing it is one of 3 things:

    1) network bandwidth. It is surprising how fast a busy website can saturate a few Gb NICs with clickstream data

    2) write activity back to the disk, especially (but not exclusively) tlog writes

    3) once you solve those two, now you are into serious high-volume insert mode, and you run SMACK into the brick wall known as page latch contention. If you have the ubiquitous identity column on your table(s) you are dead meat. You need to actually force the data to spread out as it is inserted. Most common mechanism for this is some form of hashing that will distribute the load across multiple hot-spots. Thomas Kejser has some really nice blog posts on this.

    PLEASE understand that serious performance from any engine requires a significant amount of knowledge and experience. I hate it when people say SQL Server can't perform well when they didn't do a good job of coaxing maximum performance from it! It does seem like you aren't that type, thankfully. So please do consider getting expert help to get you optimum value from your VERY expensive SQL Server Enterprise Licenses!! 😎

    One additional item. Depending on how you intend on extracting the data back out of the server in your "etl" process, you could be a good candidate for either a) SQL Server 2014 and it's Hekaton feature or b) potentially use StreamInsight. The latter would potentially take significant work to implement unless you have VERY simple needs. I think they are still taking TAP customers for Hekaton.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Tens of millions of rows a day is low activity. If you were talking millions of rows a second, then I'd understand looking for specialised, non-relational solutions.

    If you're having performance problems, the answer is not to flail around and change something. That's like killing mosquitoes with a shotgun, expensive and inefficient. Or, as I've heard in the past: "Something must be done. This is something. Therefore this must be done." 'This' usually doesn't work, so repeat until out of chance something does work.

    If you are having performance problems, identify the cause. Work down (or up) the process and look for where the delays are. Once you have found where the delays are, then you are in a position to recommend a solution.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks for your responses. I've been doing this for many years (as I'm sure you have as well) and have implemented and developed high availability and high transaction transactional and business intelligence SQL systems for very large companies. This project, however is a different animal. The front-end c# developer on this project is one of those "I know better" types, and rather than try and investigate the bottleneck or try and work with it, he just wants to throw different technologies at it. I'm sure you all know the type. Unfortunately he has the ear of the cio so it's an upward battle. I'm confident that SQL Server can handle this, as I've done it in the past. The problem is the architecture on the front end, and unfortunately the mindset of the front end developer.

  • ItsDaveTime (12/14/2013)


    Thanks for your responses. I've been doing this for many years (as I'm sure you have as well) and have implemented and developed high availability and high transaction transactional and business intelligence SQL systems for very large companies. This project, however is a different animal. The front-end c# developer on this project is one of those "I know better" types, and rather than try and investigate the bottleneck or try and work with it, he just wants to throw different technologies at it. I'm sure you all know the type. Unfortunately he has the ear of the cio so it's an upward battle. I'm confident that SQL Server can handle this, as I've done it in the past. The problem is the architecture on the front end, and unfortunately the mindset of the front end developer.

    I LOVE meeting this type at clients! In can invariably show that they have no leg to stand on trying to justify that SQL Server cannot perform what they ask of it. Often this type is just looking to pad their resume with some "shiny" new technology. I personally despise that type of behavior when it comes at the expense of the company despite the fact that I have made a lot of money cleaning up the mess that usually comes with it.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Writing data into an OLTP and reading data from an OLAP are completely different beasts, especially in high-transaction scenarios. To keep the speed of the inserts up, you need to remove the indexes are that are not needed for the transactional processes and make sure that you have a suitable clustered index so that most of the table inserts are to the end of the index and make sure that page fill space is set appropriately so that updates do not force unnecessary page splits as this fragments the indexes. Remeberthat if you need to write 15 indexes and all the maintenance that can go along with that, it will significantly slow down the insert process.

    If you need to run analysis on this data, then I would suggest replication and build your reporting indexes on there and possibly also create cubes to speed up the reporting. If that is not an option then could you rebuild the reporting indexes during low demand hours and leave them off during high demand times.

    If you are tracking mouse clicks and movement, then would it be sensible to assume you are using some sort of javascript monitoring code running on the website. Could this be made to write to a log file which is then consumed via SSIS into the database?

  • Also, Tell your C# developer that the underlying DB is none of his damn business - He should be using sprocs or LINQ to get data in and out of his objects. If he isn't doing that, then he probably isn't that good of a C# developer either.

  • aaron.reese (12/14/2013)


    Also, Tell your C# developer that the underlying DB is none of his damn business - He should be using sprocs or LINQ to get data in and out of his objects. If he isn't doing that, then he probably isn't that good of a C# developer either.

    I'd really suggest never taking taking that approach especially since the C# Developer in this case has the full trust and confidence of the CIO. 😉 It's a team and part of the problem is that the C# Developer and the CIO don't, apparently, understand that.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • ItsDaveTime (12/14/2013)


    Thanks for your responses. I've been doing this for many years (as I'm sure you have as well) and have implemented and developed high availability and high transaction transactional and business intelligence SQL systems for very large companies. This project, however is a different animal. The front-end c# developer on this project is one of those "I know better" types, and rather than try and investigate the bottleneck or try and work with it, he just wants to throw different technologies at it. I'm sure you all know the type. Unfortunately he has the ear of the cio so it's an upward battle. I'm confident that SQL Server can handle this, as I've done it in the past. The problem is the architecture on the front end, and unfortunately the mindset of the front end developer.

    Ah... and the real problem is identified. 😉

    I know it's of little help to you but I wanted to take my hat off to you for trying to do the right things for the company. As you well know from your previous experience, these types of environments are very difficult to change and take a lot of time to change, if they can be changed at all. You also know that only demonstrable code can change their minds (again, if at all) because it sounds like they're stuck on the worst problem of them all... themselves. Playing the part of the devil's advocate, make sure that you're not just as stuck on ritual DB stuff as they are on new stuff.

    Make sure you have a regular, robust, and reliable backup plan to cover any failure they may have not so much for their sake or your sake but for the sake of the company. As much as it's going to pain you, try to be helpful (unlike them, be a good member of the team) when they ask something of you and document (literally, keep a well-hidden handwritten log/diary) everything they do or would have you do that you believe is incorrect for all the obvious reasons.

    Of course, you also have to remember that they might be thinking the same of you as you do them. Make sure that there's no tangible reason for them to think that way. And start thinking of a way to document, troubleshoot, and demonstrate the repair of the bottle neck so that you can repair it if they fail.

    If the situation becomes insufferable (I believe it already is), remember that it's easier to find a nice, friendly, well paying job while your employed than it is when you're not.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • If the situation becomes insufferable (I believe it already is), remember that it's easier to find a nice, friendly, well paying job while your employed than it is when you're not.

    I think it is ESPECIALLY easy for talented SQL Server folks to find such work these days. From what I see there is about -2% unemployment in SQL Server land these days!

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply