How to analyze massive OLTP databases

  • I've just started a job and I'm responsible for administrating (will define later) a few massive OLTP databases. These are not direct transactional systems, but they are published directly by the main OLTP system (so they mirror the production tables). I am responsible for these tables. Precisely I need to accomplish the following:

    1- Create ERD models for the tables. The tables, since they are replica/mirror of the production have *zero* references or key relations established. Nobody has access to the main OLTP and there are no documentations obviously. How could I establish keys (FK or PK relations) without actually having access to the main source?

    2- Understand most used tables and most important tables

    3- Monitor incoming sources, data load into the server. Not the databases that I'm responsible for, but the server.  Basically, to analyze how the SQL Server server is being used by the tools that are connected to it.

    I'm completely lost. Any help, guidance, blogs,  third-party tool suggestions is very much appreciated.

    Many thanks

    1. Look for common column names, make guesses, check data. For keys, you have to determine uniqueness. Ultimately, I think asking questions and finding someone that can query the main db makes sense. If this is a copy of the OLTP db, then you can recreate their keys. However, if data is transformed at all, their keys might not be valid.

    2. XEvents, and capture workloads, looking for reads/writes. There are index stats as well you can query

    3.  You can monitor connections and trace them back. App names/IPs/etc. are recorded for connections. These are a starting point.

  • Steve Jones - SSC Editor wrote:

     

      <li style="list-style-type: none;">

    1. Look for common column names, make guesses, check data. For keys, you have to determine uniqueness. Ultimately, I think asking questions and finding someone that can query the main db makes sense. If this is a copy of the OLTP db, then you can recreate their keys. However, if data is transformed at all, their keys might not be valid.

     

    2. XEvents, and capture workloads, looking for reads/writes. There are index stats as well you can query

    3.  You can monitor connections and trace them back. App names/IPs/etc. are recorded for connections. These are a starting point.

    Thank you so much for your response. It is a great start, however, it is unreasonable to "Look for common column names, make guesses, check data. " and it is not an exact replica but all the keys remain with the same values and content (but they are missing the references). I had looked at SSIS' data profiling task that lets you come up with some keys, but it is very slow and so far I am struggling with it. So I was hoping for a more strategic way.

     

    Could you also tell me about "index stats"? I've never heard of that.

     

  • How are these tables replicated/mirrored from the production system?  There has to be a process in place to insert/update/delete data - unless it is a 'full' refresh of the data at a specific point in time.

    Since you have to monitor the data loads - that says there are some processes running that load the data into these systems.  Review those load processes to see how they are identifying key columns.

    If you cannot get access to any of that - then you are going to have to analyze each table and make some assumptions and then test those assumptions.  Look for a column that appears to identify the row (e.g. somethingID) and test to see if that column is actually unique.  Review the column names to see if they indicate some type of relationship - for example, a column named GenderID (or just Gender) that contains numeric values is probably related to a table named Gender (or similarly named).

    Additionally, review any stored procedures, views and functions to see how those utilize the system.  These will give you a start on the *most* important tables also...

     

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • How are these tables replicated?

    It's a propitiatory process, and it is a complete blackbox to us. So we get refreshed data, but we are not allowed to see the actual prod environment.

     

    "Monitor the data" refers to other processes, not the mirroring.

    Unfortunately, no SP/Views exist, as again they are trying to give us the least amount of information in order to sell their consulting services at $900/hr!

     

  • Then you need to run an extended event session to capture the SQL statements during the load process.  You can then review those statements to see how they are updating the system.

    If they have not provided you a structure with any unique keys or clustered indexes - then they system is not useful.  And if you add indexes to the system without understanding how they are loading the data - the best you can hope for is slowing the load processes down - worst case is causing the load process to fail.

    From what you have stated - it seems all you have is a bunch of HEAPS and no idea of how the tables relate.  I wouldn't be surprised if the table names and column names are also obfuscated in such a way that it would be almost impossible to determine any relationships.

    If that is the situation - my recommendation is to start looking for a better product from a more reliable company.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • If you can't follow Jeffery's advice, what I would suggest is you examine your data. You should have some idea of what the uniqueness of a data should be. Email address, some other identifier. What you would think about with these is to set a primary key, clustered or nonclustered.

    For the clustering key, there is lots of advice. This does move the data around, and you really want to ensure you pick one that doesn't create hot spots. Do some research here. This doesn't have to be the primary key.

    For FKs, look for data repeated in different tables. CustomerID, StatusCode, something that is a lookup from one table to another. I'd use queries to check if you have good data, so no missing parents for children, and then carefully create FKs.

     

    Document as you can. Redgate has SQL Doc, which can be used to keep this up to date in your db, but anything works.

    Disclosure: I work for Redgate

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply