Physical and Logical design of the database

  • I am working in SQL from a long time however at times I find it little difficult to provide someone theoretical explanation of any concept.

    One person asked me about how you should implement the physical and logical design of a new database. I explained him somehow but I told him to help himself out on internet as well. I know it's an open ended question. Could you give some references where I can get an in-depth knowledge on this subject ?

  • For explaining the concepts behind a database in a clear and concise manner you could do worse than the book "The Manga Guide to Databases".

  • A picture is worth a thousand words. So I like to use a good modeling program (Embarcadero's ERStudio is my favorite by a long shot), which not only offers lots of capabilities for doing the design itself but can also be annotated, broken down into submodels, etc. I also use Visio regularly to create documentation to help myself and others understand the sometimes complex business processes our database applications facilitate/track.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • There is way too much to cover here in a post. It is something that experienced DBAs can hash over for weeks. But I will share a couple of pointers I have after 26 years as a DBA (mainframe DB2, Unix, and SQL Server). There are some general rules that apply regardless of the platform.

    Your logical design should bring closure to the business and IT organization as to what constitutes a completed database design for purposes of the data requirements. The real work now is done by the DBA to guarantee performance while meeting the availability requirements. I will only cover a couple of pointers for operational data (OLTP), not BI/data warehouse physical design. These two are very different.

    First, you have to decide whether your physical design will have declared database referential integrity ("RI"). There are many factors that go into a decision like this.

    Will the end-user be making ad hoc updates? This sounds crazy in production, but some shops cannot enforce prohibitions against it. If so, then you need RI. Usually though, updates are done with a stored procedure.

    Will you be deleting a logical parent and expecting an automatic delete of all logical children? Be careful on this. While RI works faster in this case than explicit logical-child deletes, it also works slower when there are logical-child adds (i.e., under the same logical parent). Also beware of the catastrophic potential that this can wreak. For example, if you delete a department, do you automatically delete all the employees in that department? You might have some employees pretty angry about about getting fired when they were only going to be moved to another department!

    Logical-child inserts also cause share level locking on the logical parent. So if you are in an application which has to be updating a non-key row in the logical parent while logical-child inserts are going on, you will experience lock delays or even timeouts.

    Second, what type of data will be in your new database? Persistent or non-persistent? If persistent (such as a customer table), then you will have to consider speed in access (typically clustered on the primary key). If non-persistent data that you plan to purge after 30 days, one year, seven years, etc., then you should consider the effect of partitioning. If the data will be read-accessed primarily for a particular period, then have the partitioning based on a date value. You will have a smaller underlying dataset that will be accessed. You will need to test your purge strategy before going into production. Will you need to archive? If so, will you need to perform a recall/restore? Then make sure that is tested as well before going into production.

    I could go on about freespace, indexing, logical vertical partitioning, logical horizontal partitioning, and other issues. As I said, it is a very involved, complex subject sometimes.

    Jim

  • jim.drewe (7/18/2012)


    There is way too much to cover here in a post. It is something that experienced DBAs can hash over for weeks. But I will share a couple of pointers I have after 26 years as a DBA (mainframe DB2, Unix, and SQL Server). There are some general rules that apply regardless of the platform.

    Your logical design should bring closure to the business and IT organization as to what constitutes a completed database design for purposes of the data requirements. The real work now is done by the DBA to guarantee performance while meeting the availability requirements. I will only cover a couple of pointers for operational data (OLTP), not BI/data warehouse physical design. These two are very different.

    First, you have to decide whether your physical design will have declared database referential integrity ("RI"). There are many factors that go into a decision like this.

    Will the end-user be making ad hoc updates? This sounds crazy in production, but some shops cannot enforce prohibitions against it. If so, then you need RI. Usually though, updates are done with a stored procedure.

    Will you be deleting a logical parent and expecting an automatic delete of all logical children? Be careful on this. While RI works faster in this case than explicit logical-child deletes, it also works slower when there are logical-child adds (i.e., under the same logical parent). Also beware of the catastrophic potential that this can wreak. For example, if you delete a department, do you automatically delete all the employees in that department? You might have some employees pretty angry about about getting fired when they were only going to be moved to another department!

    Logical-child inserts also cause share level locking on the logical parent. So if you are in an application which has to be updating a non-key row in the logical parent while logical-child inserts are going on, you will experience lock delays or even timeouts.

    Second, what type of data will be in your new database? Persistent or non-persistent? If persistent (such as a customer table), then you will have to consider speed in access (typically clustered on the primary key). If non-persistent data that you plan to purge after 30 days, one year, seven years, etc., then you should consider the effect of partitioning. If the data will be read-accessed primarily for a particular period, then have the partitioning based on a date value. You will have a smaller underlying dataset that will be accessed. You will need to test your purge strategy before going into production. Will you need to archive? If so, will you need to perform a recall/restore? Then make sure that is tested as well before going into production.

    I could go on about freespace, indexing, logical vertical partitioning, logical horizontal partitioning, and other issues. As I said, it is a very involved, complex subject sometimes.

    Jim

    kudos to this reply! it all about sums up my job description here in our organization 🙂 🙂 thanks for this because it makes me realize how complicated being a dba is aside from configuring and administration and pressing F5.

    thanks.

    Cheers! 🙂
    [/url]

  • Thanks a lot ! It helped a l ot.

  • jim.drewe (7/18/2012)


    There is way too much to cover here in a post. It is something that experienced DBAs can hash over for weeks. But I will share a couple of pointers I have after 26 years as a DBA (mainframe DB2, Unix, and SQL Server). There are some general rules that apply regardless of the platform.

    Your logical design should bring closure to the business and IT organization as to what constitutes a completed database design for purposes of the data requirements. The real work now is done by the DBA to guarantee performance while meeting the availability requirements. I will only cover a couple of pointers for operational data (OLTP), not BI/data warehouse physical design. These two are very different.

    First, you have to decide whether your physical design will have declared database referential integrity ("RI"). There are many factors that go into a decision like this.

    Will the end-user be making ad hoc updates? This sounds crazy in production, but some shops cannot enforce prohibitions against it. If so, then you need RI. Usually though, updates are done with a stored procedure.

    Will you be deleting a logical parent and expecting an automatic delete of all logical children? Be careful on this. While RI works faster in this case than explicit logical-child deletes, it also works slower when there are logical-child adds (i.e., under the same logical parent). Also beware of the catastrophic potential that this can wreak. For example, if you delete a department, do you automatically delete all the employees in that department? You might have some employees pretty angry about about getting fired when they were only going to be moved to another department!

    Logical-child inserts also cause share level locking on the logical parent. So if you are in an application which has to be updating a non-key row in the logical parent while logical-child inserts are going on, you will experience lock delays or even timeouts.

    Second, what type of data will be in your new database? Persistent or non-persistent? If persistent (such as a customer table), then you will have to consider speed in access (typically clustered on the primary key). If non-persistent data that you plan to purge after 30 days, one year, seven years, etc., then you should consider the effect of partitioning. If the data will be read-accessed primarily for a particular period, then have the partitioning based on a date value. You will have a smaller underlying dataset that will be accessed. You will need to test your purge strategy before going into production. Will you need to archive? If so, will you need to perform a recall/restore? Then make sure that is tested as well before going into production.

    I could go on about freespace, indexing, logical vertical partitioning, logical horizontal partitioning, and other issues. As I said, it is a very involved, complex subject sometimes.

    Jim

    Why do you say there is a huge difference between physical model of oltp vs physical model of olap?..the only difference is the fact tables and dimensions isnt it ?

  • There are several models to choose from when you consider data warehousing. Some physical models are "purists" and would never conceive of having a child to a dimension table as it would violate the concept of a pure star schema design. There are times I believe that you can do this without the penalty, especially if the DBMS involved is good at exploiting single-column indexes with logical ANDing/ORing. Another model difference would be whether you are going to perform "typical" data warehouse reporting or whether you are going to design for data mining (either predictive analytics or to have the newer data warehouse appliances such as Neteeza).

    In several industries, your fact table can have an extremely long row sizes (my experience was with financial services, but the insurance industry is the same way). In the logical model, you will have logical sub-typing of fact table. That is, you can have core information that is the same for the transaction, but there are differing/variable extensions to that transaction record. Not even mainframe DB2 can handle that type of length without going to a wasteful 32K page size. This is especially true when you consider that most of the variations will be filled with blanks. I know that database compression will take these out very efficiently, but most of the time a DBA will have to determine whether these variations to the transaction record can be split out to their own fact table. Then of course, you will have to make sure that the users will NEVER join one fact table to another fact table. If they do, then performance will be in the toilet and there will be nothing you can do about it.

    Data warehousing/BI will do more with date sensitive data so it will be incumbent on the physical designer to design/partition in such a way the query will not be plowing through unnecessary rows. Sometimes the 80/20 percent rule comes into play. Don't design for 100 percent of the queries. If 80 percent are processing against the last three years of data, give optimum design performance in your partitioning for these. If 20 percent are crazy, then let them have sucky performance (or purchase their own data mart).

    Vertical and horizontal table partitioning will probably play a greater role DW/BI. Indexing will probably play a greater role too since the heavy repetitive SQL write operations will not be occurring as in OLTP.

  • Oops. I forgot to add a couple of other very important differences between BI/DW and OLTP.

    In some shops, ETL processing will allow separate Operational Reporting on the staging databases and presumably the data is cleansed. That is, the traditional data quality issues in OLTP are certified to be corrected. If you have confidence in the cleanliness of your ETL data, there is no need to restrict your loading/inserting to a fact table with a unique index/primary key. That is an unneeded overhead. On the other hand, if you doubt the cleanliness of the data coming in (e.g., duplicate primary keys in the fact table), you have no choice but to slap a unique index on the logical primary key of the fact table. It is regretable but unavoidable.

    Another problem is what to do about monster dimensions. From the DBMS's optimizer standpoint, they look like two fact tables getting joined together. Some DBMSes have something called Materialized Query Tables (or automated summary tables) that can mitigate the performance problem here. Again, the 80/20 rule comes into play.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply