Some tips for working with big data models


As SQL Server DBA I do a lot of consulting in different clients that have problems with their databases. Among the many database scenarios that I found probably the aspect that mostly impacts my work is to deal with large and complex database models that were custom created to fulfill data storage requirements.

The origin of these models it not uncommon these days: as the new requirements came up, the developers and other professionals make amends to existing objects, create new tables, relationships, columns, change datatypes and so on. On top of that is the fact that as the business grows, the data stored tends to grow too, which increases the complexity to perform maintenance tasks in the model, the data, the objects, and in the database in general. So, in order to be prepared to deal with large and complex database models, I came up with some tips over the years that prepare the ground and guide me through what I have to do to solve whatever are the database needs of the client that hired me.

The main goal of this article is to present some tips to help professionals that need to work with complex, big, and hard to understand database models that anyone may came across some day. Typically, these database models store thousands of gigabytes in many related objects that can be used by hundreds of concurrent users. So having some kind of a "to-do" list to deal with this sort of special databases can be beneficial not only to solve immediate problems but also to prepare the ground for future requirements and maintenance tasks. Without further delay, these are my tips for those that have to understand in a nutshell a big database model without being its creator:

1) Identify the biggest, the largest, and the most used tables in the model

Every big database model has at least one main table that contains a high percentage of the data stored in the database. This is a fact observed through many years of experience. By "big table" I mean a table with a large amount of data on in (i.e. rows) and when I say "large table" I mean that this object contains many columns with different datatypes. On most cases these big and large tables are the most accessed and used by the applications, so it is very helpful and recommended to know and recognize these object because chances are that you probably will end up working with them. Have a way to differentiate these tables from the other objects in a diagram is a plus that can save time when a quick look in the model is needed.

2) Use a source control, versioning or change management tool

Nowadays there are many tools that offer features to track changes over time in the database model. New columns, relationships, changes in datatypes and other minor modifications like these are common in a big database model that is constant evolving to adapt itself to new business requirements or change requests. The use of source control, versioning or change management tools may require the generation of a script containing the definition of the database objects. Alternatively, the model's versioning control may be integrated with existing IDEs such as Microsoft Visual Studio.NET with Team Foundation Server. Nevertheless the argument that DBAs do not need to use source control tools because they do no program or build applications are no longer acceptable.

3) Know how to print the complete or partial database model

Ah, the printed database model! Decorating thousands of cubicles and walls in the IT departments over the world, these relics becomes ubiquitous and part of the decoration in development environments as the data model gets bigger. As I learn from the many teams that I worked with, there is always someone that asks for a printed version of some object or even the entire database model. So, the idea here is to have a simple, quick and easy way to print a table, or a set of tables and its relationships. Options to generate PDF files, print the database model in separated pages (followed by the not-so-important and time consuming task of correcting taping hem), and use of different paper colors to identify versions and other model's metadata are among the few practices that came in handy for a DBA when working with a team of developers.

4) Identity the most used complementary objects (stored procedures, triggers, functions, indexes)

Since the database is not composed only by tables and its relationships, it is a very pragmatic action to know the other objects that access, control and manipulate the main tables mentioned in the first item of this article. I found out that big data models implemented in SQL Server usually employ many stored procedures, views, functions, triggers and other objects that are as much important as the tables that store the data. To know the logic behind the main lines of code that compose these object can save time while debugging, decrease the effort required for the tuning process and put the DBA in a very comfortable position when there is the need to modify the data.

5) Have a way to see the database in separated layers: with and without relationships, with and without indexes, with and without constraints, etc.

The market is flooded with CASE modeling tools that can help professionals to deal with big logical or physical database models. However, if these models could be created iteratively and incrementally in the same way as an image is created from layers, the DBA can have a powerful way to visualize specific details without all the complexity of a big model. For instance, imagine an ERP (Enterprise Resource Planning) database model that contains tables for the ERP system and some tables created for the integration with an existing intranet system. It would be very valuable if the DBA could just hide the tables from the model that came from the integration with the intranet, thus visualizing only the information required for a specific task that requires only the ERP tables and vice-versa. This type of model layering is a very clever way to separate and isolate parts of the model that are specific to a task in order to avoid the spread of the model's complexity.

6) Use colored rectangles to group related tables from the same subsystem

Another approach that can be employed to separate and isolate parts of the model without modifying it is to use drawing elements. Colored rectangles are almost a standard in modeling when there is the need to separate subsystems. This technique is easy, simple and it not only groups tables and relationships but it also improves the readability and the documentation of the model. Of course, comments on the model are very useful when they and can describe with few words information that may takes hours of study digging from the direct analysis of the model.

7) Enumerate the correct order to insert, update and remove the data in specific tables in order to respect the relationships

This item is a bit tricky because the correct order to insert, update and remove the data into the model's tables is programmed in the application or inside an object in the database such as a stored procedure. The tricky part is that the modifications are mixed and molded by business rules and, sometimes, it is almost impossible to separate them. However, if the DBA can accomplish this separation and create a script that correctly insert, update or remove data in the tables in the same way as they are made through the application he/she may increase its arsenal of scripts that can save several hours during an emergency support call. Please, don't get me wrong here: I'm not suggesting that the DBA or any other professional should modify directly the table's data and override the application in order to quickly solve an issued. What I'm suggesting is that the knowledge of how the tables interact with each order for a particular operation or task can be very helpful and should be in the DBA's toolset.

8) Always have a way to search for the table's name, columns, datatypes, nullability, description and other attributes

Complex models may contain hundreds or even thousands of database objects. It's wise to have a simple way to search and obtain metadata information from it. Almost always it is the case to know exactly which database system catalog view or table to look for when the DBA is searching for something. Good CASE modeling tools also provide options and features for basic and extended searches in the items even if they are still in the logical modeling diagram.

9) Have a script that generate the current database (all objects) with a fraction of its size (10% is ok)

This item is more a personal experienced tip than a must-have tool. In several occasions I needed to create a new database environment for test/production/homologation (or for any combination of these) from an existing database model. These situations require that the all the objects must be created in a different server (sometimes a virtualized one) with all the complexity of the model but with only a fraction of its data. The more or less 10% mark seems to be an acceptable data rate to start with because there must be instances of certain entities (products, customers, employees, application settings, etc) in the database in order to enable the application to start. The script needed to create the all objects is trivial and can be generated automatically by a tool provided by the database but the generation of the correct order to insert the data may be very complex, as discussed in the item 7.

10) Keep an updated list of the permissions for the most common and used database objects (in order to know very quickly what a specific user can and cannot do with the objects)

Perhaps the most common administrative task of a DBA is to manage permissions. The DCL (Data Control Language) commands and the graphic user interfaces of database administrative tools provides an adequate and simplified way of granting, revoking and denying permissions for object to users. So, the DBA that work with a complex model have to be aware of the permissions employed for the objects and the users. The separation of permissions in roles can be very helpful, but one must know how to quickly point out a role with a specific set of permissions for the tables of one subsystem of the model is a must-have skill that any DBA should pose. Again, the use of simpler summarized and grouped documentation of the user's permissions for the main tables and objects can have a great impact in the daily administrative task especially if the database has an odd security way of organizing the user's permissions and access of the objects.

11) Know how to predict and estimate sizes of specific objects in order to forecast database 's growth or shrinkage.

Forecasting based on real data and statistical models that are based on a data distribution can be very helpful when there is need to create a report suggesting the allocation and/or the relocation of database resources that affect hardware and link costs. In my experience with consulting I found more professionals acting by guess than I would like to admit when there are need to justify allocation/reallocation of database resources. Here the tip is simple: always make your suggestions based on statistics, metrics, real needs and demands in order to make your argument solid instead of based on guessing and poor predicted values without a concrete statistical base.

12) Show in the database model which objects have partitioning options, if they are compressed and the filegroups they belong to.

SQL Server and other commercial products have many options for compression, partitioning, and to separating internally the data that is stored inside a table. When doing maintenance task it is important to know which tables are compressed, have partitioning enabled, how this partitioning is implemented, and where the data is physically stored (i.e. filegroups and datafiles). However, the data model and the ER diagram do not have a specific notation to represent this type of information, so it is up to de DBA to figure how to describe and document these options in the ER diagram. Comments, different colors, geometric figures, or even a small table in excel style with sample data showing how the partitioning separates the data are suggestions that can make the model richer and increase the awareness of the DBA about partitioning options just by taking a look at the diagram.

13) In OLAP (Online Analytical Processing) models centralize the fact table and have a way to link the description of the main hierarchy, levels, members, and grains for each dimension table.

OLAP data models, implemented either with the star, snow-flake or a mix of these two schemas, tends to be simpler that complex OLTP data models. One of the reasons for that is because there is more planning involved during the elaboration of the relationships and the entities than the planning in OLTP models. Also, the definitions of the dimensions, levels, hierarchies, and grains create a logical structure in the data that is easily visualized after an inspection of the table's details. Nevertheless, the poor documentation of the characteristics of dimensions and its relationships with fact data makes the model harder to understand. One simple idea to simplify the understanding of an OLAP data model is to center the fact table in the diagram in such a way that this table step up from the others and can be clearly identified at first sight. Decorating the model with plenty of grouped information about the structure and metadata of the dimensions is also a good practice that can save a lot of time guessing how the data is supposed to be shown to do user when browsing a cube' data.


After explaining these tips I point out that most of them focus on documentation of the database model and its objects. Of course, the main efforts in programming are related to develop and implement new requirements and change existing features. However, the IT community, and specially the DBAs, must enforce the role of documentation not only on source code or deliverable software such an application, but also on artifacts that are important for all the development cycle.

Following the tips presented in this article the readers can prepare better DBAs and other professionals when they must understand and work with large and complex data models. This scenario is not uncommon as business grows and count with a few tips and best practices can really help when one have to deal with data in a complex and large database.


3.73 (33)




3.73 (33)