The
Problem
All
over the world organizations store or have already accumulated during their
activity large volumes of data in corporate databases. These data
"sets" store major potential possibilities for extracting new,
analytical information to help build the business tactics to reveal market
evolution tendencies and to find new solutions stipulating successful
development in conditions of competition. It is obvious that for some companies
such analysis is an integral part of their daily activity, but most of them, at
least here in Russia, are only
starting to begin taking it seriously.
Historically,
in Russia the solution
for daily operating activity automation was extremely advanced (I mine online
transaction processing systems, OLTP). These systems ensure registration and
storage of the facts describing companies' business processes. The base for such kinds of systems are relational database
management systems (RDBMS).
The traditional approach is to use created OLTP-systems for support of accepted
administrative solutions, i.e., attempts to construct report sets
"under" the corporate database and usage obtained (after
interpreting) reports directly for acceptance of strategic business solutions.
Among the disadvantages of such a method of decision-making is that it's
necessary to mark the following (as a minimum):
- Small number of data used for
support of crucial business-solutions;
- Similar process is very slow,
as the process of query writing and interpreting them are durable. It
takes many days when the chief solution may be necessary to make
immediately;
- The problem of various spheres
of activity. An expert in information technologies (person who prepares
the report) and the chief think in different categories and as a corollary
tend not to understand each other;
- Complexity of reports for
understanding (an absence of visualization);
- Lack of time (for chief) to
find "significant" numbers.
Actually, these problems are not a corollary of poor quality of the
corporate OLTP-systems. The problem is in the difference between activity which
is automated by the OLTP-systems and decision-making activity. Data stored in
corporate database systems are simply the records about some events having a
place -- the facts, but not information in a general sense of this word.
E.F.
Codd, the expert of technologies of management
relational systems, said: "... although RDBMS systems are accessible for
users, they never were considered as a resource giving power functions on
synthesis, the analysis and consolidation (so called, multidimensional data
analysis)". The question is synthesis of the information, on converting
the data of operating systems into the information and even in quality
standards.
The
Solution
Automation
or computerization, as a rule, happens historically from "below to
upwards". The defined divisions, which occur in mass operations on data
processing (product registration, the warehouse registration, dealing with
suppliers and/or clients), are exposed to the process of target automation.
This means that mass tasks will be made with the help of computers, which will
replace operation with paper media (documents, forms) on the mode, in which
operations are mirrored in databases, or (that more often and happens) the
information will be duplicated both on paper and in databases.
By
virtue of that (or to reflection in the database) is exposed to a
computerization each time a small fragment of business processes, and it
happens not always in coordination. Most databases are "badly
coordinated" with one another. The phrase "badly coordinated"
means that in various databases identical "real world" objects may be
reflected differently (for example, Smith D. - at charge of wages and Don Smith
- at obtaining material assets).
"Inside"
corporate databases target transactions that are executed (for example, filing
the table of credited invoice). Such systems usually are called online
transaction processing systems (or OLTP). Target transactions change a state of
databases and bring them into accord to a current state of that fragment of the
real world which is simulated with the database. Thus, the main assignment of
target databases are transaction processing.
Another
big layer of the operations fulfilled with corporate databases is reports
creation for the defined period of existence of the database. The form of these
reports may vary and, as a rule, is determined by external rules (bookkeeping
rules and laws) and internal rules (duty regulations, orders and instructions)
within the framework of which company works.
The
databases created during such automation, as a rule, are optimized for fast
execution of necessary transactions and store the data for a small temporary
period. Now it is possible to ask ourselves: "What processes have
undergone automation? What operation of staff was automated and
simplified?" The answer for this question is obvious: operation
of performers which fulfill standard procedures are automated. Automation
almost has not affected top-managers - the staff responsible for decision
making.
Creation
of data warehouses - "deckhouses" above corporate databases allow one
to proceed to the next stage of business activity automation - creation the
platform for decision making support. The main difference of activity on
decision-making from performing daily, operational activity, from the point
data used, is the need for comprehensive vision of processes for all diversity
of parameters on which they depend, for various temporary intervals. It is
possible to say that performer work with the data on happening processes
whereas the information is necessary for decision making for managers. For
creation of decision support systems it is necessary complete, non-contradictory,
the information for various temporary intervals which may be as is generalized
(sum or aggregated in another way), and detailed. These requests are defining
at creation of data warehouses as bases for creation of decision support
systems (DSS).
For
transition from the process of automation business of processes to automation
of decision-making process we use software products which can be divided into
three categories:
- Reporting tools;
- OLAP tools;
- Data Mining
tools.
Report generation tools are intended for data acquisition in tabular
form, diagrams and (geographical) maps. This class of tools allows managers to
inspect happening processes, having some of the fixed sights on metrics of
these processes.
OLAP
tools are aimed at a test of hypotheses -- they allow one to find the data
which confirm or refute the formulated administrative hypotheses. Hypotheses
may be formulated as very much definitely (whether there is a dependence
between falling arrived for the last half-year and rises of product cost), and
is more indistinct (whether there are parameters which distinguish the division
which has brought the greatest profit on all remaining company divisions). This
information allows managers to change processes of firm for reaching definite
purposes.
Data
mining tools are intended for checking, finding and confirming hypotheses based
on the existing data. These tools hardly depend on data domain and structure of
input data. However, the availability of similar tools are necessary in case of
the big data volumes and a plenty of parameters, on which these data depend, as
they allow to detect (or other words, to make visible, mapped) the facts and
tendencies which are completely invisible at usual review of huge data arrays.
Specific
data usage forms (against transaction processing "inside" OLTP
databases) generate appropriate requirements to storage and data presentation
models. Data warehouses and OLTP-databases run under different query types (or
load templates). Corporate OLTP databases are as much as possible optimized for
short transactions series execution with maximum performance. Target
transactions combine not only data reading, but also data modification and new
data insertion. In case of data warehouses we deal, at first, with data
acquisition, a great bulk of queries to data warehouse are reading / sampling
data. The most widespread model for data warehouses is the three-dimensional
model or model of multidimensional cubes.
Attempts
to construct analysis / decision making systems which would access directly to
OLTP-databases usually appear unsuccessful:
- Analytical queries
"compete" with online transactions, locking the data and causing
shortage of database server resources;
- Operating data structure /
scheme intended for effective support of short and often transactions (it
is normalized as a minimum up to the third normal form) and by virtue of
it:
- Too
complicated for understanding by end-users;
- Does
not ensure a necessary execution performance of analytical queries.
- As a rule in a company there
are several operating databases. This database has various data
structures, units of measurements, methods of encoding. For the end user
(business - analyst) the task of construction of any summary query on
several similar databases are practically insoluble.
Multidimensional data models allow to store measure values (for example, the sum of sale, an amount of the
goods, an amount of shipped products) in the uniform object of the data with
defining parameters or arguments; further we shall name these objects as
measures (for example, number of the score, date of posting, number of the waybill,
the name of the payer, the name and the type of the goods, a title of service).
It is necessary to mark, that the data for storage are received from target
online corporate databases. Data warehouses contain the information on the
various sides of processes which might be mirrored in various target databases
in the uniform object, the process of data multiplexing from various sources,
or the process of consolidation is necessary.
Most
measures (parameters on which measures depend) are inhomogeneous. For example,
if we speak about the description of the goods, it is also possible to speak
about some commodity section, the type of the goods. Thus, measure
"product" may have a name hierarchy, for example "product
type" - > "product kind" - > "product unit
name". At stage of data analysis and decision making often it is necessary
to work not with the numerical values specific to concrete production; there is
enough detailing a level of sort of production or even its(her) type. For
support of fast access to this sum the information of repository, as a rule,
contain this information already calculated. The process of
calculation of the similar sums are we named - aggregation, the
calculated values - aggregates.
Data
warehouse structures noticeably differ from those used in OLTP-systems. The
typical data structure used in warehouses are :
- Star schema;
- Snowflake schema.
These schemes for data warehouses are the same canonical, as third
normal form for online transaction processing systems (OLTP).
In a
basis of OLAP concept lays the principle of a multidimensional data
presentation. "Father of the relational theory" doctor E.F. Codd has considered disadvantages of relational model,
first having showed impossibility "to unite, view and parse the data
viewed from multiplicity of measurements, that is the most understandable for
corporate analysts a method", and has defined common requirements to OLAP
tools which are expanding functionality of relational DBMS and switching on the
multidimensional analysis as one performance.
These
12 rules (according to Codd) which should satisfy the
OLAP software are:
|
Multi-Dimensional
Conceptual View
|
Business-analyst
"sees the company world" multivariate and multi-dimensional,
accordingly and conceptual data model representation in OLAP product should
be multivariate and multi-dimensional on a nature, that
allow analysts to fulfil intuitive
operations: "slice and dice", rotate and pivot directions of
consolidation.
|
|
Transparency
|
The
user should not know what concrete resources are used for storage and data
processing and how the data are organized.
Without dependence from that, the OLAP-product a part of resources of the
user is whether or not, this fact should be transparent for the user. If OLAP
it is granted by client - server calculations this fact also, whenever
possible, should be imperceptible for the user. OLAP should be granted in a
context of open architecture, allowing the user where he was to communicate
through the analytical tool with the server. In addition transparency should
be achieved in interaction of the analytical tool with homogeneous /
heterogeneous databases.
|
|
Accessibility
|
Business
analyst should have a possibility to analyze within the framework of the
common conceptual scheme, thus the data may remain under the control of old,
"inherited" DBMS, being thus pegged to common analytical model. So
OLAP tool kit should superimpose its own logic
scheme on physical data arrays, fulfilling all conversions required for
support of an uniform, agreed and complete "user sight" on the
information.
|
|
Consistent
Reporting Performance
|
With
increasing of numbers of measures and database size analysts should not face
with any decrease of productivity. Stable productivity is necessary for
maintaining an usage simplicity which is required
for finishing OLAP up to the end user.
If the user - analyst will test essential distinctions in productivity
according to number of measures then he will try to compensate these
distinctions the strategy of development that will call data representation
other ways, but not with what it is really necessary to present the data. Costs of time to bypass the system for compensation of its
inadequacy is not what analytical products are intended for.
|
|
Client-Server
Architecture
|
Large
data volumes, required operating analytical
processing stored on mainframes, but extracted from PC. Therefore one of
requests - ability of OLAP products to operate in client - server
environment. Main idea here is that OLAP tool server component should be
intelligent enough and can build the common conceptual scheme based on
generalization and consolidations of various logical and physical schemes of
corporate databases.
|
|
Generic
Dimensionality
|
All
measures should be equivalent. Additional performances may be given to
separate measures, but as all of them are symmetric, the given additional
functionality may be given to any measure. Base data structure, formulas and
report formats should not base on any one measurement and should not be
displaced aside to any measure. Each measure should be applied irrespectively
to its structure and operational abilities. Additional operational abilities
may be granted to any selected measure, and as measures are symmetric, any
function may be given to any measure.
|
|
Dynamic
Sparse Matrix Handling
|
OLAP
tool should guarantee optimal processing of the sparse matrixes. Access speed
should be saved without dependence from data cells layout and to be a
constant for the models having different number of measures and different
data sparse.
|
|
Multi-User
Support
|
Frequently
some analysts have the necessity to work simultaneously with one analytical
model or to create various models based on the same data. OLAP tool should
grant them competitive access, guarantee integrity and data protection.
|
|
Unrestricted
Cross-dimensional Operations
|
Data
calculation and manipulation on any number of measures should not prohibit or
limit any ratios among data cells. The conversions requiring arbitrary definition, should be set in functionally complete formula
language.
|
|
Intuitive
Data Manipulation
|
Directions
consolidation, detailing data in columns and rows, aggregation and other data
manipulations inherent to hierarchy structure ,
should be executed in maximum convenient, natural and comfortable user
interface.
|
|
Flexible
Reporting
|
Various
methods of data visualization should be supported,
other word reports should be presented in any possible orientation.
|
|
Unlimited
Dimensions and Aggregation Levels
|
Strongly
recommended, that each serious OLAP tool should have a minimum of 15 (better
more than 20 measures in analytical model. Moreover, each of these measures
should admit practically unlimited amount of aggregation levels, defined by
user, on any direction of consolidation.
|
It
is necessary to consider this set of requests, being the actual definition of
OLAP (by E. F. Kodd), as recommendations, and
evaluate concrete OLAP-products on degree of approximation to correspondence to
all above 12 rules.
This
completes the first part (from a series) of the articles concerning OLAP
technology -- technologies of data analysis, the "platform" which
allows modern organizations in the modern world to build the business-tactics
to reveal market evolution tendencies and to find new solutions stipulating
successful development in conditions of competition.
Later,
in my future articles you can find a description of utilities / tools, related
to OLAP technology:
- Data transformation tools;
- Data analysis tools;
- OLAP-administration tools.
Ideally, in my future articles, I would like to proceed from the
theoretical description of OLAP to a more practical presentation and the
technical application to a solution of those practical tasks which are put
before us in our businesses. I hope to make it with your help and your advice.