SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 

Data Virtualization vs. Data Movement

I have blogged about Data Virtualization vs Data Warehouse and wanted to blog on a similar topic: Data Virtualization vs. Data Movement.

Data virtualization integrates data from disparate sources, locations and formats, without replicating or moving the data, to create a single “virtual” data layer that delivers unified data services to support multiple applications and users.

Data movement is the process of extracting data from source systems and bringing it into the data warehouse and is commonly called ETL, which stands for extraction, transformation, and loading.

If you are building a data warehouse, should you move all the source data into the data warehouse, or should you create a virtualization layer on top of the source data and keep it where it is?

The most common scenario where you would want to do data movement is if you will aggregate/transform one time and query the results many times.  Another common scenario is if you will be joining data sets from multiple sources frequently and the performance needs to be super fast.  These turn out to be the scenarios for most data warehouse solutions.  But there could be cases where you will have many ad-hoc queries that don’t need to be super fast.  And you could certainty have a data warehouse that uses data movement for some tables and data virtualization for others.

Here is a comparison of both:

Other data virtualization benefits:

  • Provides complete data lineage from the source to the presentation layer
  • Additional data sources can be added without having to change transformation packages or staging tables
  • All data presented through the data virtualization software is available through a common SQL interface regardless of the source (i.e. flat files, Excel, mainframe, SQL Server, etc)

While this table gives some good benefits of data virtualization over data movement, it may not be enough to overcome the sacrifice in performance or other drawbacks listed at Data Virtualization vs Data Warehouse.  Also keep in mind the virtualization tool you choose may not support some of your data sources.

The better data virtualization tools provide such features as query optimization, query pushdown, and caching (i.e. Denodo) that may help with performance.  You may see tools with these features called “data virtualization” and tools without these features called “data federation” (i.e. PolyBase).

More info:

A FRESH LOOK AT DATA VIRTUALIZATION

Developing a Bi-Modal Logical Data Warehouse Architecture Using Data Virtualization

James Serra's Blog

James is a big data and data warehousing technology specialist at Microsoft. He is a thought leader in the use and application of Big Data technologies, including MPP solutions involving hybrid technologies of relational data, Hadoop, and private and public cloud. Previously he was an independent consultant working as a Data Warehouse/Business Intelligence architect and developer. He is a prior SQL Server MVP with over 30 years of IT experience. James is a popular blogger (JamesSerra.com) and speaker, having presented at dozens of PASS events including the PASS Business Analytics conference and the PASS Summit. He is the author of the book “Reporting with Microsoft SQL Server 2012”. He received a Bachelor of Science degree in Computer Engineering from the University of Nevada-Las Vegas.

Comments

Leave a comment on the original post [www.jamesserra.com, opens in a new window]

Loading comments...