One of the biggest differences between the Data Mesh and other data platform architectures is a data mesh is a highly decentralized distributed data architecture as opposed to a centralized monolithic data architecture based on a data warehouse and/or a data lake.
A centralized data architecture means the data from each domain/subject (i.e. payroll, operations, finance) is copied to one location (i.e. a data lake under one storage account), and that the data from the multiple domains/subjects are combined to create centralized data models and unified views. It also means centralized ownership of the data (usually IT). This is the approach used by a Data Fabric.
A decentralized distributed data architecture means the data from each domain is not copied but rather kept within the domain (each domain/subject has its own data lake under one storage account) and each domain has its own data models. It also means distributed ownership of the data, with each domain having its own owner.
So is decentralized better than centralized?
The first thing to mention is that a decentralized solution is not for smaller companies, only for really big companies that have very complex data models, high data volumes, and many data domains. I would say that means at least for 90% of companies, a decentralized solution would be overkill.
Second, a lot depends on the technology used. In future blog posts I’ll go more into the technology used for a data mesh and some concerns I have over it. If you are not familiar with the data mesh, I recommend you read the just-released free available chapters in the book by Zhamak Dehghani, Data Mesh: Delivering Data-Driven Value At Scale.
For this blog, I want to cover the specific question: Is data virtualization/federation a good solution for enabling decentralization, where data in separate remote data stores can be queried and joined together? (I’ll dig into domain data models vs centralized data models, along with data ownership, in a future blog post).
To enable data virtualization/federation, there are full proprietary virtualization software products such as Denoto, Dremio, Starburst, and Fraxses, that can query many different types of data stores (i.e. Dremio supports 19, Starburst supports 45, Denoto supports 67+).
While there are benefits to using full proprietary virtualization software, there are some tradeoffs. I already blogged about those tradeoffs at Data Virtualization vs Data Warehouse and Data Virtualization vs. Data Movement. I also found a list of pros/cons from a presentation from Microsoft called Azure Modern Data Strategy with Data Mesh. It explains how to use Azure to build a data mesh and takes an exception to the ideal data mesh in that storage and data governance is centralized (which I’m finding is a common exception to the ideal data mesh). Definitely work a watch! Here is that list of pros/cons of data virtualization:
- Reduces data duplication
- Reduces ETL/ETL data pipeline
- Improves Speed-to-Market & rapid prototyping
- Lowers costs (but beware of egress/ingress charges)
- Reduces data staleness/refresh
- Security is centralized
- Slower performance (not sub-seconds)
- Data ownership is still not addressed
- Data versioning/history not supported (i.e. Slowly Changing Dimensions)
- Affects source system performance (OLTP)
- How to manage Master Data Management (MDM)?
- How to manage data cleansing?
- Not a star schema optimize for read
- Changes at the source will break the chain
- Might require installing software on the source systems
An alternative to using full proprietary virtualization software, sort of a “light” version of virtualization, is the Serverless SQL pool in Azure Synapse Analytics, which can query remote data stores. It currently only supports querying data in the Azure Data Lake (Parquet, Delta Lake, delimited text formats), Cosmos DB, or Dataverse, but hopefully more will come in the future. And if your company uses Power BI, another option is to use Power BI’s DirectQuery which also can query remote data stores and supports many data sources. Note that a dataset built in Power BI that uses DirectQuery can be used outside Power BI via XMLA endpoints.
I have seen the most use of a data virtualization product when data from many sources is copied into different data stores inside a modern data warehouse or data fabric (Cosmos DB, SQL Database, ADLS Gen2, etc) and you need to query those multiple different data stores and join the data.
Now if you are building a data fabric and decide to use data virtualization to keep data in place instead of copying it to a centralized location, then I would say your data fabric and a data mesh are nearly the same thing with at least the one difference is that a data mesh has standards/frameworks on how each domain handles its data, treating data as a product with the domain as the owner, where a data fabric does not have that.
I would love to hear your thoughts in the comment section below. More to come on this topic!