I have done a ton of research lately on Data Mesh (see the excellent Building a successful Data Mesh – More than just a technology initiative for more details), and have some concerns about the paradigm shift it requires. My last blog tackled the one about Centralized vs decentralized data architecture. In this one I want to talk about centralized ownership vs decentralized ownership, along with another paradigm shift (or core principle) closely related to it, siloed data engineering teams vs cross-functional data domain teams.
First I wanted to mention there is a Data Mesh Learning slack channel that I have spent a lot of time reading and what is apparent is there is a lot of confusion on exactly what a data mesh is and how to build it. I see this as a major problem as the more difficult it is to explain a concept the more difficult it will be for companies to successfully build that concept, so the promise of a data mesh improving the failure rates for big data projects will be difficult to achieve if we can’t all agree exactly what a data mesh is. What’s more is the core principles of the data mesh sound great in theory but will have challenges in implementing them, hence my thoughts in this blog on centralized ownership vs decentralized ownership.
To review what is centralized ownership vs decentralized ownership (which reminds me of the data mart arguments of the Kimball vs Inmon debates many years ago): Rather than thinking in terms of pipeline stages (i.e. data source teams copying data to a central data lake to be filtered by a centralized data team in IT, who then prepare it for data consumers, so “central ownership”), we think about data in terms of domains (e.g. HR or marketing or finance) where the data is owned and kept within each domain (called a data product), hence “decentralized ownership” (also called domain or distributed ownership). From a business perspective this makes things easier as it maps much more closely to the actual structure of your business. Domains can be followed from one end of the business to the other. Each team is accountable for their data, and their processes can be scaled without impacting other teams. Each domain will have their own team for implementing their domain solution (“cross-functional data domain teams”) instead of one centralized team that resides in IT being responsible for all the implementations (“siloed data engineering teams”).
Inside a domain such as HR, that team is managing their HR-related OLTP systems (i.e. Salesforce, Dynamics) and have created their own datasets built on top of a data warehouse or a data lake that has combined the data from all the HR-related OLTP systems. I have not seen clarity from the data mesh discussions on how exactly a domain handles OLTP and analytical data so please comment below if you have a different opinion.
To be part of the data mesh, each domain must follow a set of IT guidelines and standards (“contracts”) that describe how their domain data will be managed, secured, discovered and accessed.
Having built database and data warehouse solutions for 35 years, I have some concerns about this approach:
- Domains will only be thinking of their own data product and not how to work with other products, possibly making it difficult to combine the data from multiple domains
- Not having IT-like people in each product group to do the implementation but instead trying to use business-like people
- Does each domain have the budget to do its own implementation?
- You may have domains not wanting to deal with data and just focus on what they are good at (i.e. serving their customers), happy to have IT handle their data
- Each domain could be using different technology, some of which could be obscure. And not having the experience to pick the right technology
- Having centralized policies with a data mesh oftentimes leaves the implementation details to the individual teams. This has the potential of inconsistent implementations that may lead to performance degradations and differing cost profiles
- If implementing a Common Data Model (CDM), then you will have to get every domain to implement it
- You will have to coordinate each domain to have its own unique ID’s for rows when it has the same types of data as other domains (i.e. customers)
- Domains may have their own roadmap and want to implement their use case now and/or don’t want to pay or wait for a data mesh. And what if you have dozens of domains/orgs who feel this way?
- Conformed dimensions would have to be duplicated in each domain
- You could plan on having a bunch of people with domain knowledge within each domain, but what about if you already have many people in IT who understand all the domains and how to integrate the data to get more value than the separate domains of data? Wouldn’t this favor a centralized ownership?
- Ideally you want deep expertise on your cross-functional teams in streaming, ETL batch processing, data warehouse design, and data visualization. So if you have many domains this means many roles to fill and that might not be affordable. The data mesh approach assumes that each domain team has the necessary skills, or can acquire them, to build robust data products. These skills are incredible hard to find
- How do you convince ‘business people’ in each domain to take ownership of data if it only introduces extra work for them? And that there could possibly be a distribution in service?
- If each domain is building their own data transformation code, then there will be a lot of duplication of effort
- If there are already data experts within each domain, why not just have IT work closely with them if using a centralized ownership?
- The domain teams may say their data is clean and won’t change it, where if the data is centralized then it can be cleaned. And domains may have different interpretations of clean or how to standardize data (i.e. defining states with abbreviations or the full state name). And what if the domains don’t have time to clean the data?
- Who scans for personally identifiable information (PII) data and who fixes the issue if it is found out that people are seeing PII information that they should not be allowed to see?
- Who coordinates if a domain changes its data model, causing problems with core data models or queries that join domain data models?
- Who handles DataOps?
- Shifting from a centralized set of individuals servicing their data requests to a self-serve approach could be very challenging for many companies
- Each domain ingesting their own data could lead to duplication of purchased data, along with many domains building similar ingestion platforms
- The problem of domains ignoring the data security standards or data quality standards, which would not happen in a centralized architecture
- You create data silos for domains that don’t want to join the data mesh or are not allowed to because they don’t follow the data mesh contract for domains
- Replacing the IT data engineers with engineers in each domain (“business engineers”) will provide the benefit of business engineers knowing the data better, but the tradeoff is they don’t have the specialized technical knowledge that IT data engineers have which could lead to less-than-optimal technical solutions
- Having multiple domains that have aggregates or copies of data from other domains for performance reasons leads to duplication of data
- A data mesh assumes that the people who are closest to the data are the best able to understand it, but that is not always true. Plus, they likely don’t understand how best to combine their data with other domains
- A data mesh touts that it reduces the “organizational complexity”, but it may actually make it worse when the teams are distributed instead of centralized and many more people are involved
- The assumption that IT data engineers don’t have business and domain knowledge is not always true in my experience. I have seen some that have more knowledge than the actual domain experts, plus they understand how to combine the data from different domains together. And if IT data engineers don’t have the domain knowledge, having them obtain that knowledge could be a better solution than a whole new way of working that comes with a data mesh (in which those people are in many cases just moved to the business group). Wouldn’t improving the communication between IT and the domains be the easiest solution?
Finally, I have to take issue when I hear that current big data solutions don’t scale and data mesh will solve that problem. It is trying to solve what it perceives as a major problem (“crisis”) that is really not major in my opinion. There are thousands that have implemented successful big data solutions, but there are very few data meshes in production. I have seen many “monolithic” architectures scale the technology and the organization very well. Sure, many big data projects fail, but for the same reasons that would of made them fail if they tried to implement a data mesh instead (and arguable there would be an even higher failure rate trying to build a data mesh due to the additional challenges of a data mesh). Technology for centralizing data has improved greatly allowing solutions to scale, having serverless options now to meet the needs of most big data requirements along with cost savings, and it will continue to improve. There is a risk with the new architecture and organizational change that comes with a data mesh, especially compared to the centralized data warehouse which has proven to work for many years if done right. Plus, the data mesh assumes that each source system can dynamically scale to meet the demands of the consumers which will be particularly challenging when data assets become “hot spots” within the ecosystem.
But I want to be clear that I see a lot of positives with the data mesh architecture, and my hope is that it will be a great approach for certain use cases (mainly large fragmented organizations). I’m just trying to point out that a data mesh is not a silver bullet and you need to be aware of the concerns listed above before undertaking a data mesh to make sure it’s the right approach for you so you don’t become another statistic under the failed project column. It requires a large change in a companies technology strategy and an even larger change in a companies organizational strategy which will be a huge challenge that you have to be prepared for.
The post Data Mesh: Centralized ownership vs decentralized ownership first appeared on James Serra's Blog.