Is this a ‘new dawn’ for the data catalog?

How should you store the information you need to understand the information in your relational databases? Or rather, how should you store the information your new developers need, that your compliance team requires, that your business wants?

With extended properties, Microsoft gave us a mechanism in SQL Server to add metadata to database objects. Extended properties allow us to append text with very little structure (and thereby a lot of flexibility).

However, that trade-off between structure and flexibility has arguably fallen a little short of what we need in this new era of oversight and accountability for how we handle data (in light of a data breach epidemic and a regulatory revolution, spearheaded by the GDPR).

It might be fine for some basic free-form documentation, but there are all sorts of things that can go wrong when you try to use free text fields for critical functions like information governance, from mismatches due to typos to conflicting or incoherent taxonomies. 

There’s also a problem with accessibility. Telling your information security team to ‘go check the extended properties’ isn’t likely to win you any friends.

Microsoft seem to be acknowledging as much in recent changes to SQL Server; first in the form of the data classification feature in SSMS 17.5 and later, and now in changes to the SQL Server engine itself. By either mechanism, you can now associate ‘information types’ and ‘sensitivity labels’ with columns to describe the content (or intended content). You can also generate simple reports from the metadata applied.

Of course, there’s nothing new about the need to store information about your information.

20 years ago, as a cub developer working on a data warehousing project for the London insurance market, I would need to consult a large black book to understand the source data I was working on in the claims database. Our data dictionary was literally a 500-page bound volume that was carried from desk to desk. Quaint.

In a later role, just by working with the business to group systems by their importance into 3 tiers, I was once able to reduce the resilience requirements (and associated spend) on non-critical databases which I’d previously been treating as tier 1.

Even DBAs who know their database schemas like the back of their hand are recognising a need to record their assumptions and to explicitly confirm that their understanding – and the data policy they implement as a result – matches the expectations of the business. Not least because the regulators are becoming increasingly specific about what they expect, and thereby attaching a more explicit value to the process.

But how? Recording metadata in the database has many advantages, but also some challenges.

I wrote recently about some of the parallels I’ve encountered between database metadata and physical uses of labels and catalogs.

What happens when the contents of the column change, for example by applying static data masking so that you can use a database copy more safely outside Production?

What happens when you can’t change the schema you’re classifying because, while the data is your responsibility, the schema is under the control of a third-party vendor or ISV?

At Redgate, we think the answer might be a new kind of data catalog.

Data catalogs have existed for a while of course, and not just in the form of ‘big black books’.

But the vendor offers tend to be expensive and difficult to implement well and are often avoided in favour of ragbag DIY proxies composed of documentation, wikis, and tribal knowledge.

A data catalog that supported DevOps workflows around SQL Server might have some specific capabilities, like

  • Environment awareness – is this database in Dev, Test, UAT?
  • Classification that recognises changes between environments (and can be stored independently if needed)
  • Ability to synchronise and recognise differences between metadata definitions between environments
  • Extent of access, and access paths – do individuals have ‘select * from…’, or only application-controlled access?
  • Security measure in place – is this protected by encryption, or some other measure?
  • An API-first implementation that allows metadata to be accessed within other processes, like monitoring, masking, provisioning to dev/test, supplying context to developers (do we need to perform a Data Protection Impact Assessment for this change?)
  • Reporting, search, automated updates

If you’re interested in helping Redgate develop our vision of the modern data catalog, through taking part in research and having access to technical previews and release candidates, please register for our early access program.