28.08.2020Blogg

If a dataset is published in your data platform and nobody hears about it, does that dataset really exist?

nik-shuliahin-n1Drax93m0E

A year and a half ago I wrote a blog post about data catalogs, libraries, Tom Waits, Nick Cave and a few other random things. I received a lot of feedback about it, especially the analogy with a data catalog being like a library generated a lot of commentary. For some readers, it worked, whereas for others, not so much. A library is indeed quite static, at least in its traditional form, and requires manual work and centralized administration. This is not ideal for a fast-moving data organization and its cataloging.

Over the last year or so I have had numerous conversations about data catalogs, both early phase discussions about what these catalogs are, as well as discussions about implementation experiences and best practices. It has become clear that the library analogy is only one part of the story and it is important to embrace other aspects of data catalogs as well. One of these aspects is marketing and promoting your data assets.

Make the tree heard across the whole forest

The philosophical conundrum of a tree falling in the forest and nobody being there to hear it making a sound, may seem far-fetched as a data catalog introduction statement, but it is in fact absolutely spot on in pointing out a key benefit of data catalogs: No matter how much effort you put in acquiring, preparing, automating and publishing data in your data platform, if nobody hears about, sees, accesses or uses that data, does any of that data management effort really matter? This is where data catalogs come in. While most other data technology focuses on the technical aspects of managing and/or working with data, data catalogs either directly or indirectly also take on the mission of promoting data.

Exchanging information within the data worker community used to be easy and straightforward. The teams were smaller and often both the engineers/developers as well as the analysts were co-located. Communication around data did not require that much effort as it was a collateral of the way the organization was constructed.

This is no longer the case in most organizations. As everyone aspires to leverage digitalization and become increasingly data-driven, the number of data workers has increased rapidly. At the same time organizations have also grown overall and found diverse ways to execute data work. This (data) work is done at multiple sites, in multiple countries and through multiple employment models, typically blending employment, contracting, and outsourcing. Information about the data assets is no longer effectively exchanged by a dozen colleagues sharing the same office space, lunch hours and coffee breaks.

As an example, when I, the data engineer, have successfully published my latest dataset in the warehouse, it is no longer sufficient for me to proudly announce it to those sitting around me. It is also not sufficient to assume that I can just answer questions as and when people around me ask them. We need something more scalable. We need a platform where I can announce the publication of the dataset, provide solid documentation and metadata, and be able to further clarify and discuss with users from multiple parts of the organization. This promotional effort is a core purpose of a data catalog.

Everyone is a data worker now and you want them all onboard

But metadata is not just one-way anymore. While the volume and variety of datasets has increased so has the volume and variety of people looking for knowledge about these data assets. And while the volume and variety of users has increased, so has the volume and variety of subject matter experts who can contribute to the organization’s common data knowledge. It is no longer just me, the data engineer, who can explain or provide examples of how to use the data. Any other user can do it too. Peers can and should be encouraged to share their explicit and tacit knowledge with each other, and the data catalog should drive and encourage this behavior.

When your organization captures and shares all that diverse information in one place, your “data trees” will truly start being heard, understood, and used in optimal ways. The target of your data catalog should be to increase awareness, engagement, and usage of both the data and the metadata. Let me just repeat that, “increase awareness, engagement and usage”. Sounds like a marketing campaign, right? Without marketing, your data platform is the tree that nobody hears falling.

A good data catalog implementation will help in creating a positive loop of active user engagement across the organization, high quality metadata and data knowledge, and optimal use of data to improve business processes. It is work you need to do anyway to ensure your data efforts create value, and a catalog will make that promotional work much easier. Think of your data catalog as the place that gathers your data people to share their stories and promote their work, and all this information is captured for current and future generations to use. This is how you generate value out of your data assets and associated investments.