Episode 2: Why Data Catalogs Are Vital (Really!)
- David Williams
“Where’s the right data when I need it?”
The rise of data clutter is a well-known problem. The scale and complexity of the information being collected today is beyond what was ever imagined. Metadata management and data catalogs improve data discovery. These data curation methods provide answers to such questions as: where’s the data I need, how does it relate to other information in my organization, can I rely on it, and is it up-to-date?
Metadata management and catalogs have evolved through a number of generations over the last 15 years or so. A key concept was identified fairly early: placing business terms at the center of a web of connections, linked to data policies, rules, physical data, data quality results, integration script code, and many other attributes. Most crucially, business terms were connected to other business terms (IBM Infosphere Business Glossary 2008/9).
This approach has become an important paradigm in metadata management systems. Establishing conceptual models of business terms helps define other conceptual and logical models (differently structured models of the business area, for example). This web is thus strung together by physical “things”.
How to Create a Complete View of Enterprise Data Model
Connecting everything to a single conceptual model (often referred to as a canonical model), in principle, means that you can follow connections from one physical object to another. Eventually, a full enterprise model develops. This is why the best data cataloging and metadata repository applications are often based on Graph Theory (we’ll delve into that in a future article).
You Say Tomato, I Say Lycopersicon
Conceptual models operate similarly to the language Esperanto, i.e. as a single language through which all other languages are connected. Languages can be translated to others more efficiently than through traditional language-to-language translations. In theory, if everyone learns Esperanto, then we can all communicate in a common tongue. The same phenomenon applies to a canonical model approach to metadata management, allowing you to see all instances of “Customer”, for example, plus all its attributes in each instance.
Great Concept, “How do we get all that information?”
When catalogs first appeared, the intention was that data would be captured and managed manually. As the gigantic effort presented itself, metadata capture tools evolved to address it.
Products were developed in the mid-to-late 2000s using early versions of machine-learning algorithms(GlobalIDs and Exeros in the early 2000s). These tools differed from earlier reverse-engineering solutions that analyzed data structures and sometimes reverse-engineered code.
These new tools would search through the data itself and find information and assign similarities between like data. Once a correlation was established, it was up to a user to identify which component of the conceptual model that piece of data was related to in order to create the Esperanto-like view.
Information can be stored and made searchable in the products we now call data catalogs once a critical mass of data is gathered and organized (applying the conceptual model).
The Intrinsic Problem with Catalogs Emerges …
This presented a new problem. Connecting points to the conceptual model is a massive effort in its own right. Who is responsible for that effort and keeping catalogs up to date? That burden often falls on the data management team or data stewards.
This problem is at the heart of why catalog efforts are so challenging. Businesses prioritize focusing on the outcome of a usable and complete data catalog. Data teams often focus on the challenge and complexity in capturing and organizing the information needed to service a catalog. The gap between the two views is often where a catalog fails.
Here we’ve covered a bit of background and presented the core issue with data catalogs. You can find out ways catalog initiatives fail and what you can do to build a better foundation for your data programs in “Avoid the Epic Data Catalog Fail”.