Avoid the epic data catalog fail

The Road is Littered by Failed Catalog Efforts

A lack of critical mass of content – Catalog projects have to make a compromise decision between either narrow+deep or a wide+shallow scope of data and metadata, typically due to resource restrictions. With either approach, the chances the information a user wants is not going to be there is high until there is a critical mass (i.e. when a user searches for an asset, they need to actually find it most of the time).
Users don’t see value from the catalog – While it’s possible to get buy-in and support for your initial data inventory effort, keeping that support going and ensuring long term buy-in is impossible unless the business is getting the value back. If they are going to a catalog and not finding what they want, they are not getting value.
Disconnect between the effort to build a catalog and the value from using it – Before critical mass, the effort required for everyone in the organization to populate the catalog on a day-to-day basis is greater than the value they get. As an incremental task without a payback, it gets abandoned when resources are squeezed or pressure slackens.

Obvious Must-Haves

In addition to the challenges of catalog implementation projects, catalogs need to have a simple set of characteristics to be useful and effective. Surprisingly, many don’t have these minimum requirements for viability and are destined for a painful and embarrassing death.

Trustworthiness – I trust this information for analytics and response to compliance and base my reputation on it.
Freshness – this information is current and up-to-date, real-time or as close as is absolutely possible.
Completeness – I am confident this information is definitive for my purposes or the limitations of this inventory are understood.

“Field of Dreams” Failure

There is the assumption that the data technology folks know what the business side needs, but the requirements of the business are rarely obvious.

Professional data people often focus on the process of data management rather than the underlying business drivers. Building a fully integrated metadata-driven environment that connects business concepts to physical data stores through canonical models is a fascinating and professionally challenging endeavor. However, success comes from the business using this resource to do the things they want to do on a day-to-day basis: assess the impact of a change, identify who to speak with concerning a data issue, make an assessment on two different sources of information, ensure the data needed for a project is available and can be relied upon to meet the project’s objectives, etc.

To make this happen, the presentation of data in a catalog needs to be simple and easily accessed. The content needs to be easily navigated, and a user needs to know that the information they are looking for is either there or doesn’t exist.

Too often, elegant solutions are developed, and a high-level view of the enterprise’s data is constructed and connected through huge effort and endurance from a data team. It’s released to great fanfare, motivating users with the novelty of the solution. However, after an initial period of use, it falls into disuse. In order to breathe life into a catalog, it needs to be used. Through use, the catalog is kept alive, current, and increasingly comprehensive.

Build it and they will come…. Unfortunately, this seems to only apply to ghostly baseball. Build what you think needs to be built and the user community will take a quick look, briefly think it is a good idea and they will come back when it has something in it they actually need. They never do.

Go Big or Go Home

The other pattern that data catalog efforts seem to fall into is a “Start Small or Proof of Concept (POC)” approach. While POCs are generally a great idea, the technology is much less important than the content itself when it comes to data catalogs. Introducing catalogs as technology often proves that the data group sees the catalog as a data tool, not a business enabler. The use cases for a catalog almost invariably rely on a complete view of the enterprise. For example, where is personal or private data, the data subject, or other party-based information being kept? What is the most authoritative source for this data? Who is the organization’s data expert for X data?

In the insurance space, the number of assets that need to be described and cataloged just to support a single business unit can be enormous. ACORD (an insurance standards organization) defines over 80,000 attributes in their data model. To describe the data represented in an annual report, you may need to define thousands of terms, their relationships, and the business rules that apply to those terms to fully describe that report. You can be sure that the part the key user needs on any particular day is the part you haven’t done yet. Who is going to manually connect all of these dots? No one.

$1,000,000,000 at Stake

For a large insurer, understanding where the information comes from to make business decisions has direct financial consequences. The difference between a verifiable source of information and a vague understanding can lead to a difference of a billion dollars in reserved capital. Understanding the quality of the data in an actuarial model can mean the viability of an insurance product or a loss of hundreds of millions of dollars. The difference between a single percentage point on the returns from underwriting can wipe billions from an insurer’s stock value. Partial solutions often equate to unverifiable or unprovable outcomes; in insurance (and many other businesses), that is real money.

You simply cannot afford to “start small.” The first solution needs to bring value. If your target is the compliance or enterprise data view, you need to be exhaustive for a particular domain, geography, or business process area.

More to Come:

Over a series of articles, we’ll discuss some of the challenges associated with implementing a catalog. We’ll also share some best practices to help you be successful in the implementation and adoption of this critical prerequisite for a data-driven organization.

Andrew Ahn