Where do We Start When Building a Data Catalog
- Patrick Egan
Many companies, when embarking on building a data catalog, will look outside their organization for guidance. This often involves turning to research-based subscriptions to provide waves, quadrants or other stacking mechanisms based on certain canned criteria. Consulting firms are keen to drop teams onsite with theoretical “best practice frameworks”. Both of these approaches can have value and may illustrate guidelines from other’s deployment, but it’s critical to keep one’s organizational perspective(s) of their truth(s) front and center and not waiver. What exactly does this mean? In the next few paragraphs, we will discuss many of the nuggets of knowledge I have gained through my transition from IT guy to business-facing consultant and eventually to co-founder of a data governance start-up.
1. “Don’t boil the ocean”
If I had a dollar for every time I heard this quote I would retire to a nice golf course in the Caribbean and work on my handicap! Disregarding my dislike for the line, there is a semblance of truth to it. Most enterprises have a business vocabulary of between 2,000 and 6,000 individual terms, depending on the industry, geographic spread, and business drivers. This is not to be confused with the broader technology field definitions which may run into hundreds of thousands if not millions of data elements in some big data initiative. Within this business scope, some of the data is not considered critical nor shared outside its encased environment. It’s important to establish scope as soon as possible to discover what data really drives the business.
Last year we needed a new washing machine; we began looking at the basic models and got caught up in the “wouldn’t it be cool to have..” feature sprawl. Roll on a year; we only use “normal wash” for all laundry loads. Last month, the machine warranty expired and we had to pay to replace a circuit board for a feature we never used but prevented us from doing our normal laundry.
Why do I bring this up? After going through many RFP’s, cool research surveys and consultants’ “must-have” requirements lists, customers add content and features to data catalogs (or in my scenario a washing machine) that have minimal value and are impossible or expensive to maintain. It’s important to take a step back and ask WHY. Speaking of value to effort scope, look out for my future blog, “Data Lineage’s Dirty Little Secret”, which describes my views on the value/effort needed for the collection and maintenance of lineage.
2. Data is shared more than you think
One of the biggest revelations in many workshops is when different departments discuss who has access to what data. The motion usually evolves from eye-opening to jaw-dropping, progressing through how they get their hands on the data: everything from email to shared drives to the savvier off the grid custom solutions built-in Microsoft Excel or Access. Business users are becoming more technology aware and in some cases also savvy in the dark arts of collecting and managing “their” data.
3. Experts are Built – not Bought
When I see public job openings for Data Stewards or Subject Matter Experts, I typically suppress a little chuckle. It’s not because the job offering itself is humorous, but that they require the person to “hit the ground running”! No doubt there are resources out there who are familiar with various industry datasets, however, there is a cultural learning curve that cannot be overlooked. There are no shortcuts to acquiring tribal knowledge. Gaining respect and cooperation is tied to winning acceptance into the tribe. This requires understanding ( maybe not always agreeing with ) their way of moving and using data in order to start unwinding the “data hairball”.
4. Raise Metrics to Show Value
Corporate dogma often dictates you can’t manage what you can’t measure. It’s worth noting that even with all the metrics in the world, the value must be applied, otherwise you just have a mass of meaningless statistics. When designing a data catalog, it’s important to keep both in perspective: build in key measurements to track engagement, assignments, and validations. Users also need to be engaged and see value in the solution, bubbling up insights. There are many techniques that can be applied, such as simple friendly user experiences, gamification, social networking or collaboration. The approach should be “sticky”, drawing in users to come back and use the system.
5. Don’t just Automate – Automate and Learn
In many cases (besides some very large organizations), the task of keeping a catalog current is not considered a full-time position. Typically, resources from the business side are “volunteered” into becoming Data Stewards with little to no mandate or relief from their existing workloads. Alternatively, the task is handed over to IT, who will automate the collection and harvesting of technical metadata, invariably missing the “truths” of other departments. The outcomes are unfruitful in both approaches. What’s really needed is the ability to augment the harvesting of metadata with the intelligence to interpret and assign the “truths” that various departments hold. The learning component is key to the success or failure of the catalog. We have to move towards leveraging these resources on an “as needed” basis, learning from their interactions and automating where it makes the most sense based on interactions with the catalog. Intelligent catalogs are constantly monitoring and gathering expert decisions into a machine learning feedback loop to reduce burden after each cycle. Over time, the goal is to reduce the need for expert resources to curate the data regardless of the growth in data volume.
Each of these five points are significant to the establishment of a useful catalog and, more importantly, to its ongoing maintenance. The enterprise data catalog has to facilitate the multidimensional view we have of data and provide some level of controls and commonality when data is shared across various domains. Taking Obi-Wan Kenobi’s wisdom to create a quote of my own regarding enterprise data management:
Understand and maintain our differences, but manage to the common good.
Data Catalogs do come with their challenges, but they are not insurmountable with the right mindset and approach in place. I will leave with the final words of Obi-Wan Kenobi:
May the force be with you [on your data catalog journey]!