For many companies, data sharing is a necessary evil. It allows employees new ways to unlock insights, make better decisions, and finetune how we manage business processes. But it also opens up valid concerns and risks related to data security and privacy, with penalties to the top and bottom lines getting heavier.
Jon Teo, data governance domain expert at Informatica, believes that many of the roadblocks stem from poor data governance. In this interview, he explains why companies need to start their data sharing journeys with solid data governance that encourages (not hinder) data sharing efforts.
When planning data sharing, companies fear confidential data abuse and privacy issues. How do you address this fear with your clients?
Teo: Promotion of data sharing practices needs to be supported by the proper data classification, privacy, and protection framework. Having a comprehensive way to detect sensitive data across various filesystems, databases, and other repositories is often a part of the strategy, as well as having a method of controlling access by user groups and monitoring data use.
To get started quickly, many organizations choose to pilot their data sharing with data assets that are less sensitive, for example, promoting datasets and BI assets that are based on aggregated or deidentified data but have already been shown useful for some parts of the organization. This type of data sharing will start to promote reuse or awareness.
How do you enforce strong data governance to guide data sharing?
Teo: In the context of data sharing, data governance is about setting the right expectations and policies to balance the trust in data usage with the openness or freedom to enable greater use of the data assets. Getting the proper governance structure and stakeholders is crucial, but there needs to be a clear justification for why and how data should be shared.
As an example of one such approach, Gartner has recommended adopting a “data sharing by default” policy (See last article here) — this means that data owners and departments should be open to data sharing requests unless they have a justifiable reason not to do so.
There are four parts to data sharing: finding data, understanding data, trusting data, and accessing data? How is Informatica helping companies to address all four areas?
Teo: All of the stages here are critical to operating data-sharing initiatives at scale. Understanding and trusting the data means that apart from having a single ‘Source of Truth’ about the data universe (i.e., a data catalog), these assets need to be enriched and associated with guidance about the data assets’ providence, management, and usage terms.
Informatica typically operationalizes data stewardship. Any key data asset can be linked to its stakeholders, business processes, quality metrics, data origin, data certification status, and more. All of this enrichment comes into play to promote greater data access by the community. Data stewards can publish their curated datasets into a built-in data marketplace to promote data search and discovery, and offer data consumers an intuitive, governed data shopping experience.
The productization of data also becomes useful here, as logically defined collections of data can be more consumer-oriented. It can also include multiple assets such as multiple sets of related tables or views, AI models along with their training datasets, or compilations based on “most popular” or “frequently used together.”
Users can quickly identify the best data assets, and owners can approve and track access rights to data. It allows data consumers to select the most appropriate dataset(s) based on multiple considerations and potentially interact with the data owners or creators to get clarifications.
Finally, the data-sharing initiative also generates metrics about the amount of activity on data being published, consumed, and increased trustworthiness over time. Informatica sees these types of ‘data activity analytics’ as a way to support data managers and analytics teams in managing and improving their overall data-sharing efforts.
How can you avoid misinterpretation, and what’s the role of automation?
Teo: A big part of misinterpretation is often a lack of understanding of the data context or assumptions about the source or nature of the data-set being used. Automation can help to a large extent by quickly relating physical data elements, fields, or columns to specific data terms and definitions — bridging the data user towards gaining the correct understanding of the raw data.
This alone may not be enough. Automated mapping of data lineage provides further clues to the data asset: Where was it collected (online, through point-of-sale, phone, etc.), and what types of alternations were made to the data elements along the way. Across a large data user community, using data well and avoiding misinterpretation will require domain knowledge, experience, and intuition.
In this respect, the use of shared data assets should also incorporate collaborative options and social cues on the data set — feedback, ratings from other analysts and data scientists who may have used the data set previously, and what their findings were.
If we see these measures as part of establishing “Federated Data Product Ownership” (promoted by the data mesh concept), centrally established automation technologies will be part of the common capabilities that all Data Domain Owners can take advantage of, helping to push consistency and efficiency as each domain group goes through their definition, curation and publishing processes.
Ultimately, a data-driven company will see data workers using self-service to analyze and share insights. What are the steps you can take for companies to enable self-service analytics?
Teo: In terms of setting up the right foundations for data self-service, it starts with a clear organizational mandate and policy position on how the organization will leverage data and encourage self-service and data access moving forward. All the different steps — setting up data governance foundations, ensuring data trust, and enabling data sharing — then build up towards a trusted environment where the users can find and access the data, along with all the rich context required to make it useful in their current projects.
As a technology expert, what is currently lacking in the market landscape for self-service analytics and data sharing?
Teo: As many companies are racing towards establishing cloud data lakes, modernized data warehouses, and other data platforms, there is often a multiplicity in the technology options, practices, and access channels to the data assets held in these different environments. This also reflects the different skill levels and preferences of data consumer groups — e.g., data science and engineering personas very comfortable with writing code and manual data scrubbing vs. data analyst and business personas expecting a one-stop data shopping and Excel-like data analysis.
As a result, it is too easy to have different tools, user experiences, and access channels to data usage, with the concomitant risk of losing the data governance consistency or, conversely, slowing some consumer groups down too much.
To address this, the pervasive use of enterprise metadata allows organizations to see all their data assets as a single ‘data universe’ encompassing all the environments and data types and helps enforce consistent controls. A single data management platform can then leverage this for data stewardship, data engineering, data quality improvement, data access audit, and more while supporting different data consumption experiences of consumers.
This article is part of an eGuide that you can download here.
Winston Thomas is the editor-in-chief of CDOTrends and DigitalWorkforceTrends. He’s a singularity believer, a blockchain enthusiast, and believes we already live in a metaverse. You can reach him at [email protected].
Image credit: iStockphoto/cyano66