The Road To Quality Data

Is a given data fit for purpose? This is a recurring question that data scientists grapple with on a daily purpose, as they seek to draw insights or train new machine learning models as part of their job.

This isn’t a frivolous consideration, but a mission-critical one. We know that bad data can result in anything from wrong business decisions or defining the wrong KPIs, loss of productivity, or discrimination and bias in AI models.

But cleaning data is easier said than done. At our latest CDOTrends Digi-Live! Summit virtual event “How Changing Business Needs Transforms Data Analytics” held this week, panelists noted that some data are “fairly clean”. However, they also pointed out that the effort to consistently get clean data is non-trivial and can be extraordinarily challenging with legacy data.

Measuring data quality

Yet the popular adage about how you can’t manage what you can’t measure means that organizations should establish a way to measure data quality before rolling up their sleeves to clean up their data.

In an opinion piece on InformationWeek, William McKnight of the McKnight Consulting Group offered recommendations on measuring data quality, observing that good quality data can be defined as data with a lack of intolerable defects.

“There is a finite set of possibilities that can constitute data quality defects and that categorize all data quality rules, such as data existence, referential integrity, expected uniqueness, expected cardinality, accurate calculations, data within expected bounds and just simply correct data,” he wrote.

According to McKnight, the first step of measuring data quality begins with an inventory. By narrowing down the evaluation to specific metrics, he argues, organizations can turn vague “feelings” of data dirtiness into something tangible.

A software can then be used to profile the data by considering how closely the selected data correspond to predefined rules. This allows data quality to be scored for a particular repository, which can then be prorated to obtain the overall score.

On the bright side, McKnight notes that data quality improvements are typically made to a small subset of data elements, as most data elements already conform to the standard – and presumably because organizations with massive data problems are soon out of business.

Tools for quality data

While the expert inputs of data professionals are vital to improving data quality, it is gratifying that modern data tools can significantly accelerate the process. For instance, self-service data labeling platforms are gaining attention as businesses turn to a combination of humans and tools to curate data sets for analysis and power data-driven initiatives.

In a contributed opinion piece, Stuart Harvey, the CEO of Datactics argued for the democratizing of data quality through a self-service platform. Though detractors will call Harvey out as biased since Datactics sells an enterprise self-service data platform, his arguments bear hearing out.

Legacy data quality tools were traditionally owned by IT teams due to the significant technical skills needed to use them, says Harvey. Yet this creates a bottleneck as IT struggles with a lack of knowledge about the content of data.

“If a central IT department is to maintain quality of data correctly, it must liaise with many of these business users to correctly implement the data quality controls and remediation required. This creates a huge drain on IT resources and a slow-moving backlog of data quality change requirements within IT that simply can’t keep up,” explained Harvey.

The solution is self-service data quality platforms that shift responsibility away from IT to a centralized data management team. Different groups of data users will require disparate capabilities, of course. For instance, a data analyst will need a data profiling and rules studio, while business users will need an easy-to-understand data quality dashboard.

The objective is to achieve data agility, he says. This would allow new datasets to be onboarded quickly, and fluid data quality demands to be met by well-equipped data professionals.

“Organizations that successfully embrace and implement a self-service data quality platform are more likely to benefit from actionable data, resulting in deeper insight and [in] more efficient compliance, in turn unlocking significant competitive advantage,” concluded Harvey.

Do you agree about the importance of a self-service data quality platform?

Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].​

Image credit: iStockphoto/vovashevchuk