Data Observability: DataOps’ Shiny New Toy

Image credit: iStockphoto/iagodina

Once upon a time, data scientists practiced their dark (data) science in dark corners of the enterprises. They wanted data; their overlords listened and invested in data platforms and tools. Then, when things got busier, they had data engineers and a DataOps team. Problem solved.

Well, not really.

The practical truth is that data pipelines have evolved. Modern data pipelines are not neat one-way data flows that lead from the source to the target repository. Ask data engineers, and they’ll attest to messy, complex, and interconnected pipelines.

This creates a host of issues. Common complaints are schema drifts, data anomalies, and slowdowns in data pipelines. It also creates new headaches for data engineers who are already drowning in their workloads.

This is why many DataOps teams are starting to take data observability seriously. According to New Relic’s 2021 Observability Forecast, 90% of IT leaders see observability critical to their business success; 76% expect to see their observability budgets increase this year.

The hope is that data observability tools will give data engineers a clearer picture of the health of their data. In a Forbes article, Rohit Choudhary, Acceldata’s founder and chief executive officer, likened it to driving a car without a dashboard.

“In the automotive equivalent of a black box, you have no idea where you’re going or how fast your engine is revving or whether your tires are about to blow,” he wrote.

He added that you couldn’t blame companies. There weren’t many data observability tools. But that’s changing.

Observability vs. monitoring

Choudhary’s company calls itself a data observability platform company. Its tools, it claims, have helped GE Digital, PhonePe, and Thailand’s True Digital to better manage and optimize their enterprise data systems.

Then you have companies like Splunk, Monte Carlo Data, and Observe.ai (which partnered with Snowflake) to identify data system problems by keeping compute and data separate.

You also have the application performance management (APM) platforms, including older ones that came before the world turned data-hungry and data-driven. This category has evolved, like Dynatrace, New Relic, and Data Dog.

Companies like Cribl are helping to make sense of this growing space. Its centralized observability infrastructure plugs into data sources and observability tools. In doing so, it claims it gives its users the freedom to use other brands.  

All look to help DataOps teams deliver what data scientists have always wanted: to improve data quality. According to Gartner, it’s a huge concern and one that costs companies USD12.9 million a year. What’s not added to this number is the untold pain caused by issues like model drifts on machine learning projects because of poor data quality.

From the “what” to “why”

The challenge for data engineers lies in finding out what data observability tools they need. And there are significant differences.

Many APM tools help us answer the “what” question — what went wrong. This falls within the realm of data monitoring that many major data vendors already do. The new cache of data observability tools tells you why it went wrong. They look at how up-to-date the data tables are, whether the data covers the right range, the amount and completeness of data, schema issues (data structure changes), and lineage.

Getting to know the “why” helps DataOps teams to improve their approach and practices. It can also help to streamline data, reduce workload, and keep that data engineering talent with an eye for a job change.

The future lies in AI and standardization

Data observability is only starting on its long road to improve data quality. Expect machine learning to make its mark as data observability becomes vital and DataOps teams ask for machine learning empowerment.

Good examples include Monte Carlo Data, which uses AI models to look for upstream dependency changes to find out what’s causing the data quality problems quickly. Observe.ai uses NLP and speech recognition to help call centers flag data shifts, repetitive patterns, and anomalies.

The push for data standardization will also make data observability easier. OpenTelemetry is one example of how the industry is coming together to make data observability easier. Its SDKs, open tools, and APIs offer a standard way to collect telemetry data.

Whatever route data observability takes, it is here to stay. Properly deployed and used, it can help DataOps teams to stop being mere plumbers and become the pipeline magicians they were employed to be.

Winston Thomas is the editor-in-chief of CDOTrends and DigitalWorkforceTrends. He’s a singularity believer, a blockchain enthusiast, and believes we already live in a metaverse. You can reach him at [email protected].

Image credit: iStockphoto/iagodina