DataOps: Navigating the Data Wastelands

Image credit: iStockphoto/Rost-9D

After years of defining (and redefining) the term data-driven, companies are beginning to realize the stark truth: we’re rich in data and poor in actionable insights.

Blaming everything on bad data or misaligned data science teams is missing the point. Most of the time, the gems of insights are locked in bits and bytes. While it is the job of data scientists and analysts to sieve through the river of data to find data nuggets, traditional data management makes it difficult. Add the multitude of data teams with different goals, and you can imagine the complexity.

We needed a better approach. Enter DataOps.

Andy Palmer, chief executive officer and co-founder of Tamr, points to Wikipedia offering a comprehensive definition: “DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics.”

DataOps is not the data science version of DevOps. Nevertheless, the loose analogy puts its value proposition in context.

“If the purpose of DevOps was to increase feature velocity for the large internet companies, then the purpose of DataOps is to increase analytic velocity for the large enterprise,” Palmer explains.  

The significant difference with DevOps is that the core artifact here is not software but data. Also, data is generated and consumed very differently, Palmer adds.

Consumption led

DataOps acknowledges that we need to respect data better.

“For decades, data has been treated as an exhaust from operational systems instead of as a strategic asset.  We’ve all got a ton of work to do in order to build the next generation of modern data engineering infrastructure in large enterprises — there are huge gaps and tons of legacy to be replaced,” says Palmer.

Doing such can free data scientists and analysts to focus on their job and spend less time preparing the data. It also allows data engineers to weave and manage different data pipelines in a more standardized fashion.

It is not a technical challenge that DataOps is trying to solve. For Palmer, it is addressing the human challenge in data science.

Tamr takes it a step further by using machine learning to help humans cope with vast workloads. “At Tamr, we work with our customers to help them realize the power of using modern machine learning techniques to master their data and deliver on the unfulfilled promises of traditional MDM.”

Still, there are gaps. One huge one is our approach to data governance. Currently, it is “source-based,” but Palmer believes it should be “consumption-based.”

He draws an analogy with water management. Many data governance programs are focused on where the sources are coming from. For water, such sources are rivers, lakes, or rain clouds. While this is always good information, “the most important question for data as with water is what’s coming out of the faucet and do you want to drink it,” Palmer argues.

Consumption-based governance focuses on the data consumer. It adds a framework and monitors using the “information access policy of the company and the requisite information access [across] roles/personas.” 

“Most companies are so distracted with source-based governance that they don’t have the time or resources to govern the data which is being consumed — which I believe is by far the more important question,” said Palmer.

Dangerous behaviors

The point where data consumption occurs is where Palmer advises companies to begin their DataOps journey. It ensures you manage data important to data consumers, including data citizens, data analysts, data scientists, and developers. 

For example, data citizens will consume data in an HTML page that looks like a table or a Wikipedia page. Data analysts will visualize data using tools or spreadsheets. Data scientists will consume data in a modeling system such as SAS, R, DataRobot, or a proprietary model they’ve likely built in Python. And developers will consume data as RESTful endpoints.

“One key thing to understand, in my humble opinion, is that the bottleneck in modern data engineering — a.k.a. DataOps — is not technology. It is people and their behaviors that are the biggest bottlenecks. How data is consumed, shared (or not), and managed by people is the primary challenge facing most projects that we experience,” said Palmer.

It is also people and behaviors that separate digital-native companies like Google or Amazon from other companies. They assume data is their biggest asset. “Whereas the non-digital natives are only beginning to realize the dysfunctional dogma of their organizations/people with regard to data,” Palmer added.

One bad behavior is data hoarding, where individuals keep data to themselves or their teams as a form of a control point. It hails from a time where information was viewed as career leverage.

But Palmer believes this is dangerous behavior. While many department heads cite security and privacy, data sharing is still vital for companies to break away from their old dogmas.

“We would laugh if someone in a company was allowed to stash cash in their desk drawers. But people do this with data in large companies all the time,” he adds.  

Human-Machine collaboration

With business users starving from lack of actionable insights coupled with real-time IoT data streams and data-hungry enterprisewide AI programs coming online, Palmer feels it will be difficult for DataOps to be run by humans alone. The future lies with human-machine collaboration.

He cites Marvin Lee Minsky as a critical influence on how Tamr approaches DataOps. He sees enormous advantages when humans and machines work together to drive efficiency.

“Marvin taught me two things — first, that no algorithm is useful without enough great data. Second, that it’s always about the human and machine working together,” explains Palmer.

The combination will allow large enterprises to scale their DataOps. Palmer points to Data Mastering as a great example of where human and machine collaboration “is so incredibly powerful.”

Early years

Essentially, Tamr uses active learning design principles to integrate human feedback into the models “so that we don’t waste the humans’ time with the same questions over and over again,” says Palmer. “You have to design with the human and machine working closely together.” 

Yet, Palmer feels it is still early days for DataOps. “We’ve got a lot of heavy lifting to do over the next 10 to 20 years to make DataOps a reality in the enterprise,” says Palmer.

Winston Thomas is the editor-in-chief of CDOTrends, HR&DigitalTrends and DataOpsTrends. He is always curious about all things digital, including new digital business models, the widening impact of AI/ML, unproven singularity theories, proven data science success stories, lurking cybersecurity dangers, and reimagining the digital experience. You can reach him at [email protected].

Image credit: iStockphoto/Rost-9D