AWS revolutionized IT. Instead of managing hardware, it allowed developers to set up an entire cloud infrastructure with a few clicks of a button. AWS also released APIs to automate the clicking of said buttons, making it not only possible but fairly easy to write a script to “create ten servers” and run it anytime you wanted.
There were problems, though. For example, each time you created ten servers, you would get ten more servers! And if you weren’t careful and ran your script too often, you could end up with thousands of servers and a scary bill.
One typical solution was to try to patch it up with more complicated logic. Still, this wasn’t a silver bullet because the script was still fragile because the state of the system was often too complex to accommodate.
Fast forward, and DevOps came along and revolutionized IT again. DevOps allowed you to write scripts like “I want ten servers,” which describe not what you wanted the system to do but how you wanted the system to be. Instead of constantly creating ten new servers, the script could create or remove them depending on how many currently existed. It could adapt to different situations and dynamically generate a plan to achieve the desired results.
This made DevOps scripts reliable, which was a game-changer and enabled:
Declarative scripts — scripts that describe how a system should be — also enable DevOps to create reliable, reusable infrastructure-as-code that is open to collaboration and testing. It is how large DevOps teams can all work together to make and deploy improvements consistently.
By making data pipelines declarative, like in the above context, we can bring these same benefits to DataOps. One way we do this today is through machine learning. Instead of telling the pipeline what to do with the data in the form of complicated, fragile rules, we instead teach the system what the data should be through machine learning labels. Tamr learns the rules dynamically, adapting to the inherent nuances in the data.
Machine learning learns the rules that operate as part of the data pipeline. But we are also improving how the pipeline itself is constructed. Today, we construct pipelines by telling Tamr what to do. “Create a project, make some mappings, set the DNF, run a job, etc.” As part of our ongoing modularization work, we are making it possible to tell Tamr what the pipeline should be. For example, “run a job with this exact pipeline, including projects, mappings, DNF, etc.”
By making the construction and the execution of pipelines reliable, we will make it easier for DataOps engineers to collaborate, experiment, test, and deploy their pipelines, ultimately leading to consistent improvements in the quality and insightfulness of their data.
Tianyu Zhu, a software developer at Tamr, wrote this article. The original is here.
The views and opinions expressed in this article are those of the author and do not necessarily reflect those of CDOTrends. Image credit: iStockphoto/wildpixel