This Open Source Tool Makes Data Validation Easier

Image credit: iStockphoto/NicoElNino

Companies extract millions of data points from various sources. Data goes through further steps of replication and migration to improve availability, accessibility, and system resilience. 

However, replicating and migrating data is complex and prone to failures and exposure to a system breach. It gets worse when two different databases are involved in the validation process. Data validation is necessary to ensure data integrity. 

Data reliability platform Datafold launched Open Source Data-Diff, a first-of-its-kind open-source command-line tool and Python library for data replication and migration validation across databases using high-performance algorithms enabling data engineers to validate data pipelines at scale in seconds. This is an upgrade to their original Data Diff tool, which dealt with a single database. 

Data-diff utilizes checksums to verify 100% consistency between two different data sources quickly and efficiently. For example, a row-level comparison of 100 million records is rapidly validated without errors in the resulting comparison.

Open source data-diff also allows Datafold to provide customers coverage throughout the extract, load, transform (ELT) process.

“Data-diff fulfills a need that wasn’t previously being met. Every data-savvy business today replicates data between databases in some way, for example, to integrate all available data in a warehouse or data lake to leverage it for analytics and machine learning,” said Gleb Mezhanskiy, Datafold founder and chief executive officer.

At present, data engineers handle data with multiple comparison methods, from fast yet noncomprehensive simple row counts to slow but accurate row-level analysis. Datafold’s approach allows developers and data analysts to compare numerous databases quickly, removing the need to build makeshift diff tools from the equation. 

Mezhanskiy added that data-diff solves problems of one-off manual checks, tedious investigations of discrepancies, and customer distrust of data replication by providing an easy way to validate the consistency of data sets across databases at scale. The tool can compare one-billion-row data sets across different databases in less than five minutes on a regular laptop, which can be easily embedded into existing workflows and systems.

Data-diff is under the MIT license and includes connectors for BigQuery, MySQL, Oracle, Postgres, Presto, and Redshift. The company is geared towards expanding its contributor networks to add more data sources and other business applications.

Image credit: iStockphoto/NicoElNino