5 Features To Tackle Disparate Data in a Data Mesh

Image credit: iStockphoto/Dmi+T

The volume of data being created, captured, and consumed grows exponentially yearly. A major problem many organizations face while extracting value and insights out of data is that 80% of data live in disparate systems and infrastructure. Conventional analytics technology that connects and gathers unstructured data from disparate sources can be complex and ineffective. However, data mesh architecture is being hailed as one of the most effective methods of analyzing data in heterogeneous environments.

If your organization plans to go down the data mesh route, here are some features you’ll want to ensure are ready-made for this type of architecture.

  1. Ability to connect to external tables

If the analytics platform can connect to external tables, regardless of where they sit, it can greatly help with disparate data. Solutions will obviously vary in implementation — some solutions require the entire data set to be copied into the database to access data, while others will query in-place without having to make copies. If feasible, look for flexible analytics that can analyze data where it lives and pack enough performance to handle large data sets. It will minimize data duplication and reduce data warehousing costs and complexities. If your data updates frequently, users can query the latest data directly, eliminating any concerns relating to data validity or recency (whether the query results are up to date or not). What’s more, since there is only a single copy of data, all consumers will see and query the same data.

  1. Support for query acceleration, predicate pushdown, and partition pruning

Analytics query accelerators provide SQL or SQL-like support on a broad range of data sources that do not have the inherent capability to provide sufficient performance or ease of use on their own. Data and analytics teams must ensure the analytics tool they choose can deliver workloads into these subsystems, so query responses can be accelerated if needed. Similarly, pushing down a query to where the data lives (a.k.a. predicate pushdown) can drastically reduce query processing time as it filters out data sooner rather than later. Partition pruning is another optimization technique to look out for in analytical databases since this enables the query to run against a smaller data set, bypassing excluded partitions and improving query performance.

  1. Availability of a wide range of ecosystem connectivity

A data ecosystem is made up of a variety of data sources and disparate applications across the public cloud and on-premises environments. It’s essential that the analytics engine can connect to various applications (via API) and layer or share metadata easily so that users can easily access and transform their data. Connectors are also useful when exchanging data for loading, report generation, and other common database tasks. Analytics should include connectors for both commercial applications as well as open source applications and support a wide range of use cases such as streaming processes (where data is streamed in real-time from source applications), batch processes (where large volumes of data are moved between databases, data lakes, data warehouses, and applications) and driver-based processes (where disparate applications connect to databases to execute specific queries like ODBC, JDBC and ADO.NET).

  1. Capability for schema-on-read

With the rapid growth in semi and unstructured data, many tools will make you do extra work to define the schema while loading the data. To overcome this challenge, analysts require a feature (i.e., schema-on-read) where structure is only applied when data is read. Schema-on-read can be a huge time saver when setting up data mesh.  In addition, platforms should be able to load and query both structured and unstructured data with evolving schemas (real-time changes to the structure of incoming data). They should be flexible and not mandate any strict data source requirement, leaving the choice of storage structure, schema, ingestion frequency, and data quality to the data owner.

  1. Interoperability with various file types and formats

In a data mesh architecture, data is often analyzed where it sits, rather than requiring an import/translation into another format before analysis. This means that the analytics tool must be able to look at the file, analyze its structure and read automatically from it. There are several common data storage file formats, and the data mesh should be able to use them without any issues easily. Typical file types include TEXT, CSV, Parquet, ORC, JSON, and AVRO, including (but not limited to) compressed files such as BZIP, GZIP, and LZO. These must ideally be seamlessly accessible from the data mesh and made usable for analytics.

Any large architectural change, such as a data mesh, requires a well-thought-out strategy before implementation or execution. Evaluating whether a data analytics platform can support business goals is one of the most critical steps that analytics teams must undertake before they finalize their overall data strategy. Discovering out-of-the-box tools and integrations beforehand can save a lot of frustration and costs while significantly accelerating your journey to a data mesh environment.

Steve Sarsfield, a director at Vertica, wrote this article. His writings, offering insight and opinion on data governance and analytics, have produced a popular data governance blog, articles on Medium.com, and a book entitled “The Data Governance Imperative.”

The views reflected in this article are the author's views and do not necessarily reflect the opinions of CDOTrends. Image credit: iStockphoto/Dmi+T