Whether you noticed it or not, an open-source project has been a quiet champion in the data processing and machine learning sphere. Under the hood, its efficient data interchange format has been leveraged by various well-known projects such as Apache Spark, Google BigQuery, and TensorFlow.
Apache Arrow
Specifically, Apache Arrow is used by the various open-source projects above, as well as “many” commercial or closed-source services, according to software engineer and data expert Maximilian Michels.
At its core, Arrow was designed for high-performance analytics and supports efficient analytic operations on modern hardware like CPUs and GPUs with lightning-fast data access. It offers a standardized, language-agnostic specification for representing structured, table-like datasets in-memory.
To support cross-language development, Arrow currently works with popular languages such as C, Go, Java, JavaScript, Python, R, Ruby, among others. The project also includes a query engine that works with Arrow data called DataFusion that is written in Rust.
Great for data analytics
In a blog entry, Michels pointed to Arrow’s support for in-memory computing, standardized columnar storage format, and IPC and RPC framework for data exchange as the outstanding features.
So, if you are facing efficiency issues with manipulating or transferring data for your data science project, you might want to check out Apache Arrow.
Photo credit: iStockphoto/persistent