Data and AI firm Databricks announced that it has set a world record for the official 100 terabyte TPC-DS benchmark, used to evaluate the performance of data warehouse systems.
Audited by the Transaction Processing Performance Council (TPC), Databricks says the results outperformed the previous world record by 2.2 times. Specifically, Databricks SQL delivered 32,941,245 QphDS @ 100TB, beating Alibaba’s customized system which achieved 14,861,137 QphDS @ 100TB.
QphDS is the primary metric for TPC-DS and represents the performance of a combination of workloads. This includes the data set, a sequential test, a concurrent query test, and the running of data maintenance functions that insert and delete data.
Redefining the data lake
Databricks says the results prove that its data lakehouse architecture built on open data lakes can deliver better data warehousing performance than traditional data warehouses built using proprietary data formats.
Organizations have traditionally maintained two distinct data stacks, a data lake for data science and machine learning, and a data warehouse for business intelligence (BI) and analytics. The ability for BI tools to work directly on the data lake mean organizations can now be spared the duplicate data and governance issues inherent to maintaining two sets of data.
Databricks says its world record offers compelling proof that the data warehousing capabilities it is bringing to data lakes offer the best of both worlds while delivering the performance that the analyst and business community expect of their data warehouse.
Under the hood, Databricks uses its Lakehouse architecture designed to enable efficient and secure AI and BI on data in data lakes. Key technologies are also open source, including its Delta Lake storage layer, which itself relies on the open-sourced Apache Parquet format to store data.
In a lengthy blog post, the Databricks team explained how it overcame multiple technical barriers to achieve its leapfrog performance. This includes Photon, a new database engine written from scratch in C++ and with modern hardware and parallel query processing in mind. The result is a massively parallel (MPP) engine that sits at the heart of this record.
“We focused initially on establishing our business not on data warehousing, but on related fields (data science and AI) that shared a lot of the common technological problems… [this] success then enabled us to fund the most aggressive SQL team build-out in history; in a short period of time, we’ve assembled a team with extensive data warehouse background,” wrote Reynold Xin and Mostafa Mokhtar.
According to Xin and Mokhtar, talent includes lead engineers and designers from the top data systems such as Amazon Redshift, Google BigQuery, and enterprise SQL systems from Oracle, IBM, and Microsoft.
Image credit: iStockphoto