As a highly versatile and easy-to-learn programming language, Python is generally considered an essential language for data scientists. But its interpreted nature also imposes a heavy resource overhead that makes it much slower than compiled code, resulting in lackluster or extended runtimes when working with giant data sets.
However, researchers from Brown University and MIT say they have developed a groundbreaking data science platform called Tuplex that they claim runs data queries written in Python code at the speed of native code.
The researchers unveiled Tuplex at SIGMOD 2021, a data processing conference held mostly online this year with a local physical event in Xi’an, China.
Tuplex for fast data science
When working with large data sets, data scientists typically interact with industry-standard data systems like Spark or Dask by writing specialized queries, which often consist of custom code or “user-defined functions” (UDFs), and are often written in Python.
Processing is then performed by distributing tasks across multiple processor cores or servers within a data center where they are processed in parallel to make light work of even giant data sets. The presence of UDFs written in Python, though, forces these systems to interpret the code line by line, resulting in significantly slower performance.
The solution calls for native code, which Tuplex offers. To be clear, there are existing Python compilers that generate machine code. Numba and Weld, for instance, are designed to work with large data sets such as multi-dimensional NumPy arrays.
However, Numba and Weld are specific libraries and do not optimize the full end-to-end pipeline and UDFs, the researchers say. According to them, Tuplex optimizes the entire pipeline while delivering a similar user experience to PySpark and Dask.
Under the hood, Tuplex does its magic with a novel dual-mode execution model that compiles a highly specialized program for specific queries and common-case input data. Data that fails to match the optimized fast path – which the researchers say accounts for only a small percentage of instances – will cause it to fall back to an interpreter.
Faster than compiled Python
In their research paper, the researchers demonstrated that a wait time of 10 minutes can be reduced to just one second for a substantial performance improvement. Specifically, data queries written in Python were processed in Tuplex up to 90 times faster than Spark.
The researchers also compared Tuplex with Cython, an optimizing static compiler for Python, and found the latter to be six times slower than Tuplex. This was primarily because Tuplex generates low-level LLVM IR programming code, which is similar to assembly in performance.
“Tuplex outperforms [Cython] because it replaces C-API calls with native code, eliminates dispensable checks, and uses a more efficient object representation than [Cython], which use CPython’s representation,” wrote the authors.
“Tuplex is a new data analytics system that compiles Python UDFs to optimized, native code. Tuplex’s key idea is dual-mode processing, which makes optimizing UDF compilation [effective],” they summed up.
Smooth operator
Tuplex offers another advantage: It can deal with anomalous data common in large datasets. Corrupted records or data fields that deviate from expected data types can trigger a crash midway through a query – which can be extremely difficult to debug.
Tuplex avoids that by identifying anomalies and setting them aside as it runs, giving data scientists the option of repairing them after the query completes.
“We think this could have a major productivity impact for data scientists,” wrote Malte Schwarzkopf, an assistant professor of computer science at Brown and one of the developers of Tuplex in a blog post.
“To not have to run out to get a cup of coffee while waiting for an output, and to not have a program run for an hour only to crash before it’s done would be a really big deal.”
The Tuplex framework is released as open-source software, and source code and instructions to build it can be accessed here. The original research paper can be accessed here (pdf).
Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].
Image credit: iStockphoto/Svetlana Ivanova