MIT-Stanford project uses LLVM to break big data bottlenecks


The more cores you can use, the better — especially with big data. But the easier a big data framework is to work with, the harder it is for the resulting pipelines, such as TensorFlow plus Apache Spark, to run in parallel as a single unit.

Researchers from MIT CSAIL, the home of envelope-pushing big data acceleration projects like  and , have paired with the Stanford InfoLab to create a possible solution. Written in the Rust language,  generates code for an entire data analysis workflow that runs efficiently in parallel using the LLVM compiler framework.

The group  Weld as a “common runtime for data analytics” that takes the disjointed pieces of a modern data processing stack and optimizes them in concert. Each individual piece runs fast, but “data movement across the [different] functions can dominate the execution time.”

In other words, the pipeline spends more time moving data back and forth between pieces than actually doing work on it. Weld creates a runtime that each library can plug into, providing a common method to run key data across the pipeline that needs parallelization and optimization.

 for high-speed vector math.

InfoLab put together preliminary benchmarks comparing the native versions of Spark SQL, NumPy, TensorFlow, and the Python math-and-stats framework Pandas with their Weld-accelerated counterparts. The most dramatic speedups came with the NumPy-plus-Pandas benchmark, where the work could be amplified “by up to two orders of magnitude” when parallelized across 12 cores.

Those familiar with Pandas and want to take Weld for a spin can check out , a custom implementation of Weld with Pandas.

It’s not the pipeline, it’s the pieces

Weld’s approach comes out of what its creators believe is a fundamental problem with the current state of big data processing frameworks. The individual pieces aren’t slow; most of the bottlenecks arise from having to hook them together in the first place.

, Mozilla’s language for fast, safe software development. Despite its relative youth, Rust has an active and growing community of professional developers frustrated with having to compromise safety for speed or vice versa. There’s been talk of rewriting existing applications in Rust, but it’s tough to . Greenfield efforts like Weld, with no existing dependencies, are likely to become the standard-bearers for the language as it matures.