Presto! It’s not only an incantation to excite your audience after a magic trick, but also a name being used more and more when discussing how to churn through big data. While there are many deployments of Presto in the wild, the technology — a distributed SQL query engine that supports all kinds of data sources — remains unfamiliar to many developers and data analysts who could benefit from using it.
In this article, I’ll be discussing : what it is, where it came from, how it is different from other data warehousing solutions, and why you should consider it for your big data solutions.
Presto vs. Hive
Presto originated at Facebook back in 2012. Open-sourced in 2013 and managed by the Presto Foundation (part of the Linux Foundation), Presto has experienced a steady rise in popularity over the years. Today, several companies have built a business model around Presto, such as , with PrestoDB-based ad hoc analytics offerings.
Presto was built as a means to provide end-users access to enormous data sets to perform ad hoc analysis. Before Presto, Facebook would use Hive (also built by Facebook and then donated to the Apache Software Foundation) in order to perform this kind of analysis. As Facebook’s data sets grew, Hive was found to be insufficiently interactive (read: too slow). This was largely because the foundation of Hive is MapReduce, which, at the time, required intermediate data sets to be persisted to HDFS. That meant a lot of I/O to disk for data that was ultimately thrown away.
Presto takes a different approach to executing those queries to save time. Instead of keeping intermediate data on HDFS, Presto allows you to pull the data into memory and perform operations on the data there instead of persisting all of the intermediate data sets to disk. If that sounds familiar, you may have heard of Apache Spark (or any number of other technologies out there) that have the same basic concept to effectively replace MapReduce-based technologies. Using Presto, I’ll keep the data where it lives (in Hadoop or, as we’ll see, anywhere) and perform the executions in-memory across our distributed system, shuffling data between servers as needed. I avoid touching any disk, ultimately speeding up query execution time.
How Presto works
Different from a traditional data warehouse, Presto is referred to as a SQL query execution engine. Data warehouses control how data is written, where that data resides, and how it is read. Once you get data into your warehouse, it can prove difficult to get it back out. Presto takes another approach by decoupling data storage from processing, while providing support for the same ANSI SQL query language you are used to.
, “this query quantifies the amount of revenue increase that would have resulted from eliminating certain company-wide discounts in a given percentage range in a given year.”
. Passionate about distributed systems, Ashish joined Ahana from WalmartLabs, where as principal engineer he built a multicloud data acceleration service powered by Presto while leading and architecting other products related to data discovery, federated query engines, and data governance. Previously, Ashish was a senior data architect at PubMatic where he designed and delivered a large-scale adtech data platform for reporting, analytics, and machine learning. Earlier in his career, he was a data engineer at VeriSign. Ashish is also an Apache committer and contributor to open source projects.
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to .
Copyright © 2020 IDG Communications, Inc.