I recently worked on a content project for Alooma, an ETL and data pipeline service on the cloud. ETL is “extract, transform, load,” a time-tested process for loading mission-critical events into a central data store, while making sure that the data is correct.
A new take on ETL
Alooma and its competitors like XPlenty and Stitch Data have a new take on ETL. They promise to move huge volumes of data into data warehouses or other data stores effortlessly, with all the integration and data plumbing taken care of as a managed service.
If you ask them, you do not need an ETL tool or an ETL process, per se, anymore, because the managed data pipeline comes with its own industrial-strength ETL capabilities which are much easier to use than the ETL tools of old.
Can ETL really be that simple? I didn’t take their word for it.
,” as Alooma proclaims, sounds great on paper, but has the vendor really taken care of all the complexity in Udemy’s 248 ETL Testing lectures, obviating the need for this type of training?
I took the trouble of distilling those endless tutorials and lectures into discrete process stages, to identify “what really happens” in the ETL preparation and testing process.
I sat down with Alooma to see which of these stages are simply not relevant in a cloud-based ETL architecture, and which are actually relevant, but made easier by managed solutions.
It turns out that of 32 discrete stages or issues in the old ETL process:
- 17 stages are not relevant at all in a cloud architecture
- 15 stages are relevant but made easier in a cloud architecture
- For these 15 stages, many of them are handled transparently by the data pipeline platform with little or no user intervention
(This is based on my analysis of the Alooma platform; give or take, it should be similar for competing vendors as well.)
The full details of all those ETL stages and how they are translated into the new architecture are beyond the scope of this post, but for those of nerdy inclination, watch out for a detailed writeup I’ll be releasing soon.
The bottom line is that, yes—a new cloud-based data pipeline can get rid of over half of the stages in the old ETL process, and because it dramatically simplifies the other stages, you can actually set up ETL, if not in minutes as advertised, within hours or days. While old enterprise ETL projects could easily take years.
From manual bookkeeping to cash register
I thought of comparing it to bookkeeping for a physical store. Many years ago, stores had clerks who would meticulously copy each transaction into a day book, and then aggregate those transactions manually into debit and credit columns in a general ledger.
Imagine how many manual operations are required to record daily transactions for even a small store, how many mistakes are possible, and the extend of verification, testing and auditing that would be required to run this process accurately at large scale.
It is my impression that the ETL tools of old are like a calculator or a spreadsheet that can help organize and streamline the manual bookkeeping process. They can definitely make things much easier. But they leave the process as is.
The new “ETL-inside” data pipelines are like a digital cash register. Imagine what a huge difference it makes for a store with manual bookkeeping to implement a cash register: a machine able to scan the barcodes on each item and automatically generate books for the store. So many manual steps eliminated in one swoop, and more importantly, the remaining steps abstracted away into a seamless operating environment.
So yes, the advertising is true: Ye olde ETL shoppe is now a 7/11.
This article is published as part of the IDG Contributor Network.