Azure Databricks: Fast analytics in the cloud with Apache Spark


We’re living in a world of big data. The current generation of line-of-business computer systems generate terabytes of data every year, tracking sales and production through CRM and ERP. It’s a flood of data that’s only going to get bigger as we add the sensors of the industrial internet of things, and the data that’s needed to deliver even the simplest predictive-maintenance systems.

Having that data is one thing, using it as another. Big data is often unstructured, spread across many servers and databases. You need something to bring it together. That’s where big data analysis tools like come into play; these distributed analytical tools work across clusters of computers. Building on techniques developed for the MapReduce algorithms used by tools like , today’s big data analysis tools go further to support more database-like behavior, working with in-memory data at scale, using loops to speed up queries, and providing a foundation for machine learning systems.

Apache Spark is fast, but Databricks is faster. Founded by the Spark team, that takes advantage of public cloud services to scale rapidly and uses cloud storage to host its data. It also offers tools to make it easier to explore your data, using the notebook model popularized by tools like .

Microsoft’s new support for Databricks on Azure—called Azure Databricks—signals a new direction of its cloud services, bringing Databricks in as a partner rather than through an acquisition.

dashboard, making Azure Databricks part of an end-to-end data architecture that allows more complex reporting than a simple SQL or NoSQL service—or even Hadoop.

Microsoft plus Databricks: a new model for Azure Services

Microsoft hasn’t yet detailed its pricing for Azure Databricks, but it does claim that it can improve performance and reduce cost by as much as 99 percent compared to running your own unmanaged Spark installation on Azure’s infrastructure services. If Microsoft’s claim bears out, that promises to be a significant saving, especially when you factor in no longer having to run your own Spark infrastructure.

Azure’s Databricks service will connect directly to Azure storage services, including Azure Data Lake, with optimizations for queries and caching. There’s also the option of using it with Cosmos DB, so you can take advantage of global data sources and a range of NoSQL data models, including MongoDB and Cassandra compatibility—as well as Cosmos DB’s graph APIs. It should also work well with Azure’s data-streaming tools, giving you a new option for near real-time IoT analytics.

If you’re already using Databricks’ Spark tools, this new service won’t affect you or your relationship with Databricks. It’s only if you take the models and analytics you’ve developed on-premises to Azure’s cloud that you’ll get a billing relationship with Microsoft. You’ll also have fewer management tasks, leaving you more time to work with your data.

Microsoft’s decision to work with an expert partner on a new service makes a lot of sense. Databricks has the expertise, and Microsoft has the platform. If the resulting service is successful, it could set a new pattern for how Azure evolves in the future, building on what businesses are already using and making them part of the Azure hybrid cloud without absorbing those services into Microsoft.