One of the advantages of the cloud is scale. We don’t call the big three cloud platforms hyperscale for nothing; they have massive data centers all around the world with millions of servers that can be treated as pools of compute and storage. A modern distributed application can run across many cores of compute, each with its own memory, all addressing terabytes of storage. We’re abstracted away from all the physical infrastructure that makes up the cloud, treating everything we use as another set of services.
Combining that approach with some of the newer cloud hardware, like servers that which can address as much as 4TB of memory, has changed the type of applications we can build. Instead of limiting our code to fit the servers we have, we can build applications that take advantage of the available resources in the cloud. Even pooling standard VMs give us a platform where we can build large-scale systems, capable of working with truly big data.
The cloud and big, big data
That capability is perhaps the real value of the big cloud providers; their economies of scale mean they can purchase storage at a much lower cost than we can for our own data centers. With tools like Azure’s various Data Box devices we have the ability to link on-premises data sources to cloud services, either by wholesale shifting of files or by connecting on-premises networks to cloud storage. The prospect of delivering large amounts of data to the cloud is interesting because it mixes the data generation capabilities of modern business systems with the processing capabilities of the cloud.
If we can get our data to the cloud, how can we work with it? Until recently much of the work done on cloud-scale data processing focused on using tools like BigTable and Hadoop to analyze nonrelational data at scale. By using alternative data structures, we were able to process large amounts of data quickly, distributing our analysis across many compute nodes. Building on the technologies used to deliver consumer search engines such as Bing or Google has worked well for many classes of problem and many data sets. But it’s not what we need to work with the structured data in our line-of-business applications.