IDG Contributor Network: The clash of big data and the cloud


Recently, I visited a few conferences and I noticed a somewhat hidden theme. While a lot of attention was being paid to moving to a (hybrid) cloud-based architecture and what you need for that (such as cloud management platforms), a few presentations showed an interesting overall development that everybody acknowledges but that does not get a lot of close attention: the enormous growth of the amount of digital data stored in the world.

What especially caught my attention was a presentation from PureStorage (a storage vendor) that combined two data points from two other vendors. First, a June 2017 Cisco white paper that extrapolates the growth of internet bandwidth, the second a Seagate-sponsored IDC study  that extrapolates the trend of data growth in the world. PureStorage combined both extrapolations in the following figure (reused with permission):


PureStorage’s depiction of the clash between world data growth and world internet bandwidth growth.

These trends—if they become reality, and there are reasons enough to think these predictions to be reasonable—are going to have a major impact of the computing and data landscapes in the years to come. And they will especially impact the cloud hoopla that is still in full force. Note: The cloud is real and will be , but simplistic ideas about it being a panacea for every IT ailment are strongly reminiscent of the “new economy” dreams of the dot-com boom. And we know how that ended.

The inescapable issue

Anyway, there are two core elements of all IT: the data and the logic working with/on the data. Big data is not just about the data. Data is useless (or as would have it: meaningless) unless it can be used. What everybody working with big data already knows: To use huge amounts of data, you need to bring the processing to the data and not the data to the processing. Having the processing at any “distance” creates such a transport bottleneck that performance decreases to almost nothing and any function of that logic .

) on the possibility that cloud providers might be extending into your datacenters, but the colocation pattern is another possible solution for solving the inescapable bandwidth and latency issues arising from the exponential growth of data.

The situation may not be as dire as I’m sketchinf it. For example, maybe the actual average volatility of all that data will ultimately be very low. On the other hand, you would not want to run your analytics on stale data. But one conclusion can be drawn already: Simply assuming that you can distribute your workloads to a host of different cloud providers (the “cloud … yippy!” strategy) is risky, especially if at the same time the amount of data you are working with grows exponentially (which it certainly will, if everyone wants to combine their own data with streams from Twitter, Facebook, etc., let alone if those combinations spawn all sorts of new streams).