IDG Contributor Network: Use the cloud to create open, connected data lakes for AI, not data swamps


Produced by every single organization, data is the common denominator across industries as we look to advance how cloud and AI are incorporated into our operations and daily lives. Before the potential of cloud-powered data science and AI is fully realized, however, we first face the challenge of grappling with the sheer volume of data. This means figuring out how to turn its velocity and mass from an overwhelming firehouse into an organized stream of intelligence.

To capture all the complex data streaming into systems from various sources, businesses have turned to data lakes. Often on the cloud, these are storage repositories that hold an enormous amount of data until it’s ready to be analyzed: raw or refined, and structured or unstructured. This concept seems sound: the more data companies can collect, the less likely they are to miss important patterns and trends coming from their data.

However, a data scientist will quickly tell you that the data lake approach is a recipe for a data swamp, and there are a few reasons why. First, a good amount of data is often hastily stored, without a consistent strategy in place around how to organize, govern and maintain it. Think of your junk drawer at home: Various items get thrown in at random over time, until it’s often impossible to find something you’re looking for in the drawer, as it’s gotten buried.

This disorganization leads to the second problem: users are often not able to find the dataset once ingested into the data lake. Without a way to easily search for data, it’s nearly impossible to discover and use it, making it difficult for teams to ensure it stays within compliance or fed to the right knowledge workers. These problems mix and create a breeding ground for dark data: unorganized, unstructured, and unmanageable data.

of their time on actual data analysis, and 80 percent of their time finding, cleaning, and reorganizing tons of data.

The power of the cloud

One of the most promising elements of the cloud is that it offers capabilities to reach across open and proprietary platforms to connect and organize all a company’s data, regardless of where it resides. This equips data science teams with complete visibility, helping them to quickly find the datasets they need and better share and govern them.

Accessing and cataloging data via the cloud also offers the ability to use and connect into new analytical techniques and services, such as predictive analytics, data visualization and AI. These cloud-fueled tools help data to be more easily understood and shared across multiple business teams and users—not just data scientists.