IDG Contributor Network: Data lakes: Just a swamp without data governance and catalog


The big data landscape has exploded in an incredibly short amount of time. It was just in 2013 that the term “big data” was added to the pages of the Oxford English Dictionary. Fewer than five years later, of data is being generated every day. In response to the creation of such vast amounts of raw data, many businesses recognized the need to provide significant data storage solutions such as data warehouses and data lakes without much thought.

On the surface, more modernized data lakes hold an ocean of possibility for organizations eager to put analytics to work. They offer a storage repository for those capitalizing on new transformative data initiatives and capturing vast amounts of data from disparate sources (including social, mobile, cloud applications, and the internet of things). Unlike the old data warehouse, the data lake holds “raw” data in its native format, including structured, semistructured, and unstructured data. The data structure and requirements are not defined until the data is needed.

One of the most common challenges organizations face, though, with their data lakes is the inability to find, understand, and trust the data they need for business value or to gain a competitive edge. That’s because the data might be gibberish (in its native format)—or even conflicting. When the data scientist wants to access enterprise data for modeling or to deliver insights for analytics teams, this person is forced to dive into the depths of the data lake, and wade through the murkiness of undefined data sets from multiple sources. As data becomes an increasingly more important tool for businesses, this scenario is clearly not sustainable in the long run.

To be clear, for businesses to effectively and efficiently maximize data stored in data lakes, they need to add context to their data by implementing policy-driven processes that classify and identify what information is in the lake, and why it’s in there, what it means, who owns it, and who is using it. This can best be accomplished through data governance integrated with a data catalog. Once this is done, the murky data lake will become crystal clear, particularly for the users who need it most.