What is a data lake? Flexible big data management explained


If you are tuned in to the latest technology concepts around , you’ve likely heard the term “data lake.” The image conjures up a large reservoir of water—and that’s what a data lake is, in concept: a reservoir. Only it’s for data.

Data lake defined

A data lake holds a vast amount of raw, unstructured data in its native format.

Therefore, all you need is a device that supports a flat file system, which means you can use a mainframe if you want. The data is moved to other servers for processing. Most enterprises go with the , because it is designed for fast processing of large data sets and is used in a big data environment where a data lake is likely to be used.

That support for native-format data brings a key benefit. “If I want to get a ridiculous amount of data and figure out what to do with it later, that fits in the mantra of what we do with data lakes now,” says Michael Hiskey, head of strategy at Semarchy, a vendor of data management software.

, , , and to provide security, governance, integration, and data transformation.

  • TrifactaIts Wrangler software uses AI and machine learning algorithms to automate and simplify the processing of data and interaction with the analysts or business user. It visually tracks and presents the lineage of data transformation steps for specific data sets and across multiple workflows.
  • Zaloni: Zaloni offers an enterprise data lake platform called Zaloni Data Platform, which includes support for cloud and on-premises deployment, a management platform, data catalog, zones for data governance, and self-service data-prep tools that cover end-to-end processing.
  • When to avoid a data lake

    A data lake is not for everyone. Some companies may not need it, and it might make things worse. For example, Hiskey says data lakes are not for real-time work. “If you are looking for real-time, up-to-date info, a data lake is not for you. It’s for historical data. You’re still going to need a fast, transactional system.”

    Wilhelmy says some industries won’t allow data lakes due to their unorganized nature. “There’s no strong data governance of random bits and files, and no one understands what governance processes are around the data lake. A prerequisite would be a strong data-governance position. The organization would have to be at an intermediate or advanced level of maturity to govern data processes in a data lake, from taking it in and cleaning it to passing it out to the organization.”

    And Joshua Greenbaum, principal analyst with Enterprise Applications Consulting, doesn’t think data lakes are a good idea at all. “In most cases, data lakes are a sign of laziness on the side of IT and not a case of strategic thinking. The laziness is ‘Let’s put our data in one place and think about it later,’” he says.

    Greenbaum argues if you don’t know the problems you are trying to solve, you’re collecting as many bricks as you can because one day you want to build something. “But if you don’t have a plan, all you have is a pile of bricks, and what if you need wooden beams? If you started with a design, you would know what you need to have.”

    His cynicism comes from seeing this happen before with data warehouses. “This is a movie we’ve seen before, with different actors but the plot is the same and the end is the same. You are going to waste a lot of money on a data lake like [you did on] a data warehouse if you don’t do it strategically,” said Greenbaum.

    A data lake with no purpose is an expensive “just in case” approach. But done strategically, it’s an excellent way to store information that you want to analyze and act on in different ways over time—customer patterns, for example—because you didn’t process it to the point where it can be used only do one thing, as in a typical data warehouse.