There’s data, and then there’s big data. So, what’s the difference?

Big data defined

A clear big data definition can be difficult to pin down because big data can cover a multitude of use cases. But in general the term refers to sets of data that are so large in volume and so complex that traditional data processing software products are not capable of capturing, managing, and processing the data within a reasonable amount of time.

These big data sets can include structured, unstructured, and semistructured data, each of which can be mined for insights.

How much data actually constitutes “big” is open to debate, but it can typically be in multiples of petabytes—and for the largest projects in the exabytes range.

Often, big data is characterized by the three Vs:

  • an extreme volume of data
  • a broad variety of types of data
  • the velocity at which the data needs to be processed and analyzed

The data that constitutes big data stores can come from sources that include web sites, social media, desktop and mobile apps, scientific experiments, and—increasingly—sensors and other devices in the internet of things (IoT).

, where analysts evaluate large data sets to identify relationships, patterns, and trends.

has its own specialized techniques and tools.)

To store all the incoming data, organizations need to have adequate data storage in place. Among the storage options are traditional data warehouses, data lakes, and cloud-based storage.

Security infrastructure tools might include data encryption, user authentication and other access controls, monitoring systems, firewalls, enterprise mobility management, and other products to protect systems and data,

Big data technologies

In addition to the foregoing IT infrastructure used for data in general. There several technologies specific to big data that your IT infrastructure should support.

Hadoop ecosystem

is one of the technologies most closely associated with big data. The Apache Hadoop project develops open source software for scalable, distributed computing.

The Hadoop software library is a framework that enables the distributed processing of large data sets across clusters of computers using simple programming models. It’s designed to scale up from a single server to thousands, each offering local computation and storage.

The project includes several modules:

  • Hadoop Common, the common utilities that support other Hadoop modules
  • Hadoop Distributed File System, which provides high-throughput access to application data
  • Hadoop YARN, a framework for job scheduling and cluster resource management
  • Hadoop MapReduce, a YARN-based system for parallel processing of large data sets.

Apache Spark

Part of the Hadoop ecosystem, is an open source cluster-computing framework that serves as an engine for processing big data within Hadoop. Spark has become one of the key big data distributed processing frameworks, and can be deployed in a variety of ways. It provides native bindings for the (especially the ), and R programming languages (), and it supports , streaming data, , and .

Data lakes

are storage repositories that hold extremely large volumes of raw data in its native format until the data is needed by business users. Helping to fuel the growth of data lakes are digital transformation initiatives and the growth of the IoT. Data lakes are designed to make it easier for users to access vast amounts of data when the need arises.

NoSQL databases

Conventional SQL databases are designed for reliable transactions and ad hoc queries, but they come with restrictions such as rigid schema that make them less suitable for some types of applications. address those limitations, and store and manage data in ways that allow for high operational speed and great flexibility. Many were developed by companies that sought better ways to store content or process data for massive websites. Unlike SQL databases, many across hundreds or thousands of servers.

In-memory databases

An in-memory database (IMDB) is a database management system that primarily relies on main memory, rather than disk, for data storage. In-memory databases are faster than disk-optimized databases, an important consideration for big data analytics uses and the creation of data warehouses and data marts.

Big data skills

Big data and big data analytics endeavors require specific skills, whether they come from inside the organization or through outside experts.

Many of these skills are related to the key big data technology components, such as Hadoop, Spark, NoSQL databases, in-memory databases, and analytics software.

Others are specific to disciplines such as data science, data mining, statistical and quantitative analysis, data visualization, general-purpose programming, and data structure and algorithms. There is also a need for people with overall management skills to see big data projects through to completion.

Given how common big data analytics projects have become and the shortage of people with these types of skills, finding experienced professionals might be one of the biggest challenges for organizations.

Big data analytics use cases

Big data and analytics can be applied to many business problems and use cases. Here are a few examples:

  • Customer analytics. Companies can examine customer data to enhance customer experience, improve conversion rates, and increase retention.
  • Operational analytics. Improving operational performance and making better use of corporate assets are the goals of many companies. Big data analytics tools can help businesses find ways to operate more efficiently and improve performance.
  • Fraud prevention. Big data tools and analysis can help organizations identify suspicious activity and patterns that might indicate fraudulent behavior and help mitigate risks.
  • Price optimization. Companies can use big data analytics to optimize the prices they charge for products and services, helping to boost revenue.