Over the past few years, there has been a subtle but significant shift in the way that data is structured in databases. Whereas yesterday’s databases were typically limited to storing data in rows and tables, today’s modern databases often make use of nested data structures.
In this article, we will take a deeper dive into the nature of nested data structures, how they are represented in different databases, and the benefits and challenges of using nested data structures. Finally, we’ll propose an approach that addresses the challenge of marrying the traditional world of business intelligence with the modern world of nested data.
What is nested data?
Let’s start with a little introduction to dimensional modeling, using a website visit as an example. There are measures of the visit that exist at the visit level, such as the number of visits and the length of the visit. There are also attributes of the visit that only exist at the visit level, such as the user’s IP address, browser type, and OS. There are also page views that occur as part of each visit, each with their own measures, for example the number of page views and the time on page. And there are page view specific attributes, such as page name, page category, and page URL.
In the traditional world of data mart or data warehouse design, a common approach to creating a model to support the analysis of this web data might be to create something that looks like the following (simplified) data model.
” for storage costs. As a result, companies are no longer as intensely focused on reducing the cost of data warehouse and data mart models through key-based dimensional “denormalization” as they once were.
Nesting data within database records
As a result of the factors listed above, a new approach to storing and querying log data and machine-generated data in relational databases has emerged. While different databases have different implementations of this new approach, the general concept is the same, which is to support a single table where information is “nested” within records. Using the same example above, a nested model for a web session might look like the following conceptual diagram.
Because this nested approach stores all data as a set of discrete column elements in a single table, it addresses the challenges inherent in the traditional dimensional approach, and achieves the following objectives:
- Reducing the bottlenecks and performance issues that result from doing large-scale joins.
- Taking advantage of columnar data formats for all data elements.
- Optimizing query performance for scans and aggregations at the expense of optimizing for storage costs.
- Reducing the amount of pre-processing required to query data in its “natural” form.
For more information on how this capability is supported in a number of modern big data platforms, check out the following documents:
Support for nested data in these platforms makes modeling and storage decisions easier and also improves query performance. Nevertheless, there remain a number of challenges related to using this type of data in traditional analysis and BI scenarios. In the remainder of this article, we will discuss these challenges and the opportunities that exist when it comes to supporting BI and analytics workloads on these data structures.
For the purpose of this exercise, we will dive into how AtScale and BigQuery can be used together to analyze large-scale nested data sets.
Analyzing nested data in Google BigQuery
Google BigQuery is a modern, serverless, cloud-based data warehouse that has been architected to handle modern big data challenges, including working with nested data sets. While you can learn more about BigQuery’s nested data support , we’ll run through a quick example using a sample Google Analytics (web analytics) data set.
Let’s take a look at how BigQuery is able to store and query data that matches the session and page view example discussed above. With BigQuery’s support for nested data structures, it is possible to define a “nested” table structure with the schema shown at left.
). The screenshot below shows an AtScale multilevel model that has been constructed on the nested Google Analytics data set.
As you can see in this model, there are two “fact tables” that have been projected on top of the single Google Analytic nested data set (using our Query Data Set, or QDS, functionality). There are also a number of shared dimensions between the sessions and page facts, as well as some dimensions that are only relevant to the page fact (including page path). In the preview panel, you can see that a Page Views measure has been clearly defined. Let’s redo the same Tableau analysis, but this time connecting to the AtScale virtual model.
You can see in this visualization that we get the same session level results as before: 162 Chrome page views, and a total of 249 page views for all of the visits being analyzed. Now, let’s add page path to the analysis.
Note that in this scenario, the number of page views for Chrome has remained at 162, despite the fact that we are now looking at the nested page level records within the visit records. This is because the multi-level model has included the appropriate logic to avoid the duplicate counting of the page view records within the visit. The result is that Tableau users can seamlessly analyze nested Google Analytics data sets without needing to think about how the underlying data is stored, and benefitting from the great self-service data visualization capabilities of Tableau.
Old BI, new tricks
If you are an analytics or business intelligence practitioner, it behooves you to be aware of the increasing presence of nested data sets in modern data platforms, along with the benefits and challenges introduced by these types of data structures. While using nested data may make modeling and storage decisions easier and also improve query performance, there remain a number of challenges related to using this type of data in traditional analysis and BI scenarios.
With the increasing adoption of modern, cloud-based data platforms and the proliferation of machine-generated data, it’s clear that the ability to analyze nested data structures must be part of any business intelligence architecture. In this article, we introduced the concept of nested data, discussed the advantages, and highlighted some of the challenges in analyzing this type of data. And we illustrated one approach that allows organizations to take advantage of nested data support along with the ability to use the tools of choice for analysts and BI consumers.
Josh Klahr is vice president of product management at .
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to .