Previously, we looked at the history of precomputation and how it is used to speed up analytics applications. Since then, we’ve gotten a lot of questions about how precomputation compares with data virtualization. In this blog, we will take a look at the similarities and differences between the two technologies.
TL;DR: Both technologies are trying to address similar challenges: making analytics easily accessible to a wider audience in a modern big data environment. Precomputation focuses on performance, response times, and concurrency in the production environment. Data Virtualization technologies focus on making analysis easily available to users by reducing or eliminating ETL and data warehouses.
Precomputation strategies tend to put less pressure on the source systems. They are deployed when data quality and data governance are factors and when security needs to be de-coupled from the source systems. If you’d like to learn more about precomputation, check out this blog about modern, AI-augmented precomputation technology in the cloud era.
Data Virtualization technology focuses on rapid deployment and eliminating some of the IT processes associated with loading data into a data warehouse, or otherwise structuring the data. Users can simply connect to various data sources such as files, RDBMS, or NoSQL DBs and build a ‘virtual’ view that is exposed to the front-end BI layer. There is no data warehouse to be designed, no ETL jobs to be scheduled, all queries happen on demand the moment a user clicks in the BI tools (and the SQL query is fired).
A typical data virtualization product, at a high level, has the following components:
- Connectors to read data from various sources
- A query engine that parses and processes queries from BI applications
- A caching layer
- Optionally, a metadata layer to track data lineage
- Other management components
Performance and Concurrency vs. Agility and Immediacy
With precomputation, query results are precomputed and stored in aggregate indexes. At query time, most of the query results are readily available, which makes it possible to achieve super-low query response time for complex aggregate queries over petabytes of data. With precomputation, once the results are computed, the query response time is guaranteed. We don’t have to worry about data engineers accidentally messing up the partitions, which will totally ruin the user experience.
In the case of data virtualization, a tradeoff is made between agility and immediacy and performance. This means that a query’s response time is ultimately decided by the slowest data source. A missing index in an RDBMS, an inefficient layout in a file, or a busy source system can all add latency to the query. Unfortunately, most of the time we don’t have any control over these misconfigurations. To improve query performance, data virtualization products rely on a caching layer, whether it’s materialized views stored as local files or a third-party caching product.
Data Quality and Data Engineering
With data virtualization, only rudimentary checking and formatting is conducted at query time. While the effort of ETL is eliminated, the chance to cleanse and conform the data is also missed. With precomputation, data cleansing, transformation, and formatting can be done up front. This gives users much higher confidence level in the accuracy of their data and their query results.
Some architects prefer the data lake for its flexibility. Some architects would rather drain the data lake in favor of a data warehouse for its schema consistency. Many are interested in unifying data lake and data warehouse with a “lakehouse.” Precomputation works with all these back end architectures reading from files in the data lake or tables in the data warehouses. The precomputation layer doesn’t make a duplicate of the raw data, instead, it computes the aggregate results.
Data architects have the freedom to choose data lake and/or data warehouse architecture, the precomputation layer handles the query acceleration and, as explained in the following section, provides a unified semantic layer.
BI Semantics, Security, and Governance
A unified semantic layer presents data in a multi-dimensional model with complex business logic built in. Users typically work with BI tools like Tableau, MicroStrategy, PowerBI, and even Excel. All these BI tools feature their own semantic model.
But instead of creating a separate model in each tool - and risk inconsistency across different BI tools - users can define common models in the unified semantic layer and employ these models across all BI tools, even Excel. With a unified semantic layer, users can concentrate on asking the right questions and get the answers at the speed of thought, instead of trying to build and debug the SQL queries.
In the precomputation layer, we can centrally define the access control model, row, column, and cell level. We don’t need to worry about mapping the security model in the virtual view to security models in each source systems, instead we focus on who can access what in the model, which is much easier to manage. In the end, centrally defined security and governance policy that is de-coupled from the source systems provides greater flexibility and scale.
We reduce the strain on the source systems with precomputation because scanning, reading, and joining from data sources only happens once. The "compute once, query multiple times" nature of the precomputation pattern is extremely cost effective in large scale production environments in the cloud. By precomputing the aggregate query results, the source systems are working less and can therefore service more concurrent users and applications. Precomputation aims to serve 1,000s or more concurrent users without sacrificing performance.
Since most questions can be answered by a simple lookup (i.e. without expensive in-memory processing) it is much easier to scale out a precomputation solution to support a large number of concurrent users. One of our largest customers provides KPI dashboards to nearly 100,000 employees worldwide on top of a data service layer provided by Kyligence.
With data virtualization, each query from the BI layer will fire off queries to the source, which can be quite expensive during peak business hours. This can be further exacerbated if there are more than a few analysts running these queries concurrently. Without precomputation, each dashboard click or refresh will fire off one or more queries in the cloud, which will trigger the cloud vendor to happily charge you per byte or per CPU cycle consumed. With precomputation, the initial compute cost can be easily offset by the savings of the simple lookups at query time, resulting in significant lower TCO in the cloud.
Both Virtualization and Precomputation are trying to solve similar problems, with very different approaches. Their strengths are different and they should be used to serve different use cases in the enterprise. Speaking for precomputation, here is a summary of the advantages of an intelligent precomputation approach:
- You can get consistent super-low query latency
- You don’t have to choose between the data warehouse and data lake
- Kyligence Unified Semantic Layer greatly simplifies your analytics process
- Less pressure on the source systems in a production environment
- Better, centrally defined security and governance policies that are de-coupled from the data sources
- Precompute-once-query-many greatly reduces costs and TCO
- Scales from hundreds to thousands of concurrent users
The answer to the question ‘Which one is right for me?’ may not be that simple and straightforward. The best approach is to take a careful look at your data assets, your business requirements, and your growth plan, and the answer may not be either/or but a combination of both.